Inside the World's Largest AI Supercluster xAI Colossus
---
Summary
#Largest AI Supercomputer: The #xAI#Colossus is built with over 100,000 GPUs, massive storage, and #High Speed Networking, designed for #AI projects beyond typical #Chatbot applications.
#Record Breaking Construction: The facility, containing over 100,000 GPUs, was constructed in just 122 days—significantly faster than traditional #Supercomputers that often take years.
#Advanced Liquid Cooling System: The #Data Halls are equipped with #State Of The Art liquid cooling, using separate pipes for hot water and cold water, which efficiently manages #Heat from the #GPU Servers.
#Scalable GPU Racks: Each rack includes multiple #NVIDIA#HGX H100 units, optimized with #Compact and easily serviceable designs, featuring #Cooling Manifolds and advanced #GPU configurations.
#Innovative Power Management: #Tesla#Mega Packs support the power demands of the #AI Clusters by managing microsecond power fluctuations, stabilizing #Energy Delivery to the GPU units.
Widescreen Wonder: #LasVegasSphere 54,000 m2 (~3.67 acre) interior LED display (16x16K) and an exterior LED display (‘Exosphere’) consisting out of 1.23 million LED ‘pucks’. Driving all these pixels are around 150 #NVidia RTX #A6000#GPU, installed in computer systems which are networked using NVidia #BlueField data processing units (#DPU) and NVidia #ConnectX6 NICs (up to 400 Gb/s), with visual content transferred from Sphere Studios in Cali. All this hardware uses 45kW. https://blogs.nvidia.com/blog/sphere-las-vegas/
@karppinen Mellanox/NVIDIA has been trying to shove #BlueField into any customer box they can for years. They're even mandatory in some configurations (e.g. DGX) and there's no shortage of stock.
The "Self-Hosted DPU Controller" mode mentioned in the video has been officially supported with BSP 4.5.0 since December 2023, but customers like #Netflix and us got access to that long before.
Probably Netflix is actually running this right now at 100 Watts, but we have no confirmation.
@karppinen According to the video stream for that talk, this refered to a prototype that wasn't ready or in use at the moment he talked about it and consumed at least 125 watts when last measured:
So while the idea is nothing new and it's quite possible (people have been running Yocto Linux with nginx and Offload directly on #nvidia#Bluefield for a while), those slides do not prove #netflix "can now" do it.
What better #introduction to #fosstodon than a look at what I think was my first #foss contribution. I’d like to apologize now for the unbounded memory allocation bug I introduced. Oops.