00:00Elon Musk and his ex-AI startup have built the largest and most powerful artificial
00:05intelligence training supercomputer in the world.
00:09Elon has named this beast Colossus.
00:13It is equipped with the latest Nvidia GPU hardware, its liquid cooled with vast amounts
00:18of water, and is powered by giant Tesla Megapack batteries.
00:22Elon believes that all of this combined will create the world's most powerful artificial
00:26intelligence, one that will literally solve the mysteries of the universe.
00:30And what we see today is only the beginning.
00:34This is what's inside Colossus.
00:37The location is Memphis, Tennessee in an industrial park southwest of the city center
00:42on the bank of the mighty Mississippi River.
00:44The building itself wasn't constructed by XAI, it was previously home to Electrolux,
00:49which is a Swedish appliance manufacturer.
00:52So if you've been wondering why Elon chose Memphis and not Austin,
00:56it basically just comes down to finding the right building in the right location
01:00to get this thing up and running as fast as possible.
01:04Now, as unassuming as the exterior of Colossus might be, it's what's inside that counts.
01:10And inside is the largest AI training cluster in the world.
01:14Currently, over 100,000 NVIDIA HGX H100 GPUs connected with exabytes of data storage over
01:21a super fast network.
01:24NVIDIA CEO Jensen Huang has said himself that Colossus is, quote,
01:28easily the fastest supercomputer on the planet.
01:32And it was all built to power Grok, an AI model that Elon Musk and XAI will evolve into
01:39something far more capable than a simple chatbot.
01:43This is the breeding ground for artificial super intelligence.
01:48The entire facility as we see it was built in just 122 days.
01:53That is insane.
01:55A more traditional supercomputer cluster would have just one half to one quarter the amount of
02:00GPUs as Colossus, but the construction of those traditional systems would take years from start to
02:06finish. The training work happens in an area called the data hall.
02:10XAI uses a configuration known as the raised floor data hall, which splits the system into three levels.
02:17Above is the power, below is the cooling, and in the middle is the GPU cluster.
02:22There are four data halls inside Colossus, each with 25,000 GPUs plus storage and the fiber optic
02:30network that ties it all together. Colossus uses water for liquid cooling.
02:34Below the GPU cluster is a network of giant pipes that move vast amounts of water in and out of
02:40the facility. Hot water from the server is sent outside to a chiller, which lowers the temperature
02:45of the water by a few degrees before pumping it back in. This doesn't necessarily need to be cold
02:50water though. Without getting too deep into thermodynamics, just remember that energy always
02:55travels from hot to cold. So as long as the temperature of the water is lower than the hardworking
03:01GPUs which get pretty hot, then the excess heat energy will be drawn into the water as it flows
03:06past and heat will be removed from the system. Here is what those GPU racks look like. Each tray is
03:12loaded with eight Nvidia H100 GPUs, the current state-of-the-art chip for AI training. That will
03:19change in a relatively short amount of time, and Elon already has plans to upgrade Colossus to the Nvidia
03:24B200 chip when that becomes widely available, but for right now, there's no time to waste. There are
03:31eight of these racks built into one cabinet with a total of 64 GPU chips and 16 CPU chips in every
03:38vertical stack. Each of the racks has its own independent water cooling system, with these small
03:44tubes that lead directly into the GPU housing, blue tubes for cold water delivery, and red tubes for hot
03:49water extraction. The beauty of these GPU racks built for XAI by Supermicro is that each one can
03:56be pulled individually for maintenance, and it's serviceable on the tray. That means the entire
04:02cabinet doesn't need to be shut down and disassembled just to replace one chip. The technician can simply
04:07pull the rack, perform the service right there on the tray, and then slide it back in and get back to
04:13training. This is unique in the AI industry. Only XAI has a setup like this, and it will allow them to
04:19keep their downtime to an absolute minimum. The same is true for the water system. Each cabinet has its
04:25own cooling management unit at the base that's responsible for monitoring flow rate and temperature,
04:31with an individual water pump that can easily be removed and serviced. Now, the thing to keep in mind
04:36about gigantic computer systems like this is that things will break. There's no way to avoid that,
04:43but having a plan to keep failures localized and get problems solved as fast as possible,
04:48that is going to make an incredible difference in the overall productivity of the cluster. On the
04:54back of each cabinet is a rear door heat exchanger. That's basically just a really big fan that pulls
04:59air through the rack and facilitates the heat transfer from the hot chips to the cool water.
05:05This replaces giant air conditioning units that are found in typical data centers, and again,
05:10keeps each of the racks self-contained. Every fan is glowing with a colored light. That's not for
05:15aesthetics, it's a way for technicians to quickly identify failures. A healthy fan will have a blue
05:21light, while a bad fan will switch to a red light, and then they just replace those individual units as
05:26they go down. While GPU chips do the heavy lifting for AI training, CPU chips are used for preparing the
05:32data and running the operating system. There are two CPUs for every eight GPUs. All of the data used to
05:40train Grok is held in a massive hard drive storage system. Exabytes of text, images, and video that are
05:46fed into the training cluster. One exabyte is a billion gigabytes, and all of that data is handled
05:53by a super high-speed network system. Data is moved around Colossus by ethernet, but this is not anything
05:59like your home network. The XAI network is powered by Nvidia Bluefield 3 DPUs. That's a data processing unit,
06:07and these chips can handle 400 gigabits per second through a network of fiber optic cables.
06:12That's around 400 times faster than a very fast home internet connection. The ethernet is necessary
06:20for scaling beyond the size of a traditional supercomputer system. You see, AI training requires
06:25a massive amount of storage that needs to be accessible by every server in the data center.
06:30Now, this massive amount of equipment requires an equally massive amount of power,
06:35and again, XAI has done something totally unique with their energy delivery. They are using Tesla
06:41Energy. Colossus doesn't use solar energy. It's draining power from traditional generators.
06:47But there was a problem that XAI encountered when they started to bring their 100,000 GPU system online.
06:53The tiny millisecond variations in power coming from the grid would create inconsistencies in the
06:59training process. We are talking very small fluctuations, but at this giant scale, those will add up quickly.
07:06So the solution was to bring in Tesla Megapack battery units. So what they do now is pipe input
07:13power from the grid into the Megapacks, then the batteries discharge directly into the training cluster.
07:18This provides the super consistent direct energy required for the entire network to have the most efficient
07:25training session that is physically possible. This unique energy upgrade will become even more
07:31critical when XAI doubles the size of Colossus to over 200,000 H100 GPUs, something that Elon claims will
07:39happen within the next two months. That is an insane rate of growth, and it's got the established AI giant
07:47scared. There have been reports that OpenAI CEO Sam Altman has already told Microsoft executives that he's
07:54concerned Elon will soon overtake them in access to computing power. Of course, this stuff ain't cheap.
08:01It was just a few months ago that XAI raised $6 billion in venture capital funding, bringing the
08:06one-year-old company to a valuation of $24 billion. That's a lot of money for a young company that only
08:14had one basic product on the market at the time. But they did have the richest man in the world at the
08:20controls, so obviously that counts for a lot. Now, we've just seen reports from the Wall Street Journal
08:25that Elon is already looking for a lot more money, enough to bring the value of XAI to $40 billion.
08:33For a sense of scale, the industry giant OpenAI is currently valued at $157 billion, while a smaller
08:41scale operation like Perplexity, who makes a highly regarded AI search tool, they're expected to soon hit a
08:47valuation of $8 billion. As for Grok, the AI chatbot is continuing to rapidly evolve thanks to new power
08:54provided by Colossus. Just recently, Grok was upgraded to include vision capabilities, meaning
09:00that the AI can analyze and comprehend input from images alongside its existing text functions.
09:07This new feature is integrated into the X social media platform for premium users. Now, when you see an
09:13image in a post, you can click a button to send that image to Grok, where you can now ask the AI any
09:19question you want about the content of that image. Grok can analyze or provide additional context.
09:26This is an important step for XAI on their path towards achieving artificial general intelligence.
09:32That's a big buzz term right now. It basically just means an AI that can do pretty much anything.
09:38Essentially, an artificial reproduction of the human mind and its incredible versatility.
09:43We can write words, we can make music, we can solve complex problems, invent new things.
09:48In theory, an artificial general intelligence would have all of the knowledge of the entire human race,
09:55all concentrated into one super powerful computer brain, making it infinitely smarter than any human
10:03being. Then the AGI can use that knowledge to learn even more, to discover the undiscoverable,
10:09solve the unsolvable, invent the uninventable. According to Elon Musk, this is how we unlock
10:16the mysteries of the universe and the very nature of our own existence. Or the AI will go rogue and kill
10:22us all. But that's where Neuralink comes in, which is a whole other video that we've already made.
10:27Make sure you check one of those out next.
Comments