Passer au playerPasser au contenu principal
  • il y a 2 semaines
Rack Scale and Data Center Scale Solutions (AI Data Center Building Blocks) Powered by NVIDIA Blackwell

Catégorie

🤖
Technologie
Transcription
00:00Good morning everybody. My name is Bernard. This is Jack. We want to bring you a bit of a boring
00:07topic. We want to talk about cooling. We want to give you an overview about what it means for you,
00:12what it means for your customers, and why looking at the right cooling is important. Over to you Jack.
00:18Morning. Hi. So Bernard, do you know that right now a lot of data centers are running at a very
00:24low temperature water and also they're using very less heat for the racks. Yeah. And also a big problem is
00:32it's very loud. Yeah, it's not only loud. It's basically, if you look in a data center today, all the
00:38data centers got designed like 15 years ago. Sure. And we were talking about 10 kilowatts per rack. Now that's
00:46not sustainable and the noise level
00:48was never that hard because you never needed to cool processes for more than a thousand watts.
00:53Yeah. But this is not actually ultimate design. If you look at the design guidelines that we have right now,
01:01it's designed for air cooling. And most of the data centers, especially the hyperscalers, they are pushing the temperature to
01:09a very limit because every temperature in the data hall you raise, you save about 4% of the electricity
01:15just on cooling.
01:16But now things will change with the liquid cooling. And that's why we're going to walk you through a little
01:22bit about what is going to change with the liquid cooling inside the air centers.
01:28So carbon emission is important for all of us. We need to get more efficient. And when you build a
01:35data center today, there's two things you need to keep in mind.
01:38The power consumption for the cooling and the power consumption for the service. And effectively, when you look at the
01:43service standalone, it has a huge emission already.
01:47But when you need to cool it, it's even more. So that's why we're tackling that with a new idea.
01:53And very often people didn't realize that not just the electricity that was consumed that produced a lot of carbon
02:01emission, but also the heating that we're using because about 50% of the global energy was actually contributed to
02:09heating.
02:11So if we can reuse the heat from data centers, especially from the high density computing racks, then we can
02:19actually save a lot of energy or the carbon emission just on the heating part.
02:25Yeah. And that has an effect on greenhouse. It has an effect on everything. It's the environment we need to
02:31look at. Because the more power we have, the more acceleration we do in the data center, the more we
02:37need to look on how to make it efficient. Accelerated computing is solving a lot of challenges because you can
02:45do different tasks at a much faster time. But also we need to look at it, how we do it
02:50efficiently.
02:51Yeah. I mean, if you look at the GHG protocol, the greenhouse emission protocol, it's actually, for the past 10,
03:0015 years, we've been doing very, very little effort just on scope one and scope two. But we actually can
03:07do more if we spend more efforts on what's going on on scope three. And this is something that we
03:14can do in the future, especially on the data center industry.
03:17Okay. So when you go into a data center today, and for those who have been, the noise level is
03:24absurd. You cannot talk to somebody, you need your earplugs, it's just impossible to talk. You text with your phone
03:33in a room, even to the person next to you.
03:35Okay. And this is not only an issue from a communication point of view, it's a house issue. So we
03:41need to look at that for the people that actually need to maintain that stuff.
03:46So when you look at the water cooling solution, the water is coming in hotter. So you can basically theoretically
03:55envision that while previously you needed to chill down the water, now you can use any water temperature you have
04:02effectively to recycle that and drastically reduce the noise level.
04:07Yep. Well, as I mentioned before, most of the data center now are cooling with the chill water. And now
04:13we have another choice of utilizing the technology of liquid cooling, we're able to bring up the water temperature, or
04:20even reuse the heat from the racks, to get a more sustainable data center environment.
04:26And also, to reduce the occupational hazards. How many have you been inside a data center? What was the noise
04:35level like? Like really loud, right? It will hurt you if you stay longer inside, it will hurt your ears.
04:42So what NVIDIA and us are trying to do is we're trying to bring down the noise level, so we
04:48make people work inside the environment, it's more comfort.
04:52And also what you can do with that water, and we've seen projects like that, they're heating the whole buildings
04:58around it. Because the water is there, the temperature needs to go somewhere, so why don't you just reuse it
05:04for heating the buildings, the schools, everything. Lots of universities who have started doing that basically have now free heating
05:10in the environment.
05:16So when you look at, on the right hand side of the slide, a few years ago, we were talking
05:22about 800 watts per server, two CPUs, a bit of that. And now a tray of a GB200, which you
05:30use 36 of in a rack, has 5.7 kilowatt.
05:36So we're now talking about a rack that has up to 140 to 150 kilowatt you need to deploy.
05:44We've talked about on GTC in San Jose a few months ago, that we're looking at rack solutions for 500
05:51kilowatt.
05:52So you cannot do that with air cooling, it's just impossible. Physics will just not allow you to do that.
05:59So this is where people need to look at how do you do that and how you can maintain it.
06:05So we work very closely with Supermicro to build something around how you can do sustainable data centers and how
06:13you can cool them more efficiently.
06:15Okay. Well, when NVIDIA and Supermicro tried to promote this liquid cooling technology about two years ago, I don't think
06:22anybody would say yes.
06:24Because typically inside a data center, water loops are not allowed or they cannot have the water loops inside a
06:31data hall.
06:31But now it's not a choice. It's a must. Because if you're running on AI infrastructures, liquid cooling is the
06:38only way you can do it.
06:39Because the air cooling capacity has reached the limits. So this is not a question of how or when.
06:46But this is a question of, you know, is it a good time and what's the next step of the
06:52liquid cooling?
06:52Yeah. And I know that resistance is like electricity and water typically is not a very good match.
06:59So a lot of people are very scared about how I have water loop here. But effectively, if you have
07:03no choice, people need to adopt.
07:05But that also requires people to redesign all the infrastructure. It's not just we give you a new rack of
07:11service.
07:12We need to look the widest thing. It's not something you can do it within a day, take you six
07:16to 12 months to get it done.
07:20So historically, you always had this classical air cooling systems. There was something that came up about five to 10
07:29years ago with the rear door heat exchanger.
07:31You use big, massive chillers on the back of every air cooled rack that absorb some of the heat.
07:37And now we're moving to basically directly cool the chips. So that's the evolution.
07:45Rear door heat exchanger is something you can retrofit easier in the data center. Liquid cooling requires more infrastructural change.
07:55Yep, definitely. And how many of you have seen, say, rear door heat changer in the data center or even
08:01a direct-to-chip liquid cooling?
08:05One. Not many. Two. Sorry. Two. Oh, nice.
08:09Well, this is going to be a game changer in the data center industry. But unfortunately, most of the designers
08:16are still exploring how to design a proper liquid cooling facility.
08:21Because this is unlike conventional design. Yeah, maybe something that we will share a little bit about the advantage of
08:29having liquid cooling inside your data centers.
08:31But when you look at the distribution of power, it's now you can effectively use the power actually to do
08:41what it's supposed to do.
08:42You don't want to pay 32% of your power just for the environment. You want to reduce that to
08:48a bare minimum.
08:49And that's what you can do with liquid cooling. Now, there are countries where power cost doesn't matter.
08:54But effectively, here in Europe, power cost is a big effort. We're talking about 20 to 30 cents per kilowatt.
09:01So, every dollar you can save, every euro you can save on the power is effectively making you more competitive
09:10and more successful with your solution.
09:11Yeah. Well, I think for most of the enterprise data centers, the challenge is if you are looking for more
09:18computing power inside your current facility, it's usually not possible.
09:24But once you convert it to liquid cooling, you see immediately about 20% of the electricity available just to
09:30give you an upgrade on the computing.
09:31But at the same time, you save power on the cooling side, which would give you a lot of benefit
09:39in the long term if you just move that 20% from just cooling to more computing, meaningful power allocations.
09:49So, we do see DLC adoption started in Europe about 12 to 13 years ago. It started in supercomputing.
09:58A lot of these HPC companies really incentives the universe starting to look at it. They wanted to reuse the
10:05heat.
10:06The hyperscalers started to use it because they want to deploy massive compute on big scale.
10:11And where we do really see the biggest issue today is enterprise data centers are not retrofittable.
10:19So, a lot of it comes into what we call colo.
10:22There's a lot of people like Equinix and others.
10:25We work with all of them to actually design their next data holes to be able to absorb GPU power.
10:33Because I spoke to friends of mine yesterday. Anybody has any idea how much a data center build up cost
10:41from scratch per megawatt?
10:43Anyone want to guess?
10:45We're talking between 7 to 10 million euro per megawatt.
10:51So, if you build a data center for 20 megawatt, which everybody wants to do today, it's 200 million euro
10:57upfront investment.
10:59So, you want to use that energy efficiently in order to get the maximum output.
11:05All right.
11:06Well, from the front line of the data center industry, we are seeing a lot of changes.
11:11Well, first of all, the enterprise data center now are adapting to the new cooling technology.
11:17But most of the time, they don't have enough power to support the AI infrastructure on site.
11:23So, they are now forced into moving into a co-location space.
11:27And as for the cloud service providers, they are designing data centers that can be adapted everywhere in the world.
11:34Because you cannot just build data centers globally.
11:39You have to find a place to properly house your data centers.
11:42So, that would leave the co-location providers some new challenges.
11:46Because they have to design a facility that's multipurpose.
11:50You have to house low density racks and also the AI density racks.
11:54So, this will create a lot of new challenges on the data center industry.
12:00And also, getting the power.
12:03Getting the power from the power plant to the data center is complex.
12:06Because you have certain items in that chain have up to two years lead time.
12:11So, you actually, yeah.
12:13We've seen that.
12:15I've spoke to a big tech company in Switzerland and said, we have enough power.
12:18No problem.
12:19We just don't have a transformer.
12:21And the company told us, oh yeah, in 18 months we may have one for you.
12:26So, it's great.
12:28But that's all the stuff why you need to become more energy efficient.
12:32If you can use, and you see it here on the density service.
12:36If you can just take 15% away from the cooling and throw it on compute.
12:41It gets you 15% more throughput.
12:4415% effectively for free.
12:47Because you don't have the power.
12:49You need to live with what you have.
12:51That's right.
12:52Well, one thing is often not mentioned is because when you convert from air cool technology to liquid cooling.
13:00You actually automatically see about 15% of the reduction just on server.
13:05Because a lot of power was actually consumed by the fans just to pull down the server to move the
13:12air inside.
13:14And you see it here on this slide really, really good.
13:16A traditional data center, even if you do a traditional workload, it's 50-50.
13:20You do some of the service because it doesn't really matter with air cooling.
13:24You have the capacity, easy.
13:26Traditional HPC, you already moved to more stuff, especially the acceleration on liquid cooling.
13:32You might still stay on air cooling for classical purpose and switches.
13:36When you look at the GB300 racks, the only thing that still has a fan is the switch.
13:44Everything else is liquid cooled.
13:45It's one rack.
13:47So now you have, instead of 130 kilowatt air cooled, you have 127 kilowatt liquid cooled.
13:54And the 3 kilowatt is basically just noise.
13:58Right.
13:59And very often when the customer comes to us, say, well, I want to do liquid cooling.
14:04Can you help us?
14:04I say, yes.
14:05But you still need part of air cooling capacity.
14:09Because very often customers must understand that if you go liquid cooling, it's 100% liquid cooling.
14:15But for direct-to-chip liquid cooling, you still need some air cooling capacity.
14:19And most of the time, the conventional air cooled data center have to raise the air cooling capacity just in
14:28order to house high performance computing and also AI infrastructures.
14:33And what you see here is a high level description of a 60 megawatt, 39 megawatt consumption.
14:39You see the power backup generators.
14:41You see the water storage.
14:42You see the cooling towers.
14:43That's all around the building.
14:46Everything inside, that's noisy.
14:49But it's not as bad if it's noisy outside than inside.
14:53But inside, you can actually reduce your noise level down to something where you can work and speak.
14:59And you can save density.
15:01But we will show that a bit later.
15:04And also, liquid cooling, it just utilizes the physics of liquid because it absorbs more heat than air.
15:14So now, with the liquid cooling technology, we are able to save more than 10 times just on cooling the
15:20IT equipment itself, which is more efficient than the past.
15:26Yeah.
15:27And it's the only way forward.
15:28Accelerated computing needs power.
15:31It doesn't need as much service, but it needs power.
15:35So here is a very big cost breakdown about what is a dental center bill.
15:40And you see the land is a problem now because you want to have a land close to either close
15:45to a hub, a connection point or a power plant.
15:49And surprisingly enough, certain people started to grab all that land at every price.
15:54Maybe they're built, maybe they're not, but land first.
15:58Everything else is just a very normal frequency.
16:01So this is a slide I really like to talk about a bit more.
16:05is we are talking today, if you're going air-cooled, we're talking about 10U servers.
16:12You need a 10U server that you cannot lift anymore to basically use GPUs.
16:19This is the B200 server from Supermicro.
16:22We work very close.
16:23We ship them, but it's 10U servers.
16:26More versatile.
16:27And when you move to air-cooled, you move down to 4U.
16:31Now, when you look at the racks next to it, this is already an optimized air-cooled rack.
16:39Meaning that's a 60 kilowatt rack with air cooling.
16:43You cannot deploy that in any enterprise data center.
16:46This is what you use typically in the Nordics, where you have the environmental temperature going down.
16:52They're always able to do those 60 kilowatt.
16:55Typically, when I go to people and say,
16:59Oh yeah, now you have a big rack, a 42, 48U rack, you can get in one server.
17:06So I was joking many years ago saying, I should sell some bezels that just blink.
17:12So when your manager comes down into the data center, he sees a full rack blinking around.
17:16But effectively, there's only one server left.
17:18And when you move to liquid cooling on the same platform, you can have eight servers.
17:24So you have theoretically up to eight times the density of what you do.
17:30And if you go GB200, it's up to 72 GPUs per rack.
17:35So I think that's the real difference is space, density, manageability.
17:41There's everything you need to keep in mind when you do a project rollout.
17:46It's not just by service, by GPUs.
17:49You always need to look end to end.
17:51And Supermicro and us are happy to work with you in order to make it efficient.
17:55We just not want to ship your servers.
17:57And then you call us a week later saying,
17:59Okay, now I have the server, but I don't have the power.
18:01It's not the point.
18:03Yep, always happens.
18:04And also that we can now foresee that in the future, especially for AI data centers, it's getting smaller and
18:11smaller.
18:12Because the advantage of space efficiency, especially for countries like Singapore or Japan,
18:20it's usually data centers are located in very highly populated areas.
18:24So land and space is very precious.
18:28And that's what you see here.
18:31I think when you look at, depending on the workload, you can basically reduce your server amount by up to
18:3630, 40x.
18:38And that's really workload dependent.
18:40Some workload, you need a serial workload, you need CPUs.
18:43But a lot of stuff you could paralyze.
18:45So what you had historically with the power you had, you build a full data hole.
18:51Now, you only have that amount of power, so you can basically reduce it into six to eight racks.
18:57And you can use the space for something else.
19:00Because you don't have more power, why do you waste space?
19:04Well, I spoke to a co-location provider in Hong Kong, a friend of mine.
19:08I said, well, if I can save you all this much space, what are you going to do with this?
19:13He said, well, there's nothing much I can do because there's no power.
19:17But I can play some ping pong tables here, just for fun.
19:21So this is a bit the spare cost.
19:23We can show about the mass.
19:26It's not something that is super exciting because I think the breakdowns here is one thing that often gets ignored
19:35about liquid cooling is physics.
19:38And OPEX is a very important thing.
19:41Lifespan of a server is a very important thing.
19:44But what we also see is the performance of a liquid cooled data center is significantly more stable.
19:51Meaning because every chip gets about the same temperature.
19:54When you put them in an airflow, you never know where air is moving.
19:57That also, that has, first of all, more predictability and performance.
20:01And when you're a colo partner, you want to have an SLA.
20:05You want to have a certain performance guarantee.
20:08That's much easier if you do it with liquid.
20:11The other thing what we see when we compare defect rates between DLC versus air.
20:18DLC is reducing the defect rate in downtimes massively.
20:22So basically, by investing in direct liquid cooling, you're not only generating an increased OPEX, you reduce downtime.
20:30And downtime basically is the biggest issue because you need to call a service engineer, your model run is over,
20:36everything.
20:37The more stable you can keep your temperature environment, the better it is for the Harbour platform.
20:43The biggest part of the OPEX is actually coming from maintenance costs.
20:47Especially in some of the data centers, especially technicians, maintenance technicians are actually short of manpower.
20:56So in order to have more uptime data centers, I think you have to try some way to reduce the
21:03maintenance costs.
21:05Especially the human resource.
21:07Yeah, human resources is tough.
21:10You know, it's like everybody who tried to hire an IT engineer in the last 10 years knows that.
21:14Because there are no IT engineers left.
21:16So you need to look at how you can use your resources more efficiently and if you reduce downtime.
21:22Your software programmers, who basically believe that data comes out of the cloud, want to run it.
21:28They just want to have something stable.
21:30They don't want to borrow about, oh, my server is down again.
21:33They want to have it stable.
21:35So liquid cooling is a way to overcome some of those obstacles.
21:40So this is a slide where we talk about 288 racks.
21:44This is a really big deployment.
21:47We've seen similar like that in France.
21:50We see them in Germany.
21:51This is about a rough investment of 200 million dollars.
21:56Probably more in the range of 350.
21:59This is what all the governments are now aiming for, for the national AI strategies.
22:02If you would build that on air cool capacity, it's impossible.
22:09You don't have the space.
22:11You would basically require potentially up to a thousand rack data center to do it with air cooling.
22:19On liquid cooling, you save the space.
22:22You reduce the number of racks drastically.
22:24And your overall TCO is going down by 17%.
22:3017% can make a big difference if you're bidding for a project, if you want to deliver a project.
22:36If you need to go to your investors and say, my cost is going down 17%.
22:43That's more money, more profit, faster ROI.
22:46I think this is what really matters.
22:48It's a big challenge to get there.
22:51But once you're there, you basically have a way to cheaper money.
22:57Well, this is just our calculations over five years, how direct-to-shift liquid cooling can help you on the
23:05TCO savings.
23:06And for those who are interested in details, please contact us afterwards.
23:11Yeah.
23:12So this is how the modular data center looks like.
23:15So one thing we do very closely with Supermicro is we talked a lot about building data centers for liquid
23:20cooling.
23:22That's a long shift.
23:23What we work with is building shipping container data centers.
23:27Where you move the container into a big warehouse.
23:31And this data center runs in isolation in the shipping container.
23:34That's something is kind of, it's not cheap.
23:36It's not as ideal if you build it in a big scale data center.
23:40But if you have a warehouse, if you have the space, it's still faster to moving containers to containers to
23:46containers,
23:47than completely rebuild the concrete and everything.
23:50So there are companies, there are a couple of them.
23:53If you need help to find them, we are here to help you.
23:56Reach out to Supermicro, reach out to me.
23:59We can link you with those people that actually can explain you how it exactly works.
24:03But for some bigger companies, that's really interesting.
24:07And we're getting a lot of inquiries on containerized data center.
24:10And one of the reasons is because, for example, in Japan, it is a lot easier to get a permit
24:16for electricity use under 2 megawatts.
24:20So now a lot of Japanese data center providers are building or designing their data center with a 2 megawatt
24:26module,
24:27especially in containerized, in order to get approval from the government.
24:31Instead of building a very big one like 50 megawatts, they are dividing 2 megawatts of 20 or 25 marginalized
24:40data center clusters of 2 megawatts.
24:42Yeah, and also what happened is there were big fertilizer factories and other factories in Europe that all have power.
24:49But the data holes, the holes are just dirty and everything.
24:53So you could, the ideal targets for these containerized data centers, because they have the power there, the infrastructure is
24:59there,
25:00and you can build the containers in.
25:02You don't need to worry about how the environment looks today.
25:05So one of the things I think Supermicro and us are super proud, about a year ago, we deployed a
25:11supercomputer in the States.
25:13It has 100,000 GPUs, that's in the realm of 16,000 nodes, and it's a complete direct liquid cooled
25:22solution.
25:23And it was brought up within four months.
25:27Four months meaning that you ship 16,000 servers, ideally that's 2,000 racks, to get it racked, stacked and
25:36shipping.
25:36Because the only reason why liquid cooling is more efficient in deploying, you build the complete rack, you test the
25:43complete rack in Supermicro's facilities,
25:46and you ship the rack pre-cabled already onto the customer side.
25:51You don't need to screw it, you just get the rack in, you just move it in, you prepare everything,
25:56put in the cables, put in the power, happy days.
25:59And one thing you see at the bottom, on the fourth picture, is a cooling distribution unit.
26:07Supermicro produces them.
26:08And that's a big, big advantage, because when you look at a data center, you can either use in-row,
26:15end-of-row,
26:15CDUs, big, big terminals, or Supermicro can work with you and with us to do an in-rack CDU.
26:23So each rack is its own CDU.
26:25It's a bit more costly, at scale, but it gives you the flexibility to just add easier, more racks into
26:33the data.
26:34Well, just a little true story behind, we were competing with our competitor on the same projects.
26:40So we started building out, there were four data halls, and we took two, and we finished the two data
26:46halls in four months,
26:47and the competitors still didn't finish the first data hall, because at Supermicro,
26:53we started building the whole rack integration solution.
26:56We tested it on-site before we shipped out to customers, and then you just plug the water and the
27:02power,
27:02and everything is tested out fine.
27:05So this is something that has never been done before, and we did it last year.
27:09It was more than 700 racks with NVIDIA.
27:13And for those of you who are interested in having a tour of this project, you can scan the barcode
27:20on the screen.
27:21There's a YouTube video available online.
27:24And now I think before we get thrown off the stage, because in about two minutes, I think that's about
27:31the clock,
27:33our CEO, Jensen, will have started his keynote in the dome, which I think is somewhere here.
27:38There will be a live stream on this stage. That's why we're a bit of a...
27:42If you want to talk to us, if you want to grab us, grab us, we will be here.
27:46But I think we don't have really a lot of time for Q&A this time around.
27:49Yeah, actually, I do have a very quick question while time is running.
27:52So it's very, very interesting, very, very technical.
27:55Just, I mean, you're talking about data centers for huge use, huge, huge data centers for organizational corporate use.
28:03But I have a question, is that going forward, do you imagine people, for example, having private data centers?
28:11Why? Because a way to make money tomorrow could be to rent out their GPU space.
28:17I mean, we're talking about every single home on the planet has that potential to do it going forward.
28:21No, I think the biggest issue is you need the power grid.
28:25Okay.
28:25You need to have, think about it, if you want to build one rack, 150 kilowatt.
28:29If you build a house, you don't have a 150 kilowatt line.
28:31Okay, okay.
28:32I think that's one thing.
28:33But what we do see is a lot of these MSPs, the traditional service partners, so enterprise data centers are
28:40just suffering with the power.
28:42Okay.
28:42So they're moving to MSP specialized people that basically build these containers in the warehouses.
28:48Yeah.
28:49Okay, well listen, thank you very much.
28:51Yeah.
28:51Ladies and gentlemen, Jack and Bernard.
28:59from NVIDIA and Supermicro.
29:02Got it at last.
29:03All right.
29:04So please, gentlemen.
29:04Okay.
29:05Thank you very much.
29:05Thank you.
29:06All right.
Commentaires

Recommandations