NVIDIA’s ‘Blackwell’ server overheating problem identified… “We’re considering returning to H100”

-

NVIDIA server rack equipped with GB200 (Photo = Shutterstock)

Nvidia’s next-generation ‘Blackwell’ GPU is claimed to be affected by overheating issues that occur when installed in high-capacity server racks. Because of this, major customers corresponding to Microsoft (MS), Meta, and xAI are expressing concerns about whether they may have the ability to run Blackwell servers on time.

The Information reported on the fifteenth (local time) that overheating problems were occurring in servers equipped with 72 Blackwell GPUs, citing an internal official.

The servers are expected to devour as much as 120 kilowatts (kW) per rack, and overheating can reduce GPU performance and potentially damage components. Accordingly, it is thought that NVIDIA needed to reexamine the server rack design several times.

To deal with the overheating issue, we have now instructed our suppliers to vary the design of server racks and worked with our partners to conduct engineering work to enhance cooling performance. Nonetheless, there are concerns that some firms may delay the introduction of server racks resulting from repeated design changes.

Although it’s common for some firms to regulate server designs before launch, Blackwell identified that this rack change was made late within the production process.

Nevertheless, Nvidia has not yet notified firms of the delay and claimed that it plans to deliver server racks on schedule by the tip of the primary half of next yr.

Blackwell GPUs have previously had production schedule delays resulting from design flaws. Because of this, the ultimate modified Blackwell GPU began mass production at the tip of October, and is anticipated to be shipped from the tip of January next yr.

Specifically, as repeated design changes occurred in the course of the production process, some customers were reportedly considering purchasing additional chips that had already secured stability as a substitute of Blackwell.

NVIDIA’s current generation Hopper chips, namely ‘H100’ and ‘H200’, have significantly lower performance in comparison with the Blackwell product line ‘G100’ and ‘G200’.

Reporter Park Chan cpark@aitimes.com

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x