Nvidia's MLPerf Submission Shows B200 Offers Up To 2.2x Training Performance Of H100
Analysis Nvidia offered the first look at how its upcoming Blackwell accelerators stack up against the venerable H100 in real-world training workloads, claiming up to 2.2x higher performance.
The benchmarks, released as part of this week's MLPerf results, are in line with what we expected from Blackwell at this stage. The DGX B200 systems – used in Nvidia's Nyx supercomputer – boast about 2.27x higher peak floating point performance across FP8, FP16, BF16, and TF32 precisions than last gen's H100 systems.
And this is borne out in the results. Against the H100, the B200 managed 2.2x higher performance when fine-tuning Llama 2 70B and twice the performance when pre-training GPT-3 175B.
However, it's not just raw FLOPS at play here. According to Nvidia, Blackwell's substantially higher memory bandwidth – up to 8 TBps on the flagship parts – also came into play.
"Taking advantage of higher-bandwidth HBM3e memory, just 64 Blackwell GPUs were run in the GPT-3 LLM benchmark without compromising per-GPU performance. The same benchmark run using Hopper needed 256 GPUs to achieve the same performance," the acceleration champ explained in a blog post.
While that benchmark was conducted using just 64 GPUs across eight nodes, it's not clear if just one partition of a larger system was used or if Nyx is "super" in terms of performance rather than scale – in fact, details regarding the system are quite sparse. But from what we've gathered from images and past DGX configurations, we're looking at a modular system consisting of three, maybe four, eight-GPU nodes per rack, with the number of racks and interconnect bandwidth the two major question marks.
The Register reached out to Nvidia for clarification on Nyx, and we'll let you know if we hear anything back.
The fact that Nvidia is using the B200 as the basis for its first training submissions tells us that there's still a good amount of performance on the table.
On paper, the B200 is capable of churning out 9 petaFLOPS of sparse FP8 performance, and is rated for a kilowatt of power and heat. The 1.2 kW GPUs found in Nvidia's flagship GB200, on the other hand, are each capable of churning out 10 petaFLOPS at the same precision.
However, it's not just that GB200 systems have higher peak performance – the GPU domain is also considerably larger. Traditionally, DGX systems have housed eight GPUs interconnected by a high-speed NVLink switch fabric, with additional scale achieved using multiple InfiniBand links between the nodes.
With Blackwell, Nvidia has expanded the NVLink domain from eight to 72 accelerators with its NVL72 reference designs.
How large of a difference this actually makes in terms of time to train is hard to say – but we could see a sizable uplift in performance by the time MLCommons releases its next batch of training results. We expect to see some gains from this considering that time to train is often limited by data movement and NVLink is several times faster than InfiniBand.
- Everything you need to know to start fine-tuning LLMs in the privacy of your home
- Dow swaps Intel for Nvidia leaving no index free from wild AI volatility
- Jensen Huang asked SK hynix to give Nvidia 12-layer HBM4 chips earlier
- Fujitsu, AMD lay groundwork to pair Monaka CPUs with Instinct GPUs
Even if the next training submission we see is still from Nvidia's B200-based systems, improvements in software and networking infrastructure could still drive improvements.
Next-gen ConnectX-8 SuperNICs are set to double InfiniBand bandwidth to 800 Gbps. Meanwhile, software optimizations and other upgrades have driven considerable performance improvements since Hopper made its debut on the MLPerf ranking.
Blackwell's training results come just months after it shared the first MLPerf inference benchmarks for the compute platform. In those tests, Nvidia was able to achieve a 4x uplift over Hopper.
In addition to Blackwell, Nvidia also shared large-scale training results for the GPT-3 175B benchmark using 11,616 Hopper GPUs. This is significant – it's not uncommon to see clusters several times that deployed to support model development. ®
From Chip War To Cloud War: The Next Frontier In Global Tech Competition
The global chip war, characterized by intense competition among nations and corporations for supremacy in semiconductor ... Read more
The High Stakes Of Tech Regulation: Security Risks And Market Dynamics
The influence of tech giants in the global economy continues to grow, raising crucial questions about how to balance sec... Read more
The Tyranny Of Instagram Interiors: Why It's Time To Break Free From Algorithm-Driven Aesthetics
Instagram has become a dominant force in shaping interior design trends, offering a seemingly endless stream of inspirat... Read more
The Data Crunch In AI: Strategies For Sustainability
Exploring solutions to the imminent exhaustion of internet data for AI training.As the artificial intelligence (AI) indu... Read more
Google Abandons Four-Year Effort To Remove Cookies From Chrome Browser
After four years of dedicated effort, Google has decided to abandon its plan to remove third-party cookies from its Chro... Read more
LinkedIn Embraces AI And Gamification To Drive User Engagement And Revenue
In an effort to tackle slowing revenue growth and enhance user engagement, LinkedIn is turning to artificial intelligenc... Read more