NVIDIA Breaks Records In Generative AI With MLPerf Training V4.0

NVIDIA Breaks Records in Generative AI with MLPerf Training v4.0

NVIDIA has set new performance and scale records in the generative AI domain, according to a recent submission to MLPerf Training v4.0. This achievement underscores the company's ongoing dominance in AI training benchmarks, particularly in the realm of large language models (LLMs) and generative AI.

MLPerf Training v4.0 Updates

MLPerf Training, developed by the MLCommons consortium, is the industry-standard benchmark for evaluating end-to-end AI training performance. The latest version, v4.0, introduced two new tests to reflect popular industry workloads. The first test measures the fine-tuning speed of Llama 2 70B using the low-rank adaptation (LoRA) technique. The second test focuses on graph neural network (GNN) training, based on an implementation of the relational graph attention network (RGAT).

The updated test suite includes a variety of workloads such as LLM pre-training (GPT-3 175B), LLM fine-tuning (Llama 2 70B with LoRA), text-to-image (Stable Diffusion v2), and several others, covering a wide range of AI applications.

NVIDIA's Record-Breaking Performance

In the latest MLPerf Training round, NVIDIA achieved remarkable performance using a full stack of its hardware and software solutions:

  • NVIDIA Hopper GPUs
  • Fourth-generation NVLink interconnect with third-generation NVSwitch chip
  • NVIDIA Quantum-2 InfiniBand networking
  • An optimized NVIDIA software stack

These components have been further optimized since the last round, enabling NVIDIA to break previous records. For instance, NVIDIA improved its GPT-3 175B training time from 10.9 minutes using 3,584 H100 GPUs to just 3.4 minutes using 11,616 H100 GPUs, demonstrating near-linear performance scaling.

Generative AI and LLM Fine-Tuning

NVIDIA also set new records in LLM fine-tuning, particularly with the Llama 2 70B model developed by Meta. Utilizing the LoRA technique, a single DGX H100 with eight H100 GPUs completed the fine-tuning in just over 28 minutes. The NVIDIA H200 Tensor Core GPU further reduced this time to 24.7 minutes. NVIDIA's submissions also showcased scalability, achieving a fine-tuning time of just 1.5 minutes using 1,024 H100 GPUs.

The company leveraged the context parallelism capability available in the NVIDIA NeMo framework to achieve these results. Additionally, the use of FP8 implementation of self-attention in cuDNN improved performance by 15% at the 8-GPU scale.

Advancements in Visual Generative AI

MLPerf Training v4.0 also includes a benchmark for text-to-image generative AI based on Stable Diffusion v2. NVIDIA's submissions delivered up to 80% more performance at the same scales through extensive software enhancements, such as the use of full-iteration CUDA Graphs and an optimized distributed optimizer for Stable Diffusion.

Graph Neural Network Training

NVIDIA set new records in GNN training as well. Using 8, 64, and 512 H100 GPUs, the company achieved a record time of just 1.1 minutes in the largest-scale configuration. The use of eight H200 Tensor Core GPUs provided a 47% boost compared to the H100 submission at the same scale.

Key Takeaways

NVIDIA continues to lead in AI training performance, showcasing the highest versatility and efficiency across a range of AI workloads. The company's ongoing optimization of its software stack ensures more performance per GPU, reducing training costs and enabling the training of more demanding models.

Looking ahead, the NVIDIA Blackwell platform, announced at GTC 2024, promises to democratize trillion-parameter AI, delivering up to 30x faster real-time trillion-parameter inference and up to 4x faster trillion-parameter training compared to NVIDIA Hopper GPUs.

For more detailed information, visit the NVIDIA Technical Blog.



Image source: Shutterstock

. . .

Tags

RECENT NEWS

Ether Surges 16% Amid Speculation Of US ETF Approval

New York, USA – Ether, the second-largest cryptocurrency by market capitalization, experienced a significant surge of ... Read more

BlackRock And The Institutional Embrace Of Bitcoin

BlackRock’s strategic shift towards becoming the world’s largest Bitcoin fund marks a pivotal moment in the financia... Read more

Robinhood Faces Regulatory Scrutiny: SEC Threatens Lawsuit Over Crypto Business

Robinhood, the prominent retail brokerage platform, finds itself in the regulatory spotlight as the Securities and Excha... Read more

Binance: Tokenized RWA Market Surpasses $12b, Led By U.S. Treasuries

The market for tokenized real-world assets, excluding stablecoins, has surged past $12 billion, according to Binance. Th... Read more

Investors Pivot From PEPE, DOGE, Shift To New Hybrid Exchange Protocol

With memecoins like Pepe and Dogecoin plummeting, investors are turning to DTX Exchange for its hybrid trading potential... Read more

Pepe Unchained ICO Hits $13M As It Nears DEX Listings

Pepe Unchained raises $13M in a top ICO, aiming to tackle Ethereum’s slow speeds and high fees with a memecoin Layer-2... Read more