NVIDIA Unveils Grouped GEMM APIs In CuBLAS 12.5 To Boost DL And HPC Performance

NVIDIA Unveils Grouped GEMM APIs in cuBLAS 12.5 to Boost DL and HPC Performance

The latest release of the NVIDIA cuBLAS library, version 12.5, brings significant updates aimed at enhancing the functionality and performance of deep learning (DL) and high-performance computing (HPC) workloads, according to NVIDIA Technical Blog. Key updates include the introduction of Grouped GEMM APIs, improved matrix multiplication (matmul) performance on NVIDIA Hopper (H100 and H200) and Ada (L40S) GPUs, and enhanced performance tuning options.

Grouped GEMM APIs

The newly introduced Grouped GEMM APIs generalize batched APIs by allowing different matrix sizes, transpositions, and scaling factors to be grouped and executed in one kernel launch. This approach has shown a 1.2x speedup in certain scenarios, such as the generation phase of a mixture-of-experts (MoE) model with batch sizes of 8 and 64 and FP16 inputs and outputs.

Two new sets of APIs support Grouped GEMM:

  1. cublas<t>gemmGroupedBatched for FP32 (including TF32) and FP64 precisions.
  2. cublasGemmGroupedBatchedEx for FP16, BF16, FP32 (including TF32), and FP64 precisions.

These APIs support variable shapes, transpositions, and scaling factors. Examples can be found on the NVIDIA/CUDALibrarySamples GitHub repository.

Latest LLM Matmul Performance on NVIDIA H100, H200, and L40S GPUs

Recent performance snapshots show significant speedups for Llama 2 70B and GPT3 training phases on NVIDIA H100, H200, and L40S GPUs. The H200 GPU, in particular, demonstrates nearly 3x and 5x speedups compared to the A100 for Llama 2 70B and GPT3 training phases, respectively. These improvements are measured without locking GPU clocks and account for the number of times each GEMM is repeated in the workload.

speedup-gemm-only-fraction-e2e-workloads-2.png
Figure 1. Speedup of the GEMM-only fraction of e2e workloads

Library Performance and Benchmarking

Several enhancements have been made to runtime performance heuristics and performance tuning APIs. The cuBLAS library uses a recommender system at runtime to dispatch the fastest available configuration for any user-requested matmuls. This system is trained on actual timing data from a wide range of problems and configurations.

gemm-sampling-kernel-families-cublas.png
Figure 2. Sampling of various GEMMs using multiple configurations in different kernel families

For advanced users, the cublasLtMatmulAlgoGetHeuristic API enables performance tuning to achieve faster implementations. Examples of auto-tuning in cuBLAS can be found on the NVIDIA/CUDALibrarySamples repository.

auto-tuning-cublas-1.png
Figure 4. An example of auto-tuning in cuBLAS

Better Functionality and Performance in cuBLASLt

Since cuBLAS 12.0, numerous enhancements have been introduced:

  1. Fused epilogue support parity between BF16 and FP16 precisions on NVIDIA Ampere and Ada.
  2. Additional fused epilogues on NVIDIA Hopper and Ampere.
  3. Support for FP8 on Ada GPUs and performance updates on Ada L4, L40, and L40S.
  4. Removal of M, N, and batch size limitations of cuBLASLt matmul API.
  5. Improved performance of heuristics cache for workloads with high eviction rate.
  6. cuBLAS symbols are available in CUDA Toolkit symbols for Linux repository.

For more information on cuBLAS, see the documentation and samples.



Image source: Shutterstock

. . .

Tags

RECENT NEWS

Ether Surges 16% Amid Speculation Of US ETF Approval

New York, USA – Ether, the second-largest cryptocurrency by market capitalization, experienced a significant surge of ... Read more

BlackRock And The Institutional Embrace Of Bitcoin

BlackRock’s strategic shift towards becoming the world’s largest Bitcoin fund marks a pivotal moment in the financia... Read more

Robinhood Faces Regulatory Scrutiny: SEC Threatens Lawsuit Over Crypto Business

Robinhood, the prominent retail brokerage platform, finds itself in the regulatory spotlight as the Securities and Excha... Read more

Ethereum Lags Behind Bitcoin But Is Expected To Reach $14K, Boosting RCOF To New High

Ethereum struggles to keep up with Bitcoin, but experts predict a rise to $14K, driving RCOF to new highs with AI tools.... Read more

Ripple Mints Another $10.5M RLUSD, Launch This Month?

Ripple has made notable progress in the rollout of its stablecoin, RLUSD, with a recent minting of 10.5… Read more

Bitcoin Miner MARA Acquires Another $551M BTC, Whats Next?

Bitcoin mining firm Marathon Digital Holdings (MARA) has announced a significant milestone in its BTC acquisition strate... Read more