NVIDIA Unveils Grouped GEMM APIs In CuBLAS 12.5 To Boost DL And HPC Performance
The latest release of the NVIDIA cuBLAS library, version 12.5, brings significant updates aimed at enhancing the functionality and performance of deep learning (DL) and high-performance computing (HPC) workloads, according to NVIDIA Technical Blog. Key updates include the introduction of Grouped GEMM APIs, improved matrix multiplication (matmul) performance on NVIDIA Hopper (H100 and H200) and Ada (L40S) GPUs, and enhanced performance tuning options.
Grouped GEMM APIs
The newly introduced Grouped GEMM APIs generalize batched APIs by allowing different matrix sizes, transpositions, and scaling factors to be grouped and executed in one kernel launch. This approach has shown a 1.2x speedup in certain scenarios, such as the generation phase of a mixture-of-experts (MoE) model with batch sizes of 8 and 64 and FP16 inputs and outputs.
Two new sets of APIs support Grouped GEMM:
- cublas<t>gemmGroupedBatched for FP32 (including TF32) and FP64 precisions.
- cublasGemmGroupedBatchedEx for FP16, BF16, FP32 (including TF32), and FP64 precisions.
These APIs support variable shapes, transpositions, and scaling factors. Examples can be found on the NVIDIA/CUDALibrarySamples GitHub repository.
Latest LLM Matmul Performance on NVIDIA H100, H200, and L40S GPUs
Recent performance snapshots show significant speedups for Llama 2 70B and GPT3 training phases on NVIDIA H100, H200, and L40S GPUs. The H200 GPU, in particular, demonstrates nearly 3x and 5x speedups compared to the A100 for Llama 2 70B and GPT3 training phases, respectively. These improvements are measured without locking GPU clocks and account for the number of times each GEMM is repeated in the workload.
Library Performance and Benchmarking
Several enhancements have been made to runtime performance heuristics and performance tuning APIs. The cuBLAS library uses a recommender system at runtime to dispatch the fastest available configuration for any user-requested matmuls. This system is trained on actual timing data from a wide range of problems and configurations.
For advanced users, the cublasLtMatmulAlgoGetHeuristic API enables performance tuning to achieve faster implementations. Examples of auto-tuning in cuBLAS can be found on the NVIDIA/CUDALibrarySamples repository.
Better Functionality and Performance in cuBLASLt
Since cuBLAS 12.0, numerous enhancements have been introduced:
- Fused epilogue support parity between BF16 and FP16 precisions on NVIDIA Ampere and Ada.
- Additional fused epilogues on NVIDIA Hopper and Ampere.
- Support for FP8 on Ada GPUs and performance updates on Ada L4, L40, and L40S.
- Removal of M, N, and batch size limitations of cuBLASLt matmul API.
- Improved performance of heuristics cache for workloads with high eviction rate.
- cuBLAS symbols are available in CUDA Toolkit symbols for Linux repository.
For more information on cuBLAS, see the documentation and samples.
Image source: Shutterstock
. . .
Tags
Ether Surges 16% Amid Speculation Of US ETF Approval
New York, USA – Ether, the second-largest cryptocurrency by market capitalization, experienced a significant surge of ... Read more
BlackRock And The Institutional Embrace Of Bitcoin
BlackRock’s strategic shift towards becoming the world’s largest Bitcoin fund marks a pivotal moment in the financia... Read more
Robinhood Faces Regulatory Scrutiny: SEC Threatens Lawsuit Over Crypto Business
Robinhood, the prominent retail brokerage platform, finds itself in the regulatory spotlight as the Securities and Excha... Read more
Ethereum Lags Behind Bitcoin But Is Expected To Reach $14K, Boosting RCOF To New High
Ethereum struggles to keep up with Bitcoin, but experts predict a rise to $14K, driving RCOF to new highs with AI tools.... Read more
Ripple Mints Another $10.5M RLUSD, Launch This Month?
Ripple has made notable progress in the rollout of its stablecoin, RLUSD, with a recent minting of 10.5… Read more
Bitcoin Miner MARA Acquires Another $551M BTC, Whats Next?
Bitcoin mining firm Marathon Digital Holdings (MARA) has announced a significant milestone in its BTC acquisition strate... Read more