Unlocking Tensor Core Performance with Floating Point Emulation in cuBLAS

NVIDIA CUDA-X math libraries provide the elemental numerical constructing blocks that enable developers to deploy accelerated applications across multiple high-performance domains, including AI and scientific computing.

cuBLAS is a CUDA-X math library that consists of a highly optimized collection of basic linear algebra subroutines for matrix and vector operations which are specifically tuned to get the very best possible performance across NVIDIA hardware using familiar and easy-to-use APIs.

The most recent cuBLAS update in NVIDIA CUDA Toolkit 13.0 Update 2 introduces recent APIs and implementations that significantly boost the performance of double-precision (FP64) matrix multiplications (matmuls). That is achieved through floating-point (FP) emulation on Tensor Cores present in GPU architectures akin to NVIDIA GB200 NVL72 and NVIDIA RTX PRO 6000 Blackwell Server Edition. For comprehensive information on GPU compatibility for each FP32 and FP64 emulation, discuss with the cuBLAS documentation.

This recent emulated FP64 matmul implementation complements the recently released single-precision (FP32) matmul emulation. Developers can fine-tune the required accuracy for FP64 matrix multiplications, but by default cuBLAS maintains accuracy corresponding to or higher than native hardware. It routinely assesses whether an operation will perform higher using FP emulation (with accuracy preserved) or native hardware after which selects the optimal implementation.

This post explains cuBLAS capabilities in CUDA Toolkit 13.0 Update 2, including:

Seamless access to Tensor Core performance through familiar and easy developer APIs
FP32 emulation with Blackwell BF16 tensor cores that provide increased performance over native FP32 matrix multiplication while preserving accuracy
FP64 emulation with Blackwell INT8 tensor cores providing a protected, automatic performance increase with available fallback to native execution
FP emulation for increased performance across quite a lot of software domains and hardware platforms

That is the primary release of FP64 matmul emulation with more advancements to follow in upcoming releases.