NVIDIA CUDA-X math libraries provide the elemental numerical constructing blocks that enable developers to deploy accelerated applications across multiple high-performance domains, including AI and scientific computing.
cuBLAS is a CUDA-X math library that consists of a highly optimized collection of basic linear algebra subroutines for matrix and vector operations which are specifically tuned to get the very best possible performance across NVIDIA hardware using familiar and easy-to-use APIs.
The most recent cuBLAS update in NVIDIA CUDA Toolkit 13.0 Update 2 introduces recent APIs and implementations that significantly boost the performance of double-precision (FP64) matrix multiplications (matmuls). That is achieved through floating-point (FP) emulation on Tensor Cores present in GPU architectures akin to NVIDIA GB200 NVL72 and NVIDIA RTX PRO 6000 Blackwell Server Edition. For comprehensive information on GPU compatibility for each FP32 and FP64 emulation, discuss with the cuBLAS documentation.
This recent emulated FP64 matmul implementation complements the recently released single-precision (FP32) matmul emulation. Developers can fine-tune the required accuracy for FP64 matrix multiplications, but by default cuBLAS maintains accuracy corresponding to or higher than native hardware. It routinely assesses whether an operation will perform higher using FP emulation (with accuracy preserved) or native hardware after which selects the optimal implementation.
This post explains cuBLAS capabilities in CUDA Toolkit 13.0 Update 2, including:
- Seamless access to Tensor Core performance through familiar and easy developer APIs
- FP32 emulation with Blackwell BF16 tensor cores that provide increased performance over native FP32 matrix multiplication while preserving accuracy
- FP64 emulation with Blackwell INT8 tensor cores providing a protected, automatic performance increase with available fallback to native execution
- FP emulation for increased performance across quite a lot of software domains and hardware platforms
That is the primary release of FP64 matmul emulation with more advancements to follow in upcoming releases.
Floating-point emulation in practice
The cuBLAS library exposes two flavors for matmul emulation: the BF16x9 algorithm for FP32 and the Ozaki Scheme for FP64. The BF16x9 algorithm provides a static decomposition that might be used to performantly and safely emulate all normal and subnormal FP32 values using Blackwell BF16 tensor cores. Nonetheless, a standard challenge of emulating FP64 with the Ozaki Scheme is that the numerics of the issue necessitate different representations.
In other words, a single configuration cannot performantly and accurately emulate all FP64 values. Specifically, since the Ozaki Scheme uses a fixed-point representation for the operands after their exponents are aligned, the variety of “mantissa bits” required is data dependent and should be greater than or equal to the 53 bits within the IEEE 754 FP64 representation to deliver the identical or higher accuracy.
To resolve this problem, the cuBLAS library includes an automatic dynamic precision (ADP) framework which seamlessly analyzes inputs to find out if emulation might be safely leveraged for increased performance. In that case, the emulation parameters are routinely configured to enable accuracy equal to or higher than the native FP64 matmul.
Application results: ecTrans
When weather forecasting or climate modeling applications simulate the complex physics involved across the Earth’s atmosphere, oceans, and other systems, a grid is required to discretize the domain and perform the calculations. The open source ecTrans library relies on linear algebra operations to perform the grid-based transformations which are used for the weather predictions of the Integrated Forecasting System (IFS).
As shown in Figure 1, using NVIDIA Blackwell Tensor Cores for FP32 emulation significantly improves performance in ecTrans by providing a 2.4x speedup to the matrix product computations.


Along with the increased performance, the numerical accuracy achieved with FP emulation is either equivalent or superior to the outcomes when using native FP32. To validate this, 1,000 consecutive forward and backward transformations of the spectral transform onto real data fields from an actual simulation were repeated.
During this process, the error distribution of velocities (U and V) and temperature (T) using BF16x9 FP emulation were tracked and in comparison with the outcomes obtained when using standard FP32 precision (the operational precision used on the European Centre for Medium-Range Weather Forecasts for day by day forecasts).


The probability density functions of absolutely the errors are shown in Figure 2 across FP32, TF32, and BF16x9 FP emulation. These plots correspond to the likelihood of encountering an error if velocities and temperatures are randomly sampled. The closer the curves are to a delta function centered at 0, the more accurate the underlying implementation.
The outcomes for TF32 should not present on the speed plots because of the big error terms. Zooming out, large errors within the velocities and temperatures would turn into visible which demonstrates the sensitivity of weather modeling to precision. Nonetheless, BF16x9 FP emulation not only has accuracy inside acceptable ranges but shows the identical or higher accuracy when put next with native FP32, while exceeding the performance of FP32.
Application results: BerkeleyGW
The BerkeleyGW code is utilized by researchers to review physical properties of materials that emerge because of this of how electrons change energy states. It’s a massively parallel code that has been used at full scale on leadership class supercomputers. Using GPUs with BerkeleyGW can result in an 86x performance speedup over the CPU-only implementation and might be even further accelerated with FP emulation.
Using emulated complex FP64 matmuls (ZGEMM) within the CHISUM routine of the BerkeleyGW Epsilon module allows for some flexibility in determining the optimal balance between accuracy and performance. By default, cuBLAS uses its ADP framework to find out the parameters that may guarantee results as accurate as using native FP64. This is completed routinely for users and ends in the performance gains shown in Figure 3.


Nonetheless, the cuBLAS API enables the user to further fine-tune the performance by utilizing fewer bits for the FP64 emulated operations. For BerkeleyGW, two cases were measured. FP emulation with the default ADP setting in addition to with a manually-set 55 mantissa bits each resulted in accuracy well inside widely accepted tolerances (10E-10) in comparison with the reference values, with the 55 mantissa bits case providing much more acceleration.
The performance difference comes from ADP determining that greater than 55 mantissa bits are required; nevertheless, the reduced precision with the manually set 55 mantissa bits doesn’t have an effect on application-level accuracy for these tests. If more performance is desired, cuBLAS APIs enable you to regulate the precision used during emulation and explore if the resulting accuracy meets application needs.
Application results: Quantum Espresso
The open source Quantum Espresso (QE) collection of applications are used worldwide for materials science calculations based on density functional theory (DFT). The core of those applications is extremely optimized for each scale-out distributed computation in addition to for fine-grained parallelism inside a node.
QE depends upon efficient double-precision GEMMs to use operators during each step of the elemental iteration cycle for determining ground state energies of atoms and materials. This double-precision GEMM usage is comparable to many other DFT-based applications, and so the performance improvements for Quantum Espresso realized from FP emulation are expected to translate to many other DFT applications as well.
For the outcomes shown in Figure 4, the Ausurf benchmark dataset was used to measure each the standard of the numerical results and the performance of QE with FP emulation enabled within the cuBLAS library on an RTX PRO 6000 Blackwell Server Edition GPU.


Figure 4 shows that FP emulation with ADP provides a major 1.5x end-to-end speedup, and with further tuning to 39 mantissa bits, an almost 3x end-to-end speedup is achieved. For all configurations, the accuracy results are indistinguishable from each other until emulated FP64 with 39 mantissa bits are used. This produces application output values which are consistent as much as 12 (base-10) significant digits.
The performance difference between ADP and 55 mantissa bits is because of the ADP framework determining that greater than 55 mantissa bits are required for IEEE 754 FP64 level accuracy; nevertheless, in practice, using fewer mantissa bits doesn’t impact the measured application-level accuracy.
Benchmarking results: Heat maps
Along with end-to-end application performance improvements because of FP emulation, it will be important to grasp the applicability range of emulation when analyzing how emulation can improve your application’s performance. The three heat maps shown in Figures 5-7 exhibit the performance improvements from using emulated matmuls across different matrix shapes on a GB200 NVL72 GPU for FP32 and FP64 and on an RTX PRO 6000 Blackwell Server Edition for FP64.






All three heat maps exhibit substantial performance gains on moderate and huge problem shapes. Moreover, in Figures 6 and seven, the ADP framework uses 55 mantissa bits and we are able to see that when the issues are too small to learn from emulation, there are not any performance penalties for attempting emulation because of cuBLAS heuristics selecting native FP64 algorithms. We expect further improvements to performance and the applicability region in future cuBLAS releases.
What’s next for FP emulation
While FP emulation is already accelerating real applications, NVIDIA is continuous to advance and improve this technology across several key impact areas. Additional key BLAS level-3 and LAPACK routines throughout the CUDA-X math libraries might be accelerated through each FP32 and FP64 emulation. The team will proceed to enhance FP64 emulation with optimizations to the ADP framework, GEMM kernels, reduced workspace memory requirements, and with the Ozaki-II Scheme.
Using the strategies discussed on this post, you’ll be able to benefit from Tensor Core performance for algorithms that use matrix multiplication without changing your code or requiring tedious performance evaluation. cuBLAS will routinely select the very best strategy, delivering high performance while preserving the specified level of accuracy.
To start out using FP emulation and exploring its advantages in your individual applications, download CUDA Toolkit 13.0 Update 2.
To learn more, try these related resources:
