As AI weather and climate prediction models rapidly gain adoption, the NVIDIA Earth-2 platform provides libraries and tools for accelerating solutions using a GPU-optimized software stack. Downscaling, which is the duty of refining coarse-resolution (25km scale) weather data, enables national meteorological service (NMS) agencies to deliver high-resolution predictions for agriculture, energy, transportation, and disaster preparedness at spatial resolutions effective enough for actionable decision-making and planning.
Traditional dynamical downscaling is prohibitively expensive, especially for big ensembles at high resolution and over extensive spatial domains. CorrDiff, a generative AI downscaling model that sidesteps computational bottlenecks of traditional numerical methods, achieves state-of-the-art results, and uses a patch-based multidiffusion approach to scale to continental and global domains. This AI-based solution unlocks significant gains in efficiency and scalability in comparison with traditional numerical methods, while greatly reducing computational costs.
CorrDiff has gained global adoption for various use cases, demonstrating its versatility and impact across domains where fine-scale weather information is crucial:
- The Weather Company (TWC) for supporting the agriculture, energy, and aviation industries.
- G42 for improving smog and dirt storm predictions within the Middle East.
- Tomorrow.io for enhancing a variety of storm-scale predictions, including fire weather forecasts and wind gust forecasts that disrupt railway operations.
On this blog post, we show the performance optimizations and enhancements for CorrDiff training and inference that were incorporated into two tools within the Earth-2 stack, NVIDIA PhysicsNeMo and NVIDIA Earth2Studio. Achieving over 50x speedup on training and inference baselines, these optimizations enable:
- Scaling patch-based training of your complete planet in under 3,000 GPU-hours.
- Lowering most country-scale trainings to O(100) GPU-hrs.
- Training over the contiguous United States (CONUS) in under 1000 GPU-hours.
- Tremendous-tuning and bespoke training that democratizes km-scale AI-weather.
- Country-scale inference in GPU-seconds, planetary-scale inference in GPU-minutes.
- Generating large ensembles affordably for high-resolution probabilistic forecasting.
- Interactive exploration of kilometer-scale data.
CorrDiff: Training and inference


Figure 1 illustrates the training and sampling workflow of CorrDiff for generative downscaling. During diffusion training, a pretrained regression model is used to generate the conditional mean, which serves as input for training the diffusion model. For background and details on CorrDiff, confer with the CorrDiff publication, PhysicsNeMo docs, and source code.
Why optimize CorrDiff?
Diffusion models are resource-intensive because they depend on iterative sampling, with each denoising step involving multiple neural network computations. This makes inference time-consuming and expensive. Training can be costlier since the denoiser needs to be trained for the complete range of noise levels. Optimizing their performance requires:
- Streamlining core operations (e.g., fusing kernels, using mixed precision, using NVIDIA CUDA graphs, etc.).
- Improving the sampling process by reducing the variety of denoising steps and using optimal time integration schemes.
CorrDiff uses the EDM architecture, where several computationally expensive operations, resembling group normalization, activation functions, and convolutions, might be optimized using advanced packages and kernel fusion.
CorrDiff also uses a two-stage pipeline (regression and correction), offering opportunities to amortize cost across multiple diffusion steps by caching regression outputs, minimizing redundant compute.
Accelerated CorrDiff
In the next figures, we summarize the assorted optimizations that lead to over a 50x speedup on each training and inference costs over the CONUS domain. Figures 2 and three summarize the cumulative speedup aspects achieved over the baseline with each successive optimization. Details of every optimization are provided in subsequent sections.




Optimized CorrDiff: The way it’s achieved
The baseline performance of CorrDiff on NVIDIA H100 GPUs with FP32 precision, batch size = 1, patch size = 1 (in absolute time) was as follows:
- Regression forward: 1204ms
- Domain: CONUS of size 1056 × 1792 pixels
- Input channels: [“u500”, “v500”, “z500”, “t500”, “u850”, “v850”, “z850”, “t850”, “u10m”, “v10m”, “t2m”, “tcwv”] at 25km resolution
- Output channels: [“refc”, “2t”, “10u”, “10v”] at 2km resolution
- Diffusion forward: 155ms
- Domain: spatial patch of size 448 x 448 pixels
- Input channels: [“u500”, “v500”, “z500”, “t500”, “u850”, “v850”, “z850”, “t850”, “u10m”, “v10m”, “t2m”, “tcwv”] at 25km resolution
- Output channels: [“refc”, “2t”, “10u”, “10v”] at 2 km resolution
- Diffusion backward: 219ms
While effective, this baseline was limited by expensive regression model forward passes and inefficient data transposes.


Key CorrDiff training optimizations
To realize substantial acceleration in CorrDiff training, culminating in 53.86x speedup on NVIDIA B200 and 25.51x on H100, we introduced a series of performance optimizations outlined below.
Optimization 1: Enable AMP-BF16 for training
The unique training recipe uses FP32 precision. Here, we enabled Automatic Mixed Precision (AMP) with BF16 for training to scale back memory usage and improve throughput without compromising numerical stability, resulting in a 2.03x speedup over baseline.
Optimization 2: Amortizing regression cost using multi-iteration patching
In the unique patch-based training workflow, each 448×448 patch sample for diffusion model training required inference of the regression model for the complete 1056×1792 CONUS domain. This caused diffusion model training throughput to be bottlenecked by regression model inference.
We improved efficiency by caching regression outputs and reusing them across 16 diffusion patches per timestamp. This provided broader spatial coverage while spreading regression costs more effectively, yielding a 12.33× speedup over baseline.
Optimization 3: Eliminating data transposes with Apex GroupNorm
The training pipeline initially used the default NCHW memory layout, which triggers costly implicit memory transposes before/after convolutions. Switching the model and input tensors to NHWC (channels-last) format aligns them with cuDNN’s preferred layout. Nevertheless, PyTorch GroupNorm ops don’t support the channels-last format. To forestall transposes and keep data in channels-last format for more efficient normalization kernels, we replaced PyTorch GroupNorm with NVIDIA Apex GroupNorm. This eliminated transpose overhead and yielded a 16.71× speedup over the baseline.
Optimization 4: Fusing GroupNorm with SiLU
By fusing GroupNorm and SiLU activation right into a single kernel using Apex, we reduced kernel launches and the number of worldwide memory accesses. This increased GPU utilization and delivered a 17.15× speedup over the baseline.
Optimization 5: Prolonged channel dimension support in Apex GroupNorm
Some CorrDiff layers use channel dimensions unsupported by Apex. We prolonged support for these channel dimensions, unlocking fusion for all layers. This improved performance to 19.74× speedup over the baseline.
Optimization 6: Kernel fusion through graph compilation
We used torch.compile to fuse the remaining elementwise operations (e.g., addition, multiplication, sqrt, exp). This improved scheduling, reduced global memory accesses, and cut Python overhead, delivering speedup of 25.51× over the baseline.
Optimization 7: Apex GroupNorm V2 on NVIDIA Blackwell
Using Apex GroupNorm V2, optimized for NVIDIA Blackwell GPUs, yielded 53.86× speedup over the baseline on B200 and 2.1× over the H100-optimized workflow.


Training throughput
We compare the training throughput of baseline CorrDiff on NVIDIA Hopper against optimized versions on Hopper and Blackwell in Table 1. The optimized implementation achieves improvements in efficiency across each architectures, with Blackwell showing probably the most significant gains.
Note: Regression refers back to the regression forward pass. Diffusion refers back to the diffusion forward pass. Total includes the combined cost of (regression forward + diffusion forward + diffusion backward).
| GPU | Version | Precision | Regression (ms/patch) | Diffusion (ms/patch) | Total runtime (ms/patch) | Throughput (patch/s) |
| H100 | Baseline | FP32 | 1204.0 | 374.0 | 1578.0 | 0.63 |
| H100 | Optimized | BF16 | 10.609 | 51.25 | 61.859 | 16.2 |
| B200 | Optimized | BF16 | 4.734 | 24.56 | 29.297 | 34.1 |
Speed-of-Light evaluation
To judge how close our optimized CorrDiff workflow involves the hardware performance ceiling, we conducted a Speed-of-Light (SOL) evaluation on H100 and B200 GPUs. This provides an upper-bound estimate of achievable performance by assessing how effectively GPU resources are getting used.
Steps to estimate SOL:
- Discover low-utilization kernels:
We deal with kernels with each DRAM read/write bandwidth < 60% and Tensor Core utilization < 60%. Such kernels are neither memory-bound nor compute-bound, making them likely performance bottlenecks. - Estimate per-kernel potential:
For every low-utilization kernel, we estimate the potential speedup under ideal conditions—namely, full DRAM bandwidth usage or full Tensor Core activity. - Aggregate overall speedup:
We then compute the hypothetical end-to-end speedup if each kernel were optimized to its ideal performance. - Compute SOL efficiency:
Finally, we estimate the fraction of theoretical maximum SOL because the fraction of peak performance achievable if the highest 10 runtime-dominant kernels were individually boosted to their theoretical maximum.
Using this framework, our optimized CorrDiff workflow achieves 63% of the estimated SOL on H100 and 67% on B200. This means strong GPU utilization while still leaving headroom for future kernel-level improvements.
To further assess efficiency, we visualize kernel performance as shown in Figures 5 and 6. Each dot represents a kernel, plotted by NVIDIA Tensor Core utilization (x-axis) and combined DRAM read/write bandwidth utilization (y-axis). The dot size reflects its share of total runtime, highlighting performance-critical operations.
Kernels near the highest or right edges are generally well-optimized, as they fully exploit compute or memory resources. In contrast, kernels within the bottom-left quadrant underutilize each and represent the perfect opportunities for further optimization. This visualization provides a transparent picture of the runtime distribution and helps discover where GPU efficiency might be improved.


Figure 6 shows the distribution of kernels when it comes to Tensor Core utilization and DRAM bandwidth utilization for the baseline CorrDiff implementation. In an unoptimized workflow with FP32 precision, >95% of time is spent in low-utilization kernels where each DRAM utilization (read + write) and tensor core utilization are below 60%.
Nearly all of runtime-dominant kernels cluster near the origin, showing very low DRAM and Tensor Core utilization. Only a small variety of kernels lie near the upper or right boundaries, where kernels change into clearly memory-bound or compute-bound. The unoptimized US CorrDiff workflow was only 1.23% at SOL on B200.


Figure 7 shows the distribution of kernels within the optimized implementation when it comes to Tensor Core utilization and DRAM bandwidth utilization. Within the optimized workflow with AMP-BF16 training, the next proportion of kernels are near the highest left or bottom right edges, indicating good performance and GPU utilization. Optimized CorrDiff is now 67% at SOL on B200. Despite the general improvements within the optimized workflow, some kernels still have the potential to be accelerated further.
CorrDiff inference optimizations
Most of the training optimizations may also be applied to inference. We proposed several more inference-specific optimizations to maximise performance.
Optimized multi-diffusion
CorrDiff uses a patch-based multi-diffusion approach, where overlapping spatial patches are denoised after which aggregated. Initially, 27.1% of the full runtime was spent in im2col folding/unfolding operations. Precomputing overlap counts for every pixel and using torch.compile() to speed up the remaining folding/unfolding steps eliminates the im2col bottleneck entirely, leading to a speedup of 7.86x.
Deterministic Euler sampler (12 steps)
The unique stochastic sampler used 18 denoising steps with the Heun solver and second-order correction. By enabling a deterministic sampler using the Euler solver (with no second-order correction), we reduced the variety of denoising steps to 12 without impacting output quality. This alteration delivered an extra ~2.8× speedup on each Hopper and Blackwell. The final word speedup with a 12-step deterministic sampler is 21.94x on H100 and 54.87x on B200.
Several of the optimizations described on this blog post also apply to diffusion models generally, and a few are specific to patch-based approaches. As such, those might be ported to other models in PhysicsNeMo and utilized in the event of solutions beyond weather downscaling.
Getting began
Train/inference CorrDiff in PhysicsNeMo: PhysicsNeMo CorrDiff documentation
- To coach with the optimized codebase, follow the instructions within the CorrDiff repo readme, and set the next options within the training.perf section in your chosen training YAML config:
fp_optimizations: amp-bf16
use_apex_gn: True
torch_compile: True
profile_mode: False
- To run inference with the optimized codebase, follow the instructions within the CorrDiff repo readme, and set the next options within the generation.perf section in your chosen generation config:
use_fp16: True
use_apex_gn: True
use_torch_compile: True
profile_mode: False
io_syncronous: True
- Set
profile_modeto False for optimized performance, because the NVTX annotations would introduce graph breaks to torch.compile workflow. - To utilize the newest Apex GroupNorm kernels, either construct Apex GroupNorm in PhysicsNeMo container Dockerfile or construct it locally after loading the PhysicsNeMo container
- Clone the Apex repo and construct using:
CFLAGS="-g0" NVCC_APPEND_FLAGS="--threads 8"
pip install
--no-build-isolation
--no-cache-dir
--disable-pip-version-check
--config-settings "--build-option=--group_norm" .
Learn more about optimized CorrDiff training in PhysicsNeMo and run optimized workflows in Earth2Studio.
