Deploy High-Performance AI Models in Windows Applications on NVIDIA RTX AI PCs

-


Today, Microsoft is making Windows ML available to developers. Windows ML enables C#, C++ and Python developers to optimally run AI models locally across PC hardware from CPU, NPU and GPUs. On NVIDIA RTX GPUs, it utilizes the NVIDIA TensorRT for RTX Execution Provider (EP) leveraging the GPU’s Tensor Cores and architectural advancements like FP8 and FP4, to supply the fastest AI inference performance on Windows-based RTX AI PCs.

“Windows ML unlocks full TensorRT acceleration for GeForce RTX and RTX Pro GPUs, delivering exceptional AI performance on Windows 11,” said Logan Iyer, VP, Distinguished Engineer, Windows Platform and Developer. “We’re excited it’s generally available for developers today to construct and deploy powerful AI experiences at scale.”

Overview of Windows ML and TensorRT for RTX EP

Video 1. Deploy high-performance AI models in Windows applications on NVIDIA RTX AI PCs

Windows ML is built upon the ONNX Runtime APIs for inferencing. It extends the ONNX Runtime APIs to handle dynamic initialization and dependency management of the execution provider across CPU, NPU, and GPU hardware on the PC. As well as, Windows ML also mechanically downloads the essential execution provider on demand, mitigating the necessity for app developers to administer dependencies and packages across multiple different hardware vendors.

Diagram showing the Windows ML architecture stack, illustrating steps from applications to execution providersDiagram showing the Windows ML architecture stack, illustrating steps from applications to execution providers
Figure 1. Windows ML stack diagram

NVIDIA TensorRT for RTX Execution Provider (EP) provides several advantages to Windows ML developers using ONNX Runtime including: 

  • Run ONNX models with low-latency inference and 50% faster throughput in comparison with prior DirectML implementations on NVIDIA RTX GPUs, as shown in below figure.
  • Integrated directly with WindowsML with its flexible EP architecture and integration with ORT.
  • Just-in-time compilation for streamlined deployment on end-user devices. Learn more concerning the compilation process inside TensorRT for RTX. This compilation process is supported in ONNX Runtime as EP context models.
  • Leverage architecture advancements like FP8 and FP4 on the Tensor Cores
  • Lightweight package at just below 200 MB.
  • Support for quite a lot of model architectures from LLMs (with ONNX Runtime GenAI SDK extension), diffusion, CNN, and more.

Learn more about TensorRT for RTX.

Bar chart showing generation throughput speedup for several models, measured using an NVIDIA RTX 5090 GPUBar chart showing generation throughput speedup for several models, measured using an NVIDIA RTX 5090 GPU
Figure 2. Generation throughput speedup of assorted models on Windows ML versus Direct ML. Data measured on a NVIDIA RTX 5090 GPU.

Choosing an execution provider

The 1.23.0 release of ONNX Runtime, included with WindowsML, provides vendor and execution provider independent APIs for device selection. This dramatically reduces the quantity of application logic essential to reap the benefits of the optimal execution provider for every hardware vendor platform. See below for a code excerpt of learn how to effectively do that and procure maximum performance on NVIDIA GPUs.

// Register desired execution provider libraries of assorted vendors
auto env = Ort::Env(ORT_LOGGING_LEVEL_WARNING);
env.RegisterExecutionProviderLibrary("nv_tensorrt_rtx", L"onnxruntime_providers_nv_tensorrt_rtx.dll");

// Option 1: Depend on ONNX Runtime Execution policy
Ort::SessionOptions sessions_options;
sessions_options.SetEpSelectionPolicy(OrtExecutionProviderDevicePolicy_PREFER_GPU);

// Option 2: Interate over EpDevices to perform manual device selection 
std::vector<:constepdevice> ep_devices = env.GetEpDevices();
std::vector<:constepdevice> selected_devices = select_ep_devices(ep_devices);

Ort::SessionOptions session_options;
Ort::KeyValuePairs ep_options;
session_options.AppendExecutionProvider_V2(env, selected_devices, ep_options);
# Register desired execution provider libraries of assorted vendors
ort.register_execution_provider_library("NvTensorRTRTXExecutionProvider", "onnxruntime_providers_nv_tensorrt_rtx.dll")

# Option 1: Depend on ONNX Runtime Execution policy
session_options = ort.SessionOptions()
session_options.set_provider_selection_policy(ort.OrtExecutionProviderDevicePolicy.PREFER_GPU)

# Option 2: Interate over EpDevices to perform manual device selection
ep_devices = ort.get_ep_devices()
ep_device = select_ep_devices(ep_devices)

provider_options = {}
sess_options.add_provider_for_devices([ep_device], provider_options)

Precompiled runtimes offering quick load times

Model runtimes can now be precompiled using EP context ONNX files inside ONNX Runtime. Each execution provider can use this to optimize entire subgraphs of an ONNX model, and supply an EP specific implementation. This process could be serialized to disk to enable quick load times with WindowsML, oftentimes that is quicker than prior traditional operator based methods in Direct ML.

The below chart shows that TensorRT for RTX EP takes time to compile, but is quicker to load and do inference on the model because the optimizations are already serialized. As well as, the runtime cache feature inside TensorRT for RTX EP ensures that the generated kernels in the course of the compile phase are serialized and stored to a directory, in order that they don’t must be recompiled for next inferences.

Bar chart comparing load times for DeepSeek-R1-Distill-Qwen-7B model using ONNX model only, ONNX with EP context files, and with both EP context files and Runtime CacheBar chart comparing load times for DeepSeek-R1-Distill-Qwen-7B model using ONNX model only, ONNX with EP context files, and with both EP context files and Runtime Cache
Figure 3. Different load times of DeepSeek-R1-Distill-Qwen-7B model runtimes including ONNX model, EP context files, and with EP context and Runtime Cache. Lower is best.

Minimal data transfer overheads with ONNX Runtime Device API and Windows ML

The brand new ONNX Runtime device API, also available in Windows ML, enumerates available devices for every execution provider. Using this recent notion, developers can now allocate device-specific tensors, without additional EP-dependent type specifications.

By leveraging CopyTensors and IOBinding, this API enables developers to perform EP-agnostic, GPU-accelerated inference with minimal runtime data transfer overhead—resulting in improved performance and cleaner code design.

Figure 5 showcases the Stable Diffusion 3.5 Medium model leveraging the ONNX Runtime Device API. Figure 4 below represents the time required for a single iteration within the diffusion loop for a similar model, each with and without device IO bindings.

Table comparing Stable Diffusion 3.5 Medium model performance with and without device bindings.Table comparing Stable Diffusion 3.5 Medium model performance with and without device bindings.
Figure 4. Stable Diffusion 3.5 Medium running with and without device bindings on AMD Ryzen 7 7800X3D CPU + RTX 5090 GPU connected via PCI 5. Lower time is best.

Using Nsight systems, we visualized the performance overhead on account of repetitive copies between host and device when not using IO binding:

Nsight Systems timeline highlighting the increased overhead caused by additional synchronous PCI trafficNsight Systems timeline highlighting the increased overhead caused by additional synchronous PCI traffic
Figure 5. Nsight Systems timeline that shows the overhead that additional synchronous PCI traffic creates.

Prior to each inference run, a replica operation of the input tensor is completed, which is highlighted as green in our profile and a tool to host a replica of the output takes concerning the same time. As well as, ONNX Runtime by default uses pageable memory for which the device to host copy is an implicit synchronization, although the cudaMemCpyAsync API is utilized by ONNX Runtime.

However, when input and output tensors are IO sure, the host-to-device copy of input happens only once prior to the multi-model inference pipeline. The identical applies for the device-to-host copy of the output, after which we synchronize the CPU with the GPU again. The async Nsight trace above depicts multiple inference runs within the loop with none copy operations or synchronization operations in between, even freeing CPU resources within the meantime. This leads to a tool copy time of 4.2 milliseconds and a one-time host copy time of 1.3 milliseconds, making the whole copy time of 5.5 milliseconds, no matter the variety of iterations within the inference loop. For reference, this approach leads to a ~75x reduction in copy time for a 30 iteration loop!

TensorRT for RTX Specific Optimizations

TensorRT for RTX execution offers custom options to further optimize performance. Crucial optimizations are listed below. 

  • CUDA graphs: Enabled by setting enable_cuda_graph to capture all CUDA kernels launched from TensorRT inside a graph, thereby reducing the launch overhead on CPU. This is significant if the TensorRT graph launches many small kernels in order that the GPU can execute these faster than the CPU can submit them. This method generates around 30% performance gain with LLMs, and is helpful for a lot of model types, including traditional AI models and CNN architectures.
Bar chart showing throughput speedups achieved by using CUDA Graphs in the ONNX Runtime API, measured on an NVIDIA RTX 5090 GPU with several LLMs.Bar chart showing throughput speedups achieved by using CUDA Graphs in the ONNX Runtime API, measured on an NVIDIA RTX 5090 GPU with several LLMs.
Figure 6. Showcases the throughput speedups of CUDA Graphs being enabled in comparison with CUDA graphs being disabled in ONNX Runtime API. Data measured on NVIDIA RTX 5090 GPU with several LLMs.
  • Runtime cache: nv_runtime_cache_path points to a directory where compiled kernels could be  cached for quick load times together with using EP context nodes.
  • Dynamic shapes: Overwrite known dynamic shape ranges by setting the three options profile_{min|max|opt]_shapes or by specifying static shapes using AddFreeDimensionOverrideByName to repair the input shapes of a model. Currently, this feature is in experimental mode.

Summary

We’re excited to collaborate with Microsoft to bring Windows ML and TensorRT for RTX EP to Windows application developers for optimum performance on NVIDIA RTX GPUs. Top Windows application developers including Topaz Labs, and Wondershare Filmora are currently working on integrating Windows ML and TensorRT for RTX EP into their applications. 

Start with Windows ML, ONNX Runtime APIs, and TensorRT for RTX EP using the below resources:

Stay tuned for future improvements and get on top of things with the brand new APIs that our samples are demonstrating. If there’s a feature request out of your side, be at liberty to open a difficulty on GitHub and tell us!

Acknowledgements

We would love to thank Gaurav Garg, Kumar Anshuman, Umang Bhatt, and Vishal Agarawal for his or her contributions to the blog.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x