Deploying AI applications across diverse consumer hardware has traditionally forced a trade-off. You’ll be able to optimize for specific GPU configurations and achieve peak performance at the fee of portability. Alternatively, you’ll be able to construct generic, portable engines and leave performance on the table. Bridging this gap often requires manual tuning, multiple construct targets, or accepting compromises.
NVIDIA TensorRT for RTX seeks to eliminate this trade-off. At under 200 MB, this lean inference library provides a Just-In-Time (JIT) optimizer that compiles engines in under 30 seconds. This makes it ideal for real-time, responsive AI applications on consumer-grade devices.
TensorRT for RTX introduces adaptive inference—engines that optimize robotically at runtime to your specific system, progressively improving compilation and inference performance as your application runs. No manual tuning, no multiple construct targets, no intervention required.
Construct a light-weight, portable engine once, deploy it anywhere, and let it adapt to the user’s hardware. At runtime, the engine robotically compiles GPU-specific specialized kernels, learns out of your workload patterns, and improves performance over time—all with none developer intervention. For more details, see the NVIDIA TensorRT for RTX documentation.
Adaptive inference
With TensorRT for RTX, runtime performance improves over time with none manual intervention. Three features work in tandem to enable this self-optimization: Dynamic Shape specialized kernels tune performance to your workloads’ shapes, CUDA Graphs eliminate overhead when executing those kernels, and runtime caching persists these improvements across sessions. The result: your engine gets faster because it runs.
- Dynamic Shapes Kernel Specialization: Robotically compiles faster kernels for shapes encountered at runtime and seamlessly swaps these in, improving performance in real-time by specializing for workload conditions.
- Built-in CUDA Graphs: Robotically captures, instantiates, and executes kernels as a single batch, reducing kernel launch overhead and boosting inference performance, while integrating with Dynamic Shapes.
- Runtime caching: Reduces JIT time overhead by storing compiled kernels across sessions, reducing overhead and avoiding redundant compilation.
For a live demonstration of those features working together on an actual diffusion pipeline with concrete speedups, see the Adaptive Inference Acceleration With TensorRT for RTX walkthrough video.
Static optimization versus adaptive inference workflows
Traditional inference frameworks require developers to predict input shapes and construct optimized engines for every goal configuration at compile time. TensorRT for RTX takes a special approach: engines adapt to actual workloads at runtime. Table 1 compares these two workflows.
| Component | Static workflow | Adaptive inference |
| Construct targets | Multiple engines per GPU | Single portable engine |
| Shape flexibility | Optimized at construct time for predicted shapes | Optimized robotically at runtime for actual seen shapes |
| Inference run 1 | Optimal performance (if pretuned shape) | Near-optimal performance |
| Inference run N | Same performance | Performance improves over time as recent shapes are encountered (plus cached specializations) |
| Developer effort | Manual tuning per config | Zero intervention |
Adaptive inference closes the gap with the static workflow, offering optimal performance while eliminating construct complexity and developer effort.
Performance comparison: Adaptive versus static
To show the performance of adaptive inference, we compared the FLUX.1 [dev] model in FP8 precision at 512×512 with dynamic shapes on an RTX 5090 (Windows 11) using TensorRT for RTX 1.3 in comparison with a static optimizer. As shown in Figure 1, adaptive inference surpasses static optimization by iteration 2 and reaches 1.32x faster with all features enabled. Runtime caching also accelerates JIT compilation from 31.92s to 1.95s (16x), enabling subsequent sessions to start out at peak performance immediately.


Motivating example
Making a TensorRT engine from an ONNX model provides a motivating example:
import tensorrt_rtx as trt_rtx
logger = trt_rtx.Logger(trt.Logger.WARNING)
builder = trt_rtx.Builder(logger)
network = builder.create_network()
parser = trt_rtx.OnnxParser(network, logger)
with open("your_model.onnx", "rb") as f:
parser.parse(f.read())
Dynamic Shapes Kernel Specialization
Models are likely to have various input dimensions across different image resolutions, variable sequence lengths, or dynamic batch sizes. Dynamic Shapes Kernel Specialization robotically generates and caches optimized kernels for shapes that your application encounters at runtime, tailored specifically to the model’s input dimensions. These optimized kernels are cached and reused, so subsequent inferences with the identical shape run at peak performance, minimizing the compromise between flexibility and speed.
Figure 1 presents the inference speedup with TensorRT for RTX Dynamic Shapes Kernel Specialization across model categories on NVIDIA GeForce RTX 5090 (Windows 11). Each bar shows the common performance gain when specialized kernels are robotically generated and swapped in for encountered input shapes versus using generic “fallback” kernels.


The advantages scale together with your workload variety. Models that process diverse input shapes see consistent performance across all configurations, while maintaining the flexibleness to handle whatever comes next. Learn more about working with dynamic shapes.
Continuing with the initial example:
# Define optimization profile: min/opt/max shapes for dynamic dimensions
profile = builder.create_optimization_profile()
profile.set_shape("input",
min=(1, 3, 224, 224),
opt=(8, 3, 224, 224),
max=(32, 3, 224, 224)
)
config.add_optimization_profile(profile)
# ... construct engine ...
# Configure dynamic shape kernel specialization strategy
# The default is Lazy compilation, explicitly set below for illustrative purposes
# Lazy compilation robotically swaps in kernels compiled within the background, adaptively improving perf for shapes encountered at runtime
runtime_config = engine.create_runtime_config()
runtime_config.dynamic_shapes_kernel_specialization_strategy = (
trt_rtx.DynamicShapesKernelSpecializationStrategy.LAZY
)
Built-in CUDA Graphs
Modern neural networks can execute a whole bunch of individual GPU kernels per inference. Each kernel launch carries overhead—typically 5-15 microseconds of CPU and driver work. For models dominated by small operations (compact convolutions, small matrix multiplications, elementwise operations), this launch time becomes a bottleneck.
When per-kernel launch overhead dominates execution time, the GPU idles while the CPU queues work—the enqueue time approaches or exceeds actual GPU compute time. This condition, referred to as being “enqueue-bound,” could be addressed with CUDA Graphs.
CUDA Graphs capture your complete inference sequence as a graph structure, eliminating kernel-launch overhead and optimizing common use cases including repeated model calls. TensorRT for RTX launches the whole computation graph in a single operation, as an alternative of launching kernels individually.
This could shave many milliseconds off of each inference iteration, as an illustration providing a 1.8 ms (23%) boost on every run of the SD 2.1 UNet model as measured on a Windows machine with an RTX 5090 GPU. This feature is especially helpful on Windows systems with Hardware Accelerated GPU Scheduling enabled. Models with many small kernels see the best profit, boosting the performance of enqueue-bound workloads.
Furthermore, within the context of dynamic shapes, the built-in CUDA Graphs support only captures and executes the shape-specialized kernels. This approach ensures that the CUDA Graph focuses on accelerating essentially the most performant kernels—typically those which are used most regularly. Read more about working with built-in CUDA Graphs.
Figure 3 shows the inference speedup with TensorRT for RTX using built-in CUDA Graphs on an RTX 5090 GPU (Windows 11, Hardware-Accelerated GPU Scheduling enabled). Note that gains for CUDA Graphs are more pronounced on image networks with many relatively short-running kernels.


Adding to the instance:
# Enable CUDA Graph capture for reduced kernel launch overhead
runtime_config.cuda_graph_strategy = trt_rtx.CudaGraphStrategy.WHOLE_GRAPH_CAPTURE
Runtime caching
JIT compilation provides portability and automatic GPU-specific optimization in TensorRT for RTX. Runtime caching takes this further by preserving compiled kernels—including the specialized dynamic shape kernels referenced previously—across sessions, eliminating redundant compilation work.


To make use of runtime caching, begin by running your initial inferences using optimized implementations for commonly used shapes. This process generates specialized kernels tailored to those shapes. Using the runtime cache API, these kernels can then be serialized right into a binary blob, which could be saved to disk for future reuse.
By loading this binary blob in subsequent sessions, you be certain that essentially the most optimized kernels can be found immediately—eliminating the necessity for a warm-up period, avoiding any performance regression, and stopping fallback to generic kernels. This allows your application to attain peak performance from the very first inference run.
As well as, the runtime cache file could be bundled together with your application. In the event you know your goal users’ specific platforms—comparable to OS, GPU, CUDA, and TensorRT versions—you’ll be able to pregenerate the runtime cache for those environments. Using your provided runtime cache file, users can bypass any kernel compilation overhead entirely, enabling optimal performance from the very first run. Read more about working with runtime caching.
Completing the instance:
from polygraphy import util
# Create runtime cache to persist compiled kernels across runs
runtime_cache = runtime_config.create_runtime_cache()
# Load existing cache if available
runtime_cache_file = "runtime.cache"
with util.LockFile(runtime_cache_file):
try:
loaded_cache_bytes = util.load_file(runtime_cache_file)
if loaded_cache_bytes:
runtime_cache.deserialize(loaded_cache_bytes)
except:
pass # No cache yet, shall be populated during inference
runtime_config.set_runtime_cache(runtime_cache)
context = engine.create_execution_context(runtime_config)
# ... run inference ...
# Save cache for future runs
runtime_cache = runtime_config.get_runtime_cache()
with util.LockFile(runtime_cache_file):
with runtime_cache.serialize() as buffer:
util.save_file(buffer, runtime_cache_file, description="runtime cache")
Start with adaptive inference
Three technologies work together to make adaptive inference optimizations easy:
- Dynamic Shapes Kernel Specialization ensures each shape runs optimally.
- CUDA Graphs eliminate the overhead of executing those optimized kernels.
- Runtime caching makes those optimizations persistent across sessions.
AI applications can adapt to any input dimension while maintaining the performance characteristics of static-shape inference. No compromises or artificial constraints in your application design. Read more about TensorRT for RTX best practices for performance.
To experience adaptive inference with NVIDIA TensorRT for RTX, visit the NVIDIA/TensorRT-RTX GitHub repo and check out the FLUX.1 [dev] Pipeline Optimized with TensorRT RTX notebook. You can too view the Adaptive Inference Acceleration with TensorRT for RTX walkthrough video for a live demonstration of those features.
Start constructing AI apps for NVIDIA RTX PCs to run models faster and more privately on-device, and streamline development with NVIDIA tools, SDKs, and models on Windows.
