NVIDIA TensorRT LLM enables developers to construct high-performance inference engines for giant language models (LLMs), but deploying a brand new architecture traditionally requires significant manual effort. To handle this challenge, today we’re announcing the provision of AutoDeploy as a beta feature in TensorRT LLM.
AutoDeploy compiles off-the-shelf PyTorch models into inference-optimized graphs. This avoids the necessity to bake inference-specific optimizations directly into model code, reducing LLM deployment time. AutoDeploy enables the shift from manually reimplementing and optimizing each model toward a compiler-driven workflow that separates model authoring from inference optimization.
This post introduces AutoDeploy architecture and capabilities and shows the way it enabled support for recent NVIDIA Nemotron models at launch.
What’s AutoDeploy?
Every recent LLM architecture comes with its own inference challenges, from transformer models to hybrid vision language models (VLMs) to state space models (SSMs). Turning a reference implementation right into a high-performance inference engine typically requires adding KV cache management, sharding weights across GPUs, fusing operations, and tuning the execution graph for specific hardware.
AutoDeploy shifts this workflow toward a compiler-driven approach. As a substitute of requiring model authors to manually reimplement inference logic, AutoDeploy robotically extracts a computation graph from an off-the-shelf PyTorch model and applies a series of automated transformations to supply an inference-optimized TensorRT LLM graph. This lets you describe the model once in PyTorch and delegate inference-specific concerns—resembling caching, sharding, kernel selection, and runtime integration—to the compiler and runtime.
This approach is especially well-suited for the long tail of models, including recent research architectures, internal variants, and fast-moving open source models, where manual reimplementation is usually impractical or unjustified. AutoDeploy enables deployment at launch with competitive baseline performance, while preserving a transparent path to incremental optimization as models mature.
AutoDeploy provides:
- Seamless model translation: Routinely converts Hugging Face models into TensorRT LLM graphs without manual rewrites
- Single source of truth: Keeps the unique PyTorch model because the canonical definition
- Inference optimization: Applies sharding, quantization, KV cache insertion, attention fusion, CUDA Graphs optimization, and more
- Deployment at launch: Enables immediate deployment with ongoing performance improvements over time
- Turnkey setup: Ships as a part of TensorRT LLM with examples and documentation
AutoDeploy may be used for:
- Recent or experimental architectures: Rapidly deploy research models, hybrid designs, or novel token mixing (attention) mechanisms
- Long-tail model support: Serve internal, fine-tuned, or less common models without bespoke inference implementations
- Fast performance bring-up: Reach competitive baseline performance quickly, then optimize incrementally
- Unified training-to-inference workflow: Keep PyTorch because the model definition while counting on TensorRT LLM for runtime integration
AutoDeploy currently supports greater than 100 text‑to‑text LLMs and offers early support for VLMs and SSMs and performance-optimized models resembling the Llama model family and NVIDIA Nemotron 3 Nano.
AutoDeploy technical background
AutoDeploy sits between the unique Hugging Face model and the TensorRT LLM runtime. The LLM API accepts a model name or checkpoint directory and returns a high‑level LLM object. Under the hood, that object can use AutoDeploy (automated) or a manual backend.
As Figure 1 shows, the AutoDeploy path robotically extracts a graph, applies optimizations, and generates an inference‑optimized graph. The manual path requires engineers to rewrite the model (adding KV cache logic, attention kernels, sharding, kernel fusion, and more) before running it through the identical runtime.


Graph capture and pattern matching
AutoDeploy uses the torch.export API to capture the model as a standardized Torch graph consisting of core ATen operations and custom (user- or AutoDeploy-provided) operations. The exported graph then undergoes a series of automated transformations to pattern-match and canonicalize the graph representation of common constructing blocks.
On this initial step, AutoDeploy ensures that common constructing blocks resembling mixture of experts (MoE), attention, RoPE, or state-space layers are represented using reference implementations which can be represented as custom ops and single nodes within the graph.
Figure 2 provides an example of how attention is represented across all models as a single, easy-to-interpret custom operator in PyTorch.


This approach ensures a seamless onboarding technique of model support that’s decoupled from performance optimization and runtime integration.
Furthermore, model onboarding happens on a sliding scale between fully-automated model onboarding through pattern matching and (full) manual rewrites to make sure the final model graph can fully execute the model. The model writer can inject custom kernels into the model graph by decorating relevant operations as PyTorch custom operators. The AutoDeploy compiler won’t modify the relevant operators (Figure 3).


Sharding, fusion, and performance optimization
In the subsequent stages, AutoDeploy robotically applies performance optimization through compiler-like passes combining fusion passes, performance-tuned recipes, and insertion of optimized kernels into the graph representation. During this stage, the model can be sharded for multi-GPU inference based on available heuristics or prespecified sharding hints reusing the Hugging Face sharding hints.
Flexible attention and caching support
During graph capture and pattern matching, AutoDeploy represents token mixing (for instance, attention) operators as easy prefill-only operations expressed as AutoDeploy canonicalized reference operators. That is depicted in Figure 3 for the instance of softmax attention.
The system then robotically handles swapping to performance-optimized attention kernels and robotically integrates the caching mechanisms of token mixing operators into the TensorRT LLM optimized cache manager system. Currently, AutoDeploy can handle models which can be arbitrarily composed of softmax attention, state-space layers (Mamba2), linear attention (DeltaNet), and causal convolution.
Adding support for other operators with caching follows a strict interface and is definitely extendable.
Compilation tooling
AutoDeploy integrates with common off-the-shelf tooling for compiling and lowering the model further, resembling torch.compile, integration with CUDA Graphs for fixed batch-size decode-only batches, multistream optimizations, and more.
Runtime integration
AutoDeploy handles all points of integrating the model into the optimized TensorRT LLM runtime including features like overlap scheduler, chunked prefill, speculative decoding, or cache and state management without burdening the model writer with the intertwined dependencies between the model and the runtime.
AutoDeploy performance example: Nemotron 3 Nano
To gauge AutoDeploy capabilities, the team onboarded NVIDIA Nemotron 3 Nano, a hybrid MoE model. While hand‑tuning such a model for inference would typically take weeks, AutoDeploy enabled onboarding inside days, followed by incremental optimizations that performed consistent with a manually tuned baseline.
On a single NVIDIA Blackwell DGX B200 GPU, AutoDeploy performed on par with the manually optimized baseline in TensorRT LLM (Figure 4). It delivered as much as 350 tokens per second per user throughput and as much as 13,000 output tokens per second for latency and high-throughput applications, respectively.


Data was collected for ISL/OSL 1k/1k, TP=1, on NVIDIA DGX B200 using TensorRT LLM v1.3.0rc1, trtllm-serve, and AIPerf benchmarking tool.
To breed the outcomes yourself, follow the steps outlined within the NVIDIA Nemotron 3 Nano Checkpoint.
Model onboarding example: Nemotron-Flash
Nemotron-Flash is a representative example of the sort of architecture that may be difficult to support using a purely manual inference workflow. This hybrid research model combines multiple token mixers—including state space layers, softmax attention, and linear attention—and would require significant engineering effort to reimplement, optimize, and maintain by hand.
With AutoDeploy, existing optimization passes for Nemotron-Flash layers may very well be reused out-of-the-box, with none model-specific engineering. Recent layer types, resembling DeltaNet update rule, were integrated as an incremental extension moderately than a full rewrite and may be reused for future model onboarding work.
Because of this, Nemotron-Flash was onboarded and performance-optimized inside days and is now supported out-of-the-box. This highlights the core strength of AutoDeploy: once optimizations are expressed as reusable compiler passes, recent and unconventional architectures can immediately profit from the total optimization stack, dramatically reducing time-to-deployment while maintaining high inference performance.
The team used TensorRT LLM AutoDeploy to benchmark Nemotron Flash 3B Instruct against Qwen2.5 3B Instruct, a widely adopted, heavily hand-tuned model in the same size range. For the benchmarking scenario in Figure 1 (ISL/OSL=8k/16k), Nemotron-Flash outperforms Qwen2.5 highlighting how novel model architectures may be quickly onboarded to realize production-ready performance.


Data was collected for ISL/OSL 8k/16k, TP=1, on NVIDIA DGX H100 using TensorRT LLM v1.3.0rc1, trtllm-serve, and AIPerf benchmarking tool.
Start with TensorRT LLM AutoDeploy
TensorRT LLM AutoDeploy marks a shift toward approaching inference optimization as a compiler and runtime responsibility moderately than a burden on the model writer. This approach enables faster experimentation, broader model coverage, and a cleaner separation between model design and deployment.
As a substitute of hand-tuning each model, you possibly can describe the architecture once and let the system apply graph transformations and optimized kernels. Early successes resembling Nemotron Nano 3 and Nemotron-Flash display that deployment at model launch with peak performance is achievable across diverse model architectures.
TensorRT LLM AutoDeploy is rapidly evolving. For those who’re serious about experimenting with this feature or contributing to its development, try the AutoDeploy documentation and example scripts.
Acknowledgments
We’d wish to thank those that have contributed to AutoDeploy, including Ajinkya Rasane, Bala Marimuthu, Chenghao Zhang, Chenjie Luo, Eran Geva, Frida Hou, Gal Hubara Agam, Govind Ramnarayan, Grzegorz Kwasniewski, Hao Guo, Jingyu Xin, Joyjit Daw, Karthik Vetrivel, Lucas Liebenwein, Neta Zmora, Suguna Varshini Velury, Suyog Gupta, Tal Cherckez, Taylor Lee, Wanli Jiang, Wei-Ming Chen, William Zhang, and Yoco Xiao.
