Reasoning models are growing rapidly in size and are increasingly being integrated into agentic AI workflows that interact with other models and external tools. Deploying these models and workflows in production environments requires distributing them across multiple GPU nodes, which demands careful orchestration and coordination across GPUs.
NVIDIA Dynamo 1.0—available now—addresses these problems by accelerating generative AI and reasoning models in large-scale distributed environments. The AI framework delivers low-latency, high-throughput, distributed inference for production-grade multi-node AI deployments.
Dynamo supports leading open source inference engines, including SGLang, NVIDIA TensorRT LLM, and vLLM. It also has delivered strong leads to trusted third-party benchmarks akin to MLPerf and SemiAnalysis InferenceX, reinforcing its position as a production-grade inference platform. Dynamo can boost the variety of requests served by as much as 7x on NVIDIA Blackwell, as demonstrated within the recent SemiAnalysis InferenceX benchmark.


SemiAnalysis InferenceX, updated March 3, 2026. Results for DeepSeek R1-0528, FP4, 1k/1k, interactivity: ~50 tok/sec/user.
This blog details how early adopters have integrated Dynamo into real-world inference workflows, the system level performance improvements achieved, and the newest features and optimizations added to the framework.
Early adopters and real-world impact
Ultimately 12 months’s GTC event, NVIDIA introduced NVIDIA Dynamo, a low-latency and high-throughput, distributed inference framework built for multinode AI deployments. Since then, NVIDIA has worked collaboratively with the open source ecosystem to harden Dynamo for production-grade performance and large-scale workloads. Over this era, Dynamo has achieved significant milestones:
- Successfully deployed in production workflows: AstraZeneca, Baseten, ByteDance, CoreWeave, Crusoe, DigitalOcean, Gcore, GMI Cloud, Nebius, Meituan, Pinterest, Prime Intellect, Rednote, SoftBank Corp., Tencent Cloud, Together AI, Vultr, and lots of more have deployed Dynamo in production to scale multi-node inference, optimize throughput, and improve latency. Watch Dynamo Day recordings to listen to directly from organizations deploying Dynamo.
- Integrated into managed Kubernetes environments: Alibaba Cloud, Amazon Web Services (AWS), Google Cloud, Microsoft Azure, and Oracle Cloud Infrastructure (OCI) have built integrations showing how Dynamo will be seamlessly deployed into their managed Kubernetes environments, scaling inference to satisfy the growing demand for AI.
- Adopted by major open source frameworks: Modular Dynamo components akin to NIXL have been widely adopted by inference engines including llm-d, NVIDIA TensorRT LLM, SGLang, and vLLM to speed up KV cache transfers between GPUs. LMCache has integrated its KV caching directly into storage solutions in Dynamo, SGLang has integrated its HiCache solution into Dynamo’s Router, and LangChain has built an integration that injects agentic hints for Dynamo’s Router, validating its composable architecture.
- Inspired contributions from across the AI ecosystem: Developers across the AI community have contributed to Dynamo and broadened its capabilities. Mooncake and Alibaba prolonged the Dynamo AIConfigurator with SGLang support; Microsoft tested and hardened Dynamo on Azure Kubernetes Service (AKS), contributing fixes, deployment guides, public demos, and Planner/AIConfigurator enhancements; Prime Intellect co‑designed and integrated LoRA adapter support; and Baseten validated early Dynamo features in production‑like environments, then upstreamed bug fixes and hardening patches.
- Enabled integration with storage solutions: Cloudian, DDN, Dell, Everpure (previously Pure Storage), HPE, IBM, NetApp, VAST, and WEKA have integrated Dynamo into their AI solutions. That permits inference workloads to scale beyond GPU memory constraints to support very large context lengths with storage.
Dynamo 1.0 builds on these milestones while marking the framework’s maturity and production readiness. Keep reading for more highlights in regards to the update.
Today’s inference runtimes treat every request and KV cache block the identical—a system prompt reused across many turns has the identical eviction priority as a one-off chain-of-thought. Multi-turn agents, nonetheless, reuse prefixes and follow predictable patterns. An evicted multi-turn KV block will have to be recomputed, leading to wasted compute and better inference costs. Dynamo addresses this gap with latest agentic inference optimizations:
- Dynamo frontend API: Accepts agent hints (per-request metadata akin to latency sensitivity, expected output length, and cache control) and passes them to the router and KV cache manager.
- Dynamo KV-aware router: Uses priority and latency agentic hints to regulate queue ordering so user-facing turns run before background work. It will probably soak up expected output sequence length (OSL) to enhance load-balancing accuracy.
- Dynamo KV cache manager: Supports experimental cache pinning. Pinned nodes resist eviction for the desired duration, and are moved to host memory slightly than being deleted.
The community has built on these optimizations to create custom routing and integrate agent hints into popular frameworks akin to LangChain’s ChatNVIDIADynamo and the NVIDIA NeMo Agent Toolkit.
Running Dynamo and the NeMo Agent Toolkit demonstrated as much as 4x lower TTFT and 1.5x higher throughput when running the Llama 3.1 model on NVIDIA Hopper.


Advancing multimodal inference optimization
Dynamo 1.0 introduces three latest features designed to speed up multimodal inference in image-heavy workloads—where image encoding could be a bottleneck:
- Disaggregated encode/prefill/decode (E/P/D): As a substitute of running E/P/D on the identical GPU, Dynamo separates them into distinct stages with independent scaling. Running the encode phase on dedicated staff allows for independent scaling, which improves batching, memory efficiency, and overall throughput.
- Multimodal embedding cache: A CPU-backed least recently used (LRU) cache stores computed image embeddings off-GPU so repeated images skip encoding entirely. This is applicable to each disaggregated and aggregated setups.
- Multimodal KV routing: Multimodal KV routing extends Dynamo’s KV-aware router to account for image content. A dedicated multimodal router downloads images then selects the backend employee with the best cache overlap, including overlap on blocks containing images.
Running the Qwen3-VL-30B-A3B-Instruct-FP8 multimodal model on NVIDIA GB200, Dynamo’s embedding cache accelerated time to first token (TTFT) by as much as 30% and throughput by as much as 25% on image requests.


Adding native support for video generation
Recent video-generation models are setting a brand new bar for cinematic quality and motion realism. But serving them efficiently is non-trivial: Their inference workloads are compute- and memory-intensive, especially at high resolutions.
Dynamo 1.0 adds native support for video-generation models, with integrations for leading open source inference frameworks akin to FastVideo, SGLang Diffusion, TensorRT LLM Diffusion, and vLLM-Omni. This brings Dynamo’s modular stack—including its low-overhead front end, streaming capabilities, and high-efficiency scheduling engine—to modern video workloads.
This integration demonstrates that state‑of‑the‑art video generation will be delivered efficiently on Dynamo. For a step‑by‑step walkthrough of how one can deploy video generation models with Dynamo, take a look at this how‑to guide.
Accelerating inference startup by 7x with Dynamo ModelExpress
Modern inference clusters are always spinning latest replicas up and down in response to traffic. Each latest process has to repeat the identical heavy startup pipeline:
- Downloading model checkpoints
- Loading weights from distant or shared storage
- Applying model optimizations
- Compiling kernels
- Constructing NVIDIA CUDA graphs
To unravel that challenge, Dynamo ensures that the expensive parts of employee startup are done once and reused repeatedly through two latest ModelExpress capabilities:
Checkpoint restore: As a substitute of treating every replica as a fresh boot, Dynamo runs the complete initialization sequence a single time, captures the “ready‑to‑serve” state to persistent storage, after which brings latest replicas online by restoring from that checkpoint slightly than rebuilding all the things from scratch.
Model weight streaming: Moderately than having each latest employee independently download model weights, write them to local or shared storage, after which load them into GPU memory, ModelExpress loads the model once on an initial employee and streams the weights to additional staff over high-bandwidth interconnects using NVIDIA Inference Xfer Library (NIXL) and NVIDIA NVLink, eliminating reliance on storage bandwidth.


For big models, especially in fleets that scale aggressively, model weight streaming can speed up model loading time by as much as 7x for giant MoE models like DeepSeek v3 on NVIDIA H200.
Scaling Kubernetes on NVIDIA GB300 NVL72
NVIDIA Grove, an open source API that’s a part of Dynamo, simplifies deploying hierarchical gang-scheduled, topology‑aware AI workloads on Kubernetes. In Dynamo 1.0, Grove adds setup automation for NVIDIA NVLink fabric on rack‑scale systems akin to NVIDIA GB300 NVL72. That permits users to define placement policies across every layer of infrastructure—from cloud regions and availability zones right down to data centers, network blocks, racks, hosts, and even non-uniform memory access (NUMA) nodes.


Traditionally, using the NVIDIA GB300 NVL72 NVLink fabric required users to manually define and manage compute domains. This release introduces a unified topology API that allows developers to seamlessly colocate prefill and decode on the identical NVIDIA NVL72 rack to optimize KV cache transfers, confine an inference stack to a single data center for latency needs, and place frontend services on nearby CPU‑only nodes for efficient request handling. Grove integrates with advanced AI schedulers, like KAI scheduler, to make sure these constraints are enforced.
Integration with the Kubernetes Inference Gateway
A previous Dynamo release introduced a plugin that permits users to mix the Kubernetes-native Inference Gateway extension routing and Dynamo’s KV-aware router.


In a typical Dynamo setup, routing is handled by Dynamo’s KV-aware router. The router evaluates employee queue depth and relevant KV cache information on each employee, then makes a probabilistic decision using a weighted combination of those aspects.
Dynamo’s KV-aware router can run contained in the Inference Gateway to profit from integration with routing plugins, filters, and other gateway capabilities in Kubernetes-based environments.
Deploying fast, latency-aware inference with zero configurations
Deploying large models requires deep expertise that balances latency, throughput, and value targets through complex scaling and configuration steps. Dynamo’s latest Dynamo Graph Deployment Request (DGDR) removes that friction by providing an easy, one‑step path from service‑level objectives (SLOs) to optimized inference deployments.
DGDR combines the intelligence of the planner and AIConfigurator right into a unified, Kubernetes‑native deployment flow. As a substitute of navigating multiple tools, scripts, and guesswork, developers can now specify a model, goal hardware, and traffic goals in a YAML—soon, through an intuitive web UI—and Dynamo handles the remaining.
Behind the scenes, the AIConfigurator runs rapid, simulation‑based recommendations for quick iteration, while the planner engages deeper on‑cluster profiling for precise, production‑grade optimization. Each routes deliver an auto-deployable Dynamo Graph Deployment (DGD) that meets the user’s desired cost, performance, and scalability balance, without having to hand-configure a deployment configuration.
Increasing resiliency with fault detection and request migration
A key design principle in Dynamo is to be resilient by default so applications keep running even when individual staff fail or hang. The updated Dynamo fault tolerance combines two pillars:
Early fault detection: Dynamo adds a framework-independent “canary health check” that probes staff on a configurable schedule. If these checks don’t receive a sound response, the employee is marked unhealthy and is faraway from routing. Moreover, the Dynamo frontend also performs energetic detection using network-level signals. If establishing a brand new stream to a employee fails, or an existing stream ends unexpectedly mid-request, that employee is straight away faraway from the set of energetic staff (for about five seconds) so no latest requests are sent to it.
Request cancellation and migration: Request cancellation support is enabled out-of-the-box, allowing in-flight work to be terminated when it not is smart to proceed. When a employee becomes unavailable, Dynamo can migrate affected requests to a different employee and resume processing, preserving the request itself slightly than forcing the client to resubmit from scratch. This ensures failures don’t routinely translate into user-visible errors.
With Dynamo’s latest layered health detection combined with cancellation and migration, Dynamo goals to maintain LLM applications responsive even when individual staff fail.


Advancing KV caching to storage
In Dynamo 1.0, KV Block Manager (KVBM) introduces several features that enhance flexibility, visibility, and deployment options:
- Object storage support: KVBM now works with the Amazon Easy Storage Service (S3) and Azure-style blob APIs utilized by major storage vendors and cloud providers. This permits model operators to integrate KVBM with existing file systems, S3, or other cloud object stores without constructing separate KV offload pipelines for every backend.
- Global KV event emission: KVBM emits events each time KV blocks move between storage tiers (GPU memory, CPU memory, local SSD, and distant storage) or are evicted. The KV router’s indexer consumes these events to take care of a consistent, cluster-wide view of KV block locations, enabling smarter routing and improved cache reuse across multiple model replicas and inference engines.
- Pip-installable module: KVBM can now be installed directly into inference engines like vLLM or TensorRT LLM without requiring the entire Dynamo stack. Teams using different inference frameworks can share a standard KV offload tool slightly than re-implementing eviction policies and storage integrations.


Looking ahead
Looking forward, the Dynamo product roadmap will deal with expanding multimodal capabilities to support richer and more context-aware interactions, advancing diffusion-based models to unlock real-time higher quality video-generation capabilities, and scaling agentic workloads and reinforcement learning. Dynamo is being in-built the open with the community. To get entangled, explore the code and issues within the NVIDIA GitHub repository, drop into the biweekly Dynamo office hours, and dive into the prevailing technical blogs.
Acknowledgments
Akshatha Kamath, Anish Maddipoti, Anna Tchernych, Ben Hamm, Biswa Ranjan Panda, Dhruv Nandakumar, Ekin Karabulut, Ganesh Kudleppanavar, Hannah Simmons, Hannah Zhang, Harry Kim, Hongkuan Zhou, Hyunjae Woo, Ishan Dhanani, Itay Neeman, Jacky Hui, Jakub Kosek, John Kim, Kavin Krishnan, Kyle Kranen, Maksim Khadkevich, Michael Demoret, Moein Khazraee, Neal Vaidya, Neelay Shah, Qi Wang, Ryan McCormick, Sanjay Chatterjee, Schwinn Saereesitthipitak, Suman Tatiraju, Vikram Sharma Mailthody, Vishwanath Venkatesan, and lots of others contributed to this post.
