The exponential growth in large language model complexity has created challenges, akin to models too large for single GPUs, workloads that demand high throughput and low latency, and infrastructure that must coordinate 1000’s of interconnected components seamlessly. The NVIDIA Run:ai v2.23 release addresses these challenges through an integration with NVIDIA Dynamo—a high-throughput, low-latency inference framework designed for serving generative AI models across distributed environments.
On this blog, we’ll cover:
- The scaling problem of today’s workloads that require multi-node inference with multiple components, and the coordination challenges that include it.
- How Dynamo accelerates inference, why scheduling matters, and the role of orchestration in making workloads efficient at scale.
- The role of NVIDIA Run:ai v2.23 Dynamo integration in gang scheduling and topology-aware placement for predictable, low-latency deployments.
- The way to start with Dynamo on NVIDIA Run:ai with a step-by-step guide for organising network topology and deploying Dynamo on NVIDIA Run:ai with these capabilities enabled.
The scaling problem
As model parameters and the variety of distributed components (e.g., prefill and decode employees, router, etc.) increase, their memory requirements and computational demands grow significantly. This forces us to distribute model layers and the KV cache across multiple GPUs, and increasingly, across multiple nodes. While techniques like tensor parallelism solve the capability challenge, they introduce a coordination challenge: how do you make dozens of distributed components work together as seamlessly as a single accelerator? The reply lies in advanced inference frameworks that may manage this complexity transparently.
How Dynamo accelerates inference
NVIDIA Dynamo was purpose-built to tackle distributed inference challenges through features including:
- Disaggregated prefill and decode inference that maximizes GPU throughput and enables trade-offs between latency and throughput.
- Dynamic GPU scheduling that adapts to fluctuating demand.
- LLM-aware request routing to stop unnecessary KV cache re-computation.
- Accelerated data transfer that uses NVIDIA Inference Xfer Library (NIXL) to scale back inference response times.
- KV cache offloading that uses multiple memory hierarchies for higher throughput.
These capabilities make sure that even the most important models can run efficiently across distributed GPU clusters, but provided that the underlying orchestration doesn’t get in the way in which.
Why scheduling matters: to run Dynamo workloads efficiently at scale
Running multi-node inference in clusters has challenges. Dynamo workloads involve tightly coupled components like routers, prefill, and decode. Scheduling these independently can result in partial deployments, akin to decode pods running while prefill pods remain pending, leading to idle GPUs.
Even with all components energetic, poor placement hurts performance. Leaders and employees spread across distant nodes cause latency and reduce throughput as a consequence of cross-rack communication and bandwidth bottlenecks. Addressing these orchestration issues is significant to complementing Dynamo’s runtime efficiency throughout the cluster. That is where NVIDIA Run:ai’s advanced scheduling capabilities grow to be essential.
NVIDIA Run:ai meets Dynamo
Addressing orchestration challenges requires greater than just starting pods. It requires starting the correct pods together and placing them in the correct locations. This is precisely what NVIDIA Run:ai brings to Dynamo with two key capabilities: gang scheduling to launch components atomically, and topology-aware placement to co-locate them for low-latency communication.
Gang scheduling: all-or-nothing deployment
Dynamo workloads now use NVIDIA Run:ai’s gang scheduling capabilities, treating different groups of interdependent pods as a single deployment unit. This atomic scheduling approach ensures that either all required components (prefill employees and leaders, decode employees and leaders) could be placed concurrently, or the deployment waits until sufficient resources can be found.
By eliminating partial deployment scenarios, higher cluster utilization emerges naturally as resource fragmentation disappears. Partially deployed workloads not devour cluster resources while waiting indefinitely for missing components. Cold start lag can also be reduced because when resources grow to be available, entire workloads launch atomically fairly than spinning up incrementally, shortening time-to-service.
The result’s predictable, efficient placement for multi-node inference workloads with no additional configuration requirements; the scheduler manages this coordination mechanically.
Topology-aware scheduling: reducing latency
The mixing includes topology-aware scheduling that is especially helpful for multi-node deployments. Administrators can define a cluster’s topology, enabling the scheduler to make strategic component placement decisions. Interdependent components (akin to prefill and decode roles) are positioned to attenuate cross-node latency while maximizing the utilization of high-speed interconnects.
This topology awareness becomes critical at scale for multi-node deployments, where network communication can easily grow to be the bottleneck. The result’s improved communication throughput between components and reduced network overhead, for lower latency and enhanced performance for large-scale distributed workloads.
The way to start with NVIDIA Run:ai v2.23 along with Dynamo
Ensure you will have the next before continuing:
- A Kubernetes cluster with NVIDIA Run:ai v.2.23 installed and a project named runai-project-a is initialized (see documentation).
- Access to kubeconfig file.
- Helm installed.
- A Hugging Face access token for pulling models, stored as a Kubernetes secret using a private token.
kubectl create secret generic hf-token-secret
--from-literal=HF_TOKEN=''
-n runai-project-a
Note: Replace along with your actual Hugging Face token. Keep this token secure and never commit it to source control.
Establishing network topology
To co-locate tightly coupled Dynamo components and cut cross-node latency, configure a network topology in NVIDIA Run:ai that represents your cluster’s physical layout. Start by ensuring your Kubernetes nodes are labeled with proximity indicators akin to topology.kubernetes.io/region: us-west, topology.kubernetes.io/zone: us-west-1a, etc.
Next, specify in NVIDIA Run:ai which label keys define proximity. Within the NVIDIA Run:ai user interface, open the cluster’s settings and add the label keys you employ (for instance, topology.kubernetes.io/zone, topology.kubernetes.io/region).
Create a topology by ordering these keys from closest to farthest. Ensure that the label values you employ within the network topology setup (e.g., us-west-1a) match what you applied on the nodes exactly:


Then, attach this network topology to the relevant node pool(s) from the node pools view. Different pools can carry different topologies in case your hardware or network fabrics vary by pool.
From this point on, scheduling is automatic. NVIDIA Run:ai applies a “preferred” soft constraint on the closest tier first and only relaxes to broader tiers if the cluster can’t place the whole gang together on the initial level. Combined with gang scheduling, this ensures your Dynamo pods land together on the best-available nodes (for instance, nodes in the identical rack) or wait until they’ll, eliminating partial, inefficient deployments. For more information, please seek advice from the official documentation page.


Dynamo in motion
Once the network topology is configured within the NVIDIA Run:ai user interface, Dynamo workloads mechanically use gang scheduling and topology-aware scheduling. This ensures tightly coupled components (e.g., decode, router) launch together or wait as a gaggle, while the scheduler co-locates them on the closest tier (e.g., the identical zone or rack) to scale back latency. Users can specify preferred or required placement strategies by annotating their workloads.
Step 1. Set environment variables
# Define the required environment variables
export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
export NAMESPACE=dynamo-cloud
export RELEASE_VERSION=0.5.1
Step 2. Create a Kubernetes namespace
# Create a dedicated namespace for the deployment
kubectl create namespace $NAMESPACE
Step 3. Install the custom resource definitions (CRDs) and platform components
# CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-$RELEASE_VERSION.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace dynamo-cloud
# Platform Components
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-$RELEASE_VERSION.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --set dynamo-operator.namespaceRestriction.enabled=false
Step 4. Confirm pod status
# Be sure that all components are running
kubectl -n $NAMESPACE get pods
Step 5: Deploy the vLLM aggregator
Download the instance YAML from the Dynamo repository. Set metadata.namespace to runai-project-a, and add the next annotations:
metadata:
namespace: runai-project-a
annotations:
kai.scheduler/topology-preferred-placement: "topology.kubernetes.io/zone"
kai.scheduler/topology: "topology-1"
# kai.scheduler/topology-required-placement: "topology.kubernetes.io/zone" -> if the pods need to be in the identical zone, users can select to make use of topology required placement label as a substitute of the popular placement
#Apply the YAML:
kubectl apply -f disagg.yaml
As pods start, you’ll see the operator, control plane, and all components running, with decode and prefill pods scheduled in the identical zone based on topology.
NAME READY STATUS RESTARTS AGE
vllm-disagg-frontend-79f459c95-57fm6 1/1 Running 0 30m
vllm-disagg-vllmdecodeworker-6c8d64f569-56phf 1/1 Running 0 30m
vllm-disagg-vllmprefillworker-755cb88fcf-pflb5 1/1 Running 0 30m
Step 6. Send a request to the deployed model
To check the deployment locally, port-forward the frontend:
kubectl -n runai-project-a port-forward pod/vllm-disagg-frontend-79f459c95-57fm6
8000:8000
Send a sample request using curl:
curl localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": false,
"max_tokens": 30
}'
A successful response returns a JSON with a generated completion:
{"id":"chatcmpl-559682f7-8845-4014-b670-47a5f32f07c6","selections":[{"index":0,"message":{"content":"nOkay, I need to develop a detailed character background for the explorer in Eldoria. Let me start by understanding the user's query.","role":"assistant","reasoning_content":null},"finish_reason":"stop"}],"created":1758043876,"model":"Qwen/Qwen3-0.6B","object":"chat.completion","usage":{"prompt_tokens":196,"completion_tokens":29,"total_tokens":225}}%
The deployment uses NVIDIA Run:ai’s gang scheduling and topology-aware placement to begin pods together, minimize latency, and maximize GPU utilization by avoiding idle resources.
Wrapping up
Large-scale LLM inference succeeds when a high-performance inference framework is paired with a scheduler that knows how one can place and begin it. NVIDIA Dynamo delivers the previous with disaggregated prefill and decode, LLM-aware routing, and efficient KV cache management. NVIDIA Run:ai version 2.23 contributes the latter with gang scheduling and topology-aware placement.
Together, they make multi-node inference predictable and performant: pods launch atomically, components stay close on fast links, and GPUs remain busy. The result’s higher throughput, lower latency, and higher utilization across Kubernetes clusters, scaling reliably, and maximizing infrastructure return.
On the lookout for effective ways to beat the challenges of scaling AI workloads? Join our upcoming webinar for expert insights and practical solutions.
Start with NVIDIA Run:ai and Dynamo using the next resources:
