Over the past few years, AI inference has evolved from single-model, single-pod deployments into complex, multicomponent systems. A model deployment may now consist of several distinct components—prefill, decode, vision encoders, key value (KV) routers, and more. As well as, entire agentic pipelines are emerging, where multiple such model instances collaborate to perform reasoning, retrieval, or multimodal tasks.
This shift has modified the scaling and orchestration problem from “run N replicas of a pod” to “coordinate a gaggle of components as one logical system.” Managing such a system requires scaling and scheduling the appropriate pods together, understanding that every component has distinct configuration and resource needs, starting them in a deliberate order, and placing them within the cluster with network topology in mind. Ultimately, the goal is to orchestrate a system and scale components with awareness of their dependencies as a complete, somewhat than one pod at a time.
To deal with these challenges, today we’re announcing that NVIDIA Grove, a Kubernetes API for running modern ML inference workloads on Kubernetes clusters, is now available inside NVIDIA Dynamo as a modular component. Grove is fully open source and available on the ai-dynamo/grove GitHub repo.
How NVIDIA Grove orchestrates inference as a complete
Grove allows you to scale your multinode inference deployment from a single replica to data center scale, supporting tens of hundreds of GPUs. With Grove, you may describe your whole inference serving system in Kubernetes (for instance, prefill, decode, routing, or every other component) as a single Custom Resource (CR).
From that one spec, the platform coordinates hierarchical gang scheduling, topology‑aware placement, multilevel autoscaling, and explicit startup ordering. You get precise control of how the system behaves without stitching together scripts, YAML files, or custom controllers.
Originally motivated by the challenges of orchestrating multinode, disaggregated inference systems, Grove is flexible enough to map naturally to any real-world inference architecture—from traditional single-node aggregated inference to agentic pipelines with multiple models. Grove enables developers to define complex AI stacks in a concise, declarative, and framework-agnostic manner.
Prerequisites for the multinode disaggregated serving are detailed below.
Multilevel autoscaling for interdependent components
Modern inference systems need autoscaling at multiple levels: individual components (prefill staff for traffic spikes), related component groups (prefill leaders with their staff), and whole service replicas for overall capability. These levels affect each other: scaling prefill staff may require more decode capability, and recent service replicas need proper component ratios. Traditional pod-level autoscaling can’t handle these interdependencies.
System-level lifecycle management with recovery and rolling updates
Recovery and updates must operate on complete service instances, not individual Kubernetes pods. A failed prefill employee must properly reconnect to its leader after a restart, and rolling updates must preserve network topology to take care of low latency. The platform must treat multicomponent systems as single operational units optimized for each performance and availability.
Flexible hierarchical gang scheduling
The AI workload scheduler should support flexible gang scheduling that goes beyond traditional all-or-nothing placement. Disaggregated serving creates a brand new challenge: the inference system needs to ensure essential component combos (at the least one prefill and decode employee, for instance) while allowing independent scaling of every component type. The challenge is that prefill and decode components should scale at different ratios based on workload patterns.
Traditional gang scheduling prevents this independent scaling by forcing every part into groups that must scale together. The system needs policies that implement minimum viable component combos while enabling flexible scaling.
Topology-aware scheduling
Component placement affects performance. On systems like NVIDIA GB200 NVL72, scheduling the related prefill and decode pods on the identical NVIDIA NVLink domain optimizes KV-cache transfer latency. The scheduler must understand physical network topology, placing related components near one another while spreading replicas for availability.
Role‑aware orchestration and explicit startup ordering
Components have different jobs, configurations, and startup requirements. For instance, prefill and decode leaders execute specialized startup logic than staff, and staff can’t start before leaders are ready. The platform needs role-specific configuration and dependency enforcement for reliable initialization.
Put together, that is the larger picture: inference teams need a straightforward and declarative technique to describe their system because it is definitely operated (multiple roles, multiple nodes, clear multilevel dependencies) and have the system schedule, scale, heal, and update to that description.
Grove primitives
High-performance inference frameworks use Grove hierarchical APIs to specific role-specific logic and multilevel scaling, enabling consistent, optimized deployment across diverse cluster environments. Grove achieves this by orchestrating multicomponent AI workloads using three hierarchical custom resources in its Workload API.
For the instance shown in Figure 1, PodClique A represents a frontend component, B and C represent prefill-leader and prefill-worker, and D and E represent decode-leader and decode-worker.


- PodCliques represent groups of Kubernetes pods with specific roles, reminiscent of prefill leader or employee, decode leader or employee, or a frontend service, each with independent configuration and scaling logic.
- PodCliqueScalingGroups bundle tightly coupled PodCliques that must scale together, reminiscent of the prefill leader and prefill staff that together represent one model instance.
- PodCliqueSets define all the multicomponent workload, specifying startup ordering, scaling policies, and gang-scheduling constraints that ensure all components start together or fail together. When scaling for extra capability, Grove creates complete replicas of all the PodGangSet and defines spread constraints that distribute these replicas across the cluster for top availability, while keeping each replica’s components network-packed for optimal performance.


A Grove-enabled Kubernetes cluster brings two key components together: the Grove operator and a scheduler able to understanding PodGang resources, reminiscent of the KAI Scheduler, an open source subcomponent of the NVIDIA Run:ai platform.
When a PodCliqueSet resource is created, the Grove operator validates the specification and robotically generates the underlying Kubernetes objects required to appreciate it. This includes the constituent PodCliques, PodCliqueScalingGroups, and the associated pods, services, secrets, and autoscaling policies. As a part of this process, Grove also creates PodGang resources, which is a component of the Scheduler API, that translate workload definitions into concrete scheduling constraints for the cluster’s scheduler.
Each PodGang encapsulates detailed requirements for its workload, including minimum replica guarantees, network topology preferences to optimize inter-component bandwidth, and spread constraints to take care of availability. Together, these ensure topology-aware placement and efficient resource utilization across the cluster.
The scheduler constantly watches for PodGang resources and applies gang scheduling logic, ensuring that every one required components are scheduled together or in no way until resources can be found. Placement decisions are made with GPU topology awareness and cluster locality in mind.
The result’s a coordinated deployment of multicomponent AI systems, where prefill services, decode staff, and routing components start in the proper order, are situated closely for performance within the network, and recuperate cohesively as a gaggle. This prevents resource fragmentation, avoids partial deployments, and enables stable, efficient operation of complex model-serving pipelines at scale.
How you can start with Grove using Dynamo
This section walks you thru deploy a disaggregated serving architecture with a KV-routing deployer using Dynamo and Grove. The setup uses the Qwen3 0.6B model and demonstrates the flexibility of Grove to administer distributed inference workloads with separate prefill and decode staff.
Note: This can be a foundational example designed to make it easier to understand the core concepts. For more complicated deployments, discuss with the ai-dynamo/grove GitHub repo.
Prerequisites
First, make sure that you’ve got the next components ready in your Kubernetes cluster:
- Kubernetes cluster with GPU support
kubectlconfigured to access your cluster- Helm CLI installed
- Hugging Face token secret (referenced as
hf-token-secret), which will be created with the next command:
kubectl create secret generic hf-token-secret
--from-literal=HF_TOKEN=
Note: Within the code, replace along with your actual Hugging Face token. Keep this token secure and never commit it to source control.
Step 1: Create a namespace
kubectl create namespace vllm-v1-disagg-router
Step 2: Install Dynamo CRDs and Dynamo Operator with Grove
# 1. Set environment
export NAMESPACE=vllm-v1-disagg-router
export RELEASE_VERSION=0.5.1
# 2. Install CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
# 3. Install Dynamo Operator + Grove
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace --set "grove.enabled=true"
Step 3: Confirm Grove installation
kubectl get crd | grep grove
Expected output:
podcliques.grove.io
podcliquescalinggroups.grove.io
podcliquesets.grove.io
podgangs.scheduler.grove.io
podgangsets.grove.io
Step 4: Create the DynamoGraphDeployment configuration
Create a DynamoGraphDeployment manifest that defines a disaggregated serving architecture with one frontend, two decode staff, and one prefill employee:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: dynamo-grove
spec:
services:
Frontend:
dynamoNamespace: vllm-v1-disagg-router
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
envs:
- name: DYN_ROUTER_MODE
value: kv
VllmDecodeWorker:
dynamoNamespace: vllm-v1-disagg-router
envFromSecret: hf-token-secret
componentType: employee
replicas: 2
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
workingDir: /workspace/components/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
VllmPrefillWorker:
dynamoNamespace: vllm-v1-disagg-router
envFromSecret: hf-token-secret
componentType: employee
replicas: 1
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
workingDir: /workspace/components/backends/vllm
command:
- python3
- -m
- dynamo.vllm
args:
- --model
- Qwen/Qwen3-0.6B
- --is-prefill-worker
Step 5: Deploy the configuration
kubectl apply -f dynamo-grove.yaml -n ${NAMESPACE}
Step 6: Confirm the deployment
Confirm that operator and Grove pods were created:
kubectl get pods -n ${NAMESPACE}
Expected output:
NAME READY STATUS RESTARTS AGE
dynamo-grove-0-frontend-w2xxl 1/1 Running 0 10m
dynamo-grove-0-vllmdecodeworker-57ghl 1/1 Running 0 10m
dynamo-grove-0-vllmdecodeworker-drgv4 1/1 Running 0 10m
dynamo-grove-0-vllmprefillworker-27hhn 1/1 Running 0 10m
dynamo-platform-dynamo-operator-controller-manager-7774744kckrr 2/2 Running 0 10m
dynamo-platform-etcd-0 1/1 Running 0 10m
dynamo-platform-nats-0 2/2 Running 0 10m
Step 7: Test the deployment
First, port-forward the frontend:
kubectl port-forward svc/dynamo-grove-frontend 8000:8000 -n ${NAMESPACE}
Then test the endpoint:
curl http://localhost:8000/v1/models
Optionally, you may inspect the PodClique resource to see how Grove groups pods together including replica counts:
kubectl get podclique dynamo-grove-0-vllmdecodeworker -n vllm-v1-disagg-router -o yaml
Ready for more?
NVIDIA Grove is fully open source and available on the ai-dynamo/grove GitHub repo. We invite you to try Grove in your personal Kubernetes environments—with Dynamo, as a standalone component, or along high-performance AI inference engines.
Explore the Grove Deployment Guide and ask questions on GitHub or Discord. To see Grove in motion, visit the NVIDIA Booth #753 at KubeCon 2025 in Atlanta. We welcome contributions, pull requests, and feedback from the community.
To learn more, try these additional resources:
Acknowledgments
The NVIDIA Grove project acknowledges the useful contributions of all open source developers, testers, and community members who’ve participated in its evolution, with special due to SAP (Madhav Bhargava, Saketh Kalaga, Frank Heine) for his or her exceptional contributions and support. Open source thrives on collaboration—thanks for being a part of Grove.
