Streamline Complex AI Inference on Kubernetes with NVIDIA Grove

-


Over the past few years, AI inference has evolved from single-model, single-pod deployments into complex, multicomponent systems. A model deployment may now consist of several distinct components—prefill, decode, vision encoders, key value (KV) routers, and more. As well as, entire agentic pipelines are emerging, where multiple such model instances collaborate to perform reasoning, retrieval, or multimodal tasks.

This shift has modified the scaling and orchestration problem from “run N replicas of a pod” to “coordinate a gaggle of components as one logical system.” Managing such a system requires scaling and scheduling the appropriate pods together, understanding that every component has distinct configuration and resource needs, starting them in a deliberate order, and placing them within the cluster with network topology in mind. Ultimately, the goal is to orchestrate a system and scale components with awareness of their dependencies as a complete, somewhat than one pod at a time.

To deal with these challenges, today we’re announcing that NVIDIA Grove, a Kubernetes API for running modern ML inference workloads on Kubernetes clusters, is now available inside NVIDIA Dynamo as a modular component. Grove is fully open source and available on the ai-dynamo/grove GitHub repo.

How NVIDIA Grove orchestrates inference as a complete

Grove allows you to scale your multinode inference deployment from a single replica to data center scale, supporting tens of hundreds of GPUs. With Grove, you may describe your whole inference serving system in Kubernetes (for instance, prefill, decode, routing, or every other component) as a single Custom Resource (CR). 

From that one spec, the platform coordinates hierarchical gang scheduling, topology‑aware placement, multilevel autoscaling, and explicit startup ordering. You get precise control of how the system behaves without stitching together scripts, YAML files, or custom controllers.

Originally motivated by the challenges of orchestrating multinode, disaggregated inference systems, Grove is flexible enough to map naturally to any real-world inference architecture—from traditional single-node aggregated inference to agentic pipelines with multiple models. Grove enables developers to define complex AI stacks in a concise, declarative, and framework-agnostic manner.

Prerequisites for the multinode disaggregated serving are detailed below.

Multilevel autoscaling for interdependent components

Modern inference systems need autoscaling at multiple levels: individual components (prefill staff for traffic spikes), related component groups (prefill leaders with their staff), and whole service replicas for overall capability. These levels affect each other: scaling prefill staff may require more decode capability, and recent service replicas need proper component ratios. Traditional pod-level autoscaling can’t handle these interdependencies.

System-level lifecycle management with recovery and rolling updates

Recovery and updates must operate on complete service instances, not individual Kubernetes pods. A failed prefill employee must properly reconnect to its leader after a restart, and rolling updates must preserve network topology to take care of low latency. The platform must treat multicomponent systems as single operational units optimized for each performance and availability.

Flexible hierarchical gang scheduling

The AI workload scheduler should support flexible gang scheduling that goes beyond traditional all-or-nothing placement. Disaggregated serving creates a brand new challenge: the inference system needs to ensure essential component combos (at the least one prefill and decode employee, for instance) while allowing independent scaling of every component type. The challenge is that prefill and decode components should scale at different ratios based on workload patterns. 

Traditional gang scheduling prevents this independent scaling by forcing every part into groups that must scale together. The system needs policies that implement minimum viable component combos while enabling flexible scaling.

Topology-aware scheduling 

Component placement affects performance. On systems like NVIDIA GB200 NVL72, scheduling the related prefill and decode pods on the identical NVIDIA NVLink domain optimizes KV-cache transfer latency. The scheduler must understand physical network topology, placing related components near one another while spreading replicas for availability.

Role‑aware orchestration and explicit startup ordering

Components have different jobs, configurations, and startup requirements. For instance, prefill and decode leaders execute specialized startup logic than staff, and staff can’t start before leaders are ready. The platform needs role-specific configuration and dependency enforcement for reliable initialization.

Put together, that is the larger picture: inference teams need a straightforward and declarative technique to describe their system because it is definitely operated (multiple roles, multiple nodes, clear multilevel dependencies) and have the system schedule, scale, heal, and update to that description.

Grove primitives

High-performance inference frameworks use Grove hierarchical APIs to specific role-specific logic and multilevel scaling, enabling consistent, optimized deployment across diverse cluster environments. Grove achieves this by orchestrating multicomponent AI workloads using three hierarchical custom resources in its Workload API. 

For the instance shown in Figure 1, PodClique A represents a frontend component, B and C represent prefill-leader and prefill-worker, and D and E represent decode-leader and decode-worker.

PodClique specifies its own replica and minimum availability counts (for example, PodClique C with three replicas and two MinAvailable).
PodClique specifies its own replica and minimum availability counts (for example, PodClique C with three replicas and two MinAvailable).
Figure 1. Key components of NVIDIA Grove include PodClique, ScalingGroup, and PodCliqueSet, and the way they work together
  • PodCliques represent groups of Kubernetes pods with specific roles, reminiscent of prefill leader or employee, decode leader or employee, or a frontend service, each with independent configuration and scaling logic. 
  • PodCliqueScalingGroups bundle tightly coupled PodCliques that must scale together, reminiscent of the prefill leader and prefill staff that together represent one model instance. 
  • PodCliqueSets define all the multicomponent workload, specifying startup ordering, scaling policies, and gang-scheduling constraints that ensure all components start together or fail together. When scaling for extra capability, Grove creates complete replicas of all the PodGangSet and defines spread constraints that distribute these replicas across the cluster for top availability, while keeping each replica’s components network-packed for optimal performance.
Diagram showing the Grove workflow. A user defines a PodCliqueSet, which is processed by the Grove Operator. The operator creates and manages resources such as PodCliques, ScalingGroups (Prefill and Decode), Secrets, HPAs, and Services. These resources are combined into a PodGang that represents a schedulable unit of work. The PodGang is then passed to an Advanced AI Scheduler (KAI scheduler, for example). The right side of the figure depicts multiple pods running on GPU-enabled nodes.
Diagram showing the Grove workflow. A user defines a PodCliqueSet, which is processed by the Grove Operator. The operator creates and manages resources such as PodCliques, ScalingGroups (Prefill and Decode), Secrets, HPAs, and Services. These resources are combined into a PodGang that represents a schedulable unit of work. The PodGang is then passed to an Advanced AI Scheduler (KAI scheduler, for example). The right side of the figure depicts multiple pods running on GPU-enabled nodes.
Figure 2. Grove workflow

A Grove-enabled Kubernetes cluster brings two key components together: the Grove operator and a scheduler able to understanding PodGang resources, reminiscent of the KAI Scheduler, an open source subcomponent of the NVIDIA Run:ai platform.

When a PodCliqueSet resource is created, the Grove operator validates the specification and robotically generates the underlying Kubernetes objects required to appreciate it. This includes the constituent PodCliques, PodCliqueScalingGroups, and the associated pods, services, secrets, and autoscaling policies. As a part of this process, Grove also creates PodGang resources, which is a component of the Scheduler API, that translate workload definitions into concrete scheduling constraints for the cluster’s scheduler.

Each PodGang encapsulates detailed requirements for its workload, including minimum replica guarantees, network topology preferences to optimize inter-component bandwidth, and spread constraints to take care of availability. Together, these ensure topology-aware placement and efficient resource utilization across the cluster.

The scheduler constantly watches for PodGang resources and applies gang scheduling logic, ensuring that every one required components are scheduled together or in no way until resources can be found. Placement decisions are made with GPU topology awareness and cluster locality in mind.

The result’s a coordinated deployment of multicomponent AI systems, where prefill services, decode staff, and routing components start in the proper order, are situated closely for performance within the network, and recuperate cohesively as a gaggle. This prevents resource fragmentation, avoids partial deployments, and enables stable, efficient operation of complex model-serving pipelines at scale.

How you can start with Grove using Dynamo

This section walks you thru deploy a disaggregated serving architecture with a KV-routing deployer using Dynamo and Grove. The setup uses the Qwen3 0.6B model and demonstrates the flexibility of Grove to administer distributed inference workloads with separate prefill and decode staff.

Note: This can be a foundational example designed to make it easier to understand the core concepts. For more complicated deployments, discuss with the ai-dynamo/grove GitHub repo.

Prerequisites

First, make sure that you’ve got the next components ready in your Kubernetes cluster:

  • Kubernetes cluster with GPU support
  • kubectl configured to access your cluster
  • Helm CLI installed
  • Hugging Face token secret (referenced as hf-token-secret), which will be created with the next command:
kubectl create secret generic hf-token-secret 
  --from-literal=HF_TOKEN=

Note: Within the code, replace along with your actual Hugging Face token. Keep this token secure and never commit it to source control.

Step 1: Create a namespace

kubectl create namespace vllm-v1-disagg-router

Step 2: Install Dynamo CRDs and Dynamo Operator with Grove

# 1. Set environment
export NAMESPACE=vllm-v1-disagg-router
export RELEASE_VERSION=0.5.1

# 2. Install CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default

# 3. Install Dynamo Operator + Grove
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace --set "grove.enabled=true"

Step 3: Confirm Grove installation

kubectl get crd | grep grove

Expected output:

podcliques.grove.io                             
podcliquescalinggroups.grove.io                  
podcliquesets.grove.io                           
podgangs.scheduler.grove.io                      
podgangsets.grove.io 

Step 4: Create the DynamoGraphDeployment configuration

Create a DynamoGraphDeployment manifest that defines a disaggregated serving architecture with one frontend, two decode staff, and one prefill employee:

apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
  name: dynamo-grove
spec:
  services:
    Frontend:
      dynamoNamespace: vllm-v1-disagg-router
      componentType: frontend
      replicas: 1
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
      envs:
        - name: DYN_ROUTER_MODE
          value: kv
    VllmDecodeWorker:
      dynamoNamespace: vllm-v1-disagg-router
      envFromSecret: hf-token-secret
      componentType: employee
      replicas: 2
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
          workingDir: /workspace/components/backends/vllm
          command:
          - python3
          - -m
          - dynamo.vllm
          args:
            - --model
            - Qwen/Qwen3-0.6B
    VllmPrefillWorker:
      dynamoNamespace: vllm-v1-disagg-router
      envFromSecret: hf-token-secret
      componentType: employee
      replicas: 1
      resources:
        limits:
          gpu: "1"
      extraPodSpec:
        mainContainer:
          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
          workingDir: /workspace/components/backends/vllm
          command:
          - python3
          - -m
          - dynamo.vllm
          args:
            - --model
            - Qwen/Qwen3-0.6B
            - --is-prefill-worker

Step 5: Deploy the configuration

kubectl apply -f dynamo-grove.yaml -n ${NAMESPACE}

Step 6: Confirm the deployment 

Confirm that operator and Grove pods were created:

kubectl get pods -n ${NAMESPACE}

Expected output:

NAME                                                              READY   STATUS    RESTARTS      AGE
dynamo-grove-0-frontend-w2xxl                                     1/1     Running     0           10m
dynamo-grove-0-vllmdecodeworker-57ghl                             1/1     Running     0           10m
dynamo-grove-0-vllmdecodeworker-drgv4                             1/1     Running     0           10m
dynamo-grove-0-vllmprefillworker-27hhn                            1/1     Running     0           10m
dynamo-platform-dynamo-operator-controller-manager-7774744kckrr   2/2     Running     0           10m
dynamo-platform-etcd-0                                            1/1     Running     0           10m
dynamo-platform-nats-0                                            2/2     Running     0           10m

Step 7: Test the deployment

First, port-forward the frontend:

kubectl port-forward svc/dynamo-grove-frontend 8000:8000 -n ${NAMESPACE}

Then test the endpoint:

curl http://localhost:8000/v1/models

Optionally, you may inspect the PodClique resource to see how Grove groups pods together including replica counts:

kubectl get podclique dynamo-grove-0-vllmdecodeworker -n vllm-v1-disagg-router -o yaml

Ready for more? 

NVIDIA Grove is fully open source and available on the ai-dynamo/grove GitHub repo. We invite you to try Grove in your personal Kubernetes environments—with Dynamo, as a standalone component, or along high-performance AI inference engines. 

Explore the Grove Deployment Guide and ask questions on GitHub or Discord. To see Grove in motion, visit the NVIDIA Booth #753 at KubeCon 2025 in Atlanta. We welcome contributions, pull requests, and feedback from the community.

To learn more, try these additional resources:

Acknowledgments

The NVIDIA Grove project acknowledges the useful contributions of all open source developers, testers, and community members who’ve participated in its evolution, with special due to SAP (Madhav Bhargava, Saketh Kalaga, Frank Heine) for his or her exceptional contributions and support. Open source thrives on collaboration—thanks for being a part of Grove.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x