Deploying Disaggregated LLM Inference Workloads on Kubernetes

-


As large language model (LLM) inference workloads grow in complexity, a single monolithic serving process starts to hit its limits. Prefill and decode stages have fundamentally different compute profiles, yet traditional deployments force them onto the identical hardware, leaving GPUs underutilized and scaling inflexible.

Disaggregated serving addresses this by splitting the inference pipeline into distinct stages reminiscent of prefill, decode, and routing, each running as an independent service that could be resourced and scaled by itself terms.

This post will give an outline of how disaggregated inference gets deployed on Kubernetes, explore different ecosystem solutions and the way they execute on a cluster, and evaluate what they supply out of the box.

How do aggregated and disaggregated inference differ?

Before diving into Kubernetes manifests, it helps to know the 2 inference deployment modes for LLMs: In aggregated serving, a single process (or tightly coupled group of processes) handles the complete inference lifecycle from input to output. Disaggregated serving splits the pipeline into distinct stages reminiscent of prefill, decode, and routing, each running as independent services (see Figure 1, below). 

Aggregated inference

In a standard aggregated setup, a single model server (or coordinated group of servers in a parallel configuration) handles the total request lifecycle. A user prompt is available in, the server tokenizes it, runs prefill to construct context, generates output tokens autoregressively (decode), and returns the response. All the pieces happens in a single process or tightly coupled pod group.

That is conceptually easy and works well for a lot of use cases. But it surely means your hardware alternates between two fundamentally different workloads: Prefill is compute-intensive and advantages from high floating point operations (FLOPS), while decode is memory-bandwidth-bound and advantages from large, fast memory.

Disaggregated inference

Disaggregated architectures separate these stages into distinct services:

  • Prefill staff process the input prompt. That is compute-heavy. You must maximize your GPUs for high-throughput and may parallelize aggressively.
  • Decode staff generate output tokens one by one. That is memory-bandwidth-bound due to autoregressive nature of LLMs. You would like GPUs with fast high bandwidth memory (HBM) access. 
  • Router/gateway directs incoming requests, manages Key-Value (KV) cache routing between prefill and decode stages, and handles load balancing of requests across your staff.

Why disaggregate? Three reasons stand out:

  1. Different resource and optimization profiles per stage: With disaggregation, you may match GPU resources, model sharding techniques, and batch sizes to every stage’s needs reasonably than compromise on a single approach.
  2. Independent scaling: Prefill and decode traffic patterns differ. An extended-context prompt creates a big prefill burst but a gradual decode stream. Scaling each stage independently helps you to reply to actual demand.
  3. Higher GPU utilization: Separating stages lets each saturate its goal resource  (compute for prefill, memory bandwidth for decode) reasonably than alternating between each.

Frameworks like NVIDIA Dynamo and llm-d, implement this pattern. The query becomes: How do you orchestrate it on Kubernetes?

Why scheduling is the important thing to multi-pod inference performance on Kubernetes

Deploying a multi-pod inference workload (either model-parallel aggregated models or disaggregated models) is simply half the story. How the scheduler places pods across the cluster directly impacts performance; placing a Tensor Parallel (TP) group’s pods on the identical rack with high-bandwidth NVIDIA NVLink interconnects can mean the difference between fast inference and a network bottleneck. Three scheduling capabilities matter essentially the most here:

  • Gang scheduling ensures all pods in a bunch are placed with all-or-nothing semantics, stopping partial deployments that waste GPUs.
  • Hierarchical gang scheduling extends basic gang scheduling to multi-level workloads. In disaggregated inference, you wish nested minimum guarantees per component or role: each Tensor Parallel group (e.g., 4 pods forming one decode instance) have to be scheduled atomically, and the total system (not less than n prefill instances + not less than m decode instances + router) also needs system-level coordination. Without this, one role can eat all available GPUs while the opposite waits indefinitely—a partial deployment that holds resources but can’t serve requests.
  • Topology-aware placement colocates tightly coupled pods on nodes with high-bandwidth interconnects, minimizing inter-node communication latency.

These three capabilities determine how an AI scheduler, reminiscent of KAI Scheduler, places pods based on the applying’s scheduling constraints. Moreover, it’s also essential for the AI orchestration layer to find out what must be gang-scheduled, and when. For instance, when prefill scales independently, something needs to come to a decision that the brand new pods form a gang with a minimum availability guarantee, without disrupting existing decode pods. In consequence, the orchestration layer and the scheduler must work closely for the complete application lifecycle, handling multi-level auto-scaling, rolling updates, and more, to make sure optimal runtime conditions for AI workloads.

That is where higher-level workload abstractions are available in. APIs like LeaderWorkerSet (LWS) and NVIDIA Grove allow users to declaratively express the structure of their inference application: which roles exist, how they relate to one another, how they need to scale, and what topology constraints matter. The API’s operator translates that application-level intent into concrete scheduling constraints (including PodGroups, gang requirements, topology hints) that determine what gangs to create and when. 

KAI Scheduler then plays the critical role of satisfying those constraints, solving the how: gang scheduling, hierarchical gang scheduling, and topology-aware placement. On this post, we use KAI because the scheduler, though there are other schedulers locally that support subsets of those features. Readers can explore the broader scheduling landscape through the Cloud Native Computing Foundation (CNCF) ecosystem.

Deploying disaggregated inference

Disaggregated architectures have multiple roles each with different resource profiles and scaling needs. Since each role in a disaggregated pipeline is a definite workload, a natural approach with LWS is to create a separate resource for every role.

Prefill staff (4 replicas, 2-degree Tensor Parallelism):

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: prefill-workers
spec:
  replicas: 4
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: prefill-leader
      spec:
        containers:
        - name: prefill
          image: 
          args: ["--role=prefill", "--tensor-parallel-size=2"]
          resources:
            limits:
              nvidia.com/gpu: "1"
    workerTemplate:
      spec:
        containers:
        - name: prefill
          image: 
          args: ["--role=prefill"]
          resources:
            limits:
              nvidia.com/gpu: "1"

Decode staff (two replicas, 4-degree Tensor Parallelism):

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: decode-workers
spec:
  replicas: 2
  leaderWorkerTemplate:
    size: 4
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: decode-leader
      spec:
        containers:
        - name: decode
          image: 
          args: ["--role=decode", "--tensor-parallel-size=4"]
          resources:
            limits:
              nvidia.com/gpu: "1"
    workerTemplate:
      spec:
        containers:
        - name: decode
          image: 
          args: ["--role=decode"]
          resources:
            limits:
              nvidia.com/gpu: "1"

Router (a regular deployment—no leader-worker topology needed):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: router
spec:
  replicas: 2
  selector:
    matchLabels:
      app: router
  template:
    metadata:
      labels:
        app: router
    spec:
      containers:
      - name: router
        image: 
        env:
        - name: PREFILL_ENDPOINT
          value: "prefill-workers"
        - name: DECODE_ENDPOINT
          value: "decode-workers"

Each role is managed as its own resource. You may scale, prefill, and decode independently, and update them on different schedules.

It’s essential to notice that the scheduler treats prefill staff and decode-workers as independent workloads. The scheduler will place them successfully, however it has no knowledge that they form a single inference pipeline. In practice, this implies a number of things: 

  • Topology coordination between prefill and decode (placing them on the identical rack for fast KV cache transfer) requires manually adding pod affinity rules that reference labels across the 2 LWS resources. 
  • Scaling one role doesn’t robotically account for the opposite: If a burst of long-context requests requires more prefill capability, you scale prefill-workers, but the brand new prefill pods aren’t guaranteed to land near existing decode pods unless you’ve configured affinity yourself. 
  • Rolling out a brand new model version means coordinating updates across three independent resources—LWS’s partition update mechanism supports staged rollouts per-resource, but synchronizing across resources is managed externally.

That last point is value calling out. Inference frameworks move fast and don’t at all times guarantee backwards compatibility between versions, so prefill pods on the old version and decode pods on the new edition may not give you the chance to speak. Models also take time to load, and prefill and decode staff ceaselessly grow to be ready at different rates. During an unsynchronized rollout, this will create a short lived imbalance, where many recent decode pods are ready but only a few recent prefill pods are (or vice versa). This may create a bottleneck in your inference pipeline until the whole lot catches up.

These patterns work. The coordination just happens outside of Kubernetes primitives: within the inference framework’s routing layer, in custom autoscalers, bespoke operators, and even manually. Another choice could be using Grove’s API, which takes a special approach by moving that coordination into the Kubernetes resource itself.

It expresses all roles in a single PodCliqueSet:

apiVersion: grove.io/v1alpha1
kind: PodCliqueSet
metadata:
  name: inference-disaggregated
spec:
  replicas: 1
  template:
    cliqueStartupType: CliqueStartupTypeExplicit
    terminationDelay: 30s

    cliques:
    - name: router
      spec:
        roleName: router
        replicas: 2
        podSpec:
          schedulerName: kai-scheduler
          containers:
          - name: router
            image: 
            resources:
              requests:
                cpu: 100m

    - name: prefill
      spec:
        roleName: prefill
        replicas: 4
        startsAfter: [router]
        podSpec:
          schedulerName: kai-scheduler
          containers:
          - name: prefill
            image: 
            args: ["--role=prefill", "--tensor-parallel-size=2"]
            resources:
              limits:
                nvidia.com/gpu: "1"
        autoScalingConfig:
          maxReplicas: 8
          metrics:
          - type: Resource
            resource:
              name: cpu
              goal:
                type: Utilization
                averageUtilization: 70

    - name: decode
      spec:
        roleName: decode
        replicas: 2
        startsAfter: [router]
        podSpec:
          schedulerName: kai-scheduler
          containers:
          - name: decode
            image: 
            args: ["--role=decode", "--tensor-parallel-size=4"]
            resources:
              limits:
                nvidia.com/gpu: "1"
        autoScalingConfig:
          maxReplicas: 6
          metrics:
          - type: Resource
            resource:
              name: cpu
              goal:
                type: Utilization
                averageUtilization: 80

    topologyConstraint:
      packDomain: rack

The Grove operator manages PodCliques for every role and coordinates scheduling, startup, and lifecycle across all of them. A couple of things to notice within the YAML:

  • startsAfter: [router] on prefill and decode tells the operator to gate their startup until the router is prepared. That is expressed declaratively and enforced through init containers. If you first deploy, router pods start and grow to be ready first, then prefill and decode pods start in parallel (since each depend upon the router).
  • autoScalingConfig on each clique helps you to define per-role scaling policies. The operator creates an horizontal pod autoscaler (HPA) for every, so prefill and decode scale independently based on their very own metrics.
  • topologyConstraint with packDomain: rack tells the KAI Scheduler to pack all cliques inside the same rack, optimizing KV cache transfer between prefill and decode stages over high-bandwidth interconnects.

After applying this manifest, you may inspect all of the resources Grove creates:

$ kubectl get pcs,pclq,pg,pod
NAME                                            AGE
podcliqueset.grove.io/inference-disaggregated   45s

NAME                                                  AGE
podclique.grove.io/inference-disaggregated-0-router   44s
podclique.grove.io/inference-disaggregated-0-prefill  44s
podclique.grove.io/inference-disaggregated-0-decode   44s

NAME                                                AGE
podgang.scheduler.grove.io/inference-disaggregated-0  44s

NAME                                              READY   STATUS    AGE
pod/inference-disaggregated-0-router-k8x2m        1/1     Running   44s
pod/inference-disaggregated-0-router-w9f4n        1/1     Running   44s
pod/inference-disaggregated-0-prefill-abc12       1/1     Running   44s
pod/inference-disaggregated-0-prefill-def34       1/1     Running   44s
pod/inference-disaggregated-0-prefill-ghi56       1/1     Running   44s
pod/inference-disaggregated-0-prefill-jkl78       1/1     Running   44s
pod/inference-disaggregated-0-decode-mn90p        1/1     Running   44s
pod/inference-disaggregated-0-decode-qr12s        1/1     Running   44s

One PodCliqueSet, three PodCliques (one per role), one PodGang for coordinated scheduling, and pods matching each role’s replica count. The startsAfter dependency is enforced through init containers: Prefill and decode pods wait for the router to grow to be ready before their major containers start.

Scaling disaggregated workloads

Once a disaggregated workload is running, scaling becomes the central operational challenge. Prefill and decode have different bottlenecks; teams might need to autoscale prefill staff based on time to first token (TTFT) and decode staff based on inter-token latency (ITL) independently, to fulfill service level agreements (SLAs) while minimizing GPU costs. 

In practice, disaggregated scaling operates at three levels:

  1. Per-role scaling: adding or removing pods inside a single role (e.g. scaling prefill from 4 to six replicas)
  2. Per-TP-group scaling: scaling complete Tensor Parallel groups as atomic units, since you may’t add half a TP group.
  3. Cross-role coordination: while you add prefill capability, you might also must scale the router to handle increased throughput, or scale decode to eat the additional prefill output.

Different tools address different levels.

How inference frameworks coordinate scaling

Inference frameworks address scaling at the applying level with custom autoscalers which have visibility into inference-specific metrics. llm-d’s workload variant autoscaler (WVA) monitors per-pod KV cache utilization and queue depth via Prometheus, using a spare-capacity model to find out when replicas ought to be added or removed. Quite than scaling deployments directly, WVA emits goal replica counts as Prometheus metrics that standard HPA/Kubernetes-based event-driven autoscaling (KEDA) act on—keeping the scaling actuation inside Kubernetes-native primitives. 

The NVIDIA Dynamo planner takes a special approach: It natively understands disaggregated serving, running separate prefill and decode scaling loops that focus on TTFT and ITL SLAs respectively. It predicts upcoming demand using time-series models, computes replica requirements from profiled per-GPU throughput curves, and enforces a worldwide GPU budget across each roles.

This global visibility matters because in practice there’s an optimal ratio between prefill and decode that shifts with request patterns. Scale prefill 3x without scaling decode and the additional output has nowhere to go—decode bottlenecks and KV cache transfer queues up. Application-level autoscalers handle this because they will see the total pipeline; Kubernetes-native HPA targeting individual resources doesn’t inherently maintain cross-resource ratios.

Scaling with separate LWS resources

With one LWS per role, you scale each independently:

kubectl scale lws prefill-workers --replicas=6
kubectl scale lws decode-workers --replicas=3

Standard HPA can goal each LWS individually, or an external autoscaler (just like the Dynamo planner or llm-d’s autoscaler) makes coordinated decisions and updates each. The coordination logic lives within the autoscaler, not within the Kubernetes resources themselves.

Scaling with Grove

Grove brings per-role scaling right into a single resource. Each PodClique has its own replica count and optional autoScalingConfig, so HPAs can manage roles independently based on per-role metrics:

kubectl scale pclq inference-disaggregated-0-prefill --replicas=6

The operator creates additional prefill pods while leaving the router and decode untouched:

NAME                                                  AGE
podclique.grove.io/inference-disaggregated-0-router   5m
podclique.grove.io/inference-disaggregated-0-prefill  5m
podclique.grove.io/inference-disaggregated-0-decode   5m

NAME                                              READY   STATUS    AGE
pod/inference-disaggregated-0-router-k8x2m        1/1     Running   5m
pod/inference-disaggregated-0-router-w9f4n        1/1     Running   5m
pod/inference-disaggregated-0-prefill-abc12       1/1     Running   5m
pod/inference-disaggregated-0-prefill-def34       1/1     Running   5m
pod/inference-disaggregated-0-prefill-ghi56       1/1     Running   5m
pod/inference-disaggregated-0-prefill-jkl78       1/1     Running   5m
pod/inference-disaggregated-0-prefill-tu34v       1/1     Running   12s  # recent
pod/inference-disaggregated-0-prefill-wx56y       1/1     Running   12s  # recent
pod/inference-disaggregated-0-decode-mn90p        1/1     Running   5m
pod/inference-disaggregated-0-decode-qr12s        1/1     Running   5m

Six prefill pods, two router pods, two decode pods—only prefill modified. 

For roles that use multi-node Tensor Parallelism internally, PodCliqueScalingGroup ensures multiple PodCliques scale together as a unit while preserving the replica ratio between them. For instance, in a configuration where each prefill instance consists of 1 leader pod and 4 employee pods: 

 podCliqueScalingGroups:
    - name: prefill
      cliqueNames: [pleader, pworker]
      replicas: 2
      minAvailable: 1
      scaleConfig:
        maxReplicas: 4

With replicas: Two, this creates two complete prefill instances: two x (one leader + 4 staff) = 10 pods total. The minAvailable: One guarantee means the system won’t scale below one complete Tensor Parallel group. 

Scaling the group from two to a few replicas adds a 3rd complete instance while preserving the 1:4 leader-to-worker ratio:leader-to-worker ratio:

$ kubectl scale pcsg inference-disaggregated-0-prefill --replicas=3

Each the leader and employee cliques scaled together as a unit, the brand new replica (prefill-2) has one pleader pod and 4 pworker pods, matching the ratio. A brand new PodGang was created for the third replica to make sure it gets gang-scheduled.

NAME                                                              AGE
podcliquescalinggroup.grove.io/inference-disaggregated-0-prefill  10m

NAME                                                              AGE
podclique.grove.io/inference-disaggregated-0-prefill-0-pleader    10m
podclique.grove.io/inference-disaggregated-0-prefill-0-pworker    10m
podclique.grove.io/inference-disaggregated-0-prefill-1-pleader    10m
podclique.grove.io/inference-disaggregated-0-prefill-1-pworker    10m
podclique.grove.io/inference-disaggregated-0-prefill-2-pleader    8s  # recent
podclique.grove.io/inference-disaggregated-0-prefill-2-pworker    8s  # recent

NAME                                                              AGE
podgang.scheduler.grove.io/inference-disaggregated-0              10m
podgang.scheduler.grove.io/inference-disaggregated-0-prefill-0    10m
podgang.scheduler.grove.io/inference-disaggregated-0-prefill-1    8s  # recent

Getting began

Whether you’re running a single disaggregated pipeline or operating dozens across your cluster, the constructing blocks for this are emerging and the community is constructing them within the open. Each approach on this blog represents a special point on the spectrum between simplicity and integrated coordination.

The suitable selection relies on your workload, your team’s operational model, and the way much lifecycle management you would like the platform to handle versus the applying layer.

Take a look at these resources for more information. 

Join us at Kubecon EU

When you’re attending KubeCon EU 2026 in Amsterdam, drop by at booth No. 241 and join the session where we’ll cover an end-to-end open source AI inference stack. Explore the Grove Deployment Guide and ask questions on GitHub or Discord. We’d love to listen to the way you’re desirous about disaggregated inference on Kubernetes. 



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x