Removing the Guesswork from Disaggregated Serving

-


Deploying and optimizing large language models (LLMs) for high-performance, cost-effective serving may be an amazing engineering problem. The best configuration for any given workload (akin to hardware, parallelism, and prefill/decode split) resides in a large, multi-dimensional search space that’s unimaginable to explore manually or through exhaustive testing. AIConfigurator, an open source tool that simplifies the NVIDIA Dynamo AI serving stack, is meant to chop through this complexity and get you to an optimal deployment in minutes.

The core good thing about AIConfigurator is that you just don’t must run every possible configuration on real hardware to predict which one will perform best. As an alternative, it decomposes LLM inference into its constituent operations and measures every one in isolation on the goal GPU. AIConfigurator can then reassemble those measurements to estimate the end-to-end performance of any configuration, all without occupying a single GPU-hour at search time.

This blog will provide a fast overview of how AIConfigurator works; the right way to use it with Dynamo; and the way ecosystem contributors akin to Alibaba and Mooncake are helping extend the features of this open source project to all frameworks.

Using AIConfigurator to configure disaggregated serving

With AIConfigurator, the latency estimate for every operation—including General Matrix Multiplications (GEMM), attention, communication, and mixture-of-experts (MoE) dispatch—is backed by real kernel measurements collected on the goal hardware. The collector toolchain benchmarks every primitive across supported quantization modes, batch sizes, sequence lengths, and GPU counts, and logs results to a silicon-calibrated performance database. When collected data isn’t available for a brand new model or GPU, AIConfigurator falls back to speed-of-light roofline estimates with empirical correction aspects, giving usable recommendations even before the model has been empirically profiled.

On top of this estimation layer, AIConfigurator models continuous batching for aggregated serving, rate-matches prefill and decode employee pools for disaggregated serving, and handles MoE-specific concerns like expert parallelism and token routing skew. Relatively than returning a single answer, it computes the Pareto frontier across all evaluated configurations, showing the throughput-vs-latency tradeoff for each aggregated and disaggregated modes side-by-side. The complete search, often spanning tens of 1000’s of candidate configurations, completes in seconds, as an alternative of spending days searching on GPUs.

To see how this tool can provide help to as a developer, consider a concrete example: deploying Qwen3-32B with NVFP4 quantization across 64 NVIDIA B200 GPUs, with goal SLAs of 1000ms time-to-first-token (TTFT) and 15ms time-per-output-token (TPOT). Using a single command, you may search through 1000’s of candidate configurations:

pip install aiconfigurator  # or install from source for contemporary

  aiconfigurator cli default 
    --model-path nvidia/Qwen3-32B-NVFP4 
    --total-gpus 64 
    --system b200_sxm 
    --isl 15000 --osl 500 
    --ttft 1000 --tpot 15 
    --save-dir ./results

Inside seconds, AIConfigurator returns a suggestion. In this instance, disaggregated serving achieves 550 tokens/s/GPU, a 38% improvement over the very best aggregated configuration. The output features a Pareto frontier visualizing the total tradeoff space, ranked configurations (best_config_topn.csv), engine configurations for every employee type, and ready-to-use deployment artifacts for each serving modes.

Pareto frontier chart comparing aggregated and disaggregated serving configurations for Qwen3-32B on 64 GPUs, showing disaggregated serving achieves better throughput Pareto frontier chart comparing aggregated and disaggregated serving configurations for Qwen3-32B on 64 GPUs, showing disaggregated serving achieves better throughput
Figure 1. Example TPS/GPU vs pareto drawn by AIConfigurator

For disaggregated serving in Dynamo, deploying the beneficial configuration requires a single command:

kubectl apply -f results/disagg/top1/k8s_deploy.yaml

This workflow generalizes across models and hardware. The identical interface applies whether deploying Qwen3-32B on eight NVIDIA H200 GPUs or DeepSeek-V3 across a multi-node B200 cluster; AIConfigurator adapts its search space and proposals to the required model, hardware, and SLA constraints.

Extending support to multiple frameworks

AIConfigurator originally supported only NVIDIA TensorRT LLM, but as frameworks like SGLang gained traction—particularly for MoE models like DeepSeek—single-backend support was now not sufficient. We designed a framework-agnostic abstraction layer with a unified parameter mapping that normalizes each backend’s config schemas and terminology behind a single interface. That investment paid off when community partners akin to Mooncake and Alibaba brought SGLang support to life, contributing collectors, validation, and integration work covered in the next sections.

From a user’s perspective, comparing backends is a one-flag change:

# TensorRT LLM
aiconfigurator cli default 
  --model-path nvidia/Qwen3-32B-NVFP4 
  --total-gpus 64 --system b200_sxm 
  --backend trtllm

# SGLang
aiconfigurator cli default 
  --model-path nvidia/Qwen3-32B-NVFP4 
  --total-gpus 64 --system b200_sxm 
  --backend sglang

# vLLM
aiconfigurator cli default 
  --model-path nvidia/Qwen3-32B-NVFP4 
  --total-gpus 64 --system b200_sxm 
  --backend vllm

To make it even simpler, --backend auto compares three frameworks in a single command:

aiconfigurator cli default 
  --model-path nvidia/Qwen3-32B-NVFP4 
  --total-gpus 64 --system b200_sxm 
  --backend auto

The search process is equivalent across backends; only the generated deployment artifacts differ, with each backend receiving native config files, CLI arguments, and K8s manifests in its expected format. AIConfigurator currently ships with silicon-validated performance data for TensorRT LLM and SGLang across NVIDIA H100, H200, and B200 systems, with vLLM support on select platforms as well.

WideEP inference for SGLang

SGLang is very popular for running “Wide Expert Parallelism” (WideEP), which dramatically increases decode throughput for MoE models like DeepSeek V3/R1 by distributing experts across a lot of GPUs. To accurately model SGLang’s WideEP pathway, AIConfigurator simulates key elements like DeepEP all-to-all communication, MTP, MLA attention, Attention DP, workload-aware MoE, and expert parallel load balancing (EPLB). Modeling MoE and EPLB poses the best challenge.

WideEP’s MoE routing inherently suffers from load imbalance, with some experts receiving more tokens than others. AIConfigurator models this power-law workload distribution using an alpha parameter. This alpha acts as a lookup key within the performance database, linking distribution patterns to collected latency profiles, much like the usual MoE path. An alpha of 1.01 empirically matches DeepSeek V3.1 well for each prefill and decode across datasets.

In WideEP deployments, AIConfigurator models EPLB by adjusting two aspects as an alternative of directly simulating the algorithm. First, the workload distribution alpha is lowered from 1.01 to 0.6 to reflect the load smoothing from expert replication. Second, the effective token count is multiplied by 0.8, modeling the empirical reduction in maximum per-GPU token load. These changes select the right latency curve and adjust the operating point accordingly.

Bar chart comparing MoE latency prediction methods across batch/sequence configurations, showing Power Law 1.01 most closely matches real-time measurements, typically within 1ms to 4ms.Bar chart comparing MoE latency prediction methods across batch/sequence configurations, showing Power Law 1.01 most closely matches real-time measurements, typically within 1ms to 4ms.
Figure 2. Power-law simulation

Preliminary results are promising: One of the best configuration identified by AIConfigurator aligns with the manually tuned production configuration. Further collaboration is planned to bring this to production readiness.

Mooncake: Initial SGLang support in AIConfigurator

AIConfigurator initially supported only TensorRT LLM, reserving interfaces for SGLang and vLLM without full implementation. Contributors from Mooncake (an open source collaboration between Moonshot AI, Tsinghua University, and others) then developed the primary version of the SGLang backend. 

They first accomplished the collector layer, modeling and encapsulating core operations (GEMM, attention, batch-GEMM). This allowed quick support for models like Llama, Qwen, and DeepSeek. This work, combined with the following SGLang WideEP effort, formed the primary SGLang backend for AIConfigurator.

Alibaba: Integrating AIConfigurator within the AI Serving Stack for automated deployments

The AI Serving Stack, built on the Alibaba Container Service for Kubernetes (ACK), is an end-to-end solution for efficient and scalable cloud-native LLM inference. It manages all the lifecycle, offering deployment, smart routing, auto-scaling, and deep observability.

Architecture diagram of Alibaba's AI Serving Stack using AIConfigurator to autogenerate disaggregated serving deployments with prefill and decode workers on Kubernetes.
Architecture diagram of Alibaba's AI Serving Stack using AIConfigurator to autogenerate disaggregated serving deployments with prefill and decode workers on Kubernetes.
Figure 3. An Alibaba graphic showing the way it uses AIConfigurator in its container service

Inside this stack, the RoleBasedGroup (RBG), an SGLang community-incubated AI orchestration engine to which Alibaba Cloud heavily contributes, simplifies LLM inference service deployment on Kubernetes. RBG uses “Role” as its core orchestration unit, dividing prefill-decode-disaggregation based services into router, prefill, and decode roles to coordinate their placement, scaling, and updates. That ensures a balance of performance and stability with role-based extensibility.

The complete Dynamo service stack may be deployed with the AI Serving Stack on ACK, leveraging the AIConfigurator prediction results as input and AIConfigurator’s generator module. The ACK team can generate the deployable configuration for RBG, reference here. By integrating this process, Alibaba achieved 1.86x the throughput on the Qwen3-235B-FP8 model in comparison with the baseline, while maintaining TTFT <5000ms and ITL<40ms. 

RBG will proceed to trace AIConfigurator’s progress and supply Day 0 support for rapid deployment of recent models in ACK.

Alibaba: Constructing HiSim based on AIConfigurator

AIConfigurator optimizes static workloads, nevertheless it cannot easily model dynamic, bursty production traffic, complex scheduling, and KV cache dynamics. To beat this, the Alibaba TAIR KV Cache Team created Tair-KVCache-HiSim, a light-weight, high-fidelity, and event-driven system simulator.

HiSim tackles dynamic traffic and queuing (predicting TTFT, TPOT, and throughput under variable rates and sophisticated scheduling like SGLang) and advanced KV cache optimization (quantifying tradeoffs for multi-level storage and various eviction/prefetch policies) via system-level simulation. 

HiSim comprises a workload generator, global router simulator, and the inference engine simulator (IES). The IES uses a unified global clock to coordinate the scheduler simulator (managing LLM requests: preemption, batching), the KVCache Manager Simulator (HiCacheController, modeling the three-level KV cache and eviction), and the BatchRunnerEstimator (AIConfiguratorTimePredictor, calculating batch latency based on AIConfigurator).

This structure adapts rapidly to diverse inference engines (vLLM, SGLang, TensorRT LLM), accurately mimicking real-world configurations, runtime parameters, and execution semantics (parallelism, batching, device optimizations) without engine modification, ensuring high fidelity.

HiSim guides SGLang R&D by allowing configuration tuning to quantify scheduling tradeoffs (TTFT/throughput, queueing/memory, cache hit/TTFT, overlap efficiency) without code changes. It provides “oracle” evaluation for brand new hardware by estimating performance ceilings and identifying bottlenecks using theoretical specs. HiSim also aids HiCache architecture exploration and value/performance optimization through three-level KV cache design (e.g., L2 size, prefetch/eviction policy, L3 bandwidth needs, write-through vs write-back) to search out the very best cost–performance point.

Leveraging AIConfigurator, HiSim extends static evaluation to energetic, cost-aware deployment recommendations for dynamic traffic. The tip-to-end simulation is inside 5% error of real-world performance. Future work will enhance this collaboration to construct a high-fidelity, production-ready system simulator.

What’s next for AIConfigurator

The roadmap ahead extends AIConfigurator from a standalone command line tool right into a core component of the Dynamo platform:

  • Faster model support. “Hybrid” mode already provides Day 1 recommendations via speed-of-light estimates; we’re also automating the silicon data-collection pipeline to speed up fully validated support.
  • Powering Dynamo deployments. AIConfigurator is becoming the configuration engine behind Dynamo’s Kubernetes flow via the DynamoGraphDeploymentRequest (DGDR) CRD, producing optimized deployments from a single YAML file.
  • Dynamic workload modeling. Moving beyond static input sequence length/output sequence length/concurrency targets toward models that capture production workload distributions directly.

NVIDIA plans to maintain working with third parties on bringing AIConfigurator to more systems and tools. AIConfigurator is actively welcoming contributions, including performance data for brand new hardware, additional backend support, recent features, and extensions like HiSim. 

See the AIConfigurator repository to start, and try the Dynamo project for the fastest method to arrange disaggregated serving.

For a full technical treatment, including formal definitions and validation results, read our paper: AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x