How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models

-


As global AI adoption accelerates, developers face a growing challenge: delivering large language model (LLM) performance that meets real-world latency and price requirements. Running models with tens of billions of parameters in production, especially for conversational or voice-based AI agents, demands high throughput, low latency, and predictable service-level performance. For startups constructing sovereign AI models from scratch, these challenges are amplified by the necessity to balance model scale and accuracy with infrastructure efficiency—while also maintaining data sovereignty and price control.

Sarvam AI, a generative AI startup based in Bengaluru, India, got down to construct large, multilingual, multimodal foundation models that serve its country’s diverse population, support nearly two-dozen languages, and keep model development and data governance fully under India’s sovereign control. To fulfill strict latency targets and improve inference efficiency for its flagship Sovereign 30B model, Sarvam AI collaborated with NVIDIA to co-design hardware and software optimizations.

This collaboration delivered a 4x speedup in inference performance on NVIDIA Blackwell over baseline NVIDIA H100 GPUs, and established a path for deployment on the next-generation NVIDIA Blackwell architecture. The top-to-end performance boost was achieved through kernel and scheduling optimizations on NVIDIA H100 SXM GPUs that contributed a 2x speedup. That was combined with the powerful compute capabilities of Blackwell, together with NVFP4 weight quantization, for an extra 2x speedup, with a good larger performance gain of two.8x seen at higher interactivity points.

NVIDIA engineers helped Sarvam AI construct 3B, 30B, and 100B foundational models, and optimize a brand new family of sovereign foundation models that were trained using NVIDIA Nemotron libraries, including the NVIDIA NeMo Framework and NVIDIA NeMo-RL. These models support 22 Indian languages, English, math, and code. They exhibit how developer teams can leverage NVIDIA’s full-stack AI platform—from data to deployment—to realize state-of-the-art performance and localized AI capabilities.

This post walks through the joint engineering effort and shares benchmarks for the speed-ups achieved on the NVIDIA H100, the largest-deployed NVIDIA GPU in India. We also provide an early have a look at how these workloads are being adapted for the NVIDIA Blackwell architecture.

Making multilingual sovereign AI scalable with MoE

To deliver sovereign-scale intelligence with high efficiency, the Sarvam AI models employ a classy heterogeneous mixture-of-experts (MoE) architecture tailored for deep reasoning and linguistic density. These models were pretrained from scratch across 3B, 30B, 100B using the NVIDIA NeMo framework and NVIDIA Megatron-LM. Moreover, Nemo-RL was used for post-training workflows for these models including long-context reasoning.

Sarvam 30B utilizes a 19-layer depth (1 dense + 18 MoE) with 128 experts and a top-6 routing strategy, counting on grouped query attention (GQA) to balance memory bandwidth with generation quality.

Sarvam 100B scales this design to 32 layers (1 dense + 31 MoE) and employs top-8 routing over 128 experts with a bigger MoE FFN hidden size of 2048. Moreover, the 100B model adopts multi-head latent attention (MLA)—just like DeepSeek-V3—to aggressively compress the Key-Value (KV) cache, enabling massive context windows without the memory penalties of normal attention.

Each models feature a shared expert design where a dedicated expert handles common features while routed experts tackle specialized tasks. This mix of high energetic parameter counts (via top-6/top-8 routing) and sophisticated memory access patterns created a novel serving challenge, necessitating the deep kernel optimizations on NVIDIA Hopper and NVIDIA Blackwell GPUs detailed below.

The performance challenge: SLAs and baseline configuration on NVIDIA H100

Optimizing the Sarvam 30B model wasn’t nearly raw speed; it was about maximizing density under strict latency constraints. For the applications served by this model—voice-to-voice agents—we established the next service level agreements (SLAs):

  • P95 (ninety fifth percentile) time to first token (TTFT): < 1000 ms
  • P95 (ninety fifth percentile) inter-token latency (ITL): < 15 ms

P95 (ninety fifth percentile) in inference performance testing measures latency, indicating that 95% of served requests are accomplished faster than this threshold, while the slowest 5% take longer. It’s a critical tail-latency metric used to guage user experience and system stability, ensuring that even under load, most users face not more than a selected delay. The engineering goal was to maximise the inference server’s token throughput (concurrently served requests) without breaching these P95 targets.

For the initial performance evaluation, the Sarvam AI and NVIDIA teams chosen the SGLang inference engine for his or her initial performance evaluation. Unlike standard serving frameworks that treat the KV cache as a linear buffer, SGLang implements RadixAttention—a mechanism that manages the KV cache as a radix tree. This was critical for the Sarvam 30B architecture; RadixAttention enables automatic prefix sharing, allowing the shared expert context and system prompts to be computed once and reused across concurrent requests. Moreover, SGLang’s Cache-Aware Scheduler maximizes the hit rate of those shared prefixes, significantly reducing redundant memory operations through the prefill phase.

The Sarvam AI and NVIDIA teams modeled a production traffic profile characterised by a median input sequence length (ISL) of three,584 tokens and an output sequence length (OSL) of 128 tokens. Guided by internal simulation data, we deployed the model on two NVIDIA H100 SXM GPUs with a selected parallelism strategy designed to balance the distinct memory and compute requirements of the MoE layers:

  • Expert parallelism (EP=2) for the expert weights. This configuration utilizes Grouped GEMM kernels to maximise compute density and ensures that the huge expert weights reside in HBM, reducing the fee of expert routing. 
  • Data parallelism (DP=2) for the eye weights with –enable-dp-attention. This enabled us to parallelize attention computation across parallel batches, significantly boosting the combination throughput of the prefill phase.

While this configuration provided a strong functional baseline, profiling revealed that satisfying the sub-second TTFT at high concurrency required deeper optimization – leading us to the precise kernel and precision strategies detailed below. 

From profiling to performance: eliminating MoE bottlenecks

Simulation data indicated that a concurrency range of 32 to 64 requests would offer the most effective likelihood of meeting SLA requirements. To discover the precise bottlenecks limiting token throughput on this concurrency range, the NVIDIA and Sarvam AI teams utilized NVIDIA Nsight Systems to capture execution traces of each the prefill and decode phases at a concurrency of 32 requests. We then processed the traces to extract the microsecond-level latency contribution of each kernel inside a single transformer layer.

The profiling revealed that while the heavy General Matrix Multiplication (GEMM) operations (experts and a spotlight) were performing well, significant latency bubbles existed within the non-compute-intensive operations—specifically within the MoE routing logic and positional embedding calculations. These operations were affected by kernel launch overheads and redundant memory reads.

A diagram showing the Nsight Systems profiler timeline for a single transformer layer during the model’s prefill phase. The horizontal axis shows time, and stacked rows display GPU metrics including GPC and system clock frequency, GPU active time, streaming multiprocessor (SM) active percentage, SM instructions, and SM warp occupancy. Along the bottom, a sequence of GPU kernel launches is visible. Several regions are outlined with red boxes to highlight the most time-consuming operations: Query and key (QK) normalization and rotary positional embedding (RoPE), attention computation, router logits with top-K selection, routed MoE expert GEMM plus gated linear unit (GLU), and a shared expert GEMM plus GLU. The visualization emphasizes how GPU compute and occupancy vary across these kernels within one transformer layer.A diagram showing the Nsight Systems profiler timeline for a single transformer layer during the model’s prefill phase. The horizontal axis shows time, and stacked rows display GPU metrics including GPC and system clock frequency, GPU active time, streaming multiprocessor (SM) active percentage, SM instructions, and SM warp occupancy. Along the bottom, a sequence of GPU kernel launches is visible. Several regions are outlined with red boxes to highlight the most time-consuming operations: Query and key (QK) normalization and rotary positional embedding (RoPE), attention computation, router logits with top-K selection, routed MoE expert GEMM plus gated linear unit (GLU), and a shared expert GEMM plus GLU. The visualization emphasizes how GPU compute and occupancy vary across these kernels within one transformer layer.
Figure 1. Nsys profiler timeline showing SM activity and kernel execution over time of the prefill phase, with red boxes marking the most costly kernels within the layer—QK normalization, attention, and MoE expert computation.

Following these observations, we executed a targeted optimization strategy across three axes – kernel optimizations, scheduling efficiency, and disaggregated serving.

Cutting transformer layer time by 34% with kernel-level optimizations

The NVIDIA and Sarvam AI teams systematically targeted the most costly kernels identified within the trace, replacing standard PyTorch implementations with fused, architecture-specific kernels. We implemented the models first using a baseline implementation on SGLang with H100 GPUs after which optimized them to realize significant speedups, as detailed below in Table 1 and in the next text. 

Kernel Baseline time (microseconds) Optimized time (microseconds) Optimization applied
RMSNorm + Prepare QKV 186 185 N/A
QK Norm + RoPE 414 54 Use optimized fused in-place query-key normalization kernel
Attention 322 296 Use FA3 for prefill, FlashInfer backend for decode
Post-attention linear projection 114 112 N/A
AllReduce 252 250 N/A
Router logits and TopK 560 134 Use fused TopK impl.; ReplicatedLinear block for router logits
Routed experts computation 1103 1080 Tune kernel params for and DEP2 configuration (64 experts per GPU)
Shared expert computation 216 215 Overlap with TopK using NVIDIA CUDA streams
AllReduce 265 249 N/A
Total layer time 3432 2575 1.34x faster prefill overall
Table 1. Kernel-level optimizations repay: Fusing and tuning the most popular kernels cut layer time drastically and deliver faster prefill. 

MoE routing (4.1x faster than baseline H100 performance): Essentially the most significant bottleneck identified was the MoE routing mechanism. Within the baseline, computing router logits and performing TopK selection involved multiple kernel launches and redundant memory round-trips.

  • Optimization: We implemented a Fused TopK kernel that fuses the logit computation and selection logic right into a single CUDA kernel. Moreover, we utilized a ReplicatedLinear block for the router logits. For the reason that router weights are small, replicating them across GPUs eliminates the necessity for expensive communication through the gating phase, keeping the operation purely compute-bound.

Fusing positional embeddings (7.6x faster than baseline H100 performance): The baseline implementation of query-key (QK) norm, followed by rotary positional embeddings (RoPE), required reading and writing the huge KV cache twice.

  • Optimization: We deployed a custom fused in-place QK norm + RoPE kernel. This kernel performs normalization and rotary embedding calculations in a single pass, keeping the information within the L2 cache and reducing global memory bandwidth consumption.

Hiding latency with overlap: While the shared expert computation itself saw negligible speedup, we effectively hid its cost. By utilizing separate NVIDIA CUDA streams, we scheduled the shared expert computation to execute asynchronously alongside the router logits and TopK calculation. This parallelism ensures that the GPU’s compute units (streaming multiprocessors, or SMs) remain saturated even while the routing logic is being resolved.

These targeted kernel optimizations reduced the entire time per transformer layer in a prefill iteration from 3.4ms to 2.5ms, a 1.3x speedup over baseline H100 performance. This latency reduction directly translated to higher supportable concurrency, allowing us to serve more users per GPU while maintaining the strict <1000ms time to first token (TTFT)  and < 15ms inter-token latency service level agreement (ITL SLA) as shown in Figure 2 below.

Line chart titled “Performance impact of kernel optimizations on Sarvam 30B model.” The x-axis shows tokens per second per user, and the y-axis shows tokens per second per GPU. Two lines compare performance: a green line for optimized kernels and a lighter green line for baseline (unoptimized). Across all concurrency points, the optimized line remains above the baseline line, indicating higher throughput per GPU. At 75 tokens per second per user, the optimized configuration reaches about 1,255 TPS per GPU compared to about 998 TPS per GPU for baseline, marked with dashed guide lines and an annotation indicating a 1.26× improvement. As tokens per second per user increase, both lines slope downward, but the optimized kernels consistently deliver higher throughput than the baseline.Line chart titled “Performance impact of kernel optimizations on Sarvam 30B model.” The x-axis shows tokens per second per user, and the y-axis shows tokens per second per GPU. Two lines compare performance: a green line for optimized kernels and a lighter green line for baseline (unoptimized). Across all concurrency points, the optimized line remains above the baseline line, indicating higher throughput per GPU. At 75 tokens per second per user, the optimized configuration reaches about 1,255 TPS per GPU compared to about 998 TPS per GPU for baseline, marked with dashed guide lines and an annotation indicating a 1.26× improvement. As tokens per second per user increase, both lines slope downward, but the optimized kernels consistently deliver higher throughput than the baseline.
Figure 2. Performance gains from kernel optimizations across various concurrency points. In focus is the performance gain on the 75 TPS/user point. With kernel optimizations, we see a 1.26x improvement in overall token throughput per GPU.

How mixed prefill and decode scheduling improve GPU utilization

While kernel-level optimizations improve individual operation latency, significant efficiency gains will be achieved on the scheduler level by optimizing aggregated serving (prefill and decode run on the identical GPU) and disaggregated serving (prefill and decode run on different GPUs).

The default scheduling strategy for aggregated serving within the SGLang engine is to strictly serialize the prefill and decode phases. On this default mode, the GPU processes a batch of prefills, finishes them, and only then switches to processing decodes. While this simplifies memory management, it often results in suboptimal GPU utilization. Prefills are typically compute-bound (dense matrix multiplications), while decodes are memory-bound (loading KV cache). Serializing them means the GPU’s Tensor Core units (SMs) are underutilized during decode phases, and memory bandwidth could also be underutilized during prefill phases, particularly for the low concurrency operating point imposed by the tight SLA requirements.

To handle this, we enabled a mixed batching strategy. This approach allows the SGLang scheduler to combine prefill tokens and decode tokens inside the same batch or compute chunk. By processing a piece of prefill tokens alongside ongoing decode requests, we achieve a complementary resource profile on the GPU. This optimization introduces a subtle tradeoff. Mixing heavy prefill chunks into the decode stream can arguably increase inter-token latency (ITL) for the energetic decode requests, as they have to wait for the shared compute resources.

Nonetheless, for the Sarvam 30B workload, we observed that this impact was marginal and well inside our 15ms ITL SLA. In exchange, the end-to-end request latency improved significantly as a consequence of the reduction in queue times. By clearing the prefill queue faster (piggybacking on decodes), we reduced the time requests spent waiting to begin, ultimately driving up total system throughput by 15%. This scheduling optimization is kind of favorable within the high ISL, low OSL scenario of interest here. For more decode-heavy cases, it may be worthwhile to select smaller mixed chunk sizes or disable it altogether.

Line chart titled “Impact of mixed prefill and decode chunks in SGLang aggregate serving.” The x-axis shows P95 request latency in milliseconds (lower is better), and the y-axis shows tokens per second per GPU (higher is better). Two lines are compared: a blue line for separate prefill and decode chunks and a red line for mixed chunks. At all latency points, the mixed-chunk line sits above the separate-chunk line, indicating higher throughput. Around the 2-second latency point, annotations show roughly 1,310 TPS per GPU for mixed chunks versus about 1,140 TPS per GPU for separate chunks, highlighted as an approximately 1.15× throughput improvement. As latency increases, both configurations scale to higher throughput, with mixed chunks consistently outperforming separate chunks.Line chart titled “Impact of mixed prefill and decode chunks in SGLang aggregate serving.” The x-axis shows P95 request latency in milliseconds (lower is better), and the y-axis shows tokens per second per GPU (higher is better). Two lines are compared: a blue line for separate prefill and decode chunks and a red line for mixed chunks. At all latency points, the mixed-chunk line sits above the separate-chunk line, indicating higher throughput. Around the 2-second latency point, annotations show roughly 1,310 TPS per GPU for mixed chunks versus about 1,140 TPS per GPU for separate chunks, highlighted as an approximately 1.15× throughput improvement. As latency increases, both configurations scale to higher throughput, with mixed chunks consistently outperforming separate chunks.
Figure 3. The impact of mixed chunk scheduling, with 15% token throughput gains seen on the 2-second request latency point.

How disaggregated serving removes the critical path and boosts throughput 1.5x

Despite kernel and scheduling improvements, our profiling indicated that inter-GPU communication for token distribution (expert parallelism) remained on the critical path. For the reason that Sarvam 30B model (optimized with FP8 precision) suits comfortably inside a single NVIDIA H100 SXM GPU’s memory, we pivoted from model parallelism to disaggregated serving.

We reconfigured the setup to make use of a 1P+1D strategy via the SGLang router: dedicating one NVIDIA H100 SXM GPU exclusively to prefill and one other to decode. This approach eliminated the overhead of routing tokens between GPUs through the forward pass. The result was immediate: We observed a pointy reduction in TTFT (as prefill staff ran uninterrupted) and a big increase in per-user decode throughput (1.5x over baseline H100 performance), proving that for this model size, pipeline separation outweighs the advantages of aggregated memory capability.

Line chart titled “Performance impact of disaggregated serving with NVIDIA H100 SXM GPUs on Sarvam 30B model.” The x-axis shows tokens per second per user, and the y-axis shows tokens per second per GPU. Two lines are plotted: a light green for baseline aggregate EP2 configuration (unoptimized), an orange line for optimized aggregate EP2 configuration, and a green line for disaggregated 1P+1D optimized configuration. Across all user throughput points, the disaggregated configuration delivers the highest tokens per second per GPU, followed by the optimized aggregate configuration, with the baseline lowest. At 75 tokens per second per user, annotations show approximately 998 TPS/GPU for baseline, and about 1,995 TPS/GPU for disaggregated serving, highlighted with arrows indicating roughly a 2x× improvement over baseline.Line chart titled “Performance impact of disaggregated serving with NVIDIA H100 SXM GPUs on Sarvam 30B model.” The x-axis shows tokens per second per user, and the y-axis shows tokens per second per GPU. Two lines are plotted: a light green for baseline aggregate EP2 configuration (unoptimized), an orange line for optimized aggregate EP2 configuration, and a green line for disaggregated 1P+1D optimized configuration. Across all user throughput points, the disaggregated configuration delivers the highest tokens per second per GPU, followed by the optimized aggregate configuration, with the baseline lowest. At 75 tokens per second per user, annotations show approximately 998 TPS/GPU for baseline, and about 1,995 TPS/GPU for disaggregated serving, highlighted with arrows indicating roughly a 2x× improvement over baseline.
Figure 4. The advantages of disaggregated serving on NVIDIA H100 SXM for Sarvam 30B model

The top-to-end impact of kernel, scheduling, and disaggregation optimizations

Figure 5 below summarizes the end-to-end performance speedup we were in a position to achieve via a mixture of optimized kernels and scheduling optimizations. We also observe that disaggregated serving is probably the most optimal configuration for this model and ISL/OSL workload pattern and specific TTFT and ITL SLAs.

Bar chart titled “Performance optimization journey for Sarvam 30B on NVIDIA H100 SXM.” The y-axis shows token throughput ratio at 75 tokens per second per user, and the x-axis lists serving configurations. Three green bars show increasing performance with a gray bar as the baseline: “Baseline aggregated serving” at 1.00, “aggregated serving + optimized kernels” at 1.26 with notes about MoE GEMM shape tuning, router kernel optimization, fused normalization with RoPE, and shared expert overlap, “aggregated serving + optimal scheduling” at 1.31 with optimized kernels and mixed prefill and decode chunking, and “Disaggregated optimized” at 2.00 labeled as disaggregated prefill and decode (1P+1D configuration) with optimized kernels. The chart illustrates steady gains in throughput as kernel optimizations, scheduling improvements, and disaggregated serving are applied.Bar chart titled “Performance optimization journey for Sarvam 30B on NVIDIA H100 SXM.” The y-axis shows token throughput ratio at 75 tokens per second per user, and the x-axis lists serving configurations. Three green bars show increasing performance with a gray bar as the baseline: “Baseline aggregated serving” at 1.00, “aggregated serving + optimized kernels” at 1.26 with notes about MoE GEMM shape tuning, router kernel optimization, fused normalization with RoPE, and shared expert overlap, “aggregated serving + optimal scheduling” at 1.31 with optimized kernels and mixed prefill and decode chunking, and “Disaggregated optimized” at 2.00 labeled as disaggregated prefill and decode (1P+1D configuration) with optimized kernels. The chart illustrates steady gains in throughput as kernel optimizations, scheduling improvements, and disaggregated serving are applied.
Figure 5. Progressive improvements seen in Sarvam 30B model inference on NVIDIA H100 SXM through a mixture of kernel optimizations, scheduling optimizations, and disaggregated serving.

Running the Sarvam 30B model on Blackwell NVIDIA GPUs

The NVIDIA Blackwell architecture is designed to speed up generative AI. The NVIDIA Blackwell GPU delivers as much as 20 PFLOPS of peak FP4 compute and eight TB/s of memory bandwidth, representing a hop over the NVIDIA H100 GPU’s capabilities. This throughput is driven by the second-generation Transformer Engine, which utilizes the brand new NVFP4 format to offer over 2x the performance of FP8 while maintaining high model accuracy.

To reap the benefits of these capabilities within the Sarvam models, we used the NVIDIA Model Optimizer to quantize the bottom BF16 model to the NVFP4 format. Unlike within the case of  multiple H100 GPUs, we found that the NVIDIA HGX B200 was in a position to serve the Sarvam 30B model most efficiently with only one Blackwell GPU. By combining the kernel and scheduling optimizations for the model with NVIDIA Blackwell’s NVFP4 compute throughput, we were able to comprehend a 4x increase in inference serving throughput on the 75 tokens per second per user operating point.

As indicated in Figure 6 below, the NVIDIA Blackwell GPU enables high performance at low latency as a consequence of its superior compute, in addition to exceptional throughput at higher concurrencies  from its memory capability advantage.

Line chart titled “Performance comparison between NVIDIA B200 and NVIDIA H100 SXM for Sarvam 30B inference.” The x-axis shows tokens per second per user, and the y-axis shows tokens per second per GPU. Two lines are plotted: a green line for NVIDIA B200 (nvfp4, aggregate 1 GPU) and a light green line for NVIDIA H100 SXM (fp8, disaggregated 1P+1D). Across all operating points, the B200 line is significantly higher than the H100 line, indicating greater throughput per GPU. At 100 tokens per second per user, annotations show approximately 3,571 TPS per GPU for B200 versus about 1,274 TPS per GPU for H100, with an arrow highlighting roughly a 2.8x throughput advantage for B200. At the 75 tokens per second per user operating point, the B200 still maintains a 2x advantage over H100.Line chart titled “Performance comparison between NVIDIA B200 and NVIDIA H100 SXM for Sarvam 30B inference.” The x-axis shows tokens per second per user, and the y-axis shows tokens per second per GPU. Two lines are plotted: a green line for NVIDIA B200 (nvfp4, aggregate 1 GPU) and a light green line for NVIDIA H100 SXM (fp8, disaggregated 1P+1D). Across all operating points, the B200 line is significantly higher than the H100 line, indicating greater throughput per GPU. At 100 tokens per second per user, annotations show approximately 3,571 TPS per GPU for B200 versus about 1,274 TPS per GPU for H100, with an arrow highlighting roughly a 2.8x throughput advantage for B200. At the 75 tokens per second per user operating point, the B200 still maintains a 2x advantage over H100.
Figure 6. NVIDIA Blackwell GPU offers a 2.8x higher token throughput vs Nvidia H100 SXM GPU on the 100 TPS/User operating point.

Learn more

Together, this work shows what is feasible when model design, kernel engineering, scheduling strategy, quantization, and GPU architecture are treated as a single system quite than isolated components. By co-optimizing across the complete stack, Sarvam AI and NVIDIA delivered substantial gains in throughput and latency while maintaining strict TTFT and inter-token latency targets required for real-world deployment. 

The result shouldn’t be only a faster model, but a more economically viable and sovereign-ready inference stack that scales to national workloads. These learnings provide a blueprint for other teams constructing large, production-grade AI systems on NVIDIA platforms.

More details about Sarvam AI’s models will be found here.

To start exploring your personal sovereign AI model strategy, take a look at the NVIDIA Nemotron framework and libraries for training, fine-tuning, and deploying models on local infrastructure.

Not sleep-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.

And browse more about NVIDIA Cloud Functions, NVIDIA’s multi-cloud, high-performance AI inference solution, here.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x