As global AI adoption accelerates, developers face a growing challenge: delivering large language model (LLM) performance that meets real-world latency and price requirements. Running models with tens of billions of parameters in production, especially for conversational or voice-based AI agents, demands high throughput, low latency, and predictable service-level performance. For startups constructing sovereign AI models from scratch, these challenges are amplified by the necessity to balance model scale and accuracy with infrastructure efficiency—while also maintaining data sovereignty and price control.
Sarvam AI, a generative AI startup based in Bengaluru, India, got down to construct large, multilingual, multimodal foundation models that serve its country’s diverse population, support nearly two-dozen languages, and keep model development and data governance fully under India’s sovereign control. To fulfill strict latency targets and improve inference efficiency for its flagship Sovereign 30B model, Sarvam AI collaborated with NVIDIA to co-design hardware and software optimizations.
This collaboration delivered a 4x speedup in inference performance on NVIDIA Blackwell over baseline NVIDIA H100 GPUs, and established a path for deployment on the next-generation NVIDIA Blackwell architecture. The top-to-end performance boost was achieved through kernel and scheduling optimizations on NVIDIA H100 SXM GPUs that contributed a 2x speedup. That was combined with the powerful compute capabilities of Blackwell, together with NVFP4 weight quantization, for an extra 2x speedup, with a good larger performance gain of two.8x seen at higher interactivity points.
NVIDIA engineers helped Sarvam AI construct 3B, 30B, and 100B foundational models, and optimize a brand new family of sovereign foundation models that were trained using NVIDIA Nemotron libraries, including the NVIDIA NeMo Framework and NVIDIA NeMo-RL. These models support 22 Indian languages, English, math, and code. They exhibit how developer teams can leverage NVIDIA’s full-stack AI platform—from data to deployment—to realize state-of-the-art performance and localized AI capabilities.
This post walks through the joint engineering effort and shares benchmarks for the speed-ups achieved on the NVIDIA H100, the largest-deployed NVIDIA GPU in India. We also provide an early have a look at how these workloads are being adapted for the NVIDIA Blackwell architecture.
Making multilingual sovereign AI scalable with MoE
To deliver sovereign-scale intelligence with high efficiency, the Sarvam AI models employ a classy heterogeneous mixture-of-experts (MoE) architecture tailored for deep reasoning and linguistic density. These models were pretrained from scratch across 3B, 30B, 100B using the NVIDIA NeMo framework and NVIDIA Megatron-LM. Moreover, Nemo-RL was used for post-training workflows for these models including long-context reasoning.
Sarvam 30B utilizes a 19-layer depth (1 dense + 18 MoE) with 128 experts and a top-6 routing strategy, counting on grouped query attention (GQA) to balance memory bandwidth with generation quality.
Sarvam 100B scales this design to 32 layers (1 dense + 31 MoE) and employs top-8 routing over 128 experts with a bigger MoE FFN hidden size of 2048. Moreover, the 100B model adopts multi-head latent attention (MLA)—just like DeepSeek-V3—to aggressively compress the Key-Value (KV) cache, enabling massive context windows without the memory penalties of normal attention.
Each models feature a shared expert design where a dedicated expert handles common features while routed experts tackle specialized tasks. This mix of high energetic parameter counts (via top-6/top-8 routing) and sophisticated memory access patterns created a novel serving challenge, necessitating the deep kernel optimizations on NVIDIA Hopper and NVIDIA Blackwell GPUs detailed below.
The performance challenge: SLAs and baseline configuration on NVIDIA H100
Optimizing the Sarvam 30B model wasn’t nearly raw speed; it was about maximizing density under strict latency constraints. For the applications served by this model—voice-to-voice agents—we established the next service level agreements (SLAs):
- P95 (ninety fifth percentile) time to first token (TTFT): < 1000 ms
- P95 (ninety fifth percentile) inter-token latency (ITL): < 15 ms
P95 (ninety fifth percentile) in inference performance testing measures latency, indicating that 95% of served requests are accomplished faster than this threshold, while the slowest 5% take longer. It’s a critical tail-latency metric used to guage user experience and system stability, ensuring that even under load, most users face not more than a selected delay. The engineering goal was to maximise the inference server’s token throughput (concurrently served requests) without breaching these P95 targets.
For the initial performance evaluation, the Sarvam AI and NVIDIA teams chosen the SGLang inference engine for his or her initial performance evaluation. Unlike standard serving frameworks that treat the KV cache as a linear buffer, SGLang implements RadixAttention—a mechanism that manages the KV cache as a radix tree. This was critical for the Sarvam 30B architecture; RadixAttention enables automatic prefix sharing, allowing the shared expert context and system prompts to be computed once and reused across concurrent requests. Moreover, SGLang’s Cache-Aware Scheduler maximizes the hit rate of those shared prefixes, significantly reducing redundant memory operations through the prefill phase.
The Sarvam AI and NVIDIA teams modeled a production traffic profile characterised by a median input sequence length (ISL) of three,584 tokens and an output sequence length (OSL) of 128 tokens. Guided by internal simulation data, we deployed the model on two NVIDIA H100 SXM GPUs with a selected parallelism strategy designed to balance the distinct memory and compute requirements of the MoE layers:
- Expert parallelism (EP=2) for the expert weights. This configuration utilizes Grouped GEMM kernels to maximise compute density and ensures that the huge expert weights reside in HBM, reducing the fee of expert routing.
- Data parallelism (DP=2) for the eye weights with –enable-dp-attention. This enabled us to parallelize attention computation across parallel batches, significantly boosting the combination throughput of the prefill phase.
While this configuration provided a strong functional baseline, profiling revealed that satisfying the sub-second TTFT at high concurrency required deeper optimization – leading us to the precise kernel and precision strategies detailed below.
From profiling to performance: eliminating MoE bottlenecks
Simulation data indicated that a concurrency range of 32 to 64 requests would offer the most effective likelihood of meeting SLA requirements. To discover the precise bottlenecks limiting token throughput on this concurrency range, the NVIDIA and Sarvam AI teams utilized NVIDIA Nsight Systems to capture execution traces of each the prefill and decode phases at a concurrency of 32 requests. We then processed the traces to extract the microsecond-level latency contribution of each kernel inside a single transformer layer.
The profiling revealed that while the heavy General Matrix Multiplication (GEMM) operations (experts and a spotlight) were performing well, significant latency bubbles existed within the non-compute-intensive operations—specifically within the MoE routing logic and positional embedding calculations. These operations were affected by kernel launch overheads and redundant memory reads.


Following these observations, we executed a targeted optimization strategy across three axes – kernel optimizations, scheduling efficiency, and disaggregated serving.
Cutting transformer layer time by 34% with kernel-level optimizations
The NVIDIA and Sarvam AI teams systematically targeted the most costly kernels identified within the trace, replacing standard PyTorch implementations with fused, architecture-specific kernels. We implemented the models first using a baseline implementation on SGLang with H100 GPUs after which optimized them to realize significant speedups, as detailed below in Table 1 and in the next text.
| Kernel | Baseline time (microseconds) | Optimized time (microseconds) | Optimization applied |
| RMSNorm + Prepare QKV | 186 | 185 | N/A |
| QK Norm + RoPE | 414 | 54 | Use optimized fused in-place query-key normalization kernel |
| Attention | 322 | 296 | Use FA3 for prefill, FlashInfer backend for decode |
| Post-attention linear projection | 114 | 112 | N/A |
| AllReduce | 252 | 250 | N/A |
| Router logits and TopK | 560 | 134 | Use fused TopK impl.; ReplicatedLinear block for router logits |
| Routed experts computation | 1103 | 1080 | Tune kernel params for and DEP2 configuration (64 experts per GPU) |
| Shared expert computation | 216 | 215 | Overlap with TopK using NVIDIA CUDA streams |
| AllReduce | 265 | 249 | N/A |
| Total layer time | 3432 | 2575 | 1.34x faster prefill overall |
MoE routing (4.1x faster than baseline H100 performance): Essentially the most significant bottleneck identified was the MoE routing mechanism. Within the baseline, computing router logits and performing TopK selection involved multiple kernel launches and redundant memory round-trips.
- Optimization: We implemented a Fused TopK kernel that fuses the logit computation and selection logic right into a single CUDA kernel. Moreover, we utilized a ReplicatedLinear block for the router logits. For the reason that router weights are small, replicating them across GPUs eliminates the necessity for expensive communication through the gating phase, keeping the operation purely compute-bound.
Fusing positional embeddings (7.6x faster than baseline H100 performance): The baseline implementation of query-key (QK) norm, followed by rotary positional embeddings (RoPE), required reading and writing the huge KV cache twice.
- Optimization: We deployed a custom fused in-place QK norm + RoPE kernel. This kernel performs normalization and rotary embedding calculations in a single pass, keeping the information within the L2 cache and reducing global memory bandwidth consumption.
Hiding latency with overlap: While the shared expert computation itself saw negligible speedup, we effectively hid its cost. By utilizing separate NVIDIA CUDA streams, we scheduled the shared expert computation to execute asynchronously alongside the router logits and TopK calculation. This parallelism ensures that the GPU’s compute units (streaming multiprocessors, or SMs) remain saturated even while the routing logic is being resolved.
These targeted kernel optimizations reduced the entire time per transformer layer in a prefill iteration from 3.4ms to 2.5ms, a 1.3x speedup over baseline H100 performance. This latency reduction directly translated to higher supportable concurrency, allowing us to serve more users per GPU while maintaining the strict <1000ms time to first token (TTFT) and < 15ms inter-token latency service level agreement (ITL SLA) as shown in Figure 2 below.


How mixed prefill and decode scheduling improve GPU utilization
While kernel-level optimizations improve individual operation latency, significant efficiency gains will be achieved on the scheduler level by optimizing aggregated serving (prefill and decode run on the identical GPU) and disaggregated serving (prefill and decode run on different GPUs).
The default scheduling strategy for aggregated serving within the SGLang engine is to strictly serialize the prefill and decode phases. On this default mode, the GPU processes a batch of prefills, finishes them, and only then switches to processing decodes. While this simplifies memory management, it often results in suboptimal GPU utilization. Prefills are typically compute-bound (dense matrix multiplications), while decodes are memory-bound (loading KV cache). Serializing them means the GPU’s Tensor Core units (SMs) are underutilized during decode phases, and memory bandwidth could also be underutilized during prefill phases, particularly for the low concurrency operating point imposed by the tight SLA requirements.
To handle this, we enabled a mixed batching strategy. This approach allows the SGLang scheduler to combine prefill tokens and decode tokens inside the same batch or compute chunk. By processing a piece of prefill tokens alongside ongoing decode requests, we achieve a complementary resource profile on the GPU. This optimization introduces a subtle tradeoff. Mixing heavy prefill chunks into the decode stream can arguably increase inter-token latency (ITL) for the energetic decode requests, as they have to wait for the shared compute resources.
Nonetheless, for the Sarvam 30B workload, we observed that this impact was marginal and well inside our 15ms ITL SLA. In exchange, the end-to-end request latency improved significantly as a consequence of the reduction in queue times. By clearing the prefill queue faster (piggybacking on decodes), we reduced the time requests spent waiting to begin, ultimately driving up total system throughput by 15%. This scheduling optimization is kind of favorable within the high ISL, low OSL scenario of interest here. For more decode-heavy cases, it may be worthwhile to select smaller mixed chunk sizes or disable it altogether.


How disaggregated serving removes the critical path and boosts throughput 1.5x
Despite kernel and scheduling improvements, our profiling indicated that inter-GPU communication for token distribution (expert parallelism) remained on the critical path. For the reason that Sarvam 30B model (optimized with FP8 precision) suits comfortably inside a single NVIDIA H100 SXM GPU’s memory, we pivoted from model parallelism to disaggregated serving.
We reconfigured the setup to make use of a 1P+1D strategy via the SGLang router: dedicating one NVIDIA H100 SXM GPU exclusively to prefill and one other to decode. This approach eliminated the overhead of routing tokens between GPUs through the forward pass. The result was immediate: We observed a pointy reduction in TTFT (as prefill staff ran uninterrupted) and a big increase in per-user decode throughput (1.5x over baseline H100 performance), proving that for this model size, pipeline separation outweighs the advantages of aggregated memory capability.


The top-to-end impact of kernel, scheduling, and disaggregation optimizations
Figure 5 below summarizes the end-to-end performance speedup we were in a position to achieve via a mixture of optimized kernels and scheduling optimizations. We also observe that disaggregated serving is probably the most optimal configuration for this model and ISL/OSL workload pattern and specific TTFT and ITL SLAs.


Running the Sarvam 30B model on Blackwell NVIDIA GPUs
The NVIDIA Blackwell architecture is designed to speed up generative AI. The NVIDIA Blackwell GPU delivers as much as 20 PFLOPS of peak FP4 compute and eight TB/s of memory bandwidth, representing a hop over the NVIDIA H100 GPU’s capabilities. This throughput is driven by the second-generation Transformer Engine, which utilizes the brand new NVFP4 format to offer over 2x the performance of FP8 while maintaining high model accuracy.
To reap the benefits of these capabilities within the Sarvam models, we used the NVIDIA Model Optimizer to quantize the bottom BF16 model to the NVFP4 format. Unlike within the case of multiple H100 GPUs, we found that the NVIDIA HGX B200 was in a position to serve the Sarvam 30B model most efficiently with only one Blackwell GPU. By combining the kernel and scheduling optimizations for the model with NVIDIA Blackwell’s NVFP4 compute throughput, we were able to comprehend a 4x increase in inference serving throughput on the 75 tokens per second per user operating point.
As indicated in Figure 6 below, the NVIDIA Blackwell GPU enables high performance at low latency as a consequence of its superior compute, in addition to exceptional throughput at higher concurrencies from its memory capability advantage.


Learn more
Together, this work shows what is feasible when model design, kernel engineering, scheduling strategy, quantization, and GPU architecture are treated as a single system quite than isolated components. By co-optimizing across the complete stack, Sarvam AI and NVIDIA delivered substantial gains in throughput and latency while maintaining strict TTFT and inter-token latency targets required for real-world deployment.
The result shouldn’t be only a faster model, but a more economically viable and sovereign-ready inference stack that scales to national workloads. These learnings provide a blueprint for other teams constructing large, production-grade AI systems on NVIDIA platforms.
More details about Sarvam AI’s models will be found here.
To start exploring your personal sovereign AI model strategy, take a look at the NVIDIA Nemotron framework and libraries for training, fine-tuning, and deploying models on local infrastructure.
Not sleep-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.
And browse more about NVIDIA Cloud Functions, NVIDIA’s multi-cloud, high-performance AI inference solution, here.
