How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models

As global AI adoption accelerates, developers face a growing challenge: delivering large language model (LLM) performance that meets real-world latency and price requirements. Running models with tens of billions of parameters in production, especially for conversational or voice-based AI agents, demands high throughput, low latency, and predictable service-level performance. For startups constructing sovereign AI models from scratch, these challenges are amplified by the necessity to balance model scale and accuracy with infrastructure efficiency—while also maintaining data sovereignty and price control.

Sarvam AI, a generative AI startup based in Bengaluru, India, got down to construct large, multilingual, multimodal foundation models that serve its country’s diverse population, support nearly two-dozen languages, and keep model development and data governance fully under India’s sovereign control. To fulfill strict latency targets and improve inference efficiency for its flagship Sovereign 30B model, Sarvam AI collaborated with NVIDIA to co-design hardware and software optimizations.

This collaboration delivered a 4x speedup in inference performance on NVIDIA Blackwell over baseline NVIDIA H100 GPUs, and established a path for deployment on the next-generation NVIDIA Blackwell architecture. The top-to-end performance boost was achieved through kernel and scheduling optimizations on NVIDIA H100 SXM GPUs that contributed a 2x speedup. That was combined with the powerful compute capabilities of Blackwell, together with NVFP4 weight quantization, for an extra 2x speedup, with a good larger performance gain of two.8x seen at higher interactivity points.

NVIDIA engineers helped Sarvam AI construct 3B, 30B, and 100B foundational models, and optimize a brand new family of sovereign foundation models that were trained using NVIDIA Nemotron libraries, including the NVIDIA NeMo Framework and NVIDIA NeMo-RL. These models support 22 Indian languages, English, math, and code. They exhibit how developer teams can leverage NVIDIA’s full-stack AI platform—from data to deployment—to realize state-of-the-art performance and localized AI capabilities.

This post walks through the joint engineering effort and shares benchmarks for the speed-ups achieved on the NVIDIA H100, the largest-deployed NVIDIA GPU in India. We also provide an early have a look at how these workloads are being adapted for the NVIDIA Blackwell architecture.

Kernel	Baseline time (microseconds)	Optimized time (microseconds)	Optimization applied
RMSNorm + Prepare QKV	186	185	N/A
QK Norm + RoPE	414	54	Use optimized fused in-place query-key normalization kernel
Attention	322	296	Use FA3 for prefill, FlashInfer backend for decode
Post-attention linear projection	114	112	N/A
AllReduce	252	250	N/A
Router logits and TopK	560	134	Use fused TopK impl.; ReplicatedLinear block for router logits
Routed experts computation	1103	1080	Tune kernel params for and DEP2 configuration (64 experts per GPU)
Shared expert computation	216	215	Overlap with TopK using NVIDIA CUDA streams
AllReduce	265	249	N/A
Total layer time	3432	2575	1.34x faster prefill overall

How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models

Making multilingual sovereign AI scalable with MoE

The performance challenge: SLAs and baseline configuration on NVIDIA H100

From profiling to performance: eliminating MoE bottlenecks

Cutting transformer layer time by 34% with kernel-level optimizations

How mixed prefill and decode scheduling improve GPU utilization

How disaggregated serving removes the critical path and boosts throughput 1.5x

The top-to-end impact of kernel, scheduling, and disaggregation optimizations

Running the Sarvam 30B model on Blackwell NVIDIA GPUs

Learn more

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Recent MIT class uses anthropology to enhance chatbots

Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning

A Large-Scale Synthetic Dataset Generated from Programming Concept Seeds

Hustlers are cashing in on China’s OpenClaw AI craze

When Data Lies: Finding Optimal Strategies for Penalty Kicks with Game Theory

How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models

Making multilingual sovereign AI scalable with MoE

The performance challenge: SLAs and baseline configuration on NVIDIA H100

From profiling to performance: eliminating MoE bottlenecks

Cutting transformer layer time by 34% with kernel-level optimizations

How mixed prefill and decode scheduling improve GPU utilization

How disaggregated serving removes the critical path and boosts throughput 1.5x

The top-to-end impact of kernel, scheduling, and disaggregation optimizations

Running the Sarvam 30B model on Blackwell NVIDIA GPUs

Learn more

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.