Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

-


As AI workloads scale, achieving high throughput, efficient resource usage, and predictable latency becomes essential. NVIDIA Run:ai addresses these challenges through intelligent scheduling and dynamic GPU fractioning. GPU fractioning is wholly delivered by NVIDIA Run:ai in any environment—cloud, NCP, and on-premises.

This post presents the joint benchmarking effort between NVIDIA and AI cloud provider Nebius to guage how NVIDIA Run:ai fractional GPU allocation can improve large language model (LLM) inference performance. Nebius’ AI Cloud provided the infrastructure foundation, dedicated NVIDIA GPUs, NVIDIA Quantum InfiniBand networking, and hyperscaler-grade performance and elasticity needed to deliver these gains at production scale. 

All benchmarks were executed using NVIDIA NIM microservices. This approach provides standardized, production-grade model deployment with consistent performance, security, and lifecycle management across environments.

The outcomes show that fractional GPUs dramatically increase effective capability without compromising latency SLAs:

  • 77% of full GPU throughput and 86% of full-GPU concurrent user capability using only 0.5 GPU fraction, with time to first token (TTFT) under one second
  • As much as 2x more concurrent inference users on smaller models using 0.25 GPU fractions
  • As much as 3x more total system users when running mixed workloads (chat, reasoning, embeddings) on shared GPUs
  • Near-linear throughput scaling across 0.5, 0.25, and 0.125 GPU fractions, with modest TTFT impact
  • Production-ready autoscaling with no latency cliffs or error spikes during scale-out

This benchmarking shows that fractional GPU scheduling is not any longer an optimization technique. It’s a foundational capability for running large-scale, multimodel LLM inference efficiently in production.

LLM inference enterprise challenges

Enterprise IT departments operate with a finite, often fixed inventory of GPUs. Deploying LLM for inference requires a dedicated GPU (or multiple GPUs) to be allocated to a single LLM instance, even during sporadic traffic. That is needed since the model must load all of the weights upfront of an inference request, so the latency for generating tokens (responses) is as little as possible. 

Because of this, most LLMs eat all GPUs allocated, so it becomes difficult to run a couple of model using the identical pool of GPUs available. On this scenario, enterprise IT must manually maintain the GPUs to LLM allocation, determine when and tips on how to scale LLMs as users requesting inference grow to take care of latency between chat requests and tokens generated, and can’t repurpose idle GPUs during off-peak hours.

Ideally, enterprises want an elastic environment where GPUs might be used to run multiple LLMs, not only one, without significantly impacting the variety of users who can run inference or latency for those users. They will scale GPUs based on workloads, and scale down GPUs during off-peak hours, such that other workloads can eat the identical GPUs.

Scale inference workloads with NVIDIA Run:ai and Nebius AI Cloud 

The NVIDIA Run:ai platform addresses these pain points through its high-throughput AI workload scheduler, built for large-scale GPU clusters and dynamic fractional GPU allocation, without sacrificing performance. Together, NVIDIA Run:ai orchestration and Nebius AI Cloud infrastructure create a versatile, production-ready framework for maximizing GPU ROI. 

In benchmarking tests conducted by NVIDIA and Nebius AI Cloud, NVIDIA Run:ai delivered as much as 2x greater user capability on existing hardware during peak periods, demonstrating that enterprises can significantly scale inference workloads without proportional increases in GPU investment.

Dynamic GPU fractioning

NVIDIA Run:ai enables GPUs to be fractioned into smaller units (similar to 0.5 GPU allocations) that serve multiple workloads concurrently. Users specify their memory requirements directly and the scheduler allocates resources on-demand with none preconfiguration. This is especially impactful for inference workloads, where smaller, concurrent requests can share GPU resources without significant performance degradation. 

Memory isolation is enforced at runtime while compute cycles are distributed fairly amongst lively processes. Users can even define a guaranteed minimum (Request) with a burstable upper certain (Limit), allowing workloads to eat additional GPU capability when available and release it mechanically when demand shifts.

Intelligent workload scheduling

NVIDIA Run:ai scheduler acts because the “brain” of the operation, analyzing workload priorities, resource requirements, and system capability to optimize allocations. It prioritizes latency-sensitive tasks, similar to real-time inference, over batch-oriented training jobs during peak periods, ensuring service-level agreements (SLAs) are met. 

The scheduler also mechanically scales LLMs up or down based on consecutive users running inference and token latency depending on the SLA criterias given by the admin. These strategies collectively drive higher utilization rates, lower operational complexity, and reduce total cost of ownership (TCO). 

Teams at NVIDIA and Nebius ran benchmarking to find the impact NVIDIA Run:ai has on running inference at scale for various LLMs. Scale tests were performed on the variety of concurrent users that may run various chat requests and recording the TTFT, output throughput (tokens/second generated), and GPU utilization. At NVIDIA these tests were run on a cluster built following the PCIe-optimized NVIDIA Enterprise Reference Architectures with NVIDIA H100 NVL GPUs. At Nebius AI Cloud the tests were run on a cluster built following the HGX based Enterprise RA for NVIDIA HGX B200 GPUs.

Benchmarking setup

The software stack relies on NVIDIA Enterprise RAs (Figure 1). This includes the NVIDIA AI Enterprise stack to administer GPUs using NVIDIA GPU Operator for lifecycle management, NVIDIA Network Operator for north-south and east-west networking, NVIDIA NIM Operator to download various model weights, and NVIDIA NIM microservices to deploy the several models. This was deployed in a cluster of nodes managed by Kubernetes. To learn more, see NVIDIA NIM LLM with NVIDIA Run:ai and Vanilla Kubernetes for Enterprise RA.

Infrastructure

Equivalent benchmarks were run across two hardware configurations: an on-premises cluster with 64 NVIDIA H100 NVL GPUs built to NVIDIA Enterprise RA specifications, and a Nebius AI Cloud cluster with 32 NVIDIA HGX B200 GPUs. This dual-environment approach validates that the outcomes generalize across each self-managed infrastructure and public cloud deployments.

Diagram illustrating the NVIDIA Run:ai deployment stack on NVIDIA Enterprise Reference Architecture.
Diagram illustrating the NVIDIA Run:ai deployment stack on NVIDIA Enterprise Reference Architecture.
Figure 1. NVIDIA Run:ai deployment on NVIDIA Enterprise Reference Architecture

Model selection

The 4 models chosen span different sizes, memory footprints, and inference use cases (Table 1). This range enables evaluating fractional allocation across workloads with different memory footprints. 

Model Variety of parameters Memory requirements Use case
Llama 3.1 8B Instruct 8B ~16 GB General-purpose chat
Phi-4-Mini 3.8B ~8 GB Lightweight assistant
Qwen3-14B 14B ~28 GB Reasoning
Qwen-Embeddings-0.6B 0.6B ~1.5 GB Document embedding and reranking
Table 1. Models chosen span diverse sizes, memory requirements, and use cases

Notably, the most important model (Qwen3-14B) occupies only ~35% of 1 NVIDIA H100 NVL GPU 80 GB capability, illustrating why traditional whole-GPU allocation might leave a lot capability stranded. 

Methodology

GenAI Perf was used to simulate concurrent users sending chat requests to every NIM endpoint. The tool records per-session latency and throughput, enabling measurement under increasing load.

Primary metrics include:

  • TTFT: Latency from request submission to first response token
  • Output throughput: Tokens generated per second per session
  • GPU utilization: Percentage of GPU memory consumed under load
  • Concurrency scaling: Maximum simultaneous users supported while maintaining TTFT and throughput inside acceptable bounds (for instance, the purpose at which adding more users causes latency SLA drops)

Test conditions

Each model was benchmarked under the next five configurations:

  • Baseline: LLM inference without NVIDIA Run:ai (native Kubernetes scheduling)
  • Full GPU(s) with NVIDIA Run:ai: 1.0 GPU allocation per model replica
  • Fractional 0.5 GPU(s): NVIDIA Run:ai with 0.5 GPU allocation per model replica
  • Fractional 0.25 GPU(s): NVIDIA Run:ai with 0.25 GPU allocation per model replica
  • Mixed mode: Multiple LLMs co-located on shared GPUs

For the Qwen-Embeddings model, data ingestion throughput was also tested to guage embedding-specific workloads.

Benchmarking results using NVIDIA Run:ai

This section presents observations based on the outcomes captured from GenAI Perf. 

Fractional GPU efficiency at half allocation

Based on the outcomes captured from GenAI Perf, NVIDIA Run:ai was evaluated across two dimensions: scheduler overhead in comparison with native Kubernetes, fractional GPU efficiency at various allocation sizes. The next subsections detail the findings for every.

No scheduler overhead

NVIDIA Run:ai introduces no measurable performance penalty in comparison with native Kubernetes scheduling across all test configurations. At 64 GPUs, NVIDIA Run:ai with full GPU allocation delivered 10,200 concurrent users versus 9,934 for the native scheduler, confirming the scheduler itself adds no overhead.

Fractional GPU efficiency

Concurrent user scaling: At 64 GPUs, the 0.5 GPU configuration supported 8,768 concurrent users, where the TTFT for every user didn’t go over one second (1,000 ms)—86% of the complete GPU capability (10,200 CCU). This demonstrates that fractional allocation introduces only a modest performance trade-off, enabling enterprises to run multiple models on shared GPUs or scale deployments more granularly without significant capability loss (Figure 2).

Graph showing CCU scaling from 1–64 GPUs for Meta Llama 3.1 8B. Three configurations compared: no Run:ai, Run:ai at 1.0 GPU, and Run:ai at 0.5 GPU. At 64 GPUs, 0.5 GPU delivers 86% of full CCU (8,768 vs 10,200).
Graph showing CCU scaling from 1–64 GPUs for Meta Llama 3.1 8B. Three configurations compared: no Run:ai, Run:ai at 1.0 GPU, and Run:ai at 0.5 GPU. At 64 GPUs, 0.5 GPU delivers 86% of full CCU (8,768 vs 10,200).
Figure 2. Concurrent user scaling for Llama 3.1 8B Instruct powered by the NVIDIA H100 NVL GPU cluster

Output throughput: Token generation throughput showed similar efficiency. At 64 GPUs, the 0.5 GPU configuration achieved 152,694 tokens/sec—77% of full GPU throughput 198,680 tokens/sec), as shown in Figure 3.

All three configurations—without NVIDIA Run:ai, NVIDIA Run:ai with full GPU, and NVIDIA Run:ai with fractional GPU—scale linearly from one to 64 GPUs. This linear relationship confirms that the efficiency ratios observed at scale aren’t artifacts of small deployments.

Graph showing throughput scaling from 1–64 GPUs for Llama 3.1 8B. Three configurations: no Run:ai, Run:ai at 1.0 GPU, Run:ai at 0.5 GPU. 0.5 GPU delivers 77% of full GPU throughput.
Graph showing throughput scaling from 1–64 GPUs for Llama 3.1 8B. Three configurations: no Run:ai, Run:ai at 1.0 GPU, Run:ai at 0.5 GPU. 0.5 GPU delivers 77% of full GPU throughput.
Figure 3. Output throughput scaling for Llama 3.1 8B Instruct powered by the NVIDIA H100 NVL GPU cluster

Smaller models scale further with quarter-GPU fractions

Smaller models have lighter memory footprints, which suggests they will take even greater advantage of fractional allocation. Phi-4-Mini was tested with 0.25 GPU fractions to measure how much concurrency and throughput this allows.

Graph showing CCU scaling from 1–32 GPUs for Phi-4-Mini-4B-Instruct on NVIDIA HGX B200 (Nebius AI Cloud). At 32 GPUs: 1.0 GPU = 7,100 CCU, 0.5 GPU = 11,000 CCU (155%), 0.25 GPU = 12,200 CCU (172%).
Graph showing CCU scaling from 1–32 GPUs for Phi-4-Mini-4B-Instruct on NVIDIA HGX B200 (Nebius AI Cloud). At 32 GPUs: 1.0 GPU = 7,100 CCU, 0.5 GPU = 11,000 CCU (155%), 0.25 GPU = 12,200 CCU (172%).
Figure 4. Concurrent user scaling (1-32 GPUs) for Phi-4-Mini with TTFT under 1,000 ms on an NVIDIA HGX B200 cluster running on Nebius AI Cloud

On smaller models similar to Phi-4-Mini, NVIDIA Run:ai with 0.25 GPU fractions supported as much as 72% more concurrent users than full-GPU allocation (Figure 4). At 32 GPUs, this configuration achieved ~450K tokens/sec with P95 TTFT under 300 ms (Figure 5). Phi-Mini is a great candidate for high-density fractional deployments as a consequence of its small parameter count and tensor efficiency.

Graph showing throughput scaling for Phi-4-Mini-4B-Instruct on Blackwell (Nebius). At 32 GPUs: 1.0 GPU = 456,295 tokens/sec, 0.5 GPU = 458,138 (100%), 0.25 GPU = 389,197 (85%).
Graph showing throughput scaling for Phi-4-Mini-4B-Instruct on Blackwell (Nebius). At 32 GPUs: 1.0 GPU = 456,295 tokens/sec, 0.5 GPU = 458,138 (100%), 0.25 GPU = 389,197 (85%).
Figure 5. Throughput at scale for Phi-4 Mini NIM on NVIDIA HGX B200 cluster running on Nebius AI Cloud

Multimodel co-location on fractional GPUs in Nebius AI Cloud

NVIDIA Run:ai supports allocating fractional GPUs dynamically. In previous tests, the identical variety of users were run on fractional GPUs. One test loaded two models (Llama 3.1 8B and DeepseekR1-Distill-8B) on fractional 0.5 NVIDIA H100 NVL GPUs using NVIDIA Run:ai. A single NVIDIA H100 NVL GPU was running two inference models. 

Results show double the concurrent users with NVIDIA Run:ai versus deploying a single NIM pod per GPU (Figure 6). The performance impact increased when the size reached greater than 50% of the GPUs within the cluster. At max scale, the TTFT for the combined users dropped by 3x while the throughput dropped only by 0.4x.

Bar chart comparing system CCU: without Run:ai = 9,934; Run:ai 0.5 GPU = 8,768; Run:ai 0.5 GPU mixed models = 17,792.Bar chart comparing system CCU: without Run:ai = 9,934; Run:ai 0.5 GPU = 8,768; Run:ai 0.5 GPU mixed models = 17,792.
Figure 6. Total variety of concurrent users on cluster powered by NVIDIA H100 NVL GPU server running two models on a single GPU

Traditional Kubernetes schedulers don’t support this fractional allocation. NVIDIA Run:ai enables loading multiple models with dynamic frame buffer memory allocation without manual capability planning. 

NVIDIA NIM complements this by packaging each model as a production-ready, optimized inference microservice with consistent startup and health signaling. NVIDIA Run:ai then enforces memory isolation and fair compute distribution at runtime. Combined, this allows protected co-location of heterogeneous workloads without cross-model interference.

Bar chart comparing total concurrent users between a mixed model scenario and Llama-only deployment across three scales. Mixed (0.5 Llama plus 0.25 PHI plus 0.125 Qwen) delivers ~3x more users: 1-GPU = 303 versus 104; 1-Host (8 GPUs) = 2,960 versus 850; 1-Cluster = 9,190 versus 3,000.
Bar chart comparing total concurrent users between a mixed model scenario and Llama-only deployment across three scales. Mixed (0.5 Llama plus 0.25 PHI plus 0.125 Qwen) delivers ~3x more users: 1-GPU = 303 versus 104; 1-Host (8 GPUs) = 2,960 versus 850; 1-Cluster = 9,190 versus 3,000.
Figure 7. The whole system users that ran with multiple models on the NVIDIA HGX B200 cluster in Nebius AI Cloud greater than tripled

Nebius ran an analogous test co‑deploying 0.5 GPU Llama 3.1 8B, 0.25 GPU Phi‑4 Mini, and 0.125 GPU Qwen‑Embeddings. The cluster achieved predictable scaling with no cross‑model interference, and combined throughput exceeded 350K TPS at full scale (Figure 8). The whole variety of concurrent users that may run inference went up by almost 3x (Figure 7). This validates that the NVIDIA Run:ai scheduler can bin‑pack heterogeneous inference workloads without destabilizing latency or utilization.

Bar chart comparing total throughput between a mixed model scenario and Llama-only deployment across three scales. Mixed achieves higher TPS at all scales: 1-GPU = 9.943 versus 6.894; 1-Host = 141,838 vs 52,740; 1-Cluster = 354,312 vs 200,979.Bar chart comparing total throughput between a mixed model scenario and Llama-only deployment across three scales. Mixed achieves higher TPS at all scales: 1-GPU = 9.943 versus 6.894; 1-Host = 141,838 vs 52,740; 1-Cluster = 354,312 vs 200,979.
Figure 8. Total system throughput while running multiple models on the NVIDIA HGX B200 cluster in Nebius AI Cloud

Autoscaling NIM LLM with NVIDIA Run:ai

NVIDIA Run:ai supports auto-scaling inference pods based on concurrent users, throughput, or latency thresholds. Nebius arrange Llama 3.1 8B to scale when concurrent users exceeded 50, triggering NVIDIA Run:ai to allocate additional GPUs to the NIM inference service.

Replicas scaled easily from 1 to 16 as demand increased. The autoscaling traces showed clean ramp-up with no TTFT spikes, stable GPU utilization during pod warm-up, and negligible HTTP error rates, demonstrating that fractional GPU inference can scale elastically while maintaining SLAs.

Run:ai dashboard showing autoscaling for Llama 3.1 8B.
Run:ai dashboard showing autoscaling for Llama 3.1 8B.
Figure 9. Autoscaling results for Llama 3.1 8B on NVIDIA HGX B200 in Nebius AI Cloud

Start with GPU fractioning in NVIDIA Run:ai 

NVIDIA Run:ai enables efficient GPU utilization through dynamic allocation, fractioning, and intelligent workload placement. Combined with Nebius AI Cloud’s dedicated GPUs, NVIDIA networking, and hyperscaler-grade elasticity, enterprises can achieve:

  • GPU utilization improvements under fractional scheduling, eliminating fragmentation and idle pockets
  • Near‑linear throughput scaling across 0.5 and 0.25 GPU slices (and 0.125 for embeddings), with modest TTFT impact
  • Clean co-existence of mixed workloads: embeddings plus generative plus summarization on the identical nodes
  • Production‑ready autoscaling for fractional LLM inference—no SLA cliffs during scale‑out
  • More workloads per GPU, higher concurrency, and reduced fleet size

For an executive summary of this benchmark, see Scaling Efficient Production-Grade Inference with NVIDIA Run:ai on Nebius. 

Start with the newest version of NVIDIA Run:ai v2.24. To learn more, try the NVIDIA GTC 2026 session, Scale Inference Using Open Models: How Nebius Token Factory Delivers Control and Efficiency (Presented by Nebius) [S82234].



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x