Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

As AI workloads scale, achieving high throughput, efficient resource usage, and predictable latency becomes essential. NVIDIA Run:ai addresses these challenges through intelligent scheduling and dynamic GPU fractioning. GPU fractioning is wholly delivered by NVIDIA Run:ai in any environment—cloud, NCP, and on-premises.

This post presents the joint benchmarking effort between NVIDIA and AI cloud provider Nebius to guage how NVIDIA Run:ai fractional GPU allocation can improve large language model (LLM) inference performance. Nebius’ AI Cloud provided the infrastructure foundation, dedicated NVIDIA GPUs, NVIDIA Quantum InfiniBand networking, and hyperscaler-grade performance and elasticity needed to deliver these gains at production scale.

All benchmarks were executed using NVIDIA NIM microservices. This approach provides standardized, production-grade model deployment with consistent performance, security, and lifecycle management across environments.

The outcomes show that fractional GPUs dramatically increase effective capability without compromising latency SLAs:

77% of full GPU throughput and 86% of full-GPU concurrent user capability using only 0.5 GPU fraction, with time to first token (TTFT) under one second
As much as 2x more concurrent inference users on smaller models using 0.25 GPU fractions
As much as 3x more total system users when running mixed workloads (chat, reasoning, embeddings) on shared GPUs
Near-linear throughput scaling across 0.5, 0.25, and 0.125 GPU fractions, with modest TTFT impact
Production-ready autoscaling with no latency cliffs or error spikes during scale-out

This benchmarking shows that fractional GPU scheduling is not any longer an optimization technique. It’s a foundational capability for running large-scale, multimodel LLM inference efficiently in production.

Model	Variety of parameters	Memory requirements	Use case
Llama 3.1 8B Instruct	8B	~16 GB	General-purpose chat
Phi-4-Mini	3.8B	~8 GB	Lightweight assistant
Qwen3-14B	14B	~28 GB	Reasoning
Qwen-Embeddings-0.6B	0.6B	~1.5 GB	Document embedding and reranking

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

LLM inference enterprise challenges

Scale inference workloads with NVIDIA Run:ai and Nebius AI Cloud

Dynamic GPU fractioning

Intelligent workload scheduling

Benchmarking setup

Infrastructure

Model selection

Methodology

Test conditions

Benchmarking results using NVIDIA Run:ai

Fractional GPU efficiency at half allocation

No scheduler overhead

Fractional GPU efficiency

Smaller models scale further with quarter-GPU fractions

Multimodel co-location on fractional GPUs in Nebius AI Cloud

Autoscaling NIM LLM with NVIDIA Run:ai

Start with GPU fractioning in NVIDIA Run:ai

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Hustlers are cashing in on China’s OpenClaw AI craze

When Data Lies: Finding Optimal Strategies for Penalty Kicks with Game Theory

Yann LeCun’s $1B bet against LLMs

A greater method for planning complex visual tasks

What Are Agent Skills Beyond Claude?

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai

LLM inference enterprise challenges

Scale inference workloads with NVIDIA Run:ai and Nebius AI Cloud

Dynamic GPU fractioning

Intelligent workload scheduling

Benchmarking setup

Infrastructure

Model selection

Methodology

Test conditions

Benchmarking results using NVIDIA Run:ai

Fractional GPU efficiency at half allocation

No scheduler overhead

Fractional GPU efficiency

Smaller models scale further with quarter-GPU fractions

Multimodel co-location on fractional GPUs in Nebius AI Cloud

Autoscaling NIM LLM with NVIDIA Run:ai

Start with GPU fractioning in NVIDIA Run:ai

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.