Organizations deploying LLMs are challenged by inference workloads with different resource requirements. A small embedding model might use only a number of gigabytes of GPU memory, while a 70B+ parameter LLM could require multiple GPUs. This diversity often results in low average GPU utilization, high compute costs, and unpredictable latency.
The issue isn’t nearly packing more workloads onto GPUs but about scheduling them intelligently. Without orchestration that understands inference workload patterns, organizations face a selection between overprovisioning (wasting resources) and underprovisioning (degrading performance).
This blog post covers:
- The inference utilization problem: Why traditional scheduling underutilizes GPU resources.
- How NVIDIA NIM delivers production inference: The role of containerized microservices in standardizing model deployment.
- NVIDIA Run:ai’s intelligent scheduling strategies: 4 key capabilities that enhance performance (lower latency, increase TPS/GPU) while increasing GPU utilization and reducing compute costs.
- Benchmarking results: ~2x GPU utilization improvement with minimal throughput loss, as much as ~1.4x higher throughput under heavy concurrency with dynamic fractions, and 44-61x faster first-request latency with GPU memory swap.
- The best way to start: Practical guidance for implementing these strategies with NIM on NVIDIA Run:ai.
The inference utilization problem
GPU utilization determines what number of workloads will be run on a given cluster, and at what cost. In practice, most inference deployments leave significant GPU capability idle as each model is assigned a full GPU “simply to be secure” or because naive sharing without memory isolation causes out-of-memory (OOM) conditions and latency spikes under traffic.
Without intelligent orchestration, teams are forced to make a choice from overprovisioning (waste) and underprovisioning (performance risk).
How NVIDIA NIM delivers production inference
NVIDIA NIM packages optimize inference engines as containerized microservices with:
- Packaged inference engines: Inference runtimes pre-configured for improved throughput/latency
- Industry-standard APIs: OpenAI-compatible endpoints for integration
- Model optimization: Automatic choice of quantization, batching, and acceleration techniques.
- Production-ready containers: Pre-built with dependencies, tested at scale
- Security and compliance: Enterprise-grade security controls and container signing for deployments
- Enterprise support: NVIDIA support and maintenance for production deployments
NIM standardizes the deployment layer, but maximizing GPU utilization requires intelligent orchestration. That is where NVIDIA Run:ai‘s scheduling capabilities change into essential.
How NVIDIA Run:ai unlocks efficient resource management for NVIDIA NIM
Inference utilization is greater than just scheduling—it’s about adapting to how workloads behave. With NVIDIA Run:ai, NIM deployments get inference-first prioritization, GPU fractions with full memory isolation, smarter placement based on workload needs, dynamic memory management, and autoscaling (including replica scaling and scale-to-zero). This allows users to follow traffic and provides back GPUs when models are idle.
Inference priority protects user-facing workloads
NVIDIA Run:ai robotically assigns inference workloads the very best default priority, ensuring training jobs never preempt them. Why this matters:
- Inference serves users: Latency spikes and downtime impact the user experience and SLA compliance.
- Training can tolerate interruption: Model training can checkpoint and resume; inference requests cannot wait.
This automatic priority project eliminates manual tuning in most environments. For organizations running mixed workloads, this ensures training jobs flex around inference demands quite than competing with them. GPUs can train when inference load is low, robotically yielding resources when user-facing requests arrive.
GPU fractions with bin packing for multiple small models on a GPU
Many NIM workloads, like embeddings, rerankers, and small LLMs, rarely need a whole GPU. When used with GPU fractions, NVIDIA Run:ai’s bin packing strategy fills GPUs before allocating latest ones, maximizing utilization across the cluster.
How GPU fractions with bin packing work:
- GPU fractions provide true memory isolation (not soft limits). Each model gets a guaranteed memory allocation.
- Bin packing scores GPUs by current utilization and prioritizes filling partially used GPUs before allocating fresh ones.
- Scheduler prioritizes partially-used GPUs for brand spanking new workloads
Benchmarking results:
The approach was tested by simulating a scenario with three NIM models (a 7B LLM, a 12B VLM, and a 30B MoE) on NVIDIA H100 GPUs:
- Scenario A: Three GPUs with one H100 GPU per NIM (baseline)
- Scenario B: Three NIM on 1.5 H100 GPUs using NVIDIA Run:ai fractions, keeping NIM configurations and client load patterns constant


Exercising short and long-context prompts, the important thing findings include:
- Each NIM retained about 91–100% of its single-GPU throughput, with modest increases in time-to-first-token (TTFT) and end-to-end (E2E) latency.
- Mistral-7B matched its dedicated-GPU throughput at 834 token/s with long-context input (100%).
- Nemotron-3-Nano-30B retained 95% (582 vs. 614 token/s).
- Nemotron-Nano-12B-v2-VL retained 91% (658 vs. 723 token/s) at short-context input.
Three NIM microservices that previously required three dedicated H100s were consolidated onto ≈1.5 H100s, freeing the remaining capability for other workloads.
Dynamic GPU fractions maintain performance under heavy concurrent requests
Static GPU fractions guarantee memory isolation, but they impose a rigid ceiling that creates “standard capability”. As concurrent requests increase, each NIM’s KV-cache grows dynamically to trace energetic sequences. When that growth hits the fixed fraction boundary, throughput plateaus, and latency degrades. This bottleneck forces a difficult trade-off: over-allocate fractions (wasting GPU capability) or cap concurrency to remain throughout the fixed memory budget.
NVIDIA Run:ai’s dynamic GPU fractions solve this by replacing fixed allocations with a request/limit model, borrowing Kubernetes resource semantics for GPU memory:
- Request: The guaranteed minimum fraction, at all times reserved for the workload.
- Limit: The burstable upper sure, enabling the NIM to spread into available GPU memory when on-demand KV-cache or compute pressure increases.
When a NIM operates its request, the unused headroom between the request and limit stays available to co-located workloads. When concurrent traffic spikes occur, the NIM bursts toward its limit, claiming that memory and converting it into energetic throughput. This state transition between request and limit is handled robotically. Workloads scale up after they need resources and release them when demand subsides, maximizing total GPU utilization without manual intervention.
Benchmarking results:
Using the identical three NIM models and 1.5 H100 GPU footprint from Experiment 1, static fractions were replaced with dynamic fractions to measure performance under increasing concurrency:
- Mistral-7B NIM (Request: 0.3, Limit: 0.4)
- Nemotron-Nano-12B-v2-VL NIM (Request: 0.4, Limit: 0.5)
- Nemotron-3-Nano-30B NIM (Request: 0.65, Limit: 0.75)
Scenarios compared:
- Scenario A (static fractions + bin packing): The fixed-fraction deployment from Experiment 1 (See Figure 1), where each NIM has a tough memory ceiling with full isolation.
- Scenario B (dynamic fractions + bin packing): Same bin-packed layout on ≈1.5 H100 GPUs, but each NIM uses a request/limit pair as a substitute of a set allocation.


In Figures 2, 3, and 4, as concurrency ramped up, static fractions hit a performance wall, throughput stalled, and latency spiked because models couldn’t access additional memory for growing KV caches. With dynamic fractions, NIM microservices absorbed the pressure by bursting toward their limits during traffic peaks and releasing memory back when the load subsided.
Across all three NVIDIA NIM microservices, dynamic fractions delivered as much as 1.4x higher throughput and 1.7x lower latency, scaling cleanly with concurrency. For instance:
- Nemotron-3-Nano-30B sustained 1,025 token/s at 256 concurrent requests with dynamic fractions in comparison with a static-fraction ceiling of 721 token/s at just 4 concurrent requests before instability (1.4x).
- Mistral-7B-Instruct-v0.3 p50 end-to-end latency dropped from 5,235 ms to three,098 ms at 64 concurrent 2,048-token requests (1.7x).
The p50 latency curve stays smooth and monotonic quite than spiking or collapsing, confirming that the request/limit headroom accommodates KV-cache growth patterns, improving GPU utilization.


Key takeaway:
- Static fractions + bin packing: Predictable traffic, low-to-moderate concurrency, models with stable memory footprints
- Dynamic GPU fractions + bin packing: Variable traffic, high concurrency, models with significant KV-cache growth


Dynamic GPU fractions eliminate the performance ceiling of static allocations at high concurrency while maintaining workload density. With static fractions, the KV-cache cannot grow beyond the fixed memory boundary, and the inference engine begins rejecting requests since it lacks the headroom to confess latest sequences. Dynamic GPU fractions solve this as NIM can burst into available headroom on demand, and organizations get each the efficiency of bin packing and the resilience to handle traffic spikes without allocating additional GPUs.
GPU memory swap: Efficiently serving rarely-used models
Organizations serving LLMs face a fundamental trade-off between latency and value. Scaling an LLM from zero means full container initialization, loading model weights from disk, and allocating GPU memory; a process that may take tens of seconds to minutes. Because this cold-start latency is unacceptable for user-facing applications, most organizations select over-provision, keeping multiple replicas always-on with dedicated GPUs even during low-traffic or idle periods.
This guarantees low latency but wastes GPU capability, paying for hardware that sits idle simply to avoid the chance of a chilly start. Scale-to-zero (the Kubernetes pattern of shutting down idle replicas completely and restarting them on demand) can free the GPUs, however the cold-start penalty makes it impractical for latency-sensitive inference workloads.
How GPU memory swap works:
With GPU memory swap, models are kept in CPU memory and dynamically swap model weights between CPU and GPU as requests arrive. Only the energetic model’s weights reside in GPU memory at any moment. When a request targets an idle model, NVIDIA Run:ai’s GPU memory swap moves the currently loaded model’s weights to CPU RAM and loads the requested model into GPU memory, keeping it warm for a configurable window. The model never leaves memory entirely; it just moves between GPU and CPU, eliminating the necessity for container restarts, disk I/O, and cold-start initialization.
GPU memory swap works across single-GPU, multi-GPU, and fractional GPU workloads. Previous benchmarking with single-GPU deployments showed as much as 66x improvements in time to first token (TTFT) in comparison with scale-from-zero. On this benchmark, combining GPU memory swap with NIM deployments on fractional GPUs tested whether the identical latency advantages hold when models share hardware through bin packing and under memory constraints.
Benchmarking results:
Latency between GPU memory swap and scale-from-zero for a similar three NIM deployments was compared:
- Scenario A (scale-from-zero): Each NIM cold‑starts from scratch on a dedicated H100 GPU when traffic arrives (three GPUs in total).
- Scenario B (GPU memory swap): The three NVIDIA NIM microservices share 1.5 H100 GPUs (with the identical fractions from previous experiments), with swap‑in/swap‑out between GPU and CPU memory.




With scale-from-zero, infrequently accessed NIM microservices suffer high first-request latency because of full cold starts. With GPU memory swap, first-request latency stays acceptable, and subsequent requests see warm TTFT. All three NIM microservices run on half of the GPUs, freeing up the remaining capability for high-traffic or other workloads.
At 128-token input, cold-start TTFT ranged from 75.3 s (Mistral-7B) to 92.7 s (Nemotron-3-Nano-30B), while GPU memory swap reduced these to 1.23–1.61 s – a 55–61x improvement. At 2,048-token input, cold-start TTFT of 158.3–180.2 s dropped to three.52–4.02 s with swap, a consistent ~44x reduction.
Key takeaway: GPU memory swap delivers 44-61x faster TTFT than scale-from-zero while using fewer resources when combined with GPU fractions, eliminating the cold-start penalty for infrequently accessed models, whether deployed on dedicated or fractional GPUs.
Start with NVIDIA Run:ai and NVIDIA NIM
Try this guide to start with deploying NVIDIA NIM as a native inference workload on NVIDIA Run:ai. Watch this webinar to see how teams manage growing AI workloads with intelligent scheduling, fine-grained GPU controls, Kubernetes-native traffic balancing, and autoscaling—while latest platform updates improve access control, endpoint management, and visibility.
