Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads

-


In production Kubernetes environments, the difference between model requirements and GPU size creates inefficiencies. Lightweight automatic speech recognition (ASR) or text-to-speech (TTS) models may require only 10 GB of VRAM, yet occupy a whole GPU in standard Kubernetes deployments. Since the scheduler maps a model to at least one or more GPUs and may’t easily share across GPUs across models, expensive compute resources often remain underutilized. 

Solving this isn’t nearly cost reduction—it’s about optimizing cluster density to serve more concurrent users on the identical world-class hardware. This guide details the way to implement and benchmark GPU partitioning strategies, specifically NVIDIA Multi-Instance GPU (MIG) and time-slicing to completely use compute resources.

Using a production-grade voice AI pipeline as our testbed, we show the way to mix models to maximise infrastructure ROI while maintaining >99% reliability and strict latency guarantees. 

Addressing GPU resource fragmentation

By default, the NVIDIA Device Plugin for Kubernetes shows GPUs as integer resources. A pod requests nvidia.com/gpu: 1, and the scheduler binds it to a physical device.

Large language models (LLMs) like NVIDIA Nemotron, Llama 3, or Qwen 7B/8B require dedicated compute to keep up low time to first token (TTFT) and high batch throughput. Nevertheless, support models in a generative AI pipeline—embedding models, ASR, TTS, or guardrails—often use only a fraction of a card. Running these lightweight models on dedicated GPUs ends in:

  • Low utilization: GPU compute utilization often hovers near 0-10%.
  • Cluster bloat: More nodes need provisioning to run the identical variety of pods.
  • Scaling friction: Adding a brand new capability requires a brand new physical GPU.

To resolve this, we must break the 1:1 relationship between pods and GPUs.

Architecture: Partitioning strategies

We evaluated two primary strategies for GPU partitioning supported by the NVIDIA GPU Operator.

Software-based partitioning: Time-slicing and MPS

Time-slicing enables multiple NVIDIA CUDA processes to share a GPU by interleaving execution. It functions similarly to a CPU scheduler: context A runs, pauses, and context B runs.

  • Mechanism: Software-level scheduling through the CUDA driver.
  • Pros: Maximizes utilization. Enables “bursting”—if Pod A is idle, Pod B can use 100% of the GPU’s compute cores.
  • Cons: No hardware isolation. A memory overflow (OOM) in a single pod may impact the shared execution context,  and heavy compute in a single pod can throttle neighbors (the “noisy neighbor” effect).

Along with time-slicing, NVIDIA Multi-Process Service (MPS) offers another software-based approach. MPS enables multiple processes to share GPU resources concurrently by utilizing a server-client architecture. This provides more flexibility than MIG and is more resilient to certain issues like memory leaks compared to plain time-slicing. 

Nevertheless, in production, each methods share a single execution context, limiting isolation. While modern MPS provides isolated virtual address spaces, it lacks hardware-level fault isolation. This implies a fatal execution error or illegal memory access in a single process will propagate across the shared context, potentially resulting in a GPU reset that affects other processes sharing the cardboard.

MIG: The hardware approach to partitioning

MIG physically partitions the GPU into separate instances, each with its own dedicated memory, cache, and streaming multiprocessors (SMs). To the OS and Kubernetes, these seem like separate PCI devices.

  • Mechanism: Hardware-level isolation.
  • Pros: Strict quality of service (QoS). One workload can’t impact the performance or memory stability of one other.
  • Cons: Rigid sizing. If a partition is idle, its compute resources can’t be “borrowed” by a neighbor.

While time-slicing offers flexibility, MIG is preferred for production environments where strict hardware-level fault isolation is required to fulfill enterprise SLAs. Hardware partitioning ensures that a memory error in a single model cannot cause a cascading failure across the shared GPU—a critical requirement for mission-critical Voice AI.

Experimental setup: The voice AI pipeline

To validate these strategies in a production-realistic scenario, we used a multimodal voice-to-voice AI pipeline. This workload is good for benchmarking since it mixes three distinct traffic patterns:

Before optimizing, it’s critical to grasp our latency profile. In our voice-to-voice pipeline, the LLM is the dominant bottleneck. Under heavy loads, the LLM accounts for ~9 seconds of the full processing time. This delay can fluctuate significantly based on context length; as an example, high-input scenarios (like training users) or growing conversation histories increase processing overhead in comparison with short prompts. As history accumulates, the LLM must process more tokens before generating a response, extending the bottleneck that support models should be masked behind.

Consolidating support models like ASR and TTS provides a strategic path to maximise hardware utilization while maintaining end-to-end responsiveness. While consolidation may introduce a slight latency adjustment of 100-200 ms, the gains in infrastructure throughput and ROI are significant.

Our hypothesis

Consolidating ASR and TTS workloads on a single GPU preserves latency and jitter while freeing compute for added LLM instances.

Experiment

We designed three distinct configurations for testing. In each round, we used three voice samples, waiting for the primary response from LLM+TTS to finish. The setup used a Kubernetes cluster, models deployed using NVIDIA NIM, and managed by the NVIDIA NIM Operator. The employee node had access to 3 NVIDIA A100 Tensor Core GPUs.

  • Experiment 1: Baseline with three GPUs
    • Setup: One dedicated GPU for every model (LLM, ASR, TTS).
    • Goal: Establish the “gold standard” for latency and throughput against which to measure optimization.
    • Resource: nvidia.com/gpu: 1 per pod.
  • Experiment 2: Time-slicing with two GPUs
    • Setup: LLM retains a dedicated GPU. ASR and TTS share GPU 0 using software-level time-slicing.
    • Goal: Test if dynamic scheduling can handle the “noisy neighbor” contention between streaming ASR and bursty TTS.
    • Resource: nvidia.com/gpu: 1 (Shared via replicas: 2).
  • Experiment 3: MIG Partitioning with two GPUs
    • Setup: LLM retains a dedicated GPU. GPU 0 is physically partitioned into two isolated instances.
    • Goal: Test if hardware isolation provides higher stability than software scheduling.
    • Resource: nvidia.com/mig-3g.40gb: 1 per pod.

Configuration Note: To realize these topologies, we used specific configurations throughout the NVIDIA GPU Operator.

  • For Experiment 2, we used the timeSlicing configuration to advertise multiple replicas per physical GPU.
  • For Experiment 3, we applied a custom mig-configs ConfigMap to partition the GPU into two 3g.40gb instances.

(For the precise kubectl commands and YAML manifests used to breed this setup, please see the Implementation Appendix at the tip of this post.)

Results

To judge resource fragmentation, we tested the system with two distinct traffic patterns:

  • Light load: 5 concurrent users simulating ~135 seconds of sustained interaction.
  • Heavy load: 50 concurrent users simulating ~375 seconds of sustained interaction. 

Figure 5 compares generative AI inference throughput across traffic patterns. The info shows how partitioning affects process requests as concurrency increases, across baseline (dedicated GPUs), time-slicing (software sharing), and MIG (hardware partitioning) under light and heavy loads. All experiments have a 100% success rate, no failures. The present req/s is the rationale for the LLM bottleneck within the pipeline.

Mean latency metrics

The next evaluation evaluates how different GPU partitioning strategies impact overall system efficiency and responsiveness.

Throughput in comparison with latency

Consolidating ASR and TTS workloads onto a single GPU ends in an optimized pipeline, enabling the cluster to support more simultaneous AI streams. Nevertheless, our benchmarks reveal a critical performance divergence between the 2 partitioning strategies:

  1. MIG (hardware): Highest efficiency 

Experiment 3 achieved the best per-unit productivity, reaching ~1.00 req/s per GPU. By providing dedicated hardware paths for every instance, MIG eliminates resource contention. Organizations can achieve near-full system capability while effectively freeing up a whole GPU for other heavy LLM workloads.

  1. Time-slicing (software): Higher density with overhead 

Experiment 2 showed that software-level sharing can even improve per-GPU density in comparison with the baseline, achieving ~0.76 req/s per GPU. Nevertheless, the CUDA driver’s management of rapid context switches between streaming and bursty models introduces scheduling overhead. While functional, this software-based approach doesn’t reach the combination throughput efficiency provided by hardware partitioning.

Latency and the bursty factor 

Time-slicing handles individual bursty tasks barely faster, with a mean TTS latency of 144.7 ms in comparison with MIG’s 168.2ms. Nevertheless, this 23.5 ms difference represents a small fraction of the full end-to-end pipeline response time at present scale. Under heavy load, the LLM accounts for the overwhelming majority of the full interaction time. Since the end-user cannot perceive a 20ms delta inside a multi-second response, the throughput stability of MIG is a more precious production metric.

Recommendations for partitioning

Based on the benchmark data, we recommend the next decision matrix:

  1. Default to MIG for production scale and stability
    • Experiment 3 showed that MIG handles higher request volumes (2 req/s) with only a minor latency trade-off.
    • Strict hardware-level fault isolation prevents a memory overflow in a single process from crashing the opposite.
    • Best for production environments where throughput and 100% reliability are the priorities.
  2. Use time-slicing for development or low-concurrency apps
    • This involves a 32% reduction in total throughput and shared-resource dependencies.
    • Best for development, CI/CD, and PoCs to run a full pipeline on a minimal hardware footprint.

Start

  1. Experiment further: Try the repository.
  2. Implement partitioning: Follow our Implementation Guide to configure MIG profiles and use the provided YAML manifests to eliminate resource fragmentation in your cluster.
  3. Scale with NIM: Deploy NVIDIA NIM pipelines to completely utilize your ASR, TTS, and LLM workloads for max ROI.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x