High-performance computing (HPC) customers proceed to scale rapidly, with generative AI, large language models (LLMs), computer vision, and other uses resulting in tremendous growth in GPU resource needs. Consequently, GPU efficiency is an ever-growing focus of infrastructure optimization. With enormous GPU fleet sizes, even small inefficiencies translate into significant cluster bottlenecksÂ
Optimizing GPU usage helps:
- Generate significant savings in operational costs.
- Enable more workloads to access GPU resources.
- Improve developer experience and throughput.
On this blog, we present our process for reducing idle GPU waste across large-scale clusters—an effort that has the potential of saving tens of millions in infrastructure costs and in addition improves overall developer productivity and resource utilization. In industry terms, waste means GPUS will not be getting used to their full potential, specifically as a result of lack of effective management of the cluster, or misses in optimization or error resolution.Â
Understanding GPU waste
GPU waste could be classified into multiple categories, and every requires its own tailored solution. One of the vital frequent issues is related to jobs occupying GPU resources but not actually doing meaningful work and sitting completely idle. The next table provides a summary of waste issues.
| GPU waste issue | Solutions | Observed frequency |
| Hardware unavailability brought on by failures | Fleet health efficiency program for monitoring, tracking, and rolling out fixes to hardware | Low |
| GPUs are healthy but not occupied | Occupancy efficiency programs which primarily involve scheduler efficiency | Low |
| Jobs occupy GPUs but don’t use the compute efficiently | Application optimizatio efforts | High |
| Jobs occupy GPUs but don’t use them | Idle waste reduction program | Moderate |
Through the operation of research clusters supporting highly diverse workloads, now we have encountered expected and unexpected causes of GPU idleness. Distinguishing between these aspects is difficult but essential to make sure that researcher productivity stays unaffected. We’ve identified several recurring patterns that result in idle GPUs. A few of these include:Â
- CPU-only data processing jobs: Running on GPU nodes without using the GPUs.
- Misconfigured jobs: Over-provisioning GPUs as a result of exclusive node settings.
- Stuck jobs: Jobs that appear energetic but are stalled.
- Infrastructure overhead: Delays from container downloads or data fetching.
- Unattended interactive sessions: Leftover jobs consuming resources.
Ways to cut back GPU resource waste
To scale back idle GPU waste at scale, emphasis was placed on observing actual cluster behavior fairly than counting on theoretical utilization targets. Once the underlying patterns surfaced, it became clear that efficiency could possibly be meaningfully improved through a focused set of operational techniques fairly than sweeping architectural changes.Â
From that evaluation, we prioritized 4 techniques:
- Data collection and evaluation: Gathered utilization and job traces to discover the highest contributors to GPU waste.
- Metric development: Created a dedicated GPU idle waste metric to trace baselines and measure improvements over time.
- Customer collaboration: Resolved inefficiencies by working directly with users and teams whose workflows drove the very best idle impact.
- Scaling solutions: Built self-serve tools and automation pipelines so improvements could scale across the complete fleet.
Constructing the GPU utilization metrics pipeline
To construct the GPU utilization metrics pipeline, we aligned real-time telemetry from the NVIDIA Data Center GPU Manager (DCGM) with Slurm job metadata to create a unified view of how workloads actually consumed GPU resources. Although Slurm provided data at a five-minute granularity, it was sufficient for joining with the higher-resolution DCGM fields.Â
A key enabler on this process was the NVIDIA DCGM Exporter’s HPC job-mapping capability, through which GPU activity could possibly be tagged with precise job context. That established the muse needed to measure idle periods, discover waste contributors, and attribute inefficiencies to specific workflows.


With the pipeline established, the following step was to look at the DCGM signals that drove the evaluation and define how idle GPU behavior could be identified. The next sections outline the metrics used and the factors applied to find out when a job was considered to be causing idle GPU time.
Tapping into DCGM
DCGM is NVIDIA’s management and monitoring framework for data-center GPUs. It provides a strong set of tools and APIs that allow you observe, control, and optimize GPU resources at scale.
At its core, DCGM provides a wide range of metrics and telemetry data, organized into structures called fields. Each field has a novel identifier and field number—together, they represent every part from GPU temperature and clock speeds to utilization and power draw. You possibly can explore the whole list of accessible fields within the official documentation.
Here’s what these metrics typically cover:
- GPU utilization metrics: Measure how actively a GPU is getting used. These include indicators for core compute load, memory usage, I/O throughput, and power consumption, helping you see if a GPU is doing productive work or sitting idle.
- GPU performance metrics: Reflect how efficiently a GPU is working. Metrics reminiscent of clock speed, thermal status, and throttling events help assess performance and detect bottlenecks.
For the GPU waste metric, the DCGM_FI_DEV_GPU_UTIL field was used as the first indicator of high-level GPU activity. Future iterations of the evaluation are planned to transition to DCGM_FI_PROF_GR_ENGINE_ACTIVE to capture a more precise view of GPU engine utilization.
What classifies a job as idle?
AI and machine-learning (ML) workloads often include periods where the GPU just isn’t actively used, either as a result of infrastructure inefficiencies or the natural behavior of the workload. Several common scenarios were observed:
- Container downloads: Job startup can stall while containers are pulled across multiple hosts, especially under heavy load or slow registry performance.
- Data loading and initialization: Training workflows may wait on data retrieval from storage before GPU compute begins.
- Checkpoint reads and writes: Utilization can drop during checkpointing operations.
- Model-specific behavior: Some model types simply don’t fully utilize the GPU by design.
To account for these cases, a threshold for prolonged inactivity was established. A conservative definition was used: A workload was considered idle when a full hour of continuous GPU inactivity was detected.
Once the GPU waste metric was established, the main target shifts toward making the info usable. The goal was not only to surface idle behavior but to reveal it in a way that allowed researchers and platform teams to quickly understand the source of inefficiencies. To support this, several visualization layers and operational tools were built to show the underlying telemetry into clear signals and automatic interventions.
GPU waste metrics were surfaced through two primary interfaces:
- User portal: An internal NVIDIA portal where ML researchers could view cluster-, user-, and job-level GPU usage, making idle patterns far easier to acknowledge.
- OneLogger: A unified monitoring layer that correlated job phases with GPU telemetry, giving users clearer visibility into where inefficiencies emerged.
Together, these tools made GPU waste more transparent and actionable.
Tooling: Idle GPU job reaper
We developed a service to discover and clean up jobs that were now not using their GPUs-essentially providing self-cleaning behavior for the fleet. Since the cluster runs highly diverse workloads with no shared abstraction layer, users got the flexibility to tune the reaper’s thresholds to match the expected idle characteristics of their jobs. This allowed the system to tell apart between predictable idle phases and real waste.Â
At a high level, the service:
- Monitored GPU utilization through DCGM metrics.
- Flagged jobs with prolonged periods of inactivity.
- Terminated those jobs and reclaimed the idle GPUs.
- Logged and reported the recovered capability and user configurations to drive further improvements.
This approach ensured that each expected and unexpected idle patterns could possibly be handled consistently across the fleet.
Tooling: Job linter
We created a job-linting tool to detect misconfigured workloads—for instance, jobs requesting exclusive access to all GPUs on a node but only using a subset, leaving the remaining devices idle. Future versions of the linter are planned to expand coverage to a broader set of misconfiguration patterns.
Tooling: Defunct jobs
Time-limited jobs within the cluster often led users to submit long chains of follow-on jobs that waited within the queue with reserved resources, even once they were now not needed. The opposite issue here is that any regression in users’ jobs could be compounded by repeated large numbers of re-runs. These defunct submissions consumed scheduling cycles and introduced unnecessary overhead. Tooling was built to routinely detect and cancel such redundant jobs, reducing waste and improving overall scheduling efficiency.
Lessons learned and next steps
Small inefficiencies compound quickly at scale. Once the proper metrics were exposed, visibility alone drove a natural shift toward accountability and higher behavior across teams. Beyond metrics, researchers also require actionable guidance on improve the efficiency of their workloads. Broad adoption of those practices was needed to realize fleet-level impact. Monitoring tools needed to integrate directly into the day by day workflow to be effective; making utilization insights available at job submission time and inside experiment-tracking interfaces proved essential.
Through these efforts, GPU waste was reduced from roughly 5.5 % to about 1%, a considerable improvement that translated into meaningful cost savings and increased availability of GPUs for high-priority workloads. These gains demonstrated how operational inefficiencies, once surfaced and addressed, can return significant capability back to the fleet.
The measurement process also surfaced a variety of infrastructure gaps that contribute to idle behavior. Several improvements are planned to further reduce waste reminiscent of faster container loading, data caching, support for long-running jobs, and enhanced debugging tools.
Start instrumenting and monitoring DCGM metrics today. These signals reveal where GPU cycles are being wasted and supply the muse for constructing easy, actionable tooling that helps researchers optimize their jobs and keep GPUs consistently utilized.
Mohamed Fawzy, Mohammed Elshall, Bugra Gedik, Michael Hale, Kaiwen Shi, Vishal Patil, and Ashita Kulkarni contributed to the research described on this blog.
