Kubernetes underpins a big portion of all AI workloads in production. Yet, maintaining GPU nodes and ensuring that applications are running, training jobs are progressing, and traffic is served across Kubernetes clusters is simpler said than done.
NVSentinel is designed to assist with these challenges. An open source system for Kubernetes AI clusters, NVSentinel repeatedly monitors GPU health and mechanically remediates issues before they disrupt workloads.
A health system for Kubernetes GPU clusters
NVSentinel is an intelligent monitoring and self-healing system for Kubernetes clusters that run GPU workloads. Just as a constructing’s fire alarm repeatedly monitors for smoke and mechanically responds to threats, NVSentinel repeatedly monitors your GPU hardware and mechanically responds to failures. It’s a part of a broader category of health automation open source tools designed to enhance GPU uptime, utilization, and reliability.
GPU clusters are expensive and failures are costly. In modern AI and high-performance computing, organizations operate large clusters of servers with NVIDIA GPUs that may cost tens of hundreds of dollars each. If those GPUs fail, the outcomes could include:
- Silent corruption: Faulty GPUs producing incorrect results that go undetected
- Cascading failures: One bad GPU crashing a whole multiday training job
- Wasted resources: Healthy GPUs sitting idle while waiting for a failed node to get well
- Manual intervention: On-call engineers getting paged in any respect hours to diagnose issues
- Lost productivity: Data scientists spending hours re-running failed experiments
Traditional monitoring systems detect problems but rarely fix them. Accurately diagnosing and remediating GPU issues still requires deep expertise, and remediation can take hours and even days.
NVIDIA runs among the world’s largest GPU clusters to support products and research efforts resembling NVIDIA Omniverse, NVIDIA Cosmos, and NVIDIA Isaac GR00T. Maintaining the health of those clusters at scale requires automation.
Over the past 12 months, NVIDIA teams have been developing and testing NVSentinel internally across NVIDIA DGX Cloud clusters. It has already helped reduce downtime and improve utilization by detecting and isolating GPU failures in minutes as a substitute of hours.
How does NVSentinel work?
NVSentinel is installed in each Kubernetes cluster run. Once deployed, NVSentinel repeatedly watches nodes for errors, analyzes events, and takes automated actions resembling quarantining, draining, labeling, or triggering external remediation workflows. Specific NVSentinel features include continuous monitoring, data aggregation and evaluation, and more, as detailed below.


Continuous monitoring
NVSentinel deploys modular GPU and system monitors to trace thermal issues, memory errors, and hardware faults using NVIDIA DCGM diagnostics. It also inspects kernel logs for driver crashes and hardware errors, and may integrate with cloud provider APIs (AWS, GCP, OCI) to detect maintenance events or hardware faults. The modular design makes it easy to increase with custom monitors and data sources.
Data aggregation and evaluation
Collected signals flow into the NVSentinel evaluation engine, which classifies events by severity and sort. Using rule-based patterns just like operational runbooks, it distinguishes between transient issues, hardware faults, and systemic cluster problems. For instance:
- A single correctable ECC error is likely to be logged and monitored
- Repeated uncorrectable ECC errors trigger node quarantine
- Driver crashes result in node drain and cordon actions
This approach shifts health management within the cluster from “detect and alert” to “detect, diagnose, and act,” with policy-driven responses that you may declaratively configure.
Automated remediation
When a node is identified as unhealthy, NVSentinel coordinates the Kubernetes-level response:
- Cordon and drain to forestall workload disruption
- Set NodeConditions that expose GPU or system health context to the scheduler and operators
- Trigger external remediation hooks to reset or reprovision hardware
The NVSentinel remediation workflow is pluggable by design. For those who have already got an existing repair or reprovisioning workflow, it could be seamlessly integrated with NVSentinel. This makes it easy to attach with custom systems resembling service management platforms, node imaging pipelines, or cloud automation tools.


The system is disaggregated, enabling operators to make use of only the components they need. It’s designed to suit into diverse operational models somewhat than replace them. It’s possible you’ll decide to:
- Deploy only monitoring and detection
- Enable automated cordon and drain actions
- Enable full closed-loop remediation (for more advanced environments).
Example: Detecting and recovering from GPU errors
To offer an example, consider a 64-GPU training job. One node starts reporting repeated double-bit ECC errors. Traditionally, this might go unnoticed until the job fails hours later. With NVSentinel, the GPU Health Monitor detects the pattern, the Analyzer classifies the node as degraded, the node is cordoned and drained, and a remediation workflow reprovisions the node. The job continues with minimal disruption, saving hours of GPU time and stopping wasted compute.
Tips on how to start with NVSentinel
NVSentinel uses the NVIDIA Data Center GPU Manager (DCGM), deployed through the NVIDIA GPU Operator, to gather GPU and NVIDIA NVSwitch health signals. In case your environment supports the GPU Operator and DCGM, NVSentinel can monitor and act on GPU-level faults.
Supported NVIDIA hardware includes all data center GPUs supported by DCGM, resembling:
- NVIDIA H100 (80 GB, 144 GB, NVL)
- NVIDIA B200 series
- NVIDIA A100 (PCIe and SXM4)
- NVIDIA V100
- NVIDIA A30 / A40
- NVIDIA P100, P40, P4
- NVIDIA K80 and newer Tesla-class data center GPUs
DCGM also exposes telemetry for NVSwitch-based systems, enabling NVSentinel to observe NVIDIA DGX and HGX platforms, including DGX A100, DGX H100, HGX A100, HGX H100, and HGX B200. For an authoritative list, see the DCGM Supported Products documentation.
Note that NVSentinel is currently in an experimental phase. We don’t recommend using NVSentinel in production systems at this point.
Installation
You may deploy NVSentinel into your Kubernetes clusters using a single command:
helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel --version
v0.3.0 #Replace with any published chart version
The NVSentinel documentation explains integrate with DCGM, customize monitors and actions, and deploy on-premises or within the cloud. Example manifests are included for each environments.
More NVIDIA initiatives for advancing GPU health
NVSentinel can also be a part of a broader set of NVIDIA initiatives focused on advancing GPU health, transparency, and operational resilience for purchasers. These initiatives include the NVIDIA GPU Health service, which provides fleet-level telemetry and integrity insights that complement NVSentinel Kubernetes-native monitoring and automation. Together, these efforts reflect NVIDIA’s ongoing commitment to helping operators run healthier and more reliable GPU infrastructure at every scale.
Get entangled with NVSentinel
NVSentinel is currently in an experimental phase. We encourage you to try it and leave feedback through NVIDIA/NVSentinel using GitHub issues. We don’t recommend using NVSentinel in production systems just yet. Upcoming releases will expand GPU telemetry coverage and logging systems resembling NVIDIA DCGM, add more remediation workflows and policy engines. More stability checks and documentation will even be added because the project matures to a stable release.
NVSentinel is open source and we welcome contributions. To get entangled, you’ll be able to:
- Test NVSentinel on your individual GPU clusters
- Share feedback and have requests on GitHub
- Contribute latest monitors, evaluation rules, or remediation workflows
To start, visit the NVIDIA/NVSentinel GitHub repo and follow the NVSentinel project road map for normal updates.
