Automate Kubernetes AI Cluster Health with NVSentinel

-


Kubernetes underpins a big portion of all AI workloads in production. Yet, maintaining GPU nodes and ensuring that applications are running, training jobs are progressing, and traffic is served across Kubernetes clusters is simpler said than done. 

NVSentinel is designed to assist with these challenges. An open source system for Kubernetes AI clusters, NVSentinel repeatedly monitors GPU health and mechanically remediates issues before they disrupt workloads.

A health system for Kubernetes GPU clusters

NVSentinel is an intelligent monitoring and self-healing system for Kubernetes clusters that run GPU workloads. Just as a constructing’s fire alarm repeatedly monitors for smoke and mechanically responds to threats, NVSentinel repeatedly monitors your GPU hardware and mechanically responds to failures. It’s a part of a broader category of health automation open source tools designed to enhance GPU uptime, utilization, and reliability.

GPU clusters are expensive and failures are costly. In modern AI and high-performance computing, organizations operate large clusters of servers with NVIDIA GPUs that may cost tens of hundreds of dollars each. If those GPUs fail, the outcomes could include:

  • Silent corruption: Faulty GPUs producing incorrect results that go undetected
  • Cascading failures: One bad GPU crashing a whole multiday training job
  • Wasted resources: Healthy GPUs sitting idle while waiting for a failed node to get well
  • Manual intervention: On-call engineers getting paged in any respect hours to diagnose issues
  • Lost productivity: Data scientists spending hours re-running failed experiments

Traditional monitoring systems detect problems but rarely fix them. Accurately diagnosing and remediating GPU issues still requires deep expertise, and remediation can take hours and even days.

NVIDIA runs among the world’s largest GPU clusters to support products and research efforts resembling NVIDIA Omniverse, NVIDIA Cosmos, and NVIDIA Isaac GR00T. Maintaining the health of those clusters at scale requires automation. 

Over the past 12 months, NVIDIA teams have been developing and testing NVSentinel internally across NVIDIA DGX Cloud clusters. It has already helped reduce downtime and improve utilization by detecting and isolating GPU failures in minutes as a substitute of hours.

How does NVSentinel work?

NVSentinel is installed in each Kubernetes cluster run. Once deployed, NVSentinel repeatedly watches nodes for errors, analyzes events, and takes automated actions resembling quarantining, draining, labeling, or triggering external remediation workflows. Specific NVSentinel features include continuous monitoring, data aggregation and evaluation, and more, as detailed below.

Diagram of the NVSentinel architecture showing three stages: health and event monitors feeding into a processing layer with a database and analyzer, which then drives remediation actions that interact with the Kubernetes API.
Diagram of the NVSentinel architecture showing three stages: health and event monitors feeding into a processing layer with a database and analyzer, which then drives remediation actions that interact with the Kubernetes API.
Figure 1. NVSentinel logical architecture

Continuous monitoring

NVSentinel deploys modular GPU and system monitors to trace thermal issues, memory errors, and hardware faults using NVIDIA DCGM diagnostics. It also inspects kernel logs for driver crashes and hardware errors, and may integrate with cloud provider APIs (AWS, GCP, OCI) to detect maintenance events or hardware faults. The modular design makes it easy to increase with custom monitors and data sources.

Data aggregation and evaluation 

Collected signals flow into the NVSentinel evaluation engine, which classifies events by severity and sort. Using rule-based patterns just like operational runbooks, it distinguishes between transient issues, hardware faults, and systemic cluster problems. For instance:

  • A single correctable ECC error is likely to be logged and monitored
  • Repeated uncorrectable ECC errors trigger node quarantine
  • Driver crashes result in node drain and cordon actions

This approach shifts health management within the cluster from “detect and alert” to “detect, diagnose, and act,” with policy-driven responses that you may declaratively configure.

Automated remediation 

When a node is identified as unhealthy, NVSentinel coordinates the Kubernetes-level response:

  • Cordon and drain to forestall workload disruption
  • Set NodeConditions that expose GPU or system health context to the scheduler and operators
  • Trigger external remediation hooks to reset or reprovision hardware 

The NVSentinel remediation workflow is pluggable by design. For those who have already got an existing repair or reprovisioning workflow, it could be seamlessly integrated with NVSentinel. This makes it easy to attach with custom systems resembling service management platforms, node imaging pipelines, or cloud automation tools. 

Diagram showing the NVSentinel remediation workflow: an unhealthy event flows into issue identification, then to cordon and drain of the affected node, followed by a remediation action executed through Kubernetes or another system API, resulting in a healthy event and uncordoning the node to return it to service.
Diagram showing the NVSentinel remediation workflow: an unhealthy event flows into issue identification, then to cordon and drain of the affected node, followed by a remediation action executed through Kubernetes or another system API, resulting in a healthy event and uncordoning the node to return it to service.
Figure 2. NVSentinel remediation workflow

The system is disaggregated, enabling operators to make use of only the components they need. It’s designed to suit into diverse operational models somewhat than replace them. It’s possible you’ll decide to: 

  • Deploy only monitoring and detection
  • Enable automated cordon and drain actions
  • Enable full closed-loop remediation (for more advanced environments). 

Example: Detecting and recovering from GPU errors

To offer an example, consider a 64-GPU training job. One node starts reporting repeated double-bit ECC errors. Traditionally, this might go unnoticed until the job fails hours later. With NVSentinel, the GPU Health Monitor detects the pattern, the Analyzer classifies the node as degraded, the node is cordoned and drained, and a remediation workflow reprovisions the node. The job continues with minimal disruption, saving hours of GPU time and stopping wasted compute.

Tips on how to start with NVSentinel

NVSentinel uses the NVIDIA Data Center GPU Manager (DCGM), deployed through the NVIDIA GPU Operator, to gather GPU and NVIDIA NVSwitch health signals. In case your environment supports the GPU Operator and DCGM, NVSentinel can monitor and act on GPU-level faults. 

Supported NVIDIA hardware includes all data center GPUs supported by DCGM, resembling:

  • NVIDIA H100 (80 GB, 144 GB, NVL)
  • NVIDIA B200 series
  • NVIDIA A100 (PCIe and SXM4)
  • NVIDIA V100
  • NVIDIA A30 / A40
  • NVIDIA P100, P40, P4
  • NVIDIA K80 and newer Tesla-class data center GPUs

DCGM also exposes telemetry for NVSwitch-based systems, enabling NVSentinel to observe NVIDIA DGX and HGX platforms, including DGX A100, DGX H100, HGX A100, HGX H100, and HGX B200. For an authoritative list, see the DCGM Supported Products documentation.

Note that NVSentinel is currently in an experimental phase. We don’t recommend using NVSentinel in production systems at this point.

Installation

You may deploy NVSentinel into your Kubernetes clusters using a single command:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel --version 
v0.3.0 #Replace with any published chart version

The NVSentinel documentation explains integrate with DCGM, customize monitors and actions, and deploy on-premises or within the cloud. Example manifests are included for each environments.

More NVIDIA initiatives for advancing GPU health

NVSentinel can also be a part of a broader set of NVIDIA initiatives focused on advancing GPU health, transparency, and operational resilience for purchasers. These initiatives include the NVIDIA GPU Health service, which provides fleet-level telemetry and integrity insights that complement NVSentinel Kubernetes-native monitoring and automation. Together, these efforts reflect NVIDIA’s ongoing commitment to helping operators run healthier and more reliable GPU infrastructure at every scale.

Get entangled with NVSentinel

NVSentinel is currently in an experimental phase. We encourage you to try it and leave feedback through NVIDIA/NVSentinel using GitHub issues. We don’t recommend using NVSentinel in production systems just yet. Upcoming releases will expand GPU telemetry coverage and logging systems resembling NVIDIA DCGM, add more remediation workflows and policy engines. More stability checks and documentation will even be added because the project matures to a stable release. 

NVSentinel is open source and we welcome contributions. To get entangled, you’ll be able to:

  • Test NVSentinel on your individual GPU clusters
  • Share feedback and have requests on GitHub
  • Contribute latest monitors, evaluation rules, or remediation workflows

To start, visit the NVIDIA/NVSentinel GitHub repo and follow the NVSentinel project road map for normal updates.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x