Enable Gang Scheduling and Workload Prioritization in Ray with NVIDIA KAI Scheduler

NVIDIA KAI Scheduler is now natively integrated with KubeRay, bringing the identical scheduling engine that powers high‑demand and high-scale environments in NVIDIA Run:ai directly into your Ray clusters. This implies you’ll be able to now tap into gang scheduling, workload autoscaling, workload prioritization, hierarchical queues, and plenty of more features in your Ray environment. Together, these capabilities make your infrastructure smarter by coordinating job starts, sharing GPUs efficiently, and prioritizing workloads. And all you might have to do is configure it.

What this implies for Ray users:

Gang scheduling: no partial startups 
Distributed Ray workloads need all their employees and actors to start out together—or under no circumstances. KAI ensures they launch as a coordinated gang, stopping wasteful partial allocations that stall training or inference pipelines.
Workload and cluster autoscaling 
For workloads similar to offline batch inference, Ray clusters can scale up as cluster resources turn out to be available or when queues permit over-quota usage. They may also scale down as demand decreases, providing elastic compute aligned with resource availability and workload needs without manual intervention.
Workload priorities: smooth coexistence of several types of workloads 
High‑priority inference jobs can routinely preempt lower‑priority batch training if resources are limited, keeping your applications responsive without manual intervention.
Hierarchical queuing with priorities: dynamic resource sharing 
Create queues for various project teams with clear priorities in order that when capability is on the market, the upper‑priority queue can borrow idle resources from other teams.

On this post, we’ll walk through a hands-on example of how KAI enables smarter resource allocation and responsiveness for Ray—particularly in clusters where training and online inference must coexist. You’ll see how you can:

Schedule distributed Ray employees as a gang, ensuring coordinated startup.
Leverage priority-based scheduling, where inference jobs preempt lower-priority training jobs.

The result’s a tightly integrated execution stack, built from tools designed to work together, from scheduling policies to model serving.

Enable Gang Scheduling and Workload Prioritization in Ray with NVIDIA KAI Scheduler

Technical setup

Step 1: Arrange KAI Scheduler queues

Step 2: Submit a training job with gang scheduling and workload prioritization

Step 3: Deploy an inference service with higher priority using vLLM

Observe preemption in motion

Interact with the deployed model

Wrapping up

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Claude Skills and Subagents: Escaping the Prompt Engineering Hamster Wheel

Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not?

Statement from Dario Amodei on our discussions with the Department of War Anthropic

Google quantum-proofs HTTPS by squeezing 2.5kB of information into 64-byte space – Ars Technica

Generative AI, Discriminative Human

Enable Gang Scheduling and Workload Prioritization in Ray with NVIDIA KAI Scheduler

Technical setup

Step 1: Arrange KAI Scheduler queues

Step 2: Submit a training job with gang scheduling and workload prioritization

Step 3: Deploy an inference service with higher priority using vLLM

Observe preemption in motion

Interact with the deployed model

Wrapping up

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.