Enable Gang Scheduling and Workload Prioritization in Ray with NVIDIA KAI Scheduler

-


NVIDIA KAI Scheduler is now natively integrated with KubeRay, bringing the identical scheduling engine that powers high‑demand and high-scale environments in NVIDIA Run:ai directly into your Ray clusters.
This implies you’ll be able to now tap into gang scheduling, workload autoscaling, workload prioritization, hierarchical queues, and plenty of more features in your Ray environment. Together, these capabilities make your infrastructure smarter by coordinating job starts, sharing GPUs efficiently, and prioritizing workloads. And all you might have to do is configure it.

What this implies for Ray users:

  • Gang scheduling: no partial startups

    Distributed Ray workloads need all their employees and actors to start out together—or under no circumstances. KAI ensures they launch as a coordinated gang, stopping wasteful partial allocations that stall training or inference pipelines.
  • Workload and cluster autoscaling

    For workloads similar to offline batch inference, Ray clusters can scale up as cluster resources turn out to be available or when queues permit over-quota usage. They may also scale down as demand decreases, providing elastic compute aligned with resource availability and workload needs without manual intervention.
  • Workload priorities: smooth coexistence of several types of workloads

    High‑priority inference jobs can routinely preempt lower‑priority batch training if resources are limited, keeping your applications responsive without manual intervention.
  • Hierarchical queuing with priorities: dynamic resource sharing

    Create queues for various project teams with clear priorities in order that when capability is on the market, the upper‑priority queue can borrow idle resources from other teams. 

On this post, we’ll walk through a hands-on example of how KAI enables smarter resource allocation and responsiveness for Ray—particularly in clusters where training and online inference must coexist. You’ll see how you can:

  • Schedule distributed Ray employees as a gang, ensuring coordinated startup.
  • Leverage priority-based scheduling, where inference jobs preempt lower-priority training jobs.

The result’s a tightly integrated execution stack, built from tools designed to work together, from scheduling policies to model serving.

Technical setup

This instance assumes the next environment:

--set batchScheduler.name=kai-scheduler

Step 1: Arrange KAI Scheduler queues

Before submitting Ray workloads, queues should be defined for the KAI Scheduler. KAI Scheduler supports hierarchical queuing, which enables teams and departments to be organized into multi-level structures with fine-grained control over resource distribution.

In this instance, an easy two-level hierarchy shall be created with a top-level parent queue called department-1 and a baby queue called team-a. All workloads on this demo shall be submitted through team-a, but in an actual deployment, multiple departments and teams could be configured to reflect organizational boundaries.

apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
  name: department-1
spec:
  resources:
    cpu:
      quota: -1
      limit: -1
      overQuotaWeight: 1
    gpu:
      quota: -1
      limit: -1
      overQuotaWeight: 1
    memory:
      quota: -1
      limit: -1
      overQuotaWeight: 1
---
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
  name: team-a
spec:
  parentQueue: department-1
  resources:
    cpu:
      quota: -1
      limit: -1
      overQuotaWeight: 1
    gpu:
      quota: -1
      limit: -1
      overQuotaWeight: 1
    memory:
      quota: -1
      limit: -1
      overQuotaWeight: 1

A fast breakdown of the important thing parameters:

  • Quota: The deserved share of resources to which a queue is entitled.
  • Limit: The upper certain on what number of resources a queue can eat.
  • Over Quota Weight: Determines how surplus resources are distributed amongst queues which have the identical priority. Queues with higher weights receive a bigger portion of the additional capability.

On this demo, no specific quotas, limits, or priorities are enforced. We’re keeping it easy to give attention to the mechanics of integration. Nevertheless, these fields provide powerful tools for enforcing fairness and managing contention across organizational boundaries.

To create the queues:

kubectl apply -f kai-scheduler-queue.yaml

With the queue hierarchy now in place, workloads could be submitted and scheduled under team-a.

Step 2: Submit a training job with gang scheduling and workload prioritization

With the queues in place, it’s time to run a training workload using KAI’s gang scheduling.

In this instance, we define an easy Ray cluster with one head node and two employee replicas.

KAI schedules all Kubernetes Pods (the 2 employees and the pinnacle) as a gang, meaning they launch together or under no circumstances, and if preemption occurs, they’re stopped together too.

The one required configuration to enable KAI scheduling is the kai.scheduler/queue label, which assigns the job to a KAI queue—on this case, team-a.

An optional setting, priorityClassName: train, marks the job as a preemptible training workload. Here it’s included for example how KAI applies workload prioritization. For more information on workload priority, please seek advice from the official documentation.

Here’s the manifest utilized in this demo:

​​apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-sample
  labels:
    kai.scheduler/queue: team-a
    priorityClassName: train
spec:
  headGroupSpec:
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.46.0
          resources:
            limits:
              cpu: 4
              memory: 15Gi
            requests:
              cpu: "1"
              memory: "2Gi"
 # ---- One Employee with a GPU ----
  workerGroupSpecs:
  - groupName: employee
    replicas: 1
    minReplicas: 1
    maxReplicas: 2
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray:2.46.0
          resources:
            limits:
              cpu: 4
              memory: 15Gi
            requests:
              cpu: "1"
              memory: "1Gi"
              nvidia.com/gpu: "1"

To use the workload:

kubectl apply -f kai-scheduler-example.yaml

KAI  then gang-schedules the Ray head and two employees. 

Step 3: Deploy an inference service with higher priority using vLLM

Now that we’ve submitted a training workload, let’s walk through how KAI Scheduler handles inference workloads, that are non-preemptible and better priority by default. This distinction enables inference workloads to preempt lower-priority training jobs when GPU resources are limited, ensuring fast model responses for user-facing services.

In this instance, we’ll:

  • Deploy Qwen2.5-7B-Instruct using vLLM with Ray Serve and RayService.
  • Submit the job to the identical queue (team-a) because the training job.
  • Use the label kai.scheduler/queue to enable KAI scheduling.
  • Set the priorityClassName to inference to mark this as a high-priority workload.

Note: The one required label for scheduling with KAI is kai.scheduler/queue.The priorityClassName: inference used here is optional and specific to this demo to exhibit workload preemption. Also, remember to create a Kubernetes secret named ‘hf-token’ containing your Hugging Face token before applying the YAML.

Here’s the manifest (be certain that so as to add your individual HF token on the Secret):

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: ray-kai-scheduler-serve-llm
  labels:
    kai.scheduler/queue: team-a      
    priorityClassName: inference     

spec:
  serveConfigV2: |
    applications:
    - name: llms
      import_path: ray.serve.llm:build_openai_app
      route_prefix: "https://developer.nvidia.com/"
      args:
        llm_configs:
        - model_loading_config:
            model_id: qwen2.5-7b-instruct
            model_source: Qwen/Qwen2.5-7B-Instruct
          engine_kwargs:
            dtype: bfloat16
            max_model_len: 1024
            device: auto
            gpu_memory_utilization: 0.75
          deployment_config:
            autoscaling_config:
              min_replicas: 1
              max_replicas: 1
              target_ongoing_requests: 64
            max_ongoing_requests: 128
  rayClusterConfig:
    rayVersion: "2.46.0"
    headGroupSpec:
      rayStartParams:
        num-cpus: "0"
        num-gpus: "0"
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray-llm:2.46.0-py311-cu124
            ports:
            - containerPort: 8000
              name: serve
              protocol: TCP
            - containerPort: 8080
              name: metrics
              protocol: TCP
            - containerPort: 6379
              name: gcs
              protocol: TCP
            - containerPort: 8265
              name: dashboard
              protocol: TCP
            - containerPort: 10001
              name: client
              protocol: TCP
            resources:
              limits:
                cpu: 4
                memory: 16Gi
              requests:
                cpu: 1
                memory: 4Gi
    workerGroupSpecs:
    - replicas: 1
      minReplicas: 1
      maxReplicas: 1
      numOfHosts: 1
      groupName: gpu-group
      rayStartParams:
        num-gpus: "1"
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray-llm:2.46.0-py311-cu124
            env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: hf_token
            resources:
              limits:
                cpu: 4
                memory: 15Gi
                nvidia.com/gpu: "1"
              requests:
                cpu: 1
                memory: 15Gi
                nvidia.com/gpu: "1"

---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token
type: Opaque
stringData:
  hf_token: $HF_TOKEN

Apply the workload:

kubectl apply -f ray-service.kai-scheduler.llm-serve.yaml

Loading the model and starting the vLLM engine will take a while here.

Observe preemption in motion

Once applied, you’ll notice that KAI Scheduler preempts the training job to make room for the inference workload, since each compete for a similar GPU, however the inference workload has higher priority.

Example output from kubectl get pods:

$ kubectl get pods
NAME                                                   READY   STATUS    RESTARTS   AGE
ray-kai-scheduler-serve-llm-xxxx-gpu-group-worker-xxxx 1/1     Running   0          21m
ray-kai-scheduler-serve-llm-xxxx-head-xxxx             1/1     Running   0          21m
raycluster-sample-head-xxxx                            0/1     Running   0          21m
raycluster-sample-worker-worker-xxxx                   0/1     Running   0          21m

 For the sake of simplicity on this demo, we used the Hugging Face model loading directly contained in the container. This works for showcasing KAI Scheduler logic and preemption behavior. Nevertheless, in real production environments, model loading time becomes a critical factor, especially when autoscaling inference replicas or recovering from eviction.

For that, we recommend using NVIDIA Run:ai Model Streamer, which is natively supported in vLLM and could be used out-of-the-box with Ray. For reference, please seek advice from the Ray documentation that features an example showing how you can configure the Model Streamer in your Ray workloads.

Interact with the deployed model

Before we port forward to access the Ray dashboard or the inference endpoint, let’s list the available services to make sure we goal the proper one:

$ kubectl get svc
NAME                                         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                         AGE
ray-kai-scheduler-serve-llm-head-svc         ClusterIP   None            xxxxxx        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   17m
ray-kai-scheduler-serve-llm-xxxxx-head-svc   ClusterIP   None            xxxxxx        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   24m
ray-kai-scheduler-serve-llm-serve-svc        ClusterIP   xx.xxx.xx.xxx   xxxxxx        8000/TCP                                        17m
raycluster-sample-head-svc                   ClusterIP   None            xxxxxx        10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP   32m

Now that we will see the service names, we’ll use:

  • ray-kai-scheduler-serve-llm-xxxxx-head-svc to forward the Ray dashboard.
  • ray-kai-scheduler-serve-llm-serve-svc to forward the model’s endpoint.

Then, port forward the Ray dashboard:

kubectl port-forward svc/ray-kai-scheduler-serve-llm-xxxxx-head-svc 8265:8265

Then open http://127.0.0.1:8265 to view the Ray dashboard and make sure the deployment is lively.

A screenshot of Ray dashboard, where users can see the status of their workloads.A screenshot of Ray dashboard, where users can see the status of their workloads.
Figure 1. Overview of the Ray dashboard
The screenshot of the logs on Ray dashboard, where users can see the model being deployed.
The screenshot of the logs on Ray dashboard, where users can see the model being deployed.
Figure 2. Overview of deployment logs

Next, port forward the LLM serve endpoint:

kubectl port-forward svc/ray-kai-scheduler-serve-llm-serve-svc 8000:8000

And eventually, query the model using an OpenAI-compatible API call:

curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
        "model": "qwen2.5-7b-instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 13,
        "temperature": 0}'

Sample response:

{
  "id": "qwen2.5-7b-instruct-xxxxxx",
  "object": "text_completion",
  "created": 1753172931,
  "model": "qwen2.5-7b-instruct",
  "selections": [
    {
      "index": 0,
      "text": " city of neighborhoods, each with its own distinct character and charm.",
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 4,
    "completion_tokens": 13,
    "total_tokens": 17
  }
}

Wrapping up

On this blog, we explored how the KAI Scheduler advances scheduling capabilities to Ray workloads, including gang scheduling and hierarchical queuing. We demonstrated how training and inference workloads could be efficiently prioritized, with inference workloads in a position to preempt training jobs when resources are limited. 

While this demo used an easy open-weight model and Hugging Face for convenience, NVIDIA Run:ai Model Streamer is a production-grade option that reduces model spin-up times by streaming model weights directly from S3 or other high-bandwidth storage to GPU memory. It’s also natively integrated with vLLM and works out of the box with Ray, as shown in this instance from Ray’s docs. We’re excited to see what the community builds with this stack. Completely happy scheduling. 

The KAI Scheduler team shall be at KubeCon North America in Atlanta this November. Have questions on gang scheduling, workload auto-scaling, or AI workload optimization? Find us at our booth or sessions.

Start with KAI Scheduler.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x