NVIDIA Run:ai v2.24 introduces time-based fairshare, a brand new scheduling mode that brings fair-share scheduling with time awareness for over-quota resources to Kubernetes clusters. This capability, built on the open source KAI Scheduler that powers NVIDIA Run:ai, addresses a long-standing challenge in shared GPU infrastructure.
Consider two teams with equal priority sharing a cluster. Team A repeatedly submits smaller jobs, while Team B must run a bigger job that requires more resources. Each time resources unencumber, the smaller jobs from Team A fit immediately and get scheduled. The larger job from Team B continues to attend for enough resources to change into available. Before that happens, the following small job from Team A claims the freed capability. The result: although each teams have equivalent priority and entitlements, Team A runs job after job while the job from Team B sits within the queue indefinitely.
Time-based fairshare solves this problem by giving the scheduler memory. As an alternative of calculating justifiable share at a single easy, the scheduler now tracks historical resource usage and adjusts each queue’s share based on past consumption. Teams which have used more resources recently receive lower scores for over-quota allocation, while teams which were waiting receive a lift.
Time-based fairshare ends in proportional compute time over days and weeks. This permits true time-sharing of GPU resources, burst access for infrequent large jobs, and resource planning that aligns with weekly or monthly GPU-hour budgets. Importantly, guaranteed quotas and queue priorities proceed to work exactly as before.
This post explains the issue in additional detail, walks through a real-world use case, and demonstrates find out how to enable time-based fairshare in NVIDIA Run:ai and KAI Scheduler.
Why is over-quota GPU resource fairness necessary?
Enterprise deployments have shown a consistent pattern: when organizations move from static GPU allocation to dynamic scheduling, cluster usage becomes way more dynamic. Over-quota resources (the shared pool beyond guaranteed quotas) change into one of the crucial heavily utilized resource types. Teams repeatedly exceed their guaranteed allocations, leading to higher GPU utilization and more compute time for researchers.
This makes over-quota fairness critical. When a good portion of cluster value comes from this shared pool, that pool must be divided fairly over time.
How does stateless justifiable share scheduling work?
The classical stateless justifiable share algorithms divide cluster resources in two phases. First, it allocates Deserved Quota, the guaranteed resources that every queue is entitled to. This allocation at all times happens first and is unaffected by historical usage. Time-based fairshare doesn’t change this behavior.
After deserved quotas are satisfied, any remaining capability becomes the Over-Quota Pool, a shared surplus that queues compete for based on their weights. That is where point-in-time fairness breaks down.
When dividing over-quota resources, the scheduler:
- Groups queues by priority level and starts with the very best tier
- Calculates justifiable share based on weights in that tier:
- Queues using lower than their justifiable share get resources first
- Breaks ties using workload submission time
- If resources remain, moves to the following priority tier and repeats
Here’s where the issue lies. Consider the next two queue types competing for over-quota resources.
When queues have equal weights: Each receive the identical calculated justifiable share. When resources change into available after a job completes, each queues are in the very same state – same allocation (zero), same justifiable share, each with pending jobs. The scheduler sees no difference between them, falls back to tie-breakers (queue creation timestamp, then alphabetical order) and the identical queue wins each time.
When queues have different weights: The upper-weight queue receives a bigger justifiable share, which is correct. However the point-in-time calculation doesn’t track whether queues actually receive their proportional share over time. For instance, if Queue A has weight 3 and Queue B has weight 1, the scheduler accurately calculates that A is entitled to 75% of over-quota resources (3/4) and B to 25% (1/4). But when Queue A submits large workloads while Queue B submits many smaller ones, Queue B can more easily fit inside its justifiable share while the Queue A big jobs push it above justifiable share. The scheduler continues to prefer Queue B since it appears “underallocated” at each decision point. Over time, Queue B finally ends up running way more workloads than its 25% entitlement.
In each cases, the scheduler has no memory. It doesn’t know that one team just finished running a job while the opposite has been waiting for hours.
How does time-based fairshare work?
The core idea of time-based fairshare is simple: for every queue, compare the proportion of over-quota resources it actually consumed over the configured time window against the proportion it must have received based on its weight. Then adjust accordingly.
For instance, if Queue A has weight 3 and Queue B has weight 1, Queue A should receive 75% of over-quota resources and Queue B should receive 25%. If the scheduler looks back over the past week and sees that Queue A actually consumed 90% while Queue B only received 10%, it’ll boost the Queue B effective weight and reduce Queue A’s, balancing future allocations toward the 75/25 split.
Every little thing else stays the identical. Deserved quotas are still guaranteed first. Priority ordering still applies. Queue hierarchies work as before. Time-based fairshare only changes how the over-quota pool gets divided.
How is time-based fairshare calculated?
The scheduler uses three inputs to regulate the effective weight of every queue:
- Weight: What the queue should get based on its configured weight relative to others
- Usage: What the queue actually consumed over a configurable time window (default: one week)
- K-value: How aggressively the scheduler corrects imbalances. Higher values mean faster correction
When a queue has consumed greater than its justifiable share, its effective weight is reduced. When it has been starved, its effective weight is boosted. This fashion, allocations naturally drift back toward the intended proportions over time.
Time-based fairshare will be enabled or disabled directly from the UI (see the Node Pools section of the NVIDIA Run:ai documentation), while parameters like window size, window type, and decay rates will be tuned via API to balance responsiveness against stability. Because these settings are configured per node-pool, administrators can experiment on a dedicated node-pool without affecting the remainder of the cluster. For the total details, see the time-based fairshare documentation.
A couple of details value noting:
- Usage is measured against cluster capability, not against what others consumed. This prevents teams from being penalized for using GPUs that were sitting idle anyway.
- Priority still comes first. Time-based fairshare operates inside each priority tier. A high-priority queue still gets resources before lower-priority queues, no matter historical usage.
Example scenario: One cluster, multiple workload types
This section walks through a practical scenario that shows how time-based fairshare solves resource contention in a heterogeneous cluster.
A 100-GPU cluster is shared by two ML teams with very different workload patterns. The LLM team focuses on post-training and inference, with 30 GPUs guaranteed. The Vision team focuses on computer vision R&D, with 20 GPUs guaranteed. Each teams have equal over-quota weight. The remaining 50 GPUs from the over-quota pool available for burst workloads.
The LLM team runs customer-facing inference endpoints that serve production traffic. These inference workloads use 10 GPUs repeatedly. They’re critical and mustn’t ever be interrupted. The remaining 20 GPUs from their quota, plus access to the over-quota pool, can be found for post-training jobs when the team occasionally needs to enhance their models based on customer feedback.
The Vision team focuses on computer vision research: running VSCode, testing architectures, hyperparameter sweeps, and training object detection models. They’ve a gradual stream of coaching jobs that repeatedly tap into the over-quota pool.
The issue: Burst access becomes blocked
Sooner or later, the LLM team finishes analyzing a batch of customer feedback and is able to launch a post-training run. The job needs 60 GPUs; their 20 GPU quota plus 40 from the over-quota pool. What happens with and without time-based fairshare is printed below.
As an instance this scenario, we used the open source time-based fairshare simulator from the KAI Scheduler. This tool enables you to model different cluster configurations and visualize how resources are allocated over time. The simulations below show exactly what happens in our example scenario.
Without time-based fairshare
- LLM team’s inference endpoints proceed running on their 10 guaranteed GPUs (deserved quota is protected).
- Vision team has been repeatedly running CV training jobs, consuming over-quota resources.
- LLM team’s 60-GPU post-training job enters the queue.
- Every time over-quota resources are free, the Vision team has more pending jobs ready
- Vision team’s jobs proceed to be scheduled first. This happens since the LLM team’s 40-GPU over-quota request exceeds their justifiable share. The scheduler won’t allocate beyond justifiable share while the Vision team still has pending jobs claiming their portion. The LLM team must wait until Vision team’s over-quota usage drops.
- LLM team’s post-training job waits…and waits…and waits.
The LLM team’s inference services are positive, and the guaranteed quota works perfectly. But their post-training job is effectively starved because a team with continuous workloads monopolizes the over-quota pool. The occasional user never gets their turn.


With time-based fairshare
For detailed instructions on configuring the time-based fairshare in NVIDIA Run:ai UI under node pools, see the NVIDIA Run:ai documentation or the KAI Scheduler documentation.
With time-based fairshare, the scheduler tracks historical usage. When the LLM Team submits their post-training job:
- Vision team has collected high historical over-quota usage from continuous CV training
- LLM team has minimal historical over-quota usage (they’ve been running jobs inside quota)
- LLM team’s effective justifiable share is boosted because they’ve been “starved” for over-quota
- LLM team’s 60-GPU job is scheduled
If the post-training job runs long enough, each teams find yourself time-sharing over-quota resources. The LLM Team runs for some time, accumulating usage. As their historical usage grows, the Vision Team becomes relatively more starved and starts getting prioritized. The resources oscillate backwards and forwards (sometimes the LLM job runs, sometimes Vision jobs run) leading to fair sharing over time somewhat than one team monopolizing the pool.


Time-based fairshare enables several necessary patterns, including:
- Protected critical workloads: Inference endpoints and other production services run on guaranteed quota, completely untouched by fairness adjustments.
- Burst access when needed: Teams that don’t repeatedly devour over-quota resources can still get burst capability after they need it, without being blocked for long periods of time and even permanently.
- Fair sharing over time: No team monopolizes the over-quota pool indefinitely. Everyone gets their proportional share across the configured time window.
- Fairer treatment of huge workloads: In point-in-time justifiable share, queues with large jobs often get deprioritized because smaller jobs from other queues fit more easily. Time-based fairshare improves this: because the queue with large jobs accumulates less usage, it becomes increasingly prioritized until it gets a probability to run.
Start with NVIDIA Run:ai time-based fairshare
Time-based fairshare addresses a fundamental limitation in point-in-time justifiable share scheduling: the dearth of memory. By tracking historical usage, the scheduler distributes over-quota resources fairly across time windows somewhat than simply at each scheduling decision. Guaranteed quotas remain untouched – critical workloads like inference endpoints stay protected.
Able to start? NVIDIA Run:ai v2.24 includes time-based fairshare with straightforward configuration through the platform UI. Settings are configured per node-pool, so it’s easy to experiment on a dedicated pool without imposing the brand new mode across your entire cluster. For setup details, see the time-based fairshare documentation.
Time-based fairshare can be available in open source KAI Scheduler. Complete the configuration steps, enable Prometheus, set your parameters, and begin scheduling.
Wish to try time-based fairshare before deploying it? Try the time-based fairshare simulator, where you’ll be able to model queue allocations over time. Define your queues, weights, and workloads in a straightforward YAML file, run the simulation, and visualize how resources oscillate between competing teams.
To learn more about time-based fairshare and other features within the NVIDIA Run:ai v2.24 release, join the upcoming webinar Elevate Your AI Operations With Simplified Workload Management.
