Serving LLMs to many applications and users in parallel is difficult because they compete for limited GPU resources. This text is the primary in a series on LLM performance, based on our experience with serving self-hosted LLMs at TNG Technology Consulting GmbH. In the primary part, we give attention to the impact of queuing and discuss different scheduling strategies.
Starting Point: A Bare Inference Engine
An inference engine like vLLM or HuggingFace TGI consists of
- a employee that does the actual work of calculating the following token in a request
- a queue to which requests are added once they first arrive
- a scheduler that takes requests from the queue and moves them to the employee
Why do we’d like a queue here? Because calculations on the GPU are more performant and resource-efficient once they are done batch-wise as an alternative of isolated for individual requests. This backend queue allows the scheduler to choose multiple requests and put them on the identical batch to be processed.
Note that typically each inference engine serves only a single model, and we’ve got multiple deployments running for various models in parallel.
Problem: “Power Users” Can Block Other Users
When a single user “A” sends a lot of requests, they may quickly replenish the queue. Other users (“B” and “C”) that send their requests only shortly after are blocked from using the identical model (until all requests from “A” have been processed). Note that the image focuses on vLLM as inference engine, but the issue is more general and applies to other backends as well.
Solution: Fair Scheduling
At TNG, users don’t send requests on to the vLLM backend but to an API Server (what we call “LLM-Server”). Here we are able to have separate queues for every user (and model), and a scheduler that isn’t FIFO (first in – first out) but goes round-robin through all user queues. This achieves some “fair scheduling”: for instance, within the diagram users B and C send their request a bit later, when user A’s first request has already been scheduled. At the moment, three requests from user A had been waiting already for a while, but user C has to attend for under one request from user A to be accomplished.
The important thing idea is: prioritize requests from different users in our own component and never within the inference backend!
Typically, you may’t change the order of requests once they’ve been sent to the inference engine, so you might have to bring them in the appropriate order while they’re still within the LLM-Server, where we’ve got full control.
Possible Extensions
You may consider different elements to be able to resolve what “fair” scheduling is. Within the above example, any recent user with a single request must be served before any user is served two or more times in a row. That is “fair” on a number-of-requests level. But you may also have a look at processing time: how long does a request keep the backend busy? Long prompts and long generations will “block” the LLM for other users for an extended time, so possibly shorter requests must have precedence? Unfortunately, it is vitally difficult to estimate the generation length. While there are some requests with a “max_tokens” limit, a typical chat message from an interactive AI assistant has no token limit, and might vary between a really short generation (“summarize this text”) and a really long one (“tell me a story”https://huggingface.co/”write all code for xyz”).
Depending on the prompts, there could be some profit in arranging requests based on similarity, in order that vLLM can maximize cache hits, which hurries up the performance. This sort of KV-cache-aware routing has recently gained attention by frameworks like NVIDIA Dynamo and AIBrix.
In a business context, and for hosted LLMs, the cost of individual requests could be one other metric to contemplate, but comes with similar challenges.
The answer may also be prolonged by having not just one queue per user (and model) but several queues with different priorities. For instance, interactive applications like TNG’s AI Assistant with chat interface must have higher priority because users who don’t see any progress inside five seconds will think the applying is broken. Users who generate code reviews for tens of files and a number of other thousand lines of code, nevertheless, will expect the LLM requests to take a while. And a few use cases (like benchmark runs, scheduled via a batch API) must have such low priority that they do not disturb other use cases and are only scheduled when nothing else is running.
Problem: No Backpressure by Backend Queue
Consider again the scenario that one user A sends loads of requests directly, and a few time later a brand new user C joins and needs to schedule a single request. If the scheduler on LLM-Server side sent every request (in line with the fair prioritization) immediately to the backend, all requests from A would accumulate again within the FIFO queue. The brand new user C would should wait again until all previously received requests have been processed. There can be almost no improvement to the initial scenario without LLM-Server.
Ideally, you may limit the variety of maximum elements within the backend queue, but vLLM doesn’t offer you that option. Subsequently, we’ve got to dynamically adjust the speed at which the scheduler on LLM-Server side sends recent requests to the backend. Here, our goal is: keep the backend queue length short to be able to minimize latencies experienced by recent users.
(The only approach can be a static rate limit, but this might likely end in underutilization when most requests are short and hard to calibrate for various models and cargo patterns).
Solution: Fetch Metrics
With a purpose to make the backend queue length available within the LLM-Server, we’d like to fetch the respective Prometheus metrics from the vLLM /metrics endpoint. Our fair scheduler is just allowed to send requests to the backend so long as the backend queue length metric is smaller than three, for instance. This goal length for the backend queue could be lowered for an excellent shorter latency for brand new users, until it ends in under-utilized batches and a reduced efficiency – there may be a trade-off. Take into accout, though, that the goal queue length does NOT reflect maximum concurrency; there can still be greater than three requests being processed in parallel; vLLM will add queuing requests to the present batch as soon as there may be sufficient space (“continuous batching”).
Possible Extensions
Once you might have the feedback loop between backend queue and scheduler within the LLM-Server, you may easily extend the set of vLLM metrics used for making scheduling decisions. For instance, for a superb user experience in TNG’s interactive AI Assistant we aim for a high token generation speed (e.g., >7 tokens/s, that’s ~150ms per token), and if the reported time-per-output-token metric rises above 150ms, no recent requests might be scheduled.
You can too program different metric thresholds for various request priorities. For instance, for low-priority requests from the batch API we only schedule a request when the backend queue is totally empty: we relatively risk short periods of underutilized GPUs than causing increased latencies for any later request with higher priority.
There is kind of some potential for optimizations: for instance, should you allow scheduling for a backend length shorter than three and the present metric is zero, you may immediately send three requests to the backend before having to fetch the metrics again. Within the worst case, none of them fit into the present batch and all of them are within the backend queue (during which case the brand new length is three, just advantageous).
Alternative: Backend-Side Priority Scheduling
Recently, vLLM added a priority-based scheduling option as an alternative of sticking with FIFO. For this recent feature, requests are tagged with a priority before they’re sent to the backend. And vLLM would recurrently check if any higher-priority request is queuing after which sort all requests by priority (each the running batch and the waiting queue). This may not only let the high-priority request literally jump the queue but even move it on to the processed batch. The worth is (except for some overhead because of sorting) that lower priority requests can be evicted from the running batch and put back within the waiting queue again.
Can Backend-Side Priority Scheduling Replace All Queues on the LLM-Server Side?
vLLM understands the request priority as a continuous number. This doesn’t only will let you distinguish between default, high priority (interactive AI assistant) and low priority (batch API) but you may even make smaller decrements for each request that the user has already sent to the backend and where the response is pending. For instance, users B and C send only single requests, which can have priority zero. User A sends 4 requests directly; the primary one has priority zero, too, but the following ones can have priorities one, two, and three, and might be processed later (higher number = lower priority).
There are still some caveats:
- The backend-priority feature is just available for vLLM, not for HuggingFace TGI.
- Having a scheduler on LLM-Server side lets you adjust the scheduling rate based on time-per-output-token. The backend-priority scheduling doesn’t control the scheduling rate.
- The impact of frequent re-ordering of queue and batch on latencies still must be measured for realistic load scenarios.
Overall, backend-side priority scheduling could be a superb strategy for vLLM-based systems, since it simplifies queueing logic on the upstream LLM-Server. Unfortunately, in a sensible setting you can not do away with the LLM-Server as a further scheduling layer, since you would like some objective instance to assign priorities to individual requests.
Summary & Outlook
Queueing and scheduling are crucial for applications with multiple users and clients of various priorities, as they significantly impact the user experience. This is very true in scenarios where multiple clients submit requests in parallel, benefiting from a dedicated “fair scheduling” strategy. Although backend features like priority scheduling can simplify optimization, an upstream gateway server continues to be vital to totally manage these complexities.
In the following a part of this blogpost series, we are going to shift gears and give attention to the token generation within the inference engine during prefill and decode phases. Particularly, we are going to discuss resource utilization and techniques for concurrent processing of multiple requests.




