How Long Prompts Block Other Requests

At TNG, we’re self-hosting quite a few Large Language Models on our cluster of 24 H100 GPUs. Serving LLMs for over 50 applications, thereby consuming greater than 100 million tokens while generating over 10 thousands and thousands tokens per day, requires us to fastidiously tune our request processing.

In the previous part of our series on LLM performance, we looked into the differences between the prefill and decode phases during token generation. Briefly: for the primary output token (prefill step), all the prompt must be processed, which may be parallelized efficiently and might saturate GPU utilization. For all later output tokens (decode steps), only a single additional token must be processed, which is less compute-intensive but have to be done sequentially. When many requests are processed concurrently, any strategy that goals for low latency must run prefill steps for newly arriving requests while the decode steps of previously scheduled requests are still ongoing. Concurrent processing of latest in addition to running requests due to this fact requires careful balancing between the prefill and decode stages, which presents two major challenges, which we’ll discuss in the next. One is a readily solvable issue, while the opposite one constitutes a more fundamental flaw.

The Simpler Challenge: Long Prompts Block the Queue

Since individual decode steps usually are not compute-intensive, one can increase throughput by batching decodes of multiple requests. For prefill, nevertheless, this approach doesn’t work. Due to parallelized processing of all prompt tokens, a single prefill step can already saturate GPU utilization. Consequently, within the default chunked-prefill strategy of vLLM, each prefill chunk incorporates only prompt tokens of a single request. The subsequent request in line has to attend until the previous prefill phase has been finished before its own prefill phase can start.

This sequential scheduling of prefill chunks for various requests poses a challenge: at any time when a request with a really long prompt is scheduled for prefill, any subsequent request has to attend at some point of the long prefill before its processing starts; an extended prompt blocks the prefill-queue. (Note that the sequential processing of prefills is the default characteristic of chunked-prefill and only appears when there already is a concurrent request in its decode phase; hence the name “partial prefill”.)

By default, prefills for brand new requests are scheduled sequentially in the event that they run concurrently with other decodes. A single request with an extended prompt could cause long waiting times for subsequent requests before their first output tokens are generated, as demonstrated by this measurement. Notice that the last prefill chunk of the second request leaves room to handle the prefill of the third request in parallel.

Unfortunately, this challenge can neither be solved with vLLM-side priority scheduling (see the primary article of this series) nor with a more sophisticated upstream scheduler. The rationale is that the long prompt may be scheduled before any subsequent requests exist, so there may be nothing the scheduler could wait for.

Request-Parallel Prefills

An easy solution could be to process prefill chunks of various requests in parallel. This won’t be resource-optimized, as single-request prefill chunks could already saturate compute-power. Any additional prefill executed in parallel would likely lengthen the prefill duration a bit and decelerate any concurrent decode-requests even further. This could be acceptable if it reduced the latency of short requests and made the system appear more responsive. This approach fails, nevertheless, when the subsequent request in line has an extended prompt too. In such a case, two compute-intensive prefills could be batched together and end in a severe slowdown.

In one in every of the most recent vLLM updates, an improved strategy has been implemented: it allows for parallel prefills of various requests but with a limit to the variety of concurrently processed long prompt requests. An example configuration could enable batching of prefills for 4 requests, but only one in every of them could also be longer than 10,000 prompt tokens. With such a configuration, the behavior for longer requests continues to be the identical as before: long prompts are processed sequentially. Short requests, nevertheless, not have to wait for the long prefill of a previous request to complete; short prompts can take a quick lane. These requests not suffer from long waiting times and show much lower time-to-first-token metrics.

After all, parallel prefills can only reduce waiting times; however the time-per-output-token stays elevated through the concurrent long-prefill operation. On this regard, request-parallel prefills show the identical behavior and performance as standard chunked-prefill, just with a shorter time-to-first-token.

With parallel partial prefills enabled, recent requests can start their prefill without having to attend for the previous requests to complete their prefill phases, leading to a short while to first token. The overall completion time sees only small improvements, as decode steps are still slowed down by the concurrent prefill. Our measurement shows a transparent reduction of *time-to-first-token* for the third request, compared with the sequential partial prefill scenario.

The Fundamental Flaw: Token Generation Slowed Down by Parallel Prefills

Each time prefill and decode of various requests are executed in the identical GPU operation, it takes longer than for an isolated decode step. The user experiences an interruption or a slowing down of the token generation by a subsequent request. Specifically, a single request with an extended prompt is sufficient to decelerate all previously scheduled requests which are already of their decode phases.

It is a fundamental flaw within the concurrent processing of prefill and decode on the identical GPUs, because there may be little you possibly can do:

(a) You may penalize long prompts and allow them to wait (e.g. until all short, high-priority requests have been finished). This comes at the worth of increased latency for those requests, and it doesn’t fix the basis cause: particularly, when request-parallel prefills are enabled, the slowing down also affects short-prompt requests which are scheduled after the long-prompt one. Moreover, in times of high load, there is usually a very small likelihood for long-prompt requests to be scheduled inside reasonable time. At TNG, we implemented an analogous strategy in an API for batch requests that are scheduled with very low priority.
(b) You may have a separate inference-server for long-prompt requests and a router that forwards requests depending on load and prompt lengths. This approach requires more GPU resources however the inference server for short-context requests has lower requirements on GPU memory (for instance, Llama-3.3-70B needs 4 H100 for a context length of 130k tokens but a second deployment with two H100 could already serve requests with context length of <10k tokens). Nonetheless, a classy router design is required with the intention to optimize resource utilization. For instance, when there are not any long-prompt requests the larger inference server should still be utilized.
(c) You may have separate inference engines for prefill and decode. This architecture of disaggregated prefill combines multiple vLLM deployments, each of which runs only prefill or only decode. After ending the prefill phase, the KV cache is transferred to the decode employee which causes a small communication overhead. But since prefill and decode run isolated on different GPUs, there isn’t a direct disruption of decodes attributable to concurrent prefills anymore.

The difference between ideal concurrent processing (which could be no different from isolated requests), actual concurrent processing, and a disaggregated prefill strategy is shown by the next measurements:

When a brief request is followed by a long-prompt request, its token generation is slowed down by the concurrent prefill. This may be prevented by separating vLLM deployments for prefill and decode staff on different GPUs (“disaggregated prefill”). All measurements show the identical two requests, with a prompt length of 30k tokens for the interrupting request and 120 output tokens for every request. The disaggregated prefill implementation has been measured with vLLM v0.7.3 – here, the decode phase appears slower, likely lacking feature maturity.

Disaggregated Prefill – Optimized for Latency

Separating prefill and decode eliminates the slowing-down of token generation in presence of other requests to a big extent, which makes it a really attractive strategy. It comes at the worth of a second full-size vLLM deployment (e.g. for Llama-3.3-70B, you would want 4 H100 GPUs for a prefill employee and one other 4 H100 GPUs for a decode employee in case you desired to support a maximum context length of 130k tokens). One other drawback is the uneven GPU utilization: because prefill is compute-intensive but decode isn’t, the prefill employee will likely saturate GPU utilization before the decode employee does. Alternatively, large clusters could consist of various numbers of prefill and decode staff (depending on load patterns), with the intention to optimize resource utilization.

Disaggregated prefill isn’t intended to extend total throughput, relatively total “goodput” (i.e. the speed of requests that satisfy latency targets). Consequently, it isn’t the most effective use of GPU resources in case your application isn’t sensitive to latency of individual requests.

One other caveat: the disaggregated prefill feature in vLLM continues to be experimental, and a few optimizations and features usually are not accessible yet. For instance, there are currently lower limits on context length, and the decode employee doesn’t use CUDA graphs consistently, causing the slower decode of the long-prompt request within the figure above. Fortunately, these usually are not fundamental obstacles and are likely going to be solved in future versions of vLLM.

Source link

How Long Prompts Block Other Requests

The Simpler Challenge: Long Prompts Block the Queue

Request-Parallel Prefills

The Fundamental Flaw: Token Generation Slowed Down by Parallel Prefills

Disaggregated Prefill – Optimized for Latency

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Zero-Waste Agentic RAG: Designing Caching Architectures to Minimize Latency and LLM Costs at Scale

Context Engineering as Your Competitive Edge

Constructing Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

5 Latest Digital Twin Products Developers Can Use to Construct 6G Networks

Claude Skills and Subagents: Escaping the Prompt Engineering Hamster Wheel

How Long Prompts Block Other Requests

The Simpler Challenge: Long Prompts Block the Queue

Request-Parallel Prefills

The Fundamental Flaw: Token Generation Slowed Down by Parallel Prefills

Disaggregated Prefill – Optimized for Latency

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.