The Strangest Bottleneck in Modern LLMs

Introduction

are currently living in a time where Artificial Intelligence, especially Large Language models like ChatGPT, have been deeply integrated into our each day lives and workflows. These models are able to quite a lot of tasks, from something as complex as writing code to so simple as summarising a chunk of text. However the oh-so impressive capabilities of those models have been held back largely by a single bottleneck. Although the hardware used can run these models at incredibly fast speeds, the actual strategy of getting a response from them can still feel quite slow and sluggish.

Motivation

Essentially, for each word that the model generates, the model weights need to be loaded into the GPU VRAM from system memory, where it processes your entire calculation, only to then shift the whole lot back to system memory. Because the actual calculation takes way less time than the content transfer between memories, the chip has to sit down idle waiting for the subsequent batch to reach. This could be very wasteful.

There have been several attempts to plan algorithms that keep the chip busy, as an alternative of letting it sit idle between memory transfers. One such technique is Speculative Decoding [2], where a smaller model, normally much weaker, is used to draft multiple future tokens that the predominant model verifies without delay. But since the smaller model is commonly far less intelligent, it makes many mistakes, which the predominant model then has to reject, defeating your entire purpose. Alternatively, purely parallel diffusion models can write lots of of tokens without delay, but this speed often comes at the fee of accuracy and language coherence. With the accuracy of AR models and the speed of diffusion models, a perfect architecture would lie somewhere in between.

The Solution: TiDAR

The researchers at Nvidia also thought the identical, and hence they propose a novel architecture, which they call TiDAR [1], short for “Think in Diffusion, Talk in Autoregression.”

The genius of TiDAR lies in the best way it transforms a process that is frequently sequential (as in conventional LLMs) right into a parallel process. TiDAR shows that regardless that Autoregression and Diffusion are two completely different design philosophies, they will still be unified and exploited for his or her benefits.

To grasp it at its core, we’ll have to take a look at how the input is constructed for this model. For a regular LLM, we simply feed all past words to predict tokens separately. In TiDAR, nevertheless, we construct a special, three-part input sequence.

Imagine we now have the sentence “The cat sat.” Glued together, the completely constructed input sequence would look something like this:

(Source: Writer)

The Prefix: “The”, “cat”, “sat” (The history we got from the user).
The Drafts: “on”, “the” (The guesses from the previous step that should be checked on this iteration).
The Future Masks: [MASK], [MASK] (Empty slots where we wish latest guesses).

Now that we now have the background of the input tensor, let’s get to understanding how the actual processing happens.

(Source: Writer)
A full diagram of how the TiDAR architecture works

Component 1: “Talking” (The Autoregressive Verifier)

That is the primary and most crucial a part of the model architecture. On this phase, the model’s job is to confirm the drafts generated within the previous iteration ("on", "the") and choose in the event that they are adequate to be kept.

How Parallel Verification Works

At the top, you would possibly query yourself, “If the model has to examine if the drafts are good or not, how would this be any faster than simply generating them as an alternative?” Let’s answer this query.

In a traditional Autoregressive model, if you wish to generate 5 words, you could have to run the model 5 separate times. You feed in word 1 to get word 2, then feed in word 1+2 to get word 3, and so forth. The GPU has to load the large model weights from memory 5 separate times. That is the predominant bottleneck that should be eliminated.

That is the precise thing that TiDAR fixes when it verifies the draft tokens, because it will probably do that in a single shot, which suggests 2 words ["on", "the"] are added to the output in only one forward pass. It uses a Causal Attention Mask for this process, which ensures:

When checking “on”, the model can only see “The cat sat”.
When checking “the”, the model can only see “The cat sat on”.

Since the GPU is an enormous parallel processor, it will probably calculate the “correctness” of all these drafts concurrently in a single operation. It’s effectively doing 2 steps of labor for the worth of 1 step. That’s where the large speedup comes from.

The Easy Correction Mechanism

But what happens if the draft is incorrect? What if the drafts were ["in", "pizza"] as an alternative of ["on", "the"]?

The very best part is that it doesn’t matter if the drafts are incorrect. The correction is virtually free.

The model verifies the drafts by calculating a probability distribution over its vocabulary, conditioned on the context it gets. If the drafts are plausible predictions that the model could’ve chosen, they’re chosen, but when not, the model chooses essentially the most probable word from the distribution it just calculated.

Since we ran this computation in the identical forward pass, we don’t must run the model again. We simply:

Discard the bad draft ["in"].
Immediately swap in the winner ["on"] from the probability list we just calculated.
Cut off all subsequent drafts ["pizza"] (because they were based on the incorrect word).

This guarantees that the ultimate output we find yourself getting is mathematically as valid as when the model was running slowly, step-by-step. We get the speed of parallel processing with the accuracy of sequential processing.

Component 2: “Pondering” (The Diffusion Drafter)

While the autoregressive “talking” component is busy in verifying which token to maintain and which to reject, the “considering” component drafts the tokens for the subsequent iteration.

Filling the Empty Slots

Do you remember those [MASK] tokens at the top of our input sequence? The diffusion head tries to fill these blanks in order that the autoregressive head can confirm them in the subsequent iteration.

For this part specifically, the model looks in any respect the words within the sequence without delay. To do that, it uses a Bidirectional Mask as an alternative of the same old Causal mask, but only for these [MASK] tokens.

Why Bidirectional?

Since the diffusion head has to draft multiple tokens without delay, it has to give you the chance to relate all words to all [MASK]. It effectively has to capture the “vibe” of the sequence to fill within the [MASK] tokens and hence, the Bidirectional mask.

For our example sequence, the Diffusion head looks in any respect the [MASK] tokens together, together with the history (“The cat sat on the”), and tries to “denoise” them into essentially the most plausible and coherent text. It asks, and it would provide you with “red mat”.

The ultimate causal mask, combined for each components, looks like the next:

(Source: Writer)
For the prefix and draft tokens, the mask is a lower-triangular matrix (causal), but for the `[MASK]` tokens, there is no such thing as a restriction as to where they will attend.

The Continuous Cycle

This creates a continuous cycle:

In Step 1, the Diffusion head guesses “on the”.
In Step 2, those guesses move into the “Draft” position.
The Autoregressive head verifies them (and corrects them if needed).
, the Diffusion head moves onto guessing the phrase (“red mat”).

By continually drafting ahead while verifying behind, TiDAR keeps the GPU fully utilized to the brim, ensuring that no computing power is ever wasted.

The Results

The researchers put TiDAR through quite a lot of tests to see if their novel approach actually delivers or not. Let’s have a have a look at what they concluded:

1. Speed: A Massive Leap Forward

Essentially the most critical metric for this architecture is whether or not it will probably improve inference speed, to which it does, and quite substantially.

Compared to a regular Autoregressive (AR) model, TiDAR demonstrates a big increase in throughput. Throughput here refers back to the variety of tokens the model can generate per second.

For the 1.5B parameter model, TiDAR achieved a speedup of 4.71x. Which means that this architecture can generate the identical amount of text nearly 5X faster than a regular LLM architecture.
For the larger 8B parameter model, the resulting speed-up has an excellent greater gap, reaching upto 5.91x.

This can be a drastic improvement from the standard Next-Token Prediction schema, moving away from generating one token to drafting multiple tokens without delay.

2. Quality: Closing the Gap

Till now, purely diffusion-based LLMs like Dream [4] or Llada [5] have at all times found it difficult to match the reasoning capabilities and coherence of the AR models.

TiDAR, nevertheless, with its hybrid approach, has managed to shut this gap almost perfectly. By utilizing the autoregressive head to confirm the draft tokens made by the diffusion head, TiDAR can benefit from the fidelity of AR models and the speed of pure diffusion models concurrently.

On benchmarks like HumanEval (coding) [6] and GSM8K (math) [7], TiDAR achieved scores that were “lossless” in comparison with the baseline AR model.
The truth is, on some metrics, it even barely outperformed the baseline, likely because of the “look-ahead” nature of the drafting process, which helps the model plan higher in reasoning tasks.

(Source: Adapted from Liu et al. (2025) **[1]**, Table 2)
This table shows the accuracy scores of peer models in comparison to TiDAR. “Trust AR” is the usual mode, where we weigh the AR head’s opinion greater than the diffusion head’s opinion in the case of deciding if the drafts are correct. “Trust Diff” is the mode where we weigh the diffusion head more heavily than the AR head.

3. Efficiency vs. Speculative Decoding

The authors also tested TiDAR against the present best approach to speeding up inference, called EAGLE-3 (an algorithm based off of Speculative Decoding).

As discussed earlier, Speculative Decoding relies on a separate, smaller model to draft future tokens, which the predominant model can then confirm. But the issue is that the smaller model makes a ton of mistakes, resulting in rejected tokens and wasted compute. TiDAR, nevertheless, uses its own trunk to draft and confirm the tokens. This makes the drafted tokens rather more accurate and high-quality.

The “Acceptance Rate” (how often the drafts are correct) was significantly higher for TiDAR for the explanation stated above.
This high acceptance rate means the model spends less time on correcting its mistakes and more time on generating the actual text.

(Source: Adapted from Liu et al. (2025) **[1]**, Table 1)
Shared with base: If the draft model and predominant model share the identical trunk or not.
Parallel Decoding: If the drafter can write one token at a time or many tokens without delay.
Parallel to Verification: If the architecture can draft and confirm at the identical time.

4. The “Free Token” Advantage

Finally, the outcomes validate the core hypothesis of the paper: whether we utilize the GPU as much as its absolute limits.

The experiments conducted by the authors conclude that the drafting mechanism of TiDAR adds almost no latency in comparison to the usual forward pass. In a regular pass, the GPU is memory-bound, which suggests that the info onloading and offloading are the rate-limiting steps as an alternative of the particular compute.

In TiDAR, nevertheless, we are able to load the GPU with extra work as an alternative of letting it sit idle. The graph below principally tells us about what number of tokens we are able to draft in a single forward pass before the computation actually becomes the bottleneck for the GPU.
It seems that we are able to draft ~60 tokens per forward pass, before the GPU starts being compute-bound.

(Source: Adapted from Liu et al. (2025) **[1]**, Figure 1)

Within the graph above, the x-axis shows the variety of drafted tokens and the y-axis shows the latency of the model. As observed, within the green region, the graph being flat suggests that there is no such thing as a increase in latency even when we increase the variety of draft tokens. It is simply around 60 tokens (yellow region) that the latency starts rising, signifying that the actual computation is now taking more time than moving data to-and-from memories.
Which means that we are able to theoretically generate 60 tokens without delay, for no added latency.

.

References

Liu, J., Dong, X., Ye, Z., et al. (2025). . arXiv preprint.
Leviathan, Y., Kalman, M., & Matias, Y. (2023). . International Conference on Machine Learning (ICML).
Li, Y., Wei, F., Zhang, C., & Zhang, H. (2025). . arXiv preprint.
Ye, J., et al. (2025). . arXiv preprint.
Nie, S., et al. (2025). . arXiv preprint.
Chen, M., et al. (2021). . arXiv preprint.
Cobbe, K., et al. (2021). . arXiv preprint.

The Strangest Bottleneck in Modern LLMs

Introduction

Motivation

The Solution: TiDAR