What the Bits-over-Random Metric Modified in How I Think About RAG and Agents

-

an Edinburgh-trained PhD in Information Retrieval from Victor Lavrenko’s Multimedia Information Retrieval Lab at Edinburgh, where I trained within the late 2000s, I even have long viewed retrieval through the framework of traditional IR considering:

Those are still useful questions. But after reading the recent work on Bits over Random (BoR), I feel they’re incomplete for the Agentic systems lots of us at the moment are actually constructing.

Figure 1: In LLM systems, retrieval quality just isn’t nearly finding relevant information, but about how much irrelevant material comes with it. The librarian analogy illustrates the core idea behind Bits Over Random (BoR): one system floods the context window with noisy, low-selectivity retrieval, while the opposite delivers a smaller, cleaner, more selective bundle that is less complicated for the model to make use of. 📖 Source: image by writer via GPT-5.4.

The ICLR blogpost sharpened something I had felt for some time in production LLM systems: retrieval quality should take note of each how much good content we discover and likewise how much irrelevant material we bring together with it. In other words, as we crank up our recall we also increase the chance of context pollution.

What makes BoR useful is that it gives us a language for this. BoR tells us whether retrieval is genuinely selective, or whether we’re achieving success mostly by stuffing the context window with more material. When BoR falls, it is an indication that the retrieved bundle is becoming less discriminative relative to likelihood. In practice, that always correlates with the model being forced to read more junk, more overlap, or more weakly relevant material.

The vital nuance is that BoR does indirectly measure what the model “feels” when reading a prompt. It measures retrieval selectivity relative to random likelihood. But lower selectivity often goes hand in hand with more irrelevant context, more prompt pollution, more attention dilution, and worse downstream performance. Put simply, BoR helps tell us when retrieval remains to be selective and when it has began to degenerate into context stuffing.

That concept matters way more for RAG and agents than it did for traditional search.

Why retrieval dashboards can mislead agent teams

One in every of the simplest traps in RAG is to have a look at your retrieval dashboard, see healthy metrics, and conclude that the system is doing well. You would possibly see:

On paper things may look higher but, in point of fact, the agent might actually behave worse. Your agent could have any variety of maladies resembling diffuse answers to queries, unreliable tool use or just an increase in latency and token cost with none real user profit.

This disconnect happens because most retrieval dashboards still reflect a human search worldview. They assume the buyer of the retrieved set can skim, filter, and ignore junk. Humans are surprisingly good at this. LLMs aren’t consistently good at it.

An LLM doesn’t “notice” ten retrieved items and casually deal with the perfect two in the best way a powerful analyst would. It processes the complete bundle as prompt context. Which means the retrieval layer is surfacing evidence that’s actively shaping the model’s working memory.

That is why I feel agent teams should stop treating retrieval as a back-office rating problem and begin treating it as a reasoning-budget allocation problem. When constructing performant agentic systems, the important thing query is each:

and:

That’s the lens BoR pushes you toward, and I even have found it to be a really useful one.

Context engineering is becoming a first-class discipline

One reason this paper has resonated with me is that it suits a broader shift already happening in practice. Software engineers and ML practitioners working on LLM systems are progressively becoming something closer to context engineers.

Which means designing systems that determine:

In traditional software, we worry about memory, compute, and API boundaries. In LLM systems, we also have to worry about context purity. The context window is contested cognitive real estate.

Every irrelevant passage, duplicated chunk, weakly related example, verbose tool definition, and poorly timed retrieval result competes with the thing the model most must deal with. That’s the reason I just like the pollution metaphor. Irrelevant context contaminates the model’s workspace.

The BoR poster gives this intuition a more rigorous shape by telling us that we must always stop evaluating retrieval only by whether it succeeds. We must always also ask how a lot better the retrieval is in comparison with likelihood, on the depth (top K retrieved items) that we are literally using. That may be a very practitioner-friendly query.

Why tool overload breaks agents

That is where I feel the BoR work becomes especially vital for real-world agent systems.

In classic RAG, the corpus is usually large. Chances are you’ll be retrieving from tens of 1000’s or thousands and thousands of chunks. In that regime, random likelihood stays weak for longer. Tool selection could be very different.

In an agent, the model could also be selecting amongst 20, 50, or 100 tools. That sounds manageable until you realize that several tools are sometimes vaguely plausible for a similar task. Once that happens, dumping all tools into context just isn’t thoroughness. It’s confusion disguised as completeness.

I even have seen this pattern repeatedly in agent design:

But often the true issue is architectural, not prompt-level. The model is being asked to pick from an overloaded context where distinctions are too weak and too quite a few.

What BoR adds here’s a useful option to formalize something people often feel only intuitively: there’s some extent where the choice task becomes so crowded that the model isn’t any longer demonstrating meaningful selectivity.

That’s the reason I strongly prefer agent designs with:

  • Staged tool retrieval: 
  • Domain routing:
  • Compressed capability summaries: 
  • Explicit exclusion of irrelevant tools: 

In my experience tool selection must be treated more like retrieval than like static prompt decoration.

Understanding BoR through tool selection

One of the useful things about BoR is that it sharpens what top-K really means in tool-using agents.

In document retrieval, increasing top-K often means moving from top-5 passages to top-20 or top-50 from a really large corpus. In tool selection, the identical move has a really different character. When an agent only has a modest tool library, increasing top-K may mean moving from a shortlist of three candidate tools, to five, to eight, and eventually to the familiar but dangerous fallback: .

That always improves recall or Success@K, because the proper tool is more prone to be somewhere within the visible set. But that improvement may be misleading. As K grows, you aren’t only helping the router. You’re also making it easier for a random selector to incorporate a relevant tool.

So the true query just isn’t simply:  The more vital query is: That is precisely where BoR becomes useful.

A straightforward example makes the intuition clearer. Suppose you’ve got 10 tools, and for a given class of task 2 of them are genuinely relevant. Should you show the model just one tool, random likelihood of surfacing a relevant one is 20 percent. At 3 tools, the random baseline rises sharply. At 5 tools, random inclusion is already fairly strong. At 10 tools, it’s 100%, because you’ve got shown every little thing. So yes, Success@K rises as K rises. However the  of that success changes. At low K, success indicates real discrimination. At high K, success may simply mean you included enough of the menu that failure became difficult.

That’s what I mean by helping random likelihood somewhat than meaningful selectivity.

This matters because, with tools, the issue is worse than a misleading metric. Once you show too many tools, the prompt gets longer, descriptions begin to overlap, the model sees more near-matches, distinctions turn out to be fuzzier, parameter confusion rises, and the possibility of selecting a plausible-but-wrong tool increases. So regardless that top-K recall improves, the standard of the ultimate decision may worsen. That is the small-tool paradox: adding more candidate tools can increase apparent coverage while decreasing the agent’s ability to decide on cleanly.

A practical option to take into consideration that is that tool selection often falls into three regimes. Within the healthy regime, K is small relative to the variety of tools, and the looks of a relevant tool within the shortlist tells you the router actually did something useful. For instance, 30 total tools, 2 or 3 relevant, and a shortlist of three or 4 still appears like real selection. Within the grey zone, K is large enough that recall improves, but random inclusion can be rising quickly. For instance, 20 tools, 3 relevant, shortlist of 8. Here chances are you’ll still gain something, but you must already be asking whether you might be truly routing or merely widening the funnel. Finally, there’s the collapse regime, where K is so large that success mostly comes from exposing enough of the tool menu that random selection would also succeed often. If you’ve got 15 tools, 3 relevant ones, and a shortlist of 12 or all 15, then “high recall” isn’t any longer saying much. You’re getting near brute-force exposure.

Operationally, this pushes me toward a greater query. In a small-tool system, I like to recommend avoiding the overexposure mindset that asks:

The higher query is:

That mindset encourages disciplined routing.

In practice, that sometimes means routing first and selecting second, keeping the shortlist very small, compressing tool descriptions so distinctions are obvious, splitting tools into domains before final selection, and testing whether increasing K improves end-to-end task accuracy, not only tool recall. A useful sanity check is that this: if giving the model all tools performs in regards to the same as your routed shortlist, then your routing layer is probably not adding much value. And if giving the model more tools improves recall but worsens overall task performance, you might be likely in just the regime where K helps random likelihood greater than real selectivity.

When the failure mode changes: large tool libraries

The big-tool case is different, and that is where a crucial nuance matters. A bigger tool universe does not mean we must always dump lots of of tools into context and expect the system to work higher. It just means the failure mode changes.

If an agent has 1,000 tools available and only a handful are relevant, then increasing top-K from 10 to 50 and even 100 should still represent meaningful selectivity. Random likelihood stays weaker for longer than it does within the small-tool case. In that sense, BoR remains to be useful: it helps stop us from mistaking broader exposure for higher routing. It asks whether a bigger shortlist reflects real selectivity, or whether it’s merely helping by exposing a bigger slice of the search space.

But BoR doesn’t capture the entire problem here. With very large tool libraries, the difficulty may now not be that random likelihood has turn out to be too strong. The difficulty could also be that the model is just drowning in options. A shortlist of 200 tools can still be higher than random in BoR terms and yet still be a terrible prompt. Tool descriptions overlap, near-matches proliferate, distinctions turn out to be harder to take care of, and the model is forced to reason over a crowded semantic menu.

So BoR is priceless, but it surely just isn’t sufficient by itself. It is healthier at telling us whether a shortlist is genuinely discriminative relative to likelihood than whether that shortlist remains to be cognitively manageable for the model. In large tool libraries, we subsequently need each perspectives: BoR to measure selectivity, and downstream measures resembling tool-choice quality, latency, parameter correctness, and end-to-end task success to measure usability.

BoR tells us whether retrieval is genuinely selective, or whether we’re achieving success mostly by stuffing the context window with more material. When BoR falls, it is an indication that the retrieved bundle is becoming less discriminative relative to likelihood. In practice, that always correlates with the model being forced to read more junk, more overlap, or more weakly relevant material. The nuance is that BoR does indirectly measure what the model “feels” when reading a prompt. It measures selectivity relative to random likelihood. But low BoR is usually a warning sign that the model is being asked to process an increasingly noisy context window.

The design implication is identical regardless that the explanation differs. With small tool sets, broad exposure quickly becomes bad since it helps random likelihood an excessive amount of. With very large tool sets, broad exposure becomes bad since it overwhelms the model. In each cases, the reply just isn’t to stuff more into context. It’s to design higher routing.

My very own rule of thumb: the model should see less, but cleaner

If I needed to summarize the sensible shift in a single sentence, it could be this: for LLM systems, smaller and cleaner is usually higher than larger and more comprehensive.

That sounds obvious, but many systems are still designed as if “more context” is routinely safer. In point of fact, once a baseline level of useful evidence is present, additional retrieval can turn out to be harmful. It increases token cost and latency, but more importantly it widens the sphere of competing cues contained in the prompt.

I even have come to take into consideration prompt construction in three layers:

Layer 1: mandatory task context

Layer 2: highly selective grounding

Layer 3: optional overflow

Most failures come from letting Layer 3 invade Layer 2. That’s the reason retrieval must be judged not only by coverage, but by its ability to preserve a clean Layer 2.

Where I feel BoR is particularly useful

I don’t see BoR as a alternative for all retrieval metrics. I see it as a really useful additional lens, especially in these cases:

1. Selecting K in production

2. Evaluating agent tool routing

3. Diagnosing why downstream quality falls despite “higher retrieval”

4. Comparing systems with different retrieval depths

5. Stopping overconfidence in benchmark results

Where I feel BoR could also be insufficient by itself

I just like the paper, but I’d not treat BoR as the ultimate answer to retrieval evaluation. There are not less than a couple of vital caveats.

First, not every task only needs one good item. Some tasks genuinely require synthesis across multiple pieces of evidence. In those cases, a success-style view can understate the necessity for broader retrieval.

Second, retrieval usefulness just isn’t binary. Two chunks may each count as “relevant,” while one is much more actionable, concise, or decision-useful for the model.

Third, prompt organization still matters. A loud bundle that’s rigorously structured may perform higher than a rather cleaner bundle that’s poorly ordered or badly formatted.

Fourth, the model itself matters. Different LLMs have different tolerance for clutter, different long-context behavior, and different tool-use reliability. A retrieval policy that pollutes one model could also be acceptable for one more.

Fifth, and this is particularly relevant for big tool libraries, BoR tells us more about selectivity than about usability. A shortlist can still look meaningfully higher than random and yet be too crowded, too overlapping, or too semantically messy for the model to make use of well.

So I’d not use BoR in isolation. I’d pair it with:

Still, even with those caveats, BoR contributes something vital: it forces us to stop confusing coverage with selectivity.

How this changes evaluation practice for me

The largest practical shift is that I’d now evaluate retrieval systems more like this:

Then ask:

For agents, I’d go even further:

That may be a more realistic evaluation setup for the sorts of systems many teams are literally deploying.

The broader lesson

The essential lesson I took from the ICLR poster is way broader than a single latest metric: it’s that LLM system quality depends heavily on the cleanliness of the context we construct across the model. That has consequences across the Agentic stack:

The most effective LLM systems can be those that expose the right information, at the appropriate moment, within the smallest clean bundle that also supports the duty. That is the character of what good context engineering looks like.

Final thought

For years, retrieval was mostly about finding needles in haystacks. For LLM systems, that isn’t any longer enough. Now the job can be to avoid dragging half the haystack into the prompt together with the needle.

That’s the reason I feel the BoR idea matters and is so impactful. It gives practitioners a greater language for an actual production problem: tips on how to measure when useful context has quietly become polluted context. And once you begin taking a look at your systems that way, a whole lot of familiar agent failures begin to make way more sense.

BoR does indirectly measure what the model “feels” when reading a prompt, but it surely does tell us when retrieval is ceasing to be meaningfully selective and beginning to resemble brute-force context stuffing. In practice, that is usually precisely the regime where LLMs begin to read more junk, reason less cleanly, and perform worse downstream.

More broadly, I feel this points to a crucial emerging sub-field: developing higher metrics for measuring LLM system performance in realistic settings, not only model capability in isolation. We have now turn out to be reasonably good at measuring accuracy, recall, and benchmark performance, but much less good at measuring what happens when a model is forced to reason through cluttered, overlapping, or weakly filtered context.

That, to me, exposes an actual gap. BoR helps measure selectivity relative to likelihood, which is priceless. But there remains to be a missing concept around what I’d term cognitive overload: the purpose at which a model should still have the appropriate information somewhere in view, yet performs worse because too many competing options, snippets, tools, or cues are presented directly. In other words, the failure isn’t any longer just retrieval failure. It’s a reasoning failure induced by prompt pollution.

I believe that higher ways of measuring this sort of cognitive overload will turn out to be increasingly vital as agentic systems grow more complex. The following step forward may not only come from larger models or larger context windows, but from higher ways of quantifying when the model’s working context has crossed the road from useful breadth into harmful overload.

.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x