How we Achieved State of the Art

-



Research agents are rapidly becoming some of the essential applications of AI. Research is a foundational knowledge-work task: collecting, reading, and synthesizing information underpins all the pieces from writing and decision-making to coding itself. Yet human-driven research is constrained by memory, reading speed, and time. AI research agents, in contrast, can process vast amounts of data, synthesize insights immediately, and scale effortlessly. For this reason, research agents are emerging as a top use case for AI today and can soon turn out to be a core subcomponent of broader agentic workflows across content generation, coding, sales, and more. On this post, we share the technical and philosophical lessons we’ve learned constructing a state-of-the-art research agent, and where we imagine the sphere is headed.




Constructing for the Future



Agent Harness

The duty of constructing an agent harness is to create a software layer that enhances a model’s runtime execution through context management, tool invocations, loop control, orchestration, and error handling. Constructing applications on top of rapidly improving models is, nevertheless, a contemporary engineering challenge. How can we design software today that absorbs the performance gains from future model releases?

This requires forecasting how models will evolve, staying optimistic about their progress, limiting assumptions, and avoiding hand-crafted optimizations.

We learned this the hard way seven months ago, once we had to desert our first attempt at deep research and rebuild the whole system from scratch. The primary architecture was complicated and complex (we thought this was a great thing), but its assumptions became bottlenecks when the following generation of models arrived.



Models

Over the past seven months, model capabilities have quietly but meaningfully evolved (especially of their tool-calling abilities). This single optimization focus has pushed us from workflows to agents. We imagine future models shall be trained to unravel the present pain points of agent developers. Every model is ultimately consumed by a harness, so models should evolve in service of that harness. We hope to see models improve in high-recall summarization (for context compression), tool-calling reliability, and concision in writing.



Tools

Similarly, tools should evolve to support LLMs and widely adopted agent harnesses. The very best tools should perform some context engineering on the tool side, abstracted away from the agent. They need to return only probably the most relevant data as an alternative of dumping large volumes of tokens into the context window. As a tool provider, we’ve invested heavily in our advanced search feature, which has context engineering baked in. This in turn lowers hallucinations and latency for the downstream agent processes.



Takeaways

To construct agents that improve over time, we followed a couple of guiding principles:

  1. Simplify orchestration logic and lean into autonomy.
  2. Pay close attention to what models and tools are being optimized for, and leverage their emerging capabilities.
  3. Deal with context engineering (more on this in the following section).



Context Engineering — An Exercise in Curation

Long-horizon research tasks expose a fundamental challenge in current agent design: the duty of maintaining a clean, optimized context window over time. If curating context just isn’t a task the engineer pays close attention to, the agent is sort of destined for failure. The next outlines our considering around this idea inside the deep research domain.



Context-Managed Web Retrieval

Using Tavily’s Advanced Search is the natural first step to absorb overcoming this challenge, in that it abstracts away the processing of raw web content and returns only probably the most relevant content chunks from each source. In leveraging this functionality, we let Tavily Search do the heavy lifting and permit Tavily Research to reap the profit, gathering the most useful content in a latency-efficient manner.

Ensuring that the agent doesn’t overfit to a single research thread is the following step towards an efficient context-gathering pipeline. It’s on this regard that global state persistence and source deduplication is paramount, and in our case, it helps threefold:

  1. It ensures the agent is exposed only to fresh information.
  2. It allows the engineer to acknowledge when the knowledge scope is narrowing and to prompt the agent to explore untapped relevant domains.
  3. It lends to effective source attribution afterward within the generation process.

At Tavily, interacting with the net is our bread and butter. Architecting a refined web-retrieval system engineered for deep research was a foundational constructing block for our deep research agent design as an entire.



Modeling the Human-Web Interaction

Humans research in an inherently unstructured, iterative way. We start by defining the duty: what we’re trying to perform and what information we want. We next gather data from our sources, extracting the important thing insights and holding them in short-term memory, letting these distilled thoughts guide our subsequent actions.

This cycle repeats: collect information, distill it, determine what to do next. Just once we’ve gathered enough understanding to provide the ultimate deliverable will we return to the unique sources, using them as references to assemble the finished product.

We imagine that deep research agents ought to be designed in the same manner, in that tool outputs ought to be distilled into reflections, and only the set of past reflections ought to be used as context to your tool caller. Much like humans, it is just at the purpose when your agent begins to organize the ultimate deliverable that you need to provide the raw information as context, in order to make sure there is no such thing as a information loss.



Doing More with Less

This approach differs from traditional context structuring in a ReAct agent-based architecture. Typically, tool calls and outputs are propagated through the tool calling loop, with previously retrieved/generated tokens being persevered within the context window on each subsequent iteration. This pattern will be seen in LangChain’s Open Deep Research agent implementation, and from a token consumption perspective, it could be modeled by the next quadratic series, where n n is the quantity of tokens the tool calling model is invoked with on each tool calling iteration, and m m is the variety of tool calling iterations.

n+2n+3n+⋯+mn  =  n⋅m(m+1)2n + 2n + 3n + cdots + mn ;=; n cdot frac{m(m+1)}{2}

Group 29

Contrarily, our proposed approach to context engineering removes this token propagation (because the knowledge distillations, even when aggregated, are negligible compared to the amount of tokens gathered from web) and will be modeled by the next linear series.

n+n+n+⋯+n  =  nmn + n + n + cdots + n ;=; nm

Group 30

When comparing the 2 approaches, tokens are saved on a per-agent basis by an element of m+12 frac{m+1}{2}

Through this system, we were capable of reduce token consumption by 66% (compared to Open Deep Research) while achieving SOTA on DeepResearch Bench – the intersection of quality and efficiency in full effect.

Group 33




Productionizing Agents — an Ongoing Challenge

Constructing production-grade agents is a balancing act. We leaned into autonomy to maximise performance and quality, while still meeting strict requirements for latency, cost, and reliability.



Engineering with Non-Determinism

LLMs are inherently non-deterministic, and we found that giving them guard-railed freedom to reason and iterate produces the strongest results. Autonomy, when gone fallacious, could cause agent behavior to go off target. Tools will be called incorrectly, LLMs can overfit to a subtopic, and expected reasoning patterns may break. No single safeguard will catch all of those issues.

A shift in engineering mindset is required: treat failure modes as core design considerations, not afterthoughts. Easy guardrails like tool-call retries or model cascades help, but proactively anticipating anomalies, reinforcing proper patterns in prompting and edge-case testing is what enables production-grade, long-running agents.

Group 34



Optimal Tooling — Less is More

From our experience, it’s higher to show a small, essential toolset to the agent fairly than a big, complex one. We were tempted to over-engineer by adding many tools that seemed useful in theory, but in practice this created recent failure modes and made it harder for LLMs to consistently select the precise tool and iterate effectively.



Evals

We used evals to steer our development process but in addition recognize their shortcomings. LLM-as-a-judge evals are difficult to trust: current models are non‑deterministic, uninterpretable of their reasoning, and may turn into bottlenecks, especially for long‑running agents where a single experiment can take days to finish.

Slightly than optimizing for benchmark scores, we optimized for directional feedback. The core query was at all times: did this transformation make the agent more reliable and more useful in practice? Evals became a tool for validating that direction, not the optimization goal. Intuition and careful agent‑trace monitoring consistently provided higher‑signal feedback than any single eval rating.
Overall, one of the best end result isn’t the very best numerical rating. For production systems, improvements like reduced token usage, reliability, lower latency, and fewer failures are more priceless than a one‑point bump on an eval.


In the event you’re involved in experiencing the results of these findings in practice, you possibly can enroll for early access to Tavily Research here.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x