Within the spring of 2023, the world got excited concerning the emergence of LLM-based AI agents. Powerful demos like AutoGPT and BabyAGI demonstrated the potential of LLMs running in a loop, selecting the following motion, observing its results, and selecting the following motion, one step at a time (also often called the ReACT framework). This recent method was expected to power agents that autonomously and generically perform multi-step tasks. Give it an objective and a set of tools and it’s going to maintain the remainder. By the top of 2024, the landscape might be filled with AI agents and AI agent-building frameworks. But how do they measure against the promise?
It’s protected to say that the agents powered by the naive ReACT framework suffer from severe limitations. Give them a task that requires greater than a number of steps, using greater than a number of tools and they’ll miserably fail. Beyond their obvious latency issues, they’ll lose track, fail to follow instructions, stop too early or stop too late, and produce wildly different results on each attempt. And it is not any wonder. The ReACT framework takes the restrictions of unpredictable LLMs and compounds them by the variety of steps. Nevertheless, agent builders seeking to solve real-world use cases, especially within the enterprise, cannot do with that level of performance. They need reliable, predictable, and explainable results for complex multi-step workflows. They usually need AI systems that mitigate, fairly than exacerbate, the unpredictable nature of LLMs.
So how are agents in-built the enterprise today? To be used cases that require greater than a number of tools and a number of steps (e.g. conversational RAG), today agent builders have largely abandoned the dynamic and autonomous promise of ReACT for methods that heavily depend on static chaining – the creation of predefined chains designed to resolve a particular use case. This approach resembles traditional software engineering and is way from the agentic promise of ReACT. It achieves higher levels of control and reliability but lacks autonomy and suppleness. Solutions are subsequently development intensive, narrow in application, and too rigid to handle high levels of variation within the input space and the environment.
To be certain, static chaining practices can vary in how “static” they’re. Some chains use LLMs only to perform atomic steps (for instance, to extract information, summarize text, or draft a message) while others also use LLMs to make some decisions dynamically at runtime (for instance, an LLM routing between alternative flows within the chain or an LLM validating the consequence of a step to find out whether it ought to be run again). In any event, so long as LLMs are accountable for any dynamic decision-making in the answer – we’re inevitably caught in a tradeoff between reliability and autonomy. The more an answer is static, is more reliable and predictable but additionally less autonomous and subsequently more narrow in application and more development-intensive. The more an answer is dynamic and autonomous, is more generic and easy to construct but additionally less reliable and predictable.
This tradeoff may be represented in the next graphic:
This begs the query, why have we yet to see an agentic framework that may be placed within the upper right quadrant? Are we doomed to without end trade off reliability for autonomy? Can we not get a framework that gives the straightforward interface of a ReACT agent (take an objective and a set of tools and figure it out) without sacrificing reliability?
The reply is – we will and we’ll! But for that, we’d like to appreciate that we’ve been doing all of it fallacious. All current agent-building frameworks share a typical flaw: they depend on LLMs because the dynamic, autonomous component. Nevertheless, the crucial element we’re missing—what we’d like to create agents which can be each autonomous and reliable—is planning technology. And LLMs are NOT great planners.
But first, what’s “planning”? By “planning” we mean the flexibility to explicitly model alternative courses of motion that result in a desired result and to efficiently explore and exploit these alternatives under budget constraints. Planning ought to be done at each the macro and micro levels. A macro-plan breaks down a task into dependent and independent steps that should be executed to realize the specified consequence. What is usually ignored is the necessity for micro-planning aimed to ensure desired outcomes on the step level. There are various available strategies for increasing reliability and achieving guarantees on the single-step level through the use of more inference-time computing. For instance, you can paraphrase semantic search queries multiple times, you’ll be able to retrieve more context per a given query, can use a bigger model, and you’ll be able to get more inferences from an LLM – all leading to more requirements-satisfying results from which to decide on the very best one. micro-planner can efficiently use inference-time computing to realize the very best results under a given compute and latency budget. To scale the resource investment as needed by the actual task at hand. That way, planful AI systems can mitigate the probabilistic nature of LLMs to realize guaranteed outcomes on the step level. Without such guarantees, we’re back to the compounding error problem that can undermine even the very best macro-level plan.
But why can’t LLMs function planners? In spite of everything, they’re able to translating high-level instructions into reasonable chains of thought or plans defined in natural language or code. The rationale is that planning requires greater than that. Planning requires the flexibility to model alternative courses of motion which will reasonably result in the specified consequence AND to reason concerning the expected utility and expected costs (in compute and/or latency) of every alternative. While LLMs can potentially generate representations of obtainable courses of motion, they can’t predict their corresponding expected utility and costs. For instance, what are the expected utility and costs of using model X vs. model Y to generate a solution per a selected context? What’s the expected utility of in search of a selected piece of knowledge within the indexed documents corpus vs. an API call to the CRM? Your LLM doesn’t begin to have a clue. And for good reason – historical traces of those probabilistic traits are rarely present in the wild and should not included in LLM training data. Additionally they are inclined to be specific to the actual tool and data environment through which the AI system will operate, unlike the overall knowledge that LLMs can acquire. And even when LLMs could predict expected utility and costs, reasoning about them to decide on essentially the most effective plan of action is a logical decision-theoretical deduction, that can not be assumed to be reliably performed by LLMs’ next token predictions.
So what are the missing ingredients for AI planning technology? We’d like planner models that may learn from experience and simulation to explicitly model alternative courses of motion and corresponding utility and value probabilities per a selected task in a selected tool and data environment. We’d like a Plan Definition Language (PDL) that may be used to represent and reason about said courses of motion and probabilities. We’d like an execution engine that may deterministically and efficiently execute a given plan defined in PDL.
Some individuals are already hard at work on delivering on this promise. Until then, keep constructing static chains. Just please don’t call them “agents”.