Dumber LLM Agents Need More Constraints and Higher Tools Summary Overview of Agents Setup Findings and Experiments Concluding Thoughts

Artificial Intelligence

Dumber LLM Agents Need More Constraints and Higher Tools Summary Overview of Agents Setup Findings and Experiments Concluding Thoughts

admin

May 24, 2023

Dumber LLM Agents Need More Constraints and Higher Tools
Summary
Overview of Agents
Setup
Findings and Experiments
Concluding Thoughts

In this text, we compare how well LLM-powered agents with different degrees of complexity perform over practical data tasks (financial evaluation). We compare the performance of agents with more complex, unrestrained interaction behavior (ReAct) with agents that contain simpler, more constrained interactions (routing). We specifically analyze how much complexity might be added to the agent layer vs. the tool layer.

We discover that the alternative of the language model matters so much. ReAct agents which are powered by “dumber” models (in a tongue-in-cheek fashion we’re referring to any non GPT-4 model as “dumb”) struggle to return relevant results over data. We discover that constraining agent interaction behavior, and giving them access to more tools that may more explicitly perform complex actions, may help improve query performance over these less sophisticated LLMs. In contrast, more sophisticated models (GPT-4) can more reliably utilize the ReAct loop to execute a wide range of complex data queries.

This blog post is kind of detailed; we offer a lot of experiments and results below. Better of all, you possibly can run this all yourself with our example notebook!

Constructing LLM-powered agents have gotten increasingly popular up to now few months. Frameworks like LangChain have made it much easier to create these agents in line with a set of common abstractions.

At a high-level, an “agent” is actually an automatic decision engine, that might be used to interact with an external environment. The core agent loop looks something just like the following:

The agent has access to a set of “tools”, that are generic functions that it will possibly perform. It has an awareness of every tool through some attached metadata, and it will possibly call each tool (either as a function call or structured API).
User feeds in a natural language input to the agent.
Given the input, the agent in some fashion and returns the response.

There’s a wide range of ways to perform

The preferred might be ReAct: the agent reasons over the subsequent motion, constructs an motion command, executes the motion. It repeats these steps in an iterative loop until the duty is complete.
There are other interaction modes too. Recently there was a paper on Plan-and-solve Prompting, which generates a plan beforehand (to decompose a fancy task into simpler ones). Before ReAct there have also been related techniques on Self-Ask and Chain of Thought Prompting.

“Complex” vs. “Easy” Agent Interaction Techniques

We classify techniques like ReAct are more complex and unconstrained: it is because they perform iterative reasoning and likewise break the input into smaller steps. Complicated agent interaction loops allow for more freedom of behavior, and create an increased burden on the LLM getting used. The professional of complex interaction frameworks is that they might be more general and handle a broader class of queries over easy tools. The con is that if the LLM is just not as much as par, then these frameworks are vulnerable to making mistakes; unconstrained behavior can result in unexpected results.

On the opposite end of the spectrum, you possibly can imagine a easy and constrained agent interaction mechanism, where the agent does one-step choice of the underlying tool to make use of, and returns the response from the tool. The agent essentially just acts as a router from the query to Tool. There are not any steps to interrupt down the query into smaller ones, and no iterative chain-of-thought loops. The professional here is that the model will likely make fewer errors. The con here is that the interaction technique allows for less freedom and imposes more constraints on behavior.

Investigating Agent Interaction Techniques for Data Querying

We at LlamaIndex are concerned with how agents may help augment data tasks. More specifically, we’re concerned with how agents may help perform complex user queries over a various range of information sources. This includes not only asking questions over a single document, but with the ability to synthesize insights across multiple documents and return that to the user.

LlamaIndex query engines might be used as Tools inside an agent construct to question your data (we offer seamless integrations with LangChain). These Tools can vary in complexity. As an illustration, a easy Tool may very well be our vector store query engine, which does top-k embedding retrieval from a vector store. A more advanced tool may very well be a question engine over our graph data structure, which might be setup to explicitly provide compare/contrast capabilities over any subset of documents. The tool itself can contain “agent-like” decision-making capabilities under the hood. LlamaIndex provides a wide range of modules around routing, query decomposition, and multi-part query planning.

On this blog post, we’re concerned with comparing the next approaches to designing agents and tools to see which approach can provide good answers to different user queries in a strong fashion:

more complex and unconstrained agent interaction (ReAct) over a set of easy Tools
more easy and constrained agent interaction (easy routing) that uses more complex Tools

Complex Agents with Easy Tools, Easy Agents with Complex Tools

Essentially what we’re concerned with is how much complexity might be pushed to the agent interaction layer vs. being left within the Tool layer. We explore the next concrete example: let’s say the user query is to check/contrast two different documents (a comparatively complex query). If the set of Tools are all just vector indices over different documents, could the agent interaction loop work out easy methods to execute that question reliably against the vector indices? Alternatively, if we push the complexity all the way down to the Tool layer, then we could explicitly have a Tool that may perform “compare/contrast” over your Documents. Then the burden on the agent is to easily call this Tool as a substitute of interacting with a set of other tools in a more complex fashion.

High-Level Findings

The high-level finding is that More specifically, we found that using a GPT-3 powered agent in a ReAct loop didn’t provide good results over complex queries; it was not in a position to work out the right interaction pattern over the provided set of Tools to be able to surface the outcomes. As a substitute, by adding more constraints to the agent behavior and providing more sophistication within the Tool itself, we were in a position to get a GPT-3 agent to supply higher results.

We did find that GPT-4 agents with ReAct were in a position to provide higher query results than GPT-3 agents when presented with a set of straightforward Tools over the info. This means that more powerful agents may not need as many tools to “explicitly” perform tasks when much of that logic might be handled within the agent interaction loop.

Our data consists of three Uber 10-Q filings (quarterly financial reports) in 2022: March, June, and September. We want to execute different queries over this data; the majority of those queries are around comparing different bits of knowledge between these documents.

march_2022 = SimpleDirectoryReader(input_files=["../data/10q/uber_10q_march_2022.pdf"]).load_data()
june_2022 = SimpleDirectoryReader(input_files=["../data/10q/uber_10q_june_2022.pdf"]).load_data()
sept_2022 = SimpleDirectoryReader(input_files=["../data/10q/uber_10q_sept_2022.pdf"]).load_data()

We use LlamaIndex to define a vector index over each document, which just stores the document chunks + embeddings in a vector store. We will then query each vector index using an easyQueryEngine . We create a tool for every of thoseQueryEngine objects.

# define indices
march_index = GPTVectorStoreIndex.from_documents(march_2022)
june_index = GPTVectorStoreIndex.from_documents(june_2022)
sept_index = GPTVectorStoreIndex.from_documents(sept_2022)# define query engine
march_engine = march_index.as_query_engine(similarity_top_k=3)
june_engine = june_index.as_query_engine(similarity_top_k=3)
sept_engine = sept_index.as_query_engine(similarity_top_k=3)

We also define a ComposableGraph over these three documents. The composable graph roughly follows the guide described here. This graph is explicitly setup to perform compare/contrast queries over these three documents.

graph = ComposableGraph.from_indices(
GPTListIndex,
children_indices=[march_index, june_index, sept_index],
index_summaries=[
"Provides information about Uber quarterly financials ending March 2022",
"Provides information about Uber quarterly financials ending June 2022",
"Provides information about Uber quarterly financials ending September 2022"
]
)

The graph might be queried with a ComposableGraphQueryEngine :

# define decompose_transform
decompose_transform = DecomposeQueryTransform(verbose=True)# define custom query engines
custom_query_engines = {}
for index in [march_index, june_index, sept_index]:
query_engine = index.as_query_engine(service_context=service_context)
query_engine = TransformQueryEngine(
query_engine,
query_transform=decompose_transform,
transform_extra_info={'index_summary': index.index_struct.summary},
)
custom_query_engines[index.index_id] = query_engine
custom_query_engines[graph.root_id] = graph.root_index.as_query_engine(
service_context=service_context,
streaming=True,
)
# define graph
g_engine = graph.as_query_engine(
custom_query_engines=custom_query_engines
)

We try the next agent setups:

A zero-shot GPT-3 ReAct agent with three Tools: each Tool corresponds to the vector index over a 10-Q filing.
Same as above but using GPT-4 as a substitute.
A straightforward router “agent” with 4 Tools: the three Tools listed above + the ComposableGraphQueryEngine explicitly setup to perform compare/contrast queries.

The code snippets for initializing these agents are below. For the straightforward router agent, we use the native RouterQueryEngine inside LlamaIndex, though it is best to also find a way to realize similar leads to LangChain through either the zero-shot agent (with tweaked settings) or the router chain.

# initializing zero-shot ReAct agentuber_config_sept = IndexToolConfig(
query_engine=sept_engine, 
name=f"Uber 10Q September 2022",
description=f"Provides details about Uber quarterly financials ending September 2022",
tool_kwargs={"return_direct": False}
)
uber_config_june = IndexToolConfig(
query_engine=june_engine, 
name=f"Uber 10Q June 2022",
description=f"Provides details about Uber quarterly financials ending June 2022",
tool_kwargs={"return_direct": False}
)
uber_config_march = IndexToolConfig(
query_engine=march_engine, 
name=f"Uber 10Q March 2022",
description=f"Provides details about Uber quarterly financials ending March 2022",
tool_kwargs={"return_direct": False}
)
toolkit = LlamaToolkit(
index_configs=[uber_config_sept, uber_config_june, uber_config_march],
)
# that is a light-weight wrapper around `initialize_agent` in langchain (which defaults to zero-shot)
agent_chain = create_llama_agent(
toolkit,
llm, # might be GPT-3 or GPT-4 
verbose=True
)

Easy Router Agent Setup


query_tool_sept = QueryEngineTool.from_defaults(
query_engine=sept_engine,
description=f"Provides details about Uber quarterly financials ending September 2022",
)
query_tool_june = QueryEngineTool.from_defaults(
query_engine=june_engine,
description=f"Provides details about Uber quarterly financials ending June 2022",
)
query_tool_march = QueryEngineTool.from_defaults(
query_engine=march_engine,
description=f"Provides details about Uber quarterly financials ending March 2022",
)
query_tool_graph = QueryEngineTool.from_defaults(
query_engine=g_engine,
description=f"Provides comparisons between Uber financials across quarters in 2022. May be used to reply "
"any questions that require evaluation across multiple quarters.",
)# our "router" query engine is effectively an easy agent that may only perform routing
query_engine = RouterQueryEngine(
selector=LLMSingleSelector.from_defaults(),
query_engine_tools=[
query_tool_sept,
query_tool_june,
query_tool_march,
query_tool_graph
]
)

Now that we’ve described the setup, let’s take a take a look at the outcomes below!

At a high-level, we discover using GPT-3 in ReAct agents produces suboptimal results over these queries. They have a tendency to exhibit the next characteristics:

The set of tools chosen can differ even when the questions are semantically similar, resulting in variability within the responses.
Oftentimes we expect that a given query is in a position to make use of all three 10-Q statements, but only a subset of them are picked.
Sometimes the agent uses tools throughout the CoT process which are irrelevant to the query.

In contrast, we discover that GPT-4 ReAct agents provide answers which are more relevant, predictable, and exhibit fewer errors in intermediate results.

Finally, we discover that using an easier routing-only GPT-3 agent with access to an explicit “compare/contrast” tool allows the agent to perform higher.

As a reminder, full results are within the notebook: https://colab.research.google.com/drive/1uP38k4nr8OPmXbY4dLoKKQW0F29WtNuY?usp=sharing

GPT-3 ReAct Agent Results

agent_chain.run(input="Analyze Uber revenue growth over the previous few quarters")

Response:

We see that only the September 10-Q filing is chosen to reply the query. The September 10-Q does contain some details about revenue growth in comparison with the identical period in 2021, but that doesn’t explicitly answer the query, which is about revenue growth the past few quarters.

agent_chain.run(input="Analyze changes in risk aspects for Uber")

Response:

The September and June 10-Q filings are chosen, but not March. Furthermore, the reply is vague and doesn’t provide much detail regarding concrete risk aspects for Uber (and likewise mentions that the danger aspects “have modified over the past three quarters” although it’s only using two Tools).

In this question, we more explicitly showcase how slight changes in prompts can induce different chain-of-thought paths through different Tools, and because of this produce different answers.

# Prompt variation 1 
agent_chain.run(input="Analyze Uber revenue growth and risk aspects over time")

Response:

# Prompt variation 2
agent_chain.run(input="Analyze Uber revenue growth and risk aspects over quarters")

The essential difference between these two queries is “over time” versus “over quarters.” As we are able to see, not only are the chosen Tools different between the 2 variations, however the inputs are different as well — in the primary it’s “financials”, and within the second it’s “Revenue growth and risk aspects.”

Because the Tool input in the primary variant is unrelated to the query, the reply is similarly vague: “Uber’s revenue growth and risk aspects might be analyzed by comparing the financials…”

Here as a substitute of asking a compare/contrast query let’s just ask an issue a couple of given statement.

agent_chain.run(input="How much money did Uber have in sept 2022?")

We see that the agent makes two errors 1) it is just not in a position to supply an motion input to every Tool, and a pair of) finally ends up searching through the June and March filings that are irrelevant to the query.

GPT-4 ReAct Agent Results

GPT-4 ReAct agents perform so much higher than GPT-3 agents. They comprehensively undergo the set of obtainable Tools, and supply rather more detailed commentary extraction and response synthesis.

We won’t undergo all of those examples, but they might be present in the instance notebook!

agent_chain_gpt4.run(input="Analyze Uber revenue growth over the previous few quarters")

Response:

Unlike the GPT-3 agent, here the GPT-4 agent no less than goes through every filing and synthesizes the result.

agent_chain_gpt4.run(input="Analyze changes in risk aspects for Uber")

Response:

Here the GPT-4 agent still only looks at September and June (and skips March), however the response is far more detailed, and references concrete facts inside the report.

GPT-3 Router Agent Results

Reminder: the router agent doesn’t do any CoT and has additional access to our ComposableGraph query engine, which might explicitly perform compare/contrast queries. Let’s take a take a look at how this agent does.

response = query_engine.query("Analyze Uber revenue growth over the previous few quarters")

Response (intermediate steps):

Choosing query engine 3: Provides comparisons between Uber financials across quarters in 2022. May be used to reply any questions that require evaluation across multiple quarters..
> Current query: Analyze Uber revenue growth over the previous few quarters
> Latest query:  What was Uber's revenue growth from the last quarter ending March 2022 in comparison with the previous quarter?
> Current query: Analyze Uber revenue growth over the previous few quarters
> Latest query:  What was Uber's revenue growth from the last quarter ending June 2022 in comparison with the previous quarter?
> Current query: Analyze Uber revenue growth over the previous few quarters
> Latest query:  What was Uber's revenue growth from the previous few quarters ending September 2022?

Final Response

Based on the limited context information provided, we are able to only analyze Uber's revenue growth for 2 specific quarters:1. For the quarter ending March 2022, Uber's revenue growth increased in comparison with the primary quarter of 2021. Nevertheless, the precise percentage of growth is just not provided.
2. For the quarter ending September 2022, Uber's revenue growth was 72%.
Unfortunately, we should not have enough data to investigate the revenue growth for the quarters ending June 2022 or to check it to the previous quarter. To offer a more comprehensive evaluation, additional data on Uber's revenue growth for other quarters can be needed.

To be fair, this response doesn’t cover the month of June. However the fault of that lies with the implementation of the Tool itself slightly than the agent interaction step. The agent’s only job is to choose the correct Tool to unravel the duty at hand, and it does that job accurately; it decides to decide on the Tool corresponding to our ComposableGraphQueryEngine abstraction (which might perform compare/contrast queries).

response = query_engine.query("Analyze changes in risk aspects for Uber")

Response (intermediate steps):

INFO:llama_index.query_engine.router_query_engine:Choosing query engine 3: Provides comparisons between Uber financials across quarters in 2022. May be used to reply any questions that require evaluation across multiple quarters..
Choosing query engine 3: Provides comparisons between Uber financials across quarters in 2022. May be used to reply any questions that require evaluation across multiple quarters..
> Current query: Analyze changes in risk aspects for Uber
> Latest query:  What are the danger aspects for Uber within the quarter ending March 2022?
> Current query: Analyze changes in risk aspects for Uber
> Latest query:  What are the danger aspects for Uber within the quarter ending June 2022?
> Current query: Analyze changes in risk aspects for Uber
> Latest query:  What are the danger aspects for Uber within the quarter ending September 2022?

Final Response

From the context information provided, we are able to analyze the changes in risk aspects for Uber across the three quarters as follows:1. COVID-19 pandemic: The continuing pandemic stays a continuing risk factor across all quarters, with its impact on driver and courier supply constraints, consumer demand, and government restrictions. Nevertheless, the particular concerns related to the pandemic may evolve, corresponding to the emergence of recent virus variants or changes in vaccination rates.
2. Driver and courier supply constraints: This risk factor is present within the quarter ending March 2022 but is just not explicitly mentioned in the following quarters. Nevertheless, it should still be an underlying concern, especially if the pandemic continues to affect driver availability.
3. Changes in consumer demand: This risk factor is mentioned within the quarter ending March 2022, with an imbalance between Mobility and Delivery services. It is just not explicitly mentioned in the following quarters, but fluctuations in revenue because of changes in Gross Bookings and Mobility Trip volumes within the quarter ending June 2022 may very well be related to changes in consumer demand.
4. Government restrictions and regulations: This risk factor is present within the quarter ending March 2022, with the impact of restrictions on Mobility offerings. Within the quarter ending June 2022, the main focus shifts to the potential impact of regulatory changes and legal claims, corresponding to driver classification within the UK. Within the quarter ending September 2022, the danger factor of driver classification as employees, employees, or quasi-employees becomes more outstanding.
5. Uncertainty around future developments: This risk factor is mentioned within the quarter ending March 2022 but is just not explicitly mentioned in the following quarters. Nevertheless, it should still be an underlying concern, especially with the continued impact of COVID-19 and potential everlasting changes to end-users' behavior.
6. Accounting pronouncements: This risk factor is introduced within the quarter ending June 2022, with the adoption and impact of recently issued accounting pronouncements potentially affecting Uber's consolidated financial statements.
7. Integration and performance of acquired businesses: This risk factor is introduced within the quarter ending June 2022, with the acquisition of Transplace contributing to the rise in Freight revenue.
8. Competition: This risk factor is introduced within the quarter ending September 2022, highlighting the highly competitive nature of the mobility, delivery, and logistics industries, and the necessity to lower fares, offer incentives, and promotions to stay competitive.
9. Significant losses and increasing operating expenses: This risk factor is introduced within the quarter ending September 2022, emphasizing the financial challenges Uber faces in achieving profitability.

The job of the agent is once more easy: it chooses the graph query engine that’s optimized to perform “compare/contrast” queries. The result’s rather more detailed than the result obtained from the GPT-3 ReAct agent (and is comparable to the GPT-4 ReAct agent). It accommodates an in depth evaluation of all risk aspects in bullet points, and offers a comparison across time for all bullet points.

response = query_engine.query("Analyze Uber revenue growth and risk aspects over the previous few quarters")

Response (intermediate steps):

> Current query: Analyze Uber revenue growth and risk aspects over quarters
> Latest query:  What's Uber's revenue growth and risk aspects for the quarter ending March 2022?
> Current query: Analyze Uber revenue growth and risk aspects over quarters
> Latest query:  What's Uber's revenue growth and risk aspects for the quarter ending March 2022?
> Current query: Analyze Uber revenue growth and risk aspects over quarters
> Latest query:  What's Uber's revenue growth and risk aspects for the quarter ending June 2022?
> Current query: Analyze Uber revenue growth and risk aspects over quarters
> Latest query:  What's Uber's revenue growth and risk aspects for the quarter ending June 2022?
> Current query: Analyze Uber revenue growth and risk aspects over quarters
> Latest query:  What were Uber's total revenues for the quarter ending September 2022?
> Current query: Analyze Uber revenue growth and risk aspects over quarters
> Latest query:  What were Uber's total revenues for the quarter ending September 2022?

Final Response

Uber's revenue growth has been impressive over the past few quarters, with the quarter ending March 2022 showing a 136% year-over-year growth and the quarter ending June 2022 showing a 105% year-over-year growth. The revenue for the quarter ending June 2022 was $8,343 million. The chance aspects for Uber's business have been largely related to the continued impacts of the COVID-19 pandemic, including reduced global demand for Mobility rides, supply constraints, and potential everlasting changes to end-user behavior. These risk aspects have been present in each the quarter ending March 2022 and the quarter ending June 2022, though the particular risks have varied barely.

ReAct-based agents offer a strong, general reasoning loop, and have the potential to unravel complex tasks over your data. But they have a tendency to only work reliably with more powerful language models corresponding to GPT-4. Less sophisticated models (e.g. GPT-3) will make more unpredictable and erroneous decisions, resulting in subpar query performance over your data sources.

Agents implemented with “dumber” models need more interaction constraints to be able to make more reliable, less erroneous decisions. We discover that if we explicitly constrain the agent interface and push the complexity all the way down to the Tool layer, we are able to still create agents that supply good performance over your data.

In fact, that is just an initial evaluation and there’s just a few caveats/limitations:

You might find a way to “prompt hack” the default ReAct loop to get more consistent results, and we didn’t try that.
We only tested this over a set of three financial documents. There’s so much more work that should be done if we wish to check this out on 1000’s of docs.
We only compared GPT-3 and GPT-4, there’s so many more models to check/benchmark, e.g ChatGPT, any open-source model, Anthropic Claude, etc.
We didn’t test out other agent interaction patterns besides ReAct: “plan and solve” agents (though we do have similar formulations in LlamaIndex), AutoGPT-like task management, and more.

Whether you’ve run into similar findings otherwise you disagree with our evaluation, tell us! We’d like to facilitate this discussion on our Discord.

Notebook Walkthrough

You’ll find the total notebook walkthrough here: https://colab.research.google.com/drive/1uP38k4nr8OPmXbY4dLoKKQW0F29WtNuY?usp=sharing