We’re releasing Transformers Agents 2.0!
⇒ 🎁 On top of our existing agent type, we introduce two recent agents that can iterate based on past observations to resolve complex tasks.
⇒ 💡 We aim for the code to be clear and modular, and for common attributes like the ultimate prompt and tools to be transparent.
⇒ 🤝 We add sharing options to spice up community agents.
⇒ 💪 Extremely performant recent agent framework, allowing a Llama-3-70B-Instruct agent to outperform GPT-4 based agents within the GAIA Leaderboard!
🚀 Go try it out and climb ever higher on the GAIA leaderboard!
transformers.agentshas now been upgraded to the stand-alone library smolagents! The 2 libraries have very similar APIs, so switching is straightforward.
Go checkout thesmolagentsintroduction blog here.
Table of Contents
What’s an agent?
Large Language Models (LLMs) can tackle a big selection of tasks, but they often struggle with specific tasks like logic, calculation, and search. When prompted in these domains by which they don’t perform well, they steadily fail to generate an accurate answer.
One approach to beat this weakness is to create an agent, which is only a program driven by an LLM. The agent is empowered by tools to assist it perform actions. When the agent needs a selected skill to resolve a selected problem, it relies on an appropriate tool from its toolbox.
Thus when during problem-solving the agent needs a selected skill, it will probably just depend on an appropriate tool from its toolbox.
Experimentally, agent frameworks generally work thoroughly, achieving state-of-the-art performance on several benchmarks. As an illustration, have a take a look at the highest submissions for HumanEval: they’re agent systems.
The Transformers Agents approach
Constructing agent workflows is complex, and we feel these systems need a number of clarity and modularity. We launched Transformers Agents one 12 months ago, and we’re doubling down on our core design goals.
Our framework strives for:
- Clarity through simplicity: we reduce abstractions to the minimum. Easy error logs and accessible attributes allow you to easily inspect what’s happening and offer you more clarity.
- Modularity: We prefer to propose constructing blocks relatively than full, complex feature sets. You’re free to decide on whatever constructing blocks are best in your project.
- As an illustration, since any agent system is only a vehicle powered by an LLM engine, we decided to conceptually separate the 2, which allows you to create any agent type from any underlying LLM.
On top of that, we’ve sharing features that allow you construct on the shoulders of giants!
Primary elements
Tool: that is the category that allows you to use a tool or implement a brand new one. It consists mainly of a callable forwardmethodthat executes the tool motion, and a set of a number of essential attributes:name,descriptions,inputsandoutput_type. These attributes are used to dynamically generate a usage manual for the tool and insert it into the LLM’s prompt.Toolbox: It is a set of tools which are provided to an agent as resources to resolve a selected task. For performance reasons, tools in a toolbox are already instantiated and able to go. It’s because some tools take time to initialize, so it’s normally higher to re-use an existing toolbox and just swap one tool, relatively than re-building a set of tools from scratch at each agent initialization.CodeAgent: a quite simple agent that generates its actions as one single blob of Python code. It’s going to not give you the option to iterate on previous observations.ReactAgent: ReAct agents follow a cycle of Thought ⇒ Motion ⇒ Statement until they’ve solve the duty. We propose two classes of ReactAgent:ReactCodeAgentgenerates its actions as python blobs.ReactJsonAgentgenerates its actions as JSON blobs.
Try the documentation to learn how you can use each component!
How do agents work under the hood?
In essence, what an agent does is “allowing an LLM to make use of tools”. Agents have a key agent.run() method that:
- Provides details about tool usage to your LLM in a specific prompt. This manner, the LLM can select tools to run to resolve the duty.
- Parses the tool calls from the LLM output (will be via code, JSON format, or another format).
- Executes the calls.
- If the agent is designed to iterate on previous outputs, it keeps a memory with previous tool calls and observations. This memory will be kind of fine-grained depending on how long-term you would like it to be.
For more general context about agents, you may read this excellent blog post by Lilian Weng or our earlier blog post about constructing agents with LangChain.
To take a deeper dive in our package, go take a take a look at the agents documentation.
Example use cases
With the intention to get access to the early access of this feature, please first install transformers from its major branch:
pip install "git+https://github.com/huggingface/transformers.git#egg=transformers[agents]"
Agents 2.0 can be released within the v4.41.0 version, landing mid-May.
Self-correcting Retrieval-Augmented-Generation
Quick definition: Retrieval-Augmented-Generation (RAG) is “using an LLM to reply a user query, but basing the reply on information retrieved from a knowledge base”. It has many benefits over using a vanilla or fine-tuned LLM: to call a number of, it allows to ground the reply on true facts and reduce confabulations, it allows to offer the LLM with domain-specific knowledge, and it allows fine-grained control of access to information from the knowledge base.
Let’s say we would like to perform RAG, and a few parameters have to be dynamically generated. For instance, depending on the user query we could wish to restrict the search to specific subsets of the knowledge base, or we could wish to adjust the variety of documents retrieved. The problem is: how you can dynamically adjust these parameters based on the user query?
Well, we are able to do that by giving our agent an access to those parameters!
Let’s setup this technique.
Run the road below to put in the required dependencies:
pip install langchain sentence-transformers faiss-cpu
We first load a knowledge base on which we would like to perform RAG: this dataset is a compilation of the documentation pages for a lot of huggingface packages, stored as markdown.
import datasets
knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")
Now we prepare the knowledge base by processing the dataset and storing it right into a vector database to be utilized by the retriever. We’re going to use LangChain, because it features excellent utilities for vector databases:
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
source_docs = [
Document(
page_content=doc["text"], metadata={"source": doc["source"].split("https://huggingface.co/")[1]}
) for doc in knowledge_base
]
docs_processed = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(source_docs)[:1000]
embedding_model = HuggingFaceEmbeddings("thenlper/gte-small")
vectordb = FAISS.from_documents(
documents=docs_processed,
embedding=embedding_model
)
Now that we’ve the database ready, let’s construct a RAG system that answers user queries based on it!
We wish our system to pick out only from probably the most relevant sources of data, depending on the query.
Our documentation pages come from the next sources:
>>> all_sources = list(set([doc.metadata["source"] for doc in docs_processed]))
>>> print(all_sources)
['blog', 'optimum', 'datasets-server', 'datasets', 'transformers', 'course',
'gradio', 'diffusers', 'evaluate', 'deep-rl-class', 'peft',
'hf-endpoints-documentation', 'pytorch-image-models', 'hub-docs']
How can we select the relevant sources based on the user query?
👉 Allow us to construct our RAG system as an agent that can be free to decide on its sources!
We create a retriever tool that the agent can call with the parameters of its selection:
import json
from transformers.agents import Tool
from langchain_core.vectorstores import VectorStore
class RetrieverTool(Tool):
name = "retriever"
description = "Retrieves some documents from the knowledge base which have the closest embeddings to the input query."
inputs = {
"query": {
"type": "text",
"description": "The query to perform. This needs to be semantically near your goal documents. Use the affirmative form relatively than an issue.",
},
"source": {
"type": "text",
"description": ""
},
}
output_type = "text"
def __init__(self, vectordb: VectorStore, all_sources: str, **kwargs):
super().__init__(**kwargs)
self.vectordb = vectordb
self.inputs["source"]["description"] = (
f"The source of the documents to look, as a str representation of an inventory. Possible values within the list are: {all_sources}. If this argument shouldn't be provided, all sources can be searched."
)
def forward(self, query: str, source: str = None) -> str:
assert isinstance(query, str), "Your search query have to be a string"
if source:
if isinstance(source, str) and "[" not in str(source):
source = [source]
source = json.loads(str(source).replace("'", '"'))
docs = self.vectordb.similarity_search(query, filter=({"source": source} if source else None), k=3)
if len(docs) == 0:
return "No documents found with this filtering. Try removing the source filter."
return "Retrieved documents:nn" + "n===Document===n".join(
[doc.page_content for doc in docs]
)
Now it’s straightforward to create an agent that leverages this tool!
The agent will need these arguments upon initialization:
tools: an inventory of tools that the agent will give you the option to call.llm_engine: the LLM that powers the agent.
Our llm_engine have to be a callable that takes as input an inventory of messages and returns text. It also needs to simply accept a stop_sequences argument that indicates when to stop its generation. For convenience, we directly use the HfEngine class provided within the package to get a LLM engine that calls our Inference API.
from transformers.agents import HfEngine, ReactJsonAgent
llm_engine = HfEngine("meta-llama/Meta-Llama-3-70B-Instruct")
agent = ReactJsonAgent(
tools=[RetrieverTool(vectordb, all_sources)],
llm_engine=llm_engine
)
agent_output = agent.run("Please show me a LORA finetuning script")
print("Final output:")
print(agent_output)
Since we initialized the agent as a ReactJsonAgent, it has been robotically given a default system prompt that tells the LLM engine to process step-by-step and generate tool calls as JSON blobs (you may replace this prompt template along with your own as needed).
Then when its .run() method is launched, the agent takes care of calling the LLM engine, parsing the tool call JSON blobs and executing these tool calls, all in a loop that ends only when the ultimate answer is provided.
And we get the next output:
Calling tool: retriever with arguments: {'query': 'LORA finetuning script', 'source': "['transformers', 'datasets-server', 'datasets']"}
Calling tool: retriever with arguments: {'query': 'LORA finetuning script'}
Calling tool: retriever with arguments: {'query': 'LORA finetuning script example', 'source': "['transformers', 'datasets-server', 'datasets']"}
Calling tool: retriever with arguments: {'query': 'LORA finetuning script example'}
Calling tool: final_answer with arguments: {'answer': 'Here is an example of a LORA finetuning script: https://github.com/huggingface/diffusers/blob/dd9a5caf61f04d11c0fa9f3947b69ab0010c9a0f/examples/text_to_image/train_text_to_image_lora.py#L371'}
Final output:
Here is an example of a LORA finetuning script: https://github.com/huggingface/diffusers/blob/dd9a5caf61f04d11c0fa9f3947b69ab0010c9a0f/examples/text_to_image/train_text_to_image_lora.py#L371
We are able to see the self-correction in motion: the agent first tried to limit sources, but as a result of the dearth of corresponding documents it ended up not restricting sources in any respect.
We are able to confirm that by inspecting the llm output on the logs for step 2: print(agent.logs[2]['llm_output'])
Thought: I'll attempt to retrieve some documents related to LORA finetuning scripts from the whole knowledge base, with none source filtering.
Motion:
{
"motion": "retriever",
"action_input": {"query": "LORA finetuning script"}
}
Using an easy multi-agent setup 🤝 for efficient web browsing
In this instance, we would like to construct an agent and test it on the GAIA benchmark (Mialon et al. 2023). GAIA is a particularly difficult benchmark, with most questions requiring several steps of reasoning using different tools. A specifically difficult requirement is to have a strong web browser, capable of navigate to pages with specific constraints: discovering pages using the web site’s inner navigation, choosing specific articles in time…
Web browsing requires diving deeper into subpages and scrolling through a number of text tokens that is not going to be crucial for the higher-level task-solving. We assign the web-browsing sub-tasks to a specialized web surfer agent. We offer it with some tools to browse the online and a selected prompt (check the repo to seek out specific implementations).
Defining these tools is outside the scope of this post: but you possibly can check the repository to seek out specific implementations.
from transformers.agents import ReactJsonAgent, HfEngine
WEB_TOOLS = [
SearchInformationTool(),
NavigationalSearchTool(),
VisitTool(),
DownloadTool(),
PageUpTool(),
PageDownTool(),
FinderTool(),
FindNextTool(),
]
websurfer_llm_engine = HfEngine(
model="CohereForAI/c4ai-command-r-plus"
)
websurfer_agent = ReactJsonAgent(
tools=WEB_TOOLS,
llm_engine=websurfer_llm_engine,
)
To permit this agent to be called by a higher-level task solving agent, we are able to simply encapsulate it in one other tool:
class SearchTool(Tool):
name = "ask_search_agent"
description = "A search agent that can browse the web to reply an issue. Use it to assemble informations, not for problem-solving."
inputs = {
"query": {
"description": "Your query, as a natural language sentence. You're talking to an agent, so provide them with as much context as possible.",
"type": "text",
}
}
output_type = "text"
def forward(self, query: str) -> str:
return websurfer_agent.run(query)
Then we initialize the task-solving agent with this search tool:
from transformers.agents import ReactCodeAgent
llm_engine = HfEngine(model="meta-llama/Meta-Llama-3-70B-Instruct")
react_agent_hf = ReactCodeAgent(
tools=[SearchTool()],
llm_engine=llm_engine,
)
Let’s run the agent with the next task:
Use density measures from the chemistry materials licensed by Marisa Alviar-Agnew & Henry Agnew under the CK-12 license in LibreText’s Introductory Chemistry materials as compiled 08/21/2023.
I actually have a gallon of honey and a gallon of mayonnaise at 25C. I remove one cup of honey at a time from the gallon of honey. How over and over will I want to remove a cup to have the honey weigh lower than the mayonaise? Assume the containers themselves weigh the identical.
Thought: I'll use the 'ask_search_agent' tool to seek out the density of honey and mayonnaise at 25C.
==== Agent is executing the code below:
density_honey = ask_search_agent(query="What's the density of honey at 25C?")
print("Density of honey:", density_honey)
density_mayo = ask_search_agent(query="What's the density of mayonnaise at 25C?")
print("Density of mayo:", density_mayo)
===
Statement:
Density of honey: The density of honey is around 1.38-1.45kg/L at 20C. Although I could not find information specific to 25C, minor temperature differences are unlikely to affect the density that much, so it's prone to remain inside this range.
Density of mayo: The density of mayonnaise at 25°C is 0.910 g/cm³.
===== Latest step =====
Thought: I'll convert the density of mayonnaise from g/cm³ to kg/L after which calculate the initial weights of the honey and mayonnaise in a gallon. After that, I'll calculate the burden of honey after removing one cup at a time until it weighs lower than the mayonnaise.
==== Agent is executing the code below:
density_honey = 1.42 # taking the typical of the range
density_mayo = 0.910 # converting g/cm³ to kg/L
density_mayo = density_mayo * 1000 / 1000 # conversion
gallon_to_liters = 3.785 # conversion factor
initial_honey_weight = density_honey * gallon_to_liters
initial_mayo_weight = density_mayo * gallon_to_liters
cup_to_liters = 0.236 # conversion factor
removed_honey_weight = cup_to_liters * density_honey
===
Statement:
===== Latest step =====
Thought: Now that I actually have the initial weights of honey and mayonnaise, I'll attempt to calculate the variety of cups to remove from the honey to make it weigh lower than the mayonnaise using an easy arithmetic operation.
==== Agent is executing the code below:
cups_removed = int((initial_honey_weight - initial_mayo_weight) / removed_honey_weight) + 1
print("Cups removed:", cups_removed)
final_answer(cups_removed)
===
>>> Final answer: 6
✅ And the reply is correct!
Testing our agents
Let’s take our agent framework for a spin and benchmark different models with it!
All of the code for the experiments below will be found here.
Benchmarking LLM engines
The agents_reasoning_benchmark is a small – but mighty- reasoning test for evaluating agent performance. This benchmark was already used and explained in additional detail in our earlier blog post.
The thought is that the selection of tools you utilize along with your agents can radically alter performance for certain tasks. So this benchmark restricts the set of tools used to a calculator and a basic search tool. We picked questions from several datasets that may very well be solved using only these two tools:
Here we try 3 different engines: Mixtral-8x7B, Llama-3-70B-Instruct, and GPT-4 Turbo.
The outcomes are shown above – as the typical of two complete runs for more precision. We also tested Command-R+ and Mixtral-8x22B, but don’t show them for clarity.
⇒ Llama-3-70B-Instruct leads the Open-Source models: it’s on par with GPT-4, and it’s especially strong in a ReactCodeAgent because of Llama 3’s strong coding performance!
💡 It’s interesting to check JSON- and Code-based React agents: with less powerful LLM engines like Mixtral-8x7B, Code-based agents don’t perform in addition to JSON, because the LLM engine steadily fails to generate good code. However the Code version really shines with more powerful models as engines: in our experience, the Code version even outperforms the JSON with Llama-3-70B-Instruct. Consequently, we use the Code version for our next challenge: testing on the entire GAIA benchmark.
Climbing up the GAIA Leaderboard with a multi-modal agent
GAIA (Mialon et al., 2023) is a particularly difficult benchmark: you possibly can see within the agent_reasoning_benchmark above that models don’t perform above 50% despite the fact that we cherry-picked tasks that may very well be solved with 2 basic tools.
Now we would like to get a rating on the entire set, we don’t cherry-pick questions anymore. Thus we’ve to cover all modalities, which leads us to make use of these specific tools:
SearchTool: the online browser defined above.TextInspectorTool: open documents as text files and return their content.SpeechToTextTool: transcribe audio files to text. We use the default tool based on distil-whisper.VisualQATool: analyze images visually. For these we use the shiny recent Idefics2-8b-chatty!
We first initialize these toole (for more detail, inspect the code within the repository).
Then we initialize our agent:
from transformers.agents import ReactCodeAgent, HfEngine
TASK_SOLVING_TOOLBOX = [
SearchTool(),
VisualQATool(),
SpeechToTextTool(),
TextInspectorTool(),
]
react_agent_hf = ReactCodeAgent(
tools=TASK_SOLVING_TOOLBOX,
llm_engine=HfEngine(model="meta-llama/Meta-Llama-3-70B-Instruct"),
memory_verbose=True,
)
And after a while needed to finish the 165 questions, we submit our result to the GAIA Leaderboard, and… 🥁🥁🥁
⇒ Our agent ranks 4th: it beats many GPT-4-based agents, and is now the reigning contender for the Open-Source category!
Conclusion
We’ll keep improving this package in the approaching months. We’ve got already identified several exciting paths in our development roadmap:
- More agent sharing options: for now you possibly can push or load tools from the Hub, we are going to implement pushing/loading agents too.
- Higher tools, especially for image processing.
- Long-term memory management.
- Multi-agent collaboration.
👉 Go check out transformers agents! We’re looking forward to receiving your feedback and your ideas.
Let’s fill the highest of the leaderboard with more open-source models! 🚀
transformers.agentshas now been upgraded to the stand-alone library smolagents! The 2 libraries have very similar APIs, so switching is straightforward.
Go checkout thesmolagentsintroduction blog here.
