Open-source LLMs as LangChain Agents

Open-source LLMs have now reached a performance level that makes them suitable reasoning engines for powering agent workflows: Mixtral even surpasses GPT-3.5 on our benchmark, and its performance could easily be further enhanced with fine-tuning.

We have released the best agentic library on the market: smolagents!
Go checkout the smolagents introduction blog here.

Introduction

Large Language Models (LLMs) trained for causal language modeling can tackle a wide selection of tasks, but they often struggle with basic tasks like logic, calculation, and search. The worst scenario is once they perform poorly in a site, equivalent to math, yet still try and handle all of the calculations themselves.

To beat this weakness, amongst other approaches, one can integrate the LLM right into a system where it could call tools: such a system is named an LLM agent.

On this post, we explain the inner workings of ReAct agents, then show construct them using the ChatHuggingFace class recently integrated in LangChain. Finally, we benchmark several open-source LLMs against GPT-3.5 and GPT-4.

What are agents?

The definition of LLM agents is sort of broad: LLM agents are all systems that use LLMs as their engine and may perform actions on their environment based on observations. They will use several iterations of the Perception ⇒ Reflexion ⇒ Motion cycle to realize their task and are sometimes augmented with planning or knowledge management systems to reinforce their performance. Yow will discover a very good review of the Agents landscape in Xi et al., 2023.

Today, we’re specializing in ReAct agents. ReAct is an approach to constructing agents based on the concatenation of two words, “Reasoning” and “Acting.” Within the prompt, we describe the model, which tools it could use, and ask it to think “step-by-step” (also called Chain-of-Thought behavior) to plan and execute its next actions to succeed in the ultimate answer.

drawing

Toy example of a ReAct agent’s inner working

The graph above seems very high-level, but under the hood it’s quite easy.

Take a take a look at this notebook: we implement a barebones tool call example with the Transformers library.

The LLM is named in a loop with a prompt containing in essence:

Here is an issue: "{query}" 
You could have access to those tools: {tools_descriptions}. 
It's best to first reflect with ‘Thought: {your_thoughts}’, then you definitely either:
- call a tool with the correct JSON formatting,
- or your print your final answer starting with the prefix ‘Final Answer:’

Then you definitely parse the LLM’s output:

if it incorporates the string ‘Final Answer:’, the loop ends and also you print the reply,
else, the LLM must have output a tool call: you possibly can parse this output to get the tool name and arguments, then call said tool with said arguments. Then the output of this tool call is appended to the prompt, and also you call the LLM again with this prolonged information, until it has enough information to finally provide a final answer to the query.

As an illustration, the LLM’s output can appear to be this, when answering the query: What number of seconds are in 1:23:45?

Thought: I would like to convert the time string into seconds.

Motion:
{
    "motion": "convert_time",
    "action_input": {
    "time": "1:23:45"
    }
}

Since this output doesn’t contain the string ‘Final Answer:’, it is asking a tool: so we parse this output and get the tool call parameters: call tool convert_time with arguments {"time": "1:23:45"}.
Running this tool call returns {'seconds': '5025'}.

So we append this whole blob to the prompt.

The brand new prompt is now (a rather more elaborate version of):

Here is an issue: "What number of seconds are in 1:23:45?"
You could have access to those tools:
    - convert_time: converts a time given in hours:minutes:seconds into seconds.

It's best to first reflect with ‘Thought: {your_thoughts}’, then you definitely either:
- call a tool with the correct JSON formatting,
- or your print your final answer starting with the prefix ‘Final Answer:’

Thought: I would like to convert the time string into seconds.

Motion:
{
    "motion": "convert_time",
    "action_input": {
    "time": "1:23:45"
    }
}
Statement: {'seconds': '5025'}

➡️ We call the LLM again, with this recent prompt. Provided that it has access to the tool call’s end in Statement, the LLM is now more than likely to output:

Thought: I now have the data needed to reply the query.
Final Answer: There are 5025 seconds in 1:23:45.

And the duty is solved!

Challenges of agent systems

Generally, the difficult parts of running an agent system for the LLM engine are:

From supplied tools, select the one that may help advance to a desired goal: e.g. when asked "What's the smallest prime number greater than 30,000?", the agent could call the Search tool with "What's the height of K2" however it won’t help.
Call tools with a rigorous argument formatting: as an illustration when attempting to calculate the speed of a automotive that went 3 km in 10 minutes, you’ve got to call tool Calculator to divide distance by time : even in case your Calculator tool accepts calls within the JSON format: {”tool”: “Calculator”, “args”: “3km/10min”} , there are a lot of pitfalls, as an illustration:
- Misspelling the tool name: “calculator” or “Compute” wouldn’t work
- Giving the name of the arguments as a substitute of their values: “args”: “distance/time”
- Non-standardized formatting: “args": "3km in 10minutes”
Efficiently ingesting and using the data gathered previously observations, be it the initial context or the observations returned after using tool uses.

So, how would a whole Agent setup appear to be?

Running agents with LangChain

We have now just integrated a ChatHuggingFace wrapper that enables you to create agents based on open-source models in 🦜🔗LangChain.

The code to create the ChatModel and provides it tools is de facto easy, you possibly can check all of it within the Langchain doc.

from langchain_community.llms import HuggingFaceEndpoint
from langchain_community.chat_models.huggingface import ChatHuggingFace

llm = HuggingFaceEndpoint(repo_id="HuggingFaceH4/zephyr-7b-beta")

chat_model = ChatHuggingFace(llm=llm)

You may make the chat_model into an agent by giving it a ReAct style prompt and tools:

from langchain import hub
from langchain.agents import AgentExecutor, load_tools
from langchain.agents.format_scratchpad import format_log_to_str
from langchain.agents.output_parsers import (
    ReActJsonSingleInputOutputParser,
)
from langchain.tools.render import render_text_description
from langchain_community.utilities import SerpAPIWrapper


tools = load_tools(["serpapi", "llm-math"], llm=llm)


prompt = hub.pull("hwchase17/react-json")
prompt = prompt.partial(
    tools=render_text_description(tools),
    tool_names=", ".join([t.name for t in tools]),
)


chat_model_with_stop = chat_model.bind(stop=["nObservation"])
agent = (
    {
        "input": lambda x: x["input"],
        "agent_scratchpad": lambda x: format_log_to_str(x["intermediate_steps"]),
    }
    | prompt
    | chat_model_with_stop
    | ReActJsonSingleInputOutputParser()
)


agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

agent_executor.invoke(
    {
        "input": "Who's the present holder of the speed skating world record on 500 meters? What's her current age raised to the 0.43 power?"
    }
)

And the agent will process the input:

Thought: To reply this query, I would like to search out age of the present speedskating world record holder.  I'll use the search tool to search out this information.
Motion:
{
    "motion": "search",
    "action_input": "speed skating world record holder 500m age"
}
Statement: ...

Agents Showdown: how do open-source LLMs perform as general purpose reasoning agents?

Yow will discover the code for this benchmark here.

Evaluation

We wish to measure how open-source LLMs perform as general purpose reasoning agents. Thus we select questions requiring using logic and the usage of basic tools: a calculator and access to web search.
The final dataset is a mixture of samples from 3 other datasets:

For testing Web search capability: now we have chosen questions from HotpotQA: that is originally a retrieval dataset, but it could be used for general query answering, with access to the web. Some questions originally have to mix information from various sources: in our setting, this implies performing several steps of web search to mix the outcomes.
For calculator usage, we added questions from GSM8K: this dataset tests grade-school math ability, and is entirely solvable by accurately leveraging the 4 operators (add, subtract, multiply, divide).
We also picked questions from GAIA, a really difficult benchmark for General AI Assistants. The questions in the unique dataset can require many other different tools, equivalent to a code interpreter or pdf reader: we hand-picked questions that don’t require other tools than search and calculator.

Evaluation was performed with GPT-4-as-a-judge using a prompt based on the Prometheus prompt format, giving results on a 5-point Likert Scale: see the precise prompt here.

Models

We evaluate a number of strong open-source models:

These models are evaluated in LangChain’s ReAct implementation
This implies we prompt them to output their function calls on this format:

{
  "motion": $TOOL_NAME,
  "action_input": $INPUT
}

For comparison, now we have also evaluated GPT-3.5 and GPT-4 on the identical examples using LangChain’s OpenAI-specific agent.
Since this uses their very own function-calling template, on which they were fine-tuned, this implies the OpenAI models needs to be at their top performance.

Results

Because the open-source models weren’t specifically fine-tuned for calling functions within the given output format, they’re at a slight drawback in comparison with the OpenAI agents.

Despite this, some models perform rather well! 💪

Here’s an example of Mixtral-8x7B answering the query: “Which city has a bigger population, Guiyang or Tacheng?”

Thought: To reply this query, I would like to search out the present populations of each Guiyang and Tacheng. I'll use the search tool to search out this information.
Motion:
{
    "motion": "search",
    "action_input": "current population of Guiyang"
}
4.697 million (2016)

Thought: I even have obtained the population of Guiyang, which is 4.697 million (as of 2016). Now, I'll use the search tool again to search out the population of Tacheng.
Motion:
{
    "motion": "search",
    "action_input": "current population of Tacheng"
}
Population: Xinjiang: Tacheng data was reported at 1,110.000 Individuals in 2021. This records a decrease from the previous variety of 1,138.638 Individuals for 2020.
Population: Xinjiang: Tacheng data is updated yearly, averaging 1,023.000 Individuals from Dec 2005 to 2021, with 17 observations.

I even have obtained the population of Tacheng, which is roughly 1.11 million (as of 2021). Comparing the 2 populations, Guiyang has a bigger population than Tacheng.

Thought: I now know the ultimate answer
Final Answer: Guiyang has a bigger population, which is roughly 4.697 million (as of 2016), in comparison with Tacheng's population of roughly 1.11 million (as of 2021).

Here’s a benchmark of the models on our evaluation dataset (the common scores originally on a scale of 1-5 have been converted to a scale of 0-100% for readability):

benchmark of agents performance

As you possibly can see, some open-source models don’t perform well in powering agent workflows: while this was expected for the small Zephyr-7b, Llama2-70b performs surprisingly poorly.

👉 But Mixtral-8x7B performs rather well: it even beats GPT-3.5! 🏆

And that is out-of-the-box performance: contrary to GPT-3.5, Mixtral was not finetuned for agent workflows (to our knowledge), which somewhat hinders its performance. As an illustration, on GAIA, 10% of questions fail because Mixtral tries to call a tool with incorrectly formatted arguments. With proper fine-tuning for the function calling and task planning skills, Mixtral’s rating would likely be even higher.

➡️ We strongly recommend open-source builders to start out fine-tuning Mixtral for agents, to surpass the subsequent challenger: GPT-4! 🚀

Closing remarks:

The GAIA benchmark, even though it is tried here on a small subsample of questions and a number of tools, looks as if a really robust indicator of overall model performance for agent workflows, because it generally involves several reasoning steps and rigorous logic.
The agent workflows allow LLMs to extend performance: as an illustration, on GSM8K, GPT-4’s technical report reports 92% for 5-shot CoT prompting: giving it a calculator allows us to succeed in 95% in zero-shot . For Mixtral-8x7B, the LLM Leaderboard reports 57.6% with 5-shot, we get 73% in zero-shot. (Take into account that we tested only 20 questions of GSM8K)

Source link

Open-source LLMs as LangChain Agents

Introduction

Table of Contents

What are agents?

Toy example of a ReAct agent’s inner working

Challenges of agent systems

Running agents with LangChain

Agents Showdown: how do open-source LLMs perform as general purpose reasoning agents?

Evaluation

Models

Results

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Causal Inference Playbook: Advanced Methods Every Data Scientist Should Master

The 2026 Data Mandate: Is Your Governance Architecture a Fortress or a Liability?

The Current Status of The Quantum Software Stack

The Multi-Agent Trap

A Tale of Two Variances: Why NumPy and Pandas Give Different Answers

Open-source LLMs as LangChain Agents

Introduction

Table of Contents

What are agents?

Toy example of a ReAct agent’s inner working

Challenges of agent systems

Running agents with LangChain

Agents Showdown: how do open-source LLMs perform as general purpose reasoning agents?

Evaluation

Models

Results

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.