Today, recent libraries and low-code platforms are making it easier than ever to construct AI agents, also known as digital staff. Tool calling is one in every of the first abilities driving the “agentic” nature of Generative AI models by extending their ability beyond conversational tasks. By executing tools (functions), agents can take motion in your behalf and solve complex, multi-step problems that require robust decision making and interacting with quite a lot of external data sources.
This text focuses on how reasoning is expressed through tool calling, explores a number of the challenges of tool use, covers common ways to guage tool-calling ability, and provides examples of how different models and agents interact with tools.
On the core of successful agents lie two key expressions of reasoning: reasoning through evaluation and planning and reasoning through tool use.
- Reasoning through evaluation and planning pertains to an agent’s ability to effectively breakdown an issue by iteratively planning, assessing progress, and adjusting its approach until the duty is accomplished. Techniques like Chain-of-Thought (CoT), ReAct, and Prompt Decomposition are all patterns designed to enhance the model’s ability to reason strategically by breaking down tasks to resolve them accurately. This kind of reasoning is more macro-level, ensuring the duty is accomplished accurately by working iteratively and making an allowance for the outcomes from each stage.
- Reasoning through tool use pertains to the agents ability to effectively interact with it’s environment, deciding which tools to call and how you can structure each call. These tools enable the agent to retrieve data, execute code, call APIs, and more. The strength of this kind of reasoning lies in the correct execution of tool calls somewhat than reflecting on the outcomes from the decision.
While each expressions of reasoning are necessary, they don’t at all times must be combined to create powerful solutions. For instance, OpenAI’s recent o1 model excels at reasoning through evaluation and planning since it was trained to reason using chain of thought. This has significantly improved its ability to think through and solve complex challenges as reflected on quite a lot of benchmarks. For instance, the o1 model has been shown to surpass human PhD-level accuracy on the GPQA benchmark covering physics, biology, and chemistry, and scored within the 86th-93rd percentile on Codeforces contests. While o1’s reasoning ability might be used to generate text-based responses that suggest tools based on their descriptions, it currently lacks explicit tool calling abilities (at the very least for now!).
In contrast, many models are fine-tuned specifically for reasoning through tool use enabling them to generate function calls and interact with APIs very effectively. These models are focused on calling the appropriate tool in the appropriate format at the appropriate time, but are typically not designed to guage their very own results as thoroughly as o1 might. The Berkeley Function Calling Leaderboard (BFCL) is an excellent resource for comparing how different models perform on function calling tasks. It also provides an evaluation suite to check your personal fine-tuned model on various difficult tool calling tasks. The truth is, the latest dataset, BFCL v3, was just released and now includes multi-step, multi-turn function calling, further raising the bar for tool based reasoning tasks.
Each varieties of reasoning are powerful independently, and when combined, they’ve the potential to create agents that may effectively breakdown complicated tasks and autonomously interact with their environment. For more examples of AI agent architectures for reasoning, planning, and power calling take a look at my team’s survey paper on ArXiv.
Constructing robust and reliable agents requires overcoming many various challenges. When solving complex problems, an agent often must balance multiple tasks directly including planning, interacting with the appropriate tools at the appropriate time, formatting tool calls properly, remembering outputs from previous steps, avoiding repetitive loops, and adhering to guidance to guard the system from jailbreaks/prompt injections/etc.
Too many demands can easily overwhelm a single agent, resulting in a growing trend where what may appear to an end user as one agent, is behind the scenes a group of many agents and prompts working together to divide and conquer completing the duty. This division allows tasks to be broken down and handled in parallel by different models and agents tailored to resolve that specific piece of the puzzle.
It’s here that models with excellent tool calling capabilities come into play. While tool-calling is a strong strategy to enable productive agents, it comes with its own set of challenges. Agents need to know the available tools, select the appropriate one from a set of doubtless similar options, format the inputs accurately, call tools in the appropriate order, and potentially integrate feedback or instructions from other agents or humans. Many models are fine-tuned specifically for tool calling, allowing them to focus on choosing functions at the appropriate time with high accuracy.
Among the key considerations when fine-tuning a model for tool calling include:
- Proper Tool Selection: The model needs to know the connection between available tools, make nested calls when applicable, and choose the appropriate tool within the presence of other similar tools.
- Handling Structural Challenges: Although most models use JSON format for tool calling, other formats like YAML or XML will also be used. Consider whether the model must generalize across formats or if it should only use one. Whatever the format, the model needs to incorporate the suitable parameters for every tool call, potentially using results from a previous call in subsequent ones.
- Ensuring Dataset Diversity and Robust Evaluations: The dataset used must be diverse and canopy the complexity of multi-step, multi-turn function calling. Proper evaluations must be performed to stop overfitting and avoid benchmark contamination.
With the growing importance of tool use in language models, many datasets have emerged to assist evaluate and improve model tool-calling capabilities. Two of the most well-liked benchmarks today are the Berkeley Function Calling Leaderboard and Nexus Function Calling Benchmark, each of which Meta used to guage the performance of their Llama 3.1 model series. A recent paper, ToolACE, demonstrates how agents will be used to create a various dataset for fine-tuning and evaluating model tool use.
Let’s explore each of those benchmarks in additional detail:
- Berkeley Function Calling Leaderboard (BFCL): BFCL incorporates 2,000 question-function-answer pairs across multiple programming languages. Today there are 3 versions of the BFCL dataset each with enhancements to raised reflect real-world scenarios. For instance, BFCL-V2, released August nineteenth, 2024 includes user contributed samples designed to deal with evaluation challenges related to dataset contamination. BFCL-V3 released September nineteenth, 2024 adds multi-turn, multi-step tool calling to the benchmark. That is critical for agentic applications where a model must make multiple tool calls over time to successfully complete a task. Instructions for evaluating models on BFCL will be found on GitHub, with the latest dataset available on HuggingFace, and the current leaderboard accessible here. The Berkeley team has also released various versions of their Gorilla Open-Functions model fine-tuned specifically for function-calling tasks.
- Nexus Function Calling Benchmark: This benchmark evaluates models on zero-shot function calling and API usage across nine different tasks classified into three major categories for single, parallel, and nested tool calls. Nexusflow released NexusRaven-V2, a model designed for function-calling. The Nexus benchmark is out there on GitHub and the corresponding leaderboard is on HuggingFace.
- ToolACE: The ToolACE paper demonstrates a creative approach to overcoming challenges related to collecting real-world data for function-calling. The research team created an agentic pipeline to generate an artificial dataset for tool calling consisting of over 26,000 different APIs. The dataset includes examples of single, parallel, and nested tool calls, in addition to non-tool based interactions, and supports each single and multi-turn dialogs. The team released a fine-tuned version of Llama-3.1–8B-Instruct, ToolACE-8B, designed to handle these complex tool-calling related tasks. A subset of the ToolACE dataset is out there on HuggingFace.
Each of those benchmarks facilitates our ability to guage model reasoning expressed through tool calling. These benchmarks and fine-tuned models reflect a growing trend towards developing more specialized models for specific tasks and increasing LLM capabilities by extending their ability to interact with the real-world.
In the event you’re eager about exploring tool-calling in motion, listed below are some examples to get you began organized by ease of use, starting from easy built-in tools to using fine-tuned models, and agents with tool-calling abilities.
Level 1 — ChatGPT: The very best place to start out and see tool-calling live while not having to define any tools yourself, is thru ChatGPT. Here you should use GPT-4o through the chat interface to call and execute tools for web-browsing. For instance, when asked “what’s the newest AI news this week?” ChatGPT-4o will conduct an online search and return a response based on the knowledge it finds. Remember the brand new o1 model doesn’t have tool-calling abilities yet and can’t search the net.
While this built-in web-searching feature is convenient, most use cases would require defining custom tools that may integrate directly into your personal model workflows and applications. This brings us to the following level of complexity.
Level 2 — Using a Model with Tool Calling Abilities and Defining Custom Tools:
This level involves using a model with tool-calling abilities to get a way of how effectively the model selects and uses it’s tools. It’s necessary to notice that when a model is trained for tool-calling, it only generates the text or code for the tool call, it doesn’t actually execute the code itself. Something external to the model must invoke the tool, and it’s at this point — where we’re combining generation with execution — that we transition from language model capabilities to agentic systems.
To get a way for the way models express tool calls we are able to turn towards the Databricks Playground. For instance, we are able to select the model Llama 3.1 405B and provides it access to the sample tools get_distance_between_locations and get_current_weather. When prompted with the user message “I’m occurring a visit from LA to Latest York how far are these two cities? And what’s the weather like in Latest York? I would like to be prepared for after I get there” the model decides which tools to call and what parameters to pass so it will possibly effectively reply to the user.
In this instance, the model suggests two tool calls. Because the model cannot execute the tools, the user must fill in a sample result to simulate the tool output (e.g., “2500” for the space and “68” for the weather). The model then uses these simulated outputs to answer to the user.
This approach to using the Databricks Playground permits you to observe how the model uses custom defined tools and is an excellent strategy to test your function definitions before implementing them in your tool-calling enabled applications or agents.
Outside of the Databricks Playground, we are able to observe and evaluate how effectively different models available on platforms like HuggingFace use tools through code directly. For instance, we are able to load different models like Llama 3.2–3B-Instruct, ToolACE-8B, NexusRaven-V2–13B, and more from HuggingFace, give them the identical system prompt, tools, and user message then observe and compare the tool calls each model returns. That is an excellent strategy to understand how well different models reason about using custom-defined tools and might allow you to determine which tool-calling models are best suited in your applications.
Here is an example demonstrating a tool call generated by Llama-3.2–3B-Instruct based on the next tool definitions and user message, the identical steps might be followed for other models to check generated tool calls.
import torch
from transformers import pipelinefunction_definitions = """[
{
"name": "search_google",
"description": "Performs a Google search for a given query and returns the top results.",
"parameters": {
"type": "dict",
"required": [
"query"
],
"properties": {
"query": {
"type": "string",
"description": "The search query for use for the Google search."
},
"num_results": {
"type": "integer",
"description": "The variety of search results to return.",
"default": 10
}
}
}
},
{
"name": "send_email",
"description": "Sends an email to a specified recipient.",
"parameters": {
"type": "dict",
"required": [
"recipient_email",
"subject",
"message"
],
"properties": {
"recipient_email": {
"type": "string",
"description": "The e-mail address of the recipient."
},
"subject": {
"type": "string",
"description": "The topic of the e-mail."
},
"message": {
"type": "string",
"description": "The body of the e-mail."
}
}
}
}
]
"""
# That is the suggested system prompt from Meta
system_prompt = """You're an authority in composing functions. You're given an issue and a set of possible functions.
Based on the query, you have to to make a number of function/tool calls to attain the aim.
If not one of the function will be used, point it out. If the given query lacks the parameters required by the function,
also point it out. You must only return the function call in tools call sections.
In the event you determine to invoke any of the function(s), you MUST put it within the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]n
You SHOULD NOT include every other text within the response.
Here is an inventory of functions in JSON format that you could invoke.nn{functions}n""".format(functions=function_definitions)
From here we are able to move to Level 3 where we’re defining Agents that execute the tool-calls generated by the language model.
Level 3 Agents (invoking/executing LLM tool-calls): Agents often express reasoning each through planning and execution in addition to tool calling making them an increasingly necessary aspect of AI based applications. Using libraries like LangGraph, AutoGen, Semantic Kernel, or LlamaIndex, you may quickly create an agent using models like GPT-4o or Llama 3.1–405B which support each conversations with the user and power execution.
Try these guides for some exciting examples of agents in motion:
The long run of agentic systems will likely be driven by models with strong reasoning abilities enabling them to effectively interact with their environment. As the sphere evolves, I expect we’ll proceed to see a proliferation of smaller, specialized models focused on specific tasks like tool-calling and planning.
It’s necessary to contemplate the present limitations of model sizes when constructing agents. For instance, in line with the Llama 3.1 model card, the Llama 3.1–8B model is just not reliable for tasks that involve each maintaining a conversation and calling tools. As a substitute, larger models with 70B+ parameters must be used for these kind of tasks. This alongside other emerging research for fine-tuning small language models suggests that smaller models may serve best as specialized tool-callers while larger models could also be higher for more advanced reasoning. By combining these abilities, we are able to construct increasingly effective agents that provide a seamless user experience and permit people to leverage these reasoning abilities in each skilled and private endeavors.
Desirous about discussing further or collaborating? Reach out on LinkedIn!