of AI agents. LLMs aren’t any longer just tools. Theyāve turn out to be lively participants in our lives, boosting productivity and remodeling the way in which we live and work.Ā
- OpenAI recently launched Operator, an AI agent that may autonomously perform various tasks, from browsing the online to filling out forms and scheduling appointments.
- Anthropic introduced MCP (Model Context Protocol), a brand new standard for a way AI assistants interact with the surface world.Ā With over five thousand lively MCP servers already, adoption is growing rapidly.
- AI agents are also changing the landscape of software engineering. Tools like GitHub Copilotās agentic mode, Claude Code, OpenAI Codex, and others aren’t only improving developer productivity and code quality but additionally democratising the sector, making software development accessible to people with no technical background.
Weāve previously checked out different AI Agent frameworks, equivalent to LangGraph or CrewAI. In this text, I would love to debate a brand new one Iāve been exploring recentlyāāāHuggingFace smolagents. Itās an interesting framework because it implements the concept of code agents.
In this text, we’ll explore several topics:
- What code agents are ().
- Learn how to use the HuggingFace smolagents framework in practice.Ā
- Whether itās secure to present LLMs a lot agency.
- The actual difference in performance between code agents and traditional tool-calling agents.
AI AgentsĀ recap
Letās start with a fast refresher: what exactly are AI agents? HuggingFace provides a transparent and concise definition of what they mean by agents.Ā
So, we’d like an agentic flow when we would like a system to reason and act based on observations. Actually, agency just isn’t a binary variable (yes or no), but a spectrum.Ā
- At one end, we will have systems without agency in any respect, for instance, a straightforward process where an LLM defines the sentiment of a text, translates it or summarises it.Ā
- The subsequent level is routing, where an LLM can classify an incoming query and judge which path to takeāāā for instance, calling a tool if a customer is asking concerning the status of their current order, and transferring the conversation to a human CS agent otherwise.Ā
- More advanced systems can exhibit higher degrees of agency. These might include the power to execute other LLMs (multi-agent setup) and even create recent tools on the fly.
Code agents fall into this more advanced category. They’re multi-step agents that execute tool calls in the shape of code, in contrast to the more traditional approach using a JSON format with the tool name and arguments.
Several recent papers have shown that using code in agentic flows leads to raised results:Ā
It is sensible when you concentrate on it. Weāve been developing programming languages for a long time to resolve complex problems. So, itās natural that these languages are higher suited to LLMās tasks than easy JSON configs. An extra profit is that LLMs are already quite good at writing code in common programming languages, because of the vast amount of obtainable data for training.
This approach comes with several other advantages as well:
- By generating code, an LLM just isn’t limited to a predefined set of tools and might create its own functions.
- It could possibly mix multiple tools inside a single motion using conditions and loops, which helps reduce the variety of steps required to finish a task. Ā
- It also enables the model to work with a greater variety of outputs, equivalent to generating charts, images, or other complex objects.
These advantages arenāt just theoretical; we will observe them in practice. In āExecutable Code Actions Elicit Higher LLM Agentsā, the authors show that code agents outperform traditional methods, achieving a better success rate and completing a task in fewer steps, which in turn reduces costs.Ā
Code agents look promising, which inspired me to do that approach in practice.
HuggingFace smolagents framework
First try
Luckily, we donāt need to construct code agents from scratch, as HuggingFace has released a handy library called smolagents that implements this approach.Ā
Letās start by installing the library.Ā
pip install smolagents[litellm]
# I've used litellm, since I'm planning to make use of it with OpenAI model
Next, letās construct a basic example. To initialise the agent, we’d like just two parameters: model and tools.Ā
I plan to make use of OpenAI for the model, which is accessible via LiteLLM. Nonetheless, the framework supports other options as well. You need to use a neighborhood model via Ollama or TransformersModel, or public models via Inference Providers or select other options (you’ll find more details in the documentation).Ā
I didnāt specify any tools, but used add_base_tools = True
, so my agent has a default set of tools, equivalent to a Python interpreter or DuckDuckGo search. Letās try it out with a straightforward query.
from smolagents import CodeAgent, LiteLLMModel
model = LiteLLMModel(model_id="openai/gpt-4o-mini",
api_key=config['OPENAI_API_KEY'])
agent = CodeAgent(tools=[], model=model, add_base_tools=True)
agent.run(
"""I even have 5 different balls and I randomly select 2.
What number of possible combos of the balls I can get?""",
)
Because of this, we see a very nicely formatted execution flow. Itās just amazing and means that you can understand the method perfectly.

So, the agent found a solution in a single step and wrote Python code to calculate the variety of combos.Ā
The output is kind of helpful, but we will go even deeper and have a look at the entire information related to execution (including prompts), via agent.memory.steps
. Letās have a look at the system prompt utilized by the agent.Ā
You might be an authority assistant who can solve any task using code blobs.
You can be given a task to resolve as best you may.
To achieve this, you have got been given access to an inventory of tools: these tools
are mainly Python functions which you'll call with code.
To unravel the duty, you have to plan forward to proceed in a series of
steps, in a cycle of 'Thought:', 'Code:',
and 'Statement:' sequences.
At each step, within the 'Thought:' sequence, you need to first explain
your reasoning towards solving the duty and the tools that you just want
to make use of.
Then within the 'Code:' sequence, you need to write the code in easy
Python. The code sequence must end with '' sequence.
During each intermediate step, you should use 'print()' to avoid wasting
whatever essential information you'll then need.
These print outputs will then appear within the 'Statement:' field,
which can be available as input for the following step.
Ultimately you have got to return a final answer using
the final_answer tool.
Listed here are just a few examples using notional tools: <...>
Itās pretty clear that smolagents implements the ReAct approach (introduced within the paper by Yao et al. āReAct: Synergizing Reasoning and Acting in Language Modelsā) and uses a few-shot prompting technique.
The smolagents library handles all behind-the-scenes work involved within the agent workflow: assembling the system prompt with all essential information for the LLM (i.e. available tools), parsing the output and executing the generated code. It also provides comprehensive logging and a retry mechanism to assist correct mistakes.
Moreover, the library offers memory management features. By default, all execution results are saved to memory, but you may customise this behaviour. For instance, you may remove some intermediary results from the memory to scale back the variety of tokens or execute the agent step-by-step. While we wonāt dive deep into memory management here, you’ll find handy code examples in the documentation.
Security
Now, itās time to debate the drawbacks of the code agentsā approach. Giving an LLM more agency by allowing it to execute arbitrary code introduces higher risks. Indeed, an LLM can run harmful code either by mistake (since LLMs are still removed from perfect) or as a result of targeted attacks like prompt injections or compromised models.Ā
To mitigate these risks, the local Python executor implemented within the smolagents library has a bunch of safety checks:Ā
- By default, imports aren’t allowed unless the package has been explicitly added to
additional_authorized_imports
list.Ā - Furthermore, submodules are blocked by default, so you have to authorise them specifically (i.e.
numpy.*
). Itās been done because some packages can expose potentially harmful submodules, i.e.random._os
Ā .Ā - The entire variety of executed operations is capped, stopping infinite loops and resource bloating.Ā
- Any operation not explicitly defined within the interpreter will raise an error.Ā
Letās test whether these safety measures actually work.
from smolagents.local_python_executor import LocalPythonExecutor
custom_executor = LocalPythonExecutor(["numpy.*", "random"])
# function to have pretty formatted exceptions
def run_capture_exception(command: str):
try:
custom_executor(harmful_command)
except Exception as e:
print("ERROR:n", e)
# Unauthorised imports are blocked
harmful_command="import os; exit_code = os.system('')"
run_capture_exception(harmful_command)
# ERROR: Code execution failed at line 'import os' as a result of:
# InterpreterError: Import of os just isn't allowed. Authorized imports
# are: ['datetime', 'itertools', 're', 'math', 'statistics', 'time', 'queue',
# 'numpy.*', 'random', 'collections', 'unicodedata', 'stat']
# Submodules are also blocked unless stated specifically
harmful_command="from random import _os; exit_code = _os.system('')"
run_capture_exception(harmful_command)
# ERROR: Code execution failed at line 'exit_code = _os.system('')'
# as a result of: InterpreterError: Forbidden access to module: os
# The cap on the variety of iterations breaks inifinity loops
harmful_command = '''
while True:
pass
'''
run_capture_exception(harmful_command)
# ERROR: Code execution failed at line 'while True: pass' as a result of:
# InterpreterError: Maximum variety of 1000000 iterations in While loop
# exceeded
# Undefined operations don't work
harmful_command="!echo "
custom_executor(harmful_command)
# ERROR: Code parsing failed on line 1 as a result of: SyntaxError
It seems we’ve some safety nets with code agents. Nonetheless, despite these safeguards, risks persist once youāre executing code locally. For instance, an LLM can recursively create threads in your computer or create too many files, resulting in resource bloating. A possible solution is to execute code in a sandboxed environment, equivalent to using Docker or solutions like E2B. Iām willing to be adventurous and run my code locally, but in case you prefer a more risk-averse approach, you may follow the sandbox set-up guidance in the documentation.
Code agent vs traditional Tool-Calling agent
Itās claimed that the code agents perform higher in comparison with the standard JSON-based approach. Letās put this to the test.Ā
I’ll use the duty of metrics change evaluation that I described in my previous article, āMaking sense of KPI changesā. We are going to start with an easy case: analysing a straightforward metric (revenue) split by one dimension (country).
raw_df = pd.read_csv('absolute_metrics_example.csv', sep = 't')
df = raw_df.groupby('country')[['revenue_before', 'revenue_after_scenario_2']].sum()
.sort_values('revenue_before', ascending = False).rename(
columns = {'revenue_after_scenario_2': 'after',
'revenue_before': 'before'})

The smolagents library supports two classes, which we will use to check two approaches:Ā
- CodeAgentāāāan agent that acts by generating and executing code,
- ToolCallingAgentāāāa conventional JSON-based agent.Ā
Our agents will need some tools, so letās implement them. There are multiple options to create tools in smolagents: we will re-use LangChain tools, download them from HuggingFace Hub or just create Python functions. We are going to take essentially the most straightforward approach by writing a few Python functions and annotating them with @tool
.Ā
I’ll create two tools: one to estimate the relative difference between metrics, and one other to calculate the sum of an inventory. Since LLM can be using these tools, providing detailed descriptions is crucial.Ā
@tool
def calculate_metric_increase(before: float, after: float) -> float:
"""
Calculate the share change of the metric between before and after
Args:
before: value before
after: value after
"""
return (before - after) * 100/ before
@tool
def calculate_sum(values: list) -> float:
"""
Calculate the sum of list
Args:
values: list of numbers
"""
return sum(values)
CodeAgent
Letās start with a CodeAgent. Iāve initialised the agent with the tools we defined earlier and authorised the usage of some Python packages that is perhaps helpful.Ā
agent = CodeAgent(
model=model,
tools=[calculate_metric_increase, calculate_sum],
max_steps=10,
additional_authorized_imports=["pandas", "numpy", "matplotlib.*",
"plotly.*"],
verbosity_level=1
)
task = """
Here's a dataframe showing revenue by segment, comparing values
before and after.
Could you please help me understand the changes? Specifically:
1. Estimate how the full revenue and the revenue for every segment
have modified, each in absolute terms and as a percentage.
2. Calculate the contribution of every segment to the full
change in revenue.
Please round all floating-point numbers within the output
to 2 decimal places.
"""
agent.run(
task,
additional_args={"data": df},
)
Overall, the code agent accomplished the duty in only two steps, using only 5,451 input and 669 output tokens. The result also looks quite plausible.
{'total_before': 1731985.21, 'total_after':
1599065.55, 'total_change': -132919.66, 'segment_changes':
{'absolute_change': {'other': 4233.09, 'UK': -4376.25, 'France':
-132847.57, 'Germany': -690.99, 'Italy': 979.15, 'Spain':
-217.09}, 'percentage_change': {'other': 0.67, 'UK': -0.91,
'France': -55.19, 'Germany': -0.43, 'Italy': 0.81, 'Spain':
-0.23}, 'contribution_to_change': {'other': -3.18, 'UK': 3.29,
'France': 99.95, 'Germany': 0.52, 'Italy': -0.74, 'Spain': 0.16}}}
Letās take a have a look at the execution flow. The LLM received the next prompt.Ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāā Latest run āāāāāāāāāāāāāāāāāāāāāāāāāāāāā®
ā ā
ā Here's a pandas dataframe showing revenue by segment, ā
ā comparing values before and after. ā
ā Could you please help me understand the changes? ā
ā Specifically: ā
ā 1. Estimate how the full revenue and the revenue for every ā
ā segment have modified, each in absolute terms and as a ā
ā percentage. ā
ā 2. Calculate the contribution of every segment to the full ā
ā change in revenue. ā
ā ā
ā Please round all floating-point numbers within the output to 2 ā
ā decimal places. ā
ā ā
ā You will have been supplied with these additional arguments, that ā
ā you may access using the keys as variables in your python ā
ā code: ā
ā {'df': before after ā
ā country ā
ā other 632767.39 637000.48 ā
ā UK 481409.27 477033.02 ā
ā France 240704.63 107857.06 ā
ā Germany 160469.75 159778.76 ā
ā Italy 120352.31 121331.46 ā
ā Spain 96281.86 96064.77}. ā
ā ā
ā°ā LiteLLMModel - openai/gpt-4o-mini āāāāāāāāāāāāāāāāāāāāāāāāāāāāāÆ
In step one, the LLM generated a dataframe and performed all calculations. Interestingly, it selected to jot down all of the code independently reasonably than using the provided tools.Ā
Much more surprisingly, the LLM recreated the dataframe based on the input data as an alternative of referencing it directly. This approach just isn’t ideal (especially when working with massive datasets), as it may possibly result in errors and better token usage. This behaviour could potentially be improved through the use of a more explicit system prompt. Hereās the code the agent executed in step one.
import pandas as pd
# Creating the DataFrame from the provided data
data = {
'before': [632767.39, 481409.27, 240704.63, 160469.75,
120352.31, 96281.86],
'after': [637000.48, 477033.02, 107857.06, 159778.76,
121331.46, 96064.77]
}
index = ['other', 'UK', 'France', 'Germany', 'Italy', 'Spain']
df = pd.DataFrame(data, index=index)
# Calculating total revenue before and after
total_before = df['before'].sum()
total_after = df['after'].sum()
# Calculating absolute and percentage change for every segment
df['absolute_change'] = df['after'] - df['before']
df['percentage_change'] = (df['absolute_change'] /
df['before']) * 100
# Calculating total revenue change
total_change = total_after - total_before
# Calculating contribution of every segment to the full change
df['contribution_to_change'] = (df['absolute_change'] /
total_change) * 100
# Rounding results
df = df.round(2)
# Printing the calculated results
print("Total revenue before:", total_before)
print("Total revenue after:", total_after)
print("Total change in revenue:", total_change)
print(df)
Within the second step, the LLM simply constructed the ultimate answer by referring to the variables calculated on the previous step (which is admittedly neat).Ā
final_answer({
"total_before": round(total_before, 2),
"total_after": round(total_after, 2),
"total_change": round(total_change, 2),
"segment_changes": df[['absolute_change',
'percentage_change', 'contribution_to_change']].to_dict()
})
It worked pretty much.
ToolCallingAgent
Now, itās time to see how traditional tool-calling agents can tackle this problem. We initialised it in an identical manner and ran the duty.
from smolagents import ToolCallingAgent
traditional_agent = ToolCallingAgent(
model=model,
tools=[calculate_metric_increase, calculate_sum],
max_steps=30,
)
task = """
Here's a dataframe showing revenue by segment, comparing values
before and after.
Could you please help me understand the changes? Specifically:
1. Estimate how the full revenue and the revenue for every segment
have modified, each in absolute terms and as a percentage.
2. Calculate the contribution of every segment to the full
change in revenue.
Please round all floating-point numbers within the output
to 2 decimal places.
"""
traditional_agent.run(
task,
additional_args={"data": df},
)
The outcomes are removed from ideal: only the relative changes are correct, while the remaining of the numbers are pure hallucinations. I even have to confess, the core issue was the dearth of appropriate tools (specifically, tools to calculate differences and to estimate shares). Nonetheless, the agent must have flagged missing tools reasonably than generating random numbers.
Total revenue change: -7319.66 (-7.67%). Revenue Changes by Segment:
- Other: +232.09 (-0.67%)
- UK: -4376.25 (0.91%)
- France: -132847.57 (55.19%)
- Germany: -690.99 (0.43%)
- Italy: +979.15 (-0.81%)
- Spain: -217.09 (0.23%)
Contribution to total change:
- Other: 0.03%
- UK: -59.88%
- France: -181.77%
- Germany: -9.43%
- Italy: +13.38%
- Spain: -0.03%
When it comes to resource usage, the tool-calling agent performed significantly worse: 12 steps, with 29,201 input and 1,695 output tokens. So, code agents clearly offer cost savings with regards to agent execution.
Letās dig a bit deeper to grasp what the agent actually did. First, it took 4 iterations simply to work out methods to call calculate_sum
accurately and estimate the full revenue before and after.Ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā Step 1 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā®
ā Calling tool: 'calculate_sum' with arguments: ā
ā {'values': {'before': [632767.39, 481409.27, 240704.63, ā
ā 160469.75, 120352.31, 96281.86], ā
ā 'after': [637000.48, 477033.02, 107857.06, 159778.76, ā
ā 121331.46, 96064.77]}} ā
ā°āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāÆ
Invalid call to tool 'calculate_sum' with arguments
{"values": {"before": [632767.39, 481409.27, 240704.63,
160469.75, 120352.31, 96281.86],
"after": [637000.48, 477033.02, 107857.06, 159778.76,
121331.46, 96064.77]}}:
unsupported operand type(s) for +: 'int' and 'str'
You must call this tool with correct input arguments.
Expected inputs: {"values": {"type": "object",
"description": "list of values"}}
Returns output type: number
Tool description: 'Calculate the sum of list'
[Step 1: Duration 4.06 sec | Input tokens: 1,190 | Output: 250]
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā Step 2 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā®
ā Calling tool: 'calculate_sum' with arguments: ā
ā {'values': {'before': [632767.39, 481409.27, 240704.63, ā
ā 160469.75, 120352.31, 96281.86], ā
ā 'after': [637000.48, 477033.02, 107857.06, 159778.76, ā
ā 121331.46, 96064.77]}} ā
ā°āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāÆ
Invalid call to tool 'calculate_sum' with arguments
{"values": {"before": [632767.39, 481409.27, 240704.63,
160469.75, 120352.31, 96281.86],
"after": [637000.48, 477033.02, 107857.06, 159778.76,
121331.46, 96064.77]}}:
unsupported operand type(s) for +: 'int' and 'str'
You must call this tool with correct input arguments.
Expected inputs: {"values": {"type": "object",
"description": "list of values"}}
Returns output type: number
Tool description: 'Calculate the sum of list'
[Step 2: Duration 8.60 sec | Input tokens: 2,815 | Output: 386]
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā Step 3 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā®
ā Calling tool: 'calculate_sum' with arguments: ā
ā {'values': [632767.39, 481409.27, 240704.63, 160469.75, ā
ā 120352.31, 96281.86]} ā
ā°āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāÆ
Observations: 1731985.2100000002
[Step 3: Duration 1.23 sec | Input tokens: 4,871 | Output: 488]
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā Step 4 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā®
ā Calling tool: 'calculate_sum' with arguments: ā
ā {'values': [637000.48, 477033.02, 107857.06, 159778.76, ā
ā 121331.46, 96064.77]} ā
ā°āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāÆ
Observations: 1599065.55
The subsequent seven steps were spent calculating the relative metric changes using the calculate_metric_increase
tool.
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā Step 5 āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā®
ā Calling tool: 'calculate_metric_increase' with ā
ā arguments: {'before': 1731985.21, 'after': 1599065.55} ā
ā°āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāÆ
Observations: 7.674410799385517
<-- similar tool calls for all country segments -->
Ultimately, the agent put together a final call.Ā
So, if the LLM had had tools to calculate absolutely the difference and the share of the sum, it could have taken an extra 14 iterations and much more tokens. After all, we will prevent such inefficiencies by fastidiously designing the tools we offer:
- We could modify our functions to work with lists of values as an alternative of single items, which might significantly reduce the variety of steps.Ā
- Moreover, we could create more complex functions that calculate all essential metrics without delay (much like what the code agent did). This manner, LLM wouldnāt must perform calculations step-by-step. Nonetheless, this approach might reduce the flexibleness of the system.Ā
Though the outcomes werenāt ideal as a result of a poor alternative of tools, I still find this instance quite insightful. Itās clear that code agents are more powerful, cost-efficient and versatile as they will invent their very own comprehensive tools and perform multiple actions in a single step.
Summary
Weāve learned lots concerning the code agents. Now, itās time to wrap things up with a fast summary.
Code agents are LLM agents that āthinkā and act using Python code. As a substitute of calling tools via JSON, they generate and execute actual code. It makes them more flexible and cost-efficient as they will invent their very own comprehensive tools and perform multiple actions in a single step.Ā
HuggingFace has brought this way of living of their framework, smolagents. Smolagents makes it easy to construct quite complex agents without much hassle, while also providing safety measures through the code execution.Ā
In this text, weāve explored the fundamental functionality of the smolagents library. But thereās lots more to it. In the following article, we’ll dive into more advanced features (like multi-agent setup and planning steps) to construct the agent that may narrate KPI changes. Stay tuned!
Reference
This text is inspired by the āConstructing Code Agents with Hugging Face smolagentsā short course by DeepLearning.AI.