Code Agents: The Way forward for Agentic AI

of AI agents. LLMs aren’t any longer just tools. They’ve turn out to be lively participants in our lives, boosting productivity and remodeling the way in which we live and work.

OpenAI recently launched Operator, an AI agent that may autonomously perform various tasks, from browsing the online to filling out forms and scheduling appointments.
Anthropic introduced MCP (Model Context Protocol), a brand new standard for a way AI assistants interact with the surface world. With over five thousand lively MCP servers already, adoption is growing rapidly.
AI agents are also changing the landscape of software engineering. Tools like GitHub Copilot’s agentic mode, Claude Code, OpenAI Codex, and others aren’t only improving developer productivity and code quality but additionally democratising the sector, making software development accessible to people with no technical background.

We’ve previously checked out different AI Agent frameworks, equivalent to LangGraph or CrewAI. In this text, I would love to debate a brand new one I’ve been exploring recently — HuggingFace smolagents. It’s an interesting framework because it implements the concept of code agents.

In this text, we’ll explore several topics:

What code agents are ().
Learn how to use the HuggingFace smolagents framework in practice.
Whether it’s secure to present LLMs a lot agency.
The actual difference in performance between code agents and traditional tool-calling agents.

AI Agents recap

Let’s start with a fast refresher: what exactly are AI agents? HuggingFace provides a transparent and concise definition of what they mean by agents.

So, we’d like an agentic flow when we would like a system to reason and act based on observations. Actually, agency just isn’t a binary variable (yes or no), but a spectrum.

At one end, we will have systems without agency in any respect, for instance, a straightforward process where an LLM defines the sentiment of a text, translates it or summarises it.
The subsequent level is routing, where an LLM can classify an incoming query and judge which path to take — for instance, calling a tool if a customer is asking concerning the status of their current order, and transferring the conversation to a human CS agent otherwise.
More advanced systems can exhibit higher degrees of agency. These might include the power to execute other LLMs (multi-agent setup) and even create recent tools on the fly.

Code agents fall into this more advanced category. They’re multi-step agents that execute tool calls in the shape of code, in contrast to the more traditional approach using a JSON format with the tool name and arguments.

Several recent papers have shown that using code in agentic flows leads to raised results:

It is sensible when you concentrate on it. We’ve been developing programming languages for a long time to resolve complex problems. So, it’s natural that these languages are higher suited to LLM’s tasks than easy JSON configs. An extra profit is that LLMs are already quite good at writing code in common programming languages, because of the vast amount of obtainable data for training.

This approach comes with several other advantages as well:

By generating code, an LLM just isn’t limited to a predefined set of tools and might create its own functions.
It could possibly mix multiple tools inside a single motion using conditions and loops, which helps reduce the variety of steps required to finish a task.
It also enables the model to work with a greater variety of outputs, equivalent to generating charts, images, or other complex objects.

These advantages aren’t just theoretical; we will observe them in practice. In “Executable Code Actions Elicit Higher LLM Agents”, the authors show that code agents outperform traditional methods, achieving a better success rate and completing a task in fewer steps, which in turn reduces costs.

Chart from Wang, Xingyao, et al. “Executable Code Actions Elicit Higher LLM Agents”

Code agents look promising, which inspired me to do that approach in practice.

HuggingFace smolagents framework

First try

Luckily, we don’t need to construct code agents from scratch, as HuggingFace has released a handy library called smolagents that implements this approach.

Let’s start by installing the library.

pip install smolagents[litellm] 
# I've used litellm, since I'm planning to make use of it with OpenAI model

Next, let’s construct a basic example. To initialise the agent, we’d like just two parameters: model and tools.

I plan to make use of OpenAI for the model, which is accessible via LiteLLM. Nonetheless, the framework supports other options as well. You need to use a neighborhood model via Ollama or TransformersModel, or public models via Inference Providers or select other options (you’ll find more details in the documentation).

I didn’t specify any tools, but used add_base_tools = True, so my agent has a default set of tools, equivalent to a Python interpreter or DuckDuckGo search. Let’s try it out with a straightforward query.

from smolagents import CodeAgent, LiteLLMModel
model = LiteLLMModel(model_id="openai/gpt-4o-mini", 
  api_key=config['OPENAI_API_KEY']) 
agent = CodeAgent(tools=[], model=model, add_base_tools=True)

agent.run(
    """I even have 5 different balls and I randomly select 2. 
    What number of possible combos of the balls I can get?""",
)

Because of this, we see a very nicely formatted execution flow. It’s just amazing and means that you can understand the method perfectly.

So, the agent found a solution in a single step and wrote Python code to calculate the variety of combos.

The output is kind of helpful, but we will go even deeper and have a look at the entire information related to execution (including prompts), via agent.memory.steps. Let’s have a look at the system prompt utilized by the agent.

You might be an authority assistant who can solve any task using code blobs. 
You can be given a task to resolve as best you may.

To achieve this, you have got been given access to an inventory of tools: these tools 
are mainly Python functions which you'll call with code.

To unravel the duty, you have to plan forward to proceed in a series of 
steps, in a cycle of 'Thought:', 'Code:', 
and 'Statement:' sequences.

At each step, within the 'Thought:' sequence, you need to first explain 
your reasoning towards solving the duty and the tools that you just want 
to make use of.

Then within the 'Code:' sequence, you need to write the code in easy 
Python. The code sequence must end with '' sequence.

During each intermediate step, you should use 'print()' to avoid wasting 
whatever essential information you'll then need.

These print outputs will then appear within the 'Statement:' field, 
which can be available as input for the following step.

Ultimately you have got to return a final answer using 
the final_answer tool.

Listed here are just a few examples using notional tools: <...>

It’s pretty clear that smolagents implements the ReAct approach (introduced within the paper by Yao et al. “ReAct: Synergizing Reasoning and Acting in Language Models”) and uses a few-shot prompting technique.

The smolagents library handles all behind-the-scenes work involved within the agent workflow: assembling the system prompt with all essential information for the LLM (i.e. available tools), parsing the output and executing the generated code. It also provides comprehensive logging and a retry mechanism to assist correct mistakes.

Moreover, the library offers memory management features. By default, all execution results are saved to memory, but you may customise this behaviour. For instance, you may remove some intermediary results from the memory to scale back the variety of tokens or execute the agent step-by-step. While we won’t dive deep into memory management here, you’ll find handy code examples in the documentation.

Security

Now, it’s time to debate the drawbacks of the code agents’ approach. Giving an LLM more agency by allowing it to execute arbitrary code introduces higher risks. Indeed, an LLM can run harmful code either by mistake (since LLMs are still removed from perfect) or as a result of targeted attacks like prompt injections or compromised models.

To mitigate these risks, the local Python executor implemented within the smolagents library has a bunch of safety checks:

By default, imports aren’t allowed unless the package has been explicitly added to additional_authorized_imports list.
Furthermore, submodules are blocked by default, so you have to authorise them specifically (i.e. numpy.* ). It’s been done because some packages can expose potentially harmful submodules, i.e. random._os .
The entire variety of executed operations is capped, stopping infinite loops and resource bloating.
Any operation not explicitly defined within the interpreter will raise an error.

Let’s test whether these safety measures actually work.

from smolagents.local_python_executor import LocalPythonExecutor

custom_executor = LocalPythonExecutor(["numpy.*", "random"])

# function to have pretty formatted exceptions
def run_capture_exception(command: str):
    try:
        custom_executor(harmful_command)
    except Exception as e:
        print("ERROR:n", e)

# Unauthorised imports are blocked 
harmful_command="import os; exit_code = os.system('')"
run_capture_exception(harmful_command)
# ERROR: Code execution failed at line 'import os' as a result of: 
# InterpreterError: Import of os just isn't allowed. Authorized imports 
# are: ['datetime', 'itertools', 're', 'math', 'statistics', 'time', 'queue', 
# 'numpy.*', 'random', 'collections', 'unicodedata', 'stat']

# Submodules are also blocked unless stated specifically
harmful_command="from random import _os; exit_code = _os.system('')"
run_capture_exception(harmful_command)
# ERROR: Code execution failed at line 'exit_code = _os.system('')' 
# as a result of: InterpreterError: Forbidden access to module: os

# The cap on the variety of iterations breaks inifinity loops 
harmful_command = '''
while True: 
    pass
'''
run_capture_exception(harmful_command)
# ERROR: Code execution failed at line 'while True: pass' as a result of: 
# InterpreterError: Maximum variety of 1000000 iterations in While loop 
# exceeded

# Undefined operations don't work
harmful_command="!echo "
custom_executor(harmful_command)
# ERROR: Code parsing failed on line 1 as a result of: SyntaxError

It seems we’ve some safety nets with code agents. Nonetheless, despite these safeguards, risks persist once you’re executing code locally. For instance, an LLM can recursively create threads in your computer or create too many files, resulting in resource bloating. A possible solution is to execute code in a sandboxed environment, equivalent to using Docker or solutions like E2B. I’m willing to be adventurous and run my code locally, but in case you prefer a more risk-averse approach, you may follow the sandbox set-up guidance in the documentation.

Code agent vs traditional Tool-Calling agent

It’s claimed that the code agents perform higher in comparison with the standard JSON-based approach. Let’s put this to the test.
I’ll use the duty of metrics change evaluation that I described in my previous article, “Making sense of KPI changes”. We are going to start with an easy case: analysing a straightforward metric (revenue) split by one dimension (country).

raw_df = pd.read_csv('absolute_metrics_example.csv', sep = 't')
df = raw_df.groupby('country')[['revenue_before', 'revenue_after_scenario_2']].sum()
  .sort_values('revenue_before', ascending = False).rename(
    columns = {'revenue_after_scenario_2': 'after', 
      'revenue_before': 'before'})

The smolagents library supports two classes, which we will use to check two approaches:

CodeAgent — an agent that acts by generating and executing code,
ToolCallingAgent — a conventional JSON-based agent.

Our agents will need some tools, so let’s implement them. There are multiple options to create tools in smolagents: we will re-use LangChain tools, download them from HuggingFace Hub or just create Python functions. We are going to take essentially the most straightforward approach by writing a few Python functions and annotating them with @tool.

I’ll create two tools: one to estimate the relative difference between metrics, and one other to calculate the sum of an inventory. Since LLM can be using these tools, providing detailed descriptions is crucial.

@tool 
def calculate_metric_increase(before: float, after: float) -> float: 
    """
    Calculate the share change of the metric between before and after

    Args:
        before: value before
        after: value after
    """
    return (before - after) * 100/ before

@tool 
def calculate_sum(values: list) -> float: 
    """
    Calculate the sum of list

    Args:
        values: list of numbers
    """
    return sum(values)

CodeAgent

Let’s start with a CodeAgent. I’ve initialised the agent with the tools we defined earlier and authorised the usage of some Python packages that is perhaps helpful.

agent = CodeAgent(
    model=model,
    tools=[calculate_metric_increase, calculate_sum],
    max_steps=10,
    additional_authorized_imports=["pandas", "numpy", "matplotlib.*", 
      "plotly.*"],
    verbosity_level=1 
)

task = """
Here's a dataframe showing revenue by segment, comparing values 
before and after.
Could you please help me understand the changes? Specifically:
1. Estimate how the full revenue and the revenue for every segment 
have modified, each in absolute terms and as a percentage.
2. Calculate the contribution of every segment to the full 
change in revenue.

Please round all floating-point numbers within the output 
to 2 decimal places.
"""

agent.run(
    task,
    additional_args={"data": df},
)

Overall, the code agent accomplished the duty in only two steps, using only 5,451 input and 669 output tokens. The result also looks quite plausible.

{'total_before': 1731985.21, 'total_after': 
1599065.55, 'total_change': -132919.66, 'segment_changes': 
{'absolute_change': {'other': 4233.09, 'UK': -4376.25, 'France': 
-132847.57, 'Germany': -690.99, 'Italy': 979.15, 'Spain': 
-217.09}, 'percentage_change': {'other': 0.67, 'UK': -0.91, 
'France': -55.19, 'Germany': -0.43, 'Italy': 0.81, 'Spain': 
-0.23}, 'contribution_to_change': {'other': -3.18, 'UK': 3.29, 
'France': 99.95, 'Germany': 0.52, 'Italy': -0.74, 'Spain': 0.16}}}

Let’s take a have a look at the execution flow. The LLM received the next prompt.

╭─────────────────────────── Latest run ────────────────────────────╮
│                                                                │
│ Here's a pandas dataframe showing revenue by segment,         │
│ comparing values before and after.                             │
│ Could you please help me understand the changes?               │
│ Specifically:                                                  │
│ 1. Estimate how the full revenue and the revenue for every     │
│ segment have modified, each in absolute terms and as a          │
│ percentage.                                                    │
│ 2. Calculate the contribution of every segment to the full     │
│ change in revenue.                                             │
│                                                                │
│ Please round all floating-point numbers within the output to 2   │
│ decimal places.                                                │
│                                                                │
│ You will have been supplied with these additional arguments, that   │
│ you may access using the keys as variables in your python      │
│ code:                                                          │
│ {'df':             before      after                           │
│ country                                                        │
│ other    632767.39  637000.48                                  │
│ UK       481409.27  477033.02                                  │
│ France   240704.63  107857.06                                  │
│ Germany  160469.75  159778.76                                  │
│ Italy    120352.31  121331.46                                  │
│ Spain     96281.86   96064.77}.                                │
│                                                                │
╰─ LiteLLMModel - openai/gpt-4o-mini ────────────────────────────╯

In step one, the LLM generated a dataframe and performed all calculations. Interestingly, it selected to jot down all of the code independently reasonably than using the provided tools.

Much more surprisingly, the LLM recreated the dataframe based on the input data as an alternative of referencing it directly. This approach just isn’t ideal (especially when working with massive datasets), as it may possibly result in errors and better token usage. This behaviour could potentially be improved through the use of a more explicit system prompt. Here’s the code the agent executed in step one.

import pandas as pd                                                                                                        
 
# Creating the DataFrame from the provided data                 
data = {                                                        
    'before': [632767.39, 481409.27, 240704.63, 160469.75,      
120352.31, 96281.86],                                           
    'after': [637000.48, 477033.02, 107857.06, 159778.76,       
121331.46, 96064.77]                                            
}                                                               
index = ['other', 'UK', 'France', 'Germany', 'Italy', 'Spain']  
df = pd.DataFrame(data, index=index)                            
                                                                
# Calculating total revenue before and after                    
total_before = df['before'].sum()                               
total_after = df['after'].sum()                                 
                                                                
# Calculating absolute and percentage change for every segment   
df['absolute_change'] = df['after'] - df['before']              
df['percentage_change'] = (df['absolute_change'] /              
df['before']) * 100                                             
                                                                
# Calculating total revenue change                              
total_change = total_after - total_before                       
                                                                
# Calculating contribution of every segment to the full change  
df['contribution_to_change'] = (df['absolute_change'] /         
total_change) * 100                                             
                                                                
# Rounding results                                              
df = df.round(2)                                                
                                                                
# Printing the calculated results                               
print("Total revenue before:", total_before)                    
print("Total revenue after:", total_after)                      
print("Total change in revenue:", total_change)                 
print(df)

Within the second step, the LLM simply constructed the ultimate answer by referring to the variables calculated on the previous step (which is admittedly neat).

final_answer({                                                  
    "total_before": round(total_before, 2),                     
    "total_after": round(total_after, 2),                       
    "total_change": round(total_change, 2),                     
    "segment_changes": df[['absolute_change',                   
'percentage_change', 'contribution_to_change']].to_dict()       
})

It worked pretty much.

ToolCallingAgent

Now, it’s time to see how traditional tool-calling agents can tackle this problem. We initialised it in an identical manner and ran the duty.

from smolagents import ToolCallingAgent
traditional_agent = ToolCallingAgent(
    model=model,
    tools=[calculate_metric_increase, calculate_sum],
    max_steps=30,
)

task = """
Here's a dataframe showing revenue by segment, comparing values 
before and after.
Could you please help me understand the changes? Specifically:
1. Estimate how the full revenue and the revenue for every segment 
have modified, each in absolute terms and as a percentage.
2. Calculate the contribution of every segment to the full 
change in revenue.

Please round all floating-point numbers within the output 
to 2 decimal places.
"""

traditional_agent.run(
    task,
    additional_args={"data": df},
)

The outcomes are removed from ideal: only the relative changes are correct, while the remaining of the numbers are pure hallucinations. I even have to confess, the core issue was the dearth of appropriate tools (specifically, tools to calculate differences and to estimate shares). Nonetheless, the agent must have flagged missing tools reasonably than generating random numbers.

Total revenue change: -7319.66 (-7.67%). Revenue Changes by Segment:
- Other: +232.09 (-0.67%)
- UK: -4376.25 (0.91%)
- France: -132847.57 (55.19%)
- Germany: -690.99 (0.43%)
- Italy: +979.15 (-0.81%)
- Spain: -217.09 (0.23%)
Contribution to total change:
- Other: 0.03%
- UK: -59.88%
- France: -181.77%
- Germany: -9.43%
- Italy: +13.38%
- Spain: -0.03%

When it comes to resource usage, the tool-calling agent performed significantly worse: 12 steps, with 29,201 input and 1,695 output tokens. So, code agents clearly offer cost savings with regards to agent execution.

Let’s dig a bit deeper to grasp what the agent actually did. First, it took 4 iterations simply to work out methods to call calculate_sum accurately and estimate the full revenue before and after.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 1 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╭──────────────────────────────────────────────────────────────╮
│ Calling tool: 'calculate_sum' with arguments:                │
│ {'values': {'before': [632767.39, 481409.27, 240704.63,      │
│ 160469.75, 120352.31, 96281.86],                             │
│ 'after': [637000.48, 477033.02, 107857.06, 159778.76,        │
│ 121331.46, 96064.77]}}                                       │
╰──────────────────────────────────────────────────────────────╯
Invalid call to tool 'calculate_sum' with arguments           
{"values": {"before": [632767.39, 481409.27, 240704.63,       
160469.75, 120352.31, 96281.86],                              
"after": [637000.48, 477033.02, 107857.06, 159778.76,         
121331.46, 96064.77]}}:                                       
unsupported operand type(s) for +: 'int' and 'str'            
You must call this tool with correct input arguments.       
Expected inputs: {"values": {"type": "object",                
"description": "list of values"}}                             
Returns output type: number                                   
Tool description: 'Calculate the sum of list'                 
[Step 1: Duration 4.06 sec | Input tokens: 1,190 | Output: 250]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╭──────────────────────────────────────────────────────────────╮
│ Calling tool: 'calculate_sum' with arguments:                │
│ {'values': {'before': [632767.39, 481409.27, 240704.63,      │
│ 160469.75, 120352.31, 96281.86],                             │
│ 'after': [637000.48, 477033.02, 107857.06, 159778.76,        │
│ 121331.46, 96064.77]}}                                       │
╰──────────────────────────────────────────────────────────────╯
Invalid call to tool 'calculate_sum' with arguments           
{"values": {"before": [632767.39, 481409.27, 240704.63,       
160469.75, 120352.31, 96281.86],                              
"after": [637000.48, 477033.02, 107857.06, 159778.76,         
121331.46, 96064.77]}}:                                       
unsupported operand type(s) for +: 'int' and 'str'            
You must call this tool with correct input arguments.       
Expected inputs: {"values": {"type": "object",                
"description": "list of values"}}                             
Returns output type: number                                   
Tool description: 'Calculate the sum of list'                 
[Step 2: Duration 8.60 sec | Input tokens: 2,815 | Output: 386]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 3 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╭──────────────────────────────────────────────────────────────╮
│ Calling tool: 'calculate_sum' with arguments:                │
│ {'values': [632767.39, 481409.27, 240704.63, 160469.75,      │
│ 120352.31, 96281.86]}                                        │
╰──────────────────────────────────────────────────────────────╯
Observations: 1731985.2100000002                              
[Step 3: Duration 1.23 sec | Input tokens: 4,871 | Output: 488]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 4 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╭──────────────────────────────────────────────────────────────╮
│ Calling tool: 'calculate_sum' with arguments:                │
│ {'values': [637000.48, 477033.02, 107857.06, 159778.76,      │
│ 121331.46, 96064.77]}                                        │
╰──────────────────────────────────────────────────────────────╯
Observations: 1599065.55

The subsequent seven steps were spent calculating the relative metric changes using the calculate_metric_increase tool.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 5 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╭──────────────────────────────────────────────────────────────╮
│ Calling tool: 'calculate_metric_increase' with              │
│ arguments: {'before': 1731985.21, 'after': 1599065.55}       │
╰──────────────────────────────────────────────────────────────╯
Observations: 7.674410799385517                               

<-- similar tool calls for all country segments -->

Ultimately, the agent put together a final call.

So, if the LLM had had tools to calculate absolutely the difference and the share of the sum, it could have taken an extra 14 iterations and much more tokens. After all, we will prevent such inefficiencies by fastidiously designing the tools we offer:

We could modify our functions to work with lists of values as an alternative of single items, which might significantly reduce the variety of steps.
Moreover, we could create more complex functions that calculate all essential metrics without delay (much like what the code agent did). This manner, LLM wouldn’t must perform calculations step-by-step. Nonetheless, this approach might reduce the flexibleness of the system.

Though the outcomes weren’t ideal as a result of a poor alternative of tools, I still find this instance quite insightful. It’s clear that code agents are more powerful, cost-efficient and versatile as they will invent their very own comprehensive tools and perform multiple actions in a single step.

Summary

We’ve learned lots concerning the code agents. Now, it’s time to wrap things up with a fast summary.

Code agents are LLM agents that “think” and act using Python code. As a substitute of calling tools via JSON, they generate and execute actual code. It makes them more flexible and cost-efficient as they will invent their very own comprehensive tools and perform multiple actions in a single step.

HuggingFace has brought this way of living of their framework, smolagents. Smolagents makes it easy to construct quite complex agents without much hassle, while also providing safety measures through the code execution.

In this text, we’ve explored the fundamental functionality of the smolagents library. But there’s lots more to it. In the following article, we’ll dive into more advanced features (like multi-agent setup and planning steps) to construct the agent that may narrate KPI changes. Stay tuned!

Reference

This text is inspired by the “Constructing Code Agents with Hugging Face smolagents” short course by DeepLearning.AI.

Code Agents: The Way forward for Agentic AI

AI Agents recap

HuggingFace smolagents framework

First try

Security

Code agent vs traditional Tool-Calling agent

CodeAgent

ToolCallingAgent

Summary

Reference

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Introducing Gradio's recent Dataframe!

Demis Hassabis & John Jumper awarded Nobel Prize in Chemistry

Tips on how to Increase Coding Iteration Speed

Training and Finetuning Reranker Models with Sentence Transformers v4

Latest generative AI tools open the doors of music creation

Code Agents: The Way forward for Agentic AI

AI Agents recap

HuggingFace smolagents framework

First try

Security

Code agent vs traditional Tool-Calling agent

CodeAgent

ToolCallingAgent

Summary

Reference

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.