A Higher Method to Execute Actions

Today we’re sharing research that bridges two powerful paradigms in AI agent design: the expressiveness of code-based actions and the reliability of structured generation. Our findings show that forcing CodeAgents to generate each thoughts and code in a structured JSON format can significantly outperform traditional approaches across multiple benchmarks.

Figure 1: Accuracy comparison of three approaches: Structured CodeAgent (blue), CodeAgent (orange), and ToolCallingAgent (gray) on SmolBench (GAIA, MATH, SimpleQA, and Frames). Error bars represent 95% Confidence Intervals.

🤔 The Evolution of Agent Actions

AI agents have to take actions on the planet – whether that is calling APIs, processing data, or reasoning through complex problems. How agents express these actions has evolved through several paradigms:

Traditional JSON Agent: Agents generate structured JSON to call tools.

{"tool": "get_weather", "arguments": {"city": "Paris"}}

These agents operate by choosing from a listing of predefined tools and generating JSON-formatted calls. This method for calling tools has been popularized by OpenAI’s function calling API, and has since then been essentially the most widely used method to call tools.

It’s reliable, but limited by:

A limited set of actions: The actions the agent can take are expressed only through predefined tools which limit its functionality.
Lack of composability: If the duty requires composing information from multiple sources, JSON agents struggle because they lack support for maintaining intermediate state across tool calls. While some models support parallel tool calls, they can not easily handle scenarios where one tool’s output determines the subsequent motion or where results should be compared and processed together.
Rigid structure: Very limited in handling cases where tools don’t match exactly what must be done.

Code Agents: Agents make use of their innate coding ability and write executable Python code directly.


temperature_sum = 0
for city in ["Paris", "Tokyo", "New York"]:
    temp = get_weather(city)
    temperature_sum += temp
    
print(f"Average temperature: {temperature_sum / 3:.1f}°C")

This shift, first presented as CodeAct within the paper “Executable Code Actions Elicit Higher LLM Agents” gave AI agents the flexibleness to put in writing arbitrary executable Python code along with tool-calling.

The important thing insight here is that tools are called directly from inside the code, making variables and state management far more reliable. Agents can call tools inside loops, functions, and conditional statements – essentially generating a dynamic graph of tool execution in each motion!

Pros of using a CodeAgent:

Smart tool use: Agents resolve which tools to make use of based on what’s happening within the moment.
Unlimited flexibility: Can use any Python functionality to realize a goal.
Ability to check thoughts: Agents can hypothesize and test, resulting in more flexibility of their actions

Nonetheless, parsing code from markdown will be error-prone which leads us to a proposition: why not use structured generation to generate code actions?

➡️ Adding Structured outputs to Code Agent

With Structured outputs, you’ll be able to force the LLM to generate explicit thoughts and code as a JSON blob:


{
  "thoughts": "I would like to seek out the typical temperature across 3 cities.",
  "code": "temperature_sum = 0nfor city in ["Paris", "Tokyo", "New York"]:n    temp = get_weather(city)n    temperature_sum += tempnnprint(f"Average temperature: {temperature_sum / 3:.1f}°C")"
}

The important thing difference is that the generation is enforced: mainly, now as a substitute of just being prompted to output thoughts, then code, the usage of structured outputs forces it to respect the structure.

This approach adds the reliability of structured generation to the flexibleness of code execution, thus getting the very best of each worlds.

Explicit reasoning: The thoughts field forces the agent to reason right before it takes an motion.
Reliable parsing: JSON structure eliminates markdown parsing errors
Full code expressiveness: The code field maintains all the flexibleness of code agents
Higher separation: Clear separation between planning and execution

🧪 Benchmark Results

We compared these three paradigms across multiple benchmarks including GAIA, MATH, SimpleQA, and Frames. The outcomes show a transparent pattern: Code actions + structured generation consistently improves performance for capable models.

Across most capable models, the structured approach consistently outperformed the regular CodeAgent approach by 2-7 percentage points on average.

OpenAI models: Show the most important improvements with structure, particularly on reasoning-heavy tasks
Claude models: Profit from structure, with Claude 3.7 Sonnet showing especially strong results
Qwen models: Generally improve with structure, though “structure tax” (see in next section) creeps in for smaller models.

💡 Why Structure (Generally) Helps

The Parsing Problem is Real

Our implementation of CodeAgent in smolagents extracts Python code from the LLM output, which may fail when:

Code block formulation in markdown is incomplete or incorrectly formatted
Multiple code blocks appear in a single response

Structured generation eliminates these issues with reliable JSON parsing.

To grasp why structured generation matters, we analyzed 15,724 agent traces across our benchmarks. The outcomes are striking:

2.4% of traces had parsing errors of their first call
Traces with first call parsing errors: 42.3% success rate
Traces without first call parsing errors: 51.3% success rate

Agent traces without parsing errors succeed 21.3% more often than those with parsing errors.

This is not just about convenience – parsing errors create a cascade of failures that significantly impact overall agent performance. When an agent cannot execute its first motion as a result of malformed code, it often struggles to get well, resulting in suboptimal problem-solving paths.

Figure 2: Parsing errors in step one reduce success rates of the agent by 21.3% and increase average steps taken from 3.18 to 4.63.

Moreover: Enforced Reasoning Process

The usage of structured generation and explicit thoughts not only prompts, but forces agents to articulate their reasoning before acting. This results in:

Higher planning: Agents think through problems more systematically
Enhanced reliability: Explicit reasoning catches logical errors early

The Structure Tax

Our results also reveal a transparent capability threshold: models need sufficient instruction-following ability and JSON coverage of their pre-training data to profit from structured generation. This implies that structured approaches work best with:

Large, well-trained models
Models with strong instruction-following capabilities
Models fine-tuned on structured generation.

When Structure Breaks: A Real Example

Here’s what happens when a smaller model (e.g mistralai/Mistral-7B-Instruct-v0.3) tries to generate structured code – the cognitive load becomes an excessive amount of:

{
  "thought": "I want to seek out the peak...",
  "code": "web_search(query="Eiffel Tower height")", "
}

The model generates syntactically broken Python code: web_search(query="Eiffel Tower height")", – notice the malformed string with an additional quote and comma. This results in a direct SyntaxError and execution failure.

This illustrates the “structure tax”: smaller models struggle to concurrently handle JSON formatting, Python syntax, and the actual problem-solving logic. The cognitive overhead of structured generation can overwhelm models that might otherwise perform reasonably well with simpler markdown-based code generation.

🚀 When to Use Structured CodeAgents

✅ Use Structured CodeAgents when:

Working with capable models (32B+ parameters or frontier models)
Tasks require complex reasoning and code execution
You would like reliable parsing of agent outputs

⚠️ Consider alternatives when:

Working with smaller models that struggle with structured generation
Easy, predefined workflows are sufficient

Find out how to use with smolagents:

It’s super easy! Just enable it with use_structured_outputs_internally:

from smolagents import CodeAgent, InferenceClientModel, GoogleSearchTool


agent = CodeAgent(
    tools=[GoogleSearchTool(provider="serper")],
    model=InferenceClientModel("Qwen/Qwen3-235B-A22B", provider='nebius'),
    use_structured_outputs_internally=True 
)

result = agent.run("Calculate the time for a cheetah to run across the Golden Gate Bridge")

The LLM will generate something like this:

{
  "thoughts": "I want to seek out the length of the Golden Gate Bridge and the highest speed of a cheetah, then calculate the time.",
  "code": "bridge_info = web_search('Golden Gate Bridge length meters')ncheetah_speed = web_search('Cheetah top speed') ..."
}

Then the “code” part gets executed by the agent as usual : that is the usual CodeAgent, but now it has 100% parsing reliability!

Implementation Suggestions

Clear prompting: Ensure your prompts clearly specify the expected JSON structure
Model selection: Select models with strong structured generation capabilities
Select the correct provider: Some API providers like OpenAI or Anthropic support structured generation out of the box. Should you’re using Inference providers through Hugging Face, the support of structured generation varies across providers. Here is a listing of providers that support structured generation: Structured generation support for Models in smolagents‣

The Greater Picture – What’s Next?

This research suggests we’re moving toward a more nuanced understanding of agent architectures. It isn’t nearly “what can the agent do?” but “how should the agent take into consideration what it’s doing?”

Possibly making the reasoning process more explicit helps the model stay heading in the right direction. Or possibly it’s just easier to parse. Either way, it’s a win.

But that is only the start. There are such a lot of questions left to explore:

What other structural improvements could help?
How can we make this work higher across different model architectures, specifically smol models?
What does this tell us concerning the nature of AI reasoning?

For now, if you happen to’re using smolagents (or constructing your personal CodeAgent system), consider giving structured output a try. Your parsing errors will thanks, and you may just see a pleasant increase in performance!

Source link

A Higher Method to Execute Actions

🤔 The Evolution of Agent Actions

➡️ Adding Structured outputs to Code Agent

🧪 Benchmark Results

💡 Why Structure (Generally) Helps

The Parsing Problem is Real

The Structure Tax

When Structure Breaks: A Real Example

🚀 When to Use Structured CodeAgents

Find out how to use with smolagents:

Implementation Suggestions

The Greater Picture – What’s Next?

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Two-Stage Hurdle Models: Predicting Zero-Inflated Outcomes

Federal cyber experts called Microsoft’s cloud a “pile of shit,” approved it anyway

Methods to Construct Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain

The Recent Experience of Coding with AI

One Model to Rule Them All? SAP-RPT-1 and the Way forward for Tabular Foundation Models

A Higher Method to Execute Actions

🤔 The Evolution of Agent Actions

➡️ Adding Structured outputs to Code Agent

🧪 Benchmark Results

💡 Why Structure (Generally) Helps

The Parsing Problem is Real

The Structure Tax

When Structure Breaks: A Real Example

🚀 When to Use Structured CodeAgents

Find out how to use with smolagents:

Implementation Suggestions

The Greater Picture – What’s Next?

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.