A light-weight math reasoning Agent with SmolAgents

-


An LLM is using a calculator to answer questions.

By Intel AI Software Group

DeepMath is an aligned math reasoning agent built on Qwen3-4B Pondering and fine-tuned with GRPO (Group Relative Policy Optimization). As an alternative of verbose text, the model emits tiny Python snippets for intermediate steps, runs them in a secure sandbox, and folds the outcomes back into its reasoning, reducing errors and output length. The agent is implemented using the smolagents library.

We evaluate DeepMath on 4 math datasets: MATH500, AIME, HMMT, and HLE, and show that:

  • 🤖 The mathematics agent alone reduces output lengths by as much as 66%, while often improving accuracy.

  • ⚡ GRPO training improves the agent performance even further, in just about all benchmarks.

👉 Code and evaluation scripts: https://github.com/IntelLabs/DeepMath
👉 Model: https://huggingface.co/Intel/deepmath-v1



Why DeepMath?

Large language models (LLMs) have advanced reasoning capabilities, but mathematical problem-solving stays difficult; chain-of-thought traces could be lengthy and liable to arithmetic mistakes. Recent works[^1][^2] show that small models can reach strong performance, and other studies[^3] investigate tool use to enhance reliability. What those papers generally don’t emphasize is reducing trace verbosity or explicitly training models to prefer short, computation-oriented traces executed in a constrained, auditable environment.

We focused on two goals:

  1. Offload deterministic computation to a secure executor.

  2. Train models to prefer concise, computation-oriented traces over verbose text.

DeepMath tackles this by combining a small Python executor with a fine-tuned LLM, enabling concise, computation-driven reasoning. The model learns to generate short Python snippets, that are executed in a sandbox and reintegrated into the context. GRPO fine-tuning encourages this behavior by rewarding correctness and inspiring shorter outputs.



How It Works

  • Base model: Qwen3-4B Pondering.
  • Executor constraints: sandboxed environment, allow-list of imported modules, per-snippet timeout.
  • Inference: based on smolagents, a math agent was created. vLLM is used because the inference engine.
  • Training: based on the GRPO trainer in TRL, we modified TRL’s vLLM client and server to generate GRPO completions using our DeepMath agent.

Changes to vLLM client and server in TRL library.
Figure 1: The vLLM client and server were modified to make use of the DeepMath agent in generating the candidates, while using the vLLM backend.

  • Agent Interface: During inference, the model can output normal tokens or special agent calls containing Python snippets.

  • Execution: Snippets run in a sandboxed environment with strict safety constraints (no file I/O, no network, timeouts).

  • Design Goals:

    • Concision: Replace multi-line textual calculations with short, focused snippets.

    • Determinism & Safety: Implement strict execution limits.

    • Interpretability: Snippets are readable and auditable.

Output example: it contains a short python snippet as well as its output which is used in the reasoning process.
Figure 2: Output example where python code is generated, evaluated and the reply is inserted into the trace and used for context.



Training with GRPO

We fine-tune the model using GRPO, a reward-based optimization that balances:

  • Accuracy Reward: +1 for proper answers.

  • Using code snippets: +1 for generating code snippets, weighted 10:1 vs. the accuracy reward.

  • Length reduction: shorter lengths are encouraged by limiting the GRPO completion candidates to 5k tokens.

  • Temperature Scheduling: We implemented linear temperature scheduling (T=1.2 → T=0.7) to balance exploration and stability during training. This approach goals to reinforce experimentation in the course of the initial training phases, subsequently reducing the temperature as we refine our proficiency within the skill.

  • In-context Learning: we include 4 solved examples where the trace accommodates agent calls and executor outputs, so the model learns the syntax and the decision/response pattern.

  • Dataset: we used the Tool-Integrated Reasoning (TIR) subset of the OpenMathReasoning dataset. Note that GRPO only uses the problem, not the answer in the information. This dataset was chosen to make sure the problems profit from the external tool.



Evaluation

We benchmarked DeepMath against baselines on 4 datasets. Metrics include:

  • majority@16: robustness across samples, as utilized in previous math reasoning works, see references.

  • Mean output length: brevity.

Main results table.

  • We compare a baseline configuration (Qwen3-4B-Pondering-2507, no agenting) with our DeepMath model. As ablation, we evaluate the agentic framework we developed running with the untrained Qwen3 model, denoted by +Agent. Moreover, we examine whether the GRPO training (for agentic use) improves non-agentic inference, denoted by +GRPO. Thus the 2 ablations are independent, not additive.

  • We observe the agentic inference reduces output lengths, with mixed accuracy results. The DeepMath model is each GRPO-trained and run in agentic mode, and shows the best accuracy with shortened traces. We conclude each GRPO training and agentic inference are needed for best results.

Key Insight: DeepMath reduces output length by as much as 66% while improving accuracy on difficult datasets.



Why It Matters

  • Accuracy: Offloading computation reduces arithmetic errors.

  • Efficiency: Shorter outputs mean faster inference and easier interpretability.

  • Safety: Sandbox execution mitigates risks of running arbitrary code.



Conclusion

DeepMath demonstrates a practical and light-weight approach to mix a small executor with an LLM and to coach the model to prefer short, computation-driven traces. Offloading deterministic computation reduces arithmetic and numerical errors and shortens traces, and GRPO fine-tuning further encourages concise, correct answers. The result’s a more accurate and more interpretable math-solving agent without requiring a large model or heavyweight external tools.



Try It Yourself

Try the GitHub repo and share your feedback! Contributions welcome. 🚀



Citation

Should you use DeepMath in your research, please cite:

@software{deepmath2025,
  creator = {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe},
  title = {DeepMath: A Lightweight Math Reasoning Agent for LLMs},
  yr = {2025},
  publisher = {Intel AI Labs},
  url = {https://github.com/IntelLabs/DeepMath}
}



Limitations & Future Work

  • Scope: we focused on a small model and on mathematical reasoning.

  • Generalization: evaluated on contest-style math; results may not transfer to open-ended mathematical creativity or formal proofs.

  • Executing generated code is inherently dangerous. DeepMath uses strict sandboxing and resource limits, but any deployment should rigorously manage attack surfaces and implement rate limits.



References

[^1]: Luo, Michael, Sijun Tan, Justin Wong, et al. 2025. “DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL.” https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2

[^2]: Liu, Mingjie, Shizhe Diao, Ximing Lu, et al. 2025. “ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models.” arXiv:2505.24864. Preprint, arXiv, May 30. https://doi.org/10.48550/arXiv.2505.24864

[^3]: Moshkov, Ivan, Darragh Hanley, Ivan Sorokin, et al. 2025. “AIMO-2 Winning Solution: Constructing State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning Dataset.” arXiv:2504.16891. Preprint, arXiv, April 23. https://doi.org/10.48550/arXiv.2504.16891



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x