training LLMs to reason with notebooks

The past yr has been all about giving LLMs more tools and autonomy to resolve more complex and open ended tasks. The goal of the Jupyter Agent is to offer the model the last word tool: code execution.

A natural option to display multi-step code execution along with reasoning is inside a Jupyter Notebook, which consists of code and markdown cells. So we built Jupyter Agent to act as an agent that may execute code directly inside a Jupyter notebook and use this environment to resolve data evaluation and data science tasks. Consider it like Cursor, but living natively inside your data science workflow.
We built a demo of this vision with Qwen-3 Coder, currently one in all the strongest coding models. It is a follow-up to our earlier work on jupyter-agent (v1).

While large models are starting to indicate useful behavior, the important thing query is how we will proceed improving them. To this end, we concentrate on strengthening smaller models to perform well on agentic data science tasks as they currently struggle to compete with the big models.

The goal of this project is to construct a pipeline to first generate high-quality training data, then fine-tune an existing small model, and at last evaluate whether the model’s performance improves on relevant benchmarks.

Let’s begin with the last step: choosing a robust benchmark for evaluating models on data science tasks.

🏁 Primer: the DABStep Benchmark

With a view to understand if we’re making progress towards higher data science agents we’d like a benchmark to measure such capabilities. Last yr, in partnership with Adyen, we introduced the DABStep benchmark: a option to evaluate data science agents on realistic tasks. The setup is straightforward: provide the LLM with datasets and ask it to reply non-trivial data questions.

Example tasks:

Query	Answer
Which card scheme had the best average fraud rate in 2023?	SwiftCharge
For the yr 2023, specializing in the merchant Crossfit Hanna, if we incentivize users to modify to a special Authorization Characteristics Indicator, which option can be essentially the most cost-effective?	E:346.49

This benchmark stays difficult for today’s LLMs — e.g. the most effective out-of-the-box model is Claude 4 Sonnet which reaches not even 20% accuracy on the hard tasks.
You’ll be able to explore the live leaderboard here.

🎯 First Baseline

Now that we identified an excellent benchmark we will attempt to climb it! We got down to construct a dataset for fine-tuning such that even a small data agent model could perform well on DABStep.

Our first alternative was Qwen3-4B-Pondering-2507: extremely small (fast to iterate with, easy to run), yet strong enough to act in agentic scenarios.

Baseline results:

Easy tasks: 44.4%
Hard tasks: 2.1%

Not great — but a promising start line, because it left a number of room for improvement. Let’s have a look at how we will improve it!

🔧 Primer on Scaffolding

A core aspect of agents that sets it other than a pure chat model is the scaffolding built across the model to steer its behaviour. The evaluation script in DABStep for instance uses smolagents to execute code. Smolagents comes with predefined behaviors, prompting structures, and expected formats.

We also studied the Qwen-Agent codebase, where the authors tailoring scaffolding to the model. This is smart: Claude Code, for instance, works shockingly well with Claude Sonnet because their scaffolding is aligned.

So, we restructured our scaffolding:

Stripped it all the way down to ~200 lines of code.
No external dependencies.
Inspired by the spirit of tiny-agents.

👉 Test it out here: utils.py.

Results: accuracy jumped from 44.4% → 59.7% (easy split). 🚀

Our loop:

While loop with two tools: code execution to run the code and final_answer to return the ultimate answer.
We differ from Qwen-Agent by explicitly adding a final_answer tool — which in our testing has improved performance.
In comparison with smolagents, we simplified the scaffolding by removing a number of prompts and tools. Smolagents also hardcodes a number of assumptions into the model through the use of the ReACT framework.

🏃‍♂️ Training Pipeline

With simplified scaffolding in place, we focused on fine-tuning Qwen3-4B for data science agentic tasks.

⚙️ Dataset Pipeline

The recipe to enhance a model on a certain task or behaviour is to coach it on data that reflects the tasks as closely as possible. A natural start line is to take a look at real Jupyter Notebooks and find notebooks that align closely with the duty that we plan to tackle, namely data evaluation.

Kaggle notebooks offer a wealth of top of the range data evaluation notebooks and are made available by Kaggle:

Datasets:

Kaggle Notebooks dataset: ~2TB of notebooks.
Kaggle Datasets: 5TB of kaggle datasets that we manually downloaded and linked to the notebooks.
Wealthy metadata for every notebook (authors, datasets used, etc.).

Now that now we have good results with a base model it is time to construct a dataset that may help us improve it even further. We designed a multi-stage pipeline using Datatrove to scrub and prepare Kaggle notebooks at scale.

Jupyter Agent Dataset Pipeline

Here’s how each step worked:

1. Large-scale deduplication

We began with ~2TB of Kaggle notebooks and reduced it to ~250GB reusing our work from the BigCode project. As a part of the StarCoder2 training data processing the notebooks (without output cells) were already deduplicated.
Most Kaggle notebooks are small variations or near-identical copies, so this step was essential.
Key insight: ~90% of raw notebooks are duplicates, which might have skewed training if left unfiltered.

2. Downloading linked datasets

Most Kaggle notebooks reference external datasets via Kaggle metadata. To be sure that the code inside notebooks could actually run, we built a pipeline that robotically fetched these linked datasets. This step was crucial, since many notebooks would otherwise be incomplete or non-executable.

Using the kagglehub package, we downloaded 1000’s of datasets — about 5TB in total. To maintain things manageable and relevant:

We filtered out datasets containing model checkpoints, large multimodal corpora, or LLM-related files.
We also excluded very large datasets (10GB+) that couldn’t fit into the virtual E2B sandboxes we used for execution.

By the tip, we had a wealthy collection of executable notebooks paired with their datasets, providing the muse for training agents in realistic, runnable environments.

3. Edu scoring

We scored notebooks based on educational quality using Qwen3-32B. We saw that using the entire notebook was not optimal, as many contained trivial or broken code. Our instructional scoring approach is detailed in edu_scoring.py.

TL;DR: We assigned each notebook a rating from 1–5 based on clarity, completeness, and academic value, and kept only those above a selected threshold. This filtering removed about 70% of the notebooks.

This is comparable to the insight from the BeyondWeb paper, which showed that using high-quality data is healthier for synthetic data generation — a step we relied on for QA (Query-Answer) generation.
This helped the model learn from “prime quality” notebooks as a substitute of noisy ones.

4. Filtering irrelevant notebooks

We excluded notebooks about training LLMs or unrelated to data evaluation.
We also removed notebooks that didn’t actually use datasets through an automatic LLM-based filtering process using Qwen3-32B. The implementation of filtering might be present in extract_packages_and_files.py.

TL;DR: We prompted Qwen3-32B to discover and take away notebooks that either (1) had nothing to do with data evaluation, or (2) didn’t actually use datasets. This step removed about 20% of the notebooks.

This ensured we trained only on relevant data science tasks.

5. QA generation

Using the cleaned notebooks, we generated query–answer pairs using Qwen3-32B. The questions and answer are grounded in the actual notebook traces so the QA pairs are based on real code execution results.
Prompt design: we asked the LLM to provide natural questions that might realistically be asked of the dataset, then validated whether the notebook provided an accurate answer.

Challenge: We needed to try many prompts to get higher-difficulty questions because LLMs tended to generate trivial ones like “what’s the dimensions of the dataset”.
Insight: We broke this into two steps because LLMs tended to hallucinate answers:

Generate the query and answer.
Ask one other LLM (with access to the notebook) to examine whether the reply was correct.

The entire prompting strategy and implementation is out there in qa_generation.py.

6. Trace generation

Finally we wish to generate clean code execution traces since even the unique notebooks after processing are sometimes open ended and verbose with plenty of irrelevant parts. Nevertheless, we wish our Jupyter Agent to get to the result efficiently. To generate cleaner notebook traces for training we generated traces synthetically based on the unique notebooks.
We’ve got prompted Qwen-3-Coder-480B model to generate a jupyter notebook code to reply the query from the previously generated synthetic QA pair.
Traces captured step-by-step code execution, including intermediate outputs, that are crucial for agent training.

We used E2B for our agent to resolve the synthetic QA pairs, which required fetching Kaggle datasets so the code could actually run via E2B.

Challenge 1: Many datasets were unavailable.
Trick: Since LLMs are strong at code and have an honest world model, we prompted them to act as a code interpreter when the dataset was missing.

Starting of the prompt:

You're a stateful Python code interpreter that executes code in a persistent environment. Your role is to execute Python code while maintaining state across multiple code cells, just like a Jupyter notebook environment.
[REST OF THE PROMPT]

Challenge 2: Qwen3-Coder-480B-A35B model doesn’t support considering mode – how can we extract code commentary? By default it often outputs only a transient comment followed by several steps of code execution. Nevertheless, we would like some reasoning or comments between every cell.
Trick: When switching from Qwen3-32B to Qwen3-Coder-480B-A35B we noticed that usually output message content was empty. This seems to be a previously known quirk of Qwen3-Coder models by which when using tool calling the model wouldn’t return an empty assistant response. We implement some text commentary through tooling by passing ‘comment’ as a required field within the code execution tool call. This manner when non-reasoning model is used for code cell generation it is going to by default output some description of its actions from 1st POV, emulating the considering traces structure.

Note: the generated final answer within the notebook may vary from the reply laid out in the QA pair. That is attributable to the undeniable fact that the agent model could use data preprocessing methods and steps different from the unique Kaggle notebook and the synthetic query wouldn’t often specify them. This discrepancy is normal and lays the muse for a brand new exciting research direction of how language models are inclined to treat data evaluation and whether or not they do it otherwise from humans. For full transparency we keep each LLM-generated final answer and original answer from the actual Kaggle notebook as a signal of model’s performance. We encourage the community to try different dataset mixes to see how they’ll push performance even further.

7. Final curation

We truncated overly long outputs and filtered out trivial traces to forestall content length issues and keep only high-quality traces.
We kept non-trivial, multi-turn traces aligned with DABStep-style tasks.
The resulting Jupyter Agent Dataset became the muse for SFT on Qwen3-4B models with 51k synthetic notebooks and almost 0.2B tokens.

With this dataset in hand, the natural next step is to see whether it actually helps our model develop into a stronger data science agent. Let’s move on to the training pipeline and evaluate the impact!

🏃‍♂️ Training Pipeline

With the curated dataset ready, we turned to the important thing query: does this data actually help the model get well at solving data evaluation tasks?
To seek out out, we arrange a straightforward fine-tuning pipeline and ran experiments to measure the impact of coaching on our synthetic notebooks.

Some training steps turned out to be particularly interesting and gave us useful insights:

For trace generation, we used LLMs to generate QA pairs, which gave us a verifiable environment.
Finally, we fine-tuned Qwen3-4B with TRL.
- Used assistant_loss_only=True → small performance boost.
- Added neftune noise for full-parameter multi-epoch training → avoids overfitting.

Challenges:

Prompting models for tool calling is hard: not all prompts deliver the identical performance (Qwen docs).
We needed to manually test each to search out what worked best.
There’s no standardization in response formats for tool calling, making it difficult to modify between models.
Native Qwen’s generation prompt just isn’t adapted to assistant_loss_only=True training mode in TRL which requires to have generation tokens by default. Thus, we adapt the unique chat templates by wrapping the assistant response part within the generation tags.
Training considering models on short reasoning texts may disrupt model capabilities → full-parameter training works higher comparing to PEFT on this case.

Our complete training implementation, including hyperparameter configurations and template adaptations, is out there in our finetuning directory in our repo.

📊 Results

First, we generated our final dataset using Qwen3-Coder-480B-A35B which incorporates prime quality code and short reasoning-like traces. Afterwards, we began our training and now we have experimented with various configurations like PEFT/adapters vs. full-parameter tuning, learning rate, variety of epochs, adding noise and others. We discovered, that full-parameter fine-tuning allows the model to learn and replicate the Qwen3-Coder-480B-A35B behavior response quality higher with shorter supporting commentary fitting more to the info evaluation task without unnecessary long reasoning.

We’ve got done a small ablation study on the impact of no. training epochs:

Model	No. of epochs	DABstep (Easy)
Qwen-3-4B-Instruct-2507 (Base)	0	38.67%
Qwen-3-4B-Instruct-2507 (Our Scaffolding)	0	52.78%
Qwen-3-4B-Instruct-2507	2	63.89%
Qwen-3-4B-Instruct-2507	3	73.61%
Qwen-3-4B-Instruct-2507	5	75%
Qwen-3-4B-Instruct-2507	7	70.83%

We observe that it is useful to have a bit more epochs than usual for SFT with lower learning rate and better neftune noise (7). Finally, we compare our trained models with implemented scaffolding to define the pure impact of our training dataset. In summary, we will see as much as 36%/22% boost on DABStep easy rating compared with base/scaffolded model:

DABstep Easy Score

We can even see, that the hard rating can increase too though our dataset is targeted on easier questions:

DABstep Hard Score

From figures above one can notice a noticeable impact of each latest scaffolding and tuning on our synthetic notebooks. This makes Qwen-4B (with our pipeline + scaffolding) a state-of-the-art small-model agent on DABStep.

In practice, the model can now solve a wide selection of realistic Kaggle-style data evaluation tasks with consistent execution.
It’s not yet strong enough for the toughest queries, but we’ve shown that even small models can develop into powerful agents when paired with the appropriate data and scaffolding.

Try Jupyter Agent Yourself

These results exhibit that even small models can develop into powerful data science agents with the appropriate training approach. Able to try it yourself? We have made every little thing openly available so you possibly can experiment with our fine-tuned models and dataset.

We openly release best-performing checkpoints of tuned Qwen3-4B-Instruct-2507 and Qwen3-4B-Pondering-2507 along with the training dataset, which you’ll be able to check out and experiment with:

You’ll be able to load Jupyter Agent Dataset in only a few lines using the next code:

from datasets import load_dataset

ds = load_dataset("jupyter-agent/jupyter-agent-dataset", split="non-thinking")

tokenizer.apply_chat_template(ds[0]["text"])

You too can use sourced Kaggle datasets directly with E2B code execution using the next code:

import kagglehub
import e2b_code_interpreter as e2b
from datasets import load_dataset


ds = load_dataset("jupyter-agent/jupyter-agent-dataset", split="considering")

dataset_name = ds[0]["kaggle_dataset_name"]

path = kagglehub.dataset_download(dataset_name)
print(path) 

sandbox_init = e2b.Sandbox(timeout=240)

file_name = ds[0]["files_used"][0]
file_name = file_name.split('/')[-1] if '/' in file_name else file_name
with open(f"{path}/{file_name}", "rb") as file:
    sandbox_init.files.write(f"/home/user/input/{file_name}", file)

execution = sandbox_init.run_code("")

You utilize tuned Jupyter Agent Qwen-based models following the Qwen documentation code:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "jupyter-agent/jupyter-agent-qwen3-4b-instruct"


tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)


prompt = "Give me a brief introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)


generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

For Pondering model you possibly can decode each considering response and content using the subsequent code:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "jupyter-agent/jupyter-agent-qwen3-4b-thinking"


try:
    
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("n")

🔮 Next Steps

Harder tasks: Generate tougher, multi-step questions that higher reflect real-world evaluation.
Scaling up: Train on larger volumes of curated traces to push beyond the present 3.4% performance on the hard split.
Distillation: Investigate knowledge distillation, which has shown strong results for improving small models.
Reinforcement Learning (RL): Construct an RL environment, which has been shown to realize state-of-the-art performance on agentic tasks. Since our QA setup already provides a verifiable environment, we could leverage it directly for RL training.

Possibly it will result in… Jupyter-Agent 3. 😉

We hope that our findings will encourage others to proceed progress in developing more powerful notebook coding agents and we’re excited to see what the community builds next. Dive into our jupyter-agent dataset on the 🤗 Hub and explore the codebase at https://github.com/huggingface/jupyter-agent to begin your individual experiments on agents for jupyter notebooks.

Source link

training LLMs to reason with notebooks

🏁 Primer: the DABStep Benchmark

🎯 First Baseline

🔧 Primer on Scaffolding

🏃‍♂️ Training Pipeline

⚙️ Dataset Pipeline

1. Large-scale deduplication

2. Downloading linked datasets

3. Edu scoring

4. Filtering irrelevant notebooks

5. QA generation

6. Trace generation

7. Final curation

🏃‍♂️ Training Pipeline

📊 Results

Try Jupyter Agent Yourself

🔮 Next Steps

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

mmBERT: ModernBERT goes Multilingual

MIT scientists debut a generative AI model that might create molecules addressing hard-to-treat diseases

Making GPU Clusters More Efficient with NVIDIA Data Center Monitoring

OpenAI now lets enterprises select where to host their data

Why CrewAI’s Manager-Employee Architecture Fails — and The best way to Fix It

training LLMs to reason with notebooks

🏁 Primer: the DABStep Benchmark

🎯 First Baseline

🔧 Primer on Scaffolding

🏃‍♂️ Training Pipeline

⚙️ Dataset Pipeline

1. Large-scale deduplication

2. Downloading linked datasets

3. Edu scoring

4. Filtering irrelevant notebooks

5. QA generation

6. Trace generation

7. Final curation

🏃‍♂️ Training Pipeline

📊 Results

Try Jupyter Agent Yourself

🔮 Next Steps

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.