How We Hit #1 on DABStep with Reusable Tool Generation

-



The world of knowledge is vast, but quantitative information is commonly sparse or unavailable in text form online, presenting a major challenge for deep research agents. This post shares an architecture, NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer, for constructing autonomous data evaluation agents, developed by the NVIDIA Kaggle Grandmasters (KGMON) LLM Agent Research Team. The NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer project introduces an agent specialized for dataset exploration and evaluation, designed to handle the complexities of multi-step reasoning, tool calling, and iterative data evaluation. Notably, our approach establishes recent state-of-the-art (SOTA) performance on the Data Agent Benchmark for Multi-step Reasoning (DABStep) benchmark, rating 1st place with a 30x speedup over the claude code baseline.

The success of the multi-phase approach on the difficult DABStep benchmark validates the strategy of separating foundational knowledge constructing from rapid inference.



Motivation: Bridging the Gap in Data Evaluation

Deep research agents, especially those counting on web text search, fall short when coping with structured, tabular data that requires complex, multi-step queries.

Our core motivation is to create an agent that excels in:

  • Iterate faster on evaluation through automatic code generation and execution.
  • Crack complex tabular questions with multi-step reasoning and power use.
  • Make sense of enormous unstructured contexts using semantic search.
  • Stay oriented in experiments by generating and interpreting visualizations robotically.

NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer goals to deliver capabilities including automatic open-ended exploratory data evaluation, tabular data Q&A, predictive modeling, and forecasting.



The NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer Architecture

In NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer, we implement different agent loops for various use cases. The architecture leverages the NVIDIA NeMo Agent Toolkit to drive these loops, utilizing tools designed specifically from an information scientist’s perspective. For open-ended exploratory data evaluation, the system pairs a ReAct agent with a Jupyter Notebook tool, allowing for continuous, bi-directional interaction. Alternatively, for multi-step rule-based tabular data QA, the architecture utilizes a Tool Calling Agent. This agent interacts with a definite, multi-part suite of specialised tools to perform its structured tasks: a stateful Python interpreter, a retriever, and a file structure detector.

Screenshot 2026-03-09 at 9.25.43 PM



Open-ended Exploration and Tabular Data QA

Currently the NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer focuses on two primary applications:



1. Open-ended Exploratory Data Evaluation (EDA)

The figure below illustrates the architecture for open-ended exploratory data evaluation driven by a ReAct Agent. The workflow begins with the user mounting a dataset and sending questions or instructions to the ReAct Agent, which translates these inputs into specific tool calls. These calls are sent to the Notebook Manipulation Tools, a set capable of ordinary operations like creating notebooks, adding code, and running cells. Once the tools execute the commands, the raw output flows into the Tool Output Handler. A critical feature of this handler is its integration with a Vision-Language Model (VLM); if the tool output features a visual plot, the handler sends it to the VLM to generate a textual description and suggestions for improving the plot’s aesthetics and data richness. The handler then replaces the visual plot with this text-based evaluation, sending the processed tool output back to the ReAct Agent so it may possibly formulate an informed response to the user.

Screenshot 2026-03-09 at 9.26.39 PM



2. Multi-Step Rule-based Tabular Data QA

This addresses hard questions that require multi-step reasoning and power calling against a tabular dataset. We concentrate on Data Agent Benchmark for Multi-step Reasoning (DABStep) benchmark, which comprises 450 total tasks specifically focused on the Financial Payments Sector. The benchmark process is structured into three major components:

Screenshot 2026-03-09 at 9.27.49 PM

The Context & Query include questions and heterogeneous data sources (like CSV and JSON files), alongside a markdown manual detailing domain logic and rules. The Benchmark Tasks categorizes the workload into Easy Tasks (16%), that are basic single-dataset queries, and Hard Tasks (84%), which require complex, multi-step tool-augmented reasoning. These hard tasks involve reading documentation, generating code (similar to SQL or Pandas), and cross-referencing data to calculate a solution, where web search offers little to no useful help. Finally, the Evaluation phase measures success using an Exact Text Match with strict formatting requirements, expecting a JSONL output that features each the agent_answer and the reasoning_trace.



Cracking DABStep: A Multi-Phase Approach

To realize State-of-the-Art (SOTA) results on DATStep, we want to separate the heavy lifting from the fast execution. The system is split into three distinct phases: a Learning phase where the agent uses general skills and ground truth data to forge reusable, specialized tools; an Inference phase that applies these tools to resolve recent questions rapidly; and an Offline Reflection phase that reviews the outputs to generate deeper insights. This mimics how a human data scientist operates—spending significant effort upfront to construct a sturdy toolkit in order that future tasks grow to be efficient and scalable.

Screenshot 2026-03-09 at 9.29.20 PM



Phase 1: The Learning Loop

Within the Learning phase, we deploy a heavyweight model (like Opus 4.5/4.6) in a multi-pass loop equipped with a full arsenal of tools, including a stateful Python interpreter, bash tools, and file structure detectors. By tackling a batch of representative tasks (e.g., Tasks 1 through 10) and validating them against ground truth answers, the agent builds a comprehensive mental model of the dataset. It then synthesizes these individual python scripts into one master solution, ultimately distilling it all the way down to a highly optimized library of reusable functions (helper.py) and a concise set of few-shot examples, which demonstrates how helper functions are used to resolve the questions within the dev split (training set).

Screenshot 2026-03-09 at 9.31.04 PM



Recognizing Interconnected Tasks & Optimizing Sub-Solutions Across the Board

The core insight driving this approach is that complex data questions rarely exist in isolation. As shown within the merchant fee examples, different tasks often share the very same foundational data operations. As an example, computing a particular transaction fee for a particular month (Task 2) requires the very same initial steps—fetching merchant info and finding fee data—as simply listing the applicable fee IDs (Task 1). Recognizing and mapping this overlap is the important thing to constructing a modular, DRY (Don’t Repeat Yourself) system.

Screenshot 2026-03-09 at 9.32.07 PM

As an alternative of writing isolated, brittle scripts for each recent query, the agent actively searches for probably the most robust logic. If “Version 1” of a function works perfectly for Task 1 but fails when applied to the marginally different constraints of Task 2, the agent recognizes the flaw. By actively testing candidate functions via the Python interpreter against the bottom truth of multiple interconnected tasks, the agent iteratively discovers a “Version 2” that successfully generalizes across the complete batch.



Refactoring and Packaging

Screenshot 2026-03-09 at 9.34.09 PM

Once the optimal, generalized logic is found, the agent refactors the bulky independent scripts right into a clean, unified architecture. The complex data extraction and computation steps are packaged into the centralized helper.py library. Consequently, the actual code needed to reply any specific query shrinks dramatically. The ultimate task solutions transform from long, complex scripts into lightweight instructions that simply import and execute the suitable tools from the helper library.



Phase 2: Fast and Lean Inference

Screenshot 2026-03-09 at 9.36.21 PM

With the foundational code written, the Inference phase shifts to a smaller, faster model (like Haiku 4.5) running a single-pass loop. Since the complex domain logic is already securely housed in helper.py, the inference agent only needs a basic Python interpreter to do its job. To maintain token costs and latency to an absolute minimum, the context window is aggressively pruned: the agent is fed only the function signatures (not the underlying code) alongside a streamlined system prompt, allowing it to efficiently orchestrate the pre-built tools to resolve unseen tasks.



Phase 3: Unsupervised Offline Reflection

Screenshot 2026-03-09 at 9.37.57 PM

To make sure top quality without bottlenecking the live inference loop, we move critical quality control entirely offline. This phase relies on two powerful LLM evaluation techniques—reflection and group-consistency—driven by a heavyweight model (like Opus or Sonnet 4.6) acting as an unsupervised reviewer.

Reflection is the method where the model looks back on the agent’s generated code and reasoning to audit its performance. It asks the tough questions: Did the agent effectively utilize the helper.py library? Did it follow the prompt faithfully? Are there any obvious mistakes within the code?

Group-consistency, alternatively, involves analyzing multiple candidate solutions across groups of comparable test inquiries to make sure the agent’s logic stays stable. If the agent solves the very same kind of query using conflicting methods, the offline model flags the discrepancy and reasons through which approach is definitely correct. By moving these computationally heavy checks offline, we will deeply analyze the info without sacrificing the speed of the Inference phase.



Closing the Loop: Injecting Insights for Faster Inference

The insights generated during this offline reflection aren’t only for analytics—they’re actively fed back into the architecture to shut the training loop. By extracting key patterns, edge cases, and potential pitfalls from the test data, the heavy model compiles these learnings and injects them directly into the system prompt for future Inference phases. Since the lightweight inference agent already holds these pre-calculated insights in its starting prompt, we completely eliminate the necessity for slow, computationally expensive online reflection or consistency checks. The result’s an Inference phase that continues to be blazingly fast and token-efficient, while constantly compounding its accuracy with every offline review.



Results

Easy Hard Time/Task Code Length
NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer + haiku 4.5 87.5 89.95 20s 1870
claude code + opus 4.5 90.2 66.93 10min 5011
DataPilot from AntGroup 86.11 87.57 unknown unknown
DS-STAR from Google AI 87.5 45.24 unknown unknown

To validate this architecture, we benchmarked our three-phase “NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer” approach (using the lightweight Haiku 4.5 for inference) against an ordinary baseline using “Claude Code” with the heavyweight Opus 4.5, which attempts to resolve every task from scratch. The outcomes highlight the large efficiency gains of our methodology. Because our inference agent relies on the pre-built helper.py library, it solves tasks at blazing speed—taking only 20 seconds per task and generating a highly concise 1,870 characters. In stark contrast, the from-scratch approach takes a painstaking 10 minutes per task and bloats the code length to five,011 chars. Most impressively, this 30x speedup doesn’t compromise complex reasoning. While the heavy Opus model barely edged out on “Easy” tasks (90.2 vs. 87.5), our approach completely dominated the “Hard” tasks, scoring an 89.95 in comparison with the baseline’s 66.93. This proves that investing time in upfront learning and code abstraction allows even smaller, faster models to outsmart heavier models on complex, multi-step problems.

This performance secured our architecture 1st place on the official dabstep leaderboard. The NVIDIA KGMON (NeMo Agent Toolkit) Data Explorer approach significantly outperformed AntGroup’s DataPilot and Google AI’s DS-STAR on complex problems. With a rating of 89.95 on “Hard” tasks, our system surpassed DataPilot (87.57) and nearly doubled DS-STAR’s rating (45.24). Provided that 84% of the benchmark consists of hard-level tasks, our dominance on this category directly secures our position as the very best overall solution. These results establish our three-phase methodology as the present state-of-the-art for each efficient and rigorous tabular reasoning.



Conclusion: A Recent Paradigm for Data-Intensive Research

Constructing on top of NVIDIA NeMo Agent Toolkit, the Data Explorer agent represents a major step forward in automated data evaluation for structured tabular data. By employing flexible agent loops—a ReAct loop for open-ended exploratory data evaluation and a multi-phase system for rule-based tabular QA—the agent is uniquely positioned to handle complex, multi-step reasoning tasks. The success of the multi-phase approach on the difficult DABStep benchmark, particularly the proactive learning loop that generates reusable, generalized functions, validates the strategy of separating foundational knowledge constructing from rapid inference. Data Explorer moves beyond easy query-answering to embody the operational workflow of a seasoned data scientist, delivering scalable, high-quality insights and establishing a brand new paradigm for data-intensive research driven by LLM-powered agents.

Able to construct your individual data exploration agent? Start with NVIDIA Launchable. Examples might be released soon!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x