Data Agent Benchmark for Multi-step Reasoning

Language models have gotten increasingly capable and may solve tasks autonomously as agents. There are various exciting use cases, especially on the intersection of reasoning, code, and data. Nevertheless, proper evaluation benchmarks on real-world problems are lacking and hinder progress in the sector.

To tackle this challenge, Adyen and Hugging Face built the Data Agent Benchmark for Multi-step Reasoning (DABstep) together. DABstep consists of over 450 data evaluation tasks designed to guage the capabilities of state-of-the-art LLMs and AI agents.

Our findings reveal that DABstep presents a major challenge for current AI models, with essentially the most capable reasoning-based agents achieving only 16% accuracy, highlighting significant progress to be made in the sector.

DABstep requires AI models to:

dive in details of knowledge and be rigorous (no hallucinations)
reason over free form text and databases
connect with real life use-cases (not only math or code)

On this blog post, we’ll cover the design and construction of the benchmark, explore evaluation results, and discuss the numerous gap between current models and the flexibility to resolve complex data evaluation tasks effectively.

Motivation

Data evaluation is each an art and a science that requires technical skill, domain knowledge and creativity, and thus, it’s rarely straightforward. Even seasoned data analysts face challenges like:

Easy but time-consuming tasks: The sheer volume of even easy tasks often turns straightforward evaluation into hours of repetitive work.
Complex context and high cognitive load: Some tasks require evaluation to juggle intricate domain-specific knowledge, making them each time-intensive and mentally draining. For instance, (1) reading distributed, nested, and sophisticated documentation; (2) analyzing data; (3) reasoning over results; and eventually, providing recommendations that can steer the direction of the business.
Technical acumen: analyzing data might be a sure bet provided the info is of high availability, prime quality, and ready-to-serve. Unfortunately, this is never the case, and analysts need technical depth to create pipelines that eat, transform, and serve data. Data analysts often tackle tasks pertaining formally to data engineering.

At corporations like Adyen, analysts tackle a spectrum of problems, from routine queries to complex workflows requiring creativity, precision, and iterative reasoning. Access to a capable data evaluation agent that may automate easy and repetitive tasks and assist with complex tasks would allow analysts to work faster, reduce mental strain, and give attention to solving more impactful problems. That may be a pivotal moment for a lot of industries that need data evaluation and insights, comparable to finance.

Recent advancements in agentic workflows — where LLMs equipped with tools independently execute multi-step tasks — have shown tremendous promise across domains like coding, open QA, software engineering, and even Kaggle competitions. These systems aren’t just theoretical; they have been driving real-world productivity gains.

So, the query becomes: Can agentic workflows reshape the best way we approach data evaluation?

Introducing DABstep

Progress in machine learning is fueled by prime quality benchmarks that yield reliable progress signals. Thus, we’re excited to introduce the Data Agent Benchmark for Multi-step Reasoning (DABstep), a brand new benchmark for evaluating and advancing agentic workflows in data evaluation.

Here’s what makes DABstep unique:

Real-world use cases: Built on 450+ real-world tasks extracted from Adyen’s actual workloads. These tasks usually are not synthetic toy problems; they reflect challenges analysts face day by day, setting DABstep other than other benchmarks like DS-1000 or DS Bench [^1].
Balancing structured and unstructured data: These tasks require advanced data evaluation skills to navigate structured data and understand multiple datasets and documents captured in unstructured data.
Easy setup: Unlike benchmarks comparable to SWE-bench or MLE-bench, which require complex configurations, DABstep is straightforward to make use of. Generating answers with a model only requires access to a code execution environment, and participants can submit answers on to a leaderboard for automatic evaluation.
Factoid evaluation: Tasks have been designed to be evaluated objectively, and as such, the evaluation of the duty output will at all times map to a binary final result, right or mistaken, without interpretation.
Multi-step complexity: DABstep tests systems across a spectrum of analytical tasks, from routine queries to multi-step, iterative workflows. Unlike benchmarks focused on isolated questions, DABstep challenges models to have interaction in end-to-end agentic reasoning across diverse, practical tasks.

How does DABstep achieve all this and remain a straightforward to run benchmark? Let’s take a have a look at its design!

What’s contained in the DABstep?

DABstep has been designed for low-barrier usage, quality evaluation and increasing difficulty levels. To this end, we’re opening up the next items as a part of DABstep: Datasets, Tasks, Evals, Real-time Leaderboard and Baselines.

Data

One among the most important challenges analysts must overcome, when working on real-world problems, is balancing domain knowledge and technical skills. To this end, DABstep comprises each unstructured and structured data to measure domain knowledge and technical skills respectively.

Table 1 shows a snapshot of a number of the dataset we’re releasing with the benchmark.

Name	Description
payments.csv	Payments dataset of 138k (anonymized) transactions with various signals around fraud and risk use-cases.
payments-readme.md	Documentation for the Payments dataset
acquirer_countries.csv	Table of Acquiring Banks and respect Countries
fees.json	Extensive dataset composed of 1000 Scheme Fee structures.
merchant_category_codes.csv	Table of Merchant Category Codes (MCCs)
merchant_data.json	Table describing merchants
manual.md	In finance, business contexts are sometimes outlined in extensive handbooks from networks, regulators, and processors. For the primary version of this benchmark, we have now created a markdown file (manual.md) that distills essential business knowledge right into a precise yet simplified format for solving tasks accurately.

Table 1: The benchmark consists of varied datasets across various tasks including the financial payments sector

Among the structured datasets include CSV and JSON files representing real-world data, comparable to transaction telemetry and business metadata (e.g., merchant category codes). Moreover, we have now unstructured data comparable to documentation, lengthy manuals, and detailed handbooks that, for instance, are issued by networks, regulators, and processors.

All of those datasets were extracted from real-world tasks at Adyen.

Tasks

Based on the brand new datasets included in DABstep, we’re releasing several tasks with increasing difficulty levels designed to check an AI agent’s accuracy.

Each task comprises the next items:

A query that proposes a challenge to the analysts.
A level encapsulating the issue of the duty.
Guidelines on format the reply to fulfill the specifications of the factoid evaluation.

Not one of the tasks will be solved with 1-shot of code; in other words, they can’t be solved by reasoning alone, but reasonably, they require sequential steps of iterative problem-solving. For instance, on the minimum, the agent must no less than know what columns exist within the respective dataset to reply a matter. That is contrasted with popular benchmarks like GAIA, MATH and SimpleQA, where it’s possible to reply multiple questions with 1-shot of code appropriately.

Two example tasks are shown in Figure 1, and an example human-made reference solution is shown in Figure 2.

Name	Description
Query: Which card scheme had the best average fraud rate in 2023? Guidance: Answer should be the name of the scheme. [LLM/Agent Loop…] Answer: SwiftCharge	Query: For the 12 months 2023, specializing in the merchant Crossfit Hanna, if we aimed to cut back fraudulent transactions by encouraging users to modify to a distinct Authorization Characteristics Indicator through incentives, which option could be essentially the most cost-effective based on the bottom possible fees? Guidance: Answer should be the chosen ACI to incentive and the associated cost rounded to 2 decimals on this format: {card_scheme}:{fee}. [LLM/Agent Loop…] Answer: E:346.49

Name

Description

Query: Which card scheme had the best average fraud rate in 2023?

Guidance: Answer should be the name of the scheme.

[LLM/Agent Loop…]

Answer: SwiftCharge

Query: For the 12 months 2023, specializing in the merchant Crossfit Hanna, if we aimed to cut back fraudulent transactions by encouraging users to modify to a distinct Authorization Characteristics Indicator through incentives, which option could be essentially the most cost-effective based on the bottom possible fees?

Guidance: Answer should be the chosen ACI to incentive and the associated cost rounded to 2 decimals on this format: {card_scheme}:{fee}.

[LLM/Agent Loop…]

Answer: E:346.49

Figure 1: On the left is an example Risk/Fraud query from the Easy Set. Solution requires referencing no less than 2 data sources and 3-shots of code. On the precise is an example Scheme Fees query from the Hard Set. The answer requires referencing no less than 2 data sources and multiple shots of code. The included answers are only for demonstration purposes and are withheld from the dataset.

Levels

The benchmark consists of two difficulty levels:

Easy Level: These tasks function warm-ups, helping to confirm setups, integrations, and research direction. They typically require only a single structured dataset and minimal contextual knowledge. On average, humans achieve a 62% baseline on these tasks after 3+ hours of labor, while a Llama 70B zero-shot prompt can exceed 90% accuracy.
Hard Level: These tasks demand a more complex approach, involving multiple structured datasets and domain-specific knowledge. Unlike the simple level, they typically can’t be solved with a single-shot code generation and require multiple steps of reasoning.

For example of a multi-step reasoning problem, The next code shows a snippet of the human-made reference solution to a Hard Level task. Overall, it’s broken down into 4(4) sequential steps including the event of varied support macros. As a way to code this solution, the agent would should have specific domain knowledge and the flexibility to work in sequential steps of iterative reasoning.

Figure 2: The 220 line reference solution to a matter within the Hard Set: “If the merchant {merchant} had switched its MCC code to {target_mcc} prior to the beginning of 2023, how much of a difference in fees would they should pay for the 12 months 2023?” The answer requires multiple steps of inductive reasoning difficult for 1 shot code generation.

Generalization

Some quick comments on how we hope to encourage generalization with the benchmark.

Symbolic Reasoning: Within the spirit of GSM-Symbolic, tasks have been exploded in cardinality using permutations of time ranges, merchant names, etc. The rationale is to remove the possibility of “lucky guesses” and validate core reasoning (repeatability of reasoning) and generalization.

Hidden Test Set: We now have opted to not divide the dataset into validation and test sets and are only releasing a heldout test set. It’s because a knowledge analyst agent should have the ability to generalize across various evaluation tasks not per se captured on this benchmark version.

Dev Set: Given this tough generalization setting, the dimensions of the benchmark (450 questions), and within the spirit of developer friendliness, we have now also released a dev set, which is a representative subset of the complete test set, including answers. The spirit of this dev set is to permit researchers to configure their E2E submission pipeline locally, including evaluation and fast feedback loops, after which undergo the leaderboard proper.

To check broad generalization, DABstep mustn’t be benchmarked alone, and it needs to be seen together with other benchmarks that test overall generalization and problem-solving (e.g. MMLU, SuperGlue, GPQA).

Evaluations

For simplicity, we have now opted for a factoid-based answer evaluation system. Because of this answers to benchmark questions needs to be easy words, numbers or multiple-choice mixtures. This enables for unbiased, scaled and model-free evaluations. (That is versus natural language answer submissions evaluated by a judge LLM)

Provided that, we didn’t intend answer formatting to be the main target of the benchmark. To that end, we implemented a series of flexible evaluation methods that be certain that the main target stays on the accuracy of the answers reasonably than their formatting. As examples, we use adaptive tolerance to match numerical values, allowing for variations in precision and formatting. Strings are normalized and compared using fuzzy matching with a similarity ratio threshold. Lists are evaluated element-wise after normalization.

Real-time leaderboard

DABstep includes a real-time leaderboard hosted on Hugging Face, where participants can submit their answers to be graded immediately. You’ll be able to see your standing against others across the globe with fast feedback.

View of the leaderboard with the very best ranked submissions. Link: DABstep Leaderboard

Baselines

We’re providing a set of baseline evaluations across popular open and closed models.

Figure 3: Performance (on Hard set) across closed and open models/providers. * The Reasoning Models didn’t work well on the unified ReAct prompt we were using across all chat models, so we needed to craft a special Reasoning Prompt. See baseline implementation and prompt details here. We benchmarked the industrial offering of DeepSeek-V3.

From Figure 3 we will see that there may be quite a lot of progress to be made, even with the very best agents available not crossing the 20% line.

The very best performing agents were based on the most recent reasoning models with o3-mini coming out on top at 16% accuracy and R1 coming in at 13%**. The closest chat-based model was Claude Sonnet at 12% the open DeepSeek V3 coming in at 6%.

One surprising finding was that while instruct models perform well out of the box with a ReAct prompt, reasoning models don’t and achieve 0% accuracy. Common failure modes include poor instruction following, invalid code syntax, closing code blocks (lack of), use tools (improper) and 1-turn dialogs (i.e., no sequential steps). It required multiple iterations on the prompt to get the reasoning models to perform well on this benchmark.

The baselines provided as a part of the benchmark are standardized prompts across the chat and reasoning models, and thus, they needs to be considered non-optimized and a lower sure on performance.

*We needed to design a special prompt for the reasoning models because our unified ReAct prompt, although performing excellently on chat models, performed exceptionally poorly on all of the reasoning models.

** R1 performance is extrapolated from a sample resulting from the prolonged outage on the time of this publication.

Moreover, we tracked the price to run the complete benchmark for every industrial offering and compare them in Table 2 below:

Name	Cost	Cost/Task
o1	$435	$0.967
Claude 3.5 Sonnet	$90	$0.200
o3-mini	$85	$0.198
GPT 4o	$50	$0.111
Claude 3.5 Haiku	$35	$0.078
GPT 4o-mini	$3	$0.007
Deepseek R1	$3	$0.007
Deepseek V3	$2	$0.004

Table 2: Costs for industrial models. On account of subjectivity/variance, we didn’t include price evaluation of open models. Cost/Perf % is explored in Figure 4.

We break down the economics in Figure 4 by the Accuracy vs Cost tradeoff.

Figure 4: Curve w/ performance and price tradeoff between industrial providers.

Arguably, the economics of DeepSeek R1 are ideal as there is actually no tradeoff between performance and price.

Now, let’s have a have a look at how you may run these benchmarks yourself and evaluate your personal models.

Getting Began and Infra

We’re mindful that doing agentic research by interacting with the benchmark requires an execution environment and involves costs. We’re lowering the barrier by providing access to HuggingFace’s Inference API and smolagents. With these tools, researchers get 1k free LLM requests day by day and access to a secure local code execution environment.

For convenience, we offer an example notebook: a ready-to-go solution for submitting an entry at zero cost (quickstart.ipynb).

By eliminating friction, DABstep ensures that anyone, from seasoned researchers to curious newcomers, can contribute to advancing agentic workflows in data evaluation.

Future direction

We’re really excited concerning the release of DABstep and think it is going to help test the state of knowledge evaluation agents today. Nevertheless, this release marks just step one and we plan to evolve the benchmark over time.

At the speed of progress in AI, we foresee that the benchmark will eventually be considered solved in its current state. Nevertheless, the benchmark is designed to be valid for an extended time by increasing the issue in lots of dimensions. We will likely be constructing on the benchmark with full backwards compatibility. Below are some broad avenues on which we are going to improve the baseline.

Tasks: The present tasks are very narrow and limited in scope, encompassing mostly fraud and payment fees. This can be a subset of the true world, as there are numerous other dimensions and variables at play. In the longer term, we are going to expand the identical benchmark, including tasks in the world of approval rates (issuer refusals), authentication drop-offs, and real-time situations over a wider time span comparable to seasonal components. This is able to test the capability of agents to balance several variables at the identical time and execute trade-offs on multiple dimensions.

Domains: The benchmark currently revolves around tasks from the financial sector. Nevertheless, we invite researchers and practitioners from other fields, comparable to health, biology, insurance, telecommunication etc. to contribute recent subsets to the benchmark so we will evaluate the performance across many domains.

Data scale: The structured data will eventually not have the ability to slot in memory. They will likely be analyzed with standard tooling, requiring the analysts to make use of distributed computing engines or scheduling workflows for later evaluation.

Documentation: The unstructured data that maps the domain knowledge will explode with more files containing time evolution (e.g., bulletins), different formats (e.g., PDF), and in addition different versions of comparable but different logic that map each scheme, acquirer, or partner. The context will reach a step that can logically not fit the present and future token cardinality allowed within the context windows.

Multimodal: Agents must also turn into multimodal. To this end, we are going to enhance the benchmark with tasks that require extracting logic by interpreting and creating plots and graphs.

Related Works

Existing benchmarks for evaluating AI in data evaluation have advanced the sector and DABstep is built on top of their foundations.

DS Bench evaluates 466 questions from Modeloff competitions, primarily designed for Excel-based workflows. While effective for small-scale tasks, Excel doesn’t support the iterative, code-driven workflows common in real-world data evaluation (e.g., Jupyter notebooks). Moreover, reliance on GPT-4 as an evaluator introduces bias and reduces generalizability.

DS 1000 tests Python-based data evaluation tasks sourced from StackOverflow, curated to avoid memorization. Nevertheless, its tasks are short and single-shot, lacking real datasets and iterative reasoning. This limits its ability to guage end-to-end workflows or multimodal capabilities.

Acknowledgments: Harm de Vries (Graidd), Arjun Guha (Northeastern University), Hanna van der Vlis (Adyen)

[^1]: Tasks are realistic, the info has been generated synthetically. The business context (including merchant names, rates, volumes, transaction values, fraud rates, and charges) has been artificially generated and doesn’t reflect the actual performance that companies might exhibit. For instance, fraud rates have been intentionally elevated for the aim of this exercise.

Source link

Data Agent Benchmark for Multi-step Reasoning

Motivation

Introducing DABstep

What’s contained in the DABstep?