Evaluating AI Agents on Predicting Future Events

Most current AI benchmarks give attention to answering questions on the past, either by testing models on existing knowledge (in a static manner, reminiscent of HLE or GPQA, or augmented, like BrowseComp or GAIA) or previously solved problems (like PaperBench, DABStep, or most coding evaluations). Nevertheless, we imagine that more worthwhile AI, and ultimately AGI, shall be distinguished by its ability to make use of this past to forecast interesting points of the longer term, reasonably than merely reciting old facts.

Forecasting future events is a fancy and holistic task: it requires sophisticated reasoning, synthesis, weighing probabilities and real understanding, reasonably than pattern matching against or searching existing information. Evaluating models on their ability to predict future outcomes, whether in science, economics, geopolitics, or technology tests the sort of intelligence that creates real-world value.

Beyond its inherent importance, this forecasting-based approach also solves many methodological problems faced by current evaluations and benchmarks. Traditional benchmarks that measure accuracy on fixed test sets are inevitably affected by possible data contamination, and without access to the complete reproducible training pipeline of a model, it’s hard to trust the outcomes. Probably the most serious evaluation efforts now keep their test sets completely private, making a frustrating arms race between evaluators and potential “gaming the leaderboard” mechanics (Singh et al., 2025).

Forecasting makes contamination inconceivable by design, as you’ll be able to’t train on data that does not yet exist! This creates a level playing field where success relies on reasoning capability reasonably than memorization.

Perhaps most significantly, predictions in regards to the future are inherently verifiable. We will wait and see who was right, creating an objective, time-stamped measure of model performance.

We subsequently propose evaluating agents on their ability to predict future events (Ye et al., 2024; Karger et al., 2025). FutureBench draws from real-world prediction markets and emerging news to create interesting prediction tasks grounded in actual future outcomes. We collect events from platforms and live news coverages and manifold markets, filtering them to give attention to emerging events price predicting. Using an agent-based approach, we curate scenarios that require real reasoning reasonably than easy pattern matching. Think geopolitical developments, market movements, or technology adoption trends – events where informed evaluation actually matters.

Can Agents Predict Future Events?

That is the apparent query, and it’s at the guts of what makes this benchmark interesting! We imagine the reply can’t be a straightforward “yes” or a “no”, because it mostly relies on the actual questions; there are at all times essential caveats to contemplate.
Humans continually use their ability to weigh current information to predict future events. Aren’t most profession moves, relationship selections, and even business strategies essentially bets on future outcomes?

Some predictions involve irreducible uncertainty (Will it rain on December seventeenth, 2027 at noon?), but many do not. When a talented analyst predicts an organization’s quarterly earnings or a policy expert forecasts election outcomes, they’re using available information to make informed decisions. That is precisely what we’re asking AI agents to do with FutureBench! The duty is not to get agents to fortune-tell, but reasonably to synthesize information and reason under stronger uncertainty than most other benchmarks.

The agent’s prediction quality directly reflects its ability to look relevant information, synthesize complex data, and reason about cause-and-effect relationships. These are precisely the capabilities we would like to measure in real-world applications.

Tools like DeepResearch are already used for market evaluation and strategic planning. The standard of knowledge collection strongly correlates with decision-making effectiveness. FutureBench is inspired by this evaluation process and tries to compute agents’ quality with objective, verifiable outcomes.

Constructing a benchmark that tests real prediction capabilities requires a gentle stream of meaningful questions. We have developed two complementary approaches that capture several types of future events:

1. News-Generated Questions: Finding Tomorrow’s Headlines Today

Our first approach uses AI to mine current events for prediction opportunities. We deploy a smolagents-based agent to scrape a couple of major news web sites, analyze front-page articles, and generate prediction questions on their likely outcomes. The agent reads through and identifies interesting articles and formulates specific, time-bound questions from their content, for instance “Will the Federal Reserve cut rates of interest by at the very least 0.25% by July 1st, 2025?”

We guide this process with rigorously crafted prompts that specify what makes a superb prediction query—events which might be meaningful, verifiable, and unsure extraction time.

Technical Stack:

Model: DeepSeek-V3 for reasoning and query generation
Scraping: Firecrawl for reliable content extraction
Search: Tavily for extra context when needed

The agent typically generates 5 questions per scraping session, with a time horizon of a single week, meaning that we assume we’ll know the reply to the query after seven days. This provides us a natural pipeline of fresh evaluation material tied to real-world events.

2. Polymarket Integration: Leveraging Prediction Markets

Our second source draws from Polymarket. These questions come from a prediction market platform where real participants make forecasts about future events. We currently ingest around 8 questions per week.

Nevertheless, the raw data needs filtering. We apply strong filtering to remove general questions regarding temperature and a few questions regarding the stock and crypto markets, which might otherwise be too quite a few for practical use in our benchmark.
Along with this, polymarket questions have less constraints regarding the ultimate “realization” time, the actual consequence of the event could possibly be available only next month or by the top of the 12 months. These are still very relevant questions, but the information collection of the consequence is more sparse.

Example Questions

Here’s an example of what comes out of our query generation pipeline:

News-Generated	Polymarket
“Will the Federal Reserve cut rates of interest by at the very least 0.25% by July 1st, 2025?”	“Will monthly inflation increase by 0.2% in June?”
“Will Ukraine and Russia hold peace negotiations by July eighth, 2025?”	“Will Zohran Mamdani’s RCV margin of victory be greater than 13% within the Latest York City Mayoral Democratic Primary?”

Future Bench: Three Levels of Systematic Evaluation

The subsequent query is, what does this kind of benchmark allow us to measure? The framework operates on three distinct levels, allowing us to isolate exactly what we’re measuring:

Level 1: Framework Comparison
Keep the underlying LLMs and tools constant while various frameworks. How does a LangChain-based agent compare to at least one built with CrewAI when each use GPT-4 and the identical search tools? This isolates the impact of various agentic frameworks.
Level 2: Tool Performance
Fix the LLM and framework while comparing different implementations. Which search tool (for instance Tavily, Google, Bing) leads to higher predictions than other search engines like google, holding the whole lot else constant? This reveals which tools actually provide value. How much value do tools bring normally with respect to models without tools?
Level 3: Model Capabilities
Hold the framework and tools constant while testing different LLMs. Given access to the identical set of tools, does DeepSeek-V3 use them as effectively as GPT-4? This measures pure reasoning ability.
This systematic approach lets us understand exactly where performance gains and losses occur within the agent pipeline.

The benchmark also serves as a sturdy test of instruction following. Agents must respect specific formatting requirements and generate actions that might be appropriately parsed and executed. In practice, this often reveals where smaller language models struggle with complex multi-step reasoning.

🚀 Try it yourself! Explore the live leaderboard: FutureBench Interactive Leaderboard

Predicting The Future: Agents and Initial Results

We use SmolAgents as a baseline agent framework for all of the questions. We also compute performance on the bottom models. For the prediction task itself, the agents get access to a focused toolkit:

Search: Tavily integration for locating recent information and expert evaluation
Web Scraper: An easy web scraping tool for following up on specific sources and getting detailed context.

This intentionally lean setup forces agents to be strategic about information gathering while still providing the tools needed for informed predictions.

Initial Results

We compare different models using smolagents as a baseline (yow will discover the leaderboard on our HF Space). We also run the usual language models without web access to estimate a general prior. As expected, we see agentic models performing higher than easy language models; stronger models show more stable prediction quality. Overall we also find interesting patterns in how different models attempt to approach a matter:

Interesting Motion Patterns

Running this benchmark has revealed insights into how different models approach information gathering. One striking difference is with respect to scraping. GPT-4.1 appears to rely more on search results. Claude3.7 and 4 explore the net space in additional detail and are likely to use web scraping more often; this thorough approach also means collecting many more input tokens throughout the research process, thus increasing the associated fee.

Models show interesting approaches to creating predictions, for instance, to reply the query “Will annual inflation increase by 2.6 or more in June?”:

The DeepSeekV3 agent analyzed June 2025 inflation prospects by searching recent CPI data (finding current inflation at 2.4-2.8%), considered tariff impacts as upward pressure, and concluded inflation would exceed the two.6% threshold.
Claude3.7 analyzed June 2025 inflation through comprehensive research (11 searches vs DeepSeekV3’s 3), systematically gathering May 2025 CPI data (2.4% year-over-year), identifying decelerating monthly trends (0.2%→0.1%), weighing tariff pressures against Fed restrictive policy, calculating precise 0.2% gap needed, and concluded recent deceleration made reaching 2.6% threshold unlikely, answering “No.”
GPT4.1 analyzed June 2025 inflation through targeted searches for market consensus and forecasts, identified May 2025 CPI at 2.4% (below 2.5% expectations), noted weak 0.1% monthly increases, found no forecaster predictions of two.6%+ for June, and concluded the jump from 2.4% to 2.6% was unlikely given recent below-expectation trends.

Interestingly, Claude was the one model that attempted to access the Bureau of Labor Statistics website to scrape it directly, which failed since it is a .gov website and we don’t allow this kind of motion.

The models exhibit distinct reasoning patterns of their outputs. GPT’s evaluation focused on consensus forecasts as the important thing signal for future events reasonably than extrapolating from current data, while Claude’s approach exhibited rigorous analytical structure with its systematic pro/con framework and quantitative gap evaluation, and DeepSeekV3’s output displayed explicit acknowledgment of knowledge limitations and systematic methodology adjustments when initial approaches encountered constraints.

These behavioral differences reveal interesting patterns in how different models approach information gathering. The variations in web usage and token consumption suggest that models have distinct strategies for tackling prediction tasks, which FutureBench may help us measure and understand.

One challenge is that analysis might be expensive as a result of the massive variety of input tokens. For instance, Claude tends to go to web pages often, thus accumulating many input tokens. In a multi-turn loop, this could make the variety of input tokens skyrocket in a short time. This increases the associated fee of any subsequent generation, though most tokens are eventually cached.

FutureBench is an evolving benchmark, as we discover recent findings and higher patterns, we’ll keep incorporating them. We might love feedback from the community to grasp the best way to higher source questions, which experiments to run and which data is essentially the most interesting to research.

References

Singh, S., Nan, Y., Wang, A., D’souza, D., Kapoor, S., Ustun, A., Koyejo, S., Deng, Y., Longpre, S., Smith, N., Ermiş, B.H., Fadaee, M., & Hooker, S. (2025). The Leaderboard Illusion. ArXiv, abs/2504.20879.

Karger, E., Bastani, H., Yueh-Han, C., Jacobs, Z., Halawi, D., Zhang, F., & Tetlock, P.E. (2025). ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities. ICLR.

Ye, C., Hu, Z., Deng, Y., Huang, Z., Ma, M.D., Zhu, Y., & Wang, W. (2024). MIRAI: Evaluating LLM Agents for Event Forecasting. ArXiv, abs/2407.01231.

Source link