The rapid advancement of Large Language Models (LLMs) has enabled remarkable progress on established academic and industrial benchmarks. Knowledge benchmarks, resembling MMLU and GPQA, are actually largely saturated, and frontier models are making significant progress on expert evaluations like HLE. Nonetheless, this success in static, knowledge-based tasks doesn’t at all times translate to effectiveness in dynamic, interactive settings, the type of environment by which we’d want effective assistants and AI agents to perform well. Developing robust methodologies for evaluating LLMs as autonomous agents in complex, exploratory environments stays a big challenge.
Two core avenues exist to judge autonomous agents: either use real-world environments and a limited set of specific skills, resembling tool use or coding capabilities, or use simulated open-world environments. The latter higher captures an agent’s ability to operate autonomously in exploratory environments that demand sustained, self-directed reasoning over a protracted and growing context, while being easy to judge.
While this direction remains to be developing, it has seen growing interest through benchmarks resembling Balrog, ARC-AGI, and demonstrations of models like Claude and Gemini playing Pokémon. Constructing on this emerging vein of labor, we introduce TextQuests.
TextQuests
TextQuests is a benchmark built upon 25 classic Infocom interactive fiction games. These once-popular text-based video games, which might take human players over 30 hours and require lots of of precise actions to resolve, provide a compelling testbed for the challenges of agentic reasoning. They demand that an agent display:
-
Long-Context Reasoning: Agents must devise and execute multi-step plans by reasoning over a protracted and repeatedly growing history of actions and observations, relying solely on their intrinsic capabilities without the help of external tools.
-
Learning through Exploration: The games require agents to learn from experience, interrogating their very own failures and making incremental improvements through trial-and-error as they explore the unknown world.
Success in these games requires an agent to construct understanding over a protracted gameplay session. This permits for a more direct and accurate assessment of the LLM itself because the reasoning backbone of an AI agent system.

Evaluations
For every model, we conduct two distinct evaluation runs: one with access to the sport’s official hints (With Clues) and one without (No Clues). Each run is executed for a maximum of 500 steps and stops early if the agent successfully completes the sport. To handle the growing context, the total game history is maintained without truncation throughout the run. This long-context evaluation is computationally feasible attributable to the prompt caching inherent in modern LLM inference frameworks. We employ two primary evaluation metrics:
-
Game Progress. The Game Progress metric is calculated based on a series of labeled checkpoints representing obligatory objectives on the trail to ending a game.
-
Harm. To evaluate the moral behavior of the agents, we measure Harm by tracking specific in-game actions which can be considered harmful to a point. This rating is then averaged across all games to judge an agent’s overall tendency to perform such actions.
Discussion
Long-context Reasoning. During evaluation, the context window can exceed 100K tokens, requiring LLMs to consistently perform precise reasoning and planning over an unlimited history of observations and clues to effectively progress. Because the context length grows, we observe that current models often hallucinate about prior interactions, resembling believing they’ve already picked up an item once they haven’t or getting stuck navigating in a loop. Moreover, just like observations in Gemini 2.5 Plays Pokémon, LLM agents show an increased tendency to repeat actions from their history fairly than synthesizing novel plans because the context lengthens. These long-context failures are particularly stark in tasks requiring spatial reasoning. For example, in Wishbringer, most LLMs struggled to navigate back down a cliff after climbing it. The answer simply required reversing the sequence of directions used to ascend—information available within the context history—indicating a fundamental difficulty in constructing and utilizing a mental map. Similarly, all frontier LLMs struggle in navigating the infamous Maze in Zork I.

Studio as an alternative of the Atlantis Room. Right: In Wishbringer, LLMs often fail to retrieve and reverse their very own ascent path from in-context history to navigate down a cliff successfully.Dynamic Pondering. An agent’s overall effectiveness is defined by each its task success and its operational efficiency. For LLM agents, efficiency is closely tied to the variety of output or reasoning tokens it generates, which directly impacts inference cost and latency. Models that utilize more test-time compute generally achieve higher performance. Nonetheless, this trend starts to diminish after a certain budget. This consideration is very important as many exploratory steps in TextQuests (for instance, navigation steps) are intermediate and will be successfully executed with no large reasoning depth.

In closing, TextQuests is an evaluation of how well models can consistently progress through a series of classic interactive fiction games that were once popular amongst human players. We hope that open-sourcing TextQuests helps researchers higher understand and assess the present capabilities of LLM agents in difficult exploratory environments. Open-source model builders are welcome to undergo TextQuests Leaderboard by sending us an email at agibenchmark@protected.ai
Citations
@misc{phan2025textquestsgoodllmstextbased,
title={TextQuests: How Good are LLMs at Text-Based Video Games?},
creator={Long Phan and Mantas Mazeika and Andy Zou and Dan Hendrycks},
12 months={2025},
eprint={2507.23701},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2507.23701},
}
