How Can A Model 10,000× Smaller Outsmart ChatGPT?

1. Introduction

the last decade, all the AI industry has all the time believed in a single unsaid convention: that intelligence can only emerge at scale. We convinced ourselves that for the models to really mimic human reasoning, we would have liked larger and deeper networks. Unsurprisingly, this led to stacking more transformer blocks on top of one another (Vaswani et al., 2017)⁵, adding billions of parameters, and training it across data centers, which require megawatts of power.

But is that this race for making larger and greater models blind us to a way more efficient path? What if actual intelligence isn’t related to the dimensions of the model, but as an alternative, how long you let it reason? Can a tiny network, given the liberty to reiterate by itself solution, outsmart a model hundreds of times larger than itself?

2. The Fragility of the Giants

To grasp why we’d like a brand new approach, we must first have a look at why our current reasoning models like GPT-4, Claude, and DeepSeek still struggle with complex logic.

These models are primarily trained on the Next-Token-Prediction (NTP) objective. They process the prompt through their billion-parameter layers to predict the subsequent token in a sequence. Even once they use “Chain-of-Thought” (CoT) (Wei et al., 2022)⁴ to “reason” a few problem, they’re again just predicting a word, which, unfortunately, isn’t pondering.

This approach has two flaws.

First is that it’s brittle. Since the model generates its answers token-by-token, a single mistake within the early stages of reasoning can snowball into a totally different, and infrequently improper, answer. The model lacks the power to stop, backtrack, and proper its internal logic before answering. It has to completely commit to the trail it began with, often hallucinating confidently just to complete the sentence.

The second problem is that modern reasoning models depend on memorization over logical deduction. They perform well on unseen tasks because they likely have seen the same problem of their enormous training data. But when faced with a novel problem—something that the models have never seen before (just like the ARC-AGI benchmark)—their massive parameter counts turn into useless. This shows that the present models can adapt a known solution, as an alternative of formulating one from scratch.

3. Tiny Recursive Models: Trading Space for Time

The Tiny Recursion Model (TRM) (Jolicoeur-Martineau, 2025)¹ breaks down the technique of reasoning right into a compact and cyclic process. Traditional transformer networks (a.k.a. our LLM models) are feed-forward architectures, where they should process the input to an output in a single pass. TRM, then again, works like a recurrent machine of a small and single MLP module, which might improve its output iteratively. This permits it to beat the most effective current mainstream reasoning models, all while being lower than 7M parameters in size.

To grasp how this network solves problems this efficiently, let’s walk through the architecture from input to solution.

(Source: Creator)
Visual illustration of all the TRM training/inference

3.1. The Setup: The “Trinity” of State

In standard LLMs, the one “state” is the KV cache of the conversation history. Meanwhile, TRM maintains three distinct information vectors that feed information into one another:

The Immutable Query (): The unique problem (e.g., a Maze or a Sudoku grid), embedded right into a vector space. Throughout the training/inference, this isn’t updated.
The Current Hypothesis (): The model’s current “best guess” at the reply. At step t=0, that is initialized randomly as a learnable parameter which gets updated alongside the model itself.
The Latent Reasoning (): This vector accommodates the abstract “thoughts” or intermediate logic that the model uses to derive its answer. Much like , this can also be initialized as a random parameter in the beginning.

3.2. The Core Engine: The Single-Network Loop

At the center of TRM is a single, tiny neural network, which is usually just two layers deep. This network isn’t a “model-layer” in the standard sense, but is more like a function that is named repeatedly.

The reasoning process consists of a nested loop comprising of two distinct stages: Latent Reasoning and Answer Refinement.

Step A: Latent Reasoning (Updating )

First, the model is tasked to only think. It takes the present state (the three vectors which were described above) and runs a recursive loop to update its own internal understanding of the issue.
For a set variety of sub-steps (), the network updates its latent thought vector by:

(Source: Creator)
The model takes in all three inputs and runs them through the model to update its thought vector (goes on for steps).

Here, the network looks at the issue (), its current best guess (), and its previous thought (). With this, the model can discover contradictions or logical leaps in its understanding, which it could actually then use to update . Note that the reply _t is not updated yet. The model is only pondering/reasoning concerning the problem.

Step B: Answer Refinement (Updating )

Once the latent reasoning loop is complete as much as steps, the model then attempts to project these insights into its answer state. It uses the identical network to do that projection:

(Source: Creator)
To refine its answer state, the model only ingests the thought vector and the present answer state.

The model translates its reasoning process () right into a tangible prediction (_t). This recent answer then becomes the input for the cycle of reasoning, which in turn, goes on for total steps.

Step C: The Cycle Continues

After every steps of thought-refinement, one answer-refinement step runs (which in turn needs to be invoked T times). This creates a strong feedback loop where the model gets to refine its own output over multiple iterations. The brand new answer (₊₁) might reveal some recent information which was missed by all preceding steps (e.g., “filling this Sudoku cell reveals that the 5 must go here”). The model takes this recent answer, feeds it back into Step A, and continues refining its thoughts until it has filled in all the sudoku grid.

3.3. The “Exit” Button: Simplified Adaptive Computation Time

One other major innovation of the TRM approach is in the way it handles all the reasoning process with efficiency. An easy problem could be solved in only two loops, while a tough one might require 50 or more, which suggests that hard-coding a hard and fast variety of loops is restrictive and, hence, not ideal. The model should have the option to make a decision if it has solved the issue already or if it still needs more iterations to think.

TRM employs Adaptive Computation Time (ACT) to dynamically resolve when to stop, based on the problem of the input problem.

TRM treats stopping as a straightforward binary classification problem, which relies on how confident the model is about its own current answer.

The Halting Probability ():

At the tip of each answer-refinement steps, the model projects its internal answer state right into a single scalar value between 0 and 1, which is supposed to represent the model’s confidence:

(Source: Creator)
h_t: Halting probability.
σ: Sigmoid activation to sure the output between 0 and 1.
Linear: Linear transformation performed on the reply vector.

The Training Objective:

The model is trained with a Binary Cross-Entropy (BCE) loss. It learns to output 1 (stop) when its current answer matches the bottom truth, and 0 (proceed) if it doesn’t.

(Source: Creator)
Loss_halt: Loss value, which is used to show the model when to stop.
I(•): Conditional Function that outputs 1 if the statement inside checks out to be true, else 0.
y_true: Ground truth for whether the model should stop or not.

Inference:

When the model runs on a brand new problem, it checks this probability after every loop (i.e. steps).

If > threshold: The model is confident enough. It hits the “Exit Button” and returns the present answer as the ultimate answer.
If < threshold: The model remains to be unsure. It feeds and back into the TRM loop for deliberation and refinement.

This mechanism allows TRM to be computationally efficient. It achieves high accuracy not by being big, but by being persistent—allocating its compute budget exactly where it is required.

4. The Results

To actually test the bounds of TRM, it was benchmarked on a number of the hardest logical datasets available, just like the Sudoku and ARC-AGI (Chollet, 2019)³ challenge.

1. The Sudoku-Extreme Benchmark

The primary test was on the Sudoku-Extreme benchmark, which is a dataset of specially curated hard Sudoku puzzles that require deep logical deduction and the power to backtrack on steps that the model later realizes were improper.

The outcomes are quite contrary to the convention. TRM, with a mere 5 million parameters, achieved an accuracy of 87.4% on the dataset.

To place this in perspective:

Today’s standard reasoning LLMs like Claude 3.7, GPT o3-mini, and DeepSeek R1 couldn't complete any Sudoku problem from all the dataset, leading to a 0% accuracy across the board (Wang et al., 2025)².
The previous state-of-the-art recursive model (HRM) used 27 million parameters (over 5x larger) and achieved 55.0% accuracy.
By simply removing the complex hierarchy-based architecture of HRMs and specializing in a single recursive loop, TRM improved accuracy by over 30 percentage points while also reducing the parameter count.

(Source: Adapted from Jolicoeur-Martineau, Table 1)
T & n: Variety of cycles of answer and thought refinement, respectively.
w / ACT: With the Adaptive Computation Time Module, the model performs barely worse.
w / separate fH, fL: Separate networks used for thought and answer refinement.
w / 4-layers, n=3: Doubled the architectural depth of the recursive module, but halved the variety of recursions.
w / self-attention: Recursive module based on attention blocks as an alternative of MLP.

2. The “Capability Trap”: Why Deeper Was Worse

Perhaps essentially the most counterintuitive insight that the authors present in their approach was what happened once they tried to make TRM “higher” by doubling its parameter count.

Once they increased the network depth from 2 layers to 4 layers, performance didn’t go up; as an alternative, it crashed.

2-Layer TRM: 87.4% Accuracy on Sudoku.
4-Layer TRM: 79.5% Accuracy on Sudoku.

On this planet of LLMs, adding more layers and making the model deeper has been the default approach to increase intelligence. But for recursive reasoning on small datasets (TRM was trained on only ~1,000 examples), extra layers can turn into a liability as they permit the model excess capability to memorize patterns as an alternative of deducing them, resulting in overfitting.

This validates the paper’s core hypothesis: that depth in time beats depth in space. It could possibly be far more practical to have a smaller model think for a very long time than to have a bigger model think for a brief period of time. The model doesn’t need more capability to memorize; it just needs more time and an efficient medium to reason in.

3. The ARC-AGI Challenge: Humiliating the Giants

The Abstraction and Reasoning Corpus (ARC-AGI) is widely considered to be one in all the toughest benchmarks to check pattern recognition and logical reasoning in AI models. It essentially tests fluid intelligence, which is the power to learn recent abstract rules of a system from just a couple of examples. That is where most modern-day LLMs typically fail.

The outcomes listed below are much more shocking. TRM, trained with only 7 million parameters, achieved 44.6% accuracy on ARC-AGI-1.

Compare this to the giants of the industry:

DeepSeek R1 (671 Billion Parameters): 15.8% accuracy.
Claude 3.7 (Unknown, likely a whole bunch of billions): 28.6% accuracy.
Gemini 2.5 Pro: 37.0% accuracy.

A model that's 0.001% the dimensions of DeepSeek R1 outperformed it by nearly 3x. That is arguably the only most effective performance ever recorded on this benchmark. It is barely Grok-4’s 1.7T parameter count that we see some performance that beats the recursive reasoning approaches of HRM and TRMs.

(Source: Adapted from Jolicoeur-Martineau, Table 5)

5. Conclusion

For years, we now have gauged AI progress with the variety of zeros behind the parameter count. The Tiny Recursion Model brings a substitute for this convention. It proves that a model doesn't must be massive to be smart; it just needs the time to think effectively.

As we glance toward AGI, the reply won't lie in constructing larger data centers to include trillion-parameter models. As a substitute, it'd lie in constructing tiny, efficient models of logic that may ponder an issue for so long as they need—mimicking the very human act of stopping, pondering, and solving.

.

References

Jolicoeur-Martineau, A., . arXiv.org (2025).
Wang, G., Li, J., Sun, Y., Chen, X., Liu, C., Wu, Y., Lu, M., Song, S., & Yadkori, Y. A. (2025, June 26). . arXiv.org.
Chollet, F. (2019). On the Measure of Intelligence. .
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022, January 28). . arXiv.org.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017, June 12). . arXiv.org.

How Can A Model 10,000× Smaller Outsmart ChatGPT?

1. Introduction

2. The Fragility of the Giants