For the reason that conception of AI, researchers have all the time held faith in scale — that general intelligence was an emergent property born out of size. If we just carry on adding parameters and train them on gargantuan corpora, human-like reasoning would present itself.
But we soon discovered that even this brute-force approach had its own shortcomings. Evidence suggests that a majority of our frontier models are severely undertrained and have inflated parameter counts (Hoffmann et al., 2022)3, which indicates that we is likely to be spending compute within the flawed avenue in any case.
The Hidden Flaws of the AI Giants
We made probably the most powerful AI ever built think in a slow, awkward, foreign language: English. To search out solutions to problems, they need to “reason out loud” through a word-for-word, step-by-step process while also providing us with many irrelevant and inefficiently managed “tokens.”
Then there may be the well-established industry practice of “the-bigger-the-better.” This has led to the event of models with billions of parameters and training sets with trillions of tokens. The sheer size of such models implies that the models should not really reasoning; they’re simply being the very best possible imitators. As an alternative of finding an original, novel solution for a selected problem, they use the incontrovertible fact that they were previously shown something much like the present problem during their training data to reach at an answer.
Lastly, and maybe most critically, these models are limited to a “one-size-fits-all” approach to considering. For instance, when coping with a really difficult problem, a model cannot decide to spend additional processing time working on a very difficult area of the issue. After all, if a model takes more time to work on a tougher problem, it generates more CoT tokens (Wei et al., 2022)4. But this doesn’t necessarily replicate human reasoning, which involves deep stages of pondering with none tangible verbal dialogue.
Hierarchical Reasoning Models
Introducing Hierarchical Reasoning Models (HRMs) (Wang et al., 2025b)1: as an alternative of the clumsy “think out loud” approach, they reason silently and fluently inside their native latent space—a wealthy, high-dimensional world of numbers. This is way closer to our own human intuition, where deep thoughts often precede the words we use to explain them.
The center of this latest architecture is beautifully easy yet dynamic: a patient, H-module which sets the general strategy, while a quick, low-level L-module is chargeable for seeing through the set strategy all the best way. Each of the modules are implemented as easy transformer blocks (Vaswani et al., 2017)2 stacked on top of one another.
How HRM Thinks: A Look Inside
It breaks down the act of “considering” right into a dynamic, two-speed system. To know the way it solves a fancy problem like a 30×30 maze, let’s walk through your entire journey from input to reply.
Overall Architecture of the HRM
(Note: All of the H-modules and L-modules share their very own respective weights across all instances and process information in a recurrent manner)
1. The Setup: Embedding and Initializations
- Flatten and Embed: Because the name suggests, the input (for instance, a Sudoku grid or maze) is flattened right into a single-dimensional stream of patches/tokens, after which fed into an embedding model, which converts the human-interpretable maze into embedding vectors understood by machines.
- Initialize Memory: Two different modules at the moment are instantiated: a High-Level state (zH), which acts as a supervisor, dictating the overarching direction of thought and reasoning, and a Low-Level state (zL) chargeable for executing the reasoning within the set direction.
2. The Core Engine: Real Reasoning Starts Here
At its core, HRM is a nested loop, and a single go through it’s termed a “segment”. Each segment accommodates several and module cycles in itself.
- Step A: Setting the Plan
The High-Level (H) module begins by establishing a high-level plan. Its memory state (zH) is held constant for a set variety of steps and initialized randomly for the primary pass. In our maze example, this initial plan is likely to be very abstract/general, like “explore paths that move downwards and to the suitable.” - Step B: Executing the Plan
With the High-Level module’s plan as a hard and fast guide, the Low-Level (L) module begins a series of recurrent computations. For a set variety of timesteps (T), it iteratively updates its own hidden state (zL), with three inputs to work on:- Its own work from the previous step (zL_previous).
- The fixed plan from the High-Level Module (zH).
- The original problem (the embedded maze).
- The Low-Level module, while keeping the overarching strategy in mind, explores quite a few paths, hits dead ends, backtracks and repeats, until it reaches a conclusion, that’s then shared with the High-Level module.
- Step C: Changing the Plan Accordingly
Once the L-module is finished with its recurrent working cycles, its final memory state (zL_final), which represents the final result of its computation, is fed to the H-module for refinement. The H-module modifies its own plans and devises a brand new strategy for the L-module to follow in the following iteration. For instance: “The downward path is an eventual dead end. The latest plan is to now explore paths leading right.” - Step D: Reset and Repeat
The L-module receives this updated plan from its “supervisor” for the following cycle of its recurrent and intensive work. This goes on for the following “N” cycles for the H-module, each cycle consisting of “T” sub-cycles of the L-module.
3. The “Exit” Button: Deciding When to Stop
A single go through the engine (a “segment”) may not be enough for a more nuanced or harder problem. That is where HRM’s most ingenious feature is available in: Adaptive Computation Time (ACT) (Graves, 2016)6.
After each full segment of thought ( cycles), the model generates a tentative answer. Then, it’s fed into a straightforward linear network, which decides: “Am I confident enough to stop, or should I feel more?”
- If the model determines that it’s confident enough in its answer, it halts and presents it as the ultimate solution.
- If not, it decides to “ponder” further. It takes the ultimate memory state of the L and H modules and uses it as initialization for a wholly latest segment, which continues the considering process.
Implementation of ACT:
The model learns when to stop through a Q-learning paradigm.
- The Q-Head: This is an easy linear layer (Q-Head) that takes the decision to either proceed reasoning or to stop. It takes the ultimate memory state of the H-module at the tip of a segment and outputs two scores: Qhalt and Qproceed.
- The ‘Halt’ Value (Qhalt): This rating represents the model’s confidence that it should stop now. During training, the model learns to make this rating predict the immediate, final reward. The goal it’s trained to match is straightforward: if the anticipated answer is correct, and if it’s flawed.

Ghalt: The reward for stopping the reasoning process
ŷm: Predicted answer of the model for the duty (eg, solution of the maze)
y: Ground truth against the model’s prediction (eg, actual maze solution)
m: The present segment iteration number
- The ‘Proceed’ Value (Qproceed): This represents the estimated reward the model would receive if it continued considering for an additional segment, as an alternative of stopping at once. Its goal rating is the estimated maximum possible value among the many two Q-scores from the immediate next segment and is defined as:

Gproceed: The reward for continuation of reasoning
m: The present segment iteration number
Qproceed/halt: Q-heads predicted output
- The Dual-Loss System: After each segment of thought, the model’s total loss comprises two different objectives:
- Task Loss: The usual loss for getting the flawed answer (sequence-to-sequence cross-entropy).
- Q-Learning Loss: ACT loss for making a poor stopping decision (Binary Crossentropy).

Lmtotal: Total loss for your entire model
ŷm: Predicted answer of the model for the duty (eg, solution of the maze)
y: Ground truth against the model’s prediction (eg, actual maze solution)
Qm: Q-Head’s output prediction of either to halt or proceed
Gm: Q-Head’s output goal
- This permits the model to learn each objectives concurrently: methods to solve the given query while learning to acknowledge when it has been solved.
Putting It to the Test: Results
Sudoku and Maze Benchmarks
On benchmarking against several state-of-the-art reasoning models, HRM performs significantly higher on complex reasoning tasks involving Sudoku puzzles and 30×30 mazes. Each of them require extensive logical deduction, the flexibility to backtrack, and spatial planning. As shown below, all other models that use Chain-of-Thought prompting failed to provide even a single valid solution. These findings validate the notion that making models reason in a far more representative latent space is healthier than making them refer to themselves via CoT.

X-axis: Accuracy of the models on the respective benchmarks
Architecture Over Scale: A Paradigm of Efficiency
The model can perform such a feat while also delivering extreme levels of parameter and data efficiency. It manages its top-tier performance with 27 million parameters, trained from scratch on roughly 1,000 datapoints per task. It also doesn’t need any expensive pre-training on web-scale datasets or brittle prompt engineering tactics. It further solidifies the hypothesis that the model can internalise general patterns and may reason far more efficiently than the usual CoT-based approach to reasoning.
Abstract Reasoning and Fluid Intelligence: The ARC-AGI Challenge
The Abstraction and Reasoning Corpus (ARC) (Chollet, 2019)5 is a widely accepted benchmark for fluid intelligence and requires the models to infer vague and abstract rules, given only a number of visual examples. HRM, with just 27 million parameters, outperforms a lot of the mainstream reasoning models. Despite its size, it scored 40.3% on ARC-AGI-1, while the much larger models with tremendous compute at their disposal, like o3-mini and Claude 3.7, managed to get a subpar rating of 34.5% and 21.2% respectively.

X-axis: Accuracy of the models on the respective benchmarks
Unlocking True Computational Depth
Performance on vanilla transformer architectures quickly starts to plateau when given more compute, i.e., simply adding more layers yields diminishing returns on complex reasoning. Contrastingly, HRM’s accuracy scales almost linearly with additional computational steps. This provides direct evidence from the paper that the model’s architecture will not be a fixed-depth system. It possesses an intrinsic ability to utilize the additional compute to cope with complex tasks, a capability that the underlying structure of a regular Transformer lacks.

X-axis: Accuracy of the models on the Sudoku-Extreme Full dataset
Intelligent Efficiency: Solving Problems with Less Effort
The Adaptive Computation Time (ACT) mechanism allows the model to dynamically allocate its computational resources based on problem difficulty. An HRM equipped with ACT achieves the identical top-tier accuracy as a model hard-coded to make use of a high variety of steps, but it surely does so with significantly fewer resources on average. It learns to conserve compute by solving easy problems quickly while dedicating more “ponder time” only when essential, demonstrating an intelligent efficiency that moves beyond brute-force computation.

These two graphs should be analysed together to grasp the efficiency of the ACT mechanism. The X-axis on each charts represents the computational budget: for the “Fixed M” model, it’s the precise variety of steps it must perform, while for the “ACT” model, it’s the utmost allowed variety of steps (Mmax). The Y-axis on Figure (a) shows the average variety of steps actually used, while the Y-axis on Figure (b) shows the ultimate accuracy.
The “Fixed M” model’s accuracy (black line, Fig. b) peaks when its budget is 8, but this comes at a hard and fast cost of using exactly 8 steps for each problem (black line, Fig. a). The “ACT” model (blue line, Fig. b) achieves a virtually equivalent peak accuracy when its maximum budget is 8. Nevertheless, Fig. (a) shows that to attain this, it only uses a mean of about 1.5 steps. The conclusion is evident: the ACT model learns to perform the identical top-tier performance while using lower than 1 / 4 of the computational resources, intelligently stopping early on problems it has already solved.
References
[1] Wang, Guan, et al. “Hierarchical Reasoning Model.” (2025).
[2] Vaswani, Ashish, et al. “Attention is all you would like.” 30 (2017).
[3] Hoffmann, Jordan, et al. “Training compute-optimal large language models.” (2022).
[4] Wei, Jason, et al. “Chain-of-thought prompting elicits reasoning in large language models.” 35 (2022): 24824-24837.
[5] Chollet, François. “On the measure of intelligence.” (2019).
[6] Graves, Alex. “Adaptive computation time for recurrent neural networks.” (2016).
