Empowering LLMs to Think Deeper by Erasing Thoughts

Recent large language models (LLMs) — comparable to OpenAI’s o1/o3, DeepSeek’s R1 and Anthropic’s Claude 3.7 — display that allowing the model to think deeper and longer at test time can significantly enhance model’s reasoning capability. The core approach underlying their deep pondering capability known as chain-of-thought (CoT), where the model iteratively generates intermediate reasoning steps and appends them to the present context until producing the ultimate answer.

Nevertheless, as tasks turn into increasingly complex, the steps needed to resolve them grow dramatically. For example, consider solving NP-hard problems using CoT — the reasoning trace would inevitably span exponential steps, assuming a fixed-size Transformer as the bottom model and P ≠ NP. This raises a vital query:

Unfortunately, probably yes. Various limitations will emerge for harder tasks: (1) chains will inevitably exceed model’s context windows, (2) critical information becomes buried and nearly not possible to retrieve from quite a few preceding tokens, and (3) the self-attention complexity makes generating each latest token prohibitively expensive.

Generated by ChatGPT, prompted by creator

In this text, we challenge the traditional “write-only” CoT reasoning paradigm that dominates current LLM architectures, from each theoretical and practical perspectives. Moreover, we’ll explore a fundamentally different reasoning approach that enables LLM to not only generate thoughts, but in addition erase thoughts. This capability for thought erasure not only offers significant practical advantages in performance and efficiency, but proves fundamental for achieving optimal reasoning efficiency from a computational theory perspective.

This post relies on the paper accepted in International Conference on Machine Learning 2025, a collaboration with Nathan Srebro, David McAllester, Zhiyuan Li. Code can be available.

Not All the things Must Be Remembered

The concept of selectively discarding information has deep roots in computer science history, from the earliest computational models to modern systems. The classic Turing machine overwrites symbols on its tape reasonably than preserving every state; programming languages reclaim memory through stack frames which are robotically released when functions complete their execution; and modern garbage collectors repeatedly discover and take away objects not accessible to this system. These mechanisms weren’t merely efficiency optimizations — they were essential design selections that made complex computation possible inside finite resources.

This concept also applies to human reasoning. In theorem proving, once a lemma is established, we discard its detailed derivation while preserving the result; when exploring problem-solving approaches, we simply mark unproductive paths as “failed” without retaining their full traces. Throughout complex reasoning, we naturally compress information, retaining conclusions while discarding the scaffolding used to achieve them.

✏️ PENCIL: A Latest Reasoning Paradigm

Due to this fact, we propose ✏️ PENCIL, a brand new reasoning paradigm for LLMs. Unlike ✒️ CoT that only generates thoughts, PENCIL recursively generates and erases thoughts until reaching the ultimate answer. It maintains only the minimal context required for generating future thoughts, so the model can think longer and deeper to resolve harder tasks using shorter working memory. The next figure illustrates how PENCIL works

Chain-of-Thought (left) preserves all reasoning steps in context, creating lengthy outputs. PENCIL (right) alternates between generation (daring) and reduction (blue): discarding intermediate thoughts when not needed. After reaching the answer, PENCIL returns only the ultimate answer, hiding the pondering process.

How Do Models Erase Thoughts?

PENCIL’s erasure mechanism draws on two classical ideas. First, from rewriting rules in logic and classical automated theorem proving, which repeatedly apply predefined rules to simplify complex logical or arithmetic expressions into canonical forms until reaching a final answer. Second, from functional programming languages, which creates stack frames to store local variables when calling functions and releases corresponding memory when functions return, robotically discarding intermediate states which are not needed.

Specifically, we introduce three special tokens, called [CALL], [SEP], and [RETURN], and use the next reduction rule to implement erasure:

where C stands for context, T stands for intermediate thoughts, and A stands for answer. Every time the generated sequence completely matches the pattern on the left, PENCIL triggers the reduction rule, erasing thoughts and merging the reply back into the context. It will be significant to notice that C, T and A can themselves contain special tokens, thereby supporting recursive structures just like nested function calls — for instance, C may contain one other [CALL] token, indicating that a brand new pondering subroutine has been initiated.

Easy methods to Use PENCIL?

PENCIL’s erasure mechanism flexibly supports various reasoning patterns, comparable to:

1️⃣ Task Decomposition: Using [CALL] to initiate subproblems, generate intermediate results, after which use [SEP] and [RETURN] to merge outputs and erase subproblem reasoning details;

2️⃣ Branch and Backtrack: Using a [CALL], [SEP], [RETURN] triplet to administer an exploration branch in a search tree, erasing invalid paths upon conflicts or failures.

3️⃣ Summarization / Tail Recursion: Condensing a lengthy reasoning trace into concise summary, just like tail recursion optimization in programming:

where T represents the unique complex reasoning process (or a tougher problem), and T’ represents the summarized or simplified content (or an equivalent, more tractable problem).

Example on a NP-Complete Task

For instance, consider a classic NP-Complete problem Boolean Satisfiability (SAT): given a Boolean formula, determine whether there exists a variable task that makes it true. This problem is (widely believed to) require exponential time but only polynomial space to resolve, with the only approach being traversing a binary search tree of depth n.

Traditional CoT would accumulate intermediate calculations, causing the context length to grow proportionally with the variety of nodes within the search tree, which is exponential time complexity of O(2^n). As compared, PENCIL can recursively branch to try True/False for a variable, backtracking upon conflict and erasing all thoughts inside that branch. This thus keeps the context length proportional to the search depth, which is space complexity of only O(n).

The next figure compares the utmost context length of the vanilla CoT without reduction (blue) and PENCIL with reduction (red). As problem complexity increases, PENCIL achieves dramatic space efficiency, notably reducing context length from 151,192 to simply 3,335 tokens for Einstein’s Puzzle.

Maximal sequence length with and without the reduction rule.

Training and Experiments

The core difference between CoT and PENCIL during training is the calculation of the loss function:

For CoT, the loss for every latest token relies on the entire historical context; for PENCIL, after each “write-erase” iteration, the model calculates loss for brand new tokens only on the reduced sequence. Although each generate the identical variety of tokens, PENCIL significantly shortens the context length corresponding to every token and thus is more efficient.

It’s also worthwhile to notice that after each reduction, the KV cache for the shared prefix C could be directly reused, with only the cache for the shorter part A needing recalculation.

Experimental Results

Our experiments deal with three inherently hard reasoning tasks: 3-SAT (NP-Complete), QBF (PSPACE-Complete), and Einstein’s Puzzle (natural language reasoning). For every task, we wrote a generator to generate a training set where special tokens are included. We train a small transformer (SAT/QBF with 10.6M parameters; Einstein’s Puzzle with 25.2M parameters) starting with random initialization for these tasks.

📊 In comparison with CoT, we found PENCIL can solve larger-scale reasoning problems. As shown within the figure below, in SAT (left) and QBF (right) tasks, when problem size is small, each CoT and PENCIL perfectly solve problems; but as size increases, traditional CoT accuracy drops significantly (e.g., only about 50% for SAT at n=10), while PENCIL maintains high accuracy ≥ 99%. That is primarily because CoT’s context sequence length explodes exponentially, whereas PENCIL avoids explosion by dynamic reduction.

Performance comparison on 3-SAT (left) and QBF (right)

⚡️ Moreover, PENCIL significantly saves computational resources. As shown within the figure, for QBF (n=3–6) tasks, we compared the convergence speed of CoT (blue) and PENCIL (red) under the identical FLOPs budget. PENCIL quickly reaches 100% accuracy while CoT, as a result of repeatedly expanding context length, requires more FLOPs to approach optimality. As the issue size increases, the gap between the 2 becomes more pronounced.

Comparison of convergence speed for training on the QBF problem (with n ranges from 3
to six). Circles and vertical lines indicate the primary time each method reaches optimal performance.

🧩 We further considered a really difficult logical reasoning problem: Einstein’s Puzzle. Each problem consists of 5 houses and 5 attribute categories of individuals living in them — color, nationality, drink, cigarette, and pet (e.g., Red/Green/Blue, Brit/German/Swede, Bird/Dog/Fish, etc.). Given clues like “the green house is correct next to the bird owner’s” and “the dog owner lives within the red house,” the duty is to deduce “who owns the fish?” This problem presents an extreme challenge for existing LLMs: even GPT-4 struggles to resolve it. The figure below shows a simplified version with only 3 houses and three attribute categories:

As shown below, for this problem that even large models struggle with, PENCIL achieves 97% accuracy using only a small 25.2M parameter model, while traditional CoT achieves only 25% accuracy (near random guessing).

Theory: Universal Efficient Computation

We further display PENCIL’s fundamental advantage over traditional CoT from the theoretical expressive power perspective: PENCIL is Turing complete with optimal space complexity, and thus can solve arbitrary computable tasks efficiently. That is something fundamentally not possible for CoT!

Essential Results

Specifically, we prove:

In other words, for any Turing machine running in T time and S space, PENCIL requires only O(T) tokens while maintaining a maximum context length of O(S) to provide similar results. While previous work established that traditional CoT could make Transformers Turing complete, it demands O(T) context length with each token representing an intermediate computation step. This distinction between maximum context length becomes crucial because for many algorithms, space complexity S is substantially smaller than time complexity T, especially for harder problems.

Consider NP-Complete problems like Traveling Salesman or Hamiltonian Circuit, that are widely believed to require exponential time but solvable in polynomial space. Traditional CoT cannot solve these inside polynomial context length constraints, and requires at the very least exponential length that exceeds practical memory limitations of any real system. PENCIL, in contrast, can solve them using only polynomial maximum context length, making previously intractable reasoning tasks feasible.

Proof Sketch

We now briefly introduce our proof idea, where the important thing insight is to have PENCIL use a series of “Simulation-Summarization” iterations to scrub the memory.

PENCIL simulates Turing machine iteratively using two phases: simulating computation steps from the previous state, and summarizing into the brand new state using the reduction rule.

Step 1: Using CoT to Encode Turing Machine Transitions As illustrated within the left a part of the figure above, we encode each Turing machine state transition as a token encoding “latest state”, “written symbol”, and “head movement direction” triplet within the embedding. The model can use self-attention to calculate the present head position and determine the symbol at this position. Without reduction, this process generates T tokens with context length O(T).

Step 2: Alternating “Simulation-Summarization” PENCIL achieves space/time optimality through alternating:

Simulation: Repeatedly generate Turing machine state transition tokens, simulating multiple computation steps;
Summarization: When latest tokens exceed twice the space needed, summarize the computation using S tokens. The reduction rule then discards previous thoughts, keeping only the newest Turing machine state for the following round.

This strategy maintains total token generation at O(T) while limiting context length to O(S).

Step 3: Transformer Implementation To prove this process could be implemented by Transformers, we developed the Full-Access Sequence Processing (FASP) programming language and proved that any algorithm written in FASP could be implemented by a fixed-sized Transformer. In a FASP program, each variable corresponds to a Transformer sub-module, and every line of code transforms existing variables to a brand new variable through predefined functions, which is such as constructing a more complex Transformer based on sub-modules. The variable returned by this system is the specified Transformer that encodes the algorithm. We wrote a FASP program that implements the “Simulation-Summarization” operation, which suggests there exists a constant-sized Transformer that may perform the identical function

Conclusion

In conclusion, we propose a brand new reasoning paradigm PENCIL, which alternates between generation and erasure, and enables models to think deeper to resolve more complicated problems. Theoretically, we prove that PENCIL achieves Turing completeness with optimal time and space efficiency and thus can efficiently solve any computable problems. Looking forward, a promising direction can be to fine-tune LLMs to include PENCIL’s memory-efficient reasoning capabilities. We hope these findings will encourage reexamining current reasoning models from the angle of theory of computation.

Empowering LLMs to Think Deeper by Erasing Thoughts

Not All the things Must Be Remembered