We Didn’t Invent Attention — We Just Rediscovered It

-

, someone claims they’ve invented a revolutionary AI architecture. But if you see the identical mathematical pattern — selective amplification + normalization — emerge independently from gradient descent, evolution, and chemical reactions, you realize we didn’t invent the eye mechanism with the Transformers architecture. We rediscovered fundamental optimization principles that govern how any system processes information under energy constraints. Understanding attention as amplification quite than selection suggests specific architectural improvements and explains why current approaches work. Eight minutes here gives you a mental model that might guide higher system design for the subsequent decade.

When Vaswani and colleagues published “Attention Is All You Need” in 2017, they thought they were proposing something revolutionary [1]. Their transformer architecture abandoned recurrent networks entirely, relying as an alternative on attention mechanisms to process entire text sequences concurrently. The mathematical core was easy: compute compatibility scores between positions, convert them to weights, and use these for selective combination of knowledge.

But this pattern appears to emerge independently wherever information processing systems face resource constraints under complexity. Not because there’s some universal law of attention, but because certain mathematical structures appear to represent convergent solutions to fundamental optimization problems.

We could also be taking a look at one in every of those rare cases where biology, chemistry, and AI have converged on similar computational strategies — not through shared mechanisms, but through shared mathematical constraints.

The five hundred-Million-Yr Experiment

The biological evidence for attention-like mechanisms is remarkably deep. The optic tectum/superior colliculus system, which implements spatial attention through competitive inhibition, shows extraordinary evolutionary conservation across vertebrates [2]. From fish to humans, this neural architecture maintains structural and functional consistency across 500+ million years of evolution.

But perhaps more intriguing is the convergent evolution.

Independent lineages developed attention-like selective processing multiple times: compound eye systems in insects [3], camera eyes in cephalopods [4], hierarchical visual processing in birds [5], and cortical attention networks in mammals [2]. Despite vastly different neural architectures and evolutionary histories, these systems converged on similar solutions for selective information processing.

This raises a compelling query: Are we seeing evidence of fundamental computational constraints that govern how complex systems must process information under resource limitations?

Even easy organisms suggest this pattern scales remarkably. , with only 302 neurons, demonstrates sophisticated attention-like behaviors in food searching for and predator avoidance [6]. Plants exhibit attention-like selective resource allocation, directing growth responses toward relevant environmental stimuli while ignoring others [7].

The evolutionary conservation is striking, but we ought to be cautious about direct equivalences. Biological attention involves specific neural circuits shaped by evolutionary pressures quite different from the optimization landscapes that produce AI architectures.

Attention as Amplification: Reframing the Mechanism

Recent theoretical work has fundamentally challenged how we understand attention mechanisms. Philosophers Peter Fazekas and Bence Nanay demonstrated that traditional “filter” and “highlight” metaphors fundamentally mischaracterize what attention actually does [8].

They assert that focus doesn’t select inputs — it amplifies presynaptic signals in a non-stimulus-driven way, interacting with built-in normalization mechanisms that create the looks of selection. The mathematical structure they discover is the next:

  1. Amplification: Increase the strength of certain input signals
  2. Normalization: Built-in mechanisms (like divisive normalization) process these amplified signals
  3. Apparent Selection: The mixture creates what appears to be selective filtering
Figure 1: Attention doesn’t filter inputs — it amplifies certain signals, then normalization creates apparent selectivity. Like an audio mixer with automatic gain control, the result looks selective however the mechanism is amplification. Image by creator.

This framework explains seemingly contradictory findings in neuroscience. Effects like increased firing rates, receptive field reduction, and surround suppression all emerge from the identical underlying mechanism — amplification interacting with normalization computations that operate independently of attention.

Fazekas and Nanay focused specifically on biological neural systems. The query of whether this amplification framework extends to other domains stays open, however the mathematical parallels are suggestive.

Chemical Computers and Molecular Amplification

Perhaps essentially the most surprising evidence comes from chemical systems. Baltussen and colleagues demonstrated that the formose response — a network of autocatalytic reactions involving formaldehyde, dihydroxyacetone, and metal catalysts — can perform sophisticated computation [9].

Figure 2A Chemical Computer in Motion: Mix five easy chemicals in a stirred reactor, and something remarkable happens — the chemical soup learns to acknowledge patterns, predict future changes, and kind information into categories. No programming, no training, no silicon chips. Just molecules doing math. This formose response network processes information using the identical selective amplification principles that power ChatGPT’s attention mechanism, nevertheless it evolved naturally through chemistry alone. Image by creator.

The system shows selective amplification across as much as 10⁶ different molecular species, achieving > 95% accuracy on nonlinear classification tasks. Different molecular species respond differentially to input patterns, creating what appears to be chemical attention through selective amplification. Remarkably, the system operates on timescales (500 ms to 60 minutes) that overlap with biological and artificial attention mechanisms.

However the chemical system lacks the hierarchical control mechanisms and learning dynamics that characterize biological attention. Yet the mathematical structure — selective amplification creating apparent selectivity — appears strikingly similar. Programmable autocatalytic networks provide additional evidence. Metal ions like Nd³⁺ create biphasic control mechanisms, each accelerating and inhibiting reactions depending on concentration [10]. This generates controllable selective amplification that implements Boolean logic functions and polynomial mappings through purely chemical processes.

Information-Theoretic Constraints and Universal Optimization

The convergence across these different domains may reflect deeper mathematical necessities. Information bottleneck theory provides a proper framework: any system with limited processing capability must solve the optimization problem of minimizing information retention while preserving task-relevant details [11].

Jan Karbowski’s work on information thermodynamics reveals universal energy constraints on information processing [12]. The basic thermodynamic sure on computation creates selection pressure for efficient selective processing mechanisms across all substrates able to computation:

Information processing costs energy, so efficient attention mechanisms have a survival/performance advantage, where σ represents entropy (S) production rate, and ΔI represents information processing capability.

This creates universal pressure for efficient architectures — whether you’re evolution designing a brain, chemistry organizing reactions, or gradient descent training transformers.

Neural networks operating at criticality — the sting between order and chaos — maximize information processing capability while maintaining stability [13]. Empirical measurements show that conscious attention in humans occurs precisely at these critical transitions [14]. Transformer networks during training exhibit similar phase transitions, organizing attention weights near critical points where information processing is optimized [15].

This implies the chance that attention-like mechanisms may emerge wherever systems face the elemental trade-off between processing capability and energy efficiency under resource constraints.

Convergent Mathematics, Not Universal Mechanisms

The evidence points toward a preliminary conclusion. Moderately than discovering universal mechanisms, we could also be witnessing convergent mathematical solutions to similar optimization problems:

The mathematical structure — selective amplification combined with normalization — appears across these domains, however the underlying mechanisms and constraints differ significantly.

For transformer architectures, this reframing suggests specific insights:

  • Q·K computation implements amplification.

The dot product Q·K^T computes semantic compatibility between query and key representations, acting as a learned amplification function where high compatibility scores amplify signal pathways.The scaling factor √d_k prevents saturation in high-dimensional spaces, maintaining gradient flow.

  • Softmax normalization creates winner-take-all dynamics

Softmax implements competitive normalization through divisive renormalization. The exponential term amplifies differences (winner-take-all dynamics) while sum normalization ensures Σw_ij = 1. Mathematically this function is comparable to a divisive normalization.

  • Weighted V combination produces apparent selectivity

In this mix there just isn’t explicit selection operator, it is essentially a linear combination of value vectors. The apparent selectivity emerges from the sparsity pattern induced by softmax normalization. High attention weights create effective gating without explicit gating mechanisms.

The mixture induce a winner-take-all dynamics on the worth space.

Implications for AI Development

Understanding attention as amplification + normalization quite than selection offers several practical insights for AI architecture design:

  • Separating Amplification and Normalization: Current transformers conflate these mechanisms. We would explore architectures that decouple them, allowing for more flexible normalization strategies beyond softmax [16].
  • Non-Content-Based Amplification: Biological attention includes “not-stimulus-driven” amplification. Current transformer attention is solely content-based (Q·K compatibility). We could investigate learned positional biases, task-specific amplification patterns, or meta-learned amplification strategies.
  • Local Normalization Pools: Biology uses “pools of surrounding neurons” for normalization quite than global normalization. This implies exploring local attention neighborhoods, hierarchical normalization across layers, or dynamic normalization pool selection.
  • Critical Dynamics: The evidence for attention operating near critical points suggests that effective attention mechanisms should exhibit specific statistical signatures — power-law distributions, avalanche dynamics, and demanding fluctuations [17].

Open Questions and Future Directions

Several fundamental questions remain:

  1. How deep do the mathematical parallels extend? Are we seeing true computational equivalence or superficial similarity?
  2. What can chemical reservoir computing teach us about minimal attention architectures? If easy chemical networks can achieve attention-like computation, what does this suggest in regards to the complexity requirements for AI attention?
  3. Do information-theoretic constraints predict the evolution of attention in scaling AI systems? As models change into larger and face more complex environments, will attention mechanisms naturally evolve toward these universal optimization principles?
  4. How can we integrate biological insights about hierarchical control and adaptation into AI architectures? The gap between static transformer attention and dynamic biological attention stays substantial.

Conclusion

The story of attention appears to be less about invention and more about rediscovery. Whether within the formose response’s chemical networks, the superior colliculus’s neural circuits, or transformer architectures’ learned weights, we see variations on a mathematical theme: selective amplification combined with normalization to create apparent selectivity.

This doesn’t decrease the achievement of transformer architectures — if anything, it suggests they represent a fundamental computational insight that transcends their specific implementation. The mathematical constraints that govern efficient information processing under resource limitations appear to push different systems toward similar solutions.

As we proceed scaling AI systems, understanding these deeper mathematical principles may prove more priceless than mimicking biological mechanisms directly. The convergent evolution of attention-like processing suggests we’re working with fundamental computational constraints, not engineering selections.

Nature spent 500 million years exploring these optimization landscapes through evolution. We rediscovered similar solutions through gradient descent in a couple of years. The query now is whether or not understanding these mathematical principles can guide us toward even higher solutions that transcend each biological and current artificial approaches.

Final note

The true test: if someone reads this and designs a greater attention mechanism in consequence, we’ve created value.


Thanks for reading — and sharing!

Javier Marin
Applied AI Consultant | Production AI Systems + Regulatory Compliance
[email protected]


References

[1] Vaswani, A., et al. (2017). Attention is all you wish. , 30, 5998–6008.

[2] Knudsen, E. I. (2007). Fundamental components of attention. , 30, 57–78.

[3] Nityananda, V., et al. (2016). Attention-like processes in insects. , 283(1842), 20161986.

[4] Cartron, L., et al. (2013). Visual object recognition in cuttlefish. , 16(3), 391–401.

[5] Wylie, D. R., & Crowder, N. A. (2014). Avian models for 3D scene evaluation. , 102(5), 704–717.

[6] Jang, H., et al. (2012). Neuromodulatory state and sex specify alternative behaviors through antagonistic synaptic pathways in C. elegans. , 75(4), 585–592.

[7] Trewavas, A. (2009). Plant behaviour and intelligence. , 32(6), 606–616.

[8] Fazekas, P., & Nanay, B. (2021). Attention is amplification, not selection. , 72(1), 299–324.

[9] Baltussen, M. G., et al. (2024). Chemical reservoir computation in a self-organizing response network. , 631(8021), 549–555.

[10] Kriukov, D. V., et al. (2024). Exploring the programmability of autocatalytic chemical response networks. , 15(1), 8649.

[11] Tishby, N., & Zaslavsky, N. (2015). Deep learning and the data bottleneck principle. .

[12] Karbowski, J. (2024). Information thermodynamics: From physics to neuroscience. , 26(9), 779.

[13] Beggs, J. M., & Plenz, D. (2003). Neuronal avalanches in neocortical circuits. , 23(35), 11167–11177.

[14] Freeman, W. J. (2008). . Springer-Verlag.

[15] Gao, J., et al. (2016). Universal resilience patterns in complex networks. , 530(7590), 307–312.

[16] Reynolds, J. H., & Heeger, D. J. (2009). The normalization model of attention. , 61(2), 168–185.

[17] Shew, W. L., et al. (2009). Neuronal avalanches imply maximum dynamic range in cortical networks at criticality. , 29(49), 15595–15600.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x