Neural Networks Are Blurry, Symbolic Systems Are Fragmented. Sparse Autoencoders Help Us Mix Them.

computers and Artificial Intelligence, we had established institutions designed to reason systematically about human behavior — the court. The legal system is one in all humanity’s oldest reasoning engines, where facts and evidence are taken as input, relevant are used as reasoning rules and verdicts are the system’s output. The laws, nonetheless, have been consistently evolving from the very starting of human civilization. The earliest Codified Law – the (circa 1750 BCE) – represents one in all the primary large-scale attempts to formalize moral and social reasoning into explicit symbolic rules. Its beauty lies in clarity and uniformity — yet it’s also rigid, incapable of adaptation to context. Centuries later, Common Law traditions like those shaped by the , introduced the other philosophy: reasoning based on precedential experience and cases. Today’s legal systems, as we all know, are frequently a mix of each, while the proportions vary across different countries.

In contrast to the cohesive combination in legal systems, the same paradigm pair in AI — Symbolism and Connectionism — appear to be significantly harder to unite. The latter has dominated the surge of AI development in recent times, where all the things is implicitly learned with enormous amounts of knowledge and computing resources and encoded across parameters in neural networks. And this direction, indeed, has been proven very effective by way of benchmark performance. So, can we really want a symbolic component in our AI systems?

Symbolic Systems v.s. Neural Networks: A Perspective of Information Compression

To reply the query above, we want to take a better have a look at each systems. From a computational standpoint, each symbolic systems and neural networks will be seen as machines of compression — they reduce the vast complexity of the world into compact representations that enable reasoning, prediction, and control. Yet they accomplish that through fundamentally different mechanisms, guided by opposite philosophies of what it means to “understand”.

In essence, each paradigms will be imagined as filters applied to raw reality. Given input (X), each learns or defines a metamorphosis (H(cdot)) that yields a compressed representation (Y = H(X)), preserving information that it considers meaningful and discarding the remainder. However the shape of this filtering is different. Generally speaking, symbolic systems behave like high-pass filters — they extract the sharp, rule-defining contours of the world while ignoring its smooth gradients. Neural networks, in contrast, resemble low-pass filters, smoothing local fluctuations to capture global structure. The difference just isn’t in what they see, but in what they decide to forget.

Symbolic systems compress by discretization. They carve the continual fabric of experience into distinct categories, relations, and rules: a legal code, a grammar or an ontology. Each symbol acts as a , a handle for manipulation inside a pre-defined schema. The method resembles projecting a loud signal onto a set of human-designed basis vectors — an area spanned by concepts similar to Entity and Relation. A knowledge graph, as an example, might read the sentence “UIUC is a rare university and I find it irresistible”, and retain only , discarding all the things that falls outside its schema. The result’s clarity and composability, but in addition rigidity: meaning outside the ontological frame simply evaporates.

Neural networks, in contrast, compress by smoothing. They forgo discrete categories in favor of smooth manifolds where nearby inputs yield similar activations (often bounded by some Lipschitz constant in modern LLMs). Reasonably than mapping data to predefined coordinates, they learn a latent geometry that encodes correlations implicitly. The world, on this view, just isn’t a algorithm but a field of gradients. This makes neural representations remarkably adaptive: they will interpolate, analogize, and generalize across unseen examples. But the identical smoothness that grants flexibility also breeds opacity. Information is entangled, semantics change into distributed, and interpretability is lost within the very act of generalization.

Property	Symbolic Systems	Neural Networks
Survived Information	Discrete, schema-defined facts	Frequent, continuous statistical patterns
Source of Abstraction	Human-defined ontology	Data-driven manifold
Robustness	Brittle at rule edges	Locally robust but globally fuzzy
Error Mode	Missed facts	Smoothed facts
Interpretability	High	Low

In conclusion, we are able to summarize the difference between the 2 systems from the knowledge compression perspective in a single sentence: “” This actually indicates the explanation why neuro-symbolic systems are an art of compromise: they will harness knowledge from each paradigms through the use of them collaboratively at different scales, with neural networks providing a world, low-resolution backbone and symbolic components supplying high-resolution local details.

The Challenge of Scalability

Though it is extremely tempting so as to add symbolic components into neural networks to harness advantages from each, scalability is a giant problem getting in the best way of our attempts, especially within the era of Foundation Models. Traditional neuro-symbolic systems depend on a set of expert-defined ontology / schema / symbols, which is assumed to give you the option to cover all possible input cases. This is appropriate for domain-specific systems (for instance, a pizza order chatbot); nonetheless, you can not apply similar approaches to open-domain systems, where you will have experts to construct trillions of symbols and their relations.

A natural response is to go fully data-driven: as a substitute of asking humans to handcraft an ontology, we let the model induce its own “symbols” from internal activations. Sparse autoencoders (SAEs) are a distinguished incarnation of this concept. By factorizing hidden states right into a large set of sparse features, they seem to present us a dictionary of neural concepts: each feature fires on a selected pattern, is (often) human-interpretable, and behaves like a discrete unit that will be turned on or off. At first glance, this looks like an ideal escape from the expert bottleneck: we now not design the symbol set; we learn it.

Here (D) known as the where each column stores a semantically meaningful concept; the primary term is the of the hidden state (h), while the second is a encouraging minimal activated neurons within the code.

Nonetheless, an SAE-only approach runs into two fundamental issues. The primary is computational: using SAEs as a live symbolic layer would require multiplying every hidden state by an infinite dictionary matrix, paying a dense computation cost even when the resulting code is sparse. This makes them inconceivable for deployment at Foundation Model scales. The second is conceptual: SAE features are symbol-like representations, but they aren’t a symbolic system — they lack an explicit formal language, compositional operators, and executable rules. They tell us what concepts exist within the model’s latent space, but not the best way to reason with them.

This doesn’t mean we must always abandon SAEs altogether — they supply ingredients, not a finished meal. Reasonably than asking SAEs to be the symbolic system, we are able to treat them as a bridge between the model’s internal concept space and the numerous symbolic artefacts we have already got: knowledge graphs, ontologies, rule bases, taxonomies, where reasoning can occur by definition. And a high-quality SAE trained on a big model’s hidden states then becomes a shared “concept coordinate system”: different symbolic systems can then be aligned inside this coordinate system by associating their symbols with the SAE features which can be consistently activated when those symbols are invoked in context.

Doing this has several benefits over simply placing symbolic systems side by side and querying them independently. First, it enables symbol merging and aliasing across systems: if two symbols from different formalisms repeatedly light up almost the identical set of SAE features, we have now strong evidence that they correspond to the identical underlying neural concept, and will be linked and even unified. Second, it supports cross-system relation discovery: symbols which can be far apart in our hand-designed schemas but consistently close in SAE space point to bridges we did not encode — latest relations, abstractions, or mappings between domains. Third, SAE activations give us a model-centric notion of salience: symbols that never find a transparent counterpart within the neural concept space are candidates for pruning or refactoring, while strong SAE features with no matching symbol in any system highlight blind spots shared by all of our current abstractions.

Crucially, this use of SAEs stays scalable. The expensive SAE is trained offline, and the symbolic systems themselves don’t must grow to “Foundation Model size” — they will remain as small or as large as their respective tasks require. At inference time, the neural network continues to do the heavy lifting in its continuous latent space; the symbolic artefacts only shape, constrain, or audit behaviour on the points where explicit structure and accountability are most useful. SAEs help by tying all these heterogeneous symbolic views back to a single learned conceptual map of the model, making it possible to check, merge, and improve them without ever constructing a monolithic, expert-designed symbolic twin.

When Can an SAE Function a Symbolic Bridge?

The image above quietly assumes that our SAE is “ok” to function a meaningful coordinate system. What does that really require? We don’t need perfection, nor do we want the SAE to outperform human symbolic systems on every axis. As a substitute, we want a number of more modest but crucial properties:

– Semantic Continuity: Inputs that express the identical underlying concept should induce similar within the sparse code: the identical subset of SAE features should are inclined to be non-zero, quite than flickering on and off under small paraphrases or context shifts. In other words, semantic equivalence ought to be reflected in a stable pattern of energetic concepts.

– Partial Interpretability: We shouldn’t have to grasp every feature, but a nontrivial fraction of them should admit robust human descriptions, in order that merging and debugging are possible on the concept level.

– Behavioral Relevance: The features that the SAE discovers must actually matter for the model’s outputs: intervening on them, or conditioning on their presence, should change or predict the model’s decisions in systematic ways.

– Capability and Grounding: An SAE can only refactor whatever structure already exists in the bottom model; it cannot conjure wealthy concepts out of a weak backbone. For the “concept coordinate system” picture to make sense, the bottom model itself must be large and well-trained enough that its hidden states already encode a various, non-trivial set of abstractions. Meanwhile, the SAE should have sufficient dimensionality and overcompleteness: if the code space is simply too small, many distinct concepts will probably be forced to share the identical features, resulting in entangled and unstable representations.

Now we discuss the primary three properties intimately.

Semantic Continuity

At the extent of pure function approximation, a deep neural network with ReLU- or GELU-type activations implements a Lipschitz-continuous map: small perturbations within the input cannot cause arbitrarily unbounded jumps within the output logits. But this sort of continuity may be very different from what we want in a sparse autoencoder. For the bottom model, a number of neurons flipping on or off can easily be absorbed by downstream layers and redundancy; so long as the ultimate logits change easily, we’re satisfied.

In an SAE, in contrast, we aren’t any longer just taking a look at a smooth output — we’re treating the support pattern of the sparse code reconstructed over the residual stream as a proto-symbolic object. A “concept” is identified with a selected code subset being energetic. That makes the geometry far more brittle: if a small change within the underlying representation pushes a pre-activation across the ReLU threshold within the SAE layer, a neuron within the code will suddenly flip from off to on (or vice versa), and from the symbolic perspective the concept has appeared or disappeared. There isn’t any downstream network to average this out; the code itself is the representation we care about.

Sparsity penalty in constructing the SAE even exacerbates this. The same old SAE objective combines a reconstruction loss with an (ell_1) penalty on the activations, which explicitly encourages most neuron values to be as near zero as possible. In consequence, even many useful neurons find yourself sitting near the activation boundary: just above zero after they are needed, just under zero after they aren’t — that is referred to as “activation shrinkage” in SAEs. That is bad for semantic continuity on the support pattern level: tiny perturbations within the input can change non-zero neurons, even when the underlying meaning has barely modified. Subsequently, Lipschitz continuity of the bottom model doesn’t routinely give us a stable non-zero subset of code within the SAE space, and support-level stability must be treated as a separate design goal and evaluated explicitly.

Partial Interpretability

SAE defines an dictionary to store possible features learned from data. Subsequently, . Even for that subset, After we align existing symbols to the SAE space, it’s the activation patterns within the SAE layer that we depend upon: we probe the model in contexts where a logo is “in play”, record the resulting sparse codes, and use the aggregated code as an embedding for that symbol. Symbols from different systems whose embeddings are close will be linked or merged, even when we never assign human-readable semantics to each individual feature.

Interpretable features then play a more focused role: they supply human-facing anchors inside this activation geometry. If a selected feature has a fairly accurate description, all symbols that load heavily on it inherit a shared semantic hint (e.g. “these are all duty-of-care-like things”), making it easier to examine, debug, and organize the merged symbolic space. In other words, we don’t need an ideal, fully named dictionary. We’d like (i) enough capability in order that essential concepts can get their very own directions, and (ii) a sizeable, behaviorally relevant subset of features whose approximate meanings are stable enough to function anchors. The remainder of the overcomplete code can remain as anonymous background; it still contributes to distances and clusters within the SAE space, even when we never name it.

Behavioral Relevance via Counterfactuals

A feature is barely interesting, as a part of a bridge, if it actually influences the model’s behavior — not only if it correlates with a pattern in the information. In causal terms, we care about whether the feature lies on a within the network’s computation from input to output: if we perturb the feature while holding all the things else fixed, does the model’s behaviour change in the best way that its believed meaning would predict?

Formally, changing a feature is analogous to an intervention of the shape (text{do}(z = c)) within the causal sense, where we overwrite that internal variable and rerun the computation. But unlike classical causal inference modeling, we don’t really want Pearl’s to discover (P(y mid text{do}(z))). The neural network is a fully observable and intervenable system, so we are able to simply execute the intervention on the interior nodes and observe the brand new output. On this sense, neural networks give us the luxurious of performing idealized interventions which can be inconceivable in most real-world social or economic systems.

Intervening on SAE features is conceptually similar but implemented in another way. We typically have no idea the meaning of an arbitrary value within the feature space, so the mentioned above is probably not meaningful. As a substitute, we amplify or suppress the magnitude of an existing feature, which behaves more like a : the structural graph is left untouched, however the feature’s effective influence is modified. Because SAE reconstructs hidden activations as a linear combination of a small variety of semantically meaningful features, we are able to change the coefficients of those features to implement meaningful, localized interventions without affecting other features.

Symbolic-System Based Compression as an Alignment Process

Now let’s take a rather different view. While neural networks compress the world into some highly abstract, continuous manifolds, . From this attitude, compressing information into the symbolic space is an alignment process, where a messy, high-dimensional world is projected onto an area whose coordinates reflect human concepts, interests, and values.

After we introduce symbols like “duty of care”, “threat of violence”, or “protected attribute” right into a symbolic system, we aren’t just inventing labels. This compression process does three things directly:

– It selects which elements of the world the system is obliged to care about (and which it’s purported to ignore).

– It creates a shared vocabulary in order that different stakeholders can reliably point to “the identical thing” in disputes and audits.

– It turns those symbols into commitment points: once written down, they will be cited, challenged, and reinterpreted, but not quietly erased.

In contrast, a purely neural compression lives entirely contained in the model. Its latent axes are unnamed, its geometry is private, and its content can drift as training data or fine-tuning objectives change. Such a representation is superb for generalization, but poor as a locus of obligation. It is tough to say, in that space alone, what the system to anyone, or which distinctions it’s purported to treat as invariant. In other words, neural compression serves prediction, while symbolic compression serves alignment with a human normative frame.

When you see symbolic systems as alignment maps quite than mere rule lists, the connection to becomes direct. To say “the model must not discriminate on protected attributes”, or “the model must apply a duty-of-care standard”, is to insist that certain symbolic distinctions be reflected, in a stable way, inside its internal concept space — and that we give you the option to locate, probe, and, if essential, correct those reflections. And this accountability is often desired, even at the associated fee of compromising a part of the model capability.

From Hidden Law to Shared Symbols

In , the Jin statesman Shu-Xiang once wrote to Zi-Chan of Zheng: “” For hundreds of years, the ruling class maintained order through secrecy, believing that fear thrived where understanding ended. That’s why it became a milestone in ancient Chinese history when Zi-Chan shattered that tradition, forged the criminal code onto bronze tripods and displayed it publicly in 536 BCE. Now AI systems are facing the same problem. Who will probably be the following Zi-Chan?

References

Bloom, J., Elhage, N., Nanda, N., Heimersheim, S., & Ngo, R. (2024). Scaling monosemanticity: Sparse autoencoders and language models. Anthropic.
Garcez, A. d’Avila, Gori, M., Lamb, L. C., Serafini, L., Spranger, M., & Tran, S. N. (2019). Neural-symbolic computing: An efficient methodology for principled integration of machine learning and reasoning. FLAIRS Conference Proceedings, 32, 1–6.
Gao, L., Dupré la Tour, T., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., & Wu, J. (2024). Scaling and evaluating sparse autoencoders.
Bartlett, P. L., Foster, D. J., & Telgarsky, M. (2017). Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30, 6241–6250.
Chiang, T. (2023, February 9). ChatGPT is a blurry JPEG of the Web. The Latest Yorker.
Pearl, J. (2009). Causality: Models, reasoning, and inference (2nd ed.). Cambridge University Press.
Donoghue v Stevenson [1932] AC 562 (HL).

Neural Networks Are Blurry, Symbolic Systems Are Fragmented. Sparse Autoencoders Help Us Mix Them.

Symbolic Systems v.s. Neural Networks: A Perspective of Information Compression

The Challenge of Scalability