In a seminal but underappreciated book titled , Marcus Hutter attempted a mathematical formulation of universal artificial intelligence, shortened to AIXI. This text goals to make AIXI accessible to data scientists, technical enthusiasts and general audiences each conceptually and formally.
We start with a transient overview of the axioms of probability theory. Subsequently we delve into conditional probability, whose calculation is governed by Bayes’s theorem. While Bayes’s theorem provides the framework for updating beliefs, it leaves open the query of easy methods to assign priors. To deal with this, we turn to algorithmic information theory, which connects Kolmogorov complexity, defined because the length of the shortest program that outputs a string, with the project of Bayesian priors. The bridge between these two ideas is the Solomonoff prior, also generally known as the universal prior. The universal prior provides us with the essential scaffolding to explore the AIXI formalism, which integrates sequential decision theory, the Solomonoff prior and Occam’s Razor. In the ultimate section, we briefly discuss the constraints of AIXI and alternative approaches to formalizing an universal agent, while acknowledging that the term “universal agent” carries significant philosophical ambiguity. Particularly, we discuss Energetic Inference as a philosophical alternative to AIXI, wherein the previous models an embodied predictive agent, whereas the latter, models a disembodied utility maximization algorithm.
Probability Axioms
The Kolmogorov axioms define probabilistic space as a triple (Ω , 𝒜, 𝓟), where Ω defines the overall sample space, 𝒜 the gathering of subsets of events of interest, and 𝓟 the function that assigns a probability to every event normalized to the unit interval.
- If 𝑨 ϵ 𝒜, then 𝓟(𝑨) ≥ 0
- If 𝑨, B ϵ 𝒜 and A ∩ B = ∅, then P (A⋃B) = P(A) + P(B)
- P(Ω) = 1
The primary axiom, non-negativity, ensures that probabilities are meaningful as real-valued measures of belief or frequency. The second, additivity, formalizes the principle that the probability of disjoint outcomes is the sum of their individual probabilities. And the third, normalization, ensures that the overall probability assigned to your entire sample space equals one.
Nevertheless, while the probability axioms specify the structural rules of probability, they don’t prescribe how probabilities must be in light of recent evidence. On this sense, the Kolmogorov framework is analytic and a priori: it defines what a probability measure must satisfy, but not how such a measure is revised through evidence. To maneuver from static probability assignments to dynamic inference, we’d like a approach to relate recent data to existing hypotheses, namely conditional probability. This epistemic gap is addressed in frequentist statistics by interpreting conditional probabilities through long-run frequencies of repeated events, typically under the belief that such events are , whereas Bayes’s theorem provides a normative rule for updating beliefs about hypotheses in light of recent evidence, useful especially when observations are made incrementally or sample spaces will not be well-defined.
Bayesian Inference
First formalized by a Scottish monk, Bayes’ theorem is an algebraic derivation of conditional probability. Once we understand how conditional probability is calculated, Bayes’ Theorem will be derived through a couple of algebraic operations. Let’s recall how we compute conditional probability:
This states that the probability of the hypothesis H given the evidence D is computed because the joint probability of the hypothesis and the evidence divided by the probability of the evidence.
Why would we compute conditional probability this manner? Let’s walk through it by means of an example. The probability that it rained — the — provided that the bottom is wet — the — assumes that these two events are dependent. In the event that they were independent, then the probability of their joint occurrence could be computed by their product P(H) · P(D). It is because the probability of P(H|D) = P(H) and the probability of P(D|H)=P(D) assume that the event of the bottom being wet is independent of it raining. Notice what we’ve just asserted: P(H∩D) = P(H) · P(D). Which means that the intersection of independent events is computed by the product of their individual probabilities.
Set-theoretically, the joint probability is defined because the intersection of the 2 sets:

As a way to understand the proportion of the distribution of the sample spaces, we are able to visualize the conditional probability we seek to calculate as follows:

But in practice we almost never have prior knowledge of the joint probability distribution. That is where we are able to use an easy algebraic step to assist us approximate the joint probability. We multiply either side of the equation by the denominator to unravel for the joint probability P(H∩D):

Now conversely, if we desired to compute the probability that the bottom is wet provided that it rained, our conditional probability formula could be the next:

The identical transformation give us:

Notice that the joint probability of the 2 events in query is the term each conditional probabilities share. Since P(H∩D) is a symmetrical relation and consequently an algebraic constant between the equations, we get the next crucial equality:

Due to this fact, we if wish to test the hypothesis that “it rained” provided that the “ground is wet”, we are able to rearrange this equality to acquire Bayes’ formula:

In Bayesian terminology, we check with P(H|D) because the (namely, the probability we wish to establish), P(H) because the , P(D|H) because the , and P(D) because the .
This conventional nomenclature is vital when coping with Bayesian statistics. The likelihood supplies the conditional probabilities for the info points under the hypothesis (providing the values used to update our beliefs), whereas the marginal normalizes the posterior to the sample space of the condition or data.
Since we’d like approximations of values for all of the terms in Bayes’ formula, a serious hurdle in Bayesian statistics is easy methods to best assign these values. Particularly, specifying the prior will be difficult since we don’t at all times have the requisite knowledge prematurely. Some strategies in approximating the prior include using a , where we assign the identical probability to all possible outcomes, generally known as the Laplacian , and other strategies include using an, namely a previous that goals to approximate the actual probability distribution of the event. In our example, this could possibly be the Poisson distribution of day by day rain.
As we shift from viewing hypotheses as fixed values to treating them as random variables, the goal becomes to infer the total posterior distribution reasonably than a single estimate. Accordingly, we are able to now move from the treating hypotheses as point estimates toward statistical inference over random variables with corresponding probability distributions. To do that, we model the prior P(H) and the likelihood P(D|H) as probability distributions and compute the marginal probability of the info P(D), which is obtained either as a sum (the probability mass function for discrete variables), or an integral (the probability density for continuous variables). These components allow us to use Bayes’ theorem to acquire the posterior distribution: P(H|D).

The (PMF) is computed because the sum of all values of the distribution, equaling to at least one, whereas the (PDF) is the integral of the world under the curve of the distribution approaching one because the limit. For continuous variables we integrate because we’re coping with infinite values within the distribution. Below is the formula for the marginal for continuous variables, namely the law of total probability expressed because the probability density function:

Bayesian statistics forms another framework to the more established frequentist approach in statistics. Though its historical roots are as deep because the frequentist formulation, computational intractability limited its adoption for much of the 20th century. With advances in computational power, Bayesian methods have undergone rapid development and increased application. Today, Bayesian statistics plays a central role in machine learning (ML), particularly probabilistic modeling, uncertainty estimation and decision-making under uncertainty.
Kolmogorov Complexity
We saw that our Kolmogorov axioms supplied an analytic framework for computing probabilities which included defining the union of disjoint sets as sums and their intersection as products. But it surely didn’t tell us easy methods to compute joint sets. For this we invoked Bayes’ theorem, which utilizes set intersection to derive a universal formula of conditional probability.
Nevertheless, in our explication of Bayesian probability we identified the project of priors as an issue for the framework:
That is where the notion of Kolmogorov Complexity is available in. While Kolmogorov complexity doesn’t entail probability, through the (which we’ll explicate below), it encodes an a posteriori meta-theoretic assumption as a mathematical selection bias. This meta-theoretic generalization states that . If we’re faced with the datum or end result that the bottom is wet, what hypothesis from the available store of all possible hypotheses should we select? Intuitively we wish the hypothesis that assigns the very best probability to the observed end result. But without additional information, how are we to know which hypothesis maximizes the probability of the end result? Kolmogorov answered this inside the bounds of algorithmic information theory: .
As a way to understand the motivation behind this, let’s first state the issue inside algorithmic information theory, then circle back to its application inside less abstract, real-world scenarios.
In algorithmic information theory, we encode symbolic sequences or strings based on some symbolic base comparable to base-2, binary strings. We define a Universal Turing Machine, U as a (partial — ) function from a program to an output , i.e.. Consider this as loosely cognate to: f(x) = y. This system represents the hypothesis or theory, whereas the output represents the info or the evidence. This mapping is vital to grasp the intuitive thrust of the idea.
The Kolmogorov Complexity of an information defines because the shortest length of an algorithmic sequence that outputs that where K(x) defines the length of this system in bits:

This expression tells us that out of all of the programs that produce as output, we select the shortest one, namely the minimal . Kolmogorov complexity is defined over finite binary strings x ϵ {0,1}
What we now have to do is connect Kolmogorov complexity to probability theory in order that it may well inform our Bayesian priors. To do that, we notice a connection, not less than superficial at first, between between Kolmogorov complexity and Shannon information entropy. Each appear to quantify some measure of knowledge content: . The greater the uncertainty, the larger the quantity of knowledge required to encode the event. Each K(x) and H(X) are measured in bits, so what’s the difference?
K(x) describes the length of the shortest program in bits that outputs the string x, whereas H(X) computes the variety of bits needed on average to encode this system drawn from the probability distribution of possible values of x, over the sample space of X. It looks as if some deep connection must hold between these two measures. What then is the connection between Kolmogorov complexity and Shannon entropy? We want a bridge from raw length values to their probabilities.
If we isolate a single end result from the Shannon distribution, we are able to define it because the self-information of the output of our program:

This says that the self-information (consider this because the entropy measure for a single end result) is proportional to the log-inverse probability of x occurring. I(x) is an instance of the total distribution that defines Shannon entropy for a selected event:

Now, the Coding Theorem states that Kolmogorov Complexity is roughly equal to the Shannon information contained in a string.

This states that the shortest program that outputs x is roughly so long as the log-inverse of the overall probability that outputs x. In other words, our Shannon distribution incorporates the shortest program because the one assigned with the very best probability! We have now now connected raw program length with probability theory:
That is how we connect algorithmic compressibility, namely program length defined for an instance, to probability and data theory, enabling us to treat compressibility as a . On a sidenote, the explanation we don’t have a precise equality within the equation above is that the postulated relationship will not be exact as much as an additive constant c, which is determined by the selection of Universal Turing Machine (UTM), making K(x) machine-dependent prior to an upper certain c across UTMs:

Now you’re probably wondering, what type of distribution enables us to assign probabilities to all possible program lengths? That distribution is the .
Solomonoff Induction
As we discussed, the selection of prior affects the posterior especially when sample sizes are small enough. This begs the query: That is what Solomonoff’s prior encodes. Precisely, the Solomonoff prior encodes the probability of observing an output sequence as the overall probability that a random program outputs x on a universal Turing machine.
Now, let’s take a have a look at Solomonoff’s universal prior formula, which should click into place our earlier assertion that algorithmic probability is intimately connected with simplicity. Solomonoff defined the universal prior P(x) because the sum of the possibilities of all finite binary prefix-free programs p, where the probability that p is defined by its simplicity, 2^-|p|.

Because we define the probability of a program as shrinking by half for each additional bit it incorporates, the more information bits in this system, the smaller its weight within the distribution. Due to this fact, the overall probability over all prefix-free binary programs can be dominated by shortest programs.
We stated that the Solomonoff prior is defined for sequences of strings. Let’s ensure we understand each of those qualifiers. We use binary sequences because a Universal Turing machine will be defined when it comes to binary inputs and outputs, where we are able to represent any information object with binary encoding. We define the prior over finite sequences so as to meet the conditions of computability: infinite sequences are Turing incomputable.
A set of strings is prefix free if no string within the set is a prefix of one other:

This yields sets of disjoint finite binary string sequences. In other words, we’d like disjoint sets to compute their union. Per the Kolmogorov axiom of additivity, the probability of the union of the members of the set will be expressed because the sum of their probabilities.
Disjointness ensures that the probability related to each hypothesis or prior string obeys Kraft’s inequality, which states that the sum of all probabilities doesn’t exceed the unit interval:

This tells us that for some some prefix-free string C, the probability of that sequence is expressed as 2 raised to the negative exponent, where the exponent describes the . Because all of the sequences are disjoint, their sum cannot exceed 1 (though it may well be lower than one, making it a semi-measure). This permits us to treat code weights as probabilities and consequently to compute the probability mass function of your entire distribution by summing over string weights.
Accordingly, Solomonoff’s prior is defined because the sum of the weights or probabilities of finite binary programs p:

Due to this fact, so as to compute the probability of getting some output x from a possible program, we condition that probability to the sum of probabilities of all possible programs:

Because p is deterministic, the likelihoods and the posterior are defined as delta functions: either a program outputs x or it doesn’t.

Further, since the prior is defined over prefix-free binary strings, we are able to express the conditionalization when it comes to bitwise strings:

As a substitute of a joint distribution over events, we’ve a weighted sum over bitstrings that syntactically generate x as a stand-in for all possible events. This reveals among the limitations of the formalism: We’ll delve into these limitations later comparable to the dearth of structural bias and representational alignment.
Along with the Coding Theorem, the Solomonoff prior tenders a deep connection between induction and compressibility: In the true world, nonetheless, we all know that essentially the most “compressible” theories will not be at all times those with the best explanatory or predictive power, though the preponderance of best theoretical approximations tend toward simplicity.
The formula below expresses our notions of algorithmic complexity, the universal prior, and data entropy as roughly similar to one another (under specific ranges of x):

AIXI
Because it stands, our theory of universal induction which mixes the Solomonoff prior and Bayesian posteriors will not be defined for a constrained agent. What if we mix Solomonoff induction with sequential decision theory?
That is where Marcus Hutter’s AIXI theory is available in: it integrates Solomonoff induction, decision theory, and reinforcement learning in order that our universal prior can do work for an agent.
Transitioning from Solomonoff induction into the territory of decision theory and reinforcement learning requires expanding our ontology to actions, observations, and rewards. AIXI models a universal agent whose interaction with any computable environment enables it to decide on the motion that maximizes expected rewards. In additional formal terms, AIXI selects an motion at every time step and in return receives an commentary and reward from the environment. How does AIXI select the optimal motion? As we’ll see, because AIXI encodes an excellent Bayesian agent, it constitutes a model-based agent. Nevertheless, unlike a typical Bellman-based deterministic agent (which solves the Bellman equations to determinate optimality, see my earlier article on reinforcement learning for that), AIXI maintains uncertainty over all possible environments. It does so by computing the sum of products of the likelihoods, namely the possibilities of environmental feedback given the space of actions, and the Solomonoff weights assigned to each computable environment (or program), called the .
Put succinctly, the universal mixture constitutes a term inside AIXI that defines the probabilistic prediction of the subsequent commentary and reward pair given the present motion. It’s computed because the sum of products of the weighted distribution of each possible environment. The universal mixture exhausts the environment space by summing over the product of every environment weight and its probability distribution where each environment model assigns probabilities to observation-reward sequences given hitherto actions. The universal mixture is given by the formula below:

The universal mixture accumulates a predictive distribution of the environment through each motion that it takes. Consider ξ as assigning probabilities to each possible future trajectory of observations and rewards given a sequence of actions.
The universal mixture provides us with the probability of commentary and reward pairs under the motion history, nevertheless it doesn’t tell us which of those trajectories is essentially the most values or reward maximizing. For this we sum the reward per environment or trajectory:

As a way to discover which trajectory to decide on, we multiply the sum of rewards per trajectory by the probability assigned to that trajectory by the universal mixture:

As such, we compute expectation by weighting each trajectory by its cumulative rewards.
After we compute expectations, the ultimate step involves choosing the motion that maximizes expected return given the weighting of rewards and environment probabilities. For this we employ the arg max function as follows:

Arg max selects the motion that maximizes cumulative returns across all possible trajectories:

AIXI goals to formalize a universal agent whose equipment with the Solomonoff prior biases it toward environments with minimal Kolmogorov complexity. Apart from assuming all possible computable environments and associating compressibility with higher probability, the AIXI agent acquires structural bias from interaction with the environment. This ensures that AIXI should give you the chance to navigate any environment (provided that it’s structured/computable to start with).
While AIXI formalizes a universal agent, it doesn’t sufficiently bias this agent in order that it may well efficiently navigate actual environments. That’s to say, the procedures for optimal motion or decision that AIXI formalizes don’t encode efficient structural biases comparable to domain-specific heuristics or architectural constraints that might speed up learning. Nevertheless, this feature is a consequence of AIXI’s scope of setting universal bounds on decision optimality across all possible environments. In principle, AIXI acquires efficient structural biases not directly through Bayesian updating. With sufficient environmental sampling, AIXI asymptotically converges to the true environment across infinite interactions, assuming the environment is computable and has non-zero prior. Nevertheless, in practice convergence will be inefficient because of overspreading over posterior weights leaving the agent’s actions suboptimal for an indefinite period of time.
Noncomputability & Structural Bias
In its universal form AIXI will not be algorithmically implementable because Kolmogorov complexity and the Solomonoff prior are each not computable. The category of all programs that halt and produce valid (computable) environments is uncountable or not recursively enumerable. Computing the universal prior requires an infinite simulation of environments whereas computing future expectations requires infinite foresight, all of that are mathematically intractable.
For that reason, computable approximations of AIXI exist comparable to AIXItl. AIXItl introduces time and program-length bounds (which is what the tl stands for) restricting the environment space M to ≤ time steps and bits long. Nevertheless, AIXItl continues to be inefficient because the combinatorial possibilities of environments is exponential: O(2^l). Model-free alternatives comparable to and gradient-based alternatives comparable to and represent alternatives within the seek for a universal agent. These later alternatives use heuristics and sampling-based methods for exploration comparable to Monte Carlo Tree Search for optimal decision making. Fundamentally, the competition lies between model-based and model-free methods, the latter which derive their biases entirely from interaction with the environment.
Representational Alignment
As we already stated, AIXI treats the universe as a computable sequence represented through finite binary strings. The belief that the environment is Turing computable will not be entailed within the Church-Turing Thesis and thereby represents an extra assumption of AIXI. The reality of this assumption, in principle, that’s, not with respect to a realizable machine, is an open query although there’s good reason to think it false.
As we saw AIXI treats observations as bitstrings, whereas real world data require structured representations comparable to , , and , to call a couple of. As a way to encode richer structures inside AIXI we would want priors that encode structured representations comparable to graphs, tensors, differential equations etc. Encoding structural biases would make AIXI more efficient, but on the expense of its universality. The fee of encoding real-world representational structures inside a Bayesian model is subsequently specializing the model on the expense of environment-generalizability. On condition that in practice an agent that realizes AIXI will not be possible, we must always subsequently look to agents that encode real-world representations comparable to AIXI, model-free agents or deep-learning-based agents.
Energetic Inference
We saw that AIXI incorporates a maximally informative prior, but this prior is entirely unstructured and embodies no prior knowledge in regards to the world except the meta-selection bias for brief or compressible programs as being the most definitely. We also saw that this makes each AIXI and the Solomonoff prior computationally intractable, which precludes its implementation in its complete form.
One other strand of agent modeling, more recently branded , whose centerpiece is the , goals to integrate the modeling lineages of utility maximization, reinforcement learning, Bayesian inference, predictive coding, statistical mechanics, and removed from thermodynamic equilibrium dynamics into the unified model of a hierarchical generative Bayesian agent. Optimality within the energetic inference generative Bayesian model consists of minimizing free energy, where free energy is defined because the expected surprise in regards to the joint occurrence of sensations and their inferred causes.
Put in colloquial terms, the generative model predicts future perceptions given actions through a forward model that estimates the causes of those sensations through the prior, and it conversely estimates or predicts the actions required to bring about preferred states through an inverse model. The agent dynamically navigates the environment through loops of forward and inverse models or, more simply, loops perception and motion. Free energy results from mismatch between predictions and environmental feedback, minimized through hierarchical Bayesian updating of the model’s priors.
The formula below formally expresses the computation of variational free energy because the divergence between the popularity density (the approximate posterior) and the conditional density (the true posterior), where ỹ stands for observed input, 𝜗 for latent causes, p( ỹ, 𝜗) defines the generative model because the joint probability density of perceptions and latent causes, whereas q(𝜗) defines the approximate posterior:

The primary expression within the formula defines the Kullback-Leibler divergence between the approximate posterior q(𝜗) and the true posterior p(𝜗| ỹ) minus the log model evidence log p(ỹ). Basically, the Kullback-Leibler divergence quantifies the geometric divergence between a model distribution Q and the actual distribution P. Free energy results from the information-theoretic divergence between the approximate and true posterior offset by the log model evidence. We compute variational free energy by integrating the log ratio between the approximate and true joint distributions over latent causes. The second term expresses this same quantity because the sum of the entropy of the approximate posterior q(𝜗) and the cross-entropy between the posterior and the generative model p( ỹ, 𝜗). Minimizing free energy amounts to reducing the divergence between the popularity model and the conditional density.
Each AIXI and Energetic Inference supply optimal Bayesian agents in alternative ways. But whereas AIXI is formally non-computable in its unbounded form, Energetic Inference enables tractable approximations via variational Bayesian models. Optimization in AIXI consists of maximizing rewards, whereas in Energetic Inference in minimizing free energy. In the previous, model accuracy results implicitly from maximizing rewards, whereas within the latter maximizing rewards results implicitly from minimizing expected surprise or free energy. On this regard, Energetic Inference constitutes a structured generative model that hierarchically estimates latent causes, guiding motion selection through inference reasonably than AIXI’s enumeration, which selects the reward maximizing motion from all possible environments. Yet, Energetic Inference stays an incomplete framework because it glosses over many details about concrete agents, comparable to goal-setting, model learning (where it is amazingly vague), and a viable description of agent boundary (the Markov blanket formulation is insufficient and unable to differentiate biological agents from physical systems that don’t amount to actual agents).
References
Friston, K., Kilner, J., & Harrison, L. (2006). A free energy principle for the brain. (1–3), 70–87. https://doi.org/10.1016/j.jphysparis.2006.10.001
Hutter, M. (2005). (1st ed.). Springer Berlin Heidelberg. https://doi.org/10.1007/b138233 SpringerLink+13SpringerLink+13Google Books+13
Sommaruga, G. (Ed.). (2009). (Lecture Notes in Computer Science, Vol. 5363). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-00659-3
Zabell, S. (2009). Philosophy of inductive logic: The Bayesian perspective. In L. Haaparanta (Ed.), (pp. 725–774). Oxford University Press. https://doi.org/10.1093/acprof:oso/9780195137316.003.0044