Coding-theorem Like Behaviour and Emergence of the Universal Distribution from Resource-bounded Algorithmic Probability
Introduced by Solomonoff and Levin, the seminal concept of Algorithmic Probability (AP) and the Universal Distribution (UD) predicts the way in which strings distribute as the result of running ‘random’ computer programs. Previously referred to as ‘miraculous’ because of its surprisingly powerful properties and applications as the optimal theoretical solution to the challenge of induction and inference, approximations to AP and the UD are of the greatest importance in computer science and science in general. Here we are interested in the emergence, rates of convergence, and the Coding-theorem like behaviour as a marker of acting AP emerging in subuniversal models of computation. To this end, we investigate empirical distributions of computer programs of weaker computational power according to the Chomsky hierarchy. We introduce measures of algorithmic probability and algorithmic complexity based upon resource-bounded computation compared to previously thoroughly investigated distributions produced from the output distribution of Turing machines. The approach allows for numerical approximations to algorithmic (Kolmogorov-Chaitin) complexity-based estimations at each of the levels of a computational hierarchy. We demonstrate that all these estimations are correlated in rank and they converge both in rank as a function of computational power despite the fundamental differences of each computational model.
Keywords algorithmic coding-theorem like behaviour, Solomonoff’s induction, Levin’s semi-measure, computable algorithmic complexity, Finite-state complexity, transducer complexity, context-free grammar complexity, linear-bounded complexity, time resource-bounded complexity.
Department of Computer Science, University of Oxford, Oxford, U.K.
Algorithmic Dynamics Lab, Unit of Computational Medicine, SciLifeLab, Centre for Molecular Medicine, Department of Medicine Solna, Karolinska Institute, Stockholm, Sweden.
Algorithmic Nature Group, LABORES, Paris, France.
Posgrado en Ciencias e Ingeniería de la Computación at the Universidad Nacional Autonoma de México (UNAM)
Departamento de Matemáticas, Facultad de Ciencias, Universidad Nacional Autónoma de México (UNAM), Ciudad de México, México.
1 Motivation and Significance
An algorithmic ‘law’ regulates the behaviour of the output distribution of computer programs. The Universal Distribution is the probability distribution that establishes how all the output strings from a universal computer running a random computer program distribute. The Algorithmic Probability of a string is formally defined by:
where the sum is over all halting programs for which , a prefix-free universal Turing machine, outputs the string . A prefix universal Turing machine defines a set of valid programs so that the sum is bounded by Kraft’s inequality  and not greater than 1 (is also called a semi-probability measure because some programs will not halt and thus the sum of the probabilities is never really 1).
An invariance theorem establishes that the choice of reference universal Turing machine introduces a vanishing bias as a function of the string size and is thus asymptotically negligible.
where is a constant that depends on and (think of a compiler size translating in both directions) but is independent of and thus the reference can safely be dropped in the long term, yet this invariance theorem tells nothing about the rate of convergence thus making this numerical experiments more relevant and necessary.
The Algorithmic Probability and the Universal Distribution represent the theoretical optimal to the challenge of induction and inference. According to R. Solomonoff, a co-founder of algorithmic information theory [32, 33, 34].
More recently, at a panel discussion at the World Science Festival in New York City on Dec 14, 2014, Marvin Minsky, one of the founding fathers of AI, said [own transcription]:
It seems to me that the most important discovery since Gödel was the discovery by Chaitin, Solomonoff and Kolmogorov of the concept called Algorithmic Probability which is a fundamental new theory of how to make predictions given a collection of experiences and this is a beautiful theory, everybody should learn it, but it’s got one problem, that is, that you cannot actually calculate what this theory predicts because it is too hard, it requires an infinite amount of work. However, it should be possible to make practical approximations to the Chaitin, Kolmogorov, Solomonoff theory that would make better predictions than anything we have today. Everybody should learn all about that and spend the rest of their lives working on it.
The Universal Distribution has also been referred to as miraculous, because of its properties in inference and prediction . However, the calculation of both is not computable. This has meant that for decades after its discovery little to no attempts to apply Algorithmic Probability to problems in general science have been made. Nonetheless, is not only uncomputable but, to be more precisely, lower semi-computable, which means that it can be approximated from above. It is thus of fundamental interest to science in its application and contribution to the challenges of complexity, inference and causality to find and keep pushing the boundaries towards better approximating methods to algorithmic probability. A recent new framework and pipeline on numerical methods to estimate them have been advanced and proven successful in many areas of application ranging from cognition to graph complexity [9, 12, 7, 39, 40].
There are many properties of AP that make it optimal [32, 33, 34, 23]. For example, the same Universal Distribution will work for any problem within a convergent error, it can deal with missing and multidimensional data, the data does not need to be stationary or ergodic, there is no under-fitting or over-fitting because the method is parameter-free and thus the data need not to be divided into training and test sets, it is the gold standard for a Bayesian approach in the sense that it updates the distribution in the most efficient and accurate way possible with no assumptions.
Several interesting extensions of resource-bounded Universal Search approaches have been introduced to make algorithmic probability more useful in practice [27, 28, 20, 14] some of which provide some theoretical bounds . While some approaches have explored how relaxing some of the conditions (e.g. universality) on which Levin’s Universal Search is fundamentally based upon  or have introduced domain-specific versions (and thus versions of conditional AP). Here we explore the behaviour of explicitly weaker models of computation of increasing computational power to investigate the asymptotic behaviour and emergence of the the Universal Distribution and the properties of both the different models to approximate it and the actual empirical distributions that such models produce.
The so-called Universal Search  is based on dovetailing over all possible programs and their runtime such that the fraction of time allocated to program is , where is the size of the program (in number of bits). Despite the algorithm’s simplicity and remarkable theoretical properties, a potentially huge constant slowdown factor has kept it from being used much in practice. Some of the approaches to sped it up have included to introduce biased and make the search domain specific, this has at the same time limited the power of Algorithmic Probability.
There are practical applications of AP that make it very relevant. If one could translate some of the power of Algorithmic Probability to decidable models (thus below the Type-0 in the Chomsky hierarchy) without having to deal with the uncomputability algorithmic complexity and algorithmic probability, it would be effectively possible to trade computing power for predictive power. While trade-offs must exist for this to be possible (full predictability and uncomputability are incompatible), the question of finding a threshold for the coding-theorem to apply would be key to transfer some power from relaxing the computational power of an algorithm. If a smooth trade-off is found before the undecidability border of Turing completeness it means that the partial advantages of Algorithmic Information Theory can be found and partially recovered from simpler models of computation in exchange for accuracy. Such simpler models of computation may model physical processes that are computationally rich but are subjects to noise or are bounded by resources. More real-world approaches may then lead to applications such as in the reduction of conformational distributions of protein folding [6, 10] in a framework that may favour or rule out certain paths thereby helping predicting the most likely (algorithmic) final configuration. If the chemical and thermodynamic laws that drive these processes are subject to are considered algorithmic in any way even under random interactions e.g. molecular Brownian motion, the Universal Distribution may help shed some light to quantify the most likely regions if, in any way, those laws constitute forms of computation below or at the Turing level that we explore here. This can be considered more plausible if one considers that bias probabilistic bias affect convergence  and that we have demonstrated  that finite approximations to the Universal Distribution may explain some phenomenology of natural selection.
1.1 Uncomputability in complexity
Here we explore the middle ground at the boundary and study the interplay between computable and non-computable measures of algorithmic probability connected to algorithmic complexity. Indeed, a deep connection between algorithmic complexity (or Kolmogorov-Chaitin) of an object and of was found and formalized by way of the algorithmic Coding theorem. The theorem establishes that the probability of to be produced by a random algorithm is inversely proportional to its algorithmic complexity (up to a constant) :
Levin proved that the output distribution established by Algorithmic Probability dominates (up to multiplicative scaling) any other distribution produced by algorithmic means as long as the executor is a universal machine, hence giving the distribution its ‘universal’ character (and name, as the ‘Universal Distribution’).
This so-called Universal Distribution is a signature of Turing-completeness. However, many processes that model or regulate natural phenomena may not necessarily be Turing universal. For example, some models of self-assembly may not powerful enough to reach Turing-completeness, yet they display similar output distributions to that predicted by the Universal Distribution by way of the algorithmic Coding theorem, with simplicity highly favoured by frequency production. Noise is another source of power degradation that may hamper universality and therefore the scope and application of algorithmic probability. However, if some sub-universal systems approach the coding-theorem behaviour this give us great prediction capabilities and less powerful but computable algorithmic complexity measures. Here we ask whether such distributions can be partially or totally explained from importing the relation established by the coding theorem and under what conditions non-universal systems can display algorithmic coding-theorem like behaviour.
Here we produce empirical distributions of systems at each of the computing levels of the Chomsky hierarchy, starting from transducers (Type-3) as defined in , Context-free grammars (Type-2) as defined in , linear-bounded non-deterministic Turing machines (Type-1) as approximations to bounded Kolmogorov-Chaitin complexity and from a universal procedure from an enumeration of Turing machines (Type-0) as defined in [5, 21]. We report the results of the experiments and comparisons, showing the gradual coding-theorem-like behaviour at the boundary between decidable and undecidable systems.
We will denote by or just the set of all strings produced by all the Turing machines with states and symbols.
2.1 The Chomsky Hierarchy
The Chomsky hierarchy is a strict containment hierarchy of classes of formal grammars equivalent to different computational models of increasing computing power. At each of the 4 levels, grammars and automata compute a larger set of possible languages and strings. From weaker to stronger computational power:
The most restricted grammars. Generates the regular languages consisting of rules with single non-terminal symbols on the left-hand side and terminal or non-terminal symbols on the right-hand side. This level is studied by way of finite-state transducers as a generalization of a finite-state automaton that generates an output at every step generating a set of relations on the output tape yet FSTs do not recognize a larger set than FSAs thus representing this level. We used an enumeration of transducers introduced in  where they also proved an invariance theorem thus demonstrating that the enumeration choice is invariant (up to a constant).m[Type-2] Grammars that generate the context-free languages. This kind of grammars are extensively used in linguistics. The languages generated by a CFG grammars are exactly the languages that can be recognized by a non-deterministic pushdown automaton. We denote this level by CFG. We generated production rules for 40 000 grammars according to a sound scheme introduced in .
Grammars that generate the context-sensitive languages. The languages described by these grammars are exactly all languages that can be recognized by a linear-bounded automaton or a non-deterministic Turing machine whose tape is bounded by a constant times the length of the input that we will denote by LBA. An AP-based variation is here introduce and we denote it by LBA/AP.
We also explore the consequences of relaxing the halting configutation (state) condition in models of universal computation (Type-0) when it comes to comparing their output distributions.
2.2 Finite-state complexity
Formal language theory and algorithmic complexity had traditionally been disconnected in connection to the number of states, or the number of transitions in a minimal finite automaton accepting a regular language. In  it was established a connection by extending the notions of Blum static complexity and of encoded function space. The main reason for this lack of connection was that languages are sets of strings, rather than strings used for measure of algorithmic complexity and a meaningful definition for the complexity of a language was lacking as well as well as a definition of Finite-state algorithmic complexity. However, in , they developped a version of algorithmic complexity by replacing Turing machines with finite transducers; the complexity induced is called Finite-state complexity (FSA). Despite of the fact that the Universality Theorem (true for Turing machines) is false for finite transducers, rather surprisingly, the invariance theorem holds true for Finite-state complexity and, in contrast with descriptional complexities (plain and prefix-free), Finite-state complexity is computable.
Defined in  and analogous to the core concept (Kolmogorov-Chaitin complexity) of Algorithmic Information Theory (AIT) based on finite transducers instead of Turing machines. Finite-state complexity is computable and there is no a priori upper bound for the number of states used for minimal descriptions of arbitrary strings.
Consider a transducer with the finite set of states . Then the transition function of is encoded by a binary string (see  for details). The transducer which is encoded by is called , where is the set of all strings in the form of .
In  it was shown that the set of all transducers can be enumerated by a regular language and that there exists a a hierarchy for more general computable encodings. For this experiment we fix .
As in traditional AIT where Turing machines are used to describe binary strings, transducers describe strings in the following way: we say that a pair , , is a description of the string if and only if . The size of the description is defined in the following way
() The Finite-state complexity of (that we will identify as FSA in the results) with respect to encoding is defined by
An important characteristic of traditional AIT is the invariance theorem which states that the complexity is optimal up to an additive constant and relies on the existence of a Universal Turing machine (the additive constant is in fact its size). In contrast with AIT, due to the non existence of a “Universal transducer”, Finite-state complexity includes the size of the transducer as part of the encoding length. Nevertheless, the invariance theorem holds true for Finite-state complexity. An interesting consequence of the invariance theorem for Finite-state complexity is the existence of an upper bound for all , where is the length of the string which encodes the identity transducer. Hence is computable. If and are encodings then for computable function .
An alternative definition of Finite-state complexity based on Algorithmic Probability is as follows:
() The Finite-state complexity (denoted by FSA/AP in the results) of with respect to encoding is defined by
That is the number of times that a string is accepted by a transducer (in this case, reported in the results, for encodings of size 8 to 22).
2.2.1 Building a Finite-state empirical distribution
We now define the construction of an empirical distribution using Finite-state complexity. We introduce our alternative definition of algorithmic probability using transducers.
(Finite-state Algorithmic Probability) Let be the set of encodings of all transducers by a binary string in the form of . We then define the algorithmic probability of a string as follows
For any string , is the algorithmic probability of , computed for the set of encodings . In the construction of the empirical distribution for Finite-state complexity, we consider the set of strings such that , , and . Hence . Following  we define the empirical distribution function ( the probability distribution) as
(Finite-state Distribution, FSA)
In other words, considers all strings of length and determines whether such that . Then computes and counts the number of times that we have for every string described by such that 111Since is in fact regular we could indeed use an enumeration of but for this work we analyze all binary strings of length ..
We note that in the encoding , a string occurring as the output of a transition in contributes to the size of a description of a string . The decision of considering strings such that was made based on the fact that, in the encoding of the smallest transducer the one with the transition function,
where is the empty string, the string (which occurs as the output of transitions in ) has length and so contributes to the size of the description of a string .
2.3 Context-free grammars
In  Wharton describes an algorithm for a general purpose grammar enumerator which is adaptable to different classes of grammars (i. e. regular, context-free, etc). We implemented this algorithm in the Wolfram Language with the purpose of enumerating context-free grammars over the terminal vocabulary which are in Chomsky Normal Form. Before describing the implementation of the algorithm we define the following terminology:
A grammar is a 4-tuple .
is the non terminal vocabulary.
is the terminal vocabulary.
is the set of productions.
is the start symbol.
is the vocabulary of .
For any grammar , and denote the cardinalities of and respectively.
First, we define the structure of a grammar . Let be any grammar. Suppose we are given the non terminal vocabulary with an arbitrary ordering such that the first non terminal is the start symbol . The grammar has a structure which consists of a list of integers. Each integer from is the number of productions having the non terminal on the left-hand side (according to the ordering of ). Hence the cardinality of the set of productions satisfies .
Now, let be a class of grammars over a terminal vocabulary . By we denote the grammars in with complexity . We then enumerate by increasing the complexity . To every complexity class corresponds a set of structure classes which is determined by and . Therefore a complexity class is enumerated by enumerating each of its structure classes (i. e. every structure that constitutes ). In addition, we need to define an ordered sequence which consists on all possible right-hand sides for the production rules. The sequence is ordered lexicographically (first terminals then non terminals) and is defined according to the class of grammars we want to enumerate. For example, suppose we are interested in enumerating the class of Chomsky Normal Form grammars over the terminal vocabulary and the non terminal vocabulary , we then set .
Given a complexity , the algorithm described below (that we implemented in the Wolfram Language running on Mathematica) enumerates all the grammars according to  in a structure class .
The complexity measure is provided by the following pairing function in and :
In other words, given , we apply the inverse of the above function in order to get the values of and . This function is implemented by the function pairingInverse.
The set of non terminals is generated by the function generateSetN.
The ordered sequence is generated using the set of non terminals by the function generateSetR.
The different structure classes that correspond to complexity are generated by the function generateStructureClasses.
All the possible grammars with the structure classes defined at previous step are then generated. Each grammar has an associated matrix . This is performed by function generateStructureMatricesA[, Length].
The sequence is used to generate the rules of the grammars by the function generateGrammars[, ].
2.3.2 The CYK algorithm
A procedure to decide if a bit string is generated by a grammar was implemented according to the Cocke–Younger–Kasami (CYK) algorithm. The CYK is an efficient worst-case parsing algorithm that operates on grammars in Chomsky normal form (CNF) in , where is the length of the parsed string and is the size of the CNF grammar . The algorithm considers every possible substring of the input string and decides where is the language generated by . The implementation was adapted from .
2.4 CFG Algorithmic Probability
We can now define the Algorithmic Probability of a string according to CFG as follows:
(Context-free Algorithmic Probability, CFG)
And its respective distribution:
where as defined in 2.3.1, is the language generated by and denotes the cardinality of the sample set of the grammars considered. For the results here reported .
with a grammar of complexity at most according to a structure class .
2.5 Linear-bounded complexity
In  it is shown that Time-bounded Kolmogorov distribution is universal (comvergence) and they describe the question of an analogue to the algorithmic Coding theorem as an open problem certainly to be solved with the help of exploiting the universality finding. On the other hand, in [2, 13] it has been shown that the time-bounded algorithmic complexity (being computable) is a Solovay function. These functions are an upper bound of algorithmic complexity (prefix-free version) and they give the same value for almost all strings.
In [5, 21] we described a numerical approach to the problem of approximating the Kolmogorov complexity for short strings. This approach does an exhaustive execution of all deterministic 2-symbol Turing Machines, constructs an output frequency distribution and then the Coding Theorem is applied to approximate the algorithmic complexity of the strings produced.
For this experiment we follow the same approach but we consider the less powerful model of computation of linear-bounded automata (LBA). A LBA is basically a single tape Turing Machine which never leaves those cells on which the input was placed . It is well know that the class of languages accepted by LBA is in fact that of context-sensitive languages .
2.6 Time complexity
is the class of Turing Machines that produce an output in polynomial time with respect of the size of its input. Is easy to see that this class is contained by the class defined by Linear Bounded Automatons (LBA): if the number of the transition is bounded by a linear function, so are the number of cells it can visit, but is important to note that LBA are not time restricted and can use non-deterministic transition. Now, given that Turing Machines can decide context-free grammars in polynomial time (ex: the CYK algorithm), is higher in the hierarchy than Type-2 languages.
Within the context of this article, we will represent this class with the set of Turing Machines with 4 states and 2 symbols with no inputs whose execution time is upper-bounded by a fixed constant. We will cap our execution time by 27, 54, 81 and 107 for a total of 44 079 842 304 Turing machiness, where 107 is the Busy Beaver value of the set.
2.7 The Chomsky hierarchy bounding execution time
The definition of bounded algorithmic complexity is a variation of the unbounded version as follows:
(Linear-bounded Algorithmic Complexity, CFG)
By being bounded by polynomial-sized tapes, the Turing Machines that decide context-sensitive grammars (type-1) can be captured in exponential time by deterministic Turing Machines.
Exactly where each class of Chomsky hierarchy is with respect of the time-based computational complexity classification is related to seminal open problems. For instance, a set equality between the languages recognized by linear-bounded automata and the ones recognized in exponential time would solve the question. Nevertheless, varying the allowed computed time for the CTM algorithm allow us to capture approximations to the descriptive complexity of an object with lower computing resources in a similar way that as does considering each member of the Chomsky hierarchy.
2.8 Non-halting models
We also considered models with no halting configuration, such as cellular automata (nonH-CA) and Turing machines (nonH-TM) with no halting state as defined in  in order to assess whether they converge or not to the Universal Distribution defined over machines with halting condition. For cellular automata we exhaustively ran all the 256 Elementary Cellular Automata  (i.e. closest neighbour and centre cell are taken into consideration) and all 65 536 so-called General Cellular Automata  (that is with 2 neighbours to one side and one to the other, and the centre cell). For Turing machines, we ran all 4096 (2,2) Turing machines with no halting state, and a sample of 65 536 (same number as CA) Turing machines in (3,2) also with no halting state.
2.9 Consolidating the empirical distributions
In order to perform the comparisons among the distributions in each of the Chomsky hierarchy levels, it is needed to consolidate cases of bias imposed by arbitrary choices coming from the chosen model (e.g. starting from a tape with 0 or 1). This is because, for example, the string 0000 should occur exactly the same number of times than 1111 does and so on because 0000 and 1111 should have exactly the same algorithmic complexity. If is the string and the frequency of production, we thus consolidate the algorithmic probability of denoted by as follows:
where is the reversion of , e.g. 0001 becomes 1000 and is the negation of , e.g. 0001 becomes 1110, for all empirical distributions for FSA, CFG, LBA and TM. It is worth notice that and do not increase the algorithmic complexity of but by a very small constant and thus there is no reason to expect neither the model nor the complexity of the strings to have been produced or taken to have different algorithmic complexity. Greater details on the counting method are also given in .
Table 1 reports the number of strings produced at each model.
|0||LBA 107 = TM(4,2)||1302|
3.1 Finite-state complexity
The experiment consists on a thorough analysis of all strings that satisfy . If the string satisfies (for some ) then we compute and generate a set of output strings. Then a frequency distribution is constructed from the set of output strings. On the other hand, we compute the Finite-state complexity for strings such that (this is an arbitrary decision).
3.1.1 Produced distributions for
The results given in Table 2 indicate how many strings satisfy (such that encodes the transition table of some transducer and is an input for it) per string length.
Running FSAs is very fast, there is no halting problem and all stop very quickly. However, while FSA and CFG preserve some of the ranking of greater computational models and accelerate the appearance of ‘jumpers’ (long strings with low complexity) these weak models of computation do not generate the highest algorithmic complexity strings that come in the tail of the distributions of more powerful models as shown in Fig. 4.
For example, there is only one binary string of length 8 that encodes a transducer out of strings in , which is the transducer with the transition function (4) (we refer to it as the smallest transducer).
In the case of , we found that out of strings only two of them encode the smallest transducer with “0” and “1” as input. Again, the only string produced by this distribution is the empty string .
is the first distribution in which one of the strings encodes a transducer with two states. The Finite-state complexity of the strings produced by shows (see Table 16 and Table 17 in the Supplementary Material).
The rest of the tables are reported in the Suplpementary Material.
3.2 Computing Finite-state complexity
We performed another experiment in order to further analyze the characteristics of Finite-state complexity of all strings of length . We summarize the results we obtained for computing Finite-state complexity for each string of length in Table 5 whereas Table 6 shows the strings that encode transducers such that .
3.3 Context-free grammar distribution
We created production rules for 298 233 grammars with up to 26 non-terminal symbols and used the first 40 000 of them on a set of 155 strings for which frequency we also had on all other levels (FSA, LBA and TM). In Table 7 is shown the top 20 produced strings.
3.4 Linear-bounded automata distribution
Table 8 shows the different values that we considered for the experiments and the number of strings produced by all LBA’s with states .
|States||Tape space||Steps||Initial position||Strings produced|
Because of computational power limitations, we generated randomly LBA’s with states . According to the assumptions explained above the tape space should be 16. However, we took the tape space to be 17 since that would allow us to place the initial head position right in the middle of the tape. The table 9 shows the experiments we performed.
As the results demonstrate, by varying the allowed execution time for the space of Turing Machines we can approximate the CTM distribution corresponding for each Chomsky hierarchy level. For instance, regular languages (type-3 grammars) can be decided in linear time, given that each transition and state in a finite automaton can be encoded by a corresponding state and transition in a Turing Machine. Context-free grammars (type-2) can be decided in polynomial time with parsing algorithms such as CYK.
|States||Tape space||Steps||Initial position||Random LBA’s||produced|
3.5 Emergence of the Universal Distribution
3.5.1 Time-bounded Emergence
Fig. 5 shows how LBA asymptotically approximate the Universal Distribution.
3.5.2 Rate of convergence of the distributions
One opportunity offered by this analysis is the assessment of the way in which other methods distribute strings by (statistical) randomness such as Shannon entropy and the performance of other means of approximation algorithmic complexity, such as lossless compression algorithms, in particular one of the most popular based on LZW (Compress). We can then compare these 2 methods in relation to estimations of a Universal Distribution produced by TM(4,2). The results (see Fig. 6) of both entropy and Compression conform with the theoretical expectation. Entropy correlates best at the first level of the Chomsky hierarchy that of FSAs stressing that the algorithmic discriminatory power of entropy to tell apart randomness from pseudo-randomness is limited to statistical regularities of the kind that regular languages would capture. Lossless compression, however, at least assessed by one of the most popular methods behind other popular lossless compression formats, outperformed Shannon entropy but not by much and it was at best most correlated to the output distribution generated by CFG. This does not come as a surprise given that popular implementations of lossless compression are a variation of Shannon entropy generalized to blocks (variable-width window) that capture repetitions often followed by a remapping to use shorter codes for values with higher probabilities (dictionary encoding, Huffman coding) thus effectively a basic grammar based on a simple rewriting rule system. We also found that, while non-halting models approximate the Universal Distribution, they start diverging from TM and remain correlated to LBAs with lower runtime despite increasing the number of states. This may be expected from the over-representation of strings that non-halting machines would otherwise skip from actually producing them (defined as produced upon halting for machines with halting configuration).
4 Some open questions
4.1 Tighter bounds and suprema values
We have provided upper and lower bounds for each model of computation but current intervals seem to overlap for some measures of correlation. One question is the exact boundaries, especially how closer to the Universal Distribution each supremum for each model of computation can take us. In other words, find the tighter bounds for the intervals at each level.
4.2 Computable error corrections
There are some very interesting open questions to further explore in the future. For example, if computable corrections can be made to sub-universal distributions e.g. as calculated by context-free grammars, in order to correct for trivial (and also computable) biases such as string length. Indeed, while CFG produced an interesting distribution closer to that of LBAs and better than FSAs, there is, for example, an over-representation of trivial long strings which can easily be corrected. The suspicion is that, while one can apply these corrections and increase the speed of convergence to TM, the interesting cases are non-computable. A theoretical framework and a numerical analysis would be interesting to develop.
4.3 Sensitivity to choice of enumeration
Equivalent to the choice of programming language of reference universal Turing machine. Another interesting question is that of the stability of the chosen enumerations for the different models of computation, both at the same and at different computational power level. A striking outcome of the results here reported is that does not only the increase in computing power better approximates the estimations of the distribution produced by Turing machines, but that completely different models of computation, not only in language description but also in computational power, are rather independent of enumeration. While the enumeration that we have followed for every of the computational model is not arbitrary because they follow a length criteria of increasing program size, it is clear how one can device enumerations to produce completely different behaviour. We have pointed out this stability before in  with experiments between Turing machines, cellular automata and Post tag systems, and how these results suggest some sort of ‘natural behaviour’ defined as a behaviour that is not artificially introduced with the purpose to produce a different looking initial distribution (before converging per the invariance theorem for systems with such property). In this sense, in  we proposed a measure of ‘algorithmicity’ in the sense of the Universal Distribution, quantifying how close or removed a method producing a distribution is from other approximations, in particular that of one of the most standard Turing machine models, the one used for the Busy Beaver , that we have shown is not a special case but that several other variations to this model and completely different models of computation produce similar output distributions [41, 42, 4] including the results reported in this paper. However, one other open question is to enumerate systems in different ways and numerically quantify how many of them are convergent or divergent, how much the divergent diverge and for how long and in what conditions, and if the convergent dominate.
4.4 Missing strings from non-halting models
We have seen that for halting models of computation, decreasing the computational model has as an effect the missing of the most algorithmic random strings of the next computational model in computational power. As we have also seen, non-halting models seem to converge to lower runtime distributions of LBAs which even when they are highly correlated to TMs they do not appear to approach TM but to remain stable producing output distributions similar to LBA. An interesting question to explore is what are the kind of strings missed by non-halting machines. As opposed to the strings missed by halting machines below the Turing-universal limit, Turing-universal models of computation without halting configuration may miss other kind of strings. Do they miss less or more random or simple strings as compared to halting models?
Different sub-universal systems may produce different empirical distributions, not only in practice but also in principle i.e. asymptotically they may diverge. However, we have here provided the means to make meaningful comparisons especially against an empirical distribution that has been found to be stable and apparently convergent. It is interesting to explore and seek Coding-theorem like behaviour in sub-universal systems to better understand the landscape of algorithmic probability and complexity for all type of computational systems to which we have here contributed.
The results here reported show that, indeed, the closer a system is to Turing-universality, the higher the likeliness of its output distribution is to the the empirical distribution of universal systems, and that finite approximations of algorithmic complexity from finite sub-universal systems  is an interesting approach departing from the use of limited computable measures that can approximate the power of universal measures of complexity.
The results also show improvements over current major tools for approximating algorithmic complexity such as lossless compression algorithms. To our knowledge, it was not possible to quantify or even compare lossless compression algorithms as there was no other standard or numerical alternatives to approximate algorithmic complexity. The construction of empirical distributions based on Algorithmic Probability does provide the means and constitute an approach to evaluate performance in what we have named an algorithmicity test .
At least for our implementations–that may or may not be an indication of the best algorithms in terms of time complexity to emulate these computational models and thus produce their output distributions, it is not particularly faster to produce distributions from weaker models if there is interest in producing high algorithmic complexity strings for evaluation, but it does for cases in which only finer-grained values for low complexity string are needed in exchange for a faster calculation. Compared to entropic and lossless compression approximations, producing a partial distributions from finite approximations of Algorithmic Probability even over weak models of computation constitutes a major improvement over strings that are otherwise assigned a greater randomness content by traditional methods such as Shannon entropy and equivalent statistical formulations.
-  L. Antunes y L. Fortnow, Time-Bounded Universal Distributions, Electronic Colloquium on Computational Complexity, Report No. 144, 2005.
-  L. Bienvenu and R. Downey. Kolmogorov complexity and Solovay functions. In STACS, volume 3 of LIPIcs, pages 147–158. Schloss Dagstuhl- Leibniz-Zentrum fuer Informatik, 2009.
-  C. Campeanu, K. Culik II, K. Salomaa, S. Yu, S., State complexity of basic operations on finite languages. In: O. Boldt, H. Jürgensen, (eds.) WIA 1999, LNCS, vol. 2214, pp. 60–70. Springer, Heidelberg, 2001.
-  J.-P. Delahaye, H. Zenil, Towards a stable definition of Kolmogorov-Chaitin complexity, arXiv:0804.3459 [cs.IT], 2008.
-  J.-P. Delahaye and H. Zenil, Numerical Evaluation of the Complexity of Short Strings: A Glance Into the Innermost Structure of Algorithmic Randomness, Applied Mathematics and Computation 219, pp. 63–77, 2012.
-  K. Dingle, S. Schaper, and A.A. Louis, The structure of the genotype-phenotype map strongly constrains the evolution of non-coding RNA, Interface Focus 5: 20150053, 2015.
-  N. Gauvrit, F. Soler-Toscano, H. Zenil, Natural Scene Statistics Mediate the Perception of Image Complexity, Visual Cognition, vol. 22:8, pp. 1084–1091, 2014.
-  T. Rado, “On non-computable functions” Bell System Technical Journal 41:3, 877–884, 1962.
-  N. Gauvrit, H. Zenil, F. Soler-Toscano, J.-P. Delahaye, P. Brugger, Human Behavioral Complexity Peaks at Age 25, PLoS Comput Biol 13(4): e1005408, 2017.
-  S.F. Greenbury, I.G. Johnston, A.A. Louis, S.E. Ahnert, J. R., A tractable genotype-phenotype map for the self-assembly of protein quaternary structure, Soc. Interface 11, 20140249, 2014.
-  S. Hernández-Orozco, H. Zenil, N.A. Kiani, Algorithmically probable mutations reproduce aspects of evolution such as convergence rate, genetic memory, modularity, diversity explosions, and mass extinction, arXiv:1709.00268 [cs.NE]
-  V. Kempe, N. Gauvrit, D. Forsyth, Structure emerges faster during cultural transmission in children than in adults, Cognition, 136, 247–254, 2014.
-  R. Hölzl, T. Kräling, W. Merkle, Time-bounded Kolmogorov complexity and Solovay functions, Theory Comput. Syst. 52:1, 80–94, 2013.
-  M. Hutter, A Theory of Universal Artificial Intelligence based on Algorithmic Complexity, Springer, 2000.
-  L.A Levin, Universal sequential search problems. Problems of Information Transmission, 9:265–266, 1973.
-  L.A. Levin. Laws of information conservation (non-growth) and aspects of the foundation of probability theory, Problems Information Transmission, 10(3):206–210, 1974.
-  L.G. Kraft, A device for quantizing, grouping, and coding amplitude modulated pulses, Cambridge, MA: MS Thesis, Electrical Engineering Department, Massachusetts Institute of Technology, 1949.
-  Calude, C. S., Salomaa, K., & Roblot, T. K. (2011). Finite-state complexity. Theoretical Computer Science, 412(41), 5668-5677.
-  S. Schaper and A.A. Louis, The arrival of the frequent: how bias in genotype-phenotype maps can steer populations to local optima PLoS ONE 9(2): e86635, 2014.
-  B.R. Steunebrink, J. Schmidhuber, Towards an Actual Gödel Machine Implementation. In P. Wang, B. Goertzel, eds., Theoretical Foundations of Artificial General Intelligence, Springer, 2012.
-  Soler-Toscano, F., Zenil, H., Delahaye, J. P. & Gauvrit, N. (2014). Calculating Kolmogorov complexity from the output frequency distributions of small Turing machines. PloS one, 9(5), e96223.
-  J.E. Hopcroft, & J.D. Ullman, Formal languages and their relation to automata, 1969.
-  W. Kirchherr and M. Li and P. Vitányi, The Miraculous Universal Distribution, Mathematical Intelligencer, 19, 7–15, 1997.
-  B.Y. Peled, V.K. Mishra, A.Y. Carmi, Computing by nowhere increasing complexity, arXiv:1710.01654 [cs.IT]
-  J. Rangel-Mondragon, Recognition and Parsing of Context-Free, http://library.wolfram.com/infocenter/MathSource/3128/ Accessed on Aug 15, 2017.
-  F. Soler-Toscano, H. Zenil, A Computable Measure of Algorithmic Probability by Finite Approximations with an Application to Integer Sequences, Complexity (accepted).
-  J. Schmidhuber, Optimal Ordered Problem Solver, Machine Learning, 54, 211-254, 2004..
-  J. Schmidhuber, V. Zhumatiy, M. Gagliolo, Bias-Optimal Incremental Learning of Control Sequences for Virtual Robots. In Groen, et al. (eds) Proceedings of the 8th conference on Intelligent Autonomous Systems, IAS-8, Amsterdam, The Netherlands, pp. 658–665, 2004.
-  F. Soler-Toscano, H. Zenil, J.-P. Delahaye, and N. Gauvrit, Small Turing Machines with Halting State: Enumeration and Running on a Blank Tape. http://demonstrations.wolfram.com/SmallTuringMachinesWithHaltingStateEnumerationAndRunningOnAB/. Wolfram Demonstrations Project. Published: January 3, 2013.
-  R.J. Solomonoff. A formal theory of inductive inference: Parts 1 and 2. Information and Control, 7:1–22 and 224–254, 1964.
-  M. Li, and P. Vitányi, An Introduction to Kolmogorov Complexity and Its Applications, 3rd ed, Springer, N.Y., 2008.
-  R.J. Solomonoff, Complexity–Based Induction Systems: Comparisons and Convergence Theorems, IEEE Trans. on Information Theory, vol IT–24, No. 4, pp. 422–432, 1978.
-  R.J. Solomonoff, The Application of Algorithmic Probability to Problems in Artificial Intelligence, in L.N. Kanal and J.F. Lemmer (eds.), Uncertainty in Artificial Intelligence, pp. 473–491, Elsevier, 1986.
-  Solomonoff, R.J. A System for Incremental Learning Based on Algorithmic Probability, Proceedings of the Sixth Israeli Conference on Artificial Intelligence, Computer Vision and Pattern Recognition, Dec. 1989, pp. 515–527.
-  R.M. Wharton, Grammar enumeration and inference, Information and Control, 33(3), 253–272, 1977.
-  M. Wiering and J. Schmidhuber. Solving, POMDPs using Levin search and EIRA, In Proceedings of the International Conference on Machine Learning (ICML), pages 534–542, 1996.
-  S. Wolfram, A New Kind of Science, Wolfram Media, Champaign, IL., 2002.
-  H. Zenil and J-P. Delahaye, On the Algorithmic Nature of the World, In G. Dodig-Crnkovic and M. Burgin (eds), Information and Computation, World Scientific Publishing Company, 2010.
-  H. Zenil, F. Soler-Toscano, K. Dingle and A. Louis, Correlation of Automorphism Group Size and Topological Properties with Program-size Complexity Evaluations of Graphs and Complex Networks, Physica A: Statistical Mechanics and its Applications, vol. 404, pp. 341–358, 2014.
-  H. Zenil, N.A. Kiani and J. Tegnér, Methods of Information Theory and Algorithmic Complexity for Network Biology, Seminars in Cell and Developmental Biology, vol. 51, pp. 32-43, 2016.
-  H. Zenil, F. Soler-Toscano, J.-P. Delahaye and N. Gauvrit, Two-Dimensional Kolmogorov Complexity and Validation of the Coding Theorem Method by Compressibility, PeerJ Computer Science, 1:e23, 2015.
-  F. Soler-Toscano, H. Zenil, J.-P. Delahaye and N. Gauvrit , Correspondence and Independence of Numerical Evaluations of Algorithmic Information Measures, Computability, vol. 2, no. 2, pp 125-140, 2013.
.0.1 Empirical distributions for FSA
For we have that there are six strings that encode a transducer. In fact two of the them are different from the smallest transducer. However, as the previous cases, the only string produced is the empty string .
consists of 12 strings with inputs of length one and three but the output of these transducers is .
contains 34 transducers with inputs whose length ranges from two to four. The output distribution still contains only the string .
is a more interesting distribution. It consists of 68 strings whose input ranges from lengths 1, 3 and 5. Table 10 shows the probability distribution of the strings produced by this distribution. The Finite-state complexity of the strings that comprise is summarized in Table 11.
is a more richer distribution than the previous since it contains 156 strings that encode different transducers. Table 12 shows the different strings produced by this distribution.
We note the following facts:
The length of the longest string produced is two.
The string remains as the one with the highest probability.
The Finite-state complexity of the strings produced ranges from 7 to 10 (see Table 13).
produces two strings of length 2 out of , that is, “00” and “11”.
shows an even more diverse set of strings produced (see Table 18). We have the following interesting facts,
The longest string produced is of length 5.
For the first time, a distribution produces all strings of length 2.
.0.2 ,, and
Here are the strings that comprise each one of these distributions.
.0.3 Code in Perl for finite-state complexity
The program distributionTransducers.py is used to analyze all strings of some length to determine whether satisfies (for some ) and if so then the program computes . This program generates a set of output strings (result-experiment-distribution.csv) from which we can construct an output frequency distribution.
Example of execution:
python distributionTransducers.py 8 10 analyzes all strings of length 8 up to length 10.
python distributionTransducers.py 8 8 analyzes all strings of length 8.
The file result-experiment-distribution.csv contains the following columns:
string, this corresponds to the strings discussed above.
valid-encoding, takes value 1 in case and 0 otherwise.
sigma, corresponds to string such that .
string-p, corresponds to string such that .
num-states, number of states of transducer .
output, corresponds to string such that .
output-complexity, finite-state complexity of output string .
The program computeComplexityStrings.py computes the finite-state complexity for all strings of length up to length (this is the implementation of the algorithm described in ). This program generates the file result-complexity.csv which contains the following columns:
x, the string that the program is calculating the finite-state complexity for.
complexity, finite-state complexity of string .
sigma, string such that .
string-p, string such that .