Towards a stable definition of Kolmogorov-Chaitin complexity

# Towards a stable definition of Kolmogorov-Chaitin complexity

## Abstract

Although information content is invariant up to an additive constant, the range of possible additive constants applicable to programming languages is so large that in practice it plays a major role in the actual evaluation of , the Kolmogorov-Chaitin complexity of a string . Some attempts have been made to arrive at a framework stable enough for a concrete definition of , independent of any constant under a programming language, by appealing to the naturalness of the language in question. The aim of this paper is to present an approach to overcome the problem by looking at a set of models of computation converging in output probability distribution such that that naturalness can be inferred, thereby providing a framework for a stable definition of under the set of convergent models of computation.

a
\issue

XXI (2008)

Towards a stable definition of Kolmogorov-Chaitin complexity

lgorithmic information theory, program-size complexity.

## 1 Introduction

We will use the term model of computation to refer both to a Turing-complete programming language and to a specific device such a universal Turing machine.

The term natural for a Turing machine or a programming language has been used within several contexts and with a wide range of meanings. Many of these meanings are related to the expressive semantic framework of a model of computation. Others refer to how well a model fits with an algorithm implementation. Previous attempts have been made to arrive at a model of computation stable enough to define the Kolmogorov-Chaitin complexity of a string independent of the choice of programming language. These attempts have used, for instance, lambda calculus and combinatory logic[9, 13] appealing to their naturalness. We provide further tools for determining whether approaches such as these are natural to produce the same relative Kolmogorov-Chaitin measures. Our approach is an attempt to make precise such appeals to the term natural related to the Kolmogorov-Chaitin complexity, and to provide a framework for a stable definition of independent enough of additive constants.

Definition The Kolmogorov-Chaitin complexity of a string with respect to a universal Turing machine is defined as the binary length of the shortest program that produces as output the string .

A major drawback of is that it is uncomputable[1] because of the undecidability of the halting problem. Hence the only way to approach is by compressibility methods. A major criticism brought forward against (for example in[7]) is its high dependence of the choice of programming language.

## 2 Dependability on additive constants

The following theorem tells us that the definition of Kolmogorov-Chaitin complexity makes sense even when it is dependent upon the programming language:

Theorem (invariance) If and are two Turing machines and and the Kolmogorov - Chaitin complexity of a binary string when or are used respectively, then there exists a constant such that for all binary string :

In other terms, there is a program for the universal machine that allows to simulate . This is usually called an interpreter or compiler in for . Let be the shortest program producing some string according to . Then the result of chaining together the programs and generates in . Chaining onto adds only constant length to , so there exists a constant that bounds the difference in length of the shortest program in from the length of the shortest program in that generates the arbitrary string .

However, the constants involved can be arbitrarily large so that one can even affect the relative order relation of under two different universal Turing machines such that if and are two different strings and one can construct an alternative universal machine that not only changes the values for and but reverses the relation order to .

One of the first conclusions drawn from algorithmic information theory is that at least one among the binary strings of length will not be compressible at all. That is because there are only binary programs shorter than . In general, if one wants to come up with an ultimate compressor one can compress the length of every string by bits with length descriptions. It is a straightforward conclusion that no compressing language can arbitrarily compress all strings at once. The strings a language can compress depend on the language used, since any string (even a random-looking one) can in some way be encoded to shorten its description within the language in question even if a string compressible under other languages turns out to be incompressible under the new one. So one can always come up with another language capable of effectively compressing any given string. In other terms, the value of for a fixed can be arbitrarily made up by constructing a suitable programming language for it. However, one would wish to avoid such artificial constructions by finding distinguished programming languages which are natural in some technical sense–rather than tailor-made to favor any particular string– while also preserving the relative values of for all (or most) binary strings of length within any programming language sharing the same order-preserving property.

As suggested in [7], suppose that in a programming language , the shortest program that generates a random-looking string is almost as long as itself. One can specify a new programming language whose universal machine is just like the universal machine for except that, when presented with a very short program , simulates on the long program , generating In other words, the complexity of can be ”buried” inside of so that it does not show up in the program that generates . This arbitrariness makes it hard to find a stable definition of Kolmogorov-Chaitin complexity unless a theory of natural programming languages is provided which is unlike the usual definition in terms of an arbitrary, Turing-complete programming language.

For instance, one can conceive of a universal machine that produces certain strings very often or very seldom, despite being able to produce any conceivable string given its universality. Let’s say that a universal Turing machine is tailor-made to produce much fewer strings than any other string in . By following the relation of Kolmogorov-Chaitin complexity to the universal distribution[11, 8] one would conclude that for the said tailor-made construction the string is of greater Kolmogorov-Chaitin complexity than any other, which may seem counterintuitive. This is the kind of artificial constructions one would prefer to avoid, particularly if there is a set of programming languages for which their output distributions converge, such that between two natural programming languages the additive constant remains small enough to make invariant under the encoding from one language to the other, thus yielding stable values of .

The issue of dependence on additive constants often comes up when is evaluated using a particular programming language or universal Turing machine. One will always find that the additive constant is large enough to produce very different values. This is even worst for short strings, shorter for instance compared to the program implementation size. One way to overcome the problem of the calculation of for short strings was suggested in [2, 3]. It involved building from scratch a prior empirical distribution of the frequency of the outputs according to a formalism of universal computation. In these experiments, some of the models of computation explored seemed to converge, up to a certain degree, leading to propose a natural definition of for short strings. That was possible because the additive constant up to which the output probability distributions converge has a lesser impact on the calculation of , particularly for those at the top of the classification (thus the most frequent and stable strings). This would make it possible to establish a stable definition and calculation of for a set of models of computation identified as natural for which relative orders are preserved even for larger strings.

Our attempt differs from previous attempts in that the programs generated by different models may produce the same relative despite the programming language or the universal Turing machine being necessarily compact in terms of size. This is what one would expect for a stable definition of to work with even if there were still some additive constants involved.

## 3 Towards a stable definition of K

The experiment described in detail in [2] proceeded by analyzing the outputs of two different models of computation: deterministic Turing machines () and one-dimensional cellular automata (). Some followed methods and techniques for enumerating, generating and performing exhaustive searches are suggested in further detail in [14]. The Turing machine () model, represents the basic framework underlying many concepts in computer science, including the definition of Kolmogorov-Chaitin complexity, while cellular automaton, has been largely studied as a particular interesting model also capable of universal computation. The descriptions for both and followed standard formalisms commonly used in the literature. The Turing machine description consisted of a list of rules (a finite program) capable of manipulating a linear list of cells, called the tape, using an access pointer called the head. The directions of the tape are designated right and left. The finite program can be in any one of a finite set of states numbered from 1 to with 1 the state at which the machine starts its computation. There is a distinguished state called the halting state at which the machine halts. Each tape cell can contain a 0 or 1 (there is no special blank character). Time is discrete and the time instants (steps) are ordered from with 0 the time at which the machine starts its computation. At any time, the head is positioned over a particular cell. At time 0 the head is situated on a distinguished cell on the tape called the start cell, and the finite program starts in the state 1. At time 0 all cells contain the same symbol, either 0 or 1. A rule can be written in a -tuple notation as follows , where is the scanned symbol under the head, the state at time , the symbol to write at time , and the head movement either to the right or to the left at time . As usual a Turing machine can perform the following operations: 1. write an element from . 2. shift the head one cell left or right. 3. change the state of the finite program out of . And when the machine is running it executes the above operations at the rate of one operation per step. At the end of a computation the Turing machine has produced an output described by the contiguous cells in the tape over which the head went through.

An analogous standard description of a one-dimensional cellular automata was followed. A one-dimensional cellular automaton is a collection of cells on a grid that evolves through a number of discrete time steps according to a set of rules based on the states of neighboring cells that are applied in parallel to each row over time. In a binary cellular automaton, each cell can take only one among two possible values (0 or 1). When the cellular automaton starts its computation, it applies the rules at row 0. A neighborhood of cells means that the rule takes into consideration the value of the cell itself, cells to the right and cells to the left in order to determine the value of the next cell at row .

For the Turing machines the experiments were performed over the set of 2-state 2-symbol Turing machines, henceforth denoted as . There are different Turing machines according to the description given above and the derived formula from the traditional -tuplet rule description of a Turing machine. It was then let all the machines run for steps each and proceeded to feed each with an empty tape with 0 and once again with an empty tape filled with 1.

It was proceeded in the same fashion for cellular automata with nearest-neighbor taking a single on a background of s and a single start cell on a background of s, henceforth denoted by . There are possible binary states for the three cells neighboring a given cell, there are a total of elementary cellular automata or .

Let and be the two sets of output strings produced by the -th Turing machine and the -th cellular automaton respectively, after steps according to an enumeration for Turing machines and cellular automata, a probability distribution was built as follows: the sample space associated with the experiment is since both and are sets of binary strings. Let’s call the set of outputs either from or . For each the space of the random variable is . For a discrete variable , the probability means the probability of the random variable to produce the substring . Let such that for all , . is the probability of to be produced. In other words, is the set of tuples of a string followed by the probability of that that string to be produced by a Turing machine or a cellular automata after steps.

### 3.1 Output probability distribution D(X)

is a discrete probability distribution since , as runs through the set of all possible values of , for a set of finite number of possible binary strings, and the sum of all of them is exactly 1. simply denoted as from now on was calculated in [2] for two sets of Turing machines and cellular automata with small state and symbol values up to certain string length .

In each case was found to be stable under several variations such as number of steps and sample sizes, allowing to define a stable distribution for each, denoted from now on as for the distribution of Turing machines and for the distribution from cellular automata.

### 3.2 Equivalence of complexity classes

The application of a widely used theorem in group theory may provide further stability, getting rid of crossings due to exchanged strings, with different strings probably having the same Kolmogorov-Chaitin complexity but biasing the rank comparisons. Desirably, one would have to group and weight the frequency of the strings with the same expected complexity in order to measure the rank correlation without any additional bias. Consider, for instance, two typical distributions and for which the calculated frequency have placed the strings and at the top of and respectively. If the ranking distance of both distributions is then calculated, one might get a biased measurement due to the exchange of with despite the fact that both should have, in principle, the same Kolmogorov-Chaitin complexity. Therefore, we want to find out how to group these strings such that after comparison they do not affect the rank comparison.

The Pólya-Burnside enumeration theorem[10] makes possible to count the number of discrete combinatorial objects of a given type as a function of their symmetrical cases was used. We have found that experimentally symmetries that are supposed to preserve the Kolmogorov-Chaitin complexity of a string are reversion , complementation and the compositions from them ( and ). In all the distributions built from the experiments so far we have found that strings always tend to group themselves in contiguous groups with their complemented and reversed versions. That is also a consequence of the setting up of the experiments since each Turing machine ran from an empty tape filled with zeros first and then again with an empty tape filled with ones in order to avoid any antisymmetry bias. Each cellular automata ran starting with a 0 in a background of ones and once again with a 1 in a background of zeros as well for the same reason.

Definition (complexity class) Let be the probability distribution produced by a computation. A complexity class in is the set of strings {,,…,} such that .

The above clearly induces a partition since and for the number of strings in . In other words, all strings in are in one and only one complexity class. We will denote the reduced distribution of . Evidently the number of elements in is greater than or equal to .

The Pólya-Burnside enumeration theorem will help us arrive at . There are different binary strings of length and 4 possible transformations to take into consideration:

1. , the identity symmetry, .

2. , the reversion symmetry given by: If , .

3. , the complementation symmetry given by .

Let denote the set of all possible transformations under composition of the above.

The classes of complexity can then be obtained by applying the Burnside theorem according to the following formula:

, for odd

otherwise.

This is obtained by calculating the number of invariant binary strings under . For the transformation there are invariant strings. For there are if is even, if is odd, the number of invariant strings under is zero and the number of invariant strings under is if is even, or zero if it is odd. Let’s use to denote the application of the Burnside theorem to a distribution . As a consequence of applying , grouping and adding up the frequencies of the strings, once has to divide the frequency results by or (depending on the number of strings grouped for each class) according to the following formula:

where represents the frequency of the string and the denominator the cardinality of the union set of the equivalent strings under .

For example, the string for is grouped with the string because they both have the same algorithmic complexity: . The index of each class is the first string in the class according to arithmetical order. Thus the class {0000, 1111} is represented by . Another example of a class with two member strings is the one represented by from the class . By contrast, the string has other three strings of length 4 in the same class: . Other class with four members is the one represented by , the other three strings being because for any with the number of strings in , , i.e. by applying a transformation one can transform any string from any other in .

It is clear that induces a total order in from under the transformations preserving because if , and are strings in : and then so are in the same complexity class (antisymmetry); If and then (transitivity) and either or (totality).

Hereafter the in will simply be denoted by , it being understood that it refers to after applying .

### 3.3 Rank order correlation

To figure out the degree of correlation between the probability frequency[5], we followed a statistical method for rank comparisons. Spearman’s rank correlation coefficient is a non-parametric measure of correlation, i.e. it makes no assumptions about the frequency distribution of the variables. Spearman’s rank correlation coefficient is equivalent to the Pearson correlation on ranks. The Spearman coefficient has to do with measuring correspondence between two rankings for assessing the significance of this correspondence. The Spearman Rank Correlation Coefficient is:

where is the difference between each rank of corresponding values of and , and the number of pairs of values.

The Spearman coefficient is in the interval where:

• If the agreement between the two rankings is perfect (i.e., the two rankings are the same) the coefficient has value 1.

• If the disagreement between the two rankings is perfect (i.e., one ranking is the reverse of the other) the coefficient has value -1.

• For all other arrangements the value lies between -1 and 1, and increasing values (for the same number of elements) imply increasing agreement between the rankings.

• If the rankings are completely independent, the coefficient has value 0.

#### Level of significance

The approach to testing whether an observed value of is significantly different from zero is to calculate the probability that it would be greater than or equal to the observed , given the null hypothesis (that they are correlated by chance), by using a permutation test in order to conclude that the obtained value of is unlikely to occur by chance.

The level of significance is determined by a permutation test[6], checking all permutations of ranks in the sample and counting the fraction for which the is more extreme than the found from the data. As the number of permutations grows proportional to , this is not practical even for small numbers. An asymptotically equivalent permutation test can be created when there are too many possible orderings of the data. For less than 9 elements we proceeded by a permutation test. For more than 9 elements the significance was calculated by Monte Carlo sampling, which takes a small (relative to the total number of permutations) random sample of the possible orderings, in our case the sample size was , big enough to guarantee the results given the number of elements.

The significance convention is that below , the correlation might be the product of chance and then it has to be rejected. If it is , then there is enough confidence that the correlation has not occurred by chance and therefore it is said that the correlation is significant. If it is or below, then the correlation is said to be highly significant and very unlikely to be the product of chance since it would occur by chance less than 1 time in a hundred.

The significance tables generated and followed for the calculation of the significance of the Spearman correlation coefficients can be consulted in the following URL:

### 3.4 Convergence in distributions

We want to find out if the probability distributions built from single and different models of computation converge.

Definition (convergence in order) A sequence of distributions converges to , if for all string , , when tends to infinity. In other words, converges to an order when tends to infinity.

Definition (convergence in values) A sequence of distributions converges to if, for all string , , when tends to infinity.

Definition (order-preserving): A Turing machine is Kolmogorov-Chaitin complexity monotone, or Kolmogorov-Chaitin complexity order-preserving if, given the output probability distribution of , if then .

Definition (quasi order preserving) A Turing machine is -Kolmogorov-Chaitin complexity monotone, or -Kolmogorov-Chaitin complexity order- preserving if, for most strings, is Kolmogorov-Chaitin complexity monotone, or Kolmogorov-Chaitin complexity order-preserving. A Turing machine is -Kolmogorov-Chaitin complexity order-preserving is Kolmogorov-Chaitin complexity order-preserving.

In order to determine the degree of order-preserving we have introduced the term that will be determined by the correlation significance between two given output probability distributions and .

In other words, one can still define a monotony measure even if only a significant first segment of the distributions converge. This is important because by algorithmic probability we know that random-looking strings will be–and because of their random nature have to be–very unstable exchanging places at the bottom of the distributions. But we may nevertheless want to know whether a distribution converges for most of the strings.

Whether or not a probability distribution converges to , one might still want to check if two different models of computation converge between them:

Definition (relative Kolmogorov-Chaitin monotony) Let be and two Turing machine. and are relatively -Kolmogorov-Chaitin complexity monotone if given their probability distributions and respectively and then in for all .

Definition (distribution length): Given a model , the length of its output probability distribution denoted by is the length of the largest string .

Main result and are relative Kolmogorov-Chaitin complexity quasi monotone up to .

The following table shows the Spearman rank correlation coefficients for with from string lengths 2 to 12:

Significance values are not expected to score well at the beginning due to the lack of elements to determine if other than the product of chance produced the order. For 2 elements in each rank order there are only 2 ways to arrange each rank, and even if they make a perfect match as they do, the significance cannot be higher than 50 percent because there is still half chance to have had produced that particular order. It is also the case for 3 elements, even when the ranks made a perfect match as well. But starting at 6 one can start looking to an actual significance value, and up to 12 in the sequence below one can witness a notorious increase up to stabilize the value at which is, for all them, highly significant. Just one case was just significant rather than highly significant according to the threshold convention.

The fact that each of the values of the sequence are either significant or highly significant makes the entire sequence convergence even more significant. and are therefore statistically highly correlated and they are relative 0.01-Kolmogorov-Chaitin complexity quasi monotone up to in almost all values. Therefore and are relative Kolmogorov-Chaitin complexity monotone.

It also turned out that the Pearson correlation coefficients were all highly significant between the actual probability values between and , with the following values:

The above results are important because they not only show that and are Kolmogorov-Chaitin monotone up to but because they constitute the basis and evidence for the formulation of the conjectures in section 3.5:

### 3.5 Conjectures of convergence

Let denote the ranking order of a distribution and the actual probability values of for each string , then:

Conjecture 1 If , then for all , when with the limit frequencies. In other words, the sequence of probability values converges when tends to infinity. Let’s call this limit distribution hereafter.

Conjecture 2 The sequence converges when tends to infinity.

Notice that the conjecture 2 is weaker than the conjecture 1 since conjecture 2 could be true even if conjecture 1 is false. Both conjectures 1 and 2 imply there exists a such that for all , is Kolmogorov-Chaitin complexity order-preserving.

Likewise for cellular automata:

Conjecture 3 The sequence converges to when tends to infinity.

Conjecture 4 The sequence converges when tends to infinity.

Notice that the conjecture 2 is weaker than the conjecture 1 since conjecture 2 could be true even if conjecture 1 is false. Both conjectures 1 and 2 imply there exists a such that for all , is Kolmogorov-Chaitin complexity order-preserving.

Likewise for Turing machines, conjecture 3 implies conjecture 4, but conjecture 4 could be true even if conjecture 3 is false.

Conjecture 5 .

Conjecture 6 .

In other words, the limit distributions for both and converge to the same limit distributions.

Conjecture 5 implies conjecture 6, but conjecture 6 could be true even if conjecture 5 is false.

Both and define , from now on the natural probability distribution. We now can propose our definition of a natural model of computation:

Definition (naturalness in distribution) is a natural model of computation if it is -Kolmogorov-Chaitin monotone or -Kolmogorov-Chaitin order-preserving for .

In other words, any model of computation preserving the relative order of the natural distribution is natural in terms of Kolmogorov-Chaitin complexity under our definition. So one can now technically say that a tailor-made Turing machine producing a different enough output distribution is not natural according to the prior . One can now also define (a) a degree of according to the ranking coefficient and number of order-preserving strings as suggested before and (b) a Kolmogorov-Chaitin order-preserving test such that one can be able to say whether a programming language or Turing machine is natural by designing an experiment and running the test. For (a) it suffices to follow the ideas in this paper. For (b) one can follow the experiments described partially here supplemented with further details available in [3] in order to produce a probability distribution that could be compared to the natural probability distribution to determine whether or not convergence occurs. The use of these natural distributions as prior probability distributions are one of the possible applications. The following URL provides the full tables: http://www.mathrix.org/experimentalAIT/naturaldistribution
Further details, including the original programs, are available online in the experimental Algorithmic Information Theory

Further experiments are in the process of being performed, both for bigger classes of the same models of computation and for other models of computation, including some that clearly are not Kolmogorov-Chaitin order-preserving. More experiments will be performed covering different parameterizations, such as distributions for non-empty initial configurations, possible rates of convergence and radius of convergence, as well as the actual relation between the mathematical expected values of the theoretical definitions of and (the so called universal distribution[9]), as first suggested in [2, 3]. We are aware of the possible expected differences between probability distributions produced by self-nondelimiting vs. self-delimiting programs[4], such as in the case discussed within this paper, where the halting state of the Turing machines was partially dismissed while the halting of the cellular automata was randomly chosen to produce the desired length of strings for comparison with the TM distributions. A further investigation suggests the possibility that there are interesting qualitative differences in the probability distributions they produce. These can be also be studied using this approach.

If these conjectures are true, as suggested by our experiments, this procedure is a feasible and effective approach to both and . Moreover, as suggested in[2], it is a way to approach the Kolmogorov-Chaitin complexity of short strings. Furthermore, statistical approaches might in general be good approaches to the Kolmogorov-Chaitin complexity of strings of any length, as long as the sample is large enough for getting a reasonable significance.

## References

### References

1. C.S. Calude, Information and Randomness: An Algorithmic Perspective (Texts in Theoretical Computer Science. An EATCS Series), Springer; 2nd. edition, 2002.
2. J.P. Delahaye, H. Zenil, On the Kolmogorov-Chaitin complexity for short sequences, in Cristian Calude (eds) Complexity and Randomness: From Leibniz to Chaitin. World Scientific, 2007.
3. J.P. Delahaye, H. Zenil, On the Kolmogorov-Chaitin complexity for short sequences (long version). arXiv:0704.1043v3 [cs.CC], 2007.
4. G.J. Chaitin, Algorithmic Information Theory, Cambridge University Press, 1987.
5. W. Snedecor, WG. Cochran, Statistical Methods, Iowa State University Press; 8 edition, 1989.
6. P.I. Good, Permutation, Parametric and Bootstrap Tests of Hypotheses, 3rd ed., Springer, 2005.
7. K. Kelly, OckhamÕs Razor, Truth, and Information, in J. van Behthem and P. Adriaans, (eds) Handbook of the Philosophy of Information, to appear.
8. A.K. Zvonkin, L. A. Levin. The Complexity of finite objects and the Algorithmic Concepts of Information and Randomness, UMN = Russian Math. Surveys, 25(6):83-124, 1970.
9. M. Li and P. Vitányi, An Introduction to Kolmogorov-Chaitin Complexity and Its Applications, Springer, 1997.
10. H, Redfield, The Theory of Group-Reduced Distributions, American Journal of Mathematics, Vol. 49, No. 3 (Jul., 1927), pp. 433-455, 1997.
11. R. Solomonoff, The Discovery of Algorithmic Probability, Journal of Computer and System Sciences, Vol. 55, No. 1, pp. 73-88, August 1997.
12. R. Solomonoff, A Preliminary Report on a General Theory of Inductive Inference, (Revision of Report V-131), Zator Co., Cambridge, Mass., Feb. 4, 1960
13. J. Tromp, Binary Lambda Calculus and Combinatory Logic, Kolmogorov Complexity and Applications. M. Hutter, W. Merkle and P.M.B. Vitanyi (eds.), Dagstuhl Seminar Proceedings, Internationales Begegnungs und Forschungszentrum fuer Informatik (IBFI), Schloss Dagstuhl, Germany, 2006.
14. S. Wolfram, A New Kind of Science, Wolfram Media, Champaign, IL., 2002.
121828