Computable Variants of AIXI which are More Powerful than AIXItl

# Computable Variants of Aixiwhich are More Powerful than Aixitl

Susumu Katayama, University of Miyazaki
May 22, 2018
###### Abstract

This paper presents Unlimited Computable AI, or UCAI, that is a family of computable variants of AIXI. UCAI is more powerful than AIXI, that is a conventional family of computable variants of AIXI, in the following ways: 1) UCAI supports models of terminating computation, including typed lambda calculus, while AIXI only supports Turing machine with timeout , which can be simulated by typed lambda calculus for any ; 2) unlike UCAI, AIXI limits the program length to .

## 1 Introduction

AIXI[2] is an AI model for theoretical discussion of the limitations of AI, which is expected to be universal. AIXI model the environment as a Turing machine, and formalizes the interaction between the environment and AI agent as a discrete-time reinforcement learning [6] problem, that is an optimization problem under unknown environment. AIXI is not computable without approximation.

AIXI and AIXI[2] are computable variants of AIXI, but they have limitations on the description length of the environment and the computation time for a time step when computed by a sequential Turing machine. Especially, the limitation on the program description length may be a problem when dealing with the universe whose size is unknown beforehand.

In this paper, we prove two main theorems about Unlimited AI (UAI), that is our AIXI variant using a model of terminating computation as the environment model.

The first theorem shows that the action value function (representing how valuable each action in each situation is) of UAI is computable to arbitrary precision. Roughly speaking, this result means that UAI is at least as computable as -learning[7]: UAI and -learning both select an action at each time step by operation on real-valued action value functions, but strictly speaking, the operation is not exactly computable if different actions may have the same action value, because a comparison of exactly the same real values requires comparisons of an infinite number of digits. Practically, however, -learning is considered computable, by giving up computation at some precision.

The other theorem shows that by adequately selecting the least significant bits of each prior probability, UAI becomes exactly computable. We call the resulting AI model Unlimited Computable AI (UCAI).

Because the model of computation used by AIXI and AIXI “Turing machines with timeout within length ”, which is a more limited class than primitive recursive functions, our AI models UAI and UCAI cover more general environment class.

Making AIXI computable involves solving the following three problems:

Problem 1

programs that are candidate models of the environment can enter infinite loops;

Problem 2

it needs to compute an infinite series because there are an infinite number of candidate models of the environment;

Problem 3

all operations are on real values.

AIXI deals with the above three problems in the following way:

1. it deals with Problem 1 by limiting the computation time by introducing timeout;

2. it deals with Problem 2 by limiting the number of programs finite by limiting the length of programs.

Although Problem 3 must also be considered for strictly theoretical discussion, it is just ignored in the papers on AIXI (e.g. [2]), and it is not explicitly solved. However, if the set of rewards is limited to rational numbers, all the computation can be executed as operations on rational numbers because the number of programs is limited in AIXI. Throughout this paper, we assume the reward can only take rational numbers.

On the other hand, our UCAI algorithm deals with the three problems in the following way:

1. by not limiting the environment model to Turing machines and permitting models of terminating computation, it deals with Problem 1 while supporting models of computation which are more powerful than Turing machines with timeout;

2. it deals with Problem 2 and 3 without limiting the program length by applying the technology called Exact Real Arithmetic (e.g. [1]), which enables exact computation over real values under some limitations.

The rest of this paper is organized in the following way. Section 2.2 introduces AIXI. Section 2.3 introduces exact real arithmetic and its implementation using lazy evaluation. Section 3 generalizes AIXI, defines our new variants UCAI, and proves that UCAI algorithms are exactly computable. Section 4 summarizes the results and discusses what the results mean.

## 2 Preparation

### 2.1 Finite List

Before defining AIXI, we define (finite) lists, because AIXI deals with the discrete time sequence. Infinite lists or streams will be defined in Section 2.3.

###### Definition 1 (Finite List).

A finite list is either the empty list or the result of appending an element in front of a finite list using the cons operator.

The cons operator is right associative, and thus the list can be written as by omitting the parentheses.

denotes the set of lists of ’s.

For lists we define the concatenation operator , such that .

###### Definition 2 (Concatenation).
 []++y = y (x1:x)++y = x1:(x++y)

is also right-associative.

### 2.2 Aixi

AIXI[2] is an AI model for theoretical discussion of the limitations of AI, which is expected to be universal.

Mostly, AIXI is based on the general reinforcement learning framework:

• at each time step, the learning agent and the environment interact via perceptions and actions: the agent chooses an action based on the interaction history and sends it to the environment, then, the environment chooses a perception based on the history and sends it to the agent, then again the agent chooses an action based on the new history and sends it to the environment, and so on;

• the perception includes the information of reward, which reflects how well the agent has been behaving; the agent’s purpose is to maximize the expected return, that is the expectation of total sum of future reward;

• the environment is unknown to the agent. It may even change in time.

AIXI models the environment as a Turing machine. It estimates the environment in the way that higher prior probabilities are assigned to simpler programs. At each time step it selects the action which maximizes the expected return weighted by the belief assigned to each Turing machine.

AIXI has the following parameters:

• the finite non-empty set of actions ,

• the finite non-empty set of observations ,

• the finite and bounded set of rewards , and

• the horizon function such that .

Although [2] assumes non-negative rewards, we omit this limitation because this only applies to AIXI which is an AIXI approximation, and also because this is easily amendable. Also, we do not often mention , but instead, the set of perceptions and the projection function .

At each time step , AIXI computes the action from the interaction history based on the following equations. Firstly, the action value function computing the expected return for each action based on the current history is defined as follows:

 Q(\ae1..k−1,a) (1) = ∑ek∈Emaxak+1∈A∑ek+1∈Emaxak+2∈A...maxam(k)∈A∑em(k)∈E

where denotes the observation at time . denotes that is the action-perception pair at time . In general, denotes , i.e., the sequence from time to time of time-varying variable . Thus, is the interaction history at time .

is the set of monotone Turing machines. is the universal prior defined as

 ξ(q)=2−l(q)

where is the description length of . The universal prior is designed to prefer simple programs by assigning lower probabilities to longer programs.

Based on the action value function defined above, AIXI always chooses the best action, i.e., the action at time after the interaction history such that

 ak=arg\leavevmode\nobreak maxa∈AQ(\ae1..k−1,a) (2)

### 2.3 Exact Real Arithmetic and Lazy Evaluation

Exact real arithmetic (e.g. [1]) is a set of techniques for effectively implementing exact computations over the real numbers. Because the set of real numbers is a continuum and each real number contains the information of infinite number of digits, the reader may doubt if it is even possible. Actually, that is not a problem because the set of real numbers that can be uniquely defined by a finite program is countable.

There are two main kinds of approaches to representing exact real values.[5] One represents them as a function taking the precision returning an approximation to the precision[1], and the other represents the mantissa as a lazy infinite stream of digits or other integral values. The latter steam-based approach has various representations, which are reviewed in [5].

In this paper, we prove the computability of our AIXI variants when using redundant binary stream representation. For simplicity, we normalize the set of rewards in order for all the computations to be executed within , and stick to fixed point computations.

Lazy functional languages such as Haskell adopt the lazy evaluation model, and can deal with infinite data structures such as infinite lists (a.k.a. streams) and infinite trees. The idea of lazy evaluation is to postpone computation until it is requested. Even if the remaining computation will generate infinite data, the data structure can hold a thunk, or a description of the remaining computation, until a more precise description is requested.

Now we give a definition of a stream.

###### Definition 3 (Stream).

A stream is an infinite list. An infinite list is the result of appending an element in front of an infinite list using the cons operator. (We use the same letter for both finite and infinite lists.)

The cons operator is right associative, and thus the stream can be written as by omitting the parentheses.

denotes the set of streams of ’s.

In this paper, stream variables are suffixed with for readability, such as .

In order to make sure that such infinite computations generate results infinitely, namely, to make sure the computability to arbitrary precisions, it is enough to show that each element (or digit, in the case of stream of digits) and the data representing the remaining computation can be computed at each recursion.

###### Definition 4 (Digit-wise computability).

Function returning an FPRBSR is digit-wise computable iff it infinitely generates the resulting stream, i.e., for any precision there always exists the time when the result is computed to the precision .

In order to prove that the function is digit-wise computable, it is enough to show that can be represented as

 f(x)=g(x):(h(f))(x)

We need to request that once the first elements of streams, or the most significant digits in the case of stream of digits, are computed and fixed, they must not change later by carrying. In fact, the usual binary representation does not always satisfy this rule: e.g. when we are adding to , whether the digit for is or may not be told forever. We can use the redundant binary representation (a.k.a. signed binary representation) which only request finite times of carrying out for most arithmetic for avoiding this problem.

The most serious limitation of exact real arithmetic is that comparisons between two exactly the same numbers do not terminate, because they mean comparisons between exactly the same infinite lists. We can still compare two different numbers to tell which is the greater.

In the defining equations of AIXI, this limitation only affects their operations. (It does not affect the operations, as shown in Lemma 8.) This means that it is very difficult to exactly choose the best action when the best and second best action values are almost the same. In practice, however, there may be cases where we can ignore such small differences to use fixed-precision floating point approximations instead of exact real numbers.

This paper proves that the action values of our AIXI generalization can be computed to arbitrary precisions for arbitrary prior distributions when a model of terminating computation instead of a Turing machine (Theorem 1), and that the whole computation, including the operation, is computable if we modify the least significant digits of probabilities of the prior distribution to special irrational values (Theorem 2).

## 3 Contributions

In this section, we introduce our AIXI variant which supports more powerful models of computation than AIXI and is still computable. For this, we start with generalizing AIXI to the necessary level, and then specialize it to obtain computable models.

### 3.1 Generalizing Aixi

In this section we generalize AIXI by generalizing the model of computation from monotone Turing machine to other models and generalizing the prior distribution function on it.

#### 3.1.1 Generalizing the Model of Computation

Although AIXI and AIXI models the environment to monotone Turing machines, there is no necessity or significant advantage of limiting the discussion to Turing machines. There are a lot of Turing-complete models of computation such as -calculus and combinatory logic, and terminating Turing-incomplete models such as typed calculus are also worthy of consideration. Also, all computer languages can be regarded as models of computation. For example, functional language Haskell can be regarded as a model of computation extending typed calculus.

One may argue that the choice of monotone Turing machine instead of usual Turing machine is important, but the same effect can be obtained by using lazy I/O, i.e., by modeling the input and the output as streams. (Moreover, even when streams are not available, the same computation power is achieved by supplying the perception history instead of supplying only the current perception as the current input, though the efficiency is sacrificed.)

The idea of termination and lazy I/O can coexist. For example, Agda is a computer language equipped with both. The careful reader should recall that similarly monotone Turing machines may or may not enter an infinite loop while computing the result of each step.

Extending the available set of models of computation to include -calculus and functional languages has another bonus of enabling incremental learning by assigning biased prior probabilities to prioritize expressions with useful functionality.[3] We do not discuss this further in this paper.

#### 3.1.2 Assigning Prior Distributions

Even when using a model other than Turing machines, we need to define a counterpart of universal prior. This is not difficult. [4] induces functional programs by assigning a prior to each functional program based on the BNF representation of the grammar of the language.

The universal prior, which is adopted by AIXI, assigns a prior probability to each Turing machine represented as a stream of bits. The grammar of streams of bits, or s, can be defined as

 BitStream ::= 0:BitStream | 1:BitStream

The universal prior assigns the uniform distribution as follows:

 BitStream={0:BitStream, w.p. 1/2,1:BitStream, w.p. 1/2 (3)

A is infinite, and the prior probability of one is . However, finite programs only use the initial finite part of the sequence as the program, and the remaining infinite part does not matter. Therefore, the probability of selecting the equivalence class of s representing a given finite program is positive.

We can also assign prior distributions to other models of computation. For example of calculus with de Bruijn index, the grammar can be defined as:

 Exp ::= XNat | (λExp) | (Exp Exp) Nat ::= 0 | σNat

and the prior distribution can be assigned as follows:

 Exp = ⎧⎪⎨⎪⎩XNat, w.p. 1/2,(λExp), w.p. 1/4,(Exp Exp), w.p. 1/4 (4) Nat = {0, w.p. 1/2,σNat, w.p. 1/2 (5)

Note that the probability for the case of must be small enough. If probability is assigned to this, the expected length of programs becomes infinite. In general, if the probability of obtaining an infinite syntax tree is positive, the expected value of the program length is infinite. This means that if the expected value of the program length is finite, we obtain a finite program with probability .

For the case of Eqs. 4 and 5, the expected length of , , and the expected length of , , satisfies

 x = (1+k)/2+(1+x)/4+(1+2x)/4 k = 1/2+(1+k)/2

if the length of each data constructor is , and thus they are well-defined, taking

 k = 2 x = 2k+4=8

For our purpose, we need to limit to closed expressions, which can be achieved by using the uniform distribution among valid indexes for .

In general, priors can be assigned for polynomial syntaxes that can be defined by BNF notation, i.e., by direct sums and direct products. Again, however, the probability of generating infinite syntax trees should be .

The following is an example algorithm which assigns a prior distribution to a polynomial grammar without mutual recursion:

###### Algorithm 1.

For the grammar rule

 Xi::=m∑k=1n∏l=1Xjkll

select the -th alternative of with probability such that:

 pik=⎧⎪⎨⎪⎩min{1jki+1,1m}, if jki≠01−∑l∈{1..m},jli≠0pil|{l|l∈{1..m},jli=0}|,otherwise (6)

where

 Xi::=m∑k=1n∏l=1Xjkll

represents

 Xi::=(T1X1...X1j11...Xl...Xlj1l...Xn...Xnj1n)⋮⋮⋮⋮|(TkX1...X1jk1...Xl...Xljkl...Xn...Xnjkn)⋮⋮⋮⋮|(TmX1...X1jm1...Xl...Xljml...Xn...Xnjmn)

in the BNF notation, where each is the tag for distinguishing each alternative from each other.

The background ideas of Algorithm 1 are:

• to make sure the probability of selecting each alternative having recursively times to be below , and

• share the remaining probabilities among remaining alternatives, because finite programs should have base cases.

Note that there can be grammatically-correct, but type-incorrect programs, as well as non-terminating programs of Turing machines. Such programs should be selected with probability , and the actual probabilities for valid programs should be those divided by the sum of probabilities of valid programs defined by Eq. 6. Operationally, this is equivalent to retrying generation of syntax tree when Algorithm 1 results in an invalid program. There is no problem because we know that the probability of generating a valid program is constant, though is unknown, and the result of Eq. 2 would not be affected by multiplication by a constant.

#### 3.1.3 Uai: Our Generalized Aixi

UAI, or Unlimited AI, is our generalization of AIXI for supporting the above extensions. A UAI has the following parameters:

• the non-empty finite set of actions ,

• the non-empty finite set of observations ,

• the finite and bounded set of rational rewards where and ,

• the horizon function such that ,

• the model of computation , which can be viewed as a set of programs taking a lazy list of actions and returning a lazy list of perceptions,

• the prior distribution function

UAI denotes UAI with the above parameters. The definition of the set of perceptions and the projection function are the same as those of AIXI’s.

The action value function of UAI is defined as follows:

 Q(\ae1..k−1,a) (7) = ∑ek∈Emaxak+1∈A∑ek+1∈Emaxak+2∈A...maxam(k)∈A∑em(k)∈E

where

denotes the condition that takes as the input, returns as the output, …, takes as the input, and returns as the output, if is an interactive program such as that of monotone Turing machines and implementation using lazy I/O. can be a model of computation without interaction; in such cases, such that

 q′(a1,[]) =e1 q′(a1..2,e1) =e2 ⋮ ⋮ q′(a1..m(k),e1..m(k)−1) =em(k)

has to be implemented, which requires recomputation of the states of the environment.

Based on the action value function defined above, UAI always chooses the best action, i.e., the action at time after the interaction history such that

 ak=arg\leavevmode\nobreak maxa∈AQ(\ae1..k−1,a) (8)

AIXI can be represented as UAI using the set of monotone Turing machines and the universal prior . Although AIXI does not explicitly limit the set of rewards to rational numbers unlike UAI, how to represent real numbers is not discussed by papers on AIXI.

#### 3.1.4 Computability of Uai

Is UAI implementable using exact real arithmetic? Our conclusion is that action values are digit-wise-computable, and the only part that can cause an infinite loop is the operation. One may think the infinite summation may also cause an infinite loop, but this is not the case actually if the set of rewards is bounded and does not change with time, because the summation is bounded by a geometric series, and thus more and more digits become fixed from the most significant ones as the computation proceeds.

The following Theorem 1 clarifies the above claim.

###### Theorem 1 (Computability of the action value function of Uai).

The action value function of UAI defined by Eq. 7 is digit-wise-computable if holds and is a model of terminating computation.

See Appendix A.4 for the proof of Theorem 1.

This theorem shows the action-value function of UAI is computable to arbitrary precision, provided that the set of possible environments is a set of total functions. We could not prove, for general priors such as the AIXI’s , that the operation over the exact action values is computable, because that operation involves comparisons between real numbers which can be exactly the same. The reader should notice that action selection of -learning also involves such operation over the real numbers. In other words, this theorem proves that UAI is as computable as -learning.

Theorem 1 is proved for the case of ; thanks to the following Lemma 1, this does not limit the applicability of the theorem.

###### Lemma 1 (Linearity).

For a positive real number and a real number , let be the value obtained by replacing in Eq. 7 with where , i.e.,

 Q′(\ae1..k−1,a) =∑ek∈Emaxak+1∈A∑ek+1∈Emaxak+2∈A...maxam(k)∈A∑em(k)∈E (9)

Then, for a real number ,

 Q′(\ae1..k−1,a)=pQ(\ae1..k−1,a)+sc(\ae1..k−1) (10)

holds.

See Appendix A.4 for the proof of Lemma 1.

###### Corollary 1.

For any positive real number and any real number , the value of Eq. 8 does not change when in Eq. 7 is replaced with where . In other words, for defined by Eq. 9

 arg\leavevmode\nobreak maxa∈AQ(\ae1..k−1,a)=arg\leavevmode\nobreak maxa∈AQ′(\ae1..k−1,a)
###### Proof.

Self-explanatory from Lemma 1. ∎

Thanks to lazy evaluation UAI automatically omits unnecessary executions of environment candidate programs which do not affect the result of the operation, but it is unpredictable how much computation time is saved. However, UAI can return the result in the precision it has at the deadline, if there exists the deadline for each interaction step. It can have the set based on the first digit of values, the set based on the first two digits of values, and so on, and randomly select from the most precise set at the deadline.

### 3.2 Ucai: A Fully-Computable Subset of Uai

Now we show that UAI can be made implementable by selecting an adequate prior distribution.

As we have seen in the previous section, the only part that they can cause infinite loop is the operation. Because AIXI assigns less plausibility to longer programs, it is possible (and in fact, it is really the case) that only a finite number of the most plausible programs affect the result of decision making by the operator over the action-values and the remaining infinitely many programs do not affect it. The action-values need to be computed only to the precision where we can tell the difference between them. All we have to take care of is not to compare exactly the same values, because comparison of the same values to the precision where they make difference means infinite loop.

Thus, we can concentrate on avoiding comparison between action values which may be the same. We can do two things:

• compare two values that are known to be different beforehand, and tell which is the greater;

• skip comparison of two values that are known to be the same beforehand, and say that they are the same.

In other words, it is enough to know whether two action values are the same or not before comparison.

Our solution is inspired by the fact that

 (k,l)≠(k′,l′)⇒k√m+l√n≠k′√m+l′√n

holds for positive rational numbers , , , and , and positive coprime and . It avoids accidental coincidence between action values by

• slightly modifying the prior distribution to consist of irrational numbers and to satisfy some conditions,

• limiting to rational numbers, and

• limiting to positive numbers in order to avoid the sum of rewards happening to be 0.

#### 3.2.1 Priors for Making Difference in Values

This section gives an example way of assigning prior probabilities based on the syntactic structure.

In the case of defined in Eq. 3, we can assign the prior in the following way: 111The number is not important — it is used only to show that the difference can be made insignificant for those who ignore the difference between fixed precision floating point approximations and real numbers.

 BitStream =BS(1) (11) BS(i) =⎧⎪ ⎪⎨⎪ ⎪⎩0:BS(2i), w.p. (1/2−2−64d(2i)),1:BS(2i+1), w.p. (1/2−2−64d(2i+1)),⊥, w.p. (2−64d(2i)+2−64d(2i+1)) (12)

Cases where the resulting program syntax includes are dealt with in the same way as the cases of invalid programs with incorrect type: redraw a new program syntax, and the actual probabilities are the values divided by .

must satisfy the following three conditions:

1. for all ,

2. for all ,

3. is digit-wise-computable

Note that the syntax tree branches with different irrational probabilities at each node.

An example of satisfying the above three conditions is

 d(i)=1√the ith prime number

How to calculate for positive integral on paper is well-known, and it is digit-wise-computable even when using usual (non-redundant) binary numbers; we can just regard the resulting binary number as a redundant one.

The following Algorithm 2 is a generalization of the above Algorithm 1 to a more general case of polynomial grammar without mutual recursion.

###### Algorithm 2.

Modify the grammar rule

 Xi::=m∑k=1n∏l=1Xjkll

to

 Xi ::=X′i(1) (13) X′i(h) ::=m∑k=1n∏l=1X′l(mh+k−1)jkl (14)

and select the th alternative of with probability using defined in Eq. 6, and select with the remaining probability .

Now we can prove the following theorem:

###### Theorem 2 (Computability of Ucai).

Let be a model of computation that only includes terminating programs and can condition on a value. Then, UAI is computable by using the prior distribution defined by Algorithm 2.

See Appendix A.4 for the proof of Theorem 2.

## 4 Conclusions

This paper proposed AIXI variants supporting a broader class of the environment than AIXI, and proved their computability.

When considering the real-world interaction, the processing time for each time step should be considered limited, even if we permit the discrete-time model. In this sense, a timeout is a natural idea, and it is understandable to limit the program length, considering that the information accessible within a limited time is limited. However, considering that the real-world which has the vast space is highly parallel, AIXI which models the environment using a sequential model of computation is not necessarily the best selection.

On the other hand, UCAI needs to simulate the environment, many times at each interaction step. It is unnatural to think that a UCAI agent as a computer is as parallel as the environment.

Still, we think UAI is more powerful than AIXI even when there is a deadline at each interaction step. In the case of AIXI, each environment candidate program timeouts at time , and there are candidate programs. In the case of UAI, on the other hand, there is no timeout for each environment candidate, and a UAI agent only needs to timeout at the actual deadline , and thus more programs can be tried. The main advantage of UAI comes from the fact that UAI automatically omits execution of longer environment candidate programs when shorter ones cause difference in the action values, thanks to lazy evaluation.

Another expected merit of this paper is freeing AIXI from Turing machines. Although it is true that the name of Turing machines is well-known to all computer scientists, we have never met anyone who usually program in a language similar to Turing machines. We think that this fact discourages researchers from doing theoretical research on universal AI models. By enabling to use practical programming languages as the model of computation, we expect that more researchers will be attracted to this field.

One question is whether we need to use a Turing complete model for modeling the behavior of the environment at each time step, i.e., whether there is a case where AIXI should be used over UCAI. We think UCAI is enough for the following reasons:

• indeed, this world can simulate Turing machines with finite tape, but expecting them to terminate within one interaction step is a stupid idea; it is enough to be able to simulate Turing machines if there are tapes in the environment; for this purpose, it is enough to be able to compute finite-to-finite map at each time step, not requiring loops;

• AIXI does not adequately model the real world because it permits incomputable agent.

## References

• [1] Hans Boehm and Robert Cartwright. Exact real arithmetic formulating real numbers as functions. In David A. Turner, editor, Research Topics in Functional Programming, pages 43–64. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1990.
• [2] Marcus Hutter. Universal algorithmic intelligence: A mathematical topdown approach. In B. Goertzel and C. Pennachin, editors, Artificial General Intelligence, Cognitive Technologies, pages 227–290. Springer, Berlin, 2007.
• [3] Susumu Katayama. Ideas for a reinforcement learning algorithm that learns programs. In Artificial General Intelligence - 9th International Conference, AGI 2016, AGI 2016, New York, USA, July 16–19, 2016, Proceedings, pages 354–362, 2016.
• [4] Yura Perov and Frank Wood. Learning probabilistic programs. arXiv preprint arXiv:1407.2646, 2014.
• [5] Dave Plume. A Calculator for Exact Real Number Computation. PhD thesis, University of Edinburgh, 1998.
• [6] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998.
• [7] Christopher J. C. H. Watkins and Peter Dayan. Q-learning. In Machine Learning, pages 279–292, 1992.

## Appendix A The Detailed Proofs

### a.1 Definitions

###### Definition 5 (Fixed point redundant binary stream representation, FPRBSR).

Let . A fixed point redundant binary stream representation, or FPRBSR, of is a stream where

 x=∞∑i=12−ixi

and

 xi∈{−1,0,1}

hold for each .

denotes the interpretation of the given FPRBSR, or conversion to .

holds.

###### Definition 6 (Comparison).

Comparison of two values is either , , or , where ‘values’ can be digits, FPRBSR’s, or tuples of them. Comparison (except that of FPRBSR’s) is defined in the following way, using the usual order relations and and the equality relation :

 x?y=⎧⎪⎨⎪⎩(<), if x), if x>y (15)

### a.2 Digit-wise Computabilities of Operations on FPRBSR’s

We firstly define the average of two FPRBSR’s instead of the sum of them, because the average may not overflow.

The average of two FPRBSR’s is known to be digit-wise-computable.

###### Lemma 2 (Average is digit-wise-computable).

If and are both digit-wise-computable FPRBSR’s, then their average is also a digit-wise-computable FPRBSR.

###### Proof.

The following algorithm computes as an FPRBSR:

 (x1:x2:x3∞)⊕(y1:y2:y3∞) =⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩x1+y12:c:s, if % x1+y1∈{−2,0,2}−1:c+1:s, if x1+y1=−1 and x2+y2<00:c−1:s, if x1+y1=−1 and x2+y2≥00:c+1:s, if x1+y1=1 and x2+y2<01:c−1:s, if x1+y1=1 and x2+y2≥0

where .

defined above is digit-wise-computable because each recursive call of determines some digits. ∎

###### Lemma 3 (Sum is digit-wise-computable).

If and are both digit-wise-computable FPRBSR’s and their average is within , then their sum is also a digit-wise-computable FPRBSR.

###### Proof.

FPRBSR’s within can be doubled by the following digit-wise-computable function .

 double(0:x2∞) = x2∞ double(1:x2∞) = f(x2∞) double(−1:x2∞) = g(x2∞) f(0:x2∞) = −1:f(x2∞) f(−1:x2∞) = 1:x2∞ g(0:x2∞) = 1:g(x2∞) g(1:x2∞) = −1:x2∞

Thus, can be computed as . ∎

Likewise, multiplication of redundant binary representations is known to be computable.[5]

###### Lemma 4 (Product is digit-wise-computable).

If and are both digit-wise-computable FPRBSR’s, then their product is also a digit-wise-computable FPRBSR.

We also need to show that the maximum of two FPRBSR’s is digit-wise-computable.

For non-redundant and , can be computed by adopting digits of whichever of and until we know which is greater, and then use the digits from the greater of the two. This algorithm does not work correctly for redundant binary and , because we cannot tell which is greater by only comparing digits (e.g. ).

Our solution uses the fact that

 max{x,y}=x+y+|x−y|2=x+y+|x+(−y)|2

We need to show that negation and taking the absolute value are bitwise-computable.

###### Lemma 5 (Negation is digit-wise-computable).

If is a digit-wise-computable FPRBSR, then its negation is also a digit-wise-computable FPRBSR.

###### Proof.

The following algorithm computes as an FPRBSR:

 −∞(y:y∞) = −y:−∞(y∞)

###### Lemma 6 (The absolute value is digit-wise-computable).

If is a digit-wise-computable FPRBSR, then its absolute value is also a digit-wise-computable FPRBSR.

###### Proof.

The following algorithm computes as an FPRBSR:

 |0:y∞|∞ = 0:|y∞|∞ |1:y∞|∞ = 1:y∞ |−1:y∞|∞ = 1:−(y∞)

###### Lemma 7 (The binary max is digit-wise-computable).

If and are both digit-wise-computable FPRBSR’s, then their maximal value is also a digit-wise-computable FPRBSR.

###### Proof.

Since

 max{x,y}2=(x⊕y)⊕|x⊕(−y)|

is digit-wise-computable.

Since is either or , it can be represented in FPRBSR, and . Therefore, can be computed as

###### Lemma 8 (max over sets is digit-wise-computable.).

If is a finite non-empty set of digit-wise-computable FPRBSR’s, then is also a digit-wise-computable FPRBSR.

###### Proof.

Let and be FPRBSR’s.

If is a finite and non-empty set of digit-wise-computable FPRBSR’s, can be computed by a finite number of binary operations in the following way:

 max{x∞} = x∞ (16) max({x∞,y∞}+Y) = max({max{x∞,y∞}}+Y) (17)

where over sets denotes the direct sum of two sets. ∎

###### Lemma 9 (Comparison of two different reals is computable).

If and are both digit-wise-computable FPRBSR’s and we know beforehand, then, their comparison is computable.

###### Proof.

is digit-wise-computable. The comparison can be computed as using defined as follows:

 cmp(0:x∞) = cmp(x∞) (18) cmp(−1:x∞) = (<) (19) cmp(1:x∞) = (>) (20)

###### Lemma 10 (argmax of a monomorphism over a finite set is computable.).

If is a finite non-empty set and is a monomorphism which is digit-wise-computable using FPRBSR, then, is computable.

###### Proof.

Let be a function that computes , i.e., .

can be computed by the following algorithm:

 arg\leavevmode\nobreak maxx∈{a}f(x) = a (21) arg\leavevmode\nobreak maxx∈{a,b}+Yf(x) = {arg\leavevmode\nobreak maxx∈{a}+Yf(x), if f∞(a)?f∞(b)=(>);arg\leavevmode\nobreak maxx∈{b}+Yf(x), if f∞(a)?f∞(b)=(<). (22)

### a.3 Digit-wise-computability of Diminishing Series

In this section we provide lemmas and their proofs about digit-wise-computability of infinite series of positive values diminishing exponentially. Their main purpose is to be applied to the infinite summation in Eq. 7.

###### Lemma 11 (Diminishing series by 2−2 are digit-wise-computable.).

If holds and is a digit-wise-computable FPRBSR for all positive integer , then, is digit-wise-computable.

Note that forms a stream of FPRBSR’s, i.e., a stream of streams of digits.

###### Proof.

We show that is digit-wise-computable as using the function defined as follows:

 S((1:s∞):x∞∞) =0:1:((−1:s∞)+S(x∞∞)) (23) S((s:s∞):x∞∞) =0:0:((s:s∞)+S(x∞∞)),s∈{−1,0} (24)

Note that the results of ’s of Eqs. 2324 fall within because and . Obviously, is digit-wise-computable.

Now we show that

From Eq. 23, we obtain

 (25)

Likewise, from Eq. 24, for

 (26)

From Eqs. 25 and 26,

 (27)

Thus, by applying Eq. 27 repeatedly,

 S(x∞(1):x∞(2):x∞(3):...)

###### Lemma 12.

Let a digit-wise-computable probability function over finite lists of ’s. Also, let a computable predicate over such lists. Then, is digit-wise-computable.

###### Proof.

Since is a probability function,

 ∑pρ(p)=1

holds. Thus,

 ∑p:P(p)ρ(p)≤1

By reorganizing the summation from the shortest increasing the length, the left-hand-side can be rewritten to

 ∞∑k=1∑p:l(p)=k,P(p)ρ(p) ≤ 1 (28) ∞∑k=1∑p:l(p)=kρ(p) = 1 (29)

where denotes the length of .

Now let

 R(i) = ∑p:l(p)=i,P(p)ρ(p) (30) R′(i) = ∑p:l(p)=iρ(p) (31)

Then,

 R(i) ≤ R′(i) (32) ∞∑k=1R(k) ≤ 1 (33) ∞∑k=1R′(k) = 1 (34)

and and are digit-wise-computable because they consist of finite summations and digit-wise-computable computations. Let and be FPRBSR representations of and respectively.

Because the left-hand-side of Eq. 34 is well-defined,

 limi→∞∞∑k=iR′(k)=0

holds. In other words, for any there exists a natural number that satisfies . Thus, for some Skolem function that takes and returns such a natural number ,

 ∞∑k=f(2−2i)R′(k)<2−2i

holds.

There are infinite candidates for because we need not choose the minimal . Let

 g(i)=f(2−2i)

Then, we can choose the following computable implementation of :

 (g(0),s∞(0)) =(1,0∞) (35) (g(i),s∞(i+1)) =g′(i,g(i),s∞(i)),i≥0 (36) g′(i,n,0:s′∞) =g′(i,n+1,0:s′∞+∞R′∞(n)) (37) g′(i,n,1:s′∞) =g′(i,n+1,1:s′∞+∞R′∞(n)), if s′∞ is not prefixed with 2i 0's; (38) g′(i,n,1:s′∞) =(n+1,1:s′∞), if s′∞ is % prefixed with 2i 0's. (39)

where

 0∞=0:0∞

If we define as

 x(k)=g(k+1)−1∑i=g(k)R(i)

then, is a sequence that diminishes by the rate of . Moreover, each is digit-wise-computable because it can be computed from finite times of additions. Therefore, from Lemma 11,

 ∞∑i=1R(i) (40) = g(1)−1∑i=1R(i)+g(2)−1∑i=g(1)R(i)+g(3)−1∑i=g(2)R(i)+...+g(k+1)−1∑i=g(k)R(i)+... (41) = g(1)−1∑i=1R(i)+x(1)+x(2)+... (42)

is digit-wise-computable because it is a series diminishing by . ∎

### a.4 Proofs of the Main Theorems

###### Proof of Lemma 1.
 Q′(\ae1..k−1,a) =∑ek∈Emaxak+1∈A∑ek+1∈Emaxak+2∈A...maxam(k)∈A∑em(k)∈E (43) =∑ek∈Emaxak+1∈A∑ek+1∈Emaxak+2∈A...maxam(k)∈A∑em(k)∈E (44)

Now, for each there always exists only one , because is a model of terminating computation. Thus,

More generally,

Thus,

 Q′(\ae1..k−1,a) =p⎛⎜⎝∑ek∈Emaxak+1∈A∑ek+1∈Emaxak+2∈A...maxam(k)∈A∑em(k)∈E (45) (46)

Therefore, Eq. 10 holds for

###### Proof of Theorem 1.

From Lemma 12,

 (47)

is computable if is a model of terminating computation.

From Lemmas 3, 4, and 8, defined by Eq. 7 is computable because the right-hand-side of Eq. 7 only consists of addition, multiplication, maximization, and the right-hand-side of 47. ∎

###### Proof of Theorem 2.

From Corollary 1 we can obtain an equivalent UAI algorithm satisfying if we know the lower bound and the upper bound of the set of rewards. 222Note that must not include unlike Theorem 1. From Theorem 1, the action value function in the operation in Eq. 8 is digit-wise-computable. Since for monomorphic function is computable from 10, it is enough to show that

 a↦Q(\ae1..k−1,a) (48)

is monomorphic under the condition requested by the theorem.

From Eq. 7, we obtain

 Q(\ae1..k−1,a) =∑ek∈