# A Minimum Relative Entropy Principle

for Learning and Acting

###### Abstract

This paper proposes a method to construct an adaptive agent that is universal with respect to a given class of experts, where each expert is an agent that has been designed specifically for a particular environment. This adaptive control problem is formalized as the problem of minimizing the relative entropy of the adaptive agent from the expert that is most suitable for the unknown environment. If the agent is a passive observer, then the optimal solution is the well-known Bayesian predictor. However, if the agent is active, then its past actions need to be treated as causal interventions on the I/O stream rather than normal probability conditions. Here it is shown that the solution to this new variational problem is given by a stochastic controller called the Bayesian control rule, which implements adaptive behavior as a mixture of experts. Furthermore, it is shown that under mild assumptions, the Bayesian control rule converges to the control law of the most suitable expert.

Department of Engineering

University of Cambridge

Cambridge CB2 1PZ, UK Daniel A. Braun dab54@cam.ac.uk

Department of Engineering

University of Cambridge

Cambridge CB2 1PZ, UK

Editor: —

Keywords: Artificial Intelligence, Minimum Relative Entropy Principle, Bayesian Control Rule, Interaction Sequences, Operation Modes.

## 1 Introduction

When the behavior of an environment under any control signal is fully known, then the designer can choose an agent^{1}^{1}1In accordance with the control literature, we use the terms agent and controller interchangeably. Similarly, the terms environment and plant are used synonymously. that produces the desired dynamics. Instances of this problem include hitting a target with a cannon under known weather conditions, solving a maze having its map and controlling a robotic arm in a manufacturing plant. However, when the behavior of the plant is unknown, then the designer faces the problem of adaptive control. For example, shooting the cannon lacking the appropriate measurement equipment, finding the way out of an unknown maze and designing an autonomous robot for Martian exploration. Adaptive control turns out to be far more difficult than its non-adaptive counterpart. This is because any good policy has to carefully trade off explorative versus exploitative actions, i.e. actions for the identification of the environment’s dynamics versus actions to control it in a desired way. Even when the environment’s dynamics are known to belong to a particular class for which optimal agents are available, constructing the corresponding optimal adaptive agent is in general computationally intractable even for simple toy problems (Duff, 2002). Thus, finding tractable approximations has been a major focus of research.

Recently, it has been proposed to reformulate the problem statement for some classes of control problems based on the minimization of a relative entropy criterion. For example, a large class of optimal control problems can be solved very efficiently if the problem statement is reformulated as the minimization of the deviation of the dynamics of a controlled system from the uncontrolled system (Todorov, 2006, 2009; Kappen et al., 2009). In this work, a similar approach is introduced. If a class of agents is given, where each agent solves a different environment, then adaptive controllers can be derived from a minimum relative entropy principle. In particular, one can construct an adaptive agent that is universal with respect to this class by minimizing the average relative entropy from the environment-specific agent.

However, this extension is not straightforward. There is a syntactical difference between actions and observations that has to be taken into account when formulating the variational problem. More specifically, actions have to be treated as interventions obeying the rules of causality (Pearl, 2000; Spirtes et al., 2000; Dawid, 2010). If this distinction is made, the variational problem has a unique solution given by a stochastic control rule called the Bayesian control rule. This control rule is particularly interesting because it translates the adaptive control problem into an on-line inference problem that can be applied forward in time. Furthermore, this work shows that under mild assumptions, the adaptive agent converges to the environment-specific agent.

The paper is organized as follows. Section 2 introduces notation and sets up the adaptive control problem. Section 3 formulates adaptive control as a minimum relative entropy problem. After an initial, naïve approach, the need for causal considerations is motivated. Then, the Bayesian control rule is derived from a revised relative entropy criterion. In Section 4, the conditions for convergence are examined and a proof is given. Section 5 illustrates the usage of the Bayesian control rule for the multi-armed bandit problem and the undiscounted Markov decision problem. Section 6 discusses properties of the Bayesian control rule and relates it to previous work in the literature. Section 7 concludes.

## 2 Preliminaries

In the following both agent and environment are formalized as causal models over I/O sequences. Agent and environment are coupled to exchange symbols following a standard interaction protocol having discrete time, observation and control signals. The treatment of the dynamics are fully probabilistic, and in particular, both actions and observations are random variables, which is in contrast to the decision-theoretic agent formulation treating only observations as random variables (Russell and Norvig, 2003). All proofs are provided in the appendix.

#### Notation.

A set is denoted by a calligraphic letter like . The words set & alphabet and element & symbol are used to mean the same thing respectively. Strings are finite concatenations of symbols and sequences are infinite concatenations. denotes the set of strings of length based on , and is the set of finite strings. Furthermore, is defined as the set of one-way infinite sequences based on the alphabet . Tuples are written with parentheses or as strings . For substrings, the following shorthand notation is used: a string that runs from index to is written as . Similarly, is a string starting from the first index. Also, symbols are underlined to glue them together like in . The function is meant to be taken w.r.t. base 2, unless indicated otherwise.

#### Interactions.

The possible I/O symbols are drawn from two finite sets. Let denote the set of inputs (observations) and let denote the set of outputs (actions). The set is the interaction set. A string or is an interaction string (optionally ending in or ) where and . Similarly, a one-sided infinite sequence is an interaction sequence. The set of interaction strings of length is denoted by . The sets of (finite) interaction strings and sequences are denoted as and respectively. The interaction string of length 0 is denoted by .

#### I/O system.

Agents and environments are formalized as I/O systems. An I/O system is a probability distribution over interaction sequences . is uniquely determined by the conditional probabilities

(1) |

for each . However, the semantics of the probability distribution are only fully defined once it is coupled to another system.

#### Interaction system.

Let , be two I/O systems. An interaction system is a coupling of the two systems giving rise to the generative distribution that describes the probabilities that actually govern the I/O stream once the two systems are coupled. is specified by the equations

valid for all . Here, models the true probability distribution over interaction sequences that arises by coupling two systems through their I/O streams. More specifically, for the system , is the probability of producing action given history and is the predicted probability of the observation given history . Hence, for , the sequence is its input stream and the sequence is its output stream. In contrast, the roles of actions and observations are reversed in the case of the system . Thus, the sequence is its output stream and the sequence is its input stream. This model of interaction is fairly general, and many other interaction protocols can be translated into this scheme. As a convention, given an interaction system , is an agent to be constructed by the designer, and is an environment to be controlled by the agent. Figure 1 illustrates this setup.

#### Control Problem.

An environment is said to be known iff the agent is such that for any ,

Intuitively, this means that the agent “knows” the statistics of the environment’s future behavior under any past, and in particular, it “knows” the effects of given controls. If the environment is known, then the designer of the agent can build a custom-made policy into such that the resulting generative distribution produces interaction sequences that are desirable. This can be done in multiple ways. For instance, the controls can be chosen such that the resulting policy maximizes a given utility criterion; or such that the resulting trajectory of the interaction system stays close enough to a prescribed trajectory. Formally, if is known, and if the conditional probabilities for all have been chosen such that the resulting generative distribution over interaction sequences given by

is desirable, then is said to be tailored to .

#### Adaptive control problem.

If the environment is unknown, then the task of designing an appropriate agent constitutes an adaptive control problem. Specifically, this work deals with the case when the designer already has a class of agents that are tailored to the class of possible environments. Formally, it is assumed that is going to be drawn with probability from a set of possible systems before the interaction starts, where is a countable set. Furthermore, one has a set of systems such that for each , is tailored to and the interaction system has a generative distribution that produces desirable interaction sequences. How can the designer construct a system such that its behavior is as close as possible to the custom-made system under any realization of ?

## 3 Adaptive Systems

The main goal of this paper is to show that the problem of adaptive control outlined in the previous section can be reformulated as a universal compression problem. This can be informally motivated as follows. Suppose the agent is implemented as a machine that is interfaced with the environment . Whenever the agent interacts with the environment, the agent’s state changes as a necessary consequence of the interaction. This “change in state” can take place in many possible ways: by updating the internal memory; consulting a random number generator; changing the physical location and orientation; and so forth. Naturally, the design of the agent facilitates some interactions while it complicates others. For instance, if the agent has been designed to explore a natural environment, then it might incur into a very low memory footprint when recording natural images, while being very memory-inefficient when recording artificially created images. If one abstracts away from the inner workings of the machine and decides to encode the state transitions as binary strings, then the minimal amount of resources in bits that are required to implement these state changes can be derived directly from the associated probability distribution . In the context of adaptive control, an agent can be constructed such that it minimizes the expected amount of changes necessary to implement the state transitions, or equivalently, such that it maximally compresses the experience. Thereby, compression can be taken as a stand-alone principle to design adaptive agents.

### 3.1 Universal Compression and Naïve Construction of Adaptive Agents

In coding theory, the problem of compressing a sequence of observations from an unknown source is known as the adaptive coding problem. This is solved by constructing universal compressors, i.e. codes that adapt on-the-fly to any source within a predefined class. Such codes are obtained by minimizing the average deviation of a predictor from the true source, and then by constructing codewords using the predictor. In this subsection, this procedure will be used to derive an adaptive agent (Ortega and Braun, 2010a).

Formally, the deviation of a predictor from the a true distribution is measured by the relative entropy^{2}^{2}2The relative entropy is also known as the KL-divergence and it measures the average amount of extra bits that are necessary to encode symbols due to the usage of the (wrong) predictor.. A first approach would be to construct an agent so as to minimize the total expected relative entropy to . This is constructed as follows. Define the history-dependent relative entropies over the action and observation as

where will be the argument of the variational problem. Then, one removes the dependency on the past by averaging over all possible histories:

Finally, the total expected relative entropy of from is obtained by summing up all time steps and then by averaging over all choices of the true environment:

(2) |

Using (2), one can define a variational problem with respect to . The agent that one is looking for is the system that minimizes the total expected relative entropy in (2), i.e.

(3) |

The solution to Equation 3 is the system defined by the set of equations

(4) | ||||

valid for all , where the mixture weights are

(5) | ||||

For reference, see Haussler and Opper (1997) and Opper (1998). It is clear that is just the Bayesian mixture over the agents . If one defines the conditional probabilities

(6) | ||||

for all , then Equation 4 can be rewritten as

(7) | ||||

where the and are just the posterior probabilities over the elements in given the past interactions. Hence, the conditional probabilities in (4) that minimize the total expected divergence are just the predictive distributions and that one obtains by standard probability theory, and in particular, Bayes’ rule. This is interesting, as it provides a teleological justification for Bayes’ rule.

The behavior of can be described as follows. At any given time , maintains a mixture over systems . The weighting over them is given by the mixture coefficients . Whenever a new action or a new observation is produced (by the agent or the environment respectively), the weights are updated according to Bayes’ rule. In addition, issues an action suggested by a system drawn randomly according to the weights .

However, there is an important problem with that arises due to the fact that it is not only a system that is passively observing symbols, but also actively generating them. In the subjective interpretation of probability theory, conditionals play the role of observations made by the agent that have been generated by an external source. This interpretation suits the symbols because they have been issued by the environment. However, symbols that are generated by the system itself require a fundamentally different belief update. Intuitively, the difference can be explained as follows. Observations provide information that allows the agent inferring properties about the environment. In contrast, actions do not carry information about the environment, and thus have to be incorporated differently into the belief of the agent. In the following section we illustrate this problem with a simple statistical example.

### 3.2 Causality

Causality is the study of the functional dependencies of events. This stands in contrast to statistics, which, on an abstract level, can be said to study the equivalence dependencies (i.e. co-occurrence or correlation) amongst events. Causal statements differ fundamentally from statistical statements. Examples that highlight the differences are many, such as “do smokers get lung cancer?” as opposed to “do smokers have lung cancer?”; “assign ” as opposed to “compare ” in programming languages; and “” as opposed to “” in Newtonian physics. The study of causality has recently enjoyed considerable attention from the researchers in the fields of statistics and machine learning. Especially over the last decade, significant progress has been made towards the formal understanding of causation (Shafer, 1996; Pearl, 2000; Spirtes et al., 2000; Dawid, 2010). In this subsection, the aim is to provide the essential tools required to understand causal interventions. For a more in-depth exposition of causality, the reader is referred to the specialized literature.

To illustrate the need for causal considerations in the case of generated symbols, consider the following thought experiment. Suppose a statistician is asked to design a model for a simple time series and she decides to use a Bayesian method. Assume she collects a first observation . She computes the posterior probability density function (pdf) over the parameters of the model given the data using Bayes’ rule:

where is the likelihood of given and is the prior pdf of . She can use the model to predict the next observation by drawing a sample from the predictive pdf

where is the likelihood of given and . She understands that the nature of is very different from : while is informative and does change the belief state of the Bayesian model, is non-informative and thus is a reflection of the model’s belief state. Hence, she would never use to further condition the Bayesian model. Mathematically, she seems to imply that

if has been generated from itself. But this simple independence assumption is not correct as the following elaboration of the example will show.

The statistician is now told that the source is waiting for the simulated data point in order to produce a next observation which does depend on . She hands in and obtains a new observation . Using Bayes’ rule, the posterior pdf over the parameters is now

(8) |

where is the likelihood of the new data given the old data , the parameters and the simulated data . Notice that this looks almost like the posterior pdf given by

with the exception that in the latter case, the Bayesian update contains the likelihoods of the simulated data . This suggests that Equation 8 is a variant of the posterior pdf but where the simulated data is treated in a different way than the data and .

Define the pdf such that the pdfs , , are identical to , and respectively, but differ in :

where is the Dirac delta function. That is, is identical to but it assumes that the value of is fixed to given and . For , the simulated data is non-informative:

If one computes the posterior pdf , one obtains the result of Equation 8:

Thus, in order to explain Equation 8 as a posterior pdf given the observed data and and the generated data , one has to intervene in order to account for the fact that is non-informative given and . In other words, the statistician, by defining the value of herself, has changed the (natural) regime that brings about the series , which is mathematically expressed by redefining the pdf.

Two essential ingredients are needed to carry out interventions. First, one needs to know the functional dependencies amongst the random variables of the probabilistic model. This is provided by the causal model, i.e. the unique factorization of the joint probability distribution over the random variables encoding the causal dependencies. In the general case, this defines a partial order over the random variables. In the previous thought experiment, the causal model of the joint pdf is given by the set of conditional pdfs

Second, one defines the intervention that sets to the value , denoted as , as the operation on the causal model replacing the conditional probability of by a Dirac delta function or a Kronecker delta for a continuous or a discrete variable respectively. In our thought experiment, it is easily seen that

and thereby,

Causal models contain additional information that is not available in the joint probability distribution alone. The appropriate model for a given situation depends on the story that is being told. Note that an intervention can lead to different results if their causal models differ. Thus, if the causal model had been

then the intervention would differ from , i.e.

even though both causal models represent the same joint probability distribution. In the following, this paper will use the shorthand notation when the random variable is obvious from the context.

### 3.3 Causal construction of adaptive agents

Following the discussion in the previous section, an adaptive agent is going to be constructed by minimizing the expected relative entropy to the , but this time treating actions as interventions. Based on the definition of the conditional probabilities in Equation 6, the total expected relative entropy to characterize using interventions is going to be defined. Assuming the environment is chosen first, and that each symbol depends functionally on the environment and all the previously generated symbols, the causal model is given by

Importantly, interventions index a set of intervened probability distributions derived from a base probability distribution. Hence, the set of fixed intervention sequences of the form indexes probability distributions over observation sequences . Because of this, one defines a set of criteria indexed by the intervention sequences, but it will be clear that they all have the same solution. Define the history-dependent intervened relative entropies over the action and observation as

where is a given arbitrary agent. Note that past actions are treated as interventions. In particular, represents the knowledge state when the past actions have already been issued but the next action is not known yet. Then, averaging the previous relative entropies over all pasts yields

Here again, because of the knowledge state in time represented by and , the averages are taken treating past actions as interventions. Finally, define the total expected relative entropy of from as the sum of over time, averaged over the possible draws of the environment:

(9) |

The variational problem consists in choosing the agent as the system minimizing , i.e.

(10) |

The following theorem shows that this variational problem has a unique solution, which will be the central theme of this paper.

###### Theorem 1

The solution to Equation 10 is the system defined by the set of equations

(11) | ||||

valid for all , where the mixture weights are

(12) |

The behavior of differs in an important aspect from . At any given time , maintains a mixture over systems . The weighting over these systems is given by the mixture coefficients . In contrast to , updates the weights only whenever a new observation is produced by the environment. The update follows Bayes’ rule but treats past actions as interventions by dropping the evidence they provide. In addition, issues an action suggested by an system drawn randomly according to the weights .

Perhaps surprisingly, the theorem says that the optimal solution to the variational problem in (10) is precisely the predictive distribution over actions and observations treating actions as interventions and observations as conditionals, i.e. it is the solution that one would obtain by applying only standard probability and causal calculus. This provides a teleological interpretation to the agent akin to the naïve agent constructed in Section 3.1.

### 3.4 Summary

Adaptive control is formalized as the problem of designing an agent for an unknown environment chosen from a class of possible environments. If the environment-specific agents are known, then the Bayesian control rule allows constructing an adaptive agent by combining these agents. The resulting adaptive agent is universal with respect to the environment class. In this context, the constituent agents are called the operation modes of the adaptive agent. They are represented by causal models over the interaction sequences, i.e. conditional probabilities and for all , and where is the index or parameter characterizing the operation mode. The probability distribution over the input stream (output stream) is called the hypothesis (policy) of the operation mode. The following box collects the essential equations of the Bayesian control rule. In particular, here the rule is stated using a recursive belief update.

\@minipagerestore

#### Bayesian control rule:

Given a set of operation modes over interaction sequences in and a prior distribution over the parameters , the probability of the action is given by (13) where the posterior probability over operation modes is## 4 Convergence

The aim of this section is to develop a set of sufficient conditions of convergence and then to provide a proof of convergence. To simplify the exposition, the analysis has been limited to the case of controllers having a finite number of input-output models.

### 4.1 Policy diagrams

In the following we use “policy diagrams” as a useful informal tool to analyze the effect of policies on environments. Figure 2, illustrates an example.

Policy diagrams are especially useful to analyze the effect of policies on different hypotheses about the environment’s dynamics. An agent that is endowed with a set of operation modes can be seen as having hypotheses about the environment’s underlying dynamics, given by the observation models , and associated policies, given by the action models , for all . For the sake of simplifying the interpretation of policy diagrams, we will assume^{3}^{3}3Note however that no such assumptions are made to obtain the results of this section. the existence of a state space and a function mapping I/O histories into states. With this assumption, policies and hypotheses can be seen as conditional probabilities

respectively, defining transition probabilities

for a Markov chain in the state space, where and contains the transitions such that .

### 4.2 Divergence processes

The central question in this section is to investigate whether the Bayesian control rule converges to the correct control law or not. That is, whether as when is the true operation mode, i.e. the operation mode such that . As will be obvious from the discussion in the rest of this section, this is in general not true.

As it is easily seen from Equation 13, showing convergence amounts to show that the posterior distribution concentrates its probability mass on a subset of operation modes having essentially the same output stream as ,

Hence, understanding the asymptotic behavior of the posterior probabilities

is crucial here. In particular, we need to understand under what conditions these quantities converge to zero. The posterior can be rewritten as

If all the summands but the one with index are dropped from the denominator, one obtains the bound