An Intrinsically-Motivated Approach for Learning Highly Exploring and Fast Mixing Policies

An Intrinsically-Motivated Approach for Learning Highly Exploring and Fast Mixing Policies


What is a good exploration strategy for an agent that interacts with an environment in the absence of external rewards? Ideally, we would like to get a policy driving towards a uniform state-action visitation (highly exploring) in a minimum number of steps (fast mixing), in order to ease efficient learning of any goal-conditioned policy later on. Unfortunately, it is remarkably arduous to directly learn an optimal policy of this nature. In this paper, we propose a novel surrogate objective for learning highly exploring and fast mixing policies, which focuses on maximizing a lower bound to the entropy of the steady-state distribution induced by the policy. In particular, we introduce three novel lower bounds, that lead to as many optimization problems, that tradeoff the theoretical guarantees with computational complexity. Then, we present a model-based reinforcement learning algorithm, IDEAL, to learn an optimal policy according to the introduced objective. Finally, we provide an empirical evaluation of this algorithm on a set of hard-exploration tasks.



1 Introduction

In general, the Reinforcement Learning (RL) framework [sutton2018reinforcement] assumes the presence of a reward signal coming from a, potentially unknown, environment to a learning agent. When this signal is sufficiently informative about the utility of the agent’s decisions, RL has proved to be rather successful in solving challenging tasks, even at a super-human level [mnih2015human, silver2017mastering]. However, in most real-world scenarios, we cannot rely on a well-shaped, complete reward signal. This may prevent the agent from learning anything until, while performing random actions, it eventually stumbles into some sort of external reward. Thus, what is a good objective for a learning agent to pursue, in the absence of an external reward signal, to prepare itself to learn efficiently, eventually, a goal-conditioned policy?

Intrinsic motivation [chentanez2005intrinsically, oudeyer2009topology] traditionally tries to answer this pressing question by designing self-motivated goals that favor exploration. In a curiosity-driven approach, first proposed in [schmidhuber1991possibility], the intrinsic objective encourages the agent to explore novel states by rewarding prediction errors [stadie2015incentivizing, pathak2017curiosity, burda2018large, burda2018exploration]. On a similar flavor, other works propose to relate an intrinsic reward to some sort of learning progress [lopes2012exploration] or information gain [mohamed2015variational, houthooft2016vime], stimulating the agent’s empowerment over the environment. Count-based approaches [bellemare2016unifying, tang2017exploration, ostrovski2017count] consider exploration bonuses proportional to the state visitation frequencies, assigning high rewards to rarely visited states. Athough the mentioned approaches have been relatively effective in solving sparse-rewards, hard-exploration tasks [pathak2017curiosity, burda2018exploration], they have some common limitations that may affect their ability to methodically explore an environment in the absence of external rewards, as pointed out in [ecoffet2019go]. Especially, due to the consumable nature of their intrinsic bonuses, the learning agent could prematurely lose interest in a frontier of high rewards (detachment). Furthermore, the agent may suffer from derailment by trying to return to a promising state, previously discovered, if a naïve exploratory mechanism, such as -greedy, is combined to the intrinsic motivation mechanism (which is often the case). To overcome these limitations, recent works suggest alternative approaches to motivate the agent towards a more systematic exploration of the environment [hazan2018provably, ecoffet2019go]. Especially, in [hazan2018provably] the authors consider an intrinsic objective which is directed to the maximization of an entropic measure over the state distribution induced by a policy. Then, they provide a provably efficient algorithm to learn a mixture of deterministic policies that is overall optimal w.r.t. the maximum-entropy exploration objective. To the best of our knowledge, none of the mentioned approaches explicitly address the related aspect of the mixing time of an exploratory policy, which represents the time it takes for the policy to reach its full capacity in terms of exploration. Nonetheless, in many cases we would like to maximize the probability of reaching any potential target state having a fairly limited number of interactions at hand for exploring the environment. Notably, this context presents some analogies to the problem of maximizing the efficiency of a random walk [hassibi2014optimized].

In this paper, we present a novel approach to learn exploratory policies that are, at the same time, highly exploring and fast mixing. In Section 3, we propose a surrogate objective to address the problem of maximum-entropy exploration over both the state space (Section 3.1) and the action space (Section 3.2). The idea is to search for a policy that maximizes a lower bound to the entropy of the induced steady-state distribution. We introduce three new lower bounds and the corresponding optimization problems, discussing their pros and cons. Furthermore, we discuss how to complement the introduced objective to account for the mixing time of the learned policy (Section 3.3). In Section 4, we present the Intrinsically-Driven Effective and Efficient Exploration ALgorithm (IDEAL), a novel, model-based, reinforcement learning method to learn highly exploring and fast mixing policies through iterative optimizations of the introduced objective. In Section 5, we provide an empirical evaluation to illustrate the merits of our approach on hard-exploration, finite domains, and to show how it fares in comparison to count-based and maximum-entropy approaches. Finally, in Section 6, we discuss the proposed approach and related works. The proofs of the Theorems are reported in Appendix A1.

2 Preliminaries

A discrete-time Markov Decision Process (MDP) [puterman2014markov] is defined as a tuple , where is the state space, is the action space, is a Markovian transition model defining the distribution of the next state given the current state and action , is the reward function, such that is the expected immediate reward when taking action from state , and is the initial state distribution. A policy defines the probability of taking an action in state .

In the following we will indifferently turn to scalar or matrix notation, where denotes a vector, denotes a matrix, and , denote their transpose. A matrix is row (column) stochastic if it has non-negative entries and all of its rows (columns) sum to one. A matrix is doubly stochastic if it is both row and column stochastic. We denote with the space of doubly stochastic matrices. The -norm of a matrix is its maximum absolute row sum, while and are its and Frobenius norms respectively. We denote with a column vector of ones and with a matrix of ones with rows and columns. Using matrix notation, is a column vector of size having elements , is a row stochastic matrix of size that describes the transition model , is a row stochastic matrix of size that contains the policy , and is a row stochastic matrix of size () that represents the state transition matrix under policy . We denote with the space of all the stationary Markovian policies.

In the absence of any reward, i.e., when for every , a policy induces, over the MDP , a Markov Chain (MC) [levin2017markov] defined by where is the state transition model. Having defined the -step transition matrix as , the state distribution of the MC at time step is , while is the steady state distribution. If the MC is ergodic, i.e., aperiodic and recurrent, it admits a unique steady-state distribution, such that . The mixing time of the MC describes how fast the state distribution converges to the steady state:


where is the mixing threshold. An MC is reversible if the condition holds. Let be the eigenvalues of . For ergodic reversible MCs the largest eigenvalue is 1 with multiplicity 1. Then, we can define the second largest eigenvalue modulus and the spectral gap as:


3 Optimization Problems for Highly Exploring and Fast Mixing Policies

In this section, we define a set of optimization problems whose goal is to identify a stationary Markovian policy that effectively explores the state-action space. The optimization problem is introduced in three steps: first we ask for a policy that maximizes some lower bound to the steady-state distribution entropy, then we foster exploration over the action space by adding a constraint on the minimum action probability, and finally we add another constraint to reduce the mixing time of the Markov chain induced by the policy.

3.1 Highly Exploring Policies over the State Space

Intuitively, a good exploration policy should guarantee to visit the state space as uniformly as possible. In this view, a potential objective function is the entropy of the steady-state distribution induced by a policy over the MDP [hazan2018provably]. The resulting optimal policy is:


where is the state distribution entropy. Unfortunately, a direct optimization of this objective is particularly arduous since the steady-state distribution entropy is not a concave function of the policy [hazan2018provably]. To overcome this issue, a possible solution [hazan2018provably] is to use the conditional gradient method, such that the gradients of the steady-state distribution entropy become the intrinsic reward in a sequence of approximate dynamic programming problems [bertsekas1995dynamic].

In this paper, we follow an alternative route that consists in maximizing a lower bound to the policy entropy. In particular, in the following we will consider three lower bounds that lead to as many optimization problems (named Infinity, Frobenius, Column Sum) that show different trade-offs between theoretical guarantees and computational complexity.

Infinity  From the theory of Markov chains [levin2017markov], we know a necessary and sufficient condition for a policy to induce a uniform steady-state distribution (i.e., to achieve the maximum possible entropy). We report this result in the following theorem. {restatable}[]thrdoublyStochastic Let be the transition matrix of a given MDP. The steady-state distribution induced by a policy is uniform over iff the matrix is doubly stochastic. Unfortunately, given the constraints specified by the transition matrix , a stationary Markovian policy that induces a doubly stochastic may not exist. On the other hand, it is possible to lower bound the entropy of the steady-state distribution induced by policy as a function of the minimum -norm between and any doubly stochastic matrix. {restatable}[]threntropyBound Let be the transition matrix of a given MDP and the space of doubly stochastic matrices. The entropy of the steady-state distribution induced by a policy is lower bounded by:

The maximization of this lower bound leads to the following constrained optimization problem:


It is worth noting that this optimization problem can be reformulated as a linear program with optimization variables and inequality constraints and equality constraints (the linear program formulation can be found in Appendix B.1). In order to avoid the exponential growth of the number of constraints as a function of the number of states, we are going to introduce alternative optimization problems.

Frobenius  It is worth noting that different transition matrices having equal might lead to significantly different state distribution entropies , as the -norm only accounts for the state corresponding to the maximum absolute row sum. The Frobenius norm can better captures the distance between and over all the states, as discussed in Appendix C. For this reason, we have derived a lower bound to the policy entropy that replace the -norm with the Frobenius one. {restatable}[]threntropyBoundF Let be the transition matrix of a given MDP and the space of doubly stochastic matrices. The entropy of the steady-state distribution induced by a policy is lower bounded by:

It can be shown (see Corollary A.1 in Appendix A) that the lower bound based on the Frobenius norm cannot be better (i.e., larger) than the one with the Infinite norm. However, we have the advantage that the resulting optimization problem has significantly less constraints than Problem (4):


This problem is a (linearly constrained) quadratic problem with optimization variables and inequality constraints and equality constraints.

Column Sum  Problems (4) and (5) are aiming at finding a policy associated with a state transition matrix that is doubly stochastic. To achieve this result it is enough to guarantee that the column sums of the matrix are all equal to one [kirkland2010column]. A measure that can be used to evaluate the distance to a doubly stochastic matrix can be the absolute sum of the difference between one and the column sums: . The following theorem provides a lower bound to the policy entropy as a function of this measure. {restatable}[]threntropyBoundCS Let be the transition matrix of a given MDP. The entropy of the steady-state distribution induced by a policy is lower bounded by:

The optimization of this lower bound leads to the following linear program:


Besides being a linear program, unlike the other optimization problems presented, Problem (6) does not require to optimize over the space of all the doubly stochastic matrices, thus significantly reducing the number of optimization variables () and constraints ( inequalities and equalities). The linear program formulation of Problem (6) can be found in Appendix B.2.

3.2 Highly Exploring Policies over the State and Action Space

Although the policy resulting from the optimization of one of the above problems may lead to the most uniform exploration of the state space, the actual goal of the exploration phase is to collect enough information on the environment to optimize, at some point, a goal-conditioned policy [pong2019skew]. To this end, it is essential to have an exploratory policy that adequately covers the action space in any visited state. Unfortunately, the optimization of Problems (4), (5), (6) does not guarantee even that the obtained policy is stochastic. Thus, we need to embed in the problem a secondary objective that takes into account the exploration over . This can be done by enforcing a minimal entropy over actions in the policy to be learned, adding to (4), (5), (6) the following constraints:


where . This secondary objective is actually in competition with the objective of uniform exploration over states. Indeed, an overblown incentive in the exploration over actions may limit the state distribution entropy of the optimal policy. Having a low probability of visiting a state decreases the likelihood of sampling an action from that state, hence, also reducing the exploration over actions. To illustrate that, Figure 2 shows state distribution entropies () and state-action distribution entropies, i.e., , achieved by the optimal policy w.r.t. Problem (5) on the Single Chain domain [furmston2010variational] for different values of .

Figure 1: State distribution entropy (), state-action distribution entropy () for different values of on the Single Chain domain.
Figure 2: State distribution entropy (), spectral gap () for different values of on the Single Chain domain (left). Color-coded state distribution overlaid on a -rooms gridworld for different values of (right).

3.3 An Objective to Make Highly Exploring Policies Mix Faster

In many cases, such as in episodic tasks where the horizon for exploration is capped, we may have interest in trading inferior state entropy for faster convergence of the learned policy. Although the doubly stochastic matrices are equally valid in terms of steady-state distribution, the choice of the target strongly affects the mixing properties of the induced by the policy. Indeed, while an MC with a uniform transition matrix, i.e., transition probabilities for any , , mixes in no time, an MC with probability one on the self-loops never converges to a steady state. This is evident considering that the mixing time of an MC is trapped as follows [levin2017markov, Theorems 12.3 and 12.4]:


where is the mixing threshold, is a minorization of , and is the spectral gap of  (2). From the literature of MCs, we know that a variant of the Problems (4), (5) having the uniform transition matrix as target and the as matrix norm, is equivalent to the problem of finding the fastest mixing transition matrix  [boyd2004fastest]. However, the choice of this target may overly limit the entropy over the state distribution induced by the optimal policy. Instead, we look for a generalization that allows us to prioritize fast exploration at will. Thus, we consider a continuum of relaxations in the fastest mixing objective by embedding in Problems (4) and (5) (but not in Problem (6)) the following constraints:


where . By setting , we force the optimization problem to consider the uniform transition matrix as a target, thus aiming to reduce the mixing time, while larger values of relax this objective, allowing us to get a higher steady-state distribution entropy. In Figure 2 we show how the parameter affects the trade-off between high steady-state entropy and low mixing times (i.e., high spectral gaps), reporting the values obtained by optimal policies w.r.t. Problem (5) for different .

4 A Model-Based Algorithm for Highly Exploring and Fast Mixing Policies

Input: , , batch size
Initialize and transition counts
for  until convergence  do
     Collect N steps with and update
     Estimate the transition model as:
      optimal policy for (4) (or (5) or (6)),
     given the parameters and
end for
Output: exploratory policy
Algorithm 1 IDEAL
Figure 3: Model estimation error on the Double Chain with , , (100 runs, 95 c.i.).

In this section, we present an approach to incrementally learn a highly exploring and fast mixing policy through interactions with an unknown environment, developing a novel model-based exploration algorithm called Intrinsically-Driven Effective and Efficient Exploration ALgorithm (IDEAL). Since Problems (4), (5), (6) requires an explicit representation of the matrix , we need to estimate the transition model from samples before performing an objective optimization (model-based approach). In tabular settings, this can be easily done by adopting the transition frequency as a proxy for the (unknown) transition probabilities, obtaining an estimated transition model . However, in hard-exploration tasks, it can be arbitrarily arduous to sample transitions from the most difficult-to-reach states by relying on naïve exploration mechanisms, such as a random policy. To address the issue, we lean on an iterative approach in which we alternate model estimation phases with optimization sweeps of the objectives (4), (5) or (6). In this way, we combine the benefit of collecting samples with highly exploring policies to better estimate the transition model and the benefit of having a better-estimated model to learn superior exploratory policies. In order to foster the policy towards pairs that have never been sampled, we keep their corresponding distribution to be uniform over all possible states, thus making the pair particularly valuable in the perspective of the optimization problem. The algorithm converges whenever the exploratory policy remains unchanged during consecutive optimization sweeps and, if we know the size of the MDP, when all state-action pairs have been sufficiently explored. In Algorithm 1 we report the pseudo-code of IDEAL. Finally, in Figure 3 we compare the iterative formulation against a not-iterative one, i.e., an approach that collects samples with a random policy and then optimizes the exploration objective off-line. Considering an exploration task on the Double Chain domain [furmston2010variational], we show that the iterative form has a clear edge in reducing the model estimation error . Both the approaches employ a Frobenius formulation.

5 Experimental Evaluation

In this section, we provide the experimental evaluation of IDEAL. First, we show a set of experiments on the illustrative Single Chain and Double Chain domains [furmston2010variational, peters2010relative]. The Single Chain consists of states having possible actions, one to climb up the chain from state to , and the other to directly fall to the initial state . The two actions are flipped with a probability , making the environment stochastic and reducing the probability of visiting the higher states. The Double Chain concatenates two Single Chain into a bigger one sharing the central state , which is the initial state. Thus, the chain can be climbed in two directions. These two domains, albeit rather simple from a dimensionality standpoint, are actually hard to explore uniformly, due to the high shares of actions returning to the initial state and preventing the agent to consistently reach the higher states. Then, we present an experiment on the much more complex Knight Quest environment [fruit2018efficient, Appendix], having and . This domain takes inspiration from classical arcade games, in which a knight has to rescue a princess in the shortest possible time without being killed by the dragon. To accomplish this feat, the knight has to perform an intricate sequence of actions. In the absence of any reward, it is a fairly challenging environment for exploration. On these domains, we address the task of learning the best exploratory policy in a limited number of samples. Especially, we evaluate these policies in terms of the induced state entropy and state-action entropy .

We compare our approach with MaxEnt [hazan2018provably], the model-based algorithm to learn maximum entropy exploration that we have previously discussed in the paper, and a count-based approach inspired by the exploration bonuses of MBIE-EB [strehl2008analysis], which we refer as CountBased in the following. The latter shares the same structure of our algorithm, but replace the policy optimization sweeps with approximate value iterations [bertsekas1995dynamic], where the reward for a given state is inversely proportional to the visit count of that state. It is worth noting that the results reported for the MaxEnt algorithm are related to the mixture policy , where is a set of -deterministic policies, and is a probability distribution over . For the sake of simplicity, we have equipped all the approaches with a little domain knowledge, i.e., the cardinality of and . However, this can be avoided without a significant impact on the presented results. For every experiment, we will report the batch-size , and the parameters , of IDEAL. CountBased and MaxEnt employ -greedy policies having in all the experiments. In any plot, we will additionally provide the performance of a baseline policy, denoted as Random, that randomly selects an action in every state. Detailed information about the presented results, along with an additional experiment, can be found in Appendix D.

Frobenius ()
Infinity ()
Column Sum ()
Frobenius ()
Infinity ()
Column Sum ()
Figure 4: State distribution entropy () and probability of the least favorable state () for different objective formulations on the Single Chain domain. We report exact solutions with (left), and approximate optimizations with , , (100 runs, 95 c.i.) (right).

First, in Figure 4, we compare the Problems (4), (5), (6) on the Single Chain environment. On one hand, we show the performance achieved by the exact solutions, i.e., computed with a full knowledge of . While the plain formulations () are remarkably similar, adding a constraint over the action entropy () has a significantly different impact. On the other hand, we illustrate the performance of IDEAL, equipped with the alternative optimization objectives, in learning a good exploratory policy from samples. In this case, the Frobenius clearly achieves a better performance. In the following, we will report the results of IDEAL considering only the best-performing formulation, which, for all the presented experiments, corresponds to the Frobenius.

In Figure (a)a, we show that IDEAL compares well against the other approaches in exploring the Double Chain domain. It achieves superior state entropy and state-action entropy, and it converges faster to the optimum. It displays also a higher probability of visiting the least favorable state, and it behaves positively in the estimation of . Notably, the CountBased algorithm fails to reach high exploration due to a detachment problem [ecoffet2019go], since it fluctuates between two exploratory policies that are greedy towards the two directions of the chain. By contrast, in a domain having a clear direction for exploration, such as the simpler Single Chain domain, CountBased ties the explorative performances of IDEAL (Figure (b)b). On the other hand, MaxEnt is effective in the exploration performance, but much more slower to converge, both in the Double Chain and the Single Chain. Note that in Figure (a)a, the model estimation error of MaxEnt starts higher than the other, since it employs a different strategy to fill the transition probabilities of never reached states, inspired by [brafman2002r]. In Figure (c)c, we present an experiment on the higher-dimensional Knight Quest environment. IDEAL achieves a remarkable state entropy, while MaxEnt struggles to converge towards a satisfying exploratory policy. CountBased (not reported in Figure (c)c, see Appendix D), fails to explore the environment altogether, oscillating between policies with low entropy.

In Figure (d)d, we illustrate how the exploratory policies learned in the Double Chain environment are effective to ease learning of any possible goal-conditioned policy afterwards. To this end, the exploratory policies, learned by the three approaches through 3000 samples (Figure (a)a), are employed to collect samples in a fixed horizon (within a range from 10 to 100 steps). Then, a goal-conditioned policy is learned off-line through approximate value iteration [bertsekas1995dynamic] on this small amount of samples. The goal is to optimize a reward function that is 1 for the hardest state to reach (i.e., the state that is less frequently visited with a random policy), 0 in all the other states. In this setting, all the methods prove to be rather successful w.r.t. the baseline, though IDEAL compares positively against the other strategies.

(b) valign=t
(c) valign=t
(d) valign=t
(e) valign=t
Figure 10: Comparison of the algorithms on exploration tasks (a, b, c) and goal-conditioned learning (d), with parameters , , (a, b, d) and , , (c). (95 c.i. over 100 runs (a, b), 40 runs (c), 500 runs (d)). Comparison of the solve time (e) achieved by Column Sum and Dual formulations as a function of the number of variables.

6 Discussion

In this section, we first discuss how the proposed approach might be extended beyond tabular settings and an alternative formulation for the policy entropy optimization. Then, we consider some relevant work related to this paper.

6.1 Potential Extension to Continuous

We believe that the proposed approach has potential to be extended to more general, continuous, settings, by exploiting the core idea of avoiding a probability concentration on a subset of outgoing transitions from a state. Indeed, a compelling feature of the presented lower bounds is that they characterize an infinite-step property, the entropy of the steady-state distribution, relying only on one-step quantities, i.e., without requiring to unroll several times the state transition matrix . In addition to this, the lower bounds provide an evaluation for the current policy, and they can be computed for any policy. Thus, we could potentially operate a direct search in the policy space through the gradient of an approximation of these lower bounds. To perform the approximation we could use a kernel for a soft aggregation over regions of the, now continuous, state space.

6.2 A Dual Formulation

A potential alternative to deal with the optimization of the objective (3) is to consider its dual formulation. This is rather similar to the approach proposed in [tarbouriech2019active] to address the different problem of active exploration in an MDP. The basic idea is to directly maximize the entropy over the state-action stationary distribution and then to recover the policy afterwards. In this setting, we define the state-action stationary distribution induced by a policy as , where is a vector of size having elements . Since not all the distribution over the state-action space can be actually induced by a policy over the MDP, we characterize the set of feasible distributions:

Then, we can formulate the Dual Problem as:


Finally, let denotes the solution of Problem (10), we can recover the policy inducing the optimal state-action entropy as .

The Dual Problem displays some appealing features. Especially, the objective in (10) is already convex, so that it can be optimized right away, and it allows to explicitly maximize the entropy over the state-action space. Nonetheless, we think that this alternative formulation has three major shortcomings. First, the optimization of the convex program (10) could be way slower than the optimization of the linear programs Column Sum and Infinity [grotschel1993ellipsoid]. Secondly, it does not allow to control the mixing time of the learned policy, which can be extremely relevant. Lastly, the applicability of the Dual Problem to continuous environments seems far-fetched. It is worth noting that, from an empirical evaluation, the dual formulation does not provide any significant benefit in the entropy of the learned policy w.r.t. the lower bounds formulations (see Appendix D). Figure (e)e shows how the solve time of the Column Sum scales better with the number of variables () in incrementally large Knight Quest domains.

6.3 Related Work

As discussed in the previous sections, \citeauthorhazan2018provably \shortcitehazan2018provably consider an objective not that dissimilar to the one presented in this paper, even if they propose a fairly different solution to the problem. Their method learns a mixture of deterministic policies instead of a single stochastic policy. In a similar flavor, \citeauthortarbouriech2019active \shortcitetarbouriech2019active develop an approach, based on a dual formulation of the objective, to learn a mixture of stochastic policies for active exploration.

Other propose to intrinsically motivate the agent towards learning to reach all possible states in the environment [lim2012autonomous]. To extend this same idea from the tabular setting to the context of a continuous, high-dimensional state space, \citeauthorpong2019skew \shortcitepong2019skew employ a generative model to seek for a maximum-entropy goal distribution. \citeauthorecoffet2019go \shortciteecoffet2019go propose a method, called Go-Explore, to methodically reach any state by keeping an archive of any visited state and the best trajectory that brought the agent there. At each iteration, the agent draws a promising state from the archive, returns there replicating the stored trajectory (Go), then explores from this state trying to discover new states (Explore).

Another promising intrinsic objective is to make value out of the exploration phase by acquiring a set of reusable skills, typically formulated by means of the option framework [sutton1999between]. In [barto2004intrinsically], a set of options is learned by maximizing an intrinsic reward that is generated at the occurrence of some, user-defined, salient event. The approach proposed by \citeauthorbonarini2006incremental \shortcitebonarini2006incremental, which presents some similarities with the work in [ecoffet2019go], is based on learning a set of options to return with high probability to promising states. In their context, a promising state presents high unbalance between the probabilities of the input and output transitions [bonarini2006self], so that it is both a hard state to reach, and a doorway to reach many other states. In this way, the learned options heuristically favor an even exploration of the state space.

7 Conclusions

In this paper, we proposed a new model-based algorithm, IDEAL, to learn highly exploring and fast mixing policies. The algorithm outputs a policy that maximizes a lower bound to the entropy of the steady-state distribution. We presented three formulations of the lower bound that differently tradeoff tightness with computational complexity of the optimization. The experimental evaluation showed that IDEAL is able to achieve superior performance than other approaches striving for uniform exploration of the environment, while it avoids the risk of detachment and derailment [ecoffet2019go]. Future works could focus on extending the applicability of the presented approach to non-tabular environments, following the blueprint in Section 6.1. We believe that this work provides a valuable contribution in view of solving the conundrum on what should a reinforcement learning agent learn in the absence of any reward coming from the environment.


This work has been partially supported by the Italian MIUR PRIN 2017 Project ALGADIMAR “Algorithms, Games, and Digital Market”.


Appendix A Proofs




Let us recall the definition of the steady-state distribution of the MC induced by the policy over the MDP:

If is a uniform distribution we have:


then, the state transition matrix is column stochastic, while it is also row stochastic by definition. Conversely, if the matrix is doubly stochastic, we aim to prove that a that is not uniform cause an inconsistency in the stationary condition . Let us consider a perturbation of the uniform , such that for all the states in outside of:


where is a, sufficiently small, positive constant. Since is doubly stochastic, the sum:


is a convex combination of the elements in . Hence, for the stationary condition to hold, we must have and for all different from . Nevertheless, a state with probability one on the self-loop cannot have a stationary distribution different from or . ∎




We start with rewriting the entropy of as follows:


where is the uniform distribution over the state space (all the entries equal to ) and is the Kullback-Leibler (KL) divergence between distribution and .
Using the reverse Pinsker inequality [csiszar2006context, p. 1012 and Lemma 6.3], we can upper bound the KL divergence between and :


The total variation between the two steady-state distributions and can in turn be upper bounded by (see [schweitzer1968perturbation]):


where is the fundamental matrix and is any doubly-stochastic matrix (). Since the fundamental matrix associated to any doubly-stochastic matrix is row stochastic [hunter2010some], then . Furthermore, since the bound in Equation (16) holds for any , we can rewrite the bound as follows:


Combining Equations (15) and (17) we get an upper bound to the KL divergence, which, once replaced in Equation (14), provides the lower bound in the statement and concludes the proof.




From the properties of the matrix norms [petersen2008matrix], we have that for any matrix it holds:

As a consequence:

where . Combining this inequality with the result in Theorem 3.1 concludes the proof. ∎




We start with defining the vector that results from the difference between the vector of ones and the vector of the column sums: . We denote with the matrix obtained from by adding to the row corresponding to state :

It is worth noting that, since , the column sums and the row sums of matrix are all equal to . Nonetheless, is not guaranteed to be doubly stochastic since its entries can be lower than . However, it is possible to show that

When is doubly stochastic, the above inequality holds by definition. When has negative entries, it is always possible to transform it to a doubly stochastic matrix without increasing the distance from . In order to remove the negative entries of , we need to trade probability with the other states, so as to preserve the row sum. Each state that gives probability to state , will receive the same amount of probability taken by the columns corresponding to positive values of the vector . In order to illustrate this procedure, we consider a four-state MDP and a policy that leads to the following state transition matrix:

The corresponding vector is

Summing to the first row of we get:

Since we have two negative elements, to get a doubly stochastic matrix we can modify the matrix as follows:

  • move from element to and (to keep the row sum equal to 1) move from to

  • move from element to and (to keep the row sum equal to 1) move from to

The resulting matrix is:

The described procedure yields a doubly stochastic matrix such that . Combining this upper bound with the result in Theorem 3.1 concludes the proof. ∎

Corollary A.1.

The bound in Theorem 3.1 is never less than the bound in Theorem 3.1.


From the properties of the matrix norms [petersen2008matrix], we have that for any matrix it holds:


As a consequence:

where . It follows that

Appendix B Optimization Problems

b.1 Linear program formulation of Problem (4)

Problem (4) can be rewritten as follows:

subject to

The first set of inequality constraints can be transformed in a set of linear inequality constraints. Each constraint is obtained by removing the absoulte values and considering a different permutation of the signs in front of the terms in the summation. As a result, if the original summation contains elements, the number of linear constraints is . Since this process needs to be done for each state , the first set of constraints can be replaced by .

b.2 Linear program formulation of Problem (6)

Let be a vector of length . Problem (4) can be rewritten as follows:

subject to

Appendix C Illustrative Example

The example in Figure 11 shows that the Frobenius norm can better capture the distance between a transition matrix and a doubly stochastic w.r.t. the -norm. Indeed, the -norm only accounts for the state which corresponds to the maximum absolute row sum of the difference , while the Frobenius norm considers the difference across all the states. In the example, we see two transition matrices and that are equally bad in the worst state (), thus, have equal -norm. However, is fairly unbalanced also in the other states, where is uniform instead, and so it is clearly preferable in view of the uniform exploration objective.

Figure 11: Graphical representation of a Markov chain (left), having on the edges the transition probabilities . On the right, a table providing the values of -norm, and Frobenius norm of the difference w.r.t. a uniform , along with state distribution entropies.

Appendix D Experimental Evaluation: Further Details

In the following, we provide further details on the experimental evaluation covered by Section 5. First, for the sake of clarity, we report the pseudo-code of MaxEnt and CountBased algorithms, which we have compared with our approach. Then, for any presented experiment, we recap the full set of parameters employed, we show the value of the exact solutions, and a characterization of the solve time, for all the different formulations. Finally, we illustrate an additional experiment in the River Swim domain [strehl2008analysis].

As a side note, it is worth reporting that our implementation of the optimization Problems (4), (5), (6) is based on the CVXPY framework [cvxpy] and makes use of the MOSEK optimizer.

d.1 Algorithms: Pseudo-Code

In Algorithm 2, we report the pseudo-code of the MaxEnt algorithm [hazan2018provably]. In Algorithm 3 the pseudo-code of the CountBased algorithm, which is inspired by the exploration bonus of MBIE-EB [strehl2008analysis].

Input: , batch size , step size
Initialize , transition counts , state visitation counts
Set and , let
for  until convergence  do
     Sample , collect steps with , and update
     Estimate the transition model as:
     Define the reward function as:
     Compute as the -greedy policy for the MDP
     Set , update
end for
Output: exploratory policy
Algorithm 2 MaxEnt
Input: , batch size
Initialize , transition counts , state visitation counts
for  until convergence  do
     Collect N steps with and update
     Estimate the transition model as:
     Define the reward function as
     Compute as the -greedy policy for the MDP
end for
Output: exploratory policy
Algorithm 3 CountBased

d.2 Experiments

For any experiment covered by Section 5, we provide (Table 1) the cardinality of the state space , the cardinality of the action space , the value of the parameters , of IDEAL, the parameter of MaxEnt and CountBased, the number of iterations , and the batch-size , which are shared by all the approaches. For every domain, we report (Table 2) the value of the exact solution of Problems (4), (5), (6), (10). In Table 3, we provide the time to find a solution for the Problems (4), (5), (6) that we experienced running the optimization on a single-core general-purpose CPU. For any experiment (except Knight Quest), we show additional figures reporting the performance of all the formulations of IDEAL.

The River Swim environment [strehl2008analysis] mimic the task of crossing a river either swimming upstream or downstream. Thus, the action of swimming upstream fails with high probability, while the action of swimming downstream is deterministic. Due to this imbalance in the effort needed to cross the environment in the two directions, it is a fairly hard task in view of uniform exploration.

Single Chain
Double Chain
Knight Quest
River Swim
Table 1: Full set of parameters employed in the presented experiments.
Single Chain Double Chain Knight Quest
Column Sum
Table 2: State distribution entropy related to the exact solution for all the problem formulations.
Solve Time (sec)
Single Chain Double Chain Knight Quest
Column Sum
Table 3: Time to solve the optimization for all the problem formulations (reported in seconds).

Single Chain

Figure 12: Comparison of the algorithms’ exploration performances in the Single Chain environment with parameters , , (100 runs, 95 c.i.).

Double Chain

Figure 13: Comparison of the algorithms’ exploration performances in the Single Chain environment with parameters , , (100 runs, 95 c.i.).

Knight Quest

Figure 14: Comparison of the algorithms’ state entropy in the Knight Quest environment with parameters , , (40 runs, 95 c.i.).


Figure 15: Comparison of the algorithms on a goal-conditioned learning task in the Double Chain environment with parameters , , (500 runs, 95 c.i.).

River Swim

Figure 16: Comparison of the algorithms’ exploration performances in the River Swim environment with parameters , , (100 runs, 95 c.i.).


  1. A complete version of the paper, which includes the Appendix, is available at
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description