An IntrinsicallyMotivated Approach for Learning Highly Exploring and Fast Mixing Policies
Abstract
What is a good exploration strategy for an agent that interacts with an environment in the absence of external rewards? Ideally, we would like to get a policy driving towards a uniform stateaction visitation (highly exploring) in a minimum number of steps (fast mixing), in order to ease efficient learning of any goalconditioned policy later on. Unfortunately, it is remarkably arduous to directly learn an optimal policy of this nature. In this paper, we propose a novel surrogate objective for learning highly exploring and fast mixing policies, which focuses on maximizing a lower bound to the entropy of the steadystate distribution induced by the policy. In particular, we introduce three novel lower bounds, that lead to as many optimization problems, that tradeoff the theoretical guarantees with computational complexity. Then, we present a modelbased reinforcement learning algorithm, IDEAL, to learn an optimal policy according to the introduced objective. Finally, we provide an empirical evaluation of this algorithm on a set of hardexploration tasks.
algorithmic
1 Introduction
In general, the Reinforcement Learning (RL) framework [sutton2018reinforcement] assumes the presence of a reward signal coming from a, potentially unknown, environment to a learning agent. When this signal is sufficiently informative about the utility of the agent’s decisions, RL has proved to be rather successful in solving challenging tasks, even at a superhuman level [mnih2015human, silver2017mastering]. However, in most realworld scenarios, we cannot rely on a wellshaped, complete reward signal. This may prevent the agent from learning anything until, while performing random actions, it eventually stumbles into some sort of external reward. Thus, what is a good objective for a learning agent to pursue, in the absence of an external reward signal, to prepare itself to learn efficiently, eventually, a goalconditioned policy?
Intrinsic motivation [chentanez2005intrinsically, oudeyer2009topology] traditionally tries to answer this pressing question by designing selfmotivated goals that favor exploration. In a curiositydriven approach, first proposed in [schmidhuber1991possibility], the intrinsic objective encourages the agent to explore novel states by rewarding prediction errors [stadie2015incentivizing, pathak2017curiosity, burda2018large, burda2018exploration]. On a similar flavor, other works propose to relate an intrinsic reward to some sort of learning progress [lopes2012exploration] or information gain [mohamed2015variational, houthooft2016vime], stimulating the agent’s empowerment over the environment. Countbased approaches [bellemare2016unifying, tang2017exploration, ostrovski2017count] consider exploration bonuses proportional to the state visitation frequencies, assigning high rewards to rarely visited states. Athough the mentioned approaches have been relatively effective in solving sparserewards, hardexploration tasks [pathak2017curiosity, burda2018exploration], they have some common limitations that may affect their ability to methodically explore an environment in the absence of external rewards, as pointed out in [ecoffet2019go]. Especially, due to the consumable nature of their intrinsic bonuses, the learning agent could prematurely lose interest in a frontier of high rewards (detachment). Furthermore, the agent may suffer from derailment by trying to return to a promising state, previously discovered, if a naïve exploratory mechanism, such as greedy, is combined to the intrinsic motivation mechanism (which is often the case). To overcome these limitations, recent works suggest alternative approaches to motivate the agent towards a more systematic exploration of the environment [hazan2018provably, ecoffet2019go]. Especially, in [hazan2018provably] the authors consider an intrinsic objective which is directed to the maximization of an entropic measure over the state distribution induced by a policy. Then, they provide a provably efficient algorithm to learn a mixture of deterministic policies that is overall optimal w.r.t. the maximumentropy exploration objective. To the best of our knowledge, none of the mentioned approaches explicitly address the related aspect of the mixing time of an exploratory policy, which represents the time it takes for the policy to reach its full capacity in terms of exploration. Nonetheless, in many cases we would like to maximize the probability of reaching any potential target state having a fairly limited number of interactions at hand for exploring the environment. Notably, this context presents some analogies to the problem of maximizing the efficiency of a random walk [hassibi2014optimized].
In this paper, we present a novel approach to learn exploratory policies that are, at the same time, highly exploring and fast mixing.
In Section 3, we propose a surrogate objective to address the problem of maximumentropy exploration over both the state space (Section 3.1) and the action space (Section 3.2). The idea is to search for a policy that maximizes a lower bound to the entropy of the induced steadystate distribution. We introduce three new lower bounds and the corresponding optimization problems, discussing their pros and cons.
Furthermore, we discuss how to complement the introduced objective to account for the mixing time of the learned policy (Section 3.3).
In Section 4, we present the IntrinsicallyDriven Effective and Efficient Exploration ALgorithm (IDEAL), a novel, modelbased, reinforcement learning method to learn highly exploring and fast mixing policies through iterative optimizations of the introduced objective.
In Section 5, we provide an empirical evaluation to illustrate the merits of our approach on hardexploration, finite domains, and to show how it fares in comparison to countbased and maximumentropy approaches.
Finally, in Section 6, we discuss the proposed approach and related works. The proofs of the Theorems are reported in Appendix A
2 Preliminaries
A discretetime Markov Decision Process (MDP) [puterman2014markov] is defined as a tuple , where is the state space, is the action space, is a Markovian transition model defining the distribution of the next state given the current state and action , is the reward function, such that is the expected immediate reward when taking action from state , and is the initial state distribution. A policy defines the probability of taking an action in state .
In the following we will indifferently turn to scalar or matrix notation, where denotes a vector, denotes a matrix, and , denote their transpose. A matrix is row (column) stochastic if it has nonnegative entries and all of its rows (columns) sum to one. A matrix is doubly stochastic if it is both row and column stochastic. We denote with the space of doubly stochastic matrices. The norm of a matrix is its maximum absolute row sum, while and are its and Frobenius norms respectively. We denote with a column vector of ones and with a matrix of ones with rows and columns. Using matrix notation, is a column vector of size having elements , is a row stochastic matrix of size that describes the transition model , is a row stochastic matrix of size that contains the policy , and is a row stochastic matrix of size () that represents the state transition matrix under policy . We denote with the space of all the stationary Markovian policies.
In the absence of any reward, i.e., when for every , a policy induces, over the MDP , a Markov Chain (MC) [levin2017markov] defined by where is the state transition model. Having defined the step transition matrix as , the state distribution of the MC at time step is , while is the steady state distribution. If the MC is ergodic, i.e., aperiodic and recurrent, it admits a unique steadystate distribution, such that . The mixing time of the MC describes how fast the state distribution converges to the steady state:
(1) 
where is the mixing threshold. An MC is reversible if the condition holds. Let be the eigenvalues of . For ergodic reversible MCs the largest eigenvalue is 1 with multiplicity 1. Then, we can define the second largest eigenvalue modulus and the spectral gap as:
(2) 
3 Optimization Problems for Highly Exploring and Fast Mixing Policies
In this section, we define a set of optimization problems whose goal is to identify a stationary Markovian policy that effectively explores the stateaction space. The optimization problem is introduced in three steps: first we ask for a policy that maximizes some lower bound to the steadystate distribution entropy, then we foster exploration over the action space by adding a constraint on the minimum action probability, and finally we add another constraint to reduce the mixing time of the Markov chain induced by the policy.
3.1 Highly Exploring Policies over the State Space
Intuitively, a good exploration policy should guarantee to visit the state space as uniformly as possible. In this view, a potential objective function is the entropy of the steadystate distribution induced by a policy over the MDP [hazan2018provably]. The resulting optimal policy is:
(3) 
where is the state distribution entropy. Unfortunately, a direct optimization of this objective is particularly arduous since the steadystate distribution entropy is not a concave function of the policy [hazan2018provably]. To overcome this issue, a possible solution [hazan2018provably] is to use the conditional gradient method, such that the gradients of the steadystate distribution entropy become the intrinsic reward in a sequence of approximate dynamic programming problems [bertsekas1995dynamic].
In this paper, we follow an alternative route that consists in maximizing a lower bound to the policy entropy. In particular, in the following we will consider three lower bounds that lead to as many optimization problems (named Infinity, Frobenius, Column Sum) that show different tradeoffs between theoretical guarantees and computational complexity.
Infinity From the theory of Markov chains [levin2017markov], we know a necessary and sufficient condition for a policy to induce a uniform steadystate distribution (i.e., to achieve the maximum possible entropy). We report this result in the following theorem. {restatable}[]thrdoublyStochastic Let be the transition matrix of a given MDP. The steadystate distribution induced by a policy is uniform over iff the matrix is doubly stochastic. Unfortunately, given the constraints specified by the transition matrix , a stationary Markovian policy that induces a doubly stochastic may not exist. On the other hand, it is possible to lower bound the entropy of the steadystate distribution induced by policy as a function of the minimum norm between and any doubly stochastic matrix. {restatable}[]threntropyBound Let be the transition matrix of a given MDP and the space of doubly stochastic matrices. The entropy of the steadystate distribution induced by a policy is lower bounded by:
The maximization of this lower bound leads to the following constrained optimization problem:
(4) 
It is worth noting that this optimization problem can be reformulated as a linear program with optimization variables and inequality constraints and equality constraints (the linear program formulation can be found in Appendix B.1). In order to avoid the exponential growth of the number of constraints as a function of the number of states, we are going to introduce alternative optimization problems.
Frobenius It is worth noting that different transition matrices having equal might lead to significantly different state distribution entropies , as the norm only accounts for the state corresponding to the maximum absolute row sum. The Frobenius norm can better captures the distance between and over all the states, as discussed in Appendix C. For this reason, we have derived a lower bound to the policy entropy that replace the norm with the Frobenius one. {restatable}[]threntropyBoundF Let be the transition matrix of a given MDP and the space of doubly stochastic matrices. The entropy of the steadystate distribution induced by a policy is lower bounded by:
It can be shown (see Corollary A.1 in Appendix A) that the lower bound based on the Frobenius norm cannot be better (i.e., larger) than the one with the Infinite norm. However, we have the advantage that the resulting optimization problem has significantly less constraints than Problem (4):
(5) 
This problem is a (linearly constrained) quadratic problem with optimization variables and inequality constraints and equality constraints.
Column Sum Problems (4) and (5) are aiming at finding a policy associated with a state transition matrix that is doubly stochastic. To achieve this result it is enough to guarantee that the column sums of the matrix are all equal to one [kirkland2010column]. A measure that can be used to evaluate the distance to a doubly stochastic matrix can be the absolute sum of the difference between one and the column sums: . The following theorem provides a lower bound to the policy entropy as a function of this measure. {restatable}[]threntropyBoundCS Let be the transition matrix of a given MDP. The entropy of the steadystate distribution induced by a policy is lower bounded by:
The optimization of this lower bound leads to the following linear program:
(6) 
Besides being a linear program, unlike the other optimization problems presented, Problem (6) does not require to optimize over the space of all the doubly stochastic matrices, thus significantly reducing the number of optimization variables () and constraints ( inequalities and equalities). The linear program formulation of Problem (6) can be found in Appendix B.2.
3.2 Highly Exploring Policies over the State and Action Space
Although the policy resulting from the optimization of one of the above problems may lead to the most uniform exploration of the state space, the actual goal of the exploration phase is to collect enough information on the environment to optimize, at some point, a goalconditioned policy [pong2019skew]. To this end, it is essential to have an exploratory policy that adequately covers the action space in any visited state. Unfortunately, the optimization of Problems (4), (5), (6) does not guarantee even that the obtained policy is stochastic. Thus, we need to embed in the problem a secondary objective that takes into account the exploration over . This can be done by enforcing a minimal entropy over actions in the policy to be learned, adding to (4), (5), (6) the following constraints:
(7) 
where . This secondary objective is actually in competition with the objective of uniform exploration over states. Indeed, an overblown incentive in the exploration over actions may limit the state distribution entropy of the optimal policy. Having a low probability of visiting a state decreases the likelihood of sampling an action from that state, hence, also reducing the exploration over actions. To illustrate that, Figure 2 shows state distribution entropies () and stateaction distribution entropies, i.e., , achieved by the optimal policy w.r.t. Problem (5) on the Single Chain domain [furmston2010variational] for different values of .
3.3 An Objective to Make Highly Exploring Policies Mix Faster
In many cases, such as in episodic tasks where the horizon for exploration is capped, we may have interest in trading inferior state entropy for faster convergence of the learned policy. Although the doubly stochastic matrices are equally valid in terms of steadystate distribution, the choice of the target strongly affects the mixing properties of the induced by the policy. Indeed, while an MC with a uniform transition matrix, i.e., transition probabilities for any , , mixes in no time, an MC with probability one on the selfloops never converges to a steady state. This is evident considering that the mixing time of an MC is trapped as follows [levin2017markov, Theorems 12.3 and 12.4]:
(8) 
where is the mixing threshold, is a minorization of , and is the spectral gap of (2). From the literature of MCs, we know that a variant of the Problems (4), (5) having the uniform transition matrix as target and the as matrix norm, is equivalent to the problem of finding the fastest mixing transition matrix [boyd2004fastest]. However, the choice of this target may overly limit the entropy over the state distribution induced by the optimal policy. Instead, we look for a generalization that allows us to prioritize fast exploration at will. Thus, we consider a continuum of relaxations in the fastest mixing objective by embedding in Problems (4) and (5) (but not in Problem (6)) the following constraints:
(9) 
where . By setting , we force the optimization problem to consider the uniform transition matrix as a target, thus aiming to reduce the mixing time, while larger values of relax this objective, allowing us to get a higher steadystate distribution entropy. In Figure 2 we show how the parameter affects the tradeoff between high steadystate entropy and low mixing times (i.e., high spectral gaps), reporting the values obtained by optimal policies w.r.t. Problem (5) for different .
4 A ModelBased Algorithm for Highly Exploring and Fast Mixing Policies
In this section, we present an approach to incrementally learn a highly exploring and fast mixing policy through interactions with an unknown environment, developing a novel modelbased exploration algorithm called IntrinsicallyDriven Effective and Efficient Exploration ALgorithm (IDEAL). Since Problems (4), (5), (6) requires an explicit representation of the matrix , we need to estimate the transition model from samples before performing an objective optimization (modelbased approach). In tabular settings, this can be easily done by adopting the transition frequency as a proxy for the (unknown) transition probabilities, obtaining an estimated transition model . However, in hardexploration tasks, it can be arbitrarily arduous to sample transitions from the most difficulttoreach states by relying on naïve exploration mechanisms, such as a random policy. To address the issue, we lean on an iterative approach in which we alternate model estimation phases with optimization sweeps of the objectives (4), (5) or (6). In this way, we combine the benefit of collecting samples with highly exploring policies to better estimate the transition model and the benefit of having a betterestimated model to learn superior exploratory policies. In order to foster the policy towards pairs that have never been sampled, we keep their corresponding distribution to be uniform over all possible states, thus making the pair particularly valuable in the perspective of the optimization problem. The algorithm converges whenever the exploratory policy remains unchanged during consecutive optimization sweeps and, if we know the size of the MDP, when all stateaction pairs have been sufficiently explored. In Algorithm 1 we report the pseudocode of IDEAL. Finally, in Figure 3 we compare the iterative formulation against a notiterative one, i.e., an approach that collects samples with a random policy and then optimizes the exploration objective offline. Considering an exploration task on the Double Chain domain [furmston2010variational], we show that the iterative form has a clear edge in reducing the model estimation error . Both the approaches employ a Frobenius formulation.
5 Experimental Evaluation
In this section, we provide the experimental evaluation of IDEAL. First, we show a set of experiments on the illustrative Single Chain and Double Chain domains [furmston2010variational, peters2010relative]. The Single Chain consists of states having possible actions, one to climb up the chain from state to , and the other to directly fall to the initial state . The two actions are flipped with a probability , making the environment stochastic and reducing the probability of visiting the higher states. The Double Chain concatenates two Single Chain into a bigger one sharing the central state , which is the initial state. Thus, the chain can be climbed in two directions. These two domains, albeit rather simple from a dimensionality standpoint, are actually hard to explore uniformly, due to the high shares of actions returning to the initial state and preventing the agent to consistently reach the higher states. Then, we present an experiment on the much more complex Knight Quest environment [fruit2018efficient, Appendix], having and . This domain takes inspiration from classical arcade games, in which a knight has to rescue a princess in the shortest possible time without being killed by the dragon. To accomplish this feat, the knight has to perform an intricate sequence of actions. In the absence of any reward, it is a fairly challenging environment for exploration. On these domains, we address the task of learning the best exploratory policy in a limited number of samples. Especially, we evaluate these policies in terms of the induced state entropy and stateaction entropy .
We compare our approach with MaxEnt [hazan2018provably], the modelbased algorithm to learn maximum entropy exploration that we have previously discussed in the paper, and a countbased approach inspired by the exploration bonuses of MBIEEB [strehl2008analysis], which we refer as CountBased in the following. The latter shares the same structure of our algorithm, but replace the policy optimization sweeps with approximate value iterations [bertsekas1995dynamic], where the reward for a given state is inversely proportional to the visit count of that state. It is worth noting that the results reported for the MaxEnt algorithm are related to the mixture policy , where is a set of deterministic policies, and is a probability distribution over . For the sake of simplicity, we have equipped all the approaches with a little domain knowledge, i.e., the cardinality of and . However, this can be avoided without a significant impact on the presented results. For every experiment, we will report the batchsize , and the parameters , of IDEAL. CountBased and MaxEnt employ greedy policies having in all the experiments. In any plot, we will additionally provide the performance of a baseline policy, denoted as Random, that randomly selects an action in every state. Detailed information about the presented results, along with an additional experiment, can be found in Appendix D.

First, in Figure 4, we compare the Problems (4), (5), (6) on the Single Chain environment. On one hand, we show the performance achieved by the exact solutions, i.e., computed with a full knowledge of . While the plain formulations () are remarkably similar, adding a constraint over the action entropy () has a significantly different impact. On the other hand, we illustrate the performance of IDEAL, equipped with the alternative optimization objectives, in learning a good exploratory policy from samples. In this case, the Frobenius clearly achieves a better performance. In the following, we will report the results of IDEAL considering only the bestperforming formulation, which, for all the presented experiments, corresponds to the Frobenius.
In Figure (a)a, we show that IDEAL compares well against the other approaches in exploring the Double Chain domain. It achieves superior state entropy and stateaction entropy, and it converges faster to the optimum. It displays also a higher probability of visiting the least favorable state, and it behaves positively in the estimation of . Notably, the CountBased algorithm fails to reach high exploration due to a detachment problem [ecoffet2019go], since it fluctuates between two exploratory policies that are greedy towards the two directions of the chain. By contrast, in a domain having a clear direction for exploration, such as the simpler Single Chain domain, CountBased ties the explorative performances of IDEAL (Figure (b)b). On the other hand, MaxEnt is effective in the exploration performance, but much more slower to converge, both in the Double Chain and the Single Chain. Note that in Figure (a)a, the model estimation error of MaxEnt starts higher than the other, since it employs a different strategy to fill the transition probabilities of never reached states, inspired by [brafman2002r]. In Figure (c)c, we present an experiment on the higherdimensional Knight Quest environment. IDEAL achieves a remarkable state entropy, while MaxEnt struggles to converge towards a satisfying exploratory policy. CountBased (not reported in Figure (c)c, see Appendix D), fails to explore the environment altogether, oscillating between policies with low entropy.
In Figure (d)d, we illustrate how the exploratory policies learned in the Double Chain environment are effective to ease learning of any possible goalconditioned policy afterwards. To this end, the exploratory policies, learned by the three approaches through 3000 samples (Figure (a)a), are employed to collect samples in a fixed horizon (within a range from 10 to 100 steps). Then, a goalconditioned policy is learned offline through approximate value iteration [bertsekas1995dynamic] on this small amount of samples. The goal is to optimize a reward function that is 1 for the hardest state to reach (i.e., the state that is less frequently visited with a random policy), 0 in all the other states. In this setting, all the methods prove to be rather successful w.r.t. the baseline, though IDEAL compares positively against the other strategies.
6 Discussion
In this section, we first discuss how the proposed approach might be extended beyond tabular settings and an alternative formulation for the policy entropy optimization. Then, we consider some relevant work related to this paper.
6.1 Potential Extension to Continuous
We believe that the proposed approach has potential to be extended to more general, continuous, settings, by exploiting the core idea of avoiding a probability concentration on a subset of outgoing transitions from a state. Indeed, a compelling feature of the presented lower bounds is that they characterize an infinitestep property, the entropy of the steadystate distribution, relying only on onestep quantities, i.e., without requiring to unroll several times the state transition matrix . In addition to this, the lower bounds provide an evaluation for the current policy, and they can be computed for any policy. Thus, we could potentially operate a direct search in the policy space through the gradient of an approximation of these lower bounds. To perform the approximation we could use a kernel for a soft aggregation over regions of the, now continuous, state space.
6.2 A Dual Formulation
A potential alternative to deal with the optimization of the objective (3) is to consider its dual formulation. This is rather similar to the approach proposed in [tarbouriech2019active] to address the different problem of active exploration in an MDP. The basic idea is to directly maximize the entropy over the stateaction stationary distribution and then to recover the policy afterwards. In this setting, we define the stateaction stationary distribution induced by a policy as , where is a vector of size having elements . Since not all the distribution over the stateaction space can be actually induced by a policy over the MDP, we characterize the set of feasible distributions:
Then, we can formulate the Dual Problem as:
(10) 
Finally, let denotes the solution of Problem (10), we can recover the policy inducing the optimal stateaction entropy as .
The Dual Problem displays some appealing features. Especially, the objective in (10) is already convex, so that it can be optimized right away, and it allows to explicitly maximize the entropy over the stateaction space. Nonetheless, we think that this alternative formulation has three major shortcomings. First, the optimization of the convex program (10) could be way slower than the optimization of the linear programs Column Sum and Infinity [grotschel1993ellipsoid]. Secondly, it does not allow to control the mixing time of the learned policy, which can be extremely relevant. Lastly, the applicability of the Dual Problem to continuous environments seems farfetched. It is worth noting that, from an empirical evaluation, the dual formulation does not provide any significant benefit in the entropy of the learned policy w.r.t. the lower bounds formulations (see Appendix D). Figure (e)e shows how the solve time of the Column Sum scales better with the number of variables () in incrementally large Knight Quest domains.
6.3 Related Work
As discussed in the previous sections, \citeauthorhazan2018provably \shortcitehazan2018provably consider an objective not that dissimilar to the one presented in this paper, even if they propose a fairly different solution to the problem. Their method learns a mixture of deterministic policies instead of a single stochastic policy. In a similar flavor, \citeauthortarbouriech2019active \shortcitetarbouriech2019active develop an approach, based on a dual formulation of the objective, to learn a mixture of stochastic policies for active exploration.
Other propose to intrinsically motivate the agent towards learning to reach all possible states in the environment [lim2012autonomous]. To extend this same idea from the tabular setting to the context of a continuous, highdimensional state space, \citeauthorpong2019skew \shortcitepong2019skew employ a generative model to seek for a maximumentropy goal distribution. \citeauthorecoffet2019go \shortciteecoffet2019go propose a method, called GoExplore, to methodically reach any state by keeping an archive of any visited state and the best trajectory that brought the agent there. At each iteration, the agent draws a promising state from the archive, returns there replicating the stored trajectory (Go), then explores from this state trying to discover new states (Explore).
Another promising intrinsic objective is to make value out of the exploration phase by acquiring a set of reusable skills, typically formulated by means of the option framework [sutton1999between]. In [barto2004intrinsically], a set of options is learned by maximizing an intrinsic reward that is generated at the occurrence of some, userdefined, salient event. The approach proposed by \citeauthorbonarini2006incremental \shortcitebonarini2006incremental, which presents some similarities with the work in [ecoffet2019go], is based on learning a set of options to return with high probability to promising states. In their context, a promising state presents high unbalance between the probabilities of the input and output transitions [bonarini2006self], so that it is both a hard state to reach, and a doorway to reach many other states. In this way, the learned options heuristically favor an even exploration of the state space.
7 Conclusions
In this paper, we proposed a new modelbased algorithm, IDEAL, to learn highly exploring and fast mixing policies. The algorithm outputs a policy that maximizes a lower bound to the entropy of the steadystate distribution. We presented three formulations of the lower bound that differently tradeoff tightness with computational complexity of the optimization. The experimental evaluation showed that IDEAL is able to achieve superior performance than other approaches striving for uniform exploration of the environment, while it avoids the risk of detachment and derailment [ecoffet2019go]. Future works could focus on extending the applicability of the presented approach to nontabular environments, following the blueprint in Section 6.1. We believe that this work provides a valuable contribution in view of solving the conundrum on what should a reinforcement learning agent learn in the absence of any reward coming from the environment.
Acknowledgments
This work has been partially supported by the Italian MIUR PRIN 2017 Project ALGADIMAR “Algorithms, Games, and Digital Market”.
References
Appendix A Proofs
\doublyStochastic*
Proof.
Let us recall the definition of the steadystate distribution of the MC induced by the policy over the MDP:
If is a uniform distribution we have:
(11) 
then, the state transition matrix is column stochastic, while it is also row stochastic by definition. Conversely, if the matrix is doubly stochastic, we aim to prove that a that is not uniform cause an inconsistency in the stationary condition . Let us consider a perturbation of the uniform , such that for all the states in outside of:
(12) 
where is a, sufficiently small, positive constant. Since is doubly stochastic, the sum:
(13) 
is a convex combination of the elements in . Hence, for the stationary condition to hold, we must have and for all different from . Nevertheless, a state with probability one on the selfloop cannot have a stationary distribution different from or . ∎
*
Proof.
We start with rewriting the entropy of as follows:
(14) 
where is the uniform distribution over the state space (all the entries equal to ) and is the KullbackLeibler (KL) divergence between distribution and .
Using the reverse Pinsker inequality [csiszar2006context, p. 1012 and Lemma 6.3], we can upper bound the KL divergence between and :
(15) 
The total variation between the two steadystate distributions and can in turn be upper bounded by (see [schweitzer1968perturbation]):
(16) 
where is the fundamental matrix and is any doublystochastic matrix (). Since the fundamental matrix associated to any doublystochastic matrix is row stochastic [hunter2010some], then . Furthermore, since the bound in Equation (16) holds for any , we can rewrite the bound as follows:
(17) 
Combining Equations (15) and (17) we get an upper bound to the KL divergence, which, once replaced in Equation (14), provides the lower bound in the statement and concludes the proof.
∎
*
Proof.
From the properties of the matrix norms [petersen2008matrix], we have that for any matrix it holds:
As a consequence:
where . Combining this inequality with the result in Theorem 3.1 concludes the proof. ∎
*
Proof.
We start with defining the vector that results from the difference between the vector of ones and the vector of the column sums: . We denote with the matrix obtained from by adding to the row corresponding to state :
It is worth noting that, since , the column sums and the row sums of matrix are all equal to . Nonetheless, is not guaranteed to be doubly stochastic since its entries can be lower than . However, it is possible to show that
When is doubly stochastic, the above inequality holds by definition. When has negative entries, it is always possible to transform it to a doubly stochastic matrix without increasing the distance from . In order to remove the negative entries of , we need to trade probability with the other states, so as to preserve the row sum. Each state that gives probability to state , will receive the same amount of probability taken by the columns corresponding to positive values of the vector . In order to illustrate this procedure, we consider a fourstate MDP and a policy that leads to the following state transition matrix:
The corresponding vector is
Summing to the first row of we get:
Since we have two negative elements, to get a doubly stochastic matrix we can modify the matrix as follows:

move from element to and (to keep the row sum equal to 1) move from to

move from element to and (to keep the row sum equal to 1) move from to
The resulting matrix is:
The described procedure yields a doubly stochastic matrix such that . Combining this upper bound with the result in Theorem 3.1 concludes the proof. ∎
Proof.
From the properties of the matrix norms [petersen2008matrix], we have that for any matrix it holds:
(18) 
As a consequence:
where . It follows that
∎
Appendix B Optimization Problems
b.1 Linear program formulation of Problem (4)
Problem (4) can be rewritten as follows:
(19)  
subject to  
The first set of inequality constraints can be transformed in a set of linear inequality constraints. Each constraint is obtained by removing the absoulte values and considering a different permutation of the signs in front of the terms in the summation. As a result, if the original summation contains elements, the number of linear constraints is . Since this process needs to be done for each state , the first set of constraints can be replaced by .
b.2 Linear program formulation of Problem (6)
Let be a vector of length . Problem (4) can be rewritten as follows:
(20)  
subject to  
Appendix C Illustrative Example
The example in Figure 11 shows that the Frobenius norm can better capture the distance between a transition matrix and a doubly stochastic w.r.t. the norm. Indeed, the norm only accounts for the state which corresponds to the maximum absolute row sum of the difference , while the Frobenius norm considers the difference across all the states. In the example, we see two transition matrices and that are equally bad in the worst state (), thus, have equal norm. However, is fairly unbalanced also in the other states, where is uniform instead, and so it is clearly preferable in view of the uniform exploration objective.

Appendix D Experimental Evaluation: Further Details
In the following, we provide further details on the experimental evaluation covered by Section 5. First, for the sake of clarity, we report the pseudocode of MaxEnt and CountBased algorithms, which we have compared with our approach. Then, for any presented experiment, we recap the full set of parameters employed, we show the value of the exact solutions, and a characterization of the solve time, for all the different formulations. Finally, we illustrate an additional experiment in the River Swim domain [strehl2008analysis].
As a side note, it is worth reporting that our implementation of the optimization Problems (4), (5), (6) is based on the CVXPY framework [cvxpy] and makes use of the MOSEK optimizer.
d.1 Algorithms: PseudoCode
d.2 Experiments
For any experiment covered by Section 5, we provide (Table 1) the cardinality of the state space , the cardinality of the action space , the value of the parameters , of IDEAL, the parameter of MaxEnt and CountBased, the number of iterations , and the batchsize , which are shared by all the approaches. For every domain, we report (Table 2) the value of the exact solution of Problems (4), (5), (6), (10). In Table 3, we provide the time to find a solution for the Problems (4), (5), (6) that we experienced running the optimization on a singlecore generalpurpose CPU. For any experiment (except Knight Quest), we show additional figures reporting the performance of all the formulations of IDEAL.
The River Swim environment [strehl2008analysis] mimic the task of crossing a river either swimming upstream or downstream. Thus, the action of swimming upstream fails with high probability, while the action of swimming downstream is deterministic. Due to this imbalance in the effort needed to cross the environment in the two directions, it is a fairly hard task in view of uniform exploration.
Single Chain  

Double Chain  
Knight Quest  
River Swim 
Single Chain  Double Chain  Knight Quest  

Infinity  
Frobenius  
Column Sum  
Dual 
Solve Time (sec)  

Single Chain  Double Chain  Knight Quest  
Infinity  
Frobenius  
Column Sum 
Single Chain
Double Chain
Knight Quest
GoalConditioned
River Swim
Footnotes
 A complete version of the paper, which includes the Appendix, is available at https://arxiv.org/abs/1907.04662