[
Abstract
We study the problem of online learning in a class of Markov decision processes known as linearly solvable MDPs. In the stationary version of this problem, a learner interacts with its environment by directly controlling the state transitions, attempting to balance a fixed statedependent cost and a certain smooth cost penalizing extreme control inputs. In the current paper, we consider an online setting where the state costs may change arbitrarily between consecutive rounds, and the learner only observes the costs at the end of each respective round. We are interested in constructing algorithms for the learner that guarantee small regret against the best stationary control policy chosen in full knowledge of the cost sequence. Our main result is showing that the smoothness of the control cost enables the simple algorithm of following the leader to achieve a regret of order after rounds, vastly improving on the best known regret bound of order for this setting.
Fast rates for online learning in LMDPs]Fast rates for online learning in
Linearly Solvable Markov Decision Processes
Neu and Gómez]Gergely Neu gergely.neu@gmail.com
Universitat Pompeu Fabra, Barcelona, Spain
\ANDVicenç Gómez vicen.gomez@upf.edu
Universitat Pompeu Fabra, Barcelona, Spain
Keywords: Online learning, fast rates, Markov decision processes, optimal control
1 Introduction
We consider the problem of online learning in Markov decision processes (MDPs) where a learner sequentially interacts with an environment by repeatedly taking actions that influence the future states of the environment while incurring some immediate costs. The goal of the learner is to choose its actions in a way that the accumulated costs are as small as possible. Several variants of this problem have been wellstudied in the literature, primarily in the case where the costs are assumed to be independent and identically distributed (Sutton and Barto, 1998; Puterman, 1994; Bertsekas and Tsitsiklis, 1996; Szepesvári, 2010). In the current paper, we consider the case where the costs are generated by an arbitrary external process and the learner aims to minimize its total loss during the learning procedure—conforming to the learning paradigm known as online learning (CesaBianchi and Lugosi, 2006; ShalevShwartz, 2012). In the onlinelearning framework, the performance of the learner is measured in terms of the regret defined as the gap between the total costs incurred by the learner and the total costs of the best comparator chosen from a prespecified class of strategies. In the case of online learning in MDPs, a natural class of strategies is the set of all statefeedback policies: several works studied minimizing regret against this class both in the stationarycost (Bartlett and Tewari, 2009; Jaksch et al., 2010; AbbasiYadkori and Szepesvári, 2011) and the nonstochastic setting (EvenDar et al., 2009; Yu et al., 2009; Neu et al., 2010, 2012; Zimin and Neu, 2013; Dick et al., 2014; Neu et al., 2014; AbbasiYadkori et al., 2014). In the nonstochastic setting, most works consider MDPs with unstructured, finite state spaces and guarantee that the regret increases no faster than as the number of interaction rounds grows large. A notable exception is the work of AbbasiYadkori et al. (2014), who consider the special case of (continuousstate) linearquadratic control with arbitrarily changing target states, and propose an algorithm that guarantees a regret bound of .
In the present paper, we study another special class of MDPs that turns out to allow fast rates. Specifically, we consider the class of socalled linearly solvable MDPs (in short, LMDPs), first proposed and named so by Todorov (2006). This class takes its name after the special property that the Bellman optimality equations characterizing the optimal behavior policy take the form of a system of linear equations, which makes optimization remarkably straightforward in such problems. The continuous formulation (in both space and time) was discovered independently by Kappen (2005) and is known as path integral control. LMDPs have many interesting properties. For example, optimal control laws for LMDPs can be linearly combined to derive composite optimal control laws efficiently (Todorov, 2009). Also, the inverse optimal control problem in LMDPs can be expressed as a convex optimization problem (Dvijotham and Todorov, 2010). LMDPs generalize an existing duality between optimal control computation and Bayesian inference (Todorov, 2008). Indeed, the popular belief propagation algorithm used in dynamic probabilistic graphical models is equivalent the the power iteration method used to solve LMDPs (Kappen et al., 2012).
The LMDP framework has found applications in robotics (Matsubara et al., 2014; Ariki et al., 2016), crowdsourcing (AbbasiYadkori et al., 2015), and controlling the growth dynamics of complex networks (Thalmeier et al., 2017). The related path integral control framework of Kappen (2005) has been applied in several realworld tasks, including robot navigation (Kinjo et al., 2013), motor skill reinforcement learning (Theodorou et al., 2010; Rombokas et al., 2013; Gómez et al., 2014), aggressive car maneuvering (Williams et al., 2016) or autonomous flight of teams of quadrotors (Gómez et al., 2016).
In the present paper, we show that besides the aforementioned properties, the structure of LMDPs also enables constructing efficient online learning procedures with very low regret. In particular, we show that, under some mild assumptions on the structure of the LMDP, the (conceptually) simplest online learning strategy of following the leader guarantees a regret of order , vastly improving over the best known previous result by Guan, Raginsky, and Willett (2014), who prove a regret bound of order for arbitrarily small under the same assumptions. Our approach is based on the observation that the optimal control law arising from the LMDP structure is a smooth function of the underlying cost function, enabling rapid learning without any regularization whatsoever.
The rest of the paper is organized as follows. Section 2 introduces the formalism of LMDPs and summarizes some basic facts that our technical content is going to rely on. Section 3 describes our online learning model. Our learning algorithm is described in Section 4 and analyzed in Section 5. Finally, we draw conclusions in Section 6.
Notation.
We will consider several realvalued functions over a finite statespace , and we will often treat these functions as finitedimensional (column) vectors endowed with the usual definitions of the norms. The set of probability distributions over will be denoted as . Indefinite sums with running variables or are understood to run through all .
2 Background on linearly solvable MDPs
This section serves as a quick introduction into the formalism of linearly solvable MDPs (LMDPs, Todorov (2006)). These decision processes are defined by the tuple , where is a finite set of states, is a transition kernel called the passive dynamics (with being the probability of the process moving to state given the previous state ) and is the statecost function. Our Markov decision process is a sequential decisionmaking problem where the initial state is drawn from some distribution , and the following steps are repeated for an indefinite number of rounds :

The learner chooses a transition kernel satisfying for all .

The learner observes and draws the next state .

The learner incurs the cost
where is the relative entropy (or KullbackLeibler divergence) between the probability distributions and defined as .
The statecost function should be thought of as specifying the objective for the learner in the MDP, while the relativeentropy term governs the costs associated with significant deviations from the passive dynamics. Accordingly, we refer to this component as the control cost. A central question in the theory of Markov decision problems is finding a behavior policy that minimizes (some notion of) the longterm total costs. In this paper, we consider the problem of minimizing the longterm average costperstage . Assuming that the passive dynamics is aperiodic and irreducible, this limit is minimized by a stationary policy (see, e.g., Puterman (1994, Sec. 8.4.4)). Below, we provide two distinct derivations for the optimal stationary policy that minimizes the average costs under this assumption.
2.1 The Bellman equations
We first take an approach rooted in dynamic programming (Bertsekas, 2007), following Todorov (2006). Under our assumptions, the optimal stationary policy minimizing the average cost is given by finding the solution to the Bellman optimality equation
(1) 
for all , where is called the optimal value function and is the average cost associated with the optimal policy^{1}^{1}1This solution is guaranteed to be unique up to a constant shift of the values: if is a solution, then so is for any . Unless stated otherwise, we will assume that is such that holds for a fixed state .. Linearly solvable MDPs get their name from the fact that the Bellman optimality equation can be rewritten in a simple linear form. To see this, observe that by elementary calculations involving Lagrange multipliers, we have
so, after defining the exponentiated value function for all , plugging into Equation (1) and exponentiating both sides gives
(2) 
Rewriting the above set of equations in matrix form, we obtain the linear equations
where is a diagonal matrix with . By the PerronFrobenius theorem (see, e.g., Chapter 8 of Meyer (2000)) concerning positive matrices, the above system of linear equations has a unique^{2}^{2}2As in the case of the Bellman equations, this solution is unique up to a scaling of . solution satisfying for all , and this eigenvector corresponds to the largest eigenvalue of . Since the solution of the Bellman optimality equation (1) is unique (up to a constant shift corresponding to a constant scaling of ), we obtain that is the average cost of the optimal policy. In summary, the Bellman optimality equation takes the form of a Perron–Frobenius eigenvalue problem, which can be efficiently solved by iterative methods such as the wellknown power method for finding top eigenvectors. Finally, getting back to the basic form (1) of the Bellman equations, we can conclude after simple calculations that the optimal policy can be computed for all as
2.2 The convex optimization view
We also provide an alternative (and, to our knowledge, yet unpublished) view of the optimal control problem in LMDPs, based on convex optimization. For the purposes of this paper, we find this form to be more insightful, as it enables us to study our learning problem in the framework of online convex optimization (Hazan, 2011, 2016; ShalevShwartz, 2012). To derive this form, observe that under our assumptions, every feasible policy induces a stationary distribution over the state space satisfying . This stationary distribution and the policy together induce a distribution over defined for all as . We will call as the stationary transition measure induced by , which is motivated by the observation that corresponds to the probability of observing the transition in the equilibrium state: . Notice that, with this notation, the average costperstage of policy can be rewritten in the form
The first term in the final expression above is the negative conditional entropy of relative to , where is a pair of random states drawn from . Since the negative conditional entropy is convex in (for a proof, see Appendix A.1) and the second term in the expression is linear in , we can see that is a convex function of . This suggests that we can view the optimal control problem as having to find a feasible stationary transition measure that minimizes the expected costs. In short, defining
(3) 
and as the (convex) set of feasible stationary transition measures satisfying
(4) 
the optimization problem can be succinctly written as . In Appendix A.2, we provide a derivation of the optimal control given by Equation (2) starting from the formulation given above. We also remark that our analysis will heavily rely on the fact that is affine in .
3 Online learning in linearly solvable MDPs
We now present the precise learning setting that we consider in the present paper. We will study an online learning scheme where for each round , the following steps are repeated:

The learner chooses a transition kernel satisfying for all .

The learner observes and draws the next state .

Obliviously to the learner’s choice, the environment chooses statecost function .

The learner incurs the cost

The environment reveals the statecost function .
The key change from the stationary setting described in the previous section is that the statecost function now may change arbitrarily between each round, and the learner is only allowed to observe the costs after it has made its decision. We stress that we assume that the learner fully knows the passive dynamics, so the only difficulty comes from having to deal with the changing costs. As usual in the onlinelearning literature, our goal is to do nearly as well as the best stationary policy chosen in hindsight after observing the entire sequence of cost functions. To define our precise performance measure, we first define the average reward of a policy as
where the state trajectory is generated sequentially as and the expectation integrates over the randomness of the transitions. Having this definition in place, we can specify the best stationary policy^{3}^{3}3The existence of the minimum is warranted by the fact that is a continuous function bounded from below on its compact domain. and define our performance measure as the (total expected) regret against :
where the expectation integrates over both the randomness of the state transitions and the potential randomization used by the learning algorithm. Having access to this definition, we can now formally define the goal of the learner as having to come up with a sequence of policies that guarantee that the total regret grows sublinearly, that is, that the average perround regret asymptotically converges to zero.
For our analysis, it will be useful to define an idealized version of the above online optimization problem, where the learner is allowed to immediately switch between the stationary distributions of the chosen policies. By making use of the convexoptimization view given in Section 2.2, we define an auxiliary online convex optimization (or, in short, OCO, see, e.g., Hazan, 2011; ShalevShwartz, 2012) problem called the idealized OCO problem where in each round , the following steps are repeated:

The learner chooses the stationary transition measure .

Obliviously to the learner’s choice, the environment chooses the loss function .

The learner incurs a loss of .

The environment reveals the loss function .
The performance of the learner in this setting is measured by the idealized regret
Throughout the paper, we will consider oblivious environments that choose the sequence of statecost functions without taking into account the states visited by the learner. This assumption will enable us to simultaneously reason about the expected costs under any sequence of state distributions, and thus to make a connection between the idealized regret and the true regret . This technique was first used by EvenDar et al. (2009) and was shown to be essentially inevitable by Yu et al. (2009): As discussed in their Section 3.1, no learning algorithm can avoid linear regret if the environment is not oblivious.
4 Algorithm and main result
In this section, we propose a simple algorithm for online learning in LMDPs based on the “followtheleader” (FTL) strategy. On a high level, the idea of this algorithm is greedily betting on the policy that seems to have been optimal for the total costs observed so far. While this strategy is known to fail catastrophically in several simple learning problems (see, e.g., CesaBianchi and Lugosi 2006), it is known to perform well in several important scenarios such as sequential prediction under the logarithmic loss (Merhav and Feder, 1992) or prediction with expert advice under bounded losses, given that losses are stationary (Kotłowski, 2016) and often serves as a strong benchmark strategy (de Rooij et al., 2014; Sani et al., 2014). In our learning problem, following the leader is a very natural choice of algorithm, as the convex formulation of Section 2.2 suggests that we can effectively build on the analysis of FollowtheRegularizedLeadertype algorithms without having to explicitly regularize the objective.
In precise terms, our algorithm computes the sequence of policies by running FTL in the idealized setting: in round , the algorithm chooses the stationary transition measure
where the third equality uses the fact that is affine in its second argument and the last step introduces the average statecost function . This form implies that can be computed as the optimal control for the statecost function , which can be done by following the procedure described in Section 2.1. Precisely, we define the diagonal matrix with its diagonal element , let be the largest eigenvalue of and be the corresponding (unitnorm) right eigenvector. Also, let and , and note that is the optimal averagecostperstage of given the cost function . Finally, we define the policy used in round as
(5) 
for all and . We denote the induced stationary distribution by . The algorithm is presented as Algorithm 1.
Input: Passive dynamics .
Initialization: for all .
For , repeat

Construct .

Find the right eigenvector of corresponding to the largest eigenvalue.

Compute the policy

Observe state and draw .

Observe statecost function and update .
Now we present our main result. First, we state two key assumptions about the underlying passive dynamics; both of these assumptions are also made by Guan et al. (2014).
Assumption 1
The passive dynamics is irreducible and aperiodic. In particular, there exists a natural number such that for all and all . We will refer to as the (worstcase) hitting time.
Assumption 2
The passive dynamics is ergodic in the sense that its Markov–Dobrushin ergodicity coefficient is strictly less than :
A standard consequence (see, e.g., Seneta 2006) of Assumption 2 is that the passive dynamics mixes quickly: for any distributions , we have
We will sometimes refer to as the mixing time associated with . Now we are ready to state our main result:
Theorem 1
The asymptotic notation used in the theorem hides a number of factors that depend only on the passive dynamics . In particular, the bound scales polynomially with the worstcase mixing time of any optimal policy, and shows no explicit dependence on the number of states.^{4}^{4}4Of course, the mixing time time does depend on the size of the state space in general. We explicitly state the bound at the end of the proof presented in the next section as Equation (8), when all terms are formally defined.
5 Analysis
In this section, we provide a series of lemmas paving the way towards proving Theorem 1. The attentive reader may find some of these lemmas familiar from related work: indeed, we build on several technical results from EvenDar et al. (2009); Neu et al. (2014) and Guan et al. (2014). Our main technical contribution is an efficient combination of these tools that enables us to go way beyond the best known bounds for our problem, proved by Guan et al. (2014). Throughout the section, we will assume that the conditions of Theorem 1 are satisfied.
Before diving into the analysis, we state some technical results that we will use several times. We defer all proofs to Appendix B. First, we present some important facts regarding LMDPs with bounded statecosts. In particular, we define as the optimal policy with respect to an arbitrary statecost function and let be the set of all statecosts bounded in . We define as the set of optimal policies induced by statecost functions in : . Observe that for all , as and for all . Below, we give several useful results concerning policies in . For stating these results, let and . We first note that the average cost of is bounded in : By the PerronFrobenius theorem (see, e.g., Meyer, 2000, Chapter 8), we have that the largest eigenvalue of is bounded by the maximal and minimal row sums of : , which translates to having under our assumptions. The next key result bounds the value functions and the control costs in terms of the hitting time:
Lemma 2
For all and , the value functions satisfy . Furthermore, all policies satisfy
The proof is loosely based on ideas from Bartlett and Tewari (2009). The second statement guarantees that the mixing time is finite for all policies in :
Lemma 3
The Markov–Dobrushin coefficient of any policy is bounded as
The proof builds on the previous lemma and uses standard ideas from Markovchain theory. In what follows, we will use and to denote the worstcase mixing time and ergodicity coefficient, respectively. With this notation, we can state the following lemma that establishes that the value functions are Lipschitz with respect to the statecost function. For pronouncing and proving the statement, it is useful to define the span seminorm . Note that it is easy to show that is indeed a seminorm as it satisfies all the requirements to be a norm except that it maps all constant vectors (and not just zero) to zero.
Lemma 4
Let and be two statecost functions taking values in the interval and let and be the corresponding optimal value functions. Then,
The proof roughly follows the proof of Proposition 3 of Guan et al. (2014), with the slight difference that we make the constant factor in the bound explicit. A consequence of this result is our final key lemma in this section that actually makes our fast rates possible: a bound on the changerate of the policies chosen by the algorithm.
Lemma 5
.
The proof is based on ideas by Guan et al. (2014). As for the proof of Theorem 1, we follow the path of EvenDar et al. (2009); Neu et al. (2014); Guan et al. (2014), and first analyze the idealized setting where the learner is allowed to directly pick stationary distributions instead of policies. Then, we show how to relate the idealized regret of FTL to its true regret in the original problem.
5.1 Regret in the idealized OCO problem
Let us now consider the idealized online convex optimization problem described at the end of Section 3. In this setting, our algorithm can be formally stated as choosing the stationary transition measure . This view enables us to follow a standard proof technique for analyzing online convex optimization algorithms, going back to at least Merhav and Feder (1992). The first ingredient of our proof is the socalled “followtheleader/betheleader” lemma CesaBianchi and Lugosi (2006, Lemma 3.1):
Lemma 6
.
The second step exploits the bound on the change rate of the policies to show that looking one step into the future does not buy much advantage. Note however that controlling the change rate is not sufficient by itself, as our loss functions are effectively unbounded.
Lemma 7
.
In the interest of space, we only provide a proof sketch here and defer the full proof to Appendix B.5.
Proof sketch Let us define . By exploiting the affinity of in its second argument, we can start by proving . Furthermore, by using the form of the optimal policy given in Eq. (5) and the form of given in Eq. (3), we can obtain
The first term can be bounded by a simple argument (see, e.g., Lemma 4 of Neu et al. 2014) that leads to
Now, the first factor can be bounded by and the second by appealing to Lemma 5. The proof is
concluded by plugging the above bounds into Equation (12), using , summing up both sides, and
noting that .
Putting Lemmas 6 and 7 together, we obtain the following bound on the idealized regret of FTL:
Lemma 8
.
5.2 Regret in the reactive setting
We first show that the advantage of the true best policy over our final policy is bounded.
Lemma 9
Let be the smallest nonzero transition probability under the passive dynamics and . Then, .
The proof follows from applying Lemma 1 from Neu et al. (2014) and observing that holds for all . It remains to relate the total cost of FTL to the total idealized cost of the algorithm. This is done in the following lemma:
Lemma 10
.
Proof Let . Similarly to the proof of Lemma 7, we rewrite using Equation (11) to obtain
where the last step uses and . Now, noticing that , we obtain
where the last inequality uses Lemma 4 and to bound the first term. By Lemma 4, this last term can be bounded by , where is the value function corresponding to the allzero statecost function.
In the rest of the proof, we are going to prove the inequality
(6) 
It is easy to see that this trivially holds for , so we will assume that the contrary holds in the following derivations. To prove Equation (6) for larger values of , we can follow the proofs of Lemma 5 of Neu et al. (2014) or Lemma 5.2 of EvenDar et al. (2009) to obtain
(7) 
For completeness, we include a proof in Appendix B.6. For bounding the last term, we split the sum at :
where the first inequality follows from bounding the factors by and , respectively, and bounding the sums by
the full geometric sums. The secondtolast inequality follows from our assumption that . That is, we have
successfully proved Equation (6). Now the statement of the lemma follows from summing up for all and noting that
and .
Now the proof of Theorem 1 follows easily from combining the bounds of Lemmas 8–10. The result
is
(8) 
Thus, we can see that the bound indeed demonstrates a polynomial dependence on the mixing time , and depends logarithmically on the smallest nonzero transition probability via .
6 Discussion
In this paper, we have shown that, besides the wellestablished computational advantages, linearly solvable MDPs also admit a remarkable informationtheoretic advantage: fast learnability in the online setting. In particular, we show that achieving a regret of is achievable by the simple algorithm of following the leader, thus greatly improving on the best previously known regret bounds of . At first sight, our improvement may appear dramatic: in their paper, Guan et al. (2014) pose the possibility of improving their bounds to as an important open question (Sec. VII.). In light of our results, these conjectured improvements are also grossly suboptimal. On the other hand, our new results can be also seen to complement wellknown results on fast rates in online learning (see, e.g., van Erven et al. 2015 for an excellent summary). Indeed, our learning setting can be seen as a generalized variant of sequential prediction under the relativeentropy loss (see, e.g., CesaBianchi and Lugosi, 2006, Sec. 3.6), which is known to be expconcave. Such expconcave losses are wellstudied in the online learning literature, and are known to allow logarithmic regret bounds (Kivinen and Warmuth, 1999; Hazan et al., 2007).
Inspired by these related results, we ask the question: Is the loss function defined in Section 2.2 expconcave? While our derivations Appendix A.1 indicate that has curvature in certain directions, we were not able to prove its expconcavity. Similarly to the approach of Merhav and Feder (1992), our analysis in the current paper merely exploits the Lipschitzness of the optimal policies with respect to the cost functions, but otherwise does not explicitly make use of the curvature of . We hope that our work presented in this paper will inspire future studies that will clarify the exact role of the LMDP structure in efficient online learnability, potentially also leading to a better understanding of policy gradient algorithms for LMDPs (Todorov, 2010).
Finally, let us comment on the tightness of our bounds. Regardless of whether the loss function is expconcave or not, we are almost certain that our rates can be improved to at least by using a more sophisticated algorithm. While our focus in this paper was on improving the asymptotic regret guarantees, we also slightly improve on the results Guan et al. (2014) in that we make the leading constants more explicit. However, we expect that the dependence on these constants may also be improved in future work. Note however that the potential looseness of our bounds does not impact the performance of the algorithm itself, as it never makes use of any problemdependent constants.
Acknowledgements
This work was supported by the UPFellows Fellowship (Marie Curie COFUND program n 600387) and the Ramon y Cajal program RYC201518878 (AEI/MINEICO/FSE, UE). The authors wish to thank the three anonymous reviewers for their valuable comments that helped to improve the paper.
References
 AbbasiYadkori et al. (2015) Y. AbbasiYadkori, P. L. Bartlett, X. Chen, and A. Malek. Largescale Markov decision problems with KL control cost and its application to crowdsourcing. In 32nd International Conference on Machine Learning (ICML) 2015, pages 1053–1062, 2015.
 AbbasiYadkori and Szepesvári (2011) Y. AbbasiYadkori and Cs. Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In COLT, 2011.
 AbbasiYadkori et al. (2014) Y. AbbasiYadkori, P. Bartlett, and V. Kanade. Tracking adversarial targets. In ICML 2014, pages 369–377, 2014.
 Ariki et al. (2016) Y. Ariki, T. Matsubara, and S. H. Hyon. Latent KullbackLeibler control for dynamic imitation learning of wholebody behaviors in humanoid robots. In 2016 IEEERAS 16th International Conference on Humanoid Robots (Humanoids), pages 946–951, 2016.
 Bartlett and Tewari (2009) P. L. Bartlett and A. Tewari. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In UAI 2009, 2009.
 Bertsekas (2007) D. P. Bertsekas. Dynamic Programming and Optimal Control, volume 2. Athena Scientific, Belmont, MA, 3 edition, 2007.
 Bertsekas and Tsitsiklis (1996) D. P. Bertsekas and J. N. Tsitsiklis. NeuroDynamic Programming. Athena Scientific, Belmont, MA, 1996.
 CesaBianchi and Lugosi (2006) N. CesaBianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.
 de Rooij et al. (2014) S. de Rooij, T. van Erven, P. D. Grünwald, and W. M. Koolen. Follow the leader if you can, hedge if you must. Accepted to the Journal of Machine Learning Research, 2014.
 Dick et al. (2014) T. Dick, A. György, and Cs. Szepesvári. Online learning in markov decision processes with changing cost sequences. In ICML 2014, 2014.
 Dvijotham and Todorov (2010) K. Dvijotham and E. Todorov. Inverse optimal control with linearlysolvable mdps. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pages 335–342, 2010.
 EvenDar et al. (2009) E. EvenDar, S. M. Kakade, and Y. Mansour. Online Markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
 Gómez et al. (2014) V. Gómez, H. J. Kappen, J. Peters, and G. Neumann. Policy search for path integral control. European Conference on Machine Learning and Knowledge Discovery in Databases, 8724 LNAI(PART 1):482–497, 2014.
 Gómez et al. (2016) V. Gómez, S. Thijssen, A. C. Symington, S. Hailes, and H. J. Kappen. Realtime stochastic optimal control for multiagent quadrotor systems. In 26th International Conference on Automated Planning and Scheduling, 2016.
 Guan et al. (2014) P. Guan, M. Raginsky, and R. M. Willett. Online markov decision processes with kullback–leibler control cost. Automatic Control, IEEE Transactions on, 59(6):1423–1438, 2014.
 Hazan (2011) E. Hazan. The convex optimization approach to regret minimization. In S. Sra, S. Nowozin, and S. Wright, editors, Optimization for Machine Learning, pages 287–303. MIT press, 2011.
 Hazan et al. (2007) E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169–192, 2007.
 Hazan (2016) E. Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.
 Jaksch et al. (2010) T. Jaksch, R. Ortner, and P. Auer. Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 99:1563–1600, August 2010. ISSN 15324435.
 Kappen (2005) H. J. Kappen. Linear theory for control of nonlinear stochastic systems. Physical review letters, 95(20):200201, 2005.
 Kappen et al. (2012) H. J. Kappen, V. Gómez, and M. Opper. Optimal control as a graphical model inference problem. Machine learning, 87(2):159–182, 2012.
 Kinjo et al. (2013) K. Kinjo, E. Uchibe, and K. Doya. Evaluation of linearly solvable Markov decision process with dynamic model learning in a mobile robot navigation task. Frontiers in Neurorobotics, 7:1–13, 2013.
 Kivinen and Warmuth (1999) J. Kivinen and M. Warmuth. Averaging expert predictions. In Proceedings of the Fourth European Conference on Computational Learning Theory, pages 153–167. Lecture Notes in Artificial Intelligence, Vol. 1572. Springer, 1999.
 Kotłowski (2016) W. Kotłowski. On minimaxity of follow the leader strategy in the stochastic setting. In International Conference on Algorithmic Learning Theory, pages 261–275, 2016.
 Matsubara et al. (2014) T. Matsubara, V. Gómez, and H. J. Kappen. Latent Kullback Leibler control for continuousstate systems using probabilistic graphical models. 30th Conference on Uncertainty in Artificial Intelligence (UAI), 2014.
 Merhav and Feder (1992) N. Merhav and M. Feder. Universal sequential learning and decision from individual data sequences. In Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory. ACM Press, 1992.
 Meyer (2000) C. D. Meyer. Matrix analysis and applied linear algebra, volume 2. Siam, 2000.
 Neu et al. (2010) G. Neu, A. György, and Cs. Szepesvári. The online loopfree stochastic shortestpath problem. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), pages 231–243, 2010.
 Neu et al. (2012) G. Neu, A. György, and Cs. Szepesvári. The adversarial stochastic shortest path problem with unknown transition probabilities. In AISTATS 2012, pages 805–813, 2012.
 Neu et al. (2014) G. Neu, A. György, Cs. Szepesvári, and A. Antos. Online Markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, 59:676–691, 2014.
 Puterman (1994) M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. WileyInterscience, April 1994.
 Rombokas et al. (2013) E. Rombokas, M. Malhotra, E. A. Theodorou, E. Todorov, and Y. Matsuoka. Reinforcement learning and synergistic control of the act hand. IEEE/ASME Transactions on Mechatronics, 18(2):569–577, 2013.
 Sani et al. (2014) A. Sani, G. Neu, and A. Lazaric. Exploiting easy data in online optimization. In NIPS27, pages 810–818, 2014.
 Seneta (2006) E. Seneta. Nonnegative matrices and Markov chains. Springer Science & Business Media, 2006.
 ShalevShwartz (2012) S. ShalevShwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2012.
 Sutton and Barto (1998) R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
 Szepesvári (2010) Cs. Szepesvári. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
 Thalmeier et al. (2017) D. Thalmeier, V. Gómez, and H. J. Kappen. Action selection in growing state spaces: control of network structure growth. Journal of Physics A: Mathematical and Theoretical, 50(3):034006, 2017.
 Theodorou et al. (2010) E. Theodorou, J. Buchli, and S. Schaal. A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research, 11:3137–3181, 2010.
 Todorov (2006) E. Todorov. Linearlysolvable Markov decision problems. In NIPS18, pages 1369–1376, 2006. ISBN 0262232537.
 Todorov (2008) E. Todorov. General duality between optimal control and estimation. In Decision and Control, 2008. CDC 2008. 47th IEEE Conference on, pages 4286–4292. IEEE, 2008.
 Todorov (2009) E. Todorov. Compositionality of optimal control laws. In NIPS22, pages 1856–1864, 2009.
 Todorov (2010) E. Todorov. Policy gradients in linearlysolvable mdps. In NIPS23, pages 2298–2306. CURRAN, 2010.
 van Erven et al. (2015) T. van Erven, P. D. Grünwald, N. A. Mehta, M. D. Reid, and R. C. Williamson. Fast rates in statistical and online learning. Journal of Machine Learning Research, 16:1793–1861, 2015.
 Williams et al. (2016) G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou. Aggressive driving with model predictive path integral control. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 1433–1440, May 2016. doi: 10.1109/ICRA.2016.7487277.
 Yu et al. (2009) J. Y. Yu, S. Mannor, and N. Shimkin. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737–757, 2009.
 Zimin and Neu (2013) A. Zimin and G. Neu. Online learning in episodic Markovian decision processes by relative entropy policy search. In NIPS26, pages 1583–1591, 2013.
Appendix A The convex optimization view of optimal control in LDMPs
This section summarizes some facts regarding the convex optimization formulation of Section 2.2. We first show that the negative conditional entropy constituting the only nonlinear term in the objective is convex.
a.1 The convexity of the negative conditional entropy
Let us consider the joint probability distribution on the finite set . We denote and . We study the negative conditional entropy of as a function of :
We will study the Bregman divergence corresponding to :
Our aim is to show that is nonnegative, which will imply the convexity of .
We begin by computing the partial derivative of with respect to :
where we used the fact that for all . With this expression, we have :
Thus, the Bregman divergence takes the form
where the last step follows from Pinsker’s inequality. Thus, we have shown that the Bregman divergence is nonnegative on , proving that is convex.
a.2 Derivation of the optimal control
Here, we give an alternative derivation of the optimal control given in Section 2.1 based on the optimization problem for an arbitrary bounded statecost function . As a reminder, is given by
and the feasible set is given by the following convex constraints:
We begin by slightly adjusting the definition of for it to become a barrier function: we set for all not satisfying the last two constraints. It is easy to see that this adjustment does not change the optimum of , but it helps getting rid of the inequality constraints. Thus, with this form of , we can characterize the optimum of using the technique of Lagrange multipliers^{5}^{5}5Alternatively, one could introduce KKT multipliers for all constraints and eliminate the last two by complementary slackness, which yields the same characterization..
Precisely, we introduce a Lagrange multiplier for every to enforce the first constraint and to enforce the second one, and write the Lagrangian as