Mirror Descent Search and its Acceleration1footnote 11footnote 1The research was partially supported by JSPS KAKENHI (Grant numbers JP26120005, JP16H03219, and JP17K12737).

Mirror Descent Search and its Acceleration111The research was partially supported by JSPS KAKENHI (Grant numbers JP26120005, JP16H03219, and JP17K12737).

Megumi Miyashita Shiro Yano Toshiyuki Kondo Dept. of Computer and Information Sciences, Graduate School of Engineering,
Tokyo University of Agriculture and Technology, Tokyo, Japan
Division of Advanced Information Technology and Computer Science, Institute of Engineering,
Tokyo University of Agriculture and Technology, Tokyo, Japan
Abstract

In recent years, attention has been focused on the relationship between black-box optimization problem and reinforcement learning problem. In this research, we propose the Mirror Descent Search (MDS) algorithm which is applicable both for black box optimization problems and reinforcement learning problems. Our method is based on the mirror descent method, which is a general optimization algorithm. The contribution of this research is roughly twofold. We propose two essential algorithms, called MDS and Accelerated Mirror Descent Search (AMDS), and two more approximate algorithms: Gaussian Mirror Descent Search (G-MDS) and Gaussian Accelerated Mirror Descent Search (G-AMDS). This research shows that the advanced methods developed in the context of the mirror descent research can be applied to reinforcement learning problem. We also clarify the relationship between an existing reinforcement learning algorithm and our method. With two evaluation experiments, we show our proposed algorithms converge faster than some state-of-the-art methods.

keywords:
Reinforcement Learning, Mirror Descent, Bregman Divergence, Accelerated Mirror Descent, Policy Improvement with Path Integrals
journal: Robotics and Autonomous Systems

1 Introduction

Similarity between black-box optimization problem and reinforcement learning (RL) problem inspires recent researchers to develop novel RL algorithms Stulp2013Policy (); Hwangbo2014ROCKEfficient (); Salimans2017Evolution (). The objective of a black box optimization problem is to find the optimal input of an unknown function . Because the objective function is unknown, we usually solve the black box optimization problem without gradient information . Such is the case with RL problem. The objective of an RL problem is to find the optimal policy that maximizes the expected cumulative reward Sutton1998Reinforcement (). As is the case in a black-box optimization problem, the agent doesn’t know the problem formulation initially, so he is required to tackle the lack of information. In this research, we propose RL algorithms from a standpoint of a black-box optimization problem.

RL algorithm has been categorized into a value-based method and a policy-based method, roughly. In the value-based method, the agent learns the value function of some action in some state. On the other hand, in the policy-based method, the agent learns policy from the observation directly. Moreover, RL algorithm has been divided into a model-free approach and a model-based approach. In the model-based approach, first, the agent gains the model of a system from the sample. Then, it learns policy or the value using the model. In contrast, in the model-free approach, the agent learns the policy or value without the model. RL algorithms usually employ the assumption that the behavior of environment is well approximated by Markov Decision Process (MDP).

Recently, KL divergence regularization plays a key role in policy search algorithms. KL divergence is one of the essential metrics between two distributions. Past methods Schulman2015Trust (); Peters2010Relative (); Abdolmaleki2017Deriving (); Abdolmaleki2015Model (); Zimin2013Online (); Daniel2012Hierarchical () employ KL divergence regularization to find a suitable distance between a new distribution and a referential distribution. It is important to note that there exists two types of KL divergence: KL and reverse-KL (RKL) divergence Bishop2006Pattern (); nowozin2016f (). The past researches mentioned above are clearly divided into the algorithms with KL divergence Schulman2015Trust (); Abdolmaleki2017Deriving () and RKL divergence Peters2010Relative (); Abdolmaleki2015Model (); Zimin2013Online (); Daniel2012Hierarchical (). We review details of these algorithms afterward.

Bregman divergence is the general metric which includes both of KL and RKL divergence Amari2009alpha () (see A). Moreover, it includes Euclidean distance, Mahalanobis distance, Hellinger distance and so on. Mirror Descent (MD) algorithm employs the Bregman divergence to regularize the learning steps of decision variables; it includes a variety of gradient methods Bubeck2015Convex (). Accelerated mirror descent Krichene2015Accelerated () is one of the recent advance applicable for the MD algorithms universally.

In this study, we propose four reinforcement learning algorithms on the basis of MD method. Proposed algorithms can be applied in the non-MDP setting. We propose two essential algorithms and two approximate algorithms of them. We propose mirror descent search (MDS) and accelerated mirror descent search (AMDS) as the essential algorithms, and Gaussian mirror descent search (G-MDS) and Gaussian accelerated mirror descent search (G-AMDS) as the approximate algorithms. G-AMDS showed significant improvement in convergence speed and optimality in two benchmark problems. If other existing reinforcement learning algorithms can be reformulated as the MDS form, they would also get the benefit from the acceleration. We also clarify the relationship between existing reinforcement learning algorithms and our method. As an example, we show the relationship between MDS and Policy Improvement with Path Integrals (PI) Theodorou2010generalized (); Theodorou2010Reinforcement () in section 5.

2 Related Works

This section will proceed in the order described below. First of all, we introduce the concept of KL and RKL divergences. Then we refer the two types of RL algorithms: RL with KL divergence Schulman2015Trust (); Abdolmaleki2017Deriving () and RL with RKL divergence Peters2010Relative (); Abdolmaleki2015Model (); Zimin2013Online (); Daniel2012Hierarchical (). We also refer the RL algorithm PI; we show the relation between PI and our method afterward. We conclude this section with a comment on other MD-based RL algorithms.

The KL divergence between and is represented as follows.

 KL(x,x′)=m∑j=1xjlogxjx′j(x,x′∈Rm,xj,x′j>0). (1)

We call Kullback Leibler divergence under the condition that we determine by reference to the fixed ; we call reverse-KL divergence nowozin2016f (). Bregman divergence includes both of KL and RKL divergence Amari2009alpha (), so we expect it provides an unified formulation of above-mentioned algorithms.

Let us introduce the RKL-based RL algorithms. Relative Entropy Policy Search (REPS) Peters2010Relative () is one of the pioneering algorithms focusing on the information loss during the policy search process. The information loss is defined as the relative entropy, also known as the RKL divergence, between the old policy and the new policy. The new policy is determined under the upper bound constraints of the RKL divergence. Episode-based REPS also considers information loss bound with regard to the upper-level policy Daniel2012Hierarchical (). The method is proposed as an extension of REPS to be an episode-based algorithm. The paper Zimin2013Online () discussed the similarity between Episode-based REPS and the proximal point algorithm; they proposed the Online-REPS algorithm as an theoretically guaranteed one. MOdel-based Relative Entropy stochastic search (MORE) also employed RKL divergence Abdolmaleki2015Model (), which extends the episode-based REPS to be a model-based RL algorithm. These algorithms employ RKL divergence in their formulation.

There are some methods employing KL divergence. Trust Region Policy Optimization (TRPO) Schulman2015Trust (), which is one of the suitable algorithms to solve deep reinforcement learning problem, updates the policy parameters under the KL divergence bound. The research Abdolmaleki2017Deriving () showed that KL divergence between policies plays a key role to derive the well-known heuristic algorithm: Co-variance Matrix Adaptation Evolutionary Strategy (CMA-ES) Hansen2001Completely (). Authors named the method Trust-Region Co-variance Matrix Adaptation Evolution Strategy (TR-CMA-ES). TR-CMA-ES is similar to episode-based REPS but uses the KL divergence. Proximal Policy optimization (PPO) algorithm also introduces KL divergence in their penalized objective schulman2017proximal ().

PI Theodorou2010Reinforcement (); Theodorou2013information () would be one of the worth mentioning RL algorithm. PI encouraged researchers Stulp2012Policy (); Hwangbo2014ROCKEfficient () to focus on the relationship between RL algorithms and black box optimization. For example, Stulp2012Policy () proposes a reinforcement learning algorithm PI on the basis of black box optimization algorithm: CMA-ES. The authors Theodorou2012Relative (); Theodorou2013information () discussed the connection between PI and KL control. We further discuss PI from a viewpoint of our proposed methods at section 5.

Previous studies also proposed reinforcement learning algorithms on the basis of MD methodMahadevan2012Sparse (); Montgomery2016Guided (). Mirror Descent TD() (MDTD) Mahadevan2012Sparse () is a value based RL algorithm. The paper Mahadevan2012Sparse () employs Minkowski distance with Euclidean space rather than KL divergence. By contrast, we basically employ the Bregman divergences on the simplex space, i.e. non-Euclidean space. Mirror Descent Guided Policy Search (MDGPS) Montgomery2016Guided () is also associated with our proposed method. They showed mirror descent formulation improved the Guided Policy Search (GPS) Levine2013Guided (). MDGPS has a distinctive feature that it depends both on KL divergence and RKL divergence. However, as is shown in Krichene2015Efficient (), there are the variety of Bregman divergences on simplex space other than KL divergence and RKL divergence. Moreover, it plays an important role in accelerating the mirror descent Krichene2015Accelerated (). So we explicitly use Bregman divergence in this research.

3 Mirror Descent Search and Its Variants

3.1 Problem Statement

In this section, we mainly explain our algorithm as a method for the black box optimization problem. Consider the problem of minimizing the original objective function defined on subspace , i.e. . We represent the decision variable by . Rather than dealing with decision variable directly, we consider the continuous probability density function of . Let us introduce the probability space. The probability space is defined as , where is the -field of and is a probability measure over .

In this paper, we introduce the continuous probability density function as the alternative decision variable defined on the probability space. We also define the alternative objective function by the expectation of the original objective function :

 J=∫ΩJ(ω)p(ω)dω (2)

Therefore, we search the following domain:

 p(ω) ≥ 0 (3) ∫Ωp(ω)dω = 1 (4)

Let us introduce the set consists of all probability density functions defined on the probability space. The optimal generative probability is

 p∗(ω)=arg minp(ω)∈Pall{∫ΩJ(ω)p(ω)dω}=arg minp(ω)∈PallJ. (5)

From the viewpoint of the black box optimization problems, the algorithm aims at obtaining the optimal decision variable to optimize the alternative objective function . From the viewpoint of the reinforcement learning problems, it’s purpose is to obtain the optimal policy to optimize reward . Next, we introduce an iterative algorithm converges to the optimal solution.

3.2 Mirror Descent Search and Gaussian-Mirror Descent Search

3.2.1 Mirror Descent Search (MDS)

The algorithm is divided into three steps as Fig. 1.

Discretizing Prior Distribution

To update the continuous probability density function , we need to discretize the probability density function from sampling, because we donât know the form of the objective function. We can only evaluate the objective value corresponding to each sample.

First, we discretize based on sampling. For the illustrative purpose, we assume that we can get infinite samples , here. To satisfy the definition of the discrete probability density function, we discretize the continuous distribution using the function for the acquired samples Billingsley2008Probability ():

 q(θi) := limΔθ→0p(θi≤ω≤θi+Δθ)∑∞j=0p(θj≤ω≤θj+Δθ) (1≤i≤∞) (6) ∞∑j=0q(θj) = 1. (7)

With Eq. (6) and Eq. (7), our objective function becomes the expectation of the original objective function :

 ~J=∞∑j=1J(θj)q(θj)=⟨J,q⟩, (8)

where

 q=[q1,…] := [q(θ1),…]∈Q (9) J=[J1,…] := [J(θ1),…]∈R∞. (10)

Updating by Mirror Descent

After discretizing the continuous distribution , we employ the mirror descent algorithm (B) to update the discretized distribution :

 qk=arg minq∈Q{⟨∇q~J,q⟩+ηBϕ(q,qk−1)}, (11)

where is step-size. We call as the prior distribution, and as the posterior distribution. The domain of the decision variable is the simplex . is the Bregman divergence, which has an arbitrarily smooth convex function and is defined as

 Bϕ(x,x′)=ϕ(x)−ϕ(x′)−⟨∇ϕ(x′),x−x′⟩. (12)

There are numerous variations of Bregman divergence on the simplex such as the KL divergence and the Euclidean distances assumed on the simplex Krichene2015Efficient (). Moreover, slightly perturbed KL divergence, which was first introduced in Krichene2015Efficient (), is another important divergence. It plays a key role in accelerating the convergence speed of mirror descent as discussed in Krichene2015Accelerated () and this paper.

Because , we finally obtain the convex optimization problem:

 qk=arg minq∈Q{⟨J,q⟩+ηBϕ(q,qk−1)}. (13)

Although we have assumed the infinite number of samples from , it works only in theory. In what follows, we approximate the distribution using sufficiently large samples.

Density Estimation

We estimate the continuous probability density function from the posterior distribution . The procedure of MDS with -iterations is summarized in Algorithm 1.

3.2.2 Gaussian-Mirror Descent Search (G-MDS)

We consider a specific case where the Bregman divergence in Eq. (11) is the RKL divergence. Then, Eq. (13) can be rewritten as follows:

 qk=arg minq∈Q{⟨J,q⟩+ηKL(q,qk−1)}. (14)

In G-MDS, we considered as the Gaussian distribution of the mean and the variance-covariance matrix , so is generated accordingly:

 θk,i∼N(μk−1,Σϵk−1) (15)

Because the derived algorithm is an instance of MDS with the constraint that the policy is a Gaussian distribution, we named G-MDS. The procedure of G-MDS with -iterations is summarized in Algorithm 2.

As shown in section 5 and section 4, we discuss G-MDS formulation sheds new light on the existing method PI. Deisenroth also discussed the similarity between episode-based REPS and PIDeisenroth2013Survey (). To compare the asymptotic behavior of these algorithms appropriately, Algorithm 2 only update the mean vector of Gaussian distribution as PI also only updates the mean vector. A lot of past studies proposed the procedure to update variance-covariance matrix Hansen2001Completely (); Stulp2012Path (); Abdolmaleki2017Deriving (). These methods would be applicable to the G-MDS.

3.3 Accelerated Mirror Descent Search and Gaussian-Accelerated Mirror Descent Search

3.3.1 Accelerated Mirror Descent Search (AMDS)

Next, the accelerated mirror descent (AMD) method Krichene2015Accelerated () is applied to the proposed method. AMD is an accelerated method that generalizes Nesterov’s accelerated gradient such that it can be applied to MD. The details of AMD are explained in C. Here, AMD yields the following equations:

 qk = (16) q~zk = arg minq~z∈Rm{(k−1)sr⟨Jk−1,q~z⟩+Bϕ(q~z,q~zk−1)} (17) q~xk = arg minq~x∈Rm{γs⟨Jk−1,q~x⟩+R(q~x,qk)} (18)

where is regularization function, which belongs to the Bregman divergence Krichene2015Accelerated (), and are hyper parameters, and is step-size.

The procedure of AMDS with -iterations is summarized in Algorithm 3. Fig. 2 also explains the implementation of AMDS. Each captions in Fig. 2 correspond to the line number of Algorithm 3.

3.3.2 Gaussian-Accelerated Mirror Descent Search (G-AMDS)

In accordance with prior work Krichene2015Accelerated (), we applied the RKL distance to the Bregman divergence in Eq. (17) and on in Eq. (18). As the divergence takes the form of slightly perturbed KL divergence, we represent by in Algorithm 4. We approximate the distributions and with a Gaussian distribution. Accordingly, this method is called G-AMDS. Although the result cannot be calculated analytically, it is known that an efficient numerical calculation of time is availableKrichene2015Accelerated (). The procedure of G-AMDS with -iterations is summarized in Algorithm 4.

4 Experimental Evaluations

In this section, we show the comparative experiments. We compare the learning curves of G-MDS, G-AMDS, PI and episode-based REPS in two tasks. We selected PI and episode-based REPS as the baseline because they are state-of-the-art methods. In Theodorou2010generalized (); Theodorou2010Reinforcement (), these methods equipped the heuristics such as the normalization of the costs and the simulated annealing. However, in our evaluations, we do not use these heuristics. We focus on the theoretical guaranteed performance of these algorithms. Our source code is available online444https://github.com/mmilk1231/MirrorDescentSearch  We acknowledge with appreciation that PI code222 Theodorou2010generalized () and the AMD code333 Krichene2015Accelerated () are gratefully helpful to implement our code.. 2 33footnotetext: https://github.com/walidk/AcceleratedMirrorDescent

We performed a 2DOF point via-point task to evaluate the proposed method. The agent is represented as a point on the x–y plane. This agent learns to pass through the point (0.5, 0.2) at 250 ms. We employed DMP Ijspeert2003Learning () to parameterize the policy. DMP represents the trajectory of agent behavior toward x-axis and y-axis in each time step. The parameter settings are as follows: 100 updates, 10 rollouts, and 20 basis functions. Before learning, an initial trajectory from (0, 0) to (1, 1) is generated.

The reward function is as follows:

 rt = 5000f2t+0.5θTθ (19) Δr250ms = 1.0×1010((0.5−x250ms)2+(0.2−y250ms)2), (20)

where denotes the policy parameter.

We summarize the results in Fig. 3. Fig. 3 shows that G-AMDS agent learns faster than all the other agents. Fig. 3 shows that the agent was able to accomplish the task.

Table 1 shows the average cost and the standard deviation of the cost at the last update (right-endpoint of Fig. 3). In the figure, the thin line represents a standard deviation of the cost (). Fig. 3 shows the acquired trajectory at the last update. We set the variance-covariance matrix of sampling distribution to the unit matrix in all algorithms.

We performed a 10DOF arm via-point task and a 50DOF arm via-point task to evaluate the proposed method. The agent learns to control his end-effector to pass through the point (0.5, 0.5) at 300 ms. Before learning, arm trajectory is initialized to minimize the jerk.

The reward function with the [DOF] arm is as follows:

 rt = ∑Di=1(D+1−i)(0.1f2i,t+0.5θTiθi)∑Di=1(D+1−i) (21) Δr300ms = 1.0×108((0.5−x300ms)2+(0.5−y300ms)2) (22)

, where and are the end-effector position. DMP Ijspeert2003Learning () is also used to parameterize the policy. The parameter settings are as follows: 1000 updates, 10 rollouts, and 100 basis functions.

We summarize the results in Fig. 4 and Fig. 5. From Fig. 4 and Fig. 5, we can confirm that G-AMDS learns faster than all the other algorithms. Moreover, the variance of G-AMDS is smallest. As Fig. 4 and Fig. 5 show, it is clear that the G-AMDS agent accomplished both of 10 DOF task and 50 DOF task. Thus, G-AMDS would have scalability for dimensionality.

Table 2 and Table 3 show the average cost and the standard deviation of the cost at the last update.

5 Relation between MDS and PI2

Theodorou et al. proposed PI algorithm in Theodorou2010Reinforcement () and they discussed the relation between PI and KL control in Theodorou2012Relative (); Theodorou2013information (). In this section, we provide an explanation of PI from a viewpoint of MDS.

5.1 Problem Statement and Algorithm of PI2

We begin with the problem statement of PI Theodorou2012Relative (); Kappen2011Optimal ():

 min{uk}k=t,…,T Eτ[L(τ)] (23) s.t.  dxt=f(xt)dt+G(xt)(utdt+dwt), (24)

where , , , and . is Wiener process Kappen2011Optimal (). It is essential to point out that plays a role as feedback gain of , so our objective is to find the optimal feedback gain; we try to optimize the averaged continuous time series especially.

Eq. (24) can be interpreted in two ways. Under the model-free reinforcement learning problem, Eq. (24) represents the actual physical dynamics of the real plant. Under the model-predictive optimal control problem, Eq. (24) would represent the predictive model of the real plant. Essentially, we consider the model-free reinforcement learning setting, below.

Eq. (23)-Eq. (24) satisfies linearized Hamilton-Jacobi-Bellman equation under the quadratic cost assumption Theodorou2010Reinforcement (); Kappen2011Optimal (). denotes some state dependent cost . With the path integral calculation, we acquire the analytic solution of the HJB equation. They finally proposed Algorithm 5 as an iterative algorithm for the problem.

Next, we discuss the relation between MDS and PI. Theodorou et al. Theodorou2012Relative (); Theodorou2013information () proposed more general problem setting, so we touch the subject in section 5.3.

5.2 Pi2 from a Viewpoint of MDS

First of all, we reformulate Eq. (23)-Eq. (24) as Eq. (25):

 minp(h) ∫J(h)p(h)dh, (25)

where is the probability distribution of the stochastic process with . Stochastic process is the Gaussian process with mean function because every increments are Gaussian. Once is sampled, trajectory and are uniquely determined, so we defined . Our problem is to find the optimal probability distribution .

We introduce MDS approach to optimize :

 (26)

Eq. (27) is the solution of Eq. (26), which is also known as exponentiated gradient Shalev-Shwartz2012Online ().

 pk+1(h)=exp(−1ηJ(h))pk(h)∫exp(−1ηJ(h))pk(h)dh (27)

The posterior mean function becomes

 μk+1 = ∫h⋅pk+1(h)dh (28) = ∫hexp(−1ηJ(h))pk(h)dh∫exp(−1ηJ(h))pk(h)dh. (29)

By the Monte Carlo approximation, Eq. (29) can be approximated by

 ~μk+1=μk+∑mj=1(hj−μk)exp(−1ηJ(hj))∑mj=1exp(−1ηJ(hj)) (30)

where . With the above mentioned procedure, gradually gets closer to the optimal .

We explain the similarity and difference between Eq. (30) and PI. To simplify the notations, we introduce . Eq. (30) becomes

 ~μk+1 = μk+m∑j=1ϵk,jPk,j (31) Pk,j := exp(−1ηJ(hj))∑mj=1exp(−1ηJ(hj)) (32)

Eq. (31)-Eq. (32) correspond to Line 9-13 in Algorithm 5. There are two important differences between PI and the algorithms obtained here. First, as Line 7-9 in Algorithm 5 show, PI sequentially updates the decision variable at each time step based on the provisional cumulative rewards . On the other hand, as Eq. (31) shows, our procedure just uses the entire cumulative rewards. We can bridge the gap by introducing Dynamic Programming as used in Theodorou2012Relative (); Theodorou2013information () (see E). Second, PI assumes a Wiener process, but MDS is applicable for arbitrary stochastic processes. This difference would be important to deal with more complex stochastic processes.

5.3 More General Problem Setting and Online Mirror Descent Trick

Theodorou et al. proposed more general problem setting in Theodorou2012Relative (); Theodorou2013information ().

 min{uk}k=t,…,T Eτ[L(τ)] (33) s.t. dxt=f(xt)dt+G(xt)(dzt+dξt) (35) dzt=utdt+dwt.

They introduced additional wiener process which represents the stochasticity of passive dynamics. All the other variables are defined in section 5.1 and section 5.2.

In this setting, trajectory becomes stochastic variable even after is sampled. The evaluated value is represented by

 J(h)=∫p(ξ)j(h,ξ)dξ, (36)

with .

It is important to note that we can approximate MDS by:

 pk+1(h) = arg minp∈P{∫j(h,ξk)p(h)dh+ηBϕ(p(h),pk(h))}, (38) s.t. J(h)=limk→∞1k∑kj(h,ξk)

We can prove that Eq. (38) will asymptotically converge to the optimal solution  (see D). This trick is called online mirror descent. It enables us to make use of MDS under a single roll-out setting. Fig. 6 is the schematic view of Eq. (36) and Eq. (38).

Such is the case with reinforcement learning problem. We usually employ the expected cumulative reward as the objective function . Although there exist not only uncertainty in the dynamics but also in the reward function, we expect Eq. (38) is applicable for the reinforcement learning problems.

6 Conclusions

In this research, we proposed four optimization algorithms both for black-box optimization problem and reinforcement learning problem. On the basis of MD method, we proposed two essential algorithms: MDS and AMDS. Moreover, we proposed two more approximate algorithms of them: G-MDS and G-AMDS. Then, we discussed the relation between our proposed methods and the related algorithms. Especially in section 5, we provided the detailed discussion about the relation between MDS and PI. We compared the performances of G-MDS, G-AMDS, PI and episode-based REPS in two tasks. G-AMDS showed significant improvements in convergence speed and optimality.

These results suggest that variety of existing MD extensions can be applied to reinforcement learning algorithms. Moreover, it would be also possible that variety of Bayesian techniques such as variational inference are applicable to reinforcement learning algorithms as there exists the theoretical relation between MD method and Bayes theorem Dai2016provable (). We refer to Natural Evolution Strategies (NES) Wierstra2014Natural (); Salimans2017Evolution (). NES uses the natural gradient to update the parameterized distribution. Natural gradient comes from the constraints on KL divergence or Hellinger distance between two distributions. Because Bregman divergence includes both KL and Hellinger distance, we expect there exists some connections between MDS and NES. Recent work suggests that the relation exists between the natural gradient and the MD Raskutti2013information (). Although we didn’t evaluated G-MDS and G-AMDS with the variance-covariance matrix update in this study, we believe CMA-ES and its variants would improve performance. Parallelization of MDS algorithms would also be important work.

Acknowledgment

The research was supported by JSPS KAKENHI (Grant numbers JP26120005, JP16H03219, and JP17K12737).

Appendix A Bregman, KL and RKL divergence

We sketch the proof that both of KL and RKL divergence are Bregman divergence Amari2009alpha ().

First of all, we define the smooth convex function in the Bregman divergence Eq. (12) as

 ϕα(x)=21+αN∑i=1(1+1−α2xi)21−α, (39)

where , and . By directly substituting Eq. (39) into the Bregman divergence, we acquire . The work Amari2009alpha () provides the proof that becomes -divergence. The divergence under condition is defined by a limit case .

The limit case of is easy to calculate. We acquire

 limα→+1Bα(x,y)=N∑i[exp(xi)−exp(yi)+exp(yi)(yi−xi)], (40)

and

 limα→−1Bα(x,y)=N∑i[−xi+yi+(1+xi)log(1+xi)−(1+xi)log(1+yi)]. (41)

Under the conditions and , Eq. (40) becomes

 limα→1Bα(logp,logq)=N∑iqilogqipi, (42)

and, under the conditions and , Eq. (41) becomes

 limα→−1Bα(p−1,q−1)=N∑ipilogpiqi. (43)

Here, we used in these calculation.

Finally, we proved both of KL and RKL divergences belongs to Bregman divergence as is shown by Eq. (42) and Eq. (43).

Appendix B Mirror Descent

We explain the mirror descent algorithm in this section. Let and be a decision variable and an objective function.

 xk=arg minx∈X{⟨∇f(xk−1),x⟩+ηBϕ(x,xk−1)} (44)

, where is the Bregman divergence. The first term linearlizes the objective function around , and the second term controls the step size of by bounding the Bregman divergence between the new decision variable candidate and old one .

Appendix C Accelerated Mirror Descent

We explain the accelerated mirror descent (AMD) algorithm in this section. This algorithm is proposed in Krichene2015Accelerated (). The AMD is an accelerated method that generalizes Nesterov’s accelerated gradient descent. Let and be a decision variable and an objective function.

 xk = λk−1~zk−1+(1−λk−1)~xk−1,with λk−1=rr+(k−1) (45) ~zk = arg min~z∈X{(k−1)sr⟨∇f(xk),~z⟩+Bϕ(~z,~zk−1)} (46) ~xk = arg min~x∈X{γs⟨∇f(xk),~x⟩+R(~x,xk)}, (47)

where is the Bregman divergence, and are hyper parameters, and is step-size. In general, represents the Bregman divergence of the arbitrarily smooth convex function . For more detail on the algorithm, refer to Krichene2015Accelerated ().

AMD consist of two MD equations Eqs. (46) and (47). Parameter in Eq. (45) defines the mixture ratio of Eqs. (46) and (47). is initially close to 1, so AMD behaves according to Eq. (46). As comes close to 0, AMD converges Eq. (47).

We provide two topics related to this method. First, AMD naturally includes simulated annealing, while the existing method such as PI includes it heuristically Theodorou2010generalized (); Theodorou2010Reinforcement (). Parameter in Eq. (46) is a time-varying learning rate; as the learning step proceeds, the factor becomes increasingly important for the optimization in Eq. (46). This is equivalent to a simulated annealing operation. It would be more clear if you reformulate Eq. (46) in exponentiated gradient form.

Another topic is about an advantage of reverse-KL (RKL) minimization: . The methods in this paper and original AMD paper both include it. The RKL minimization problem shows mode-seeking behavior when is the multi-modal distribution Bishop2006Pattern (). According to Eq. (45), becomes a multi-modal distribution when and are on simplex space. Such is the case with in Eq. (47). As the learning step proceeds, gradually becomes to lead to from . We guess the mode-seeking behavior is effective for the AMD to convert to the latter MD algorithm Eq. (47).

Appendix D Online Mirror Descent

We begin with the optimization problem:

 qk = (48) s.t. J=limk→∞1k∑kjk (49)

From the formula deformation

 qk = (50) = (51)

and a relational expression of the dual space in mirror descent

 ∇ϕ(qk−1)=∇ϕ(qk−2)−1ηjk−2=⋯=∇ϕ(q0)−1ηk−2∑i=0ji, (52)

we can reformulate Eq. (48) as follows:

 qk = (53) = (54) = (55) = (56)

Next, we reformulate the original problem in the same way.

 qk = (57) = (58)

The more the number of updates increases, the more gets closer to . Thus, we can replace in Eq. (57) with , when the number of updates is sufficient.

Appendix E Dynamic Programming on MDS

We begin with the problem setting:

 (59)

We assume discrete time dynamics and the specific Markov Chain structure Theodorou2013information ():

 pk(h)=T∏t=1pk(ht∣ht−1). (60)

In addition, we assume the decomposable objective function

 J(h)=T∑t=0F(ht). (61)

Eq. (59) becomes

 (62)

By Bellman principle, we get (see Theodorou2013information () for details)

 Vt(ht)=minp(ht+1∣ht) {F(ht)+ηKL[p(ht+1|ht)|pk(ht+1|ht)]+∫