Model-Based Imitation Learningwith Accelerated Convergence

Model-Based Imitation Learning with Accelerated Convergence

Abstract

Sample efficiency is critical in solving real-world reinforcement learning problems, where agent-environment interactions can be costly. Imitation learning from expert advice has proved to be an effective strategy for reducing the number of interactions required to train a policy. Online imitation learning, a specific type of imitation learning that interleaves policy evaluation and policy optimization, is a particularly effective framework for training policies with provable performance guarantees. In this work, we seek to further accelerate the convergence rate of online imitation learning, making it more sample efficient. We propose two model-based algorithms inspired by Follow-the-Leader (FTL) with prediction: MoBIL-VI based on solving variational inequalities and MoBIL-Prox based on stochastic first-order updates. When a dynamics model is learned online, these algorithms can provably accelerate the best known convergence rate up to an order. Our algorithms can be viewed as a generalization of stochastic Mirror-Prox by Juditsky et al. (2011), and admit a simple constructive FTL-style analysis of performance. The algorithms are also empirically validated in simulation.

\globtoksblk\prooftoks

1000 \NewEnvironproofatend+0=

\pat@proofof \pat@label.
\BODY

\AtEndEnvironment

lemma \AtEndEnvironmenttheorem \AtEndEnvironmentproposition \AtEndEnvironmentcorollary

1 Introduction

Imitation learning (IL) has recently received attention for its ability to speed up policy learning when solving reinforcement learning problems (RL) (Abbeel and Ng, 2005; Ross et al., 2011; Ross and Bagnell, 2014; Chang et al., 2015; Sun et al., 2017; Le et al., 2018). Unlike pure RL techniques, which rely on uniformed random exploration to locally improve a policy, IL leverages prior knowledge about a problem in terms of expert demonstrations. At a high level, this additional information provides policy learning with an informed search direction toward the expert policy.

The goal of IL is to quickly learn a policy that can perform at least as well as the expert policy. Because the expert policy may be suboptimal with respect to the RL problem of interest, performing IL does not necessarily solve the RL problem. It is often used to provide a good warm start to the RL problem, minimizing the number of interactions with the environment. Sample efficiency is especially critical when learning is deployed in applications like robotics, where every interaction incurs real-world costs such as time and money.

By reducing IL to an online learning problem, online IL (Ross et al., 2011), provides a framework for convergence analysis and mitigates the covariate shift problem encountered in batch IL (Argall et al., 2009; Bojarski et al., 2017). In particular, under proper assumptions, the performance of a policy sequence updated by Follow-the-Leader (FTL) can converge on average to the performance of the expert policy (Ross et al., 2011). Recently, it was shown that this rate is sufficient to make IL more sample efficient than solving an RL problem from scratch (Cheng et al., 2018).

In this work, we further accelerate the convergence rate of IL. Inspired by the observation of Cheng and Boots (2018) that the online learning problem of IL is not completely adversarial, we propose two MOdel-Based IL (MoBIL) algorithms, MoBIL-VI and MoBIL-Prox, that can achieve a fast rate of convergence. Under the same assumptions of Ross et al. (2011), these algorithms improve on-average convergence to when a dynamics model is learned online.

The improved speed of our algorithms is attributed to using the dynamics model as a simulator to predict the next per-round cost in online learning. We first conceptually show that this idea can be realized as a variational inequality problem in MoBIL-VI. Next, we propose a practical first-order stochastic algorithm MoBIL-Prox, which alternates between the steps of taking the true gradient and of taking the simulated gradient. MoBIL-Prox is a generalization of stochastic Mirror-Prox proposed by Juditsky et al. (2011) to the case where the problem is weighted and the vector field is unknown but learned online. In both the theory and the experiments, we show that having a weighting scheme is pivotal to speed up convergence, and this generalization is made possible by a new constructive FTL-style regret analysis, which greatly simplifies the original algebraic proof (Juditsky et al., 2011). The performance of MoBIL-Prox is also empirically validated in simulation.

2 Preliminaries

2.1 Reinforcement Learning and Imitation Learning

Let and be the state and the action spaces, respectively. The objective of RL is to search for a stationary policy within a policy class with good performance. This can be characterized by the stochastic optimization problem with expected cost1 defined below:

(1)

in which , , is the instantaneous cost at time , is a generalized stationary distribution given by executing policy , and is a distribution of given state . The policies here are assumed to be parametric. However, to make the writing compact, we will abuse the notation to also denote its parameter, and assume is a compact convex subset in some normed space with norm .

Based on the abstracted distribution , the formulation in (1) subsumes multiple discrete-time RL problems. For example, a -discounted infinite-horizon problem can be considered by setting as a time-invariant cost and , in which denotes the state distribution at time under policy . Similarly, a -horizon RL problem can be considered by setting . Note that while we use the notation , the policy is allowed to be deterministic; in this case, the notation means evaluation. For notational compactness, we will often omit the random variable inside the expectation (e.g. we shorten (1) to ). In addition, we denote as the Q-function2 at time with respect to .

In this paper, we consider IL, and view it as an indirect approach to solving the RL problem. Specifically, we assume there is a black-box oracle , called the expert policy, from which demonstration can be queried for any state . To satisfy the querying requirement, usually the expert policy is an algorithm; for example, it can represent a planning algorithm which solves a simplified version of (1), or some engineered, hard-coded policy (see e.g. Pan et al. (2017)).

The purpose of incorporating the expert policy into solving (1) is to quickly obtain a policy that has reasonable performance. To this end, we consider solving a surrogate problem of (1),

(2)

where is a function that measures the difference between two distributions over actions (e.g. KL divergence; see Appendix B). Importantly, the objective in (2) has the property that and there is constant such that

(3)

in which denotes the set of natural numbers. By the Performance Difference Lemma (Kakade and Langford, 2002), it can be shown that the condition in (3) implies (Cheng and Boots, 2018),

(4)

Therefore, solving (2) can lead to a policy that performs similarly to the expert policy .

2.2 Imitation Learning as Online Learning

The surrogate problem in (2) is comparably more structured than the original RL problem in (1). In particular, when the distance-like function is given, and we know that is close to zero when is close to . On the contrary, in (1) generally can still be large, even if is a good policy (since it also depends on the state). In other words, the range of the surrogate problem in (2) is normalized, which is the crucial property for the reduction from IL to online learning (Cheng and Boots, 2018).

The reduction is based on observing that, with the normalization property, we can characterize the expressiveness of the policy class with a constant defined, for all , as

(5)

which measures the worst possible average difference between and with respect to . Ross et al. (2011) make use of this property and reduce (2) into an online learning problem by distinguishing the influence of on and on in (2). To make this transparent, we define a bivariate function

(6)

Using this bivariate function , the online learning setup can be described as follows: in round , the learner applies a policy and then the environment reveals a per-round cost

(7)

Ross et al. (2011) show that if the sequence is selected by a no-regret algorithm, then the policy sequence will have good performance in terms of (2). For example, DAgger updates the policy by FTL, , and has the following guarantee.

Theorem 2.1 ((Ross et al., 2011; Cheng and Boots, 2018)).

If each is -strongly convex and , then DAgger has performance on average satisfying .

First-order variants of DAgger based on Follow-the-Regularized-Leader (FTRL) have also been proposed by Sun et al. (2017) and Cheng et al. (2018), which have the same performance but only requires taking a stochastic gradient step in each iteration without keeping all the previous cost functions (i.e. data) as in the original FTL approach. The bound in Theorem 2.1 also applies to the expected performance of a policy randomly picked out of the sequence , although it does not necessarily translate into the performance of the last policy  (Cheng and Boots, 2018).

3 Accelerating Imitation Learning with Dynamics Models

The reduction-based approach to solving IL has demonstrated sucess in speeding up policy learning. However, because interactions with the environment are necessary to approximately evaluate the per-round cost, it is interesting to determine if the convergence rate of IL can be further improved. A faster convergence rate will be valuable in real-world applications, where data collection is expensive.

We answer this question affirmatively. We show that, by modeling the dynamics in the RL problem, the convergence rate of IL can be improved up to an order. The improvement comes through leveraging the fact that the per-round cost defined in (7) is not completely unknown or adversarial as it is assumed in the most general online learning setting. Because the same function is used in (7) over different rounds, the online component actually comes from the reduction made by Ross et al. (2011), which ignores information about how changes with the left argument; in other words, it omits the variations of when changes. Therefore, we argue that the original reduction proposed by Ross et al. (2011), while allowing the use of (5) to characterize the performance, loses one critical piece of information present in the original RL problem: the dynamics are the same across different rounds of online learning.

We propose two model-based algorithms (MoBIL-VI and MoBIL-Prox) to accelerate IL. The first algorithm, MoBIL-VI, is conceptual in nature and updates the policies by solving variational inequality (VI) problems (Facchinei and Pang, 2007). This algorithm is used to illustrate how modeling the dynamics can help in speeding up IL. The second algorithm, MoBIL-Prox is a first-order method. It alternates between taking stochastic gradients by interacting with the environment and by evaluating the dynamics model. We will prove that this simple practical approach has the same performance as the conceptual one: when the dynamics model is learned online, e.g. both algorithms can converge in , in contrast to DAgger’s convergence.

3.1 Relationship between Performance and Average Weighted Regret

Before presenting the two algorithms, we first summarize the core idea of the reduction from IL to online learning in a simple lemma, which builds the foundation of our algorithms (proved in Appendix E.1). {restatable}lemmareductionLemma Let and . Then it holds that

where , is an unbiased estimate of , and is given in Definition 4.1 In other words, the on-average performance convergence of an online IL algorithm mainly depends on the rate of . For example, in DAgger, the weighting is uniform and is bounded by ; by Lemma 3.1 this rate directly proves Theorem 2.1.

3.2 Algorithms

0:  , ,
0:  
1:  Set weights for and sample integer with
2:  for  do
3:     Run in the real environment to collect data to define and (MoBIL-VI assumes and )
4:     Update model by FTL:
5:     For MoBIL-VI, update policy to by solving the VI in (8) For MoBIL-Prox, update policy to by the first-order update rule in  (9)
6:  end for
7:  Set
Algorithm 1 MoBIL-VI and MoBIL-Prox

From Lemma 3.1, we know that improving the regret bound implies a faster convergence of IL. This leads to the main idea of MoBIL-VI and MoBIL-Prox: to use the model information to approximately play Be-the-Leader (BTL) (Kalai and Vempala, 2005), i.e. . To understand why playing BTL can minimize the regret, we recall a classical regret bound of online learning. {restatable}[Strong FTL Lemma (McMahan, 2017)]lemmastrongFTL For any sequence and , , where . Namely, if the decision made in round is close to the best decision in round after the new per-round cost is revealed (which depends on ), then the regret will be small.

The two algorithms are summarized in Algorithm 1, which mainly differ in the policy update rule (line 5). Like DAgger, they both learn the policy in an interactive manner. In round , both algorithms execute the current policy in the real environment to collect data to define the per-round cost functions (line 3): is an unbiased estimate of in (7) for policy learning, and is an unbiased estimate of the per-round cost for model learning. Given the current per-round costs, the two algorithms then update the model (line 4) and the policy (line 5) using the respective rules. The exact definition of and will be given later in Section 4.1; for the time being, we can think of as the family of dynamics models and as the empirical loss function used to train a dynamics model (e.g. the KL divergence of prediction).

A Conceptual Algorithm: MoBIL-VI

We first present our conceptual algorithm MoBIL-VI, which is simpler to explain. We assume that and are given, like in Theorem 2.1. This assumption will be removed in MoBIL-Prox later. To realize the idea of BTL, in round , MoBIL-VI uses the newly learned model to estimate of in (6) and then updates the policy by solving the VI problem below: finding such that

(8)

If , the VI problem3 in (8) is equivalent to the fixed-point problem that seeks . Therefore, if the model is perfect (i.e. ), then plays exactly BTL and by Lemma 3.2 the regret is non-positive.

In general, we can show that, even with modeling errors, MoBIL-VI can still reach a faster convergence rate such as , if a nonuniform weighting scheme is used and the model is updated online. The details will be presented in Section 4.2.

A Practical Algorithm: MoBIL-Prox

While the previous conceptual algorithm achieves a faster convergence, it requires solving a nontrivial VI problem in each iteration. In addition, it assumes is given as a function and requires keeping all the past data to define . Here we relax these unrealistic assumptions and propose MoBIL-Prox. In MoBIL-Prox, at round , taking two gradient steps:

(9)

in which is an -strongly convex function () such that is its global minimum and (e.g. a Bregman divergence), and and are estimates of and , respectively. Here we only require to be unbiased, whereas could be a biased estimate obtained from simulation.

MoBIL-Prox treats , which plays FTL with from the real environment, as a rough estimate of the next policy and uses it to query an gradient estimate from model . Therefore, the learner’s final decision can approximately play BTL. If we compare the update rule of and the VI problem in (8), we can see that MoBIL-Prox linearizes the problem and attempts to approximate by . While the above approximation is crude, interestingly it is sufficient to speed up the convergence rate to be as fast as MoBIL-VI under very mild assumptions, as shown later in Section 4.3.

4 Theoretical Analysis

In the following analyses, we will focus on bounding the expected weighted average regret, as it directly translates into the average performance bound by Lemma 3.1. For short, we define for . The full proofs are give in Appendix F.

4.1 Assumptions

Dynamics model We consider a dynamics model as an approximator of in (6) and be the class of such estimators (their connections between the transition probability will be made clear in Section 4.4). These models are assumed to be Lipschitz continuous in the following sense.

Assumption 4.1.

There is such that , , in which denotes the partial derivative with respect to the second argument.

Per-round costs The per-round cost for policy learning is given in (7), and we define as an upper bound of . For example, can be selected as the expected KL divergence between two transition probabilities (see Section 4.4). We make some structural assumptions and , which are also made by Ross et al. (2011) (see Theorem 2.1).

Assumption 4.2.

With probability one, is -strongly convex, and , ; is -strongly convex, and , .

By definition, these properties extend to and . We note they can be relaxed to just convexity and our algorithms still improve the best known convergence rate (see Table 1 and Appendix G).

Expressiveness of hypothesis classes Finally, with the per-round costs defined, we introduce two constants ( and ) to characterize the expressiveness of policy class and model class . These two constants generalize the idea of (5) to stochastic and general weighting settings. When the exact per-round losses are available (i.e. ) and the weighting is uniform, Definition 4.1 agrees with (5). Similarly, we see that if and , then and are zero.

Definition 4.1.

A policy class is -close to , if for all and with , . Similarly, a model class is -close to , if .

convex strongly convex Without model
convex
strongly convex
Table 1: Convergence Rate Comparison4

4.2 Performance of MoBIL-VI

Here we show the performance for MoBIL-VI when there is modeling error in . The main idea is treat MoBIL-VI as online learning with prediction (Rakhlin and Sridharan, 2013), in which is obtained after solving the VI problem (8) as an estimate of . {restatable}propositionconstantRegretOfConceptualAlgorithm For MoBIL-VI with , .

By Lemma 3.1, this means that if the model class is expressive enough (i.e ), then by adapting the model online we can improve the original convergence rate in of Ross et al. (2011) to . While removing the factor does not seem like much, we will show that running MoBIL-VI can improve the convergence rate to , when a nonuniform weighting is adopted. {restatable}theoremweightedPerformanceOfConceptualAlgorithm For MoBIL-VI with , , where . The key is that can be upper bounded by the regret of the online learning for models, which has per-round cost . Therefore, if , randomly picking a policy out of proportional to weights has expected convergence in if .5

4.3 Performance of MoBIL-Prox

We characterize the gradient error with and , where also entails potential bias.

Assumption 4.3.

and

We show this simple first-order algorithm MoBIL-Prox achieves similar performance as the conceptual algorithm MoBIL-VI. To this end, we introduce a stronger lemma than Lemma 3.2. {restatable}[Stronger FTL Lemma]lemmastrongerFTL Let . For any sequence and , , where . This additional term in Lemma 4.3 would be pivotal to prove the performance of MoBIL-Prox. {restatable}theoremweightedPerformanceOfPracticalAlgorithm For MoBIL-Prox with and ,

and .

Proof sketch.

Here we give a proof sketch in big-O notation (see Appendix F.2 for the details). To bound , recall . Define . Since is -strongly convex, is -strongly convex, and , satisfies , . This implies , where .

The following lemma upper bounds by using Stronger FTL lemma (Lemma 4.3). {restatable}lemmapathwiseRegretBound . Since the second term in Lemma 4.3 is negative, we just need to upper bound the expectation of the first item. Using the triangular inequality, we bound the model’s prediction of the next per-round cost. {restatable}lemmaexpectedQuadraticError With Lemma 4.3 and Lemma 4.3, we see , where . When is large enough, , and hence the first term is . For the third term, because the model is learned online using FTL with strongly convex costs . Thus, . Substituting this bound into and using that the fact proves the theorem. ∎

The main assumption made in Theorem 4.3 is that is -Lipschitz continuous (Assumption 4.1). This condition is practical as we are free to choose and . Compared with Theorem 4.2, Theorem 4.3 considers the inexactness of and explicitly; hence the additional term due to and . Under the same assumption of MoBIL-VI that and are directly available, we can actually show that the simple MoBIL-Prox has the same performance as MoBIL-VI, which is a corollary of Theorem 4.3.

Corollary 4.1.

If and , for MoBIL-Prox with , .

It is worth noting that and (the variance of the stochastic gradients) should be considerably smaller than (the upper bound of the gradients) in Theorem 2.1. Therefore, in general, MoBIL-Prox has a better upper bound than DAgger. Finally, we see that, as the convergence result in Theorem 4.3 is based on the assumption that model learning also has no regret, the FTL update rule (line 4) can be replaced by a no-regret first-order method without changing the result. This would make the algorithm even simpler to implement. In addition, it is interesting to note that the accelerated convergence is made possible when model learning puts more weight on costs in later rounds (because ).

4.4 Model Learning

So far we have stated model learning rather abstractly, which only requires to be an upper bound of . Now we give a particular example of and , in terms of KL divergence, and consider learning a transition model online that induces a bivariate function , where is the class of transition models. Specifically, let denote the KL divergence and let be the generalized stationary distribution (cf. (1)) generated by running policy under transition model . We define, for , .

We show the error of can be bounded by the KL-divergence error of . {restatable}lemmaerrorBoundInMarginalKL Assume is -Lipschitz continuous with respect to . It holds that . Directly minimizing the marginal KL-divergence is a nonconvex problem and requires backpropagation through time. To make the problem simpler, we further upper bound it in terms of the KL divergence between the true and the modeled transition probabilities.

To make the problem concrete, here we consider -horizon RL problems. {restatable}propositionerrorBoundInExpectedDynamicsKL For a -horizon problem with dynamics , let be the modeled dynamics. Then s.t . Therefore, we can simply takes as the upper bound in Proposition 4.4, and as its empirical approximation by sampling state-action transition triples through running policy in the real environment. This construction agrees with the causal relationship assumed in the Section 3.2.1.

(a) linear, cvx
(b) linear, scvx
(c) MLP, cvx
(d) MLP, scvx
Figure 1: Results of CPB: accumulated rewards (y-axis) versus iteration index (x-axis). The blue () and green () lines are the baselines: the light lines are the state-of-the-art IL algorithm (DAggereD (Cheng et al., 2018)), whereas the dark lines are MoBIL-Prox with the true model. Dark red () and light red () are MoBIL-Prox online learned models. For , is for . From left to right, is set to and , and is set to for to match the peak learning rate of o facilitate a fair comparison (see Appendix C)

5 Experiments

To empirically validate MoBIL-Prox, we conducted experiments on a Cart-Pole Balancing (CPB) task (See Appendix C for details). We study the sample efficiency of MoBIL-Prox with different access to the dynamics model: (a) no access (b) the true model or (c) online learned models; and different rates of : (a) or (b) . For regularization, we set , where and are hyperparameters; this leads to an effective learning rate . We consider two schemes: (scvx) and (cvx), which are the optimal rates when is strongly convex and convex, respectively (see Appendix G).

Figure 1 shows the results (averaged over seeds) of linear and multi-layer perceptron (MLP) approximators (used for all expert, learner, and dynamic mode).6 Both the numbers of samples from the environment and from the model (if a model is present) in each round are . We can observe that, when , whether having model information or not does not improve the performance over standard online IL. This is predicted by Proposition 4.2, which says that when , the improvement is only a negligible factor. When (as suggested by Theorem 4.3), MoBIL-Prox improves the convergence (especially in the scvx rate). This is a highly non-trivial result, because from the learning rate perspective, both choices of lead to the same order of learning rate.

6 Discussion

We propose two novel model-based IL algorithms with strong theoretical guarantees, provably up-to-and-order faster than the state-of-the-art IL algorithms. While we proved the performance under convexity assumptions, we empirically find that MoBIL-Prox improves the performance even when using MLP approximators.

MoBIL-Prox is closely related to stochastic Mirror-Prox (Nemirovski, 2004; Juditsky et al., 2011). In particular, when true dynamics are known (i.e. ) and MoBIL-Prox is set to convex-mode (i.e. for , and ; see Appendix G), then MoBIL-Prox gives the same update rule as stochastic Mirror-Prox with step size (See Appendix H for a thorough discussion). Therefore, MoBIL-Prox can be viewed as a generalization of Mirror-Prox: 1) it allows nonuniform weights 2) it allows the vector field to be estimated online by alternately taking stochastic gradients and estimated gradients. The design of MoBIL-Prox is made possible by our Stronger FTL lemma (Lemma 4.3), which greatly simplifies the original algebraic proof proposed by Nemirovski (2004); Juditsky et al. (2011). Using Lemma 4.3 reveals more closely the interactions between model updates and policy updates. In addition, it more clearly shows the effect of non-uniform weighting, which is essential to achieving convergence. To the best of our knowledge, even the analysis of the original (stochastic) Mirror-Prox from the FTL perspective is new.

Acknowledgements

This work was supported in part by NSF NRI Award 1637758 and NSF CAREER Award 1750483.

Appendix A Notation

Symbol Meaning
total number of online learning iterations
average accumulated cost,  (1)
generalized stationary state distribution
the difference between distributions and
expert policy
hypothesis class of policies
the policy run in the environment at the th online learning iteration
hypothesis class of models (elements denoted as )
the model at the beginning of the th online learning iteration
policy class complexity (Definition 4.1)
model class complexity (Definition 4.1)
bivariate function  (6)
 (7)
an unbiased estimate of
an upper bound of
an unbiased estimate of
is strongly convex (Assumption 4.2)
(Assumption 4.2)
is strongly convex (Assumption 4.2)
(Assumption 4.2)
(Assumption 4.1)
weighted average regret, (Section 4)
weighted average regret, defined in Lemma 3.1
the sequence of weights used to define , we take
Table 2: Meaning of Symbols in the Main Text

Appendix B Imitation Learning Objective Function and Choice of Distance

Here we provide a short introduction to the objective function of IL in (2). The idea of IL is based on the Performance Difference Lemma, whose proof can be found, e.g. in (Kakade and Langford, 2002).

Lemma B.1 (Performance Difference Lemma).

Let and be two policies and be the (dis)advantage function with respect to running . Then it holds that

(B.1)

Using Lemma B.1, we can relate the performance of the learner’s policy and the expert policy as

where the last equality uses the definition of and that . Therefore if the inequality in (3) holds

then minimizing (2) would minimize the performance difference between the policies as in (4)

Intuitively, we can set and (4) becomes an equality with . This corresponds to the objective function used in AggreVaTe by Ross and Bagnell (2014). However, this choice requires to be given as a function or to be estimated online, which may be inconvenient or complicated in some settings.

Therefore, is usually used to construct a strict upper bound in (4). The choice of and is usually derived from some statistical distances, and it depends on the topology of the action space and the policy class . For discrete action spaces, can be selected as a convex upper bound of the total variational distance between and and is a bound on the range of (e.g., a hinge loss used by Ross et al. (2011)). For continuous action spaces, can be selected as an upper bound of the Wasserstein distance between and and is the Lipschitz constant of with respect to action (Pan et al., 2017). More generally, for stochastic policies, we can simply set to Kullback-Leibler (KL) divergence (e.g. by Cheng et al. (2018)), because it upper bounds both total variational distance and Wasserstein distance.

Appendix C Experimental Details

Cart-Pole Balancing The Cart-Pole Balancing task is a classic control problem, of which the goal is to keep the pole balanced in an upright posture with force only applied to the cart. The state and action spaces are both continuous, with dimension and , respectively. The state includes the horizontal position and velocity of the cart, and the angle and angular velocity of the pole. The time-horizon of this task is steps, and there is small uniform noise injected to initial state with later transition being deterministic. We used the implementation from OpenAI Gym (Brockman et al., 2016) with DART physics engine (Lee et al., 2018).7

Policy learning We adopted Gaussian policies in our experiments, i.e. for any state , is Gaussian distributed. The mean of is modeled by either as a linear function or a MLP 8, and the covariance matrix of is restricted to be diagonal and independent of state. The experts and the learners are from the same policy class (e.g. MLP expert and MLP learner), and the experts are trained using actor-critic-based policy gradient with ADAM (Kingma and Ba, 2014), in which the value function is estimated by TD(0) learning. After iterations, the experts achieve the highest accumulated rewards consistently. With regard to the imitation learning loss, we set in 2 to KL-divergence, which implies , where is the number of samples collected by running in the environment. In addition, a running batch normalizer is applied, and to stabilize learning each batch is further weighted by in scvx and in cvx so that the normalizer converges faster than the policy.

Model learning The model learned online is selected to be deterministic (the true model is deterministic too), and it has the same representation as the policy (e.g. linear model is selected for linear policy). With a similar proof as in Section 4.4, given a batch of transition triples