Model Imitation for Model-Based Reinforcement Learning

Model Imitation for Model-Based Reinforcement Learning

Yueh-Hua Wu, Ting-Han Fan, Peter J. Ramadge, Hao Su
National Taiwan University
University of California San Diego
Princeton University

Model-based reinforcement learning (MBRL) aims to learn a dynamic model to reduce the number of interactions with real-world environments. However, due to estimation error, rollouts in the learned model, especially those of long horizon, fail to match the ones in real-world environments. This mismatching has seriously impacted the sample complexity of MBRL. The phenomenon can be attributed to the fact that previous works employ supervised learning to learn the one-step transition models, which has inherent difficulty ensuring the matching of distributions from multi-step rollouts. Based on the claim, we propose to learn the synthesized model by matching the distributions of multi-step rollouts sampled from the synthesized model and the real ones via WGAN. We theoretically show that matching the two can minimize the difference of cumulative rewards between the real transition and the learned one. Our experiments also show that the proposed model imitation method outperforms the state-of-the-art in terms of sample complexity and average return.

1 Introduction

Reinforcement learning (RL) has become of great interest because plenty of real-world problems can be modeled as a sequential decision-making problem. Model-free reinforcement learning (MFRL) is favored by its capability of learning complex tasks when interactions with environments are cheap. However, in the majority of real-world problems, such as autonomous driving, interactions are extremely costly, thus MFRL becomes infeasible. One critique about MFRL is that it does not fully exploit past queries over the environment, and this motivates us to consider the model-based reinforcement learning (MBRL). In addition to learning an agent policy, MBRL also uses the queries to learn the dynamic of the environment that our agent is interacting with. If the learned dynamic is accurate enough, the agent can acquire the desired skill by simply interacting with the simulated environment, so that the number of samples to collect in the real world can be greatly reduced. As a result, MBRL has become one of the possible solutions to reduce the number of samples required to learn an optimal policy.

Most previous works of MBRL adopt supervised learning with -based errors [luo2018slbo; kurutach18metrpo; clavera2018mbmpo] or maximum likelihood [janner2019trust], to obtain an environment model that synthesizes real transitions. These non-trivial developments imply that optimizing a policy on a synthesized environment is a challenging task. Because the estimation error of model accumulates as the trajectory grows, it is hard to train a policy on a long synthesized trajectory. On the other hand, training on short trajectories makes the policy short-sighted. This issue is known as the planning horizon dilemma [langlois2019benchmarking]. As a result, despite having a strong intuition at first sight, MBRL has to be designed meticulously.

Intuitively, we would like to learn a transition model in a way that it can reproduce the trajectories that have been generated in the real world. Since the attained trajectories are sampled according to a certain policy, directly employing supervised learning may not necessarily lead to the mentioned result especially when the policy is stochastic. The resemblance in trajectories matters because we estimate policy gradient by generating rollouts; however, the one-step model learning adopted by many MBRL methods do not guarantee this. Some previous works propose multi-step training [luo2018slbo]; however, experiments show that model learning fails to benefit much from the multi-step loss. We attribute this outcome to the essence of supervised learning, which elementally preserves only one-step transition and the similarity between real trajectories and the synthesized ones cannot be guaranteed.

In this work, we propose to learn the transition model via distribution matching. Specifically, we use WGAN [wgan] to match the distributions of state-action-next-state triple in real/learned models so that the agent policy can generate similar trajectories when interacting with either the true transition or the learned transition. Figure 1 illustrates the difference between methods based on supervised learning and distribution matching. Different from the ensemble methods proposed in previous works, our method is capable of generalizing to unseen transitions with only one dynamic model because merely incorporating multiple models does not alter the essence that one-step (or few-step) supervised learning fails to imitate the distribution of multi-step rollouts.

Concretely, we gather some transitions in the real world according to a policy. To learn the real transition, we then sample fake transitions from our synthesized model with the same policy. The synthesized model serves as the generator in the WGAN framework and there is a critic that discriminates the two transition data. We update the generator and the critic alternatively until the synthesized data cannot be distinguished from the real one, which we will show later that it gives theoretically.

Our contributions are summarized below:

  • We propose an MBRL method called model imitation (MI), which enforces the learned transition model to generate similar rollouts to the real one so that policy gradient is accurate;

  • We theoretically show that the transition can be learned by MI in the sense that by consistency and the difference in cumulative rewards is small;

  • To stabilize model learning, we deduce guarantee for our sampling technique and investigate training across WGANs;

  • We experimentally show that MI is more sample efficient than state-of-the-art MBRL and MFRL methods and outperforms them on four standard tasks.

Figure 1: Distribution matching enables the learned transition to generate similar rollouts to the real ones even when the policy is stochastic or the initial states are close. On the other hand, training with supervised learning does not ensure rollout similarity and the resulting policy gradient may be inaccurate. This figure considers a fixed policy sampling in the real world and a transition model.

2 Related work

In this section, we introduce our motivation inspired by learning from demonstration (LfD) [schaal1997learning] and give a brief survey of MBRL methods.

2.1 Learning from Demonstration

A straightforward approach to LfD is to leverage behavior cloning (BC), which reduces LfD to a supervised learning problem. Even though learning a policy via BC is time-efficient, it cannot imitate a policy without sufficient demonstration because the error may accumulate without the guidance of expert [ross2011dagger]. Generative Adversarial Imitation Learning (GAIL) [ho2016generative] is another state-of-the-art IfD method that learns an optimal policy by utilizing generative adversarial training to match occupancy measure [syed2008apprenticeship]. GAIL learns an optimal policy by matching the distribution of the trajectories generated from an agent policy with the distribution of the given demonstration. ho2016generative shows that the two distributions match if and only if the agent has learned the optimal policy. One of the advantages of GAIL is that it only requires a small amount of demonstration data to obtain an optimal policy but it requires a considerable number of interactions with environments for the generative adversarial training to converge.

Our intuition is that we analogize transition learning (TL) to learning from demonstration (LfD). In LfD, trajectories sampled from a fixed transition are given, and the goal is to learn a policy. On the other hand, in TL, trajectories sampled from a fixed policy are given, and we would like to imitate the underlying transition. That being said, from LfD to TL, we interchange the roles of the policy and the transition. It is therefore tempting to study the counterpart of GAIL in TL; i.e., learning the transition by distribution matching. Fortunately, by doing so, the pros of GAIL remain while the cons are insubstantial in MBRL because sampling with the learned model is considered to be much cheaper than sampling in the real one. That GAIL learns a better policy than what BC does suggests that distribution matching possess the potential to learn a better transition than supervised learning.

2.2 Model-Based Reinforcement Learning

For deterministic transition, it is usually optimized with -based error. Nagabandi18, an approach that uses supervised learning with mean-squared error as its objective, is shown to perform well under fine-tuning. To alleviate model bias, some previous works adopt ensembles [kurutach18metrpo; jacob2018steve], where multiple transition models with different initialization are trained at the same time. In a slightly more complicated manner, clavera2018mbmpo utilizes meta-learning to gather information from multiple models. Lastly, on the theoretical side, SLBO [luo2018slbo] is the first algorithm that develops from solid theoretical properties for model-based deep RL via a joint model-policy optimization framework.

For the stochastic transition, maximum likelihood estimator or moment matching are natural ways to learn a synthesized transition, which is usually modeled by the Gaussian distribution. Following this idea, Gaussian process [Kupcsik2013gauss; Deisenroth2015gauss] and Gaussian process with model predictive control [Kamthe2017gauss] are introduced as an uncertainty-aware version of MBRL. Similar to the deterministic case, to mitigate model bias and foster stability, an ensemble method for probabilistic networks [kurtland2018pets] is also studied. An important distinction between training a deterministic or stochastic transition is that although the stochastic transition can model the noise hidden within the real world, the stochastic model may also induce instability if the true transition is deterministic. This is a potential reason why an ensemble of models is adopted to reduce variance.

3 Background

3.1 Reinforcement Learning

We consider the standard Markov Decision Process (MDP) [sutton1998introduction]. MDP is represented by a tuple , where is the state space, is the action space, is the transition density of state at time step given action made under state , is the reward function, and is the discount factor.

A stochastic policy is a density of action  given state . Let the initial state distribution be . The performance of the triple is evaluated in the expectation of the cumulative reward in the -discounted infinite horizon setting:


Equivalently, is the expected cumulative rewards in a length- trajectory generated by with . When and are fixed, becomes a function that only depends on , and reinforcement learning algorithms [sutton1998introduction] aim to find a policy to maximize .

3.2 Occupancy Measure

Given initial state distribution , policy and transition , the normalized occupancy measure generated by is defined as


where is the probability measure and will be replaced by a density function if or is continuous. Intuitively, is a distribution of in a length- trajectory with following the laws of . From Syed08, the relation between and is characterized by the Bellman flow constraint. Specifically, as defined in Eq. 2 is the unique solution to:


In addition, Theorem 2 of Syed08 gives that and have an one-to-one correspondence with and fixed; i.e., is the only policy whose occupancy measure is .

With the occupancy measure, the cumulative reward Eq. 1 can be represented as


The goal of maximizing the cumulative reward can then be achieved by adjusting , and this motivates us to adopt distribution matching approaches like WGAN [wgan] to learn a transition model.

4 Theoretical Analysis for WGAN

In this section, we present a consistency result and error bounds for WGAN [wgan]. All proofs of the following theorems and lemmas can be found in Appendix A.

In the setting of MBRL, the training objective for WGAN is


By Kantorovich-Rubinstein duality [Villani2008_opt_transport], the optimal value of the inner maximization is exactly where is the discounted distribution of . Thus, by minimizing over the choice of , we are essentially finding that minimizes , which gives the consistency result.

Proposition 4.1 (Consistency for WGAN).

Let and be the true and synthesized transitions respectively. If WGAN is trained to its optimal point, we have

where is the support of .

The support constraint is inevitable because the training data is sampled from and guaranteeing anything beyond it can be difficult. Still, we will empirically show that the support constraint is not an issue in our experiments because the performance boosts up in the beginning, indicating that may be large enough initially.

Now that training with WGAN gives a consistent estimate of the true transition, it is sensible to train a synthesized transition upon it. However, the consistency result is too restrictive as it only discusses the optimal case. The next step is to analyze the non-optimal situation and observe how the cumulative reward deviates w.r.t. the training error.

Theorem 4.2 (Error Bound for WGAN).

Let be the normalized occupancy measures generated by the true transition and the synthesized one . If the reward function is -Lipschitz and the training error of WGAN is , we have .

Theorem 4.2 indicates that if WGAN is trained properly, i.e., having small , the cumulative reward on the synthesized trajectory will be close to that on the true trajectory. As MBRL aims to train a policy on the synthesized trajectory, the accuracy of the cumulative reward over the synthesized trajectory is thus the bottleneck. Theorem 4.2 also implies that WGAN’s error is linear to the (expected) length of the trajectory . This is a sharp contrast to the error bounds in most RL literature, as the dependency on the trajectory length is usually quadratic [syed2010; ross2011dagger], or of even higher order. Since WGAN gives us a better estimation of the cumulative reward in the learned model, the policy update becomes more accurate.

5 Model Imitation for Model-Based Reinforcement Learning

In this section, we present a practical MBRL method called model imitation (MI) that incorporates the transition learning mentioned in Section 4.

5.1 Sampling Technique for Transition Learning

Due to the long-term digression, it is hard to train the WGAN directly from a long synthesized trajectory. To tackle this issue, we use the synthesized transition to sample short trajectories with initial states sampled from the true trajectory.

To analyze this sampling technique, let be the discount factor of the short trajectories so that the expected length is . Let , , , be the normalized occupancy measures of synthesized short trajectories, empirical true short trajectories, true short trajectories and the true long trajectories. The 1-Wasserstein distance can be bounded by

is upper bounded by the training error of WGAN on short trajectories, which can be small empirically because the short ones are easier to imitate. by Canas2012wdist_bound and Lemma A.3, where is the dimension of . by Lemma A.4 and [dist_bounds], where is the diameter. The second term encourages to be large while the third term does the opposite. Besides, need not be large if is large enough; in practice we may sample short trajectories to reduce the error from to . Finally, since is the occupancy measure we train on, from the proof of Theorem 4.2 we deduce that

Thus, WGAN may perform better under this sampling technique.

5.2 Empirical Transition Learning

To learn the real transition based on the occupancy measure matching mentioned in Section 4, we employ a transition learning scheme by aligning the distribution of between the real and the learned environments. Inspired by how GAIL [ho2016generative] learns to align via solving an MDP with rewards extracted from a discriminator, we formulate an MDP with rewards from a discriminator over . Specifically, the WGAN critic in Eq. 5 is used as the (psuedo) rewards of our MDP. Interestingly, there is a duality between GAIL and our transition learning: for GAIL, the transition is fixed and the objective is to train a policy to maximize the cumulative pseudo rewards, while for our transition learning, the policy is fixed and the objective is to train a synthesized transition to maximize the cumulative pseudo rewards.

In practice, since the policy is updated alternatively with the synthesized model, we are required to train a number of WGANs along with the change of the policy. Although the generators across WGANs correspond to the same transition and can be similar, we observe that WGAN may get stuck at a local optimum when we switch from one WGAN training to another. The reason is that, unlike GAN that mimics the Jensen-Shannon divergence and hence its inner maximization is upper bounded by , WGAN mimics the Wasserstein distance and the inner maximization is unbounded from above. Intuitively, such unboundedness makes the WGAN critic so strong that the WGAN generator (the synthesized transition) cannot find a way out and gets stuck at a local optimum. Thereby, we have to modify the WGAN objective to alleviate such situation. To ensure the boundedness, for a fixed , we introduce cut-offs at the WGAN objective so that the inner maximization is upper bounded by :


As , Eq. 6 recovers the WGAN objective, Eq. 5. Therefore, this is a truncated version of WGAN. To comprehend Eq. 6 further, notice that it is equivalent to


which is a hinge loss version of the generative adversarial objective. Such WGAN is introduced in lim2017geometric, where the consistency result is provided and further experiments are evaluated in zhang2018self. According to lim2017geometric, the inner minimization can be interpreted as the soft-margin SVM. Consequently, it provides a geometric intuition of maximizing margin, which potentially enhances robustness. Finally, because the objective of transition learning is to maximize the cumulative pseudo rewards on the MDP, does not directly optimize Eq. 7. Note that the truncation only takes part in the inner minimization:


which gives us a WGAN critic . As mentioned, will be the pseudo reward function. Later, we will introduce a transition learning version of PPO [schulman2017proximal] to optimize the cumulative pseudo reward.

1:Initialize policy , transition model , WGAN critic , environment dataset
2:for  do
3:     Take actions in real environment according to ;
4:     Pre-train and by optimizing Eq. 8 and 11 with and
5:     for  epochs do
6:         for  epochs do
7:              optimize Eq. 8 and 11 over and with          
8:         for  epochs do
9:              update by TRPO on the data generated by               
Algorithm 1 Model Imitation for Model-Based Reinforcement Learning

After modifying the WGAN objective, to include both the stochastic and (approximately) deterministic scenarios, the synthesized transition is modeled by a Gaussian distribution . Although the underlying transitions of tasks like MuJoCo [todorov2012mujoco] are deterministic, modeling by a Gaussian does not harm the transition learning empirically.

Recall that the synthesized transition is trained on an MDP whose reward function is the critic of the truncated WGAN. To achieve this goal with proper stability, we employ PPO [schulman2017proximal], which is an efficient approximation of TRPO [schulman2015trust]. Note that although the PPO is originally designed for policy optimization, it can be adapted to transition learning with a fixed sampling policy and the PPO objective (Eq. 7 of schulman2017proximal)




To enhance stability of the transition learning, in addition to PPO, we also optimize maximum likelihood, which can be regarded as a regularization. We empirically observe that jointly optimizing both maximum likelihood and the PPO objective attains better transition model for policy gradient. The overall loss of the transition learning becomes


where is the loss of MLE, which is policy-agnostic and can be estimated with all collected real transitions. For more implementation details, please see Appendix B.1.

We consider a training procedure similar to SLBO [luo2018slbo], where they consider the fact that the value function is dependent on the varying transition model. As a result, unlike most of the MBRL methods that have only one pair of model-policy update for each real environment sampling, SLBO proposes to take multiple update pairs for each real environment sampling so that the objective composed of the model loss and the value loss can be optimized. Our proposed model imitation (MI) method is summarized in Algorithm 1.

In the experiment section, we would like to answer the following questions. (1) Does the proposed model imitation outperforms the state-of-the-art in terms of sample complexity and average return? (2) Does the proposed model imitation benefit from distribution matching and is superior to its model-free and model-based counterparts, TRPO and SLBO?

Figure 2: Learning curves of our MI versus two model-free and four model-based baselines. The solid lines indicate the mean of five trials and the shaded regions suggest standard deviation.

To fairly compare algorithms and enhance reproducibility, we adopt open-sourced environments released along with a model-based benchmark paper [langlois2019benchmarking], which is based on a physical simulation engine, MuJoCo [todorov2012mujoco]. Specifically, we evaluate the proposed algorithm MI on four continuous control tasks including Hopper, HalfCheetah, Ant, and Reacher. For hyper-parameters mentioned in Algorithm 1 and coefficients such as entropy regularization , please refer to Appendix B.2.

We compare to two model-free algorithms, TRPO [schulman2015trust] and PPO [schulman2017proximal], to assess the benefit of utilizing the proposed model imitation since our MI (Algorithm 1) uses TRPO for policy gradient to update the agent policy. We also compare MI to four model-based methods. SLBO [luo2018slbo] gives theoretical guarantee of monotonic improvement for model-based deep RL and proposes to update a joint model-policy objective. PETS [kurtland2018pets] propose to employ uncertainty-aware dynamic models with sampling-based uncertainty to capture both aleatoric and epistemic uncertainty. METRPO [kurutach18metrpo] shows that insufficient data may cause instability and propose to use an ensemble of models to regularize the learning process. STEVE [jacob2018steve] dynamically interpolates among model rollouts of various horizon lengths and favors those whose estimates have lower error.

Figure 2 shows the learning curves for all methods. In Hopper, HalfCheetah, and Ant, MI converges fairly fast and learns a policy significantly better than competitors’. In Ant, even though MI does not improve the performance too much from the initial one, the fact that it maintains the average return at around 1,000 indicates that MI can capture a better transition than other methods do with only 5,000 transition data. Even though we do not employ an ensemble of models, the curves show that our learning does not suffer from high variance. In fact, the performance shown in Figure 2 indicates that the variance of MI is lower than that of methods incorporating ensembles such as METRPO and PETS.

The questions raised at the beginning of this section can now be answered. The learned model enables TRPO to explore the world without directly access real transitions and therefore TRPO equipped with MI needs much fewer interactions with the real world to learn a good policy. Even though MI is based on the training framework proposed in SLBO, the additional distribution matching component allows the synthesized model to generate similar rollouts to that of the real environments, which empirically gives superior performance because we rely on long rollouts to estimate policy gradient.

To better understand the performance presented in Figure 2, we further compare MI with bench-marked RL algorithms recorded in langlois2019benchmarking including state-of-the-art MFRL methods such as TD3 [fujimoto2018td3] and SAC [haarnoja2018soft]. It should be noted that the reported results of langlois2019benchmarking are the final performance after 200k time-steps but we only use up to 100k time-steps to train MI. Table 1 indicates that MI significantly outperforms most of the MBRL and MFRL methods with fewer samples, which verifies that MI is more sample-efficient by incorporating distribution matching.

Hopper HalfCheetah Ant Reacher
MBRL 8/10 10/10 8/10 8/10
MFRL 3/4 2/4 4/4 3/4
Table 1: Proportion of bench-marked RL methods that are inferior to MI in terms of t-test. indicates that among approaches, MI is significantly better than approaches. The detailed performance can be found in Table 1 of langlois2019benchmarking. It should be noted that the reported results in langlois2019benchmarking are the final performance after 200k time-steps whereas ours are no more than 100k time-steps.

6 Conclusion

We have pointed out that the state-of-the-art methods concentrate on learning synthesized models in a supervised fashion, which does not guarantee that the policy is able to reproduce a similar trajectory in the learned model and therefore the model may not be accurate enough to estimate long rollouts. We have proposed to incorporate WGAN to achieve occupancy measure matching between the real transition and the synthesized model and theoretically shown that matching indicates the closeness in cumulative rewards between the synthesized model and the real environment.

To enable stable training across WGANs, we have suggested using a truncated version of WGAN to prevent training from getting stuck at local optimums. The empirical property of WGAN application such as imitation learning indicates its potential to learn the transition with fewer samples than supervised learning. We have confirmed it experimentally by further showing that MI converges much faster and obtains better policy than state-of-the-art model-based and model-free algorithms.


Appendix A Proofs

a.1 Proof for WGAN

Proposition A.1 (Consistency for WGAN).

Let be initial state distribution, policy and synthesized transition. Let be the true transition, be the discounted distribution of the triple . If the WGAN is trained to its optimal point, we have


Because the loss function of WGAN is the 1-Wasserstein distance, we know at its optimal points. Plug in to the Bellman flow constraint Eq. (3),

That is,

Finally, recall , we arrive at

Theorem A.2 (Two-sided Errors for WGAN).

Let be normalized occupancy measures generated by the true transition and the synthesized one . Suppose the reward is -Lipschitz. If the training error of WGAN is , then .


Observe that the occupancy measure is a marginal distribution of . Because the distance between the marginal is upper bounded by that of the joint, we have

where is the 1-Wasserstein distance. Then, the cumulative reward is bounded by

where the first inequality holds because is 1-Lipschitz and the last equality follows from Kantorovich-Rubinstein duality Villani2008_opt_transport. Since distance is symmetric, the same conclusion holds if we interchange and , so we arrive at

a.2 Lemmas for Sampling Techniques

Lemma A.3.

Let . If , then .


where Li is the polylogarithm function. From Wood1992polylog, the limiting behavior of it is

where is the gamma function. Since when , we know when , . Finally, since , we conclude that . ∎

Lemma A.4.

Let be a the normalized occupancy measure generated by the triple with discount factor . Let be the normalized occupancy measure generated by the triple with discount factor . If , then .


By definition of the occupancy measure we have

where is the density of at time if generated by the triple . The TV distance is bounded by

where comes from that is a strictly decreasing function. Since , its sign flips from to at some index; say . Finally, the sum of the absolute value are the same from and from because the total probability is conservative, and the difference on one side is the same as that on the other. ∎

Appendix B Experiments

b.1 Implementation Details

We normalize states according to the statistics derived from the first batch of states from the real world. To ensure stability, we maintain the same mean and standard deviation throughout the training process.

Instead of directly predicting the next state, we estimate the state difference [kurutach18metrpo; luo2018slbo]. Since we incorporate state normalization, the transition network is trained to output .

To enhance state exploration, we sample real transitions according to policy , where is the mean of our Gaussian parameterized policy and is a fixed standard deviation. In addition, since model the transition as a Gaussian distribution, we found that matching with is empirically more stable and more sample-efficient than matching with .

For policy update, it is shown that using the mean of the Gaussian-parameterized transition can accelerate policy optimization and better balance exploration and exploitation.

b.2 Hyperparameters

HalfCheetah Hopper Reacher Ant
1 10
20 60 100 30
horizon for model update 20 10 30
entropy regularization 0.001 0.005
Table 2: List of hyper-parameters adopted in our experiments.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description