Variational Discriminator Bottleneck:
Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow
Adversarial learning methods have been proposed for a wide range of applications, but the training of adversarial models can be notoriously unstable. Effectively balancing the performance of the generator and discriminator is critical, since a discriminator that achieves very high accuracy will produce relatively uninformative gradients. In this work, we propose a simple and general technique to constrain information flow in the discriminator by means of an information bottleneck. By enforcing a constraint on the mutual information between the observations and the discriminator’s internal representation, we can effectively modulate the discriminator’s accuracy and maintain useful and informative gradients. We demonstrate that our proposed variational discriminator bottleneck (VDB) leads to significant improvements across three distinct application areas for adversarial learning algorithms. Our primary evaluation studies the applicability of the VDB to imitation learning of dynamic continuous control skills, such as running. We show that our method can learn such skills directly from raw video demonstrations, substantially outperforming prior adversarial imitation learning methods. The VDB can also be combined with adversarial inverse reinforcement learning to learn parsimonious reward functions that can be transferred and re-optimized in new settings. Finally, we demonstrate that VDB can train GANs more effectively for image generation, improving upon a number of prior stabilization methods. (Video111Video: https://youtu.be/0qTCNx4AtJU)
Adversarial learning methods provide a promising approach to modeling distributions over high-dimensional data with complex internal correlation structures. These methods generally use a discriminator to supervise the training of a generator in order to produce samples that are indistinguishable from the data. A particular instantiation is generative adversarial networks, which can be used for high-fidelity generation of images (Goodfellow et al., 2014; Karras et al., 2017) and other high-dimensional data (Vondrick et al., 2016; Xie et al., 2018; Donahue et al., 2018). Adversarial methods can also be used to learn reward functions in the framework of inverse reinforcement learning (Finn et al., 2016a; Fu et al., 2017), or to directly imitate demonstrations (Ho & Ermon, 2016). However, they suffer from major optimization challenges, one of which is balancing the performance of the generator and discriminator. A discriminator that achieves very high accuracy can produce relatively uninformative gradients, but a weak discriminator can also hamper the generator’s ability to learn. These challenges have led to widespread interest in a variety of stabilization methods for adversarial learning algorithms (Arjovsky et al., 2017; Kodali et al., 2017; Berthelot et al., 2017).
In this work, we propose a simple regularization technique for adversarial learning, which constrains the information flow from the inputs to the discriminator using a variational approximation to the information bottleneck. By enforcing a constraint on the mutual information between the input observations and the discriminator’s internal representation, we can encourage the discriminator to learn a representation that has heavy overlap between the data and the generator’s distribution, thereby effectively modulating the discriminator’s accuracy and maintaining useful and informative gradients for the generator. Our approach to stabilizing adversarial learning can be viewed as an adaptive variant of instance noise (Salimans et al., 2016; Sønderby et al., 2016; Arjovsky & Bottou, 2017). However, we show that the adaptive nature of this method is critical. Constraining the mutual information between the discriminator’s internal representation and the input allows the regularizer to directly limit the discriminator’s accuracy, which automates the choice of noise magnitude and applies this noise to a compressed representation of the input that is specifically optimized to model the most discerning differences between the generator and data distributions.
The main contribution of this work is the variational discriminator bottleneck (VDB), an adaptive stochastic regularization method for adversarial learning that substantially improves performance across a range of different application domains, examples of which are available in Figure 1. Our method can be easily applied to a variety of tasks and architectures. First, we evaluate our method on a suite of challenging imitation tasks, including learning highly acrobatic skills from mocap data with a simulated humanoid character. Our method also enables characters to learn dynamic continuous control skills directly from raw video demonstrations, and drastically improves upon previous work that uses adversarial imitation learning. We further evaluate the effectiveness of the technique for inverse reinforcement learning, which recovers a reward function from demonstrations in order to train future policies. Finally, we apply our framework to image generation using generative adversarial networks, where employing VDB improves the performance in many cases.
2 Related Work
Recent years have seen an explosion of adversarial learning techniques, spurred by the success of generative adversarial networks (GANs) (Goodfellow et al., 2014). A GAN framework is commonly composed of a discriminator and a generator, where the discriminator’s objective is to classify samples as real or fake, while the generator’s objective is to produce samples that fool the discriminator. Similar frameworks have also been proposed for inverse reinforcement learning (IRL) (Finn et al., 2016b) and imitation learning (Ho & Ermon, 2016). The training of adversarial models can be extremely unstable, with one of the most prevalent challenges being balancing the interplay between the discriminator and the generator (Berthelot et al., 2017). The discriminator can often overpower the generator, easily differentiating between real and fake samples, thus providing the generator with uninformative gradients for improvement (Che et al., 2016). Alternative loss functions have been proposed to mitigate this problem (Mao et al., 2016; Zhao et al., 2016; Arjovsky et al., 2017). Regularizers have been incorporated to improve stability and convergence, such as gradient penalties (Kodali et al., 2017; Gulrajani et al., 2017a; Mescheder et al., 2018), reconstruction loss (Che et al., 2016), and a myriad of other heuristics (Sønderby et al., 2016; Salimans et al., 2016; Arjovsky & Bottou, 2017; Berthelot et al., 2017). Task-specific architectural designs can also substantially improve performance (Radford et al., 2015; Karras et al., 2017). Similarly, our method also aims to regularize the discriminator in order to improve the feedback provided to the generator. But instead of explicit regularization of gradients or architecture-specific constraints, we apply a general information bottleneck that encourages the discriminator to ignore irrelevant cues, which then allows the generator to focus on improving the most discerning differences between real and fake samples.
Adversarial techniques have also been applied to inverse reinforcement learning (Fu et al., 2017), where a reward function is recovered from demonstrations, which can then be used to train policies to reproduce a desired skill. Finn et al. (2016a) showed an equivalence between maximum entropy IRL and GANs. Similar techniques have been developed for adversarial imitation learning (Ho & Ermon, 2016; Merel et al., 2017), where agents learn to imitate demonstrations without explicitly recovering a reward function. One advantage of adversarial methods is that by leveraging a discriminator in place of a reward function, they can be applied to imitate skills where reward functions can be difficult to engineer. However, the performance of policies trained through adversarial methods still falls short of those produced by manually designed reward functions, when such reward functions are available (Rajeswaran et al., 2017; Peng et al., 2018). We show that our method can significantly improve upon previous works that use adversarial techniques, and produces results of comparable quality to those from state-of-the-art approaches that utilize manually engineered reward functions.
Our variational discriminator bottleneck is based on the information bottleneck (Tishby & Zaslavsky, 2015), a technique for regularizing internal representations to minimize the mutual information with the input. Intuitively, a compressed representation can improve generalization by ignoring irrelevant distractors present in the original input. The information bottleneck can be instantiated in practical deep models by leveraging a variational bound and the reparameterization trick, inspired by a similar approach in variational autoencoders (VAE) (Kingma & Welling, 2013). The resulting variational information bottleneck approximates this compression effect in deep networks (Alemi et al., 2016). Building on the success of VAEs and GANs, a number of efforts have been made to combine the two. Makhzani et al. (2016) used adversarial discriminators during the training of VAEs to encourage the marginal distribution of the latent encoding to be similar to the prior distribution, similar techniques include Mescheder et al. (2017) and Chen et al. (2018). Conversely, Larsen et al. (2016) modeled the generator of a GAN using a VAE. Zhao et al. (2016) used an autoencoder instead of a VAE to model the discriminator, but does not enforce an information bottleneck on the encoding. While instance noise is widely used in modern architectures (Salimans et al., 2016; Sønderby et al., 2016; Arjovsky & Bottou, 2017), we show that explicitly enforcing an information bottleneck leads to improved performance over simply adding noise for a variety of applications.
In this section, we provide a review of the variational information bottleneck proposed by Alemi et al. (2016) in the context of supervised learning. Our variational discriminator bottleneck is based on the same principle, and can be instantiated in the context of GANs, inverse RL, and imitation learning. Given a dataset , with features and labels , the standard maximum likelihood estimate can be determined according to
Unfortunately, this estimate is prone to overfitting, and the resulting model can often exploit idiosyncrasies in the data (Krizhevsky et al., 2012; Srivastava et al., 2014). Alemi et al. (2016) proposed regularizing the model using an information bottleneck to encourage the model to focus only on the most discriminative features. The bottleneck can be incorporated by first introducing an encoder that maps the features to a latent distribution over , and then enforcing an upper bound on the mutual information between the encoding and the original features . This results in the following regularized objective
Note that the model now maps samples from the latent distribution to the label . The mutual information is defined according to
where is the distribution given by the dataset. Computing the marginal distribution can be challenging. Instead, a variational lower bound can be obtained by using an approximation of the marginal. Since , , an upper bound on can be obtained via the KL divergence,
This provides an upper bound on the regularized objective ,
To solve this problem, the constraint can be subsumed into the objective with a coefficient
Alemi et al. (2016) evaluated the method on supervised learning tasks, and showed that models trained with a VIB can be less prone to overfitting and more robust to adversarial examples.
4 Variational Discriminator Bottleneck
To outline our method, we first consider a standard GAN framework consisting of a discriminator and a generator , where the goal of the discriminator is to distinguish between samples from the target distribution and samples from the generator ,
We incorporate a variational information bottleneck by introducing an encoder into the discriminator that maps a sample into a stochastic encoding , and then apply a constraint on the mutual information between the original features and the encoding. is then trained to classify samples drawn from the encoder distribution. A schematic illustration of the framework is available in Figure 2. The regularized objective for the discriminator is given by
with being a mixture of the target distribution and the generator. We refer to this regularizer as the variational discriminator bottleneck (VDB). To optimize this objective, we can introduce a Lagrange multiplier ,
As we will discuss in Section 4.1 and demonstrate in our experiments, enforcing a specific mutual information budget between and is critical for good performance. We therefore adaptively update via dual gradient descent to enforce a specific constraint on the mutual information,
where is the Lagrangian
and is the stepsize for the dual variable in dual gradient descent (Boyd & Vandenberghe, 2004). In practice, we perform only one gradient step on and , followed by an update to . We refer to a GAN that incorporates a VDB as a variational generative adversarial network (VGAN).
In our experiments, the prior is modeled with a standard Gaussian. The encoder models a Gaussian distribution in the latent variables , with mean and diagonal covariance matrix . We use a simplified objective for the generator,
where the KL penalty is excluded from the generator’s objective. Instead of computing the expectation over , we found that approximating the expectation by evaluating at the mean of the encoder’s distribution was sufficient for our tasks. The discriminator is modeled with a single linear unit followed by a sigmoid , with weights and bias .
4.1 Discussion and Analysis
To interpret the effects of the VDB, we consider the results presented by Arjovsky & Bottou (2017), which show that for two distributions with disjoint support, the optimal discriminator can perfectly classify all samples and its gradients will be zero almost everywhere. Thus, as the discriminator converges to the optimum, the gradients for the generator vanishes accordingly. To address this issue, Arjovsky & Bottou (2017) proposed applying continuous noise to the discriminator inputs, thereby ensuring that the distributions have continuous support everywhere. In practice, if the original distributions are sufficiently distant from each other, the added noise will have negligible effects. As shown by Mescheder et al. (2017), the optimal choice for the variance of the noise to ensure convergence can be quite delicate. In our method, by first using a learned encoder to map the inputs to an embedding and then applying an information bottleneck on the embedding, we can dynamically adjust the variance of the noise such that the distributions not only share support in the embedding space, but also have significant overlap. Since the minimum amount of information required for binary classification is 1 bit, by selecting an information constraint , the discriminator is prevented from from perfectly differentiating between the distributions. To illustrate the effects of the VDB, we consider a simple task of training a discriminator to differentiate between two Gaussian distributions. Figure 2 visualizes the decision boundaries learned with different bounds on the mutual information. Without a VDB, the discriminator learns a sharp decision boundary, resulting in vanishing gradients for much of the space. But as decreases and the bound tightens, the decision boundary is smoothed, providing more informative gradients that can be leveraged by the generator.
Taking this analysis further, we can extend Theorem 3.2 from Arjovsky & Bottou (2017) to analyze the VDB, and show that the gradient of the generator will be non-degenerate for a small enough constraint , under some additional simplifying assumptions. The result in Arjovsky & Bottou (2017) states that the gradient consists of vectors that point toward samples on the data manifold, multiplied by coefficients that depend on the noise. However, these coefficients may be arbitrarily small if the generated samples are far from real samples, and the noise is not large enough. This can still cause the generator gradient to vanish. In the case of the VDB, the constraint ensures that these coefficients are always bounded below. Due to space constraints, this result is presented in Appendix A.
4.2 VAIL: Variational Adversarial Imitation Learning
To extend the VDB to imitation learning, we start with the generative adversarial imitation learning (GAIL) framework (Ho & Ermon, 2016), where the discriminator’s objective is to differentiate between the state distribution induced by a target policy and the state distribution of the agent’s policy ,
The discriminator is trained to maximize the likelihood assigned to states from the target policy, while minimizing the likelihood assigned to states from the agent’s policy. The discriminator also serves as the reward function for the agent, which encourages the policy to visit states that, to the discriminator, appear indistinguishable from the demonstrations. Similar to the GAN framework, we can incorporate a VDB into the discriminator,
where represents a mixture of the target policy and the agent’s policy. The reward for is then specified by the discriminator . We refer to this method as variational adversarial imitation learning (VAIL).
4.3 VAIRL: Variational Adversarial Inverse Reinforcement Learning
The VDB can also be applied to adversarial inverse reinforcement learning (Fu et al., 2017) to yield a new algorithm which we call variational adversarial inverse reinforcement learning (VAIRL). AIRL operates in a similar manner to GAIL, but with a discriminator of the form
where , with and being learned functions. Under certain restrictions on the environment, Fu et al. show that if is defined to depend only on the current state , the optimal recovers the expert’s true reward function up to a constant . In this case, the learned reward can be re-used to train policies in environments with different dynamics, and will yield the same policy as if the policy was trained under the expert’s true reward. In contrast, GAIL’s discriminator typically cannot be re-optimized in this way (Fu et al., 2017). In VAIRL, we introduce stochastic encoders , and are modified to be functions of the encoding. We can reformulate Equation 13 as
for and . We then obtain a modified objective of the form
where denotes the joint distribution of successive states from a policy, and .
We evaluate our method on adversarial learning problems in imitation learning, inverse reinforcement learning, and image generation. In the case of imitation learning, we show that the VDB enables agents to learn complex motion skills from a single demonstration, including visual demonstrations provided in the form of video clips. We also show that the VDB improves the performance of inverse RL methods. Inverse RL aims to reconstruct a reward function from a set demonstrations, which can then used to perform the task in new environments, in contrast to imitation learning, which aims to recover a policy directly. Our method is also not limited to control tasks, and we demonstrate its effectiveness for unconditional image generation.
|Merel et al., 2017|
|GAIL - noise|
|GAIL - noise z|
|Peng et al., 2018||0.26||0.21||0.20||0.14||0.19|
5.1 VAIL: Variational Adversarial Imitation Learning
The goal of the motion imitation tasks is to train a simulated character to mimic demonstrations provided by mocap clips recorded from human actors. Each mocap clip provides a sequence of target states that the character should track at each timestep. We use a similar experimental setup as Peng et al. (2018), with a 34 degrees-of-freedom humanoid character. We found that the discriminator architecture can greatly affect the performance on complex skills. The particular architecture we employ differs substantially from those used in prior work (Merel et al., 2017), details of which are available in Appendix B. The encoding is D and an information constraint of is applied for all skills, with a dual stepsize of . All policies are trained using PPO (Schulman et al., 2017).
The motions learned by the policies are best seen in the supplementary video. Snapshots of the character’s motions are shown in Figure 3. Each skill is learned from a single demonstration. VAIL is able to closely reproduce a variety of skills, including those that involve highly dynamics flips and complex contacts. We compare VAIL to a number of other techniques, including state-only GAIL (Ho & Ermon, 2016), GAIL with instance noise applied to the discriminator inputs (GAIL - noise), and GAIL with instance noise applied to the last hidden layer (GAIL - noise z). Learning curves for the various methods are shown in Figure 4 and Table 1 summarizes the performance of the final policies. Performance is measured in terms of the average joint rotation error between the simulated character and the reference motion. We also include a reimplementation of the method described by Merel et al. (2017). For the purpose of our experiments, GAIL denotes policies trained using our particular architecture but without a VDB, and Merel et al. (2017) denotes policies trained using an architecture that closely mirror those from previous work. Furthermore, we include comparisons to policies trained using the handcrafted reward from Peng et al. (2018), as well as policies trained via behavioral cloning (BC). Since mocap data does not provide expert actions, we use the policies from Peng et al. (2018) as oracles to provide state-action demonstrations, which are then used to train the BC policies via supervised learning. Each BC policy is trained with 10k samples from the oracle policies, while all other policies are trained from just a single demonstration, the equivalent of approximately 100 samples.
VAIL consistently outperforms the other adversarial methods. Simply adding instance noise to the inputs (Salimans et al., 2016) or hidden layer without the KL constraint (Sønderby et al., 2016) leads to worse performance, since the network can learn a latent representation that renders the effects of the noise negligible. Though training with the handcrafted reward still outperforms the adversarial methods, VAIL demonstrates comparable performance to the handcrafted reward without manual reward or feature engineering, and produces motions that closely resemble the original demonstrations. The method from Merel et al. (2017) was able to imitate simple skills such as running, but was unable to reproduce more acrobatic skills such as the backflip and spinkick. In the case of running, our implementation produces more natural gaits than the results reported in Merel et al. (2017). Behavioral cloning is unable to reproduce any of the skills, despite being provided with substantially more demonstration data than the other methods.
While our method achieves substantially better results on motion imitation when compared to prior work, previous methods can still produce reasonable behaviors. However, if the demonstrations are provided in terms of the raw pixels from video clips, instead of mocap data, the imitation task becomes substantially harder. The goal of the agent is therefore to directly imitate the skill depicted in the video. This is also a setting where manually engineering rewards is impractical, since simple losses like pixel distance do not provide a semantically meaningful measure of similarity. Figure 6 compares learning curves of policies trained with VAIL, GAIL, and policies trained using a reward function defined by the average pixel-wise difference between the frame from the video demonstration and a rendered image of the agent at each timestep , . Each frame is represented by a RGB image.
Both GAIL and the pixel-loss are unable to learn the running gait. VAIL is the only method that successfully learns to imitate the skill from the video demonstration. Snapshots of the video demonstration and the simulated motion is available in Figure 5. To further investigate the effects of the VDB, we visualize the gradient of the discriminator with respect to images from the video demonstration and simulation. Saliency maps for discriminators trained with VAIL and GAIL are available in Figure 5. The VAIL discriminator learns to attend to spatially coherent image patches around the character, while the GAIL discriminator exhibits less structure. The magnitude of the gradients from VAIL also tend to be significantly larger than those from GAIL, which may suggests that VAIL is able to mitigate the problem of vanishing gradients present in GAIL.
To evaluate the effects of the adaptive updates, we compare policies trained with different fixed values of and policies where is updated adaptively to enforce a desired information constraint . Figure 6 illustrates the learning curves and the KL loss over the course of training. When is too small, performance reverts to that achieved by GAIL. Large values of help to smooth the discriminator landscape and improve learning speed during the early stages of training, but converges to a worse performance. Policies trained using dual gradient descent to adaptively update consistently achieves the best performance overall.
5.2 VAIRL: Variational Adversarial Inverse Reinforcement Learning
Next, we use VAIRL to recover reward functions from demonstrations. Unlike the discriminator learned by VAIL, the reward function recovered by VAIRL can be re-optimized to train new policies from scratch in the same environment; in some cases, it can also be used to transfer similar behaviour to different environments. In Figure 7, we show the results of applying VAIRL to the C-maze from Fu et al. (2017), and a more complex S-maze; the simple 2D observation spaces of these tasks make it easy to interpret the recovered reward functions. In both mazes, the expert is trained to navigate from a start position at the bottom of the maze to a fixed target position at the top. We use each method to obtain an imitation policy and approximate the expert’s reward on the original maze. The recovered reward is then used to train a new policy to solve a horizontally mirrored version of the training maze. On the C-maze, we found that AIRL would sometimes overfit to the training environment, and fail to transfer to the new environment; this is evidenced by both the reward visualization in Figure 7 (left) and the higher return variance in Figure 7 (right). In contrast, by incorporating a VDB into AIRL, VAIRL learns a substantially smoother reward function that is more suitable for transfer. Furthermore, we found that in the S-maze—which has two internal walls instead of one—AIRL was too unstable to acquire a meaningful reward function, whereas VAIRL was able to learn a reasonable reward in most cases. To evaluate the effects of the VDB, we observe that the performance of VAIRL drops on both tasks when the KL constraint is disabled (), suggesting that the improvements from the VDB cannot be attributed entirely to the noise introduced by the sampling process for . Further details of these experiments and illustrations of the recovered reward functions are available in Appendix C.
5.3 VGAN: Variational Generative Adversarial Networks
Finally, we apply the VDB to image generation with generative adversarial networks, which we refer to as VGAN. Experiment are conducted on CIFAR-10 (Krizhevsky et al. ), CelebA (Liu et al. (2015)), and CelebAHQ (Karras et al., 2018) datasets. We compare our approach to recent stabilization techniques: WGAN-GP (Gulrajani et al., 2017b), instance noise (Sønderby et al., 2016; Arjovsky & Bottou, 2017), and gradient penalty (GP) (Mescheder et al., 2018), as well as the original GAN (Goodfellow et al., 2014) on CIFAR-10. To measure performance, we report the Fréchet Inception Distance (FID) (Heusel et al., 2017), which has been shown to be more consistent with human evaluation. All methods are implemented using the same base model, built on the resnet architecture of Mescheder et al. (2018). Aside from tuning the KL constraint for VGAN, no additional hyperparameter optimization was performed to modify the settings provided by Mescheder et al. (2018). The performance of the various methods on CIFAR-10 are shown in Figure 8. While vanilla GAN and instance noise are prone to diverging as training progresses, VGAN remains stable. Note that instance noise can be seen as a non-adaptive version of VGAN without constraints on . This experiment again highlights that there is a significant improvement from imposing the information bottleneck over simply adding instance noise. VGAN is competitive with WGAN-GP and GP. Since VDB and GP are complementary techniques, we also train a model that combines both VDB and GP, which we refer to as VGAN-GP. This combination achieves the best performance overall with an FID of 18.1. See Figure 9 for samples of images generated with our approach. Please refer to Appendix D for experimental details and more results.
We present the variational discriminator bottleneck, a general regularization technique for adversarial learning. Our experiments show that the VDB is broadly applicable to a variety of domains, and yields significant improvements over previous techniques on a number of challenging tasks. While our experiments have produced promising results for video imitation, the results have been primarily with videos of synthetic scenes. We believe that extending the technique to imitating real-world videos is an exciting direction. Another exciting direction for future work is a more in-depth theoretical analysis of the method, to derive convergence and stability results or conditions.
- Alemi et al. (2016) Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. CoRR, abs/1612.00410, 2016. URL http://arxiv.org/abs/1612.00410.
- Arjovsky & Bottou (2017) Martín Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. CoRR, abs/1701.04862, 2017. URL http://arxiv.org/abs/1701.04862.
- Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/arjovsky17a.html.
- Berthelot et al. (2017) David Berthelot, Tom Schumm, and Luke Metz. BEGAN: boundary equilibrium generative adversarial networks. CoRR, abs/1703.10717, 2017. URL http://arxiv.org/abs/1703.10717.
- Boyd & Vandenberghe (2004) Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004. ISBN 0521833787.
- Bullet (2015) Bullet. Bullet physics library, 2015. http://bulletphysics.org.
- Che et al. (2016) Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative adversarial networks. CoRR, abs/1612.02136, 2016. URL http://arxiv.org/abs/1612.02136.
- Chen et al. (2018) Liqun Chen, Shuyang Dai, Yunchen Pu, Erjin Zhou, Chunyuan Li, Qinliang Su, Changyou Chen, and Lawrence Carin. Symmetric variational autoencoder and connections to adversarial learning. In Amos Storkey and Fernando Perez-Cruz (eds.), Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 661–669, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018. PMLR. URL http://proceedings.mlr.press/v84/chen18b.html.
- Donahue et al. (2018) Chris Donahue, Julian McAuley, and Miller Puckette. Synthesizing audio with generative adversarial networks. CoRR, abs/1802.04208, 2018. URL http://arxiv.org/abs/1802.04208.
- Finn et al. (2016a) Chelsea Finn, Paul F. Christiano, Pieter Abbeel, and Sergey Levine. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. CoRR, abs/1611.03852, 2016a. URL http://arxiv.org/abs/1611.03852.
- Finn et al. (2016b) Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pp. 49–58, 2016b. URL http://jmlr.org/proceedings/papers/v48/finn16.html.
- Fu et al. (2017) Justin Fu, Katie Luo, and Sergey Levine. Learning robust rewards with adversarial inverse reinforcement learning. CoRR, abs/1710.11248, 2017. URL http://arxiv.org/abs/1710.11248.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27, pp. 2672–2680. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf.
- Gulrajani et al. (2017a) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 5767–5777. Curran Associates, Inc., 2017a. URL http://papers.nips.cc/paper/7159-improved-training-of-wasserstein-gans.pdf.
- Gulrajani et al. (2017b) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777, 2017b.
- Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637, 2017.
- (17) Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent.
- Ho & Ermon (2016) Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems 29, pp. 4565–4573. Curran Associates, Inc., 2016.
- Holden et al. (2017) Daniel Holden, Taku Komura, and Jun Saito. Phase-functioned neural networks for character control. ACM Trans. Graph., 36(4):42:1–42:13, July 2017. ISSN 0730-0301. doi: 10.1145/3072959.3073663. URL http://doi.acm.org/10.1145/3072959.3073663.
- Karras et al. (2017) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. CoRR, abs/1710.10196, 2017. URL http://arxiv.org/abs/1710.10196.
- Karras et al. (2018) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk99zCeAb.
- Kingma & Welling (2013) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013. URL http://dblp.uni-trier.de/db/journals/corr/corr1312.html#KingmaW13.
- Kodali et al. (2017) Naveen Kodali, Jacob D. Abernethy, James Hays, and Zsolt Kira. How to train your DRAGAN. CoRR, abs/1705.07215, 2017. URL http://arxiv.org/abs/1705.07215.
- (24) Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.
- Larsen et al. (2016) Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In Maria Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 1558–1566, New York, New York, USA, 20–22 Jun 2016. PMLR. URL http://proceedings.mlr.press/v48/larsen16.html.
- Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738, 2015.
- Lucic et al. (2017) Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans created equal? a large-scale study. arXiv preprint arXiv:1711.10337, 2017.
- Makhzani et al. (2016) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow. Adversarial autoencoders. In International Conference on Learning Representations, 2016. URL http://arxiv.org/abs/1511.05644.
- Mao et al. (2016) Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, and Zhen Wang. Multi-class generative adversarial networks with the L2 loss function. CoRR, abs/1611.04076, 2016. URL http://arxiv.org/abs/1611.04076.
- Merel et al. (2017) Josh Merel, Yuval Tassa, Dhruva TB, Sriram Srinivasan, Jay Lemmon, Ziyu Wang, Greg Wayne, and Nicolas Heess. Learning human behaviors from motion capture by adversarial imitation. CoRR, abs/1707.02201, 2017. URL http://arxiv.org/abs/1707.02201.
- Mescheder et al. (2018) Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for GANs do actually converge? In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3481–3490, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/mescheder18a.html.
- Mescheder et al. (2017) Lars M. Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. CoRR, abs/1701.04722, 2017. URL http://arxiv.org/abs/1701.04722.
- Peng et al. (2018) Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph., 37(4):143:1–143:14, July 2018. ISSN 0730-0301. doi: 10.1145/3197517.3201311. URL http://doi.acm.org/10.1145/3197517.3201311.
- Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015. URL http://arxiv.org/abs/1511.06434.
- Rajeswaran et al. (2017) Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. CoRR, abs/1709.10087, 2017. URL http://arxiv.org/abs/1709.10087.
- Salimans et al. (2016) Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. CoRR, abs/1606.03498, 2016. URL http://arxiv.org/abs/1606.03498.
- Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, 2015.
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
- Sønderby et al. (2016) Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised MAP inference for image super-resolution. CoRR, abs/1610.04490, 2016. URL http://arxiv.org/abs/1610.04490.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, January 2014. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=2627435.2670313.
- Tishby & Zaslavsky (2015) Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. CoRR, abs/1503.02406, 2015. URL http://arxiv.org/abs/1503.02406.
- Vondrick et al. (2016) Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Generating videos with scene dynamics. CoRR, abs/1609.02612, 2016. URL http://arxiv.org/abs/1609.02612.
- Xie et al. (2018) You Xie, Erik Franz, Mengyu Chu, and Nils Thuerey. tempogan: A temporally coherent, volumetric gan for super-resolution fluid flow. ACM Transactions on Graphics (TOG), 37(4):95, 2018.
- Zhao et al. (2016) Junbo Jake Zhao, Michaël Mathieu, and Yann LeCun. Energy-based generative adversarial network. CoRR, abs/1609.03126, 2016. URL http://arxiv.org/abs/1609.03126.
Appendix A Analysis and Proofs
In this appendix, we show that the gradient of the generator when the discriminator is augmented with the VDB is non-degenerate, under some mild additional assumptions. First, we assume a pointwise constraint of the form for all . In reality, we use an average KL constraint, since we found it to be more convenient to optimize, though a pointwise constraint is also possible to enforce by using the largest constraint violation to increment . We could likely also extend the analysis to the average constraint, though we leave this to future work. The main theorem can then be stated as follows:
Let denote the generator’s mapping from a noise vector to a point in . Given the generator distribution and data distribution , a VDB with an encoder , and , the gradient passed to the generator has the form
where is the optimal discriminator, and are positive functions, and we always have , where is a continuous monotonic function, and as .
Analysis for an encoder with an input-dependent variance is also possible, but more involved. We’ll further assume below for notational simplicity that is diagonal with diagonal values . This assumption is not required, but substantially simplifies the linear algebra. Analogously to Theorem 3.2 from Arjovsky & Bottou (2017), this theorem states that the gradient of the generator points in the direction of points in the data distribution, and away from points in the generator distribution. However, going beyond the theorem in Arjovsky & Bottou (2017), this result states that the coefficients on these vectors, given by , are always bounded below by a value that approaches a positive constant as we decrease , meaning that the gradient does not vanish. The proof of the first part of this theorem is essentially identical to the proof presented by Arjovsky & Bottou (2017), but accounting for the fact that the noise is now injected into the latent space of the VDB, rather than being added directly to . This result assumes that has a learned but input-independent variance , though the proof can be repeated for an input-dependent or non-diagonal :
Overloading and , let and be the distribution of embeddings under the real data and generator respectively. is then given by
and similarly for
From Arjovsky & Bottou (2017), the optimal discriminator between and is
The gradient passed to the generator then has the form
We then have
Similar to the result from Arjovsky & Bottou (2017), the gradient of the generator drives the generator’s samples in the embedding space towards embeddings of the points from the dataset weighted by their likelihood under the real data. For an arbitrary encoder , real and fake samples in the embedding may be far apart. As such, the coefficients can be arbitrarily small, thereby resulting in vanishing gradients for the generator.
The second part of the theorem states that is a continuous monotonic function, and as . This is the main result, and relies on the fact that . The intuition behind this result is that, for any two inputs and , their encoded distributions and have means that cannot be more than some distance apart, and that distance shrinks with . This allows us to bound below by , which ensures that the coefficients on the vectors in the theorem above are always at least as large as .
Let be the prior distribution and suppose the KL divergence for all in the dataset and all generated by the generator are bounded by
From the definition of the KL-divergence we can bound the length of all embedding vectors,
and similarly for , with denoting the dimension of . A lower bound on , where and , can then be determined by
and it follows that
The likelihood is therefore bounded below by
From the KL constraint, we can derive a lower bound and an upper bound on .
For the upper bound, since ,
Substituting and into Equation 14, we arrive at the following lower bound
Appendix B Imitation Learning
The goal of the motion imitation tasks is to train a simulated agent to mimic a demonstration provided in the form of a mocap clip recorded from a human actor. We use a similar experimental setup as Peng et al. (2018), with a 34 degrees-of-freedom humanoid character. The state consists of features that represent the configuration of the character’s body (link positions and velocities). We also include a phase variable among the state features, which records the character’s progress along the motion and helps to synchronize the character with the reference motion. With and denoting the start and end of the motion respectively. The action sampled from the policy specifies target poses for PD controller positioned at each joint. Given a state, the policy specifies a Gaussian distribution over the action space , with a state-dependent mean and fixed diagonal covariance matrix . is modeled using a 3-layered fully-connected network with 1024 and 512 hidden units, followed by a linear output layer that specifies the mean of the Gaussian. ReLU activations are used for all hidden layers. The value function is modeled with a similar architecture but with a single linear output unit. The policy is queried at . Physics simulation is performed at 1.2kHz using the Bullet physics engine Bullet (2015).
Given the rewards from the discriminator, PPO (Schulman et al., 2017) is used to train the policy, with a stepsize of for the policy, a stepsize of for the value function, and a stepsize of for the discirminator. Gradient descent with momentum 0.9 is used for all models. The PPO clipping threshold is set to . When evaluating the performance of the policies, each episode is simulated for a maximum horizon of 20. Early termination is triggered whenever the character’s torso contacts the ground, leaving the policy is a maximum error of radians for all remaining timesteps.
Unlike the policy and value function, which are modeled with standard fully-connected networks, the discriminator is modeled by a phase-functioned neural network (PFNN) to explicitly model the time-dependency of the reference motion (Holden et al., 2017). While the parameters of a network are generally fixed, the parameters of a PFNN are functions of the phase variable . The parameters of the network for a given is determined by a weighted combination of a set of fixed parameters ,
where is a phase-dependent weight for . In our implementation, we use sets of parameters and is designed to linearly interpolate between two adjacent sets of parameters for each phase , where each set of parameters corresponds to a discrete phase value spaced uniformly between . For a given value of , the parameters of the discriminator are determined according to
where and correspond to the phase values that form the endpoints of the phase interval that contains . A PFNN is used for all motion imitation experiments, both VAIL and GAIL, except for those that use the approach proposed by Merel et al. (2017), which use standard fully-connected networks for the discriminator. Figure 10 compares the performance of VAIL when the discriminator is modeled with a phase-functioned neural network (with PFNN) to discriminators modeled with standard fully-connected networks. We increased the size of the layers of the fully-connected nets to have a similar number of parameters as a PFNN. We evaluate the performance of fully-connected nets that receive the phase variable as part of the input (no PFNN), and fully-connected nets that do not receive as an input. The phase-functioned discriminator leads to significant performance improvements across all tasks evaluated. Policies trained without a phase variable performs worst overall, suggesting that phase information is critical for performance. All methods perform well on simpler skills, such as running, but the additional phase structure introduced by the PFNN proved to be vital for successful imitation of more complex skills, such as the dance and backflip.
Next we compare the accuracy of discriminators trained using different methods. Figure 11 illustrates accuracy of the discriminators over the course of training. Discriminators trained via GAIL quickly overpowers the policy, and learns to accurately differentiate between samples, even when instance noise is applied to the inputs. VAIL without the KL constraint slows the discriminator’s progress, but nonetheless reaches near perfect accuracy with a larger number of samples. Once the KL constraint is enforced, the information bottleneck constrains the performance of the discriminator, converging to approximately accuracy. Figure 11 also visualizes the value of over the course of training for motion imitation tasks, along with the loss of the KL term in the objective. The dual gradient descent update effectively enforces the VDB constraint .
In the video imitation tasks, we use a simplified 2D biped character in order to avoid issues that may arise due to depth ambiguity from monocular videos. The biped character has a total of 12 degrees-of-freedom, with similar state and action parameters as the humanoid. The video demonstrations are generated by rendering a reference motion into a sequence of video frames, which are then provided to the agent as a demonstration. The goal of the agent is to imitate the motion depicted in the video, without access to the original reference motion, and the reference motion is used only to evaluate performance.
Appendix C Inverse Reinforcement Learning
c.1 Experimental setup
We evaluate on two maze tasks, as illustrated in Figure 12. The C-maze is taken from Fu et al. (2017): in this maze, the agent starts at a random point within a small fixed distance of the mean start position. The agent has a continuous, 2D action space which allows it to accelerate in the or directions, and is able to observe its and position, but not its velocity. The ground truth reward is