Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow

Variational Discriminator Bottleneck:
Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow

Xue Bin Peng & Angjoo Kanazawa & Sam Toyer & Pieter Abbeel & Sergey Levine
University of California, Berkeley
{xbpeng,kanazawa,sdt,pabbeel,svlevine}@berkeley.edu
Abstract

Adversarial learning methods have been proposed for a wide range of applications, but the training of adversarial models can be notoriously unstable. Effectively balancing the performance of the generator and discriminator is critical, since a discriminator that achieves very high accuracy will produce relatively uninformative gradients. In this work, we propose a simple and general technique to constrain information flow in the discriminator by means of an information bottleneck. By enforcing a constraint on the mutual information between the observations and the discriminator’s internal representation, we can effectively modulate the discriminator’s accuracy and maintain useful and informative gradients. We demonstrate that our proposed variational discriminator bottleneck (VDB) leads to significant improvements across three distinct application areas for adversarial learning algorithms. Our primary evaluation studies the applicability of the VDB to imitation learning of dynamic continuous control skills, such as running. We show that our method can learn such skills directly from raw video demonstrations, substantially outperforming prior adversarial imitation learning methods. The VDB can also be combined with adversarial inverse reinforcement learning to learn parsimonious reward functions that can be transferred and re-optimized in new settings. Finally, we demonstrate that VDB can train GANs more effectively for image generation, improving upon a number of prior stabilization methods. (Video111Video: https://youtu.be/0qTCNx4AtJU)

\stackMath

1 Introduction

Adversarial learning methods provide a promising approach to modeling distributions over high-dimensional data with complex internal correlation structures. These methods generally use a discriminator to supervise the training of a generator in order to produce samples that are indistinguishable from the data. A particular instantiation is generative adversarial networks, which can be used for high-fidelity generation of images (Goodfellow et al., 2014; Karras et al., 2017) and other high-dimensional data (Vondrick et al., 2016; Xie et al., 2018; Donahue et al., 2018). Adversarial methods can also be used to learn reward functions in the framework of inverse reinforcement learning (Finn et al., 2016a; Fu et al., 2017), or to directly imitate demonstrations (Ho & Ermon, 2016). However, they suffer from major optimization challenges, one of which is balancing the performance of the generator and discriminator. A discriminator that achieves very high accuracy can produce relatively uninformative gradients, but a weak discriminator can also hamper the generator’s ability to learn. These challenges have led to widespread interest in a variety of stabilization methods for adversarial learning algorithms (Arjovsky et al., 2017; Kodali et al., 2017; Berthelot et al., 2017).

In this work, we propose a simple regularization technique for adversarial learning, which constrains the information flow from the inputs to the discriminator using a variational approximation to the information bottleneck. By enforcing a constraint on the mutual information between the input observations and the discriminator’s internal representation, we can encourage the discriminator to learn a representation that has heavy overlap between the data and the generator’s distribution, thereby effectively modulating the discriminator’s accuracy and maintaining useful and informative gradients for the generator. Our approach to stabilizing adversarial learning can be viewed as an adaptive variant of instance noise (Salimans et al., 2016; Sønderby et al., 2016; Arjovsky & Bottou, 2017). However, we show that the adaptive nature of this method is critical. Constraining the mutual information between the discriminator’s internal representation and the input allows the regularizer to directly limit the discriminator’s accuracy, which automates the choice of noise magnitude and applies this noise to a compressed representation of the input that is specifically optimized to model the most discerning differences between the generator and data distributions.

Figure 1: Our method is general and can be applied to a broad range of adversarial learning tasks. Left: Motion imitation with adversarial imitation learning. Middle: Image generation. Right: Learning transferable reward functions through adversarial inverse reinforcement learning.

The main contribution of this work is the variational discriminator bottleneck (VDB), an adaptive stochastic regularization method for adversarial learning that substantially improves performance across a range of different application domains, examples of which are available in Figure 1. Our method can be easily applied to a variety of tasks and architectures. First, we evaluate our method on a suite of challenging imitation tasks, including learning highly acrobatic skills from mocap data with a simulated humanoid character. Our method also enables characters to learn dynamic continuous control skills directly from raw video demonstrations, and drastically improves upon previous work that uses adversarial imitation learning. We further evaluate the effectiveness of the technique for inverse reinforcement learning, which recovers a reward function from demonstrations in order to train future policies. Finally, we apply our framework to image generation using generative adversarial networks, where employing VDB improves the performance in many cases.

2 Related Work

Recent years have seen an explosion of adversarial learning techniques, spurred by the success of generative adversarial networks (GANs) (Goodfellow et al., 2014). A GAN framework is commonly composed of a discriminator and a generator, where the discriminator’s objective is to classify samples as real or fake, while the generator’s objective is to produce samples that fool the discriminator. Similar frameworks have also been proposed for inverse reinforcement learning (IRL) (Finn et al., 2016b) and imitation learning (Ho & Ermon, 2016). The training of adversarial models can be extremely unstable, with one of the most prevalent challenges being balancing the interplay between the discriminator and the generator (Berthelot et al., 2017). The discriminator can often overpower the generator, easily differentiating between real and fake samples, thus providing the generator with uninformative gradients for improvement (Che et al., 2016). Alternative loss functions have been proposed to mitigate this problem (Mao et al., 2016; Zhao et al., 2016; Arjovsky et al., 2017). Regularizers have been incorporated to improve stability and convergence, such as gradient penalties (Kodali et al., 2017; Gulrajani et al., 2017a; Mescheder et al., 2018), reconstruction loss (Che et al., 2016), and a myriad of other heuristics (Sønderby et al., 2016; Salimans et al., 2016; Arjovsky & Bottou, 2017; Berthelot et al., 2017). Task-specific architectural designs can also substantially improve performance (Radford et al., 2015; Karras et al., 2017). Similarly, our method also aims to regularize the discriminator in order to improve the feedback provided to the generator. But instead of explicit regularization of gradients or architecture-specific constraints, we apply a general information bottleneck that encourages the discriminator to ignore irrelevant cues, which then allows the generator to focus on improving the most discerning differences between real and fake samples.

Adversarial techniques have also been applied to inverse reinforcement learning (Fu et al., 2017), where a reward function is recovered from demonstrations, which can then be used to train policies to reproduce a desired skill. Finn et al. (2016a) showed an equivalence between maximum entropy IRL and GANs. Similar techniques have been developed for adversarial imitation learning (Ho & Ermon, 2016; Merel et al., 2017), where agents learn to imitate demonstrations without explicitly recovering a reward function. One advantage of adversarial methods is that by leveraging a discriminator in place of a reward function, they can be applied to imitate skills where reward functions can be difficult to engineer. However, the performance of policies trained through adversarial methods still falls short of those produced by manually designed reward functions, when such reward functions are available (Rajeswaran et al., 2017; Peng et al., 2018). We show that our method can significantly improve upon previous works that use adversarial techniques, and produces results of comparable quality to those from state-of-the-art approaches that utilize manually engineered reward functions.

Our variational discriminator bottleneck is based on the information bottleneck (Tishby & Zaslavsky, 2015), a technique for regularizing internal representations to minimize the mutual information with the input. Intuitively, a compressed representation can improve generalization by ignoring irrelevant distractors present in the original input. The information bottleneck can be instantiated in practical deep models by leveraging a variational bound and the reparameterization trick, inspired by a similar approach in variational autoencoders (VAE) (Kingma & Welling, 2013). The resulting variational information bottleneck approximates this compression effect in deep networks (Alemi et al., 2016). Building on the success of VAEs and GANs, a number of efforts have been made to combine the two. Makhzani et al. (2016) used adversarial discriminators during the training of VAEs to encourage the marginal distribution of the latent encoding to be similar to the prior distribution, similar techniques include Mescheder et al. (2017) and Chen et al. (2018). Conversely, Larsen et al. (2016) modeled the generator of a GAN using a VAE. Zhao et al. (2016) used an autoencoder instead of a VAE to model the discriminator, but does not enforce an information bottleneck on the encoding. While instance noise is widely used in modern architectures (Salimans et al., 2016; Sønderby et al., 2016; Arjovsky & Bottou, 2017), we show that explicitly enforcing an information bottleneck leads to improved performance over simply adding noise for a variety of applications.

3 Preliminaries

In this section, we provide a review of the variational information bottleneck proposed by Alemi et al. (2016) in the context of supervised learning. Our variational discriminator bottleneck is based on the same principle, and can be instantiated in the context of GANs, inverse RL, and imitation learning. Given a dataset , with features and labels , the standard maximum likelihood estimate can be determined according to

(1)

Unfortunately, this estimate is prone to overfitting, and the resulting model can often exploit idiosyncrasies in the data (Krizhevsky et al., 2012; Srivastava et al., 2014). Alemi et al. (2016) proposed regularizing the model using an information bottleneck to encourage the model to focus only on the most discriminative features. The bottleneck can be incorporated by first introducing an encoder that maps the features to a latent distribution over , and then enforcing an upper bound on the mutual information between the encoding and the original features . This results in the following regularized objective

(2)
s.t.

Note that the model now maps samples from the latent distribution to the label . The mutual information is defined according to

(3)

where is the distribution given by the dataset. Computing the marginal distribution can be challenging. Instead, a variational lower bound can be obtained by using an approximation of the marginal. Since , , an upper bound on can be obtained via the KL divergence,

(4)

This provides an upper bound on the regularized objective ,

(5)
s.t.

To solve this problem, the constraint can be subsumed into the objective with a coefficient

(6)

Alemi et al. (2016) evaluated the method on supervised learning tasks, and showed that models trained with a VIB can be less prone to overfitting and more robust to adversarial examples.

Figure 2: Left: Overview of the variational discriminator bottleneck. The encoder first maps samples to a latent distribution . The discriminator is then trained to classify samples from the latent distribution. An information bottleneck is applied to . Right: Visualization of discriminators trained to differentiate two Gaussians with different KL bounds .

4 Variational Discriminator Bottleneck

To outline our method, we first consider a standard GAN framework consisting of a discriminator and a generator , where the goal of the discriminator is to distinguish between samples from the target distribution and samples from the generator ,

We incorporate a variational information bottleneck by introducing an encoder into the discriminator that maps a sample into a stochastic encoding , and then apply a constraint on the mutual information between the original features and the encoding. is then trained to classify samples drawn from the encoder distribution. A schematic illustration of the framework is available in Figure 2. The regularized objective for the discriminator is given by

(7)
s.t.

with being a mixture of the target distribution and the generator. We refer to this regularizer as the variational discriminator bottleneck (VDB). To optimize this objective, we can introduce a Lagrange multiplier ,

(8)

As we will discuss in Section 4.1 and demonstrate in our experiments, enforcing a specific mutual information budget between and is critical for good performance. We therefore adaptively update via dual gradient descent to enforce a specific constraint on the mutual information,

(9)

where is the Lagrangian

(10)

and is the stepsize for the dual variable in dual gradient descent (Boyd & Vandenberghe, 2004). In practice, we perform only one gradient step on and , followed by an update to . We refer to a GAN that incorporates a VDB as a variational generative adversarial network (VGAN).

In our experiments, the prior is modeled with a standard Gaussian. The encoder models a Gaussian distribution in the latent variables , with mean and diagonal covariance matrix . We use a simplified objective for the generator,

(11)

where the KL penalty is excluded from the generator’s objective. Instead of computing the expectation over , we found that approximating the expectation by evaluating at the mean of the encoder’s distribution was sufficient for our tasks. The discriminator is modeled with a single linear unit followed by a sigmoid , with weights and bias .

4.1 Discussion and Analysis

To interpret the effects of the VDB, we consider the results presented by Arjovsky & Bottou (2017), which show that for two distributions with disjoint support, the optimal discriminator can perfectly classify all samples and its gradients will be zero almost everywhere. Thus, as the discriminator converges to the optimum, the gradients for the generator vanishes accordingly. To address this issue, Arjovsky & Bottou (2017) proposed applying continuous noise to the discriminator inputs, thereby ensuring that the distributions have continuous support everywhere. In practice, if the original distributions are sufficiently distant from each other, the added noise will have negligible effects. As shown by Mescheder et al. (2017), the optimal choice for the variance of the noise to ensure convergence can be quite delicate. In our method, by first using a learned encoder to map the inputs to an embedding and then applying an information bottleneck on the embedding, we can dynamically adjust the variance of the noise such that the distributions not only share support in the embedding space, but also have significant overlap. Since the minimum amount of information required for binary classification is 1 bit, by selecting an information constraint , the discriminator is prevented from from perfectly differentiating between the distributions. To illustrate the effects of the VDB, we consider a simple task of training a discriminator to differentiate between two Gaussian distributions. Figure 2 visualizes the decision boundaries learned with different bounds on the mutual information. Without a VDB, the discriminator learns a sharp decision boundary, resulting in vanishing gradients for much of the space. But as decreases and the bound tightens, the decision boundary is smoothed, providing more informative gradients that can be leveraged by the generator.

Taking this analysis further, we can extend Theorem 3.2 from Arjovsky & Bottou (2017) to analyze the VDB, and show that the gradient of the generator will be non-degenerate for a small enough constraint , under some additional simplifying assumptions. The result in Arjovsky & Bottou (2017) states that the gradient consists of vectors that point toward samples on the data manifold, multiplied by coefficients that depend on the noise. However, these coefficients may be arbitrarily small if the generated samples are far from real samples, and the noise is not large enough. This can still cause the generator gradient to vanish. In the case of the VDB, the constraint ensures that these coefficients are always bounded below. Due to space constraints, this result is presented in Appendix A.

4.2 VAIL: Variational Adversarial Imitation Learning

To extend the VDB to imitation learning, we start with the generative adversarial imitation learning (GAIL) framework (Ho & Ermon, 2016), where the discriminator’s objective is to differentiate between the state distribution induced by a target policy and the state distribution of the agent’s policy ,

The discriminator is trained to maximize the likelihood assigned to states from the target policy, while minimizing the likelihood assigned to states from the agent’s policy. The discriminator also serves as the reward function for the agent, which encourages the policy to visit states that, to the discriminator, appear indistinguishable from the demonstrations. Similar to the GAN framework, we can incorporate a VDB into the discriminator,

(12)

where represents a mixture of the target policy and the agent’s policy. The reward for is then specified by the discriminator . We refer to this method as variational adversarial imitation learning (VAIL).

4.3 VAIRL: Variational Adversarial Inverse Reinforcement Learning

The VDB can also be applied to adversarial inverse reinforcement learning (Fu et al., 2017) to yield a new algorithm which we call variational adversarial inverse reinforcement learning (VAIRL). AIRL operates in a similar manner to GAIL, but with a discriminator of the form

(13)

where , with and being learned functions. Under certain restrictions on the environment, Fu et al. show that if is defined to depend only on the current state , the optimal recovers the expert’s true reward function up to a constant . In this case, the learned reward can be re-used to train policies in environments with different dynamics, and will yield the same policy as if the policy was trained under the expert’s true reward. In contrast, GAIL’s discriminator typically cannot be re-optimized in this way (Fu et al., 2017). In VAIRL, we introduce stochastic encoders , and are modified to be functions of the encoding. We can reformulate Equation 13 as

for and . We then obtain a modified objective of the form

where denotes the joint distribution of successive states from a policy, and .

(a) Backflip
(b) Cartwheel
(c) Dance
(d) Run
Figure 3: Simulated humanoid performing various skills. VAIL is able to closely imitate a broad range of skills from mocap data.

5 Experiments

We evaluate our method on adversarial learning problems in imitation learning, inverse reinforcement learning, and image generation. In the case of imitation learning, we show that the VDB enables agents to learn complex motion skills from a single demonstration, including visual demonstrations provided in the form of video clips. We also show that the VDB improves the performance of inverse RL methods. Inverse RL aims to reconstruct a reward function from a set demonstrations, which can then used to perform the task in new environments, in contrast to imitation learning, which aims to recover a policy directly. Our method is also not limited to control tasks, and we demonstrate its effectiveness for unconditional image generation.

Figure 4: Learning curves comparing VAIL to other methods for motion imitation. Performance is measured using the average joint rotation error between the simulated character and the reference motion. Each method is evaluated with 3 random seeds.
Method Backflip Cartwheel Dance Run Spinkick
BC 3.01 2.88 2.93 2.63 2.88
Merel et al., 2017
GAIL
GAIL - noise
GAIL - noise z
VAIL (ours)
Peng et al., 2018 0.26 0.21 0.20 0.14 0.19

Table 1: Average joint rotation error (radians) on humanoid motion imitation tasks. VAIL outperforms the other methods for all skills evaluated, except for policies trained using the manually-designed reward function from (Peng et al., 2018).

5.1 VAIL: Variational Adversarial Imitation Learning

The goal of the motion imitation tasks is to train a simulated character to mimic demonstrations provided by mocap clips recorded from human actors. Each mocap clip provides a sequence of target states that the character should track at each timestep. We use a similar experimental setup as Peng et al. (2018), with a 34 degrees-of-freedom humanoid character. We found that the discriminator architecture can greatly affect the performance on complex skills. The particular architecture we employ differs substantially from those used in prior work (Merel et al., 2017), details of which are available in Appendix B. The encoding is D and an information constraint of is applied for all skills, with a dual stepsize of . All policies are trained using PPO (Schulman et al., 2017).

The motions learned by the policies are best seen in the supplementary video. Snapshots of the character’s motions are shown in Figure 3. Each skill is learned from a single demonstration. VAIL is able to closely reproduce a variety of skills, including those that involve highly dynamics flips and complex contacts. We compare VAIL to a number of other techniques, including state-only GAIL (Ho & Ermon, 2016), GAIL with instance noise applied to the discriminator inputs (GAIL - noise), and GAIL with instance noise applied to the last hidden layer (GAIL - noise z). Learning curves for the various methods are shown in Figure 4 and Table 1 summarizes the performance of the final policies. Performance is measured in terms of the average joint rotation error between the simulated character and the reference motion. We also include a reimplementation of the method described by Merel et al. (2017). For the purpose of our experiments, GAIL denotes policies trained using our particular architecture but without a VDB, and Merel et al. (2017) denotes policies trained using an architecture that closely mirror those from previous work. Furthermore, we include comparisons to policies trained using the handcrafted reward from Peng et al. (2018), as well as policies trained via behavioral cloning (BC). Since mocap data does not provide expert actions, we use the policies from Peng et al. (2018) as oracles to provide state-action demonstrations, which are then used to train the BC policies via supervised learning. Each BC policy is trained with 10k samples from the oracle policies, while all other policies are trained from just a single demonstration, the equivalent of approximately 100 samples.

VAIL consistently outperforms the other adversarial methods. Simply adding instance noise to the inputs (Salimans et al., 2016) or hidden layer without the KL constraint (Sønderby et al., 2016) leads to worse performance, since the network can learn a latent representation that renders the effects of the noise negligible. Though training with the handcrafted reward still outperforms the adversarial methods, VAIL demonstrates comparable performance to the handcrafted reward without manual reward or feature engineering, and produces motions that closely resemble the original demonstrations. The method from Merel et al. (2017) was able to imitate simple skills such as running, but was unable to reproduce more acrobatic skills such as the backflip and spinkick. In the case of running, our implementation produces more natural gaits than the results reported in Merel et al. (2017). Behavioral cloning is unable to reproduce any of the skills, despite being provided with substantially more demonstration data than the other methods.

Figure 5: Left: Snapshots of the video demonstration and the simulated character trained with VAIL. The policy learns to run by directly imitating the video. Right: Saliency maps that visualize the magnitude of the discriminator’s gradient with respect to input images from both the demonstration and the simulation.
Figure 6: Left: Learning curves comparing policies for the video imitation task trained using a pixel-wise loss as the reward, GAIL, and VAIL. Only VAIL successfully learns to run from a video demonstration. Middle: Effect of training with fixed values of and adaptive (). Right:. KL loss over the course of training with adaptive . The dual gradient descent update for effectively enforces the VDB constraint .

Video Imitation:

While our method achieves substantially better results on motion imitation when compared to prior work, previous methods can still produce reasonable behaviors. However, if the demonstrations are provided in terms of the raw pixels from video clips, instead of mocap data, the imitation task becomes substantially harder. The goal of the agent is therefore to directly imitate the skill depicted in the video. This is also a setting where manually engineering rewards is impractical, since simple losses like pixel distance do not provide a semantically meaningful measure of similarity. Figure 6 compares learning curves of policies trained with VAIL, GAIL, and policies trained using a reward function defined by the average pixel-wise difference between the frame from the video demonstration and a rendered image of the agent at each timestep , . Each frame is represented by a RGB image.

Both GAIL and the pixel-loss are unable to learn the running gait. VAIL is the only method that successfully learns to imitate the skill from the video demonstration. Snapshots of the video demonstration and the simulated motion is available in Figure 5. To further investigate the effects of the VDB, we visualize the gradient of the discriminator with respect to images from the video demonstration and simulation. Saliency maps for discriminators trained with VAIL and GAIL are available in Figure 5. The VAIL discriminator learns to attend to spatially coherent image patches around the character, while the GAIL discriminator exhibits less structure. The magnitude of the gradients from VAIL also tend to be significantly larger than those from GAIL, which may suggests that VAIL is able to mitigate the problem of vanishing gradients present in GAIL.

Adaptive Constraint:

To evaluate the effects of the adaptive updates, we compare policies trained with different fixed values of and policies where is updated adaptively to enforce a desired information constraint . Figure 6 illustrates the learning curves and the KL loss over the course of training. When is too small, performance reverts to that achieved by GAIL. Large values of help to smooth the discriminator landscape and improve learning speed during the early stages of training, but converges to a worse performance. Policies trained using dual gradient descent to adaptively update consistently achieves the best performance overall.

5.2 VAIRL: Variational Adversarial Inverse Reinforcement Learning

Method C-maze S-maze
GAIL -24.67.2 1.01.3
VAIL -65.618.9 20.839.7
AIRL -15.37.8 -0.20.1
VAIRL () -25.57.2 62.333.2
VAIRL (ours) -10.02.2 74.038.7
TRPO expert -5.1 153.2
Figure 7: Left: C-Maze and S-Maze. When trained on the training maze on the left, AIRL learns a reward that overfits to the training task, and which cannot be transferred to the mirrored maze on the right. In contrast, VAIRL learns a smoother reward function that enables more-reliable transfer. Right: Performance on flipped test versions of our two training mazes. We report mean return ( std. dev.) over five runs for imitation methods, and mean return for the single expert used to generate demonstrations.

Next, we use VAIRL to recover reward functions from demonstrations. Unlike the discriminator learned by VAIL, the reward function recovered by VAIRL can be re-optimized to train new policies from scratch in the same environment; in some cases, it can also be used to transfer similar behaviour to different environments. In Figure 7, we show the results of applying VAIRL to the C-maze from Fu et al. (2017), and a more complex S-maze; the simple 2D observation spaces of these tasks make it easy to interpret the recovered reward functions. In both mazes, the expert is trained to navigate from a start position at the bottom of the maze to a fixed target position at the top. We use each method to obtain an imitation policy and approximate the expert’s reward on the original maze. The recovered reward is then used to train a new policy to solve a horizontally mirrored version of the training maze. On the C-maze, we found that AIRL would sometimes overfit to the training environment, and fail to transfer to the new environment; this is evidenced by both the reward visualization in Figure 7 (left) and the higher return variance in Figure 7 (right). In contrast, by incorporating a VDB into AIRL, VAIRL learns a substantially smoother reward function that is more suitable for transfer. Furthermore, we found that in the S-maze—which has two internal walls instead of one—AIRL was too unstable to acquire a meaningful reward function, whereas VAIRL was able to learn a reasonable reward in most cases. To evaluate the effects of the VDB, we observe that the performance of VAIRL drops on both tasks when the KL constraint is disabled (), suggesting that the improvements from the VDB cannot be attributed entirely to the noise introduced by the sampling process for . Further details of these experiments and illustrations of the recovered reward functions are available in Appendix C.

Method FID
GAN 63.6
Inst Noise 30.7
VGAN (ours) 24.8
GP 22.6
WGAN-GP 19.9
VGAN-GP (ours) 18.1
Figure 8: Comparison of VGAN and other methods on CIFAR-10, with performance evaluated using the Fréchet Inception Distance (FID).
Figure 9: Random image samples on CIFAR-10, CelebA 128128, and CelebAHQ 10241024 using VGAN.

5.3 VGAN: Variational Generative Adversarial Networks

Finally, we apply the VDB to image generation with generative adversarial networks, which we refer to as VGAN. Experiment are conducted on CIFAR-10 (Krizhevsky et al. ), CelebA (Liu et al. (2015)), and CelebAHQ (Karras et al., 2018) datasets. We compare our approach to recent stabilization techniques: WGAN-GP (Gulrajani et al., 2017b), instance noise (Sønderby et al., 2016; Arjovsky & Bottou, 2017), and gradient penalty (GP) (Mescheder et al., 2018), as well as the original GAN (Goodfellow et al., 2014) on CIFAR-10. To measure performance, we report the Fréchet Inception Distance (FID) (Heusel et al., 2017), which has been shown to be more consistent with human evaluation. All methods are implemented using the same base model, built on the resnet architecture of Mescheder et al. (2018). Aside from tuning the KL constraint for VGAN, no additional hyperparameter optimization was performed to modify the settings provided by Mescheder et al. (2018). The performance of the various methods on CIFAR-10 are shown in Figure 8. While vanilla GAN and instance noise are prone to diverging as training progresses, VGAN remains stable. Note that instance noise can be seen as a non-adaptive version of VGAN without constraints on . This experiment again highlights that there is a significant improvement from imposing the information bottleneck over simply adding instance noise. VGAN is competitive with WGAN-GP and GP. Since VDB and GP are complementary techniques, we also train a model that combines both VDB and GP, which we refer to as VGAN-GP. This combination achieves the best performance overall with an FID of 18.1. See Figure 9 for samples of images generated with our approach. Please refer to Appendix D for experimental details and more results.

6 Conclusion

We present the variational discriminator bottleneck, a general regularization technique for adversarial learning. Our experiments show that the VDB is broadly applicable to a variety of domains, and yields significant improvements over previous techniques on a number of challenging tasks. While our experiments have produced promising results for video imitation, the results have been primarily with videos of synthetic scenes. We believe that extending the technique to imitating real-world videos is an exciting direction. Another exciting direction for future work is a more in-depth theoretical analysis of the method, to derive convergence and stability results or conditions.

References

Supplementary Material

Appendix A Analysis and Proofs

In this appendix, we show that the gradient of the generator when the discriminator is augmented with the VDB is non-degenerate, under some mild additional assumptions. First, we assume a pointwise constraint of the form for all . In reality, we use an average KL constraint, since we found it to be more convenient to optimize, though a pointwise constraint is also possible to enforce by using the largest constraint violation to increment . We could likely also extend the analysis to the average constraint, though we leave this to future work. The main theorem can then be stated as follows:

Theorem A.1.

Let denote the generator’s mapping from a noise vector to a point in . Given the generator distribution and data distribution , a VDB with an encoder , and , the gradient passed to the generator has the form

where is the optimal discriminator, and are positive functions, and we always have , where is a continuous monotonic function, and as .

Analysis for an encoder with an input-dependent variance is also possible, but more involved. We’ll further assume below for notational simplicity that is diagonal with diagonal values . This assumption is not required, but substantially simplifies the linear algebra. Analogously to Theorem 3.2 from Arjovsky & Bottou (2017), this theorem states that the gradient of the generator points in the direction of points in the data distribution, and away from points in the generator distribution. However, going beyond the theorem in Arjovsky & Bottou (2017), this result states that the coefficients on these vectors, given by , are always bounded below by a value that approaches a positive constant as we decrease , meaning that the gradient does not vanish. The proof of the first part of this theorem is essentially identical to the proof presented by Arjovsky & Bottou (2017), but accounting for the fact that the noise is now injected into the latent space of the VDB, rather than being added directly to . This result assumes that has a learned but input-independent variance , though the proof can be repeated for an input-dependent or non-diagonal :

Proof.

Overloading and , let and be the distribution of embeddings under the real data and generator respectively. is then given by

and similarly for

From Arjovsky & Bottou (2017), the optimal discriminator between and is

The gradient passed to the generator then has the form

Let

We then have

Similar to the result from Arjovsky & Bottou (2017), the gradient of the generator drives the generator’s samples in the embedding space towards embeddings of the points from the dataset weighted by their likelihood under the real data. For an arbitrary encoder , real and fake samples in the embedding may be far apart. As such, the coefficients can be arbitrarily small, thereby resulting in vanishing gradients for the generator.

The second part of the theorem states that is a continuous monotonic function, and as . This is the main result, and relies on the fact that . The intuition behind this result is that, for any two inputs and , their encoded distributions and have means that cannot be more than some distance apart, and that distance shrinks with . This allows us to bound below by , which ensures that the coefficients on the vectors in the theorem above are always at least as large as .

Proof.

Let be the prior distribution and suppose the KL divergence for all in the dataset and all generated by the generator are bounded by

From the definition of the KL-divergence we can bound the length of all embedding vectors,

and similarly for , with denoting the dimension of . A lower bound on , where and , can then be determined by

Since ,

and it follows that

The likelihood is therefore bounded below by

Since ,

(14)

From the KL constraint, we can derive a lower bound and an upper bound on .

For the upper bound, since ,

Substituting and into Equation 14, we arrive at the following lower bound

Appendix B Imitation Learning

Experimental Setup:

The goal of the motion imitation tasks is to train a simulated agent to mimic a demonstration provided in the form of a mocap clip recorded from a human actor. We use a similar experimental setup as Peng et al. (2018), with a 34 degrees-of-freedom humanoid character. The state consists of features that represent the configuration of the character’s body (link positions and velocities). We also include a phase variable among the state features, which records the character’s progress along the motion and helps to synchronize the character with the reference motion. With and denoting the start and end of the motion respectively. The action sampled from the policy specifies target poses for PD controller positioned at each joint. Given a state, the policy specifies a Gaussian distribution over the action space , with a state-dependent mean and fixed diagonal covariance matrix . is modeled using a 3-layered fully-connected network with 1024 and 512 hidden units, followed by a linear output layer that specifies the mean of the Gaussian. ReLU activations are used for all hidden layers. The value function is modeled with a similar architecture but with a single linear output unit. The policy is queried at . Physics simulation is performed at 1.2kHz using the Bullet physics engine Bullet (2015).

Given the rewards from the discriminator, PPO (Schulman et al., 2017) is used to train the policy, with a stepsize of for the policy, a stepsize of for the value function, and a stepsize of for the discirminator. Gradient descent with momentum 0.9 is used for all models. The PPO clipping threshold is set to . When evaluating the performance of the policies, each episode is simulated for a maximum horizon of 20. Early termination is triggered whenever the character’s torso contacts the ground, leaving the policy is a maximum error of radians for all remaining timesteps.

Phase-Functioned Discriminator:

Unlike the policy and value function, which are modeled with standard fully-connected networks, the discriminator is modeled by a phase-functioned neural network (PFNN) to explicitly model the time-dependency of the reference motion (Holden et al., 2017). While the parameters of a network are generally fixed, the parameters of a PFNN are functions of the phase variable . The parameters of the network for a given is determined by a weighted combination of a set of fixed parameters ,

where is a phase-dependent weight for . In our implementation, we use sets of parameters and is designed to linearly interpolate between two adjacent sets of parameters for each phase , where each set of parameters corresponds to a discrete phase value spaced uniformly between . For a given value of , the parameters of the discriminator are determined according to

where and correspond to the phase values that form the endpoints of the phase interval that contains . A PFNN is used for all motion imitation experiments, both VAIL and GAIL, except for those that use the approach proposed by Merel et al. (2017), which use standard fully-connected networks for the discriminator. Figure 10 compares the performance of VAIL when the discriminator is modeled with a phase-functioned neural network (with PFNN) to discriminators modeled with standard fully-connected networks. We increased the size of the layers of the fully-connected nets to have a similar number of parameters as a PFNN. We evaluate the performance of fully-connected nets that receive the phase variable as part of the input (no PFNN), and fully-connected nets that do not receive as an input. The phase-functioned discriminator leads to significant performance improvements across all tasks evaluated. Policies trained without a phase variable performs worst overall, suggesting that phase information is critical for performance. All methods perform well on simpler skills, such as running, but the additional phase structure introduced by the PFNN proved to be vital for successful imitation of more complex skills, such as the dance and backflip.

Figure 10: Learning curves comparing VAIL with a discriminator modeled by a phase-functioned neural network (PFNN), to modeling the discriminator with a fully-conneted network that receives the phase-variable as part of the input (no PFNN), and a discriminator modeled with a fully-connected network but does not receive as an input (no phase).
Figure 11: Left: Accuracy of the discriminator trained using different methods for imitating the dance skill. Middle:. Value of the dual variable over the course of training. Right: KL loss over the course of training. The dual gradient descent update for effectively enforces the VDB constraint .

Next we compare the accuracy of discriminators trained using different methods. Figure 11 illustrates accuracy of the discriminators over the course of training. Discriminators trained via GAIL quickly overpowers the policy, and learns to accurately differentiate between samples, even when instance noise is applied to the inputs. VAIL without the KL constraint slows the discriminator’s progress, but nonetheless reaches near perfect accuracy with a larger number of samples. Once the KL constraint is enforced, the information bottleneck constrains the performance of the discriminator, converging to approximately accuracy. Figure 11 also visualizes the value of over the course of training for motion imitation tasks, along with the loss of the KL term in the objective. The dual gradient descent update effectively enforces the VDB constraint .

Video Imitation:

In the video imitation tasks, we use a simplified 2D biped character in order to avoid issues that may arise due to depth ambiguity from monocular videos. The biped character has a total of 12 degrees-of-freedom, with similar state and action parameters as the humanoid. The video demonstrations are generated by rendering a reference motion into a sequence of video frames, which are then provided to the agent as a demonstration. The goal of the agent is to imitate the motion depicted in the video, without access to the original reference motion, and the reference motion is used only to evaluate performance.

Appendix C Inverse Reinforcement Learning

c.1 Experimental setup

Figure 12: Left: The C-maze used for training and its mirror version used for testing. Colour contours show the ground truth reward function that we use to train the expert and evaluate transfer quality, while the red and green dots show the initial and goal positions, respectively. Right: The analogous diagram for the S-maze.

Environments

We evaluate on two maze tasks, as illustrated in Figure 12. The C-maze is taken from Fu et al. (2017): in this maze, the agent starts at a random point within a small fixed distance of the mean start position. The agent has a continuous, 2D action space which allows it to accelerate in the or directions, and is able to observe its and position, but not its velocity. The ground truth reward is