Plug & Play Generative Networks:
Conditional Iterative Generation of Images in Latent Space
Abstract
Generating highresolution, photorealistic images has been a longstanding goal in machine learning. Recently, Nguyen et al. [37] showed one interesting way to synthesize novel images by performing gradient ascent in the latent space of a generator network to maximize the activations of one or multiple neurons in a separate classifier network. In this paper we extend this method by introducing an additional prior on the latent code, improving both sample quality and sample diversity, leading to a stateoftheart generative model that produces high quality images at higher resolutions () than previous generative models, and does so for all 1000 ImageNet categories. In addition, we provide a unified probabilistic interpretation of related activation maximization methods and call the general class of models “Plug and Play Generative Networks.” PPGNs are composed of 1) a generator network G that is capable of drawing a wide range of image types and 2) a replaceable “condition” network C that tells the generator what to draw. We demonstrate the generation of images conditioned on a class (when C is an ImageNet or MIT Places classification network) and also conditioned on a caption (when C is an image captioning network). Our method also improves the state of the art of Multifaceted Feature Visualization [40], which generates the set of synthetic inputs that activate a neuron in order to better understand how deep neural networks operate. Finally, we show that our model performs reasonably well at the task of image inpainting. While image models are used in this paper, the approach is modalityagnostic and can be applied to many types of data.
1 Introduction
Recent years have seen generative models that are increasingly capable of synthesizing diverse, realistic images that capture both the finegrained details and global coherence of natural images [54, 27, 9, 15, 43, 24]. However, many important open challenges remain, including (1) producing photorealistic images at high resolutions [30], (2) training generators that can produce a wide variety of images (e.g. all 1000 ImageNet classes) instead of only one or a few types (e.g. faces or bedrooms [43]), and (3) producing a diversity of samples that match the diversity in the dataset instead of modeling only a subset of the data distribution [14, 53]. Current image generative models often work well at low resolutions (e.g. ), but struggle to generate highresolution (e.g. or higher), globally coherent images (especially for datasets such as ImageNet [7] that have a large variability [41, 47, 14]) due to many challenges including difficulty in training [47, 41] and computationally expensive sampling procedures [54, 55].
Nguyen et al. [37] recently introduced a technique that produces high quality images at a high resolution. Their Deep Generator Networkbased Activation Maximization^{1}^{1}1 Activation maximization is a technique of searching via optimization for the synthetic image that maximally activates a target neuron in order to understand which features that neuron has learned to detect [11]. (DGNAM) involves training a generator to create realistic images from compressed features extracted from a pretrained classifier network (Fig. 3f). To generate images conditioned on a class, an optimization process is launched to find a hidden code that maps to an image that highly activates a neuron in another classifier (not necessarily the same as ). Not only does DGNAM produce realistic images at a high resolution (Figs. 2b & S10b), but, without having to retrain , it can also produce interesting new types of images that never saw during training. For example, a trained on ImageNet can produce ballrooms, jail cells, and picnic areas if is trained on the MIT Places dataset (Fig. S17, top).
A major limitation with DGNAM, however, is the lack of diversity in the generated samples. While samples may vary slightly (e.g. “cardoons” with two or three flowers viewed from slightly different angles; see Fig. 2b), the whole image tends to have the same composition (e.g. a closeup of a single cardoon plant with a green background). It is noteworthy that the images produced by DGNAM closely match the images from that class that most highly activate the class neuron (Fig. 2a). Optimization often converges to the same mode even with different random initializations, a phenomenon common with activation maximization [11, 40, 59]. In contrast, real images within a class tend to show more diversity (Fig. 2c). In this paper, we improve the diversity and quality of samples produced via DGNAM by adding a prior on the latent code that keeps optimization along the manifold of realisticlooking images (Fig. 2d).
We do this by providing a probabilistic framework in which to unify and interpret activation maximization approaches [48, 64, 40, 37] as a type of energybased model [4, 29] where the energy function is a sum of multiple constraint terms: (a) priors (e.g. biasing images to look realistic) and (b) conditions, typically given as a category of a separately trained classification model (e.g. encouraging images to look like “pianos” or both “pianos” and “candles”). We then show how to sample iteratively from such models using an approximate Metropolisadjusted Langevin sampling algorithm.
We call this general class of models Plug and Play Generative Networks (PPGN). The name reflects an important, attractive property of the method: one is free to design an energy function, and “plug and play” with different priors and conditions to form a new generative model. This property has recently been shown to be useful in multiple image generation projects that use the DGNAM generator network prior and swap in different condition networks [66, 13]. In addition to generating images conditioned on a class, PPGNs can generate images conditioned on text, forming a texttoimage generative model that allows one to describe an image with words and have it synthesized. We accomplish this by attaching a recurrent, imagecaptioning network (instead of an image classification network) to the output of the generator, and performing similar iterative sampling. Note that, while this paper discusses only the image generation domain, the approach should generalize to many other data types. We publish our code and the trained networks at http://EvolvingAI.org/ppgn.
2 Probabilistic interpretation of iterative image generation methods
Beginning with the Metropolisadjusted Langevin algorithm [46, 45] (MALA), it is possible to define a Markov chain Monte Carlo (MCMC) sampler whose stationary distribution approximates a given distribution . We refer to our variant of MALA as MALAapprox, which uses the following transition operator:^{2}^{2}2 We abuse notation slightly in the interest of space and denote as a sample from that distribution. The first step size is given as in anticipation of later splitting into separate and terms.
(1) 
A full derivation and discussion is given in Sec. S1. Using this sampler we first derive a probabilistically interpretable formulation for activation maximization methods (Sec. 2.1) and then interpret other activation maximization algorithms in this framework (Sec. 2.2, Sec. S2).
2.1 Probabilistic framework for Activation
Maximization
Assume we wish to sample from a joint model , which can be decomposed into an image model and a classification model:
(2) 
This equation can be interpreted as a “product of experts” [19] in which each expert determines whether a soft constraint is satisfied. First, a expert determines a condition for image generation (e.g. images have to be classified as “cardoon”). Also, in a highdimensional image space, a good expert is needed to ensure the search stays in the manifold of image distribution that we try to model (e.g. images of faces [6, 63], shoes [67] or natural images [37]), otherwise we might encounter “fooling” examples that are unrecognizable but have high [38, 51]. Thus, and together impose a complicated highdimensional constraint on image generation.
We could write a sampler for the full joint , but because variables are categorical, suppose for now that we fix to be a particular chosen class , with either sampled or chosen outside the inner sampling loop.^{3}^{3}3 One could resample in the loop as well, but resampling via the Langevin family under consideration is not a natural fit: because values from the data set are onehot – and from the model hopefully nearly so – there will be a wide small or zerolikelihood region between pairs coming from different classes. Thus making local jumps will not be a good sampling scheme for the components. This leaves us with the conditional :
(3) 
We can construct a MALAapprox sampler for this model, which produces the following update step:
(4) 
Expanding the into explicit partial derivatives and decoupling into explicit and multipliers, we arrive at the following form of the update rule:
(5) 
We empirically found that decoupling the and multipliers works better. An intuitive interpretation of the actions of these three terms is as follows:

term: take a step from the current image toward one that looks more like a generic image (an image from any class).

term: add a small amount of noise to jump around the search space to encourage a diversity of images.
2.2 Interpretation of previous models
Aside from the errors introduced by not including a reject step, the stationary distribution of the sampler in Eq. 5 will converge to the appropriate distribution if the terms are chosen appropriately [61]. Thus, we can use this framework to interpret previously proposed iterative methods for generating samples, evaluating whether each method faithfully computes and employs each term.
There are many previous approaches that iteratively sample from a trained model to generate images [48, 64, 40, 37, 60, 2, 11, 63, 67, 6, 39, 38, 34], with methods designed for different purposes such as activation maximization [48, 64, 40, 37, 60, 11, 38, 34] or generating realisticlooking images by sampling in the latent space of a generator network [63, 37, 67, 6, 2, 17]. However, most of them are gradientbased, and can be interpreted as a variant of MCMC sampling from a graphical model [25].
While an analysis of the full spectrum of approaches is outside this paper’s scope, we do examine a few representative approaches under this framework in Sec. S2. In particular, we interpret the models that lack a image prior, yielding adversarial or fooling examples [51, 38] as setting ; and methods that use decay during sampling as using a Gaussian prior with . Both lack a noise term and thus sacrifice sample diversity.
3 Plug and Play Generative Networks
Previous models are often limited in that they use handengineered priors when sampling in either image space or the latent space of a generator network (see Sec. S2). In this paper, we experiment with 4 different explicitly learned priors modeled by a denoising autoencoder (DAE) [57].
We choose a DAE because, although it does not allow evaluation of directly, it does allow approximation of the gradient of the log probability when trained with Gaussian noise with variance [1]; with sufficient capacity and training time, the approximation is perfect in the limit as :
(6) 
where is the reconstruction function in space representing the DAE, i.e. is a “denoised” output of the autoencoder (an encoder followed by a decoder) when the encoder is fed input . This term approximates exactly the term required by our sampler, so we can use it to define the steps of a sampler for an image from class . Pulling the term into , the update is:
(7) 
3.1 Ppgn: DAE model of
3.2 DGNAM: sampling without a learned prior
Poor mixing in the highdimensional pixel space of PPGN is consistent with previous observations that mixing on higher layers can result in faster exploration of the space [5, 33]. Thus, to ameliorate the problem of slow mixing, we may reparameterize as for some latent , and perform sampling in this lowerdimensional space.
While several recent works had success with this approach [37, 6, 63], they often handdesign the prior. Among these, the DGNAM method [37] searches in the latent space of a generator network to find a code such that the image highly activates a given neuron in a target DNN. We start by reproducing their results for comparison. is trained following the methodology in Dosovitskiy & Brox [9] with an image reconstruction loss, a Generative Adversarial Networks (GAN) loss [14], and an loss in a feature space of an encoder (Fig. 3f). The last loss encourages generated images to match the real images in a highlevel feature space and is referred to as “feature matching” [47] in this paper, but is also known as “perceptual similarity” [28, 9] or a form of “moment matching” [31]. Note that in the GAN training for , we simultaneously train a discriminator to tell apart real images vs. generated images . More training details are in Sec. S4.4.
The directed graphical model interpretation of DGNAM is (see Fig. 3b) and the joint can be decomposed into:
(8) 
where in this case represents features extracted from the first fully connected layer (called ) of a pretrained AlexNet [26] 1000class ImageNet [7] classification network (see Fig. 3f). is modeled by , an upconvolutional (also “deconvolutional”) network [10] with 9 upconvolutional and 3 fully connected layers. is modeled by C, which in this case is also the AlexNet classifier. The model for was an implicit unimodal Gaussian implemented via L2 decay in space [37].
Since is a deterministic variable, the model simplifies to:
(9) 
From Eq. 5, if we define a Gaussian centered at 0 and set , pulling Gaussian constants into , we obtain the following noiseless update rule in Nguyen et al. [37] to sample from class :
(10) 
where represents the output unit associated with class . As before, all terms are computable in a single forwardbackward pass. More concretely, to compute the term, we push a code through the generator and condition network to the output class that we want to condition on (Fig. 3b, red arrows), and backpropagate the gradient via the same path to . The final is pushed through to produce an image sample.
Under this newly proposed framework, we have successfully reproduced the original DGNAM results and their issue of converging to the same mode when starting from different random initializations (Fig. 2b). We also found that DGNAM mixes somewhat poorly, yielding the same image after many sampling steps (Figs. 12(b) & 13(b)).
3.3 Ppgn: Generator and DAE model of
We attempt to address the poor mixing speed of DGNAM by incorporating a proper prior learned via a DAE into the sampling procedure described in Sec. 3.2. Specifically, we train , a 7layer, fullyconnected DAE on (as before, is a feature vector). The size of the hidden layers are respectively: . Full training details are provided in S4.3.
The update rule to sample from this model is similar to Eq. 10 except for the inclusion of all three terms:
(11) 
Concretely, to compute we push through the learned DAE, encoding and decoding it (Fig. 3c, black arrows). The term is computed via a forward and backward pass through both and networks as before (Fig. 3c, red arrows). Lastly, we add the same amount of noise used during DAE training to . Equivalently, noise can also be added before the encodedecode step.
We sample^{4}^{4}4 If faster mixing or more stable samples are desired, then and can be scaled up or down together. Here we scale both down to . using and show results in Figs. 12(c) & 13(c). As expected, the chain mixes faster than PPGN, with subsequent samples appearing more qualitatively different from their predecessors. However, the samples for PPGN are qualitatively similar to those from DGNAM (Figs. 12(b) & 13(b)). Samples still lack quality and diversity, which we hypothesize is due to the poor model learned by the DAE.
3.4 Joint PPGN: joint Generator and DAE
The previous result suggests that the simple multilayer perceptron DAE poorly modeled the distribution of features. This could occur because the DAE faces the generally difficult unconstrained density estimation problem. To combat this issue, we experiment with modeling via with a DAE: . Intuitively, to help the DAE better model , we force it to generate realisticlooking images from features and then decode them back to . One can train this DAE from scratch separately from (as done for PPGN). However, in the DGNAM formulation, models the (Fig. 3b) and models the (Fig. 3f). Thus, the composition can be considered an AE (Fig. 3d).
Note that is theoretically not a formal DAE because its two components were trained with neither noise added to nor an reconstruction loss for [37] (more details in Sec. S4.4) as is required for regular DAE training [57]. To make a more theoretically justifiable DAE, we add noise to and train with an additional reconstruction loss for (Fig. S9c). We do the same for and ( features), hypothesizing that a little noise added to and might encourage to be more robust [57]. In other words, with the same existing network structures from DGNAM [37], we train differently by treating the entire model as being composed of 3 interleaved DAEs that share parameters: one each for , , and (see Fig. S9c). Note that remains frozen, and is trained with 4 losses in total i.e. three reconstruction losses for , , and and a GAN loss for . See Sec. S4.5 for full training details. We call this the Joint PPGN model.
We sample from this model following the update rule in Eq. 11 with , and with noise added to all three variables: , and instead of only to (Fig. 3d vs e). The noise amounts added at each layer are the same as those used during training. As hypothesized, we observe that the sampling chain from this model mixes substantially faster and produces samples with better quality than all previous PPGN treatments (Figs. 12(d) & 13(d)) including PPGN, which has a multilayer perceptron DAE.
3.5 Ablation study with Noiseless Joint PPGN
While the Joint PPGN outperforms all previous treatments in sample quality and diversity (as the chain mixes faster), the model is trained with a combination of four losses and noise added to all variables. This complex training process can be difficult to understand, making further improvements nonintuitive. To shed more light into how the Joint PPGN works, we perform ablation experiments which later reveal a betterperforming variant.
Noise sweeps. To understand the effects of adding noise to each variable, we train variants of the Joint PPGN (1) with different noise levels, (2) using noise on only a single variable, and (3) using noise on multiple variables simultaneously. We did not find these variants to produce qualitatively better reconstruction results than the Joint PPGN. Interestingly, in a PPGN variant trained with no noise at all, the autoencoder given by still appears to be contractive, i.e. robust to a large amount of noise (Fig. S16). This is beneficial during sampling; if “unrealistic” codes appear, could map them back to realisticlooking images. We believe this property might emerge for multiple reasons: (1) and are not trained jointly; (2) features encode global, highlevel rather than local, lowlevel information; (3) the presence of the adversarial cost when training could make the mapping more “manytoone” by pushing towards modes of the image distribution.
Combinations of losses. To understand the effects of each loss component, we repeat the Joint PPGN training (Sec. 3.4), but without noise added to the variables. Specifically, we test different combinations of losses and compare the quality of images produced by pushing the codes of real images through (without MCMC sampling).
First, we found that removing the adversarial loss from the 4loss combination yields blurrier images (Fig. 7(c)). Second, we compare 3 different feature matching losses: , , and both and combined, and found that feature matching loss leads to the best image quality (Sec. S3). Our result is consistent with Dosovitskiy & Brox [9]. Thus, the model that we found empirically to produce the best image quality is trained without noise and with three losses: a feature matching loss, an adversarial loss, and an image reconstruction loss. We call this variant “Noiseless Joint PPGN”: it produced the results in Figs. 1 & 2 and Sections 3.5 & 4.
Noiseless Joint PPGN. We sample from this model with following the same update rule in Eq. 11 (we need noise to make it a proper sampling procedure, but found that infinitesimally small noise produces better and more diverse images, which is to be expected given that the DAE in this variant was trained without noise). Interestingly, the chain mixes substantially faster than DGNAM (Figs. 12(e) & 12(b)) although the only difference between two treatments is the existence of the learned prior. Overall, the Noiseless Joint PPGN produces a large amount of sample diversity (Fig. 2). Compared to the Joint PPGN, the Noiseless Joint PPGN produces better image quality, but mixes slightly slower (Figs. S13 & S14). Sweeping across the noise levels during sampling, we noted that larger noise amounts often results in worse image quality, but not necessarily faster mixing speed (Fig. S15). Also, as expected, a small multiplier makes the chain mix faster, and a large one pulls the samples towards being generic instead of classspecific (Fig. S23).
Evaluations. Evaluating image generative models is challenging, and there is not yet a commonly accepted quantitative performance measure [53]. We qualitatively evaluate sample diversity of the Noiseless Joint PPGN variant by running 10 sampling chains, each for 200 steps, to produce 2000 samples, and filtering out samples with class probability of less than . From the remaining, we randomly pick 400 samples and plot them in a grid tSNE [56] (Figs. S12 & S11). More examples for the reader’s evaluation of sample quality and diversity are provided in Figs. S21, S22 & S25. To better observe the mixing speed, we show videos of sampling chains (with one sample per frame; no samples filtered out) from within classes and between 10 different classes at https://goo.gl/36S0Dy. In addition, Table S3 provides quantitative comparisons between PPGN, auxiliary classifier GAN [41] and real ImageNet images in image quality (via Inception score [47] & Inception accuracy [41]) and diversity (via MSSSIM metric [41]).
4 Additional results
In this section, we take the Noiseless Joint PPGN model and show its capabilities on several different tasks.
4.1 Generating images with different condition
networks
A compelling property that makes PPGN different from other existing generative models is that one can “plug and play” with different prior and condition components (as shown in Eq. 2) and ask the model to perform new tasks, including challenging the generator to produce images that it has never seen before. Here, we demonstrate this feature by replacing the component with different networks.
Generating images conditioned on classes
Above we showed that PPGN could generate a diversity of high quality samples for ImageNet classes (Figs. 1 & 2 & Sec. 3.5). Here, we test whether the generator within the PPGN can generalize to new types of images that it has never seen before. Specifically, we sample with a different model: an AlexNet DNN [26] trained to classify 205 categories of scene images from the MIT Places dataset [65]. Similar to DGNAM [37], the PPGN generates realisticlooking images for classes that the generator was never trained on, such as “alley” or “hotel room” (Fig. 4). A sidebyside comparison between DGNAM and PPGN are in Fig. S17.
Generating images conditioned on captions
Instead of conditioning on classes, we can also condition the image generation on a caption (Fig. 3g). Here, we swap in an imagecaptioning recurrent network (called LRCN) from [8] that was trained on the MS COCO dataset [32] to predict a caption given an image . Specifically, LRCN is a twolayer LSTM network that generates captions conditioned on features extracted from the output softmax layer of AlexNet [26].
We found that PPGN can generate reasonable images in many cases (Figs. 5 & S18), although the image quality is lower than when conditioning on classes. In other cases, it also fails to generate highquality images for certain types of images such as “people” or “giraffe”, which are not categories in the generator’s training set (Fig. S18). We also observe “fooling” images [38]—those that look unrecognizable to humans, but produce highscoring captions. More results are in Fig. S18. The challenges for this task could be: (1) the sampling is conditioned on many () words at the same time, and the gradients backpropagated from different words could conflict with each other; (2) the LRCN captioning model itself is easily fooled, thus additional priors on the conversion from image features to natural language could improve the result further; (3) the depth of the entire model (AlexNet and LRCN) impairs gradient propagation during sampling. In the future, it would be interesting to experiment with other stateoftheart image captioning models [12, 58]. Overall, we have demonstrated that PPGNs can be flexibly turned into a texttoimage model by combining the prior with an image captioning network, and this process does not even require additional training.
Generating images conditioned on hidden neurons
PPGNs can perform a more challenging form of activation maximization called Multifaceted Feature Visualization (MFV) [40], which involves generating the set of inputs that activate a given neuron. Instead of conditioning on a class output neuron, here we condition on a hidden neuron, revealing many facets that a neuron has learned to detect (Fig. 6).
4.2 Inpainting
Because PPGNs can be interpreted probabilistically, we can also sample from them conditioned on part of an image (in addition to the class condition) to perform inpainting—filling in missing pixels given the observed context regions [42, 3, 63, 54]. The model must understand the entire image to be able to reasonably fill in a large masked out region that is positioned randomly. Overall, we found that PPGNs are able to perform inpainting suggesting that the models do “understand” the semantics of concepts such as junco or bell pepper (Fig. 7) rather than merely memorizing the images. More details and results are in Sec. S5.
5 Conclusion
The most useful property of PPGN is the capability of “plug and play”—allowing one to drop in a replaceable condition network and generate images according to a condition specified at test time. Beyond the applications we demonstrated here, one could use PPGNs to synthesize images for videos or create arts with one or even multiple condition networks at the same time [13]. Note that DGNAM [37]—the predecessor of PPGNs—has previously enabled both scientists and amateurs without substantial resources to take a pretrained condition network and generate art [13] and scientific visualizations [66]. An explanation for why this is possible is that the features that the generator was trained to invert are relatively general and cover the set of natural images. Thus, there is great value in producing flexible, powerful generators that can be combined with pretrained condition networks in a plug and play fashion.
Acknowledgments
We thank Theo Karaletsos and Noah Goodman for helpful discussions, and Jeff Donahue for providing a trained image captioning model [8] for our experiments. We also thank Joost Huizinga, Christopher Stanton, Rosanne Liu, Tyler Jaszkowiak, Richard Yang, and Jon Berliner for invaluable suggestions on preliminary drafts.
References
 [1] G. Alain and Y. Bengio. What regularized autoencoders learn from the datagenerating distribution. The Journal of Machine Learning Research, 15(1):3563–3593, 2014.
 [2] K. Arulkumaran, A. Creswell, and A. A. Bharath. Improving sampling from generative autoencoders with markov chains. arXiv preprint arXiv:1610.09296, 2016.
 [3] C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman. Patchmatch: a randomized correspondence algorithm for structural image editing. ACM Transactions on GraphicsTOG, 28(3):24, 2009.
 [4] I. G. Y. Bengio and A. Courville. Deep learning. Book in preparation for MIT Press, 2016.
 [5] Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai. Better mixing via deep representations. In Proceedings of the 30th International Conference on Machine Learning (ICML), pages 552–560, 2013.
 [6] A. Brock, T. Lim, J. Ritchie, and N. Weston. Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093, 2016.
 [7] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
 [8] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Longterm recurrent convolutional networks for visual recognition and description. In Computer Vision and Pattern Recognition, 2015.
 [9] A. Dosovitskiy and T. Brox. Generating Images with Perceptual Similarity Metrics based on Deep Networks. In Advances in Neural Information Processing Systems, 2016.
 [10] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1538–1546, 2015.
 [11] D. Erhan, Y. Bengio, A. Courville, and P. Vincent. Visualizing higherlayer features of a deep network. Technical report, Technical report, University of Montreal, 2009.
 [12] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov. Devise: A deep visualsemantic embedding model. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2121–2129. Curran Associates, Inc., 2013.
 [13] G. Goh. Image synthesis from yahoo open nsfw. https://opennsfw.gitlab.io, 2016.
 [14] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 [15] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw: A recurrent neural network for image generation. In ICML, 2015.
 [16] A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A kernel method for the twosampleproblem. In Advances in neural information processing systems, pages 513–520, 2006.
 [17] T. Han, Y. Lu, S.C. Zhu, and Y. N. Wu. Alternating backpropagation for generator network. In AAAI, 2017.
 [18] W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97–109, 1970.
 [19] G. E. Hinton. Products of experts. In Artificial Neural Networks, 1999. ICANN 99. Ninth International Conference on (Conf. Publ. No. 470), volume 1, pages 1–6. IET, 1999.
 [20] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, 2015.
 [21] Y. Jia. Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/, 2013.
 [22] J. Johnson, A. Alahi, and L. FeiFei. Perceptual losses for realtime style transfer and superresolution. arXiv preprint arXiv:1603.08155, 2016.
 [23] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [24] D. P. Kingma and M. Welling. AutoEncoding Variational Bayes. Dec. 2014.
 [25] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
 [26] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.
 [27] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. Journal of Machine Learning Research, 15:29–37, 2011.
 [28] A. B. L. Larsen, S. K. Sønderby, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. CoRR, abs/1512.09300, 2015.
 [29] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang. A tutorial on energybased learning. Predicting structured data, 1:0, 2006.
 [30] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photorealistic single image superresolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016.
 [31] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In International Conference on Machine Learning, pages 1718–1727, 2015.
 [32] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
 [33] H. Luo, P. L. Carrier, A. C. Courville, and Y. Bengio. Texture modeling with convolutional spikeandslab rbms and deep extensions. In AISTATS, pages 415–423, 2013.
 [34] A. Mahendran and A. Vedaldi. Visualizing deep convolutional neural networks using natural preimages. International Journal of Computer Vision, pages 1–23, 2016.
 [35] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953.
 [36] A. Mordvintsev, C. Olah, and M. Tyka. Inceptionism: Going deeper into neural networks. Google Research Blog. Retrieved June, 20, 2015.
 [37] A. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Advances in Neural Information Processing Systems, 2016.
 [38] A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 [39] A. Nguyen, J. Yosinski, and J. Clune. Innovation engines: Automated creativity and improved stochastic optimization via deep learning. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), 2015.
 [40] A. Nguyen, J. Yosinski, and J. Clune. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks. In Visualization for Deep Learning Workshop, ICML conference, 2016.
 [41] A. Odena, C. Olah, and J. Shlens. Conditional Image Synthesis With Auxiliary Classifier GANs. ArXiv eprints, Oct. 2016.
 [42] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. arXiv preprint arXiv:1604.07379, 2016.
 [43] A. Radford, L. Metz, and S. Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. Nov. 2015.
 [44] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive autoencoders: Explicit invariance during feature extraction. In Proceedings of the 28th international conference on machine learning (ICML11), pages 833–840, 2011.
 [45] G. O. Roberts and J. S. Rosenthal. Optimal scaling of discrete approximations to langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268, 1998.
 [46] G. O. Roberts and R. L. Tweedie. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pages 341–363, 1996.
 [47] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. CoRR, abs/1606.03498, 2016.
 [48] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, presented at ICLR Workshop 2014, 2013.
 [49] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [50] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inceptionv4, inceptionresnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016.
 [51] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013.
 [52] Y. W. Teh, A. H. Thiery, and S. J. Vollmer. Consistency and fluctuations for stochastic gradient langevin dynamics. 2014.
 [53] L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. Nov 2016. International Conference on Learning Representations.
 [54] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel Recurrent Neural Networks. ArXiv eprints, Jan. 2016.
 [55] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional image generation with pixelcnn decoders. CoRR, abs/1606.05328, 2016.
 [56] L. Van der Maaten and G. Hinton. Visualizing data using tsne. Journal of Machine Learning Research, 9(11), 2008.
 [57] P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.
 [58] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. arXiv preprint arXiv:1411.4555, 2014.
 [59] D. Wei, B. Zhou, A. Torrabla, and W. Freeman. Understanding intraclass knowledge inside cnn. arXiv preprint arXiv:1507.02379, 2015.
 [60] D. Wei, B. Zhou, A. Torrabla, and W. Freeman. Understanding intraclass knowledge inside cnn. arXiv preprint arXiv:1507.02379, 2015.
 [61] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 681–688, 2011.
 [62] J. Xie, Y. Lu, S.C. Zhu, and Y. N. Wu. Cooperative training of descriptor and generator networks. arXiv preprint arXiv:1609.09408, 2016.
 [63] R. Yeh, C. Chen, T. Y. Lim, M. HasegawaJohnson, and M. N. Do. Semantic image inpainting with perceptual and contextual losses. arXiv preprint arXiv:1607.07539, 2016.
 [64] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understanding neural networks through deep visualization. In Deep Learning Workshop, International Conference on Machine Learning (ICML), 2015.
 [65] B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. In International Conference on Learning Representations (ICLR), volume abs/1412.6856, 2014.
 [66] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva. Places: An image database for deep scene understanding. arXiv preprint arXiv:1610.02055, 2016.
 [67] J.Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pages 597–613. Springer, 2016.
Supplementary materials for:
Plug & Play Generative Networks:
Conditional Iterative Generation of Images in Latent Space
Appendix S1 Markov chain Monte Carlo methods and derivation of MALAapprox
Assume a distribution that we wish to produce samples from. For certain distributions with amenable structure it may be possible to write down directly an independent and identically distributed (IID) sampler, but in general this can be difficult. In such cases where IID samplers are not readily available, we may instead resort to Markov Chain Monte Carlo (MCMC) methods for sampling. Complete discussions of this topic fill books [25, 4]. Here we briefly review the background that led to the sampler we propose.
In cases where evaluation of is possible, we can write down the MetropolisHastings (hereafter: MH) sampler for [35, 18]. It requires a choice of proposal distribution ; for simplicity we consider (and later use) a simple Gaussian proposal distribution. Starting with an from some initial distribution, the sampler takes steps according to a transition operator defined by the below routine, with shorthand for a sample from that Gaussian proposal distribution:



if , reject sample with probability by setting , else keep
In theory, sufficiently many steps of this simple sampling rule produce samples for any computable , but in practice it has two problems: it mixes slowly, because steps are small and uncorrelated in time, and it requires us to be able to compute to calculate , which is often not possible. A Metropolisadjusted Langevin algorithm (hereafter: MALA) [46, 45] addresses the first problem. This sampler follows a slightly modified procedure:



if , reject sample with probability by setting , else keep
where is the slightly more complex calculation of , with the notable property that as the step size goes to 0, . This sampler preferentially steps in the direction of higher probability, which allows it to spend less time rejecting low probability proposals, but it still requires computation of to calculate .
\lrbox\endlrbox  uses accept/  

\lrbox\endlrbox  reject step and  
\lrbox\endlrbox  mixes  requires  update rule (not including accept/reject step) 
\lrboxMH\endlrbox  slowly  yes  
\lrboxMALA\endlrbox  ok  yes  
\lrboxMALAapprox\endlrbox  ok  no 
The stochastic gradient Langevin dynamics (SGLD) method [61, 52] was proposed to sidestep this troublesome requirement by generating probability proposals that are based on a small subset of the data only: by using stochastic gradient descent plus noise, by skipping the acceptreject step, and by using decreasing step sizes. Inspired by SGLD, we define an approximate sampler by assuming small step size and doing away with the reject step (by accepting every sample). The idea is that the stochasticity of SGD itself introduces an implicit noise: although the resulting update does not produce asymptotically unbiased samples, it does if we also anneal the step size (or, equivalently, gradually increase the minibatch size).
While an accept ratio of 1 is only approached in the limit as the step size goes to zero, in practice we empirically observe that this approximation produces reasonable samples even for moderate step sizes. This approximation leads to a sampler defined by the simple update rule:
(12) 
As explained below, we propose to decouple the two step sizes for each of the above two terms after , with two independent scaling factors to allow independently tuning each ( and in Eq. 13). This variant makes sense when we consider that the stochasticity of SGD itself introduces more noise, breaking the direct link between the amount of noise injected and the step size under Langevin dynamics.
We note that , is just the gradient of the energy (because the partition function does not depend on ) and that the scaling factor ( in the above equation) can be partially absorbed when changing the temperature associated with energy, since temperature is just a multiplicative scaling factor in the energy. Changing that link between the two terms is thus equivalent to changing temperature because the incorrect scale factor can be absorbed in the energy as a change in the temperature. Since we do not control directly the amount of noise (some of which is now produced by the stochasticity of SGD itself), it is better to “manually” control the tradeoff by introducing an extra hyperparameter. Doing so also may help to compensate for the fact that the SGD noise is not perfectly normal, which introduces a bias in the Markov chain. By manually controlling both the step size and the normal noise, we can thus find a good tradeoff between variance (or temperature level, which would blur the distribution) and bias (which makes us sample from a slightly different distribution). In our experience, such decoupling has helped find better tradeoffs between sample diversity and quality, perhaps compensating for idiosyncrasies of sampling without a reject step. We call this sampler MALAapprox:
(13) 
Table S1 summarizes the samplers and their properties.
Appendix S2 Probabilistic interpretation of previous models (continued)
In this paper, we consider four main representative approaches in light of the framework:
Here we discuss the first three and refer readers to the main text (Sec. 2.2) for the fourth approach.
Activation maximization with no priors. From Eq. 5, if we set , we obtain a sampler that follows the class gradient directly without contributions from a term or the addition of noise. In a highdimensional space, this results in adversarial or fooling images [51, 38]. We can also interpret the sampling procedure in [51, 38] as a sampler with nonzero but with a such that ; in other words, a uniform where all images are equally likely.
a. Derivative of logit. Has worked well in practice [37, 11] but not quite the right term to maximize under the sampler framework set out in this paper. 



b. Derivative of softmax. Previously avoided due to poor performance [48, 64], but poor performance may have been due to illconditioned optimization rather than the inclusion of logits from other classes. In particular, the term goes to 0 as goes zero. 


c. Derivative of log of softmax. Correct term under the sampler framework set out in this paper. Wellbehaved under optimization, perhaps due to the term untouched by the multiplier. 

Activation maximization with a Gaussian prior. To combat the fooling problem [38], several works have used decay, which can be thought of as a simple Gaussian prior over images [48, 64, 60]. From Eq. 5, if we define a Gaussian centered at the origin (assume the mean image has been subtracted) and set , pulling Gaussian constants into , we obtain the following noiseless update rule:
(14) 
The first term decays the current image slightly toward the origin, as appropriate under a Gaussian image prior, and the second term pulls the image toward higher probability regions for the chosen class. Here, the second term is computed as the derivative of the log of a softmax unit in the output layer of the classification network, which is trained to model . If we let be the logit outputs of a classification network, where indexes over the classes, then the softmax outputs are given by , and the value is modeled by the softmax unit .
Note that the second term is similar, but not identical, to the gradient of logit term used by [48, 64, 34]. There are three variants of computing this class gradient term: 1) derivative of logit; 2) derivative of softmax; and 3) derivative of log of softmax. Previously mentioned papers empirically reported that derivative of the logit unit produces better visualizations than the derivative of the softmax unit (Table S2a vs. b), but this observation had not been fully justified [48]. In light of our probablistic interpretation (discussed in Sec. 2.1), we consider activation maximization as performing noisy gradient descent to minimize the energy function :
(15) 
To sample from the joint model , we follow the energy gradient:
(16) 
which derives the class gradient term that matches that in our framework (Eq. 14, second term). In addition, recall that the classification network is trained to model via softmax, thus the class gradient variant (the derivative of log of softmax) is the most theoretically justifiable in light of our interpretation. We summarize all three variants in Table S2. In overall, we found the proposed class gradient term a) theoretically justifiable under the probabilistic interpretation, and b) working well empirically, and thus suggest it for future activation maximization studies.
Activation maximization with handdesigned priors. In an effort to outdo the simple Gaussian prior, many works have proposed more creative, handdesigned image priors such as Gaussian blur [64], total variation [34], jitter [36], and datadriven patch priors [59]. These priors effectively serve as a simple component. Those that cannot be explicitly expressed in the mathematical form (e.g. jitter [36] and centerbiased regularization [40]) can be written as a general regularization function as in [64], in which case the noiseless update becomes:
(17) 
Appendix S3 Comparing feature matching losses
The addition of feature matching losses (i.e. the differences between a real image and a generated image not in pixel space, but in a feature space, such as a highlevel code in a deep neural network) to the training cost has been shown to substantially improve the quality of samples produced by generator networks, e.g. by producing sharper and more realistic images [9, 28, 22].
Dosovitskiy & Brox [9] used the feature matching loss measured in the layer code space of AlexNet deep neural network (DNN) [26] trained to classify 1000class ImageNet images [7]. Here, we empirically compare several feature matching losses computed in different layers of the AlexNet DNN. Specifically, we follow the training procedure in Dosovitskiy & Brox [9], and train 3 generator networks, each with a different feature matching loss computed in different layers from the pretrained AlexNet DNN: a) , b) and c) both and losses. We empirically found that matching the features leads to the best image quality (Fig. S8), and chose the generator with this loss for the main experiments in the paper.
Appendix S4 Training details
s4.1 Common training framework
We use the Caffe framework [21] to train the networks. All networks are trained with the Adam optimizer [23] with momentum , , and , and an initial learning rate of following [9]. The batch size is . To stabilize the GAN training, we follow heuristic rules based on the ratio of the discriminator loss over generator loss and pause the training of the generator or discriminator if one of them is winning too much. In most cases, the heuristics are a) pause training D if ; b) pause training G if . We did not find BatchNorm [20] helpful in further stabilizing the training as found in Radford et al. [43]. We have not experimented with all of the techniques discussed in Salimans et al. [47], some of which could further improve the results.
s4.2 Training PPGN
We train a DAE for images and incorporate it to the sampling procedure as a prior to avoid fooling examples [37]. The DAE is a 4layer convolutional network that encodes an image to the layer of AlexNet [26] and decodes it back to images with 3 upconvolutional layers. We add an amount of Gaussian noise with to images during training. The network is trained via the common training framework described in Sec. S4.1 for minibatch iterations. We use regularization of .
s4.3 Training PPGN
For the PPGN variant, we train two separate networks: a generator (that maps codes to images ) and a prior . is trained via the same procedure described in Sec. S4.4. We model via a multilayer perceptron DAE with 7 hidden layers of size: . We experimented with larger networks but found this to work the best. We sweep across different amounts of Gaussian noise , and empirically chose (i.e. 10% of the mean feature activation). The network is trained via the common training framework described in Sec. S4.1 for minibatch iterations. We use regularization of .
s4.4 Training Noiseless Joint PPGN
Here we describe the training details of the generator network used in the main experiments in Sections 3.3, 3.5, 3.4. The training procedure follows closely the framework by Dosovitskiy & Brox [9].
The purpose is to train a generator network to reconstruct images from an abstract, highlevel feature code space of an encoder network —here, the first fully connected layer () of an AlexNet DNN [26] pretrained to perform image classification on the ImageNet dataset [7] (Fig. S9a) We train as a decoder for the encoder , which is kept frozen. In other words, form an image autoencoder (Fig. S9b).
Training losses. G is trained with 3 different losses as in Dosovitskiy & Brox [9], namely, an adversarial loss , an image reconstruction loss , and a feature matching loss measured in the spatial layer (Fig. S9b):
(18) 
and are reconstruction losses in their respective space of images and () codes :
(19)  
(20) 
The adversarial loss for (which intuitively maximizes the chance makes mistakes) follows the original GAN paper [14]:
(21) 
where is a training image, and is a code. As in Goodfellow et al. [14], tries to tell apart real and fake images, and is trained with the adversarial loss as follows:
(22) 
Architecture. , an upconvolutional (also “deconvolutional”) network [10] with 9 upconvolutional and 3 fully connected layers. is a regular convolutional network for image classification with a similar architecture to AlexNet [26] with 5 convolutional layers followed by 3 fully connected layers, and 2 outputs (for “real” and “fake” classes).
The networks are trained via the common training framework described in Sec. S4.1 for minibatch iterations. We use regularization of .
Specifics of DGNAM reproduction. Note that while the original set of parameters in Nguyen et al. [37] (including a small number of iterations, an decay on code , and a step size decay) produces highquality images, it does not allow for a long sampling chain, traveling from one mode to another. For comparisons with other models within our framework, we sample from DGNAM with , which is slightly different from in Eq. 10, but produces qualitatively the same result.
s4.5 Training Joint PPGN
Via the same existing network structures from DGNAM [37], we train the generator differently by treating the entire model as being composed of 3 interleaved DAEs: one for , , and respectively (see Fig. S9c). Specifically, we add Gaussian noise to these variables during training, and by incorporating three corresponding reconstruction losses (see Fig. S9c). Adding noise to an AE can be considered as a form of regularization that encourages an autoencoder to extract more useful features [57]. Thus, here, we hypothesize that adding a small amount of noise to and might slightly improve the result. In addition, the benefits of adding noise to and training the pair and as a DAE for are two fold: 1) it allows us to formally estimate the quantity (see Eq. 6) following a previous mathematical justification from Alain & Bengio [1]; 2) it allows us to sample with a larger noise level, which might improve the mixing speed.
We add noise to during training, and train with a reconstruction loss for :
(23) 
Thus, generator network is trained with 4 losses in total:
(24) 
Three losses , , and remain the same as in the training of Noiseless Joint PPGN (Sec. S4.4). Network architectures and other common training details remain the same as described in Sec. S4.4.
The amount of Gaussian noise added to the 3 different variables , , and are respectively which are 1% of the mean pixel values and 10% of the mean activations respectively in and space computed from the training set. We experimented with larger noise levels, but were not able to train the model successfully as large amounts of noise resulted in training instability. We also tried training without noise for , i.e. treating the model as being composed of 2 DAEs instead of 3, but did not obtain qualitatively better results.
Note that while we did not experiment in this paper, jointly training both the generator and the encoder via their respective maximum likelihood training algorithms is possible. Also, Xie et al. [62] has proposed a training regime that alternatively updates these two networks. That cooperative training scheme indeeds yields a generator that synthesizes impressive results for multiple image datasets [62].
Appendix S5 Inpainting
We first randomly mask out a patch of a real image (Fig. 7a). The patch size is chosen following Pathak et al. [42]. We perform the same update rule as in Eq. 11 (conditioning on a class, e.g. “volcano”), but with an additional step updating image during the forward pass:
(25) 
where is the binary mask for the corrupted patch, is the uncorrupted area of the real image, and denotes the Hadamard (elementwise) product. Intuitively, we clamp the observed parts of the synthesized image and then sample only the unobserved portion in each pass. The DAE model and the image classification network model see progressively refined versions of the final, filled in image. This approach tends to fill in semantically correct content, but it often fails to match the local details of the surrounding context (Fig. 7b, the predicted pixels often do not transition smoothly to the surrounding context). An explanation is that we are sampling in the fullyconnected feature space, which mostly encodes information of the global structure of objects instead of local details [64].
To encourage the synthesized image to match the context of the real image, we can add an extra condition in pixel space in the form of an additional term to the update rule in Eq. 5 to update in the direction of minimizing the cost: . This helps the filledin pixels match the surrounding context better (Fig. 7 b vs. c). Compared to the ContextAware Fill feature in Photoshop CS6, which is based on the PatchMatch technique [3], our method often performs worse in matching the local features of the surrounding context, but can fill in semantic objects better in many cases (Fig. 7, bird & bell pepper). More inpainting results are provided in the Fig. S24.
Appendix S6 Ppgn: DAE model of
We investigate the effectiveness of using a DAE to model directly (Fig. 3a). This DAE is a 4layer convolutional network trained on unlabeled images from ImageNet. We sweep across different noise amounts for training the DAE and empirically find that a noise level of of the pixel value range, corresponding to , produces the best results. Full training and architecture details are provided in Sec. S4.2.
We sample from this chain following Eq. 7 with ^{5}^{5}5 The and correspond to the noise level used while training the DAE, and the value is chosen manually to produce the best samples. and show samples in Figs. 12(a) & 13(a). PPGN exhibits two expected problems: first, it models the data distribution poorly, evidenced by the images becoming blurry over time. Second, the chain mixes slowly, changing only slightly in hundreds of steps.
Note that, instead of training the above DAE, one can also form an DAE by combining a pair of separately trained encoder and a generator into a composition . We also experiment with this model and call it Joint PPGN. The details of network and and how they can be combined are described in Sec. 3.4 (Joint PPGN). For sampling, we sample in the image space, similarly to the PPGN in this section. We found that Joint PPGN model performs better than PPGN, but worse than Joint PPGN (data not shown).
Appendix S7 Why PPGNs produce highquality images
One practical question is why Joint PPGN produces highquality images at a high resolution for 1000class ImageNet more successfully than other existing latent variable models [41, 47, 43]. We can consider this question from two perspectives.
First, from the perspective of the training loss, is trained with the combination of three losses (Fig. S9b), which may be a beneficial approach to model . The GAN [14] loss, which is the gradient of , that is used to train pushes each reconstruction toward a mode of real images and away from the current reconstruction distribution. This can be seen by noting that the Bayes optimal is [14]. Since is already near a mode of , the net effect is to push towards one of the modes of , thus making the reconstructions sharper and more plausible. If one uses only the GAN objective and no reconstruction objectives ( losses in the pixel or feature space), may bring the sample far from the original , possibly collapsing several modes of into fewer modes. This is the typical, known “missingmode” behavior of GANs [47, 14] that arises in part because GANs minimize the JensenShannon divergence rather than KullbackLeibler divergence between and , leading to an overmemorization of modes [53]. The reconstruction losses are important to combat this missing mode problem and may also serve to enable better convergence of the feature space autoencoder to the distribution it models, which is necessary in order to make the space reconstruction properly estimate [1].
Second, from the perspective of the learned mapping, we train the parameters of the pair of networks as an AE, mapping (see Fig. S9b). In this configuration, as in VAEs [24] and regular DAEs [57], the onetoone mapping helps prevent the typical latent input missing mode collapse that occurs in GANs, where some input images are not representable using any code [14, 47]. However, unlike in VAEs and DAEs, where the latent distribution is learned in a purely unsupervised manner, we leverage the labeled ImageNet data to train in a supervised manner that yields a distribution of features that we hypothesize to be semantically meaningful and useful for building a generative image model. To further understand the effectiveness of using deep, supervised features, it might be interesting future work to train PPGNs with other feature distributions such as random features or shallow features (e.g. produced by PCA).
Model  Image size  Inception accuracy  Inception score  MSSSIM  Percent of classes 

Real ImageNet images  76.1%  210.4 4.6  0.10 0.06  999 / 1000  
ACGAN [41]  10.1%  N/A  N/A  847 / 1000  
PPGN  59.6%  60.6 1.6  0.23 0.11  829 / 1000  
PPGN samples resized to  54.8%  47.7 1.0  0.25 0.11  770 / 1000 
Row 2: Note that we chose to compare with ACGAN [41] because, this model is also classconditional and, to the best of our knowledge, it produces the previous highest resolution ImageNet images () in the literature.
Row 3: For comparison with ImageNet images, the spatial dimension of the samples from the generator is and we did not crop it to as done in other experiments in the paper.
Row 4: Although imperfect, we resized PPGN samples down to (last row) for comparison with ACGAN.