Abstract
We consider the problem of learning deep generative models from data. We formulate a method that generates an independent sample via a single feedforward pass through a multilayer preceptron, as in the recently proposed generative adversarial networks (Goodfellow et al., 2014). Training a generative adversarial network, however, requires careful optimization of a difficult minimax program. Instead, we utilize a technique from statistical hypothesis testing known as maximum mean discrepancy (MMD), which leads to a simple objective that can be interpreted as matching all orders of statistics between a dataset and samples from the model, and can be trained by backpropagation. We further boost the performance of this approach by combining our generative network with an autoencoder network, using MMD to learn to generate codes that can then be decoded to produce samples. We show that the combination of these techniques yields excellent generative models compared to baseline approaches as measured on MNIST and the Toronto Face Database.
Generative Moment Matching Networks
Yujia Li yujiali@cs.toronto.edu
Kevin Swersky kswersky@cs.toronto.edu
Richard Zemel zemel@cs.toronto.edu
Department of Computer Science, University of Toronto, Toronto,
ON, CANADA
Canadian Institute for Advanced Research, Toronto, ON,
CANADA
The most visible successes in the area of deep learning have come from the application of deep models to supervised learning tasks. Models such as convolutional neural networks (CNNs), and long short term memory (LSTM) networks are now achieving impressive results on a number of tasks such as object recognition (Krizhevsky et al., 2012; Sermanet et al., 2014; Szegedy et al., 2014), speech recognition (Graves & Jaitly, 2014; Hinton et al., 2012a), image caption generation (Vinyals et al., 2014; Fang et al., 2014; Kiros et al., 2014), machine translation (Cho et al., 2014; Sutskever et al., 2014), and more. Despite their successes, one of the main bottlenecks of the supervised approach is the difficulty in obtaining enough data to learn abstract features that capture the rich structure of the data. It is well recognized that a promising avenue is to use unsupervised learning on unlabelled data, which is far more plentiful and cheaper to obtain.
A longstanding and inherent problem in unsupervised learning is defining a good method for evaluation. Generative models offer the ability to evaluate generalization in the data space, which can also be qualitatively assessed. In this work we propose a generative model for unsupervised learning that we call generative moment matching networks (GMMNs). GMMNs are generative neural networks that begin with a simple prior from which it is easy to draw samples. These are propagated deterministically through the hidden layers of the network and the output is a sample from the model. Thus, with GMMNs it is easy to quickly draw independent random samples, as opposed to expensive MCMC procedures that are necessary in other models such as Boltzmann machines (Ackley et al., 1985; Hinton, 2002; Salakhutdinov & Hinton, 2009). The structure of a GMMN is most analogous to the recently proposed generative adversarial networks (GANs) (Goodfellow et al., 2014), however unlike GANs, whose training involves a difficult minimax optimization problem, GMMNs are comparatively simple; they are trained to minimize a straightforward loss function using backpropagation.
The key idea behind GMMNs is the use of a statistical hypothesis testing framework called maximum mean discrepancy (Gretton et al., 2007). Training a GMMN to minimize this discrepancy can be interpreted as matching all moments of the model distribution to the empirical data distribution. Using the kernel trick, MMD can be represented as a simple loss function that we use as the core training objective for GMMNs. Using minibatch stochastic gradient descent, training can be kept efficient, even with large datasets.
As a second contribution, we show how GMMNs can be used to bootstrap autoencoder networks in order to further improve the generative process. The idea behind this approach is to train an autoencoder network and then apply a GMMN to the code space of the autoencoder. This allows us to leverage the rich representations learned by autoencoder models as the basis for comparing data and model distributions. To generate samples in the original data space, we simply sample a code from the GMMN and then use the decoder of the autoencoder network.
Our experiments show that this relatively simple, yet very flexible framework is effective at producing good generative models in an efficient manner. On MNIST and the Toronto Face Dataset (TFD) we demonstrate improved results over comparable baselines, including GANs. Source code for training GMMNs will be made available at https://github.com/yujiali/gmmn.
Suppose we are given two sets of samples and and are asked whether the generating distributions . Maximum mean discrepancy is a frequentist estimator for answering this question, also known as the two sample test (Gretton et al., 2007; 2012a). The idea is simple: compare statistics between the two datasets and if they are similar then the samples are likely to come from the same distribution.
Formally, the following MMD measure computes the mean squared difference of the statistics of the two sets of samples.
(1)  
(2) 
Taking to be the identity function leads to matching the sample mean, and other choices of can be used to match higher order moments.
Written in this form, each term in Equation (2) only involves inner products between the vectors, and therefore the kernel trick can be applied.
(3) 
The kernel trick implicitly lifts the sample vectors into an infinite dimensional feature space. When this feature space corresponds to a universal reproducing kernel Hilbert space, it is shown that asymptotically, MMD is 0 if and only if (Gretton et al., 2007; 2012a).
For universal kernels like the Gaussian kernel, defined as , where is the bandwidth parameter, we can use a Taylor expansion to get an explicit feature map that contains an infinite number of terms and covers all orders of statistics. Minimizing MMD under this feature expansion is then equivalent to minimizing a distance between all moments of the two distributions.
In this work we focus on generative models due to their ability to capture the salient properties and structure of data. Deep generative models are particularly appealing because they are capable of learning a latent manifold on which the data has high density. Learning this manifold allows smooth variations in the latent space to result in nontrivial transformations in the original space, effectively traversing between high density modes through low density areas (Bengio et al., 2013a). They are also capable of disentangling factors of variation, which means that each latent variable can become responsible for modelling a single, complex transformation in the original space that would otherwise involve many variables (Bengio et al., 2013a). Even if we restrict ourselves to the field of deep learning, there are a vast array of approaches to generative modelling. Below, we outline some of these methods.
One popular class of generative models used in deep learning are undirected graphical models, such as Boltzmann machines (Ackley et al., 1985), restricted Boltzmann machines (Hinton, 2002), and deep Boltzmann machines (Salakhutdinov & Hinton, 2009). These models are normalized by a typically intractable partition function, making training, evaluation, and sampling extremely difficult, usually requiring expensive Markovchain Monte Carlo (MCMC) procedures.
Next there is the class of fully visible directed models such as fully visible sigmoid belief networks (Neal, 1992) and the neural autoregressive distribution estimator (Larochelle & Murray, 2011). These admit efficient loglikelihood calculation, gradientbased learning and efficient sampling, but require that an ordering be imposed on the observable variables, which can be unnatural for domains such as images and cannot take advantage of parallel computing methods due to their sequential nature.
More related to our own work, there is a line of research devoted to recovering density models from autoencoder networks using MCMC procedures (Rifai et al., 2012; Bengio et al., 2013b; 2014). These attempt to use contraction operators, or denoising criteria in order to generate a Markov chain by repeated perturbations during the encoding phase, followed by decoding.
Also related to our own work, there is the class of deep, variational networks (Rezende et al., 2014; Kingma & Welling, 2014; Mnih & Gregor, 2014). These are also deep, directed generative models, however they make use of an additional neural network that is designed to approximate the posterior over the latent variables. Training is carried out via a variational lower bound on the loglikelihood of the model distribution. These models are trained using stochastic gradient descent, however they either require that the latent representation is continuous (Kingma & Welling, 2014), or require many secondary networks to sufficiently reduce the variance of gradient estimates in order to produce a sufficiently good learning signal (Mnih & Gregor, 2014).
Finally there is some early work that proposed the idea of using feedforward neural networks to learn generative models. MacKay (1995) proposed a model that is closely related to ours, which also used a feedforward network to map the prior samples to the data space. However, instead of directly outputing samples, an extra distribution is associated with the output. Sampling was used extensively for learning and inference in this model. MagdonIsmail & Atiya (1998) proposed to use a neural network to learn a transformation from the data space to another space where the transformed data points are uniformly distributed. This transformation network then learns the cumulative density function.
The highlevel idea of the GMMN is to use a neural network to learn a deterministic mapping from samples of a simple, easy to sample distribution, to samples from the data distribution. The architecture of the generative network is exactly the same as a generative adversarial network (Goodfellow et al., 2014). However, we propose to train the network by simply minimizing the MMD criterion, avoiding the hard minimax objective function used in generative adversarial network training.
More specifically, in the generative network we have a stochastic hidden layer with hidden units at the top with a prior uniform distribution on each unit independently,
(4) 
Here is a uniform distribution in , where is an indicator function. Other choices for the prior are also possible, as long as it is a simple enough distribution from which we can easily draw samples.
The vector is then passed through the neural network and deterministically mapped to a vector in the dimensional data space.
(5) 
is the neural network mapping function, which can contain multiple layers of nonlinearities, and represents the parameters of the neural network. One example architecture for is illustrated in Figure 1(a), which has 3 intermediate ReLU (Nair & Hinton, 2010) nonlinear layers and one logistic sigmoid output layer.
(a) GMMN  (b) GMMN+AE 
The prior and the mapping jointly defines a distribution in the data space. To generate a sample we only need to sample from the uniform prior and then pass the sample through the neural net to get .
Goodfellow et al. (2014) proposed to train this network by using an extra discriminative network, which tries to distinguish between model samples and data samples. The generative network is then trained to counteract this in order to make the samples indistinguishable to the discriminative network. The gradient of this objective can be backpropagated through the generative network. However, because of the minimax nature of the formulation, it is easy to get stuck at a local optima. So the training of generative network and the discriminative network must be interleaved and carefully scheduled. By contrast, our learning algorithm simply involves minimizing the MMD objective.
Assume we have a dataset of training examples ( for data), and a set of samples generated from our model ( for samples). The MMD objective is differentiable when the kernel is differentiable. For example for Gaussian kernels , the gradient of has a simple form
(6) 
This gradient can then be backpropagated through the generative network to update the parameters .
Realworld data can be complicated and highdimensional, which is one reason why generative modelling is such a difficult task. Autoencoders, on the other hand, are designed to solve an arguably simpler task of reconstruction. If trained properly, autoencoder models can be very good at representing data in a code space that captures enough statistical information that the data can be reliably reconstructed.
The code space of an autoencoder has several advantages for creating a generative model. The first is that the dimensionality can be explicitly controlled. Visual data, for example, while represented in a high dimension often exists on a lowdimensional manifold. This is beneficial for a statistical estimator like MMD because the amount of data required to produce a reliable estimator grows with the dimensionality of the data (Ramdas et al., 2015). The second advantage is that each dimension of the code space can end up representing complex variations in the original data space. This concept is referred to in the literature as disentangling factors of variation (Bengio et al., 2013a).
For these reasons, we propose to bootstrap autoencoder models with a GMMN to create what we refer to as the GMMN+AE model. These operate by first learning an autoencoder and producing code representations of the data, then freezing the autoencoder weights and learning a GMMN to minimize MMD between generated codes and data codes. A visualization of this model is given in Figure 1(b).
Our method for training a GMMN+AE proceeds as follows:

Greedy layerwise pretraining of the autoencoder (Bengio et al., 2007).

Finetune the autoencoder.

Train a GMMN to model the code layer distribution using an MMD objective on the final encoding layer.
We found that adding dropout to the encoding layers can be beneficial in terms of creating a smooth manifold in code space. This is analogous to the motivation behind contractive and denoising autoencoders (Rifai et al., 2011; Vincent et al., 2008).
Here we outline some design choices that we have found to improve the peformance of GMMNs.
Bandwidth Parameter. The bandwidth parameter in the kernel plays a crucial role in determining the statistical efficiency of MMD, and optimally setting it is an open problem. A good heuristic is to perform a line search to obtain the bandwidth that produces the maximal distance (Sriperumbudur et al., 2009), other more advanced heuristics are also available (Gretton et al., 2012b). As a simpler approximation, for most of our experiments we use a mixture of kernels spanning multiple ranges. That is, we choose the kernel to be:
(7) 
where is a Gaussian kernel with bandwidth parameter . We found that choosing simple values for these such as , , , etc. and using a mixture of 5 or more was sufficient to obtain good results. The weighting of different kernels can be further tuned to achieve better results, but we kept them equally weighted for simplicity.
Square Root Loss. In practice, we have found that better results can be obtained by optimizing . This loss can be important for driving the difference between the two distributions as close to 0 as possible. Compared to which flattens out when its value gets close to 0, behaves much better for small values. Alternatively, this can be understood by writing down the gradient of with respect to
(8) 
The term automatically adapts the effective learning rate. This is especially beneficial when both and become small, where this extra factor can help by maintaining larger gradients.
Minibatch Training. One of the issues with MMD is that the usage of kernels means that the computation of the objective scales quadratically with the amount of data. In the literature there have been several alternative estimators designed to overcome this (Gretton et al., 2012a). In our case, we found that it was sufficient to optimize MMD using minibatch optimization. In each weight update, a small subset of data is chosen, and an equal number of samples are drawn from the GMMN. Within a minibatch, MMD is applied as usual. As we are using exact samples from the model and the data distribution, the minibatch MMD is still a good estimator of the population MMD. We found this approach to be both fast and effective. The minibatch training algorithm for GMMN is shown in Algorithm LABEL:alg:gmmn.
algocf[t] \end@float
(e) GMMN nearest neighbors for MNIST samples  
(a) GMMN MNIST samples  (b) GMMN TFD samples  (f) GMMN+AE nearest neighbors for MNIST samples 
(g) GMMN nearest neighbors for TFD samples  
(c) GMMN+AE MNIST samples  (d) GMMN+AE TFD samples  (h) GMMN+AE nearest neighbors for TFD samples 
We trained GMMNs on two benchmark datasets MNIST (LeCun et al., 1998) and the Toronto Face Dataset (TFD) (Susskind et al., 2010). For MNIST, we used the standard test set of 10,000 images, and split out 5000 from the standard 60,000 training images for validation. The remaining 55,000 were used for training. For TFD, we used the same training and test sets and fold splits as used by (Goodfellow et al., 2014), but split out a small set of the training data and used it as the validation set. For both datasets, rescaling the images to have pixel intensities between 0 and 1 is the only preprocessing step we did.
On both datasets, we trained the GMMN network in both the input data space and the code space of an autoencoder. For all the networks we used in this section, a uniform distribution in was used as the prior for the dimensional stochastic hidden layer at the top of the GMMN, which was followed by 4 ReLU layers, and the output was a layer of logistic sigmoid units. The autoencoder we used for MNIST had 4 layers, 2 for the encoder and 2 for the decoder. For TFD the autoencoder had 6 layers in total, 3 for the encoder and 3 for the decoder. For both autoencoders the encoder and the decoder had mirrored architectures. All layers in the autoencoder network used sigmoid nonlinearities, which also guaranteed that the code space dimensions lay in , so that they could match the GMMN outputs. The network architectures for MNIST are shown in Figure 1.
The autoencoders were trained separately from the GMMN. Cross entropy was used as the reconstruction loss. We first did standard layerwise pretraining, then finetuned all layers jointly. Dropout (Hinton et al., 2012b) was used on the encoder layers. After training the autoencoder, we fixed it and passed the input data through the encoder to get the corresponding codes. The GMMN network was then trained in this code space to match the statistics of generated codes to the statistics of codes from data examples. When generating samples, the generated codes were passed through the decoder to get samples in the input data space.
For all experiments in this section the GMMN networks were trained with minibatches of size 1000, for each minibatch we generated a set of 1000 samples from the network. The loss and gradient were computed from these 2000 points. We used the square root loss function throughout.
Evaluation of our model is not straightforward, as we do not have an explicit form for the probability density function, it is not easy to compute the loglikelihood of data. However, sampling from our model is easy. We therefore followed the same evaluation protocol used in related models (Bengio et al., 2013a), (Bengio et al., 2014), and (Goodfellow et al., 2014). A Gaussian Parzen window (kernel density estimator) was fit to 10,000 samples generated from the model. The likelihood of the test data was then computed under this distribution. The scale parameter of the Gaussians was selected using a grid search in a fixed range using the validation set.
The hyperparameters of the networks, including the learning rate and momentum for both autoencoder and GMMN training, dropout rate for the autoencoder, and number of hidden units on each layer of both autoencoder and GMMN, were tuned using Bayesian optimization (Snoek et al., 2012; 2014)^{1}^{1}1We used the service provided by https://www.whetlab.com to optimize the validation set likelihood under the Gaussian Parzen window density estimation.
Model  MNIST  TFD 

DBN  138 2  1909 66 
Stacked CAE  121 1.6  2110 50 
Deep GSN  214 1.1  1890 29 
Adversarial nets  225 2  2057 26 
GMMN  147 2  2085 25 
GMMN+AE  282 2  2204 20 
The loglikelihood of the test set for both datasets are shown in Table 1. The GMMN is competitive with other approaches, while the GMMN+AE significantly outperforms the other models. This shows that despite being relatively simple, MMD, especially when combined with an effective decoder, is a powerful objective for training good generative models.
Some samples generated from the GMMN models are shown in Figure 2(ad). The GMMN+AE produces the most visually appealing samples, which are reflected in its Parzen window loglikelihood estimates. The likely explanation is that any perturbations in the code space correspond to smooth transformations along the manifold of the data space. In that sense, the decoder is able to “correct” noise in the code space.
To determine whether the models learned to merely copy the data, we follow the example of (Goodfellow et al., 2014) and visualize the nearest neighbour of several samples in terms of Euclidean pixelwise distance in Figure 2(eh). By this metric, it appears as though the samples are not merely data examples.
One of the interesting aspects of a deep generative model such as the GMMN is that it is possible to directly explore the data manifold. Using the GMMN+AE model, we randomly sampled 5 points in the uniform space and show their corresponding data space projections in Figure 3. These points are highlighted by red boxes. From left to right, top to bottom we linearly interpolate between these points in the uniform space and show their corresponding projections in data space. The manifold is smooth for the most part, and almost all of the projections correspond to realistic looking data. For TFD in particular, these transformations involve complex attributes, such as the changing of pose, expression, lighting, gender, and facial hair.
In this paper we provide a simple and effective framework for training deep generative models called generative moment matching networks. Our approach is based off of optimizing maximum mean discrepancy so that samples generated from the model are indistinguishable from data examples in terms of their moment statistics. As is standard with MMD, the use of the kernel trick allows a GMMN to avoid explicitly computing these moments, resulting in a simple training objective, and the use of minibatch stochastic gradient descent allows the training to scale to large datasets.
Our second contribution combines MMD with autoencoders for learning a generative model of the code layer. The code samples from the model can then be fed through the decoder in order to generate samples in the original space. The use of autoencoders makes the generative model learning a much simpler problem. Combined with MMD, pretrained autoencoders can be readily bootstrapped into a good generative model of data. On the MNIST and Toronto Face Database, the GMMN+AE model achieves superior performance compared to other approaches. For these datasets, we demonstrate that the GMMN+AE is able to discover the implicit manifold of the data.
There are many interesting directions for research using MMD. One such extension is to consider alternatives to the standard MMD criterion in order to speed up training. One such possibility is the class of lineartime estimators that has been developed recently in the literature (Gretton et al., 2012a).
Another possibility is to utilize random features (Rahimi & Recht, 2007). These are randomized feature expansions whose inner product converges to a kernel function with an increasing number of features. This idea was recently explored for MMD in (Zhao & Meng, 2014). The advantage of this approach would be that the cost would no longer grow quadratically with minibatch size because we could use the original objective given in Equation 2. Another advantage of this approach is that the data statistics could be precomputed from the entire dataset, which would reduce the variance of the objective gradients.
Another direction we would like to explore is joint training of the autoencoder model with the GMMN. Currently, these are treated separately, but joint training may encourage the learning of codes that are both suitable for reconstruction as well as generation.
While a GMMN provides an easy way to sample data, the posterior distribution over the latent variables is not readily available. It would be interesting to explore ways in which to infer the posterior distribution over the latent space. A straightforward way to do this is to learn a neural network to predict the latent vector given a sample. This is reminiscent of the recognition models used in the wakesleep algorithm (Hinton et al., 1995), or variational autoencoders (Kingma & Welling, 2014).
An interesting application of MMD that is not directly related to generative modelling comes from recent work on learning fair representations (Zemel et al., 2013). There, the objective is to train a prediction method that is invariant to a particular sensitive attribute of the data. Their solution is to learn an intermediate clusteringbased representation. MMD could instead be applied to learn a more powerful, distributed representation such that the statistics of the representation do not change conditioned on the sensitive variable. This idea can be further generalized to learn representations invariant to known biases.
Finally, the notion of utilizing an autoencoder with the GMMN+AE model provides new avenues for creating generative models of even more complex datasets. For example, it may be possible to use a GMMN+AE with convolutional autoencoders (Zeiler et al., 2010; Masci et al., 2011; Makhzani & Frey, 2014) in order to create generative models of high resolution color images.
Acknowledgements We thank David WardeFarley for helpful clarifications regarding (Goodfellow et al., 2014), and Charlie Tang for providing relevant references. We thank CIFAR, NSERC, and Google for research funding.
References
 Ackley et al. (1985) Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.
 Bengio et al. (2007) Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al. Greedy layerwise training of deep networks. In Advances in Neural Information Processing Systems (NIPS), 2007.
 Bengio et al. (2013a) Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. Better mixing via deep representations. In Proceedings of the 28th International Conference on Machine Learning (ICML), 2013a.
 Bengio et al. (2013b) Bengio, Y., Yao, L., Alain, G., and Vincent, P. Generalized denoising autoencoders as generative models. In Advances in Neural Information Processing Systems, pp. 899–907, 2013b.
 Bengio et al. (2014) Bengio, Y., ThibodeauLaufer, E., Alain, G., and Yosinski, J. Deep generative stochastic networks trainable by backprop. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2014.
 Cho et al. (2014) Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. Learning phrase representations using rnn encoderdecoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
 Fang et al. (2014) Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J., Zitnick, C. L., and Zweig, G. From captions to visual concepts and back. arXiv preprint arXiv:1411.4952, 2014.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
 Graves & Jaitly (2014) Graves, A. and Jaitly, N. Towards endtoend speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pp. 1764–1772, 2014.
 Gretton et al. (2007) Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. J. A kernel method for the twosampleproblem. In Advances in Neural Information Processing Systems (NIPS), 2007.
 Gretton et al. (2012a) Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. A kernel twosample test. The Journal of Machine Learning Research, 13(1):723–773, 2012a.
 Gretton et al. (2012b) Gretton, A., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., Fukumizu, K., and Sriperumbudur, B. K. Optimal kernel choice for largescale twosample tests. In Advances in Neural Information Processing Systems, pp. 1205–1213, 2012b.
 Hinton (2002) Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.
 Hinton et al. (1995) Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. The “wakesleep” algorithm for unsupervised neural networks. Science, 268(5214):1158–1161, 1995.
 Hinton et al. (2012a) Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag., 29(6):82–97, 2012a.
 Hinton et al. (2012b) Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012b.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. Autoencoding variational Bayes. In International Conference on Learning Representations, 2014.
 Kiros et al. (2014) Kiros, R., Salakhutdinov, R., and Zemel, R. S. Unifying visualsemantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), 2012.
 Larochelle & Murray (2011) Larochelle, H. and Murray, I. The neural autoregressive distribution estimator. In roceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 MacKay (1995) MacKay, D. J. Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 354(1):73–80, 1995.
 MagdonIsmail & Atiya (1998) MagdonIsmail, M. and Atiya, A. Neural networks for density estimation. In NIPS, pp. 522–528, 1998.
 Makhzani & Frey (2014) Makhzani, A. and Frey, B. A winnertakeall method for training sparse convolutional autoencoders. In NIPS Deep Learning Workshop, 2014.
 Masci et al. (2011) Masci, J., Meier, U., Cireşan, D., and Schmidhuber, J. Stacked convolutional autoencoders for hierarchical feature extraction. In Artificial Neural Networks and Machine Learning–ICANN 2011, pp. 52–59. Springer, 2011.
 Mnih & Gregor (2014) Mnih, A. and Gregor, K. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, 2014.
 Nair & Hinton (2010) Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, pp. 807–814, 2010.
 Neal (1992) Neal, R. M. Connectionist learning of belief networks. Artificial intelligence, 56(1):71–113, 1992.
 Rahimi & Recht (2007) Rahimi, A. and Recht, B. Random features for largescale kernel machines. In Advances in Neural Information Processing Systems (NIPS), 2007.
 Ramdas et al. (2015) Ramdas, A., Reddi, S. J., Poczos, B., Singh, A., and Wasserman, L. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In The TwentyNinth AAAI Conference on Artificial Intelligence (AAAI15), 2015.
 Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278–1286, 2014.
 Rifai et al. (2011) Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. Contractive autoencoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pp. 833–840, 2011.
 Rifai et al. (2012) Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. A generative process for sampling contractive autoencoders. In International Conference on Machine Learning (ICML), 2012.
 Salakhutdinov & Hinton (2009) Salakhutdinov, R. and Hinton, G. E. Deep boltzmann machines. In International Conference on Artificial Intelligence and Statistics, 2009.
 Sermanet et al. (2014) Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. In International Conference on Learning Representations, 2014.
 Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, 2012.
 Snoek et al. (2014) Snoek, J., Swersky, K., Zemel, R. S., and Adams, R. P. Input warping for bayesian optimization of nonstationary functions. In International Conference on Machine Learning, 2014.
 Sriperumbudur et al. (2009) Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Lanckriet, G. R., and Schölkopf, B. Kernel choice and classifiability for rkhs embeddings of probability distributions. In Advances in Neural Information Processing Systems, pp. 1750–1758, 2009.
 Susskind et al. (2010) Susskind, J., Anderson, A., and Hinton, G. E. The toronto face dataset. Technical report, Department of Computer Science, University of Toronto, 2010.
 Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pp. 3104–3112, 2014.
 Szegedy et al. (2014) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.
 Vincent et al. (2008) Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. ACM, 2008.
 Vinyals et al. (2014) Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. Show and tell: A neural image caption generator. arXiv preprint arXiv:1411.4555, 2014.
 Zeiler et al. (2010) Zeiler, M. D., Krishnan, D., Taylor, G. W., and Fergus, R. Deconvolutional networks. In Computer Vision and Pattern Recognition, pp. 2528–2535. IEEE, 2010.
 Zemel et al. (2013) Zemel, R., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. Learning fair representations. In International Conference on Machine Learning, pp. 325–333, 2013.
 Zhao & Meng (2014) Zhao, J. and Meng, D. Fastmmd: Ensemble of circular discrepancy for efficient twosample test. arXiv preprint arXiv:1405.2664, 2014.