Deep Adversarial Belief Networks
Abstract
We present a novel adversarial framework for training deep belief networks (DBNs), which includes replacing the generator network in the methodology of generative adversarial networks (GANs) with a DBN and developing a highly parallelizable numerical algorithm for training the resulting architecture in a stochastic manner. Unlike the existing techniques, this framework can be applied to the most general form of DBNs with no requirement for back propagation. As such, it lays a new foundation for developing DBNs on a par with GANs with various regularization units, such as pooling and normalization. Foregoing backpropagation, our framework also exhibits superior scalability as compared to other DBN and GAN learning techniques. We present a number of numerical experiments in computer vision as well as neurosciences to illustrate the main advantages of our approach.
1 Introduction
An essential problem in statistical machine learning (ML) is to model a given data set as a collection of independent samples from an underlying probability distribution. This distribution is generally referred to as a generative model. Representing and training generative models has a long and fruitful history and is popular in different mathematical modeling disciplines related to ML. The advent of deep learning and its associated methodology, have in recent years resulted in remarkable advances in the area of inference. Generative Adversarial Networks (GANs) are among the chief examples, and have gained enormous attention in different domains of application, including automatic translation yu2017seqgan ; isola2017image , image generation goodfellow2014generative ; reed2016generative ; arjovsky2017wasserstein and superresolution ledig2017photo . Deep Belief Networks (DBNs) are another set of highly popular examples with a wide range of application in acoustics mohamed2012acoustic ; lee2009unsupervised , computer vision nair20093d ; teh2001rate and others.
Within the class of generative models, the notion of generative network presents a marked difference with modern ML techniques, such as GANs and DBNs, and other conventional methods. In a nutshell, a generative network consists of a randomized computational unit, which is able to generate random realizations from a wide variety of distributions. For instance, GANs utilize a standard deep neural network (DNN) that we refer to as a neural generator, fed with a random sample from a fixed distribution. On the other hand, DBNs consist of a Markov chain of random vectors with the last vector in the chain as the output and the others as hidden features. In contrast to conventional techniques, generative networks in GANs and DBNs encode the generative models in an implicit way, whereby the desired probability distribution may be computationally unfeasible, but can still be statistically sampled efficiently. This fundamental difference leads to an incredible potential in representing highly complex generative models, with however, a radical paradigm shift in the training methodology. In this respect, GANs provide a novel and numerically efficient training approach, relying on an adversarial learning framework and the stochastic gradient descent (SGD) technique in backpropagation.
It has been observed that in many cases, DBNs have remarkable advantages over the neural generators in GANs. A motivating example, considered in this work, is modeling the recorded activities of biological neurons from the visual cortex area of a mouse brain under visual stimulation. While neural generators inherit the limiting properties of neural networks, such as continuity and differentiability, DBNs enjoy much more versatile statistical properties, including sparsity and less severe regularity srivastava2012multimodal ; boureau2008sparse . For modeling biological neurons, sparsity considerably limits standard GANs’ performance. Relying on a symmetric probabilistic relation between different layers, some DBNs can also be used in a reverse order, i.e. by feeding the data at the output layer and reversely generating the hidden features. DBNs based on Restricted Boltzmann Machines (RBMs) are prominent examples of such reversible networks. This unique property yields an efficient method for feature extraction, which has been exploited in different applications such as data completion and denoising lee2009convolutional ; lee2009unsupervised . It has also been used in supervised learning problems such as classification, by learning a joint generative model for the data and the labels and feeding the data to generate the labels hinton2006fast ; lee2009convolutional ; chen2015spectral . Using reversible DBNs, as we later argue in this paper, also allows us to symmetrically capture the relation of stimuli and neural activities, such that one of them can be inferred from the other. In addition, when given a small amount of training data, DBNs have the potential of a better statistical performance than DNNs, on account of their Bayesian nature. This turns out to be crucial to modeling neural activities as these recorded data are often in short supply.
Despite their numerous advantages, DBNs are less popular than GANs in practice, especially when highly deep structures are required to represent complex models. One reason is that unlike neural generators, the existing training techniques for DBNs are based on layerwise Gibbs sampling and/or variational methods, resulting in a substantially slower convergence rate than GANs. Moreover, the formulation of GANs admits various inference principles, as exemplified by the Wasserstein GAN (WGAN) architecture arjovsky2017wasserstein , while DBNs are generally trained on the basis of the Maximum Likelihood (MaL) principle, which can become numerically unstable in many practical situations arjovsky2017wasserstein . Furthermore, the various well known regularizers for neural generators such as pooling layers and normalization, are not readily used in DBNs as of now. Our goal in this paper is to address the aforementioned issues by endowing DBNs with similar training methodologies available for GANs, while avoiding the conventional layerwise training based on the MaL principle. The result may also be interpreted as a new generation of GANs with more flexible DBNs as their generators. We show that applying the adversarial learning approach to DBNs leads to a numerically more efficient algorithm than GANs, since unlike the backpropagation algorithm in GANs, our training approach is parallelizable over different layers. The main contributions of this paper can be summarized as follows:

Inspired by GANs, we develop an adversarial training framework for DBNs with superior numerical properties, including scalability through parallelization and compatibility with the acceleration and adaptive learning rate schemes.

Unlike existing approaches, our framework can address the most generic form of DBNs, creating a potential to incorporate similar regularization units as DNNs, such as pooling and normalization. Focusing on the standard RBMbased DBNs and convolutional DBNs, we leave the details of further generalizations to a future study.

Based on our framework, we develop algorithms that train DBNs under different metrics than the MaL principle, such as the Wasserstein distance.

We consider a number of illustrative experiments with the MNIST handwritten digits dataset lecun1998gradient as well as the aforementioned biological neural activities.
1.1 Related Literature
DBNs belong to a broader family of Bayesian networks nielsen2009bayesian ; neapolitan2004learning . The most popular form of DBNs are for binary variables and generalize the "shallow" architecture of Restricted Boltzman Machines (RBMs) smolensky1986information ; hinton2006reducing ; ackley1985learning . The most efficient methods of training RBMs, such the contrastive divergence method are based on Monte Carlo (MC) sampling and variational Bayesian techniques tieleman2008training ; sutskever2010convergence ; hinton2012practical . Similar techniques are used in a layerwise fashion for training DBNs consisting of multiple layers of RBMs bengio2007greedy ; boureau2008sparse ; hinton2006fast . Another approach for training DBNs is based on the variational lower bound (a.k.a Evidence Lower Bound ELBO) mnih2014neural . Modifications of DBNs are also considered in the literature. In teh2001rate and nair2010rectified for example, the application of DBNs to nonbinary variables is discussed by respectively introducing binomial and rectified linear (ReLU) units. To impose shift invariance, convolutional DBNs are introduced in lee2009convolutional . Using DBNs for modeling neural activities has also been considered in lee2008sparse ; zheng2014eeg . Compared to DBNs GANs and neural generators belong to a more recent literature. The original idea of GANs stems from the original work in goodfellow2014generative . Different variations of GANs, such as WGAN arjovsky2017wasserstein and Deep Convolutional GANs (DCGANs) radford2015unsupervised are highly popular in the literature. Conditional GANs are introduced in mirza2014conditional for modeling the relation of two variables such as images and labels gauthier2014conditional ; isola2017image . It is worth noting that using adversarial learning for training neural generators is not limited to GANs. For example, makhzani2015adversarial introduces a training method based on the probabilistic auto encoder architecture used in the socalled Variational Auto Encoders (VAEs) kingma2013auto and adversarial learning.
2 Mathematical Background
2.1 Problem Formulation: Training Deep Belief Networks
Given an observed data set with data points from a data space (domain) , we are to estimate a probability measure ^{1}^{1}1To define such a measure, we naturally assume that is also equipped with a proper sigma algebra. Indeed, we are practically concerned only with the case where is the space of dimensional real vectors with the standard Borel sigma algebra. on , with as its set of independent random samples. DBNs address this problem by generating a random variable with a desired distribution . To this end, multiple layers of random variables are considered. In the RBMbased DBNs, the layer is a random dimensional binary vector and the joint probability density function^{2}^{2}2For simplicity, we interchangeably use the terms probability distribution and probability mass. (p.d.f) of the layers is written as
(1) 
where are a set of weights and is a proper normalization constant. The output layer thus reflects the desired distribution, while the layers form a Markov chain. This is apparent in the RBMbased formulation in Eq. (1) as the pdf can be factored by terms including only adjacent layers. We may write the joint distribution in the "forward" form , where
and with a proper choice of the constant , is the marginal distribution of the input layer and
are the transitional probabilities between the layers, with a suitable normalization constant . This forward representation enable us to conveniently sample the output of DBNs by first sampling the input (by ) and by successively sampling the next layers (by ), given the realizations of the previous layers, until reaching the output. We observe that in the forward representation, the elements of the input layer are independent. Conditioned on their previous layer, the elements of the next layers are also independent. As the variables in Eq. (1) are binary, the constants can also be explicitly calculated, resulting in logistic functions for the probability of individual elements in the first layer as well as the conditional probabilities of the subsequent layers:
(2) 
where is the logistic function. Once a DBN is trained, it can also be used in a backward way by factorizing the joint pdf as , where with an abuse of notation we also denote the backward transitional probabilities by . For the RBMbased model in Eq. (1), this factorization can be easily carried out, leading to similar expressions of Eq. (2), which allows us to reversely sample a set of hidden features from a give data point at the output layer.
DBNs are trained based on the Maximum Likelihood (MaL) principle. For a given set of weights, denote the marginal distribution of the output by . Then the MaL principle leads to the following optimization problem for training DBNs:
(3) 
The major difficulty in Eq. (3) is the calculation of the term , and its derivative is extremely difficult to calculate and making gradientbased optimization techniques not directly applicable. To overcome this difficulty, and make DBNs training more viable, we exploit in this paper, the GANs methodology to lift the numerical difficulties of the MaLbased optimization framework.
2.2 Proposed Method: Deep Adversarial Belief Networks
For training the DBNs, we adopt a similar solution to GANs. We consider the empirical measure of the data set , where denotes Dirac’s delta measure at point and take the solution of the following optimization problem:
(4) 
where is a positive distance or divergence function between two measures, and is the set of probability measures on generated by the DBNs in Eq. (1) . We observe that Eq. (4) generalizes the MaL framework in Eq. (3), since the latter is obtained as a special case, by letting be the Kullback Leibler divergence, i.e. . More generally and similarly to GANs, we consider those distance (or divergence) functions that can be written as
(5) 
where is a family of realvalued functions on , known as the discriminators. Furthermore, are two real functions and the notation implies that the variables in the arguments of expectation are respectively distributed according to . The original GAN formulation uses and with as the set of all measurable functions, corresponding to the JensenShannon divergence. The WGAN formalism is obtained by taking , and as the set of all 1Lipschitz functions, which leads to the Wasserstein distance between measures. The MaL framework in Eq. (3) can also be obtained by setting and . In practice, the discriminator is limited to the family of deep neural networks (DNNs) with a suitable fixed architecture. In this case, we denote the discriminator by where denotes the set of weights in the neural network at hand. Plugging Eq. (5) into Eq. (4) and using the abovementioned specifications of the discriminator, we obtain the following optimization framework:
(6) 
Our proposed technique for training DBNs is hence entails solving the optimization problem in Eq. (6) to obtain the set of parameters of the underlying DBN. As we shortly elaborate, the stochastic gradient method provides a practical scheme for this purpose. We also observe that Eq. (6) bears a similar adversarial interpretation to GANs: As the loss function reflects the objective of the discriminator in distinguishing the "real" samples from the "fake" ones , the goal of the DBN () is to deceive the discriminator by counterfeiting "true samples" in the best possible way.
2.2.1 Algorithmic Details
The optimization problem in Eq. (6) can be solved by the SGD method: At each iteration a set of samples (minibatch) from either the data set or the output of the underlying DBN is randomly selected. The gradient with respect to both and of their corresponding term or in the objective of Eq. (6), are estimated using the samples, and subsequently applied. When considering the iteration, if a set of samples from the data set is used, we adopt the standard procedure of estimating the gradient by calculating the sample mean:
(7) 
where the estimate respectively includes the gradients with respect to and in the first and second entries. Note that the gradient with respect to is zero in this case. The above solution is not applicable when the DBN samples are employed, since the relation of their corresponding term to is implicit in the underlying distribution . For this reason, we first express the exact gradient of this term with respect to as^{3}^{3}3In the continuous variable case, the summation will be replaced by an integral, but the final expression remains unchanged.
(8) 
where , and the notation is used to refer to the joint p.d.f in Eq. (1). We observe in Eq. (2.2.1) that the expected value on the right hand side is over all layers in , while the original expression on the left hand side is over the output . Next, we estimate the right hand side by generating samples of the entire network, where with as the sample of the output layer, and calculating the sample mean. This leads to the following expression for the gradient
(9) 
We notice that the term can be efficiently calculated on account of the Markovian properties of the DBNs, which allows us to efficiently express . This shows that the gradient of the variables at individual layers can be independently calculated in parallel, thus foregoing the backpropagation algorithm. This represents a great numerical advantage of adversarial DBNs over DNNs. We observe that for the RBMbased DBNs, calculating the term amounts to differentiating the expressions in Eq. (2), which can be found in the standard literature of DBNs hinton2006fast , and is hence skipped herein for space sake. Once the elements of the stochastic gradient are calculated based on either Eq. (7) or Eq. (9), they are applied to their corresponding parameters with a suitable learning rate.
2.2.2 Extensions
Our training method by Eq. (7) and Eq. (9) enables us to extend the existing framework of DBNs in multiple respects:
Modifying MaL Principle: We can easily alter our training principle by modifying the pair of functions . In particular, we consider the Wasserstein metric and the JS divergence in our next experiments, which are popular choices in the GAN literature.
NonRBM Layers: As seen, our training technique is applicable to any DBN, such as Eq. (2), for which the derivative of the forward representation is simple to compute. For example, we may simply obtain the convolutional DBNs by replacing the linear terms in Eq. (2) with a convolution. The resulting expressions and derivatives are similar to those in lee2009convolutional and are hence skipped, we nevertheless use the resulting algorithm in our experiments. Further operations such as normalization factors and pooling can also be incorporated in the description of the transitional probabilities in Eq. (2), and their adoption is postponed to a future work as they will impact the reversibility property.
Accelerated Learning: Another advantage of our training methodology is that it admits standard techniques in optimization algorithms, such as acceleration and adaptive step size to improve convergence. We examine some of these approaches in our experiments.
3 Experiments
In this section, we examine our proposed training algorithm by a way of two groups of numerical experiments. The first group concerns the application of DBNs to a computer vision problem, namely the MNIST dataset, containing 60,000 labeled samples of grayscale handwritten digits for training and 10,000 more for testing. The second group investigates DBNs for modeling neural activities of the visual cortex under given visual stimuli.
3.1 Generation of MNISTLike Digits
The goal of our first experiment is to generate synthetic handwritten digits by a DBN, with or without control over the generated digit. For this experiment, we use the dataset of salakhutdinov2 , containing 1797 samples of 8 by 8 cropped MNIST images, further binarized by thresholding the original grey scale images. In the first part of this experiment, a DBN is adversarially trained by Eq. (7) and Eq. (9) in an unsupervised way, i.e. by feeding the image samples as the last layer and discarding the labels. Sampling the resulting DBN generates handwritten instances with no control over the underlying digit. In the second part, another DBN is trained in a supervised way (with ground truth labels) to gain control over the generated digit. For this purpose, we adopt a similar approach to the conditional GAN structure mirza2014conditional by treating a pair of label and image as a data point , which are respectively fed to the first () and last () layer of a DBN. The discriminator assumes this pair as an input and Eq. (7) and Eq. (9) are similarly used. The output of the discriminator in the two parts of our experiment can be interpreted as the likelihood of the input samples following the distribution of the training images, either unconditionally or given a label.
In both parts of our experiment, the discriminator is a 2layer densely connected neural network, whose input vector length, the hidden layer, and the output layer respectively are [64,64,1]. The output layer has a sigmoid activation function. In the first part, we employ a densely connected 3layer DBN as in Eq. (2), and each with 64 units. For the second part, we add an input layer of length 10, corresponding to the onehot encoding (converting categorical integers to a binary vector) of the ten digit labels.
The diagram of the DBN and the discriminator are shown in Fig. 1. Samples of the generated images in the two experiments are also shown in Fig. 1. Note that the last layer of the resulting DBNs generates binary images of the digits, while the values of the conditional distribution of the pixels of the last layer given the previous layer in Eq. (2), can be used as the grey scale images of the original MNIST digits, before thresholding. We show the results of the conditional distribution of the last layer in Fig. 1. We observe that the generated digits are well distributed and resemble the real digits. Moreover, in the bottom plot, each row is conditioned on a certain label, which shows that the generation of the digits is to a large extent, associated with the labels, while the digits maintain variability for the same label.
Training by Wasserstein distance: We also conducted the first part of our experiment using the WGAN formalism. Specifically, we apply weight clipping arjovsky2017wasserstein on the discriminator, remove the sigmoid function at the end of the discriminator, and change the loss function elements to in Eq. (7) and Eq. (9) according to the Wasserstein distance. We show in Fig. 2 some generated samples of the digits using Wasserstein DBN, which demonstrates that training DBNs with different metrics than the KL divergence leads to different distributional properties of the generated images. In this experiment, we also explore different adaptive learning rate strategies. The convergence properties of three different optimizers, namely SGD, RMSprop tieleman2012lecture and Adam kingma2014adam are depicted on the right hand side of Fig. 2. As seen, SGD leads to highest variance and slowest convergence, while Adam results in the most suitable saddle point solution of Eq. (6).
3.2 Classification of MNIST
In this experiment, we use the DBN formalism for classification of MNIST images. For this purpose, we train a conditional convolutional DBN on the original MNIST dataset using the proposed framework. This is similar to the second part of our previous experiment with digits generation, with a slight difference that the DBN takes images as a nonbinary input and outputs a vector of length 10. We repeat this experiment by binarizing the entire MNIST dataset, but since the two results are similar, we only present one set of results. The images are still used as the conditional inputs to the discriminator, as in Fig. 1. The DBN includes 4 convolutional layers, with the following number of filters [32,32,16,10], and their corresponding sizes [11,11,5,4]. In the discriminator architecture, we first feed the conditional image to a CNN to generate a feature vector of length 64. We subsequently concatenate the feature vector with the DBN’s output (generated labels) and pass the result through a two layer linear neural network.
Since generative models are not specialized for classification, their training is usually followed by a fine tuning stage, where their weights are updated by backpropagation as a conventional CNN lee2009unsupervised . We use the first layer of a trained convolutional DBN as the pretrained weights for the first layer filters of a CNN, and finetune the weights using backpropagation. The classification performance is compared with direct training of a CNN classifier of the same structure. The resulting accuracy for different sizes of the training set are listed in Table 1. This shows that CNN initialized by DBN outperforms normal CNN for a small training set and has better generalizability. For the CNN without DBN pretraining and a small size of the training set, the result is highly unstable. In contrast, CNN initialized by DBN exhibits a considerably more consistent performance. The reported accuracy for CNN is the average of 100 runs.
training set size  DBN + CNN  CNN 

100  77.772% (1.501%)  70.037% (6.32%) 
1000  96.081% (0.622%)  94.625% (0.355%) 
60000  99.140% (0.096%)  98.906% (0.092%) 
3.3 Modeling Visual Cortex Neural Activities
Modelling neural spikes is a natural application of DBNs lee2008sparse . The power of DBNs in addressing the limitations of DNNs, such as sparsity and reduced datasets, is demonstrated in this task. They naturally generate binary outputs, which can be easily interpreted as neural spikes. In this experiment, we show that a DBN is capable of modeling sparse spike signals.
Our dataset is recorded by a twophoton calcium imaging system capturing large scale neural activities stirman2016wide ; huang2018 . We simultaneously record activities of individual neurons as time series in the primary visual cortex (V1) and the anterolateral (AL) areas of the visual cortex of an awake mouse. Top 50 neurons whose activities are most correlated across 20 trials are selected for modeling, and their binary spike trains are obtained by applying a standard deconvolution technique to the recorded time series.
We model the spikes of the 50 neurons independent of the visual stimuli with a fourlayer dense DBN. The number of units per layer are [128,128,128,50]. The discriminator is a twolayer dense neural network with 64 neurons in the hidden layer. The training procedure is similar to the first part of the experiment in Section 3.1. In Fig. 3, the firing probabilities for individual neurons, given by the DBN are depicted, which exhibit high resemblance to the real firing rates, despite the fact that the overall firing rate is very low (about 1 spike per 100 frames). We also verify that the loglikelihood of the real data on DBN increases during training. The likelihood of the output layer of the DBN is estimated by sampling , the distribution of the last layer conditioned on the penultimate layer, and averaging, amounting to the total probability rule.
4 Conclusion
In this paper, we proposed an adversarial training framework for DBNs. The experiments verify that our method works under different structures and settings, including GAN, WGAN, conditional GAN and various optimizers. The development of this method opens a promising way to train complex DBNs. Future works include implementing more components for DBN, improving the modeling capability and training stability, and explore more complex and dedicated DBN structures for neural modelling, such as 3D convolution and recurrent structures.
References
 [1] David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.
 [2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 [3] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layerwise training of deep networks. In Advances in neural information processing systems, pages 153–160, 2007.
 [4] Ylan Boureau, Yann L Cun, et al. Sparse feature learning for deep belief networks. In Advances in neural information processing systems, pages 1185–1192, 2008.
 [5] Yushi Chen, Xing Zhao, and Xiuping Jia. Spectral–spatial classification of hyperspectral data based on deep belief network. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 8(6):2381–2392, 2015.
 [6] Jon Gauthier. Conditional generative adversarial nets for convolutional face generation. Class Project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, Winter semester, 2014(5):2, 2014.
 [7] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [8] Geoffrey E Hinton. A practical guide to training restricted boltzmann machines. In Neural networks: Tricks of the trade, pages 599–619. Springer, 2012.
 [9] Geoffrey E Hinton, Simon Osindero, and YeeWhye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
 [10] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
 [11] Yuming Huang, Ashkan Panahi, Han Wang, Bo Jiang, Hamid Krim, Yiyi Yu, Spencer L. Smith, and Liyi Dai. Modelfree inference of neuronal connectivity via embedding dimensionality. In ICMNS, 2018.
 [12] Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A Efros. Imagetoimage translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
 [13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [14] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [15] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [16] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photorealistic single image superresolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
 [17] Honglak Lee, Chaitanya Ekanadham, and Andrew Y Ng. Sparse deep belief net model for visual area v2. In Advances in neural information processing systems, pages 873–880, 2008.
 [18] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th annual international conference on machine learning, pages 609–616. ACM, 2009.
 [19] Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng. Unsupervised feature learning for audio classification using convolutional deep belief networks. In Advances in neural information processing systems, pages 1096–1104, 2009.
 [20] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 [21] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 [22] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014.
 [23] Abdelrahman Mohamed, George E Dahl, and Geoffrey Hinton. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):14–22, 2012.
 [24] Vinod Nair and Geoffrey E Hinton. 3d object recognition with deep belief nets. In Advances in neural information processing systems, pages 1339–1347, 2009.
 [25] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pages 807–814, 2010.
 [26] Richard E Neapolitan et al. Learning bayesian networks, volume 38. Pearson Prentice Hall Upper Saddle River, NJ, 2004.
 [27] Thomas Dyhre Nielsen and Finn Verner Jensen. Bayesian networks and decision graphs. Springer Science & Business Media, 2009.
 [28] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [29] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.
 [30] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pages 872–879. ACM, 2008.
 [31] Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, Colorado Univ at Boulder Dept of Computer Science, 1986.
 [32] Nitish Srivastava and Ruslan R Salakhutdinov. Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems, pages 2222–2230, 2012.
 [33] Jeffrey N Stirman, Ikuko T Smith, Michael W Kudenov, and Spencer L Smith. Wide fieldofview, multiregion, twophoton imaging of neuronal activity in the mammalian brain. Nature biotechnology, 34(8):857, 2016.
 [34] Ilya Sutskever and Tijmen Tieleman. On the convergence properties of contrastive divergence. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 789–795, 2010.
 [35] Yee Whye Teh and Geoffrey E Hinton. Ratecoded restricted boltzmann machines for face recognition. In Advances in neural information processing systems, pages 908–914, 2001.
 [36] Tijmen Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pages 1064–1071. ACM, 2008.
 [37] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
 [38] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 [39] WeiLong Zheng, JiaYi Zhu, Yong Peng, and BaoLiang Lu. Eegbased emotion classification using deep belief networks. In 2014 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2014.