Generative Deep Deconvolutional Learning
Abstract
A generative model is developed for deep (multilayered) convolutional dictionary learning. A novel probabilistic pooling operation is integrated into the deep model, yielding efficient bottomup (pretraining) and topdown (refinement) probabilistic learning. After learning the deep convolutional dictionary, testing is implemented via deconvolutional inference. To speed up this inference, a new statistical approach is proposed to project the toplayer dictionary elements to the data level. Following this, only one layer of deconvolution is required during testing. Experimental results demonstrate powerful capabilities of the model to learn multilayer features from images, and excellent classification results are obtained on the MNIST and Caltech 101 datasets.
Generative Deep Deconvolutional Learning
Yunchen Pu, Xin Yuan and Lawrence Carin 

Department of Electrical and Computer Engineering, Duke University, Durham, NC, 27708, USA 
{yunchen.pu,xin.yuan,lcarin}@duke.edu 
1 Introduction
Convolutional networks, introduced in Lecun et al. (1998), have demonstrated excellent performance on image classification and other tasks. There are at least two key components of this model: computational efficiency manifested by leveraging the convolution operator, and a deep architecture, in which the features of a given layer serve as the inputs to the next layer above. Since that seminal contribution, much work has been undertaken on improving deep convolutional networks (Lecun et al., 1998), deep deconvolutional networks (Zeiler et al., 2010), convolutional deep restricted Boltzmann machines (Lee et al., 2009), and on Bayesian convolutional dictionary learning (Chen et al., 2013), among others.
An important technique employed in these deep models is pooling, in which a contiguous block of features from the layer below are mapped to a single input feature for the layer above. The pooling step manifests robustness, by minimizing the effects of variations due to small shifts, and it has the advantage of reducing the number of features as one moves higher in the hierarchical representation (possibly mitigating overfitting). Methods that have been considered include average and maximum pooling, in which the single feature mapped as input to the layer above is respectively the average or maximum of the corresponding block of features below. Average pooling may introduce blur to learned filters (Zeiler & Fergus, 2013), and use of the maximum (“max pooling”) is widely employed. Note that average and max pooling are deterministic. Stochastic pooling proposed by Zeiler & Fergus (2013) and the probabilistic maxpooling used by Lee et al. (2009) often improve the pooling process. The use of stochastic pooling is also attractive in the context of developing a generative model for the deep convolutional representation, as highlighted in this paper. Specifically, we develop a deep generative statistical model, which starts at the highestlevel features, and maps these through a sequence of layers, until ultimately mapping to the data plane (e.g., an image). The feature at a given layer is mapped via a multinomial distribution to one feature in a block of features at the layer below (and all other features in the block at the next layer are set to zero). This is analogous to the method in Lee et al. (2009), in the sense of imposing that there is at most one nonzero activation within a pooling block. As we demonstrate, this yields a generative statistical model with which Bayesian inference may be readily implemented, with all layers analyzed jointly to fit the data.
We use bottomup pretraining, in which initially we sequentially learn parameters of each layer one at a time, from bottom to top, based on the features at the layer below. However, in the refinement phase, all model parameters are learned jointly, topdown. Each consecutive layer in the model is locally conjugate in a statistical sense, so learning model parameters may be readily performed using sampling or variational methods. We here develop a Gibbs sampler for learning, with the goal of obtaining a maximum a posterior (MAP) estimate of the model parameters, as in the original paper on Gibbs sampling (Geman & Geman, 1984) (we have found it unnecessary, and too expensive, to attempt an accurate estimate of the full posterior). The Gibbs sampler employed for parameter learning may be viewed as an alternative to typical optimizationbased learning (Lecun et al., 1998; Zeiler et al., 2010), making convenient use of the developed generative statistical model.
The work in Zeiler et al. (2010); Chen et al. (2013) involves learning convolutional dictionaries, and at the testing phase one must perform a (generally) expensive nonlinear deconvolution step at each layer. In Kavukcuoglu et al. (2010) convolutional dictionaries are also learned at the training stage, but one also simultaneously learns a convolutional filterbank and nonlinear function. The convolutional filterbank can be implemented quickly at test (no nonlinear deconvolutional inversion) and, linked with the nonlinear function, this computationally efficient testing step is meant to approximate the devonvolutional network.
We propose an alternative approach to yield fast inversion at test, while still retaining an aspect of the nonlinear deconvolution operation. As detailed below, in the learning phase, we infer a deep hierarchy of convolutional dictionary elements, which if handled like in Zeiler et al. (2010), requires joint deconvolution at each layer when testing. However, leveraging our generative statistical model, the dictionary elements at the top of the hierarchy can be mapped through a sequence of linear operations to the image/data plane. At test, we only employ the features from the top layer in the hierarchy, mapped to the data plane, and therefore only a single layer of deconvolution need be applied. This implies that the testtime computational cost is independent of the number of layers employed during the learning phase.
This paper makes three contributions: () rather than employing betaBernoulli sparsity at each layer of the model separately, as in Chen et al. (2011; 2013), the sparsity is manifested via a multinomial process between layers, constituting stochastic pooling, and allowing coupling all layers of the deep model when learning; () the stochastic pooling manifests a proper topdown generative model, allowing a new means of mapping highlevel features to the data plane; and () a novel form of testing is employed with deep models, with the toplayer features mapped to the data plane, and deconvolution only applied once, directly with the data. This methodology yields excellent performance on imagerecognition tasks, as demonstrated in the experiments.
2 Modeling Framework
The proposed model is applicable to general data for which a convolutional dictionary representation is appropriate. One may, for example, apply the model to onedimensional signals such as audio, or to twodimensional imagery. In this paper we focus on imagery, and hence assume twodimensional signals and convolutions. Grayscale images are considered for simplicity, with straightforward extension to color.
2.1 SingleLayer Convolutional Dictionary Learning
Assume grayscale images , with ; the images are analyzed jointly to learn the convolutional dictionary . Specifically consider the model
(1) 
where is the convolution operator, denotes the Hadamard (elementwise) product, the elements of are in , the elements of are real, and represents the residual. indicates which shifted version of is used to represent . Considering (typically and ), the corresponding weights are of size .
Let and represent elements of and , respectively. Within a Bayesian construction, the priors for the model may be represented as (Paisley & Carin, 2009):
(2)  
(3)  
(4) 
where , denotes the gamma distribution, represents the identity matrix, and are hyperparameters, for which default settings are discussed in Paisley & Carin (2009); Chen et al. (2011; 2013). While the model may look somewhat complicated, local conjugacy admits Gibbs sampling or variational Bayes inference (Chen et al., 2011; 2013).
In Chen et al. (2011; 2013) a deep model was developed based on (1), by using as the input of the layer above. In order to do this, a pooling operation (e.g., the maxpooling used in Chen et al. (2013)) is employed, reducing the feature dimension as one moves to higher layers. However, the model was learned by stacking layers upon each other, without subsequent overall refinement. This was because use of deterministic max pooling undermined development of a proper topdown generative model that coupled all layers; therefore, in Chen et al. (2013) the model in (1) was used sequentially from bottomup, but the overall model parameters were never coupled when learning. To tackle this, we propose a probabilistic pooling procedure, yielding a topdown deep generative statistical structure, coupling all parameters when performing learning. As discussed when presenting results, this joint learning of all layers plays a critical role in improving model performance. The stochastic pooling applied here is closely related to that in Zeiler & Fergus (2013); Lee et al. (2009).
2.2 Pretraining & Stochastic Pooling
Parameters of the deep model are learned by first analyzing one layer of the model at a time, starting at the bottom layer (touching the data), and sequentially stacking layers. The parameters of each layer of the model are learned separately, conditioned on parameters of the layers learned thus far (like in Chen et al. (2011; 2013)). The parameters learned in this manner serve as initializations for the topdown refinement step, discussed in Sec. 2.3, in which parameters at all layers of the deep model are learned jointly.
Assume an layer model, with layer the top layer, and layer 1 at the bottom, closest to the data. In the pretraining stage, the output of layer is the input to layer , after pooling. Layer has dictionary elements, and we have:
(5)  
(6) 
The expression is a 2D (spatial) activation map, for image , model layer , dictionary element . The expression may be viewed as a 3D entity, with its th plane defined by a “pooled” version of (pooling discussed next). The dictionary elements and residual are also three dimensional (each 2D plane of and is the spatialdependent structure of the corresponding features), and the convolution is performed in the 2D spatial domain, simultaneously for each layer of the feature map.
We now discuss the relationship between and layer of . The 2D activation map is partitioned into dimensional contiguous blocks (pooling blocks with respect to layer of the model); see the left part of Figure 1. Associated with each block of pixels in is one pixel at layer of ; the relative locations of the pixels in are the same as the relative locations of the blocks in . Within each block of , either all pixels are zero, or only one pixel is nonzero, with the position of that pixel selected stochastically via a multinomial distribution. Each pixel at layer of equals the largestamplitude element in the associated block of (, max pooling). Hence, if all elements of a block of are zero, the corresponding pixel in is also zero. If a block of has a (single) nonzero element, that nonzero element is the corresponding pixel value at the th layer of .
The bottomup generative process for each block of proceeds as follows (left part of Figure 1). The model first imposes that a given block of is either all zero or has one nonzero element, and this binary question is modeled as the betaBernoulli representation of (6). If a given block has a nonzero value, the position of that value in the associated block is defined by a multinomial distribution, and its value is modeled as represented in (6). The betaBernoulli step, followed by multinomial, are combined into one equivalent statistical representation, as discussed next.
Let denote the th block of at layer , where assuming integer divisions. We introduce a latent variable to implement at most one nonzero element out of the entries in through
(7) 
where and denote multinomial and Dirichlet distribution, respectively (the Dirichlet distribution has a set of parameters, and here we imply that are equal, and set to the value indicated in ). has entries, of which only one is equal to 1. If the last element is 1, this means all . Since the th block at layer corresponds to one element at layer , we have
(8) 
Hence, if the last element of is 1, all elements of block are zero; if not, the location of the nonzero element in the first entries of locates the position of the nonzero element in the corresponding block. The remaining parts of the model are represented as in (6).
In the pretraining phase, we start with , which is the data . We learn using the blocked activation weights, via Gibbs sampling, where the multinomial distribution associates each nonzero element with a position in the corresponding block. The MAP Gibbs sample is then selected, defining model parameters for the layer under analysis. The “stacked” and pooled are used to define , and the learning procedure then continues, learning dictionary elements and activation maps , again via Gibbs sampling and MAP selection. This continues sequentially up to the th, or top, layer. For the top layer, since no pooling is necessary, the betaBernoulli prior in (2) is used.
2.3 Model Refinement With Stochastic Pooling
The learning performed with the topdown generative model (right part of Fig. 1) constitutes a refinement of the parameters learned during pretraining, and the excellent initialization constituted by the parameters learned during pretraining is key to the subsequent model performance.
In the refinement phase, the equations are (almost) the same, but we now proceed top down, from (5) to (6). The generative process constitutes and , and after convolution is manifested; the is now absent at all layers, except layer , at which the fit to the data is performed. Each element of has an associated pooling block in . Via a multinomial distribution like in pretraining, each element of is mapped to one position in the corresponding block of , and all other elements in that block are set to zero. Since is manifested topdown as a convolution of and , will in general have no elements exactly equal to zero (but many will be small, based on the pretraining). Hence, each block of will have one nonzero element, with position defined by the multinomial^{1}^{1}1We also considered a model exactly like in pretraining, which in the pooling step a pixel in could be mapped via the multinomial to an allzero activation block in layer ; the results are essentially unchanged from the method discussed above..
During pretraining many blocks of will be allzero since we preferred a sparse representation, while during refinement this sparsity requirement is relaxed, and in general each pooling block of will have one nonzero element (but it is still sparse), and this value is mapped via pooling to the corresponding pixel in . In pretraining the Dirichlet and multinomial distributions were of size , allowing the allzero activation block; during refinement the multinomial and Dirichlet are of dimensions . The corresponding Dirichlet and multinomial parameters from pretraining are used to constitute initializations for refinement.
2.4 TopLevel Features and Testing
In order to understand deep convolutional models, researchers have visualized dictionary elements mapped to the image level (Zeiler & Fergus, 2014). One key challenge of this visualization is that one dictionary element at high layers can have multiple representations at the layer below, given different activations in each pooling block (in our model, this is manifested by the stochasticity associated with the multinomialbased pooling). Zeiler & Fergus (2014) showed different versions of the same upperlayer dictionary element at the image level. Because of this capability of accurate dictionary localization at each layer, deep convolutional models perform well in classification. However, also due to these multiple representations, during testing, one has to infer dictionary activations layer by layer (via deconvolution), which is computationally expensive. In order to alleviate this issue, Kavukcuoglu et al. (2010) proposed an approximation method using convolutional filter banks (fast because there is no explicit deconvolution) followed by a nonlinear function. Though efficient at test time, in the training step one must simultaneously learn deconvolutional dictionaries and associated filterbanks, and the choice of nonlinear function is critical to the performance of the model. Moreover, in the context of the framework proposed here, it is difficult to integrate the approach of Kavukcuoglu et al. (2008; 2010) into a Bayesian model.
We propose a new approach to accelerate testing. After performing model learning (after refinement), we project toplayer dictionary elements down to the data plane. At test, deconvolution is only performed once, using the toplayer dictionary elements mapped to the data plane. The toplayer activation strengths inferred via this deconvolution are then used in a subsequent classifier. The different manifestations of a toplayer dictionary element mapped to the data plane are constituted by different (stochastic) pooling mappings via the multinomial. To select toplayer dictionary elements in the data plane, used for test, we employ maximumlikelihood (ML) dictionary elements, with ML performed across the different choices of the max pooling at each layer. Hence, after this MLbased toplayer dictionary selection, a pixel at layer is mapped to the same location in the associated layer block, for all convolutional shifts (same maxpooling map for all shifts at a given layer). Hence, the key approximation is that the stochastic pooling employed for each pixel at layer to a position in a block at layer is replaced by an MLbased deterministic pooling (possibly a different deterministic map at each layer). This simple approach has the advantage of Zeiler & Fergus (2014) at test, in that we retain the deconvolution operation (unlike Kavukcuoglu et al. (2010)), but deconvolution must only be performed once (not at each layer). In the experiments presented below, when visualizing inferred dictionary elements in the image plane, this MLbased dictionary selection is employed. More details on this aspect of the model are provided in the Supplementary Material.
3 GibbsSamplingBased Learning and Inference
Due to local conjugacy at every component of the model, the local conditional posterior distribution for all parameters of our model is manifested in closed form, yielding efficient Gibbs sampling (see Supplementary Material for details). As in all previous convolutional models of this type, the FFT is leveraged to accelerate computation of the convolution operations, here within Gibbs update equations.
In the pretraining step, we select the ML sample from 500 collection samples, after first computing and discarding 1500 burnin samples. The same number of burnin and collection samples, with ML selection, is performed for model refinement. This ML selection of collection samples shares the same spirit as Geman & Geman (1984), in the sense of yielding a MAP solution (not attempting to approximate the full posterior). During testing, we select the ML sample across 200 deconvolutional samples, after first discarding 500 burnin samples.
4 Experimental Results
We here apply our model to the MNIST and Caltech 101 datasets. We compare dictionaries (viewed in the data plane) before and after refinement. Classification results (average of 10 trials) using toplayer features are presented for both datasets. As in (Paisley & Carin, 2009), the hyperparameters are set as , where is the number of dictionary elements at the corresponding layer, and ; these are standard hyperparameter settings (Paisley & Carin, 2009) for such models, and no tuning or optimization was performed. All code is written in MATLAB and executed on a desktop with 3.8 GHz CPU and 24G memory. Model training including refinement with one class (30 images) of Caltech 101 takes about 40 CPU minutes, and testing (deconvolution) for one image takes less than 1 second. These results were run on a single computer, for demonstration, and acceleration via parallel implementation, GPUs (Krizhevsky et al., 2012), and coding in C will be considered in the future; the successes realized recently in accelerating convolutionbased models of this type are transferrable to our model.
MNIST Dataset
Methods  Test error 
DBN Hinton & Salakhutdinov (2006)  1.20% 
CBDN Lee et al. (2009)  0.82% 
0.53%  
0.35%  
MCDNN Ciresan et al. (2012)  0.23% 
SPCNN Zeiler & Fergus (2013)  
Average Pooling  0.83% 
Max Pooling  0.55% 
Stochastic Pooling  0.47% 
MCMC (10000 Training)  0.89% 
Batch VB (10000 Training)  0.95% 
online VB (60000 Training)  0.96% 
Ours, 2layer model + 1layer features  
60000 Training  0.42% 
10000 Training  0.68% 
5000 Training  1.02% 
2000 Training  1.11% 
1000 Training  1.66% 
We first consider the widely studied MNIST data (http://yann.lecun.com/exdb/mnist/), which has 60,000 training and 10,000 testing images, each , for digits 0 through 9. A two layer model is used with dictionary size () at the first layer and at the second layer; the pooling size is () and the number of dictionary elements at layers 1 and 2 are and , respectively. We obtained these number of dictionary elements via setting the initial dictionary number to a relatively large value in the pretraining step and discarding infrequently used elements by counting the corresponding binary indicator , i.e., inferring the number of needed dictionary elements, as in Chen et al. (2013).
Table 1 summaries the classification results of our model compared with some related results, on the MNIST data. The second (top) layer features corresponding to the refined dictionary are sent to a nonlinear support vector machine (SVM) (Chang & Lin, 2011) with Gaussian kernel, in a onevsall multiclass classifier, with classifier parameters tuned via 5fold crossvalidation (no tuning on the deep feature learning). Rather than concatenating features at all layers as in Zeiler & Fergus (2013); Chen et al. (2013), we only use the top layer features as the input to the SVM (deconvolution is only performed with toplayer dictionary elements), which saves much computation time (as well as memory) in both inference and classification, since the feature size is small. When the model is trained using all 60000 digits, we achieve an error rate of on testing, which is very close to the stateoftheart, but with a relatively simpler model compared to Ciresan et al. (2012); the error rate obtained using features learned after pretraining, before refinement, are similar to those in Chen et al. (2013) ( error), underscoring the importance of the refinement step. We further plot the testing error in Fig. (c)c (bottom part) when the training size is reduced compared to the results reported in Zeiler & Fergus (2013). It can be seen that our model outperforms every approach in Zeiler & Fergus (2013).
In order to examine the properties of the learned model, in Fig. (a)a we visualize trained dictionaries at layer 2 mapped down to the data level. It is observed qualitatively that refinement improves the dictionary; the atoms after refinement are much sharper. If the average pooling described in Zeiler & Fergus (2013) is used, the dictionaries are blurry (middleleft part of Fig. (a)a). When a threshold is imposed on the refined dictionary elements, they look like digits (rightmost part).
To further verify the efficacy of our model, we show in Fig. (b)b the interpolation results of digits with half missing, as in Lee et al. (2009). A onelayer model cannot recover the digits, while a twolayer model provides a good recovery (bottom row of Fig. (b)b). Furthermore, by using our refinement approach, the recovery is much clearer (comparing the bottomleft part and bottommiddle part of Fig. (b)b). Given this excellent performance, more challenging interpolation results are shown in Fig. (c)c (upper part), where we cannot identify any digits from the observations; even in this case, the model can provide promising reconstructions.
Caltech 101 Dataset
We next consider the Caltech 101 dataset. First we analyze our model with images in the “easy face” category; 64 images (after local contrast normalization (Jarrett et al., 2009)) have been resized to and a threelayer deep model is used. At layers 1, 2 and 3, the number of dictionary elements is set respectively to , and (these inferred in the pretraining step, as discussed above), with dictionary sizes , and . The pooling sizes are (layer 1 to layer 2) and (layer 2 to layer 3). Example learned dictionary elements are mapped to the image level and shown in Fig. 3. It can be seen that the firstlayer dictionary extracts edges of the images, while the secondlayer dictionary elements look like a part of the face and the thirdlayer elements are almost entire faces. We can see the improvement manifested by refinement by comparing the right two parts in Fig. 3 (the dictionaries after refinement are sharper). Similar to the MNIST example, we also show in Fig. 4 the interpolation results of face data with half missing, using a twolayer model (the dictionary sizes are and at layers 1 and 2, respectively, with maxpooling size .). It can be seen the missing parts are recovered progressively more accurately considering a one and twolayer model. Though the background is a little noisy, each face is recovered in great detail by the second layer dictionary (a threelayer model gives similar results, omitted here for brevity).
We develop Caltech 101 dictionaries by learning on each data class in isolation, and then concatenate all (toplayer) dictionaries when learning the classifier. In Figure 5 we depict dictionary elements learned for two data classes, projected to the image level (more results are shown in the Supplementary Material). It can be seen the layer1 dictionary elements are similar for the two data classes, while the upperlayer dictionary elements are dataclass dependent. One problem of this parallel training is that the dictionary may be redundant across image classes (especially at the first layer). However, during testing, using the proposed approach, we only use toplayer dictionaries, which are typically distinct across data classes (for the data considered).
# Training Images per Category  15  30 

DN Zeiler et al. (2010)  58.6 %  66.9% 
CBDN Lee et al. (2009)  57.7 %  65.4% 
HBP Chen et al. (2013)  58%  65.7% 
ScSPM Yang et al. (2009)  67 %  73.2% 
PFV Seidenari et al. (2014)  71.47%  80.13% 
RKSVD Li et al. (2013)  79 %  83% 
Convnet Zeiler & Fergus (2014)  83.8 %  86.5% 
Ours, 2layer model + 1layer features  70.02%  80.31% 
Ours, 3layer model + 1layer features  75.24%  82.78% 
For Caltech 101 classification, we follow the setup in Yang et al. (2009), selecting 15 and 30 images per category for training, and testing on the rest. The features of testing images are inferred based on the toplayer dictionaries and sent to a multiclass SVM; we again use a Gaussian kernel nonlinear SVM with parameters tuned via crossvalidation. Ours and related results are summarized in Table 2. For our model, we present results based on 2layer and 3layer models. It can be seen that our model (the 3layer one) provides results close to the stateoftheart in Zeiler & Fergus (2014), which used a much more complicated model (i.e., a 7layer convolutional network and used the ImageNet dataset to pretrain the network), and our results are also very close to the stateoftheart results using handcrafted features (e.g., SIFT in Li et al. (2013)). Based on features learned by our model at the pretraining stage, our classification performance is comparable to that of the HBP model in Chen et al. (2013) (around 65% accuracy for a 2layer model, when training with 30 examples per class), with our results demonstrating a 17% improvement in performance after model refinement.
5 Conclusions
A deep generative convolutional dictionarylearning model has been developed within a Bayesian setting, with efficient Gibbssamplingbased MAP parameter estimation. The proposed framework enjoys efficient bottomup and topdown probabilistic inference. A probabilistic pooling module has been integrated into the model, a key component to developing a principled topdown generative model, with efficient learning and inference. Extensive experimental results demonstrate the efficacy of the model to learn multilayered features from images. A novel method has been developed to project the highlayer dictionary elements to the image level, and efficient singlelayer deconvolutional inference is accomplished during testing. On the MNIST and Caltech 101 datasets, our results are very near the state of the art, but with relatively simple model complexity at test. Future work includes performing deep feature learning and classifier design jointly. The algorithm will also be ported to a GPUbased implementation, allowing scaling to largescale datasets.
References
References
 Chang & Lin (2011) Chang, C.C. and Lin, C.J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011.
 Chen et al. (2011) Chen, B., Polatkan, G., Sapiro, G., Carin, L., and Dunson, D. B. The hierarchical beta process for convolutional factor analysis and deep learning. In ICML, 2011.
 Chen et al. (2013) Chen, B., Polatkan, G., Sapiro, G., Blei, D., Dunson, D., and Carin, L. Deep learning with hierarchical convolutional factor analysis. IEEE TPAMI, 2013.
 Ciresan et al. (2012) Ciresan, D., Meier, U., and Schmidhuber, J. Multicolumn deep neural networks for image classification. In CVPR, 2012.
 Ciresan et al. (2011) Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., and Flexible, J. Schmidhuber. high performance convolutional neural networks for image classification. IJCAI, 2011.
 Geman & Geman (1984) Geman, S. and Geman, D. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE TPAMI, 1984.
 Hinton & Salakhutdinov (2006) Hinton, G. and Salakhutdinov, R. Reducing the dimensionality of data with neural networks. Science, 2006.
 Jarrett et al. (2009) Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. What is the best multistage architecture for object recognition? ICCV, 2009.
 Kavukcuoglu et al. (2008) Kavukcuoglu, K., Ranzato, M., and LeCun, Y. Fast inference in sparse coding algorithms with applications to object recognition. In ArXiv 1010.3467, 2008.
 Kavukcuoglu et al. (2010) Kavukcuoglu, K., Sermanet, P., Boureau, YL., Gregor, K., Mathieu, M., and LeCun, Y. Learning convolutional feature hierarchies for visual recognition. NIPS, 2010.
 Krizhevsky et al. (2012) Krizhevsky, A., S., Ilya, and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 Lecun et al. (1998) Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, 1998.
 Lee et al. (2009) Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. ICML, 2009.
 Li et al. (2013) Li, Q., Zhang, H., Guo, J., Bhanu, B., and An, L. Referencebased scheme combined with Ksvd for scene image categorization. IEEE Signal Processing Letters, 2013.
 Paisley & Carin (2009) Paisley, J. and Carin, L. Nonparametric factor analysis with beta process priors. In ICML, 2009.
 Seidenari et al. (2014) Seidenari, L., Serra, G., Bagdanov, A., and Del Bimbo, A. Local pyramidal descriptors for image recognition. IEEE TPAMI, 2014.
 Yang et al. (2009) Yang, J., Yu, K., Gong, Y., and Huang, T. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, 2009.
 Zeiler & Fergus (2013) Zeiler, M. and Fergus, R. Stochastic pooling for regularization of deep convolutional neural networks. ICLR, 2013.
 Zeiler & Fergus (2014) Zeiler, M. and Fergus, R. Visualizing and understanding convolutional networks. ECCV, 2014.
 Zeiler et al. (2010) Zeiler, M., Kirshnan, D., Taylor, G., and Fergus, R. Deconvolutional networks. CVPR, 2010.
Supplementary Material
Appendix A Conditional Posteriori Distributions for Gibbs Sampling
In the th layer, the model can be formed as:
(9) 
For simplification, we define the following symbols (operations):
(10)  
(11)  
(12)  
(13)  
(14) 
The symbol is the elementwise product operator and is the elementwise division operator.
means if and , with element
(15) 
For each MCMC iteration, the samples are drawn from:

Sample :
(16) (17) (18) 
Sample :
(19) 
Sample :
(20) (21) (22) 
Sample
(23) 
Sample :
Let , , ; , ; we can find and are onetoone correspondence. From
(24) and
(25) we have
(26) 
Sample :
(27) (28) (29) 
Sample :
(30)
Appendix B Projection of Dictionaries to the Data Layer
b.1 Notation
Assume and . Here are the pooling ratio and the pooling map is . In the block of and , there is at most one nonzero element, where , . Now, let , then the following pooling and unpooling functions can be defined:

Define , with . Recall that within each pooling block, has at most one nonzero element, and therefore
(31) The following is an example to demonstrate :

Define , with
(32) The following is an example to demonstrate :

Define , with
(33) The following is an example to demonstrate ,with :

Define , with :
(34) The following is an example to demonstrate with :
b.2 Some Useful Lemmas
Lemma 1.
Lemma 2.
Lemma 3.
Lemma 4.
Lemma 5.
The first three lemmas are obvious. Now, we provide the proof of lemma 4 and lemma 5 .
Lemma 4 proof:
Recall that the convolution operator means if and , then the element is given by
(35) 
where
(36) 
Let , , and , . We want to prove that . Deriving elementwise we have
(37) 
Since if and , then , and if not .
Lemma 5 proof:
Let