Generative Deep Deconvolutional Learning

Generative Deep Deconvolutional Learning

Yunchen Pu, Xin Yuan and Lawrence Carin
Department of Electrical and Computer Engineering, Duke University, Durham, NC, 27708, USA
{yunchen.pu,xin.yuan,lcarin}@duke.edu
Abstract

A generative model is developed for deep (multi-layered) convolutional dictionary learning. A novel probabilistic pooling operation is integrated into the deep model, yielding efficient bottom-up (pretraining) and top-down (refinement) probabilistic learning. After learning the deep convolutional dictionary, testing is implemented via deconvolutional inference. To speed up this inference, a new statistical approach is proposed to project the top-layer dictionary elements to the data level. Following this, only one layer of deconvolution is required during testing. Experimental results demonstrate powerful capabilities of the model to learn multi-layer features from images, and excellent classification results are obtained on the MNIST and Caltech 101 datasets.

Generative Deep Deconvolutional Learning

Yunchen Pu, Xin Yuan and Lawrence Carin
Department of Electrical and Computer Engineering, Duke University, Durham, NC, 27708, USA
{yunchen.pu,xin.yuan,lcarin}@duke.edu

1 Introduction

Convolutional networks, introduced in Lecun et al. (1998), have demonstrated excellent performance on image classification and other tasks. There are at least two key components of this model: computational efficiency manifested by leveraging the convolution operator, and a deep architecture, in which the features of a given layer serve as the inputs to the next layer above. Since that seminal contribution, much work has been undertaken on improving deep convolutional networks (Lecun et al., 1998), deep deconvolutional networks (Zeiler et al., 2010), convolutional deep restricted Boltzmann machines (Lee et al., 2009), and on Bayesian convolutional dictionary learning (Chen et al., 2013), among others.

An important technique employed in these deep models is pooling, in which a contiguous block of features from the layer below are mapped to a single input feature for the layer above. The pooling step manifests robustness, by minimizing the effects of variations due to small shifts, and it has the advantage of reducing the number of features as one moves higher in the hierarchical representation (possibly mitigating over-fitting). Methods that have been considered include average and maximum pooling, in which the single feature mapped as input to the layer above is respectively the average or maximum of the corresponding block of features below. Average pooling may introduce blur to learned filters (Zeiler & Fergus, 2013), and use of the maximum (“max pooling”) is widely employed. Note that average and max pooling are deterministic. Stochastic pooling proposed by Zeiler & Fergus (2013) and the probabilistic max-pooling used by Lee et al. (2009) often improve the pooling process. The use of stochastic pooling is also attractive in the context of developing a generative model for the deep convolutional representation, as highlighted in this paper. Specifically, we develop a deep generative statistical model, which starts at the highest-level features, and maps these through a sequence of layers, until ultimately mapping to the data plane (e.g., an image). The feature at a given layer is mapped via a multinomial distribution to one feature in a block of features at the layer below (and all other features in the block at the next layer are set to zero). This is analogous to the method in Lee et al. (2009), in the sense of imposing that there is at most one non-zero activation within a pooling block. As we demonstrate, this yields a generative statistical model with which Bayesian inference may be readily implemented, with all layers analyzed jointly to fit the data.

We use bottom-up pretraining, in which initially we sequentially learn parameters of each layer one at a time, from bottom to top, based on the features at the layer below. However, in the refinement phase, all model parameters are learned jointly, top-down. Each consecutive layer in the model is locally conjugate in a statistical sense, so learning model parameters may be readily performed using sampling or variational methods. We here develop a Gibbs sampler for learning, with the goal of obtaining a maximum a posterior (MAP) estimate of the model parameters, as in the original paper on Gibbs sampling (Geman & Geman, 1984) (we have found it unnecessary, and too expensive, to attempt an accurate estimate of the full posterior). The Gibbs sampler employed for parameter learning may be viewed as an alternative to typical optimization-based learning (Lecun et al., 1998; Zeiler et al., 2010), making convenient use of the developed generative statistical model.

The work in Zeiler et al. (2010); Chen et al. (2013) involves learning convolutional dictionaries, and at the testing phase one must perform a (generally) expensive nonlinear deconvolution step at each layer. In Kavukcuoglu et al. (2010) convolutional dictionaries are also learned at the training stage, but one also simultaneously learns a convolutional filterbank and nonlinear function. The convolutional filterbank can be implemented quickly at test (no nonlinear deconvolutional inversion) and, linked with the nonlinear function, this computationally efficient testing step is meant to approximate the devonvolutional network.

We propose an alternative approach to yield fast inversion at test, while still retaining an aspect of the nonlinear deconvolution operation. As detailed below, in the learning phase, we infer a deep hierarchy of convolutional dictionary elements, which if handled like in Zeiler et al. (2010), requires joint deconvolution at each layer when testing. However, leveraging our generative statistical model, the dictionary elements at the top of the hierarchy can be mapped through a sequence of linear operations to the image/data plane. At test, we only employ the features from the top layer in the hierarchy, mapped to the data plane, and therefore only a single layer of deconvolution need be applied. This implies that the test-time computational cost is independent of the number of layers employed during the learning phase.

This paper makes three contributions: () rather than employing beta-Bernoulli sparsity at each layer of the model separately, as in Chen et al. (2011; 2013), the sparsity is manifested via a multinomial process between layers, constituting stochastic pooling, and allowing coupling all layers of the deep model when learning; () the stochastic pooling manifests a proper top-down generative model, allowing a new means of mapping high-level features to the data plane; and () a novel form of testing is employed with deep models, with the top-layer features mapped to the data plane, and deconvolution only applied once, directly with the data. This methodology yields excellent performance on image-recognition tasks, as demonstrated in the experiments.

2 Modeling Framework

The proposed model is applicable to general data for which a convolutional dictionary representation is appropriate. One may, for example, apply the model to one-dimensional signals such as audio, or to two-dimensional imagery. In this paper we focus on imagery, and hence assume two-dimensional signals and convolutions. Gray-scale images are considered for simplicity, with straightforward extension to color.

2.1 Single-Layer Convolutional Dictionary Learning

Assume gray-scale images , with ; the images are analyzed jointly to learn the convolutional dictionary . Specifically consider the model

(1)

where is the convolution operator, denotes the Hadamard (element-wise) product, the elements of are in , the elements of are real, and represents the residual. indicates which shifted version of is used to represent . Considering (typically and ), the corresponding weights are of size .

Let and represent elements of and , respectively. Within a Bayesian construction, the priors for the model may be represented as (Paisley & Carin, 2009):

(2)
(3)
(4)

where , denotes the gamma distribution, represents the identity matrix, and are hyperparameters, for which default settings are discussed in Paisley & Carin (2009); Chen et al. (2011; 2013). While the model may look somewhat complicated, local conjugacy admits Gibbs sampling or variational Bayes inference (Chen et al., 2011; 2013).

In Chen et al. (2011; 2013) a deep model was developed based on (1), by using as the input of the layer above. In order to do this, a pooling operation (e.g., the max-pooling used in Chen et al. (2013)) is employed, reducing the feature dimension as one moves to higher layers. However, the model was learned by stacking layers upon each other, without subsequent overall refinement. This was because use of deterministic max pooling undermined development of a proper top-down generative model that coupled all layers; therefore, in Chen et al. (2013) the model in (1) was used sequentially from bottom-up, but the overall model parameters were never coupled when learning. To tackle this, we propose a probabilistic pooling procedure, yielding a top-down deep generative statistical structure, coupling all parameters when performing learning. As discussed when presenting results, this joint learning of all layers plays a critical role in improving model performance. The stochastic pooling applied here is closely related to that in Zeiler & Fergus (2013); Lee et al. (2009).

2.2 Pretraining & Stochastic Pooling

Parameters of the deep model are learned by first analyzing one layer of the model at a time, starting at the bottom layer (touching the data), and sequentially stacking layers. The parameters of each layer of the model are learned separately, conditioned on parameters of the layers learned thus far (like in Chen et al. (2011; 2013)). The parameters learned in this manner serve as initializations for the top-down refinement step, discussed in Sec. 2.3, in which parameters at all layers of the deep model are learned jointly.

Assume an -layer model, with layer the top layer, and layer 1 at the bottom, closest to the data. In the pretraining stage, the output of layer is the input to layer , after pooling. Layer has dictionary elements, and we have:

(5)
(6)

The expression is a 2D (spatial) activation map, for image , model layer , dictionary element . The expression may be viewed as a 3D entity, with its -th plane defined by a “pooled” version of (pooling discussed next). The dictionary elements and residual are also three dimensional (each 2D plane of and is the spatial-dependent structure of the corresponding features), and the convolution is performed in the 2D spatial domain, simultaneously for each layer of the feature map.

We now discuss the relationship between and layer of . The 2D activation map is partitioned into dimensional contiguous blocks (pooling blocks with respect to layer of the model); see the left part of Figure 1. Associated with each block of pixels in is one pixel at layer of ; the relative locations of the pixels in are the same as the relative locations of the blocks in . Within each block of , either all pixels are zero, or only one pixel is non-zero, with the position of that pixel selected stochastically via a multinomial distribution. Each pixel at layer of equals the largest-amplitude element in the associated block of (, max pooling). Hence, if all elements of a block of are zero, the corresponding pixel in is also zero. If a block of has a (single) non-zero element, that non-zero element is the corresponding pixel value at the -th layer of .

The bottom-up generative process for each block of proceeds as follows (left part of Figure 1). The model first imposes that a given block of is either all zero or has one non-zero element, and this binary question is modeled as the beta-Bernoulli representation of (6). If a given block has a non-zero value, the position of that value in the associated block is defined by a multinomial distribution, and its value is modeled as represented in (6). The beta-Bernoulli step, followed by multinomial, are combined into one equivalent statistical representation, as discussed next.

Figure 1: Schematic of the proposed generative process. Left: bottom-up pretraining, right: top-down refinement. (Zoom-in for best visulization and a larger version can be found in the Supplementary Material.)

Let denote the -th block of at layer , where assuming integer divisions. We introduce a latent variable to implement at most one non-zero element out of the entries in through

(7)

where and denote multinomial and Dirichlet distribution, respectively (the Dirichlet distribution has a set of parameters, and here we imply that are equal, and set to the value indicated in ). has entries, of which only one is equal to 1. If the last element is 1, this means all . Since the -th block at layer corresponds to one element at layer , we have

(8)

Hence, if the last element of is 1, all elements of block are zero; if not, the location of the non-zero element in the first entries of locates the position of the non-zero element in the corresponding block. The remaining parts of the model are represented as in (6).

In the pretraining phase, we start with , which is the data . We learn using the blocked activation weights, via Gibbs sampling, where the multinomial distribution associates each non-zero element with a position in the corresponding block. The MAP Gibbs sample is then selected, defining model parameters for the layer under analysis. The “stacked” and pooled are used to define , and the learning procedure then continues, learning dictionary elements and activation maps , again via Gibbs sampling and MAP selection. This continues sequentially up to the -th, or top, layer. For the top layer, since no pooling is necessary, the beta-Bernoulli prior in (2) is used.

2.3 Model Refinement With Stochastic Pooling

The learning performed with the top-down generative model (right part of Fig. 1) constitutes a refinement of the parameters learned during pretraining, and the excellent initialization constituted by the parameters learned during pretraining is key to the subsequent model performance.

In the refinement phase, the equations are (almost) the same, but we now proceed top down, from (5) to (6). The generative process constitutes and , and after convolution is manifested; the is now absent at all layers, except layer , at which the fit to the data is performed. Each element of has an associated pooling block in . Via a multinomial distribution like in pretraining, each element of is mapped to one position in the corresponding block of , and all other elements in that block are set to zero. Since is manifested top-down as a convolution of and , will in general have no elements exactly equal to zero (but many will be small, based on the pretraining). Hence, each block of will have one non-zero element, with position defined by the multinomial111We also considered a model exactly like in pretraining, which in the pooling step a pixel in could be mapped via the multinomial to an all-zero activation block in layer ; the results are essentially unchanged from the method discussed above..

During pretraining many blocks of will be all-zero since we preferred a sparse representation, while during refinement this sparsity requirement is relaxed, and in general each pooling block of will have one non-zero element (but it is still sparse), and this value is mapped via pooling to the corresponding pixel in . In pretraining the Dirichlet and multinomial distributions were of size , allowing the all-zero activation block; during refinement the multinomial and Dirichlet are of dimensions . The corresponding Dirichlet and multinomial parameters from pretraining are used to constitute initializations for refinement.

2.4 Top-Level Features and Testing

In order to understand deep convolutional models, researchers have visualized dictionary elements mapped to the image level (Zeiler & Fergus, 2014). One key challenge of this visualization is that one dictionary element at high layers can have multiple representations at the layer below, given different activations in each pooling block (in our model, this is manifested by the stochasticity associated with the multinomial-based pooling). Zeiler & Fergus (2014) showed different versions of the same upper-layer dictionary element at the image level. Because of this capability of accurate dictionary localization at each layer, deep convolutional models perform well in classification. However, also due to these multiple representations, during testing, one has to infer dictionary activations layer by layer (via deconvolution), which is computationally expensive. In order to alleviate this issue, Kavukcuoglu et al. (2010) proposed an approximation method using convolutional filter banks (fast because there is no explicit deconvolution) followed by a nonlinear function. Though efficient at test time, in the training step one must simultaneously learn deconvolutional dictionaries and associated filterbanks, and the choice of non-linear function is critical to the performance of the model. Moreover, in the context of the framework proposed here, it is difficult to integrate the approach of Kavukcuoglu et al. (2008; 2010) into a Bayesian model.

We propose a new approach to accelerate testing. After performing model learning (after refinement), we project top-layer dictionary elements down to the data plane. At test, deconvolution is only performed once, using the top-layer dictionary elements mapped to the data plane. The top-layer activation strengths inferred via this deconvolution are then used in a subsequent classifier. The different manifestations of a top-layer dictionary element mapped to the data plane are constituted by different (stochastic) pooling mappings via the multinomial. To select top-layer dictionary elements in the data plane, used for test, we employ maximum-likelihood (ML) dictionary elements, with ML performed across the different choices of the max pooling at each layer. Hence, after this ML-based top-layer dictionary selection, a pixel at layer is mapped to the same location in the associated layer block, for all convolutional shifts (same max-pooling map for all shifts at a given layer). Hence, the key approximation is that the stochastic pooling employed for each pixel at layer to a position in a block at layer is replaced by an ML-based deterministic pooling (possibly a different deterministic map at each layer). This simple approach has the advantage of Zeiler & Fergus (2014) at test, in that we retain the deconvolution operation (unlike Kavukcuoglu et al. (2010)), but deconvolution must only be performed once (not at each layer). In the experiments presented below, when visualizing inferred dictionary elements in the image plane, this ML-based dictionary selection is employed. More details on this aspect of the model are provided in the Supplementary Material.

3 Gibbs-Sampling-Based Learning and Inference

Due to local conjugacy at every component of the model, the local conditional posterior distribution for all parameters of our model is manifested in closed form, yielding efficient Gibbs sampling (see Supplementary Material for details). As in all previous convolutional models of this type, the FFT is leveraged to accelerate computation of the convolution operations, here within Gibbs update equations.

In the pre-training step, we select the ML sample from 500 collection samples, after first computing and discarding 1500 burn-in samples. The same number of burn-in and collection samples, with ML selection, is performed for model refinement. This ML selection of collection samples shares the same spirit as Geman & Geman (1984), in the sense of yielding a MAP solution (not attempting to approximate the full posterior). During testing, we select the ML sample across 200 deconvolutional samples, after first discarding 500 burn-in samples.

4 Experimental Results

We here apply our model to the MNIST and Caltech 101 datasets. We compare dictionaries (viewed in the data plane) before and after refinement. Classification results (average of 10 trials) using top-layer features are presented for both datasets. As in (Paisley & Carin, 2009), the hyperparameters are set as , where is the number of dictionary elements at the corresponding layer, and ; these are standard hyperparameter settings (Paisley & Carin, 2009) for such models, and no tuning or optimization was performed. All code is written in MATLAB and executed on a desktop with 3.8 GHz CPU and 24G memory. Model training including refinement with one class (30 images) of Caltech 101 takes about 40 CPU minutes, and testing (deconvolution) for one image takes less than 1 second. These results were run on a single computer, for demonstration, and acceleration via parallel implementation, GPUs (Krizhevsky et al., 2012), and coding in C will be considered in the future; the successes realized recently in accelerating convolution-based models of this type are transferrable to our model.

MNIST Dataset

Methods Test error
DBN Hinton & Salakhutdinov (2006) 1.20%
CBDN Lee et al. (2009) 0.82%
0.53%
0.35%
MCDNN Ciresan et al. (2012) 0.23%
SPCNN Zeiler & Fergus (2013)
Average Pooling 0.83%
Max Pooling 0.55%
Stochastic Pooling 0.47%
MCMC (10000 Training) 0.89%
Batch VB (10000 Training) 0.95%
online VB (60000 Training) 0.96%
Ours, 2-layer model + 1-layer features
60000 Training 0.42%
10000 Training 0.68%
5000 Training 1.02%
2000 Training 1.11%
1000 Training 1.66%
Table 1: Classification Error of MNIST data

We first consider the widely studied MNIST data (http://yann.lecun.com/exdb/mnist/), which has 60,000 training and 10,000 testing images, each , for digits 0 through 9. A two layer model is used with dictionary size () at the first layer and at the second layer; the pooling size is () and the number of dictionary elements at layers 1 and 2 are and , respectively. We obtained these number of dictionary elements via setting the initial dictionary number to a relatively large value in the pre-training step and discarding infrequently used elements by counting the corresponding binary indicator , i.e., inferring the number of needed dictionary elements, as in Chen et al. (2013).

Table 1 summaries the classification results of our model compared with some related results, on the MNIST data. The second (top) layer features corresponding to the refined dictionary are sent to a nonlinear support vector machine (SVM) (Chang & Lin, 2011) with Gaussian kernel, in a one-vs-all multi-class classifier, with classifier parameters tuned via 5-fold cross-validation (no tuning on the deep feature learning). Rather than concatenating features at all layers as in Zeiler & Fergus (2013); Chen et al. (2013), we only use the top layer features as the input to the SVM (deconvolution is only performed with top-layer dictionary elements), which saves much computation time (as well as memory) in both inference and classification, since the feature size is small. When the model is trained using all 60000 digits, we achieve an error rate of on testing, which is very close to the state-of-the-art, but with a relatively simpler model compared to Ciresan et al. (2012); the error rate obtained using features learned after pretraining, before refinement, are similar to those in Chen et al. (2013) ( error), underscoring the importance of the refinement step. We further plot the testing error in Fig. (c)c (bottom part) when the training size is reduced compared to the results reported in Zeiler & Fergus (2013). It can be seen that our model outperforms every approach in Zeiler & Fergus (2013).

(a)
(b)
(c)
Figure 2: (a) Visualization of the dictionary learned by the proposed model. Note the refined dictionary (right) is much sharper than the dictionary before refinement (middle). (b) Missing data interpolation of digits. (c) Upper part: a more challenging case for missing data interpolation of digits. Bottom part: testing error when training with reduced dataset sizes on MNIST.

In order to examine the properties of the learned model, in Fig. (a)a we visualize trained dictionaries at layer 2 mapped down to the data level. It is observed qualitatively that refinement improves the dictionary; the atoms after refinement are much sharper. If the average pooling described in Zeiler & Fergus (2013) is used, the dictionaries are blurry (middle-left part of Fig. (a)a). When a threshold is imposed on the refined dictionary elements, they look like digits (rightmost part).

To further verify the efficacy of our model, we show in Fig. (b)b the interpolation results of digits with half missing, as in Lee et al. (2009). A one-layer model cannot recover the digits, while a two-layer model provides a good recovery (bottom row of Fig. (b)b). Furthermore, by using our refinement approach, the recovery is much clearer (comparing the bottom-left part and bottom-middle part of Fig. (b)b). Given this excellent performance, more challenging interpolation results are shown in Fig. (c)c (upper part), where we cannot identify any digits from the observations; even in this case, the model can provide promising reconstructions.

Figure 3: Dictionary elements in each layer trained with 64 “face easy” images from Caltech 101.
Figure 4: Face data interpolation using a 2-layer model. From left to right: truth, observed data, layer-1 recovery, layer-2 recovery.

Caltech 101 Dataset

We next consider the Caltech 101 dataset. First we analyze our model with images in the “easy face” category; 64 images (after local contrast normalization (Jarrett et al., 2009)) have been resized to and a three-layer deep model is used. At layers 1, 2 and 3, the number of dictionary elements is set respectively to , and (these inferred in the pretraining step, as discussed above), with dictionary sizes , and . The pooling sizes are (layer 1 to layer 2) and (layer 2 to layer 3). Example learned dictionary elements are mapped to the image level and shown in Fig. 3. It can be seen that the first-layer dictionary extracts edges of the images, while the second-layer dictionary elements look like a part of the face and the third-layer elements are almost entire faces. We can see the improvement manifested by refinement by comparing the right two parts in Fig. 3 (the dictionaries after refinement are sharper). Similar to the MNIST example, we also show in Fig. 4 the interpolation results of face data with half missing, using a two-layer model (the dictionary sizes are and at layers 1 and 2, respectively, with max-pooling size .). It can be seen the missing parts are recovered progressively more accurately considering a one- and two-layer model. Though the background is a little noisy, each face is recovered in great detail by the second layer dictionary (a three-layer model gives similar results, omitted here for brevity).

Figure 5: Trained dictionaries per class mapped to the data plane. Row 1-2: nautilus, revolver. Column 1-4: training images after local contrast normalization, layer-1 dictionary, layer-2 dictionary, layer-3 dictionary.

We develop Caltech 101 dictionaries by learning on each data class in isolation, and then concatenate all (top-layer) dictionaries when learning the classifier. In Figure 5 we depict dictionary elements learned for two data classes, projected to the image level (more results are shown in the Supplementary Material). It can be seen the layer-1 dictionary elements are similar for the two data classes, while the upper-layer dictionary elements are data-class dependent. One problem of this parallel training is that the dictionary may be redundant across image classes (especially at the first layer). However, during testing, using the proposed approach, we only use top-layer dictionaries, which are typically distinct across data classes (for the data considered).

# Training Images per Category 15 30
DN Zeiler et al. (2010) 58.6 % 66.9%
CBDN Lee et al. (2009) 57.7 % 65.4%
HBP  Chen et al. (2013) 58% 65.7%
ScSPM  Yang et al. (2009) 67 % 73.2%
P-FV  Seidenari et al. (2014) 71.47% 80.13%
R-KSVD  Li et al. (2013) 79 % 83%
Convnet Zeiler & Fergus (2014) 83.8 % 86.5%
Ours, 2-layer model + 1-layer features 70.02% 80.31%
Ours, 3-layer model + 1-layer features 75.24% 82.78%
Table 2: Classification Accuracy Rate of Caltech-101.

For Caltech 101 classification, we follow the setup in Yang et al. (2009), selecting 15 and 30 images per category for training, and testing on the rest. The features of testing images are inferred based on the top-layer dictionaries and sent to a multi-class SVM; we again use a Gaussian kernel non-linear SVM with parameters tuned via cross-validation. Ours and related results are summarized in Table 2. For our model, we present results based on 2-layer and 3-layer models. It can be seen that our model (the 3-layer one) provides results close to the state-of-the-art in Zeiler & Fergus (2014), which used a much more complicated model (i.e., a 7-layer convolutional network and used the ImageNet dataset to pretrain the network), and our results are also very close to the state-of-the-art results using hand-crafted features (e.g., SIFT in Li et al. (2013)). Based on features learned by our model at the pretraining stage, our classification performance is comparable to that of the HBP model in Chen et al. (2013) (around 65% accuracy for a 2-layer model, when training with 30 examples per class), with our results demonstrating a 17% improvement in performance after model refinement.

5 Conclusions

A deep generative convolutional dictionary-learning model has been developed within a Bayesian setting, with efficient Gibbs-sampling-based MAP parameter estimation. The proposed framework enjoys efficient bottom-up and top-down probabilistic inference. A probabilistic pooling module has been integrated into the model, a key component to developing a principled top-down generative model, with efficient learning and inference. Extensive experimental results demonstrate the efficacy of the model to learn multi-layered features from images. A novel method has been developed to project the high-layer dictionary elements to the image level, and efficient single-layer deconvolutional inference is accomplished during testing. On the MNIST and Caltech 101 datasets, our results are very near the state of the art, but with relatively simple model complexity at test. Future work includes performing deep feature learning and classifier design jointly. The algorithm will also be ported to a GPU-based implementation, allowing scaling to large-scale datasets.

References

References

  • Chang & Lin (2011) Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011.
  • Chen et al. (2011) Chen, B., Polatkan, G., Sapiro, G., Carin, L., and Dunson, D. B. The hierarchical beta process for convolutional factor analysis and deep learning. In ICML, 2011.
  • Chen et al. (2013) Chen, B., Polatkan, G., Sapiro, G., Blei, D., Dunson, D., and Carin, L. Deep learning with hierarchical convolutional factor analysis. IEEE T-PAMI, 2013.
  • Ciresan et al. (2012) Ciresan, D., Meier, U., and Schmidhuber, J. Multi-column deep neural networks for image classification. In CVPR, 2012.
  • Ciresan et al. (2011) Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., and Flexible, J. Schmidhuber. high performance convolutional neural networks for image classification. IJCAI, 2011.
  • Geman & Geman (1984) Geman, S. and Geman, D. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE T-PAMI, 1984.
  • Hinton & Salakhutdinov (2006) Hinton, G. and Salakhutdinov, R. Reducing the dimensionality of data with neural networks. Science, 2006.
  • Jarrett et al. (2009) Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. What is the best multi-stage architecture for object recognition? ICCV, 2009.
  • Kavukcuoglu et al. (2008) Kavukcuoglu, K., Ranzato, M., and LeCun, Y. Fast inference in sparse coding algorithms with applications to object recognition. In ArXiv 1010.3467, 2008.
  • Kavukcuoglu et al. (2010) Kavukcuoglu, K., Sermanet, P., Boureau, Y-L., Gregor, K., Mathieu, M., and LeCun, Y. Learning convolutional feature hierarchies for visual recognition. NIPS, 2010.
  • Krizhevsky et al. (2012) Krizhevsky, A., S., Ilya, and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • Lecun et al. (1998) Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998.
  • Lee et al. (2009) Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. ICML, 2009.
  • Li et al. (2013) Li, Q., Zhang, H., Guo, J., Bhanu, B., and An, L. Reference-based scheme combined with K-svd for scene image categorization. IEEE Signal Processing Letters, 2013.
  • Paisley & Carin (2009) Paisley, J. and Carin, L. Nonparametric factor analysis with beta process priors. In ICML, 2009.
  • Seidenari et al. (2014) Seidenari, L., Serra, G., Bagdanov, A., and Del Bimbo, A. Local pyramidal descriptors for image recognition. IEEE T-PAMI, 2014.
  • Yang et al. (2009) Yang, J., Yu, K., Gong, Y., and Huang, T. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, 2009.
  • Zeiler & Fergus (2013) Zeiler, M. and Fergus, R. Stochastic pooling for regularization of deep convolutional neural networks. ICLR, 2013.
  • Zeiler & Fergus (2014) Zeiler, M. and Fergus, R. Visualizing and understanding convolutional networks. ECCV, 2014.
  • Zeiler et al. (2010) Zeiler, M., Kirshnan, D., Taylor, G., and Fergus, R. Deconvolutional networks. CVPR, 2010.

Supplementary Material

Appendix A Conditional Posteriori Distributions for Gibbs Sampling

In the th layer, the model can be formed as:

(9)

For simplification, we define the following symbols (operations):

(10)
(11)
(12)
(13)
(14)

The symbol is the element-wise product operator and is the element-wise division operator.

means if and , with element

(15)

For each MCMC iteration, the samples are drawn from:

  • Sample :

    (16)
    (17)
    (18)
  • Sample :

    (19)
  • Sample :

    (20)
    (21)
    (22)
  • Sample

    (23)
  • Sample :

    Let , , ; , ; we can find and are one-to-one correspondence. From

    (24)

    and

    (25)

    we have

    (26)
  • Sample :

    (27)
    (28)
    (29)
  • Sample :

    (30)

Appendix B Projection of Dictionaries to the Data Layer

b.1 Notation

Assume and . Here are the pooling ratio and the pooling map is . In the block of and , there is at most one non-zero element, where , . Now, let , then the following pooling and unpooling functions can be defined:

  1. Define , with . Recall that within each pooling block, has at most one non-zero element, and therefore

    (31)

    The following is an example to demonstrate :

  2. Define , with

    (32)

    The following is an example to demonstrate :

  3. Define , with

    (33)

    The following is an example to demonstrate ,with :

  4. Define , with :

    (34)

    The following is an example to demonstrate with :

b.2 Some Useful Lemmas

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

The first three lemmas are obvious. Now, we provide the proof of lemma 4 and lemma 5 .
Lemma 4 proof:
Recall that the convolution operator means if and , then the element is given by

(35)

where

(36)

Let , , and , . We want to prove that . Deriving elementwise we have

(37)

Since if and , then , and if not .

Lemma 5 proof:
Let