Adaptive Network Sparsification via Dependent Variational Beta-Bernoulli Dropout

Adaptive Network Sparsification via
Dependent Variational Beta-Bernoulli Dropout

Juho Lee, Saehoon Kim , Jaehong Yoon, Hae Beom Lee,
Eunho Yang, Sung Ju Hwang
UNIST, Ulsan, South Korea, KAIST, Daejeon, South Korea,
AITRICS, Seoul, South Korea, University of Oxford, Oxford, United Kingdom
juho.lee@stats.ox.ac.uk, shkim@aitrics.com, jaehong.yoon@kaist.ac.kr,
hblee@unist.ac.kr, eunhoy@kaist.ac.kr, sjhwang82@kaist.ac.kr
Abstract

While variational dropout approaches have been shown to be effective for network sparsification, they are still suboptimal in the sense that they set the dropout rate for each neuron without consideration of the input data. With such input-independent dropout, each neuron is evolved to be generic across inputs, which makes it difficult to sparsify networks without accuracy loss. To overcome this limitation, we propose adaptive variational dropout whose probabilities are drawn from sparsity-inducing beta-Bernoulli prior. It allows each neuron to be evolved either to be generic or specific for certain inputs, or dropped altogether. Such input-adaptive sparsity-inducing dropout allows the resulting network to tolerate larger degree of sparsity without losing its expressive power by removing redundancies among features. We validate our dependent variational beta-Bernoulli dropout on multiple public datasets, on which it obtains significantly more compact networks than baseline methods, with consistent accuracy improvements over the base networks.

 

Adaptive Network Sparsification via
Dependent Variational Beta-Bernoulli Dropout


  Juho Lee, Saehoon Kim , Jaehong Yoon, Hae Beom Lee, Eunho Yang, Sung Ju Hwang UNIST, Ulsan, South Korea, KAIST, Daejeon, South Korea, AITRICS, Seoul, South Korea, University of Oxford, Oxford, United Kingdom juho.lee@stats.ox.ac.uk, shkim@aitrics.com, jaehong.yoon@kaist.ac.kr, hblee@unist.ac.kr, eunhoy@kaist.ac.kr, sjhwang82@kaist.ac.kr

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

One of the main obstacles in applying deep learning to large-scale problems and low-power computing systems is the large number of network parameters, as it can lead to excessive memory and computational overheads. To tackle this problem, researchers have explored network sparsification methods to remove unnecessary connections in a network, which is implementable either by weight pruning han2016deep () or sparsity-inducing regularizations wen2016learning ().

Recently, variational Bayesian approaches have shown to be useful for network sparsification, outperforming non-Bayesian counterparts. They take a completely different approach from the conventional methods that either uses thresholding or sparsity-inducing norms on parameters, and uses well-known dropout regularization instead. Specifically, these approaches use variational dropout kingma2015variational () which adds in multiplicative stochastic noise to each neuron, as a means of obtaining sparse neural networks. Removal of unnecessary neurons could be done by either setting the dropout rate individually for each neuron with unbounded dropout rate molchanov2017variational () or by pruning based on the signal-to-noise ratio neklyudov2017structured ().

While these variational dropout approaches do yield compact networks, they are suboptimal in that the dropout rate for each neuron is learned completely independently of the given input data and labels. With input-independent dropout regularization, each neuron has no choice but to encode generic information for all possible inputs, since it does not know what input and tasks it will be given at evaluation time, as each neuron will be retained with fixed rate regardless of the input. Obtaining high degree of sparsity in such as setting will be difficult as dropping any of the neurons will result in information loss. For maximal utilization of the network capacity and thus to obtain a more compact model, however, each neuron should be either irreplaceably generic and used by all tasks, or highly specialized for a task such that there exists minimal redundancy among the learned representations. This goal can be achieved by adaptively setting the dropout probability for each input, such that some of the neurons are retained with high probability only for certain types of inputs and tasks.

To this end, we propose a novel input-dependent variational dropout regularization for data and task-dependent network sparsification. We first propose beta-Bernoulli dropout that learns to set dropout rate for each individual neuron, by generating the dropout mask from beta-Bernoulli prior, and show how to train it using variational inference. This dropout regularization is a proper way of obtaining a Bayesian neural network and also sparsifies the network, since beta-Bernoulli distribution is a sparsity-inducing prior. Then, we propose dependent beta-Bernoulli dropout, which is an input-dependent version of our variational dropout regularization.

Such adaptive regularization has been utilized for general network regularization by a non-Bayesian and non-sparsity-inducing model ba2013adaptive (); yet, the increased memory and computational overheads that come from learning additional weights for dropout mask generation made it less appealing for generic network regularization. In our case of network sparsification, however, the overheads at training time is more than rewarded by the reduced memory and computational requirements at evaluation time, thanks to the high degree of sparsification obtained in the final output model.

We validate our dependent beta-Bernoulli variational dropout regularizer on multiple public datasets for network sparsification performance and prediction error, on which it obtains significantly more compact network ( memory reduction and speedup on CIFAR-100) with substantially reduced prediction errors, when compared with both the base network and existing network sparsification methods. Further analysis of the learned dropout probability for each unit reveals that our input-adaptive variational dropout approach generates a clearly distinguishable dropout mask for each task, thus enables each task to utilize different sets of neurons for their specialization.

Our contribution in this paper is threefold:

  • We propose beta-Bernoulli dropout, a novel dropout regularizer which learns to generate Bernoulli dropout mask for each neuron with sparsity-inducing prior, that obtains high degree of sparsity without accuracy loss.

  • We further propose dependent beta-Bernoulli dropout, which yields significantly more compact network than input-independent beta-Bernoulli dropout, and further perform run-time pruning for even less computational cost.

  • Our beta-Bernoulli dropout regularizations provide novel ways to implement a sparse Bayesian Neural Network, and we provide a variational inference framework for learning it.

2 Related Work

Deep neural networks are known to be prone to overfitting, due to its large number of parameters. Dropout srivastava2014dropout () is an effective regularization that helps prevent overfitting by reducing coadaptations of the units in the networks. During dropout training, the hidden units in the networks are randomly dropped with fixed probability , which is equivalent to multiplying the Bernoulli noises to the units. It was later found that multiplying Gaussian noises with the same mean and variance, , works just as well or even better srivastava2014dropout ().

Dropout regularizations generally treat the dropout rate as a hyperparameter to be tuned, but there have been several studies that aim to automatically determine proper dropout rate. kingma2015variational () propose to determine the variance of the Gaussian dropout by stochastic gradient variational Bayes. Generalized dropout srinivas2016generalized () places a beta prior on the dropout rate and learn the posterior of the dropout rate through variational Bayes. They showed that by adjusting the hyperparameters of the beta prior, we can obtain several regularization algorithms with different characteristics. Our beta-Bernoulli dropout is similar to one of its special cases, but while they obtain the dropout estimates via point-estimates and compute the gradients of the binary random variables with biased heuristics, we approximate the posterior distribution of the dropout rate with variational distributions and compute asymptotically unbiased gradients for the binary random variables.

Ba et al. ba2013adaptive () proposed adaptive dropout (StandOut), where the dropout rates for each individual neurons are determined as function of inputs. This idea is similar in spirit to our dependent beta-Bernoulli dropout, but they use heuristics to model this function, while we use proper variational Bayesian approach to obtain the dropout rates. One drawback of their model is the increased memory and computational cost from additional parameters introduced for dropout mask generation, which is not negligible when the network is large. Our model also requires additional parameters, but with our model the increased cost at training time is rewarded at evaluation time, as it yields a significantly sparse network than the baseline model as an effect of the sparsity-inducing prior.

Recently, there has been growing interest in structure learning or sparsification of deep neural networks. Han et al. han2016deep () proposed a strategy to iteratively prune weak network weights for efficient computations, and Wen et al. wen2016learning () proposed a group sparsity learning algorithm to drop neurons, filters or even residual blocks in deep neural networks. In Bayesian learning, various sparsity inducing priors have been demonstrated to efficiently prune network weights with little drop in accuracies molchanov2017variational (); louizos2017bayesian (); neklyudov2017structured (); louizos2018learning (). In the nonparametric Bayesian perspective, Feng et al. feng2015learning () proposed IBP based algorithm that learns proper number of channels in convolutional neural networks using the asymptotic small-variance limit approximation of the IBP. While our dropout regularizer is motivated by IBP as with this work, our work is differentiated from it by the input-adaptive adjustments of dropout rates that allow each neuron to specialize into features specific for some subsets of tasks.

3 Backgrounds

3.1 Bayesian Neural Networks and Stochastic Gradient Variational Bayes

Suppose that we are given a neural network parametrized by , a training set , and a likelihood chosen according to the problem of interest (e.g., the categorical distribution for a classification task). In Bayesian neural networks, the parameter is treated as a random variable drawn from a pre-specified prior distribution , and the goal of learning is to compute the posterior distribution :

(1)

When a novel input is given, the prediction is obtained as a distribution, by mixing from as follows:

(2)

Unfortunately, is in general computationally intractable due to computing , and thus we resort to approximate inference schemes. Specifically, we use variational Bayes (VB), where we posit a variational distribution of known parametric form and minimize the KL-divergence between it and the true posterior , . It turns out that minimizing is equivalent to maximizing the evidence lower-bound (ELBO),

(3)

where the first term measures the expected log-likelihood of the dataset w.r.t. , and the second term regularizes so that it does not deviate too much from the prior. The parameter is learned by gradient descent, but these involves two challenges. First, the expected likelihood is intractable in many cases, and so is its gradient. To resolve this, we assume that is reparametrizable, so that we can obtain i.i.d. samples from by computing differentiable transformation of i.i.d. noise  (kingma2014auto, ; rezende2014stochastic, ) as . Then we can obtain a low-variance unbiased estimator of the gradient, namely

(4)

The second challenge is that the number of training instances may be too large, which makes it impossible to compute the summation of all expected log-likelihood terms. Regarding on this challenge, we employ the stochastic gradient descent technique where we approximate with the summation over a uniformly sampled mini-batch ,

(5)

Combining the reparametrization and the mini-batch sampling, we obtain an unbiased estimator of to update . This procedure, often referred to as stochastic gradient variational Bayes (SGVB) kingma2014auto (), is guaranteed to converge to local optima under proper learning-rate scheduling.

3.2 Latent feature models and Indian Buffet Processes

In latent feature model, data are assumed to be generated as combinations of latent features:

(6)

where means that possesses the -th feature , and is an arbitrary function.

The Indian Buffet Process (IBP) griffiths2005infinite () is a generative process of binary matrices with infinite number of columns. Given data points, IBP generates a binary matrix whose -th row encodes the feature indicator . The IBP is suitable to use as a prior process in latent feature models, since it generates possibly infinite number of columns and adaptively adjust the number of features on given dataset. Hence, with an IBP prior we need not specify the number of features in advance.

One interesting observation is that while it is a marginal of the beta-Bernoulli processes  (thaibux2007hierarchical, ), the IBP may also be understood as a limit of the finite-dimensional beta-Bernoulli process. More specifically, the IBP with parameter can be obtained as

(7)

This beta-Bernoulli process naturally induces sparsity in the latent feature allocation matrix . As , the expected number of nonzero entries in converges to  (griffiths2005infinite, ) , where is a hyperparameter to control the overall sparsity level of .

In this paper, we relate the latent feature models (6) to neural networks with dropout mask. Specifically, the binary random variables correspond to the dropout indicator, and the features correspond to the inputs or intermediate units in neural networks. From this connection, we can think of a hierarchical Bayesian model where we place the IBP, or finite-dimensional beta-Bernoulli priors for the binary dropout indicators. We expect that due to the property of the IBP favoring sparse model, the resulting neural network would also be sparse.

3.3 Dependent Indian Buffet Processes

One important assumption in the IBP is that features are exchangeable - the distribution is invariant to the permutation of feature assignments. This assumption makes the posterior inference convenient, but restricts flexibility when we want to model the dependency of feature assignments to the input covariates , such as times or spatial locations.

To this end, Williamson et al. williamson2010dependent () proposed dependent Indian Buffet processes (dIBP), which triggered a line of follow-up work (zhou2011dependent, ; ren2011kernel, ). These models can be summarized as following generative process:

(8)

where is an arbitrary function that maps and to a probability. In our latent feature interpretation for neural network layers above, the input covariates corresponds to the input or activations in the previous layer. In other words, we build a data-dependent dropout model where the dropout rates depend on the inputs. In the main contribution section, we will further explain how we will construct this data-dependent dropout layers in detail.

4 Main contribution

4.1 Variational Beta-Bernoulli dropout

Inspired by the latent-feature model interpretation of layers in neural networks, we propose a Bayesian neural network layer overlaid with binary random masks sampled from the finite-dimensional beta-Bernoulli prior. Specifically, let be a parameter of a neural network layer, and let be a binary mask vector to be applied for the -th observation . The dimension of needs not be equal to . Instead, we may enforce arbitrary group sparsity by sharing the binary masks among multiple elements of . For instance, let be a parameter tensor in a convolutional neural network with channels. To enforce a channel-wise sparsity, we introduce of dimension, and the resulting masked parameter for the -th observation is given as

(9)

where is the -th element of . From now on, with a slight abuse of notation, we denote this binary mask multiplication as

(10)

with appropriate sharing of binary mask random variables. The generative process of our Bayesian neural network is then described as

(11)

Note the difference between our model and the model in gal2016dropout (). In gal2016dropout (), only Gaussian prior is placed on the parameter , and the dropout is applied in the variational distribution to approximate . Our model, on the other hand, includes the binary mask in the prior, and the posterior for the binary masks should also be approximated.

The goal of the posterior inference is to compute the posterior distribution , where . We approximate this posterior with the variational distribution of the form

(12)

For , we conduct computationally efficient point-estimate to get the single value , with the weight-decay regularization arising from the zero-mean Gaussian prior. For , following nalisnick2017stick (), we use the Kumaraswamy distribution (kumaraswamy1980generalized, ) for :

(13)

since it closely resembles the beta distribution and easily reparametrizable as

(14)

We further assume that . A sample from can be obtained by reparametrization based on continuous relaxation (maddison2017concrete, ; jang2017categorical, ; gal2017concrete, ),

(15)

where is a temperature of continuous relaxation, , and . The KL-divergence between the prior and the variational distribution is then obtained in closed form as follows (nalisnick2017stick, ):

(16)

where is Euler-Mascheroni constant and is the digamma function.

Having all the above ingredients, we can apply the SGVB framework described in Section 3.1 to optimize the variational parameters . After the training, the prediction for a novel input is given as

(17)

and we found that the following näive approximation works well in practice,

(18)

where

(19)

4.2 Variational Dependent Beta-Bernoulli Dropout

Now we describe our Bayesian neural network model with input dependent beta-Bernoulli dropout prior constructed as follows:

(20)

Here, is the input to the dropout layer. For convolutional layers, we apply the global average pooling to tensors to get vectorized inputs. In principle, we may introduce another fully connected layer as , with additional parameters and , but this is undesirable for the network sparsification. Rather than adding parameters for fully connected layer, we propose simple yet effective way to generate input-dependent probability, with minimal parameters involved. Specifically, we construct each independently as follows:

(21)

where and are the estimates of -th components of mean and standard deviation of inputs, and and are scaling and shifting parameters to be learned, and is some small tolerance to prevent overflow.

The parameterization in (21) is motivated by the batch normalization (ioffe2015batch, ). The intuition behind this construction is as follows. Provided that we have good estimates of , the inputs after the standardization would approximately be distributed as , so the inputs would be centered around zero, with insignificant values being closed to zero or negative. Hence, if we pass them through , outputs for the insignificant dimensions would be close to zero, However, some inputs may be important for the final classification regardless of the significance. In that case, we expect the corresponding shifting parameter to be large. Thus by we control the overall sparsity of the dropout layer, but we want them to be small unless required to get sparse outcomes. We enforce this by placing a prior distribution on .

The goal of variational inference is hence to learn the posterior distribution , and we approximate this with variational distribution of the form

(22)

where are the same as in beta-Bernoulli dropout, , and  111 In principle, we may introduce an inference network and minimizes the KL-divergence between and , but this results in discrepancy between training and testing for sampling , and also make optimization cumbersome. Hence, we chose to simply set them equal. Please refer to sohn2015learning () for discussion about this. The KL-divergence is computed as

(23)

where the first term was described for beta-Bernoulli dropout and the second term can be computed analytically.

The prediction for the novel input is similarity done as in the beta-Bernoulli dropout, with the näive approximation for the expectation:

(24)

where

(25)

Two stage pruning scheme

Since for all , we expect the resulting network to be sparser than the network pruned only with the beta-Bernoulli dropout (only with ). To achieve this, we propose a two-stage pruning scheme, where we first prune the network with beta-Bernoulli dropout, and prune the network again with while holding the variables fixed. By fixing the resulting network is guaranteed to be sparser than the network before the second pruning. To save memories, once trained, we pre-prune all the units whose values were below threshold for every training instances . We found that this saves memory without any accuracy drop in testing.

5 Experiments

We now compare our beta-Bernoulli dropout (BB) and input-dependent beta-Bernoulli dropout (DBB) to other structure learning/pruning algorithms on several neural networks using benchmark datasets.

Experiment Settings

We followed a common setting to compare pruning algorithms by using LeNet 500-300, LeNet 5-Caffe 222https://github.com/BVLC/caffe/blob/master/examples/mnist, and VGG-like (zagoruyko2015torch, ) networks on MNIST lecun1998gradient (), CIFAR-10, and CIFAR-100 datasets (krizhevsky2009tr, ). We included recent Bayesian pruning methods for a fair comparison: sparse variational dropout (SVD molchanov2017variational ()), structured sparsity learning (SSL wen2016learning ()) and structured Bayesian pruning (SBP neklyudov2017structured ()). We faithfully tuned all hyperparameters of baseline methods on a validation set to find a reasonable solution that is well balanced between accuracy and sparsification, while fixing batch size (100) and the number of maximum epochs (300) to match our experiment setting.

Implementation Details

We pretrained all networks using the standard training procedure before fine-tuning for network sparsification molchanov2017variational (); neklyudov2017structured (). While pruning, we set the learning rate for the weights to be 0.1 times smaller than those for the variational parameters as in neklyudov2017structured (). We used Adam (kingma2015adam, ) for optimization of both BB and DBB. For DBB, as mentioned in Section 4.2, we first prune networks with BB, and then prune again with DBB whiling holding the variational parameters for fixed.

We report all hyperparameters of BB and DBB for reproducing our results. We set for all layers of BB and DBB. In principle, we may fix to be large number and tune . However, in the network sparsification tasks, is given as the neurons/filters to be pruned. Hence, we chose to set the ratio to be small number altogether. In the testing phase, we pruned the neurons/filters whose expected dropout mask probability are smaller than a filxed threshold 333We tried different values such as or , but the difference was insignificant.. For the input-dependent dropout, since the number of pruned neurons/filters differ according to the inputs, we report them as the running average over the test data. We fixed the temperature parameter of concrete distribution and the prior variance of , for all experiments.

LeNet 500-300 LeNet5-Caffe
Error Neurons Speedup Memory Error Neurons/Filters Speedup Memory
Original 1.69 784-500-300 1.0x 100.0% 0.81 20-50-800-500 1.0x 100.0%
SVD molchanov2017variational () 1.43 537-59-31 16.2x 6.18% 0.7 11-31-263-27 2.9x 3.74%
SSL wen2016learning () 2.25 505-68-17 15.3x 6.55% 1.12 3-17-172-63 17.1x 2.86%
SBP neklyudov2017structured () 1.7 219-102-40 20.5x 4.87% 0.82 13-15-100-50 4.5x 2.4%
BB 1.36 264-140-65 11.8x 8.50% 0.55 15-33-138-64 2.0x 5.07%
DBB 1.41 100-36-34 112.4x 8.21% 0.59 13-31-53-34 2.5x 4.84%
\includegraphics

[width=0.2]lenet_dense_cavgs.pdf    \includegraphics[width=0.65]lenet_conv_corr.pdf

Figure 1: Top: Comparision of pruning algorithms on MNIST dataset. We report the medians of five runs. Speedups are calculated in terms of FLOPs. Bottom left: class average values of for the first layer of LeNet 500-300. Bottom ight: correlation coefficients of class averages of for the four layers in LeNet5-Caffe.

5.1 Experiments on MNIST dataset

We used LeNet 500-300 and LeNet 5-Caffe networks on MNIST for comparison. Following the conventions, we applied dropout to the inputs to the fully connected layers and right after the convolution for the convolutional layers. The results are summarized in Fig. 1. For both networks, BB achieved the highest accuracy significantly improved over the original networks. DBB produced much sparser networks than those pruned with BB with little accuracy drops (see memory usage), and further perform run-time pruning which significantly reduced the number of neurons to consider, to obtain large gains in speedups (as much as 112.4x on LeNet 500-300). Note that for LeNet5-Caffe, BB and DBB tend to prune fully-connected layers rather than convolutional layers. This is due to the number of dropout mask applies for fully-connected layers is much larger, which results in much stronger KL divergence term. We may resolve this by scaling the KL term for the convolutional layers with constant , and in that case we optimize the lower bound on the original ELBO.

On LeNet 500-300, DBB pruned large amount of neurons in input layer, because the inputs to this network are simply vectorized pixels. DBB adaptively prunes the region where there are no pixel values, while other input-independent pruning algorithms prune the general background areas (Fig. 1, bottom left). Notice that in general, DBB tends to prune less neurons/filters in lower layers of the networks, and tends to prune more neurons/filters in higher layers (Fig. 1). Further, the droput masks generated by DBB tend to be generic at lower network layers to extract common features, but become class-specific at higher layers to specialize features for class discriminativity. This phenomenon is clearly shown in (Fig. 1, bottom right), which plots the correlation between the class-average dropout masks at each layer of LeNet5-Caffe.

5.2 Experiments on CIFAR-10 and CIFAR-100 datasets

We compared the pruning algorithms on VGG-like network adapted for CIFAR-10 and CIFAR-100 datasets. For CIFAR-10, we ran the algorithms for 200 epochs with batch size 100, and for CIFAR-100 we ran for 300 epochs with batch size 100. Table 1 summarizes the performance of each algorithm, where BB and DBB achieved impressive sparsity with significantly improved accuracies. Especially, the network pruned with DBB showed similar error to the one pruned with BB with much less number of filters used. Further analysis of the filters retained by DBB in Fig. 2 shows that DBB either retains most filters (layer 3) or perform generic pruning (layer 8) at lower layers, while performing diversified pruning at higher layers (layer 15). Further, at layer 15, instances from the same class retained similar filters, while instances from different classes retained different filters.

\resizebox

14cm! CIFAR-10 Error Filters Speedup Memory Original 7.1% 64-64-128-128-256-256-256 512-512-512-512-512-512-512-512 1.00x 100.0% SVD molchanov2017variational () 7.7% 64-64-128-128-254-218-58 107-91-14-152-79-107-109-473 2.00x 10.5% SSL wen2016learning () 7.4% 64-64-128-128-254-219-59 107-92-14-161-103-398-398-492 1.97x 40.5% SBP neklyudov2017structured () 9.01% 64-64-128-128-254-221-60 113-97-14-181-124-483-484-493 1.95x 15.7% BB 6.77% 51-64-126-125-229-116-44 38-24-13-16-10-28-28-75 2.56x 5.5% DBB 6.76% 35-61-108-118-192-83-37 24-15-11-10-9-23-23-52 3.51x 4.2% CIFAR-100 Error Filters Speedup Memory Original 30.86% 64-64-128-128-256-256-256 512-512-512-512-512-512-512-512 1.0x 100.0% SVD molchanov2017variational () 29.81% 64-64-128-128-256-236-99 187-109-22-133-77-119-123-512 1.84x 13.0% SSL wen2016learning () 30.76% 64-64-128-128-255-236-99 194-120-22-273-219-512-512-512 1.78x 42.9% SBP neklyudov2017structured () 36.95% 64-64-128-128-255-236-99 194-119-22-102-91-61-63-348 1.84x 12.6% BB 30.15% 36-59-103-121-211-142-87 50-19-18-13-10-47-47-150 3.04x 5.8 % DBB 30.22% 28-53-76-116-191-131-80 37-15-16-10-8-40-39-55 4.03x 5.0%

Table 1: Classification accuracy and sparsification performance of various pruning methods on CIFAR-10 and CIFAR-100 datasets. We report the medians of five runs, and calculate the speedup in terms of FLOP.
\includegraphics

[width=1]vgg_cifar100_instances2.pdf

Figure 2: Parts of in 3rd, 8th, 15th layer of VGG network for CIFAR-100, w.r.t. different inputs.

6 Conclusion

We have proposed novel beta-Bernoulli dropout for network regularization and sparsification, where we learn dropout probabilities for each neuron either in an input-independent or input-dependent manner. Our beta-Bernoulli dropout learns the distribution of sparse Bernoulli dropout mask for each neuron in a variational inference framework, in contrast to existing work that learned the distribution of Gaussian multiplicative noise or weights, and obtains significantly more compact network compared to those competing approaches. Further, our dependent beta-Bernoulli dropout that input-adaptively decides which neuron to drop further improves on the input-independent beta-Bernoulli dropout, both in terms of size of the final network obtained and run-time computations.

References

  • [1] J. Ba and B. Frey. Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems 26, 2013.
  • [2] J. Feng and T. Darrell. Learning the structure of deep convolutional networks. IEEE International Conference on Computer Vision, 2015.
  • [3] Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning, 2016.
  • [4] Y. Gal, J. Hron, and A. Kendall. Concrete dropout. Advances in Neural Information Processing Systems, 2017.
  • [5] T. L. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet process. In NIPS, 2005.
  • [6] S. Han, H. Mao, and W. J. Dally. Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. In Proceedings of the International Conference on Learning Representations, 2016.
  • [7] S. Ioffe and C. Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
  • [8] E. Jang, S. Gu, and B. Poole. Categorical reparametrization with Gumbel-softmax. In Proceedings of the International Conference on Learning Representations, 2017.
  • [9] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 2015.
  • [10] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparametrization trick. In Advances in Neural Information Processing Systems 28, 2015.
  • [11] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations, 2014.
  • [12] A. Krizhevsky and G. E. Hinton. Learning multiple layers of features from tiny images. Technical report, Computer Science Department, University of Toronto, 2009.
  • [13] P. Kumaraswamy. A generalized probability density function for double-bounded random processes. Journal of Hydrology, 1980.
  • [14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [15] C. Louizos, K. Ullrich, and M. Welling. Bayesian compression for deep learning. Advances in Neural Information Processing Systems, 2017.
  • [16] C. Louizos, M. Welling, and D. P. Kingma. Learning sparse neural networks through regularization. International Conference on Learning Representations, 2018.
  • [17] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: a continuous relaxation of discrete random variables. In Proceedings of the International Conference on Learning Representations, 2017.
  • [18] D. Molchanov, A. Ashukha, and D. Vetrov. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning, 2017.
  • [19] E. Nalisnick and P. Smyth. Stick-breaking variational autoencoders. In Proceedings of the International Conference on Learning Representations, 2017.
  • [20] K. Neklyudov, D. Molchanov, A. Ashukha, and D. Vetrov. Structured Bayesian pruning via log-normal multiplicative noise. Advances in Neural Information Processing Systems, 2017.
  • [21] L. Ren, Y. Wang, D. B. Dunson, and L. Carin. The kernel beta process. In Advances in Neural Information Processing Systems 24, 2011.
  • [22] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, 2014.
  • [23] K. Sohn, H. Lee, and X. Yan. Learning structured ouput representation using deep conditional generative models. Advances in Neural Information Processing Systems 28, 2015.
  • [24] S. Srinivas and R. V. Babu. Generalized dropout. arXiv preprint arXiv:1611.06791, 2016.
  • [25] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • [26] R. Thibaux and M. I. Jordan. Hierarchical beta processess and the Indian buffet processes. In Proceedings of the 11th International Conference on Artificial Intelligence and Statistics, 2007.
  • [27] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems 29, 2016.
  • [28] S. Williamson, P. Orbanz, and Z. Ghahramani. Dependent indian buffet processes. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010.
  • [29] S. Zagoruyko. 92.45 on CIFAR-10 in Torch. Technical report, 2015.
  • [30] M. Zhou, H. Yang, G. Sapiro, and D. B. Dunson. Dependent hierarchical beta process for image interpolation and denoising. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
199795
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description