1 BasicNet: Structure of the basic convolution network studied in this paper. We refer to the convolution layers as C1-7.
Abstract

We show implicit filter level sparsity manifests in convolutional neural networks (CNNs) which employ Batch Normalization and ReLU activation, and are trained with adaptive gradient descent techniques and L2 regularization or weight decay. Through an extensive empirical study (anonymous) we hypothesize the mechanism behind the sparsification process, and find surprising links to certain filter sparsification heuristics proposed in literature. Emergence of, and the subsequent pruning of selective features is observed to be one of the contributing mechanisms, leading to feature sparsity at par or better than certain explicit sparsification / pruning approaches. In this workshop article we summarize our findings, and point out corollaries of selective-feature-penalization which could also be employed as heuristics for filter pruning.

oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.

 

Implicit Filter Sparsification In Convolutional Neural Networks

 

Dushyant Mehta0 0  Kwang In Kim0  Christian Theobalt0 


footnotetext: 1AUTHORERR: Missing \icmlaffiliation. 2AUTHORERR: Missing \icmlaffiliation. 3AUTHORERR: Missing \icmlaffiliation. . Correspondence to: Dushyant Mehta <dmehta@mpi-inf.mpg.de>.  
On-Device Machine Learning & Compact Deep Neural Network Representations (ODML-CDNNR) Workshop at International Conference on Machine Learning, Long Beach, California, 2019. Copyright 2019 by the author(s).
\@xsect

In this article we discuss the findings from (anonymous) regarding filter level sparsity which emerges in certain types of feedforward convolutional neural networks. Filter refers to the weights and the nonlinearity associated with a particular feature, acting together as a unit. We use filter and feature interchangeably throughout the document. We particularly focus on the implications of the findings on feature pruning for neural network speed up.

In networks which employ Batch Normalization and ReLU activation, after training, certain filters are observed to not activate for any input. Importantly, the sparsity emerges in the presence of regularizers such as L2 and weight decay (WD) which are in general understood to be non sparsity inducing, and the sparsity vanishes when regularization is removed. We experimentally observe the following:

  • The sparsity is much higher when using adaptive flavors of SGD vs. (m)SGD.

  • Adaptive methods see higher sparsity with L2 regularization than with WD. No sparsity emerges in the absence of regularization.

  • In addition to the regularizers, the extent of the emergent sparsity is also influenced by hyperparameters seemingly unrelated to regularization. The sparsity decreases with increasing mini-batch size, decreasing network size and increasing task difficulty.

  • The primary hypothesis that we put forward is that selective features111Feature selectivity is the fraction of training exemplars for which a feature produces max activation less than some threshold. see a disproportionately higher amount of regularization than non-selective ones. This consistently explains how parameters such as mini-batch size, network size, and task difficulty indirectly impact sparsity by affecting feature selectivity.

  • A secondary hypothesis to explain the higher sparsity observed with adaptive methods is that Adam (and possibly other) adaptive approaches learn more selective features. Though threre is evidence of highly selective features with Adam, this requires further study.

  • Synthetic experiments show that the interaction of L2 regularizer with the update equation in adaptive methods causes stronger regularization than WD. This can explain the discrepancy in sparsity between L2 and WD.

Quantifying Feature Sparsity:  Feature sparsity can be measured by per-feature activation and by per-feature scale. For sparsity by activation, the absolute activations for each feature are max pooled over the entire feature plane. If the value is less than over the entire training corpus, the feature is inactive. For sparsity by scale, we consider the scale of the learned affine transform in the Batch Norm layer. We consider a feature inactive if for the feature is less than . Explicitly zeroing the features thus marked inactive does not affect the test error. The thresholds chosen are purposefully conservative, and comparable levels of sparsity are observed for a higher feature activation threshold of , and a higher threshold of .

The implicit sparsification process can remove 70-80% of the convolutional filters from VGG-16 on CIFAR10/100, far exceeding that for (li2017pruning), and performs comparable to (liu2017learning) for VGG-11 on ImageNet. Common hyperparameters such as mini-batch size can be used as knobs to control the extent of sparsity, with no tooling changes needed to the traditional NN training pipeline.

We present some of the experimental results, and discuss links of the hypothesis to pruning heuristics.

Figure 1: BasicNet: Structure of the basic convolution network studied in this paper. We refer to the convolution layers as C1-7.
Table 1: Convolutional filter sparsity in BasicNet trained on CIFAR10/100 for different combinations of regularization and gradient descent methods. Shown are the % of non-useful / inactive convolution filters, as measured by activation over training corpus (max act. ) and by the learned BatchNorm scale (), averaged over 3 runs. The lowest test error per optimizer is highlighted, and sparsity (green) or lack of sparsity (red) for the best and near best configurations indicated via text color. L2: L2 regularization, WD: Weight decay (adjusted with the same scaling schedule as the learning rate schedule).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
362750
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description