We show implicit filter level sparsity manifests in convolutional neural networks (CNNs) which employ Batch Normalization and ReLU activation, and are trained with adaptive gradient descent techniques and L2 regularization or weight decay. Through an extensive empirical study (anonymous) we hypothesize the mechanism behind the sparsification process, and find surprising links to certain filter sparsification heuristics proposed in literature. Emergence of, and the subsequent pruning of selective features is observed to be one of the contributing mechanisms, leading to feature sparsity at par or better than certain explicit sparsification / pruning approaches. In this workshop article we summarize our findings, and point out corollaries of selective-feature-penalization which could also be employed as heuristics for filter pruning.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.
Implicit Filter Sparsification In Convolutional Neural Networks
Dushyant Mehta 0 0 Kwang In Kim 0 Christian Theobalt 0
On-Device Machine Learning & Compact Deep Neural Network Representations (ODML-CDNNR) Workshop at International Conference on Machine Learning, Long Beach, California, 2019. Copyright 2019 by the author(s).\@xsect
In this article we discuss the findings from (anonymous) regarding filter level sparsity which emerges in certain types of feedforward convolutional neural networks. Filter refers to the weights and the nonlinearity associated with a particular feature, acting together as a unit. We use filter and feature interchangeably throughout the document. We particularly focus on the implications of the findings on feature pruning for neural network speed up.
In networks which employ Batch Normalization and ReLU activation, after training, certain filters are observed to not activate for any input. Importantly, the sparsity emerges in the presence of regularizers such as L2 and weight decay (WD) which are in general understood to be non sparsity inducing, and the sparsity vanishes when regularization is removed. We experimentally observe the following:
The sparsity is much higher when using adaptive flavors of SGD vs. (m)SGD.
Adaptive methods see higher sparsity with L2 regularization than with WD. No sparsity emerges in the absence of regularization.
In addition to the regularizers, the extent of the emergent sparsity is also influenced by hyperparameters seemingly unrelated to regularization. The sparsity decreases with increasing mini-batch size, decreasing network size and increasing task difficulty.
The primary hypothesis that we put forward is that selective features111Feature selectivity is the fraction of training exemplars for which a feature produces max activation less than some threshold. see a disproportionately higher amount of regularization than non-selective ones. This consistently explains how parameters such as mini-batch size, network size, and task difficulty indirectly impact sparsity by affecting feature selectivity.
A secondary hypothesis to explain the higher sparsity observed with adaptive methods is that Adam (and possibly other) adaptive approaches learn more selective features. Though threre is evidence of highly selective features with Adam, this requires further study.
Synthetic experiments show that the interaction of L2 regularizer with the update equation in adaptive methods causes stronger regularization than WD. This can explain the discrepancy in sparsity between L2 and WD.
Quantifying Feature Sparsity: Feature sparsity can be measured by per-feature activation and by per-feature scale. For sparsity by activation, the absolute activations for each feature are max pooled over the entire feature plane. If the value is less than over the entire training corpus, the feature is inactive. For sparsity by scale, we consider the scale of the learned affine transform in the Batch Norm layer. We consider a feature inactive if for the feature is less than . Explicitly zeroing the features thus marked inactive does not affect the test error. The thresholds chosen are purposefully conservative, and comparable levels of sparsity are observed for a higher feature activation threshold of , and a higher threshold of .
The implicit sparsification process can remove 70-80% of the convolutional filters from VGG-16 on CIFAR10/100, far exceeding that for (li2017pruning), and performs comparable to (liu2017learning) for VGG-11 on ImageNet. Common hyperparameters such as mini-batch size can be used as knobs to control the extent of sparsity, with no tooling changes needed to the traditional NN training pipeline.
We present some of the experimental results, and discuss links of the hypothesis to pruning heuristics.