On Implicit Filter Level Sparsity in Convolutional Neural Networks
We investigate filter level sparsity that emerges in convolutional neural networks (CNNs) which employ Batch Normalization and ReLU activation, and are trained with adaptive gradient descent techniques and L2 regularization (or weight decay). We conduct an extensive experimental study casting these initial findings into hypotheses and conclusions about the mechanisms underlying the emergent filter level sparsity. This study allows new insight into the performance gap obeserved between adapative and non-adaptive gradient descent methods in practice. Further, analysis of the effect of training strategies and hyperparameters on the sparsity leads to practical suggestions in designing CNN training strategies enabling us to explore the tradeoffs between feature selectivity, network capacity, and generalization performance. Lastly, we show that the implicit sparsity can be harnessed for neural network speedup at par or better than explicit sparsification / pruning approaches, without needing any modifications to the typical training pipeline.
In this work we show that filter111Filter refers to the weights and the nonlinearity associated with a particular feature, acting together as a unit. We use filter and feature interchangeably throughout the document. level sparsity emerges in certain types of feedforward convolutional neural networks. In networks which employ Batch Normalization and ReLU activation, after training, certain filters are observed to not activate for any input. We investigate the cause and implications of this emergent sparsity. Our findings relate to the anecdotally known and poorly understood ‘dying ReLU’ phenomenon [StanfordCS231n], wherein some features in ReLU networks get cut off while training, leading to a reduced effective learning capacity of the network. Approaches such as Leaky ReLU [maas2013rectifier] and RandomOut [cohen2016randomout] propose symptomatic fixes but with a limited understanding of the cause.
We conclude, through a systematic experimental study, that the emergence of sparsity is the direct result of a disproportionate relative influence of the regularizer (L2 or weight decay) viz a viz the gradients from the primary training objective of ReLU networks.
The extent of the resulting sparsity is affected by multiple phenomena which subtly impact the relative influence of the regularizer. Many of the hyperparameters and design choices for training neural networks interplay with these phenomena to influence the extent of sparsity. We show that increasing the mini-batch size decreases the extent of sparsity, adaptive gradient descent methods exhibit a much higher sparsity than stochastic gradient descent (SGD) for both L2 regularization and weight decay, and L2 regularization couples with adaptive gradient methods to further increase sparsity compared to weight decay.
We show that understanding the impact of these design choices yields useful and readily controllable sparsity which can be leveraged for considerable neural network speed up, without trading the generalization performance and without requiring any explicit pruning [molchanov2017pruning, li2017pruning] or sparsification [liu2017learning] steps. The implicit sparsification process can remove 70-80% of the convolutional filters from VGG-16 on CIFAR10/100, far exceeding that for [li2017pruning], and performs comparable to [liu2017learning] for VGG-11 on ImageNet.
Further, the improved understanding of the sparsification process we provide can better guide efforts towards developing strategies for ameliorating its undesirable aspects, while retaining the desirable properties. Also, our insights will lead practitioners to be more aware of the implicit tradeoffs between network capacity and generalization being made underneath the surface, while changing hyperparameters seemingly unrelated to network capacity.
2 Emergence of Filter Sparsity in CNNs
2.1 Setup and Preliminaries
Our basic setup is comprised of a 7-layer convolutional network with 2 fully connected layers as shown in Figure 1. The network structure is inspired by VGG [simonyan2014very], but is more compact. We refer to this network as BasicNet in the rest of the document. We use a variety of gradient descent approaches, a batch size of 40, with a method specific base learning rate for 250 epochs, and scale down the learning rate by 10 for an additional 75 epochs. We train on CIFAR10 and CIFAR 100 [krizhevsky2009learning], with normalized images, and random horizontal flips used while training. Xavier initialization [glorot2010understanding] is used for the network weights, with the appropriate gain for ReLU. The base learning rates and other hyperparameters are as follows: Adam (1e-3, =0.9, =0.99, =1e-8), Adadelta (1.0, =0.9, =1e-6), SGD (0.1, momemtum=0.9), Adagrad (1e-2). Pytorch [paszke2017automatic] is used for training, and we study the effect of varying the amount and type of regularization on the extent of sparsity and test error in Table 2.1.