On Implicit Filter Level Sparsity in Convolutional Neural Networks

On Implicit Filter Level Sparsity in Convolutional Neural Networks

Dushyant Mehta, Kwang In Kim, Christian Theobalt    [1] MPI For Informatics [2] University of Bath [3] Saarland Informatics Campus
Abstract

We investigate filter level sparsity that emerges in convolutional neural networks (CNNs) which employ Batch Normalization and ReLU activation, and are trained with adaptive gradient descent techniques and L2 regularization (or weight decay). We conduct an extensive experimental study casting these initial findings into hypotheses and conclusions about the mechanisms underlying the emergent filter level sparsity. This study allows new insight into the performance gap obeserved between adapative and non-adaptive gradient descent methods in practice. Further, analysis of the effect of training strategies and hyperparameters on the sparsity leads to practical suggestions in designing CNN training strategies enabling us to explore the tradeoffs between feature selectivity, network capacity, and generalization performance. Lastly, we show that the implicit sparsity can be harnessed for neural network speedup at par or better than explicit sparsification / pruning approaches, without needing any modifications to the typical training pipeline.

1 Introduction

In this work we show that filter111Filter refers to the weights and the nonlinearity associated with a particular feature, acting together as a unit. We use filter and feature interchangeably throughout the document. level sparsity emerges in certain types of feedforward convolutional neural networks. In networks which employ Batch Normalization and ReLU activation, after training, certain filters are observed to not activate for any input. We investigate the cause and implications of this emergent sparsity. Our findings relate to the anecdotally known and poorly understood ‘dying ReLU’ phenomenon [StanfordCS231n], wherein some features in ReLU networks get cut off while training, leading to a reduced effective learning capacity of the network. Approaches such as Leaky ReLU [maas2013rectifier] and RandomOut [cohen2016randomout] propose symptomatic fixes but with a limited understanding of the cause.

We conclude, through a systematic experimental study, that the emergence of sparsity is the direct result of a disproportionate relative influence of the regularizer (L2 or weight decay) viz a viz the gradients from the primary training objective of ReLU networks.

The extent of the resulting sparsity is affected by multiple phenomena which subtly impact the relative influence of the regularizer. Many of the hyperparameters and design choices for training neural networks interplay with these phenomena to influence the extent of sparsity. We show that increasing the mini-batch size decreases the extent of sparsity, adaptive gradient descent methods exhibit a much higher sparsity than stochastic gradient descent (SGD) for both L2 regularization and weight decay, and L2 regularization couples with adaptive gradient methods to further increase sparsity compared to weight decay.

We show that understanding the impact of these design choices yields useful and readily controllable sparsity which can be leveraged for considerable neural network speed up, without trading the generalization performance and without requiring any explicit pruning [molchanov2017pruning, li2017pruning] or sparsification [liu2017learning] steps. The implicit sparsification process can remove 70-80% of the convolutional filters from VGG-16 on CIFAR10/100, far exceeding that for [li2017pruning], and performs comparable to [liu2017learning] for VGG-11 on ImageNet.

Further, the improved understanding of the sparsification process we provide can better guide efforts towards developing strategies for ameliorating its undesirable aspects, while retaining the desirable properties. Also, our insights will lead practitioners to be more aware of the implicit tradeoffs between network capacity and generalization being made underneath the surface, while changing hyperparameters seemingly unrelated to network capacity.

2 Emergence of Filter Sparsity in CNNs

2.1 Setup and Preliminaries

Our basic setup is comprised of a 7-layer convolutional network with 2 fully connected layers as shown in Figure 1. The network structure is inspired by VGG [simonyan2014very], but is more compact. We refer to this network as BasicNet in the rest of the document. We use a variety of gradient descent approaches, a batch size of 40, with a method specific base learning rate for 250 epochs, and scale down the learning rate by 10 for an additional 75 epochs. We train on CIFAR10 and CIFAR 100 [krizhevsky2009learning], with normalized images, and random horizontal flips used while training. Xavier initialization [glorot2010understanding] is used for the network weights, with the appropriate gain for ReLU. The base learning rates and other hyperparameters are as follows: Adam (1e-3, =0.9, =0.99, =1e-8), Adadelta (1.0, =0.9, =1e-6), SGD (0.1, momemtum=0.9), Adagrad (1e-2). Pytorch [paszke2017automatic] is used for training, and we study the effect of varying the amount and type of regularization on the extent of sparsity and test error in Table 2.1.

Figure 1: BasicNet: Structure of the basic convolution network studied in this paper. We refer to the individual convolution layers as C1-7. The fully connected head shown here is for CIFAR10/100, and a different fully-connected structure is used for TinyImageNet and ImageNet.
Table 1: Convolutional filter sparsity in BasicNet trained on CIFAR10/100 for different combinations of regularization and gradient descent methods. Shown are the % of non-useful / inactive convolution filters, as measured by activation over training corpus (max act. ) and by the learned BatchNorm scale (), averaged over 3 runs. The lowest test error per optimizer is highlighted, and sparsity (green) or lack of sparsity (red) for the best and near best configurations indicated via text color. L2: L2 regularization, WD: Weight decay (adjusted with the same scaling schedule as the learning rate schedule).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
322037
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description