Group Sparse Regularization for
Deep Neural Networks
Abstract
In this paper, we consider the joint task of simultaneously optimizing (i) the weights of a deep neural network, (ii) the number of neurons for each hidden layer, and (iii) the subset of active input features (i.e., feature selection). While these problems are generally dealt with separately, we present a simple regularized formulation allowing to solve all three of them in parallel, using standard optimization routines. Specifically, we extend the group Lasso penalty (originated in the linear regression literature) in order to impose grouplevel sparsity on the network’s connections, where each group is defined as the set of outgoing weights from a unit. Depending on the specific case, the weights can be related to an input variable, to a hidden neuron, or to a bias unit, thus performing simultaneously all the aforementioned tasks in order to obtain a compact network. We perform an extensive experimental evaluation, by comparing with classical weight decay and Lasso penalties. We show that a sparse version of the group Lasso penalty is able to achieve competitive performances, while at the same time resulting in extremely compact networks with a smaller number of input features. We evaluate both on a toy dataset for handwritten digit recognition, and on multiple realistic largescale classification problems.
I Introduction
The recent interest in deep learning has made it feasible to train very deep (and large) neural networks, leading to remarkable accuracies in many highdimensional problems including image recognition, video tagging, biomedical diagnosis, and others [schmidhuber2015deep, lecun2015deep]. While even five hidden layers were considered challenging until very recently, today simple techniques such as the inclusion of interlayer connections [he2015deep] and dropout [srivastava2014dropout] allow to train networks with hundreds (or thousands) of hidden layers, amounting to millions (or billions) of adaptable parameters. At the same time, it becomes extremely common to ‘overpower’ the network, by providing it with more flexibility and complexity than strictly required by the data at hand. Arguments that favor simple models instead of complex models for describing a phenomenon are quite known in the machine learning literature [domingos2012few]. However, this is actually far from being just a philosophical problem of ‘choosing the simplest model’. Having too many weights in a network can clearly increase the risk of overfitting; in addition, their exchange is the main bottleneck in most parallel implementations of gradient descent, where agents must forward them to a centralized parameter server [recht2011hogwild, seide2014parallelizability]; and finally, the resulting models might not work on lowpower or embedded devices due to excessive computational power needed for performing dense, large matrixmatrix multiplications [courbariaux2015binaryconnect].
In practice, current evidence points to the fact that the majority of weights in most deep network are not necessary to its accuracy. As an example, Denil et al. [denil2013predicting] demonstrated that it is possible to learn only a small percentage of the weights, while the others can be predicted using a kernelbased estimator, resulting in most cases in a negligible drop in terms of classification accuracy. Similarly, in some cases it is possible to replace the original weight matrix with a lowrank approximation, and perform gradient descent on the factor matrices [sainath2013low]. Driven by these observations, recently the number of works trying to reduce the network’s weights have increased drastically. Most of these works either require strong assumptions on the connectivity (e.g, the lowrank assumption), or they require multiple, separate training steps. For example, the popular pruning method of Han et al. [han2015learning] works by first training a network, setting to zero all the weights based on a fixed threshold, are then finetuning the remaining connections with a second training step. Alternatively, learned weights can be reduced by applying vector quantization techniques [gong2014compressing], which however are formulated as a separate optimization problem. There are endless other possibilities, e.g. (i) we can use ‘distillation’ to train a separate, smaller network that imitates the original one, as popularized by Hinton et al. [hinton2015distilling]; (ii) we can leverage over classical works on pruning, such as the optimal brain damage algorithm, that uses secondorder information on the gradient of the cost function to remove ‘non salient’ connections after training [lecun1989optimal]; (iii) we can work with limited numerical precision to reduce storage [gupta2015deep] (up to the extreme of a single bit per weight [courbariaux2015binaryconnect]); or we can use hash functions to force weight sharing [chen2015compressing]; and so on.
When considering highdimensional datasets, an additional problem is that of feature selection, where we search for a small subset of input features that brings most of the discriminative information [guyon2003introduction]. Feature selection and pruning are related problems: adding a new set of features to a task generally results in the need of increasing the network’s capacity (in terms of number of neurons), all the way up to the last hidden layer. Similarly to before, there are countless techniques for feature selection (or dimensionality reduction of the input vector), including principal component analysis, mutual information [kwak2002input], autoencoders, and many others. What we obtain, however, is a rather complex workflow of machine learning primitives: one algorithm to select features; an optimization criterion for training the network; and possibly another procedure to compress the weight matrices. This raises the following question, which is the main motivation for this paper: is there a principled way of performing all three tasks simultaneously, by minimizing a properly defined cost function? This is further motivated by the fact that, in a neural network, pruning a node and deleting an input feature are almost equivalent problems. In fact, it is customary to consider the input vector as an additional layer of the neural network, having no ingoing connections and having outgoing connections to the first hidden layer. In this sense, pruning a neuron from this initial layer can be considered the same as deleting the corresponding input feature.
Currently, the only principled way to achieve this objective is the use of regularization, wherein we penalize the sum of absolute values of the weights during training. The norm acts as a convex proxy of the nonconvex, nondifferentiable norm [tibshirani1996regression]. Its use originated in the linear regression routine, where it is called the Lasso estimator, and it has been widely popularized recently thanks to the interest in compressive sensing [candes2008introduction, bach2012optimization]. Even if it has a non differentiable point in , in practice this rarely causes problems to standard firstorder optimizers. In fact, it is common to simultaneously impose both weightlevel sparsity with the norm, and weight minimization using the norm, resulting in the socalled ‘elastic net’ penalization [zou2005regularization]. Despite its popularity, however, the norm is only an indirect way of solving the previously mentioned problems: a neuron can be removed if, and only if, all its ingoing or outgoing connections have been set to . In a sense, this is highly suboptimal: between two equally sparse networks, we would prefer one which has a more structured level of sparsity, i.e. with a smaller number of neurons per layer.
In this paper, we show how a simple modification of the Lasso penalty, called the ‘group Lasso’ penalty in the linear regression literature [yuan2006model, schmidt2010graphical], can be used efficiently to this end. A group Lasso formulation can be used to impose sparsity on a group level, such that all the variables in a group are either simultaneously set to , or none of them are. An additional variation, called the sparse group Lasso, can also be used to impose further sparsity on the nonsparse groups [friedman2010note, simon2013sparse]. Here, we apply this idea by considering all the outgoing weights from a neuron as a single group. In this way, the optimization algorithm is able to remove entire neurons at a time. Depending on the specific neuron, we obtain different effects, corresponding to what we discussed before: feature selection when removing an input neuron; pruning when removing an internal neuron; and also bias selection when considering a bias unit (see next section). The idea of group regularization in machine learning is quite known when considering convex loss functions [jenatton2011structured], including multikernel [bach2008consistency] and multitask problems [liu2009multi]. However, to the best of our knowledge, such a general formulation was never considered in the neural networks literature, except for very specific cases. For example, Zhao et al. [zhao2015heterogeneous] used a group sparse penalty to select groups of features cooccurring in a robotic control task. Similarly, Zhu et al. [zhu2016co] have used a group sparse formulation to select informative groups of features in a multimodal context.
On the contrary, in this paper we use the group Lasso formulation as a generic tool for enforcing compact networks with a lower subset of selected features. In fact, our experimental comparisons easily show that the best results are obtained with the sparse group term, where we can obtain comparable accuracies to regularized ans regularized networks, while at the same time reducing by a large margin the number of neurons in every layer. In addition, the regularizer can be implemented immediately in most existing software libraries, and it does not increase the computational complexity with respect to a traditional weight decay technique.
Outline of the paper
The paper is organized as follows. Section II describes standard techniques for regularizing a neural network during training, namely , and composite / terms. Section III describes our novels group Lasso and sparse group Lasso penalties, showing the meaning of groups in this context. Then, we test our algorithms in Section IV on a simple toy dataset of handwritten digits recognition, followed by multiple realistic experiments with standard deep learning benchmarks. After going more in depth with respect to some related pruning techniques in Section V, we conclude with some final remarks in Section VI.
Notation
In the rest of the paper, vectors are denoted by boldface lowercase letters, e.g. , while matrices are denoted by boldface uppercase letters, e.g. . All vectors are assumed column vectors. The operator is the standard norm on an Euclidean space. For this is the Euclidean norm, while for we obtain the Manhattan (or taxicab) norm defined for a generic vector as .
Ii Weightlevel regularization for neural networks
Let us denote by a generic deep neural network, taking as input a vector , and returning a vector after propagating it through hidden layers. The vector is used as a shorthand for the columnvector concatenation of all adaptable parameters of the network. The generic th hidden layer, , operates on a dimensional input vector and returns an dimensional output vector as:
(1) 
where are the adaptable parameters of the layer, while is a properly chosen activation function to be applied elementwise. By convention we have . For training the weights of the network, consider a generic training set of examples given by . The network is trained by minimizing a standard regularized cost function:
(2) 
where is a proper cost function, is used to impose regularization, and the scalar coefficient weights the two terms. Standard choices for are the squared error for regression problems, and the crossentropy loss for classification problems [haykin2009neural].
By far the most common choice for regularizing the network, thus avoiding overfitting, is to impose a squared norm constraint on the weights:
(3) 
In the neural networks’ literature, this is commonly denoted as ‘weight decay’ [moody1995simple], since in a steepest descent approach, its net effect is to reduce the weights by a factor proportional to their magnitude at every iteration. Sometimes it is also denoted as Tikhonov regularization. However, the only way to enforce sparsity with weight decay is to artificially force to zero all weights that are lower, in absolute terms, than a certain threshold. Even in this way, its sparsity effect might be negligible.
As we stated in the introduction, the second most common approach to regularize the network, inspired by the Lasso algorithm, is to penalize the absolute magnitude of the weights:
(4) 
The formulation is not differentiable at , where it is necessary to resort to a subgradient formulation. Everywhere else, its gradient is constant, and in a standard minimization procedure it moves each weight by a constant factor towards zero (in the next section, we also provide a simple geometrical intuition on its behavior). While there exists customized algorithms to solve nonconvex regularized problems [ochs2015iteratively], it is common in the neural networks’ literature to apply directly the same firstorder procedures (e.g., stochastic descent with momentum) as for the weight decay formulation. As an example, all libraries built on top of the popular Theano framework [bergstra2010theano] assigns a default gradient value of to terms such that . Due to this, a thresholding step after optimization is generally required also in this case to obtain precisely sparse solutions [bengio2012practical], although the resulting level of sparsity is quite higher than using weight decay.
One popular variation is to approximate the norm by a convex term, e.g. for a sufficiently small scalar factor , to obtain a smooth problem. Another possibility is to consider a mixture of and regularization, which is sometimes denoted as elastic net penalization [zou2005regularization]. The problem in this case, however, is that it is required to select two different hyperparameters for weighting differently the two terms.
Iii Neuronlevel regularization with group sparsity
Iiia Formulation of the algorithm
Both regularization in (3) and regularization in (4) are efficient for preventing overfitting, but they are not optimal for obtaining compact networks. Generally speaking, a neuron can be removed from the architecture only if all its connections (either ingoing or outgoing) have been zeroed out during training. However, this objective is not actively pursued while minimizing the cost in (2). Between the many local minima, some might be equivalent (or almost equivalent) in terms of accuracy, while corresponding to more compact and efficient networks. As there is no principled way to converge to one instead of the other, when using these kind of regularization the resulting network’s design will simply be a matter of initialization of the optimization procedure.
The basic idea of this paper is to consider grouplevel sparsity, in order to force all outgoing connections from a single neuron (corresponding to a group) to be either simultaneously zero, or not. More specifically, we consider three different groups of variables, corresponding to three different effects of the grouplevel sparsity:

Input groups : a single element is the vector of all outgoing connections from the th input neuron to the network, i.e. it corresponds to the first row transposed of the matrix .

Hidden groups : in this case, a single element corresponds to the vector of all outgoing connections from one of the neurons in the hidden layers of the network, i.e. one row (transposed) of a matrix . There are such groups, corresponding to neurons in the internal layers up to the final output one.

Bias groups : these are onedimensional groups (scalars) corresponding to the biases on the network, of which there are . They correspond to a single element of the vectors .
Overall, we have a total of groups, corresponding to three specific effects on the resulting network. If the variables of an input group are set to zero, the corresponding feature can be neglected during the prediction phase, effectively corresponding to a feature selection procedure. Then, if the variables in an hidden group are set to zero, we can remove the corresponding neuron, thereby obtaining a pruning effect and a thinner hidden layer. Finally, if a variable in a bias group is set to zero, we can remove the corresponding bias from the neuron. We note that having a separate group for every bias is not the unique choice. We can consider having a single bias unit for every layer feeding every neuron in that layer. In this case, we would have a single bias group per layer, corresponding to keeping or deleting every bias in it. Generally speaking, we have not found significant improvements in one way or the other.
A visual representation of this weight grouping strategy is shown in Fig. 1 for a simple network with two inputs (top of the figure), one hidden layer with two units (middle of the figure), and a single output unit (bottom of the figure). In the figure, input groups are shown with a green background; hidden groups (which in this case have a single element per group) are shown with a blue background; while the bias groups are surrounded in a light red background.
Let us define for simplicity the total set of groups as
Group sparse regularization can be written as [yuan2006model]:
(5) 
where denotes the dimensionality of the vector , and it ensures that each group gets weighted uniformly. Note that, for onedimensional groups, the expression in (5) simplifies to the standard Lasso. Similarly to the norm, the term in (5) is convex but nonsmooth, since its gradient is not defined if . The subgradient of a single term in (5) is given by:
(6) 
As for the norm, we have found very good convergence behaviors using standard firstorder optimizers, with a default choice of as subgradient in the second case. Also here, a final thresholding step is required to obtain precisely sparse solutions. Note that we have used the symbol in (5) as the formulation is closely related to the norm defined for matrices.
The formulation in (5) might still be suboptimal, however, since we lose guarantees of sparsity at the level of single connections among those remaining after removing some of the groups. To force this, we also consider the following composite ‘sparse group Lasso’ (SGL) penalty [friedman2010note, simon2013sparse]:
(7) 
The SGL penalty has the same properties as its constituting norms, namely, it is convex but nondifferentiable. Differently from an elastic net penalization, we have found that optimal results can be achieved by considering a single regularization factor for both terms in (7).
A visual comparison between , , and SGL penalizations is given in Fig. 2. The dashed box represents one weight matrix connecting a dimensional input layer to a dimensional output layer. In gray, we show a possible combination of matrix elements that are zeroed out by the corresponding penalization. The Lasso penalty removes elements without optimizing neuronlevel considerations. In this example, we remove connections (thus obtaining a level of sparsity), and we might remove the second neuron from the second layer (only in case the bias unit to the neuron has also been deleted). The group Lasso penalization removes all connections exiting from the second neuron, which can now be safely removed from the network. The sparsity level is just slightly higher than in the first case, but the resulting connectivity is more structured. Finally, the SGL formulation combines the advantages of both formulation: we remove all connections from the second neuron in the first layer and two of the remaining connections, thus achieving a level of sparsity in the layer and an extremely compact (and powerefficient) network.
IiiB Graphical interpretation of group sparsity
The group Lasso penalty admits a very interesting geometrical interpretation whenever the first term in (2) is convex (see for example [bach2012optimization, Section 1]). Although this is not the case of neural networks, whose model is highly nonconvex due to the presence of the hidden layers, this interpretation does help in visualizing why the resulting formulation provides a group sparse solution. For this reason, we briefly describe it here for the sake of understanding.
For a convex loss in (2), standard arguments from duality theory show that the problem can be reformulated as follows [ben2001lectures]:
subject to  (8) 
where is a scalar whose precise value depends on , and whose existence is guaranteed thanks to the absence of duality gap. In machine learning, this is sometimes called Ivanov regularization, in honor of the Russian mathematician Nikolai V. Ivanov [pelckmans2004morozov]. For a small value of , such that the constraint in (8) is active at the optimum , it can be shown that the set of points for which is equal to is tangent to . Due to this, an empirical way to visualize the behavior of the different penalties is to consider the shape of corresponding to them. The shapes corresponding to regularization, Lasso, and group Lasso are shown in Fig. 3 for a simple problem with three variables. The shape of for a weight decay penalty is a sphere (shown in Fig. (a)a), which does not favor any of the solutions. On the contrary, the Lasso penalty imposes a threedimensional diamondshaped surface (shown in Fig. (b)b), whose vertices lie on the axes and correspond to all the possible combinations of sparse solutions. Finally, consider the shape imposed by the group Lasso penalty (shown in Fig. (c)c), where we set one group comprising of the first two variables, and another group comprising only the third variable. The shape now has infinitely many singular points, corresponding to solutions having zeroes either on the first and second variables simultaneously, or in the third variable.
Iv Experimental results
Iva Experimental setup
In this section, we evaluate our proposal on different classification benchmarks. Particularly, we begin with a simple toy dataset to illustrate its general behavior, and then move on to more elaborate, realworld datasets. In all cases, we use ReLu activation functions [glorot2011deep] for the hidden layers of the network:
(9) 
while we use the standard onehot encoding for the different classes, and a softmax activation function for the output layer. Denoting as the values in input to the softmax, its th output is computed as:
(10) 
The weights of the network are initialized according to the method described in [glorot2010understanding], and the networks are trained using the popular Adam algorithm [kingma2014adam], a derivation of stochastic gradient descent with both adaptive step sizes and momentum. In all cases, parameters of the Adam procedure are kept as the default values described in [kingma2014adam], while the size of the minibatches is varied depending on the dimensionality of the problem. Specifically, we minimize the loss function in (2) with the standard crossentropy loss given by:
(11) 
and multiple choices for the regularization penalty. Dataset loading, preprocessing and splitting is made with the sklearn library [scikitlearn]. First, every input column is normalized in the range with an affine transformation. Then for every run we randomly keep of the dataset for testing, and we repeat each experiment times in order to average out statistical variations. For training, we exploit the Lasagne framework,^{1}^{1}1https://github.com/Lasagne/Lasagne which is built on top of the Theano library [bergstra2010theano]. Open source code to replicate the experiments is available on the web under BSD2 license.^{2}^{2}2https://bitbucket.org/ispamm/grouplassodeepnetworks
IvB Comparisons with the DIGITS dataset
To begin with, we evaluate our algorithm on a toy dataset of handwritten digit recognition, namely the DIGITS dataset [alimoglu1996methods]. It is composed of grey images of handwritten digits collected from several dozens different people. We compare four neural networks, trained respectively with the weight decay in (3) (denoted as L2NN), the Lasso penalty in (4) (denoted as L1NN), the proposed group Lasso penalty in (5) (denoted as GL1NN), and finally its sparse variation in (7) (denoted as SGL1NN). In all cases, we use a simple network with two hidden layers having, respectively, and neurons. We run the optimization algorithm for epochs, with minibatches of elements. After training, all weights under in absolute value are set to .
The aim of this preliminary test is to evaluate what we obtain from the different penalties when varying the regularization factor . To this end, we run each algorithm by choosing in the exponential range , with going from to . Results of this set of experiments are shown in Fig. 4. There are several key observations to be made from the results. To begin with, the overall behavior in terms of test accuracy with respect to the four penalties, shown in Fig. (a)a, is similar among the algorithms, as they rapidly converge to the optimal accuracy (slightly lower than ) for sufficiently small regularization factors. In particular, from onwards, their results are basically indistinguishable. Fig. (b)b shows the level of sparsity that we obtain, which is evaluated as the percentage of zero weights with respect to the total number of connections. The sparsity of L2NN is clearly unsatisfactory, oscillating from in the best case to in average. The sparsity of GL1NN is lower than the corresponding sparsity of L1NN, while the results of SGL1NN (shown with a dashed blue line) are equal or superior than all alternatives. In particular, for both L1NN and SGL1NN are able to remove four fifths of the connections. At the same time, the resulting sparsity is highly more structured for the proposed algorithm, which is able to consistently remove more features, as shown in Fig. (c)c, and neurons in the hidden layers, as shown in Fig. (d)d.
Since the input to the classifier is an image, it is quite interesting to visualize which features (corresponding to pixels of the original image) are neglected in the proposed approaches, in order to further validate empirically the proposal. This is shown for one representative run in Fig. 5. In Fig. (a)a we see a characteristic image in input to the system, representing in this case the number . We see that the digit covers all the image with respect to its height, while there is some white space to its left and right, which is not interesting from a discriminative point of view. In Fig. (b)b we visualize the results of GL1NN (which is very similar to SGL1NN), by plotting the cumulative intensity of the weights connecting the input layer to the first hidden layer (where white color represents an input with all outgoing connections set to ). We see that the algorithm does what we would have expected in this case, by ignoring all pixels corresponding to the outermost left and right regions of the image.
Origin  Name  Features  Size  N. Classes  Desired output  Reference 

UCI Repository  Sensorless Drive Diagnosis (SDD)  48  58508  11  Motor operating condition  [bayer2013sensorless] 
MLData Repository  MNIST Handwritten Digits  784  70000  10  Digit (09)  [deng2012mnist] 
MLData Repository  Forest Covertypes (COVER)  54  581012  7  Cover type of forest  [blackard1999comparative] 
IvC Comparisons with largescale datasets
We now evaluate our algorithm on three more realistic datasets, which require the use of deeper, larger networks. A schematic description of them is given in Table I in terms of features, number of patterns, and number of output classes. The first is downloaded from the UCI repository,^{3}^{3}3http://archive.ics.uci.edu/ml/ while the second and third ones are downloaded from the MLData repository.^{4}^{4}4http://mldata.org/ In the SSD dataset, we wish to predict whether a motor has one or more defective components, starting from a set of features obtained from the motor’s electric drive signals (see [bayer2013sensorless] for details on the feature extraction process). The dataset is composed of examples obtained under different operating conditions. The MNIST database is an extremely wellknown database of handwritten digit recognition [deng2012mnist], composed of thousands gray images of the digits . Finally, the COVER dataset is the task of predicting the actual cover type of a forest (e.g. ponderosa pine) from a set of features extracted from cartographic data (see [blackard1999comparative, Table 1] for a complete list of cover types). This dataset has roughly a half million training examples, but only possible classes compared to and classes for SSD and MNIST, respectively.
Dataset  Neurons  Regularization  Minibatch size 

SSD  40/40/30/11  500  
MNIST  400/300/100/10  400  
COVER  50/50/20/7  1000 
Details on the network’s architecture, regularization factor and minibatch size for the three datasets is given in Table II. Generally speaking, we use the same regularization factor for all algorithms, as it was shown to provide the best results in terms of classification accuracy and sparsity of the network. The network architecture is selected based on an analysis of previous works and is given in the second column of Table II, where means a network with one dimensional hidden layer, a second dimensional hidden layer, and a dimensional output layer. We stress that our focus is on comparing the different penalties, and very similar results can be obtained for different choices of the network’s architecture and the regularization factors. Additionally, we only consider SGL1NN as the previous section has shown that it can consistently outperform the simpler GL1NN.
Dataset  Measure  L2NN  L1NN  SGL1NN 

SSD  Training accuracy [%]  
Test accuracy [%]  
Training time [secs.]  
Sparsity [%]  
Neurons  
MNIST  Training accuracy [%]  
Test accuracy [%]  
Training time [secs.]  
Sparsity [%]  
Neurons  
COVER  Training accuracy [%]  
Test accuracy [%]  
Training time [%]  
Sparsity [%]  
Neurons 
The results for these experiments are given in Table III, where we show the average training and test accuracy, training time, sparsity of the network, and final size of each hidden layer (which is highlighted with a light blue background). As a note on training times, results for the smaller SSD dataset are obtained on an Intel Core i3 @ 3.07 GHz with 4 GB of RAM, while results for MNIST and COVER are obtained on an Intel Xeon E52620 @ 2.10 GHz, with 8 GB of RAM and a CUDA backend employing an Nvidia Tesla K20c. We see that the results in terms of test accuracy are comparable between the three algorithms, with a negligible loss on the MNIST dataset for SGL1NN. However, SGL1NN results in networks which are extremely sparse and more compact than its two competitors. Let us consider as an example the MNIST dataset. In this case, the algorithm removes more than features in average from the input vector (compared to approximately for L1NN). Also, the resulting network only has neurons in the hidden layers compared to for L1NN and for L2NN. Also in this case, we can visually inspect the resulting features selected by the algorithm, which are shown in Fig. 6. In Fig. (a)a we see an example of input pattern (corresponding to the digit ), while in Fig. (b)b we plot the cumulative intensity of the outgoing weights from the input layer. Differently from the DIGITS case, the images in this case have a large white margin on all sides, which is efficiently neglected by the proposed formulation, as shown by the white portions of Fig. (b)b.
One last observation can be made regarding the required training time. The SGL penalty is actually faster to compute than both the and norms when the code is run on the CPU, while we obtain a slower training time (albeit by a small margin) when it is executed on the CUDA backend. The reason for this is the need to compute two square root operations per group in (7). This gap can be removed by exploiting several options for faster mathematical computations (at the cost of precision) on the GPU, e.g. by using the ‘–precsqrt’ flag on the Nvidia CUDA compiler.
Overall, the results presented in this section show how the sparse group Lasso penalty can easily allow to obtain networks with a high level of sparsity, a low number of neurons (both on the input layer and on the hidden layers), while incurring no or negligible losses in accuracy.
V Related works
Before concluding our paper, we describe a few related works that we briefly mentioned in the introduction, in order to highlight some common points and differences. Recently, there has been a sustained interest in methods that randomly decrease the complexity of the network during training. For example, dropout [srivastava2014dropout] randomly removes a set of connections; stochastic depths skips entire layers [huang2016deep]; while [glorot2011deep] introduced the possibility of applying the penalty to the activations of the neurons in order to further sparsify its firing patterns. However, these methods are only used to simplify the training phase, while the entire network is needed at the prediction stage. Thus, they are only tangentially related to what we discuss here.
A second class of related works group all the socalled pruning methods, which can be used to simplify the network’s structure after training is completed. Historically, the most common method to achieve this is the optimal brain damage algorithm introduced by LeCun [lecun1989optimal], which removes connections by measuring a ‘saliency’ quantity related to the secondorder derivatives of the cost function at the optimum. Other methods require instead to compute the sensitivity of the error to the removal of each neuron, in order to choose an optimal subset of neurons to be deleted [suzuki2001simple]. More recently, a twostep learning process introduced by Han et al. [han2015learning] has also gained a lot of popularity. In this method, the network is originally trained considering an penalty, in order to learn which connections are ‘important’. Then, the nonimportant connections, namely all weights under a given threshold, are set to zero, and the network is retrained by keeping fixed the zeroed out weights. This procedure can also be iteratively repeated to further reduce the size of the network. None of these methods, however, satisfy what we considered in the introduction, i.e. they either require a separate pruning process, they do not act directly at the neuronlevel, and they might make some heuristic assumptions that should hold at the pruning phase. As an example, the optimal brain damage algorithm is built on the socalled diagonal approximation, stating that the error modification resulting from modifying many weights can be computed by summing the individual contributions from each weight perturbation.
A final class of methods is not interested in learning an optimal topology, insofar as to reduce the actual number of parameters and/or the storage requirements of the network. The most common method in this class is the lowrank approximation method [sainath2013low], where a weight matrix is replaced by a lowrank factorization , where the rank must be chosen by the user. Optimization is then performed directly on the two factors instead of the original matrix. The choice of the rank allows to balance between compression and accuracy. As an example, if we wish to compress the network by a factor , we can choose [sainath2013low]:
(12) 
However, this approximation is not guaranteed to work efficiently, and may result in highly worse results for a poor choice of .
Vi Conclusions
In this paper, we have introduced a way to simultaneously perform pruning and feature selection while optimizing the weights of a neural network. Our sparse group Lasso penalty can be implemented efficiently (and easily) in most software libraries, with a very small overhead with respect to standard or formulations. At the same time, our experimental comparisons show its superior performance for obtaining highly compact networks, with definite savings in terms of storage requirements and power consumption on embedded devices.
There are two main lines of research that we wish to explore in future contributions. To begin with, there is the problem of studying the interaction between a sparse formulation (originated in the case of convex costs), with a nonconvex cost as in (2), an aspect which is still open in the optimization literature. It would be interesting to investigate the possible improvements with the use of a nonconvex sparse regularizer, such as the norm with fractional . Alternatively, we might improve the sparse behavior of (4) and (5) by iteratively solving a convex approximation to the original nonconvex problem, e.g. by exploiting the techniques presented in [scutari2014decomposition], as we did in a previous work on semisupervised support vector machines [scardapane2016distributed].
Then, we are interested in exploring group Lasso formulations for other types of neural networks, including convolutional neural networks and recurrent neural networks. As an example, we are actively working in extending our previous work on sparse regularization in reservoir computing architectures [bianchi2015prediction], where it is shown that having sparse connectivity can help in creating clusters of neurons resulting in heterogeneous features extracted from the recurrent layer.
Acknowledgments
The authors wish to thank Dr. Paolo Di Lorenzo for his insightful comments.