Logarithmic Pruning is All You Need
The Lottery Ticket Hypothesis is a conjecture that every large neural network contains a subnetwork that, when trained in isolation, achieves comparable performance to the large network. An even stronger conjecture has been proven recently: Every sufficiently overparameterized network contains a subnetwork that, at random initialization, but without training, achieves comparable accuracy to the trained large network. This latter result, however, relies on a number of strong assumptions and guarantees a polynomial factor on the size of the large network compared to the target function. In this work, we remove the most limiting assumptions of this previous work while providing significantly tighter bounds: the overparameterized network only needs a logarithmic factor (in all variables but depth) number of neurons per weight of the target subnetwork.
The recent success of neural network (NN) models in a variety of tasks, ranging from vision (khan2020convnet) to speech synthesis (oord2016wavenet) to playing games (schrittwieser2019mastering; ebendt2009weighted), has sparked a number of works aiming to understand how and why they work so well. Proving theoretical properties for neural networks is quite a difficult task, with challenges due to the intricate composition of the functions they implement and the high-dimensional regimes of their training dynamics. The field is vibrant but still in its infancy, many theoretical tools are yet to be built to provide guarantees on what and how NNs can learn. A lot of progress has been made towards understanding the convergence properties of NNs (see e.g., allen-zhu2019convergence, zou2019improved and references therein). The fact remains that training and deploying deep NNs has a large cost (livni2014), which is problematic. To avoid this problem, one could stick to a small network size. However, it is becoming evident that there are benefits to using oversized networks, as the literature on overparametrized models (belkin-late2018power) points out. Another solution, commonly used in practice, is to prune a trained network to reduce the size and hence the cost of prediction/deployment. While missing theoretical guarantees, experimental works show that pruning can considerably reduce the network size without sacrificing accuracy.
The influential work of frankle2018lottery has pointed out the following observation: a) train a large network for long enough and observe its performance on a dataset, b) prune it substantially to reveal a much smaller subnetwork with good (or better) performance, c) reset the weights of the subnetwork to their original values and remove the rest of the weights, and d) retrain the subnetwork in isolation; then the subnetwork reaches the same test performance as the large network, and trains faster. frankle2018lottery thus conjecture that every successfully trained network contains a much smaller subnetwork that, when trained in isolation, has comparable performance to the large network, without sacrificing computing time. They name this phenomenon the Lottery Ticket Hypothesis, and a ‘winning ticket’ is a subnetwork of the kind just described.
ramanujan2019s went even further by observing that if the network architecture is large enough, then it contains a smaller network that, even without any training, has comparable accuracy to the trained large network. They support their claim with empirical results using a new pruning algorithm, and even provide a simple asymptotic justification that we rephrase here: Starting from the inputs and progressing toward the outputs, for any neuron of the target network, sample as many neurons as required until one calculates a function within small error of the target neuron; then, after pruning the unnecessary neurons, the newly generated network will be within some small error of the target network. Interestingly, ulyanov2018 pointed out that randomly initialized but untrained ConvNets already encode a great deal of the image statistics required for restoration tasks such as de-noising and inpainting, and the only prior information needed to do them well seems to be contained in the network structure itself, since no part of the network was learned from data.
Very recently, building upon the work of ramanujan2019s, malach2020proving proved a significantly stronger version of the “pruning is all you need” conjecture, moving away from asymptotic results to non-asymptotic ones: With high probability, any target network of layers and neurons per layer can be approximated within accuracy by pruning a larger network whose size is polynomial in the size of the target network. To prove their bounds, malach2020proving make assumptions about the norms of the inputs and of the weights. This polynomial bound already tells us that unpruned networks contain many ‘winning tickets’ even without training. Then it is natural to ask: could the most important task of gradient descent be pruning?
Building on top of these previous works, we aim at providing stronger theoretical guarantees still based on the motto that “pruning is all you need” but hoping to provide further insights into how ‘winning tickets’ may be found. In this work we relax the aforementioned assumptions while greatly strengthening the theoretical guarantees by improving from polynomial to logarithmic order in all variables except the depth, for the number of samples required to approximate one target weight.
How this paper is organized.
After some notation (Section 2) and the description of the problem (Section 3), we provide a general approximation propagation lemma (Section 4), which shows the effect of the different variables on the required accuracy. Next, we show how to construct the large, fully-connected ReLU network in Section 5 identical to malach2020proving, except that weights are sampled from a hyperbolic weight distribution instead of a uniform one. We then give our theoretical results in Section 6, showing that only neurons per target weight are required under some similar conditions as the previous work (with layers, neurons per layer and accuracy) or (with some other dependencies inside the log) if these conditions are relaxed. For completeness, the most important technical result is included in Section 7. Other technical results, a table of notation, and further ideas can be found in Appendix C.
2 Notation and definitions
A network architecture is described by a positive integer corresponding to the number of fully connected feed-forward layers, and a list of positive integers corresponding to the profile of widths, where is the number of neurons in layer and is the input dimension, and a list of activation functions —all neurons in layer use the activation function . Networks from the architecture implement functions from to that are obtained by successive compositions:
Let be a target network from architecture . The composition of such is as follows: Each layer has a matrix of connection weights, and an activation function , such as tanh, the logistic sigmoid, ReLU, Heaviside, etc. The network takes as input a vector where for example or , etc. In layer , the neuron with in-coming weights calculates , where is usually the output of the previous layer. Note that is the -th row of the matrix . The vector denotes the output of the whole layer when it receives from the previous layer. Furthermore, for a given network input we recursively define by setting , and for then . The output of neuron in layer is . The network output is .
For an activation function , let be its Lipschitz factor (when it exists), that is, is the smallest real number such that for all . For ReLU and tanh we have , and for the logistic sigmoid, . Let be the corresponding to the activation function of all the neurons in layer , and let .
Define to be the maximum number of neurons per layer. The total number of connection weights in the architecture is denoted , and we have .
For all , let be the maximum activation at any layer of a target network , including the network inputs but excluding the network outputs. We also write ; when is restricted to the set of inputs of interest (not necessarily the set of all possible inputs) such as a particular dataset, we expect to be bounded by a small constant in most if not all cases. For example, for a neural network with only sigmoid activations and inputs in . For ReLU activations, can in principle grow as fast as , but since networks with sigmoid activations are universal approximators, we expect that for all functions that can be approximated with a sigmoid network there is a ReLU network calculating the same function with .
The large network has an architecture , possibly wider and deeper than the target network . The pruned network is obtained by pruning (setting to 0) many weights of the large network . For each layer , and each pair of neurons and , for the weight of the large network there is a corresponding mask such that the weight of the pruned network is . The pruned network will have a different architecture from , but at a higher level (by grouping some neurons together) it will have the same ‘virtual’ architecture, with virtual weights . As in previous theoretical work, we consider an ‘oracle’ pruning procedure, as our objective is to understand the limitations of even the best pruning procedures.
For a matrix , we denote by its spectral norm, equal to its largest singular value, and its max-norm is . In particular, for a vector , we have and and also . This means for example that is a stronger condition than .
Objective: Given an architecture and accuracy , construct a network from some larger architecture , such that if the weights of are randomly initialized (no training), then for any target network from , setting some of the weights of to 0 (pruning) reveals a subnetwork such that with high probability,
Question: How large must be to contain all such ? More precisely, how many more neurons or how many more weights must have compared to ?
ramanujan2019s were the first to provide a formal asymptotic argument proving that such a can indeed exist at all.
malach2020proving went substantially further by providing the first polynomial bound on the size of compared to the size of the target network .
To do so, they make the following assumptions on the target network: (i) the inputs must satisfy ,
and at all layers :
(ii) the weights must be bounded in , (iii) they must satisfy at all layers , and
(iv) the number of non-zero weights at layer must be less than : .
Note that these constraints imply that .
Then under these conditions,
they prove that any ReLU network with layers and neurons per layer
can be approximated
4 Approximation Propagation
In this section, we analyze how the approximation error between two networks with the same architecture propagates through the layers. The following lemma is a generalization of the (end of the) proof of malach2020proving that removes their aforementioned assumptions and provides better insight into the impact of the different variables on the required accuracy, but is not sufficient in itself to obtain better bounds. For two given networks with the same architecture, it determines what accuracy is needed on each individual weight so the outputs of the two neural networks differ by at most on any input. Note that no randomization appears at this stage.
Lemma 1 (Approximation propagation).
Consider two networks and with the same architecture with respective weight matrices and , each weight being in . Given , if for each weight of the corresponding weight of we have , and if
The proof is given in Appendix C.
Consider an architecture with only ReLU activation function (), weights in and assume that and take the worst case , then Lemma 1 tells us that the approximation error on each individual weight must be at most so as to guarantee that the approximation error between the two networks is at most . This is exponential in the number of layers. If we assume instead that as in previous work then this reduces to a mild polynomial dependency: . ∎
5 Construction of the ReLU Network and Main Ideas
We now explain how to construct the large network given only the architecture , the accuracy , and the domain of the weights. Apart from this, the target network is unknown. In this section all activation functions are ReLU , and thus .
We use a similar construction of the large network as malach2020proving: both the target network and the large network consist of fully connected ReLU layers, but may be wider and deeper. The weights of are in . The weights for (at all layers) are all sampled from the same distribution, the only difference with the previous work is the distribution of the weights: we use a hyperbolic distribution instead of a uniform one.
Between layer and of the target architecture, for the large network we insert an intermediate layer of ReLU neurons. Layer is fully connected to layer which is fully connected to layer . By contrast to the target network , in the layers and are not directly connected. The insight of malach2020proving is to use two intermediate (fully connected ReLU) neurons and of the large network to mimic one weight of the target network (seeFig. 1): Calling the input and output weights of and that match the input and output of the connection , then in the pruned network all connections apart from these 4 are masked out by a binary mask set to 0. These two neurons together implement a ‘virtual’ weight and calculate the function by taking advantage of the identity :
Hence, if and , the virtual weight made of and is approximately . Then, for each target weight , malach2020proving sample many such intermediate neurons to ensure that two of them can be pruned so that with high probability. This requires samples and, when combined with Lemma 1 (see creftype 2), makes the general bound on the whole network grow exponentially in the number of layers, unless strong constraints are imposed.
To obtain a logarithmic dependency on , we use three new insights that take advantage of the composability of neural networks: 1) ‘binary’ decomposition of the weights, 2) product weights, and 3) batch sampling. We detail them next.
Our most important improvement is to build the weight not with just two intermediate neurons, but with of them, so as to decompose the weight into pieces of different precisions, and recombine them with the sum in the neuron at layer (see Fig. 1), using a suitable binary mask vector in the pruned network . Intuitively, the weight is decomposed into its binary representation up to a precision of bits: . Using a uniform distribution to obtain these weights would not help, however. But, because the high precision bits are now all centered around 0, we can use a hyperbolic sampling distribution which has high density near 0. More precisely, but still a little simplified, for a weight we approximate within accuracy with the virtual weight such that:
where is factored out since all connections have the same mask, and where and , , and . Note however that, because of the inexactness of the sampling process, we use a decomposition in base instead of base 2 (Lemma 9 in Section 7).
Recall that . For fixed signs of and , this function can be equivalently calculated for all possible values of these two weights such that the product remains unchanged. Hence, forcing and to take 2 specific values is wasteful as one can take advantage of the cumulative probability mass of all their combinations. We thus make use of the induced product distribution, which avoids squaring the number of required samples. Define the distribution for positive weights with and , symmetric around 0, for :
Then, instead of sampling uniformly until both and , we sample both from so that , taking advantage of the induced product distribution (Lemma 28).
Sampling sufficiently many intermediate neurons so that a subset of them are employed in approximating one target weight with high probably and then discarding (pruning) all other intermediate neurons is wasteful. Instead, we allow these samples to be ‘recycled’ to be used for other neurons in the same layer. This is done by partitioning the neurons in different buckets (categories) and ensuring that each bucket has enough neurons (LABEL:lem:fillcat).
6 Theoretical Results
We now have all the elements to present our central theorem, which tells us how many intermediate neurons to sample to approximate all weights at a layer of the target network with high probability. Remark 4 below will then describe the result in terms of number of neurons per target weight.
Theorem 3 (ReLU sampling bound).
For a given architecture where is the ReLU function, with weights in and a given accuracy , the network constructed as above with weights sampled from with , and , requires only to sample intermediate neurons for each layer , where
( is in Lemma 1 with for ReLU), in order to ensure that for any target network with the given architecture , there exist binary masks of such that for the resulting subnetwork ,
Step 1. Sampling intermediate neurons to obtain product weights. Consider a single target weight . Recalling that and , we rewrite Eq. 1 as
The two virtual weight and are obtained separately. We need both and so that .
Consider (the case is similar). We now sample intermediate neurons, fully connected to the previous and next layers, but only keeping the connection between the same input and output neurons as (the other weights are zeroed out by the mask ). For a single sampled intermediate neuron , all its weights, including and , are sampled from , thus the product is sampled from the induced product distribution and, a quarter of the time, and have the correct signs (recall we need and ). Define
then with where the last inequality follows from Lemma 28 for , and , and similarly for with .
Note that because and , the samples for and are mutually exclusive which will save us a factor 2 later.
Step 2. ‘Binary’ decomposition/recomposition. Consider a target weight . Noting that Corollary 10 equally applies for negative weights by first negating them, we obtain and by two separate applications of Corollary 10 where we substitute , , . Substituting with in Eq. 2 shows that this leads to a factor 8 on . Therefore, by sampling weights from in with ensures that there exists a binary mask of size at most such that with probability at least . We proceed similarly for . Note that Corollary 10 guarantees , even though the large network may have individual weights larger than .
Step 2’. Batch sampling. Take to be the number of ‘bits’ required to decompose a weight with Corollary 10 (via Lemma 9). Sampling different intermediate neurons for each target weight and discarding samples is wasteful: Since there are target weights at layer , we would need intermediate neurons, when in fact most of the discarded neurons could be recycled for other target weights.
Instead, we sample all the weights of layer at the same time, requiring that we have at least samples for each of the intervals of the ‘binary’ decompositions of and . Then we use LABEL:lem:fillcat with categories: The first categories are for the decomposition of and the next ones are for . Note that these categories are indeed mutually exclusive as explained in Step 1. and, adapting Eq. 2, each has probability at least (for any ). Hence, using LABEL:lem:fillcat where we take and , we only need to sample intermediate neurons to ensure that with probability at least each and can be decomposed into product weights in each of the intervals of Lemma 9.
Step 3. Network approximation. Using a union bound, we need for the claim to hold simultaneously for all layers. Finally, when considering only the virtual weights (constructed from and ), and now have the same architecture, hence choosing as in Lemma 1 ensures that with probability at least , . ∎
Consider for all and assume , and . Then and . Then we can interpret Theorem 3 as follows: When sampling the weights of a ReLU architecture from the hyperbolic distribution, we only need to sample neurons per target weight (assuming ). Compare with the bound of malach2020proving which, under the further constraints that and and with uniform sampling in , needed to sample neurons per target weight.
We can now state our final result.
Corollary 7 (Weight count ratio).
Under the same conditions as Theorem 3, Let be the number of weights in the fully connected architecture and the number of weights of the large network , then the weight count ratio is .
We have , and the total number of weights in the network if layers are fully connected is at most , where . Hence the weight count ratio is . ∎
Since in the pruned network each target weight requires neurons, the large network has at most a constant factor more neurons than the pruned network.
7 Technical lemma: Random weights
The following lemma shows that if weights are sampled from a hyperbolic distribution, we can construct a ‘goldary’ (as opposed to ‘binary’) representation of the weight with only samples. Because of the randomness of the process, we use a “base” instead of a base 2 for logarithms, so that the different ‘bits’ have overlapping intervals. As the proof clarifies, the optimal base is . The base is convenient. The number is known as the ‘golden ratio’ in the mathematical literature, which explains the name we use.
Lemma 9 (Golden-ratio decomposition).
For any given and , where is the golden ratio, define the probability density for with normalization . For any , if with , then with probability at least over the random sampling of ‘weights’ for , the following holds: For every ‘target weight’ , there exists a mask with such that is -close to , indeed .
Let . First, consider a sequence such each is in the interval for . We construct an approximating for any weight by successively subtracting when possible. Formally
By induction we can show that . This holds for . Assume : If then .
|If then .|
The last inequality is true for , which is satisfied due to the restriction . Hence the error for .
Now consider a random sequence where we sample over the interval for . In the event that there is at least one sample in each interval , we can use the construction above with a subsequence of such that and as in the construction above. Next we lower bound the probability that each interval contains at least one sample. Let be the event “no sample is in ” and let . Then , hence
and thus choosing ensures that . Finally,
and so we can take . ∎
Corollary 10 (Golden-ratio decomposition for weights in ).
For any given , define the probability density for with normalization . Let , For any , if