Are wider nets better given the same number of parameters?
Empirical studies demonstrate that the performance of neural networks improves with increasing number of parameters. In most of these studies, the number of parameters is increased by increasing the network width. This begs the question: Is the observed improvement due to the larger number of parameters, or is it due to the larger width itself? We compare different ways of increasing model width while keeping the number of parameters constant. We show that for models initialized with a random, static sparsity pattern in the weight tensors, network width is the determining factor for good performance, while the number of weights is secondary, as long as trainability is ensured. As a step towards understanding this effect, we analyze these models in the framework of Gaussian Process kernels. We find that the distance between the sparse finite-width model kernel and the infinite-width kernel at initialization is indicative of model performance.
Deep neural networks have shown great empirical success in solving a variety of tasks across different application domains. One of the prominent empirical observations about neural nets is that increasing the number of parameters leads to improved performance [DBLP:journals/corr/NeyshaburTS14, neyshabur2018the, hestness2017deep, kaplan2020scaling]. The consequences of this effect for model optimization and generalization have been explored extensively. In the vast majority of these studies, both empirical and theoretical, the number of parameters is increased by increasing the width of the network [neyshabur2018the, du2018gradient, allen2019convergence]. Network width itself on the other hand has been the subject of interest in studies analyzing its effect on the dynamics of neural network optimization, e.g. using Neural Tangent Kernels [jacot2018neural, arora2019exact] and Gaussian Process Kernels [wilson2016deep, lee2017deep]. All studies we know of suffer from the same fundamental issue: When increasing the width, the number of parameters is being increased as well, and therefore it is not possible to separate the effect of increasing width from the effect of increasing number of parameters. How does each of these factors — width and number of parameters — contribute to the improvement in performance?
To our knowledge, there is no controlled study that answers the above question. The goal of our work is to decouple the effect of the width from the effect of the number of parameters. In particular, we aim to answer the question of whether the performance improves if we increase the width while keeping the number of parameters constant. We conduct a principled study addressing this question, proposing and testing methods of increasing the width of the network while keeping the number of parameters constant. Suprisingly, we find scenarios under which most of the performance benefits come from increasing the width.
1.1 Our contributions
In this paper we make the following contributions:
We propose three candidate methods (illustrated in Figure 2) for increasing network width while keeping the number of parameters constant.
Linear bottleneck: Substituting each weight matrix by a product of two weight matrices. This corresponds to limiting the rank of the weight matrix.
Non-linear bottleneck: Narrowing every other layer and widening the rest.
Static sparsity: Setting some weights to zero using a mask that is randomly chosen at initialization, and remains static throughout training.
We show that performance can be improved by increasing the width, without increasing the number of model parameters. We find that test accuracy can be improved using method 1 or 3, while method 2 only degrades performance. However, 1 suffers from another degradation due to inferior implicit bias caused by the reparameterization. Consequently, we focus on the sparsity method 3, as it leads to the best results and is applicable to any network type.
We empirically investigate different ways in which random, static sparsity can be distributed among layers of the network and, based on our findings, propose an algorithm to do this effectively (Section 2.3).
We demonstrate that the improvement due to widening (while keeping the number of parameters fixed) holds across standard image datasets and models. Surprisingly, we obesrve that for ImageNet, increasing the width according to 3 leads to almost identical performance as when we allow the number of weights to increase along with the width (Section 2.3).
To understand the observed effect theoretically, we study a simplified model and show that the improved performance of a wider, sparse network is correlated with a reduced distance between its Gaussian Process kernel and that of an infinitely wide network. We propose that reduced kernel distance may explain the observed effect (Section 3).
1.2 Related Work
Our work is similar in nature to the body of work studying the role of overparametrization and width. [DBLP:journals/corr/NeyshaburTS14] observed that increasing the number of hidden units beyond what is necessary to fit the training data leads to improved test performance, and attributed this to the inductive bias of the optimization algorithm. [soltanolkotabi2018theoretical, neyshabur2018the, allen2019convergence] further studied the role of over-parameterization in improving optimization and generalization. [woodworth2020kernel] studied the implicit regularization of gradient descent in the over-parameterized setting, and [belkin2019reconciling] investigated the behavior at interpolation. Furthermore, lee2017deep showed that networks at initialization become Gaussian Processes in the large width limit, and [jacot2018neural] showed that infinitely wide networks behave as linear models when trained using gradient flow. 2020arXiv200715801L systematically compared these different theoretical approaches. In all the above works, the number of parameters is increased by increasing the width. However, in this work, we conduct a controlled study of the effect of width by keeping the number of parameters fixed.
Perhaps [littwin2020collegial] is the closest work to ours, which investigates overparametrization achieved through ensembling as opposed to layer width. Ensembling is achieved by connecting several networks in parallel into one “collegial” ensemble. They show that for large ensembles the optimization dynamics simplify and resemble the dynamics of wide models, yet scale much more favorably in terms of number of parameters. However, the method employed there is borrowed from the ResNeXt architecture [2016arXiv161105431X], which involves altering the overall structure of the network as width is increased. In this work we try to make minimal changes to network structure, in order to isolate the effect of width on network performance.
Finally, static sparsity is the basis of the main method we use to increase network width while keeping the number of parameters fixed. There is a large body of work on the topic of sparse neural networks [Park2016HolisticSF, Narang2017ExploringSI, Bellec2018DeepRT, LTH, Elsen2019FastSC, Gale2019TheSO, you2019drawing], and many studies derive sophisticated approaches to optimize the sparsity pattern [SNIP, evci2019rigging, GraSP, SynFlow]. In our study, however, sparsity itself is not the subject of interest. In order to minimize its effect in our controlled experiments, the sparsity pattern that we apply is randomly chosen and static. A recent work [frankle2020pruning] demonstrates that a random, fixed sparsity pattern leads to equal performance as the more involved methods for pruning applied at initialization, when the per-layer sparsity distribution is the same. However, we do not explore this direction in our study.
2 Empirical Investigation
In this section, we first explain our experimental methodology and then investigate the effectiveness of different approaches to increase width while keeping the number of parmeters fixed. Finally, we discuss the respective roles of width and the number of parameters in improving performance.
In order to identify how width and number of parameters separately affect performance, we need to decouple these quantities. For a fully-connected layer, width is the number of hidden units, and for a convolutional layer, width corresponds to the number of output channels. Increasing the width of one layer ordinarily entails an increase in the number of weights. Therefore, some adjustment is required to keep the total number of weights constant. This adjustment can be made at the level of layers by reducing some other dimension (i.e., by introducing a “bottleneck”), or at the level of weights by setting some of them to be zero. When choosing a method for our analysis, we take particular care about possible confounding variables, as many aspects of a neural network are intertwined. For example, we prefer to keep the number of non-linear layers in the network fixed, as it has been shown that changing it can significantly affect the experssive power, optimization dynamics and generalizaiton properties of the network. With these constraints in mind, we specify three different methods listed in Section 1.1. In this section, we discuss these methods in detail, evaluate them, and present the results obtained on image classification tasks.
In summary, our approach is as follows: We select a network architecture which specifies layer types and arrangement (e.g., MLP with some number of hidden layers, or ResNet-18), set layer widths, and count the number of weights. We refer to this model as the baseline, and derive other models from it by increasing the width while keeping the number of weights constant. In this way, we obtain a family of models that we train using the same training procedure, and compare their test accuracies. For comparison and as a sanity check, we also consider the dense variants of the wider models.
We tested a variety of simple models, and observed the same general behavior in the context of our research question. For the discussion in this paper, we focus on two model types: a MLP with one hidden layer, and a ResNet-18. We chose the ResNet-18 architecture because it is a standard model widely used in practice. Its size is small enough that it allows us to increase the width substantially, yet it has sufficient representational power to obtain a nontrivial accuracy on standard image datasets. When creating a family of ResNet-18 models, we refer to the number of output channels of the first convolutional layer as the width, and do not alter the default width ratios of all following layers. Further details of each experiment and figure are specified in Appendix A.
2.2 Bottleneck methods
Here we discuss the two bottleneck methods described in the previous section and in Figure 2.
Substitute each weight matrix by , where , and .
One way to create a non-linear bottleneck is to split each layer in two as described above and add a non-linearity to the first layer. The problem with this approach is that adding a non-linearity changes the expressive power of the model, and hence adds a factor that we cannot control. An alternative approach, which is more favorable in our case and which we adopt here, is to modify the layers in pairs. The input dimension of the first layer is increased, while its output dimension (and consequently the input dimension of the second layer) is reduced (). Non-linear bottlenecks are widely used in practice, particularly as a way to save parameters in very deep architectures, such as deeper variants of ResNet [he2016deep] and DenseNet [huang2017densely].
The performance of these methods on CIFAR-10 and CIFAR-100 datasets is demonstrated in Figure 3. The results indicate that increasing width using the linear bottleneck indeed leads to improved accuracy up to a certain width. Moreover, this effect is more pronounced in smaller models that are less overparameterized. This is similar to the improvement gained when increasing the width along with the number of parameters (without a bottleneck): The improvement tends to be diminished when the base network is wider. However, as discussed above, the act of substituting a weight matrix by a product of two weight matrices changes the implicit bias of optimization. Indeed, the performance of the original model, indicated by dashed horizontal lines in the Figure, is signicantly higher than that of the transformed models. Therefore, even though the width increase at a constant number of weights with the linear bottleneck method improves the result of the most narrow model obtained after the transformation, it typically does not outperform the default ResNet-18 model. Due to lack of control over the effect of inductive bias, we conclude that this choice does not qualify to be used in our controlled experiments.
The non-linear bottleneck method does not suffer from the same issues as the linear version in terms of inductive bias of reparametrization. However, as Figure 3 demonstrates, no empirical improvement is obtained by increasing the width, except for the model with baseline width (and then the improvement is mild). Therefore, we conclude that the non-linear bottleneck model does not show a significant enough improvement to be considered as an effective method for our study.
2.3 Sparsity method
We now turn to the sparsity method, which is the main method considered in this work, illustrated in Figure 1(c). We start with a baseline model that has dense weight tensors. We then increase the width by a widening factor , sparsifying the weight tensors by setting a randomly chosen subset of parameters to zero throughout training. The total number of tunable parameters is the same as that of the baseline model. The advantage of the sparsity method over the bottleneck methods considered earlier is that it allows us to control the number of parameters without altering the overall network structure. In an attempt to run a controlled experiment and minimize the differences between the sparsified setup and the baseline setup, we choose the sparsity mask at random during initialization and keep it static during training. In this sense, our sparsity method differs from most other pruning methods discussed in the literature, where the aim is to maximize performance. We define the connectivity of a sparse model to be the ratio between the number of tunable parameters and the number of parameters in a dense model of the same width.
In order to implement the sparsity method, we need to choose how to distribute the sparsity both across network layers and within each layer. We choose the number of weights to be removed in each layer to be proportional to the number of parameters in that layer, except for BatchNorm layers which are kept intact. Within each layer, the positions of the weights to remove are chosen randomly, sampled uniformly across all dimensions of the weight tensor. In our experiments, this method of distributing the weights led to the best performance among the different methods we tried. See the following discussion as well as Appendix C for comparisons of different sparsity distribution methods.
We study a fully-connected network with one hidden layer trained on MNIST. We use this simple setup to compare different sparsity distribution patterns across layers for a given network connectivity, and to compare the effects of increased width with and without ReLU non-linearities.
The dense MLP at width has two weight matrices: The first layer matrix has size and the last layer has size . In the experiments, we set the total number of weights to 3970 and consider widths ranging from to , corresponding to network connectivities between and . At each width, we vary the connectivity of the last layer (the smaller of the two) between 1.0 and 0.1.
The results are shown in Figure 4. We find that sparse, wide models can outperform the dense, baseline models for both ReLU and linear activations. The ReLU model attains its maximum performance at around connectivity, which corresponds to a widening factor of 16 or 32. At the optimal point the connectivity of the last layer is high, above . It is therefore advantageous in this case to remove more weights from the first layer than from the last layer. This makes intuitive sense: Removing weights from a layer that starts out with fewer weights can be expected to make optimization more difficult. This results motivates our choice to remove weights proportionally to layer size when sparsifying other models. Finally, the fact that larger width leads to improvement even in the deep linear model implies that the improvement cannot be attributed only to the increased model capacity that a wider ReLU network enjoys.
We train families of ResNet-18 models on ImageNet, CIFAR-10, CIFAR-100 and SVHN, covering a range of widths and model sizes. A detailed example of the sparsity distribution over all layers is shown in Appendix D.
|dense||68.03 (11.7)||69.11 (22.8)||70.22 (45.7)||70.91 (90.7)||71.89 (180.6)|
|sparse||–||69.56 (11.7)||70.02 (11.7)||70.66 (11.7)||70.53 (11.7)|
Table 1 shows the results obtained on ImageNet. As expected, the performance improves as the width and the number of weights increase (row 1). However, up to a certain width, a comparable improvement is achieved when only the width grows and the number of weights remains fixed (row 2). We believe the reason for declining test accuracy at yet larger widths is that the connectivity becomes too small and impairs the trainability of the model. The impaired trainability can be observed in Figure 6 and the additional plots presented in Appendix E. Therefore, in this case the determining factor for model performance is width rather than number of parameters. Figure 5 shows the results of training model families on several additional datasets. We again find that performance improves with width at a fixed number of parameters. The effect is most pronounced for more difficult tasks and for smaller models that do not reach 100% training accuracy, yet it is still present for models that do fit the training set (see also Appendix E).
Figure 6 compares the performance improvement of a sparse, wide model against that of a dense model with the same width. In particular, it shows the fraction of the improvement that can be attributed to width alone: That is, the ratio between the sparse/baseline accuracy gap and the dense/baseline accuracy gap. We see that, as long as trainability is not impaired by sparsifying, most of the improvement in performance can be attributed to the width.
3 Theoretical Analysis in a Simplified Setting
We showed empirically that wide, sparse networks with random connectivity patterns can outperform dense, narrower networks when the number of parameters is fixed. The performance as a function of the width has a maximum when the network is sparse. In this section we investigate a potential theoretical explanation for this effect.
It is well known that wider (dense) networks can achieve consistently better performance. In the infinite-width limit, the training dynamics of neural networks is equivalent (under certain conditions) to kernel-based learning, where the kernel is a function of the model parameters at initialization [jacot2018neural]. We conjecture that the kernel of a finite-width network at initialization is indicative of its performance, and that optimal performance is achieved when its distance to the infinite-width kernel is minimized. We further hypothesize that this distance can be reduced by increasing the network width at a fixed number of parameters. In the following, we formalize this conjecture and derive expressions for the kernel of a sparse finite-width network with one hidden layer. We calculate the kernel distance theoretically, and show that the distance predicted using this result is in good agreement with experiments.
Consider a 2-layer ReLU network function given by
Here, is the input, the network parameters are and (for simplicity, we omit the biases), and is the Heaviside step function. We note that in this section we use NTK parameterization [jacot2018neural] for simplicity, whereas in previous sections we used LeCun parameterization. Each network parameter is sampled independently from with probability , and is set to zero with probability . We set . The probability density function for each parameter is therefore given by
where is the Dirac delta function. In the following, we consider the Gaussian Process (GP) kernel of the network, defined as .
Consider the GP kernel of a 2-layer network with ReLU activations, where the weights are sampled from . The mean distance squared between to the (dense) infinite-width kernel is
is a vector with elements in , and is the vector with some elements zeroed out.
The proof is provided in Appendix F. Also in the Appendix, we derive the following closed-form approximation to the kernel distance (3) under the assumptions that the input vectors are independent random vectors and :
In order to keep the number of parameters fixed as we change the width, we set . Under this constraint, and assuming , the distance (6) is minimized when .
In Figure 6(b) we compare this approximation with the GP kernel computed empirically at initialization. We find good agreement with the theoretical prediction (6) when . Furthermore, we see that the minimal kernel distance at initialization and the optimal performance of the trained network are obtained at a similar width, providing evidence in support of our hypothesis.
In this work we studied the question: Do wider networks perform better because they have more parameters, or because of the larger width itself? We considered several ways of increasing the width while keeping the number of parameters fixed, either by introducing bottlenecks into the network, or by sparsifying the weight tensors using a static, random mask. Among the methods we tested, the one that led to the best results was removing weights at random in proportion to the layer size, using a static mask generated at initialization. In our image classification experiments, increasing the width using this sparsity method (while keeping the total number of parameters constant) led to significant improvements in model performance. The effect was strongest when starting with a narrow basline model. Additionally, when comparing the wide, sparse models against dense models of the same width, we found that the width itself accounts for most of the performance gains; this holds true up to the width above which the training accuracy of the sparse models begins to deteriorate, presumably due to low connectivity between neurons.
Focusing on the sparsity method, we initiated a theoretical study of the effect, hypothesizing that the improvemenet in performance is correlated with having a Gaussian Process kernel that is closer to the infinite-width kernel. We computed the GP kernel of a sparse, 2-layer ReLU network, and derived a simple approximate formula for the distance between this kernel and the infinite-width dense kernel. In our experiment, we found surprisingly strong correlation between the model performance and the distance to the infinite-width kernel.
While our work is fundamental in nature, and sparsity is not the subject of this paper, the method we propose may lead to practical benefits in the future. Using current hardware and available deep learning libraries, we cannot reap the benefits of a sparse weight tensor in terms of reduced computational budget. However, in our experiments we find that the optimal sparsity can be around 1-10% for convolutional models (corresponding to a widening factor of between 3-10). Therefore, using an implementation that natively supports sparse operations, our method may be used to build faster, more memory-efficient networks.
The authors would like to thank Ethan Dyer, Utku Evci, Etai Littwin, Joshua Susskind, and Shuangfei Zhai for useful discussions.
Appendix A Experimental details
In this section we provide experimantal details and additional information about the figures in the main text.
In all experiments, we use a standard PyTorch implementation of the ResNet-18 model. We set the number of weights in the model by changing the number of output channels in the first convolutional layer (referred to as the model width), while leaving the width ratios between the first convolutional layer and the subsequent four blocks of the ResNet-18 at their default values . We do not include the weights in the BatchNorm layers into the total weight count, and we do not sparsify these layers.
All models are trained using SGD with momentum=0.9, Cross-Entropy loss, and initial learning rate 0.1. The learning rate value and schedule were tuned for the smallest baseline model. For ImageNet, we use weight decay 1e-4, cosine learning rate schedule, and train for 150 epochs. For other datasets, we use weight decay 5e-4, train for 300 epochs, and the initial learning rate 0.1 is decayed at epochs 50, 120 and 200 with gamma=0.1. We do not apply early stopping, and we report the best achieved test accuracy.
In Figure 1, the baseline model has width 64 (1e7 weights) for ImageNet, and 18 (9e5 weights) for the other datasets.
In Figure 5, we consider baseline models with base widths [8, 12, 18, 40, 64], corresponding to a total of [1.8e5, 4.0e5, 9.0e5, 4.4e6, 1.1e7] weights respectively.
All networks are MLPs with one hidden layer, a total of 3970 weights (base width is 5), and either (a) ReLU or (b) Linear activation function. The networks are parameterized according to the standard Pytorch implementation (weights and biases are randomly initialized from the uniform distribution). We train these models on MNIST for a fixed number of 300 epochs (ensuring convergence), with SGD optimizer, no momentum, Cross-Entropy loss, with a constant learning rate 0.1 and mini-batch size 100.
For ReLU, highest test accuracy (marked by white stars in the plot) is 96.3%, and is achieved by models with connectivity 0.06 (width 80) or 0.03 (width 160). For Linear, it is 92.7%, achieved at connectivity 0.13 (width 40). The color scheme is centered at the test accuracy value attained by the baseline model (approximately 90% in both cases), and its upper limit is set to the respective highest achieved value. Empty (white) cells correspond to invalid combinations of connectivity values. Note that the range on the horizontal axis is not equally spaced.
The MLP has one hidden layer, no biases, ReLU activation function, and NTK-style parameterization. It is trained on a subset of 2048 samples from the MNIST training set and tested on the full MNIST test set. The input is normalized with pixel mean and standard deviation as (image - mean)/stdev. We train for 300 epochs with vanilla SGD using Cross-Entropy loss and batch size 256. The learning rate was tuned separately for each width.
The number of weights in dense models is width, while all sparse models have the same number of weights as the smallest dense model (width 8): 6,352. The empirical approximation of the infinite-width kernel is computed on a dense MLP with width at initialization.
a.1 ImageNet data preprocessing
In order to decrease the size of the dataset and be able to download it to cloud instances with limited storage, we resized all images in the dataset by keeping their proportions fixed and setting their smallest dimension to 256. This procedure reduces the accuracy of ResNet models by aboud 1-2%.
Transformations for ImageNet:
We follow the standard transformations used is training Imagenet. Following is the list of PyTorch data transformations applied on each image.
RandomResizedCrop(size=224, scale=(0.2, 1.0)) on the training set.
Resize(256, transforms.CenterCrop(224)) on the test set.
Appendix B Sparsity distribution code
The following code implements our algorithm for distributing sparsity over model layers. Figure 8 illustrates the procedure.
def get_ntf(num_to_freeze_tot, num_W, tensor_dims, lnames_sorted): """ Distribute the total number of weights to freeze over model layers. Parameters num_to_freeze_tot (int) - total number of weights to freeze. num_W (dict) - layer names (keys) and number of weights in layer (vals). tensor_dims (dict) - layer names (keys) and the dimensions of layer tensor (vals). lnames_sorted (list of str) - layer names, sorted by magnitude in descending order. Returns num_to_freeze (list of int) - number of weights to freeze per layer, order corresponding to lnames_sorted. """ num_layers = len(lnames_sorted) num_to_freeze = np.zeros(num_layers, dtype=int) # init # list of num. weights in layer, in sorted order (largest first) num_W_sorted_list = [num_W[lname] for lname in lnames_sorted] # compute num. weights differences between layers num_W_diffs = np.diff(num_W_sorted_list) num_W_diffs = [abs(d) for d in num_W_diffs] # auxiliary vector for the following dot product to compute the bins aux_vect = np.arange( 1,len(num_W_diffs)+1 ) # the bins of the staggered sparsification: array of max. num. of weights # that can be frozen within the given layer before the next-smaller layer # gets involved into sparsification ntf_lims = [np.dot(aux_vect[:k], num_W_diffs[:k]) for k in range(1,num_layers)] # find in which bin num_to_freeze_tot falls - this gives the number of layers to sparsify lim_val, lim_ind = find_ge(ntf_lims, num_to_freeze_tot) num_layers_to_sparsify = lim_ind+1 # base fill: chunks of num. weights that are frozen in each involved layer # until all involved layers have equal num. weights remaining base_fill = [sum(num_W_diffs[lind:lim_ind]) for lind in range(lim_ind)] base_fill.append(0) # the rest is distributed evenly over all layers involved rest_tot = num_to_freeze_tot-sum(base_fill) rest = int(np.floor(rest_tot/num_layers_to_sparsify)) num_to_freeze[:num_layers_to_sparsify] = np.array(base_fill)+rest # first layer gets the few additional frozen weights when rest_tot is not # evenly divisible by num_layers_to_sparsify rest_mismatch = rest_tot - rest*num_layers_to_sparsify num_to_freeze+= rest_mismatch assert sum(num_to_freeze)==num_to_freeze_tot return num_to_freeze
Appendix C Sparsity distribution in convolutional layers
The weight tensor of a convolutional layer has 4 dimensions: input, output, kernel width, kernel height. As discussed in section 2.3, when reducing network connectivity, we remove weights randomly across all dimensions of a weight tensor. A reasonable alternative for convolutional layers is to remove weights in the input and output dimensions only, leaving the kernels themselves unchanged. We test this approach on ResNet-18 and find that it leads to very similar results in general. Specifically for smaller models, removing weights along all tensor dimensions results in better performance. For the smallest networks (base widths 8 and 12) and small connectivity, we observed a gap up to 3% in test accuracy on both CIFAR datasets. Figure 9 shows results for on CIFAR-100, where the effect is more pronounced, for models with base widths 8, 12 and 18 (1.8e5, 4.0e5 and 9.0e5 weights, respectively). Each experiment was repeated for 10 random seeds.
Appendix D Sparsity distribution in ResNet-18
On a coarse level, the ResNet-18 architecture is as follows: one convolutional layer, followed by four modules, followed by one fully-connected layer; each module comprises two blocks, and each block contains two convolutional layers. The number of output channels in the first convolutional layer is the same for the first module, and the ratio of output channel numbers in the subsequent modules is – we do not change this ratio in our experiments. When building a family of ResNet-18 models, we vary the number of output channels of the first convolutional layer, and refer to this as the width of the model, while the widths of all subsequent layers are set according to the mentioned ratio.
When reducing the connectivity of a ResNet-18 model, we remove weights from each layer according to layer size. More precisely, we first remove weights from the layer with the largest number of weights until it reaches the size of the next-smaller layer. We then proceed with removing weights from these two layers equally, and continue this procedure until the targeted total number of weights in the network is achieved.
Figures 10 shows layer-wise sparsity distribution in ResNet-18 with 1.8e5 weights and various widths as an example.
Appendix E Additional figures for ResNet-18 experiments
Appendix F Theoretical details
In this section we provide additional details on the theoretical analysis of Section 3.
Proof (Theorem 1).
We begin by defining the integral
In particular, for we have the relation .
A straightforward calculation gives the following.
Here is a 0,1 vector of length , , and is the vector with some elements zeroed out ( and are defined similarly). The integration is over all the non-zero s, namely .
Consider now the sparse kernel . It is easy to check that
Let be the dense kernel (with ) at width , and let be the dense infinite-width kernel. From (11) and (12) we see that , and . Using these results, the mean square distance between the sparse and infinite-width kernels is now given by
f.1 Approximating the kernel distance
Next, we derive the approximate form (6) by using plausible arguments. In this calculation we assume that , and that are independent random vectors with elements sampled from . The derivation is not rigorous, but we compare the results against an empirical calculation in Figure 7 and find good agreement when .
For given we expect the dominant contribution to in the sum (10) to come from terms where , and so we consider a mask with this property.
We can then approximate and (and similarly for ).
where is a random sign.
Next, we consider the integrals .
We will rely on the following result from cho2009kernel.
Using the random vector approximations, we find
For the sparse functions , we assume as before that the contribution from the sum over masks is concentrated where the mask . For a mask obeying this condition we then have . In particular,
We can also consider the diagonal elements, setting and . We have . In particular,
We would like to find the minimum of the distance (28) when keeping the mean number of parameters fixed. Roughly, we would like to minimize with constant. Assuming that the minimum is at , we find that the minimum is at
- In the case of a ResNet-18 model, each convolutional layer is replaced by two convolutional layers with kernel dimensions and , respectively, and with before widening.
- Here we used the fact that these are random vectors of effective dimension , and that for such a vector we have .
- The definition here has a factor of 2 difference compared with cho2009kernel.