Calibrate and Prune: Improving Reliability of Lottery Tickets Through Prediction Calibration
The hypothesis that sub-network initializations (lottery) exist within the initializations of over-parameterized networks, which when trained in isolation produce highly generalizable models, has led to crucial insights into network initialization and has enabled computationally efficient inferencing. In order to realize the full potential of these pruning strategies, particularly when utilized in transfer learning scenarios, it is necessary to understand the behavior of winning tickets when they might overfit to the dataset characteristics. In supervised and semi-supervised learning, prediction calibration is a commonly adopted strategy to handle such inductive biases in models. In this paper, we study the impact of incorporating calibration strategies during model training on the quality of the resulting lottery tickets, using several evaluation metrics. More specifically, we incorporate a suite of calibration strategies to different combinations of architectures and datasets, and evaluate the fidelity of sub-networks retrained based on winning tickets. Furthermore, we report the generalization performance of tickets across distributional shifts, when the inductive biases are explicitly controlled using calibration mechanisms. Finally, we provide key insights and recommendations for obtaining reliable lottery tickets, which we demonstrate to achieve improved generalization.
With an over-parameterized neural network, pruning or compressing its layers, while not compromising performance, can significantly improve the computational efficiency of the inference step . However, until recently, training such sparse networks directly from scratch has been challenging, and most often they have been found to be inferior to their dense counterparts. Frankle and Carbin , in their work on lottery ticket hypothesis (LTH), showed that one can find sparse sub-networks embedded in over-parameterized networks, which when trained using the same initialization as the original model can achieve similar or sometimes even better performance. Surprisingly, even aggressively pruned networks ( weights pruned) were showed to be comparable to the original network, as long as they were initialized appropriately. Such a well-performing sub-network is often referred as a winning lottery ticket or simply a winning ticket.
Following this pivotal work, several studies have been carried to understand the role of initialization, the effect of the pruning criterion used and the importance of retraining the sub-networks [32, 7, 23, 5, 12, 26] for the success of lottery tickets. In order to fully realize the merits of these pruning approaches, it is critical to understand the behavior of winning tickets when they might be overfit to the dataset characteristics or specific training configurations, e.g., combinations of model architecture and optimizer. For example, in , Desai et al. evaluated winning tickets under data distribution shifts, and found that the tickets demonstrated strong generalization capabilities. Similarly, in , the authors reported that the winning tickets generalized reasonably across changes in the training configuration.
In supervised and semi-supervised learning, prediction calibration is a widely adopted strategy to not pick inductive biases that compromise the reliability of models [3, 2]. Broadly, calibration is the process of adjusting predictions to improve the error distribution of a predictive model. For example, in the MixMatch approach for semi-supervised learning , an alignment cost on marginal distributions is used to improve the generalization of models to unseen data. In this paper, we propose to study the impact of incorporating calibration strategies during model training on the behavior of the resulting lottery tickets. To this end, we explore a suite of calibration strategies, and evaluate the performance of lottery tickets, in terms of accuracy and calibration metrics, on several dataset/model combinations. Finally, we also investigate the performance of those tickets using transfer learning experiments, and provide key insights for obtaining reliable lottery tickets.
2 Lottery Ticket Hypothesis
Formally, the process of lottery ticket training in  can be described as follows: (i) train an over-parameterized model with initial parameters to infer final parameters ; (ii) prune the model by applying a mask identified using a masking criterion, e.g. LTH uses weight magnitudes; (iii) Reinitialize the sparse sub-network by resetting the non-zero weights to its original initial values, i.e., and retrain. These steps are repeated until a desired level of pruning is achieved.
Why Does LTH Work?
The work by Zhou et. al.  sheds light into reasons for the success of LTH training. The authors generalized the iterative magnitude pruning in , and proposed several other choices for the pruning criterion and the initialization strategy. Most importantly, they reported that retaining the signs from the original initialization is the most crucial, and also argued that zeroing out certain weights is a form of training and hence accelerates convergence. However, these variants still require training the over-parameterized model and this does not save training computations. Consequently, in , Wang et al. computed the gradient flows of a network, and performed pruning prior to training, such that the gradient flows are preserved. Note, alternate pruning approaches exist in the literature – in , the authors adopted variational dropout for sparsifying networks. Lee et al.  improved upon this by using a sparsity inducing Beta-Bernoulli prior.
Is Retraining Required?
Another key finding from LTH studies is that randomly initialized, over-parameterized networks contain sub-networks that lead to good performance without updates to its weight values . Note that, identifying such sub-networks still requires training. Similar results were reported with Weight Agnostic Networks . These works disentangle weight values from the network structure, and show that structure alone can encode sufficient discriminatory information. Another intriguing observation from  is that certain distributions such as Kaiming Normal and Scaled Kaiming Normal are considerably better than others such as Kaiming Uniform and Xavier Normal.
Transfer Learning using LTH:
Pruning and transfer learning have been studied before [22, 33], however there are only a handful of works so far that have explored the connection between transfer learning and LTH. For example, in  the authors investigate the transfer of initializations instead of transferring learned representations. In particular, it was found that winning tickets from large datasets transferred well to small datasets, when the datasets were assumed to be drawn from similar distributions. This empirical result hints at the potential existence of a distribution of tickets that can generalize across datasets. In this spirit, Mehta  introduced the Ticket Transfer Hypothesis – there exists a sparse sub-network () of a model trained on the source data, which when fine-tuned to the target data will perform comparably to a model that is obtained by fine-tuning the dense model directly.
3 Improving Winning Tickets using Prediction Calibration
In this paper, we use the term calibration to refer to any strategy that is utilized to adjust the model predictions to match any prior on the model’s behavior or error distribution. Formally, let us consider a -way classification problem, where and denote the input data and its corresponding label respectively. We assume that the observed samples are drawn from the unknown joint distribution . The task of classifying any sample amounts to predicting the tuple , where represents the predicted label and is the likelihood of the prediction. In other words, is a sample from the unknown likelihood , which represents the associated uncertainties in the prediction, and the label is derived based on . While approximating these likelihoods has been the focus of deep uncertainty quantification techniques , prediction calibration has been more commonly adopted to improve model reliability. For example, in a fully supervised setting, one might expect a reliable predictive model not to be overconfident while making wrong predictions . On the other hand, in a semi-supervised setting, one can enforce the aggregate of predictions on the unlabeled data to match the marginal distribution of the labeled data . In practice, these requirements are incorporated as regularization strategies to systematically adjust the predictions during training, most often leading to better performing models. In this paper, we study the role of prediction calibration during model training on the reliability of the resulting lottery tickets. To this end, we consider the following approaches for training the dense models and subsequently obtain sparse winning tickets using the approach in :
No Calibration: This is the baseline approach where we utilize only the standard cross-entropy loss for training the model.
Variance Weighted Confidence Calibration (VWCC): This approach uses stochastic inferences to calibrate the confidence of deep networks. More specifically, we utilize the loss function in , which augments a confidence-calibration term to the standard cross-entropy loss and the two terms are weighted using the variance measured via multiple stochastic inferences. Mathematically, this can be written as:
Here denotes the standard cross-entropy loss for sample , and the predictions are inferred using stochastic inferences for each sample , while the variance in the predictions is used to balance the loss terms. More specifically, we perform forward passes with dropout in the network and promote the softmax probabilities to be closer to an uniform distribution, i.e. high uncertainty, when the variance is large. The normalized variance is given by the mean of the Bhattacharyya coefficients between each of the predictions and the mean prediction.
Mixup: Mixup is a popular augmentation strategy  that generates additional synthetic training samples by convexly combining random pairs of images and their corresponding labels, in order to temper overconfidence in predictions. Recently, in , it was found that mixup regularization led to improved calibration in the resulting model. Formally, mixup training is designed based on Vicinal Risk Minimization, wherein the model is trained not only on the training data, but also using samples in the vicinity of each training sample. The vicinal points are generated as follows:
where and are two randomly chosen samples with their associated labels and . The parameter , drawn from a symmetric Beta distribution, determines the mixing ratio.
Marginal Distribution Alignment (MDA): When a classifier model is biased and assigns non-trivial probabilities towards a single class for all samples, the resulting predictions are often unreliable. In such scenarios, we can adopt a calibration strategy wherein we discourage assignment of all samples to a single class.
where is the prior probability distribution for class and denotes the mean softmax probability for class across all samples in the dataset. Similar to , we assume a uniform prior distribution, and approximate using mini-batches.
Likelihood Weighted Confidence Calibration (LWCC): In lieu of variance weighting for confidence calibration in VWCC, we propose the following strategy directly based on the estimated likelihoods. In particular,
The indicator function ensures that the weight is at the maximum value of when the prediction is wrong, i.e., enforces the softmax probabilities towards a high-entropy uniform distribution. On the other hand, when the prediction is correct, the term penalizes cases when the likelihood is low.
Likelihood Weighted Confidence Calibration with Stochastic Inferences (LWCC-SI): We also tested a variant of LWCC, where we incorporated stochastic inferencing. More specifically, we apply dropout and obtain different predictions for each sample. The loss function in eqn. (5) is computed using the average prediction across the realizations.
Normalized Bin Assignment (NBA): A popular metric used for evaluating calibration of classifier models is the empirical calibration error (ECE) (definition can be found in Section 4). This metric measures the discrepancy between the average confidences and the accuracies of a model. In practice, we first bin the maximum softmax probabilities (a.k.a confidence) for each of the samples and then measure bin-wise discrepancy scores. Finally, we compute a weighted average of the scores, where the weights correspond to ratio of samples in each bin. Intuitively, assigning all samples to a high-confidence bin can lead to overconfidence compared to the accuracy of the model, while assigning all samples to a low-confidence bin will produce a under-confident model even when the accuracy is reasonable. To discourage either of these cases, we propose the following regularization strategy:
where is the total number of bins considered, denotes the number of samples in bin and is the bin-level weighting. Since the operation of counting the number of samples in each bin is not differentiable, we use a soft histogram function, and we assign larger weights to lower/higher confidence bins to avoid underconfidence/overconfidence.
4 Empirical Studies
We perform empirical studies with different dataset/model architecture combinations to understand the impact of prediction calibration on the reliability of the winning tickets. Though standard classification metrics such as accuracy are routinely used to evaluate the performance of lottery tickets, their reliability has not been studied so far. Since our focus is to examine the issue of calibration for lottery tickets; noting that in a well-calibrated classifier, predictive scores should be reflective of the actual likelihood of correctness, we consider popular calibration metrics for our evaluation, namely (i) empirical calibration error (ECE) (eq. 7), (ii) Brier score (eq. 9) and (iii) negative log likelihood (NLL) (eq. 8). We present comparisons on the reliability of the winning tickets obtained by utilizing different prediction calibration strategies (discussed in Section 3) while training the over-parameterized model. Note, we used the following hyperparameters to implement the different strategies: For MDA, we set for Cifar-10, SVHN and fashion MNIST, while for MNIST. We used a dropout rate of for VWCC and LWCC-SI in all pruning iterations.
Ideally, in a classification setup we expect the predicted softmax probabilities to reflect the model’s true confidence in its prediction. Therefore, the predicted softmax probabilities can be used to evaluate the quality of the prediction uncertainties and thus the model reliability [25, 13, 4]. We now formally describe the evaluation metrics used in our experiment.
Empirical Calibration Error: This is the most widely used metric to evaluate the calibration of predicted uncertainties. Since ECE takes only prediction confidence into account and not the complete prediction probability, it is often considered as an insufficient metric . Consequently, variants of this metric have been considered . In our setup, we adopt the following strategy: we bin the mean of the maximum softmax probabilities from each of the samples into bins and compute calibration error as the discrepancy between the average confidence and average accuracies in each of these bins:
where represents the number of predictions falling in bin and is the accuracy and the average confidence of the samples in bin .
Negative Log Likelihood: Given the prediction likelihoods, the negative log likelihood metric can be used to obtain a notion of calibration as showed in [13, 11]. For a set of predictions on given samples, the NLL metric is defined as follows:
4.1 MNIST with a Fully Connected Network
We conducted an initial investigation on the MNIST digit recognition dataset  using a fully connected network (FCN). To enable fair comparison , we adopt the same architecture and hyper parameters as in  i.e., we use a Lenet300-100  as our base architecture for this experiment. This two layer fully connected neural network consists of 300 and 100 neurons respectively. The networks are trained at a learning rate of 1e-4 with the Adam optimizer  for 60 epochs and using mini-batches of 60.
A key design choice to be made while implementing LTH is whether to prune a fixed ratio of parameters in each layer, often referred local pruning, as opposed to pruning a fixed ratio of parameters by taking into account all parameters of the network, i.e. global pruning. For fair comparison with the experiment setup in , we adopt the local pruning strategy for the case of MNIST. In particular, we perform magnitude based weight pruning to select the sparse sub-networks, and the pruning ratio was set to 20% in each iteration except for the last layer, which is pruned at 10%.
The next crucial component in LTH is the initialization schemes to be adopted for the weights in the pruned sub-networks. More specifically, we investigated two popular strategies namely rewinding weights to the initializations of the over-parameterized network and randomly reinitializing the tickets in every iteration. In all our experiments, we found the former strategy to produce better performance and hence we report the results for only that case. Note, this observation is consistent with results from earlier works on LTH such as [8, 23].
4.2 CIFAR-10 with ResNet-18
We also conduct experiments on CIFAR-10 dataset using ResNet-18. Following , for this case, we perform global pruning at the ratio of 20% in each iteration, and we do not prune the parameters used for downsampling representations from residual blocks or the final fully-connected output layer. We trained the networks using the SGD optimizer at the learning rate of 0.01, weight decay of 0.0001 and a momentum of 0.9, for 130 epochs. We annealed the learning rate by 0.1 after 80 and 120 epochs. Note, similar to the previous case, we rewind weights in the sub-network to the initialization of the over-parameterized network.
4.3 SVHN with VGG-19 Convolutional Network
For experiments with VGG-19 , we use its modified variant as in [23, 8], i.e., all fully connected layers are removed, the last max pool layer is replaced with a global average pool layer followed by a linear layer that maps the output of this layer to the number of output classes. It was found in  that global pruning outperforms local pruning when the larger networks are considered, and hence we adopt global pruning for this case as well. As in  we employed the SGD optimizer at the learning rate of 0.05, momentum of 0.9 and a weight decay of 0.0001. We trained the networks for 100 epochs using mini-batches of size 250. The learning rate was annealed by a factor of 0.1 after and epochs respectively. The pruning ratio was set at 20% in every iteration, and the final fully connected layer is never pruned.
4.4 Discussion of Results
Figure 5 illustrates the classification performance of the lottery tickets obtained from different dataset/model architecture combinations. In each of the cases, we report the results from networks trained with different calibration strategies. Except for the case of the Fashion MNIST dataset, we find that calibrated networks lead to better performing sub-networks. The benefits are more pronounced as the network complexity increases. For example, in the case of Cifar-10 we observe about to percentage points improvement over standard LTH at all compression ratios. Whereas, we observe about improvement on the MNIST dataset at different iterations. Across all datasets, we find that strategies that explicitly promote confidence calibration, namely VWCC and LWCC-SI, and augmentation strategies such as Mixup provide maximal benefits, while approaches that adjust the softmax probabilities with simplistic priors, e.g. uniform marginal distribution in MDA, do not provide consistent improvements. By enabling a principled characterization of prediction probabilities, stochastic inferencing (based on dropout) appears to be helpful – compare LWCC-SI with its deterministic variant LWCC.
In order to understand the reliability of the resulting models in each iteration, we report the three calibration metrics for each of the datasets. In Figure 12(a-c), we show the evaluation metrics for the fashion MNIST dataset. We find that, similar to the classification performance, explicit confidence calibration methods demonstrate significant improvements in all metrics, thus evidencing the model reliability. Surprisingly, state-of-the-methods such as mixup, which produce highly calibrated models  (in 12(a), mixup achieves the lowest ECE at compression), do not always produce tickets that are inherently well-calibrated. However, with the standard MNIST dataset, we observe that, though uncertainty-based approaches produce lower calibration metrics, they become inferior as compression ratio increases. We suspect that, with simpler model architectures, the uncertainties can be over-estimated when the sub-network is highly sparse. In comparison, we do not see this behavior with Resnet-18 and Vgg-19 models. As the model complexity increases, see Figures 16 and 20, we find consistent improvements in calibration at all compression ratios, thus hinting that the structure of the sub-network plays a critical role in the generalization of tickets, in addition to the initialization strategy in LTH. This becomes more apparent from the transfer learning experiment in the next section.
5 Impact on Transfer Learning Performance
Given the fact that the process of generating the winning ticket is computationally expensive, it is useful and interesting to analyze the generalizability of winning tickets in different training scenarios and configurations. In a recent study , it was found through extensive experimentation that the winning ticket initializations could be reused and are generalizable across models trained on different datasets and optimizers. It was also reported that tickets generated using large datasets are more generalizable than those obtained from those created with smaller datasets.
On similar lines, the next critical question is understanding if improving the reliability of winning tickets via prediction calibration will have any effect on the transfer learning performance. Hence, we set up an experimental study to evaluate the generalizability of winning tickets with and without the incorporation of prediction calibration during the training of the original network. To this end, we adopt a setup similar to : we investigate the transferability of tickets to different datasets from the same distribution. Similar to the empirical studies in the previous section, we evaluate the prediction performance and reliability of the resulting models through the three calibration metrics, namely ECE, NLL and the Brier scores. An extensive analysis of ticket transfer generalization performance across various training scenarios and different datasets are critical, and is part of our future work.
5.1 Cifar-10a to Cifar-10b
In this experiment, we study the generalizability of tickets, generated from over-parameterized networks with and without explicit calibration-promoting strategies, within the same data distribution. Similar to the experimental setup in , we divide the Cifar-10 dataset into two equal training splits namely Cifar-10a and Cifar-10b with k training samples in each, with k samples in each class. The source model was trained on the Cifar-10a split and the Cifar-10b set was treated as the target. We used the ResNet-18 architecture for both source and target models, and the hyperparameter settings for training both models were adopted from . Similar to the Cifar-10 experiment in the previous section, we used the SGD optimizer with learning rate 0.01, momentum 0.9 and weight decay 0.0001 and batch size 128. As discussed earlier, we do not prune the fully connected layers and perform global pruning. Given the wining tickets from the source dataset, we retrain the model for the target dataset and evaluate the reliability of the target models using the metrics in Section 4.
From Figure 25, we make two interesting observations. First, prediction calibration leads to orders-of-magnitude performance gains, in terms of all metrics, over the baseline LTH. Second, confidence calibration with stochastic inferencing achieves over gain in the classification accuracy, when the source ticket was used to train the target model. This clearly emphasizes the benefits of taking into account the Bayesian uncertainties from the model construction process when identifying sparse sub-networks. The resulting models are thus found to generalize to much larger data variations. In the next section, we summarize all our key findings and provide recommendations for improving lottery tickets in practice.
6 Key Findings
While different pruning strategies have been explored in existing works , the common conclusion has been that weight magnitude based pruning is the most effective, and hence the research focus has shifted towards investigating better initialization strategies for the sub-networks. However, our results clearly show that using prediction calibration during the training of the over-parameterized model can produce sub-networks that demonstrate improved generalization (transfer learning experiment) and produce consistently reliable models (showed using calibration metric evaluations on different dataset/model combinations). This is an interesting result in that we have resorted to the vanilla initialization strategy adopted by LTH  and the performance improvements are solely from more effective sub-networks. This motivates further research to better understand the role of the sub-network selection step, not by merely adjusting the pruning criterion, but by designing networks that are not just accurate, but also better calibrated to meaningful priors.
We find the two calibration strategies that attempt to directly balance between the cross entropy loss and the term promoting high-entropy softmax probabilities in cases of unreliable predictions, namely LWCC and VWCC, to be consistently useful in all experiments. This is a surprising result since it has been empirically showed in existing works  that augmentation strategies such as mixup often produce highly calibrated models in practice. In other words, the intuitions that we have on calibration methods do not directly translate to the case of lottery tickets, thus warranting further research in this direction.
Another important observation is that under challenging generalization scenarios, i.e. Cifar-10a to Cifar-10b transfer, methods that involve stochastic inferencing, namely VWCC and LWCC-SI, significantly outperform all baselines, both in terms of accuracy and calibration metrics. This emphasizes the need for characterizing prediction uncertainties in over-parameterized models and the value of leveraging them for better model generalization.
Our results are particularly important in the context of recent efforts that attempt to design randomly initialized neural networks that can be utilized for a given dataset, without even carrying out model training [26, 9]. While sufficiently over-parameterized random networks will most likely contain sub-networks that can achieve reasonable accuracy without training, calibration strategies can help identify the most effective, in terms of both generalization and reliability .
- (2019) Pseudo-labeling and confirmation bias in deep semi-supervised learning. arXiv preprint arXiv:1908.02983. Cited by: 4th item.
- (2019) ReMixMatch: semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785. Cited by: §1.
- (2019) Mixmatch: a holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, pp. 5050–5060. Cited by: §1, §3.
- (1983) The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician) 32 (1-2), pp. 12–22. Cited by: §4, §4.
- (2019) Evaluating lottery tickets under distributional shifts. arXiv preprint arXiv:1910.12708. Cited by: §1.
- (2019) Sparse networks from scratch: faster training without losing performance. arXiv preprint arXiv:1907.04840. Cited by: §1.
- (2019) Rigging the lottery: making all tickets winners. arXiv preprint arXiv:1911.11134. Cited by: §1.
- (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: §1, §2, §2, §3, §4.1, §4.1, §4.1, §4.2, §4.3, §5.1, 1st item.
- (2019) Weight agnostic neural networks. arXiv preprint arXiv:1906.04358. Cited by: §2, 4th item.
- (2016) Uncertainty in deep learning. University of Cambridge 1, pp. 3. Cited by: §3.
- (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American statistical Association 102 (477), pp. 359–378. Cited by: §4, §4.
- (2019) One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. Cited by: §1, §2.
- (2017) On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. Cited by: §3, §4, §4, §4.
- (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §4.2.
- (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §4.1.
- (2010) Convolutional deep belief networks on cifar-10. Unpublished manuscript 40 (7), pp. 1–9. Cited by: §4.2.
- (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.
- (2010) MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2. Cited by: §4.1.
- (2018) Adaptive network sparsification with dependent variational beta-bernoulli dropout. arXiv preprint arXiv:1805.10896. Cited by: §2.
- (2019) Sparse transfer learning via winning lottery tickets. arXiv preprint arXiv:1905.07785. Cited by: §2.
- (2017) Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2498–2507. Cited by: §2.
- (2016) Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440. Cited by: §2.
- (2019) One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. In Advances in Neural Information Processing Systems, pp. 4933–4943. Cited by: §1, §4.1, §4.3, §5.1, §5, §5.
- (2019) Measuring calibration in deep learning. arXiv preprint arXiv:1904.01685. Cited by: §4.
- (2005) Evaluating predictive uncertainty challenge. In Machine Learning Challenges Workshop, pp. 1–27. Cited by: §4.
- (2019) What’s hidden in a randomly weighted neural network?. arXiv preprint arXiv:1911.13299. Cited by: §1, §2, 4th item.
- (2019) Learning for single-shot confidence calibration in deep neural networks through stochastic inferences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9030–9038. Cited by: 2nd item.
- (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.3.
- (2019) On mixup training: improved calibration and predictive uncertainty for deep neural networks. arXiv preprint arXiv:1905.11001. Cited by: 3rd item, §4.4, 2nd item.
- (2019) Picking winning tickets before training by preserving gradient flow. Cited by: §2.
- (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: 3rd item.
- (2019) Deconstructing lottery tickets: zeros, signs, and the supermask. arXiv preprint arXiv:1905.01067. Cited by: §1, §2, 1st item.
- (2017) To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878. Cited by: §2.