Linear Mode Connectivity and The Lottery Ticket Hypothesis

Linear Mode Connectivity and The Lottery Ticket Hypothesis

Abstract

We introduce instability analysis, which assesses whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise. We find that standard vision models become stable in this way early in training. From then on, the outcome of optimization is determined to within a linearly connected region.

We use instability to study iterative magnitude pruning (IMP), the procedure used by work on the lottery ticket hypothesis to identify subnetworks that could have trained to full accuracy from initialization. We find that these subnetworks only reach full accuracy when they are stable, which either occurs at initialization for small-scale settings (MNIST) or early in training for large-scale settings (Resnet-50 and Inception-v3 on ImageNet).

\printAffiliationsAndNotice\icmlEqualContribution

1 Introduction

When training a neural network with mini-batch stochastic gradient descent (SGD), training examples are presented to the network in a random order within each epoch. This random order can be seen as noise that varies from training run to training run and alters the network’s trajectory through the optimization landscape, even when hyperparameters are fixed. In this paper, we investigate how much variability this data order randomness induces in the optimization trajectories of neural networks and the role this variability plays in sparse, lottery ticket networks (Frankle and Carbin, 2019).

Instability

Instability

Figure 1: A diagram of instability analysis from iteration 0 (left) and iteration (right).

Instability analysis. To study these questions, we propose instability analysis. The goal of instability analysis is to determine whether the outcome of optimization is robust to different samples of SGD noise. The left diagram in Figure 1 visualizes instability analysis. First, we create a neural network with random initialization . We then train two copies of this network in parallel on different data orders (which models different samples of SGD noise). Finally, we measure the effect of these different samples of SGD noise by comparing the resulting networks. We also study this behavior starting from the state of the network at iteration of training (Figure 1 right). Doing so allows us to determine when the outcome of optimization becomes robust to different samples of SGD noise.

To compare the trained networks that result from instability analysis, we study the optimization landscape along the line between them (blue curve in Figure 1). Does error remain flat or even decrease (meaning the networks are in the same, linearly connected minimum), or is there a barrier of increased error? We define the instability of the network to SGD noise as the maximum increase in test error along this linear path (red line). A network is stable if error does not increase along the path, i.e., instability 0.

Interpolating at the end of training assesses a linear form of mode connectivity, a phenomenon where the minima found by two networks are connected by a path of constant error. Draxler et al. (2018) and Garipov et al. (2018) show that the modes of standard vision networks trained from different initializations are connected by piece-wise linear paths of constant error or loss. Based on this work, we expect that all networks we examine are connected by such paths. However, the modes found by Draxler et al. and Garipov et al. are not connected by linear paths. The only extant example of linear mode connectivity is by Nagarajan and Kolter (2019), who train MLPs from the same initialization on disjoint subsets of MNIST and find that the resulting networks are connected by linear paths of constant test error. In contrast, we explore linear connectivity from points throughout training, we do so at a larger scale, and we focus on different samples of SGD noise rather than disjoint samples of data.

Network Variant Dataset Params Train For Batch Accuracy Optimizer Rate Schedule Warmup BatchNorm Pruning Level Style
Lenet MNIST 266K 24K Iters 60 98.3 0.1% adam 12e-4 constant 0 No 3.5% Iterative
Resnet-20 Standard 91.7 0.1% 0.1 10x drop at 32K, 48K 0 Yes 16.8% Iterative
Resnet-20 Low CIFAR-10 274K 63K Iters 128 88.8 0.1% momentum 0.01 0 8.6%
Resnet-20 Warmup 89.7 0.3% 0.03 30K 8.6%
VGG-16 Standard 93.7 0.1% 0.1 10x drop at 32K, 48K 0 Yes 1.5% Iterative
VGG-16 Low CIFAR-10 14.7M 63K Iters 128 91.7 0.1% momentum 0.01 0 5.5%
VGG-16 Warmup 93.4 0.1% 0.1 30K 1.5%
Resnet-50 ImageNet 25.5M 90 Eps 1024 76.1 0.1% momentum 0.4 10x drop at 30,60,80 5 Eps Yes 30% One-Shot
Inception-v3 ImageNet 27.1M 171 Eps 1024 78.1 0.1% momentum 0.03 linear decay to 0.005 0 Yes 30% One-Shot
Table 1: Our networks and hyperparameters. Accuracies are the means and standard deviations across three initializations. Hyperparameters for Resnet-20 standard are from He et al. (2016). Hyperparameters for VGG-16 standard are from Liu et al. (2019). Hyperparameters for low, warmup, and Lenet are adapted from Frankle and Carbin (2019). Hyperparameters for ImageNet networks are from Google’s reference TPU code (Google, 2018). Note: Frankle and Carbin mistakenly refer to Resnet-20 as “Resnet-18,” which is a separate network.

We examine the instability of standard networks for MNIST, CIFAR-10, and ImageNet. All but the smallest MNIST network are unstable at initialization. However, by a point early in training (3% for Resnet-20 on CIFAR-10 and 20% for Resnet-50 on ImageNet), all networks become stable. From this point forward, the outcome of optimization is determined to a linearly connected minimum.

The lottery ticket hypothesis. Finally, we show that instability analysis is a valuable scientific tool for assessing the effect of SGD noise in other contexts. Specifically, we study the sparse networks discussed by the recent lottery ticket hypothesis (LTH; Frankle and Carbin, 2019). The LTH conjectures that, at initialization, neural networks contain sparse subnetworks that can train in isolation to full accuracy.

Empirical evidence for the LTH consists of experiments using a procedure called iterative magnitude pruning (IMP). On small networks for MNIST and CIFAR-10, IMP finds subnetworks at initialization that can match the accuracy of the full network (we refer to such subnetworks as matching) at sparsity levels far beyond those at which randomly pruned or randomly reinitialized subnetworks can do the same. In more challenging settings, however, there is no empirical evidence for the LTH. IMP subnetworks of VGGs and Resnets on CIFAR-10 and ImageNet perform no better than other sparse networks (Liu et al., 2019; Gale et al., 2019).

We find that instability analysis distinguishes known cases where IMP succeeds and fails to find a matching subnetwork, providing the first basis for understanding the mixed results in the literature. Namely, IMP subnetworks are only matching when they are stable. Using this insight, we identify new scenarios where we can find sparse, matching subnetworks, including in more challenging settings (e.g., Resnet-50 on ImageNet). In these settings, sparse IMP subnetworks become stable early in training rather than at initialization, just as we found with the unpruned networks. Moreover, these stable IMP subnetworks are also matching. In other words, early in training (if not at initialization), sparse subnetworks emerge that are capable of completing training in isolation and reaching full accuracy. These findings shed new light on neural network training dynamics and hint at possible mechanisms underlying lottery ticket phenomena.

Contributions. We make the following contributions:

  • We introduce instability analysis to identify whether a neural network will find the same linearly connected minimum despite different samples of SGD noise.

  • On a range of image classification benchmarks including standard networks on ImageNet, we observe that networks become stable to SGD noise early in training.

  • We use instability analysis to distinguish successes and failures of IMP (the core method behind the lottery ticket hypothesis) identified in prior work. Namely, extremely sparse IMP subnetworks are matching only when stable.

  • We extend IMP with rewinding and show that IMP subnetworks become stable and matching when set to their weights from early in training. In doing so, we show how to find matching subnetworks that were present early in training in more challenging settings than in prior work.

Figure 2: Error when linearly interpolating between networks trained from the same initialization on different data orders. Lines are means and standard deviations over three initializations and three data orders (nine samples total). The trained networks are at 0.0 and 1.0.

2 Preliminaries and Methodology

Instability analysis via linear connectivity. Instability analysis evaluates whether the minima found when training two copies of a neural network on different randomly sampled executions of SGD (i.e., different data orders) are linearly connected by a path over which error does not increase. The network could be randomly initialized ( in Figure 1) or the result of training iterations (). To perform instability analysis, we make two copies of the network and train them to completion with different random data orders, resulting in weights and . We then linearly interpolate between the trained weights (dashed line) and compute the train or test error at each point (blue curve) to determine whether it increased (minima are not linearly connected) or did not (minima are linearly connected).

We represent SGD by a function that maps weights at step and SGD randomness to weights at step by training for steps (for ). denotes the error of a network with weights . Algorithm 1 outlines our procedure:

1:Create a network with randomly initialized weights .
2:Train to with noise :
3:Train to with noise :
4:Train to with noise :
5:Evaluate for .
Algorithm 1 Stability analysis from iteration .

We describe the result of linear interpolation (step 5) with a quantity that we term instability. Let be the average test error of and . Let be the highest test error when linearly interpolating between and . The instability is (red line in Figure 1). When instability , the minima are linearly connected and we say the network is stable. Otherwise, we say it is unstable. Empirically, we consider instability 2% to be stable; this margin accounts for noise as we interpolate and matches the increases in test error along the paths found by Draxler et al. (2018, Table B.1) and Garipov et al. (2018, Table 2). We interpolate using 30 evenly-spaced values of , and we average instability from three initializations and three data orders per initialization (nine combinations total).

Networks and datasets. We study image classification networks on MNIST, CIFAR-10, and ImageNet as specified in Table 1. All hyperparameters are standard values from reference implementations or prior work as cited in Table 1. The warmup and low variants of Resnet-20 and VGG-16 are adapted from hyperparameters in Frankle and Carbin (2019).

3 Neural Network Instability to SGD Noise

In this section, we perform instability analysis on the standard networks in Table 1 from many points during training. We find that, although only Lenet is stable at initialization, every network becomes stable early in training, meaning the outcome of optimization from that point forward is determined to within a linearly connected region.

Instability at initialization. We begin by studying the effect of data order on linear connectivity when starting at initialization. We use Algorithm 1 with (visualized in Figure 1 left): train two copies of the same, randomly initialized network with different data orders. Figure 2 shows the train (purple) and test (red) error when linearly interpolating between the minima found by these copies. Except for Lenet (MNIST), none of the networks are stable at initialization. In fact, train and test error rise to the point of random guessing when linearly interpolating. Lenet’s error rises slightly, but by less than a percentage point. We conclude that, in general, larger-scale image classification networks are unstable at initialization.

Instability during training. Although larger networks are unstable at initialization, they may become stable at some point afterwards; in the limit, they will be stable by definition after the last step of training. To investigate when stability occurs, we train a network for steps, make two copies, train them to completion on different data orders, and linearly interpolate (visualized in Figure 1 right). We do so for many values of , assessing whether there is a point after which the outcome of optimization is determined modulo linear interpolation regardless of the data order.

Figure 3: The instability when linearly interpolating between the minima found by networks trained on different data orders from step . Each line is the mean and standard deviation across three initializations and three data orders (nine samples in total).
Figure 4: Test instability when making two copies of the state of the network at iteration and either training for the remaining iterations (blue) or training for iterations with the learning rate schedule reset to iteration 0 (orange).

Figure 3 presents the instability of the networks for various values of . Instability is the maximum error during interpolation (the peaks in Figure 2) minus the mean of the errors of the two networks (the endpoints in Figure 2). In all cases, test set instability decreases as increases, culminating in networks that are stable. The iteration at which stability emerges is surprisingly early. For example, it occurs at iterations 2000 for Resnet-20 and 1000 VGG-16; in other words, after 3% and 1.5% of training, SGD noise cannot affect the final minimum modulo linear interpolation. Stability occurs later for Resnet-50 and Inception-v3: at epoch 18 (20% into training) and 28 (16%), respectively, using the test set.

For Lenet, Resnet-20, and VGG-16, instability is essentially identical when measured in terms of train or test error, and the networks become stable at the same time when using both quantities. For Resnet-50 and Inception-v3, train instability follows the same trend as test instability but is slightly higher at all points, meaning train set stability occurs later for Resnet-50 and does not occur in our range of analysis for Inception-v3. Going forward, we present all results with respect to test error for simplicity and include corresponding train error data in the appendices.

Disentangling instability from training time. Varying the iteration from which we run instability analysis has two effects. First, it changes the state of the network from which we train two copies to completion on different data orders. Second, it changes the number of iterations for which those copies are trained. Concretely, when we run instability analysis from iteration , we train the copies under different data orders for iterations. As increases, the copies have fewer iterations during which to potentially find linearly unconnected minima. It is possible that the gradual decrease in instability as increases and the eventual emergence of stability is just an artifact of these shorter training times.

To disentangle the role of training time in our experiments, we modify instability analysis to train the copies for iterations no matter the value of . When doing so, we reset the learning rate schedule to iteration 0 after making the copies. In Figure 4, we compare instability with and without this modification for Resnet-20 and VGG-16 on CIFAR-10. Instability is indistinguishable in both cases, indicating that the different numbers of training steps did not play a role in the onset of stability. Going forward, we present all results by training copies for iterations.

1:Create a network with randomly initialized weights .
2:Initialize pruning mask to .
3:Train to with noise : .
4:for  do
5:     {varwidth}[t]Train to with noise :
6:.
7:     {varwidth}[t]Prune the lowest magnitude entries of that remain.
8:Let if is pruned.
9:Return
Algorithm 2 IMP with rewinding to step and iterations.

4 Instability and Lottery Tickets

In this section, we leverage instability analysis and our observations about stable networks to gain new insights into the behavior of sparse lottery ticket networks.

4.1 Overview

We have long known that it is possible to prune neural networks after training, often removing 90% of weights without reducing accuracy after some additional training (e.g., Reed, 1993; Han et al., 2015; Gale et al., 2019). However, sparse networks are more difficult to train from scratch. At the most extreme sparsities attained by pruning, sparse networks trained in isolation are generally less accurate than the corresponding dense networks (Han et al., 2015; Li et al., 2016; Liu et al., 2019; Frankle and Carbin, 2019).

However, there is a known class of networks that remain accurate at these sparsities. On small vision tasks, an algorithm called iterative magnitude pruning (IMP) retroactively finds sparse subnetworks that were capable of training in isolation from initialization to full accuracy (Frankle and Carbin, 2019). The existence of such subnetworks raises the prospect of replacing conventional, dense networks with sparse ones, creating new opportunities to reduce the cost of training. However, in more challenging settings, IMP subnetworks perform no better than other kinds of subnetworks, and they do not train to full accuracy at the sparsities attained by pruning (Liu et al., 2019; Gale et al., 2019).

We find that instability offers new insights into the behavior of IMP subnetworks and a potential explanation for their successes and failures. Namely, the sparsest IMP subnetworks only train to full accuracy when they are stable. That is, when different samples of SGD noise cause an IMP subnetwork to find minima that are not linearly connected, then test accuracy is lower. This contrasts with the unpruned networks, whose accuracy seems unaffected by instability.

IMP Subnetwork is Matching for

IMP Subnetwork is Not Matching for

Figure 5: Test error when linearly interpolating between sparse subnetworks trained from the same initialization on different data orders. Lines are means and standard deviations over three initializations and three data orders (nine samples in total). The trained networks are at 0.0 and 1.0. Percents are weights remaining.

4.2 Methodology

Iterative magnitude pruning. Iterative magnitude pruning (IMP) is a procedure to retroactively find a subnetwork of the state of the full network at iteration of training. To do so, IMP trains a network to completion, prunes weights with the lowest magnitudes globally, and rewinds the remaining weights back to their values at iteration (Algorithm 2). The result is a subnetwork where is the state of the full network at iteration and is a mask such that (where is the element-wise product) is a pruned network. We can run IMP iteratively (pruning 20% of weights as in Han et al. (2015), rewinding, and repeating until a target sparsity) or in one shot (pruning to a target sparsity at once). We one-shot prune ImageNet networks for efficiency and iteratively prune otherwise (Table 1).

Frankle and Carbin (2019) focus on finding sparse subnetworks at initialization; as such, they only rewind to iteration 0. One of our contributions is to generalize IMP to any rewinding iteration . They refer to subnetworks that match the accuracy of the full network as winning tickets because they have “won the initialization lottery” with weights that make attaining this accuracy possible. When we rewind to iteration , subnetworks are no longer randomly initialized, so the term winning ticket is no longer appropriate. Instead, we refer to such subnetworks simply as matching.

Sparsity levels. In this section, we focus on the most extreme sparsity levels for which IMP returns a matching subnetwork at any rewinding iteration . These levels are in Table 1, and Appendix B explains these choices. These sparsities provide the best contrast between sparse networks that are matching and (1) the full, overparameterized networks and (2) other classes of sparse networks. Appendix E includes the analyses from this section for all sparsities for Resnet-20 and VGG-16, which we summarize in Section 4.4. Due to the computational costs of these experiments, we only collected data across all sparsities for these networks.

IMP Subnetwork is Matching for

IMP Subnetwork is Not Matching for

Figure 6: Instability of subnetworks created by pruning the state of the full network at iteration . Lines are means and standard deviations over three initializations and three data orders (nine samples in total). Percents are weights remaining.
Network Full IMP Rand Prune Rand Reinit IMP Matching?
Lenet 98.3 98.2 96.7 97.5 0.1 Y
Resnet-20 91.7 88.5 88.6 88.8 3.2 N
Resnet-20 Low 88.8 89.0 85.7 84.7 -0.2 Y
Resnet-20 Warmup 89.7 89.6 85.7 85.6 0.1 Y
VGG-16 93.7 90.9 89.4 91.0 2.8 N
VGG-16 Low 91.7 91.6 90.1 90.2 0.1 Y
VGG-16 Warmup 93.4 93.2 90.1 90.7 0.2 Y
Resnet-50 76.1 73.7 73.1 73.4 2.4 N
Inception-v3 78.1 75.7 75.2 75.5 2.4 N
Table 2: Accuracy of IMP and random subnetworks when rewinding to at the sparsities in Table 1. Accuracies are means across three initializations. All standard deviations are .

4.3 Experiments and Results

Recapping the lottery ticket hypothesis. We begin by studying sparse subnetworks trained from initialization (). This is the lottery ticket experiment from Frankle and Carbin (2019). As Table 2 shows, when rewinding to iteration 0, IMP subnetworks of Lenet are matching, as are variants of Resnet-20 and VGG-16 with lower learning rates or learning rate warmup (changes proposed by Frankle and Carbin to make it possible for IMP to find matching subnetworks). However, IMP subnetworks of standard Resnet-20, standard VGG-16, Resnet-50, and Inception-v3 are not matching. In fact, they are no more accurate than subnetworks generated by randomly pruning or reinitializing the IMP subnetworks, suggesting that neither the structure nor the initialization uncovered by IMP provides a performance advantage. For full details on the accuracy of these subnetworks at all levels of sparsity, see Appendix B.

IMP subnetwork instability at initialization. When we perform instability analysis on these subnetworks, we find that they are only matching when they are stable (Figure 5). The IMP subnetworks of Lenet, Resnet-20 (low, warmup), and VGG-16 (low, warmup) are stable and matching (Figure 5, left). In all other cases, IMP subnetworks are neither stable nor matching (Figure 5, left). The low and warmup experiments are notable because Frankle and Carbin (2019) selected these hyperparameters specifically for IMP to find matching subnetworks; the fact that this change also makes the subnetworks stable adds further evidence of a connection between instability and accuracy in IMP subnetworks.

No randomly pruned or reinitialized subnetworks are stable or matching at these sparsities except Lenet. These subnetworks of Lenet are not matching but error only rises slightly when interpolating. For all other networks, error approaches that of random guessing when interpolating.

IMP subnetwork instability during training. We just saw that IMP subnetworks are matching from initialization only when they are stable. In Section 3, we found that unpruned networks become stable only after a certain amount of training. Here, we combine these observations: we study whether IMP subnetworks become stable later in training and, if so, whether improved accuracy follows.

Concretely, we perform IMP where we rewind to iteration after pruning. Doing so produces a subnetwork (, ) of the state of the full network at iteration . We then run instability analysis using this subnetwork. Another way of looking at this experiment is that it simulates training the full network to iteration , generating a pruning mask, and evaluating the instability of the resulting sparse network; the underlying mask-generation procedure involves training the network many times in the course of performing IMP.

The blue dots in Figure 6 show the instability of the IMP subnetworks at many rewinding iterations. Networks whose IMP subnetworks were stable when rewinding to iteration 0 remain stable at all other rewinding points (Figure 6, left). Notably, networks whose IMP subnetworks were unstable when rewinding to iteration 0 become stable when rewinding later. IMP subnetworks of Resnet-20 and VGG-16 become stable at iterations 500 (0.8% into training) and 1000 (1.6%). Likewise, IMP subnetworks of Resnet-50 and Inception-v3 become stable at epochs 5 (5.5% into training) and 6 (3.5%). In all cases, the IMP subnetworks become stable sooner than the unpruned networks, substantially so for Resnet-50 (epoch 5 vs. 18) and Inception-v3 (epoch 6 vs. 28).

IMP Subnetwork is Matching for

IMP Subnetwork is Not Matching for

Figure 7: Error of subnetworks created using the state of the full network at iteration and a sparse pruning mask. Black lines are the accuracy of the full network to one standard deviation. Lines are means and standard deviations over three initializations and three data orders (nine samples in total). Percents are weights remaining.

The error of the IMP subnetworks behaves similarly. The blue line in Figure 7 plots the error of the IMP subnetworks and the gray line plots the error of the full networks to one standard deviation; subnetworks are matching when the lines cross. Networks whose IMP subnetworks were matching when rewinding to iteration 0 (Figure 7, left) generally remain matching at later iterations (except for Resnet-20 low and VGG-16 low at the latest rewinding points). Notably, networks whose IMP subnetworks were not matching when rewinding to iteration 0 (Figure 7, right) become matching when rewinding later. Moreover, these rewinding points closely coincide with those where the subnetworks become stable. In summary, at these extreme sparsities, IMP subnetworks are matching when they are stable.

Randomly pruned and reinitialized subnetworks are unstable and non-matching at all rewinding points (with Lenet again an exception). Although it is beyond the scope of our study, this behavior suggests a potential broader link between subnetwork stability and accuracy: IMP subnetworks are matching and become stable at least as early as the full networks, while other subnetworks are less accurate and unstable for the sparsities and rewinding points we consider.

4.4 Results at Other Sparsity Levels

Thus far, we have studied instability at only two sparsities: unpruned networks (Section 3) and an extreme sparsity (Section 4.3). In this section, we examine sparsities between these levels and beyond the extreme sparsity for Resnet-20 and VGG-16. Figure 8 presents the median iteration at which IMP and randomly pruned subnetworks become stable (instability ) and matching (accuracy drop , allowing a small margin for noise) across sparsity levels.1

Stability behavior. As sparsity increases, the stability iteration of the IMP subnetworks becomes earlier, plateaus, and eventually increases. In contrast, the stability iteration of randomly pruned subnetworks only becomes later until the subnetworks are no longer stable at any rewinding iteration.

Matching behavior. We separate the sparsities into three ranges reflecting when different sparse networks are matching. In sparsity range I, the networks are overparameterized, so much so that even randomly pruned subnetworks are matching (red). This range occurs when more than 80.0% and 16.8% of weights remain for Resnet-20 and VGG-16.

In sparsity range II, the networks are sufficiently sparse that only IMP subnetworks are matching (orange). This range occurs when 80.0%-13.4% and 16.8%-1.2% of weights remain in Resnet-20 and VGG-16. For part of this range, IMP subnetworks become matching and stable at approximately the same rewinding iteration; namely, when 51.2%-13.4% and 6.9%-1.5% of weights remain for Resnet-20 and VGG-16. In Section 4.3, we observed this behavior for a single, extreme sparsity level for each network. Based on Figure 8, we conclude that there are many sparsities where these rewinding iterations coincide for Resnet-20 and VGG-16.

In sparsity range III, the networks are so sparse that even IMP subnetworks are not matching at any rewinding iteration we consider. This range occurs when fewer than 13.4% and 1.2% of weights remain for Resnet-20 and VGG-16. According to Appendix E, the error of IMP subnetworks still decreases when they become stable (although not to the point that they are matching), potentially suggesting a broader relationship between instability and accuracy.

Figure 8: The median rewinding iteration at which IMP and randomly pruned subnetworks of Resnet-20 and VGG-16 become stable and matching. A network is stable if instability . A network is matching if the accuracy drop ; we only include points where a majority of subnetworks are matching at a rewinding iteration. Includes three initializations and three data orders (nine samples in total).

5 Discussion

Instability analysis. We introduce instability analysis as a novel way to study the sensitivity of a neural network’s optimization trajectory to data order randomness. In doing so, we uncover a class of situations in which linear mode connectivity emerges, whereas previous examples of mode connectivity (e.g., between networks trained from different initializations) at similar scales required piece-wise linear paths (Draxler et al., 2018; Garipov et al., 2018).

Our full network results divide training into two phases: an unstable phase where the network finds linearly unconnected minima due to SGD noise and a stable phase where the linearly connected minimum is determined. Our finding that stability emerges early in training adds to work suggesting that training comprises a noisy first phase and a less stochastic second phase. For example, the Hessian eigenspectrum settles into a few large values and a bulk (Gur-Ari et al., 2018), and large-batch training at high learning rates benefits from learning rate warmup (Goyal et al., 2017).

One way to exploit our findings is to explore changing aspects of optimization (e.g., learning rate schedule or optimizer) similar to Goyal et al. (2017) once the network becomes stable to improve performance; instability analysis can evaluate the consequences of doing so. We also believe instability analysis provides a scientific tool for topics related to the scale and distribution of SGD noise, e.g., the relationship between batch size, learning rate, and generalization (LeCun et al., 2012; Keskar et al., 2017; Goyal et al., 2017; Smith and Le, 2018; Smith et al., 2018) and the efficacy of alternative learning rate schedules (Smith, 2017; Smith and Topin, 2018; Li and Arora, 2019).

The lottery ticket hypothesis. The lottery ticket hypothesis (Frankle and Carbin, 2019) conjectures that any “randomly initialized, dense neural network contains a subnetwork that—when trained in isolation—matches the accuracy of the original network.” This work is among several recent papers to propose that merely sparsifying at initialization can produce high performance neural networks (Mallya et al., 2018; Zhou et al., 2019; Ramanujan et al., 2019). Frankle and Carbin support the lottery ticket hypothesis by using IMP to find matching subnetworks at initialization in small vision networks. However, follow-up studies show (Liu et al., 2019; Gale et al., 2019) and we confirm that IMP does not find matching subnetworks in more challenging settings. We use instability analysis to distinguish the successes and failures of IMP as identified in previous work. In doing so, we make a new connection between the hypothesis and the optimization dynamics of neural networks.

Practical impact of rewinding. By augmenting IMP with rewinding, we show how to find matching subnetworks in much larger settings than in previous work, albeit from early in training rather than initialization. Our technique has already been adopted for practical purposes. Morcos et al. (2019) show that subnetworks found by IMP with rewinding transfer between vision tasks, meaning the effort of finding a subnetworks can be amortized by reusing it many times. Renda et al. (2020) show that IMP with rewinding prunes to state-of-the-art sparsities, matching or exceeding the performance of standard techniques that fine-tune at a low learning rate after pruning (e.g., Han et al., 2015; He et al., 2018). Other efforts use rewinding to further study lottery tickets (Yu et al., 2020; Frankle et al., 2020; Caron et al., 2020; Savarese et al., 2020; Yin et al., 2020).

Pruning. In larger-scale settings, IMP subnetworks only become stable and matching after the full network has been trained for some number of steps. Recent proposals attempt to prune networks at initialization (Lee et al., 2019; Wang et al., 2020), but our results suggest that the best time to do so may be after some training. Likewise, most pruning methods only begin to sparsify networks late in training or after training (Han et al., 2015; Gale et al., 2019; He et al., 2018). The existence of matching subnetworks early in training suggests that there is an unexploited opportunity to prune networks much earlier than current methods.

6 Conclusions

We propose instability analysis to shed light on the variability of neural network optimization trajectories induced by random data orders. We find that standard networks for MNIST, CIFAR-10, and ImageNet become stable to this randomness early in training, after which the outcome of optimization is determined to a linearly connected minimum.

We then apply instability analysis to better understand a key question at the center of work on the lottery ticket hypothesis: why does iterative magnitude pruning find sparse networks that can train from initialization to full accuracy in smaller-scale settings (e.g., MNIST) but not on more challenging tasks (e.g., ImageNet)? We find that extremely sparse IMP subnetworks only train to full accuracy when they are stable, which occurs at initialization in some settings but only after some amount of training in others.

Instability analysis contributes to a growing range of empirical tools for studying and understanding the behavior of neural networks in practice. In our paper, we show that it has already yielded new insights into neural network training dynamics and lottery ticket phenomena.

Appendix A Overview and Contents

In this supplementary material, we include data that either (1) we processed to produce the plots in the paper or (2) that we were not able to fit in the main body of the paper. The contents of these appendices are as follows:

Appendix B. The process by which we chose the “extreme” sparsity levels used in Section 4.

Appendix C. Details about the states of the unpruned networks and IMP subnetworks at the rewinding iterations, including full network accuracy, distance from initialization, distance to the trained weights, and the distance between trained weights under different data orders.

Appendix D. Instability data throughout training for Resnet-20 and VGG-16; that is, interpolating between the states at each epoch of networks trained on different dataorders.

Appendix E. Instability and test error across rewinding iterations for Resnet-20 and VGG-16 at all levels of sparsity (not just the extreme sparsity we analyzed in Section 4.3).

Appendix F. The error when linearly interpolating for all networks in all configurations (unpruned and sparse) at all rewinding iterations. This data was used to create the instability plots in Figures 3 and 6.

Appendix G. The training set instability for the sparse networks corresponding to the test set instability data that we present in Section 4 Figure 6.

Appendix H. Metrics other than linear mode connectivity for comparing the networks that result from our instability experiments: distance, cosine distance, number of identical classifications, and distance of losses.

Appendix B Selecting Extreme Sparsity Levels for IMP

In this appendix, we describe how we select the extreme sparsity level that we examine in Section 4.3 for each IMP subnetwork. For each network and hyperparameter configuration, our goal is to study the most extreme sparsity level at which matching subnetworks are known to exist early in training. To do so, we use IMP to generate subnetworks at many different sparsities for many different rewinding iterations. We then select the most extreme sparsity level at which any IMP under any rewinding iteration produces a matching subnetwork.

In Figure 9, each plot contains the maximum accuracy found by any rewinding iteration in red. The black line is the accuracy of the unpruned network to one standard deviation. For each network, we select the most extreme sparsity for which the red and black lines intersect. As a basis for comparison, these plots also include the result of performing IMP with (blue line), random pruning (orange line), and random reinitialization of the IMP subnetworks with (green line).

Note that, for computational reasons, Resnet-50 and Inception-v3 are pruned using one-shot pruning, meaning the networks are pruned to the target sparsity all at once. All other networks are pruned using iterative pruning, meaning the networks are pruned by 20% after each iteration of IMP until they reach the target sparsity. Pruning 20% per iteration is standard practice in the sparse pruning literature (Han et al., 2015; Frankle and Carbin, 2019; Renda et al., 2020) This information is specified in Table 1.

Appendix C The State of the Network at Rewinding

c.1 Methodology

In the main body of the paper, we make two copies of the network at a rewinding iteration , optionally apply a pruning mask (as in Section 4), and train two copies of the network from there to completion under different data orders. We find that, for a sufficiently large value of , the trained networks will find the same, linearly connected minimum. In this appendix, we address the following question: what is the state of the network at the rewinding points from which this linear connectivity results? Are the networks so far along in training that they are virtually fully optimized? Have they traveled the vast majority of the distance from initialization to the eventual minimum? In this sense, is the iteration at which the network becomes stable “trivial?” We address these questions in two ways.

Error at rewinding. In Figure 10, we present the error of the unpruned network at each rewinding iteration we consider in the main body of the paper. With this data, we investigate how close the network has come to its full accuracy when it becomes stable.

distances. In Figures 11 and 12, we measure various distances that capture how close the network is to initialization and to the end of training. In particular, we measure three distances as shown in the diagram below (which is an annotated version of Figure 1).

Distance Between Copies Trained on Different Dataorders

Distance from Rewinding to End of Training

Distance from Init to Rewinding

First, we measure the distance in parameter space from initialization to the state of the network at each rewinding iteration (blue circle in the diagram above and in Figures 11 and 12); for the sparse IMP subnetworks, we measure the distance after applying the pruning mask to both initialization and the state of the network at iteration . This quantity captures the distance that the network has traversed from initialization by iteration .

Second, we measure the distance from the state of the network at the rewinding iteration to its state at the end of training under one data order (orange x in the diagram above and in Figures 11 and 12). This quantity captures the distance that the network traverses after the rewinding iteration . If the network is very close to the optimum by the time it becomes stable, then we expect this quantity to be small compared to the distance between initialization and iteration ; that would indicate that the network has already traversed a large distance and has a relatively smaller distance to go.

Finally, we measure the distance between the final states of networks trained from the rewinding iteration under different data orders (green triangle in the diagram above and in Figures 11 and 12). This quantity captures the size of the linearly connected minimum found by the networks. We are interested in how this distance compares to the distance traveled by the networks and how this quantity changes as the rewinding iteration varies.

c.2 Results

Error at rewinding. These results appear in Figure 10. Recall that the unpruned networks become stable at a different (typically later) iteration than the IMP subnetworks, so we consider two rewinding points for each network.

Unpruned networks. Resnet-20 and VGG-16 become stable at iterations 2000 and 1000, at which point test error is about 25% (compared to final error 8.3%) for Resnet-20 and 20% (compared to final error 6.3%) for VGG-16. Train error is at a similar value to test error at these points; in both cases, train error eventually converges to 0%. We conclude that, at the iteration at which they become stable, these networks have not fully converged but are much closer to their final errors than to random guessing.

We see similar behavior for the unpruned Resnet-50 and Inception-v3 networks, which become stable at epochs 18 and 28. At these points, test error is 55% (compared to final error 24%) for Resnet-50 and 33% (compared to final error 22%) for Inception-v3. Both networks are most of the way to their final accuracies.

IMP pruned subnetworks. The IMP pruned subnetworks become stable earlier than the unpruned networks. Resnet-20 and VGG-16 become stable at iterations 500 and 1000, at which point error is 30% (compared to final error 8.3%) for Resnet-20 and 35% (compared to final error 6.3%) for VGG-16. These networks have not fully converged but are closer to their final errors than to random guessing. IMP subnetworks of Resnet-50 and Inception-v3 become stable much earlier than the unpruned networks—at epoch 5 and epoch 6, respectively. At these points, error is much higher—55% for Resnet-50 and 40% for Inception-v3—leaving these networks substantial room to further train. We did not evaluate the train accuracy at these checkpoints for the ImageNet networks due to storage and computational limitations.

distances. These results appear in Figures 11 and 12.

Unpruned networks. Resnet-20 and VGG-16 become stable at iterations 2000 and 1000, at which point they are closer to their initial weights than to their final weights. This indicates that they still have a substantial distance to travel on the optimization landscape and are still far from their final weights. This result is particularly remarkable considering our observation in Appendix D that stable networks follow the same, linearly connected trajectory throughout training (according to test error); the distance data suggests that they do so for a substantial distance.

The unpruned Resnet-50 and Inception-v3 networks are closer to their final weights than their initial weights when they become stable. In fact, it appears that distance from initialization begins to plateau and distance to the final weights only decreases slowly. This may indicate that the networks will make much slower progress for the remaining 80% of training iterations.

The green triangles in these plots show the distance between the weights of copies of the network trained from a rewinding iteration to completion on different data orders. In all cases, the distance between these copies is substantial, even after the networks become stable. As a point of comparison, we use the distance that the networks travel between initialization and the final weights, which is captured by the orange for rewinding iteration 0. For Resnet-20, the distance between copies trained on different data orders from iteration 2000 (when it becomes stable) is more than half the distance that the network travels during the entirety of training. The same is true for VGG-16 from iteration 1000 (when it becomes stable). For Resnet-50 and Inception-v3, this distance is about a quarter and half (respectively) of the distance the networks travel over the course of training. These are remarkably large distances considering that any network on this line segment reaches full test accuracy.

IMP pruned subnetworks. We discuss the IMP subnetworks in Figure 12. Each distance in this figure is measured after applying the pruning mask to all weights. When Resnet-20 and VGG-16 become stable (iterations 2000 and 1000, respectively), they are about 2x (Resnet-20) and 3x (VGG-16) closer to their initial weights than their final weights. Resnet-50 and Inception-v3 are about equal distances from both points for the epochs at which they become stable.

Unique to the IMP subnetworks, we observe here and in Appendix H that the distance between copies trained on different data orders drops alongside instability, plateauing at a lower value when training from the rewinding iteration at which the subnetworks becomes stable. Even this lower distance is still a substantial fraction of the overall distance the network travels: 25%, 45%, 27%, and 28% for Resnet-20, VGG-16, Resnet-50, and Inception-v3.

Appendix D Instability Throughout Training

In Section 3, we find that stable networks arrive at minima that are linearly connected. In this appendix, we study whether the trajectories they follow are also linearly connected. In other words, when training two copies of the same network with different noise, are the states of the network at each iteration connected by a linear path over which test error does not increase? In the main body of the paper, we study this quantity only at the end of training (i.e., ). Here, we study it for all iterations throughout training. To study this behavior, we linearly interpolate between the networks at each epoch of training and compute instability.

Figure 13 plots instability throughout training for Resnet-20 and VGG-16 from different rewinding iterations for both train and test error for the unpruned networks and the IMP subnetworks. We begin with the unpruned networks. For (blue line), instability increases rapidly. In fact, it follows the same pattern as error: as the train or test error of each network decreases, the maximum possible instability increases (since instability never exceeds random guessing). With larger values of , instability increases more slowly throughout training. When is sufficiently large that the networks are stable at the end of training, they are generally stable at every epoch of training (, pink line). In other words, after iteration 2000, the networks follow identical optimization trajectories modulo linear interpolation.

The IMP subnetworks of Resnet-20 exhibit the same behavior as the unpruned network: when the network is stable at the end of training, it is stable throughout training, meaning two copies of the same network follow the same optimization trajectory up to linear interpolation. The IMP subnetworks of VGG-16 exhibit sightly different behavior at rewinding iterations 500 and 1000: instability initially spikes (meaning the networks rapidly become separated by a loss barrier) but decreases gradually thereafter. For rewinding iteration 1000, it decreases to 0, meaning the networks are stable by the end of training. For all other rewinding iterations, being stable at the end of training corresponds to being stable throughout training, so it is possible that rewinding iteration 1000 represents a transition point between the unstable rewinding iterations earlier and the stable rewinding iterations later.

Appendix E Instability Data at All Sparsities

In Figure 6 in Section 4.3, we show the effect of rewinding iteration on instability and test error for sparse subnetworks. We specifically focus on the most extreme level of sparsity for which IMP at any rewinding iteration is matching (as selected in Appendix B). In this appendix, we present the relationship between rewinding iteration and instability/test error for all levels of sparsity for standard Resnet-20 (Figures 15 and 14) and VGG-16 (Figures 17 and 16) on CIFAR-10. Section 4.4 and Figure 8 summarize this data, so we defer analysis of this data to that section.

This data begins with 80% of weights remaining and includes sparsities attained by repeatedly pruning 20% of weights (e.g., 64% of weights remaining, 51% of weights remaining, etc.). We include these levels in particular because we use IMP to prune 20% of weights per iteration, meaning we have sparse IMP subnetworks for each of these levels. We include data for every sparsity level displayed in Appendix B, including those beyond the extreme sparsities we study in Section 4.3.

We only collected this data for standard Resnet-20 and VGG-16 on CIFAR-10. We determined that it was more valuable to spend our limited computational resources on these networks (whose instability and accuracy are sensitive to rewinding at the extreme sparsity level) than for the low and warmup variants (which are consistently stable and matching at the extreme sparsity level). We did not have the computational resources to compute this data on the ImageNet networks for all sparsities.

Appendix F Full Linear Interpolation Data

In Figures 3 and 6, we plot the instability value derived from linearly interpolating between copies of the same network or subnetwork trained on different data orders. In this appendix, we plot the linear interpolation data from which we derived the instabilities in Figures 3 and 6. We plot this data for the unpruned networks (Figure 18), IMP subnetworks (Figure 19), randomly pruned subnetworks (Figure 20), and the randomly reinitialized IMP subnetworks (Figure 21).

Appendix G Train Instability for Sparse Subnetworks

In Section 4, we only measure instability and error on the test set. We make this choice for simplicity after observing in Section 3 that train and test instability closely align. In this appendix, we present the data from Section 4 on the test set. Figures 22 and 23 examine the instability and error of the same IMP subnetworks as Figure 6, but it shows both the train and test sets. We did not compute the train set quantities for Inception-v3 due to computational limitations.

Train set and test set instability are nearly identical, just as we found in Section 3. Interestingly, the two coincide more closely for IMP subnetworks of Resnet-50 than they do for the unpruned networks in Section 3.

For networks that are unstable at rewinding iteration 0, train error and test error follow similar trends, starting higher when the subnetworks are unstable and dropping when the subnetworks become stable. In other words, the unstable IMP subnetworks are not able to fully optimize to 0% train error, while the stable IMP subnetworks are. This means that, at earlier rewinding iterations, the IMP subnetworks are having trouble optimizing, not just generalizing.

Appendix H Alternate Distance Metrics

Instability analysis involves training two copies of the same network on different data orders and comparing the networks that result. In the main body of the paper, our method of comparison is linear interpolation, which we find to offer valuable new insights into neural network optimization and the lottery ticket hypothesis. However, one could parameterize instability analysis with a wide range of other metrics for comparing pairs of neural networks. In this appendix, we discuss four alternate methods for which we collected data using the MNIST and CIFAR-10 networks.

Distance. One simple way to compare neural networks is to measure the distance between the trained weights. The limitation of this metric is that there is not necessarily any relationship between distance and the functional similarity of networks or the structure of the loss landscape. In other words, there is no clear interpretation of distance.

In Figure 24, we plot the distance metric at all rewinding points for the unpruned networks. In Figure 25, we plot the distance metric at all rewinding points for all three classes of sparse networks. We plot this data separately because distance is not necessarily comparable between sparse networks (which have fewer parameters) and dense networks (which have more parameters).

For the unpruned networks, distance decreases linearly as we logarithmically increase the rewinding iteration. We see no distinct changes in behavior when the networks become stable, and the distance remains far from 0 at this point.

For the IMP subnetworks, distance mirrors the behavior of instability. In cases where the IMP subnetworks are stable at all rewinding points (Resnet-20 low/warmup, VGG-16 low/warmup, and Lenet), the distance is at a lower level than the distance between the other baselines (random pruning and random reinitialization) and is consistent across rewinding points. In cases where the IMP subnetworks are unstable at initialization but become stable later (Resnet-20 and VGG-16), the distance begins high (at the same level as the distance for the randomly pruned and randomly reinitialized baselines) and drops when the subnetworks become stable, settling at a lower level.

Although stable IMP subnetworks are closer in distance than unstable IMP subnetworks and the baselines, the distance remains far from zero. In general, it is difficult to translate the results of this metric into higher-level statements about the relationships between the networks.

Cosine distance. In Figures 26 (unpruned networks) and 27 (sparse networks), we plot the cosine distance in a manner similar to distance. The results are similar to those for distance, and the same interpretation applies.

Classification differences. This distance metric computes the number of examples that are classified differently by two networks. Unlike linear interpolation and /cosine distance, this metric looks at the functional behavior of the networks rather than the parameterizations. This metric is particularly valuable because it allows us to compare the dense and sparse networks directly.

In Figures 28 (test set) and 29 (train set), we plot this metric for the unpruned and sparse networks across rewinding iterations. The unpruned networks generally classify the same number of examples differently no matter the rewinding iteration, although the number of different classifications decreases gradually for the latest rewinding iterations for Resnet-20 low and warmup. We see no relationship between this metric and instability.

The behavior of the IMP sparse networks better matches instability. IMP subnetworks that are stable from initialization (Resnet-20 low and warmup, VGG-16 low and warmup, Lenet) consistently have the same distance no matter the rewinding iteration. This distance is lower than that for the randomly pruned and randomly reinitialized baselines.

IMP subnetworks that are unstable at iteration 0 (Resnet-20 and VGG-16) have the same number of different classifications as the baselines when rewinding to iteration 0. When the networks become stable, the number of different classifications drops substantially to a lower level.

One challenge with using this distance metric is that it is inherently entangled with accuracy. As the accuracy of the networks improves, the number of different classifications might decrease simply because the networks will classify more examples correctly (and thereby, the same way). Consider the IMP subnetworks of Resnet-20 on the CIFAR-10 test set (the graph in the upper right of Figure 28, blue line). At rewinding iteration 0, the networks have about 11% error on the test set, meaning there are at most 2200 examples they could classify differently.2 In Figure 28, we see that the networks are classifying about 1100 examples differently.

When Resnet-20 IMP subnetworks are stable, error decreases to 8.5%, meaning at most 1700 examples can be classified differently. However, in Figure 26, we see that only about 350 examples are being classified differently. Although this number is lower than the 1100 differences at rewinding iteration 0 in absolute terms, accuracy has improved as well, so we must consider these differences in context. At rewinding iteration 0, classification differences are 50% of their maximum possible value, while at rewinding iteration 1000, classification differences are at 21% of their maximum possible value. In summary, as the IMP subnetworks become stable, they behave in a more functionally similar fashion, even considering accuracy improvements.

Loss distance. This distance metric computes the distance between the vector of cross-entropy losses aggregated by computing the loss on each example. This metric again considers only the functional behavior of the networks, but it uses the per-example loss rather than the classification decisions, which may provide more information about the functional behavior of the networks. We plot this data in Figures 30 (test set) and 31 (train set). It largely mirrors the behavior from the classification difference metric, and the same interpretations apply.

Figure 9: An illustration of the methodology by which we select the extreme sparsity levels that we study in Section 4. The red line is the maximum accuracy achieved by any IMP subnetwork under any rewinding iteration. The black line is the accuracy of the full network. We use the most extreme sparsity level for which the red and black lines overlap. Each line is the mean and standard deviation across three runs with different initializations.

Figure 10: The error of the full networks at the rewinding iteration specified on the x-axis. For clarity, this is the error of the network at that specific iteration of training, before any copies are made or further training occurs. Each line is the mean and standard deviation across three initializations.

Figure 11: Various distances for the full networks at the rewinding iteration specified on the x-axis. Each line is the mean and standard deviation across three initializations.

Figure 12: Various distances for the IMP subnetworks at the rewinding iteration specified on the x-axis. Each line is the mean and standard deviation across three initializations. Each distance is computed after applying the pruning mask to the states of the networks in question.
Figure 13: Instability throughout training for Resnet-20 and VGG-16 using both the unpruned networks and the IMP-pruned networks as computed on both the test set and train set. Each line involves training to iteration and then training two copies on different data orders after. Each point is the instability when interpolating between the states of the networks at the training iteration on the x-axis.

Figure 14: The test error of subnetworks of Resnet-20 created using the state of the full network at iteration and trained on different data orders from there. Each line is the mean and standard deviation across three initializations. Gray lines are the accuracies of the full networks to one standard deviation. Percents are percents of weights remaining.

Figure 15: The instability of subnetworks of Resnet-20 created using the state of the full network at iteration and trained on different data orders from there. Each line is the mean and standard deviation across three initializations and three data orders (nine samples total). Percents are percents of weights remaining.

Figure 16: The test error of subnetworks of VGG-16 created using the state of the full network at iteration and trained on different data orders from there. Each line is the mean and standard deviation across three initializations. Gray lines are the accuracies of the full networks to one standard deviation. Percents are percents of weights remaining.

Figure 17: The instability of subnetworks of VGG-16 created using the state of the full network at iteration and trained on different data orders from there. Each line is the mean and standard deviation across three initializations and three data orders (nine samples total) Percents are percents of weights remaining.
Figure 18: The error when linearly interpolating between the minima found by randomly initializing a network, training to iteration , and training two copies from there to completion using different data orders. Each line is the mean and standard deviation across three initializations and three data orders (nine samples in total). The errors of the trained networks are at interpolation = 0.0 and 1.0.
Figure 19: The error when linearly interpolating between the minima found by randomly initializing a network, training to iteration , pruning according to IMP, and training two copies from there to completion using different data orders. Each line is the mean and standard deviation across three initializations and three data orders (nine samples in total). The errors of the trained networks are at interpolation = 0.0 and 1.0. We did not interpolate using the training set for the ImageNet networks due to computational limitations.
Figure 20: The error when linearly interpolating between the minima found by randomly initializing a network, training to iteration , pruning randomly in the same layerwise proportions as IMP, and training two copies from there to completion using different data orders. Each line is the mean and standard deviation across three initializations and three data orders (nine samples in total). The errors of the trained networks are at interpolation = 0.0 and 1.0. We did not interpolate using the training set for the ImageNet networks due to computational limitations.
Figure 21: The error when linearly interpolating between the minima found by randomly initializing a network, training to iteration , pruning according to IMP, randomly reinitializing, and training two copies from there to completion using different data orders. Each line is the mean and standard deviation across three initializations and three data orders (nine samples in total). The errors of the trained networks are at interpolation = 0.0 and 1.0. We did not interpolate using the training set for the ImageNet networks due to computational limitations.

Figure 22: The train and test set instability of subnetworks that are created by using the state of the full network at iteration , applying the pruning mask found by performing IMP with rewinding to iteration , and training on different data orders from there. Each line is the mean and standard deviation across three initializations and three data orders (nine samples in total). Percents are percents of weights remaining. We did not compute the train set quantities for Inception-v3 due to computational limitations.

Figure 23: The train and test set error of subnetworks that are created by using the state of the full network at iteration , applying the pruning mask found by performing IMP with rewinding to iteration , and training on different data orders from there. Each line is the mean and standard deviation across three initializations. Percents are percents of weights remaining. We did not compute the train set quantities for Inception-v3 due to computational limitations.

Figure 24: The distance between networks that are created by trained to iteration , making two copies, and training on different data orders from there. Each line is the mean and standard deviation across three initializations and three data orders (nine samples in total). Percents are percents of weights remaining. We did not compute the train set quantities for the ImageNet networks due to computational limitations.

Figure 25: The distance between subnetworks that are created by using the state of the full network at iteration , applying a pruning mask, and training on different data orders from there. Each line is the mean and standard deviation across three initializations and three data orders (nine samples in total). Percents are percents of weights remaining. We did not compute the train set quantities for the ImageNet networks due to computational limitations.

Figure 26: The cosine distance between networks that are created by trained to iteration , making two copies, and training on different data orders from there. Each line is the mean and standard deviation across three initializations and three data orders (nine samples in total). Percents are percents of weights remaining. We did not compute the train set quantities for the ImageNet networks due to computational limitations.

Figure 27: The cosine distance between subnetworks that are created by using the state of the full network at iteration , applying a pruning mask, and training on different data orders from there. Each line is the mean and standard deviation across three initializations and three data orders (nine samples in total). Percents are percents of weights remaining. We did not compute the train set quantities for the ImageNet networks due to computational limitations.

Figure 28: The number of different test set classifications between networks that are created by training the full network to iteration , optionally applying a pruning mask, and training on different data orders from there. Each line is the mean and standard deviation across three initializations and three data orders (nine samples in total). Percents are percents of weights remaining.

Figure 29: The number of different train set classifications between networks that are created by training the full network to iteration , optionally applying a pruning mask, and training on different data orders from there. Each line is the mean and standard deviation across three initializations and three data orders (nine samples in total). Percents are percents of weights remaining.

Figure 30: The distance between the per-example losses on the test set for networks that are created by training the full network to iteration , optionally applying a pruning mask, and training on different data orders from there. Each line is the mean and standard deviation across three initializations and three data orders (nine samples in total). Percents are percents of weights remaining. We did not compute the train set quantities for the ImageNet networks due to computational limitations.

Figure 31: The distance between the per-example losses on the train set for networks that are created by training the full network to iteration , optionally applying a pruning mask, and training on different data orders from there. Each line is the mean and standard deviation across three initializations and three data orders (nine samples in total). Percents are percents of weights remaining. We did not compute the train set quantities for the ImageNet networks due to computational limitations.

Footnotes

  1. In Appendix E, we present the full instability and error data that we used to produce this summary.
  2. In the worst case, all examples that one network misclassifies will be classified correctly by the other. Since each network misclassifies 1100 examples, 2200 examples will be classified differently in total.

References

  1. Finding winning tickets with limited (or no) supervision. External Links: Link Cited by: §5.
  2. Essentially no barriers in neural network energy landscape. In International Conference on Machine Learning, Cited by: §1, §2, §5.
  3. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In Int. Conf. Represent. Learn., External Links: 1803.03635 Cited by: Appendix B, Table 1, §1, §1, §2, §4.1, §4.1, §4.2, §4.3, §4.3, §5.
  4. The early phase of neural network training. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  5. The state of sparsity in deep neural networks. Note: arXiv:1902.09574 Cited by: §1, §4.1, §4.1, §5, §5.
  6. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems, pp. 8789–8798. Cited by: §1, §2, §5.
  7. Networks for Imagenet on TPUs. External Links: Link Cited by: Table 1.
  8. Accurate, large minibatch SGD: training Imagenet in 1 hour. External Links: 1706.02677 Cited by: §5, §5.
  9. Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754. Cited by: §5.
  10. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135–1143. Cited by: Appendix B, §4.1, §4.2, §5, §5.
  11. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: Table 1.
  12. Amc: automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800. Cited by: §5, §5.
  13. On large-batch training for deep learning: generalization gap and sharp minima. In International Conference on Learning Representations, Cited by: §5.
  14. Efficient backprop. In Neural networks: Tricks of the trade, pp. 9–48. Cited by: §5.
  15. SNIP: Single-shot Network Pruning based on Connection Sensitivity. In International Conference on Learning Representations, Cited by: §5.
  16. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §4.1.
  17. An exponential learning rate schedule for deep learning. arXiv preprint arXiv:1910.07454. Cited by: §5.
  18. Rethinking the Value of Network Pruning. In International Conference on Learning Representations, Cited by: Table 1, §1, §4.1, §4.1, §5.
  19. Piggyback: adapting a single network to multiple tasks by learning to mask weights. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 67–82. Cited by: §5.
  20. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. Cited by: §5.
  21. Uniform convergence may be unable to explain generalization in deep learning. Note: arXiv:1902.04742v2. To appear in NeurIPS 2019 External Links: 1902.04742v2 Cited by: §1.
  22. What’s hidden in a randomly weighted neural network?. arXiv preprint arXiv:1911.13299. Cited by: §5.
  23. Pruning algorithms-a survey. IEEE transactions on Neural Networks 4 (5), pp. 740–747. Cited by: §4.1.
  24. Comparing fine-tuning and rewinding in neural network pruning. In International Conference on Learning Representations, External Links: Link Cited by: Appendix B, §5.
  25. Winning the lottery with continuous sparsification. External Links: Link Cited by: §5.
  26. Super-convergence: very fast training of residual networks using large learning rates. External Links: Link Cited by: §5.
  27. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472. Cited by: §5.
  28. Don’t decay the learning rate, increase the batch size. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  29. A bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  30. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  31. The sooner the better: investigating structure of early winning lottery tickets. External Links: Link Cited by: §5.
  32. Playing the lottery with rewards and multiple languages: lottery tickets in {rl} and {nlp}. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  33. Deconstructing lottery tickets: zeros, signs, and the supermask. In Advances in Neural Information Processing Systems, pp. 3592–3602. Cited by: §5.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
409232
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description