Linear Mode Connectivity and The Lottery Ticket Hypothesis
Abstract
We introduce instability analysis, which assesses whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise. We find that standard vision models become stable in this way early in training. From then on, the outcome of optimization is determined to within a linearly connected region.
We use instability to study iterative magnitude pruning (IMP), the procedure used by work on the lottery ticket hypothesis to identify subnetworks that could have trained to full accuracy from initialization. We find that these subnetworks only reach full accuracy when they are stable, which either occurs at initialization for smallscale settings (MNIST) or early in training for largescale settings (Resnet50 and Inceptionv3 on ImageNet).
1 Introduction
When training a neural network with minibatch stochastic gradient descent (SGD), training examples are presented to the network in a random order within each epoch. This random order can be seen as noise that varies from training run to training run and alters the network’s trajectory through the optimization landscape, even when hyperparameters are fixed. In this paper, we investigate how much variability this data order randomness induces in the optimization trajectories of neural networks and the role this variability plays in sparse, lottery ticket networks (Frankle and Carbin, 2019).
Instability analysis. To study these questions, we propose instability analysis. The goal of instability analysis is to determine whether the outcome of optimization is robust to different samples of SGD noise. The left diagram in Figure 1 visualizes instability analysis. First, we create a neural network with random initialization . We then train two copies of this network in parallel on different data orders (which models different samples of SGD noise). Finally, we measure the effect of these different samples of SGD noise by comparing the resulting networks. We also study this behavior starting from the state of the network at iteration of training (Figure 1 right). Doing so allows us to determine when the outcome of optimization becomes robust to different samples of SGD noise.
To compare the trained networks that result from instability analysis, we study the optimization landscape along the line between them (blue curve in Figure 1). Does error remain flat or even decrease (meaning the networks are in the same, linearly connected minimum), or is there a barrier of increased error? We define the instability of the network to SGD noise as the maximum increase in test error along this linear path (red line). A network is stable if error does not increase along the path, i.e., instability 0.
Interpolating at the end of training assesses a linear form of mode connectivity, a phenomenon where the minima found by two networks are connected by a path of constant error. Draxler et al. (2018) and Garipov et al. (2018) show that the modes of standard vision networks trained from different initializations are connected by piecewise linear paths of constant error or loss. Based on this work, we expect that all networks we examine are connected by such paths. However, the modes found by Draxler et al. and Garipov et al. are not connected by linear paths. The only extant example of linear mode connectivity is by Nagarajan and Kolter (2019), who train MLPs from the same initialization on disjoint subsets of MNIST and find that the resulting networks are connected by linear paths of constant test error. In contrast, we explore linear connectivity from points throughout training, we do so at a larger scale, and we focus on different samples of SGD noise rather than disjoint samples of data.
Network  Variant  Dataset  Params  Train For  Batch  Accuracy  Optimizer  Rate  Schedule  Warmup  BatchNorm  Pruning Level  Style 
Lenet  MNIST  266K  24K Iters  60  98.3 0.1%  adam  12e4  constant  0  No  3.5%  Iterative  
Resnet20  Standard  91.7 0.1%  0.1  10x drop at 32K, 48K  0  Yes  16.8%  Iterative  
Resnet20  Low  CIFAR10  274K  63K Iters  128  88.8 0.1%  momentum  0.01  0  8.6%  
Resnet20  Warmup  89.7 0.3%  0.03  30K  8.6%  
VGG16  Standard  93.7 0.1%  0.1  10x drop at 32K, 48K  0  Yes  1.5%  Iterative  
VGG16  Low  CIFAR10  14.7M  63K Iters  128  91.7 0.1%  momentum  0.01  0  5.5%  
VGG16  Warmup  93.4 0.1%  0.1  30K  1.5%  
Resnet50  ImageNet  25.5M  90 Eps  1024  76.1 0.1%  momentum  0.4  10x drop at 30,60,80  5 Eps  Yes  30%  OneShot  
Inceptionv3  ImageNet  27.1M  171 Eps  1024  78.1 0.1%  momentum  0.03  linear decay to 0.005  0  Yes  30%  OneShot 
We examine the instability of standard networks for MNIST, CIFAR10, and ImageNet. All but the smallest MNIST network are unstable at initialization. However, by a point early in training (3% for Resnet20 on CIFAR10 and 20% for Resnet50 on ImageNet), all networks become stable. From this point forward, the outcome of optimization is determined to a linearly connected minimum.
The lottery ticket hypothesis. Finally, we show that instability analysis is a valuable scientific tool for assessing the effect of SGD noise in other contexts. Specifically, we study the sparse networks discussed by the recent lottery ticket hypothesis (LTH; Frankle and Carbin, 2019). The LTH conjectures that, at initialization, neural networks contain sparse subnetworks that can train in isolation to full accuracy.
Empirical evidence for the LTH consists of experiments using a procedure called iterative magnitude pruning (IMP). On small networks for MNIST and CIFAR10, IMP finds subnetworks at initialization that can match the accuracy of the full network (we refer to such subnetworks as matching) at sparsity levels far beyond those at which randomly pruned or randomly reinitialized subnetworks can do the same. In more challenging settings, however, there is no empirical evidence for the LTH. IMP subnetworks of VGGs and Resnets on CIFAR10 and ImageNet perform no better than other sparse networks (Liu et al., 2019; Gale et al., 2019).
We find that instability analysis distinguishes known cases where IMP succeeds and fails to find a matching subnetwork, providing the first basis for understanding the mixed results in the literature. Namely, IMP subnetworks are only matching when they are stable. Using this insight, we identify new scenarios where we can find sparse, matching subnetworks, including in more challenging settings (e.g., Resnet50 on ImageNet). In these settings, sparse IMP subnetworks become stable early in training rather than at initialization, just as we found with the unpruned networks. Moreover, these stable IMP subnetworks are also matching. In other words, early in training (if not at initialization), sparse subnetworks emerge that are capable of completing training in isolation and reaching full accuracy. These findings shed new light on neural network training dynamics and hint at possible mechanisms underlying lottery ticket phenomena.
Contributions. We make the following contributions:

We introduce instability analysis to identify whether a neural network will find the same linearly connected minimum despite different samples of SGD noise.

On a range of image classification benchmarks including standard networks on ImageNet, we observe that networks become stable to SGD noise early in training.

We use instability analysis to distinguish successes and failures of IMP (the core method behind the lottery ticket hypothesis) identified in prior work. Namely, extremely sparse IMP subnetworks are matching only when stable.

We extend IMP with rewinding and show that IMP subnetworks become stable and matching when set to their weights from early in training. In doing so, we show how to find matching subnetworks that were present early in training in more challenging settings than in prior work.
2 Preliminaries and Methodology
Instability analysis via linear connectivity. Instability analysis evaluates whether the minima found when training two copies of a neural network on different randomly sampled executions of SGD (i.e., different data orders) are linearly connected by a path over which error does not increase. The network could be randomly initialized ( in Figure 1) or the result of training iterations (). To perform instability analysis, we make two copies of the network and train them to completion with different random data orders, resulting in weights and . We then linearly interpolate between the trained weights (dashed line) and compute the train or test error at each point (blue curve) to determine whether it increased (minima are not linearly connected) or did not (minima are linearly connected).
We represent SGD by a function that maps weights at step and SGD randomness to weights at step by training for steps (for ). denotes the error of a network with weights . Algorithm 1 outlines our procedure:
We describe the result of linear interpolation (step 5) with a quantity that we term instability. Let be the average test error of and . Let be the highest test error when linearly interpolating between and . The instability is (red line in Figure 1). When instability , the minima are linearly connected and we say the network is stable. Otherwise, we say it is unstable. Empirically, we consider instability 2% to be stable; this margin accounts for noise as we interpolate and matches the increases in test error along the paths found by Draxler et al. (2018, Table B.1) and Garipov et al. (2018, Table 2). We interpolate using 30 evenlyspaced values of , and we average instability from three initializations and three data orders per initialization (nine combinations total).
Networks and datasets. We study image classification networks on MNIST, CIFAR10, and ImageNet as specified in Table 1. All hyperparameters are standard values from reference implementations or prior work as cited in Table 1. The warmup and low variants of Resnet20 and VGG16 are adapted from hyperparameters in Frankle and Carbin (2019).
3 Neural Network Instability to SGD Noise
In this section, we perform instability analysis on the standard networks in Table 1 from many points during training. We find that, although only Lenet is stable at initialization, every network becomes stable early in training, meaning the outcome of optimization from that point forward is determined to within a linearly connected region.
Instability at initialization. We begin by studying the effect of data order on linear connectivity when starting at initialization. We use Algorithm 1 with (visualized in Figure 1 left): train two copies of the same, randomly initialized network with different data orders. Figure 2 shows the train (purple) and test (red) error when linearly interpolating between the minima found by these copies. Except for Lenet (MNIST), none of the networks are stable at initialization. In fact, train and test error rise to the point of random guessing when linearly interpolating. Lenet’s error rises slightly, but by less than a percentage point. We conclude that, in general, largerscale image classification networks are unstable at initialization.
Instability during training. Although larger networks are unstable at initialization, they may become stable at some point afterwards; in the limit, they will be stable by definition after the last step of training. To investigate when stability occurs, we train a network for steps, make two copies, train them to completion on different data orders, and linearly interpolate (visualized in Figure 1 right). We do so for many values of , assessing whether there is a point after which the outcome of optimization is determined modulo linear interpolation regardless of the data order.
Figure 3 presents the instability of the networks for various values of . Instability is the maximum error during interpolation (the peaks in Figure 2) minus the mean of the errors of the two networks (the endpoints in Figure 2). In all cases, test set instability decreases as increases, culminating in networks that are stable. The iteration at which stability emerges is surprisingly early. For example, it occurs at iterations 2000 for Resnet20 and 1000 VGG16; in other words, after 3% and 1.5% of training, SGD noise cannot affect the final minimum modulo linear interpolation. Stability occurs later for Resnet50 and Inceptionv3: at epoch 18 (20% into training) and 28 (16%), respectively, using the test set.
For Lenet, Resnet20, and VGG16, instability is essentially identical when measured in terms of train or test error, and the networks become stable at the same time when using both quantities. For Resnet50 and Inceptionv3, train instability follows the same trend as test instability but is slightly higher at all points, meaning train set stability occurs later for Resnet50 and does not occur in our range of analysis for Inceptionv3. Going forward, we present all results with respect to test error for simplicity and include corresponding train error data in the appendices.
Disentangling instability from training time. Varying the iteration from which we run instability analysis has two effects. First, it changes the state of the network from which we train two copies to completion on different data orders. Second, it changes the number of iterations for which those copies are trained. Concretely, when we run instability analysis from iteration , we train the copies under different data orders for iterations. As increases, the copies have fewer iterations during which to potentially find linearly unconnected minima. It is possible that the gradual decrease in instability as increases and the eventual emergence of stability is just an artifact of these shorter training times.
To disentangle the role of training time in our experiments, we modify instability analysis to train the copies for iterations no matter the value of . When doing so, we reset the learning rate schedule to iteration 0 after making the copies. In Figure 4, we compare instability with and without this modification for Resnet20 and VGG16 on CIFAR10. Instability is indistinguishable in both cases, indicating that the different numbers of training steps did not play a role in the onset of stability. Going forward, we present all results by training copies for iterations.
4 Instability and Lottery Tickets
In this section, we leverage instability analysis and our observations about stable networks to gain new insights into the behavior of sparse lottery ticket networks.
4.1 Overview
We have long known that it is possible to prune neural networks after training, often removing 90% of weights without reducing accuracy after some additional training (e.g., Reed, 1993; Han et al., 2015; Gale et al., 2019). However, sparse networks are more difficult to train from scratch. At the most extreme sparsities attained by pruning, sparse networks trained in isolation are generally less accurate than the corresponding dense networks (Han et al., 2015; Li et al., 2016; Liu et al., 2019; Frankle and Carbin, 2019).
However, there is a known class of networks that remain accurate at these sparsities. On small vision tasks, an algorithm called iterative magnitude pruning (IMP) retroactively finds sparse subnetworks that were capable of training in isolation from initialization to full accuracy (Frankle and Carbin, 2019). The existence of such subnetworks raises the prospect of replacing conventional, dense networks with sparse ones, creating new opportunities to reduce the cost of training. However, in more challenging settings, IMP subnetworks perform no better than other kinds of subnetworks, and they do not train to full accuracy at the sparsities attained by pruning (Liu et al., 2019; Gale et al., 2019).
We find that instability offers new insights into the behavior of IMP subnetworks and a potential explanation for their successes and failures. Namely, the sparsest IMP subnetworks only train to full accuracy when they are stable. That is, when different samples of SGD noise cause an IMP subnetwork to find minima that are not linearly connected, then test accuracy is lower. This contrasts with the unpruned networks, whose accuracy seems unaffected by instability.
4.2 Methodology
Iterative magnitude pruning. Iterative magnitude pruning (IMP) is a procedure to retroactively find a subnetwork of the state of the full network at iteration of training. To do so, IMP trains a network to completion, prunes weights with the lowest magnitudes globally, and rewinds the remaining weights back to their values at iteration (Algorithm 2). The result is a subnetwork where is the state of the full network at iteration and is a mask such that (where is the elementwise product) is a pruned network. We can run IMP iteratively (pruning 20% of weights as in Han et al. (2015), rewinding, and repeating until a target sparsity) or in one shot (pruning to a target sparsity at once). We oneshot prune ImageNet networks for efficiency and iteratively prune otherwise (Table 1).
Frankle and Carbin (2019) focus on finding sparse subnetworks at initialization; as such, they only rewind to iteration 0. One of our contributions is to generalize IMP to any rewinding iteration . They refer to subnetworks that match the accuracy of the full network as winning tickets because they have “won the initialization lottery” with weights that make attaining this accuracy possible. When we rewind to iteration , subnetworks are no longer randomly initialized, so the term winning ticket is no longer appropriate. Instead, we refer to such subnetworks simply as matching.
Sparsity levels. In this section, we focus on the most extreme sparsity levels for which IMP returns a matching subnetwork at any rewinding iteration . These levels are in Table 1, and Appendix B explains these choices. These sparsities provide the best contrast between sparse networks that are matching and (1) the full, overparameterized networks and (2) other classes of sparse networks. Appendix E includes the analyses from this section for all sparsities for Resnet20 and VGG16, which we summarize in Section 4.4. Due to the computational costs of these experiments, we only collected data across all sparsities for these networks.
Network  Full  IMP  Rand Prune  Rand Reinit  IMP  Matching? 
Lenet  98.3  98.2  96.7  97.5  0.1  Y 
Resnet20  91.7  88.5  88.6  88.8  3.2  N 
Resnet20 Low  88.8  89.0  85.7  84.7  0.2  Y 
Resnet20 Warmup  89.7  89.6  85.7  85.6  0.1  Y 
VGG16  93.7  90.9  89.4  91.0  2.8  N 
VGG16 Low  91.7  91.6  90.1  90.2  0.1  Y 
VGG16 Warmup  93.4  93.2  90.1  90.7  0.2  Y 
Resnet50  76.1  73.7  73.1  73.4  2.4  N 
Inceptionv3  78.1  75.7  75.2  75.5  2.4  N 
4.3 Experiments and Results
Recapping the lottery ticket hypothesis. We begin by studying sparse subnetworks trained from initialization (). This is the lottery ticket experiment from Frankle and Carbin (2019). As Table 2 shows, when rewinding to iteration 0, IMP subnetworks of Lenet are matching, as are variants of Resnet20 and VGG16 with lower learning rates or learning rate warmup (changes proposed by Frankle and Carbin to make it possible for IMP to find matching subnetworks). However, IMP subnetworks of standard Resnet20, standard VGG16, Resnet50, and Inceptionv3 are not matching. In fact, they are no more accurate than subnetworks generated by randomly pruning or reinitializing the IMP subnetworks, suggesting that neither the structure nor the initialization uncovered by IMP provides a performance advantage. For full details on the accuracy of these subnetworks at all levels of sparsity, see Appendix B.
IMP subnetwork instability at initialization. When we perform instability analysis on these subnetworks, we find that they are only matching when they are stable (Figure 5). The IMP subnetworks of Lenet, Resnet20 (low, warmup), and VGG16 (low, warmup) are stable and matching (Figure 5, left). In all other cases, IMP subnetworks are neither stable nor matching (Figure 5, left). The low and warmup experiments are notable because Frankle and Carbin (2019) selected these hyperparameters specifically for IMP to find matching subnetworks; the fact that this change also makes the subnetworks stable adds further evidence of a connection between instability and accuracy in IMP subnetworks.
No randomly pruned or reinitialized subnetworks are stable or matching at these sparsities except Lenet. These subnetworks of Lenet are not matching but error only rises slightly when interpolating. For all other networks, error approaches that of random guessing when interpolating.
IMP subnetwork instability during training. We just saw that IMP subnetworks are matching from initialization only when they are stable. In Section 3, we found that unpruned networks become stable only after a certain amount of training. Here, we combine these observations: we study whether IMP subnetworks become stable later in training and, if so, whether improved accuracy follows.
Concretely, we perform IMP where we rewind to iteration after pruning. Doing so produces a subnetwork (, ) of the state of the full network at iteration . We then run instability analysis using this subnetwork. Another way of looking at this experiment is that it simulates training the full network to iteration , generating a pruning mask, and evaluating the instability of the resulting sparse network; the underlying maskgeneration procedure involves training the network many times in the course of performing IMP.
The blue dots in Figure 6 show the instability of the IMP subnetworks at many rewinding iterations. Networks whose IMP subnetworks were stable when rewinding to iteration 0 remain stable at all other rewinding points (Figure 6, left). Notably, networks whose IMP subnetworks were unstable when rewinding to iteration 0 become stable when rewinding later. IMP subnetworks of Resnet20 and VGG16 become stable at iterations 500 (0.8% into training) and 1000 (1.6%). Likewise, IMP subnetworks of Resnet50 and Inceptionv3 become stable at epochs 5 (5.5% into training) and 6 (3.5%). In all cases, the IMP subnetworks become stable sooner than the unpruned networks, substantially so for Resnet50 (epoch 5 vs. 18) and Inceptionv3 (epoch 6 vs. 28).
The error of the IMP subnetworks behaves similarly. The blue line in Figure 7 plots the error of the IMP subnetworks and the gray line plots the error of the full networks to one standard deviation; subnetworks are matching when the lines cross. Networks whose IMP subnetworks were matching when rewinding to iteration 0 (Figure 7, left) generally remain matching at later iterations (except for Resnet20 low and VGG16 low at the latest rewinding points). Notably, networks whose IMP subnetworks were not matching when rewinding to iteration 0 (Figure 7, right) become matching when rewinding later. Moreover, these rewinding points closely coincide with those where the subnetworks become stable. In summary, at these extreme sparsities, IMP subnetworks are matching when they are stable.
Randomly pruned and reinitialized subnetworks are unstable and nonmatching at all rewinding points (with Lenet again an exception). Although it is beyond the scope of our study, this behavior suggests a potential broader link between subnetwork stability and accuracy: IMP subnetworks are matching and become stable at least as early as the full networks, while other subnetworks are less accurate and unstable for the sparsities and rewinding points we consider.
4.4 Results at Other Sparsity Levels
Thus far, we have studied instability at only two sparsities: unpruned networks (Section 3) and an extreme sparsity (Section 4.3).
In this section, we examine sparsities between these levels and beyond the extreme sparsity for Resnet20 and VGG16.
Figure 8 presents the median iteration at which IMP and randomly pruned subnetworks become stable (instability ) and matching (accuracy drop , allowing a small margin for noise) across sparsity levels.
Stability behavior. As sparsity increases, the stability iteration of the IMP subnetworks becomes earlier, plateaus, and eventually increases. In contrast, the stability iteration of randomly pruned subnetworks only becomes later until the subnetworks are no longer stable at any rewinding iteration.
Matching behavior. We separate the sparsities into three ranges reflecting when different sparse networks are matching. In sparsity range I, the networks are overparameterized, so much so that even randomly pruned subnetworks are matching (red). This range occurs when more than 80.0% and 16.8% of weights remain for Resnet20 and VGG16.
In sparsity range II, the networks are sufficiently sparse that only IMP subnetworks are matching (orange). This range occurs when 80.0%13.4% and 16.8%1.2% of weights remain in Resnet20 and VGG16. For part of this range, IMP subnetworks become matching and stable at approximately the same rewinding iteration; namely, when 51.2%13.4% and 6.9%1.5% of weights remain for Resnet20 and VGG16. In Section 4.3, we observed this behavior for a single, extreme sparsity level for each network. Based on Figure 8, we conclude that there are many sparsities where these rewinding iterations coincide for Resnet20 and VGG16.
In sparsity range III, the networks are so sparse that even IMP subnetworks are not matching at any rewinding iteration we consider. This range occurs when fewer than 13.4% and 1.2% of weights remain for Resnet20 and VGG16. According to Appendix E, the error of IMP subnetworks still decreases when they become stable (although not to the point that they are matching), potentially suggesting a broader relationship between instability and accuracy.
5 Discussion
Instability analysis. We introduce instability analysis as a novel way to study the sensitivity of a neural network’s optimization trajectory to data order randomness. In doing so, we uncover a class of situations in which linear mode connectivity emerges, whereas previous examples of mode connectivity (e.g., between networks trained from different initializations) at similar scales required piecewise linear paths (Draxler et al., 2018; Garipov et al., 2018).
Our full network results divide training into two phases: an unstable phase where the network finds linearly unconnected minima due to SGD noise and a stable phase where the linearly connected minimum is determined. Our finding that stability emerges early in training adds to work suggesting that training comprises a noisy first phase and a less stochastic second phase. For example, the Hessian eigenspectrum settles into a few large values and a bulk (GurAri et al., 2018), and largebatch training at high learning rates benefits from learning rate warmup (Goyal et al., 2017).
One way to exploit our findings is to explore changing aspects of optimization (e.g., learning rate schedule or optimizer) similar to Goyal et al. (2017) once the network becomes stable to improve performance; instability analysis can evaluate the consequences of doing so. We also believe instability analysis provides a scientific tool for topics related to the scale and distribution of SGD noise, e.g., the relationship between batch size, learning rate, and generalization (LeCun et al., 2012; Keskar et al., 2017; Goyal et al., 2017; Smith and Le, 2018; Smith et al., 2018) and the efficacy of alternative learning rate schedules (Smith, 2017; Smith and Topin, 2018; Li and Arora, 2019).
The lottery ticket hypothesis. The lottery ticket hypothesis (Frankle and Carbin, 2019) conjectures that any “randomly initialized, dense neural network contains a subnetwork that—when trained in isolation—matches the accuracy of the original network.” This work is among several recent papers to propose that merely sparsifying at initialization can produce high performance neural networks (Mallya et al., 2018; Zhou et al., 2019; Ramanujan et al., 2019). Frankle and Carbin support the lottery ticket hypothesis by using IMP to find matching subnetworks at initialization in small vision networks. However, followup studies show (Liu et al., 2019; Gale et al., 2019) and we confirm that IMP does not find matching subnetworks in more challenging settings. We use instability analysis to distinguish the successes and failures of IMP as identified in previous work. In doing so, we make a new connection between the hypothesis and the optimization dynamics of neural networks.
Practical impact of rewinding. By augmenting IMP with rewinding, we show how to find matching subnetworks in much larger settings than in previous work, albeit from early in training rather than initialization. Our technique has already been adopted for practical purposes. Morcos et al. (2019) show that subnetworks found by IMP with rewinding transfer between vision tasks, meaning the effort of finding a subnetworks can be amortized by reusing it many times. Renda et al. (2020) show that IMP with rewinding prunes to stateoftheart sparsities, matching or exceeding the performance of standard techniques that finetune at a low learning rate after pruning (e.g., Han et al., 2015; He et al., 2018). Other efforts use rewinding to further study lottery tickets (Yu et al., 2020; Frankle et al., 2020; Caron et al., 2020; Savarese et al., 2020; Yin et al., 2020).
Pruning. In largerscale settings, IMP subnetworks only become stable and matching after the full network has been trained for some number of steps. Recent proposals attempt to prune networks at initialization (Lee et al., 2019; Wang et al., 2020), but our results suggest that the best time to do so may be after some training. Likewise, most pruning methods only begin to sparsify networks late in training or after training (Han et al., 2015; Gale et al., 2019; He et al., 2018). The existence of matching subnetworks early in training suggests that there is an unexploited opportunity to prune networks much earlier than current methods.
6 Conclusions
We propose instability analysis to shed light on the variability of neural network optimization trajectories induced by random data orders. We find that standard networks for MNIST, CIFAR10, and ImageNet become stable to this randomness early in training, after which the outcome of optimization is determined to a linearly connected minimum.
We then apply instability analysis to better understand a key question at the center of work on the lottery ticket hypothesis: why does iterative magnitude pruning find sparse networks that can train from initialization to full accuracy in smallerscale settings (e.g., MNIST) but not on more challenging tasks (e.g., ImageNet)? We find that extremely sparse IMP subnetworks only train to full accuracy when they are stable, which occurs at initialization in some settings but only after some amount of training in others.
Instability analysis contributes to a growing range of empirical tools for studying and understanding the behavior of neural networks in practice. In our paper, we show that it has already yielded new insights into neural network training dynamics and lottery ticket phenomena.
Appendix A Overview and Contents
In this supplementary material, we include data that either (1) we processed to produce the plots in the paper or (2) that we were not able to fit in the main body of the paper. The contents of these appendices are as follows:
Appendix C. Details about the states of the unpruned networks and IMP subnetworks at the rewinding iterations, including full network accuracy, distance from initialization, distance to the trained weights, and the distance between trained weights under different data orders.
Appendix D. Instability data throughout training for Resnet20 and VGG16; that is, interpolating between the states at each epoch of networks trained on different dataorders.
Appendix E. Instability and test error across rewinding iterations for Resnet20 and VGG16 at all levels of sparsity (not just the extreme sparsity we analyzed in Section 4.3).
Appendix F. The error when linearly interpolating for all networks in all configurations (unpruned and sparse) at all rewinding iterations. This data was used to create the instability plots in Figures 3 and 6.
Appendix G. The training set instability for the sparse networks corresponding to the test set instability data that we present in Section 4 Figure 6.
Appendix H. Metrics other than linear mode connectivity for comparing the networks that result from our instability experiments: distance, cosine distance, number of identical classifications, and distance of losses.
Appendix B Selecting Extreme Sparsity Levels for IMP
In this appendix, we describe how we select the extreme sparsity level that we examine in Section 4.3 for each IMP subnetwork. For each network and hyperparameter configuration, our goal is to study the most extreme sparsity level at which matching subnetworks are known to exist early in training. To do so, we use IMP to generate subnetworks at many different sparsities for many different rewinding iterations. We then select the most extreme sparsity level at which any IMP under any rewinding iteration produces a matching subnetwork.
In Figure 9, each plot contains the maximum accuracy found by any rewinding iteration in red. The black line is the accuracy of the unpruned network to one standard deviation. For each network, we select the most extreme sparsity for which the red and black lines intersect. As a basis for comparison, these plots also include the result of performing IMP with (blue line), random pruning (orange line), and random reinitialization of the IMP subnetworks with (green line).
Note that, for computational reasons, Resnet50 and Inceptionv3 are pruned using oneshot pruning, meaning the networks are pruned to the target sparsity all at once. All other networks are pruned using iterative pruning, meaning the networks are pruned by 20% after each iteration of IMP until they reach the target sparsity. Pruning 20% per iteration is standard practice in the sparse pruning literature (Han et al., 2015; Frankle and Carbin, 2019; Renda et al., 2020) This information is specified in Table 1.
Appendix C The State of the Network at Rewinding
c.1 Methodology
In the main body of the paper, we make two copies of the network at a rewinding iteration , optionally apply a pruning mask (as in Section 4), and train two copies of the network from there to completion under different data orders. We find that, for a sufficiently large value of , the trained networks will find the same, linearly connected minimum. In this appendix, we address the following question: what is the state of the network at the rewinding points from which this linear connectivity results? Are the networks so far along in training that they are virtually fully optimized? Have they traveled the vast majority of the distance from initialization to the eventual minimum? In this sense, is the iteration at which the network becomes stable “trivial?” We address these questions in two ways.
Error at rewinding. In Figure 10, we present the error of the unpruned network at each rewinding iteration we consider in the main body of the paper. With this data, we investigate how close the network has come to its full accuracy when it becomes stable.
distances. In Figures 11 and 12, we measure various distances that capture how close the network is to initialization and to the end of training. In particular, we measure three distances as shown in the diagram below (which is an annotated version of Figure 1).
First, we measure the distance in parameter space from initialization to the state of the network at each rewinding iteration (blue circle in the diagram above and in Figures 11 and 12); for the sparse IMP subnetworks, we measure the distance after applying the pruning mask to both initialization and the state of the network at iteration . This quantity captures the distance that the network has traversed from initialization by iteration .
Second, we measure the distance from the state of the network at the rewinding iteration to its state at the end of training under one data order (orange x in the diagram above and in Figures 11 and 12). This quantity captures the distance that the network traverses after the rewinding iteration . If the network is very close to the optimum by the time it becomes stable, then we expect this quantity to be small compared to the distance between initialization and iteration ; that would indicate that the network has already traversed a large distance and has a relatively smaller distance to go.
Finally, we measure the distance between the final states of networks trained from the rewinding iteration under different data orders (green triangle in the diagram above and in Figures 11 and 12). This quantity captures the size of the linearly connected minimum found by the networks. We are interested in how this distance compares to the distance traveled by the networks and how this quantity changes as the rewinding iteration varies.
c.2 Results
Error at rewinding. These results appear in Figure 10. Recall that the unpruned networks become stable at a different (typically later) iteration than the IMP subnetworks, so we consider two rewinding points for each network.
Unpruned networks. Resnet20 and VGG16 become stable at iterations 2000 and 1000, at which point test error is about 25% (compared to final error 8.3%) for Resnet20 and 20% (compared to final error 6.3%) for VGG16. Train error is at a similar value to test error at these points; in both cases, train error eventually converges to 0%. We conclude that, at the iteration at which they become stable, these networks have not fully converged but are much closer to their final errors than to random guessing.
We see similar behavior for the unpruned Resnet50 and Inceptionv3 networks, which become stable at epochs 18 and 28. At these points, test error is 55% (compared to final error 24%) for Resnet50 and 33% (compared to final error 22%) for Inceptionv3. Both networks are most of the way to their final accuracies.
IMP pruned subnetworks. The IMP pruned subnetworks become stable earlier than the unpruned networks. Resnet20 and VGG16 become stable at iterations 500 and 1000, at which point error is 30% (compared to final error 8.3%) for Resnet20 and 35% (compared to final error 6.3%) for VGG16. These networks have not fully converged but are closer to their final errors than to random guessing. IMP subnetworks of Resnet50 and Inceptionv3 become stable much earlier than the unpruned networks—at epoch 5 and epoch 6, respectively. At these points, error is much higher—55% for Resnet50 and 40% for Inceptionv3—leaving these networks substantial room to further train. We did not evaluate the train accuracy at these checkpoints for the ImageNet networks due to storage and computational limitations.
Unpruned networks. Resnet20 and VGG16 become stable at iterations 2000 and 1000, at which point they are closer to their initial weights than to their final weights. This indicates that they still have a substantial distance to travel on the optimization landscape and are still far from their final weights. This result is particularly remarkable considering our observation in Appendix D that stable networks follow the same, linearly connected trajectory throughout training (according to test error); the distance data suggests that they do so for a substantial distance.
The unpruned Resnet50 and Inceptionv3 networks are closer to their final weights than their initial weights when they become stable. In fact, it appears that distance from initialization begins to plateau and distance to the final weights only decreases slowly. This may indicate that the networks will make much slower progress for the remaining 80% of training iterations.
The green triangles in these plots show the distance between the weights of copies of the network trained from a rewinding iteration to completion on different data orders. In all cases, the distance between these copies is substantial, even after the networks become stable. As a point of comparison, we use the distance that the networks travel between initialization and the final weights, which is captured by the orange for rewinding iteration 0. For Resnet20, the distance between copies trained on different data orders from iteration 2000 (when it becomes stable) is more than half the distance that the network travels during the entirety of training. The same is true for VGG16 from iteration 1000 (when it becomes stable). For Resnet50 and Inceptionv3, this distance is about a quarter and half (respectively) of the distance the networks travel over the course of training. These are remarkably large distances considering that any network on this line segment reaches full test accuracy.
IMP pruned subnetworks. We discuss the IMP subnetworks in Figure 12. Each distance in this figure is measured after applying the pruning mask to all weights. When Resnet20 and VGG16 become stable (iterations 2000 and 1000, respectively), they are about 2x (Resnet20) and 3x (VGG16) closer to their initial weights than their final weights. Resnet50 and Inceptionv3 are about equal distances from both points for the epochs at which they become stable.
Unique to the IMP subnetworks, we observe here and in Appendix H that the distance between copies trained on different data orders drops alongside instability, plateauing at a lower value when training from the rewinding iteration at which the subnetworks becomes stable. Even this lower distance is still a substantial fraction of the overall distance the network travels: 25%, 45%, 27%, and 28% for Resnet20, VGG16, Resnet50, and Inceptionv3.
Appendix D Instability Throughout Training
In Section 3, we find that stable networks arrive at minima that are linearly connected. In this appendix, we study whether the trajectories they follow are also linearly connected. In other words, when training two copies of the same network with different noise, are the states of the network at each iteration connected by a linear path over which test error does not increase? In the main body of the paper, we study this quantity only at the end of training (i.e., ). Here, we study it for all iterations throughout training. To study this behavior, we linearly interpolate between the networks at each epoch of training and compute instability.
Figure 13 plots instability throughout training for Resnet20 and VGG16 from different rewinding iterations for both train and test error for the unpruned networks and the IMP subnetworks. We begin with the unpruned networks. For (blue line), instability increases rapidly. In fact, it follows the same pattern as error: as the train or test error of each network decreases, the maximum possible instability increases (since instability never exceeds random guessing). With larger values of , instability increases more slowly throughout training. When is sufficiently large that the networks are stable at the end of training, they are generally stable at every epoch of training (, pink line). In other words, after iteration 2000, the networks follow identical optimization trajectories modulo linear interpolation.
The IMP subnetworks of Resnet20 exhibit the same behavior as the unpruned network: when the network is stable at the end of training, it is stable throughout training, meaning two copies of the same network follow the same optimization trajectory up to linear interpolation. The IMP subnetworks of VGG16 exhibit sightly different behavior at rewinding iterations 500 and 1000: instability initially spikes (meaning the networks rapidly become separated by a loss barrier) but decreases gradually thereafter. For rewinding iteration 1000, it decreases to 0, meaning the networks are stable by the end of training. For all other rewinding iterations, being stable at the end of training corresponds to being stable throughout training, so it is possible that rewinding iteration 1000 represents a transition point between the unstable rewinding iterations earlier and the stable rewinding iterations later.
Appendix E Instability Data at All Sparsities
In Figure 6 in Section 4.3, we show the effect of rewinding iteration on instability and test error for sparse subnetworks. We specifically focus on the most extreme level of sparsity for which IMP at any rewinding iteration is matching (as selected in Appendix B). In this appendix, we present the relationship between rewinding iteration and instability/test error for all levels of sparsity for standard Resnet20 (Figures 15 and 14) and VGG16 (Figures 17 and 16) on CIFAR10. Section 4.4 and Figure 8 summarize this data, so we defer analysis of this data to that section.
This data begins with 80% of weights remaining and includes sparsities attained by repeatedly pruning 20% of weights (e.g., 64% of weights remaining, 51% of weights remaining, etc.). We include these levels in particular because we use IMP to prune 20% of weights per iteration, meaning we have sparse IMP subnetworks for each of these levels. We include data for every sparsity level displayed in Appendix B, including those beyond the extreme sparsities we study in Section 4.3.
We only collected this data for standard Resnet20 and VGG16 on CIFAR10. We determined that it was more valuable to spend our limited computational resources on these networks (whose instability and accuracy are sensitive to rewinding at the extreme sparsity level) than for the low and warmup variants (which are consistently stable and matching at the extreme sparsity level). We did not have the computational resources to compute this data on the ImageNet networks for all sparsities.
Appendix F Full Linear Interpolation Data
In Figures 3 and 6, we plot the instability value derived from linearly interpolating between copies of the same network or subnetwork trained on different data orders. In this appendix, we plot the linear interpolation data from which we derived the instabilities in Figures 3 and 6. We plot this data for the unpruned networks (Figure 18), IMP subnetworks (Figure 19), randomly pruned subnetworks (Figure 20), and the randomly reinitialized IMP subnetworks (Figure 21).
Appendix G Train Instability for Sparse Subnetworks
In Section 4, we only measure instability and error on the test set. We make this choice for simplicity after observing in Section 3 that train and test instability closely align. In this appendix, we present the data from Section 4 on the test set. Figures 22 and 23 examine the instability and error of the same IMP subnetworks as Figure 6, but it shows both the train and test sets. We did not compute the train set quantities for Inceptionv3 due to computational limitations.
Train set and test set instability are nearly identical, just as we found in Section 3. Interestingly, the two coincide more closely for IMP subnetworks of Resnet50 than they do for the unpruned networks in Section 3.
For networks that are unstable at rewinding iteration 0, train error and test error follow similar trends, starting higher when the subnetworks are unstable and dropping when the subnetworks become stable. In other words, the unstable IMP subnetworks are not able to fully optimize to 0% train error, while the stable IMP subnetworks are. This means that, at earlier rewinding iterations, the IMP subnetworks are having trouble optimizing, not just generalizing.
Appendix H Alternate Distance Metrics
Instability analysis involves training two copies of the same network on different data orders and comparing the networks that result. In the main body of the paper, our method of comparison is linear interpolation, which we find to offer valuable new insights into neural network optimization and the lottery ticket hypothesis. However, one could parameterize instability analysis with a wide range of other metrics for comparing pairs of neural networks. In this appendix, we discuss four alternate methods for which we collected data using the MNIST and CIFAR10 networks.
Distance. One simple way to compare neural networks is to measure the distance between the trained weights. The limitation of this metric is that there is not necessarily any relationship between distance and the functional similarity of networks or the structure of the loss landscape. In other words, there is no clear interpretation of distance.
In Figure 24, we plot the distance metric at all rewinding points for the unpruned networks. In Figure 25, we plot the distance metric at all rewinding points for all three classes of sparse networks. We plot this data separately because distance is not necessarily comparable between sparse networks (which have fewer parameters) and dense networks (which have more parameters).
For the unpruned networks, distance decreases linearly as we logarithmically increase the rewinding iteration. We see no distinct changes in behavior when the networks become stable, and the distance remains far from 0 at this point.
For the IMP subnetworks, distance mirrors the behavior of instability. In cases where the IMP subnetworks are stable at all rewinding points (Resnet20 low/warmup, VGG16 low/warmup, and Lenet), the distance is at a lower level than the distance between the other baselines (random pruning and random reinitialization) and is consistent across rewinding points. In cases where the IMP subnetworks are unstable at initialization but become stable later (Resnet20 and VGG16), the distance begins high (at the same level as the distance for the randomly pruned and randomly reinitialized baselines) and drops when the subnetworks become stable, settling at a lower level.
Although stable IMP subnetworks are closer in distance than unstable IMP subnetworks and the baselines, the distance remains far from zero. In general, it is difficult to translate the results of this metric into higherlevel statements about the relationships between the networks.
Cosine distance. In Figures 26 (unpruned networks) and 27 (sparse networks), we plot the cosine distance in a manner similar to distance. The results are similar to those for distance, and the same interpretation applies.
Classification differences. This distance metric computes the number of examples that are classified differently by two networks. Unlike linear interpolation and /cosine distance, this metric looks at the functional behavior of the networks rather than the parameterizations. This metric is particularly valuable because it allows us to compare the dense and sparse networks directly.
In Figures 28 (test set) and 29 (train set), we plot this metric for the unpruned and sparse networks across rewinding iterations. The unpruned networks generally classify the same number of examples differently no matter the rewinding iteration, although the number of different classifications decreases gradually for the latest rewinding iterations for Resnet20 low and warmup. We see no relationship between this metric and instability.
The behavior of the IMP sparse networks better matches instability. IMP subnetworks that are stable from initialization (Resnet20 low and warmup, VGG16 low and warmup, Lenet) consistently have the same distance no matter the rewinding iteration. This distance is lower than that for the randomly pruned and randomly reinitialized baselines.
IMP subnetworks that are unstable at iteration 0 (Resnet20 and VGG16) have the same number of different classifications as the baselines when rewinding to iteration 0. When the networks become stable, the number of different classifications drops substantially to a lower level.
One challenge with using this distance metric is that it is inherently entangled with accuracy.
As the accuracy of the networks improves, the number of different classifications might decrease simply because the networks will classify more examples correctly (and thereby, the same way).
Consider the IMP subnetworks of Resnet20 on the CIFAR10 test set (the graph in the upper right of Figure 28, blue line).
At rewinding iteration 0, the networks have about 11% error on the test set, meaning there are at most 2200 examples they could classify differently.
When Resnet20 IMP subnetworks are stable, error decreases to 8.5%, meaning at most 1700 examples can be classified differently. However, in Figure 26, we see that only about 350 examples are being classified differently. Although this number is lower than the 1100 differences at rewinding iteration 0 in absolute terms, accuracy has improved as well, so we must consider these differences in context. At rewinding iteration 0, classification differences are 50% of their maximum possible value, while at rewinding iteration 1000, classification differences are at 21% of their maximum possible value. In summary, as the IMP subnetworks become stable, they behave in a more functionally similar fashion, even considering accuracy improvements.
Loss distance. This distance metric computes the distance between the vector of crossentropy losses aggregated by computing the loss on each example. This metric again considers only the functional behavior of the networks, but it uses the perexample loss rather than the classification decisions, which may provide more information about the functional behavior of the networks. We plot this data in Figures 30 (test set) and 31 (train set). It largely mirrors the behavior from the classification difference metric, and the same interpretations apply.