Abstract
Reducing the test time resource requirements of a neural network while preserving test accuracy is crucial for running inference on resourceconstrained devices. To achieve this goal, we introduce a novel network reparameterization based on the Kroneckerfactored eigenbasis (KFE), and then apply Hessianbased structured pruning methods in this basis. As opposed to existing Hessianbased pruning algorithms which do pruning in parameter coordinates, our method works in the KFE where different weights are approximately independent, enabling accurate pruning and fast computation. We demonstrate empirically the effectiveness of the proposed method through extensive experiments. In particular, we highlight that the improvements are especially significant for more challenging datasets and networks. With negligible loss of accuracy, an iterativepruning version gives a 10 reduction in model size and a 8 reduction in FLOPs on wide ResNet32. Our code is available at here.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layoutchanging commands and try again.
EigenDamage: Structured Pruning in the KroneckerFactored Eigenbasis
Chaoqi Wang ^{0 }^{0 } Roger Grosse ^{0 }^{0 } Sanja Fidler ^{0 }^{0 }^{0 } Guodong Zhang ^{0 }^{0 }
Proceedings of the International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).\@xsect
Deep neural networks exhibit good generalization behavior in the overparameterized regime (Zhang et al., 2016; Neyshabur et al., 2018), where the number of network parameters exceeds the number of training samples. However, overparameterization leads to high computational cost and memory overhead at test time, making it hard to deploy deep neural networks on a resourcelimited device.
Network pruning (LeCun et al., 1990; Hassibi et al., 1993; Han et al., 2015b; Dong et al., 2017; Zeng & Urtasun, 2019) has been identified as an effective technique to improve the efficiency of deep networks for applications with limited testtime computation and memory budgets. Without much loss in accuracy, classification networks can be compressed by a factor of 10 or even more (Han et al., 2015b; Zeng & Urtasun, 2019) on ImageNet (Deng et al., 2009). A typical pruning procedure consists of three stages: 1) train a large, overparameterized model, 2) prune the trained model according to a certain criterion, and 3) finetune the pruned model to regain the lost performance.
Most existing work on network pruning focuses on the second stage. A common idea is to select parameters for pruning based on weight magnitudes (Hanson & Pratt, 1989; Han et al., 2015b). However, weights with small magnitude are not necessarily unimportant (LeCun et al., 1990). As a consequence, magnitudebased pruning might delete important parameters, or preserve unimportant ones. By contrast, Optimal Brain Damage (OBD) (LeCun et al., 1990) and Optimal Brain Surgeon (OBS) (Hassibi et al., 1993) prune weights based on the Hessian of the loss function; the advantage is that both criteria reflect the sensitivity of the cost to the weight. Though OBD and OBS have proven to be effective for shallow neural networks, it remains challenging to extend them for deep networks because of the high computational cost of computing second derivatives. To solve this issue, several approximations to the Hessian have been proposed recently which assume layerwise independence (Dong et al., 2017) or Kronecker structure (Zeng & Urtasun, 2019).
All of the aforementioned methods prune individual weights, leading to nonstructured architectures which do not enjoy computational speedups unless one employs dedicated hardware (Han et al., 2016) and software, which is difficult and expensive in realworld applications (Liu et al., 2018). In contrast, structured pruning methods such as channel pruning (Liu et al., 2017; Li et al., 2016b) aim to preserve the convolutional structure by pruning at the level of channels or even layers, thus automatically enjoy computational gains even with standard software frameowrks and hardware.
Our Contributions. In this work, we focus on structured pruning. We first extend OBD and OBS to channel pruning, showing that they can match the performance of a stateoftheart channel pruning algorithm (Liu et al., 2017). We then interpret them from the Bayesian perspective, showing that OBD and OBS each approximate the fullcovariance Gaussian posterior with factorized Gaussians, but minimizing different variational objectives. However, different weights can be highly coupled in Bayesian neural network posteriors (e.g., see Figure 2), suggesting that fullfactorization assumptions may hurt the pruning performance.
Based on this insight, we prune in a different coordinate system in which the posterior is closer to factorial. Specifically, we consider the Kroneckerfactored eigenbasis (KFE) (George et al., 2018; Bae et al., 2018), in which the Hessian for a given layer is closer to diagonal. We propose a novel network reparameterization inspired by Desjardins et al. (2015) which explicitly parameterizes each layer in terms of the KFE. Because the Hessian matrix is closer to diagonal in the KFE, we can apply OBD with less cost to prediction accuracy; we call this method EigenDamage.
Instead of sparse weight matrices, pruning in the KFE leads to a lowrank approximation, or bottleneck structure, in each layer (see Figure 1). While most existing structured pruning methods (He et al., 2017; Li et al., 2016b; Liu et al., 2017; Luo et al., 2017) require specialized network architectures, EigenDamage can be applied to any fully connected or convolution layers without modifications. Furthermore, in contrast to traditional lowrank approximations (Denton et al., 2014; Lebedev et al., 2014; Jaderberg et al., 2014) which minimize the Frobenius norm of the weight space error, EigenDamage is loss aware. As a consequence, the user need only choose a single compression ratio parameter, and EigenDamage can automatically determine an appropriate rank for each layer, and thus it is calibrated across layers. Empirically, EigenDamage outperforms strong baselines which do pruning in parameter coordinates, especially in more challenging datasets and networks.
In this section, we first introduce some background for understanding and reinterpreting Hessianbased weight pruning algorithms, and then briefly review structured pruning to provide context for the task that we will deal with.
Laplace Approximation. In general, we can obtain the Laplace approximation (MacKay, 1992) by simply taking the secondorder Taylor expansion around a local mode. For neural networks, we can find such modes with SGD. Given a neural network with local MAP parameters after training on a dataset , we can obtain the Laplace approximation over the weights around by:
(1) 
where , and is the Hessian matrix of the negative log posterior evaluated at . Assuming is p.s.d., the Laplace approximation is equivalent to approximating the posterior over weights as a Gaussian distribution with and as the mean and precision, respectively. In practice, we can use the Fisher information matrix to approximate , as done in Graves (2011); Zhang et al. (2017); Ritter et al. (2018). This ensures a p.s.d. matrix and allows efficient approximation (Martens, 2014).
Forward and reverse KL divergence (Murphy, 2012). Suppose the true distribution is , and the approximate distribution is , the forward and reverse KL divergence are and respectively. In general, minimizing the forward KL will arise the masscovering behavior, and minimizing the reverse KL will arise the zeroforcing/modeseeking behavior (Minka et al., 2005). When we use a factorized Gaussian distribution to approximate multivariate Gaussian distribution , the solutions to minimizing the forward KL and reverse KL are
where the precision matrix . For the Laplace approximation, the true posterior variance is .
KFAC. Kroneckerfactored approximate curvature (KFAC) (Martens & Grosse, 2015) uses a Kroneckerfactored approximation to the Fisher matrix of fully connected layers, i.e. no weight sharing. Considering th layer in a neural network whose input activations are , weight matrix , and output , we have . Therefore, the weight gradient is . With this formula, KFAC decomposes this layer’s Fisher matrix with an independence assumption:
(2)  
where and .
Grosse & Martens (2016) further extended KFAC to convolutional layers under additional assumptions of spatial homogeneity (SH) and spatially uncorrelated derivatives (SUD). Suppose the input and the output , then the gradient of the reshaped weight is , and the corresponding Fisher matrix is:
(3)  
where is the set of spatial locations, is the patch extracted from , is the gradient to each spatial location in and . Decomposing into and not only avoids the quadratic storage cost of the exact Fisher, but also enables efficient computation of the Fisher vector product:
(4) 
and fast computation of inverse and eigendecomposition:
(5)  
where and are eigenvectors and eigenvalues. Since gives the eigenbasis of the Kronecker product, we call it the Kroneckerfactored Eigenbasis (KFE).
Structured Pruning. Structured network pruning (He et al., 2017; Liu et al., 2017; Li et al., 2016b; Luo et al., 2017) is a technique to reduce the size of a network while retaining the original convolutional structure. Among structured pruning methods, channel/filter pruning is the most popular. Let denote the number of input channels for the th convolutional layer and / be the height/width of the input feature maps. The conv layer transforms the input with filters . All the filters constitute the kernel matrix . When a filter is pruned, its corresponding feature map in the next layer is removed, so channel and filter pruning are typically referred to as the same thing. However, most current channel pruning methods either require predefined target models (Li et al., 2016b; Luo et al., 2017) or specialized network architectures (Liu et al., 2018), making them hard to use.
OBD and OBS share the same basic pruning pipeline: first training a network to (local) minimum in error at weight , and then pruning a weight that leads to the smallest increase in the training error. The predicted increase in the error for a change in full weight vector is:
(6) 
Eqn. (6) is a simple second order Taylor expansion around the local mode, which is essentially the Laplace approximation. According to Eqn. (1), we can reinterpret the above cost function from a probabilistic perspective:
(7)  
where LA denotes Laplace approximation.
OBD. Due to the intractability of computing full Hessian in deep networks, the Hessian matrix is approximated by a diagonal matrix in OBD. If we prune a weight , then the corresponding change in weights as well as the cost are:
(8) 
It regards all the weights as uncorrelated, such that removing one will not affect the others. This treatment can be problematic if the weights are correlated in the posterior.
OBS. In OBS, the importance of each weight is calculated by solving the following constrained optimization problem:
(9) 
for considering the correlations among weights, where is the unit selecting vector whose th element is 1 and 0 otherwise. Solving Eqn. (9) yields the optimal weight change and the corresponding change in error:
(10) 
The main difference is that OBS not only prunes a single weight but takes into account the correlation between weights and updates the rest of the weights to compensate.
A common belief is that OBS is superior to OBD, though it is only feasible for shallow networks. In the following paragraphs, we will show that this may not be the case in practice even when we can compute exact Hessian inverse.
From Eqn. (8), we can see that OBD can be seen as OBS with offdiagonal entries of the Hessian ignored. If we prune only one weight each time, OBS is advantageous in the sense that it takes into account the offdiagonal entries. However, pruning weights one by one is time consuming and typically infeasible for modern neural networks. It is more common to prune many weights at a time (Zeng & Urtasun, 2019; Dong et al., 2017; Han et al., 2015b), especially in structured pruning (Liu et al., 2017; Luo et al., 2017; Li et al., 2016b).
We note that, when pruning multiple weights simultaneously, both OBD and OBS can be interpreted as using a factorized Gaussian to approximate the true posterior over weights, but with different objectives. Specifically, OBD can be obtained by minimizing the reverse KL divergence (), whereas OBS is using the forward KL divergence (). Reverse KL underestimates the variance of the true distribution and overestimates the importance of each weight. By contrast, forward KL overestimates the variance and prunes more aggressively. The following example illustrates that while OBS outperforms OBD when pruning only a single weight, there is no guarantee that OBS is better than OBD when pruning multiple weights simultaneously since OBS may prune highly correlated weights all together.
Example 1.
Suppose a neural network converged to a local minima with weight , and the associated Hessian . Compute the resulting weight and increase in loss of OBD and OBS for the following cases. Case 1: Prune one weight (OBS is better). OBD: , OBS: , Case 2: Prune two weights simultaneously (OBD is better). OBD: , OBS: ,OBD and OBS are equivalent when the true posterior distribution is fully factorized. It has been observed that different weights are highly coupled (Zhang et al., 2017) and diagonal approximation is too crude. However, the correlations are small in the KFE (see Figure 2). This motivates us to consider applying OBD in the KFE, where the diagonal approximation is more reasonable.
We use the Fisher matrix to approximate the Hessian. In the following, we briefly discuss the relationship between these matrices. For more detailed discussion, we refer readers to Martens (2014); Pascanu & Bengio (2013a).
Suppose the function is parameterized by , and the loss function is . Then the Hessian at (local) minimum is equivalent to the generalized GaussNewton matrix :
(11)  
where is the gradient of evaluated at , is the Hessian of w.r.t. , and is the Hessian of th component of .
Pascanu & Bengio (2013b) showed that the Fisher matrix and generalized GaussNewton matrix are identical when the model predictive distribution is in the exponential family, such as categorical distribution (for classification) and Gaussian distribution (for regression), justifying the use of the Fisher to approximate the Hessian.
OBD and OBS were originally used for weightlevel pruning. Before introducing our main contributions, we first extend OBD and OBS to structured (channel/filterlevel) pruning. The most naïve approach is to first compute the importance of every weight, i.e., Eqn. (8) for OBD and Eqn. (10) for OBS, then sum together the importances within each filter. We use this approach as a baseline, and denote it COBD and COBS. For COBS, because inverting the Hessian/Fisher matrix is computationally intractable, we adopt the KFAC approximation for efficient inversion, as first proposed by Zeng & Urtasun (2019) for weightlevel pruning.
In the scenario of structured pruning, a more sophisticated approach is to take into account the correlation of the weights within the same filter. For example, we can compute the importance of each filter as follows:
(12) 
where and are the parameters vector and Fisher matrix of th filter , respectively. To do this, we would need to store the Fisher matrix for each filter, which is intractable for large convolutional layers. To overcome this problem, we adopt the KFAC approximation , and compute the change in weights as well as the importance in the following way:
(13) 
Unlike Eqn. (12), the input factor is shared between different filters, and therefore cheap to store. By analogy, we can compute the change in weights and importance of each filter for KronOBS as:
(14) 
where is the selecting vector with 1 for elements of and 0 elsewhere. We refer to Eqn. (13) and Eqn. (14) as KronOBD and KronOBS (See Algorithm 1). See Appendix id1 for derivations.
As argued in Section id1, weightlevel OBD and OBS approximate the posterior distribution with a factorized Gaussian around the mode, which is overly restrictive and cannot capture the correlation between weights. Although we just extended them to filter/channel pruning, which captures correlations of weights within the same filter, the interactions between filters are ignored. In this section, we propose to decorrelate the weights before pruning. In particular, we introduce a novel network reparameterization by breaking each linear operation into three stages. Intuitively, the role of the first and third stages is to rotate to the KFE.
Considering a single layer with weight with KFAC Fisher (see Section id1), we can decompose the weight matrix as the following form:
(15) 
where . It is easy to show that the Fisher matrix for is diagonal if the assumptions of KFAC are satisfied (George et al., 2018). We then apply COBD (or equivalently COBS since the Fisher is close to diagonal) on for both input and output channels. This way, each layer has a bottleneck structure which is a lowrank approximation, which we term eigenpruning. (Note that COBD and KronOBD only prune the output channels, since it automatically results in removal of corresponding input channel in the next layer.) We refer to our proposed method as EigenDamage (See Algorithm 2).
EigenDamage preserves the input and output shape, and thus can be applied to any convolutional or fully connected architecture without modification, in contrast with Liu et al. (2017), which requires adaptions for networks with crosslayer connections. Furthermore, like all Hessianbased pruning methods, our criterion allows us to set one global compression ratio for the whole network, making it easy to use. Moreover, the introduced eigenbasis can be further compressed by the "doubly factored" Kronecker approximation (Ba et al., 2016), and can be also compressed by depthwise separable decomposition, as detailed in Sections id1 and id1.
The above method relies heavily on the Taylor expansion (6), which may be accurate if we prune only a few filters. Unfortunately, the approximation will break down if we prune a large number of filters. In order to handle this issue, we can conduct the pruning process iteratively and only prune a few filters each iteration. Specifically, once we finish pruning the network for the first time, each layer has a bottleneck structure (i.e., ). We can then conduct the next pruning iteration (after finetuning) on in the same manner. This will result in two new eigenbases associated with . Conveniently, we can always merge these two new eigenbases (i.e., ) into old ones so as to reduce the model size as well as FLOPs by:
(16) 
This procedure may take several iterations until it reaches desirable compression ratio.
Since the eigenbasis can take up a large chunk of memory for convolutional networks^{1}^{1}1 has the shape of ., we further leverage the internal structure to reduce the model size. Inspired by Ba et al. (2016)’s “doubly factored” Kronecker approximation for layers whose input feature maps are too large, we ignore the correlation among the spatial locations within the same input channel. In that case, only captures the correlation between different channels. Here we abuse the notation slightly and let denote the covariance matrix along the channel dimension and (see blue cubes) the activation of each spatial location:
(17) 
The expectation in Eqn. (17) is taken over training examples and spatial locations . We note that with such approximation, can be efficiently implemented by conv, resulting in compact bottleneck structures like ResNet (He et al., 2016a), as shown in Figure 1. This will greatly reduce the size of eigenbasis to be of the original one.
Depthwise separable convolution has been proven to be effective in designing lightweight models (Howard et al., 2017; Chollet, 2017; Zhang et al., 2018b; Ma et al., 2018). The idea of separable convolution can be naturally incorporated in our method to further reduce the computational cost and model size. For convolution filters , we perform the singular value decomposition (SVD) for every slice ; then we can get a diagonal matrix as well as two new bases, as shown in Figure 3 (a). However, such a decomposition will result in more than twice the original parameters due to the two new bases. Therefore, we again ignore the correlation along the spatial dimension of filters, i.e. sharing the basis for each spatial dimension (see Figure 3 (b)). In particular, we solve the following problem:
(18) 
where , ^{2}^{2}2 is the domain of diagonal matrices. and . We can merge and into and respectively, and then replace with , which can be implemented with a depthwise convolution. By doing so, we are able to further reduce the size of the filter to be of the original one.
Dataset  CIFAR10  CIFAR100  

Prune Ratio (%)  60%  90%  60%  90%  
Method  Test  Reduction in  Reduction in  Test  Reduction in  Reduction in  Test  Reduction in  Reduction in  Test  Reduction in  Reduction in 
acc (%)  weights (%)  FLOPs (%)  acc (%)  weights (%)  FLOPs (%)  acc (%)  weights (%)  FLOPs (%)  acc (%)  weights (%)  FLOPs (%)  
VGG19(Baseline)  94.17            73.34           
NN Slimming (Liu et al., 2017)  92.84   80.07   42.65   85.01   97.85   97.89   71.89   74.60   38.33   58.69   97.76   94.09  
COBD 
94.04 0.12  82.01 0.44  38.18 0.45  92.34 0.18  97.68 0.02  77.39 0.36  72.23 0.15  77.03 0.05  33.70 0.04  58.07 0.60  97.97 0.04  77.55 0.25 
COBS 
94.08 0.07  76.96 0.14  34.73 0.11  91.92 0.16  97.27 0.04  87.53 0.41  72.27 0.13  73.83 0.03  38.09 0.06  58.87 1.34  97.61 0.01  91.94 0.26 
KronOBD 
94.00 0.11  80.40 0.26  38.19 0.55  92.92 0.26  97.47 0.02  81.44 0.68  72.29 0.11  77.24 0.10  37.90 0.24  60.70 0.51  97.56 0.08  82.55 0.39 
KronOBS 
94.09 0.12  79.71 0.26  36.93 0.15  92.56 0.21  97.32 0.02  80.39 0.21  72.12 0.14  74.18 0.04  36.59 0.11  60.66 0.35  97.48 0.03  83.57 0.27 
EigenDamage 
93.98 0.06  78.18 0.12  37.13 0.41  92.29 0.21  97.15 0.04  86.51 0.26  72.90 0.06  76.64 0.12  37.40 0.11  65.18 0.10  97.31 0.01  88.63 0.12 
VGG19+ (Baseline) 
93.71            73.08           
NN Slimming (Liu et al., 2017) 
93.79   83.45   49.23   91.99   97.93   86.00   72.78   76.53   39.92   57.07   97.59   93.86  
COBD 
93.84 0.04  84.19 0.01  47.34 0.02  91.29 0.30  97.88 0.02  81.22 0.38  72.73 0.09  79.47 0.02  39.04 0.02  56.49 0.06  97.96 0.03  80.91 0.16 
COBS 
93.85 0.01  82.88 0.02  44.58 0.10  91.14 0.13  97.31 0.03  88.18 0.27  72.58 0.09  76.17 0.01  41.61 0.06  44.18 0.87  97.31 0.02  91.90 0.07 
KronOBD 
93.86 0.06  84.78 0.00  50.10 0.00  91.14 0.26  97.74 0.02  83.09 0.33  72.44 0.03  79.99 0.02  43.46 0.02  57.59 0.21  97.53 0.02  85.04 0.07 
KronOBS 
93.84 0.04  84.33 0.03  48.01 0.13  91.13 0.17  97.37 0.01  81.52 0.18  72.61 0.15  77.27 0.03  40.89 0.59  57.61 0.67  97.51 0.02  86.60 0.14 
EigenDamage 
93.88 0.04  79.50 0.02  39.84 0.11  91.79 0.16  96.84 0.02  84.82 0.21  73.01 0.05  75.41 0.03  37.46 0.06  64.91 0.23  97.28 0.04  88.65 0.06 
ResNet32(Baseline) 
95.30            78.17           
NN Slimming (Liu et al., 2017)  N/A  N/A  N/A  N/A  N/A  N/A  N/A  N/A  N/A  N/A  N/A  N/A 
COBD 
95.11 0.10  70.36 0.39  66.18 0.46  91.75 0.42  97.30 0.06  93.50 0.37  75.70 0.31  66.68 0.25  67.53 0.25  59.52 0.24  97.74 0.08  94.88 0.08 
COBS 
95.04 0.07  67.90 0.25  76.75 0.36  90.04 0.21  95.49 0.22  97.39 0.04  75.16 0.32  66.83 0.03  76.59 0.34  58.20 0.56  91.99 0.07  96.27 0.02 
KronOBD 
95.11 0.09  63.97 0.22  63.41 0.42  92.57 0.09  96.11 0.12  94.18 0.17  75.86 0.37  63.92 0.23  62.97 0.17  62.42 0.41  96.42 0.05  95.85 0.08 
KronOBS 
95.14 0.07  64.21 0.31  61.89 0.79  92.76 0.12  96.14 0.27  94.37 0.54  75.98 0.33  62.36 0.40  60.41 1.02  63.62 0.50  93.56 0.14  95.65 0.13 
EigenDamage 
95.17 0.12  71.99 0.13  70.25 0.24  93.05 0.23  96.05 0.03  94.74 0.02  75.51 0.11  69.80 0.11  71.62 0.21  65.72 0.04  95.21 0.04  94.62 0.06 
PreResNet29+ (Baseline) 
94.42            75.70           
NN Slimming (Liu et al., 2017) 
92.32   71.60   80.95   82.50   93.49   95.88   68.87   61.68   82.03   49.48   93.70   96.33  
COBD 
91.17 0.16  87.48 0.23  78.14 0.70  80.03 0.21  98.45 0.02  96.03 0.10  62.19 0.18  89.72 0.01  82.24 0.16  36.44 0.90  98.65 0.00  96.81 0.02 
COBS 
91.64 0.22  83.52 0.12  76.33 0.21  76.59 0.69  98.34 0.02  98.47 0.02  68.10 0.29  81.26 0.10  89.47 0.04  32.77 0.89  97.89 0.01  98.73 0.00 
KronOBD 
90.22 0.43  74.84 0.20  67.83 0.33  82.68 0.20  98.18 0.04  94.90 0.13  57.76 0.28  76.85 0.06  72.38 0.02  34.26 1.12  98.62 0.00  96.09 0.00 
KronOBS 
89.02 0.17  72.96 0.20  70.14 0.18  81.77 0.59  98.44 0.01  96.85 0.09  60.28 0.37  70.53 0.11  76.60 0.14  33.45 0.96  98.31 0.00  97.15 0.01 
EigenDamage 
93.80 0.05  70.09 0.12  63.13 0.26  89.10 0.13  93.45 0.04  90.67 0.06  73.62 0.16  66.73 0.17  62.86 0.12  65.11 0.15  92.33 0.02  90.52 0.02 

In this section, we aim to verify the effectiveness of EigenDamage in reducing the testtime resource requirements of a network without significantly sacrificing accuracy. We compare EigenDamage with other compression methods in terms of test accuracy, reduction in weights, reduction in FLOPs, and inference wallclock time speedup. Wherever possible, we analyze the tradeoff curves involving test accuracy and resource requirements. We find that EigenDamage gives a significantly more favorable tradeoff curve, especially on larger architectures and more difficult datasets.
We test our methods on two network architectures: VGGNet (Simonyan & Zisserman, 2014) and (Pre)ResNet^{3}^{3}3For ResNet, we widen the network by a factor of 4, as done in Zhang et al. (2018a) (He et al., 2016b; a). We make use of three standard benchmark datasets: CIFAR10, CIFAR100 (Krizhevsky, 2009) and TinyImageNet^{4}^{4}4https://tinyimagenet.herokuapp.com. We compare EigenDamage to the extended versions COBD/OBS and KronOBD/OBS as well as one stateoftheart channellevel pruning algorithm, NN Slimming (Liu et al., 2017; 2018), and a lowrank approximation algorithm, CPDecomposition (Jaderberg et al., 2014). Note that because NN Slimming requires imposing loss on the scaling weights of BatchNorm (Ioffe & Szegedy, 2015), we train the networks with two different settings, i.e., with and without loss, for fair comparison.
For networks with skip connections, NN Slimming can only be applied to specially designed network architectures. Therefore, in addition to ResNet32, we also test on PreResNet29 (He et al., 2016b), which is in the same family of architectures considered by Liu et al. (2017). In our experiments, all the baseline (i.e. unpruned) networks are trained from scratch with SGD. We train the networks for 150 epochs for CIFAR datasets and 300 epochs for TinyImageNet with an initial learning rate of and weight decay of . The learning rate is decayed by a factor of 10 at and of the total number of training epochs. For the networks trained with sparsity on BatchNorm, we followed the same settings as in Liu et al. (2017).
We first consider the singlepass setting, where we perform a single round of pruning, and then finetune the network. Specifically, we compare eigenpruning^{5}^{5}5For EigenDamage, we count both the parameters of and two eigenbasis. (EigenDamage) against our proposed baselines COBD, COBS, KronOBD, KronOBS and a stateoftheart channellevel pruning method, NN Slimming, on CIFAR10 and CIFAR100 with VGGNet and (Pre)ResNet. For all methods, we test a variety of pruning ratios, ranging from to . Due to the space limit, please refer to Appendix id1 for the full results. In order to avoiding pruning all the channels in some layers, we constrain that at most of the channels can be pruned at each layer. After pruning, the network is finetuned for 150 epochs with an initial learning rate of and weight decay of . The learning rate decay follows the same scheme as in training. We run each experiment times in order to reduce the variance of the results.
Results on CIFAR datasets. The results on CIFAR datasets are presented in Table 1. It shows that even COBD and COBS can almost match NN slimming on CIFAR10 and CIFAR100 with VGGNet, if trained with sparsity on BatchNorm, and outperform when trained without it. Moreover, when the pruning ratio is , two channellevel variants outperform NN Slimming on CIFAR100 with VGGNet by in terms of test accuracy. For the experiments on ResNet, EigenDamage achieves better performance () than others when the pruning ratio is on CIFAR100 dataset. Besides, for the experiments on PreResNet, EigenDamage achieves the best performance in terms of test accuracy on all configurations and outperforms other baselines by a bigger margin.
Prune Ratio (%)  50%  

Method  Test  Reduction in  Reduction in 
acc (%)  weights (%)  FLOPs (%)  
VGG19(Baseline)  61.56     
VGG19+(Baseline)  60.68     
NN Slimming (Liu et al., 2017)  50.90   60.14   85.42  
COBD 
51.10 0.60  69.27 0.22  63.61 0.19 
COBS 
53.13 0.47  57.99 0.52  78.51 0.56 
KronOBD 
53.82 0.32  67.22 0.19  76.11 0.24 
KronOBS 
53.54 0.32  64.51 0.23  74.57 0.29 
EigenDamage 
58.20 0.30  61.87 0.11  66.21 0.15 

To summarize, EigenDamage performs the best across almost all the settings, and the improvements become more significant when the pruning ratio is high, e.g. , especially on more complicated networks, e.g. (Pre)ResNet, which demonstrates the effectiveness of pruning in the KFE. Moreover, EigenDamage adopts the bottleneck structure, which preserves the input and output dimension, as illustrated in Figure 1, and thus can be trivially applied to any fully connected or convolution layer without modification.
As we mentioned in Section id1, the success of lossaware pruning algorithms relies on the approximation to the loss function for identifying unimportant weights/filters. Therefore, we visualize the loss on training set after onepass pruning (without fintuning) in Figure 6. For EigenDamage, we can see that for VGG19 on CIFAR10, when even prune of the weights, the increase in loss is negligible, and for other settings, the loss is also significantly lower than for other methods. For the remaining methods, which conduct pruning in the original weight space, they all result in a large increase in loss, and the resulting network performs similarly to uniform predictions in terms of loss.
Results on TinyImageNet dataset. Apart from the results on CIFAR datasets, we futher test our methods on a more challenging dataset, TinyImageNet, with VGGNet. TinyImageNet consists of 200 classes and 500 images per class for training, and 10,000 images for testing, which are downsampled from the original ImageNet dataset. The results are in Table 2. Again, EigenDamage outperforms all the baselines by a significant margin.
We further plot the pruning ratio in each convolution layer for a detailed analysis. As shown in Figure 4, NN Slimming tends to prune more in the bottom layers but retain most of the filters in the top layers, which is undesirable since neural networks typically learn compact representations in the top. This may explain why NN Slimming performs worse than other methods in TinyImageNet (see Table 2). By contrast, EigenDamage yields a balanced pruning ratio across different layers (retains most filters in bottom layers while pruning most redundant weights in top layer).
We further experiment with the iterative setting, where the pruning can be conducted iteratively until it reaches a desired model size or FLOPs. Concretely, the iterative pruning is conducted for times with a pruning ratio of at each iteration for simulating the process. In order to avoiding pruning the entire layer, we also adopt the same strategy as in Liu et al. (2017), i.e. we constrain that at most of the channels can be pruned in each layer for each iteration.
We compare EigenDamage to COBD, COBS, KronOBD and KronOBS. The results are summarized in Figure 5. We notice that EigenDamage performs slightly better than other baselines with VGGNet and achieves significantly higher performance on ResNet. Specifically, for the results on CIFAR10 dataset with VGGNet, nearly all the methods achieved similar results due to the simplicity of CIFAR10 and VGGNet. However, the performance gap is a bit more clear as the dataset becoming more challenging, e.g., CIFAR100. On a more sophisticated network, ResNet, the performance improvements of EigenDamage were especially significant on CIFAR10 or CIFAR100. Furthermore, EigenDamage was especially effective in reducing the number of FLOPs, due to the bottleneck structure.
Since EigenDamage can also be viewed as lowrank approximation, we compared it with a stateoftheart lowrank method, CPDecomposition (Lebedev et al., 2014), which computes a lowrank decomposition of the filter into a sum of rankone tensors. We experimented lowrank approximation for VGG19 on CIFAR100 and TinyImageNet. For CPDecomposition, we tested it under two settings: (1) we varied the ranks from times of the original rank at each layer; (2) we varied ranks in for computing the approximation^{6}^{6}6We choose the minimum of the target rank and the original rank of the convolution filter as the rank for approximation.. For EigenDamage, we chose different pruning ratios in the range of , and EigenDamageDepthwise Frob is obtained by applying depthwise separable decomposition on the network obtained by EigenDamage.
The results are presented in Figure 7. EigenDamage outperforms CPDecomposition significantly in terms of speedup and accuracy. Moreover, CPDecomposition approximates the original weights under the Frobenius norm in the original weight coordinates, which does not precisely reflect the sensitivity to the training loss. In contrast, EigenDamage is lossaware, and thus the resulting approximation will achieve lower training loss when only pruning is applied, i.e. without finetuning, as is shown in the Figure 7. Note that EigenDamage will determine the approximation rank for each layer automatically given a global pruning ratio. However, CPDecomposition requires predetermined approximation rank for each layer and thus the search complexity will grow exponentially in the number of layers.
In this paper, we introduced a novel network reparameterization based on the Kroneckerfactored eigenbasis, in which the entrywise independence assumption is approximately satisfied. This lets us prune the weights effectively using Hessianbased pruning methods. The pruned networks give lowrank (bottleneck structure) which allows for fast computation. Empirically, EigenDamage outperforms strong baselines which do pruning in original parameter coordinates, especially on more chanllenging datasets and networks.
Acknowledgements We thank Shengyang Sun, Ricky Chen, David Duvenaud, Jonathan Lorraine for their feedback on early drafts. GZ was funded by an MRIS Early Researcher Award.
References
 Ba et al. (2016) Ba, J., Grosse, R., and Martens, J. Distributed secondorder optimization using kroneckerfactored approximations. 2016.
 Bader & Kolda (2007) Bader, B. W. and Kolda, T. G. Efficient matlab computations with sparse and factored tensors. SIAM Journal on Scientific Computing, 30(1):205–231, 2007.
 Bae et al. (2018) Bae, J., Zhang, G., and Grosse, R. Eigenvalue corrected noisy natural gradient. arXiv preprint arXiv:1811.12565, 2018.
 Chollet (2017) Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258, 2017.
 Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and FeiFei, L. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. Ieee, 2009.
 Denton et al. (2014) Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y., and Fergus, R. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277, 2014.
 Desjardins et al. (2015) Desjardins, G., Simonyan, K., Pascanu, R., et al. Natural neural networks. In Advances in Neural Information Processing Systems, pp. 2071–2079, 2015.
 Dong et al. (2017) Dong, X., Chen, S., and Pan, S. Learning to prune deep neural networks via layerwise optimal brain surgeon. In Advances in Neural Information Processing Systems, pp. 4857–4867, 2017.
 George et al. (2018) George, T., Laurent, C., Bouthillier, X., Ballas, N., and Vincent, P. Fast approximate natural gradient descent in a kroneckerfactored eigenbasis. arXiv preprint arXiv:1806.03884, 2018.
 Graves (2011) Graves, A. Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 2348–2356, 2011.
 Grosse & Martens (2016) Grosse, R. and Martens, J. A kroneckerfactored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pp. 573–582, 2016.
 Han et al. (2015a) Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
 Han et al. (2015b) Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143, 2015b.
 Han et al. (2016) Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M. A., and Dally, W. J. Eie: efficient inference engine on compressed deep neural network. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pp. 243–254. IEEE, 2016.
 Hanson & Pratt (1989) Hanson, S. J. and Pratt, L. Y. Comparing biases for minimal network construction with backpropagation. In Advances in neural information processing systems, pp. 177–185, 1989.
 Hassibi et al. (1993) Hassibi, B., Stork, D. G., and Wolff, G. J. Optimal brain surgeon and general network pruning. In Neural Networks, 1993., IEEE International Conference on, pp. 293–299. IEEE, 1993.
 He et al. (2016a) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016a.
 He et al. (2016b) He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Springer, 2016b.
 He et al. (2017) He, Y., Zhang, X., and Sun, J. Channel pruning for accelerating very deep neural networks. 2017.
 Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Hubara et al. (2016) Hubara, I., Courbariaux, M., Soudry, D., ElYaniv, R., and Bengio, Y. Binarized neural networks. In Advances in neural information processing systems, pp. 4107–4115, 2016.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 Jacob et al. (2018) Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D. Quantization and training of neural networks for efficient integerarithmeticonly inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713, 2018.
 Jaderberg et al. (2014) Jaderberg, M., Vedaldi, A., and Zisserman, A. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
 Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Lebedev et al. (2014) Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., and Lempitsky, V. Speedingup convolutional neural networks using finetuned cpdecomposition. arXiv preprint arXiv:1412.6553, 2014.
 LeCun et al. (1990) LeCun, Y., Denker, J. S., and Solla, S. A. Optimal brain damage. In Advances in neural information processing systems, pp. 598–605, 1990.
 Li et al. (2016a) Li, F., Zhang, B., and Liu, B. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016a.
 Li et al. (2016b) Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016b.
 Lin et al. (2015) Lin, Z., Courbariaux, M., Memisevic, R., and Bengio, Y. Neural networks with few multiplications. arXiv preprint arXiv:1510.03009, 2015.
 Liu et al. (2017) Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C. Learning efficient convolutional networks through network slimming. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 2755–2763. IEEE, 2017.
 Liu et al. (2018) Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270, 2018.
 Luo et al. (2017) Luo, J.H., Wu, J., and Lin, W. Thinet: A filter level pruning method for deep neural network compression. arXiv preprint arXiv:1707.06342, 2017.
 Ma et al. (2018) Ma, N., Zhang, X., Zheng, H.T., and Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. arXiv preprint arXiv:1807.11164, 1, 2018.
 MacKay (1992) MacKay, D. J. A practical bayesian framework for backpropagation networks. Neural computation, 4(3):448–472, 1992.
 Martens (2014) Martens, J. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
 Martens & Grosse (2015) Martens, J. and Grosse, R. Optimizing neural networks with kroneckerfactored approximate curvature. In International conference on machine learning, pp. 2408–2417, 2015.
 Minka et al. (2005) Minka, T. et al. Divergence measures and message passing. Technical report, 2005.
 Murphy (2012) Murphy, K. P. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.
 Neyshabur et al. (2018) Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., and Srebro, N. Towards understanding the role of overparametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076, 2018.
 Novikov et al. (2015) Novikov, A., Podoprikhin, D., Osokin, A., and Vetrov, D. P. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pp. 442–450, 2015.
 Pascanu & Bengio (2013a) Pascanu, R. and Bengio, Y. Revisiting natural gradient for deep networks. 2013a.
 Pascanu & Bengio (2013b) Pascanu, R. and Bengio, Y. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013b.
 Ritter et al. (2018) Ritter, H., Botev, A., and Barber, D. A scalable laplace approximation for neural networks. 2018.
 Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 Zeng & Urtasun (2019) Zeng, W. and Urtasun, R. MLPrune: Multilayer pruning for automated neural network compression, 2019. URL https://openreview.net/forum?id=r1g5b2RcKm.
 Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
 Zhang et al. (2017) Zhang, G., Sun, S., Duvenaud, D., and Grosse, R. Noisy natural gradient as variational inference. arXiv preprint arXiv:1712.02390, 2017.
 Zhang et al. (2018a) Zhang, G., Wang, C., Xu, B., and Grosse, R. Three mechanisms of weight decay regularization. arXiv preprint arXiv:1810.12281, 2018a.
 Zhang et al. (2018b) Zhang, X., Zhou, X., Lin, M., and Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018b.
 Zhu et al. (2016) Zhu, C., Han, S., Mao, H., and Dally, W. J. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.
Supplementary Material
Derivation of KronOBD. Assuming that the weight of a conv layer is , where , and , and the two Kronecker factors are and . Then the Fisher information matrix of can be approximated by . Substituting the Hessian with KFAC Fisher in eqn. (8), we get:
(19) 
where represents the change in , and is the weight of th filter , i.e., th column of . Under the assumption that each filter is independent to each other, and thus is diagonal. So, we can get the importance of each filter and the corresponding change in weights are:
(20) 
Derivation of KronOBS. Under the assumption of KronOBS that different filters are correlated to each other, is no longer diagonal. Then, similar to eqn. (20), the corresponding structured version of eqn.(9) becomes:
(21) 
We can solve the above constrained optimization problem with Lagrange multiplier:
(22) 
Taking the derivatives w.r.t to and set it to , we get:
(23) 
Substitute it back to the constrain to solve the equation, we get:
(24) 
Then substitute eqn. (24) back to eqn. (23), we can finally get the optimal change in weights if we remove filter :
(25) 
In order to evaluating the importance of each filter, we can substitute eqn. (25) back to eqn. (21):
(26)  
In this section, we will introduce the algorithm for solving the optimization problem in eqn. (18).
KhatriRao product. The KhatriRao product of two matrices and is the columnwise Kronecker product, that is:
(27) 
Kruskal tensor notation. Suppose has lowrank Canonical Polyadic (CP) structure. Following (Bader & Kolda, 2007), we refer to it as a Kruskal tensor. Normally, it can be defined by a collection of factor matrices, for , such that:
(28) 
where . Denote is the mode unfolding of a Kruskal tensor, which has the following form that depends on the KhatriRao products of the factor matrices:
(29) 
Alternating Least Squares (ALS). We can use ALS to solve problems similar to eqn. (18). Suppose we are approximating using . Specifically, for fixed , there is a closed form solution for . Specifically, we can update update by the following update rule:
(30) 
alternatively until converge or reach the maximum number of iterations. For the Mahalanobis norm case (with as the metric tensor), if we take the derivative with respect to to be ,
(31) 
we can get the corresponding update rule for :
(32) 
where unvec and vec are inverse operators to each other, and in our case, unvec operation is to convert the vectorized matrix back to the original matrix form. and has the same shape with , and for each column .
We present the additional results on onepass pruning in the following tables. We also present the data in tables as tradeoff curves in terms of acc vs. reduction in weight and acc vs. reduction in FLOPs for making it easy to tell the difference in performances of each method.
Prune Ratio (%)  50%  70%  80%  

Method  Test  Reduction in  Reduction in  Test  Reduction in  Reduction in  Test  Reduction in  Reduction in 
acc (%)  weights (%)  FLOPs (%)  acc (%)  weights (%)  FLOPs (%)  acc (%)  weights (%)  FLOPs (%)  
VGG19(Baseline)  94.17                 
NN Slimming (Liu et al., 2017)  92.84   73.84   38.88   92.89   84.30   54.83   91.92   91.77   76.43  
COBD 
94.01 0.15  76.84 0.30  35.07 0.38  94.04 0.09  85.88 0.10  41.17 0.23  93.70 0.07  92.17 0.07  56.87 0.33 
COBS 
94.19 0.10  66.91 0.08  26.12 0.13  93.97 0.16  84.97 0.02  43.16 0.20  93.77 0.12  91.52 0.09  63.64 0.13 
KronOBD 
93.91 0.16  73.93 0.42  33.71 0.69  93.95 0.12  85.80 0.09  43.78 0.24  93.78 0.17  92.04 0.04  60.81 0.26 
KronOBS 
94.03 0.13  69.17 0.20  28.02 0.26  94.10 0.15  85.83 0.09  42.56 0.14  93.87 0.14  92.00 0.04  60.19 0.38 
EigenDamage 
94.15 0.05  68.64 0.19  28.09 0.21  94.15 0.14  85.78 0.06  45.68 0.31  93.68 0.22  92.51 0.05  66.98 0.36 
VGG19+ (Baseline) 
93.71                 
NN Slimming (Liu et al., 2017) 
93.79   77.44   45.19   93.74   88.81   52.15   93.48   92.60   62.23  
COBD 
93.85 0.03  76.83 0.01  41.14 0.05  93.88 0.03  89.04 0.03  52.73 0.14  93.38 0.04  93.29 0.05  63.74 0.12 
COBS 
93.88 0.04  74.95 0.03  37.56 0.06  93.84 0.04  88.53 0.20  51.88 0.00  93.27 0.04  92.01 0.02  63.96 0.10 
KronOBD 
93.88 0.01  83.43 0.00  49.58 0.00  93.89 0.03  89.02 0.01  53.40 0.09  93.33 0.05  93.55 0.06  67.01 0.30 
KronOBS 
93.85 0.03  76.95 0.01  42.04 0.10  93.88 0.04  88.69 0.02  52.38 0.08  93.44 0.07  92.66 0.05  63.77 0.27 
EigenDamage 
93.84 0.04  78.14 0.11  39.02 0.30  93.85 0.04  85.71 0.01  46.56 0.03  93.40 0.07  91.48 0.06  62.18 0.29 

Prune Ratio (%)  50%  70%  80%  

Method  Test  Reduction in  Reduction in  Test  Reduction in  Reduction in  Test  Reduction in  Reduction in 
acc (%)  weights (%)  FLOPs (%)  acc (%)  weights (%)  FLOPs (%)  acc (%)  weights (%)  FLOPs (%)  
VGG19(Baseline)  73.34                 
NN Slimming (Liu et al., 2017)  72.77   66.50   30.61   69.98   85.56   54.51   66.09   92.33   76.76  
COBD 
72.82 0.15  65.47 0.13  24.24 0.10  71.10 0.22  86.06 0.04  41.18 0.04  67.46 0.26  93.31 0.06  60.39 0.25 
COBS 
72.73 0.17  62.31 0.05  25.50 0.06  71.25 0.21  84.49 0.04  49.25 0.52  67.47 0.13  91.04 0.06  68.38 0.26 
KronOBD 
72.88 0.12  67.11 0.21  28.57 0.19  71.16 0.11  85.83 0.10  47.19 0.35  67.70 0.32  92.86 0.05  65.26 0.26 
KronOBS 
72.89 0.12  67.26 0.08  25.80 0.16  71.36 0.17  84.75 0.02  45.74 0.17  68.17 0.34  92.16 0.03  63.95 0.19 
EigenDamage 
73.39 0.12  66.05 0.11  28.55 0.11  71.62 0.14  85.69 0.05  54.83 0.79  69.50 0.22  92.92 0.03  74.55 0.33 
VGG19+ (Baseline) 
73.08                 
NN Slimming (Liu et al., 2017) 
73.24   72.68   35.37   71.55   84.38   51.59   66.55   92.48   76.54  
COBD 
73.39 0.05  74.16 0.01  35.13 0.01  71.40 0.09  86.13 0.01  45.47 0.10  67.56 0.16  93.00 0.01  63.42 0.11 
COBS 
73.44 0.04  71.17 0.03  33.77 0.67  71.30 0.12  84.07 0.01  56.74 0.13  66.90 0.23  91.20 0.04  73.39 0.31 
KronOBD 
73.24 0.05  74.00 0.03  36.56 0.03  71.01 0.13  86.66 0.05  52.66 0.21  67.24 0.20  92.90 0.05  68.62 0.21 
KronOBS 
73.20 0.12  72.27 0.03  36.45 0.66  71.88 0.11  84.77 0.01  50.53 0.08  67.75 0.14  92.08 0.01  67.39 0.17 
EigenDamage 
73.23 0.08  66.80 0.02  29.49 0.03  71.81 0.13  84.27 0.04  52.75 0.21  69.83 0.24  92.36 0.01  73.68 0.13 

Prune Ratio (%)  50%  70%  80%  

Method  Test  Reduction in  Reduction in  Test  Reduction in  Reduction in  Test  Reduction in  Reduction in 
acc (%)  weights (%)  FLOPs (%)  acc (%)  weights (%)  FLOPs (%)  acc (%)  weights (%)  FLOPs (%)  
ResNet32(Baseline)  95.30                 
COBD 
95.27 0.10  60.67 0.44  55.46 0.35  95.00 0.17  80.64 0.40  76.78 0.63  94.41 0.10  90.73 0.11  86.66 0.34 
COBS 
95.30 0.15  58.99 0.02  65.54 0.25  94.43 0.17  76.27 0.35  85.89 0.23  93.45 0.25  86.15 0.68  92.77 0.05 
KronOBD 
95.30 0.09  56.05 0.24  52.21 0.36  94.94 0.02  73.98 0.46  74.97 0.63  94.60 0.14  85.96 0.41  86.36 0.38 
KronOBS 
95.46 0.08  56.48 0.26  50.93 0.46  94.92 0.11  73.77 0.24  74.58 0.44  94.44 0.08  85.65 0.46  86.05 0.55 
EigenDamage 
95.28 0.16  59.68 0.28  58.32 0.23  94.86 0.11  82.57 0.27  80.88 0.36  94.23 0.13  90.48 0.35  88.86 0.50 
PreResNet29+ 