Compression of Deep Convolutional Neural Networks under Joint Sparsity Constraints
Abstract
We consider the optimization of deep convolutional neural networks (CNNs) such that they provide good performance while having reduced complexity if deployed on either conventional systems utilizing spatialdomain convolution or lower complexity systems designed for Winograd convolution. Furthermore, we explore the universal quantization and compression of these networks. In particular, the proposed framework produces one compressed model whose convolutional filters are sparse not only in the spatial domain but also in the Winograd domain. Hence, one compressed model can be deployed universally on any platform, without need for retraining on the deployed platform, and the sparsity of its convolutional filters can be exploited for further complexity reduction in either domain. To get a better compression ratio, the sparse model is compressed in the spatial domain which has a less number of parameters. From our experiments, we obtain , and compressed models for ResNet18, AlexNet and CTSRCNN, while their computational complexity is also reduced by , and , respectively.
Compression of Deep Convolutional Neural Networks under Joint Sparsity Constraints
Yoojin Choi Mostafa ElKhamy Jungwon Lee SoC R&D, Samsung Semiconductor Inc. San Diego, CA 92121, USA {yoojin.c,mostafa.e,jungwon2.lee}@samsung.com
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Deep learning with convolutional neural networks (CNNs) has recently achieved performance breakthroughs in many of computer vision applications lecun2015deep (). However, the large model size and huge computational complexity hinder the deployment of stateoftheart CNNs on resourcelimited platforms such as batterypowered mobile devices. Thus, it is of great interest to compress complex CNNs into compact forms to lower their storage requirements and to reduce their computational costs sze2017efficient (); cheng2018model ().
CNN size compression has been actively investigated for memory and storage size reduction. Han et al. han2015deep () showed impressive compression results by weight pruning, quantization using kmeans clustering and Huffman coding. It has been followed by further analysis and mathematical optimization, and more efficient CNN compression schemes have been suggested afterwards, e.g., in choi2017towards (); ullrich2017soft (); agustsson2017soft (); louizos2017bayesian (); choi2018universal (). CNN computational complexity reduction has also been investigated on the other hand. The major computational cost of CNNs comes from the multiplyaccumulate (MAC) operations in their convolutional layers (sze2017efficient, , Table II). There have been two directions to reduce the convolution complexity in CNNs:

First, instead of conventional spatialdomain convolution, it is suggested to use either frequencydomain convolution mathieu2013fast (); vasilache2014fast () or Winograd convolution lavin2016fast (). In particular, for typical smallsize filters such as filters, Lavin & Gray lavin2016fast () showed that Winograd convolution is more efficient than both spatialdomain convolution and frequencydomain convolution.

Second, weight pruning is another approach to reduce the number of MACs required for convolution by skipping the MACs involving pruned weights (zero weights). The previous work mostly focused on spatialdomain weight pruning, which leads us to exploit sparse spatialdomain convolution of low complexity, e.g., see han2015learning (); lebedev2016fast (); wen2016learning (); guo2016dynamic (); park2016faster (); lin2017runtime (). More recently, there have also been some attempts to prune Winograddomain weights and reduce the complexity of Winograd convolution li2017enabling (); liu2018efficient ().
Previous works either focused on spatialdomain weight pruning and compression or on Winograddomain weight pruning and complexity reduction. Compression of Winograd CNNs has never been addressed before, to the best of our knowledge. Other shortcomings of the previous works addressing the complexity reduction of Winograd CNNs are that their final CNNs are no longer backward compatible with spatialdomain convolution due to the noninvertibility of the Winograd transformation, and hence they suffer from accuracy losses if they need to be run on platforms that only support spatialdomain convolution. To our knowledge, this paper is the first to address the universal CNN pruning and compression framework for both Winograd and spatialdomain convolutions.
Our proposed solutions are summarized in Figure 1. The main novelty of the proposed framework comes from the fact that it optimizes CNNs such their convolutional filters can be pruned either in the Winograd domain or in the spatial domain without accuracy losses and without extra training or finetuning in that domain. Our CNNs can be further optimized for and compressed by universal quantization and universal source coding such that their decompressed convolutional filters still have sparsity in both Winograd and spatial domains. Hence, one universally compressed model can be deployed on any platform whether it uses spatialdomain convolution or Winograd convolution, with no need for further training. Since many lowpower platforms, such as mobile phones, are expected to only support the inference of CNNs, and not their training, our approach is more desirable for widescale deployment of pretrained models without worrying about underlying inference engines.
2 Preliminary
2.1 Winograd convolution
We first review the Winograd convolution algorithm winograd1980arithmetic () in this subsection. For the sake of illustration, consider that we are given a twodimensional (2D) input of size and a 2D filter of size for convolution. We first prepare a set of patches of size extracted from the input with stride of for . Each of the patches is convolved with the filter by the Winograd convolution algorithm and produces an output patch of size .
Let and be one of the input patches and its corresponding output patch, respectively, and let be the filter. In Winograd convolution, the input and the filter are transformed into the Winograd domain by and using the Winograd transformation matrices and , respectively, where the superscript denotes the matrix transpose. In the Winograd domain, both and are of size , and elementwise product of them follows. Then, the output is transformed back to the spatial domain using matrix by
(1) 
where is the elementwise product of two matrices. The transformation matrices , , and are specific and can be obtained from the Chinese remainder theorem (e.g., see (blahut2010fast, , Section 5.3)). In case of input channels, the inverse transformation in (1) can be deployed once after summation over all channels of the elementwise product outputs in the Winograd domain (see (lavin2016fast, , Section 4)).
2.2 Sparse Winograd convolution
Similar to spatialdomain weight pruning for sparse spatialdomain convolution of low complexity, it is considered to skip some of the computations in the Winograd domain by pruning (i.e., setting to zero) some of the Winogradtransformed filter weights (elements of in (1)) for sparse Winograd convolution. The most related work to this end can be found in li2017enabling (); liu2018efficient ().
Pruning spatialdomain weights does not yield sparse Winograddomain filters in general since the sparsity is not maintained after transformation. Thus, Li et al. li2017enabling () introduced new Winograd layers, which are similar to convolutional layers but their learnable parameters are defined in the Winograd domain, and not in the spatial domain. In their framework, Winograddomain weights are directly learned in training where the loss and gradients are computed with Winograd layers. For Winograddomain weight pruning, some insignificant Winograddomain weights are nullified in every training iteration based on their magnitude and gradient values. In liu2018efficient (), the complexity of Winograd layers is further reduced by putting rectified linear units (ReLUs) in the Winograd domain and skipping MACs not only for zero weights but also for zero activations in the Winograd domain.
However, if we learn Winograddomain weights directly using Winograd layers, the trained model has to use Winograd layers in inference as well. We cannot transform the learned Winograddomain weights back to the spatial domain without considerable accuracy loss, since the inversion is overdetermined. Hence, the model is not deployable on the platforms that only support classical spatialdomain convolution. Moreover, storing Winograddomain weights is inefficient, since the number of weights is larger in the Winograd domain. Thus, we suggest that it is better to compress weights in the spatial domain even if the target computational platform only deploys Winograd convolution.
2.3 Universal compression
A universal CNN compression framework was proposed in choi2018universal (), where CNNs are optimized for and compressed by universal quantization and universal entropy source coding with schemes such as the variants of Lempel–Ziv–Welch ziv1977universal (); ziv1978compression (); welch1984technique () and the Burrows–Wheeler transform effros2002universal (). Of particular interest for universal quantization is randomized uniform quantization, where uniform random dithering makes the distortion independent of the source, and the gap of its rate from the ratedistortion bound at any distortion level is provably no more than bits per sample for any source zamir1992universal (). Universal CNN compression has practical advantages as it is easily applicable to any CNN model at any desired compression rate, without the extra burden required by previous approaches to compute or estimate the statistics of the CNN weights, and is guaranteed to achieve nearoptimal performance.
3 Main results
We consider a typical CNN model consisting of convolutional layers. The input of layer has channels of size and the output has channels of size , where the input is convolved with filters of size . For , and , let be the 2D convolutional filter for input channel and output channel of layer .
3.1 Regularization for jointly sparse convolutional filters
In this subsection, we introduce our Winograddomain and spatialdomain partial L2 regularizers to attain convolutional filters that are sparse both in the Winograd domain and in the spatial domain. We choose L2 regularizers to promote sparsity, although other regularizers such as L1 regularizers can be used instead. For notational simplicity, let be the set of all convolutional filters of layers, which are learnable, i.e., . Moreover, given any matrix , we define as the matrix that has the same size as while its element is one if the corresponding element in satisfies and is zero otherwise.
Winograddomain partial L2 regularization: To optimize CNNs under a Winograddomain sparsity constraint, we introduce the Winograddomain partial L2 regularizer, which gives the constraints on all elements of Winograddomain filters that should be drawn to zero, but is expressed in terms of the elements of spatialdomain filters as below:
(2) 
where denotes the L2 norm and is the Winograd transformation matrix determined by the filter size and the input patch size of layer for Winograd convolution (see Section 2.1); is the total number of Winograddomain weights of all layers. Observe that L2 regularization is applied only to a part of Winograddomain weights if their magnitude values are not greater than the threshold value . Due to the partial L2 regularization, spatialdomain weights are updated towards the direction to yield diminishing Winograddomain weights in part after training and being transformed into the Winograd domain, and then we can prune those insignificant Winograddomain weights in inference for sparse Winograd convolution at negligible accuracy loss.
Given a desired sparsity level (%) in the Winograd domain, we set the threshold value to be the th percentile of Winograddomain weight magnitude values. The threshold is updated at every training iteration as weights are updated. Note that the threshold decreases as training goes on since the regularized Winograddomain weights gradually converge to zero (see Section 3.2).
Spatialdomain partial L2 regularization: To optimize CNNs while having sparsity in the spatial domain, we regularize the cost function by the sum of L2 norms of partial spatialdomain weights, determined by given a target sparsity level (%), similar to (2), as below:
(3) 
where is the total number of spatialdomain weights of all layers.
3.2 Regularized training with learnable regularization coefficients
Combining the regularizers in (2) and (3), the cost function to minimize in training is given by
(4) 
for and , where is a training dataset and the is the network loss function such as the crossentropy loss for classification or the meansquarederror loss for regression. We emphasize that training is performed in the spatial domain with conventional spatialdomain convolution and we update spatialdomain filters in , while the regularizers steer the spatialdomain filters to be sparse not only in the spatial domain but also in the Winograd domain when they are transformed.
In (4), we introduced two regularization coefficients and . Conventionally, we use a fixed value for a regularization coefficient. However, we observe that using fixed regularization coefficients for the whole training is not efficient to find good sparse models. For small fixed coefficients, regularization is weak and we cannot reach the desired sparsity after training. For large fixed coefficients, on the other hand, we can achieve the desired sparsity, but it likely comes with considerable performance loss due to strong regularization.
Learnable regularization coefficient: To overcome the problems with fixed regularization coefficients, we propose novel learnable regularization coefficients, i.e., we let regularization coefficients be learnable parameters. Starting from a small initial coefficient value, we learn an accurate model with little regularization. As training goes on, we make the regularization coefficients increase gradually so that the performance does not degrade much but we finally have sparse convolutional filters at the end of training in both Winograd and spatial domains. Towards this end, we first replace and with and , respectively, and learn and instead in order to make the regularization coefficients always positive in training. Moreover, we include an additional regularization term, e.g., for , to penalize small regularization coefficients and encourage them to increase in training. The cost function in (4) is then altered into
(5) 
Observe that we introduced a new hyperparameter , while making regularization coefficients learnable. The tradeoff between the loss and the regularization is now controlled by the new hyperparameter instead of regularization coefficients, which is beneficial since is not directly related to either of the loss or the regularization, and the optimal tradeoff between them can be learned.
Gradient descent: From (5), we have
(6) 
where is provided from the CNN backpropagation algorithm. It can be shown that
(7)  
(8) 
The detailed proof of (7) can be found in Appendix A, while (8) is straightforward to show. Combining (6)–(8), we can perform gradient descent for weights in . We update and by gradient decent using and , respectively. Observe that tends to . This implies that as the regularizer decreases, the regularization coefficient gets larger. A larger regularization coefficient further encourages spatialdomain weights to move towards the direction where regularized Winograddomain weights converge zero in the following update. In this way, we gradually sparsify Winograddomain filters. Similarly, spatialdomain filters are sparsified owing to increasing and decreasing .
(a) Iterations=0  (b) Iterations=100k  (c) Iterations=120k  (d) Iterations=200k  

Winograd domain  
Spatial domain 
Evolution of weight histogram: In Figure 2, we present how the weight histogram (distribution) of the AlexNet second convolutional layer evolves in the Winograd domain and in the spatial domain as training goes on due to the proposed partial L2 regularizers with the learnable regularization coefficients. Observe that a part of the weights converges to zero in both domains. Finally, we have a peek at zero, which can be pruned at little accuracy loss, in each domain.
Winograd domain  

Spatial domain 
Examples of pruned filters: In Figure 3, we present convolutional filter samples that are sparse both in the Winograd domain and in the spatial domain, which are obtained by our regularization method for different sparsity levels. The AlexNet second convolutional layer consists of filters and we assume to use Winograd convolution of in Section 2.1.
3.3 Universal CNN pruning and compression
We compress the jointly sparse CNN model from Section 3.2 by universal compression in the spatial domain for universal deployment. Universal compression consists of the following three steps.
Universal quantization and pruning: First, we randomize spatialdomain weights by adding uniform random dithers, and quantize the dithered weights uniformly with the interval of by
(9) 
where are the individual spatialdomain weights of all layers, and are independent and identically distributed uniform random variables with the support of . The weights rounded to zero in (9) are pruned and fixed to be zero for the rest of the finetuning and compression steps. The random dithering values or their random seed are assumed to be known at deployment, and the dithering values are cancelled for the unpruned weights after decompression by
(10) 
where is the final deployed value of weight for inference.
Finetuning the uniform codebook: Second, we finetune the uniform codebook to compensate the accuracy loss after quantization. The average gradient is computed for unpruned weights that are quantized to the same value in (9). Then, their shared quantized value in the codebook is updated by gradient descent using the average gradient of them, which is given by
(11) 
where is the iteration time, is the learning rate, and is the index set of all weights that are quantized to the same value in (9) for some nonzero integer . After the codebook is updated, individual weights are updated by following their shared quantized value in the codebook, i.e., for all and . We emphasize here that the pruned weights in (9) are not finetuned and stay zero. We do not include the spatialdomain regularizer in (11) since this step follows after the joint sparsity optimization as shown in Figure 1. We determine which spatialdomain weights to prune in (9) and fix them to zero. However, to maintain the sparsity in the Winograddomain while optimizing the quantization codebook in the spatial domain, we keep the Winograddomain regularizer, i.e., we use in (11) instead of (5).
Universal lossless source coding: Finally, universal lossless source coding follows for compression. It is assumed that the encoder and the decoder share the information on the random dithers, or it is assumed that the dithering information can be already known to both of them through a compression protocol, e.g., by sending the random seed. The indexes in the codebook of the universally quantized weights are passed as an input stream to a universal entropy source coding scheme such as Lempel–Ziv–Welch ziv1977universal (); ziv1978compression (); welch1984technique (), gzip gailly2003gzip () and bzip2 seward1998bzip2 () that uses the Burrows–Wheeler transform effros2002universal (), which produces a compressed stream. We also need to deploy the codebook that contains the indexes and corresponding finetuned shared quantized values for decompression.
Deployment: At deployment, the compressed stream is decompressed, and random dithers are cancelled to get unpruned spatialdomain weights as in (10). Then, the CNN can be deployed in the spatial domain with the desired sparsity. If we deploy the CNN in the Winograd domain, its convolutional filters are transformed into the Winograd domain, and pruned to the desired sparsity level.
4 Experiments
ResNet18 for ImageNet classification: We experiment our universal CNN pruning and compression scheme on the ResNet18 model in he2015deep () for the ImageNet ILSVRC 2012 dataset russakovsky2015imagenet (). As in liu2018efficient (), we modify the original ResNet18 model by replacing its convolutional layers of stride with convolutional layers of stride and maxpooling layers, in order to utilize Winograd convolution for all possible convolutional layers. One difference from liu2018efficient () is that we place maxpooling after convolution (Conv+Maxpool) instead of placing it before convolution (Maxpool+Conv). Our modification provides better accuracy (see Figure 4) although it comes with more computations.
Regularization (sparsity )  (1) Spatialdomain convolution  (2) Winograd convolution  

Pruning  Top1 / Top5 accuracy  # MACs per image  Pruning  Top1 / Top5 accuracy  # MACs per image  
Pretrained model    68.2 / 88.6  2347.1M    68.2 / 88.6  1174.0M 
SD (80%)  SD (80%)  67.8 / 88.4  837.9M  WD (80%)  56.9 / 80.7  467.0M 
WD (80%)  SD (80%)  44.0 / 70.5  819.7M  WD (80%)  68.4 / 88.6  562.9M 
WD+SD (80%)  SD (80%)  67.8 / 88.5  914.9M  WD (80%)  67.8 / 88.5  522.6M 
The Winograddomain regularizer is applied to all convolutional filters, for which Winograd convolution can be used. We assume to use Winograd convolution of for filters (see Section 2.1). The spatialdomain regularizer is applied to all convolutional and fullyconnected layers not only for pruning but also for compression later in the spatial domain. We use the Adam optimizer kingma2014adam () with the learning rate of for iterations with the batch size of . We set in (5). The initial values for and are both set to be , and they are updated using the Adam optimizer with the learning rate of .
Table 1 summarizes the average pruning ratio, the accuracy and the number of MACs to process one input image of size for pruned ResNet18 models. The number of MACs for Winograd convolution is counted by following (lavin2016fast, , Section 5). We compare three models obtained by spatialdomain regularization only (SD), Winograddomain regularization only (WD), and both Winograddomain and spatialdomain regularizations (WD+SD). The accuracy is evaluated using (1) spatialdomain convolution and (2) Winograd convolution,^{1}^{1}1We used https://github.com/IntelLabs/SkimCaffe li2017enabling () for Winograd convolution in accuracy evaluation. for convolutional layers of filters. In case of (2), the filters are transformed into the Winograd domain and pruned to the desired ratio.
As expected, the proposed regularization method produces its desired sparsity only in the regularized domain. If we prune weights in the other domain, then we suffer from considerable accuracy loss. Using both Winograd and spatialdomain regularizers, we can produce one model that is sparse and accurate in both domains. We can reduce the number of MACs by and when using sparse convolution in the spatial and the Winograd domains, respectively, at accuracy loss less than 0.5%.
We compare the accuracy of our pruned ResNet18 models to the ones from liu2018efficient () in Figure 4. Observe that our models outperform the ones from liu2018efficient () in the Winograd domain. We emphasize that the major advantage of our scheme is that it produces one model that can use any of sparse spatialdomain convolution or sparse Winograd convolution. However, the models from liu2018efficient () are constrained to utilize their special WinogradReLU layers even though they can additionally exploit the dynamic sparsity of ReLU activations in the Winograd domain as explained in Section 2.2.
Regularization (sparsity )  Quantization (cell size )  Compression ratio  (1) Spatialdomain convolution  (2) Winograd convolution  

Top1 / Top5 accuracy  # MACs per image  Top1 / Top5 accuracy  # MACs per image  
Pretrained model    68.2 / 88.6  2347.1M  68.2 / 88.6  1174.0M  
WD+SD (80%)  UQ (0.005)  24.2  67.4 / 88.2  888.6M  67.4 / 88.2  516.4M 
UQ (0.010)  28.9  66.9 / 87.9  859.0M  66.9 / 87.9  516.5M  
UQ (0.020)  38.4  63.7 / 86.0  749.6M  63.7 / 86.0  495.1M  
DUQ (0.005)  23.8  67.5 / 88.2  886.5M  67.4 / 88.3  516.3M  
DUQ (0.010)  28.7  66.8 / 87.8  848.1M  66.6 / 87.7  512.9M  
DUQ (0.020)  38.6  60.0 / 83.5  708.8M  60.0 / 83.5  502.5M 
Table 2 shows our universal CNN quantization and compression results for the ResNet18 model. We take the model obtained by regularization of WD+SD (80%) in Table 1 and compress its weights as described in Section 3.3. We compare uniform quantization (UQ) and dithered uniform quantization (DUQ). We use bzip2 seward1998bzip2 () for universal source coding. The results show that we can achieve more than compression at accuracy loss less than 1% in both cases (1) and (2). Uniform quantization is known to be asymptotically optimal in the high resolution as gish1968asymptotically (), but it loses its optimality as increases. Thus, DUQ can outperform UQ in particular as increases, as shown in our results.
AlexNet for ImageNet classification: We perform similar pruning and compression experiments on the AlexNet model in krizhevsky2012imagenet (). The AlexNet model has a huge number of weights in its fullyconnected (FC) layers (58.6M out of total 61M), and thus we first prune roughly 90% spatialdomain weights mostly from its FC layers by the incremental pruning as suggested in han2015learning ().
We retrain the pruned AlexNet model, similar to the ResNet18 case above. In particular, we apply the proposed regularizers only to the second to the fifth convolutional layers (Conv2Conv5), where their filter sizes are small such as and . We assume to use Winograd convolution of and for filters and filters, respectively. The first convolutional layer (Conv1) is excluded since its filter size is , which is not small for Winograd convolution.
Regularization (sparsity )  Quantization (cell size )  Compression ratio  Sparsity (%)  Top1 / Top5 accuracy (%)  # MACs per image  
Conv1  Conv2  Conv3  Conv4  Conv5  FC1  FC2  FC3  
(1)  Pretrained model    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  56.8 / 80.0  724.4M  
WD+SD (70%)  UQ (0.010)  47.5  17.2  68.3  81.9  76.1  72.7  93.5  92.2  80.4  56.0 / 79.5  237.1M  
DUQ (0.010)  47.7  18.3  66.8  81.7  75.8  72.4  93.7  92.3  80.6  56.1 / 79.3  240.0M  
Han et al. han2015deep ()  35.0  16.0  62.0  65.0  63.0  63.0  91.0  91.0  75.0  57.2 / 80.3  301.1M  
(2)  Pretrained model    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  56.8 / 80.0  330.0M  
WD+SD (70%)  UQ (0.010)  47.5  17.2  43.9  72.0  63.7  62.4  93.5  92.2  80.4  56.0 / 79.5  144.2M  
DUQ (0.010)  47.7  18.3  47.4  71.7  63.3  62.0  93.7  92.3  80.6  56.0 / 79.3  142.6M  
Li et al. li2017enabling ()  N/A  0.0  90.6  95.8  94.3  93.9  0.0  0.0  0.0  57.3 / N/A  319.8M  
(1) Spatialdomain convolution and (2) Winograd convolution are used respectively for Conv2–Conv5 in accuracy evaluation. 
Regularization (sparsity )  Quantization (cell size )  Compression ratio  (1) Spatialdomain convolution  (2) Winograd convolution  
PSNR (dB)  SSIM  # MACs per image  PSNR (dB)  SSIM  # MACs per image  
Pretrained model    29.70  0.8301  233.2G  29.70  0.8301  56.7G  
WD+SD (90%)  UQ (0.020)  35.4  29.32  0.8225  17.4G  29.32  0.8225  9.9G 
DUQ (0.020)  34.8  29.30  0.8222  18.0G  29.31  0.8222  10.0G 
In Table 4, we provide the compression ratio, the sparsity, the accuracy and the number of MACs to process one input image of size for compressed AlexNet models. We compare our results to han2015deep () in the spatial domain and to li2017enabling () in the Winograd domain. We note that han2015deep () and li2017enabling () produce sparse models only in one domain. Comparing to li2017enabling (), our method yields less pruning for the Conv2Conv5 layers in the Winograd domain, but we also prune the Conv1 and FC layers heavily in the spatial domain. Observe that we can reduce the number of MACs by and when using sparse spatialdomain convolution and using sparse Winograd convolution, respectively, at accuracy loss less than 1%. The results also show that we can achieve more than compression.
CTSRCNN for image super resolution: We finally evaluate the proposed scheme for the cascadetrained SRCNN (CTSRCNN) model of 9 convolutional layers ren2018ctsrcnn (). We apply the Winograddomain regularizer to the and filters of the second to the last layers; the filters of the first layer are excluded. We use Winograd convolution of and for and filters, respectively. The spatialdomain regularizer is applied to all 9 layers.
The average peaksignaltonoiseratio (PSNR) and structuredsimilarity (SSIM) are compared for Set14 dataset zeyde2010single () in Table 4 for compressed CTSRCNN models. We also summarize the compression ratio and the number of MACs for super resolution to get one highdefinition image of size by upscaling factor in Table 4. Observe that we achieve compression at PSNR loss less than dB. The number of MACs is reduced by and when using sparse spatialdomain convolution and using sparse Winograd convolution, respectively.
5 Conclusion
We introduced a framework for hardware or software platform independent pruning and compression of CNNs. The proposed scheme produces one compressed model whose convolutional filters are sparse in both Winograd and spatial domains. Thus, one compressed model can be deployed on any platform and the sparsity of its convolutional filters can be utilized for complexity reduction in either domain, unlike the previous approaches that yield sparse models only in one domain. We show by experiments that the proposed scheme successfully prunes and compresses ResNet18, AlexNet and 9layer CTSRCNN with compression ratios of , and , and complexity reduction by , and , respectively. Finally, our regularization method for joint sparsity can be extended for sparse frequencydomain convolution, which remains as our future work. It will be also interesting to compare our partial L2 norm to L1 or support norm argyriou2012sparse () for sparsity regularization.
Appendix A Appendix
In this appendix, we show the proof of (7) of our paper.
Proof.
We have (e.g., see (golub2012matrix, , Section 1.3.7))
(12) 
where is the columnvectorization of matrix and denotes the Kronecker product of two matrices. Thus, it follows that
(13) 
For any matrix , column vector and column vector , it is straightforward to show that
where is the diagonal matrix whose diagonal elements are from , and then it follows that
(14) 
Combining (12)–(14) leads us to
(15) 
which results in (7) of our paper. We note that the gradient is actually not defined at the discontinuous points where any element in is exactly equal to the threshold in magnitude, which however can be ignored in stochastic gradient descent. ∎
References
 (1) Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
 (2) V. Sze, Y.H. Chen, T.J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
 (3) Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “Model compression and acceleration for deep neural networks: The principles, progress, and challenges,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126–136, 2018.
 (4) S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” in International Conference on Learning Representations, 2016.
 (5) Y. Choi, M. ElKhamy, and J. Lee, “Towards the limit of network quantization,” in International Conference on Learning Representations, 2017.
 (6) K. Ullrich, E. Meeds, and M. Welling, “Soft weightsharing for neural network compression,” in International Conference on Learning Representations, 2017.
 (7) E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool, “Softtohard vector quantization for endtoend learning compressible representations,” in Advances in Neural Information Processing Systems, 2017, pp. 1141–1151.
 (8) C. Louizos, K. Ullrich, and M. Welling, “Bayesian compression for deep learning,” in Advances in Neural Information Processing Systems, 2017, pp. 3290–3300.
 (9) Y. Choi, M. ElKhamy, and J. Lee, “Universal deep neural network compression,” arXiv preprint arXiv:1802.02271, 2018.
 (10) M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of convolutional networks through FFTs,” arXiv preprint arXiv:1312.5851, 2013.
 (11) N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun, “Fast convolutional nets with fbfft: A GPU performance evaluation,” arXiv preprint arXiv:1412.7580, 2014.
 (12) A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4013–4021.
 (13) S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems, 2015, pp. 1135–1143.
 (14) V. Lebedev and V. Lempitsky, “Fast convnets using groupwise brain damage,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2554–2564.
 (15) W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 2074–2082.
 (16) Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient DNNs,” in Advances In Neural Information Processing Systems, 2016, pp. 1379–1387.
 (17) J. Park, S. Li, W. Wen, P. T. P. Tang, H. Li, Y. Chen, and P. Dubey, “Faster CNNs with direct sparse convolutions and guided pruning,” International Conference on Learning Representations, 2017.
 (18) J. Lin, Y. Rao, J. Lu, and J. Zhou, “Runtime neural pruning,” in Advances in Neural Information Processing Systems, 2017, pp. 2178–2188.
 (19) S. Li, J. Park, and P. T. P. Tang, “Enabling sparse Winograd convolution by native pruning,” arXiv preprint arXiv:1702.08597, 2017.
 (20) X. Liu, J. Pool, S. Han, and W. J. Dally, “Efficient sparseWinograd convolutional neural networks,” in International Conference on Learning Representations, 2018.
 (21) S. Winograd, Arithmetic Complexity of Computations. SIAM, 1980, vol. 33.
 (22) R. E. Blahut, Fast Algorithms for Signal Processing. Cambridge University Press, 2010.
 (23) J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Transactions on Information Theory, vol. 23, no. 3, pp. 337–343, 1977.
 (24) ——, “Compression of individual sequences via variablerate coding,” IEEE Transactions on Information Theory, vol. 24, no. 5, pp. 530–536, 1978.
 (25) T. A. Welch, “A technique for highperformance data compression,” Computer, vol. 6, no. 17, pp. 8–19, 1984.
 (26) M. Effros, K. Visweswariah, S. R. Kulkarni, and S. Verdú, “Universal lossless source coding with the Burrows Wheeler transform,” IEEE Transactions on Information Theory, vol. 48, no. 5, pp. 1061–1081, 2002.
 (27) R. Zamir and M. Feder, “On universal quantization by randomized uniform/lattice quantizers,” IEEE Transactions on Information Theory, vol. 38, no. 2, pp. 428–436, 1992.
 (28) J.L. Gailly and M. Adler, “gzip,” 2003. [Online]. Available: www.gzip.org
 (29) J. Seward, “bzip2,” 1998. [Online]. Available: www.bzip.org
 (30) K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
 (31) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
 (32) D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 (33) H. Gish and J. Pierce, “Asymptotically efficient quantizing,” IEEE Transactions on Information Theory, vol. 14, no. 5, pp. 676–683, 1968.
 (34) A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
 (35) H. Ren, M. ElKhamy, and J. Lee, “CTSRCNN: Cascade trained and trimmed deep convolutional neural networks for image super resolution,” in IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.
 (36) R. Zeyde, M. Elad, and M. Protter, “On single image scaleup using sparserepresentations,” in International Conference on Curves and Surfaces. Springer, 2010, pp. 711–730.
 (37) A. Argyriou, R. Foygel, and N. Srebro, “Sparse prediction with the support norm,” in Advances in Neural Information Processing Systems, 2012, pp. 1457–1465.
 (38) G. H. Golub and C. F. Van Loan, Matrix Computations. JHU Press, 2013.