Compression of Deep Convolutional Neural Networks under Joint Sparsity Constraints

Compression of Deep Convolutional Neural Networks under Joint Sparsity Constraints

Yoojin Choi   Mostafa El-Khamy   Jungwon Lee
SoC R&D, Samsung Semiconductor Inc.
San Diego, CA 92121, USA
{yoojin.c,mostafa.e,jungwon2.lee}@samsung.com
Abstract

We consider the optimization of deep convolutional neural networks (CNNs) such that they provide good performance while having reduced complexity if deployed on either conventional systems utilizing spatial-domain convolution or lower complexity systems designed for Winograd convolution. Furthermore, we explore the universal quantization and compression of these networks. In particular, the proposed framework produces one compressed model whose convolutional filters are sparse not only in the spatial domain but also in the Winograd domain. Hence, one compressed model can be deployed universally on any platform, without need for re-training on the deployed platform, and the sparsity of its convolutional filters can be exploited for further complexity reduction in either domain. To get a better compression ratio, the sparse model is compressed in the spatial domain which has a less number of parameters. From our experiments, we obtain , and compressed models for ResNet-18, AlexNet and CT-SRCNN, while their computational complexity is also reduced by , and , respectively.

 

Compression of Deep Convolutional Neural Networks under Joint Sparsity Constraints


  Yoojin Choi   Mostafa El-Khamy   Jungwon Lee SoC R&D, Samsung Semiconductor Inc. San Diego, CA 92121, USA {yoojin.c,mostafa.e,jungwon2.lee}@samsung.com

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Deep learning with convolutional neural networks (CNNs) has recently achieved performance breakthroughs in many of computer vision applications lecun2015deep (). However, the large model size and huge computational complexity hinder the deployment of state-of-the-art CNNs on resource-limited platforms such as battery-powered mobile devices. Thus, it is of great interest to compress complex CNNs into compact forms to lower their storage requirements and to reduce their computational costs sze2017efficient (); cheng2018model ().

CNN size compression has been actively investigated for memory and storage size reduction. Han et al. han2015deep () showed impressive compression results by weight pruning, quantization using k-means clustering and Huffman coding. It has been followed by further analysis and mathematical optimization, and more efficient CNN compression schemes have been suggested afterwards, e.g., in choi2017towards (); ullrich2017soft (); agustsson2017soft (); louizos2017bayesian (); choi2018universal (). CNN computational complexity reduction has also been investigated on the other hand. The major computational cost of CNNs comes from the multiply-accumulate (MAC) operations in their convolutional layers (sze2017efficient, , Table II). There have been two directions to reduce the convolution complexity in CNNs:

Previous works either focused on spatial-domain weight pruning and compression or on Winograd-domain weight pruning and complexity reduction. Compression of Winograd CNNs has never been addressed before, to the best of our knowledge. Other shortcomings of the previous works addressing the complexity reduction of Winograd CNNs are that their final CNNs are no longer backward compatible with spatial-domain convolution due to the non-invertibility of the Winograd transformation, and hence they suffer from accuracy losses if they need to be run on platforms that only support spatial-domain convolution. To our knowledge, this paper is the first to address the universal CNN pruning and compression framework for both Winograd and spatial-domain convolutions.

Our proposed solutions are summarized in Figure 1. The main novelty of the proposed framework comes from the fact that it optimizes CNNs such their convolutional filters can be pruned either in the Winograd domain or in the spatial domain without accuracy losses and without extra training or fine-tuning in that domain. Our CNNs can be further optimized for and compressed by universal quantization and universal source coding such that their decompressed convolutional filters still have sparsity in both Winograd and spatial domains. Hence, one universally compressed model can be deployed on any platform whether it uses spatial-domain convolution or Winograd convolution, with no need for further training. Since many low-power platforms, such as mobile phones, are expected to only support the inference of CNNs, and not their training, our approach is more desirable for wide-scale deployment of pre-trained models without worrying about underlying inference engines.

Figure 1: Universal CNN weight pruning and compression for supporting sparse Winograd convolution as well as sparse spatial-domain convolution.

2 Preliminary

2.1 Winograd convolution

We first review the Winograd convolution algorithm winograd1980arithmetic () in this subsection. For the sake of illustration, consider that we are given a two-dimensional (2-D) input of size and a 2-D filter of size for convolution. We first prepare a set of patches of size extracted from the input with stride of for . Each of the patches is convolved with the filter by the Winograd convolution algorithm and produces an output patch of size .

Let and be one of the input patches and its corresponding output patch, respectively, and let be the filter. In Winograd convolution, the input and the filter are transformed into the Winograd domain by and using the Winograd transformation matrices  and , respectively, where the superscript  denotes the matrix transpose. In the Winograd domain, both and are of size , and element-wise product of them follows. Then, the output is transformed back to the spatial domain using matrix  by

(1)

where is the element-wise product of two matrices. The transformation matrices , , and are -specific and can be obtained from the Chinese remainder theorem (e.g., see (blahut2010fast, , Section 5.3)). In case of input channels, the inverse transformation in (1) can be deployed once after summation over all channels of the element-wise product outputs in the Winograd domain (see (lavin2016fast, , Section 4)).

2.2 Sparse Winograd convolution

Similar to spatial-domain weight pruning for sparse spatial-domain convolution of low complexity, it is considered to skip some of the computations in the Winograd domain by pruning (i.e., setting to zero) some of the Winograd-transformed filter weights (elements of in (1)) for sparse Winograd convolution. The most related work to this end can be found in li2017enabling (); liu2018efficient ().

Pruning spatial-domain weights does not yield sparse Winograd-domain filters in general since the sparsity is not maintained after transformation. Thus, Li et al. li2017enabling () introduced new Winograd layers, which are similar to convolutional layers but their learnable parameters are defined in the Winograd domain, and not in the spatial domain. In their framework, Winograd-domain weights are directly learned in training where the loss and gradients are computed with Winograd layers. For Winograd-domain weight pruning, some insignificant Winograd-domain weights are nullified in every training iteration based on their magnitude and gradient values. In liu2018efficient (), the complexity of Winograd layers is further reduced by putting rectified linear units (ReLUs) in the Winograd domain and skipping MACs not only for zero weights but also for zero activations in the Winograd domain.

However, if we learn Winograd-domain weights directly using Winograd layers, the trained model has to use Winograd layers in inference as well. We cannot transform the learned Winograd-domain weights back to the spatial domain without considerable accuracy loss, since the inversion is over-determined. Hence, the model is not deployable on the platforms that only support classical spatial-domain convolution. Moreover, storing Winograd-domain weights is inefficient, since the number of weights is larger in the Winograd domain. Thus, we suggest that it is better to compress weights in the spatial domain even if the target computational platform only deploys Winograd convolution.

2.3 Universal compression

A universal CNN compression framework was proposed in choi2018universal (), where CNNs are optimized for and compressed by universal quantization and universal entropy source coding with schemes such as the variants of Lempel–Ziv–Welch ziv1977universal (); ziv1978compression (); welch1984technique () and the Burrows–Wheeler transform effros2002universal (). Of particular interest for universal quantization is randomized uniform quantization, where uniform random dithering makes the distortion independent of the source, and the gap of its rate from the rate-distortion bound at any distortion level is provably no more than bits per sample for any source zamir1992universal (). Universal CNN compression has practical advantages as it is easily applicable to any CNN model at any desired compression rate, without the extra burden required by previous approaches to compute or estimate the statistics of the CNN weights, and is guaranteed to achieve near-optimal performance.

3 Main results

We consider a typical CNN model consisting of convolutional layers. The input of layer  has channels of size and the output has channels of size , where the input is convolved with filters of size . For , and , let be the 2-D convolutional filter for input channel  and output channel  of layer .

3.1 Regularization for jointly sparse convolutional filters

In this subsection, we introduce our Winograd-domain and spatial-domain partial L2 regularizers to attain convolutional filters that are sparse both in the Winograd domain and in the spatial domain. We choose L2 regularizers to promote sparsity, although other regularizers such as L1 regularizers can be used instead. For notational simplicity, let be the set of all convolutional filters of layers, which are learnable, i.e., . Moreover, given any matrix , we define as the matrix that has the same size as while its element is one if the corresponding element  in satisfies and is zero otherwise.

Winograd-domain partial L2 regularization: To optimize CNNs under a Winograd-domain sparsity constraint, we introduce the Winograd-domain partial L2 regularizer, which gives the constraints on all elements of Winograd-domain filters  that should be drawn to zero, but is expressed in terms of the elements of spatial-domain filters  as below:

(2)

where denotes the L2 norm and is the Winograd transformation matrix determined by the filter size and the input patch size of layer  for Winograd convolution (see Section 2.1); is the total number of Winograd-domain weights of all layers. Observe that L2 regularization is applied only to a part of Winograd-domain weights if their magnitude values are not greater than the threshold value . Due to the partial L2 regularization, spatial-domain weights are updated towards the direction to yield diminishing Winograd-domain weights in part after training and being transformed into the Winograd domain, and then we can prune those insignificant Winograd-domain weights in inference for sparse Winograd convolution at negligible accuracy loss.

Given a desired sparsity level  (%) in the Winograd domain, we set the threshold value  to be the -th percentile of Winograd-domain weight magnitude values. The threshold is updated at every training iteration as weights are updated. Note that the threshold decreases as training goes on since the regularized Winograd-domain weights gradually converge to zero (see Section 3.2).

Spatial-domain partial L2 regularization: To optimize CNNs while having sparsity in the spatial domain, we regularize the cost function by the sum of L2 norms of partial spatial-domain weights, determined by given a target sparsity level  (%), similar to (2), as below:

(3)

where is the total number of spatial-domain weights of all layers.

3.2 Regularized training with learnable regularization coefficients

Combining the regularizers in (2) and (3), the cost function  to minimize in training is given by

(4)

for and , where is a training dataset and the is the network loss function such as the cross-entropy loss for classification or the mean-squared-error loss for regression. We emphasize that training is performed in the spatial domain with conventional spatial-domain convolution and we update spatial-domain filters in , while the regularizers steer the spatial-domain filters to be sparse not only in the spatial domain but also in the Winograd domain when they are transformed.

In (4), we introduced two regularization coefficients  and . Conventionally, we use a fixed value for a regularization coefficient. However, we observe that using fixed regularization coefficients for the whole training is not efficient to find good sparse models. For small fixed coefficients, regularization is weak and we cannot reach the desired sparsity after training. For large fixed coefficients, on the other hand, we can achieve the desired sparsity, but it likely comes with considerable performance loss due to strong regularization.

Learnable regularization coefficient: To overcome the problems with fixed regularization coefficients, we propose novel learnable regularization coefficients, i.e., we let regularization coefficients be learnable parameters. Starting from a small initial coefficient value, we learn an accurate model with little regularization. As training goes on, we make the regularization coefficients increase gradually so that the performance does not degrade much but we finally have sparse convolutional filters at the end of training in both Winograd and spatial domains. Towards this end, we first replace and with and , respectively, and learn and instead in order to make the regularization coefficients always positive in training. Moreover, we include an additional regularization term, e.g., for , to penalize small regularization coefficients and encourage them to increase in training. The cost function in (4) is then altered into

(5)

Observe that we introduced a new hyper-parameter , while making regularization coefficients learnable. The trade-off between the loss and the regularization is now controlled by the new hyper-parameter  instead of regularization coefficients, which is beneficial since is not directly related to either of the loss or the regularization, and the optimal trade-off between them can be learned.

Gradient descent: From (5), we have

(6)

where is provided from the CNN back-propagation algorithm. It can be shown that

(7)
(8)

The detailed proof of (7) can be found in Appendix A, while (8) is straightforward to show. Combining (6)–(8), we can perform gradient descent for weights in . We update and by gradient decent using and , respectively. Observe that tends to . This implies that as the regularizer  decreases, the regularization coefficient  gets larger. A larger regularization coefficient further encourages spatial-domain weights to move towards the direction where regularized Winograd-domain weights converge zero in the following update. In this way, we gradually sparsify Winograd-domain filters. Similarly, spatial-domain filters are sparsified owing to increasing  and decreasing .

(a) Iterations=0 (b) Iterations=100k (c) Iterations=120k (d) Iterations=200k
Winograd domain
Spatial domain
Figure 2: Weight histogram snapshots of the AlexNet second convolutional layer.

Evolution of weight histogram: In Figure 2, we present how the weight histogram (distribution) of the AlexNet second convolutional layer evolves in the Winograd domain and in the spatial domain as training goes on due to the proposed partial L2 regularizers with the learnable regularization coefficients. Observe that a part of the weights converges to zero in both domains. Finally, we have a peek at zero, which can be pruned at little accuracy loss, in each domain.

Winograd domain
Spatial domain
Figure 3: Convolutional filter samples that are sparse both in the Winograd domain and in the spatial domain, obtained from the AlexNet second convolutional layer.

Examples of pruned filters: In Figure 3, we present convolutional filter samples that are sparse both in the Winograd domain and in the spatial domain, which are obtained by our regularization method for different sparsity levels. The AlexNet second convolutional layer consists of filters and we assume to use Winograd convolution of in Section 2.1.

3.3 Universal CNN pruning and compression

We compress the jointly sparse CNN model from Section 3.2 by universal compression in the spatial domain for universal deployment. Universal compression consists of the following three steps.

Universal quantization and pruning: First, we randomize spatial-domain weights by adding uniform random dithers, and quantize the dithered weights uniformly with the interval of by

(9)

where are the individual spatial-domain weights of all layers, and are independent and identically distributed uniform random variables with the support of . The weights rounded to zero in (9) are pruned and fixed to be zero for the rest of the fine-tuning and compression steps. The random dithering values or their random seed are assumed to be known at deployment, and the dithering values are cancelled for the unpruned weights after decompression by

(10)

where is the final deployed value of weight  for inference.

Fine-tuning the uniform codebook: Second, we fine-tune the uniform codebook to compensate the accuracy loss after quantization. The average gradient is computed for unpruned weights that are quantized to the same value in (9). Then, their shared quantized value in the codebook is updated by gradient descent using the average gradient of them, which is given by

(11)

where is the iteration time, is the learning rate, and is the index set of all weights that are quantized to the same value  in (9) for some non-zero integer . After the codebook is updated, individual weights are updated by following their shared quantized value in the codebook, i.e., for all and . We emphasize here that the pruned weights in (9) are not fine-tuned and stay zero. We do not include the spatial-domain regularizer in (11) since this step follows after the joint sparsity optimization as shown in Figure 1. We determine which spatial-domain weights to prune in (9) and fix them to zero. However, to maintain the sparsity in the Winograd-domain while optimizing the quantization codebook in the spatial domain, we keep the Winograd-domain regularizer, i.e., we use in (11) instead of (5).

Universal lossless source coding: Finally, universal lossless source coding follows for compression. It is assumed that the encoder and the decoder share the information on the random dithers, or it is assumed that the dithering information can be already known to both of them through a compression protocol, e.g., by sending the random seed. The indexes in the codebook of the universally quantized weights are passed as an input stream to a universal entropy source coding scheme such as Lempel–Ziv–Welch ziv1977universal (); ziv1978compression (); welch1984technique (), gzip gailly2003gzip () and bzip2 seward1998bzip2 () that uses the Burrows–Wheeler transform effros2002universal (), which produces a compressed stream. We also need to deploy the codebook that contains the indexes and corresponding fine-tuned shared quantized values for decompression.

Deployment: At deployment, the compressed stream is decompressed, and random dithers are cancelled to get unpruned spatial-domain weights as in (10). Then, the CNN can be deployed in the spatial domain with the desired sparsity. If we deploy the CNN in the Winograd domain, its convolutional filters are transformed into the Winograd domain, and pruned to the desired sparsity level.

4 Experiments

ResNet-18 for ImageNet classification: We experiment our universal CNN pruning and compression scheme on the ResNet-18 model in he2015deep () for the ImageNet ILSVRC 2012 dataset russakovsky2015imagenet (). As in liu2018efficient (), we modify the original ResNet-18 model by replacing its convolutional layers of stride with convolutional layers of stride and max-pooling layers, in order to utilize Winograd convolution for all possible convolutional layers. One difference from liu2018efficient () is that we place max-pooling after convolution (Conv+Maxpool) instead of placing it before convolution (Maxpool+Conv). Our modification provides better accuracy (see Figure 4) although it comes with more computations.

Regularization (sparsity ) (1) Spatial-domain convolution (2) Winograd convolution
Pruning Top-1 / Top-5 accuracy # MACs per image Pruning Top-1 / Top-5 accuracy # MACs per image
Pre-trained model - 68.2 / 88.6 2347.1M - 68.2 / 88.6 1174.0M
SD (80%) SD (80%) 67.8 / 88.4 837.9M WD (80%) 56.9 / 80.7 467.0M
WD (80%) SD (80%) 44.0 / 70.5 819.7M WD (80%) 68.4 / 88.6 562.9M
WD+SD (80%) SD (80%) 67.8 / 88.5 914.9M WD (80%) 67.8 / 88.5 522.6M
Table 1: Accuracy and complexity of our pruned ResNet-18 models.

The Winograd-domain regularizer is applied to all convolutional filters, for which Winograd convolution can be used. We assume to use Winograd convolution of for filters (see Section 2.1). The spatial-domain regularizer is applied to all convolutional and fully-connected layers not only for pruning but also for compression later in the spatial domain. We use the Adam optimizer kingma2014adam () with the learning rate of for iterations with the batch size of . We set in (5). The initial values for and are both set to be , and they are updated using the Adam optimizer with the learning rate of .

Table 1 summarizes the average pruning ratio, the accuracy and the number of MACs to process one input image of size for pruned ResNet-18 models. The number of MACs for Winograd convolution is counted by following (lavin2016fast, , Section 5). We compare three models obtained by spatial-domain regularization only (SD), Winograd-domain regularization only (WD), and both Winograd-domain and spatial-domain regularizations (WD+SD). The accuracy is evaluated using (1) spatial-domain convolution and (2) Winograd convolution,111We used https://github.com/IntelLabs/SkimCaffe li2017enabling () for Winograd convolution in accuracy evaluation. for convolutional layers of filters. In case of (2), the filters are transformed into the Winograd domain and pruned to the desired ratio.

As expected, the proposed regularization method produces its desired sparsity only in the regularized domain. If we prune weights in the other domain, then we suffer from considerable accuracy loss. Using both Winograd- and spatial-domain regularizers, we can produce one model that is sparse and accurate in both domains. We can reduce the number of MACs by and when using sparse convolution in the spatial and the Winograd domains, respectively, at accuracy loss less than 0.5%.

Figure 4: Performance comparison of the pruned ResNet-18 models using sparse Winograd-domain convolution for different sparsity levels in the Winograd domain.

We compare the accuracy of our pruned ResNet-18 models to the ones from liu2018efficient () in Figure 4. Observe that our models outperform the ones from liu2018efficient () in the Winograd domain. We emphasize that the major advantage of our scheme is that it produces one model that can use any of sparse spatial-domain convolution or sparse Winograd convolution. However, the models from liu2018efficient () are constrained to utilize their special Winograd-ReLU layers even though they can additionally exploit the dynamic sparsity of ReLU activations in the Winograd domain as explained in Section 2.2.

Regularization (sparsity ) Quantization (cell size ) Compression ratio (1) Spatial-domain convolution (2) Winograd convolution
Top-1 / Top-5 accuracy # MACs per image Top-1 / Top-5 accuracy # MACs per image
Pre-trained model - 68.2 / 88.6 2347.1M 68.2 / 88.6 1174.0M
WD+SD (80%) UQ (0.005) 24.2 67.4 / 88.2 888.6M 67.4 / 88.2 516.4M
UQ (0.010) 28.9 66.9 / 87.9 859.0M 66.9 / 87.9 516.5M
UQ (0.020) 38.4 63.7 / 86.0 749.6M 63.7 / 86.0 495.1M
DUQ (0.005) 23.8 67.5 / 88.2 886.5M 67.4 / 88.3 516.3M
DUQ (0.010) 28.7 66.8 / 87.8 848.1M 66.6 / 87.7 512.9M
DUQ (0.020) 38.6 60.0 / 83.5 708.8M 60.0 / 83.5 502.5M
Table 2: Summary of compression results for the ResNet-18 model.

Table 2 shows our universal CNN quantization and compression results for the ResNet-18 model. We take the model obtained by regularization of WD+SD (80%) in Table 1 and compress its weights as described in Section 3.3. We compare uniform quantization (UQ) and dithered uniform quantization (DUQ). We use bzip2 seward1998bzip2 () for universal source coding. The results show that we can achieve more than compression at accuracy loss less than 1% in both cases (1) and (2). Uniform quantization is known to be asymptotically optimal in the high resolution as  gish1968asymptotically (), but it loses its optimality as increases. Thus, DUQ can outperform UQ in particular as increases, as shown in our results.

AlexNet for ImageNet classification: We perform similar pruning and compression experiments on the AlexNet model in krizhevsky2012imagenet (). The AlexNet model has a huge number of weights in its fully-connected (FC) layers (58.6M out of total 61M), and thus we first prune roughly 90% spatial-domain weights mostly from its FC layers by the incremental pruning as suggested in han2015learning ().

We re-train the pruned AlexNet model, similar to the ResNet-18 case above. In particular, we apply the proposed regularizers only to the second to the fifth convolutional layers (Conv2-Conv5), where their filter sizes are small such as and . We assume to use Winograd convolution of and for filters and filters, respectively. The first convolutional layer (Conv1) is excluded since its filter size is , which is not small for Winograd convolution.

Regularization (sparsity ) Quantization (cell size ) Compression ratio Sparsity (%) Top-1 / Top-5 accuracy (%) # MACs per image
Conv1 Conv2 Conv3 Conv4 Conv5  FC1  FC2  FC3
(1) Pre-trained model - 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 56.8 / 80.0 724.4M
WD+SD (70%) UQ (0.010) 47.5 17.2 68.3 81.9 76.1 72.7 93.5 92.2 80.4 56.0 / 79.5 237.1M
DUQ (0.010) 47.7 18.3 66.8 81.7 75.8 72.4 93.7 92.3 80.6 56.1 / 79.3 240.0M
Han et al. han2015deep () 35.0 16.0 62.0 65.0 63.0 63.0 91.0 91.0 75.0 57.2 / 80.3 301.1M
(2) Pre-trained model - 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 56.8 / 80.0 330.0M
WD+SD (70%) UQ (0.010) 47.5 17.2 43.9 72.0 63.7 62.4 93.5 92.2 80.4 56.0 / 79.5 144.2M
DUQ (0.010) 47.7 18.3 47.4 71.7 63.3 62.0 93.7 92.3 80.6 56.0 / 79.3 142.6M
Li et al. li2017enabling () N/A 0.0 90.6 95.8 94.3 93.9 0.0 0.0 0.0 57.3 / N/A 319.8M
(1) Spatial-domain convolution and (2) Winograd convolution are used respectively for Conv2–Conv5 in accuracy evaluation.

Table 4: Summary of compression results for the 9-layer CT-SRCNN model of up-scaling factor .
Regularization (sparsity ) Quantization (cell size ) Compression ratio (1) Spatial-domain convolution (2) Winograd convolution
PSNR (dB) SSIM # MACs per image PSNR (dB) SSIM # MACs per image
Pre-trained model - 29.70 0.8301 233.2G 29.70 0.8301 56.7G
WD+SD (90%) UQ (0.020) 35.4 29.32 0.8225 17.4G 29.32 0.8225 9.9G
DUQ (0.020) 34.8 29.30 0.8222 18.0G 29.31 0.8222 10.0G
Table 3: Summary of compression results for the AlexNet model.

In Table 4, we provide the compression ratio, the sparsity, the accuracy and the number of MACs to process one input image of size for compressed AlexNet models. We compare our results to han2015deep () in the spatial domain and to li2017enabling () in the Winograd domain. We note that han2015deep () and li2017enabling () produce sparse models only in one domain. Comparing to li2017enabling (), our method yields less pruning for the Conv2-Conv5 layers in the Winograd domain, but we also prune the Conv1 and FC layers heavily in the spatial domain. Observe that we can reduce the number of MACs by and when using sparse spatial-domain convolution and using sparse Winograd convolution, respectively, at accuracy loss less than 1%. The results also show that we can achieve more than compression.

CT-SRCNN for image super resolution: We finally evaluate the proposed scheme for the cascade-trained SRCNN (CT-SRCNN) model of 9 convolutional layers ren2018ctsrcnn (). We apply the Winograd-domain regularizer to the and filters of the second to the last layers; the filters of the first layer are excluded. We use Winograd convolution of and for and filters, respectively. The spatial-domain regularizer is applied to all 9 layers.

The average peak-signal-to-noise-ratio (PSNR) and structured-similarity (SSIM) are compared for Set14 dataset zeyde2010single () in Table 4 for compressed CT-SRCNN models. We also summarize the compression ratio and the number of MACs for super resolution to get one high-definition image of size by up-scaling factor  in Table 4. Observe that we achieve compression at PSNR loss less than dB. The number of MACs is reduced by and when using sparse spatial-domain convolution and using sparse Winograd convolution, respectively.

5 Conclusion

We introduced a framework for hardware or software platform independent pruning and compression of CNNs. The proposed scheme produces one compressed model whose convolutional filters are sparse in both Winograd and spatial domains. Thus, one compressed model can be deployed on any platform and the sparsity of its convolutional filters can be utilized for complexity reduction in either domain, unlike the previous approaches that yield sparse models only in one domain. We show by experiments that the proposed scheme successfully prunes and compresses ResNet-18, AlexNet and 9-layer CT-SRCNN with compression ratios of , and , and complexity reduction by , and , respectively. Finally, our regularization method for joint sparsity can be extended for sparse frequency-domain convolution, which remains as our future work. It will be also interesting to compare our partial L2 norm to L1 or -support norm argyriou2012sparse () for sparsity regularization.

Appendix A Appendix

In this appendix, we show the proof of (7) of our paper.

Proof.

We have (e.g., see (golub2012matrix, , Section 1.3.7))

(12)

where is the column-vectorization of matrix  and denotes the Kronecker product of two matrices. Thus, it follows that

(13)

For any matrix , column vector  and column vector , it is straightforward to show that

where is the diagonal matrix whose diagonal elements are from , and then it follows that

(14)

Combining (12)–(14) leads us to

(15)

which results in (7) of our paper. We note that the gradient is actually not defined at the discontinuous points where any element in is exactly equal to the threshold  in magnitude, which however can be ignored in stochastic gradient descent. ∎

References

  • (1) Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • (2) V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
  • (3) Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “Model compression and acceleration for deep neural networks: The principles, progress, and challenges,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126–136, 2018.
  • (4) S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” in International Conference on Learning Representations, 2016.
  • (5) Y. Choi, M. El-Khamy, and J. Lee, “Towards the limit of network quantization,” in International Conference on Learning Representations, 2017.
  • (6) K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural network compression,” in International Conference on Learning Representations, 2017.
  • (7) E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for end-to-end learning compressible representations,” in Advances in Neural Information Processing Systems, 2017, pp. 1141–1151.
  • (8) C. Louizos, K. Ullrich, and M. Welling, “Bayesian compression for deep learning,” in Advances in Neural Information Processing Systems, 2017, pp. 3290–3300.
  • (9) Y. Choi, M. El-Khamy, and J. Lee, “Universal deep neural network compression,” arXiv preprint arXiv:1802.02271, 2018.
  • (10) M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of convolutional networks through FFTs,” arXiv preprint arXiv:1312.5851, 2013.
  • (11) N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun, “Fast convolutional nets with fbfft: A GPU performance evaluation,” arXiv preprint arXiv:1412.7580, 2014.
  • (12) A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4013–4021.
  • (13) S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems, 2015, pp. 1135–1143.
  • (14) V. Lebedev and V. Lempitsky, “Fast convnets using group-wise brain damage,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2554–2564.
  • (15) W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 2074–2082.
  • (16) Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient DNNs,” in Advances In Neural Information Processing Systems, 2016, pp. 1379–1387.
  • (17) J. Park, S. Li, W. Wen, P. T. P. Tang, H. Li, Y. Chen, and P. Dubey, “Faster CNNs with direct sparse convolutions and guided pruning,” International Conference on Learning Representations, 2017.
  • (18) J. Lin, Y. Rao, J. Lu, and J. Zhou, “Runtime neural pruning,” in Advances in Neural Information Processing Systems, 2017, pp. 2178–2188.
  • (19) S. Li, J. Park, and P. T. P. Tang, “Enabling sparse Winograd convolution by native pruning,” arXiv preprint arXiv:1702.08597, 2017.
  • (20) X. Liu, J. Pool, S. Han, and W. J. Dally, “Efficient sparse-Winograd convolutional neural networks,” in International Conference on Learning Representations, 2018.
  • (21) S. Winograd, Arithmetic Complexity of Computations.   SIAM, 1980, vol. 33.
  • (22) R. E. Blahut, Fast Algorithms for Signal Processing.   Cambridge University Press, 2010.
  • (23) J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Transactions on Information Theory, vol. 23, no. 3, pp. 337–343, 1977.
  • (24) ——, “Compression of individual sequences via variable-rate coding,” IEEE Transactions on Information Theory, vol. 24, no. 5, pp. 530–536, 1978.
  • (25) T. A. Welch, “A technique for high-performance data compression,” Computer, vol. 6, no. 17, pp. 8–19, 1984.
  • (26) M. Effros, K. Visweswariah, S. R. Kulkarni, and S. Verdú, “Universal lossless source coding with the Burrows Wheeler transform,” IEEE Transactions on Information Theory, vol. 48, no. 5, pp. 1061–1081, 2002.
  • (27) R. Zamir and M. Feder, “On universal quantization by randomized uniform/lattice quantizers,” IEEE Transactions on Information Theory, vol. 38, no. 2, pp. 428–436, 1992.
  • (28) J.-L. Gailly and M. Adler, “gzip,” 2003. [Online]. Available: www.gzip.org
  • (29) J. Seward, “bzip2,” 1998. [Online]. Available: www.bzip.org
  • (30) K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
  • (31) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • (32) D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • (33) H. Gish and J. Pierce, “Asymptotically efficient quantizing,” IEEE Transactions on Information Theory, vol. 14, no. 5, pp. 676–683, 1968.
  • (34) A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
  • (35) H. Ren, M. El-Khamy, and J. Lee, “CT-SRCNN: Cascade trained and trimmed deep convolutional neural networks for image super resolution,” in IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.
  • (36) R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in International Conference on Curves and Surfaces.   Springer, 2010, pp. 711–730.
  • (37) A. Argyriou, R. Foygel, and N. Srebro, “Sparse prediction with the -support norm,” in Advances in Neural Information Processing Systems, 2012, pp. 1457–1465.
  • (38) G. H. Golub and C. F. Van Loan, Matrix Computations.   JHU Press, 2013.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
198691
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description