Quantized Convolutional Neural Networks for Mobile Devices
Abstract
Recently, convolutional neural networks (CNN) have demonstrated impressive performance in various computer vision tasks. However, high performance hardware is typically indispensable for the application of CNN models due to the high computation complexity, which prohibits their further extensions. In this paper, we propose an efficient framework, namely Quantized CNN, to simultaneously speedup the computation and reduce the storage and memory overhead of CNN models. Both filter kernels in convolutional layers and weighting matrices in fullyconnected layers are quantized, aiming at minimizing the estimation error of each layer’s response. Extensive experiments on the ILSVRC12 benchmark demonstrate speedup and compression with merely one percentage loss of classification accuracy. With our quantized CNN model, even mobile devices can accurately classify images within one second.
1 Introduction
In recent years, we have witnessed the great success of convolutional neural networks (CNN) [19] in a wide range of visual applications, including image classification [16, 27], object detection [10, 9], age estimation [24, 23], etc. This success mainly comes from deeper network architectures as well as the tremendous training data. However, as the network grows deeper, the model complexity is also increasing exponentially in both the training and testing stages, which leads to the very high demand in the computation ability. For instance, the 8layer AlexNet [16] involves 60M parameters and requires over 729M FLOPs^{1}^{1}1FLOPs: number of FLoatingpoint OPerations required to classify one image with the convolutional network.to classify a single image. Although the training stage can be offline carried out on high performance clusters with GPU acceleration, the testing computation cost may be unaffordable for common personal computers and mobile devices. Due to the limited computation ability and memory space, mobile devices are almost intractable to run deep convolutional networks. Therefore, it is crucial to accelerate the computation and compress the memory consumption for CNN models.
For most CNNs, convolutional layers are the most timeconsuming part, while fullyconnected layers involve massive network parameters. Due to the intrinsical difference between them, existing works usually focus on improving the efficiency for either convolutional layers or fullyconnected layers. In [7, 13, 32, 31, 18, 17], lowrank approximation or tensor decomposition is adopted to speedup convolutional layers. On the other hand, parameter compression in fullyconnected layers is explored in [3, 7, 11, 30, 2, 12, 28]. Overall, the abovementioned algorithms are able to achieve faster speed or less storage. However, few of them can achieve significant acceleration and compression simultaneously for the whole network.
In this paper, we propose a unified framework for convolutional networks, namely Quantized CNN (QCNN), to simultaneously accelerate and compress CNN models with only minor performance degradation. With network parameters quantized, the response of both convolutional and fullyconnected layers can be efficiently estimated via the approximate inner product computation. We minimize the estimation error of each layer’s response during parameter quantization, which can better preserve the model performance. In order to suppress the accumulative error while quantizing multiple layers, an effective training scheme is introduced to take previous estimation error into consideration. Our QCNN model enables fast testphase computation, and the storage and memory consumption are also significantly reduced.
We evaluate our QCNN framework for image classification on two benchmarks, MNIST [20] and ILSVRC12 [26]. For MNIST, our QCNN approach achieves over 12 compression for two neural networks (no convolution), with lower accuracy loss than several baseline methods. For ILSVRC12, we attempt to improve the testphase efficiency of four convolutional networks: AlexNet [16], CaffeNet [15], CNNS [1], and VGG16 [27]. Generally, QCNN achieves 4 acceleration and compression (sometimes higher) for each network, with less than 1% drop in the top5 classification accuracy. Moreover, we implement the quantized CNN model on mobile devices, and dramatically improve the testphase efficiency, as depicted in Figure 1. The main contributions of this paper can be summarized as follows:

We propose a unified QCNN framework to accelerate and compress convolutional networks. We demonstrate that better quantization can be learned by minimizing the estimation error of each layer’s response.

We propose an effective training scheme to suppress the accumulative error while quantizing the whole convolutional network.

Our QCNN framework achieves speedup and compression, while the classification accuracy loss is within one percentage. Moreover, the quantized CNN model can be implemented on mobile devices and classify an image within one second.
2 Preliminary
During the test phase of convolutional networks, the computation overhead is dominated by convolutional layers; meanwhile, the majority of network parameters are stored in fullyconnected layers. Therefore, for better testphase efficiency, it is critical to speedup the convolution computation and compress parameters in fullyconnected layers.
Our observation is that the forwardpassing process of both convolutional and fullyconnected layers is dominated by the computation of inner products. More formally, we consider a convolutional layer with input feature maps and response feature maps , where are the spatial sizes and are the number of feature map channels. The response at the 2D spatial position in the th response feature map is computed as:
(1) 
where is the th convolutional kernel and is the kernel size. We use and to denote the 2D spatial positions in the input feature maps and convolutional kernels, and both and are dimensional vectors. The layer response is the sum of inner products at all positions within the receptive field in the input feature maps.
Similarly, for a fullyconnected layer, we have:
(2) 
where and are the layer input and layer response, respectively, and is the weighting vector for the th neuron of this layer.
Product quantization [14] is widely used in approximate nearest neighbor search, demonstrating better performance than hashingbased methods [21, 22]. The idea is to decompose the feature space as the Cartesian product of multiple subspaces, and then learn subcodebooks for each subspace. A vector is represented by the concatenation of subcodewords for efficient distance computation and storage.
In this paper, we leverage product quantization to implement the efficient inner product computation. Let us consider the inner product computation between . At first, both and are split into subvectors, denoted as and . Afterwards, each is quantized with a subcodeword from the th subcodebook, then we have
(3) 
which transforms the inner product computation to addition operations (), if the inner products between each subvector and all the subcodewords in the th subcodebook have been computed in advance.
Quantizationbased approaches have been explored in several works [11, 2, 12]. These approaches mostly focus on compressing parameters in fullyconnected layers [11, 2], and none of them can provide acceleration for the testphase computation. Furthermore, [11, 12] require the network parameters to be reconstructed during the testphase, which limit the compression to disk storage instead of memory consumption. On the contrary, our approach offers simultaneous acceleration and compression for both convolutional and fullyconnected layers, and can reduce the runtime memory consumption dramatically.
3 Quantized CNN
In this section, we present our approach for accelerating and compressing convolutional networks. Firstly, we introduce an efficient testphase computation process with the network parameters quantized. Secondly, we demonstrate that better quantization can be learned by directly minimizing the estimation error of each layer’s response. Finally, we analyze the computation complexity of our quantized CNN model.
3.1 Quantizing the Fullyconnected Layer
For a fullyconnected layer, we denote its weighting matrix as , where and are the dimensions of the layer input and response, respectively. The weighting vector is the th column vector in .
We evenly split the dimensional space (where lies in) into subspaces, each of dimensions. Each is then decomposed into subvectors, denoted as . A subcodebook can be learned for each subspace after gathering all the subvectors within this subspace. Formally, for the th subspace, we optimize:
(4) 
where consists of the th subvectors of all weighting vectors. The subcodebook contains subcodewords, and each column in is an indicator vector (only one nonzero entry), specifying which subcodeword is used to quantize the corresponding subvector. The optimization can be solved via kmeans clustering.
The layer response is approximately computed as:
(5) 
where is the th column vector in , and is the th subvector of the layer input. is the index of the subcodeword used to quantize the subvector .
In Figure 2, we depict the parameter quantization and testphase computation process of the fullyconnected layer. By decomposing the weighting matrix into submatrices, subcodebooks can be learned, one per subspace. During the testphase, the layer input is split into subvectors, denoted as . For each subspace, we compute the inner products between and every subcodeword in , and store the results in a lookup table. Afterwards, only addition operations are required to compute each response. As a result, the overall time complexity can be reduced from to . On the other hand, only subcodebooks and quantization indices need to be stored, which can dramatically reduce the storage consumption.
3.2 Quantizing the Convolutional Layer
Unlike the 1D weighting vector in the fullyconnected layer, each convolutional kernel is a 3dimensional tensor: . Before quantization, we need to determine how to split it into subvectors, i.e. apply subspace splitting to which dimension. During the test phase, the input feature maps are traversed by each convolutional kernel with a sliding window in the spatial domain. Since these sliding windows are partially overlapped, we split each convolutional kernel along the dimension of feature map channels, so that the precomputed inner products can be reused at multiple spatial locations. Specifically, we learn the quantization in each subspace by:
(6) 
where contains the th subvectors of all convolutional kernels at position . The optimization can also be solved by kmeans clustering in each subspace.
With the convolutional kernels quantized, we approximately compute the response feature maps by:
(7) 
where is the th subvector at position in the input feature maps, and is the index of the subcodeword to quantize the th subvector at position in the th convolutional kernel.
Similar to the fullyconnected layer, we precompute the lookup tables of inner products with the input feature maps. Then, the response feature maps are approximately computed with (7), and both the time and storage complexity can be greatly reduced.
3.3 Quantization with Error Correction
So far, we have presented an intuitive approach to quantize parameters and improve the testphase efficiency of convolutional networks. However, there are still two critical drawbacks. First, minimizing the quantization error of model parameters does not necessarily give the optimal quantized network for the classification accuracy. In contrast, minimizing the estimation error of each layer’s response is more closely related to the network’s classification performance. Second, the quantization of one layer is independent of others, which may lead to the accumulation of error when quantizing multiple layers. The estimation error of the network’s final response is very likely to be quickly accumulated, since the error introduced by the previous quantized layers will also affect the following layers.
To overcome these two limitations, we introduce the idea of error correction into the quantization of network parameters. This improved quantization approach directly minimizes the estimation error of the response at each layer, and can compensate the error introduced by previous layers. With the error correction scheme, we can quantize the network with much less performance degradation than the original quantization method.
3.3.1 Error Correction for the Fullyconnected Layer
Suppose we have images to learn the quantization of a fullyconnected layer, and the layer input and response of image are denoted as and . In order to minimize the estimation error of the layer response, we optimize:
(8) 
where the first term in the Frobenius norm is the desired layer response, and the second term is the approximated layer response computed via the quantized parameters.
A block coordinate descent approach can be applied to minimize this objective function. For the th subspace, its residual error is defined as:
(9) 
and then we attempt to minimize the residual error of this subspace, which is:
(10) 
and the above optimization can be solved by alternatively updating the subcodebook and subcodeword assignment.
Update . We fix the subcodeword assignment , and define . The optimization in (10) can be reformulated as:
(11) 
which implies that the optimization over one subcodeword does not affect other subcodewords. Hence, for each subcodeword, we construct a least square problem from (11) to update it.
Update . With the subcodebook fixed, it is easy to discover that the optimization of each column in is mutually independent. For the th column, its optimal subcodeword assignment is given by:
(12) 
3.3.2 Error Correction for the Convolutional Layer
We adopt the similar idea to minimize the estimation error of the convolutional layer’s response feature maps, that is:
(13) 
The optimization also can be solved by block coordinate descent. More details on solving this optimization can be found in the supplementary material.
3.3.3 Error Correction for Multiple Layers
The above quantization method can be sequentially applied to each layer in the CNN model. One concern is that the estimation error of layer response caused by the previous layers will be accumulated and affect the quantization of the following layers. Here, we propose an effective training scheme to address this issue.
We consider the quantization of a specific layer, assuming its previous layers have already been quantized. The optimization of parameter quantization is based on the layer input and response of a group of training images. To quantize this layer, we take the layer input in the quantized network as , and the layer response in the original network (not quantized) as in Eq. (8) and (13). In this way, the optimization is guided by the actual input in the quantized network and the desired response in the original network. The accumulative error introduced by the previous layers is explicitly taken into consideration during optimization. In consequence, this training scheme can effectively suppress the accumulative error for the quantization of multiple layers.
Another possible solution is to adopt backpropagation to jointly update the subcodebooks and subcodeword assignments in all quantized layers. However, since the subcodeword assignments are discrete, the gradientbased optimization can be quite difficult, if not entirely impossible. Therefore, backpropagation is not adopted here, but could be a promising extension for future work.
3.4 Computation Complexity
Now we analyze the testphase computation complexity of convolutional and fullyconnected layers, with or without parameter quantization. For our proposed QCNN model, the forwardpassing through each layer mainly consists of two procedures: precomputation of inner products, and approximate computation of layer response. Both subcodebooks and subcodeword assignments are stored for the testphase computation. We report the detailed comparison on the computation and storage overhead in Table 1.
FLOPs  Conv.  CNN  
QCNN  
FCnt.  CNN  
QCNN  
Bytes  Conv.  CNN  
QCNN  
FCnt.  CNN  
QCNN 
As we can see from Table 1, the reduction in the computation and storage overhead largely depends on two hyperparameters, (number of subspaces) and (number of subcodewords in each subspace). Large values of and lead to more finegrained quantization, but is less efficient in the computation and storage consumption. In practice, we can vary these two parameters to balance the tradeoff between the testphase efficiency and accuracy loss of the quantized CNN model.
4 Related Work
There have been a few attempts in accelerating the testphase computation of convolutional networks, and many are inspired from the lowrank decomposition. Denton et al. [7] presented a series of lowrank decomposition designs for convolutional kernels. Similarly, CPdecomposition was adopted in [17] to transform a convolutional layer into multiple layers with lower complexity. Zhang et al. [32, 31] considered the subsequent nonlinear units while learning the lowrank decomposition. [18] applied groupwise pruning to the convolutional tensor to decompose it into the multiplications of thinned dense matrices. Recently, fixedpoint based approaches are explored in [5, 25]. By representing the connection weights (or even network activations) with fixedpoint numbers, the computation can greatly benefit from hardware acceleration.
Another parallel research trend is to compress parameters in fullyconnected layers. Ciresan et al. [3] randomly remove connection to reduce network parameters. Matrix factorization was adopted in [6, 7] to decompose the weighting matrix into two lowrank matrices, which demonstrated that significant redundancy did exist in network parameters. Hinton et al. [8] proposed to use dark knowledge (the response of a welltrained network) to guide the training of a much smaller network, which was superior than directly training. By exploring the similarity among neurons, Srinivas et al. [28] proposed a systematic way to remove redundant neurons instead of network connections. In [30], multiple fullyconnected layers were replaced by a single “Fastfood” layer, which can be trained in an endtoend style with convolutional layers. Chen et al. [2] randomly grouped connection weights into hash buckets, and then finetuned the network with backpropagation. [12] combined pruning, quantization, and Huffman coding to achieve higher compression rate. Gong et al. [11] adopted vector quantization to compress the weighing matrix, which was actually a special case of our approach (apply QCNN without error correction to fullyconnected layers only).
5 Experiments
In this section, we evaluate our quantized CNN framework on two image classification benchmarks, MNIST [20] and ILSVRC12 [26]. For the acceleration of convolutional layers, we compare with:

CPD [17]: CPDecomposition;

GBD [18]: Groupwise Brain Damage;

LANR [31]: Lowrank Approximation of Nonlinear Responses.
and for the compression of fullyconnected layers, we compare with the following approaches:

RER [3]: Random Edge Removal;

LRD [6]: LowRank Decomposition;

DK [8]: Dark Knowledge;

HashNet [2]: Hashed Neural Nets;

DPP [28]: Datafree Parameter Pruning;

SVD [7]: Singular Value Decomposition;

DFC [30]: Deep Fried Convnets.
For all above baselines, we use their reported results under the same setting for fair comparison. We report the theoretical speedup for more consistent results, since the realistic speedup may be affected by various factors, e.g. CPU, cache, and RAM. We compare the theoretical and realistic speedup in Section 5.4, and discuss the effect of adopting the BLAS library for acceleration.
Our approaches are denoted as “QCNN” and “QCNN (EC)”, where the latter one adopts error correction while the former one does not. We implement the optimization process of parameter quantization in MATLAB, and finetune the resulting network with Caffe [15]. Additional results of our approach can be found in the supplementary material.
5.1 Results on MNIST
The MNIST dataset contains 70k images of handwritten digits, 60k used for training and 10k for testing. To evaluate the compression performance, we pretrain two neural networks, one is 3layer and another one is 5layer, where each hidden layer contains 1000 units. Different compression techniques are then adopted to compress these two network, and the results are as depicted in Table 2.
Method  3layer  5layer  
Compr.  Error  Compr.  Error  
Original    1.35%    1.12% 
RER [3]  8  2.19%  8  1.24% 
LRD [6]  8  1.89%  8  1.77% 
DK [8]  8  1.71%  8  1.26% 
HashNets [2]  8  1.43%  8  1.22% 
QCNN  12.1  1.42%  13.4  1.34% 
QCNN (EC)  12.1  1.39%  13.4  1.19% 
In our QCNN framework, the tradeoff between accuracy and efficiency is controlled by (number of subspaces) and (number of subcodewrods in each subspace). Since is determined once is given, we tune to adjust the quantization precision. In Table 2, we set the hyperparameters as and .
From Table 2, we observe that our QCNN (EC) approach offers higher compression rates with less performance degradation than all baselines for both networks. The error correction scheme is effective in reducing the accuracy loss, especially for deeper networks (5layer). Also, we find the performance of both QCNN and QCNN (EC) quite stable, as the standard deviation of five random runs is merely 0.05%. Therefore, we report the singlerun performance in the remaining experiments.
5.2 Results on ILSVRC12
The ILSVRC12 benchmark consists of over one million training images drawn from 1000 categories, and a disjoint validation set of 50k images. We report both the top1 and top5 classification error rates on the validation set, using singleview testing (central patch only).
We demonstrate our approach on four convolutional networks: AlexNet [16], CaffeNet [15], CNNS [1], and VGG16 [27]. The first two models have been adopted in several related works, and therefore are included for comparison. CNNS and VGG16 use a either wider or deeper structure for better classification accuracy, and are included here to prove the scalability of our approach. We compare all these networks’ computation and storage overhead in Table 3, together with their classification error rates on ILSVRC12.
Model  FLOPs  Bytes  Top1 Err.  Top5 Err. 

AlexNet  7.29e+8  2.44e+8  42.78%  19.74% 
CaffeNet  7.27e+8  2.44e+8  42.53%  19.59% 
CNNS  2.94e+9  4.12e+8  37.31%  15.82% 
VGG16  1.55e+10  5.53e+8  28.89%  10.05% 
5.2.1 Quantizing the Convolutional Layer
To begin with, we quantize the second convolutional layer of AlexNet, which is the most timeconsuming layer during the testphase. In Table 4, we report the performance under several settings, comparing with two baseline methods, CPD [17] and GBD [18].
Method  Para.  Speedup  Top1 Err.  Top5 Err.  
No FT  FT  No FT  FT  
CPD    3.19      0.94%  0.44% 
  4.52      3.20%  1.22%  
  6.51      69.06%  18.63%  
GBD    3.33  12.43%  0.11%     
  5.00  21.93%  0.43%      
  10.00  48.33%  1.13%      
QCNN  4/64  3.70  10.55%  1.63%  8.97%  1.37% 
6/64  5.36  15.93%  2.90%  14.71%  2.27%  
6/128  4.84  10.62%  1.57%  9.10%  1.28%  
8/128  6.06  18.84%  2.91%  18.05%  2.66%  
QCNN (EC)  4/64  3.70  0.35%  0.20%  0.27%  0.17% 
6/64  5.36  0.64%  0.39%  0.50%  0.40%  
6/128  4.84  0.27%  0.11%  0.34%  0.21%  
8/128  6.06  0.55%  0.33%  0.50%  0.31% 
From Table 4, we discover that with a large speedup rate (over 4), the performance loss of both CPD and GBD become severe, especially before finetuning. The naive parameter quantization method also suffers from the similar problem. By incorporating the idea of error correction, our QCNN model achieves up to 6 speedup with merely 0.6% drop in accuracy, even without finetuning. The accuracy loss can be further reduced after finetuning the subsequent layers. Hence, it is more effective to minimize the estimation error of each layer’s response than minimize the quantization error of network parameters.
Model  Method  Para.  Speedup  Compression  Top1 Err.  Top5 Err.  

No FT  FT  No FT  FT  
AlexNet 

4/64  3.32  10.58  1.33%    0.94%    
6/64  4.32  14.32  2.32%    1.90%    
6/128  3.71  10.27  1.44%  0.13%  1.16%  0.36%  
8/128  4.27  12.08  2.25%  0.99%  1.64%  0.60%  
VGG16  LANR [31]    4.00  2.73      0.95%  0.35%  
QCNN (EC)  6/128  4.06  14.40  3.04%  1.06%  1.83%  0.45% 
Next, we take one step further and attempt to speedup all the convolutional layers in AlexNet with QCNN (EC). We fix the quantization hyperparameters across all layers. From Table 5, we observe that the loss in accuracy grows mildly than the singlelayer case. The speedup rates reported here are consistently smaller than those in Table 4, since the acceleration effect is less significant for some layers (i.e. “conv_4” and “conv_5”). For AlexNet, our QCNN model () can accelerate the computation of all the convolutional layers by a factor of 4.27, while the increase in the top1 and top5 error rates are no more than 2.5%. After finetuning the remaining fullyconnected layers, the performance loss can be further reduced to less than 1%.
In Table 5, we also report the comparison against LANR [31] on VGG16. For the similar speedup rate (4), their approach outperforms ours in the top5 classification error (an increase of 0.95% against 1.83%). After finetuning, the performance gap is narrowed down to 0.35% against 0.45%. At the same time, our approach offers over compression of parameters in convolutional layers, much larger than theirs compression^{2}^{2}2The compression effect of their approach was not explicitly discussed in the paper; we estimate the compression rate based on their description.. Therefore, our approach is effective in accelerating and compressing networks with many convolutional layers, with only minor performance loss.
5.2.2 Quantizing the Fullyconnected Layer
For demonstration, we first compress parameters in a single fullyconnected layer. In CaffeNet, the first fullyconnected layer possesses over 37 million parameters (), more than 60% of whole network parameters. Our QCNN approach is adopted to quantize this layer and the results are as reported in Table 6. The performance loss of our QCNN model is negligible (within 0.4%), which is much smaller than baseline methods (DPP and SVD). Furthermore, error correction is effective in preserving the classification accuracy, especially under a higher compression rate.
Method  Para.  Compression  Top1 Err.  Top5 Err. 

DPP    1.19  0.16%   
  1.47  1.76%    
  1.91  4.08%    
  2.75  9.68%    
SVD    1.38  0.03%  0.03% 
  2.77  0.07%  0.07%  
  5.54  0.36%  0.19%  
  11.08  1.23%  0.86%  
QCNN  2/16  15.06  0.19%  0.19% 
3/16  21.94  0.35%  0.28%  
3/32  16.70  0.18%  0.12%  
4/32  21.33  0.28%  0.16%  
QCNN (EC)  2/16  15.06  0.10%  0.07% 
3/16  21.94  0.18%  0.03%  
3/32  16.70  0.14%  0.11%  
4/32  21.33  0.16%  0.12% 
Now we evaluate our approach’s performance for compressing all the fullyconnected layers in CaffeNet in Table 7. The third layer is actually the combination of 1000 classifiers, and is more critical to the classification accuracy. Hence, we adopt a much more finegrained hyperparameter setting () for this layer. Although the speedup effect no longer exists, we can still achieve around 8 compression for the last layer.
Method  Para.  Compression  Top1 Err.  Top5 Err. 

SVD    1.26  0.14%   
  2.52  1.22%    
DFC    1.79  0.66%   
  3.58  0.31%    
QCNN  2/16  13.96  0.28%  0.29% 
3/16  19.14  0.70%  0.47%  
3/32  15.25  0.44%  0.34%  
4/32  18.71  0.75%  0.59%  
QCNN (EC)  2/16  13.96  0.31%  0.30% 
3/16  19.14  0.59%  0.47%  
3/32  15.25  0.31%  0.27%  
4/32  18.71  0.57%  0.39% 
From Table 7, we discover that with less than 1% drop in accuracy, QCNN achieves high compression rates (), much larger than that of SVD^{3}^{3}3In Table 6, SVD means replacing the weighting matrix with the multiplication of two lowrank matrices; in Table 7, SVD means finetuning the network after the lowrank matrix decomposition.and DFC (). Again, QCNN with error correction consistently outperforms the naive QCNN approach as adopted in [11].
5.2.3 Quantizing the Whole Network
So far, we have evaluated the performance of CNN models with either convolutional or fullyconnected layers quantized. Now we demonstrate the quantization of the whole network with a threestage strategy. Firstly, we quantize all the convolutional layers with error correction, while fullyconnected layers remain untouched. Secondly, we finetune fullyconnected layers in the quantized network with the ILSVRC12 training set to restore the classification accuracy. Finally, fullyconnected layers in the finetuned network are quantized with error correction. We report the performance of our QCNN models in Table 8.
Model  Para.  Speedup  Compression  Top1/5 Err.  

Conv.  FCnt.  
AlexNet  8/128  3/32  4.05  15.40  1.38% / 0.84% 
8/128  4/32  4.15  18.76  1.46% / 0.97%  
CaffeNet  8/128  3/32  4.04  15.40  1.43% / 0.99% 
8/128  4/32  4.14  18.76  1.54% / 1.12%  
CNNS  8/128  3/32  5.69  16.32  1.48% / 0.81% 
8/128  4/32  5.78  20.16  1.64% / 0.85%  
VGG16  6/128  3/32  4.05  16.55  1.22% / 0.53% 
6/128  4/32  4.06  20.34  1.35% / 0.58% 
For convolutional layers, we let and for AlexNet, CaffeNet, and CNNS, and let and for VGG16, to ensure roughly speedup for each network. Then we vary the hyperparameter settings in fullyconnected layers for different compression levels. For the former two networks, we achieve 18 compression with about 1% loss in the top5 classification accuracy. For CNNS, we achieve 5.78 speedup and 20.16 compression, while the top5 classification accuracy drop is merely 0.85%. The result on VGG16 is even more encouraging: with 4.06 speedup and 20.34, the increase of top5 error rate is only 0.58%. Hence, our proposed QCNN framework can improve the efficiency of convolutional networks with minor performance loss, which is acceptable in many applications.
5.3 Results on Mobile Devices
We have developed an Android application to fulfill CNNbased image classification on mobile devices, based on our QCNN framework. The experiments are carried out on a Huawei^{®} Mate 7 smartphone, equipped with an 1.8GHz Kirin 925 CPU. The testphase computation is carried out on a single CPU core, without GPU acceleration.
Model  Time  Storage  Memory  Top5 Err.  

AlexNet  CNN  2.93s  232.56MB  264.74MB  19.74% 
QCNN  0.95s  12.60MB  74.65MB  20.70%  
CNNS  CNN  10.58s  392.57MB  468.90MB  15.82% 
QCNN  2.61s  20.13MB  129.49MB  16.68% 
In Table 9, we compare the computation efficiency and classification accuracy of the original and quantized CNN models. Our QCNN framework achieves 3 speedup for AlexNet, and 4 speedup for CNNS. What’s more, we compress the storage consumption by 20 , and the required runtime memory is only one quarter of the original model. At the same time, the loss in the top5 classification accuracy is no more than 1%. Therefore, our proposed approach improves the runtime efficiency in multiple aspects, making the deployment of CNN models become tractable on mobile platforms.
5.4 Theoretical vs. Realistic Speedup
In Table 10, we compare the theoretical and realistic speedup on AlexNet. The BLAS [29] library is used in Caffe [15] to accelerate the matrix multiplication in convolutional and fullyconnected layers. However, it may not always be an option for mobile devices. Therefore, we measure the runtime speed under two settings, i.e. with BLAS enabled or disabled. The realistic speedup is slightly lower with BLAS on, indicating that QCNN does not benefit as much from BLAS as that of CNN. Other optimization techniques, e.g. SIMD, SSE, and AVX [4], may further improve our realistic speedup, and shall be explored in the future.
BLAS  FLOPs  Time (ms)  Speedup  

CNN  QCNN  CNN  QCNN  Theo.  Real.  
Off  7.29e+8  1.75e+8  321.10  75.62  4.15  4.25 
On  167.79^{4}^{4}4This is Caffe’s runtime speed. The code for the other three settings is on https://github.com/jiaxiangwu/quantizedcnn.  55.35  3.03 
6 Conclusion
In this paper, we propose a unified framework to simultaneously accelerate and compress convolutional neural networks. We quantize network parameters to enable efficient testphase computation. Extensive experiments are conducted on MNIST and ILSVRC12, and our approach achieves outstanding speedup and compression rates, with only negligible loss in the classification accuracy.
7 Acknowledgement
This work was supported in part by National Natural Science Foundation of China (Grant No. 61332016), and 863 program (Grant No. 2014AA015105).
References
 [1] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference (BMVC), 2014.
 [2] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. In International Conference on Machine Learning (ICML), pages 2285–2294, 2015.
 [3] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber. Highperformance neural networks for visual object classification. CoRR, abs/1102.0183, 2011.
 [4] I. Corporation. Intel architecture instruction set extensions programming reference. Technical report, Intel Corporation, Feb 2016.
 [5] M. Courbariaux, Y. Bengio, and J. David. Training deep neural networks with low precision multiplications. In International Conference on Learning Representations (ICLR), 2015.
 [6] M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de Freitas. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems (NIPS), pages 2148–2156, 2013.
 [7] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems (NIPS), pages 1269–1277, 2014.
 [8] J. D. Geoffrey Hinton, Oriol Vinyals. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
 [9] R. B. Girshick. Fast RCNN. CoRR, abs/1504.08083, 2015.
 [10] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 580–587, 2014.
 [11] Y. Gong, L. Liu, M. Yang, and L. D. Bourdev. Compressing deep convolutional networks using vector quantization. CoRR, abs/1412.6115, 2014.
 [12] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.
 [13] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In British Machine Vision Conference (BMVC), 2014.
 [14] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(1):117–128, Jan 2011.
 [15] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. CoRR, abs/1408.5093, 2014.
 [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1106–1114, 2012.
 [17] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. In International Conference on Learning Representations (ICLR), 2015.
 [18] V. Lebedev and V. S. Lempitsky. Fast convnets using groupwise brain damage. CoRR, abs/1506.02515, 2015.
 [19] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989.
 [20] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [21] C. Leng, J. Wu, J. Cheng, X. Bai, and H. Lu. Online sketching hashing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2503–2511, 2015.
 [22] C. Leng, J. Wu, J. Cheng, X. Zhang, and H. Lu. Hashing for distributed data. In International Conference on Machine Learning (ICML), pages 1642â–1650, 2015.
 [23] G. Levi and T. Hassncer. Age and gender classification using convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 34–42, 2015.
 [24] C. Li, Q. Liu, J. Liu, and H. Lu. Learning ordinal discriminative features for age estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2570–2577, 2012.
 [25] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. CoRR, abs/1603.05279, 2016.
 [26] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), pages 1–42, 2015.
 [27] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations (ICLR), 2015.
 [28] S. Srinivas and R. V. Babu. Datafree parameter pruning for deep neural networks. In British Machine Vision Conference (BMVC), pages 31.1–31.12, 2015.
 [29] R. C. Whaley and A. Petitet. Minimizing development and maintenance costs in supporting persistently optimized BLAS. Software: Practice and Experience, 35(2):101–121, Feb 2005.
 [30] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. J. Smola, L. Song, and Z. Wang. Deep fried convnets. CoRR, abs/1412.7149, 2014.
 [31] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. CoRR, abs/1505.06798, 2015.
 [32] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun. Efficient and accurate approximations of nonlinear convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1984–1992, 2015.
Appendix A: Additional Results
In the submission, we report the performance after quantizing all the convolutional layers in AlexNet, and quantizing all the fullconnected layers in CaffeNet. Here, we present experimental results for some other settings.
Quantizing Convolutional Layers in CaffeNet
We quantize all the convolutional layers in CaffeNet, and the results are as demonstrated in Table 11. Furthermore, we finetune the quantized CNN model learned with error correction (), and the increase of top1/5 error rates are 1.15% and 0.75%, compared to the original CaffeNet.
Method  Para.  Speedup  Top1 Err.  Top5 Err. 

QCNN  4/64  3.32  18.69%  16.73% 
6/64  4.32  32.84%  33.55%  
6/128  3.71  20.08%  18.31%  
8/128  4.27  35.48%  37.82%  
QCNN (EC)  4/64  3.32  1.22%  0.97% 
6/64  4.32  2.44%  1.83%  
6/128  3.71  1.57%  1.12%  
8/128  4.27  2.30%  1.71% 
Quantizing Convolutional Layers in CNNS
We quantize all the convolutional layers in CNNS, and the results are as demonstrated in Table 12. Furthermore, we finetune the quantized CNN model learned with error correction (), and the increase of top1/5 error rates are 1.24% and 0.63%, compared to the original CNNS.
Method  Para.  Speedup  Top1 Err.  Top5 Err. 

QCNN  4/64  3.69  19.87%  16.77% 
6/64  5.17  45.74%  48.67%  
6/128  4.78  27.86%  25.09%  
8/128  5.92  46.18%  50.26%  
QCNN (EC)  4/64  3.69  1.60%  0.92% 
6/64  5.17  3.49%  2.32%  
6/128  4.78  2.07%  1.32%  
8/128  5.92  3.42%  2.17% 
Quantizing Fullyconnected Layers in AlexNet
We quantize all the fullyconnected layers in AlexNet, and the results are as demonstrated in Table 13.
Method  Para.  Compression  Top1 Err.  Top5 Err. 

QCNN  2/16  13.96  0.25%  0.27% 
3/16  19.14  0.77%  0.64%  
3/32  15.25  0.54%  0.33%  
4/32  18.71  0.71%  0.69%  
QCNN (EC)  2/16  13.96  0.14%  0.20% 
3/16  19.14  0.40%  0.22%  
3/32  15.25  0.40%  0.21%  
4/32  18.71  0.46%  0.38% 
Quantizing Fullyconnected Layers in CNNS
We quantize all the fullyconnected layers in CNNS, and the results are as demonstrated in Table 14.
Method  Para.  Compression  Top1 Err.  Top5 Err. 

QCNN  2/16  14.37  0.22%  0.07% 
3/16  20.15  0.45%  0.22%  
3/32  15.79  0.21%  0.11%  
4/32  19.66  0.35%  0.27%  
QCNN (EC)  2/16  14.37  0.36%  0.14% 
3/16  20.15  0.43%  0.24%  
3/32  15.79  0.29%  0.11%  
4/32  19.66  0.56%  0.27% 
Appendix B: Optimization in Section 3.3.2
Assume we have images to learn the quantization of a convolutional layer. For image , we denote its input feature maps as and response feature maps as , where are the spatial sizes and are the number of feature map channels. We use and to denote the spatial location in the input and response feature maps. The spatial location in the convolutional kernels is denoted as .
To learn quantization with error correction for the convolutional layer, we attempt to optimize:
(14) 
where is the th subcodebook, and is the corresponding subcodeword assignment indicator for the convolutional kernels at spatial location .
Similar to the fullyconnected layer, we adopt a block coordinate descent approach to solve this optimization problem. For the th subspace, we firstly define its residual feature map as:
(15) 
and then the optimization in the th subspace can be reformulated as:
(16) 
Update . With the assignment indicator fixed, we let:
(17) 
We greedily update each subcodeword in the th subcodebook in a sequential style. For the th subcodeword, we compute the corresponding residual feature map as:
(18) 
and then we can alternatively optimize:
(19) 
which can be transformed into a least square problem. By solving it, we can update the th subcodeword.
Update . We greedily update the subcodeword assignment at each spatial location in the convolutional kernels in a sequential style. For the spatial location , we compute the corresponding residual feature map as:
(20) 
and then the optimization can be rewritten as:
(21) 
Since is an indicator vector (only one nonzero entry), we can exhaustively try all subcodewords and select the optimal one that minimize the objective function.