UNIQ: Uniform Noise Injection for the Quantization of Neural Networks

UNIQ: Uniform Noise Injection for the Quantization of Neural Networks

Chaim Baskin    Eli Schwartz    Evgenii Zheltonozhskii    Natan Liss    Raja Giryes    Alex Bronstein    Avi Mendelson

We present a novel method for training deep neural network amenable to inference in low-precision arithmetic with quantized weights and activations. The training is performed in full precision with random noise injection emulating quantization noise. In order to circumvent the need to simulate realistic quantization noise distributions, the weight and the activation distributions are uniformized by a non-linear transformation, and uniform noise is injected. An inverse transformation is then applied. This procedure emulates a non-uniform -quantile quantizer at inference time, which is shown to achieve state-of-the-art results for training low-precision networks on CIFAR-10 and ImageNet-1K datasets. In particular, we observe no degradation in accuracy for MobileNet and ResNet-18 on ImageNet with as low as 2-bit quantization of the activations and minimal degradation for as little as 4 bits for the weights.

Deep Learning, Quantization, Network Compression, ICML

Figure 1: Performance vs. complexity of different quantized neural networks. Performance is measured as top-1 accuracy on ImageNet; complexity is estimated in BOPs. Network architectures are denotes by different marker shapes; quantization methods are marked in different colors. The figure must be viewed in color.
Bit operations Architecture Method Bits Accuracy
Less than ten GBOPs
MobileNet UNIQ (Ours) 2,4 51.01
MobileNet UNIQ (Ours) 2,2 49.27

Tens of GBOPs
ResNet-18 UNIQ (Ours) 8,4 69.92
ResNet-18 UNIQ (Ours) 8,2 69.63
MobileNet UNIQ (Ours) 8,4 68.30
MobileNet UNIQ (Ours) 8,2 68.00
ResNet-18 UNIQ (Ours) 4,4 66.02
ResNet-18 UNIQ (Ours) 4,2 65.32
MobileNet UNIQ (Ours) 4,4 64.40
MobileNet UNIQ (Ours) 4,2 64.10
AlexNet XNOR (Rastegari et al., 2016) 1,32 60.10
ResNet-18 XNOR (Rastegari et al., 2016) 1,1 51.20
AlexNet QNN (Hubara et al., 2016b) 1,2 51.03
AlexNet XNOR (Rastegari et al., 2016) 1,1 44.20
AlexNet QNN (Hubara et al., 2016b) 1,1 41.80
Hundreds of GBOPs

ResNet-18 Apprentice (Mishra & Marr, 2018) 4,8 70.40
ResNet-18 IQN (Zhou et al., 2017a) 5,32 68.89
ResNet-18 Apprentice (Mishra & Marr, 2018) 2,32 68.50
MobileNet Baseline (full-precision) 32,32 68.20
ResNet-18 Apprentice (Mishra & Marr, 2018) 2,2 68.00
ResNet-18 Distillation (Polino et al., 2018) 4,32 64.20
AlexNet WRPN (Mishra et al., 2018) 2,32 57.50
AlexNet Deep Compression (Han et al., 2016) 8,32 57.21
AlexNet WRPN (Mishra et al., 2018) 1,32 56.80
AlexNet WRPN (Mishra et al., 2018) 4,32 55.34
VGG-16 Baseline (full-precision) 32,32 71.00
VGG-16 IQN (Zhou et al., 2017a) 5,32 70.82
ResNet-18 Baseline (full-precision) 32,32 69.60
AlexNet Baseline (full-precision) 32,32 56.50
Table 1: Top-1 accuracy of ResNet-18 on ImageNet quantized using various techniques. Number of bits is reported as (weights,activations). The rows are divided by the order of magnitude of complexity and internally sorted by decreasing accuracy.

1 Introduction

Deep learning has established itself as an important tool in the machine learning arsenal. Deep networks have shown spectacular success in a variety of tasks in a broad range of fields including computer vision, signal processing, computational imaging and image processing, and speech and language processing (Hinton et al., 2012; Chen et al., 2016; Lai et al., 2015).

A major drawback of deep learning models is their storage and computational cost. Typical deep networks comprise millions of parameters and require billions of multiply-accumulate (MAC) operations. In many cases, this cost renders them infeasible for running on mobile devices with limited resources. While some applications allow moving the inference to the cloud, such an architecture still incurs significant bandwidth and latency limitations.

A recent trend in research focuses on developing lighter deep models, both in their memory footprint and computational complexity. The main focus of most of these works, including the present one, is on alleviating the complexity at inference time rather than simplifying the training. While training deep model requires even more resources and time, it is usually done offline with plentiful computational resources.

One way of reducing computational cost is quantization of weights and activations. Quantization of weights also reduces storage size and memory access. The bit widths of activations and weights affect linearly the amount of required hardware logic; reducing both widths by a factor of two reduces the amount of hardware by roughly a factor of four. The weights need to be stored in memory, and reduction of their size allows to fit bigger networks into the device, which is especially critical for embedded devices and custom hardware. On the other hand, activations are passed between layers, and thus, if different layers are processed separately, activation size reduction affects communication overhear. It should be noted, that in typical architectures there are overwhelmingly more activations than weights.

Deep neural networks are usually trained and operate with both the weights and the activations represented in single-precision (-bit) floating point. A straightforward uniform quantization of the pre-trained weights to or even -bit fixed point representation has been shown to have a negligible effect on the model accuracy (Gupta et al., 2015). In the majority of the applications, further reduction of precision quickly degrades performance; hence, nontrivial techniques are required to carry it out.

Previous studies have investigated quantizing the network weights and, sometimes, the activations to as low as ternary or binary (-bit) representation (Rastegari et al., 2016; Hubara et al., 2016b; Zhou et al., 2016; Mishra et al., 2018; Dong et al., 2017). Such extreme reduction of the range of possible parameter values can greatly affect accuracy. Recent works proposed to use a wider network, i.e., one with more filters per layer, to mitigate the accuracy loss (Zhu et al., 2016; Mishra et al., 2018; Polino et al., 2018). In some approaches, e.g., Zhu et al. (2016) and Zhou et al. (2017a), a learnable linear scaling layer is added after each quantized layer to improve expressiveness.

A general approach to learning a quantized model adopted in several recent papers (Hubara et al., 2016a, b; Zhou et al., 2016; Rastegari et al., 2016; Cai et al., 2017) is to perform the forward pass using the quantized values, while keeping another set of full-precision values for the backward pass updates. In this case, the backward pass is still almost everywhere differentiable, while the forward pass is quantized. In the aforementioned papers, a deterministic or stochastic function is used at training for weight and activation quantization. Another approach introduced by Mishra & Marr (2018) and Polino et al. (2018) is based on a teacher-student setup for knowledge distillation of a full precision (and larger) teacher model to a quantized student model. This method allows training highly accurate quantized models, but with the drawback that training a larger model is required, e.g. ResNet34 is used for training ResNet-18.

Most previous studies have used uniform quantization (i.e., all quantization bins have equal width), which is attractive due to its simplicity. However, unless the values are uniformly distributed, uniform quantization is not optimal either in the sense, or in any other reasonable metric. Unfortunately, neural network weights are not uniform and tend to assume bell-shaped distributions (Han et al., 2016).

Non-uniform quantization was utilized by Han et al. (2016), where the authors replaced the weight values with indexes pointing to a finite codebook of shared values. Another approach adopted by Zhou et al. (2017a) learns the quantizer thresholds by iteratively grouping close values and re-training the weights. Lee et al. (2017) utilize a logarithmic scale quantization for an approximation of the -optimal Lloyd quantizer (Lloyd, 1982). Cai et al. (2017) proposed to optimize expectation of error of quantization function to reduce the quantization error. Normally distributed weights and half-normally distributed activations were assumed, that enables using a pre-calculated -means quantizer. In Zhou et al. (2017b) balanced bins are used, so each bin has the same number of samples. In some sense, this idea is the closest to our approach; yet, while Zhou et al. (2017b) force the values to have an approximately uniform distribution, we pose no such constraint. Also, since calculating percentiles is expensive in this setting, Zhou et al. (2017b) estimate them with means, while our method allows using the actual percentiles as detailed in the sequel.

Contribution. This paper makes the following contributions: Firstly, we propose a -quantile quantization method with balanced (equal probability mass) bins, which is particularly suitable for neural networks, where outliers or long tails of the bell curve are less important. We also show a simple and efficient way of reformulating this quantizer using a “uniformization trick”.

Secondly, we introduce a novel method for training a neural network that performs well with quantized values. This is achieved by adding noise (at training time) to emulate the quantization noise introduced at inference time. The uniformization trick renders exact the injection of uniform noise and alleviates the need to draw the noise from complicated and bin-dependent distributions. While we limit our attention to the -quantile quantization, the proposed scheme can work with any threshold configuration while still keeping the advantage of uniformly distributed noise in every bin.

Lastly, we report a major improvement over the state-of-the-art quantization techniques in the performance vs. complexity tradeoff. Unlike several leading methods, our approach can be applied directly to the existing architectures without the need to modify them at training (as opposed, for example, to the teacher-student approaches that require to train a bigger network, or the XNOR networks that typically increase the number of parameters by a significant factor in order to meet accuracy goals).

2 Method

In this section, we present our method for training a neural network amenable to operation in low-precision arithmetic. We start by outlining several common quantization schemes and discussing their suitability for deep neural networks. Then, we suggest a training procedure where during training time uniform random additive noise is injected into weights and activations, simulating the quantization error. The scheme aims at improving the quantized network performance at inference time, when regular deterministic quantization is used.

2.1 Quantization

Let be a random variable drawn from some distribution described by the probability density function . Let be a set of thresholds partitioning the real line into disjoint intervals (bins) , and let be a set of representation levels. A quantizer is a function mapping each bin to the corresponding representation level . We denote the quantization error by . The effect of quantization can be modelled as the addition of random noise to ; the noise added to the -th bin admits the conditional distribution .

Most papers on neural network quantization focus on the uniform quantizer which has a constant bin width and . A -means quantizer is optimal in the sense of the mean squared error , where the expectation is taken with respect to the density . Its name follows from the property that each representation level coincides with the -th bin centroid (mean w.r.t. ). While finding the optimal -means quantizer is known to be an NP-hard problem, heuristic procedures such as the Lloyd-Max algorithm (Lloyd, 1982) usually produce a good approximation. The -means quantizer coincides with the uniform quantizer when is uniformly distributed.

While being a popular choice in signal processing, the -means quantizer encounters severe obstacles in our problem of neural network quantization. Firstly, the Lloyd-Max algorithm has a prohibitively high complexity to be used in every forward pass. Secondly, it is not easily amenable to our scheme of modelling quantization as the addition of random noise, as the noise distribution at every bin is complex and varies with the change of the quantization thresholds. Finally, our experiments shown in Section 3.6 in the sequel suggest that the use of the criterion for quantization of deep neural classifier does not produce best classification results. The weights in such networks typically assume a bell-shaped distribution with tails exerting a great effect on the mean squared error, yet having little impact on the classification precision.

As an alternative to -means, we propose the -quantile quantizer characterized by the equiprobable bins property, that is, . The property is realized by setting , where denotes the cumulative distribution function of and, accordingly, its inverse denotes the quantile function. The representation level of the -th bin is set to the bin median, .

It can be shown that the -quantile quantizer minimizes the mean absolute error . The use of such a more robust error criterion limits the effect of the tails of the distribution and typically produces more bins closer to the distribution mean. Based on empirical observations, we conjecture that the distribution tails are not essential for good model performance at least in classification tasks. It can also be shown that in the case of a uniform , the -quantile quantizer coincides with the -level uniform quantizer.

The cumulative distribution and the quantile function can be efficiently estimated empirically from the distribution of weights and activations, and updated in every forward pass. Alternatively, one can rely on the empirical observation that the -regularized weights tend to assume an approximately normal distribution (Blundell et al., 2015) and use the normal quantile function.

Empirical observations also show that the network observations can be approximated as normally-distributed variables saturated by the ReLU units (Cai et al., 2017). The resulting distribution is a mixture of a delta at zero and a half-normal distribution, for which a quantile function also has a closed form.

The fact that a monotonically increasing transformation preserves quantiles allows an alternative construction of the -quantile quantizer. We first apply the transformation to the input converting it into a uniform random variable on the interval . Then, a uniform -level quantizer (coinciding with the -quantile quantizer) is applied to producing ; the result is transformed back into using the inverse transformation. We refer to this procedure as to the uniformization trick. Its importance will become evident in the next section.

2.2 Training quantizers by uniform noise injection

The lack of continuity, let alone smoothness, of the quantization operator renders impractical its use in the backward pass. As an alternative, at training we replace the quantizer by the injection of random additive noise. This scheme suggests that instead of using the quantized value of a weight in the forward pass, is used with drawn from the conditional distribution of described by the density

defined for and vanishing elsewhere. The bin to which belongs is established according to its value and is fixed during the backward pass. Quantization of the network activations is performed in the same manner.

The fact that the parameters stay do not directly undergo quantization keeps the model differentiable. In addition, gradient updates in the backward pass have an immediate impact on the forward pass, in contrast to the directly quantized model, where small updates often leave the parameter in the same bin, leaving it effectively unchanged.

However, while in signal processing it is customary to model the quantization noise distribution as uniform, this approximation breaks in the extremely low precision regimes (small ) considered here. Hence, the injected noise has to be drawn from a potentially non-uniform distribution which furthermore changes as the network parameters and the quantizer thresholds are updated.

To overcome this difficulty, we resort to the uniformization trick outlined in the previous section. Instead of the -quantile quantizer , we apply the equivalent uniform quantizer to the uniformized variable, . The effect of the quantizer can be again modelled using noise injection,

with the cardinal difference than now the noise is uniformly distributed on the interval .

Note that the proposed scheme, henceforth referred to as uniform noise injection quantization (UNIQ), is not restricted to the -quantile quantizer discussed in this paper, but rather applies to any quantizer. In the general case, the noise injected into each bin is uniform, but its variance changes with the bin width. The latter requires finding the bin number in the forward pass, which has a negative impact on the training time.

Usually, quantization of neural networks is either used for training a model from scratch or applied post-training as a fine-tuning stage. Our method, as will be demonstrated in our experiments, works well in both cases. Our practice shows that best results are obtained when the learning rate is reduced as the noise is added; we explain this by the need to compensate for noisier gradients.

2.3 Gradual quantization

The described method works well “as is” for small- to medium-sized neural networks. For deeper networks, the basic method does not perform well, most likely due to cumulative errors arising when applying a long sequence of operations where more noise is added at each step. We found that applying the scheme gradually to groups of layers works better in deep networks.

In order to perform gradual quantization, we split the network into blocks , each containing about same number of consecutive layers. We also split our budget of training epochs into stages. At the -th stage, we quantize and freeze the parameters in blocks , and inject noise into the parameters of . For the rest of the blocks neither noise nor quantization is applied. This way, the number of parameters and activations into which the noise is injected simultaneously is reduced, which allows better convergence. The deeper blocks gradually adapt to the quantization error of previous ones and thus tend to converge relatively fast when the noise is injected into them. For fine-tuning a pre-trained model, we use the same scheme, applying a single epoch per stage. An empirical analysis of the effect of a different number of stages is presented in Section 3.5.

3 Experimental evaluation

We performed an extensive performance analysis of the UNIQ scheme compared to most prominent recent methods for neural network quantization. The basis for comparison is the accuracy vs. the total number of bit operations in visual classification tasks on the ImageNet-1K (Russakovsky et al., 2015) dataset. CIFAR-10 and 100 datasets (Krizhevsky, 2009) are further used to evaluate different design choices made.

MobileNet (Howard et al., 2017) and ResNet-18 (He et al., 2016) architectures were used as the baseline for quantization. MobileNet was chosen as a representative of lighter models, which are more suited for a limited hardware setting where quantization is also most likely to be used. ResNet-18 was chosen due to its near state-of-the-art performance and popularity, which makes it an excellent reference for comparison to other results in the literature.

We compared the performance of different quantization methods, training from scratch vs. fine-tuning, and different partitions to stages in the gradual quantization strategy. We adopted the number of bit operations (BOPs) metric for arithmetic complexity. This metric is particularly informative about the performance of mixed-precision arithmetic especially in hardware implementations on FPGAs and ASICs.

Training details

For training a quantized ResNet-18 model from random initialization on CIFAR, we used SGD with an initial learning rate of , momentum and weight decay . The weight decay was set to zero after epochs, and the learning rate was reduced by after epochs. The total number of epochs was set to . For fine-tuning a pre-trained model, we used the same settings and trained for epochs. Both were trained with stages, noise was not injected to the first and last layers and they were not quantized. The same setup was used when fine-tuning for ImageNet, but with learning rate set to for ResNet-18 and for MobileNet. For MobileNet, we fine-tuned with epochs and stages. In all experiments, unless otherwise stated, we fine-tuned a pre-trained model.

3.1 Performance of quantized networks on ImageNet

Tables 23 report the top-1 accuracy performance of ResNet-18 and MobileNet with weights and activations quantized to several levels using UNIQ. As the baseline, we use a full-precision model with bit weights and activations. We found the UNIQ method to perform especially well for extreme low-precision activations with minor degradation in performance even with as little as -bit quantization. Although higher precision is required for the weights, we only observed degradation when using bits or less.

Table 1 compare the performance of quantized using UNIQ to other leading approaches reported in the literature. Those methods are aimed at lower precision weights rather than activations, while UNIQ is better suited for lower precision activations. Nevertheless, UNIQ achieved comparable performance at a significantly lower overall complexity.

We found the UNIQ method to perform well also with the smaller MobileNet architecture comprising only million parameters. This is in contrast with the previous works that resort to using larger models, e.g. by doubling the number of filters for ResNet-18 and thus quadrupling the number of parameters from to million (Mishra et al., 2018; Polino et al., 2018).

Activation bits
2 4 32
Weight bits 2 49.27 51.01
4 64.15 64.43
8 68.07 68.32
32 68.01
Table 2: Accuracy (top-1 in percent) on ImageNet with quantized MobileNet (Howard et al., 2017).
Activation bits
2 4 32
Weight bits 2 48.01 49.77
4 65.32 66.02
8 69.63 69.92
32 70.14
Table 3: Accuracy (top-1 in percent) on ImageNet with quantized ResNet-18 (He et al., 2016).

3.2 Accuracy vs. complexity trade-off

Since custom precision data types are used for the network weights and activations, the number of MAC operations is not an appropriate metric to describe the computational complexity of the model. Therefore, we use the BOP metric quantifying the number of bit operations. Given the bit width of two operands, it is possible to approximate the number of bit operations required for a basic arithmetic operation such as addition and multiplication. The proposed metric is useful when the inference is run on custom hardware like FPGAs or ASICs. Both of them are a natural choice for quantized networks, due to the use of lookup tables (LUTs) and dedicated MAC (or more general DSP) units, which are efficient with custom data types.

Another factor that must be incorporated into the BOPS calculation is the cost of fetching the parameters from an an external memory. Two assumptions are made in the approximation of this cost: firstly, we assume that each parameter is only fetched once from an external memory; secondly, the cost of fetching a -bit parameter is assumed to be BOPs. Given a neural network with parameters all represented in bits, the memory access cost is simply .

Figure 1 and Table 1 display the performance-complexity tradeoffs of various neural networks trained using UNIQ and other methods to different levels of weight and activation quantization.

3.3 Accuracy vs. quantization level

Table 4 reports the accuracy of a narrow version of ResNet-18 on CIFAR-10 for various levels of weight and activation quantization. We observed that when using UNIQ, quantizing activations has a smaller effect than quantizing weights. Degradation in accuracy appears to be significant for -bit weights, without a noticeable dependence on the activation precision. For -bit weights, we observed only a minor degradation, compared to full-precision, even when using -bit (binary) activations.

Activation bits
1 2 4 32
Weight bits 1 78.74 78.74 78.78 78.77
2 88.79 88.85 88.63 88.63
4 90.57 90.60 90.71 90.71
32 90.74 90.79 90.82 90.90
Table 4: Accuracy with UNIQ on CIFAR-10 for different bitwidth of activations and weights

3.4 Training from scratch vs. fine-tuning

Both training from scratch (that is, from random initialization) and fine-tuning have their advantages and disadvantages. Training from scratch takes more time but requires a single training phase with no extra training epochs, at the end of which a quantized model is obtained. Fine-tuning, on the other hand, is useful when a pre-trained full-precision model is already available; it can then be quantized with a short re-training.

Table 5 compares the accuracy achieved in the two regimes on a narrow version of ResNet-18 trained on CIFAR-10 and 100. -bit quantization of weights only and -bit quantization of both weights and activations were compared. We found that both regimes work equally well, reaching accuracy close to the full-precision baseline.

Dataset Bits Full training Fine-tuning Baseline
CIFAR-10 5,32 90.36 90.47 90.9
5,5 90.57 90.53
CIFAR-100 5,32 65.56 65.73 66.3
5,5 65.65 66.05
Table 5: Top-1 accuracy (in percent) on CIFAR-10 and 100 of a narrow version on ResNet-18 trained with UNIQ from random initialization vs. fine-tuning a full-precision model. Number of bits is reported as (weights,activations).

3.5 Accuracy vs. number of quantization stages

We found that injecting noise to all layers simultaneously does not perform well for deeper networks. As described in Section 2.3, we suggest splitting the training into stages, such that at each stage the noise is injected only into a subset of layers.

To determine the optimal number of stages, we fine-tuned ResNet-18 on CIFAR-10 with a fixed epoch budget. Bit width was set to for both the weights and the activations. The first and the last layers were not quantized; the other layers were partitioned evenly into stages. Two of the epochs were reserved for fine-tuning the first and the last layers after all other layers were quantized and fixed. The remaining epochs were allocated evenly to the stages.

Figure 2 reports the classification accuracy as a function of the number of quantization stages. Based on these results, we conclude that the best strategy is injecting noise to a single layer at each stage. We follow this strategy in all ResNet-18 experiments conducted in this paper. For MobileNet, since it is deeper and includes layers, we chose to inject noise to consecutive layers at every stage.

Figure 2: Classification accuracy on CIFAR-10 of the ResNet-18 architecture quantized using UNIQ with different number of quantization stages during training.

3.6 Comparison of different quantizers

In the following experiment, different quantizers were compared within the uniform noise injection scheme.

The bins of the uniform quantizer were allocated evenly in the range with denoting the standard deviation of the parameters. For both the -quantile and the -means quantizers, normal distribution of the weights was assumed and the normal cumulative distribution and quantile functions were used for uniformization and deuniformization of the quantized parameter. The -means and uniform quantizers used a pre-calculated set of thresholds translated to the uniformized domain. Since the resulting bins in the uniformized domain had different widths, the level of noise was different in each bin. This required an additional step of finding the bin index for each parameter approximately doubling the training time.

The three quantizers were evaluated in a ResNet-18 network trained on the CIFAR-10 dataset with weights quantized to bits () and activations computed in full precision ( bit). Table 6 reports the obtained top-1 classification accuracy. -quantile quantization outperforms other quantization methods and is only slightly inferior to the full-precision baseline. In terms of training time, the -quantile quantizer requires about more time to train for ; this is compared to around increase in training time required for the -means quantizer. In addition, -quantile training time is independent on the number of quantization bins as the noise distribution is same for every bin while the other methods require separate processing of each bin, increasing the training time for higher bit widths.

Quantization method Accuracy Training time [h]
Baseline (unquantized) 90.90 1.42
-quantile 90.47 2.28
-means 84.80 5.37
Uniform 83.93 5.37
Table 6: Comparison of different quantization methods obtained by random noise injection into a ResNet-18 network trained and tested on CIFAR-10. Top-1 accuracy is reported in percent.

4 Conclusion

We introduced UNIQ – a training scheme for quantized neural networks. The scheme is based on the uniformization of the distribution of the quantized parameters, injection of additive uniform noise, followed by the de-uniformization of the obtained result. The scheme is amenable to efficient training by back propagation in full precision arithmetic, and achieves maximum efficiency with the -quantile (balanced) quantizer that was investigated in this paper.

We reported state-of-the-art results for quantized neural networks on the ImageNet visual recognition task. In the high-accuracy regime (around top-1 accuracy), our networks require around GBOPs as opposed to the best results obtained so far by the apprentice networks in (Mishra & Marr, 2018) at the complexity of around GBOPs. The 2- to 5-fold improvement in complexity comes at a negligible drop in accuracy compared to both (Mishra & Marr, 2018) and the full-precision baseline (the latter requires around TBOPs).

In the low-complexity regime, we achieve around top-1 accuracy with a network requiring around GBOPs. For comparison, the full-precision baseline requires over GBOPs, while the best extreme quantization schemes such as XNOR networks (Rastegari et al., 2016) and QNN (Hubara et al., 2016b) achieving similar complexity perform at around accuracy.

UNIQ is straightforward to implement and can be used as a “plug-and-play” modification of existing architectures. It also does not require tweaking the architecture, e.g. increasing the number of filters as done in few previous works. It can also benefit from pre-trained parameters by applying only a short re-training. Unlike the teacher-student scheme that requires training a much bigger teacher network, our method trains the target architecture directly.

One of the most surprising properties of UNIQ is that it allows extreme quantization of the activations without a significant drop in accuracy (see Table 4). This allows settings with very low precision quantization of the weights (around bits) and extreme (even binary) quantization of the activation without the need to modify the architecture.

While this paper considered a setting in which all MAC operations were performed in the same precision, more complicated bit allocations will be explored in follow up studies.


The research was funded by ERC StG RAPID and ERC StG SPADE.


  • Blundell et al. (2015) Blundell, Charles, Cornebise, Julien, Kavukcuoglu, Koray, and Wierstra, Daan. Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622, 2015.
  • Cai et al. (2017) Cai, Z, He, X, Sun, J, and Vasconcelos, N. Deep learning with low precision by half-wave gaussian quantization. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2017.
  • Chen et al. (2016) Chen, Liang-Chieh, Papandreou, George, Kokkinos, Iasonas, Murphy, Kevin, and Yuille, Alan L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR, abs/1606.00915, 2016. URL http://arxiv.org/abs/1606.00915.
  • Dong et al. (2017) Dong, Yinpeng, Ni, Renkun, Li, Jianguo, Chen, Yurong, Zhu, Jun, and Su, Hang. Learning accurate low-bit deep neural networks with stochastic quantization. In British Machine Vision Conference (BMVC’17), 2017.
  • Gupta et al. (2015) Gupta, Suyog, Agrawal, Ankur, Gopalakrishnan, Kailash, and Narayanan, Pritish. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1737–1746, 2015.
  • Han et al. (2016) Han, Song, Mao, Huizi, and Dally, William J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR), 2016.
  • He et al. (2016) He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  • Hinton et al. (2012) Hinton, G., Deng, L., Yu, D., Dahl, G. E., r. Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, Nov 2012. ISSN 1053-5888. doi: 10.1109/MSP.2012.2205597.
  • Howard et al. (2017) Howard, Andrew G, Zhu, Menglong, Chen, Bo, Kalenichenko, Dmitry, Wang, Weijun, Weyand, Tobias, Andreetto, Marco, and Adam, Hartwig. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • Hubara et al. (2016a) Hubara, Itay, Courbariaux, Matthieu, Soudry, Daniel, El-Yaniv, Ran, and Bengio, Yoshua. Binarized neural networks. In Advances in neural information processing systems, pp. 4107–4115, 2016a.
  • Hubara et al. (2016b) Hubara, Itay, Courbariaux, Matthieu, Soudry, Daniel, El-Yaniv, Ran, and Bengio, Yoshua. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016b.
  • Krizhevsky (2009) Krizhevsky, Alex. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto, 2009.
  • Lai et al. (2015) Lai, Siwei, Xu, Liheng, Liu, Kang, and Zhao, Jun. Recurrent convolutional neural networks for text classification. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pp. 2267–2273. AAAI Press, 2015. ISBN 0-262-51129-0. URL http://dl.acm.org/citation.cfm?id=2886521.2886636.
  • Lee et al. (2017) Lee, Edward H, Miyashita, Daisuke, Chai, Elaina, Murmann, Boris, and Wong, S Simon. Lognet: Energy-efficient neural networks using logarithmic computation. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 5900–5904. IEEE, 2017.
  • Lloyd (1982) Lloyd, Stuart. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
  • Mishra & Marr (2018) Mishra, Asit and Marr, Debbie. Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy. International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1ae1lZRb.
  • Mishra et al. (2018) Mishra, Asit, Nurvitadhi, Eriko, Cook, Jeffrey J, and Marr, Debbie. Wrpn: Wide reduced-precision networks. International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1ZvaaeAZ.
  • Polino et al. (2018) Polino, Antonio, Pascanu, Razvan, and Alistarh, Dan. Model compression via distillation and quantization. International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=S1XolQbRW.
  • Rastegari et al. (2016) Rastegari, Mohammad, Ordonez, Vicente, Redmon, Joseph, and Farhadi, Ali. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Springer, 2016.
  • Russakovsky et al. (2015) Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  • Zhou et al. (2017a) Zhou, Aojun, Yao, Anbang, Guo, Yiwen, Xu, Lin, and Chen, Yurong. Incremental network quantization: Towards lossless cnns with low-precision weights. In International Conference on Learning Representations,ICLR2017, 2017a.
  • Zhou et al. (2017b) Zhou, Shu-Chang, Wang, Yu-Zhi, Wen, He, He, Qin-Yao, and Zou, Yu-Heng. Balanced quantization: An effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology, 32(4):667–682, 2017b.
  • Zhou et al. (2016) Zhou, Shuchang, Wu, Yuxin, Ni, Zekun, Zhou, Xinyu, Wen, He, and Zou, Yuheng. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
  • Zhu et al. (2016) Zhu, Chenzhuo, Han, Song, Mao, Huizi, and Dally, William J. Trained ternary quantization. International Conference on Learning Representations (ICLR), 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description