1 Introduction

\sysmltitlerunning

FQ-Conv: Fully Quantized Convolution for Efficient and Accurate Inference

\printAffiliationsAndNotice\sysmlEqualContribution

1 Introduction

In recent years, there has been a surge of interest in designing accelerator hardware for neural networks. These neural-network accelerators aim to improve the speed and energy efficiency with which the billions of operations in DNNs are performed. The design of NN-accelerators often goes hand in hand with optimizations at the algorithmic level. Such algorithmic optimizations include changing the structure of the network [He et al., 2015, Howard et al., 2017], network pruning LeCun et al. [1990], Han et al. [2015], dimensionality reduction of weight matrices [Xue et al., 2013], and combinations thereof. Moreover, each algorithm requires some sort of quantization of its values before being mapped on chip and, if quantized to low precision, this can produce several additional hardware benefits. For instance, low-precision quantization significantly reduces the memory footprint of the network, thereby reducing the on-chip memory and memory transfers needed. Furthermore, extreme quantization can simplify computations considerably: e.g. a network with ternary weights (i.e. -1, 0 or 1) involves only additions, no computationally expensive multiplications. However, very low-precision quantized networks often imply a reduction in accuracy. Also, low-precision DNNs still entail some higher-precision computations, like Batch Normalization(BN) and nonlinear activation functions, which require changing the numerical formats between hardware operations and extra silicon area, and energy for computation. Hence, quantization techniques that maintain good accuracy are required. Most of today’s DNN accelerators operate in the digital domain [Jouppi et al., 2017, Moons et al., 2017], but recently there has been an increase in the number of analog designs [Ambrogio et al., 2018, Guo et al., 2017]. Some of the analog accelerators promise to mitigate the classical von Neumann bottleneck by performing most of the computations in the memory. For example, one type of analog hardware implementation uses Ohm’s law to perform multiplications in the memory elements of a crossbar array. Here, the weights are encoded as conductances, but only a limited number of conductances can be stored in each memory device. Following multiplication, the charges are accumulated on the summation line using Kirchhoff’s current law, equivalent to summation of the weighted activations in dot products. Although promising, such analog-compute-in-memory also poses several challenges. This is because the devices that store the weights (memory cells), generate the input activations (e.g. DAC) and convert the analog summed signal back to the digital domain (ADC) are often noisy and require low-precision quantization to be usable or efficient. For these analog accelerators to work, it is therefore crucial that neural networks perform accurately under noisy low-precision conditions.
This paper makes the following contributions: (1) We propose an effective quantization technique that learns to optimally quantize the weights and activations of the network during training. (2) We train the network to low precision using a new training technique, called gradual quantization. (3) We show that our proposed quantization technique compares favorably to existing techniques. (4) We present a method to remove the higher-precision nonlinearities and BN from the network. (5) We demonstrate the potential of this approach on two additional datasets, the Google speech commands dataset and on CIFAR-100, showing that ternary-weight (2-bit) CNNs with low-precision in- and outputs and no higher-precision BN and nonlinearity perform comparably to their full-precision equivalents with BN. (6) We show that these networks can handle moderate amounts of noise on the weights, activations and outputs of the convolution, making them suitable for analog accelerators.

2 Related Work

Learned quantization.

In recent years, a wide range of quantization methods have been proposed, up to methods to ternarize the network weights [Li et al., 2016, Zhu et al., 2016] or binarize the weights and activations of DNNs [Courbariaux et al., 2016, Rastegari et al., 2016, Hubara et al., 2017]. Several of these methods utilized the statistical distributions of the weights and activations to propose good quantization methods [Li et al., 2016, Zhu et al., 2016, Cai et al., 2017]. This statistical approach is a sound approach from an information-theoretic point of view and often works well, but may be sub-optimal from a DNN perspective. For instance, the statistical-quantization approach may not result in the best quantized solution for the network as a whole. This is because the quantization often happens after the network has been trained in full precision, without querying the quantized network if it would choose these or other quantized values if it were allowed to. Furthermore, the distributional assumptions may not be fully accurate and may change for different datasets, across network layers and during training, thereby complicating the statistical approach. Consequently, recent studies have proposed to learn the optimal quantization during training [Zhang et al., 2018, Jung et al., 2018]. Our proposed method is most similar to the recently proposed PACT method for quantizing the outputs of ReLUs [Choi et al., 2019]. In that paper, the authors present a way to parametrically learn the clipping range of the ReLU function for optimal quantization. Our method differs from theirs in that our proposed learned quantization method does not have zero gradients for values in the clipping range and can be applied at any position inside the network, which includes the ReLUs, but also for quantizing the weights, quantizing the immediate linear outputs of convolutions and even for quantizing the inputs (e.g. images) into the DNN. In our experiments presented below we will demonstrate the generality of our quantization method.

Gradual quantization.

Quantizing DNNs to low precision puts strong constraints on the network, constraints that the network often finds hard to adjust to, as evidenced by decreased accuracy. Previous work has tried to ease the transition to low-precision by quantizing different parts of the network in different stages of the training, e.g. by quantizing the initial layers of a DNN before the later layers [Baskin et al., 2018] or by quantizing different parts within layers at different stages [Xu et al., 2018]. The justification for this process is to give the remaining full-precision parts of the network the chance to compensate for the quantization in other network parts. In contrast to these methods, our proposed gradual quantization method quantizes the entire network at once, but gradually lowers the bitwidth of the weights and activations inside the network. Our method is motivated by the observation that it is relatively easy to quantize at high bitwidths and that networks with lower precision can likely learn from networks with slightly higher precision.

Noise resilience of neural networks.

DNNs have a special relationship with noise. For example, both dropout [Srivastava et al., 2014] and batch normalization [Ioffe and Szegedy, 2015] add noise to the weights or activations during training, thereby improving generalization performance of the network. Importantly, both techniques deactivate the noise source during inference, which is different from what happens in analog accelerators where noise is inherent to the circuitry and thus also present during inference. Other studies have examined the effect of noise on the network weights [Merolla et al., 2016], finding that (weight-) quantized models better withstand weight noise during inference compared to their full-precision counterparts. Here, we examine the influence of noise on the weights, activations and convolution outputs (MACs) and propose a technique to improve network performance under noisy conditions.

3 The Proposed Approach

Overview.

We aim to train convolutional layers of a CNN in which inputs, weights and outputs are fully quantized to low-precision numerical values, in which no higher-precision BN and nonlinearities need to be computed, and for which the resulting CNN performs at high accuracy. To achieve these objectives, we combine three methods during network training: 1. We propose a novel learned quantization technique, 2. We present a new training technique that improves the accuracy of quantized networks, 3. We combine the previous methods with network distillation for best accuracy. In the following sections we will present each method in detail and finally discuss how we eliminate the higher-precision BN and nonlinearities from the network.

3.1 Learned quantization.

We seek to quantize the inputs, outputs and weights of a convolutional layer in an optimal way using a quantization method that does not rely on any distributional assumptions, can be used for all elements of the network (weights and activations) and gives the network the chance to learn the quantization that is optimal for the entire network, i.e. end-to-end. We will now discuss this in detail.
Uniform quantization requires a range in which the values are quantized. Crucially, we do not know if the network relies most on extreme values (i.e. in the tails of the distribution) or small values. Also, the optimal quantization range may change across layers and may be different for weights and activations. We therefore instruct the network to learn the quantization range during training. To do so, we introduce a learnable scale factor in the quantization process. This scale factor can differ per layer and for weights and activations. Our method can be summarized by the following two equations. First, we employ a uniform quantization rule:

(1)

Here, can be weights or activations; is a lower bound, equal to for quantizing weights/linear outputs of convolutions/inputs to CNNs, and equal to zero for quantized ReLUs (see below); and is the number of positive quantization levels, which is , for bits bitwidth. Thus, we force the quantization to happen in the -range. The quantization scale is then parametrized as follows:

(2)

where is a learnable scale parameter. So, the learnable scale parameter first scales the non-quantized values so that they can be clipped in the standardized -range of the quantization function (1), after which the quantized result is scaled back to its original range. Therefore, the network can learn whichever range is most optimal for quantization. Note that we employ , i.e. the exponential of . This function is differentiable in and forces the scaling to be positive. Positive scaling is preferred, otherwise the scaling can, in addition to the network weights, change the sign of the weights and activations, thereby causing training instabilities. Furthermore, positive scaling avoids division by zero.
The quantization function involves a non-differentiable rounding function, which causes problems when learning the scaling parameter with backpropagation. To mitigate this issue, we employ the straight-through-estimator (STE) approach, as introduced by [Courbariaux et al., 2016, Hinton, 2012]. The STE passes the gradient through the non-differentiable rounding function, basically ignoring it in the backward pass. In agreement with [Courbariaux et al., 2016] we also keep a copy of the non-quantized weights during training and update these non-quantized weights based on the gradient of the quantized weights. The final quantized weights are obtained by quantizing the copy of non-quantized weights. During the experiments described below, we will use this quantization procedure, allowing each convolutional layer to learn to optimally quantize its weights and activations.

l

Figure 1: Gradual quantization procedure

3.2 Gradual quantization

It is well-known that poor hyper-parameter initialization can cause the network to learn slowly and converge to sub-optimal solutions, exhibiting low accuracy. We found this especially true for networks that are quantized at different positions (quantizing inputs, weights and outputs) and to very low precision (e.g. ternary weights). This is likely because the quantization function (1) is essentially a saturating nonlinearity, and because a too wide or too narrow initial quantization range effectively collapses all values onto a single quantized value. Both factors increase the likelihood of small gradients during training, and thus poor network convergence. To lessen these issues, and hence improve network convergence, we found it beneficial to gradually lower the bitwidth of the quantization. The general procedure is illustrated in Figure 1. Specifically, we start by training a full-precision network and use that network’s trained parameters to initialize a network that is subsequently trained with lower bitwidth (e.g. 8 bits) weights and activations. We then use the parameters of this network to initialize another network with even lower bitwidth (e.g. 5 bits). We continue this procedure until the desired low-bitwidth network is obtained.
Gradual quantization is akin to curriculum learning [Bengio et al., 2009], which has been shown to induce better network convergence and improved generalization behavior. Gradual quantization facilitates training, likely because the initialization with networks with similar bitwidths primes the network under training with effective quantization ranges, not too wide or too narrow, such that gradients are larger and learning can happen effectively.
Training with gradual quantization takes longer than training once from random initialization, but as soon as the network has been quantized to low precision at high accuracy, it can be deployed and used indefinitely for inference purposes without any additional costs. Finally, remark that similar gradual strategies can help to simplify networks in other ways, besides decreasing the bitwidth (see 3.4).

3.3 Network distillation

Gradual quantization is a form of transfer learning: it learns from the higher-bitwidth network how to deal with low bitwidths. To further improve the network accuracy, we also include another form of transfer learning, called network distillation. Network distillation was introduced by [Hinton et al., 2015] to train smaller networks based on larger networks and has subsequently also been used to improve quantized networks [Baskin et al., 2018, Polino et al., 2018, Wu et al., 2016, Leroux et al., 2019]. We use the same methods as in [Hinton et al., 2015]. In short, this technique uses a teacher network to train a student, in our case the low-precision quantized network, by supplying the student network with soft labels, i.e. the output probabilities of the teacher network. The soft labels contain more generalization information than the one-hot training labels (e.g. that a salmon and a goldfish are alike in the sense that they are both fish), which improves test accuracy. This is especially useful for datasets like CIFAR-100, which consist of classes that are subdivided into smaller subclasses. Note that the teacher network does not have to be the full-precision network.

3.4 Removing the higher-precision BN and nonlinearity

Batch normalization is used to stabilize the first- and second-order statistics of the activations during training, giving the network the chance to focus learning on more interesting higher order statistics, and improves network convergence during training and generalization behavior during testing [Ioffe and Szegedy, 2015]. Batch normalization is performed as: , where and are the mean and standard deviation of the mini-batch respectively, and and are a learned scale and shift parameter. For inference, the and are replaced by their corresponding estimates, and , based on the complete training dataset. For inference, the BN equation can therefore be simplified as:

(3)

where and . In other words, for inference we only require one scale and one shift factor. Given that the learned quantization method described in 3.1 already has a scale factor, we can absorb the BN scale factor into the quantization scale factor. We further find that the shift factor doesn’t contribute much to overall accuracy if we train the network to adapt to the absence of the shift factor (see below). This shows that it is possible to remove the BN computations for inference purposes.

Next, we observe that quantization function (1) is a nonlinear function: When the clipping lower-bound is set to -1, the quantization function approximates a hard-tanh function: . On the other hand, when the lower-bound b is set to 0, the quantization function approximates a ReLU function . This indicates that we can use the quantization function as a nonlinear activation function.

In practice, we have found it necessary to first train the network to low precision with BNs and nonlinearities in place. Then, once low-precision has been obtained, we initialize a new network with these trained parameters and retrain the network after replacing the combinations of BN+ReLU with the learned quantized ReLU (clipping lower-bound set to 0), and isolated BNs with the learned quantization function with clipping lower-bound is set to -1 (Figure 3, 4). During retraining, the learned scale parameters are allowed to change, so as to compensate for the new network structure.

The resulting FQ-conv layers have quantized inputs, convolve with quantized weights and return quantized outputs, which in turn become the inputs into subsequent FQ-conv layers. No higher-precision BN and activation functions need to be computed. We observe further that the high-precision scale parameters are only needed during training to allow the network to learn its optimal quantization. During inference, we can perform integer-valued convolutions. This follows from the definition of the quantization function and the linearity of the dot product:

(4)

where and are weights and activations, and , and are defined as in equation (1) and (2) (dropping the exponentials for clarity), and and are (signed) integer-valued weights and activations, i.e. . Hence, the multiply-accumulates are performed with integer-valued numbers. Note that for ternary-weight convolutions, with , only additions/subtractions are performed, and no multiplications. Moreover, the remaining scaling factor is not needed for active computation as long as the hardware-supported quantization method (e.g. Lookup tables or Analog-to-digital converters) puts the integer-valued sum into the correct integer-valued quantized bin, which becomes the input into the next layer.
The only important scale factor for active computation during inference is the one from the output quantization of the final FQ-Conv layer. This scale factor () is applied to the output of the final FQ-conv layer to bring the activations back to the scale expected by the global average pooling layer, which is performed in higher precision.

4 Experiments

4.1 Effectiveness of the proposed quantization technique

We first examined the effectiveness of the proposed quantization technique. For this purpose, we first employed CIFAR-10 with ResNet-20, a configuration often used to benchmark quantization methods. We trained the network using standard hyper-parameters from previous related studies (learning rate: 0.1, weight decay: 5E-4, 200 epochs, batch size: 128, with standard data augmentation). For proper comparison with the existing literature, we did not quantize the first and last convolutional layer in this analysis (although we do so in subsequent analyses) and report results on the validation set. We also quantized the 1x1 convolutions in the residual paths.

We quantized the network to various bitwidths using gradual quantization, from full-precision down to 2-bit ternary networks (Table 1). We observed test accuracies above (at precisions 3-bit), equal to (3-bit precision), or slightly below (2-bit precision) a full-precision network trained from scratch (FP0).

We further compared test accuracy with and without gradual quantization (GQ). Without GQ, we initialized the network with FP0 parameters and used FP0 as teacher, then quantized immediately to a given low precision. We observed that GQ significantly improves the accuracy of the lowest precision 3-bit and especially 2-bit networks (Table 1). It is likely that one can improve the 2-bit network accuracy without GQ with enough hyper-parameter tuning, here we present an alternative, less error-prone, gradual quantization, technique.

We next compared our results to the state of the art (Table 2). Our quantization method has lowest degradation compared to FP baseline (DoReFa accuracies taken from [Li et al., 2016]). For 2-bit networks, LQ-net has slightly higher overall accuracy, despite its increased degradation compared to baseline compared to our method. The overall higher accuracy of LQ-net may be caused by 1. its higher FP baseline, 2. the fact that LQ-net quantizes weights per channel (vs. per layer in our method), i.e. LG-net uses more learned parameters, 3. LQ-net uses non-uniform quantization (vs. uniform quantization in our method). In sum, we observed that our proposed quantized technique performs well at low precision and compares favorably to existing methods.

Network #bits / weight #bits / act. Init. net Trainer net Test acc. (%) Test acc. No GQ (%) Diff (%)
FP0 32 (float) 32 (float) - - 91.6 - -
Q88 8 8 FP0 FP0 92.6 - -
FP1 32 (float) 32 (float) Q88 Q88 92.3 - -
Q66 6 6 Q88 FP1 92.6 92.5 0.1
Q55 5 5 Q66 FP1 92.6 92.5 0.1
Q44 4 4 Q55 FP1 92.2 92.1 0.1
Q33 3 3 Q44 FP1 91.6 90.8 0.8
Q22 2 2 Q33 FP1 89.9 10.0 79.9
Table 1: Gradual Quantization of ResNet-20 on CIFAR-10
Name Baseline (%) Quantized (%) Diff (%)
PACT-SAWB (W2/A2) 91.5 89.2 2.3
LQ-Net (W2/A2) 92.1 90.2 1.9
DoReFa (W2/A2) 91.5 88.2 3.3
GQ (W2/A2) 91.6 89.9 1.7
LQ-Net (W3/A3) 92.1 91.6 0.5
GQ (W3/A3) 91.6 91.6 0.0
Table 2: CIFAR-10: Comparison of validation accuracy for ResNet-20. (W/A) gives # bits for weights/activations.

Thus far, we have examined the effectiveness of the proposed quantization technique on a relatively simple problem. We further extended the examination by quantizing DarkNet-19 [Redmon and Farhadi, 2016] on ImageNet/ILSVRC2012 [Deng et al., 2014]. We used the same training methods as described in [Redmon and Farhadi, 2016], but with only random crops and random horizontal flips as data augmentation. During quantization we used a trained full-precision ResNet-50 as the teacher and applied label refinery [Bagherinezhad et al., 2018] with it. Label refinery is similar to network distillation but avoids tuning a temperature hyper-parameter. Like before, the first and last layer were not quantized to low precision but left in full precision. In all other layers the weights and activations were quantized to low precision. Models were trained on eight V100 GPUs using distributed data parallel and the V100 tensor cores (mixed precision training).

We show the top-1 and top-5 accuracy for quantized models at different low precisions in Table 3. Due to the teacher model and very little effect of low-precision quantization on the validation accuracy, all models except for the ternary-weight model (Q25) achieve better accuracy compared to the full precision model trained from scratch. Even for the ternary-weight model we observe only a moderate (2.4%/1.3%) drop in accuracy.

Network #bits / weight #bits / act. Init. net Top-1 (%) Top-5 (%) Diff (%)
FP0 32 (float) 32 (float) - 72.3 90.7 0.0/0.0
Q88 8 8 FP0 73.7 91.6 -1.4/-0.9
Q77 7 7 Q88 73.8 91.7 -1.5/-1.0
Q66 6 6 Q77 73.8 91.6 -1.5/-0.9
Q55 5 5 Q66 73.4 91.4 -1.1/-0.7
Q45 4 5 Q55 73.0 91.3 -0.7/-0.6
Q35 3 5 Q45 72.6 90.9 -0.3/-0.2
Q25 2 5 Q35 69.9 89.4 2.4/1.3
Table 3: Quantized DarkNet-19 on ImageNet

4.2 Keyword spotting with the Google speech commands dataset

We first evaluate FQ-Conv layers on the Google speech commands dataset [Warden, 2017], a typical benchmark dataset for edge applications. The dataset consists of 65K audio clips, each 1sec long, of 30 keywords uttered by thousands of different people. The goal is to classify each audio clip into one of 10 keyword categories (e.g. “Yes”, “No”, “Left”, “Right”, etc.), or a “silence” (i.e. no spoken word, but background noise is possible) or “unknown” (i.e. a class consisting of the remaining 20 keywords from the dataset) category. The dataset was split into 80% training, 10% validation and 10% test data, based on the SHA1-hashed name of the audio clips. Following Google’s preprocessing procedure, we add background noise to each training sample with a probability of 0.8, where the type of noise is randomly sampled from the background noises provided in the dataset. We also add random time shifts . From the augmented audio samples, 39-dimensional Mel-Frequency Cepstrum Coefficients (MFCCs; 13 MFCCs and their first- and second-order deltas) are then constructed using 20ms sliding window, shifted by 10ms. These spectral features provide the inputs into the network.

Figure 2: Keyword spotting network architecture.

The network for this application is illustrated in Figure 2. It was developed to have low computational and memory complexity, while still being accurate. The MFCC components are first fed into a small full-precision fully-connected layer (N=100 units). This small (3.9K weights/MACs) layer serves as an expansive embedding of the spectral features such that no input-feature information is lost after quantizing this layer’s output. The output of this layer is batch normalized and quantized to 4 bits before entering the quantized CNN (QCNN) with 7 FQ-Conv layers. Each FQ-Conv layer is a 1D-convolutional layer (45 filters, filter length=3), with no zero-padding applied. To widen the receptive fields of the units in the final FQ-Conv layer, we employ dilated filters with an exponential-sizing dilation across layers, as shown in Figure 2. The output of the QCNN is global-average pooled before entering the final softmax layer. The network contains 50K parameters and computes 3.5M MACs per sample.

The network was implemented in Pytorch and trained with the ADAM optimizer on Nvidia Tesla V100 GPUs for 600 epochs (batch size of 100). For the full-precision network, the initial learning rate was set to 0.01 and exponentially decayed (decay factor=0.98; network randomly initialized). The network with best performance on the validation set was retained (94.3% on test set). The full-precision (FP) network served as the initial teacher network and as initialization for the gradual quantization. Each time we obtained a more accurate network on the validation dataset, the more accurate network became the teacher for subsequent networks.

For gradual quantization, we used the quantization sequence presented in Table 4. The table shows the accuracy on the test dataset for each step in the gradual-quantization sequence, where each step is defined by the number of bits used for the weights and activations. The final quantized network has ternary weights and 4-bit activations and a 94.26% accuracy on the test set, on par with the full-precision network.

Figure 3: Replacing BN+ReLU by a learned quantized ReLU
Network #bits/weight #bits/activ. Initializing network Trainer network Test accuracy (%)
FP 32 (float) 32 (float) - - 94.3
Q66 6 6 FP FP 94.42
Q45 4 5 Q66 Q66 94.68
Q35 3 5 Q45 Q45 94.97
Q24 2 4 Q35 Q45 94.26
FQ24 2 4 Q24 Q45 93.81
Table 4: Quantized keyword spotting network training sequence

In the previous networks, each quantized convolution was followed by a BN+ReLU. Next, we replaced the BN+ReLUs with Quantized ReLUs (3.1) (Figure 3), turning it into a fully quantized Conv-layer. To do so, we initialized the network consisting of FQ-Conv layers with the final parameters obtained with gradual quantization and finetuned (learning rate=0.0005; decay=0.98; 600 epochs) the network. The final network with the fully quantized CNN has an accuracy on the test dataset of 93.81% (Table 4), almost as good as the full-precision network with BN, and outperforming several of the larger and higher bitwidth models in the literature [Zhang et al., 2017, Tang and Lin, 2018].

Model Test accuracy (%) # params Size (Byte) Mult.
trad-fpool13 90.5 1.37M 5.48M 125M
tpool2 91.7 1.09M 4.36M 103M
one-stride1 77.9 954K 3.82M 5.76M
res15 95.8 238K 952K 894M
res15-narrow 94.0 42.6K 170K 160M
Q35 94.97 50K 18.75K 3.5M
FQ24 93.81 50K 12.5K 3.5M
Table 5: Comparison of different keyword spotting models

In Table 5, we compare our best and final low-precision network to some of the best full-precision models reported in the literature [Sainath and Parada, 2015], [Tang and Lin, 2018]. Our models have a much smaller memory footprint, require significantly less operations and perform very competitively accuracy-wise.

4.3 Visual object classification with CIFAR-100

Figure 4: Quantized ResNet architecture: A. General architecture, B. Fully quantized architecture.

We conducted further studies on the CIFAR-100 dataset, using a ResNet-32 network. The CIFAR-100 dataset comprises 50K training and 10K testing 32 x 32 RGB images in 100 classes. All images were normalized to zero mean and unit standard deviation. For data augmentation, we performed random horizontal flips and random crops from images zero-padded with 4 pixels on each side.

The ResNet-32 architecture is shown in Figure 4A. It consists of a first convolutional layer, followed by BN and ReLU. The output of this layer is fed into three ResBlocks with increasing numbers of filters (64 to 256). Each ResBlock consists of five subblocks with standard residual architecture [He et al., 2015], using 1x1 convolution + BN for down sampling between ResBlocks. When quantizing the network, all convolutional layers were quantized (the pooling and softmax layer were left in full precision). The overall architecture of the fully quantized ResNet-32 is shown in Figure 4B. Note that we also quantized the first conv layer and the 1x1 convolutions in the residual connections. Moreover, the input images are also quantized to lower precision, using learned quantization, before entering the first quantized conv-layer.

The network was implemented in Pytorch and trained with SGD with Nesterov Momentum (0.9 momentum) on V100 GPUs for 200 epochs (batch size of 128), applying a small amount of weight decay (5E-4). We decayed the learning rate by 0.2 after 60, 120 and 180 epochs and report the final test accuracy. For the initial full-precision network, the initial learning rate was set to 0.1, but an initial learning rate of 0.01 was used for gradual quantization and fine-tuning. To obtain a good teacher network, we first trained a full-precision (FP) network from random initialization (top-1: 77.94%; top-5: 94.43%), then trained an 8-bit network with the FP-network as initialization and teacher (top-1: 79.82%; top-5: 94.50%), and finally trained again an FP-network with the 8-bit network as initialization and teacher (top-1: 79.81%; top-5: 95.09%). This final FP-net served as teacher throughout subsequent analyses.

For gradual quantization of ResNet-32, we used the quantization sequence presented in Table 6. The final quantized network has ternary weights and 5-bit activations with a top-1 accuracy of 76.80% and a top-5 accuracy of 93.53%.

Network #bits / weight #bits / act. Init. net Trainer net top 1 acc.(%) top 5 acc. (%)
FP0 32 (float) 32 (float) - - 77.94 94.43
Q88 8 8 FP0 FP0 79.82 94.50
FP1 32 (float) 32 (float) Q88 Q88 79.81 95.09
Q66 6 6 Q88 FP1 78.54 94.58
Q55 5 5 Q66 FP1 78.38 94.18
Q45 4 5 Q55 FP1 77.96 94.26
Q35 3 5 Q45 FP1 77.31 93.90
Q25 2 5 Q35 FP1 76.80 93.53
FQ25 2 5 Q25 FP1 76.89 94.32
Table 6: Gradual Quantization of ResNet-32 on CIFAR-100

In the previous networks, each quantized convolution was followed by a BN+ReLU or BN (Figure 4A). To obtain the fully quantized network structure presented in Figure 4B, we next replaced each BN+ReLU with a Quantized ReLU and the isolated BNs with a learned quantization function with clipping lower-bound b set to -1 (3.1). Subsequently, we initialized the network consisting of FQ-Conv layers with the final parameters obtained from gradual quantization and finetuned the network. The final network with the fully quantized CNN without high-precision BN and ReLUs obtains a top-1 accuracy of 76.89% and a top-5 accuracy of 94.32% (Table 6), close to the FP-network, when trained from random initialization.

It has been shown in previous studies that the first conv-layer (with quantized inputs) and the residual connections are not easily quantized to low-precision without sacrificing too much accuracy [Courbariaux et al., 2016, Rastegari et al., 2016, Cai et al., 2017, Choi et al., 2019, Baskin et al., 2018, Anderson and Berg, 2017]. Hence it is likely that one can quantize the activations to lower than 5 bits, while retaining high accuracy, if one were to give higher precision to these critical paths compared to the other conv-layers. In this work, we demonstrate the principle and use the extreme case of completely uniform conv-blocks, each with ternary weights and 5-bit activations, and obtain accuracies close to the full-precision network when trained from scratch. Depending on the particular application and hardware, one can adjust the bitwidths in different blocks for optimal performance.

4.4 Network performance with additional noise

In a final experiment, we examined the effect of adding additional noise (on top of the quantization noise) to the weights, activations and outcomes of the convolutions (MACs) on the accuracy of the KWS and CIFAR-100 networks. In the context of typical analog accelerators, adding noise to the weights, activations and MACs corresponds to noisy memory cells, DACs and ADCs respectively. Exploring the entire noise space is impossible, so we restrict our exploration to a set of physically plausible noise values, including relatively low and high noise levels. Specifically, we added Gaussian noise () to the different elements of the network. The amount of noise is quantified by , which is expressed as a percentage of the least significant bit (LSB). In other words, is a percentage of the quantization interval. We used the ternary networks for these experiments.

Table 7 presents the network accuracy for different levels of weight noise (), activation noise () and MAC noise ( for the KWS and CIFAR-100 dataset. We examined network accuracy under different conditions: with or without training with noise. For each condition, we averaged accuracy across ten repetitions (with different noise) of the test set.

As expected, small amounts of noise had little influence on network accuracy, but larger amounts of noise clearly lowered test accuracy. However, by training with noise, we could recover much of the accuracy drop (Table 7).

Dataset KWS CIFAR-100
Baseline (No added noise) 94.3% 76.8%
Test Condition Not trained with noise Trained with noise Not trained with noise Trained with noise
94.3% 94.4% 76.9% 77%
94.2% 94.6% 76.6% 76.9%
93.1% 94% 73.8% 75.4%
79.7% 91.6% 65.1% 72.5%
38.8% 87.7% 34.8% 69.2%
Table 7: Effect of noisy weights, activations and MAC results on the accuracy of ternary networks.

5 Conclusion

This paper presents a novel learned quantization method, a new gradual quantization training strategy and an approach to eliminate high-precision BN and nonlinearities from the network. The result is a network consisting of convolutional layers, in which the weights, inputs and outputs are fully quantized to low precision and high-precision BN and nonlinearities are removed. The accuracy of this low precision network closely approximates that of its full-precision equivalent, which includes BN and higher precision nonlinearities. These low-precision networks are ideal to run in a memory-, computationally-, and energy-efficient way on modern neural-network-accelerator hardware and microcontrollers. Although such low-precision networks can improve the efficiency of both digital and analog accelerator designs, we believe that analog designs especially will benefit from them. For example, in contrast to digital accelerators, the summation of weighted activations in the analog domain has virtually infinite precision and comes at no additional cost, i.e. no higher-precision accumulators are required. Thus, quantizing the inputs, weights and MAC results is sufficient. We further show that these quantized networks tolerate noise quite well. Consequently, these findings suggest that networks implemented on analog arrays can be accurate, fast and efficient.

References

  1. Equivalent-accuracy accelerated neural-network training using analogue memory. Nature 558 (7708), pp. 60. Cited by: §1.
  2. The high-dimensional geometry of binary neural networks. arXiv preprint arXiv:1705.07199. Cited by: §4.3.
  3. Label refinery: improving imagenet classification through label progression. arXiv preprint arXiv:1805.02641. Cited by: §4.1.
  4. NICE: noise injection and clamping estimation for neural network quantization. arXiv preprint arXiv:1810.00162. Cited by: §2, §3.3, §4.3.
  5. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §3.2.
  6. Deep learning with low precision by half-wave gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5918–5926. Cited by: §2, §4.3.
  7. Accurate and efficient 2-bit quantized neural networks. Technical report External Links: Link Cited by: §2, §4.3.
  8. Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: §2, §3.1, §4.3.
  9. Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575. Cited by: §4.1.
  10. Temperature-insensitive analog vector-by-matrix multiplier based on 55 nm nor flash memory cells. In 2017 IEEE Custom Integrated Circuits Conference (CICC), pp. 1–4. Cited by: §1.
  11. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1.
  12. Deep residual learning for image recognition. eprint. arXiv preprint arXiv:0706.1234. Cited by: §1, §4.3.
  13. Neural networks for machine learning. coursera,[video lectures]. Cited by: §3.1.
  14. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.3.
  15. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
  16. Quantized neural networks: training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18 (1), pp. 6869–6898. Cited by: §2.
  17. Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2, §3.4.
  18. In-datacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12. Cited by: §1.
  19. Learning to quantize deep networks by optimizing quantization intervals with task loss. External Links: 1808.05779 Cited by: §2.
  20. Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §1.
  21. Training binary neural networks with knowledge transfer. Neurocomputing. Cited by: §3.3.
  22. Ternary weight networks. arXiv preprint arXiv:1605.04711. Cited by: §2, §4.1.
  23. Deep neural networks are robust to weight binarization and other non-linear distortions. arXiv preprint arXiv:1606.01981. Cited by: §2.
  24. 14.5 envision: a 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm fdsoi. In 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 246–247. Cited by: §1.
  25. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668. Cited by: §3.3.
  26. Xnor-net: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §2, §4.3.
  27. YOLO9000: better, faster. Stronger Preprint gr-qc/arXiv 1612. Cited by: §4.1.
  28. Convolutional neural networks for small-footprint keyword spotting. Cited by: §4.2.
  29. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.
  30. Deep residual learning for small-footprint keyword spotting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5484–5488. Cited by: §4.2, §4.2.
  31. Speech commands: a public dataset for single-word speech recognition. Dataset available from http://download. tensorflow. org/data/speech_commands_v0 1. Cited by: §4.2.
  32. Binarized neural networks on the imagenet classification task. arXiv preprint arXiv:1604.03058. Cited by: §3.3.
  33. Deep neural network compression with single and multiple level quantization. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
  34. Restructuring of deep neural network acoustic models with singular value decomposition.. In Interspeech, pp. 2365–2369. Cited by: §1.
  35. Lq-nets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–382. Cited by: §2.
  36. Hello edge: keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128. Cited by: §4.2.
  37. Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402586
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description