FQConv: Fully Quantized Convolution for Efficient and Accurate Inference
1 Introduction
In recent years, there has been a surge of interest in designing accelerator hardware for neural networks. These neuralnetwork accelerators aim to improve the speed and energy efficiency with which the billions of operations in DNNs are performed. The design of NNaccelerators often goes hand in hand with optimizations at the algorithmic level. Such algorithmic optimizations include changing the structure of the network [He et al., 2015, Howard et al., 2017], network pruning LeCun et al. [1990], Han et al. [2015], dimensionality reduction of weight matrices [Xue et al., 2013], and combinations thereof. Moreover, each algorithm requires some sort of quantization of its values before being mapped on chip and, if quantized to low precision, this can produce several additional hardware benefits. For instance, lowprecision quantization significantly reduces the memory footprint of the network, thereby reducing the onchip memory and memory transfers needed. Furthermore, extreme quantization can simplify computations considerably: e.g. a network with ternary weights (i.e. 1, 0 or 1) involves only additions, no computationally expensive multiplications. However, very lowprecision quantized networks often imply a reduction in accuracy. Also, lowprecision DNNs still entail some higherprecision computations, like Batch Normalization(BN) and nonlinear activation functions, which require changing the numerical formats between hardware operations and extra silicon area, and energy for computation. Hence, quantization techniques that maintain good accuracy are required.
Most of todayâs DNN accelerators operate in the digital domain [Jouppi et al., 2017, Moons et al., 2017], but recently there has been an increase in the number of analog designs [Ambrogio et al., 2018, Guo et al., 2017]. Some of the analog accelerators promise to mitigate the classical von Neumann bottleneck by performing most of the computations in the memory. For example, one type of analog hardware implementation uses Ohmâs law to perform multiplications in the memory elements of a crossbar array. Here, the weights are encoded as conductances, but only a limited number of conductances can be stored in each memory device. Following multiplication, the charges are accumulated on the summation line using Kirchhoffâs current law, equivalent to summation of the weighted activations in dot products. Although promising, such analogcomputeinmemory also poses several challenges. This is because the devices that store the weights (memory cells), generate the input activations (e.g. DAC) and convert the analog summed signal back to the digital domain (ADC) are often noisy and require lowprecision quantization to be usable or efficient. For these analog accelerators to work, it is therefore crucial that neural networks perform accurately under noisy lowprecision conditions.
This paper makes the following contributions: (1) We propose an effective quantization technique that learns to optimally quantize the weights and activations of the network during training. (2) We train the network to low precision using a new training technique, called gradual quantization. (3) We show that our proposed quantization technique compares favorably to existing techniques. (4) We present a method to remove the higherprecision nonlinearities and BN from the network. (5) We demonstrate the potential of this approach on two additional datasets, the Google speech commands dataset and on CIFAR100, showing that ternaryweight (2bit) CNNs with lowprecision in and outputs and no higherprecision BN and nonlinearity perform comparably to their fullprecision equivalents with BN. (6) We show that these networks can handle moderate amounts of noise on the weights, activations and outputs of the convolution, making them suitable for analog accelerators.
2 Related Work
Learned quantization.
In recent years, a wide range of quantization methods have been proposed, up to methods to ternarize the network weights [Li et al., 2016, Zhu et al., 2016] or binarize the weights and activations of DNNs [Courbariaux et al., 2016, Rastegari et al., 2016, Hubara et al., 2017]. Several of these methods utilized the statistical distributions of the weights and activations to propose good quantization methods [Li et al., 2016, Zhu et al., 2016, Cai et al., 2017]. This statistical approach is a sound approach from an informationtheoretic point of view and often works well, but may be suboptimal from a DNN perspective. For instance, the statisticalquantization approach may not result in the best quantized solution for the network as a whole. This is because the quantization often happens after the network has been trained in full precision, without querying the quantized network if it would choose these or other quantized values if it were allowed to. Furthermore, the distributional assumptions may not be fully accurate and may change for different datasets, across network layers and during training, thereby complicating the statistical approach. Consequently, recent studies have proposed to learn the optimal quantization during training [Zhang et al., 2018, Jung et al., 2018]. Our proposed method is most similar to the recently proposed PACT method for quantizing the outputs of ReLUs [Choi et al., 2019]. In that paper, the authors present a way to parametrically learn the clipping range of the ReLU function for optimal quantization. Our method differs from theirs in that our proposed learned quantization method does not have zero gradients for values in the clipping range and can be applied at any position inside the network, which includes the ReLUs, but also for quantizing the weights, quantizing the immediate linear outputs of convolutions and even for quantizing the inputs (e.g. images) into the DNN. In our experiments presented below we will demonstrate the generality of our quantization method.
Gradual quantization.
Quantizing DNNs to low precision puts strong constraints on the network, constraints that the network often finds hard to adjust to, as evidenced by decreased accuracy. Previous work has tried to ease the transition to lowprecision by quantizing different parts of the network in different stages of the training, e.g. by quantizing the initial layers of a DNN before the later layers [Baskin et al., 2018] or by quantizing different parts within layers at different stages [Xu et al., 2018]. The justification for this process is to give the remaining fullprecision parts of the network the chance to compensate for the quantization in other network parts. In contrast to these methods, our proposed gradual quantization method quantizes the entire network at once, but gradually lowers the bitwidth of the weights and activations inside the network. Our method is motivated by the observation that it is relatively easy to quantize at high bitwidths and that networks with lower precision can likely learn from networks with slightly higher precision.
Noise resilience of neural networks.
DNNs have a special relationship with noise. For example, both dropout [Srivastava et al., 2014] and batch normalization [Ioffe and Szegedy, 2015] add noise to the weights or activations during training, thereby improving generalization performance of the network. Importantly, both techniques deactivate the noise source during inference, which is different from what happens in analog accelerators where noise is inherent to the circuitry and thus also present during inference. Other studies have examined the effect of noise on the network weights [Merolla et al., 2016], finding that (weight) quantized models better withstand weight noise during inference compared to their fullprecision counterparts. Here, we examine the influence of noise on the weights, activations and convolution outputs (MACs) and propose a technique to improve network performance under noisy conditions.
3 The Proposed Approach
Overview.
We aim to train convolutional layers of a CNN in which inputs, weights and outputs are fully quantized to lowprecision numerical values, in which no higherprecision BN and nonlinearities need to be computed, and for which the resulting CNN performs at high accuracy. To achieve these objectives, we combine three methods during network training: 1. We propose a novel learned quantization technique, 2. We present a new training technique that improves the accuracy of quantized networks, 3. We combine the previous methods with network distillation for best accuracy. In the following sections we will present each method in detail and finally discuss how we eliminate the higherprecision BN and nonlinearities from the network.
3.1 Learned quantization.
We seek to quantize the inputs, outputs and weights of a convolutional layer in an optimal way using a quantization method that does not rely on any distributional assumptions, can be used for all elements of the network (weights and activations) and gives the network the chance to learn the quantization that is optimal for the entire network, i.e. endtoend. We will now discuss this in detail.
Uniform quantization requires a range in which the values are quantized. Crucially, we do not know if the network relies most on extreme values (i.e. in the tails of the distribution) or small values. Also, the optimal quantization range may change across layers and may be different for weights and activations. We therefore instruct the network to learn the quantization range during training. To do so, we introduce a learnable scale factor in the quantization process. This scale factor can differ per layer and for weights and activations. Our method can be summarized by the following two equations. First, we employ a uniform quantization rule:
(1) 
Here, can be weights or activations; is a lower bound, equal to for quantizing weights/linear outputs of convolutions/inputs to CNNs, and equal to zero for quantized ReLUs (see below); and is the number of positive quantization levels, which is , for bits bitwidth. Thus, we force the quantization to happen in the range. The quantization scale is then parametrized as follows:
(2) 
where is a learnable scale parameter. So, the learnable scale parameter first scales the nonquantized values so that they can be clipped in the standardized range of the quantization function (1), after which the quantized result is scaled back to its original range. Therefore, the network can learn whichever range is most optimal for quantization. Note that we employ , i.e. the exponential of . This function is differentiable in and forces the scaling to be positive. Positive scaling is preferred, otherwise the scaling can, in addition to the network weights, change the sign of the weights and activations, thereby causing training instabilities. Furthermore, positive scaling avoids division by zero.
The quantization function involves a nondifferentiable rounding function, which causes problems when learning the scaling parameter with backpropagation. To mitigate this issue, we employ the straightthroughestimator (STE) approach, as introduced by [Courbariaux et al., 2016, Hinton, 2012]. The STE passes the gradient through the nondifferentiable rounding function, basically ignoring it in the backward pass. In agreement with [Courbariaux et al., 2016] we also keep a copy of the nonquantized weights during training and update these nonquantized weights based on the gradient of the quantized weights. The final quantized weights are obtained by quantizing the copy of nonquantized weights.
During the experiments described below, we will use this quantization procedure, allowing each convolutional layer to learn to optimally quantize its weights and activations.
3.2 Gradual quantization
It is wellknown that poor hyperparameter initialization can cause the network to learn slowly and converge to suboptimal solutions, exhibiting low accuracy. We found this especially true for networks that are quantized at different positions (quantizing inputs, weights and outputs) and to very low precision (e.g. ternary weights). This is likely because the quantization function (1) is essentially a saturating nonlinearity, and because a too wide or too narrow initial quantization range effectively collapses all values onto a single quantized value. Both factors increase the likelihood of small gradients during training, and thus poor network convergence. To lessen these issues, and hence improve network convergence, we found it beneficial to gradually lower the bitwidth of the quantization. The general procedure is illustrated in Figure 1. Specifically, we start by training a fullprecision network and use that networkâs trained parameters to initialize a network that is subsequently trained with lower bitwidth (e.g. 8 bits) weights and activations. We then use the parameters of this network to initialize another network with even lower bitwidth (e.g. 5 bits). We continue this procedure until the desired lowbitwidth network is obtained.
Gradual quantization is akin to curriculum learning [Bengio et al., 2009], which has been shown to induce better network convergence and improved generalization behavior. Gradual quantization facilitates training, likely because the initialization with networks with similar bitwidths primes the network under training with effective quantization ranges, not too wide or too narrow, such that gradients are larger and learning can happen effectively.
Training with gradual quantization takes longer than training once from random initialization, but as soon as the network has been quantized to low precision at high accuracy, it can be deployed and used indefinitely for inference purposes without any additional costs. Finally, remark that similar gradual strategies can help to simplify networks in other ways, besides decreasing the bitwidth (see 3.4).
3.3 Network distillation
Gradual quantization is a form of transfer learning: it learns from the higherbitwidth network how to deal with low bitwidths. To further improve the network accuracy, we also include another form of transfer learning, called network distillation. Network distillation was introduced by [Hinton et al., 2015] to train smaller networks based on larger networks and has subsequently also been used to improve quantized networks [Baskin et al., 2018, Polino et al., 2018, Wu et al., 2016, Leroux et al., 2019]. We use the same methods as in [Hinton et al., 2015]. In short, this technique uses a teacher network to train a student, in our case the lowprecision quantized network, by supplying the student network with soft labels, i.e. the output probabilities of the teacher network. The soft labels contain more generalization information than the onehot training labels (e.g. that a salmon and a goldfish are alike in the sense that they are both fish), which improves test accuracy. This is especially useful for datasets like CIFAR100, which consist of classes that are subdivided into smaller subclasses. Note that the teacher network does not have to be the fullprecision network.
3.4 Removing the higherprecision BN and nonlinearity
Batch normalization is used to stabilize the first and secondorder statistics of the activations during training, giving the network the chance to focus learning on more interesting higher order statistics, and improves network convergence during training and generalization behavior during testing [Ioffe and Szegedy, 2015]. Batch normalization is performed as: , where and are the mean and standard deviation of the minibatch respectively, and and are a learned scale and shift parameter. For inference, the and are replaced by their corresponding estimates, and , based on the complete training dataset. For inference, the BN equation can therefore be simplified as:
(3) 
where and . In other words, for inference we only require one scale and one shift factor. Given that the learned quantization method described in 3.1 already has a scale factor, we can absorb the BN scale factor into the quantization scale factor. We further find that the shift factor doesnât contribute much to overall accuracy if we train the network to adapt to the absence of the shift factor (see below). This shows that it is possible to remove the BN computations for inference purposes.
Next, we observe that quantization function (1) is a nonlinear function: When the clipping lowerbound is set to 1, the quantization function approximates a hardtanh function: . On the other hand, when the lowerbound b is set to 0, the quantization function approximates a ReLU function . This indicates that we can use the quantization function as a nonlinear activation function.
In practice, we have found it necessary to first train the network to low precision with BNs and nonlinearities in place. Then, once lowprecision has been obtained, we initialize a new network with these trained parameters and retrain the network after replacing the combinations of BN+ReLU with the learned quantized ReLU (clipping lowerbound set to 0), and isolated BNs with the learned quantization function with clipping lowerbound is set to 1 (Figure 3, 4). During retraining, the learned scale parameters are allowed to change, so as to compensate for the new network structure.
The resulting FQconv layers have quantized inputs, convolve with quantized weights and return quantized outputs, which in turn become the inputs into subsequent FQconv layers. No higherprecision BN and activation functions need to be computed. We observe further that the highprecision scale parameters are only needed during training to allow the network to learn its optimal quantization. During inference, we can perform integervalued convolutions. This follows from the definition of the quantization function and the linearity of the dot product:
(4)  
where and are weights and activations, and , and are defined as in equation (1) and (2) (dropping the exponentials for clarity), and and are (signed) integervalued weights and activations, i.e. . Hence, the multiplyaccumulates are performed with integervalued numbers. Note that for ternaryweight convolutions, with , only additions/subtractions are performed, and no multiplications. Moreover, the remaining scaling factor is not needed for active computation as long as the hardwaresupported quantization method (e.g. Lookup tables or Analogtodigital converters) puts the integervalued sum into the correct integervalued quantized bin, which becomes the input into the next layer.
The only important scale factor for active computation during inference is the one from the output quantization of the final FQConv layer. This scale factor () is applied to the output of the final FQconv layer to bring the activations back to the scale expected by the global average pooling layer, which is performed in higher precision.
4 Experiments
4.1 Effectiveness of the proposed quantization technique
We first examined the effectiveness of the proposed quantization technique. For this purpose, we first employed CIFAR10 with ResNet20, a configuration often used to benchmark quantization methods. We trained the network using standard hyperparameters from previous related studies (learning rate: 0.1, weight decay: 5E4, 200 epochs, batch size: 128, with standard data augmentation). For proper comparison with the existing literature, we did not quantize the first and last convolutional layer in this analysis (although we do so in subsequent analyses) and report results on the validation set. We also quantized the 1x1 convolutions in the residual paths.
We quantized the network to various bitwidths using gradual quantization, from fullprecision down to 2bit ternary networks (Table 1). We observed test accuracies above (at precisions 3bit), equal to (3bit precision), or slightly below (2bit precision) a fullprecision network trained from scratch (FP0).
We further compared test accuracy with and without gradual quantization (GQ). Without GQ, we initialized the network with FP0 parameters and used FP0 as teacher, then quantized immediately to a given low precision. We observed that GQ significantly improves the accuracy of the lowest precision 3bit and especially 2bit networks (Table 1). It is likely that one can improve the 2bit network accuracy without GQ with enough hyperparameter tuning, here we present an alternative, less errorprone, gradual quantization, technique.
We next compared our results to the state of the art (Table 2). Our quantization method has lowest degradation compared to FP baseline (DoReFa accuracies taken from [Li et al., 2016]). For 2bit networks, LQnet has slightly higher overall accuracy, despite its increased degradation compared to baseline compared to our method. The overall higher accuracy of LQnet may be caused by 1. its higher FP baseline, 2. the fact that LQnet quantizes weights per channel (vs. per layer in our method), i.e. LGnet uses more learned parameters, 3. LQnet uses nonuniform quantization (vs. uniform quantization in our method). In sum, we observed that our proposed quantized technique performs well at low precision and compares favorably to existing methods.
Network  #bits / weight  #bits / act.  Init. net  Trainer net  Test acc. (%)  Test acc. No GQ (%)  Diff (%) 

FP0  32 (float)  32 (float)      91.6     
Q88  8  8  FP0  FP0  92.6     
FP1  32 (float)  32 (float)  Q88  Q88  92.3     
Q66  6  6  Q88  FP1  92.6  92.5  0.1 
Q55  5  5  Q66  FP1  92.6  92.5  0.1 
Q44  4  4  Q55  FP1  92.2  92.1  0.1 
Q33  3  3  Q44  FP1  91.6  90.8  0.8 
Q22  2  2  Q33  FP1  89.9  10.0  79.9 
Name  Baseline (%)  Quantized (%)  Diff (%) 

PACTSAWB (W2/A2)  91.5  89.2  2.3 
LQNet (W2/A2)  92.1  90.2  1.9 
DoReFa (W2/A2)  91.5  88.2  3.3 
GQ (W2/A2)  91.6  89.9  1.7 
LQNet (W3/A3)  92.1  91.6  0.5 
GQ (W3/A3)  91.6  91.6  0.0 
Thus far, we have examined the effectiveness of the proposed quantization technique on a relatively simple problem. We further extended the examination by quantizing DarkNet19 [Redmon and Farhadi, 2016] on ImageNet/ILSVRC2012 [Deng et al., 2014]. We used the same training methods as described in [Redmon and Farhadi, 2016], but with only random crops and random horizontal flips as data augmentation. During quantization we used a trained fullprecision ResNet50 as the teacher and applied label refinery [Bagherinezhad et al., 2018] with it. Label refinery is similar to network distillation but avoids tuning a temperature hyperparameter. Like before, the first and last layer were not quantized to low precision but left in full precision. In all other layers the weights and activations were quantized to low precision. Models were trained on eight V100 GPUs using distributed data parallel and the V100 tensor cores (mixed precision training).
We show the top1 and top5 accuracy for quantized models at different low precisions in Table 3. Due to the teacher model and very little effect of lowprecision quantization on the validation accuracy, all models except for the ternaryweight model (Q25) achieve better accuracy compared to the full precision model trained from scratch. Even for the ternaryweight model we observe only a moderate (2.4%/1.3%) drop in accuracy.
Network  #bits / weight  #bits / act.  Init. net  Top1 (%)  Top5 (%)  Diff (%) 

FP0  32 (float)  32 (float)    72.3  90.7  0.0/0.0 
Q88  8  8  FP0  73.7  91.6  1.4/0.9 
Q77  7  7  Q88  73.8  91.7  1.5/1.0 
Q66  6  6  Q77  73.8  91.6  1.5/0.9 
Q55  5  5  Q66  73.4  91.4  1.1/0.7 
Q45  4  5  Q55  73.0  91.3  0.7/0.6 
Q35  3  5  Q45  72.6  90.9  0.3/0.2 
Q25  2  5  Q35  69.9  89.4  2.4/1.3 
4.2 Keyword spotting with the Google speech commands dataset
We first evaluate FQConv layers on the Google speech commands dataset [Warden, 2017], a typical benchmark dataset for edge applications. The dataset consists of 65K audio clips, each 1sec long, of 30 keywords uttered by thousands of different people. The goal is to classify each audio clip into one of 10 keyword categories (e.g. âYesâ, âNoâ, âLeftâ, âRightâ, etc.), or a âsilenceâ (i.e. no spoken word, but background noise is possible) or âunknownâ (i.e. a class consisting of the remaining 20 keywords from the dataset) category. The dataset was split into 80% training, 10% validation and 10% test data, based on the SHA1hashed name of the audio clips. Following Googleâs preprocessing procedure, we add background noise to each training sample with a probability of 0.8, where the type of noise is randomly sampled from the background noises provided in the dataset. We also add random time shifts . From the augmented audio samples, 39dimensional MelFrequency Cepstrum Coefficients (MFCCs; 13 MFCCs and their first and secondorder deltas) are then constructed using 20ms sliding window, shifted by 10ms. These spectral features provide the inputs into the network.
The network for this application is illustrated in Figure 2. It was developed to have low computational and memory complexity, while still being accurate. The MFCC components are first fed into a small fullprecision fullyconnected layer (N=100 units). This small (3.9K weights/MACs) layer serves as an expansive embedding of the spectral features such that no inputfeature information is lost after quantizing this layerâs output. The output of this layer is batch normalized and quantized to 4 bits before entering the quantized CNN (QCNN) with 7 FQConv layers. Each FQConv layer is a 1Dconvolutional layer (45 filters, filter length=3), with no zeropadding applied. To widen the receptive fields of the units in the final FQConv layer, we employ dilated filters with an exponentialsizing dilation across layers, as shown in Figure 2. The output of the QCNN is globalaverage pooled before entering the final softmax layer. The network contains 50K parameters and computes 3.5M MACs per sample.
The network was implemented in Pytorch and trained with the ADAM optimizer on Nvidia Tesla V100 GPUs for 600 epochs (batch size of 100). For the fullprecision network, the initial learning rate was set to 0.01 and exponentially decayed (decay factor=0.98; network randomly initialized). The network with best performance on the validation set was retained (94.3% on test set). The fullprecision (FP) network served as the initial teacher network and as initialization for the gradual quantization. Each time we obtained a more accurate network on the validation dataset, the more accurate network became the teacher for subsequent networks.
For gradual quantization, we used the quantization sequence presented in Table 4. The table shows the accuracy on the test dataset for each step in the gradualquantization sequence, where each step is defined by the number of bits used for the weights and activations. The final quantized network has ternary weights and 4bit activations and a 94.26% accuracy on the test set, on par with the fullprecision network.
Network  #bits/weight  #bits/activ.  Initializing network  Trainer network  Test accuracy (%) 

FP  32 (float)  32 (float)      94.3 
Q66  6  6  FP  FP  94.42 
Q45  4  5  Q66  Q66  94.68 
Q35  3  5  Q45  Q45  94.97 
Q24  2  4  Q35  Q45  94.26 
FQ24  2  4  Q24  Q45  93.81 
In the previous networks, each quantized convolution was followed by a BN+ReLU. Next, we replaced the BN+ReLUs with Quantized ReLUs (3.1) (Figure 3), turning it into a fully quantized Convlayer. To do so, we initialized the network consisting of FQConv layers with the final parameters obtained with gradual quantization and finetuned (learning rate=0.0005; decay=0.98; 600 epochs) the network. The final network with the fully quantized CNN has an accuracy on the test dataset of 93.81% (Table 4), almost as good as the fullprecision network with BN, and outperforming several of the larger and higher bitwidth models in the literature [Zhang et al., 2017, Tang and Lin, 2018].
Model  Test accuracy (%)  # params  Size (Byte)  Mult. 
tradfpool13  90.5  1.37M  5.48M  125M 
tpool2  91.7  1.09M  4.36M  103M 
onestride1  77.9  954K  3.82M  5.76M 
res15  95.8  238K  952K  894M 
res15narrow  94.0  42.6K  170K  160M 
Q35  94.97  50K  18.75K  3.5M 
FQ24  93.81  50K  12.5K  3.5M 
In Table 5, we compare our best and final lowprecision network to some of the best fullprecision models reported in the literature [Sainath and Parada, 2015], [Tang and Lin, 2018]. Our models have a much smaller memory footprint, require significantly less operations and perform very competitively accuracywise.
4.3 Visual object classification with CIFAR100
We conducted further studies on the CIFAR100 dataset, using a ResNet32 network. The CIFAR100 dataset comprises 50K training and 10K testing 32 x 32 RGB images in 100 classes. All images were normalized to zero mean and unit standard deviation. For data augmentation, we performed random horizontal flips and random crops from images zeropadded with 4 pixels on each side.
The ResNet32 architecture is shown in Figure 4A. It consists of a first convolutional layer, followed by BN and ReLU. The output of this layer is fed into three ResBlocks with increasing numbers of filters (64 to 256). Each ResBlock consists of five subblocks with standard residual architecture [He et al., 2015], using 1x1 convolution + BN for down sampling between ResBlocks. When quantizing the network, all convolutional layers were quantized (the pooling and softmax layer were left in full precision). The overall architecture of the fully quantized ResNet32 is shown in Figure 4B. Note that we also quantized the first conv layer and the 1x1 convolutions in the residual connections. Moreover, the input images are also quantized to lower precision, using learned quantization, before entering the first quantized convlayer.
The network was implemented in Pytorch and trained with SGD with Nesterov Momentum (0.9 momentum) on V100 GPUs for 200 epochs (batch size of 128), applying a small amount of weight decay (5E4). We decayed the learning rate by 0.2 after 60, 120 and 180 epochs and report the final test accuracy. For the initial fullprecision network, the initial learning rate was set to 0.1, but an initial learning rate of 0.01 was used for gradual quantization and finetuning. To obtain a good teacher network, we first trained a fullprecision (FP) network from random initialization (top1: 77.94%; top5: 94.43%), then trained an 8bit network with the FPnetwork as initialization and teacher (top1: 79.82%; top5: 94.50%), and finally trained again an FPnetwork with the 8bit network as initialization and teacher (top1: 79.81%; top5: 95.09%). This final FPnet served as teacher throughout subsequent analyses.
For gradual quantization of ResNet32, we used the quantization sequence presented in Table 6. The final quantized network has ternary weights and 5bit activations with a top1 accuracy of 76.80% and a top5 accuracy of 93.53%.
Network  #bits / weight  #bits / act.  Init. net  Trainer net  top 1 acc.(%)  top 5 acc. (%) 

FP0  32 (float)  32 (float)      77.94  94.43 
Q88  8  8  FP0  FP0  79.82  94.50 
FP1  32 (float)  32 (float)  Q88  Q88  79.81  95.09 
Q66  6  6  Q88  FP1  78.54  94.58 
Q55  5  5  Q66  FP1  78.38  94.18 
Q45  4  5  Q55  FP1  77.96  94.26 
Q35  3  5  Q45  FP1  77.31  93.90 
Q25  2  5  Q35  FP1  76.80  93.53 
FQ25  2  5  Q25  FP1  76.89  94.32 
In the previous networks, each quantized convolution was followed by a BN+ReLU or BN (Figure 4A). To obtain the fully quantized network structure presented in Figure 4B, we next replaced each BN+ReLU with a Quantized ReLU and the isolated BNs with a learned quantization function with clipping lowerbound b set to 1 (3.1). Subsequently, we initialized the network consisting of FQConv layers with the final parameters obtained from gradual quantization and finetuned the network. The final network with the fully quantized CNN without highprecision BN and ReLUs obtains a top1 accuracy of 76.89% and a top5 accuracy of 94.32% (Table 6), close to the FPnetwork, when trained from random initialization.
It has been shown in previous studies that the first convlayer (with quantized inputs) and the residual connections are not easily quantized to lowprecision without sacrificing too much accuracy [Courbariaux et al., 2016, Rastegari et al., 2016, Cai et al., 2017, Choi et al., 2019, Baskin et al., 2018, Anderson and Berg, 2017]. Hence it is likely that one can quantize the activations to lower than 5 bits, while retaining high accuracy, if one were to give higher precision to these critical paths compared to the other convlayers. In this work, we demonstrate the principle and use the extreme case of completely uniform convblocks, each with ternary weights and 5bit activations, and obtain accuracies close to the fullprecision network when trained from scratch. Depending on the particular application and hardware, one can adjust the bitwidths in different blocks for optimal performance.
4.4 Network performance with additional noise
In a final experiment, we examined the effect of adding additional noise (on top of the quantization noise) to the weights, activations and outcomes of the convolutions (MACs) on the accuracy of the KWS and CIFAR100 networks. In the context of typical analog accelerators, adding noise to the weights, activations and MACs corresponds to noisy memory cells, DACs and ADCs respectively. Exploring the entire noise space is impossible, so we restrict our exploration to a set of physically plausible noise values, including relatively low and high noise levels. Specifically, we added Gaussian noise () to the different elements of the network. The amount of noise is quantified by , which is expressed as a percentage of the least significant bit (LSB). In other words, is a percentage of the quantization interval. We used the ternary networks for these experiments.
Table 7 presents the network accuracy for different levels of weight noise (), activation noise () and MAC noise ( for the KWS and CIFAR100 dataset. We examined network accuracy under different conditions: with or without training with noise. For each condition, we averaged accuracy across ten repetitions (with different noise) of the test set.
As expected, small amounts of noise had little influence on network accuracy, but larger amounts of noise clearly lowered test accuracy. However, by training with noise, we could recover much of the accuracy drop (Table 7).
Dataset  KWS  CIFAR100  

Baseline (No added noise)  94.3%  76.8%  
Test Condition  Not trained with noise  Trained with noise  Not trained with noise  Trained with noise  

94.3%  94.4%  76.9%  77%  

94.2%  94.6%  76.6%  76.9%  

93.1%  94%  73.8%  75.4%  

79.7%  91.6%  65.1%  72.5%  

38.8%  87.7%  34.8%  69.2% 
5 Conclusion
This paper presents a novel learned quantization method, a new gradual quantization training strategy and an approach to eliminate highprecision BN and nonlinearities from the network. The result is a network consisting of convolutional layers, in which the weights, inputs and outputs are fully quantized to low precision and highprecision BN and nonlinearities are removed. The accuracy of this low precision network closely approximates that of its fullprecision equivalent, which includes BN and higher precision nonlinearities. These lowprecision networks are ideal to run in a memory, computationally, and energyefficient way on modern neuralnetworkaccelerator hardware and microcontrollers. Although such lowprecision networks can improve the efficiency of both digital and analog accelerator designs, we believe that analog designs especially will benefit from them. For example, in contrast to digital accelerators, the summation of weighted activations in the analog domain has virtually infinite precision and comes at no additional cost, i.e. no higherprecision accumulators are required. Thus, quantizing the inputs, weights and MAC results is sufficient. We further show that these quantized networks tolerate noise quite well. Consequently, these findings suggest that networks implemented on analog arrays can be accurate, fast and efficient.
References
 Equivalentaccuracy accelerated neuralnetwork training using analogue memory. Nature 558 (7708), pp. 60. Cited by: §1.
 The highdimensional geometry of binary neural networks. arXiv preprint arXiv:1705.07199. Cited by: §4.3.
 Label refinery: improving imagenet classification through label progression. arXiv preprint arXiv:1805.02641. Cited by: §4.1.
 NICE: noise injection and clamping estimation for neural network quantization. arXiv preprint arXiv:1810.00162. Cited by: §2, §3.3, §4.3.
 Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §3.2.
 Deep learning with low precision by halfwave gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5918–5926. Cited by: §2, §4.3.
 Accurate and efficient 2bit quantized neural networks. Technical report External Links: Link Cited by: §2, §4.3.
 Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830. Cited by: §2, §3.1, §4.3.
 Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575. Cited by: §4.1.
 Temperatureinsensitive analog vectorbymatrix multiplier based on 55 nm nor flash memory cells. In 2017 IEEE Custom Integrated Circuits Conference (CICC), pp. 1–4. Cited by: §1.
 Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1.
 Deep residual learning for image recognition. eprint. arXiv preprint arXiv:0706.1234. Cited by: §1, §4.3.
 Neural networks for machine learning. coursera,[video lectures]. Cited by: §3.1.
 Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.3.
 Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
 Quantized neural networks: training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18 (1), pp. 6869–6898. Cited by: §2.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2, §3.4.
 Indatacenter performance analysis of a tensor processing unit. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12. Cited by: §1.
 Learning to quantize deep networks by optimizing quantization intervals with task loss. External Links: 1808.05779 Cited by: §2.
 Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §1.
 Training binary neural networks with knowledge transfer. Neurocomputing. Cited by: §3.3.
 Ternary weight networks. arXiv preprint arXiv:1605.04711. Cited by: §2, §4.1.
 Deep neural networks are robust to weight binarization and other nonlinear distortions. arXiv preprint arXiv:1606.01981. Cited by: §2.
 14.5 envision: a 0.26to10tops/w subwordparallel dynamicvoltageaccuracyfrequencyscalable convolutional neural network processor in 28nm fdsoi. In 2017 IEEE International SolidState Circuits Conference (ISSCC), pp. 246–247. Cited by: §1.
 Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668. Cited by: §3.3.
 Xnornet: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §2, §4.3.
 YOLO9000: better, faster. Stronger Preprint grqc/arXiv 1612. Cited by: §4.1.
 Convolutional neural networks for smallfootprint keyword spotting. Cited by: §4.2.
 Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.
 Deep residual learning for smallfootprint keyword spotting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5484–5488. Cited by: §4.2, §4.2.
 Speech commands: a public dataset for singleword speech recognition. Dataset available from http://download. tensorflow. org/data/speech_commands_v0 1. Cited by: §4.2.
 Binarized neural networks on the imagenet classification task. arXiv preprint arXiv:1604.03058. Cited by: §3.3.
 Deep neural network compression with single and multiple level quantization. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §2.
 Restructuring of deep neural network acoustic models with singular value decomposition.. In Interspeech, pp. 2365–2369. Cited by: §1.
 Lqnets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–382. Cited by: §2.
 Hello edge: keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128. Cited by: §4.2.
 Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: §2.