# Relaxed Quantization for Discretized Neural Networks

## Abstract

Neural network quantization has become an important research area due to its great impact on deployment of large models on resource constrained devices. In order to train networks that can be effectively discretized without loss of performance, we introduce a differentiable quantization procedure. Differentiability can be achieved by transforming continuous distributions over the weights and activations of the network to categorical distributions over the quantization grid. These are subsequently relaxed to continuous surrogates that can allow for efficient gradient-based optimization. We further show that stochastic rounding can be seen as a special case of the proposed approach and that under this formulation the quantization grid itself can also be optimized with gradient descent. We experimentally validate the performance of our method on MNIST, CIFAR 10 and Imagenet classification.

[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth] \iclrfinalcopy

## 1 Introduction

Neural networks excel in a variety of large scale problems due to their highly flexible parametric nature. However, deploying big models on resource constrained devices, such as mobile phones, drones or IoT devices is still challenging because they require a large amount of power, memory and computation. Neural network compression is a means to tackle this issue and has therefore become an important research topic.

Neural network compression can be, roughly, divided into two not mutually exclusive categories: pruning and quantization. While pruning (lecun1990optimal; han2015deep) aims to make the model “smaller” by altering the architecture, quantization aims to reduce the precision of the arithmetic operations in the network. In this paper we focus on the latter. Most network quantization methods either simulate or enforce discretization of the network during training, e.g. via rounding of the weights and activations. Although seemingly straighforward, the discontinuity of the discretization makes the gradient-based optimization infeasible. The reason is that there is no gradient of the loss with respect to the parameters. A workaround to the discontinuity are the “pseudo-gradients” according to the straight-through estimator (bengio2013estimating), which have been successfully used for training low-bit width architectures at e.g. hubara2016quantized; zhu2016trained.

The purpose of this work is to introduce a novel quantization procedure, Relaxed Quantization (RQ). RQ can bypass the non-differentiability of the quantization operation during training by smoothing it appropriately. The contributions of this paper are four-fold: First, we show how to make the set of quantization targets part of the training process such that we can optimize them with gradient descent. Second, we introduce a way to discretize the network by converting distributions over the weights and activations to categorical distributions over the quantization grid. Third, we show that we can obtain a “smooth” quantization procedure by replacing the categorical distributions with concrete (maddison2016concrete; jang2016categorical) equivalents. Finally we show that stochastic rounding (gupta2015deep), one of the most popular quantization techniques, can be seen as a special case of the proposed framework. We present the details of our approach in Section 2, discuss related work in Section 3 and experimentally validate it in Section 4. Finally we conclude and provide fruitful directions for future research in Section 5.

## 2 Relaxed quantization for discretizing neural networks

The central element for the discretization of weights and activations of a neural network is a quantizer . The quantizer receives a (usually) continous signal as input and discretizes it to a countable set of values. This process is inherently lossy and non-invertible: given the output of the quantizer, it is impossible to determine the exact value of the input. One of the simplest quantizers is the rounding function:

where corresponds to the step size of the quantizer. With , the quantizer rounds to its nearest integer number.

Unfortunately, we cannot simply apply the rounding quantizer to discretize the weights and activations of a neural network. Because of the quantizers’ lossy and non-invertible nature, important information might be destroyed and lead to a decrease in accuracy. To this end, it is preferable to train the neural network while simulating the effects of quantization during the training procedure. This encourages the weights and activations to be robust to quantization and therefore decreases the performance gap between a full-precision neural network and its discretized version.

However, the aforementioned rounding process is non-differentiable. As a result, we cannot directly optimize the discretized network with stochastic gradient descent, the workhorse of neural network optimization. In this work, we posit a “smooth” quantizer as a possible way for enabling gradient based optimization.

### 2.1 Learning (fixed point) quantizers via gradient descent

The proposed quantizer comprises four elements: a vocabulary, its noise model and the resulting discretization procedure, as well as a final relaxation step to enable gradient based optimization.

The first element of the quantizer is the vocabulary: it is the set of (countable) output values that the quantizer can produce. In our case, this vocabulary has an inherent structure, as it is a grid of ordered scalars. For fixed point quantization the grid is defined as

(1) |

where is the number of available bits that allow for possible integer values. By construction this grid of values is agnostic to the input signal and hence suboptimal; to allow for the grid to adapt to we introduce two free parameters, a scale and an offset . This leads to a learnable grid via that can adapt to the range and location of the input signal.

The second element of the quantizer is the assumption about the input noise ; it determines how probable it is for a specific value of the input signal to move to each grid point. Adding noise to will result in a quantizer that is, on average, a smooth function of its input. In essense, this is an application of variational optimization (staines2012variational) to the non-differentiable rounding function, which enables us to do gradient based optimization.

We model this form of noise as acting additively to the input signal and being governed by a distribution . This process induces a distribution where . In the next step of the quantization procedure, we discretize according to the quantization grid ; this neccesitates the evaluation of the cumulative distribution function (CDF). For this reason, we will assume that the noise is distributed according to a zero mean logistic distribution with a standard deviation , i.e. , hence leading to . The CDF of the logistic distribution is the sigmoid function which is easy to evaluate and backpropagate through. Using Gaussian distributions proved to be less effective in preliminary experiments. Other distributions are conceivable and we will briefly discuss the choice of a uniform distribution in Section 2.3.

The third element is, given the aforementioned assumptions, how the quantizer determines an appropriate assignment for each realization of the input signal . Due to the stochastic nature of , a deterministic round-to-nearest operation will result in a stochastic quantizer for . Quantizing in this manner corresponds to discretizing onto and then sampling grid points from it. More specifically, we construct a categorical distribution over the grid by adopting intervals of width equal to centered at each of the grid points. The probability of selecting that particular grid point will now be equal to the probability of falling inside those intervals:

(2) | ||||

(3) |

where corresponds to the quantized variable, corresponds to the CDF and the step from Equation 2 to Equation 3 is due to the logistic noise assumption. A visualization of the aforementioned process can be seen in Figure 1. For the first and last grid point we will assume that they reside within and respectively. Under this assumption we will have to truncate such that it only has support within . Fortunately this is easy to do, as it corresponds to just a simple modification of the CDF:

(4) |

Armed with this categorical distribution over the grid, the quantizer proceeds to assign a specific grid value to by drawing a random sample. This procedure emulates quantization noise, which prevents the model from fitting the data. This noise can be reduced in two ways: by clustering the weights and activations around the points of the grid and by reducing the logistic noise . As , the CDF converges towards the step function, prohibiting gradient flow. On the other hand, if is too high, the optimization procedure is very noisy, prohibiting convergence. For this reason, during optimization we initialize in a sensible range, such that covers a significant portion of the grid. Please confer Appendix A for details. We then let be freely optimized via gradient descent such that the loss is minimized. Both effects reduce the gap between the function that the neural network computes during training time vs. test time. We illustrate this in Figure 2.

The fourth element of the procedure is the relaxation of the non-differentiable categorical distribution sampling. This is achieved by replacing the categorical distribution with a concrete distribution (maddison2016concrete; jang2016categorical). This relaxation procedure corresponds to adopting a “smooth” categorical distribution that can be seen as a “noisy” softmax. Let be the categorical probability of sampling grid point , i.e. ; the “smoothed” quantized value can be obtained via:

(5) |

where is the random sample from the concrete distribution and is a temperature parameter that controls the degree of approximation, since as the concrete distribution becomes a categorical.

We have thus defined a fully differentiable “soft” quantization procedure that allows for stochastic gradients for both the quantizer parameters as well as the input signal (e.g. the weights or the activations of a neural network). We refer to this alrogithm as Relaxed Quantization (RQ). We summarize its forward pass as performed during training in Algorithm 1. It is also worthwhile to notice that if there were no noise at the input then the categorical distribution would have non-zero mass only at a single value, thus prohibiting gradient based optimization for and .

One drawback of this approach is that the smoothed quantized values defined in Equation 5 do not have to coincide with grid points, as is not a one-hot vector. Instead, these values can lie anywhere between the smallest and largest grid point, something which is impossible with e.g. stochastic rounding (gupta2015deep). In order to make sure that only grid-points are sampled, we propose an alternative algorithm RQ ST in which we use the variant of the straight-through (ST) estimator proposed in jang2016categorical. Here we sample the actual categorical distribution during the forward pass but assume a sample from the concrete distribution for the backward pass. While this gradient estimator is obviously biased, in practice it works as the “gradients” seem to point towards a valid direction. We perform experiments with both variants.

After convergence, we can obtain a “hard” quantization procedure, i.e. select points from the grid, at test time by either reverting to a categorical distribution (instead of the continuous surrogate) or by rounding to the nearest grid point. In this paper we chose the latter as it is more aligned with the low-resource environments in which quantized models will be deployed. Furthermore, with this goal in mind, we employ two quantization grids with their own learnable scalar (and potentially ) parameters for each layer; one for the weights and one for the activations.

### 2.2 Scalable quantization via a local grid

Sampling based on drawing random numbers for the concrete distribution as described in Equation 5 can be very expensive for larger values of . Firstly, drawing random numbers for every individual weight and activation in a neural network drastically increases the number of operations required in the forward pass. Secondly, it also requires keeping many more numbers in memory for gradient computations during the backward pass. Compared to a standard neural network or stochastic rounding approaches, the proposed procedure can thus be infeasible for larger models and datasets.

Fortunately, we can make sampling independent of the grid size by assuming zero probability for grid-points that lie far away from the signal . Specifically, by only considering grid points that are within standard deviations away from , we truncate such that it lies within a “localized” grid around .

To simplify the computation required for determining the local grid elements, we choose the grid point closest to , , as the center of the local grid (Figure 3). Since is shared between all elements of the weight matrix or activation, the local grid has the same width for every element. The computation of the probabilities over the localized grid is similar to the truncation happening in Equation 4 and the smoothed quantized value is obtained via a manner similar to Equation 5:

(6) | ||||

(7) |

### 2.3 Relation to Stochastic Rounding

One of the pioneering works in neural network quantization has been the work of gupta2015deep; it introduced stochastic rounding, a technique that is one of the most popular approaches for training neural networks with reduced numerical precision. Instead of rounding to the nearest representable value, the stochastic rounding procedure selects one of the two closest grid points with probability depending on the distance of the high precision input from these grid points. In fact, we can view stochastic rounding as a special case of RQ where . This uniform distribution centered at of width equal to the grid width generally has support only for the closest grid point. Discretizing this distribution to a categorical over the quantization grid however assigns probabilities to the two closest grid points as in stochastic rounding, following Equation 2:

(8) |

Stochastic rounding has proven to be a very powerful quantization scheme, even though it relies on biased gradient estimates for the rounding procedure. On the one hand, RQ provides a way to circumvent this estimator at the cost of optimizing a surrogate objective. On the other hand, RQ ST makes use of the unreasonably effective straight-through estimator as used in jang2016categorical to avoid optimizing a surrogate objective, at the cost of biased gradients. Compared to stochastic rounding, RQ ST further allows sampling of not only the two closest grid points, but also has support for more distant ones depending on the estimated input noise . Intuitively, this allows for larger steps in the input space without first having to decrease variance at the traversion between grid sections.

## 3 Related Work

In this work we focus on hardware oriented quantization approaches. As opposed to methods that focus only on weight quantization and network compression for a reduced memory footprint, quantizing all operations within the network aims to additionally provide reduced execution speeds. Within the body of work that considers quantizing weights and activations fall papers using stochastic rounding (gupta2015deep; hubara2016quantized; gysel2018ristretto; wu2018training). (wu2018training) also consider quantized backpropagation, which is out-of-scope for this work.

Furthermore, another line of work considers binarizing (courbariaux2015binaryconnect; zhou2018explicit) or ternarizing (li2016ternary; zhou2018explicit) weights and activations (hubara2016quantized; rastegari2016xnor; zhou2016dorefa) via the straight-through gradient estimator (bengio2013estimating); these allow for fast implementations of convolutions using only bit-shift operations. In a similar vein, the straight through estimator has also been used in cai2017deep; faraone2018syq; jacob2017quantization; zhou2017incremental; mishra2017apprentice for quantizing neural networks to arbitrary bit-precision. In these approaches, the full precision weights that are updated during training correspond to the means of the logistic distributions that are used in RQ. Furthermore, jacob2017quantization maintains moving averages for the minimum and maximum observed values for activations while parameterises the network’s weights’ grids via their minimum and maximum values directly. This fixed-point grid is therefore learned during training, however without gradient descent; unlike the proposed RQ. Alternatively, instead of discretizing real valued weights, shayer2018learning directly optimize discrete distributions over them. While providing promising results, this approach does not generalize straightforwardly to activation quantization.

Another line of work quantizes networks through regularization. (louizos2017bayesian) formulate a variational approach that allows for heuristically determining the required bit-width precision for each weight of the model. Improving upon this work, (achterhold2018variational) proposed a quantizing prior that encourages ternary weights during training. Similarly to RQ, this method also allows for optimizing the scale of the ternary grid. In contrast to RQ, this is only done implicitly via the regularization term. One drawback of these approaches is that the strength of the regularization decays with the amount of training data, thus potentially reducing their effectiveness on large datasets.

Weights in a neural network are usually not distributed uniformly within a layer. As a result, performing non-uniform quantization is usually more effective. (baskin2018uniq) employ a stochastic quantizer by first uniformizing the weight or activation distribution through a non-linear transformation and then injecting uniform noise into this transformed space. (polino2018model) propose a version of their method in which the quantizer’s code book is learned by gradient descent, resulting in a non-uniformly spaced grid. Another line of works quantizes by clustering and therefore falls into this category; (han2015deep; ullrich2017soft) represent each of the weights by the centroid of its closest cluster. While such non-uniform techniques can be indeed effective, they do not allow for efficient implementations on todays hardware.

Within the liteterature on quantizing neural networks there are many approaches that are orthogonal to our work and could potentially be combined for additional improvements. (mishra2017apprentice; polino2018model) use knowledge distrillation techniques to good effect, whereas works such as (mishra2017wrpn) modify the architecture to compensate for lower precision computations. (zhou2017incremental; zhou2018explicit; baskin2018uniq) perform quantization in an step-by-step manner going from input layer to output, thus allowing the later layers to more easily adapt to the rounding errors introduced. polino2018model; faraone2018syq further employ “bucketing”, where small groups of weights share a grid, instead of one grid per layer. As an example from polino2018model, a bucket size of weights per grid on Resnet-18 translates to separate weight quantization grids as opposed to in RQ.

## 4 Experiments

For the subsequent experiments RQ will correspond to the proposed procedure that has concrete sampling and RQ ST will correspond to the proposed procedure that uses the Gumbel-softmax straight-through estimator (jang2016categorical) for the gradient. We did not optimize an offset for the grids in order to be able to represent the number zero exactly, which allows for sparcity and is required for zero-padding. Furthermore we assumed a grid that starts from zero when quantizing the outputs of ReLU. We provide further details on the experimental settings at Appendix A. We will also provide results of our own implementation of stochastic rounding (gupta2015deep) with the dynamic fixed point format (gysel2018ristretto) (SR+DR). Here we used the same hyperparameters as for RQ. All experiments were implemented with TensorFlow (tensorflow2015-whitepaper), using the Keras library (chollet2015keras).

### 4.1 LeNet-5 on MNIST and VGG7 on CIFAR 10

For the first task we considered the toy LeNet-5 network trained on MNIST with the 32C5 - MP2 - 64C5 - MP2 - 512FC - Softmax architecture and the VGG 2x(128C3) - MP2 - 2x(256C3) - MP2 - 2x(512C3) - MP2 - 1024FC - Softmax architecture on the CIFAR 10 dataset. Details about the hyperparameter settings can be found in Appendix A.

By observing the results in Table 1, we see that our method can achieve competitive results that improve upon several recent works on neural network quantization. Considering that we achieve lower test error for 8 bit quantization than the high-precision models, we can see how RQ has a regularizing effect. Generally speaking we found that the gradient variance for low bit-widths (i.e. 2-4 bits) in RQ needs to be kept in check through appropriate learning rates.

Method | # Bits weights/act. | MNIST | CIFAR 10 |
---|---|---|---|

Original | |||

SR+DR | |||

(gupta2015deep; gysel2018ristretto) | - | ||

- | |||

Deep Comp. (han2015deep) | - | ||

TWN (li2016ternary) | |||

BWN (rastegari2016xnor) | - | ||

XNOR-net (rastegari2016xnor) | - | ||

SWS (ullrich2017soft) | - | ||

Bayesian Comp. (louizos2017bayesian) | - | ||

VNQ (achterhold2018variational) | - | ||

WAGE (wu2018training) | |||

LR Net (shayer2018learning)^{1} |
|||

RQ (ours) | |||

RQ ST (ours) | |||

### 4.2 Resnet-18 and Mobilenet on Imagenet

In order to demonstrate the effectiveness of our proposed approach on large scale tasks we considered the task of quantizing a Resnet-18 (he2016deep) as well as a Mobilenet (howard2017mobilenets) trained on the Imagenet (ILSVRC2012) dataset. For the Resnet-18 experiment, we started from a pre-trained full precision model that was trained for 90 epochs. We provide further details about the training procedure in Appendix B. The Mobilenet was initialized with the pretrained model available on the tensorflow github repository^{2}

Some of the existing quantization works do not quantize the first (and sometimes) last layer. Doing so simplifies the problem but it can, depending on the model and input dimensions, significantly increase the amount of computation required. We therefore make use of the bit operations per second (BOPs) metric (baskin2018uniq), which can be seen as a proxy for the execution speed on appropriate hardware. In BOPs, the impact of not quantizing the first layer in, for example, the Resnet-18 model on Imagenet, becomes apparent: keeping the first layer in full precision requires roughly times as many BOPs for one forward pass through the whole network compared to quantizing all weights and activations to bits.

Figure 4 compares a wide range of methods in terms of accuracy and BOPs. We choose to compare only against methods that employ fixed-point quantization on Resnet-18 and Mobilenet, hence do not compare with non-uniform quantization techniques, such as the one described at baskin2018uniq. In addition to our own implementation of (gupta2015deep) with the dynamic fixed point format (gysel2018ristretto), we also report results of “rounding”. This corresponds to simply rounding the pre-trained high-precision model followed by re-estimation of the batchnorm statistics. The grid in this case is defined as the initial grid used for fine-tuning with RQ. For batchnorm re-estimation and grid initialization, please confer Appendix A.

In Figure 3(a) we observe that on ResNet-18 the RQ variants form the “Pareto frontier” in the trade-off between accuracy and efficiency, along with SYQ, Apprentice and jacob2017quantization. SYQ, however, employs “bucketing” and Apprentice uses distillation, both of which can be combined with RQ and improve performance. jacob2017quantization does better than RQ with 8 bits, however RQ improved w.r.t. to its pretrained model, whereas jacob2017quantization decreased slightly. For experimental details with jacob2017quantization, please confer Appendix B.1. SR+DR underperforms in this setting and is worse than simple rounding for 5 to 8 bits.

For Mobilenet, 3(b) shows that RQ is competitive to existing approaches. Simple rounding resulted in almost random chance for all of the bit configurations. SR+DR shows its strength for the bit scenario, while in the lower bit regime, RQ outperforms competitive approaches.

## 5 Discussion

We have introduced Relaxed Quantization (RQ), a powerful and versatile algorithm for learning low-bit neural networks using a uniform quantization scheme. As such, the models trained by this method can be easily transferred and executed on low-bit fixed point chipsets. We have extensively evaluated RQ on various image classification benchmarks and have shown that it allows for the better trade-offs between accuracy and bit operations per second.

Future hardware might enable us to cheaply do non-uniform quantization, for which this method can be easily extended. (lai2017deep; ortiz2018low) for example, show the benefits of low-bit floating point weights that can be efficiently implemented in hardware. The floating point quantization grid can be easily learned with RQ by redefining . General non-uniform quantization, as described for example in (baskin2018uniq), is a natural extension to RQ, whose exploration we leave to future work. Currently, the bit-width of every quantizer is determined beforehand, but in future work we will explore learning the required bit precision within this framework. In our experiments, batch normalization was implemented as a sequence of convolution, batch normalization and quantization. On a low-precision chip, however, batch normalization would be ”folded” (jacob2017quantization) into the kernel and bias of the convolution, the result of which is then rounded to low precision. In order to accurately reflect this folding at test time, future work on the proposed algorithm will emulate folded batchnorm at training time and learn the corresponding quantization grid of the modified kernel and bias. For fast model evaluation on low-precision hardware, quantization goes hand-in-hand with network pruning. The proposed method is orthogonal to pruning methods such as, for example, regularization (louizos2017learning), which allows for group sparsity and pruning of hidden units.

#### Acknowledgments

We would like to thank Peter O’Connor and Rianne van den Berg for feedback on a draft of this paper.

## References

## Appendix A Experimental details

The grid width of each grid was initialized according to the bit-width and the maximum and minimum values of the input to the quantizer^{3}

The moving averages of layer statistics that are aggregated during the training phase for the batch normalization do not necessarily reflect the statistics of the quantized model accurately. Even though RQ aims to minimize the gap between training and testing phase, we found that the aggregated statistics in combination with the learned scale and shift parameters of batch normalization lead to decreased test performance. In order to avoid this drop in accuracy, we apply the insights from (peters2018probabilistic) and recompute the statistics of the quantized model before reporting the final test error rate. The final models were determined through early stopping using the validation loss computed with minibatch statistics, in case the model uses batch normalization.

For the MNIST experiment we rescaled the input to the [-1, 1] range, employed no regularization and the network was trained with Adam (kingma2014adam) and a batch size of 128. We used a local grid whenever the bit width was larger than 2 for both, weights and biases (shared grid parameters), as well as for the ouputs of the ReLU, with . For the 8 and 4 bit networks we used a temperature of 2 whereas for the 2 bit models we used a temperature of 1 for RQ. We trained the 8 and 4 bit networks for 100 epochs using a learning rate of 1e-3 and the 2 bit networks for 200 epochs with a learning rate of 5e-4. In all of the cases the learning rate was annealed to zero during the last 50 epochs.

For the CIFAR 10 experiment, the hyperparameters were chosen identically to the LeNet-5 experiments except a few differences. We chose a learning rate ot 1e-4 instead of 1e-3 for 8 and 4 bit networks and trained for 300 epochs with a batch size of 100. We also included a weight decay term of 1e-4 for the 8 bit networks. For the 2 bit model we started with a learning rate of 1e-3. The VGG model contains a batch normalization layer after every convolutional layer, but preceeded by max pooling, if present.

## Appendix B Imagenet details

Each channel of the input images was preprocessed by subtracting the mean and dividing by the standard deviation of that channel across the training set. We then resized the images such that the shorter side is set to 256 and then applied random 224x224 crops and random horizontal flips for data augmentation. For evaluation we consider the center 224x224 crop of the images.

We trained the base Resnet-18 model with stochastic gradient descent, a batch size of 128, nesterov momentum of 0.9 and a learning rate of 0.1 which was multiplied by 0.1 at the 30th and 60th epoch. We also applied weight decay with a strength of 1e-4. For the quantized model fine-tuning phase, we used Adam with a learning rate of , a batch size of 24 and a momentum of 0.99. We used a temperature of 2 for both RQ variants. Following the strategy in (jacob2017quantization), we did not quantize the biases.

Table 2 contains the error rates for Resnet-18 and Mobilenet on which Figure 1 is based on. Algorithm and architecture specific changes are mentioned explicitly through footnotes.

Resnet18 | Mobilenet | ||||

Method | # Bits weights/act. | Top-1 | Top-5 | Top-1 | Top-5 |

Original | |||||

SR+DR | |||||

(gupta2015deep; gysel2018ristretto) | |||||

Rounding | - | - | |||

- | - | ||||

- | - | ||||

- | - | ||||

(jacob2017quantization)^{4} |
|||||

- | - | ||||

- | - | ||||

LR Net (shayer2018learning) | - | - | |||

- | - | ||||

QSM (sheng2018quantization)5 ^{5} |
- | - | - | ||

TWN (li2016ternary) | - | - | |||

INQ (zhou2017incremental) | - | - | |||

BWN (rastegari2016xnor) | - | - | |||

XNOR-net (rastegari2016xnor) | - | - | |||

HWGQ (cai2017deep)6 | - | - | |||

ELQ (zhou2018explicit) | - | - | |||

- | - | ||||

SYQ (faraone2018syq)^{6} |
- | - | |||

- | - | ||||

Apprentice (mishra2017apprentice)6 | |||||

RQ (ours) | |||||

- | - | ||||

RQ ST (ours) | |||||

- | - |

### b.1 jacob2017quantization for Resnet18

We used the code provided at https://github.com/tensorflow/models/tree/master/official/resnet and modified the construction of the training and evaluation graph by inserting quantization operations provided by the tensorflow.contrib.quantize package. In a first step, the unmodified code was used to train a high-precision Resnet18 model using the hyper-parameter settings for the learning rate scheduling that are provided in the github repository. More specifically, the model was trained for epochs with a batch size of . The learning rate scheduling involved a ”warm up” period in which the learning rate was annealed from zero to over the first steps, after which it was divided by after epochs and respectively. Gradients were modified using a momentum of . Final test performance under this procedure is top-1 error and top-5 error. From the high-precision model checkpoint, the final quantized model was then fine-tuned for epochs using a constant learning rate of and momentum of . We did not freeze the moving averages of the batch normalization layers. Finally, we found that re-estimating the batchnorm statistics was harmful for this algorithm. We hypothesise that this is due to the usage of folded batch normalization, which incorporates the statistics into the construction of the grid at training time.

### b.2 jacob2017quantization for Mobilenet

The bit results for quantizing Mobilenet provided in table 2 are read off from Figure 4.1 in jacob2017quantization. The pre-trained models published at https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md originally reflected that number up until commit 4415c2613b0c74032a7c631769ef9fa7f5477d88, but have since been updated to improved error rates of and respectively. Unfortunately, there are several conflicting sources for quantized Mobilenet results and pretrained-models within the tensorflow github repository. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/g3doc/models.md#image-classification-quantized-models, for example, reports error rates of and , whereas at https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/quantize the reported top-1 error rate is .

We attempted to use the provided training scripts in the https://github.com/tensorflow/models/blob/master/research/slim repository to train lower-bit mobilenet variants, but did not succeed in doing so. We experimented with learning rates in the range of for , and bit-width variants, but could not achieve significant accuracy improvements within the first epochs of fine-tuning of the high-precision model published at https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md. After epochs, the version achieved top-1 error with a learning rate of and as such is worse than the published results. We therefore chose to only include the published numbers for the bit model and leave addition hyperparameter tuning to future work.

### Footnotes

- Last layer in full precision
- https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md
- For activations we computed the minimum and maximum on a random minibatch of inputs.
- Includes folded batch normalization
- Modified architecture
- Weights of first and last layer not quantized