ProxQuant: Quantized Neural Networks via Proximal Operators
Abstract
To make deep neural networks feasible in resourceconstrained environments (such as mobile devices), it is beneficial to quantize models by using lowprecision weights. One common technique for quantizing neural networks is the straightthrough gradient method, which enables backpropagation through the quantization mapping. Despite its empirical success, little is understood about why the straightthrough gradient method works.
Building upon a novel observation that the straightthrough gradient method is in fact identical to the wellknown Nesterov’s dualaveraging algorithm on a quantization constrained optimization problem, we propose a more principled alternative approach, called ProxQuant, that formulates quantized network training as a regularized learning problem instead and optimizes it via the proxgradient method. ProxQuant does backpropagation on the underlying fullprecision vector and applies an efficient proxoperator in between stochastic gradient steps to encourage quantizedness. For quantizing ResNets and LSTMs, ProxQuant outperforms stateoftheart results on binary quantization and is on par with stateoftheart on multibit quantization. For binary quantization, our analysis shows both theoretically and experimentally that ProxQuant is more stable than the straightthrough gradient method (i.e. BinaryConnect), challenging the indispensability of the straightthrough gradient method and providing a powerful alternative.
tabular
1 Introduction
Deep neural networks (DNNs) have achieved impressive results in various machine learning tasks [7]. Highperformance DNNs typically have over tens of layers and millions of parameters, resulting in a high memory usage and a high computational cost at inference time. However, these networks are often desired in environments with limited memory and computational power (such as mobile devices), in which case we would like to compress the network into a smaller, faster network with comparable performance.
A popular way of achieving such compression is through quantization – training networks with lowprecision weights and/or activation functions. In a quantized neural network, each weight and/or activation can be representable in bits, with a possible codebook of negligible additional size compared to the network itself. For example, in a binary neural network (), the weights are restricted to be in . Compared with a 32bit single precision float, a quantized net reduces the memory usage to of a fullprecision net with the same architecture [8, 5, 18, 12, 23, 24]. In addition, the structuredness of the quantized weight matrix can often enable faster matrixvector product, thereby also accelerating inference [12, 9].
Typically, training a quantized network involves (1) the design of a quantizer that maps a fullprecision parameter to a bit quantized parameter, and (2) the straightthrough gradient method [5] that enables backpropagation from the quantized parameter back onto the original fullprecision parameter, which is critical to the success of quantized network training. With quantizer , an iterate of the straightthrough gradient method (see Figure 0(a)) proceeds as , and (for the converged ) is taken as the output model. For training binary networks, choosing gives the BinaryConnect method [5].
Though appealingly simple and empirically effective, it is informationtheoretically rather mysterious why the straightthrough gradient method works well, at least in the binary case: while the goal is to find a parameter with low loss, the algorithm only has access to stochastic gradients at . As this is a discrete set, a priori, gradients in this set do not necessarily contain any information about the function values. Indeed, a simple onedimensional example (Figure 0(b)) shows that BinaryConnect fails to find the minimizer of fairly simple convex Lipschitz functions in , due to a lack of gradient information in between.
In this paper, we formulate the problem of model quantization as a regularized learning problem and propose to solve it with a proximal gradient method. Our contributions are summarized as follows.

We present a unified framework for defining regularization functionals that encourage binary, ternary, and multibit quantized parameters, through penalizing the distance to quantized sets (see Section 3.1). For binary quantization, the resulting regularizer is a shaped nonsmooth regularizer, which shrinks parameters towards either or in the same way that the norm regularization shrinks parameters towards . We demonstrate that the proxoperators for regularizers that come out of our framework often admit lineartime solutions (or linear time approximation heuristics) which result in numerically exact quantized parameters.

We propose training quantized networks using ProxQuant (Algorithm 1) — a stochastic proximal gradient method with a homotopy scheme. Compared with the straightthrough gradient method, ProxQuant has access to additional gradient information at nonquantized points, which avoids the problem in Figure 0(b) and its homotopy scheme prevents potential overshoot early in the training (Section 3.2). Algorithmically, ProxQuant involves just adding a simple proximal step with respect to a quantizationinducing regularizer after each stochastic gradient step (Figure 0(a)), thus can be efficiently implemented under any major deep learning frameworks without incurring significant system overhead and be used as a modular component to add to the training pipeline of any deep networks to result in a quantized network.

We demonstrate the effectiveness and flexibility of ProxQuant through systematic experiments on (1) image classification with ResNets (Section 4.1); (2) language modeling with LSTMs (Section 4.2). The ProxQuant method outperforms the stateoftheart results on binary quantization and is comparable with the stateoftheart on ternary and multibit quantization.

For binary nets, we show that BinaryConnect suffers from more optimization instability than ProxQuant through (1) a theoretical characterization of convergence for BinaryConnect (Section 5.1) and (2) a sign change experiment on CIFAR10 (Section 5.2). Experimentally, ProxQuant finds better binary nets that is also closer to the initialization in the sign change metric.
1.1 Prior work
Methodologies
Han et al. [8] propose Deep Compression, which compresses a DNN via sparsification, nearestneighbor clustering, and Huffman coding. This architecture is then made into a specially designed hardware for efficient inference [9]. In a parallel line of work, Courbariaux et al. [5] propose BinaryConnect that enables the training of binary neural networks, and Li and Liu [14], Zhu et al. [24] extend this method into ternary quantization. Training and inference on quantized nets can be made more efficient by also quantizing the activation [12, 18, 23], and such networks have achieved impressive performance on largescale tasks such as ImageNet classification [18, 24]. In the NLP land, quantized language models have been successfully trained using alternating multibit quantization [22].
Theories
Li et al. [15] prove the convergence rate of stochastic rounding and BinaryConnect on convex problems and demonstrate the advantage of BinaryConnect over stochastic rounding on nonconvex problems. Anderson and Berg [1] demonstrate the effectiveness of binary networks through the observation that the angles between highdimensional vectors are approximately preserved when binarized, and thus highquality feature extraction with binary weights is possible. Ding et al. [6] show a universal approximation theorem for quantized ReLU networks.
Principled methods
Sun and Sun [19] perform model quantization through a Wasserstein regularization term and minimize via the adversarial representation, similar as in Wasserstein GANs [2]. Their method has the potential of generalizing to other generic requirements on the parameter, but might be hard to tune due to the instability of the inner maximization problem.
While preparing this manuscript, we discovered the independent work of CarreiraPerpinán [3], CarreiraPerpinán and Idelbayev [4]. They formulate quantized network training as a constrained optimization problem and propose to solve them via augmented Lagrangian methods. From an optimization perspective, our views are largely complementary: they treat the quantization as a constraint, whereas we encourage quantization through a regularizer. Due to time constraints, we did not do experimental comparison (they only reported results on VGG whereas we focus on ResNets) – as they solve a full augmented Lagrangian minimization in between each compression step, successful training of their LC algorithm will at least require a careful tuning of this inner optimization procedure.
2 Preliminaries
The optimization difficulty of training quantized models is that they involve a discrete parameter space and hence efficient localsearch methods are often prohibitive. For example, the problem of training a binary neural network is to minimize for . Projected SGD on this set will not move unless with an unreasonably large stepsize [15], whereas greedy nearestneighbor search requires forward passes which is intractable for neural networks where is on the order of millions. Alternatively, quantized training can also be cast as minimizing for and an appropriate quantizer that maps a real vector to a nearby quantized vector, but is often nondifferentiable and piecewise constant (such as the binary case ), and thus backpropagation through does not work.
2.1 The straightthrough gradient method
The pioneering work of BinaryConnect [5] proposes to solve this problem via the straightthrough gradient method, that is, propagate the gradient with respect to unaltered to , i.e. to let . One iterate of the straightthrough gradient method (with the SGD optimizer) is
This enables the real vector to move in the entire Euclidean space, and taking at the end of training gives a valid quantized model. Such a customized backpropagation rule yields good empirical performance in training quantized nets and has thus become a standard practice [5, 24, 22]. However, as we have discussed, it is information theoretically unclear how the straightthrough method works, and it does fail on very simple convex Lipschitz functions (Figure 0(b)).
2.2 Straightthrough gradient as lazy projection
Our first observation is that the straightthrough gradient method is equivalent to a dualaveraging method, or a lazy projected SGD [21]. In the binary case, we wish to minimize over , and the lazy projected SGD proceeds as
(1) 
Written compactly, this is , which is exactly the straightthrough gradient method: take the gradient at the quantized vector and perform the update on the original real vector.
2.3 Projection as a limiting proximal operator
We take a broader point of view that a projection is also a limiting proximal operator with a suitable regularizer, to allow more generality and to motivate our proposed algorithm. Given any set , one could identify a regularizer such that the following hold:
(2) 
In the case for example, one could take
(3) 
The proximal operator (or prox operator) [17] with respect to and strength is
In the limiting case , the argmin has to satisfy , i.e. , and the prox operator is to minimize over , which is the Euclidean projection onto . Hence, projection is also a prox operator with , and the straightthrough gradient estimate is equivalent to a lazy proximal gradient descent with and .
While the prox operator with correponds to “hard” projection onto the discrete set , when it becomes a “soft” projection that moves towards . Compared with the hard projection, a finite is less aggressive and has the potential advantage of avoiding overshoot early in training. Further, as the prox operator does not strictly enforce quantizedness, it is in principle able to query the gradients at every point in the space, and therefore has access to more information than the straightthrough gradient method.
3 Quantized net training via regularized learning
We propose the ProxQuant algorithm, which adds a quantizationinducing regularizer onto the loss and optimizes via the (nonlazy) proxgradient method with a finite . The prototypical version of ProxQuant is described in Algorithm 1.
(4)  
(5) 
Compared to usual fullprecision training, ProxQuant only adds a prox step after each stochastic gradient step, hence can be implemented straightforwardly upon existing fullprecision training. As the prox step does not need to know how the gradient step is performed, our method adapts to other stochastic optimizers as well such as Adam. Further, each iteration is a proxgradient step over the objective with learning rates , and by choosing we obtain a joint control over the speed of training and falling onto the quantized set.
In the remainder of this section, we define a flexible class of quantizationinducing regularizers through “distance to the quantized set”, derive efficient algorithms of their corresponding prox operator, and propose a homotopy method for choosing the regularization strengths. Our regularization perspective subsumes most existing algorithms for modelquantization (e.g.,[5, 8, 22]) as limits of certain regularizers with strength . Our proposed method can be viewed as a principled generalization of these methods to .
3.1 Regularization for model quantization
Let be a set of quantized parameter vectors. An ideal regularizer for quantization would be to vanish on and reflect some type of distance to when . To achieve this, we propose and regularizers of the form
(6) 
This is a highly flexible framework for designing regularizers, as one could specify any and choose between and . Specifically, encodes certain desired quantization structure. By appropriately choosing , we can specify which part of the parameter vector to quantize^{4}^{4}4Empirically, it is advantageous to keep the biases of each layers and the BatchNorm layers at fullprecision, which is often a negligible fraction, say of the total number of parameters, the number of bits to quantize to, whether we allow adaptivelychosen quantization levels and so on.
The choice of distance metrics will result in distinct properties in the regularized solutions. For example, choosing the version leads to nonsmooth regularizers that induce exact quantizedness in the same way that norm regularization induces sparsity [20], whereas choosing the squared version leads to smooth regularizers that induce quantizedness “softly”.
In the following, we present a few examples of regularizers under our framework eq. 6 which induce binary weights, ternary weights and multibit quantization. We will also derive efficient algorithms (or approximation heuristics) for solving the prox operators corresponding to these regularizers, which generalize the projection operators used in the straightthrough gradient algorithms.
Binary neural nets
In a binary neural net, the entries of are in . A natural choice would be taking . The resulting regularizer is
(7)  
This is exactly the binary regularizer that we discussed earlier in eq. 3. Figure 2 plots the Wshaped onedimensional component of from which we see its effect for inducing quantization in analog to regularization for inducing exact sparsity.
The prox operator with respect to , despite being a nonconvex optimization problem, admits a simple analytical solution:
(8)  
We note that the choice of the version is not unique: the squared version works as well, whose prox operator is given by . See Appendix A.1 for the derivation of these prox operators and the definition of the soft thresholding operator.
Multibit quantization with adaptive levels.
Following [22], we consider bit quantized parameters with a structured adaptivelychosen set of quantization levels, which translates into
(9) 
The squared regularizer for this structure is
(10) 
which is also the alternating minimization objective in [22].
We now derive the prox operator for the regularizer eq. 10. For any , we have
(11)  
This is a joint minimization problem in , and we adopt an alternating minimization schedule to solve it:

Minimize over given , which has a closedform solution .

Minimize over given , which does not depend on , and can be done via calling the alternating quantizer of [22]: .
Together, the prox operator generalizes the alternating minimization procedure in [22], as governs a tradeoff between quantization and closeness to . To see that this is a strict generalization, note that for any the solution of eq. 11 will be an interpolation between the input and its Euclidean projection to . As , the prox operator collapses to the projection.
Ternary quantization
Ternary quantization is a variant of 2bit quantization, in which weights are constrained to be in for real values . We defer the derivation of the ternary prox operator into Appendix A.2.
3.2 Homotopy method for regularization strength
Recall that the larger is, the more aggressive will move towards the quantized set. An ideal choice would be to (1) force the net to be exactly quantized upon convergence, and (2) not be too aggressive such that the quantized net at convergence is suboptimal.
We let be a linearly increasing sequence, i.e. for some hyperparameter which we term as the regularization rate. With this choice, the stochastic gradient steps will start off close to fullprecision training and gradually move towards exact quantizedness, hence the name “homotopy method”. The parameter can be tuned by minimizing the validation loss, and controls the aggressiveness of falling onto the quantization constraint. There is nothing special about the linear increasing scheme, but it is simple enough and works well as we shall see in the experiments.
4 Experiments
We evaluate the performance of ProxQuant on two tasks: image classification with ResNets, and language modeling with LSTMs. On both tasks, we show that the default straightthrough gradient method is not the only choice, and our ProxQuant can achieve the same and often better results.
4.1 Image classification on CIFAR10
Problem setup
We perform image classification on the CIFAR10 dataset, which contains 50000 training images and 10000 test images of size 32x32. We apply a commonly used data augmentation strategy (pad by 4 pixels on each side, randomly crop to 32x32, do a horizontal flip with probability 0.5, and normalize). Our models are ResNets [10] of depth 20, 32, and 44 with weights quantized to binary or ternary.
Method
We use ProxQuant with regularizer eq. 3 in the binary case and eqs. 14 and 13 in the ternary case, which we respectively denote as PQB and PQT. The training is initialized at pretrained fullprecision nets (warmstart). For the regularization strength we use the homotopy method with . We initialize at pretrained fullprecision networks and use the Adam optimizer with constant learning rate 0.01. To accelerate training in the final stage, we do a hard quantization at epoch 400 and keeps training till the 600th epoch to stabilize the BatchNorm layers.
We compare with BinaryConnect (BC) for binary nets and Trained Ternary Quantization (TTQ) [24] for ternary nets. For BinaryConnect, we haven’t found reported results with ResNets on CIFAR10, and we train with the recommended Adam optimizer with learning rate decay [5] (initial learning rate 0.01, multiply by 0.1 at epoch 81 and 122, hardquantize at epoch 400), which we find leads to the best result for BinaryConnect.
Result
The top1 classification errors are reported in Table 1. For binary nets, our ProxQuantBinary consistently yields better results than BinaryConnect. For ternary nets, our results are comparable with the reported results of TTQ,^{5}^{5}5We note that our ProxQuantTernary and TTQ are not strictly comparable: we have the advantage of using better initializations; TTQ has the advantage of a stronger quantizer: they train the quantization levels whereas our quantizer eq. 14 precomputes them from the current fullprecision parameter. and the best performance of our method over 4 runs (from the same initialization) is slightly better than TTQ.
Model  FullPrecision  BC  PQB (ours)  TTQ  PQT (ours)  PQT (Bo4) 

(Bits)  (32)  (1)  (1)  (2)  (2)  (2) 
ResNet20  8.06  9.49 (0.22)  9.15 (0.21)  8.87  8.40 (0.13)  8.22 
ResNet32  7.25  8.66 (0.36)  8.40 (0.23)  7.63  7.65 (0.15)  7.53 
ResNet44  6.96  8.26 (0.24)  7.79 (0.06)  7.02  7.05 (0.08)  6.98 
4.2 Language modeling with LSTMs
Problem setup
We perform language modeling with LSTMs [11] on the Penn Treebank (PTB) dataset [16], which contains 929K training tokens, 73K validation tokens, and 82K test tokens. Our model is a standard onehiddenlayer LSTM with embedding dimension 300 and hidden dimension 300. We train quantized LSTMs with the encoder, transition matrix, and the decoder quantized to bits for . The quantization is performed in a rowwise fashion, so that each row of the matrix has its own codebook .
Method
We compare our multibit ProxQuant (eq. 11) to the stateoftheart alternating minimization algorithm with straightthrough gradients [22]. Training is initialized at a pretrained fullprecision LSTM. We use the SGD optimizer with initial learning rate 20.0 and decay by a factor of 1.2 when the validation error does not improve over an epoch. We train for 80 epochs with batch size 20, BPTT 30, dropout with probability 0.5, and clip the gradient norms to . The regularization rate is tuned by finding the best performance on the validation set. In addition to multibit quantization, we also report the results for binary LSTMs (weights in ), comparing BinaryConnect and our ProxQuantBinary.
Result
We report the perplexityperword (PPW, lower is better) in Table 2. The performance of ProxQuant is comparable with the Straightthrough gradient method. On Binary LSTMs, ProxQuantBinary beats BinaryConnect by a large margin. These results demonstrate that ProxQuant offers a powerful alternative for training recurrent networks.
Method / Number of Bits  1  2  3  FP (32) 

BinaryConnect  419.1      88.5 
ProxQuantBinary (ours)  321.8      
ALT Straightthrough^{6}^{6}6We thank Xu et al. [22] for sharing the implementation of this method through a personal communication. There is a very clever trick not mentioned in their paper: after computing the alternating quantization , they multiply by a constant 0.3 before taking the gradient; in other words, their quantizer is a rescaled alternating quantizer: . This scaling step gives a significant gain in performance – without scaling the PPW is for bits. In contrast, our ProxQuant does not involve a scaling step and achieves better PPW than this unscaled ALT straightthrough method.  104.7  90.2  86.1  
ALTProxQuant (ours)  106.2  90.0  87.2 
5 Stability analysis of binary quantization
5.1 Convergence characterization for BinaryConnect
We now show that BinaryConnect has a very stringent convergence condition. Consider the BinaryConnect method with batch gradients:
(12) 
[Fixed point and convergence] We say that is a fixed point of the BinaryConnect algorithm, if in eq. 12 implies that for all . We say that the BinaryConnect algorithm converges if there exists such that is a fixed point. {theorem} Assume that the learning rates satisfy , then is a fixed point for BinaryConnect eq. 12 if and only if for all such that . Such a point may not exist, in which case BinaryConnect does not converge for any initialization . We have already seen that such a fixed point might not exist in the toy example in Figure 0(b). In the following sign change experiment on CIFAR10, we are going to see that BinaryConnect indeed fails to converge to a fixed sign pattern, corroborating Theorem 5.1.
5.2 Sign change experiment
We experimentally compare the training dynamics of ProxQuantBinary and BinaryConnect through the sign change metric. The sign change metric between any and is the proportion of their different signs, i.e. the (rescaled) Hamming distance:
In , the space of all fullprecision parameters, the sign change is a natural distance metric that represents the closeness of the binarization of two parameters.
Recall in our CIFAR10 experiments (Section 4.1), for both BinaryConnect and ProxQuant, we initialize at a good fullprecision net and stop at a converged binary network . We are interested in along the training path, as well as , i.e. the distance of the final output model to the initialization.
As ProxQuant converges to higherperformance solutions than BinaryConnect, we expect that if we run both methods from a same warm start, the sign change of ProxQuant should be higher than that of BinaryConnect, as in general one needs to travel farther to find a better net.
However, we find that this is not the case: ProxQuant produces binary nets with both lower sign changes and higher performances, compared with BinaryConnect. This finding is consistent in all layers, across different warm starts, and across differnent runs from each same warm start (see Figure 3 and Table 3 in Appendix B). This shows that for every warm start position, there is a good binary net nearby which can be found by ProxQuant but not BinaryConnect, suggesting that BinaryConnect, and in general the straightthrough gradient method, suffers from higher optimization instability than ProxQuant. This result here is also consistent with Theorem 5.1: the signs in BinaryConnect never stop changing until we manually freeze the signs at epoch 400.
6 Conclusion
In this paper, we propose and experiment with the ProxQuant method for training quantized networks. Our results demonstrate that ProxQuant offers a powerful alternative to the straightthrough gradient method and suffers from less optimization instability. For future work, it would be of interest to propose alternative regularizers for ternary and multibit ProxQuant and experiment with our method on larger tasks.
Acknowledgement
We thank Tong He, Yifei Ma, and Zachary Lipton for their valuable feedback. We thank Chen Xu and Zhouchen Lin for the insightful discussion on multibit quantization and sharing the implementation of [22] with us.
References
 Anderson and Berg [2017] A. G. Anderson and C. P. Berg. The highdimensional geometry of binary neural networks. arXiv preprint arXiv:1705.07199, 2017.
 Arjovsky et al. [2017] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 CarreiraPerpinán [2017] M. A. CarreiraPerpinán. Model compression as constrained optimization, with application to neural nets. part i: General framework. arXiv preprint arXiv:1707.01209, 2017.
 CarreiraPerpinán and Idelbayev [2017] M. A. CarreiraPerpinán and Y. Idelbayev. Model compression as constrained optimization, with application to neural nets. part ii: Quantization. arXiv preprint arXiv:1707.04319, 2017.
 Courbariaux et al. [2015] M. Courbariaux, Y. Bengio, and J.P. David. BinaryConnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
 Ding et al. [2018] Y. Ding, J. Liu, and Y. Shi. On the universal approximability of quantized relu neural networks. arXiv preprint arXiv:1802.03646, 2018.
 Goodfellow et al. [2016] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
 Han et al. [2015] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 Han et al. [2016] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: Efficient inference engine on compressed deep neural network. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 243–254. IEEE, 2016.
 He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 Hochreiter and Schmidhuber [1997] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Hubara et al. [2017] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18:187–1, 2017.
 Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Li and Liu [2016] F. Li and B. Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
 Li et al. [2017] H. Li, S. De, Z. Xu, C. Studer, H. Samet, and T. Goldstein. Training quantized nets: A deeper understanding. In Advances in Neural Information Processing Systems, pages 5811–5821, 2017.
 Marcus et al. [1993] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
 Parikh and Boyd [2014] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends® in Optimization, 1(3):127–239, 2014.
 Rastegari et al. [2016] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 Sun and Sun [2018] J. Sun and X. Sun. Adversarial probabilistic regularization. Unpublished draft, 2018.
 Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 Xiao [2010] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(Oct):2543–2596, 2010.
 Xu et al. [2018] C. Xu, J. Yao, Z. Lin, W. Ou, Y. Cao, Z. Wang, and H. Zha. Alternating multibit quantization for recurrent neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=S19dR9x0b.
 Zhou et al. [2016] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
 Zhu et al. [2016] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.
Appendix A Additional results on Regularization
a.1 Prox operators for binary nets
Here we derive the prox operators for the binary regularizer eq. 7 and its squared variant. Recall that
By definition of the prox operator, we have for any that
This minimization problem is coordinatewise separable. For each , the penalty term remains the same upon flipping the sign, but the quadratic term is smaller when . Hence, the solution to the prox satisfies that , and the absolute value satisfies
Multiplying by , we have
which gives eq. 8.
For the squared version, by a similar argument, the corresponding regularizer is
For this regularizer we have
Using the same argument as in the case, the solution satisfies , and
Multiplying by gives
or, in vector form, .
a.2 Prox operator for ternary quantization
For ternary quantization, we use an approximate version of the alternating prox operator eq. 11: compute by initializing at and repeating
(13) 
where is the ternary quantizer defined as
(14) 
This is a straightforward extension of the TWN quantizer [14] that allows different levels for positives and negatives. We find that two rounds of alternating computation in eq. 13 achieves a good performance, which we use in our experiments.
Appendix B Detailed sign change results on ResNet20
Initialization  Method  Top1 Error(%)  Sign change 
FPNet 1  BC  9.489 (0.223)  0.383 (0.006) 
(8.06)  PQB  9.146 (0.212)  0.276 (0.020) 
FPNet 2  BC  9.745 (0.422)  0.381 (0.004) 
(8.31)  PQB  9.444 (0.067)  0.288 (0.002) 
FPNet 3  BC  9.383 (0.211)  0.359 (0.001) 
(7.73)  PQB  9.084 (0.241)  0.275 (0.001) 
Initialization  Method  Top1 Error(%)  Sign change 

FPNet 1  BC  9.664, 9.430, 9.198, 9.663  0.386, 0.377, 0.390, 0.381 
(8.06)  PQB  9.058, 8.901, 9.388, 9.237  0.288, 0.247, 0.284, 0.285 
FPNet 2  BC  9.456, 9.530, 9.623, 10.370  0.376, 0.379, 0.382, 0.386 
(8.31)  PQB  9.522, 9.474, 9.410, 9.370  0.291, 0.287, 0.289, 0.287 
FPNet 3  BC  9.107, 9.558, 9.538, 9.328  0.360, 0.357, 0.359, 0.360 
(7.73)  PQB  9.284, 8.866, 9.301, 8.884  0.275, 0.276, 0.276, 0.275 
Appendix C Proof of Theorem 5.1
We start with the “” direction. If is a fixed point, then by definition there exists such that for all . By the iterates eq. 12
Take signs on both sides and apply for all on both sides, we get that
Take the limit and apply the assumption that , we get that for all such that ,
Now we prove the “” direction. If obeys that for all such that , then if we take any such that , will move in a straight line towards the direction of , which does not change the sign of . In other words, for all . Therefore, by definition, is a fixed point.