Understanding StraightThrough Estimator in Training Activation Quantized Neural Nets
Abstract
Training activation quantized neural networks involves minimizing a piecewise constant function whose gradient vanishes almost everywhere, which is undesirable for the standard backpropagation or chain rule. An empirical way around this issue is to use a straightthrough estimator (STE) (Bengio et al., 2013) in the backward pass, so that the “gradient” through the modified chain rule becomes nontrivial. Since this unusual “gradient” is certainly not the gradient of loss function, the following question arises: why searching in its negative direction minimizes the training loss? In this paper, we provide the theoretical justification of the concept of STE by answering this question. We consider the problem of learning a twolinearlayer network with binarized ReLU activation and Gaussian input data. We shall refer to the unusual “gradient” given by the STEmodifed chain rule as coarse gradient. The choice of STE is not unique. We prove that if the STE is properly chosen, the expected coarse gradient correlates positively with the population gradient (not available for the training), and its negation is a descent direction for minimizing the population loss. We further show the associated coarse gradient descent algorithm converges to a critical point of the population loss minimization problem. Moreover, we show that a poor choice of STE leads to instability of the training algorithm near certain local minima, which is verified with CIFAR10 experiments.
Understanding StraightThrough Estimator in Training Activation Quantized Neural Nets
Penghang Yin,^{1}^{1}footnotemark: 1 Jiancheng Lyu,^{2}^{2}footnotemark: 2 Shuai Zhang,^{3}^{3}footnotemark: 3 Stanley Osher,^{1}^{1}footnotemark: 1 Yingyong Qi,^{3}^{3}footnotemark: 3 Jack Xin^{2}^{2}footnotemark: 2 
^{1}^{1}footnotemark: 1 Department of Mathematics, University of California, Los Angeles 
yph@ucla.edu, sjo@math.ucla.edu 
^{2}^{2}footnotemark: 2 Department of Mathematics, University of California, Irvine 
jianchel@uci.edu, jxin@math.uci.edu 
^{3}^{3}footnotemark: 3 Qualcomm AI Research, San Diego 
{shuazhan,yingyong}@qti.qualcomm.com 
1 Introduction
Deep neural networks (DNN) have achieved the remarkable success in many machine learning applications such as computer vision (Krizhevsky et al., 2012; Ren et al., 2015), natural language processing (Collobert & Weston, 2008) and reinforcement learning (Mnih et al., 2015; Silver et al., 2016). However, the deployment of DNN typically require hundreds of megabytes of memory storage for the trainable fullprecision floatingpoint parameters, and billions of floatingpoint operations to make a single inference. To achieve substantial memory savings and energy efficiency at inference time, many recent efforts have been made to the training of coarsely quantized DNN, meanwhile maintaining the performance of their float counterparts (Courbariaux et al., 2015; Rastegari et al., 2016; Cai et al., 2017; Hubara et al., 2018; Yin et al., 2018b).
Training fully quantized DNN amounts to solving a very challenging optimization problem. It calls for minimizing a piecewise constant and highly nonconvex empirical risk function subject to a discrete setconstraint that characterizes the quantized weights. In particular, weight quantization of DNN have been extensively studied in the literature; see for examples (Li et al., 2016; Zhu et al., 2016; Li et al., 2017; Yin et al., 2016; 2018a; Hou & Kwok, 2018; He et al., 2018; Li & Hao, 2018). On the other hand, the gradient in training activation quantized DNN is almost everywhere (a.e.) zero, which makes the standard backpropagation inapplicable. The arguably most effective way around this issue is nothing but to construct a nontrivial search direction by properly modifying the chain rule. Specifically, one can replace the a.e. zero derivative of quantized activation function composited in the chain rule with a related surrogate. This proxy derivative used in the backward pass only is referred as the straightthrough estimator (STE) (Bengio et al., 2013). In the same paper, Bengio et al. (2013) proposed an alternative approach based on stochastic neurons. In addition, Friesen & Domingos (2017) proposed the feasible target propagation algorithm for learning hardthreshold (or binary activated) networks (Lee et al., 2015) via convex combinatorial optimization.
1.1 Related Works
The idea of STE originates to the celebrated perceptron algorithm (Rosenblatt, 1957; 1962) in 1950s for learning singlelayer perceptrons. The perceptron algorithm essentially does not calculate the “gradient” through the standard chain rule, but instead through a modified chain rule in which the derivative of identity function serves as the proxy of the original derivative of binary output function . Its convergence has been extensive discussed in the literature; see for examples, (Widrow & Lehr, 1990; Freund & Schapire, 1999) and the references therein. Hinton (2012) extended this idea to train multilayer networks with binary activations (a.k.a. binary neuron), namely, to backpropagate as if the activation had been the identity function. Bengio et al. (2013) proposed a STE variant which uses the derivative of the sigmoid function instead. In the training of DNN with weights and activations constrained to , (Hubara et al., 2016) substituted the derivative of the signum activation function with in the backward pass, known as the saturated STE. Later the idea of STE was readily employed to the training of DNN with general quantized ReLU activations (Hubara et al., 2018; Zhou et al., 2016; Cai et al., 2017; Choi et al., 2018; Yin et al., 2018b), where some other proxies took place including the derivatives of vanilla ReLU and clipped ReLU. Despite all the empirical success of STE, there is very limited theoretical understanding of it in training DNN with staircase activations.
Goel et al. (2018) considers leaky ReLU activation of a onehiddenlayer network. They showed the convergence of the socalled Convertron algorithm, which uses the identity STE in the backward pass through the leaky ReLU layer. Other similar scenarios, where certain layers are not desirable for backpropagation, have been brought up recently by (Wang et al., 2018) and (Athalye et al., 2018). The former proposed an implicit weighted nonlocal Laplacian layer as the classifier to improve the generalization accuracy of DNN. In the backward pass, the derivative of a pretrained fullyconnected layer was used as a surrogate. To circumvent adversarial defense (Szegedy et al., 2013), (Athalye et al., 2018) introduced the backward pass differentiable approximation, which shares the same spirit as STE, and successfully broke defenses at ICLR 2018 that rely on obfuscated gradients.
1.2 Main Contributions
Throughout this paper, we shall refer to the “gradient” of loss function w.r.t. the weight variables through the STEmodified chain rule as coarse gradient. The coarse gradient is certainly not the gradient of loss function, and it is generally not the gradient of any function. Why searching in its negative direction minimizes the training loss, since this is not the standard gradient descent? Apparently, the choice of STE is nonunique, then what makes a good STE? From the optimization perspective, we take a step towards understanding STE in training quantized ReLU nets by attempting these questions.
On the theoretical side, we consider three representative STEs for learning a twolinearlayer network with binary activation and Gaussian data: the derivatives of the identity function (Rosenblatt, 1957; Hinton, 2012; Goel et al., 2018), vanilla ReLU and the clipped ReLUs (Cai et al., 2017; Hubara et al., 2016). We adopt the model of population loss minimization (Brutzkus & Globerson, 2017; Tian, 2017; Li & Yuan, 2017; Du et al., 2018). For the first time, we prove that proper choices of STE give rise to training algorithms that are descent. Specifically, the negative expected coarse gradients based on STEs of the vanilla and clipped ReLUs are provably descent directions for the minimizing the population loss, which yield monotonically decreasing energy in the training. In contrast, this is not true for the identity STE. We further prove that the corresponding training algorithm can be unstable sometimes near certain local minima, because the coarse gradient may simply not vanish there.
Complementary to the analysis, we examine the empirical performances of the three STEs on MNIST and CIFAR10 classifications with general quantized ReLU. While both vanilla and clipped ReLUs work very well on the relatively shallow LeNet5, clipped ReLU STE is arguably the best for the deeper VGG11 and ResNet20. In our CIFAR experiments in section 4.2, we observe that the training using identity or ReLU STE can be unstable at good minima and repelled to an inferior one with substantially higher training loss and decreased generalization accuracy. This is an implication that poor STEs generate coarse gradients incompatible with the energy landscape.
To our knowledge, convergence guarantees of perceptron algorithm (Rosenblatt, 1957; 1962) and Convertron algorithm (Goel et al., 2018) were proved for the identity STE. It is worth noting that Convertron (Goel et al., 2018) makes weaker assumptions than in this paper. These results, however, do not generalize to the network with two trainable layers studied here. As aforementioned, the identity STE is actually a poor choice in our case. Moreover, it is not clear if their analyses can be extended to other STEs. Similar to Convertron with leaky ReLU, the monotonicity of quantized activation function plays a role in coarse gradient descent. Indeed, all three STEs considered here exploit this property. But this is not the whole story. A great STE like the clipped ReLU matches quantized ReLU at the extrema, otherwise the instability/incompatibility issue may arise.
Organization. In section 2, we study the energy landscape of a twolinearlayer network with binary activation and Gaussian data. We present the main results and sketch the mathematical analysis for STE in section 3. In section 4, we compare the empirical performances of different STEs in 2bit and 4bit activation quantization, and report the instability phenomena of the training algorithms associated with poor STEs observed in CIFAR experiments. Due to space limitation, all the technical proofs as well as some figures are deferred to the appendix.
Notations. denotes the Euclidean norm of a vector or the spectral norm of a matrix. represents the vector of all zeros, whereas the vector of all ones. is the identity matrix of order . For any , is their inner product. denotes the Hadamard product whose entry is given by .
2 Learning TwoLinearLayer CNN with Binary Activation
We consider a model similar to (Du et al., 2018) that outputs the prediction
for some input . Here and are the trainable weights in the first and second linear layer, respectively; denotes the th row vector of ; the activation function acts componentwise on the vector , i.e., . The first layer serves as a convolutional layer, where each row can be viewed as a patch sampled from and the weight filter is shared among all patches, and the second linear layer is the classifier. The label is generated according to for some true (nonzero) parameters and . Moreover, we use the following squared sample loss
(1) 
Unlike in (Du et al., 2018), the activation function here is not ReLU, but the binary function .
We assume that the entries of are i.i.d. sampled from the Gaussian distribution (Zhong et al., 2017; Brutzkus & Globerson, 2017). Since for any scalar , without loss of generality, we take and cast the learning task as the following population loss minimization problem:
(2) 
where the sample loss is given by (1).
2.1 Backpropagation and Coarse Gradient Descent
With the Gaussian assumption on , as will be shown in section 2.2, it is possible to find the analytic expressions of and its gradient
The gradient of objective function, however, is not available for the network training. In fact, we can only access the expected sample gradient, namely,
We remark that is not the same as . By the standard backpropagation or chain rule, we readily check that
(3) 
and
(4) 
Note that is zero a.e., which makes (4) inapplicable to the training. The idea of STE is to simply replace the a.e. zero component in (4) with a related nontrivial function (Hinton, 2012; Bengio et al., 2013; Hubara et al., 2016; Cai et al., 2017), which is the derivative of some (sub)differentiable function . More precisely, backpropagation using the STE gives the following nontrivial surrogate of , to which we refer as the coarse (partial) gradient
(5) 
Using the STE to train the twolinearlayer convolutional neural network (CNN) with binary activation gives rise to the (fullbatch) coarse gradient descent described in Algorithm 1.
2.2 Preliminaries
Let us present some preliminaries about the landscape of the population loss function . To this end, we define the angle between and as for any . Recall that the label is given by from (1), we elaborate on the analytic expressions of and .
Lemma 1.
If , the population loss is given by
In addition, for .
Lemma 2.
If and , the partial gradients of w.r.t. and are
(6) 
and
(7) 
respectively.
For any , is impossible to be a local minimizer. The only possible (local) minimizers of the model (2) are located at

Nondifferentiable points where and , or and .
Among them, are obviously the global minimizers of (2). We show that the stationary points, if exist, can only be saddle points, and are the only potential spurious local minimizers.
Proposition 1.
We further prove that the population gradient given by (6) and (7), is Lipschitz continuous when restricted to bounded domains.
Lemma 3.
For any differentiable points and with and , there exists a Lipschitz constant depending on and , such that
3 Main Results
We are most interested in the complex case where both the saddle points and spurious local minimizers are present. Our main results are concerned with the behaviors of the coarse gradient descent summarized in Algorithm 1 when the derivatives of the vanilla and clipped ReLUs as well as the identity function serve as the STE, respectively. We shall prove that Algorithm 1 using the derivative of vanilla or clipped ReLU converges to a critical point, whereas that with the identity STE does not.
Theorem 1 (Convergence).
Let be the sequence generated by Algorithm 1 with ReLU or clipped ReLU . Suppose for all with some . Then if the learning rate is sufficiently small, for any initialization , the objective sequence is monotonically decreasing, and converges to a saddle point or a (local) minimizer of the population loss minimization (2). In addition, if and , the descent and convergence properties do not hold for Algorithm 1 with the identity function near the local minimizers satisfying and .
Remark 1.
The convergence guarantee for the coarse gradient descent is established under the assumption that there are infinite training samples. When there are only a few data, in a coarse scale, the empirical loss roughly descends along the direction of negative coarse gradient, as illustrated by Figure 1. As the sample size increases, the empirical loss gains monotonicity and smoothness. This explains why (proper) STE works so well with massive amounts of data as in deep learning.
Remark 2.
The same results hold, if the Gaussian assumption on the input data is weakened to that their rows i.i.d. follow some rotationinvariant distribution. The proof will be substantially similar.
In the rest of this section, we sketch the mathematical analysis for the main results.
sample size = 10  sample size = 50  sample size = 1000 

3.1 Derivative of the Vanilla ReLU as STE
If we choose the derivative of ReLU as the STE in (5), it is easy to see , and we have the following expressions of and for Algorithm 1.
Lemma 4.
The expected partial gradient of w.r.t. is
(10) 
Let in (5). The expected coarse gradient w.r.t. is
(11) 
where .
As stated in Lemma 5 below, the key observation is that the coarse partial gradient has nonnegative correlation with the population partial gradient , and together with form a descent direction for minimizing the population loss.
Lemma 5.
If and , then the inner product between the expected coarse and population gradients w.r.t. is
Moreover, if further and , there exists a constant depending on and , such that
(12) 
Clearly, when , is roughly in the same direction as . Moreover, since by Lemma 4, , we expect that the coarse gradient descent behaves like the gradient descent directly on . Here we would like to highlight the significance of the estimate (12) in guaranteeing the descent property of Algorithm 1. By the Lipschitz continuity of specified in Lemma 3, it holds that
(13) 
where a) is due to (12). Therefore, if is small enough, we have monotonically decreasing energy until convergence.
Lemma 6.
3.2 Derivative of the Clipped ReLU as STE
For the STE using clipped ReLU, and . We have results similar to Lemmas 5 and 6. That is, the coarse partial gradient using clipped ReLU STE generally has positive correlation with the true partial gradient of the population loss (Lemma 7)). Moreover, the coarse gradient vanishes and only vanishes at the critical points (Lemma 8).
Lemma 7.
If and , then
where same as in Lemma 5, and
with . The inner product between the expected coarse and true gradients w.r.t.
Moreover, if further and , there exists a constant depending on and , such that
3.3 Derivative of the Identity Function as STE
Now we consider the derivative of identity function. Similar results to Lemmas 5 and 6 are not valid anymore. It happens that the coarse gradient derived from the identity STE does not vanish at local minima, and Algorithm 1 may never converge there.
Lemma 9.
Let in (5). Then the expected coarse partial gradient w.r.t. is
(14) 
If and ,
i.e., does not vanish at the local minimizers if and .
Lemma 10.
If and , then the inner product between the expected coarse and true gradients w.r.t. is
(15) 
When , , if and , we have
(16) 
Lemma 9 suggests that if , the coarse gradient descent will never converge near the spurious minimizers with and , because does not vanish there. By the positive correlation implied by (15) of Lemma 10, for some proper , the iterates may move towards a local minimizer in the beginning. But when approaches it, the descent property (3.1) does not hold for because of (16), hence the training loss begins to increase and instability arises.
4 Experiments
While our theory implies that both vanilla and clipped ReLUs learn a twolinearlayer CNN, their empirical performances on deeper nets are different. In this section, we compare the performances of the identity, ReLU and clipped ReLU STEs on MNIST (LeCun et al., 1998) and CIFAR10 (Krizhevsky, 2009) benchmarks for 2bit or 4bit quantized activations. As an illustration, we plot the 2bit quantized ReLU and its associated clipped ReLU in Figure 3 in the appendix. Intuitively, the clipped ReLU should be the best performer, as it best approximates the original quantized ReLU. We also report the instability issue of the training algorithm when using an improper STE in section 4.2. In all experiments, the weights are kept float.
The resolution for the quantized ReLU needs to be carefully chosen to maintain the fullprecision level accuracy. To this end, we follow (Cai et al., 2017) and resort to a modified batch normalization layer (Ioffe & Szegedy, 2015) without the scale and shift, whose output components approximately follow a unit Gaussian distribution. Then the that fits the input of activation layer the best can be precomputed by a variant of Lloyd’s algorithm (Lloyd, 1982; Yin et al., 2018a) applied to a set of simulated 1D halfGaussian data. After determining the , it will be fixed during the whole training process. Since the original LeNet5 does not have batch normalization, we add one prior to each activation layer. We emphasize that we are not claiming the superiority of the quantization approach used here, as it is nothing but the HWGQ (Cai et al., 2017), except we consider the uniform quantization.
The optimizer we use is the stochastic (coarse) gradient descent with momentum = 0.9 for all experiments. We train 50 epochs for LeNet5 (LeCun et al., 1998) on MNIST, and 200 epochs for VGG11 (Simonyan & Zisserman, 2014) and ResNet20 (He et al., 2016) on CIFAR10. The parameters/weights are initialized with those from their pretrained fullprecision counterparts. The schedule of the learning rate is specified in Table 2 in the appendix.
4.1 Comparison Results
The experimental results are summarized in Table 1, where we record both the training losses and validation accuracies. Among the three STEs, the derivative of clipped ReLU gives the best overall performance, followed by vanilla ReLU and then by the identity function. For deeper networks, clipped ReLU is the best performer. But on the relatively shallow LeNet5 network, vanilla ReLU exhibits comparable performance to the clipped ReLU, which is somewhat in line with our theoretical finding that ReLU is a great STE for learning the twolinearlayer (shallow) CNN.
Network  BitWidth  Straightthrough estimator  

identity  vanilla ReLU  clipped ReLU  
MNIST  LeNet5  2  /  /  / 
4  /  /  /  
CIFAR10  VGG11  2  0.19/  /  / 
4  /  /  /  
ResNet20  2  /  /  /  
4  /  /  / 
4.2 Instability
We report the phenomenon of being repelled from a good minimum on ResNet20 with 4bit activations when using the identity STE, to demonstrate the instability issue as predicted in Theorem 1. By Table 1, the coarse gradient descent algorithms using the vanilla and clipped ReLUs converge to the neighborhoods of the minima with validation accuracies (training losses) of (0.25) and (0.04), respectively, whereas that using the identity STE gives (1.38). Note that the landscape of the empirical loss function does not depend on which STE is used in the training. Then we initialize training with the two improved minima and use the identity STE. To see if the algorithm is stable there, we start the training with a tiny learning rate of . For both initializations, the training loss and validation error significantly increase within the first 20 epochs; see Figure 4.2. To speedup training, at epoch 20, we switch to the normal schedule of learning rate specified in Table 2 and run 200 additional epochs. The training using the identity STE ends up with a much worse minimum. This is because the coarse gradient with identity STE does not vanish at the good minima in this case (Lemma 9). Similarly, the poor performance of ReLU STE on 2bit activated ResNet20 is also due to the instability of the corresponding training algorithm at good minima, as illustrated by Figure 4 in Appendix C, although it diverges much slower.
5 Concluding Remarks
We provided the first theoretical justification for the concept of STE that it gives rise to descent training algorithm. We considered three STEs: the derivatives of the identity function, vanilla ReLU and clipped ReLU, for learning a twolinearlayer CNN with binary activation. We derived the explicit formulas of the expected coarse gradients corresponding to the STEs, and showed that the negative expected coarse gradients based on vanilla and clipped ReLUs are descent directions for minimizing the population loss, whereas the identity STE is not since it generates a coarse gradient incompatible with the energy landscape. The instability/incompatibility issue was confirmed in CIFAR experiments for improper choices of STE. In the future work, we aim further understanding of coarse gradient descent for largescale optimization problems with intractable gradients.
Acknowledgments
This work was partially supported by NSF grants DMS1522383, IIS1632935, ONR grant N000141812527, AFOSR grant FA9550180167, DOE grant DESC0013839 and STROBE STC NSF grant DMR1548924.
References
 Athalye et al. (2018) Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
 Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 Brutzkus & Globerson (2017) Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint arXiv:1702.07966, 2017.
 Cai et al. (2017) Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision by halfwave gaussian quantization. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 Choi et al. (2018) Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce IJen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
 Collobert & Weston (2008) Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In International Conference on Machine Learning, pp. 160–167. ACM, 2008.
 Courbariaux et al. (2015) Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131, 2015.
 Du et al. (2018) Simon S. Du, Jason D. Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns onehiddenlayer CNN: Don’t be afraid of spurious local minimum. arXiv preprint arXiv:1712.00779, 2018.
 Freund & Schapire (1999) Yoav Freund and Robert E Schapire. Large margin classification using the perceptron algorithm. Machine learning, 37(3):277–296, 1999.
 Friesen & Domingos (2017) Abram L Friesen and Pedro Domingos. Deep learning as a mixed convexcombinatorial optimization problem. arXiv preprint arXiv:1710.11573, 2017.
 Goel et al. (2018) Surbhi Goel, Adam Klivans, and Raghu Meka. Learning one convolutional layer with overlapping patches. arXiv preprint arXiv:1802.02547, 2018.
 He et al. (2018) J. He, L. Li, J. Xu, and C. Zheng. ReLU deep neural networks and linear finite elements. arXiv preprint arXiv:1807.03973, 2018.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 Hinton (2012) Geoffrey Hinton. Neural networks for machine learning, coursera. Coursera, video lectures, 2012.
 Hou & Kwok (2018) Lu Hou and James T Kwok. Lossaware weight quantization of deep networks. arXiv preprint arXiv:1802.08635, 2018.
 Hubara et al. (2016) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks: Training neural networks with weights and activations constrained to +1 or 1. arXiv preprint arXiv:1602.02830, 2016.
 Hubara et al. (2018) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18:1–30, 2018.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Tech Report, 2009.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lee et al. (2015) DongHyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propagation. In Joint european conference on machine learning and knowledge discovery in databases, pp. 498–515. Springer, 2015.
 Li et al. (2016) Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. arXiv preprint arXiv:1605.04711, 2016.
 Li et al. (2017) Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, and Tom Goldstein. Training quantized nets: A deeper understanding. In Advances in Neural Information Processing Systems, pp. 5811–5821, 2017.
 Li & Hao (2018) Qianxiao Li and Shuji Hao. An optimal control approach to deep learning and applications to discreteweight neural networks. In International Conference on Machine Learning, 2018.
 Li & Yuan (2017) Yuanzhi Li and Yang Yuan. Convergence analysis of twolayer neural networks with relu activation. In Advances in Neural Information Processing Systems, pp. 597–607, 2017.
 Lloyd (1982) Stuart P. Lloyd. Least squares quantization in PCM. IEEE Trans. Info. Theory, 28:129–137, 1982.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Rastegari et al. (2016) Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Springer, 2016.
 Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster RCNN: Towards realtime object detection with region proposal networks. In Advances in Neural Information Processing systems, pp. 91–99, 2015.
 Rosenblatt (1957) Frank Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.
 Rosenblatt (1962) Frank Rosenblatt. Principles of neurodynamics. Spartan Book, 1962.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
 Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 Tian (2017) Yuandong Tian. An analytical formula of population gradient for twolayered relu network and its applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560, 2017.
 Wang et al. (2018) Bao Wang, Xiyang Luo, Zhen Li, Wei Zhu, Zuoqiang Shi, and Stanley J Osher. Deep neural nets with interpolating function as output activation. In Advances in Neural Information Processing Systems, 2018.
 Widrow & Lehr (1990) Bernard Widrow and Michael A Lehr. 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proceedings of the IEEE, 78(9):1415–1442, 1990.
 Yin et al. (2016) Penghang Yin, Shuai Zhang, Yingyong Qi, and Jack Xin. Quantization and training of low bitwidth convolutional neural networks for object detection. arXiv preprint arXiv:1612.06052, 2016.
 Yin et al. (2018a) Penghang Yin, Shuai Zhang, Jiancheng Lyu, Stanley Osher, Yingyong Qi, and Jack Xin. Binaryrelax: A relaxation approach for training deep neural networks with quantized weights. arXiv preprint arXiv:1801.06313, 2018a.
 Yin et al. (2018b) Penghang Yin, Shuai Zhang, Jiancheng Lyu, Stanley Osher, Yingyong Qi, and Jack Xin. Blended coarse gradient descent for full quantization of deep neural networks. arXiv preprint arXiv:1808.05240, 2018b.
 Zhong et al. (2017) Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for onehiddenlayer neural networks. arXiv preprint arXiv:1706.03175, 2017.
 Zhou et al. (2016) Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
 Zhu et al. (2016) Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary quantization. arXiv preprint arXiv:1612.01064, 2016.
Appendix
A. The Plots of Quantized and Clipped ReLUs
quantized ReLU  clipped ReLU 

B. The Schedule of Learning Rate
Network  # epochs  Batch size  Learning rate  
initial  decay rate  milestone  
LeNet5  50  64  0.1  0.1  [20,40] 
VGG11  200  128  0.01  0.1  [80,140] 
ResNet20  200  128  0.01  0.1  [80,140] 
C. Instability of ReLU STE on ResNet20 with 2bit Activations
D. Additional Supporting Lemmas
Lemma 11.
Let be a Gaussian random vector with entries i.i.d. sampled from . Given nonzero vectors with the angle , we have
and