Towards Accurate Binary Convolutional Neural Network
Abstract
We introduce a novel scheme to train binary convolutional neural networks (CNNs) – CNNs with weights and activations constrained to {1,+1} at runtime. It has been known that using binary weights and activations drastically reduce memory size and accesses, and can replace arithmetic operations with more efficient bitwise operations, leading to much faster testtime inference and lower power consumption. However, previous works on binarizing CNNs usually result in severe prediction accuracy degradation. In this paper, we address this issue with two major innovations: (1) approximating fullprecision weights with the linear combination of multiple binary weight bases; (2) employing multiple binary activations to alleviate information loss. The implementation of the resulting binary CNN, denoted as ABCNet, is shown to achieve much closer performance to its fullprecision counterpart, and even reach the comparable prediction accuracy on ImageNet and forest trail datasets, given adequate binary weight bases and activations.
1 Introduction
Convolutional neural networks (CNNs) have achieved stateoftheart results on realworld applications such as image classification (He et al., 2016) and object detection (Ren et al., 2015), with the best results obtained with large models and sufficient computation resources. Concurrent to these progresses, the deployment of CNNs on mobile devices for consumer applications is gaining more and more attention, due to the widespread commercial value and the exciting prospect.
On mobile applications, it is typically assumed that training is performed on the server and test or inference is executed on the mobile devices (Courbariaux et al., 2016; Esser et al., 2016). In the training phase, GPUs enabled substantial breakthroughs because of their greater computational speed. In the test phase, however, GPUs are usually too expensive to deploy. Thus improving the testtime performance and reducing hardware costs are likely to be crucial for further progress, as mobile applications usually require realtime, low power consumption and fully embeddable. As a result, there is much interest in research and development of dedicated hardware for deep neural networks (DNNs). Binary neural networks (BNNs) (Courbariaux et al., 2016; Rastegari et al., 2016), i.e., neural networks with weights and perhaps activations constrained to only two possible values (e.g., 1 or +1), would bring great benefits to specialized DNN hardware for three major reasons: (1) the binary weights/activations reduce memory usage and model size 32 times compared to singleprecision version; (2) if weights are binary, then most multiplyaccumulate operations can be replaced by simple accumulations, which is beneficial because multipliers are the most space and powerhungry components of the digital implementation of neural networks; (3) furthermore, if both activations and weights are binary, the multiplyaccumulations can be replaced by the bitwise operations: xnor and bitcount Courbariaux et al. (2016). This could have a big impact on dedicated deep learning hardware. For instance, a 32bit floating point multiplier costs about 200 Xilinx FPGA slices (Govindu et al., 2004), whereas a 1bit xnor gate only costs a single slice. Semiconductor manufacturers like IBM (Esser et al., 2016) and Intel (Venkatesh et al., 2016) have been involved in the research and development of related chips.
However, binarization usually cause severe prediction accuracy degradation, especially on complex tasks such as classification on ImageNet dataset. To take a closer look, Rastegari et al. (2016) shows that binarizing weights causes the accuracy of Resnet18 drops from 69.3% to 60.8% on ImageNet dataset. If further binarize activations, the accuracy drops to 51.2%. Similar phenomenon can also be found in literatures such as (Hubara et al., 2016). Clearly there is a considerable gap between the accuracy of a fullprecision model and a binary model.
This paper proposes a novel scheme for binarizing CNNs, which aims to alleviate, or even eliminate the accuracy degradation, while still significantly reducing inference time, resource requirement and power consumption. The paper makes the following major contributions.

We approximate fullprecision weights with the linear combination of multiple binary weight bases. The weights values of CNNs are constrained to , which means convolutions can be implemented by only addition and subtraction (without multiplication), or bitwise operation when activations are binary as well. We demonstrate that 35 binary weight bases are adequate to well approximate the fullprecision weights.

We introduce multiple binary activations. Previous works have shown that the quantization of activations, especially binarization, is more difficult than that of weights (Cai et al., 2017; Courbariaux et al., 2016). By employing five binary activations, we have been able to reduce the Top1 and Top5 accuracy degradation caused by binarization to around 5% on ImageNet compared to the full precision counterpart.
It is worth noting that the multiple binary weight bases/activations scheme is preferable to the fixedpoint quantization in previous works. In those fixedpoint quantized networks one still needs to employ arithmetic operations, such as multiplication and addition, on fixedpoint values. Even though faster than floating point, they still require relatively complex logic and can consume a lot of power. Detailed discussions can be found in Section 5.2.
Ideally, combining more binary weight bases and activations always leads to better accuracy and will eventually get very close to that of fullprecision networks. We verify this on ImageNet using Resnet network topology. This is the first time a binary neural network achieves prediction accuracy comparable to its fullprecision counterpart on ImageNet.
2 Related work
Quantized Neural Networks: High precision parameters are not very necessary to reach high performance in deep neural networks. Recent research efforts (e.g., (Hubara et al., 2016)) have considerably reduced a large amounts of memory requirement and computation complexity by using low bitwidth weights and activations. Zhou et al. (2016) further generalized these schemes and proposed to train CNNs with low bitwidth gradients. By performing the quantization after network training or using the “straightthrough estimator (STE)" (Bengio et al., 2013), these works avoided the issues of nondifferentiable optimization. While some of these methods have produced good results on datasets such as CIFAR10 and SVHN, none has produced low precision networks competitive with fullprecision models on largescale classification tasks, such as ImageNet. In fact, (Zhou et al., 2016) and (Hubara et al., 2016) experiment with different combinations of bitwidth for weights and activations, and show that the performance of their highly quantized networks deteriorates rapidly when the weights and activations are quantized to less than 4bit numbers. Cai et al. (2017) enhance the performance of a low bitwidth model by addressing the gradient mismatch problem, nevertheless there is still much room for improvement.
Binarized Neural Networks: The binary representation for deep models is not a new topic. At the emergence of artificial neural networks, inspired biologically, the unit step function has been used as the activation function (Toms, 1990). It is known that binary activation can use spiking response for eventbased computation and communication (consuming energy only when necessary) and therefore is energyefficient (Esser et al., 2016). Recently, Courbariaux et al. (2016) introduce BinarizedNeuralNetworks (BNNs), neural networks with binary weights and activations at runtime. Different from their work, Rastegari et al. (2016) introduce simple, efficient, and accurate approximations to CNNs by binarizing the weights and even the intermediate representations in CNNs. All these works drastically reduce memory consumption, and replace most arithmetic operations with bitwise operations, which potentially lead to a substantial increase in power efficiency.
In all above mentioned works, binarization significantly reduces accuracy. Our experimental results on ImageNet show that we are close to filling the gap between the accuracy of a binary model and its fullprecision counterpart. We relied on the idea of finding the best approximation of fullprecision convolution using multiple binary operations, and employing multiple binary activations to allow more information passing through.
3 Binarization methods
In this section, we detail our binarization method, which is termed ABCNet (AccurateBinaryConvolutional) for convenience. Bear in mind that during training, the realvalued weights are reserved and updated at every epoch, while in testtime only binary weights are used in convolution.
3.1 Weight approximation
Consider a layer CNN architecture. Without loss of generality, we assume the weights of each convolutional layer are tensors of dimension , which represents filter width, filter height, inputchannel and outputchannel respectively. We propose two variations of binarization method for weights at each layer: 1) approximate weights as a whole and 2) approximate weights channelwise.
3.1.1 Approximate weights as a whole
At each layer, in order to constrain a CNN to have binary weights, we estimate the realvalue weight filter using the linear combination of binary filters such that . To find an optimal estimation, a straightforward way is to solve the following optimization problem:
(1) 
where , and . Here the notation refers to vectorization.
Although a local minimum solution to (1) can be obtained by numerical methods, one could not backpropagate through it to update the realvalue weight filter . To address this issue, assuming the mean and standard deviation of are and respectively, we fix ’s as follows:
(2) 
where and is a shift parameter. For example, one can choose ’s to be to shift evenly over the range , or leave it to be trained by the network. This is based on the observation that the fullprecision weights tend to have a symmetric, nonsparse distribution, which is close to Gaussian. To gain more intuition and illustrate the approximation effectiveness, an example is visualized in Section S2 of the supplementary material.
With ’s chosen, (1) becomes a linear regression problem
(3) 
in which ’s serve as the bases in the design/dictionary matrix. We can then backpropagate through ’s using the “straightthrough estimator” (STE) (Bengio et al., 2013). Assume as the cost function, and as the input and output tensor of a convolution respectively, the forward and backward approach of an approximated convolution during training can be computed as follows:
Forward:  (4)  
(5)  
(6)  
Backward:  (7) 
In testtime, only (6) is required. The block structure of this approximated convolution layer is shown on the left side in Figure 1. With suitable hardwares and appropriate implementations, the convolution can be efficiently computed. For example, since the weight values are binary, we can implement the convolution with additions and subtractions (thus without multiplications). Furthermore, if the input is binary as well, we can implement the convolution with bitwise operations: xnor and bitcount (Rastegari et al., 2016). Note that the convolution with each binary filter can be computed in parallel.
3.1.2 Approximate weights channelwise
Alternatively, we can estimate the realvalue weight filter of each output channel using the linear combination of binary filters such that . Again, to find an optimal estimation, we solve a linear regression problem analogy to (3) for each output channel. After convolution, the results are concatenated together along the outputchannel dimension. If , this approach reduces to the BinaryWeightsNetworks (BWN) proposed in (Rastegari et al., 2016).
Compared to weights approximation as a whole, the channelwise approach approximates weights more elaborately, however no extra cost is needed during inference. Since this approach requires more computational resources during training, we leave it as a future work and focus on the former approximation approach in this paper.
3.2 Multiple binary activations and bitwise convolution
As mentioned above, a convolution can be implemented without multiplications when weights are binarized. However, to utilize the bitwise operation, the activations must be binarized as well, as they are the inputs of convolutions.
Similar to the activation binarization procedure in (Zhou et al., 2016), we binarize activations after passing it through a bounded activation function , which ensures . We choose the bounded rectifier as . Formally, it can be defined as:
(8) 
where is a shift parameter. If , then is the clip activation function in (Zhou et al., 2016).
We constrain the binary activations to either 1 or 1. In order to transform the realvalued activation into binary activation, we use the following binarization function:
(9) 
where is the indicator function. The conventional forward and backward approach of the activation can be given as follows:
Forward:  (10)  
Backward: 
Here denotes the Hadamard product. As can be expected, binaizing activations as above is kind of crude and leads to nontrivial losses in accuracy, as shown in Rastegari et al. (2016); Hubara et al. (2016). While it is also possible to approximate activations with linear regression, as that of weights, another critical challenge arises – unlike weights, the activations always vary in testtime inference. Luckily, this difficulty can be avoided by exploiting the statistical structure of the activations of deep networks.
Our scheme can be described as follows. First of all, to keep the distribution of activations relatively stable, we resort to batch normalization (Ioffe and Szegedy, 2015). This is a widely used normalization technique, which forces the responses of each network layer to have zero mean and unit variance. We apply this normalization before activation. Secondly, we estimate the realvalue activation using the linear combination of binary activations such that , where
(11) 
Different from that of weights, the parameters ’s and ’s () here are both trainable, just like the scale and shift parameters in batch normalization. Without the explicit linear regression approach, ’s and ’s are tuned by the network itself during training and fixed in testtime. They are expected to learn and utilize the statistical features of fullprecision activations.
The resulting network architecture outputs multiple binary activations and their corresponding coefficients , which allows more information passing through compared to the former one. Combining with the weight approximation, the whole convolution scheme is given by:
(12) 
which suggests that it can be implemented by computing bitwise convolutions in parallel. An example of the whole convolution scheme is shown in Figure 1.
3.3 Training algorithm
A typical block in CNN contains several different layers, which are usually in the following order: (1) Convolution, (2) Batch Normalization, (3) Activation and (4) Pooling. The batch normalization layer (Ioffe and Szegedy, 2015) normalizes the input batch by its mean and variance. The activation is an elementwise nonlinear function (e.g., Sigmoid, ReLU). The pooling layer applies any type of pooling (e.g., max,min or average) on the input batch. In our experiment, we observe that applying maxpooling on binary input returns a tensor that most of its elements are equal to +1, resulting in a noticeable drop in accuracy. Similar phenomenon has been reported in Rastegari et al. (2016) as well. Therefore, we put the maxpooling layer before the batch normalization and activation.
Since our binarization scheme approximates fullprecision weights, using the fullprecision pretrain model serves as a perfect initialization. However, finetuning is always required for the weights to adapt to the new network structure. The training procedure, i.e., ABCNet, is summarized in Section S1 of the supplementary material.
It is worth noting that as increases, the shift parameters get closer and the bases of the linear combination are more correlated, which sometimes lead to rank deficiency when solving (3). This can be tackled with the regularization.
4 Experiment results
In this section, the proposed ABCNet was evaluated on the ILSVRC12 ImageNet classification dataset (Deng et al., 2009), and visual perception of forest trails datasets for mobile robots (Giusti et al., 2016) in Section S6 of supplementary material.
4.1 Experiment results on ImageNet dataset
The ImageNet dataset contains about 1.2 million highresolution natural images for training that spans 1000 categories of objects. The validation set contains 50k images. We use Resnet ((He et al., 2016)) as network topology. The images are resized to 224x224 before fed into the network. We report our classification performance using Top1 and Top5 accuracies.
4.1.1 Effect of weight approximation
We first evaluate the weight approximation technique by examining the accuracy improvement for a binary model. To eliminate variables, we leave the activations being fullprecision in this experiment. Table 1 shows the prediction accuracy of ABCNet on ImageNet with different choices of . For comparison, we add the results of BinaryWeightsNetwork (denoted ‘BWN’) reported in Rastegari et al. (2016) and the fullprecision network (denoted ‘FP’). The BWN binarizes weights and leaves the activations being fullprecision as we do. All results in this experiment use Resnet18 as network topology. It can be observed that as increases, the accuracy of ABCNet converges to its fullprecision counterpart. The Top1 gap between them reduces to only 0.9 percentage point when , which suggests that this approach nearly eliminates the accuracy degradation caused by binarizing weights.
Top1  60.8%  62.8%  63.7%  66.2%  68.3%  69.3% 
Top5  83.0%  84.4%  85.2%  86.7%  87.9%  89.2% 
4.1.2 Configuration space exploration
We explore the configuration space of combinations of number of weight bases and activations. Table 2 presents the results of ABCNet with different configurations. The parameter settings for these experiments are provided in Section S4 of the supplementary material.
Network  weight base  activation base  Top1  Top5  Top1 gap  Top5 gap 
res18  1  1  42.7%  67.6%  26.6%  21.6% 
res18  3  1  49.1%  73.8%  20.2%  15.4% 
res18  3  3  61.0%  83.2%  8.3%  6.0% 
res18  3  5  63.1%  84.8%  6.2%  4.4% 
res18  5  1  54.1%  78.1%  15.2%  11.1% 
res18  5  3  62.5%  84.2%  6.8%  5.0% 
res18  5  5  65.0%  85.9%  4.3%  3.3% 
res18  Full Precision  69.3%  89.2%      
res34  1  1  52.4%  76.5%  20.9%  14.8% 
res34  3  3  66.7%  87.4%  6.6%  3.9% 
res34  5  5  68.4%  88.2%  4.9%  3.1% 
res34  Full Precision  73.3%  91.3%      
res50  5  5  70.1%  89.7%  6.0%  3.1% 
res50  Full Precision  76.1%  92.8%     
As balancing between multiple factors like training time and inference time, model size and accuracy is more a problem of practical tradeoff, there will be no definite conclusion as which combination of () one should choose. In general, Table 2 shows that (1) the prediction accuracy of ABCNet improves greatly as the number of binary activations increases, which is analogous to the weight approximation approach; (2) larger or gives better accuracy; (3) when , the Top1 gap between the accuracy of a fullprecision model and a binary one reduces to around 5%. To gain a visual understanding and show the possibility of extensions to other tasks such object detection, we print the a sample of feature maps in Section S3 of supplementary material.
4.1.3 Comparison with the stateoftheart
Model  W  A  Top1  Top5 
FullPrecision Resnet18 [fullprecision weights and activation]  32  32  69.3%  89.2% 
BWN [fullprecision activation] Rastegari et al. (2016)  1  32  60.8%  83.0% 
DoReFaNet [1bit weight and 4bit activation] Zhou et al. (2016)  1  4  59.2%  81.5% 
XNORNet [binary weight and activation] Rastegari et al. (2016)  1  1  51.2%  73.2% 
BNN [binary weight and activation] Courbariaux et al. (2016)  1  1  42.2%  67.1% 
ABCNet [5 binary weight bases, 5 binary activations]  1  1  65.0%  85.9% 
ABCNet [5 binary weight bases, fullprecision activations]  1  32  68.3%  87.9% 
Table 3 presents a comparison between ABCNet and several other stateoftheart models, i.e., fullprecision Resnet18, BWN and XNORNet in (Rastegari et al., 2016), DoReFaNet in (Zhou et al., 2016) and BNN in (Courbariaux et al., 2016)^{†}^{†}Courbariaux et al. (2016) did not report their result on ImageNet. We implemented and presented the result. respectively. All comparative models use Resnet18 as network topology. The fullprecision Resnet18 achieves 69.3% Top1 accuracy. Although Rastegari et al. (2016)’s BWN model and DeReFaNet perform well, it should be noted that they use fullprecision and 4bit activation respectively. Models (XNORNet and BNN) that used both binary weights and activations achieve much less satisfactory accuracy, and is significantly outperformed by ABCNet with multiple binary weight bases and activations. It can be seen that ABCNet has achieved stateoftheart performance as a binary model.
One might argue that 5bit width quantization scheme could reach similar accuracy as that of ABCNet with 5 weight bases and 5 binary activations. However, the former one is less efficient and requires distinctly more hardware resource. More detailed discussions can be found in Section 5.2.
5 Discussion
5.1 Why adding a shift parameter works?
Intuitively, the multiple binarized weight bases/activations scheme works because it allows more information passing through. Consider the case that a real value, say 1.5, is passed to a binarized function , where maps a positive to 1 and otherwise 1. In that case, the outputs of is 1, which suggests that the input value is positive. Now imagine that we have two binarization function and . In that case outputs 1 and outputs 1, which suggests that the input value is not only positive, but also must be smaller than 2. Clearly we see that each function contributes differently to represent the input and more information is gained from compared to the former case.
From another point of view, both coefficients (’s) and shift parameters are expected to learn and utilize the statistical features of fullprecision tensors, just like the scale and shift parameters in batch normalization. If we have more binarized weight bases/activations, the network has the capacity to approximate the fullprecision one more precisely. Therefore, it can be deduced that when or is large enough, the network learns to tune itself so that the combination of weight bases or binarized activations can act like the fullprecision one.
5.2 Advantage over the fixedpoint quantization scheme
It should be noted that there are key differences between the multiple binarization scheme ( binarized weight bases or binarized activations) proposed in this paper and the fixedpoint quantization scheme in the previous works such as (Zhou et al., 2016; Hubara et al., 2016), though at first thought bit width quantization seems to share the same memory requirement with binarizations. Specifically, our binarized weight bases/activations is preferable to the fixed Kbit width scheme for the following reasons:
(1) The binarization scheme preserves binarization for bitwise operations. One or several bitwise operations is known to be more efficient than a fixedpoint multiplication, which is a major reason that BNN/XNORNet was proposed.
(2) A bit width multiplier consumes more resources than 1bit multipliers in a digital chip: it requires more than bits to store and compute, otherwise it could easily overflow/underflow. For example, if a real number is quantized to a 2bit number, a possible choice is in range {0,1,2,4}. In this 2bit multiplication, when both numbers are 4, it outputs , which is not within the range. In (Zhou et al., 2016), the range of activations is constrained within [0,1], which seems to avoid this situation. However, fractional numbers do not solve this problem, severe precision deterioration will appear during the multiplication if there are no extra resources. The fact that the complexity of a multiplier is proportional to THE SQUARE of bitwidths can be found in literatures (e.g., sec 3.1.1. in (Grabbe et al., 2003)). In contrast, our binarization scheme does not have this issue – it always outputs within the range {1,1}. The saved hardware resources can be further used for parallel computing.
(3) A binary activation can use spiking response for eventbased computation and communication (consuming energy only when necessary) and therefore is energyefficient (Esser et al., 2016). This can be employed in our scheme, but not in the fixed bit width scheme. Also, we have mentioned the fact that bit width multiplier consumes more resources than 1bit multipliers. It is noteworthy that these resources include power.
To sum up, bit multipliers are the most space and powerhungry components of the digital implementation of DNNs. Our scheme could bring great benefits to specialized DNN hardware.
5.3 Further computation reduction in runtime
On specialized hardware, the following operations in our scheme can be integrated with other operations in runtime and further reduce the computation requirement.
(1) Shift operations. The existence of shift parameters seem to require extra additions/subtractions (see (2) and (8)). However, the binarization operation with a shift parameter can be implemented as a comparator where the shift parameter is the number for comparison, e.g., ( is a constant), so no extra additions/subtractions are involved.
(2) Batch normalization. In runtime, a batch normalization is simply an affine function, say, , whose scale and shift parameters are fixed and can be integrated with ’s. More specifically, a batch normalization can be integrated into a binarization operation as follow: Therefore, there will be no extra cost for the batch normalization.
6 Conclusion and future work
We have introduced a novel binarization scheme for weights and activations during forward and backward propagations called ABCNet. We have shown that it is possible to train a binary CNN with ABCNet on ImageNet and achieve accuracy close to its fullprecision counterpart. The binarization scheme proposed in this work is parallelizable and hardware friendly, and the impact of such a method on specialized hardware implementations of CNNs could be major, by replacing most multiplications in convolution with bitwise operations. The potential to speedup the testtime inference might be very useful for realtime embedding systems. Future work includes the extension of those results to other tasks such as object detection and other models such as RNN. Also, it would be interesting to investigate using FPGA/ASIC or other customized deep learning processor (Liu et al., 2016) to implement ABCNet at runtime.
7 Acknowledgement
We acknowledge Mr Jingyang Xu for helpful discussions.
References
 Bengio et al. (2013) Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 Cai et al. (2017) Z. Cai, X. He, J. Sun, and N. Vasconcelos. Deep learning with low precision by halfwave gaussian quantization. arXiv preprint arXiv:1702.00953, 2017.
 Courbariaux et al. (2016) M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830, 2016.
 Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
 Esser et al. (2016) S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry, T. Melano, D. R. Barch, et al. Convolutional networks for fast, energyefficient neuromorphic computing. Proceedings of the National Academy of Sciences, page 201604850, 2016.
 Giusti et al. (2016) A. Giusti, J. Guzzi, D. Ciresan, F.L. He, J. P. Rodriguez, F. Fontana, M. Faessler, C. Forster, J. Schmidhuber, G. Di Caro, D. Scaramuzza, and L. Gambardella. A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters, 2016.
 Govindu et al. (2004) G. Govindu, L. Zhuo, S. Choi, and V. Prasanna. Analysis of highperformance floatingpoint arithmetic on fpgas. In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, page 149. IEEE, 2004.
 Grabbe et al. (2003) C. Grabbe, M. Bednara, J. Teich, J. von zur Gathen, and J. Shokrollahi. Fpga designs of parallel high performance gf (2233) multipliers. In ISCAS (2), pages 268–271. Citeseer, 2003.
 He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 Hubara et al. (2016) I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
 Ioffe and Szegedy (2015) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 Kingma and Ba (2014) D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Liu et al. (2016) S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen. Cambricon: An instruction set architecture for neural networks. In Proceedings of the 43rd International Symposium on Computer Architecture, pages 393–405. IEEE Press, 2016.
 Qian (1999) N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
 Rastegari et al. (2016) M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 Ren et al. (2015) S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
 Toms (1990) D. Toms. Training binary node feedforward neural networks by back propagation of error. Electronics letters, 26(21):1745–1746, 1990.
 Venkatesh et al. (2016) G. Venkatesh, E. Nurvitadhi, and D. Marr. Accelerating deep convolutional networks using lowprecision and sparsity. arXiv preprint arXiv:1610.00324, 2016.
 Zhou et al. (2016) S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
Supplementary Material
S1 Summary of training algorithm in Section 3.3
S2 Weight approximation
In this section we explore how well the weight approximation can achieved given adequate binary weight bases (Section 3.1). To gain a visual intuition, we randomly sample a slice of weight tensor from a fullprecision Resnet18 model pretrained on ImageNet. The sliced tensor is then vectorized, and we approximate it with bases using linear regression (see (3)). The results are presented in Figure S2, where the left subfigure shows the root mean square (RMSE) for the estimated weights with increasing number of bases, and the right one shows 5 fitting results, whose choice of are respectively 1 to 5 from top to bottom. The blue line in the right subfigure draws the groundtruth weights from the fullprecision pretrained model, and the red line is the estimated one. It can be observed that is adequate to have a rough fitting, and it gets almost perfect when .
S3 Feature map
It is also possible to perform more complex tasks beyond classification using ABCNet, as long as the model is built upon a CNN, such as faster RCNN for object detection, in which the classification model serves as a pretrain model. Thus, one might be interested in whether ABCNet learns similar feature maps as its fullprecision counterpart. Figure S3 shows several example image and the corresponding feature maps of these two models, from which we see that they are indeed similar. This shows the potential for ABCNet to further generalize on more complex tasks mentioned above.
S4 Parameter settings for the experiment in Section 4.1.2
The parameters ’s, the initial values for ’s and ’s can be treated as hyperparameters. At the beginning of our exploration we randomly choose these initial values. Bit by bit we began to find certain patterns to achieve good performance: for ’s, usually symmetric; for ’s, maybe slightly shift towards the negative direction. These are based on tunings and also the observation of the fullprecision distribution of weights/activations. Table S4 provides the parameter settings for the experiment in Section 4.1.2. All ABCNet models in the experiments are trained using SGD with momentum, and the initial learning rate is set to 0.01.
Network  shift parameters (’s)  shift parameters (’s)  ’s  
res18  1  1  0  0.0  1.0 
res18  3  1  1,0,1  0.0  1.0 
res18  3  3  1,0,1  1.5, 0.0, 1.5  1.0, 1.0, 1.0 
res18  3  5  1,0,1  3.5, 2.5, 1.5, 0.0, 2.5  1.0, 1.0, 1.0, 1.0, 1.0 
res18  5  1  2,1,0,1,2  0.0  1.0 
res18  5  3  2,1,0,1,2  0.9, 0.0 0.9  1.0,1.0,1.0 
res18  5  5  1,0.5,0,0.5,1  3.5, 2.5, 1.5, 0.0, 2.5  1.0, 1.0, 1.0, 1.0, 1.0 
res34  3  3  1,0,1  3.0, 0.0, 3.0  1.0, 1.0, 1.0 
res34  5  5  1,0.5,0,0.5,1  3.5, 2.5, 1.5, 0.0, 2.5  1.0, 1.0, 1.0, 1.0, 1.0 
res50  5  5  1,0.5,0,0.5,1  3.5, 2.5, 1.5, 0.0, 2.5  1.0, 1.0, 1.0, 1.0, 1.0 
S5 Relationship between accuracy and number of binary weight bases
Figure S4 shows that the relationship between accuracy and the number of binary weight bases appears to be linear. Note that we keep the activations being fullprecision in this experiment.
S6 Application on visual perception of forest trails
s6.1 VGGlike Network Topology
A VGGlike network topology is used for visual perception of forest trails as illustrated in Figure S5.
s6.2 Experiment results on visual perception of forest trails dataset
Giusti et al. (2016) cast the forest or mountain trails perception problem for mobile robots as a image classification task based on Deep Neural Networks. The dataset is composed by 8 hours of 30fps video acquired by a hiker equipped with three headmounted cameras . Each image is labelled in one of three classes: turn right, go straight, turn left. We evaluate ABCNet against its full precision counterpart using this dataset. The classification result is shown in Table S5 by fixing both number of weight bases and activation bases to be 5.
Network  shift parameters (’s)  shift parameters (’s)  ’s  ABCNet  FP 

VGGlike  1,0.5,0.0,0.5,1  0,0,0,0,0  1,1,1,1,1  78.0%  77.7% 