BiReal Net: Binarizing Deep Network towards RealNetwork Performance
Abstract
In this paper, we study 1bit convolutional neural networks (CNNs), of which both the weights and activations are binary. While being efficient, the lacking of a representational capability and the training difficulty impede 1bit CNNs from performing as well as realvalued networks. To this end, we propose BiReal net with a novel training algorithm to tackle these two challenges. To enhance the representational capability, we propagate the realvalued activations generated by each 1bit convolution via a parameterfree shortcut. To address the training difficulty, we propose a training algorithm using a tighter approximation to the derivative of the sign function, a magnitudeaware binarization for weight updating, a better initialization method, and a twostep scheme for training a deep network. Experiments on ImageNet show that an 18layer BiReal net with the proposed training algorithm achieves 56.4% top1 classification accuracy, which is 10% higher than the stateofthearts (e.g., XNORNet), with a greater memory saving and a lower computational cost. BiReal net is also the first to scale up 1bit CNNs to an ultradeep network with 152 layers, and achieves 64.5% top1 accuracy on ImageNet. A 50layer BiReal net shows comparable performance to a realvalued network on the depth estimation task with merely a 0.3% accuracy gap.
Keywords:
1bit CNNs Binary Convolution Shortcut 1layerperblock∎
1 Introduction
Deep Convolutional Neural Networks (CNNs) have achieved substantial advances in a wide range of vision tasks, such as object detection and recognition [21, 33, 37, 36, 11, 8, 31], depth perception [6, 26], visual relation detection [43, 44], face tracking and alignment [34, 48, 50, 40, 39], object tracking [28], etc. However, the superior performance of CNNs usually requires powerful hardware with abundant computing and memory resources, for example, highend graphics processing units (GPUs). Meanwhile, there are growing demands to run vision tasks, such as augmented reality and intelligent navigation, on mobile handheld devices and small drones. The CNN models are usually trained on GPUs and deployed on mobile devices for inference. Most mobile devices are equipped with neither powerful GPUs nor adequate memory to store and run expensive CNN models. Consequently, the high demand for computation and memory becomes the bottleneck of deploying powerful CNNs on most mobile devices. In general, there are two major approaches to alleviate this limitation. The first is to reduce the number of weights with more compact network design or pruning. The second is to quantize the weights or quantize both the weights and activations, with the extreme case of both the weights and activations being binary.
In this work, we study the extreme case of the second approach, i.e., one binary CNN. It is also called 1bit CNN, as each weight parameter and activation can be represented by a single bit. As demonstrated in [30], up to a memory saving and a speedup on CPUs have been achieved for a 1bit convolutional layer, in which the computationally heavy matrix multiplication operations can be implemented using lightweighted bitwise XNOR operations and popcount operations. The current binarization methods achieve comparable accuracy to realvalued networks on small datasets (e.g., CIFAR10 and MNIST). However, on largescale datasets (e.g., ImageNet), the binarization method based on AlexNet in [16] encounters a severe accuracy degradation, from to [30]. This suggests that the capability of conventional 1bit CNNs is not sufficient to cover the great diversity in largescale datasets like ImageNet. Another binary network called XNORNet [30] enhances the performance of 1bit CNNs by utilizing the absolute mean of weights and activations. XNORNet improves the accuracy to 44.2% on AlexNet, which is encouraging, but there still remains a performance gap regarding the realvalued networks.
The objective of this study is to further improve 1bit CNNs, as we believe that their potential has not yet been fully explored. One important observation is that during the inference process, a 1bit convolutional layer generates integer outputs, as a result of the popcount operations. The integer outputs become real values after a BatchNorm [19] layer. These realvalued activations are then binarized to or through the consecutive sign function, as shown in Fig. 1(a). Obviously, compared to binary activations, these integers or real activations contain more information, which is lost in conventional 1bit CNNs [16]. Based on this observation, we propose to use parameterfree shortcut paths to collect all the realvalued outputs from each 1bit convolutional layer, and add them to the next most adjacent realvalued outputs with matched dimensions. This proposed network is dubbed BiReal net, with realvalued shortcut paths transmitting highprecision features and efficient 1bit convolutional layers extracting new features, respectively. Because ResNet [11] is the most prevalent network structure, we verify our shortcut design principle on the shallow ResNetbased structures as well as the deep ResNet with the bottleneck structure.
For a shallow ResNetbased structure, we propose to add shortcuts forwarding the real activations to be added with the realvalued activations of the next block, as shown in Fig. 1(b). By doing so, the representational capability of the proposed model becomes much greater than that of the original 1bit CNNs, with only a negligible computational cost incurred by the extra elementwise addition and without any additional memory cost. This shortcut design results in a socalled 1(convolutional)layerperblock structure, which is more effective than the 2layerperblock structure proposed in ResNet. The original ResNet argues that a shortcut has to bypass at least two convolutional layers; however, we provide a mathematical explanation of the feasibility of the 1layerperblock structure in Sec. 3.2
For a deep ResNetbased structure with the bottleneck, we propose to add another shortcut path in addition to the original shortcut path in ResNet. This shortcut path adds the input activations to the sign function before the 33 convolutional layer and the output activations of the 33 convolutional layers in series, as shown in Fig. 2. This additional shortcut works jointly with the original shortcut collecting all the realvalued outputs in the 1bit CNN, and the representational capability is thus greatly enhanced.
We further propose a novel training algorithm for 1bit CNNs including four special technical features:

Magnitudeaware binarization with respect to weights. As the gradient of the loss with respect to the binary weight will not be large enough to trigger the change to the sign of the binary weight, the binary weight cannot be directly updated using the standard gradient descent algorithm. In BinaryNet [16], the realvalued weight is first updated using gradient descent, and the new binary weight is then obtained through taking the sign of the updated real weight. However, we observed that the gradient with respect to the real weight only depends on the sign of the current real weight, while independent of its magnitude. To derive a more effective gradient, we propose to use a magnitudeaware sign function during training, resulting in the desired dependence of the gradient with respect to the real weight on both the sign and the magnitude of the current real weight. After convergence, the binary weight (i.e., 1 or +1) is obtained through the sign function of the final real weight for inference.

Initialization. As a highly nonconvex optimization problem, training 1bit CNNs could be sensitive to initialization. In [24], the 1bit CNN model is initialized using the realvalued CNN model with the ReLU function pretrained on ImageNet. We propose to replace ReLU by the clip function in pretraining, as the activation of the clip function is closer to the binary activation than that of ReLU.

A twostep training method for deep 1bit CNNs with the bottleneck structure. As the network goes deeper, training becomes more difficult. Following the practice in multistep training in quantizing the network [51], we customize a twostep training method for our deep BiReal net with the bottleneck structure to ease the training difficulty. We first binarize the weights and activations in the 11 convolutional layers and the activations in the 33 convolutional layers. After the network converges, we further binarize the weights in the 33 convolutional layers. This training procedure uses the realvalued weights in the 33 convolutional layers as a transition for training the deep BiReal net, helping the network converge to reach higher accuracy.
Experiments on ImageNet show that these ideas are useful to train 1bit CNNs. With the dedicatedlydesigned shortcut and the proposed optimization techniques, our BiReal net, with only binary weights and activations inside each 1bit convolutional layer, achieves 56.4% and 62.2% top1 accuracy on the ImageNet dataset with 18layer and 34layer structures, respectively, with up to a 16.0 memory saving and a 19.0 computational cost reduction compared to the fullprecision CNN. Comparing to the stateoftheart binary models (e.g., XNORNet), BiReal net achieves 10% higher top1 accuracy on the 18layer network. By using the shortcut propagating the realvalued feature map in the bottleneck structure, BiReal net achieves 64.5% top1 accuracy with an ultradeep 152layer structure. We also apply BiReal net to a realworld application, the depth estimation task. The experimental results demonstrate that a 50layer BiReal net achieves superior performance over BinaryNet [16] and comparable accuracy to the realvalued counterpart.
This paper extends the preliminary conference paper [27] in several aspects. 1) We generalize the idea of using the shortcut propagating the realvalued features to the ultradeep ResNet structure with the bottleneck, enabling the application of our BiReal net to both shallow and deep ResNets. The idea of propagating realvalued activations in 1bit CNNs with the shortcut is proven to be effective with the general guideline that the realvalued outputs of each convolutional layer should be added to the shortcut for propagation. 2) We propose a twostep training method targeting at the deep BiReal net with the bottleneck structure. By using the realvalued weights in the 33 convolutional layers as an intermediate step, the ultradeep BiReal net with the bottleneck can converge to achieve higher accuracy. 3) We conduct an ablation study on a higherorder approximation to the derivative of the sign function. We show that using the higherorder approximation rather than the piecewise linear function yields a marginal improvement in accuracy but induces computation overhead. 4) We apply our 50layer BiReal net as a feature extractor on a depth estimation network and demonstrate comparable accuracy to a realvalued network.
2 Related Work
An overview of neural network compression and acceleration from both the hardware and software perspectives is provided in [35]. In computer vision, the network compression methods can be mainly divided into two major families.
Reducing the number of parameters. Several methods have been proposed to compress neural networks by reducing the number of parameters and neural connections. Previous works on compact neural network structure design have achieved a high compression rate with a negligible accuracy degradation. In SqueezeNet [18], some 33 convolutions are replaced with 11 convolutions, resulting in a 50 reduction in the number of parameters. As an extreme version of Inceptionv4 [36], Xception [3] applies depthwise separable convolution to reduce the number of parameters, which brings a memory saving as well as a speedup in convolution with a negligible accuracy drop. Based on depthwise separable convolutions, MobileNets [15] builds lightweight deep neural networks and achieves a good tradeoff between resource and accuracy. ResNext [41] is proposed to use group convolution to achieve higher accuracy with a limited parameter budget. Recently, ShuffleNet [45] utilizes both pointwise group convolution and channel shuffle to achieve about a 13 speedup over AlexNet with comparable accuracy.
Pruning is another effective solution for model compression and acceleration. Pruning filters [23] in the network actually removes filters together with the connected feature maps, which significantly reduces the computational costs. He et al. [13] pruned the channel in the network and achieved a 5 speedup with an only 0.3% increase of errors on VGG16. Guo et al. [9] made onthefly connection pruning and efficiently compressed LeNet5 and AlexNet by a factor of 108 and 17.7, respectively, without any accuracy loss. Li et al. proposed to use ADMMbased method for filter pruning. In Sparse CNN [25], a sparse matrix multiplication operation is employed to zero out more than 90% of parameters to accelerate the learning process. Motivated by the Sparse CNN, Han et al. proposed Deep Compression [10], which employs connection pruning, quantization with retraining, and Huffman coding to reduce the number of neural connections.
Parameter quantization. The study in [22] demonstrates that realvalued deep neural networks such as AlexNet [21], GoogLeNet [37], and VGG16 [33] only encounter a marginal accuracy degradation when quantizing 32bit parameters to 8bit. In Incremental Network Quantization [46], Zhou et al. quantized the parameter incrementally and showed that it is even possible to further reduce the weight precision to 25 bits with slightly higher accuracy on the ImageNet dataset than a fullprecision network. Based on that, Zhou et al. further proposed explicit losserroraware quantization [47], which obtains comparable accuracy to the realvalued network with very lowbit parameter values. In BinaryConnect [4], Courbariaux et al. more radically employed 1bit precision weights (1 and 1) while maintaining sufficiently high accuracy on the MNIST, CIFAR10, and SVHN datasets. Ho et al. utilized a proximal Newton algorithm with a diagonal Hessian approximation that directly minimizes the loss with regard to the binarized weights [14].
Quantizing weights properly can achieve considerable memory savings with little accuracy degradations. However, acceleration via weight quantization alone would be limited due to the realvalued activations (i.e., the input to convolutional layers). Several recent studies have been conducted to explore new network structures and/or training techniques for quantizing both weights and activations while minimizing the accuracy degradation. Successful attempts include DoReFaNet [49] and QNN [17], which explore neural networks trained with 1bit weights and 2bit activations, and the accuracy drops by 6.1% and 4.9%, respectively, on the ImageNet dataset compared to the realvalued AlexNet. Recently, Zhuang et al. [51] proposed to jointly train a fullprecision model alongside the lowprecision one, which leads to no performance decrease in a 4bit precision network compared with its fullprecision counterpart. Zhang et al. [42] proposed an easytotrain scheme of jointly training a quantized, bitoperationcompatible DNN and its associated quantizers, which can be applied to quantize weights and activations with arbitrarybit precision. Additionally, BinaryNet [16] uses only 1bit weights and 1bit activations in a neural network and achieves comparable accuracy as a fullprecision neural network on the MNIST and CIFAR10 datasets. In XNORNet [30], Rastegari et al. further improved BinaryNet by multiplying the absolute mean of the weight filter and activation with the 1bit weight and 1bit activation to improve the accuracy. ABCNet [24] is proposed to enhance the accuracy by using more weight bases and activation bases. The results of these studies are encouraging, but the additional usage of realvalued weights and realvalued operations offsets the memory saving and speedup of binarizing the network. In [1], Bagherinezhad et al. proposed label refinery technique for further improving the accuracy of the quantized networks, which is orthogonal to other quantization methods.
In this study, we aim to design 1bit CNNs aided with a realvalued shortcut to compensate for the accuracy loss of binarization. In contrast to approaches mentioned above, adding realvalued shortcuts does not incur nontrivial realvalued operations nor extra memory. We further design 1) optimization strategies for overcoming the gradient dismatch issue and discrete optimization difficulties in 1bit CNNs, 2) a customized initialization method, and 3) a twostep training method for the ultradeep network. The proposed solution enables us to fully explore the potential of 1bit CNNs with the limited resources.
3 Methodology
3.1 Standard 1Bit CNN and Its Representational Capability
1bit convolutional neural networks (CNNs) refer to the CNN models with binary weight parameters and binary activations in intermediate convolutional layers. Specifically, the binary activation and weight are obtained through a sign function,
(1) 
where and indicate the real activation and the real weight, respectively. exists in both the training and inference processes of the 1bit CNN, due to the convolution and batch normalization (if used). For example, given a binary activation map and a binary weight kernel, the output activation could be an odd integer from to . If a batch normalization is applied, as shown in Fig. 3, then the integer activation will be transformed into real values. The realvalued weights will be used to update the binary weights in the training process, which will be introduced later.
Compared to realvalued CNN models with 32bit weight parameters, the 1bit CNNs gain up to a memory saving. Moreover, as the activation is also binary, the convolution operation could be implemented by the bitwise XNOR operation followed by a popcount operation [30], i.e.,
(2) 
where and indicate the vectors of binary activations and binary weights , respectively, with being the entry index. In contrast, the convolution operation in realvalued CNNs is implemented by the expensive real value multiplication. Consequently, the 1bit CNNs could obtain up to a 64 computation saving.
However, it has been demonstrated in [16] that the classification performance of the 1bit CNNs is much worse than that of realvalued CNN models on largescale datasets like ImageNet. We believe that the poor performance of 1bit CNNs is caused by their low representational capacity. We denote as the representational capability of , i.e., the number of all possible configurations of , where could be a scalar, vector, matrix, or tensor. For example, the representational capability of 32 channels of a binary feature map is . Given a binary weight kernel , each entry of (i.e., the bitwise convolution output) can choose even values from (288 to 288), as shown in Fig 3. Thus, = . Note that since the BatchNorm layer is a unique mapping, it will not increase the number of different choices but will scale the (288,288) to a particular value. If adding the sign function behind the output, each entry in the feature map is binarized, and the representational capability shrinks to again.
3.2 Shallow BiReal Net Model and Its Representational Capability
We propose to preserve the real activations before the sign function to increase the representational capability of the 1bit CNN through a simple shortcut. Specifically, as shown in Fig. 3(b), one block indicates the structure ”Sign 1bit convolution batch normalization addition operator”. The shortcut connects the input activations to the sign function in the current block to the output activations after the batch normalization in the same block, and these two activations are added through an addition operator, and then the combined activations are input to the sign function in the next block. The representational capability of each entry in the added activations is . Consequently, the representational capability of each block in the 1bit CNN with the above shortcut becomes . As both real and binary activations are retained, we call the proposed model BiReal net.
The representational capability of each block in the 1bit CNN is significantly enhanced due to the simple identity shortcut. The only additional cost of computation is the addition operation of two real activations, as these real activations already exist in the standard 1bit CNN (i.e., without shortcuts). Moreover, as the activations are computed on the fly, no additional memory is needed.
It is mentioned in ResNet [11] that a residual block with only one convolutional layer will lose the superiority of the shortcut connections. As shown in Fig. 4 (a), , which means the identity mapping can be learned by the weight matrix and the block will act as a plain layer. However, He et al., in [12], proposed to move the nonlinear function into the block. Based on this, we find that 1layerperblock structure with a nonlinear function inside the block performs residual learning and is distinct from a plain layer. As shown in Fig. 4 (b), . Thus, BiReal net with the shortcut connecting every layer’s output extensively utilizes the shortcut and this socalled 1layerperblock design brings a huge benefit for 1bit CNNs.
3.3 Deep BiReal Net Model with the Bottleneck and Its Representational Capability
Binarizing an ultradeep network is not trivial. Increased depth induces a training difficulty and also raises a higher requirement for the network structure design. As ResNet is the most prevalent network backbone structure, we propose to binarize the deep ResNet with the bottleneck structure to verify the superiority of our shortcut design and the training algorithm on a deep network.
Although a deep ResNet bottleneck structure already has a shortcut for each block, binarizing the convolutional layers would discard the realvalued outputs of the intermediate layers in a bottleneck block. Thus, we suspect that it could be much more effective to use the shortcut to propagate all the realvalued outputs of each 1bit convolutional layer. Based on this conjecture, we propose to use another shortcut to propagate realvalued features generated inside the bottleneck. As shown in Fig. 5, the newly added shortcut connects the input activations to the sign functions before the 33 convolutional layers and the output activations of the 33 convolutional layers in series, by an adding operation. This shortcut path together with the original shortcut path jointly collects all the realvalued outputs of the convolutional layers, greatly enhancing the representational capability. As illustrated in Fig. 5, the representational capability of each entry in the original shortcut path, i.e., left path, grows from 65 to 65 65 with the added shortcut. The representational capability of each entry in the additional shortcut path, i.e., right path, grows from 257 to 257 577 257. The representational capability of each entry after the adding operation will be the product of those of the input entries, which grows exponentially with the network depth and greatly contributes to the final accuracy.
3.4 Training BiReal Net
As both activations and weight parameters are binary, the continuous optimization method, i.e., the stochastic gradient descent (SGD), cannot be directly adopted to train the 1bit CNN. There are two major challenges. One is how to compute the gradient of the sign function on activations, which is nondifferentiable, while the other is that the gradient of the loss with respect to the binary weight is too small to change the sign of the weight. The authors of [16] proposed to adjust the standard SGD algorithm to approximately train the 1bit CNN. Specifically, the gradient of the sign function on activations is approximated by the gradient of the piecewise linear function, as shown in Fig. 7(b). To tackle the second challenge, the method proposed in [16] updates the realvalued weights by the gradient computed with regard to the binary weight, and obtains the binary weight by taking the sign of the real weights. As the identity shortcut will not add difficulty for training, the training algorithm proposed in [16] can also be adopted to train the BiReal net model. However, we propose a novel training algorithm to tackle the above two major challenges, which is more suitable for the BiReal net model as well as other 1bit CNNs. Additionally, we also propose a novel initialization method.
We present a graphical illustration of the training of BiReal net in Fig. 6. The identity shortcut is omitted in the graph for clarity, as it will not change the main part of the training algorithm.
3.4.1 Approximation to the derivative of the sign function with respect to activations.
As shown in Fig. 7(a), the derivative of the sign function is an impulse function, which cannot be utilized in training.
(3) 
where is a differentiable approximation of the nondifferentiable . In [16], is set as the clip function, leading to the derivative as a stepfunction (see Fig. 7(b)). In this work, we utilize a piecewise polynomial function (see Fig. 7(c)) as the approximation function, as follows:
(4) 
As shown in Fig. 7, the shaded areas with blue slashes reflect the difference between the sign function and its approximation. The shaded area corresponding to the clip function is , while that corresponding to Eq. (4) is . We conclude that Eq. (4) is a closer approximation to the sign function than the clip function. Consequently, the derivative of Eq. (4) is formulated as
(5) 
which is a piecewise linear function.
Although we can use a higherorder approximation to obtain closer approximation, the gain would be limited. We carry out an ablation study on the thirdorder approximation to the sign function, formulated as
(6) 
Its derivative is a piecewise quadratic function:
(7) 
Intuitively, the gradient dismatch decreasing from the secondorder approximation to the thirdorder approximation is small, as shown in Fig. 7, and the experimental results also show that the corresponding accuracy increase is very limited. Thus, we conclude that the secondorder approximation is sufficient. The limited gain from a higherorder approximation is not worthwhile for the computational overhead.
3.4.2 Magnitudeaware binarization with respect to weights.
Updating the binary parameters is challenging. The naive solution induces the magnitude dismatch issue between the binary weights and realvalued weights. This magnitude dismatch issue is in turn aggravated in gradient with the existence of the BatchNorm layer and hampers the convergence of 1bit networks. To tackle this we use a magnitudeaware binarization scheme to match the magnitudes between the binary weights and the realvalued weights at the training time. After training, we use the naive sign function for weight binarization to keep inference simple and fast.
Here we present how to update the binary weight parameter in the block. The standard gradient descent algorithm cannot be directly applied as the gradient is not large enough to change the binary weights.
To tackle this problem, the method of [16] introduces a real weight and a sign function during training. Hence the binary weight parameter can be seen as the output to the sign function, i.e., , as shown in the upper subfigure in Fig. 6. is updated using gradient calculated with respect to in the backward pass, as follows:
(8) 
This method solves the problem of updating the discrete binary weights, but the sign function binarizing weights induces a magnitude dismatch problem between binary weights and realvalued weights. As shown in Fig. 8 (a), the magnitude of binary weights equals to 1; however, for empirical reasons, the magnitude of real weights is around 0.1.
Thus in binary networks, we have
(9) 
where is a nonnegligible number.
With the existence of the BatchNorm layer in the prevalent architectures, the magnitude difference between and will incur a reciprocal magnitude difference in gradient, which in turn harms the convergence of the binary networks. We explain this phenomenon through starting with a simple lemma.
Lemma1: With the presence of a BatchNorm layer in a convolutional neural network, if every element in the weight kernel is amplified by , the gradient will become of the previous gradient.
Proof: For clarity, we assume that there is only one weight kernel, i.e., is a matrix.
Forward Pass: We consider a convolutional layer followed by a BatchNorm layer in the convolutional neural network, where the original parameters are denoted with superscript (1) and the parameters after scaling are denoted with superscript (2). When weights in the convolutional layer are amplified by ,
(10) 
and the output of the convolutional layer is
(11) 
For the BatchNorm layer which normalizes Y, the mean () and variance () can be calculated as
(12) 
where m is the number of entries in the weight matrix. Thus, we have
(13) 
The output Z of the BatchNorm layer in the forward pass is independent of the scaling factor ,
(14) 
Backward Pass: As we assume that other layers besides the concerned layer are the same, the gradient back propagated to the corresponding BatchNorm layer should be identical, that is,
(15) 
We can calculate the gradient with respect to the weights by Chain Rule:
(16) 
Since , we have
(17) 
The gradient with respect to the weight is scaled to . Obviously this lemma holds in the convolutional layer with multiple output channels.
Proof completed.
According to Lemma1, this magnitude difference between the binary weights and the realvalued weights induces the gradient to be rescaled in the converse direction:
(18) 
This effect makes the weight update of the weights in the binary network inprecise and hinders the binary network from converging to a higher accuracy. To address this issue, we adopt the magnitudeaware binarization for matching the dimension between binary weight and realvalued weights. Thus, we have
(19) 
After training the binary network with the more accurate gradients, we no longer need to rescale the binary weights at inference time, as the output of the BatchNorm layer is independent of the scaling factor. By setting and , we can obtain the same value in the output,
(20) 
Thus we can simply use the sign function for binarizing the weights at the inference phase for easy deployment.
Using for weight binarization is first proposed in XNORNet [30] and inherited by the following works [24, 38]. But previous works use this scaling factor for enhancing the representational capability of the binary weights in both training time and inference time, which results in extra computation in deploying the binary model. To the best of our knowledge, we are the first to explicitly point out that this scaling factor can be used as an auxiliary parameter to help convergence at training time and be normalized by the BatchNorm layer at inference time.
3.4.3 Initialization.
As discussed in the previous section, training the 1bit CNN means updating the stored realvalued weight and using its sign to update the binary weight. The value of a realvalued weight denoting how likely it is a binary weight is going to change its sign. A good initialization of the realvalued weights is of great importance for a rapid convergence and a high accuracy of the model. Previous works proposed to finetune the 1bit CNN from the corresponding realvalued network with ReLU nonlinearity [24]. However, the activation of ReLU is nonnegative, as shown in Fig. 9 (a), while that of Sign is or . Due to this difference, the realvalued CNNs with ReLU may not provide a suitable initial point for training the 1bit CNNs. Instead, we propose to replace ReLU with to pretrain the realvalued CNN model. The activation of the clip function is closer to the sign function than ReLU, as shown in Fig. 9 (b), and yields a better initialization which will be further validated in the following experiments.
The initialization method and the magnitudeaware binarization with respect to weights can be viewed as a whole process to match the magnitude of activations and weights in the realvalued network with that in the 1bit CNNs. Replacing the ReLU nonlinearity with the clip function in initialization reshapes the activation distribution to be closer to 1 and +1. Then the magnitudeaware sign matches the magnitude of the binary weights and the realvalued weights to obtain more effective gradients. After training for convergence, we use the BatchNorm layer to normalize the scaling factor and scale back the binary weights to 1 and +1. In this process, we manage to convert the realvalued networks to 1bit CNNs.
3.4.4 Twostep training method for a deep 1bit CNN with the bottleneck structure.
To further alleviate the difficulty in training a deep 1bit CNN, we followed the progressive quantization methods [51] to binarize the entire network in a twostep manner, in which the realvalued weights in the 33 convolutional layers are used as auxiliary variables. Specifically, we first binarize the activations and weights in all the 11 convolutional layers and the activations in the 33 convolutional layers in the bottleneck blocks. We train the network till convergence, and then binarize the weights in the 33 convolutional layers. With the proposed twostep training method, we manage to decompose the big challenge of binarizing the deep network into two subproblems, facilitating the network converge to higher accuracy.
4 Experiments
In this section, we first introduce the dataset for experiments, and present implementation details in Sec 4.1. Then, we conduct an ablation study in Sec. 4.2 to investigate the effectiveness of the proposed techniques. This is followed by comparing our BiReal net with other stateoftheart binary networks regarding accuracy in Sec 4.3. Sec. 4.4 reports the memory usage and computation cost in comparison with other networks. In Sec. 4.5, we deploy the BiReal net to a realworld application, the depth estimation task.
4.1 Dataset and Implementation Details
The experiments are carried out on the ILSVRC12 ImageNet classification dataset [32]. ImageNet is a largescale dataset with 1.2 million training images and 50K validation images of 1,000 classes. Compared to other datasets like CIFAR10 [20] or MNIST [29], ImageNet is more challenging due to its large scale and great diversity. The study on this dataset will validate the superiority of the proposed BiReal network structure and the effectiveness of the novel training techniques for 1bit CNNs. In our experiment, we report both the top1 and top5 accuracies.
For each image in the ImageNet dataset, the lower dimension of the image is rescaled to 256 while keeping the aspect ratio intact. For training, a random crop of size 224 224 is selected from the rescaled image or its horizontal flip. For inference, we employ the 224 224 center crop from images.
Pretraining: We finetune the realvalued network with the clip nonlinear function from the corresponding network with the ReLU nonlinear function. For easy convergence, we use the network with the leaky clip as a transition, which has a negative slope of 0.1 instead of 0 outside the range of (1,1). The finetuning procedure is ReLU LeakyClip Clip.
Training: We train four instances of the BiReal net, including an 18layer, a 34layer, a 50layer, and a 152layer BiReal net. The training of them consists of two steps: training the 1bit convolutional layer and retraining the BatchNorm. In the first step, the weights in the 1bit convolutional layer are binarized using the magnitudeaware binarization with respect to the weights. We use the SGD solver with the momentum of 0.9 and set the weightdecay to 0, which means that we no longer encourage the weights to be close to 0. For the 18layer BiReal net, we run the training algorithm for 20 epochs with a batch size of 128. The learning rate starts at 0.01 and is decayed twice by multiplying 0.1 at the 10th and the 15th epoch. For the 34layer BiReal net, the training process includes 40 epochs and the batch size is set to 1024. The learning rate starts at 0.08 and is multiplied by 0.1 at the 20th and the 30th epoch respectively. For the 50layer BiReal net and 152layer BiReal net, the first step is further divided into two substeps: 1) binarizing the activations and weights in 11 convolutional layers along with the activations in 33 convolutional layers, and 2) binarizing the weights in 33 convolutional layers. For the 50layer BiReal net, each substep includes 100 epochs and the batch size is set to 800. The learning rate starts from 0.064 and is multiplied by 0.1 at the 50th and the 75th epoch respectively. For the 152layer BiReal net, each substep includes 120 epochs and the batch size is set to 4000. The learning rate starts at 0.1 and is multiplied by 0.1 at the 60th and the 90th epoch, respectively. In the second step, we constrain the weights to 1 and 1, and set the learning rate in all convolutional layers to 0 and retrain the BatchNorm layer for one epoch to absorb the scaling factor.
Inference: we use the trained model with binary weights and binary activations in the 1bit convolutional layers for inference.
4.2 Ablation Study
Three building blocks. The shortcut in our BiReal net transfers realvalued representation without additional memory cost, which plays an important role in improving its capability. To verify its importance, we implemented a PlainNet structure without shortcuts, as shown in Fig. 10 (d), for comparison. At the same time, as our network structure employs the same number of weight filters and layers as the standard ResNet, we also carry out a comparison with the standard ResNet shown in Fig. 10 (c). For fair comparison, we adopt the ReLUonly preactivation ResNet structure in [12], which differs from BiReal net only in the structure of two layers per block instead of one layer per block. The layer order and shortcut design in Fig. 10 (c) are also applicable for 1bit CNNs. The comparison can justify the benefit of implementing our BiReal net by specifically replacing the 2convlayerperblock ResNet structure with two 1convlayerperblock BiReal structures.
As discussed in Sec. 3, we propose to overcome the optimization challenges induced by discrete weights and activations by 1) approximation to the derivative of the sign function with respect to activations, 2) magnitudeaware binarization with respect to weights, and 3) clip initialization. To study how these proposals benefit the 1bit CNNs individually and collectively, we train the 18layer structure and the 34layer structure with a combination of these techniques on the ImageNet dataset. Thus, we derive pairs of values of top1 and top5 accuracies, which are presented in Table 1.
Initiali  Weight  Activation  BiReal18  Res18  Plain18  BiReal34  Res34  Plain34  

zation  update  backward  top1  top5  top1  top5  top1  top5  top1  top5  top1  top5  top1  top5 
ReLU  Original  Original  32.9  56.7  27.8  50.5  3.3  9.5  53.1  76.9  27.5  49.9  1.4  4.8 
Proposed  36.8  60.8  32.2  56.0  4.7  13.7  58.0  81.0  33.9  57.9  1.6  5.3  
Proposed  Original  40.5  65.1  33.9  58.1  4.3  12.2  59.9  82.0  33.6  57.9  1.8  6.1  
Proposed  47.5  71.9  41.6  66.4  8.5  21.5  61.4  83.3  47.5  72.0  2.1  6.8  
Realvalued Net  68.5  88.3  67.8  87.8  67.5  87.5  70.4  89.3  69.1  88.3  66.8  86.8  
Clip  Original  Original  37.4  62.4  32.8  56.7  3.2  9.4  55.9  79.1  35.0  59.2  2.2  6.9 
Proposed  38.1  62.7  34.3  58.4  4.9  14.3  58.1  81.0  38.2  62.6  2.3  7.5  
Proposed  Original  53.6  77.5  42.4  67.3  6.7  17.1  60.8  82.9  43.9  68.7  2.5  7.9  
Proposed  56.4  79.5  45.7  70.3  12.1  27.7  62.2  83.9  49.0  73.6  2.6  8.3  
Realvalued Net  68.0  88.1  67.5  87.6  64.2  85.3  69.7  89.1  67.9  87.8  57.1  79.9  
Fullprecision original ResNet[11]  69.3  89.2  73.3  91.3 
Based on Table 1, we can evaluate each technique’s individual contribution and collective contribution of each unique combination of these techniques towards the final accuracy.
1) Comparing the columns with the columns, both the proposed BiReal net and the binarized standard ResNet outperform their plain counterparts with a significant margin, which validates the effectiveness of the shortcut and the disadvantage of directly concatenating the 1bit convolutional layers. As Plain18 has a thin and deep structure, which has the same weight filters but no shortcut, binarizing it results in very limited network representational capacity in the last convolutional layer, and can thus hardly achieve good accuracy.
2) Comparing the and columns, the 18layer BiReal net structure improves the accuracy of the binarized standard ResNet18 by about 18%. This validates the conjecture that the BiReal net structure with more shortcuts further enhances the network capacity compared to the standard ResNet structure. Replacing the 2convlayerperblock structure employed in ResNet with two 1convlayerperblock structures, adopted by BiReal net, could even benefit a realvalued network.
3) All proposed techniques for initialization, weight update, and activation backward improve the accuracy to various extent. For the 18layer BiReal net structure, the improvement from the weight (about 23%, by comparing the and rows) is greater than the improvement from the activation (about 12%, by comparing the and rows) and the improvement from replacing ReLU with Clip for initialization (about 13%, by comparing the and rows). These three proposed training mechanisms are orthogonal to each other and can function collaboratively towards enhancing the final accuracy.
4) The proposed training methods can improve the final accuracy for all three networks in comparison with the original training method, which implies that these proposed three training methods are universally suitable for various networks.
5) The two implemented BiReal nets (i.e., the 18layer and 34layer structures) together with the proposed training methods achieve approximately 83% and 89% of the accuracy level of their corresponding fullprecision networks, but with a huge amount of speedup and computation cost saving.
In brief, the shortcut enhances the network representational capability, and the proposed training methods help the network approach the accuracy upper bound.
As discussed in Sec. 3.4.1, we investigate a higherorder approximation to the derivative of the sign function. Specifically, we carry out an ablation study to answer the question of how much increase in performance can we obtain by using a higher order approximation to the derivative of the sign function. As shown in Table 2, the gain from changing the secondorder approximation of the sign function to the thirdorder approximation is diminished, to only a 0.06% increase in top1 accuracy. Considering the computational overhead, we suggest using the secondorder approximation.
ApproxSign  Thirdorder ApproxSign  

Top1  56.40%  56.46% 
Top5  79.50%  79.74% 
BiReal net  BinaryNet[16]  ABCNet[24]  XNORNet[30]  Fullprecision[11]  

18layer  Top1  56.4%  42.2%  42.7%  51.2%  69.3% 
Top5  79.5%  67.1%  67.6%  73.2%  89.2%  
34layer  Top1  62.2%  –  52.4%  –  73.3% 
Top5  83.9%  –  76.5%  –  91.3% 
4.3 Accuracy Comparison with StateoftheArts
While the ablation study demonstrates the effectiveness of our 1layerperblock structure and the proposed techniques for optimal training, it is also necessary to make a comparison with other stateoftheart methods to evaluate BiReal net’s overall performance. To this end, we carry out a comparative study with three methods: BinaryNet [16], XNORNet [30], and ABCNet [24]. These three networks are representative methods of binarizing both weights and activations for CNNs and achieve stateoftheart results. Note that for fair comparison, our BiReal net contains the same number of weight filters as the corresponding ResNet that these methods attempt to binarize, differing only in the shortcut design.
Table 3 shows the results. The results of the three networks are quoted directly from the corresponding references, except the result of BinaryNet, which is quoted from ABCNet [24]. The comparison clearly indicates that the proposed BiReal net outperforms the three networks by a considerable margin in terms of both the top1 and top5 accuracies. Specifically, the 18layer BiReal net outperforms its 18layer counterparts BinaryNet and ABCNet with a relative 33% advantage and achieves a roughly 10% relative improvement over XNORNet. Similar improvements can be observed for the 34layer BiReal net. In short, our BiReal net is more competitive than the stateoftheart binary networks.
BiReal net  BinaryNet[16]  XNORNet[30]  Fullprecision ResNet[11]  

50layer  Top1  62.6%  9.4%  63.1%  74.7% 
Top5  83.9%  22.4%  83.6%  92.1%  
152layer  Top1  64.5%  8.9%  –  76.5% 
Top5  85.5%  21.0%  –  93.2% 
Table 4 shows the results of BiReal net on the deeper network with the bottleneck structure. Our BiReal net contains the same number of weight filters as the corresponding ResNet. We reimplement the method proposed in BinaryNet on ResNet50 and ResNet152. For fair comparison, both methods use the same data augmentation with the lower dimension of the image randomly sampled in [256,480] while keeping the aspect ratio intact. A random crop of size 224 224 is selected from the rescaled image or its horizontal flip. The results show that BiReal net with an ultradeep network structure outperforms the method in BinaryNet [16] by a large margin. The results also show that without adding extra shortcuts to preserve every layer’s realvalued output, the 1bit CNNs can hardly scale up to 152layers deep, while our proposed BiReal net can achieve 64.5% top1 accuracy. BiReal net also achieves comparable accuracy to XNORNet [30] on the 50layer structures. Compared to XNORNet, we do not need to store or compute the realvalued scaling factor for multiplying with the binary weights and activations, which makes our network easier for implementation and execution.
4.4 Efficiency and Memory Usage Analysis
In this section, we analyze the saving of memory usage and speedup in computing BiReal net by both theoretical analysis and realworld estimation on FPGA devices.
4.4.1 Resource computation
The memory usage is computed as the summation of 32 bits times the number of realvalued parameters and 1 bit times the number of binary parameters in the network. For efficiency comparison, following the suggestion of Uniq [2], we use BOPs to measure the total multiplication computation in BiReal net. BOP refers to the number of bitoperations in a neural network, where the calculation method is the same as calculating FLOPs for the floatingpoint operations in [11], excepting the operation calculated in BOPs is bitwise.
Memory usage  Memory saving  BOPs  Speedup  
18layer  BiReal net  33.6 Mbit  11.14  1.04  11.06 
XNORNet[30]  33.7 Mbit  11.10  1.07  10.86  
Fullprecision ResNet[11]  374.1 Mbit  –  1.16  –  
34layer  BiReal net  43.7 Mbit  15.97  124  18.99 
XNORNet[30]  43.9 Mbit  15.88  127  18.47  
Fullprecision ResNet[11]  697.3 Mbit  –  234  –  

We follow the suggestion in XNORNet [30], to keep the weights and activations in the first convolution and the last fullyconnected layers to be realvalued. We also adopt the same realvalued 11 convolution in the Type B shortcut [11], as implemented in XNORNet. Note that this 11 convolution is for the transition between two stages of ResNet and thus all information should be preserved. As the number of weights in those three kinds of layers accounts for only a very small proportion of the total number of weights, the limited memory saving for binarizing them does not justify the performance degradation caused by the information loss.
For both the 18layer and the 34layer networks, the proposed BiReal net reduces the memory usage by 11.1 times and 16.0 times, respectively, and achieves computation reduction of about 11.1 times and 19.0 times, in comparison with the fullprecision counterparts. Without using realvalued weights and activations for scaling binary ones during the inference time, our BiReal net requires fewer BOPs and uses less memory than XNORNet and is also much easier to implement.
4.4.2 Onboard speed estimation
We estimate the execution time of an 18layer BiReal net as well as its realvalued counterpart on FPGA (FieldProgrammable Gate Array) with the Vivado Design Suite [5], which is targeted at embedded applications. Table 7 shows the speed and resource usage comparison of the entire networks. The proposed BiReal net achieves a 6.07 speed up with the same or fewer resources compared to its realvalued counterpart on an FPGA board. Table 6 shows the speed estimation of each individual module. Binary convolutional layers achieve a 15.8 speed up compared with realvalued convolutions. By adding up the execution time of all the operations, 18layer BiReal net is able to achieve a 7.38 speedup than the realvalued network with the same structure.
Execution Time(ms)  Resource usage  

LUT  FF  BRAM  
32bit network  2394.6  256207  315644  2086 
1bit network  394.7  154036  197606  2086 
Time/Resource reduction ratio  6.07  1.66  1.60  1 
Execution time  Speedup ratio  

1bit network  32bit network  
3 3 Convolutional layers  2.55s  24.23s  15.8 
Downsampling layers  0.13s  0.13s  – 
First convolutional layer  0.08s  0.08s  – 
Fullyconnected layer  1.02ms  1.02ms  – 
BatchNorm layers  0.62s  0.62s  – 
All operations considered  3.4s  25.1s  7.38 
4.5 Application: Pixelwise Depth Estimation
Depth estimation is an important task for autonomous driving and drone navigation. Compressing a depth estimation CNN is crucial to deploying the powerful CNN to mobile devices which have limited memory and computational resources. In this section, we replace the realvalued Res50 network in [6] with a 50layer BiReal net for pixelwise depth estimation.
The experimental evaluations were carried out on the KITTI dataset [7], which contains pictures of the roads captured using a stereo camera mounted on a moving vehicle. We employed the same training/validation split and data preprocessing method as [6] for fair comparison.
In the training phase, the images were resized to 160 608 and no data augmentation was applied. We trained the network for 60K iterations with a minibatch size of 10. We started from a learning rate of 0.001 and divided it by 10 at every 30K iterations. We pretrained the realvalued network and then used it to initialize and finetune the binarized network.
In the testing phase, we evaluated our results on the same cropped region of interests as [6] and compared the depth prediction results with the corresponding groundtruth depth maps.
Depth Estimation Network  

BiReal net  BinaryNet [16]  Fullprecision network [6] 
84.9%  83.0%  85.2% 
The results show that BiReal net achieves comparable accuracy to the realvalued network proposed in [6] and is 2% higher than directly binarizing the original network with the method in [16]. The results provide a piece of evidence that the proposed BiReal net not only works well on classification tasks but can also be applied to other regression tasks like pixelwise depth estimation.
5 Conclusions
In this study, we proposed a novel 1bit CNN model, dubbed BiReal net. Compared to standard 1bit CNNs, BiReal net utilizes a simple yet effective shortcut to significantly enhance the representational capability of the 1bit CNNs. Furthermore, an advanced training algorithm was designed for training 1bit CNNs (including BiReal net), including a tighter approximation to the derivative of the sign function with respect to the activation, a magnitudeaware binarization with respect to the weight, as well as a novel initialization and a twostep training algorithm for deep 1bit CNNs. The extensive experimental results demonstrate that the proposed BiReal net and novel training algorithm achieve superior results over the stateoftheart methods and are viable for realworld applications.
6 Acknowledgements
The authors would like to acknowledge HKSAR RGC’s funding support under grant GRF16203918. We also would like to thank Zhuoyi Bai, Tian Xia, Prof. Zhenyan Wang and Xiaofeng Hu from Huazhong University of Science and Technology for their efforts in implementing BiReal net on FPGA and carrying out the onboard speed estimation.
References
 [1] Hessam Bagherinezhad, Maxwell Horton, Mohammad Rastegari, and Ali Farhadi. Label refinery: Improving imagenet classification through label progression. arXiv preprint arXiv:1805.02641, 2018.
 [2] Chaim Baskin, Eli Schwartz, Evgenii Zheltonozhskii, Natan Liss, Raja Giryes, Alex M Bronstein, and Avi Mendelson. Uniq: uniform noise injection for the quantization of neural networks. arXiv preprint arXiv:1804.10969, 2018.
 [3] François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, pages 1610–02357, 2017.
 [4] Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
 [5] Tom Feist. Vivado design suite. White Paper, 5:30, 2012.
 [6] Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision, pages 740–756. Springer, 2016.
 [7] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
 [8] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
 [9] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pages 1379–1387, 2016.
 [10] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
 [13] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In International Conference on Computer Vision (ICCV), volume 2, 2017.
 [14] Lu Hou, Quanming Yao, and James T Kwok. Lossaware binarization of deep networks. In Proceedings of the International Conference on Learning Representations, 2017.
 [15] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [16] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4107–4115. Curran Associates, Inc., 2016.
 [17] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
 [18] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
 [19] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [20] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [22] Liangzhen Lai, Naveen Suda, and Vikas Chandra. Deep convolutional neural network inference with floatingpoint weights and fixedpoint activations. arXiv preprint arXiv:1703.03073, 2017.
 [23] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
 [24] Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pages 345–353, 2017.
 [25] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 806–814, 2015.
 [26] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian D Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell., 38(10):2024–2039, 2016.
 [27] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and KwangTing Cheng. Bireal net: Enhancing the performance of 1bit cnns with improved representational capability and advanced training algorithm. In Proceedings of the European Conference on Computer Vision (ECCV), pages 722–737, 2018.
 [28] Wenhan Luo, Peng Sun, Fangwei Zhong, Wei Liu, Tong Zhang, and Yizhou Wang. Endtoend active object tracking via reinforcement learning. ICML, 2018.
 [29] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
 [30] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 [31] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
 [32] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [33] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [34] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3476–3483, 2013.
 [35] Vivienne Sze, YuHsin Chen, TienJu Yang, and Joel S Emer. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12):2295–2329, 2017.
 [36] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inceptionv4, inceptionresnet and the impact of residual connections on learning. In AAAI, volume 4, page 12, 2017.
 [37] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 [38] Wei Tang, Gang Hua, and Liang Wang. How to train a compact binary neural network with high accuracy? In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 [39] Baoyuan Wu, BaoGang Hu, and Qiang Ji. A coupled hidden markov random field model for simultaneous face clustering and tracking in videos. Pattern Recognition, 64:361–373, 2017.
 [40] Baoyuan Wu, Siwei Lyu, BaoGang Hu, and Qiang Ji. Simultaneous clustering and tracklet linking for multiface tracking in videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 2856–2863, 2013.
 [41] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
 [42] Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lqnets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 365–382, 2018.
 [43] Hanwang Zhang, Zawlin Kyaw, ShihFu Chang, and TatSeng Chua. Visual translation embedding network for visual relation detection. In CVPR, volume 1, page 5, 2017.
 [44] Hanwang Zhang, Zawlin Kyaw, Jinyang Yu, and ShihFu Chang. Pprfcn: Weakly supervised visual relation detection via parallel pairwise rfcn. arXiv preprint arXiv:1708.01956, 2017.
 [45] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.
 [46] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. In Proceedings of the International Conference on Learning Representations, 2017.
 [47] Aojun Zhou, Anbang Yao, Kuan Wang, and Yurong Chen. Explicit losserroraware quantization for lowbit deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9426–9435, 2018.
 [48] Erjin Zhou, Haoqiang Fan, Zhimin Cao, Yuning Jiang, and Qi Yin. Extensive facial landmark localization with coarsetofine convolutional network cascade. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 386–391, 2013.
 [49] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
 [50] Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z Li. Face alignment across large poses: A 3d solution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 146–155, 2016.
 [51] Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid. Towards effective lowbitwidth convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7920–7928, 2018.