Rotated Binary Neural Network

Rotated Binary Neural Network

Abstract

Binary Neural Network (BNN) shows its predominance in reducing the complexity of deep neural networks. However, it suffers severe performance degradation. One of the major impediments is the large quantization error between the full-precision weight vector and its binary vector. Previous works focus on compensating for the norm gap while leaving the angular bias hardly touched. In this paper, for the first time, we explore the influence of angular bias on the quantization error and then introduce a Rotated Binary Neural Network (RBNN), which considers the angle alignment between the full-precision weight vector and its binarized version. At the beginning of each training epoch, we propose to rotate the full-precision weight vector to its binary vector to reduce the angular bias. To avoid the high complexity of learning a large rotation matrix, we further introduce a bi-rotation formulation that learns two smaller rotation matrices. In the training stage, we devise an adjustable rotated weight vector for binarization to escape the potential local optimum. Our rotation leads to around 50% weight flips which maximize the information gain. Finally, we propose a training-aware approximation of the sign function for the gradient backward. Experiments on CIFAR-10 and ImageNet demonstrate the superiorities of RBNN over many state-of-the-arts. Our source code, experimental settings, training logs and binary models are available at https://github.com/lmbxmu/RBNN.

1 Introduction

The community has witnessed the remarkable performance improvements of deep neural networks (DNNs) in computer vision tasks, such as image classification Krizhevsky et al. (2012); He et al. (2016), object detection Redmon et al. (2016); He et al. (2017) and semantic segmentation Noh et al. (2015); Long et al. (2015). However, the cost of massive parameters and computational complexity makes DNNs hard to be deployed on resource-constrained and low-power devices.

To solve this problem, many compression techniques have been proposed including network pruning Lin et al. (2020b); Ding et al. (2019b); Lin et al. (2020a), low-rank decomposition Denton et al. (2014); Yu et al. (2017); Hayashi et al. (2019), efficient architecture design Iandola et al. (2016); Wu et al. (2018); Chen et al. (2020) and network quantization Lin et al. (2016); Banner et al. (2018); Helwegen et al. (2019), etc. In particular, network quantization resorts to converting the weights and activations of a full-precision network to low-bit representations. In the extreme case, a binary neural network (BNN) restricts its weights and activations to only two possible values ( and ) such that: 1) the network size is 32 less than its full-precision counterpart; 2) the multiply-accumulation convolution can be replaced with the efficient xnor and bitcount logics.

Though BNN has attracted great interest, it remains a challenge to close the accuracy gap between a full-precision network and its binarized version Rastegari et al. (2016); Cai et al. (2017). One of the major obstacles comes at the large quantization error between the full-precision weight vector and its binary vector Courbariaux et al. (2015, 2016) as shown in Fig. 2(a). To solve this, state-of-the-art approaches Rastegari et al. (2016); Qin et al. (2020b) try to lessen the quantization error by introducing a per-channel learnable/optimizable scaling factor to minimize the quantization error:

Figure 1: (a) Early works Courbariaux et al. (2015, 2016) suffer from large quantization error caused by both the norm gap and angular bias between the full-precision weights and its binarized version. (b) Recent works Rastegari et al. (2016); Qin et al. (2020b) introduce a scaling factor to reduce the norm gap but cannot reduce the angular bias, i.e., . Therefore the quantization error is still large when is large.
Figure 2: Cosine similarity and quantization error in various layers of ResNet-20. (a) Our RBNN achieves a significantly higher cosine similarity between the full-precision weight and its binarization than XNOR-Net Rastegari et al. (2016) does, implying fewer angular bias. (b) XNOR-Net suffers great quantization error while RBNN leads to a much smaller one.
(1)

However, the introduction of only partly mitigates the quantization error by compensating for the norm gap between the full-precision weight and its binarized version, but cannot reduce the quantization error due to an angular bias as shown in Fig. 2(b). Apparently, with a fixed angular bias , when is orthogonal to , Eq. (1) reaches the minimum and we have

(2)

Thus, serves as the lower bound of the quantization error and cannot be diminished as long as the angular bias exists. This lower bound could be huge with a large angular bias . Though the training process updates the weights and may close the angular bias, we experimentally observe the possibility of this case is small, as shown in Fig. 2. Thus, it is desirable to reduce this angular error for the sake of further reducing the quantization error. Moreover, the information of BNN learning is upper-bounded by where is the total number of weight elements and the base denotes the two possible values in BNN Liu et al. (2018); Qin et al. (2020b). Weight flips refer to that positive value turns to and vice versa. It is easy to see that when the probability of flip achieves 50%, the information reaches the maximum of . However, the scaling factor results in a small ratio of flipping weights thus leading to little information gain in the training process Helwegen et al. (2019); Qin et al. (2020b)1.

In this paper, we propose a Rotated Binary Neural Network (RBNN) to further mitigate the quantization error from the intrinsic angular bias as illustrated in Fig. 3. To the best of our knowledge, this is the first work that explores and reduces the influence of angular bias on quantization error in the field of BNN. To this end, we devise an angle alignment scheme by learning a rotation matrix that rotates the full-precision weight vector to its geometrical vertex of the binary hypercube at the beginning of each training epoch. Instead of directly learning a large rotation matrix, we introduce a bi-rotation formulation that learns two smaller matrices with a significantly reduced complexity. A series of optimization steps are then developed to learn the rotation matrices and binarization alternatingly to align the angle difference as shown in Fig. 2(a), which significantly reduces the quantization error as illustrated in Fig. 2(b). To get rid of the possible local optimum in the optimization, we dynamically adjust the rotated weights for binarization in the training stage. We show that the proposed rotation not only reduces the angular bias which leads to less quantization error, but also achieves around 50% weight flips thereby achieving maximum information gain. Finally, we provide a training-aware approximation of the sign function for gradient backpropagation. We show the superiority of RBNN through extensive experiments.

2 Related Work

The pioneering BNN work dates back to Courbariaux et al. (2016) that binarizes the weights and activations to or by the sign function. The straight-through estimator (STE) Bengio et al. (2013) was proposed for the gradient backpropagation. Following this, abundant works have been devoted to improving the accuracy performance and implementing them in low-power and resource-constrained platforms. We refer readers to the survey paper Simons and Lee (2019); Qin et al. (2020a) for a more detailed overview.

XNOR-Net Rastegari et al. (2016) includes all the basic components from Bengio et al. (2013) but further introduces a per-channel scaling factor to reduce the quantization error. The scaling factor is obtained through the -norm of both weights and activations before binarization. DoReFa-Net Zhou et al. (2016) introduces a changeable bit-width for the quantization of weights and activations, even the gradient in the backpropagation. The scaling factor is layer-wise and deduced from network weights which allows efficient inference since the weights do not change after training. XNOR++ Bulat and Tzimiropoulos (2019) fuses the separate activation and weight scaling factors in Bengio et al. (2013) into a single one which is then learned discriminatively via backpropagation.

Besides using the scaling factor to reduce the quantization error, more recent works are engaged in expanding the expressive ability to gain more information in the learning of BNN and devising a differentiable approximation to the sign function to achieve the propagation of the gradient. ABC-Net Lin et al. (2017) proposes multiple parallel binary convolution layers to enhance the model accuracy. Bi-Real Net Liu et al. (2018) adds ResNet-like shortcuts to reduce the information loss caused by binarization. Both ABC-Net and Bi-Real Net modify the structure of networks to strengthen the information of BNN learning. However, they ignore the probability of weight flips, thus the actual learning capacities of ABC-Net and Bi-Real Net are smaller than as stressed in Sec. 1. To compensate, Helwegen et al. (2019); Qin et al. (2020b) strengthen the learning ability of BNNs during network training via increasing the probability of weight flips.

3 Approach

Figure 3: Framework of our RBNN. The weight vector is rotated at the beginning of each training epoch such that the angular bias between the rotated weight vector and the geometrical binary vertex is smaller than that of the original . After rotation, the weights are either unflipped (b) or flipped (c) which increases the information gain. During training, the rotated weights are dynamically adjusted such that with much less angular bias is obtained, which then follows up the binarization.

3.1 Binary Neural Networks

Given a CNN model, we denote and as its weights and feature maps in the -th layer, where and . (, ) represent the number of output and input channels, respectively. (, ) and (, ) are the width and height of filters and feature maps, respectively. We then have

where is the standard convolution and we omit activation layers for simplicity. The BNN aims to convert and into and , such that the convolution can be achieved by using the efficient xnor and bitcount logics. Following Hubara et al. (2016); Bulat and Tzimiropoulos (2019), we binarize the activations with the sign function:

where represents the xnor and bitcount logics, and sign() denotes the sign function which returns if the input is larger than zero, and otherwise. Similar to Hubara et al. (2016); Zhou et al. (2016); Bulat and Tzimiropoulos (2019), can also be obtained by and a scaling factor can be applied to compensate for the norm difference.

However, the existence of angular bias between and could lead to large quantization error as analyzed in Sec. 1. Besides, it results in consistent signs between and which lessens the information gain (see footnote 1). We aim to minimize this angular bias to reduce the quantization error, meanwhile increasing the probability of weight flips to increase the information gain.

3.2 Rotated Binary Neural Networks

As shown in Fig. 3, we consider applying a rotation matrix to at the beginning of each training epoch, such that the angle between the rotated weight vector and its binary vector should be minimized. To this end, we derive the following formulation:

(3)

where is an -th order identity matrix. It is easy to know that , = , both of which are constant2. Thus, Eq. (3) can be further simplified as

(4)

where , returns the trace of the input matrix and . However, Eq. (4) involves a large rotation matrix, of which can be up to millions in a neural network. Direct optimization of would consume massive memory and computation. Besides, performing a large rotation leads to complexity in both space and time.

To deal with this, inspired by the properties of Kronecker product Laub (2005), we introduce a bi-rotation scenario where two smaller rotation matrices and are used to reconstruct the large rotation matrix with . One of the basic property of Kronecker product Laub (2005) is that if two matrices and are orthogonal, then is orthogonal as well, where denotes the Kronecker product. Another basic property of Kronecker product comes at:

(5)

where Vec() vectorizes its input and Vec() = . Thus, we can see that applying the bi-rotation to is equivalent to applying an rotation to . Learning two smaller matrices and can well reconstruct the large rotation matrix . Moreover, performing the bi-rotation consumes only space complexity and time complexity, respectively, leading to a significant complexity reduction compared to the large rotation3. Accordingly, Eq. (4) can be reformulated as

(6)

where . Finally, we rewrite our optimization objective below:

(7)

3.3 Alternating Optimization

Eq. (7) is non-convex w.r.t. , and . To find a feasible solution, we adopt an alternating optimization approach, i.e., updating one variable with the rest two fixed until convergence.

1) -step: Fix and , then learn the binarization . The sub-problem of Eq. (7) becomes:

(8)

which can be achieved by .

2) -step: Fix and , then update . The corresponding sub-problem is:

(9)

where . The above maximum can be achieved by using the polar decomposition Lu and Chipman (1996): , where is the SVD of .

3) -step: Fix and , then update . The corresponding sub-problem becomes

(10)

where . Similar to the updating rule for , the updating rule for is , where is the SVD of .

In the experiments, we iteratively update , and , which can reach convergence after three cycles of updating. Therefore, the weight rotation can be efficiently implemented.

3.4 Adjustable Rotated Weight Vector

Figure 4: Best viewed with zooming in.

We narrow the angular bias between the full-precision weights and the binarization using our bi-rotation at the beginning of each training epoch. Then, we can set , which will be fed to the sign function and follow up the standard gradient update in the neural network. However, the alternating optimization may get trapped in a local optimum that either overshoots (Fig. 4(a)) or undershoots (Fig. 4(b)) the binarization . To deal with this, we further propose to self-adjust the rotated weight vector as below:

(11)

where and .

As can be seen from Fig. 4, Eq. (11) constrains that the final weight vector moves along the residual direction of with . It is intuitive that when overshooting, ; when undershooting, . We empirically observe that overshooting is in a dominant position. Thus, we simply constrain to shrink the feasible region of , which we find can well further reduce the quantization error and boost the performance as demonstrated in Table 4. The final value of varies across different layers. In Fig. 4(c), we show a toy example of how updates during training in ResNet-20 (layer2.2.conv2).

At the beginning of each training epoch, with fixed , we learn the rotation matrix ( and actually). In the training stage, with fixed , we feed the sign function using Eq. (11) in the forward, and update and in the backward.

3.5 Gradient Approximation

The derivative of the sign function is almost zero everywhere, which makes the training unstable and degrades the accuracy performance. To solve it, various gradient approximations in the literature have been proposed to enable the gradient updating, e.g., straight through estimation Bengio et al. (2013), piecewise polynomial function Liu et al. (2018), annealing hyperbolic tangent function Ajanthan et al. (2019), error decay estimator Qin et al. (2020b) and so on. Instead of simply using these existing approximations, in this paper, we further devise the following training-aware approximation function to replace the sign function:

(12)

with

where , in our implementation, is the number of training epochs and represents the current epoch. As can be seen, the shape of relies on the value of , which indicates the training progress. Then, the gradient of w.r.t. the input can be obtained by

(13)
Figure 5: Visualization of our approximation function and its derivative w.r.t. different values of .

In Fig. 5, we visualize Eq. (12) and Eq. (13) w.r.t. varying values of . In the early training stage, the gradient exists almost everywhere, which overcomes the drawback of sign function and enables the updating of the whole network. As the training proceeds, our approximation gradually becomes the sign-like function, which ensures the binary property. Thus, our approximation is training-aware. So far, we can have the gradient of loss function w.r.t. the activation and weight :

(14)

where

(15)

Besides, the gradient of in Eq. (11) can be obtained by

(16)

where denotes the -th element of its input vector. Note that the error decay estimator in Qin et al. (2020b) can be also regarded as a training-aware approximation. However, our design in Eq. (12) is fundamentally different from that in Qin et al. (2020b) and its superiority is validated in Sec. 4.2.

4 Experiments

In this section, we evaluate our RBNN on CIFAR-10 Krizhevsky and Hinton (2009) using ResNet-18/20 He et al. (2016) and VGG-small Zhang et al. (2018), and on ImageNet Deng et al. (2009) using ResNet-18/34 He et al. (2016). Following the compared methods, all convolutional and fully-connected layers except the first and last ones are binarized. We implement RBNN with Pytorch and the SGD is adopted as the optimizer. Also, for fair comparison, we only apply the classification loss during training.

4.1 Experimental Results

Cifar-10

On CIFAR-10, we compare our RBNN with several SOTAs. For ResNet-18, we compare with RAD Ding et al. (2019a) and IR-Net Qin et al. (2020b). For ResNet-34, we compare with DoReFa Zhou et al. (2016), DSQ Gong et al. (2019), and IR-Net Qin et al. (2020b). For VGG-small, we compare with LAB Hou et al. (2016), XNOR-Net Rastegari et al. (2016), BNN Hubara et al. (2016), RAD Ding et al. (2019a), and IR-Net Qin et al. (2020b). We list the experimental results in Table 2. As can be seen, RBNN consistently outperforms the SOTAs. Compared with the best baseline Qin et al. (2020b), RBNN achieves 0.7%, 1.1% and 0.9% accuracy improvements with ResNet-18, ResNet-20 with normal structure He et al. (2016), and VGG-small, respectively. Furthermore, binarizing network with the Bi-Real structure Liu et al. (2018) achieves a better accuracy performance over the normal structure as shown by ResNet-20. For example, with the Bi-Real structure, IR-Net obtains 1.1% accuracy improvements while RBNN also gains 1.3% improvements. Other variants of network structure proposed in Bethge et al. (2020); Zhu et al. (2019); Gu et al. (2019a) and training loss in Hou et al. (2016); Ding et al. (2019a); Wang et al. (2019); Gu et al. (2019b) can be combined to further improve the final accuracy performance. Nevertheless, under the same structure, our RBNN performs the best (87.8% of RBNN v.s. 86.5% of IR-Net for ResNet-20 with Bi-Real structure). Hence, the superiority of the angle alignment is evident.

Network Method W/A Acc
ResNet-18 FP 32/32 93.0%
RAD Ding et al. (2019a) 1/1 90.5%
IR-Net Qin et al. (2020b) 1/1 91.5%
RBNN(Ours) 1/1 92.2%
ResNet-20 FP 32/32 91.7%
DoReFa Zhou et al. (2016) 1/1 79.3%
DSQ Gong et al. (2019) 1/1 84.1%
IR-Net Qin et al. (2020b) 1/1 85.4%
RBNN(Ours) 1/1 86.5%
IR-Net Qin et al. (2020b) 1/1 86.5%
RBNN(Ours) 1/1 87.8%
VGG-small FP 32/32 91.7%
LAB Hou et al. (2016) 1/1 87.7%
XNOR-Net Rastegari et al. (2016) 1/1 89.8%
BNN Hubara et al. (2016) 1/1 89.9%
RAD Ding et al. (2019a) 1/1 90.0%
IR-Net Qin et al. (2020b) 1/1 90.4%
RBNN(Ours) 1/1 91.3%
Table 2: Performance comparison with SOTAs on ImageNet. W/A denotes the bit length of weights and activations. We report the top-1 and top-5 accuracy performances.
Network Method W/A Top-1 Top-5
ResNet-18 FP 32/32 69.6% 89.2%
ABC-Net Lin et al. (2017) 1/1 42.7% 67.6%
XNOR-Net Rastegari et al. (2016) 1/1 51.2% 73.2%
BNN+ Darabi et al. (2018) 1/1 53.0% 72.6%
DoReFa Zhou et al. (2016) 1/2 53.4% -
Bi-Real Liu et al. (2018) 1/1 56.4% 79.5%
XNOR++ Bulat and Tzimiropoulos (2019) 1/1 57.1% 79.9%
IR-Net Qin et al. (2020b) 1/1 58.1% 80.0%
RBNN(Ours) 1/1 59.9% 81.9%
ResNet-34 FP 32/32 73.3% 91.3%
ABC-Net Lin et al. (2017) 1/1 52.4% 76.5%
Bi-Real Liu et al. (2018) 1/1 62.2% 83.9%
IR-Net Qin et al. (2020b) 1/1 62.9% 84.1%
RBNN(Ours) 1/1 63.1% 84.4%
Table 1: Performance comparison with SOTAs on CIFAR-10. W/A denotes the bit length of weights and activations. The denotes the network with the Bi-Real structure Liu et al. (2018).

ImageNet

We further show the experimental results on ImageNet in Table 2. For ResNet-18, we compare RBNN with ABC-Net Lin et al. (2017), XNOR-Net Rastegari et al. (2016), BNN+ Darabi et al. (2018), DoReFa Zhou et al. (2016), Bi-Real Liu et al. (2018), XNOR++ Bulat and Tzimiropoulos (2019), and IR-Net Qin et al. (2020b). For ResNet-34, ABC-Net Lin et al. (2017), Bi-Real Liu et al. (2018), and IR-Net Qin et al. (2020b) are compared. As shown in Table 2, RBNN beats all the compared binary models in both top-1 and top-5 accuracy. More detailedly, with ResNet-18, RBNN achieves 59.9% and 81.9% in top-1 and top-5 accuracy, with 1.8% and 1.9% improvements over IR-Net, respectively. With ResNet-34, it achieves a top-1 accuracy of 63.1% and a top-5 accuracy of 84.4%, with 0.2% and 0.3% improvements over IR-Net, respectively.

4.2 Performance Study

In this section, we first show the benefit of our training-aware approximation over other recent advances Bengio et al. (2013); Liu et al. (2018); Qin et al. (2020b). And then, we show the effect of different components proposed in our RBNN. All the experiments are conducted on top of ResNet-20 with Bi-Real structure on CIFAR-10.

Table 4 compares the performances of RBNN based on various gradient approximations including straight through estimation (denoted by STE) Bengio et al. (2013), piecewise polynomial function (denoted by PPF) Liu et al. (2018) and error decay estimator (denoted by EDE) Qin et al. (2020b). As can be seen, STE shows the least accuracy. Though EDE also studies approximation with dynamic changes, its performance is even worse than PPF which is a fixed approximation. In contrast, our training-aware approximation achieves 1.8% improvements over EDE, which validates the effectiveness of our approximation.

To further understand the effect of each component in our RBNN, we conduct an ablation study by starting with the binarization using XNOR-Net Rastegari et al. (2016) (denoted by B), and then gradually add different parts of training-aware approximation (denoted by T), weight rotation (denoted by R) and adjustable scheme (denoted by A). As shown in Table 4, the binarization using XNOR-Net suffers a great performance degradation of 8.0% compared with the full-precision model. By adding our weight rotation or training-aware approximation, the accuracy performance increases to 86.4% or 86.6%. Then, the collective effort of weight rotation and training-aware approximation further raises it to 87.1%. Lastly, by considering the adjustable weight vector in the training process, our RBNN achieves the highest accuracy of 87.8%. Therefore, each part of RBNN plays its unique role in improving the performance.

Figure 6: Weight histograms (before binarization) of the XNOR and RBNN in ResNet-20.
Table 3: Gradient Approximation Analysis.
Method W/A Acc
FP 32/32 91.7%
STE Bengio et al. (2013) 1/1 84.9%
PPF Liu et al. (2018) 1/1 86.9%
EDE Qin et al. (2020b) 1/1 86.0%
Ours 1/1 87.8%
Table 4: Ablation Study of RBNN. B, T, R and A respectively denote binarization using XNOR-Net, training-aware approximation, weight rotation and adjustable scheme.
Method W/A Acc
FP 32/32 91.7%
B 1/1 83.7%
B + R 1/1 86.4%
B + T 1/1 86.6%
B + T + R 1/1 87.1%
B + T + R + A (RBNN) 1/1 87.8%
Figure 7: Weight flip rates of our RBNN and XNOR-Net in different layers of ResNet-20.

4.3 Weight Distribution and Flips

Fig. 6 shows the histograms of weights (before binarization) for XNOR-Net and our RBNN. It can be seen that the weight values for XNOR-Net are mixed up tightly around zero center and the value magnitude remains far less than 1. Thus it causes large quantization error when being pushed to the binary values of and . On the contrary, our RBNN results in two-mode distributions, each of which is centered around . Besides, there exist few weights around the zero, which creates a clear boundary between the two distributions. Thus, by the weight rotation, our RBNN effectively reduces quantization error as explained in Fig. 2.

As discussed in Sec. 2, the capacity of learning BNN is up to where is the total number of weight elements. When the probability of each element being or is equal during training, it reaches the maximum of . We compare the initialization weights and binary weights, and then show the weight flipping rates of our RBNN and XNOR-Net across the layers of ResNet-20 in Fig. 7. As can be observed, XNOR-Net leads to a small flipping rate, i.e., most positive weights are directly quantized to , and vice versa. Differently, RBNN leads to around 50% weight flips each layer due to the introduced weight rotation as illustrated in Fig. 3, which thus maximizes the information gain during the training stage.

5 Conclusion

In this paper, we analyzed the influence of angular bias on the quantization error in binary neural networks and proposed a Rotated Binary Neural Network (RBNN) to achieve the angle alignment between the rotated weight vector and the binary vector at the beginning of each training epochs. We have also introduced a bi-rotation scheme involving two smaller rotation matrices to reduce the complexity of learning a large rotation matrix. In the training stage, our method dynamically adjusts the rotated weight vector via backward gradient updating to overcome the potential sub-optimal problem in the optimization of bi-rotation. Our rotation maximizes the information gain of learning BNN by achieving around 50% weight flips. To enable gradient propagation, we have devised a training-aware approximation of the sign function. Extensive experiments have demonstrated the efficacy of our RBNN in reducing the quantization error, and its superiorities over several SOTAs.

Acknowledgement

This work is supported by the National Natural Science Foundation of China (No.U1705262, No.61772443, No.61572410, No.61802324 and No.61702136), National Key R&D Program (No.2017YFC0113000, and No.2016YFB1001503), Key R&D Program of Jiangxi Province (No. 20171ACH80022), Natural Science Foundation of Guangdong Province in China (No.2019B1515120049) and National Key R&D Plan Project (No.2018YFC0830105 and No.2018YFC0830100).

Broader Impact

Benefit: The binary neural network community may benefit from our research. The proposed Rotated Binary Neural Network (RBNN) provides a novel perspective to lessen the quantization error by reducing the angular bias, which was ignored by previous works. With the code publicly available, our work will also help researchers quantize DNNs so that the deep models can be deployed on devices with limited resources such as mobile phones.

Disadvantage: The angular bias between the activation and its binarization remains an open problem. It may be not appropriate to apply our rotation to the activation vector since it will add the computation in the inference.

Consequence: The failure of the network quantization will not bring serious consequences, as our RBNN causes fewer accuracy drops compared to other SOTAs.

Data Biases: The proposed RBNN is irrelevant to data selection, so it does not have the data bias problem.

Footnotes

  1. The binarization in Eq. (1) is obtained by , which does not change the coordinate quadrant as shown in Fig. 2. Thus, only a small number of weight flips occur in the training stage. See Sec. 4.3 for our experimental validation.
  2. To stress, the rotation is applied at the beginning of each training epoch instead of the training stage. Thus, should be regarded as a constant.
  3. With , our RBNN achieves the least complexity when , which is also our experimental setting.

References

  1. Mirror descent view for neural network quantization. arXiv preprint arXiv:1910.08237. Cited by: §3.5.
  2. Scalable methods for 8-bit training of neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 5145–5153. Cited by: §1.
  3. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §2, §2, §3.5, §4.2, §4.2, Table 4.
  4. MeliusNet: can binary neural networks achieve mobilenet-level accuracy?. arXiv preprint arXiv:2001.05936. Cited by: §4.1.1.
  5. XNOR-net++: improved binary neural networks. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §2, §3.1, §4.1.2, Table 2.
  6. Deep learning with low precision by half-wave gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5918–5926. Cited by: §1.
  7. AdderNet: do we really need multiplications in deep learning?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1468–1477. Cited by: §1.
  8. Binaryconnect: training deep neural networks with binary weights during propagations. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 3123–3131. Cited by: Figure 2, §1.
  9. Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: Figure 2, §1, §2.
  10. BNN+: improved binary network training. arXiv preprint arXiv:1812.11800. Cited by: §4.1.2, Table 2.
  11. Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §4.
  12. Exploiting linear structure within convolutional networks for efficient evaluation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 1269–1277. Cited by: §1.
  13. Regularizing activation distribution for training binarized deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11408–11417. Cited by: §4.1.1, Table 2.
  14. Global sparse momentum sgd for pruning very deep neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 6379–6391. Cited by: §1.
  15. Differentiable soft quantization: bridging full-precision and low-bit neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4852–4861. Cited by: §4.1.1, Table 2.
  16. Projection convolutional neural networks for 1-bit cnns via discrete back propagation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 33, pp. 8344–8351. Cited by: §4.1.1.
  17. Bayesian optimized 1-bit cnns. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4909–4917. Cited by: §4.1.1.
  18. Exploring unexplored tensor network decompositions for convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 5553–5563. Cited by: §1.
  19. Mask r-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2961–2969. Cited by: §1.
  20. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1, §4.1.1, §4.
  21. Latent weights do not exist: rethinking binarized neural network optimization. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 7531–7542. Cited by: §1, §1, §2.
  22. Loss-aware binarization of deep networks. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.1.1, Table 2.
  23. Binarized neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 4107–4115. Cited by: §3.1, §4.1.1, Table 2.
  24. SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §1.
  25. Learning multiple layers of features from tiny images. Technical report, University of Toronto. Cited by: §4.
  26. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 1097–1105. Cited by: §1.
  27. Matrix analysis for scientists and engineers. Vol. 91, Siam. Cited by: §3.2.
  28. Fixed point quantization of deep convolutional networks. In Proceedings of the International Conference on Machine Learning (ICML), pp. 2849–2858. Cited by: §1.
  29. HRank: filter pruning using high-rank feature map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1529–1538. Cited by: §1.
  30. Channel pruning via automatic structure search. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 673 – 679. Cited by: §1.
  31. Towards accurate binary convolutional neural network. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), pp. 345–353. Cited by: §2, §4.1.2, Table 2.
  32. Bi-real net: enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 722–737. Cited by: §1, §2, §3.5, §4.1.1, §4.1.2, §4.2, §4.2, Table 2, Table 4.
  33. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Cited by: §1.
  34. Interpretation of mueller matrices based on polar decomposition. Journal of the Optical Society of America (JOSA A) 13 (5), pp. 1106–1113. Cited by: §3.3.
  35. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1520–1528. Cited by: §1.
  36. Binary neural networks: a survey. Pattern Recognition (PR), pp. 107281. Cited by: §2.
  37. Forward and backward information retention for accurate binary neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2250–2259. Cited by: Figure 2, §1, §1, §2, §3.5, §3.5, §4.1.1, §4.1.2, §4.2, §4.2, Table 2, Table 4.
  38. Xnor-net: imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 525–542. Cited by: Figure 2, §1, §2, §4.1.1, §4.1.2, §4.2, Table 2.
  39. You only look once: unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788. Cited by: §1.
  40. A review of binarized neural networks. Electronics 8 (6), pp. 661. Cited by: §2.
  41. Learning channel-wise interactions for binary convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 568–577. Cited by: §4.1.1.
  42. Shift: a zero flop, zero parameter alternative to spatial convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9127–9135. Cited by: §1.
  43. On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7370–7379. Cited by: §1.
  44. Lq-nets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–382. Cited by: §4.
  45. Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §2, §3.1, §4.1.1, §4.1.2, Table 2.
  46. Binary ensemble neural network: more bits per network or more networks per bit?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4923–4932. Cited by: §4.1.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
414461
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description