# Balanced Binary Neural Networks with Gated Residual

## Abstract

Binary neural networks have attracted numerous attention in recent years. However, mainly due to the information loss stemming from the biased binarization, how to preserve the accuracy of networks still remains a critical issue. In this paper, we attempt to maintain the information propagated in the forward process and propose a Balanced Binary Neural Networks with Gated Residual (BBG for short). First, a weight balanced binarization is introduced and thus the informative binary weights can capture more information contained in the activations. Second, for binary activations, a gated residual is further appended to compensate their information loss during the forward process, with a slight overhead. Both techniques can be wrapped as a generic network module that supports various network architectures for different tasks including classification and detection. The experimental results show that BBG-Net performs remarkably well across various network architectures such as VGG, ResNet and SSD with the superior performance over state-of-the-art methods.

Mingzhu Shen^{1}, Xianglong Liu^{1,}^{1}^{1}, Kai Han^{2}
\address^{1} State Key Laboratory of Software Development Environment, Beihang University, China
^{2} State Key Lab of Computer Science, Institude of Software, CAS; UCAS

shenmingzhu@buaa.edu.cn

{xlliu, gongruihao}@nlsde.buaa.edu.cn

hankai@ios.ac.cn
{keywords}
model compression, binary neural networks, energy-efficient models

## 1 Introduction

Deep neural networks (DNNs), especially deep convolution neural networks (CNNs), have been well demonstrated in a wide variety of computer vision applications. However, most of these advanced deep CNN models requires expensive storage and computing resources, and cannot be easily deployed on portable devices such as mobile phones, cameras, etc.

In recent years, a number of approaches have been proposed to learn portable deep neural networks, including multi-bit quantization [1], pruning [14], and lightweight architecture design [18], knowledge distillation [15]. Among them, quantization based methods [1, 16] represent the weights and activations in the network with a very low precision, and thus can yield highly more compact DNN models compared to those floating-point counterparts. In an extreme case where both the weights and activations are quantized into one-bit values, the conventional convolution operations can be efficiently achieved via bitwise operations [9]. Ihe resulting decrease in the storage and acceleration in the inference are therefore appealing to the community. Since the proposition of binary neural network, many works have been done to address the performance drop and improve expression ability of binarized networks. Bireal-Net [8] proposes using additional shortcut and different approximation of sign function in the backward pass. PCNN [3] employs a projection matrix to help with the network training. CircConv [6] rotate binary weight by three times and calculate feature map with four binary weights and merge them together. BENN [20] ensembles multiple standard binary networks to improve performance. AutoBNN [10] employs genetic algorithm to search for binary neural network architectures. Most of these methods have complicated training pipeline and even increase the FLOPs to achieve better results.

Although much progress has been made on binarizing DNNs, the existing quantization methods still cause a significantly large accuracy drops, compared with the full-precision models. This observation indicates that the information propagates through the full-precision network are largely lost, when using the binary representation. To address this problem and maximally preserve the information, in this paper we propose the Balanced Binary Network with Gated Residual (BBG for short), which learns the balanced binary weights, and reduce binarization loss for activations by a gated residual path. Fig. 1 demonstrates the whole structure of our BBG. We re-parameterize the weight and devise a linear transformation to replace the standard one, which can be easily implemented and learnt. To compensate the information loss when binarizing activations, the gated residual path employs a lightweight operation to use the channel attention information of floating-point activations to further reconstruct the information flow across layers. We evaluate our BBG method on the image classification and object detection benchmarks, and the experimental results show that it performs remarkably well across various network architectures, outperforming state-of-the-art.

## 2 The Proposed Method

### 2.1 Maximizing Entropy with Balanced Binary Weights

In Binary Neural Networks, the most important training parameter is the non-differentiable discrete binary weights. This discrete property of binary weights brings troublesome problems to the training of the network. To preserve the information of binary weights, we propose to maximize entropy of binary weight in the training process. Directly optimize the above regularization is hard. Instead, we approximate the optimal solution for it by making the expectation of binary weights to be zero to, i.e. .

Hence, we first center the real-valued weights and then quantize them into binary codes. More specifically, we use a proxy parameter to get and then quantize it into binary codes . In the convolutional layer, we express the weight in terms of the proxy parameter using:

(1) |

After balanced weight normalization, we can perform binary quantization on the floating-point weight . Subsequently, the forward pass and backward pass of binary weights are as follows:

Forward: | (2) | |||

Backward: |

where is sign function that outputs for positive numbers and otherwise and calculates the mean of absolute value.

In our balanced weight quantization, the parameter updating is completed based on the proxy parameters , and thus the gradient signal can back-propagate through the normalization process. In the whole process, and are floating-point, and we update on training process and only is needed on inference.

### 2.2 Reconstructing Information Flow with Gated Residual

In the binarization of activation, we first clip the value range of activations into and use a round function to binarize activations. Therefore, the forward pass and backward pass for binary activations are as follows:

Forward: | (3) | |||

Backward: |

where means if elements of is in the range of , then it is 1, otherwise 0.

Binarizing activations results in much larger loss of precision than binary weights. All activation values of different channels are quantized to 0 or 1, without considering the differences among the channels. The quantization error caused by binary layers is accumulated layer by layer. To address this problem, we further propose a new module named gated residual to reconstruct information flow in channel-wise manner, during the forward propagation. Our gated residual employs the floating-point activations to reduce quantization error and recalibrate features.

Subsequently, we propose a new designed layer with gated weights that learn the channel attention information of floating-point input feature map in a binary convolution layer (, , means channels, height and width respectively). The operation on the th channel of the input feature map is defined as follows:

(4) |

Based on the gated residual, the output feature map can be recalibrated, enhancing the representation power of the activations. The operation in the gated module can be written in the following form:

(5) |

where means the operation in the main path including activation binarization, balanced convolution and BatchNorm in total.

With gated residual layer, the overall structure of gated module is shown in the Figure 2(d). We initialize weight of gated residual with which is the same with identity shortcut of vanilla ResNet. In training process, the gated residual learns to distinguish the channels and eliminate the less useful channels. Beside reconstructing the information flow in the forward process, this path also acts as an auxiliary gradient path for activations in the backward propagation. Usually, the STE used in the backward pass to approximate discrete binary functions results in severe gradient mismatch problem. Fortunately, with learnable gated weights, our gated residual can also reconstruct gradient in the following way:

(6) |

In terms of computational complexity and memory limitation in the nature of the binary networks, the additional operations for designing a new module need to be as small as possible. The structure of HighwayNet [12] requires a full-precision weights that is as large as the weight in the convolutional layer, which is unacceptable. Compared to SENet [5], SE module is a correction to the output after the convolution layer, and we argue that it is better to make use of the unquantified information. Similarly, our FLOPs is only and is much smaller than the SE module. When the number of channels increases and the reduction decreases, the amount of FLOPs required by the SE module will be much more.

## 3 Experiments

In this section, to verify the effectiveness of our proposed BBG-Net, we conduct experiments on both image classification and object detection task.

### 3.1 Datasets and Implementation Details

Datasets. In our experiments, we adopt three common image classification datasets: CIFAR-10/100, and ILSVRC-2012 ImageNet. We also evaluate our proposed method on the object detection task with a standard detection benchmark, Pascal VOC datasets named VOC2007, and VOC2012.

Network Architectures and Setup.We conduct our experiments on popular and powerful network architectures ResNet [4] and Single Shot Detector(SSD[7]). For fair comparison with the existing methods, we respectively choose ResNet-20, ResNet-18 as the baseline model on CIFAR-10/100 and ImageNet datasets. And we verify SSD with different backbones, i.e. VGG-16 [11] and ResNet-34 in Pascal VOC datasets. As for hyper-parameter, we mostly follow the same setup in the original papers and all our models are trained from scratch. Note that we do not quantize the first and last layer and we also do not quantize down-sample layers as suggested by many previous work [2]. And we only quantize backbone network in SSD.

W/A | Weight | Residual | Acc.(%) |
---|---|---|---|

32/32 | FP | FP | 92.1 |

1/1 | Vanilla | Identity | 84.13 |

1/1 | Balanced | Identity | 84.71 |

1/1 | Vanilla | Gated | 84.89 |

1/1 | Balanced | Gated | 85.34 |

### 3.2 Ablation Study

Now we study how our balanced weight quantization and gated residual module affects the network’s performance. In Table 1, we report the results of ResNet-20 on CIFAR-10, with and without balanced quantization or gated residual. In the performance comparison from the first two rows, the network with balanced quantization can obtain 0.6% accuracy than that without this operation. From the whole table, we can easily observe that, balanced weights or gated residual brings accuracy improvement and together they work even better with 1.2% accuracy improvement. It reveals that our proposed method faithfully helps pursue a highly accurate binary network.

Method | Kernel Stage | CIFAR-10 | CIFAR-100 |
---|---|---|---|

FP | 16-32-64 | 92.1 | 68.1 |

Vanilla | 16-32-64 | 84.71 | 53.37 |

Vanilla Gated | 16-32-64 | 84.96 | 55.24 |

Bireal | 16-32-64 | 85.54 | 55.07 |

Gated | 16-32-64 | 85.34 | 55.62 |

Vanilla | 32-64-128 | 90.22 | 65.06 |

Vanilla Gated | 32-64-128 | 90.71 | 66.15 |

Bireal | 32-64-128 | 90.27 | 65.6 |

Gated | 32-64-128 | 90.68 | 66.47 |

Vanilla | 48-96-192 | 92.01 | 68.66 |

Vanilla Gated | 48-96-192 | 92.31 | 69.11 |

Bireal | 48-96-192 | 91.78 | 68.5 |

Gated | 48-96-192 | 92.46 | 69.38 |

### 3.3 Comparison with the State-of-the-Art

CIFAR-10/100 Dataset. In ResNet-20 on CIFAR-10/100 datasets, we further compare four different kinds of modules in Fig. 2. As shown in Table 2, in CIFAR-10 datasets, the accuracy improvement caused by Gated or module is less than 1% while in a more challenging CIFAR-100 dataset, it adds up to over 2%. Through the whole experiments, we can conclude that Vanilla Gated and Gated show superiority over Bireal and Vanilla modules. Especially when the network grows wider, Bireal Module even performs worse than Vanilla while Vanilla Gated and Gated consistently performs better. With the kernel stage of 48-96-192, our binary network matches the accuracy of full-precision networks in both CIFAR-10 and CIFAR-100 datasets.

ResNet-18 on ImageNet Dataset.In Fig. 3, we compare our method with many binarization methods of recent years including XNOR-Net [9], DoReFa-Net [19], Bireal-Net [8], PCNN [3], and it further reveals stability of our proposed BBG-Net on larger datasets. Our method significantly outperforms all the other methods, with 1.3% performance gain over the state-of-the-art ResNetE [2].

We also improve the accuracy of binary networks by exploring network width and resolution in a simple but effective way. In the calculation of FLOPs, the binary layer is divided by following Bireal-Net[8]. In the network width experiments, we expand all channels of the original ResNet-18 by and . Compared with the full-precision networks, our width models can achieve almost the same accuracy. We also compare with ABC-Net where means 5 binary bases for weight and 3 bases for activations, our methods consistently perform better than ABC-Net by a large margin. In resolution exploration, we employ a simple strategy that we remove the max pooling layer after the first convolution, which makes all hidden layers have feature maps compared with the original ones. Compared to CircConv [6] which employs four times binary weights, we have better results with 1.6% accuracy improvement, while we have slightly higher FLOPs but the same memory. Our resolution accuracy is 2.3% higher while the FLOPs is only of the FLOPs of DenseNet-28 [2]. BENN [20] ensembles 6 standard binary ResNet-18 which has nearly three times FLOPs of our resolution while the accuracy declines by 2%.

Method | Backbone | Resolution | FLOPs | mAP(%) |
---|---|---|---|---|

FP | VGG-16 | 300 | 29986 | 74.3 |

FP | ResNet-34 | 300 | 6850 | 75.5 |

XNOR | ResNet-34 | 300 | 362 | 55.1 |

TBN | ResNet-34 | 300 | 464 | 59.5 |

BDN | DenseNet-37 | 512 | 1530 | 66.4 |

BDN | DenseNet-45 | 512 | 1960 | 68.2 |

BBG | ResNet-34 | 300 | 362 | 62.8 |

BBG | VGG-16 | 300 | 1062 | 68.5 |

SSD on Pascal VOC Dataset. In object detection task, we compare our method with XNOR-Net [9], TBN [13], and BDN [2] which includes DenseNet37 and DenseNet-45.In the comparison of ResNet-34 as its backbone, we outperform XNOR and TBN by 6.9% and 2.5%. It proves that our solution using 1-bit can preserve the stronger feature representation and maintain the better generalization ability than ternary neural networks. As for VGG-16, we only need half of FLOPs of binary DenseNet-45 to achieve slightly higher results and our model outperforms DenseNet-37 by 2% with one thirds fewer FLOPs. The accuracy of our VGG-16 is only 5.8% less than full-precision counterparts. The significant performance gain further demonstrates that our method can better preserve the information propagated in the network and help extract the most discriminative features for detection task.

### 3.4 Deploying Efficiency

Finally, we implement our method on mobile devices using a framework named daBNN [17]. The mobile device we use is Rasberry Pi 3B, which has a 1.2GHz 64-bit quad-core ARM Cortex-A53. As shown in Table 4, we implement DoReFa which binarizes downsample layers while we do not, the difference in inference time is only which can be ignored. Our proposed method can run faster than full-precision counterparts.

Model | Full-Precision | DoReFa | BBG |
---|---|---|---|

time(ms) | 1457 | 249 | 251 |

## 4 Conclusion

Binarization methods for establishing portable neural networks are urgently required so that these networks with massive parameters and complex architectures can be launched efficiently. In this work, we proposed a novel Balanced Binary Neural Network with Gated Residual, namely BBG-Net. Experiments conducted on benchmark datasets and architectures demonstrate the effectiveness of the proposed balanced weight quantization and gated residual for learning binary neural networks with higher performance but lower memory and computation consumption than the state-of-the-art methods.

### Footnotes

- thanks: Corresponding author

### References

- (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In CVPR, pp. 2704–2713. Cited by: §1.
- (2019) Back to simplicity: how to train accurate bnns from scratch?. arXiv preprint arXiv:1906.08637. Cited by: §3.1, §3.3, §3.3, §3.3.
- (2019) Projection convolutional neural networks for 1-bit cnns via discrete back propagation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 8344–8351. Cited by: §1, §3.3.
- (2016) Deep residual learning for image recognition. CVPR, pp. 770–778. Cited by: §3.1.
- (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §2.2.
- (2019) Circulant binary convolutional networks: enhancing the performance of 1-bit dcnns with circulant back propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2691–2699. Cited by: §1, §3.3.
- (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §3.1.
- (2018) Bi-real net: enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In ECCV, Cited by: §1, §3.3, §3.3.
- (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §1, §3.3, §3.3.
- (2019) Searching for accurate binary neural architectures. arXiv preprint arXiv:1909.07378. Cited by: §1.
- (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.1.
- (2015) Highway networks. arXiv preprint arXiv:1505.00387. Cited by: §2.2.
- (2018) Tbn: convolutional neural network with ternary inputs and binary weights. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 315–332. Cited by: §3.3.
- (2017) Towards evolutional compression. arXiv preprint arXiv:1707.08005. Cited by: §1.
- (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: §1.
- (2018) Lq-nets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–382. Cited by: §1.
- (2019) DaBNN: a super fast inference framework for binary neural networks on arm devices. arXiv preprint arXiv:1908.05858. Cited by: §3.4.
- (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §1.
- (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §3.3.
- (2019) Binary ensemble neural network: more bits per network or more networks per bit?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4923–4932. Cited by: §1, §3.3.