Towards Accurate Binary Convolutional Neural Network

Towards Accurate Binary Convolutional Neural Network

Xiaofan Lin                 Cong Zhao                 Wei Pan*
DJI Innovations Inc, Shenzhen, China
{xiaofan.lin, cong.zhao, wei.pan}@dji.com
Abstract

We introduce a novel scheme to train binary convolutional neural networks (CNNs) – CNNs with weights and activations constrained to {-1,+1} at run-time. It has been known that using binary weights and activations drastically reduce memory size and accesses, and can replace arithmetic operations with more efficient bitwise operations, leading to much faster test-time inference and lower power consumption. However, previous works on binarizing CNNs usually result in severe prediction accuracy degradation. In this paper, we address this issue with two major innovations: (1) approximating full-precision weights with the linear combination of multiple binary weight bases; (2) employing multiple binary activations to alleviate information loss. The implementation of the resulting binary CNN, denoted as ABC-Net, is shown to achieve much closer performance to its full-precision counterpart, and even reach the comparable prediction accuracy on ImageNet and forest trail datasets, given adequate binary weight bases and activations.

footnotetext: indicates corresponding author.

1 Introduction

Convolutional neural networks (CNNs) have achieved state-of-the-art results on real-world applications such as image classification (He et al., 2016) and object detection (Ren et al., 2015), with the best results obtained with large models and sufficient computation resources. Concurrent to these progresses, the deployment of CNNs on mobile devices for consumer applications is gaining more and more attention, due to the widespread commercial value and the exciting prospect.

On mobile applications, it is typically assumed that training is performed on the server and test or inference is executed on the mobile devices (Courbariaux et al., 2016; Esser et al., 2016). In the training phase, GPUs enabled substantial breakthroughs because of their greater computational speed. In the test phase, however, GPUs are usually too expensive to deploy. Thus improving the test-time performance and reducing hardware costs are likely to be crucial for further progress, as mobile applications usually require real-time, low power consumption and fully embeddable. As a result, there is much interest in research and development of dedicated hardware for deep neural networks (DNNs). Binary neural networks (BNNs) (Courbariaux et al., 2016; Rastegari et al., 2016), i.e., neural networks with weights and perhaps activations constrained to only two possible values (e.g., -1 or +1), would bring great benefits to specialized DNN hardware for three major reasons: (1) the binary weights/activations reduce memory usage and model size 32 times compared to single-precision version; (2) if weights are binary, then most multiply-accumulate operations can be replaced by simple accumulations, which is beneficial because multipliers are the most space and power-hungry components of the digital implementation of neural networks; (3) furthermore, if both activations and weights are binary, the multiply-accumulations can be replaced by the bitwise operations: xnor and bitcount Courbariaux et al. (2016). This could have a big impact on dedicated deep learning hardware. For instance, a 32-bit floating point multiplier costs about 200 Xilinx FPGA slices (Govindu et al., 2004), whereas a 1-bit xnor gate only costs a single slice. Semiconductor manufacturers like IBM (Esser et al., 2016) and Intel (Venkatesh et al., 2016) have been involved in the research and development of related chips.

However, binarization usually cause severe prediction accuracy degradation, especially on complex tasks such as classification on ImageNet dataset. To take a closer look, Rastegari et al. (2016) shows that binarizing weights causes the accuracy of Resnet-18 drops from 69.3% to 60.8% on ImageNet dataset. If further binarize activations, the accuracy drops to 51.2%. Similar phenomenon can also be found in literatures such as (Hubara et al., 2016). Clearly there is a considerable gap between the accuracy of a full-precision model and a binary model.

This paper proposes a novel scheme for binarizing CNNs, which aims to alleviate, or even eliminate the accuracy degradation, while still significantly reducing inference time, resource requirement and power consumption. The paper makes the following major contributions.

  • We approximate full-precision weights with the linear combination of multiple binary weight bases. The weights values of CNNs are constrained to , which means convolutions can be implemented by only addition and subtraction (without multiplication), or bitwise operation when activations are binary as well. We demonstrate that 35 binary weight bases are adequate to well approximate the full-precision weights.

  • We introduce multiple binary activations. Previous works have shown that the quantization of activations, especially binarization, is more difficult than that of weights (Cai et al., 2017; Courbariaux et al., 2016). By employing five binary activations, we have been able to reduce the Top-1 and Top-5 accuracy degradation caused by binarization to around 5% on ImageNet compared to the full precision counterpart.

It is worth noting that the multiple binary weight bases/activations scheme is preferable to the fixed-point quantization in previous works. In those fixed-point quantized networks one still needs to employ arithmetic operations, such as multiplication and addition, on fixed-point values. Even though faster than floating point, they still require relatively complex logic and can consume a lot of power. Detailed discussions can be found in Section 5.2.

Ideally, combining more binary weight bases and activations always leads to better accuracy and will eventually get very close to that of full-precision networks. We verify this on ImageNet using Resnet network topology. This is the first time a binary neural network achieves prediction accuracy comparable to its full-precision counterpart on ImageNet.

2 Related work

Quantized Neural Networks: High precision parameters are not very necessary to reach high performance in deep neural networks. Recent research efforts (e.g., (Hubara et al., 2016)) have considerably reduced a large amounts of memory requirement and computation complexity by using low bitwidth weights and activations. Zhou et al. (2016) further generalized these schemes and proposed to train CNNs with low bitwidth gradients. By performing the quantization after network training or using the “straight-through estimator (STE)" (Bengio et al., 2013), these works avoided the issues of non-differentiable optimization. While some of these methods have produced good results on datasets such as CIFAR-10 and SVHN, none has produced low precision networks competitive with full-precision models on large-scale classification tasks, such as ImageNet. In fact, (Zhou et al., 2016) and (Hubara et al., 2016) experiment with different combinations of bitwidth for weights and activations, and show that the performance of their highly quantized networks deteriorates rapidly when the weights and activations are quantized to less than 4-bit numbers. Cai et al. (2017) enhance the performance of a low bitwidth model by addressing the gradient mismatch problem, nevertheless there is still much room for improvement.

Binarized Neural Networks: The binary representation for deep models is not a new topic. At the emergence of artificial neural networks, inspired biologically, the unit step function has been used as the activation function (Toms, 1990). It is known that binary activation can use spiking response for event-based computation and communication (consuming energy only when necessary) and therefore is energy-efficient (Esser et al., 2016). Recently, Courbariaux et al. (2016) introduce Binarized-Neural-Networks (BNNs), neural networks with binary weights and activations at run-time. Different from their work, Rastegari et al. (2016) introduce simple, efficient, and accurate approximations to CNNs by binarizing the weights and even the intermediate representations in CNNs. All these works drastically reduce memory consumption, and replace most arithmetic operations with bitwise operations, which potentially lead to a substantial increase in power efficiency.

In all above mentioned works, binarization significantly reduces accuracy. Our experimental results on ImageNet show that we are close to filling the gap between the accuracy of a binary model and its full-precision counterpart. We relied on the idea of finding the best approximation of full-precision convolution using multiple binary operations, and employing multiple binary activations to allow more information passing through.

3 Binarization methods

In this section, we detail our binarization method, which is termed ABC-Net (Accurate-Binary-Convolutional) for convenience. Bear in mind that during training, the real-valued weights are reserved and updated at every epoch, while in test-time only binary weights are used in convolution.

3.1 Weight approximation

Consider a -layer CNN architecture. Without loss of generality, we assume the weights of each convolutional layer are tensors of dimension , which represents filter width, filter height, input-channel and output-channel respectively. We propose two variations of binarization method for weights at each layer: 1) approximate weights as a whole and 2) approximate weights channel-wise.

3.1.1 Approximate weights as a whole

At each layer, in order to constrain a CNN to have binary weights, we estimate the real-value weight filter using the linear combination of binary filters such that . To find an optimal estimation, a straightforward way is to solve the following optimization problem:

(1)

where , and . Here the notation refers to vectorization.

Although a local minimum solution to (1) can be obtained by numerical methods, one could not backpropagate through it to update the real-value weight filter . To address this issue, assuming the mean and standard deviation of are and respectively, we fix ’s as follows:

(2)

where and is a shift parameter. For example, one can choose ’s to be to shift evenly over the range , or leave it to be trained by the network. This is based on the observation that the full-precision weights tend to have a symmetric, non-sparse distribution, which is close to Gaussian. To gain more intuition and illustrate the approximation effectiveness, an example is visualized in Section S2 of the supplementary material.

With ’s chosen, (1) becomes a linear regression problem

(3)

in which ’s serve as the bases in the design/dictionary matrix. We can then back-propagate through ’s using the “straight-through estimator” (STE) (Bengio et al., 2013). Assume as the cost function, and as the input and output tensor of a convolution respectively, the forward and backward approach of an approximated convolution during training can be computed as follows:

Forward: (4)
(5)
(6)
Backward: (7)

In test-time, only (6) is required. The block structure of this approximated convolution layer is shown on the left side in Figure 1. With suitable hardwares and appropriate implementations, the convolution can be efficiently computed. For example, since the weight values are binary, we can implement the convolution with additions and subtractions (thus without multiplications). Furthermore, if the input is binary as well, we can implement the convolution with bitwise operations: xnor and bitcount (Rastegari et al., 2016). Note that the convolution with each binary filter can be computed in parallel.

Figure 1: An example of the block structure of the convolution in ABC-Net. . On the left is the structure of the approximated convolution (ApproxConv). ApproxConv is expected to approximate the conventional full-precision convolution with linear combination of binary convolutions (BinConv), i.e., convolution with binary and weights. On the right is the overall block structure of the convolution in ABC-Net. The input is binarized using different functions , passed into the corresponding ApproxConv’s and then summed up after multiplying their corresponding ’s. With the input binarized, the BinConv’s can be implemented with highly efficient bitwise operations. There are 9 BinConv’s in this example and they can work in parallel.

3.1.2 Approximate weights channel-wise

Alternatively, we can estimate the real-value weight filter of each output channel using the linear combination of binary filters such that . Again, to find an optimal estimation, we solve a linear regression problem analogy to (3) for each output channel. After convolution, the results are concatenated together along the output-channel dimension. If , this approach reduces to the Binary-Weights-Networks (BWN) proposed in (Rastegari et al., 2016).

Compared to weights approximation as a whole, the channel-wise approach approximates weights more elaborately, however no extra cost is needed during inference. Since this approach requires more computational resources during training, we leave it as a future work and focus on the former approximation approach in this paper.

3.2 Multiple binary activations and bitwise convolution

As mentioned above, a convolution can be implemented without multiplications when weights are binarized. However, to utilize the bitwise operation, the activations must be binarized as well, as they are the inputs of convolutions.

Similar to the activation binarization procedure in (Zhou et al., 2016), we binarize activations after passing it through a bounded activation function , which ensures . We choose the bounded rectifier as . Formally, it can be defined as:

(8)

where is a shift parameter. If , then is the clip activation function in (Zhou et al., 2016).

We constrain the binary activations to either 1 or -1. In order to transform the real-valued activation into binary activation, we use the following binarization function:

(9)

where is the indicator function. The conventional forward and backward approach of the activation can be given as follows:

Forward: (10)
Backward:

Here denotes the Hadamard product. As can be expected, binaizing activations as above is kind of crude and leads to non-trivial losses in accuracy, as shown in Rastegari et al. (2016); Hubara et al. (2016). While it is also possible to approximate activations with linear regression, as that of weights, another critical challenge arises – unlike weights, the activations always vary in test-time inference. Luckily, this difficulty can be avoided by exploiting the statistical structure of the activations of deep networks.

Our scheme can be described as follows. First of all, to keep the distribution of activations relatively stable, we resort to batch normalization (Ioffe and Szegedy, 2015). This is a widely used normalization technique, which forces the responses of each network layer to have zero mean and unit variance. We apply this normalization before activation. Secondly, we estimate the real-value activation using the linear combination of binary activations such that , where

(11)

Different from that of weights, the parameters ’s and ’s () here are both trainable, just like the scale and shift parameters in batch normalization. Without the explicit linear regression approach, ’s and ’s are tuned by the network itself during training and fixed in test-time. They are expected to learn and utilize the statistical features of full-precision activations.

The resulting network architecture outputs multiple binary activations and their corresponding coefficients , which allows more information passing through compared to the former one. Combining with the weight approximation, the whole convolution scheme is given by:

(12)

which suggests that it can be implemented by computing bitwise convolutions in parallel. An example of the whole convolution scheme is shown in Figure 1.

3.3 Training algorithm

A typical block in CNN contains several different layers, which are usually in the following order: (1) Convolution, (2) Batch Normalization, (3) Activation and (4) Pooling. The batch normalization layer (Ioffe and Szegedy, 2015) normalizes the input batch by its mean and variance. The activation is an element-wise non-linear function (e.g., Sigmoid, ReLU). The pooling layer applies any type of pooling (e.g., max,min or average) on the input batch. In our experiment, we observe that applying max-pooling on binary input returns a tensor that most of its elements are equal to +1, resulting in a noticeable drop in accuracy. Similar phenomenon has been reported in Rastegari et al. (2016) as well. Therefore, we put the max-pooling layer before the batch normalization and activation.

Since our binarization scheme approximates full-precision weights, using the full-precision pre-train model serves as a perfect initialization. However, fine-tuning is always required for the weights to adapt to the new network structure. The training procedure, i.e., ABC-Net, is summarized in Section S1 of the supplementary material.

It is worth noting that as increases, the shift parameters get closer and the bases of the linear combination are more correlated, which sometimes lead to rank deficiency when solving (3). This can be tackled with the regularization.

4 Experiment results

In this section, the proposed ABC-Net was evaluated on the ILSVRC12 ImageNet classification dataset (Deng et al., 2009), and visual perception of forest trails datasets for mobile robots (Giusti et al., 2016) in Section S6 of supplementary material.

4.1 Experiment results on ImageNet dataset

The ImageNet dataset contains about 1.2 million high-resolution natural images for training that spans 1000 categories of objects. The validation set contains 50k images. We use Resnet ((He et al., 2016)) as network topology. The images are resized to 224x224 before fed into the network. We report our classification performance using Top-1 and Top-5 accuracies.

4.1.1 Effect of weight approximation

We first evaluate the weight approximation technique by examining the accuracy improvement for a binary model. To eliminate variables, we leave the activations being full-precision in this experiment. Table 1 shows the prediction accuracy of ABC-Net on ImageNet with different choices of . For comparison, we add the results of Binary-Weights-Network (denoted ‘BWN’) reported in Rastegari et al. (2016) and the full-precision network (denoted ‘FP’). The BWN binarizes weights and leaves the activations being full-precision as we do. All results in this experiment use Resnet-18 as network topology. It can be observed that as increases, the accuracy of ABC-Net converges to its full-precision counterpart. The Top-1 gap between them reduces to only 0.9 percentage point when , which suggests that this approach nearly eliminates the accuracy degradation caused by binarizing weights.

Top-1 60.8% 62.8% 63.7% 66.2% 68.3% 69.3%
Top-5 83.0% 84.4% 85.2% 86.7% 87.9% 89.2%
Table 1: Top-1 (left) and Top-5 (right) accuracy of ABC-Net on ImageNet, using full-precision activation and different choices of the number of binary weight bases .

For interested readers, Figure S4 in section S5 of the supplementary material shows that the relationship between accuracy and appears to be linear. Also, in Section S2 of the supplementary material, a visualization of the approximated weights is provided.

4.1.2 Configuration space exploration

We explore the configuration space of combinations of number of weight bases and activations. Table 2 presents the results of ABC-Net with different configurations. The parameter settings for these experiments are provided in Section S4 of the supplementary material.

Network -weight base -activation base Top-1 Top-5 Top-1 gap Top-5 gap
res18 1 1 42.7% 67.6% 26.6% 21.6%
res18 3 1 49.1% 73.8% 20.2% 15.4%
res18 3 3 61.0% 83.2% 8.3% 6.0%
res18 3 5 63.1% 84.8% 6.2% 4.4%
res18 5 1 54.1% 78.1% 15.2% 11.1%
res18 5 3 62.5% 84.2% 6.8% 5.0%
res18 5 5 65.0% 85.9% 4.3% 3.3%
res18 Full Precision 69.3% 89.2% - -
res34 1 1 52.4% 76.5% 20.9% 14.8%
res34 3 3 66.7% 87.4% 6.6% 3.9%
res34 5 5 68.4% 88.2% 4.9% 3.1%
res34 Full Precision 73.3% 91.3% - -
res50 5 5 70.1% 89.7% 6.0% 3.1%
res50 Full Precision 76.1% 92.8% - -
Table 2: Prediction accuracy (Top-1/Top-5) for ImageNet with different choices of and in a ABC-Net (approximate weights as a whole). “res18”, “res34” and “res50” are short for Resnet-18, Resnet-34 and Resnet-50 network topology respectively. and refer to the number of weight bases and activations respectively.

As balancing between multiple factors like training time and inference time, model size and accuracy is more a problem of practical trade-off, there will be no definite conclusion as which combination of () one should choose. In general, Table 2 shows that (1) the prediction accuracy of ABC-Net improves greatly as the number of binary activations increases, which is analogous to the weight approximation approach; (2) larger or gives better accuracy; (3) when , the Top-1 gap between the accuracy of a full-precision model and a binary one reduces to around 5%. To gain a visual understanding and show the possibility of extensions to other tasks such object detection, we print the a sample of feature maps in Section S3 of supplementary material.

4.1.3 Comparison with the state-of-the-art

Model W A Top-1 Top-5
Full-Precision Resnet-18 [full-precision weights and activation] 32 32 69.3% 89.2%
BWN [full-precision activation] Rastegari et al. (2016) 1 32 60.8% 83.0%
DoReFa-Net [1-bit weight and 4-bit activation] Zhou et al. (2016) 1 4 59.2% 81.5%
XNOR-Net [binary weight and activation] Rastegari et al. (2016) 1 1 51.2% 73.2%
BNN [binary weight and activation] Courbariaux et al. (2016) 1 1 42.2% 67.1%
ABC-Net [5 binary weight bases, 5 binary activations] 1 1 65.0% 85.9%
ABC-Net [5 binary weight bases, full-precision activations] 1 32 68.3% 87.9%
Table 3: Classification test accuracy of CNNs trained on ImageNet with Resnet-18 network topology. ‘W’ and ‘A’ refer to the weight and activation bitwidth respectively.

Table 3 presents a comparison between ABC-Net and several other state-of-the-art models, i.e., full-precision Resnet-18, BWN and XNOR-Net in (Rastegari et al., 2016), DoReFa-Net in (Zhou et al., 2016) and BNN in (Courbariaux et al., 2016)Courbariaux et al. (2016) did not report their result on ImageNet. We implemented and presented the result. respectively. All comparative models use Resnet-18 as network topology. The full-precision Resnet-18 achieves 69.3% Top-1 accuracy. Although Rastegari et al. (2016)’s BWN model and DeReFa-Net perform well, it should be noted that they use full-precision and 4-bit activation respectively. Models (XNOR-Net and BNN) that used both binary weights and activations achieve much less satisfactory accuracy, and is significantly outperformed by ABC-Net with multiple binary weight bases and activations. It can be seen that ABC-Net has achieved state-of-the-art performance as a binary model.

One might argue that 5-bit width quantization scheme could reach similar accuracy as that of ABC-Net with 5 weight bases and 5 binary activations. However, the former one is less efficient and requires distinctly more hardware resource. More detailed discussions can be found in Section 5.2.

5 Discussion

5.1 Why adding a shift parameter works?

Intuitively, the multiple binarized weight bases/activations scheme works because it allows more information passing through. Consider the case that a real value, say 1.5, is passed to a binarized function , where maps a positive to 1 and otherwise -1. In that case, the outputs of is 1, which suggests that the input value is positive. Now imagine that we have two binarization function and . In that case outputs 1 and outputs -1, which suggests that the input value is not only positive, but also must be smaller than 2. Clearly we see that each function contributes differently to represent the input and more information is gained from compared to the former case.

From another point of view, both coefficients (’s) and shift parameters are expected to learn and utilize the statistical features of full-precision tensors, just like the scale and shift parameters in batch normalization. If we have more binarized weight bases/activations, the network has the capacity to approximate the full-precision one more precisely. Therefore, it can be deduced that when or is large enough, the network learns to tune itself so that the combination of weight bases or binarized activations can act like the full-precision one.

5.2 Advantage over the fixed-point quantization scheme

It should be noted that there are key differences between the multiple binarization scheme ( binarized weight bases or binarized activations) proposed in this paper and the fixed-point quantization scheme in the previous works such as (Zhou et al., 2016; Hubara et al., 2016), though at first thought -bit width quantization seems to share the same memory requirement with binarizations. Specifically, our binarized weight bases/activations is preferable to the fixed K-bit width scheme for the following reasons:

(1) The binarization scheme preserves binarization for bitwise operations. One or several bitwise operations is known to be more efficient than a fixed-point multiplication, which is a major reason that BNN/XNOR-Net was proposed.

(2) A -bit width multiplier consumes more resources than 1-bit multipliers in a digital chip: it requires more than bits to store and compute, otherwise it could easily overflow/underflow. For example, if a real number is quantized to a 2-bit number, a possible choice is in range {0,1,2,4}. In this 2-bit multiplication, when both numbers are 4, it outputs , which is not within the range. In (Zhou et al., 2016), the range of activations is constrained within [0,1], which seems to avoid this situation. However, fractional numbers do not solve this problem, severe precision deterioration will appear during the multiplication if there are no extra resources. The fact that the complexity of a multiplier is proportional to THE SQUARE of bit-widths can be found in literatures (e.g., sec 3.1.1. in (Grabbe et al., 2003)). In contrast, our binarization scheme does not have this issue – it always outputs within the range {-1,1}. The saved hardware resources can be further used for parallel computing.

(3) A binary activation can use spiking response for event-based computation and communication (consuming energy only when necessary) and therefore is energy-efficient (Esser et al., 2016). This can be employed in our scheme, but not in the fixed -bit width scheme. Also, we have mentioned the fact that -bit width multiplier consumes more resources than 1-bit multipliers. It is noteworthy that these resources include power.

To sum up, -bit multipliers are the most space and power-hungry components of the digital implementation of DNNs. Our scheme could bring great benefits to specialized DNN hardware.

5.3 Further computation reduction in run-time

On specialized hardware, the following operations in our scheme can be integrated with other operations in run-time and further reduce the computation requirement.

(1) Shift operations. The existence of shift parameters seem to require extra additions/subtractions (see (2) and (8)). However, the binarization operation with a shift parameter can be implemented as a comparator where the shift parameter is the number for comparison, e.g., ( is a constant), so no extra additions/subtractions are involved.

(2) Batch normalization. In run-time, a batch normalization is simply an affine function, say, , whose scale and shift parameters are fixed and can be integrated with ’s. More specifically, a batch normalization can be integrated into a binarization operation as follow: Therefore, there will be no extra cost for the batch normalization.

6 Conclusion and future work

We have introduced a novel binarization scheme for weights and activations during forward and backward propagations called ABC-Net. We have shown that it is possible to train a binary CNN with ABC-Net on ImageNet and achieve accuracy close to its full-precision counterpart. The binarization scheme proposed in this work is parallelizable and hardware friendly, and the impact of such a method on specialized hardware implementations of CNNs could be major, by replacing most multiplications in convolution with bitwise operations. The potential to speed-up the test-time inference might be very useful for real-time embedding systems. Future work includes the extension of those results to other tasks such as object detection and other models such as RNN. Also, it would be interesting to investigate using FPGA/ASIC or other customized deep learning processor (Liu et al., 2016) to implement ABC-Net at run-time.

7 Acknowledgement

We acknowledge Mr Jingyang Xu for helpful discussions.

References

Supplementary Material

S1 Summary of training algorithm in Section 3.3

0:  a minibatch of inputs and targets, number of binary weight bases , number of binary activations , previous weights , and learning rate .
0:  updated weights and updated learning rate .
  {1. Computing the parameters gradients:}
  {1.1. Forward propagation:}
  for  to  do
     Compute with (2) and (3))
      using (12)
     Optionally apply max-pooling
     
     if  then
         (using (11))
     end if
  end for
  {1.2. Backward propagation:}
  {Note that the gradients are full-precision.}
  Compute
  for  to  do
     if  then
        
     end if
     
     
  end for
  {2. Accumulating the parameters gradients:}
  for  to  do
     With known, compute using (7)
     
     
     
     
  end for
Algorithm 1 Training a -layer ABC-Net. is the cost function for minibatch, and is the learning rate decay factor. indicates element-wise multiplication. BatchNorm() specifies how to batch-normalize the output of convolution and BackBatchNorm specifies how to backpropagate through the normalization (Ioffe and Szegedy, 2015). Analogously, Conv() and BackConv() respectively specify how to do convolution and how to backpropagate through the convolution. Update() specifies how to update the parameters when their gradients are known, such as ADAM (Kingma and Ba, 2014) and Momentum (Qian, 1999))

S2 Weight approximation

In this section we explore how well the weight approximation can achieved given adequate binary weight bases (Section 3.1). To gain a visual intuition, we randomly sample a slice of weight tensor from a full-precision Resnet-18 model pretrained on ImageNet. The sliced tensor is then vectorized, and we approximate it with bases using linear regression (see (3)). The results are presented in Figure S2, where the left subfigure shows the root mean square (RMSE) for the estimated weights with increasing number of bases, and the right one shows 5 fitting results, whose choice of are respectively 1 to 5 from top to bottom. The blue line in the right subfigure draws the groundtruth weights from the full-precision pretrained model, and the red line is the estimated one. It can be observed that is adequate to have a rough fitting, and it gets almost perfect when .

Figure S2: Fitting a section of weights of full-precision Resnet-18 trained on ImageNet. On the left side the RMSE for the estimated weights with increasing number of bases is shown, and on the right side 5 fitting results are shown, whose choice of are respectively 1 to 5 from top to bottom (Blue line: full-precision weights; Red line: estimated weights).

S3 Feature map

It is also possible to perform more complex tasks beyond classification using ABC-Net, as long as the model is built upon a CNN, such as faster RCNN for object detection, in which the classification model serves as a pre-train model. Thus, one might be interested in whether ABC-Net learns similar feature maps as its full-precision counterpart. Figure S3 shows several example image and the corresponding feature maps of these two models, from which we see that they are indeed similar. This shows the potential for ABC-Net to further generalize on more complex tasks mentioned above.

Figure S3: Examples of feature maps. The feature maps from the first convolution layer of ABC-Net (above) looks similar to that of its full-precision counterpart (below). Settings for the ABC-Net: , using Resnet-18.

S4 Parameter settings for the experiment in Section 4.1.2

The parameters ’s, the initial values for ’s and ’s can be treated as hyperparameters. At the beginning of our exploration we randomly choose these initial values. Bit by bit we began to find certain patterns to achieve good performance: for ’s, usually symmetric; for ’s, maybe slightly shift towards the negative direction. These are based on tunings and also the observation of the full-precision distribution of weights/activations. Table S4 provides the parameter settings for the experiment in Section 4.1.2. All ABC-Net models in the experiments are trained using SGD with momentum, and the initial learning rate is set to 0.01.

Network shift parameters (’s) shift parameters (’s) ’s
res18 1 1 0 0.0 1.0
res18 3 1 -1,0,1 0.0 1.0
res18 3 3 -1,0,1 -1.5, 0.0, 1.5 1.0, 1.0, 1.0
res18 3 5 -1,0,1 -3.5, -2.5, -1.5, 0.0, 2.5 1.0, 1.0, 1.0, 1.0, 1.0
res18 5 1 -2,-1,0,1,2 0.0 1.0
res18 5 3 -2,-1,0,1,2 -0.9, 0.0 0.9 1.0,1.0,1.0
res18 5 5 -1,-0.5,0,0.5,1 -3.5, -2.5, -1.5, 0.0, 2.5 1.0, 1.0, 1.0, 1.0, 1.0
res34 3 3 -1,0,1 -3.0, 0.0, 3.0 1.0, 1.0, 1.0
res34 5 5 -1,-0.5,0,0.5,1 -3.5, -2.5, -1.5, 0.0, 2.5 1.0, 1.0, 1.0, 1.0, 1.0
res50 5 5 -1,-0.5,0,0.5,1 -3.5, -2.5, -1.5, 0.0, 2.5 1.0, 1.0, 1.0, 1.0, 1.0
Table S4: Parameter settings for the experiment in Section 4.1.2. “res18", “res34" and “res50" are short for Resnet-18, Resnet-34 and Resnet-50 network topology respectively. and refer to the number of weight bases and activations respectively.

S5 Relationship between accuracy and number of binary weight bases

Figure S4 shows that the relationship between accuracy and the number of binary weight bases appears to be linear. Note that we keep the activations being full-precision in this experiment.

Figure S4: Top-1 (left) and Top-5 (right) accuracy of ABC-Net on ImageNet, using full-precision activation and different choices of the number of binary weight bases .

S6 Application on visual perception of forest trails

s6.1 VGG-like Network Topology

A VGG-like network topology is used for visual perception of forest trails as illustrated in Figure S5.

Figure S5: The network topology for visual perception of forest trails.

s6.2 Experiment results on visual perception of forest trails dataset

Giusti et al. (2016) cast the forest or mountain trails perception problem for mobile robots as a image classification task based on Deep Neural Networks. The dataset is composed by 8 hours of 30fps video acquired by a hiker equipped with three head-mounted cameras . Each image is labelled in one of three classes: turn right, go straight, turn left. We evaluate ABC-Net against its full precision counterpart using this dataset. The classification result is shown in Table S5 by fixing both number of weight bases and activation bases to be 5.

Network shift parameters (’s) shift parameters (’s) ’s ABC-Net FP
VGG-like -1,0.5,0.0,0.5,1 0,0,0,0,0 1,1,1,1,1 78.0% 77.7%
Table S5: Classification accuracy on Forest Trails dataset. ‘FP’ stands for ‘Full Precision’.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
13149
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description