Training Binary Neural Networks using the Bayesian Learning Rule
Appendix
Abstract
Neural networks with binary weights are computationefficient and hardwarefriendly, but their training is challenging because it involves a discrete optimization problem. Surprisingly, ignoring the discrete nature of the problem and using gradientbased methods, such as StraightThrough Estimator, still works well in practice. This raises the question: are there principled approaches which justify such methods? In this paper, we propose such an approach using the Bayesian learning rule. The rule, when applied to estimate a Bernoulli distribution over the binary weights, results in an algorithm which justifies some of the algorithmic choices made by the previous approaches. The algorithm not only obtains stateoftheart performance, but also enables uncertainty estimation for continual learning to avoid catastrophic forgetting. Our work provides a principled approach for training binary neural networks which justifies and extends existing approaches.
1 Introduction
Deep neural networks (DNNs) have been remarkably successful in machine learning but their training and deployment requires a high energy budget and hinders their application to resourceconstrained devices, such as mobile phones, wearables, and IoT devices. Binary neural networks (BiNN), where weights and/or activations are restricted to binary values, are one promising solution to address this issue (Courbariaux et al., 2016, 2015). Compared to fullprecision DNNs (e.g. 32bit) weights, BiNN directly give a 32 times reduction in the model size. Further computational efficiency is obtained by using specialized hardware, e.g., by replacing the multiplication and addition operations with the bitwise xnor and bitcount operations (Rastegari et al., 2016; Mishra et al., 2017; Bethge et al., 2020). In the near future, BiNNs are expected to play an important role in energyefficient and hardwarefriendly deep learning.
A problem with BiNNs is that their training is much more difficult than their continuous counterpart. BiNNs obtained by quantizing already trained DNNs do not work well, and it is preferable to optimize for binary weights directly. Such training is challenging because it involves a discrete optimization problem. Continuous optimization methods such as the Adam optimizer (Kingma and Ba, 2014) are not expected to perform well or even converge.
Despite such theoretical issues, a method called StraightThroughEstimator (STE) (Bengio et al., 2013), which employs continuous optimization methods, works remarkably well (Courbariaux et al., 2015). The method is justified based on “latent” realvalued weights which are discretized at every iteration to get binary weights. The gradients used to update the latent weights, however, are computed at the binary weights (see Figure 1 (a) for an illustration). It is not clear why these gradients help the search for the minimum of the discrete problem (Yin et al., 2019; Alizadeh et al., 2019). Another recent work by Helwegen et al. (2019) dismisses the idea of latent weights, and proposes a new optimizer called Binary Optimizer (Bop) based on inertia. Unfortunately, the steps used by their optimizers too are derived based on intuition and are not theoretically justified using an optimization problem. Our goal in this paper is to address this issue and propose a principled approaches to justify the algorithmic choices of these previous approaches.
STE  Our BayesBiNN method  Bop  
Step 1: Get from  
Step 2: Compute gradient at 

Step 3: Update 


.
In this paper, we present a Bayesian perspective to justify previous approaches. Instead of optimizing a discrete objective, the Bayesian approach relaxes it by using a distribution over the binary variable, resulting in a principled approach for discrete optimization. We use a Bernoulli approximation to the posterior and estimate it using a recently proposed variational inference method called the Bayesian learning rule (Khan and Rue, 2019). This results in an algorithm which justifies some of the algorithmic choices made by existing methods; see Table 1 for a summary of results. Since our algorithm is based on a principled derivation, it makes it easier for us to generalize it. We show an application for using uncertainty estimation in BiNN for continual learning to avoid catastrophic forgetting Kirkpatrick et al. (2017). To the best of our knowledge, there is no other work on continual learning of BiNN, mostly because extending existing methods, like STE, for such tasks is not trivial. Overall, our work provides a principled approach for training BiNNs that justifies and extends previous approaches. The code to reproduce the results is available at https://github.com/teamapproxbayes/BayesBiNN.
1.1 Related Works
There are two main directions on the study of BiNNs: one involves the design of special network architecture tailored to binary operations (Courbariaux et al., 2015; Rastegari et al., 2016; Lin et al., 2017; Bethge et al., 2020) and the other is on the training methods. The latter is the main focus of this paper.
Our algorithm is derived using the Bayesian learning rule which is recently proposed by Khan and Rue (2019). They show that the rule can be used to derive many existing learningalgorithms in fields such as optimization, Bayesian statistics, machine learning and deep learning. In particular, the Adam optimizer can also be derived as a special case of the Bayesian learning rule (Khan et al., 2018; Osawa et al., 2019). Our application is yet another example where the rule can be used to justify existing algorithms that perform well in practice but whose mechanisms are not well understood.
Instead of using the Bayesian learning rule, it is possible to use other types of variational inference methods, e.g., Shayer et al. (2017) use a variational optimization approach (Staines and Barber, 2012) along with the reparameterization trick. Unfortunately, such applications do not results in an update similar to eiter STE or Bop.
2 Training Binary Neural Networks (BiNN)
Given , the goal to train a neural network with binary weights . The challenge is in optimizing the following discrete optimization objective:
(1) 
where is a loss function, e.g., crossentropy loss for the model predictions . It is clear that binarized weights obtained from pretrained NNs with realweights do not minimize equation (1) and therefore not expected to give good performance. Optimizing the objective with respect to binary weights is difficult since gradientbased methods cannot be directly applied. The gradient of the realvalued weights are not expected to help the search for the minimum of a discrete objective (Yin et al., 2019).
Despite such theoretical concerns, the StraightThrough Estimator (STE) (Bengio et al., 2013), which utilizes gradientbased methods, works extremely well. There have been many recent works that build upon this method, including BinaryConnect (Courbariaux et al., 2015), Binarized neural networks (Courbariaux et al., 2016), XORNet (Rastegari et al., 2016), as well as the most recent MeliusNet (Bethge et al., 2020). The general approach of such methods is shown in Figure 1 in three steps. In step 1, we obtain binary weights from the realvalued parameters . In step 2, we compute gradients at and, in step 3, update using the gradients. STE makes a particular choice for the step 1 where a sign function is used to obtain the binary weights from the realvalued weights (see Table 1 for a pseudocode). However, since the gradient through sign function is zero almost everywhere, it implies that . This approximation can be justified in simple settings but in general the reasons behind its effectiveness are not clear (Yin et al., 2019).
Recently Helwegen et al. (2019) proposed a new method that goes against the justification behind STE. They argue that “latent” weights used in STE based methods do not exist. Instead, they provide a new perspective: the sign of each element of represents a binary weight while its magnitude encodes some inertia against flipping the sign of the binary weight. With this perspective, they propose the Binary optimizer (Bop) method which keeps track of an exponential moving average of the gradient during the training process and then decide whether to flip the sign of the binary weights when they exceed a certain threshold . The Bop algorithm is shown in Table 1. However, derivation of Bop is also based on intuition and heuristics. It remains unclear why the exponential moving average of the gradient is used and what objective the algorithm is optimizing. Choice of the threshold is another difficulty in the algorithm.
Indeed, Bayesian methods do present a principled way to incorporate both the ideas used in both STE and Bop. For example, the idea of “generating” binary weights from realvalued parameters can be though of as sampling from a discrete distribution with realvalued parameters. In fact, the sign function used in STE is related to the “softthresholding” used in machine learning. Despite this there exist no work on Bayesian training of BiNN that can give an algorithm similar to STE or Bop. In this work, we fix this gap and show that, by using the Bayesian learning rule, we recover a method that justifies some of the steps of STE and Bop, and enable us to extend their application. We will now describe our method.
3 BayesBiNN: Binary NNs with Bayes
We will now describe our approach based on a Bayesian formulation of the discrete optimization problem. A Bayesian formulation of a lossbased approach can be written as the following minimization problem with respect to a distribution (Zellner, 1988; Bissiri et al., 2016)
(2) 
where is a prior distribution and is the posterior distribution or its approximation. The formulation is general and does not require the loss to correspond to a probabilistic model. When the loss indeed corresponds to a log likelihood function, this minimization results in the posterior distribution which is equivalent to Bayes’ rule Bissiri et al. (2016). When the space of is restricted, this results in an approximation to the posterior, which is then equivalent to variational inference (Jordan et al., 1999). For our purpose, this formulation enables us to derive an algorithm that resembles existing methods such as STE and Bop.
3.1 BayesBiNN optimizer
To solve the optimization problem (2), the Bayesian learning rule (Khan and Rue, 2019) considers a class of minimal exponential family approximation
(3) 
where is the natural parameter, is vector of sufficient statistics, is the logpartition function, and is the base measure. When the prior distribution follows the same distribution as in (3), and both of the base measure , the Bayesian learning rule states that the posterior distribution could be learned by updating the natural parameter as follows (Khan and Rue, 2019)
(4) 
where is the learning rate, is the expectation parameter of , and is the natural parameter of the prior distribution . The Bayesian learning rule is a natural gradient variational inference algorithm (Khan and Lin, 2017; Khan and Rue, 2019). An interesting point of this updating rule is that the gradient is computed with respect to expectation parameter while the update is performed on the natural parameter .
For BiNNs, the form of and are specified as follows. A priori, we assume that the weights are equally likely to be either or , so is a Bernoulli distribution with a probability of for each state. For the posterior approximation , we use the following meanfield Bernoulli distribution:
(5) 
where is the probability that , otherwise , and is the number of parameters. Then the goal is to learn the parameters of the approximations. The Bernoulli distribution defined in equation (5) is a special case of the minimal exponential family distribution, where the corresponding natural and expectation parameters of each weight are
(6) 
The natural parameter of prior is . Theoretically, we could apply the Bayesian learning rule to learn the posterior Bernoulli distribution of the binary weights.
However, as shown in equation (4), to implement the Bayesian learning rule, it requires the gradient of an expectation function with respect to . A straightforward solution is the REINFORCE method (Williams, 1992) which transforms gradients of expectations into expectations of gradients using the logderivative trick, i.e., . Nevertheless, the REINFORCE method does not use the gradient of the loss , which is essential to show the similarity to STE and Bop. The REINFORCE method also suffers from high variance.
To address this problem, we resort to a popular reparameterization trick for discrete variables called Gumbelsoftmax trick (Maddison et al., 2016; Jang et al., 2016), which results in a form of gradient similar to those used in previous gradient based methods such as STE and Bop. The idea of Gumbelsoftmax trick is to introduce a concrete distribution that relaxes the discrete random variables so that the reparameterization trick becomes applicable. The result is shown in Lemma 1.
Lemma 1
Using the Gumbelsoftmax trick, the gradient can be approximated using the minibatch gradient
(7) 
where
(8)  
(9) 
Proof 1
As shown in equation (5), each weight take values in and follows a Bernoulli distribution. According to the Gumbel trick, the Bernoulli random variable following distribution in equation (5) could be parameterized or relaxed as
(10) 
where is the elementwise tanh function, is the natural parameter defined in equation (6), is the temperature value, and is the additive noise term
(11) 
where is one random sample vector independently drawn from a uniform distribution.
Thanks to the reparameterization of in equation (10), the gradient can now be written as an expectation form, i.e.,
(12) 
Based on the chain rule, each element of could be calculated as
(13) 
According to the definition of natural parameter and expectation parameter in equation (6), it is easy to obtain the relationship between them, i.e., . Recalling the reparameterization of each element in equation (10), after some algebra, we get
(14) 
Then, combining equations (12)(14), and using a minibatch of the whole dataset as well as one single Monte Carlo sample of , the gradient term in equation (4) can be estimated as equation (7), where in equation (9) denotes a minibatch of the whole dataset , which completes the proof.
Substituting the result of Lemma 1 into the Bayesian learning rule in equation (4), we obtain the update equation in each iteration for the natural parameter
(15) 
The resultant optimizer, which we call BayesBiNN, is shown in Table 1, where we use the fact that the natural parameter of our Bernoulli prior (since the probability of is ). Also, for ease of comparison with other methods, the natural parameter is replaced with .
After training BiNN using BayesBiNN, for new inputs , one could either use the mean, i.e., which is the average of the outputs associated with different Monte Carlo (MC) samples drawn from , or directly use the mode of to make predictions, i.e., where .
3.2 Justification of Previous Approaches
In this section, we show how BayesBiNNs justifies the steps of STE and Bop. A summary is shown in Table 1.
First of all, BayesBiNN justifies the use of gradient based methods to solve the discrete optimization problem (1). As opposed to the problem (1), the new objective (2) from the Bayesian perspective is over the continuous parameter , and gradient descent based optimization can be performed. The underlying principle is similar to stochastic approaches to nondifferential optimization (Lemaréchal, 1989), as well as the variational optimization (Staines and Barber, 2012).
Second, some of the algorithmic choices of previous methods such as STE and Bop can be better understood using BayesBiNN. Specifically, as the temperature in BayesBiNN becomes small enough, the function in Table 1 will behave like the function as in Step 1 of STE, see Figure 1 (b). From this perspective, the latent weights in STE play a similar role as the natural parameter of BayesBiNN. In particular, if there is no sampling, i.e., in BayesBiNN, the two algorithms will become very similar to each other, justifying STE using the Bayesian perspective.
On the other hand, as shown in Step 3 of BayesBiNN in Table 1, an exponential moving average of (scaled) gradients is kept during training, which is similar to Step 3 of in Bop in Table 1. Specifically, we rewrite Step 3 of BayesBiNN in Table 1 in an equivalent manner, i.e., . Note that apart from the scaling factor, there is also a minus sign before gradients as opposed to Bop. In fact, it makes no difference since one could obtain an equivalent form of Bop by simply changing the transform function, as discussed below. The original definition of the hysteresis function in Bop is (Helwegen et al., 2019)
(16) 
which corresponds to the update rule as in Step 3 of Bop in Table 1. One could however also modify the update rule to , which will be equivalent to the former if we change the condition in equation (16) to and . The corresponding curve is simply a upsidedown flipped version of the rightmost figure in Figure 1 (b). As a result, the momentum term in Bop plays a similar role as the natural parameter in BayesBiNN, which provides an alternative Bayesian explanation.
In Helwegen et al. (2019), the momentum is interpreted as something related to inertia, which indicates the strength of the state of weights. As the natural parameter in the binary distribution (5) essentially indicates the strength of the probability being or for each weight, BayesBiNN provides a more principled explanation for Bop. In addition, it also justifies why the threshold in Bop is typically quite small, e.g., , as the scaling factor before the natural parameter (momentum) is ignored, which is typically very large.
A recent mirror descent view of training quantized neural networks proposed in Ajanthan et al. (2019) interprets the continuous parameters as the dual of the quantized ones. As there is an equivalence between the natural gradient descent and mirror descent (Raskutti and Mukherjee, 2015; Khan and Lin, 2017), the proposed BayesBiNN also provides an interesting perspective on the mirror descent framework for BiNN training.
3.3 Benefits of BayesBiNN
Apart from justifying previous methods, BayesBiNN has several other advantages. First of all, since its algorithmic form is very similar to existing deep learning optimizers, it is very easy to implement within current deep learning platforms. Second, as a Bayesian method, BayesBiNN is able to provide uncertainty estimates, which can be useful for a decision task and post training.
The uncertainty obtained with using BayesBiNN enables us to perform continual learning by using the variational continual learning (VCL) framework (Nguyen et al., 2017). To the best of our knowledge, this is the first work on continual learning for BiNN. In continual learning setting, the goal is to learn the parameters of the neural network from a set of sequentially arriving datasets . While training th task with dataset , we do not have access to the datasets of past tasks, i.e., . Our goal is to train BiNN with alone while maintaining the performance on the previous tasks. Using standard training leads to catastrophic forgetting (Kirkpatrick et al., 2016).
A common approach to solve this problem with fullprecision network is by using the elastic weight consolidation (EWC) (Kirkpatrick et al., 2016), which regularizes the weights to combat catastrophic forgetting by minimizing the following loss
(17) 
where is a scaling factor, is a preconditioner and EWC uses the information matrix as . However, for BiNN whose weights are binary values, it is still not clear how to implement EWC with STE/Bop. Since the optimization problem is not explicitly defined, it is nontrivial to include a proper regularization loss.
Our method BayesBiNN as a principled approach has a very clear definition of the objective function, which could be easily modified to enable continual learning within the VCL framework. Suppose that is the posterior distribution estimated using BayesBiNN from previous tasks, then for new task with dataset , we could replace the prior distribution in equation (2) by , obtaining
(18) 
As a result, the solution for task is obtained in a straightforward manner by adding the prior from previous tasks, where the update of the natural parameter for new task in equation (15) becomes
(19) 
where denotes the learned natural parameter of from previous tasks. For the first task , no prior of previous tasks is available so that , which is the same as in the singletask case.
4 Experimental Results
In this section, we conduct a set of experiments to demonstrate the performance of BayesBiNN on both synthetic and real image data for different kinds of neural network architectures. Meanwhile, We show an application for using uncertainty estimation in BiNN for continual learning. The code to reproduce the results is available at https://github.com/teamapproxbayes/BayesBiNN.
4.1 Synthetic Data
First, we evaluate BiNN a toy binary classification and a regression task on some synthetic data, respectively. The popular STE (Bengio et al., 2013) based Adam optimizer is used as a deterministic baseline. In both cases, we used a multilayer perceptron (MLP) with two hidden layers of 64 units and activation functions.
Classification
As shown in Figure 3, we train both STE and BayesBiNN on the two moons dataset (Moons, ) with 100 data points in each class and plot the predicted Bernoulli results. When using the BayesBiNN mode, i.e., or STE, we get a single deterministic prediction. While BayesBiNN’s fit is slightly worse than STE’s, it is overall much less overconfident, mainly in regions with no data. Averaging the predictions of 10 samples drawn from posterior distribution , we could obtain uncertainty estimates that are lower where data is available and higher when moving away from the training data. Experimental details of the training process are provided in the Appendix A.1 in supplementary material.
Regression
Dataset  Optimizer  Train Accuracy  Validation Accuracy  Test Accuracy 

MNIST  STE Adam  %  %  % 
Bop  %  %  %  
PMF  %  %  
BayesBiNN (proposed)  %  %  %  
Fullprecision  %  %  %  
CIFAR10  STE Adam  %  %  % 
Bop  %  %  %  
PMF  %  %  
BayesBiNN (proposed)  %  %  %  
Fullprecision  %  %  %  
CIFAR100  STE Adam  %  %  % 
Bop  %  %  %  
PMF  %  %  
BayesBiNN (proposed)  %  %  %  
Fullprecision  %  %  % 
In Figure 2, we train both algorithms on the Snelson dataset (Snelson and Ghahramani, 2005). In contrast to the classification problem, we add one Batch normalization (BN) (Ioffe and Szegedy, 2015) layer (but without learned gain or bias terms) after the last fully connected layer. As can be seen in Figure 2, BayesBiNN with predictive mean produces much smoother results than STE since multiple predictive results are averaged. Uncertainty is low in areas with little noise and plenty of data points and high in areas with no data. Experimental details of the training process are provided in the Appendix A.1 in supplementary material.
4.2 Real Image Classification
In this subsection we perform some experiments on three benchmark real datasets widely used for image classification, i.e., MNIST (LeCun and Cortes, 2010), CIFAR10 (Krizhevsky and Hinton, 2009) and CIFAR100 (Krizhevsky and Hinton, 2009).
For comparison, we also report results of three other optimizers
Mnist
MNIST (LeCun and Cortes, 2010) is a large database of handwritten digits that is commonly used for image classification. The MNIST database contains 60,000 training grayscale images and 10,000 testing grayscale images, each of size pixels. The network adapted is a multilayer perceptron (MLP) with three hidden layers with 2048 units and rectified linear units (ReLU) (Alizadeh et al., 2019) activations. Both Batch normalization (BN) (but with no parameters learned, otherwise conventional optimizer such as Adam could be applied separately.) (Ioffe and Szegedy, 2015) and dropout are used. No data augmentation is performed. The details of the experiments, including the detailed network architecture, values of all hyperparameters, as well as the training cures, are provided in Appendix A.2 in supplementary material. As shown in Table 2, BayesBiNN achieves a top1 testset accuracy of 98.86 %, which is competitive with the STE (98.85 %) and better than Bop (98.47 %), approaching the performance of a fullprecision network (99.01 %) quite closely.
Cifar10
The CIFAR10 dataset (Krizhevsky and Hinton, 2009) consists of natural colour images of size pixels. Each image is classified into 1 of 10 classes, such as dog, cat, automobile, etc. The training set contains 50,000 images, while the test set contains 10,000 images.
For CIFAR10, we use the BinaryConnect CNN network in Alizadeh et al. (2019), which is a VGGlike structure like used in Helwegen et al. (2019). We perform standard data augmentation as follows (Graham, 2014): 4 pixels are padded on each side, a random 32*32 crop is applied, followed by a random horizontal flip. Note that no ZCA whitening is used as in Courbariaux et al. (2015); Alizadeh et al. (2019). The results are shown in Table 2. The details of the experiments, including the detailed network architecture, values of all hyperparameters, as well as the training cures, are provided in Appendix A.2 in supplementary material.
Cifar100
The CIFAR100 dataset (Krizhevsky and Hinton, 2009) is similar to CIFAR10, except it has 100 classes containing 600 images each, 500 training images and 100 testing images per class. The network used is also the BinaryConnect CNN network in Alizadeh et al. (2019). The same standard data augmentation used for CIFAR10 was used for CIFAR100 (Graham, 2014). The results are shown in Table 2. The details of the experiments, including the detailed network architecture, values of all hyperparameters, as well as the training cures, are provided in Appendix A.2 in supplementary material.
As shown in Table 2, BayesBiNN achieves the highest 73.00 % top1 testset accuracy, which is approaching the performance of a fullprecision network (74.83 %) quite closely.
4.3 Continual learning with binary neural networks
The BayesBiNN approach not only obtains stateoftheart performance but also generalizes to more applications. In this subsection we show an application for using uncertainty estimation in BiNN for continual learning. We consider the popular benchmark of permuted MNIST (Goodfellow et al., 2013; Kirkpatrick et al., 2017; Nguyen et al., 2017; Zenke et al., 2017), where each dataset consists of labeled MNIST images whose pixels are permuted randomly. As is in Nguyen et al. (2017), we used a fully connected singlehead networks with two hidden layers, each containing 100 hidden units with ReLu activations. No coreset is used in the current experiments. The details of the experiment, e.g., the network architecture and values of hyperparameters, are provided in Appendix A.3 in supplementary material. As shown in Figure 4, BayesBiNN with prior, i.e. the posterior of previous tasks, achieves significant improvement in overcoming the catastrophic forgetting problem, which demonstrates the efficiency of BayesBiNN in continual learning for BiNNs.
5 Conclusion
Binary neural networks (BiNNs) are computationefficient and hardwarefriendly, but their training is challenging since it involves a discrete optimization problem. In this paper, we propose a principled approach to train binary neural networks using the Bayesian learning rule. The resultant optimizer, which we call BayesBiNN, not only justifies some of the algorithmic choices made by existing methods such as STE and Bop but also facilitates the extensions of them, e.g., enabling uncertainty estimation for continual learning to avoid catastrophic forgetting for BiNNs.
Appendix A Experimental details
In this section we list the exact training details for all experiments shown in the main text.
Note that after training BiNN with BayesBiNN, there are two ways to perform inference during test time:
(1). Mean: One method is to use the predictive mean, where we use Monte Carlo sampling to compute the predictive probabilities for each test sample as follows
(20) 
where are samples from the Bernoulli distributions with natural parameters obtained by BayesBiNN.
(2). Mode: The other way is simply to use the mode of the posterior distribution , i.e., the sign value of the posterior mean, i.e., , to make predictions, which will be denoted as .
a.1 Synthetic Data
Binary Classification
We used the Two Moons dataset with 100 data points in each class and added Gaussian noise with standard deviation 0.1 to each point. We trained a Multilayer Perceptron (MLP) with two hidden layers of 64 units and tanh activation functions for 3000 epochs, using Cross Entropy as the loss function. Additional train and test settings with respect to the optimizers are detailed in Table 3. The learning rate was decayed at fixed epochs by the specified learning rate decay rate. For the STE baseline, we used the Adam optimizer with standard settings.
Setting  BayesBiNN  STE 

Learning rate  
Learning rate decay  0.1  0.1 
Learning rate decay epochs  [1500, 2500]  [1500, 2500] 
Momentum(s)  0.99  0.9, 0.999 
MC train samples  5   
MC test samples  0/10   
Temperature  1   
Prior    
Initialization  randomly   
Regression
We used the Snelson dataset (Snelson and Ghahramani, 2005) with 200 data points to train a regression model. Similar to the Binary Classification experiment, we used a MLP with two hidden layers of 64 units and tanh activation functions, but trained it for 5000 epochs using Mean Squared Error as the loss function. Additionally, we added a batch normalization layer (without learned gain or bias terms) after the last fully connected layer. The learning rate is adjusted after every epoch to slowly anneal from an initial learning rate to a target learning rate at the maximum epoch using
(21) 
The learning rates and other train and test settings are detailed in Table 4.
Setting  BayesBiNN  STE 
Learning rate start  
Learning rate end  
Momentum(s)  0.99  0.9, 0.999 
MC train samples  1   
MC test samples  0/10   
Temperature  1   
Prior    
Initialization  randomly   
a.2 MNIST, CIFAR10 and CIFAR100
In this section, three wellknown real image datasets are considered, i.e., MNIST, CIFAR10 and CIFAR100 datasets. We compare the proposed BayesBiNN with four other popular algorithms, i.e., STE Adam, Bop and PMF for BiNNs as well as the Adam for fullprecision weights. Here, we detail the dataset and algorithm specific settings, see Table 7.
Mnist
All algorithms have been trained using the same MLP detailed in Table 5 on minibatches of size 100, for a maximum of 500 epochs. The loss used was Categorical Cross Entropy. We split the original training data into 90% train and 10% validation data and no data augmentation except normalization has been done. We report the best accuracy (averaged over 5 random runs) on the test set corresponding to the highest validation accuracy achieved during training (we do not retrain using the validation set). In addition, we tune the hyperparameters such as learning rate for all the methods including the baselines. The search space for the learning rate is set to be for all methods.
Dropout (p = 0.2) 
Fully Connected Layer (units = 2048, bias = False) 
ReLU 
Batch Normalization Layer (gain = 1, bias = 0) 
Dropout (p = 0.2) 
Fully Connected Layer (units = 2048, bias = False) 
ReLU 
Batch Normalization Layer (gain = 1, bias = 0) 
Dropout (p = 0.2) 
Fully Connected Layer (units = 2048, bias = False) 
ReLU 
Batch Normalization Layer (gain = 1, bias = 0) 
Dropout (p = 0.2) 
Fully Connected Layer (units = 2048, bias = False) 
Batch Normalization Layer (gain = 1, bias = 0) 
Softmax 
CIFAR10 and CIFAR100
We trained all algorithms on the Convolutional Neural Network (CNN) architecture detailed in Table 6 on minibatches of size 50, for a maximum of 500 epochs. The loss used was Categorical Cross Entropy. We split the original training data into 90% train and 10% validation data. For data augmentation during training, the images were normalized, a random 32 32 crop was selected from a 40 40 padded image and finally a random horizontal flip was applied. Same as Osawa et al. (2019), we consider such data augmentation as effectively increasing the dataset size by a factor of 10 (4 images for each corner, and one central image, and the horizontal flipping step further doubles the dataset size, which gives a total factor of 10). We report the best accuracy (averaged over 5 random runs) on the test set corresponding to the highest validation accuracy achieved during training (we do not retrain using the validation set). In addition, we tune the hyperparameters such as learning rate for all the methods including the baselines. The search space for the learning rate is set to be for all methods.
Convolutional Layer (channels = 128, kernelsize = 3 3, bias = False, padding = same) 
ReLU 
Batch Normalization Layer (gain = 1, bias = 0) 
Convolutional Layer (channels = 128, kernelsize = 3 3, bias = False, padding = same) 
ReLU 
Max Pooling Layer (size = 2 2, stride = 2 2) 
Batch Normalization Layer (gain = 1, bias = 0) 
Convolutional Layer (channels = 256, kernelsize = 3 3, bias = False, padding = same) 
ReLU 
Batch Normalization Layer (gain = 1, bias = 0) 
Convolutional Layer (channels = 256, kernelsize = 3 3, bias = False, padding = same) 
ReLU 
Max Pooling Layer (size = 2 2, stride = 2 2) 
Batch Normalization Layer (gain = 1, bias = 0) 
Convolutional Layer (channels = 512, kernelsize = 3 3, bias = False, padding = same) 
ReLU 
Batch Normalization Layer (gain = 1, bias = 0) 
Convolutional Layer (channels = 512, kernelsize = 3 3, bias = False, padding = same) 
ReLU 
Max Pooling Layer (size = 2 2, stride = 2 2) 
Batch Normalization Layer (gain = 1, bias = 0) 
Fully Connected Layer (units = 1024, bias = False) 
ReLU 
Batch Normalization Layer (gain = 1, bias = 0) 
Fully Connected Layer (units = 1024, bias = False) 
ReLU 
Batch Normalization Layer (gain = 1, bias = 0) 
Fully Connected Layer (units = 1024, bias = False) 
Batch Normalization Layer (gain = 1, bias = 0) 
Softmax 
Algorithm  Setting  MNIST  CIFAR10  CIFAR100 

BayesBiNN  Learning rate start  
Learning rate end  
Learning rate decay  Cosine  Cosine  Cosine  
MC train samples  1  1  1  
MC test samples  0  0  0  
Temperature  
Prior  
Initialization  randomly  randomly  randomly  
STE Adam  Learning rate start  
Learning rate end  
Learning rate decay  Cosine  Cosine  Cosine  
Gradient clipping  Yes  Yes  Yes  
Weights clipping  Yes  Yes  Yes  
Bop  Threshold  
Adaptivity rate  
decay type  Step  Step  Step  
decay rate  0.1  0.1  
decay interval (epochs)  1  100  100  
PMF  Learning rate start  
Learning rate decay type  Step  Step  Step  
LR decay interval (iterations)  7k  30k  30k  
LRscale  0.2  0.2  0.2  
Optimizer  Adam  Adam  Adam  
Weight decay  0  
1.2  1.05  1.05  
Adam (Fullprecision)  Learning rate start  
Learning rate end  
Learning rate decay  Cosine  Cosine  Cosine 
a.3 Continual learning with binary neural networks
For the continual learning experiment, we used a threelayer MLP, detailed in Table 8, and trained it using the Categorical Cross Entropy loss. Specific training parameters are given in Table 9. There is no split of the original MNIST training data in the continual learning case. No data augmentation has been used except normalization.
Fully Connected Layer (units = 100, bias = False) 
ReLU 
Batch Normalization Layer (gain = 1, bias = 0) 
Fully Connected Layer (units = 100, bias = False) 
ReLU 
Batch Normalization Layer (gain = 1, bias = 0) 
Fully Connected Layer (units = 100, bias = False) 
ReLU 
Batch Normalization Layer (gain = 1, bias = 0) 
Softmax 
Algorithm  Setting  Permuted MNIST 

BayesBiNN  Learning rate start  
Learning rate end  
Learning rate decay  Cosine  
MC train samples  1  
MC test samples  100  
Temperature  
Prior  learned of the previous task  
Initialization  randomly  
Batch size  100  
Number of epochs  100 
Footnotes
 We use Bop code available at https://github.com/plumerai/rethinkingbnnoptimization and PMF code available at https://github.com/tajanthan/pmf.
References
 Mirror descent view for neural network quantization. arXiv preprint arXiv:1910.08237. Cited by: §3.2.
 Proximal meanfield for neural network quantization. In IEEE International Conference on Computer Vision, pp. 4871–4880. Cited by: §4.2.
 An empirical study of binary neural networks’ optimisation. ICLR. Cited by: Table 5, Table 6, §1, §4.2, §4.2, §4.2, §4.2.
 Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: Table 1, §1, §2, §4.1, §4.2.
 MeliusNet: can binary neural networks achieve mobilenetlevel accuracy?. arXiv preprint arXiv:2001.05936. Cited by: §1.1, §1, §2.
 A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 (5), pp. 1103–1130. Cited by: §3.
 Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §1.1, §1, §1, §2, §4.2, §4.2.
 Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or1. arXiv preprint arXiv:1602.02830. Cited by: §1, §2.
 An empirical investigation of catastrophic forgetting in gradientbased neural networks. arXiv preprint arXiv:1312.6211. Cited by: §4.3.
 Spatiallysparse convolutional neural networks. arXiv preprint arXiv:1409.6070. Cited by: §4.2, §4.2.
 Latent weights do not exist: rethinking binarized neural network optimization. arXiv preprint arXiv:1906.02107. Cited by: Table 1, §1, §2, §3.2, §3.2, §4.2, §4.2.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.1, §4.2.
 Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144. Cited by: §3.1.
 An introduction to variational methods for graphical models. Machine learning 37 (2), pp. 183–233. Cited by: §3.
 Conjugatecomputation variational inference: converting variational inference in nonconjugate models to inferences in conjugate models. arXiv preprint arXiv:1703.04265. Cited by: §3.1, §3.2.
 Fast and scalable bayesian deep learning by weightperturbation in adam. arXiv preprint arXiv:1806.04854. Cited by: §1.1.
 Learningalgorithms from bayesian principles. Note: Available online\urlhttps://emtiyaz.github.io/papers/learning_from_bayes.pdf Cited by: §1.1, §1, §3.1.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §1.
 Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America 114 13, pp. 3521–3526. Cited by: §3.3, §3.3.
 Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §1, §4.3.
 Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.2, §4.2, §4.2.
 Crafting papers on machine learning. In Proceedings of the 17th International Conference on Machine Learning (ICML 2000), P. Langley (Ed.), Stanford, CA, pp. 1207–1216. Cited by: §5.
 MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: §4.2, §4.2.
 Nondifferentiable optimization. Handbooks in operations research and management science 1, pp. 529–572. Cited by: §3.2.
 Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems, pp. 345–353. Cited by: §1.1.
 The concrete distribution: a continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712. Cited by: §3.1, §4.2.
 WRPN: wide reducedprecision networks. arXiv preprint arXiv:1709.01134. Cited by: §1.
 Two moons datasets description. Note: Available online\urlhttps://scikitlearn.org/stable/modules/generated/sklearn.datasets.make_moons.html Cited by: §4.1.
 Variational continual learning. arXiv preprint arXiv:1710.10628. Cited by: Table 8, §3.3, §4.3.
 Practical deep learning with bayesian principles. In Advances in Neural Information Processing Systems, pp. 4289–4301. Cited by: §A.2, §1.1.
 The information geometry of mirror descent. IEEE Transactions on Information Theory 61 (3), pp. 1451–1457. Cited by: §3.2.
 Xnornet: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §1.1, §1, §2.
 Learning discrete weights using the local reparameterization trick. arXiv preprint arXiv:1710.07739. Cited by: §1.1.
 Sparse gaussian processes using pseudoinputs. In Proceedings of the 18th International Conference on Neural Information Processing Systems, NIPSâ05, Cambridge, MA, USA, pp. 1257â1264. Cited by: §A.1, Table 4, Figure 2, §4.1.
 Variational optimization. arXiv preprint arXiv:1212.4507v2. Cited by: §1.1, §3.2.
 Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8 (34), pp. 229–256. Cited by: §3.1.
 Understanding straightthrough estimator in training activation quantized neural nets. arXiv preprint arXiv:1903.05662. Cited by: §1, §2, §2.
 Optimal information processing and bayes’s theorem. The American Statistician 42 (4), pp. 278–280. Cited by: §3.
 Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3987–3995. Cited by: §4.3.