Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
Abstract
We introduce a method to train Quantized Neural Networks (QNNs) — neural networks with extremely low precision (e.g., 1bit) weights and activations, at runtime. At traintime the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bitwise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32bit counterparts. For example, our quantized version of AlexNet with 1bit weights and 2bit activations achieves top1 accuracy. Moreover, we quantize the parameter gradients to 6bits as well which enables gradients computation using only bitwise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32bit counterparts using only 4bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.
Department of Electrical Engineering
Technion  Israel Institute of Technology
Haifa, Israel Matthieu Courbariaux* matthieu.courbariaux@gmail.com
Department of Computer Science and Department of Statistics
Université de Montréal
Montréal, Canada Daniel Soudry daniel.soudry@gmail.com
Department of Statistics
Columbia University
New York, USA Ran ElYaniv rani@cs.technion.ac.il
Department of Computer Science
Technion  Israel Institute of Technology
Haifa, Israel Yoshua Bengio yoshua.umontreal@gmail.com
Department of Computer Science and Department of Statistics
Université de Montréal
Montréal, Canada
*Indicates first authors.
Editor:
Keywords: Deep Learning, Neural Networks Compression, Energy Efficient Neural Networks, Computer vision, Language Models.
1 Introduction
Deep Neural Networks (DNNs) have substantially pushed Artificial Intelligence (AI) limits in a wide range of tasks, including but not limited to object recognition from images (Krizhevsky2012small; Szegedyetalarxiv2014), speech recognition (Hintonetal2012; SainathetalICASSP2013), statistical machine translation (DevlinetalACL2014; SutskeveretalNIPS2014; BahdanauetalICLR2015small), Atari and Go games (Mnihetal2015; Silveretal2016), and even computer generation of abstract art (Mordvintsevetal2015).
Training or even just using neural network (NN) algorithms on conventional generalpurpose digital hardware (Von Neumann architecture) has been found highly inefficient due to the massive amount of multiplyaccumulate operations (MACs) required to compute the weighted sums of the neurons’ inputs. Today, DNNs are almost exclusively trained on one or many very fast and powerhungry Graphic Processing Units (GPUs) (Coatesetal2013). As a result, it is often a challenge to run DNNs on target lowpower devices, and substantial research efforts are invested in speeding up DNNs at runtime on both generalpurpose (Vanhouckeetal2011; Gongetal2014; Romeroetal2014; Hanetal2015) and specialized computer hardware (Farabetetal2011a; Farabetetal2011b; Phametal2012; ChenetalACM2014; ChenetalIEEE2014; Esseretal2015).
The most common approach is to compress a trained (full precision) network. HashedNets (chen2015compressing) reduce model sizes by using a hash function to randomly group connection weights and force them to share a single parameter value. Gongetal2014 compressed deep convnets using vector quantization, which resulteds in only a accuracy loss. However, both methods focused only on the fully connected layers. A recent work by Han2015 successfully pruned several stateoftheart large scale networks and showed that the number of parameters could be reduced by an order of magnitude.
Recent works have shown that more computationally efficient DNNs can be constructed by quantizing some of the parameters during the training phase. In most cases, DNNs are trained by minimizing some error function using BackPropagation (BP) or related gradient descent methods. However, such an approach cannot be directly applied if the weights are restricted to binary values. SoudryetalNIPS2014small used a variational Bayesian approach with MeanField and Central Limit approximation to calculate the posterior distribution of the weights (the probability of each weight to be +1 or 1). During the inference stage (test phase), their method samples from this distribution one binary network and used it to predict the targets of the test set (More than one binary network can also be used). Courbariaux2015 similarly used two sets of weights, realvalued and binary. They, however, updated the real valued version of the weights by using gradients computed by applying forward and backward propagation with the set of binary weights (which was obtained by quantizing the realvalue weights to +1 and 1).
This study proposes a more advanced technique, referred to as Quantized Neural Network (QNN), for quantizing the neurons and weights during inference and training. In such networks, all MAC operations can be replaced with and (i.e., counting the number of ones in the binary number) operations. This is especially useful in QNNs with the extremely low precision — for example, when only 1bit is used per weight and activation, leading to a Binarized Neural Network (BNN). The proposed method is particularly beneficial for implementing large convolutional networks whose neurontoweight ratio is very large.
This paper makes the following contributions:

We introduce a method to train QuantizedNeuralNetworks (QNNs), neural networks with low precision weights and activations, at runtime, and when computing the parameter gradients at traintime. In the extreme case QNNs use only 1bit per weight and activation(i.e., Binarized NN; see Section 2).

We conduct two sets of experiments, each implemented on a different framework, namely Torch7 and Theano, which show that it is possible to train BNNs on MNIST, CIFAR10 and SVHN and achieve near stateoftheart results (see Section 4). Moreover, we report results on the challenging ImageNet dataset using binary weights/activations as well as quantized version of it (more than 1bit).

We present preliminary results on quantized gradients and show that it is possible to use only 6bits with only small accuracy degradation.

We present results for the Penn Treebank dataset using language models (vanilla RNNs and LSTMs) and show that with 4bit weights and activations Recurrent QNNs achieve similar accuracies as their 32bit floating point counterparts.

We show that during the forward pass (both at runtime and traintime), QNNs drastically reduce memory consumption (size and number of accesses), and replace most arithmetic operations with bitwise operations. A substantial increase in power efficiency is expected as a result (see Section 5). Moreover, a binarized CNN can lead to binary convolution kernel repetitions; we argue that dedicated hardware could reduce the time complexity by .

Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy (see Section 6).

The code for training and applying our BNNs is available online (both the Theano ^{1}^{1}1https://github.com/MatthieuCourbariaux/BinaryNet and the Torch framework ^{2}^{2}2https://github.com/itayhubara/BinaryNet).
2 Binarized Neural Networks
In this section, we detail our binarization function, show how we use it to compute the parameter gradients, and how we backpropagate through it.
2.1 Deterministic vs Stochastic Binarization
When training a BNN, we constrain both the weights and the activations to either or . Those two values are very advantageous from a hardware perspective, as we explain in Section 6. In order to transform the realvalued variables into those two values, we use two different binarization functions, as proposed by Courbariauxetal2015. The first binarization function is deterministic:
(1) 
where is the binarized variable (weight or activation) and the realvalued variable. It is very straightforward to implement and works quite well in practice. The second binarization function is stochastic:
(2) 
where is the “hard sigmoid” function:
(3) 
This stochastic binarization is more appealing theoretically (see Section 4) than the sign function, but somewhat harder to implement as it requires the hardware to generate random bits when quantizing (torii2016asic). As a result, we mostly use the deterministic binarization function (i.e., the sign function), with the exception of activations at traintime in some of our experiments.
2.2 Gradient Computation and Accumulation
Although our BNN training method utilizes binary weights and activations to compute the parameter gradients, the realvalued gradients of the weights are accumulated in realvalued variables, as per Algorithm 1. Realvalued weights are likely required for Stochasic Gradient Descent (SGD) to work at all. SGD explores the space of parameters in small and noisy steps, and that noise is averaged out by the stochastic gradient contributions accumulated in each weight. Therefore, it is important to maintain sufficient resolution for these accumulators, which at first glance suggests that high precision is absolutely required.
Moreover, adding noise to weights and activations when computing the parameter gradients provide a form of regularization that can help to generalize better, as previously shown with variational weight noise (Graves2011practical), Dropout (Srivastava14) and DropConnect (Wan+alICML2013small). Our method of training BNNs can be seen as a variant of Dropout, in which instead of randomly setting half of the activations to zero when computing the parameter gradients, we binarize both the activations and the weights.
2.3 Propagating Gradients Through Discretization
The derivative of the sign function is zero almost everywhere, making it apparently incompatible with backpropagation, since the exact gradients of the cost with respect to the quantities before the discretization (preactivations or weights) are zero. Note that this limitation remains even if stochastic quantization is used. Bengioarxiv2013 studied the question of estimating or propagating gradients through stochastic discrete neurons. He found that the fastest training was obtained when using the “straightthrough estimator,” previously introduced in Hinton’s lectures (HintonCoursera2012). We follow a similar approach but use the version of the straightthrough estimator that takes into account the saturation effect, and does use deterministic rather than stochastic sampling of the bit. Consider the sign function quantization
and assume that an estimator of the gradient has been obtained (with the straightthrough estimator when needed). Then, our straightthrough estimator of is simply
(4) 
Note that this preserves the gradient information and cancels the gradient when is too large. Not cancelling the gradient when is too large significantly worsens performance. To better understand why the straightthrough estimator works well, consider the stochastic binarization scheme in Eq. (2) and rewrite , where is the wellknown “hard tanh”,
(5) 
In this case the input to the next layer has the following form,
where we use the fact that is the expectation over (see Eqs. (2) and (5)), and define as binarization noise with mean equal to zero. When the layer is wide, we expect the deterministic mean term to dominate, because the noise term is a summation over many independent binarizations from all the neurons in the previous layer. Thus, we argue that the binarization noise can be ignored when performing differentiation in the backward propagation stage. Therefore, we replace (which cannot be computed) with
(6) 
which is exactly the straightthrough estimator defined in Eq (4). The use of this straightthrough estimator is illustrated in Algorithm 1.
A similar binarization process was applied for weights in which we combine two ingredients:

Project each realvalued weight to [1,1], i.e., clip the weights during training, as per Algorithm 1. The realvalued weights would otherwise grow very large without any impact on the binary weights.

When using a weight , quantize it using .
Projecting the weights to [1,1] is consistent with the gradient cancelling when , according to Eq. ( 4).
2.4 Shiftbased Batch Normalization
Batch Normalization (BN) (Ioffe+Szegedy2015) accelerates the training and reduces the overall impact of the weight scale (Courbariauxetal2015). The normalization procedure may also help to regularize the model. However, at traintime, BN requires many multiplications (calculating the standard deviation and dividing by it, namely, dividing by the running variance, which is the weighted mean of the training set activation variance). Although the number of scaling calculations is the same as the number of neurons, in the case of ConvNets this number is quite large. For example, in the CIFAR10 dataset (using our architecture), the first convolution layer, consisting of only filter masks, converts an image of size to size , which is almost two orders of magnitude larger than the number of weights (87.1 to be exact). To achieve the results that BN would obtain, we use a shiftbased batch normalization (SBN) technique, presented in Algorithm 4. SBN approximates BN almost without multiplications. Define as the approximate powerof2 of z (i.e., the index of the most significant bit (MSB)), and as both left and right binary shift. SBN replaces almost all multiplication with powerof2approximation and shift operations:
(7) 
The only operation which is not a binary shift or an add is the inverse square root (see normalization operation Algorithm 4). From the early work of lomont2003fast we know that the inversesquare operation could be applied with approximately the same complexity as multiplication. There are also faster methods, which involve lookup table tricks that typically obtain lower accuracy (this may not be an issue, since our procedure already adds a lot of noise). However, the number of values on which we apply the inversesquare operation is rather small, since it is done after calculating the variance, i.e., after averaging (for a more precise calculation, see the BN analysis in Lin2015. Furthermore, the size of the standard deviation vectors is relatively small. For example, these values make up only of the network size (i.e., the number of learnable parameters) in the Cifar10 network we used in our experiments.
In the experiment we observed no loss in accuracy when using the shiftbased BN algorithm instead of the vanilla BN algorithm.
2.5 Shift Based AdaMax
The ADAM learning method (kingma2014adam) also reduces the impact of the weight scale. Since ADAM requires many multiplications, we suggest using instead the shiftbased AdaMax we outlined in Algorithm 3. In the experiment we conducted we observed no loss in accuracy when using the shiftbased AdaMax algorithm instead of the vanilla ADAM algorithm.
2.6 First Layer
In a BNN, only the binarized values of the weights and activations are used in all calculations. As the output of one layer is the input of the next, the inputs of all the layers are binary, with the exception of the first layer. However, we do not believe this to be a major issue. First, in computer vision, the input representation typically has far fewer channels (e.g, red, green and blue) than internal representations (e.g., 512). Consequently, the first layer of a ConvNet is often the smallest convolution layer, both in terms of parameters and computations (Szegedyetalarxiv2014). Second, it is relatively easy to handle continuousvalued inputs as fixed point numbers, with bits of precision. For example, in the common case of bit fixed point inputs:
(8) 
where is a vector of 1024 8bit inputs, is the most significant bit of the first input, is a vector of 1024 1bit weights, and is the resulting weighted sum. This method is used in Algorithm 4.
3 Qunatized Neural network  More than 1bit
Observing Eq. (8), we can see that using 2bit activations simply doubles the number of times we need to run our XnorPopCount Kernel (i.e., directly proportional to the activation bitwidth). This idea was recently proposed by zhou2016dorefa (DoReFa net) and miyashita2016convolutional (published on arXive shortly after our preliminary technical report was published there). However, in contrast to to zhou2016dorefa, we did not find it useful to initialize the network with weights obtained by training the network with full precision weights. Moreover, the zhou2016dorefa network did not quantize the weights of the first convolutional layer and the last fullyconnected layer, whereas we binarized both. We followed the quantization schemes suggested by miyashita2016convolutional, namely, linear quantization:
(9) 
and logarithmic quantization:
(10) 
where and are the minimum and maximum scale range respectively. Where is the approximatepowerof2 of as described in Section 2.4. In our experiments (detailed in Section 4) we applied the above quantization schemes on the weights, activations and gradients and tested them on the more challenging ImageNet dataset.
4 Benchmark Results
4.1 Results on MNIST,SVHN, and CIFAR10
Data set  MNIST  SVHN  CIFAR10 

Binarized activations+weights, during training and test  
BNN (Torch7)  1.40%  2.53%  10.15% 
BNN (Theano)  0.96%  2.80%  11.40% 
Committee Machines’ Array Baldassi2015  1.35%     
Binarized weights, during training and test  
BinaryConnect Courbariauxetal2015  1.29 0.08%  2.30%  9.90% 
Binarized activations+weights, during test  
EBP Chengetal2015  2.2 0.1%     
Bitwise DNNs Kimetal2016  1.33%     
Ternary weights, binary activations, during test  
hwangetal2014  1.45%     
No binarization (standard results)  
No reg  1.3 0.2%  2.44%  10.94% 
Maxout Networks Goodfellow2013a  0.94%  2.47%  11.68% 
Gated pooling leeetal2015    1.69%  7.62% 
We performed two sets of experiments, each based on a different framework, namely Torch7 and Theano. Other than the framework, the two sets of experiments are very similar:

In both sets of experiments, we obtain near stateoftheart results with BNNs on MNIST, CIFAR10 and the SVHN benchmark datasets.

In our Torch7 experiments, the activations are stochastically binarized at traintime, whereas in our Theano experiments they are deterministically binarized.
Results are reported in Table 1. Implementation details are reported in Appendix A.
Mnist
MNIST is an image classification benchmark dataset (LeCun+98). It consists of a training set of 60K and a test set of 10K 28 28 grayscale images representing digits ranging from 0 to 9. The MultiLayerPerceptron (MLP) we train on MNIST consists of 3 hidden layers. In our Theano implementation we used hidden layers of size 4096 whereas in our Torch implementation we used much smaller size 2048. This difference explains the accuracy gap between the two implementations.
Cifar10
CIFAR10 is an image classification benchmark dataset. It consists of a training set of size 50K and a test set of size 10K, where instances are 32 32 color images representing airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships and trucks. Both implementations share the same structure as reported in Appendix A. Since the Torch implementation uses stochastic binarization, it achieved slightly better results.
Svhn
Street View House Numbers (SVHN) is also an image classification benchmark dataset. It consists of a training set of size 604K examples and a test set of size 26K, where instances are 32 32 color images representing digits ranging from 0 to 9. Here again we obtained a small improvement in the performance by using stochastic binarization scheeme.
4.2 Results on ImageNet
To test the strength of our method, we applied it to the challenging ImageNet classification task, which is probably the most important classification benchmark dataset. It consists of a training set of size 1.2M samples and test set of size 50K. Each instance is labeled with one of 1000 categories including objects, animals, scenes, and even some abstract shapes. On ImageNet, it is customary to report two error rates: top1 and top5, where the top error rate is the fraction of test images for which the correct label is not among the labels considered most probable by the model. Considerable research has been concerned with compressing ImageNet architectures while preserving high accuracy. Previous approaches include pruning near zero weights (Gongetal2014; han2015deep) using matrix factorization techniques (zhang2015efficient), quantizing the weights (gupta2015deep), using shared weights (chen2015compressing) and applying Huffman codes (han2015deep) among others.
To the best of our knowledge, before the first revision of this paper was published on arXive, no one had reported on successfully quantizing the network’s activations. On the contrary, a recent work (han2015deep) showed that accuracy significantly deteriorates when trying to quantize convolutional layers’ weights below 4bit (FC layers are more robust to quantization and can operate quite well with only 2 bits). In the present work we attempted to tackle the difficult task of binarizing both weights and activations. Employing the wellknown AlexNet and GoogleNet architectures, we applied our techniques and achieved top1 and top5 accuracy using AlexNet and top1 and top5 accuracy using GoogleNet. While these performance results leave room for improvement (relative to full precision nets), they are by far better than all previous attempts to compress ImageNet architectures using less than 4bit precision for the weights. Moreover, this advantage is achieved while also binarizing neuron activations.
4.3 Relaxing “hard tanh” boundaries
We discovered that after training the network it is useful to widen the “hard tanh” boundaries and retrain the network. As explained in Section 2.3, the straightthrough estimator (which can be written as “hard tanh”) cancels gradients coming from neurons with absolute values higher than 1. Hence, towards the last training iterations most of the gradient values are zero and the weight values cease to update. By relaxing the “hard tanh” boundaries we allow more gradients to flow in the backpropagation phase and improve top1 accuracies by on AlexNet topology using vanilla BNN.
4.4 2bit activations
While training BNNs on the ImageNet dataset we noticed that we could not force the training set error rate to converge to zero. In fact the training error rate stayed fairly close to the validation error rate. This observation led us to investigate a more relaxed activation quantization (more than 1bit). As can be seen in Table 2, the results are quite impressive and illustrate an approximate drop in performance (top1 accuracy) relative to floating point representation, using only 1bit weights and 2bit activation. Following miyashita2016convolutional, we also tried quantizing the gradients and discovered that only logarithmic quantization works. With 6bit gradients we achieved degradation. Those results are presently stateoftheart, surpassing those obtained by the DoReFa net (zhou2016dorefa). As opposed to DoReFa, we utilized a deterministic quantization process rather than a stochastic one. Moreover, it is important to note that while quantizing the gradients, DoReFa assigns for each instance in a minibatch its own scaling factor, which increases the number of MAC operations.
While AlexNet can be compressed rather easily, compressing GoogleNet is much harder due to its small number of parameters. When using vanilla BNNs, we observed a large degradation in the top1 results. However, by using QNNs with 4bit weights and activation, we were able to achieve top1 accuracy (only a drop in performance compared to the 32bit floating point architecture), which is the current stateoftheartcompression result over GoogleNet. Moreover, by using QNNs with 6bit weights, activations and gradients we achieved top1 accuracy. Full implementation details of our experiments are reported in Appendix A.6.
Model  Top1  Top5 

Binarized activations+weights, during training and test  
BNN  41.8%  67.1% 
XnorNets^{5}^{5}5 First and last layers were not binarized (i.e., using 32bit precision weights and activation) (rastegari2016xnor)  44.2%  69.2% 
Binary weights and Quantize activations during training and test  
QNN 2bit activation  51.03%  73.67% 
DoReFaNet 2bit activation\footreffirst_last (zhou2016dorefa)  50.7%  72.57% 
Quantize weights, during test  
Deep Compression 4/2bit (conv/FC layer) (han2015deep)  55.34%  77.67% 
(gysel2016hardware)  2bit  0.01%  % 
No Quantization (standard results)  
AlexNet  our implementation  56.6%  80.2% 
Model  Top1  Top5 

Binarized activations+weights, during training and test  
BNN  47.1%  69.1% 
Quantize weights and activations during training and test  
QNN 4bit  66.5%  83.4% 
Quantize activation,weights and gradients during training and test  
QNN 6bit  66.4%  83.1% 
No Quantization (standard results)  
GoogleNet  our implementation  71.6%  91.2% 
4.5 Language Models
Recurrent neural networks (RNNs) are very demanding in memory and computational power in comparison to feed forward networks. There are a large variety of recurrent models with the Long Short Term Memory networks (LSTMs) introduced by hochreiter1997long are being the most popular model. LSTMs are a special kind of RNN, capable of learning longterm dependencies using unique gating mechanisms. Recently, ott2016recurrent tried to quantize the RNNs weight matrices using similar techniques as described in Section 2. They observed that the weight binarization methods do not work with RNNs. However, by using 2bits (i.e., ), they have been able to achieve similar and even higher accuracy on several datasets. Here we report on the first attempt to quantize both weights and activations by trying to evaluate the accuracy of quantized recurrent models trained on the Penn Treebank dataset. The Penn Treebank Corpus (marcus1993building) contains 10K unique words. We followed the same setting as in (mikolov2012context) which resulted in 18.55K words for training set, 14.5K and 16K words in the validation and test sets respectively. We experimented with both vanilla RNNs and LSTMs. For our vanilla RNN model we used one hidden layers of size 2048 and ReLU as the activation function. For our LSTM model we use 1 hidden layer of size 300. Our RNN implementation was constructed to predict the next character hence performance was measured using the bitspercharacter (BPC) metric. In the LSTM model we tried to predict the next word so performance was measured using the perplexity per word (PPW) metric. Similar to (ott2016recurrent), our preliminary results indicate that binarization of weight matrices lead to large accuracy degradation. However, as can be seen in Table 4, with 4bits activations and weights we can achieve similar accuracies as their 32bit floating point counterparts.
5 High Power Efficiency during the Forward Pass
Operation  MUL  ADD 

8bit Integer  0.2pJ  0.03pJ 
32bit Integer  3.1pJ  0.1pJ 
16bit Floating Point  1.1pJ  0.4pJ 
32bit Floating Point  3.7pJ  0.9pJ 
Memory size  64bit Cache 

8K  10pJ 
32K  20pJ 
1M  100pJ 
DRAM  1.32.6nJ 
Computer hardware, be it generalpurpose or specialized, is composed of memories, arithmetic operators and control logic. During the forward pass (both at runtime and traintime), BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bitwise operations, which might lead to vastly improved powerefficiency. Moreover, a binarized CNN can lead to binary convolution kernel repetitions, and we argue that dedicated hardware could reduce the time complexity by .
Memory Size and Accesses
Improving computing performance has always been and remains a challenge. Over the last decade, power has been the main constraint on performance (Horowitz2014). This is why considerable research efforts have been devoted to reducing the energy consumption of neural networks. Horowitz2014 provides rough numbers for the energy consumed by the computation (the given numbers are for 45nm technology), as summarized in Tables 6 and 6. Importantly, we can see that memory accesses typically consume more energy than arithmetic operations, and memory access cost increases with memory size. In comparison with 32bit DNNs, BNNs require 32 times smaller memory size and 32 times fewer memory accesses. This is expected to reduce energy consumption drastically (i.e., by a factor larger than 32).
XNORCount
Applying a DNN mainly involves convolutions and matrix multiplications. The key arithmetic operation of deep learning is thus the multiplyaccumulate operation. Artificial neurons are basically multiplyaccumulators computing weighted sums of their inputs. In BNNs, both the activations and the weights are constrained to either or . As a result, most of the 32bit floating point multiplyaccumulations are replaced by 1bit XNORcount operations. This could have a big impact on dedicated deep learning hardware. For instance, a 32bit floating point multiplier costs about 200 Xilinx FPGA slices (Govinduetal2004; Beauchampetal2006), whereas a 1bit XNOR gate only costs a single slice.
When using a ConvNet architecture with binary weights, the number of unique filters is bounded by the filter size. For example, in our implementation we use filters of size , so the maximum number of unique 2D filters is . However, this should not prevent expanding the number of feature maps beyond this number, since the actual filter is a 3D matrix. Assuming we have filters in the convolutional layer, we have to store a 4D weight matrix of size . Consequently, the number of unique filters is . When necessary, we apply each filter on the map and perform the required multiplyaccumulate (MAC) operations (in our case, using XNOR and popcount operations). Since we now have binary filters, many 2D filters of size repeat themselves. By using dedicated hardware/software, we can apply only the unique 2D filters on each feature map and sum the results to receive each 3D filter’s convolutional result. Note that an inverse filter (i.e., [1,1,1] is the inverse of [1,1,1]) can also be treated as a repetition; it is merely a multiplication of the original filter by 1. For example, in our ConvNet architecture trained on the CIFAR10 benchmark, there are only 42% unique filters per layer on average. Hence we can reduce the number of the XNORpopcount operations by 3.
QNNs complexity scale up linearly with the number of bits per weight/activation, since it requires the application of the XNOR kernel several times (see Section 3). As of now, QNNs still supply the best compression to accuracy ratio. Moreover, quantizing the gradients allows us to use the XNOR kernel for the backward pass, leading to fully fixed point layers with low bitwidth. By accelerating the training phase, QNNs can play an important role in future power demanding tasks.
6 Seven Times Faster on GPU at RunTime
It is possible to speed up GPU implementations of QNNs, by using a method sometimes called SIMD (single instruction, multiple data) within a register (SWAR). The basic idea of SWAR is to concatenate groups of 32 binary variables into 32bit registers, and thus obtain a 32times speedup on bitwise operations (e.g., XNOR). Using SWAR, it is possible to evaluate 32 connections with only 3 instructions:
(11) 
where is the resulting weighted sum, and and are the concatenated inputs and weights. Those 3 instructions (accumulation, popcount, xnor) take clock cycles on recent Nvidia GPUs (and if they were to become a fused instruction, it would only take a single clock cycle). Consequently, we obtain a theoretical Nvidia GPU speedup of factor of . In practice, this speedup is quite easy to obtain as the memory bandwidth to computation ratio is also increased 6 times.
In order to validate those theoretical results, we programmed two GPU kernels:

An unoptimized matrix multiplication kernel that serves as our baseline.

The XNOR kernel, which is nearly identical to the baseline, except that it uses the SWAR method, as in Equation (11).
The two GPU kernels return identical outputs when their inputs are constrained to or (but not otherwise). The XNOR kernel is about 23 times faster than the baseline kernel and 3.4 times faster than cuBLAS, as shown in Figure 3. Last but not least, the MLP from Section 4 runs 7 times faster with the XNOR kernel than with the baseline kernel, without suffering any loss in classification accuracy (see Figure 3). As MNIST’s images are not binary, the first layer’s computations are always performed by the baseline kernel. The last three columns show that the MLP accuracy does not depend on which kernel is used.
7 Discussion and Related Work
Until recently, the use of extremely lowprecision networks (binary in the extreme case) was believed to substantially degrade the network performance (courbariaux+alTR2014). SoudryetalNIPS2014small and Chengetal2015 proved the contrary by showing that good performance could be achieved even if all neurons and weights are binarized to . This was done using Expectation BackPropagation (EBP), a variational Bayesian approach, which infers networks with binary weights and neurons by updating the posterior distributions over the weights. These distributions are updated by differentiating their parameters (e.g., mean values) via the back propagation (BP) algorithm. Esseretal2015 implemented a fully binary network at run time using a very similar approach to EBP, showing significant improvement in energy efficiency. The drawback of EBP is that the binarized parameters are only used during inference.
The probabilistic idea behind EBP was extended in the BinaryConnect algorithm of Courbariauxetal2015. In BinaryConnect, the realvalued version of the weights is saved and used as a key reference for the binarization process. The binarization noise is independent between different weights, either by construction (by using stochastic quantization) or by assumption (a common simplification; see spang1962reduction). The noise would have little effect on the next neuron’s input because the input is a summation over many weighted neurons. Thus, the realvalued version could be updated using the back propagated error by simply ignoring the binarization noise in the update. With this method, Courbariauxetal2015 were the first to binarize weights in CNNs and achieved near stateoftheart performance on several datasets. They also argued that noisy weights provide a form of regularization, which could help to improve generalization, as previously shown by Wan+alICML2013small. This method binarized weights while still maintaining full precision neurons.
Linetal2015 carried over the work of Courbariauxetal2015 to the backpropagation process by quantizing the representations at each layer of the network, to convert some of the remaining multiplications into binary shifts by restricting the neurons’ values to be poweroftwo integers. Linetal2015’s work and ours seem to share similar characteristics .However, their approach continues to use full precision weights during the test phase. Moreover, Linetal2015 quantize the neurons only during the back propagation process, and not during forward propagation.
Other research (Baldassi2015) showed that full binary training and testing is possible in an array of committee machines with randomized input, where only one weight layer is being adjusted. Gongetal2014 aimed to compress a fully trained high precision network by using quantization or matrix factorization methods. These methods required training the network with full precision weights and neurons, thus requiring numerous MAC operations (which the proposed QNN algorithm avoids). hwangetal2014 focused on a fixedpoint neural network design and achieved performance almost identical to that of the floatingpoint architecture. Kimetal2016 retrained neural networks with binary weights and activations.
As far as we know, before the first revision of this paper was published on arXive, no work succeeded in binarizing weights and neurons, at the inference phase and the entire training phase of a deep network. This was achieved in the present work. We relied on the idea that binarization can be done stochastically, or be approximated as random noise. This was previously done for the weights by Courbariauxetal2015, but our BNNs extend this to the activations. Note that the binary activations are especially important for ConvNets, where there are typically many more neurons than free weights. This allows highly efficient operation of the binarized DNN at run time, and at the forwardpropagation phase during training. Moreover, our training method has almost no multiplications, and therefore might be implemented efficiently in dedicated hardware. However, we have to save the value of the full precision weights. This is a remaining computational bottleneck during training, since it is an energyconsuming operation.
Shortly after the first version of this paper was posted on arXiv, several papers tried to improve and extend it. rastegari2016xnor made a small modification to our algorithm (namely multiplying the binary weights and input by their norm) and published promising results on the ImageNet dataset. Note that their method, named XnorNet, requires additional multiplication by a different scaling factor for each patch in each sample (rastegari2016xnor) Section 3.2 Eq. 10 and figure 2). This in itself, requires many multiplications and prevents efficient implementation of XnorNet on known hardware designs. Moreover, (rastegari2016xnor) didn’t quantize first and last layers, therefore XNORNet are only partially binarized NNs. miyashita2016convolutional suggested a more relaxed quantization (more than 1bit) for both the weights and activation. Their idea was to quantize both and use shift operations as in our Eq. (4). They proposed to quantize the parameters in their nonuniform, base2 logarithmic representation. This idea was inspired by the fact that the weights and activations in a trained network naturally have nonuniform distributions. They moreover showed that they can quantize the gradients as well to 6bit without significant losses in performance (on the Cifar10 dataset). zhou2016dorefa applied similar ideas to the ImageNet dataset and showed that by using 1bit weights, 2bit activations and 6bit gradients they can achieve top1 accuracies using the AlexNet architecture. They named this method DoReFa net. Here we outperform DoReFa net and achieve using a 126 bit quantization scheme (weightactivationgradients) and using a 1232 quantization scheme. These results confirm that we can achieve comparable results even on a large dataset by applying the Xnor kernel several times. merolla2016deep showed that DNN can be robust to more than just weight binarization. They applied several different distortions to the weights, including additive and multiplicative noise, and a class of nonlinear projections.This was shown to improve robustness to other distortions and even boost results. zhengbinarized tried to apply our binarization scheme to recurrent neural network for language modeling and achieved comparable results as well. andri2016yodann even created a hardware implementation to speed up BNNs.
Conclusion
We have introduced BNNs, which binarize deep neural networks and can lead to dramatic improvements in both power consumption and computation speed. During the forward pass (both at runtime and traintime), BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bitwise operations. Our estimates indicate that power efficiency can be improved by more than one order of magnitude (see Section 5). In terms of speed, we programmed a binary matrix multiplication GPU kernel that enabled running MLP over the MNIST datset 7 times faster (than with an unoptimized GPU kernel) without any loss of accuracy (see Section 6).
We have shown that BNNs can handle MNIST, CIFAR10 and SVHN while achieving nearly stateoftheart accuracy. While our results for the challenging ImageNet are not on par with the best results achievable with full precision networks, they significantly improve all previous attempts to compress ImageNetcapable architectures. Moreover, by quantizing the weights and activations to more than 1bit (i.e., QNNs), we have been able to achieve comparable results to the 32bit floating point architectures (see Section 4.4 and supplementary material  Appendix B). A major open research avenue would be to further improve our results on ImageNet. Substantial progress in this direction might go a long way towards facilitating DNN usability in low power instruments such as mobile phones.
Acknowledgments
We would like to express our appreciation to Elad Hoffer, for his technical assistance and constructive comments. We thank our fellow MILA lab members who took the time to read the article and give us some feedback. We thank the developers of Torch, (Torch2011) a Lua based environment, and Theano (bergstra+al:2010scipy; BastienTheano2012), a Python library that allowed us to easily develop fast and optimized code for GPU. We also thank the developers of Pylearn2 (pylearn2_arxiv_2013) and Lasagne (dielemanetal2015), two deep learning libraries built on the top of Theano. We thank Yuxin Wu for helping us compare our GPU kernels with cuBLAS. We are also grateful for funding from NSERC, the Canada Research Chairs, Compute Canada, and CIFAR. We are also grateful for funding from CIFAR, NSERC, IBM, Samsung. This research was supported by The Israel Science Foundation (grant No. 1890/14)
A Implementation Details
In this section we give full implementation details over our MNIST,SVHN, CIFAR10 and ImageNet datasets.
a.1 MLP on MNIST (Theano)
MNIST is an image classification benchmark dataset (LeCun+98). It consists of a training set of 60K and a test set of 10K 28 28 grayscale images representing digits ranging from 0 to 9. In order for this benchmark to remain a challenge, we did not use any convolution, dataaugmentation, preprocessing or unsupervised learning. The MultiLayerPerceptron (MLP) we train on MNIST consists of 3 hidden layers of 4096 binary units and a L2SVM output layer; L2SVM has been shown to perform better than Softmax on several classification benchmarks (Tangwkshp2013; Leeetal2014). We regularize the model with Dropout (Srivastava14). The square hinge loss is minimized with the ADAM adaptive learning rate method (kingma2014adam). We use an exponentially decaying global learning rate, as per Algorithm 1, and also scale the learning rates of the weights with their initialization coefficients from (GlorotAISTATS2010small), as suggested by Courbariauxetal2015. We use Batch Normalization with a minibatch of size 100 to speed up the training. As is typical, we use the last 10K samples of the training set as a validation set for early stopping and model selection. We report the test error rate associated with the best validation error rate after 1000 epochs (we do not retrain on the validation set).
a.2 MLP on MNIST (Torch7)
We use a similar architecture as in our Theano experiments, without dropout, and with 2048 binary units per layer instead of 4096. Additionally, we use the shift base AdaMax and BN (with a minibatch of size 100) instead of the vanilla implementations, to reduce the number of multiplications. Likewise, we decay the learning rate by using a 1bit right shift every 10 epochs.
a.3 ConvNet on CIFAR10 (Theano)
CIFAR10 is an image classification benchmark dataset. It consists of a training set of size 50K and a test set of size 10K, where instances are 32 32 color images representing airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships and trucks. We do not use dataaugmentation (which can really be a game changer for this dataset; see Graham2014). The architecture of our ConvNet is identical to that used by Courbariaux2015 except for the binarization of the activations. The Courbariauxetal2015 architecture is itself mainly inspired by VGG (Simonyan2015). The square hinge loss is minimized with ADAM. We use an exponentially decaying learning rate, as we did for MNIST. We scale the learning rates of the weights with their initialization coefficients from (GlorotAISTATS2010small). We use Batch Normalization with a minibatch of size 50 to speed up the training. We use the last 5000 samples of the training set as a validation set. We report the test error rate associated with the best validation error rate after 500 training epochs (we do not retrain on the validation set).
CIFAR10 ConvNet architecture 

Input:  RGB image 
 128 convolution layer 
BatchNorm and Binarization layers 
 128 convolution and maxpooling layers 
BatchNorm and Binarization layers 
 256 convolution layer 
BatchNorm and Binarization layers 
 256 convolution and maxpooling layers 
BatchNorm and Binarization layers 
 512 convolution layer 
BatchNorm and Binarization layers 
 512 convolution and maxpooling layers 
BatchNorm and Binarization layers 
1024 fully connected layer 
BatchNorm and Binarization layers 
1024 fully connected layer 
BatchNorm and Binarization layers 
10 fully connected layer 
BatchNorm layer (no binarization) 
Cost: Mean square hinge loss 
a.4 ConvNet on CIFAR10 (Torch7)
We use the same architecture as in our Theano experiments. We apply shiftbased AdaMax and BN (with a minibatch of size 200) instead of the vanilla implementations to reduce the number of multiplications. Likewise, we decay the learning rate by using a 1bit right shift every 50 epochs.
a.5 ConvNet on SVHN
SVHN is also an image classification benchmark dataset. It consists of a training set of size 604K examples and a test set of size 26K, where instances are 32 32 color images representing digits ranging from 0 to 9. In both sets of experiments, we follow the same procedure used for the CIFAR10 experiments, with a few notable exceptions: we use half the number of units in the convolution layers, and we train for 200 epochs instead of 500 (because SVHN is a much larger dataset than CIFAR10).
a.6 ConvNet on ImageNet
ImageNet classification task consists of a training set of size 1.2M samples and test set of size 50K. Each instance is labeled with one of 1000 categories including objects, animals, scenes, and even some abstract shapes.
AlexNet:
Our AlexNet implementation consists of 5 convolution layers followed by 3 fully connected layers (see Section 8). Additionally, we use Adam as our optimization method and batchnormalization layers (with a minibatch of size 512). Likewise, we decay the learning rate by 0.1 every 20 epochs.
GoogleNet:
Our GoogleNet implementation consist of 2 convolution layers followed by 10 inception layers, spatialaveragepooling and a fully connected classifier. We also used the 2 auxilary classifiers. Additionally, we use Adam (Kingma2015) as our optimization method and batchnormalization layers (with a minibatch of size 64). Likewise, we decay the learning rate by 0.1 every 10 epochs.
AlexNet ConvNet architecture 

Input:  RGB image 
 64 convolution layer and maxpooling layers 
BatchNorm and Binarization layers 
 192 convolution layer and maxpooling layers 
BatchNorm and Binarization layers 
 384 convolution layer 
BatchNorm and Binarization layers 
 256 convolution layer 
BatchNorm and Binarization layers 
 256 convolution layer 
BatchNorm and Binarization layers 
4096 fully connected layer 
BatchNorm and Binarization layers 
4096 fully connected layer 
BatchNorm and Binarization layers 
1000 fully connected layer 
BatchNorm layer (no binarization) 
SoftMax layer (no binarization) 
Cost: Negative log likelihood 