Minimizing Classification Energy of Binarized Neural Network Inference for Wearable Devices
Abstract
In this paper, we propose a lowpower hardware for efficient deployment of binarized neural networks (BNNs) that have been trained for physiological datasets. BNNs constrain weights and featuremap to bit, can pack in as many 1bit weights as the width of a memory entry provides, and can execute multiple multiplyaccumulate (MAC) operations with one fused bitwise xnor and populationcount instruction over aligned packed entries. Our proposed hardware is scalable with the number of processing engines (PEs) and the memory width, both of which adjustable for the most energy efficient configuration given an application. We implement two real case studies including Physical Activity Monitoring and Stress Detection on our platform, and for each case study on the target platform, we seek the optimal PE and memory configurations. Our implementation results indicate that a configuration with a good choice of memory width and number of PEs can be optimized up to 4 and 2.5 in energy consumption respectively on Artix7 FPGA and on 65nm CMOS ASIC implementation. We also show that, generally, wider memories make more efficient BNN processing hardware. To further reduce the energy, we introduce PoolSkipping technique that can skip at least 25 of the operations that are accompanied by a MaxPool layer in BNNs, leading to a total of 22 operation reduction in the Stress Detection case study. Compared to the related works using the same case studies on the same target platform and with the same classification accuracy, our hardware is respectively 4.5 and 250 more energy efficient for the Stress Detection on FPGA and Physical Activity Monitoring on ASIC, respectively.
Binarized Neural Networks, Low Energy Wearable Devices, Machine Learning, ASIC, FPGA
I Introduction
In recent years, the demand for using wearable technology has increased dramatically with the advancement of technology and the growth of the Internet of Things (IoT). Wearable devices are extending their applications in many diverse domains from consumer electronics, such as smartwatches and activity trackers for monitoring one’s fitness and wellness, to medical devices that track a patient’s vital physiological signs such as heart rate and brain activity. Such devices usually process realtime data, read from multimodal sensors continuously, and suffer from resourcebound and limited battery budget due to their small size, online monitoring, and portability. Therefore, minimizing the power dissipation of these devices while meeting realtime requirements is a subject of interest [1, 2, 3, 4, 5].
Deep Neural Networks (DNNs) are among the effective machine learning (ML) approaches that are employed majorly in the processing units of such devices. DNNs can extract featured information from raw multimodal timeseries signals without much prior knowledge of the signal [2]. However, DNNs with highprecision weights consume highenergy operations and require large blocks of memories for storing and processing the DNN model. To tackle the problem of memory and power consumption, weight pruning and weight quantization techniques [6, 7, 8] as well as low precision weight networks [9, 10, 11] have been proposed in recent years as effective approaches to compressing DNN models. [9], [10], and [11] proposed respectively BinaryConnect (BC), Binarized Neural Network (BNN), and BinaryWeightNetworks (BWN), all of which constrain every weight of the neural network to either or , that can be stored in the smallest possible unit of memory, i.e. 1 bit ( for and for ), thereby allowing to pack as many weights in as a memory entry can accommodate. In BNNs, not only the weights but the featuremap is also constrained to or , thus resulting in bitwise operations between each neuron output and every synapse weight. Due to the minimal aspects of binary weight DNNs, they have gained significant attention recently. BNNs work well on small datasets [9], and will be shown in this work that they can be employed on physiological timeseries data as well.
Timeseries data classification in ML applications is conventionally processed with techniques such as Dynamic Time Warping (DWT), Recurrent Neural Networks (RNN), and Long ShortTerm Memory (LSTM) [12]. These techniques have complicated algorithms and hardware implementation, and mainly deal with complex input data such as text or speech. DNNs, nevertheless, have also been used in the classification of timeseries data [2, 13]. In this approach, the realtime data is cut and buffered into frames of samples that may or may not contain specific patterns, e.g. seizure pattern, that annotate the frame with specific labels. The frame of temporal data can be multichannel and can be used as raw data or processed in the frequency domain to feed a DNN, or represented as an imagelike frame whose sequence of image rows follow the sequence of sensors that pick data, e.g. EEG sensors on the skull can represent the brain activity spatially and temporally [14].
In this paper, a lowpower scalable hardware is proposed for multimodal physiological data classification that uses BNNs. In its Verilog HDL, the number of PEs and memory width can be adjusted to fit the minimal energy consumption given an application. The hardware is designed to be minimal, and use resource sharing and onchip memories at its finest.
The main contributions of this paper include:

Train two physiological case studies including Physical Activity Monitoring, and Stress Detection for BNNs

Improve BNN accuracy by increasing the number of parameters to achieve the same accuracy as fullprecision

Propose a scalable parallel hardware for BNN inference that can be configured for different number of PEs, and memory width, both of which adjustable for minimal classification energy given a case study.

Propose an operation skipping technique, referred to as PoolSkipping, for MaxPool layer in BNNs that can skip more than 25% of effectless operations accompanied with MP and 22% of the BNN operations in total.

Analyze hardware parameters, including energy vs memory width, and energy vs number of PEs and find the optimal configuration per case study on FPGA and on ASIC.

Synthesis and Implementation of the platform on FPGA and post placeandroute ASIC layout in 65nm CMOS technology, and provide the energy and time reports of the two case studies.
Ii Related Work
Several multimodal data classification methods have been proposed recently. [2] proposed a complexityreduced Deep Convolutional Neural Network (DCNN) for physiolog ical case studies and a 16bit hardware platform for their inference. [13] proposed an architecture that uses a CNN per variable and tested their CNN on Congestive Heart Failure Detection and on Physical Activity Monitoring datasets with detection accuracies of respectively 94% and 93%. [15] proposed a scalable system that addresses concurrent activity recognition. Their model extracts temporal and spatial features from a multimodal sensory data by means of a 7 layer CNN followed by an LSTM and is tested on three activity datasets. [16] introduced DeepSense which is a deep learning framework that integrates DNNs and RNNs for addressing feature customization challenges in sensory datasets. [17] investigated the opportunity to use deep learning for identifying nonintuitive features from crosssensor correlations by means of an RBM. In [1] a kernel decomposing scheme in binaryweight networks is proposed that skips redundant computations and achieves 22% energy reduction on image classification. BiNMAC is proposed in [18] which is a programmable manycore accelerator for BNNs designed for physiological and Image processing case studies.
Iii Binarized Neural Networks
BNNs constrain both weights and the featuremap to either or . Similar to fullprecision DNNs, BNNs consist of layers equivalent to those of the standard DNNs, albeit implemented rather differently; for the first layer of BNN, data from each input channel are in fullprecision values, but model weights of the BNN are trained and constrained to either or . Therefore, MultiplyAccumulation (MAC) operations are replaced by a series of ADDs/SUBs. For the next layers, all input features are binarized using binary activation functions (AF) and packed inside registers. Therefore, a MAC operation between two vectors that hold such packed weights can be performed by summing (considering as ) over the result of the bitwise of the two vectors. The summation over the bits of a register is referred to as and denoted by in this work.
In order to apply the to the realvalued variables and transform them to either or , deterministic binarization function is proposed for AF by [10] which is essentially a function as shown in equation (1), where is the realvalued variable and is its binarized value.
(1) 
Thus, the feedforward path of a layer, that is traditionally denoted with in standard DNNs, can be formulated for BNNs by equation (2):
(2) 
where X and W are respectively tensors of input (featuremap) and weights with bias () implicitly included, performs bitwise operations on rows from W and columns from X, sums over the result of this bitwise operation, and sign is the AF applied to the generated scalar value from . As explained earlier, from a hardware point of view it is more efficient for both binarized weights and featuremap to be packed in memory entries. For example, one row of a binarized can be stored in 4 memory entries of a 32bit machine, and the whole matrix can be packed inside 512 entries (2KB). Therefore, practically in an Mbit computing machine, the populationcount part in equation (2) should be performed as per every chunk of aligned bitwise . Hence, before applying the activation function the populationcount needs to be carried out several times until the completion of fetching all patch elements that contribute to generating one output value. This accumulation, before applying the AF, can be denoted by the following equation [9]:
(3) 
where and are chunks of and that are sequentially read from packed memory entries of an Mbit machine until they are fetched completely. Only then one can infer from equations (2) and (3).
For the last layer, weights can either be constrained to or , that would be treated similarly as previous layers, or trained with fullprecision values that, with respect to the binarized activation values coming from the previous layer would convert the MAC operation into ADD/SUB over the weight values, analogous to the first layer. Binarizing the last layer of BNNs can sometimes result in failure and therefore different techniques have been proposed to deal with the last layer [19, 20]. We use highprecision (16bit) weights for the last layer of the first case study in this paper because it resulted in a stabilized BNN accuracy during trainaing over 100 epochs.
MaxPool (MP) is a frequentlyemployed layer in DNNs that subsamples the featuremap, and is traditionally placed before . It can be shown that if is a nondescending function of its input, then placing after results in equivalent outcome. In BNNs therefore:
(4) 
Placing after layer in BNNs has two advantages: 1) a bulky comparator that would have to operate on the realvalued results from the previous Conv/FC layer in order to execute before , can be converted to an gate if follows with binarized output. Because, if , , then (, , …, ) = (, , …,), where and are elements and the pool size of the MP 2) as soon as one of the elements of the pool, at which the seeks the maximum, is equal to the maximum possible value ( in BNNs), the pool exploring can be asserted as ‘done’, and the rest of the exploration in the pool can be skipped, that is conducive to skipping of large portion of computation imposed from the preceding Conv/FC layer. We refer to this property as PoolSkipping and in Section VI show that it can skip up to 22% of the total operations for the first case study that has 3 layers.
Since in Neural Networks one operation is usually considered as either one multiplication or one addition, therefore each time either a bitwise or a is executed, the number of carried out operations is equal to the number of vector bits to be executed. Consequently in a BNN process, by fusing and well pipelining the and bitwise operations in an Mbit machine, one can achieve 2M operations per execution of equation 3. This implies that the performance of an Mbit BNN processing machine can be proportional to M, which decides its registers, data bus, memory width, etc. We will evaluate this feature in Section VI.
Iv BNN for Physiological Case Studies
BNN models can compress the model size by 8 to 64 times, at the expense of some accuracy loss. The accuracy loss can usually be compensated by augmenting the BNN model size by 211 [20]. In order to present a fair comparison between our hardware and related works, in this section, we introduce two case studies that were used in [2] and [18]. We use Torch framework from [9] to train the BNN models and to grab the weights for inference on hardware.
Iva Case study 1: Stress Detection
The Stress Detection dataset [21] includes physiological nonEEG signals that are labeled with four neurological stress status. The DCNN in [2] utilizes a cascade of 5 Conv layers of size 16, 16, 8 and 8 filter banks and 2 FC layers of size 64 and numberofclasses respectively. Input is cut into frames of 64 samples. All filters have a shape of 15. We conducted a set of experiments on binarizing this baseline configuration, abiding by the same number of layers, but increasing the number of layer parameters in each experiment. When binarizing the original 16bit model, the BNN model is compressed by 16 times, but the accuracy drops from the reported 93.8% to 81.8%. In order to compensate the accuracy loss to the baseline level, we increased the number of parameters by approximately 9.5, and still, the final binarized model is 1.7 smaller than the 16bit baseline. Fig. 1 shows the incremental augmentation and the impact of binarizing over the baseline DCNN, and the compensation in accuracy with the increase of binarized model size. In total, the binarized DCNN for Stress Detection has 98K parameters and requires 13KB of memory to store the model. For every classification of this dataset, 7.32 Million operations are required. Fig. 2A shows the BNN configuration and table I summarizes the details for each case study after augmentation.
Dataset 








Stress  1647  4  98K  13KB  7.32M  94.1%  
PAMAP2  4011  16  145K  18KB  0.29M  97.8% 
IvB Case study 2: Physical Activity Monitoring
Physical Activity Monitoring dataset (PAMAP2) [22] contains data from 9 subjects recorded from sensors such as IMUs, heart rate monitor that totally has 40 valid channels labeled with 12 physical activities. We use a 3layer binarized MultiLayer Perceptron (MLP) as proposed in [18] for classification of the PAMAP2 dataset with an average accuracy of 97.8%. In total, the BNN MLP has 145K parameters, storable in 18KB, and requires 290K operations. Fig. 2B shows the configuration and Table I summarizes the details.
V Scalable Hardware Platform
In this section, we introduce a minimal hardware architecture that is scalable with the number of PEs and memory width that can be adjusted to match the time and energy requirements given an application. For the parallelization scheme, we use output channel tiling as in [23] in which every PE contributes to generating an independent channel of the featuremap. This scheme is shown to result in the best computation to memory communication ratio [24]. It also facilitates implementing the PoolSkipping method with a simple control logic.
The highlevel abstraction of the hardware is depicted in Fig.3. The main components of the hardware include Filter Memory to store the model weights, an InputMemory and a FeatureMap Memory that are used for the intermediate data and alternately switch their tasks, i.e. buffering the input data/featuremap to be exported to PEs in one memory and importing the output featuremap from the PEs in another memory. All the three mentioned memories are shared between the PEs and addressed by a global controller. The global controller is the only unit of the hardware that uses a multiplier for address calculations.
Each PE comprises a Filter cache to temporarily store an individual filter for the process, and an Output cache to concatenate the single output bits and prepare them to be sent to the FeatureMap Memory along with other PEs. Once this cache is full, all the PEs concatenate their first bit of outputcache and send the packed packet out to the featuremap memory. Then, next bits are sent out accordingly. Resource sharing is carefully taken into consideration such that the major logic of every PE is only a pipeline of bitwise , , and an accumulator to satisfy the equation (2). The accumulator ADDs/SUBs over either the input data w.r.t binarized packed filter for the first BNN layer, or on the output of for midlayers, or over the filter data w.r.t binarized packed data for the last BNN layer. Two shifters, not depicted in Fig. 3 to avoid obfuscation, shift the packed filter/data for the first/last layers. A local controller, by means of Multiplexers to allocate the existing resources, handles the dataflow such that FC and Conv layers with different strides are executed with respect to the BNN layer.
A bitwise gate is implemented after the Output cache to perform MP and is bypassed if there is no MP. PoolSkipping is implemented in such a way that, given MP, whenever there is a +1 in the poolunderexplore, +1 is written quickly to its relevant location in the Output cache, the pipeline is flushed, and the operation is skipped to a stride as wide as the pool size. The wouldbe adjacent bits in the Output cache are left as ‘don’t care’ as their old values will be masked by the effect of the written +1.
Vi Evaluation
Via Parallelization
As mentioned in Section III, BNN inference performance in our Mbit datawidth machine is proportional to , omitting the first/last layers that may use fullprecision values. Meanwhile, with the split and tiled threads dispatched to concurrent PEs, the computation can be expedited by times if the memory communication is disregarded. Clock frequency, , is another factor that decides the computation rate. Therefore in our hardware, the performance is proportional to these three metrics:
(5) 
The performance of the first and the last layer, similar to traditional accelerators, is proportional to only.
ViB Memory Energy Evaluation
We will seek a tradeoff between and in the next section for minimizing the energy consumption, but beforehand, we analyze a set of ARM memories generated with Artisan Memory Compiler in 65nm and at frequency of 128MHz with various datawidth. For each generated memory we registered the energy per entry read/write, and calculated the energy per bit read/write. Fig. 4 plots the results of this analysis and indicates that it is more energy efficient to read 1 bit of information, as a BNN weight, if wider memories are used.
ViC PoolSkipping
As explained in Section III, PoolSkipping can skip a large portion of operations that are preceding a MP. For the Stress Detection case study, 3 MPs with pool size of 12 follow 3 Conv layers that contribute to 76.5 % of the total operations. Fig. 5 shows the impact of PoolSkipping on the number of operations per layer and in total for the BNN of Stress Detection. Applying PoolSkipping, the average number of effective operations is reduced by 22%.
Vii Implementation Results and Analysis
Viia FPGA Implementation
In order to emulate the case studies on the proposed hardware, FPGA is used for the evaluation setup. We choose the lowpower Artix7 family and select the smallest package whose Block RAMs (BRAMs) can suffice the model of the case studies. Each case study was synthesized, placed and routed and implemented using Xilinx Vivado. with different numbers of PEs and different width for BRAMs. Fig. 6(left) depicts the classification energy for the Stress Detection vs number of PEs w.r.t varying memory widths, and Fig. 6(right) depicts classification energy vs Memory width with varying number of PEs for Physical Activity. Table II provides the implementation results for the two cases with memory width of 128 bits, at 100MHz and with more implementation details. In Table II, for each case study the optimal (Opt.) configuration in terms of energy is given for a specific number of PEs. For the Stress Detection, the optimal config. (8PEs) requires 4 less energy, and for the Physical Activity, the optimal config.(4PEs) consumes 2 less energy as compared to their serial configs (1PE).
Case Studies Implementation Characteristics 




Serial 


Serial 



# of PEs  1  4  8  1  4  8  
BRAM  6  6  6  3  3  3  
DSP  1  1  1  1  1  1  
# of slices  341  1076  2047  385  1329  2601  
Latency (ms)  1.14  0.31  0.19  0.06  0.03  0.03  
Throughput(k label/s)  0.87  3.23  5.22  17.17  36.15  38.12  
Total Power (mW)  78  98  115  76  79  91  
Energy (ÂµuJ)  89.3  30.3  22.0  4.4  2.2  2.4 
ViiB ASIC Implementation
Various configurations per case study were synthesized using RC Compiler and post placeandrouted using the Encounter SOC from Cadence using 65nm standard cell library CMOS technology. For the two case studies we obtained slightly different optimal configs per case study on ASIC. Fig. 7(left) and Fig. 7(right) respectively depict the classification energy for the first and second case studies w.r.t. the varying PEs and varying memory widths, and Table III provides the implementation results with memory width 128 bits, at 128MHz in more details. Both the Figure and the table confirm energy per bit analysis inferred from Fig. 4. From table III it is concluded that for the Stress Detection and Physical Activity, the optimal configs are respectively 2 and 2.5 more energy efficient than their serial config counterpart. The ASIC postlayout views for the optimal configs of both cases are shown in Fig. 8. The reason for the Physical Activity case to have smaller PE logic is that in its HDL the MP and FC logics are automatically removed and less number of SRAM cells have been dedicated as the cache memories.
Case Studies Implementation Characteristics 




Serial 


Serial 



# of PEs  1  4  8  1  4  8  
Area Util. (%)  94  91  90  94  93  92  
Clock Freq. (MHz)  128  
Max Freq. (MHz)  1000  
Core Area (mm2)  0.49  0.56  0.66  0.18  0.19  0.20  
Latency (ms)  2.13  0.55  0.29  0.04  0.018  0.014  
Throughput(k label/s)  0.47  1.80  3.42  22.22  52.81  68.52  
Leakage power (mW)  5.77  6.42  7.27  3.61  3.66  3.71  
Dynamic power (mW)  5.37  13.83  32.85  0.80  1.23  1.62  
Total power (mW)  11.14  20.25  40.13  4.41  4.90  5.32  
Energy (uJ)  23.81  11.25  11.73  0.20  0.09  0.077  

Viii Comparison
We compare our proposed hardware on the two case studies with two relevant works, [2] and [18], from which we obtained our BNN configurations. The comparison is summarized in Table IV. In summary, despite 9.5 augmentation in the number of parameters, the optimal config. for the Stress Detection is 4.5 and 6.8 more energy efficient as compared to [2] implemented respectively on Artix7 FPGA and on ASIC 65nm. For the Physical Activity, the minimal hardware with no MP and Conv logics, is 250 more energy efficient as compared to a programmable, yet lowpower manycore platform in [18].
Metrics  [2]  This Work  [18]  This Work  
Application  Stress Detection  Physical Activity  
Technique  16bit DCNN  BNN  BNN  BNN  
Accuracy(%)  94  94.1  97.8  97.8  
Platform 







Freq. (MHz)  100  100  100  128  145  128  
Latency (us)  1000  800  190  550  77  14  
Power (mW)  154  60  115  22  270  5.3  
Energy (uJ)  150  50  22  11  20  0.08  
Energy Imp.      6.8  4.5    250 
Ix Conclusion
We proposed a scalable hardware for inference of BNNs that have been trained for physiological datasets. The proposed hardware has flexibility in adjusting the memory width and number of PEs, both adjustable for the most energy efficient configuration while meeting time requirements given an application. Two case studies including Physical Activity Monitoring and Stress Detection are trained and deployed on the hardware, and for each, the optimal number of PEs and memorywidth is sought. Our implementation results with the two case studies on Artix7 FPGA and on 65nm ASIC CMOS standard cell library indicate that the configuration parameters of the hardware, if adjusted well, can optimize the energy consumption up to 4 and 2.5 respectively on FPGA and on ASIC. An operation skipping method, named PoolSkipping, is also proposed for BNNs that skips effectless operations that precede a MaxPool layer, and can skip up to 22 of operations in the Stress Detection case study. When compared to the related works that use the same case studies on the same target platforms, FPGA and ASIC, and with the same classification accuracy, our hardware is respectively 4.5 and 250 more energy efficient for the Stress Detection on FPGA and Physical Activity Monitoring on ASIC.
References
 [1] H. Kim et al., “A kernel decomposition architecture for binaryweight convolutional neural networks.”
 [2] A. Jafari et al., “Sensornet: A scalable and lowpower deep convolutional neural network for multimodal data classification,” IEEE TCASI: Regular Papers, 2018.
 [3] A. Page, A. Kulkarni, and T. Mohsenin, “Utilizing deep neural nets for an embedded ecgbased biometric authentication system,” in BioCAS. IEEE, 2015.
 [4] T. Abtahi et al., “Accelerating convolutional neural network with fft on tiny cores,” in ISCAS. IEEE, 2017.
 [5] ——, “Accelerating convolutional neural network with fft on embedded hardware,” IEEE TVLSI Systems, 2018.
 [6] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, “Compressing neural networks with the hashing trick,” in International Conference on Machine Learning, 2015.
 [7] J. Shen and S. Mousavi, “Least sparsity of pnorm based optimization problems with p¿1,” SIAM Journal on Optimization, 2018.
 [8] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
 [9] M. Courbariaux et al., “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1,” arXiv preprint arXiv:1602.02830, 2016.
 [10] ——, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in Neural Information Processing Systems, 2015, pp. 3123–3131.
 [11] M. Rastegari et al., “Xnornet: Imagenet classification using binary convolutional neural networks,” in European Conference on Computer Vision. Springer, 2016, pp. 525–542.
 [12] F. Karim et al., “Lstm fully convolutional networks for time series classification,” IEEE Access, 2018.
 [13] Y. Zheng et al., “Time series classification using multichannels deep convolutional neural networks,” in International Conference on WebAge Information Management, 2014.
 [14] M. Soleymani et al., “Analysis of eeg signals and facial expressions for continuous emotion detection,” IEEE Transactions on Affective Computing, no. 1, 2016.
 [15] X. Li et al., “Concurrent activity recognition with multimodal cnnlstm structure,” arXiv preprint arXiv:1702.01638, 2017.
 [16] S. Yao et al., “Deepsense: A unified deep learning framework for timeseries mobile sensing data processing,” in Proceedings of the 26th International Conference on World Wide Web, 2017.
 [17] V. Radu et al., “Towards multimodal deep learning for activity recognition on mobile devices,” in UbiComp. ACM, 2016.
 [18] A. Jafari et al., “Binmac: Binarized neural network manycore accelerato,” in ACM Proceedings of the 28th Edition of the Great Lakes Symposium on VLSI (GLSVLSI). ACM, 2018.
 [19] W. Tang et al., “How to train a compact binary neural network with high accuracy,” in AAAI, 2017, pp. 2625–2631.
 [20] Y. Umuroglu et al., “Finn: A framework for fast, scalable binarized neural network inference,” in Proceedings of the SIGDA. ACM, 2017.
 [21] J. Birjandtalab et al., “A noneeg biosignals dataset for assessment and visualization of neurological status,” in Signal Processing Systems (SiPS). IEEE, 2016.
 [22] A. Reiss and D. Stricker, “Introducing a new benchmarked dataset for activity monitoring,” in Wearable Computers (ISWC), 2012 16th International Symposium on. IEEE, 2012.
 [23] A. Page et al., “Sparcnet: A hardware accelerator for efficient deployment of sparse convolutional networks,” Journal of Emerging Technologies in Computing Systems, 2017.
 [24] C. Zhang et al., “Optimizing fpgabased accelerator design for deep convolutional neural networks,” in Proceedings of SIGDA. ACM, 2015.