Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks
Abstract
Loom (LM), a hardware inference accelerator for Convolutional Neural Networks (CNNs) is presented. In LM every bit of data precision that can be saved translates to proportional performance gains. Specifically, for convolutional layers LM’s execution time scales inversely proportionally with the precisions of both weights and activations. For fullyconnected layers LM’s performance scales inversely proportionally with the precision of the weights. LM targets area constrained SystemonaChip designs such as those found on mobile devices that cannot afford the multimegabyte buffers that would be needed to store each layer onchip during processing. Experiments on image classification CNNs show that on average across all networks studied and assuming that weights are supplied via a High Bandwidth Memory v2 (HBM2) interface, a configuration of LM outperforms a stateoftheart bitparallel accelerator [1] by without any loss in accuracy while being more energy efficient. Moreover, LM can tradeoff accuracy for additional improvements in execution performance and energy efficiency.
1 Introduction
Deep neural networks (DNNs) have become the stateoftheart technique in many recognition tasks such as object [2] and speech recognition [3]. The high computational bandwidth demands and energy consumption of DNNs motivated several special purpose architectures such as the stateoftheart DaDianNao (DaDN) dataparallel accelerator [1]. To maximize performance DaDN, as proposed, uses 36MB of onchip eDRAM to hold all input (weights and activation) and output data (activations) per layer. Unfortunately, such large onchip buffers are beyond the reach of embedded and mobile systemonchip (SoC) devices.
This work presents Loom (LM), a hardware accelerator for inference with Convolutional Neural Networks (CNNs) targeting embedded systems where the bulk of the data processed cannot be held on chip and has to be fetched from offchip memories. LM exploits the precision requirement variability of modern CNNs to reduce the offchip network footprint, increase bandwidth utilization, and to deliver performance which scales inversely proportional with precision for both convolutional (CVLs) and fullyconnected (FCLs) layers. Ideally, compared to a conventional DaDNlike dataparallel accelerator that uses a fixed precision of 16 bits, LM achieves a speedup of and for CVLs and FCLs where and are the precisions of weights and activations respectively. LM process both activations and weights bitserially while compensating for the loss in computation bandwidth by exploiting parallelism. Judicious reuse of activations or weights enables LM to improve performance and energy efficiency over conventional bitparallel designs without requiring a wider memory interface.
We evaluate LM on an SoC with a High Bandwidth Memory V2 (HBM2) interface comparing against a DaDNlike accelerator (BASE). Both accelerators are configured so that they can utilize the full bandwidth of HBM2. On a set of image classification CNNs, on average LM yields a speedup of , , and over BASE for the convolutional, fullyconnected, and all layers respectively. The energy efficiency of LM over BASE is , and for the aforementioned layers respectively. LM enables trading off accuracy for additional improvements in performance and energy efficiency. For example, accepting a 1% relative loss in accuracy, LM yields higher performance and more energy efficiency than BASE.
The rest of this document is organized as follows: Section 2 illustrates the key concepts behind LM via an example. Section 3 reviews the BASE architecture and presents an equivalent Loom configuration. The evaluation methodology and experimental results are presented in Section 4. Section 5 reviews related work, and Section 6 concludes.
2 Loom: A Simplified Example
This section explains how LM would process CVLs and FCLs assuming 2bit activations and weights.
Conventional BitParallel Processing: Figure a shows a bitparallel processing engine which multiplies two input activations with two weights generating a single 2bit output activation per cycle. The engine can process two new 2bit weights and/or activations per cycle a throughput of two products per cycle.
Loom’s Approach: Figure b shows an equivalent LM engine comprising four subunits organized in a array. Each subunit accepts 2 bits of input activations and 2 bits of weights per cycle. The subunits along the same column share the activation inputs while the subunits along the same row share their weight inputs. In total, this engine accepts 4 activation and 4 weight bits equaling the input bandwidth of the bitparallel engine. Each subunit has two 1bit Weight Registers (WRs), one 2bit Output Register (OR), and can perform two products which it can accumulate into its OR.
Figure b through Figure f show how LM would process an FCL.
As Figure b shows, in cycle 1, the left column subunits receive the least significant bits (LSBs) and of activations and , and , , , and , the LSBs of four weights from filters 0 and 1. Each of these two subunits calculates two products
In total it took 4+1 cycles to process 32 products (4, 8, 8, 8, 4 products in cycles 2 through 5, respectively). Notice that at the end of the fifth cycle, the left column subunits are idle, thus another set of weights could have been loaded into the WRs allowing a new set of outputs to commence computation. In the steady state, when the input activations and the weights are represented in two bits, this engine will be producing 8 terms every cycle thus matching the 2 throughput of the parallel engine.
If the weights could be represented using only one bit, LM would be producing two output activations per cycle, twice the bandwidth of the bitparallel engine. In general, if the bitparallel hardware was using bits to represent the weights while only bits were actually required, for the FCLs the LM engine would outperform the bitparallel engine by . Since there is no weight reuse in FCLs, cycles are required to load a different set of weights to each of the columns. Thus having activations that use less than bits would not improve performance (but could improve energy efficiency).
Convolutional Layers: LM processes CVLs mostly similarly to FCLs but exploits weight reuse across different windows to exploit a reduction in precision for both weights and activations. Specifically, in CVLs the subunits across the same row share the same weight bits which they load in parallel into their WRs in a single cycle. These weight bits are multiplied by the corresponding activation bits over cycles. Another set of weight bits needs to be loaded every cycles, where is the input activation precision. Here LM exploits weight reuse across multiple windows by having each subunit column process a different set of activations. Assuming that the bitparallel engine uses bits to represent both input activations and weights, LM will outperform the bitparallel engine by where and are the weight and activation precisions respectively.
3 Loom Architecture
This section describes the baseline DaDNlike design, how it was configured to work with an HBM2 memory, and finally the Loom architecture.
3.1 Data Supply and Baseline System
Our baseline design (BASE) is an appropriately configured dataparallel engine inspired by the DaDN accelerator [1]. DaDN uses 16bit fixedpoint activations and weights. A DaDN chip integrates 16 tiles where each tile processes 16 filters concurrently, and 16 weight and activation products per filter. In total, a DaDN chip processes filters and products concurrently requiring of weight and 32B or activation inputs (16 activations are reused by all 256 filters) per cycle. Given the 1GHz operating frequency, sustaining DaDN’s compute bandwidth requires and of weight and input activation bandwidth respectively. DaDN uses 32MB weight and 4MB activation eDRAMs for this purpose. Such large onchip memories are beyond the reach of modern embedded SoC designs. Given that there is no weight reuse in FCLs all weights have to be supplied from an offchip memory.
Figure a illustrates the BASE design which processes eight filters concurrently calculating 16 input activation and weight products per filter for a total of 128 products per cycle. Each cycle, the design reduces the 16 products of each filter into a single partial output activation, for a total of eight partial output activations for the whole chip. Internally, the chip has an input activation buffer (ABin) to provide 16 activations per cycle through 16 activation lanes, and an output activation buffer (ABout) to accept eight partial output activations per cycle. In total, 128 multipliers calculate the 128 activation and weight products and eight 16input adder trees produce the partial output activations. All interlayer activation outputs except for the initial input and the final output are stored in a 4MB Activation Memory (AM) which is connected to the ABin and ABout buffers. Offchip accesses are needed only for reading: 1) the input image, 2) the weights, and 3) for writing the final output.
3.2 Loom
Targeting a 1GHz clock frequency and an HBM2 interface, LM can expect to sustain an input bandwidth of up to 2K weight bits per cycle. Accordingly, LM is configured to process 128 filters concurrently and 16 weight bits per filter per cycle, for a total of weight bits per cycle. LM also accepts 256 1bit input activations each of which it multiplies with 128 1bit weights thus matching the computation bandwidth of base in the worst case where both activations and weights need 16 bits. Figure b shows the Loom design. It comprises 2K Serial InnerProduct Units (SIPs) organized in a grid. Every cycle, each SIP multiplies 16 input activations with 16 weights and reduces these products into a partial output activation. The SIPs along the same row share a common weight bus, and the SIPs along the same column share a common activation bus. Accordingly, as in BASE, the SIP array is fed by a weight bus and a activation input bus. Similar to BASE, LM has an ABout and an ABin. LM processes both activations and weights bitserially.
Reducing Memory Footprint and Bandwidth: Since both weights and activations are processed bitserially, LM can store weights and activations in a bitinterleaved fashion and using only as many bits as necessary thus boosting the effective bandwidth and storage capacity of the external weight memory and the onchip AM. For example, given 2K weights to be processed in parallel, LM would pack first their bit 0 onto continuous rows, then their bit 1, and so on up to bit 12. BASE would stored them using 16 bits instead. A transposer can rotate the output activations prior to writing them to AM from ABout. Since each output activation entails innerproducts with tens to hundreds of inputs, the transposer demand will be low. Next we explain how LM processes FCLs and CVLs.
Convolutional Layers: Processing starts by reading in parallel 2K weight bits from the offchip memory, loading 16 bits to all WRs per SIP row. The loaded weights will be multiplied by 16 corresponding activation bits per SIP column bitserially over cycles where is the activation precision for this layer . Then, the second bit of weights will be loaded into WRs and multiplied with another set of 16 activation bits per SIP row, and so on. In total, the bitserial multiplication will take cycles. where the weight precision for this layer . Whereas BASE would process 16 sets of 16 activations and 128 filters over 256 cycles, LM processes them concurrently but bitserially over cycles. If and/or are less than 16, LM will outperform BASE by . Otherwise, LM will match BASE’s performance.
FullyConnected Layers: Processing starts by loading the LSBs of a set of weights into the WR registers of the first SIP column and multiplying the loaded weights with the LSBs of the corresponding activations. In the second cycle, while the first column of SIPs is still busy with multiplying the LSBs of its WRs by the second bit of the activations, the LSBs of a new set of weights can be loaded into the WRs of the second SIP column. Each weight bit is reused for 16 cycles multiplying with bits 0 through bit 15 of the input activations. Thus, there is enough time for LM to keep any single column of SIPs busy while loading new sets of weights to the other 15 columns. For example, as shown in Figure b LM can load a single bit of 2K weights to SIP(0,0)..SIP(0,127) in cycle 0, then load a singlebit of the next 2K weights to SIP(1,0)..SIP(1,127) in cycle 1, and so on. After the first 15 cycles, all SIPs are fully utilized. It will take cycles for LM to process 16 sets of 16 activations and 128 filters while BASE processes them in 256 cycles. Thus, when is less than 16, LM will outperform BASE by and it will match BASE’s performance otherwise.
Processing Layers with Few Outputs: For LM to keep all the SIPs busy an output activation must be assigned to each SIP. This is possible as long as the layer has at least 2K outputs. However, in the networks studied some FCLs have only 1K output activations, To avoid underutilization, LM’s implements SIP cascading, in which SIPs along each row can form a daisychain, where the output of one can feed into an input of the next via a multiplexer. This way, the computation of an output activation can be sliced along the bit dimension over the SIPs in the same row. In this case, each SIP processes only a portion of the input activations resulting into several partial output activations along the SIPs on the same row. Over the next cycles, where is the number of bit slices used, the partial outputs can be reduced into the final output activation.
Other Layers: Similar to DaDN, LM processes the additional layers needed by the studied networks. To do so, LM incorporates units for MAX pooling as in DaDN. Moreover, to apply nonlinear activations, an activation functional unit is present at the output of the ABout. Given that each output activation typically takes several cycles to compute, it is not necessary to use more such functional units compared to BASE.
Total computational bandwidth: In the worst case, where both activations and weights use 16b precisions, a single product that would have taken BASE one cycle to produce, now takes LM 256 cycles. Since BASE calculates 128 products per cycle, LM needs to calculate the equivalent of products every 256 cycles. LM has SIPs each producing 16 products per cycle. Thus, over 256 cycles, LM produces products matching BASE’s compute bandwidth.
Convolutional layers  
Per Layer Activation  Network Weight  Per Layer Activation  Network Weight  
Network  Precision in Bits  Precision in Bits  Precision in Bits  Precision in Bits 
100% Accuracy  99% Accuracy  
NiN  888978899888  11  887978899878  10 
AlexNet  98557  11  97457  11 
GoogLeNet  108109810989107  11  10898891089108  10 
VGG_S  78979  12  78979  11 
VGG_M  77787  12  68777  12 
VGG_19  12121211121011111312131313131313  12  9998121010121311121313131313  12 
Fully connected layers  
Per Layer Weight  Per Layer Weight  
Network  Precision in Bits  Precision in Bits 
100% Accuracy  99% Accuracy  
AlexNet  1099  988 
GoogLeNet  7  7 
VGG_S  1099  998 
VGG_M  1088  988 
VGG_19  1099  1098 
SIP: BitSerial InnerProduct Units: Figure 3 shows LM’s BitSerial InnerProduct Unit (SIP). Every clock cycle, each SIP multiplies 16 singlebit activations by 16 singlebit weights to produce a partial output activation. Internally, each SIP has 16 1bit Weight Registers (WRs), 16 2input AND gates to multiply the weights in the WRs with the incoming input activation bits, and a 16input adder tree that sums these partial products. accumulates and shifts the output of the adder tree over cycles. Every cycles, shifts the output of and accumulates it into the OR. After cycles the Output Register (OR) contains the innerproduct of an activation and weight set. In each SIP, a multiplexer after implements cascading. To support signed 2’s complement activations, a negation block is used to subtract the sum of the input activations corresponding to the most significant bit of weights (MSB) from the partial sum when the MSB is 1. Each SIP also includes a comparator (max) to support max pooling layers.
Tuning the Performance, Area and Energy Tradeoff: It is possible to trade off some of the performance benefits to reduce the number of SIPs and the respective area overhead by processing more than one bit activation per cycle. Using this method, LM requires fewer SIPs to match BASE’s throughput. The evaluation section considers 2bit and 4bit LM configurations, denoted as () and (), respectively which need 8 and 4 SIP columns, respectively. Since activations now are forced to be a multiple of 2 or 4 respectively, these configurations give up some of the performance potential. For example, for reducing the from 8 to 5 bits produces no performance benefit, whereas for the it would improve performance by .
4 Evaluation
This section evaluates Loom performance, energy and area and explores the tradeoff between accuracy
and performance comparing to BASE and Stripes*
4.1 Methodology
Performance, Energy, and Area Methodology: The measurements were collected over layouts of all designs as follows: The designs were synthesized for worst case, typical case, and best case corners with the Synopsys Design Compiler [6] using a TSMC 65nm library. Layouts were produced with Cadence Encounter [7] using the typical corner case synthesis results which were more pessimistic for LM than the worst case scenario. Power results are based on the actual datadriven activity factors. The clock frequency of all designs is set to 980 MHz matching the original DaDianNao design [1]. The ABin and ABout SRAM buffers were modeled with CACTI [8], and the AM eDRAM area and energy were modeled with Destiny [9]. Execution time is modeled via a custom cycleaccurate simulator.
Weight and Activation Precisions: The methodology of Judd et al. [10] was used to generate per layer precision profiles. Tables 1 and 2 indicate precisions for convolutional and fullyconnected layers, respectively. Caffe [11] was used to measure how reducing the precision of each layer affects the network’s overall top1 prediction accuracy over 5000 images. The network models and trainednetworks are taken from the Caffe Model Zoo [12]. Since LM’s performance for the CVLs depends on both and , we adjust them independently: we use per layer activation precisions and a common across all CVLs weight precision (we found little interlayer variability for weight precisions but additional per layer exploration is warranted). Since LM’s performance for FCLs performance depends only on we only adjust weight precision for FCLs.
Table 1 reports the per layer precisions of input activations and network precisions of weights for the CVLs. The precisions that guarantee no accuracy loss for input activations vary from 5 to 13 bits and for weights vary from 10 to 12. When a 99% accuracy is still acceptable, the activation and weight precision can be as low as 4 and 10 bits, respectively. Table 2 shows that the per layer weight precisions for the FCLs vary from 7 to 10 bits.
Fullyconnected Layers  Convolutional Layers  
Design  1bit  2bit  4bit  1bit  2bit  4bit  
Perf  Eff  Perf  Eff  Perf  Eff  Perf  Eff  Perf  Eff  Perf  Eff  
NiN  3.63  2.96  3.35  3.20  2.99  3.18  
AlexNet  1.85  1.51  1.85  1.76  1.85  1.97  3.74  3.05  3.28  3.13  3.12  3.32 
GoogLeNet  2.25  1.84  2.27  2.16  2.28  2.42  2.13  1.74  2.12  2.02  1.99  2.11 
VGG_S  1.78  1.46  1.78  1.70  1.79  1.90  2.74  2.24  2.58  2.46  2.37  2.53 
VGG_M  1.79  1.47  1.80  1.72  1.80  1.92  2.83  2.31  2.59  2.47  2.63  2.80 
VGG_19  1.63  1.33  1.63  1.56  1.63  1.74  1.79  1.47  1.72  1.64  1.56  1.66 
geomean  1.85  1.51  1.85  1.77  1.86  1.98  2.85  2.22  2.54  2.42  2.38  2.53 
4.2 Results
Performance: Figure 4 shows the performance of Stripes* and Loom configurations for CVLs relative to BASE with the precision profiles of Tables 1 and 2. With no accuracy loss (100% accuracy) improves performance of CVLs by a factor of on average over BASE compared to improvement with Stripes*. Similarly, , and achieve, on average, speedups of and over BASE on the CVLs, respectively. As expected and offer slightly lower performance than however given that the power consumption of and are lower than , this can be a good tradeoff. The performance loss of and is due to the limitation of rounding up activation precisions to be multiple of 2 and 4, respectively.
Figure 5 shows the performance of Stripes* and Loom configurations for FCLs layers. Since for the FCLs the performance improvement is only coming from lower precision of weights, rounding up the activation precision does not effect the performance of the designs. Hence all three configurations of the LM outperform the BASE on average by a factor of $̃1.74\times$ while Stripes* matches the same performance of BASE. However, due to having shorter initiation interval per layer the performs slightly better than the and on the FCLs. Since GoogLeNet has only one small fullyconnected layer, the initiation interval has higher effect on the performance of the fullyconnect layer. Thus, the performance variation for different configurations of Loom is higher in GoogLeNet.
Table 3 illustrates performance and energy efficiency of FCLs and CVLs with an up to 1% loss in accuracy (99% accuracy). The average speedups for the FCLs with , , and are , , and , respectively. The respective speedups for the CVLs are , and .
Energy Efficiency: Figure 6 shows the energy efficiency of Stripes*, , , and relative to BASE for CVLs using the 100% accuracy profiles of Table 1. Since, the number of SIPs in , , and are 2k, 1k, and 512, respectively, the power consumption of is less than and that of so for the all networks has higher energy efficiency than and . The , , and accelerators for CVLs achieve on average energy efficiencies of , , and over BASE compared to improvement with Stripes*.
Figure 7 shows the energy efficiency of Stripes* and Loom configurations for FCLs layers with no accuracy loss. Since Stripes* does not improve the performance for FCLs and consumes more energy than BASE, the energy efficiency of Stripes* for FCLs is less than one (). All three configurations of Loom have the same performance improvement for FCLs. However, as the power consumption of is lower than that of two other configurations, it has the highest energy efficiency. Similarly, the design is more energy efficient than . The energy efficiency improvements of , , and over BASE are , , and respectively.
With the 99% accuracy profiles, , , and energy efficiency improves to , , and for the CVLs and , , and for the FCLs (Table 3). On average, over the whole network, , and improve energy efficiency by factors of , , and over the BASE.
These energy measurements do not include the offchip memory accesses as an appropriate model for HBM2 is not available to us. However, since LM uses lower precisions for representing the weights, it will transfer less data from offchip. Thus our evaluation is conservative and the efficiency of LM will be even higher.
Area Overhead: Post layout measurements were used to measure the area of BASE and Loom. The configuration requires more area over BASE while achieving on average a speedup. The and reduce the area overhead to and while still improving the execution time by and respectively. Thus LM exhibits better performance vs. area scaling than BASE.
4.3 Dynamic Precisions
To further improve the performance of Loom, similar to [13], the precision required to represent the input activations and weights can be determined at runtime. This enables Loom to exploit smaller precisions without any accuracy loss as it explores the weight and activation precisions on smaller granularity. In this experiment, the activation precisions are adjusted per group of 16 activations that are broadcast to the same column of SIPs. Figure 8 shows the performance of Loom configurations relative to the BASE. Exploiting the dynamic precision technique on average improves performance by , , and for , , and , compared to the average improvement with Stripes*.
5 Related Work
Bitserial neural network (NN) hardware has been proposed before [14, 15]. While its performance scales with the input data precision, it is slower than an equivalently configured bitparallel engine. For example, one design [14], takes cycles to multiply per weight and activation product where is the precision of the weights.
In recent years, several DNN hardware accelerators have been proposed, however, in the interest of space we limit attention to the most related to this work. Stripes [5, 16] processes activations bitserially and reduces execution time on CVLs only. Loom outperforms Stripes on both CVLs and FCLs: it exploits both weight and activation precisions in CVLs and weight precision in FCLs. Pragmatic’s performance for the CVLs depends only on the number of activation bits that are 1 [17], but does not improve performance for FCLs. Further performance improvement may be possible by combining Pragmatic’s approach with LM’s. Proteus exploits per layer precisions reducing memory footprint and bandwidth. but requires crossbars per input weight to convert from the storage format to the one used by the bitparallel compute engines [18]. Loom obviates the need for such a conversion and the corresponding crossbars. Hardwired NN implementations where the whole network is implemented directly in hardware naturally exploit per layer precisions [19]. Loom does not require that the whole network fit on chip nor does it hardwire the per layer precisions at design time.
6 Conclusion
This work presented Loom, a hardware inference accelerator for DNNs whose execution time for the convolutional and the fullyconnected layers scales inversely proportionally with the precision used to represent the input data. LM can tradeoff accuracy vs. performance and energy efficiency on the fly. The experimental results show that, on average LM is faster and more energyefficient than a conventional bitparallel accelerator. We targeted the available HBM2 interface and devices. However, we expect that LM will scale well to future HBM revisions.
Footnotes
 In reality the product and accumulation would take place in the subsequent cycle. For clarity, we do not describe this in detail. It would only add an extra cycle in the processing pipeline per layer.
 Since there is weight reuse in CVLs it may be possible to boost weight supply bandwidth with a smaller than 32MB onchip WM for CVLs. However, offchip memory bandwidth will remain a bottleneck for FCLs. The exploration of such designs is left for future work.
 Stripes* is a configuration of [5] that is appropriately scaled to match the 256GB/s bandwidth of the HBM2 interface.
References
 Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, “Dadiannao: A machinelearning supercomputer,” in Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, Dec 2014, pp. 609–622.
 R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” CoRR, vol. abs/1311.2524, 2013.
 A. Y. Hannun, C. Case, J. Casper, B. C. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng, “Deep speech: Scaling up endtoend speech recognition,” CoRR, vol. abs/1412.5567, 2014.
 J. Hruska, “Samsung announces mass production of nextgeneration HBM2 memory,” https://www.extremetech.com/extreme/221473samsungannouncesmassproductionofnextgenerationhbm2memory, 2016.
 P. Judd, J. Albericio, T. Hetherington, T. Aamodt, and A. Moshovos, “Stripes: Bitserial Deep Neural Network Computing ,” in Proc. of the 49th Annual IEEE/ACM Intl’ Symposium on Microarchitecture, 2016.

Synopsys, “Design Compiler,”
http://www.synopsys.com/Tools/
Implementation/RTLSynthesis/DesignCompiler/Pages.  Cadence, “Encounter RTL Compiler,” https://www.cadence.com/content/cadencewww/global/en_US/home/training/allcourses/84441.html.
 N. Muralimanohar and R. Balasubramonian, “Cacti 6.0: A tool to understand large caches.”
 M. Poremba, S. Mittal, D. Li, J. Vetter, and Y. Xie, “Destiny: A tool for modeling emerging 3d nvm and edram caches,” in Design, Automation Test in Europe Conference Exhibition, March 2015.
 P. Judd, J. Albericio, T. Hetherington, T. Aamodt, N. E. Jerger, R. Urtasun, and A. Moshovos, “ReducedPrecision Strategies for Bounded Memory in Deep Neural Nets ,” arXiv:1511.05236v4 [cs.LG] , 2015.
 Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
 Y. Jia, “Caffe model zoo,” https://github.com/BVLC/caffe/wiki/ModelZoo, 2015.
 A. Delmas, P. Judd, S. Sharify, and A. Moshovos, “Dynamic stripes: Exploiting the dynamic precision requirements of activation values in neural networks,” arXiv preprint arXiv:1706.00504, 2017.
 B. Svensson and T. Nordstrom, “Execution of neural network algorithms on an array of bitserial processors,” in Pattern Recognition, 1990. Proceedings., 10th International Conference on, vol. 2. IEEE, 1990.
 A. F. Murray, A. V. Smith, and Z. F. Butler, “Bitserial neural networks,” in Neural Information Processing Systems, 1988, pp. 573–583.
 P. Judd, J. Albericio, and A. Moshovos, “Stripes: Bitserial Deep Neural Network Computing ,” Computer Architecture Letters, 2016.
 J. Albericio, P. Judd, A. D. Lascorz, S. Sharify, and A. Moshovos, “Bitpragmatic deep neural network computing,” Arxiv, vol. arXiv:1610.06920 [cs.LG], 2016.
 P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, N. E. Jerger, and A. Moshovos, “Proteus: Exploiting numerical precision variability in deep neural networks,” in Proceedings of the 2016 International Conference on Supercomputing. ACM, 2016, p. 23.
 T. Szabo, L. Antoni, G. Horvath, and B. Feher, “A fullparallel digital implementation for pretrained NNs,” in IJCNN 2000, Proceedings of the IEEEINNSENNS International Joint Conference on Neural Networks, 2000, vol. 2, 2000, pp. 49–54 vol.2.