Laconic Deep Learning Computing
Abstract
We motivate a method for transparently identifying ineffectual computations in unmodified Deep Learning models and without affecting accuracy. Specifically, we show that if we decompose multiplications down to the bit level the amount of work performed during inference for image classification models can be consistently reduced by two orders of magnitude. In the best case studied of a sparse variant of AlexNet, this approach can ideally reduce computation work by more than . We present Laconic a hardware accelerator that implements this approach to improve execution time, and energy efficiency for inference with Deep Learning Networks. Laconic judiciously gives up some of the work reduction potential to yield a lowcost, simple, and energy efficient design that outperforms other stateoftheart accelerators. For example, a Laconic configuration that uses a weight memory interface with just 128 wires outperforms a conventional accelerator with a 2Kwire weight memory interface by on average while being more energy efficient on average. A Laconic configuration that uses a 1Kwire weight memory interface, outperforms the 2Kwire conventional accelerator by and is more energy efficient. Laconic does not require but rewards advances in model design such as a reduction in precision, the use of alternate numeric representations that reduce the number of bits that are “1”, or an increase in weight or activation sparsity.
Laconic Deep Learning Computing
Sayeh Sharify, Mostafa Mahmoud, Alberto Delmas Lascorz, Milos Nikolic, Andreas Moshovos 
Electrical and Computer Engineering, University of Toronto 
{sayeh, delmasl1, moshovos}@ece.utoronto.ca, 
{mostafa.mahmoud, milos.nikolic}@mail.utoronto.ca 
Modern computing hardware is energyconstrained and thus developing techniques that reduce the amount of energy required to perform the computation is essential for improving performance. The bulk of the work performed by convolutional neural networks during inference is due to 2D convolutions (see Section Laconic Deep Learning Computing). In turn, these convolutions entail numerous multiplyaccumulate operations were most of the work is due to the multiplication of an activation and a weight . In order to improve energy efficiency a hardware accelerator can thus strive to perform only those multiplications that are effectual which will also lead to fewer additions. We can approach a multiplication as a monolithic action which can be either performed or avoided in its entirety. Alternatively, we can decompose it into a collection of simpler operations. For example, if and are 16b fixedpoint numbers can be approached as 256 multiplications or 16 ones.
Figure (h)h reports the potential reduction in work for several ineffectual work avoidance policies. The “A” policy avoids multiplications where the activation is zero. This is representative of the first generation of valuebased accelerators that were motivated by the relatively large fraction of zero activations that occur in convolutional neural networks, e.g., Cnvlutin [?]. The “A+W” skips those multiplications where either the activation or the weight are zero and is representative of accelerators that target sparse models where a significant fraction of synaptic connections has been pruned, e.g., SCNN [?]. The “Ap” (e.g., Stripes [?] or Dynamic Stripes [?]) and “Ap+Wp” (e.g., Loom [?]) policies target precision for the activations alone or for the activations and the weights respectively. It has been found that neural networks exhibit variable per layer precision requirements. All aforementioned measurements corroborate past work on accelerator designs that exploited the respective properties.
However, we show that further potential for work reduction exists if we decompose the multiplications at the bit level. Specifically, for our discussion we can assume without loss of generality that these multiplications operate on 16b fixedpoint values. The multiplication itself is given by:
(1) 
where and are bits of and respectively. When decomposed down to the individual 256 single bit multiplications one can observe that it is only those multiplications where both and are nonzero that are effectual. Accordingly, the “Ab” (e.g., Pragmatic [?]) and “Ab+Wb” measurements show the potential reduction in work that is possible if we skip those single bit multiplications where the activation bit is zero or whether either the activation or the weight bits are zero respectively. The results show that the potential is far greater than the policies discussed thus far.
Further to our discussion, rather than representing and as bit vectors, we can instead Boothencode them as a series of signed powers of two, or terms (higherradix Booth encoding is also possible). In this case the multiplication is given by:
(2) 
where and are of the form . As with the positional representation, it is only those products where both and are nonzero that are effectual. Accordingly, the figure shows the potential reduction in work with “At” where we skip the ineffectual terms for a Boothencoded activation (e.g., Pragmatic [?]), and with “At+Wt” where we calculate only those products where both the activation and the weight terms are nonzero. The results show that the reduction in work (and equivalently the performance improvement potential) with “At+Wt” is in most cases two orders of magnitude higher than the zero value or the precision based approaches.
Based on these results, our goal is to develop a hardware accelerator that computes only the effectual terms. No other accelerator to date has exploited this potential. Moreover, by targeting “At+Wt” we can also exploit “Ab+Wb” where the inputs are represented in a plain positional representation and are not Boothencoded.
This section provides the required background information as follows: Section \thefigure reviews operation of a Convolutional Neural Network and Section \thefigure goes through our baseline system.
Convolutional Neural Networks (CNNs) usually consist of several Convolutional layers (CVLs) followed by a few fullyconnected layers (FCLs). In many image related CNNs most of the operation time is spent on processing CVLs in which a 3D convolution operation is applied to the input activations producing output activations. Figure Laconic Deep Learning Computing illustrates a CVL with a input activation block and filters. The layer dot products each of these filters (denoted ) by a subarray of input activation, called window, to generate a single output activation. In total convolving filters and an activation window results in outputs which will be passed to the input of the next layer. The convolution of activation windows and filters takes place in a sliding window fashion with a constant stride .
Fullyconnected layers can be implemented as convolutional layers in which filters and input activations have the same dimensions, i.e., and .
Our baseline design (BASE) is a dataparallel engine inspired by the DaDianNao accelerator [?] which uses 16bit fixedpoint activations and weights. Our baseline configuration has 8 inner product units (IPs) each accepting 16 input activations and 16 weights as inputs. The 16 input activations are broadcast to all 8 IPs; however, each IP has its own 16 weights. Every cycle each IP multiplies 16 input activations by their 16 corresponding weights and reduces them into a single partial output activation using a 16 32bit input adder tree. The partial results are accumulated over the multiple cycles to generate the final output activation. An activation memory provides the activations and a weight memory provides the weights. Other memory configurations are possible.
This section illustrates the key concepts behind Laconic via an example using 4bit activations and weights.
BitParallel Processing: Figure Laconic Deep Learning Computinga shows a bitparallel engine multiplying two 4bit activation and weight pairs, generating a single 4bit output activation per cycle. Its throughput is two products per cycle.
BitSerial Processing: Figure Laconic Deep Learning Computingb shows an equivalent bitserial engine which is representative of Loom (LM) [?]. To match the bitparallel engine’s throughput, LM processes 8 input activations and 8 weights every cycle producing 32 products. Since LM processes both activations and weights bitserially, it produces 16 output activations in cycles where and are the activation and weight precisions, respectively. Thus, LM outperforms the bitparallel engine by . In this example, since both activations and weights can be represented in three bits, the speedup of LM over the bitparallel engine is . However, LM still processes some ineffectual terms. For example, in the first cycle 27 of the 32 products are zero and thus ineffectual and can be removed.
Laconic: Figure Laconic Deep Learning Computingc illustrates a simplified Laconic engine in which both the activations and weights are represented as vectors of essential powers of two, or oneoffsets. For example, = is represented as a vector of its oneoffsets = . Every cycle each PE accepts a 4bit oneoffset of an input activation and a 4bit oneoffset of a weight and adds them up to produce the power of the corresponding product term in the output activation. Since Laconic processes activation and weight “term”serially, it takes cycles for each PE to complete producing the product terms of an output activation, where and are the number of oneoffsets in the corresponding input activation and weight. The engine processes the next set of activation and weight oneoffsets after cycles, where is the maximum among all the PEs. In this example, the maximum is 6 corresponding to the pair of and from . Thus, the engine can start processing the next set of activations and weights after 6 cycles achieving speedup over the bitparallel engine.
This section presents the Laconic architecture by explaining its processing approach, processing elements structure, and its highlevel organization.
Laconic’s goal is to minimize the required computation for producing the products of input activations and weights by processing only the essential bits of both the input activations and weights. To do so, LAC converts, onthefly, the input activations and weights into a representation which contains only the essential bits, and processes per cycle one pair of essential bits one from an activation and another from a weight. The rest of this section is organized as follows: Section \thefigure describes the activation and weight representations in LAC and Section Laconic Deep Learning Computing explains how LAC calculates the product terms.
For clarity we present a LAC implementation that processes the oneoffsets, that is the nonzero signed powers of two in a Boothencoded representation of the activations and weights (however, LAC could be adjusted to process a regular positional representation or adapted to process representations other than fixedpoint). LAC represent each activation or weight as a list of its oneoffsets. Each oneoffset is represented as pair. For example, an activation with a Boothencoding of would be represented as and a will be presented as . The sign can be encoded using a single bit, with, for example, representing “+” and representing “”.
LAC calculates the product of a weight and an input activation where each term is a as follows:
(3)  
That is, instead of processing the full product in a single cycle, LAC processes each product of a single term of the input activation and of a single term of the weight individually. Since these terms are powers of two so will be their product. Accordingly, LAC can first add the corresponding exponents . If a single product is processed per cycle, the final value can be calculated via a decoder. In the more likely configuration where more than one term pairs are processed per cycle, LAC can use one decoder per term pair to calculate the individual products and then an efficient adder tree to accumulate all. This is described in more detail in the next section.
Figure Laconic Deep Learning Computing illustrates how the LAC Processing Element (PE) calculates the product of a set of weights and their corresponding input activations. Without loss of generality we assume that each PE multiplies weights, ,…,, by 16 input activations, ,…,. The PE calculates the 16 products in 6 steps:
In Step 1, the PE accepts 16 4bit weight oneoffsets, ,…, and their 16 corresponding sign bits ,…,, along with 16 4bit activation oneoffsets, ,…, and their signs ,…,, and calculates 16 oneoffset pair products. Since all oneoffsets are powers of two, their products will also be powers of two. Accordingly, to multiply 16 activations by their corresponding weights LAC adds their oneoffsets to generate the 5bit exponents (),…,() and uses 16 XOR gates to determine the signs of the products.
In Step 2, for the pair of activation and weight, where is {0,…,15}, the PE calculates via a 5bto32b decoder which converts the 5bit exponent result () into its corresponding onehot format, i.e., a 32bit number with one “1” bit and 31 “0” bits. The single “1” bit in the position of a decoder output corresponds to a value of either or depending on the sign of the corresponding product ( on the figure).
Step 3: The PE generates the equivalent of a histogram of the decoder output values. Specifically, the PE accumulates the 16 32bit numbers from Step 2 into buckets, corresponding to the values of as there are powers of two. The signs of these numbers from Step 1 are also taken into account. At the end of this step, each “bucket” contains the count of the number of inputs that had the corresponding value. Since each bucket has signed inputs the resulting count would be in a value in and thus is represented by bits in 2’s complement.
Step 4: Naïvely reducing the 32 6bit counts into the final output would require first “shifting” the counts according to their weight converting all to b and then using a 32input adder tree as shown in Figure (a)a(4)(5). Instead LAC reduces costs and energy by exploiting the relative weighting of each count by grouping and concatenating them in this stage as shown in Figure (b)b(). For example, rather than adding and we can simply concatenate them as they are guaranteed to have no overlapping bits that are “1”. This is explained in more detail in Section Laconic Deep Learning Computing.
Step 5: As Section Laconic Deep Learning Computing explains in more detail, the concatenated values from Step are added via a 6input adder tree as shown in Figure (b)b() producing a b partial sum.
Step 6: The partial sum from the previous step is accumulated with the partial sum held in an accumulator. This way, the complete product can be calculated over multiple cycles, one effectual pair of oneoffsets per cycle.
The aforementioned steps are not meant to be interpreted as pipeline stages. They can be merged or split as desired.
Step 5 of Figure (a)a has to add b counts each weights by the corresponding power of 2. This section presents an alternate design that replaces Steps 4 and 5. Specifically, it presents an equivalent more area and energy efficient “adder tree” which takes advantage of the fact that the outputs of Step 4 contain groups of numbers that have no overlapping bits that are “1”. For example, in relation to the naïve adder tree of Figure (a)a(5) consider adding the 6bit input () with the 6bit input (). We have to first shift by bits which amounts to adding zeros as the least significant bits of the result. In this case, there will be no bit position in which both and will have a bit that is 1. Accordingly, adding () and is equivalent to concatenating either and or (1) and based on the sign bit of (Figure Laconic Deep Learning Computinga):
(4) 
Accordingly, this process can be applied recursively, by grouping those where is equal. That is the input would be concatenated with , , and so on. Figure Laconic Deep Learning Computingb shows an example unit for those inputs where . While the figure shows the concatenation done as stack, other arrangements are possible.
For the 16 product unit described here the above process yields the following six groups:
(5) 
The final partial sum is then given by the following:
(6) 
Figure Laconic Deep Learning Computing illustrates LAC tile which comprises a 2D array of PEs processing 16 windows of input activations and filters every cycle. PEs along the same column share the same input activations and PEs along the same row receive the same weights. Every cycle PE(i,j) receives the next oneoffset from each input activation from the window and multiplies it by a oneoffset of the corresponding weight from the filter. The tile starts processing the next set of activations and weights when all the PEs are finished with processing the terms of the current set of 16 activations and their corresponding weights.
Since LAC processes both activations and weights termserially, to match our BASE configuration it requires to process more filters or more windows concurrently. Here we consider implementations that process more filters. In the worst case each activation and weight possesses 16 terms, thus LAC tile should process filters in parallel to always match the peak compute bandwidth of BASE. However, as shown in Figure Laconic Deep Learning Computing with more filters, LAC’s potential performance improvement over the baseline is more than two orders of magnitude. Thus, we can tradeoff some of this potential by using fewer filters.
To read weights from the WM BASE requires 16 wires per weight while LAC requires only one wire per weight as it process weights termserially. Thus, with the same number of filters LAC requires less wires. In this study we limit our attention to a BASE configuration with 8 filters, 16 weights per filter, and thus weight wires (), and to LAC configurations with 8, 16, 32, and 64 filters, and , , , and weight wires. In all designs, the number of activation wires is set to (Figure Laconic Deep Learning Computing). Alternatively, we could fix the number of filters and accordingly number of weight wires and add more parallelism to the design by increasing the number of activation windows. The evaluation of such a design is not reported in this document.
This section evaluates LAC’s performance, energy and area and explores different configurations of LAC comparing to . This section considers , , , and configurations which require , , , and weight wires, respectively (Figure Laconic Deep Learning Computing).
Execution time is modeled via a custom cycleaccurate simulator and energy and area results are reported based on post layout simulations of the designs. Synopsys Design Compiler [?] was used to synthesize the designs with TSMC 65nm library. Layouts were produced with Cadence Innovus [?] using synthesis results. Intel PSG ModelSim is used to generate datadriven activity factors to report the power numbers. The clock frequency of all designs is set to 1GHz. The ABin and ABout SRAM buffers were modeled with CACTI [?] and AM and WM were modeled as eDRAM with Destiny [?].
Convolutional Layers  
Network  100% Accuracy  
Activation Precision Per Layer  Weight Precision Per Network  
AlexNet  98557  11 
GoogLeNet  108109810989107  11 
VGGS  78979  12 
VGGM  77787  12 
AlexNetSparse [?]  89998  7 
ResNet50Sparse [?]  10866576677676768768658 868776975876876876887879610761087  13 
Figure Laconic Deep Learning Computing shows the performance of LAC configurations relative to for convolutional layers with the 100% relative TOP1 accuracy precision profiles of Table Laconic Deep Learning Computing.
Laconic targets both dense and sparse networks and improves the performance by processing only the essential terms; however, the sparse networks would benefit more as they posses more ineffectual terms. On average, outperforms by more than while for AlexNetSparse achieves a speedup of over the baseline. Figure Laconic Deep Learning Computing shows how average performance on convolutional layers over all networks scales for different configurations with different number of weight wires. , , and achieve speedups of , , and over , respectively.
AlexNet  2.03  2.44  2.92  1.88 
GoogLeNet  1.84  2.32  2.76  1.75 
VGG_S  2.43  3.04  3.63  2.31 
VGG_M  2.18  2.73  3.26  2.02 
AlexNetSparse  2.81  2.91  3.49  2.19 
ResNetSparse  1.69  2.02  2.41  1.61 
Geomean  2.13  2.55  3.05  1.95 
Table Laconic Deep Learning Computing summarizes the energy efficiency of various LAC configurations over . On average over all networks , , , and are , , , and more energy efficient than .
Post layout measurements were used to measure the area of BASE and LAC. The , , and configurations require , , and less area than , respectively while outperforming by , , and . The area overhead for is while its execution time improvement over the baseline is . Thus LAC exhibits better performance vs. area scaling than BASE.
Thus far we considered designs with up to 1K wire weight memory connections. For one of the most recent network studied here, GoogleNet we also experimented with 2K and 4K wire configurations. Their relative performance improvements were and . Similarly to other accelerators performance improves sublinearly. This is primarily due to interfilter imbalance which is aggravated as in these experiments we considered only increasing the number of filters when scaling up. Alternate designs may consider increasing the number of simultaneously processed activations instead. In such configurations, minimal buffering across activation columns as in Pragmatic [?] can also combat crossactivation imbalance which we expect will worsen as we increase the number of concurrently processed activations.
We have shown that compared to conventional bitparallel processing, aiming to process only the nonzero bits (or terms in a boothencoded format) of the activations and weights has the potential to reduce work and thus improve performance by two orders of magnitude. We presented the first practical design, Laconic that takes advantage of this approach leading to bestofclass performance improvements. Laconic is naturally compatible with the compression approach of Delmas et al., [?] and thus as per their study we expect to perform well with practical offchip memory configurations and interfaces.
 [1] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. Enright Jerger, and A. Moshovos, “CNVLUTIN: IneffectualNeuronFree Deep Neural Network Computing,” in Proceedings of the International Symposium on Computer Architecture, 2016.
 [2] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressedsparse convolutional neural networks,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, (New York, NY, USA), pp. 27–40, ACM, 2017.
 [3] P. Judd, J. Albericio, T. Hetherington, T. Aamodt, and A. Moshovos, “Stripes: Bitserial Deep Neural Network Computing ,” in Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO49, 2016.
 [4] A. Delmas, P. Judd, S. Sharify, and A. Moshovos, “Dynamic stripes: Exploiting the dynamic precision requirements of activation values in neural networks,” CoRR, vol. abs/1706.00504, 2017.
 [5] S. Sharify, A. D. Lascorz, P. Judd, and A. Moshovos, “Loom: Exploiting weight and activation precisions to accelerate convolutional neural networks,” CoRR, vol. abs/1706.07853, 2017.
 [6] J. Albericio, A. Delmás, P. Judd, S. Sharify, G. O’Leary, R. Genov, and A. Moshovos, “Bitpragmatic deep neural network computing,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO50 ’17, pp. 382–394, 2017.
 [7] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, “Dadiannao: A machinelearning supercomputer,” in Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pp. 609–622, Dec 2014.

[8]
Synopsys, “Design compiler.”
http://www.synopsys.com/Tools/
Implementation/RTLSynthesis/DesignCompiler/Pages.  [9] Cadence, “Encounter rtl compiler.” https://www.cadence.com/content/cadencewww/global/en_US/home/training/allcourses/84441.html.
 [10] N. Muralimanohar and R. Balasubramonian, “Cacti 6.0: A tool to understand large caches,” 2015.
 [11] M. Poremba, S. Mittal, D. Li, J. Vetter, and Y. Xie, “Destiny: A tool for modeling emerging 3d nvm and edram caches,” in Design, Automation Test in Europe Conference Exhibition (DATE), 2015, pp. 1543–1546, March 2015.
 [12] Yang, TienJu and Chen, YuHsin and Sze, Vivienne, “Designing EnergyEfficient Convolutional Neural Networks using EnergyAware Pruning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [13] J. Park, S. Li, W. Wen, P. T. P. Tang, H. Li, Y. Chen, and P. Dubey, “Faster CNNs with Direct Sparse Convolutions and Guided Pruning,” in 5th International Conference on Learning Representations (ICLR), 2017.
 [14] A. Delmas, S. Sharify, P. Judd, M. Nikolic, and A. Moshovos, “Dpred: Making typical activation values matter in deep learning computing,” CoRR, vol. abs/1804.06732, 2018.