2.2 Energy Efficiency with Stripes

Tartan: Accelerating Fully-Connected and Convolutional Layers in Deep Learning Networks by Exploiting Numerical Precision Variability

Abstract

Tartan (TRT), a hardware accelerator for inference with Deep Neural Networks (DNNs), is presented and evaluated on Convolutional Neural Networks. TRT exploits the variable per layer precision requirements of DNNs to deliver execution time that is proportional to the precision in bits used per layer for convolutional and fully-connected layers. Prior art has demonstrated an accelerator with the same execution performance only for convolutional layers[1, 2]. Experiments on image classification CNNs show that on average across all networks studied, TRT outperforms a state-of-the-art bit-parallel accelerator [3] by without any loss in accuracy while it is more energy efficient. TRT requires no network retraining while it enables trading off accuracy for additional improvements in execution performance and energy efficiency. For example, if a 1% relative loss in accuracy is acceptable, TRT is on average faster and more energy efficient than a conventional bit-parallel accelerator. A Tartan configuration that processes 2-bits at time, requires less area than the 1-bit configuration, improves efficiency to over the bit-parallel baseline while being 73% faster for convolutional layers and 60% faster for fully-connected layers is also presented.

1 Introduction

It is only recently that commodity computing hardware in the form of graphics processors delivered the performance necessary for practical, large scale Deep Neural Network applications [4]. At the same time, the end of Dennard Scaling in semiconductor technology [5] makes it difficult to deliver further advances in hardware performance using existing general purpose designs. It seems that further advances in DNN sophistication would have to rely mostly on algorithmic and in general innovations at the software level which can be helped by innovations in hardware design. Accordingly, hardware DNN accelerators have emerged. The DianNao accelerator family was the first to use a wide single-instruction single-data (SISD) architecture to process up to 4K operations in parallel on a single chip  [6, 3] outperforming graphics processors by two orders of magnitude. Development in hardware accelerators has since proceeded in two directions: either toward more general purpose accelerators that can support more machine learning algorithms while keeping performance mostly on par with DaDianNao (DaDN[3], or toward further specialization on specific layers or classes of DNNs with the goal of outperforming DaDN in execution time and/or energy efficiency, e.g., [7, 8, 1, 9, 10]. This work is along the second direction. While an as general purpose as possible DNN accelerator is desirable further improving performance and energy efficiency for specific machine learning algorithms will provides us with the additional experience that is needed for developing the next generation of more general purpose machine learning accelerators. Section 6 reviews several other accelerator designs.

While DaDN’s functional units process 16-bit fixed-point values, DNNs exhibit varying precision requirements across and within layers, e.g., [11]. Accordingly, it is possible to use shorter, per layer representations for activations and/or weights. However, with existing bit-parallel functional units doing so does not translate into a performance nor an energy advantage as the values are expanded into the native hardware precision inside the unit. Some designs opt to hardwire the whole network on-chip by using tailored datapaths per layer, e.g., [12]. Such hardwired implementations are of limited appeal for many modern DNNs whose footprint ranges several 10s or 100s of megabytes of weights and activations. Accordingly, this work targets accelerators that can translate any precision reduction into performance and that do not require that the precisions are hardwired at implementation time.

This work presents Tartan (TRT), a massively parallel hardware accelerator whose execution time for fully-connected (FCLs) and convolutional (CVLs) layers scales with the precision used to represent the input values. TRT uses hybrid bit-serial/bit-parallel functional units and exploits the abundant parallelism of typical DNN layers with the following goals: 1) exceeding DaDN’s execution time performance and energy efficiency, 2) maintaining the same activation and weight memory interface and wire counts, 3) maintaining wide, highly efficient accesses to weight and activation memories. Ideally, Tartan improves execution time over DaDN by where is the precision used for the activations in CVLs and for the activations and weights in FCLs. Every single bit of precision that can be eliminated ideally reduces execution time and increases energy efficiency. For example, decreasing precision from 13 to 12 bits in an FCL can ideally boost the performance improvement over DaDN DaDN to 33% from 23% respectively. TRT builds upon the Stripes (STR) accelerator [2, 1] which improves execution time and energy efficiency on CVLs only. While STR matches the performance of a bit-parallel accelerator on FCLs its energy efficiency suffers considerably. TRT improves performance and energy efficiency over a bit-parallel accelerator for both CVLs and FCLs.

This work evaluates TRT on a set of convolutional neural networks (CNNs) for image classification. On average TRT reduces inference time by , and over DaDN for the fully-connected, the convolutional, and all layers respectively. Energy efficiency compared to DaDN with TRT is , and respectively. By comparison, efficiency with STR compared to DaDN is , and respectively. Additionally, TRT enables trading off accuracy for improving execution time and energy efficiency. For example, on average on FCLs, accepting a 1% loss in relative accuracy improves performance to and energy efficiency to compared to DaDN.

In detail this work makes the following contributions:

  • Extends the STR accelerator offering performance improvements on FCLs. Not only STR does not improve performance on FCLs, but its energy efficiency suffers compared to DaDN.

  • TRT incorporates cascading multiple serial inner-product (SIP) units improving utilization when the number or filters or the dimensions of the filters is not a multiple of the datapath lane count.

  • It uses the methodology of Judd et al., [11] to determine per layer weight and activation precisions for the fully-connected layers of several modern image classification CNNs.

  • It evaluates a configuration of TRT which trades off some of the performance improvement for enhancing energy and area efficiency. The evaluated configuration processes two activation bits per cycle and requires half the parallelism and the SIPs than the bit-serial TRT configuration.

  • Reports energy efficiency and area measurements derived from a layout of the TRT accelerator demonstrating its benefits over the preciously proposed STR and DaDN accelerators.

The rest of this document is organized as follows: Section 2 motivates TRT. Section 3 illustrates the key concepts behind TRT via an example. Section 4 reviews the DaDN architecture and presents an equivalent Tartan configuration. Section 5 presents the experimental results. Section 6 reviews related work and discusses the limitations of this study and the potential challenges with TRT. Section 7 concludes.

2 Motivation

This section motivates TRT by showing that: 1) the precisions needed for the FCLs of several modern image classification CNNs are far below the fixed 16-bit precision used by DaDN, and 2) the energy efficiency of STR is below that of DaDN for FCLs. Combined these results motivate TRT which improves performance and energy efficiency for both FCLs and CVLs compared to DaDN.

2.1 Numerical Representation Requirements Analysis

The experiments of this section corroborate past results that the precisions needed vary per layer for several modern image classification CNNs and during inference. The section also shows that there is significant potential to improve performance if it were possible to exploit per layer precisions even for the FCLs. The per layer precision profiles presented here were found via the methodology of Judd et al. [11]. Caffe [13] was used to measure how reducing the precision of each FCL affects the network’s overall top-1 prediction accuracy over 5000 images. The network definitions and pre-trained synaptic weights are taken from the Caffe Model Zoo [14]. The networks are used as-is without retraining. Further reductions in precisions may be possible with retraining. As Section 3 will explain, TRT’s performance on an layer is bound by the maximum of the weight () and activation () precisions. Accordingly, precision exploration was limited to cases where both and are equal. The search procedure is a gradient descent where a given layer’s precision is iteratively decremented one bit at a time, until the network’s accuracy drops. For weights, the fixed-point numbers are set to represent values between -1 and +1. For activations, the number of fractional bits is fixed to a previously-determined value known not to hurt accuracy, as per Judd et al.[11]. While both activations and weights use the same number of bits, their precisions and ranges differ. For CVLs only the activation precision is adjusted as with the TRT design there is no benefit in adjusting the weight precisions as well. Weights remain at 16-bits for CVLs. While, reducing the weight precision for CVLs can reduce their memory footprint [15], an option we do not explore further in this work.

Table 1 reports the resulting per layer precisions separately for FCLs and CVLs. The ideal speedup columns report the performance improvement that would be possible if execution time could be reduced proportionally with precision compared to a 16-bit bit-parallel baseline. For the FCLs, the precisions required range from 8 to 10 bits and the potential for performance improvement is on average and ranges from to . If a 1% relative reduction in accuracy is acceptable then the performance improvement potential increases to on average and ranges from to as much as . Given that the precision variability for FCLs is relatively low (ranges from 8 to 11 bits) one may be tempted to conclude that a bit-parallel architecture with 11 bits may be an appropriate compromise. However, note that the precision variability is much larger for the CVLs (range is 5 to 13 bits) and thus performance with a fixed precision datapath would be far below the ideal. For example, speedup with a 13-bit datapath would be just vs. the that is be possible with an 8-bit precision. A key motivation for TRT is that its incremental cost over STR that already supports variable per layer precisions for CVLs is well justified given the benefits. Section 5 quantifies this cost and the resulting performance and energy benefits.

Convolutional layers Fully-Connected layers
Per Layer Activation Ideal Per Layer Activation and Ideal
Network Precision in Bits Speedup Weight Precision in Bits Speedup
100% Accuracy
AlexNet 9-8-5-5-7 2.38 10-9-9 1.66
VGG_S 7-8-9-7-9 2.04 10-9-9 1.64
VGG_M 7-7-7-8-7 2.23 10-8-8 1.64
VGG_19 12-12-12-11-12-10-11-11-13-12-13-13-13-13-13-13 1.35 10-9-9 1.63
99% Accuracy
AlexNet 9-7-4-5-7 2.58 9-8-8 1.85
VGG_S 7-8-9-7-9 2.04 9-9-8 1.79
VGG_M 6-8-7-7-7 2.34 9-8-8 1.80
VGG_19 9-9-9-8-12-10-10-12-13-11-12-13-13-13-13-13 1.57 10-9-8 1.63
Table 1: Per layer synapse precision profiles needed to maintain the same accuracy as in the baseline. Ideal: Potential speedup with bit-serial processing of activations over a 16-bit bit-parallel baseline.

2.2 Energy Efficiency with Stripes

Stripes (STR) uses hybrid bit-serial/bit-parallel inner-product units for processing activations and weights respectively exploiting the per layer precision variability of modern CNNs [1]. However, STR exploits precision reductions only for CVLs as it relies on weight reuse across multiple windows to maintain the width of the weight memory the same as in DaDN (there is no weight reuse in FCLs). Figure 1 reports the energy efficiency of STR over that of DaDN for FCLs (Section 5.1 details the experimental methodology). While performance is virtually identical to DaDN, energy efficiency is on average compared to DaDN. This result combined with the reduced precision requirements of FCLs serves as motivation for extending STR to improve performance and energy efficiency compared to DaDN on both CVLs and FCLs.

2.3 Motivation Summary

This section showed that: 1) The per layer precisions for FCLs on several modern CNNs for image classification vary significantly and exploiting them has the potential to improve performance by on average. 2) STR that exploits variable precision requirements only for CVLs achieves only the energy efficiency of a bit-parallel baseline. Accordingly, an architecture that would exploit precisions for FCLs as well as CVLs is worth investigating in hope that it will eliminate this energy efficiency deficit resulting in an accelerator that is higher performing and more energy efficient for both layer types. Combined FCLs and CVLs account for more than 99% of the execution time in DaDN.

Figure 1: Energy Efficiency of Stripes compared to DaDN on Fully-Connected layers.

3 Tartan: A Simplified Example

This section illustrates at a high-level the way TRT operates by showing how it would process two purposely trivial cases: 1) a fully-connected layer with a single input activation producing two output activations, and 2) a convolutional layer with two input activations and one single-weight filter producing two output activations. The per layer calculations are:

Where , , and are output activations, , , and are weights, and , and are input activations. For clarity all values are assumed to be represented in 2 bits of precision.

3.1 Conventional Bit-Parallel Processing

Figure 2a shows a bit-parallel processing engine representative of DaDN. Every cycle, the engine can calculate the product of two 2-bit inputs, (weight) and (activation) and accumulate or store it into the output register . Parts (b) and (c) of the figure show how this unit can calculate the example CVL over two cycles. In part (b) and during cycle 1, the unit accepts along the input bits 0 and 1 of (noted as and respectively on the figure), and along the input bits 0 and 1 of and produces both bits of output . Similarly, during cycle 2 (part (c)), the unit processes and to produce . In total, over two cycles, the engine produced two products. Processing the example FCL also takes two cycles: In the first cycle, and produce , and in the second cycle and produce . This process is not shown in the interest of space.

Figure 2: Bit-Parallel Engine processing the convolutional layer over two cycles: a) Structure, b) Cycle 1, and c) Cycle 2.
(a) Engine Structure
(b) Cycle 1: Parallel Load on BRs
(c) Cycle 2: Multiply with bits 0 of the activations
(d) Cycle 3: Multiply with bits 1 of the activations
Figure 3: Processing the example Convolutional Layer Using TRT’s Approach.
(a) Cycle 1: Shift in bits 1 of weights into the ARs
(b) Cycle 2: Shift in bits 0 of weights into the ARs
(c) Cycle 3: Copy AR into BR
(d) Cycle 4: Multiply weights with first bit of
(e) Cycle 5: Multiply weights with second bit of
Figure 4: Processing the example Fully-Connected Layer using TRT’s Approach.

3.2 Tartan’s Approach

Figure 3 shows how a TRT-like engine would process the example CVL. Figure a shows the engine’s structure which comprises two subunits. The two subunits accept each one bit of an activation per cycle through inputs and respectively and as before, there is a common 2-bit weight input . In total, the number of input bits is 4, the same as in the bit-parallel engine.

Each subunit contains three 2-bit registers: a shift-register AR, a parallel load register BR, and an parallel load output register OR. Each cycle each subunit can calculate the product of its single bit input with BR which it can write or accumulate into its OR. There is no bit-parallel multiplier since the subunits process a single activation bit per cycle. Instead, two AND gates, a shift-and-add functional unit, and OR form a shift-and-add multiplier/accumulator. Each AR can load a single bit per cycle from one of the wires, and BR can be parallel-loaded from AR or from the wires.

Convolutional Layer: Figure b through Figure d show how TRT processes the CVL. The figures abstract away the unit details showing only the register contents. As Figure b shows, during cycle 1, the synapse is loaded in parallel to the BRs of both subunits via the and inputs. During cycle 2, bits 0 of and of are sent via the and inputs respectively to the first and second subunit. The subunits calculate concurrently and and accumulate these results into their ORs. Finally, in cycle 3, bit 1 of and appear respectively on and . The subunits calculate respectively and accumulating the final output activations and into their ORs.

In total, it took 3 cycles to process the layer. However, at the end of the third cycle, another could have been loaded into the BRs (the inputs are idle) allowing a new set of outputs to commence computation during cycle 4. That is, loading a new weight can be hidden during the processing of the current output activation for all but the first time. In the steady state, when the input activations are represented in two bits, this engine will be producing two terms every two cycles thus matching the bandwidth of the bit-parallel engine.

If the activations and could be represented in just one bit, then this engine would be producing two output activations per cycle, twice the bandwidth of the bit-parallel engine. The latter is incapable of exploiting the reduced precision for reducing execution time. In general, if the bit-parallel hardware was using bits to represent the activations while only bits were enough, TRT would outperform the bit-parallel engine by .

Fully-Connected Layer: Figure 4 shows how a TRT-like unit would process the example FCL. As Figure a shows, in cycle 1, bit 1 of and of appear respectively on lines and . The left subunit’s AR is connected to while the right subunit’s AR is connected to . The ARs shift in the corresponding bits into their least significant bit sign-extending to the vacant position (shown as a 0 bit on the example). During cycle 2, as Figure b shows, bits 0 of and of appear on the respective lines and the respective ARs shift them in. At the end of the cycle, the left subunit’s AR contains the full 2-bit and the right subunit’s AR the full 2-bit . In cycle 3, Figure c shows that each subunit copies the contents of AR into its BR. From the next cycle, calculating the products can now proceed similarly to what was done for the CVL. In this case, however, each BR contains a different weight whereas when processing the CVL in the previous section, all BRs held the same value. The shift capability of the ARs coupled with having each subunit connect to a different wire allowed TRT to load a different weight bit-serially over two cycles. Figure d and Figure e show cycles 4 and 5 respectively. During cycle 4, bit 0 of appears on both inputs and is multiplied with the BR in each subunit. In cycle 5, bit 1 of appears on both inputs and the subunits complete the calculation of and . It takes two cycles to produce the two products once the correct inputs appear into the BRs.

While in our example no additional inputs nor outputs are shown, it would have been possible to overlap the loading of a new set of inputs into the ARs while processing the current weights stored into the BRs. That is the loading into ARs, copying into BRs, and the bit-serial multiplication of the BRs with the activations is a 3-stage pipeline where each stage can take multiple cycles. In general, assuming that both activations and weights are represented using 2 bits, this engine would match the performance of the bit-parallel engine in the steady state. When both set of inputs and can be represented with fewer bits (1 in this example) the engine would produce two terms per cycle, twice the bandwidth of the bit-parallel engine of the previous section.

Summary: In general, if the precision of the bit-parallel engine, and and the precisions that can be used respectively for activations and weights for layer , a TRT engine can ideally outperform an equivalent bit parallel engine by for CVLs, and by for FCLs. This example used the simplest TRT engine configuration. Since typical layers exhibit massive parallelism, TRT can be configured with many more subunits while exploiting weight reuse for CVLs and activation reuse for FCLs. The next section describes the baseline state-of-the-art DNNs accelerator and presents an equivalent TRT configuration.

4 Tartan Architecture

This work presents TRT as a modification of the state-of-the-art DaDianNao accelerator. Accordingly, Section 4.1 reviews DaDN’s design and how it can process FCLs and CVLs. For clarity, in what follows the term brick refers to a set of 16 elements of a 3D activation or weight array input which are contiguous along the dimension, e.g., . Bricks will be denoted by their origin element with a subscript, e.g., . The size of a brick is a design parameter. Furthermore, an FCL can be thought of as a CVL where the input activation array has unit x and y dimensions, and there are as many filters as output activations, and where the filter dimensions are identical to the input activation array.

(a) DaDianNao
(b) Tartan
Figure 5: Processing Tiles.
Figure 6: Overview of the system components and their communication. a) DaDN. b) Tartan.

4.1 Baseline System: DaDianNao

Figure a shows a DaDN tile which processes 16 filters concurrently calculating 16 activation and weight products per filter for a total of 256 products per cycle [3]. Each cycle the tile accepts 16 weights per filter for total of 256 weight and 16 input activations. The tile multiplies each weight with only one activation whereas each activation is multiplied with 16 weights, one per filter. The tile reduces the 16 products per filter into a single partial output activation, for a total of 16 partial output activations for the tile. Each DaDN chip comprises 16 such tiles, each processing a different set of 16 filters per cycle. Accordingly, each cycle, the whole chip processes 16 activations and weights producing partial output activations, 16 per tile.

Internally, each tile has: 1) a synapse buffer (SB) that provides 256 weights per cycle one per weight lane, 2) an input neuron buffer (NBin) which provides 16 activations per cycle through 16 neuron lanes, and 3) a neuron output buffer (NBout) which accepts 16 partial output activations per cycle. In the tile’s datapath each activation lane is paired with 16 weight lanes one from each filter. Each weight and activation lane pair feeds a multiplier, and an adder tree per filter lane reduces the 16 per filter 32-bit products into a partial sum. In all, the filter lanes produce each a partial sum per cycle, for a total of 16 partial output activations per tile.Once a full window is processed, the 16 resulting sums are fed through a non-linear activation function, , to produce the 16 final output activations. The multiplications and reductions needed per cycle are implemented via 256 multipliers one per weight lane and sixteen 17-input (16 products plus the partial sum from NBout) 32-bit adder trees one per filter lane.

Figure 6a shows an overview of the DaDN chip. There are 16 processing tiles connected via an interconnect to a shared 2MB central eDRAM Neuron Memory (NM). DaDN’s main goal was minimizing off-chip bandwidth while maximizing on-chip compute utilization. To avoid fetching weights from off-chip, DaDN uses a 2MB eDRAM Synapse Buffer (SB) for weights per tile for a total of 32MB eDRAM for weight storage. All inter-layer activation outputs except for the initial input and the final output are stored in NM which is connected via a broadcast interconnect to the 16 Input Neuron Buffers (NBin) buffers. All values are 16-bit fixed-point, hence a 256-bit wide interconnect can broadcast a full activation brick in one step. Off-chip accesses are needed only for reading: 1) the input image, 2) the weights once per layer, and 3) for writing the final output.

Processing starts by reading from external memory the first layer’s filter weights, and the input image. The weights are distributed over the SBs and the input is stored into NM. Each cycle an input activation brick is broadcast to all units. Each units reads 16 weight bricks from its SB and produces a partial output activation brick which it stores in its NBout. Once computed, the output activations are stored through NBout to NM and then fed back through the NBins when processing the next layer. Loading the next set of weights from external memory can be overlapped with the processing of the current layer as necessary.

4.2 Tartan

As Section 3 explained, TRT processes activations bit-serially multiplying a single activation bit with a full weight per cycle. Each DaDN tile multiplies 16 16-bit activations with 256 weights each cycle. To match DaDN’s computation bandwidth, TRT needs to multiply 256 1-bit activations with 256 weights per cycle. Figure b shows the TRT tile. It comprises 256 Serial Inner-Product Units (SIPs) organized in a grid. Similar to DaDN each SIP multiplies 16 weights with 16 activations and reduces these products into a partial output activation. Unlike DaDN, each SIP accepts 16 single-bit activation inputs. Each SIP has two registers, each a vector of 16 16-bit subregisters: 1) the Serial Weight Register (SWR), and 2) the Weight Register (WR). These correspond to AR and BR of the example of Section 3. NBout remains as in DaDN, however, it is distributed along the SIPs as shown.

Convolutional Layers: Processing starts by reading in parallel 256 weights from the SB as in DaDN, and loading the 16 per SIP row weights in parallel to all SWRs in the row. Over the next cycles, the weights are multiplied by the bits of an input activation brick per column. TRT exploits weight reuse across 16 windows sending a different input activation brick to each column. For example, for a CVL with a stride of 4 a TRT tile will processes 16 activation bricks , through in parallel a bit per cycle. Assuming that the tile processes filters though , after cycles it would produce the following 256 partial output activations: , through , that is 16 contiguous on the dimension output activation bricks. Whereas DaDN would process 16 activations bricks over 16 cycles, TRT processes them concurrently but bit-serially over cycles. If is less than 16, TRT will outperform DaDN by , and when is 16, TRT will match DaDN’s performance.

Fully-Connected Layers: Processing starts by loading bit-serially and in parallel over cycles, 4K weights into the 256 SWRs, 16 per SIP. Each SWR per row gets a different set of 16 weights as each subregister is connected to one out of the 256 wires of the SB output bus for the SIP row (is in DaDN there are wires). Once the weights have been loaded, each SIP copies its SWR to its SW and multiplication with the input activations can then proceed bit-serially over cycles. Assuming that there are enough output activations so that a different output activation can be assigned to each SIP, the same input activation brick can be broadcast to all SIP columns. For example, for an FCL a TRT tile will process one activation brick bit-serially to produce 16 output activation bricks through one per SIP column. Loading the next set of weights can be done in parallel with processing the current set, thus execution time is constrained by . Thus, a TRT tile produces 256 partial output activations every cycles, a speedup of over DaDN since a DaDN tile always needs 16 cycles to do the same.

Cascade Mode: For TRT to be fully utilized an FCL must have at least 4K output activations. Some of the networks studied have a layer with as little as 2K output activations. To avoid underutilization, the SIPs along each row are cascaded into a daisy-chain, where the output of one can feed into an input of the next via a multiplexer. This way, the computation of an output activation can be sliced over the SIPs along the same row. In this case, each SIP processes only a portion of the input activations resulting into several partial output activations along the SIPs on the same row. Over the next cycles, where the number of slices used, the partial outputs can be reduced into the final output activation. The user can chose any number of slices up to 16, so that TRT can be fully utilized even with fully-connected layers of just 256 outputs. This cascade mode can be useful in other Deep Learning networks such as in NeuralTalk [16] where the smallest FCLs can have 600 outputs or fewer.

Other Layers: TRT like DaDN can process the additional layers needed by the studied networks. For this purpose the tile includes additional hardware support for max pooling similar to DaDN. An activation function unit is present at the output of NBout in order to apply nonlinear activations before the output neurons are written back to NM.

4.3 SIP and Other Components

SIP: Bit-Serial Inner-Product Units: Figure 7 shows TRT’s Bit-Serial Inner-Product Unit (SIP). Each SIP multiplies 16 activation bits, one bit per activation, by 16 weights to produce an output activation. Each SIP has two registers, a Serial Weight Register (SWR) and a Weight Register (WR), each containing 16 16-bit subregisters. Each SWR subregister is a shift register with a single bit connection to one of the weight bus wires that is used to read weights bit-serially for FCLs. Each WR subregister can be parallel loaded from either the weight bus or the corresponding SWR subregister, to process CVLs or FCLs respectively. Each SIP includes 256 2-input AND gates that multiply the weights in the WR with the incoming activation bits, and a adder tree that sums the partial products. A final adder plus a shifter accumulate the adder tree results into the output register OR. In each SIP, a multiplexer at the first input of the adder tree implements the cascade mode supporting slicing the output activation computation along the SIPs of a single row. To support signed 2’s complement neurons, the SIP can subtract the weight corresponding to the most significant bit (MSB) from the partial sum when the MSB is 1. This is done with negation blocks for each weight before the adder tree. Each SIP also includes a comparator (max) to support max pooling layers.

Figure 7: TRT’s SIP.

Dispatcher and Reducers: Figure 6b shows an overview of the full TRT system. As in DaDN there is a central NM and 16 tiles. A Dispatcher unit is tasked with reading input activations from NM always performing eDRAM-friendly wide accesses. It transposes each activation and communicates each a bit a time over the global interconnect. For CVLs the dispatcher has to maintain a pool of multiple activation bricks, each from different window, which may require fetching multiple rows from NM. However, since a new set of windows is only needed every cycles, the dispatcher can keep up for the layers studied. For FCLs one activation brick is sufficient. A Reducer per title is tasked with collecting the output activations and writing them to NM. Since output activations take multiple cycles to produce, there is sufficient bandwidth to sustain all 16 tiles.

4.4 Processing Several Activation Bits at Once

In order to improve TRT’s area and power efficiency, the number of activation bits processed at once can be adjusted at design time. The chief advantage of these designs is that less SIPs are needed in order to achieve the same throughput – for example, processing two activation bits at once reduces the number of SIP columns from 16 to 8 and their total number to half. Although the total number of bus wires is similar, the distance they have to cover is significantly reduced. Likewise, the total number of adders required stays similar, but they are clustered closer together. A drawback of these configurations is they forgo some of the performance potential as they force the activation precisions to be multiple of the number of bits that they process per cycle. A designer can chose the configuration that best meets their area, energy efficiency and performance target.

In these configurations the weights are multiplied with several activation bits at once, and the multiplication results are partially shifted before they are inserted into their corresponding adder tree. In order to load the weights on time, the SWR subregister has to be modified so it can load several bits in parallel, and shift that number of positions every cycle. The negation block (for 2’s complement support) will operate only over the most significant product result.

5 Evaluation

This section evaluates TRT’s performance, energy and area compared to DaDN. It also explores the trade-off between accuracy and performance for TRT. Section 5.1 described the experimental methodology. Section 5.2 reports the performance improvements with TRT. Section 5.3 reports energy efficiency and Section 5.4 reports TRT’s area overhead. Finally, Section 5.5 studies a TRT configuration that processes two activation bits per cycle.

5.1 Methodology

DaDN, STR and TRT were modeled using the same methodology for consistency. A custom cycle-accurate simulator models execution time. Computation was scheduled as described by [1] to maximize energy efficiency for DaDN.

The logic components of the both systems were synthesized with the Synopsys Design Compiler [17] for a TSMC 65nm library to report power and area. The circuit is clocked at 980 MHz. The NBin and NBout SRAM buffers were modelled using CACTI [18]. The eDRAM area and energy were modelled with Destiny [19]. Three design corners were considered as shown in Table 2, and the typical case was chosen for layout.

Area overhead Mean efficiency
Best case 39.40% 0.933
Typical case 40.40% 1.012
Worst case 45.30% 1.047
Table 2: Pre-layout results comparing TRT to DaDN. Efficiency values for FC layers.
Fully Connected Layers Convolutional Layers
Accuracy 100% 99% 100% 99%
Perf Eff Perf Eff Perf Eff Perf Eff
AlexNet 1.61 1.06 1.80 1.19 2.32 1.43 2.52 1.55
VGG_S 1.61 1.05 1.76 1.16 1.97 1.21 1.97 1.21
VGG_M 1.61 1.06 1.77 1.17 2.18 1.34 2.29 1.40
VGG_19 1.60 1.05 1.61 1.06 1.35 0.83 1.56 0.96
geomean 1.61 1.06 1.73 1.14 1.91 1.18 2.05 1.26
Table 3: Execution time and energy efficiency improvement with TRT compared to DaDN.

5.2 Execution Time

Table 3 reports TRT’s performance and energy efficiency relative to DaDN for the precision profiles in Table 1 separately for FCLs, CVLs, and the whole network. For the 100% profile, where no accuracy is lost, TRT yields, on average, a speedup of over DaDN on FCLs. With the 99% profile, it improves to .

There are two main reasons the ideal speedup can’t be reached in practice: dispatch overhead and under-utilization. Dispatch overhead occurs on the initial cycles of execution, where the serial weight loading process prevents any useful products to be performed. In practice, this overhead is less than 2% for any given network, although it can be as high as 6% for the smallest layers. Underutilization can happen when the number of output neurons is not a power of two, or lower than 256. The last classifier layers of networks designed to perform recognition of ImageNet categories [20] all provide 1000 output neurons, which leads to 2.3% of the SIPs being idle.

Compared to STR, TRT matches its performance improvements on CVLs while offering performance improvements on FCLs. We do not report the detailed results for STR since they would have been identical to TRT for CVLs and within 1% of DaDN for FCLs.

We have also evaluated TRT on NeuralTalk LSTM [16] which uses long short-term memory to automatically generate image captions. Precision can be reduced down to 11 bits without affecting the accuracy of the predictions (measured as the BLEU score when compared to the ground truth) resulting in a ideal performance improvement of translating into a speedup with TRT. We do not include these results in Table 3 since we did not study the CVLs nor did we explore reducing precision further to obtain a 99% accuracy profile.

5.3 Energy Efficiency

This section compares the Energy Efficiency or simply efficiency of TRT and DaDN. Energy Efficiency is the inverse of the relative energy consumption of the two designs. As Table 3 reports, the average efficiency improvement with TRT across all networks and layers for the 100% profile is . In FCLs, TRT is more efficient than DaDN. Overall, efficiency primarily comes from the reduction in effective computation following the use of reduced precision arithmetic for the inner product operations. Furthermore, the amount of data that has to be transmitted from the SB and the traffic between the central eDRAM and the SIPs is decreased proportionally with the chosen precision.

TRT area TRT 2-bit area DaDN area
Inner-Product Units 57.27 (47.71%) 37.66 (37.50%) 17.85 (22.20%)
Synapse Buffer 48.11 (40.08%) 48.11 (47.90%) 48.11 (59.83%)
Input Neuron Buffer 3.66 (3.05%) 3.66 (3.64%) 3.66 (4.55%)
Output Neuron Buffer 3.66 (3.05%) 3.66 (3.64%) 3.66 (4.55%)
Neuron Memory 7.13 (5.94%) 7.13 (7.10%) 7.13 (8.87%)
Dispatcher 0.21 (0.17%) 0.21 (0.21%) -
Total 120.04 (100%) 100.43 (100%) 80.41 (100%)
Normalized Total
Table 4: Area Breakdown for TRT and DaDN

5.4 Area

Table 4 reports the area breakdown of TRT and DaDN. Over the full chip, TRT needs the area compared to DaDN while delivering on average a improvement in speed. Generally, performance would scale sublinearly with area for DaDN due to underutilization. The 2-bit variant, which has a lower area overhead, is described in detail in the next section.

5.5

This section evaluates the performance, energy efficiency and area for a multi-bit design as described in Section 4.4, where 2 bits are processed every cycle in as half as many total SIPs. The precisions used are the same as indicated in Table 1 for the 100% accuracy profile rounded up to the next multiple of two. Table 5 reports the resulting performance. The 2-bit TRT always improves performance compared to DaDN as the “vs. DaDN” columns show. Compared to the 1-bit TRT performance is slightly lower however given that the area of the 2-bit TRT is much lower, this can be a good trade-off. Overall, there are two forces at work that shape performance relative to the 1-bit TRT. There is performance potential lost due to rounding all precisions to an even number, and there is performance benefit by requiring less parallelism. The time needed to serially load the first bundle of weights is also reduced. In VGG_19 the performance benefit due to the lower parallelism requirement outweighs the performance loss due to precision rounding. In all other cases, the reverse is true.

Fully Connected Layers Convolutional Layers
vs. DaDN vs. 1b TRT vs. DaDN vs. 1b TRT
AlexNet +58% -2.06% +208% -11.71%
VGG_S +59% -1.25% +76% -12.09%
VGG_M +63% +1.12% +91% -13.78%
VGG_19 +59% -0.97% +29% -4.11%
geomean +60% -0.78% +73% -10.36%
Table 5: Relative performance of 2-bit TRT variation compared to DaDN and 1-bit TRT

A hardware synthesis and layout of both DaDN and TRT’s 2-bit variant using TSMC’s 65nm typical case libraries shows that the total area overhead can be as low as 24.9% (Table 4), with an improved energy efficiency in fully connected layers of on average (Table 3).

6 Related Work and Limitations

The recent success of Deep Learning has led to several proposals for hardware acceleration of DNNs. This section reviews some of these recent efforts. However, specialized hardware designs for neural networks is a field with a relatively long history. Relevant to TRT, bit-serial processing hardware for neural networks has been proposed several decades ago, e.g., [21, 22]. While the performance of these designs scales with precision it would be lower than that of an equivalently configured bit-parallel engine. For example, Svensson et al., uses an interesting bit-serial multiplier which requires cycles, where the precision in bits [21]. Furthermore, as semiconductor technology has progressed the number of resources that can be put on chip and the trade offs (e.g., relative speed of memory vs. transistors vs. wires) are today vastly different facilitating different designs. However, truly bit-serial processing such as that used in the aforementioned proposals needs to be revisited with today’s technology constraints due to its potentially high compute density (compute bandwidth delivered per area).

In general, hardware acceleration for DNNs has recently progressed in two directions: 1) considering more general purpose accelerators that can support additional machine learning algorithms, and 2) considering further improvements primarily for convolutional neural networks and the two most dominant in terms of execution time layer types: convolutional and fully-connected. In the first category there are accelerators such as Cambricon [23] and Cambricon-X [24]. While targeting support for more machine learning algorithms is desirable, work on further optimizing performance for specific algorithms such as TRT is valuable and needs to be pursued as it will affect future iterations of such general purpose accelerators.

TRT is closely related to Stripes [2, 1] whose execution time scales with precision but only for CVLs. STR does not improve performance for FCLs. TRT improves upon STR by enabling: 1)  performance improvements for FCLs, and 2) slicing the activation computation across multiple SIPs thus preventing under-utilization for layers with fewer than 4K outputs. Pragmatic uses a similar in spirit organization to STR but its performance on CVLs depends only on the number of activation bits that are 1 [25]. It should be possible to apply the TRT extensions to Pragmatic, however, performance in FCLs will still be dictated by weight precision. The area and energy overheads would need to be amortized by a commensurate performance improvement necessitating a dedicated evaluation study.

The Efficient Inference Engine (EIE) uses synapse pruning, weight compression, zero activation elimination, and network retraining to drastically reduce the amount of computation and data communication when processing fully-connected layers [7]. An appropriately configured EIE will outperform TRT for FCLs, provided that the network is pruned and retrained. However, the two approaches attack a different component of FCL processing and there should be synergy between them. Specifically, EIE currently does not exploit the per layer precision variability of DNNs and relies on retraining the network. It would be interesting to study how EIE would benefit from a TRT-like compute engine where EIE’s data compression and pruning is used to create vectors of weights and activations to be processed in parallel. EIE uses single-lane units whereas TRT uses a coarser-grain lane arrangement and thus would be prone to more imbalance. A middle ground may be able to offer some performance improvement while compensating for cross-lane imbalance.

Eyeriss uses a systolic array like organization and gates off computations for zero activations [9] and targets primarily high-energy efficiency. An actual prototype has been built and is in full operation. Cnvlutin is a SIMD accelerator that skips on-the-fly ineffectual activations such as those that are zero or close to zero [8]. Minerva is a DNN hardware generator which also takes advantage of zero activations and that targets high-energy efficiency [10]. Layer fusion can further reduce off-chip communication and create additional parallelism [26]. As multiple layers are processed concurrently, a straightforward combination with TRT would use the maximum of the precisions when layers are fused.

Google’s Tensor Processing Unit uses quantization to represent values using 8 bits [27] to support TensorFlow [28]. As Table 1 shows, some layers can use lower than 8 bits of precision which suggests that even with quantization it may be possible to use fewer levels and to potentially benefit from an engine such as TRT.

6.1 Limitations

As in DaDN this work assumed that each layer fits on-chip. However, as networks evolve it is likely that they will increase in size thus requiring multiple TRT nodes as was suggested in DaDN. However, some newer networks tend to use more but smaller layers. Regardless, it would be desirable to reduce the area cost of TRT most of which is due to the eDRAM buffers. We have not explored this possibility in this work. Proteus [15] is directly compatible with TRT and can reduce memory footprint by about 60% for both convolutional and fully-connected layers. Ideally, compression, quantization and pruning similar in spirit to EIE [7] would be used to reduce computation, communication and footprint. General memory compression [29] techniques offer additional opportunities for reducing footprint and communication.

We evaluated TRT only on CNNs for image classification. Other network architectures are important and the layer configurations and their relative importance varies. TRT enables performance improvements for two of the most dominant layer types. We have also provided some preliminary evidence that TRT works well for NeuralTalk LSTM [16]. Moreover, by enabling output activation computation slicing it can accommodate relatively small layers as well.

Applying some of the concepts that underlie the TRT design to other more general purpose accelerators such as Cambricon [23] or graphics processors would certainly be more preferable than a dedicated accelerator in most application scenarios. However, these techniques are best first investigated into specific designs and then can be generalized appropriately.

We have evaluated TRT only for inference only. Using an engine whose performance scales with precision would provide another degree of freedom for network training as well. However, TRT needs to be modified accordingly to support all the operations necessary during training and the training algorithms need to be modified to take advantage of precision adjustments.

This section commented only on related work on digital hardware accelerators for DNNs. Advances at the algorithmic level would impact TRT as well or may even render it obsolete. For example, work on using binary weights [30] would obviate the need for an accelerator whose performance scales with weight precision. Investigating TRT’s interaction with other network types and architectures and other machine learning algorithms is left for future work.

7 Conclusion

This work presented Tartan, an accelerator for inference with Convolutional Neural Networks whose performance scales inversely linearly with the number of bits used to represent values in fully-connected and convolutional layers. TRT also enables on-the-fly accuracy vs. performance and energy efficiency trade offs and its benefits were demonstrated over a set of popular image classification networks. The new key ideas in TRT are: 1) Supporting both the bit-parallel and the bit-serial loading of weights into processing units to facilitate the processing of either convolutional or fully-connected layers, and 2) cascading the adder trees of various subunits (SIPs) to enable slicing the output computation thus reducing or eliminating cross-lane imbalance for relatively small layers.

TRT opens up a new direction for research in inference and training by enabling precision adjustments to translate into performance and energy savings. These precisions adjustments can be done statically prior to execution or dynamically during execution. While we demonstrated TRT for inference only, we believe that TRT, especially if combined with Pragmatic, opens up a new direction for research in training as well. For systems level research and development, TRT with its ability to trade off accuracy for performance and energy efficiency enables a new degree of adaptivity for operating systems and applications.

References

  1. P. Judd, J. Albericio, T. Hetherington, T. Aamodt, and A. Moshovos, “Stripes: Bit-serial Deep Neural Network Computing ,” in Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-49, 2016.
  2. P. Judd, J. Albericio, and A. Moshovos, “Stripes: Bit-serial Deep Neural Network Computing ,” Computer Architecture Letters, 2016.
  3. Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, “Dadiannao: A machine-learning supercomputer,” in Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pp. 609–622, Dec 2014.
  4. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pp. 1106–1114, 2012.
  5. H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,” in Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA ’11, (New York, NY, USA), pp. 365–376, ACM, 2011.
  6. T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, 2014.
  7. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: efficient inference engine on compressed deep neural network,” in 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016, pp. 243–254, 2016.
  8. J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network computing,” in 2016 IEEE/ACM International Conference on Computer Architecture (ISCA), 2016.
  9. Chen, Yu-Hsin and Krishna, Tushar and Emer, Joel and Sze, Vivienne, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” in IEEE International Solid-State Circuits Conference, ISSCC 2016, Digest of Technical Papers, pp. 262–263, 2016.
  10. B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G.-Y. Wei, and D. Brooks, “Minerva: Enabling low-power, highly-accurate deep neural network accelerators,” in Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, (Piscataway, NJ, USA), pp. 267–278, IEEE Press, 2016.
  11. P. Judd, J. Albericio, T. Hetherington, T. Aamodt, N. E. Jerger, R. Urtasun, and A. Moshovos, “Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets, arXiv:1511.05236v4 [cs.LG] ,” arXiv.org, 2015.
  12. J. Kim, K. Hwang, and W. Sung, “X1000 real-time phoneme recognition VLSI using feed-forward deep neural networks,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7510–7514, May 2014.
  13. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
  14. Y. Jia, “Caffe model zoo,” https://github.com/BVLC/caffe/wiki/Model-Zoo, 2015.
  15. P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, N. E. Jerger, and A. Moshovos, “Proteus: Exploiting numerical precision variability in deep neural networks,” in Proceedings of the 2016 International Conference on Supercomputing, ICS ’16, (New York, NY, USA), pp. 23:1–23:12, ACM, 2016.
  16. A. Karpathy and F. Li, “Deep visual-semantic alignments for generating image descriptions,” CoRR, vol. abs/1412.2306, 2014.
  17. Synopsys, “Design Compiler.” http://www.synopsys.com/Tools/Implementation/RTLSynthesis/DesignCompiler/Pages.
  18. N. Muralimanohar and R. Balasubramonian, “Cacti 6.0: A tool to understand large caches.”
  19. M. Poremba, S. Mittal, D. Li, J. Vetter, and Y. Xie, “Destiny: A tool for modeling emerging 3d nvm and edram caches,” in Design, Automation Test in Europe Conference Exhibition (DATE), 2015, pp. 1543–1546, March 2015.
  20. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” arXiv:1409.0575 [cs], Sept. 2014. arXiv: 1409.0575.
  21. B. Svensson and T. Nordstrom, “Execution of neural network algorithms on an array of bit-serial processors,” in Pattern Recognition, 1990. Proceedings., 10th International Conference on, vol. 2, pp. 501–505, IEEE, 1990.
  22. A. F. Murray, A. V. Smith, and Z. F. Butler, “Bit-serial neural networks,” in Neural Information Processing Systems, pp. 573–583, 1988.
  23. S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: An instruction set architecture for neural networks,” in 2016 IEEE/ACM International Conference on Computer Architecture (ISCA), 2016.
  24. S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambricon-x: An accelerator for sparse neural networks,” in Proceedings of the 49th International Symposium on Microarchitecture, 2016.
  25. J. Albericio, P. Judd, A. D. Lascorz, S. Sharify, and A. Moshovos, “Bit-pragmatic deep neural network computing,” Arxiv, vol. arXiv:1610.06920 [cs.LG], 2016.
  26. M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer cnn accelerators,” in 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016.
  27. N. Jouppi, “Google supercharges machine learning tasks with TPU custom chip.” https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html, 2016. [Online; accessed 3-Nov-2016].
  28. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015. Software available from tensorflow.org.
  29. S. Mittal and J. S. Vetter, “A survey of architectural approaches for data compression in cache and main memory systems,” IEEE Trans. Parallel Distrib. Syst., vol. 27, pp. 1524–1536, May 2016.
  30. M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in Neural Information Processing Systems, pp. 3123–3131, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
157737
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description