Memory Slices: A Modular Building Block for Scalable, Intelligent Memory Systems
Abstract
While reduction in feature size makes computation cheaper in terms of latency, area, and power consumption, performance of emerging dataintensive applications is determined by data movement, which is becoming more costly. These trends have introduced the concept of scalability as reaching a desirable performance per unit cost by using as few number of units as possible. Therefore, many proposals have moved compute closer to the memory, used modular and parallel architectures, and utilized interconnection networks. However, these efforts ignored maintaining a balance between bandwidth and compute rate of an architecture, with those of applications, which is a key principle in designing scalable large systems. This paper proposes the use of memory slices, a modular building block for scalable memory systems integrated with compute, in which performance scales with memory size (and volume of data). The slice architecture utilizes a programmable memory interface feeding a systolic compute engine with high reuse rate. The modularity feature of slicebased systems is exploited with a partitioning and data mapping strategy across allocated memory slices where training performance scales with the data size. These features enable shifting the most pressure to cheap compute units rather than expensive memory accesses or transfers via interconnection network. An application of the memory slices to a scaleout memory system is accelerating the training of recurrent, convolutional, and hybrid neural networks (RNNs and RNNs+CNN) that are forming cloud workloads. The results of our cyclelevel simulations show that memory slices exhibits a superlinear speedup when the number of slices increases. Furthermore, memory slices improve power efficiency to 747 GFLOPs/J for training LSTMs. While our current evaluation uses memory slices with 3D packaging, a major value is that slices can also be constructed with a variety of packaging options, for example with DDRbased memory units.
3
1 Introduction
As feature size of CMOS technology scales down, the cost of a billion floating point (GFLOPs) decreases at a rate of 35% operations per year [14]. Moreover, every five years, floating point units (FPUs) can provide 8X arithmetic operations per Watt, and similarly, FPUs in a given area perform 8X faster with the same cost [13]. However, the bandwidth cost is increasing with distance [14]. As a result, computing is at an inflection point where the escalating energy and latency costs of data movement are dominating that of computation [27, 5, 35, 13]. This has been amplified by the exponential growth of data and the explosive growth in new algorithmic approaches such as machine learning, graph analytics, relational processing, and stochastic programming to convert these exponentially growing data sets into useful information. This rising impact of data movement has spawned many proposals for moving compute closer to the memory system in the form of accelerators to reduce the data movement costs[19, 18, 2, 36, 28, 20, 37], which have included integrating compute logic onto the memory die as well as into the package, (e.g., DIMM, interposer, or 3D die stacks). In addition, studies such as highbandwidth DRAM architectures (e.g., FGDRAM) [37] for increasing memory bandwidth at low energy, modular architectures that utilize efficient interconnection networks for shrinking wire length [35], and stream processors for reducing the average distance that an operand must travel to reach an FPU [13] have been proposed contributions for dealing with scaling issues.
An uncovered factor in providing scalability by studies that leveraged onchip SRAM with integrated DRAM stacks connected to GPUs, FPGAs, TPUs, or any other compute unit, is maintaining the matched balance between compute rate and bandwidth, which in turn enables achieving a given desirable performance with fewer number of nodes. This paper addresses this gap by proposing Memory Slices  a modular building block for constructing scalable memory systems integrated with compute logic. Inspired by the bitslice microprocessors [26] of earlier decades, a memory slice is a unit of memory integrated with compute logic and an onchip switched interconnection network (ICN) interface. Slices are aggregated into large memory systems wherein local memory bandwidth to the compute logic scales linearly with compute bandwidth, and external bandwidth grows as a function of the number of slices and can be delivered through memory networks such as proposed in [29].
Slice memory controllers are augmented with programmable address generators for traversing memory data structures creating a data driven execution model for the compute logic. The specific slice microarchitecture proposed in this paper uses a pipelined, systolicstyle compute engine [31, 21] to balance local slice memory bandwidth with slice compute bandwidth. The collective finegrained distributed compute logic across slices can host complex computations in the memory system on large data sets leading to the notion of intelligent memory – for example when acting as a host for machine learning algorithms. Moreover, the finegrained implantation is not only highly efficient for applications with varied data sizes in their steps (e.g. neural networks), but also is compatible with finegrained packetswitched lowpower memories.
This paper reports on the application of the memory slice microarchitecture to the scaleout organization of memory systems to accelerate an important class of deep neural networks (DNNs) namely recurrent neural networks (RNNs) and hybrid neural networks composed of convolutional neural networks and RNNs (CNN+RNN). This class of ondemand applications are not only dense but also require high rate of reuse (i.e., high ratio of FLOPs:Byte). They also include many independent operations of same type. Because of these features, they favors finegrained parallel architectures, with minimum distance (no hierarchy of registers) between memory and FPUs, with high reuse rate, which is arranged in memory slice architecture. Note that while the majority of past efforts have focused on the acceleration of CNNs and mainly focused on supporting inference with training performed offline over GPU clusters [39, 20, 28, 8, 7, 9, 34, 17]. However, RNNs represent an emerging class of DNNs for learning temporally dependent patterns and are becoming the dominant workloads in the cloud [25]. Training of RNNs has unique computational characteristics that challenges acceleration, presents substantial data movement management challenges, and therefore impedes progress towards expanding its scope of application to future larger applications.
This paper shows how the modular microarchitecture of slices are exploited in the partitioning and mapping of data where training performance can scale with the size of data and at much lower energy than GPU clusters. For example, aggregation functions operate transparently across slices via the interslice network and can naturally encapsulate activation functions. Importantly, the scalable implementation of fine grained parallel dense matrix multiplication is a core kernel in the implementation described here. When viewed in this manner, we envision wider application to of applications in machine learning, graph analytics, and scientific computations. Mapping applications is now akin to memory allocation and data layout in memory memory systems and is key to scaling performance in proportion to size of memory.
The results of our cyclelevel simulations show that in average training CNNs on memory slices consisting of 3D DRAM stacks and integrated computation logic provides 6.3X as high throughput (images/sec) as Tesla^{®}P100 GPUs do. Moreover, our evaluation exhibits a superlinear speedup when the number of slices increases from 2 to 256. While our current evaluation uses memory slices with 3D packaging, the major value is that slices can be constructed with a variety of alternative packaging options including conventional DDR (e.g., to form intelligent DIMMs), High Bandwidth Memory (HBM) [43], Hybrid Memory Cube (HMC) [12],or other emerging memory standards. Memory slice seeks to make the following key contributions:

It is a modular inmemory building block of scalable acceleration of dataintensive applications, which shifts the most stress to cheap computations, located near data rather than accessing memory.

It tries to provide a best balance between the bandwidth and compute rate of architecture, with those of applications, to gain a desirable performance at lower cost (fewer nodes). This is achieved by exploiting the modular nature of a slice with a flexible data partitioning and mapping strategy supported by a data driven execution model.

It is comprised of i) a systolic compute engine, ii) programmable memory address generator, iii) wormhole switched network interface, and iv) aggregation engines, all of which are arranged to reach the key goal of scalability.

It demonstrates scaling performance with memory system size for an ondemand class of dense and compute intensive problems, training of RNNs and hybrid networks (CCNs + RNNs).
2 Motivation
Scaling performance of traditional parallelcomputing have been challenging mainly because of their power budget limitations and using slowgrowing memory bandwidth in combination of fastgrowing computational units which creates a mismatch between computation rate and memory bandwidth and leads to inefficiency. These challenges, which have not been design mistakes, worked well for traditional applications. However, they cannot meet the requirements of a category of new dense applications with high ratio of FLOPs:byte (i.e. reuse). As a result, the primary motivation of this work is to conceive a near data processing system architecture wherein the performance scales with the size of memory (and hence volume of data). From our perspective, this view has several consequences. First, it implies the maintenance of a fixed ratio of compute throughput per unit of memory as the size of memory increases. If we double the size of memory, the compute capacity also doubles. Second, to scale performance implies that memory bandwidth to the local compute logic should not reduce with system size (i.e., at least remain at a fixed ratio relative to the throughput of the local compute logic). Therefore, as memory size scales, so does the local memory bandwidth to compute logic. Less obvious, is the implication that this leads to the design of bandwidthoptimized memory systems rather than capacityoptimized systems we have today, where the memory bandwidth delivered to cores is 23 orders of magnitude less than available within the memory die. With the compute logic integrated in the memory system, maximizing performance is pursued by maximizing finegrained (wordlevel) concurrency. Data parallel algorithmic implementations that can maximally expose finegrained concurrency across large data structures can be exploited to scale performance with memory size. In this paper, we propose and evaluate a datadriven, streaming model of computing within and across memory slices in an attempt to maximize finegrained concurrency.
Finally, we wish to demonstrate the proposed approach with applications to the inmemory acceleration of an important class of computationally intensive, data parallel applications. We apply this architectural approach to specific class of stateoftheart DNN, namely RNNs and hybrid neural networks (RNNs + CNNs.) RNNs are ondemand networks used in applications such as speech recognition or machine translation in which the data is a sequence. For example, in a machine translation application, a neuron, which can be a gated recurrent unit (GRU) [10], or a long shortterm memory (LSTM) [23] processes a sequence of words and thus operates over temporal sequences. It has been observed that such workloads are forming the bulk of server class workloads (rather than the more popularly analyzed CNNs). Training of such networks involves computationally intense kernels dominated by dense matrix operations. We wish to demonstrate that memory slice designs can provide scalable acceleration for such problems, which today are largely pursued using GPU clusters. We will show that our solution realizes some unexpected advantages over the GPU clusters from the perspective of scaling.
The definition of the architecture of a memory slice involves balancing local memory bandwidth and the throughput of the associated compute logic. The desired balance is determined by the properties of the applications/algorithms to be hosted in the memory system. We can gain some insights as to appropriate balance by examining the Roofline model[51] for various memory technologies and slice memory bandwidth and compute throughputs. As an example, Figure 1 shows the Roofline model for three computeintensive LSTMs with various balance of compute throughput and memory bandwidth. The figure shows the rooflines for two memory technologies coupled with the peak throughput of companion compute logic. According to the characteristics of each application, the throughput can be improved by either increasing computation rate or memory bandwidth. In Figure 1, LSTM1 utilizes both memory bandwidth and the compute bandwidth more effectively than LSTM0 and LSTM2. The closer peak points of LSTM1 to the balance points (knee of the curve) underscores this observation. In the same systems, LSTM2, which requires more computation, underutilizes memory bandwidth. Therefore, increasing computations at even lower memory bandwidth improves throughput. Alternatively, performance of LSTM0, which is relatively (i.e., comparing to the other two LSTMs) memorybound, can be improved by increasing memory bandwidth.
All in all, this Roofline model emphasizes that based on the characteristics of an specific architecture (e.g., FLOPs per Byte ratio and memory bandwidth) an application with high FLOPs:Byte ratio may be labeled as either computebottlenecked, or bandwidth bottlenecked. The key point here is trying not only to match the compute rate to memory bandwidth of the hardware, but also to imitate the requirements of an application by characteristics of the hardware. For example, DNNs vary in their structure and computational requirements both within and across applications (e.g., based on their structural features such as number of connections, size of each layer, specific neuron model, training features such as batch and input sizes, etc). Structuring near data processing with slices produces memory systems with a high degree of concurrency  many concurrent memory channels between memory components and accompanying compute logic. As a result, for problems with a high degree of finegrained parallelism, data partitioning and mapping algorithms can be used to maximize the utilization of memory bandwidth and compute logic. Since these are memory systems, partitioning and mapping are analogous to memory allocation and data layout in memory systems and effective implementations balance performance across slices to fully and effectively exploit the fine grained parallelism available in data intensive computations. Thus, performance scales with the size of allocated data and hence memory.
3 Memory Slices
This section introduces the notion of memory slices and their aggregation into larger memory systems structured around NDP. We describe the microarchitecture of a specific type of memory slice and memory system, used in this paper to accelerate a specific class of DNNs.
3.1 Slice Organization
A slice is comprised of a bank of addressable memory, a programmable memory interface (PMI), compute logic, and a network interface to a wormhole switched interslice communication network (ICN). Depending on the packaging, the ICN may be a networkonchip, networkonpackage, or networkonboard. The specific slice microarchitecture used here is illustrated in Figure 2. An important characteristic of a slice is the ratio of the memory bandwidth to the peak compute bandwidth. The former depends on the specific memory technology (e.g., DDR, HBM) while the latter depends on the compute logic (e.g., simple cores, systolic cores, vector, etc). Computation is data driven  data can be streamed from the memory bank to the compute logic or driven by data packets from the network. The individual components of the specific slice architecture for accelerating a specific class of DNNs, are described in more detail in the following sections.
3.2 Inside a Slice
The multiplier array and the adder tree vector: The dense matrix applications of interest in this paper exhibit significant fine grained parallelism and a high ratio of compute operations to memory accesses (see Figure 1). We employ a slice microarchitecture that exploits finegrained parallelism and maximizes data reuse, while balancing compute throughput and memory bandwidth. We concurrently achieve these goals by adopting a systolic array design [21, 31]  a regular and cost efficient architecture, in which data flows from memory, rhythmically passes through processing elements (PEs) contributing to all partial product computations that require that element thereby minimizing accesses to each data element. As shown in Figure 2, we have a 2D array of multipliers, where each row is connected to an adder tree for computing the sum of the products in each row. The choice of the number of multipliers and their interconnection are tuning knobs for adjusting the required parallelism and computememory balance of the multiplier arrays and memory for an application. Figure 3 shows an example of a systolic multiplier array with a 32bit multiplier and two source registers per processing element. In this configuration, eight 256unit columns are connected in a manner wherein at each step i) all PEs compute the product of its inputs, ii) the row adder tree computes the sum of products, and ii) the values in Register A are shifted vertically down to the next PE. Register B maintains preloaded values (see example below). With a 2 GHz clock and 3 cycles per operation, such an organization can provide 1.28 TFLOPs/Sec with memory bandwidth of 10GB/s (this is a sample example without counting addition operations).
Figure 4 illustrates a simple example of multiplying two tiny matrices (A and B) and producing matrix C. At each step all floatingpoint multiplications work in parallel with the adders. Matrix B is preloaded and column elements of matrix A are streamed through the corresponding columns of the multiplier array. In the proposed implementation, we have a 256x8 array of multipliers per slice  8 per row. The adder actually represents an adder tree. The multiplier latency is 3 cycles (@2GHz) and the adder tree takes 3 cycles to add the sum of 8 products thereby maintaining the streaming behavior.
The aggregation engine: Large problems will be partitioned across slices necessitating aggregation of values across slices. For example, for large matrices partitioned across multiple slices, partial sums or inner products have to be accumulated from multiple slices. Other operations may have to be then applied to these global sums, for example the application of activation functions to the computed neuron state where the state computation may have been computed across multiple slices.
Memory controller: A key components of a slice that ease the modularity and scalability is the PMI. The PMI has a table that maps abstract information such as matrix indices to physical addresses. In general, such abstract information include indices, tags, IDs, etc. for identifying parameters of applications, dependencies, subsets of data corresponding to each step of an application, the length of vectors to be streamed, etc. The mapping table is populated during the programming phase, in which the host sends configuration packets via the ICN to setup the mapping tables of all memory slices. To guaranty the correctness of operations, all the operands of one row should be appeared before firing one shifting of operands in registers. For dense applications with high locality, we assume that memory access latency will not create a bottleneck. In other words, in such architecture memory bandwidth defines performance and not the memory latency. As registers themselves can play the role of buffers in permitting intermediate offdie memory latency, therefore, we do not use buffers. Streamprocessors such as [13, 35] supports those kind streamings.
Interconnection network: During run time, the sequencer accesses the PMI table. Then, required data is read and streamed from memory to the multiplier array. The network interface creates packets and sends them to the destination slice/slices using the ICN. The mapping algorithm helps in increasing the fraction of local data transfer (i.e., without accessing the ICN), locating communicated slices close to each other. In addition, the network interface implements a coalescing optimization where multiple data elements destined for the same destination are encapsulated in a single packet. As a result, the injection rate of packets is kept low, to prevent saturations in the 256node network. Note that besides these optimizations, memory slice architecture, which utilizes systolic arrays at NDP eliminates two typical types of traffic flow, data distribution from a global buffer to processing elements (PEs), and traffic from multiple PEs to the global buffer, which occur in most of the neural network accelerators [8, 17, 25, 1]. The mentioned factors, impacts the traffic flowing in the ICN and in turn the efficiency of the system.
The sequencer: While utilizing a costefficient systolic array of multipliers is the key contribution in maintaining a desirable balance between compute and memory to reach high throughput, an accurate sequencer should precisely adjust and synchronize the parallel stream inputs of arrays to guarantee the correct operations. In other words, all the inputs (e.g., eight parallel streams in Figure 3) should arrive to multipliers of a row at the same time, so the the vector of adder trees can sum correct values together. The sequencer at each module guarantees this synchronization. The sequencer is a programmable state machine that sequences all of the data flow within a slice. This includes i) streaming data from the memory to the multiplier array, ii) between the multiplier to the adder trees, and iii) from the adder trees to the network interface where it will be streamed back to memory or a remote slice depending on the destination address. In doing so, the sequencer ensures synchronization between units (e.g., all data is ready before initiating the multiplier array). The sequencer also manages the asynchronous interactions between the aggregation engine and the network interface.
4 Partitioning And Mapping
The slicebased memory system operates under the control of a host that is envisioned to program slices via memorymapped memory read/write operations much in the same way that peripherals are often configured. Considering the acceleration of training of RNNs described in Section 5 , compilation will produce a set of matrixmatrix operations corresponding to a batched training formulation. The host now i) partitions and maps these matrices across slices, ii) computes the mapping tables that will populate the PMI of each slice (the sequencer in each slice will use this to control local operation),iii) programs the PMI of each slice via configuration packets, iv) loads all of the matrices into memory, and iv) initiate computation. The sequencers are programmed to coordinate the execution of application phases, e.g., forward propagation, backpropagation, etc. Slices operate in parallel and asynchronously thus overlapping communication with computation. Partitioning and mapping of matrices is central to the performance gains. The approach is described in the following section followed by a stepbystep description of multislice operation that brings it all together
4.1 Partitioning Algorithm
The dominant compute kernel implemented in the applications analyzed here is matrix multiplication. The strategy for partitioning is illustrated in Figure 5. The basic idea is to partition the matrices across their common dimension as illustrated in the figure, for two slices to multiply two matrices A and B. At each step, a row partition of A can be aligned with a corresponding column partition of B at the multipliers in a slice to perform an elementbyelement multiplication. At successive steps, i) the products can be summed in the adder tree for that row, ii) the partial sum can be transmitted to the the other slice to be added to its locally computed partial sum, and iii) the row partitions are shifted down one row. These steps are repeated producing elements of the output matrix as shown in Figure 5. The indices are used to determine the network address of the destination of partial sums. Since the vertical streaming of elements of each column of A can be performed in parallel, we can partition A and B into as many partitions as necessary (even with distinct partition sizes as shown) and assign each partition to one slice. The systolic multiplier array favors such a partitioning scheme by preloading elements of one of the input matrices (usually the smaller one, B in Figure 5) into one of the registers of the multiplier array, and streaming the other matrix (e.g., A in Figure 5) via the second register.
The preceding basic approach is applied to map large matrices across a a fixed number of slices each with a fixed array size of multipliers and adder trees in each slice. The dimensions of the multiplier array determines the partition size (e.g., row partition size in Figure 5). Consequently, for very large matrices, one slice may in fact have to process more than one partition of the matrices with overheads for sequentially loading required data for each partition. In addition, in some cases, larger dimensions of the smaller matrix (e.g., in Figure 5) could preclude preloading into the registers of the multiplier array. In such cases, the second matrix should be partitioned both vertically and horizontally and be loaded iteratively to the registers. Alternatively, for smaller matrix dimensions, multiplier array elements will be under utilized. As long as the matrices are larger than the dimensions of the multiplier arrays of a single slice, this fragmented utilization of a multiplier array is limited to one slice due to the flexibility of the partitioning and mapping approach illustrated in Figure 5. In addition to partitioning matrices based on the size of slices, we define a mapping between the partitions and nodes (slices) which are connected in a network. Note that besides partitioning, the mapping approach is central to determining the performance of the system because it effect the traffic of the ICN directly. In the examples of dense applications, studied in this paper, we heuristically map the partitions sequentially to the slices.
4.2 Microarchitectural Support
The last section provided a description of how matrices are partitioned and mapped across slices. This section describes the details of microarchitectural operations at each memory slice in support of the example in Figure 5. Assume that partition 1 and 2 of Figure 5 are mapped to slice 1 and 2 in Figure 6, and that the output (matrix C) is mapped to slice 2. While both of the slices execute concurrently, Figure 5 describes the nine steps initiated in slice 1 to 2 as follows: ➶ To start a new matrixmatrix multiplication, the sequencer preloads the partition of matrix B to the systolic multiplier array and initiates streaming elements of matrix A via the memory interface to load the next elements to the first row of multiplier array (in general equal to width of multiplier array). ➷ Once the memory interface loads new values to registers of the first row of the multiplier, the previous values of those registers shift one row down. ➸ All the multipliers in the array operate in parallel, and send their results to the adder tree unit for that row. In this example, assume slice 1 is at a step that rows 2,3, and 4 of matrix A are at the multiplier array to be multiplied by rows 0,1,2 of matrix B to create a portion of elements (0,2), (1,3), and (2,4) of matrix C.
➹ The local adder tree tree partially sums the results of multipliers involved in creating these elements of the output. In general, partition of the output matrix may reside in the local or remote slice. As a result, the adder tree vector, which is only aware of the index of the output matrix (based on indices of its operands), but not the slice number, send its results to the network interface which has the information about the mapping of matrix partitions to slices. In this example, the three elements are transmitted to slice 2. ➺ The network interface packetizes data and sends them to their destinations. In the case that the source and the destination of a packet are the same, the network interface will route it to its local port. Since it is already known that the elements of destination matrix (C) are located diagonal, the network interface only sends the abstract address (indices) of the first element in a packets and the number of elements (e.g., (0,2) and 3). At the destination slice, the addresses of other elements will be made by using the first one. ➻ The ICN routes packets to their destination. ➼ The network interface of slice 2 receives packets and after extracting, sends them to the aggregation unit, which sums up partial sums of each element from the slices. ➽ If the received packet includes the last partial sum, then this unit applies other required functions to the results (e.g., the activation functions for neural networks). Otherwise, the result directly updates the value of memory. ➾ The memory controller has two tasks here. First, it fetches the other portion of the same element from memory, so that the aggregation engine adds the new portion to that. Second, it writes back the values at dedicated addresses based on the mapping table. While the nine steps are basic steps, some complex applications may require additional operations. For instance, for CNNs, converting convolution operations to matrixmatrix multiplications duplicates some elements of inputs. In such scenarios, the memory interface may create two or more copies of an output element at write them in specified elements based on the mapping table.
5 A Dense Application Example
This section explains memory slices by using an example of RNNs and describes two major points: First, how required operations for forward and backward propagations for training RNNs are converted to matrixmatrix multiplications, favored by systolic arrays at each memory slice. And, second, how memory slices run the training and inference efficiently and correctly by maximizing parallel operations required at each step of algorithms. To date, for solving various ranges of sequential applications, from speech and video processing to language modeling and machine translation, several structures have been proposed, all of which are unanimous in having the feedback connection from previous elements of a sequence in their structure, while are varied in details of their structure. Predicting next word of a sentence, or translating a sentence are applications of RNNs. In this section, we picked neural machine translation (NMT)[4] Figure 7a defines the fundamental parameters of a translator. Since processing sentences with random sizes is almost impossible for current translators, bucketing mechanism is used to justify sentences of various size by padding (i.e., adding special tokens to sentences for enlarging them). However, when variation in length of sentences is too much, padding all the sentences to a single size is very inefficient. Therefore, usually a group of buckets with various sizes are used In this example, we assume only one bucket pair of size (4,6).
Unlike inference, during training with teacher forcing technique[47], both input and output sentences are loaded into the translator. The size of batch is assumed as three in our example. Therefore, three sentences of a dataset will be processed in forward propagation, then the loss is calculated, and finally weights are updated and errors are back propagated. In RNNs error should be back propagated through time (BPTT)[50], which suffers from vanishing gradient problem when length of sequences enlarges. To overcome this issue, truncated back propagation is proposed[44], which back propagates errors to only a limited number of timesteps. Figure 7b shows back propagation of errors through four timesteps. In Figure 7b, the basic block of the translator, which translates three fourword sentences to three sixword sentences, is unrolled to four timesteps. At each timestep, the translator block processes a new batch by fetching three sentences of input and three corresponding sentences from output; and updates the loss by adding its own loss (step loss) to the backpropagated loss. All the mentioned dependencies will be defined at the table of PMI, so that the sequencer can keep the track of them.
A translator block can consist of a stack of RNN cells for increasing the accuracy. Figure 8a illustrates the block diagram of one translator block including five layers, four of which are LSTMs and one(the attention) is a feedforward network. This example is a simplified version of attentionbased translators, a mature version of encoderdecoderbased translators[11]. First, encoders process input sentence of first language. The input of an LSTM is an input word and a previous state, represented in one vector, which is multiplied by the weight matrix. The output of last decoder layer goes to the attention layer, which is added to encoderdecoderbased translators to enable translating long sentences[4]. Similar to encoders, each decoder cell is an LSTM.
Figure 8b displays the nodes and connections of the neural network inside two LSTM layers (the network of decoders are similar). The arrows show that the output of each layer contributes in creating the half input of next layer as well as the state part for next element in the sequence by the same layer. Figure 8c shows a high abstraction of partitioning across four slices. In this example, weight and input matrices of two layers are partitioned similar to those in Figure5. As Figure 8c shows, layer 1 is mapped to slices 0 and 1, and layer 2 is mapped to slices 2 and 3. As a result, in this simple scheme, slice 1 writes its results on its own memory section and on slice 2. Note that since operations of decoders will not be started until encoders are finished, the resources of encoders can be mapped to the same four slices.
5.1 Training Neural Networks by Memory Slices
This section explains how memory slices ease training, for proceeding which, each timestep shown in Figure 7b is unrolled to several microsteps, at each a word of sentence is processed. Figure 9 shows microsteps for translating a batch of three sentences at timestep=3 of Figure 7b. Since decoders cannot start their operations until encoders are done, the number of microsteps required for translating a fourword sentence to a sixword sentence is ten. This section continue by explaining the flow of operations for processing sentences A, B, C, and A’, B’, C’, performed at memory slices. In the next section, to simplify the example, we assume that all the sentences in both languages have two words, which cause four microsteps per timestep.
Forward Propagation
Figure 10 demonstrates inputs, weights, and output matrices of fivelayer translator mapped to four slices when all the microsteps of timestep=3 are completed. The upper and lower indices indicate the layer number and order of a word in a sentence. For instance, is the input of layer one (i.e., the first encoder). Each word is represented by a vector of length three, the socalled hidden unit size (H=3). Note that a layer, called embedding transfers onehot words into words of hidden unit size[32]. As Figure 10 shows, in addition to , participates in making the input matrix of the first layer. is the feedback state of layer one from previous elements. Figure 10 also shows that weights of LSTM layers (i.e., encoders and decoders) have size of , made by concatenating four group of filters for creating output, acquiring information from input, forgetting history, and updating the state[23]. At each microstep, an LSTM generates its output and next state by multiplying input matrix of size to its weight matrix of size , which results a matrix such as for the first microstep. Then, the first part of this matrix, which is the result of multiplying by the output weight, is activated and it creates the input of the next layer, . The remaining of matrix participates in creating , the state for processing the next element in the sequence. The same procedure repeats for layers 2, 4, and 5. However, the procedure of layer 3, the attention layer differs, which receives outputs of both group of elements (i.e., and that create and ). As a result, before the attention, three microsteps should be done: 1) , 2) and , and 3) . Then the attention stars working by multiplying its input matrix to its weight matrix of size .
Accordingly the following are some of the considerations for training. First, the memory space for a part of input matrices (those for feedback states) that are not ready during programming phase, kept reserved in PMI table, so that the sequencer knows where to write the result. Second, as matrix multiplications for next element of a sequence for RNNs cannot be proceed before the previous one are finished, to fill the gap between processing two elements, all the sameindex elements of all sequences at one batch are arranged to be streamed together during programming, to better utilize the compute unit reuse. Therefore, batch size matters for gaining throughput. Third, during programming phase, the order of loading weight matrices to registers of the systolic array defines the order of producing input of next layer and the next state of current layer (more importantly when the length of weight matrix is larger than length of multiplier array), it is considered to improve performance.
Error Back Propagation and Weight Update
Once all the microsteps of a timestep are done, the backpropagating error of a layer (), defined as scalar multiplication of activation function differentiation of layer output to matrixmultiplication of error from previous layer to matrix transpose of weight, is computed. In addition to error computation, weights of each layer is updated as , where is the learning rate. Therefore, a pair of matrixmatrix multiplications (for calculating error and updating weight of current layer) should be performed. At step five (Figure 11), the calculated error will be propagated to the next timestep. However, based on hardware constrains, memory slices can map them to less number of slices. For training, the mapping of matrices, mapped to slices in a forwardfriendly manner (i.e., in a way that they are streamed into the multiplier arrays for forward propagation) should be changed to become proper for back propagation. To handle two mappings of same matrices for forward and back propagation, two options is possible. First, remapping matrices of each layer right after the operations for that layer in forward propagation is over. This solution uses bandwidth of memory and ICN. The second solution is mapping two copies of each matrix during programming phase. As long as the memory space is not limited, the second approach is preferred.
6 Experimental Setup
For our evaluation, we configure memory slices based on models of 3D DRAM technologies  the Hybrid Memory Cube (HMC) [12] and High Bandwidth Memory (HBM) [43]. Other packaging options are also feasible (e.g., DDR, 2.5D, etc.) but are not evaluated here as we do not expect the main insights to differ although specific design choices should be affected. We evaluate a memory system constructed of slices using an inhouse cyclelevel simulator. For the RNNs and hybrid networks we simulate the forward and backpropagation. In addition, to estimate the power consumption integrate into the system the Kitfox1.1 library[42] at the 16nm technology node. Kitfox is a library that incorporates a number of public domain power models including CACTI and McPAT  we use the McPAT model[33] for the power modeling of compute units and estimates of access energy per bit for the DRAM based on 6pJ/bit for HBM [38] and 3.7pJ/bit for HMC [24]. Table 1 lists the details of the ICN and the configuration of a single slice which is comprised of an 16bit multiplier array. Each slice is assumed to be on the order of 1Gbytes in size and is large enough to host the partitioned data sets considered in these workloads.
We define eight system configurations for evaluation as listed in Table2, modeling two versions of HMCstyle and HBMstyle memories. The first four configurations employ the basic compute unit (i.e., that in Table1) and are referred to as the baseline configurations. The other four configurations, which we refer to as balanced configurations increase the compute throughput with the addition of up to 2.5x compute units/slice. This rebalances the ratio of memory bandwidth to compute bandwidth in a slice to be closer to the knee of the Roofline models of the evaluated configurations that employ these memory technologies and host these applications. Each slice nominally represents one HMCstyle vault, or HBMstyle channel. The memory system configurations that use balanced slice configurations have the same peak TFLOPs/S as the memory system configurations that use the baseline slice configurations, but with a smaller number of slices. We evaluate training performance of four LSTMs and four CNNs. The proposed partitioning and mapping algorithm optimizes each layer separately (each set of matrices). Consequently, it is straightforward to apply to hybrid networks with any number of convolutional layers combined with any number of recurrent layers. However, in this paper, we evaluate the CNN and RNN layers separately using wellknown workloads to enable comparisons with previous studies. The four LSTMs are multilayer translators with varied parameters listed in Table 3, trained on the WMT’15 dataset[48] and bucket sizes of (5, 10), (10, 15), (20, 25), (40, 50). LSTM0 has parameters similar to Google NMT (GNMT)[52]. Four CNNs are AlexNet[30], VGG16[41], ResNet152[22], and InceptionV3[45], trained on ImageNet dataset[16] and batch size of 128.
7 Performance Evaluation
In this study we seek to learn specific lessons from executing training of RNNs in the memory system, and the consequences for throughput, scalability, and energy efficiency. Note that we are studying memory systems and study specific choices of the balance between memory bandwidth and compute bandwidth in a slice. When we normalize the compute throughput of the memory system, different choices will produce different number of slices to construct the memory system. We assume the memory capacity of a slice can be varied to keep also the memory system size constant.
7.1 Throughput
We locate the throughput (measured from simulation) of LSTMs on the RoofLine model in Figure 1  note each Roofline has a number of slices and characteristics shown in Table 2. Figure 12a illustrates the results for baseline configurations, all of which have the same FLOPs/Byte since the configuration of each slice is fixed, the FLOPs/Byte is determined by the operation the dimensions of the multiplier arrays and adder trees. However the memory bandwidth differs across each configuration. In Figure 12a we see the LSTMs are clearly computebound. They do not achieve the peak throughput due to i) dependencies in the models, ii) nonuniform sizes of layers, iii) dimensions of matrices of a model are not perfectly matched to the multiplier array size in the slices, and iv) performance of the ICN, which connects the slices. Utilization can be moderated by the partitioning scheme or by modifying effective parameters. For example, increasing the batch size can reduce the gap with peak throughput, made as a result of sequential dependencies during forward propagation. The larger batch size of LSTM3 is the main reason for its higher throughput compared to that of LSTM2 in Figure 12.
The throughput gains in Figure 12a come at some inefficiency in bandwidth utilization. The balanced configurations, which combine lower memory bandwidths (i.e., HMC1.0, and HBM) with high computation rates (again note the number of slices are different as defined in Table 2) have different rate of data reuse (due to larger multiplier arrays) and therefore reduce the of accesses per a byte. As a result the working points in Figure 12b are distributed in xaxis closer to the knee of the curve  the balance point. A comparison between Figure 12a and b illustrates two insights: First, those applications that were limited by the low compute rate (e.g., yellow and green points of LSTM 1,2, and 3 in Figure 12a) reach higher throughput by using lower bandwidth memories but higher compute rate with consequent power advantages. Second, even though LSTM1, 2 and 3 may not gain significantly higher throughput by using balanced configurations, they are located closer to the balance point and improve memory utilization. For example, LSTM1, which reaches 1.2 PFLOPs/Sec by using 512 slices of HMC2.0, where it is 644 FLOPs/Byte from balance point, can reach the same throughput by using only 128 slices of HMC1.0 2x, where it is 144 FLOPs/Byte closer to the balanced point. We argue that such insights are very useful in determining how to construct memory systems where performance (perhaps most efficiently) scales with memory size for an application domain.
Figure 13 provides PFLOPs/S of all workloads for both training and inference. As it shows, using slices with 2X the computation rate of the baseline increases performance but it does not necessarily achieve 2X system throughput. However, it is not orthogonal with this fact that the target application of the case study (matrix multiplication) is heavily compute bound. For inference we use batch processing that is an approach for inference in server applications[25, 6]. However, the slow error back propagation causes lower performance of training compared to inference. Figure 13 also shows that the throughput of training LSTMs is higher than that of training CNNs. There are two reasons for such behavior. The first is uniformity in the structure of LSTMs, which helps create higher utilization in all columns of the multiplier arrays. The second is higher opportunity of concurrency in multilayer LSTMs. As mentioned earlier, once one layer of an LSTM is processing an input of a temporal sequence, the other layers can continue processing other elements of the sequence.
To better evaluate the throughput of memory slice systems, we compare the number of images that can be trained per second with Tesla^{®} P100 and K80 GPUs reported by the TensorFlow benchmarks[46]. Since in this part we compare simulation results with real system, we try to do so with as mush fairness as possible. For this reason, we have very similar peak throughput for each pair of this comparison  four memory slices with HMC1.0 2x has the same peak throughput as one P100 GPU. In addition, here we changed our batch sizes to match with those reported in [46]. As Figure 14 shows, memory slices perform similar to P100 for training InceptionV3. However, VGG16 can gain up to 41x higher throughput when it is trained using memory slice system. We also compare the throughput result of LSTM0 inference of HBM and HMC1.0based memory slices with that of the TPU, using DDR3 and GDDR5. Seeking a fair comparison we choose a number of slices to have close to the peak throughput of the TPU and we normalized results to the peak throughput of each device. As Figure 15 shows, memory slices offer better performance the main reason is that the throughput of TPU(more specifically for DDR3based TPU) is limited by the memory bandwidth[25].
7.2 Scalability
One of our goals was to understand how we can architect NDP memory systems where performance scales with memory size. Consequently, the analysis in this section is mostly unique to slicebased memory systems. There are three main outcomes from our analysis. First is the importance of finding the right balance between memory bandwidth and compute bandwidth within a slice via its effect on performance scaling. Figure 16 illustrates throughput for two sample applications, one from LSTMs and one from CNNs, when the number of slices varies from 8 to 128. For a fixed number of slices, the balanced configuration with 2X the computation throughput/slice provides twice the system throughput as the baseline configuration. Finding this balance will have to come from domain analysis. The number of slices for a specific application is addressed via memory allocation  use a subset of slices of the complete memory system much in the same way physical memory is allocated to processes.
The second lesson concerns how overheads scale. As we increase the number of slices, parallelism is exploited while overheads here have dropped and can lead to superlinear speedup. This is primarily due high overheads for a small number of slices and large networks as explained below. While a perfect balance at each slice (operating at the knee of the Roofline curve) is necessary for optimum scale performance, it is not enough without utilizing maximum parallelism across all the slices. Currently the wellknown approaches for training neural networks are data and model parallelism[15, 40], which in the best case (i.e., when a program have enough concurrency opportunity), can provide linear speedup[46]. On the other hand, a system can provide better than linear speedup, if it can prevent maintenance overhead operations, while using multiple resources. In sum, to better scale performance, the speed up should be gained not only from the parallelism, but also from eliminating unnecessary operations.
Consider when the number of memory slices are small and we have large matrices than number of required partitions, several parts of model and data should be processed sequentially in limited number of slices. Handling several partitions of a layer by one slice costs operations such as reloading weights to registers iteratively (tasks mentioned in ➶ of Figure 6). Consider the case where the number of Memory Slices are small and we have large matrices. Table 4 compares the weight matrix dimensions and required partitions for utilizing all the available fine grained parallelism of a slicebased system (when the multiplier array width is 8). Weight matrices with a length longer than the multiplier array length (256 in the baseline system) must be partitioned horizontally and be loaded iteratively. In this case, when the number of slices doubles, the latency required for those overhead operations at each slice is reduced up to half the time, which results up to 2.4X speedup (due cascading reductions elsewhere). Consider if such a rate of reduction were to continue, after seven times doubling the number of slices (e.g., from 2 to 256), the speed up will reach x that is 4.2 times as high as linear speedup (i.e., ). Figure 17 demonstrates the speedup when the number of slices increases from 2 to 256 (the lowest black line shows the linear speedup trend). As the figure shows, LSTM0 and 2, those with largest weight matrices, are better at utilizing a larger number of slices due to larger matrix size and more parallel operations. While generally all the eight workloads perform better on more slices, even limited number of slices perform better than comparable devices (see Figure 14 and 15). Having to preload the full multiplier array in a slice and 256 cycles (and less for partial matrix partitions). While superlinear speedups are substantively affected by this we keep in mind the usage model. The goal is to utilize memory size to scale performance and hence we expect larger number of slices allocated to these types of applications when parallelism is available and envisioned memory systems are tens to hundreds of Gbytes on a server/cloud system.
Third, slicebased memory systems benefit from the partitioning mechanism used here. Data parallelism, in which minibatches are split across workers to train the same model by updating shared parameters (the approach used in GPU clusters), has other limitations to scalability. One of them is transferring large amount of data to the central server for combining the results of SGDs. On completing minibatches, error matrices are transmitted to a central server to be combined and broadcast back to the individual GPUs. This date grows with the size of the problem and limits scalability. On the other hand, the approach of splitting outputs across memory slices avoid such bottleneck by partially computing error matrices. In addition, memory slices eliminate overheads of data parallelism, such as communication or storage overheads for broadcasting or duplicating the shared parameters.
7.3 Efficiency
Energy consumption is an important factor in defining the cost of a system, specifically when the power consumption budget is limited. The number of operations that can be committed by consuming one Joule of energy is an important metric. Figure 18 compares power efficiency of training (a) and inference (b) of workloads on various configurations. As Figure 18a shows, those LSTMs that most utilize the computations (i.e., LSTM1 and LSTM3), have highest performance efficiency during training, comparing to the other LSTMs. Furthermore, Figure 19 illustrates a breakdown of memory and compute power consumption for training for four configurations. As can be seen, a larger portion of power is dedicated to computations. HBMbased configurations consumes less power because they have a smaller number of slices to provide the required bandwidth. To the best of our knowledge, the only comprehensive architectural approach devoted to accelerating training (and for CNNs) is ScaleDeep[49], the hardware for which has peak performance of 1.35 PFLOPs/S for halfprecision(16bit) at 1.4KW. A comparison between power efficiency of a slicebased memory system and ScaleDeep shows approximately 330 GFLOPs/W for training AlexNet and VGG16 for ScaleDeep, which is in the same range of that of the slicebased system. Regarding the target of training RNNs, memory slices can substantially improve power efficiency to 747 GFLOPs/J for LSTMs.
8 Related Work
Several studies proposed solutions for inference[39, 20, 28] and training neural networks[49, 3]. Neurocube[28] that used 3D memory, TETRIS[20] with scheduling techniques and hybrid partitioning, and SCNN[39] that utilizes sparsity, can be used for efficient inference of CNNs, but they do not support training dense and largescale applications. ScaleDeep[49] trains CNNs by using heterogenous tile of chips for provide convolutional and fullyconnected layers requirement. Some of the recent studies, proposed using PE arrays for neural networks. The architectures used in Eyeriss[8], DianNao[9], TETRIS[20] devote a huge portion of each PE to SRAM memories. Therefore, they limited their PE array size to up to . Even Tetris, which optimizes the size of its PE array size to place it near vaults of HMC, has 512Byte register file per PE. Using a systolic array is efficient and uses only four bytes of register per compute unit (128x smaller than that for Tetris). TPU[25] uses a systolic array of 8bit integer MACs with 24MB unified buffer and 4MB of accumulator RAMs, for neural network inference. Comparing to TPU, memory slices gain higher throughput from employing systolic arrays by directly locating them near memory, and assigning arrays with smaller width to higher bandwidth (e.g., to 10GB/S). In addition matched balanced to requirements of dense applications, partially adding eightwidth (instead of 256) rows in parallel is more efficient. DianNao[7], DaDianNao[9], PuDianNao[34], and ShiDianNao[17] are generations of neuralnetwork accelerators, proposed for inference and training of CNNs. Their key contributions are reducing memory accesses by either utilizing NNspecific access pattern or employing eDRAMbased weight cache. Besides scalability, because of our balanced NDP design, programability, which enable training RNNs, differentiates memory slices from previous efforts.
9 Conclusion
Toward the growth in dataintensive applications, this paper proposed scalable and intelligent memory slices, the key microarchitectural components of which consists of a pipelined systolicbased compute logic, a programmable memory interface and an aggregation engine. A class of stateoftheart computeintensive applications including RNNs and hybrid DNNs can benefit from modularity and partitioning strategy of memory slices. The results of our cyclelevel simulations show that memory slices provide higher throughput for inference and training of RNNs and CNNs comparing to GPUs and TPUs do. Also, memory slices exhibits a superlinear speedup when the number of slices increases.
References
 Filipp Akopyan, Jun Sawada, Andrew Cassidy, Rodrigo AlvarezIcaza, John Arthur, Paul Merolla, Nabil Imam, Yutaka Nakamura, Pallab Datta, GiJoon Nam, et al. Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 34(10):1537–1557, 2015.
 Hadi AsghariMoghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim. Chameleon: Versatile and practical neardram acceleration architecture for large memory systems. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pages 1–13. IEEE, 2016.
 Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. Neurostream: Scalable and energy efficient deep learning with smart memory cubes. arXiv preprint arXiv:1701.06420, 2017.
 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 Shekhar Borkar. Role of interconnects in the future of computing. Journal of Lightwave Technology, 31(24):3927–3933, 2013.
 Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.
 Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. Diannao: A smallfootprint highthroughput accelerator for ubiquitous machinelearning. In ACM Sigplan Notices, volume 49, pages 269–284. ACM, 2014.
 YuHsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of SolidState Circuits, 52(1):127–138, 2017.
 Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. Dadiannao: A machinelearning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 609–622. IEEE Computer Society, 2014.
 Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoderdecoder approaches. arXiv preprint arXiv:1409.1259, 2014.
 Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 Hybrid Memory Cube Consortium et al. Hybrid memory cube specification 1.0. Last Revision Jan, 2013.
 William J Dally, Francois Labonte, Abhishek Das, Patrick Hanrahan, JungHo Ahn, Jayanth Gummaraju, Mattan Erez, Nuwan Jayasena, Ian Buck, Timothy J Knight, et al. Merrimac: Supercomputing with streams. In Proceedings of the 2003 ACM/IEEE conference on Supercomputing, page 35. ACM, 2003.
 William J Dally and John W Poulton. Digital systems engineering. Cambridge university press, 2008.
 Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.
 Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
 Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. Shidiannao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer Architecture News, volume 43, pages 92–104. ACM, 2015.
 Amin FarmahiniFarahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. Nda: Neardram acceleration architecture leveraging commodity dram devices and standard memory modules. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 283–295. IEEE, 2015.
 Mingyu Gao, Grant Ayers, and Christos Kozyrakis. Practical neardata processing for inmemory analytics frameworks. In Parallel Architecture and Compilation (PACT), 2015 International Conference on, pages 113–124. IEEE, 2015.
 Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. Tetris: Scalable and efficient neural network acceleration with 3d memory. In Proceedings of the TwentySecond International Conference on Architectural Support for Programming Languages and Operating Systems, pages 751–764. ACM, 2017.
 W Morven Gentleman and HT Kung. Matrix triangularization by systolic arrays. In RealTime Signal Processing IV, volume 298, pages 19–27. International Society for Optics and Photonics, 1982.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Joe Jeddeloh and Brent Keeth. Hybrid memory cube new dram architecture increases density and performance. In VLSI Technology (VLSIT), 2012 Symposium on, pages 87–88. IEEE, 2012.
 Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. Indatacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 1–12. ACM, 2017.
 G Kanttaiah. Bitslice microprocessors. IETE Journal of Research, 24(34):124–131, 1978.
 Stephen W Keckler, William J Dally, Brucek Khailany, Michael Garland, and David Glasco. Gpus and the future of parallel computing. IEEE Micro, 31(5):7–17, 2011.
 Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. Neurocube: A programmable digital neuromorphic architecture with highdensity 3d memory. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pages 380–392. IEEE, 2016.
 Gwangsun Kim, John Kim, Jung Ho Ahn, and Jaeha Kim. Memorycentric system interconnect design with hybrid memory cubes. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, pages 145–156. IEEE Press, 2013.
 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 Sun Yuan Kung. Vlsi array processors. Englewood Cliffs, NJ, Prentice Hall, 1988, 685 p. Research supported by the Semiconductor Research Corp., SDIO, NSF, and US Navy., 1988.
 Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pages 2177–2185, 2014.
 Sheng Li, Jung Ho Ahn, Richard D Strong, Jay B Brockman, Dean M Tullsen, and Norman P Jouppi. Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Microarchitecture, 2009. MICRO42. 42nd Annual IEEE/ACM International Symposium on, pages 469–480. IEEE, 2009.
 Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. Pudiannao: A polyvalent machine learning accelerator. In ACM SIGARCH Computer Architecture News, volume 43, pages 369–381. ACM, 2015.
 Ken Mai, Tim Paaske, Nuwan Jayasena, Ron Ho, William J Dally, and Mark Horowitz. Smart memories: A modular reconfigurable architecture. ACM SIGARCH Computer Architecture News, 28(2):161–171, 2000.
 Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and Hyesoon Kim. Graphpim: Enabling instructionlevel pim offloading in graph computing frameworks. In High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, pages 457–468. IEEE, 2017.
 Mike O’Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W Keckler, and William J Dally. Finegrained dram: energyefficient dram for extreme bandwidth systems. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pages 41–54. ACM, 2017.
 Mike O’Connor. Highlights of the highbandwidth memory (hbm) standard. In Memory Forum Workshop, 2014.
 Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. Scnn: An accelerator for compressedsparse convolutional neural networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 27–40. ACM, 2017.
 Rajat Raina, Anand Madhavan, and Andrew Y Ng. Largescale deep unsupervised learning using graphics processors. In Proceedings of the 26th annual international conference on machine learning, pages 873–880. ACM, 2009.
 Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 William J Song, Saibal Mukhopadhyay, and Sudhakar Yalamanchili. Kitfox: Multiphysics libraries for integrated power, thermal, and reliability simulations of multicore microarchitecture. IEEE Transactions on Components, Packaging and Manufacturing Technology, 5(11):1590–1601, 2015.
 JEDEC Standard. High bandwidth memory (hbm) dram. JESD235, 2013.
 Ilya Sutskever. Training recurrent neural networks. University of Toronto, Toronto, Ont., Canada, 2013.
 Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
 TensorFlow. Performance Benchmarks. https://www.tensorflow.org/performance/benchmarks#details_for_google_compute_engine_nvidia_tesla_k80l, 2017. [Online; accessed 11/19/17].
 Nikzad Toomarian and J Barhen. Fast temporal neural learning using teacher forcing. In Neural Networks, 1991., IJCNN91Seattle International Joint Conference on, volume 1, pages 817–822. IEEE, 1991.
 TENTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION. WMT’15. http://www.statmt.org/wmt15/translationtask.html, 2017. [Online; accessed 11/19/17].
 Swagath Venkataramani, Ashish Ranjan, Subarno Banerjee, Dipankar Das, Sasikanth Avancha, Ashok Jagannathan, Ajaya Durg, Dheemanth Nagaraj, Bharat Kaul, Pradeep Dubey, et al. Scaledeep: A scalable compute architecture for learning and evaluating deep networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 13–26. ACM, 2017.
 Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
 Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009.
 Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.