Accelerating Bulk Bit-Wise X(N)OR Operation in Processing-in-DRAM Platform
With Von-Neumann computing architectures struggling to address computationally- and memory-intensive big data analytic task today, Processing-in-Memory (PIM) platforms are gaining growing interests. In this way, processing-in-DRAM architecture has achieved remarkable success by dramatically reducing data transfer energy and latency. However, the performance of such system unavoidably diminishes when dealing with more complex applications seeking bulk bit-wise X(N)OR- or addition operations, despite utilizing maximum internal DRAM bandwidth and in-memory parallelism. In this paper, we develop DRIM platform that harnesses DRAM as computational memory and transforms it into a fundamental processing unit. DRIM uses the analog operation of DRAM sub-arrays and elevates it to implement bit-wise X(N)OR operations between operands stored in the same bit-line, based on a new dual-row activation mechanism with a modest change to peripheral circuits such sense amplifiers. The simulation results show that DRIM achieves on average 71 and 8.4 higher throughput for performing bulk bit-wise X(N)OR-based operations compared with CPU and GPU, respectively. Besides, DRIM outperforms recent processing-in-DRAM platforms with up to 3.7 better performance.
In the last two decades, Processing-in-Memory (PIM) architecture, as a potentially viable way to solve the memory wall challenge, has been well explored for different applications (chi2016prime, ; seshadri2017ambit, ; li2017drisa, ; angizi2017RIMPA, ; angizi2019mrima, ; angizi2018dima, ; angizi2018imce, ). The key concept behind PIM is to realize logic computation within memory to process data by leveraging the inherent parallel computing mechanism and exploiting large internal memory bandwidth. The proposals for exploiting SRAM-based (aga2017compute, ; eckert2018neural, ) PIM architectures can be found in recent literature. However, PIM in context of main memory (DRAM- (li2017drisa, ; seshadri2017ambit, ; dai2018graphh, )) has drawn much more attention in recent years mainly due to larger memory capacities and off-chip data transfer reduction as opposed to SRAM-based PIM. Such processing-in-DRAM platforms show significantly higher throughputs leveraging multi-row activation methods to perform bulk bit-wise operations by either modifying the DRAM cell and/or sense amplifier. For example, Ambit (seshadri2017ambit, ) uses triple-row activation method to implement majority-based AND/OR logic, outperforming Intel Skylake-CPU, NVIDIA GeForce GPU, and even HMC (HMC, ) by 44.9, 32.0, and 2.4, respectively. DRISA (driskill2011latest, ) employs 3T1C- and 1T1C-based computing mechanisms and achieves 7.7 speedup and 15 better energy-efficiency over GPUs to accelerate convolutional neural networks. However, there are different challenges in such platforms that make them inefficient acceleration solutions for X(N)OR- and addition-based applications such as DNA alignment and data encryption. Due to the intrinsic complexity of X(N)OR logic, current PIM designs are not able to offer a high-throughput X(N)OR-based operation despite utilizing the maximum internal bandwidth and memory level parallelism. This is because majority/AND/OR-based multi-cycle operations and required row initialization in the previous designs.
To overcome the memory bandwidth bottleneck and address the existing challenges, we propose a high-throughput and energy-efficient PIM accelerator based on DRAM, called DRIM. DRIM exploits a new in-memory computing mechanism called Dual-Row Activation (DRA) to perform bulk bit-wise operations between operands stored in different word-lines. The DRA is developed based on analog operation of DRAM sub-arrays with a modest change in the sense amplifier circuit such that X(N)OR operation can be efficiently realized on every memory bit-line. In addition, such design addresses the reliability concerns regarding the voltage deviation on the bit-line and multi-cycle operations of the triple-row activation method. We evaluate and compare DRIM’s raw performance with conventional and PIM accelerators including a Core-i7 Intel CPU (CPU, ), an NVIDIA GTX 1080Ti Pascal GPU (GPU1, ), Ambit (seshadri2017ambit, ), DRISA-1T1C (li2017drisa, ), and HMC 2.0 (HMC, ), to handle bulk bit-wise operations. We observe that DRIM achieves remarkable throughput compared to Von-Neumann computing systems (CPU/GPU) through unblocking the data movement bottleneck by on average 71/8.4 better throughput. DRIM outperforms other PIMs in performing X(N)OR-based operations by up to 3.7 higher throughput. We further show that a 3D-stacked DRAM built on top of DRIM can boost the throughput of the HMC by 13.5. From the energy consumption perspective, DRIM reduces the DRAM chip energy by 2.4 compared with Ambit (seshadri2017ambit, ) and 69 compared with copying data through the DDR4 interface.
To the best of our knowledge, this work is the first that designs a high-throughput and energy-efficient X(N)OR-friendly PIM architecture exploiting DRAM arrays. We develop DRIM based on a set of novel microarchitectural and circuit-level schemes to realize a data-parallel computational unit for different applications.
2. Background and Motivation
2.1. Processing-in-DRAM Platforms
A DRAM hierarchy at the top level is composed of channels, modules, and ranks. Each memory rank, with a data bus typically 64-bits wide, includes a set of memory chips that are manufactured with a variety of configurations and operate in unison (kim2016ramulator, ; seshadri2017ambit, ). Each chip is further divided into multiple memory banks that contains 2D sub-arrays of memory cells virtually-organized in memory matrices (mats). Banks within same chips share I/O, buffer and banks in different chips working in a lock-step manner. Each memory sub-array, as shown in Fig. 1a, has 1) a large number of rows (typically or ) holding DRAM cells, 2) a row of Sense Amplifiers (SA), and 3) a Row Decoder (RD) connected to the cells. A DRAM cell basically consists of two elements, a capacitor (storage) and an Access Transistor (AT) (Fig. 1b). The drain and gate of the AT is connected to the Bit-line () and Word-line (), respectively. DRAM cell encodes the binary data by the charge of the capacitor. It represents logic ‘1’ when the capacitor is full-charged, and logic ‘0’ when there is no charge.
Write/Read Operation: At initial state, both and is always set to . Technically, accessing data from a DRAM’s sub-array (write/read) after initial state is done through three consecutive commands (seshadri2017ambit, ; seshadri2015fast, ) issued by the memory controller: 1) During the activation (i.e. ACTIVATE), activating the target row, data is copied from the DRAM cells to the SA row. Fig. 1b shows how a cell is connected to a SA via a . The selected cell (storing or 0) shares its charge with the leading to a small change in the initial voltage of (). Then, by activating the signal, the SA senses and amplifies the of the voltage towards the original value of the data through voltage amplification according to the switching threshold of SA’s inverter (seshadri2015fast, ). 2) Such data can be then transferred from/to SA to/from DRAM bus by a READ/WRITE command. In addition, multiple READ/WRITE commands can be issued to one row. 3) The PRECHARGE command precharges both and again and makes the sub-array ready for the next access.
Copy and Initialization Operations: To enable a fast () in-memory copy operation within DRAM sub-arrays, rather than using conventional operation in Von-Neumann computing systems, RowClone-Fast Parallel Mode (FPM) (seshadri2013rowclone, ) proposes a PIM-based mechanism that does not need to send the data to the processing units. In this scheme, two back-to-back ACTIVATE commands to the source and destination rows without PRECHARGE command in between, leads to a multi-kilo byte in-memory copy operation. This operation takes only (seshadri2013rowclone, ). This method has been further used for row initialization, where a preset DRAM row (either to ‘0’ or ‘1’) can be readily copied to a destination row. RowClone imposes only a 0.01% overhead to DRAM chip area (seshadri2013rowclone, ).
Not Operation: The NOT function has been implemented in different works employing Dual-Contact Cells (DCC), as shown Fig. 1c. DCC is mainly designed based on typical DRAM cell, but equipped with one more AT connected to . Such hardware-friendly design (seshadri2017ambit, ; kang2010one, ; lu2015improving, ) can be developed for a small number of rows on top of existing DRAM cells to enable efficient NOT operation with issuing two back-to-back ACTIVATE commands (seshadri2017ambit, ). In this way, the memory controller first activates the (Fig. 1c) of input DRAM cell, and reads the data out to the SA through . It then activates to connect to the same capacitor and so writes the negated result back to the DCC.
Other Logic Operations: To realize the logic function in DRAM platform, Ambit (seshadri2017ambit, ) extends the idea of RowClone by implementing 3-input majority (Maj3)-based operations in memory by issuing the ACTIVATE command to three rows simultaneously followed by a single PRECHARGE command, so-called Triple Row Activation (TRA) method. As shown in Fig. 2a, considering one row as the control, initialized by = ‘0’/‘1’, Ambit can readily implement in-memory AND2/OR2 in addition to Maj3 functions through charge sharing between connected cells (, and ) and write back the result on cell. It also leverage TRA mechanism along with DCCs to realize the complementary functions. However, despite Ambit shows only 1% area over commodity DRAM chip (seshadri2017ambit, ), it suffers from multi-cycle PIM operations to implement other functions such as XOR2/XNOR2 based on TRA. Alternatively, DRISA-3T1C method (li2017drisa, ) utilizes the early 3-transistor DRAM design (sideris1973intel, ), in which the cell consists of two separated read/write ATs, and one more transistor to decouple the capacitor from the read (), as shown in Fig. 2b. This transistor connects the two DRAM cells in a NOR style on the naturally performing functionally-complete NOR2 function. However, DRISA-3T1C imposes very large area overhead (2T per cell) and still requires multi-cycle operations to implement more complex logic functions. DRISA-1T1C method (li2017drisa, ) offers to perform PIM through upgrading the SA unit by adding a CMOS logic gate in conjunction with a latch, as depicted in Fig. 2c. Such inherently-multi-cycle operation can enhance the performance of a single function through add-on CMOS circuitry, in two consecutive cycles. In first cycle, is read out and stored in the latch, and in the second cycle, is sensed to perform the computation. However, this design imposes excessive cycles to implement other logic functions and at least 12 transistors to each SA. Recently, Dracc (deng2018dracc, ) implements a carry look-ahead adder by enhancing Ambit (seshadri2017ambit, ) to accelerate convolutional neural networks.
There are three main challenges in the existing processing-in-DRAM platforms that make them inefficient acceleration solutions for XOR-based computations and we aim to resolve them:
Limited throughput (Challenge-1): Due to the intrinsic complexity of X(N)OR-based logic implementations, current PIM designs (such as Ambit (seshadri2017ambit, ), DRISA (li2017drisa, ), and Dracc (deng2018dracc, )) are not able to offer a high-throughput and area-efficient X(N)OR or addition in-memory operation despite utilizing maximum internal DRAM bandwidth and memory-level parallelism for NOT, (N)AND, (N)OR, and MAJ/MIN logic functions. Moreover, while DRISA-1T1C method could implement either XNOR or XOR functions as the add-on logic gate, it requires at least two consecutive cycles to perform the computation, which in turn limits other logics implementation. We address this challenge by proposing the DRA mechanism in Section 3.1 and 3.4.
Row initialization (Challenge-2): Given R=AB function ( AND2/OR2), TRA-based method (seshadri2017ambit, ; seshadri2015fast, ) takes 4 consecutive steps to calculate one result as it relies on row initialization: 1-RowClone data of row A to row (Copying first operand to a computation row to avoid data-overwritten), 2-RowClone of row B to , 3-RowClone of ctrl row to (Copying initialized control row to a computation row), 4-TRA and RowClone data of row to R row (Computation and Writing-back the result). Therefore TRA method needs averagely 360 to perform such in-memory operations. When it comes to XOR2/XNOR2 operation, Ambit requires at least three row-initialization steps to process two input rows. Obviously, this row-initialization load could adversely impact the PIM’s energy-efficiency especially dealing with such big data problems. This challenge is addressed in Section 3.1 through the proposed sense amplifier, which totally eliminates the need for initialization in performing X(N)OR-based logics.
Reliability concerns (Challenge-3): By simultaneously activating three cells in TRA method, the deviation on the might be smaller than typical one-cell read operation in DRAM. This can elongate the sense amplification state or even adversely affect the reliability of the result (seshadri2017ambit, ; seshadri2015fast, ). The problem can be even intensified when multiple TRA are needed to implement X(N)OR-based computations. To explore and address this challenges, we perform an extensive Monte-Carlo simulation on our design in Section 3.3.
3. DRIM Design
DRIM is designed to be an independent, high- performance, energy-efficient accelerator based on main memory architecture to accelerate different applications. The main memory organization of DRIM is shown in Fig. 3 based on typical DRAM hierarchy. Each mat consists of multiple computational memory sub-arrays connected to a Global Row Decoder (GRD) and a shared Global Row Buffer (GRB). According to the physical address of operands within memory, DRIM’s Controller (Ctrl) is able to configure the sub-arrays to perform data-parallel intra-sub-array computations. We divide the DRIM’s sub-array row space into two distinct regions as depicted in Fig. 3: 1- Data rows (500 rows out of 512) that include the typical DRAM cells (Fig. 1b) connected to a regular Row Decoder (RD), and 2- Computation rows (12), connected to a Modified Row Decoder (MRD), which enables multiple row activation required for bulk bit-wise in-memory operations between operands. Eight computational rows () include typical DRAM cells and four rows () are allocated to DCCs (Fig. 1c) enabling NOT function in every sub-array. DRIM’s computational sub-array is motivated by Ambit (seshadri2017ambit, ), but enhanced and optimized to perform both TRA and the proposed Dual-Row Activation (DRA) mechanisms leveraging charge-sharing among different rows to perform logic operations, as discussed below.
3.1. New In-Memory Operations
Dual-Row Single-Cycle In-Memory X(N)OR: With a careful observation on the existing processing-in-DRAM platforms, we realized that they are not able to efficiently handle two main functions prerequisite for accelerating a variety of applications (XNOR, addition). As a result, such platforms impose an excessive latency and energy to memory chip, which could be alleviated by rethinking about SA circuit. Our key idea is to perform in-memory XNOR2 through a DRA method to alleviate and address three of the challenges discussed in Section 2.3. To achieve this goal, we propose a new reconfigurable SA, as shown in Fig. 4a, developed on top of the existing DRAM circuitry. It consists of a regular DRAM SA equipped with add-on circuits including three inverters and one AND gate controlled with three enable signals (,,). This design leverages the charge-sharing feature of DRAM cell and elevates it to implement XNOR2 logic between two selected rows through static capacitive-NAND/NOR functions in a single cycle. To implement capacitor-based logics, we use two different inverters with shifted Voltage Transfer Characteristic (VTC), as shown in Fig. 4b.
In this way, a NAND/NOR logic can be readily carried out based on high switching voltage ()/low- inverters with standard high-/low- NMOS and low-/high- PMOS transistors. It is worth mentioning that, utilizing low/high-threshold voltage transistors a long with normal-threshold transistors have been accomplished in low-power application, and many circuits have enjoyed this technique in low-power design (allam2000high, ; mutoh19951, ; kuroda19960, ; navi2009novel, ).
Consider and operands are RowCloned from data rows to and rows and both and are precharged to (Precharged State in Fig. 5). To implement DRA, DRIM’s ctrl first activates two s in computational row space (here, and ) through the modified decoder for charge-sharing when all the other enable signals are deactivated (Charge Sharing State). During Sense Amplification State, by activating the corresponding enable signals ( and ) tabulated in Table 1, the input voltage of both low- and high- inverters in the reconfigurable SA can be simply derived as , where is the number of DRAM cells storing logic ‘1’ and represents the total number of unit capacitors connected to the inverters (i.e. 2 in DRA method).
|W/R - Copy - NOT - TRA||1||1||0|
Now, the low- inverter acts as a threshold detector by amplifying deviation from and realizes a NOR2 function as tabulated in the truth table in Fig. 4b. At the same time the high- inverter amplifies the deviation from and realizes a NAND2 function. Accordingly, XOR2 and XNOR2 functions of input operands can be realized after CMOS AND gate, respectively, on the and based on Equation-(1) in a single memory cycle.
DRIM’s reconfigurable SA is especially optimized to accelerate X(N)OR2 operations, as well as supporting other memory and in-memory operations (i.e. Write/Read, Copy, NOT, and TRA). DRIM ctrl activates and control-bits simultaneously (when is deactivated) to perform such operations. However, in this work, we only use Ambit’s TRA mechanism to directly realize in-memory majority function (Maj3).
The transient simulation results of DRA method to realize single-cycle in-memory XNOR2 operation is shown in Fig. 6. We can observe how voltage and accordingly cell’s capacitor is charged to (when =00/11) or discharged to GND (when =01/10) during sense amplification state. Therefore, DRA method can effectively provide a single-cycle X(N)OR logic to address the challenge-1 and -2 discussed in Section 2.3 by eliminating the need for multiple TRA- (seshadri2017ambit, ) or NOR-based (li2017drisa, ) operations as well as row initialization steps.
In-Memory Adder: DRIM’s sub-array can perform addition /subtraction (add/sub) operation quite efficiently. Assume , and as input operands, the carry-out () of the Full-Adder (FA) can be directly generated through using TRA method. Moreover, the can be readily carried out through two back-to-back XOR2 operations based on the proposed DRA mechanism.
3.2. ISA Support
While DRIM is meant to be an independent high-performance and energy-efficient accelerator, we need to expose it to programmers and system-level libraries to utilize it. From a programmer perspective, DRIM is more of a third party accelerator that can be connected directly to the memory bus or through PCI-Express lanes rather than a memory unit, thus it is integrated similar to that of GPUs. Therefore, a virtual machine and ISA for general-purpose parallel thread execution need to be defined similar to PTX (GPU, ) for NVIDIA. Accordingly, the programs are translated at install time to the DRIM hardware instruction set discussed here to realize the functions tabulated in Table 2. The micro and control transfer instructions are not discussed here.
|Func.||Operation||Command Sequence||AAP Type|
Complement functions and Subtraction can be realized with rows.
DRIM is developed based on ACTIVATE-ACTIVATE-PRECHARGE command a.k.a. AAP primitives and most bulk bitwise operations involve a sequence of AAP commands. To enable processor to efficiently communicate with DRIM, we developed four types of AAP-based instructions that only differ from the number of activated source or destination rows:
1- AAP (src, des, size) that runs the following commands sequence: 1) ACTIVATE a source address (src); 2) ACTIVATE a destination address (des); 3) PRECHARGE to prepare the array for the next access. The size of input vectors for in-memory computation must be a multiple of DRAM row size, otherwise the application must pad it with dummy data. The type-1 instruction is mainly used for copy and NOT functions; 2- AAP (src, des1, des2, size), 1) ACTIVATE a source address; 2) ACTIVATE two destination addresses; 3) PRECHARGE. This instruction copies a source row simultaneously to two destination rows; 3- AAP (src1, src2, des, size) that performs DRA method by activating two source addresses and then writes back the result on a destination address; 4- AAP (src1, src2, src3, des, size) that performs Ambit-TRA method (seshadri2017ambit, ) by activating three source rows and writing back the MAJ3 result on a destination address.
For instance, in order to implement the addition-in-memory, as shown in Table 2, three AAP-type2 commands double-copy the three input data rows to computational rows (). Then, the function is realized through two back-to-back XOR2 operations with AAP-type3. The is generated by AAP-type4 and written back to the designated data row.
We performed a comprehensive circuit-level simulation to study the effect of process variation on both DRA and TRA methods considering different noise sources and variation in all components including DRAM cell (/ capacitance and transistor, shown in Fig. 7) and SA (width/length of transistors-). We ran Monte-Carlo simulation in Cadence Spectre with 45nm NCSU Product Development Kit (PDK) library (NCSU_PDK, ) (DRAM cell parameters were taken and scaled from Rambus (Rambus, )) under 10000 trials and increased the amount of variation from 0% to 30% for each method. Table 3 shows the percentage of the test error in each variation. We observe that even considering a significant 10% variation, the percentage of erroneous DRA across 10000 trials is 0%, where TRA method shows a failure with 0.18%. Therefore, DRIM offers a solution to alleviate challenge-3 by showing an acceptable voltage margin in performing operations based on DRA mechanism. By scaling down the transistor size, the process variation effect is expected to get worse (seshadri2013rowclone, ; seshadri2017ambit, ). Since DRIM is mainly developed based on existing DRAM structure and operation with slight modifications, different methods currently-used to tackle process variation can be also applied for DRIM. Besides, just like Ambit, DRIM chips that fail testing due to DRA or TRA methods can be potentially considered as regular DRAM chips alleviating DRAM yield.
Throughput: We evaluate and compare the DRIM’s raw performance with conventional computing units including a Core-i7 Intel CPU (CPU, ) and an NVIDIA GTX 1080Ti Pascal GPU (GPU1, ). There is a great deal of PIM accelerators that present reconfigurable platforms or application-specific logics in or close to memory die (angizi2018design, ; bojnordi2016memristive, ; angizi2018cmp, ; ahn2016scalable, ; ahn2015pim, ; akin2015data, ; balasubramonian2014near, ; boroumand2017lazypim, ; farmahini2015nda, ; guo2015enabling, ; hsieh2016accelerating, ; kim2016neurocube, ; nair2015active, ; pattnaik2016scheduling, ; pugsley2014comparing, ; trancoso2015moving, ; tang2017data, ; zhang2014top, ; akerib2015non, ; angizi2018pima, ; parveen2018imcs2, ). Due to the lack of space, we shall restrict our comparison to four recent processing-in-DRAM platforms, Ambit (seshadri2017ambit, ), DRISA-1T1C (li2017drisa, ), DRISA-3T1C (li2017drisa, ), and HMC 2.0 (HMC, ), to handle three main bulk bit-wise operations, i.e. NOT, XNOR2, and add. To have a fair comparison, we report DRIM’s and other PIM platforms’ raw throughput implemented with 8 banks with 512256 computational sub-arrays. We further develop a 3D-Stacked DRAM with 256 banks in 4GB capacity similar to that of HMC 2.0 for the DRIM (i.e. DRIM-S) considering its computational capability. The Intel CPU consists of 4 cores and 8 threads working with two 64-bit DDR4-1866/2133 channels. The Pascal GPU has 3584 CUDA cores running at 1.5GHz (GPU1, ) and 352-bit GDDR5X. The HMC has 32 vaults each with 10 GB/s bandwidth. Accordingly, we develop an in-house benchmark to run the operations repeatedly for //-length input vectors and report the throughput of each platform, as shown in Fig. 8.
We observe that 1) either the external or internal DRAM bandwidth has limited the throughput of the CPU, GPU, and even HMC platforms. However, HMC outperforms the CPU and GPU with 25 and 6.5 higher performance on average for bulk bit-wise operations. Besides, PIM platforms achieve remarkable throughput compared to Von-Neumann computing systems (CPU/GPU) through unblocking the data movement bottleneck. Regular DRIM (DRIM-R) shows on average 71 and 8.4 better throughput compared to CPU and GPU, respectively. 2) while DRIM-R, Ambit, and DRISA platforms achieve almost the same performance on performing bulk bit-wise NOT function, DRIM-R outperforms other PIMs in performing X(N)OR2-based operations. Our platform improves the throughput 2.3, 1.9, 3.7 compared with Ambit (seshadri2017ambit, ), DRISA-1T1C (li2017drisa, ), and DRISA-3T1C (li2017drisa, ), respectively. 3) DRIM-S can boost the throughput of the HMC by 13.5. To sum it up, DRIM’s DRA mechanism could effectively address challenge-1 by proposing the high-through bulk bit-wise X(N)OR-based operation.
Energy: We estimate the energy that DRAM chip consumes to perform the three bulk bit-wise operations per Kilo-Byte for DRIM, Ambit (seshadri2017ambit, ), DRISA-1T1C (driskill2011latest, ), and CPU111This energy doesn’t involve the energy that processor consumes to perform the operation.. Note that, other operations such AND2/NAND2 and OR2/NOR2 in DRIM can be built on top of TRA method with almost the same energy consumption to that of Ambit. Fig. 9 shows that DRIM achieves 2.4 and 1.6 energy reduction over Ambit (seshadri2017ambit, ) and DRISA-1T1C (driskill2011latest, ), respectively, to perform bulk bit-wise XNOR2 operation. Besides, compared with copying data through the DDR4 interface, DRIM reduces the energy by 69. As for bit-wise in-memory add operation, DRIM outperforms Ambit, DRISA-1T1C, and CPU, respectively, with 2, 1.7, and 27 reduction in energy consumption.
Area: To assess the area overhead of DRIM on top of commodity DRAM chip, four hardware cost sources must be taken into consideration. First, add-on transistors to SAs; in our design, each SA requires 22 additional transistors connected to each . Second, two rows of DCCs with two associated with each; based on the estimation made by (kang2010one, ), each DCC row imposes roughly one transistor over regular DRAM cell to each . Third, the 4:12 MRD overhead (originally 4:16); we modify each driver by adding two more transistors in the typical buffer chain, as depicted in Fig. 4a. Fourth, the Ctrl’s overhead to control enable bits; ctrl generates the activation bits with MUX units with 6 transistors. To sum it up, DRIM roughly imposes 24 DRAM rows per sub-array, which can be interpreted as of DRAM chip area.
Virtual Memory: DRIM has its own ISA with operations that can potentially use virtual addresses. To use virtual addresses, DRIM’s ctrl must have the ability to translate virtual addresses to physical addresses. While in theory this looks as simple as passing the address of the page table root to DRIM and giving DRIM’s ctrl the ability to walk the page table, it is way more complicated in real-world designs. The main challenge here is that the page table can be scattered across different DIMMs and channels, while DRIM operates within a memory module. Furthermore, page table coherence issues can arise. The other way to implement translation capabilities for DRIM is through memory controller pre-processing of instructions being written to DRIM instruction registers. For instance, if the programmer writes instruction APP (src,dec,256), then the memory controller intercepts the virtual addresses and translates them into physical addresses. Note that most systems have near memory controller translation capabilities, mainly to manage IOMMU and DMA accesses from I/O devices. One issue that can arise is that some operations are appropriate only if the resulting physical addresses are within specific plane, e.g., within the same bank. Accordingly, the compiler and the OS should work together to ensure that the operands of commands will result physical addresses that are suitable to the operation type.
Memory Layout and Interleaving: While high- performance memory systems rely on channel interleaving to maximize the memory bandwidth, DRIM adopts a different approach through maximizing spatial locality and allocating memory as close to their corresponding operands as possible. The main goal is to reduce the data movement across memory modules and hence reducing operations latency and energy costs. As exposing a programmer directly to the layout of memory is challenging, DRIM architecture can rely on compiler passes that take memory layout and the program as input, then assign physical addresses that are adequate to each operation without impacting the symantics of the application.
Reliability: Many ECC-enabled DIMMs rely on calculating some hamming code at the memory controller and use it to correct any soft errors. Unfortunately, such a feature is not available for DRIM as the data being processed are not visible to the memory controller. Note that this issue is common across all PIM designs. To overcome this issue, DRIM can potentially augment each row with additional ECC bits that can be calculated and verified at the memory module level or bank level. Augmenting DRIM with reliability guarantees is left as future work.
Cache Coherence: When DRIM updates data directly in memory, there could be stale copies of the updated memory locations in the cache, thus data inconsistency issues may arise. Similarly, if the processor updates cached copies from memory locations that DRIM will process later, DRIM could actually use wrong/stale values. There are several ways to solve such issues in off-chip accelerators, the most common one is to rely on operating system (OS) to unmap the physical pages accessible by DRIM from any process that can run while computing in DRIM.
In this work, we presented DRIM, as a high-throughput and energy-efficient PIM architecture to address some of the existing issues in state-of-the-art DRAM-based acceleration solutions for performing bulk bit-wise X(N)OR-based operations i.e. limited throughput, row initialization, reliability concerns, etc. incurring less than 10% on top of commodity DRAM chip.
- (1) P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “Prime: a novel processing-in-memory architecture for neural network computation in reram-based main memory,” in ACM SIGARCH Computer Architecture News, vol. 44, no. 3. IEEE Press, 2016, pp. 27–39.
- (2) V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-memory accelerator for bulk bitwise operations using commodity dram technology,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2017, pp. 273–287.
- (3) S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “Drisa: A dram-based reconfigurable in-situ accelerator,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2017, pp. 288–301.
- (4) S. Angizi, Z. He, F. Parveen, and D. Fan, “Rimpa: A new reconfigurable dual-mode in-memory processing architecture with spin hall effect-driven domain wall motion device,” in VLSI (ISVLSI), 2017 IEEE Computer Society Annual Symposium on. IEEE, 2017, pp. 45–50.
- (5) S. Angizi, Z. He, A. Awad, and D. Fan, “Mrima: An mram-based in-memory accelerator,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019.
- (6) S. Angizi, Z. He, and D. Fan, “Dima: a depthwise cnn in-memory accelerator,” in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2018, pp. 1–8.
- (7) S. Angizi, Z. He, F. Parveen, and D. Fan, “Imce: Energy-efficient bit-wise in-memory convolution engine for deep neural network,” in Design Automation Conference (ASP-DAC), 2018 23rd Asia and South Pacific. IEEE, 2018, pp. 111–116.
- (8) S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das, “Compute caches,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2017, pp. 481–492.
- (9) C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaauw, and R. Das, “Neural cache: Bit-serial in-cache acceleration of deep neural networks,” arXiv preprint arXiv:1805.03718, 2018.
- (10) G. Dai, T. Huang, Y. Chi, J. Zhao, G. Sun, Y. Liu, Y. Wang, Y. Xie, and H. Yang, “Graphh: A processing-in-memory architecture for large-scale graph processing,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.
- (11) “Hybrid memory cube speci!cation 2.0.” [Online]. Available: http://www.hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.0_Public.pdf.
- (12) A. Driskill-Smith, D. Apalkov, V. Nikitin, X. Tang, S. Watts, D. Lottis, K. Moon, A. Khvalkovskiy, R. Kawakami, X. Luo et al., “Latest advances and roadmap for in-plane and perpendicular stt-ram,” in Memory Workshop (IMW), 2011 3rd IEEE International. IEEE, 2011, pp. 1–3.
- (13) “6th generation intel core processor family datasheet.” [Online]. Available: https://www.intel.com/content/www/us/en/products/processors/core/core-vpro/i7-6700.html
- (14) “Geforce gtx 1080 ti.” [Online]. Available: https://www.nvidia.com/en-us/geforce/products/10series/geforce-gtx-1080-ti/
- (15) Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensible dram simulator,” IEEE Computer architecture letters, vol. 15, no. 1, pp. 45–49, 2016.
- (16) V. Seshadri, K. Hsieh, A. Boroum, D. Lee, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Fast bulk bitwise and and or in dram,” IEEE Computer Architecture Letters, vol. 14, no. 2, pp. 127–131, 2015.
- (17) V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch et al., “Rowclone: fast and energy-efficient in-dram bulk data copy and initialization,” in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2013, pp. 185–197.
- (18) H. B. Kang and S. K. Hong, “One-transistor type dram,” Apr. 20 2010, uS Patent 7,701,751.
- (19) S.-L. Lu, Y.-C. Lin, and C.-L. Yang, “Improving dram latency with dynamic asymmetric subarray,” in 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2015, pp. 255–266.
- (20) G. Sideris, “Intel 1103-mos memory that defied cores,” Electronics, vol. 46, no. 9, pp. 108–113, 1973.
- (21) Q. Deng, L. Jiang, Y. Zhang, M. Zhang, and J. Yang, “Dracc: a dram based accelerator for accurate cnn inference,” in Proceedings of the 55th Annual Design Automation Conference. ACM, 2018, p. 168.
- (22) M. W. Allam, M. H. Anis, and M. I. Elmasry, “High-speed dynamic logic styles for scaled-down cmos and mtcmos technologies,” in Proceedings of the 2000 international symposium on Low power electronics and design. ACM, 2000, pp. 155–160.
- (23) S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, and J. Yamada, “1-v power supply high-speed digital circuit technology with multithreshold-voltage cmos,” IEEE Journal of Solid-state circuits, vol. 30, no. 8, pp. 847–854, 1995.
- (24) T. Kuroda, T. Fujita, S. Mita, T. Nagamatsu, S. Yoshioka, K. Suzuki, F. Sano, M. Norishima, M. Murota, M. Kako et al., “A 0.9-v, 150-mhz, 10-mw, 4 mm/sup 2/, 2-d discrete cosine transform core processor with variable threshold-voltage (vt) scheme,” IEEE Journal of Solid-State Circuits, vol. 31, no. 11, pp. 1770–1779, 1996.
- (25) K. Navi, V. Foroutan, M. R. Azghadi, M. Maeen, M. Ebrahimpour, M. Kaveh, and O. Kavehei, “A novel low-power full-adder cell with new technique in designing logical gates based on static cmos inverter,” Microelectronics Journal, vol. 40, no. 10, pp. 1441–1448, 2009.
- (26) (2018) Parallel thread execution isa version 6.1. [Online]. Available: http://docs.nvidia.com/cuda/parallel-thread-execution/index.html
- (27) (2011) Ncsu eda freepdk45. [Online]. Available: http://www.eda.ncsu.edu/wiki/FreePDK45:Contents
- (28) . DRAM Power Model. https://www.rambus.com/energy/.
- (29) S. Angizi, Z. He, N. Bagherzadeh, and D. Fan, “Design and evaluation of a spintronic in-memory processing platform for nonvolatile data encryption,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 9, pp. 1788–1801, 2018.
- (30) M. N. Bojnordi and E. Ipek, “Memristive boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning,” in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 2016, pp. 1–13.
- (31) S. Angizi, Z. He, A. S. Rakin, and D. Fan, “Cmp-pim: an energy-efficient comparator-based processing-in-memory neural network accelerator,” in Proceedings of the 55th Annual Design Automation Conference. ACM, 2018, p. 105.
- (32) J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory accelerator for parallel graph processing,” ACM SIGARCH Computer Architecture News, vol. 43, no. 3, pp. 105–117, 2016.
- (33) J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture,” in Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on. IEEE, 2015, pp. 336–348.
- (34) B. Akin, F. Franchetti, and J. C. Hoe, “Data reorganization in memory using 3d-stacked dram,” in Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on. IEEE, 2015, pp. 131–143.
- (35) R. Balasubramonian, J. Chang, T. Manning, J. H. Moreno, R. Murphy, R. Nair, and S. Swanson, “Near-data processing: Insights from a micro-46 workshop,” IEEE Micro, vol. 34, no. 4, pp. 36–42, 2014.
- (36) A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, K. Hsieh, K. T. Malladi, H. Zheng, and O. Mutlu, “Lazypim: An efficient cache coherence mechanism for processing-in-memory,” IEEE Computer Architecture Letters, vol. 16, no. 1, pp. 46–50, 2017.
- (37) A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, “Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules,” in High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on. IEEE, 2015, pp. 283–295.
- (38) Q. Guo, T.-M. Low, N. Alachiotis, B. Akin, L. Pileggi, J. C. Hoe, and F. Franchetti, “Enabling portable energy efficiency with memory accelerated library,” in Microarchitecture (MICRO), 2015 48th Annual IEEE/ACM International Symposium on. IEEE, 2015, pp. 750–761.
- (39) K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and O. Mutlu, “Accelerating pointer chasing in 3d-stacked memory: Challenges, mechanisms, evaluation,” in Computer Design (ICCD), 2016 IEEE 34th International Conference on. IEEE, 2016, pp. 25–32.
- (40) D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube: A programmable digital neuromorphic architecture with high-density 3d memory,” in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE, 2016, pp. 380–392.
- (41) R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen, C.-Y. Cher, C. H. Costa, J. Doi, C. Evangelinos et al., “Active memory cube: A processing-in-memory architecture for exascale systems,” IBM Journal of Research and Development, vol. 59, no. 2/3, pp. 17–1, 2015.
- (42) A. Pattnaik, X. Tang, A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, and C. R. Das, “Scheduling techniques for gpu architectures with processing-in-memory capabilities,” in Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE, 2016, pp. 31–44.
- (43) S. H. Pugsley, J. Jestes, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li, “Comparing implementations of near-data computing with in-memory mapreduce workloads,” IEEE Micro, vol. 34, no. 4, pp. 44–52, 2014.
- (44) P. Trancoso, “Moving to memoryland: in-memory computation for existing applications,” in Proceedings of the 12th ACM International Conference on Computing Frontiers. ACM, 2015, p. 32.
- (45) X. Tang, O. Kislal, M. Kandemir, and M. Karakoy, “Data movement aware computation partitioning,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2017, pp. 730–744.
- (46) D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, “Top-pim: throughput-oriented programmable processing in memory,” in Proceedings of the 23rd international symposium on High-performance parallel and distributed computing. ACM, 2014, pp. 85–98.
- (47) A. Akerib and E. Ehrman, “Non-volatile in-memory computing device,” May 14 2015, uS Patent App. 14/588,419.
- (48) S. Angizi, Z. He, and D. Fan, “Pima-logic: a novel processing-in-memory architecture for highly flexible and energy-efficient logic computation,” in Proceedings of the 55th Annual Design Automation Conference. ACM, 2018, p. 162.
- (49) F. Parveen, S. Angizi, Z. He, and D. Fan, “Imcs2: Novel device-to-architecture co-design for low-power in-memory computing platform using coterminous spin switch,” IEEE Transactions on Magnetics, vol. 54, no. 7, pp. 1–14, 2018.