X-SRAM: Enabling In-Memory Boolean Computations in CMOS Static Random Access Memories

X-SRAM: Enabling In-Memory Boolean Computations in CMOS Static Random Access Memories

Amogh Agrawal*, Akhilesh Jaiswal*, Kaushik Roy,  School of Electrical and Computer Engineering, Purdue University, West Lafayette, US
(* Equal Contributors)
Email: {agrawa64, jaiswal, kaushik}@purdue.edu
Abstract

Silicon based Static Random Access Memories (SRAM) and digital Boolean logic have been the workhorse of the state-of-art computing platforms. Despite tremendous strides in scaling the ubiquitous metal-oxide-semiconductor transistor, the underlying von-Neumann computing architecture has remained unchanged. The limited throughput and energy-efficiency of the state-of-art computing systems, to a large extent, results from the well-known von-Neumann bottleneck. The energy and throughput inefficiency of the von-Neumann machines have been accentuated in recent times due to the present emphasis on data intensive applications like artificial intelligence, machine learning etc. A possible approach towards mitigating the overhead associated with the von-Neumann bottleneck is to enable in-memory Boolean computations. In this manuscript, we present an augmented version of the conventional SRAM bit-cells, called the X-SRAM, with the ability to perform in-memory, vector Boolean computations, in addition to the usual memory storage operations. We propose at least six different schemes for enabling in-memory vector computations including NAND, NOR, IMP (implication), XOR logic gates with respect to different bit-cell topologies the 8T cell and the 8T Differential cell. In addition, we also present a novel ‘read-compute-store’ scheme, wherein the computed Boolean function can be directly stored in the memory without the need of latching the data and carrying out a subsequent write operation. The feasibility of the proposed schemes have been verified using predictive transistor models and Monte-Carlo variation analysis.

In-memory computing, SRAMs, sense amplifiers, von Neumann bottleneck.

I Introduction

Since the invention of transistor switches [1], there has been an ever-increasing demand for speed and energy-efficiency in computing systems. Almost all the state-of-art computing platforms are based on the well-known von-Nuemann architecture which is characterized by decoupled memory storage and computing cores. Running data-intensive applications on such von-Neumann machines, like artificial intelligence, search engines, neural networks, biological systems, financial analysis etc., are limited by the von Neumann bottleneck [2]. This bottleneck results due to frequent and large amounts of data transfer between the physically separate memory units and compute cores. Moreover, frequent to-and-fro data transfers incur large energy overheads in addition to limiting the overall throughput.

In order to overcome the von-Neumann bottleneck, there have been many efforts to develop new computing paradigms. One of the most promising approach is the in-memory computing, which aims to embed logic within the memory array in order to reduce memory-processor data transfers. Conceptually, the in-memory compute paradigm is illustrated in Fig. 1. It shows two physically separated blocks the processor and the memory unit and the associated computing bottleneck. In-memory techniques tend to bypass the von-Neumann bottleneck by accomplishing computations right inside the memory array, as shown in the figure. In other words, in-memory-compute blocks store data exactly like a standard memory, however, they enable additional operations without expensive area or energy overheads. By enabling logic computations in-memory, significant improvements, both in energy efficiency and throughput are expected [3, 4, 5, 6].

Fig. 1: Illustration of the von-Neumann bottleneck. Frequent to-and-fro data transfers between the processor and memory units incur large energy consumption and limits the throughput. Computing within the memory array enhances the memory functionality thereby reducing the number of unnecessary transfers of data for certain class of operations like vector bit-wise Boolean logic etc.

Due to the potential impact of in-memory computing on future computing platforms, various proposals spanning right from conventional complementary metal-oxide semiconductor (CMOS) to beyond-CMOS technologies can be found in the literature. For example, Ref. [7] proposed integrating an ALU (arithmetic-logic-unit) close to the memory unit to exploit the wide memory bandwidth, while Ref. [3] reconfigures a standard 6 transistor (6T) static random-access memory (SRAM) cells as content addressable memories (CAMs) and enable bit-wise logical operations. 6T-SRAM cells have also been used to implement machine learning classifiers [8], and dot-products in analog domain for pattern recognition [5]. The underlying idea is to enable multiple rows of memory bit-cells and directly read out a voltage at the pre-charged bit-lines corresponding to the desired operation. However, the 6T-SRAM bit-cells have a coupled read-write path that imposes conflicting constraints on the design of the 6T cell, thereby raising issues of read-disturb failures. Moreover, activating multiple word-lines may cause short-circuit paths, thereby flipping the cell states nondeterministically. The read-disturb failure is further accentuated by the fact that once the BL has discharged, activating subsequent word-lines perform a pseudo-write operation on the 6T cell, given the shared read-write path. A 6T-SRAM based on the deeply depleted channel (DDC) technology [9] was recently proposed for searching and in-memory computing applications, which had decoupled read-write paths. However, all of these proposals perform the computation in the peripheral circuits and read out the data. A subsequent memory-write operation is required to store the data back in the memory array. Thus, in our work, we use standard CMOS 8T- and 8T Differential SRAM cells due to their decoupled read-write mechanisms, for performing in-memory computations. Moreover, we go a step further and propose the novel ‘read-compute-store’ scheme, where the computed result can be stored in-situ, within the memory array, without the need for latching the result and performing a subsequent memory-write instruction.

In addition, almost all beyond CMOS non-volatile technologies have been extensively explored for possible applications to in-memory computing [10]. These include works based on resistive RAMs [11], spin-based magnetic RAMs [12, 13, 14], and phase change materials [15]. Such emerging non-volatile technologies promise denser integration, energy-efficient operations and non-volatility as compared to the CMOS based memories, and are suitable for in-memory computations [16]. However, these emerging technologies are still under extensive research and development phase and their large scale commercialization for on-chip memories is far-fetched.

In this work, we explore in-memory vector operations in standard CMOS 8T- and 8T Differential SRAM cells with minimal modifications in the peripheral circuitry. We call the augmented version of the SRAM bit-cells with extra in-memory compute features as the X-SRAM. We propose at least six different techniques to enable Boolean computations. The 8T and 8T cells lend themselves easily for enabling in-memory computations because of the following three factors. 1) The read ports of the 8T and 8T cells are isolated and can be easily configured to enable in-memory operations. 2) Also, in sharp contrast to the 6T cells, 8T and 8T cells do not suffer from read disturb and hence multiple read word-lines within the memory array can be simultaneously activated. 3) In addition, in this manuscript, we exploit the two port structure of the 8T and 8T cells to propose a novel read-compute-store operation, wherein, the computed Boolean data can be stored into the memory array without actually latching the data followed by a subsequent memory write-operation. Later in Appendix, we describe the in-memory computations in standard 6T-SRAMs using the staggered activation of word-lines, as was presented for analog computing in Ref. [5].

Fig. 2: A summary of In-Memory computing schemes proposed in this work. With respect to the 8T cell, we present bit-wise NAND, NOR and XOR operations using skewed inverter sensing. Further, we present the voltage-divider based operation of 8T-cells for IMP and XOR gates. With respect to the 8T-cells, we present bit-wise NAND, NOR and XOR operations using asymmetric differential SAs. Moreover, a ‘read-compute-store’ operation has been presented for both types of bit-cells.

Some of the key highlights of the present work in comparison to previous works are enumerated below.

  1. We firstly leverage the fact that two simultaneously activated read-word-lines for the standard 8T cells are inherently ‘wire NORed’ through the read bit-line. By using a skewed inverter at the sensing output, we demonstrate that NOR operation can be easily achieved. Further, we also show that NAND logic can similarly be accomplished using another skewed inverter. Note, unlike 6T cells, simultaneous activations of two read word-lines do not impose any read-disturb concerns, thereby opening up a wider design space for optimization.

  2. Further, by applying appropriate voltages, we show that two activated read ports of the 8T cell can be configured as a voltage divider. Based on such voltage divider scheme we present in-memory vector IMP as well as XOR logic gates. The voltage divider scheme not only allows in-memory computations, but also augments the read mechanism by allowing a possible two bit-read operation under specific conditions.

  3. Subsequently, we also present in-memory NAND and NOR computations (along with XOR) in the recently proposed 8T cells [17], using asymmetric sense amplifiers (SA). The 8T cells are more robust since they allow differential read sensing as opposed to the standard 8T cells that are characterized by single ended sensing. The usual memory read/write functionality of the SRAM cell is not disturbed due to the use of asymmetric sense amplifiers. We also show that the same hardware, including the SA, can be shared for an in-memory operation and also for the normal memory read operation. Moreover, the extra hardware enhances the memory read operation, by acting as a check for read failures.

  4. We propose a novel ‘read-compute-store’ scheme for the 8T and 8T bit-cells, wherein the computed data can directly be written into the desired memory location, without having to latch the output and perform a subsequent memory write operation. This exploits the decoupled read-write paths of the 8T and 8T bit-cells.

  5. We perform Monte-Carlo simulations to verify the robustness of the proposed in-memory operations for the 8T and the 8T bit-cells. Energy, delay and area numbers have been presented for each of the proposed scheme.

Fig. 3: a) Schematic of a standard 8T-SRAM bit-cell. In addition to the standard 6T cell, two additional transistors form the read path using a separate read bit-line (RBL). b) Single ended sensing of NAND/NOR using gated skewed inverters. Figure also shows the truth table for NAND/NOR/XOR operations. c) Timing diagram for reading NOR output of Cell 1 and Cell 2. d) Timing diagram for reading NAND output of Cell 1 and Cell 2.

Ii In-Memory Computations in 8-Transistor SRAM Bit-Cells

As discussed in the introduction, 8T cells have favorable bit-cell structure to enable in-memory computing. Specifically we would exploit the isolated read mechanism and the two port cell topology to embed NAND, NOR, IMP and XOR logic within the memory array. Further, by leveraging the separate read and write ports of the 8T cell, we also propose a ‘read-compute-store’ scheme, wherein, by minimal changes in the peripheral circuits, the computed Boolean result can be stored in the desired row of the memory array in the same cycle without the need of latching the result and performing a subsequent write operation.

Fig. 4: Monte-Carlo simulations in SPICE for NAND and NOR outputs for all possible input cases ‘00,01,10,11’, in presence of 30mV sigma variations in threshold voltage.

Ii-a 8-Transistor SRAM: NOR operation

The 8T SRAM cell is shown in Fig. 3(a). It consists of the usual 6T cell augmented by additional read port constituted by transistors M1-M2. The write operation is similar to the 6T cell, whereas for the read operation, RWL is activated (WWL is low). The RBL is initially pre-charged and if Q = ‘1’ the RBL discharges otherwise it stays at its initial precharged condition. This decoupled read port for the 8T cell allows to have large voltage swing (almost rail-to-rail) on the RBL during the read operation without any concerns of read disturb failure.

The output of a NOR operation is ‘1’ only if both the inputs are ‘0’. For the memory implementation it implies that only if both the bits corresponding to the operands ‘A’ and ‘B’ store ‘0’, the logic output should be ‘1’. In all other cases the output should remain ‘0’. Consider we activate two RWLs corresponding to the rows storing vector operand ‘A’ and vector operand ‘B’, respectively, as shown in Fig. 3(b). Due to the decoupled read ports, both the RWLs can be activated simultaneously without any read disturb concerns as opposed to the 6T cell. The precharged RBL line retains its precharged state if and only if both the bits Q corresponding to operands ‘A’ and ‘B’ are ‘0’. In other words, as shown in Fig. 3(c) RBL remains high only if both the bits corresponding to the operands ‘A’ and ‘B’ are ‘0’ (i.e. Q = ‘0’ for both ‘A’ and ‘B’). Thus, merely by activating the two RWLs, data stored in the two bit-cells are ‘wire NORed’. A gated inverter (INV1) is connected to the RBL such that the inverter output goes low if the RBL remains high. Thereby, the output of the cascaded inverter (INV2) is high only if the bits of operands ‘A’ and ‘B’ are low simultaneously, mimicking the NOR operation. Note, the NOR operation is same as the usual read operation except that we have turned ON two RWLs instead of one. Thus, NOR can be easily achieved in the 8T bit-cell without any significant overhead. The timing diagram for the NOR operation is shown in Fig. 3(c).

Ii-B 8-Transistor SRAM: NAND operation

Let us consider that we activate two RWLs corresponding to vector operands ‘A’ and ‘B’, respectively. The precharged RBL will eventually go to 0V if Q for any one of the input operand is ‘1’. However, the fall time of the signal at RBL from the precharged value to 0V would depend strongly on the fact, whether the bits corresponding to any one Q is high or if both the Q bits corresponding to operands ‘A’ and ‘B’ are high simultaneously. In other words, only if both the Qs are ‘1’, the discharge of the precharged RBL line would be fast enough.

In Fig. 3(d), we have shown schematically the state of the RBL for the cases (0,0), (0,1) or (1,0) and (1,1), where the first number in the brackets correspond to the state of bit representing operand ‘A’ and the second number corresponds to the bit representing operand ‘B’, respectively. In order to exploit the different discharge rates of the RBL in case of (0,1) (or (1,0)) and (1,1), the RWL signal had to be timed such that the RBL does not discharge completely in either of the cases (0,1) or (1,0). As shown in the timing diagram of Fig. 3(d), we activated the RWL only for a short period of time such that it does not discharge the RBL completely in the case of (0,1) or (1,0), thus allowing a difference in voltage levels on RBL in the two cases ((0,1) or (1,0) and (1,1)). The trip point of the inverter INV3 is chosen such that it goes high only for the case (1,1), thereby the output of the inverter INV4 goes low only for (1,1), mimicking the NAND operation.

Fig. 4 demonstrates the robustness of the NAND and NOR proposals in presence of 30mV sigma variations in the threshold voltage. We used 45-nm Predictive Technology Models (PTM) [18] for simulating the circuits. A BL and BLB capacitance of 10fF was assumed for all the simulations.

In addition, by NORing the outputs of the AND (INV3) and the NOR (INV2) gates together, XOR operation can be easily achieved. In summary, we have shown that the very bit-cell topology of the 8T cell can be exploited to accomplish in-memory NOR, NAND, XOR computations. In the next sub-section, we would discuss another proposal for embedding IMP as well as XOR gate within the 8T SRAM array by utilizing the proposed voltage divider scheme.

Ii-C 8 Transistor SRAM: Voltage Divider Scheme for IMP and XOR gates

Fig. 5: a) Circuit schematic of the 8T-SRAM for implementing the voltage-divider scheme. b) Equivalent circuit traced by transistors while data is read from Cell 1 and Cell 2. c) Monte-Carlo simulations in SPICE for all possible input cases, showing the output of the two asymmetric inverters.

In this sub-section, we present a method of implementing IMP and XOR operation using 8T cell by exploiting the voltage divider principle. Let us consider, the circuit shown in Fig. 5(a). Let us assume the first operand is stored in the upper bit-cell corresponding to the line RWL1, while the second operand is stored in the lower bit-cell corresponding to RWL2. In the conventional 8T cell, the source of transistors M and M are connected to ground. In the presented circuit, the source of the transistors M and M are connected to respective source lines (SL1 and SL2 shared along respective rows). During the normal operations, the SLs can be grounded, thereby accomplishing usual 8T SRAM read and write operations.

During the in-memory computation mode, the SL1 is pulled to V, while the SL2 is grounded. RWL1 and RWL2 are initially grounded and RDBL is pre-charged to a voltage V (chosen to be 400mV). After the precharge phase, transistors M and M are switched ON, thereby M M M M form a voltage divider and RDBL forms the middle node of the voltage divider structure (see Fig. 5(b)). Note, in the voltage divider configuration, M and M are strongly source degenerated. In order to make sure M and M are sufficiently ON, we boosted the of ‘Cell 1’ and RWL1 such that the gate of M and M have enough overdrive when the ‘Cell 1’ is storing a digital ‘1’ (Q = ‘1’ and QB = ‘0’).

In the voltage divider configuration M M M M, RDBL retains its precharged voltage V if both the bit-cells are storing digital ‘0’ (i.e. M and M are OFF ). Similarly, if both the cells are storing a digital ‘1’ (i.e. M and M are ON), the voltage at RDBL stays close to its precharged value (400mV) due to the voltage divider effect. Thus, when the cells store (0,0) or (1,1) (where the first (second) number in the bracket indicates the data stored in Cell 1 (2)), the voltage at RDBL stays close to the precharged voltage. On the other hand, if the data stored is (1,0), then M is ON while M is OFF. As such, RDBL will charge to V through transistors M and M. In contrast, if the data stored is (0,1), M is ON while M is OFF. Therefore, RDBL will discharge to 0V through transistors M and M. In summary, the voltage on RDBL stays close to V when both the cells store same data. RDBL charges to V for data (1,0) and discharges to 0V for data (0,1).

The state of the data stored in the two cells can be sensed through two skewed inverters. INV2 is skewed such that it goes high only when RDBL is much lower than V and is close to 0V, while INV1 is skewed so that it goes low only when RDBL is higher than V and is close to V. In other words, high output at INV2 indicates data (0,1) while high output at INV3 indicates data (1,0). Interestingly, INV1 implements ‘A IMP B’. By ORing the output of INV2 and INV3 we can obtain the XOR of inputs A and B.

Some key features of the voltage divider logic scheme are, 1) IMP is a universal gate and hence any arbitrary Boolean function can be implemented using the proposed scheme 2) if any one of the inverter outputs (INV2 or INV3) are high, it indicates the data stored is (0,1) or (1,0), thereby allowing a two bit-read operation in addition to the desired in-memory computation. However, if none of the inverters are high then a subsequent read operation would be required to ascertain if the stored data is (0,0) or (1,1). As such, in 50% cases when the data stored is (0,1) or (1,0), we can accomplish a two bit read operation, along with the in-memory compute operation.

Ii-D Proposed ‘read-compute-store’ (RCS) scheme

We have seen that basic Boolean operations like NAND, NOR, IMP and XOR can be computed using 8T cells. We would now show that the decoupled read and write ports of the 8T bit-cell can be used for enabling ‘read-compute-store’ (RCS) scheme. The RCS scheme implies that while the data is being read from the two activated RWLs (corresponding to the two input operands), simultaneously the WWL of a third row can be activated such that the computed data gets stored in the third row at the same time while the actual Boolean computation is in progress. As such, the computed data is not required to be latched first, then written subsequently, in a multi-cycle fashion. Note, writing into 8T bit-cells is much easier due to the fact that the write port of the 8T cell is specifically optimized for the write operation.

Fig. 6: a) Proposed ‘read-compute-store’ (RCS) scheme. RWL1 and RWL2 are enabled, corresponding to the data to be computed. The computation output is selectively passed to the write-driver of that column, while simultaneously enabling the WWL3, where data is to be stored. b) Block diagram showing the RCS blocks in the memory array. The NAND of row 1 and row 2 is to be stored in row 3. c) Monte-Carlo simulations in SPICE, showing the final state of Cell 3 stores the desired output.

Let us understand how the RCS scheme can be implemented with reference to Fig. 6. Assume that the input operands correspond to the rows 1 and 2, while the resulting Boolean computation has to be stored in row 3. Note, this Boolean computation can be either of NAND/NOR/IMP/XOR. Let us take the example for the NAND operation. As shown in Fig. 6(a), two read lines RWL1 and RWL2 would be activated, the compute block, which basically is the abstracted view of the skewed inverters of Fig. 3(b), would perform the logic computation. Now, since the read and write port for 8T cell are decoupled we can simultaneously activate a third WL, in this case the write word-line (WWL3). The computed output can be selected through a multiplexer and fed to the write drivers for directly storing the Boolean result in the bit-cells corresponding to WWL3. Thus, the fact that 8T cells have decoupled read-write ports can be leveraged to accomplish the proposed ‘read-compute-store’ scheme. Fig. 6(b) shows schematically the array level block diagram where the three word-lines RWL1, RWL2 and WWL3 are activated simultaneously. In Fig. 6(c) we show the Monte-Carlo results for storing the computed NAND output into Cell3. Note that a ‘copy’ operation can also be performed using the RCS scheme, by activating the RWL of the source row and WWL of the destination row. In this case, the input to the RCS block will simply be the SA output, which corresponds to the data stored in the bit-cells of the source row.

Iii 8 Transistor Differential Read SRAM

Fig. 7: a) Circuit schematic of an 8T Differential SRAM bit-cell [17]. b) Timing diagram used for in-memory computations on the 8T Differential SRAM. c) Circuit schematic of the proposed asymmetric differential sense amplifier.

Recently, an 8T Differential SRAM design was proposed in [17] to overcome the single ended sensing of the conventional 8T-SRAM cell. 8T Differential SRAM has decoupled read-write paths with an added advantage of a differential read mechanism through the read bit-lines RBL/RBLB (see Fig. 7(a)), as opposed to the single-ended read mechanism of 8T-SRAM. The ninth transistor, whose gate is connected to RWL in Fig. 7(a) is shared by all the bit cells in the same row. The differential read operation is very similar to the read operation of a standard 6T-SRAM. The usual memory read operation is performed by pre-charging the bit-lines (RBL and RBLB) to V, and subsequently enabling the word-line corresponding to the row to be read out. Depending on whether the bit-cell stores ‘1’ or ‘0’, RBL or RBLB discharges. The difference in voltages on RBL and RBLB is sensed using a differential sense amplifier.

Let us consider words ‘A’ and ‘B’ stored in two rows of the memory array. Note that we can simultaneously enable the two corresponding RWLs without worrying about read-disturbs, since the bit-cell has decoupled read-write paths. The RBL/RBLB are pre-charged to V. For the case ‘AB’=‘00’ (‘11’), RBL (RBLB) discharges to 0V, but RBLB (RBL) remains in the precharged state. However, for cases ‘10’ and ‘01’, both RBL and RBLB discharge simultaneously. The four cases are summarized in Fig. 7(b).

Now, in order to sense bit-wise NAND and NOR operation of ‘A’ and ‘B’, we propose an asymmetric SA (see Fig. 7(c)), by skewing one of the transistors. Skewing the transistors can be done in multiple ways, for example, transistor sizing, threshold voltage, body bias etc. In Fig. 7(c), if the transistor is deliberately sized bigger compared to , its current carrying capability increases. For cases ‘01’ and ‘10’, both RBL and RBLB discharge simultaneously. However, since the current carrying capability of is more than , node discharges faster, and the cross-coupled inverter pair of the SA stabilizes with =‘0’. For the case ‘11’, RBL starts to discharge, while RBLB is at V. The SA amplifies the voltage difference between RBL and RBLB, resulting in =‘1’. Whereas for the case ‘00’, RBLB starts to discharge, while RBL is at V, giving =‘0’.

Fig. 8: Monte-Carlo simulations in SPICE for SA outputs for all possible input cases ‘00,01,10,11’, in presence of 30mV sigma variations in threshold voltage.
Bit-Cell Latency (ns) Average Energy/bit (fJ)
8T-SRAM 3 17.25
8T-SRAM (Voltage Divider) 1 11.22
8T Differential SRAM 1 29.67
6T-SRAM (Appendix) 3 29.3
TABLE I: Average energy per-bit and latency for the proposed in-memory operations on various bit-cells.

Thus it can be observed that generates an AND gate (thus, outputs NAND gate). Similarly, by sizing the bigger than , OR/NOR gates are be obtained SA outputs. Finally, two SAs in parallel (one with up-sized, , and one with up-sized, ) enable bit-wise AND/NAND and OR/NOR logic gates. Moreover, an XOR gate can be obtained by combining the AND/NAND and OR/NOR outputs using an additional NOR gate. Thus, in a single memory read cycle, we obtain a class of Boolean bitwise operations, read directly from the asymmetrically sized SA outputs. Monte-Carlo simulations with 30mV sigma variations in the threshold voltage were repeated, and the outputs of and for all input data cases are summarized in Fig. 8.

It is worthwhile to note that the two SAs can be used for regular memory read operations as well. The two cases of a typical memory read operation are similar to the cases ‘11’ and ‘00’ in Fig. 7(b). Both SAs will generate the same output corresponding to the bit stored in the cell. Moreover, the output of the XOR gate inherently acts as an in-memory check for possible read failures. The RCS scheme described in Section II can also be applied to 8T Differential SRAMs due to decoupled read-write paths. Along with the two RWLs from where the input operands are read, a WWL can also be enabled which would eventually store the Boolean output within the memory array in the same cycle.

Using 8T cells is advantageous over the conventional 8T cells for in-memory bit-wise logic operations because of better robustness due to the differential read operation, in contrast to the single ended read in 8T-SRAM cells.

Iv Discussions

In sections II and III, we have seen various ways of implementing basic Boolean operations using the 8T and the 8T bit-cells. Table I presents the average energy per-bit and latency for each of the proposed in-memory compute techniques. The 8T cell allows separate read write ports, thereby alleviating any possible read-disturb failure concerns. In addition, it also supports the proposed RCS scheme. However, 8T cell suffers from robustness concerns due to its single ended sensing.

Using the 8T cell, on the other hand, allows differential sensing like the conventional 6T cell, while also allowing separate read and write ports. It thus combines the benefits of both the standard 6T and the 8T cells. Note, since the differential read scheme for the 8T cell is functionally similar to the conventional 6T cell, NOR and NAND gates (along with the XOR gate) can also be implemented in the 6T based memory array. However, due to the shared read-write paths of the 6T cell, the word-lines cannot be simultaneously activated and require a sequential activation. In addition, 6T cells are read disturb prone and hence would exhibit much lesser robustness than the proposed 8T and 8T cells. Nevertheless, in the Appendix we have included a description of how the 6T cells can be used to accomplish NOR, NAND and XOR operations. We also show that an in-memory ‘copy’ operation can also be easily achieved in the 6T cell due to its shared read/write paths.

Finally, it is worth noting that although we have proposed multiple in-memory techniques in this manuscript, the choice of the bit-cell and the associated Boolean function would heavily depend on the target application. The aim of the present manuscript is to demonstrate various possible techniques that can be utilized in conventional CMOS based memories for accomplishing in-memory Boolean computations.

V Conclusion

Von-Neumann machines have fueled the computing era for the past few decades. However, the recent emphasis on data intensive applications like artificial intelligence, image recognition, Internet-of-Things (IoT) etc. requires novel computing paradigm in order to fulfill the energy and throughput requirements. ‘In-memory’ computing has been proposed as a promising approach that could sustain the throughput and energy requirements for future computing platforms. In this paper, we have proposed multiple techniques to enable in memory computing in standard CMOS bit-cells the 8T cell and the 8T cell. We have shown that Boolean functions like NAND, NOR, IMP and XOR can be obtained by minimal changes in the peripherals circuits and the associated read-operation. Further, we have also proposed a ‘read-compute-store’ scheme by leveraging the decoupled read and write ports of the 8T and 8T cells, wherein the computed logic data can be directly stored in the desired row of the memory array. Our results are supported by rigorous Monte-Carlo simulations performed using predictive transistor models.

Appendix

V-a 6-Transistor SRAM: bit-wise NOR/NAND/XOR Operation

Fig. 9: Schematic of a 6T-SRAM array along with two asymmetric SAs in parallel for reading bitwise NAND/NOR/XOR operation.

The most popular and widely used SRAM design is the standard 6T bit-cell, shown in Fig. 9. However, 6T bit-cells are inherently design constrained due to the shared read and write paths. Nevertheless, by proper design choices, 6T cells can still be used to perform in-memory computations although at reduced robustness due to the conflict between read and write operations in a standard 6T cell. The usual memory read operation in a 6T cell is performed by pre-charging the bit-lines (BL and BLB) to V, and enabling the word-line corresponding to the row to be read out. Depending on whether the bit-cell stores ‘1’ or ‘0’, BL or BLB discharges, as illustrated in Fig. 10(a). The difference in voltages on BL and BLB is sensed using a differential sense amplifier.

Consider a typical memory array shown in Fig. 9, with two words ‘A’ and ‘B’ stored in rows 1 and 2, respectively. Simultaneously enabling WL1 and WL2 introduces read-disturbs due to possible short-circuit paths. Hence, we employ a sequentially pulsed WL technique as a workaround, similar to the proposal in [5]. The address decoder sequentially turns WL1 and WL2 ON, corresponding to the rows storing ‘A’ and ‘B’, respectively, as illustrated in Fig. 10(b).

The WL pulse duration is chosen such that with application of one WL pulse, BL/BLB drops to about V/2. If bits ‘A’ and ‘B’ both store ‘0’ (‘1’), BL (BLB) will finally discharge to 0V after the two consecutive pulses, whereas BLB (BL) remains at V. On the other hand, for cases where ‘AB’ = ‘10’ and ‘01’, the final voltages at BL and BLB would be the same (V/2), approximately. Thus, for the cases ‘01’ and ‘10’ both BL and BLB would have a voltage V/2, while for ‘00’ BL would be lower than BLB by V and for the case of ‘10’ BLB would be lower than BL by V.

Fig. 10: a) Timing diagram for a typical memory read operation. The BL/BLB is pre-charged to Vdd, and the final voltage is shown for the two cases when the bit-cell stores ‘0’ or ‘1’. b) Timing diagram for the proposed sequentially pulsed WL activation, and the resulting BL/BLB voltages for the four cases when bit-cells store ‘00,01,10,11’. c) Truth table for NAND/NOR/XOR operation.
Fig. 11: Monte-Carlo simulations in SPICE of the SA outputs for all possible input cases ‘00,01,10,11’, in presence of 30mV sigma variations in threshold voltage.

The four cases are summarized in Fig. 10(b). Using the two asymmetric SAs (in a similar fashion as proposed in Section III), connected in parallel, we can obtain NAND/AND, NOR/OR and XOR bit-wise operations on ‘A’ and ‘B’. Note, although the sensing operation for in-memory computing with 6T cells seem similar to the 8T cell, there are certain key differences. Firstly, two word-lines cannot be activated simultaneously in 6T cells, therefore the WL pulses have to be properly timed and the pulse duration needs to be appropriately selected for achieving the desired functionality. Secondly, unlike the 6T cells the voltage swing on the read bit-lines for the 8T cells can have much larger swing without any concerns of possible read disturb failures, thereby relaxing the constraints on the sense amplifier.

A Monte-Carlo simulation with a 30mV sigma variations in the threshold voltage were performed to demonstrate the functionality and robustness of the proposal. Fig. 11 shows the outputs of the asymmetric SAs, and , for the four possible input cases - ‘00,01,10,11’, in presence of variations.

V-B 6T SRAM: Copy Operation

Fig. 12: a) Schematic of 6T-SRAM bit-cells Cell 1 and Cell 2. b) Timing diagram for performing a copy operation to copy data from Cell 1 to Cell 2. c) Monte-Carlo simulations in SPICE showing the final state of the Cell 2.

In this section, we describe a method of implementing ‘copy’ functionality within the 6T bit-cell. To copy data from one memory location to another, a typical instruction sequence performed by the processor would be to do a memory read from the source location, followed by a memory write to the destination. Thus, two memory transactions are performed. We exploit the coupled read-write paths of the 6T cell to perform a data copy operation from one row to another, since the same set of bit-lines BL/BLB are used to read from and write into the cell.

Let us consider bit-cells 1 and 2, connected to WL1 and WL2, respectively (see Fig. 12(a)). To copy data from cell 1 to cell 2, we perform two steps illustrated in Fig. 12(b). Let us assume cell 1 stores a ‘1’, while cell 2 stores ‘0’. The bit-lines BL/BLB are pre-charged to V, as usual. In step 1, WL1 is enabled, thereby turning the access transistors of cell 1 ON. Since node Q1 is connected to V and QB1 is connected to 0V (cell 1 stores ‘1’), BL remains at V, while BLB discharges to 0V. The pulse width is long enough for BLB to discharge fully to 0V. In step 2, WL2 is enabled, turning access transistors of cell 2 ON, with BL at V and BLB at 0V. Since cell 2 stores a ‘0’, charge flows from BL to Q2, and from QB2 to BLB, thereby, flipping the state of cell 2, such that cell 2 now stores a ‘1’. If cell 2 initially stored a ‘1’, nothing happens in step 2, and the state remains the same. Thus, we have implemented a data copy from cell 1 to cell 2, in a single memory transaction.

Note that step 1 is a usual memory read operation, while step 2 is similar to a memory write operation. However, step 2 is a weak write mechanism since the charge stored on BL/BLB is used to switch the cell state. This may cause write failures. Thus, a boosted voltage on WL2 is required to ensure correct data is written into cell 2. A Monte-Carlo simulation with sigma threshold voltage variations of 30mV in 45-nm PTM models was performed to test the proposal, as shown in Fig. 12(c).

In order to implement a copy in 8T- and 8T Differential SRAMs, the proposed scheme would not work due to decoupled read-write paths. However, the RCS scheme proposed in Section II can be used. The RWL of the source row and the WWL of the destination row are enabled, and the output of the sense amplifier is fed to the RCS block. This copies the data from the source row to the destination row in a single cycle operation.

References

  • [1] J. Bardeen and W. H. Brattain, “The transistor, a semi-conductor triode,” Physical Review, vol. 74, no. 2, p. 230, 1948.
  • [2] J. Backus, “Can programming be liberated from the von neumann style?: A functional style and its algebra of programs,” Commun. ACM, vol. 21, no. 8, pp. 613–641, Aug. 1978.
  • [3] “A 28 nm configurable memory (TCAM/BCAM/SRAM) using push-rule 6t bit cell enabling logic-in-memory,” IEEE Journal of Solid-State Circuits, vol. 51, no. 4, pp. 1009–1021, apr 2016.
  • [4] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das, “Compute caches,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).   IEEE, feb 2017.
  • [5] M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz, “An energy-efficient VLSI architecture for pattern recognition via deep embedding of computation in SRAM,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, may 2014.
  • [6] M. Kang, E. P. Kim, M. sun Keel, and N. R. Shanbhag, “Energy-efficient and high throughput sparse distributed memory architecture,” in 2015 IEEE International Symposium on Circuits and Systems (ISCAS).   IEEE, may 2015.
  • [7] W. M. Snelgrove, M. Stumm, D. Elliott, R. McKenzie, and C. Cojocaru, “Computational ram: Implementing processors in memory,” IEEE Design & Test of Computers, vol. 16, pp. 32–41, 1999.
  • [8] J. Zhang, Z. Wang, and N. Verma, “In-memory computation of a machine-learning classifier in a standard 6t SRAM array,” IEEE Journal of Solid-State Circuits, vol. 52, no. 4, pp. 915–924, apr 2017.
  • [9] Q. Dong, S. Jeloka, M. Saligane, Y. Kim, M. Kawaminami, A. Harada, S. Miyoshi, D. Blaauw, and D. Sylvester, “A 0.3v VDDmin 4+2t SRAM for searching and in-memory computing using 55nm DDC technology,” in 2017 Symposium on VLSI Circuits.   IEEE, jun 2017.
  • [10] H.-S. P. Wong and S. Salahuddin, “Memory leads the way to better computing,” Nature Nanotechnology, vol. 10, no. 3, pp. 191–194, mar 2015.
  • [11] S. Shirinzadeh, M. Soeken, P.-E. Gaillardon, and R. Drechsler, “Fast logic synthesis for RRAM-based in-memory computing using majority-inverter graphs,” in Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).   Research Publishing Services, 2016.
  • [12] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in memory with spin-transfer torque magnetic ram,” arXiv preprint arXiv:1703.02118, 2017.
  • [13] W. Kang, H. Wang, Z. Wang, Y. Zhang, and W. Zhao, “In-memory processing paradigm for bitwise logic operations in stt-mram,” IEEE Transactions on Magnetics, 2017.
  • [14] D. Lee, X. Fong, and K. Roy, “R-MRAM: A ROM-embedded STT MRAM cache,” IEEE Electron Device Letters, vol. 34, no. 10, pp. 1256–1258, oct 2013.
  • [15] A. Sebastian, T. Tuma, N. Papandreou, M. L. Gallo, L. Kull, T. Parnell, and E. Eleftheriou, “Temporal correlation detection using computational phase-change memory,” Nature Communications, vol. 8, no. 1, oct 2017.
  • [16] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo,” in Proceedings of the 53rd Annual Design Automation Conference on - DAC16.   ACM Press, 2016.
  • [17] J. P. Kulkarni, A. Goel, P. Ndai, and K. Roy, “A read-disturb-free, differential sensing 1r/1w port, 8t bitcell array,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 19, no. 9, pp. 1727–1730, sep 2011.
  • [18] Predictive Technology Models.[Online] http://ptm.asu.edu/, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
84170
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description