XSRAM: Enabling InMemory Boolean Computations in CMOS Static Random Access Memories
Abstract
Silicon based Static Random Access Memories (SRAM) and digital Boolean logic have been the workhorse of the stateofart computing platforms. Despite tremendous strides in scaling the ubiquitous metaloxidesemiconductor transistor, the underlying vonNeumann computing architecture has remained unchanged. The limited throughput and energyefficiency of the stateofart computing systems, to a large extent, results from the wellknown vonNeumann bottleneck. The energy and throughput inefficiency of the vonNeumann machines have been accentuated in recent times due to the present emphasis on data intensive applications like artificial intelligence, machine learning etc. A possible approach towards mitigating the overhead associated with the vonNeumann bottleneck is to enable inmemory Boolean computations. In this manuscript, we present an augmented version of the conventional SRAM bitcells, called the XSRAM, with the ability to perform inmemory, vector Boolean computations, in addition to the usual memory storage operations. We propose at least six different schemes for enabling inmemory vector computations including NAND, NOR, IMP (implication), XOR logic gates with respect to different bitcell topologies the 8T cell and the 8T Differential cell. In addition, we also present a novel ‘readcomputestore’ scheme, wherein the computed Boolean function can be directly stored in the memory without the need of latching the data and carrying out a subsequent write operation. The feasibility of the proposed schemes have been verified using predictive transistor models and MonteCarlo variation analysis.
I Introduction
Since the invention of transistor switches [1], there has been an everincreasing demand for speed and energyefficiency in computing systems. Almost all the stateofart computing platforms are based on the wellknown vonNuemann architecture which is characterized by decoupled memory storage and computing cores. Running dataintensive applications on such vonNeumann machines, like artificial intelligence, search engines, neural networks, biological systems, financial analysis etc., are limited by the von Neumann bottleneck [2]. This bottleneck results due to frequent and large amounts of data transfer between the physically separate memory units and compute cores. Moreover, frequent toandfro data transfers incur large energy overheads in addition to limiting the overall throughput.
In order to overcome the vonNeumann bottleneck, there have been many efforts to develop new computing paradigms. One of the most promising approach is the inmemory computing, which aims to embed logic within the memory array in order to reduce memoryprocessor data transfers. Conceptually, the inmemory compute paradigm is illustrated in Fig. 1. It shows two physically separated blocks the processor and the memory unit and the associated computing bottleneck. Inmemory techniques tend to bypass the vonNeumann bottleneck by accomplishing computations right inside the memory array, as shown in the figure. In other words, inmemorycompute blocks store data exactly like a standard memory, however, they enable additional operations without expensive area or energy overheads. By enabling logic computations inmemory, significant improvements, both in energy efficiency and throughput are expected [3, 4, 5, 6].
Due to the potential impact of inmemory computing on future computing platforms, various proposals spanning right from conventional complementary metaloxide semiconductor (CMOS) to beyondCMOS technologies can be found in the literature. For example, Ref. [7] proposed integrating an ALU (arithmeticlogicunit) close to the memory unit to exploit the wide memory bandwidth, while Ref. [3] reconfigures a standard 6 transistor (6T) static randomaccess memory (SRAM) cells as content addressable memories (CAMs) and enable bitwise logical operations. 6TSRAM cells have also been used to implement machine learning classifiers [8], and dotproducts in analog domain for pattern recognition [5]. The underlying idea is to enable multiple rows of memory bitcells and directly read out a voltage at the precharged bitlines corresponding to the desired operation. However, the 6TSRAM bitcells have a coupled readwrite path that imposes conflicting constraints on the design of the 6T cell, thereby raising issues of readdisturb failures. Moreover, activating multiple wordlines may cause shortcircuit paths, thereby flipping the cell states nondeterministically. The readdisturb failure is further accentuated by the fact that once the BL has discharged, activating subsequent wordlines perform a pseudowrite operation on the 6T cell, given the shared readwrite path. A 6TSRAM based on the deeply depleted channel (DDC) technology [9] was recently proposed for searching and inmemory computing applications, which had decoupled readwrite paths. However, all of these proposals perform the computation in the peripheral circuits and read out the data. A subsequent memorywrite operation is required to store the data back in the memory array. Thus, in our work, we use standard CMOS 8T and 8T Differential SRAM cells due to their decoupled readwrite mechanisms, for performing inmemory computations. Moreover, we go a step further and propose the novel ‘readcomputestore’ scheme, where the computed result can be stored insitu, within the memory array, without the need for latching the result and performing a subsequent memorywrite instruction.
In addition, almost all beyond CMOS nonvolatile technologies have been extensively explored for possible applications to inmemory computing [10]. These include works based on resistive RAMs [11], spinbased magnetic RAMs [12, 13, 14], and phase change materials [15]. Such emerging nonvolatile technologies promise denser integration, energyefficient operations and nonvolatility as compared to the CMOS based memories, and are suitable for inmemory computations [16]. However, these emerging technologies are still under extensive research and development phase and their large scale commercialization for onchip memories is farfetched.
In this work, we explore inmemory vector operations in standard CMOS 8T and 8T Differential SRAM cells with minimal modifications in the peripheral circuitry. We call the augmented version of the SRAM bitcells with extra inmemory compute features as the XSRAM. We propose at least six different techniques to enable Boolean computations. The 8T and 8T cells lend themselves easily for enabling inmemory computations because of the following three factors. 1) The read ports of the 8T and 8T cells are isolated and can be easily configured to enable inmemory operations. 2) Also, in sharp contrast to the 6T cells, 8T and 8T cells do not suffer from read disturb and hence multiple read wordlines within the memory array can be simultaneously activated. 3) In addition, in this manuscript, we exploit the two port structure of the 8T and 8T cells to propose a novel readcomputestore operation, wherein, the computed Boolean data can be stored into the memory array without actually latching the data followed by a subsequent memory writeoperation. Later in Appendix, we describe the inmemory computations in standard 6TSRAMs using the staggered activation of wordlines, as was presented for analog computing in Ref. [5].
Some of the key highlights of the present work in comparison to previous works are enumerated below.

We firstly leverage the fact that two simultaneously activated readwordlines for the standard 8T cells are inherently ‘wire NORed’ through the read bitline. By using a skewed inverter at the sensing output, we demonstrate that NOR operation can be easily achieved. Further, we also show that NAND logic can similarly be accomplished using another skewed inverter. Note, unlike 6T cells, simultaneous activations of two read wordlines do not impose any readdisturb concerns, thereby opening up a wider design space for optimization.

Further, by applying appropriate voltages, we show that two activated read ports of the 8T cell can be configured as a voltage divider. Based on such voltage divider scheme we present inmemory vector IMP as well as XOR logic gates. The voltage divider scheme not only allows inmemory computations, but also augments the read mechanism by allowing a possible two bitread operation under specific conditions.

Subsequently, we also present inmemory NAND and NOR computations (along with XOR) in the recently proposed 8T cells [17], using asymmetric sense amplifiers (SA). The 8T cells are more robust since they allow differential read sensing as opposed to the standard 8T cells that are characterized by single ended sensing. The usual memory read/write functionality of the SRAM cell is not disturbed due to the use of asymmetric sense amplifiers. We also show that the same hardware, including the SA, can be shared for an inmemory operation and also for the normal memory read operation. Moreover, the extra hardware enhances the memory read operation, by acting as a check for read failures.

We propose a novel ‘readcomputestore’ scheme for the 8T and 8T bitcells, wherein the computed data can directly be written into the desired memory location, without having to latch the output and perform a subsequent memory write operation. This exploits the decoupled readwrite paths of the 8T and 8T bitcells.

We perform MonteCarlo simulations to verify the robustness of the proposed inmemory operations for the 8T and the 8T bitcells. Energy, delay and area numbers have been presented for each of the proposed scheme.
Ii InMemory Computations in 8Transistor SRAM BitCells
As discussed in the introduction, 8T cells have favorable bitcell structure to enable inmemory computing. Specifically we would exploit the isolated read mechanism and the two port cell topology to embed NAND, NOR, IMP and XOR logic within the memory array. Further, by leveraging the separate read and write ports of the 8T cell, we also propose a ‘readcomputestore’ scheme, wherein, by minimal changes in the peripheral circuits, the computed Boolean result can be stored in the desired row of the memory array in the same cycle without the need of latching the result and performing a subsequent write operation.
Iia 8Transistor SRAM: NOR operation
The 8T SRAM cell is shown in Fig. 3(a). It consists of the usual 6T cell augmented by additional read port constituted by transistors M1M2. The write operation is similar to the 6T cell, whereas for the read operation, RWL is activated (WWL is low). The RBL is initially precharged and if Q = ‘1’ the RBL discharges otherwise it stays at its initial precharged condition. This decoupled read port for the 8T cell allows to have large voltage swing (almost railtorail) on the RBL during the read operation without any concerns of read disturb failure.
The output of a NOR operation is ‘1’ only if both the inputs are ‘0’. For the memory implementation it implies that only if both the bits corresponding to the operands ‘A’ and ‘B’ store ‘0’, the logic output should be ‘1’. In all other cases the output should remain ‘0’. Consider we activate two RWLs corresponding to the rows storing vector operand ‘A’ and vector operand ‘B’, respectively, as shown in Fig. 3(b). Due to the decoupled read ports, both the RWLs can be activated simultaneously without any read disturb concerns as opposed to the 6T cell. The precharged RBL line retains its precharged state if and only if both the bits Q corresponding to operands ‘A’ and ‘B’ are ‘0’. In other words, as shown in Fig. 3(c) RBL remains high only if both the bits corresponding to the operands ‘A’ and ‘B’ are ‘0’ (i.e. Q = ‘0’ for both ‘A’ and ‘B’). Thus, merely by activating the two RWLs, data stored in the two bitcells are ‘wire NORed’. A gated inverter (INV1) is connected to the RBL such that the inverter output goes low if the RBL remains high. Thereby, the output of the cascaded inverter (INV2) is high only if the bits of operands ‘A’ and ‘B’ are low simultaneously, mimicking the NOR operation. Note, the NOR operation is same as the usual read operation except that we have turned ON two RWLs instead of one. Thus, NOR can be easily achieved in the 8T bitcell without any significant overhead. The timing diagram for the NOR operation is shown in Fig. 3(c).
IiB 8Transistor SRAM: NAND operation
Let us consider that we activate two RWLs corresponding to vector operands ‘A’ and ‘B’, respectively. The precharged RBL will eventually go to 0V if Q for any one of the input operand is ‘1’. However, the fall time of the signal at RBL from the precharged value to 0V would depend strongly on the fact, whether the bits corresponding to any one Q is high or if both the Q bits corresponding to operands ‘A’ and ‘B’ are high simultaneously. In other words, only if both the Qs are ‘1’, the discharge of the precharged RBL line would be fast enough.
In Fig. 3(d), we have shown schematically the state of the RBL for the cases (0,0), (0,1) or (1,0) and (1,1), where the first number in the brackets correspond to the state of bit representing operand ‘A’ and the second number corresponds to the bit representing operand ‘B’, respectively. In order to exploit the different discharge rates of the RBL in case of (0,1) (or (1,0)) and (1,1), the RWL signal had to be timed such that the RBL does not discharge completely in either of the cases (0,1) or (1,0). As shown in the timing diagram of Fig. 3(d), we activated the RWL only for a short period of time such that it does not discharge the RBL completely in the case of (0,1) or (1,0), thus allowing a difference in voltage levels on RBL in the two cases ((0,1) or (1,0) and (1,1)). The trip point of the inverter INV3 is chosen such that it goes high only for the case (1,1), thereby the output of the inverter INV4 goes low only for (1,1), mimicking the NAND operation.
Fig. 4 demonstrates the robustness of the NAND and NOR proposals in presence of 30mV sigma variations in the threshold voltage. We used 45nm Predictive Technology Models (PTM) [18] for simulating the circuits. A BL and BLB capacitance of 10fF was assumed for all the simulations.
In addition, by NORing the outputs of the AND (INV3) and the NOR (INV2) gates together, XOR operation can be easily achieved. In summary, we have shown that the very bitcell topology of the 8T cell can be exploited to accomplish inmemory NOR, NAND, XOR computations. In the next subsection, we would discuss another proposal for embedding IMP as well as XOR gate within the 8T SRAM array by utilizing the proposed voltage divider scheme.
IiC 8 Transistor SRAM: Voltage Divider Scheme for IMP and XOR gates
In this subsection, we present a method of implementing IMP and XOR operation using 8T cell by exploiting the voltage divider principle. Let us consider, the circuit shown in Fig. 5(a). Let us assume the first operand is stored in the upper bitcell corresponding to the line RWL1, while the second operand is stored in the lower bitcell corresponding to RWL2. In the conventional 8T cell, the source of transistors M and M are connected to ground. In the presented circuit, the source of the transistors M and M are connected to respective source lines (SL1 and SL2 shared along respective rows). During the normal operations, the SLs can be grounded, thereby accomplishing usual 8T SRAM read and write operations.
During the inmemory computation mode, the SL1 is pulled to V, while the SL2 is grounded. RWL1 and RWL2 are initially grounded and RDBL is precharged to a voltage V (chosen to be 400mV). After the precharge phase, transistors M and M are switched ON, thereby M M M M form a voltage divider and RDBL forms the middle node of the voltage divider structure (see Fig. 5(b)). Note, in the voltage divider configuration, M and M are strongly source degenerated. In order to make sure M and M are sufficiently ON, we boosted the of ‘Cell 1’ and RWL1 such that the gate of M and M have enough overdrive when the ‘Cell 1’ is storing a digital ‘1’ (Q = ‘1’ and QB = ‘0’).
In the voltage divider configuration M M M M, RDBL retains its precharged voltage V if both the bitcells are storing digital ‘0’ (i.e. M and M are OFF ). Similarly, if both the cells are storing a digital ‘1’ (i.e. M and M are ON), the voltage at RDBL stays close to its precharged value (400mV) due to the voltage divider effect. Thus, when the cells store (0,0) or (1,1) (where the first (second) number in the bracket indicates the data stored in Cell 1 (2)), the voltage at RDBL stays close to the precharged voltage. On the other hand, if the data stored is (1,0), then M is ON while M is OFF. As such, RDBL will charge to V through transistors M and M. In contrast, if the data stored is (0,1), M is ON while M is OFF. Therefore, RDBL will discharge to 0V through transistors M and M. In summary, the voltage on RDBL stays close to V when both the cells store same data. RDBL charges to V for data (1,0) and discharges to 0V for data (0,1).
The state of the data stored in the two cells can be sensed through two skewed inverters. INV2 is skewed such that it goes high only when RDBL is much lower than V and is close to 0V, while INV1 is skewed so that it goes low only when RDBL is higher than V and is close to V. In other words, high output at INV2 indicates data (0,1) while high output at INV3 indicates data (1,0). Interestingly, INV1 implements ‘A IMP B’. By ORing the output of INV2 and INV3 we can obtain the XOR of inputs A and B.
Some key features of the voltage divider logic scheme are, 1) IMP is a universal gate and hence any arbitrary Boolean function can be implemented using the proposed scheme 2) if any one of the inverter outputs (INV2 or INV3) are high, it indicates the data stored is (0,1) or (1,0), thereby allowing a two bitread operation in addition to the desired inmemory computation. However, if none of the inverters are high then a subsequent read operation would be required to ascertain if the stored data is (0,0) or (1,1). As such, in 50% cases when the data stored is (0,1) or (1,0), we can accomplish a two bit read operation, along with the inmemory compute operation.
IiD Proposed ‘readcomputestore’ (RCS) scheme
We have seen that basic Boolean operations like NAND, NOR, IMP and XOR can be computed using 8T cells. We would now show that the decoupled read and write ports of the 8T bitcell can be used for enabling ‘readcomputestore’ (RCS) scheme. The RCS scheme implies that while the data is being read from the two activated RWLs (corresponding to the two input operands), simultaneously the WWL of a third row can be activated such that the computed data gets stored in the third row at the same time while the actual Boolean computation is in progress. As such, the computed data is not required to be latched first, then written subsequently, in a multicycle fashion. Note, writing into 8T bitcells is much easier due to the fact that the write port of the 8T cell is specifically optimized for the write operation.
Let us understand how the RCS scheme can be implemented with reference to Fig. 6. Assume that the input operands correspond to the rows 1 and 2, while the resulting Boolean computation has to be stored in row 3. Note, this Boolean computation can be either of NAND/NOR/IMP/XOR. Let us take the example for the NAND operation. As shown in Fig. 6(a), two read lines RWL1 and RWL2 would be activated, the compute block, which basically is the abstracted view of the skewed inverters of Fig. 3(b), would perform the logic computation. Now, since the read and write port for 8T cell are decoupled we can simultaneously activate a third WL, in this case the write wordline (WWL3). The computed output can be selected through a multiplexer and fed to the write drivers for directly storing the Boolean result in the bitcells corresponding to WWL3. Thus, the fact that 8T cells have decoupled readwrite ports can be leveraged to accomplish the proposed ‘readcomputestore’ scheme. Fig. 6(b) shows schematically the array level block diagram where the three wordlines RWL1, RWL2 and WWL3 are activated simultaneously. In Fig. 6(c) we show the MonteCarlo results for storing the computed NAND output into Cell3. Note that a ‘copy’ operation can also be performed using the RCS scheme, by activating the RWL of the source row and WWL of the destination row. In this case, the input to the RCS block will simply be the SA output, which corresponds to the data stored in the bitcells of the source row.
Iii 8 Transistor Differential Read SRAM
Recently, an 8T Differential SRAM design was proposed in [17] to overcome the single ended sensing of the conventional 8TSRAM cell. 8T Differential SRAM has decoupled readwrite paths with an added advantage of a differential read mechanism through the read bitlines RBL/RBLB (see Fig. 7(a)), as opposed to the singleended read mechanism of 8TSRAM. The ninth transistor, whose gate is connected to RWL in Fig. 7(a) is shared by all the bit cells in the same row. The differential read operation is very similar to the read operation of a standard 6TSRAM. The usual memory read operation is performed by precharging the bitlines (RBL and RBLB) to V, and subsequently enabling the wordline corresponding to the row to be read out. Depending on whether the bitcell stores ‘1’ or ‘0’, RBL or RBLB discharges. The difference in voltages on RBL and RBLB is sensed using a differential sense amplifier.
Let us consider words ‘A’ and ‘B’ stored in two rows of the memory array. Note that we can simultaneously enable the two corresponding RWLs without worrying about readdisturbs, since the bitcell has decoupled readwrite paths. The RBL/RBLB are precharged to V. For the case ‘AB’=‘00’ (‘11’), RBL (RBLB) discharges to 0V, but RBLB (RBL) remains in the precharged state. However, for cases ‘10’ and ‘01’, both RBL and RBLB discharge simultaneously. The four cases are summarized in Fig. 7(b).
Now, in order to sense bitwise NAND and NOR operation of ‘A’ and ‘B’, we propose an asymmetric SA (see Fig. 7(c)), by skewing one of the transistors. Skewing the transistors can be done in multiple ways, for example, transistor sizing, threshold voltage, body bias etc. In Fig. 7(c), if the transistor is deliberately sized bigger compared to , its current carrying capability increases. For cases ‘01’ and ‘10’, both RBL and RBLB discharge simultaneously. However, since the current carrying capability of is more than , node discharges faster, and the crosscoupled inverter pair of the SA stabilizes with =‘0’. For the case ‘11’, RBL starts to discharge, while RBLB is at V. The SA amplifies the voltage difference between RBL and RBLB, resulting in =‘1’. Whereas for the case ‘00’, RBLB starts to discharge, while RBL is at V, giving =‘0’.
BitCell  Latency (ns)  Average Energy/bit (fJ) 

8TSRAM  3  17.25 
8TSRAM (Voltage Divider)  1  11.22 
8T Differential SRAM  1  29.67 
6TSRAM (Appendix)  3  29.3 
Thus it can be observed that generates an AND gate (thus, outputs NAND gate). Similarly, by sizing the bigger than , OR/NOR gates are be obtained SA outputs. Finally, two SAs in parallel (one with upsized, , and one with upsized, ) enable bitwise AND/NAND and OR/NOR logic gates. Moreover, an XOR gate can be obtained by combining the AND/NAND and OR/NOR outputs using an additional NOR gate. Thus, in a single memory read cycle, we obtain a class of Boolean bitwise operations, read directly from the asymmetrically sized SA outputs. MonteCarlo simulations with 30mV sigma variations in the threshold voltage were repeated, and the outputs of and for all input data cases are summarized in Fig. 8.
It is worthwhile to note that the two SAs can be used for regular memory read operations as well. The two cases of a typical memory read operation are similar to the cases ‘11’ and ‘00’ in Fig. 7(b). Both SAs will generate the same output corresponding to the bit stored in the cell. Moreover, the output of the XOR gate inherently acts as an inmemory check for possible read failures. The RCS scheme described in Section II can also be applied to 8T Differential SRAMs due to decoupled readwrite paths. Along with the two RWLs from where the input operands are read, a WWL can also be enabled which would eventually store the Boolean output within the memory array in the same cycle.
Using 8T cells is advantageous over the conventional 8T cells for inmemory bitwise logic operations because of better robustness due to the differential read operation, in contrast to the single ended read in 8TSRAM cells.
Iv Discussions
In sections II and III, we have seen various ways of implementing basic Boolean operations using the 8T and the 8T bitcells. Table I presents the average energy perbit and latency for each of the proposed inmemory compute techniques. The 8T cell allows separate read write ports, thereby alleviating any possible readdisturb failure concerns. In addition, it also supports the proposed RCS scheme. However, 8T cell suffers from robustness concerns due to its single ended sensing.
Using the 8T cell, on the other hand, allows differential sensing like the conventional 6T cell, while also allowing separate read and write ports. It thus combines the benefits of both the standard 6T and the 8T cells. Note, since the differential read scheme for the 8T cell is functionally similar to the conventional 6T cell, NOR and NAND gates (along with the XOR gate) can also be implemented in the 6T based memory array. However, due to the shared readwrite paths of the 6T cell, the wordlines cannot be simultaneously activated and require a sequential activation. In addition, 6T cells are read disturb prone and hence would exhibit much lesser robustness than the proposed 8T and 8T cells. Nevertheless, in the Appendix we have included a description of how the 6T cells can be used to accomplish NOR, NAND and XOR operations. We also show that an inmemory ‘copy’ operation can also be easily achieved in the 6T cell due to its shared read/write paths.
Finally, it is worth noting that although we have proposed multiple inmemory techniques in this manuscript, the choice of the bitcell and the associated Boolean function would heavily depend on the target application. The aim of the present manuscript is to demonstrate various possible techniques that can be utilized in conventional CMOS based memories for accomplishing inmemory Boolean computations.
V Conclusion
VonNeumann machines have fueled the computing era for the past few decades. However, the recent emphasis on data intensive applications like artificial intelligence, image recognition, InternetofThings (IoT) etc. requires novel computing paradigm in order to fulfill the energy and throughput requirements. ‘Inmemory’ computing has been proposed as a promising approach that could sustain the throughput and energy requirements for future computing platforms. In this paper, we have proposed multiple techniques to enable in memory computing in standard CMOS bitcells the 8T cell and the 8T cell. We have shown that Boolean functions like NAND, NOR, IMP and XOR can be obtained by minimal changes in the peripherals circuits and the associated readoperation. Further, we have also proposed a ‘readcomputestore’ scheme by leveraging the decoupled read and write ports of the 8T and 8T cells, wherein the computed logic data can be directly stored in the desired row of the memory array. Our results are supported by rigorous MonteCarlo simulations performed using predictive transistor models.
Appendix
Va 6Transistor SRAM: bitwise NOR/NAND/XOR Operation
The most popular and widely used SRAM design is the standard 6T bitcell, shown in Fig. 9. However, 6T bitcells are inherently design constrained due to the shared read and write paths. Nevertheless, by proper design choices, 6T cells can still be used to perform inmemory computations although at reduced robustness due to the conflict between read and write operations in a standard 6T cell. The usual memory read operation in a 6T cell is performed by precharging the bitlines (BL and BLB) to V, and enabling the wordline corresponding to the row to be read out. Depending on whether the bitcell stores ‘1’ or ‘0’, BL or BLB discharges, as illustrated in Fig. 10(a). The difference in voltages on BL and BLB is sensed using a differential sense amplifier.
Consider a typical memory array shown in Fig. 9, with two words ‘A’ and ‘B’ stored in rows 1 and 2, respectively. Simultaneously enabling WL1 and WL2 introduces readdisturbs due to possible shortcircuit paths. Hence, we employ a sequentially pulsed WL technique as a workaround, similar to the proposal in [5]. The address decoder sequentially turns WL1 and WL2 ON, corresponding to the rows storing ‘A’ and ‘B’, respectively, as illustrated in Fig. 10(b).
The WL pulse duration is chosen such that with application of one WL pulse, BL/BLB drops to about V/2. If bits ‘A’ and ‘B’ both store ‘0’ (‘1’), BL (BLB) will finally discharge to 0V after the two consecutive pulses, whereas BLB (BL) remains at V. On the other hand, for cases where ‘AB’ = ‘10’ and ‘01’, the final voltages at BL and BLB would be the same (V/2), approximately. Thus, for the cases ‘01’ and ‘10’ both BL and BLB would have a voltage V/2, while for ‘00’ BL would be lower than BLB by V and for the case of ‘10’ BLB would be lower than BL by V.
The four cases are summarized in Fig. 10(b). Using the two asymmetric SAs (in a similar fashion as proposed in Section III), connected in parallel, we can obtain NAND/AND, NOR/OR and XOR bitwise operations on ‘A’ and ‘B’. Note, although the sensing operation for inmemory computing with 6T cells seem similar to the 8T cell, there are certain key differences. Firstly, two wordlines cannot be activated simultaneously in 6T cells, therefore the WL pulses have to be properly timed and the pulse duration needs to be appropriately selected for achieving the desired functionality. Secondly, unlike the 6T cells the voltage swing on the read bitlines for the 8T cells can have much larger swing without any concerns of possible read disturb failures, thereby relaxing the constraints on the sense amplifier.
A MonteCarlo simulation with a 30mV sigma variations in the threshold voltage were performed to demonstrate the functionality and robustness of the proposal. Fig. 11 shows the outputs of the asymmetric SAs, and , for the four possible input cases  ‘00,01,10,11’, in presence of variations.
VB 6T SRAM: Copy Operation
In this section, we describe a method of implementing ‘copy’ functionality within the 6T bitcell. To copy data from one memory location to another, a typical instruction sequence performed by the processor would be to do a memory read from the source location, followed by a memory write to the destination. Thus, two memory transactions are performed. We exploit the coupled readwrite paths of the 6T cell to perform a data copy operation from one row to another, since the same set of bitlines BL/BLB are used to read from and write into the cell.
Let us consider bitcells 1 and 2, connected to WL1 and WL2, respectively (see Fig. 12(a)). To copy data from cell 1 to cell 2, we perform two steps illustrated in Fig. 12(b). Let us assume cell 1 stores a ‘1’, while cell 2 stores ‘0’. The bitlines BL/BLB are precharged to V, as usual. In step 1, WL1 is enabled, thereby turning the access transistors of cell 1 ON. Since node Q1 is connected to V and QB1 is connected to 0V (cell 1 stores ‘1’), BL remains at V, while BLB discharges to 0V. The pulse width is long enough for BLB to discharge fully to 0V. In step 2, WL2 is enabled, turning access transistors of cell 2 ON, with BL at V and BLB at 0V. Since cell 2 stores a ‘0’, charge flows from BL to Q2, and from QB2 to BLB, thereby, flipping the state of cell 2, such that cell 2 now stores a ‘1’. If cell 2 initially stored a ‘1’, nothing happens in step 2, and the state remains the same. Thus, we have implemented a data copy from cell 1 to cell 2, in a single memory transaction.
Note that step 1 is a usual memory read operation, while step 2 is similar to a memory write operation. However, step 2 is a weak write mechanism since the charge stored on BL/BLB is used to switch the cell state. This may cause write failures. Thus, a boosted voltage on WL2 is required to ensure correct data is written into cell 2. A MonteCarlo simulation with sigma threshold voltage variations of 30mV in 45nm PTM models was performed to test the proposal, as shown in Fig. 12(c).
In order to implement a copy in 8T and 8T Differential SRAMs, the proposed scheme would not work due to decoupled readwrite paths. However, the RCS scheme proposed in Section II can be used. The RWL of the source row and the WWL of the destination row are enabled, and the output of the sense amplifier is fed to the RCS block. This copies the data from the source row to the destination row in a single cycle operation.
References
 [1] J. Bardeen and W. H. Brattain, “The transistor, a semiconductor triode,” Physical Review, vol. 74, no. 2, p. 230, 1948.
 [2] J. Backus, “Can programming be liberated from the von neumann style?: A functional style and its algebra of programs,” Commun. ACM, vol. 21, no. 8, pp. 613–641, Aug. 1978.
 [3] “A 28 nm configurable memory (TCAM/BCAM/SRAM) using pushrule 6t bit cell enabling logicinmemory,” IEEE Journal of SolidState Circuits, vol. 51, no. 4, pp. 1009–1021, apr 2016.
 [4] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das, “Compute caches,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, feb 2017.
 [5] M. Kang, M.S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz, “An energyefficient VLSI architecture for pattern recognition via deep embedding of computation in SRAM,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, may 2014.
 [6] M. Kang, E. P. Kim, M. sun Keel, and N. R. Shanbhag, “Energyefficient and high throughput sparse distributed memory architecture,” in 2015 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, may 2015.
 [7] W. M. Snelgrove, M. Stumm, D. Elliott, R. McKenzie, and C. Cojocaru, “Computational ram: Implementing processors in memory,” IEEE Design & Test of Computers, vol. 16, pp. 32–41, 1999.
 [8] J. Zhang, Z. Wang, and N. Verma, “Inmemory computation of a machinelearning classifier in a standard 6t SRAM array,” IEEE Journal of SolidState Circuits, vol. 52, no. 4, pp. 915–924, apr 2017.
 [9] Q. Dong, S. Jeloka, M. Saligane, Y. Kim, M. Kawaminami, A. Harada, S. Miyoshi, D. Blaauw, and D. Sylvester, “A 0.3v VDDmin 4+2t SRAM for searching and inmemory computing using 55nm DDC technology,” in 2017 Symposium on VLSI Circuits. IEEE, jun 2017.
 [10] H.S. P. Wong and S. Salahuddin, “Memory leads the way to better computing,” Nature Nanotechnology, vol. 10, no. 3, pp. 191–194, mar 2015.
 [11] S. Shirinzadeh, M. Soeken, P.E. Gaillardon, and R. Drechsler, “Fast logic synthesis for RRAMbased inmemory computing using majorityinverter graphs,” in Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE). Research Publishing Services, 2016.
 [12] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan, “Computing in memory with spintransfer torque magnetic ram,” arXiv preprint arXiv:1703.02118, 2017.
 [13] W. Kang, H. Wang, Z. Wang, Y. Zhang, and W. Zhao, “Inmemory processing paradigm for bitwise logic operations in sttmram,” IEEE Transactions on Magnetics, 2017.
 [14] D. Lee, X. Fong, and K. Roy, “RMRAM: A ROMembedded STT MRAM cache,” IEEE Electron Device Letters, vol. 34, no. 10, pp. 1256–1258, oct 2013.
 [15] A. Sebastian, T. Tuma, N. Papandreou, M. L. Gallo, L. Kull, T. Parnell, and E. Eleftheriou, “Temporal correlation detection using computational phasechange memory,” Nature Communications, vol. 8, no. 1, oct 2017.
 [16] S. Li, C. Xu, Q. Zou, J. Zhao, Y. Lu, and Y. Xie, “Pinatubo,” in Proceedings of the 53rd Annual Design Automation Conference on  DAC16. ACM Press, 2016.
 [17] J. P. Kulkarni, A. Goel, P. Ndai, and K. Roy, “A readdisturbfree, differential sensing 1r/1w port, 8t bitcell array,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 19, no. 9, pp. 1727–1730, sep 2011.
 [18] Predictive Technology Models.[Online] http://ptm.asu.edu/, 2016.