Mitigate Parasitic Resistance in Resistive Crossbarbased Convolutional Neural Networks
Abstract.
Traditional computing hardware often encounters onchip memory bottleneck on large scale Convolution Neural Networks (CNN) applications. With its unique inmemory computing feature, resistive crossbarbased computing attracts researchers’ attention as a promising solution to the memory bottleneck issue in von Neumann architectures. However, the parasitic resistances in crossbar deviate its behavior from the ideal weighted summation operation. In largescale implementations, the impact of parasitic resistances must be carefully considered and mitigated to ensure circuits’ functionality. In this work, we implemented and simulated CNNs on resistive crossbar circuits with consideration of parasitic resistances. Moreover, we carried out a new mapping scheme for high utilization of crossbar arrays on convolution, and a mitigation algorithm to mitigate parasitic resistances in CNN applications. The mitigation algorithm considers parasitic resistances as well as data/kernel patterns of each layer to minimize the computing error in crossbarbased convolutions of CNNs. We demonstrated the proposed methods with implementations of a 4layer CNN on MNIST, and ResNet(20, 32, and 56) on CIFAR10. Simulation results show the proposed methods well mitigate the parasitic resistances in crossbars. With our methods, modern CNNs on crossbars can preserve ideal(software) level classification accuracy with 6bit ADCs and DACs implementation.
inccomment \setboolinccommenttrue
1. Introduction
Convolutional Neural Networks(CNN) have led to many performance breakthroughs in image classification, video object tracking, and audio processing applications(LeCun et al., 2015). Since AlexNet won the ILSVRC 2012 (Krizhevsky et al., 2012), CNNs have evolved into many different models, such as GoogLeNet(Szegedy et al., 2015), VGG(Simonyan and Zisserman, 2014), and ResNet(He et al., 2015). Using Graphics Processing Unit(GPU) and specific accelerators to accelerate convolutions play an important role in CNNs’ success, since convolutions dominate the overall computation. Interestingly, in many machine learning applications throughput of convolutions is prior than computing accuracy, especially on inference functions (Gupta et al., 2015). Inspired by this feature, Nvidia’s recent GPUs(Ho and Wong, 2017), CambriconX(Zhang et al., 2016), and many other accelerators (such as Cambricon MLU100(wikichip, [n.d.]), Xilinx FPGA based accelerator(Mei et al., 2017)) start support halffloating point or even 8bit integer precision to accelerate convolutions without performance degradation in many neural networks.
However, modern computing hardware often encounters onchip memory bottleneck when dealing with high volume convolutions in large scale CNNs(Chandrasekhar et al., 2017; Rhu et al., 2016). Stateofart convolution neural networks require tremendous amount of parameters to handle complex tasks. For example, AlexNet has 2.3 million parameters, VGG has 14.7 million parameters, and ResNet152 has 25.5 million parameters(Chandrasekhar et al., 2017). Storing such amount of parameters into limited caches is impossible. Therefore, frequent cache flushing and weight loading from offchip memory (usually DRAM) are usually inevitable, which lead to significant delay and energy cost.
To overcome the memory bottleneck, many researchers show interest in resistive crossbar arrays for the computinginmemory feature (Gao et al., 2016; Adam et al., 2017; Hu et al., 2018; Liu et al., 2016; Yu, 2018; Li et al., 2018). The resistive crossbar is defined as a circuit structure with vertical and horizontal metal lines sandwiching a resistive switching material at their intersection. The crosspoint material could be memristor(Strukov et al., 2008), phase change memory (PCM) device(Wong et al., 2010), floating gates(Chen and Fong, 1999), spintronic device(Wolf et al., 2001), RRAM(Wong et al., 2012), SRAM(Merolla et al., 2011), or any other devices with programmable resistance. By utilizing Kirchhoff’s current law (KCL) and Ohm’s law, an ideal resistive crossbar array can carry out analog vectormatrixmultiplication (VMM). The outputs of analog VMMs are represented as the analog output currents from columns of the crossbar, with input voltage signals flowing through rows, and weights are stored nonvolatility as conductance in crosspoint. In the inference stage, any size of VMMs can be easily done in a single step. Moreover, since weight storage and weighted multiplication/summation both happen at the same place — the crossbar array, it enables ultrahigh computing efficiency on multiplications between changing vectors(data) and fixed matrices(weight), which is ideal for implementing ultralow power inference functions of neural networks.
However, as array scales up, circuit parasitics deviate crossbar from its ideal linear behavior and bring the nonnegligible error to the computing result. The impact of circuit parasitics, especially parasitic resistances such as wire resistance and interconnect resistance, have been observed and analyzed in many simulations and experiments(Hu et al., 2016)(Ciprut and Friedman, 2017)(Agarwal et al., 2017). Currently, the impact of parasitic capacitance and inductance can be ignored since they mainly affect transient behavior of crossbar, while inmemory computing mainly depends on crossbars’ DC behavior. It is important to consider parasitic resistance in circuit simulations for practical and functional implementations, especially for neural network applications where many largescale crossbars are used.
In this paper, we investigate crossbarbased largescale CNN implementation with consideration of parasitic resistances and provide methods to mitigate its impact on convolution accuracy as well as CNN classification accuracy. Our major contributions includes:

First, we invent an efficient implementation method to densely map high dimension 4D kernels to 2D crossbar arrays. Using this method, a crossbar designed for vectormatrix multiplication can be easily adapted for convolution with nearzero hardware overhead.

Second, we model resistive crossbar array with consideration of parasitic resistances and realistic device models. The simulation result is also verified with experiment data up to 12864 crossbar size.

Third, we study the impact of parasitic resistances and provide a mitigation method to minimize computing error due to parasitic resistances as well as data/kernel patterns in CNNs without retraining.

Last but not least, we demonstrate our methods for a 4layer CNN on MNIST, and Resnet20, 32, and 56 on CIFAR10.
Comparing to other stateoftheart crossbar simulators, our work conducts the endtoend full circuit simulation on modern CNN models and large dataset with the consideration of circuit parasitics and realistic memristor models. Our work firstly shows that with realistic crossbars and other peripheral circuits (DAC+ADC), how errors will propagate in deep CNNs due to mixed signal processing and nonlinear activations (ReLU).
Our result shows that, with the proposed implementation and mitigation methods, 8bit ADC/DAC resolution may be good enough to preserve the softwarelevel classification accuracy in deep CNNs.
The rest of paper is organized as follows: Section II covers the background. Section III details the methodology of the implementation and mitigation methods. Section IV gives simulation result on single crossbar as well as crossbarbased CNNs. In the end, Section V concludes the paper.
2. Preliminary
2.1. Convolutional Neural Network
Fig.1 shows a typical structure of CNN models. It has three key components: convolutional layers, pooling layers, and fully connected layers(LeCun et al., 2015). Each convolutional layer consists of a set of kernels which have identical sizes. It can be used to detect the local spatial correlation in input patterns. The output of convolution is passed to a nonlinear activation function, such as ReLU or sigmoid, to form new feature maps. Pooling layer acts as a downsampling filter for the output of the activation function. A pooling layer not only reduces the size of the feature map but also merges similar features. As a result, it helps to reduce the number of parameters as well as alleviate overfitting. The output from pooling layer then feeds into the next convolution layer. Fully connected (FC) layers usually appeared at the last few layers of CNNs as the classifier. They weight every point in the highlevel feature maps and generate the final classification result.
In this work, we are targeting deep residual neural network (ResNet) on resistive crossbar arrays, as it is one of the stateoftheart CNNs on image classification problems. ResNet was firstly introduced in ILSVRC 2015 (He et al., 2015). It keeps the benefit of deeper layers while addressing the degradation issue by optimizing the residual mapping with shortcut connections between layers. An ensemble of residual nets up to 152 layers has achieved 3.57% error on the ImageNet test set and won the 1st place on the ILSVRC 2015 classification competition. Fig.3 shows the basic block diagram of ResNet. It combines multiple convolutions, batch normalizations, and rectified linear units(ReLU) together as its basic building block. Different from other CNNs, ResNet uses a shortcut to directly add input data to the output result of a block. Since two data inputs may have different dimensions at the summation stage, a convolution layer is introduced in the shortcut to match the dimensions of two inputs. The summation result is feed to ReLU and pass to the next Block. At the end of ResNet, pooling layer, one or more FC layers, and a softmax layer are used in sequence to generate the final classification result. By studying and optimizing resistive crossbars for ResNet, we can get more insight into the performance of resistive crossbarbased computing on modern CNNs.
2.2. Resistive crossbar circuit for VMM computing
Fig.3 illustrates the general structure of resistive crossbar array for VMM computing. In an ideal crossbar, when applying voltage inputs at the rows simultaneously and read current outputs from the columns, the inputoutput relationship of the crossbar can be represented as below:
In this way, the analog weighted summation is achieved through Kirchhoff’s Current Law and Ohm’s Law. By mapping input vector to input voltage , positive matrix A to conductance G, and output current back to output result , a memristor crossbar can be regarded as an analog VMM module to realize in one step. Note that is only valid for ideal crossbar where parasitic resistances can be ignored, and device conductance is independent of voltage/current.
In realistic crossbar arrays, circuit parasitics, including wire resistance, input/output resistance, device IV nonlinearity, and parasitic capacitance, deviate crossbar’s behavior from ideal vectormatrix multiplication. For instance, the wire resistance degrades signal along the row and column wire, so the device on the further corner will receive signal less than expected due to increased wire resistance. Input/Output resistance act similarly as wire resistance, but they could cause even larger signal degradation since they are usually caused by pass transistors and more resistive than wires. Meanwhile, crosspoint device IV nonlinearity affects its multiplication accuracy, as its conductance is no longer independent of voltage or current. The impact of capacitance and inductance can be ignored at current stage since they mainly affect the transient behavior of crossbar while we are using crossbar’s DC behavior to realize analog VMM at relatively low frequency (10M to 100MHz).
2.3. Mapping algorithms for crossbar array
When using the crossbar array as the VMM circuit, the first problem is how to map the matrix onto the crossbar. Since the intersection conductance and column output current cannot be the negative value, the mapping method needs to able to address the negative issue. One common way is using the positive and negative power supply, and matrices are mapped on crossbar array in absolute form. For the negative element in the matrix, the negative input voltage is applied to the corresponding position. Recently, Chris Yakopcic et.al have proposed a way to implement convolution and CNN on memeristorbased crossbar (Yakopcic et al., 2016)(Yakopcic et al., 2015). In their exsuit process, convolutional kernel has to be divided into two part, positive and negative . All negative values in have been replaced by zero. And those negative values have been mapped in as positive conductance. By providing the identical inputs to but with reversed sign, memristorbased crossbar can deal with the matrix which contains negative values.
(Gao et al., 2016) also maps the absolute value of the matrix on crossbar array. However, it needs two adjacent columns of the crossbar to represent a single column of that matrix. One represents the positive part, and another represents the negative part. Then the two columns have output currents and respectively. Final output = .
(Ji et al., 2019) uses another way to solve this negative issue in their ReRAM crossbar based PE(processing element). The positive and negative part of the matrix are still mapped on two adjacent columns of a crossbar. But the negative voltage input is unnecessary in this design. Instead, an extra subtracter component is needed for every pair of positive and negative columns. Another change in this design is that it uses spiking schema to improve the output precision. spikes are used to represent a number of n bits.
In the above mapping schemes, either extra overheads or redundant crossbar areas are needed. Moreover, for all above implements, a rigorous full circuit simulation with consideration of parasitic resistances on a large enough CNN is still missing.
2.4. Crossbar based Neural Network simulator
There exist a lot of crossbarbased simulators for neural network applications with the consideration of parasiticeffect, stuckatfault, thermal noise and more. However, most of those simulators are not experimentverified and even oversimplified for the nonideal effects. Table 1 compares stateoftheart crossbarbased NN simulators. Our simulator is the only circuit simulator that experimentally verified and works on largescale dataset and NN models.
MNSIM(Xia et al., 2016) and NeuroSim(Chen et al., 2018) are two famous ReRAM crossbar simulators. They count many nonideal effects into consideration but they do not focus on accurate crossbar computation output due to lack of full analog circuit simulation. However, they could give a reasonable power/area estimation for crossbar based design. Moreover, they are designed for general crossbarbased neural networks not optimized for convolution neural networks.
PytorX(He et al., 2019) and FTNNA(Liu et al., 2019) are designed for neural network applications with the consideration of nonideal effects. They also have the ability to alleviate the impact of those nonideal effects and have been demonstrated on large datasets. However, such rescue ability is given by the error tolerance of NN models rather than the optimization of simulators, since both need to retraining the model to assuage the nonideal effects. In practical we cannot guarantee that all crossbars have same nonideal effects and those effects may vary because of different crossbar size, stored conductance matrix, wire resistance, and etc. Retraining for every crossbar device is a plausible way to solve such nonideal effects, but they may not always practical due to time/energy limitation. Moreover, online training for largescale NNs requires deep understanding and very accurate modeling of memristor dynamic behaviors, which is still far from enough at this moment. Even we do training with existing memristor models, the learning we can get from it is questionable because the model is still very different from realistic device behaviors and does not account the impact of realistic process variations and noises.
In this work, we developed an experimentverified simulator to conduct the endtoend SPICElevel analog crossbar circuit simulation with the consideration of circuit parasitics and realistic memristor models. Conversion and calibration methods are used to mitigate the parasitic effects in crossbarbased computing. Comparing to other simulators, our work not only shows the impact of nonideal effects on NN model accuracy, but also explains the reasons behind such accuracy degradation by investigating the error propagation in crossbarbased deep CNNs. In this way, our work provides a more solid upperbound for the inference performance of crossbarbased CNNs respecting different design parameters.
Simulator 


MNIST  CIFAR10  NN type  Nonideal effects  Method to rescue IRDrop  

MNSIM(Xia et al., 2016)  ✗  ✗  ✗  ✗  MLP,CNN 

✗  
NeuroSim(Chen et al., 2018)  ✗  ✗  ✓  ✗  MLP 



PytorX(He et al., 2019)  ✗  ✗  ✗  ✓  CNN 



FTNNA(Liu et al., 2019)  ✗  ✗  ✓  ✓  CNN 



DPE(Hu et al., 2016)  ✓  ✓  ✓  ✗  MLP 

Conversion  
Our simulator  ✓  ✓  ✓  ✓  CNN 


3. Methodology
There are two challenges ahead of using resistive crossbars for convolution operations. The first challenge is how to transform high dimensional convolution kernels to 2D matrices for resistive crossbars. The most straightforward way converts convolution kernels to Toeplitz matrices. However, the Toeplitz matrix is sparse, and its sparsity significantly grows with larger kernels. Directly mapping this sparse matrix to resistive crossbars, will cause large hardware overhead as well as computing accuracy concerns. The second challenge is how to mitigate parasitic resistances in crossbarbased computing. As introduced before, parasitic resistances in crossbar will cause nonlinear signal degradation at each crosspoint devices, lack of considering its impact will cause unacceptable errors in computing result of large crossbar arrays. To solve both challenges, we present the dense mapping as the new mapping method to fully unitize crossbar arrays with near zero hardware overhead, and the mitigation algorithm to compensate parasitic resistance in largescale crossbars.
3.1. Dense mapping for crossbarbased convolution
The critical step of mapping an arbitrary matrix on crossbar is how to deal with negative values in A, as conductance cannot be negative. We first shift all elements in A by a constant to make sure no negative value in A. Such a shift can be easily removed at the final step by subtracting . Where is the summation of input. By doing so, we do not need to part matrix into positive and negative submatrices, and any VMM can be performed on the crossbar with much less overhead.
When we use crossbar for VMM computing in digital circuit environment, ADC/DAC are necessary to convert input data to analog voltage and output current to digital form. This process usually assumes to be 10ns to 100ns, which is usually limited by the ADC speed. To get sum(X), an accumulator works in parallel with the DAC¿crossbar¿ADC system with similar speed. For small or medium size crossbars, digital accumulator would be more efficient to calculate the summation of input vectors; For large crossbars, an analog accumulator can be implemented by using an additional column in the crossbar array where its devices are programmed to the same conductance, and it is sensed by an additional ADC. With either method of accumulator design, the cost and hardware overhead on calculating sum(X) is much lower than using two crossbars or two devices to track negative values.
In CNN, each convolution layer has a 4D kernel. For example, a 3*3*3*16 kernel means it has 16 sets of 3*3*3 3D kernels, and each 3D kernels contains three 2D kernels for the three channels (RGB) of color images. To perform convolution on memristor crossbar, we need to convert high dimensional convolution into 2D VMM. It is well known that 2D convolution can be implemented using matrix multiplication by converting kernel to a Toeplitz matrix. Toeplitz matrices are sparse matrices which contain many zeros, so we named this method as sparse mapping.
However, sparse mapping has three major issues: First, memristor crossbar is by nature a dense array structure with positive values, not efficient for sparse matrix implementation. Mapping a sparse matrix to crossbar means that those memristors assigned with zero would do nothing but adding error to the final result due to the leakage current, as well as waste circuit area. Second, sparse mapping requires a huge crossbar array, which is not always feasible, and vulnerable to noise and defects. Third, sparse mapping requires a sizeable peripheral circuit to support the enormous array. Since the peripheral circuit dominates the total power/area, the hardware implementation of sparse mapping would be too costly to afford.
In contrast to sparse mapping, we developed a new method, named as dense mapping, targeting on using small and dense memristor crossbars to implement convolutions efficiently. Fig.4 illustrates the concept of dense mapping. Each 2D kernel is unrolled to a vector and mapped to one column of crossbar so that one crossbar can implement multiple 2D kernels as long as it has enough columns. For input signal, only data within the convolution window are fed to the row inputs of crossbar arrays. When input data has multiple feature channels, each output feature channel needs a 3D convolution kernel. A 3D convolution kernel can be treated as many 2D kernels. Every 2D kernel corresponds to one input channel. The output of 3D convolution is given by performing convolution on all 2D input channels and add them together. Thus, we can unroll every 2D kernel in the 3D kernel as the same order. It is then cascading all vectors together to form a single column on a crossbar. An input shift register stores the unrolled input data within convolution window(multiple input channels cascaded like the kernel on a crossbar) at the current iteration to feed to the crossbar, then updating its storage as the window moves through the entire data space. The convolution results for data within the convolution window are collected at the column outputs of the crossbar. In this way, a single convolution kernel with one stride on both horizontal and vertical direction needs * iterations where , / , are data/kernel height and width respectively. Since multiple input channels have been compressed in a single column on a crossbar, the input channel number doesn’t impact the iteration time.
Comparing dense mapping to sparse mapping, it is a tradeoff between time complexity and space complexity. Sparse mapping uses much more extra hardware to produce the result in parallel without iteration. From the data movement aspect, the least movement mapping method is sparse mapping. It uses extra space in crossbar to not only store weight in there but also perform the kernel window movement. However, its efficiency exponentially drops as data/kernel size scales up because more devices in a rectangular crossbar are unused for increasing data/kernel size.
And much larger crossbar and more ADC/DACs are required for sparse mapping a large data/kernel. Table 2 compares the overhead of dense&sparse mapping on common used data/kernel sizes from ResNet. In Table 2, crossbar, ADC/DAC parameters are adopted from ISAAC(Shafiee et al., 2016). We exponentially scaled the DAC to 8bit, then proportionally scaled the area and power for crossbar, ADC, and DAC respectively. Although we could partition huge matrix into multiple small matrices(Lin et al., 2019), sparse mapping still needs 100x numbers of DACs&ADCs than dense mapping.
Kernel Size 









3x3x3x16  27 x 16  0.06  0.000005  1  27  29.06  0.001  
3072 x 14400  6480  0.54  113  3072  9778  0.806  
3x3x16x16  144 x 16  0.34  0.000028  1  144  146.34  0.007  
16384 x 14400  34560  2.88  113  16384  51170  3.712  
3x3x32x32  288 x 32  1.35  0.000113  1  288  291.35  0.014  
8192 x 6272  7526.4  0.6272  49  8192  15816.4  1.034  
3x3x64x64  576 x 64  5.4  0.00045  1  576  583.4  0.026  
4096 x 2304  1382.4  0.1152  18  4096  5514.4  0.311 
Therefore, dense mapping is an adequate and more practical method comparing to sparse mapping. It not only achieves 100% usage of devices but also easy to implement and provide sufficient performance in speed for CNN applications. From Fig.5, one classification inference in ResNet20 needs 9089 iterations in sequential, if no parallel copies of hardware are used. Note that summation (sum#) is in parallel of convolutions, so it’s not counted in total iterations. Assuming crossbar runs at 100 MHz (Hu et al., 2016), for each classification the convolution part takes only 0.09 ms, which is fast enough for realtime classification requirement.
3.2. Crossbar simulation with parasitic resistances
Fig.7 shows the circuit structure of onetransistoronememristor (1T1M) crossbar for vector matrix multiplication (VMM). In this structure, memristor crossbar is the core component as it enables analog current weighted summation, which leads to ultraefficient VMM operation. A memristor crossbar is made by row and column metal wires where memristors formed at intersecting points. Each memristor can be tuned to arbitrary conductance within its programmable range. To enable precise, nondisturbing tuning in large crossbar arrays, 1T1M cell is necessary to program the entire array with arbitrary conductance matrix G.
Besides memristor crossbars, DACs/ADCs are also essential to ensure accurate VMM operation. First, they are necessary to integrate memristorbased analog VMM module into the digital environment, as functions like pooling, ReLU, and normalization still need digital implementation; Second, the calibration step can be performed at the ADC/DAC stage to improve crossbar result further. Last but not least, ADC provides quantization, which is helpful on filtering out the output error of memristorbased analog VMM modules, to prevent error accumulation inter and intralayers in CNNs.
Comparing to previous simulation work on memristor crossbar arrays, we considered parasitic resistances including wire resistances, input/output resistances, and intersection resistances in our simulation model. With consideration of parasitic resistances, we can observe the signal degradation along rows and columns in the array, as shown in Fig.8. The low input signals applied on further corner memristors also make the column current output lower than expectation. To guarantee the actual output close to the ideal output, we adopt an experiment verified method(Hu et al., 2018). Instead of mapping conductance to a crossbar, we map a tweaked conductance matrix with consideration of parasitic resistances in the crossbar. To find for a crossbar, we first need to formulate all KCL equations from the crossbar circuit model to solve all node voltages, including crosspoint top nodes, crosspoint middle nodes (between memristor and access transistor), crosspoint bottom nodes, and boundary nodes on both end of horizontal lines and vertical lines. Then as conductance has unknown variables, we add new equations as new limitations to force each crosspoint to pass the ideal current. Where is the crosspoint current for each memristor, is the ideal crosspoint voltage between top nodes and bottom nodes, and denotes Hadamard product, or say elementwise multiplication. Finally with nonlinear equations, can be solved with HSPICE or any nonlinear equation solver. As shown in Fig.7, the simulation result well matches the experiment result of a 12864 array.
3.3. Mitigation algorithm for parasitic resistance
Algorithm 1 summarizes the flow of crossbarbased convolution with the mitigation algorithm. Table 3 explains the important functions in the algorithm. After initialization, if the kernel is already mapped and converted onto crossbars, it will directly jump to the computing step to simulate crossbarbased convolution. So we only need to mapping conductance matrix once for CNN inference.
Function  Explanation 

CrossbarSim  Experiment verified crossbar simulator from (Hu et al., 2018) 
Conversion  Solve to get G’. 
GetCaliPara  Get 1st order poly fitting result by fitting crossbar output to ideal output of calibration samples. 
Calibration  Use and to map to . Here is needed, because in VMM , if A contains negative values, can be calculated by , while c is a large enough scalar to shift A to all positive. 
Data and kernel pattern
Input data has high sparsity after the ReLU layer. Fig. 9 shows the data sparsity at each convolution layer of ResNet20. The impact of data sparsity should be considered when choosing the conversion signal as well as gathering calibration samples.
Similarly, we found that kernels in CNN have different distributions. In Fig.18 we list three typical kernel types regarding to their weight value distributions. Kernel type 1 refers to a weight distribution close to Gaussian. Usually, it happens when training algorithms put no particular limitation on weight values, such as ResNet. Kernel type 2 refers to the training algorithm that preventing weight values goes near zero(Han et al., 2015). Kernel type 3 refers to Ternary Neural Networks where weights can only be 1, 0(sparse), or 1(Li et al., 2016). It’s worth investigating how different kennel types in CNN impact the quality/precision of crossbarbased convolution.
Optimize conversion signal
To better quantify the computing accuracy, we define Output Error and Relative Error as below:
While ActualOutput is the crossbar output in our simulator. IdealOutput is the expected correct result of the same kernel and input data. Output Range is the ideal convolution output range for each kernel. Relative error is the absolute value of output error, and it can be converted to output bit accuracy as below:
The conversion step is finetuning crossbar conductance from G to G’ to compensate the parasitic resistances. The original conversion algorithm(Hu et al., 2016) takes the maximum input vector as its conversion signal, which works well with dense matrix and dense input signals. However, in CNN, we need to consider the sparsity of data/kernel to optimize the conversion signal. By testing different conversion signals across different kernels, we notice that the amplitude of conversion signal is critical, while the sparsity of conversion signal is not as important. Fig.12 shows the relative error distribution with different conversion signal amplitudes in crossbar with size . We found that a conversion signal with too large amplitude (all 1) will cause overcompensation for crossbar conductance matrix due to circuit parasitic, and a toosmall conversion signal (all 0.001) do not have enough compensation and both of them result in obvious output error. A bad conversion signal may cause hundreds of times error than a good conversion signal( in all 1 signal versus in all 0.1 signal). So for crossbar size , all 0.1 signal appears to be the best conversion signal, and other conversion signals close to it generate similar error distribution.
Calibration
In addition to the original conversion algorithm, we add a calibration stage to improve the result further. It randomly picks ten samples from the input data set and runs a 1st order polynomial fit to fit crossbar output to ideal output. The generated fitting vector is fixed per crossbar and can be easily embedded in ADC/DAC configurations. Fig.12 shows the relative error with different calibration signals, and shuffled input patterns achieve the best result in all four sets of different calibration signals.
4. Result
4.1. Simulation setup
In this work, convolution and fullyconnected(FC) layers are implemented by analog crossbars with the digital interface. Other functions, such as pooling, ReLU, batch normalization, etc., are processed by digital circuits. The CNN is offlinetrained, then its kernels are converted to the conductance of crossbar for inference. Similar to our previous work(Zhang and Hu, 2018), our crossbar parameters are listed below: lowest resistance = 15k, highest resistance = 300k, wire resistance per segment is set to 1, input/output resistance of crossbar are set to 1. Input voltage range is [0, 0.4V]. The sensing voltage for device conductance is 0.2V. The CNN framework used in this work is MatConvNet(Vedaldi and Lenc, 2015).
4.2. Individual convolution layer simulation
We first run circuit simulation at individual convolution layer to study the impact of input data sparsity, kernel type and crossbar size on convolution accuracy.
Fig. 13 shows the output distribution with different input sparsity and different mapping algorithms in a crossbar, which stores a 3*3*16*16 convolution kernel. If just using linear mapping without any consideration of circuit parasitics, crossbar result deviates far always from the ideal outcome. Our mitigation algorithm demonstrates better performance compared to the original conversion algorithm in all cases. The result overlaps well on the trendline with different input sparsity. The original conversion algorithm can force the output to get close to the ideal output. But as the sparsity goes up, the original conversion algorithm loses its ability to amend the result.
Fig.15, 15, 17, 17 illustrate the impact of input data sparsity on different size of crossbar. There are three observations: first, our method provides 50% better overall accuracy than the original conversion algorithm. Second, our method gives a lower mean relative error when the ideal value is small. Third, our method minimizes the impact of data sparsity comparing to the original conversion algorithm. Fig.18 summarizes the mean/worst relative error across the aforementioned three kernel types with different input sparsity and crossbar sizes. In short, the result shows that our method is independent of kernel types and data sparsity, and achieves the best accuracy overall.
4.3. Endtoend CNN circuit simulation
CNN contains more than one layer, any error in one layer will propagate and accumulate to the next layer. It is essential to analyze how error propagates and accumulates in deep neural networks, and evaluate their impact on the final classification result. For MNIST(Lecun et al., 1998), We trained a CNN with four convolution layers. Table.4 summarizes the kernel information, corresponding crossbar sizes, and the conversion signal for each layer. Note that for larger crossbars, conversion signal with smaller amplitude tends to ease the calculation of , and provides with better quality (higher VMM accuracy). It is because, in the conversion process, conversion signals are used for the initialization of node states in crossbar simulation. For example, we initialize all top voltage nodes to be its row input voltages. Large conversion signal is okay for small to medium size crossbar arrays since the signal degradation along wires is not too significant. However, in large crossbar arrays, the node voltage of devices in further corners are significantly different from the ideal values due to notable parasitic resistance. Initializing these nodes with ideal values not only makes the solver takes more iteration to complete, but also may lead to nonconvergence issues due to lousy initialization, observed as extremely large or even negative values in . In practice, we found that as the crossbar size goes up, the lower conversion signal helps the algorithm to compute the quickly in the appropriate range.
Fig.19 shows the error propagation at each convolution layer for MNIST. Different ADC/DAC quantization bitresolutions are applied to restrict error propagation. As a result, we can see that 6 and 8bit quantization both can prevent error accumulation after the third layer. Another observation is that Nonquantization (direct analog forwarding) makes the error even lower than 8bit.
We further tested 4000 images from MNIST validation dataset with different quantization setting. Testing result is given in Table 6. With 8 bit quantization, final classification accuracy remaining at 98.8%, which is less than 1% difference than the ideal (software) result (99.1%).
We further implement the stateofart CNN, ResNet on CIFAR10(Krizhevsky and Hinton, 2009) to explore the impact of error propagation in modern deep neural networks. ResNet20, 32, and 56 all consist by following types of convolution layers: first a convolution layer connected with input has convolution kernel size 3*3*3*16 then a bunch of layers followed with kernel size 3*3*16*16, 3*3*32*32, and 3*3*64*64 respectively. Second, additional convolution layer is added between different kernel size to match the matrix dimension. Its size depends on nearby convolution layers but with fixed convolution window(1 by 1) on feature map. The last convolution layer is the fully connected layer which has the kernel size 1*1*64*10. Besides the last FC layer and dimension match layer, convolution layers in ResNet use the same 3*3 convolution window in their every channel. It means for each channel, only nine memristors in the same column is needed to perform the convolution operation. Due to such small convolution kernel and few feature channels in ResNet, even in 56 layers, ResNet the largest crossbar needed is still the same in ResNet20 (57664).
In table5, we give the convolution kernel sizes, corresponding crossbar sizes and the calibration signal amplitude for ResNets. In ResNet, some bypass branches use 1*1 convolution to match the matrix dimension. For those dimension match layer, since kernels are tiny, we use the 0.1 calibration signal for all cases. Although crossbar size in ResNet is much smaller than CNN we used with MNIST dataset, extra layers make it needs more crossbars for inference. Circuit simulation for ResNet is slower than the CNN mentioned above for MNIST.
Limited by long circuit simulation time (Each test image takes about 5 minutes), We use a subset of CIFAR10 which contains 150 images as our test set. In Table 6, our experiments show that 8bit quantization is a good balance between error suppression and information preservation, as it achieves even slightly better classification result than software. 4bit quantization causes the error accumulates through all layers and leads to a significant drop in final classification accuracy.
Layer  Kernel Size  Crossbar Size  Conversion signal 

Conv1  5*5*1*20  25 * 20  0.1 
Conv2  5*5*20*50  500 * 50  0.01 
Conv3  4*4*50*500  800 * 500  0.001 
Conv4  1*1*500*10  500 * 10  0.01 
ResNet20  ResNet32  ResNet56 





Conv1  Conv1  Conv1  3*3*3*16  27*16  0.1  
Conv27  Conv211  Conv219  3*3*16*16  144*16  0.1  
Conv8  Conv12  Conv20  3*3*16*32  144*32  0.1  
Conv913  Conv1321  Conv2137  3*3*32*32  288*32  0.05  
Conv14  Conv22  Conv38  3*3*32*64  288*64  0.05  
Conv1519  Conv2331  Conv3755  3*3*64*64  576*64  0.01  
FC  FC  FC  1*1*64*10  64*10  0.1 
Network  Software  non quantization  4 bit  6 bit  8 bit 
ResNet20 (CIFAR10)  89.3%  88.7%  11.3%  82.7%  90.7% 
ResNet32 (CIFAR10)  89.3%  89.3%  N/A  82%  90% 
ResNet56 (CIFAR10)  90%  88%  N/A  88%  89.3% 
LeNet (MNIST)  99.1%  98.9%  79.2%  88.6%  88.8% 
4.4. Impact of programming error
We modelled the programming error as conductance variations following zeromean Gaussian distributions with sigma ranges from 0 to 1 (Hu et al., 2018). The devices’ conductance ranges from 3.3 (300k ohm) to 66.7 (15k ohm). We would like to emphasize that since crossbarbased computing is based on Ohm’s law and KCLs, computing result is carried by current. The error current is linearly proportional to the error in conductance, rather than error in resistance. Thus, a large resistance variation at high resistance state may not cause large computing error because the relative error in conductance is small. Fig. 21 shows that how programming error affects the final classification accuracy of ResNet20 on CIFAR10 with different quantization levels. To conduct the simulation, we generate the programming error pattern with different sigma settings and add them to the conductance matrices. Calibration step is then performed on the programming error noise contaminated conductance matrix and update the fitting parameter P. The above configurations are stored for all input test data to make sure they all use the same crossbar setting.
When the programming error sigma ¡0.4, ResNet20 could still maintain the accuracy higher than 80%. Since the 0.4 is less than the 6bit resolution under our configuration, the 6bit quantization has minimal impact. While the nonquantization setting has the maximum accuracy degradation. As the programming error sigma increase, quantization step could not prevent such error propagating between layers and the classification accuracy drops dramatically.
5. Conclusions
In this work, we investigate how modern CNN performs on crossbarbased architecture with endtoend circuit simulations with careful consideration of parasitic resistances. By studying CNN layer by layer, we find CNNs’ characteristic like data sparsity cause existed crossbarbased optimization algorithm invalid. We propose dense mapping to achieve efficient convolution kernel to crossbar mapping. And we adapted and improved the original conversion algorithm for CNNs, which enables 0.25% mean relative error ( 8.6 bits) or 1.2% worst relative error ( 6.4 bits) for crossbar size . We performed a rigorous endtoend circuit simulation for every convolution layer to give an accurate prediction of error propagation due to analog circuit errors. We find that 8bit or even 6bit ADC/DAC is necessary to prevent error accumulation in deep CNNs up to 50 layers, and maintains the final classification accuracy. Simulation result also shows that our method is independent of input data sparsity and kernel type. It would be applied to general CNNs to improve their accuracy performance on crossbarbased architecture.
6. Acknowledgement
This project is supported by HUAWEI with confirmation number HIRPO2017050311. Any Opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of HUAWEI or its contractors.
Footnotes
 journal: JETC
 journalvolume: SI: Nanoelectronic Device, Circuit, Architecture Design
 article: 1
 ccs: Computing methodologies Neural networks
 ccs: Hardware Emerging simulation
References
 Gina C. Adam, Brian D. Hoskins, Mirko Prezioso, Farnood MerrikhBayat, Bhaswar Chakrabarti, and Dmitri B. Strukov. 2017. 3D Memristor Crossbars for Analog and Neuromorphic Computing Applications. IEEE Transactions on Electron Devices 64, 1 (Jan. 2017), 312–318. https://doi.org/10.1109/TED.2016.2630925
 Sapan Agarwal, Richard L Schiek, and Matthew J Marinella. 2017. Compensating for Parasitic Voltage Drops in Resistive Memory Arrays. In Memory Workshop (IMW), 2017 IEEE International. IEEE, 1–4.
 Vijay Chandrasekhar, Jie Lin, Qianli Liao, Olivier MorÃ¨re, Antoine Veillard, Lingyu Duan, and Tomaso Poggio. 2017. Compression of Deep Neural Networks for Image Instance Retrieval. arXiv:1701.04923 [cs] (Jan. 2017). http://arxiv.org/abs/1701.04923 arXiv: 1701.04923.
 Jian Chen and Yupin Fong. 1999. High density nonvolatile Flash memory without adverse effects of electric field coupling between adjacent floating gates. US Patent 5,867,429.
 PaiYu Chen, Xiaochen Peng, and Shimeng Yu. 2018. NeuroSim: A CircuitLevel Macro Model for Benchmarking NeuroInspired Architectures in Online Learning. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems (2018), 1–1. https://doi.org/10.1109/TCAD.2018.2789723
 Albert Ciprut and Eby G Friedman. 2017. Modeling size limitations of resistive crossbar array with cell selectors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 1 (2017), 286–293.
 L. Gao, P. Chen, and S. Yu. 2016. Demonstration of Convolution Kernel Operation on Resistive CrossPoint Array. IEEE Electron Device Letters 37, 7 (July 2016), 870–873. https://doi.org/10.1109/LED.2016.2573140
 Ligang Gao, PaiYu Chen, and Shimeng Yu. 2016. Demonstration of Convolution Kernel Operation on Resistive CrossPoint Array. IEEE Electron Device Letters 37, 7 (July 2016), 870–873. https://doi.org/10.1109/LED.2016.2573140
 Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep Learning with Limited Numerical Precision. arXiv:1502.02551 [cs, stat] (Feb. 2015). http://arxiv.org/abs/1502.02551 arXiv: 1502.02551.
 Song Han, Huizi Mao, and William J. Dally. 2015. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv:1510.00149 [cs] (Oct. 2015). http://arxiv.org/abs/1510.00149 arXiv: 1510.00149.
 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs] (Dec. 2015). http://arxiv.org/abs/1512.03385 arXiv: 1512.03385.
 Zhezhi He, Jie Lin, Rickard Ewetz, JiannShiun Yuan, and Deliang Fan. 2019. Noise Injection Adaption: EndtoEnd ReRAM Crossbar Nonideal Effect Adaption for Neural Network Mapping. In Proceedings of the 56th Annual Design Automation Conference 2019 (DAC ’19). ACM, New York, NY, USA, Article 57, 6 pages. https://doi.org/10.1145/3316781.3317870
 NhutMinh Ho and WengFai Wong. 2017. Exploiting half precision arithmetic in Nvidia GPUs. IEEE, 1–7. https://doi.org/10.1109/HPEC.2017.8091072
 Miao Hu, Catherine E. Graves, Can Li, Yunning Li, Ning Ge, Eric Montgomery, Noraica Davila, Hao Jiang, R. Stanley Williams, J. Joshua Yang, Qiangfei Xia, and John Paul Strachan. 2018. MemristorBased Analog Computation and Neural Network Classification with a Dot Product Engine. Advanced Materials 30, 9 (March 2018), 1705914. https://doi.org/10.1002/adma.201705914
 Miao Hu, R. Stanley Williams, John Paul Strachan, Zhiyong Li, Emmanuelle M. Grafals, Noraica Davila, Catherine Graves, Sity Lam, Ning Ge, and Jianhua Joshua Yang. 2016. Dotproduct engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrixvector multiplication. ACM Press, 1–6. https://doi.org/10.1145/2897937.2898010
 Yu Ji, Youyang Zhang, Xinfeng Xie, Shuangchen Li, Peiqi Wang, Xing Hu, Youhui Zhang, and Yuan Xie. 2019. FPSA: A Full System Stack Solution for Reconfigurable ReRAMbased NN Accelerator Architecture. arXiv preprint arXiv:1901.09904 (2019).
 A. Krizhevsky and G. Hinton. 2009. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto (2009).
 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097–1105. http://papers.nips.cc/paper/4824imagenetclassificationwithdeepconvolutionalneuralnetworks.pdf
 Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (May 2015), 436–444. https://doi.org/10.1038/nature14539
 Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradientbased learning applied to document recognition. Proc. IEEE 86, 11 (Nov 1998), 2278–2324. https://doi.org/10.1109/5.726791
 Can Li, Miao Hu, Yunning Li, Hao Jiang, Ning Ge, Eric Montgomery, Jiaming Zhang, Wenhao Song, Noraica DÃ¡vila, Catherine E. Graves, Zhiyong Li, John Paul Strachan, Peng Lin, Zhongrui Wang, Mark Barnell, Qing Wu, R. Stanley Williams, J. Joshua Yang, and Qiangfei Xia. 2018. Analogue signal and image processing with large memristor crossbars. Nature Electronics 1, 1 (Jan. 2018), 52–59. https://doi.org/10.1038/s419280170002z
 Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary Weight Networks. arXiv:cs.CV/1605.04711
 Jilan Lin, Zhenhua Zhu, Yu Wang, and Yuan Xie. 2019. Learning the Sparsity for ReRAM: Mapping and Pruning Sparse Neural Network for ReRAM Based Accelerator. In Proceedings of the 24th Asia and South Pacific Design Automation Conference (ASPDAC ’19). ACM, New York, NY, USA, 639–644. https://doi.org/10.1145/3287624.3287715
 Tao Liu, Wujie Wen, Lei Jiang, Yanzhi Wang, Chengmo Yang, and Gang Quan. 2019. A FaultTolerant Neural Network Architecture. In Proceedings of the 56th Annual Design Automation Conference 2019 (DAC ’19). ACM, New York, NY, USA, Article 55, 6 pages. https://doi.org/10.1145/3316781.3317742
 Xiaoxiao Liu, Mengjie Mao, Beiye Liu, Boxun Li, Yu Wang, Hao Jiang, Mark Barnell, Qing Wu, Jianhua Yang, Hai Li, and Yiran Chen. 2016. Harmonica: A Framework of Heterogeneous Computing Systems With MemristorBased Neuromorphic Computing Accelerators. IEEE Transactions on Circuits and Systems I: Regular Papers 63, 5 (May 2016), 617–628. https://doi.org/10.1109/TCSI.2016.2529279
 C. Mei, Z. Liu, Y. Niu, X. Ji, W. Zhou, and D. Wang. 2017. A 200MHZ 202.4GFLOPS@10.8W VGG16 accelerator in Xilinx VX690T. In 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP). 784–788. https://doi.org/10.1109/GlobalSIP.2017.8309067
 P. Merolla, J. Arthur, F. Akopyan, N. Imam, R. Manohar, and D. S. Modha. 2011. A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm. In 2011 IEEE Custom Integrated Circuits Conference (CICC). 1–4. https://doi.org/10.1109/CICC.2011.6055294
 Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. vDNN: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design. arXiv:1602.08124 [cs] (Feb. 2016). http://arxiv.org/abs/1602.08124 arXiv: 1602.08124.
 A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with InSitu Analog Arithmetic in Crossbars. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 14–26. https://doi.org/10.1109/ISCA.2016.12
 Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for LargeScale Image Recognition. arXiv:1409.1556 [cs] (Sept. 2014). http://arxiv.org/abs/1409.1556 arXiv: 1409.1556.
 Dmitri B Strukov, Gregory S Snider, Duncan R Stewart, and R Stanley Williams. 2008. The missing memristor found. nature 453, 7191 (2008), 80.
 Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. IEEE, 1–9. https://doi.org/10.1109/CVPR.2015.7298594
 Andrea Vedaldi and Karel Lenc. 2015. Matconvnet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM international conference on Multimedia. ACM, 689–692.
 wikichip. [n.d.]. MLU100  Cambricon. https://en.wikichip.org/wiki/cambricon/mlu/mlu100, accessed 20181031.
 SA Wolf, DD Awschalom, RA Buhrman, JM Daughton, S Von Molnar, ML Roukes, A Yu Chtchelkanova, and DM Treger. 2001. Spintronics: a spinbased electronics vision for the future. Science 294, 5546 (2001), 1488–1495.
 HS Philip Wong, HengYuan Lee, Shimeng Yu, YuSheng Chen, Yi Wu, PangShiu Chen, Byoungil Lee, Frederick T Chen, and MingJinn Tsai. 2012. Metal–oxide RRAM. Proc. IEEE 100, 6 (2012), 1951–1970.
 HS Philip Wong, Simone Raoux, SangBum Kim, Jiale Liang, John P Reifenberg, Bipin Rajendran, Mehdi Asheghi, and Kenneth E Goodson. 2010. Phase change memory. Proc. IEEE 98, 12 (2010), 2201–2227.
 Lixue Xia, Boxun Li, Tianqi Tang, Peng Gu, Xiling Yin, Wenqin Huangfu, PaiYu Chen, Shimeng Yu, Yu Cao, Yu Wang, Yuan Xie, and Huazhong Yang. 2016. MNSIM: Simulation Platform for Memristorbased Neuromorphic Computing System. (2016), 6.
 C. Yakopcic, M. Z. Alom, and T. M. Taha. 2016. Memristor crossbar deep network implementation based on a Convolutional neural network. In 2016 International Joint Conference on Neural Networks (IJCNN). 963–970. https://doi.org/10.1109/IJCNN.2016.7727302
 C. Yakopcic, R. Hasan, and T. M. Taha. 2015. Memristor based neuromorphic circuit for exsitu training of multilayer neural network algorithms. In 2015 International Joint Conference on Neural Networks (IJCNN). 1–7. https://doi.org/10.1109/IJCNN.2015.7280813
 Shimeng Yu. 2018. NeuroInspired Computing With Emerging Nonvolatile Memorys. Proc. IEEE 106, 2 (Feb. 2018), 260–285. https://doi.org/10.1109/JPROC.2018.2790840
 Fan Zhang and Miao Hu. 2018. Memristorbased Deep Convolution Neural Network: A Case Study. arXiv preprint arXiv:1810.02225 (2018).
 Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambriconx: An accelerator for sparse neural networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 20.