Mitigate Parasitic Resistance in Resistive Crossbar-based Convolutional Neural Networks
Traditional computing hardware often encounters on-chip memory bottleneck on large scale Convolution Neural Networks (CNN) applications. With its unique in-memory computing feature, resistive crossbar-based computing attracts researchers’ attention as a promising solution to the memory bottleneck issue in von Neumann architectures. However, the parasitic resistances in crossbar deviate its behavior from the ideal weighted summation operation. In large-scale implementations, the impact of parasitic resistances must be carefully considered and mitigated to ensure circuits’ functionality. In this work, we implemented and simulated CNNs on resistive crossbar circuits with consideration of parasitic resistances. Moreover, we carried out a new mapping scheme for high utilization of crossbar arrays on convolution, and a mitigation algorithm to mitigate parasitic resistances in CNN applications. The mitigation algorithm considers parasitic resistances as well as data/kernel patterns of each layer to minimize the computing error in crossbar-based convolutions of CNNs. We demonstrated the proposed methods with implementations of a 4-layer CNN on MNIST, and ResNet(20, 32, and 56) on CIFAR-10. Simulation results show the proposed methods well mitigate the parasitic resistances in crossbars. With our methods, modern CNNs on crossbars can preserve ideal(software) level classification accuracy with 6-bit ADCs and DACs implementation.
Convolutional Neural Networks(CNN) have led to many performance breakthroughs in image classification, video object tracking, and audio processing applications(LeCun et al., 2015). Since AlexNet won the ILSVRC 2012 (Krizhevsky et al., 2012), CNNs have evolved into many different models, such as GoogLeNet(Szegedy et al., 2015), VGG(Simonyan and Zisserman, 2014), and ResNet(He et al., 2015). Using Graphics Processing Unit(GPU) and specific accelerators to accelerate convolutions play an important role in CNNs’ success, since convolutions dominate the overall computation. Interestingly, in many machine learning applications throughput of convolutions is prior than computing accuracy, especially on inference functions (Gupta et al., 2015). Inspired by this feature, Nvidia’s recent GPUs(Ho and Wong, 2017), Cambricon-X(Zhang et al., 2016), and many other accelerators (such as Cambricon MLU100(wikichip, [n.d.]), Xilinx FPGA based accelerator(Mei et al., 2017)) start support half-floating point or even 8-bit integer precision to accelerate convolutions without performance degradation in many neural networks.
However, modern computing hardware often encounters on-chip memory bottleneck when dealing with high volume convolutions in large scale CNNs(Chandrasekhar et al., 2017; Rhu et al., 2016). State-of-art convolution neural networks require tremendous amount of parameters to handle complex tasks. For example, AlexNet has 2.3 million parameters, VGG has 14.7 million parameters, and ResNet-152 has 25.5 million parameters(Chandrasekhar et al., 2017). Storing such amount of parameters into limited caches is impossible. Therefore, frequent cache flushing and weight loading from off-chip memory (usually DRAM) are usually inevitable, which lead to significant delay and energy cost.
To overcome the memory bottleneck, many researchers show interest in resistive crossbar arrays for the computing-in-memory feature (Gao et al., 2016; Adam et al., 2017; Hu et al., 2018; Liu et al., 2016; Yu, 2018; Li et al., 2018). The resistive crossbar is defined as a circuit structure with vertical and horizontal metal lines sandwiching a resistive switching material at their intersection. The cross-point material could be memristor(Strukov et al., 2008), phase change memory (PCM) device(Wong et al., 2010), floating gates(Chen and Fong, 1999), spintronic device(Wolf et al., 2001), RRAM(Wong et al., 2012), SRAM(Merolla et al., 2011), or any other devices with programmable resistance. By utilizing Kirchhoff’s current law (KCL) and Ohm’s law, an ideal resistive crossbar array can carry out analog vector-matrix-multiplication (VMM). The outputs of analog VMMs are represented as the analog output currents from columns of the crossbar, with input voltage signals flowing through rows, and weights are stored non-volatility as conductance in cross-point. In the inference stage, any size of VMMs can be easily done in a single step. Moreover, since weight storage and weighted multiplication/summation both happen at the same place — the crossbar array, it enables ultra-high computing efficiency on multiplications between changing vectors(data) and fixed matrices(weight), which is ideal for implementing ultra-low power inference functions of neural networks.
However, as array scales up, circuit parasitics deviate crossbar from its ideal linear behavior and bring the non-negligible error to the computing result. The impact of circuit parasitics, especially parasitic resistances such as wire resistance and interconnect resistance, have been observed and analyzed in many simulations and experiments(Hu et al., 2016)(Ciprut and Friedman, 2017)(Agarwal et al., 2017). Currently, the impact of parasitic capacitance and inductance can be ignored since they mainly affect transient behavior of crossbar, while in-memory computing mainly depends on crossbars’ DC behavior. It is important to consider parasitic resistance in circuit simulations for practical and functional implementations, especially for neural network applications where many large-scale crossbars are used.
In this paper, we investigate crossbar-based large-scale CNN implementation with consideration of parasitic resistances and provide methods to mitigate its impact on convolution accuracy as well as CNN classification accuracy. Our major contributions includes:
First, we invent an efficient implementation method to densely map high dimension 4-D kernels to 2-D crossbar arrays. Using this method, a crossbar designed for vector-matrix multiplication can be easily adapted for convolution with near-zero hardware overhead.
Second, we model resistive crossbar array with consideration of parasitic resistances and realistic device models. The simulation result is also verified with experiment data up to 12864 crossbar size.
Third, we study the impact of parasitic resistances and provide a mitigation method to minimize computing error due to parasitic resistances as well as data/kernel patterns in CNNs without re-training.
Last but not least, we demonstrate our methods for a 4-layer CNN on MNIST, and Resnet-20, 32, and 56 on CIFAR-10.
Comparing to other state-of-the-art crossbar simulators, our work conducts the end-to-end full circuit simulation on modern CNN models and large dataset with the consideration of circuit parasitics and realistic memristor models. Our work firstly shows that with realistic crossbars and other peripheral circuits (DAC+ADC), how errors will propagate in deep CNNs due to mixed signal processing and nonlinear activations (ReLU).
Our result shows that, with the proposed implementation and mitigation methods, 8-bit ADC/DAC resolution may be good enough to preserve the software-level classification accuracy in deep CNNs.
The rest of paper is organized as follows: Section II covers the background. Section III details the methodology of the implementation and mitigation methods. Section IV gives simulation result on single crossbar as well as crossbar-based CNNs. In the end, Section V concludes the paper.
2.1. Convolutional Neural Network
Fig.1 shows a typical structure of CNN models. It has three key components: convolutional layers, pooling layers, and fully connected layers(LeCun et al., 2015). Each convolutional layer consists of a set of kernels which have identical sizes. It can be used to detect the local spatial correlation in input patterns. The output of convolution is passed to a non-linear activation function, such as ReLU or sigmoid, to form new feature maps. Pooling layer acts as a down-sampling filter for the output of the activation function. A pooling layer not only reduces the size of the feature map but also merges similar features. As a result, it helps to reduce the number of parameters as well as alleviate over-fitting. The output from pooling layer then feeds into the next convolution layer. Fully connected (FC) layers usually appeared at the last few layers of CNNs as the classifier. They weight every point in the high-level feature maps and generate the final classification result.
In this work, we are targeting deep residual neural network (ResNet) on resistive crossbar arrays, as it is one of the state-of-the-art CNNs on image classification problems. ResNet was firstly introduced in ILSVRC 2015 (He et al., 2015). It keeps the benefit of deeper layers while addressing the degradation issue by optimizing the residual mapping with shortcut connections between layers. An ensemble of residual nets up to 152 layers has achieved 3.57% error on the ImageNet test set and won the 1st place on the ILSVRC 2015 classification competition. Fig.3 shows the basic block diagram of ResNet. It combines multiple convolutions, batch normalizations, and rectified linear units(ReLU) together as its basic building block. Different from other CNNs, ResNet uses a shortcut to directly add input data to the output result of a block. Since two data inputs may have different dimensions at the summation stage, a convolution layer is introduced in the shortcut to match the dimensions of two inputs. The summation result is feed to ReLU and pass to the next Block. At the end of ResNet, pooling layer, one or more FC layers, and a softmax layer are used in sequence to generate the final classification result. By studying and optimizing resistive crossbars for ResNet, we can get more insight into the performance of resistive crossbar-based computing on modern CNNs.
2.2. Resistive crossbar circuit for VMM computing
Fig.3 illustrates the general structure of resistive crossbar array for VMM computing. In an ideal crossbar, when applying voltage inputs at the rows simultaneously and read current outputs from the columns, the input-output relationship of the crossbar can be represented as below:
In this way, the analog weighted summation is achieved through Kirchhoff’s Current Law and Ohm’s Law. By mapping input vector to input voltage , positive matrix A to conductance G, and output current back to output result , a memristor crossbar can be regarded as an analog VMM module to realize in one step. Note that is only valid for ideal crossbar where parasitic resistances can be ignored, and device conductance is independent of voltage/current.
In realistic crossbar arrays, circuit parasitics, including wire resistance, input/output resistance, device I-V non-linearity, and parasitic capacitance, deviate crossbar’s behavior from ideal vector-matrix multiplication. For instance, the wire resistance degrades signal along the row and column wire, so the device on the further corner will receive signal less than expected due to increased wire resistance. Input/Output resistance act similarly as wire resistance, but they could cause even larger signal degradation since they are usually caused by pass transistors and more resistive than wires. Meanwhile, cross-point device I-V non-linearity affects its multiplication accuracy, as its conductance is no longer independent of voltage or current. The impact of capacitance and inductance can be ignored at current stage since they mainly affect the transient behavior of crossbar while we are using crossbar’s DC behavior to realize analog VMM at relatively low frequency (10M to 100MHz).
2.3. Mapping algorithms for crossbar array
When using the crossbar array as the VMM circuit, the first problem is how to map the matrix onto the crossbar. Since the intersection conductance and column output current cannot be the negative value, the mapping method needs to able to address the negative issue. One common way is using the positive and negative power supply, and matrices are mapped on crossbar array in absolute form. For the negative element in the matrix, the negative input voltage is applied to the corresponding position. Recently, Chris Yakopcic et.al have proposed a way to implement convolution and CNN on memeristor-based crossbar (Yakopcic et al., 2016)(Yakopcic et al., 2015). In their ex-suit process, convolutional kernel has to be divided into two part, positive and negative . All negative values in have been replaced by zero. And those negative values have been mapped in as positive conductance. By providing the identical inputs to but with reversed sign, memristor-based crossbar can deal with the matrix which contains negative values.
(Gao et al., 2016) also maps the absolute value of the matrix on crossbar array. However, it needs two adjacent columns of the crossbar to represent a single column of that matrix. One represents the positive part, and another represents the negative part. Then the two columns have output currents and respectively. Final output = .
(Ji et al., 2019) uses another way to solve this negative issue in their ReRAM crossbar based PE(processing element). The positive and negative part of the matrix are still mapped on two adjacent columns of a crossbar. But the negative voltage input is unnecessary in this design. Instead, an extra subtracter component is needed for every pair of positive and negative columns. Another change in this design is that it uses spiking schema to improve the output precision. spikes are used to represent a number of n bits.
In the above mapping schemes, either extra overheads or redundant crossbar areas are needed. Moreover, for all above implements, a rigorous full circuit simulation with consideration of parasitic resistances on a large enough CNN is still missing.
2.4. Crossbar based Neural Network simulator
There exist a lot of crossbar-based simulators for neural network applications with the consideration of parasitic-effect, stuck-at-fault, thermal noise and more. However, most of those simulators are not experiment-verified and even over-simplified for the non-ideal effects. Table 1 compares state-of-the-art crossbar-based NN simulators. Our simulator is the only circuit simulator that experimentally verified and works on large-scale dataset and NN models.
MNSIM(Xia et al., 2016) and NeuroSim(Chen et al., 2018) are two famous ReRAM crossbar simulators. They count many non-ideal effects into consideration but they do not focus on accurate crossbar computation output due to lack of full analog circuit simulation. However, they could give a reasonable power/area estimation for crossbar based design. Moreover, they are designed for general crossbar-based neural networks not optimized for convolution neural networks.
PytorX(He et al., 2019) and FTNNA(Liu et al., 2019) are designed for neural network applications with the consideration of non-ideal effects. They also have the ability to alleviate the impact of those non-ideal effects and have been demonstrated on large datasets. However, such rescue ability is given by the error tolerance of NN models rather than the optimization of simulators, since both need to re-training the model to assuage the non-ideal effects. In practical we cannot guarantee that all crossbars have same non-ideal effects and those effects may vary because of different crossbar size, stored conductance matrix, wire resistance, and etc. Re-training for every crossbar device is a plausible way to solve such non-ideal effects, but they may not always practical due to time/energy limitation. Moreover, online training for large-scale NNs requires deep understanding and very accurate modeling of memristor dynamic behaviors, which is still far from enough at this moment. Even we do training with existing memristor models, the learning we can get from it is questionable because the model is still very different from realistic device behaviors and does not account the impact of realistic process variations and noises.
In this work, we developed an experiment-verified simulator to conduct the end-to-end SPICE-level analog crossbar circuit simulation with the consideration of circuit parasitics and realistic memristor models. Conversion and calibration methods are used to mitigate the parasitic effects in crossbar-based computing. Comparing to other simulators, our work not only shows the impact of non-ideal effects on NN model accuracy, but also explains the reasons behind such accuracy degradation by investigating the error propagation in crossbar-based deep CNNs. In this way, our work provides a more solid upper-bound for the inference performance of crossbar-based CNNs respecting different design parameters.
|MNIST||CIFAR-10||NN type||Non-ideal effects||Method to rescue IR-Drop|
|MNSIM(Xia et al., 2016)||✗||✗||✗||✗||MLP,CNN||
|NeuroSim(Chen et al., 2018)||✗||✗||✓||✗||MLP||
|PytorX(He et al., 2019)||✗||✗||✗||✓||CNN||
|FTNNA(Liu et al., 2019)||✗||✗||✓||✓||CNN||
|DPE(Hu et al., 2016)||✓||✓||✓||✗||MLP||
There are two challenges ahead of using resistive crossbars for convolution operations. The first challenge is how to transform high dimensional convolution kernels to 2-D matrices for resistive crossbars. The most straightforward way converts convolution kernels to Toeplitz matrices. However, the Toeplitz matrix is sparse, and its sparsity significantly grows with larger kernels. Directly mapping this sparse matrix to resistive crossbars, will cause large hardware overhead as well as computing accuracy concerns. The second challenge is how to mitigate parasitic resistances in crossbar-based computing. As introduced before, parasitic resistances in crossbar will cause nonlinear signal degradation at each cross-point devices, lack of considering its impact will cause unacceptable errors in computing result of large crossbar arrays. To solve both challenges, we present the dense mapping as the new mapping method to fully unitize crossbar arrays with near zero hardware overhead, and the mitigation algorithm to compensate parasitic resistance in large-scale crossbars.
3.1. Dense mapping for crossbar-based convolution
The critical step of mapping an arbitrary matrix on crossbar is how to deal with negative values in A, as conductance cannot be negative. We first shift all elements in A by a constant to make sure no negative value in A. Such a shift can be easily removed at the final step by subtracting . Where is the summation of input. By doing so, we do not need to part matrix into positive and negative sub-matrices, and any VMM can be performed on the crossbar with much less overhead.
When we use crossbar for VMM computing in digital circuit environment, ADC/DAC are necessary to convert input data to analog voltage and output current to digital form. This process usually assumes to be 10ns to 100ns, which is usually limited by the ADC speed. To get sum(X), an accumulator works in parallel with the DAC-¿crossbar-¿ADC system with similar speed. For small or medium size crossbars, digital accumulator would be more efficient to calculate the summation of input vectors; For large crossbars, an analog accumulator can be implemented by using an additional column in the crossbar array where its devices are programmed to the same conductance, and it is sensed by an additional ADC. With either method of accumulator design, the cost and hardware overhead on calculating sum(X) is much lower than using two crossbars or two devices to track negative values.
In CNN, each convolution layer has a 4-D kernel. For example, a 3*3*3*16 kernel means it has 16 sets of 3*3*3 3-D kernels, and each 3-D kernels contains three 2-D kernels for the three channels (RGB) of color images. To perform convolution on memristor crossbar, we need to convert high dimensional convolution into 2-D VMM. It is well known that 2D convolution can be implemented using matrix multiplication by converting kernel to a Toeplitz matrix. Toeplitz matrices are sparse matrices which contain many zeros, so we named this method as sparse mapping.
However, sparse mapping has three major issues: First, memristor crossbar is by nature a dense array structure with positive values, not efficient for sparse matrix implementation. Mapping a sparse matrix to crossbar means that those memristors assigned with zero would do nothing but adding error to the final result due to the leakage current, as well as waste circuit area. Second, sparse mapping requires a huge crossbar array, which is not always feasible, and vulnerable to noise and defects. Third, sparse mapping requires a sizeable peripheral circuit to support the enormous array. Since the peripheral circuit dominates the total power/area, the hardware implementation of sparse mapping would be too costly to afford.
In contrast to sparse mapping, we developed a new method, named as dense mapping, targeting on using small and dense memristor crossbars to implement convolutions efficiently. Fig.4 illustrates the concept of dense mapping. Each 2-D kernel is unrolled to a vector and mapped to one column of crossbar so that one crossbar can implement multiple 2-D kernels as long as it has enough columns. For input signal, only data within the convolution window are fed to the row inputs of crossbar arrays. When input data has multiple feature channels, each output feature channel needs a 3-D convolution kernel. A 3-D convolution kernel can be treated as many 2-D kernels. Every 2-D kernel corresponds to one input channel. The output of 3-D convolution is given by performing convolution on all 2-D input channels and add them together. Thus, we can unroll every 2-D kernel in the 3-D kernel as the same order. It is then cascading all vectors together to form a single column on a crossbar. An input shift register stores the unrolled input data within convolution window(multiple input channels cascaded like the kernel on a crossbar) at the current iteration to feed to the crossbar, then updating its storage as the window moves through the entire data space. The convolution results for data within the convolution window are collected at the column outputs of the crossbar. In this way, a single convolution kernel with one stride on both horizontal and vertical direction needs * iterations where , / , are data/kernel height and width respectively. Since multiple input channels have been compressed in a single column on a crossbar, the input channel number doesn’t impact the iteration time.
Comparing dense mapping to sparse mapping, it is a trade-off between time complexity and space complexity. Sparse mapping uses much more extra hardware to produce the result in parallel without iteration. From the data movement aspect, the least movement mapping method is sparse mapping. It uses extra space in crossbar to not only store weight in there but also perform the kernel window movement. However, its efficiency exponentially drops as data/kernel size scales up because more devices in a rectangular crossbar are unused for increasing data/kernel size.
And much larger crossbar and more ADC/DACs are required for sparse mapping a large data/kernel. Table 2 compares the overhead of dense&sparse mapping on common used data/kernel sizes from ResNet. In Table 2, crossbar, ADC/DAC parameters are adopted from ISAAC(Shafiee et al., 2016). We exponentially scaled the DAC to 8-bit, then proportionally scaled the area and power for crossbar, ADC, and DAC respectively. Although we could partition huge matrix into multiple small matrices(Lin et al., 2019), sparse mapping still needs 100x numbers of DACs&ADCs than dense mapping.
|3x3x3x16||27 x 16||0.06||0.000005||1||27||29.06||0.001|
|3072 x 14400||6480||0.54||113||3072||9778||0.806|
|3x3x16x16||144 x 16||0.34||0.000028||1||144||146.34||0.007|
|16384 x 14400||34560||2.88||113||16384||51170||3.712|
|3x3x32x32||288 x 32||1.35||0.000113||1||288||291.35||0.014|
|8192 x 6272||7526.4||0.6272||49||8192||15816.4||1.034|
|3x3x64x64||576 x 64||5.4||0.00045||1||576||583.4||0.026|
|4096 x 2304||1382.4||0.1152||18||4096||5514.4||0.311|
Therefore, dense mapping is an adequate and more practical method comparing to sparse mapping. It not only achieves 100% usage of devices but also easy to implement and provide sufficient performance in speed for CNN applications. From Fig.5, one classification inference in ResNet-20 needs 9089 iterations in sequential, if no parallel copies of hardware are used. Note that summation (sum#) is in parallel of convolutions, so it’s not counted in total iterations. Assuming crossbar runs at 100 MHz (Hu et al., 2016), for each classification the convolution part takes only 0.09 ms, which is fast enough for real-time classification requirement.
3.2. Crossbar simulation with parasitic resistances
Fig.7 shows the circuit structure of one-transistor-one-memristor (1T1M) crossbar for vector matrix multiplication (VMM). In this structure, memristor crossbar is the core component as it enables analog current weighted summation, which leads to ultra-efficient VMM operation. A memristor crossbar is made by row and column metal wires where memristors formed at intersecting points. Each memristor can be tuned to arbitrary conductance within its programmable range. To enable precise, non-disturbing tuning in large crossbar arrays, 1T1M cell is necessary to program the entire array with arbitrary conductance matrix G.
Besides memristor crossbars, DACs/ADCs are also essential to ensure accurate VMM operation. First, they are necessary to integrate memristor-based analog VMM module into the digital environment, as functions like pooling, ReLU, and normalization still need digital implementation; Second, the calibration step can be performed at the ADC/DAC stage to improve crossbar result further. Last but not least, ADC provides quantization, which is helpful on filtering out the output error of memristor-based analog VMM modules, to prevent error accumulation inter- and intra-layers in CNNs.
Comparing to previous simulation work on memristor crossbar arrays, we considered parasitic resistances including wire resistances, input/output resistances, and intersection resistances in our simulation model. With consideration of parasitic resistances, we can observe the signal degradation along rows and columns in the array, as shown in Fig.8. The low input signals applied on further corner memristors also make the column current output lower than expectation. To guarantee the actual output close to the ideal output, we adopt an experiment verified method(Hu et al., 2018). Instead of mapping conductance to a crossbar, we map a tweaked conductance matrix with consideration of parasitic resistances in the crossbar. To find for a crossbar, we first need to formulate all KCL equations from the crossbar circuit model to solve all node voltages, including cross-point top nodes, cross-point middle nodes (between memristor and access transistor), cross-point bottom nodes, and boundary nodes on both end of horizontal lines and vertical lines. Then as conductance has unknown variables, we add new equations as new limitations to force each cross-point to pass the ideal current. Where is the cross-point current for each memristor, is the ideal cross-point voltage between top nodes and bottom nodes, and denotes Hadamard product, or say element-wise multiplication. Finally with nonlinear equations, can be solved with HSPICE or any nonlinear equation solver. As shown in Fig.7, the simulation result well matches the experiment result of a 12864 array.
3.3. Mitigation algorithm for parasitic resistance
Algorithm 1 summarizes the flow of crossbar-based convolution with the mitigation algorithm. Table 3 explains the important functions in the algorithm. After initialization, if the kernel is already mapped and converted onto crossbars, it will directly jump to the computing step to simulate crossbar-based convolution. So we only need to mapping conductance matrix once for CNN inference.
|CrossbarSim||Experiment verified crossbar simulator from (Hu et al., 2018)|
|Conversion||Solve to get G’.|
|GetCaliPara||Get 1st order poly fitting result by fitting crossbar output to ideal output of calibration samples.|
|Calibration||Use and to map to . Here is needed, because in VMM , if A contains negative values, can be calculated by , while c is a large enough scalar to shift A to all positive.|
Data and kernel pattern
Input data has high sparsity after the ReLU layer. Fig. 9 shows the data sparsity at each convolution layer of ResNet-20. The impact of data sparsity should be considered when choosing the conversion signal as well as gathering calibration samples.
Similarly, we found that kernels in CNN have different distributions. In Fig.18 we list three typical kernel types regarding to their weight value distributions. Kernel type 1 refers to a weight distribution close to Gaussian. Usually, it happens when training algorithms put no particular limitation on weight values, such as ResNet. Kernel type 2 refers to the training algorithm that preventing weight values goes near zero(Han et al., 2015). Kernel type 3 refers to Ternary Neural Networks where weights can only be -1, 0(sparse), or 1(Li et al., 2016). It’s worth investigating how different kennel types in CNN impact the quality/precision of crossbar-based convolution.
Optimize conversion signal
To better quantify the computing accuracy, we define Output Error and Relative Error as below:
While ActualOutput is the crossbar output in our simulator. IdealOutput is the expected correct result of the same kernel and input data. Output Range is the ideal convolution output range for each kernel. Relative error is the absolute value of output error, and it can be converted to output bit accuracy as below:
The conversion step is fine-tuning crossbar conductance from G to G’ to compensate the parasitic resistances. The original conversion algorithm(Hu et al., 2016) takes the maximum input vector as its conversion signal, which works well with dense matrix and dense input signals. However, in CNN, we need to consider the sparsity of data/kernel to optimize the conversion signal. By testing different conversion signals across different kernels, we notice that the amplitude of conversion signal is critical, while the sparsity of conversion signal is not as important. Fig.12 shows the relative error distribution with different conversion signal amplitudes in crossbar with size . We found that a conversion signal with too large amplitude (all 1) will cause overcompensation for crossbar conductance matrix due to circuit parasitic, and a too-small conversion signal (all 0.001) do not have enough compensation and both of them result in obvious output error. A bad conversion signal may cause hundreds of times error than a good conversion signal( in all 1 signal versus in all 0.1 signal). So for crossbar size , all 0.1 signal appears to be the best conversion signal, and other conversion signals close to it generate similar error distribution.
In addition to the original conversion algorithm, we add a calibration stage to improve the result further. It randomly picks ten samples from the input data set and runs a 1st order polynomial fit to fit crossbar output to ideal output. The generated fitting vector is fixed per crossbar and can be easily embedded in ADC/DAC configurations. Fig.12 shows the relative error with different calibration signals, and shuffled input patterns achieve the best result in all four sets of different calibration signals.
4.1. Simulation setup
In this work, convolution and fully-connected(FC) layers are implemented by analog crossbars with the digital interface. Other functions, such as pooling, ReLU, batch normalization, etc., are processed by digital circuits. The CNN is offline-trained, then its kernels are converted to the conductance of crossbar for inference. Similar to our previous work(Zhang and Hu, 2018), our crossbar parameters are listed below: lowest resistance = 15k, highest resistance = 300k, wire resistance per segment is set to 1, input/output resistance of crossbar are set to 1. Input voltage range is [0, 0.4V]. The sensing voltage for device conductance is 0.2V. The CNN framework used in this work is MatConvNet(Vedaldi and Lenc, 2015).
4.2. Individual convolution layer simulation
We first run circuit simulation at individual convolution layer to study the impact of input data sparsity, kernel type and crossbar size on convolution accuracy.
Fig. 13 shows the output distribution with different input sparsity and different mapping algorithms in a crossbar, which stores a 3*3*16*16 convolution kernel. If just using linear mapping without any consideration of circuit parasitics, crossbar result deviates far always from the ideal outcome. Our mitigation algorithm demonstrates better performance compared to the original conversion algorithm in all cases. The result overlaps well on the trend-line with different input sparsity. The original conversion algorithm can force the output to get close to the ideal output. But as the sparsity goes up, the original conversion algorithm loses its ability to amend the result.
Fig.15, 15, 17, 17 illustrate the impact of input data sparsity on different size of crossbar. There are three observations: first, our method provides 50% better overall accuracy than the original conversion algorithm. Second, our method gives a lower mean relative error when the ideal value is small. Third, our method minimizes the impact of data sparsity comparing to the original conversion algorithm. Fig.18 summarizes the mean/worst relative error across the aforementioned three kernel types with different input sparsity and crossbar sizes. In short, the result shows that our method is independent of kernel types and data sparsity, and achieves the best accuracy overall.
4.3. End-to-end CNN circuit simulation
CNN contains more than one layer, any error in one layer will propagate and accumulate to the next layer. It is essential to analyze how error propagates and accumulates in deep neural networks, and evaluate their impact on the final classification result. For MNIST(Lecun et al., 1998), We trained a CNN with four convolution layers. Table.4 summarizes the kernel information, corresponding crossbar sizes, and the conversion signal for each layer. Note that for larger crossbars, conversion signal with smaller amplitude tends to ease the calculation of , and provides with better quality (higher VMM accuracy). It is because, in the conversion process, conversion signals are used for the initialization of node states in crossbar simulation. For example, we initialize all top voltage nodes to be its row input voltages. Large conversion signal is okay for small to medium size crossbar arrays since the signal degradation along wires is not too significant. However, in large crossbar arrays, the node voltage of devices in further corners are significantly different from the ideal values due to notable parasitic resistance. Initializing these nodes with ideal values not only makes the solver takes more iteration to complete, but also may lead to non-convergence issues due to lousy initialization, observed as extremely large or even negative values in . In practice, we found that as the crossbar size goes up, the lower conversion signal helps the algorithm to compute the quickly in the appropriate range.
Fig.19 shows the error propagation at each convolution layer for MNIST. Different ADC/DAC quantization bit-resolutions are applied to restrict error propagation. As a result, we can see that 6 and 8-bit quantization both can prevent error accumulation after the third layer. Another observation is that Non-quantization (direct analog forwarding) makes the error even lower than 8-bit.
We further tested 4000 images from MNIST validation dataset with different quantization setting. Testing result is given in Table 6. With 8 bit quantization, final classification accuracy remaining at 98.8%, which is less than 1% difference than the ideal (software) result (99.1%).
We further implement the state-of-art CNN, ResNet on CIFAR-10(Krizhevsky and Hinton, 2009) to explore the impact of error propagation in modern deep neural networks. ResNet-20, 32, and 56 all consist by following types of convolution layers: first a convolution layer connected with input has convolution kernel size 3*3*3*16 then a bunch of layers followed with kernel size 3*3*16*16, 3*3*32*32, and 3*3*64*64 respectively. Second, additional convolution layer is added between different kernel size to match the matrix dimension. Its size depends on nearby convolution layers but with fixed convolution window(1 by 1) on feature map. The last convolution layer is the fully connected layer which has the kernel size 1*1*64*10. Besides the last FC layer and dimension match layer, convolution layers in ResNet use the same 3*3 convolution window in their every channel. It means for each channel, only nine memristors in the same column is needed to perform the convolution operation. Due to such small convolution kernel and few feature channels in ResNet, even in 56 layers, ResNet the largest crossbar needed is still the same in ResNet-20 (57664).
In table5, we give the convolution kernel sizes, corresponding crossbar sizes and the calibration signal amplitude for ResNets. In ResNet, some bypass branches use 1*1 convolution to match the matrix dimension. For those dimension match layer, since kernels are tiny, we use the 0.1 calibration signal for all cases. Although crossbar size in ResNet is much smaller than CNN we used with MNIST dataset, extra layers make it needs more crossbars for inference. Circuit simulation for ResNet is slower than the CNN mentioned above for MNIST.
Limited by long circuit simulation time (Each test image takes about 5 minutes), We use a subset of CIFAR-10 which contains 150 images as our test set. In Table 6, our experiments show that 8-bit quantization is a good balance between error suppression and information preservation, as it achieves even slightly better classification result than software. 4-bit quantization causes the error accumulates through all layers and leads to a significant drop in final classification accuracy.
|Layer||Kernel Size||Crossbar Size||Conversion signal|
|Conv1||5*5*1*20||25 * 20||0.1|
|Conv2||5*5*20*50||500 * 50||0.01|
|Conv3||4*4*50*500||800 * 500||0.001|
|Conv4||1*1*500*10||500 * 10||0.01|
|Network||Software||non quantization||4 bit||6 bit||8 bit|
4.4. Impact of programming error
We modelled the programming error as conductance variations following zero-mean Gaussian distributions with sigma ranges from 0 to 1 (Hu et al., 2018). The devices’ conductance ranges from 3.3 (300k ohm) to 66.7 (15k ohm). We would like to emphasize that since crossbar-based computing is based on Ohm’s law and KCLs, computing result is carried by current. The error current is linearly proportional to the error in conductance, rather than error in resistance. Thus, a large resistance variation at high resistance state may not cause large computing error because the relative error in conductance is small. Fig. 21 shows that how programming error affects the final classification accuracy of ResNet-20 on CIFAR-10 with different quantization levels. To conduct the simulation, we generate the programming error pattern with different sigma settings and add them to the conductance matrices. Calibration step is then performed on the programming error noise contaminated conductance matrix and update the fitting parameter P. The above configurations are stored for all input test data to make sure they all use the same crossbar setting.
When the programming error sigma ¡0.4, ResNet-20 could still maintain the accuracy higher than 80%. Since the 0.4 is less than the 6-bit resolution under our configuration, the 6-bit quantization has minimal impact. While the non-quantization setting has the maximum accuracy degradation. As the programming error sigma increase, quantization step could not prevent such error propagating between layers and the classification accuracy drops dramatically.
In this work, we investigate how modern CNN performs on crossbar-based architecture with end-to-end circuit simulations with careful consideration of parasitic resistances. By studying CNN layer by layer, we find CNNs’ characteristic like data sparsity cause existed crossbar-based optimization algorithm invalid. We propose dense mapping to achieve efficient convolution kernel to crossbar mapping. And we adapted and improved the original conversion algorithm for CNNs, which enables 0.25% mean relative error ( 8.6 bits) or 1.2% worst relative error ( 6.4 bits) for crossbar size . We performed a rigorous end-to-end circuit simulation for every convolution layer to give an accurate prediction of error propagation due to analog circuit errors. We find that 8-bit or even 6-bit ADC/DAC is necessary to prevent error accumulation in deep CNNs up to 50 layers, and maintains the final classification accuracy. Simulation result also shows that our method is independent of input data sparsity and kernel type. It would be applied to general CNNs to improve their accuracy performance on crossbar-based architecture.
This project is supported by HUAWEI with confirmation number HIRPO2017050311. Any Opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of HUAWEI or its contractors.
- journal: JETC
- journalvolume: SI: Nanoelectronic Device, Circuit, Architecture Design
- article: 1
- ccs: Computing methodologies Neural networks
- ccs: Hardware Emerging simulation
- Gina C. Adam, Brian D. Hoskins, Mirko Prezioso, Farnood Merrikh-Bayat, Bhaswar Chakrabarti, and Dmitri B. Strukov. 2017. 3-D Memristor Crossbars for Analog and Neuromorphic Computing Applications. IEEE Transactions on Electron Devices 64, 1 (Jan. 2017), 312–318. https://doi.org/10.1109/TED.2016.2630925
- Sapan Agarwal, Richard L Schiek, and Matthew J Marinella. 2017. Compensating for Parasitic Voltage Drops in Resistive Memory Arrays. In Memory Workshop (IMW), 2017 IEEE International. IEEE, 1–4.
- Vijay Chandrasekhar, Jie Lin, Qianli Liao, Olivier MorÃ¨re, Antoine Veillard, Lingyu Duan, and Tomaso Poggio. 2017. Compression of Deep Neural Networks for Image Instance Retrieval. arXiv:1701.04923 [cs] (Jan. 2017). http://arxiv.org/abs/1701.04923 arXiv: 1701.04923.
- Jian Chen and Yupin Fong. 1999. High density non-volatile Flash memory without adverse effects of electric field coupling between adjacent floating gates. US Patent 5,867,429.
- Pai-Yu Chen, Xiaochen Peng, and Shimeng Yu. 2018. NeuroSim: A Circuit-Level Macro Model for Benchmarking Neuro-Inspired Architectures in Online Learning. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2018), 1–1. https://doi.org/10.1109/TCAD.2018.2789723
- Albert Ciprut and Eby G Friedman. 2017. Modeling size limitations of resistive crossbar array with cell selectors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 1 (2017), 286–293.
- L. Gao, P. Chen, and S. Yu. 2016. Demonstration of Convolution Kernel Operation on Resistive Cross-Point Array. IEEE Electron Device Letters 37, 7 (July 2016), 870–873. https://doi.org/10.1109/LED.2016.2573140
- Ligang Gao, Pai-Yu Chen, and Shimeng Yu. 2016. Demonstration of Convolution Kernel Operation on Resistive Cross-Point Array. IEEE Electron Device Letters 37, 7 (July 2016), 870–873. https://doi.org/10.1109/LED.2016.2573140
- Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep Learning with Limited Numerical Precision. arXiv:1502.02551 [cs, stat] (Feb. 2015). http://arxiv.org/abs/1502.02551 arXiv: 1502.02551.
- Song Han, Huizi Mao, and William J. Dally. 2015. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv:1510.00149 [cs] (Oct. 2015). http://arxiv.org/abs/1510.00149 arXiv: 1510.00149.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs] (Dec. 2015). http://arxiv.org/abs/1512.03385 arXiv: 1512.03385.
- Zhezhi He, Jie Lin, Rickard Ewetz, Jiann-Shiun Yuan, and Deliang Fan. 2019. Noise Injection Adaption: End-to-End ReRAM Crossbar Non-ideal Effect Adaption for Neural Network Mapping. In Proceedings of the 56th Annual Design Automation Conference 2019 (DAC ’19). ACM, New York, NY, USA, Article 57, 6 pages. https://doi.org/10.1145/3316781.3317870
- Nhut-Minh Ho and Weng-Fai Wong. 2017. Exploiting half precision arithmetic in Nvidia GPUs. IEEE, 1–7. https://doi.org/10.1109/HPEC.2017.8091072
- Miao Hu, Catherine E. Graves, Can Li, Yunning Li, Ning Ge, Eric Montgomery, Noraica Davila, Hao Jiang, R. Stanley Williams, J. Joshua Yang, Qiangfei Xia, and John Paul Strachan. 2018. Memristor-Based Analog Computation and Neural Network Classification with a Dot Product Engine. Advanced Materials 30, 9 (March 2018), 1705914. https://doi.org/10.1002/adma.201705914
- Miao Hu, R. Stanley Williams, John Paul Strachan, Zhiyong Li, Emmanuelle M. Grafals, Noraica Davila, Catherine Graves, Sity Lam, Ning Ge, and Jianhua Joshua Yang. 2016. Dot-product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix-vector multiplication. ACM Press, 1–6. https://doi.org/10.1145/2897937.2898010
- Yu Ji, Youyang Zhang, Xinfeng Xie, Shuangchen Li, Peiqi Wang, Xing Hu, Youhui Zhang, and Yuan Xie. 2019. FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture. arXiv preprint arXiv:1901.09904 (2019).
- A. Krizhevsky and G. Hinton. 2009. Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto (2009).
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (May 2015), 436–444. https://doi.org/10.1038/nature14539
- Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (Nov 1998), 2278–2324. https://doi.org/10.1109/5.726791
- Can Li, Miao Hu, Yunning Li, Hao Jiang, Ning Ge, Eric Montgomery, Jiaming Zhang, Wenhao Song, Noraica DÃ¡vila, Catherine E. Graves, Zhiyong Li, John Paul Strachan, Peng Lin, Zhongrui Wang, Mark Barnell, Qing Wu, R. Stanley Williams, J. Joshua Yang, and Qiangfei Xia. 2018. Analogue signal and image processing with large memristor crossbars. Nature Electronics 1, 1 (Jan. 2018), 52–59. https://doi.org/10.1038/s41928-017-0002-z
- Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary Weight Networks. arXiv:cs.CV/1605.04711
- Jilan Lin, Zhenhua Zhu, Yu Wang, and Yuan Xie. 2019. Learning the Sparsity for ReRAM: Mapping and Pruning Sparse Neural Network for ReRAM Based Accelerator. In Proceedings of the 24th Asia and South Pacific Design Automation Conference (ASPDAC ’19). ACM, New York, NY, USA, 639–644. https://doi.org/10.1145/3287624.3287715
- Tao Liu, Wujie Wen, Lei Jiang, Yanzhi Wang, Chengmo Yang, and Gang Quan. 2019. A Fault-Tolerant Neural Network Architecture. In Proceedings of the 56th Annual Design Automation Conference 2019 (DAC ’19). ACM, New York, NY, USA, Article 55, 6 pages. https://doi.org/10.1145/3316781.3317742
- Xiaoxiao Liu, Mengjie Mao, Beiye Liu, Boxun Li, Yu Wang, Hao Jiang, Mark Barnell, Qing Wu, Jianhua Yang, Hai Li, and Yiran Chen. 2016. Harmonica: A Framework of Heterogeneous Computing Systems With Memristor-Based Neuromorphic Computing Accelerators. IEEE Transactions on Circuits and Systems I: Regular Papers 63, 5 (May 2016), 617–628. https://doi.org/10.1109/TCSI.2016.2529279
- C. Mei, Z. Liu, Y. Niu, X. Ji, W. Zhou, and D. Wang. 2017. A 200MHZ 202.4GFLOPS@10.8W VGG16 accelerator in Xilinx VX690T. In 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP). 784–788. https://doi.org/10.1109/GlobalSIP.2017.8309067
- P. Merolla, J. Arthur, F. Akopyan, N. Imam, R. Manohar, and D. S. Modha. 2011. A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm. In 2011 IEEE Custom Integrated Circuits Conference (CICC). 1–4. https://doi.org/10.1109/CICC.2011.6055294
- Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design. arXiv:1602.08124 [cs] (Feb. 2016). http://arxiv.org/abs/1602.08124 arXiv: 1602.08124.
- A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 14–26. https://doi.org/10.1109/ISCA.2016.12
- Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs] (Sept. 2014). http://arxiv.org/abs/1409.1556 arXiv: 1409.1556.
- Dmitri B Strukov, Gregory S Snider, Duncan R Stewart, and R Stanley Williams. 2008. The missing memristor found. nature 453, 7191 (2008), 80.
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. IEEE, 1–9. https://doi.org/10.1109/CVPR.2015.7298594
- Andrea Vedaldi and Karel Lenc. 2015. Matconvnet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM international conference on Multimedia. ACM, 689–692.
- wikichip. [n.d.]. MLU100 - Cambricon. https://en.wikichip.org/wiki/cambricon/mlu/mlu100, accessed 2018-10-31.
- SA Wolf, DD Awschalom, RA Buhrman, JM Daughton, S Von Molnar, ML Roukes, A Yu Chtchelkanova, and DM Treger. 2001. Spintronics: a spin-based electronics vision for the future. Science 294, 5546 (2001), 1488–1495.
- H-S Philip Wong, Heng-Yuan Lee, Shimeng Yu, Yu-Sheng Chen, Yi Wu, Pang-Shiu Chen, Byoungil Lee, Frederick T Chen, and Ming-Jinn Tsai. 2012. Metal–oxide RRAM. Proc. IEEE 100, 6 (2012), 1951–1970.
- H-S Philip Wong, Simone Raoux, SangBum Kim, Jiale Liang, John P Reifenberg, Bipin Rajendran, Mehdi Asheghi, and Kenneth E Goodson. 2010. Phase change memory. Proc. IEEE 98, 12 (2010), 2201–2227.
- Lixue Xia, Boxun Li, Tianqi Tang, Peng Gu, Xiling Yin, Wenqin Huangfu, Pai-Yu Chen, Shimeng Yu, Yu Cao, Yu Wang, Yuan Xie, and Huazhong Yang. 2016. MNSIM: Simulation Platform for Memristor-based Neuromorphic Computing System. (2016), 6.
- C. Yakopcic, M. Z. Alom, and T. M. Taha. 2016. Memristor crossbar deep network implementation based on a Convolutional neural network. In 2016 International Joint Conference on Neural Networks (IJCNN). 963–970. https://doi.org/10.1109/IJCNN.2016.7727302
- C. Yakopcic, R. Hasan, and T. M. Taha. 2015. Memristor based neuromorphic circuit for ex-situ training of multi-layer neural network algorithms. In 2015 International Joint Conference on Neural Networks (IJCNN). 1–7. https://doi.org/10.1109/IJCNN.2015.7280813
- Shimeng Yu. 2018. Neuro-Inspired Computing With Emerging Nonvolatile Memorys. Proc. IEEE 106, 2 (Feb. 2018), 260–285. https://doi.org/10.1109/JPROC.2018.2790840
- Fan Zhang and Miao Hu. 2018. Memristor-based Deep Convolution Neural Network: A Case Study. arXiv preprint arXiv:1810.02225 (2018).
- Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-x: An accelerator for sparse neural networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 20.