Noisy Machines: Understanding noisy neural networks and enhancing robustness to analog hardware errors using distillation

Noisy Machines: Understanding noisy neural networks and enhancing robustness to analog hardware errors using distillation

Abstract

The success of deep learning has brought forth a wave of interest in computer hardware design to better meet the high demands of neural network inference. In particular, analog computing hardware has been heavily motivated specifically for accelerating neural networks, based on either electronic, optical or photonic devices, which may well achieve lower power consumption than conventional digital electronics. However, these proposed analog accelerators suffer from the intrinsic noise generated by their physical components, which makes it challenging to achieve high accuracy on deep neural networks. Hence, for successful deployment on analog accelerators, it is essential to be able to train deep neural networks to be robust to random continuous noise in the network weights, which is a somewhat new challenge in machine learning. In this paper, we advance the understanding of noisy neural networks. We outline how a noisy neural network has reduced learning capacity as a result of loss of mutual information between its input and output. To combat this, we propose using knowledge distillation combined with noise injection during training to achieve more noise robust networks, which is demonstrated experimentally across different networks and datasets, including ImageNet. Our method achieves models with as much as greater noise tolerance compared with the previous best attempts, which is a significant step towards making analog hardware practical for deep learning.

\iclrfinalcopy

1 Introduction

Deep neural networks (DNNs) have achieved unprecedented performance over a wide variety of tasks such as computer vision, speech recognition, and natural language processing. However, DNN inference is typically very demanding in terms of compute and memory resources Li et al. (2019). Consequently, larger models are often not well suited for large-scale deployment on edge devices, which typically have meagre performance and power budgets, especially battery powered mobile and IoT devices. To address these issues, the design of specialized hardware for DNN inference has drawn great interest, and is an extremely active area of research (Whatmough et al., 2019). To date, a plethora of techniques have been proposed for designing efficient neural network hardware (Sze et al., 2017; Whatmough et al., 2019).

In contrast to the current status quo of predominantly digital hardware, there is significant research interest in analog hardware for DNN inference. In this approach, digital values are represented by analog quantities such as electrical voltages or light pulses, and the computation itself (e.g., multiplication and addition) proceeds in the analog domain, before eventually being converted back to digital. Analog accelerators take advantage of particular efficiencies of analog computation in exchange for losing the bit-exact precision of digital. In other words, analog compute is cheap but somewhat imprecise. Analog computation has been demonstrated in the context of DNN inference in both electronic (Binas et al., 2016), photonic (Shen et al., 2017) and optical (Lin et al., 2018) systems. Analog accelerators promise to deliver at least two orders of magnitude better performance over a conventional digital processor for deep learning workloads in both speed (Shen et al., 2017) and energy efficiency (Ni et al., 2017). Electronic analog DNN accelerators are arguably the most mature technology and hence will be our focus in this work.

The most common approach to electronic analog DNN accelerator is in-memory computing, which typically uses non-volatile memory (NVM) crossbar arrays to encode the network weights as analog values. The NVM itself can be implemented with memristive devices, such as metal-oxide resistive random-access memory (ReRAM) (Hu et al., 2018) or phase-change memory (PCM) (Le Gallo et al., 2018; Boybat et al., 2018; Ambrogio et al., 2018). The matrix-vector operations computed during inference are then performed in parallel inside the crossbar array, operating on analog quantities for weights and activations. For example, addition of two quantities encoded as electrical currents can be achieved by simply connecting the two wires together, whereby the currents will add linearly according to Kirchhoff’s current law. In this case, there is almost zero latency or energy dissipation for this operation.

Similarly, multiplication with a weight can be achieved by programming the NVM cell conductance to the weight value, which is then used to convert an input activation encoded as a voltage into a scaled current, following Ohm’s law. Therefore, the analog approach promises significantly improved throughput and energy efficiency. However, the analog nature of the weights makes the compute noisy, which can limit inference accuracy. For example, a simple two-layer fully-connected network with a baseline accuracy of on digital hardware, achieves only when implemented on an analog photonic array (Shen et al., 2017). This kind of accuracy degradation is not acceptable for most deep learning applications. Therefore, the challenge of imprecise analog hardware motivates us to study and understand noisy neural networks, in order to maintain inference accuracy under noisy analog computation.

The question of how to effectively learn and compute with a noisy machine is a long-standing problem of interest in machine learning and computer science (Stevenson et al., 1990; Von Neumann, 1956). In this paper, we study noisy neural networks to understand their inference performance. We also demonstrate how to train a neural network with distillation and noise injection to make it more resilient to computation noise, enabling higher inference accuracy for models deployed on analog hardware. We present empirical results that demonstrate state-of-the-art noise tolerance on multiple datasets, including ImageNet.

The remainder of the paper is organized as follows. Section 2 gives an overview of related work. Section 3 outlines the problem statement. Section 4 presents a more formal analysis of noisy neural networks. Section 5 gives a distillation methodology for training noisy neural networks, with experimental results. Finally, Section 6 provides a brief discussion and Section 7 closes with concluding remarks.

2 Related work

Previous work broadly falls under the following categories: studying the effect of analog computation noise, analysis of noise-injection for DNNs, and use of distillation in model training.

Analog Computation Noise Models

In Rekhi et al. (2019), the noise due to analog computation is modeled as additive parameter noise with zero-mean Gaussian distribution. The variance of this Gaussian is a function of the effective number of bits of the output of an analog computation. Similarly, the authors in Joshi et al. (2019) also model analog computation noise as additive Gaussian noise on the parameters, where the variance is proportional to the range of values that their PCM device can represent. Some noise models presented have included a more detailed account of device-level interactions, such as voltage drop across the analog array (Jain et al., 2018; Feinberg et al., 2018), but are beyond the scope of this paper. In this work, we consider an additive Gaussian noise model on the weights, similar to Rekhi et al. (2019); Joshi et al. (2019) and present a novel training method that outperforms the previous work in model noise resilience.

Noise Injection for Neural Networks

Several stochastic regularization techniques based on noise-injection and dropout (Srivastava et al., 2014; Noh et al., 2017; Li and Liu, 2016) have been demonstrated to be highly effective at reducing overfitting. For generalized linear models, dropout and additive noise have been shown to be equivalent to adaptive regularization to a first order (Wager et al., 2013). Training networks with Gaussian noise added to the weights or activations can also increase robustness to variety of adversarial attacks (Rakin et al., 2018). Bayesian neural networks replace deterministic weights with distributions in order to optimize over the posterior distribution of the weights (Kingma and Welling, 2013). Many of these methods use noise injection at inference time to approximate weight distribution; in Gal and Ghahramani (2016) a link between Gaussian processes and dropout is established in an effort to model the uncertainty of the output of a network. A theoretical analysis by Stevenson et al. (1990) has shown that for neural networks with adaptive linear neurons, the probability of error of a noisy neural network classifier with weight noise increases with the number of layers, but largely independent of the number of weights per neuron or neurons per layer.

Distillation in Training

Knowledge distillation (Hinton et al., 2015) is a well known technique in which the soft labels produced by a teacher model are used to train a student model which typically has reduced capacity. Distillation has shown merit for improving model performance across a range of scenarios, including student models lacking access to portions of training data (Micaelli and Storkey, 2019), quantized low-precision networks (Polino et al., 2018; Mishra and Marr, 2017), protection against adversarial attacks (Papernot et al., 2016; Goldblum et al., 2019), and in avoiding catastrophic forgetting for multi-task learning (Schwarz et al., 2018). To the best of our knowledge, our work is the first to combine distillation with noise injection in training to enhance model noise robustness.

3 Problem statement

Figure 1: Deploying a neural network layer, , on an analog in-memory crossbar involves first flattening the filters for a given layer into weight matrix , which is then programmed into an array of NVM devices which provide differential conductances for analog multiplication. A random Gaussian is used to model the inherent imprecision in analog computation.

Without loss of generality, we model a general noisy machine after a simple memristive crossbar array, similar to Shafiee et al. (2016). Figure 1 illustrates how an arbitrary neural network layer, , such as a typical convolution, can be mapped to this hardware substrate by first flattening the weights into a single large 2D matrix, , and then programming each element of this matrix into a memristive cell in the crossbar array, which provides the required conductances (the reciprocal of resistance) to perform analog multiplication following Ohm’s law, . Note that a pair of differential pair of NVM devices are typically used to represent a signed quantity in . Subsequently, input activations, converted into continuous voltages, , are streamed into the array rows from the left-hand side. The memristive devices connect row with columns, where the row voltages are converted into currents scaled by the programmed conductance, , to generate the currents , which are differential in order to represent both positive and negative quantites with unipolar signals. The currents from each memristive device essentially add up for free where they are connected in the columns, according to Kirchhoff’s current law. Finally, the differential currents are converted to bipolar voltages, , which are they digitized before adding bias, and performing batch normalization and ReLU operations, which are not shown in Figure 1.

However, the analog inference hardware of Figure 1 is subject to real-world non-idealities, typically attributed to variations in: 1) manufacturing process, 2) supply voltage and 3) temperature, PVT variation collectively, all of which result in noise in the system. Below we discuss the two key components in terms of analog noise modeling.

Data Converters.

Digital-to-analog converter (DAC) and analog-to-digital converter (ADC) circuits are designed to be robust to PVT variation, but in practice these effects do degrade the resolution (i.e. number of bits). Therefore, we consider effective number of bits (ENOB), which is a lower bound on resolution in the presence of non-idealities. Hence, we use activation and weight quantization with ENOB data converters and no additional converter noise modeling.

NVM cells.

Due to their analog nature, memristive NVM cells have limited precision, due to the read and write circuitry (Joshi et al., 2019). In between write and read operations, their stored value is prone to drift over time. Long-term drift can be corrected with periodic refresh operations. At shorter timescales, time-varying noise may be encountered. For most of the experiments in this paper, we model generic NVM cell noise as an additive zero-mean i.i.d. Gaussian error term on the weights of the model in each particular layer . This simple model, described more concretely in Section 5, is similar to that used by Joshi et al. (2019) which was verified on real hardware. In addition, we also investigate spatially-varying and time-varying noise models in Section 5.2 (Table 1).

4 Analysis of noisy neural networks

4.1 Bias variance decomposition for noisy weights

Naively deploying an off-the-shelf pretrained model on a noisy accelerator will yield poor accuracy for a fundamental reason. Consider a neural network with weights that maps an input to an output . In the framework of statistical learning, and are considered to be randomly distributed following a joint probability distribution . In a noisy neural network, the weights are also randomly distributed, with distribution . The expected Mean Squared Error (MSE) of this noisy neural network can be decomposed as

(1)

The first term on the right hand side of Equation 4.1 is a variance loss term due to randomness in the weights and is denoted as . The second term is a squared bias loss term which we call . However, typically a model is trained to minimize the empirical version of expected loss . We assume that the noise is centered such that pretrained weights are equal to . A pretrained model is therefore optimized for the wrong loss function when deployed on a noisy accelerator. To show this in a more concrete way, a baseline LeNet model (32 filters in the first convolutional layer, 64 filters in the second convolutional layer and 1024 neurons in the fully-connected layer) (LeCun et al., 1998) is trained on MNIST dataset to accuracy and then exposed to Gaussian noise in its weights, numerical values of these loss terms can be estimated. The expected value of the network output is estimated by averaging over outputs of different instances of the network for the same input . We perform inference on different instances of the network and estimate the loss terms as

(2)
(3)
(4)
(5)

The above formulas are for a network with a scalar output. They can be easily extended to the vector output case by averaging over all outputs. In the LeNet example, we take the output of softmax layer to calculate squared losses. The noise is assumed i.i.d. Gaussian centered around zero with a fixed SNR in each layer . The numerical values of the above losses are estimated using the entire test dataset for different noise levels. Results are shown in Figure 2(a). is initially equal to and when there is no noise. However, as noise level rises, they increase in magnitude and become much more important than . overtakes to become the predominant loss term in a noisy LeNet at . It is useful to note that increases with noise entirely due to nonlinearity in the network, which is ReLU in the case of LeNet. In a linear model, should be equal to as we would have . A model trained in a conventional manner is thus not optimized for the real loss it is going to encounter on a noisy accelerator. Special retraining is required to improve its noise tolerance. In Figure 2(a), we show how the model accuracy degrades with a rising noise level for the baseline LeNet and its deeper and wider variants. The deeper network is obtained by stacking two more convolutional layers of width 16 in front of the baseline network and the wider network is obtained by increasing the widths of each layer in the baseline to 128, 256, 2048 respectively. Performance degradation due to noise is worse for the deeper variant and less severe for the wider one. A more detailed discussion of the network architecture effect on its performance under noise is offered in Section 4.2

Figure 2: (a) Different loss terms on the test dataset and model test accuracy as a function of noise standard deviation, the losses are normalized to the pretrained model loss , calculated using clean weights. Accuracy is calculated by performing the inference times on the test set, error bars show the standard deviation.(b) Estimate of normalized mutual information between the input and output of the baseline LeNet and its variants as a function of noise standard deviation. A random subset of 200 training images are used for this estimate, with each inference repeated 100 times on a random realization of the network to estimate . Mutual information decays with rising noise, deeper and narrower networks are more susceptible to this decay.

4.2 Loss of information in a noisy neural network

Information theory offers useful tools to study noise in neural networks. Mutual information characterizes the amount of information obtained on random variable by observing another random variable . The mutual information between and can be related to Shannon entropy by

(6)

Mutual information has been used to understand DNNs (Tishby and Zaslavsky, 2015; Saxe et al., 2018). Treating a noisy neural network as a noisy information channel, we can show how information about the input to the neural network diminishes as it propagates through the noisy computation. In this subsection, is the input to the neural network and is the output. Mutual information is estimated for the baseline LeNet model and its variants using Equation 6. When there is no noise, the term is zero as is deterministic once the input to the network is known, therefore is just in this case. Shannon entropy can be estimated using a standard discrete binning approach (Saxe et al., 2018). In our experiment, is the output of the softmax layer which is a vector of length . Entropy is estimated using four bins per coordinate of by

(7)

where is the probability that an output falls in the bin . When noise is introduced to the weights, the conditional entropy is estimated by fixing the input and performing multiple noisy inferences to calculate with the above binning approach. is then averaged over different input to obtain . This estimate is performed for LeNet and its variants with different noise levels. Results are shown in Figure 2(b). The values are normalized to the estimate of at zero noise. Mutual information between the input and the output decays towards zero with increasing noise in network weights. Furthermore, mutual information in a deeper and narrower network decays faster than in a shallower and wider network. Intuitively, information from the input undergoes more noisy compute when more layers are added to the network, while a wider network has more redundant paths for the information to flow, thus better preserving it. An information theoretic bound of mutual information decay as a function of network depth and width in a noisy neural network will be treated in our follow-up work. Overall, noise is damaging the learning capacity of the network. When the output of the model contains no information from its input, the network loses all ability to learn. For a noise level that is not so extreme, a significant amount of mutual information remains, which indicates that useful learning is possible even with a noisy model.

5 Combining noise injection and knowledge distillation

5.1 Methodology

Noise injection during training is one way of exposing network training to a more realistic loss as randomly perturbing weights simulates what happens in a real noisy analog device, and forces the network to adapt to noise during training. Noise injection only happens in training during forward propagation, which can be considered as an approximation for calculating weight gradients with a straight-through-estimator (STE) (Bengio et al., 2013). At each forward pass, the weight of layer is drawn from an i.i.d. Gaussian distribution . The noise is referenced to the range of representable weights in that particular layer

(8)

where is a coefficient characterizing the noise level. During back propagation, gradients are calculated with clean weights , and only gets updated by applying the gradient. and are hyperparameters which can be chosen with information on the weight distributions.

Knowledge distillation was introduced by Hinton et al. (2015) as a way for training a smaller student model using a larger model as the teacher. For an input to the neural network , the teacher model generates logits , which are then turned into a probability vector by the softmax layer

(9)

The temperature, , controls the softness of the probabilities. The teacher network can generate softer labels for the student network by raising the temperature . We propose to use a noise free clean model as the teacher to train a noisy student network. The student network is trained with noise injection to match a mix of hard targets and soft targets generated by the teacher. Logits generated by the student network are denoted as . A loss function with distillation for the student model can be written as

(10)

Here is cross-entropy loss, is the one-hot encoding of the ground truth, and is the -regularization term. Parameter balances relative strength between hard and soft targets. We follow the original implementation in Hinton et al. (2015), which includes a factor in front of the soft target loss to balance gradients generated from different targets. The student model is then trained with Gaussian noise injection using this distillation loss function. The vanilla noise injection training corresponds to the case where . If the range of weights is not constrained and the noise reference is fixed, the network soon learns that the most effective way to decrease the loss is to increase the amplitude of the weights, which increases the effective SNR. There are two possible ways to deal with this problem. Firstly, the noise reference could be re-calculated after each weight update, thus updating the noise power. Secondly, we can constrain the range of weights by clipping them to the range , and use a fixed noise model during training. We found that in general the second method of fixing the range of weights and training for a specific noise yields more stable training and better results. Therefore, this is the training method that we adopt in this paper. A schematic of our proposed method is shown in Figure 5 of the Appendix.

During training, a clean model is first trained to its full accuracy and then weight clipping is applied to clip weights in the range . The specific range is chosen based on statistics of the weights. Fine-tuning is then applied to bring the weight-clipped clean model back to full accuracy. This model is then used as the teacher to generate soft targets. The noisy student network is initialized with the same weights as the teacher. This can be considered as a warm start to accelerate retraining. As we discussed earlier, the range of weights is fixed during training, and the noise injected into the student model is referenced to this range.

Our method also supports training for low precision noisy models. Quantization reflects finite precision conversion between analog and digital domains in an analog accelerator. Weights are uniformly quantized in the range before being exposed to noise. In a given layer, the input activations are quantized before being multiplied by noisy weights. The output results of the matrix multiplication are also quantized before adding biases and performing batch normalization, which are considered to happen in digital domain. When training with quantization, the straight-through-estimator is assumed when calculating gradients with back propagation.

5.2 Experimental results

In order to establish the effectiveness of our proposed method, experiments are performed for different networks and datasets. In this section we mainly focus on bigger datasets and models, while results on LeNet and its variants with some discussion of network architecture effect can be found in Figure 6 of the Appendix. ResNets are a family of convolutional neural networks proposed by He et al. (2016), which have gained great popularity in computer vision applications. In fact, many other deep neural networks also use ResNet-like cells as their building blocks. ResNets are often used as industry standard benchmark models to test hardware performance. The first set of experiments we present consist of a ResNet-32 model trained on the CIFAR10 dataset. In order to compare fairly with the previous work, we follow the implementation in Joshi et al. (2019), and consider a ResNet-32(v1) model on CIFAR10 with weight clipping in the range . The teacher model is trained to an accuracy of using stochastic gradient descent with cosine learning rate decay (Loshchilov and Hutter, 2016), and an initial learning rate of (batch size is ). The network is then retrained with noise injection to make it robust against noise. Retraining takes place for epochs, the initial learning rate is and decays with the same cosine profile. We performed two sets of retraining, one without distillation in the loss (), and another with distillation loss (). Everything else was kept equal in these retraining runs. Five different noise levels are tested with five different values of : .

Results are shown in Figure 3(a). Every retraining run was performed twice and inference was performed times on the test dataset for one model, to generate statistically significant results. Temperature was set to for the runs with distillation. We found that an intermediate temperature between and produces better results. The pretrained model without any retraining performs very poorly at inference time when noise is present. Retraining with Gaussian noise injection can effectively recover some accuracy, which we confirm as reported in Joshi et al. (2019). Our method of combining noise injection with knowledge distillation from the clean model further improves noise resilience by about in terms of , which is an improvement of almost in terms of noise power .

Figure 3: (a) Test accuracy as a function of noise level, here we have , error bars show the standard deviation of different training and inference runs. Our method with distillation achieves the best robustness. (b) Comparison of model performance at noise levels different from the training level.

The actual noise level in a given device can only be estimated, and will vary from one device to another and even fluctuate depending on the physical environment in which it operates (Section 3). Therefore, it is important that any method to enhance noise robustness can tolerate a range of noise levels. Our method offers improved noise robustness, even when the actual noise at inference time is different from that injected at training time. It is shown in Figure 3(b) that the model obtained from distillation is more accurate and less sensitive to noise level differences between training and inference time. This holds for a range of different inference noise levels around the training level. In the previous experiments, we assume a fixed noise level parameterized by . On real analog hardware, there could be additional non-idealities such as variation in noise level due to temperature fluctuation and nonuniform noise profile on different NVM cells due to statistical variation in the manufacturing process. We have conducted additional experiments to account for these effects.

Results from the experiments are shown in Table 1. Temporal fluctuation represents noise level variation over time. Noise is randomly sampled from for each inference batch. A noise temporal fluctuation level of means that . Spatial noise level fluctuation introduces nonuniform diagonal terms in the noise covariance matrix. More concretely, each weight noise in our previous model is multiplied by a scale factor with drawn from a Gaussian distribution . A noise spatial fluctuation level of means that . The scale factors are generated and then fixed when the network is instantiated, therefore the noise during network inference is non i.i.d. in this case. Results from our experiments show that there is no significant deviation when a combination of these non-ideal noise effects are taken into account.

Noise level
Non-ideal fluctuation type
Temporal 10%
Spatial 0%
Temporal 20%
Spatial 0%
Temporal 0%
Spatial 10%
Temporal 0%
Spatial 20%
Temporal 20%
Spatial 20%
No retraining
93%
+/- 0.14%
92.98%
+/- 0.18%
92.98%
+/- 0.15%
92.95%
+/- 0.15%
92.94%
+/- 0.15%
Noise injection
93.18%
+/- 0.13%
93.03%
+/- 0.15%
93.1%
+/- 0.14%
93.15%
+/- 0.15%
93.11%
+/- 0.13%
Distillation and noise injection
93.56%
+/- 0.12%
93.55%
+/- 0.11%
93.55%
+/- 0.13%
93.51%
+/- 0.12%
93.53%
+/- 0.12%
Noise level
No retraining
90.46%
+/- 0.19%
90.22%
+/- 0.27%
90.5%
+/- 0.2%
90.4%
+/- 0.23%
90.1%
+/- 0.3%
Noise injection
91.87%
+/- 0.17%
91.93%
+/- 0.2%
91.91%
+/- 0.2%
91.79%
+/- 0.18%
91.81%
+/- 0.17%
Distillation and noise injection
92.83%
+/- 0.18%
92.77%
+/- 0.14%
92.88%
+/- 0.14%
92.89%
+/- 0.14%
92.86%
+/- 0.15%
Table 1: ResNet-32 on CIFAR10 with analog non-idealities: our method of combining distillation and noise injection consistently achieves the best accuracy under different analog non-ideal effects.

The performance of our training method is also validated with quantization. A ResNet-18(v2) model is trained with quantization to 4-bit precision (ENOB) for both weights and activations. This corresponds to 4-bit precision conversions between digital and analog domains. A subset of training data is passed through the full precision model to calibrate the range for quantization – we choose the and percentiles as and for the quantizer. This range of quantization is fixed throughout training. The quantized model achieves an accuracy of on the test dataset when no noise is present. The model is then re-trained for noise robustness. The noise level is referenced to the range of quantization of weights in one particular layer, such that and . Results are shown for the same set of values in Figure 4(a). In the distillation retraining runs, the full-precision clean model with an accuracy of is used as the teacher and temperature is set to . Due to extra loss in precision imposed by aggressive quantization, accuracy of the pretrained quantized model drops sharply with noise. At , the model accuracy drops to without retraining and further down to at . Even retraining with noise injection struggles, and the model retrained with only noise injection achieves an accuracy of at . Our method of combining noise injection and distillation stands out by keeping the accuracy loss within from the baseline up to a noise level of .

Figure 4: (a) Test accuracy as a function of noise level for 4-bit ResNet-18, here we have , error bars show the standard deviation of different training and inference runs. Retraining with distillation and noise injection achieves the best results with quantization. (b) Test accuracy of different models during retraining with noise level .

One interesting aspect of using distillation loss during retraining with noise can be seen in Figure 4(b). The evolution of model accuracy on the test dataset is shown. When no distillation loss is used, the model suffers an accuracy drop (difference between blue and orange curves) around when tested with noise. The drop (difference between green and red curves) is significantly reduced to around when distillation loss is used. This observation indicates that training with distillation favors solutions that are less sensitive to noise. The final model obtained with distillation is actually slightly worse when there is no noise at inference time but becomes superior when noise is present.

Results on the ImageNet dataset for a ResNet-50(v1) network are shown in Table 2 to demonstrate that our proposed approach scales to a large-scale dataset and a deep model. A ResNet-50 model is first trained to an accuracy of with weight clipping in the range . This range is fixed as the reference for added noise. For ResNet-50 on ImageNet, only three different noise levels are explored, and the accuracy degrades very quickly beyond the noise level , as the model and the task are considerably more complex. Retraining runs for epochs with an initial learning rate of and cosine learning rate decay with a batch size of . For distillation, we used and as in previous experiments. Results are collected for two independent training runs in each setting and inference runs over the entire test dataset. The findings confirm that training with distillation and noise injection consistently delivers more noise robust models. The accuracy uplift benefit also markedly increases with noise.

\diagboxTraining methodNoise level
No retraining 74.942%
72.975%
+/- 0.095%
64.382%
+/- 0.121%
46.284%
+/- 0.179%
Gaussian noise injection 74.942%
73.513%
+/- 0.091%
70.142%
+/- 0.129%
65.285%
+/- 0.168%
     Distillation and noise injection 74.942%
74.005%
+/- 0.096%
71.442%
+/- 0.111%
67.525%
+/- 0.162%
Table 2: ResNet-50 on ImageNet at different noise levels, showing the Top-1 accuracy on the test dataset, with no quantization applied. Uncertainty is the standard deviation of different training and inference runs.

6 Discussion

Effects of distillation

Knowledge distillation is a proven technique to transfer knowledge from a larger teacher model to a smaller, lower capacity student model. This paper shows, for the first time, that distillation is also an effective way to transfer knowledge between a clean model and its noisy counterpart, with the novel approach of combining distillation with noise injection during training. We give some intuition for understanding this effect with the help of Section 4.2: a noisy neural network can be viewed as a model with reduced learning capacity by the loss of mutual information argument. Distillation is therefore acting to help reduce this capacity gap.

In our experiments, distillation shows great benefit in helping the network to converge to a good solution, even with a high level of noise injected in the forward propagation step. Here, we attempt to explain this effect by the reduced sensitivity of distillation loss. An influential work by Papernot et al. (2016) shows that distillation can be used to reduce the model sensitivity with respect to its input perturbations thus defending against some adversarial attacks. We argue that distillation can achieve a similar effect for the weights of the network. Taking the derivative of the -th output of the student network at temperature with respect to a weight yields

(11)

The scaling makes the output less sensitive to weight perturbation at higher temperature, thus potentially stabilizing the training when noise is injected into weights during forward propagation. We plan to work on a more formal analysis of this argument in our future work.

Hardware Performance Benefits

The improvements in noise tolerance of neural networks demonstrated in this work have a potential impact on the design of practical analog hardware accelerators for neural network inference. Increased robustness to noisy computation at the model training level potentially means that the specification of the analog hardware can be relaxed. In turn, this can make it easier to achieve the hardware specification, or even allow optimizations to further reduce the energy consumption. An in-depth discussion of the trade-off between compute noise performance and hardware energy dissipation is beyond the scope of this paper, but we refer the interested reader to Rekhi et al. (2019) for more details. In summary, we believe that machine learning research will be a key enabler for practical analog hardware accelerators.

7 Conclusion

Analog hardware holds the potential to significantly reduce the latency and energy consumption of neural network inference. However, analog hardware is imprecise and introduces noise during computation that limits accuracy in practice. This paper explored the training of noisy neural networks, which suffer from reduced capacity leading to accuracy loss. We propose a training methodology that trains neural networks via distillation and noise injection to increase the accuracy of models under noisy computation. Experimental results across a range of models and datasets, including ImageNet, demonstrate that this approach can almost double the network noise tolerance compared with the previous best reported values, without any changes to the model itself beyond the training method. With these improvements in the accuracy of noisy neural networks, we hope to enable the implementation of analog inference hardware in the near future.

Appendix A Appendix

Figure 5: Schematic of our retraining method combining distillation and noise injection.
Figure 6: Results on LeNet and its variants show that our method of combining distillation and noise injection improves noise robustness for different model architectures on MNIST. The benefit of our method is the most significant when the network struggles to learn with vanilla noise injection retraining method. This threshold noise level depends on the network architecture, as we have remarked for mutual information decay.

References

  1. Equivalent-accuracy accelerated neural-network training using analogue memory. Nature 558 (7708), pp. 60. Cited by: §1.
  2. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §5.1.
  3. Precise deep neural network computation on imprecise low-power analog hardware. arXiv preprint arXiv:1606.07786. Cited by: §1.
  4. Neuromorphic computing with multi-memristive synapses. Nature communications 9 (1), pp. 2514. Cited by: §1.
  5. Making memristive neural network accelerators reliable. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 52–65. Cited by: §2.
  6. Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §2.
  7. Adversarially robust distillation. CoRR abs/1905.09747. External Links: Link, 1905.09747 Cited by: §2.
  8. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.2.
  9. Distilling the knowledge in a neural network. Neural Information Processing Systems. Cited by: §2, §5.1.
  10. Memristor-based analog computation and neural network classification with a dot product engine.. Advanced materials (Deerfield Beach, Fla.) 30 (9). Cited by: §1.
  11. Rx-caffe: framework for evaluating and training deep neural networks on resistive crossbars. arXiv preprint arXiv:1809.00072. Cited by: §2.
  12. Accurate deep neural network inference using computational phase-change memory. arXiv preprint arXiv:1906.03138. Cited by: §2, §3, §5.2, §5.2.
  13. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
  14. Mixed-precision in-memory computing. Nature Electronics 1 (4), pp. 246. Cited by: §1.
  15. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.
  16. On-chip memory technology design space explorations for mobile deep neural network accelerators. In 2019 56th ACM/IEEE Design Automation Conference (DAC), Vol. , pp. 1–6. External Links: Document, ISSN 0738-100X Cited by: §1.
  17. Whiteout: gaussian adaptive noise regularization in deep neural networks. arXiv preprint arXiv:1612.01490. Cited by: §2.
  18. All-optical machine learning using diffractive deep neural networks. Science 361 (6406), pp. 1004–1008. Cited by: §1.
  19. Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §5.2.
  20. Zero-shot knowledge transfer via adversarial belief matching. Proceedings of Machine Learning Research. Cited by: §2.
  21. Apprentice: using knowledge distillation techniques to improve low-precision network accuracy. arXiv preprint arXiv:1711.05852. Cited by: §2.
  22. An energy-efficient digital reram-crossbar-based cnn with bitwise parallelism. IEEE Journal on Exploratory Solid-State Computational Devices and Circuits 3, pp. 37–46. Cited by: §1.
  23. Regularizing deep neural networks by noise: its interpretation and optimization. In Advances in Neural Information Processing Systems, pp. 5109–5118. Cited by: §2.
  24. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), Vol. , pp. 582–597. External Links: Document, ISSN Cited by: §2, §6.
  25. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668. Cited by: §2.
  26. Parametric noise injection: trainable randomness to improve deep neural network robustness against adversarial attack. arXiv preprint arXiv:1811.09310. Cited by: §2.
  27. Analog/mixed-signal hardware error modeling for deep learning inference. In Proceedings of the 56th Annual Design Automation Conference 2019, pp. 81. Cited by: §2, §6.
  28. On the information bottleneck theory of deep learning. Cited by: §4.2.
  29. Progress & compress: a scalable framework for continual learning. arXiv preprint arXiv:1805.06370. Cited by: §2.
  30. ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News 44 (3), pp. 14–26. Cited by: §3.
  31. Deep learning with coherent nanophotonic circuits. Nature Photonics 11 (7), pp. 441. Cited by: §1, §1.
  32. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §2.
  33. Sensitivity of feedforward neural networks to weight errors. IEEE Transactions on Neural Networks 1 (1), pp. 71–80. Cited by: §1, §2.
  34. Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE 105 (12), pp. 2295–2329. Cited by: §1.
  35. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pp. 1–5. Cited by: §4.2.
  36. Probabilistic logics and the synthesis of reliable organisms from unreliable components. Automata studies 34, pp. 43–98. Cited by: §1.
  37. Dropout training as adaptive regularization. In Advances in neural information processing systems, pp. 351–359. Cited by: §2.
  38. A 16nm 25mm2 soc with a 54.5x flexibility-efficiency range from dual-core arm cortex-a53 to efpga and cache-coherent accelerators. In 2019 Symposium on VLSI Circuits, Vol. , pp. C34–C35. External Links: Document, ISSN 2158-5601 Cited by: §1.
  39. FixyNN: efficient hardware for mobile computer vision via transfer learning. In Proceedings of the 2nd SysML Conference 2019, 31 March-2 April, Stanford, California, USA, External Links: Link Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
404687
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description