Dithered backprop: A sparse and quantized backpropagation algorithm for more efficient deep neural network training

Dithered backprop: A sparse and quantized backpropagation algorithm for more efficient deep neural network training

Abstract

Deep Neural Networks are successful but highly computationally expensive learning systems. One of the main sources of time and energy drains is the well known backpropagation (backprop) algorithm, which roughly accounts for 2/3 of the computational complexity of training. In this work we propose a method for reducing the computational cost of backprop, which we named dithered backprop. It consists in applying a stochastic quantization scheme to intermediate results of the method. The particular quantisation scheme, called non-subtractive dither (NSD), induces sparsity which can be exploited by computing efficient sparse matrix multiplications. Experiments on popular image classification tasks show that it induces 92% sparsity on average across a wide set of models at no or negligible accuracy drop in comparison to state-of-the-art approaches, thus significantly reducing the computational complexity of the backward pass. Moreover, we show that our method is fully compatible to state-of-the-art training methods that reduce the bit-precision of training down to 8-bits, as such being able to further reduce the computational requirements. Finally we discuss and show potential benefits of applying dithered backprop in a distributed training setting, where both communication as well as compute efficiency may increase simultaneously with the number of participant nodes.

\keywords

Efficient Deep Learning Quantisation Dither signals Distributed Learning

1 Introduction

Deep neural networks (DNNs) are powerful machine learning systems for recognizing patterns in large amounts of data. They became very popular through recent successes in computer vision, language understanding and other areas of computer science [11]. However, DNNs need to undergo a highly computationally expensive training procedure in order to extract meaningful representations from large amounts of data. For instance, [23] showed that the training process of state-of-the-art neural network architectures can produce 284 tons of carbon dioxide, nearly five times the lifetime emissions of an average car. Therefore, in order to mitigate the impact of training and/or allow for models to be trained on resource-constrained devices, more efficient algorithms have to be designed.

The backpropagation (backprop) algorithm [18] is most often applied when gradient-based optimization techniques are selected for training DNNs. However, it involves the computation of many dot products between large tensors, therefore playing a major role in the computational cost of the training procedure. Techniques such quantization and/or sparsity can be employed in order to reduce the complexity of the dot products, however, when applied in a naïve manner they may induce biased, non-linear errors which can have catastrophic effects for the convergence of the overall training algorithm.

In this work we aim to minimize the computational complexity of the backprop algorithm by carefully studying the error induced by quantization. Concretely, we propose to apply a particular type of stochastic quantization technique to the gradients of the preactivation values, known as non-subtractive dithering (NSD) [22]. NSD does not only reduce the precision of the preactivation values, but it also induces sparsity. As such, we attain sparse tensors with low precision non-zero values, properties that can be leveraged on in order to lower the computational cost of the dot products they are involved in. Our contributions can be summarized as follows:

  • We reduce the computational complexity of the most expensive components of the backprop algorithm by applying stochastic quantization techniques to the gradients of the preactivation values, inducing sparsity + low-precision non-zero values.

  • We show on extensive experiments that we can reach a significant amount of sparsity across a wide set of neural network models while maintaining the non-zero values below/equal to 8-bit precision, without affecting their final accuracy of their model neither the convergence speed.

  • Finally, we discuss the positive properties that emerge when applying dithered backprop in a distributed setting. Concretely, we show that we can reduce the computational cost for training at each node by increasing the number of participant nodes.

2 Related Work

A lot of research is dedicated to improve the performance at inference time [6, 25, 29]. However, less research has focused on designing more efficient training algorithms, in particular a more efficient backward pass. In the following we discuss some of the proposed approaches.

Precision Quantization. Most of preceding work on efficient neural network training uses Precision Quantization. In the context of deep learning that means to transform activation, weight and gradient values to representations of lower precision than the regular single-point floating point standard. It has been shown that this can significantly reduce the time and space complexity of deep learning models [7, 9, 8, 15, 30, 14, 3].
[7] were among the first to show that it is feasible to quantize parts of state-of-the-art models without or just with negligible loss of accuracy using 10-bit multiplications. Subsequently, more people followed the example and quantized successfully whole models to 16-bit representations [15, 12]. Later, even ternary and binary weight quantizations were applied, while keeping the gradients and errors in the backward pass in full precision [8, 26]. However, these approaches sacrifice accuracy over the baseline networks. [3] accomplished to quantize weights, activations and all gradient calculations, except for the weight updates, to 8-bit. A 16-bit copy of the backpropagated gradient is saved to compute a full-precision weight update. They argue that the extra time required for this matrix multiplication is comparably small to the time required to backpropagate the error gradient and that in most layers these calculations can be made in parallel.

Efficient Approximations. Other work investigated the possible speed up gaining from efficient approximations of matrix multiplications in the backward pass. [1] reduces the complexity of the matrix multiplication by approximations through a form of column-row sampling. Using an efficient sampling heuristic, this approach achieves up to 80% reduced computation but the authors provide no analysis of the induced noise variance contained the weight gradients and its impact on the generalization performance. The meProp algorithm [24] sparsifies the pre-activation gradients by selecting the elements with the largest magnitude. They leverage sparse matrix multiplications for a more efficient backward pass. However, since this quantization function is deterministic and operates on vectors, it results in biased estimates of the weight updates which can harm the convergence speed as well as generalization performance of the trained model.

In contrast, we show how dither functions can be used to calculate unbiased weight updates efficiently, due to their sparsity-inducing property when applied to gradient values. Furthermore, we show how the approach can be combined with state-of-the-art precision quantization methods in order to boost the computational efficiency of the algorithm.

3 Dithered backpropagation

For fully-connected layers the operations that need to be performed per layer during one training iteration are the following (note that these equations are analogous for convolutional layers):

Forward pass
(2)
Backward pass
(3)
(4)

with , , and being the weight tensor, bias, preactivation and activation values respectively. Naturally, , , and denote the error or gradients of the respective quantities. With we denote the non-linear function whereas with its derivative. is an index referring to a particular layer and denotes the transpose operation. Finally, the symbols and denote the dot and Hadamard product respectively.

As one can see, there are three major matrix multiplications involved at each layer during one training iteration, namely, one in the forward pass (equation 2) and two in the backward pass (equation 3 and equation 4). Since up to 90% of the computing time is spent on performing these dot product operations [24], in this work we focus on reducing their computational cost. In particular, notice how the preactivation gradients are present in both matrix multiplications in the backward pass. Hence, in order to save operations, we apply quantization functions that compresses these gradients.

3.1 Non-subtractive dithered quantization (NSD)

For reasons that will become more apparent in the next section, in this work we propose to apply the following quantization function:

(5)

with being the quantization step size and some input value. is a random number sampled from the uniform distribution between the open interval . The quantization function in equation 5 is sometimes referred as non-subtractive dither (NSD) [22] in the source coding literature, with being a stochastic dither signal that is added to the input before quantization. The main motivation for adding a dither signal before quantization is to decouple the properties of the quantization error from the input signal . For instance, it is known that the quantization error of NSD is unbiased and has bounded variance

(6)
(7)

3.2 Effects of applying NSD to the gradients

Figure 1: Distribution of preactivation gradient values before (left) and after (right) NSD quantization. The gradients have become more sparse (higher count of 0 values) and the non-zero values can be represented with low bitwidths (low number of non-zero “buckets”). For instance, this example only 1 bit is required to represented all non-zero values.

Hence, at each layer , we now apply NSD to the gradients of the preacitvation values before computing the respective dot products. For large enough stepsizes , NSD will induce sparsity (many zero values) as well as non-zero values with low bitwidth representation (see figure 1).

To make this effect more clear, consider the convolution between a gaussian distribution with mean 0 and standard deviation and a uniform distribution, sampling values in the range . The induced average sparsity is given by the probability of sampling a value in the same interval, thus

As figure 2 shows, the probability of 0 increases with the stepsize value. Naturally, the same applies for the maximal bit-width of the non-zero values since the probability of a high number appearing after quantization decreases as the stepsize increases.

Figure 2: (left) Shape of the probability distribution resulting from the convolution of a gaussian with a uniform distribution, where the uniform distribution samples values between a range . The shape depends on the stepsize of the uniform distribution, which is chosen to be with being the standard deviation of the gaussian distribution and a scaling factor. The dashed lines indicate the region of values between . (right) the probability of a 0 value appearing after quantization at different scale factors. It is calculated by integrating the area between the dashed lines on the left plot. From both plots one can see that sparsity increases with the scaling factor .

We can then exploit this sparsity to omit operations when computing the dot product between tensors. The altered equations for the backward pass at each layer are then given by:

(8)
(9)
(10)

with and being the matrix of quantized pre-activation gradients.

Given the above analysis we propose to choose the stepsize at each layer as to be a multiple of the standard deviation, that is, , with being the standard deviation of the preactivation gradients and . is thus a global scaling factor that controls the trade-off between compute complexity and learning performance. We named our proposed modification of the backprop method dithered backprop. Algorithm 1 shows a pseudocode of the quantization procedure of the preactivation gradients. After quantization, the backward pass as well as the weight update steps remain identical as in the usual algorithm.

1:procedure NSD(, ) Quantizes preactivation gradients of layer
2:      Computes standard deviation
3:     
4:      As in equation 5
5:     return
6:end procedure
Algorithm 1 Dithered backprop quantization

3.3 Error statistics and convergence of the method

Due to applying NSD to all , dithered backprop attains perturbed estimates of the weight updates

with being the perturbation error. Hence, this begs the question: how does this error influence the convergence of the training method?

From [4] we know that under mild assumptions regarding the loss function, if a stochastic operator is added to a training algorithm that already converges and generates unbiased estimates of the weight updates with bounded variance, then the respective training algorithm converges as well. Thus, we only need to show that the error of the weight updates is unbiased and has bounded variance, that is

(11)
(12)

Although in this work we do not provide a rigorous proof (mainly due to space constraints), it is relatively easy to see that equation 11 and equation 12 are satisfied by modelling the quantization error of the preactivation gradients also as additive noise , and taking into consideration that the error satisfies equation 6 and equation 7.

3.4 Computational complexity

Theoretical analysis
When dithered backprop is used for training, some additional computational overhead comes form applying NSD to the gradients of the preactivation values. However, we argue that this cost is asymptotically negligible compared to the cost of performing the subsequent dot products. In the following we will highlight the rationale for the case of fully-connected layers, however, we stress that it also applies analogously to convolutional layers.

Let be a -dimensional matrix whose elements are the gradient of the preactivation values of a particular layer. As can be seen from equation 5 and algorithm 1, applying NSD to requires: for each element,

  1. calculate the standard deviation of the preactivation gradients. This requires 1 multiplication + 1 addition per element.

  2. sampling from the uniform distribution between the interval . This requires about 2 multiplications + 2 additions + 1 modulo operation.

  3. Quantizing the value, which requires 1 addition + 1 multiplication + truncation of decimal bits

Overall, the cost can be approximately reduced to about 9 arithmetic operations per element. Thus, the computational complexity of applying NSD is of order . If we now include the cost of performing the subsequent sparse matrix-matrix dot product, then the complexity becomes of order , with being the empirical probability of non-zero values in .

In contrast, the computational complexity of a matrix multiplication of the form with being, for instance, an arbitrary -dimensional weight matrix, is of order . If we now measure the effective asymptotic savings between the dithered dot product vs the dense dot product algorithm by taking the ratio of both quantities we get

(13)

The above equation 13 states that the asymptotic computational savings depend inversely on the amount of rows of the output matrix, as well as on the sparsity attained after applying NSD. Since the number of output rows are most often much bigger than one, the computational savings are dominated solely by the sparsities achieved. Later in the experimental section we show that NSD is able to induce high sparsity ratios (between 75% - 99%) during the entire training procedure, thus in principle being able to achieve significant savings.

Practical savings
Unfortunately, the above analysis does not translate directly to real-world speedups/energy savings mainly due to the challenges that unstructured sparsity imposes on the hardware level. Nevertheless, it is worth to mention that in recent years there has been significant progress in this field, showing promising results in narrowing the gap between the theory and practice. On a software level, [10] have shown that they can already attain up to x2.4 speedups for DNNs with 80%-90% sparsity, by optimizing the sparse dot products so that it becomes more amenable to the underlying hardware. On the other hand, many hardware accelerators have been proposed [13, 16, 5, 17] that are able to successfully exploit structured and unstructured sparsity, sometimes achieving orders of magnitude more compute efficiency. In particular, [17] attained about x1.5-x8 speedups and x1.5-x6 energy gains at sparsity ratios between 75%-95%, ratios that are typically induced by dithered backprop (see experiments section). Finally, [2] proposed an accelerator that includes an efficient implementation of dithered quantization in order to execute DNNs with lower bit-precision. Hence, this progress motivates the study of methods akin to dithered backprop, since it seems likely to expect similar gains when such algorithms are implemented in an efficient manner on a software level and run on similarly optimized hardware architectures.

3.5 Quantizing forward pass

So far we have only discussed the reduction of the computational cost of the backprop method. Although the backward pass accounts for roughly 2/3 of the computational complexity of the training iteration (see equation 2, equation 3, equation 4), we are also interested in applying methods that also reduce the computation of the forward pass. Fortunately, some research has already been done in this area.
[3], e. g., quantizes activation, weight and some gradient values in the backward pass to 8-bits and show that using their method state-of-the-art results can still be achieved. In addition, they introduced Range Batch-Normalization (BN), an approximated batch norm that scales a batch by dividing it by its range. It has significantly higher tolerance to quantization noise and improved computational complexity.
Armed with this knowledge, we similarly quantize activation and weight values in the forward pass and apply dithered backprop in the backward pass, leaving also only the weight update in full precision. Therefore, all compuations, except for the weight update, can be calculated with 8-bit computations.

3.6 Usage in distributed training setting

A further interesting area of application of the dithered backprop method is distributed training. In distributed training, an algorithm called synchronous stochastic gradient descent (SSGD) is widely used [21]. It differs from single-threaded mini-batch SGD in that the mini-batch of size is distributed to total workers that locally compute sub-mini-batch gradients. These gradients are then communicated to a centralized server called parameter server that updates the parameter vector and then eventually sends it back. By increasing the number of training nodes and taking advantage of data parallelism, the total computation time of the forward-backward passes on the same size training data can as such be dramatically reduced.
As mentioned in the above section, dithered backprop induces unbiased noise with bounded variance to the weight updates. Therefore, if dithered backprop is applied to nodes, then most of the induced noise cancels out on the server side due to the averaging effect. Moreover, the variance of the noise goes down with . Thus, dithered backprop promises to reduce the computational cost per node as the number of nodes grows, since stronger quantization can be applied without affecting the final performance of the model after training. This may be beneficial for scenarios where a large number of nodes with limited computational resources may participate in the training procedure, e.g., a large number of mobile devices connected through a communication channel with high bandwidth such as 5G.

4 Experiments

Model Dataset Baseline Dithered Backprop 8-bit Training [3] 8bit + dith. backprop
acc% sparsity% acc% sparsity% acc% sparsity% acc% sparsity%
LeNet5 MNIST 99.31 2.05 99.35 97.52 99.34 2.09 99.35 97.18
LeNet300100 MNIST 98.45 47.48 98.40 94.92 98.43 48.61 98.52 94.85

AlexNet
CIFAR10 91.23 91.35 91.26 98.95 91.03 64.62 90.81 97.05
ResNet18 CIFAR10 92.67 24.36 92.35 91.86 92.22 34.88 92.10 92.10
VGG11 CIFAR10 92.35 8.47 92.17 94.10 92.44 4.82 92.29 94.24


AlexNet
CIFAR100 67.98 92.23 67.78 97.35 68.37 64.39 67.63 89.51
Resnet18 CIFAR100 69.54 18.23 69.97 87.66 70.73 13.39 69.69 87.74
VGG11 CIFAR100 70.58 6.70 70.09 91.79 71.29 83.40 70.07 91.77

Resnet18
ImageNet 71.40 6.44 71.10 75.80 71.25 3.27 71.23 75.48
Average - 83.72 33.03 83.61 92.22 83.90 35.50 83.52 91.10
Average diff. - 0 0 0.23 59.12 0 0 0.40 55.61

Table 1: Results of experiments, where acc% means accuracy in % on test set and sparsity% the average sparsity of the gradients of the preactivation values in % over all layers and training iterations. The largest values for are marked in bold.

Datasets. We conducted our experiments on four different benchmark datasets for image classification, namely MNIST, CIFAR10, CIFAR100 and ImageNet.

Training Setting. For the MNIST Dataset Lenet300100 and Lenet5 were evaluated, while for CIFAR10 and CIFAR100 it was VGG11, AlexNet and ResNet18 and for ImageNet only ResNet18. For the CIFAR datasets, we reduced the capacity of the models to account for the dataset. That is, for AlexNet we reduced the dimensionality of the last two hidden layers to 2048, and for VGG11 to 512. The last layers are adapted to account for the classes, respectively.

All Models are trained via stochastical gradient descent with a momentum of 0.9, a weight decay of , and a batch-size of 256 for ImageNet and 128 for the others. We used a learning rate lr of 0.05 for AlexNet and 0.1 for the rest of the models. For the CIFAR datasets a lr-decay setting of 0.1/100 and 0.1/45 is applied.

4.1 Accuracy & Induced Sparsity

For the listed data sets we conducted experiments for four different methods, according to the training setting described above. Besides the baseline method, which describes training without quantization, we applied dithered backprop as described in the above section, the precision quantization of [3] (8-bit training) which applies quantization in the forward and backward pass in order to perform training in 8-bit precision, and the combination of the latter two. Table 1 summaries our findings.

Firstly, notice how the baseline training method exhibits vastly different sparsities across different models, ranging from 2% to 92%. Models trained without batchnorm such as AlexNet exhibit already high sparsity ratios due to the derivative of the ReLU activation function, which is often 0. However, batchnorm layers cancel out this effect and therefore models such as LeNet5 or VGG11 exhibit high density (low sparsity). We see a similar effect on models trained with 8-bit precision. On average, the baseline backprop method was able to induce only 33% sparsity across the different models, and similarly the 8-bit backprop method only 36%.

In contrast, after applying dithered backprop, sparsity becomes very high across all networks, ranging between 76%-99%. In particular, notice how dithered backprop is able to significantly increase the sparsity of networks trained with batchnorm layers. For instance, LeNet5 goes from 2.05% to 97.52%, a substantial increase of 95.47%. On average, dithered backprop was able to induce 92% sparsity across the models, increasing the sparsity ratio by 59% from the baseline. We get similar results when applied in combination with the 8-bit training method. Here, dithered backprop increased the sparsity by 56%, inducing an average sparsity of 91% across the networks. If we consider the speedups and energy gains reported in [17], these results may potentially translate to x5 speedups and x4.5 energy gains on average if dithered backprop is run on specialized hardware.

(a)
(b)
Figure 3: (a) Test error of VGG11 trained on CIFAR10 over the training epochs. (b) Average density (# non-zero values or 1-sparsity) of the preactivation gradients during training.

We stress that the accuracies were approximately maintained across the experiments, changing on average only by 0.3% between the dithered and non-dithered methods. Moreover, the number of training epochs did also not change, showing that dithered backprop did not harm the convergence speed. Figure 3a shows an example where the test error of AlexNet is plotted over the training epochs. As can be seen, there is no recognizable difference in convergence speed between the baseline model and the dithered model. More examples can be found in the appendix.

Additionally, we also want to mention that the maximum bitwidth of the non-zero values was consistently below/equal to 8-bits (see figure 6b) across all experiments. Thus, dithered backprop is fully compatible to training methods that limit the bit-precision training to 8-bits, such as [3].

In Figure 3b we show the course of the density (# non-zero values or 1-sparsity) of the preactivation gradient over the entire training period of the VGG11 model. We can see how the density of the gradients is much lower when dithered backprop is applied across the entire training procedure. Interestingly, we also see that the density decreases at the beginning of training and stays approximately constant afterwards. Coincidentally, this trend correlates weakly with the speed of the learning progress, which can be interpreted as gradients carrying more information. However, it seems that dithered backprop is successful at eliminating redundant, non-useful information for learning since its density is much lower.

4.2 Comparison to meProp

Figure 4: Learning performance at different levels of average sparsity the preactivation gradients of a multilayer perceptron with two hidden layers (500, 500) trained on MNIST, using either regular back propagation (Baseline), dithered backprop (D. backprop) or meProp [24]. Multiple runs with different random seeds were executed for each configuration. Points show mean performance with standard deviation indicated as span.

We now benchmark dithered backprop against meProp [24], arguable the closest related work. To recall, in one of its modes meProp sparsifies the pre-activation gradients by selecting the elements with the largest magnitude. This induces biased estimates of the weight updates, which we argue affects negatively the learning quality of the network.

Since meProp was only benchmarked on multilayer perceptrons, we chose a model with two fully-connected layers with hidden dimensions of (500, 500) and trained it on MNIST and CIFAR10 for the experiments. Figure 4 shows the final test accuracy of the model trained on MNIST at different levels of average sparsity of the preacitvation gradients. On the appendix we show the results for CIFAR10. As one can see, dithered backprop clearly outperforms meProp at all levels of sparsity. Concretely, overall dithered backprop achieved an average test accuracy of at a sparsity of , whereas meProp only achieved average test accuracy at a sparsity of .

4.3 Distributed training

Figure 5: Accuracy of the final model of AlexNet trained on CIFAR10 with dithered backprop in a distributed training setting, at different number of participating nodes configuration. The accuracy stays approximately constant as the number of nodes increases.
(a)
(b)
Figure 6: (a) Average sparsity of the preactivation gradients of the fully-connected layers of AlexNet trained on CIFAR10 with dithered backprop in a distributed training setting, at different number of participating nodes configuration. As the number of nodes increases, so does the sparsity at each node and therefore its computational savings for training. (b) Maximal, worst-case bit-precision of the fully-connected layers of AlexNet trained on CIFAR10 with dithered backprop in a distributed training setting, at different number of participating nodes configuration. As the number of nodes increases, the number of bits necessary to represent the non-zero values decreases, and with it the computational cost for training at each node.

In the above section we argued that applying dithered backprop in a distributed training scenario may be beneficial. The rationale was that, since the noise induced by dithered backprop on the weight updates is unbiased with bounded variance, then this should cancel out as the number of nodes grows due to the averaging of the parameters on the server. In this section we try to show this effect experimentally.
To investigate this, we ran several experiments of the same model with different amount of nodes . While increasing , we also increase the scaling factor of the dither method in order to increase the quantization strength. At each training iteration, each node runs one forward and dithered backward pass of batchsize 1, then sends its parameter gradients to the server where it is subsequently averaged with the gradients of all other nodes. Finally the averaged parameter gradient are broadcasted back to each node, and a new training iteration can subsequently start again. We then measure the final accuracy of the model, average sparsity and worst-case bit-precision at all configurations.

Figures 5, 6 show the respective trends for the fully-connected layers of AlexNet trained on CIFAR10. On the appendix we show the same plots for the convolutional layers as well. Each plot shows the average trend and the standard deviation over 3 different runs of the same experiments. As one can see, we can increase the sparsity and lower the bit-precision as the number of participating nodes increases, while negligibly affecting the final accuracy of the model. In other words, dithered backprop allows to to reduce the computational cost of performing a training iteration at each node as the number of participant nodes increases.

As a side note we want to remark that in this context, high sparsities on the preactivation gradients do necessarily translate to communication savings. For batchsizes bigger one the weight updates are with high probability densely populated, so that the full model would have to be communicated at each iteration. Only when the batchsize per node is equal to one (as was in the case of our experimental setup), sparsities on the preactivation gradients directly translate to sparsity on the weight updates and consequently to savings in communication cost.

5 Conclusion

In this work we proposed a method for reducing the computational complexity of the backpropagation (backprop) algorithm. Our method, called dithered backprop, is based on applying dithered quantization on the tensor of the preactivation gradients in order to induce sparsity and non-zero values with low bit-precision. It is also simple in that it has only one global hyperparameter which controls the trade-off between computational complexity and learning performance of the model.

Extensive experimental results show that dithered backprop is able to attain high sparsity ratios, between 75%-99% across a wide set of neural network models, boosting the sparsity by 59% on average from the original backprop method. In addition, we showed that dithered backprop maintains the bit-precision of the non-zero values to less/equal 8-bits during the entire training process, thus being fully compatible with methods that limit the training to 8-bit precision only. However, in its current form, dithered backprop induces unstructured sparsity which is not amenable to conventional hardware such as CPUs or GPUs. In future work we will consider modifications that translate directly to speedups and energy gains without having to rely on specialized hardware. Moreover, we will also consider applying efficient compression algorithms to the gradients in order to reduce memory complexity of training as well [28, 27].

We also showed that beneficial properties emerge when dithered backprop is applied in the context of distributed training. For instance, we showed experimentally that as the number of participating nodes increases, so does the computational savings per node as well. This effect can be advantageous when a large number of nodes with resource-constrained computational engines participate in the training procedure, such as mobile phones. A further interesting future work direction is to apply dithered backprop jointly with methods that drastically reduce the communication cost [19, 20], with the goal of minimizing both the communication as well as computation cost of the distributed training system.

\counterwithin

figuresection \counterwithintablesection

Appendix

More experimental results

In this section of the appendix we show further experimental results.

Convergence of dithered backprop

Figures 7 and 8 show the training curves of AlexNet and Resnet18 trained on CIFAR10 with the baseline method, dithered backprop, the reduced precision training method [3] and the combination of the latter two. As one can see, the training convergence is not affected by dithered backprop in any of the cases.

Figure 7: Test error of AlexNet and Resnet18 trained on CIFAR10 over the training epochs for baseline and dithered backpropagation.
Figure 8: Test error of AlexNet and Resnet18 trained on CIFAR10 over the training epochs for 8-bit quantization and dithered 8-bit quantization.

Comparison to meProp

Figure 9: Learning performance at different levels of average sparsity of the preactivation gradients of a multilayer perceptron with two hidden layers (500, 500) trained on CIFAR10, using either regular back propagation (Baseline), dithered backprop (D. backprop) or meProp [24]. Multiple runs with different random seeds were executed for each configuration. Points show mean performance with standard deviation indicated as span.

In figure 9 we show the learning performance of the multilayer perceptron when trained on CIFAR10. As one can see, meProp does not reach as high accuracies as dithered backprop. We attribute this to the biased nature of their gradients estimates, which affects negatively the learning performance of the model.

Distributed training

Here we show the trend of the computational complexity of the convolutional layers as the number of participating nodes increases. As can be seen in figures 10 and 11, the computational decreases as the number of nodes increases.

Figure 10: Average sparsity of the preactivation gradients of the convolutional layers of AlexNet trained on CIFAR10 with dithered backprop in a distributed training setting, at different number of participating nodes configuration. As the number of nodes increases, so does the sparsity at each node and therefore its computational savings for training.
Figure 11: Maximal, worst-case bit-precision of the convolutional layers of AlexNet trained on CIFAR10 with dithered backprop in a distributed training setting, at different number of participating nodes configuration. As the number of nodes increases, the number of bits necessary to represent the non-zero values decreases, and with it the computational cost for training at each node.

References

  1. M. Adelman and M. Silberstein (2018)(Website) External Links: 1805.08079 Cited by: §2.
  2. K. Ando, K. Ueyoshi, Y. Oba, K. Hirose, R. Uematsu, T. Kudo, M. Ikebe, T. Asai, S. Takamaeda-Yamazaki and M. Motomura (2018-12) Dither nn: an accurate neural network with dithering for low bit-precision hardware. In 2018 International Conference on Field-Programmable Technology (FPT), Vol. , pp. 6–13. External Links: Document, ISSN null Cited by: §3.4.
  3. R. Banner, I. Hubara, E. Hoffer and D. Soudry (2018) Scalable methods for 8-bit training of neural networks. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett (Eds.), pp. 5145–5153. External Links: Link Cited by: Convergence of dithered backprop, §2, §3.5, §4.1, §4.1, Table 1.
  4. L. Bottou (1998) Online learning and stochastic approximations. Cited by: §3.3.
  5. Y. Chen, T. Yang, J. Emer and V. Sze (2019-06) Eyeriss v2: a flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9 (2), pp. 292–308. External Links: Document, ISSN 2156-3365 Cited by: §3.4.
  6. Y. Cheng, D. Wang, P. Zhou and T. Zhang (2018-01) Model compression and acceleration for deep neural networks: the principles, progress, and challenges. IEEE Signal Processing Magazine 35 (1), pp. 126–136. Cited by: §2.
  7. M. Courbariaux, Y. Bengio and J. David (2014)(Website) External Links: 1412.7024 Cited by: §2.
  8. M. Courbariaux, Y. Bengio and J. David (2015) BinaryConnect: training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, Vol. 28, pp. 3123–3131. Cited by: §2.
  9. M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv and Y. Bengio (2016)(Website) External Links: 1602.02830 Cited by: §2.
  10. E. Elsen, M. Dukhan, T. Gale and K. Simonyan (2019) Fast sparse convnets. External Links: 1911.09723 Cited by: §3.4.
  11. I. Goodfellow, Y. Bengio, A. Courville and Y. Bengio (2016) Deep learning. Vol. 1, MIT press Cambridge. Cited by: §1.
  12. S. Gupta, A. Agrawal, K. Gopalakrishnan and P. Narayanan (2015) Deep learning with limited numerical precision. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning (ICML), Vol. 37, pp. 1737–1746. Cited by: §2.
  13. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz and W. J. Dally (2016) EIE: efficient inference engine on compressed deep neural network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Vol. , pp. 243–254. Cited by: §3.4.
  14. I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv and Y. Bengio (2017) Quantized neural networks: training neural networks with low precision weights and activations. Journal on Machine Learning Research 18 (1), pp. 6869–6898. Cited by: §2.
  15. P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh and H. Wu (2018) Mixed precision training. In International Conference on Learning Representations (ICML), Cited by: §2.
  16. B. Moons, D. Bankman and M. Verhelst (2018) Embedded deep learning: algorithms, architectures and circuits for always-on neural network processing. 1st edition, Springer Publishing Company, Incorporated. External Links: ISBN 3319992228 Cited by: §3.4.
  17. A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler and W. J. Dally (2017-06) SCNN: an accelerator for compressed-sparse convolutional neural networks. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Vol. , pp. 27–40. External Links: Document, ISSN null Cited by: §3.4, §4.1.
  18. D. E. Rumelhart, G. E. Hinton and R. J. Williams (1986) Learning representations by back-propagating errors. Nature 323 (6088), pp. 533. Cited by: §1.
  19. F. Sattler, S. Wiedemann, K. Müller and W. Samek (2019) Robust and communication-efficient federated learning from non-i.i.d. data. IEEE Transactions on Neural Networks and Learning Systems (), pp. 1–14. Cited by: §5.
  20. F. Sattler, S. Wiedemann, K. Müller and W. Samek (2019) Sparse binary compression: towards distributed deep learning with minimal communication. In 2019 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 1–8. Cited by: §5.
  21. F. Sattler, T. Wiegand and W. Samek (2020) Trends and advancements in deep neural network communication. arXiv preprint arXiv:2003.03320. Cited by: §3.6.
  22. L. Schuchman (1964) Dither signals and their effect on quantization noise. 12 (4), pp. 162–165. Cited by: §1, §3.1.
  23. E. Strubell, A. Ganesh and A. McCallum (2019) Energy and policy considerations for deep learning in NLP. CoRR abs/1906.02243. External Links: Link, 1906.02243 Cited by: §1.
  24. X. Sun, X. Ren, S. Ma and H. Wang (2017)(Website) External Links: 1706.06197 Cited by: Figure 9, §2, §3, Figure 4, §4.2.
  25. V. Sze, Y. Chen, T. Yang and J. S. Emer (2017-12) Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE 105 (12), pp. 2295–2329. External Links: Document, ISSN 1558-2256 Cited by: §2.
  26. W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen and H. Li (2017) TernGrad: ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems 30, pp. 1509–1519. Cited by: §2.
  27. S. Wiedemann, H. Kirchhoffer, S. Matlage, P. Haase, A. Marban, T. Marinc, D. Neumann, T. Nguyen, H. Schwarz, T. Wiegand, D. Marpe and W. Samek (2020) DeepCABAC: a universal compression algorithm for deep neural networks. IEEE Journal of Selected Topics in Signal Processing (), pp. 1–1. Cited by: §5.
  28. S. Wiedemann, K. Müller and W. Samek (2020) Compact and computationally efficient representation of deep neural networks. IEEE Transactions on Neural Networks and Learning Systems 31 (3), pp. 772–785. Cited by: §5.
  29. S. Wiedemann, A. Marban, K. Müller and W. Samek (2019) Entropy-constrained training of deep neural networks. In 2019 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 1–8. Cited by: §2.
  30. S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou (2016)(Website) External Links: 1606.06160 Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
413816
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description