Faster Neural Network Training with Approximate Tensor Operations
Abstract
We propose a novel technique for faster Neural Network (NN) training by systematically approximating all the constituent matrix multiplications and convolutions. This approach is complementary to other approximation techniques, requires no changes to the dimensions of the network layers, hence compatible with existing training frameworks. We first analyze the applicability of the existing methods for approximating matrix multiplication to NN training, and extend the most suitable columnrow sampling algorithm to approximating multichannel convolutions. We apply approximate tensor operations to training MLP, CNN and LSTM network architectures on MNIST, CIFAR100 and Penn Tree Bank datasets and demonstrate 30%80% reduction in the amount of computations while maintaining little or no impact on the test accuracy. Our promising results encourage further study of general methods for approximating tensor operations and their application to NN training.
Faster Neural Network Training with Approximate Tensor Operations
Menachem Adelman Department of Electrical Engineering Technion  Israel Institute of Technology Haifa madelman@campus.technion.ac.il Mark Silberstein Department of Electrical Engineering Technion  Israel Institute of Technology Haifa mark@ee.technion.ac.il
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Approximation techniques for faster inference and training of deep neural networks (DNNs) have received considerable attention. Examples include quantized numerical representation(Hubara et al., 2016; Micikevicius et al., 2017; Seide et al., 2014; Wen et al., 2017), lowrank models(Mamalet and Garcia, 2012; Kuchaiev and Ginsburg, 2017), weight extrapolations(Kamarthi and Pittner, 1999) and partial or delayed gradient updates(Recht et al., 2011; Strom, 2015; Sun et al., 2017a).
We propose a novel approach to performing approximate DNN training that reduces the amount of computations by approximating computeintensive tensor operations. At high level, the original matrix products and convolutions are replaced with their faster approximate versions that require fewer computations. The approximation is applied separately to each tensor operation, keeping the network architecture and dimensions intact, thereby facilitating the adoption of this technique in existing DNN training frameworks potentially in combination with other approximation techniques.
We first focus on the existing methods for approximating matrix multiplication, noting that there is a rich literature on the topic. We analyze several known algorithms (Cohen and Lewis, 1999; Drineas and Kannan, 2001; Drineas et al., 2006; Magen and Zouzias, 2011; Sarlos, 2006; Clarkson and Woodruff, 2009; Pagh, 2013; Kutzkov, 2013), and find columnrow sampling (CRS) (Drineas et al., 2006) to be the most suitable for approximating matrix multiplications in fullyconnected layers. Given the product of two matrices , the algorithm samples the columns of and the corresponding rows of thus constructing smaller matrices, which are then multiplied as usual. This method incurs low sampling overheads linearly proportional to the size of the input matrices, and lends itself to an efficient implementation using existing dense matrix product routines. The algorithm minimizes the approximation error for the Frobenius norm of the resulting matrix while keeping the approximation unbiased.
We study the application of CRS and its variants to training Multi Layer Perceptron (MLP) on the MNIST dataset, and observe no impact on training accuracy while performing as little as 40% of computations.
Encouraged by these results, we turn to approximating the training of Convolutional Neural Networks (CNNs). We generalize CRS to approximating multichannel convolutions and analyze the approximation error to derive the optimal sampling policy.
Inspired by prior works on gradient resilience to partial updates(Recht et al., 2011; Strom, 2015; Sun et al., 2017a), we apply more aggressive approximation of gradient computations. This allows us to further reduce the amount of computations by when training on MNIST without affecting the result accuracy.
We demonstrate the utility of our approach on different network architectures and datasets as summarized in Table 1. CRS approximation saves 3080% of the computational cost with little or no degradation in model accuracy. The compute reduction column shows the relative amount of computations saved by our method in matrix multiplications and convolutions, as we further explain in detail in Section 3.3.
Network  Dataset  Compute  Accuracy  Baseline 

reduction  accuracy  
MLP  MNIST  79%  98.1%  98.16% 
CNN  MNIST  77%  99.25%  99.29% 
LSTM  PTB  24%  83.6 perplexity  83.5 perplexity 
WRN2810(Zagoruyko and Komodakis, 2016)  CIFAR100  50%  77.1%  78% 
This paper makes the following contributions:

We explore the application of general approximation algorithms for tensor operations to DNN training,

We develop a novel algorithm for fast approximation of multichannel convolution,

We show that our approach can significantly reduce the computational cost of training several popular neural network architectures with little or no accuracy degradation.
2 Related work
To the best of our knowledge, we are the first to study the application of general approaches to approximating tensor operations to speed up DNN training. However, there have been several prior efforts to accelerate DNN computations via approximation which we survey below.
Several works employ model compression to accelerate inference (Denton et al., 2014; Jaderberg et al., 2014; Lebedev et al., 2014; Osawa et al., 2017; Gong et al., 2014; Han et al., 2015; Sun et al., 2017b). Large body of work is devoted to quantization and the use of lowprecision datatypes (see for example (Hubara et al., 2016; Micikevicius et al., 2017; Seide et al., 2014; Wen et al., 2017)). Approximation has been used to extrapolate weight values instead of performing backpropagation iterations(Kamarthi and Pittner, 1999). Several works address communication and synchronization bottlenecks in distributed training by allowing delayed weight updates(Recht et al., 2011; Strom, 2015). Another approach enforces lowrank structure on the layers, resulting in lower computational cost both for training and inference(Mamalet and Garcia, 2012; Kuchaiev and Ginsburg, 2017). These methods are all different from ours and can potentially be applied in combination with approximate tensor operations.
Dropout(Srivastava et al., 2014) is related to CRS, but it has been primarily evaluated in the context of preventing overfitting rather than for approximate computations. We elaborate on the connection between CRS and dropout in Section 6.
Another closely related work is meProp(Sun et al., 2017a), which computes a small subset of the gradients during backpropagation. We note that the simple unified topk variant of meProp is in fact a particular case of CRS, applied only to backpropagation and with a different sampling criterion. We show in Section 5 that CRS enables greater savings than meProp.
3 Approximating matrix multiplication for DNN training
There are several known algorithms for approximating matrix product, however only the algorithms that meet the following requirements will be effective for general DNN training. First, the algorithm should apply to any input matrices regardless of the dimensions or their constituent values. Second, to be effective in reducing the training time, the total computational cost of the approximate multiplication including input transformation should be smaller than the cost of the original matrix product. Last, the algorithm should be amenable to efficient implementation on commodity hardware.
With these criteria in mind we now consider the following algorithms:
Random walk (Cohen and Lewis, 1999) This algorithm performs random walks on a graph representation of the input matrices. However, it is applicable to nonnegative matrices only, which is not the case for matrices typically encountered in DNN training.
Random projections (Sarlos, 2006; Clarkson and Woodruff, 2009; Magen and Zouzias, 2011) The two matrices to be multiplied are first projected into a lowerdimensional subspace by a scaled random size matrix. These algorithms require both input matrices to be roughly square, otherwise the cost of projection will be similar to the original product. In DNNs, however, it is common for one dimension to be smaller than the other.
FFT (Pagh, 2013; Kutzkov, 2013) These algorithms represent each columnrow outer product as a polynomial multiplication and then calculate it using Fast Fourier Transform. The complexity depends on the sparsity of the input matrices, decreasing as the sparsity increases. Therefore, these algorithms might not be effective for computing fullyconnected layers generally represented by dense matrices.
SVD (Drineas and Kannan, 2001; Denton et al., 2014; Osawa et al., 2017) Several algorithms replace one input matrix with its lowrank approximation using truncated SVD. These algorithms are suitable for inference where the weight matrix factorization can be precomputed offline, but are not applicable to training since the high cost of factorization is incurred in every matrix product.
Columnrow sampling (CRS) (Drineas and Kannan, 2001; Drineas et al., 2006) The sampling algorithm approximates matrix product by sampling columns of and respective rows of to form smaller matrices, which are then multiplied as usual. We choose CRS as the basis for our current work, because it meets all the criteria above: It is applicable to fullyconnected layers of any size, its effectiveness does not depend on the matrix contents, its sampling is computationally lightweight, and may use regular matrix multiplication algorithms since the sampled submatrices remain dense.
3.1 Crs(Drineas and Kannan, 2001; Drineas et al., 2006)
Let . Their product is approximated as a weighted sum of outer products between sampled columns of and corresponding rows of .
(1) 
where denote the matrix i’th column and row respectively, is the number of samples (satisfying ), is a probability distribution over the columnrow pairs of and . This algorithm allows linear reduction in complexity from to .
Eq. 1 can also be expressed as , where is a matrix of the sampled columns of , comprises the respective rows of and is a diagonal matrix with being the scaling factor that makes the approximation unbiased:
(2) 
Drineas et al. (2006) derive the upper bounds for the spectral and Frobenius norms of the error matrix . They show that the error is minimized when the sampling probabilities are proportional to the product of the columnrow euclidean norms:
(3) 
We refer to sampling using this probability distribution as NormProportional Sampling (NPS).
We consider different variants of the CRS algorithm:

Sampling with or without replacement

Sampling with uniform distribution versus the distribution given by Eq 3 (NPS)

Using or omitting the scaling factor
See Drineas et al. (2006) for the derivation of error bounds.
In addition, we introduce a deterministic top sampling, which chooses the columnrow pairs with the largest product of their euclidean norms without scaling. The intuition is that this policy chooses the samples that would be assigned the highest probabilities by NPS (Eq. 3) while avoiding the overhead of generating random samples.
3.2 Approximate matrix multiplication on synthetic data
We study the approximation quality and the computational cost of the CRS variants for standalone matrix multiplication. We generate random matrices and evaluate all the CRS variants while computing perelement, spectral norm and Frobenious norm error. We report only the latter because we find that the error metrics are equivalent w.r.t. the relative quality of the algorithms. For example, the algorithm with the lowest Frobenious norm error is also the one that yields the lowest spectral norm and perelement error. Using matrices of other sizes yields similar results.
Figures 0(a),0(b) show the approximation error for different variants of the CRS algorithm and different sampling ratios, averaged over 1000 runs. We show the error metric , as well as its dataindependent upper bound for NPS with replacement (Theorem 1 in (Drineas et al., 2006)).
We observe that (1) sampling without replacement outperforms sampling with replacement, which agrees with the theoretical bounds (for uniform sampling) in (Drineas and Kannan, 2001); (2) deterministic topk produces similar results to NPS without replacement and without scaling; (3) when one matrix or both have i.i.d. entries with zero mean, random individual columnrow products are unbiased estimators of the result. In this case, multiplying by the scaling factor only increases the error.
Therefore, the use of deterministic topk sampling without scaling appears to be preferable as it results in lower error, yet it is simple and computationally lightweight.
3.3 Approximate training of Multi Layer Perceptron
Evaluation methodology.
Here and in the rest of the paper we estimate the speedup as the amount of computations saved due to approximation. Specifically, we denote by compute reduction the proportion of the multiplyaccumulate operations in the fullyconnected and convolutional layers in the forward and backward passes saved due to approximation out of the total computations in these layers performed in the exact training. We neglect the cost of activation functions and other elementwise operations, as well as the cost of the sampling itself, since it involves a single pass over the input and has lower asymptotic complexity of versus of the exact computation. We believe that compute reduction is a reliable estimate of the potential training time savings because tensor operations by far dominate the DNN training time. We leave the efficient implementation of the CRS approximation for future work.
Execution environment.
We perform our experiments in Tensorflow(Abadi et al., 2016) and replace exact tensor operations with their CRSbased approximations. Only columnrow pairs sampled in the forward pass are used during backpropagation because the rest do not affect the loss function. Hence, sampling in the forward pass reduces the amount of computations in the backward pass by the same ratio. We apply approximations only during training, and use exact computations to evaluate the model on the test/validation sets.
We evaluate the impact of different CRS variants on the training accuracy of a 3layer Multi Layer Perceptron (MLP) (98.16% accuracy on MNIST (LeCun et al., 1998) with exact computations). The model and training details are found in supplementary material. Figure 0(c) shows the results. Deterministic topk performs the best along with NPS without replacement and without scaling. There is no loss in model accuracy when using only 40% of the original columnrow pairs while training with the original hyperparameters. Overall, the results are similar to those of approximating matrix product (Figure 0(b)). This is not surprising given our empirical observation that the network weights are indeed close to be symmetrically distributed around zero during training.
3.4 Approximate training of Recurrent Neural Networks
We consider a model similar to the medium Long Short Term Memmory cells (LSTM) proposed by Zaremba et al. (2014) for language modeling on the Penn Tree Bank dataset (PTB)(Marcus et al., 1993). The model involves two weight matrices learned during training. One is the gates matrix that controls the state transfer, where n is the hidden state size. The other is that transforms the LSTM output into logprobabilities of the next word to predict. See supplementary material for model and training details.
We use the CRS deterministic topk algorithm that performs best for the MLP model. We apply 50% sampling to the matrix products of only. This computation accounts for about half of the multiplyaccumulate operations in the model, therefore the total compute reduction is 24%. The resulting test perplexity is 83.6 vs. 83.5 of the original model. The perplexity on the validation set is 87.9 vs. 87.5 of the original model. We adjust the learning rate to allow faster learning at early training stages (0.74 decay after 29 epochs instead of 0.8 after 6).
Applying 50% sampling to the entire model, however, results in degradation of test perplexity to 89.8. We also observe degradation when approximating each gate individually. Following ideas similar to (Lu et al., 2016; Bayer et al., 2014) we also check approximations of specific gates known to be more resilient to perturbations such as compression or dropout, but observe perplexity degradation as well.
4 Approximating convolutions
So far we considered approximation of fullyconnected layers. In this section we extend the basic CRS algorithm to the approximation of multichannel convolution.
We consider two approaches: (1) Transforming convolution into matrix multiplication (e.g., as in cuDNN (Chetlur et al., 2014)) and then applying CRS (2) Devising a specialized algorithm for approximate convolution. We prefer the second approach to avoid introducing dependence of the approximation quality on a particular implementation of the transformation.
Our algorithm is a generalization of the CRS algorithm for matrices. In matrix multiplication sampling is performed over the common dimension. The analogue for multichannel convolution is to sample over the input features dimension. As in the matrix case, the output dimensions remain the same.
Formally, let be the input tensor, where B is the batch size, IH,IW are the input height and width respectively, and IC are the input channels. Let be the kernels tensor, where KH,KW are the kernel height and width, and IC,OC are the input and output channels respectively. Let be the output tensor, where OH,OW are the output height and width.
The multichannel convolution operation is defined as:
(4) 
For notation simplicity, we assume zero padding. The inner sums in Eq. 4 can be written as convolutions of the input channels:
(5) 
where denotes a tensor with one input channel that corresponds to the i’th input channel of I, i.e . Similarly, corresponds to the i’th input channel of K.
This formulation immediately hints at the possibility to sample over the input channel dimension, similarly to sampling columnrow pairs in matrices. We propose to approximate multichannel convolution as a convolution of lowerrank tensors:
(6) 
where are such that and is a probability distribution over the input channels, is a tensor composed of sampled input channels of I scaled by , and is a tensor composed of corresponding sampled input channels of K scaled by the same factor.
Computing the convolution of the smaller tensors can be done using standard efficient convolution implementations.
The properties of the approximation in Eq. 6 can be derived similarly to the CRS derivations for matrix multiplication. In particular, we prove the following (see supplementary material):
Proposition 1
The approximation is unbiased, i.e., .
Proposition 2
The expected Frobenius norm of the error tensor is minimized when the sampling probabilities are:
(7) 
where
The terms emerge for convolutions when the kernel spatial dimensions are greater than one. However, computing them is too expensive, precluding efficient implementation of the approximate version. We therefore omit them and verify empirically whether the resulting normproportional probabilities:
(8) 
yield better results than the uniform sampling. Intuitively, in some (common) cases these terms are much smaller than , so their omission does not significantly impact the final outcome. amounts to the outer spatial dimensions of the input not being convolved with the entire kernel, so it is likely to be smaller than the Frobenius norm of the whole input. is the sum of products of different input and kernel entries. If different kernels are lowlycorrelated and weights are centered around zero, the sum will include terms of similar magnitudes but opposite signs.
4.1 Approximate training of Convolutional Neural Networks
We evaluate our extended CRS algorithm applied both to matrix multiplications and convolutions.
Small CNN.
We first consider a small CNN (99.35% accuracy on MNIST with exact computations). The training is performed using approximate tensor operations while keeping the original training hyperparameters unchanged. See the supplementary material for model and training details.
Figure 1(a) shows the results of different CRS variants. For low sampling ratios, topk and NPS without replacement and without scaling outperform the other policies, as in the MLP case.
The learning curves in Figure 1(b) show that using only 30% of the channels via topk sampling is almost equivalent to training with full computations, achieving final accuracy of 99.21%. This level of sampling results in about 68% compute reduction since the first convolution layer has only one input channel and therefore is computed exactly. The figure also shows that removing additional 10% of the computations achieves the same accuracy but with slower convergence.
Large CNN.
We also consider Wide ResNet2810 model(Zagoruyko and Komodakis, 2016) (WRN2810) trained on the CIFAR100 dataset(Krizhevsky and Hinton, 2009). See supplementary material for model and training details. We evaluate only the deterministic topk CRS that shows good results for smaller models. We use approximation for all the convolutional layers except for the input layer. Initial experiments show a significant drop in accuracy if the input layer is sampled, possibly because it has only three channels. Further, we do not approximate the single fullyconnected layer since it amounts only to 0.01% of the total computations in WRN2810.
Our results show that by using only 50% of the channels the model achieves 77.1% accuracy on CIFAR100 dataset vs. 78% of the original model. Moreover, the accuracy reaches 77.5% if we sample only the larger convolutional layers with 320 and 640 filters. In total, the compute reduction is 50% and 33% respectively. Figure 1(c) shows that the learning curves of the approximate training closely follow the curve of the exact computations. Note that these results are obtained using the same hyperparameters as in the original model.
5 Aggressive approximation in backpropagation
Prior work shows that partial gradient computations do not affect the accuracy(Recht et al., 2011; Strom, 2015; Sun et al., 2017a). Specifically, meProp(Sun et al., 2017a) computes only a small subset of the gradients in fullyconnected layers without accuracy loss. In fact, the computation of the gradient in the simple unified topk variant of meProp can be viewed as a particular case of CRS that uses a different deterministic sampling criterion based on the gradient matrix only. This prompts us to apply more aggressive approximation to the tensor operations in backpropagation, on top of the compute reduction due to sampling in the forward pass.
Table 2 shows the results of training the MLP and CNN models on MNIST using CRS with deterministic topk sampling. We retain at least 10 samples per tensor operation thereby sampling larger layers more aggressively. We compare to meProp simple unified topk algorithm (Sun et al., 2017a) that also uses computeefficient sampling patterns. The aggressive sampling in backpropagation achieves the accuracy of the exact training with the compute reduction of about 80%. The accuracy for the MLP is the same as reported for meProp (Sun et al., 2017a) but CRS allows to further reduce the amount of computations compared to meProp by about .
Network  Method  Forward  Backprop  Compute  Accuracy  Baseline accuracy 

sampling  sampling  reduction  (no sampling)  
MLP  meProp Sun et al. (2017a)    6%  55%  98.08%  98.01% 
CRS  40%  5%  80%  98.1%  
CNN  meProp          99.29% 
CRS  40%  20%  78%  99.25% 
Larger networks, however, are more sensitive to additional sampling in backpropagation. Both the LSTM model and WRN28 had a noticeable drop in accuracy, even when the additional backpropagation sampling ratio exceeds 50%.
6 Discussion
CRS interpretation in the context of DNN training.
The use of CRS algorithm in DNN training results in choosing a different subset of features in each iteration. Both topk and NPS sampling prioritize the features and weights that jointly dominate in the current minibatch and in the network, and therefore are more likely to influence the prediction. Thus, different minibatch compositions trigger learning of different features.
CRS and Dropout.
The CRS approximation is closely related to Dropout(Srivastava et al., 2014). In Dropout, activations of each minibatch example are randomly zeroed. Applying CRS is also equivalent to zeroing activations, albeit the same activations across the entire minibatch. Furthermore, CRS with replacement results in the columnrow sampling that follows multinomial propability distribution, which has also been studied for Dropout (Li et al., 2016). This connection might serve the basis for analyzing the properties of CRS in DNN training by leveraging similar analysis for Dropout (for example (Wager et al., 2013; Baldi and Sadowski, 2014)), which we leave for future work.
7 Conclusions
This paper shows that general approximate tensor operations applied to DNN training are effective in significantly reducing the amount of computations without affecting the model accuracy. It generalizes the columnrow sampling algorithms for approximating matrix products to multichannel convolution and demonstrates promising results on several popular DNN architectures.
Our work opens new opportunities for systematic approximation of DNN training. Future research directions include efficient implementation of approximation algorithms for training, theoretical analysis of their properties, use of different approximation algorithms in suitable scenarios, combination with other approximation techniques and dynamic tuning of the approximation level for different layers or training stages.
References
 Hubara et al. [2016] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016.
 Micikevicius et al. [2017] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
 Seide et al. [2014] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1bit stochastic gradient descent and its application to dataparallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
 Wen et al. [2017] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pages 1508–1518, 2017.
 Mamalet and Garcia [2012] Franck Mamalet and Christophe Garcia. Simplifying convnets for fast learning. In International Conference on Artificial Neural Networks, pages 58–65. Springer, 2012.
 Kuchaiev and Ginsburg [2017] Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for lstm networks. arXiv preprint arXiv:1703.10722, 2017.
 Kamarthi and Pittner [1999] Sagar V Kamarthi and Stefan Pittner. Accelerating neural network training using weight extrapolations. Neural networks, 12(9):1285–1299, 1999.
 Recht et al. [2011] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
 Strom [2015] Nikko Strom. Scalable distributed dnn training using commodity gpu cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association, 2015.
 Sun et al. [2017a] Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. meprop: Sparsified back propagation for accelerated deep learning with reduced overfitting. In International Conference on Machine Learning, pages 3299–3308, 2017a.
 Cohen and Lewis [1999] Edith Cohen and David D Lewis. Approximating matrix multiplication for pattern recognition tasks. Journal of Algorithms, 30(2):211–252, 1999.
 Drineas and Kannan [2001] Petros Drineas and Ravi Kannan. Fast montecarlo algorithms for approximate matrix multiplication. In FoCS, volume 1, pages 452–459, 2001.
 Drineas et al. [2006] Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo algorithms for matrices i: Approximating matrix multiplication. SIAM Journal on Computing, 36(1):132–157, 2006.
 Magen and Zouzias [2011] Avner Magen and Anastasios Zouzias. Low rank matrixvalued chernoff bounds and approximate matrix multiplication. In Proceedings of the twentysecond annual ACMSIAM symposium on Discrete Algorithms, pages 1422–1436. SIAM, 2011.
 Sarlos [2006] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on, pages 143–152. IEEE, 2006.
 Clarkson and Woodruff [2009] Kenneth L Clarkson and David P Woodruff. Numerical linear algebra in the streaming model. In Proceedings of the fortyfirst annual ACM symposium on Theory of computing, pages 205–214. ACM, 2009.
 Pagh [2013] Rasmus Pagh. Compressed matrix multiplication. ACM Transactions on Computation Theory (TOCT), 5(3):9, 2013.
 Kutzkov [2013] Konstantin Kutzkov. Deterministic algorithms for skewed matrix products. In 30th International Symposium on Theoretical Aspects of Computer Science, page 466, 2013.
 Zagoruyko and Komodakis [2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 Denton et al. [2014] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pages 1269–1277, 2014.
 Jaderberg et al. [2014] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
 Lebedev et al. [2014] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. arXiv preprint arXiv:1412.6553, 2014.
 Osawa et al. [2017] Kazuki Osawa, Akira Sekiya, Hiroki Naganuma, and Rio Yokota. Accelerating matrix multiplication in deep learning by using lowrank approximation. In High Performance Computing & Simulation (HPCS), 2017 International Conference on, pages 186–192. IEEE, 2017.
 Gong et al. [2014] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.
 Han et al. [2015] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 Sun et al. [2017b] Xu Sun, Xuancheng Ren, Shuming Ma, Bingzhen Wei, Wei Li, and Houfeng Wang. Training simplification and model simplification for deep learning: A minimal effort back propagation method. arXiv preprint arXiv:1711.06528, 2017b.
 Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Abadi et al. [2016] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Zaremba et al. [2014] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
 Marcus et al. [1993] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
 Lu et al. [2016] Zhiyun Lu, Vikas Sindhwani, and Tara N Sainath. Learning compact recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 5960–5964. IEEE, 2016.
 Bayer et al. [2014] Justin Bayer, Christian Osendorfer, Daniela Korhammer, Nutan Chen, Sebastian Urban, and Patrick van der Smagt. On fast dropout and its applicability to recurrent networks. In Proc. ICLR, 2014.
 Chetlur et al. [2014] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
 Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
 Li et al. [2016] Zhe Li, Boqing Gong, and Tianbao Yang. Improved dropout for shallow and deep learning. In Advances in Neural Information Processing Systems, pages 2523–2531, 2016.
 Wager et al. [2013] Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. In Advances in neural information processing systems, pages 351–359, 2013.
 Baldi and Sadowski [2014] Pierre Baldi and Peter Sadowski. The dropout learning algorithm. Artificial intelligence, 210:78–122, 2014.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Supplementary Material
Proofs
The following proofs go along the same lines of [Drineas et al., 2006], generalizing them to multichannel convolutions (zeropadding assumed).
Proposition 1.
Suppose is a probability distribution over and are such that .
Proof.
We show that every satisfies .
For , define .
Using Eq. 6 we can write .
Taking the expectation, we get:
(10) 
∎
Lemma 1.
Suppose the same as Proposition 1. Then:
(11) 
Proof.
(13) 
From Eq. 10 we get .
Substituting both expressions in Eq. 12 and expanding concludes the proof. ∎
Proposition 2.
Suppose the same as Proposition 1. Then:
(14) 
where
The expected error is minimized when the sampling probabilities are:
(15) 
Remark.
We use here the Frobenius norm in its generalization for tensors. For a tensor T of rank r:
Proof.
Note that:
(16) 
Substituting the result from Lemma 1:
(17) 
This expression includes 3 terms. The first involves products between each element of and all the corresponding entries in , except for the upper and left edges of . We therefore add and subtract the correction term to get:
(18) 
The second term is .
The third term can be written as
To find that minimize the expression in Eq. 14 it is enough to minimize the function under the constraints and . We can write the numerator as because the expression in Eq. 13 is nonnegative.
This minimization problem has a straightforward solution in Lemma 4 of [Drineas et al., 2006], which is .
In our case, , and therefore the optimal probabilities are:
(19) 
∎
Implementation details
MLP for MNIST
The MNIST dataset[LeCun et al., 1998] includes 60K training examples and 10K test examples. We use 5K as validation set. Each example is a grayscale image of a handwritten digit.
Our MLP model contains the following layers:

fullyconnected layer with RELU activations

fullyconnected layer

fullyconnected layer

Softmax
We use the Adam optimizer[Kingma and Ba, 2014] with default parameters (learning rate=0.001,,,). As loss function we use crossentropy. We use minibatch size of 50 and train the model for 20 epochs, reporting the test accuracy of the model with the highest accuracy on the validation set.
LSTM for PTB
The Penn Tree Bank dataset (PTB)[Marcus et al., 1993] has a vocabulary of 10K words and consists of 929K training words, 73K validation words, and 82k test words.
Our network is based on [Zaremba et al., 2014] with the implementation in https://github.com/tensorflow/models/tree/master/tutorials/rnn/ptb.
The network includes 2 layers of LSTM cells with a hidden size of 650, unrolled for 35 steps. The model is trained with a minibatch size of 20, gradient clipping of 5. The perplexity is reported on the test set 39 training epochs. The learning rate decay is 0.8 starting after 6 epochs for the full computations and 0.74 after 29 epochs for the approximations. The non recurrent connections are regularized with Dropout with .
CNN for MNIST
The model is based on the implementation in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/mnist/mnist_deep.py
The network is composed of the following layers:

convolution layer with RELU activation., followed by max pooling.

convolution layer with RELU activation., followed by max pooling.

fully connected layer with RELU activation.

fully connected layer.

Dropout layer with

Softmax
The model is trained using Adam optimizer with default parameters (learning rate=0.001,,,) and crossentropy loss. We use minibatch size of 50 and train the model for 20K iterations.
Wide ResNet2810 for CIFAR100
The CIFAR100 dataset consists of color images from 100 classes, split into 50K training set and 10K test set.
For WRN2810 we use the implementation in https://github.com/tensorflow/models/tree/master/research/resnet.
WRN2810 includes the following layers:

conv1  input convolution layer

conv2  eight convolution layers

conv3  eight convolution layers

conv4  eight convolution layers

Batch normalization, Average pooling, fully connected+softmax layers.
Every two subsequent convolution layers are followed by a residual connection that adds the input to these layers to the result. the first convolution conv3 and conv4 has a stride of 2, halving the spatial dimensions. For additional details see [Zagoruyko and Komodakis, 2016].
Image preprocessing includes padding to 36x36 and random crop, horizontal flipping and perimage whitening. The optimizer is Momentum with momentum=0.9. Learning rate is 0.1 for the first 40K iterations, 0.01 until 60K, 0.001 afterwards. We use 0.002 L2 weight decay. Batch size is 128.
We train the model for 75K iterations and compare the accuracy on the test set for different approximation algorithms.