SpecNet: Spectral Domain Convolutional Neural Network
Abstract
The memory consumption of most Convolutional Neural Network (CNN) architectures grows rapidly with increasing depth of the network, which is a major constraint for efficient network training and inference on modern GPUs with limited memory. Several studies show that the feature maps (as generated after the convolutional layers) are the big bottleneck in this memory problem. Often, these feature maps mimic natural photographs in the sense that their energy is concentrated in the spectral domain. This paper proposes a Spectral Domain Convolutional Neural Network (SpecNet) that performs both the convolution and the activation operations in the spectral domain to achieve memory reduction. SpecNet exploits a configurable threshold to force small values in the feature maps to zero, allowing the feature maps to be stored sparsely. Since convolution in the spatial domain is equivalent to a dot product in the spectral domain, the multiplications only need to be performed on the nonzero entries of the (sparse) spectral domain feature maps. SpecNet also employs a special activation function that preserves the sparsity of the feature maps while effectively encouraging the convergence of the network. The performance of SpecNet is evaluated on three competitive object recognition benchmark tasks (MNIST, CIFAR10, and SVHN), and compared with four stateoftheart implementations (LeNet, AlexNet, VGG, and DenseNet). Overall, SpecNet is able to reduce memory consumption by about without significant loss of performance for all tested network architectures.
SpecNet: Spectral Domain Convolutional Neural Network
Bochen Guan^{}^{†}^{†}thanks: Authors contributed equally to this work, Jinnian Zhang^{}^{1}^{1}footnotemark: 1, William A. Sethares, Richard Kijowski and Fang Liu ^{}University of WisconsinMadison Madison, WI 53705, USA gbochen@wisc.edu, jinnian.zhang@wisc.edu, sethares@wisc.edu, rkijowski@wisc.edu, fliu37@wisc.edu
Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Introduction
Deep convolutional neural networks have made significant progress on various tasks in recent years (?; ?; ?; ?; ?). Current successful deep CNNs such as ResNet (?) and DenseNet (?) typically include over 100 layers and require large amounts of training data. Training these models becomes computationally and memory intensive, especially when limited resources are available (?). Therefore, it is essential to reduce the memory requirements to allow better network training and deployment, such as applying deep CNNs to embedded systems and cell phones.
Several studies (?; ?) show that the intermediate layer outputs (feature maps) are the primary contributors to this memory bottleneck. Existing methods such as model compression (?; ?; ?; ?) and scheduling (?; ?; ?), do not directly address the storage of feature maps. By transforming the convolutions into the spectral domain, we target the memory requirements of feature maps.
In contrast to (?), which proposes an efficient encoded representation of feature maps in the spatial domain, we exploit the property that the energy of feature maps is concentrated in the spectral domain (?). Values that are less than a configurable threshold are forced to zero, so that the feature maps can be stored sparsely. We call this approach the Spectral Domain Convolutional Neural Network (SpecNet). In this new architecture, convolutional and activation layers are implemented in the spectral domain. The outputs of convolutional layers are equal to the multiplication of nonzero entries of the inputs and kernels. The activation function is designed to preserve the sparsity and symmetry properties of the feature maps in the spectral domain, and also allow effective derivative computation in backward propagation.
More specifically, this paper contributes the following:

A new CNN architecture (SpecNet) that performs convolution and activation in the spectral domain. Feature maps are thresholded and compressed to allow reducing model memory by only computing and saving nonzero entries.

A spectral domain activation function is applied to both the real and imaginary parts of the input feature maps, preserving the sparsity property and ensuring effective network convergence during training.

Extensive experiments are conducted to show the effectiveness of SpecNet using different architectures at multiple computer vision tasks. For example, a SpecNet implementation of DenseNet architecture can reach up to reduction of the memory consumption on the SHVN dataset without significant loss of accuracy ( testing accuracy compared with accuracy of the original implementation).
Related Work
Model Compression
Model compression can be achieved in several ways including quantization, pruning and weight decomposition.
With quantization, the values of filter kernels in the convolutional layers and weight matrices in fullyconnected layers are quantized into a limited number of levels. This can decrease the computational complexity and reduce memory cost (?; ?). The extreme case of quantization is binarization (?; ?) which uses only to represent all values of the weights, resulting in dramatic memory reduction but risking potentially degraded performance.
Pruning and weight decomposition are other approaches to model compression. The key idea in pruning is to remove unimportant connections. Some initial work (?) focused on using weight decay to sparsify the connections in neural networks while more recent work (?; ?) applied structured sparsity regularizers to the weights. Instead of selecting redundant connections, (?) proposed a compression technique that fixed a random connectivity pattern and required the CNN to train around it. Weight decomposition is based on a lowrank decomposition of the weights in the network. SVD is an efficient decomposition method, and has proven to be successful in shrinking the size of the model (?). Other work (?; ?) uses PCA to make the rank selection. The pruning and weight decomposition attempt to reduce the size of the model so that it is more easily deployed in embedded systems or smart phones. Overall, the aforementioned methods are focused on compressing weights to reduce the model size, and further reduce the size of feature maps. SpecNet is an orthogonal method that directly reduces the memory consumption by sparsifying the feature maps and store it efficiently. The two methods may be combined to save memory.
Memory Sharing
Since the ’lifetime’ of feature maps (the amount of time data from a given layers must be stored) is different in each layer, it is possible to design data reuse methods to reduce memory consumption. (?) observes the feature maps in some layers that are responsible for most of the memory consumption are relatively cheap to compute. By storing the output of the concatenation, batch normalization and ReLU layers in shared memory, DenseNet can achieve more than memory saving, compared to the original implementation that allocates new memory for these layers. A more general algorithm for designing memory sharing patterns can be found in (?). It can be applied to CNNs and RNNs with sublinear memory cost compared to the original implementations. Recently, An approach called SmartPool (?) has been proposed to provide an even more finegrained memory sharing strategy to improve the memory reduction.
Representation of Feature Maps in the Spatial Domain
The above methods are not focused on compressing feature maps directly. (?) employed two classes of layerspecific encoding schemes to encode and store the feature maps in the time domain, and to decode the data for back propagation. The additional encoding and decoding process will increase the computational complexity. In SpecNet, the architecture is designed for sparse storage of feature maps in the spectral domain, which is more computationally efficient.
(?) proposed a method to extract intrinsic representations of the feature maps while preserving the discriminability of the features. It can achieve a high compression ratio, but the training process involves a pretrained CNN and solving an optimization problem. SpecNet does not require additional modules in the training process and is easier to implement.
CNN in the Spectral Domain
A few pilot studies have attempted to combine Fast Fourier Transform (FFT) and Wavelet transforms with CNNs (?; ?; ?). However, most of these works aim to make the training process faster by replacing the traditional convolution operation with FFT and dot product of the inputs and kernel in spectral domain (?; ?). Wavelet CNN (?) concatenates the feature maps and multiresolution features captured by the wavelet transform of the input images to improve the classification accuracy. These methods do not attempt to reduce memory, and several works (such as the Wavelet CNN) require more memory in order to achieve optimal performance. In contrast, SpecNet uses the FFT to reduce memory consumption and its computation complexity depends on certain input parameters, which is quite different from most FFTbased CNN implementations.
SpecNet
The key idea of SpecNet rests on the observation that feature maps, like most natural images, tend to have compact energy in the spectral domain. The compression can be achieved by retaining nontrivial values while zeroing out small entries. A threshold () can then be applied to configure the compression rate where larger values result in more zeros in the spectral domain feature maps.
As with existing CNNs (?; ?; ?; ?) which operate in the spatial domain, the elemental SpecNet, operating in the spectral domain, also consists of three layers: convolutional layers, activation layers, and pooling layers as shown at the bottom half of Fig. 1. In contrast to previous studies (?; ?; ?) that simply use the FFT to accelerate network training, SpecNet represents a new design of the network architecture for convolution, tensor compression, and activation in the spectral domain and can be applied to both forward and backward propagation in network training and inference.
Convolution in the Spectral Domain
Consider 2Dconvolution with a stride of 1. In a standard convolutional layer, the output is computed by
(1) 
where is an input matrix of size ; is the kernel with dimension , and indicates 2D convolution. The output in the spatial domain has dimensions , where and . This process involves multiplications.
Convolution can be implemented more efficiently in the spectral domain as
(2) 
where is the transformed input in the spectral domain by FFT , and is the corresponding kernel in the spectral domain, . represents elementbyelement multiplication, which requires equal dimensions for and . Therefore, and are zeropadded to match their dimensions . Since there are various hardware optimizations for the FFT (?; ?; ?), it requires complex multiplications. The computational complexity of (2) is and so the overall complexity in the spectral domain is . Depending on the size of the inputs and kernels, SpecNet can have a computational advantage over spatial convolution in some cases (?; ?). However, SpecNet is focused on reducing memory consumption for applications that are primarily limited by the available memory.
The compression of involves a configurable threshold , which forces entries in with small absolute values (those less than ) to zero. This allows the thresholded map () to be sparse and hence to store only the nonzero entries in , thus saving memory.
The backward propagation step requires the calculation of the error for the previous layers, and the gradients for . Let be the error from the next layer, and , be the input and kernel of the convolutional layer stored in the forward propagation, respectively. Then
(3) 
where is the loss function. After obtaining its gradient in the spectral domain , the IFFT is applied. Then the matrix for the update of can be expressed as
(4) 
where is the learning rate. The kernels are updated after obtaining , the gradient in the spectral domain, by using the inverse FFT and downsampling.
Note that after the gradient update of , the kernel is further converted from spectral domain back into the spatial domain using the inverse FFT to save kernel storage.
A more general case of 2Dconvolution with arbitrary integer stride can be viewed as a combination of 2Dconvolution with stride of 1 and uniform downsampling. This can also be implemented in the spectral domain (?).
Activation Function in the Spectral Domain
In SpecNet, the activation function for the feature maps is designed to perform directly in the spectral domain. For each complex entry in the spectral feature map,
(5) 
where
(6) 
The function is used in (5) as a proofofconcept design for this study. Other activation functions may also be used, but must fulfill the following:

They allow inexpensive gradient calculation.

Both and are monotonic nondecreasing

The functions are odd, i.e. .
The first and second rules are standard requirements for nearly all popular activation functions used in modern CNN design. The third rule in SpecNet is applied to preserve the conjugate symmetry structure of the spectral feature maps so that they can be converted back into real spatial features without generating pseudo phases. By looking at the 2D FFT,
(7) 
where and . If is real, i.e. the conjugate of is itself (), then
(8)  
Therefore, must be odd to retain the symmetry structure of the activation layer to ensure that
(9) 
If the symmetry structure of in (3) is also maintained, the gradients in the spatial domain should be real after the inverse FFT in (4), and can be added to directly.
Let be the input of activation layer in forward propagation, and be the error from the next layer. The error for the previous layer in backward propagation can be calculated by
(10) 
Pooling Layers
The pooling methods in SpecNet are implemented in the spatial domain after transforming the activated frequency feature maps back into the spatial domain using the IFFT. As a result of the convolution and activation function design (which preserve conjugate symmetry in the spectral domain feature maps), the corresponding spatial feature maps are real valued and the same pooling operation (max pooling or average pooling) used in standard CNNs can be used seamlessly in SpecNet. The calculation of the error in backward propagation can be found in the standard approach (?). Note that the error is in the spatial domain, if the previous layer is either activation or convolutional layer, the error should be transformed into the spectral domain, i.e., to get in (3) or (10).
Implementation Details
SpecNet stores the kernels in spatial domain as matrices. Therefore, given the input feature maps in spectral domain, each kernel should be upsampled to the size of the inputs by adding zeros to the right and bottom of its value matrix, and then transformed to the spectral domain with the FFT. The complete forward propagation of the convolutional block (including convolution and activation operations) in the spectral domain is shown in Algorithm 1.
Experiments
We demonstrate the feasibility of SpecNet using several benchmark datasets and by comparing the performance of SpecNet implmentations of several stateoftheart networks (LeNet, AlexNet, VGG16 and DenseNet) with their standard implementations.
Datasets
MNIST is a dataset of handwritten digits with 28 by 28 pixels each image, which is widely used for training and testing image processing, machine learning, and deep learning algorithms (?; ?). In our experiment, 60,000 images were used for training and 10000 images were used for testing. The images were preprocessed by normalizing all pixel values to [0,1].
CIFAR10 is a ten class dataset of small colored natural images (?). In our experiment, 50,000 images were used for training and 10,000 images were used for testing. All images of CIFAR 10 were resized to 32 by 32 pixels, and each channel was normalized with respect to its mean and standard deviation (?). Standard data augmentation techniques (?) were also applied to the training set.
SVHN is a dataset consisting of colored digit images with 32 by 32 pixels each image (?). The dataset contains 99289 images: 73257 images for training and 26032 images for testing. It is reported that stateofart CNNs can achieve good performance on the dataset without data augmentation (?), therefore we do not use data augmentation for training. Images were channelnormalized in mean and standard deviation.
Training
Training and evaluation of all networks were performed on a desktop computer running a 64bit Linux operating system (Ubuntu 16.04) with an Intel Core i77700K CPU and 32 GB DDR4 RAM and two Nvidia GeForce GTX 1080 graphic cards (Nvidia driver 384.130) with 2560 CUDA cores and 8GB GDDR5 RAM. All networks and algorithms (including comparisons) were implemented in MATLAB 2018a and Tensorflow 1.09.
All the networks were trained by mini batch stochastic gradient descent (SGD) with a batch size of 64 on MNIST, CIFAR, and SVHN. The initial learning rate was set to 0.1 and was reduced by half every 50 epochs. The momentum of the optimizer was set to 0.95 and a total of 300 epochs was trained to ensure convergence.
Results
DenseNet  

Layers  Output size  Structure  
Input  3232  Input  
Convolution  3232  33 kernel, BN, ReLU  
Pooling  1616  MaxPool (window size 22)  
Dense Block  1616  11 kernel, BN, ReLU  6 
33 kernel, BN, ReLU  
Classification Layer  11  GlobalAveragePool  
Fullyconnected, SoftMax (10 classes)  
SpecDenseNet  
Layers  Output size  Structure  
Input  3232  Input  
Convolutional Block  3535  FFT  
3535  FConv2D (33 kernels), Activation  
3232  IFFT  
Pooling  1616  MaxPool (window size 22)  
Dense Block  1919  FFT  
1919  FConv2D (64 33 kernels), Activation  6  
FConv2D (64 33 kernels), Activation  
1616  IFFT  
Classification Layer  11  GlobalAveragePool  
Fullyconnected, SoftMax (10 classes) 
First, we empirically show the impact when different thresholds are applied to the feature maps. The results are shown in Fig. 2. We apply two different thresholds ( and ). Observe that the feature maps in the spectral domain are compressed (and sparsified), but that this does not significantly impact the feature maps in the spatial domain.
Next, we evaluated the proposed SpecNet using four widely used CNN architectures including LeNet5 (?), AlexNet (?), VGG (?) and DenseNet (?). We use the prefix ‘Spec’ to stand for the SpecNet implementation of each network. To ensure fair comparisons, the SpecNet networks used identical network hyperparameters as the native spatial domain implementations. The experiments also use the same conditions for image preprocessing, parameter initialization, and optimization settings. For example, architectures for DenseNet and SpecDenseNet are in Table 1. The other three networks are detailed in the supplementary material. The experimental results on MNIST, CIFAR10 and SVHN are shown in Fig. 3.
Figures 3 (a)(b)(c) compare the memory usage of the SpecNet implementations of four different networks over a range of beta values from 0.5 to 1.5. We compute relative memory consumption and error by: memory (error) of SpecNet / the memory (error) in the original implementations. When compared with their original models, all SpecNet implementations of the four networks can save at least memory with negligible loss of accuracy, indicating the feasibility of compressing feature maps within SpecNet framework. With increasing value, all models show monotonic reduction in memory usage. The rates of memory reduction are different between different network architectures, which is likely caused by different feature representations in the various network designs.
Figures 3 (d)(e)(f) compare the relative error of the SpecNet implementation for the four different networks over the same range of values from 0.5 to 1.5. While the SpecNet typically compresses the models, there is penalty in the form of increased error in comparison to the original model with full spatial feature maps. The average accuracy of SpecAlexNet, SpecVGG, and SpecDenseNet can be higher than when is smaller than .
Figures 4 (a)(b)(c) show the relative memory use () during the training process. We recorded average memory consumption of each epoch and compare it with memory in the original implementations. The memory consumption gradually improves with the training epochs, and the peak value tends to occur when the model has converged.
Table 2 shows a comparison between SpecNet and other recently published memoryefficient algorithms. The experiments investigate memory usage when training VGG and DenseNet on the CIFAR10 dataset. For each algorithm, we selected most memory efficient performance that still retains testing accuracy of at least . The SpecNet outperformed all the listed algorithms and resulted in the lowest memory usage while maintaining high testing accuracy. It is notable that SpecNet is independent of the work listed in the table, and these techniques can be applied along with SpecNet to further reduce memory consumption.
VGG  DenseNet  

INPLACEABN (?)  0.52  0.58 
Chen Meng et al. (?)  0.65  0.55 
MemoryEfficient DenseNets (?)  N/A  0.44 
vDNN (?)  0.38  0.39 
SpecNet  0.37  0.37 
Conclusion
We have introduced a new Convolutional Neural Network architecture called SpecNet, which performs both the convolution and the activation operations in the spectral domain. By setting a configurable threshold to force small values in the feature maps in the spectral domain to zero, the feature maps of SpecNet can be stored sparsely. SpecNet also employs a special activation function that preserves the sparsity of the feature maps and helps ensure training convergence. We have evaluated SpecNet on three competitive object recognition benchmark tasks (MNIST, CIFAR10, and SVHN), and demonstrated the performance of SpecNet implmentation of stateoftheart (LeNet, AlexNet, VGG16 and DenseNet) to show the efficacy and efficiency of memory reduction. In some cases, SpecNet can reduce memory consumption by about without significant loss of performance.
It is worth noting that our experimental hyperparameter settings were ideantical to those of CNN in the spatial domain. Further memory reduction and performance improvement for SpecNet can be achieved by using more reasonable data scaling, dedicated network architecture and optimization settings tailored for SpecNet. Transform methods other than the FFT such as DCT, Wavelet transform, can also be incorporated into SpecNet to promote energy concentration for memory reduction.
It is also notable that SpecNet is only focused on the sparse storage of feature maps in the spectral domain. In the future, we plan to apply aforementioned methods, such as model compression and scheduling, to SpecNet for more efficient use of memory.
Acknowledgement
We thank Prof. Varun Jog and Prof. Dimitris Papailiopoulos for helpful discussions.
References
 [Andri et al. 2018] Andri, R.; Cavigelli, L.; Rossi, D.; and Benini, L. 2018. Yodann: An architecture for ultralow power binaryweight cnn acceleration. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems 37(1):48–60.
 [Atwood and Towsley 2016] Atwood, J., and Towsley, D. 2016. Diffusionconvolutional neural networks. In Advances in Neural Information Processing Systems, 1993–2001.
 [Changpinyo, Sandler, and Zhmoginov 2017] Changpinyo, S.; Sandler, M.; and Zhmoginov, A. 2017. The power of sparsity in convolutional neural networks. CoRR abs/1702.06257.
 [Chen et al. 2016] Chen, T.; Xu, B.; Zhang, C.; and Guestrin, C. 2016. Training deep nets with sublinear memory cost. CoRR abs/1604.06174.
 [Cheng et al. 2017] Cheng, Y.; Wang, D.; Zhou, P.; and Zhang, T. 2017. A survey of model compression and acceleration for deep neural networks. CoRR abs/1710.09282.
 [Courbariaux and Bengio 2016] Courbariaux, M., and Bengio, Y. 2016. Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1. CoRR abs/1602.02830.
 [Deng 2012] Deng, L. 2012. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine 29(6):141–142.
 [Denton et al. 2014] Denton, E. L.; Zaremba, W.; Bruna, J.; LeCun, Y.; and Fergus, R. 2014. Exploiting linear structure within convolutional networks for efficient evaluation. In Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 27. Curran Associates, Inc. 1269–1277.
 [Fujieda, Takayama, and Hachisuka 2017] Fujieda, S.; Takayama, K.; and Hachisuka, T. 2017. Wavelet convolutional neural networks for texture classification. CoRR abs/1707.07394.
 [Gupta et al. 2015] Gupta, S.; Agrawal, A.; Gopalakrishnan, K.; and Narayanan, P. 2015. Deep learning with limited numerical precision. CoRR abs/1502.02551.
 [Hanson and Pratt 1989] Hanson, S. J., and Pratt, L. Y. 1989. Comparing biases for minimal network construction with backpropagation. In Touretzky, D. S., ed., Advances in Neural Information Processing Systems 1. MorganKaufmann. 177–185.
 [He et al. 2015] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep residual learning for image recognition. CoRR abs/1512.03385.
 [Huang et al. 2017] Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger, K. Q. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4700–4708.
 [Jain et al. 2018] Jain, A.; Phanishayee, A.; Mars, J.; Tang, L.; and Pekhimenko, G. 2018. Gist: Efficient data encoding for deep neural network training. In 45th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2018, Los Angeles, CA, USA, June 16, 2018, 776–789.
 [Jin, Lazarow, and Tu 2017] Jin, L.; Lazarow, J.; and Tu, Z. 2017. Introspective classification with convolutional nets. In Advances in Neural Information Processing Systems, 823–833.
 [Kappeler et al. 2017] Kappeler, A.; Ghosh, S.; Holloway, J.; Cossairt, O.; and Katsaggelos, A. 2017. Ptychnet: Cnn based fourier ptychography. In 2017 IEEE International Conference on Image Processing (ICIP), 1712–1716. IEEE.
 [Krizhevsky and Hinton 2009] Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images. Technical report, Citeseer.
 [Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Pereira, F.; Burges, C. J. C.; Bottou, L.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 25. Curran Associates, Inc. 1097–1105.
 [Krizhevsky 2014] Krizhevsky, A. 2014. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997.
 [LeCun, Bengio, and Hinton 2015] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. nature 521(7553):436.
 [LeCun et al. 1995] LeCun, Y.; Jackel, L.; Bottou, L.; Brunot, A.; Cortes, C.; Denker, J.; Drucker, H.; Guyon, I.; Muller, U.; Sackinger, E.; et al. 1995. Comparison of learning algorithms for handwritten digit recognition. In International conference on artificial neural networks, volume 60, 53–60. Perth, Australia.
 [Lee et al. 2018] Lee, J.H.; Heo, M.; Kim, K.R.; and Kim, C.S. 2018. Singleimage depth estimation based on fourier domain analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 330–339.
 [Leonard and Kramer 1990] Leonard, J., and Kramer, M. 1990. Improvement of the backpropagation algorithm for training neural networks. Computers & Chemical Engineering 14(3):337–341.
 [Li et al. 2016] Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; and Graf, H. P. 2016. Pruning filters for efficient convnets. CoRR abs/1608.08710.
 [Li et al. 2018] Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; and Chen, B. 2018. Pointcnn: Convolution on xtransformed points. In Advances in Neural Information Processing Systems, 820–830.
 [Liu et al. 2015] Liu, B.; Wang, M.; Foroosh, H.; Tappen, M.; and Pensky, M. 2015. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 806–814.
 [Liu et al. 2019] Liu, F.; Guan, B.; Zhou, Z.; Samsonov, A.; Rosas, H.; Lian, K.; Sharma, R.; Kanarek, A.; Kim, J.; Guermazi, A.; et al. 2019. Fully automated diagnosis of anterior cruciate ligament tears on knee mr images by using deep learning. Radiology: Artificial Intelligence 1(3):180091.
 [Mathieu, Henaff, and LeCun 2014] Mathieu, M.; Henaff, M.; and LeCun, Y. 2014. Fast training of convolutional networks through ffts. In ICLR.
 [Meng et al. 2017] Meng, C.; Sun, M.; Yang, J.; Qiu, M.; and Gu, Y. 2017. Training deeper models by gpu memory optimization on tensorflow. In Proc. of ML Systems Workshop in NIPS.
 [Netzer et al. 2011] Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.
 [Pleiss et al. 2017] Pleiss, G.; Chen, D.; Huang, G.; Li, T.; van der Maaten, L.; and Weinberger, K. Q. 2017. Memoryefficient implementation of densenets. arXiv preprint arXiv:1707.06990.
 [Pratt et al. 2017] Pratt, H.; Williams, B. M.; Coenen, F.; and Zheng, Y. 2017. Fcnn: Fourier convolutional neural networks. In ECML/PKDD.
 [Rhu et al. 2016] Rhu, M.; Gimelshein, N.; Clemons, J.; Zulfiqar, A.; and Keckler, S. W. 2016. vdnn: Virtualized deep neural networks for scalable, memoryefficient neural network design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, 18. IEEE Press.
 [Rota Bulò, Porzi, and Kontschieder 2018] Rota Bulò, S.; Porzi, L.; and Kontschieder, P. 2018. Inplace activated batchnorm for memoryoptimized training of dnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5639–5647.
 [Simonyan and Zisserman 2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556.
 [Szegedy et al. 2017] Szegedy, C.; Ioffe, S.; Vanhoucke, V.; and Alemi, A. A. 2017. Inceptionv4, inceptionresnet and the impact of residual connections on learning. In ThirtyFirst AAAI Conference on Artificial Intelligence.
 [Wang et al. 2017a] Wang, Y.; Xie, L.; Liu, C.; Qiao, S.; Zhang, Y.; Zhang, W.; Tian, Q.; and Yuille, A. 2017a. Sort: Secondorder response transform for visual recognition. In Proceedings of the IEEE International Conference on Computer Vision, 1359–1368.
 [Wang et al. 2017b] Wang, Y.; Xu, C.; Xu, C.; and Tao, D. 2017b. Beyond filters: Compact feature map for portable deep model. In ICML, volume 70 of Proceedings of Machine Learning Research, 3703–3711. PMLR.
 [Wen et al. 2016] Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; and Li, H. 2016. Learning structured sparsity in deep neural networks. CoRR abs/1608.03665.
 [Wu et al. 2015] Wu, J.; Leng, C.; Wang, Y.; Hu, Q.; and Cheng, J. 2015. Quantized convolutional neural networks for mobile devices. CoRR abs/1512.06473.
 [Zhang et al. 2014] Zhang, X.; Zou, J.; Ming, X.; He, K.; and Sun, J. 2014. Efficient and accurate approximations of nonlinear convolutional networks. CoRR abs/1411.4229.
 [Zhang et al. 2019] Zhang, J.; Yeung, S.; Shu, Y.; He, B.; and Wang, W. 2019. Efficient memory management for gpubased deep learning systems. CoRR abs/1903.06631.