Learning in the Frequency Domain
Abstract
Deep neural networks have achieved remarkable success in computer vision tasks. Existing neural networks mainly operate in the spatial domain with fixed input sizes. For practical applications, images are usually large and have to be downsampled to the predetermined input size of neural networks. Even though the downsampling operations reduce computation and the required communication bandwidth, it removes both redundant and salient information obliviously, which results in accuracy degradation. Inspired by digital signal processing theories, we analyze the spectral bias from the frequency perspective and propose a learningbased frequency selection method to identify the trivial frequency components which can be removed without accuracy loss. The proposed method of learning in the frequency domain leverages identical structures of the wellknown neural networks, such as ResNet50, MobileNetV2, and Mask RCNN, while accepting the frequencydomain information as the input. Experiment results show that learning in the frequency domain with static channel selection can achieve higher accuracy than the conventional spatial downsampling approach and meanwhile further reduce the input data size. Specifically for ImageNet classification with the same input size, the proposed method achieves and top1 accuracy improvements on ResNet50 and MobileNetV2, respectively. Even with half input size, the proposed method still improves the top1 accuracy on ResNet50 by . In addition, we observe a average precision improvement on Mask RCNN for instance segmentation on the COCO dataset.
1 Introduction
Convolutional neural networks (CNNs) have revolutionized the computer vision community because of their exceptional performance on various tasks such as image classification [1, 2], object detection [3, 4], and semantic segmentation [5, 6].
Constrained by the computing resources and memory limitations, most CNN models only accept RGB images at low resolutions (e.g., ). However, images produced by modern cameras are usually much larger. For example, the high definition (HD) resolution images () are considered relatively small by modern standards. Even the average image resolution in the ImageNet dataset [7] is 482415, which is roughly four times the size accepted by most CNN models. Therefore, a large portion of realworld images are aggressively downsized to to meet the input requirement of classification networks. However, image downsizing inevitably incurs information loss and accuracy degradation [8].
Prior works [9, 10] aim to reduce information loss by learning taskaware downsizing networks. However, those networks are taskspecific and require additional computation, which are not favorable in practical applications. In this paper, we propose to reshape the highresolution images in the frequency domain, i.e., discrete cosine transform (DCT) domain
Inspired by the observation that human visual system (HVS) has unequal sensitivity to different frequency components [11], we analyze the image classification, detection and segmentation task in the frequency domain and find that CNN models are more sensitive to lowfrequency channels than the highfrequency channels, which coincides with HVS. This observation is validated by a learningbased channel selection method that consists of multiple “onoff switches”. The DCT coefficients with the same frequency are packed as one channel, and each switch is stacked on a specific frequency channel to either allow the entire channel to flow into the network or not.
Using the decoded highfidelity images for model training and inference has posed significant challenges, from both data transfer and computation perspectives [12, 13]. Due to the spectral bias of the CNN models, one can keep only the important frequency channels during inference without losing accuracy. In this paper, we also develop a static channel selection approach to preserve the salient channels rather than using the entire frequency spectrum for inference. Experiment results show that the CNN models still retain the same accuracy when the input data size is reduced by 87.5%.
The contributions of this paper are as follows:

We propose a method of learning in the frequency domain (using DCT coefficients as input), which requires little modification to the existing CNN models that take RGB input. We validate our method on ResNet50 and MobileNetV2 for the image classification task and Mask RCNN for the instance segmentation task.

We show that learning in the frequency domain better preserves image information in the preprocessing stage than the conventional spatial downsampling approach (spatially resizing the images to , the default input size of most CNN models) and consequently achieves improved accuracy, i.e., on ResNet50 and on MobileNetV2 for the ImageNet classification task, on Mask RCNN for both object detection and instance segmentation tasks.

We analyze the spectral bias from the frequency perspective and show that the CNN models are more sensitive to lowfrequency channels than highfrequency channels, similar to the human visual system (HVS).

We propose a learningbased dynamic channel selection method to identify the trivial frequency components for static removal during inference. Experiment results on ResNet50 show that one can prune up to of the frequency channels using the proposed channel selection method with no or little accuracy degradation in the ImageNet classification task.

To the best of our knowledge, this is the first work that explores learning in the frequency domain for object detection and instance segmentation. Experiment results on Mask RCNN show that learning in the frequency domain can achieve a average precision improvement for the instance segmentation task on the COCO dataset.
2 Related Work
Learning in the frequency domain: Compressed representations in the frequency domain contain rich patterns for image understanding tasks. [14, 15, 16] train dedicated autoencoderbased networks on compression and inference tasks jointly. [17] extracts features from the frequency domain to classify images. [18] proposes a model conversion algorithm to convert the spatialdomain CNN models to the frequency domain. Our method differs from the prior works in two aspects. First, we avoid the complex model transition procedure from the spatial to the frequency domain. Thus, our method has a broader application scope. Second, we provide an analysis method to interpret the spectral bias of neural networks in the frequency domain.
Dynamic Neural Networks: Prior works [19, 20, 21, 22, 23] propose to selectively skip the convolutional blocks on the fly based on the activations of the previous blocks. These works adjust the model complexity in response to the input of each convolutional block. Only the intermediate features that are most relevant to the inputs are computed in the inference stage to reduce computation cost. In contrast, our method exclusively operates on the raw inputs and distills the salient frequency components to lower the communication bandwidth requirement for input data.
Efficient Network Training: There are substantial recent interests in training efficient networks [24, 25, 26, 27], which focus on network compression via kernel pruning, learned quantization, and entropy encoding. Another line of works aim to compress the CNN models in the frequency domain. [28] reduces the storage space by converting filter weights to the frequency domain and using a hash function to group the frequency parameters into hash buckets. [29] also transforms the kernels to the frequency domain and discards the lowenergy frequency coefficients for high compression. [30] constrains the frequency spectra of CNN kernels to reduce memory consumption. These network compression works in the frequency domain all rely on the FFTbased convolution, which is generally more effective on larger kernels. Nevertheless, the stateoftheart CNN models use small kernels, e.g., or . Extensive efforts need to be taken to optimize the computation efficiency of these FFTbased CNN models [31]. In contrast, our method makes little modification to the existing CNN models. Thus, our method requires no extra effort to improve its computation efficiency on the CNN models with small kernels. Another fundamental difference is that our method targets at reducing the input data size rather than model complexity.
3 Methodology
In this paper, we propose a generic method on learning in the frequency domain, including a data preprocessing pipeline as well as an input data size pruning method.
Figure 1 shows the comparison of our method and the conventional approach. In the conventional approach, highresolution RGB images are usually preprocessed on a CPU and transmitted to a GPU/AI accelerator for realtime inference. Because uncompressed images in the RGB format are usually large, the requirement of the communication bandwidth between a CPU and a GPU/AI accelerator is usually high. Such communication bandwidth can be the bottleneck of the system performance, as shown in Figure 1(a). To reduce both the computation cost and the communication bandwidth requirement, highresolution RGB images are downsampled to smaller images, which often results in information loss and lower inference accuracy.
In our method, highresolution RGB images are still preprocessed on a CPU. However, they are first transformed to the YCbCr color space and then to the frequency domain. This coincides with the most widelyused image compression standards, such as JPEG. All components of the same frequency are grouped into one channel. In this way, multiple frequency channels are generated. As shown in Section 3.2, certain frequency channels have bigger impact on the inference accuracy than the others. Thus, we propose to only preserve and transmit the most important frequency channels to a GPU/AI accelerator for inference. Compared to the conventional approach, the proposed method requires less communication bandwidth and achieves higher accuracy at the same time.
We demonstrate that the input features in the frequency domain can be applied to all existing CNN models developed in the spatial domain with minimal modification. Specifically, one just need to remove the input CNN layer and reserve the remaining residual blocks. The first residual layer is used as the input layer, and the number of input channels is modified to fit the dimension of the DCT coefficient inputs. As such, a modified model can maintain similar parameter count and computational complexity to the original model.
Based on our frequencydomain model, we propose a learningbased channel selection method to explore the spectral bias of a given CNN model, i.e., which frequency components are more informative to the subsequent inference task. The findings motivate us to prune the trivial frequency components for inference, which significantly reduces the input data size, consequently reducing both the computational complexity of domain transformation and the required communication bandwidth, while maintaining inference accuracy.
3.1 Data Preprocessing in the Frequency Domain
The data preprocessing flow is shown in Figure 2. We follow the welldeveloped preprocessing and augmentation flow in the spatial domain, consisting of image resizing, cropping, and flipping (spatial resize and crop in Figure 2). Then images are transformed to the YCbCr color space and are converted to the frequency domain (DCT transform in Figure 2). After that, the twodimensional DCT coefficients at the same frequency are grouped to one channel to form threedimensional DCT cubes (DCT reshape in Figure 2). As discussed in Section 3.2, a subset of highly impactful frequency channels are selected (DCT channel select in Figure 2). The selected frequency channels in the YCbCr color space are concatenated together to form one tensor (DCT concatenate in Figure 2). Lastly, every frequency channel is normalized by the mean and variance calculated from the training dataset.
The DCT reshape operation in Figure 2 groups a twodimensional DCT coefficients to a threedimensional DCT cube. Since the JPEG compression standard uses DCT transformation on the YCbCr color space, we group the components of the same frequency in all the blocks into one channel, maintaining their spatial relations at each frequency. Thus, each of the Y, Cb, and Cr components provides channels, one for each frequency, with a total of channels in the frequency domain. Suppose the shape of the original RGB input image is , where and the height and width of the image is denoted as and , respectively. After converting to the frequency domain, the input feature shape becomes , which maintains the same input data size.
Since the input feature maps in the frequency domain are smaller in the and dimensions but larger in the dimension than the spatialdomain counterpart, we skip the input layer of a conventional CNN model, which is usually a stride convolution. If a maxpooling operator immediately follows the input convolution (e.g., ResNet50), we skip the maxpooling operator as well. Then we adjust the channel size of the next layer to match the number of channels in the frequency domain. It is illustrated in Figure 3. This way, we minimally adjust the existing CNN models to accept the frequencydomain features as input.
In the image classification task, the CNN models usually take input features of the shape , which is usually downsampled from images with a much higher resolution. When the classification is performed in the frequency domain, larger images can be taken as input. Take ResNet50 as an example, the input features in the frequency domain are connected to the first residue block with the number of channels adjusted to , forming an input feature of the shape , as shown in Figure 2. That is DCTtransformed from input images of size , which preserves four times more information than the counterpart in the spatial domain, at the cost of times the input feature size. Similarly, for the model MobileNetV2, the input feature shape is , reshaped from images of size . As discussed in Section 3.3, the majority of the frequency channels can be pruned without sacrificing accuracy. The frequency channel pruning operation is referred to as DCT channel select in Figure 2.
3.2 Learningbased Frequency Channel Selection
As different channels of the input feature are at different frequencies, we conjecture that some frequency channels are more informative to the subsequent tasks such as image classification, object detection, and instance segmentation, and removing the trivial frequency channels shall not result in performance degradation. Thus, we propose a learningbased channel selection mechanism to exploit the relative importance of each input frequency channel. We employ a dynamic gate module that assigns a binary score to each frequency channel. The salient channels are rated as one, the others as zero. The input frequency channels with zero scores are detached from the network. Thus, the input data size is reduced, leading to reduced computation complexity of domain transformation and communication bandwidth requirement. The proposed gate module is simple and can be part of the model to be applied in online inference.
Figure 4 describes our proposed gate module in details. The input is of shape ( in this paper), with frequency channels (Tensor 1 in Figure 4). It is first converted to Tensor 2 in Figure 4 of shape by average pooling. Then it is converted to Tensor 3 in Figure 4 of shape by a convolutional layer. Conversion from Tensor 1 to Tensor 3 is exactly the same as a twolayer squeezeandexcitation block (SEBlock) [32], which utilizes the channelwise information to emphasize the informative features and suppress the trivial ones. Then, Tensor 3 is converted to Tensor 4 in Figure 4 of the shape by multiplying every element in Tensor 3 with two trainable parameters. During inference, the two numbers for each of the channels in Tensor 4 are normalized and serve as the probability of being sampled as or , and then, pointwise multiplied to the input frequency channels to obtain Tensor 5 in Figure 4. As an example, if the two numbers in the th channel in Tensor 4 are and , there is a probability that the th gate is turned off. In other words, the th frequency channel in Tensor 5 becomes all zeros of the times, which effectively blocks this frequency channel from being used for inference.
Our gate module differs from the conventional SEBlock in two ways. First, the proposed gate module outputs a tensor of dimension , where the two numbers in the last dimension describe the probability of being on and off for each frequency channel, respectively. Thus we add another convolution layer for the conversion. Second, the number multiplied to each frequency channel is either or , i.e., a binary decision of using the frequency or not. The decision is obtained by sampling a Bernoulli distribution , where is calculated by the numbers in the tensor mentioned above.
One of the challenges in the proposed gate module is that the Bernoulli sampling process is not differentiable in case one needs to update the weights in the gate module. [33, 34, 35] propose a reparameterization method, called Gumbel Softmax trick, which allows the gradients to back propagate through a discrete sampling process (see Gumbel samples in Figure 4).
Let be the input channels in the frequency domain () for a CNN model. Let F denote the proposed gate module such that , for each frequency channel . Then is selected if
(1) 
where is the elementwise product.
We add a regularization term to the loss function that balances the number of selected frequency channels, which is minimized together with the crossentropy loss or other accuracyrelated loss. Our loss function is thus as follows,
(2) 
where is the loss that is related to accuracy. is a hyperparameter indicating the relative weight of the regularization term.
3.3 Static Frequency Channel Selection
The learningbased channel selection provides a dynamic estimation of the importance of each frequency channel, i.e., different input images may have different subsets of the frequency channels activated.
To understand the pattern of frequency channel activation, we plot two heat maps, one on the classification task (Figure 4(a)) and one on the segmentation task (Figure 4(b)). The number in each box indicates the frequency index of the channel, with a lower and higher index indicating a lower and higher frequency, respectively. The heat map value indicates the likelihood a frequency channel being selected for inference across all the validation images.
Based on the patterns in the heat maps shown in Figure 5, we make several observations:

The lowfrequency channels (boxes with small indices) are selected much more often than the highfrequency channels (boxes with with large indices). This demonstrates that lowfrequency channels are more informative than highfrequency channels in general for vision inference tasks.

The frequency channels in luma component Y are selected more often than the frequency channels in chroma components Cb and Cr. This indicates that the luma component is more informative for vision inference tasks.

The heat maps share a common pattern between the classification and segmentation tasks. This indicates that the abovementioned two observations are not specific to one task and is very likely to be general to more highlevel vision tasks.

Interestingly, some lower frequency channels have lower probability of being selected than the slightly higher frequency channels. For example, in Cb and Cr components, both tasks favor Channel and over Channel and .
Those observations imply that the CNN models may indeed exhibit similar characteristics to the HVS, and the image compression standards (e.g., JPEG) targeting human eyes may be suitable for the CNN models as well.
The JPEG compression standard puts more bits to the lowfrequency and the luma components. Following the same principle, we statically select the lower frequency channels, with more emphasis on the luma component than the chroma components. This ensures the frequency channels with higher activation probabilities are fed into the CNN models. The rest of the frequency channels can be pruned by either the image encoder or decoder to reduce the required data transmission bandwidth and input data size.
ResNet50  #Channels  Size Per Channel  Top1  Top5  Normalized Input Size 
RGB  3  224224  75.780  92.650  1.0 
YCbCr  3  224224  75.234  92.544  1.0 
DCT192 [17]  192  2828  76.060  93.020  1.0 
DCT192 (ours)  192  5656  77.194  93.454  4.0 
DCT64 (ours)  64  5656  77.232  93.624  1.3 
DCT48 (ours)  48  5656  77.384  93.554  1.0 
DCT24 (ours)  24  5656  77.196  93.504  0.5 
MobileNetV2  #Channels  Size Per Channel  Top1  Top5  Normalized Input Size 

RGB  3  224224  71.702  90.415  1.0 
DCT32 (ours)  32  112112  72.282  90.592  2.7 
DCT24 (ours)  24  112112  72.364  90.606  2.0 
DCT12 (ours)  12  112112  72.328  90.644  1.0 
DCT6 (ours)  6  112112  71.776  90.258  0.5 
4 Experiment Results
We benchmark our proposed methodology on three different highlevel vision tasks: image classification, detection, and segmentation.
4.1 Experiment Settings on Image Classification
We benchmark our method on image classification using the ImageNet 2012 LargeScale Visual Recognition Challenge dataset (ILSVRC2012) [36]. We use the stochastic gradient descent (SGD) optimizer. SGD is applied with an initial learning rate of , a momentum of , and a weight decay of 4e5. We choose ResNet50 [37] and MobileNetV2 [38] as the CNN models because they contain important building blocks (e.g., residue blocks and depthwise separable convolutions) widely used in modern CNN models. Note that our method can be generally applied to any CNN model. We train and epochs and decay the learning rate by every and epochs for ResNet50 and MobileNetV2, respectively.
For the normalization of the input channels, we compute the mean and variance of each of the frequency channels separately for all the images in the training dataset.
As described in Section 3.1, the input features in the frequency domain are generated from images with a much higher resolution than the spatialdomain counterpart. However, some of the images in the ImageNet dataset have lower resolutions. We perform similar preprocessing steps as in the spatial domain, including resizing and cropping to a larger image size, performing upsampling when needed.
4.2 Experiment Results on Image Classification
We train the ResNet50 model with frequency channel inputs on the image classification task using the approach described in Section 3.2. The gate module for channel selection is trained together with the ResNet50 model. Figure 4(a) shows a heat map of the selection results over the validation set with . Note that different regularization parameters generate different number of activated frequency channels in heat maps. A typical example is shown in Figure 4(a), that most channels () have very low possibility () of being selected.
Based on the different heat maps generated by using different regularization parameters , we statically pick the top and highprobability channels from the frequency channels to train three separate ResNet50 models in the frequency domain. For DCT24, we choose the top probable frequency channels in Y, Cb, and Cr, respectively. Similarly, we choose the top channels for DCT48 and top channels for DCT64. The results on the ImageNet dataset are shown in Table 1 along with selecting all frequency channels. In particular, compared with the baseline ResNet50, the top1 accuracy is improved by using all frequency channels. Note that DCT48 and DCT24 select and frequency channels, and the input data size is the same and a half of the baseline ResNet50, respectively. For DCT24 with half of the input data size, the top1 accuracy is still improved by about . One should also note that the accuracy is dropped when the inputs are transformed from the RGB to the YCbCr color space (both in the spatial domain) by roughly , and the improvement of our method (in the frequency domain) over the YCbCr case is even larger.
Similar experiments are performed using the MobileNetV2 as the baseline CNN model and the results are shown in Table 2. The top1 accuracy is improved by and by selecting and frequency channels, respectively. Note that in order to preserve the MobileNetV2 architecture, the input dimension is set to with or . Thus the input data size is higher than baseline models for RGB input.
Backbone  #Channels  Size Per Channel  bbox  

AP  AP@0.5  AP@0.75  AP  AP  AP  
ResNet50FPN (RGB)  3  8001333  37.3  59.0  40.2  21.9  40.9  48.1 
DCT24 (ours)  24  200334  37.7  59.2  40.9  21.7  41.4  49.1 
DCT48 (ours)  48  200334  38.1  59.5  41.2  22.0  41.3  49.8 
DCT64 (ours)  64  200334  38.1  59.6  41.1  22.5  41.6  49.7 
Backbone  #Channels  Size Per Channel  mask  

AP  AP@0.5  AP@0.75  AP  AP  AP  
ResNet50FPN (RGB)  3  8001333  34.2  55.9  36.2  15.8  36.9  50.1 
DCT24 (ours)  24  200334  34.6  56.1  36.9  16.1  37.4  50.7 
DCT48 (ours)  48  200334  35.0  56.6  37.2  16.3  37.5  52.3 
DCT64 (ours)  64  200334  35.0  56.5  37.4  16.9  37.6  51.6 
4.3 Experiment Settings on Instance Segmentation
We train our model on the COCO train2017 split containing about 118k images and evaluate on the val2017 split containing 5k images. We evaluate the bounding box (bbox) average precision (AP) for the object detection task and the mask AP for the instance segmentation task. Based on the Mask RCNN [39], our model consists of a frequencydomain ResNet50 model as introduced in Section 4.1 and a feature pyramid network [43] as the backbone. The frequencydomain ResNet50 model is finetuned with the boundingbox recognition head and the mask prediction head. Input images are resized to a maximum scale of without changing the aspect ratio. The corresponding DCT coefficients have a maximum size of , which are fed into the ResNet50FPN [43] for feature extraction.
We train our networks for epochs with an initial learning rate of , which is decreased by after and epochs. The rest of the configurations follow those of MMDetection [44].
In Table 5 and Table 4, we report the Average Precision (AP) metric that averages APs across IoU thresholds from to with an interval of . Both the bbox AP and the mask AP are evaluated. For the mask AP, we also report AP@0.5 and AP@0.75 at the IoU threshold of and respectively, as well as AP, AP, and AP at different scales.
4.4 Experiment Results on Instance Segmentation
We train our Mask RCNN model using the 192channel inputs in the frequency domain for instance segmentation. The gate module for dynamic channel selection is trained together with the entire Mask RCNN. Figure 4(b) shows the heat maps for the dynamic selection.
We further train our models using only the top , , and highprobability frequency channels. The bbox and mask AP of our method in different cases is reported in Table 5 and Table 4, respectively. The experiment results show that our method outperforms the RGBbased Mask RCNN baseline with both an equal (DCT48) or smaller (DCT24) input data size. Specifically, the 24channel model (DCT24) achieves an improvement of in both bbox AP and mask AP with a half of the input data size compared to the RGBbased Mask RCNN baseline.
Figure 6 visually illustrates the segmentation results of the Mask RCNN model trained and performing inference in the frequency domain.
5 Conclusion
In this paper, we propose a method of learning in the frequency domain (using DCT coefficients as input) and demonstrate its generality and superiority for a variety of tasks, including classification, detection, and segmentation. Our method requires little modification to the existing CNN models that take RGB input thus can be generally applied to existing network training and inference methods. We show that learning in the frequency domain better preserves image information in the preprocessing stage than the conventional spatial downsampling approach and consequently achieves improved accuracy. We propose a learningbased dynamic channel selection method and empirically show that the CNN models are more sensitive to lowfrequency channels than highfrequency channels. Experiment results show that one can prune up to of the frequency channels using the proposed channel selection method with no or little accuracy degradation in the classification, object detection, and instance segmentation tasks.
Acknowledgement. The work by Arizona State University is supported by an NSF grant (IIS/CPS1652038).
Supplementary Material for
Learning in the Frequency Domain
This document supplements our paper entitled Learning in the Frequency Domain by providing further quantitative and qualitative insights of the results.
Appendix A Instructions to Reproduce the Experiments
We have provided the source code to reproduce the experiments in the paper. The code is based on PyTorch and is available at https://github.com/calmevtime1990/supp. There are two folders in the repo named “classification”
Appendix B Additional Instance Segmentation Results
More instance segmentation examples are shown in Figure 7.
Appendix C Object Detection Results on Faster RCNN
In addition to the Mask RCNN model provided in the paper, we train our model for object detection on the COCO train2017 split and evaluate on the val2017 split using the Faster RCNN [42] model. Our model consists of a frequencydomain ResNet50 model (introduced in Section 4.1 in the main paper) and a feature pyramid network [43] as the backbone. The frequencydomain ResNet50 model is finetuned with the classification head and bounding box regression head. Input images are resized to a maximum scale of16002666 without changing the aspect ratio. The corresponding DCT coefficients have a maximum size of 200334, which are fed into the ResNet50FPN for feature extraction. The rest of the configurations follow those of MMDetection [44].
In Table 5, we report the results on the object detection task using the frequency domain Faster RCNN. The proposed method achieves a 0.8 AP improvement compared to the baseline Faster RCNN on the COCO dataset.
Backbone  #Channels  Size Per Channel  bbox  

AP  AP@0.5  AP@0.75  AP  AP  AP  
ResNet50FPN (RGB)  3  8001333  36.4  58.4  39.1  21.5  40.0  46.6 
DCT24 (ours)  24  200334  37.2  58.8  39.9  21.9  40.7  48.9 
DCT48 (ours)  48  200334  37.1  58.6  40.2  21.7  40.9  48.8 
DCT64 (ours)  64  200334  37.2  58.5  40.6  21.9  40.9  48.3 
References
Footnotes
 We interchangeably use the terms frequency domain and DCT domain in the context of this paper.
 https://github.com/calmevtime1990/supp/tree/master/classification
 https://github.com/calmevtime1990/supp/tree/master/segmentation
References
 A. Krizhevsky, I. Sutskever, and G. E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei. Largescale video classification with convolutional neural networks. In CVPR, 2014.
 S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NIPS, 2015.
 J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, realtime object detection. In CVPR, 2016.
 J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoderdecoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
 O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. In IJCV, 2015.
 Y. Pei, Y. Huang, Q. Zou, X. Zhang, and S. Wang. Effects of image degradation and degradation removal to cnnbased image classification. In TPAMI, 2019.
 H. Kim, M. Choi, B. Lim, and K. Lee. Taskaware image downscaling. In ECCV, 2018.
 F. Saeedan, N. Weber, M. Goesele, and S. Roth. Detailpreserving pooling in deep networks. In CVPR, 2018.
 J. Kim and S. Lee. Deep learning of human visual sensitivity in image quality assessment framework. In CVPR, 2017.
 X. Wei, Y. Liang, P. Zhang, C. Yu, and J. Cong. Overcoming data transfer bottlenecks in dnn accelerators via layerconscious memory managment. In FPGA, 2019.
 Y. You, Z. Zhang, C. Hsieh, J. Demmel, and K. Keutzer. Imagenet training in minutes. In ICPP, 2018.
 R. Torfason, F. Mentzer, E. ÃgÃºstsson, M. Tschannen, R. Timofte, and L. Gool. Towards image understanding from deep compression without decoding. In ICLR, 2018.
 K. XU, Z. Zhang, and F. Ren. Lapran: A scalable laplacian pyramid reconstructive adversarial network for flexible compressive sensing reconstruction. In ECCV), 2018.
 C. Wu, M. Zaheer, H. Hu, R. Manmatha, A. Smola, and P. Krähenbühl. Compressed video action recognition. In CVPR, 2018.
 L. Gueguen, A., B., R. Liu, and J. Yosinski. Faster neural networks straight from jpeg. In NIPS, 2018.
 M. Ehrlich and L. Davis. Deep Residual Learning in the JPEG Transform Domain. In ICCV, 2019.
 A. Veit and S. Belongie. Convolutional networks with adaptive inference graphs. In ECCV, 2018.
 X. Wang, F. Yu, Z. Dou, T. Darrell, and J. Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In ECCV, 2018.
 Q. Guo, Z. Yu, Y. Wu, D. Liang, H. Qin, and J. Yan. Dynamic recursive neural network. In CVPR, 2019.
 Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. Davis, K. Grauman, and R. Feris. Blockdrop: Dynamic inference paths in residual networks. In CVPR, 2018.
 Z. Chen, Y. Li, S. Bengio, and S. Si. You look twice: Gaternet for dynamic filter selection in cnns. In CVPR, 2019.
 J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019.
 P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz. Importance estimation for neural network pruning. In CVPR, 2019.
 K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han. Haq: Hardwareaware automated quantization with mixed precision. In CVPR, 2019.
 S. Han, H. Mao, and W. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR, 2016.
 W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen. Compressing convolutional neural networks in the frequency domain. In KDD, 2016.
 Y. Wang, C. Xu, C. Xu, and D. Tao. Packing convolutional neural networks in the frequency domain. TPAMI, 2019.
 A. Dziedzic, J. Paparrizos, S. Krishnan, A. Elmore, and M. Franklin. Bandlimited training and inference for convolutional neural networks. In ICML, 2019.
 A. Lavin and S. Gray. Fast algorithms for convolutional neural networks. In CVPR, 2016.
 J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. In CVPR, 2018.
 E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbelsoftmax. In ICLR, 2017.
 G. Tucker, A. Mnih, C. Maddison, J. Lawson, and J. SohlDickstein. Rebar: Lowvariance, unbiased gradient estimates for discrete latent variable models. In NIPS, 2017.
 C. Maddison, A. Mnih, and Y. Teh. The concrete distribution: A continuous relaxation of discrete random variables. In ICLR, 2017.
 J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR, 2009.
 K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
 K. He, G. Gkioxari, P. DollÃ¡r, and R. Girshick. Mask rcnn. In ICCV, 2017.
 T. Lin, P. DollÃ¡r, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature Pyramid Networks for Object Detection. In CVPR, 2017.
 K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. Loy, and D. Lin. MMDetection: Open mmlab detection toolbox and benchmark. ArXiv:1906.07155, 2019.
 S. Ren, K. He, R. Girshick, and J. Sun. Faster RCNN: Towards Realtime Object Detection with Region Proposal Networks. In NIPS, 2015.
 T. Lin, P. DollÃ¡r, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature Pyramid Networks for Object Detection. In CVPR, 2017.
 K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. Loy, and D. Lin.