Wavelet Convolutional Neural Networks
Abstract
Spatial and spectral approaches are two major approaches for image processing tasks such as image classification and object recognition. Among many such algorithms, convolutional neural networks (CNNs) have recently achieved significant performance improvement in many challenging tasks. Since CNNs process images directly in the spatial domain, they are essentially spatial approaches. Given that spatial and spectral approaches are known to have different characteristics, it will be interesting to incorporate a spectral approach into CNNs. We propose a novel CNN architecture, wavelet CNNs, which combines a multiresolution analysis and CNNs into one model. Our insight is that a CNN can be viewed as a limited form of a multiresolution analysis. Based on this insight, we supplement missing parts of the multiresolution analysis via wavelet transform and integrate them as additional components in the entire architecture. Wavelet CNNs allow us to utilize spectral information which is mostly lost in conventional CNNs but useful in most image processing tasks. We evaluate the practical performance of wavelet CNNs on texture classification and image annotation. The experiments show that wavelet CNNs can achieve better accuracy in both tasks than existing models while having significantly fewer parameters than conventional CNNs.
1 Introduction
Convolutional neural networks (CNNs) [27, 26] are known to be good at capturing spatial features, while spectral analyses [38, 28] are good at capturing scaleinvariant features based on the spectral information. It is thus preferable to consider both the spatial and spectral information within a single model, so that it captures both types of features simultaneously. While the connection between CNNs and spectral approaches have been considered unclear so far, we found that a CNN can be seen as a limited form of a multiresolution analysis. This observation points out that conventional CNNs are missing a large part of spectral information available via a multiresolution analysis.
We thus propose to supplement those missing parts of a multiresolution analysis as novel additional components in a CNN architecture. Figure 1 shows the overview of our model; wavelet convolutional neural networks (wavelet CNNs). Besides its theoretical formulation, we demonstrate the practical benefit of wavelet CNNs in two challenging tasks: texture classification and image annotation. We demonstrate that wavelet CNNs achieve better or competitive accuracies with a significantly smaller number of trainable parameters than conventional CNNs. Our model is thus easier to train, less prone to overfitting, and consumes less memory than conventional CNNs. To summarize, our contributions are:

Combination of CNNs and a multiresolution analysis as one model.

Reformulation of CNNs as a limited form of a multiresolution analysis.

Accurate and efficient texture classification and image annotation using our model.
2 Related Work
Convolutional Neural Networks:
CNNs essentially replaced conventional handcrafted descriptors such as the Bag of Visual Words (BoVW) [8] due to the superior performance in various tasks [26]. The original network architecture has been extended to deeper architectures since then. One such architecture is Residual Networks (ResNets) [16] which make deeper networks easier to train by introducing shortcut connections; connections which skip a few layers and perform identity mappings. Inspired by ResNets, various networks with shortcut connections have been proposed [40, 41, 19].
Even with shortcut connections, however, deeper networks still have a problem that information about the input and gradient can quickly vanish as they propagate through the networks. To address this problem, Dense Convolutional Network (DenseNet) [18] further adds shortcut connections which connect each layer with all its previous layers. While these networks have achieved impressive results on many computer vision tasks, they require significant computational resources because they have a large number of parameters. We designed our network after DenseNet, but the combination with a multiresolution analysis allows us to significantly reduce the number of parameters compared to conventional networks.
Several recent works use CNNs as a feature extractor and a BoVW approach as pooling and encoding instead of the fully connected layers. Cimpoi et al. [6] demonstrated that a CNN in combination with Fisher Vectors (FVCNN) can achieve much better accuracy than using a CNN alone. Their model uses a pretrained CNN to extract image features and this CNN part is not trained with existing datasets. Lin et al. [30] achieved a remarkable improvement in finegrained visual recognition by replacing the fully connected layers with bilinear pooling. The dimension of the encoded features in the bilinear model is typically higher than 250,000 and it is difficult to train. To address this difficulty, Gao et al. [9] proposed compact bilinear pooling which reduces the number of parameters of bilinear pooling by 90% while maintaining its performance. Despite this significant reduction in the number of parameters, inherited from conventional CNNs, these models still have a large number of trainable parameters that makes them difficult to train in practice. We show that our model achieves competitive results to compact bilinear pooling while further reducing the number of parameters in a texture classification task.
Spectral Approaches:
Spectral approaches transform images into the frequency domain using a set of spatial filters. The statistics of the spectral information at different scales and orientations define image features. This approach has been well studied in image processing and achieved practical results [38, 2, 23]. Feature extraction in the frequency domain has an advantage. A spatial filter can be easily made selective by enhancing certain frequencies while suppressing the others. This explicit selection of certain frequencies is difficult to control in CNNs. While CNNs are know to be universal approximators, in practice, it is unclear whether CNNs can learn to perform spectral analyses with available datasets. Rather than relying CNNs to learn performing spectral analysis, we propose to directly integrate spectral approaches into CNNs, particularly based on a multiresolution analysis using wavelet transform [33]. Our experiments show that a CNN with more parameters cannot be trained to become equivalent to our model with available datasets in practice.
3 Wavelet Convolutional Neural Networks
Overview:
We propose to formulate convolution and pooling in CNNs as filtering and downsampling. This formulation allows us to connect CNNs with a multiresolution analysis. In the following explanations, we use a singlechannel 1D data for the sake of brevity. Applications to 2D images with multiple channels are trivially possible as was done by CNNs.
3.1 Convolutional Neural Networks
In addition to the use of an activation function and a fully connected layer, CNNs introduce convolution/pooling layers. Figure 2 illustrates the configuration we explain in the following.
Convolution Layers:
Given an input vector with components , a convolution layer outputs a vector of the same number of components :
(1) 
where is a set of indices of neighbors at and is a weight. Following the notational convention in CNNs, we consider that includes the bias by having a constant input of . The equation thus says that each output is a weighted sum of neighbors plus constant.
Each layer defines the weights as constants over . By sharing parameters, CNNs reduce the number of parameters and achieve translation invariance in the image space. The definition of in Equation 1 is equivalent to convolution of via a filtering kernel , thus this layer is called a convolution layer. We can thus rewrite in Equation 1 using the convolution operator as
(2) 
where .
Pooling Layers:
Pooling layers are typically used immediately after convolution layers to simplify the information. We focus on average pooling which allows us to see the connection with a multiresolution analysis. Given an input , average pooling outputs a vector of a fewer components as
(3) 
where defines the support of pooling and . For example, means that we reduce the number of outputs to a half of the inputs by taking pairwise averages. Using the standard downsampling operator , we can rewrite Equation 3 as
(4) 
where represents the averaging filter. Average pooling mathematically involves convolution via followed by downsampling with the stride of .
3.2 Generalized Convolution and Pooling
Equation 2 and Equation 4 can be combined into a generalized form of convolution and downsampling as
(5) 
The generalized weight is defined as

with (convolution in Equation 2)

with (pooling in Equation 4)

with (convolution followed by pooling).
Our insight is that Equation 5 is equivalent to a part of a multiresolution analysis. To see this connection, let us consider convolution followed by pooling with and a pair of convolution kernels and :
(6) 
At this point, it is nothing but convolution and pooling with two different kernels and . The key idea is that a multiresolution analysis [7] decomposes further as follows.
By defining , a multiresolution analysis performs a hierarchical decomposition of into and by repeatedly applying Equation 6 with different and at each :
(7) 
The number of applications is called a level in a multiresolution analysis. Based on our reformulation, CNNs essentially discard entirely and use only one set of kernels :
(8) 
Therefore, CNNs can seen as a limited form of a multiresolution analysis.
Figure 1 illustrates how CNNs and our wavelet CNNs differ under this formulation. We call and as lowpass and highpass filter to follow the convention of multiresolution analyses. Note, however, that they are not necessarily lowpass and highpass filters in the spectral domain. Conventional CNNs can be seen as a limited form of a multiresolution analysis that uses only without a characteristic hierarchical decomposition (Equation 8). Our model supplements the missing part due to by introducing another set of to form a multiresolution analysis via wavelet transform inside a neural network architecture. While this idea might look simple after the fact, our model is powerful enough to outperform the existing more complex models as we will show in the results.
Note that we cannot use an arbitrary pair of filters ( and ) to perform multiresolution analysis. For wavelet transform, is known as the wavelet function and is known as the scaling function. We used Haar wavelets [12] for our experiments, but our model is not restricted to Haar. This constraint also suggests why it is difficult to train conventional CNNs to perform the same computation as wavelet CNNs do: weights in CNNs are ignorant of this important constraint and just try to learn it from datasets.
Rippel et al. [34] proposed a related approach of replacing convolution and pooling by discrete Fourier transform and truncation of the coefficients. This approach, called spectral pooling, is equivalent to Equation 8, thus it is not essentially different from conventional CNNs. Our model is also different from merely applying multiresolution analysis on input data and using CNNs afterward, since multiresolution analysis is built inside the network with skip connections.
3.3 Implementation
Network Structure:
Figure 1 illustrates our network structure. We designed our main network structure after a VGG network [37]. We use convolutional kernels exclusively and padding to ensure the output is the same size as the input.
Instead of using the pooling layers to reduce the size of the feature maps, we exploit convolution layers with the increased stride. If padding is added to the layer with a stride of two, the output becomes half the size of the input layer. This approach can be used to replace max pooling without loss in accuracy [21]. In addition, since both the VGGlike architecture and image decomposition in multiresolution analysis have the same characteristic that the size of images is reduced to a half successively, we combine each level of decomposed images with feature maps of the specific layer that are the same size as those images.
Furthermore, in order to use information of decomposed images more efficiently, we use dense connections [18] and projection shortcuts [16]. Dense connections allow each level of decomposed images to be directly connected with all subsequent layers through channelwise concatenation. With this connectivity, our network can flow all the information effectively into the end of the network. Projection shortcuts can be used to increase dimensions of the input with convolutional kernels. In our model, since dimensions of feature maps are different before and after the shortcut path, we use projection shortcuts in every shortcuts. We also use global average pooling [29] instead of fully connected layers to prevent overfitting.
Learning:
Wavelet CNNs exploit global average pooling with the same size as the input of the layer, so the size of input images is required to be the fixed size. We thus train our proposed model exclusively with images of the size . These images are achieved by first scaling the training images to pixels and then conducting random crops to pixels and flipping. This random variation helps the model to prevent overfitting. For further robustness, we use batch normalization [20] throughout our network before activation layers during training. For the optimizer, we exploit the Adam optimizer [24] instead of SGD. We use the Rectified Linear Unit (ReLU) [10] as the activation function in all the experiments.
4 Experiments
We provide details of two applications of wavelet CNNs. We applied wavelet CNNs to texture classification to confirm that wavelet CNNs can capture small features of images. We also investigated how wavelet CNNs perform on natural images in an image annotation task.
4.1 Texture Classification
Texture classification is a challenging problem since textures often vary a lot within the same class, due to changes in viewpoints, scales, lighting configurations, etc. In addition, textures usually do not contain enough information regarding the shape of objects which are informative to distinguish different objects in image classification tasks. Due to such difficulties, even the latest approaches based on convolutional neural networks achieved a limited success, when compared to other tasks such as image classification [13]. Andrearczyk et al. [1] proposed texture CNN (TCNN) which is a CNN specialized for texture classification. TCNN uses a novel energy layer in which each feature map is simply pooled by calculating the average of its activated output. This results in a single value for each feature map, similar to an energy response to a filter bank. This approach does not improve classification accuracy, but its simple architecture reduces the number of parameters.
Datasets:
For our experiments, we used two publicly available texture datasets: kthtips2b [14] and DTD [5]. The kthtips2b dataset contains 11 classes of 432 texture images. Each class consists of four samples and each sample has 108 images. Each sample is used for training once while the remaining three samples are used for testing. The results for kthtips2b are shown as the mean and the standard deviation over the four splits. The DTD dataset contains 47 classes of 120 images ”in the wild” which means that images are collected in uncontrolled conditions. This dataset includes 10 available annotated splits with 40 training images, 40 validation images, and 40 testing images for each class. The results for DTD are averaged over the 10 splits. We processed the images in each dataset by global contrast normalization. We calculated the accuracy as percentage of images that are correctly labeled which is a common metric in texture classification.
2level  3level  4level  5level  

kthtips2b  
DTD 
AlexNet  TCNN  Wavelet CNN  

kthtips2b  
DTD 
Training from scratch:
Table 1 shows the results of our model with different levels of a multiresolution analysis. For initialization of the parameters, we used a robust method for ReLU [15]. For both datasets, the network with 5level decomposition performed the best, though the model with 4level decomposition achieved almost the same accuracy as 5level. Figure 3 and Table 2 compare our model with AlexNet [26] and TCNN [1] using texture datasets to train each model from scratch. Since the model with 5level decomposition achieved the best accuracy in the previous experiment, we used this network in this and following experiments as well. Since VGG networks tend to perform poorly due to overfitting if trained from scratch, we used AlexNet as an example of conventional CNNs for this experiment. For both datasets, our model performs better than AlexNet and TCNN by a large margin.
Training with finetuning:
Figure 4 and Table 3 show the classification rates using the networks pretrained with the ImageNet 2012 dataset [35]. We compared our model with a spectral approach using shearlet transform [25], VGGM [4], TCNN [1], and VGGM using compact bilinear pooling [9]. For compact bilinear pooling, we compared our model only with Tensor Sketch (TS) since it worked the best in practice. Our model again achieved the best performance for both datasets. While the improvement for the DTD dataset might be marginal (less than 1%), as we show later, this performance is achieved with a significantly fewer parameters than other methods.
Shearlet  VGGM  TCNN  TS+ VGGM  Wavelet CNN  

kthtips2b  
DTD 
Visual comparisons of classified images:
Figure 5 shows some extracted images for several classes in our experiments. The images in the top row are from kthtips2b dataset, while the images in the bottom row of Figure 5 are from DTD dataset. A red square indicates a incorrectly classified texture. We can visually confirm that a spectral approach (shearlet) is insensitive to the scale variation and extract detailed features, whereas a spatial approach (VGGM) is insensitive to distortion. For example, in Aluminium foil, a shearlet transform can correctly ignore the scale of wrinkles, but VGGM failed to classify such an image into the same class. In Banded, VGGM classifies distorted lines into the correct class, but a shearlet transform could not recognize this linelike structure well. Since our model is the combination of both approaches, it can assign texture images to the correct label in every variation above.
4.2 Image Annotation
The purpose of an image annotation task is to associate multiple labels with an image regarding to its content. This task is more natural than singlelabel image classification because a natural image actually includes various objects. The convolutional neural network  recurrent neural network (CNNRNN) encoderdecoder model is a popular approach [39, 22, 31] for this task. In this model, a CNN encodes the image into a fixed length vector, and then it is fed into an RNN that decodes it into a list of tags. The existing models share this concept and differ slightly in how the CNN and RNN relate to each other.
Recurrent image annotator (RIA) [22] exploits image features output from the CNN as the RNN hidden states. They focus on the order of a list of input tags and show that the rarefirst order, which put rarer tags first based on their frequency, improves the performance. Liu et al. proposed semantically regularized CNNRNN (SCNNRNN) [31] where the CNN model is regularized by semantic concepts which serve as strong deep supervision to guide the learning of the CNN layers. The prediction layer of the CNN in this model is also used as the RNN initial states. Both models use VGG16 [37] as the CNN and the long shortterm memory (LSTM) [17] as RNN. In our experiment, we compared our model with RIA.
Datasets:
We used two benchmark image annotation datasets: IAPRTC12 [11] and Microsoft COCO [29]. The IAPRTC12 dataset contains 20,000 images of natural scenes with text captions in several languages. To use this dataset for image annotation, it can be arranged by extracting common nouns in accord with the previous work [32]. This process results in a vocabulary size of 291. Training used 17,665 images while the remaining are used for testing. The Microsoft COCO (MSCOCO) dataset contains 82,783 training images and 40,504 testing images. Following the previous works [39, 31], we employed 80 object annotations as labels.
Training Details:
We replaced VGG16 in RIA by a wavelet CNN with 5level decomposition. For LSTM, the dimension of both hidden states and the input is set to 1024 and the number of hidden layer is 1. When training, the hidden state and the cell state are initialized by image features from CNN and zero respectively. Additionally, since original RIA uses 4096 dimensional output from the last fullyconnected layer of VGG16 as image features, we add a 2048 dimensional fully connected layer to our model just after an average pooling layer and exploit the output from this layer as image features. Even though this additional layer increases the number of trainable parameters in our model, our model still has only 18.3 millions parameters while VGG16 has 138.4 millions parameters. For the order of input tags, we use the rarefirst order following the original paper [22].
CP  CR  CF1  OP  OR  OF1  

VGG16 [37]  22.97  27.39  24.99  33.87  34.93  34.40 
Wavelet CNN  29.01  30.62  29.79  37.43  37.66  37.54 
CP  CR  CF1  OP  OR  OF1  

VGG16 [37]  51.55  45.60  48.49  57.94  51.92  54.77 
Wavelet CNN  53.17  46.69  49.72  58.68  52.04  55.16 
Results:
We used perclass and overall metrics including precision (CP and OP), recall (CR and OR) and F1 score (CF1 and OF1) as evaluation metrics. Perclass metrics take the average over all classes while overall metrics take the average over all test images. As RIA can produce annotations in arbitrary length, we used the arbitrarylength results to compare. Table 4 and Table 5 show the results using RIA models with VGG16 and the wavelet CNN for IAPRTC12 and MSCOCO. For IAPRTC12, RIA with our model obtained much better results than original RIA. The improvement for MSCOCO was marginal in comparison. However, all these results are obtained with a significantly fewer number of parameters; the number of parameters of wavelet CNN is more than seven times smaller than that of VGG16.
Figure 7 shows some results in image annotation. The images in the top row are from IAPRTC12 while the images in the bottom row are from MSCOCO. GT indicates the groundtruth annotations, and they are organized in the rarefirst order. VGG and ours show the results of RIA with VGG16 and our model, where the order of the predictions is preserved as RNN output.
4.3 Number of parameters
To assess the complexity of each model, we compared the number of trainable parameters such as weights and biases for classification to 1000 classes (Figure 6). Conventional CNNs such as VGGM and AlexNet have a large number of parameters while their depth is a little shallower than our proposed model. Even compared to TCNN, which aims at reducing the model complexity, the number of parameters in our model with 5level decomposition is about the half. We also remind that our model achieved higher accuracy than TCNN does in texture classification.
This result confirms that our model achieves better results with a significantly reduced number of parameters than existing models. The memory consumption of each Caffe model is: 392 MB (VGGM), 232 MB (AlexNet), 89.1 MB (TCNN), and 53.9 MB (Ours). The small number of parameters generally suppresses overfitting of the model for small datasets.
5 Discussion
Application to more general tasks:
We applied our model to two challenging tasks; texture classification and image annotation. However, since we do not assume anything regarding the input, our model is not necessarily restricted to these tasks. For example, we experimented training a wavelet CNN with 5level decomposition and AlexNet with the ImageNet 2012 dataset from scratch to perform image classification. Our model obtained the accuracy of 59.4% whereas AlexNet resulted in 57.1%. We should remind that the number of parameters of our model is about four times smaller than that of AlexNet (Figure 6). Our model is thus suitable also for image classification with smaller memory footprint. Other applications such as image recognition and object detection with our model should be similarly possible.
pooling:
An interesting generalization of max and average pooling is pooling [3, 36]. The idea of pooling is that max pooling can be thought as computing norm, while average pooling can be considered as computing norm. In this case, Equation 4 cannot be written as linear convolution anymore due to nonlinear transformation in norm calculation. Our overall formulation, however, is not necessarily limited to a multiresolution analysis either; we can just replace downsampling part by corresponding norm computation to support pooling. This modification however will not retain all the frequency information of the input as it is no longer a multiresolution analysis. We focused on average pooling as it has a clear connection to a multiresolution analysis.
Limitations:
We designed wavelet CNNs to put each high frequency part between layers of the CNN. Since our network has four layers to reduce the size of feature maps, the maximum decomposition level is restricted to five. This design is likely to be less ideal since we cannot tweak the decomposition level independently from the depth (thereby the number of trainable parameters) of the network. A different network design might make this separation of hyperparameters possible.
Wavelet CNNs achieved the best accuracy for both training from scratch and with finetuning for texture classification. For the performance with finetuning, however, our model outperforms other methods by a slight margin especially for DTD, albeit with a significantly smaller number of parameters. We speculated that it is partially because pretraining with the ImageNet 2012 dataset is simply not appropriate for texture classification. An exact reasoning of failure cases for texture classification, however, is generally difficult for any neural network models, and our model is not an exception.
6 Conclusion
We presented a novel CNN architecture which incorporates a spectral analysis into CNNs. We showed how to reformulate convolution and pooling layers in CNNs into a generalized form of filtering and downsampling. This reformulation shows how conventional CNNs perform a limited version of multiresolution analysis, which then allows us to integrate multiresolution analysis into CNNs as a single model called wavelet CNNs. We demonstrated that our model achieves better accuracy for texture classification and image annotation with smaller number of trainable parameters than existing models. In particular, our model outperformed all the existing models with significantly more trainable parameters by a large margin when we trained each model from scratch. A wavelet CNN is a general learning model and applications to other problems are interesting future works. Finally, for wavelet transform in our model, we use fixedweight kernels as Haar wavelet. It is interesting to explore how wavelet kernels themselves can be trained in the endtoend learning framework.
References
 [1] V. Andrearczyk and P. F. Whelan. Using filter banks in convolutional neural networks for texture classification. Pattern Recognition Letters, 84:63 – 69, 2016.
 [2] S. Arivazhagan, T. G. Subash Kumar, and L. Ganesan. Texture classification using curvelet transform. International Journal of Wavelets, Multiresolution and Information Processing, 05(03):451–464, 2007.
 [3] Y. Boureau, J. Ponce, and Y. Lecun. A theoretical analysis of feature pooling in visual recognition. In International Conference on Machine Learning (ICML10), pages 111–118. Omnipress, 2010.
 [4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In British Machine Vision Conference, 2014.
 [5] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
 [6] M. Cimpoi, S. Maji, and A. Vedaldi. Deep filter banks for texture recognition and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
 [7] J. L. Crowley. A representation for visual information. Technical report, 1981.
 [8] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision, ECCV, pages 1–22, 2004.
 [9] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell. Compact bilinear pooling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 317–326, 2016.
 [10] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS11), volume 15, pages 315–323, 2011.
 [11] M. Grubinger. Analysis and evaluation of visual information systems performance, 2007. Thesis (Ph. D.)–Victoria University (Melbourne, Vic.), 2007.
 [12] A. Haar. Zur theorie der orthogonalen funktionensysteme. Mathematische Annalen, 69(3):331–371, 1910.
 [13] L. G. Hafemann, L. S. Oliveira, and P. Cavalin. Forest species recognition using deep convolutional neural networks. In International Conference on Pattern Recognition (ICPR), pages 1103–1107, 2014.
 [14] E. Hayman, B. Caputo, M. Fritz, and J.O. Eklundh. On the Significance of RealWorld Conditions for Material Classification, volume 4, pages 253–266. 2004.
 [15] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, 2015.
 [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016.
 [17] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Comput., 9(8):1735–1780, 1997.
 [18] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [19] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep Networks with Stochastic Depth, pages 646–661. Springer International Publishing, 2016.
 [20] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.
 [21] T. B. J. T. Springenberg, A. Dosovitskiy and M. Riedmiller. Striving for simplicity: The all convolutional net. In International Conference on Learning Representations Workshop Track, 2015.
 [22] J. Jin and H. Nakayama. Annotation order matters: Recurrent image annotator for arbitrary length image tagging. In International Conference on Pattern Recognition (ICPR), pages 2452–2457, 2016.
 [23] M. Kanchana and P. Varalakshmi. Texture classification using discrete shearlet transform. International Journal of Scientific Research, 5, 2013.
 [24] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [25] K. G. Krishnan, P. T. Vanathi, and R. Abinaya. Performance analysis of texture classification techniques using shearlet transform. In International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), pages 1408–1412, 2016.
 [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114. 2012.
 [27] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, Dec 1989.
 [28] W. Q. Lim. The discrete shearlet transform: A new directional transform and compactly supported shearlet frames. IEEE Transactions on Image Processing, 19(5):1166–1180, May 2010.
 [29] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2014.
 [30] T.Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear cnns for finegrained visual recognition. In Transactions of Pattern Analysis and Machine Intelligence (PAMI), 2017.
 [31] F. Liu, T. Xiang, T. M. Hospedales, W. Yang, and C. Sun. Semantic Regularisation for Recurrent Image Annotation. ArXiv eprints, 2016.
 [32] A. Makadia, V. Pavlovic, and S. Kumar. A new baseline for image annotation. In European Conference on Computer Vision (ECCV), pages 316–329, 2008.
 [33] S. G. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):674–693, Jul 1989.
 [34] O. Rippel, J. Snoek, and R. P. Adams. Spectral representations for convolutional neural networks. In Neural Information Processing Systems (NIPS), pages 2449–2457, 2015.
 [35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [36] P. Sermanet, S. Chintala, and Y. LeCun. Convolutional neural networks applied to house numbers digit classification. In International Conference on Pattern Recognition (ICPR), pages 3288–3291, 2012.
 [37] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [38] M. Unser. Texture classification and segmentation using wavelet frames. IEEE Transactions on Image Processing, 4(11):1549–1560, 1995.
 [39] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu. Cnnrnn: A unified framework for multilabel image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2285–2294, 2016.
 [40] S. Zagoruyko and N. Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2016.
 [41] K. Zhang, M. Sun, X. Han, X. Yuan, L. Guo, and T. Liu. Residual networks of residual networks: Multilevel residual networks. IEEE Transactions on Circuits and Systems for Video Technology, PP(99):1–1, 2017.