Filter Bank Regularization of Convolutional Neural Networks
Regularization techniques are widely used to improve the generality, robustness, and efficiency of deep convolutional neural networks (DCNNs). In this paper, we propose a novel approach of regulating DCNN convolutional kernels by a structured filter bank. Comparing with the existing regularization methods, such as or minimization of DCNN kernel weights and the kernel orthogonality, which ignore sample correlations within a kernel, the use of filter bank in regularization of DCNNs can mold the DCNN kernels to common spatial structures and features (e.g., edges or textures of various orientations and frequencies) of natural images. On the other hand, unlike directly making DCNN kernels fixed filters, the filter bank regularization still allows the freedom of optimizing DCNN weights via deep learning. This new DCNN design strategy aims to combine the best of two worlds: the inclusion of structural image priors of traditional filter banks to improve the robustness and generality of DCNN solutions and the capability of modern deep learning to model complex non-linear functions hidden in training data. Experimental results on object recognition tasks show that the proposed regularization approach guides DCNNs to faster convergence and better generalization than existing regularization methods of weight decay and kernel orthogonality.
Deep convolutional neural networks (DCNNs) have rapidly matured as an effective tool for almost all computer vision tasks [30, 27, 8, 9, 7, 25], including object recognition, classification, segmentation, superresolution, etc. Compared with traditional vision methods based on analytical models, DCNNs are able to learn far more complex, non-linear functions hidden in the training images. However, DCNNs are also known for their high model redundancy and susceptibility to data overfitting. When having a very large number of parameters, DCNNs have high Vapnic-Chervonenkis (VC) dimension. If trained on limited amount of samples from the data generating distribution, DCNNs are less likely to choose the correct hypothesis from the large hypothesis space . In other words, there should be a balance between information in the training examples and the complexity of the network . The simplest model that could perform the task and generalize well on the real world data is the best one. But choosing the simplest model is not an easy task; simply reducing the number of parameters in a network runs the risk of removing the true hypothesis from the hypothesis space. To prevent overfitting and improve the generalization capability, a common strategy is to use a complex model but put some constraints on the model to make it overlook noise samples. In this way reducing the model complexity is not achieved by reducing the number of free parameters in the network, but rather by controlling the variance of the model and its parameters. This strategy is known as regularization .
Regularization methods for DCNNs fall into two categories. The regularization methods of the first category are procedural. For example, Ioffe and Szegedy proposed to perform batch normalization in each layer to reduce the internal covariate shift in the network and improve the generalization and performance of the network .
In  Srivastava et al. used a dropout technique to stochastically regularize the network; they showed that the dropout works like an ensemble of simpler models. Khan and Shah proposed a so-called Bridgeout stochastic regularization technique , and they proved that their method is equivalent to the norm penalty on the weights for a generalized linear model, where norm is a learnt hyperparameter.
All regularization methods of the first category are implicit and quite weak in the sense that they do not directly act on the CNN loss function, nor they require the convolutional kernels to have any spatial structures.
The methods in the second category explicitly add a regularization term in the loss function to penalize the CNN weights. One example is the weight decay method by A. Krogh and J. Hertz that penalizes large weights via the norm minimization . Among the published methods weight decay is the most common one to regularize CNNs. For simple linear models, it can be shown, using Bayesian inference, that weight decay statistically means that the weights obey a multivariate normal distribution with diagonal covariance prior. In this case, maximum a posterior probability (MAP) estimation with Gaussian prior on the weights is equivalent to the maximum likelihood estimation with the weight decay term . However, weight decay regularization can be justified only if the weights within a CNN kernel have no correlation with each other. This assumption is obviously false, as it is well known that the CNN kernels, upon convergence, typically exhibit strong spatial structures. To illustrate our point, the kernels of the Alexnet after training are shown in Figure 1, where the weights in a kernel are clearly correlated.
Another form of penalty term in the cost function for regularizing CNN weights is the orthogonality of the kernels . But the requirement of mutually orthogonal kernels also ignores the spatial structures.
In summary, all existing CNN regularization methods overlook spatial structures of images. This research sets out to rectify the above common problem. Our solution to it is a novel approach of regularizing CNN convolutional kernels by a structured filter bank. The idea is to encourage the CNN kernels to conform to common spatial structures and features (e.g., edges or textures of various orientations and frequencies) of natural images. But this is different from simply making the CNN kernels fixed structured filters; the filter bank regularization still allows the CNN filters to be fine-tuned based on input data. This new CNN design strategy aims to combine the best of two worlds: the inclusion of structural image priors of traditional filter banks to improve the robustness and generality of CNN solutions and the capability of modern deep learning to model complex non-linear functions hidden in training data. More specifically, our technical innovations are
Considering a convolutional kernel as a set of correlated weights and penalize them based on their structural difference from adaptively chosen reference 2D filters.
Using filter banks as guidance for the convolutional kernels of DCNNs but at the same time allowing the kernel weights to deviate from the reference filters, if so required by data.
The remainder of the paper is structured as follows. The next section briefly review related works. Section 3 is the main technical body of this paper, in which we present the details and justifications of the proposed new regularization method. In Section 4, we report experimental results on object recognition tasks. The proposed regularization approach is shown to lead to faster convergence and better generalization than existing regularization methods of weight decay and kernel orthogonality.
1.2 Related Work
In  Xie et al. proposed an orthonormal regularizer for each layer of the CNNs, as a means to improve the accuracy and convergence of the network. Except being free of redundancy, orthogonal kernels do not consider spatial correlation between the weights within a given kernel, and hence irrespective of spatial structures. Some attempts were made to reduce the complexity of model by including priors in the kernels. Bruna and Mallat proposed a method called convolutional scattering networks in which they used fixed cascaded wavelets to decompose images . Although the method had good performance on specific datasets, it reduces the capability of CNNs. The wavelet prior is too rigid to effectively characterize a great variety of unknown image structures. Similarly, Chan et. al. in  proposed a network architecture called PCANet in order to create filter banks in the layers based on a PCA decomposition of input images. This method can learn convolutional kernels from the inputs, but the output cannot affect the filter bank design. This is in conflict with the design objective of DCNNs, which is to learn convolutional kernels with respect to outputs not just inputs. For instance, for classification tasks, the goal is to learn conditional probabilities of output data in relation to the input.
To gain flexibility over the scattering network and also to use the wavelet features, Jacobson et. al. proposed a method called structured receptive fields . They make every kernel a weighted sum of filters in a fixed filter bank that consists of Gaussian derivative basis functions. However, this method works under the assumption that every kernel filter is smooth and can be decomposed into a compact filter basis , which may not hold in all layers. In , Keshari et al. published a method to learn a dictionary of filters in an unsupervised manner. Then they made DCNN convolutional kernels linear combinations of the dictionary words with weights optimized by training data. Although this method reduces number of parameters of the network significantly, it limits the network performance, because the dictionary is not fine tuned by the training dataset. The implementation of this network is not an easy task because it needs a customized backpropagation for updating the weights. By combining pre-determined filters to form DCNN kernels, both of the mentioned methods severely limit the solution space of convolutional kernels when optimizing the DCNN for the given task.
Sarwar et al. proposed a combination of Gabor filters and learnable filters when choosing convolutional kernels of DCNN . For each layer they fixed some filters to be Gabor filters and allowed others to be trained. Luan et al. proposed a so-called Gabor convolutional network (GCN) . The Gabor filters are used to modulate convolutional kernels of the DCNN. The modulated filter kernels are optimized via back propagation. These two methods try to take advantage of the spatial structures of Gabor filters, but they are not used in regularization as we do in this paper.
2 Proposed Method
In this section, we explain our new filter bank regularization (FBR) technique for DCNNs in detail. In the FBR method, we include in the DCNN objective function a penalty term that encourages the convolutional kernels of the DCNN to approach some member filters of a filter bank. In addition to controlling the model complexity of DCNNs to prevent overfitting and expedite convergence, the FBR strategy has a multitude of other advantages. 1. It is a way of incorporating into the DCNN design priors of spatial structures that are interpretable and effective; 2. The filter bank approach allows the DCNN kernels to be chosen from a large pool of candidate 2D filters suitable for a given computer vision application; 3. It is a general regularization mechanism that can be applied to any DCNN architecture without any modification.
2.1 Filter banks
Filter banks have proven their effectiveness for extracting useful features to facilitate many computer vision tasks. Being a set of different filters a filter bank can be used as bases (often overcomplete) to decompose images into meaningful construction elements. Arguably the best known filter bank used in computer vision is Gabor filter bank, as the family of Gabor filters are noted for their power to characterize and discriminate texture and edges, thanks to their parameterization in orientation and frequency. This is why the Gabor filter bank is a main construct used in the development of our FBR method. In addition to their mathematical properties, Gabor filters can also, in view of many vision researchers, model simple cells in the visual cortex of mammalian brains. 
The generic formula of Gabor filter is:
Typically, the real part of this filter is used for filtering images. The Gabor filter enables us to extract orientation-dependent frequency contents of the image . Transforming an image with Gabor filter bank decomposes it in a way that can enhance the separation capability of the machine learning model between different classes. Also, using the Gabor filter bank may be justified cognitively as some researchers showed that simple visual cortex cells of mammals could be modeled by Gabor filters .
To further enrich the DCNN model, we augment the Gabor filters in the regularization filter bank by adding the Leung-Malik (LM) filter bank  that has shown potential for extracting textures. LM filter bank consists of first and second derivatives of Gaussians at 6 orientations and 3 scales, 8 Laplacian of Gaussian (LoG) filters, and 4 Gaussians filters. This filter bank is shown in Figure 2.
Using filter banks can increase robustness of the model for small varations. In the context of deep learning it has been shown that first layer kernels in the DCNNs that are trained on relatively large datasets such as VGGNet, ResNet and AlexNet are very similar to Gabor filters. Scale-space theory  gives a method for convolving an image with filters that have different scales, this method can be used to extract useful descriptors from general signals. Similarly, we try to use filter banks as a guidance for CNN filters, but as previously mentioned, the filter banks are suitable for general signals, so we guide DCNN kernels to be close to the filter bank. In what follows, we give detail about the implementation of this regularization.
2.2 Filter bank regularization as a Maximum A Posteriori Estimation (MAP) Problem
Using Bayesian statistics is a common approach to derive the regularized loss functions. We use MAP estimation to make the Bayesian posterior tractable. Consider the simple case of regression (). The model parameter is (). The dataset contains pairs of datapoints denoted by , in which s are 1-D vectors and s are scalers. In the presense of Gaussian noise we could write , where is the Gaussian noise, and is the model output. The conditional distribution of can be written as follows:
where is the identity matrix. The MAP estimation for the model parameters is defined by:
where is the prior distribution of the model parameters. Substituting (2) into (3) gives us:
Therefore, the cost function can be derived as follows:
This result can be easily extended for 2-D datasets and parameters. As we discussed earlier in this paper, many researchers use the Gabor filters to model simple cells in the visual cortex of mammalian brains. We Can use this information to presume a reasonable prior distribution for the model parameters. We assume that the model parameters have a Gaussian distribution around a vectorized filter in the Gabor filter bank. In other words, we could write the prior distribution of the parameters as follows:
determines the deviation of the model kernel from . So, we can write the loss function as:
2.3 Kernel regularization using a filter bank
Denote by a 2D filter bank of dimension . For a DCNN of convolution layers, let be the number of kernels in layer , . Denote the convolutional kernels of layer by . Each kernel in the DCNN is a three-dimensional tensor of dimension , where is the number of channels in layer . In each iteration of the learning process, for layer of the DCNN we find the filter in the filter bank that best matches kernel in channel , namely,
where is the 2D cross section of the 3D kernel in channel . Accordingly, the FBR regularizer produces the penalty term
Therefore, the total loss function of the DCNN is
where is the total weights of the DCNN, is the per sample classification loss for the input and corresponding output .
The interactions between the DCNN convolutional kernels and the regularization filter bank are depicted in Figure 3.
As one could see, in FBR, kernels choose the reference regularizer adaptively, moreover, the reference filters are well structured in the spatial domain. The proposed algorithm is shown in Algorithm 1.
2.4 Adding orthogonality regularization
Due to the random initialization of the DCNN, it is likely that some of the kernels tend to select the same reference filter from the filter bank . This can create redundant or correlated kernels after DCNN training stage. To resolve this issue, we introduce an orthogonality regularization term  to encourage uncorrelated kernels. Adding the orthogonality term can change the reference regularizer filters and as a result, enables the DCNN to learn a richer set of kernels. Letting be the kernel weight matrix of DCNN layer in which each column is a vectorized kernel, the orthogonality regularization term for this layer can be written as
where is the identity matrix and denotes Frobenius norm. By adding the orthgonality regularization, we can rewrite the final loss function as follows
where is the orthogonality regularization for layer of the DCNN.
3 Experiments and Discussions
We implemented the proposed FBR method with classification DCNNs and evaluated its performances in comparison with existing regularization methods, including , penalty norm on the weights and pure orthogonality regularization. Two commonly used benchmark datasets CIFAR 10  and Caltech-101  are used in our evaluations. In our experiment setup, the filter bank is the union of the Gabor and LM filter banks. These filter banks are shown in Figure 2 and 5. We designed the Gabor filter bank using 10 different orientations and 7 frequencies with resulting 70 Gabor member filters.
3.1 Results on MNIST benchmark
First, we report our experimental results on the benchmark dataset MNIST . We apply Double Soft Orthogonality (DSO) regularization  to the DCNN model. The model consists of convolution layers. The first convolutional layers are regularized. The learning rate of is used, and we make it half every epoch. We compare our method with Gabor Convolutional Networks (GCNs). As one could see in table 1, FBR has the best accuracy with much fewer parameters in comparison with GCNs.
|Reg. Type||Err. (%)||#Params (M)|
|GCN4 (with )|
|GCN4 (with )|
|GCN4 (with )|
3.2 Performance evaluation on CIFAR-10 benchmark
we report our experimental results on the benchmark dataset CIFAR-10. CIFAR-10 contains training images and testing images from different categories. The images dimensions are . The architecture that we used for DCNN is shown in Figure 4. We applied regularization on the and convolution kernels and trained the model for epochs, using the RMSProp optimizer with learning rate and decay of . The batch size was set to , and data augmentation was used. We also used step decay to half the learning rate after every epochs. It is worth mentioning that we employed regularization only for layers of the DCNN when dimension of kernels were larger than , in order to have an effective filter bank with reasonable representational capability. To make the comparisons fair, we used the same weight initialization for all experiments.
The cross entropy loss results of different regularization methods on the test dataset are plotted in Figure 6. In the figure, the baseline curve is for the classification DCNN of Figure 4 without any regularization. As shown, the FBR method has the lowest cross entropy loss () among all tested methods. An interesting observation is that, the reduction in cross entropy with respect to the baseline is almost the same for the orthogonal regularization, and regularization. And the FBR method can reduce the cross entropy further from the above three methods by approximately same margin. Additionally in Table 3, we tabulate the experiments results in more details, including both the classification accuracy and cross entropy numbers in relationship to hyperparameters and . The table demonstrates that the FBR method outperforms all other regularization methods, in both the classification accuracy and cross entropy loss for suitable and .
Also, one can see the effects of on the spatial structures of DCNN kernels in Figure 7. Emphasizing the orthogonality of the DCNN kernels can reduce the degree of kernel redundancy, i.e., preventing similar kernels from being chosen.
3.3 Effect of kernel size
In the FBR method, as opposed to other regularization techniques, the DCNN kernel size affects both the DCNN architecture and the regularization filter bank . Increasing the kernel size improves the spatial resolution of the filter bank while sacrificing the locality of the feature maps. In practice, we need to trade off between the spatial resolution and the locality by varying the kernel size. Decreasing the kernel size can reduce the representational power of the filter bank, but on the other hand, improve the locality. To examine how much the kernel size can affect the DCNN model, we trained the DCNN baseline with different kernel sizes. The results of these experiments are shown in Table 2.
|Kernel Size||Accuracy (%)||CE Loss|
As one can see, the aforementioned trade-off creates an optimal kernel size for the DCNN. In fact, this can be achieved by optimizing a hyperparameter in a cross validation approach.
|Reg. Type||or Coeff.||Acc. (%)||CE Loss|
3.4 Results on Caltech-101
As discussed above, applying a large kernel to very small images like CIFAR-10 (), can lead to poor locality. To avoid the problem and evaluate the performance of the FBR method with larger kernel sizes, we conduct the above experiments using the Caltech-101 dataset  and compare different regularization methods. Caltech-101 has 101 categories and each class contains to images. We resized all of the images to and used the DCNN baseline architecture with two extra max pooling at the first and third convolutional layers to control the number of DCNN parameters.
|Reg. Type||, Coeff.||Acc. (%)||CE Loss|
The experimental results with the Caltech-101 dataset are presented in Table 4. By comparing Table 4 with Table 2 (CIFAR-10), we can see that not only the FBR method achieves the best performance in the comparison group, but also its performance gain over others increases by a significant margin with larger kernel sizes and higher resolution images. In other words, the FBR method is more advantageous on high resolution images of greater variations, because it can adapt the kernel size.
The classification accuracy results on the test dataset for different methods are plotted in Figure 8. As one can observe in Figure 8, the regularization method improves the generalization of the model over the baseline by a very small amount, the regularization performs much better than the regularization. The orthogonal regularization outperforms both and , because the orthogonality prevents the choice of highly correlated kernels and promotes more diverse kernels to extract more novel features.
3.5 Large Size Image Classification
To show that FBR is an effective method to regularize the DCNN on the large scale images as well as the small scale images, we use ImageNet  dataset. It contains color images () from different objects. We use classes from the objects to train the DCNN. We use ResNet-50 architecture  as our baseline model. We could see the training loss and top-5 accuracy on the validation data in the Figure 9 and 10 respectively.
As shown in Figure 9, both models that regularized with FBR have the lower training loss in comparison with regularization or baseline model without regularization. For the validation, as one could see in Figure 10, the model with FBR regularization has the best accuracy. In addition, the abrupt changes in validation accuracy for FBR model is less than other models.
3.6 Results of using a VGG-derived filter bank as the regularizer
We can also construct the regularization filter bank using the lower layer convolutional kernels of some pretrained DCNNs, for example, those of VGG16. As mentioned previously, Gabor filters do not work effectively if the filter kernel is small. One way of creating a regularization filter bank of a small kernel size is to choose a subset of pre-trained VGG convolutional kernels at first few front layers. Specifically, we randomly select 256 VGG16 kernels of the first two layers pre-trained on Imagenet to form the regularization filter bank, which is shown in Figure 11.
For comparison purposes, we train the DCNN of Figure 4 to classify the Caltech 101 dataset, with the above VGG-derived filter bank regularization, the regularization, and without any weight regularization at all (the baseline), and compare the performances of these methods.
The classification accuracy and cross entropy results are displayed in Figures 12 and 13, respectively. As shown, the DCNNs regularized by the VGG16-derived filter bank outperform the -regularized DCNN and the baseline model without regularization.
3.6.1 Comparison of different regularization methods based on feature maps
In order to understand how the proposed FBR method works and its advantages over the other methods, we examine the DCNN feature maps generated under different and without regularizations.
First we compare the baseline model without regularization and the FBR regularization method. Let us examine two examples from the Caltech 101 test dataset on which the baseline model misclassifies whereas the FBR method correctly classifies. The two test images after normalization (mean subtracted and then divided by standard deviation) are shown in Figures 14.
The feature maps of layer 1 and layer 3 for the baseline and the FBR method are displayed in Figure 15 and Figure 16 respectively. The similar feature maps of the baseline model are marked in these figures
As can be easily observed in the figures, the feature maps of the FBR method are more sparse than those of the baseline model without regularization. This increased sparsity improves the robustness of the FBR method. Also, we bring the reader’s attention to interpreting the feature maps of the FBR method in Figures 15 and 16. Thanks to the Gabor filters included in the regularization filter bank, the FBR method extracts features of strong directionality and high frequency that may explain the superior classification performance of the FBR method.
Next, we compare the FBR method and the regularization. Again, two sample images of the CalTech 101 dataset are selected and shown in Figures 17. For these two images the FBR method correctly classifies, whereas the regularization does not. The feature maps of layer 1 and layer 3 for the two methods are shown in Figures 18 and 19.
Here, the observations are very similar to what we discussed about Figures 15 and 16. The feature maps of the FBR method appear to be sparser and exhibit greater discriminating power in high frequency and directionality than the regularization. This explains the superior performance of the former over the latter.
Regularization techniques are widely used to prevent DCNNs from overfitting. While the importance of regularization is generally accepted, no previously existing explicit regularization techniques take into account the spatial correlation of the weights of a convolution kernel in DCNNs. This oversight has been addressed and it is corrected by our novel approach of filter bank regularization of DCNNs. This regularization approach allows us to incorporate into the network training process interpretable feature extractors such as Gabor filters to improve the convergence, robustness and generality of DCNNs.
-  (2018) Can we gain more from orthogonality regularizations in training deep cnns?. External Links: Cited by: §3.1.
-  (2012) Invariant scattering convolution networks. External Links: Cited by: §1.2.
-  (2000) Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. In IN PROC. NEURAL INFORMATION PROCESSING SYSTEMS CONFERENCE, pp. 402–408. Cited by: §1.1.
-  (2014) PCANet: a simple deep learning baseline for image classification?. External Links: Cited by: §1.2.
-  (1985-07) Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. Journal of the Optical Society of America A 2 (7), pp. 1160. External Links: Cited by: §2.1.
-  (2016) Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §1.1.
-  (2015) Deep residual learning for image recognition. External Links: Cited by: §1.1, §3.5.
-  (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. External Links: Cited by: §1.1.
-  (2016) Densely connected convolutional networks. External Links: Cited by: §1.1.
-  (2017) Orthogonal weight normalization: solution to optimization over multiple dependent stiefel manifolds in deep neural networks. External Links: Cited by: §2.4.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. External Links: Cited by: §1.1.
-  (2016) Structured receptive fields in cnns. External Links: Cited by: §1.2.
-  (2018) Learning structure and strength of cnn filters for small sample size training. External Links: Cited by: §1.2.
-  (2018) Bridgeout: stochastic bridge regularization for deep neural networks. External Links: Cited by: §1.1, §1.1.
-  () CIFAR-10 (canadian institute for advanced research). . External Links: Cited by: §3.
-  (2017-05) ImageNet classification with deep convolutional neural networks. Communications of the ACM 60 (6), pp. 84–90. External Links: Cited by: Figure 1.
-  (1992) A simple weight decay can improve generalization. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann (Eds.), pp. 950–957. External Links: Cited by: §1.1, §1.1.
-  (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Cited by: §3.1.
-  (2001) International Journal of Computer Vision 43 (1), pp. 29–44. External Links: Cited by: §2.1.
-  (2003) Caltech101 image dataset. . External Links: Cited by: §3.4, §3.
-  (2017) Gabor convolutional networks. External Links: Cited by: §1.2, §2.1.
-  (1980-11) Mathematical description of the responses of simple cortical cells. J. Opt. Soc. Am. 70 (11), pp. 1297–1300. External Links: Cited by: §2.1.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §3.5.
-  (2017) Gabor filter assisted energy efficient fast learning convolutional neural networks. Note: EEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), Taipei, 2017, pp. 1-6 External Links: Cited by: §1.2.
-  (2014) Very deep convolutional networks for large-scale image recognition. External Links: Cited by: §1.1.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. External Links: Cited by: §1.1.
-  (2014) Going deeper with convolutions. External Links: Cited by: §1.1.
-  (1987) SCALE-SPACE FILTERING. In Readings in Computer Vision, pp. 329–332. External Links: Cited by: §2.1.
-  (2017) All you need is beyond a good init: exploring better solution for training extremely deep convolutional neural networks with orthonormality and modulation. External Links: Cited by: §1.1, §1.2.
-  (2017) Learning transferable architectures for scalable image recognition. External Links: Cited by: §1.1.