Classification Driven Dynamic Image Enhancement

Classification Driven Dynamic Image Enhancement

Vivek Sharma, Ali Diba, Davy Neven, Michael S. Brown, Luc Van Gool, and Rainer Stiefelhagen
CV:HCI, KIT, Karlsruhe, ESAT-PSI, KU Leuven, York University, Toronto, and CVL, ETH Zürich
{firstname.lastname}, {firstname.lastname},

Convolutional neural networks rely on image texture and structure to serve as discriminative features to classify the image content. Image enhancement techniques can be used as preprocessing steps to help improve the overall image quality and in turn improve the overall effectiveness of a CNN. Existing image enhancement methods, however, are designed to improve the perceptual quality of an image for a human observer. In this paper, we are interested in learning CNNs that can emulate image enhancement and restoration, but with the overall goal to improve image classification and not necessarily human perception. To this end, we present a unified CNN architecture that uses a range of enhancement filters that can enhance image-specific details via end-to-end dynamic filter learning. We demonstrate the effectiveness of this strategy on four challenging benchmark datasets for fine-grained, object, scene and texture classification: CUB-200-2011, PASCAL-VOC2007, MIT-Indoor, and DTD. Experiments using our proposed enhancement shows promising results on all the datasets. In addition, our approach is capable of improving the performance of all generic CNN architectures.

1 Introduction

Image enhancement methods are commonly used as preprocessing steps that are applied to improve the the visual quality of the image before highly level-vision tasks such as classification and object recognition. Examples include enhancement to remove the effects of blur, noise, poor contrast, and compression artifacts – or to boost image details. Examples of such enhancement methods include Gaussian smoothing, anisotropic diffusion, weighted least squares (WLS), and bilateral filtering. Such enhancement methods are not simple filter operations (e.g. 33 Sobel filter etc.), but often involve complex optimization. In practice, the run time for these methods are expensive and can take seconds or even minutes for high-resolution images.

Several recent works have shown that Convolutional Neural Networks (CNN) [2, 3, 23, 27, 37, 38] can successfully emulate a wide range of image enhancement by training on input and target output image pairs. These CNNs often have a significant advantage in terms of run-time performance. The current strategy, however, is to train these CNN-based image filters to approximate the output of their non-CNN counterparts.

In this paper, we propose to extend the training of CNN-based image enhancement to incorporate the high-level goal of image classification. Our contribution is a method that jointly optimizes a CNN for enhancement-approximation and image classification. The main idea behind of our work is to adaptively enhance the image features dynamically conditioned on themselves. This allows the enhancement CNN to selectively enhance only those features that are useful for image classification.

Figure 1: Overview of the proposed unified CNN architecture using enhancement filters to improve classification tasks. Given an input RGB image, instead of directly applying the CNN on this image ([a]), we first enhance the image details by convolving the input image with a WLS filter (see Sec. 3.1), resulting in improved classification with high-confidence ([b]).

Since we understand the critical role of selective feature enhancement, we propose to use the dynamic convolutional layer (or dynamic filter) [7] to dynamically enhance the image-specific features with a classification objective, whereas the authors apply this module to transform an angle into a filter (steerable filter) using input-output image pairs. We used the same terminology as in [7]. The dynamic filters are a function of the input and therefore vary from one sample to another during train/test time, that means when presented an image the image enhancement is done image-specific to enhance the texture patterns or sharpen edges for discrimination. Specifically, our network learns the amount of various enhancement filters that should be applied to an input image, such that the enhanced representation provides better performance in terms of classification accuracy. Our proposed approach is evaluated on four challenging benchmark datasets for bird fine-grained, object, scene and texture classification respectively, CUB-200-2011 [35], PASCAL-VOC2007 [12], MIT-Indoor [26], and DTD [4]. We experimentally show that when CNNs are combined with the proposed dynamic enhancement technique (Sec. 3.1 and 3.3), one can consistently improve the classification performance of vanilla CNN architectures on all the datasets. In addition, our experiments demonstrate the full capability of the proposed method, and show promising results in comparison to the state-of-the-art.

The remainder of this paper is organized as follows. Section 2 overviews related work. Section 3 describes our proposed enhancement architecture. Experimental results and their analysis are presented in Sections 4. Finally, the paper is concluded in Section 5.

Figure 2: Dynamic enhancement filters. Input to the network are input-output image pairs, as well as image class labels for training. In this architecture, we learn a single enhancement filter for each enhancement method individually. The model operates on the luminance component of RGB color space. The enhancement network (i.e. filter-generating network) generates dynamic filter parameters that are sample-specific and conditioned on the input of the enhancement network, with the overall goal to improve image classification. The figure on the upper-right corner shows the whole pipeline workflow.

2 Background & Related work

Considerable progress has been seen in the development for removing the effects of blur [2], noise [27], and compression artifacts [36] using CNN architectures. Reversing the effect of these degradations in order to obtain sharp images is currently an active area of research [2, 22, 37]. The investigated CNN frameworks [2, 3, 15, 22, 23, 27, 37, 38] are typically build on simple strategies to train the networks by minimizing a global objective function using input-output image pairs. These frameworks encourage the output to have a similar structure with the target image. After training the CNN, a similar approach to transfer details to new images has been proposed [37]. These frameworks act as a filter that are specialized for a specific enhancement method.

For example, Xu et al. [37] learns a CNN architecture to approximate existing edge-aware filters from input-output image pairs. Chen et al. [3] learns a CNN that approximates pixels to pixels several image processing operations using a parameterization that is deeper and more context-aware. Yan et al. [38] learns a CNN to approximating image transformations for image adjustment. Fu et al. [15] learns a CNN architecture to remove rain streaks from an image. For CNN training, the authors use rainy and clean image detail layer pairs rather than the regular RGB images. Li et al. [22] propose a learning-based joint filter using three CNN architectures. In Li et al.’s work, two sub-networks take target and guidance images respectively, while the third-network selectively transfers the main content structure and reconstructs the desired output. Remez et al. [27] propose a fully convolutional CNN architecture to do image denoising using image prior i.e. class-aware information. The closest work to ours are by Chakrabarty et al. [2] and Liu et al. [23]. Chakrabarty et al. proposes a CNN architecture to predict the complex Fourier coefficients of a deconvolution filter which is applied to individual image patches for restoration. Liu et al. uses CNN+RNNs to learn enhancement filters, here we use CNNs only for learning filters. Our methods produce one representative filter per method, while they produce 4­way directional propagation filters per method. Like others, their work is meant for low­-level vision tasks similar to [2, 3], while our goal is enhancement for classification. In contrast to these prior works, our work differs substantially in scope and technical approach. Our goal is to approximate different image enhancement filters with a classification objective in order to selectively extract informative features from enhancement techniques to improve classification, not necessarily approximating the enhancement methods.

Similar to our goal are the works [6, 9, 19, 25, 33, 34], where the authors also seek to ameliorate the degradation effects for accurate classification. Dodge and Karam [9] analyzed how blur, noise, contrast, and compression hamper the performance of ConvNet architectures for image classification. Their findings showed that: (i) ConvNets are very sensitive to blur because blur removes textures in the images; (ii) noise affects the performance negatively, though deeper architectures performance falls off slower; and (iii) deep networks are resilient to compression distortions and contrast changes. A study by Karahan et al. [19] reports similar results for a face-recognition task. Ullman et al. [33] showed that minor changes to the image, which are barely perceptible to humans, can have drastic effects on computational recognition accuracy. Szegedy et al. [30] showed that applying an imperceptible non-random perturbation can cause ConvNets to produce erroneous prediction.

To help to mitigate these problems, Costa et al. [6] designed separate models specialized for each noisy version of an augmented training set. This improved the classification results for noisy data to some extent. Peng et al. [25] explored the potential of jointly training on low-resolution and high-resolution images in order to boost performance on low-resolution inputs. Similarly to [25] is Vasijevic et al.’s [34] work, where the authors augment the training set with degradations and fine-tune the network with diverse mix of different types of degraded and high-quality images to regain much of the lost accuracy on degraded images. In fact, with this approach the authors were able to learn to generate a degradation (particularly blur) invariant representation in their hidden layers.

In contrast to previous works, we use high-quality images that are free of artifacts, and jointly learn ConvNet to enhance the image for the purpose of improving classification.

3 Proposed Method

As previously mentioned, our aim is to learn a dynamic image enhancement network with the overall goal to improve classification, and not necessarily approximating the enhancement methods specifically. To this end, we propose three CNN architectures described in this section.

Our first architecture is proposed to learn a single enhancement filter for each enhancement method in an end-to-end fashion (Sec. 3.1) and by end-to-end we mean each image will be enhanced and recognized in one unique deep network with dynamic filters. Our second architecture uses pre-learned enhancement filters from the first architecture and combine them in a weighted-way in the CNN. There is no adaptation of weights of the filters (Sec. 3.2). In our third architecture, similar to first architecture, but differently, we show end-to-end joint learning of multiple enhancement filters (Sec. 3.3). We also combine them in a weighted-way in the CNN. All these setups are jointly optimized with a classification objective to selectively enhance the image feature quality for improved classification. In the network training, image-level class labels are used, while for testing the input image can have multiple labels.

3.1 Dynamic Enhancement Filters

In this section we describe our model to learn representative enhancement filters for different enhancement methods from input and target output enhanced image pairs in end-to-end learning approach with a goal to improve classification performance. Given an input RGB image , we first transform it into the luminance-chrominance color space. Our enhancement methods operates on the luminance component [14] of the RGB image. This allows our filter to modify the overall tonal properties and sharpness of the image without affecting the color. The luminance image is then convolved with an image enhancement method , resulting in a enhanced target output luminance image , where , and denote the height and width in the input respectively. We generate target images for a range of enhancement methods as a preprocessing step (see Section. 4.2 for more details). For filter generation, we explicitly use a dataset of only one enhancement method at a time for learning the transformation. The scheme is illustrated in Figure 2.

First stage (Enhancement Stage): The enhancement network (EnhanceNet) is inspired by [7, 18, 20], and is composed of convolutional and fully-connected layers. The EnhanceNet maps the input to the filter. The enhancement network takes the one channel luminance Image and outputs filters , , where are the parameters of the transformation generated dynamically by the enhancement network, is the filter size, and is the number of filters, is equal to for single generated filter meant for one channel luminance image. The generated filter is applied to the input image at every spatial position (,) to output predicted image with . The filters are image-specific, and are conditioned on . For generating the enhancement filter parameters , the network is trained using mean squared error () between the target image and the network’s predicted output image . Note that, the parameters of the filter are obtained as the output of a EnhanceNet that maps the input to a filter and therefore vary from one sample to another. To compare the reconstruction image with the ideal , we use MSE loss as a measure of image quality, although we note that more complex loss functions could be used [10].

Chrominance component is then recombined, and the image is transformed back into RGB, . We found that the filters learned the expected transformation and applied the correct enhancement to the image. Figure 5 shows qualitative results with dynamically enhanced image textures.

Second stage (Classification Stage): The predicted output image from Stage 1 is fed as an input to the classification network (ClassNet). As the classification network (e.g. Alexnet [21]) has fully-connected layers between the last convolutional layer and the classification layer, the parameters of the fully-connected layer and -way classification layer are learned when fine-tuning a pre-trained network.

End-to-End Learning: The Stage 1-2 cascade with two loss functions: () and softmax-loss () enables joint optimization by end-to-end propagation of gradients in both ClassNet and EnhanceNet using SGD optimizer. The total loss function of the whole pipeline is given by:


where is the output of the last fully-connected layer of ClassNet that is fed to a -way softmax function, is the vector of true labels for image , and is the number of classes.

We fine-tune the whole pipeline until convergence, thus leading to learned enhancement filters in the dynamic enhancement layer. The joint optimization allows the loss gradients from the ClassNet to also backpropagate through the EnhanceNet, making the filter parameters to be also optimized for classification.

Figure 3: (Stat-CNN) In this architecture, we combine pre-learned filters from Sec. 3.1 (Figure 2) and original image, and combine them in a weighted-way in the CNN. There is no adaptation of weights of the filters.
Figure 4: (Dyn-CNN) In this architecture, similar to Sec. 3.1 (Figure 2) but differently, we show end-to-end joint learning of multiple filters. We combine them in a weighted-way in the CNN. There is adaptation of weights of the filters.

3.2 Static Filters for Classification

Here, we show how to integrate the pre-learned enhancement filters obtained from the first approach. The static filter is computed by taking a mean of all the dynamic filters obtained for each image in the train set. The extracted static filters are convolved with the input luminance component of the RGB image , and the chrominance component is added and then the image is transformed back to RGB , which is then fed into the classification network. Figure 3 shows the schematic layout of the whole architecture.

First stage (Enhancement Stage): We begin by extracting the pre-trained filters for image enhancement methods learned from the first approach. Given an input luminance image , these filters are convolved with the input image to generate enhanced images as . We also include an identity filter (+1) to generate the original image, as some learned enhancements may perform worse than the original image itself. We then investigate two different strategies to weight the enhancement methods, (i) giving equal weights with value equal to 1/, and (ii) giving weights on the basis of MSE, as discussed in Sec. 3.3.

The output of this stage is a set of enhanced luminance images and their corresponding weights indicating the potential importance for pushing to next stage of the classification pipeline. Chrominance is then recombined, and the images are transformed back to RGB, .

Second stage (Classification Stage): The enhanced images for image enhancement methods and original image are fed as an input to the classification network one-by-one sequentially, with class labels and their weights indicating the importance of the enhancement method for the input image. Similar to the last approach, the network parameters of the fully-connected layer and -way classification layer are fine-tuned using a pre-trained network in an end-to-end learning approach.

End-to-End Training: The loss of the network training is the weighted sum of the individual softmax losses term. The weighted loss is given as:


Where is the weight indicating the importance of the enhancement method, where for original RGB image, contributing to the total loss of the whole pipeline.

3.3 Multiple Dynamic Filters for Classification

Here, we recycle the architectures from the previous Section 3.1-3.2. Figure 4 shows the schematic layout of the whole architecture. Our architecture uses the similar architecture proposed in Sec. 3.1, but differently, we generate dynamically filters using enhancement networks, one for each enhancement method. In this proposed architecture, the loss associated with the Stage 1 is the MSE between the predicted output images and the target output images .

For computing the weights of each enhancement method, the MSE for the enhanced images are transformed to weights by comparing their relative strengths as: , followed by , and then scaling the weights with the constraint that the sum of the weights for all methods should be equal to . The enhanced images with smallest errors obtain the highest weight, and vice-versa. In addition, we also compare against giving equal weights to all enhancement methods. Of both weighting strategies, MSE-based weighting yielded best results, and was therefore selected as default. Note that, we also include the original image by simply convolving it with an identity filter (+1) similar to approach 2, the weight for the RGB image is set to , i.e. . During training, the weights are estimated by cross-validation on the train/validation set, while for testing phase, we use these pre-computed weights. Further, we observed that training the network without regularization of weights prevented the model to converge throughout the learning, and led to overfitting with significant drop in performance.

End-to-End Training: Finally, we now extend the loss of approach 2, by adding a MSE term for joint optimization of enhancement networks with a classification objective. We learn all parameters of the network jointly in an end-to-end fashion. The weighted loss is sample-specific, and is given as:


We believe training our network in this manner, offers a natural way to encourage the filters to apply a transformation that enhances the image structures for an accurate classification, as the classification network is regularized via enhancement networks. Moreover, joint optimization helps minimize the overall cost function of the whole architecture, hence leading to better results.

4 Experiments

In this section, we demonstrate the use of our enhancement filtering technique on four very different image classification tasks. First, we introduce the dataset, target output data generation, implementation details, and the exploring the design choices of the proposed methods. Finally, we test and compare our proposed method with baseline methods and other current ConvNet architectures. Note that, the purpose of this paper is to improve the baseline performance of generic CNN architectures using an add-on enhancement filters, and not to compete against the state-of-the-art methods.

4.1 Datasets

We evaluate our proposed method on four visual recognition tasks: fine grained classification using CUB-200-2011 (CUB) [35], object classification using PASCAL-VOC2007 (PascalVOC) [12], scene recognition using MIT-Indoor-Scene (MIT) [26], and texture classification using Describable Textures Dataset (DTD) [4]. Table 1 shows the details of the datasets. For all of these datasets, we use the standard training/validation/testing protocols provided as the original evaluation scheme and report the classification accuracy.

Data-set # train img # test img # classes
CUB [35] 5994 5794 200
PascalVOC [12] 5011 4952 20

MIT [26]
4017 1339 67

DTD [4]
1880 3760 47
Table 1: Details of the training and test set for datasets.

4.2 Target Output Data

We generate target output images for five (i.e. =5) enhancement methods : (i) weighted least squares (WLS) filter [13], (ii) bilateral filter (BF) [11, 32], (iii) image sharpening filter (Imsharp), (iv) guided filter (GF) [16], and (v) histogram equalization (HistEq). Given an input image, we first transform the RGB color space into a luminance-chrominance color space, and then apply these enhancement methods on the luminance image to obtain an enhanced luminance image. This enhanced luminance image is then used as the target image for training. We used default parameters for WLS, Imsharp, and for BF, GF and HistEq parameters are adapted to each image, thus requires no parameter setting. For comprehensive discussion, we refer the readers to [11, 13, 16]. The source code for fast BF [11], WLS [13] are publicly available, and others are available in the Matlab framework.

4.3 Implementation Details

We use the MatCovNet and Torch frameworks, and all the ConvNets are trained on a TitanX GPU. Here we discuss the implementation details for ConvNet training (i) with Dynamic enhancement filter networks, (ii) with Static enhancement filters, and (iii) without enhancement filters as a classic ConvNet training scenario.

We evaluate our design on AlexNet [21], GoogleNet [29], VGG-VD [28], VGG-16 [28], and BN-Inception [17]. In each case, the models are pre-trained on the ImageNet [8] and then fine-tuned on the target datasets. To fine-tune the network, we replace the 1000-way classification layer with a -way softmax layer, where is the # of classes in the target dataset. For fine-tuning the different architectures depending on the dataset about 60-90 iterations (batch size 32) were used, with scheduled learning rate decrease, starting with a small learning rate . All ConvNet architectures are trained with identical optimization schemes, using SGD optimizer with a fixed weight decay of , scheduled learning rate decrease. We follow two steps to fine-tune the whole network. First, we fine-tune (last two fc layers) the ConvNet architecture using RGB images, and then embed it in Stat/Dyn-CNN for fine-tuning the whole network with enhancement filters, by setting a small learning rate for the all layers except the last two layers, which have a high learning rate. Specifically, for example, in BN-Inception the network requires a fixed input size of . The images are mean-subtracted before network training. We apply data augmentation [21, 28] by cropping the 4 corners, center, and their x-axis flips, along with color jittering (and the cropping procedure repeated for each of these) for network training. Below, we provide more details for ConvNet training using BN-Inception.

Dynamic Enhancement Filters (Dyn-CNN): The enhancement network consists of 570k learnable model parameters, with last fully-connected layer (i.e. dynamic filter parameters) containing 36 neurons, i.e. filter-size . We initialize the enhancement networks model parameters randomly, except for the last fully-connected layer which is initialized to regress the identity transform (zero weights, and identity transform bias), suggested in [18]. We initialize the learning rate with and decrease it by a factor of 10 after every 15k iterations. The maximum number of iterations is set to 90k. In terms of computation speed, training enhancement network along with BN-Inception takes approx. 7% more training time for network convergence in comparison to BN-Inception for approach 1 (Sec. 3.1). We use five enhancement networks for generating five enhancement filters (one for each method) for approach 3 (Sec. 3.3). We also include original RGB image too.

Without Enhancement Filters (FC-CNN): Similar to classical ConvNets fine-tuning scenario, we replace the last classification layer of a pre-trained model with a -way classification layer before fine-tuning. The fully connected layers and the classification layer are fine-tuned. We initialize the learning rate with and decrease it by a factor of 10 after every 15k iterations. The maximum number of iterations is set to 45k.

Static Enhancement Filters (Stat-CNN): Similar to FC-CNN, here, we have 5 enhanced images for 5 static filters and original RGB image as 6th input that are fed as input to the ConvNets for network training. In practice, the static filters for image enhancement are very low-complex operations. The optimization scheme used here is same as FC-CNN. We use all the five static learned filters for approach 2 (Sec. 3.2).

Testing: As previously mentioned, the input RGB image is transformed into luminance-chrominance color space, then the luminance image is convolved with the enhancement filter, leading to enhanced luminance image. Chrominance is then recombined to the enhanced luminance image and the image is transformed back to RGB. For ConvNet testing, an input frame with either be an RGB image or an enhanced RGB image using the Static or Dynamic filters is fed into the network. In total, five enhanced images (one for each filter) and original RGB image are fed into the network, sequentially. For final image label prediction, the predictions of all images are combined through a weighted sum, where the pre-computed weights are obtained from Dyn-CNN.

4.4 Fine-Grained Classification

In this section, we use CUB-200-2011 [35] dataset as a test bed to explore the design choices of our proposed method, and then finally compare our method with baseline methods and the current methods.

Dataset: CUB [35] is a fine-grained bird species classification dataset. The dataset contains 20 bird species with 11,788 images. For this dataset, we measure the accuracy of predicting a class for an image.

Ablation study: Here we explore four aspects of our proposed method: (i) the impact of different filter size; (ii) the impact of each enhancement method, separately; (iii) the impact of weighting strategies; and (iv) the impact of different ConvNet architectures.

Filter Size: In our experiment, we explore three different filter sizes. Specifically, we implement the enhancement network as a few convolutional and fully-connected layers with the last-layer containing (i) 25 neurons ( is enhancement filter of size ), (ii) 36 neurons (), and (iii) 49 neurons (). From the literature [7, 18], we exploited the insights about good filter size. The filter-size determines the receptive field and is application dependent. We found that filter-size produce smoother images, thus dropped the classification performance by approx. 2% (WLS: ) in comparison to filter-size of . Similar was the case with filter-size , were correct enhancement was not transferred, leading to drop in performance by approx. 3% (WLS: ). We found that filter-size learned the expected transformation, and applied the correct enhancement to the input image with sharper preserved edges.

Enhancement Method (): Here, we compare the performance of individual enhancement method in three aspects: (a) We employ AlexNet [21] pre-trained on ImageNet [8] and finetuned (last two fc layers) on CUB for each groundtruth enhancement method separately (GT-EMs). (b) Using the pre-trained RGB AlexNet model on CUB from (a), we finetune the whole model with GT-EMs, by setting a small learning rate for the all layers except the last two fc layers, which have a high learning rate. This slightly improves the performance of pre-trained RGB model by a small margin. (c) Similar to (b), but here we finetune the whole model using approach 1 (Sec. 3.1). We can see that our dynamic enhancement approach improves the performance by a margin of 1-1.5% in comparison to generic network when finetuned on RGB images only. In Table 2, we summarize the results.

In Fig. 5, as an example we show some qualitative results for the difference in textures that our enhancement method extracts from the GT-EMs, which is primarily responsible for improving the classification performance.

RGB BF WLS GF HistEq Imsharp LF
(a) GT-EMs 67.3 [24] 66.93 67.34 67.12 66.41 66.74 70.14
(b) RGB: GT-EMs 67.16 67.41 67.37 66.58 66.87 71.28
(c) Ours (Sec. 3.1) 68.21 68.73 68.5 67.62 67.86 72.16
Table 2: Individual accuracy (%) performance comparison of all the enhancement methods using AlexNet on CUB. Where LF is late-fusion as averaging of scores for the 5 enhancement methods.
BF WLS GF HistEq Imsharp RGB
0.230.05 0.250.04 0.240.03 0.130.03 0.170.05 1.0
Table 3: Relative comparison of the strength of weights for each enhancement method estimated by cross-validation on the training set of CUB using Dyn-CNN with BN-Inception, where for RGB image by default is set to 1.
Figure 5: Qualitative results: CUB. Comparison between the target image , enhanced luminance image , and compliment of difference image (diff=-) obtained using approach 1 (Sec. 3.1) for all enhancement methods.

Weighting Strategies: Combining the enhancement methods in a late-fusion (LF) as averaging of the scores gives further improvements shown in Table 2. With this observation, we realized a more effective weighting strategy should be applied, such that more importance could be given to better methods for combining. In our evaluation, we explore two weighting strategies (i) giving equal weights with value equal to , i.e. 0.2 for =5, and (ii) weight computed on the basis of , estimated by cross-validation on the training set are shown in Table 3.

Table 4 clearly shows that weighting adds a positive regularization effect. We found that training the network with regularization of MSE­-loss prevents the classification objective from divergence throughout the learning. Table 3 shows that in Dyn-CNN the weight of each enhancement filter relates very well to that of its individual performance shown in Table 2. We observe that the MSE-based weighting performs the best. Therefore, we choose that as a default weighting method.

Stat-CNN (ours) Dyn-CNN (ours)
: Averaging 83.19 85.58
: MSE-based 83.74 86.12
Table 4: Accuracy (%) performance comparison of the weighting strategies using BN-Inception on CUB.

ConvNet Architectures: Here, we compare the different ConvNet architectures. Specifically, we compare AlexNet [21], GoogLeNet [29], and BN-Inception [17]. Among all architectures shown in the Table 5, BN-Inception exhibits the best performance in terms of classification accuracy in comparison to others. Therefore, we choose BN-Inception as a default architecture for this experiment.

FC-CNN Stat-CNN (ours) Dyn-CNN (ours)
AlexNet 67.3 [24] 68.52 73.57
GoogLeNet 81.0 [24] 82.35 84.91
BN-Inception 82.3 [18] 83.74 86.12
Table 5: Accuracy (%) performance comparison of different architectures on CUB.

Results: In Table 6, we explore our static and dynamic CNNs with the current methods. We consider BN-Inception using our two step fine-tuning scheme with Stat-CNN and Dyn-CNN. We can notice that, Dyn-CNN improves the generic BN-Inception performance by 3.82% () using image enhancement. Our EnhanceNet takes a constant time of only 8 ms (GPU) to generate all enhanced images altogether, in comparison to generating groundtruth target images which is very time-consuming and takes 1-6 seconds for each image/method: BF, WLS, and GF. Testing time for the whole model is: EnhanceNet (8 ms) plus ClassNet (Inference time for the architecture used).

Further, we extend the baseline ST-CNN [18] to include Static (Sec 3.2) and Dynamic filters (Sec. 3.3) immediately following the input, with the weighted loss. In reference to ST-CNN work [18], we evaluate the methods keeping the training and evaluation setup same for a fair comparison. Our results indicate that Dyn-CNN improves the performance by 3.81% (). Furthermore, our Stat-CNN with static filters are competitive too, and performs 1.15% better than ST-CNN [18]. This means that the static filters when dropped into a network, can perform explicit enhancement of features, thus gains in accuracy are expected in any ConvNet architecture.

FC-CNN Stat-CNN (ours) Dyn-CNN (ours)
ST-CNN [18]: 448px 84.1 [18]
BN-Inception 82.3 [18] 83.74 86.12
ST-CNN [18] 83.1 [18] 84.25 86.91
Table 6: Fine-Grained Classification (CUB). Accuracy (%) performance comparison of Stat-CNN and Dyn-CNN with baseline methods and previous works on CUB.

4.5 Object Classification

Dataset: The PASCAL-VOC2007 [12] dataset contains 20 object classes with 9,963 images that contain a total of 24,640 annotated objects. For this dataset, we report the mean average precision (mAP), averaged over all classes.

Results: In Table 7, we show the results. Dyn-CNN is 4.58/6.16% better than Stat-CNN/FC-CNN using AlexNet, and 2.43/3.5% using VGG-16. One can observe that for smaller network: AlexNet shows more improvement in performance in comparison to deeper: VGG-16 network. Also, Stat-CNN is 1.58/1.07% better than FC-CNN using AlexNet/VGG-16. Furthermore, Bilen et al. [1] with 89.7% mAP performs 3.1% () lower than Dyn-CNN using VGG-16.

As an additional experiment, we used a subset of enhancement methods: BF, WLS, GF and RGB in Dyn-CNN using AlexNet. This setup underperforms by 0.57% () wrt. to when using all methods together.

Dataset ConvNet FC-CNN Stat-CNN Dyn-CNN
(ours) (ours)
PasclVOC (mAP) AlexNet 76.9 [31] 78.48 83.06
PasclVOC (mAP) VGG-16 89.3 [28] 90.37 92.8
MIT (Acc.) AlexNet 56.79 [39] 58.24 62.9
MIT (Acc.) VGG-16 64.87 [39] 65.94 68.67
DTD (mAP) AlexNet 61.3 [5] 62.9 67.81
DTD (mAP) VGG-VD 67.0 [5] 69.12 71.34
Table 7: Performance comparison in %. The table compares FC-CNN, Stat-CNN and Dyn-CNN on AlexNet and VGG networks trained on ImageNet and fine-tuned on target datasets using the standard training and testing sets.

4.6 Indoor Scene Recognition

Dataset: The MIT-Indoor scene dataset (MIT) [26] contains a total of 67 indoor scene categories with 5,356 images. For this dataset, we measure the accuracy of predicting a class for an image.

Results: In Table 7, we show the results. As expected and previously observed, Dyn-CNN is 4.66/6.11% better than Stat-CNN/FC-CNN using AlexNet, and 2.73/3.8% using VGG-16. Stat-CNN is 1.45/1.07% better than FC-CNN using AlexNet/VGG-16.

4.7 Texture Classification

Dataset: The Describable Texture Datasets (DTD) [4] contains 47 describable attributes with 5,640 images. For this dataset, we report the mAP, averaged over all classes.

Results: In Table 7, we show the results. The story is similar to our previous observation, Dyn-CNN outperforms Stat-CNN and FC-CNN by a significant margin. Surprisingly, it is interesting to see that Dyn-CNN shows a significant improvement of 6.51/4.34% in comparison to FC-CNN using AlexNet/VGG-VD.

5 Concluding Remarks

In this paper, we propose a unified CNN architecture that can emulate a range of enhancement filters with the overall goal to improve image classification in an end-to-end learning approach. We demonstrate our framework on four benchmark datasets: PASCAL-VOC2007, CUB-200-2011, MIT-Indoor Scene and Describable Textures Dataset. In addition to improving the baseline performance of vanilla CNN architectures on all datasets, our method shows promising results in comparison to the state-of-the-art using our static/dynamic enhancement filters. Also, our enhancement filters can be used with any existing networks to perform explicit enhancement of image texture and structure features, giving CNNs higher quality features to learn from, this in turn can lead to more accurate classification.

We believe our work opens many possibilities for further exploration. In future work, we plan to further investigate more enhancement methods as well as more complex loss functions which are appropriate for image enhancement tasks.

Acknowledgements: This work is supported by the DFG, German Research Foundation funded PLUMCOT project.


  • [1] H. Bilen and A. Vedaldi. Weakly supervised deep detection networks. In CVPR, 2016.
  • [2] A. Chakrabarti. A neural approach to blind motion deblurring. In ECCV, 2016.
  • [3] Q. Chen, J. Xu, and V. Koltun. Fast image processing with fully-convolutional networks. In ICCV, 2017.
  • [4] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In CVPR, 2014.
  • [5] M. Cimpoi, S. Maji, I. Kokkinos, and A. Vedaldi. Deep filter banks for texture recognition, description, and segmentation. IJCV, 2016.
  • [6] G. B. P. da Costa, W. A. Contato, T. S. Nazare, J. E. Neto, and M. Ponti. An empirical study on the effects of different types of noise in image classification tasks. arxiv:1609.02781, 2016.
  • [7] B. De Brabandere, X. Jia, T. Tuytelaars, and L. Van Gool. Dynamic filter networks. In NIPS, 2016.
  • [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • [9] S. Dodge and L. Karam. Understanding how image quality affects deep neural networks. In QoMEX, 2016.
  • [10] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. In NIPS, 2016.
  • [11] F. Durand and J. Dorsey. Fast bilateral filtering for the display of high-dynamic-range images. In ACM TOG, 2002.
  • [12] M. Everingham, A. Zisserman, C. K. Williams, L. Van Gool, M. Allan, C. M. Bishop, O. Chapelle, N. Dalal, T. Deselaers, G. Dorkó, et al. The pascal visual object classes challenge 2007 (voc2007) results. 2007.
  • [13] Z. Farbman, R. Fattal, D. Lischinski, and R. Szeliski. Edge-preserving decompositions for multi-scale tone and detail manipulation. In ACM TOG, 2008.
  • [14] C. Fredembach and S. Süsstrunk. Colouring the near-infrared. In CIC, 2008.
  • [15] X. Fu, J. Huang, X. Ding, Y. Liao, and J. Paisley. Clearing the skies: A deep network architecture for single-image rain removal. arxiv:1609.02087, 2016.
  • [16] K. He, J. Sun, and X. Tang. Guided image filtering. In ECCV, 2010.
  • [17] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015.
  • [18] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, 2015.
  • [19] S. Karahan, M. K. Yildirum, K. Kirtac, F. S. Rende, G. Butun, and H. K. Ekenel. How image degradations affect deep cnn-based face recognition? In BIOSIG, 2016.
  • [20] B. Klein, L. Wolf, and Y. Afek. A dynamic convolutional layer for short range weather prediction. In CVPR, 2015.
  • [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [22] Y. Li, J.-B. Huang, N. Ahuja, and M.-H. Yang. Deep joint image filtering. In ECCV, 2016.
  • [23] S. Liu, J. Pan, and M.-H. Yang. Learning recursive filters for low-level vision via a hybrid neural network. In ECCV, 2016.
  • [24] K. Matzen and N. Snavely. Bubblenet: Foveated imaging for visual discovery. In ICCV, pages 1931–1939, 2015.
  • [25] X. Peng, J. Hoffman, X. Y. Stella, and K. Saenko. Fine-to-coarse knowledge transfer for low-res image classification. In ICIP, 2016.
  • [26] A. Quattoni and A. Torralba. Recognizing indoor scenes. In CVPR, 2009.
  • [27] T. Remez, O. Litany, R. Giryes, and A. M. Bronstein. Deep class aware denoising. arxiv:1701.01698, 2017.
  • [28] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [29] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [30] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arxiv:1312.6199, 2013.
  • [31] P. Tang, X. Wang, B. Shi, X. Bai, W. Liu, and Z. Tu. Deep fishernet for object classification. arxiv:1608.00182, 2016.
  • [32] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In ICCV, 1998.
  • [33] S. Ullman, L. Assif, E. Fetaya, and D. Harari. Atoms of recognition in human and computer vision. National Academy of Sciences, 2016.
  • [34] I. Vasiljevic, A. Chakrabarti, and G. Shakhnarovich. Examining the impact of blur on recognition by convolutional networks. arxiv:1611.05760, 2016.
  • [35] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
  • [36] L. Xu, J. S. Ren, C. Liu, and J. Jia. Deep convolutional neural network for image deconvolution. In NIPS, 2014.
  • [37] L. Xu, J. S. Ren, Q. Yan, R. Liao, and J. Jia. Deep edge-aware filters. In ICML, 2015.
  • [38] Z. Yan, H. Zhang, B. Wang, S. Paris, and Y. Yu. Automatic photo adjustment using deep neural networks. ACM TOG, 2016.
  • [39] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva. Places: An image database for deep scene understanding. arxiv:1610.02055, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description