Effects of Image Degradations to CNN-based Image Classification
Just like many other topics in computer vision, image classification has achieved significant progress recently by using deep-learning neural networks, especially the Convolutional Neural Networks (CNN). Most of the existing works are focused on classifying very clear natural images, evidenced by the widely used image databases such as Caltech-256, PASCAL VOCs and ImageNet. However, in many real applications, the acquired images may contain certain degradations that lead to various kinds of blurring, noise, and distortions. One important and interesting problem is the effect of such degradations to the performance of CNN-based image classification. More specifically, we wonder whether image-classification performance drops with each kind of degradation, whether this drop can be avoided by including degraded images into training, and whether existing computer vision algorithms that attempt to remove such degradations can help improve the image-classification performance. In this paper, we empirically study this problem for four kinds of degraded images – hazy images, underwater images, motion-blurred images and fish-eye images. For this study, we synthesize a large number of such degraded images by applying respective physical models to the clear natural images and collect a new hazy image dataset from the Internet. We expect this work can draw more interests from the community to study the classification of degraded images.
Associating an input image with one of the priorly specified image class, image classification is a fundamental and important problem in computer vision and artificial intelligence [1, 2]. While image classification has been studied in different applications for a long time, its performance is substantially improved in recent years by using supervised deep learning, e.g., Convolutional Neural Networks (CNN) [3, 4, 5], which unifies the feature extraction and classification into a single end-to-end network. For example, on ImageNet dataset, a recent CNN-based image-classification method  achieved a top-5 accuracy of 96.4%.
However, most of these excellent image classification performances are achieved on clear natural images, such as the images in databases of Caltech-256 , PASCAL VOCs  and ImageNet . In many real applications, such as those related to autonomous driving, underwater robotics, video surveillance, and wearable cameras, the acquired images are not always clear. Instead, they suffer from various kinds of degradations. For example, images taken in the hazy weather, images take underwater by waterproof cameras, and image taken by moving cameras usually contain different levels of intensity blurs. Images taken by fish-eye cameras usually show spatial distortions. Some examples are shown in Fig. 1. One important and interesting problem is whether the excellent classification performance obtained on clear natural images can be preserved on such degradation images by using the same deep learning techniques. In this paper, we empirically study this problem by constructing datasets of various kinds of degraded images and quantitatively evaluate and compare the CNN image classification models on these datasets.
More specifically, in this paper we select four kinds of degraded images – hazy images, underwater images, motion-blurred images and fish-eye images – for our empirical study. To quantify the classification performance under different level of degradations, we use their respective physical models to synthesize a large number of degraded images, as well as collecting real hazy images from Internet. We then implement the CNN model using AlexNet  and VGGNet-16  on Caffe and use them for classifying images with different degradation levels. CNN-based model employs supervised learning and requires a set of images for training. To more comprehensively explore the effect of degradations to image classification, we study not only the training and testing on the images with the same level of degradation, but also the training and testing on images with different levels of degradations.
The main contributions of this paper are threefolds. First, we conduct an empirical study of the effect of four kinds of typical image degradations to the CNN-based image classification. Second, we propose the use of respective physical models to construct synthetic images with different levels of degradations for quantitative evaluation. Third, we investigate whether existing degradation removal algorithms can benefit degraded image classification. Note that the goal of this paper is not the development of a new method to improve the classification performance on degraded images. Instead, we study whether the degraded image classification is a more challenging problem compared to the classification of clear natural images and whether its performance can be improved by selecting or pre-processing training/test data. We expect this work can help attract more interests from the community to further study the image classification of degraded images, including the development of new methods for improving the image-classification performance.
The remainder of the paper is organized as follows. Section II overviews the related work. Section III introduces the construction of degraded images, the image-classification evaluation metric, and the CNN-based image classification model. Section IV reports the training and testing datasets, experimental results, and further analysis of the experimental results, followed by a brief conclusion in Section V.
Ii Related Work
Just like many other topics in computer vision, the performance of image classification has been significantly improved by using deep learning techniques, especially the Convolution Neural Networks (CNN). In 2012, a CNN-based method  achieved a top-5 classification accuracy of 83.6% on ImageNet dataset in the ImageNet Large-Scale Visual Recognition Challenge 2012 (ILSVRC 2012). It is about 10% higher than the traditional methods  that achieved a top-5 accuracy of 74.3% on ImageNet dataset in ILSVRC 2011. Almost all the recent works and the state-of-the-art performance on image classification were achieved by CNN-based methods. For example, VGGNet-16 in  increased the network depth using an architecture with very small () convolution filters and achieved a top-1 accuracy of 75.2% and a top-5 accuracy of 92.5% on ImageNet dataset in ILSVRC 2014. Image classification accuracy in ILSVRC 2014 was then further improved  in 2015 by increasing the depth and width of the network. In , residual learning was applied to solve the gradient disappearance problem and achieved a top-5 accuracy of 96.4% on ImageNet dataset in ILSVRC 2015. In , an architectural unit was proposed based on the channel relationship, which adaptively recalibrates the channel-wise feature responses by explicitly modeling interdependencies between channels, resulting in a top-5 accuracy of 97.8% on ImageNet dataset in ILSVRC 2017. In , image regions for gaining spatial invariance are aligned and strongly localized features are learned, resulting in 95.0%, 93.4%, 94.4% and 84.0% classification accuracies on PASCAL VOC 2007, PASCAL VOC 2012, Scene-15 and MIT-67 datasets, respectively. Although these CNN-based methods have achieved excellent performance on image classification, most of them were only applied to the classification of clear natural images.
Degraded image-based recognition and classification have been studied in several recent works. In , only the influence of special degradations to face recognition was analyzed when using deep CNN-based approaches. In this paper, we investigate the problems of general degraded image classification, by covering hazy images, motion blurs, underwater blurs, and fish-eye distortions. Furthermore, In , the training data are always clear images while in this paper we will study whether the direct use of degraded images for training is beneficial or not. In , special degradations of low image resolution was studied in the applications of face identification, digit recognition and font recognition. In , a CNN-based method was proposed for improving the recognition performance of low-quality images and video by using pre-training, data augmentation, and other strategies. In this paper, we c onduct an empirical study to comprehensively understand the effects of various degradations to the performance of CNN-based image classification and investigate whether the use of degraded image in training and a pre-processing of degradation removal are helpful for image classification, which have important guiding significance to future work.
For hazy images, many models and algorithms were developed for removing the haze and restore the original clear image. He et al.  presented a single-image haze-removal method using the dark channel prior. Zhu et al.  presented a single-image haze-removal algorithm using the color attenuation prior. Berman et al.  introduced a haze removal method based on the haze line. Cai et al.  adopted CNN-based deep architecture, whose layers are specially designed to embody the established priors in image dehazing and it is constructed by three convolution layers, a max-pooling, a Maxout unit and a BReLU activation function. Ren et al.  proposed a multi-scale deep neural network for haze removal, and the network consists of a coarse-scale net for a holistic transmission map and a fine-scale net for local refinement. Li et al.  designed an end-to-end network based on a re-formulated atmospheric scattering model, instead of estimating the transmission matrix and the atmospheric light separately. Recently, researchers also investigated haze removal from the images taken at nighttime hazy scenes. For example, Li et al.  developed a method to remove the nighttime haze with glow and multiple light colors. Zhang et al.  proposed a fast nighttime haze removal method using the maximum reflectance prior.
Different from the standard indoor and outdoor environments, the visible distance in many underwater conditions is only few meters. The underwater images taken by waterproof cameras, or other imaging facilities, are usually highly blurred and recognizing the objects from an underwater image is an important problem for both civil and military applications. In , a traditional feature matching method using linear sparse coding is developed for underwater object recognition/detection. Jordt et al.  proposed a system for computing camera path and 3D points from underwater images. Yau et al.  extended the existing works on physical refraction models by considering the dispersion of light, and derived new constraints on the model parameters for underwater camera calibration. Sheinin et al.  generalized the next best view concept of robot vision to scattering media and cooperative movable lighting for underwater navigation. Akkaynak et al.  introduced the space of attenuation coefficients that can be used for many underwater computer vision tasks. Wang et al.  proposed a method for feeble object detection of underwater images through logical stochastic resonance with delay loop. Moller et al.  proposed a active learning method for the classification of species in underwater images from a fixed observatory. Rajeev et al.  proposed a segmentation technique for underwater images based on K-means and local adaptive thresholding. Chen et al.  proposed a underwater object segmentation method based on optical features.
The motion of the camera and/or the captured objects usually introduces motion blur to the acquired images. Liu et al.  proposed a blurred image classification and analysis framework for detecting images containing blurred regions and recognizing the blur types for those regions without needing to perform blur kernel estimation and image deblurring. Golestaneh et al.  proposed a spatially-varying blur detection method. Kalalembang et al.  presented a method of detecting unwanted motion blur effects. Gast et al.  proposed a parametric object motion model by combining with a segmentation mask to exploit localized, non-uniform motion blur. Lin et al.  addressed the problem of matting motion blurred objects from a single image. Besides, effective and efficient deblurring of such degraded images have become an important research topic in the past decades. In , an end-to-end algorithm was developed to reconstruct motion-blur-free images. In , the motion flow was estimated from a single degraded image and then compensation was made to remove motion blur. Fan et al.  proposed a new blur classification model using convolutional neural network.
Fish-eye images can provide wide-angle view of a scene, but introduce distortions to the covered scene and objects. Kannala et al.  proposed a generic camera model for both the conventional and wide-angle lens cameras, as well as developing a calibration method for estimating the parameters of the model. Fu et al.  discussed how to explicitly employ the distortion cues to detect the forgery object in fish-eye images. Hughes et al.  proposed a method to estimate the intrinsic and extrinsic parameters of fish-eye cameras. Wei et al.  proposed a fish-eye video correction method. Ying et al.  presented a method to calibrate fish-eye lenses. Li et al.  proposed a new fish-eye image rectification method that combines the physical spherical model and the digital distortion model. Krams et al.  addressed the problem of people detection in top-view fish-eye imaging. Baek et al.  proposed a method for real-time detection, tracking, and classification of moving and stationary objects using multiple fish-eye images.
Different from these prior works on hazy, underwater, motion-blur and fish-eye images, in this paper, we conduct a comprehensive empirical study to quantify the effects of these four kinds of degradations to image classification. Some of these prior works investigated the removal of degradations. Later in this paper, we will study whether the removal of degradations using these methods can help the CNN-based image classification or not.
Iii Proposed Method
In this section, we first discuss the construction of the images with different levels of degradations, followed by CNN-based image classification models and performance metric.
Iii-a Synthesis of Degraded Images
There are two difficulties in quantitatively and comprehensively evaluating the CNN-based degraded image classification. First, it is very difficult to collect a large number of real degraded images with the desired class information for training the CNN models. Second, the degradation level of the collected real images are usually unknown and therefore, the use of such images could not quantify the effect of degradation levels to the image classification performance. To address these problems, we propose to first synthesize degraded images for large-scale CNN-based training and testing.
There are many available image datasets, such as Caltech-256, PASCAL VOCs and ImageNet, consisting of different classes of clear natural images, that have been widely used for evaluating image classification models. We can select one such dataset, take each image in this dataset as the original image without any degradation, and then synthesize its degraded versions by using available physical models. In this data synthesis, we can control the level of the added degradations. By using these synthesized degraded images for training and testing the CNN models, we can systematically study the effect of different levels of degradations to the performance of image classification.
We synthesize hazy images by 
where x is the pixel coordinates, I is the synthesized hazy image, J is the original image and A is the global atmospheric light. The scene transmission ) is distance-dependent:
where is the atmospheric scattering coefficient and is the normalized distance of the scene at pixel x. We get the depth map by following . We can control the degradation levels of the synthesized hazy images by varying .
where is the scattering angle, , with being the optical depth and being the single scattering albedo. We can control the degradation levels of the synthesized underwater images by varying and .
We synthesize motion-blurred images by following  and a motion-blurred image can be modeled as follows:
where is a motion-blurred imageï¼ is the motion blur kernel, and is the clear natural image. * is the convolution operator and is an additive noise. x is the pixel coordinates. The degradation level is controlled by varying the parameters and , with being the length of the blur kernel and being the counterclockwise rotation angle of the object.
We synthesize fish-eye images by following , where the pixel coordinates of the fish-eye image is computed from
where is the distorted coordinates and and . with and . is the coordinates of the original image. is the principal point and and are the number of pixels per unit distance along horizontal and vertical directions, respectively. The degradation level is controlled by varying the exponent .
Figure 2 shows the examples of the synthesized images of different degradation levels.
Iii-B CNN-based Image Classifiers and Evaluation Metric
In this paper, we use AlexNet and VGGNet-16 on Caffe to implement the CNN-based image classification model. The architecture of AlexNet and VGGNet-16 are summarized in Fig. 3. where the convolution layers are shown in gray, the max-pooling layers are shown in orange, and the fully-connected layers are shown in light green.
The AlexNet  has 8 weight layers (5 convolutional layers and 3 fully-connected layers). The first convolutional layer has 96 kernels of size and filters a input image with stride 2. The second convolutional layer filters the output of the first convolutional layer with 256 kernels of size . The third convolutional layer has 384 kernels of size connected to the outputs of the second convolutional layer. The fourth convolutional layer has 384 kernels of size , and the fifth convolutional layer has 256 kernels of size . Each fully-connected layer has 4,096 channels, except for the last fully-connected layer that has channels ( is the number of classes).
The VGGNet-16  has 16 weight layers (13 convolution layers and 3 fully-connected layers). It has very small receptive fields with stride 1 throughout the whole net. The number of channels is small and it starts from 64 in the first layer and increases after each max-pooling layer by a factor of 2, until it reaches 512. The input is also a fixed-size RGB image during training. Spatial pooling is carried out by a max-pooling layer, which follows the convolution layers. Max-pooling is performed over a pixel window with stride 2. The first two fully-connected layers have 4,096 channels each and the third has 257 channels (one for each class). The final layer is the soft-max layer.
As detailed later, we use Caltech-256 dataset to synthesize the images with different degradation levels. We divide the synthesized image data for training and testing, and use the classification accuracy on the testing set as the performance evaluation metric. Specifically, the accuracy is defined by , where is the number of correctly recognized images in the testing set and is the total number of testing images.
Iii-C Training and Testing Datasets
For a comprehensive analysis of the effects of image degradations to CNN-based image classification, we vary the selection of training and testing image datasets for the CNN classifiers and design multiple experiments in our empirical study.
For each kind of degradation, train and test CNN classifiers using image data of the different gradation levels.
For each kind of degradation, train and test CNN classifiers using image data of the same gradation level.
For each kind of degradation, train CNN classifiers by combining images of different degradation levels and test on images of each degradation level.
Finally, we collect a set of real hazy images from Internet for CNN training and/or testing to support our findings drawn from study on the synthetic data, although the exact degradation levels of these real images are unknown.
In this section, we first describe the datasets and experiment setup. After that, we report the experiment results on different datasets and visualize sample features extracted at each hidden layer to further analyze the experiment results. Finally, we conduct experiments to check whether a degradation-removal preprocessing step can improve the CNN-based image classification accuracy.
Iv-a Datasets and Experiment Setup
We synthesize degraded images using all the images in Caltech-256 dataset , which has been widely used for evaluating image classification algorithms. This dataset contains 30,607 images from 257 classes (256 object categories and a clutter class). For each level of each kind of degradation, we synthesize 30,607 images by applying the respective model to each of the images in Caltech-256. In the original Caltech-256, we follow  to select 60 images as training images per class, and the rest are used for testing. Among the training images, 20% per class are used as a validation set. We follow this to strategy split the synthesized image data: an image is in training set if it is synthesized from an image in the training set and in testing set otherwise. This way, for each level of each kind of degradation, we have a training set of images (60 per class) and a testing set of images.
In our experiment, for each kind of degraded images, we select seven different levels of degradations. More specifically, for hazy images, we set the parameter . For underwater images, we set parameters . For motion-blurred images, we set . For fish-eye images, we set parameter . For each kind of the degradations, the first level, i.e., the corresponding parameters are all zero, except fish-eye images is one, corresponds to the level without any degradation, which is equivalent to use the original images in Caltech-256 for experiments.
While we can construct synthetic degradation images by well-acknowledged physical models, real image degradations can be much more complicated and experiments on real images are still crucial. Given the relative wide availability of hazy images, we collect a new dataset of real hazy images from the Internet. This new dataset contains 4,610 images from 20 classes and we name it as Haze-20. These 20 image classes are bird, boat, bridge, building, bus, car, chair, cow, dog, horse, people, plane, sheep, sign, street-lamp, tower, traffic-light, train, tree and truck, respectively. The number of images per class varies from 204 to 279, as given in Table I. Some examples (one image of each class) in Haze-20 are shown in Fig. 4. For the collected real hazy images in Haze-20, we select 100 images from each class as training images, and the rest are used for testing. Among the training images, 20% per class are used as a validation set. So, we have a training set of images and a testing set of images.
One important experiment in our empirical study is to train on clear images and test on degradation images. This is not an problem for our synthetic data since their underlying degradation-free clear images are available, i.e., the original Caltech-256 data. For collected real images in Haze-20, we do not have their underlying clear images. To address this issue, we collect a new HazeClear-20 image dataset from Internet, which consists of haze-free images that fall in the same 20 classes as in Haze-20. HazeClear-20 consists of 3,000 images, with 150 images per class. For HazeClear-20 dataset, we also select 100 images from each class as training images, and the rest are used for testing. Among the training images, 20% per class are used as a validation set. So, we have a training set of images and a testing set of images.
We implement AlexNet and VGGNet-16 on Caffe in this paper. The CNN architectures are pre-trained on ImageNet dataset that consists of 1,000 classes with 1.2 million training images. We use the pre-trained model for image classification by fine-turning on our training images. We change the number of channels in the last fully connected layer from 1,000 to , where is the number of classes in our datasets. Note that in this paper we use AlexNet and VGGNet-16 for their simplicity – it does not prevent from using other network structures, such as ResNet.
Iv-B Results on Synthetic Images – Individual Degradation Levels
For each kind of synthesized degraded images, we exhaustively try the training using the training set of one level and testing on the testing data of the same or another level. The image classification accuracies by using VGGNet-16 are summarized in the four subtables in Table II, separated by thick lines. For each of the four subtables, the first column indicates the degradation level of the training set and the first row indicates the degradation level of the testing set. For example, the accuracy corresponding to the row and the column is 65.0%. This indicates the classification accuracy on the testing set at haze-degradation level is 65.0% when using the VGGNet-16 classifiers trained using the training set at the haze-degradation level . In these subtables, the accuracies along the diagonal are achieved by training and testing on the image data of the same degradation level. The accuracies at non-diagonal elements are achieved by training and testing on the image data of the different degradation levels. In these four subtables, we highlight the maximum accuracy in each column.
From the results reported in these four subtables, we can see that, the maximum values in each column are usually located along the diagonal of each subtable. This indicates that, to achieve the best possible accuracy in classifying the images with certain level of degradations, we need to collect the training images with the same kind of degradation and with the same or similar degradation levels. If the degradation levels of the training images and test images have large gaps, the testing accuracy can be very low. For example, if we use clear images () to train CNN classifiers, and then apply them to classify images with hazy level of , the accuracy will drop significantly from (training and testing both on clear images) or (training and testing both on hazy images) to .
By examining the change of accuracy values along the diagonal of each subtable, we can see that they also decrease from the top-left element to the bottom-right element. This indicates that, even we train and test the CNN using the images of the same degradation level, the classification performance still drops when the degradation level increases. For example, in the lower-left subtable of Table II, the classification accuracy drops to when we train and test both on -level motion-blurred images, while the accuracy is when both training and testing images have no degradation. This may be caused by the partial loss of discriminative image information in the image degradations.
We conduct the same experiments using AlexNet and results are shown in the four subtables in Table III. These results are largely consistent to the results shown in Table II, e.g., in each subtable, the diagonal elements are usually lager than the non-diagonal ones, and the values along the diagonal drop from the top-left element to the bottom-right element. We also find that, in general, VGGNet-16 produces higher classification accuracy than AlexNet when using the same training set. This is not surprising since VGGNet-16 is a deeper network. For each subtable in Table III and Table II, we compute the relative accuracy drop by computing the difference between their top-left and bottom-right elements. We can find that the accuracy drop in using VGGNet-16 is less than the drop in using AlexNet for three out four kinds of degradations – 13.2%, 6.1% and 18.3% (VGGNet-16) v.s. 19.6%, 14.3% and 25.4% (AlexNet) for hazy, underwater and fish-eye images, respectively. Only for motion-blurred images, the accuracy drop (22.0%) in using VGGNet-16 is a little more than the drop (20.3%) in using AlexNet. From this perspective, VGGNet-16 also outperforms AlexNet.
Iv-C Results on Synthetic Images – Mixed Degradation Levels
In practice, we may not know exactly the degradation levels of real images, and it can be difficult to guarantee that the degradation levels of the testing images match those of the training images. Therefore, it is important to study the case where training images mix a wide range of image degradation levels.
For each kind of degradation, we combine the training images of all different degradation levels to generate a mixed training set for CNN training. Then, we test the CNN classifiers on testing images of the same degradation kind at each degradation level. Results are shown in Table IV: the four subtables from top to bottom are classification accuracy (%) of hazy images, underwater images, motion-blurred images and fish-eye images by using VGGNet-16, respectively. The first, third, fifth and seventh rows indicate the degradation level of the testing set and the second, fourth, sixth and eighth rows indicate the classification accuracy (%) on the corresponding degradation level of the testing set.
From Table IV, we can see that the classification accuracies of clear images in its four subtables are 79.7%, 80.5%, 76.2% and 77.2% respectively, which are lower than the clear-image classification accuracy of 81% shown in Table II, where training images only contain clear images. This indicates that the inclusion of degraded images into training may affect the classification accuracy of clear images. However, we can also see that the accuracy of each degradation kind and each degradation level in Table IV is usually higher than the accuracy of the corresponding degradation kind and level shown along the diagonal of the four subtables in Table II, except for the degradation-free clear images. For example, accuracy of hazy images are 73.7% when training set mixes all degradation level images, while this accuracy is only 72.3% when training only on hazy images. This indicates that, if we know that the test images are degraded, we may want to include as many degraded images as possible, even of different degradation levels, into the training set.
Iv-D Results on Real Hazy Images
We conduct experiments in Haze-20 and HazeClear-20 datasets using VGGNet-16 and AlexNet, respectively. The experimental results are shown in Table V. The first column indicates the kind of the training images and the second row indicates the kind of test images. where “Combine” indicates the combination of the haze and clear training images for training. We can see that when we train and test on clear images, the accuracy can get up to 98.0% using VGGNet-16. However, when we train and test on real hazy images, the accuracy drops to only 81.2% using VGGNet-16. When the training set mixes haze and clear images, the test accuracy on clear images is 97.7% and on haze image is 76.7%. Training and testing in the same level images is the best way to achieve the better performance.
Iv-E Hidden-Layer Features
We can scrutinize the features extracted at each hidden layer to analyze the possible reasons that cause the performance drop in degraded image classification. For an input image with size , the activations of a convolution layer is formulated as an order-3 tensor with elements, where is the number of channels. The term “activations” is a feature map of all the channels in a convolution layer. The activations in classifying several samples images are displayed in Fig. 5, where five columns on the right are the activations of the max-pooling of the first, second, third, fourth, and last convolution layer in VGGNet-16, respectively. They are labeled as , , , , and respectively in Fig. 5. For better visual effects, we resize those activations using the bi-cubic interpolation, such that they have the same size as the input image.
We can see that, compared to clear images, the activations of “” and “” of the degraded images are not as discriminative. As we all know, low layers of CNN reflect color, texture, and other low-level image features. The low quality of these low-level features can very negatively affect the extraction of good-quality high-level features in later layers, leading to decreased image-classification accuracies. Besides, from the high-level features of degraded image, we can see that the salient region is not accurately localized. For example, for the synthesized underwater image in Fig. 5, the degradation leads to too many salient regions which make the dog difficult to be localized. As a result, on such a degraded image, the dog is mistakenly recognized as a raccoon. Similarly, the butterfly in the motion-blurred image in Fig. 5 is mistakenly recognized as a mushroom. Distortion changes the shape and appearance of objects, leading to incorrect features in CNN layers and finally errors in image classification. For example, in the fish-eye image in Fig. 5, the distortion leads to the incorrect recognition of the horse as a rifle.
Iv-F Does Degradation-Removal Pre-Processing Help?
As discussed earlier, for many kinds of image degradations, many researches have been conducted to remove/reduce the degradation to restore the underlying clear images [15, 16, 17, 18, 19, 20, 21, 22, 25, 37, 38, 40, 42, 44, 45]. One interesting problem is whether we can get better classification accuracy by training the CNN classifiers on the clear images, and testing on the restored test images after the degradation removal. In this section, on the synthetic data we pick the haze-removal algorithm developed in , the deblurring algorithm developed in , and distortion correction algorithm developed in  to remove the corresponding degradations in the test images and then run the CNN-based image classification. Results are shown in Table VI, which contains three subtables, separated by thick lines, for hazy images (top), motion-blurred images (middle) and fish-eye images (bottom), respectively. For each subtable, the row of “w/o DM” indicates the results of training on clear images and testing on degraded images without the degradation removal, the row of “w DM” indicates the results of training on clear images and testing on degraded images with the degradation removal, and the row of “Diag.” indicates the results of training and testing on the images of the same degradation level, copied here from the diagonal of the corresponding subtables in Table II.
We can see that, if we train the CNN model on clear images, a pre-processing of degradation removal can sometimes help improve the classification accuracy, especially for hazy and motion-blurred images. But the degradation removal could never lead to a classification accuracy close to the level of training and testing both on original clear images. Comparing the rows of “w DM” and “Diag.”, we can also see that, the accuracy resulting from training and testing on the same degradation level is much higher than the accuracy resulting from training on clear images and testing on the images recovered from degradation removal. This shows that the degradation-removal algorithms may transform the images to be more visually pleasant to human eyes, but may not help much for CNN-based classification: for degraded images, training and testing directly without degradation removal actually produces the best accuracy. We believe this is reasonable since the degradation-removal algorithms do not introduce any new information to the CNN-based classification.
In this paper, we conducted an empirical study to explore the effect of four kinds of image degradations to the performance of CNN-based image classification. For facilitating the quantitative evaluation, we proposed to synthesize a large number of images for training and testing. We considered the synthesis of hazy images, underwater images, motion-blurred images and fish-eye images, each with seven degradation levels, as well as collection of real hazy images from Internet. We found that the image classification performance does drop significantly when the image is degraded, especially when the training images can not well reflect the degradation levels of the test images. By visualizing the activations of hidden layers of the CNN classifiers, we found that many important low level features were not well discerned in early layers, which might be a key factor for the dropped classification accuracy. We also found that the existing algorithms for removing various kinds of degradations could not be used to improve much the CNN-based classification performance. We hope this study can draw more interests from the community to work on degraded image classification, that can benefit many important application domains such as autonomous driving, underwater robotics, video surveillance, and wearable cameras.
-  J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid matching using sparse coding for image classification,” in Computer Vision and Pattern Recognition. IEEE, 2009.
-  J. Sánchez and F. Perronnin, “High-dimensional signature compression for large-scale image classification,” in Computer Vision and Pattern Recognition. IEEE, 2011.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Neural Information Processing Systems, 2012.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Computer Vision and Pattern Recognition. IEEE, 2016.
-  G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,” 2007.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International Journal of Computer Vision, 2010.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition. IEEE, 2009.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich et al., “Going deeper with convolutions.” Computer Vision and Pattern Recognition, 2015.
-  J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” arXiv preprint arXiv:1709.01507, 2017.
-  T. Durand, T. Mordan, N. Thome, and M. Cord, “Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation,” in Computer Vision and Pattern Recognition. IEEE, 2017.
-  S. Karahan, M. K. Yildirum, K. Kirtac, F. S. Rende, G. Butun, and H. K. Ekenel, “How image degradations affect deep cnn-based face recognition?” in Biometrics Special Interest Group (BIOSIG), 2016 International Conference of the. IEEE, 2016.
-  Z. Wang, S. Chang, Y. Yang, D. Liu, and T. S. Huang, “Studying very low resolution recognition using deep networks,” in Computer Vision and Pattern Recognition. IEEE, 2016.
-  D. Liu, B. Cheng, Z. Wang, H. Zhang, and T. S. Huang, “Enhance visual recognition under adverse conditions via deep networks,” arXiv preprint arXiv:1712.07732, 2017.
-  K. He, J. Sun, and X. Tang, “Single image haze removal using dark channel prior,” Pattern Analysis and Machine Intelligence, 2011.
-  Q. Zhu, J. Mai, and L. Shao, “A fast single image haze removal algorithm using color attenuation prior,” Transactions on Image Processing, 2015.
-  D. Berman, S. Avidan et al., “Non-local image dehazing,” in Computer Vision and Pattern Recognition. IEEE, 2016.
-  B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao, “Dehazenet: An end-to-end system for single image haze removal,” Transactions on Image Processing, 2016.
-  W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao, and M.-H. Yang, “Single image dehazing via multi-scale convolutional neural networks,” in European Conference on Computer Vision. Springer, 2016.
-  B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng, “Aod-net: All-in-one dehazing network,” in International Conference on Computer Vision. IEEE, 2017.
-  Y. Li, R. T. Tan, and M. S. Brown, “Nighttime haze removal with glow and multiple light colors,” in International Conference on Computer Vision. IEEE, 2015.
-  J. Zhang, Y. Cao, S. Fang, Y. Kang, and C. W. Chen, “Fast haze removal for nighttime image using maximum reflectance prior,” in Computer Vision and Pattern Recognition. IEEE, 2017.
-  K. Oliver, W. Hou, and S. Wang, “Feature matching in underwater environments using sparse linear combinations,” in Computer Vision and Pattern Recognition Workshops. IEEE, 2010.
-  A. Jordt-Sedlazeck and R. Koch, “Refractive structure-from-motion on underwater images,” in International Conference on Computer Vision. IEEE, 2013.
-  T. Yau, M. Gong, and Y.-H. Yang, “Underwater camera calibration using wavelength triangulation,” in Computer Vision and Pattern Recognition. IEEE, 2013.
-  M. Sheinin and Y. Y. Schechner, “The next best underwater view,” in Computer Vision and Pattern Recognition. IEEE, 2016.
-  D. Akkaynak, T. Treibitz, T. Shlesinger, Y. Loya, R. Tamir, and D. Iluz, “What is the space of attenuation coefficients in underwater computer vision?” in Computer Vision and Pattern Recognition. IEEE, 2017.
-  N. Wang, B. Zheng, H. Zheng, and Z. Yu, “Feeble object detection of underwater images through lsr with delay loop,” Optics express, 2017.
-  T. Möller, I. Nillsen, and T. W. Nattkemper, “Active learning for the classification of species in underwater images from a fixed observatory,” in Computer Vision and Pattern Recognition. IEEE, 2017.
-  A. A. Rajeev, S. Hiranwal, and V. K. Sharma, “Improved segmentation technique for underwater images based on k-means and local adaptive thresholding,” in Information and Communication Technology for Sustainable Development. Springer, 2018.
-  Z. Chen, Z. Zhang, Y. Bu, F. Dai, T. Fan, and H. Wang, “Underwater object segmentation based on optical features,” Sensors, 2018.
-  R. Liu, Z. Li, and J. Jia, “Image partial blur detection and classification,” in Computer Vision and Pattern Recognition. IEEE, 2008.
-  S. A. Golestaneh and L. J. Karam, “Spatially-varying blur detection based on multiscale fused and sorted transform coefficients of gradient magnitudes,” in Computer Vision and Pattern Recognition. IEEE, 2017.
-  E. Kalalembang, K. Usman, and I. P. Gunawan, “Dct-based local motion blur detection,” in Instrumentation, Communications, Information Technology, and Biomedical Engineering (ICICI-BME), 2009 International Conference on. IEEE, 2010.
-  J. Gast, A. Sellent, and S. Roth, “Parametric object motion from blur,” in Computer Vision and Pattern Recognition. IEEE, 2016.
-  H. T. Lin, Y.-W. Tai, and M. S. Brown, “Motion regularization for matting motion blurred objects,” Pattern Analysis and Machine Intelligence, 2011.
-  J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” in Computer Vision and Pattern Recognition. IEEE, 2016.
-  J. Sun, W. Cao, Z. Xu, and J. Ponce, “Learning a convolutional neural network for non-uniform motion blur removal,” in Computer Vision and Pattern Recognition. IEEE, 2015.
-  M. Fan, R. Huang, W. Feng, and J. Sun, “Image blur classification and blur usefulness assessment,” in Multimedia & Expo Workshops (ICMEW), 2017 IEEE International Conference on. IEEE, 2017.
-  J. Kannala and S. S. Brandt, “A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses,” Pattern Analysis and Machine Intelligence, 2006.
-  H. Fu and X. Cao, “Forgery authentication in extreme wide-angle lens using distortion cue and fake saliency map,” TIFS, 2012.
-  C. Hughes, P. Denny, M. Glavin, and E. Jones, “Equidistant fish-eye calibration and rectification by vanishing point extraction,” Pattern Analysis and Machine Intelligence, 2010.
-  J. Wei, C. F. Li, S. M. Hu, R. R. Martin, and C. L. Tai, “Fisheye video correction,” IEEE Transactions on Visualization and Computer Graphics, 2012.
-  X. Ying, Z. Hu, and H. Zha, “Fisheye lenses calibration using straight-line spherical perspective projection constraint,” in Asian Conference on Computer Vision. Springer, 2006.
-  X. Li, Y. Pi, Y. Jia, Y. Yang, Z. Chen, and W. Hou, “Fisheye image rectification using spherical and digital distortion models,” in MIPPR 2017: Multispectral Image Acquisition, Processing, and Analysis. International Society for Optics and Photonics, 2018.
-  O. Krams and N. Kiryati, “People detection in top-view fisheye imaging,” in Advanced Video and Signal Based Surveillance (AVSS), 2017 14th IEEE International Conference on. IEEE, 2017.
-  I. Baek, A. Davies, G. Yan et al., “Real-time detection, tracking, and classification of moving and stationary objects using multiple fisheye images,” arXiv preprint arXiv:1803.06077, 2018.
-  F. Liu, C. Shen, and G. Lin, “Deep convolutional neural fields for depth estimation from a single image,” in Computer Vision and Pattern Recognition. IEEE, 2015.
-  L. Dolin, G. Gilbert, I. Levin, and A. Luchinin, “Theory of imaging through wavy sea surface,” Russian Academy of Sciences, Institute of Applied Physics, Nizhniy Novgorod, 2006.
-  R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T. Freeman, “Removing camera shake from a single photograph,” in ACM transactions on graphics (TOG). ACM, 2006.