UG{}^{2+} Track 2: A Collective Benchmark Effort for Evaluating and Advancing Image Understanding in Poor Visibility Environments

Ug Track 2: A Collective Benchmark Effort for Evaluating and Advancing Image Understanding in Poor Visibility Environments

Ye Yuan , Wenhan Yang*, Wenqi Ren, Jiaying Liu, Walter J. Scheirer, and Zhangyang Wang
http://www.ug2challenge.org/
The first two authors contributed equally.
Abstract

The UG challenge in IEEE CVPR 2019 aims to evoke a comprehensive discussion and exploration about how low-level vision techniques can benefit the high-level automatic visual recognition in various scenarios. In its second track, we focus on object or face detection in poor visibility enhancements caused by bad weathers (haze, rain) and low light conditions. While existing enhancement methods are empirically expected to help the high-level end task, that is observed to not always be the case in practice. To provide a more thorough examination and fair comparison, we introduce three benchmark sets collected in real-world hazy, rainy, and low-light conditions, respectively, with annotate objects/faces annotated. To our best knowledge, this is the first and currently largest effort of its kind. Baseline results by cascading existing enhancement and detection models are reported, indicating the highly challenging nature of our new data as well as the large room for further technical innovations. We expect a large participation from the broad research community to address these challenges together.

1 Introduction

Background. Many emerging applications, such as unmanned aerial vehicles (UAVs), autonomous/assisted driving, search and rescue robots, environment monitoring, security surveillance, transportation and inspection, hinge on computer vision-based sensing and understanding of outdoor environments [134]. Such systems concern a wide range of target tasks such as detection, recognition, segmentation, tracking, and parsing. However, the performances of visual sensing and understanding algorithms will be largely jeopardized by various challenging conditions in unconstrained and dynamic degraded environments, e.g., moving platforms, bad weathers, and poor illumination. They can cause severe visual input degradations such as reduced contrasts, detail occlusions, abnormal illumination, fainted surfaces and color shift.

While most current vision systems are designed to perform in “clear” environments, i.e., where subjects are well observable without (significant) attenuation or alteration, a dependable vision system must reckon with the entire spectrum of complex unconstrained outdoor environments. Taking autonomous driving for example: the industry players have been tackling the challenges posed by inclement weathers; however, a heavy rain, haze or snow will still obscure the vision of on-board cameras and create confusing reflections and glare, leaving the state-of-the-art self-driving cars in struggle (see a Forbes article). Another illustrative example can be found in city surveillance: even the commercialized cameras adopted by governments appear fragile in challenging weather conditions (see a news article). Therefore, it is highly desirable to study to what extent, and in what sense, such challenging visual conditions can be coped with, for the goal of achieving robust visual sensing and understanding in the wild, that benefit security/safety, autonomous driving, robotics, and an even broader range of signal and image processing applications.

1.1 Challenges and Bottlenecks

Despite the blooming research on removing or alleviating the impacts of those challenges, such as dehazing  [9, 90, 47, 33], deraining  [15, 73, 58, 126, 11, 24, 22, 119] and illumination enhancement [54, 92, 117, 72, 97], the current solutions see significant gaps from addressing the above-mentioned pressing real-world challenges. A collective effort for identifying and resolving those bottlenecks that they commonly face has also been absent.

One primary challenge arises from the Data aspect. Those challenging visual conditions usually give rise to nonlinear and data-dependent degradations that will be much more complicated than the well-studied noise or motion blur. The state-of-the-art deep learning methods are typically hungry for training data. The usage of synthetic training data has been prevailing, but may inevitably lead to domain shifts [70]. Fortunately, those degradations often follow some parameterized physical models a priori. That will naturally motivate a combination of model-based and data-driven approaches. In addition to training, the lack of real world test sets (and consequently, the usage of potentially oversimplified synthetic sets) have limited the practical scope of the developed algorithms.

The other main challenge is found in the Goal side. Most restoration or enhancement methods cast the handling of those challenging conditions as a post-processing step of signal restoration or enhancement after sensing, and then feed the restored data for visual understanding. The performance of high-level visual understanding tasks will thus largely depend on the quality of restoration or enhancement. Yet it remains questionable whether restoration-based approaches would actually boost the visual understanding performance, as the restoration/enhancement step is not optimized towards the target task and may bring in misleading information and artifacts too. For example, a recent line of researches [47, 124, 61, 64, 65, 16, 98, 105, 100, 94, 83] discuss on the intrinsic interplay relationship of low-level vision and high-level recognition/detection tasks, showing that their goals are not always aligned.

1.2 Overview of UG Track 2

UG Challenge Track 2 aims to evaluate and advance object detection algorithms’ robustness in specific poor-visibility environmental situations including challenging weather and lighting conditions. We structure Challenge 2 into three sub-challenges. Each challenge features a different poor-visibility outdoor condition, and diverse training protocols (paired versus unpaired images, annotated versus unannotated, etc.). For each sub-challenge, we collect a new benchmark dataset captured in realistic poor-visibility environments with real image artifacts caused by rain, haze, insufficiency of light are observed.

  • Sub-Challenge 2.1: (Semi-)Supervised Object Detection in the Haze. We provide 4,322 real-world hazy images collected from traffic surveillance, all labeled with object bounding boxes and categories (car, bus, bicycle, motorcycle, pedestrian), as the main training and/or validation sets. We also release another set of 4,807 unannotated real-world hazy images collected from the same sources (and containing the same classes of traffic objects, though not annotated), which might be used at the participants’ discretization. There will be a held-out test set of 3,000 real-world hazy images, with the same classes of objected annotated.

  • Sub-Challenge 2.2: (Semi-)Supervised Face Detection in the Low Light Condition. We provide 6,000 real-world low light images captured during the nighttime, at teaching buildings, streets, bridges, overpasses, parks etc., all labeled with bounding boxes for of human face, as the main training and/or validation sets. We also provide 10,400 unlabeled low-light images collected from the same setting. Additionally, we provided a unique set of 1,022 paired low-light/normal-light images captured in controllable real lighting conditions (but unnecessarily containing faces), which can be used as parts of the training data at the participants’ discretization. There will be a held-out test set of 4,000 low-light images, with human face bounding boxes annotated.

  • Sub-Challenge 2.3: Zero-Shot Object Detection with Raindrop Occlusions. We provide 1,010 pairs of raindrop images and corresponding clean ground-truths (collected through physical simulations), as the training and/or validation sets. Different from Sub-Challenges 2.1 and 2.2, no semantic annotation will be available on training/validation images. A held-out test set of 2,496 real-world raindrop images are collected from high-resolution driving videos, in diverse real locations and scenes during multiple drives. We label bounding boxes for selected traffic object categories: car, person, bus, bicycle, and motorcycle.

The ranking criteria will be the Mean average precision (mAP) on each held-out test set, with default Interception-of-Union (IoU) threshold as 0.5. If the ratio of the intersection of a detected region with an annotated face region is greater than 0.5, a score of 1 is assigned to the detected region, and 0 otherwise. When mAPs with IoU as 0.5 are equal, the mAPs with higher IoUs (0.6, 0.7, 0.8) will be compared sequentially.

2 Related Work

2.1 Datasets

Most datasets used for image enhancement/processing mainly targets at evaluating the quantitative (PSNR, SSIM, etc.) or qualitative (visual subjective quality) differences of enhanced images w.r.t. the ground truths. Some earlier classical datasets include Set5 [6], Set14 [123], and LIVE [96]. The numbers of images are small with only limited. Subsequent datasets come with more diverse scene content, such as BSD500 [74] and Urban100 [36]. The popularity of deep learning methods has increased demand for training and testing data.Therefore, many newer and larger datasets are presented for image and video restoration, such as DIV2K [101] and MANGA109 [26] for image super-resolution, PolyU [115] and Darmstadt [82] for denoising, RawInDark [12] and LOL dataset [113] for low light enhancement, HazeRD [132], OHAZE [3] and IHAZE [2] for dehazing, rain100L/H [119] and rain800 [127] for rain streak removal, and RAINDROP [84] for raindrop removal. However, these datasets provide no integration with subsequent high-level tasks.

A few works [31, 95, 136] make preliminary attempts for event/action understanding, video summarization, or face recognition in unconstrained and potentially degraded environments. The following datasets are collected by aerial vehicles, including VIRAT Video Dataset [80] for event recognition, UAV123 [75] for UAV tracking, and a multi-purpose dataset [120]. In [76], an unconstrained Face Detection Dataset (UFDD) is proposed for face detection in adverse condition including weather-based degradations, motion blur, focus blur and several others, containing a total of 6,425 images with 10,897 face-annotations. However, few works specifically consider the impacts of image enhancement and object detection/recognition jointly. Prior to this UG effort, a number of latest works have taken the first stabs. A large scale hazy image dataset and a comprehensive study – REalistic Single Image DEhazing (RESIDE) [50] – including paired synthetic data and unpaired real data is proposed to thoroughly examine visual reconstruction and vision recognition in hazy images. In [71], an Exclusively Dark (ExDARK) dataset is proposed with a collection of 7,363 images captured from very low-light environments with 12 object classes annotated on both image class level and local object bounding boxes. In [56], the authors present a new large-scale benchmark called RESIDE and a comprehensive study and evaluation of existing single image deraining algorithms, ranging from full-reference metrics, to no-reference metrics, to subjective evaluation and the novel task-driven evaluation. Those datasets and studies shed new light on the comparisons and limitations of state-of-the-art algorithms, and suggest promising future directions. In this work, we follow the footsteps of predecessors to advance the fields by proposing new benchmarks.

2.2 Poor Visibility Enhancement

There are numerous algorithms aiming to enhance visibility of the degraded imagery, such as image and video denoising/inpainting [109, 51, 86, 110, 66], deblurring [116, 91], super-resolution [112, 111, 62, 63] and interpolation [122]. Here we focus on dehazing, low-light condition, and deraining, as in the UG Track 2 scope.

Dehazing. Dehazing methods proposed in an early stage rely on the exploitation of natural image priors and depth statistics, e.g. locally constant constraints and decorrelation of the transmission [20], dark channel prior [33], color attenuation prior [135], nonlocal prior [5] et al. Lately, Convolutional Neural Network (CNN)-based methods bring in the new prosperity for dehazing. Several methods [9, 90] rely on various CNNs to learn the transmission fully from data. Beyond estimating the haze related variables separately, successive works make their efforts to estimate them in a unified way. In [45, 79], the authors use a factorial Markov random field that integrates the estimation of transmission and atmosphere light. some researchers focus on the more challenging night-time dehazing problem [57, 128]. [133, 77] tries to utilize Retinex theory to approximate the spectral properties of object surfaces by the ratio of the reflected light. AOD-Net [47, 48] re-formulates the haze generation model to realize one-step estimation of the inverse recovery and consider the joint interplay effect of dehazing and object detection. The idea is further applied to video dehazing by extending the model into a light-weighed video hazing framework [49]. In another recent work [88], the semantic prior is also injected to facilitate video dehazing.

Low Light Enhancement. All low-light enhancement methods can be categorized into three ways: hand-crafted methods, Retinex theory-based methods and data-driven methods. Hand-crafted methods explore and apply various image priors to single image low-light enhancement, e.g. histogram equalization [81, 1], Some methods [53, 130] regard the inverted low-light images as hazy images, and enhance the visibility by applying dehazing. Retinex theory-based method  [46] is designed to regard the signal components, reflectance and illumination, differently to simultaneously suppress the noises and preserve high-frequency details. Different ways [39, 40] are used to decompose the signal and diverse priors [107, 25, 32, 23] are applied to realize better light adjustment and noise suppression. Li et al. [54] further extends the traditional Retinex model to a robust one with an explicit noise term, and made the first attempt to estimate a noise map out of that model via an alternating direction minimization algorithm. A successive work [92] develops a fast sequential algorithm. Learning based low-light image enhancement methods [117, 72, 97] have also been studied. In these works, low-light images used for training is synthesized by applying random gamma transformation on natural normal light images. Some recent works aim to build paired training data from real scenes. In [12], Chen et al. introduced a dataset See-in-the-Dark (SID) of short-exposure low-light raw images with corresponding long-exposure reference raw images. Cai et al. [10] built a dataset of under/over-contrast and normal-contrast encoded image pairs, in which the reference normal-contrast images are generated by Multi-Exposure image Fusion (MEF) or High Dynamic Range (HDR) algorithm.

Deraining. Single image deraining is a highly ill-posed problem. To address it, many models and priors are used to perform signal separation and texture classification. These models include sparse coding [41], generalized low rank model [15], nonlocal mean filter [43], discriminative sparse coding  [73], Gaussian mixture model [58], rain direction prior [126], transformed low rank model [11]. The presence of deep learning has promoted the development of single image deraining. In [24, 22], deep networks take the image detail layer as their input. Yang et al. [119] propose a deep joint rain detection and removal method to remove heavy rain streaks and accumulation. In [126], a novel density-aware multi-stream densely connected CNN is proposed for joint rain density estimation and removal. Video deraining can additionally make use of the temporal context and motion information. The early works formulate rain streaks with more flexible and intrinsic characteristics, including rain modeling [29, 27, 30, 28, 131, 68, 4, 93, 8, 7, 15, 38]. The presence of learning-based method [13, 104, 103, 89, 55, 114, 44], with improved modeling capacity, brings new progress. The emergence of deep learning-based methods push performance of video deraining to a new level. Chen et al. [14] integrate superpixel segmentation alignment, and consistency among these segments and CNN-based detail compensation network into a unified framework. [67] presented a recurrent network integrating rain degradation classification, deraining and background reconstruction.

2.3 Visual Recognition under Adverse Conditions

A real-world visual detection/recognition system needs to handle a complex mixture of both low-quality and high-quality images. It is commonly observed that, mild degradations, e.g. small noises, scaling with small factors, lead to almost no change of recognition performance. However, once the degradation level passes a certain threshold, there will be an unneglected or even very significant effect on system performance. In [102], Torralba et al. showed that, there will be a significant performance drop in object and scene recognition when the image resolution is reduced to 3232 pixels. In [137], the boundary where the face recognition performance is largely degraded is 1616 pixels. Karahan et al. [42] found the threshold of standard deviation of Gaussian noise which will cause a rapid decline range from 10 to 20. In [19], more impacts of contrast, brightness, sharpness, and out-of-focus on face recognition are analyzed.

In the era of deep learning, some methods [21, 118, 17] attempt to first enhance the input image and then forward the output into a classifier. However, this separate consideration of enhancement may not benefit the successive recognition task, because the first stage may incur artifacts which will damage the second stage recognition. In [137, 35] class-specific features is extracted as a prior to incorporate into the restoration model. In [124], Zhang et al. developed a joint image restoration and recognition method based on sparse representation prior, which constrains the identity of the test image and guide better reconstruction and recognition. [47] considers dehazing and object detection jointly. These two stage joint optimization methods achieve better performance than previous one-stage methods. [108, 61] examine the joint optimization pipeline for low-resolution recognition. [65, 64] discuss and the impact of denoising for semantic segmentation and advocates their mutual optimization. Lately, [106] thoroughly examines the algorithmic impact of enhancement algorithms for both visual quality and automatic object recognition, on a real image set with highly compound degradations. In our work, we take a further step to consider the joint enhancement and detection in bad weather environment. Three large-scale datasets are collected to inspire new ideas and develop novel methods in the related fields.

Figure 1: Sub-challenge 2.1: Basic statistics on the training/validation set (the top row) and the held out test set (the bottom row). The first column shows the image size distribution (number of pixels per image), The second column the bounding box count distribution (number of bounding boxes per image), the third column the bounding box size distribution (number of pixels per bounding box), and the last column the ratios of bounding box size compared to frame size.
#images #bounding boxes
training/validation 4,310 41,113
test (held-out) 2,987 24,201
Table 1: Sub-challenge 2.1: Image and object statistics of the training/validation, and the held-out test sets.
Categories Car Person Bus Bicycle Motorcycle
RTTS 25,317 11,366 2,590 698 1,232
test (held-out) 18,074 1,562 536 225 3,804
Table 2: Sub-challenge 2.1: Class statistics of the training/validation, and the held-out test sets.

3 Introduction of UG Track 2 Datasets

3.1 (Semi-)Supervised Object Detection in the Haze

Figure 2: Sub-challenge 2.1: Examples of images in training/validation set (i.e., RESIDE RTTS [50]).
Figure 3: Sub-challenge 2.1: Examples of images in the held-out test set.

In Sub-challenge 2.1, we use the 4,322 annotated real-world hazy images of the RESIDE RTTS set [50] as the training and/or validation sets (the split is up to the participants). Five categories of objects (car, bus, bicycle, motorcycle, pedestrian) are labeled with tight bounding boxes. We provide another 4,807 unannotated real-world hazy images collected from the same traffic camera sources, for the possible usage of semi-supervised training too.

The participants can optionally use pre-trained models (e.g., on ImageNet or COCO), or external data. But if any pre-trained model, self-synthesized or self-collected data are used, that must be explicitly mentioned in their submissions, and the participants must ensure all their used data to be public available at the time of challenge submission, for reproduciblity purposes.

There will be a held-out test set of 2,987 real-world hazy images, collected from the same sources, with the same classes of objected annotated. Fig. 1 shows the basic statistics of the RTTS set and the hold set. The hold out test set has a similar distribution of number of bounding boxes per image, bounding box size and relative scale of bounding boxes to input images compared to the RTTS set, but has relatively large image size. Samples from RTTS set and held-out set can be found in Fig. 2 and Fig. 3.

3.2 (Semi-)Supervised Face Detection in the Low Light Condition

Figure 4: Sub-challenge 2.2: Examples of images in DARK FACE collections.

In Sub-challenge 2.2, we use our self-curated DARK FACE dataset. It is composed of 10,000 images (6,000 for training and validation, and 4,000 for testing) taken in under-exposure condition where human faces are annotated by human with bounding boxes; and 9,000 images taken with the same equipment in the similar environment without human annotations. Additionally, we provided a unique set of 789 paired low-light / normal-light images captured in controllable real lighting conditions (but unnecessarily containing faces), which can be optionally used as parts of the training data.

The training and evaluation set includes 43,849 annotated faces and the held-out test set includes 32,571 annotated faces. Table 3 presents a summary of the dataset and Fig. 4 presents example images.

Figure 5: Sub-challenge 2.2: DARK FACE has a high degree of variability in scale, pose, occlusion, appearance and illumination.
Dataset Training Testing
#Image #Face #Image #Face
ExDark 400 - 209 -
UFDD - - 612 -
DarkFace 6,000 43,849 4,000 32,571
Table 3: Sub-challenge 2.2: Comparison of low-light image understanding datasets.

Collection and annotation. This collection consists of images recorded from Digital Single Lens Reflexes, specifically Sony and E-mount camera with different capturing parameters on several busy streets around Beijing, where faces of various scales and poses are captured. The images in this collection are open source content tagged with a Creative Commons license. The resolution of these images is 1080 720 (down-sampled from 6K 4K). After filtering out those without sufficient information (lacking faces, too dark to see anything, etc.), we select 10,000 image for human annotation. The bounding boxes is labeled for all the recognizable faces in our collection. We make the bounding tightly around the forehead, chin, and cheek, using the LabelImg Toolbox111https://github.com/tzutalin/labelImg. If a face is occluded, we only label the exposed skin region. If most of a face is occluded, we ignore it. For this collection, we observed commonly seen degradations in addition to under exposure, such as intensive noise. Each annotated image contains 1-34 human faces. The face number and resolution range distribution are displayed in Fig 6. Each annotated image contains 1-34 human faces. The face resolutions in these images range from 12 to 335296. The resolution of most faces in our dataset is below 300 pixel and the the face number mostly falls into the range .

(a) FN in Train
(b) FN in Test
(c) FR in Train
(d) FR in Test
Figure 6: Sub-challenge 2.2: Face resolution (FR) and face number (FN) distribution in DARK FACE collections. Image number denotes the number of images belonging to a certain category. Face number denotes the summation number of faces belonging to a certain category.

3.3 Zero-Shot Object Detection with Raindrop Occlusions

Categories Car Person Bus Bicycle Motorcycle
test set 7332 1135 613 268 968
Table 4: Sub-challenge 2.3: Object statistics in the held-out test set.
Figure 7: Sub-challenge 2.3: Example images from the held-out test set.

In Sub-challenge 2.3, we release 1,010 pairs of realistic raindrop images and corresponding clean ground-truths, collected through the physical simulation process described in [84], as the training and/or validation sets. Our held-out test set contains 2,495 real rainy images from high-resolution driving videos. As shown in Figure 7, all images are contaminated by raindrops on camera lens. They were captured in diverse real traffic locations and scenes during multiple drives. We labeled bounding boxes for selected traffic objects: car, person, bus, bicycle, and motorcycle, that commonly appear on the roads of all images. Most images are of 1920 990 resolution, with a few exceptions of 4023 3024 resolution.

The participants can optionally use pre-trained models (e.g., using ImageNet or COCO) or external data. But if any pre-trained model, self-synthesized or self-collected data are used, that must be explicitly mentioned in their submissions, and the participants must ensure their used data to be public available at the time of challenge submission, for reproduciblity purposes.

4 Baseline Results and Analysis

For all three sub-challenges, we report results by cascading off-the-shelf enhancement methods and popular pre-trained detectors. There has been no joint training performed, hence the baseline numbers are in no way very competitive. We expect to see much performance boosts over the baselines from the competition participants.

4.1 Sub-challenge 2.1 Baseline Results

4.1.1 Baseline Composition

We test three state-of-the-art object detectors: (1) Mask R-CNN222https://github.com/matterport/Mask_RCNN [34]; (2) RetinaNet333https://github.com/fizyr/keras-retinanet [60]; and (3) YOLO-V3444https://github.com/ayooshkathuria/pytorch-yolo-v3 [85]; (4) Feature Pyramid Network555https://github.com/DetectionTeamUCAS/FPN_Tensorflow (FPN) [59].

We also try three state-of-the-art dehazing approaches: (a) AOD-Net666https://github.com/Boyiliee/AOD-Net [47]; (b) Multi-Scale Convolutional Neural Network (MSCNN)777https://github.com/rwenqi/Multi-scale-CNN-Dehazing [90]; (c) Densely Connected Pyramid Dehazing Network (DCPDN)888https://github.com/hezhangsprinter/DCPDN [125]. All dehazing models adopt officially released versions.

Figure 8: Examples of object detection of hazy images on RESIDE RTTS set. The top row displays ground truth bounding boxes, the bottom row displays detected bounding boxed using pretrained Mask R-CNN.
Figure 9: Examples of object detection of hazy images and dehazed images on RESIDE RTTS set. The first row displays the ground truth bounding boxes on hazy images, the second row displays detected bounding box on hazy image using pretrained Mask R-CNN, the bottom four column displays Mask R-CNN detected bounding boxed on dehazed images using AOD-Net, MSCNN, DCPDN correspondingly.
mAP hazy AOD-Net [47] DCPDN [90] MSCNN [125]
validation RetinaNet [60] Person 55.85 54.93 56.70 58.07
Car 41.19 37.61 42.68 42.77
Bicycle 39.61 37.80 36.96 38.16
Motorcycle 27.37 23.31 29.18 29.01
Bus 16.88 15.70 16.34 18.34
mAP 36.18 33.87 36.37 37.27
Mask R-CNN [34] Person 67.52 66.71 67.18 69.23
Car 48.93 47.76 52.37 51.93
Bicycle 40.81 39.66 40.40 40.42
Motorcycle 33.78 26.71 34.58 31.38
Bus 18.11 16.91 18.25 18.42
mAP 41.83 39.55 42.56 42.28
YOLO-V3 [85] Person 60.81 60.21 60.42 61.56
Car 47.84 47.32 48.17 49.75
Bicycle 41.03 42.22 40.18 42.01
Motorcycle 39.29 37.55 38.17 41.11
Bus 23.71 20.91 23.35 23.15
mAP 42.54 41.64 42.06 43.52
FPN [59] Person 51.85 52.35 51.04 54.50
Car 37.48 36.05 37.19 38.88
Bicycle 35.31 35.93 32.57 37.01
Motorcycle 23.65 21.07 22.97 23.86
Bus 12.95 13.68 12.07 15.83
mAP 32.25 31.82 31.17 34.02
test RetinaNet Person 17.64 18.23 16.65 19.34
Car 31.41 29.30 33.31 32.97
Bicycle 0.42 0.84 0.38 0.75
Motorcycle 1.69 1.37 1.93 2.03
Bus 12.77 13.70 12.07 15.82
mAP 12.79 12.69 12.87 14.18
Mask R-CNN Person 25.60 26.63 24.59 27.94
Car 39.31 39.71 42.76 42.57
Bicycle 0.64 0.52 0.22 0.37
Motorcycle 3.37 2.81 2.83 2.99
Bus 15.66 15.41 16.69 16.55
mAP 16.92 17.02 17.42 18.09
YOLO-V3 Person 20.64 21.41 21.42 22.11
Car 34.68 33.90 34.52 35.93
Bicycle 0.50 0.38 0.98 0.57
Motorcycle 4.26 4.10 4.72 5.27
Bus 13.55 14.35 13.75 15.04
mAP 14.69 14.83 15.08 15.78
FPN Person 12.65 12.57 11.13 14.19
Car 30.54 31.24 27.81 32.68
Bicycle 1.91 0.39 1.12 0.97
Motorcycle 2.25 1.7 1.96 1.89
Bus 6.08 7.93 7.39 8.31
mAP 10.69 10.77 9.88 11.61
  • RetinaNet, Mask R-CNN and YOLO-V3 are pretrained on Microsoft COCO dataset.

  • FPN using ResNet-101 backbone is pretrained on the PASCAL Visual Object Classes (VOC) dataset.

Table 5: Detection results (mAP) on the RTTS (train/validation dataset) and held-out test sets.

4.1.2 Results and Analysis

Fig. 8 shows the object detection performance on the original hazy images of RESIDE RTTS set using Mask R-CNN. The detectrons is pretrained on Microsoft COCO, a large-scale object detection, segmentation, and captioning dataset. A more detailed detection performance on the five objects can be found in Table 5.

Results show that without preprocessing or dehazing, the object detectors pretrained on clean images fail to predict a large amount of objects in the hazy image. The overall detection performance has a mAP of only 41.83% using Mask R-CNN and 42.54% using YOLO-V3. Among all the five object categories, person has the highest detection AP, while bus has the lowest AP.

We also compare the validation and test set performance in Table. 5. One possible reason for the performance gap between validation and test sets is that the bounding box size of the latter is smaller compared to the former, as showed in Fig. 1 as well as visualized in Fig. 9.

Effect of Dehazing

We further evaluate the current state-of-the-art dehaze approaches on hazy dataset, with pre-trained detectors subsequently applied without tuning or adaptation. Fig. 9 shows two examples that dehazing algorithms can imporove not only the visual quality of the images but also the detection accuracies. More detection results are included in Table. 5. Detection mAPs of dehazed images using DCPDN and MSCNN approaches are 1% higher on average compared to on hazy images.

Eventually, the choice of pre-trained detectors seem to also matter here: Mask R-CNN outperforms the other two detectors on both validation and test sets, and both before and after apply dehazing.

4.2 Sub-challenge 2.2 Baseline Results

4.2.1 Baseline Composition

We test five state-of-the-art deep face detectors: (1) Dual Shot Face Detector (DSFD[52]999https://github.com/TencentYoutuResearch/FaceDetection-DSFD; (2) Pyramidbox [99]101010https://github.com/EricZgw/PyramidBox; (3) Single Shot Scale-Invariant Face Detector (SFD[129]111111https://github.com/sfzhang15/SFD; (4) Single Stage Headless Face Detector (SSH[78]121212https://github.com/mahyarnajibi/SSH.git; (5) Faster RCNN [37]131313https://github.com/playerkk/face-py-faster-rcnn.

We also include seven state-of-the-art algorithms for light/contrast enhancement: (a) Bio-Inspired Multi-Exposure Fusion (BIMEF[121]141414https://github.com/baidut/BIMEF; (b) Dong [18]; (c) Low-light IMage Enhancement (LIME[32]151515https://sites.google.com/view/xjguo/lime; (d) MF [25]; (e) Multi-Scale Retinex (MSR) [40]; (f) Joint Enhancement and Denoising (JED[92]161616https://github.com/tonghelen/JED-Method; (g) RetinexNet [113]171717https://github.com/weichen582/RetinexNet.

4.2.2 Results and Analysis

Fig. 12 (a) depicts the precision-recall curves of the original face detection methods, without enhancement. The baseline methods are trained on WIDER FACE, a large dataset with large scale variations in diversified factors and conditions. The results demonstrate that without proper pre-processing or adaptation, the state-of-the-art methods cannot achieve desirable detection rates on DARK FACE. Result examples are illustrated in Fig. 10. The evidences may imply that previous face datasets, though covering variations in poses, appearances, scale, et al., are still insufficient to capture the facial features in the highly under-exposure condition.

Effect of Enhancement

We next use the enhancement algorithms to pre-process the annotated dataset and then apply the above two pre-trained face detection methods to the processed data. While the visual quality of the enhanced images is better, as expected, the detectors do perform better. As shown in Fig. 12 (b) and (c), in most instances, the precision of the detectors notably increased compared to that of the data without enhancement. Except for JED, various existing enhancement methods seem to result in similar improvements here. JED leads to a performance drop. Despite being encouraging to see, the overall performance of the detectors still drops a lot compared to normal-light datasets. The simple cascade of low light enhancement and face detectors leave much improvement room open.

Effect of Face Scale and Light Condition

We analyze the performance of the face detectors on subsets of different levels of difficulty. We define difficulty of the sets based on two criteria: face scale and facial light condition. Face scale is divided into three levels based on the average size of the bounding boxes in an image: small face (100 pixel), medium face (100-300 pixel), large face (300pixel). Facial illumination is also divided into three levels based on the average pixel value of the bounding boxes: low illumination, medium illumination, high illumination. We present the results in Fig. 13 and  14. Clearly, the performance degrades for small faces and those with low illumination. DSFD achieves the best performance, with average precision rates greater than 45, while lower than 55. The results suggest that current face detectors are limited when face scale and light condition change.

4.3 Sub-challenge 2.3 Baseline Results

4.3.1 Baseline Composition

We use four state-of-the-art object detection models: (1) Faster R-CNN (FRCNN[87]; (2) YOLO-V3 [85]; (3) SSD-512 [69]; and (4) RetinaNet [60].

We employ five state-of-the-art deep learning-based deraining algorithms: (a) JOint Rain DEtection and Removal181818http://www.icst.pku.edu.cn/struct/Projects/joint_rain_removal.html (JORDER) [119]; (b) Deep Detail Network191919https://github.com/XMU-smartdsp/Removing_Rain (DDN) [24]; (c) Conditional Generative Adversarial Network202020https://github.com/TrinhQuocNguyen/Edited_Original_IDCGAN (CGAN) [127]; (d) Density-aware Image De-raining method using a Multistream Dense Network212121https://github.com/hezhangsprinter/DID-MDN (DID-MDN) [126]; and (e) DeRaindrop222222https://github.com/rui1996/DeRaindrop [84]. For fair comparisons, we re-trained all deraining algorithms using the same provided training set.

Results and Analysis

Table 6 shows mAP results comparisons for different deraining algorithms using different detection models on the held-out test set. Unfortunatly, we find that almost all existing deraining algorithms deteriorate the objects detection performance compared to directly using the rainy images for YOLO-V3, SSD-512, and RetinaNet (The only exception is the detection results by FRCNN). This could be due to those deraining algorithms were not trained towards the end goal of object detection, they are unnecessary to help this goal, and the deraining process itself might have lost discriminative, semantically meaningful true information, and thus hamper the detection performance. In addition, Table 6 shows that YOLO-V3 achieves the best detection performance, independently of deraining algorithms applied. We attribute this to the small objects in relative long distance from the camera in the test set since YOLO-V3 is known to improve small object detection based on multi-scale prediction structure.

Rainy JORDER [119] DDN [24] CGAN [127] DID-MDN [126] DeRaindrop [84]
FRCNN [87] 16.52 16.97 18.36 23.42 16.11 15.58
YOLO-V3 [85] 27.84 26.72 26.20 23.75 24.62 24.96
SSD-512 [69] 17.71 17.06 16.93 16.71 16.70 16.69
RetinaNet [60] 23.92 21.71 21.60 19.28 20.08 19.73
Table 6: Detection results (mAP) on the held-out test set.
Figure 10: Sample face detection results of pretrained baseline on the original images of the proposed DARK FACE dataset.
Figure 11: Sample face detection results of pretrained baseline on the enhanced images of the proposed DARK FACE dataset.
(a) Original
(b) DSFD
(c) PyramidBox
Figure 12: Evaluation results of pretrained baseline on original and enhanced images of the proposed DARK FACE dataset.
(a) Small face
(b) Medium face
(c) Large face
Figure 13: Comparison of detection accuracies for different face scales for DARK FACE.
(a) Low illumination of face
(b) Medium illumination of face
(c) High illumination of face
Figure 14: Comparison of detection accuracies for different face brightness for DARK FACE.

References

  • [1] M. Abdullah-Al-Wadud, M. H. Kabir, M. A. A. Dewan and O. Chae (2007-05) A dynamic histogram equalization for image contrast enhancement. IEEE Transactions on Consumer Electronics 53 (2), pp. 593–600. External Links: Document, ISSN 0098-3063 Cited by: §2.2.
  • [2] C. O. Ancuti, C. Ancuti, R. Timofte and C. De Vleeschouwer (2018-04) I-HAZE: a dehazing benchmark with real hazy and haze-free indoor images. arXiv e-prints, pp. arXiv:1804.05091. External Links: 1804.05091 Cited by: §2.1.
  • [3] C. O. Ancuti, C. Ancuti, R. Timofte and C. De Vleeschouwer (2018-04) O-HAZE: a dehazing benchmark with real hazy and haze-free outdoor images. arXiv e-prints, pp. arXiv:1804.05101. External Links: 1804.05101 Cited by: §2.1.
  • [4] P. C. Barnum, S. Narasimhan and T. Kanade (2010) Analysis of rain and snow in frequency space. International Journal of Computer Vision 86 (2-3), pp. 256–274. Cited by: §2.2.
  • [5] D. Berman, T. Treibitz and S. Avidan (2016-06) Non-local image dehazing. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Vol. , pp. 1674–1682. External Links: Document, ISSN 1063-6919 Cited by: §2.2.
  • [6] M. Bevilacqua, A. Roumy, C. Guillemot and M. A. Morel (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proc. of the British Machine Vision Conf., pp. 135.1–135.10. External Links: ISBN 1-901725-46-4, Document Cited by: §2.1.
  • [7] J. Bossu, N. Hautière and J. Tarel (2011) Rain or snow detection in image sequences through use of a histogram of orientation of streaks. Int’l Journal of Computer Vision 93 (3), pp. 348–367. Cited by: §2.2.
  • [8] N. Brewer and N. Liu (2008) Using the shape characteristics of rain to identify and remove rain from video. In Joint IAPR International Workshops on SPR and SSPR, pp. 451–458. Cited by: §2.2.
  • [9] B. Cai, X. Xu, K. Jia, C. Qing and D. Tao (2016-11) DehazeNet: an end-to-end system for single image haze removal. IEEE Trans. on Image Processing 25 (11), pp. 5187–5198. External Links: Document, ISSN 1057-7149 Cited by: §1.1, §2.2.
  • [10] J. Cai, S. Gu and L. Zhang (2018-04) Learning a deep single image contrast enhancer from multi-exposure images. IEEE Trans. on Image Processing 27 (4), pp. 2049–2062. External Links: Document, ISSN 1057-7149 Cited by: §2.2.
  • [11] Y. Chang, L. Yan and S. Zhong (2017-10) Transformed low-rank model for line pattern noise removal. In Proc. IEEE Int’l Conf. Computer Vision, Cited by: §1.1, §2.2.
  • [12] C. Chen, Q. Chen, J. Xu and V. Koltun (2018-06) Learning to see in the dark. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Vol. , pp. 3291–3300. External Links: Document, ISSN 2575-7075 Cited by: §2.1, §2.2.
  • [13] J. Chen and L. P. Chau (2014-03) A rain pixel recovery algorithm for videos with highly dynamic scenes. IEEE Trans. on Image Processing 23 (3), pp. 1097–1104. External Links: Document, ISSN 1057-7149 Cited by: §2.2.
  • [14] J. Chen, C. Tan, J. Hou, L. Chau and H. Li (2018-06) Robust video content alignment and compensation for rain removal in a cnn framework. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Cited by: §2.2.
  • [15] Y. Chen and C. Hsu (2013) A generalized low-rank appearance model for spatio-temporally correlated rain streaks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1968–1975. Cited by: §1.1, §2.2.
  • [16] B. Cheng, Z. Wang, Z. Zhang, Z. Li, D. Liu, J. Yang, S. Huang and T. S. Huang (2017) Robust emotion recognition from low quality and low bit rate video: a deep learning approach. In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 65–70. Cited by: §1.1.
  • [17] C. Dong, Y. Deng, C. C. Loy and X. Tang (2015-12) Compression artifacts reduction by a deep convolutional network. In Proc. IEEE Int’l Conf. Computer Vision, Vol. , pp. 576–584. External Links: Document, ISSN 2380-7504 Cited by: §2.3.
  • [18] X. Dong, G. Wang, Y. Pang, W. Li, J. Wen, W. Meng and Y. Lu (2011) Fast efficient algorithm for enhancement of low lighting video. In Proc. IEEE Int’l Conf. Multimedia and Expo, pp. 1–6. Cited by: §4.2.1.
  • [19] A. Dutta, R. Veldhuis and L. Spreeuwers (2012-05) The impact of image quality on the performance of face recognition. In Symposium on Information Theory in the Benelux and Joint WIC/IEEE Symposium on Information Theory and Signal Processing in the Benelux, Netherlands, pp. 141–148 (English). External Links: ISBN 978-90-365-3383-6 Cited by: §2.3.
  • [20] R. Fattal (2008-08) Single image dehazing. ACM Trans. Graph. 27 (3), pp. 72:1–72:9. External Links: ISSN 0730-0301, Link, Document Cited by: §2.2.
  • [21] R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis and W. T. Freeman (2006) Removing camera shake from a single photograph. In ACM Trans. Graphics, pp. 787–794. Cited by: §2.3.
  • [22] X. Fu, J. Huang, X. Ding, Y. Liao and J. Paisley (2017-06) Clearing the skies: a deep network architecture for single-image rain removal. IEEE Trans. on Image Processing 26 (6), pp. 2944–2956. External Links: Document, ISSN 1057-7149 Cited by: §1.1, §2.2.
  • [23] X. Fu, D. Zeng, Y. Huang, X. P. Zhang and X. Ding (2016-06) A weighted variational model for simultaneous reflectance and illumination estimation. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Vol. , pp. 2782–2790. External Links: Document, ISSN Cited by: §2.2.
  • [24] X. Fu, J. Huang, D. Zeng, Y. Huang, X. Ding and J. Paisley (2017-07) Removing rain from single images via a deep detail network. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Cited by: §1.1, §2.2, §4.3.1, Table 6.
  • [25] X. Fu, D. Zeng, Y. Huang, Y. Liao, X. Ding and J. Paisley (2016) A fusion-based enhancing method for weakly illuminated images. Signal Processing 129, pp. 82 – 96. External Links: ISSN 0165-1684, Document, Link Cited by: §2.2, §4.2.1.
  • [26] A. Fujimoto, T. Ogawa, K. Yamamoto, Y. Matsui, T. Yamasaki and K. Aizawa (2016) Manga109 dataset and creation of metadata. In Proc. of Int’l Workshop on coMics ANalysis, Processing and Understanding, pp. 1–5. Cited by: §2.1.
  • [27] K. Garg and S. K. Nayar (2004) Detection and removal of rain from videos. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Vol. 1, pp. I–528. Cited by: §2.2.
  • [28] K. Garg and S. K. Nayar (2005) When does a camera see rain?. In Proc. IEEE Int’l Conf. Computer Vision, Vol. 2, pp. 1067–1074. Cited by: §2.2.
  • [29] K. Garg and S. K. Nayar (2006) Photorealistic rendering of rain streaks. In ACM Trans. Graphics, Vol. 25, pp. 996–1002. Cited by: §2.2.
  • [30] K. Garg and S. K. Nayar (2007) Vision and rain. Int’l Journal of Computer Vision 75 (1), pp. 3–27. Cited by: §2.2.
  • [31] M. Grgic, K. Delac and S. Grgic (2011-02) SCface — surveillance cameras face database. Multimedia Tools Appl. 51 (3), pp. 863–879. External Links: ISSN 1380-7501 Cited by: §2.1.
  • [32] X. Guo, Y. Li and H. Ling (2017-02) LIME: low-light image enhancement via illumination map estimation. IEEE Trans. on Image Processing 26 (2), pp. 982–993. External Links: Document, ISSN 1057-7149 Cited by: §2.2, §4.2.1.
  • [33] K. He, J. Sun and X. Tang (2011-12) Single image haze removal using dark channel prior. IEEE Trans. on Pattern Analysis and Machine Intelligence 33 (12), pp. 2341–2353. External Links: Document, ISSN 0162-8828 Cited by: §1.1, §2.2.
  • [34] K. He, G. Gkioxari, P. Dollár and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §4.1.1, Table 5.
  • [35] P. H. Hennings-Yeomans, S. Baker and B. V. K. V. Kumar (2008-06) Simultaneous super-resolution and feature extraction for recognition of low-resolution faces. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Vol. , pp. 1–8. External Links: Document, ISSN 1063-6919 Cited by: §2.3.
  • [36] J. Huang, A. Singh and N. Ahuja (2015-06) Single image super-resolution from transformed self-exemplars. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Vol. , pp. 5197–5206. External Links: Document, ISSN 1063-6919 Cited by: §2.1.
  • [37] H. Jiang and E. G. Learned-Miller (2017) Face detection with the faster r-cnn. IEEE Int’l Conf. on Automatic Face and Gesture Recognition, pp. 650–657. Cited by: §4.2.1.
  • [38] T. Jiang, T. Huang, X. Zhao, L. Deng and Y. Wang (2017-07) A novel tensor-based video rain streaks removal approach via utilizing discriminatively intrinsic priors. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Cited by: §2.2.
  • [39] D. J. Jobson, Z. Rahman and G. A. Woodell (1997-03) Properties and performance of a center/surround retinex. IEEE Trans. on Image Processing 6 (3), pp. 451–462. External Links: Document, ISSN 1057-7149 Cited by: §2.2.
  • [40] D. J. Jobson, Z. Rahman and G. A. Woodell (1997-07) A multiscale retinex for bridging the gap between color images and the human observation of scenes. IEEE Trans. on Image Processing 6 (7), pp. 965–976. External Links: Document, ISSN 1057-7149 Cited by: §2.2, §4.2.1.
  • [41] L. W. Kang, C. W. Lin and Y. H. Fu (2012-04) Automatic single-image-based rain streaks removal via image decomposition. IEEE Trans. on Image Processing 21 (4), pp. 1742–1755. External Links: Document, ISSN 1057-7149 Cited by: §2.2.
  • [42] S. Karahan, M. Kilinc Yildirum, K. Kirtac, F. S. Rende, G. Butun and H. K. Ekenel (2016-Sep.) How image degradations affect deep cnn-based face recognition?. In Int’l Conf. of the Biometrics Special Interest Group, Vol. , pp. 1–5. External Links: Document, ISSN Cited by: §2.3.
  • [43] J. H. Kim, C. Lee, J. Y. Sim and C. S. Kim (2013-Sept) Single-image deraining using an adaptive nonlocal means filter. In Proc. IEEE Int’l Conf. Image Processing, pp. 914–917. External Links: Document, ISSN 1522-4880 Cited by: §2.2.
  • [44] J. H. Kim, J. Y. Sim and C. S. Kim (2015-Sept) Video deraining and desnowing using temporal correlation and low-rank matrix completion. IEEE Trans. on Image Processing 24 (9), pp. 2658–2670. External Links: Document, ISSN 1057-7149 Cited by: §2.2.
  • [45] L. Kratz and K. Nishino (2009-Sep.) Factorizing scene albedo and depth from a single foggy image. In Proc. IEEE Int’l Conf. Computer Vision, Vol. , pp. 1701–1708. External Links: Document, ISSN 2380-7504 Cited by: §2.2.
  • [46] E. H. Land (1977) The retinex theory of color vision. Sci. Amer, pp. 108–128. Cited by: §2.2.
  • [47] B. Li, X. Peng, Z. Wang, J. Xu and D. Feng (2017-10) AOD-net: all-in-one dehazing network. In Proc. IEEE Int’l Conf. Computer Vision, Vol. , pp. 4780–4788. External Links: Document, ISSN 2380-7504 Cited by: §1.1, §1.1, §2.2, §2.3, §4.1.1, Table 5.
  • [48] B. Li, X. Peng, Z. Wang, J. Xu and D. Feng (2017) An all-in-one network for dehazing and beyond. arXiv preprint arXiv:1707.06543. Cited by: §2.2.
  • [49] B. Li, X. Peng, Z. Wang, J. Xu and D. Feng (2018-Feb.) End-to-end united video dehazing and detectionc. In aaai, Cited by: §2.2.
  • [50] B. Li, W. Ren, D. Fu, D. Tao, D. Feng, W. Zeng and Z. Wang (2019) Benchmarking single-image dehazing and beyond. IEEE Trans. on Image Processing 28 (1), pp. 492–505. Cited by: §2.1, Figure 2, §3.1.
  • [51] H. Li, Z. Lu, Z. Wang, Q. Ling and W. Li (2013) Detection of blotch and scratch in video based on video decomposition. IEEE Transactions on Circuits and Systems for Video Technology 23 (11), pp. 1887–1900. Cited by: §2.2.
  • [52] J. Li, Y. Wang, C. Wang, Y. Tai, J. Qian, J. Yang, C. Wang, J. Li and F. Huang (2019) DSFD: dual shot face detector. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Cited by: §4.2.1.
  • [53] L. Li, R. Wang, W. Wang and W. Gao (2015-Sept) A low-light image enhancement method for both denoising and contrast enlarging. In Proc. IEEE Int’l Conf. Image Processing, Vol. , pp. 3730–3734. External Links: Document, ISSN Cited by: §2.2.
  • [54] M. Li, J. Liu, W. Yang, X. Sun and Z. Guo (2018-06) Structure-revealing low-light image enhancement via robust retinex model. IEEE Trans. on Image Processing 27 (6), pp. 2828–2841. External Links: Document, ISSN 1057-7149 Cited by: §1.1, §2.2.
  • [55] M. Li, Q. Xie, Q. Zhao, W. Wei, S. Gu, J. Tao and D. Meng (2018-06) Video rain streak removal by multiscale convolutional sparse coding. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Cited by: §2.2.
  • [56] S. Li, I. B. Araujo, W. Ren, Z. Wang, E. K. Tokuda, R. H. Junior, R. Cesar-Junior, J. Zhang, X. Guo and X. Cao (2019) Single image deraining: a comprehensive benchmark analysis. arXiv preprint arXiv:1903.08558. Cited by: §2.1.
  • [57] Y. Li, R. T. Tan and M. S. Brown (2015-12) Nighttime haze removal with glow and multiple light colors. In Proc. IEEE Int’l Conf. Computer Vision, Vol. , pp. 226–234. External Links: Document, ISSN 2380-7504 Cited by: §2.2.
  • [58] Y. Li, R. T. Tan, X. Guo, J. Lu and M. S. Brown (2016) Rain streak removal using layer priors. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, pp. 2736–2744. Cited by: §1.1, §2.2.
  • [59] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125. Cited by: §4.1.1, Table 5.
  • [60] T. Lin, P. Goyal, R. Girshick, K. He and P. Dollár (2018) Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence. Cited by: §4.1.1, §4.3.1, Table 5, Table 6.
  • [61] D. Liu, B. Cheng, Z. Wang, H. Zhang and T. S. Huang (2017) Enhance visual recognition under adverse conditions via deep networks. arXiv preprint arXiv:1712.07732. Cited by: §1.1, §2.3.
  • [62] D. Liu, Z. Wang, Y. Fan, X. Liu, Z. Wang, S. Chang and T. Huang (2017) Robust video super-resolution with learned temporal dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2507–2515. Cited by: §2.2.
  • [63] D. Liu, Z. Wang, Y. Fan, X. Liu, Z. Wang, S. Chang, X. Wang and T. S. Huang (2018) Learning temporal dynamics for video super-resolution: a deep learning approach. IEEE Transactions on Image Processing 27 (7), pp. 3432–3445. Cited by: §2.2.
  • [64] D. Liu, B. Wen, J. Jiao, X. Liu, Z. Wang and T. S. Huang (2018) Connecting image denoising and high-level vision tasks via deep learning. arXiv preprint arXiv:1809.01826. Cited by: §1.1, §2.3.
  • [65] D. Liu, B. Wen, X. Liu, Z. Wang and T. S. Huang (2017) When image denoising meets high-level vision tasks: a deep learning approach. arXiv preprint arXiv:1706.04284. Cited by: §1.1, §2.3.
  • [66] J. Liu, S. Yang, Y. Fang and Z. Guo (2018) Structure-guided image inpainting using homography transformation. IEEE Transactions on Multimedia 20 (12), pp. 3252–3265. Cited by: §2.2.
  • [67] J. Liu, W. Yang, S. Yang and Z. Guo (2018-06) Erase or fill? deep joint recurrent rain removal and reconstruction in videos. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Cited by: §2.2.
  • [68] P. Liu, J. Xu, J. Liu and X. Tang (2009) Pixel based temporal analysis using chromatic property for removing rain from videos. In Computer and Information Science, Cited by: §2.2.
  • [69] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §4.3.1, Table 6.
  • [70] Y. Liu, G. Zhao, B. Gong, Y. Li, R. Raj, N. Goel, S. Kesav, S. Gottimukkala, Z. Wang and W. Ren (2018) Improved techniques for learning to dehaze and beyond: a collective study. arXiv preprint arXiv:1807.00202. Cited by: §1.1.
  • [71] Y. P. Loh and C. S. Chan (2019) Getting to know low-light images with the exclusively dark dataset. Computer Vision and Image Understanding 178, pp. 30–42. External Links: Document Cited by: §2.1.
  • [72] K. G. Lore, A. Akintayo and S. Sarkar (2017) LLNet: a deep autoencoder approach to natural low-light image enhancement. Pattern Recognition 61, pp. 650 – 662. External Links: ISSN 0031-3203, Document, Link Cited by: §1.1, §2.2.
  • [73] Y. Luo, Y. Xu and H. Ji (2015) Removing rain from a single image via discriminative sparse coding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3397–3405. Cited by: §1.1, §2.2.
  • [74] D. Martin, C. Fowlkes, D. Tal and J. Malik (2001-07) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. IEEE Int’l Conf. Computer Vision, Vol. 2, pp. 416–423. Cited by: §2.1.
  • [75] M. Mueller, N. Smith and B. Ghanem (2016) A benchmark and simulator for uav tracking. Springer Nature. External Links: ISSN 0302-9743, Link Cited by: §2.1.
  • [76] H. Nada, V. A. Sindagi, H. Zhang and V. M. Patel (2018-04) Pushing the Limits of Unconstrained Face Detection: a Challenge Dataset and Baseline Results. arXiv e-prints, pp. arXiv:1804.10275. Cited by: §2.1.
  • [77] D. Nair, P. A. Kumar and P. Sankaran (2014) An effective surround filter for image dehazing. In Proc. of Int’l Conf. on Interdisciplinary Advances in Applied Computing, ICONIAAC ’14, New York, NY, USA, pp. 20:1–20:6. External Links: ISBN 978-1-4503-2908-8, Link, Document Cited by: §2.2.
  • [78] M. Najibi, P. Samangouei, R. Chellappa and L. S. Davis (2017-10) SSH: single stage headless face detector. In Proc. IEEE Int’l Conf. Computer Vision, Vol. , pp. 4885–4894. Cited by: §4.2.1.
  • [79] K. Nishino, L. Kratz and S. Lombardi (2012-07) Bayesian defogging. Int’l Journal of Computer Vision 98 (3), pp. 263–278. External Links: ISSN 0920-5691, Link, Document Cited by: §2.2.
  • [80] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. Chen, J. T. Lee, S. Mukherjee, J. K. Aggarwal, H. Lee, L. Davis, E. Swears, X. Wang, Q. Ji, K. Reddy, M. Shah, C. Vondrick, H. Pirsiavash, D. Ramanan, J. Yuen, A. Torralba, B. Song, A. Fong, A. Roy-Chowdhury and M. Desai (2011-06) A large-scale benchmark dataset for event recognition in surveillance video. In CVPR 2011, Vol. , pp. 3153–3160. Cited by: §2.1.
  • [81] S. M. Pizer, R. E. Johnston, J. P. Ericksen, B. C. Yankaskas and K. E. Muller (1990-05) Contrast-limited adaptive histogram equalization: speed and effectiveness. In Proceedings of Conference on Visualization in Biomedical Computing, Vol. , pp. 337–345. External Links: Document, ISSN Cited by: §2.2.
  • [82] T. Plötz and S. Roth (2017-07) Benchmarking denoising algorithms with real photographs. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Vol. , pp. 2750–2759. External Links: Document, ISSN 1063-6919 Cited by: §2.1.
  • [83] R. Prabhu, X. Yu, Z. Wang, D. Liu and A. Jiang (2018) U-finger: multi-scale dilated convolutional network for fingerprint image denoising and inpainting. arXiv preprint arXiv:1807.10993. Cited by: §1.1.
  • [84] R. Qian, R. T. Tan, W. Yang, J. Su and J. Liu (2018) Attentive generative adversarial network for raindrop removal from a single image. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1, §3.3, §4.3.1, Table 6.
  • [85] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §4.1.1, §4.3.1, Table 5, Table 6.
  • [86] J. Ren, J. Liu and Z. Guo (2013) Context-aware sparse decomposition for image denoising and super-resolution. IEEE Transactions on Image Processing 22 (4), pp. 1456–1469. Cited by: §2.2.
  • [87] S. Ren, K. He, R. Girshick and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pp. 91–99. Cited by: §4.3.1, Table 6.
  • [88] W. Ren, J. Zhang, X. Xu, L. Ma, X. Cao, G. Meng and W. Liu (2019-04) Deep video dehazing with semantic segmentation. IEEE Trans. on Image Processing 28 (4), pp. 1895–1908. External Links: Document, ISSN 1057-7149 Cited by: §2.2.
  • [89] W. Ren, J. Tian, Z. Han, A. Chan and Y. Tang (2017-07) Video desnowing and deraining based on matrix decomposition. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Cited by: §2.2.
  • [90] W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao and M. Yang (2016) Single image dehazing via multi-scale convolutional neural networks. In European Conference on Computer Vision, Cited by: §1.1, §2.2, §4.1.1, Table 5.
  • [91] W. Ren, J. Pan, X. Cao and M. Yang (2017) Video deblurring via semantic segmentation and pixel-wise non-linear kernel. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1077–1085. Cited by: §2.2.
  • [92] X. Ren, M. Li, W. Cheng and J. Liu (2018-05) Joint enhancement and denoising method via sequential decomposition. Cited by: §1.1, §2.2, §4.2.1.
  • [93] V. Santhaseelan and V. K. Asari (2015-03) Utilizing local phase information to remove rain from video. Int’l Journal of Computer Vision 112 (1), pp. 71–89. External Links: ISSN 0920-5691 Cited by: §2.2.
  • [94] C. Shan, S. Gong and P.W. McOwan (2005-Sep.) Recognizing facial expressions at low resolution. In Proc. of IEEE Conf. on Advanced Video and Signal Based Surveillance, Vol. , pp. 330–335. External Links: Document, ISSN Cited by: §1.1.
  • [95] J. Shao, C. C. Loy and X. Wang (2014-06) Scene-independent group profiling in crowd. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Vol. , pp. 2227–2234. External Links: Document, ISSN 1063-6919 Cited by: §2.1.
  • [96] H. R. Sheikh, M. F. Sabir and A. C. Bovik (2006-11) A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. on Image Processing 15 (11), pp. 3440–3451. External Links: Document, ISSN 1057-7149 Cited by: §2.1.
  • [97] L. Shen, Z. Yue, F. Feng, Q. Chen, S. Liu and J. Ma (2017-11) MSR-net:Low-light Image Enhancement Using Deep Convolutional Network. ArXiv e-prints. External Links: 1711.02488 Cited by: §1.1, §2.2.
  • [98] L. Stasiak, A. Pacut and R. Vincente-Garcia (2009-10) Face tracking and recognition in low quality video sequences with the use of particle filtering. In Proc. of Annual Int’l Carnahan Conf. on Security Technology, Vol. , pp. 126–133. External Links: Document, ISSN 1071-6572 Cited by: §1.1.
  • [99] X. Tang, D. K. Du, Z. He and J. Liu (2018-09) PyramidBox: a context-assisted single shot face detector. In Proc. IEEE European Conf. Computer Vision, Cited by: §4.2.1.
  • [100] Y. Tian (2004-06) Evaluation of face resolution for expression analysis. In Proc. of Int’l Conf. on Computer Vision and Pattern Recognition Workshop, Vol. , pp. 82–82. Cited by: §1.1.
  • [101] R. Timofte, E. Agustsson, L. Van Gool, M. Yang and L. Zhang (2017) Ntire 2017 challenge on single image super-resolution: methods and results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 114–125. Cited by: §2.1.
  • [102] A. Torralba, R. Fergus and W. T. Freeman (2008-11) 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 30 (11), pp. 1958–1970. External Links: ISSN 0162-8828 Cited by: §2.3.
  • [103] A. K. Tripathi and S. Mukhopadhyay (2012-03) Video post processing: low-latency spatiotemporal approach for detection and removal of rain. IET Image Processing 6 (2), pp. 181–196. External Links: Document, ISSN 1751-9659 Cited by: §2.2.
  • [104] A. K. Tripathi and S. Mukhopadhyay (2011) A probabilistic approach for detection and removal of rain from videos. IETE Journal of Research 57 (1), pp. 82–91. External Links: Document Cited by: §2.2.
  • [105] V. Vašek, V. Franc and M. Urban (2018-09) License plate recognition and super-resolution from low-resolution videos by convolutional neural networks. In Proc. of British Machine Vision Conference, Cited by: §1.1.
  • [106] R. G. VidalMata, S. Banerjee, B. RichardWebster, M. Albright, P. Davalos, S. McCloskey, B. Miller, A. Tambo, S. Ghosh and S. Nagesh (2019) Bridging the gap between computational photography and visual recognition. arXiv preprint arXiv:1901.09482. Cited by: §2.3.
  • [107] S. Wang, J. Zheng, H. M. Hu and B. Li (2013-Sept) Naturalness preserved enhancement algorithm for non-uniform illumination images. IEEE Trans. on Image Processing 22 (9), pp. 3538–3548. External Links: Document, ISSN 1057-7149 Cited by: §2.2.
  • [108] Z. Wang, S. Chang, Y. Yang, D. Liu and T. S. Huang (2016) Studying very low resolution recognition using deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4792–4800. Cited by: §2.3.
  • [109] Z. Wang, H. Li, Q. Ling and W. Li (2013) Robust temporal-spatial decomposition and its applications in video processing. IEEE Transactions on Circuits and Systems for Video Technology 23 (3), pp. 387–400. Cited by: §2.2.
  • [110] Z. Wang, D. Liu, S. Chang, Q. Ling, Y. Yang and T. S. Huang (2016) D3: deep dual-domain based fast restoration of jpeg-compressed images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2764–2772. Cited by: §2.2.
  • [111] Z. Wang, Y. Yang, Z. Wang, S. Chang, W. Han, J. Yang and T. Huang (2015) Self-tuned deep super resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–8. Cited by: §2.2.
  • [112] Z. Wang, Y. Yang, Z. Wang, S. Chang, J. Yang and T. S. Huang (2015) Learning super-resolution jointly from external and internal examples. IEEE Transactions on Image Processing 24 (11), pp. 4359–4371. Cited by: §2.2.
  • [113] C. Wei, W. Wang, W. Yang and J. Liu (2018) Deep retinex decomposition for low-light enhancement. In British Machine Vision Conference, pp. 155. Cited by: §2.1, §4.2.1.
  • [114] W. Wei, L. Yi, Q. Xie, Q. Zhao, D. Meng and Z. Xu (2017-10) Should we encode rain streaks in video as deterministic or stochastic?. In Proc. IEEE Int’l Conf. Computer Vision, Cited by: §2.2.
  • [115] J. Xu, H. Li, Z. Liang, D. Zhang and L. Zhang (2018-04) Real-world Noisy Image Denoising: A New Benchmark. arXiv e-prints, pp. arXiv:1804.02603. Cited by: §2.1.
  • [116] Y. Yan, W. Ren, Y. Guo, R. Wang and X. Cao (2017) Image deblurring via extreme channels prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4003–4011. Cited by: §2.2.
  • [117] J. Yang, X. Jiang, C. Pan and C. Liu (2016-12) Enhancement of low light level images with coupled dictionary learning. In Proc. IEEE Int’l Conf. Pattern Recognition, Vol. , pp. 751–756. External Links: Document, ISSN Cited by: §1.1, §2.2.
  • [118] J. Yang, J. Wright, T. S. Huang and Y. Ma (2010-11) Image super-resolution via sparse representation. IEEE Trans. on Image Processing 19 (11), pp. 2861–2873. External Links: Document, ISSN 1057-7149 Cited by: §2.3.
  • [119] W. Yang, R. T. Tan, J. Feng, J. Liu, Z. Guo and S. Yan (2017) Deep joint rain detection and removal from a single image. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.1, §2.1, §2.2, §4.3.1, Table 6.
  • [120] B. Z. Yao, X. Yang and S. Zhu (2007) Introduction to a large-scale general purpose ground truth database: methodology, annotation tool and benchmarks. In EMMCVPR, Cited by: §2.1.
  • [121] Z. Ying, G. Li and W. Gao (2017-11) A Bio-Inspired Multi-Exposure Fusion Framework for Low-light Image Enhancement. ArXiv e-prints. External Links: 1711.00591 Cited by: §4.2.1.
  • [122] Z. Yu, H. Li, Z. Wang, Z. Hu and C. W. Chen (2013) Multi-level video frame interpolation: exploiting the interaction among different levels. IEEE Transactions on Circuits and Systems for Video Technology 23 (7), pp. 1235–1248. Cited by: §2.2.
  • [123] R. Zeyde, M. Elad and M. Protter (2012) On single image scale-up using sparse-representations. In Proc. of the Int’l Conf. on Curves and Surfaces, Berlin, Heidelberg, pp. 711–730. External Links: ISBN 978-3-642-27412-1, Link, Document Cited by: §2.1.
  • [124] H. Zhang, J. Yang, Y. Zhang, N. M. Nasrabadi and T. S. Huang (2011) Close the loop: joint blind image restoration and recognition with sparse representation prior. In 2011 International Conference on Computer Vision, pp. 770–777. Cited by: §1.1, §2.3.
  • [125] H. Zhang and V. M. Patel (2018) Densely connected pyramid dehazing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3194–3203. Cited by: §4.1.1, Table 5.
  • [126] H. Zhang and V. M. Patel (2018) Density-aware single image de-raining using a multi-stream dense network. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.1, §2.2, §4.3.1, Table 6.
  • [127] H. Zhang, V. Sindagi and V. M. Patel (2017) Image de-raining using a conditional generative adversarial network. arXiv preprint arXiv:1701.05957. Cited by: §2.1, §4.3.1, Table 6.
  • [128] J. Zhang, Y. Cao, S. Fang, Y. Kang and C. W. Chen (2017-07) Fast haze removal for nighttime image using maximum reflectance prior. In Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, Vol. , pp. 7016–7024. External Links: Document, ISSN 1063-6919 Cited by: §2.2.
  • [129] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang and S. Z. Li (2017-10) S3FD: single shot scale-invariant face detector. In Proc. IEEE Int’l Conf. Computer Vision, Vol. , pp. 192–201. Cited by: §4.2.1.
  • [130] X. Zhang, P. Shen, L. Luo, L. Zhang and J. Song (2012-11) Enhancement and noise reduction of very low light level images. In Proc. IEEE Int’l Conf. Pattern Recognition, Vol. , pp. 2034–2037. External Links: Document, ISSN 1051-4651 Cited by: §2.2.
  • [131] X. Zhang, H. Li, Y. Qi, W. K. Leow and T. K. Ng (2006) Rain removal in video by combining temporal and chromatic properties. In Proc. IEEE Int’l Conf. Multimedia and Expo, pp. 461–464. Cited by: §2.2.
  • [132] Y. Zhang, L. Ding and G. Sharma (2017) HazeRD: an outdoor scene dataset and benchmark for single image dehazing. In Proc. IEEE Int’l Conf. Image Processing, pp. 3205–3209. Cited by: §2.1.
  • [133] J. Zhou and F. Zhou (2013-12) Single image dehazing motivated by retinex theory. In Proc. of Int’l Symposium on Instrumentation and Measurement, Sensor Network and Automation, Vol. , pp. 243–247. External Links: Document, ISSN Cited by: §2.2.
  • [134] P. Zhu, L. Wen, D. Du, X. Bian, H. Ling, Q. Hu, Q. Nie, H. Cheng, C. Liu and X. Liu (2018) VisDrone-det2018: the vision meets drone object detection in image challenge results. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §1.
  • [135] Q. Zhu, J. Mai and L. Shao (2015-11) A fast single image haze removal algorithm using color attenuation prior. IEEE Trans. on Image Processing 24 (11), pp. 3522–3533. External Links: Document, ISSN 1057-7149 Cited by: §2.2.
  • [136] X. Zhu, C. C. Loy and S. Gong (2013-12) Video synopsis by heterogeneous multi-source correlation. In Proc. IEEE Int’l Conf. Computer Vision, Vol. , pp. 81–88. Cited by: §2.1.
  • [137] W. W. W. Zou and P. C. Yuen (2012-01) Very low resolution face recognition problem. IEEE Trans. on Image Processing 21 (1), pp. 327–340. External Links: Document, ISSN 1057-7149 Cited by: §2.3, §2.3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
350867
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description