Curriculum Model Adaptation with Synthetic and Real Data for Semantic Foggy Scene Understanding
This work addresses the problem of semantic scene understanding under fog. Although marked progress has been made in semantic scene understanding, it is mainly concentrated on clear-weather scenes. Extending semantic segmentation methods to adverse weather conditions such as fog is crucial for outdoor applications. In this paper, we propose a novel method, named Curriculum Model Adaptation (CMAda), which gradually adapts a semantic segmentation model from light synthetic fog to dense real fog in multiple steps, using both labeled synthetic foggy data and unlabeled real foggy data. The method is based on the fact that the results of semantic segmentation in moderately adverse conditions (light fog) can be bootstrapped to solve the same problem in highly adverse conditions (dense fog). CMAda is extensible to other adverse conditions and provides a new paradigm for learning with synthetic data and unlabeled real data. In addition, we present three other main stand-alone contributions: 1) a novel method to add synthetic fog to real, clear-weather scenes using semantic input; 2) a new fog density estimator; 3) a novel fog densification method to densify the fog in real foggy scenes without known depth; and 4) the Foggy Zurich dataset comprising real foggy images, with pixel-level semantic annotations for images under dense fog. Our experiments show that 1) our fog simulation and fog density estimator outperform their state-of-the-art counterparts with respect to the task of semantic foggy scene understanding (SFSU); 2) CMAda improves the performance of state-of-the-art models for SFSU significantly, benefiting both from our synthetic and real foggy data. The datasets and code are available at the project website.
Keywords:Semantic Foggy Scene Understanding, Fog Simulation, Learning with Synthetic and Real Data, Curriculum Model Adaptation, Network Distillation, Adverse Weather Conditions
Adverse weather or illumination conditions create visibility problems for both people and the sensors that power automated systems vision:atmosphere; vision:rain:07; SFSU_synthetic; daytime:2:nighttime. While sensors and the down-streaming vision algorithms are constantly getting better, their performance is mainly benchmarked with clear-weather images Cityscapes; drive:surroundview:route:planner. Many outdoor applications, however, can hardly escape from “bad” weather vision:atmosphere. One typical example of adverse weather conditions is fog, which degrades the visibility of a scene significantly contrast:weather:degraded; tan2008visibility. The denser the fog is, the more severe this problem becomes.
During the past years, the community has made a tremendous progress on image dehazing (defogging) to increase the visibility of foggy images bayesian:defogging; dark:channel; dehazing:mscale:depth. The last few years have also witnessed a leap in object recognition. A great deal of effort is made specifically on semantic road scene understanding road:scene:eccv12; Cityscapes. However, the extension of these techniques to other weather/illumination conditions has not received due attention, despite its importance in outdoor applications. For example, an automated car still needs to detect other traffic agents and traffic control devices in the presence of fog or rain. This work investigates the problem of semantic foggy scene understanding (SFSU).
The current ‘standard’ policy for addressing semantic scene understanding is to train a neural network with many annotations of real images pascal:2011; imagenet:2015; Cityscapes. While this trend of creating and using more human annotations may still continue, extending the same protocol to all conditions seems to be problematic, as the manual annotation part is hardly scalable. The problem is more pronounced for adverse weather conditions as the difficulty of data collection and data annotation increases significantly. To overcome this problem, a few streams of research have gained extensive attention: 1) learning with limited, weak supervision dai:EnPro:iccv13; Misra_2015_CVPR; 2) transfer learning LSDA:nips:14; DomainAdaptiveFasterRCNN and 3) learning with synthetic data Synthia:dataset; SFSU_synthetic.
Our method falls to the middle ground, and aims to combine the strength of these two kinds of methods. In particular, our method is developed to learn from 1) a dataset with high-quality synthetic fog and the corresponding human annotations, and 2) a dataset with a large number of unlabeled images of real fog. The goal of our method is to improve the performance of SFSU without requiring extra human annotations.
To this aim, this work proposes a novel fog simulator to generate high-quality synthetic fog into real images that contain clear-weather outdoor scenes, and then leverage these partially synthetic foggy images for SFSU. The new fog simulator is an extension of the recent work done by Sakaridis et al. SFSU_synthetic, by introducing a semantic-aware filter to exploit the structures of object instances. We show that learning with our synthetic foggy data improves the performance for SFSU. Furthermore, we learn a fog density estimator by using synthetic fog of varying densities, and rank the unlabeled real images in increasing fog density. The ranking forms the foundation for our novel learning method Curriculum Model Adaptation (CMAda) to gradually adapt a semantic segmentation model from clear weather to dense fog, through light fog. CMAda is based on the fact that object recognition in moderately adverse conditions (light fog) is easier and its results can be re-used via knowledge distillation to solve a harder problem, i.e. object recognition in very adverse conditions (dense fog).
CMAda is iterative in nature and can be implemented with different number of steps. The pipeline of a two-step implementation of CMAda is shown in Figure 1. CMAda has the potential to be used for other adverse weather conditions, and opens a new avenue for learning with synthetic data and unlabeled real data in general. Experiments show that CMAda yields the best results on two datasets with dense real fog and one dataset with real fog of varying fog densities.
A shorter version of this work has been published to European Conference on Computer Vision dense:SFSU:eccv18. Compared to the old version, this paper makes the following six additional contributions:
An extension of the formulation of the learning method to accommodate multiple adaptation steps instead of only two steps, leading to improved performance over the conference paper as well.
A novel fog densification method to densify the fog in real foggy scenes. The fog densification method can close the domain gap between light real fog and dense real fog; using it in our learning method curriculum model adaptation significantly increases the performance of semantic foggy scene understanding.
A method Model Selection for the task of semantic scene understanding in multiple weather conditions where test images are a mixture of clear-weather images and foggy images. This extension is important for real world applications as weather conditions change constantly. Semantic scene understanding methods need to be robust and/or adaptive to those changes.
An enlarged annotated dataset, by increasing pixel-wise annotated images with dense fog from to 111Creating accurate, pixel-wise annotation for scenes under dense fog is very difficult.
More extensive experiments to diagnose the contribution of each component of the CMAda pipeline, to compare with more competing methods, and to comprehensively study the usefulness of image dehazing (defogging) for SFSU.
Other sections are also enhanced, including the introduction, related work, and the dataset collection and annotation parts.
The paper is structured as follows. Section 2 presents the related work. Section 3 is devoted to our method for simulating synthetic fog, which is followed by Section 4 for our learning approach. Section 5 summarizes our data collection and annotation. Finally, Section LABEL:sec:experiments presents our experimental results and Section LABEL:sec:conclusion concludes this paper.
2 Related Work
Our work is relevant to image defogging/dehazing, joint image filtering, foggy scene understanding, and domain adaptation.
2.1 Image Defogging/Dehazing
Fog fades the color of observed objects and reduces their contrast. Extensive research has been conducted on image defogging (dehazing) to increase the visibility of foggy scenes contrast:weather:degraded; tan2008visibility; bayesian:defogging; fattal2008single; nonlocal:image:dehazing; fattal2014dehazing; dark:channel. Certain works focus particularly on enhancing foggy road scenes THC+12; exponential:contrast:restoration. Recent approaches also rely on trainable architectures TYW14, which have evolved to end-to-end models joint:transmission:estimation:dehazing; deep:transmission:network. For a comprehensive overview of defogging/dehazing algorithms, we point the reader to review:defogging:restoration; dehazing:survey:benchmarking. Our work is complementary and focuses on semantic foggy scene understanding. This work investigates the usefulness of image dehazing for semantic scene understanding under fog.
2.2 Joint Image Filtering
Using additional images as input for filtering a target image has been originally studied in settings where the target image has low photometric quality flash:enhance:crossbilateral:Eisemann; flash:enhance:crossbilateral:Petschnigg or low resolution joint:bilateral:upsampling. Compared to the bilateral filtering formulation of these approaches, subsequent works propose alternative formulations, such as the guided filter guided:filtering and mutual structure filtering mutual:structure:filtering, for better incorporating the reference image into the filtering process. In comparison, we extend the classical cross-bilateral filter to a dual-reference cross-bilateral filter by accepting two reference images, one of which is a discrete label image that helps our filter adhere to the semantics of the scene.
2.3 Foggy Scene Understanding
Typical examples in this line include road and lane detection recent:progress:lane, traffic light detection traffic:light:survey:16, car and pedestrian detection kitti, and a dense, pixel-level segmentation of road scenes into most of the relevant semantic classes recognition:sfm:eccv08; Cityscapes. While deep recognition networks have been developed dilated:convolution; refinenet; pspnet; fast:rcnn; faster:rcnn and large-scale datasets have been presented kitti; Cityscapes, that research mainly focused on clear weather. There is also a large body of work on fog detection fog:detection:cv:09; fog:detection:vehicles:12; night:fog:detection; fast:fog:detection. Classification of scenes into foggy and fog-free has been tackled as well fog:nonfog:classification:13. In addition, visibility estimation has been extensively studied for both daytime visibility:road:fog:10; visibility:detection:fog:15; fog:detection:visibility:distance and nighttime night:visibility:analysis:15, in the context of assisted and autonomous driving. The closest of these works to ours is visibility:road:fog:10, in which synthetic fog is generated and foggy images are segmented to free-space area and vertical objects. Our work differs in that our semantic scene understanding task is more complex and we tackle the problem from a different route by learning jointly from synthetic fog and real fog.
2.4 Domain Adaptation
Our work bears resemblance to transfer learning and model adaptation. Model adaptation across weather conditions to semantically segment simple road scenes is studied in road:scene:2013. More recently, domain adversarial based approaches were proposed to adapt semantic segmentation models both at pixel level and feature level from simulated to real environments adversarial:training:simulated:17; synthetic:semantic:segmentation; CyCADA; incremental:adversarial:DA:18. Most of these works are based on adversarial domain adaptation. Our work is complementary to methods in this vein; we adapt the model parameters with carefully generated data, leading to an algorithm whose behavior is easy to understand and whose performance is more predictable. Combining our method and adversarial domain adaptation is a promising direction.
A very recent work daytime:2:nighttime on semantic nighttime image segmentation shows that real images captured at twilight are helpful for supervision transfer from daytime to nighttime via model adaptation. CMAda constitutes a more complex framework, since it leverages both synthetic foggy data and real foggy data jointly for adapting semantic segmentation models to dense fog, whereas the adaptation method in daytime:2:nighttime uses solely real data for the adaptation. Moreover, the assignment of real foggy images to the correct target foggy domain through fog density estimation is another crucial and nontrivial component of CMAda and it is a prerequisite for using these real images as training data in the method. By contrast, the partition of the real dataset in daytime:2:nighttime into subsets that correspond to different times of day from daytime to nighttime is trivially performed by using the time of capture of the images.
3 Fog Simulation on Real Scenes Using Semantics
We drive our motivation for fog simulation on real scenes using semantic input from the pipeline that was used in SFSU_synthetic to generate the Foggy Cityscapes dataset, which primarily focuses on depth denoising and completion. This pipeline is denoted in Figure 2 with thin gray arrows and consists of three main steps: depth outlier detection, robust depth plane fitting at the level of SLIC superpixels slic:superpixels using RANSAC, and postprocessing of the completed depth map with guided image filtering guided:filtering. Our approach adopts the general configuration of this pipeline, but aims to improve its postprocessing step by leveraging the semantic annotation of the scene as additional reference for filtering, which is indicated in Figure 2 with the thick blue arrow.
The guided filtering step in SFSU_synthetic uses the clear-weather color image as guidance to filter the depth map. However, as previous works on image filtering mutual:structure:filtering have shown, guided filtering and similar joint filtering methods such as cross-bilateral filtering flash:enhance:crossbilateral:Eisemann; flash:enhance:crossbilateral:Petschnigg transfer every structure that is present in the guidance/reference image to the output target image. Thus, any structure that is specific to the reference image but irrelevant for the target image is transferred to the latter erroneously.
Whereas previous approaches such as mutual-structure filtering mutual:structure:filtering attempt to estimate the common structure between reference and target images, we identify this common structure with the structure that is present in the ground-truth semantic labeling of the image. In other words, we assume that edges which are shared by the color image and the depth map generally coincide with semantic edges, i.e. locations in the image where the semantic classes of adjacent pixels are different. Under this assumption, the semantic labeling can be used directly as the reference image in a classical cross-bilateral filtering setting, since it contains exactly the mutual structure between the color image and the depth map. In practice, however, the boundaries drawn by humans when creating semantic annotations are not pixel-accurate, and using the color image as additional reference helps to capture the precise location and orientation of edges better. As a result, we formulate the postprocessing step of the completed depth map in our fog simulation as a dual-reference cross-bilateral filter, with color and semantic reference.
Before delving into the formulation of our filter, we briefly argue against alternative usage cases of semantic annotations in our fog simulation pipeline which might seem attractive at first sight. First, replacing SLIC superpixels with superpixels induced by the semantic labeling for the depth plane fitting step is not viable, because it induces very large superpixels, for which the planarity assumption breaks completely. Second, we have experimented with omitting the robust depth plane fitting step altogether and applying our dual-reference cross-bilateral filter directly on the incomplete depth map which is output from the outlier detection step. This approach, however, is highly sensitive to outliers that have not been detected and invalidated in the preceding step. By contrast, these remaining outliers are handled successfully by robust RANSAC-based depth plane fitting.
3.2 Dual-reference Cross-bilateral Filter Using Color and Semantics
Let us denote the RGB image of the clear-weather scene by and its CIELAB counterpart by . We consider CIELAB, as it has been designed to increase perceptual uniformity and gives better results for bilateral filtering of color images bilateral:grid. The input image to be filtered in the postprocessing step of our pipeline constitutes a scalar-valued transmittance map . We provide more details on this transmittance map in Section 3.3. Last, we are given a labeling function
which maps pixels to semantic labels, where is the discrete domain of pixel positions and is the total number of semantic classes in the scene. We define our dual-reference cross-bilateral filter with color and semantic reference as
where and denote pixel positions, is the neighborhood of , denotes the Kronecker delta, is the spatial Gaussian kernel, is the color-domain Gaussian kernel and is a positive constant. The novel dual reference is demonstrated in the second factor of the filter weights, which constitutes a sum of the terms for semantic reference and for color reference, weighted by . The formulation of the semantic term implies that only pixels with the same semantic label as the examined pixel contribute to the output at through this term, which prevents blurring of semantic edges. At the same time, the color term helps to better preserve true depth edges that do not coincide with any semantic boundary but are present in , e.g. due to self-occlusion of an object.
The formulation of (2) enables an efficient implementation of our filter based on the bilateral grid bilateral:grid. More specifically, we construct two separate bilateral grids that correspond to the semantic and color domains respectively and operate separately on each grid to perform filtering, combining the results in the end. In this way, we handle a 3D bilateral grid for the semantic domain and a 5D grid for the color domain instead of a single joint 6D grid that would dramatically increase computation time bilateral:grid.
In our experiments, we set , , and .
3.3 Remaining Steps
Here we outline the rest parts of our fog simulation pipeline of Figure 2. For more details, we refer the reader to SFSU_synthetic, with which most parts of the pipeline are common. The standard optical model for fog that forms the basis of our fog simulation was introduced in Koschmieder:optical:model and is expressed as
where is the observed foggy image at pixel , is the clear scene radiance and is the atmospheric light, which is assumed to be globally constant. The transmittance determines the amount of scene radiance that reaches the camera. For homogeneous fog, transmittance depends on the distance of the scene from the camera through
The attenuation coefficient controls the density of the fog: larger values of mean denser fog. Fog decreases the meteorological optical range (MOR), also known as visibility, to less than 1 km by definition Federal:meteorological:handbook. For homogeneous fog , which implies
where the lower bound corresponds to the lightest fog configuration. In our fog simulation, the value that is used for always obeys (5).
The required inputs for fog simulation with (3) are the image of the original clear scene, atmospheric light and a complete transmittance map . We use the same approach for atmospheric light estimation as that in SFSU_synthetic. Moreover, we adopt the stereoscopic inpainting method of SFSU_synthetic for depth denoising and completion to obtain an initial complete transmittance map from a noisy and incomplete input disparity map , using the recommended parameters. We filter with our dual-reference cross-bilateral filter (2) to compute the final transmittance map , which is used in (3) to synthesize the foggy image .
Results of the presented pipeline for fog simulation on example images from Cityscapes Cityscapes are provided in Figure 3 for , which corresponds to visibility of ca. . We specifically leverage the instance-level semantic annotations that are provided in Cityscapes and set the labeling of (1) to a different value for each distinct instance of the same semantic class in order to distinguish adjacent instances. We compare our synthetic foggy images against the respective images of Foggy Cityscapes that were generated with the approach of SFSU_synthetic. Our synthetic foggy images generally preserve the edges between adjacent objects with large discrepancy in depth better than the images in Foggy Cityscapes, because our approach utilizes semantic boundaries, which usually encompass these edges. The incorrect structure transfer of color textures to the transmittance map, which deteriorates the quality of Foggy Cityscapes, is also reduced with our method.
In this work, the method is mainly applied to the Cityscapes dataset. The generated foggy dataset is named Foggy Cityscapes-DBF, with DBF standing for Dual-reference Cross-Bilateral Filter. Foggy Cityscapes-DBF has been made available on the server of Cityscapes.
4 Semantic Foggy Scene Understanding
In this section, we first present a standard supervised learning approach for semantic segmentation under dense fog using our synthetic foggy data with the novel fog simulation of Section 3, and then elaborate on our novel curriculum model adaptation approach using both synthetic and real foggy data.
4.1 Learning with Synthetic Fog
Generating synthetic fog from real clear-weather scenes grants the potential of inheriting the existing human annotations of these scenes, such as those from the Cityscapes dataset Cityscapes. This is a significant asset that enables training of standard segmentation models. Therefore, an effective way of evaluating the merit of a fog simulator is to adapt a segmentation model originally trained on clear weather to the synthesized foggy images and then evaluate the adapted model against the original one on real foggy images. The primary goal is to verify that the standard learning methods for semantic segmentation can benefit from our simulated fog in the challenging scenario of real fog. This evaluation policy has been proposed in SFSU_synthetic. We adopt this policy and fine-tune the RefineNet model refinenet on synthetic foggy images generated with our simulation. The performance of our adapted models on real fog is compared to that of the original clear-weather model as well as the models that are adapted on Foggy Cityscapes SFSU_synthetic, providing an objective comparison of our simulation method against SFSU_synthetic.
The learned model can be used as a standalone approach for semantic foggy scene understanding as shown in SFSU_synthetic, or it can be used as an initialization step for our curriculum model adaptation method described in the next section, which learns both from synthetic and real data.
4.2 Curriculum Model Adaptation (CMAda)
In the previous section, the proposed method learns to adapt semantic segmentation models from the domain of clear weather to the domain of foggy weather in a single step. While considerable improvement can be achieved (as shown in Section LABEL:sec:experiments:synthetic), the method falls short when it is presented with dense fog. This is because domain discrepancies become more accentuated for denser fog: 1) the domain discrepancy between synthetic foggy images and real foggy images increases with fog density; and 2) the domain discrepancy between real clear-weather images and real foggy images increases with fog density. This section presents a method to gradually adapt the semantic segmentation model which was originally trained with clear-weather images to images under dense fog by using both synthetic foggy images and unlabeled real foggy images. The method, which we term Curriculum Model Adaptation (CMAda), uses synthetic fog with a range of varying fog density—from light fog to dense fog—and a large dataset of unlabeled real foggy scenes with variable, unknown fog density. The goal is to improve the performance of state-of-the-art semantic segmentation models on dense foggy scenes without using any human annotations for foggy data. Below, we first present our fog density estimator and our method for densification of fog in real foggy images without depth information, and then proceed to the complete learning approach.
4.2.1 Fog Density Estimation
Fog density is usually determined by the visibility of the foggy scene. An accurate estimate of fog density can benefit many applications, such as image defogging fog:density:15. Since annotating images in a fine-grained manner regarding fog density is very challenging, previous methods are trained on a few hundreds of images divided into only two classes: foggy and fog-free fog:density:15. The performance of the system, however, is affected by the small amount of training data and the coarse class granularity.
In this paper, we leverage our fog simulation applied to Cityscapes Cityscapes for fog density estimation. Since simulated fog density is directly controlled through , we generate several versions of Foggy Cityscapes with varying and train AlexNet alexnet to regress the value of for each image, lifting the need to handcraft features relevant to fog and to collect human annotations as fog:density:15 did. The predicted fog density using our method correlates well with human judgments of fog density taken in a subjective study on a large foggy image database on Amazon Mechanical Turk (cf. Section LABEL:sec:exp:fog:density for results). The fog density estimator is used to rank all unlabeled images in our new Foggy Zurich dataset, to pave a road for our curriculum adaptation from light foggy images to dense foggy images. The fog density estimator is denoted by , where is an image.
4.2.2 CMAda with Synthetic and Real Fog
The learning algorithm has a source domain denoted by , an ultimate target domain denoted by , and an ordered list of intermediate target domains indicated by with being the number of intermediate domains. In this work, is for clear-weather images, is for images with dense fog, and ’s are for groups of images with fog density that increases with , ranging between the density of and . Our method adapts semantic segmentation models through a sequence of domains: . The intermediate target domains ’s are optional; When , the method will be a one-stage adaptation approach as presented in Section 4.1. Similarly, leads to a two-stage adaptation approach as presented in our short paper dense:SFSU:eccv18, to a three-stage adaptation approach, and so on. They will be abbreviated accordingly as CMAda1, CMAda2, CMAda3, and so on.
Let us denote by , with , the index of the domains in this ordered sequence . In this work, the sequence of domains is sorted in ascending order with respect to fog density. For instance, it could be , with clear weather being the source domain, dense fog the ultimate target domain and light fog the intermediate target domain. The approach proceeds progressively and adapts the semantic segmentation model from one domain (i.e. one fog density) to the next by learning with the corresponding dataset of synthetic fog and the corresponding dataset of real fog. Once the model for the current domain is trained, its knowledge can be distilled on unlabeled real foggy images, and then used, along with a new version of synthetic foggy data, to adapt the current model to the next domain (i.e. the next denser fog).
Since the method proceeds in an iterative manner, we only present the algorithmic details for model adaptation from to . Let us use to indicate the fog density for domain . Specifically, we use attenuation coefficient to indicate fog density. In order to adapt the semantic segmentation model from the previous domain to the current domain , we generate synthetic fog of the exact fog density and inherit the human annotation directly from the original clear-weather images. Thus, we have
where is the total number of synthetic foggy images, is the label of pixel manually created for clear-weather image . is the total number of classes.
For real foggy images, since no human annotations are available, we rely on a strategy of self-learning or curriculum learning. Objects in lighter fog are easier to recognize than in denser fog, hence models trained for lighter fog are more generalizable to real data. The model of the previous domain can be applied to all real foggy images with fog density less than to generate supervisory labels to create a dataset with semantic labels in order to train . Specifically, the dataset of real foggy images for the adaptation at is
where denotes the predicted labels of image by the semantic segmentation model (.) trained for the previous domain of lighter fog.
Once the two training sets are formed, the aim is to learn a mapping function from and . The proposed scheme for training semantic segmentation models for the current domain of foggy images is to learn a mapping function so that synthetic fog in density domain with human annotation and real fog in density domain with generated labels by are both taken into account:
where is the cross entropy loss function, is a hyper-parameter balancing the weights of the two data sources, with serving as the relative weight of each real weakly labeled image compared to each synthetic labeled one and being the number of images in . We empirically set in our experiment, but an optimal value can be obtained via cross-validation if needed. The optimization of (8) is implemented by generating a hybrid data stream and feeding it to a CNN for standard supervised training. More specifically, during the training phase, training images are fetched from the randomly shuffled and the randomly shuffled with a speed ratio of .
We now describe the initialization stage of our method, which is also a variant of our method when no intermediate target domains are used. For the case , it is clear-weather condition and the model is directly trained with the original Cityscapes dataset. For the case , there are no real foggy images falling into domain which is the clear-weather domain. So for , the model is trained with synthetic dataset only, as specified in Section 4.1. For the remaining steps from on, we use the adaptation approach defined to adapt the model iteratively to domain , which is also the ultimate target domain . In this work, we evaluated three variants of our method with , leading to models named as CMAda1, CMAda2, and CMAda3. The attenuation coefficients for the three models are and , respectively.
Figure 1 provides an overview of our method CMAda2. Below, we summarize the complete operations of our model CMAda2 to further help understanding the method. With the chosen attenuation coefficients , the whole procedure of CMAda2 is as follows:
generate a synthetic foggy dataset with multiple versions of varying fog density;
train a model for fog density estimation on the dataset of step 1;
rank the images in the real foggy dataset with the model of step 2 according to fog density;
generate a dataset with light synthetic fog (), and train a segmentation model on it;
generate a dataset with dense synthetic fog ();
CMAda adapts segmentation models from clear weather condition to dense fog and is inspired by curriculum learning curriculum:learning, in the sense that we first solve easier tasks with our synthetic data, i.e. fog density estimation and semantic scene understanding under light fog, and then acquire new knowledge from the already “solved” tasks in order to better tackle the harder task, i.e. semantic scene understanding under dense real fog. CMAda also exploits the direct control of fog density for synthetic foggy images.
This learning approach also bears resemblance to model distillation hinton2015distilling; supervision:transfer or imitation model:compression; dai:metric:imitation. The underpinnings of our proposed approach are the following: 1) in light fog objects are easier to recognize than in dense fog, hence models trained on synthetic data are more generalizable to real data in case both data sources contain light rather than dense fog; and 2) models trained on the source domain can be successfully applied to the target domain when the domain gap is small, hence incremental (curriculum) domain adaptation can better propagate semantic knowledge from the source domain to the ultimate target domain than single-step domain adaptation approaches.
The goal of CMAda is to train a semantic segmentation model for the ultimate target domain . The standard recipe is to record foggy images ’s and then to manually create semantic labels ’s for those foggy images so that the standard supervised learning can be applied. As discussed in Section 1, there is difficulty to apply this recipe to all adverse weather conditions because manual creation of ’s is very time-consuming and expensive. To address this problem, this work develops methods to automatically create two proxies for . The two proxies are defined in (6) and in (7). These two proxies reflect different and complementary characteristics of . On the one hand, dense synthetic fog features a similar overall visibility obstruction to dense real fog, but includes artifacts. On the other hand, light real fog captures the true nonuniform and spatially varying structure of fog, but at a different density than dense fog. Combining the two alleviates the influence of the drawbacks.
The method CMAda presented in Section 4.2.2 is an extension of the method proposed in our short paper dense:SFSU:eccv18, from a two-step approach to a more general multiple-step approach. CMAda is a stand-alone approach, can be applied to real fog, and outperforms competing methods already as shown in Section LABEL:sec:experiments. In the next section, we present an extension of CMAda, dubbed CMAda+, to further boost its performance.
4.3 CMAda+ with Synthetic and Densified Real Fog
As shown in (6), images in the synthetic dataset have the exact same fog density as images in the (next) target domain. Images in the real dataset , however, have lower foggy density than the target fog density as defined in (7). While the lower fog density of the real training images paving the road for the self-learning branch of CMAda on real foggy images, the remaining domain gap due to the difference of fog density hampers finding the ‘optimal’ solution. In Section 4.3.1, we present a method to densify real foggy images to the desirable fog density. The fog densification method is general and can be applied beyond CMAda. In Section 4.3.2, we use our fog densification method to upgrade the dataset defined in (7) to a densified foggy dataset, which will be used along with the synthetic dataset to train the model .
4.3.1 Fog Densification of a Real Foggy Scene
We aim at synthesizing images with increased fog density compared to already foggy real input images for which no depth information is available. In this way, we can generate multiple synthetic versions of each split of our real Foggy Zurich dataset, where each synthetic version is characterized by a different, controlled range of fog densities, so that these densified foggy images can also be leveraged in our curriculum adaptation. To this end, we utilize our fog density estimator and propose a simple yet effective approach for increasing fog density when no depth information is available for the input foggy image, by using the assumption of constant transmittance in the scene.
More formally, we denote the input real foggy image with and assume that it can be expressed through the optical model (3). Contrary to our fog simulation on clear-weather scenes in Section 3, the clear scene radiance is unknown and the input foggy image cannot be directly used as its substitute for synthesizing a foggy image with increased fog density, as does not correspond to clear weather. Since the scene distance which determines the transmittance through (4) is also unknown, we make the simplifying assumption that the transmittance map for is globally constant, i.e.
and use the statistics for scene distance computed on Cityscapes, which features depth maps, to estimate . By using the distance statistics from Cityscapes, we implicitly assume that the distribution of distances of Cityscapes roughly matches that of our Foggy Zurich dataset, which is supported by the fact that both datasets contain similar, road scenes. In particular, we apply our fog density estimator on to get an estimate of the input attenuation coefficient. The values for scene distance of all pixels in Cityscapes are collected into a histogram with distance bins, where are the bin centers and are the relative frequencies of the bins. We use each bin center as representative of all samples in the bin and compute as a weighted average of the transmittance values that correspond to the different bins through (4):
For the output densified foggy image , we select a target attenuation coefficient and again estimate the corresponding global transmittance value similarly to (10), this time plugging into the formula. The output image is finally computed via (3) as
Equation (13) implies that our fog densification method can bypass the explicit calculation of the clear scene radiance in (11), as the output image does not depend on . In this way, we completely avoid dehazing our input foggy image as an intermediate step, which would pose challenges as it constitutes an inverse problem, and reduce the inference problem just to the estimation of the attenuation coefficient by assuming a globally constant transmittance. Moreover, (13) implies that the change in the value of a pixel with respect to is linear in the difference . This means that distant parts of the scene, where , are not modified significantly in the output, i.e. . On the contrary, our fog densification modifies the appearance of those parts of the scene which are closer to the camera and shifts their color closer to that of the estimated atmospheric light irrespective of their exact distance from the camera. This can be observed in the example of Figure 4, where the closer parts of the input scene such as the red car on the left and the vegetation on the right have brighter colors in the synthesized output. The overall shift to brighter colors is verified by the accompanying RGB histograms of the input and output images in Figure 4.
4.3.2 Fog Densification of a Real Foggy Dataset
In principle, images in defined in (7) can be densified so that they all have fog density . This may completely close the domain gap due to different fog density, but it adds heavy constant fog into the images initially with very light fog. This introduces other domain discrepancies, as the fog densification method makes simplified assumption. Thus, we propose to use the method in a different way.
Given the dataset defined in (7), instead of densifying all to , we choose to perform a linear densification from to . In particular, given a real foggy image with its estimated fog density , the target fog density (signaled by attenuation coefficient) is defined as
4.4 Semantic Scene Understanding in Multiple Weather Conditions
In Section 4.2.2 and Section 4.3, specialized approaches have been developed for semantic scene understanding under (dense) fog. For real world applications, weather conditions change constantly, for instance the weather can change from foggy to sunny or reversely at any time. We argue that semantic scene understanding methods need to be robust and/or adaptive to those changes. To this aim, we present Model Selection, a method to adaptively select the specialized model according to the weather condition.
4.4.1 Model Selection
In this work, two expert models are used, one for clear-weather condition and one for foggy condition. In particular, a two-class classifier is trained to distinguish clear-weather and foggy, with images from Cityscapes dataset used as the training data for the former and images from three Foggy Cityscapes-DBF datasets (with three attenuation coefficients , , and ) as the training data for the latter. AlexNet alexnet is used for this task.
Denoted by the expert model for foggy condition, the expert model for clear-weather condition, and the classifier, then the semantic labels of a given test image can be obtained by
where label indicates class clear-weather and foggy.
The method is not particularly limited to these two conditions and it has potential to be applied to multiple adverse weather conditions such as and .
5 The Foggy Zurich Dataset
We present the Foggy Zurich dataset, which comprises images depicting foggy road scenes in the city of Zurich and its suburbs. We provide annotations for semantic segmentation for of these scenes that contain dense fog.
5.1 Data Collection
Foggy Zurich was collected during multiple rides with a car inside the city of Zurich and its suburbs using a GoPro Hero 5 camera. We recorded four large video sequences, and extracted video frames corresponding to those parts of the sequences where fog is (almost) ubiquitous in the scene at a rate of one frame per second. The extracted images are manually cleaned by removing the duplicates (if any), resulting in foggy images in total. The resolution of the frames is pixels. We mounted the camera inside the front windshield, since we found that mounting it outside the vehicle resulted in significant deterioration in image quality due to blurring artifacts caused by dew.
In particular, the small water droplets that compose fog condense and form dew on the surface of the lens very shortly after the vehicle starts moving, which causes severe blurring artifacts and contrast degradation in the image, as shown in Figure 5LABEL:sub@fig:windshield:outside. On the contrary, mounting the camera inside the windshield, as we did when collecting Foggy Zurich, prevents these blurring artifacts and affords much sharper images, to which the windshield surface incurs minimal artifacts, as shown in Figure 5LABEL:sub@fig:windshield:inside.
5.2 Annotation of Images with Dense Fog
We use our fog density estimator presented in Section 4.2.1 to rank all images in Foggy Zurich according to fog density. Based on the ordering, we manually select 40 images with dense fog and diverse visual scenes, and construct the test set of Foggy Zurich therefrom, which we term Foggy Zurich-test. The aforementioned selection is performed manually in order to guarantee that the test set has high diversity, which compensates for its relatively small size in terms of statistical significance of evaluation results. We annotate these images with fine pixel-level semantic annotations using the 19 evaluation classes of the Cityscapes dataset Cityscapes: road, sidewalk, building, wall, fence, pole, traffic light, traffic sign, vegetation, terrain, sky, person, rider, car, truck, bus, train, motorcycle and bicycle. In addition, we assign the void label to pixels which do not belong to any of the above 19 classes, or the class of which is uncertain due to the presence of fog. Every such pixel is ignored for semantic segmentation evaluation. Comprehensive statistics for the semantic annotations of Foggy Zurich-test are presented in Figure 6. Furthermore, we note that individual instances of person, rider, car, truck, bus, train, motorcycle and bicycle are annotated separately, which additionally induces bounding box annotations for object detection for these 8 classes, although we focus solely on semantic segmentation in this paper.
We also distinguish the semantic classes that occur frequently in Foggy Zurich-test. These “frequent” classes are: road, sidewalk, building, wall, fence, pole, traffic light, traffic sign, vegetation, sky, and car. When performing evaluation on Foggy Zurich-test, we occasionally report the average score over this set of frequent classes, which feature plenty of examples, as a second metric to support the corresponding results.
Despite the fact that there exists a number of prominent large-scale datasets for semantic road scene understanding, such as KITTI kitti, Cityscapes Cityscapes and Mapillary Vistas Mapillary, most of these datasets contain few or even no foggy scenes, which can be attributed partly to the rarity of the condition of fog and the difficulty of annotating foggy images. Through manual inspection, we found that even Mapillary Vistas, which was specifically designed to also include scenes with adverse conditions such as snow, rain or nighttime, in fact contains very few images with fog, i.e. in the order of 10 images out of 25000, with a lot more images depicting misty scenes, which have , i.e. significantly better visibility than foggy scenes Federal:meteorological:handbook.
To the best of our knowledge, the only previous dataset for semantic foggy scene understanding whose scale exceeds that of Foggy Zurich-test is Foggy Driving SFSU_synthetic, with 101 annotated images. However, we found that most images in Foggy Driving contain relatively light fog and most images with dense fog are annotated coarsely. Compared to Foggy Driving, Foggy Zurich comprises a much greater number of high-resolution foggy images. Its larger, unlabeled part is highly relevant for unsupervised or semi-supervised approaches such as the one we have presented in Section 4.2.2, while the smaller, labeled Foggy Zurich-test set features fine semantic annotations for the particularly challenging setting of dense fog, making a significant step towards evaluation of semantic segmentation models in this setting. In Table 5.2, we compare the overall annotation statistics of Foggy Zurich-test to some of the aforementioned existing datasets; we note that the comparison involves a test set (Foggy Zurich-test) and unions of training plus validation sets (KITTI and Cityscapes), which are much larger than the respective test sets. The comparatively lower number of humans and vehicles per image in Foggy Zurich-test is not a surprise, as the condition of dense fog that characterizes the dataset discourages road transportation and reduces traffic.