Adversarial Learning and Self-Teaching Techniques for Domain Adaptation in Semantic Segmentation

Adversarial Learning and Self-Teaching Techniques for Domain Adaptation in Semantic Segmentation

Abstract

Deep learning techniques have been widely used in autonomous driving systems for the semantic understanding of urban scenes . However, they need a huge amount of labeled data for training, which is difficult and expensive to acquire. A recently proposed workaround is to train deep networks using synthetic data, but the domain shift between real world and synthetic representations limits the performance. In this work, a novel Unsupervised Domain Adaptation (UDA) strategy is introduced to solve this issue. The proposed learning strategy is driven by three components: a standard supervised learning loss on labeled synthetic data ; an adversarial learning module that exploits both labeled synthetic data and unlabeled real data ; finally, a self-teaching strategy applied to unlabeled data. The last component exploits a region growing framework guided by the segmentation confidence. Furthermore, we weighted this component on the basis of the class frequencies to enhance the performance on less common classes. Experimental results prove the effectiveness of the proposed strategy in adapting a segmentation network trained on synthetic datasets, like GTA5 and SYNTHIA, to real world datasets like Cityscapes and Mapillary.

I Introduction

A key component of any autonomous driving system is the capability of understanding the surrounding environment from visual data. Nowadays, this is achieved using semantic segmentation techniques mostly based on deep learning strategies. Deep networks have shown impressive performance on this task . However, they have the key drawback that a huge amount of labeled data is required for their training, especially in case recent highly complex architectures are used. In the autonomous driving scenario, the pixel-level annotation must be manually provided for a huge amount of frames acquired by cameras mounted on cars driving around . This annotation is expensive and requires a huge amount of work. Some recent papers [1, 2] introduced a workaround for this issue using computer generated data for training the networks. The realistic rendering models developed by the video game industry can be used to produce a large amount of high quality rendered road scenes [1]. Despite the impressive realism of recent video games graphics, there is still a large domain shift between the computer generated data and real world images acquired by video cameras on cars. To be able to really exploit computer generated data in real world applications the domain shift issue needs to be addressed. Various techniques have been introduced to tackle it: data augmentation techniques try to improve the generalization properties of the available data. Transfer learning approaches allow to adapt the model to the new domain but require some labeled data from the target domain. Finally, Unsupervised Domain Adaptation (UDA) techniques improve the model exploiting unlabeled target data. We present an UDA strategy for road driving scenes able to adapt an initial learning performed on synthetic data to the real world scenario. The domain adaptation strategy presented in this work is based on adversarial learning and is an extension of our previous work [3]: here we further improve the self-teaching strategy and we present a more robust experimental evaluation. We focus on the training scenario where a large amount of annotated synthetic data is available but there are no labeled real world samples (or just a small amount that can be used for validation purposes but not sufficient for training the deep network). The proposed method exploits a segmentation network based on the DeepLab v2 framework [4] that is trained using both labeled and unlabeled data in an adversarial learning framework with multiple components. The first component that controls the training is a standard cross-entropy loss exploiting ground truth annotations used to perform a supervised training on synthetic data. The second is an adversarial learning scheme inspired by previous works on semi-supervised semantic segmentation, i.e., dealing with partially annotated datasets [5, 6]. We exploited a fully convolutional discriminator which produces a pixel-level confidence map distinguishing between data produced by the generator (both from real or synthetic data) and the ground truth segmentation maps. It allows to train in an adversarial setting the segmentation network using both synthetic labeled data and real world scenes without ground truth information. Finally, the third term is based on a self-teaching loss. This key component is based on the idea introduced in [5] that the output of the discriminator can also be used as a measure of the reliability of the network estimations. This can in turn be exploited to select the reliable regions in a self-teaching framework. However, this component has been greatly improved in this work, both with respect to [5] and to [3]. First of all, the output of the discriminator has been considered as a weight to be applied to the loss function of the self-teaching component at each location, in place of the hard threshold mask used in previous work [3]. Then, a novel region growing scheme is introduced in order to extend and better represent the shape of reliable regions . This is a key difference because the previous approaches [5, 3] tend to almost always discard edge regions and small objects. Finally, since the various classes have different frequencies, we also weighted the loss coming from unlabeled data in proportion to the frequency of the various classes in the synthetic dataset . This allows to obtain a better balance of the performance among the different classes. In particular, it avoids dramatic drops in performance on less common classes, as small objects and structures.

The network has been trained on both synthetic labeled data (using the first and second component) and on unlabeled real world data (using the second and third component) and we were able to obtain accurate results on different real world datasets even without using labeled real world data. In particular, we used the synthetic datasets SYNTHIA and GTA5 for the supervised part and the real datasets Cityscapes and Mapillary (the latter has been introduced in this journal extension) for the unsupervised components and then tested on the respective validation sets, achieving state-of-the-art results on the unsupervised domain adaptation task.

Ii Related Work

Many different approaches for semantic segmentation of images have been proposed (see [7] for a recent review of the field). There are many different strategies for this task, but most current state-of-the-art approaches are based on encoder-decoder schemes and on the Fully Convolutional Network (FCN) model [8]. Some recent well-known and highly performing methods are DilatedNet [9], PSPNet [10] and DeepLab [4]. In particular , the latter is the model employed for the generator network in this work. All the approaches for generic images can be applied also to road scenes, however since this is a very relevant application [11, 12] there has been a large effort both in the acquisition of datasets [13, 14, 15] and in the development of ad-hoc approaches [16, 17, 18].

These approaches show impressive performance but they all share the fundamental requirement of having a large amount of labeled data for their training. They are typically trained on huge datasets with pixel-wise annotations (e.g., Cityscapes [13], CamVid [19] or Mapillary [14]), whose acquisition is highly expensive and time-consuming. Recent research, as the proposed work, focuses on how to deal with this issue both by using only partially labeled data or by adapting the training done on a different set of data with slightly different statistics to the problem of interest.

The first family of approaches we consider is semi-supervised methods. They can be divided into methods exploiting weakly annotated data (e.g., with only image-wise labels or only bounding boxes) [20, 21, 22, 23, 24, 25, 26, 27] or methods for which only part of the data is labeled while the other is completely unlabeled [28, 29, 5, 21, 6]. The work of [30] has opened the way to adversarial learning approaches for the semantic segmentation task, while [21] to their application to semi-supervised learning. The approaches of [5, 6] are also based on adversarial learning but exploit a Fully Convolutional Discriminator (FCD) trying to discriminate between the predicted probability maps and the ground truth segmentation distributions at pixel-level. These works targeted a scenario where only part of the dataset is labeled but unlabeled data comes from the same dataset and shares the same domain data distribution of the labeled ones.

The work of [3] starts from [5] but instead proposes to tackle a scenario where unlabeled data refers to a different dataset with a different domain distribution, i.e., it deals with the domain adaptation task. A common setting for this task is domain adaptation from synthetic data to real world scenes. Indeed, the development of advanced computer graphics techniques enabled the collection of huge synthetic datasets for semantic segmentation. Examples of synthetic semantic segmentation datasets for the autonomous driving scenario are the GTA5 [1] and SYNTHIA [2] datasets, which have been employed in this work. However, there is a cross-domain shift that has to be addressed when a neural network trained on synthetic data processes real-world images, since in this case training and test data are not drawn i.i.d. from the same underlying distribution as usually assumed [16, 31, 32, 33, 34].

Data augmentation techniques can be used to improve the generalization capabilities: some works propose to pre-process the synthetic images, i.e., the existing labeled data, to reduce the inherent discrepancy between real and synthetic domain distributions mainly using generative models based on Generative Adversarial Networks (GANs) [35, 36, 37, 38, 39]. Thus, these augmented labeled data are used to train the segmentation network to work in a more reliable way on the real domain.

Transfer learning techniques improve the generalization of the trained network, going from the source to the target domain, by using either weak supervision [40, 41, 27] or just a small amount of labeled target data [42, 43].

Differently, unsupervised domain adaptation techniques aim at training the networks in a supervised way on synthetic data and in a unsupervised way on real unlabeled data. This family of techniques has been widely investigated in classification tasks [44, 45, 46, 47] but its application to semantic segmentation is less explored. The first work to deal with cross-domain urban scenes semantic segmentation is [48], where the adaptation is performed by aligning the features from the different during the adversarial training procedure. A curriculum-style learning approach is proposed in [16], where firstly the easier task of estimating global label distributions is learned and then the segmentation network is trained forcing that the target label distribution is aligned to the previously computed properties. Many other works addressed the domain adaptation problem with various techniques, including GANs [29, 49], output space alignment [50, 51], distillation loss [17, 52], class-balanced self-training [53], conservative loss [54], geometrical guidance [55], adaptation networks [56], entropy minimization [57] and cycle consistency [58, 11]. This latter technique applies also some kind of data augmentation to the synthetic data to make them more realistic.

Region growing techniques have been recently applied to domain adaptation in semantic segmentation [27, 26]. In particular in [26] a semantic segmentation network is trained to segment the discriminative regions first and to progressively increase the pixel-level supervision by seeded region growing [59]. In [27] the authors propose a saliency guided weakly supervised segmentation network which utilizes salient information as guidance to help weakly segmentation through a seeded region growing procedure. In [60] the region growing problem is represented as a Markov Decision Process.

Iii Architecture of the Proposed Approach

Fig. 1: Architecture of the proposed framework. The optimization is guided by a discriminator loss and 3 losses for the generator: a standard cross-entropy loss on synthetic data (), an adversarial loss () and a self-teaching loss for unlabeled real data (). Best viewed in colors.

Our target is to train a semantic segmentation network in a supervised way on synthetic data and to adapt it in a unsupervised way to real data. In this paper, we name this network , since it has the role of the generator in the proposed adversarial training framework. A supplementary discriminator network is used to evaluate the reliability of ’s output. This information can be employed to guide the adaptation of to unlabeled real data. In this section, we detail the CNN architectures and the training procedure implementing the unsupervised domain adaptation. Our approach is agnostic to the architecture of and in general any semantic segmentation network can be used . However, in our experiments is a Deeplab v2 network [4]. This widely used model is based on the ResNet-101 backbone whose weights were pre-trained [61] on the MSCOCO dataset [62].

Figure 1 shows the architecture of the proposed training framework. The optimization of the network is driven by the minimization of three loss functions. The first loss function () is a standard multi-class cross-entropy. The segmentation network is trained to estimate for each input pixel the probability that it belongs to a class inside the set of possible classes . It is optimized only on labeled synthetic data since the ground truth is required. In the following, is used to represent the output of the segmentation network on the -th input image, , from the source (synthetic) domain. is used to refer to the one-hot encoded ground truth segmentation related to input . In this scenario, the multi-class cross-entropy loss is formulated as:

(1)

where is the index of a pixel in the considered image, is a specific class belonging to and and are the values relative to pixel and class respectively in the ground truth and in the generator () output. As mentioned above, this loss can be computed only on the source (synthetic) domain where the semantic ground truth is available.

The second and the third loss functions, minimized during the training of , aim at adapting the semantic segmentation CNN to real data without using ground truth labels for real data. These loss functions are implemented by means of the discriminator network , that is trained to distinguish segmentation maps produced by the generator from the ground truth ones. The peculiarity of this discriminator network is that it produces a per-pixel estimation, differently from traditional adversarial frameworks where the discriminator outputs a single binary value for the whole input image. The discriminator is made of a stack of convolutional layers each with kernels with a stride of and Leaky ReLU activation function. The number of filters (from the first layer to the last one) is , , , , and the cascade is followed by a bilinear upsampling to match the original input image resolution. The discriminator is trained by minimizing the loss function , that is a standard cross-entropy loss, between ’s output and the one-hot encoding indicating if the input is produced by (class ) or if it is the ground truth one-hot encoding semantic segmentation (class ). can be formulated as:

(2)

Notice that the class , associated to ’s output, can be produced both from an input coming from the source domain and from a real world input . This means that can be trained on both synthetic and real data, trying to discriminate generated data from ground truth one. The segmented source and target datasets share a similar statistic, since low level features of the color images are processed to leave place to the class statistic: for this reason the training of on real and synthetic data is possible. Another possible issue in the training procedure could be related to the well distinguishable Dirac distributed segmentation ground truth data. In principle, this could be easily distinguished from data produced by . However, we have investigated this issue and in general produces segmentation maps very close to the Dirac distribution after a few training steps. This forces to capture also other statistical properties of the two different types of input data. Notice that this issue has been investigated also in [5, 3] with similar conclusions. The discriminator is used to implement the second loss function for the training of , . This loss function is an adversarial loss since , the generator in the traditional adversarial training scheme, is updated in order to create an output that has to look similar to ground truth data from the viewpoint. On a generic image this loss function can be formulated as:

(3)

As for the training of (Eq. 2), can be optimized both on the source and on the target data. In case the input is coming from the source dataset, we will refer to the loss function of Eq. 3 with , otherwise we will refer to it with in case of target data as input. Notice that the generator is forced to adapt to the target real domain in an unsupervised way by minimizing . is forced to produce data similar to what considers ground truth also on real data. Remember that the ground truth is not used for this loss.

The third loss function is inspired from the work of Hung et al. [5]. The idea is to interpret the output of the discriminator as a measure of the reliability of the output of in case of synthetic and real data. This reliability measure is used to realize a self-training on real data. The predictions of , assumed to be reliable by , are converted to the one-hot encoding and are used as a self-taught ground truth to train on unlabeled target real data. This loss can be formulated as

(4)

where is the one-hot encoded ground truth derived from the per-class argmax of the generated probability map . Each contribution to the loss is weighted by two terms. The first () is a weighting term dependent on the output of the discriminator refined by a region growing procedure that exploits pixel aggregation to improve the confidence estimation. The second () is a weighting function proportional to the class frequencies on the source domain.

More in detail, the first term finds the reliable locations in the segmented map and assigns to them a weight interpreted as a confidence measure. The module computing this weighting mask is named and it takes as input a real image . In the first step, a mask is computed selecting confident points by applying a threshold to the output of the discriminator with input . The discriminator output is interpreted as a confidence map related to the segmentation map estimated on in this phase. Formally, at each pixel location we have:

(5)

In the second step, for a generic confident pixel in , assigned by G to class , the algorithm expands the confident region to a generic adjacent pixel if the output of the segmentation network for the class (i.e., the one selected for point ) is greater than a threshold at location . More formally, is added to the mask if . We will denote with the mask obtained by applying this region growing process to the original mask . Finally, for each location selected by the updated mask the weight is given by the corresponding output of the discriminator . Thus, the resulting weights are:

(6)

i.e., the weight is equal to the discriminator output for points selected by and to for points not selected by the mask. Empirically we set and thus achieving high reliability when expanding the confidence map.

The second weighting function is related to the class frequency on the source domain (). It is defined as:

(7)

where represents the cardinality of the considered set.

This weighting function balances the overall loss when unlabeled data of the target set are used, avoiding that rare and tiny objects (e.g., traffic lights or pole) are forgotten and replaced by more frequent and large ones (such as road, building). Notice that is estimated on source data since the ground truth of the target data is assumed to be unknown during the training phase. Furthermore, does not change during the training process and so it is computed only once.

Finally, the overall loss function for the training of is a weighted average of the three losses, i.e.:

(8)

We empirically set the weighting parameters as specified in Section V. The discriminator is trained minimizing (Eq. 2) on ground truth labels and on the generator output computed on a mixed batch composed by both source and target data. During the first steps, the loss is disabled, setting , allowing the discriminator to learn how to produce higher quality confidence maps before using them. After this initial phase, all the three components of the loss are enabled and the training ends after steps.

Iv Datasets

The proposed unsupervised domain adaptation framework has been trained and evaluated using 4 different datasets. Recall that the target is to train the semantic segmentation network using labeled synthetic road scenes while no labels are available for real world data. The supervised synthetic training exploits two publicly available datasets, i.e., GTA5 [1] and SYNTHIA [2]. The real world datasets used for the unsupervised adaptation and for the result evaluation are instead Cityscapes [13] and Mapillary [14]. Notice that the evaluation scenario is the same of recent competing approaches as [48, 29, 16] in order to allow for a fair comparison. During the training stage all the images have been resized and cropped to for memory constraints. The testing on the real datasets, instead, has been carried out at their original resolution.

The GTA5 dataset [1] contains synthetic images with pixel level semantic annotation. The images have been rendered using the open-world video game Grand Theft Auto 5 and are all from the car perspective in the streets of American-style virtual cities. The images have an impressive visual quality and are very realistic since they come from a high budget commercial production. We used images for the supervised training, while have been taken out for validation purposes. There are 19 semantic classes which are compatible with the ones of the exploited real world datasets.

The SYNTHIA-RAND-CITYSCAPES subset of the SYNTHIA dataset [2] contains synthetic images with pixel level semantic annotation. The images have been rendered with an ad-hoc engine, allowing to obtain a large variability of street scenes. In this case they come from virtual European-style towns in different environments under various light and weather conditions. On the other hand, the visual quality is not the same of the commercial video game GTA5. The semantic labels are compatible with of the classes of Cityscapes (for the evaluation on the Cityscapes dataset, only the classes contained in both datasets are taken into consideration). We used images for the supervised training while have been taken out for validation purposes.

The Cityscapes dataset [13] contains color images with pixel level semantic annotation. The images have a resolution of and have been captured on the streets of European cities. We used the labels only for experimental evaluation, since the domain adaptation procedure is unsupervised. The original training set (without the labels) has been used for unsupervised adaptation, while the images in the original validation set have been used as a test set (as done by competing approaches since the test set labels are not available).

The Mapillary dataset [14] contains high resolution images taken from different cameras in many different locations. The variability in classes, appearance, acquisition settings and geo-localization makes the dataset the most complete and of highest quality in the field. As for Cityscapes we used this dataset for unsupervised domain adaptation and testing. The semantic annotations have been re-conducted to the labels of the Cityscapes dataset following the mapping in [18]. We exploited the training images (without the labels) for unsupervised training and the images in the original validation set as test set (as done by competing approaches).

Hoffman et al. [48] Hung et al. [5] Zhang et al. [16] Biasetton et al. [3] Ours
Domain adaptation strategy Adversarial feature alignment, label distributions Adversarial learning and self-teaching Labels distribution (global and on superpixels) Adversarial learning and self-teaching Adversarial learning and self-teaching
D input Source features vs. target features Ground truth vs. predicted maps - Ground truth vs. predicted maps Ground truth vs. predicted maps
D output Binary Confidence map - Confidence map Confidence map
G backbone FCN-8s Deeplab v2 FCN-8s Deeplab v2 Deeplab v2
Loss CE, ADV-Feat., LD CE, ADV, ST CE, superpixel, LD CE, ADV, ST CE, ADV, ST
Self-teaching No Yes No Yes Yes, with soft selection and region growing
Class-weighting Yes No No Yes Yes
Pre/post processing Label distribution and superpixel - Superpixel segmentation - -
TABLE I: Summary of compared methodologies. Loss components are expressed as CE: cross entropy, ADV: adversarial loss, LD: labels distribution, ST: self-teaching.
GTA5 Cityscapes

road

sidewalk

building

wall

fence

pole

t light

t sign

veg

terrain

sky

person

rider

car

truck

bus

train

mbike

bike

mean

Supervised ( only) 45.3 20.6 50.1 9.3 12.7 19.5 4.3 0.7 81.9 21.1 63.3 52.0 1.7 77.9 26.0 39.8 0.1 4.7 0.0 27.9
Ours (full) 81.0 19.6 65.8 20.7 12.9 20.9 6.6 0.2 82.4 33.0 68.2 54.9 6.2 80.3 28.1 41.6 2.4 8.5 0.0 33.3
Hoffman et al. [48] 70.4 32.4 62.1 14.9 5.4 10.9 14.2 2.7 79.2 21.3 64.6 44.1 4.2 70.4 8.0 7.3 0.0 3.5 0.0 27.1
Hung et al. [5] 81.7 0.3 68.4 4.5 2.7 8.5 0.6 0.0 82.7 21.5 67.9 40.0 3.3 80.7 34.2 45.9 0.2 8.7 0.0 29.0
Zhang et al. [16] 74.9 22.0 71.7 6.0 11.9 8.4 16.3 11.1 75.7 13.3 66.5 38.0 9.3 55.2 18.8 18.9 0.0 16.8 14.6 28.9
Biasetton et al. [3] 54.9 23.8 50.9 16.2 11.2 20.0 3.2 0.0 79.7 31.6 64.9 52.5 7.9 79.5 27.2 41.8 0.5 10.7 1.3 30.4
TABLE II: mIoU on the different classes of the Cityscapes validation set. The approaches have been trained in a supervised way on the GTA5 dataset and the unsupervised domain adaptation has been performed using the Cityscapes training set. The highest value in each column is highlighted in bold.
SYNTHIA Cityscapes

road

sidewalk

building

wall

fence

pole

t light

t sign

veg

sky

person

rider

car

bus

mbike

bike

mean

Supervised ( only) 10.3 20.5 35.5 1.5 0.0 28.9 0.0 1.2 83.1 74.8 53.5 7.5 65.8 18.1 4.7 1.0 25.4
Ours (full) 80.7 0.3 75.0 0.0 0.0 19.5 0.0 0.4 84.0 79.4 46.6 0.8 80.8 32.8 0.5 0.5 31.3
Hoffman et al. [48] 11.5 19.6 30.8 4.4 0.0 20.3 0.1 11.7 42.3 68.7 51.2 3.8 54.0 3.2 0.2 0.6 20.1
Hung et al. [5] 72.5 0.0 63.8 0.0 0.0 16.3 0.0 0.5 84.7 76.9 45.3 1.5 77.6 31.3 0.0 0.1 29.4
Zhang et al. [16] 65.2 26.1 74.9 0.1 0.5 10.7 3.7 3.0 76.1 70.6 47.1 8.2 43.2 20.7 0.7 13.1 29.0
Biasetton et al. [3] 78.4 0.1 73.2 0.0 0.0 16.9 0.0 0.2 84.3 78.8 46.0 0.3 74.9 30.8 0.0 0.1 30.2
TABLE III: mIoU on the different classes of the Cityscapes validation set. The approaches have been trained in a supervised way on the SYNTHIA dataset and the unsupervised domain adaptation has been performed using the Cityscapes training set. The highest value in each column is highlighted in bold.

V Experimental Results

The target of the proposed approach is to adapt a deep network trained on synthetic data to real world scenes. To evaluate the performance on this task we used the 4 different datasets introduced in Section IV. We started by evaluating the performance on the validation set of Cityscapes. In the first experiment, we trained the network using the scenes from the GTA5 dataset to compute the supervised loss and the adversarial loss while the training scenes of the Cityscapes dataset have been used for the unsupervised domain adaptation, i.e., to compute the losses and . Notice that no labels from the Cityscapes training set have been used. In the second experiment, we performed the same procedure but we replaced the GTA5 dataset with the SYNTHIA one.

Then , we switched to the Mapillary dataset and we repeated the two experiments using this dataset: we performed the supervised training with GTA 5 or SYNTHIA and we used the training set of Mapillary, without any label, for the unsupervised domain adaptation. Similarly to the previous scenario, we evaluated the results on the validation split of Mapillary.

The proposed deep learning scheme has been implemented using the TensorFlow framework. The generator network (we used a Deeplab v2 model) has been trained as proposed in [4] using the Stochastic Gradient Descent (SGD) optimizer with momentum set to and weight decay to . The discriminator has been trained using the Adam optimizer. The learning rate employed for both and started from and was decreased up to by means of a polynomial decay with power . We trained the two networks for iterations on a NVIDIA GTX 1080 Ti GPU. The longest training inside this work, i.e., the one with all the loss components enabled, takes about 20 hours to complete. Further resources and the code of our approach are available at: https://lttm.dei.unipd.it/paper_data/semanticDA.

We assess the quality of our approach computing the mean Intersection over Union (mIoU), as done by all competing approaches. Moreover, we compared our results with some recent frameworks [48, 16, 5, 3], whose main design choices are briefly summarized in Table I.

V-a Evaluation on the Cityscapes Dataset

We started the experimental evaluation from the Cityscapes dataset. The performance have been computed by comparing the predictions on the Cityscapes validation set with the ground truth. Table III refers to the first experiment (i.e., using GTA5 for the supervised training). It shows the accuracy obtained with standard supervised training, with the proposed approach and with some state-of-the-art approaches. By simply training the network in a supervised way on the GTA5 dataset and then performing inference on real world data from the Cityscapes dataset a mIoU of can be obtained. The proposed unsupervised domain adaptation strategy allows to enhance the accuracy to with an improvement of . By looking more in detail to the various class accuracies, it is possible to see that the accuracy has increased on almost all the classes (only on two of them the accuracy has slightly decreased) . In particular, there is a large improvement on the most common classes corresponding to large structures, since the domain adaptation strategy allows to better learn their statistics in the new domain. At the same time the performance improves also on less frequent classes corresponding to small objects due to the usage of the class weights in the self-teaching loss component.

The method of Hung et al. [5], based on a similar framework, achieves a lower accuracy of mostly because it struggles with small structures and uncommon classes. The methods in [16, 48] also have lower performance; however, they are also based on a different generator network. The older version of our method, introduced in [3], achieves an accuracy of , with a gap of almost w.r.t. the proposed approach, proving that the newly introduced elements (i.e., the weighting in the self-teaching and the region growing strategy) have a relevant impact on the performance.

Figure 2a shows the output of the supervised training, of the methods of [5] and [3] and of our approach on some sample scenes, using the GTA5 dataset as source dataset and the Cityscapes as target one. The supervised training leads to reasonable results, but some small objects get lost or the object contours are badly captured (e.g., the rider in row 1 or the poles in row 3). Furthermore, some regions of the street are corrupted by noise (e.g., see rows 1 and 2). The approach of [5] seems to lose some structures (e.g., the terrain in the third row) and presents issues with small objects (the poles in row 3 get completely lost) as pointed out before. The old version of the approach [3] has better performance, for example the people are better preserved and the structures have better defined edges but there are still artifacts, e.g., the road surface in row 2 and 3. Finally , the proposed method has the best performance showing a good capability of detecting small objects and structures and at the same time a reliable recognition of the road and of the larger elements in the scene: in all the selected images it obtains a cleaner representation of the road removing the sidewalk class where is not present but at the same time correctly localizes it in the second row differently from the other methods. Similar discussion holds for the terrain class in row 3 and for the pole class whose detection has been highly improved w.r.t. [5].

road sidewalk building wall fence pole traffic light traffic sign vegetation terrain
sky person rider car truck bus train motorcycle bicycle unlabeled
a) To Cityscapes From GTA5
From SYNTHIA
b) To Mapillary From GTA5
From SYNTHIA
Image Annotation Supervised () Hung et al. [5] Biasetton et al. [3] Ours ()
Fig. 2: Semantic segmentation of some sample scenes extracted from the Cityscapes (a) and Mapillary (b) validation sets. The first group of six rows is related to the Cityscapes dataset, the last six to the Mapillary dataset. For each group, the first three rows are related to the experiments in which the GTA5 dataset is used as source. The last three rows are related to the case in which the SYNTHIA dataset is used as source (best viewed in colors).

Adapting from SYNTHIA, the task is even more challenging w.r.t. the GTA5 case since the computer generated graphics are less realistic. By training the network in a supervised way on SYNTHIA and performing inference on the real world Cityscapes dataset, a mIoU of can be obtained (see Table III). This value is smaller than the mIoU of obtained by training on the GTA5 dataset. The performance gap confirms that the GTA5 dataset has a smaller domain shift with respect to real world data, when compared with the SYNTHIA dataset. By exploiting the proposed approach an accuracy of can be obtained. The improvement is very similar to the one obtained using GTA5 as source dataset, proving that the approach is able to generalize to different datasets. In this case, there is a larger variability among different classes, however notice the very large improvement on road and building classes. The previous version of the method [3] has an accuracy of

Furthermore, our framework outperforms the compared state-of-the-art approaches. The method of Hung et al. [5], that exploits the same generator architecture of our approach, obtains a mIoU equal to . The approach of [16] has an even lower mIoU of . The method of [48] is the less performing approach and in this comparison it is even less accurate than our synthetic supervised trained network, however it employs a different segmentation network.

The fourth, fifth and sixth row of Figure 2a shows the output on the same sample scenes discussed above when the SYNTHIA dataset is used as source instead of GTA5. The first thing that stands out is that by training on the SYNTHIA dataset some very common classes as sidewalk and road are highly corrupted. This is caused by the not very realistic textures used for such classes in the SYNTHIA dataset. Furthermore, while the positioning of the camera in the Cityscapes dataset is always fixed and mounted on-board inside the car, in SYNTHIA the camera can be placed in different positions. For example, the pictures can be captured from inside the car, from the top or from the side of the road.

The approach of Hung et al. [5] is able to correctly recognize the class road, correcting the noise present in the synthetic supervised training. However, as mentioned earlier, it suffers on small classes where it tends to lose small objects and to produce imprecise shapes. The method of [3] and the proposed one have slightly better performances: the last two columns of Figure 2a show how the unsupervised adaptation and the self-teaching component allow to avoid all the artifacts on the road surface. The segmentation network now captures the real nature of this class in the Cityscapes dataset. At the same time, our method is able to locate a bit more precisely small classes as person and vegetation. However, in this setting the difference between the old and new version of the proposed method is limited.

GTA5 Mapillary

road

sidewalk

building

wall

fence

pole

t light

t sign

veg

terrain

sky

person

rider

car

truck

bus

train

mbike

bike

mean

Supervised ( only) 66.5 24.4 46.1 17.9 21.6 24.8 11.8 5.9 70.7 25.6 66.1 57.3 10.2 79.7 37.3 39.8 4.6 10.1 1.7 32.7
Ours (full) 79.9 28.0 73.4 23.0 29.5 20.9 1.1 0.0 79.5 39.6 95.0 57.6 9.0 80.6 41.5 40.1 7.4 24.8 0.1 38.5
Hung et al. [5] 78.2 29.7 68.7 10.0 6.7 17.5 0.0 0.0 76.4 35.2 95.6 53.8 13.8 77.5 34.3 30.2 5.0 21.8 0.0 34.4
Biasetton et al. [3] 71.4 25.0 62.0 20.4 17.6 26.8 5.9 0.8 64.6 24.6 86.5 58.3 14.7 80.0 39.3 42.2 5.5 22.3 0.1 35.2
TABLE IV: mIoU on the different classes of the Mapillary validation set. The approaches have been trained in a supervised way on the GTA5 dataset and the unsupervised domain adaptation has been performed using the Mapillary training set. The highest value in each column is highlighted in bold.
SYNTHIA Mapillary

road

sidewalk

building

wall

fence

pole

t light

t sign

veg

sky

person

rider

car

bus

mbike

bike

mean

Supervised ( only) 14.7 18.6 34.6 5.4 0.1 28.5 0.0 0.4 73.8 62.9 50.0 11.4 74.3 28.7 14.0 8.1 26.6
Ours (full) 57.6 18.3 62.1 0.4 0.0 23.7 0.0 0.0 79.4 94.8 52.4 9.2 74.2 28.3 4.0 6.9 32.0
Hung et al. [5] 36.8 20.1 53.9 0.0 0.0 23.7 0.0 0.0 73.9 95.6 43.4 0.1 64.6 19.0 0.4 0.5 27.0
Biasetton et al. [3] 16.4 19.1 42.2 2.7 0.0 33.1 0.0 1.3 76.5 88.0 50.4 10.9 69.9 25.5 6.1 9.2 28.2
TABLE V: mIoU on the different classes of the Mapillary validation set. The approaches have been trained in a supervised way on the SYNTHIA dataset and the unsupervised domain adaptation has been performed using the Mapillary training set. The highest value in each column is highlighted in bold.

V-B Evaluation on the Mapillary dataset

To ensure that our approach can generalize to other real datasets, we performed the same experimental evaluation procedure also on the Mapillary dataset. We started by using the GTA5 dataset for the supervised training as before. By simply performing a supervised training on GTA5 and then testing on the Mapillary dataset a mIoU of can be obtained. The proposed approach allows to obtain a much more accurate classification with a mIoU of . Notice that the gain of almost is consistent with the results obtained on the Cityscapes dataset, proving that the performance of the approach is stable across different datasets. The improvement can also be appreciated on both small and large classes, the mIoU values of 14 out of 19 classes show a clear gain. This is also visible in the qualitative results depicted in Figure 2b, where most of the artifacts on the road surface present in the synthetic trained network disappear and the shape of the small objects is more accurate. The results of [48] and [16] are not available for this dataset, however notice how the approach outperforms by a large margin both [5] and the old version of the approach [3] that are able to reduce only partially the artifacts on the road surface (visible in all the images), on the cars (row 1) and on the buildings (row 3).

Furthermore, we can appreciate that also on Mapillary the accuracy is lower when adapting from SYNTHIA leading to a mIoU of only. As for Cityscapes, road and sidewalk classes have an extremely low accuracy due to the poor texture representation (the visual results are reported in the last 3 rows of Figure 2b). By exploiting the proposed unsupervised domain adaptation strategy the mIoU increases to with an improvement of , again consistent with the other experiments. In this case, the performance is more unstable across the various classes but it is noticeable the large gains on road and building classes. This is also confirmed by the qualitative results, for example we can appreciate that the proposed approach is the only one able to achieve an accurate and reliable recognition of the road. The method of Hung et al. [5] achieves a mIoU of with a very limited improvement w.r.t. the synthetic supervised training. It is strongly penalized by the poor performance on the small and uncommon classes. The approach of [3] has slightly better performance (), but it has a quite large gap with respect to the proposed method. The weighting scheme and the region growing strategy introduced in this work allowed to obtain a large improvement in this setting.

V-C Ablation Study

In this section, we are going to analyze the contributions of the various components of the proposed approach. We focus on Cityscapes as target dataset for this study. The results of this analysis are shown in Table VI. By training the generator with a synthetic supervised approach, i.e., using only , it is possible to obtain a mIoU of when GTA5 is the source dataset and when adapting from SYNTHIA. As mentioned in the previous sections, this is the less performing approach. A slight improvement can be obtained by adding the adversarial term in the loss function. In this case, the mIoU increases of and when the source datasets are GTA5 and SYNTHIA respectively. The use of the self-teaching loss is particularly useful when exploiting the SYNTHIA dataset, obtaining an improvement of almost , probably because the domain shift from this dataset is larger. In the case of GTA5, the improvement is smaller but still significant. Moving to the new elements introduced in this work, the region growing strategy (i.e., masking with the extended mask instead of using ) allows to a further performance enhancement, especially when using GTA5 with a increase, mostly due to the improved handling of medium and large size objects. When starting from SYNTHIA the gain is more limited but still noticeable (almost ). The usage of the discriminator output as a weighting factor for the self-teaching loss, without masking with , has a more unstable behavior. This leads to a very good improvement of when starting from GTA5 but having almost no impact when employed alone in the SYNTHIA case. Moreover, we can observe that removing the class weighting term in Eq. 4, i.e., assigning the same weight to all the classes, we obtain a mIoU of and when adapting from GTA5 and SYNTHIA, respectively. The values are not too far from the mIoU of the complete version of the approach ( and , respectively). This proves that the performances are quite stable with respect to the setting of the weights for the various classes: the approach can work well even if the class frequencies on real data do not accurately match the statistics of synthetic data. Notice how the complete version of our approach has the best performance. In particular the discriminator-based weighting on the SYNTHIA dataset, that alone had a limited impact, is useful when combined with the region growing scheme.

Finally, Table VII analyzes the impact of the weights that control the relative relevance of the 3 losses. It is possible to notice that the most critical parameter is the weight of the adversarial loss on the source domain , that has the largest impact on the final accuracy while the performance are more stable with respect to the other two parameters.

Region Growing Disc. Weight. Class Weight. () mIoU GTA mIoU SYNTHIA
27.9 25.4
29.4 27.4
30.4 30.2
32.6 31.0
32.8 30.2
33.1 30.2
33.3 31.3
TABLE VI: Ablation study on Cityscapes validation set. We analyze the influence of the losses , , , region growing, discriminator weighting and class weighting .
mIoU from GTA5 mIoU from SYNTHIA
TABLE VII: Ablation study on the Cityscapes validation set when adapting from GTA5. Different values of the balancing hyper-parameters of Eq. 8 are reported applying various scaling factors to each parameter. The default parameters are , , when adapting from GTA5 and , , when adapting from SYNTHIA.

Vi Conclusions

In this paper, a novel scheme to perform unsupervised domain adaptation from synthetic urban scenes to real world ones has been proposed. Two different strategies have been used to exploit unlabeled data. The first is based on an adversarial learning framework, based on a fully convolutional discriminator. The second is a soft self-teaching strategy, based on the assumption that predictions labeled as highly confident by the discriminator are reliable. Additionally, we improved this approach with a region growing module that further refines the confidence maps on the basis of the segmentation output on real-world images. Experimental results on the Cityscapes and Mapillary datasets prove the effectiveness of the proposed approach. In particular, we obtained good results both on large sized classes, thanks to the region growing procedure, and on particularly challenging small and uncommon ones, thanks to the class frequency weighting of the self-teaching loss.

Further research will be devoted to test the proposed framework with different backbone networks and to the exploitation of generative models to produce more realistic and refined synthetic training data to be fed to the framework.

Umberto Michieli received the M.Sc. degree in Telecommunication Engineering from the University of Padova in 2018. He is currently a Ph.D. student at the Department of Information Engineering of the University of Padova. In 2018, he spent six months as a Visiting Researcher with the Technische Universität Dresden. His research focuses on transfer learning techniques for semantic segmentation, in particular on domain adaptation and on incremental learning.

Matteo Biasetton received the B.Sc. degree in Information Engineering and the M.Sc. degree in Computer Engineering from the University of Padova, Italy, in 2016 and 2019, respectively. His Master thesis focused on domain adaptation for semantic segmentation. Currently, he is working as a Software Engineer in image processing at Microtec srl.

Gianluca Agresti received the M.Sc. degree in Telecommunication Engineering from University of Padova in 2016. Currently, he is a PhD candidate at the Department of Information Engineering of University of Padova. His research focuses on deep learning for ToF sensor data processing and multiple sensor fusion for 3D acquisition.

Pietro Zanuttigh received a Master degree in Computer Engineering at the University of Padova in 2003 where he also got the Ph.D. degree in 2007. Currently he is an assistant professor at the Department of Information Engineering. His research activity focuses on 3D data processing, in particular ToF sensors data processing, multiple sensors fusion for 3D acquisition, semantic segmentation and hand gesture recognition.

References

  1. S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from computer games,” in Proceedings of European Conference on Computer Vision, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., vol. 9906.   Springer International Publishing, 2016, pp. 102–118.
  2. G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3234–3243.
  3. M. Biasetton, U. Michieli, G. Agresti, and P. Zanuttigh, “Unsupervised Domain Adaptation for Semantic Segmentation of Urban Scenes,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019.
  4. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, pp. 834–848, 2018.
  5. W.-C. Hung, Y.-H. Tsai, Y.-T. Liou34, Y.-Y. Lin, and M.-H. Yang15, “Adversarial learning for semi-supervised semantic segmentation,” in Proceedings of the British Machine Vision Conference, 2018.
  6. X. Liu, J. Cao, T. Fu, Z. Pan, W. Hu, K. Zhang, and J. Liu, “Semi-supervised automatic segmentation of layer and fluid region in retinal optical coherence tomography images using adversarial learning,” IEEE Access, vol. 7, pp. 3046–3061, 2019.
  7. A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, P. Martinez-Gonzalez, and J. Garcia-Rodriguez, “A survey on deep learning techniques for image and video semantic segmentation,” Applied Soft Computing, vol. 70, pp. 41 – 65, 2018.
  8. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
  9. F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” in International Conference on Learning Representations, 2016.
  10. H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2881–2890.
  11. Y.-C. Chen, Y.-Y. Lin, M.-H. Yang, and J.-B. Huang, “Crdoco: Pixel-level domain transfer with cross-domain consistency,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1791–1800.
  12. U. Michieli and L. Badia, “Game theoretic analysis of road user safety scenarios involving autonomous vehicles,” in 2018 IEEE 29th Annual International Symposium on Personal, Indoor and Mobile Radio Communications.   IEEE, 2018, pp. 1377–1381.
  13. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes dataset for semantic urban scene understanding,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3213–3223.
  14. G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder, “The Mapillary vistas dataset for semantic understanding of street scenes,” in Proceedings of International Conference on Computer Vision, 2017, pp. 4990–4999.
  15. F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell, “Bdd100k: A diverse driving video database with scalable annotation tooling,” arXiv preprint arXiv:1805.04687, 2018.
  16. Y. Zhang, P. David, and B. Gong, “Curriculum domain adaptation for semantic segmentation of urban scenes,” in Proceedings of International Conference on Computer Vision, 2017, pp. 2020–2030.
  17. Y. Chen, W. Li, and L. Van Gool, “Road: Reality oriented adaptation for semantic segmentation of urban scenes,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7892–7901.
  18. J. Kim and C. Park, “Attribute dissection of urban road scenes for efficient dataset integration,” in International Joint Conference on Artificial Intelligence Workshops, 2018, pp. 8–15.
  19. G. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-definition ground truth database,” Pattern Recognition Letters, pp. 88–97, 2009.
  20. D. Pathak, P. Krahenbuhl, and T. Darrell, “Constrained convolutional neural networks for weakly supervised segmentation,” in Proceedings of International Conference on Computer Vision, 2015, pp. 1796–1804.
  21. N. Souly, C. Spampinato, and M. Shah, “Semi and weakly supervised semantic segmentation using generative adversarial network,” arXiv preprint arXiv:1703.09695, 2017.
  22. A. Vezhnevets and J. M. Buhmann, “Towards weakly supervised semantic segmentation by means of multiple instance and multitask learning,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2010, pp. 3249–3256.
  23. Y. Wei, X. Liang, Y. Chen, X. Shen, M.-M. Cheng, J. Feng, Y. Zhao, and S. Yan, “STC: A simple to complex framework for weakly-supervised semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 11, pp. 2314–2320, 2017.
  24. S. Hong, H. Noh, and B. Han, “Decoupled deep neural network for semi-supervised semantic segmentation,” in Advances in Neural Information Processing Systems, 2015, pp. 1495–1503.
  25. J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation,” in Proceedings of International Conference on Computer Vision, 2015, pp. 1635–1643.
  26. Z. Huang, X. Wang, J. Wang, W. Liu, and J. Wang, “Weakly-supervised semantic segmentation network with deep seeded region growing,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7014–7023.
  27. F. Sun and W. Li, “Saliency guided deep network for weakly-supervised image segmentation,” Pattern Recognition Letters, 2019.
  28. G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille, “Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation,” in Proceedings of International Conference on Computer Vision, 2015, pp. 1742–1750.
  29. S. Sankaranarayanan, Y. Balaji, A. Jain, S. Nam Lim, and R. Chellappa, “Learning from synthetic data: Addressing domain shift for semantic segmentation,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3752–3761.
  30. P. Luc, C. Couprie, S. Chintala, and J. Verbeek, “Semantic segmentation using adversarial networks,” in NIPS Workshop on Adversarial Training, 2016.
  31. T. Tommasi, N. Patricia, B. Caputo, and T. Tuytelaars, “A deeper look at dataset bias,” in Domain Adaptation in Computer Vision Applications.   Springer, 2017, pp. 37–55.
  32. A. Torralba and A. Efros, “Unbiased look at dataset bias,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.   IEEE Computer Society, 2011, pp. 1521–1528.
  33. B. Gong, F. Sha, and K. Grauman, “Overcoming dataset bias: An unsupervised domain adaptation approach,” in NIPS Workshop on Large Scale Visual Recognition and Retrieval, vol. 3.   Citeseer, 2012.
  34. A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba, “Undoing the damage of dataset bias,” in Proceedings of European Conference on Computer Vision.   Springer, 2012, pp. 158–171.
  35. A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, “Learning from simulated and unsupervised images through adversarial training,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2107–2116.
  36. X. Peng and K. Saenko, “Synthetic to real adaptation with generative correlation alignment networks,” in 2018 IEEE Winter Conference on Applications of Computer Vision.   IEEE, 2018, pp. 1982–1991.
  37. X. Wang and A. Gupta, “Generative image modeling using style and structure adversarial networks,” in Proceedings of European Conference on Computer Vision.   Springer, 2016, pp. 318–335.
  38. J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, “Generative visual manipulation on the natural image manifold,” in Proceedings of European Conference on Computer Vision.   Springer, 2016, pp. 597–613.
  39. D. J. Im, C. D. Kim, H. Jiang, and R. Memisevic, “Generating images with recurrent adversarial networks,” arXiv preprint arXiv:1602.05110, 2016.
  40. D. Li, J.-B. Huang, Y. Li, S. Wang, and M.-H. Yang, “Weakly supervised object localization with progressive domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3512–3520.
  41. N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa, “Cross-domain weakly-supervised object detection through progressive domain adaptation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5001–5009.
  42. S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto, “Unified deep supervised domain adaptation and generalization,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5715–5725.
  43. K. Saito, D. Kim, S. Sclaroff, T. Darrell, and K. Saenko, “Semi-supervised domain adaptation via minimax entropy,” arXiv preprint arXiv:1904.06487, 2019.
  44. Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in International Conference on Machine Learning, 2015, pp. 1180–1189.
  45. Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
  46. M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adaptation networks,” in International Conference on Machine Learning, 2015, pp. 97–105.
  47. E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous deep transfer across domains and tasks,” in Proceedings of International Conference on Computer Vision, 2015, pp. 4068–4076.
  48. J. Hoffman, D. Wang, F. Yu, and T. Darrell, “FCNs in the wild: Pixel-level adversarial and constraint-based adaptation,” arXiv preprint arXiv:1612.02649, 2016.
  49. G. Agresti, H. Schaefer, P. Sartor, and P. Zanuttigh, “Unsupervised domain adaptation for tof data denoising with adversarial learning,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5584–5593.
  50. Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker, “Learning to adapt structured output space for semantic segmentation,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7472–7481.
  51. Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang, “Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  52. U. Michieli and P. Zanuttigh, “Incremental Learning Techniques for Semantic Segmentation,” in Proceedings of International Conference on Computer Vision Workshops, 2019.
  53. Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang, “Unsupervised domain adaptation for semantic segmentation via class-balanced self-training,” in Proceedings of European Conference on Computer Vision, 2018, pp. 289–305.
  54. X. Zhu, H. Zhou, C. Yang, J. Shi, and D. Lin, “Penalizing top performers: Conservative loss for semantic segmentation adaptation,” in Proceedings of European Conference on Computer Vision, 2018, pp. 568–583.
  55. Y. Chen, W. Li, X. Chen, and L. Van Gool, “Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach,” arXiv preprint arXiv:1812.05040, 2018.
  56. Y. Zhang, Z. Qiu, T. Yao, D. Liu, and T. Mei, “Fully convolutional adaptation networks for semantic segmentation,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6810–6818.
  57. T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez, “Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2517–2526.
  58. J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in Proceedings of the 35th International Conference on Machine Learning, 2018.
  59. R. Adams and L. Bischof, “Seeded region growing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 6, pp. 641–647, 1994.
  60. G. Song, H. Myeong, and K. Mu Lee, “Seednet: Automatic seed generation with deep reinforcement learning for robust interactive segmentation,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1760–1768.
  61. V. Nekrasov, “Pre-computed weights for ResNet-101,” https://github.com/DrSleep/tensorflow-deeplab-resnet, Accessed: 2019-07-04.
  62. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of European Conference on Computer Vision.   Springer, 2014, pp. 740–755.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
409987
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description