Towards Adaptive Semantic Segmentation by Progressive Feature Refinement

Towards Adaptive Semantic Segmentation by Progressive Feature Refinement


As one of the fundamental tasks in computer vision, semantic segmentation plays an important role in real world applications. Although numerous deep learning models have made notable progress on several mainstream datasets with the rapid development of convolutional networks, they still encounter various challenges in practical scenarios. Unsupervised adaptive semantic segmentation aims to obtain a robust classifier trained with source domain data, which is able to maintain stable performance when deployed to a target domain with different data distribution. In this paper, we propose an innovative progressive feature refinement framework, along with domain adversarial learning to boost the transferability of segmentation networks. Specifically, we firstly align the multi-stage intermediate feature maps of source and target domain images, and then a domain classifier is adopted to discriminate the segmentation output. As a result, the segmentation models trained with source domain images can be transferred to a target domain without significant performance degradation. Experimental results verify the efficiency of our proposed method compared with state-of-the-art methods.


Bin Zhang   Shengjie Zhao   Rongqing Zhang \addressSchool of Software Engineering, Tongji University, Shanghai, China
Key Laboratory of Embedded System and Service Computing, Tongji University, Shanghai, China
{keywords} semantic segmentation, domain adaptation, feature refinement, deep learning

1 Introduction

Recent advances in deep learning [8] have revolutionized the development of many computer vision tasks, such as person re-identification, depth estimation, semantic segmentation, etc. Semantic segmentation can be regarded as a dense prediction task, which aims to assign a category label (e.g. building, sidewalk, bus, train) to each image pixel. Due to the emergence of diverse large scale datasets [5], a wide variety of deep learning models [1, 16, 2, 23] have gained remarkable breakthrough. Nevertheless, collecting high-resolution and well-annotated datasets is a laborious process, especially for tasks that require pixel-level annotation. An alternative solution is to take the advantage of synthetic data collected from simulated environment where unlimited amount of computer-generated images are available.

Figure 1: Complementary utilization of semantic classifier and domain classifier for adaptive segmentation.

Despite this, generic segmentation models are commonly domain-specific and thus inevitably suffer from the dataset bias problem [17], which hinders their cross-domain generalization to novel scenes. Therefore, how to improve the adaptation ability of deep learning models is of great significance.

Unsupervised domain adaptation offers a formal framework for addressing the above-mentioned issues by bridging the domain gap between the source and target domains. The core idea is transferring domain-invariant knowledge from a source domain with sufficient labels to a target domain without annotation. Among a majority of previous approaches [22, 9], they either minimize the difference between intermediate feature distribution of source and target data via adversarial learning, or explicitly transfer source domain data into target domain in the input space. For low-level vision tasks such as image classification, the feature maps extracted by deep convolutional neural networks are aligned across source and target domains. However, these methods often fail when handling high-level vision tasks such as semantic segmentation which encodes complicated relationship among diverse object categories.

To this end, we propose an innovative learning methodology for cross-domain semantic segmentation, termed progressive feature refinement, which disentangles the style and content representation of source and target domain images, respectively. In this way, the source and target features can be aligned stage by stage and the obtained feature maps are less sensitive to domain shift. Our basic concept is illustrated in Figure 1. In addition to the conventional segmentation network, we also borrow the experience from [18] and exploit a domain classifier to obtain domain-invariant output. The proposed framework can be trained end-to-end in an adversarial learning manner. Evaluations on mainstream benchmarks demonstrate that our approach is superior to most state-of-the-art baselines.

2 Related Work

2.1 Domain Adaptation

Semantic segmentation has always been one of the research hotspots in computer vision and is valuable for a large number of applications, including autonomous driving, robot scene understanding, medical image analysis, etc. Powered by high-capacity deep neural networks, [1] equipped the ResNet-101 with spatial pyramid pooling module and reaches high segmentation accuracy. In order to enlarge the receptive field and retain high resolution of feature maps, [20] aggregated multi-scale context information through dilated convolution. Although traditional deep learning models have been proven to be effective on the segmentation task, their configuration design is non-trivial. Most recently, [13] adopted neural architecture search (NAS) strategy and reinforcement learning to discover the optimal network structure automatically rather than rely on the tedious manual design.

2.2 Domain Adaptation for Semantic Segmentation

While semantic segmentation is a well-researched topic, few efforts have been made to explore the adaptation of segmentation models. In [9], they employed image-to-image translation networks to convert source domain images into target style, followed by adversarial learning to ensure the extracted feature maps are domain-agnostic. By introducing the concept of domain flow, [7] generated a series of intermediate domains and smoothly mitigated the domain gap. In [4], they combined GAN-based augmentation strategy with self-ensembling techniques and produced augmented labeled images by mimicking the target domain style. Different from the above global alignment method which ignores semantic consistency, [12] took the category-level information into consideration and alleviated the negative transfer problem. In [14], they proved that most visual tasks are closely related to each other and proposed a unified learning scheme to investigate the latent relationship across distinct tasks and domains. Apart from domain adaptation, [21] also tackled a more difficult domain generalization case, where both data and label of target domain are unavailable.

3 Proposed Method

In this paper, we focus on the unsupervised cross-domain semantic segmentation setting, where we are given a labeled source domain dataset with pixel-level annotation and an unlabeled target domain dataset . Our goal is to utilize the source dataset to train a model that can precisely provide segmentation map for images in . Figure 2 depicts the overall architecture of our proposed framework, which is composed of the semantic segmentation network, the progressive feature refinement module (PFR), and the domain adversarial learning module in the output space.

3.1 Semantic Segmentation Network

We adopt the same setting as the method in [18] and utilize the DeepLab-v2 network with pretrained ResNet-101 [8] backbone as our base model. We discard the last fully connected layer and modify the strides of the last two convolution layers to 1. Given an image with height and width from the source domain dataset , the semantic segmentation network is optimized to produce the segmentation output for different categories. This is accomplished by minimizing the following segmentation loss under the supervision of the corresponding ground truth label map :


3.2 Progressive Feature Refinement

To align the feature maps of source and target domain data, we insert the additional PFR module to the existing semantic segmentation network. We are inspired by the fact that the style and content of an image are separable [11], and our main idea is to disentangle the style and content feature representation of source and target domain images in order to align them progressively.

More formally, let and denote the content feature of source and target domain image obtained from the -th stage () of the backbone ResNet-101, respectively. Similarly, we use and to represent the source and target style feature, i.e., the Gram matrix [6] calculated by the content feature. We define the following content loss and style loss to match the style and content feature of source domain images to those of target domain images:


The training

Figure 2: The overall pipeline of our adaptive semantic segmentation framework. We augment the general semantic segmentation network with the progressive feature refinement module, and incorporate domain adversarial learning to further improve the segmentation result in the output space.

objective of PFR module can be summarized as follows:


which is a combination of two parts, i.e., the style loss and the content loss .

3.3 Domain Adversarial Learning

On the one hand, the segmentation loss is optimized with the source domain images and thus has no contribution to narrow the domain discrepancy. On the other hand, the PFR module only eliminates the domain variance in the feature space and there is no guarantee of domain-agnostic output. Therefore, we further integrate domain adversarial learning to rectify the segmentation results. We apply an auxiliary domain classifier to the predictions of both source and target domain images and force to discriminate whether the segmentation output is from source domain or target domain. The adversarial loss is defined as:


3.4 Network Optimization

We combine the aforementioned loss functions and formulate the overall training objective of our framework as:


where and are adjustable hyper-parameters that control the trade-off among regularization terms. In our implementation we set and .

4 Experiments

4.1 Datasets and Evaluation Metric

We verify the performance of our proposed approach on the GTA5 [15] Cityscapes [5] domain adaptation tasks. Cityscapes is a large-scale dataset to evaluate the accuracy of semantic segmentation models, which covers the urban scenes of several European countries. It is split into a training set with 2,975 samples, a testing set with 1,525 samples, and a validation set with 500 samples. GTA5 dataset contains 24,966 high-definition images collected from a contemporary computer game called Grand Theft Auto V. The dataset is automatically annotated into 19 categories, which are consistent with the Cityscapes dataset. As for the evaluation metric, we choose the commonly adopted Intersection over Union (IoU) for fair comparison:


where TP, FP, FN stand for the number of true positives, false positives, and false negatives, respectively.

Methods Backbone







t light

t sign












FCN WId [10] VGG-16 70.4 32.4 62.1 14.9 5.4 10.9 14.2 2.7 79.2 21.3 64.6 44.1 4.2 70.4 8.0 7.3 0.0 3.5 0.0 27.1
Curriculum [19] VGG-16 72.9 30.0 74.9 12.1 13.2 15.3 16.8 14.1 79.3 14.5 75.5 35.7 10.0 62.1 20.6 19.0 0.0 19.3 12.0 31.4
TGCF-DA [4] VGG-16 90.2 51.5 81.1 15.0 10.7 37.5 35.2 28.9 84.1 32.7 75.9 62.7 19.9 82.6 22.9 28.3 0.0 19.3 12.0 42.5
ROAD [3] VGG-16 85.4 31.2 78.6 27.9 22.2 21.9 23.7 11.4 80.7 29.3 68.9 48.5 14.1 78.0 19.1 23.8 9.4 8.3 0.0 35.9
Cycada [9] ResNet-101 86.7 35.6 80.1 19.8 17.5 38.0 39.9 41.5 82.7 27.9 73.6 64.9 19.0 65.0 12.0 28.6 4.5 31.1 42.0 42.7
CLAN [12] ResNet-101 87.0 27.1 79.6 27.3 23.3 28.3 35.5 24.2 83.6 27.4 74.2 58.6 28.0 76.2 33.1 36.7 6.7 31.9 31.4 43.2
AdaptSegNet [18] ResNet-101 86.5 36.0 79.9 23.4 23.3 23.9 35.2 14.8 83.4 33.3 75.6 58.5 27.6 73.7 32.5 35.4 3.9 30.1 28.1 42.4
DLOW [7] ResNet-101 87.1 33.5 80.5 24.5 13.2 29.8 29.5 26.6 82.6 26.7 81.8 55.9 25.3 78.0 33.5 38.7 0.0 22.9 34.5 42.3
Ours ResNet-101 90.8 40.8 81.9 28.4 24.4 24.2 32.0 17.6 83.8 36.6 72.4 59.3 29.0 82.1 35.5 45.6 3.3 29.1 28.6 44.5
Table 1: Performance comparison between baseline approaches and ours under the GTA5 Cityscapes setting.

Figure 3: Qualitative segmentation results on the GTA5 Cityscapes domain adaptation task. We present (a) target image, (b) Non-Adapted, (c) Conventional Adapt [18], (d) Ours, and (e) Ground Truth. Details are highlighted by the black boxes.

4.2 Performance Comparison

We take the GTA5 dataset as source domain and the Cityscapes dataset as target domain. Table 1 summarizes the comparison results between baseline approaches and ours on the GTA5 Cityscapes domain adaptation task. As presented in Table 1, equipped with the ResNet-101 backbone, our approach reaches the best mIoU result compared to other baseline methods. By further looking into the results of each category, we observe that the improvement over other methods primarily comes from the “road”, “building”, “wall”, “fence”, “terrain”, “rider”, “truck”, and “bus” classes. Over the input space alignment baselines which translate the source domain images into target style [9], our PFR is more memory-efficient since there is no requirement for extra image-to-image translation networks. Although PFR achieves limited performance gain on some less frequent objects which may trigger a negative transfer compared to class-wise alignment methods [12], it is excellent at the dominant classes such as “road”, “building”, etc. Moreover, we visualize some qualitative segmentation examples and their corresponding ground truth in Figure 3. It is obvious that our method can segment the object boundaries more precisely and produce smoother output.

5 Conclusions

In this paper, we propose a novel progressive feature refinement method for cross-domain semantic segmentation. Our proposed PFR provides a novel perspective of insight by incorporating the content and style alignment module. The experimental results demonstrate that PFR outperforms most current state-of-the-art unsupervised domain adaptation methods on the GTA5 Cityscapes task.

Acknowledgement. This work is supported in part by the National Key Research and Development Project under Grant 2019YFB2102300 and 2019YFB2102301, and in part by the National Natural Science Foundation of China under Grant 61936014 and 61901302.


  1. L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40 (4), pp. 834–848. Cited by: §1, §2.1.
  2. L. C. Chen, Y. Zhu, G. Papandreou, F. Schroff and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, Cited by: §1.
  3. Y. Chen, W. Li and L. Van Gool (2018) ROAD: reality oriented adaptation for semantic segmentation of urban scenes. In CVPR, pp. 7892–7901. Cited by: Table 1.
  4. J. Choi, T. Kim and C. Kim (2019) Self-ensembling with gan-based data augmentation for domain adaptation in semantic segmentation. In ICCV, pp. 6830–6840. Cited by: §2.2, Table 1.
  5. M. Cordts et al (2016) The cityscapes dataset for semantic urban scene understanding. In CVPR, pp. 3213–3223. Cited by: §1, §4.1.
  6. L. A. Gatys, A. S. Ecker and M. Bethge (2016) Image style transfer using convolutional neural networks. In CVPR, pp. 2414–2423. Cited by: §3.2.
  7. R. Gong, W. Li, Y. Chen and L. V. Gool (2019) DLOW: domain flow for adaptation and generalization. In CVPR, pp. 2477–2486. Cited by: §2.2, Table 1.
  8. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §3.1.
  9. J. Hoffman, E. Tzeng, T. Park, J. Y. Zhu and T. Darrell (2018) CyCADA: cycle-consistent adversarial domain adaptation. In ICML, pp. 1994–2003. Cited by: §1, §2.2, §4.2, Table 1.
  10. J. Hoffman, D. Wang, F. Yu and T. Darrell (2016) FCNs in the wild: pixel-level adversarial and constraint-based adaptation. arXiv:1612.02649. Cited by: Table 1.
  11. X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, pp. 1501–1510. Cited by: §3.2.
  12. Y. Luo, L. Zheng, T. Guan, J. Yu and Y. Yang (2019) Taking a closer look at domain shift: category-level adversaries for semantics consistent domain adaptation. In CVPR, pp. 2507–2516. Cited by: §2.2, §4.2, Table 1.
  13. V. Nekrasov, H. Chen, C. Shen and I. Reid (2019) Fast neural architecture search of compact semantic segmentation models via auxiliary cells. In CVPR, pp. 9126–9135. Cited by: §2.1.
  14. P. Z. Ramirez, A. Tonioni, S. Salti and L. D. Stefano (2019) Learning across tasks and domains. In ICCV, pp. 8110–8119. Cited by: §2.2.
  15. S. R. Richter, V. Vineet, S. Roth and V. Koltun (2016) Playing for data: ground truth from computer games. In ECCV, pp. 102–118. Cited by: §4.1.
  16. K. Sun, B. Xiao, D. Liu and J. Wang (2019) Deep high-resolution representation learning for human pose estimation. In CVPR, pp. 5693–5703. Cited by: §1.
  17. A. Torralba and A. A. Efros (2011) Unbiased look at dataset bias. In CVPR, pp. 1521–1528. Cited by: §1.
  18. Y. H. Tsai, W. C. Hung, S. Schulter, K. Sohn and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In CVPR, pp. 7472–7481. Cited by: §1, §3.1, Figure 3, Table 1.
  19. Z. Yang, P. David and B. Gong (2017) Curriculum domain adaptation for semantic segmentation of urban scenes. In ICCV, pp. 2039–2049. Cited by: Table 1.
  20. F. Yu and V. Koltun (2016) Multi-scale context aggregation by dilated convolutions. In ICLR, Cited by: §2.1.
  21. X. Yue et al (2019) Domain randomization and pyramid consistency: simulation-to-real generalization without accessing target domain data. In ICCV, pp. 2100–2110. Cited by: §2.2.
  22. W. Zhang, W. Ouyang, W. Li and D. Xu (2018) Collaborative and adversarial network for unsupervised domain adaptation. In CVPR, pp. 3801–3809. Cited by: §1.
  23. H. Zhao, J. Shi, X. Qi, X. Wang and J. Jia (2017) Pyramid scene parsing network. In CVPR, pp. 2881–2890. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description