Universal Semi-Supervised Semantic Segmentation
In recent years, the need for semantic segmentation has arisen across several different applications and environments. However, the expense and redundancy of annotation often limits the quantity of labels available for training in any domain, while deployment is easier if a single model works well across domains. In this paper, we pose the novel problem of universal semi-supervised semantic segmentation and propose a solution framework, to meet the dual needs of lower annotation and deployment costs. In contrast to counterpoints such as fine tuning, joint training or unsupervised domain adaptation, universal semi-supervised segmentation ensures that across all domains: (i) a single model is deployed, (ii) unlabeled data is used, (iii) performance is improved, (iv) only a few labels are needed and (v) label spaces may differ. To address this, we minimize supervised as well as within and cross-domain unsupervised losses, introducing a novel feature alignment objective based on pixel-aware entropy regularization for the latter. We demonstrate quantitative advantages over other approaches on several combinations of segmentation datasets across different geographies (Germany, England, India) and environments (outdoors, indoors), as well as qualitative insights on the aligned representations.
Semantic segmentation is the task of pixel level classification of an image into a predefined set of categories. State-of-the-art semantic segmentation architectures [31, 3, 8] pre-train deep networks on a classification task on datasets like ImageNet [12, 44] and then fine-tune on finely annotated labeled examples [11, 54]. The availability of such large-scale labeled datasets has been crucial to achieve high accuracies for semantic segmentation in applications ranging from natural scene understanding  to medical imaging . However, performance often suffers even in the presence of a minor domain shift. For example, a segmentation model trained on a driving dataset from a specific geographic location may not generalize to a new city due to differences in weather, lighting or traffic density. Further, a segmentation model trained on traffic scenes for outdoor navigation may not be applicable for an indoor robot.
While such domain shift is a challenge for any machine learning problem, it is particularly exacerbated for segmentation where human annotation is highly prohibitive and redundant for different locations and tasks. Thus, there is a growing interest towards learning segmentation representations that may be shared across domains. A prominent line of work addresses this through unsupervised domain adaptation from a labeled source to an unlabeled target domain [23, 51, 9, 36, 6]. But there remain limitations. For instance, unsupervised domain adaptation usually does not leverage target domain data to improve source performance. Further, it is designed for the restrictive setting of large-scale labeled source domain and unlabeled target domain. While some applications such as self-driving have large-scale annotated datasets for particular source domains, the vast majority of applications only have limited data in any domain. Finally, most of the above works assume that the target label set matches with the source one, which is often not the case in practice. For example, road scene segmentation across different countries, or segmentation across outdoor and indoor scenes, have domain-specific label sets.
In this paper, we propose and address the novel problem of universal semi-supervised semantic segmentation as a practical setting for many real-world applications. It seeks to aggregate knowledge from several different domains during training, each of which has few labeled examples but several unlabeled examples. The goal is to simultaneously limit training cost through reduced annotations and deployment cost by obtaining a single model to be used across domains. Label spaces may be partially or fully non-overlapping across domains. While fine-tuning a source model on a small amount of target data is a possible counterpoint, it usually requires plentiful source labels and necessitates deployment of a separate model in every domain. Another option is joint training, which does yield a unified model across domains, but does not leverage unlabeled data available in each domain. Our semi-supervised universal segmentation approach leverages both limited labeled and larger-scale unlabeled data in every domain, to obtain a single model that performs well across domains.
In particular, we use the labeled examples in each domain to supervise the universal model, akin to multi-tasking [27, 34, 26], albeit with limited labels. We simultaneously make use of the large number of unlabeled examples to align pixel level deep feature representations from multiple domains using entropy regularization based objective functions. We calculate the similarity score vector between the features and the label embeddings (computed from class prototypes ) and minimize the entropy of this discrete distribution to positively align similar examples between the labeled and the unlabeled images. We do this unsupervised alignment both within domain, as well as across domains.
We believe such within and cross-domain alignment is fruitful even with non-overlapping label spaces, particularly so for semantic segmentation, since label definitions often encode relationships that may positively reinforce performance in each domain. For instance, two road scene datasets such as Cityscapes  and IDD  might have different labels, but share similar label hierarchies. Even an outdoor dataset like Cityscapes and an indoor one like SUN  may have label relationships, for example, between horizontal (road, floor) and vertical (building, wall) classes. Similar observations have been made for multi-task training .
We posit that our pixel wise entropy-based objective discovers such alignments to improve over joint training, as demonstrated quantitatively and qualitatively in our experiments. Specifically, our experiments lend insights across various notions of domain gaps. With Cityscapes  as one of domains (road scenes in Germany), we derive universal models with respect to CamVid (roads in England) , IDD (roads in India)  and SUN (indoor rooms) . In each case, our semi-supervised universal model improves over fine-tuning and joint training, with visualizations of the learned feature representations demonstrating conceptually meaningful alignments. We use dilated residual networks in our experiments , but the framework is equally applicable to any of the existing deep encoder-decoder architectures for semantic segmentation.
In summary, we make the following contributions:
We propose a universal segmentation framework to train a single joint model on multiple domains with disparate label spaces to improve performance on each domain.
We introduce a pixel-level entropy regularization scheme to train semantic segmentation architectures using datasets with few labeled examples and larger quantities of unlabeled examples.
2 Related Work
Most of the state of the art models for semantic segmentation [59, 31, 37, 3, 8, 42] have been possible largely due to breakthroughs in deep learning that have pushed the boundaries of performance in image classification [28, 21, 22] and related tasks. The pioneering work in  proposes an end-to-end trainable network for semantic segmentation by replacing the fully connected layers of pretrained AlexNet  and VGG Net  with fully convolutional layers that aggregate spatial information across various resolutions. Noh et al.  use transpose convolutions to build a learnable decoder module, while DeepLab network  uses artrous convolutions along with artrous spatial pyramid pooling for better aggregation of spatial features. Segmentation architectures based on dilated convolutions  for real time semantic segmentation have also been proposed [59, 42].
Transfer Learning and Domain Adaptation
Transfer learning  involves transferring deep feature representations learned in one domain or task to another domain or task where labeled data availability is low. Previous works demonstrate transfer learning capabilities between related tasks [13, 61, 39, 40] or completely different tasks [19, 41, 31]. Unsupervised domain adaptation is a related paradigm which leverages labeled data from one or more source domains to learn a classifier for a new unsupervised target domain in the presence of a domain shift. Ganin et al. [16, 17], followed by [53, 52, 23, 51, 5, 9] propose various models for learning domain agnostic discriminative feature representations with the help of an auxiliary domain classifier trained using an adversarial loss. Most of these works in domain adaptation assume equal source and target dataset label spaces or a subset target label space, which is not the most general case for real world applications. Luo et al.  relax this assumption to perform few shot adaptation across arbitrary label spaces for image and action recognition tasks. While transfer learning or adaptation based methods are typically focused on using knowledge from a labeled source domain to improve performance on a specific target domain, we propose a joint training framework to train a single model that delivers good performance on both the domains.
Most tasks in computer vision perform prediction in a one-hot label space where each label is represented as a one-hot vector. However, following the success in metric learning based approaches , tasks such as fine grained classification [2, 1], latent hierarchy learning  and zero-shot prediction [38, 15, 29] have benefited greatly by projecting images and labels into a common subspace. Popular methods for obtaining vector representations for labels include hand crafted semantic attribute vectors  or semantic representations of labels [38, 15] from word2vec . More recent works propose using prototypes computed from class wise feature representations as the label embeddings to perform few shot classification[49, 14], unsupervised domain adaptation  as well as few shot domain adaptation .
Multitask learning  is shown to help in improving performance for many tasks that share useful relationships between them in computer vision [47, 27, 60], natural language processing [10, 34, 26] and speech recognition . Universal segmentation builds on this idea by training a single joint network applicable across multiple semantic segmentation domains with possibly different label spaces to make use of transferable representations at lower levels of the network. Liang et al.  first propose the idea of universal segmentation by performing dynamic propagation through a label hierarchy graph constructed from an external knowledge source like WordNet. We propose an alternative method to perform universal segmentation without the need for any outside knowledge source or additional model parameters during inference, and instead make efficient use of the large set of unlabeled examples in each of the domains for unsupervised feature alignment.
3 Problem Description
In this section, we explain the framework used to train a single model across different segmentation datasets with possibly disparate label spaces using a novel pixel aware entropy regularization objective.
We have datasets , each of which has few labeled examples and many unlabeled examples. The labeled images and corresponding labels from are denoted by , where and is the number of labeled examples. The unlabeled images are represented by where is the number of unlabeled examples. We work with domains with very few labeled examples, so . We consider the general case of non-intersecting label spaces, that is for any . The label spaces might still have a partial overlap between them, which is common in the case of segmentation datasets. For ease of notation, we consider the special case of two datasets , but similar idea can be applied for the case of multiple datasets as well.
The proposed universal segmentation model is summarized in Figure 2. We first concatenate the datasets and and randomly select samples from this mixture in each mini-batch for training. Deep semantic segmentation architectures generally consist of an encoder module which aggregates the spatial information across various resolutions and a decoder module that up samples the image to enable pixel wise predictions at a resolution that matches the input. In order to enable joint training with multiple datasets, we modify this encoder decoder architecture by having a shared encoder module and different decoder layers , for prediction in different label spaces. For a labeled input image , the pixel wise predictions are denoted by for which, along with the labeled annotations, gives us the supervised loss. To include the visual information from the unlabeled examples , we propose an entropy regularization module . This entropy module takes as input the output of the encoder to give pixel wise embedding representations. The entropy of the pixel level similarity scores of these embedding representations with the label embeddings result in the unsupervised loss term. Each of these loss terms is explained in detail in the following sections.
The supervised loss is calculated separately at each output softmax layer as a masked cross entropy loss between the predicted segmentation mask at that layer and the corresponding pixel wise ground truth segmentation masks for all labeled examples. Specifically, for the output softmax layer which corresponds to dataset ,
where is the softmax cross entropy loss function over the label space , which is averaged over all the pixels of the segmentation map. and together comprise the supervised loss term .
The large number of unsupervised images in the training set provides us with rich information regarding the visual similarity between the domains and the label structure, which the existing methods on few shot segmentation  or universal segmentation  do not exploit. To address this issue, we propose using entropy regularization  to transfer the information from labeled images to the unsupervised images. We introduce a pixel level entropy regularization module, which helps in aligning the visually similar pixel level features from both the datasets calculated from the segmentation network close to each other in an unsupervised way.
The entropy module takes as input the encoder output and projects the representation at each pixel into a dimensional embedding space . A similarity metric which operates on each pixel is then used to calculate the similarity score of the embedding representations with each of the dimensional label embeddings using the equation
where is an image from the unlabeled set, is the label embedding corresponding to the label from the dataset and . When , the scores correspond to the similarity scores within a dataset, and when , they provide the cross dataset similarity scores.
To calculate the vector representation of the labels of a dataset, we first train an end to end segmentation model with all the supervised training data available from that dataset. Using the output of the encoder of this network, we calculate the centroid of the vector representation of all the pixels belonging to a label to obtain the label embedding for that particular label. These centroids have then been used to initialize the label embeddings for the universal segmentation model. In our experiments, the label embeddings have been kept static over the course of training the joint segmentation network, since we found that the limited supervised data was not sufficient to jointly train a universal segmentation model as well as fine tune the label embeddings. More information regarding this is provided in the supplementary section.
As demonstrated in Figure 3, although the entropy module is similar to the decoder module , in practice, two crucial differences exists between the two. First, the entropy module projects the encoder features into a large dimensional embedding space, while the decoder projects the features into a much smaller label space. Also, unlike the decoder, features corresponding to both the datasets are projected into a common embedding space by the entropy module.
We have two parts for the unsupervised entropy loss. The first part, the cross dataset entropy loss, is obtained by minimizing the entropy of the cross dataset similarity vectors.
where is the entropy measure of a discrete distribution, is the softmax operator and the similarity vector is from Eq (2). Minimizing makes the probability distribution peaky over a single label from a dataset, which helps to align visually similar images across datasets close to each other improving the overall prediction of the network. In addition, we also have a within dataset entropy loss given by
which aligns the unlabeled examples within the same domain.
where and are a hyper parameters that control the influence of the unsupervised loss in the total loss.
Many labels in a segmentation dataset often appear in more than one visual form or modalities. For example, road class can appear as dry road, wet road, shady road etc., or a class labeled as building can come in different structures and sizes. To better capture the multiple modalities involved in the visual information of the label, we propose using multiple embeddings for each label instead of a single mean centroid. This is analogous to polysemy in vocabulary, where many words can have multiple meanings and can occur in different contexts, and context specific word vector representations are used to capture this behavior. To calculate the multiple label embeddings, we perform K-means clustering of the pixel level encoder feature representations calculated from networks pretrained on the limited supervised data, and calculate similarity scores with all the multiple label embeddings, and include the results in the experiments.
For a query image from dataset during test time, the output of the decoder is used to obtain the segmentation map over the label set which gives us the pixel wise label predictions. Although we calculate feature and label embeddings in our method and metric based inference schemes like nearest neighbor search might enable prediction in a label set agnostic manner, calculating pixel wise nearest neighbor predictions can prove very slow and costly for images with high resolution.
4 Results and Discussion
Cityscapes  is a standard autonomous driving dataset consisting of 2975 training images collected from various cities across Europe finely annotated with 19 classes. CamVid  dataset contains 367 training and 233 testing images from England taken from video sequences finely labeled with 32 classes, although we use the more popular 11 class version from . We also demonstrate results on IDD [25, 54] dataset, which is an in-the-wild dataset for autonomous navigation in unconstrained environments. It consists of 6993 training and 981 validation images finely annotated with 26 classes collected from 182 drive sequences on Indian roads, taken in highly varying weather and environment conditions. This is a challenging driving dataset since it contains images taken from largely unstructured environments.
While autonomous driving datasets typically offer many challenges, there is still limited variation with respect to the classes, object orientation or camera angles. Therefore, we also use SUN RGB-D  dataset for indoor segmentation, which contains 5285 training images along with 475 validation images finely annotated with 37 labels consisting of regular household objects like chair, table, desk, pillow etc. We use only the RGB information for our universal training and ignore the depth information provided.
Although the proposed framework is readily applicable to any state-of-the art encoder-decoder semantic segmentation framework, we use the openly available PyTorch implementation of dilated residual network  with a ResNet-18 backbone owing to its low latency in autonomous driving applications. We train every model on 2 Nvidia GeForce GTX 1080 GPUs with a uniform crop size of across datasets and a batch size of . We employ SGD learning algorithm with an initial learning rate of and a momentum of , along with a poly learning rate schedule  with a power of . We take the embedding dimension to be , and use dot product for the pixel level similarity metric as it can be implemented as a convolution on most of the modern deep learning packages. We use 100 labeled examples from each dataset as a standard setting for the experiments.
We use the mean IoU (Intersection over Union) as the performance analysis metric. The IoU for each class is given by
where TP , FP , FN are the true positive, false positive and false negative pixels respectively, and mIoU is the mean of IoUs over all the classes. mIoUs are calculated separately for all datasets in a universal model and the models are compared against each other using the sum of mIoUs as the metric. All the mIoU values reported are on the publicly available validation sets of the respective datasets.
We perform the following ablation studies in our experiments to provide insights into the various components of the proposed objective function. (i) supervised only: This result involves training two different segmentation models from scratch on the limited available supervised data from each dataset. (ii) finetune: We also compare against the fine tuning strategy commonly used in cases where there is limited amount of training data. In our case, the network trained on the limited labeled data from one dataset is used to fine tune on the other labeled dataset after changing the final softmax classification layer of the decoder. (iii) Univ-basic: To study the effect of the unsupervised losses, we put and perform training using only the supervised loss term from Eq (1) and no entropy module at all. (iv) Univ-cross: To study the effect of the cross dataset loss term from Eq (3), we conduct experiments with using only the cross dataset loss term. (v) Univ-full: This is the proposed model, including all the supervised and unsupervised loss terms. We use in the loss function in Eq (5). Note that (i) and (ii) need to be trained separately for each domain while our method makes it possible to universally deploy a single model across different domains. For reference, the mIoU when using the full dataset as the labeled data is on Cityscapes and on CamVid, on IDD and on SUN-RGB-D.
4.1 Driving Datasets
Cityscapes + CamVid.
The results for joint training a universal model on Cityscapes and CamVid datasets is given in Table 2. For our standard setting of N=100 which corresponds to using 100 labeled examples from each domain, the proposed method gives an mIoU value of on Cityscapes and on CamVid clearly outperforming the supervised only training method as well as the fine tuning based method, with an added advantage of deploying only a single joint model in both the domains. Moreover, the universal segmentation method using the proposed unsupervised losses also performs better than using only supervised losses, which demonstrates the advantage of having unsupervised entropy regularization in domains with few labeled data and lots of unlabeled data.
In the case of semantic segmentation datasets, very low values of N offers challenges like limited representation for many of the smaller labels, but we notice that the proposed model for N=50 still manages to perform consistently better on both the datasets, and the advantage is more pronounced in the case of Cityscapes which has many more unlabeled examples compared to CamVid.
For N=300, we observe that our full universal segmentation model improves a lot upon all the baselines on Cityscapes with an mIoU of by making use of the many available unlabeled examples. But for CamVid with a total dataset size of 367, this increase in number of labeled examples to N=300 means sufficient supervision, and the univ-basic method gave better results, asserting that the proposed universal segmentation model is particularly useful with few labeled data.
From Table 2, we also observe that using multiple centroids helps for all the experiment settings irrespective of the number of labeled examples available, as the sum of mIoUs are highest when the models are trained using 3 centroids per label (K=3). This is because many of the labels occur in multiple modalities and more than one label embedding usually captures this variance better.
Comparison of classwise mIoUs against other methods for N=100 are given in Table 1. The entropy based models on most of the labels perform better compared to the baselines, while we also observe that few categories which show a diversity in visual appearance like wall, car, train in Cityscapes and pavement, tree in CamVid benefit more by having more than one representation per label.
IDD + Cityscapes
The results for universal semantic segmentation using IDD and cityscapes are shown in Table 3. This combination is a good candidate for validating the universal segmentation approach as the images are from widely dissimilar domains in terms of geography, weather conditions as well as traffic setup, and capture the wide variety of road scenes one might encounter while training autonomous driving datasets. Using 100 training examples from each domain, the proposed univ-full model gives an mIoU of on IDD and on cityscapes, performing better than the univ-basic or univ-cross methods as well as the traditional fine tuning based methods. If we increase the number of supervised examples from 100 to 300, the univ-cross performs better than the univ-full model, which implies that the within dataset unsupervised loss has a greater effect with few labeled data in case of IDD + cityscapes datasets.
4.2 Cross Domain Experiment
A useful advantage of the universal segmentation model is its ability to perform knowledge transfer between datasets used in completely different settings, due to its ability to effectively exploit useful visual relationships. We signify this effect in the case of joint training between Cityscapes, which is a driving dataset with road scenes used for autonomous navigation and SUN RGB-D, which is an indoor segmentation dataset with household objects used for high-level scene understanding.
The 19 labels in Cityscapes and the 37 labels in SUN dataset are completely different (non overlapping), so the simple joint training techniques generally give poor results. However, from Table 4, our universal segmentation model with unsupervised entropy regularization losses shows a relative improvement of mIoU on Cityscapes dataset and mIoU on the SUN RGB-D dataset compared to the simple joint training based method, proving the effectiveness of having a label alignment module in addition to shared lower level representations for cross domain tasks.
4.3 Feature Visualization
A more intuitive understanding of the feature alignment performed by our universal model is obtained from the tSNE embeddings  of the visual features. The pixel wise output of the encoder module is used to plot the tSNE of selected labels in Figure 4. For the universal training between CS and CVD in Figures 3(a) and 3(b), we can observe that large classes like Building and Sky from both the datasets align with each other better when trained using a universal segmentation objective. For the universal training between CS and SUN from Figure 3(c) and Figure 3(d), labels with similar visual attributes such as Road and Floor align close to each other in spite of the label sets themselves being completely non overlapping.
In this work, we demonstrate a simple and effective way to perform universal semi-supervised semantic segmentation. We train a joint model using the few labeled examples and large amounts of unlabeled examples from each domain by an entropy regularization based semantic transfer objective. We show this approach to be useful in better alignment of the visual features corresponding to different domains. We demonstrate superior performance of the proposed approach when compared to supervised training or fine tuning based methods over a wide variety of segmentation datasets with varying degree of label overlap. We hope that our work would address the growing concern in the deep learning community over the difficulty involved in collection of large number of labeled examples for dense prediction tasks such as semantic segmentation. In future, we aim to extend the present method to other problems in computer vision such as object detection or instance aware segmentation.
-  Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for attribute-based classification. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 819–826, June 2013.
-  Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2927–2936, 2015.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561, 2015.
-  G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(2):88–97, 2009.
-  Z. Cao, L. Ma, M. Long, and J. Wang. Partial adversarial domain adaptation. In The European Conference on Computer Vision (ECCV), September 2018.
-  F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulo. Autodial: Automatic domain alignment layers. In International Conference on Computer Vision (ICCV), 2017.
-  R. Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
-  Y.-H. Chen, W.-Y. Chen, Y.-T. Chen, B.-C. Tsai, Y.-C. F. Wang, and M. Sun. No more discrimination: Cross city adaptation of road scene segmenters. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2011–2020. IEEE, 2017.
-  R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008.
-  M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. corr abs/1310.1531 (2013), 2013.
-  N. Dong and E. P. Xing. Few-shot semantic segmentation with prototype learning. In BMVC, 2018.
-  A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, pages 2121–2129, 2013.
-  Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning, pages 1180–1189, 2015.
-  Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
-  A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, P. Martinez-Gonzalez, and J. Garcia-Rodriguez. A survey on deep learning techniques for image and video semantic segmentation. Applied Soft Computing, 70:41 – 65, 2018.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
-  Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In Advances in neural information processing systems, pages 529–536, 2005.
-  K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649, 2016.
-  J. Hu, J. Lu, and Y.-P. Tan. Deep transfer metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 325–333, 2015.
-  C. Jawahar, A. Subramanian, A. Namboodiri, M. Chandrakar, and S. Ramalingam. AutoNUE workshop and challenge at ECCV’18. http://cvit.iiit.ac.in/scene-understanding-challenge-2018/.
-  L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit. One model to learn them all. arXiv preprint arXiv:1706.05137, 2017.
-  I. Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In CVPR, volume 2, page 8, 2017.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465, March 2014.
-  X. Liang, H. Zhou, and E. Xing. Dynamic-structured semantic propagation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 752–761, 2018.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
-  Z. Luo, Y. Zou, J. Hoffman, and L. F. Fei-Fei. Label efficient learning of transferable representations acrosss domains and tasks. In Advances in Neural Information Processing Systems, pages 165–177, 2017.
-  L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
-  B. McCann, N. S. Keskar, C. Xiong, and R. Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
-  Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim. Image to image translation for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4500–4509, 2018.
-  H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015.
-  M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. International Conference on Learning Representations (ICLR), 2014.
-  M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1717–1724, 2014.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  E. Romera, J. M. Ãlvarez, L. M. Bergasa, and R. Arroyo. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1):263–272, Jan 2018.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  S. Saha, G. Varma, and C. Jawahar. Class2Str: End to end latent hierarchy learning. In International Conference on Pattern Recognition (ICPR), 2018.
-  M. L. Seltzer and J. Droppo. Multi-task learning in deep neural networks for improved phoneme recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6965–6969. IEEE, 2013.
-  P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
-  S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 567–576. IEEE, 2015.
-  Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker. Learning to adapt structured output space for semantic segmentation. arXiv preprint arXiv:1802.10349, 2018.
-  E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In Proceedings of the IEEE International Conference on Computer Vision, pages 4068–4076, 2015.
-  E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  G. Varma, A. Subramanian, A. Namboodiri, M. Chandraker, and C. Jawahar. IDD: A dataset for exploring problems of autonomous navigation in unconstrained environments. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019.
-  K. Weiss, T. M. Khoshgoftaar, and D. Wang. A survey of transfer learning. Journal of Big Data, 3(1):9, 2016.
-  Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-shot learning - a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2018.
-  S. Xie, Z. Zheng, L. Chen, and C. Chen. Learning semantic representations for unsupervised domain adaptation. In International Conference on Machine Learning, pages 5419–5428, 2018.
-  F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
-  F. Yu, V. Koltun, and T. A. Funkhouser. Dilated residual networks. In CVPR, volume 2, page 3, 2017.
-  A. R. Zamir, A. Sax, W. Shen, L. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3712–3722, 2018.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
Appendix A Calculating the label embeddings
In this section, we describe the method used to obtain the vector representations for the labels. For each dataset separately, we train an end-to-end segmentation network from scratch using only the limited training data available in that dataset. We use this trained segmentation network to calculate the encoder outputs of the training data at each pixel. Typically, the size of the output dimension of the encoder at each pixel (512 for a ResNet-18 or ResNet-34 backbone) is not equal to the dimension of the label embeddings (=128, in our case). So we first apply a dimensionality reduction technique like PCA to reduce the dimension of the outputs to match the dimension of the label embeddings , and then up sample this output to match the resolution of the ground truth segmentation map. We then calculate the class wise centroids to obtain the label embeddings.
Appendix B Updating the label embeddings
In our original experiments, we fixed the pretrained label embeddings over the phase of training the universal model. Here, we present a method to jointly train the segmentation model as well as the label embeddings. We initialize the embeddings with the values computed from the pretrained networks, and make use of the following exponentially weighted average rule to update the centroids at the time step.
In Eq (7), denotes the centroids at the time step, is the state of the encoder module at the time step and calculates the class wise centroids.A value of implies that the centroids are not updated from their initial state, and a value of means that the centroids are calculated afresh at each update. We make an update to the centroids after every epoch of the original training data.
However, we can observe from Table 5 that there is a drop in the mIoU values for both cityscapes and CamVid datasets in cases when or compared to . The reason for this could be that the limited training data available is not sufficient to jointly train the segmentation network as well as update the centroids. This can also be observed from the fact that an increase in the number of labeled examples presented to the network results in an increase in the accuracy of jointly updating the embeddings.