Mix-and-Match Tuning for Self-Supervised Semantic Segmentation

Mix-and-Match Tuning for Self-Supervised Semantic Segmentation

Xiaohang Zhan  Ziwei Liu  Ping Luo  Xiaoou Tang  Chen Change Loy
Department of Information Engineering, The Chinese University of Hong Kong
{zx017, lz013, pluo, xtang, ccloy}@ie.cuhk.edu.hk

Deep convolutional networks for semantic image segmentation typically require large-scale labeled data, e.g., ImageNet and MS COCO, for network pre-training. To reduce annotation efforts, self-supervised semantic segmentation is recently proposed to pre-train a network without any human-provided labels. The key of this new form of learning is to design a proxy task (e.g., image colorization), from which a discriminative loss can be formulated on unlabeled data. Many proxy tasks, however, lack the critical supervision signals that could induce discriminative representation for the target image segmentation task. Thus self-supervision’s performance is still far from that of supervised pre-training. In this study, we overcome this limitation by incorporating a ‘mix-and-match’ (M&M) tuning stage in the self-supervision pipeline. The proposed approach is readily pluggable to many self-supervision methods and does not use more annotated samples than the original process. Yet, it is capable of boosting the performance of target image segmentation task to surpass fully-supervised pre-trained counterpart. The improvement is made possible by better harnessing the limited pixel-wise annotations in the target dataset. Specifically, we first introduce the ‘mix’ stage, which sparsely samples and mixes patches from the target set to reflect rich and diverse local patch statistics of target images. A ‘match’ stage then forms a class-wise connected graph, which can be used to derive a strong triplet-based discriminative loss for fine-tuning the network. Our paradigm follows the standard practice in existing self-supervised studies and no extra data or label is required. With the proposed M&M approach, for the first time, a self-supervision method can achieve comparable or even better performance compared to its ImageNet pre-trained counterpart on both PASCAL VOC2012 dataset and CityScapes dataset.

Mix-and-Match Tuning for Self-Supervised Semantic Segmentation

Xiaohang Zhan  Ziwei Liu  Ping Luo  Xiaoou Tang  Chen Change Loy Department of Information Engineering, The Chinese University of Hong Kong {zx017, lz013, pluo, xtang, ccloy}@ie.cuhk.edu.hk

Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.


Semantic image segmentation is a classic computer vision task that aims at assigning each pixel in an image with a class label such as “chair”, “person”, and “dog”. It enjoys a wide spectrum of applications, such as scene understanding (???) and autonomous driving (???). Deep convolutional neural network (CNN) is now the state-of-the-art technique for semantic image segmentation (????). The excellent performance, however, comes with a price of expensive and laborious label annotations. In most existing pipelines, a network is usually first pre-trained on millions of class-labeled images, e.g., ImageNet (?) and MS COCO (?), and subsequently fine-tuned with thousands of pixel-wise annotated images.

Self-supervised learning111Project page: http://mmlab.ie.cuhk.edu.hk/projects/M&M/ is a new paradigm proposed for learning deep representations without extensive annotations. This new technique has been applied to the task of image segmentation (???). In general, self-supervised image segmentation can be divided into two stages: the proxy stage, and the fine-tuning stage. The proxy stage does not need any labeled data but requires one to design a proxy or pretext task with self-derived supervisory signals on unlabeled data. For instance, learning by colorization (?) utilizes the fact that a natural image is composed of luminance channel and chrominance channels. The proxy task is formulated with cross-entropy loss to predict an image chrominance from the luminance of the same image. In the fine-tuning stage, the learned representations are utilized to initialize the target semantic segmentation network. The network is then fine-tuned with pixel-wise annotations. It has been shown that without large-scale class-labeled pre-training, semantic image segmentation could still gain encouraging performance over random initialization or from-scratch training.

Though promising, the performance of self-supervised learning is still far from that achieved by supervised pre-training. For instance, a VGG-16 network trained with the self-supervised method of (?) achieves a 56.0% mean Intersection over Union (mIoU) on PASCAL VOC 2012 segmentation benchmark (?), higher than a random initialized network that only yields 35.0% mIoU. However, an identical network trained on ImageNet achieves 64.2% mIoU. There exists a considerable gap between self-supervised and pure supervised pre-training.

Figure 1: (a) shows samples of patches from categories ‘bus’ and ‘car’, and these two categories have similar color distributions but different patch statistics. (b) depicts deep feature distributions of ‘bus’ and ‘car’, before and after mix-and-match, visualized with t-SNE (?). Best viewed in color.

We believe that the performance discrepancy is mainly caused by the semantic gap between the proxy task and the target task. Take learning by colorization as an example, the goal of the proxy task is to colorize gray-scale images. The representations learned from colorization may be well-suited for modeling color distributions, but are likely amateur in discriminating high-level semantics. For instance, as shown in Fig. 1(a), a red car can be arbitrarily more similar to a red bus than to a blue car. The features of both car and bus classes are highly overlapped, as depicted by the feature embedding in the left plot of Fig. 1(b).

Improving the performance of self-supervised image segmentation requires one to improve the discriminative power of representation tailored to the target task. This goal is non-trivial – target’s pixel-wise annotations are discriminative for the goal but they often available with just a handful amount, typically in thousands of labeled images. Existing approaches typically use a pixel-wise softmax loss to exploit pixel-wise annotations for fine-tuning a network. This strategy may be sufficient for a network that is well-initialized by supervised pre-training but could fall inadequate for a self-supervised network of which the features are weak. We argue that pixel-wise softmax loss is not the sole way of harnessing the information provided by pixel-wise annotations.

In this study, we present a new learning strategy called ‘mix-and-match’ (M&M), which can help harness the scarce labeled information of a target set for improving the performance of networks pre-trained by self-supervised learning. The M&M learning is conducted after the proxy stage and before the usual target fine-tuning stage, serving as an intermediate step to bridge the gap between the proxy and target tasks. It is noteworthy that M&M only uses the target images and its labels thus no additional annotation is required.

The essence of M&M is inspired by metric learning. In the ‘mix’ step, we randomly sample a large number of local patches from the target set and mix them together. The patch set is formed across images thus decouple any intra-image dependency to faithfully reflect the diverse and rich target distribution. Extracting patches also allows us to generate a massive number of triplets from the small target image set to produce stable gradients for training our network. In the ‘match’ step, we form a graph with nodes defined by patches represented by their deep features. An edge between nodes is defined as attractive if the nodes share the same class label; otherwise, it is a rejective edge. We enforce a class-wise connected graph, that is, all nodes from the same class in the graph compose a connected subgraph, as shown in Fig. 3(c). This ensures global consistency in triplet selection coherent to the class labels. With the graph, we can derive a robust triplet loss that encourages the network to map each patch to a point in feature space so that patches belonging to the same class lie close together while patches of different classes are separated by a wide margin. The way we sample triplets from a class-wise connected graph differs significantly from existing approach (?) that forms multiple disconnected subgraphs for each class.

We summarize our contributions as follows. 1) We formulate a novel ‘mix-and-match’ tuning method, which for the first time, allows networks pre-trained with self-supervised learning to outperform the supervised learning counterpart. Specifically, with VGG-16 as the backbone network, by using image colorization as the proxy task, our M&M method achieves 64.5%, outperforming the ImageNet pre-trained network that achieves 64.2% mIoU on PASCAL VOC2012 dataset. Our method also obtains 66.4% mIoU on CityScapes dataset, comparable to 67.9% mIoU achieved by using a ImageNet pre-trained network. This improvement is significant considering that our approach is based on unsupervised pre-training. 2) Apart from the learning by colorization method, M&M also improves learning by context method (?) by a large margin. 3) In the setting of random initialization, our method achieves significant improvements with both AlexNet and VGG-16, on both PASCAL VOC2012 and CityScapes. It makes training semantic segmentation from scratch possible. 4) In addition to the new notion of mix-and-match, we also present a triplet selection mechanism based on class-wise connected graph, which is more robust than conventional selection scheme for our task.

Related Work

Self-supervision. It is a standard and established practice to pre-train a deep network with large-scale class-labeled images (e.g., ImageNet) before fine-tuning the model for other visual tasks. Recent research efforts are gearing towards reducing the degree of or eliminating supervised pre-training altogether. Among various alternatives, self-supervised learning is gaining substantial interest. To enable self-supervised learning, proxy tasks are designed so that meaningful representations can be induced from the problem-solving process. Popular proxy tasks include sample reconstruction (?), temporal correlation (??), learning by context (??), cross-transform correlation (?) and learning by colorization (????). In this study, we do not design a new proxy task, but present an approach that could uplift the discriminative power of a self-supervised network tailored to the image segmentation task. We demonstrate the effectiveness of M&M on learning by colorization and learning-by-context.

Weakly-supervised segmentation. There exists a rich body of literature that investigates approaches for reducing annotations in learning deep models for the task of image segmentation. Alternative annotations such as point (?), bounding box (?) (?), scribble (?) and video (?) have been explored as “cheap” supervisions to replace the pixel-wise counterpart. Note that these methods still require ImageNet classification as a pre-training task. Self-supervised learning is more challenging in that no image-level supervision is provided in the pre-training stage. The proposed M&M approach is dedicated to improve the weak representation learned by self-supervised pre-training.

Graph-based segmentation. Graph-based image segmentation (?) has been explored from early years. The main idea is to explore dependency between pixels. Different from the conventional graph on pixels or superpixels in a single image, the proposed method defines the graph on image patches sampled from multiple images. We do not partition image by performing cuts on a graph, but use the graph to select triplets for the proposed discriminative loss.

Mix-and-Match Tuning

Figure 2 illustrates the proposed approach, where (a) and (c) depict the conventional stages for self-supervised semantic image segmentation, while (b) shows the proposed ‘mix-and-match’ (M&M) tuning. Specifically, in (a), a proxy task, e.g., learning by colorization, is designed to pre-train the CNN using unlabeled images. In (c), the pre-trained CNN is fine-tuned on images and the associated per-pixel labeled maps of a target task. This work inserts M&M tuning between the proxy task and the target task as shown in (b). It is noteworthy that M&M uses the same target images and label maps in (c), hence no additional data is required. As the name implies, M&M tuning consists of two steps, namely ‘mix’ and ‘match’. We explain these steps as follows.

The Mix Step – Patch Sampling

Recall that our goal is to better harness the information in pixel-wise annotations of the target set. Image patches have long been considered as strong visual primitive (?) that incorporates both appearance and structure information. Visual patches have been successfully applied to various tasks in visual understanding (?). Inspired by these pioneering works, the first step of M&M tuning is designed to be a ‘mix’ step that aims at sampling patches across images. The relation between these patches can be exploited for optimization in the subsequent ‘match’ operation.

More precisely, a large number of image patches with various spatial sizes are randomly sampled from a batch of images. Heavily overlapped patches are discarded. These patches are represented by using the features extracted from the CNN pre-trained in the stage of Fig. 2(a), and assigned with unique class labels based on the corresponding label map. The patches across all images are mixed to decouple any intra-image dependency so as to reflect the diverse and rich target distribution. The mixed patches are subsequently utilized as the input for the ‘match’ operation.

Figure 2: An overview of the mix-and-match approach. Our approach starts with a self-supervised proxy task (a), and uses the learned CNN parameters to initialize the CNN in mix-and-match tuning (b). Given an image batch with label maps of the target task, we select and mix image patches and then match them according to their classes via a class-wise connected graph. The matching gives rise to a triplet loss, which can be optimized to tune the parameters of the network via back propagation. Finally, the modified CNN parameters are further fine-tuned to the target task (c).

The Match Step – Perceptual Patch Graph

Our next goal is to exploit the patches to generate stable gradients for tuning the network. This is possible since patches are of different classes, and such relation can be employed to form a massive number of triplets. A triplet is denoted as , where is an anchor patch, is a positive patch that shares the same label as , and is a negative patch with a different class label. With the triplets, one can formulate a discriminative triplet loss for fine-tuning the network.

A conventional way of sampling triplets is to follow the notion of ? (?). For convenience, we call this strategy as ‘random triplets’. In this strategy, triplets are randomly picked from the input batch. For instance, as shown in Fig. 3(a), nodes and an arbitrary negative patch forms a triplet, and nodes and another negative patch forms another triplet. As can be seen, there is no positive connection between nodes and despite they share a common class label. While locally the distance between each triplet is optimized, the boundary of the positive class can be loose since the global constraint (i.e. all nodes must lie closer) is not enforced. We term this phenomenon as global inconsistency. Empirically, we found that this approach tends to perform poorer than the proposed method, which will be introduced next.

The proposed ‘match’ step draws triplets in a different way from the conventional approach (?). In particular, the ‘match’ step begins with graph construction based on the mixed patches. For each CNN learning iteration, we construct a graph on-the-fly given a batch of input images. The nodes of the graph are patches. Two types of edges are defined between nodes – a) “attractive” if two nodes have an identical class label and b) “rejective” if two nodes have different class labels. Different from (?), we enforce the graph to be connected, and importantly, the graph should be class-wise connected. That is, all nodes from the same class in the graph compose a connected subgraph via “attractive” edges. We adopt an iterative strategy to create such a graph. At first, the graph is initialized to be empty. Then, as shown in Fig. 3(b), patches are absorbed individually into the graph as a node and it creates respectively one “attractive” and “rejective” edge with existing nodes in the graph.

An example of an established graph is shown in Fig. 3(c). Considering nodes again, unlike ‘random triplets’, the nodes form a connected subgraph. Different classes represented in green nodes and pink nodes also form coherent clusters based on their respective classes, imposing tighter constraints than random triplets. To fully realize such class-wise constraints, each node in the graph will take turn to serve as an anchor for loss optimization. An added benefit of permitting all nodes as possible anchor candidate is the improved utilization efficiency of patch relation over random triplets.

Figure 3: This figure shows different strategies of drawing triplets. The color of nodes represent their labels. Blue and red edges denote attractive and rejective edges, respectively. (a) depicts the random triplet strategy (?), where nodes from the same class do not necessarily form a connected subgraph. (b-i) and (b-ii) shows the proposed triplet selection strategy. A class-wised connected graph is constructed to sample triplets, which enforces tighter constraints on positive class boundary. Details are explained in the main text of methodology section. Best viewed in color.

The Tuning Loss

Loss function. To optimize the semantic consistency within the graph, for any two nodes in the graph, if they are connected by attractive edges, we seek to minimize their distance in the feature space; and if they are connected by rejective edges, the distance should be maximized. Consider a node that connects two other nodes via attractive and rejective edges, we denote it as an “anchor” while the two connected nodes are denoted as “positive” and “negative” respectively. These three nodes are grouped to be a “triplet”.

When constructing the graph, we ensure that each node can serve as “anchor”, except for those nodes whose labels are unique among all the nodes. Thus, the number of nodes equals the number of triplets. Assume that in each iteration we discover triplets in the graph. By converting the graph optimization problem into “triplet ranking” (?), we formulate our loss function as follows:


where , , , denote “anchor”, “positive”, “negative” nodes in a triplet, is a regularization factor controlling the distance margin and is a distance metric measuring patch relationship.

In this work, we leverage perceptual distance (?) to characterize the relationship between patches. This is different from previous works (?) (?) that define patch distance using low-level cues (e.g., colors and edges). Specifically, the perceptual representation can be formulated as , where denotes a convolutional neural network (CNN) and denotes the extracted representation. is the perceptual distance between two patches, which can be formulated as:


where and is the CNN representation extracted from patch and . normalization is used here for calculating Euclidean distances.

By optimizing the “triplet ranking” loss, our perceptual patch graph converges to both intra-class and inter-class semantic consistency.

M&M implementation details. We use both AlexNet (?) and VGG-16 (?) as our backbone CNN architectures, as illustrated in Fig. 2. For initialization, we try random initialization and two proxy tasks including Jigsaw Puzzles (?) and Colorization (?). From a batch of images in each CNN iteration, we sample patches per image with various sizes and resize them to a fixed size of . Then we extract “pool5” feature of these patches from the CNN for later usage. We assign the patches’ unique labels as the central pixel labels using the corresponding label maps. Then we perform the iterative strategy to construct the graph as discussed in the methodology section. We make use of each node in the graph as an “anchor”, which is made possible by our graph construction strategy. If any node whose label is unique among all the nodes, we duplicate it as its “positive” counterpart. In this way, we obtain a batch of meaningful triplets whose number is equal to the number of nodes, and feed them into a triplet loss layer, whose margin is set as . Such a M&M tuning is conducted for 8000 iterations on PASCAL VOC2012 or CityScapes training dataset. The learning rate is fixed at before iteration 6000, and then dropped to . We apply batch normalization to speed up convergence.

Segmentation fine-tuning details. Finally, we fine-tune the CNN to the semantic segmentation task. For AlexNet, we follow the same setting as presented in (?), and for VGG-16, we follow (?) whose architecture is equipped with hyper-columns (?). The fine-tuning process undergoes 40k iterations, with an initial learning rate as 0.01 and dropped with a factor of 10 at iteration 24k, 36k. We keep tuning batch normalization layers before “pool5”. All experiments follow the same setting.


Settings. Different proxy tasks are combined with our M&M tuning to demonstrate its merits. In our experiments, as initialization, we use released models of different proxy tasks from learning by context (or Jigsaw Puzzles) (?) and learning by colorization (?). Both methods adopt 1.3 million unlabeled images in ImageNet dataset (?) for training. Besides that, we also perform experiments on randomly initialized networks. In M&M tuning, we make use of PASCAL VOC2012 dataset (?), which consists of 10,582 training samples with pixel-wise annotations. The same dataset is used in  (??) for fine-tuning so no additional data is used in M&M. For fair comparisons, all self-supervision methods are benchmarked on PASCAL VOC2012 validation set that comes with 1,449 images. We show the benefits of M&M tuning on different backbone networks, including AlexNet and VGG-16. To demonstrate the generalization ability of our learned model, we also report the performance of our VGG-16 full model on PASCAL VOC2012 test set. We further apply our method on the CityScapes dataset (?), with 2,974 training samples and report results on the 500 validation samples. All results are reported in mean Intersection over Union (mIoU), which is the standard evaluation criterion of semantic segmentation.


Overall. Existing self-supervision works report segmentation results on PASCAL VOC2012 dataset. The highest performance attained by existing self-supervision methods is learning by colorization (?), which achieves 38.4% mIoU and 56.0% mIoU with AlexNet and VGG-16 as the backbone network, respectively. Therefore, we adopt learning by colorization as our proxy task here. With our M&M tuning, we boost the performance to 42.8% mIoU and 64.5% mIoU with AlexNet and VGG-16 as the backbone network. As shown in Table 1, our method achieves state-of-the-art performance on semantic segmentation, outperforming (?) by 14.3% and (?) by 8.5% when using VGG-16 as backbone network. Notably, our M&M self-supervision paradigm shows comparable results (0.3% point of advantage) to its ImageNet pre-trained counterpart. Furthermore, on PASCAL VOC2012 test set, our approach achieves 64.3% mIoU, which is a record-breaking performance for self-supervision methods. Qualitative results of this model are shown in Fig. 6.

We additionally perform an ablation study on the AlexNet setting. As shown in Table 1, with colorization task as pre-training, our class-wise connected graph outperforms ‘random triplets’ by 2.5%, suggesting the importance of class-wise connected graph. With random initialization, our model surprisingly performs even better than colorization pre-training.

Method Arch.
ImageNet VGG-16 64.2
Random VGG-16 35.0
Larsson et al. (?) VGG-16 50.2
Larsson et al. (?) VGG-16 56.0
Ours (M&M + Graph, colorization pre-trained) VGG-16 64.5
ImageNet AlexNet 48.0
Random AlexNet 23.5
k-means (?) AlexNet 32.6
Pathak et al. (?) AlexNet 29.7
Donahue et al. (?) AlexNet 35.2
Zhang et al. (?) AlexNet 35.6
Zhang et al. (?) AlexNet 36.0
Noroozi et al. (?) AlexNet 37.6
Larsson et al. (?) AlexNet 38.4
Ours (M&M + Random Triplets, colorization pre-trained) AlexNet 40.9
Ours (M&M + Graph, colorization pre-trained) AlexNet 42.8
Ours (M&M + Graph, randomly initialized) AlexNet 43.6
Table 1: We test our model on PASCAL VOC2012 validation set, which is the generally accepted benchmark for semantic segmentation with self-supervised pre-training. Our method achieves the state-of-the-art with both VGG-16 and AlexNet architectures.
aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mIoU.
ImageNet 81.7 37.4 73.3 55.8 59.6 82.4 74.7 82.4 30.8 60.3 46.1 71.4 65.3 72.6 76.7 49.7 70.6 34.2 72.7 60.2 64.2
Colorization 73.6 28.5 67.5 55.5 50.2 78.3 66.1 78.3 26.8 60.8 50.6 70.6 64.9 62.2 73.5 38.2 66.8 38.8 68.1 55.1 60.2
M&M 83.1 37.0 69.6 56.1 62.9 84.4 76.4 82.8 33.4 61.5 44.7 67.3 68.5 68.0 78.5 42.2 72.7 37.2 75.7 58.6 64.5
84.5 39.4 76.3 60.3 64.6 85.4 77.7 84.1 35.6 63.6 50.4 70.6 72.0 73.6 80.1 50.2 73.7 37.6 77.8 66.6 67.4
Table 2: Per-class segmentation results on PASCAL VOC2012 val. The last row shows the additional results of our model combined with ImageNet pre-trained model by averaging their prediction probabilities. The results suggest the complementary nature of our self-supervised method with ImageNet pre-trained model.
Figure 4: Feature distribution with and without the proposed mix-and-match (M&M) tuning. We use 17,684 patches obtained from PASCAL VOC2012 validation set to extract features, and map the high-dimensional features to a 2-D space with t-SNE, along with their categories. For clarity, we split 20 classes into four parts in order. The first row shows the feature distribution of a naively fine-tuned model without M&M, and the second row depicts the feature distribution of a model additionally tuned with M&M. Note that the features are respectively extracted from the CNNs which have been fine-tuned to segmentation task, in this case, two CNNs have undergone the identical amount of data and labels. Best viewed in color.

Per-class results. We analyze per-class results of M&M tuning on PASCAL VOC2012 validation set. The results are summarized in Table 2. When compared our method with the baseline model that uses colorization222We obtain higher performance than reported with the released pre-training model of  (?). as pre-training, our approach demonstrates significant improvements in classes including aeroplane, bike, bottle, bus, car, chair, motorbike, sheep, train. A further attempt at combining our self-supervised model and the fully-supervised model (through averaging their predictions) leads to an even higher mIoU of 67.4%. The results suggest that self-supervision serves as a strong candidate complementary to the current fully-supervised paradigm.

Applicability to different proxy tasks. Besides colorization (?), we also explore the possibility of using Jigsaw Puzzles (?) as our proxy task. Similarly, our M&M tuning boosts the segmentation performance from 36.5%333We use the released pre-training model of Jigsaw Puzzles (?) for fine-tuning and obtain a slightly lower baseline than the reported 37.6% mIoU in the paper. mIoU to 44.5% mIoU. The result suggests that the proposed approach is widely applicable to other self-supervision methods. Our method can also be applied to randomly initialized cases. In PASCAL VOC 2012, M&M tuning boosts the performance from 19.8% mIoU to 43.6% mIoU with AlexNet and from 35.0% mIoU to 56.7% mIoU with VGG-16. The improvements of our method over different baselines are shown in Table 3 for PASCAL VOC 2012.

benchmark PASCAL VOC2012 CityScapes
pre-train Random Jigsaw Colorize Random Colorize Random Colorize
backbone AlexNet VGG-16 VGG-16
baseline 19.8 36.5 38.4 35.0 60.2 42.5 57.5
M&M 43.6 44.5 42.8 56.7 64.5 (64.3) 49.1 66.4 (65.6)
ImageNet 48.0 64.2 67.9
Table 3: The table shows the improvements of our method with different pre-training tasks. They respectively are Random (Xavier initialization) with AlexNet and VGG-16, Jigsaw Puzzles (?) with AlexNet and Colorization (?) with AlexNet and VGG-16. Baselines are produced with naive fine-tuning. ImageNet pre-trained results are regarded as upper bound. Evaluations are conducted on PASCAL VOC2012 validation set and CityScapes validation set. Results on testing sets are shown in brackets.

Generalizability to CityScapes. We apply our method on CityScapes dataset. With colorization as pre-training, naive fine-tuning yields 57.5% mIoU and M&M tuning improves it to 66.4% mIoU. The result is comparable with ImageNet pre-trained counterpart that yields 67.9% mIoU. With a random initialized network, M&M could bring a large improvement from 42.5% mIoU to 49.1% mIoU. The comparison can be found in Table 3.

Further Analysis

Learned representations. To illustrate the learned representations enabled by M&M tuning, we visualize the sample distribution changes in the t-SNE embedding space. As shown in Fig. 4, after M&M tuning, samples from the same category tend to stay close while those from different categories are torn apart. Notably, this effect is more pronounced on categories of aeroplane, bike, bottle, bus, car, chair, motorbike, sheep, train and tv, which aligns with the per-class performance improvements listed in Table 2.

The Effect of graph size. Here we investigate how the self-supervision performance is influenced by the graph size (the number of nodes in the graph), which defines the number of triplets that can be discovered. Specifically, we set the image batch size to be , so that the number of nodes is , as shown in Fig. 5. The comparative study is performed on AlexNet with learning by colorization (?) as initialization. We have the following observations. On the one hand, a larger graph leads to a higher performance, since it brings more diverse samples for more accurate metric learning. On the other hand, a larger graph takes longer time for processing, since a larger batch size of images is fed in each iteration.

Efficiency. The previous study suggests that performance and speed trade-off can be enabled through graph size adjustment. Nevertheless, our graph training process is very efficient. It costs respectively hours and hours on a single TITAN-X for AlexNet and VGG-16, which are much faster than conventional ImageNet pre-training or any other self-supervised pre-training task.

Failure cases. We also include some failure cases of our method, as shown in Fig. 7. The failed examples can be explained as follows. Firstly, patches sampled from thin objects may fail to reflect the key characteristics of the object due to the clutter, so the boat in the figure ends as a false negative. Secondly, our M&M tuning method inherits its base model (i.e., colorization model) to some extent, which accounts for the case in the figure that the dog is falsely classified as a cat.

Figure 5: The figure shows that a larger graph brings better performance, but costs a longer time in each iteration. We train the model with the same hyper-parameters for different settings and test on PASCAL VOC2012 validation set.
Figure 6: Visual comparison on PASCAL VOC2012 validation set (top 4 rows) and CityScapes validation set (bottom 3 rows). (a) Image. (b) Ground Truth. (c) Results with ImageNet supervised pre-training. (d) Results with colorization pre-training. (e) Our results.
Figure 7: Our failure cases. (a) Image. (b) Ground Truth. (c) Results with ImageNet supervised pre-training. (d) Results with colorization pre-training. (e) Our results.


We have presented a novel ‘mix-and-match’ (M&M) tuning method for improving the performance of self-supervised learning on semantic image segmentation task. Our approach effectively exploits mixed image patches to form a class-wise connected graph, from which triplets can be sampled to compute a discriminative loss for M&M tuning. Our approach not only improves the performance of self-supervised semantic segmentation with different proxy tasks and different backbone CNNs on different benchmarks, achieving state-of-the-art results, but also outperforms its ImageNet pre-trained counterpart for the first time in the literature, shedding light on the enormous potential of self-supervised learning. M&M tuning is potentially to be applied to various tasks and worth further exploration. Future work will focus on the essence and advantages of multi-step optimization like M&M tuning.

Acknowledgement. This work is supported by SenseTime Group Limited and the General Research Fund sponsored by the Research Grants Council of the Hong Kong SAR (CUHK 14241716, 14224316. 14209217).


  • [Bearman et al. 2016] Bearman, A.; Russakovsky, O.; Ferrari, V.; and Fei-Fei, L. 2016. What’s the point: Semantic segmentation with point supervision. In ECCV.
  • [Cordts et al. 2016] Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; and Schiele, B. 2016. The Cityscapes dataset for semantic urban scene understanding. In CVPR.
  • [Dai, He, and Sun 2015] Dai, J.; He, K.; and Sun, J. 2015. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In ICCV.
  • [Deng et al. 2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.
  • [Doersch et al. 2012] Doersch, C.; Singh, S.; Gupta, A.; Sivic, J.; and Efros, A. 2012. What makes Paris look like Paris? ACM Transactions on Graphics 31(4).
  • [Doersch, Gupta, and Efros 2015] Doersch, C.; Gupta, A.; and Efros, A. A. 2015. Unsupervised visual representation learning by context prediction. In ICCV.
  • [Donahue, Krähenbühl, and Darrell 2016] Donahue, J.; Krähenbühl, P.; and Darrell, T. 2016. Adversarial feature learning. arXiv:1605.09782.
  • [Dosovitskiy et al. 2015] Dosovitskiy, A.; Fischer, P.; Springenberg, J.; Riedmiller, M.; and Brox, T. 2015. Discriminative unsupervised feature learning with exemplar convolutional neural networks. arXiv:1506.02753.
  • [Everingham et al. 2010] Everingham, M.; Van Gool, L.; Williams, C. K.; Winn, J.; and Zisserman, A. 2010. The PASCAL visual object classes (VOC) challenge. IJCV 88(2):303–338.
  • [Felzenszwalb and Huttenlocher 2004] Felzenszwalb, P. F., and Huttenlocher, D. P. 2004. Efficient graph-based image segmentation. IJCV 59(2):167–181.
  • [Gatys, Ecker, and Bethge 2015] Gatys, L. A.; Ecker, A. S.; and Bethge, M. 2015. A neural algorithm of artistic style. arXiv:1508.06576.
  • [Geiger et al. 2013] Geiger, A.; Lenz, P.; Stiller, C.; and Urtasun, R. 2013. Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research 32(11):1231–1237.
  • [Hariharan et al. 2015] Hariharan, B.; Arbeláez, P.; Girshick, R.; and Malik, J. 2015. Hypercolumns for object segmentation and fine-grained localization. In CVPR.
  • [Hong et al. 2017] Hong, S.; Yeo, D.; Kwak, S.; Lee, H.; and Han, B. 2017. Weakly supervised semantic segmentation using web-crawled videos. arXiv:1701.00352.
  • [Krähenbühl et al. 2015] Krähenbühl, P.; Doersch, C.; Donahue, J.; and Darrell, T. 2015. Data-dependent initializations of convolutional neural networks. arXiv:1511.06856.
  • [Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS.
  • [Larsson, Maire, and Shakhnarovich 2016] Larsson, G.; Maire, M.; and Shakhnarovich, G. 2016. Learning representations for automatic colorization. In ECCV.
  • [Larsson, Maire, and Shakhnarovich 2017] Larsson, G.; Maire, M.; and Shakhnarovich, G. 2017. Colorization as a proxy task for visual understanding. arXiv:1703.04044.
  • [Li et al. 2017a] Li, X.; Liu, Z.; Luo, P.; Loy, C. C.; and Tang, X. 2017a. Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer cascade. In CVPR.
  • [Li et al. 2017b] Li, X.; Qi, Y.; Wang, Z.; Chen, K.; Liu, Z.; Shi, J.; Luo, P.; Tang, X.; and Loy, C. C. 2017b. Video object segmentation with re-identification. In CVPRW.
  • [Li, Socher, and Fei-Fei 2009] Li, L.-J.; Socher, R.; and Fei-Fei, L. 2009. Towards total scene understanding: Classification, annotation and segmentation in an automatic framework. In CVPR.
  • [Li, Wu, and Tu 2013] Li, Q.; Wu, J.; and Tu, Z. 2013. Harvesting mid-level visual concepts from large-scale internet images. In CVPR.
  • [Lin et al. 2014] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft COCO: Common objects in context. In ECCV.
  • [Lin et al. 2016] Lin, D.; Dai, J.; Jia, J.; He, K.; and Sun, J. 2016. Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In CVPR.
  • [Liu et al. 2015] Liu, Z.; Li, X.; Luo, P.; Loy, C.-C.; and Tang, X. 2015. Semantic image segmentation via deep parsing network. In ICCV.
  • [Liu et al. 2017] Liu, Z.; Li, X.; Luo, P.; Loy, C. C.; and Tang, X. 2017. Deep learning markov random field for semantic segmentation. TPAMI.
  • [Long, Shelhamer, and Darrell 2015] Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In CVPR.
  • [Maaten and Hinton 2008] Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using t-sne. JMLR 9:2579–2605.
  • [Noroozi and Favaro 2016] Noroozi, M., and Favaro, P. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV.
  • [Papandreou et al. 2015] Papandreou, G.; Chen, L.-C.; Murphy, K. P.; and Yuille, A. L. 2015. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In ICCV.
  • [Pathak et al. 2016a] Pathak, D.; Girshick, R.; Dollár, P.; Darrell, T.; and Hariharan, B. 2016a. Learning features by watching objects move. arXiv:1612.06370.
  • [Pathak et al. 2016b] Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; and Efros, A. A. 2016b. Context encoders: Feature learning by inpainting. In CVPR.
  • [Russakovsky et al. 2015] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. IJCV 115(3):211–252.
  • [Schroff, Kalenichenko, and Philbin 2015] Schroff, F.; Kalenichenko, D.; and Philbin, J. 2015. Facenet: A unified embedding for face recognition and clustering. In CVPR.
  • [Simonyan and Zisserman 2015] Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.
  • [Singh, Gupta, and Efros 2012] Singh, S.; Gupta, A.; and Efros, A. 2012. Unsupervised discovery of mid-level discriminative patches. In ECCV.
  • [Wang and Gupta 2015] Wang, X., and Gupta, A. 2015. Unsupervised learning of visual representations using videos. In ICCV.
  • [Zhang, Isola, and Efros 2016a] Zhang, R.; Isola, P.; and Efros, A. A. 2016a. Colorful image colorization. In ECCV.
  • [Zhang, Isola, and Efros 2016b] Zhang, R.; Isola, P.; and Efros, A. A. 2016b. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR.
  • [Zhao et al. 2017] Zhao, H.; Shi, J.; Qi, X.; Wang, X.; and Jia, J. 2017. Pyramid scene parsing network. In CVPR.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description