Distilling Localization for Self-Supervised Representation Learning

Distilling Localization for Self-Supervised Representation Learning


For high-level visual recognition, self-supervised learning defines and makes use of proxy tasks such as colorization and visual tracking to learn a semantic representation useful for distinguishing objects. In this paper, through visualizing and diagnosing classification errors, we observe that current self-supervised models are ineffective at localizing the foreground object, limiting their ability to extract discriminative high-level features. To address this problem, we propose a data-driven approach for learning invariance to backgrounds. It first estimates foreground saliency in images and then creates augmentations by copy-and-pasting the foreground onto a variety of backgrounds. The learning follows an instance discrimination approach which encourages the features of augmentations from the same image to be similar. In this way, the representation is trained to disregard background content and focus on the foreground. We study a variety of saliency estimation methods, and find that most methods lead to improvements for self-supervised learning. With this approach, strong performance is achieved for self-supervised learning on ImageNet classification, and also for transfer learning to object detection on PASCAL VOC 2007.

Self-supervised Learning, Transfer Learning, Saliency Estimation, Object Localization, Data Augmentation

1 Introduction


Visual recognition has been revolutionized by deep learning in the fashion of assembling considerable amounts of labeled data [8] and training very deep neural networks [26]. However, collection of supervisory signals, especially at a very large scale, is constrained by budget and time. Due to this, there has been a growing interest in self-supervised and unsupervised learning which do not face this practical limitation.

For high-level visual recognition, previous approaches in self-supervised learning define proxy tasks which do not require human labeling but encode useful priors for object recognition. For example, objects in the same category tend to have similar colors [50] and similar spatial configurations [10]. A recent study shows that combining multiple self-supervised tasks by framing them within a multi-task learning problem improves the representation [11].

Figure 1: Motivation: for natural images with objects, backgrounds are usually shared across categories, while the distinctive region for determining the object is usually localized.

In this paper, by visualizing and diagnosing errors made by recent self-supervised models, we identify a strong pattern which is overlooked by prior works. Specifically, we find that current self-supervised models lack the ability to localize foreground objects, and the learned representation can be predominantly determined by background pixels. This is actually unsurprising, as self-supervised learning generally treats each spatial location as equally important, and it is well known that neural networks are prone to “cheat” [50] by taking advantage of unintended information. As a result, a network cannot be expected to discover objects unless it is driven to do so [1].

In supervised visual recognition, localization has been demonstrated to be a strong by-product of training on image-level labels. Strong object localization performance has been shown using the gradient of the class score in the pixel space [36]. It has also been found that adding precise localization information does not bring significant gains for PASCAL object classification when transferred from ImageNet [30]. Moreover, object segments have been estimated using only image-level labels via a class activation mapping method [52]. As suggested in Figure 1, we hypothesize that the learning signal that drives localization comes from the category-wise supervisory labels, because background contents (e.g., grass, sky, water) are usually shared among different categories while foreground objects are only salient within the same category.

The gap in the localization ability between self-supervised and supervised models motivates us to explore approaches for distilling localization of self-supervised representations. We study this problem by first estimating a foreground saliency mask for each training image. The training image and its corresponding saliency map are then used to create augmentations by pasting the foreground object onto various backgrounds. During representation learning, we follow the simple instance discrimination approach [44] using the augmentations for the same object on different backgrounds. This encourages the representation to become invariant to backgrounds, enabling localization of the foreground object.

For generating our augmentations, several saliency estimation methods are examined, including unsupervised non-learning based techniques [54, 45, 43], and a saliency network [33]. Our results show consistent improvements of over the baseline. This clearly demonstrates that object recognition benefits from better localization representations, and that our approach is effective for solving the localization problem. Due to its better localization ability, we also achieve strong transfer learning results for object detection on PASCAL VOC 2007.

In summary, this paper makes the following contributions:

1) A visualization-based study of recent self-supervised models that shows a limited capacity to localize objects.

2) A data-driven method that improves the localization modeling of self-supervised representations by learning invariance to backgrounds.

3) An investigation of different kinds of saliency estimation methods for improving localization, including traditional saliency and network-predicted saliency.

2 Related Work

Unsupervised and Self-Supervised Learning. Unsupervised learning aims to extract semantically meaningful representations without human labels [7]. Self-supervised learning is a sub-branch of unsupervised learning which automatically generates learning signals from the data itself. These learning signals have been derived from proxy tasks that involve semantic image understanding but do not require semantic labels for training. These tasks have been based on prediction of color [50], context [10, 32], rotation [17], and motion [31]. Auto-encoders [40] and GANs [18, 12] have also shown promising results for representation learning through reconstructing images.

Another direction of work for self-supervised learning is to achieve invariances in a data-driven fashion by image augmentation. Exemplar CNN [14] and instance discrimination [44] create augmentations of an image through changes in color, spatial location and scale. Transitive invariance [42] augments data by tracking the same object in a video sequence, and may possibly learn deformation invariance. Our paper is in line with these works, and we propose a non-trivial augmentation for distilling localization information.

Saliency Estimation. Saliency estimation refers to the task of estimating the locations of interesting objects consistent with human perception. For learning saliency, datasets [3] have been collected by tracking eye fixations over an image. Later works usually consider saliency as the full foreground object.

Previous non-learning based approaches [54, 46] rely on handcrafted features and use priors to find salient object regions. Useful priors include background priors [21], color contrast priors [5], and objectness [25]. Deep supervised methods [33] train a segmentation network to regress the foreground mask, outperforming all non-learning based methods. Recent research on saliency estimation also explores unsupervised learning methods. It integrates multiple non-learning based methods into a noise optimization framework [48], showing encouraging results that are on par with supervised methods.

In a network, the salient region corresponds to pixels that fire for the classification decision. Previous works study this in both the input space via gradient visualization [36] and the output space via activation mapping [52]. A prior work [51] also finds the salient region by optimizing a minimal region that determines the classification response.

Figure 2: Visualizing and analyzing the error patterns of self-supervised models. Given an input for each model, we visualize its top-3 nearest neighbors in the embedding space, as well as the gradient on the pixel space with respect to the classification signal. Compared with the supervised model, which is able to localize the salient objects, self-supervised models (Colorization, RotationNet, InstDisc) look holistically over the image and are prone to distraction by backgrounds. More visualizations are available in the appended supplement.

Copy-and-paste for Visual Recognition. Several works create data in a copy-and-paste fashion for visual recognition. A key insight of such an approach is that data being generated may not look realistic, but the trained model generalizes surprisingly well to real data. For example, Flying Chairs [13] renders chairs onto various backgrounds to generate data for optical flow estimation. Cut-paste-learn [15] randomly puts household object instances in an indoor environment for instance detection and segmentation. Instaboost [16] spatially shifts the foreground objects as a means of data augmentation for instance segmentation. Copy-pasting GAN [1] uses the copy-and-paste idea to discover objects in an unsupervised manner. However, their experiments are performed on toy examples, such as discovering artificial boxes. Moreover, they do not show how discovering objects may help recognition. Our work follows this path, but in contrast to these previous works our method is targeted to self-supervised representation learning. We note that our augmented images are extremely unrealistic, but provide useful information for learning a recognition model.

Image Augmentations. Data augmentation plays a key role in visual recognition. Recent work devise handcrafted augmentations [9] or learning-based methods [6, 34] to boost representation learning especially in semi-supervised learning. Our copy-paste augmentation is the first introduced for self-supervised learning. From it, we seek to gain further understanding about the ineffective localization problem in self-supervised learning.

3 Visualizing and Diagnosing Self-Supervision

In contrast to supervised representation learning, which usually optimizes a cross-entropy loss between predictions and labels, prior works on self-supervised representation learning have employed a diverse range of approaches [50, 32, 10, 11, 44]. The behaviors of self-supervised models thus may vary considerably, and it is important to understand them in this work.

A variety of methods have been presented for visualizing the behavior of supervised convolutional neural networks, based on deconvolution [47], class-specific gradients [36], and class activation mapping [52, 35]. However, there is little work on visualizing and analyzing the error patterns of self-supervised models, particularly for understanding the relationship between a proxy task and the semantic labels.

In the following, we visualize some self-supervised models with a focus on understanding the salient regions when self-supervised networks extract features.

Methods. For model visualization, we employ two existing methods with minor adjustments.

  • Nearest Neighbors. A straightforward way to diagnose what a feature has learned is to find the nearest neighbors in the feature space. By identifying patterns on what draws neighbors close to each other, we gain insights about what the features represent.

  • Class-specific gradients. The magnitude of class-score gradients in the pixel space provides information about how important the pixels are for classification. This approach has proven to be strong for weakly-supervised object localization [36]. Since self-supervised models do not have classifiers for objects, we train a linear classifier on top of the extracted features. Then we do back-propagation through the linear classifier and the rest of the self-supervised network to calculate the gradients in the pixel space.

Investigated Models. We examine various self-supervised models, based on colorization [50], rotation [17], and instance discrimination [14, 44].

  • Colorization [50] learns a representation by predicting the channels (in color space) of a grayscale image.

  • Rotation [17] generates rotated images at degrees and learns a representation by predicting the rotations.

  • Instance discrimination [14, 44] treats each individual instance as a class and learns a representation by multi-class classification.

Error Patterns. Figure 2 illustrates our major findings. We observe that for a considerable number of error cases, the similarity between a query and its nearest neighbors exists mainly in their backgrounds. Gradient-based saliency visualization confirms such findings, as the salient regions for self-supervised models are in the background instead of the foreground. For comparison, we also show the corresponding results for the supervised models, which instead show similarities among the foregrounds.

These findings are not unexpected, because these self-supervised methods treat foreground and background pixels equally and do not enforce a loss that drives the model to discover objects. This lack of localization ability calls for salient region modeling in self-supervised learning.

Figure 3: Examples of saliency estimations methods. We show 6 saliency estimations, including traditional methods (GS [43], MC [24], RBD [54]), supervised methods (BASNet [33]), and class-specific methods visualized from a pretrained network (CAM [52], Gradient [36]).

4 Saliency Estimation

In distilling localization ability for self-supervised methods, our approach first estimates saliency masks. A saliency mask should depict the regions most relevant for classifying the object. Typically, it coincides with the foreground object region, as indicated by most saliency datasets [41].

Note that recent research on unsupervised saliency estimation has shown promising progress. However, these models [49, 28] heavily rely on ImageNet and semantic segmentation pretraining, which violates our experimental protocols. We avoided these methods in this paper, and instead consider the following saliency estimation techniques.

Non-learning Based Methods. Traditional saliency estimation methods use handcrafted features, and rely on priors and heuristics to find the dominant object in an image. Useful priors include the background prior (pixels on the image border are more likely to be background) and the color contrast prior (edges with high contrast tend to belong to the foreground). We investigate several high-performing methods: RBD [54], MC [24], and GS [43].

Supervised Saliency Networks. Recent methods for saliency estimation commonly employ deep learning on large annotated saliency datasets [41]. These deep models outperform non-learning based methods by a large margin. A state-of-the-art supervised saliency network BASNet [33] trained on the DUTS dataset [41] from scratch is included in the investigation.

Class-specific Saliency. The aforementioned methods estimate saliency as foreground object regions. However, it is not clear that this represents the discriminative part of an image (e.g., only the face of a person may be important for recognizing humans). To keep the problem open, we also compare with CAM [52] and a gradient-based method [36] through class-specific visualizations. For [36], we convert the gradients to a mask using a segmentation algorithm [19].

Summary. Figure 3 shows examples of the saliency visualizations. Traditional methods are seen to be noisy, while network-produced saliency is much cleaner. It can be noticed that class-specific saliency from a pretrained network tends to be more compact around discriminative regions. This indicates that the use of full foreground saliency may not be ideal.

5 Distilling Localization via Background Invariance

Our goal is to learn a representation from which the foreground object can be automatically localized, such that discriminative regions can be focused on to improve recognition. We propose to distill the ability for object localization by learning invariance against the background. We first revisit a recent data-driven method for learning invariance, and then describe our approach.


Augmentations Self-Supervised Supervised


Center Crop 5.8 67.1
+ Flipping 7.3 69.5
+ Spatial Crop 24.7 75.0
+ Scale Crop 36.8 76.5
+ Color Jitter 51.7 74.9
+ Random Gray 56.5 75.0


Table 1: A comparison study of the role of data augmentations for learning self-supervised and supervised representations. Please refer to the main text for details.

5.1 Revisiting Instance Discrimination for Learning Invariance

Our work follows a prior non-parametric instance discrimination approach [44] for unsupervised learning. To promote instance discrimination, the algorithm generates image augmentations in the spatial domain, scale space, and color space. It encourages augmentations of the same image to have similar feature embeddings, and augmentations of different images to have dissimilar embeddings.

Let denote the image and be the feature embedding, where is the embedding function implemented as a convolutional neural network. Let represent an augmentation for image , where is a random augmentation function. The probability of the augmentation to be classified as the -th identity is expressed as


where is a temperature parameter and is the total number of images in the dataset. The learning objective is to minimize the negative log-likelihood over the dataset:


The effectiveness of such an instance discrimination approach for unsupervised learning strongly relies on the types of augmentations , i.e., image transformation priors that do not change object identity. In Table 1, we summarize the role of data-driven augmentations for both the instance discrimination model and the supervised model. We gradually add each type of transformation to the set of augmentations. The performance is measured on the ImageNet validation set of 1000 classes, and evaluated by linear classifiers. The model is trained with the ResNet50 architecture. The unsupervised setting is trained with the default parameter as in [44].

We find that the unsupervised representation gains much more classification accuracy from the augmentations than the supervised representation. This indicates that the priors present in the augmentations strongly overlap with the modeling cues from semantic labels. Adding intense color jittering improves the unsupervised representation but hurts the supervised representation. This suggests that the color jitter prior expands beyond the original data distributions. Nevertheless, adding a prior that only partially relates to semantics improves self-supervised learning significantly.

Figure 4: Generated copy-paste augmentations using three kinds of background images.

5.2 Copy-and-paste for Background Augmentation

Based on the previous findings, we propose to copy the foreground object estimated from the saliency methods in Section 4, and paste that onto various backgrounds as a means of data-driven augmentation for learning localization.

Background Datasets. For this augmentation, we ablate three types of backgrounds.

  • Homogeneous grayscale images with a random grayscale level.

  • Texture images from the MIT Vision Texture dataset [27].

  • Image crops from ImageNet which have no saliency response using unsupervised saliency [49].

Figure 4 shows copy-and-pasted examples using various background images.

Blending. For pasting, we examine three techniques: directly copying the foreground object onto the background, copying with Gaussian blending on the object borders, and a mixture of the two approaches.

Accounting for Context. Context plays an important role in recognizing objects [39]. Though the surrounding context of an object may not be the most discriminative region for recognition, it may help to prune the set of candidates. For example, a tree is unlikely to be completely encompassed by sky. To account for this during augmentation, we set a probability of keeping the original full image without copy-and-paste augmentation.

Integrating other Augmentations. Since copy-paste augmentation is orthogonal to other previous augmentations, i.e. random scaling, cropping, color jittering, the order of copy-paste augmentation with respect to other augmentations does not matter. In our implementation, we first run copy-paste augmentation to replace the background, and then perform other augmentations.

6 Experiments

We conduct a series of experiments on model designs for self-supervised representation learning and their transfer learning abilities.

6.1 Baseline Settings

We largely follow the instance discrimination [44] settings as our baseline. Specifically, we use a temperature in Eqn. 1, and an embedding dimension of for each image. A memory bank [44] is used to accelerate discrimination. Unless otherwise noted, we use number of negatives for most experiments. The training loss is the softmax NCE [22] which proves to be more stable than the original NCE [20]. This leads to about in improvement for our baseline. Training takes 200 epochs with an initial learning rate of that is decayed at epochs 120 and 160.

During testing, performance is evaluated by the linear readoff on the penultimate layer features. The optimization takes 100 epochs and starts with a learning rate of 30 that is decayed every 30 epochs.

(a)   Saliency MAE Acc   Baseline [44] - - 56.5 -   GS 0.557 0.173 57.8 +1.3 MC 0.627 0.186 58.1 +1.6 RBD 0.630 0.144 59.3 +2.8 BASNet 0.805 0.056 62.9 +6.4   (b)   Aug Ratio Linear   Baseline [44] 56.5 -   20% 59.3 +2.8 50% 59.3 +2.8 80% 57.5 +1.0 100% 44.5 -12.0   (c)   Background Linear   Baseline [44] 56.5 -   Texture 57.1 +0.6 Imagenet 59.2 +2.7 Grayscale 59.3 +2.8   (d)   Blending Linear   Baseline [44] 56.5 -   No blend 59.3 +2.8 Gaussian 58.9 +2.4 Mix 59.3 +2.8  
Table 2: Ablation studies for investigating copy-and-pasting augmentations: (a) on various saliency estimation methods (b) on controlling the ratio of using copy-and-pasting augmentation (c) on various background images (d) on blending options.

6.2 Ablation Study

In this section, we first validate our data-driven approach of distilling localization through a series of ablation experiments on ImageNet ILSVRC2012 1K. All models are trained using the ResNet50 architecture and reported on the ImageNet validation set. Experimental settings follow the baseline, except that we distill the localization ability through copy-and-paste augmentations.

A naive approach. First of all, to demonstrate the necessity of a data-driven approach, we consider a naive approach that pools the final layer features by masking according to saliency. With this, the performance decreases sharply by , possibly because the model loses too much context. Moreover, by masking out the features, the model is still unable to localize the discriminative regions automatically.

Saliency Estimation. In Table 2 (a), we examine several class-agnostic saliency estimation methods. All of them are found to improve performance, even the noisy traditional approaches RBD [54], MC [24] and GS [43]. RBD improves the performance by and the saliency network BASNet by . The supervised BASNet [33] is trained on the DUTS dataset [41] from scratch with 10,053 training images, which is less than of ImageNet images. Though human annotation is involved, it indicates potential room for developing better unsupervised saliency approaches. In Table 2, we find a correlation between the saliency performance on the saliency benchmark (by and MAE on DUT-OMRON dataset [46]) and the self-supervised representation learning. Better saliency translates to better representations. We also note that more localized saliency may potentially bring additional improvements since we observe that class-specific saliency in Figure 3 is centered around discriminative regions instead of the full foreground object.

Background Images. We ablate the use of various background images in Table 2 (c). Texture backgrounds improve the performance very marginally. This is possibly because textured images in the dataset [3] are outside of the ImageNet distribution. Homogeneous grayscale backgrounds and ImageNet backgrounds perform similarly well.

Amount of Augmentation. During dataloading, we only randomly add copy-and-paste augmentations with a probability ratio. We ablate the ratio in Table 2 (b). With only to of images receiving copy-and-pastes, we significantly improve the performance by . Always using the copy-and-paste augmentation hurts performance.

Blending Options. When copy-and-pasting an object to a background, blending has proven to be important for object detection [15]. In our study in Table 2 (d), blending appears to be less important. Direct copy-and-pasting with no blending performs reasonably well. This difference is possibly because detection requires realistic boundaries, which prevents the network from taking shortcuts, while for classification, boundary cheating is not as significant.

Baseline Comparisons. In Table 4, we evaluate the copy-and-pasting augmentation on various network architectures: AlexNet [26], VGG16 [37], ResNet18 [23] and ResNet50. Following prior works, we insert Batch Normalization layers into AlexNet and VGG16 for faster convergence. RBD saliency consistently improves the performance by around , and BASNet by .

Figure 5: Successful examples where our model outperforms our baseline. The improvement is due to better localization and background invariance.
Figure 6: Failure examples where our model underperforms the supervised model. The model finds it difficult when multiple objects apprear in the image, or the object is of a fine-grained category.

Visualizations. In Figure 5 and Figure 6, we visualize examples where our model outperforms the baseline, as well as some failure cases. For all the successful cases, our salient region on the gradient and the nearest neighbors focus on the discriminative object, while the baseline approach is distracted by the background. This validates the claim that our data-driven augmentation drives the model to learn to automatically localize the object. Such localization leads to better recognition performance.

For the failure cases, we compare our model with the supervised model. We find that there are two error patterns. First, multiple objects appear in a single image, and our model makes wrong decisions on where to focus. Second, the testing image is of a fine-grained class, too difficult to recognize without labels.


Method AlexNet VGG16
Acc Acc


Baseline [44] 42.2 - 48.7 -
Ours RBD 45.1 +2.9 51.5 +2.8
Ours BASNet 47.5 +5.3 54.5 +5.8


Method ResNet18 ResNet50
Acc Acc


Baseline [44] 45.9 - 56.5 -
Ours RBD 48.8 +2.9 59.3 +2.8
Ours BASNet 52.3 +6.4 62.9 +6.4


Table 4: State-of-the-arts comparisons.


Methods Architecture Linear


DC [4] VGG 48.4
LA [55] R50 58.8
CPC [29] R170-w 65.9
CMC [38] 2xR50-w2 68.1
BigBiGAN [12] Rv50-w4 61.3
AMDIM [2] customized 68.1
MoCo [22] R50 60.6
MoCo [22] R50-w2 65.4


Ours RBD (K=65536) R50 60.6
Ours RBD (K=65536) R50-w2 65.3


Table 3: Comparisons with baselines.

6.3 Transfer Learning Results

We evaluate the transfer learning ability of our model on object recognition, scene categorization, and object detection benchmarks, and compare with the state-of-the-art methods.

For our method, we employ the traditional RBD saliency [54] which is completely unsupervised. The background is modeled as random grayscale images. For dataloading each image, we turn on copy-and-pasting augmentation with a probability of 0.5. To fairly compare with recent works [22], we use number of negatives for training.

ImageNet Classification. To be consistent with prior works, we report performance using a single center crop. In Table 4, our method is competitive with state-of-the-arts using the same network architecture, while showing consistent improvement on larger networks. Furthermore, our method is orthogonal to all prior works. A plug-and-play of copy-and-pasting augmentation into prior works is likely to bring additional improvements.



Method Super. MoCo Ours Super. MoCo Ours Super. MoCo Ours




Table 5: Transfer learning for object detection on VOC 2007 using Faster R-CNN with R50-C4. We present the gap to ImageNet supervised pre-training in the brackets for reference. For MoCo [22], we use the officially released model for finetuning. All numbers are the averages of three independent runs.


\pbox20cmPretrained Models \pbox20cmSupervised ImageNet InstDist [44] DC [4] LA [55] Ours


Accuracy 49.8 45.5 38.9 46.8


Table 6: Scene recognition on Places. LA reports 10-crop accuracy on this dataset.

Object Detection. We transfer our pretrained model to object detection by finetuning it on PASCAL VOC 2007 trainval and evaluating on the VOC 2007 test set. Following the state-of-the-art method MoCo [22], we use the exact same training protocol to finetune the Faster R-CNN with a Res50-C4 backbone as with the supervised counterpart. A critical BN layer is added after the stage (including global pooling) in the box prediction head. During training, we finetune all layers with synchronized BN The finetuning takes 9k iterations. Results are summarized in Table 5. We close the gap of unsupervised pretraining and supervised pretraining under , and surpass the supervised pretraining by a large margin under and .

Scene Categorization. To test the generalization ability of the representations, we transfer the model learned on ImageNet to Places [53]. We compare our results with prior works using the ResNet50 architecture. In Table 6, our model improves the baseline and compares favorably with prior works. However, the improvement observed is much smaller than the result on ImageNet. We hypothesize that this is because the gap between scene recognition and object recognition requires a different feature representation. We further test the model pretrained on ImageNet in a supervised manner and find its transfer performance on Places to be marginally above ours. This confirms our hypothesis that the common knowledge shared between these two tasks is limited, which may limit further transferability.

7 Conclusion

In this work, we identified a strong error pattern among self-supervised models in their failure to localize foreground objects. We then propose a simple data-driven approach to distill localization via learning invariance against backgrounds. We achieve strong results on ImageNet classification and its transfer performance for object detection on PASCAL VOC 2007. The improvements achieved suggest that the localization problem for self-supervised representation learning is prevalent. However, our method may not be the ideal way to solve this localization problem. We are interested in finding a clever “proxy task” which can help distill such localization abilities.

Supplementary Materials

Appendix A1 Visualizing and Diagnosing Self-Supervision

We provide more visualization examples in the supplementary materials. We use K-Nearest Neighbor (KNN) to diagnose what features learned in self-supervised models. Given an input image for each model, we visualize its top-3 nearest neighbors in the embedding space, and show the results in Fig. 7  13. Compared with the supervised model, which is able to localize the salient objects/regions, self-supervised models (Colorize [50], RotNet [17], DC [4], LA [55], and InstDisc [44]) prone to be distracted by the backgrounds.

Figure 7: Visualizing the error patterns of self-supervised models using KNN (K=3).
Figure 8: Visualizing the error patterns of self-supervised models using KNN (K=3).
Figure 9: Visualizing the error patterns of self-supervised models using KNN (K=3).
Figure 10: Visualizing the error patterns of self-supervised models using KNN (K=3).
Figure 11: Visualizing the error patterns of self-supervised models using KNN (K=3).
Figure 12: Visualizing the error patterns of self-supervised models using KNN (K=3).
Figure 13: Visualizing the error patterns of self-supervised models using KNN (K=3).


  1. Equal contribution.


  1. R. Arandjelović and A. Zisserman (2019) Object discovery with a copy-pasting gan. arXiv preprint arXiv:1905.11369. Cited by: §1, §2.
  2. P. Bachman, R. D. Hjelm and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: Table 4.
  3. Z. Bylinskii, T. Judd, A. Borji, L. Itti, F. Durand, A. Oliva and A. Torralba MIT saliency benchmark. Cited by: §2, §6.2.
  4. M. Caron, P. Bojanowski, A. Joulin and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: Appendix A1, Table 4, Table 6.
  5. M. Cheng, N. J. Mitra, X. Huang, P. H. Torr and S. Hu (2014) Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.
  6. E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan and Q. V. Le (2019) Autoaugment: learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 113–123. Cited by: §2.
  7. V. R. de Sa (1994) Learning classification with unlabeled data. In Advances in neural information processing systems, Cited by: §2.
  8. J. Deng, W. Dong, R. Socher, L. Li, K. Li and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, Cited by: §1.
  9. T. DeVries and G. W. Taylor (2017) Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: §2.
  10. C. Doersch, A. Gupta and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1, §2, §3.
  11. C. Doersch and A. Zisserman (2017) Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2051–2060. Cited by: §1, §3.
  12. J. Donahue and K. Simonyan (2019) Large scale adversarial representation learning. arXiv preprint arXiv:1907.02544. Cited by: §2, Table 4.
  13. A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers and T. Brox (2015) Flownet: learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, Cited by: §2.
  14. A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Riedmiller and T. Brox (2015) Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2, 3rd item, §3.
  15. D. Dwibedi, I. Misra and M. Hebert (2017) Cut, paste and learn: surprisingly easy synthesis for instance detection. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2, §6.2.
  16. H. Fang, J. Sun, R. Wang, M. Gou, Y. Li and C. Lu (2019) Instaboost: boosting instance segmentation via probability map guided copy-pasting. arXiv preprint arXiv:1908.07801. Cited by: §2.
  17. S. Gidaris, P. Singh and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: Appendix A1, §2, 2nd item, §3.
  18. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, Cited by: §2.
  19. V. Gulshan, C. Rother, A. Criminisi, A. Blake and A. Zisserman (2010) Geodesic star convexity for interactive image segmentation. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Cited by: §4.
  20. M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Cited by: §6.1.
  21. J. Han, D. Zhang, X. Hu, L. Guo, J. Ren and F. Wu (2014) Background prior-based salient object detection via deep reconstruction residual. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §2.
  22. K. He, H. Fan, Y. Wu, S. Xie and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §6.1, §6.3, §6.3, Table 4, Table 5.
  23. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §6.2.
  24. B. Jiang, L. Zhang, H. Lu, C. Yang and M. Yang (2013) Saliency detection via absorbing markov chain. In Proceedings of the IEEE international conference on computer vision, Cited by: Figure 3, §4, §6.2.
  25. P. Jiang, H. Ling, J. Yu and J. Peng (2013) Salient region detection by ufo: uniqueness, focusness and objectness. In Proceedings of the IEEE international conference on computer vision, Cited by: §2.
  26. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, Cited by: §1, §6.2.
  27. M. MediaLab (1995) VisTex texture database. Web site: http://vismod. media. mit. edu/vismod/imagery/VisionTexture/vistex. html. Cited by: 2nd item.
  28. T. Nguyen, M. Dax, C. K. Mummadi, N. Ngo, T. H. P. Nguyen, Z. Lou and T. Brox (2019) DeepUSPS: deep robust unsupervised saliency prediction via self-supervision. In Advances in Neural Information Processing Systems, Cited by: §4.
  29. A. v. d. Oord, Y. Li and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: Table 4.
  30. M. Oquab, L. Bottou, I. Laptev and J. Sivic (2015) Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  31. D. Pathak, R. Girshick, P. Dollár, T. Darrell and B. Hariharan (2017) Learning features by watching objects move. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2701–2710. Cited by: §2.
  32. D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2, §3.
  33. X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan and M. Jagersand (2019) BASNet: boundary-aware salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, Figure 3, §4, §6.2.
  34. A. J. Ratner, H. Ehrenberg, Z. Hussain, J. Dunnmon and C. Ré (2017) Learning to compose domain-specific transformations for data augmentation. In Advances in neural information processing systems, pp. 3236–3246. Cited by: §2.
  35. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §3.
  36. K. Simonyan, A. Vedaldi and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §1, §2, Figure 3, 2nd item, §3, §4.
  37. K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §6.2.
  38. Y. Tian, D. Krishnan and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: Table 4.
  39. A. Torralba (2003) Contextual priming for object detection. International journal of computer vision. Cited by: §5.2.
  40. P. Vincent, H. Larochelle, Y. Bengio and P. Manzagol (2008) Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, Cited by: §2.
  41. L. Wang, H. Lu, Y. Wang, M. Feng, B. Yin and X. Ruan (2017) Learning to detect salient objects with image-level supervision. In CVPR, Cited by: §4, §4, §6.2.
  42. X. Wang, K. He and A. Gupta (2017) Transitive invariance for self-supervised visual representation learning. In Proceedings of the IEEE international conference on computer vision, Cited by: §2.
  43. Y. Wei, F. Wen, W. Zhu and J. Sun (2012) Geodesic saliency using background priors. In European conference on computer vision, Cited by: §1, Figure 3, §4, §6.2.
  44. Z. Wu, Y. Xiong, S. X. Yu and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Appendix A1, §1, §2, 3rd item, §3, §3, §5.1, §5.1, §6.1, Table 2, Table 2, Table 2, Table 2, Table 4, Table 6.
  45. Q. Yan, L. Xu, J. Shi and J. Jia (2013) Hierarchical saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  46. C. Yang, L. Zhang, H. Lu, X. Ruan and M. Yang (2013) Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §2, §6.2.
  47. M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, Cited by: §3.
  48. D. Zhang, J. Han and Y. Zhang (2017) Supervision by fusion: towards unsupervised learning of deep salient object detector. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.
  49. J. Zhang, T. Zhang, Y. Dai, M. Harandi and R. Hartley (2018) Deep unsupervised saliency detection: a multiple noisy labeling perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4, 3rd item.
  50. R. Zhang, P. Isola and A. A. Efros (2016) Colorful image colorization. In European conference on computer vision, Cited by: Appendix A1, §1, §1, §2, 1st item, §3, §3.
  51. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva and A. Torralba (2014) Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856. Cited by: §2.
  52. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva and A. Torralba (2016) Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §1, §2, Figure 3, §3, §4.
  53. B. Zhou, A. Lapedriza, J. Xiao, A. Torralba and A. Oliva (2014) Learning deep features for scene recognition using places database. In Advances in neural information processing systems, Cited by: §6.3.
  54. W. Zhu, S. Liang, Y. Wei and J. Sun (2014) Saliency optimization from robust background detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §1, §2, Figure 3, §4, §6.2, §6.3.
  55. C. Zhuang, A. L. Zhai and D. Yamins (2019) Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: Appendix A1, Table 4, Table 6.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description