Inner-Scene Similarities as a Contextual Cue for Object Detection
Using image context is an effective approach for improving object detection. Previously proposed methods used contextual cues that rely on semantic or spatial information. In this work, we explore a different kind of contextual information: inner-scene similarity. We present the CISS (Context by Inner Scene Similarity) algorithm, which is based on the observation that two visually similar sub-image patches are likely to share semantic identities, especially when both appear in the same image. CISS uses base-scores provided by a base detector and performs as a post-detection stage. For each candidate sub-image (denoted anchor), the CISS algorithm finds a few similar sub-images (denoted supporters), and, using them, calculates a new enhanced score for the anchor. This is done by utilizing the base-scores of the supporters and a pre-trained dependency model. The new scores are modeled as a linear function of the base scores of the anchor and the supporters and is estimated using a minimum mean square error optimization. This approach results in: (a) improved detection of partly occluded objects (when there are similar non-occluded objects in the scene), and (b) fewer false alarms (when the base detector mistakenly classifies a background patch as an object). This work relates to Duncan and Humphreys’ ”similarity theory,” a psychophysical study. which suggested that the human visual system perceptually groups similar image regions and that the classification of one region is affected by the estimated identity of the other. Experimental results demonstrate the enhancement of a base detector’s scores on the PASCAL VOC dataset.
(a) Shows the 7 top scoring image regions by a base detector (Fast-RCNN). In blue is the best detection for the ‘person’ category and in pink are the best 6 detections for the ‘sheep’ category; (b) and (c) show two false positive detections in yellow, and their similar patches in red; and (d) shows the 7 top scoring image regions by CISS.
Object detection is composed of object localization and object classification, and is one of the major computer vision challenges. Traditional object detection and categorization used tailored appearance features to recognize objects in images. In recent years, appearance features are automatically generated when training deep convolutional neural networks (CNNs) for classification or segmentation tasks. Until recently, it was common to train a classifier using a set of annotated images containing single objects, and to afterwards apply the classifier on sub-images of full scenes. The sub-images were usually extracted by a naive sliding window method, or by segmentation-based methods that extracted ‘proposals’ (candidate object locations). Some recent CNNs were designed to infer both the regions of interest and their detection scores using one forward pass [35, 37].
Several models that use contextual information for improving detection performance have been proposed. The most common context-based methods involve semantic cues and spatial cues. Some of those methods use semantic context by applying statistics on various object class co-occurrences in training data , while other methods model the probability of an object class to appear in the image given the image’s gist . Some methods use spatial cues by utilizing the statistics of the locations, or relative locations, for object classes in scenes .
This work explores a different kind of contextual information: inner-scene similarity. While intra-image patch similarities are used for other image processing and computer vision applications such as super-resolution [4, 21], texture classification [30, 40], segmentation and object search , information of this type has not been explored as a cue for improving object detection accuracy. We show that by utilizing inner-scene similarity, we can sometimes overcome two main difficulties that current object detectors do not handle very well: (1) detection in the presence of background clutter and (2) detection of partly-occluded objects. Moreover, we show that the cue of inner-scene similarity can reduce false alarms.
We present the CISS (Context by Inner Scene Similarity) algorithm. CISS is based on the observation that two visually similar sub-image patches are likely to share semantic identities, especially when both appear in the same image. A similar observation was used in , which followed psychophysical studies  and proposed a dynamic visual search attentional framework driven mainly by inner-scene similarity.
CISS uses ‘base-scores’ provided by a base detector and performs as a post-detection stage, proposing enhanced scores. It is generic in the sense that it does not depend on the detector used and can enhance any detector’s scores. In this work, we demonstrate CISS using Fast R-CNN  as the base detector.
We follow  and use the minimum mean square error linear estimator (MMSE) to calculate the new scores. However, unlike , which considered the dependencies between similarity and identity, we quantify (a) the dependencies between sub-image similarity and the correlation of the two corresponding base-scores, and (b) the dependencies between the base-score of one candidate object and the identity of another candidate, given the similarity between them. The statistics of these dependencies are learned from training data and utilized by the estimation process. We evaluate our algorithm on the well-known PASCAL VOC benchmark .
Figure 1 demonstrates CISS’s contribution. In Figure 0(a) and 0(d) we show 7 image regions for which Fast R-CNN  provided the highest base-score, and the 7 top scoring regions by CISS, respectively. In Figure 0(c) and 0(b) we show two example anchors (in yellow) with the similar ‘supporters’ found by CISS (in red). Both examples depict false alarms of the base detector: in Figure 0(b) the false alarm is a background error, and in Figure 0(c) the false alarm is a ”localization error” .
Object recognition and object detection methods usually aim to utilize features that are invariant to viewing conditions and are scene independent. In this work we wish to do the contrary: we exploit the fact that two objects (or background patches) that share an identity and appear in the same image are usually similar also in features that are non-invariant to viewing conditions such as weather, luminance, texture, and even pose. Moreover, given that both objects belong to the same class, they will usually belong to a common subclass, and will have more in common than any two objects sharing identity but captured at different times and locations. Therefore, while detectors recognize objects mainly on the basis of those parts that are meaningful for characterizing the objects’ classes, CISS uses a complementary similarity measure based on color and texture, and only weakly on the spatial pixel arrangement. As a result, CISS can increase the score of an object, given that other similar objects in the scene were assigned a higher base-score, and can decrease the score of a background patch given that other background patches with similar texture and color were assigned very low base-scores. It can therefore lead to better detection of objects of low visibility, or partly occluded objects, and can reduce false alarms.
The rest of this paper is organized as follows. In Section 2 we review related work. We describe the CISS algorithm in Section 3 and report experimental results in Section 4. Section 5 concludes the paper.
2 Related Work
2.1 Traditional Object Detection
Advances in object detection (and other vision tasks) during the previous decade have relied mainly on the use of SIFT  and HOG features . Two dominant approaches that use those features are the Bag of Visual Words  and the Deformable Parts Model (DPM) . The DPM represents objects using mixtures of a root and deformable part HOG filter, forming a ’star’ model. The detector obtains scores for (almost) all locations in the image using a sliding window. On top of this model, the authors created a mixture of star models for each category. This detector was a leading detection method for some time , until neural network based image classifiers broke through computation and scalability barriers .
2.2 CNN Based Detectors
One of the first high performing deep CNN based object detectors is R-CNN . It takes numerous candidate object locations, called ”proposals” , and classifies each using a forward pass of an object recognition network such as the one in . To achieve better running time and more accurate localization results, the author of  created Fast R-CNN, a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations. One forward pass is used to estimate all classes and locations of the candidates in the image.
Faster and more current works make region proposals a part of the networks. In , one network implements end-to-end detection, including region extraction. The YOLO model  also performs detection by a single forward pass. It predicts both bounding boxes and class probabilities that are associated with entries in a fixed image grid.
2.3 Object Detection and Context
Several models that use contextual information to improve detection performance have been proposed . The most common contextual cues involve semantic context and spatial context. The former take into account other object classes that may appear in the image and use statistics on co-occurrences in training data. The latter learn the expected locations for object classes in scenes and use them as priors for detection.
Contextual cues can also be categorized in other ways: those that use global context  vs. those that use local context, or those that use object-scene interaction vs. those that use object-object interaction . The contextual cues may be integrated as a post-processing step given the detector output [6, 34], or integrated within the detector itself . A good example of an object detection method that uses context can be found in the work of Choi et al. . The authors propose a graphical tree model that captures the contextual cues of co-occurrences and spatial relationships. They show that their model can improve the results of object detection tasks (both for object localization and for presence prediction), as well as other scene understanding tasks that local detectors alone cannot solve, such as detecting objects that are out-of-context.
2.4 Similarity Between Images or Image Patches
Traditionally, color and texture attributes are used to determine the similarity between images (for tasks such as texture classification [30, 40]) or between image parts (for image segmentation [28, 39], denoising , image completion and image extrapolation [7, 26], super-resolution [4, 21], texture synthesis  and saliency detection [3, 22]). It is common to describe each image patch by some feature vector and to then measure similarities by simple distance functions (i.e., Euclidean), or by histogram intersection (when the feature vectors are histograms). The features may be based on texture (e.g, be SIFT-like  or on responses of basis filters ) or on color information . A well-tested approach to describe patch texture is the bag of textons proposed by Malik et al.  and used for texture segmentation.
Several CNN based works have managed to perform well using ’Siamese’ networks. The ’Siamese’ network acts as a feature extractor after which a comparison method can be applied. In , a simple distance metric is used to compare patches. In  and  an additional network is learned for the comparison stage. In MatchNet , the full network is disassembled during prediction, to allow faster and more efficient computations.
Patch texture similarity can also be measured by applying a material classification network on the patches and then comparing their output (we adopted this method in our work; see details in Section 3). Bell et al.  proposed a material classifier that uses large texture patches for training.
3 The Algorithm
3.1 Overview and Notations
A general description of CISS is given in Algorithm 1. CISS starts by applying a base detector on the input image. The base detector outputs a set of triplets , where is a rectangle and is a score that quantifies the likelihood of the sub-image to depict an object of the category . We refer to as the base-score, and to each with a base-score higher than a threshold as an anchor.
For each anchor, CISS finds similar rectangular image regions . We refer to these regions as supporters. The visual dissimilarity between an anchor and its supporters is measured by a function and denoted . How CISS finds the supporters is explained in detail in Section 3.2.
For each supporter, CISS obtains the base-score associated with the likelihood of to depict an object of category . The score of the anchor is then revised, taking into account , , , and the dependency functions and . We assume that the revised score is a linear function of and . Using this assumption, is obtained by a minimum mean square estimator (MMSE). For details on the estimation process see Section 3.3.
3.2 Finding Similar Supporters
To quantify the appearance similarity of two patches (anchor and a potential supporter), we extract color and texture descriptors from each. For color, we use an HSV color histogram with 10 bins per channel. To describe the texture of a patch we use a pre-trained neural network based material classifier . This classifier provides 23 confidence values corresponding to 23 materials, for every image pixel. We describe the textural characteristics of a sub-image patch with a -dimensional vector, by averaging the activation maps over the patch’s pixels.
We then compute the chi-square distances between the associated color and textural descriptors, and compute a weighted average between them:
In addition, we add a pyramid based comparison in order to roughly compare the spatial arrangement inside the patches.
Naturally, other descriptors and similarity measures can be used. We experimented also with histograms of textons  and the results were only slightly inferior.
For every candidate anchor we iteratively choose the most similar supporting region which is of the same size as the anchor (up to 20% difference) and does not overlap with previously chosen supporters. This is done efficiently as described in Section 4.2. Some examples of the results of this procedure can be viewed in Figure 2.
3.3 Re-evaluating Anchor Scores
3.3.1 Estimating scores with the linear MMSE
In the CISS approach, we consider each object identity as a binary random variable ( when depicts an object from category and otherwise) and consider , and , , the base-scores, as continuous random variables. Once we have selected the supporters for the anchor , we can re-estimate its score, or its probability to belong to the class . We model , the estimate of , as a linear function of and ,
Using the common technique of MMSE estimation , we then calculate
where , and are set to the expected value of the category’s base-score for a random patch. is the probability for a random patch to be associated with the category, and are the coefficients for which the square error has a minimum. The value of the coefficients is given by:
where includes the covariances between and in the first row and column, and the covariances between the ’s in rows and columns 2 to . R is a vector consisting of the covariances between and and between each and .
3.3.2 Characterizing the dependency behavior
and are estimated using training data. We model the covariance between two base-scores and between a base-score and the identity as a descending function of the feature-space distance between the associated sub-images, and denote it with .
Where , and are functions that best fit the covariance values of the training data. As demonstrated in Section 4.1, we extract sub-images with known, manually annotated identities and obtain the base-scores for them. Then, using a large set of such sub-image pairs (with base-scores and , and identities and ), we calculate the covariances values for each distance interval:
Then, we compute exponential descending functions and that are the best fit for and .
In Section 4.1 we describe in detail the estimation of the dependency characteristics (the functions). In Section 4.2 we discuss some CISS implementation issues and in Section 4.3 report results for the VOC PASCAL detection challenge. In Section 4.4 we further discuss the advantages and weaknesses of CISS.
4.1 Characterization of the Dependency Behavior
The dependency functions between base-scores of two sub-images and between a base-score and sub-image identity, are inferred from training data. For this task we used the SUN09 database . We chose this dataset as it contains many fully annotated images (including the annotations of background regions). We extracted patches from the first 3000 images of the training set. The collected patches are either the bounding boxes of objects, or rectangular patches containing background. Background patches are either boxes contained inside a background segment (and then described by one background category), or randomly selected patches that may be associated with multiple background categories. The number of collected background patches roughly equaled the number of objects in the image. For each selected patch we obtained the base-score provided by Fast R-CNN . For each pair of patches coming from the same image, we measured the similarity using the distance described in equation (1) when . Each pair was classified as a ‘same’ pair if both patches are similarly annotated, or as a ‘not-same’ pair otherwise.
Figure 3 shows the distribution of the distances associated with ‘same’ pairs and the distribution of distances associated with ‘not-same’ pairs. As expected, we see a strong correlation between inner-scene similarity and the probability that the two patches share or don’t share an identity. Specifically, we see that if the measured visual distance between two patches is low (below 0.25), there is a very high probability that they describe objects (or background) from the same class.
Figure 4 displays in blue the estimated and (using the data collected from SUN09 and equations (5) and (6)). As can be seen, the estimated covariances are indeed monotonic descending functions of the patch distance. Best-fitting monotonic descending exponential functions are displayed in red. These analytic functions were used as the functions in the CISS experiments described in Section 4.3.
4.2 CISS Implementation Details
In this section we describe the details of the CISS implementation, when using the Fast R-CNN detector  as the base detector that provides the base-scores. Fast R-CNN returns for each image a set of detections, each defined by a rectangle region, a class and a base-score. In our implementation, anchors are regions whose base-scores were above and whose width and height exceeds 15 pixels. For evaluation purposes, other boxes provided by the detector are treated as anchors that have no supporters. Consequently, Eq. (3) is applied on all input boxes, and all have comparable ‘CISS-scores’.
To efficiently locate supporters, the color and textural features described above are extracted for all image locations and for all relevant scales using integral images. Given the anchor’s dimensions , we search for supporters of dimensions between and , having the same aspect ratio as . The chi-square distance from the anchorâs descriptors is then computed in parallel for each possible image location. Given the distances between the anchor’s descriptors and all possible locations for supporters, we select up to supporters with distance not exceeding 0.25. We also limit the choice of supporters so that they do not overlap with the anchor or with each other. Thus, we iteratively select the most similar (lowest feature-wise distance) supporter not overlapping with previous selections. The search for supporters is run on a GeForce GTX TITAN X GPU.
4.3 CISS Results on PASCAL VOC
In this section we report results of CISS on the PASCAL VOC2007  benchmark detection challenge. The Fast R-CNN detector  with its CaffeNet model is used both as a baseline for comparison and as the base detector providing the base scores.
Examples of CISS’s results can be seen in Figure 6. The base-score and the CISS-score cannot be directly compared, so for each anchor, we also apply Eq.( 3) as if it had no supporters. This revised base-score is annotated by and is comparable with . We therefore report in the examples in Figure 6. Note that the transformation from to is monotonic and preserves the order between base-scores (for each category). The upper row of the Figure shows cases in which CISS reduced the score of background patches or object parts (reducing false alarms). The second row of Figure 6 shows cases where CISS increased the scores of objects due to their similarity to other objects in the scene (increasing detection rate).
In order to evaluate the CISS algorithm, we compared the precision vs. recall (P-R) curves of Fast R-CNN and CISS for the 20 PASCAL categories. See two examples in Figure 4(a) and 4(b). As demonstrated, CISS’s results are sometimes inferior for low recalls and usually improve precision for higher recalls. Note that for most applications, the low recall section of the P-R curve has no practical importance. Yet, the most common evaluation measure for object detection is mean average precision (mAP), which is equal to the area under the P-R curve. This area averages the performance for low and high recall, providing the same weight for both. The F-score measure, on the other hand, compares the best performing point in the P-R curve of different detectors. We summarize the results for both criteria in Table 1. As can be seen CISS improves the F-score results for 17 of the 20 PASCAL categories, while preserving the F-score for all other categories but one. The mAP is also improved, for 11 of the classes.
|Fast R-CNN + CISS||68.5||73.6||54.6||44.0||23.4||70.2||72.2||73.9||31.4||64.5||60.8||63.4||76.1||68.7||59.9||22.3||52.9||55.7||69.8||56.6||58.1|
|Fast R-CNN + CISS||69.1||72.7||58.4||50.3||31.1||70.2||73.6||74.5||38.0||66.0||64.1||64.8||74.5||70.2||63.3||31.8||57.1||59.2||71.5||60.1||61|
|Fast R-CNN + CISS||82.8||86.8||70.5||69.2||35.7||81.5||84.5||93.0||51.6||88.5||77.8||92.0||92.1||86.4||84.4||43.7||71.0||77.8||89.6||60.6||76|
|Fast R-CNN + CISS||81.3||81.6||68.3||66.5||38.0||78.1||80.1||88.9||51.6||82.9||73.8||87.3||87.9||81.2||77.3||45.7||67.0||76.0||84.7||62.5||73|
In previous works [24, 36] several other drawbacks have been demonstrated for the mAP measure, and alternative measures have been proposed. It was suggested in  that localization and ‘similar-object’ errors are less important than other errors. We followed this observation and tested how CISS performs when ignoring the localization and ‘similar-object’ errors. As can be seen in Figure 4(c) and 4(d), CISS’s disadvantage for low recalls is significantly reduced, while the advantage for high recalls is maintained. See Table 2 for mAP and F-scores of CISS vs. Fast R-CNN when only considering the more crucial errors.
4.4 Failure Cases and Discussion
Figure 7 demonstrate some failure cases. In Figure 6(a) we demonstrate a case where the supporter is mistakenly classified by the base detector with a high base-score (to be an instance of the class ‘bottle’), which leads incorrectly to an increase of the anchor’s score. In Figure 6(b) we demonstrate a case where an object part is found to be similar to a full object. As a result, CISS increased the score of the part. In Figure 6(c) we demonstrate a case where the anchor and supporter are both of the same class (a person), but the base detector assigns a low base-score to the supporter. As a result CISS decreased the score of the anchor.
Several extensions to consider for avoiding such failures are described in Section 5. Nonetheless, it should be noted that the CISS algorithm improves detection accuracy more often than decreasing it. When the goal is a better operating point, incorporating CISS as a post-detection step is worthwhile.
In this work, we explore inner scene similarities as a contextual cue for enhancing object detection performance. Our method relates to a psychophysical study which suggested that the human visual system perceptually groups similar image regions and then estimates their identity simultaneously. Statistical analysis confirms that patches sharing visual properties also share semantic identities. We have shown that the suggested CISS algorithm improves the operating point for the Fast R-CNN detector by reducing the number of false alarms and enhancing detection of partly occluded objects. Interestingly, we used a similarity measure that is based on descriptors that are variant in pose and illumination, an approach which goes against the common computer vision wisdom. Yet comparing patches in this manner allows us to use the increased similarity between the same objects in the same image, which is neglected by the classifiers.
In this work, we showed how the inner-scene similarity cue can contribute by itself, but combining it with other contextual cues can be helpful. While CISS uses the MMSE linear estimator, better results may be possible if a general estimator is learned. Detectors often fail to confuse object-parts and full-objects. This follows the desire to detect partly occluded objects, for which only a part is visible. When the detector mistakenly highly scores a part of a fully visible object, CISS can make things worse by increasing the score of more similar parts (see Figure 7(a)). Treating differently anchors that are contained inside other highly scored anchor (and therefore likely to be parts) may help prevent undesired affects of CISS. An interesting route to follow is the exploitation of inner-scene similarities for an intelligent non-maximum suppression (NMS) process. In order to avoid multiple detections of one object instance, NMS is usually applied on the output of the detector in a greedy process. The CISS algorithm can produce better localization by changing the scores of all detection windows before applying NMS. See Figure 7(b) and 7(c).
-  V. Arvis, C. Debain, M. Berducat, and A. Benassi. Generalization of the cooccurrence matrix for colour images: application to colour texture classification. Image Analysis & Stereology, 23(1):63–72, 2011.
-  T. Avraham and M. Lindenbaum. Attention-based dynamic visual search using inner-scene similarity: algorithms and bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 28(2):251–264, 2006.
-  T. Avraham and M. Lindenbaum. Esaliency (extended saliency): meaningful attention using stochastic image modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32(4):693–708, 2010.
-  T. Avraham and Y. Y. Schechner. Ultrawide foveated video extrapolation. IEEE Journal of Selected Topics in Signal Processing, 5(2):321–334, 2011.
-  S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material recognition in the wild with the materials in context database. In Computer Vision and Pattern Recognition (CVPR), pages 3479–3487, 2015.
-  M. J. Choi, A. Torralba, and A. S. Willsky. A tree-based context model for object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 34(2):240–252, 2012.
-  A. Criminisi, P. Pérez, and K. Toyama. Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on Image Processing, 13(9):1200–1212, 2004.
-  G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision, ECCV, volume 1, pages 1–22, 2004.
-  K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Image Processing, 16(8):2080–2095, 2007.
-  N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886–893, 2005.
-  J. S. De Bonet. Multiresolution sampling procedure for analysis and synthesis of texture images. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH, pages 361–368, 1997.
-  C. Desai, D. Ramanan, and C. C. Fowlkes. Discriminative models for multi-class object layout. International Journal of Computer Vision, 95(1):1–12, 2011.
-  J. Duncan and G. W. Humphreys. Visual search and stimulus similarity. Psychological review, 96(3):433–458, July 1989.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
-  P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010.
-  C. Galleguillos and S. Belongie. Context based object categorization: A critical survey. Computer Vision and Image Understanding, 114(6):712–722, 2010.
-  S. Gidaris and N. Komodakis. Object detection via a multi-region and semantic segmentation-aware cnn model. In International Conference on Computer Vision (ICCV), pages 1134–1142, 2015.
-  R. Girshick. Fast R-CNN. In International Conference on Computer Vision (ICCV), pages 1440–1448, 2015.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), pages 580–587, 2014.
-  D. Glasner, S. Bagon, and M. Irani. Super-resolution from a single image. In International Conference on Computer Vision (ICCV), pages 349–356, 2009.
-  S. Goferman, L. Zelnik-Manor, and A. Tal. Context-aware saliency detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(10):1915–1926, 2012.
-  X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. Matchnet: unifying feature and metric learning for patch-based matching. In Computer Vision and Pattern Recognition (CVPR), pages 3279–3286, 2015.
-  D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing error in object detectors. In European Conference on Computer Vision (ECCV), pages 340–353. Springer, 2012.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
-  T.-H. Kwok, H. Sheung, and C. C. Wang. Fast query for exemplar-based image completion. IEEE Transactions on Image Processing, 19(12):3106–3115, 2010.
-  C. Liu, L. Sharan, E. H. Adelson, and R. Rosenholtz. Exploring features in a bayesian framework for material recognition. In Computer Vision and Pattern Recognition (CVPR), pages 239–246, 2010.
-  X. Liu, L. Lin, and A. L. Yuille. Robust region grouping via internal patch statistics. In Computer Vision and Pattern Recognition (CVPR), pages 1931–1938, 2013.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
-  J. Malik, S. Belongie, T. Leung, and J. Shi. Contour and texture analysis for image segmentation. International Journal of Computer Vision, 43(1):7–27, 2001.
-  R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In Computer Vision and Pattern Recognition (CVPR), pages 891–898, 2014.
-  A. Oliva and A. Torralba. Building the gist of a scene: The role of global image features in recognition. Visual Perception, Progress in Brain Research, 155:23–36, 2006.
-  A. Papoulis and S. U. Pillai. Probability, Random Variables, and Stochastic Processes. McGraw-Hill Higher Education, 4 edition, 2002.
-  A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In International Conference on Computer Vision (ICCV), pages 1–8, 2007.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Computer Vision and Pattern Recognition (CVPR), pages 779–788.
-  C. Redondo-Cabrera, R. J. López-Sastre, Y. Xiang, T. Tuytelaars, and S. Savarese. Pose estimation errors, the ultimate diagnosis. In European Conference on Computer Vision (ECCV), pages 118–134. Springer, 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28, pages 91–99. Curran Associates, Inc., 2015.
-  E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer. Discriminative learning of deep convolutional feature point descriptors. In International Conference on Computer Vision (ICCV), pages 118–126, 2015.
-  J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
-  M. Varma and A. Zisserman. A statistical approach to texture classification from single images. International Journal of Computer Vision, 62(1-2):61–81, 2005.
-  J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In Computer Vision and Pattern Recognition (CVPR), pages 3485–3492, 2010.
-  S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), pages 4353–4361, 2015.
-  Y. Zhu, R. Urtasun, R. Salakhutdinov, and S. Fidler. segdeepm: Exploiting segmentation and context in deep neural networks for object detection. In Computer Vision and Pattern Recognition (CVPR), pages 4703–4711, 2015.