# Contextual Object Detection with a Few Relevant Neighbors

## Abstract

A natural way to improve the detection of objects is to consider the contextual constraints imposed by the detection of additional objects in a given scene. In this work, we exploit the spatial relations between objects in order to improve detection capacity, as well as analyze various properties of the contextual object detection problem. To precisely calculate context-based probabilities of objects, we developed a model that examines the interactions between objects in an exact probabilistic setting, in contrast to previous methods that typically utilize approximations based on pairwise interactions. Such a scheme is facilitated by the single realistic assumption that the existence of an object in any given location is influenced by only few informative locations in space. Based on this assumption, we suggest a method for identifying these relevant locations and integrating them into an exact calculation of probability based on their raw detector responses. This scheme is shown to improve detection results and provides unique insights about the process of contextual inference for object detection. We show that it is generally difficult to learn that a particular object *reduces* the probability of another, and that in cases when the context and detector strongly disagree this learning becomes *virtually impossible* for the purposes of improving the results of an object detector. Finally, we demonstrate improved detection results through use of our approach as applied to the PASCAL VOC dataset.

## 1Introduction

The task of object detection entails the analysis of an image for the identification of all instances of objects from predefined categories [7]. While most methods employ local information, in particular the appearance of individual objects [5], the contextual relations between objects were also shown to be a valuable source of information [24]. Thus, the challenge is to combine the local appearance at each image location with information regarding the other objects or detections.

Even when focusing on how context could influence the detection of a single object, one immediately realizes that the difficulty stems from the varying number of objects that can be used as predictors, their individual power of prediction, and more importantly, their *combined* effect as moderated by the complex interactions between the predictors themselves. Unfortunately, however, most previous works that employ relations between objects and their context focused on pairwise approximations [31], assuming that the different objects that serve as sources of contextual information do not interact among themselves.

More current detectors based on convolutional neural networks are also able to reason about context since the receptive field of neurons grows with depth, eventually covering the entire image. However, the extent to which such a network is able to incorporate context is still not entirely understood [20]. To include more explicit contextual reasoning in the detection process, several approaches suggested to include additional layers such as bidirectional recurrent neural networks (RNNs) [2], or attention mechanisms [18]. These methods have shown to improve detection results, but the types of contextual information they can encode remains unclear. Additionally, such networks are not able to reason about object relations in a manner invariant to viewpoint, requiring training data in which all meaningful relations between all groups of objects are observed from all relevant viewpoints.

As mentioned above, some complications in addressing the full fledged contextual inference problem may emerge from attempting to model the relations between all (detected or predicted) objects to *all* other detections or image locations. But in reality, such comprehensive contextual relations are rarely observed or needed, as the existence of objects (or lack thereof) is correlated to just *few* other locations in space (and thus to the detected objects in those locations). This notion is exemplified in Figure 1. In this paper we employ exactly this assumption to calculate a score for any given query detection. To do so, we first define the relevant context for that detection as the (few) other most informative detections, and then we use only these detections to calculate (in an exact closed form fashion) the probability that the query detection is indeed an object.

The suggested approach, facilitated by the decision to employ only few informative detections, provides several contributions. First, it is shown to improve the results of state-of-the-art object detectors. Second, unlike previous methods that require costly iterative training procedures, training our model is as quick and simple as just counting. Third, we represent object relations in a framework that allows to incorporate scale-invariant relations, reducing the number of needed examples, thus simplifying the training phase even further. Finally, using the derived calculation we observe various aspects and obtain novel insights related to the contextual inference of objects. In particular, we show that the effect of context is relative to the prior probability of the query object. As we show, this typically small quantity makes it difficult to infer when an object *reduces* the probability of another, and it practically prohibits the *improvement* of detection probability when the context *strongly disagrees* with the raw detector result. These and further observations and insights are analyzed in our results^{1}

## 2Relevant Work

Valuable information regarding image or scene elements may be obtained by examining their context. Indeed, different kinds of context were employed for various inference problems [32]. In this work we focus on employing long-range spatial interactions between objects. This problem has also gained attention, where the standard framework is that of employing fully-connected Markov random fields (MRF) and conditional random fields (CRF) with pairwise potentials [31]. This pairwise assumption is at the heart of the decision process, as it entails that a decision about an object is made by employing information supplied by its neighbors but without considering interactions *between* these neighbors. Other notable works make such assumption, where all detections affect each other in a voting scheme that is weighted by confidence [27], as a sum weighted by the probability of an object given by the detector [24], or a more complex voting mechanism that favors the most confident hypotheses and gives higher relevance to the object relations observed during training [25]. Yet another popular tool for reasoning with many elements is to employ linear classifiers [8] or more complicated set classifiers [4] in a pairwise interaction scheme.

Seeking more accurate representations, schemes that incorporate higher order models have been suggested for different kinds of problems in computer vision other than contextual object detection [15]. One such noteworthy model was suggested for objects that were not detected at all, where new detection hypotheses are generated by sampling pairwise and higher order relations using methods from topic modeling [23]. While this method was facilitated by context, it did not play a role in re-scoring existing or new detections. Other higher order models include neural networks that implicitly or explicitly reason about context as discussed in the Introduction.

A fundamental aspect of our work deals with finding the most relevant set of location variables for a prediction about another location. This problem can be abstracted as finding the structure of a graph (even if only locally) and algorithms for doing so can be grouped to roughly three types [16]. *Constraint-based methods* make local decisions to connect dependent variables, *score-based methods* penalize the entire graph according to an optimization criterion, and *model averaging methods* employ multiple graph structures. While our problem is better related to the constraint-based approach, most algorithms in this class seek to find the structure without considering current beliefs, a set of measures that change dramatically after observing the detections. To better support such cases, Chechetka and Guestrin [3] proposed to learn evidence-specific structures for CRFs, in which a new structure is chosen based on given evidence. This approach, however, is limited to trees. In a different approach, contextual information sources were dynamically separated to those that accept or reject each detection [34], but in a non-probabilistic framework. To facilitate structure learning in our extended graph configurations we therefore propose a different regime based on local structure exploration for each variable during the process of belief propagation. As will be discussed, our computational process is inspired by Komodakis and Tziritas [17] since it prioritizes variables for the message passing process according to their confidence regarding the labels they should be assigned.

## 3Suggested Approach

The input to our algorithm is a set of detections , such that each detection comprises type , location , size , and confidence . A random variable is created for each detection , denoting the probability of having an object of type at location with size . For the sake of brevity, in the remainder of this text we refer to as representing an empty location if it is indeed empty, or as a location containing an object of different type or size.

Our goal is to calculate a new confidence for each location variable using detections . To do so, we calculate the probability in a belief propagation process, where the context of is dynamically selected as the most informative *small* set of location variables from . The initial beliefs are determined according to the detector, followed by iterations in which an updated belief is calculated for each by identifying the best set and employing the current belief of its variables for a decision about . Alas, it turns out that this calculation can be very sensitive and produce problematic results when the detector and context strongly disagree. We therefore identify these cases first, then calculate accordingly.

### 3.1Calculation of object probability

We assume that location and detection variables are connected according to the graph structure shown in Figure 2. More formally, we assume that detection directly depends only on the existence of an object at , so

We further assume that directly depends only on its detection and on a small set of location variables. Therefore

We note that variables in the set may or may not directly depend on each other.

Employing these assumptions and the set (identified as described in Section 3.4), we calculate in the following way:

Applying Bayes’ rule we first obtain

Employing Equation 1 entails

Employing Equation 2 provides

and applying Bayes’ rule again provides

which results in

an expression reminiscent of the belief propagation process suggested by Pearl [26]. This is further developed by applying Bayes’ rule once more:

Finally, we denote the first term with , which is a normalizing constant and need not be explicitly calculated. We thus obtain:

an expression we assert is more informative than Equation 3, as it is now apparent that the way in which the context affects is relative to its prior probability . We note that instead of calculating , we normalize the values of calculated with , so that their sum equals to 1 as in a standard belief propagation process [26].

As can be seen, apart from this expression contains only functions over a small number of variables (for small sizes of ). We therefore restrict the size of to the maximum that still enables to properly represent and learn these functions, as employing more neighbors would require more memory and more training examples.

We are left with the term , which is more complicated to calculate. This term is the (joint) belief of variables in given all detections but , and can be seen as a weighting factor to the extent we are confident about assignments for . We therefore suggest to approximate it as the product of individual beliefs of variables in :

With this approximation, the representation of based on Eqs. Equation 4 and Equation 5 now consists of simple functions (that are easily measured from data), and terms of the form , which can be seen as the messages in a standard belief propagation process.

Finally, given detections provided by a base detector applied to an image, a new confidence is calculated for each detection . Its most informative neighbors are identified as explained in Section 3.4, and used for the calculation of the probability in a belief propagation process. The confidence assigned to is then .

### 3.2Scale invariant representation

The term in Equation 4 represents relations between several locations. Such a term must be calculated for each set of locations, requiring to observe many examples of object groups in each location. To reduce this complexity we make the (very reasonable) assumption that relations between objects are independent of the viewer, and suggest a representation that is invariant to different object scales.

Similar to the spatial features employed by Cinbis and Sclaroff [4], we represent the spatial relation with respect to the size of a reference object . The relative location of (and any other non-reference object ) is represented as

where , are the locations of the center points of and , is the height of , and is a scaling factor for according to , the type of assigned by the base detector. Object scales are also represented relative to the reference

Using this representation, the probability is measured for any assignment to and to the variables in except , which contains an object used as a reference frame. Specifically, we count the occurrences of objects of each type in each location and scale relative to reference objects of type that appear in training data.

A non-parametric representation of requires a value for every possible assignment of the variables. The described method for measuring requires at least one variable that is not empty (, contains an object) with which to construct a reference frame. However, in some assignments there may be no such . Hence, for assignments in which all the variables in are empty while is not, we simply use as reference:

where is arbitrarily picked from . Notice how the terms of the quotient operate on one less variable.

Finally, the probability for the assignment in which all the variables are empty is calculated by subtracting the probability of the complementary event from one.

### 3.3The implications of high probability derivative

To better understand the way contextual information is combined with the detector response we revisit Equation 4 and examine its behavior when the context is known. Thus, for some assignment to the members of we assume that , and so Equation 4 reduces to:

A graph of Equation 6 for different detector responses is presented in Figure 3. As can be seen, the addition of context strengthens a detection when the context-based probability is bigger than the prior, and weakens it when the opposite occurs. The red and blue curves, representing especially strong and weak detections respectively, *exhibit large derivative regions where the detector and context strongly disagree*. It may also be the case when the detector is confident and the context is independent, , in the case of Figure 3. In these cases, the overall probability greatly changes with small perturbations of the context-based probability . Because this quantity is measured from data, great errors are to be expected when the number of samples does not suffice.

Owing to this, many cases are indeed observed where detections are incorrectly assigned with low probabilities despite a confident detector and when the context is seemingly independent. Failing to address such cases leads to poor results, as we show in Section 4, and a specific case can be seen in Figure 4.

Hence, to handle such cases, for each assignment to the members of we calculate the derivative of at and if it exceeds a threshold we ignore the context by setting , essentially assuming that and are independent, and thus, that the context has no effect.

Another way to identify these cases is to estimate the number of samples needed to ensure a low error. Let and , as depicted in Figure 5. If we allow a maximal error of , we require that:

where is the value calculated using measured data. For , the measured value of , to provide an error that does not exceed , it is required to stay within the limits of and , that are the values of that correspond to and respectively. Thus, it is required that:

where is the allowed measurement error for .

We then employ Hoeffding’s inequality [30] to estimate , the number of samples required for :

We assume that the indicator variables are sampled and note that their expectation is equal to . Therefore, the probability of measuring with large overall error can be expressed as:

Finally, to guarantee a maximal error of with probability , the number of required samples is:

This expression enables to calculate the needed number of samples for in each case and then to employ it only when enough data is provided (even when the derivative is high). In our experiments the derivative was used to identify problematic cases, while the number of samples is used for discussion in Section 4.

### 3.4Identification of relevant detections

The described algorithm requires to identify the most relevant set of locations to use as context for a decision regarding location , where is the set of subsets of . To limit the size of , we consider only subsets for which the cardinality is equal or smaller than a predefined number that is given as a parameter of the model.

The set was used for the assumption that does not depend on other locations (or their detections) given :

Thus, the most suitable would be

However, this calculation requires a function over many variables , which seems as complicated as calculating . We hence suggest a different way to determine . Basically, we would like to employ the set that would be the best predictor for using our current beliefs for the different variables. For this reason, we pick as those variables that are the least independent of and weigh them according to the current beliefs

where is calculated as in Equation 5.

## 4Results and Discussion

We evaluate the proposed approach using the PASCAL VOC 2007 dataset where initial detections are provided by the Fast R-CNN detector [11]. Training and validation set objects are used to measure the probability of an object given its context as described in Section 3.2. For the probability of an object given its detection we simply use the confidence provided by the base detector. To determine , we assume that the prior probability of an object is fixed regardless of image location and size, but depends on the object type. The value of for each type is found by an exhaustive search to maximize the method’s average precision (AP) on the training and validation set.

The results of our approach are summarized in Table ?, and specific examples including the detections identified as most informative can be seen in Figure ?. Included in the table are the base Fast R-CNN detector, our proposed model with two most informative detections and *no* treatment for large derivatives (dFNM), and finally, our contextual inference model *with* treatment for large derivatives and (at most) two most informative detections (FNM). We note that it is also possible for a single detection to be identified as the most informative, and that employing two detections to reason about a third constitutes a triple-wise model. In both variants of our computational model the results are reported after a *single* iteration of belief propagation, as usually just two or three iterations were needed for convergence. We do note that the actual results of our model are probably better than reported, as the objects whose detections are improved the most are usually small or occluded, and as such, were sometimes missed by PASCAL’s human annotators and thus are considered as false.

(a) | (b) | (c) |

As can be seen, our suggested approach (FNM) indeed improves detection results, where greater improvement is observed for the detection of chairs, sheep, televisions and bottles. In addition, the model in which regions of large derivative are not handled (dFNM) behaves worse than the detector alone. This very much fits the analysis performed in Section 3.3.

### 4.1Impracticality of learning certain properties

The graph of Equation Equation 6 in Figure 3 sheds light on the decision process, or more specifically, on the way in which information supplied by the local detector is combined with the one inferred from the context. As can be seen in Figure 3, the use of contextual information can either increase or decrease the probability of an object, where different probabilities of the local detector response affect the rate of change.

Using Equation 7, we examine the difficulty of learning different properties with regard to the number of samples needed to stay within the limits of an allowed error. In this section we show the impracticality of learning certain properties even under modest error requirements.

We first examine the requirements for learning relations that decrease the probability of objects without necessarily seeking to improve detections. As a test case, we set , which is a relatively high prior probability for location to contain an object. So, to decide with high certainty () whether an assignment to the members of reduces the probability of , that is , an error of at most is required. According to Equation 7, this requires at least 3745 samples of that relation (or even more for the average prior probability). While this is not a problem for many modern datasets, smaller datasets may not suffice.

Seeking to improve detection results, let us examine one test case in which the context-based probability is half of the prior probability :

For confident detections , the overall probability of to contain an object in this case is . If we require a modest overall error of at most , then according to the construction in Section 3.3, a measurement error less than is required. In this case, for a high certainty () the number of required samples is 127,095. Similarly, for more accurate results (), as much as 400,048 samples are required. Of course, in more extreme cases many more samples would be required. The need to collect datasets that large renders such relations impractical to learn by observing object occurrences.

The solution suggested in this paper was to handle these cases in which the detector and context strongly disagree by assuming independence from context when the derivative exceeds a threshold. This arbitrary decision to ignore the context (instead of the detector) corresponds to similar processes we observe in the human visual system, as exemplified in Figure ?. And yet, it is important to note that there are indeed cases in which our contextual computation successfully reduces the probability of an object. Two such cases can be seen in Figure ?b, in which the confidence of a false bottle detection was decreased possibly based on its scale relative to the person detection next to it, and the confidence of a false tv/monitor detection was reduced due to its image location and scale relative to the chair.

(a) | (b) |

Another important point is the rate of change of the graph in Figure 3 for different detector responses as depicted by the differently colored curves. The central curve, calculated for , behaves as a straight line. The red and yellow curves behave similarly to the blue and magenta curves. However, the scale of the latter curves is significantly smaller, reducing the high derivative cases to those in which the detector is extremely confident that an object is not present. From this we conclude that it is generally simpler to learn relations that increase the probability of an object in comparison to those that decrease its probability.

### 4.2On the generation of new detections

While outside the scope of this work, the use of context can also facilitate the generation of new detection hypotheses [23] for cases in which an object was missed by the detector. However, the use of such generated hypotheses in a detection framework requires that they be assigned with some confidence measure, and ideally, one based on appearance and context.

Oramas and Tuytelaars [23] have demonstrated such a system, in which object proposals are created by considering their context, and scored by a different classifier than the one originally used. To shed more light on this process we once again turn to Figure 3. Locations with no detection are essentially considered to have a zero or near zero probability, as exemplified by the blue curve. To assign a probability greater than 0.1, an exceptionally large context-based probability is required, which is still likely to fall in a large derivative region.

This behavior is the result of a detector which is confident that a certain location does not contain an object. Hence, to generate new detections the detector response should be reevaluated, possibly by further analysis with different classifiers or by employing additional information about occlusions or the non-maximum suppression (NMS) process.

## 5Conclusions

The problem of including context in the process of object detection is important but difficult, as decision over large numbers of locations is needed. We have suggested to employ only a small number of locations for a more accurate decision, and presented a method for the identification of the most informative set of locations, and a formulation that employs it to infer the probability of objects at different locations in an exact probabilistic fashion. Key benefits of our computational approach is how it facilitates better understanding of certain aspects of the problem, and in particular it allowed to conclude that it is impractical to infer relations that decrease or increase the probability of detections when the detector and the context strongly disagree, or that in general it is more difficult to infer conditions that reduce the probability of an object rather than relations that increase it. Finally, we have demonstrated how our approach improves detection results on the challenging PASCAL dataset using a model that is quick and simple to train and to employ for context-based inference.

### Footnotes

- Source code is provided in the supplementary material and will be made available after publication.

### References

**Inner-scene similarities as a contextual cue for object detection.**

N. Arbel, T. Avraham, and M. Lindenbaum.*arXiv preprint*, 2017.**Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks.**

S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick. In*CVPR*, pages 2874–2883, 2016.**Evidence-specific structures for rich tractable crfs.**

A. Chechetka and C. Guestrin. In*Advances in Neural Information Processing Systems*, pages 352–360, 2010.**Contextual object detection using set-based classification.**

R. G. Cinbis and S. Sclaroff. In*ECCV*, pages 43–57. Springer, 2012.**Histograms of oriented gradients for human detection.**

N. Dalal and B. Triggs. In*CVPR*, pages 886–893, 2005.**Discriminative models for multi-class object layout.**

C. Desai, D. Ramanan, and C. C. Fowlkes.*Int. J. Comput. Vision*, 95(1):1–12, 2011.**A discriminatively trained, multiscale, deformable part model.**

P. Felzenszwalb, D. McAllester, and D. Ramanan. In*CVPR*, pages 1–8, 2008.**Object detection with discriminatively trained part-based models.**

P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan.*IEEE Trans. Pattern Anal. Mach. Intell.*, 32(9):1627–1645, 2010.**An experimental comparison of performance measures for classification.**

C. Ferri, J. Hernández-Orallo, and R. Modroiu.*Pattern Recognition Letters*, 30(1):27–38, 2009.**Object categorization using co-occurrence, location and appearance.**

C. Galleguillos, A. Rabinovich, and S. Belongie. In*CVPR*, pages 1 –8, 2008.**Fast r-cnn.**

R. Girshick. In*ICCV*, pages 1440–1448, 2015.**Rich feature hierarchies for accurate object detection and semantic segmentation.**

R. Girshick, J. Donahue, T. Darrell, and J. Malik. In*CVPR*, 2014.**Learning spatial context: Using stuff to find things.**

G. Heitz and D. Koller. In*ECCV*, pages 30–43, 2008.**Putting objects in perspective.**

D. Hoiem, A. A. Efros, and M. Hebert.*Int. J. Comput. Vision*, 80(1):3–15, 2008.**Higher-order models in computer vision.**

P. Kohli and C. Rother. In*Image Processing and Analysing with Graphs: Theory and Practice*, chapter 3, pages 65–92. CRC Press, 2012.*Probabilistic graphical models: principles and techniques*.

D. Koller and N. Friedman. MIT press, 2009.**Image completion using efficient belief propagation via priority scheduling and dynamic pruning.**

N. Komodakis and G. Tziritas.*IEEE Trans. Pattern Anal. Mach. Intell.*, 16(11):2649–2661, 2007.**Attentive contexts for object detection.**

J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and S. Yan.*IEEE Transactions on Multimedia*, 19(5):944–954, 2017.**Ssd: Single shot multibox detector.**

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. In*CVPR*, pages 21–37, 2016.**Understanding the effective receptive field in deep convolutional neural networks.**

W. Luo, Y. Li, R. Urtasun, and R. Zemel. In*NIPS*, pages 4898–4906, 2016.**A closer look at context: From coxels to the contextual emergence of object saliency.**

R. Mairon and O. Ben-Shahar. In*ECCV*, pages 708–724. Springer, 2014.**The role of context for object detection and semantic segmentation in the wild.**

R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. In*CVPR*, pages 891–898, 2014.**Recovering hard-to-find object instances by sampling context-based object proposals.**

J. Oramas and T. Tuytelaars.*CVIU*, 2016.**Allocentric pose estimation.**

J. Oramas M, L. De Raedt, and T. Tuytelaars. In*ICCV*, 2013.**Towards cautious collective inference for object verification.**

J. Oramas M, L. De Raedt, and T. Tuytelaars. In*WACV*, 2014.*Probabilistic reasoning in intelligent systems: networks of plausible inference*.

J. Pearl. Morgan Kaufmann, 1988.**A framework for visual-context-aware object detection in still images.**

R. Perko and A. Leonardis.*CVIU*, 114(6):700–711, 2010.**Objects in context.**

A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. In*ICCV*, pages 1–8. IEEE, 2007.**Ugm: A matlab toolbox for probabilistic undirected graphical models.**

M. Schmidt. http://www.cs.ubc.ca/~schmidtm/Software/UGM.html, 2007.*Understanding machine learning: From theory to algorithms*.

S. Shalev-Shwartz and S. Ben-David. Cambridge University Press, 2014.**Contextual models for object detection using boosted random fields.**

A. Torralba, K. P. Murphy, and W. T. Freeman. In*NIPS*, pages 1401–1408, 2004.**Statistical context priming for object detection.**

A. Torralba and P. Sinha. In*ICCV*, volume 1, pages 763–770. IEEE, 2001.**A critical view of context.**

L. Wolf and S. Bileschi.*Int. J. Comput. Vision*, 69(2):251–261, 2006.**The role of context selection in object detection.**

R. Yu, X. Chen, V. I. Morariu, and L. S. Davis.*British Machine Vision Conference*, abs/1609.02948, 2016.