Visual Chunking: A List Prediction Framework for Region-based Object Detection

Visual Chunking: A List Prediction Framework for Region-based Object Detection

Nicholas Rhinehart, Jiaji Zhou, Martial Hebert, and J. Andrew Bagnell
The Robotics Institute, Carnegie Mellon University
{nrhineha, jiajiz, hebert, dbagnell}

We consider detecting objects in an image by iteratively selecting from a set of arbitrarily shaped candidate regions. Our generic approach, which we term visual chunking, reasons about the locations of multiple object instances in an image while expressively describing object boundaries. We design an optimization criterion for measuring the performance of a list of such detections as a natural extension to a common per-instance metric. We present an efficient algorithm with provable performance for building a high-quality list of detections from any candidate set of region-based proposals. We also develop a simple class-specific algorithm to generate a candidate region instance in near-linear time in the number of low-level superpixels that outperforms other region generating methods. In order to make predictions on novel images at testing time without access to ground truth, we develop learning approaches to emulate these algorithms’ behaviors. We demonstrate that our new approach outperforms sophisticated baselines on benchmark datasets.

I Introduction

We consider the problem of object detection, where the goal is to identify parts of an image corresponding to objects of a particular semantic type, e.g. “car”. In recent years, machine learning-based approaches have become de-rigueur for addressing this difficult problem; one classical approach is to transform the problem into one of binary classification, either on bounding boxes [1, 2], or regions. Such approaches (see Section II for a detailed discussion) typically follow a two stage procedure:

  1. generate independent proposals to provide coverage across object instances

  2. improve precision and reduce redundancy by pruning out highly overlapping proposals

Intuitively, the first step returns a set of proposals with high recall and the second step improves the precision. For the second step, traditional approaches rely on a combination of thresholds and arbitration techniques like Non-Max Suppression (NMS) to produce a final output. Such methods, while remarkably effective at identifying sufficiently separated objects, still have difficulty simultaneously detecting objects that are close together or overlap while preventing multiple detections of the same object (see Fig. 6). While we provide contributions to both stages, our focus is on formalizing and improving the second stage.

Fig. 1: Visual Chunking run on test data. The first prediction is shown in red, the second in green, the third in blue, and the fourth in yellow.

We formulate the objective of the second step as that of producing a diverse list of detections in the image. We propose an optimization criterion on this list of detections as a natural extension of the intersection over union metric (IoU) (described in Section III-A), and develop an algorithm that targets this criterion. This approach uses recent work on building performance-bounded lists of predictions [3, 4]. Our algorithm shares information across all candidate detections to build a list of detections, specifically exploiting this information to perform well even when object instances are adjacent. Each decision of appending to the list of detections is made with contextual information from all previous detections. Importantly, our list prediction algorithm is agnostic to the source of candidate detections. This provides our approach with the ability to use any candidate generating method as input for constructing a list.

Each candidate detection is treated as a union of superpixels with no adjacency constraints. We call these unions “chunks,” inspired by a well-known task in Natural Language Processing: “chunking,” which involves grouping many words together into meaningful semantic instances / entities. We use “region” to refer to a contiguous group of superpixels, and reserve “chunk” to refer to a group of superpixels corresponding to a single semantic instance. The analogy is particularly apt when object instances are adjacent, as in Fig. 1.

For the first step, we develop a class-specific supervised approach of region-based object proposal by iteratively grouping superpixels produced by a low-level segmentation algorithm [5] to form chunks. This helps build a high-recall candidate set. This algorithm learns to “grow” by utilizing class-specific ground-truth labeling by emulating an algorithm that optimizes a chunk’s IoU score with an object, which we present in Algorithm 2. This strategy follows from imitation learning approaches [6, 7].

Our technique for building the list of detections can be run for arbitrary list lengths, or budgets. This enables several use cases: building very short lists of highly confident object predictions (high precision), long lists of many candidate regions (high recall), and dynamic length lists tuned by some heuristic(s) (e.g., the highest predicted IoU score of the remaining candidates).

Ii Related Work

Much work has been done in the combined areas of object detection and semantic labeling. Object detection approaches often seek to place bounding boxes around all instances of objects [1, 8]. [9] casts the multi-class (and multi-instance) detection problem as a structured prediction task instead of NMS as post processing. However, the resulting detections are still bounding boxes.

Intermediate approaches deform the regions inside the output of a detector to produce object segmentations [10, 11, 12], or, conversely, adjust bounding boxes based on low-level features such as boundaries, texture, and color [13, 14]. Again, these approaches refine individual detections relying on the initial detector output. In contrast, we attempt to find the best list of detections given a large collection of candidate detections and regions. Closer to our work, [15] proposes to use a deformable shape model to represent object categories in order to extract region level object detections. This approach reasons about occluders and overlapping detection by using depth layering and is designed for one specific shape model for region-based representation, while our approach is agnostic to the source of region segments and detection boxes.

Direct region-based techniques, such as [16, 17, 18], use region-based information to formulate detections, the produced detections are bounding boxes, and detection performance is analyzed using individual bounding box metrics. [19] produces region-wise segmentations, however they assume the existence of only one object in each image. [20] produces multiple region-wise segmentations, but contiguous and adjacent objects are not resolved, and ignore inter-class context. Other region-based techniques are segmentation algorithms that rely on combining low-level image features with class-specific models [21, 22, 23, 24], control segmentation parameters from object detection [25], or use the box-level detections as features for segmentation [26, 27]. These approaches attempt to find regions that best agree with both the region segments and individual detections but do not explicitly deal with the problem of finding the most consistent list of detections as we do.

Semantic systems such as [28, 29, 30, 31] do produce region-level labels, which can be grouped into detections, however there is no notion of separate detections; connected components of labeling are not grouped into their constituent object instances. [32] uses non-overlapping segmentation proposals in its first stage, thus allowing, in principle, the handling of multiple instances of the same class, without explicitly optimization for multi-instance settings. Although the evaluation criteria in [32] focuses on per-class overlap without accounting for multiple instances, the authors do note the possibility for multi-instance extension. Combining semantic labeling with object detectors has been explored in different ways. Several approaches were proposed to combine pixel-level classification labels and box-level detections into a single inference problem. For example, [33, 34, 35, 36] incorporate detections into a CRF model for semantic labeling. These techniques attempt to generate a holistic representation of the scene that combines objects and regions. These approaches rely on semantic segmentation. Our approach, while incorporating semantic segmentation, is agnostic to the input features, as well as to the source(s) from which candidate detections are generated.

Another group of approaches related to our work address the problem of generating proposals for regions or boxes that are likely to delineate objects, in a class-independent manner. The proposals can then be evaluated by a class-specific algorithm for object detection. They include, for example, generating regions by iterative superpixel grouping [37, 18], and ranking proposed regions [38] or boxes [39, 40] based on a learned objectness score. In [41], the authors investigate an iterative, class-specific region generation procedure that incorporates class-specific models at different scales, and requires bounding boxes as input. Our generation method, in comparison, directly optimizes the instance-based IoU metric, and we provide worst-case and probabilistic performance bounds. All of these approaches are complementary to our work in that we can potentially use any of them as input to our candidate generation step, thus, we incorporate and compare to several of them in our experiments.

Iii Approach

Our task is to output a list of chunks, i.e., list of sets of superpixels as described in Section I, with high intersection over union (IoU) scores with each of the ground truth instances in the image. This metric is formalized in Section III-A. We decompose the task into two parts:

  • Generation of a set of candidate chunks containing some elements that cover individual object instances.

  • Iterative construction of a list of chunks by selecting from an arbitrarily generated set of candidate chunks so as to maximize a natural variant of intersection over union score for multiple object instances and multiple predictions.

In the second stage, the candidate chunks can be generated from any algorithm, providing our method with the ability to augment our set of grown candidates constructed by other means. We start by describing the method by which we build lists of detections for the second stage, and first define a natural scoring function to evaluate any input list of chunks given ground truth on the pixels corresponding to objects of interest in a scene. We provide an efficient greedy algorithm that is guaranteed to optimize this metric to within a constant factor given access to ground-truth and this arbitrary set of (potentially overlapping) candidate chunks.

Our test-time approach, following recent work in structured prediction [4, 6], is to learn to emulate the sequential greedy strategy. The result is a predictor that takes a candidate set of chunks and iteratively builds a list of chunks that are likely to overlap well with separate objects in the scene.

We do not place assumptions on the given candidate set of chunks: the list predictor is agnostic to the way the candidate set of chunks is generated. Such a set can be heuristically generated in many ways, e.g., those created from the baseline approaches described in Section IV. In Section III-C, we provide an algorithm designed to generate a candidate based on a fixed superpixel-based segmentation, and in Section III-D extend this algorithm to the case of growing multiple chunks per images.

Iii-a Objective function and greedy optimization

We establish an objective function to evaluate the quality of any list, and devise a greedy algorithm to approximately maximize this objective function given access to the ground-truth. This will lead to the development of learning algorithm that produces a prediction procedure that operates on novel images.

Given an image with ground truth instance set and candidate chunk set , our goal is to sequentially build a list of chunks out of so as to maximize the sum of IoU’s with respect to ground truth instances. Denoting as a size- list of chunks, we first establish correspondences between candidate chunks and ground truth instances to enable pairwise IoU computation. Note that each is associated with at most one ground truth instance , and each is associated with at most one . For analytic convenience, we augment with dummy ground truth instances to deal with the case in which the length of the list is larger than the number of ground truth instances (). Every chunk has zero intersection with each . Each feasible assignment corresponds to a permutation of , and the sum of IoU scores for this permutation can be written as the following: . It is natural to define the quality metric of a list to be the sum of IoU scores under the optimal assignment, i.e., , where denotes all permutations of . With an abuse of notation, indicates all elements in belong to . Our goal during training is to find list to maximize :


This scoring metric, which is a natural generalization of the IoU metric common in segmentation and single instance detection [42, 43], encourages lists of a fixed length that contain chunks that are relevant and diverse in covering multiple ground truth instances. Unfortunately, the metric as written down does not possess a clear combinatorial structure like modularity or submodularity that would beget easy optimizability.

Interestingly, however, Problem (1) can be cast as an equivalent maximum weighted bi-partite graph matching problem. This problem can be shown to be a submodular maximization problem under matroid partition constraints, and a greedy algorithm as shown in Algorithm 1 has multiplicative performance guarantees  [44]. In addition to these guarantees, such a greedy algorithm is desirable as it is easily imitable at test time, and has a recursive solution: the length list is exactly the length list with the next greedily chosen item appended. The greedy algorithm behaves as follows: at each iteration, it chooses the chunk with the highest IoU with one of the remaining ground truth instances. More precisely, a chunk’s best overlap with each remaining ground truth is defined as (the “greedy marginal”), where is the set of remaining unpaired ground truth instances. At each step, the algorithm chooses the chunk with the highest value, appends it to the list (), and removes its associated ground truth from the set of remaining ground truth. This associated ground truth element is given by .

Input: Set of candidate chunks , set of ground truth instances , size of predicted list
Output: A near-optimal list of chunks
for  to  do
     , . choose the highest scoring (chunk, GT) pair
     . append the chunk to the list
     . remove the associated GT
end for
Algorithm 1 Greedy List Generation with Ground-Truth Access

Critically, the greedy algorithm is recursive, meaning longer lists of predictions always include shorter lists, and is within a constant factor of optimal111Although Problem (1) can be solved exactly, it requires knowledge of the instances to be matched and does not possess a recursive structure that enables simple creation of longer lists of prediction.:

Theorem 1

Let be the list of the first elements in and be the optimal solution of Problem (1) among size- lists


See the appendix for proof of Theorem 1, which invokes results from [44]. Theorem 1 implies that if we are given a budget to build the list, then each scores within a constant factor of the optimal list among all lists of budget , for . This is an important property for producing good predictions earlier in the list and for producing the list of chunks rapidly. The empirical performance is usually much better than this bound suggests.

Iii-B List prediction learning

In essence, the greedy strategy is a sequential list prediction process where at each round it maximizes the marginal benefit given the previous list of predictions and ground truth association. Maximization of the marginal benefit at each position of the list yields chunks that have high IoU with ground truth instances and minimal overlap with each other. At test time, however, there is no access to the ground-truth. Therefore, we take a learning approach to emulate the greedy algorithm. We train a predictor to imitate Algorithm 1, with the goal of preserving the ranking of all candidate chunks based on the greedy increments . This predictor uses both information about the current chunk and information about the currently built list to inform its predictions. In our experiments, we train random forests as our regressor with features as (each chunk’s feature is a function of itself and the currently built list, as described in Section IV-B), and regression targets (the score for a chunk at each iteration is the greedy marginal, or how much a chunk candidate covers a new object instance). This regression of “region IoU ” is similar to that explored in [45], except it is explicitly reasoning about multiple objects, as well as the current contents of the predicted list. The prediction procedure is similar to the greedy list generation as in Algorithm 1, with the difference that there is no access to the ground truth.

Iii-C Growing for a single instance

To generate a set of diverse chunks (output of stage 1 in the detection process), we develop a class-specific algorithm that “grows” chunks via iterative addition of superpixels, with the goal of producing diverse candidate detections that cover each ground truth object instance. We first analyze the case where there is only a single object of interest in the image. We consider a chunk to be a union of superpixels , i.e., . Let denote the IoU score between and .

To grow a chunk, Algorithm 2 starts with an empty chunk (no superpixels), and adds single superpixels to the current chunk sequentially. After each addition, the resulting chunk is copied and added to the set of candidate chunks. Let be the ratio of intersection area with ground truth to the size of a superpixel . The set of chunks generated by the greedy algorithm described in Algorithm 2 is guaranteed to contain the optimal chunk if the input predictor returned the exact value of , i.e., .

Input: Set of superpixels , grower predictor .
Output: A set of chunks, .
Sort elements in by decreasing order of
for  to  do
end for
Algorithm 2 Single Instance Chunk Growing Algorithm
Theorem 2

Let be an oracle growing predictor, i.e., . The output set of from Algorithm 2 by setting contains the best chunk given the set of superpixels .

See appendix for proof. At testing time, we must give an estimate of , i.e., . We train a random forest regressor as our predictor with features for estimation. We analyze the performance of Algorithm 2 under approximation by relating the squared regression error of to the IoU score of the grown best chunk in the chain. We note that the test-time performance depends on both the size of the squared error and the number of predictions made. Notably, the error bound has no explicit dependence on the area sizes of ground truth object instances and images. See appendix for proof.

Theorem 3

Given a regressor that achieves no worse than absolute error uniformly across all superpixels, let be the best chunk in the predicted set . The IoU score of is no worse than of the IoU score of the optimal chunk : .

Corollary 1

Suppose regressor has expected sq. error over the distribution of superpixels, let be the number of superpixels in the image, then we have for any , with probability : .

Iii-D Growing for multiple instances

We run the growing algorithm more than once to cover multiple objects. Instead of making predictions based solely on features of each individual superpixel, we augment the information available to the predictor by including a feature of the current grown chunk, (see Section IV-B for more information about grower features). This yields predictors that prefer choosing superpixels in close proximity to the currently growing chunk, and allows us not to explicitly encode contiguity requirements, as objects may be partially occluded in a way that renders them discontiguous. We also modify Algorithm 2 by “seeding” the chunks at a set of superpixel locations, (the initialization step, , becomes ), and running the growing procedure on each of these seeds separately. See appendix for the pseudo-code description modified from Algorithm 2. In practice, we choose a seeding grid interval and a maximum chunk size cutoff, yielding . In Figure III-D, we visualize the sequential growth of the best chunks for each object instance.

Fig. 2: Selected images of the best grown chunks for images with single and multiple objects. Each chunk grows independently of the others. Given the initial seed, superpixels are iteratively added to the growing chunk. The predictor greedily adds superpixels that it believes make the highest contribution to the overall class-specific IoU score of the currently growing chunk.

Iv Experiments

We describe our experiments and features in the next two sections, and discuss the results of each experiment in their respective captions.

Fig. 3: LM+Sun Adjacent Cars candidate quality, measured by Average Best Overlap (ABO). We find that our grown chunks generally outperform the Selective Search and SCALPEL methods. On average, SCALPEL generated regions per image, Selective Search generated regions per image, SP  DPM generated 8 chunks per image, and our grower generated chunks per image.
Fig. 4: performance of lists constructed from best candidates from each prediction pool (dashed lines) and predicted candidates (solid lines) on a 50/50 split of LM+Sun Adjacent Cars dataset. Our chunk generating method (dark blue) generates candidates of similar quality to that of SCALPEL (light blue). On this dataset, our DPM-based baselines (magenta and dark cyan) perform quite well, but the best performing list prediction method (green line) is our list predictor that uses the SP DPM chunks as the candidate pool, and essentially has learned how to reorder them. This demonstrates how our approach can utilize and improve different candidate sources.

Iv-a Datasets and baseline algorithms

We perform experiments on imagery from 3 different datasets. We refine the Stanford Background Dataset [46] labeling to include a vehicle class with instance labeling. We also perform experiments on PASCAL VOC 2012 (Fig. 5 and Tables II and I). This dataset possesses relatively few images containing adjacent and/or overlapping instances of the same class. Therefore, we created a subset of the LM+Sun dataset [31] of images containing at least 2 adjacent cars, consisting of 1,042 images.

Fig. 5: Example single predictions on PASCAL VOC2012
TABLE I: Average instance-based accuracy (a metric proposed by [15]), and first slot scores (corresponding to the average overlap of the first prediction in each image) for systems trained and tested on the standard PASCAL 2012 train and val sets. We find that the very small amount of co-occurring instance training data was not sufficient to enable our system to perform as well as it did in our other experiments on images with co-occurring instances. While [15] provides experimental results of average instance-based accuracy on PASCAL 2010, their results are confined to verified correct DPM detections, rendering a fair comparison difficult.
aeroplane bicycle bird boat bottle bus car cat chair cow
.157 .066 .105 .132 .079 .228 .097 .155 .071 .211
.521 .148 .375 .335 .186 .439 .190 .445 .141 .494
.530 .158 .394 .347 .202 .509 .195 .461 .171 .581
diningtable dog horse motorbike person pottedplant sheep sofa train tvmonitor
.098 .165 .197 .166 .193 .078 .182 .139 .170 .109
.260 .430 .437 .407 .362 .133 .466 .260 .403 .270
.261 .454 .479 .441 .456 .160 .582 .272 .409 .278

Iv-B Features

As discussed in III-B, the features should encode the quality of a chunk (e.g. “Does the chunk look like a vehicle?”) and similarity with the currently predicted list (e.g. “Is this chunk similar to previously predicted chunks” ). One of the quality features is built upon the superpixel-wise multi-class label distribution from [30], where we compute label distribution for each chunk via aggregating histograms of its constituent superpixels. The other quality features are shape features including central moments of the chunk, area, and scale relative to the image. The similarity features we use primarily encode spatial information between predictions. We use a candidate’s with previous predictions, the spatial histogram used in [9] and the size of the current list. Chunks with high similarity with previously predicted chunks in the list are less favored.

The features for the grower encode information about the quality of proposed chunk by growing with superpixel . The grower uses the same quality features that characterize used by the list predictor, as well as several of the class-agnostic features described in [18], specifically color similarity (color histogram intersection), which encourages regions to be built from similarly colored regions, and region fill, which encourages growing compact chunks. See [18] for further details. As each superpixel is iteratively added to the chunk, similarity to the growing chunk for remaining candidate superpixels is recomputed.

TABLE II: List prediction and baseline performance on VOC2012 Person validation data and an 80/20 split of SBD Vehicles. Our list prediction outperforms all baselines in both experiments. In SBD Vehicles, the most competitive is the Scene Parsing intersected with DPM Bounding boxes. In VOC2012 Person, scene parsing was lower quality, and resulted in the DPM Boxes outperforming other baselines. SBD Vehicle VOC2012 Person (mean optimal chunks given segmentation) 0.82 1.43 1.87 2.19 2.44 0.83 1.17 1.33 (mean best grown chunks) 0.69 1.14 1.45 1.66 1.81 0.52 0.71 0.79 List Prediction with Grown and Baseline Chunks 0.58 0.89 1.08 1.18 1.25 0.38 0.50 0.53 List Prediction with Selective Search - - - - - 0.27 0.36 0.41 Scene Parsing DPM Baseline 0.56 0.79 0.91 1.02 1.07 0.16 0.19 0.21 Connected Components Baseline 0.37 0.53 0.60 0.65 0.66 0.19 0.24 0.27 DPM Baseline 0.28 0.39 0.43 0.45 0.47 0.29 0.38 0.41
Fig. 6: Comparison of list prediction versus other baselines. Each group of images contains, from left to right, the results of DPM, DPM intersected with Scene Parsing, and Visual Chunking. Note that while the Scene Parsing intersected with a bounding box detector can perform well, it fails in the case of poor NMS performance (group 1), and requires highly accurate Scene Parsing. Visual Chunking outperforms this baseline by instead building a list of detections.

We evaluate three methods222 We use the semantic labeling algorithm of [30] and the DPM detection method of [1] for bounding box output, with the default SVM threshold, and NMS threshold of . To generate the superpixels, we use the segmentation algorithm of [5]. For each experiment, separate semantic labeling systems and chunk growers were trained. leveraging existing bounding box detections and superpixel-wise semantic labeling algorithm, all of which serve as our baseline systems for building lists of predictions: 1) Bounding box detector output after NMS filtering 2) Connected components of scene parsing (“SP”) / semantic labeling 3) A combination of 1) and 2): intersection of connected components with bounding boxes, which creates chunks for every bounding box by extracting the labeled region inside (“SP DPM”). The third baseline is intended to capitalize on desirable properties of each component while avoiding their less desirable properties: boxes usually violate the object boundaries, and semantic labeling does not separate adjacent instances. The downside to this baseline is that it can suffer from compounding both detector and scene parsing errors. See Fig. 6 for a visual comparison.

We investigate the region generating methods of SCALPEL [41] and Selective Search [18], and in Fig. 4 compare our chunk generating method against them on our LM+Sun Adjacent Cars dataset with the Average Best Object method suggested by [41], and additionally train our system by using these methods to fill the candidate pool. In Table II, we compare different list predictions methods on vehicle and person data, respectively.

V Conclusion

We provide a novel method for producing region-based object detections in images, treating the problem as a list prediction from a set of candidate region proposals. We formulate a scoring criterion for multiple object instances and multiple predictions. We develop a list prediction algorithm that directly optimized the criterion. Our approach is agnostic to proposal generation method and provides a recursive solution for all list lengths, enabling it to easily produce any best guesses for objects. We provide a method for class-specific candidate generation algorithm, yielding good coverage of objects. We demonstrate that our list prediction is a useful method for improving arbitrary candidate pools.


  • [1] P. F. Felzenszwalb et al., “Object detection with discriminatively trained part-based models,” PAMI, 2010.
  • [2] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
  • [3] D. Debadeepta et al., “Contextual sequence prediction with application to control library optimization,” ICML, 2013.
  • [4] S. Ross et al., “Learning policies for contextual submodular prediction,” in ICML, 2013.
  • [5] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based image segmentation,” IJCV, 2004.
  • [6] H. Daumé III et al., “Search-based structured prediction,” Machine Learning, 2009.
  • [7] S. Ross et al., “A reduction of imitation learning and structured prediction to no-regret online learning,” in AISTATS, 2011.
  • [8] A. Vedaldi et al., “Multiple kernels for object detection,” in CVPR, 2009.
  • [9] C. Desai et al., “Discriminative models for multi-class object layout,” ICJV, 2011.
  • [10] V. Lempitsky et al., “Image segmentation with a bounding box prior,” in ICCV, 2009.
  • [11] A. Monroy and B. Ommer, “Beyond bounding-boxes: Learning object shape by model-driven grouping,” in ECCV, 2012.
  • [12] J. Z. Wang et al., “Simplicity: Semantics-sensitive integrated matching for picture libraries,” PAMI, 2001.
  • [13] Q. Dai and D. Hoiem, “Learning to localize detected objects,” in CVPR, 2012.
  • [14] R. Mottaghi, “Augmenting deformable part models with irregular-shaped object patches,” 2012.
  • [15] Y. Yang et al., “Layered object models for image segmentation,” PAMI, 2012.
  • [16] B. Leibe et al., “Robust object detection with interleaved categorization and segmentation,” IJCV, 2008.
  • [17] C. Gu et al., “Recognition using regions,” in CVPR, 2009.
  • [18] J. Uijlings et al., “Selective search for object recognition,” IJCV, 2013.
  • [19] E. Borenstein and S. Ullman, “Class-specific, top-down segmentation,” in ECCV, 2002.
  • [20] J. Carreira et al., “Object recognition by sequential figure-ground ranking,” IJCV, 2012.
  • [21] B. Leibe et al., “Combined object categorization and segmentation with an implicit shape model,” in Workshop on Statistical Learning in Computer Vision, ECCV, 2004.
  • [22] X. Y. Stella et al., “Concurrent object recognition and segmentation by graph partitioning,” in NIPS, 2002.
  • [23] A. Levin and Y. Weiss, “Learning to combine bottom-up and top-down segmentation,” in ECCV, 2006.
  • [24] Z. Tu et al., “Image parsing: Unifying segmentation, detection, and recognition,” IJCV, 2005.
  • [25] M. P. o. Kumar, “Obj cut,” in CVPR, 2005.
  • [26] J. M. Gonfaus et al., “Harmony potentials for joint classification and segmentation,” in CVPR, 2010.
  • [27] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in NIPS, 2011.
  • [28] G. Heitz and D. Koller, “Learning spatial context: Using stuff to find things,” in ECCV, 2008.
  • [29] S. Gould et al., “Decomposing a scene into geometric and semantically consistent regions,” in ICCV.   IEEE, 2009.
  • [30] D. Munoz et al., “Stacked hierarchical labeling,” in ECCV, 2010.
  • [31] J. Tighe and S. Lazebnik, “Superparsing: scalable nonparametric image parsing with superpixels,” in ECCV, 2010.
  • [32] A. Ion et al., “Probabilistic joint image segmentation and labeling.” in NIPS, 2011.
  • [33] L. Ladickỳ et al., “What, where and how many? combining object detectors and crfs,” in ECCV, 2010.
  • [34] S. Fidler et al., “Bottom-up segmentation for top-down detection,” in CVPR, 2013.
  • [35] G. Heitz et al., “Cascaded classification models: Combining models for holistic scene understanding,” in NIPS, 2008.
  • [36] J. Yao et al., “Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation,” in CVPR, 2012.
  • [37] A. Levinshtein et al., “Optimal contour closure by superpixel grouping,” in ECCV, 2010.
  • [38] I. Endres and D. Hoiem, “Category independent object proposals,” in ECCV, 2010.
  • [39] B. Alexe et al., “What is an object?” in CVPR, 2010.
  • [40] ——, “Measuring the objectness of image windows,” PAMI, 2012.
  • [41] D. Weiss and B. Taskar, “Scalpel: Segmentation cascades with localized priors and efficient learning,” in CVPR, 2013.
  • [42] M. Everingham et al., “The pascal visual object classes (voc) challenge,” IJCV, 2010.
  • [43] P. Arbeláez et al., “Semantic segmentation using regions and parts,” in CVPR, 2012.
  • [44] M. L. Fisher et al., “An analysis of approximations for maximizing submodular set functions ii,” in Polyhedral combinatorics, 1978.
  • [45] J. Carreira and C. Sminchisescu, “Cpmc: Automatic object segmentation using constrained parametric min-cuts,” PAMI, 2012.
  • [46] S. Gould et al., “Decomposing a scene into geometric and semantically consistent regions,” in ICCV, 2009.

Appendix A Appendix

This appendix contains proofs of theoretical results and additional pseudo-code descriptions presented in the paper.

A-a Proof for Theorem 1

Given an image with ground truth entities , candidate chunks set and list size budget , our goal is to select the optimal chunks out of and associate each with the ground truth entities so as to maximize the sum of intersection over union scores under the association. Such problem can be cast as maximum weighted bi-partite graph matching, a classic assignment problem in combinatorial optimization. The edge set of the bi-partite graph is the Cartesian product of and , i.e., . The weight for each edge is the I/U score between chunk and ground truth . The defined quality metric in Section 3.1 is equal to the optimal assignment score for subgraph .

Let be the optimal size- subset of chunks, which can be computed in cubic time by Hungarian Algorithm [munkres1957algorithms]. Algorithm 1 can be seen as a greedy approach for maximum bi-partite graph matching with 1/2 approximation guarantee [preis1999linear]. Furthur, let be the greedy match on graph , we can show that for any augmented graph where , obtained from running Algorithm 1 with iterations is no worse than . Hence, we can conclude that running Algorithm 1 on has 1/2 approximation gurantee with respect to the optimal size- subset of . Together with the fact that greedy solution has recursive structure, i.e., shorter greedy list is the prefix list for longer greedy list under larger budget, we can prove Theorem 1.

A-B Proof of Theorem 2

Consider superpixel and ground truth , let , and , we have that is a montonic transformation of , i.e., if and only if . Therefore the rankings based on or are the same. This follows from the fact that .

Using the fact that superpixels are non-overlapping, given any superpixel and a set of superpixel , we have . Further, if , adding to would increase and vice versa, since implies . Therefore, suppose the optimal solution be , then if , it must be true that , and otherwise . This also implies the optimal set of superpixels is the first elements based on a sorting of superpixels by , where is the smallest integer such that .

A-C Proof of Theorem 3 and Corollary

Let and let the optimal set of superpixels to maximize I/U with ground truth be the first superpixels, i.e., ; suppose the regressor makes bounded uniform error , i.e., , and let be the largest number such that: for . If the regressor makes bounded uniform error , then the worst case would be: it underestimates by and overestimates by . Therefore, some of the elements in would rank higher than elements in . Denote as the best solution among the chain of sets induced by the ranking output of the regressor .


In 8, we are using the fact that implies . Rearrange the terms, we get:


From (9) to (10), we are using the fact that and . We can proceed to have an additive bound:


A more natural assumption is to assume an expected square error over the distribution of all superpixels. Denote , the expected uniform error bound satisfies:


In (14), we are applying Jensen’s Inequality along with the fact that is concave. Using Markov Inequality, for any , with probability , we have that:


Together with 12, we have that: for any , with probability


A-D Pseudocode for Multiple Grower Algorithm

Input: Set of superpixels , grower predictor , seeding superpixel
Output: A set of chunks .
for  to  do
end for
Algorithm 3 Multiple Instance Chunk Growing Algorithm

The pseudocode in Algorithm 3 describes the growing algorithm with specified seeding superpixel . We run this algorithm for each in order to increase diversity, where is the set of seeding superpixels. Two major differences from single instance chunk growing algorithm in Section 3.2 are addressed below:

  1. A seeding superpixel needs to be given as input to initialize the chain of growth, i.e., .

  2. Instead of just using features only based on the superpixel itself, we also consider features including both the superpixel and the currently growing chunk . We replace with . These feature not only encode information about the quality of a superpixel but also encourage the grower to grow spatially compact chunks.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description