Visual Chunking: A List Prediction Framework for Regionbased Object Detection
Abstract
We consider detecting objects in an image by iteratively selecting from a set of arbitrarily shaped candidate regions. Our generic approach, which we term visual chunking, reasons about the locations of multiple object instances in an image while expressively describing object boundaries. We design an optimization criterion for measuring the performance of a list of such detections as a natural extension to a common perinstance metric. We present an efficient algorithm with provable performance for building a highquality list of detections from any candidate set of regionbased proposals. We also develop a simple classspecific algorithm to generate a candidate region instance in nearlinear time in the number of lowlevel superpixels that outperforms other region generating methods. In order to make predictions on novel images at testing time without access to ground truth, we develop learning approaches to emulate these algorithms’ behaviors. We demonstrate that our new approach outperforms sophisticated baselines on benchmark datasets.
I Introduction
We consider the problem of object detection, where the goal is to identify parts of an image corresponding to objects of a particular semantic type, e.g. “car”. In recent years, machine learningbased approaches have become derigueur for addressing this difficult problem; one classical approach is to transform the problem into one of binary classification, either on bounding boxes [1, 2], or regions. Such approaches (see Section II for a detailed discussion) typically follow a two stage procedure:

generate independent proposals to provide coverage across object instances

improve precision and reduce redundancy by pruning out highly overlapping proposals
Intuitively, the first step returns a set of proposals with high recall and the second step improves the precision. For the second step, traditional approaches rely on a combination of thresholds and arbitration techniques like NonMax Suppression (NMS) to produce a final output. Such methods, while remarkably effective at identifying sufficiently separated objects, still have difficulty simultaneously detecting objects that are close together or overlap while preventing multiple detections of the same object (see Fig. 6). While we provide contributions to both stages, our focus is on formalizing and improving the second stage.
We formulate the objective of the second step as that of producing a diverse list of detections in the image. We propose an optimization criterion on this list of detections as a natural extension of the intersection over union metric (IoU) (described in Section IIIA), and develop an algorithm that targets this criterion. This approach uses recent work on building performancebounded lists of predictions [3, 4]. Our algorithm shares information across all candidate detections to build a list of detections, specifically exploiting this information to perform well even when object instances are adjacent. Each decision of appending to the list of detections is made with contextual information from all previous detections. Importantly, our list prediction algorithm is agnostic to the source of candidate detections. This provides our approach with the ability to use any candidate generating method as input for constructing a list.
Each candidate detection is treated as a union of superpixels with no adjacency constraints. We call these unions “chunks,” inspired by a wellknown task in Natural Language Processing: “chunking,” which involves grouping many words together into meaningful semantic instances / entities. We use “region” to refer to a contiguous group of superpixels, and reserve “chunk” to refer to a group of superpixels corresponding to a single semantic instance. The analogy is particularly apt when object instances are adjacent, as in Fig. 1.
For the first step, we develop a classspecific supervised approach of regionbased object proposal by iteratively grouping superpixels produced by a lowlevel segmentation algorithm [5] to form chunks. This helps build a highrecall candidate set. This algorithm learns to “grow” by utilizing classspecific groundtruth labeling by emulating an algorithm that optimizes a chunk’s IoU score with an object, which we present in Algorithm 2. This strategy follows from imitation learning approaches [6, 7].
Our technique for building the list of detections can be run for arbitrary list lengths, or budgets. This enables several use cases: building very short lists of highly confident object predictions (high precision), long lists of many candidate regions (high recall), and dynamic length lists tuned by some heuristic(s) (e.g., the highest predicted IoU score of the remaining candidates).
Ii Related Work
Much work has been done in the combined areas of object detection and semantic labeling. Object detection approaches often seek to place bounding boxes around all instances of objects [1, 8]. [9] casts the multiclass (and multiinstance) detection problem as a structured prediction task instead of NMS as post processing. However, the resulting detections are still bounding boxes.
Intermediate approaches deform the regions inside the output of a detector to produce object segmentations [10, 11, 12], or, conversely, adjust bounding boxes based on lowlevel features such as boundaries, texture, and color [13, 14]. Again, these approaches refine individual detections relying on the initial detector output. In contrast, we attempt to find the best list of detections given a large collection of candidate detections and regions. Closer to our work, [15] proposes to use a deformable shape model to represent object categories in order to extract region level object detections. This approach reasons about occluders and overlapping detection by using depth layering and is designed for one specific shape model for regionbased representation, while our approach is agnostic to the source of region segments and detection boxes.
Direct regionbased techniques, such as [16, 17, 18], use regionbased information to formulate detections, the produced detections are bounding boxes, and detection performance is analyzed using individual bounding box metrics. [19] produces regionwise segmentations, however they assume the existence of only one object in each image. [20] produces multiple regionwise segmentations, but contiguous and adjacent objects are not resolved, and ignore interclass context. Other regionbased techniques are segmentation algorithms that rely on combining lowlevel image features with classspecific models [21, 22, 23, 24], control segmentation parameters from object detection [25], or use the boxlevel detections as features for segmentation [26, 27]. These approaches attempt to find regions that best agree with both the region segments and individual detections but do not explicitly deal with the problem of finding the most consistent list of detections as we do.
Semantic systems such as [28, 29, 30, 31] do produce regionlevel labels, which can be grouped into detections, however there is no notion of separate detections; connected components of labeling are not grouped into their constituent object instances. [32] uses nonoverlapping segmentation proposals in its first stage, thus allowing, in principle, the handling of multiple instances of the same class, without explicitly optimization for multiinstance settings. Although the evaluation criteria in [32] focuses on perclass overlap without accounting for multiple instances, the authors do note the possibility for multiinstance extension. Combining semantic labeling with object detectors has been explored in different ways. Several approaches were proposed to combine pixellevel classification labels and boxlevel detections into a single inference problem. For example, [33, 34, 35, 36] incorporate detections into a CRF model for semantic labeling. These techniques attempt to generate a holistic representation of the scene that combines objects and regions. These approaches rely on semantic segmentation. Our approach, while incorporating semantic segmentation, is agnostic to the input features, as well as to the source(s) from which candidate detections are generated.
Another group of approaches related to our work address the problem of generating proposals for regions or boxes that are likely to delineate objects, in a classindependent manner. The proposals can then be evaluated by a classspecific algorithm for object detection. They include, for example, generating regions by iterative superpixel grouping [37, 18], and ranking proposed regions [38] or boxes [39, 40] based on a learned objectness score. In [41], the authors investigate an iterative, classspecific region generation procedure that incorporates classspecific models at different scales, and requires bounding boxes as input. Our generation method, in comparison, directly optimizes the instancebased IoU metric, and we provide worstcase and probabilistic performance bounds. All of these approaches are complementary to our work in that we can potentially use any of them as input to our candidate generation step, thus, we incorporate and compare to several of them in our experiments.
Iii Approach
Our task is to output a list of chunks, i.e., list of sets of superpixels as described in Section I, with high intersection over union (IoU) scores with each of the ground truth instances in the image. This metric is formalized in Section IIIA. We decompose the task into two parts:

Generation of a set of candidate chunks containing some elements that cover individual object instances.

Iterative construction of a list of chunks by selecting from an arbitrarily generated set of candidate chunks so as to maximize a natural variant of intersection over union score for multiple object instances and multiple predictions.
In the second stage, the candidate chunks can be generated from any algorithm, providing our method with the ability to augment our set of grown candidates constructed by other means. We start by describing the method by which we build lists of detections for the second stage, and first define a natural scoring function to evaluate any input list of chunks given ground truth on the pixels corresponding to objects of interest in a scene. We provide an efficient greedy algorithm that is guaranteed to optimize this metric to within a constant factor given access to groundtruth and this arbitrary set of (potentially overlapping) candidate chunks.
Our testtime approach, following recent work in structured prediction [4, 6], is to learn to emulate the sequential greedy strategy. The result is a predictor that takes a candidate set of chunks and iteratively builds a list of chunks that are likely to overlap well with separate objects in the scene.
We do not place assumptions on the given candidate set of chunks: the list predictor is agnostic to the way the candidate set of chunks is generated. Such a set can be heuristically generated in many ways, e.g., those created from the baseline approaches described in Section IV. In Section IIIC, we provide an algorithm designed to generate a candidate based on a fixed superpixelbased segmentation, and in Section IIID extend this algorithm to the case of growing multiple chunks per images.
Iiia Objective function and greedy optimization
We establish an objective function to evaluate the quality of any list, and devise a greedy algorithm to approximately maximize this objective function given access to the groundtruth. This will lead to the development of learning algorithm that produces a prediction procedure that operates on novel images.
Given an image with ground truth instance set and candidate chunk set , our goal is to sequentially build a list of chunks out of so as to maximize the sum of IoU’s with respect to ground truth instances. Denoting as a size list of chunks, we first establish correspondences between candidate chunks and ground truth instances to enable pairwise IoU computation. Note that each is associated with at most one ground truth instance , and each is associated with at most one . For analytic convenience, we augment with dummy ground truth instances to deal with the case in which the length of the list is larger than the number of ground truth instances (). Every chunk has zero intersection with each . Each feasible assignment corresponds to a permutation of , and the sum of IoU scores for this permutation can be written as the following: . It is natural to define the quality metric of a list to be the sum of IoU scores under the optimal assignment, i.e., , where denotes all permutations of . With an abuse of notation, indicates all elements in belong to . Our goal during training is to find list to maximize :
(1) 
This scoring metric, which is a natural generalization of the IoU metric common in segmentation and single instance detection [42, 43], encourages lists of a fixed length that contain chunks that are relevant and diverse in covering multiple ground truth instances. Unfortunately, the metric as written down does not possess a clear combinatorial structure like modularity or submodularity that would beget easy optimizability.
Interestingly, however, Problem (1) can be cast as an equivalent maximum weighted bipartite graph matching problem. This problem can be shown to be a submodular maximization problem under matroid partition constraints, and a greedy algorithm as shown in Algorithm 1 has multiplicative performance guarantees [44]. In addition to these guarantees, such a greedy algorithm is desirable as it is easily imitable at test time, and has a recursive solution: the length list is exactly the length list with the next greedily chosen item appended. The greedy algorithm behaves as follows: at each iteration, it chooses the chunk with the highest IoU with one of the remaining ground truth instances. More precisely, a chunk’s best overlap with each remaining ground truth is defined as (the “greedy marginal”), where is the set of remaining unpaired ground truth instances. At each step, the algorithm chooses the chunk with the highest value, appends it to the list (), and removes its associated ground truth from the set of remaining ground truth. This associated ground truth element is given by .
Critically, the greedy algorithm is recursive, meaning longer lists of predictions always include shorter lists, and is within a constant factor of optimal^{1}^{1}1Although Problem (1) can be solved exactly, it requires knowledge of the instances to be matched and does not possess a recursive structure that enables simple creation of longer lists of prediction.:
Theorem 1
Let be the list of the first elements in and be the optimal solution of Problem (1) among size lists
(2) 
See the appendix for proof of Theorem 1, which invokes results from [44]. Theorem 1 implies that if we are given a budget to build the list, then each scores within a constant factor of the optimal list among all lists of budget , for . This is an important property for producing good predictions earlier in the list and for producing the list of chunks rapidly. The empirical performance is usually much better than this bound suggests.
IiiB List prediction learning
In essence, the greedy strategy is a sequential list prediction process where at each round it maximizes the marginal benefit given the previous list of predictions and ground truth association. Maximization of the marginal benefit at each position of the list yields chunks that have high IoU with ground truth instances and minimal overlap with each other. At test time, however, there is no access to the groundtruth. Therefore, we take a learning approach to emulate the greedy algorithm. We train a predictor to imitate Algorithm 1, with the goal of preserving the ranking of all candidate chunks based on the greedy increments . This predictor uses both information about the current chunk and information about the currently built list to inform its predictions. In our experiments, we train random forests as our regressor with features as (each chunk’s feature is a function of itself and the currently built list, as described in Section IVB), and regression targets (the score for a chunk at each iteration is the greedy marginal, or how much a chunk candidate covers a new object instance). This regression of “region IoU ” is similar to that explored in [45], except it is explicitly reasoning about multiple objects, as well as the current contents of the predicted list. The prediction procedure is similar to the greedy list generation as in Algorithm 1, with the difference that there is no access to the ground truth.
IiiC Growing for a single instance
To generate a set of diverse chunks (output of stage 1 in the detection process), we develop a classspecific algorithm that “grows” chunks via iterative addition of superpixels, with the goal of producing diverse candidate detections that cover each ground truth object instance. We first analyze the case where there is only a single object of interest in the image. We consider a chunk to be a union of superpixels , i.e., . Let denote the IoU score between and .
To grow a chunk, Algorithm 2 starts with an empty chunk (no superpixels), and adds single superpixels to the current chunk sequentially. After each addition, the resulting chunk is copied and added to the set of candidate chunks. Let be the ratio of intersection area with ground truth to the size of a superpixel . The set of chunks generated by the greedy algorithm described in Algorithm 2 is guaranteed to contain the optimal chunk if the input predictor returned the exact value of , i.e., .
Theorem 2
Let be an oracle growing predictor, i.e., . The output set of from Algorithm 2 by setting contains the best chunk given the set of superpixels .
See appendix for proof. At testing time, we must give an estimate of , i.e., . We train a random forest regressor as our predictor with features for estimation. We analyze the performance of Algorithm 2 under approximation by relating the squared regression error of to the IoU score of the grown best chunk in the chain. We note that the testtime performance depends on both the size of the squared error and the number of predictions made. Notably, the error bound has no explicit dependence on the area sizes of ground truth object instances and images. See appendix for proof.
Theorem 3
Given a regressor that achieves no worse than absolute error uniformly across all superpixels, let be the best chunk in the predicted set . The IoU score of is no worse than of the IoU score of the optimal chunk : .
Corollary 1
Suppose regressor has expected sq. error over the distribution of superpixels, let be the number of superpixels in the image, then we have for any , with probability : .
IiiD Growing for multiple instances
We run the growing algorithm more than once to cover multiple objects. Instead of making predictions based solely on features of each individual superpixel, we augment the information available to the predictor by including a feature of the current grown chunk, (see Section IVB for more information about grower features). This yields predictors that prefer choosing superpixels in close proximity to the currently growing chunk, and allows us not to explicitly encode contiguity requirements, as objects may be partially occluded in a way that renders them discontiguous. We also modify Algorithm 2 by “seeding” the chunks at a set of superpixel locations, (the initialization step, , becomes ), and running the growing procedure on each of these seeds separately. See appendix for the pseudocode description modified from Algorithm 2. In practice, we choose a seeding grid interval and a maximum chunk size cutoff, yielding . In Figure IIID, we visualize the sequential growth of the best chunks for each object instance.
Iv Experiments
We describe our experiments and features in the next two sections, and discuss the results of each experiment in their respective captions.
Iva Datasets and baseline algorithms
We perform experiments on imagery from 3 different datasets. We refine the Stanford Background Dataset [46] labeling to include a vehicle class with instance labeling. We also perform experiments on PASCAL VOC 2012 (Fig. 5 and Tables II and I). This dataset possesses relatively few images containing adjacent and/or overlapping instances of the same class. Therefore, we created a subset of the LM+Sun dataset [31] of images containing at least 2 adjacent cars, consisting of 1,042 images.
aeroplane  bicycle  bird  boat  bottle  bus  car  cat  chair  cow  
.157  .066  .105  .132  .079  .228  .097  .155  .071  .211  
.521  .148  .375  .335  .186  .439  .190  .445  .141  .494  
.530  .158  .394  .347  .202  .509  .195  .461  .171  .581 
diningtable  dog  horse  motorbike  person  pottedplant  sheep  sofa  train  tvmonitor  
.098  .165  .197  .166  .193  .078  .182  .139  .170  .109  
.260  .430  .437  .407  .362  .133  .466  .260  .403  .270  
.261  .454  .479  .441  .456  .160  .582  .272  .409  .278 
IvB Features
As discussed in IIIB, the features should encode the quality of a chunk (e.g. “Does the chunk look like a vehicle?”) and similarity with the currently predicted list (e.g. “Is this chunk similar to previously predicted chunks” ). One of the quality features is built upon the superpixelwise multiclass label distribution from [30], where we compute label distribution for each chunk via aggregating histograms of its constituent superpixels. The other quality features are shape features including central moments of the chunk, area, and scale relative to the image. The similarity features we use primarily encode spatial information between predictions. We use a candidate’s with previous predictions, the spatial histogram used in [9] and the size of the current list. Chunks with high similarity with previously predicted chunks in the list are less favored.
The features for the grower encode information about the quality of proposed chunk by growing with superpixel . The grower uses the same quality features that characterize used by the list predictor, as well as several of the classagnostic features described in [18], specifically color similarity (color histogram intersection), which encourages regions to be built from similarly colored regions, and region fill, which encourages growing compact chunks. See [18] for further details. As each superpixel is iteratively added to the chunk, similarity to the growing chunk for remaining candidate superpixels is recomputed.
We evaluate three methods^{2}^{2}2 We use the semantic labeling algorithm of [30] and the DPM detection method of [1] for bounding box output, with the default SVM threshold, and NMS threshold of . To generate the superpixels, we use the segmentation algorithm of [5]. For each experiment, separate semantic labeling systems and chunk growers were trained. leveraging existing bounding box detections and superpixelwise semantic labeling algorithm, all of which serve as our baseline systems for building lists of predictions: 1) Bounding box detector output after NMS filtering 2) Connected components of scene parsing (“SP”) / semantic labeling 3) A combination of 1) and 2): intersection of connected components with bounding boxes, which creates chunks for every bounding box by extracting the labeled region inside (“SP DPM”). The third baseline is intended to capitalize on desirable properties of each component while avoiding their less desirable properties: boxes usually violate the object boundaries, and semantic labeling does not separate adjacent instances. The downside to this baseline is that it can suffer from compounding both detector and scene parsing errors. See Fig. 6 for a visual comparison.
We investigate the region generating methods of SCALPEL [41] and Selective Search [18], and in Fig. 4 compare our chunk generating method against them on our LM+Sun Adjacent Cars dataset with the Average Best Object method suggested by [41], and additionally train our system by using these methods to fill the candidate pool. In Table II, we compare different list predictions methods on vehicle and person data, respectively.
V Conclusion
We provide a novel method for producing regionbased object detections in images, treating the problem as a list prediction from a set of candidate region proposals. We formulate a scoring criterion for multiple object instances and multiple predictions. We develop a list prediction algorithm that directly optimized the criterion. Our approach is agnostic to proposal generation method and provides a recursive solution for all list lengths, enabling it to easily produce any best guesses for objects. We provide a method for classspecific candidate generation algorithm, yielding good coverage of objects. We demonstrate that our list prediction is a useful method for improving arbitrary candidate pools.
References
 [1] P. F. Felzenszwalb et al., “Object detection with discriminatively trained partbased models,” PAMI, 2010.
 [2] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
 [3] D. Debadeepta et al., “Contextual sequence prediction with application to control library optimization,” ICML, 2013.
 [4] S. Ross et al., “Learning policies for contextual submodular prediction,” in ICML, 2013.
 [5] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graphbased image segmentation,” IJCV, 2004.
 [6] H. Daumé III et al., “Searchbased structured prediction,” Machine Learning, 2009.
 [7] S. Ross et al., “A reduction of imitation learning and structured prediction to noregret online learning,” in AISTATS, 2011.
 [8] A. Vedaldi et al., “Multiple kernels for object detection,” in CVPR, 2009.
 [9] C. Desai et al., “Discriminative models for multiclass object layout,” ICJV, 2011.
 [10] V. Lempitsky et al., “Image segmentation with a bounding box prior,” in ICCV, 2009.
 [11] A. Monroy and B. Ommer, “Beyond boundingboxes: Learning object shape by modeldriven grouping,” in ECCV, 2012.
 [12] J. Z. Wang et al., “Simplicity: Semanticssensitive integrated matching for picture libraries,” PAMI, 2001.
 [13] Q. Dai and D. Hoiem, “Learning to localize detected objects,” in CVPR, 2012.
 [14] R. Mottaghi, “Augmenting deformable part models with irregularshaped object patches,” 2012.
 [15] Y. Yang et al., “Layered object models for image segmentation,” PAMI, 2012.
 [16] B. Leibe et al., “Robust object detection with interleaved categorization and segmentation,” IJCV, 2008.
 [17] C. Gu et al., “Recognition using regions,” in CVPR, 2009.
 [18] J. Uijlings et al., “Selective search for object recognition,” IJCV, 2013.
 [19] E. Borenstein and S. Ullman, “Classspecific, topdown segmentation,” in ECCV, 2002.
 [20] J. Carreira et al., “Object recognition by sequential figureground ranking,” IJCV, 2012.
 [21] B. Leibe et al., “Combined object categorization and segmentation with an implicit shape model,” in Workshop on Statistical Learning in Computer Vision, ECCV, 2004.
 [22] X. Y. Stella et al., “Concurrent object recognition and segmentation by graph partitioning,” in NIPS, 2002.
 [23] A. Levin and Y. Weiss, “Learning to combine bottomup and topdown segmentation,” in ECCV, 2006.
 [24] Z. Tu et al., “Image parsing: Unifying segmentation, detection, and recognition,” IJCV, 2005.
 [25] M. P. o. Kumar, “Obj cut,” in CVPR, 2005.
 [26] J. M. Gonfaus et al., “Harmony potentials for joint classification and segmentation,” in CVPR, 2010.
 [27] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in NIPS, 2011.
 [28] G. Heitz and D. Koller, “Learning spatial context: Using stuff to find things,” in ECCV, 2008.
 [29] S. Gould et al., “Decomposing a scene into geometric and semantically consistent regions,” in ICCV. IEEE, 2009.
 [30] D. Munoz et al., “Stacked hierarchical labeling,” in ECCV, 2010.
 [31] J. Tighe and S. Lazebnik, “Superparsing: scalable nonparametric image parsing with superpixels,” in ECCV, 2010.
 [32] A. Ion et al., “Probabilistic joint image segmentation and labeling.” in NIPS, 2011.
 [33] L. Ladickỳ et al., “What, where and how many? combining object detectors and crfs,” in ECCV, 2010.
 [34] S. Fidler et al., “Bottomup segmentation for topdown detection,” in CVPR, 2013.
 [35] G. Heitz et al., “Cascaded classification models: Combining models for holistic scene understanding,” in NIPS, 2008.
 [36] J. Yao et al., “Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation,” in CVPR, 2012.
 [37] A. Levinshtein et al., “Optimal contour closure by superpixel grouping,” in ECCV, 2010.
 [38] I. Endres and D. Hoiem, “Category independent object proposals,” in ECCV, 2010.
 [39] B. Alexe et al., “What is an object?” in CVPR, 2010.
 [40] ——, “Measuring the objectness of image windows,” PAMI, 2012.
 [41] D. Weiss and B. Taskar, “Scalpel: Segmentation cascades with localized priors and efficient learning,” in CVPR, 2013.
 [42] M. Everingham et al., “The pascal visual object classes (voc) challenge,” IJCV, 2010.
 [43] P. Arbeláez et al., “Semantic segmentation using regions and parts,” in CVPR, 2012.
 [44] M. L. Fisher et al., “An analysis of approximations for maximizing submodular set functions ii,” in Polyhedral combinatorics, 1978.
 [45] J. Carreira and C. Sminchisescu, “Cpmc: Automatic object segmentation using constrained parametric mincuts,” PAMI, 2012.
 [46] S. Gould et al., “Decomposing a scene into geometric and semantically consistent regions,” in ICCV, 2009.
Appendix A Appendix
This appendix contains proofs of theoretical results and additional pseudocode descriptions presented in the paper.
Aa Proof for Theorem 1
Given an image with ground truth entities , candidate chunks set and list size budget , our goal is to select the optimal chunks out of and associate each with the ground truth entities so as to maximize the sum of intersection over union scores under the association. Such problem can be cast as maximum weighted bipartite graph matching, a classic assignment problem in combinatorial optimization. The edge set of the bipartite graph is the Cartesian product of and , i.e., . The weight for each edge is the I/U score between chunk and ground truth . The defined quality metric in Section 3.1 is equal to the optimal assignment score for subgraph .
Let be the optimal size subset of chunks, which can be computed in cubic time by Hungarian Algorithm [munkres1957algorithms]. Algorithm 1 can be seen as a greedy approach for maximum bipartite graph matching with 1/2 approximation guarantee [preis1999linear]. Furthur, let be the greedy match on graph , we can show that for any augmented graph where , obtained from running Algorithm 1 with iterations is no worse than . Hence, we can conclude that running Algorithm 1 on has 1/2 approximation gurantee with respect to the optimal size subset of . Together with the fact that greedy solution has recursive structure, i.e., shorter greedy list is the prefix list for longer greedy list under larger budget, we can prove Theorem 1.
AB Proof of Theorem 2
Consider superpixel and ground truth , let , and , we have that is a montonic transformation of , i.e., if and only if . Therefore the rankings based on or are the same. This follows from the fact that .
Using the fact that superpixels are nonoverlapping, given any superpixel and a set of superpixel , we have . Further, if , adding to would increase and vice versa, since implies . Therefore, suppose the optimal solution be , then if , it must be true that , and otherwise . This also implies the optimal set of superpixels is the first elements based on a sorting of superpixels by , where is the smallest integer such that .
AC Proof of Theorem 3 and Corollary
Let and let the optimal set of superpixels to maximize I/U with ground truth be the first superpixels, i.e., ; suppose the regressor makes bounded uniform error , i.e., , and let be the largest number such that: for . If the regressor makes bounded uniform error , then the worst case would be: it underestimates by and overestimates by . Therefore, some of the elements in would rank higher than elements in . Denote as the best solution among the chain of sets induced by the ranking output of the regressor .
(3)  
(4)  
(5)  
(6)  
(7)  
(8) 
In 8, we are using the fact that implies . Rearrange the terms, we get:
(9)  
(10) 
From (9) to (10), we are using the fact that and . We can proceed to have an additive bound:
(11)  
(12) 
A more natural assumption is to assume an expected square error over the distribution of all superpixels. Denote , the expected uniform error bound satisfies:
(13)  
(14)  
(15) 
In (14), we are applying Jensen’s Inequality along with the fact that is concave. Using Markov Inequality, for any , with probability , we have that:
(16) 
Together with 12, we have that: for any , with probability
(17) 
AD Pseudocode for Multiple Grower Algorithm
The pseudocode in Algorithm 3 describes the growing algorithm with specified seeding superpixel . We run this algorithm for each in order to increase diversity, where is the set of seeding superpixels. Two major differences from single instance chunk growing algorithm in Section 3.2 are addressed below:

A seeding superpixel needs to be given as input to initialize the chain of growth, i.e., .

Instead of just using features only based on the superpixel itself, we also consider features including both the superpixel and the currently growing chunk . We replace with . These feature not only encode information about the quality of a superpixel but also encourage the grower to grow spatially compact chunks.