Objects as context for detecting their semantic parts
We present a semantic part detection approach that effectively leverages object information. We use the object appearance and its class as indicators of what parts to expect. We also model the expected relative location of parts inside the objects based on their appearance. We achieve this with a new network module, called OffsetNet, that efficiently predicts a variable number of part locations within a given object. Our model incorporates all these cues to detect parts in the context of their objects. This leads to considerably higher performance for the challenging task of part detection compared to using part appearance alone (+5 mAP on the PASCAL-Part dataset). We also compare to other part detection methods on both PASCAL-Part and CUB200-2011 datasets.
Semantic parts play an important role in visual recognition. They offer many advantages such as lower intra-class variability than whole objects, higher robustness to pose variation, and their configuration provides useful information about the aspect of the object. For these reasons, part-based models have gained attention for tasks such as fine-grained recognition [simon15iccv, liu12eccv, parkhi12cvpr, huang16cvpr, xiao15cvpr, lin15cvpr, zhang14eccv, zhang16cvpr, gavves13iccv, goering14cvpr], object class detection and segmentation [chen14cvpr, azizpour12eccv, wang15iccv], or attribute prediction [gkioxari15iccv, zhang13iccv, vedaldi14cvpr]. Moreover, part localizations deliver a more comprehensive image understanding, enabling reasoning about object-part interactions in semantic terms. Despite their importance, many part-based models detect semantic parts based only on their local appearance [parkhi12cvpr, chen14cvpr, liu12eccv, zhang13iccv, gkioxari15iccv, simon15iccv, xiao15cvpr, huang16cvpr, yang15iccv]. While some works [lin15cvpr, zhang14eccv, zhang16cvpr, azizpour12eccv, vedaldi14cvpr, gavves13iccv, goering14cvpr] leverage other types of information, they use parts mostly as support for other tasks. Part detection is rarely their focus. Here we take part detection one step further and provide a specialized approach that exploits the unique nature of this task.
Parts are highly dependent on the objects that contain them. Hence, objects provide valuable cues to help detecting parts, creating an advantage over detecting them independently. First, the class of the object gives a firm indication of what parts should be inside it, i.e. only those belonging to that object class. For example, a dark round patch should be more confidently classified as a wheel if it is on a car, rather than on a dog (fig. 1). Furthermore, by looking at the object appearance we can determine in greater detail which parts might be present. For example, a profile view of a car suggests the presence of a car door, and the absence of the licence plate. This information comes mostly through the viewpoint of the object, but also from other factors, such as the type of object (e.g. van), or whether the object is truncated (e.g. no wheels if the lower half is missing). Second, objects also provide information about the location and shape of the parts they contain. Semantic parts appear in very distinctive locations within objects, especially given the object appearance. Moreover, they appear in characteristic relative sizes and aspect ratios. For example, wheels tend to be near the lower corners of car profile views, often in a square aspect ratio, and appear rather small.
In this work, we propose a dedicated part detection model that leverages all of the above object information. We start from a popular Convolutional Neural Network (CNN) detection model [girshick15iccv], which considers the appearance of local image regions only. We extend this model to incorporate object information that complements part appearance by providing context in terms of object appearance, class and the relative locations of parts within the object.
We evaluate our part detection model on all 16 object classes in the PASCAL-Part dataset [chen14cvpr]. We demonstrate that adding object information is greatly beneficial for the difficult task of part detection, leading to considerable performance improvements. Compared to a baseline detection model that considers only the local appearance of parts, our model achieves a +5 mAP improvement. We also compare to methods that report part localization in terms of bounding-boxes [chen14cvpr, gkioxari15iccv, zhang14eccv, lin15cvpr, zhang16cvpr] on PASCAL-Part and CUB200-2011 [WahCUB_200_2011]. We outperform [chen14cvpr, gkioxari15iccv, zhang14eccv, lin15cvpr] and match the performance of [zhang16cvpr]. We achieve this by an effective combination of the different object cues considered, demonstrating their complementarity. Moreover our approach is general as it works for a wide range of object classes: we demonstrate it on 16 classes, as opposed to 1-7 in [chen14cvpr, parkhi12cvpr, wang15iccv, gkioxari15iccv, zhang14eccv, zhang14cvpr, lin15cvpr, zhang16cvpr, liu14eccv, hariharan15cvpr, sun11iccv_art, liang16eccv, ukita12cvpr, yang16cvpr, xia16eccv, gavves13iccv, goering14cvpr, zhang13iccv, simon15iccv, xiao15cvpr, xia17cvpr] (only animals and person). Finally, we perform fully automatic object and part detection, without using ground-truth object locations at test time [chen14cvpr, wang15iccv, lin15cvpr, zhang16cvpr]. We released code for our method at [gonzalez-garcia18cvpr-code].
2 Related work
DPM-based part-based models.
The Deformable Part Model (DPM) [felzenszwalb10pami] detects objects as collections of parts, which are localized by local part appearance using HOG [dalal05cvpr] templates. Most models based on DPM [felzenszwalb10pami, pandey11iccv, endres13cvpr, Ott11cvpr, nguyen15acomp, sadeghi2014eccv, divvala2012eccv, drayer14eccv, pedersoli11cvpr] consider parts as any image patch that is discriminative for the object class. In our work instead we are interested in semantic parts, i.e. an object region interpretable and nameable by humans (e.g.‘saddle’).
Among DPM-based works, [azizpour12eccv] is especially related as they also simultaneously detect objects and their semantic parts. Architecturally, our work is very different: [azizpour12eccv] builds on DPM [felzenszwalb10pami], whereas our model is based on modern CNNs and offers a tighter integration of part appearance and object context. Moreover, the focus of [azizpour12eccv] is object detection, with part detection being only a by-product. They train their model to maximize object detection performance, and thus they require parts to be located only roughly near their ground-truth box. This results in inaccurate part localization at test time, as confirmed by the low part localization results reported (table 5 of [azizpour12eccv]). Finally, they only localize those semantic parts that are discriminative for the object class. Our model, instead, is trained for precise part localization and detects all object parts.
CNN-based part-based models.
In recent years, CNN-based representations are quickly replacing hand-crafted features [dalal05cvpr, lowe04ijcv] in many domains, including semantic part-based models [bulat16eccv, chen14nips, gkioxari15iccv, hariharan15cvpr, huang16cvpr, liang16eccv, wang15cvpr, wang15iccv, xia16eccv, zhang14eccv, yang15iccv, simon15iccv, xiao15cvpr, modolo17pami]. Our work is related to those that explicitly train CNN models to localize semantic parts using bounding-boxes [gkioxari15iccv, zhang14eccv, lin15cvpr, zhang16cvpr], as opposed to keypoints [huang16cvpr, simon15iccv] or segmentation masks [hariharan15cvpr, liang16eccv, wang15cvpr, wang15iccv, xia16eccv, yang15iccv]. Many of these works [gkioxari15iccv, xiao15cvpr, simon15iccv, huang16cvpr, yang15iccv] detect the parts used in their models based only on local part appearance, independently of their objects. Moreover, they use parts as a means for object or action and attribute recognition, they are not interested in part detection itself.
Several fine-grained recognition works [gavves13iccv, goering14cvpr, zhang16cvpr] use nearest-neighbors to transfer part location annotations from training objects to test objects. They do not perform object detection, as ground-truth object bounding-boxes are used at both training and test time. Here, instead, at test time we jointly detect objects and their semantic parts.
A few works [zhang14eccv, wang15iccv, xia17cvpr] use object information to refine part detections as a post-processing step. Part-based R-CNN [zhang14eccv] refines R-CNN [girshick14cvpr] part detections by using nearest-neighbors from training samples. Our model integrates object information also within the network, which allows us to deal with several object classes simultaneously, as opposed to only one [zhang14eccv]. Additionally, we refine part detections with a new network module, OffsetNet, which is more efficient than nearest-neighbors. Xia et al. [xia17cvpr] refine person part segmentations using the estimated pose of the person. We propose a more general method that gathers information from multiple complementary object cues. The method of [wang15iccv] is demonstrated only on 5 very similar classes from PASCAL-Part [chen14cvpr] (all quadrupeds), and on fully visible object instances from a manually selected subset of the test set (10% of the full test set). Instead, we show results on 105 parts over all 16 classes of PASCAL-Part, using the entire dataset. Moreover, [wang15iccv] uses manually defined object locations at test time, whereas we detect both objects and their parts fully automatically at test time.
We define a new detection model specialized for parts which takes into account the context provided by the objects that contain them. This is the key advantage of our model over traditional part detection approaches, which detect parts based on their local appearance alone, independently of the objects [gkioxari15iccv, simon15iccv, huang16cvpr, yang15iccv]. We build on top of a baseline part detection model (sec. 3.1) and include various cues based on object class (sec, 3.2.2), object appearance (sec. 3.2.3), and the relative location of parts on the object (sec. 3.3). Finally, we combine all these cues to achieve more accurate part detections (sec. 3.4).
Fig. 2 gives an overview of our model. First, we process the input image through a series of convolutional layers. Then, the Region of Interest (RoI) pooling layer produces feature representations from two different kind of region proposals, one for parts (red) and one for objects (blue). Each part region gets associated with a particular object region that contains it (sec. 3.2.1). Features for part regions are passed on to the part appearance branch, which contains two Fully Connected (FC) layers (sec. 3.1). Features for object regions are sent to both the object class (sec. 3.2.2) and object appearance (sec. 3.2.3) branches, with three and two FC layers, respectively.
For each part proposal, we concatenate the output of the part appearance branch with the outputs of the two object branches for its associated object proposal. We pass this refined part representation (purple) on to a part classification layer and a bounding-box regression layer (sec. 3.4).
Simultaneously, the relative location branch (green) also produces classification scores for each part region based on its relative location within the object (sec. 3.3). We combine the above part classification scores with those produced by relative location (big symbol, sec. 3.4), obtaining the final part classification scores. The model outputs these and regressed bounding-boxes.
3.1 Baseline model: part appearance only
As baseline model we use the popular Fast R-CNN [girshick15iccv], which was originally designed for object detection. It is based on a CNN that scores a set of region proposals [uijlings13ijcv] by processing them through several layers of different types. The first layers are convolutional and they process the whole image once. Then, the RoI pooling layer extracts features for each region proposal, which are later processed by several FC layers. The model ends with two sibling output layers, one for classifying each proposal into a part class, and one for bounding-box regression, which refines the proposal shape to match the extent of the part more precisely. The model is trained using a multi-task loss which combines these two objectives. This baseline corresponds to the part appearance branch in fig. 2.
We follow the usual approach [girshick15iccv] of fine-tuning for the used dataset on the current task, part detection, starting from a network pre-trained for image classification [krizhevsky12nips]. The classification layer of our baseline model has as many outputs as part classes, plus one output for a generic background class. Note how we have a single network for all part classes in the dataset, spanning across all object classes.
3.2 Adding object appearance and class
The baseline model tries to recognize parts based only on the appearance of individual region proposals. In our first extension, we include object appearance and class information by integrating it inside the network. We can see this as selecting an adequate contextual spatial support for the classification of each proposal into a part class.
3.2.1 Supporting proposal selection
Our models use two types of region proposals (sec. 4). Part proposals are candidate regions that might cover parts. Analogously, object proposals are candidates to cover objects. The baseline model uses only part proposals. In our models, instead, each part proposal is accompanied by a supporting object proposal , which must fulfill two requirements (fig. 3). First, it needs to contain the part proposal, i.e. at least 90% of must be inside . Second, it should tightly cover the object that contains the part, if any. For example, if the part proposal is on a wheel, the supporting proposal should be on the car that contains that wheel. To achieve this, we select the highest scored proposal among all object proposals containing , where the score is the object classification score for any object class.
Formally, let be a part proposal and the set of object proposals that contain . Let be the classification score of proposal for object class . These scores are obtained by first passing all object proposals through three FC layers as in the object detector [girshick15iccv]. We select the supporting proposal for as
where is the total number of object classes in the dataset.
3.2.2 Object class
The class of the object provides cues about what part classes might be inside it. For example, a part proposal on a dark round patch cannot be confidently classified as a wheel based solely on its appearance (fig. 1). If the corresponding supporting object proposal is a car, the evidence towards it being a wheel grows considerably. On the other hand, if the supporting proposal is a dog, the patch should be confidently classified as not a wheel.
Concretely, we process convolutional features pooled from the supporting object proposal through three FC layers (fig. 2). The third layer performs object classification and outputs scores for each object class, including a generic background class. These scores can be seen as object semantic features, which complement part appearance.
3.2.3 Object appearance
The appearance of the object might bring even more detailed information about what part classes it might contain. For example, the side view of a car indicates that we can expect to find wheels, but not a licence plate. We model object appearance by processing the convolutional features of the supporting proposal through two FC connected layers (fig. 2). This type of features have been shown to successfully capture the appearance of objects [donahue14icml, ge15cvpr].
3.3 Adding relative location
We now add another type of information that could be highly beneficial: the relative location of the part with respect to the object. Parts appear in very distinct and characteristic relative locations and sizes within the objects. Fig. 4a shows examples of prior relative location distributions for some part classes as heatmaps. These are produced by accumulating all part ground-truth bounding-boxes from the training set, in the normalized coordinate frame of the bounding-box of their object. Moreover, this part location distribution can be sharper if we condition it on the object appearance, especially its viewpoint. For example, the car-wheel distribution on profile views of cars will only have two modes (fig. 4b) instead of the three shown in fig. 4a.
Our relative location model is specific to each part class within each object class (e.g. a model for car-wheel, another model for cat-tail). Below we explain the model for one particular object and part class. Given an object proposal of that object class, our model suggests windows for the likely position of each part inside the object. Naturally, these windows will also depend on the appearance of . For example, given a car profile view, our model suggests square windows on the lower corners as likely to contain wheels (fig. 4b top). Instead, an oblique view of a car will also suggest a wheel towards the lower central region, as well as a more elongated aspect ratio for the wheels on the side (fig. 4b bottom). We generate the suggested windows using a special kind of CNN, which we dub OffsetNet (see fig. 2, Relative location branch). Finally, we score each part proposal according to its overlap with the suggested windows. This indicates the probability that a proposal belongs to a certain part class, based on its relative location within the object, and on the object appearance (but it does not depend on part appearance).
OffsetNet directly learns to regress from the appearance of an object proposal to the relative location of a part class within it. Concretely, it learns to produce a 4D offset vector that points to the part inside . In fact, OffsetNet produces a set of vectors , as some objects have multiple instances of the same part inside (e.g. cars with multiple wheels). Intuitively, a CNN is a good framework to learn this regressor, as the activation maps of the network contain localized information about the parts of the object [simon14accv, zeiler14eccv, gonzalez-garcia17ijcv].
OffsetNet generates each offset vector in through a regression layer. To enable OffsetNet to output multiple vectors we build multiple parallel regression layers. We set the number of parallel layers to the number of modes of the prior distribution for each part class (fig. 4). For example, the prior car-wheel has three modes, leading to three offset regression layers in OffsetNet (fig. 2). On the other hand, OffsetNet only has one regression layer for person-head, as its prior distribution is unimodal.
In some cases, however, not all modes are active for a particular object instance (e.g. profile views of cars only have two active modes out of the three, fig. 4b). For this reason, each regression layer in OffsetNet has a sibling layer that predicts the presence of that mode in the input detection , and outputs a presence score . This way, even if the network outputs multiple offset vectors, only those with a high presence score will be taken into account. This construction effectively enables OffsetNet to produce a variable number of output offset vectors, depending on the input .
We train one OffsetNet for all part classes simultaneously by arranging parallel regression layers, where is the maximum number of modes over all part classes (4 for PASCAL-Part). If a part class has fewer than N modes, we simply ignore its regression output units in the extra layers. We train the offset regression layers using a smooth-L1 loss, on the training samples described in sec. 4 (analog to the bounding-box regression of Fast R-CNN [girshick15iccv]). We train the presence score layer using a logistic log loss: , where is the score produced by the network, and is a binary label indicating whether the current mode is present () or not (). We generate using annotated ground-truth bounding-boxes (sec. 4). This loss implicitly normalizes score using the sigmoid function. After training, we add a sigmoid layer to explicitly normalize the output presence score: .
Generating suggested windows.
At test time, given an input object proposal , OffsetNet generates pairs of offset vectors and presence scores for each part class, where is the number of modes in the prior distribution. We apply the offset vectors to , producing a set of suggested windows .
Scoring part proposals.
At test time, we score all part proposals of an image by generating windows with OffsetNet for all the detected objects in the image. Let be a set of object detections in the image, i.e. object proposals with high score after non-maxima suppression [felzenszwalb10pami]. We produce these automatically using standard Fast R-CNN [girshick15iccv]. Let be the score of detection for the considered object class. We compute the relative location score for part proposal using its overlap with all windows suggested by OffsetNet
where we use Intersection-over-Union (IoU) to measure overlap. Here, is the set of suggested windows output by OffsetNet for object detection , and is the associated presence score for each individual window . Suggested windows with higher presence score have higher weight in the overall relative location score . The score of the object detection is also used to weight all suggested windows based on it. Consequently, object detections with higher score provide stronger cues through higher relative location scores . Fig. 5 depicts this scoring procedure.
Fig. 6 shows examples of windows suggested by OffsetNet, along with their presence score and a heatmap generated by scoring part proposals using eq. (2). We can see how the suggested windows cover very likely areas for part instances on the input objects, and how the presence scores are crucial to decide which windows should be relied on.
3.4 Cue combination
We have presented multiple cues that can help part detection. These cues are complementary, so our model needs to effectively combine them.
We concatenate the output of the part appearance, object class and object appearance branches and pass them on to a part classification layer that combines them and produces initial part classification scores (purple in fig. 2). Therefore, we effectively integrate object context into the network, resulting in the automatic learning of object-aware part representations. We argue that this type of context integration has greater potential than just a post-processing step [wang15iccv, zhang14eccv].
The relative location branch, however, is special as its features have a different nature and much lower dimensionality (4 vs 4096). To facilitate learning, instead of directly concatenating them, this branch operates independently of the others and computes its own part scores. Therefore, we linearly combine the initial part classification scores with those delivered by the relative location branch (big in fig. 2). For some part classes, the relative location might not be very indicative due to high variance in the training samples (e.g. cat-nose). In some other cases, relative location can be a great cue (e.g. the position of cow-torso is very stable across all its instances). For this reason, we learn a separate linear combination for each part class. We do this by maximizing part detection performance on the training set, using grid search on the mixing weight in the range. We define the measure of performance in sec. 5.
4 Implementation details
Object proposals [alexe12pami, dollar14eccv, uijlings13ijcv] are designed to cover whole objects, and sometimes fail to find small parts. To alleviate this issue, we changed the standard settings of Selective Search [uijlings13ijcv], by decreasing the minimum box size to 10. This results in adequate proposals even for parts: reaching 71.4% recall with 3000 proposals (IoU 0.5). For objects, we keep the standard settings (minimum box size 20), resulting in 2000 proposals.
Training the part detection network.
Our networks are pre-trained on ILSVRC12 [ilsvrc12] for image classification and fine-tuned on PASCAL-Part [chen14cvpr] for part detection, or on PASCAL VOC 2010 [everingham10ijcv] for object detection, using MatConvNet [vedaldi15mm]. Fine-tuning for object detection follows the Fast R-CNN procedure [girshick15iccv]. For part detection fine-tuning we changed the following settings. Positive samples for parts overlap any part ground-truth IoU, whereas negative samples overlap . We train for 12 epochs with learning rate 10, and then for 4 epochs with 10.
We jointly train part appearance, object appearance, and object class branches for a multi-task part detection loss. We modify the RoI pooling layer to pool convolutional features from both the part proposal and the supporting object proposal. Backpropagation through this layer poses a problem, as (1) is not differentiable. We address this by backpropagating the gradients only through the area of the convolutional map covered by the object proposal selected by the . We obtain the object scores used in (1) from the object class branch, which is previously initialized using the standard Fast R-CNN object detection loss, in order to provide reliable object scores when joint training starts.
We need object samples and part samples to train OffsetNet. Our object samples are all object ground-truth bounding-boxes and object proposals with IoU 0.7 in the training set. Our part samples are only part ground-truth bounding-boxes. We split the horizontal axis in regions, where is the number of modes in the part class prior relative location distribution. We assign each part ground-truth bounding-box in the object to the closest mode. If a mode has more than one part bounding-box assigned, we pick one at random. In case a mode has no instance assigned (e.g. occluded wheel) for a particular training sample, the loss function omits the contribution of that mode. All layers except the top ones are initialized with a Fast R-CNN network trained for object detection. Similarly to the other networks, we train it for 16 epochs, but with learning rates 10 and 10.
5 Results and conclusions
5.1 Validation of our model
We present results on PASCAL-Part [chen14cvpr], which has pixel-wise part annotations for the images of PASCAL VOC 2010 [everingham10ijcv]. For our experiments we fit a bounding-box to each part segmentation mask. We pre-process the set of part classes as follows. We discard additional information on semantic part annotations, such as ‘front’ or ‘left’ (e.g. both “car wheel front left” and “car wheel back right” become car-wheel). We merge continuous subdivisions of the same semantic part (“horse lower leg” and “horse upper leg” become horse-leg). Finally, we discard tiny parts, with average width and height over the training set 15 pixels (e.g. “bird eye”), and rare parts that appear times (e.g. “bicycle headlight”). After this pre-processing, we obtain a total of 105 part classes for 16 object classes. We train our methods on the train set and test them on the val set (the test set is not annotated in PASCAL-Part). We note how we are the first work to present fully automatic part detection results on the whole PASCAL-Part dataset.
Just before measuring performance we remove duplicate detections using non-maxima suppression [felzenszwalb10pami]. We measure part detection performance using Average Precision (AP), following the PASCAL VOC protocol [everingham10ijcv]. We consider a part detection to be correct if its IoU with a ground-truth part bounding-box is .
|Model||Obj. App||Obj. Cls||Rel Loc||mAP|
|AlexNet||Baseline [girshick15iccv] (part appearance only)||22.1|
|Obj. app + cls||✓||✓||25.7|