Semantic Part Detection via Matching:Learning to Generalize to Novel Viewpoints from Limited Training Data

Semantic Part Detection via Matching:
Learning to Generalize to Novel Viewpoints from Limited Training Data

Yutong Bai1, Qing Liu2*, Lingxi Xie2, Yan Zheng3, Weichao Qiu2, Alan Yuille2
1Northwestern Polytechnical University 2Johns Hopkins University 3Beihang University{198808xc, yan.zheng.mat, qiuwch,alan.l.yuille}
These authors contribute equally to this work.

Detecting semantic parts of an object is a challenging task in computer vision, particularly because it is hard to construct large annotated datasets due to the difficulty of annotating semantic parts. In this paper we present an approach which learns from a small training dataset of annotated semantic parts, where the object is seen from a limited range of viewpoints, but generalizes to detect semantic parts from a much larger range of viewpoints. Our approach is based on a matching algorithm for finding accurate spatial correspondence between two images, which enables semantic parts annotated on one image to be transplanted to another. In particular, this enables images in the training dataset to be matched to a virtual 3D model of the object (for simplicity, we assume that the object viewpoint can be estimated by standard techniques). Then a clustering algorithm is used to annotate the semantic parts of the 3D virtual model. This virtual 3D model can be used to synthesize annotated images from a large range of viewpoint. These can be matched to images in the test set, using the same matching algorithm, to detect semantic parts in novel viewpoints of the object. Our algorithm is very simple, intuitive, and contains very few parameters. We evaluate our approach in the car subclass of the VehicleSemanticPart dataset. We show it outperforms standard deep network approaches and, in particular, performs much better on novel viewpoints.

1 Introduction

Object detection has been a long-lasting challenge in computer vision which has attracted a lot of research attention [8][9]. Recently, with the development of deep networks, this research area has been dominated by a strategy which starts by extracting a number of regional proposals and then determines if each of them belongs to a specific set of object classes [11][42][3][30][41]. The success of these approaches [8][28] motivates researchers to address the more challenging task of understanding the objects at finer level and, in particular, to parse it into semantic parts, which we define to be those components of an object with semantic meaning and can be verbally described [57]. A particular challenge is that annotating semantic parts is very time-consuming and much more difficult than annotating objects, which makes it harder to directly apply deep networks to this task.

Figure 1: The flowchart of our approach (best viewed in color). The key module is the matching algorithm which finds spatial correspondence between images with similar viewpoints. This enables us to match the training data to a virtual 3D model and thereby annotate it. The virtual 3D model can then be used to detect the semantic parts on images in the test set, by re-using the matching algorithm.

In this paper, we address the problem of semantic part detection when there is a small amount of training data and when the object is seen from a limited range of viewpoints. Our strategy is to define a matching algorithm which finds correspondences between images of the same object seen from roughly the same viewpoint. This can be used to match the training images to the rendered images of a virtual 3D model enabling us to annotate the semantic parts of the 3D model. The same matching algorithm can then be used to find correspondences between the virtual 3D model and the test images, which enables us to detect the semantic parts of the test images, even though they may correspond to viewpoints not seen in the training set. The overall framework is illustrated in Figure 1.

Mathematically, we have a training set , in which each image has a viewpoint and a semantic part set . Given a testing image and the corresponding viewpoint (provided by ground-truth or some state-of-the-art approaches such as [46]), the goal is to predict a set of matches for each training sample, i.e., , so that is transplanted to according to and finally all these transplanted sets are summarized into the final prediction . In this pipeline, the key component is the matching algorithm . For simplicity, we assume that only deals with two images with similar viewpoints (which could be found by an off-the-shelf viewpoint detector). In order to combine information from the training images with different viewpoints, we introduce a viewpoint transfer function which converts each training image with viewpoint into with viewpoint , following which is computed. In practice, this is implemented by introducing a virtual model (e.g., represented by point clouds) with 3D semantic part annotations assigned to it. A graphics algorithm is used for rendering . is a hidden variable, so we apply a clustering-based algorithm to estimate it in the training stage. An additional benefit of this strategy is that the costly matching algorithm is executed only once in the testing stage.

The major contribution of this work is to provide a simple and intuitive algorithm for semantic part detection which works using little training data and can generalize to novel viewpoints. It is an illustration of how virtual data can be used to reduce the need for time-consuming semantic part annotation. Experiments are performed on the VehicleSemanticPart (VSP) dataset [51], which is currently the largest corpus for semantic part detection. In the car subclass, our approach achieves significantly better performance than standard end-to-end methods such as Faster R-CNN [42] and DeepVoting [57]. The advantages become even bigger when the amount of training data is limited.

The remainder of this paper is organized as follows. Section 2 briefly reviews the prior literature, and Section 3 presents our framework. After experiments are shown in Section 4, we conclude this work in Section 5.

2 Related Work

In the past years, deep learning [26] has advanced the research and applications of computer vision to a higher level. With the availability of large-scale image datasets [5] as well as powerful computational device, researchers designed very deep neural networks [25][43][45] to accomplish complicated vision tasks. The fundamental idea of deep learning is to organize neurons (the basic units that perform specified mathematical functions) in a hierarchical manner, and tune the parameters by fitting a dataset. Based on some learning algorithms to alleviate numerical stability issues [34][44][22], researchers developed deep learning in two major directions, namely, increasing the depth of the network towards higher recognition accuracy [16][20][19], and transferring the pre-trained models to various tasks, including feature extraction [6][40], object detection [11][10][42], semantic segmentation [31][2], pose estimation [35], boundary detection [54], etc.

For object detection, the most popular pipeline, in the context of deep learning, involves first extracting a number of bounding-boxes named regional proposals [1][49][42], and then determining if each of them belongs to the target class [11][10][42][3][30][41]. To improve spatial accuracy, the techniques of bounding-box regression [23] and non-maximum suppression [17] were widely used for post-processing. Boosted by high-quality visual features and end-to-end optimization, this framework significantly outperforms the conventional deformable part-based model [9] which were trained on top of handcrafted features [4]. Despite the success of this framework, it still suffers from weak explainability, as both object proposal extraction and classification modules were black-boxes, and thus easily confused by occlusion [57] and adversarial attacks [53]. There were also research efforts of using mid-level or high-level contextual cues to detect objects [51] or semantic parts [57]. These methods, while being limited on rigid objects such as vehicles, often benefit from better transferability and work reasonably well on partially occluded data [57].

Another way of visual recognition is to find correspondence between features or images, so that annotations from one (training) image can be transplanted to another (testing) image [15][24][36][47][48]. This topic was noticed in the early age of vision [37] and later built upon handcrafted features [33][18][55]. There were efforts in introducing semantic information into matching [29], and also improving the robustness against noise [32]. Recently, deep learning has brought a significant boost to these problems by improving both features [40][58] and matching algorithms [7][14][59][21], while a critical part of these frameworks still lies in end-to-end optimizing deep networks.

Training a vision system requires a large amount of data. To alleviate this issue, researchers sought for help from the virtual world, mainly because annotating virtual data is often easy and cheap [39]. Another solution is to perform unsupervised or weakly-supervised training with consistency that naturally exists [59][13][60]. This paper investigates both of these possibilities.

3 Our Approach

3.1 Problem: Semantic Part Detection

Figure 2: Two examples of annotated semantic parts in the class car. For better visualization, we only show part of annotation.

The goal of this work is to detect semantic parts on an image. We use to denote the image-independent set of semantic parts, each element in which indicates a verbally describable components of an object [51]. For example, there are tens of semantic parts in the class car, including wheel, headlight, licence plate, etc. Two car examples with semantic parts annotated are shown in Figure 2.

Let the training set contain images, and each image, , has a width of and a height of . A set of semantic parts are annotated for each image, and each semantic part appears as a bounding box and a class label , where is the index.

3.2 Framework: Detection by Matching

We desire a function which receives an image and outputs the corresponding semantic part set . In the context of deep learning, researcher designed end-to-end models [42] which starts with an image, passes the signal throughout a series of layers, and outputs the prediction in an encoded form. With ground-truth annotation , a loss function is computed between and , and the gradient of with respect to is computed in order to update accordingly. Despite their success, these approaches often require a large number of annotations to avoid over-fitting, yet their ability of explaining the prediction is relatively weak. DeepVoting [57] went a step further by simplifying the high-level inference layers as well as reducing the number of parameters in them, so that the prediction can be partly explained by a feature voting process. However, as we shall see in experiments, it still suffers significant accuracy drop in the scenario of few training data.

This paper investigates the problem of semantic part detection from another perspective. Instead of directly optimizing , we adopt an indirect approach which assumes that semantic part detection can be achieved by finding semantic correspondence between two (or more) images. This is to say, if a training image is annotated with a set of semantic parts, and we know that is densely matched to a testing image , then we can transplant to by projecting each element of onto the corresponding element of by applying a spatially-constrained mapping function.

This approach suffers from a major drawback. For every testing image , there needs to be at least one training image which has a very similar viewpoint of . The argument is two fold. First, semantic parts that appear in an image vary among different viewpoints, so we shall not expect the annotation to transfer between two images with large viewpoint difference. Second, towards simplicity, we aim at using a relatively simple algorithm, e.g., the max-clique algorithm, for image matching. However, in the scenarios with few (e.g., tens of) training images, it is not realistic to expect the existence of such training samples.

Our solution, as well as the main contribution of this work, is to introduce auxiliary cues by borrowing information from those training images with clearly different viewpoints. These auxiliary cues are named viewpoint consistency, which suggests that when an object and its semantic parts are observed from one viewpoint, they should be predictable under another viewpoint, e.g., the viewpoint of the testing image. Therefore, our idea is to transfer each training image to the same viewpoint of the testing image with semantic parts preserved, so that we can make prediction and transplant semantic parts with a light-weighted feature matching algorithm. In what follows, we formulate this idea mathematically and solve it using an efficient algorithm.

3.3 The Matching Algorithm

We start with defining a generation function which takes a source (training) image as well as its semantic part annotation , refers to the source and target viewpoints and , and outputs a transferred image and the corresponding semantic parts:


Throughout the remaining part, a prime in superscript indicates elements in a generated image. Next, by assuming that and have the same viewpoint111We expect and to have exactly the same viewpoint, though in practice has the true viewpoint of while is generated by the estimated , so the actual viewpoints of these two images can be slightly different, e.g., [46] reported a medium viewpoint prediction error of less than on the car subclass., we build a regional feature matching between them, denoted by . We assume that both images are represented by a set of regional features, each of which describes the appearance of a patch. In the context of deep learning, this is often achieved by extracting mid-level neural responses from a pre-trained deep network [51][58]. Although it is possible to fine-tune the network with an alternative head for object detection [57], we do not take this option in order to preserve the model’s explainability222We follow the argument that low-level features, e.g., edges, basic geometric shapes, etc., can be automatically learned by deep networks on a large dataset, but in order to achieve explainability, the high-level inference part should be assigned with clear meanings, e.g., visual word counting or feature voting.. Let an image have a set consisting of regional features, the -th of which is denoted by . We assume that all these feature vectors have a fixed length, e.g., all of them are -dimensional vectors corresponding to specified positions at the pool-4 layer of VGGNet [43][51]. Each is also associated with a 2D coordinate .

Then, takes the form of:


which implies that a feature vector with a coordinate on is matched to a feature at with a coordinate on . and are the matched feature indices on and , respectively. We use both unary and binary relationship terms to evaluate the quality of in terms of appearance and spatial consistency. The penalty function of , , is defined as:


where denotes the oriented distance between two features, i.e., and . Thus, the first term on the right-hand side measures the similarity in appearance of the matched patches, and the second term measures the spatial consistency of the connection between all patch pairs.

Based on , we can compute a coordinate transformation function mapping the bounding box of each semantic part of to the corresponding region on :


where is the index of a semantic part.

3.4 Optimization with Multiple Viewpoints

After the annotations of all training images are collected and transplanted to , we apply the final term named cross-viewpoint consistency to confirm that these annotations align with each other. For simplicity, we denote as the -th semantic part transferred to , regardless of its source image index . For all pairs with the same semantic part index, i.e., , we compute the intersection-over-union (IOU) between the two bounding boxes and penalize those pairs with similar but clearly different positions:


where we set , i.e., two annotations are considered to be different instances if their IOU is smaller than .

The final solution comes from minimizing the sum of the matching penalty defined in Eqn (3) and the consistency penalty defined in Eqn (5):


In this overall loss function, three modules can be optimized, namely, the generator , the matching function and the coordinate transformation function .

Directly optimizing Eqn (6) is computationally intractable due to two reasons. First, it is not a convex function. Second, Eqn (6) involves matching to all training images, which makes it time consuming in practice. This subsection provides a practical solution, which deals with the first issue with iteration, and the second using a 3D model, while the possibility of designing other algorithm to optimize this generalized objective function is preserved.

The major motivation comes from accelerating computation. We use a simplified version of Eqn (1) which assumes that all images and annotations are generated by the same model and thus identical to each other:


In practice, is a 3D virtual model (e.g., a point cloud) and indicates the vertices in that correspond to semantic parts. Provided a viewpoint , is a rendering algorithm (e.g., UnrealCV [39]) that generates a 2D image from with projected onto the image as . Thus, we relate each training data to the testing data by:


and then computing the relationship between and , i.e., using Eqns (3) and (5) accordingly.

Finally, we assume that and do not change with testing data. Under this assumption, Eqn (6) is partitioned into two individual parts, namely, a training stage which optimizes and , and a testing stage which infers , i.e., and . Though the first part is often time-consuming, it is executed only once and does not slow down testing.

3.4.1 Training: Optimizing the Hidden Variables

It is difficult to construct directly by optimization, so pre-define a model set and each is an instance in it. We purchase high-resolution models from the Unreal Engine Marketplace, and enumerate through the set to achieve the best matching to each individually. We denote as the index of the model that corresponds to. Given a specified , we first render it into , and then extract the feature sets and with and elements, respectively. In practice, this is achieved by rescaling each image so that the short axis of the object is -pixel long [51], followed by feeding it into a pre-trained -layer VGGNet [43] and extracting all -dimensional feature vectors at the pool-4 layer. All feature vectors are -normalized.

Now, Eqn (6), in the training stage, is simplified as:


where is the matching between and , and and are the corresponding losses to Eqns (3) and (5). Note that both of these terms depend on hidden variables , i.e., computes the loss from to , and only sums up the data that are assigned to . We use an iterative algorithm to optimize them. The starting point is that all are randomly sampled from .

In the first-step of each iteration, we fix all as well as and minimize for each , which is equivalent to finding the optimal . Note that each is equipped with an individual . So, we collect semantic parts from all images that are assigned to and optimize Eqn (5). As an approximation, each collected is considered a candidate if at least boxes, including itself, have the same semantic part index as well as an IOU of at least with it, where is a hyper-parameter and is the number of training images assigned to . The average of all the overlapping boxes to a candidate forms a semantic part of . This algorithm is able to filter out false positives (e.g., those generated by incorrect matching) because these samples are mostly isolated. Also, the true positives are averaged towards more accurate localization.

In the second-step, we instead fix all , minimize each by enumerating and finding the best solution meanwhile updating . Again, this is done in an approximate manner. For each of the feature pairs , we compute the distance between and and use a threshold to filter them. On the survived features , we further enumerate all quadruples with matched to and matched to , compute and , and again filter them with a threshold . Finally, we apply the Bron-Kerbosch algorithm to find the max-clique on both images that are matched with each other. In practice, the hyper-parameters and do not impact performance.

3.4.2 Testing: Fast Inference

In the testing stage, all and are fixed, so the viewpoint consistency term vanishes and the goal becomes:


Thus, we enumerate over all possible and find the best solution. There are no new components introduced in this part – the feature matching process is executed, based on which a coordinate transformation function is built and transplants into which obtains the desired results.

Compared to Eqn (6) that computes feature matching between the testing image and all training images, Eqn (10) only performs the computation for each of the models. Most often, we have and so this strategy saves a large amount of computation at the testing stage.

Approach Training Samples Training Samples Training Samples
L0 L1 L2 L3 L0 L1 L2 L3 L0 L1 L2 L3
Faster R-CNN [42]
DeepVoting [57]
Table 1: Semantic part detection accuracy (by mAP, ) of different approaches using different number of training samples. L0 through L3 indicate occlusion levels, with L0 being non-occlusion and L3 the heaviest occlusion.

3.5 Implementation Details

The rendering function is implemented with standard rasterization in a game engine. We place the 3D model in regular background with road and sky, and use two directional light sources to reduce shadow in the rendered images (this improves image matching performance).

The transformation of the semantic part annotations is learned using their nearby matched features. For each semantic part, a weighted average of its neighboring features’ relative translation is applied to the annotation, where the weights are proportional to the inverse of the 2-D Euclidean distances between the semantic part and the features in the source image.

The basis of our approach is the feature vectors extracted from a pre-trained deep network. However, these features, being computed at a mid-level layer, often suffer a lower resolution in the original image plane. For example, the pool-4 features of VGGNet [43] used in this paper have a spatial stride of , which leads to inaccuracy in feature coordinates and, consequently, transformed locations of semantic parts. To improve matching accuracy, we apply a hierarchical way of extracting regional features. The idea is to first use higher-level (e.g., pool-4) features for semantic matching, and then fine-tune the matching using lower-level (e.g., pool-3) features which have a smaller spatial stride for better alignment.

3.6 Discussions

Compared to prior methods on object detection [42] or parsing [57], our approach enjoys a higher explainability as shown in experiments (see Figure 5). Here, we inherit the argument that low-level or mid-level features can be learned by deep networks as they often lead to better local [56] or regional [40] descriptions, but the high-level inference stage should be semantically meaningful so that we can manipulate either expertise knowledge or training data for improving recognition performance or transferring the pipeline to other tasks. Moreover, our approach requires much fewer training data to be optimized, and applies well in novel viewpoints which are not seen in training data.

The training process of our approach can be largely simplified if we fix , e.g., manually labeling semantic parts on each 3D model . However, the amount of human labor required increases as the complexity of annotation as well as the number of 3D models. Our approach serves as a good balance – the annotation on each 2D image can be transferred to different 3D models. In addition, 2D images are often annotated by different users, which provide complementary information by crowd-sourcing [5]. Therefore, learning 3D models from the ensemble of 2D annotations is a safer option.

4 Experiments

4.1 Settings and Baselines

We perform experiments on the VehicleSemanticPart (VSP) dataset [51]. We choose sedan, a prototype of car, which aligns with the purchased 3D models, and investigate whether our approach generalizes to other prototypes. There are training and testing images, all come from the Pascal3D+ dataset [52], and the authors manually labeled semantic parts covering a large fraction of the surface of each car (examples in Figure 2). There are semantic parts related to wheel, at the center and other around the rim. We only consider the center one as the others are less consistent in annotation. Moreover, we did not investigate other classes due to the difficulty in defining 3D models, i.e., airplane, bus, and train images suffer substantial intra-class appearance variation, and bike and motorbike are not perfectly rigid. The same setting was used in some prior work [38][12][27], which focused on broad applications of car parsing.

We use the ground-truth azimuth angle to categorize training images into bins, centered at , respectively. We randomly sample images in each bin, leading to three training sets with , and images, which are much smaller than the standard training set ( images). In the testing stage, we provide the ground-truth viewpoint and bounding-box for each image, which often requires less than seconds to annotate – in comparison, labeling all semantic parts costs more than one minute. We add quantization noise to the ground-truth azimuth and polar angles by assigning them into bins with fixed widths. This is to reduce the benefit of the algorithm in using very accurate viewpoints. In addition, following the same setting of [57], different levels of occlusion are added to each testing image.

The competitors of our approach include DeepVoting [57], a recent approach towards explainable semantic part detection, as well as Faster R-CNN [42] (also used in [57]), an end-to-end object detection algorithm. Other approaches (e.g., [51] and [50]) are not listed as they have been verified weaker than DeepVoting.

4.2 Quantitative Results

Results are summarized in Table 1. One can observe that our approach outperforms both DeepVoting and Faster R-CNN, especially in the scenarios of (i) fewer training data and (ii) heavier occlusion.

4.2.1 Impact of the Amount of Training Data

One major advantage of our method is the ability to learn from just a few training samples by preserving the 3D geometry consistency for images from different viewpoints. As shown in Table 1, when using as less as training images, which provides no more than training samples for most semantic parts, our method still gives reasonable predictions and outperforms other baseline method by a large margin. By increasing the training sample number, our method also benefits from learning more accurate annotations on the 3D model, resulting to higher mAP in the 2D detection task. By contrast, both Faster R-CNN and DeepVoting fail easily given small number of training.

Since the viewpoint is provided for each testing image, we also design a naive overlaying algorithm which transfers semantic parts between two images with similar viewpoints, i.e., each semantic part is transferred to the same relative position within the bounding box. With training samples, it produces an mAP of , about lower than our approach. This reveals the effectiveness of using feature matching to guide coordinate transformation.

4.2.2 Ability of Dealing with Occlusion

To evaluate how robust our method is to occlusion, we apply the models learned from the occlusion-free dataset to images with different levels of occlusion. Compared to DeepVoting [57], which learns the spatial relationship between semantic parts and their characteristic deep features in occlusion-free settings, our method directly models the spatial relationship of parts by projecting them to the 3D space, and then matches them back to the occluded testing. On light occlusion, our method consistently beats the baseline methods. In the cases of heavier occlusion, due to the deficiency of accurately matched features, the performance of our method deteriorates. As expected, Faster R-CNN lacks the ability of dealing with occlusion and its performance drops quickly, while DeepVoting is affected less.

It is interesting to see that the robustness to fewer training data and occlusion is negatively related to the number of extra parameters. For example, DeepVoting has less than parameters compared to Faster R-CNN, and our approach, being a stage-wise one, only requires some hyper-parameters to be set. This largely alleviates the risk of over-fitting (to small datasets) and the difficulty of domain adaptation (to occlusion data).

4.2.3 Predicting on Unseen Viewpoints

Approach # Training Samples
Faster R-CNN
Table 2: Semantic part detection accuracy (by mAP, %) of different approaches on unseen viewpoints.
Figure 3: Example of predicting on unseen viewpoint. Left: test image with transferred annotation (red) and ground truth (green). Note that no ground truth for wheels is given for the current image, which is an annotation error. Right: viewpoint matched synthetic image with semantic part annotations learned from training data.

To show that our approach has the ability of working on unseen viewpoints, we train the models using sedan images with various azimuth angle and elevation angle, then test them on sedan with elevation angle equal or larger than . The results are shown in Table 2. Our method maintains roughly the same mAP as tested on all viewpoints, while FasterRCNN and DeepVoting deteriorate heavily. In Figure 3, we show predictions made by our method on one sample with unseen viewpoint (elevation angle equals ). The predicted locations are very close to the annotated ground truth, and may help to fix the annotation error in the dataset.

4.2.4 Transfer across Different Prototypes

Approach sedan SUV minivan hatchback
Faster R-CNN
Table 3: Semantic part detection accuracy (by mAP, %) of different approaches on other car prototypes. All models are trained using 32 sedan images.

In order to evaluate how sensitive our method is to the car prototype used during training (and 3D reconstruction), we transfer the sedan model with semantic parts to other prototypes of cars. Results are summarized in Table 3. As expected, our method generalizes fine to prototypes with similar appearance (e.g., SUV). For minivan and hatch-back, due to the variation of their 3D structures and semantic part definitions, the performance drops more. Similar results are observed from Faster R-CNN and DeepVoting, and DeepVoting seems slightly more robust to the prototypes.

4.3 Qualitative Studies

4.3.1 Viewpoint Consistency in Training

Figure 4: Two examples of how viewpoint consistency improves the semantic part annotation on 3D model. The red circles represent incorrectly transferred semantic part annotations that get eliminated during our aggregation process using 3-D geometry constraints. The green circles are the reasonable annotations that are used to get the final annotation for the targeted semantic part, which is represented by the blue stars with bounding-boxes.

In Figure 4, we show examples on how viewpoint consistency improves the stability of the training stage. Although we have applied 2-D geometry coherence as one of the criteria during matching individual training samples to their viewpoint-paired synthetic images, it is possible to get wrong matched features at inaccurate positions. Therefore, the semantic part annotations transferred from an individual training image could be far off the ground truth area (e.g., outliers shown by the red circles). With viewpoint consistency, the incorrect annotations are eliminated during aggregation, and our method stably outputs the right position for the targeted semantic parts (e.g., final annotation shown by blue stars) based on the reasonably transferred annotations (e.g., inliers shown by green circles).

4.3.2 Interpreting Semantic Part Detection

Figure 5: Interpret semantic part detection. Stars with bounding-boxes represent the semantic parts that are transferred from the synthetic image (right) to the testing image (left), based on transformation learned from features (green circles, matched features are linked by red lines) in their neighborhood.

Next, we provide some qualitative results to demonstrate the explainability of our approach. In Figure 5, we show examples of how we locate the semantic parts in two image pairs. Each pair includes one testing image and its viewpoint-matched synthetic image.The star represents the location of the semantic parts in each image (learned from training in the synthetic images and got transferred to in the testing images), and the color represent their identity. The transformation is learned using nearby matched features, which are shown by green circles (matched features are linked by red lines). For better visualization, we only display the nearest three features for each semantic part in the figure. This explains what features are used to transfer the annotation from synthetic images to testing images, and helps us understand what is going on during the inference process.

5 Conclusions

In this paper, we present a novel framework for semantic part detection. The pipeline starts with extracting regional features and applying robust matching algorithms to find correspondence between images with similar viewpoints. To deal with the problem of limited training data, an additional consistency loss term is added, which measures how semantic part annotations transfer across different viewpoints. By introducing a 3D model as well as its viewpoints as hidden variables, we can optimize the loss function using an iterative algorithm. In the testing stage, we directly apply the same algorithms to match the semantic parts from the 3D model back to each 2D image, and achieve high efficiency in the testing stage. Experiments are performed to detect semantic parts of car in the VSP dataset. Our approach works especially well with very few (e.g., tens of) training images, on which other competitors [42][57] often heavily over-fit the data and generalize badly.

Our approach provides an alternative solution to object parsing, which has three major advantages: (i) it can be trained on a limited amount of data and generalized to unseen viewpoints; (ii) it can be trained on a subset of viewpoints and then transferred to novel ones; and (iii) it can be assisted by virtual data in both training and testing. However, it still suffers from the difficulty of designing parameters, which is the common weakness of stepwise methods compared to the end-to-end learning methods. In other words, a lot of work is yet undone towards the balance of abilities of learning and explanation.

Researchers believe that 3D is the future direction of computer vision. In the intending research, we will try to learn one or more 3D models directly from 2D data, or allow the 3D model to adjust slightly to fit 2D data. More importantly, it is an intriguing yet challenging topic to generalized this idea to non-rigid objects, which will largely extend its area of application.


  • [1] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2189–2202, 2012.
  • [2] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. In International Conference on Learning Representations, 2016.
  • [3] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems, 2016.
  • [4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005.
  • [5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009.
  • [6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, 2014.
  • [7] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In International Conference on Computer Vision, 2015.
  • [8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
  • [9] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.
  • [10] R. Girshick. Fast r-cnn. In Computer Vision and Pattern Recognition, 2015.
  • [11] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014.
  • [12] D. Glasner, M. Galun, S. Alpert, R. Basri, and G. Shakhnarovich. Viewpoint-aware object detection and continuous pose estimation. Image and Vision Computing, 30(12):923–933, 2012.
  • [13] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In Computer Vision and Pattern Recognition, 2017.
  • [14] B. Ham, M. Cho, C. Schmid, and J. Ponce. Proposal flow. In Computer Vision and Pattern Recognition, 2016.
  • [15] K. Han, R. S. Rezende, B. Ham, K. Y. K. Wong, M. Cho, C. Schmid, and J. Ponce. Scnet: Learning semantic correspondence. In International Conference on Computer Vision, 2017.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, 2016.
  • [17] J. H. Hosang, R. Benenson, and B. Schiele. Learning non-maximum suppression. In International Conference on Computer Vision, 2017.
  • [18] A. Hosni, C. Rhemann, M. Bleyer, C. Rother, and M. Gelautz. Fast cost-volume filtering for visual correspondence and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(2):504–511, 2013.
  • [19] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Computer Vision and Pattern Recognition, 2018.
  • [20] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In Computer Vision and Pattern Recognition, 2017.
  • [21] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Computer Vision and Pattern Recognition, 2017.
  • [22] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
  • [23] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang. Acquisition of localization confidence for accurate object detection. In European Conference on Computer Vision, 2018.
  • [24] S. Kim, D. Min, B. Ham, S. Jeon, S. Lin, and K. Sohn. Fcss: Fully convolutional self-similarity for dense semantic correspondence. In Computer Vision and Pattern Recognition, 2017.
  • [25] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
  • [26] Y. LeCun, Y. Bengio, and G. E. Hinton. Deep learning. Nature, 521(7553):436, 2015.
  • [27] C. Li, M. Z. Zia, Q. H. Tran, X. Yu, G. D. Hager, and M. Chandraker. Deep supervision with shape concepts for occlusion-aware 3d object parsing. In Computer Vision and Pattern Recognition, 2017.
  • [28] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  • [29] C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense correspondence across scenes and its applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):978–994, 2011.
  • [30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision, 2016.
  • [31] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition, 2015.
  • [32] J. Ma, W. Qiu, J. Zhao, Y. Ma, A. L. Yuille, and Z. Tu. Robust l2e estimation of transformation for non-rigid registration. IEEE Transactions on Signal Processing, 63(5):1115–1129, 2015.
  • [33] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10):761–767, 2004.
  • [34] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, 2010.
  • [35] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, 2016.
  • [36] D. Novotnỳ, D. Larlus, and A. Vedaldi. Anchornet: A weakly supervised network to learn geometry-sensitive features for semantic matching. In Computer Vision and Pattern Recognition, 2017.
  • [37] M. Okutomi and T. Kanade. A multiple-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(4):353–363, 1993.
  • [38] M. Ozuysal, V. Lepetit, and P. Fua. Pose estimation for category specific multiview object localization. In Computer Vision and Pattern Recognition, 2009.
  • [39] W. Qiu and A. L. Yuille. Unrealcv: Connecting computer vision to unreal engine. In European Conference on Computer Vision, 2016.
  • [40] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In Computer Vision and Pattern Recognition, 2014.
  • [41] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Computer Vision and Pattern Recognition, 2016.
  • [42] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, 2015.
  • [43] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  • [44] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • [45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al. Going deeper with convolutions. In Computer Vision and Pattern Recognition, 2015.
  • [46] R. Szeto and J. J. Corso. Click here: Human-localized keypoints as guidance for viewpoint estimation. In International Conference on Computer Vision, 2017.
  • [47] J. Thewlis, H. Bilen, and A. Vedaldi. Unsupervised learning of object landmarks by factorized spatial embeddings. In International Conference on Computer Vision, 2017.
  • [48] N. Ufer and B. Ommer. Deep semantic feature matching. In Computer Vision and Pattern Recognition, 2017.
  • [49] J. R. R. Uijlings, K. E. A. Van De Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
  • [50] J. Wang, C. Xie, Z. Zhang, J. Zhu, L. Xie, and A. L. Yuille. Detecting semantic parts on partially occluded objects. In British Machine Vision Conference, 2017.
  • [51] J. Wang, Z. Zhang, C. Xie, V. Premachandran, and A. Yuille. Unsupervised learning of object semantic parts from internal states of cnns by population encoding. arXiv preprint arXiv:1511.06855, 2015.
  • [52] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision, 2014.
  • [53] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. L. Yuille. Adversarial examples for semantic segmentation and object detection. In International Conference on Computer Vision, 2017.
  • [54] S. Xie and Z. Tu. Holistically-nested edge detection. In International Conference on Computer Vision, 2015.
  • [55] H. Yang, W. Y. Lin, and J. Lu. Daisy filter flow: A generalized discrete approach to dense correspondences. In Computer Vision and Pattern Recognition, 2014.
  • [56] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. Lift: Learned invariant feature transform. In European Conference on Computer Vision, 2016.
  • [57] Z. Zhang, C. Xie, J. Wang, L. Xie, and A. L. Yuille. Deepvoting: An explainable framework for semantic part detection under partial occlusion. In Computer Vision and Pattern Recognition, 2018.
  • [58] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. 2015.
  • [59] T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and A. A. Efros. Learning dense correspondence via 3d-guided cycle consistency. In Computer Vision and Pattern Recognition, 2016.
  • [60] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In International Conference on Computer Vision, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description