Siamese Regression Networks with Efficient mid-level Feature Extraction for 3D Object Pose Estimation

Siamese Regression Networks with Efficient mid-level Feature Extraction for 3D Object Pose Estimation

Andreas Doumanoglou, Vassileios Balntas, Rigas Kouskouridas and Tae-Kyun Kim All authors are with the Imperial Computer Vision and Learning Lab (ICVL), at the Department of Electrical and Electronic Engineering, Imperial College London, UK, {a.doumanoglou12, v.balntas15, r.kouskouridas, tk.kim}@imperial.ac.uk
Abstract

In this paper we tackle the problem of estimating the 3D pose of object instances, using convolutional neural networks. State of the art methods usually solve the challenging problem of regression in angle space indirectly, focusing on learning discriminative features that are later fed into a separate architecture for 3D pose estimation. In contrast, we propose an end-to-end learning framework for directly regressing object poses by exploiting Siamese Networks. For a given image pair, we enforce a similarity measure between the representation of the sample images in the feature and pose space respectively, that is shown to boost regression performance. Furthermore, we argue that our pose-guided feature learning using our Siamese Regression Network generates more discriminative features that outperform the state of the art. Last, our feature learning formulation provides the ability of learning features that can perform under severe occlusions, demonstrating high performance on our novel hand-object dataset.

1 Introduction

Detecting objects and estimating their 3D pose is one of the most challenging tasks in computer vision, since severe occlusions, background clutter or large scale changes, dramatically affect the performance of any contemporary solution. State of the art methods make use of Hough Forests for casting patch votes in the 3D space tejani2014latent ; andoum2016recovering or train CNNs to either perform classification into the quantized 3D space johns2016pairwise or learn features that are later fed to a Nearest Neighbor scheme for 3D object pose template matching wohlhart2015learning . The lack of research on Deep Networks for angle regression prove that directly regressing object poses in angle space is not trivial, with the objective function appearing to have many local minima. Albeit the recent advances in classification tasks using CNNs, a framework that is able to perform direct regression in angles, while jointly learning discriminative features has yet to be built.

Towards this end, in this paper, contrary to the state of the art methods, we are interested in learning an end-to-end framework that directly regresses object pose angles. Recent works sun2014deep ; hoffer2015deep demonstrated successful results by using siamese networks, which can improve the network learning capabilities by exploiting additional information about the relationship between the training samples. Inspired by this, we present the Siamese Regression Network that enforces a relationship between feature and pose space by applying a novel loss function, which can boost the performance of a regression network layer. Thus, this network is able to perform single-shot end-to-end regression for object pose estimation, without requiring pairs of inputs at testing time. Apart from that, we experimentally evaluate the effect of some other factors that play an important role in successful regression such as feature normalization and batch formation. Our Siamese Regression Network, on the other hand, proved to learn more discriminative features optimized for our particular problem, as compared to a state of the art feature learning technique using CNNs wohlhart2015learning . Finally, we are interested in handling severe occlusions in the object pose estimation task, a particularly interesting problem constantly arising in real-life applications. However, estimating accurate pose when a significant portion of the object is missing is a very challenging task which drastically degrades the performance of previous arts wohlhart2015learning ; hinterstoisser2012accv ; brachmann2014learning ; lim2014fpm ; wu20153d ; mottaghi_cvpr15 . We show how our loss function can be easily modified to handle cases of severe occlusions. To evaluate our regressor on such cases, we built our own challenging dataset which demonstrates an object being manipulated by a human hand. Results show that our method can very well handle severe occlusion, reaching accuracy levels of non-occluded objects.

In summary our paper offers the following contributions:

  • [topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex,leftmargin=3ex]

  • We present Siamese Regression Network which, to the best of our knowledge, is the first CNN-based framework for regressing object poses in angle space.

  • We boost the performance of our system by introducing a novel loss function for feature-guided pose regression.

  • In turn, we show that pose-guided feature learning results in more discriminative features than the ones of wohlhart2015learning and are, experimentally proven, optimized for the particular task of 3D object pose estimation.

  • We show how our loss function can be adapted to deal with severe occlusions and evaluate our system on a new challenging dataset containing an object captured under severe occlusions. Furthermore, experimental evaluation on a benchmark dataset hinterstoisser2012accv provide evidence of our system outperforming the state of the art.

The remainder of the paper is organized as follows. In Section 2 we provide an overview of the related work, while our proposed approach is introduced in Section 3. In Section 4 we present evaluation of our method compared to the state-of-the-art on two datasets. Finally, in Section 5 we conclude with final remarks and an outlook to future work.

2 Related Work

Recognizing and detecting objects along with estimating their 3D pose has received a lot of attention in the literature. Early works made use of pointclouds to facilitate Point-to-Point matching drost2010model ; rusu2009fast , while the advent of low-cost depth sensors hinterstoisser2012accv ; rios2013discriminatively provided additional data in favor of textureless objects. Hinterstoisser et al. designed a powerful holistic template matching method (LINEMOD) based on RGB-D data that suffers in cases of occlusions. Tejani et al. tejani2014latent integrated LINEMOD into Hough Forests to tackle the problem of occlusions and clutter. The work of Brachmann et al. brachmann2014learning along with its recent extension to RGB-only images brachmann2016uncertainty employ a new representation framework that jointly maps 3D object coordinates and class labels. Hodan et al. hodan2015detection presented a method that tackles the complexity of sliding window approaches via a fast-filtering technique followed by a voting procedure for hypotheses generation, while fine 3D pose estimation is performed via a stochastic, population-based optimization scheme. In turn, in song2014sliding exemplar SVMs are slided in the 3D space to perform object pose classification based on depth images.

Deep learning has only recently found application to the 3D object pose estimation problem. Doumanoglou et al. andoum2016recovering suggested using a network of stacked sparse autoencoders to automatically learn features in an unsupervised manner that are fed to Hough Forests for 6D object pose recovery and next-best-view estimation. In johns2016pairwise Johns et al. employed a CNN-based end-to-end learning framework for classification of object poses in the 3D space and next-best-view prediction. In turn, in crivellaro2015novel a CNN was used to learn projections of 3D control points for accurate 3D object tracking, while in krull2015learning a CNN is utilized in a probabilistic framework to perform analysis-by-synthesis as a final refinement step for object pose estimation. In wohlhart2015learning 3D pose estimation is performed by a scalable Nearest Neighbor method on discriminative feature descriptors learned by a CNN.

To the best of our knowledge, this paper presents the first CNN-based framework for regressing object poses in the continuous 3D space. The work of Kendall et al. kendall2015posenet regresses camera poses in the continuous 3D space111camera pose estimation is the inverse of object pose estimation but does not offer any end-to-end learning since it makes use of the pretrained VGG network adding just a softmax layer for regressing the four quaternions of the camera pose. The method of Sun et al. sun2014deep offers a learning framework that learns a new face representation by joint identification-verification. As far as feature learning for 3D object pose estimation is concerned, our work shares similar ideas with the method of Wohlhart et al. wohlhart2015learning that learns feature descriptors with pairs and triplets. However, we argue that our learned features are pose-guided and as experiments prove, more discriminative, which in fact suggests that they are optimized for the particular task of 3D object pose estimation.

3 Siamese Regression Network

3.1 Object Pose Estimation Using Regression CNN

We first formulate the problem of object pose estimation as a regression problem. Let be an RGBD (4 channels) image depicting a centered object having width and height . Pose estimation is the problem of learning a regressor , where is the dimensionality of the pose representation used. For example, Euler angles require angles to be defined () whereas quaternions suggest . Regressing euler angles directly can be problematic due to multiple problems such as periodicity yi2015learning , and the non-continuous nature of the euler angle space kendall2015posenet . For example, poses that are very similar visually might be far away in euler angle space, making regression harder. Therefore, similar to previous work kendall2015posenet on regressing camera pose, we also use the quaternion representation, which does not suffer from the same problems.

For the task of estimating the regression function we train a convolutional neural network (CNN). We use the simple architecture similar to wohlhart2015learning that consists of 2 convolutional layers, and 2 fully connected layers (we have removed the max-pooling layers as we saw that they slightly degraded the performance). On top of that, we added another fully connected layer that outputs units to estimate the object pose. If we consider the layer just before the last regression layer as the features learned by the network, we can describe the output of the our network as:

(1)

where is the output of the feature layer, the regression layer function and is a pose vector returned by the network for the input image . Given a training set that contains combinations of training samples of the form , the most commonly used method of training a regression network is by minimizing the Mean Square Error (MSE) between the estimation and the ground truth and back-propagating the error. If we split the training set into mini batches of samples each, the regression loss can be written as:

(2)
Figure 1: (left) Our training and testing architectures. We enforce a siamese architecture for regressing relative distance between feature and pose spaces. During testing, we extract a branch of the network, and use it for regression. (right) Illustration of our feature-guided pose regression loss. The loss seeks to associate distances in the normalised feature space with the normalised pose space.

3.2 Siamese Regression Objective

Previous work has shown that the feature layer is able to learn representations that can be successfully applied in nearest neighbour matching wohlhart2015learning or face identification - verification sun2014deep . However, end-to-end regression learning with CNN in angle space proved to be a very challenging task, with researchers resorting to indirect solutions, such as Nearest Neighbor template matching wohlhart2015learning or ad-hoc angle estimation methods like yi2015learning . Therefore, inspired by sun2014deep , we want to enhance the feature learning process by using additional information available during training, in order help the end-to-end regressor converge to a better minimum. Thus, our goal is to enforce a loss function in the feature layer , in a way that the learned features are more appropriate and useful for the regression task in the last layer .

In order to enforce a second loss function in this layer, we utilize the siamese architecture that has been very successful for learning non-linear feature embeddings with convolutional neural networks hadsell2006dimensionality . The siamese architecture consists of two (or more) branches of the same CNN that share weights and encode two inputs processed in parallel. Subsequently, a loss function can be introduced based on both outputs, which makes it possible to compare different samples of our training data passing through our network in a meaningful way.

Our study on the regression problem concluded that there is a relation between the feature and the angle space which helps a regression network layer perform much better. The relationship is the following: the euclidean distance between two sample images represented in feature space, should be maintained the same with the distance between the same samples as represented in angle space, during training. Fig. 1 shows an illustration of our idea. In order to enforce such relation we use a siamese network to pass through the network pairs of samples and apply an objective term on them. The pairs have the form: where represent the raw input, and the pose vector ground truth. We enforce the following loss function for feature guided regression

(3)

Intuitively, minimizing this loss enforces the distance between the features in the sample pair, to be close to the distance between the ground truth of their poses. In order to avoid weighting any of the above parts of the objective loss term, we normalize the output of the feature layer as well as the output of the pose layer to have unit norm (if using quaternions as pose representation, they already have unit norm). In fact, as we will show in experiments, this normalization has a positive effect in training angle regression even without using our extra feature term.

It should be mentioned that the siamese network is only used during training to help the regression task. During testing, only a single image produces a pose estimation without the need of providing a pair for the image.

3.3 Feature Guided Pose Regression

Combining the regression loss with the feature loss, we get

(4)

where is a term to regularise the weights of the convolutional neural network. By enforcing this loss in the proposed siamese regression network, we are able to simultaneously focus on both features that are able to work well in a nearest neighbour framework, and on the fully connected last layer that regresses the poses directly. Indeed, in our experiments we show that enforcing the feature term in the loss leads to better pose estimation in the final layer.

In Table 1, we describe the relationship between the two loss functions , and the parts of the CNN weights that are updated. We note that the weights related to the feature learning, are updated using information from both losses, while the weights related with the pose regression, are only updated based on the loss.

3.4 Pose Guided Feature Learning

Despite the fact that the loss function of from Eq. 4 mainly aims to learn a better regressor for the pose in the final layer of the convolutional network, it can be argued that the features that are learned in the layer of the network can be more discriminative.

Previous work on 3D feature learning with siamese networks has focused on optimising the feature embeddings using triplets. Triplet training samples contain an anchor, a positive sample and a negative sample. The authors from wohlhart2015learning form the triplet by using two close views of the object as anchor and positive samples, and a view with significantly different pose as the negative. What they try to optimize is the anchor and the positive sample to be closer in feature space than the anchor and the negative one. They also use pairs of images of similar pose but different appearance and try to minimize their distance in feature space in order to learn features immune to different lightning conditions and noise.

On the other hand, our loss focuses on forcing the feature distance between a pair to be equivalent to the pose distance. Thus, it is more appropriate for a nearest neighbour framework, since the features are optimized to be relative to the pose distance. Indeed, in our experiments show that enforcing our loss from Eq. 4 results in features that are more suited for nearest neighbour matching.

We should note that using the objective function of wohlhart2015learning instead of in order to help regression didn’t work, with the network showing similar convergence behavior of the simple regression without using the extra term. This is a clue that our objective does indeed help regression, while at the same time regression helps building more discriminative features appropriate for pose estimation.

input: training set , CNN feature parameters , CNN pose parameters , learning rate
for epoch e=1:N do
 sample M mini-batches from using pairs with both similar and different poses
for mini-batch b=1:M do
  
  
  update
  update
end
end
output
Table 1: Our learning algorithm.

3.5 Siamese Pair Sampling

Considering a dataset of training samples of the form , there exist possible pairs to be used in the siamese training process described above. Since the number of pairs can become very large, several authors explored different techniques of sampling or mining hard negative pairs wang2014learning ; simo2015discriminative .

Although we do not explicitly have positive and negative pairs since the training process is done in the same object, we approximate such pairs by spliting a batch of size between pairs that are both close in the pose space, or have very large pose differences. We examined that random choice of pairs in terms of their pose difference perform inferior to our formation. Interestingly, forming batches using both similar and different pose pairs, is a factor that improves regression on its own, even without enforcing any constraints on such pairs. In the experiments we will show the relative performance gain of using well formed batches compared to simple regression and enforcing our sample pair objective.

3.6 Handling occlusions

Tackling occlusions, that is estimating the object pose when a major part of the object is missing or occluded, requires features that are robust in such conditions and one should explicitly enforce this property. We note that the form of Eq. 4 makes it convenient to support building such features: if we generate training samples with the object occluded, and using its annotation render a clean object having the same pose, we can enforce a similar term between the occluded and the clean images:

(5)

where and are images depicting occluded and clean objects respectively. Fig. 3 in the experiments section shows examples of such images. This term can be added to the loss in order to tackle the severe occlusion problem.

4 Experiments

(a)
(b)
(c)
Figure 2: a) End-to-end regression evaluation as compared to our learned features using nearest neighbor on various network configurations: S is simple regression, B is again simple regression with properly formed batch, N means that the network contains normalization layer after the feature and final layer, and F means that the network is trained using our new feature-guided pose regression objective. b) Evaluation of different number of pairs compared inside a mini-batch. c) Evaluation of the length of the feature layer.

Our convolutional regressor is a simple convolutional neural network with similar architecture to wohlhart2015learning that has the following architecture: {} where represents a convolutional layer with filters of size , and a fully connected layer with outputs. Note that the feature layer is of variable length, something that allows a trade-off between feature extraction size and performance. We use as the non-linearity in all our convolutional and fully connected layers apart from the last layer that produces the pose estimation where we used . For training the network we use the stochastic gradient method bottou2012stochastic , with momentum and initial learning rate of . We also decay the learning rate in each epoch, to avoid oscillations around local minima.

In order to evaluate our method we used two datasets. The first one, which is also used for our parameter analysis, is the one of LINEMOD hinterstoisser2012accv . More specifically, we worked with a variant of the RGBD images as used by wohlhart2015learning where the objects are centered in the image so that no localization is required. This dataset contains about 3000 training images and 1000 test images per object.

The second dataset is constructed by us and depicts a small object (a car belt) being manipulated by a human hand as seen in Fig. 3. The focus of such dataset is to introduce realistic occlusions that are severe and have stronger effect on the learning process than using different object instances or types. In order to construct such dataset, we recorded an RGBD video, using Asus Xtion, of a human manipulating the object, and used a particle swarm optimization tracker tompson2014real ; oikonomidis2011efficient to track both the hand pose as well as the object pose. Having such information we can easily generate our needed pairs for our occlusion term (eq. 5). Such scenario appears in autonomous learning of object manipulation by robots, where the task is being demonstrated by a human. This dataset is very challenging since human hand introduces high level of occlusion which can significantly degrade the accuracy of pose estimation. Moreover, our dataset is larger, with about 21000 training and 5000 testing images.

Regarding the evaluation metric, we use the average Euler angle error, which is the average of the absolute difference in angle (in degrees) between the estimated pose and the ground truth, measured regarding the three principal axes. Such metric is more appropriate for our regression task and matches the one used in wohlhart2015learning . Since we are using quaternions, we transform them in Euler angles after the estimation in order to perform the comparisons.

In the following subsections, we first evaluate different parameters of our network showing the relevant importance of each of our contributions. We also compare our siamese regression network with some base-line and state of the art methods showing the superiority of our method.

4.1 Parameter Evaluation

Fig. 1(a) shows the evaluation of the different parts of our network, starting from a simple regression network and gradually adding elements of our siamese regression network. We show both the performance of our end-to-end pose estimation and the performance of our produced features using nearest neighbor template matching. The simple regression network performs worse, and the performance gradually increases by just using a better formed batch, then the normalization layers and the best performance is achieved when adding our feature learning term in the objective function. Interestingly, we notice that the our features using nearest neighbor exhibit similar behavior, but the increase in performance is more significant. When using our feature learning term, it slightly outperforms the end-to-end regression. As we will see in the next subsection, the regression is more affected by overfitting regarding the small size of the linemod dataset we used for the analysis.

We also experiment with the amount of pairs required for our feature term. Fig. 1(b) shows the regression performance for different values of the amount of pairs used. Using a batch that contains 300 training samples we see that the more pairs we use, the better the performance. However above 100 pairs we did not get any further significant improvement.

Last, we evaluated our network on different feature sizes, shown in Fig. 1(c). Again we see that the more features used the better the performance achieved. Above the size of 32 however, there is not significant improvement, which is in par with what was reported in wohlhart2015learning .

4.2 State Of The Art Comparisons

Last, we performed a final evaluation of our siamese regression network compared with the method of wohlhart2015learning , which is the most relevant work to ours and directly comparable. This work uses triplets and pairs formed by the training samples and learn to minimize an objective function using a convolutional neural network. This objective only tries to increase the euclidean distance in feature space of dissimilar samples, while enforcing similar samples to be close. The idea behind this is to build a mapping appropriate for nearest neighbor matching with some templates.

Results are shown in Table 2. We see that both the end-to-end regression and our learned features outperform the previous work. One reason for this is that the objective used by wohlhart2015learning does not take into account the actual task objective which is the pose estimation. Our method has as ultimate goal to learn the object pose directly, and therefore constructing more appropriate features for this task. On the other hand, we see that on the small dataset of hinterstoisser2012accv , nearest neighbor performs slightly better than the end-to-end regression, which is prone to overfitting regarding the dataset size. When experimenting on our larger occlusion dataset, we see that the end-to-end regression is able to converge to a better minimum. It is clear that both our features and regression significantly outperform wohlhart2015learning . Furthermore, our formulation gives us another opportunity to further improve the performance when we can generate synthetically the occluded and the clean image of an object. We see that by using equation 5 pose error decreases even further, reaching accuracy levels of the linemod dataset which does not contain occlusions. We note that method of wohlhart2015learning was also trained with both clean and occluded images.

Fig. 3 illustrates images and results of our novel hand-object occlusion dataset. From left to right columns represent: a real RGBD image; a synthetic one rendered using our tracker result; the rendered not occluded object that corresponds to the exact pose of the respective occluded real RGBD image; and our network final estimation.

Regarding our implementation, it was written in Theano. Training one epoch using Nvidia Titan X takes about 15mins for our dataset and about 20 seconds on linemod. One image of our dataset is and takes 4ms for regression and 6ms for NN. Linemod dataset contains images of and evaluation of an images takes about 2ms for regression and 4ms for NN.

Figure 3: Our occlusion dataset. First column shows a real RGBD image, second column shows the synthetic image rendered using the tracking ground truth annotation, the third column shows the rendered not occluded object corresponding to the occluded image, and the forth column shows our network final estimation.
Object
Nearest
Neighbor
wohlhart2015learning
Siamese
Regression
Network
(End-to-End)
Siamese
Regression
Features + NN
Siamese
Regression
+ Occlusion
Term
ape 15 12.3 11.8 -
benchviseblue 15.5 15.6 13.2 -
camera 12 10.9 10.1 -
can 15.5 14.5 12.3 -
cat 14 12.1 10.4 -
driller 17.8 16.7 13.2 -
duck 13.9 13.1 10.9 -
holepuncher 13.2 12.9 11.4 -
iron 11.4 11.6 10.2 -
lamp 13.3 12.6 11.1 -
phone 18.2 12.9 11.7 -
average 14.5 13.2 11.4 -
belt
(occlusion dataset)
25.2 13.2 14.3 11.8
Table 2: State of the art and self comparisons of our method against the one of Wohlhart et al. wohlhart2015learning in the dataset of LINEMOD hinterstoisser2012accv and our novel hand-object.

5 Conclusion

We presented Siamese Regression Networks, a convolutional network that is able to perform object pose regression in angle space directly, by enforcing distance similarity in feature and pose space among the training samples. Such network is able to learn more discriminative features that are optimal for the pose regression task, which outperform state of the art. Last, our feature-guided pose estimation can be easily modified to learn features that are robust to occlusions, achieving accuracy compared to occlusion free images, when tested on our own severe occlusion-by-hand dataset. As a future work, we would like to investigate how this network can be extended in order to simultaneously tackle object localization, as well as object classification.

References

  • (1) L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade. 2012.
  • (2) E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother. Learning 6d object pose estimation using 3d object coordinates. In ECCV. 2014.
  • (3) E. Brachmann, F. Michel, A. Krull, M. Ying Yang, S. Gumhold, and C. Rother. Uncertainty-driven 6d pose estimation of objects and scenes from a single rgb image. In CVPR. 2016.
  • (4) A. Crivellaro, M. Rad, Y. Verdie, K. Moo Yi, P. Fua, and V. Lepetit. A novel representation of parts for accurate 3d object detection and tracking in monocular images. In ICCV, 2015.
  • (5) A. Doumanoglou, R. Kouskouridas, S. Malassiotis, and T.-K. Kim. Recovering 6d object pose and predicting next-best-view in the crowd. In CVPR. 2016.
  • (6) B. Drost, M. Ulrich, N. Navab, and S. Ilic. Model globally, match locally: Efficient and robust 3d object recognition. In CVPR, 2010.
  • (7) R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
  • (8) S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In ACCV, 2012.
  • (9) T. Hodan, X. Zabulis, M. Lourakis, S. Obdrzalek, and J. Matas. Detection and fine 3d pose estimation of texture-less objects in rgb-d images. In IROS, 2015.
  • (10) E. Hoffer and N. Ailon. Deep metric learning using triplet network. In Similarity-Based Pattern Recognition. 2015.
  • (11) E. Johns, S. Leutenegger, and A. J. Davison. Pairwise decomposition of image sequences for active multi-view recognition. In CVPR, 2016.
  • (12) A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In ICCV, 2015.
  • (13) A. Krull, E. Brachmann, F. Michel, M. Ying Yang, S. Gumhold, and C. Rother. Learning analysis-by-synthesis for 6d pose estimation in rgb-d images. In ICCV, 2015.
  • (14) J. J. Lim, A. Khosla, and A. Torralba. Fpm: Fine pose parts-based model with 3d cad models. In ECCV. 2014.
  • (15) R. Mottaghi, Y. Xiang, and S. Savarese. A coarse-to-fine model for 3d pose estimation and sub-category recognition. In CVPR, 2015.
  • (16) I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Efficient model-based 3d tracking of hand articulations using kinect. In BMVC, 2011.
  • (17) R. Rios-Cabrera and T. Tuytelaars. Discriminatively trained templates for 3d object detection: A real time scalable approach. In ICCV, 2013.
  • (18) R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (fpfh) for 3d registration. In ICRA, 2009.
  • (19) E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer. Discriminative learning of deep convolutional feature point descriptors. In ICCV, 2015.
  • (20) S. Song and J. Xiao. Sliding shapes for 3d object detection in depth images. In ECCV. 2014.
  • (21) Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In NIPS, 2014.
  • (22) A. Tejani, D. Tang, R. Kouskouridas, and T.-K. Kim. Latent-class hough forests for 3d object detection and pose estimation. In ECCV. 2014.
  • (23) J. Tompson, M. Stein, Y. Lecun, and K. Perlin. Real-time continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics (TOG), 2014.
  • (24) J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning fine-grained image similarity with deep ranking. In CVPR, 2014.
  • (25) P. Wohlhart and V. Lepetit. Learning descriptors for object recognition and 3d pose estimation. In CVPR, 2015.
  • (26) Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, 2015.
  • (27) K. M. Yi, Y. Verdie, P. Fua, and V. Lepetit. Learning to assign orientations to feature points. 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
7006
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description