Siamese Regression Networks with Efficient midlevel Feature Extraction for 3D Object Pose Estimation
Abstract
In this paper we tackle the problem of estimating the 3D pose of object instances, using convolutional neural networks. State of the art methods usually solve the challenging problem of regression in angle space indirectly, focusing on learning discriminative features that are later fed into a separate architecture for 3D pose estimation. In contrast, we propose an endtoend learning framework for directly regressing object poses by exploiting Siamese Networks. For a given image pair, we enforce a similarity measure between the representation of the sample images in the feature and pose space respectively, that is shown to boost regression performance. Furthermore, we argue that our poseguided feature learning using our Siamese Regression Network generates more discriminative features that outperform the state of the art. Last, our feature learning formulation provides the ability of learning features that can perform under severe occlusions, demonstrating high performance on our novel handobject dataset.
1 Introduction
Detecting objects and estimating their 3D pose is one of the most challenging tasks in computer vision, since severe occlusions, background clutter or large scale changes, dramatically affect the performance of any contemporary solution. State of the art methods make use of Hough Forests for casting patch votes in the 3D space tejani2014latent ; andoum2016recovering or train CNNs to either perform classification into the quantized 3D space johns2016pairwise or learn features that are later fed to a Nearest Neighbor scheme for 3D object pose template matching wohlhart2015learning . The lack of research on Deep Networks for angle regression prove that directly regressing object poses in angle space is not trivial, with the objective function appearing to have many local minima. Albeit the recent advances in classification tasks using CNNs, a framework that is able to perform direct regression in angles, while jointly learning discriminative features has yet to be built.
Towards this end, in this paper, contrary to the state of the art methods, we are interested in learning an endtoend framework that directly regresses object pose angles. Recent works sun2014deep ; hoffer2015deep demonstrated successful results by using siamese networks, which can improve the network learning capabilities by exploiting additional information about the relationship between the training samples. Inspired by this, we present the Siamese Regression Network that enforces a relationship between feature and pose space by applying a novel loss function, which can boost the performance of a regression network layer. Thus, this network is able to perform singleshot endtoend regression for object pose estimation, without requiring pairs of inputs at testing time. Apart from that, we experimentally evaluate the effect of some other factors that play an important role in successful regression such as feature normalization and batch formation. Our Siamese Regression Network, on the other hand, proved to learn more discriminative features optimized for our particular problem, as compared to a state of the art feature learning technique using CNNs wohlhart2015learning . Finally, we are interested in handling severe occlusions in the object pose estimation task, a particularly interesting problem constantly arising in reallife applications. However, estimating accurate pose when a significant portion of the object is missing is a very challenging task which drastically degrades the performance of previous arts wohlhart2015learning ; hinterstoisser2012accv ; brachmann2014learning ; lim2014fpm ; wu20153d ; mottaghi_cvpr15 . We show how our loss function can be easily modified to handle cases of severe occlusions. To evaluate our regressor on such cases, we built our own challenging dataset which demonstrates an object being manipulated by a human hand. Results show that our method can very well handle severe occlusion, reaching accuracy levels of nonoccluded objects.
In summary our paper offers the following contributions:

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex,leftmargin=3ex]

We present Siamese Regression Network which, to the best of our knowledge, is the first CNNbased framework for regressing object poses in angle space.

We boost the performance of our system by introducing a novel loss function for featureguided pose regression.

In turn, we show that poseguided feature learning results in more discriminative features than the ones of wohlhart2015learning and are, experimentally proven, optimized for the particular task of 3D object pose estimation.

We show how our loss function can be adapted to deal with severe occlusions and evaluate our system on a new challenging dataset containing an object captured under severe occlusions. Furthermore, experimental evaluation on a benchmark dataset hinterstoisser2012accv provide evidence of our system outperforming the state of the art.
The remainder of the paper is organized as follows. In Section 2 we provide an overview of the related work, while our proposed approach is introduced in Section 3. In Section 4 we present evaluation of our method compared to the stateoftheart on two datasets. Finally, in Section 5 we conclude with final remarks and an outlook to future work.
2 Related Work
Recognizing and detecting objects along with estimating their 3D pose has received a lot of attention in the literature. Early works made use of pointclouds to facilitate PointtoPoint matching drost2010model ; rusu2009fast , while the advent of lowcost depth sensors hinterstoisser2012accv ; rios2013discriminatively provided additional data in favor of textureless objects. Hinterstoisser et al. designed a powerful holistic template matching method (LINEMOD) based on RGBD data that suffers in cases of occlusions. Tejani et al. tejani2014latent integrated LINEMOD into Hough Forests to tackle the problem of occlusions and clutter. The work of Brachmann et al. brachmann2014learning along with its recent extension to RGBonly images brachmann2016uncertainty employ a new representation framework that jointly maps 3D object coordinates and class labels. Hodan et al. hodan2015detection presented a method that tackles the complexity of sliding window approaches via a fastfiltering technique followed by a voting procedure for hypotheses generation, while fine 3D pose estimation is performed via a stochastic, populationbased optimization scheme. In turn, in song2014sliding exemplar SVMs are slided in the 3D space to perform object pose classification based on depth images.
Deep learning has only recently found application to the 3D object pose estimation problem. Doumanoglou et al. andoum2016recovering suggested using a network of stacked sparse autoencoders to automatically learn features in an unsupervised manner that are fed to Hough Forests for 6D object pose recovery and nextbestview estimation. In johns2016pairwise Johns et al. employed a CNNbased endtoend learning framework for classification of object poses in the 3D space and nextbestview prediction. In turn, in crivellaro2015novel a CNN was used to learn projections of 3D control points for accurate 3D object tracking, while in krull2015learning a CNN is utilized in a probabilistic framework to perform analysisbysynthesis as a final refinement step for object pose estimation. In wohlhart2015learning 3D pose estimation is performed by a scalable Nearest Neighbor method on discriminative feature descriptors learned by a CNN.
To the best of our knowledge, this paper presents the first CNNbased framework for regressing object poses in the continuous 3D space. The work of Kendall et al. kendall2015posenet regresses camera poses in the continuous 3D space^{1}^{1}1camera pose estimation is the inverse of object pose estimation but does not offer any endtoend learning since it makes use of the pretrained VGG network adding just a softmax layer for regressing the four quaternions of the camera pose. The method of Sun et al. sun2014deep offers a learning framework that learns a new face representation by joint identificationverification. As far as feature learning for 3D object pose estimation is concerned, our work shares similar ideas with the method of Wohlhart et al. wohlhart2015learning that learns feature descriptors with pairs and triplets. However, we argue that our learned features are poseguided and as experiments prove, more discriminative, which in fact suggests that they are optimized for the particular task of 3D object pose estimation.
3 Siamese Regression Network
3.1 Object Pose Estimation Using Regression CNN
We first formulate the problem of object pose estimation as a regression problem. Let be an RGBD (4 channels) image depicting a centered object having width and height . Pose estimation is the problem of learning a regressor , where is the dimensionality of the pose representation used. For example, Euler angles require angles to be defined () whereas quaternions suggest . Regressing euler angles directly can be problematic due to multiple problems such as periodicity yi2015learning , and the noncontinuous nature of the euler angle space kendall2015posenet . For example, poses that are very similar visually might be far away in euler angle space, making regression harder. Therefore, similar to previous work kendall2015posenet on regressing camera pose, we also use the quaternion representation, which does not suffer from the same problems.
For the task of estimating the regression function we train a convolutional neural network (CNN). We use the simple architecture similar to wohlhart2015learning that consists of 2 convolutional layers, and 2 fully connected layers (we have removed the maxpooling layers as we saw that they slightly degraded the performance). On top of that, we added another fully connected layer that outputs units to estimate the object pose. If we consider the layer just before the last regression layer as the features learned by the network, we can describe the output of the our network as:
(1) 
where is the output of the feature layer, the regression layer function and is a pose vector returned by the network for the input image . Given a training set that contains combinations of training samples of the form , the most commonly used method of training a regression network is by minimizing the Mean Square Error (MSE) between the estimation and the ground truth and backpropagating the error. If we split the training set into mini batches of samples each, the regression loss can be written as:
(2) 
3.2 Siamese Regression Objective
Previous work has shown that the feature layer is able to learn representations that can be successfully applied in nearest neighbour matching wohlhart2015learning or face identification  verification sun2014deep . However, endtoend regression learning with CNN in angle space proved to be a very challenging task, with researchers resorting to indirect solutions, such as Nearest Neighbor template matching wohlhart2015learning or adhoc angle estimation methods like yi2015learning . Therefore, inspired by sun2014deep , we want to enhance the feature learning process by using additional information available during training, in order help the endtoend regressor converge to a better minimum. Thus, our goal is to enforce a loss function in the feature layer , in a way that the learned features are more appropriate and useful for the regression task in the last layer .
In order to enforce a second loss function in this layer, we utilize the siamese architecture that has been very successful for learning nonlinear feature embeddings with convolutional neural networks hadsell2006dimensionality . The siamese architecture consists of two (or more) branches of the same CNN that share weights and encode two inputs processed in parallel. Subsequently, a loss function can be introduced based on both outputs, which makes it possible to compare different samples of our training data passing through our network in a meaningful way.
Our study on the regression problem concluded that there is a relation between the feature and the angle space which helps a regression network layer perform much better. The relationship is the following: the euclidean distance between two sample images represented in feature space, should be maintained the same with the distance between the same samples as represented in angle space, during training. Fig. 1 shows an illustration of our idea. In order to enforce such relation we use a siamese network to pass through the network pairs of samples and apply an objective term on them. The pairs have the form: where represent the raw input, and the pose vector ground truth. We enforce the following loss function for feature guided regression
(3) 
Intuitively, minimizing this loss enforces the distance between the features in the sample pair, to be close to the distance between the ground truth of their poses. In order to avoid weighting any of the above parts of the objective loss term, we normalize the output of the feature layer as well as the output of the pose layer to have unit norm (if using quaternions as pose representation, they already have unit norm). In fact, as we will show in experiments, this normalization has a positive effect in training angle regression even without using our extra feature term.
It should be mentioned that the siamese network is only used during training to help the regression task. During testing, only a single image produces a pose estimation without the need of providing a pair for the image.
3.3 Feature Guided Pose Regression
Combining the regression loss with the feature loss, we get
(4) 
where is a term to regularise the weights of the convolutional neural network. By enforcing this loss in the proposed siamese regression network, we are able to simultaneously focus on both features that are able to work well in a nearest neighbour framework, and on the fully connected last layer that regresses the poses directly. Indeed, in our experiments we show that enforcing the feature term in the loss leads to better pose estimation in the final layer.
In Table 1, we describe the relationship between the two loss functions , and the parts of the CNN weights that are updated. We note that the weights related to the feature learning, are updated using information from both losses, while the weights related with the pose regression, are only updated based on the loss.
3.4 Pose Guided Feature Learning
Despite the fact that the loss function of from Eq. 4 mainly aims to learn a better regressor for the pose in the final layer of the convolutional network, it can be argued that the features that are learned in the layer of the network can be more discriminative.
Previous work on 3D feature learning with siamese networks has focused on optimising the feature embeddings using triplets. Triplet training samples contain an anchor, a positive sample and a negative sample. The authors from wohlhart2015learning form the triplet by using two close views of the object as anchor and positive samples, and a view with significantly different pose as the negative. What they try to optimize is the anchor and the positive sample to be closer in feature space than the anchor and the negative one. They also use pairs of images of similar pose but different appearance and try to minimize their distance in feature space in order to learn features immune to different lightning conditions and noise.
On the other hand, our loss focuses on forcing the feature distance between a pair to be equivalent to the pose distance. Thus, it is more appropriate for a nearest neighbour framework, since the features are optimized to be relative to the pose distance. Indeed, in our experiments show that enforcing our loss from Eq. 4 results in features that are more suited for nearest neighbour matching.
We should note that using the objective function of wohlhart2015learning instead of in order to help regression didn’t work, with the network showing similar convergence behavior of the simple regression without using the extra term. This is a clue that our objective does indeed help regression, while at the same time regression helps building more discriminative features appropriate for pose estimation.
input: training set , CNN feature parameters , CNN pose parameters , learning rate 

for epoch e=1:N do 
sample M minibatches from using pairs with both similar and different poses 
for minibatch b=1:M do 
update 
update 
end 
end 
output 
3.5 Siamese Pair Sampling
Considering a dataset of training samples of the form , there exist possible pairs to be used in the siamese training process described above. Since the number of pairs can become very large, several authors explored different techniques of sampling or mining hard negative pairs wang2014learning ; simo2015discriminative .
Although we do not explicitly have positive and negative pairs since the training process is done in the same object, we approximate such pairs by spliting a batch of size between pairs that are both close in the pose space, or have very large pose differences. We examined that random choice of pairs in terms of their pose difference perform inferior to our formation. Interestingly, forming batches using both similar and different pose pairs, is a factor that improves regression on its own, even without enforcing any constraints on such pairs. In the experiments we will show the relative performance gain of using well formed batches compared to simple regression and enforcing our sample pair objective.
3.6 Handling occlusions
Tackling occlusions, that is estimating the object pose when a major part of the object is missing or occluded, requires features that are robust in such conditions and one should explicitly enforce this property. We note that the form of Eq. 4 makes it convenient to support building such features: if we generate training samples with the object occluded, and using its annotation render a clean object having the same pose, we can enforce a similar term between the occluded and the clean images:
(5) 
where and are images depicting occluded and clean objects respectively. Fig. 3 in the experiments section shows examples of such images. This term can be added to the loss in order to tackle the severe occlusion problem.
4 Experiments
Our convolutional regressor is a simple convolutional neural network with similar architecture to wohlhart2015learning that has the following architecture: {} where represents a convolutional layer with filters of size , and a fully connected layer with outputs. Note that the feature layer is of variable length, something that allows a tradeoff between feature extraction size and performance. We use as the nonlinearity in all our convolutional and fully connected layers apart from the last layer that produces the pose estimation where we used . For training the network we use the stochastic gradient method bottou2012stochastic , with momentum and initial learning rate of . We also decay the learning rate in each epoch, to avoid oscillations around local minima.
In order to evaluate our method we used two datasets. The first one, which is also used for our parameter analysis, is the one of LINEMOD hinterstoisser2012accv . More specifically, we worked with a variant of the RGBD images as used by wohlhart2015learning where the objects are centered in the image so that no localization is required. This dataset contains about 3000 training images and 1000 test images per object.
The second dataset is constructed by us and depicts a small object (a car belt) being manipulated by a human hand as seen in Fig. 3. The focus of such dataset is to introduce realistic occlusions that are severe and have stronger effect on the learning process than using different object instances or types. In order to construct such dataset, we recorded an RGBD video, using Asus Xtion, of a human manipulating the object, and used a particle swarm optimization tracker tompson2014real ; oikonomidis2011efficient to track both the hand pose as well as the object pose. Having such information we can easily generate our needed pairs for our occlusion term (eq. 5). Such scenario appears in autonomous learning of object manipulation by robots, where the task is being demonstrated by a human. This dataset is very challenging since human hand introduces high level of occlusion which can significantly degrade the accuracy of pose estimation. Moreover, our dataset is larger, with about 21000 training and 5000 testing images.
Regarding the evaluation metric, we use the average Euler angle error, which is the average of the absolute difference in angle (in degrees) between the estimated pose and the ground truth, measured regarding the three principal axes. Such metric is more appropriate for our regression task and matches the one used in wohlhart2015learning . Since we are using quaternions, we transform them in Euler angles after the estimation in order to perform the comparisons.
In the following subsections, we first evaluate different parameters of our network showing the relevant importance of each of our contributions. We also compare our siamese regression network with some baseline and state of the art methods showing the superiority of our method.
4.1 Parameter Evaluation
Fig. 1(a) shows the evaluation of the different parts of our network, starting from a simple regression network and gradually adding elements of our siamese regression network. We show both the performance of our endtoend pose estimation and the performance of our produced features using nearest neighbor template matching. The simple regression network performs worse, and the performance gradually increases by just using a better formed batch, then the normalization layers and the best performance is achieved when adding our feature learning term in the objective function. Interestingly, we notice that the our features using nearest neighbor exhibit similar behavior, but the increase in performance is more significant. When using our feature learning term, it slightly outperforms the endtoend regression. As we will see in the next subsection, the regression is more affected by overfitting regarding the small size of the linemod dataset we used for the analysis.
We also experiment with the amount of pairs required for our feature term. Fig. 1(b) shows the regression performance for different values of the amount of pairs used. Using a batch that contains 300 training samples we see that the more pairs we use, the better the performance. However above 100 pairs we did not get any further significant improvement.
Last, we evaluated our network on different feature sizes, shown in Fig. 1(c). Again we see that the more features used the better the performance achieved. Above the size of 32 however, there is not significant improvement, which is in par with what was reported in wohlhart2015learning .
4.2 State Of The Art Comparisons
Last, we performed a final evaluation of our siamese regression network compared with the method of wohlhart2015learning , which is the most relevant work to ours and directly comparable. This work uses triplets and pairs formed by the training samples and learn to minimize an objective function using a convolutional neural network. This objective only tries to increase the euclidean distance in feature space of dissimilar samples, while enforcing similar samples to be close. The idea behind this is to build a mapping appropriate for nearest neighbor matching with some templates.
Results are shown in Table 2. We see that both the endtoend regression and our learned features outperform the previous work. One reason for this is that the objective used by wohlhart2015learning does not take into account the actual task objective which is the pose estimation. Our method has as ultimate goal to learn the object pose directly, and therefore constructing more appropriate features for this task. On the other hand, we see that on the small dataset of hinterstoisser2012accv , nearest neighbor performs slightly better than the endtoend regression, which is prone to overfitting regarding the dataset size. When experimenting on our larger occlusion dataset, we see that the endtoend regression is able to converge to a better minimum. It is clear that both our features and regression significantly outperform wohlhart2015learning . Furthermore, our formulation gives us another opportunity to further improve the performance when we can generate synthetically the occluded and the clean image of an object. We see that by using equation 5 pose error decreases even further, reaching accuracy levels of the linemod dataset which does not contain occlusions. We note that method of wohlhart2015learning was also trained with both clean and occluded images.
Fig. 3 illustrates images and results of our novel handobject occlusion dataset. From left to right columns represent: a real RGBD image; a synthetic one rendered using our tracker result; the rendered not occluded object that corresponds to the exact pose of the respective occluded real RGBD image; and our network final estimation.
Regarding our implementation, it was written in Theano. Training one epoch using Nvidia Titan X takes about 15mins for our dataset and about 20 seconds on linemod. One image of our dataset is and takes 4ms for regression and 6ms for NN. Linemod dataset contains images of and evaluation of an images takes about 2ms for regression and 4ms for NN.







ape  15  12.3  11.8    
benchviseblue  15.5  15.6  13.2    
camera  12  10.9  10.1    
can  15.5  14.5  12.3    
cat  14  12.1  10.4    
driller  17.8  16.7  13.2    
duck  13.9  13.1  10.9    
holepuncher  13.2  12.9  11.4    
iron  11.4  11.6  10.2    
lamp  13.3  12.6  11.1    
phone  18.2  12.9  11.7    
average  14.5  13.2  11.4    

25.2  13.2  14.3  11.8 
5 Conclusion
We presented Siamese Regression Networks, a convolutional network that is able to perform object pose regression in angle space directly, by enforcing distance similarity in feature and pose space among the training samples. Such network is able to learn more discriminative features that are optimal for the pose regression task, which outperform state of the art. Last, our featureguided pose estimation can be easily modified to learn features that are robust to occlusions, achieving accuracy compared to occlusion free images, when tested on our own severe occlusionbyhand dataset. As a future work, we would like to investigate how this network can be extended in order to simultaneously tackle object localization, as well as object classification.
References
 (1) L. Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade. 2012.
 (2) E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother. Learning 6d object pose estimation using 3d object coordinates. In ECCV. 2014.
 (3) E. Brachmann, F. Michel, A. Krull, M. Ying Yang, S. Gumhold, and C. Rother. Uncertaintydriven 6d pose estimation of objects and scenes from a single rgb image. In CVPR. 2016.
 (4) A. Crivellaro, M. Rad, Y. Verdie, K. Moo Yi, P. Fua, and V. Lepetit. A novel representation of parts for accurate 3d object detection and tracking in monocular images. In ICCV, 2015.
 (5) A. Doumanoglou, R. Kouskouridas, S. Malassiotis, and T.K. Kim. Recovering 6d object pose and predicting nextbestview in the crowd. In CVPR. 2016.
 (6) B. Drost, M. Ulrich, N. Navab, and S. Ilic. Model globally, match locally: Efficient and robust 3d object recognition. In CVPR, 2010.
 (7) R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
 (8) S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab. Model based training, detection and pose estimation of textureless 3d objects in heavily cluttered scenes. In ACCV, 2012.
 (9) T. Hodan, X. Zabulis, M. Lourakis, S. Obdrzalek, and J. Matas. Detection and fine 3d pose estimation of textureless objects in rgbd images. In IROS, 2015.
 (10) E. Hoffer and N. Ailon. Deep metric learning using triplet network. In SimilarityBased Pattern Recognition. 2015.
 (11) E. Johns, S. Leutenegger, and A. J. Davison. Pairwise decomposition of image sequences for active multiview recognition. In CVPR, 2016.
 (12) A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for realtime 6dof camera relocalization. In ICCV, 2015.
 (13) A. Krull, E. Brachmann, F. Michel, M. Ying Yang, S. Gumhold, and C. Rother. Learning analysisbysynthesis for 6d pose estimation in rgbd images. In ICCV, 2015.
 (14) J. J. Lim, A. Khosla, and A. Torralba. Fpm: Fine pose partsbased model with 3d cad models. In ECCV. 2014.
 (15) R. Mottaghi, Y. Xiang, and S. Savarese. A coarsetofine model for 3d pose estimation and subcategory recognition. In CVPR, 2015.
 (16) I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Efficient modelbased 3d tracking of hand articulations using kinect. In BMVC, 2011.
 (17) R. RiosCabrera and T. Tuytelaars. Discriminatively trained templates for 3d object detection: A real time scalable approach. In ICCV, 2013.
 (18) R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (fpfh) for 3d registration. In ICRA, 2009.
 (19) E. SimoSerra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. MorenoNoguer. Discriminative learning of deep convolutional feature point descriptors. In ICCV, 2015.
 (20) S. Song and J. Xiao. Sliding shapes for 3d object detection in depth images. In ECCV. 2014.
 (21) Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identificationverification. In NIPS, 2014.
 (22) A. Tejani, D. Tang, R. Kouskouridas, and T.K. Kim. Latentclass hough forests for 3d object detection and pose estimation. In ECCV. 2014.
 (23) J. Tompson, M. Stein, Y. Lecun, and K. Perlin. Realtime continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics (TOG), 2014.
 (24) J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning finegrained image similarity with deep ranking. In CVPR, 2014.
 (25) P. Wohlhart and V. Lepetit. Learning descriptors for object recognition and 3d pose estimation. In CVPR, 2015.
 (26) Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, 2015.
 (27) K. M. Yi, Y. Verdie, P. Fua, and V. Lepetit. Learning to assign orientations to feature points. 2016.