BPnP: Further Empowering End-to-End Learning with Back-Propagatable Geometric Optimization

# BPnP: Further Empowering End-to-End Learning with Back-Propagatable Geometric Optimization

Bo Chen, Tat-Jun Chin
School of Computer Science, The University of Adelaide
Nan Li
College of Mathematics and Statistics, Shenzhen University
Shenzhen, 518060, China
nan.li@szu.edu.cn
###### Abstract

In this paper we present BPnP, a novel method to do back-propagation through a PnP solver. We show that the gradients of such geometric optimization process can be computed using the Implicit Function Theorem as if it is differentiable. Furthermore, we develop a residual-conformity trick to make end-to-end pose regression using BPnP smooth and stable. We also propose a “march in formation” algorithm which successfully uses BPnP for keypoint regression.

Our invention opens a door to vast possibilities. The ability to incorporate geometric optimization in end-to-end learning will greatly boost its power and promote innovations in various computer vision tasks.

## 1 Introduction

In learning problems, often times geometric errors can not be directly used to guide the training. For example, let be a regression model which predicts some 2D keypoint locations with input and trainable parameters , i.e.,

 x=h(I;θ), (1)

let be a PnP solver, which computes the 6DOF camera pose using the 2D keypoint locations , the corresponding 3D coordinates of the structural keypoint expressed in the world frame and the camera intrinsic matrix , i.e.,

 y=g(x,M,K). (2)

For simplicity we omit the input and and use

 y=g(x) (3)

for the rest of the paper. Finally, let be a loss function which computes the loss of the estimated pose

 ℓ=l(y). (4)

A typical training scheme is to train the regression model by updating using gradients

 ∂ℓ∂θ=∂ℓ∂y∂y∂x∂x∂θ. (5)

However, the problem of such scheme is that a PnP solver is an optimization process, not simply a continuous or differentiable function. In other words we do not have a way to compute in Equation 5.

To fill this gap, we introduce the Back-propagatable PnP (BPnP), the first ever method that is able to back-propagate through a geometric optimization process using implicit gradients.

## 2 BPnP: the back-propagatable PnP with implicit gradients

In this section we describe step by step how BPnP is created.

### 2.1 The implicit function theorem

###### Theorem 1

Let be a continuous differentiable function, which has input where and . For a point , if

 f(x0,y0)=0 (6)

and the Jacobian matrix is invertible, then there exists an open set such that and an function such that and

 f(x′,g(x′))=0,∀x′∈U. (7)

Moreover, the Jacobian matrix is given by

 ∂g∂x(x′)=−[∂f∂y(x′,g(x′))]−1[∂f∂x(x′,g(x′)], ∀x′∈U. (8)

For a function , the implicit function theorem provides a way of computing the derivatives of with respect to without knowing the explicit form of the function, given that and are constrained by an function . Note that Equation 6 can be relaxed to

 f(x0,y0)=c, (9)

where is a constant vector that does not depend on or . Because for each such there exists which satisfies Equation 6 and , .

### 2.2 Constructing the constraint function f

To utilize the implicit function theorem, we first need to construct the constraint function such that . Recall here is the 2D keypoint locations and is the camera pose, a natural idea is to utilize the projection error for constructing .

Let denote the extrinsic matrix defined by pose and denote the 3-by-4 projection matrix. For the -th keypoint, the projection equation can be established as

 ⎡⎢⎣uiviwi⎤⎥⎦=P⎡⎢ ⎢ ⎢⎣xiyizi1⎤⎥ ⎥ ⎥⎦, (10)

where , and are the 3D coordinates of the -th keypoint obtained from and its projected 2D coordinates in the image plane can be derived by , . Let denote the 2D image coordinates of keypoint provided by , we defined the residual as

 ri=wixi−[uivi]. (11)

The above definition of residual is numerically more stable than using as a denominator.

For , define as

 fj=n∑i=1Cj,iri, (12)

where each is a 1-by-2 vector of Gaussian random numbers, for and . Finally, we construct the function as

 f(x,y)=[f1,…,fm]T. (13)

The above defined function has output and is differentiable with respect to both and . Thus, the implicit gradients of function with respect to , i.e., , can be calculated using Equation 1. The number of keypoints is arbitrary. The number of dimensions of , i.e., , is up to the representation of the rotation part of the pose. For example, for axis-angle representation, for quaternion representation and for rotation matrix representation.

### 2.3 Normalizing the implicit gradients

Because the calculation of the implicit gradients involves an inversion of the Jacobian , it can sometimes produces large values, making the gradients unstable for training. To address this issue, we normalize the gradients with its Frobenius norm:

 ∂g∂x←∂g∂x/∥∂g∂x∥F.

### 2.4 Implementation note

We find that using the axis-angle representation for the rotation part of the pose works best with BPnP. This is possibly because this way the pose representation has dimensionality which equals its degrees of freedom.

When calculating the implicit gradients in the backward pass, the solver is treated as a black box. It seems that in the forward pass can be any PnP solver. However, contrary to common practice, it is important in this case to choose sensitive solvers over robust ones. Thus for the implementation of , we use OpenCV’s SOLVEPNP_ITERATIVE method, which is a Levenberg-Marquardt (LM) optimization for minimizing the sum of squared distances between the 2D keypoints and the projected 3D keypoints. The initial pose for the LM optimization for the first time in a iterative algorithm is obtained using RANSAC; we then use the output pose of the previous iteration as initial pose for all iterations.

In the backward pass, the computation of Equation 1 is implemented using the Pytorch autograd package [1].

## 3 End-to-end learning with BPnP

Given a function with input , output and trainable parameters , a typical baseline algorithm to use the BPnP solver in optimizing some loss function would look like Algorithm 1.

There are two issues with this baseline algorithm: firstly and most importantly, the constraint of Theorem 1 as shown in Equation 9 is not always met when training with Algorithm 1; secondly, if BPnP is used for 2D keypoint regression, i.e., regressing , then the prediction is unlikely to approach the ground truth. We further explain these issues and propose their remedies below.

### 3.1 The residual-conformity trick

The key for Algorithm 1 to work smoothly lies in that in each iteration, should produce an improved pose using the updated , provided a small enough learning rate . However, whether this happens or not, depends on whether Equation 9 is upheld or not while computing . The implicit function theorem assumes that and are constrained by Equation 9, i.e., in our case the linear combinations of the projection residuals must remain the same. In other words, when changes slightly, should change accordingly to keep the residuals unchanged. The problem is that PnP solvers are designed to minimize the projection errors instead of keeping them unchanged. This issue causes the training process to become unstable.

To align the objective of the BPnP solver and the objective of maintaining the projection residuals as much as possible in order to uphold Equation 9, we use a residual-conformity trick to boost the stability of the training process. Let be the the projective transformation of 3D points into the image plane with pose and known camera intrinsics, i.e., outputs the projected 2D image coordinates of the 3D points . Let denote an operation that copies and detaches a variable from its computational graph, i.e., any variable after such operation will be treated as a constant when involved in any differentiation. Algorithm 2 describes the training procedure with the residual-conformity trick.

The key difference in Algorithm 2 is that at each iteration the projection residual based on the previous pose is subtracted from before passing to for computing the current pose. This aligns the objective of the BPnP solver and the objective of upholding Equation 9. To see this, recall that is an optimization process which seeks to minimize the projection error. If is the output of , it means that has done its best efforts to let , which is equivalent to letting , i.e., trying to uphold Equation 9.

### 3.2 March in formation

With either Algorithm 1 or 2, BPnP is still not good enough for 2D keypoint regression. Suppose during training the output pose has reached ground truth, it is still highly unlikely that equals to . Because the PnP solver does not guarantee 0 projection residuals; it merely outputs the pose with minimum residuals. Define a feasible formation as a formation of that satisfies the following condition:

 ∃y∈ST(3) s.t. x=πy(M). (14)

A fix to this issue is to encourage to keep a feasible formation during training.

Algorithm 3 presents such a training scheme, where line 3-7 are to encourage to maintain a feasible formation by approaching the projection , while line 8-12 are to guide the pose to approach the ground truth. This idea can be intuitively viewed as if is trying to march to the destination while maintaining a certain formation.

## 4 Experiment

To show the effectiveness of BPnP, we conduct experiments in two tasks: pose regression and keypoint regression.

Fix a set of keypoints with 3D coordinates , the camera intrinsic matrix and an arbitrary feasible pose as the ground truth pose. We use a modified VGG-11 [2] model as a function that outputs the 2D keypoint locations :

 x=h(I;θ) (15)

with input and parameters . The final layer of the model is modified to output the coordinates of keypoints. The BPnP solver is as described in Section 2 which computes the pose

 y=g(x;K,M), (16)

and the loss function is defined as

 l(y,y∗)=ℓR+ℓT, (17)

where

 ℓR=(2cos−1(|⟨yR,y∗R⟩|))2 (18)

is the rotation loss and

 ℓT=∥yT−y∗T∥22 (19)

is the translation loss. Here is the rotation part of converted to quaternion representation and is the translation part of , and similarly for and .

### 4.1 Pose regression

We fix the input and start training using Algorithm 1 and 2 respectively, with learning rate .

Figure 2 and 2 provide the pose regression results of 3 different runs with keypoints using Algorithm 1 and 2, respectively. They show that BPnP are effective in both algorithms. The implicit gradients are able to reduce the loss to near 0 and guide the model to output a pose close to the ground truth.

Figure 2 and 2 also show the importance of the residual-conformity trick. In Figure 2 the loss curves are bumpy and the pose trajectories are messy because Equation 9 is often violated in Algorithm 1. In contrast, Figure 2 shows smooth loss curves and clean trajectories, verifying the effectiveness of the residual-conformity trick in making the training process stable.

### 4.2 Keypoint regression

We conduct 2 sets of keypoint regression experiments using Algorithm 2 and 3 respectively. Figure 3 and 4 provides the results of the experiments.

Figure 3 demonstrates the issue discussed in Section 3.2: despite that the pose has converged to , the output 2D keypoints are still far away from . On the other hand, as shown in Figure 4 (c) and (f) both experiments are successful in converging to . This proves that the march in formation algorithm can be used for predicting 2D keypoint locations.

## 5 Conclusion

In this paper we present BPnP, a novel method to do back-propagation through a PnP solver. We show that the gradients of such geometric optimization process can be computed using the Implicit Function Theorem as if it is differentiable. Furthermore, we develop a residual-conformity trick to make end-to-end pose regression using BPnP smooth and stable. We also propose a “march in formation” algorithm which successfully uses BPnP for keypoint regression.

Our invention opens a door to vast possibilities. The ability to incorporate geometric optimization in end-to-end learning will greatly boost its power and promote innovations in various computer vision tasks.

## References

• [1] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §2.4.
• [2] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §4.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters