Robust SfM with Little Image Overlap

# Robust SfM with Little Image Overlap

Yohann Salaün, Renaud Marlet, and Pascal Monasse
LIGM, UMR 8049, École des Ponts, UPE, Champs-sur-Marne, France
CentraleSupélec, Châtenay-Malabry, France
{yohann.salaun,renaud.marlet,pascal.monasse}@imagine.enpc.fr
###### Abstract

Usual Structure-from-Motion (SfM) techniques require at least trifocal overlaps to calibrate cameras and reconstruct a scene. We consider here scenarios of reduced image sets with little overlap, possibly as low as two images at most seeing the same part of the scene. We propose a new method, based on line coplanarity hypotheses, for estimating the relative scale of two independent bifocal calibrations sharing a camera, without the need of any trifocal information or Manhattan-world assumption. We use it to compute SfM in a chain of up-to-scale relative motions. For accuracy, we however also make use of trifocal information for line and/or point features, when present, relaxing usual trifocal constraints. For robustness to wrong assumptions and mismatches, we embed all constraints in a parameterless RANSAC-like approach. Experiments show that we can calibrate datasets that previously could not, and that this wider applicability does not come at the cost of inaccuracy.

\iccvfinalcopy\pdfstringdefDisableCommands

## 1 Introduction

Structure-from-Motion (SfM) has made spectacular improvements concerning scalability [9, 43, 15], accuracy [26, 6, 3, 40] and robustness [42, 28]. However, most approaches assume a significant amount of overlap between images. In this work, we consider the case where images have only little overlap, possibly as low as two images at most seeing the same part of the scene.

This situation commonly occurs when a scene is photographed by people with little or no knowledge in photogrammetry. Even when informed, they can make occasional mistakes, widening too much a baseline, or shooting a low-quality image that cannot be exploited and has to be skipped. Another example is when exploiting street views, e.g., taken from a vehicle, if the viewpoints are too distant one from another, or if the camera is too close to the facades because of narrow sidewalks. In such cases, SfM methods may yield partial or fragmented calibrations.

It also applies to situations where only a small number of pictures can be shot, because of physical or time constraints. It may be the case when digitizing a building which is still in use, not to disturb occupants. Reducing the number of images, to only a few per room for indoor scenes, is also a way to reduce the cost and time for acquiring and processing information, as long as a minimum level of accuracy can still be reached for the targeted application.

Another issue concerns the lack of texture in environments such as building interiors, as it greatly reduces the amount of feature points detected in images, also leading to uneven feature distributions. Besides, if the number of images is small, it is likely that the baselines are wide, as well as the view angles between overlapping images. Consequently, because of perspective distortion, point descriptors are harder to match; relaxing matching thresholds does not fix the problem, as it introduces outliers. Moreover, detected points may unknowingly lie on a single plane, possibly giving rise to degenerate configurations for camera registration. On the contrary, lines do not suffer from lack of texture and are prevalent in interiors, where they often occur at plane intersections and object boundaries, see Fig. 1. Furthermore, lines are robust to significant changes of viewpoints, although their matching eventually also degrades.

Our approach thus leverages on lines for pose estimation and reconstruction, although it also exploits points when available. Our main contributions are as follow:

• Given two successive bifocal calibrations, i.e., sharing a middle camera, we propose a novel method for computing their relative scale without the need for trifocal features. It is based on line coplanarity hypotheses, which is relevant in man-made scenes (cf. Sect. 4).

• To however exploit trifocality when present, we show how to relax usual trifocal constraints, and integrate them with coplanarity constraints in a unified framework. These relaxed constraints only need one feature triplet and 2 of the bifocal calibrations (cf. Sect. 5).

• Robustness to outliers in this context (cf. Sect. 6) requires parameters that may be hard to set given the variety of scenes and constraint types. We propose a parameterless method that integrates all constraints and adapts automatically to the scene diversity (cf. Sect. 7).

• Using both standard SfM datasets with ground truth and difficult interior scenes with little overlap and little texture, we validate empirically our technical choices and we show (1) that we can calibrate images that cannot be calibrated with existing point- or line-based SfM methods, and (2) that our accuracy is on par with state-of-the-art SfM methods on less hard datasets.

Our method does not assume any Manhattan configuration.

## 2 Related Work

Incremental SfM methods [37, 25, 43] register a new image to a partial model already constructed, using 3D-2D correspondences. A number of Perspective-n-Points (PnP) algorithms [20, 11] have been proposed for solving this resection problem.  Whatever the method, at least 3 correspondences common to 3 images are required: three 3D points visible in the same 3 images, and up to minimum of 6 [49].

Hierarchical SfM methods, that additionally merge partial models, also have similar constraints. In [14], the two models to merge are overlapping in the sense that they share one or several images, and pairs of 3D points projecting to the same 2D features in both models are used to relate the models, which implies tracks of length 3 or more. In [41], models are merged via feature matches between images separately associated to each model, requiring that the same four 3D points are reconstructed in both models, which implies in turn tracks of length 4 (connections between 2 tracks of length at least 2). Even with relaxed requirements where merging uses 4 points that are seen but not necessarily reconstructed in the other model [10], feature tracks of length 3 are required.

As for global SfM methods, their main objective is merging relative motions between two cameras into a consistent graph of all cameras. Besides robustness concerns, to get rid of outlier edges, and various approaches to average rotations, one of the main issues is that the relative translations are only given up to an unknown scale factor; only their directions are known. Most methods to infer global translations rely on information redundancy assuming a densely-connected graph [12, 6], or on additional information from trifocal tensors [36, 26] (hence requiring 4 tracked points across 3 views). A number of other methods [18, 31, 2] compute the global translations, possibly along with the 3D points, by solving equations relating points visible in two images; however, they implicitly assume that enough points are visible in at least 3 images to cancel the degrees of freedom of the relative scale factors. Besides, they do not all address point match outliers.

The situation is similar with line-based SfM. In [47], an initial image triplet with common line matches is required. Then, given a partial model, Perspective-n-Line (PnL) methods estimate the pose of a camera in which three 3D lines reproject [23, 48], which implies that at least 3 lines are visible in 3 views. A minimum of 6 lines is sometimes even desirable to prevent noise sensitivity [23], if not 9 for applicability and 25 for accuracy [29].

More generally, when associating both points and lines in a “Perspective-n-Features” framework, a minimum of 3 features visible in 3 views is still required [30, 44].

Our approach for relating the scale of two bifocal calibrations is based on coplanar line pairs. To our knowledge, coplanar lines have been used for pose estimation, but only in a two-view context and with a Manhattan-world assumption, to identify planar structures [19]. A related topic is plane-based SfM, but it has mostly been studied assuming prior knowledge (user-given) about the scene planes [39, 4] or in tracking scenarios with videos [50]. In [32, 18], a reference plane is used to estimate both the global translations and 3D points, but it must be visible in all images.

A related work regarding the estimation of scale factors and the identification of planar structures concerns direct structure estimation (DES) via homography estimations, with the computation of coplanar point clusters, but it does not estimate poses and it also relies on trifocal points [17].

Line triangulation and line bundle adjustment has been well studied given an initial global pose estimation [5], but not associated to coplanarity issues.

## 3 From Relative to Global Pose Estimation

We consider the case where the epipolar graph mainly contains long chains or long cycles with little or no trifocal relations. For every edge in the graph between camera and camera , we assume the relative pose known (estimated), where is the relative rotation and is the unit norm translation direction. We are interested in estimating the scale factors relating the translation directions to the global relative translations , that is, up to a single global scale factor.

In the following, we only consider the case of a single chain or cycle of cameras, for which we estimate global poses , where rotations , translations as well as camera centers are defined in the same reference frame. Yet, our pose estimation method can be integrated in a general global SfM framework for arbitrary graphs, e.g., as described in [35] to evenly distribute errors over the whole graph in trying to satisfy the coherency constraints:

 Rj =RijRi, Tj =RijTi+Tij, Cj =−R⊤jTj. (1)

This could be associated to a method to remove outlier edges and enforce cycle consistency [13, 45, 8].

When considering a single chain or cycle of cameras, global motions are recursively defined as:

 R1 =I and Rj+1 =Rj,j+1Rj (2) T1 =0 and Tj+1 =Rj,j+1Tj+λj,j+1tj,j+1 (3)

As the global pose remains defined up to an unknown scale factor, we additionally set : distances are thus defined with unit length . (In case of a cycle, we could also include epipolar constraints to close the loop and distribute errors as in [35], but we do not in our implementation.) Finally, a bundle adjustment refine the initial pose estimation.

In the following, to simplify notations, we assume that features are normalized, i.e., with camera intrinsic parameter matrices . We also denote an arbitrary triplet of successive cameras rather than .

## 4 Coplanarity Constraint

Let and be two non-parallel 3D line segments in a plane . (Coplanarity here is actually just an hypothesis, to be validated in a RANSAC-like manner, see Sect. 6.) Suppose only appears on cameras and , whereas only appears on cameras 2 and 3. Let be the projection of on camera . (See Fig. 2.)

The coordinates of a 3D point in the global reference frame relates to its projection on camera and depth :

 P=R⊤j(zjppj−Tj) (4)

Assuming , we also have for any camera :

 lki⋅(RkP+Tk)=0, (5)

where “” denotes scalar product. Combining (4) and (5), and relating global to relative motions using (1), we can express the depth in terms of the relative pose information:

 lki⋅(RkR⊤j(zjppj−Tj)+Tk)=lki⋅(Rjkzjppj+Tjk)=0 to0.0pt$⇔$zjp=−lki⋅Tjklki⋅(Rjkpj). (6)

We also know that the normal to plane  is given by:

 nP=dLa×dLb=(R⊤1l1a×R⊤2l2a)×(R⊤2l2b×R⊤3l3b) (7)

where is the 3D direction of line . Let be a point in . As any line in is orthogonal to , we have:

 P∈P⇔nP⋅(P−M)=0 (8)

Since , using (4) and (6) yields:

 nP⋅(R⊤j(zjppj−Tj)−M)=0to0.0pt$⇔$ nP⋅(−lki⋅Tjklki⋅(Rjkpj)R⊤jpj+Cj−M)=0to0.0pt$⇔$ nP⋅(Cj−M)=(lki⋅Tjk)(nP⋅(R⊤jpj))lki⋅(Rjkpj) (9)

Now using (9) both for a point with and , and a point with and , we get:

 (l3b⋅T23)(nP⋅(R⊤2p2b))l3b⋅(R23p2b)=(l1a⋅T21)(nP⋅(R⊤2p2a))l1a⋅(R21p2a)to0.0pt$⇔$ l3b⋅T23l1a⋅T21=(l3b⋅(R23p2b))(nP⋅(R⊤2p2a))(l1a⋅(R21p2a))(nP⋅(R⊤2p2b))to0.0pt$⇔$ λ23λ21=(l3b⋅(R23p2b))(nP⋅(R⊤2p2a))(l1a⋅t21)(l1a⋅(R21p2a))(nP⋅(R⊤2p2b))(l3b⋅t23) (10)

Eq. (10) expresses the signed ratio of the distances between 3 successive camera centers in the same reference frame, in terms of only 2-view relative pose information, plus weak information about 2 coplanar lines that do not have to be visible in all 3 views. It has 3 degenerate configurations that can be assessed and controlled with an angular threshold.

Note that this criterion does not depend on the endpoints of line segments, that are notoriously inaccurate. Line segment matches can actually be grossly wrong because of over-segmentation, which is a common weakness of line segment detectors. On the contrary, our criterion only relies on (infinite) lines, which are much more stable. It can cope with line segments matches that are wrong w.r.t. endpoints but correct w.r.t. the supporting 3D line.

## 5 Simplified Trifocal Constraints, When Any

In case a line or point is visible in three successive views 1-2-3, we consider a corresponding simplified trifocal constraint, that also involves the scale factors . For both kinds of features, we do as follows:

1. We set arbitrarily the first two camera centers given their relative pose: , , .

2. We compute the approximate 3D reconstruction of the feature from the first 2 camera poses, as well as its projection on the third camera.

3. We compute the scale factor such that the projected feature on camera 3 corresponds best to the detection.

As in step 1 we set , the computed scale factor is actually also the ratio . For a more robust and accurate estimation, we can symmetrize the setting: permuting the role of cameras and averaging the resulting scale factors. As we just consider here a chain of successive views, it makes sense to actually only take into account the case where cameras 1 and 3 are swapped, keeping camera 2 as central. It prevents camera configurations 1-3 or 3-1, which are likely to feature wider changes of viewpoints, i.e., wider baselines and view angles. Bifocal calibration 1-3 may even fail and not be available. (In our implementation, we thus only compute the ratios and , retaining the average .) Note that, as opposed to usual trifocal constraints that require at least three trifocal features, one is enough here.

In step 3, as detailed below, the scale factor is expressed in the form:

 λ23=argminλ∈R∥u×(v+λw)∥∥u∥∥v+λw∥, (11)

where are known vectors in . To find the minimum, we look for the values of that make the derivative vanish. It leads to a third-degree polynomial, that simplifies into a second-degree polynomial, whose roots can be tested to find the minimum of the original expression. As above, this formula has a degenerate configuration that can be checked and discarded using an angular threshold.

Trifocal point constraint. Given a triplet of matched points in cameras 1-2-3, we can estimate the corresponding 3D point from , assuming , and reproject as on camera 3, for some given :

 ~p3=R3(~P−C3)=R3(~P−C2)−λ23t23

We look for the scale factor that makes the closest to the observation . Rather than minimizing the distance between the two 2D points, we minimize the angle :

 λ∗23 =argminλ23∈R∥p3×~p3∥∥p3∥∥~p3∥ =argminλ23∈R∥p3×(R3(~P−C2)−λ23t23)∥∥p3∥∥R3(~P−C2)−λ23t23∥, (12)

which is in the form (11).

Trifocal line constraint. Given a triplet of matched line segments in cameras 1-2-3, we can estimate the corresponding 3D line from , assuming , and reproject as on camera 3, for some given . As also is the normal of the plane defined by and , we have for any point on :

 ~l3∝R3[d~L×(~P−C3)]

We look for the scale factor that makes the most colinear with the observation , i.e., that minimizes the relative angle between and :

 λ∗23 =argminλ23∈R∥l3×~l3∥∥l3∥∥~l3∥ =argminλ23∈R∥l3×(R3[d~L×(~P−C3)])∥∥l3∥∥R3[d~L×(~P−C3)]∥ =argminλ23∈R∥l3×(R3[d~L×(~P−C2)]−λ23R3[d~L×t23])∥∥l3∥∥R3[d~L×(~P−C2)]−λ23R3[d~L×t23]∥ (13)

which is in the form of (11). As for coplanarity (Sect. 4), this constraint does not depend on line segment endpoints.

## 6 Robust Estimation

As there is no oracle to safely pick non-parallel coplanar lines (cf. Sect. 4), we adopt a RANSAC-like approach to sample candidate line pairs and select the associated scale with which the largest number of pairs agree. More generally, we also sample and check agreement w.r.t. trifocal points and lines when any (cf. Sect. 5), which provides robustness to wrong detections and matching too. For this, we have to define a measure of residual error for the three different kinds of features (hypothesized coplanar line pairs, trifocal points, trifocal lines), given a presumed obtained from the sample (discarding degenerate cases).

Coplanar lines. For any quadruplet of line segments s.t. match in cameras 1-2 and match in cameras 2-3, we estimate the associated 3D lines and consider the 3D point (resp. ) that is the closest to (resp. ) in 3D, with if are coplanar. We then consider their reprojection on the shared camera 2. The residual error is their pixel distance: .

Considering rather than removes one dimension of error, along the direction of , possibly constraining less. But positioning by triangulation along this direction is generally less accurate and thus less meaningful, leading people to rather use reprojected distances and rely on other views to capture any error along this direction.

To avoid degenerate cases with mostly parallel lines, we also discard line pairs whose 3D directions are similar. (In our experiments, we use a threshold of .) Note that these directions can be computed from the global rotations only, before global translations are estimated. As there can be many pairs candidate for coplanarity, we only consider, for each segment in camera 2, having a match in camera , its closest neighbors having a match in the other camera, with a distance defined as the minimum distance between segment endpoints. (In our experiments, ).

Trifocal point error. For each triplet of matched points in cameras 1-2-3, we estimate the corresponding 3D point from , and reproject it as on camera 3 (assuming ), as in Sect. 5. The residual error is the pixel distance of reprojection to the observed detection . For robustness, we actually symmetrize this measure by swapping cameras 1 and 3, and averaging the distances: .

Trifocal line error. For each triplet of matched line segments in cameras 1-2-3, we estimate the corresponding 3D line from , and reproject it as on camera 3 (assuming ), as in Sect. 5. The residual error is the average of the pixelic distance between the reprojected (infinite) line and the two endpoints of the detected segment , as in [5, 47]. For robustness, we also swap cameras 1 and 3, and define as the average of both errors.

Sampling. In practice, the total number of features (hypothesized coplanar line pairs, trifocal points, trifocal lines) is generally low enough (usually less than 10,000) for all features to be tried rather than sampled.

## 7 Parameter-Free Robust Estimation

Rather than depend on fixed arbitrary error thresholds to define feature agreement with a model, we actually resort to an a contrario (AC) approach [24, 25]: we compute the expectation of the number of false alarms (NFA), that measures the statistical meaningfulness (actually the converse, i.e., the insignificance) of w.r.t. to the features to test, and select the scale with the lowest NFA. It allows an automatic optimization of the inlier-outlier threshold, and thus more accurate results too. This is also consistent with the method we use for two-view pose estimation [34].

Coplanar line NFA. For coplanarity, we follow [34] and define a line error from the error for line pairs:

 dco(Li,λ)=minLj coplanar with Lidco(Li,Lj,λ) (14)

where is the coplanarity distance defined in Sect. 6. The NFA for our sampling is then, following [25]:

 NFAco(λ)=(n2−2)mink∈[3,nco]n2N(n2k−2)[πdco(Lk,λ)2A]k−2 (15)

where is the number of lines in camera 2 that have at least a match in camera 1 or 3, is the image area, and is the -th best inlier (with lowest error).

Trifocal point NFA. For triplets of matched points, following [25], we have:

 NFApt(λ)=(npt−1)mink∈[2,npt](nptk)k[πdpt(^pk,λ)2A]k−1 (16)

where is the total number of matched point triplets in cameras 1-2-3, is the image area, is the residual error of triplet assuming scale (cf. Sect. 6), and is the -th lowest error.

Trifocal line NFA. For triplets of matched line segments, also following [25], we have:

 NFAseg(λ)=(nseg−1)mink∈[2,nseg](nsegk)k[2Ddseg(^lk,λ)A]k−1 (17)

where is the total number of matched line triplets, is the diagonal length of the picture, is the residual error of triplet assuming scale (cf. Sect. 6), and is the -th lowest error.

Global NFA. As these NFAs correspond to expectations of independent events, the global NFA is their product:

 NFA(λ)=NFAco(λ).NFApt(λ).NFAseg(λ). (18)

It estimates the insignificance of a candidate ratio  based on coplanarity and trifocal constraints, when any. For robust estimation, we retain the with the overall lowest NFA. It defines a parameterless AC-RANSAC variant to Sect. 6.

Note that this pose estimation method can be heavily parallelized, not only to evaluate the different constraints but also to estimate the global motion for any three consecutive views, to be later combined in a chain as defined in Sect. 3.

The last step of our SfM method is a bundle adjustment (BA) that refines simultaneously the structure and the poses.

Reconstructed structure. In our context, we could consider as structures not only points and lines, but also planes corresponding to coplanar lines. Indeed, just as tracks of points or lines across images represent single structures, single planes could be associated to sets of lines sharing a coplanarity constraint. More precisely, only coplanar line pairs with similar plane orientations should be clustered together, as a line can belong to two different planes at edges, e.g., where a wall meets another wall, floor or ceiling.

Yet, we observed on real scenes that such a clustering of coplanar lines into single planes tends to degrade the accuracy of pose estimation. Our interpretation is that individual pairs of (real 3D) lines can be coplanar enough to robustly estimate sensible scale ratios but, when grouped in a single plane, their global coplanarity deteriorates.

In fact, contrary to 3D points, that are well determined although possibly misdetected or mismatched in images, there is no such thing in the real world as perfect 3D lines, perfect 3D planes, nor exact line parallelism, orthogonality or coplanarity. (Optical distortion and detection noise just come on top of it.) This is especially true of edges and surfaces in a building, as tolerances of straightness and flatness in the construction industry range typically from 0.2 to 1%. Line-based calibration is thus prone to be less accurate in practice than point-based calibration, and even less when two lines are involved in a feature, as in line coplanarity.

Buildings also present many cases of near colinearity, hence near line coplanarity, due to close edges at the boundary of thin surfaces, e.g., baseboards, conduits, moldings, picture frames, whiteboards, window and door frames, etc. Furniture edges also tend to be almost but not exactly coplanar with wall edges. Due to small errors, such nearly coplanar lines are considered as inliers, but degrade accuracy.

This consideration is consistent with [34] where the authors observe that, in real data, lines that should “logically” be parallel turn out not to be as much as expected, leading to a number of close but different-enough vanishing points (VPs). They found however that treating parallel line pairs independently leads to more accurate calibrations than merging them into a single VP. Similarly, in our bundle adjustment, we do not consider planes as the support of many coplanar lines; line pairs that were determined as coplanar, i.e., RANSAC inliers, are treated individually.

Residuals and optimization. The parameters of our BA are tracked points and lines, as well as camera positions and orientations. The error to minimize is the sum of the square (pixelic) distance of the reprojected points and lines to their detection (cf. Sect. 6), in all the cameras that see (i.e., detect and match) them, plus the sum of the square coplanarity residuals (cf. Sect. 6), for all line pairs found as inliers, in all images that see both lines.

Concretely, we initialize the bundle adjustment with the pose found by composing the scaled relative motions (cf. Sect. 3) and with triangulated features. As BA can be sensitive to initialization and given that rotations are often better estimated than translations, we first refine the structure and motion with fixed rotations, then refine all parameters (as in [27, 26]). We use the Ceres solver [1] for minimization.

## 9 Experiments

Feature detection and matching. Points are detected and matched with SIFT [21]. Lines are detected with MLSD [33] and matched with LBD [46]. Both kinds of features are tracked across consecutive pictures to identify match triplets (hence trifocal overlaps), when any. The code is available on GitHub .

Bifocal calibration. We implemented our SfM approach on top of the two-view relative pose estimation of [34]. Besides being parameterless, it has the advantage of robustly combining both line and point features, when any, providing state-of-the-art accuracy even in textureless environments and with wide baselines. When points are not available, it only assumes that both images contains at least two pairs of matched lines that are parallel in 3D; this constraint is most often met in indoor scenes. If not met, our assumption (seeing lines in views 1-2 coplanar with lines seen in views 2-3) is likely not to be met either.

Using method [34] is not intrinsic to our approach. We could have used just as well any other bifocal calibration method, e.g., based on points assuming enough (5 or 7) are available on all image pairs, or based on lines as in [7], although it assumes a Manhattan scene, contrary to [34].

Datasets. We consider both difficult interior scenes and a standard SfM dataset of outdoor scenes with ground truth.

• The indoor dataset pictures various office rooms with little texture and little image overlap. Office-P19 (cf. Fig. 1), Meeting-P31 and Trapezoid-P17 (cf. Fig. 3) consist of cycles of images. P means pictures. Trapezoid-P17 does not belong to a Manhattan world. Resolution is . As can be seen on Fig. 1, points are dense on some textured objects, like the door, but scarce in large other parts of the scene, e.g., white walls.

• Strecha \etal’s dataset [38] is a de facto standard for assessing the accuracy of SfM. It consists of 6 outdoor scenes with ground truth for camera poses. We consider both the full dataset as well as subsets of images to reduce image overlap. Resolution is .

All images have been corrected for radial distortion.

RANSAC with and without parameters. Table 1 (left) compares RANSAC with a single fixed threshold for all three kinds of features (cf. Sect. 6) to the parameterless AC-RANSAC (cf. Sect. 7). Although AC-RANSAC does not always yield the best results, it works better on average. For a given scene, a better accuracy could be achieved by setting the different thresholds for each kind of features, but it would not be practical, hence the interest of AC-RANSAC. In the following, all experiments rely on AC-RANSAC.

Contribution of the different kind of features. Table 1 (right) reports the accuracy of the different kinds of features alone, or when used jointly. When studying coplanarity features alone, the residual includes the reprojection error of coplanar lines (which may or may not be trifocal).

Trifocal line features provide just a little less accuracy than trifocal points, which are prevalent in this textured datatset. Used together, they lead to a slightly better accuracy. Coplanarity features alone yield a little less accuracy than trifocal lines; only Fountain-P11 has a poor calibration with coplanarity, probably because it contains less planes than other scenes. Yet, after being merged with other features, coplanarity does not degrade accuracy in general, and often even improves it.

State-of-the-art accuracy. Tab. 2 shows our general approach outperforms incremental point-based SfM methods [37, 43] regarding accuracy, and is on par with state-of-the-art global SfM methods [26, 27, 6, 2, 16]. The wider applicability thus was not traded for accuracy, even in scenarios with dense features and much overlap.

It also shows that our relaxed trifocal constraints retain the most relevant part of trifocality contribution to calibration. Comparing to Tab. 1 (right), we can see that coplanarity constraints alone provide in general comparable or better accuracy than incremental SfM methods [37, 43], except again on Fountain-P11. Residuals are shown in Tab. 3.

We could not compare with line-based SfM methods [47] as their code is unavailable and as they do not measure their accuracy against standard datasets with ground truth.

Succeeding when others fail. Our method can calibrate difficult datasets that other methods fail to calibrate:

In Office-P19, some triplets do not have overlapping features (neither points nor lines), which makes it impossible to calibrate with existing SfM methods. Our method succeeds. Fig. 1 illustrates some (consecutive) views of the dataset with line detections, and point, line and plane reconstructions. Apart from a few line outliers due to mismatches, line and plane reconstructions are qualitatively good. We can also calibrate Meeting-P31 and Trapezoid-P17, whereas OpenMVG fails; the residuals and number of calibrated cameras are given in Tab. 3.

Removing 3/19 images from Castle-P19 can be enough to cause other methods to fail to calibrate all cameras, thus also reducing reconstructed structures and ability to model the whole scene (cf. missing wall parts in Fig. 6, left). Our method calibrates all cameras, yielding an average location error of 13.2 cm. It is not as good as the state-of-the-art error of 2.3 cm when all 19 images are available (cf. Tab. 2), but still better than Bundler (34.4 cm) and VSfM (25.8 cm). We can actually remove up to 11/19 images and still estimate all camera poses with average error 18.4 cm, which remains better than incremental methods with all 19 images.

Likewise, keeping only 3 images (left-most, middle, right-most) from Herz-Jesu-P8 (Fig. 6) leads other methods to totally fail, while we calibrate these very-wide-baseline images with 6 mm accuracy, the same as full P8.

More generally, disconnected components in the trifocal graph make other SfM methods fail to calibrate all cameras (in the same reference frame), even if the bifocal graph is connected. (For other methods to work, connections in the trifocal graph actually even have to be supported by at least 3 features.) However, provided there are 3D lines in the configuration described in Fig. 2, we can hope to calibrate all cameras in the dataset.

## 10 Conclusion

We have presented a novel SfM method that exploits lines and coplanarity constraints to estimate camera poses in difficult settings with little image overlap and possibly little or no texture, causing other methods to fail.

Our experiments show that our method combines the accuracy of usual trifocal constraints although we relax them, and the robustness (in the sense of wider applicability) of coplanar constraints. It can thus be blindly used to calibrate scenes that other methods fail to calibrate, and still to provide state-of-the-art results on less difficult scenes, with smaller baselines, more overlap or more texture.

The coplanarity constraint we have defined is actually not intrinsically specific to lines. It could apply to points as well, as long as they can be associated to planes, e.g., using a method to fit multiple homographies in two views [22]. Future work also include the use of coplanarity constraints for dense surface reconstruction, even in textureless scenes.

## References

• [1] S. Agarwal, K. Mierle, and Others. Ceres solver.
• [2] M. Arie-Nachimson, S. Z. Kovalsky, I. Kemelmacher-Shlizerman, A. Singer, and R. Basri. Global motion estimation from point matches. In 3DIMPVT, 2012.
• [3] F. Arrigoni, B. Rossi, and A. Fusiello. On computing the translations norm in the epipolar graph. In International Conference on 3D Vision (3DV 2015), 2015.
• [4] A. Bartoli and P. Sturm. Constrained Structure and Motion from multiple uncalibrated views of a piecewise planar scene. International Journal of Computer Vision (IJCV 2003), 52(1):45–64, 2003.
• [5] A. Bartoli and P. F. Sturm. Structure-from-motion using lines: Representation, triangulation, and bundle adjustment. Computer Vision and Image Understanding, 100(3):416–441, 2005.
• [6] Z. Cui, N. Jiang, C. Tang, and P. Tan. Linear global translation estimation with feature tracks. In British Machine Vision Conference (BMVC 2015), 2015.
• [7] A. Elqursh and A. Elgammal. Line-based relative pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3049–3056, June 2011.
• [8] O. Enqvist, F. Kahl, and C. Olsson. Non-sequential structure from motion. In ICCV Workshops, pages 264–271, 2011.
• [9] J.-M. Frahm, P. Fite-Georgel, D. Gallup, T. Johnson, R. Raguram, C. Wu, Y.-H. Jen, E. Dunn, B. Clipp, S. Lazebnik, and M. Pollefeys. Building Rome on a cloudless day. In K. Daniilidis, P. Maragos, and N. Paragios, editors, 11th European Conference on Computer Vision (ECCV 2010), pages 368–381. Springer Berlin Heidelberg, 2010.
• [10] A. Fusiello, F. Crosilla, and F. Malapelle. Procrustean point-line registration and the NPnP problem. In International Conference on 3D Vision (3DV 2015), pages 250–255, Oct. 2015.
• [11] V. Garro, F. Crosilla, and A. Fusiello. Solving the PnP problem with anisotropic orthogonal Procrustes analysis. In International Conference on 3D Imaging, Modeling, Processing, Visualization Transmission (3DIMPVT 2012), pages 262–269, Oct. 2012.
• [12] V. M. Govindu. Combining two-view constraints for motion estimation. In Conference on Computer Vision and Pattern Recognition (CVPR 2001), 2001.
• [13] V. M. Govindu. Robustness in motion averaging. In ACCV, 2006.
• [14] M. Havlena, A. Torii, and T. Pajdla. Efficient Structure from Motion by graph optimization. In 11th European Conference on Computer Vision (ECCV 2010), pages 100–113, Berlin, Heidelberg, 2010.
• [15] J. Heinly, J. L. Schönberger, E. Dunn, and J.-M. Frahm. Reconstructing the world* in six days *(as captured by the Yahoo 100 million image dataset). In Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015.
• [16] N. Jiang, Z. Cui, and P. Tan. A global linear method for camera pose registration. In IEEE International Conference on Computer Vision (ICCV 2013), pages 481–488, 2013.
• [17] N. Jiang, W. Lin, M. N. Do, and J. Lu. Direct structure estimation for 3D reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), pages 2655–2663, 2015.
• [18] F. Kahl and R. I. Hartley. Multiple-view geometry under the -norm. IEEE Trans. PAMI, 30(9):1603–1617, 2008.
• [19] C. Kim and R. Manduchi. Planar structures from line correspondences in a Manhattan world. In 12th Asian Conference on Computer Vision (ACCV 2014), pages 509–524, 2014.
• [20] V. Lepetit, F. Moreno-Noguer, and P. Fua. EPnP: An accurate O(n) solution to the PnP problem. International Journal of Computer Vision (IJCV 2008), 81(2):155–166, 2008.
• [21] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision, 60(2):91–110, Nov. 2004.
• [22] L. Magri and A. Fusiello. T-linkage: A continuous relaxation of J-linkage for multi-model fitting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), pages 3954–3961, 2014.
• [23] F. M. Mirzaei and S. I. Roumeliotis. Globally optimal pose estimation from line correspondences. In IEEE International Conference on Robotics and Automation (ICRA 2011), pages 5581–5588. IEEE, 2011.
• [24] L. Moisan and B. Stival. A probabilistic criterion to detect rigid point matches between two images and estimate the fundamental matrix. Int. J. Comput. Vision, 57(3):201–218, May 2004.
• [25] P. Moulon, P. Monasse, and R. Marlet. Adaptive Structure from Motion with a contrario model estimation. In 11th Asian Conference on Computer Vision (ACCV 2012) - Volume Part IV, ACCV’12, pages 257–270, Berlin, Heidelberg, 2013. Springer-Verlag.
• [26] P. Moulon, P. Monasse, and R. Marlet. Global fusion of relative motions for robust, accurate and scalable structure from motion. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3248–3255, 2013.
• [27] C. Olsson and O. Enqvist. Stable structure from motion for unordered image collections. In SCIA, pages 524–535, 2011. LNCS 6688.
• [28] O. Özyeşil and A. Singer. Robust camera location estimation by convex programming. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), June 2015.
• [29] B. Přibyl, P. Zemčík, and M. Čadik. Camera pose estimation from lines using PlÃ¼cker coordinates. In X. Xie, M. W. Jones, and G. K. L. Tam, editors, British Machine Vision Conference (BMVC 2015), pages 45.1–45.12. BMVA Press, Sept. 2015.
• [30] S. Ramalingam, S. Bouaziz, and P. F. Sturm. Pose estimation using both points and lines for geo-localization. In IEEE International Conference on Robotics and Automation (ICRA 2011), pages 4716–4723, 2011.
• [31] A. L. Rodríguez, P. E. López-de Teruel, and A. Ruiz. Reduced epipolar cost for accelerated incremental SfM. In Conference on Computer Vision and Pattern Recognition (CVPR 2011), 2011.
• [32] C. Rother. Linear multi-view reconstruction of points, lines, planes and cameras using a reference plane. In 9th IEEE International Conference on Computer Vision (ICCV 2003) - Volume 2, ICCV ’03, pages 1210–1217, Washington, DC, USA, 2003. IEEE Computer Society.
• [33] Y. Salaün, R. Marlet, and P. Monasse. Multiscale line segment detector for robust and accurate SfM. In 23rd International Conference on Pattern Recognition (ICPR), 2016.
• [34] Y. Salaün, R. Marlet, and P. Monasse. Robust and accurate line- and/or point-based pose estimation without Manhattan assumptions. In European Conference on Computer Vision (ECCV), 2016.
• [35] G. C. Sharp, S. W. Lee, and D. K. Wehe. Toward multiview registration in frame space. In IEEE International Conference on Robotics and Automation (ICRA 2001), 2001.
• [36] K. Sim and R. Hartley. Recovering camera motion using minimization. In Conference on Computer Vision and Pattern Recognition (CVPR 2006), volume 1, 2006.
• [37] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: Exploring photo collections in 3D. ACM Transations on Graphics (TOG 2006), 25(3):835–846, July 2006.
• [38] C. Strecha, W. von Hansen, L. Van Gool, P. Fua, and U. Thoennessen. On benchmarking camera calibration and multi-view stereo for high resolution imagery. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008), pages 1–8, June 2008.
• [39] P. F. Sturm. Algorithms for plane-based pose estimation. In Conference on Computer Vision and Pattern Recognition (CVPR 2000 ), pages 1706–1711, 2000.
• [40] C. Sweeney, T. Sattler, T. Hollerer, M. Turk, and M. Pollefeys. Optimizing the viewing graph for Structure-from-Motion. In IEEE International Conference on Computer Vision (ICCV 2015), pages 801–809, Washington, DC, USA, 2015.
• [41] R. Toldo, R. Gherardi, M. Farenzena, and A. Fusiello. Hierarchical structure-and-motion recovery from uncalibrated images. Computer Vision and Image Understanding (CVIU 2015), 140(C):127–143, Nov. 2015.
• [42] K. Wilson and N. Snavely. Robust global translations with 1DSfM. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, 13th European Conference on Computer Vision (ECCV 2014), pages 61–75, 2014.
• [43] C. Wu. Towards linear-time incremental structure from motion. In International Conference on 3D Vision (3DV 2013), pages 127–134, 2013.
• [44] C. Xu, L. Zhang, L. Cheng, and R. Koch. Pose estimation from line correspondences: A complete analysis and a series of solutions. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI 2016), June 2016.
• [45] C. Zach, M. Klopschitz, and M. Pollefeys. Disambiguating visual relations using loop constraints. In Conference on Computer Vision and Pattern Recognition (CVPR 2010), 2010.
• [46] L. Zhang and R. Koch. An efficient and robust line segment matching approach based on LBD descriptor and pairwise geometric consistency. J. Vis. Comun. Image Represent., 24(7):794–805, Oct. 2013.
• [47] L. Zhang and R. Koch. Structure and motion from line correspondences: Representation, projection, initialization and sparse bundle adjustment. Journal of Visual Communication and Image Representation, 25(5):904–915, 2014.
• [48] L. Zhang, C. Xu, K.-M. Lee, and R. Koch. Robust and efficient pose estimation from line correspondences. In K. M. Lee, Y. Matsushita, J. M. Rehg, and Z. Hu, editors, 11th Asian Conference on Computer Vision (ACCV 2012), pages 217–230. Springer, 2012.
• [49] Y. Zheng, Y. Kuang, S. Sugimoto, K. Åström, and M. Okutomi. Revisiting the PnP problem: A fast, general and optimal solution. In IEEE International Conference on Computer Vision (ICCV 2013), pages 2344–2351, Dec. 2013.
• [50] Z. Zhou, H. Jin, and Y. Ma. Robust plane-based Structure from Motion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012), pages 1482–1489, Washington, DC, USA, 2012.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters