View Selection with Geometric Uncertainty Modelling
Abstract
Estimating positions of world points from features observed in images is a key problem in 3D reconstruction, image mosaicking, simultaneous localization and mapping and structure from motion. We consider a special instance in which there is a dominant ground plane viewed from a parallel viewing plane above it. Such instances commonly arise, for example, in aerial photography.
Consider a world point and its worst case reconstruction uncertainty obtained by merging all possible views of chosen from . We first show that one can pick two views and such that the uncertainty obtained using only these two views is almost as good as (i.e. within a small constant factor of) . Next, we extend the result to the entire ground plane and show that one can pick a small subset of (which grows only linearly with the area of ) and still obtain a constant factor approximation, for every point , to the minimum worst case estimate obtained by merging all views in . Finally, we present a multiresolution view selection method which extends our techniques to nonplanar scenes. We show that the method can produce rich and accurate dense reconstructions with a small number of views.
Our results provide a view selection mechanism with provable performance guarantees which can drastically increase the speed of scene reconstruction algorithms. In addition to theoretical results, we demonstrate their effectiveness in an application where aerial imagery is used for monitoring farms and orchards.
I Introduction
Consider a scenario where a plane flying at a fixed altitude is capturing images of a ground plane below so as to reconstruct the scene (Figure 1). Over the course of its flight, the plane may capture thousands of images which can easily overwhelm image reconstruction algorithms. Our goal in this paper is to answer the question of whether we can select a small number of images and focus only on them without reducing the reconstruction quality.
We first study a basic version where we focus on a single world point. The goal is to select a small number of images from which the 3D position of the world point can be accurately estimated (Problem 1). We then present a general version where the goal is to minimize the error for the entire scene (Problem 2) from a small set of images. Note that in the latter case, the same set of images must be used for every scene point. We also extended our approach to a multiresolution view selection scheme to accommodate nonplanar scenes.
In order to formalize these two problems, we first need to formalize the error model and the uncertainty objective. Let be a world point and be an image taken from a camera at position and orientation . Let be the observed projection of onto and be the unobserved true projection represented as vectors originating from the camera center . We will employ a bounded uncertainty model where we will assume that the angle between and is bounded by a known (or desired) quantity . Therefore, the 3D location of the world point is contained inside a cone apexed at and with symmetry axis along and cone angle . See Figure 3.
Merging measurements: In order to estimate the true location of a world point from multiple measurements, we simply intersect the corresponding cones. The diameter of the intersection is used as an uncertainty measure. We chose diameter over the volume so as to avoid degenerate cases where the intersection has almost zero volume but large diameter which could still generate large triangulation error.
Uncertainty as worstcase reconstruction error: Rather than associating a single cone for a specific measurement, our formulation considers a possibly infinite set of viable cones for a given true camera pose and world point pair. To do this, we consider all possible perturbations of relevant quantities (projection, location or pose). When merging measurements, we consider the worstcase scenario which maximizes the reconstruction uncertainty. This formulation gives us a deterministic worstcase error model. It also allows us to factor out unknown or uncontrollable quantities such as camera orientation.
Ii Contributions and Related Work
The importance of view selection for scene reconstruction is well established. One of the first view selection schemes for multiview stereo is presented in [5]. The work of Maver and Bajcsy [17] and Kutulakos and Dyer [14] use contour information to choose viewing locations. A 2003 paper by Scott et al. [21] surveys view selection methods. Recently, Furukawa et al. [6] proposed a view selection scheme to enable large scale 3D reconstruction. Their method relies on clustering images based on overlap. The resulting optimization problem is solved iteratively. The method of Hornung et al. [9] incrementally selects images and uses a proxy to ensure coverage. Mauro et al. resort to linear programming to solve the view selection problem [16]. Submodular optimization [13] has also been considered to jointly optimize the coverage and accuracy. However, it requires repeated visit of the same region. Both [13] and [8] uses surface meshes as geometrical reference to reason about optimal view selection. View selection has also been involved in image based modeling [24], object retrieval [7] and target localization [10].
In the general reconstruction domain, keyframe methods [12] [18] [4] implement heuristics such as visible map features, distance between keyframes to decide if the current frame should be used for mapping. The main idea is to reduce the number of frames for bundle adjustment so as to make the system work in realtime. MurArtal et al. [18] introduced the “essentialgraph” which builds a spanning tree from the image graph to achieve realtime performance. Snavely et al. [22] proposed a method called “skeleton set” that selects a subset of frames from the image graph to achieve similar reconstruction accuracy. However, they do not consider the geometry of the mapped environments. In Kaucic et al. [11], the environment is assumed to be planar and the factorization method [23] is used to speed up the bundle adjustment.
In the present work, we consider an abstraction of the problem as: cameras on a viewing plane observing a planar world scene. We present a novel uncertainty model which allows us to characterize worstcase reconstruction error in a way that is independent of particular measurements. What differentiates our work from the previous body of work is that we present a view selection mechanism with theoretical performance guarantees. Specifically, our contributions are the following.

We show that one can select two good views and obtain a reconstruction which is almost as good as merging all possible views from the entire viewing plane.

We also show that a coarse camera grid (of resolution proportional to the scene depth) can provide a good reconstruction of the entire world plane.

We present a multiresolution view selection method which can be used for more general environments that are not strictly planar.
Our work is also related to error analysis in stereo [20, 2]. There are also many different uncertainty models. Bayram et al. [1] models the bearing measurement’s uncertainty as a function of linearized intersection area. Davison [3] approximates the uncertainty as a Gaussian distribution. We contribute to this line of work by analyzing the reconstruction error for two (best) cameras with respect to the reconstruction error achievable by using all possible cameras for the particular geometry we consider.
Iii Problem Definition
In this section, we introduce the general sensor selection problem. Consider the world point and a camera where is the projection center and is the orientation. Suppose we have a set of measurements where each is expressed as a unit vector pointing towards the observed pixel and anchored at the corresponding camera center. We need a function that maps measurements to , the estimate of . This way, we can define the estimation error to be by choosing an error measure .
In this paper, we will consider the following “bounded uncertainty” characterization of the error: Consider the true measurement given by the projection of onto camera which is also represented as a vector from pointing toward . We make the assumption that the angle between the measurement and the true projection is bounded by a fixed threshold . For a given measurement , the rays corresponding to all possible formulate to a cone denoted as as shown in Fig 3, which is a function of both the camera parameters and as well as the measurement . For the rest of the paper, we will assume a fixed and drop the subscript. By intersecting the cones from multiple measurements from views , we can get an estimate of the true target location. The uncertainty is given by the diameter of the intersection given by .
For sensor selection purposes, rather than a single cone, it is beneficial to associate a set of cones for each measurement. This will allow us to replace the randomness in the measurement process with a deterministic worstcase analysis. To do this, for a given true target location and a camera pose , we generate . Then for every possible measurement within angle of , we define and include it with the set associated with this world point/camera pair. Note that each cone in the set includes the true location . We can further eliminate the dependency on camera orientation by taking the union of these sets for each allowable orientation. That is, we define with the additional requirement that for each cone included in the union.
We can now define the worst case uncertainty for a given set of camera centers and a ground point as:
In other words, for each camera location , a cone is chosen such that the chosen cones jointly maximize the intersection diameter. The advantage of this formulation is that since the computation of implicitly generates all possible measurements for a given camera location and world point, it generates a worst case uncertainty independent of specific measurements and camera rotations. We are now ready to define the first problem.
Problem 1
For a given world point , the set of all possible viewpoints , a projection error bound , and an error tolerance parameter , choose a minimum cardinality subset , such that
In Problem 1 the goal is to choose a small subset of camera locations whose worst case uncertainty when reconstructing a given point is at most with a factor of the worstcase uncertainty of the entire viewing set. Problem 2 generalizes it to multiple points.
Problem 2
For a set of points , the set of all possible viewpoints , a projection error bound , and an error tolerance parameter , choose a minimum cardinality subset , such that
In this paper, we study a specific geometric instance of these problems where and are two parallel planes with distance apart. For a given , we will define .
Iv Sensor Selection for a Single Point
In this section, we study Problem 1 where the goal is to choose cameras to reconstruct a single point. We will start with the two dimensional (2D) case where the ground and viewing planes reduce to lines, and the uncertainty cones become wedges.
Our key result in this section is that for any point , one can choose two cameras whose worst case uncertainty is almost as good as , which is the worst case uncertainty obtained by merging the views from all cameras. The key ideas in obtaining this result are: (1) if we choose two cameras at locations and who view symmetrically at 90 degrees (i.e. ), the diagonals of the worstcase uncertainty polygon (the intersection of the two wedges) are roughly of equal length. (2) Any other camera added to the sensor set can be rotated to contain the horizontal diagonal. Therefore, it does not reduce the uncertainty drastically.
Iva The Solution of Problem 1 in 2D
Let be the set of wedges which yield the minimum worst case uncertainty. For every point on the viewing plane, there is a wedge in which (i) is apexed at , (ii) has wedge angle and (iii) contains . By definition of , the wedges are rotated so as to maximize the diagonal of the intersection.
Theorem IV.1
Consider a target on line and viewing set composed of all camera locations on parallel to . There exist two cameras and which guarantee that
(1) 
where is the minimum worst case uncertainty of the entire viewing set, and is the worst case uncertainty of and is the error threshold measured in radians.
We will prove the theorem directly by providing the two cameras, computing their worstcase uncertainty and comparing it with the minimum possible worstcase uncertainty. First, we present the notation and the setup used in the computations. We set a coordinate system whose origin is at the target . The axis is on and the axis points “up” toward the viewing plane. The locations of the two cameras are chosen as: and where and the cone orientations respectively (Fig 4 (a)). We use the angle between the bisector of a wedge with respect to for orientation. Of the two halfplanes whose intersection yields the wedge, the inner half plane is the one that is closer to – i.e. the angle measured is smaller while the other halfplane is the outer halfplane also shown in Fig 4 (a). Note that .
Their worst case uncertainty is given by
(2) 
Consider the two wedges which give the worst case uncertainty (i.e. of ). Let be their intersection with vertices and edges (Fig 4 (a)). The lengths of the edges are denoted as and the length of the diagonals are denoted by .
We now compute these quantities.
IvA1 Computing
In order to maximize over the orientation, we first establish the closed form solution for the edges and diagonals as functions of ,,,and .
Using the law of cosines, can be calculated as
(3) 
Similarly, the can be calculated as
(4) 
The detailed derivation is shown in Appendix CA.
We now consider the vertical diagonal whose length is given in Equation 3. It is maximized when . Fig 5 shows as a function of the two wedge angles and and for rad. When , the vertex , which means that the inner halfplanes of and intersect at .
We can therefore set and write the equation of as a function of and : Using the law of sines on the triangle and , we obtain:
This establishes the maximum length of the diagonal in the worst case configuration of .
We now compare with .
Lemma IV.2
Consider the two cameras in the optimal configuration described above and let be the intersection of their worstcase uncertainty polygon . Any cone starting from location , can be rotated to an angle such that both and are contained in its uncertainty wedge .
Now that we established that two cameras suffice, we compute the uncertainty value:
Lemma IV.3
Given the two cameras , the intersection polygon , the maximum length of the diagonal when , and the worst case uncertainty .
(5) 
Now we can conclude by presenting the proof of Theorem IV.1.
In this section, we showed that there exist two cameras and with orientation such that their worst case uncertainty . We will call the pair of cameras as the optimal pair for the rest of the paper and this configuration as the optimal configuration of .
IvB The Solution of Problem 1 in 3D
The results of the previous section readily extend to in 3D.
Theorem IV.4
Given a target and a set of cameras , where the distance between and is and the number of cameras in is unbounded, we claim that the optimal pair and gives
(6) 
where the minimum worst case uncertainty in 3D is and worst case uncertainty from two cameras and is .
V Sensor Selection For the Entire Scene
In the previous section, we established that for a world point , the optimal pair of cameras can produce a reconstruction with approximation ratio less than of the optimal reconstruction (Theorem IV.4). However, if we use the dedicated pair directly for every scene point, we may end up choosing two cameras for each scene point, which in turn might result in a large number of cameras.
In this section, we show that a coarse grid of cameras provide a good reconstruction for every scene point. Recall that is the ground plane, is the view plane, is parallel to and the distance between them is . Let be a square grid imposed on with resolution (Fig 7 (a)). The same grid is also imposed on the ground plane . To demonstrate the main strategy at a highlevel, consider a ground point , such that the optimal pair of cameras lies in camera grid . We will show that the optimal pair of cameras can still provide “good” reconstruction for all points in a region around .
Using the result we will show in Theorem V.5 that a constant number of cameras for a ground plane can be used to achieve a small approximation ratio.
Va Problem 2 in 2D
For cameras in the grid and target , we define the grid uncertainty using only the best two cameras in grid as the following
As mentioned earlier, we will choose the grid resolution to be for the following analysis.
Now, we define the geometry for Lemmas V.1, V.2, B.1, and B.2. Let be a grid location with height to the viewing plane . Now, we choose the optimal pair of cameras for the target as as shown in Fig 7 (b). Let be a line passing through with and , where is the intersection line segments between and the Cone generated by sensor and target .
In order to bound the uncertainty of any target using the camera grid , we need to explore the uncertainty of the targets in grid cells (Fig 7 (b)). Therefore, we fix a grid point and define a range of targets such that is generated by moving along the axis of the grid. We now show that the worst case uncertainty is achieved at the end points of this interval (i.e. the midpoint of two grid locations) bound by , where represents the length of line segment of . We define and in Fig 6.
Lemma V.1
When , .
Lemma V.2
is maximized when the inner halfplane of both cones intersect .
It is clear that either or is always larger or equal to , which can be used to generate the worst case bound.
Theorem V.3
For all targets and sensor grid with resolution , the worst case grid uncertainty using only two cameras from is bounded as follows
VB Relaxing planar scene and viewing plane assumptions
So far, our analyses of the uncertainty bound are based on the parallel plane assumptions. Such assumptions are reasonable for some applications such as high altitude aerial imagery.
In this section, we relax these assumptions so that the theorem can be applied to more general environments. Define horizontal and vertical variation as , where . We will analyze the change in when adding variation in both horizontal and vertical directions. The new camera location is generated by perturbing by amount in vertical and horizontal directions. We analyze both effects from vertical and horizontal variations in Appendix View Selection with Geometric Uncertainty Modelling and get the following results.
Theorem V.4
For all targets and sensor grid with resolution and variation , the worst case grid uncertainty using only two cameras from is bounded as follows
We can see that small deviation from the camera position or the ground plane does not introduce significant uncertainty.
VC Problem 2 in 3D
In 3D, we use the same grid resolution which is half of the distance between the optimal pair of cameras. The main result is
Theorem V.5
For all targets and sensor grid with resolution and variation , the worst case grid uncertainty using only two cameras from is bounded as follows
The proof is similar to the 2D case. It is extended to include perturbations in both and directions which slightly increases the bounds, which is shown in Appendix View Selection with Geometric Uncertainty Modelling Figure 8.
Theorem V.5 allows us to bound the geometric error even in the presence of variations in both viewing and scene planes. However, it does not address visibility: variations in the scene can cause occlusions which can block camera views. In the next section, we address this issue.
Vi MultiResolution View Selection
In this section, we explore how to extend our previous camera view grid approach to nonplanar regions such as orchards and forests. The parallel plane assumption can produce good results with high altitude, but will be insufficient to model nonplanar regions. For this purpose, we propose a multiresolution approach, which generates multiple camera view grids in a coarse to fine manner, to reconstruct more general regions.
The input to our method is a surface mesh generated using sparse points clouds from a SLAM method such as ORBSLAM [18]. It then outputs a subset of the views such that each face of the mesh can be wellcovered, that is, covered by at least 3 cameras separated by the current grid resolution. To ensure coverage quality, we double the grid resolution at each iteration so that the minimum distance between cameras is bounded. We present the details in Section VIB.
As the scene becomes more complex, the multiresolution approach is able to adapt the terrain. For a given grid resolution, we iterate through all triangles and if they are wellcovered by the current subset of views, those views will be added to the solution. However, the potential views that can see the triangle are limited due to occlusion and matching quality. Therefore, we introduce a visibility cone for each triangle in Section VIA to limit the search space.
Similar to [13] and [8], we also generate scene meshes to reason about the geometry. The main difference of our work is that first, we do not require a secondary visit to the scene. The existing trajectory of views can be sufficient enough to cover the environment in most cases. Second, we generalize the visibility for each triangle mesh such that wellcovered views can be predicted instead of histogram method [8] that is strongly case sensitive.
Via Visibility Cone
A camera is defined to be visible to a triangle mesh when it contains 2D feature of a point around the mesh. A viewing vector for a triangle is defined as the vector pointing from the center of the triangle to the corresponding camera as shown in Figure 9. The mesh vector is then the average of all viewing vectors for that triangle mesh. We also define the visibility angle of each triangle as the average angle between all viewing vector. We can therefore predict the visibility of a triangle using both the visibility angle and the mesh vector. Essentially, we generate a visibility cone, where the direction of the cone is the mesh vector and the aperture is the visibility angle. We do not consider the effects of viewing angles since all the views are assumed to be facing downwards, which can be easily maintained with a gimbal stabilizer. Unlike the approaches from [8] that extract the histogram for each mesh triangle, we bound the region of possible visible camera views using the mesh visibility.
ViB Coarse to Fine View Selection
After identifying the visibility cone for each triangle, we utilize our previous proposal of the camera grid in a coarse to fine manner.
For a given grid resolution, we iterate through all faces of the mesh and check their visibility cones against current subset of views. For each face, if the visibility cone contains at least 3 camera views from the current subset of views, then those views will be added to the solution as shown in Figure 10. Those faces covered by 3 or more cameras will not be considered in the next iteration. To ensure the quality of the selected views, we impose that for each face, there are at least 3 views visible to the mesh so that feature matching error can be reduced. Since we also increase the grid resolution by two fold for each iteration, the chosen views for a specific mesh guarantee a minimum spacing. Giving the grid spacing at the first iteration, after iterations, the minimum spacing between all views will be instead of arbitrarily small spacing that reduces reconstruction quality.
Vii Evaluation
Original Frames  Avg Reprj Err  SFM time (min)  Camera Grid Frames  Avg Reprj Err  SFM time (min)  
Orchard: 30 meters Flight  416  0.842  313.6  76  0.934  4.1 
Orchard: 10 meters Flight  375  0.724  374.7  84  0.842  4.4 
Original Frames  Avg Reprj Err  Dense Recon (min)  MultiResol Method  Avg Reprj Err  Dense Recon (min)  
Orchard: 30 meters Flight  875  0.863  1463  209  0.931  115 
Orchard: 10 meters Flight  893  0.944  1522  266  1.243  167 
In this section, we present simulation results used for validating the uncertainty model and results followed by a realworld reconstruction performance using the coarse to fine view selection method.
Viia Simulations
We used the following parameters of a GOPRO HERO 3 for simulations. Resolution: , Field of view: . The calibration error in pixels was . For all simulations we used an iMac with GHz quadcore Intel Core i5 and GB of RAM.
Model justification: We consider the following sources of uncertainty: finite resolution, calibration errors, camera center location, camera orientation.
The first two sources are less than one pixel. To investigate the role of camera orientation, we perturbed camera location , where is the true location and is a uniform noise, and camera pose where is the true orientation and is a uniform noise. Figure 11 (b) reports the result of triangulation error from two cameras in an optimal position. The height of the viewing plane was set to m. The noise is set to , , and . Each simulation was repeated times where the target location is computed by triangulation and the error is reported. Various noise levels are shown in the captions. If we choose a bound of 10 pixels for the measurement error, it corresponds to rad. The solid red line shows the predicted worst case error using our model. In general, reprojection error will be less than 10 pixels, otherwise it will be discarded as outliers. The stateoftheart SLAM [25] algorithm’s performance can go up to therefore, we set the camera position error to be less than of the height while bounding the orientation error to be less than . The histogram shows that the distance to the true target location is well bounded by the worst case uncertainty which is indicated as the vertical red line. It means that our uncertainty cone model can be relatively robust to system noises.
Next, we study the effect of using two cameras vs. all cameras. We estimate the target pose using least squares from all cameras and report the ratio: is plotted in Fig 11 (a). Here, is the estimated target location using the optimal pair while uses all the cameras. The simulation was repeated times. The ratio in Fig 11 (a) is less than 3.5, which means that using the optimal pair of cameras to triangulate the target is at most 3.5 times worse than triangulation using all cameras. This is because that the triangulation error using two or all camera views can be considered as a random process. Using only two camera views does not restrict the target as rigorous as using more all views, therefore, imposing at most 3.5 ratio of target position error.
ViiB Real Experiment
We collected two data sets using a GOPRO HERO 3 with a UAV flying over the same region with different height. The altitude ranges from 10 meters to 30 meters whereas the covered areas range between planar to more general orchard scenes. The orchard contains trees that are around 3 meters tall and ground elevation difference around 1 meter. We recorded around 5 minutes of videos, which is roughly 10000 frames. In order to speed up the reconstruction, we extracted every frames of the videos for mosaicking, which results in around 400 frames. We used the commercial AgiSoft software for Structure from Motion for dense reconstruction and mosaicking to investigate the effect of view selection on reconstruction quality and reprojection error.
ViiB1 Mosaic Quality
We use the original selected frames for reconstruction and mosaicking [15]. Then, we use grid resolution of as shown in Figure 12 to select a subset of the frames for reconstruction and mosaicking. This means that if the drone is 10 meters above the ground plane, we select camera frames every 10 meters, which significantly reduces the number of cameras required. The total time required to reconstruct the same region decreased significantly while the reprojection error of each reconstruction remains low as shown in Table I. For qualitative evaluation, we stitched the images together by using the output pose from SFM and orthorectifying the views to compare the quality of the final mosaic. The resulting views are comparable, indicating that the proposed view selection mechanism does indeed perform comparable with respect to the original input set as shown in Figures 12.
ViiB2 Dense Reconstruction Quality
We also examine the performance of the multiresolution camera grid approach at the orchard data sets. For dense reconstruction, such data sets should be considered as a general scene and they cannot be treated as planar region, otherwise, features on different height cannot be covered. We first use ORBSLAM [18] to extract camera poses and sparse point clouds. Since the point clouds still contains many inconsistent points, a filter is applied to remove noisy points too far from the surroundings. Then a mesh is built upon those points with maximum of 10,000 faces. We extracted the visibility cones of the mesh with the given trajectory and sampled a coarsetofine camera view grid in the same trajectory. The original data sets last around 5 minutes and contains more than 9000 images. Using the key frame selection method from ORBSLAM, more than 3000 images are selected for reconstruction. It is unfeasible due to computational limitations. Therefore, we selected every frames with a total of around 900 frames. As shown in Figure 2, the view selection algorithm selected a relatively sparser views in flat regions comparing to the densely packed views in more complex regions. The view selection algorithm will terminate when at least of the surfaces are covered. Therefore, there are still a few meshes that cannot be visible to the view subsets in the last iteration. The initial grid spacing is set to the height between the camera view plane and dominant ground plane: . The reconstruction time and reprojection error comparison is shown in Table I. It is clear that the computational time decreased by more than a magnitude and while the reprojection error does not increase too much. Essentially, our multiresolution approach takes the scene geometry into consideration and removes redundant views that does not contribute much to the results. Visually, we can see that the dense reconstruction quality is very comparable shown in Figure 13 taken from 30 meters above and in Figure 1 taken from 10 meters above. Both results show that the reconstruction quality are almost identical. There is also an interesting observation: it is not necessarily beneficial to have as many as views possible for dense reconstruction. As shown in Figure 13 (a), more views actually smooth out the distinct geometry of the trees, leaving edges blending into each other. At a lower altitude, as shown in Figure 1, the dense reconstruction results are almost indistinguishable.
Viii Conclusion
In this paper, we studied view selection for a specific but common setting where a ground plane is viewed from above from a parallel viewing plane. We showed that for a given world point, two views can be chosen so as to guarantee a reconstruction quality which is almost as good as one that can be obtained by using all possible views. Next, by fixing these two views and studying perturbations of the world point, we showed that one can put a coarse grid on the viewing plane and ensure good reconstructions everywhere. Even though the reconstruction quality can be improved by increasing the grid resolution, we showed that a grid resolution proportional to the scene depth suffices to guarantee a constant factor deviation from the optimal reconstruction. We then showed how to extend the bound in the presence of perturbations of the viewing or scene planes. However, as the scene geometry gets more sophisticated, occlusions must be addressed. For this purpose, we presented a multiresolution view selection mechanism. We also presented an application of these results to image mosaicking and scene reconstruction from (low altitude) aerial imagery.
Our results provide a foundation for multiple avenues of future research. An immediate extension is for scenes which can be represented as surfaces composed of multiple planes. Giving guarantees in the presence of occlusions raises “art gallery” type research problems [19]. Furthermore, rather than selecting views apriori and in one shot, the view selection can be informed by the reconstruction process as is commonly done in existing literature. Our multiresolution view selection method provides the starting point for a batch scheme where a coarse grid is used for reconstruction under the planar scene assumption and further refined based on the intermediate reconstruction.
References
 Bayram et al. [2016] H. Bayram, J. V. Hook, and V. Isler. Gathering bearing data for target localization. IEEE Robotics and Automation Letters, 1(1):369–374, Jan 2016. ISSN 23773766. doi: 10.1109/LRA.2016.2521387.
 Cheong et al. [1998] LoongFah Cheong, Cornelia Fermüller, and Yiannis Aloimonos. Effects of errors in the viewing geometry on shape estimation. Computer Vision and Image Understanding, 71(3):356–372, 1998.
 Davison [2003] Andrew J Davison. Realtime simultaneous localisation and mapping with a single camera. In null, page 1403. IEEE, 2003.
 Engel et al. [2014] Jakob Engel, Thomas Schöps, and Daniel Cremers. Lsdslam: Largescale direct monocular slam. In European Conference on Computer Vision, pages 834–849. Springer, 2014.
 Farid et al. [1994] H Farid, S Lee, and R Bajcsy. View selection strategies for multiview, widebase stereo. Technical report, Technical Report MSCIS9418, University of Pennsylvania, 1994.
 Furukawa et al. [2010] Yasutaka Furukawa, Brian Curless, Steven M Seitz, and Richard Szeliski. Towards internetscale multiview stereo. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1434–1441. IEEE, 2010.
 Gao et al. [2011] Yue Gao, Meng Wang, ZhengJun Zha, Qi Tian, Qionghai Dai, and Naiyao Zhang. Less is more: efficient 3d object retrieval with query view selection. IEEE Transactions on Multimedia, 13(5):1007–1018, 2011.
 Hoppe et al. [2012] Christof Hoppe, Andreas Wendel, Stefanie Zollmann, Katrin Pirker, Arnold Irschara, Horst Bischof, and Stefan Kluckner. Photogrammetric camera network design for micro aerial vehicles. In Computer vision winter workshop (CVWW), volume 8, pages 1–3, 2012.
 Hornung et al. [2008] Alexander Hornung, Boyi Zeng, and Leif Kobbelt. Image selection for improved multiview stereo. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
 Isler and MagdonIsmail [2008] Volkan Isler and Malik MagdonIsmail. Sensor selection in arbitrary dimensions. IEEE Transactions on Automation Science and Engineering, 5(4):651–660, 2008.
 Kaucic et al. [2001] R. Kaucic, R. Hartley, and N. Dano. Planebased projective reconstruction. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 1, pages 420–427 vol.1, 2001. doi: 10.1109/ICCV.2001.937548.
 Klein and Murray [2007] Georg Klein and David Murray. Parallel tracking and mapping for small ar workspaces. In Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on, pages 225–234. IEEE, 2007.
 Krause and Golovin [2014] Andreas Krause and Daniel Golovin. Submodular function maximization., 2014.
 Kutulakos and Dyer [1994] Kiriakos N Kutulakos and Charles R Dyer. Recovering shape by purposive viewpoint adjustment. International Journal of Computer Vision, 12(23):113–136, 1994.
 Li and Isler [2016] Z. Li and V. Isler. Large scale image mosaic construction for agricultural applications. IEEE Robotics and Automation Letters, 1(1):295–302, Jan 2016. ISSN 23773766. doi: 10.1109/LRA.2016.2519946.
 Mauro et al. [2014] Massimo Mauro, Hayko Riemenschneider, Alberto Signoroni, Riccardo Leonardi, and Luc Van Gool. An integer linear programming model for view selection on overlapping camera clusters. In 3D Vision (3DV), 2014 2nd International Conference on, volume 1, pages 464–471. IEEE, 2014.
 Maver and Bajcsy [1993] Jasna Maver and Ruzena Bajcsy. Occlusions as a guide for planning the next view. IEEE transactions on pattern analysis and machine intelligence, 15(5):417–433, 1993.
 MurArtal and Tardós [2016] Raúl MurArtal and Juan D. Tardós. ORBSLAM2: an opensource SLAM system for monocular, stereo and RGBD cameras. arXiv preprint arXiv:1610.06475, 2016.
 O’rourke [1987] Joseph O’rourke. Art gallery theorems and algorithms, volume 57. Oxford University Press Oxford, 1987.
 Sahabi and Basu [1996] Hossein Sahabi and Anup Basu. Analysis of error in depth perception with vergence and spatially varying sensing. Computer Vision and Image Understanding, 63(3):447–461, 1996.
 Scott et al. [2003] William R Scott, Gerhard Roth, and JeanFrançois Rivest. View planning for automated threedimensional object reconstruction and inspection. ACM Computing Surveys (CSUR), 35(1):64–96, 2003.
 Snavely et al. [2008] Noah Snavely, Steven M Seitz, and Richard Szeliski. Skeletal graphs for efficient structure from motion. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
 Sturm and Triggs [1996] Peter Sturm and Bill Triggs. A factorization based algorithm for multiimage projective structure and motion. Computer VisionâECCV’96, pages 709–720, 1996.
 Vázquez et al. [2003] PerePau Vázquez, Miquel Feixas, Mateu Sbert, and Wolfgang Heidrich. Automatic view selection using viewpoint entropy and its application to imagebased modelling. In Computer Graphics Forum, volume 22, pages 689–700. Wiley Online Library, 2003.
 Zhang and Singh [2015] Ji Zhang and Sanjiv Singh. Visuallidar odometry and mapping: Lowdrift, robust, and fast. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pages 2174–2181. IEEE, 2015.
Appendix A More Reconstruction Results
Appendix B Lemmas and Theorems
Ba Proof of Lemma iv.2
{proof}We prove the lemma by contradiction: Suppose there exists a camera such that intersects at point , where and as shown in Fig 4 (b); Since must contain target , . We know that are on the vertical line passing through , we can formulate using the law of sine of the triangle .
Since , we want to find the minimum by choosing different , which is equivalent to minimizing w.r.t. . Thus, is minimized when , which results in . By substituting , . It means that either or , both of which contradict with our assumption.
BB Proof of Lemma iv.3
{proof}Using small angle approximation, we get and and . The angles are constrained such that .
and
and . Therefore, and will not be negative since must be less than to satisfy small angle approximation. Given that , we can conclude
BC Proof of Lemma v.1
{proof}We will add two more line segments and to generate a isosceles trapezoid (Fig 6). When the angle , the diagonal will be the longest line segments in the trapezoid . Therefore, when , that is , is satisfied, .
BD Proof of Lemma v.2
{proof}First, when the inner half planes of both and intersect above , it is clear that by moving the intersection down to , is increased. Now assume target is moving along the axis (Fig 8) by some length , where . We can formulate as a function of and the distance between the cameras as
We can get the derivative as
Since , keeps increasing and is maximized at target .
BE Proof of Theorem v.3
{proof}The intersection length is obtained using the law of sines.
When the inner halfplane of and intersect , is maximized. We can now compute directly the worst case uncertainty when rad which gives the desired result.
BF Proof of Lemma b.1
First, we analyze the effects of horizontal variation .
Lemma B.1
Let be a camera location in an optimal pair for target . Let obtained by perturbing in the horizontal direction. Let and .
.
BG Proof of Lemma b.2
Then, we add vertical perturbation in between the viewing plane and the ground plane.
Lemma B.2
Let be a camera location in a optimal pair for target . Let obtained by perturbing in the vertical direction. Let and .
Appendix C Derivations
Ca Wedge Intersection
Using the law of sines over the triangle , we get . We also have . From , we know that . By combining both equations, we obtain: Using the same method, we have: From , we get: Similarly, from