Leveraging Vision Reconstruction Pipelines for Satellite Imagery
Abstract
Reconstructing 3D geometry from satellite imagery is an important topic of research. However, disparities exist between how this 3D reconstruction problem is handled in the remote sensing context and how multiview reconstruction pipelines have been developed in the computer vision community. In this paper, we explore whether stateoftheart reconstruction pipelines from the vision community can be applied to the satellite imagery. Along the way, we address several challenges adapting visionbased structure from motion and multiview stereo methods. We show that vision pipelines can offer competitive speed and accuracy in the satellite context.
1 Introduction
Satellite imagery finds applications in a variety of domains, including ecological monitoring, 3D urban modelling, and navigation. The growing number of academic and commercial satellites, and increasing accessibility of satellite imagery, has sparked greater interest in use of satellite imagery for largescale 3D reconstruction of the Earth’s surface [25, 11, 4, 5]. Satellites can capture the same geographic area over the course of seconds, days, months, and years, yielding a wealth of imagery available for 3D reconstruction.
Much of the research on 3D reconstruction from satellite imagery has progressed in the remote sensing community. At the same time, the computer vision community has also seen great advances in 3D reconstruction, primarily using groundlevel views of objects, buildings, and interiors [32, 33, 1, 28]. As a result, several software packages for highquality reconstruction are available [22, 14, 28, 26]. While developments in the remote sensing and computer vision communities have proceeded largely independently over the past few decades, it seems natural that advances in one area should benefit the other, since the fundamental challenges involved are similar. However, the solutions produced in each community have diverged in several ways:

In remote sensing, recent satellitebased reconstruction methods often focus on the minimal twoview scenario [8, 2, 27] within a paradigm of fusing multiple twoview reconstructions. This strategy works well empirically [36], e.g., in the stateoftheart S2P satellite reconstruction pipeline [11]. In contrast, full multiview stereo methods have long been explored in computer vision due to their advantages in areas of occlusion, repeated patterns, low image signal, etc.

While the full set of satellite images available for a certain location might span years, remote sensing methods tend to conservatively select pairs of images within much shorter time span for reconstruction due to the associated diversity in illumination, season, weather, etc. In contrast, recent vision pipelines such as COLMAP [28, 29], are designed for Internet photos captured in the wild under highly diverse conditions.

Accordingly, methods like COLMAP focus on scalability to thousands or millions of views, and hence strive for efficient algorithms suitable for GPU computing.

Finally, there are significant differences in terminology and assumptions between communities. For example, satellite cameras are typically described by complex blackbox models, while computer vision pipelines assume simpler pinhole cameras.
In this paper, we explore the question of whether these gaps can be bridged, and in particular, whether stateoftheart reconstruction pipelines developed in the vision community can be leveraged for satellite reconstruction problems. If yes, then the satellite stereo problem can benefit from progress made in the computer vision community.
Specifically, we take COLMAP, a computer vision pipeline that performs both structure from motion (SfM) and multiview stereo (MVS), and a top performer in recent groundlevel reconstruction benchmarks [24], and adapt it to the satellite image scenario. Along the way, we address a number of key technical challenges, described in this paper. Our adapted stereo pipeline is much more efficient compared to S2P, and can indeed operate on large numbers of diverse images. This method achieves submeter accuracy, although its accuracy is lower compared to S2P, in part due to its prioritization of efficiency during the MVS stage. Hence, we also develop a simple planesweep stereo alternative that performs comparably to S2P in terms of accuracy.
Our adapted reconstruction pipeline is shown in Fig. 1, which highlights some key enhancements to COLMAP. The same adaptation techniques can also be applied to other MVS pipelines. In particular, the challenges we faced include:
For SfM: a main difficulty arises from the complex Rational Polynomial Camera (RPC) model [16, 34, 20] that is widely used for satellite images. We propose a method to reduce the RPC model locally to a simpler pinhole camera model, which in turn allows us to utilize stateoftheart computer vision SfM pipelines [28] to bundle adjust the camera parameters.
For MVS: a major issue involves the lack of numerical precision due to depths that concentrate in an interval that is far from the sensor plane, as a result of the large distance between satellites and the Earth. Numerical precision issues are particular severe on consumer GPUs, which are limited to singleprecision floating point computation. Our solution is to reparameterize depth using planeplusparallax [21].
In addition to these contributions, we also have released our code implementing these changes. We hope that this software can help make it easier for 3D computer vision researchers to explore the growing domain of satellite imagery.
2 Background and related work
Satellite images. Research dating back to the early 2000s [34, 20] has promoted the RPC camera model as a generic replacement of a physical sensor model for geomapping. The RPC model commonly used includes 78 coefficients and 10 normalization constants:
(1)  
(2) 
where denote latitude, longitude, and altitude, are ratios of two cubic polynomials parameterized by 39 coefficients, and and are normalization constants. While lacking physical interpretability, the RPC model has been successful in downstream photogrammetry tasks such as orthophoto generation and DEM extraction [20]. It has consequently become a standard practice over the past decade for satellite image vendors to deliver an RPC model to accompany each satellite image.
Based on the RPC model, prior work has explored parameter estimation, bundle adjustment, and various minimal solvers. Parameter estimation [16] requires knowledge of the physical sensor, which is often inaccessible to end users, and so such estimation is generally performed by the satellite image vendors. However, efforts have sought to understand the residual error in vendorprovided RPCs, leading to the notion of the biascorrected RPC model [13]. This work observes that the RPC model can be subject to a bias equivalent to a 2D imagespace translation, and this bias can be corrected with the aid of human surveyed ground control points. Grodecki and Dial [15] propose an adjustable imagespace translation term to facilitate bundle adjustment of multiple RPCs. Later, de Franchis et al. [9] propose a similar method to fix relative pointing error in a stereo pair. Regarding minimal solvers, inverse projection and triangulation are critical for geometryrelated tasks. For RPC models, they involve solving nontrivial thirdorder polynomial systems [39].
In the remote sensing community, multiple satellite stereo pipelines are based on the RPC model [8, 2, 27, 31]. These often focus on a minimal twoview stereo setting rather than utilizing multiple views at once. The twoview stereo components rely on disparity [8, 27, 2] or optical flow estimation [4] to find dense correspondence, followed by RPCbased triangulation. A representative recent work, S2P [11], heuristically ranks all pairs of images in an input set, then aligns and merges independent reconstructions produced via twoview stereo on the top 50 image pairs [8]. We compare our approach primarily to S2P, as it was the winner of the IARPA MultiView Stereo 3D Mapping Challenge [4].
Groundlevel images. Pipelines developed in the computer vision community adopt the easytointerpret pinhole camera model, in contrast to the RPC model. Using homogeneous coordinates, the pinhole model is written as^{1}^{1}1In this paper, we follow MATLAB notation for denoting column vectors, row vectors, and matrices. Vectors and matrices are in boldface.
(3) 
The intrinsics matrix takes the form , while the extrinsics consist of a camera rotation and translation . In many scenarios, one can assume zero skew (), unit aspect ratio (), and centered principal point ( in normalized image coordinates).
Most modern computer vision reconstruction pipelines consist of SfM and MVS stages. SfM aims to obtain accurate camera parameters by iteratively optimizing these parameters along with a triangulated sparse point cloud. We refer the readers to [28] for recent progress in SfM. MVS reconstructs a dense point cloud or mesh given a set of images and their corresponding camera parameters. Key issues in MVS include view selection based on camera baseline, overlap, etc, and robust recovery of dense depth maps [29]. From a computational perspective, many MVS pipelines can scale to thousands of images thanks to the use of GPU hardware.
3 Adapting SfM to satellite imagery
The goal of SfM is to recover accurate camera parameters for use in subsequent steps such as dense reconstruction. A major factor distinguishing satellite stereo pipelines from typical vision pipelines is the camera model (RPC vs. pinhole). In this section, we bridge this gap by locally approximating the RPC model with a pinhole model. Following this approximation, modern SfM pipelines, e.g., VisualSFM [38] and COLMAP SfM [28] can be utilized to bundle adjust the camera parameters.
3.1 Coordinate system
Our Earth is approximately ellipsoidal. To locate a point on the Earth’s surface, the RPC model uses a global coordinate frame consisting of (latitude, longitude, altitude). This coordinate frame is defined with respect to a nominal reference ellipsoid, e.g., World Geodetic System 1984 (WGS84).
The (latitude, longitude, altitude) coordinate system is not a Cartesian frame as assumed by most vision pipelines. In realworld applications, however, we almost always focus on a much smaller geometric area, e.g., , either because our reconstruction task is regional or a large area is divided into smaller patches for parallel processing. Hence, we adopt a simpler local Cartesian coordinate system for a specific reconstruction problem. In particular, we use the EastNorthUp (ENU) coordinate system, defined by first choosing an observer point , and then the “east”, “north”, and “up” directions at this point form the three axes. The “eastnorth” plane is parallel to the tangent plane of the reference ellipsoid at the point . With a careful choice of the observer point, such a local Cartesian frame not only opens the door to a pinhole model approximation, but also improves numerical stability by bringing the scene close to the origin.
3.2 3D bounding cube generation
In the (latitude, longitude, altitude) coordinate system, an area of interest (AOI) is specified by a 2D bounding box . To bound the AOI along the axis, we use public SRTM data describing the global terrain altitude up to an accuracy of 30m [23], together with the assumption that most vegetation and buildings are less than a few hundred meters in height. We assume that the input satellite images are cropped to this AOI.
3.3 Tonemapping
Satellite images typically have high dynamic range (HDR). To accelerate computation, the input HDR satellite images are tonemapped to LDR. We observe that the original pixel intensities exhibit a longtailed distribution, hence direct scaling results in dark, lowcontrast LDR images. Hence, we instead use a standard tonemapping operation consisting of a gamma correction (where denotes pixel intensity) followed by a scale factor.
3.4 Perspective camera approximation
We now mathematically justify the practice of locally approximating an RPC camera with a perspective camera, then provide a numerical method to estimate camera parameters.
Mathematical justification. The physical satellite imaging process can be modelled by an orbiting linear pushbroom camera, such that each row of a satellite image is captured at a slightly different time instant. For satellite images, the depth values of scene points lie in the interval , with . Under this condition, we show that both perspective cameras and linear pushbroom cameras can be accurately approximated by the socalled weak perspective camera, and consequently we can approximately transform a linear pushbroom camera into a perspective camera.
First, as a result of small scene depth variation and very large average scene depth, a perspective camera reduces to a weak perspective camera that takes the form , where is a point in the camera coordinate frame [12].^{2}^{2}2In this paper refers to a 3D point in the scene coordinate frame, while is a 3D point in a given camera coordinate frame.
Next, we show that a linear pushbroom camera can also be reduced to a weak perspective camera under the same condition. Projection under a linear pushbroom camera can be modeled as: , where is a point in the scene coordinate frame. In other words, satellite images exhibit linearity along the row axis , while having a (weak) perspective effect along the column axis . Let be the intrinsics and extrinsics (assuming zero skew and unit aspect ratio) of the perspective camera along the column axis. We then use to transform scene coordinates to camera coordinates , and rewrite the linear pushbroom camera model. With some approximation, the result is again a weak perspective camera:
(4)  
(5) 
Comparing the two weak perspective cameras, we have . Solving for , together with , gives us the perspective camera approximation. Note that Eq. 4, 5 assumes that the satellite moves at a constant orientation and velocity, which is approximately true because of the extremely short time period needed to capture a small local area.
Numerical solution. The derivation above only establishes the existence of perspective approximations to satellite cameras in a local area. To actually solve for an approximated perspective camera model, we perform three steps: sampling the RPC model, fitting a projection matrix, and factorizing into a standard form.
We first sample the RPC model by generating a set of 3D2D correspondences. Specifically, we uniformly discretize the ENU 3D bounding cube into a finite grid, with each axis evenly divided into sample locations. The resulting grid samples are then converted from ENU coordinates to (latitude, longitude, altitude), and projected via the RPC model into pixel coordinates ; this gives us a total of 3D2D correspondences . We filter out correspondences whose pixel coordinates lie outside the image boundary.
Next, we solve for a projection matrix . For simplicity, we denote , and then the projection equation of our perspective camera becomes . Given a total of 3D2D correspondences, , we use the standard direct linear transformation method to solve for [17]. Finally, we factor into the standard form [17]. Note that this numerical solution outputs a full intrinsic matrix with a nonzero skew that is unusual for groundlevel images, but can be explained in the satellite case by the equation in our derivation above. The physical intuition is that satellite images are produced by stitching image rows captured at slightly different instants; hence rows may not be perfectly aligned.
Advantages. Locally approximating the RPC model with a perspective camera yields several benefits besides physical interpretability, efficiency, and simplicity of minimal solvers. First, it enables us to easily convert satellite images for use in existing SfM and MVS pipelines. For instance, Section 3.6 describes how to correct for the skew effect in satellite images. Second, this approximation shows that linear epipolar geometry approximately holds locally in satellite imagery, which avoids the need to handle the complex epipolar lines described by the RPC model as in [11]. This is important for standard visionbased reconstruction pipelines. Third, bundle adjustment is simplified, as demonstrated in Section 3.5, whereas bundle adjusting RPC models is unintuitive and requires expert or proprietary knowledge.
3.5 Bundle adjustment
Bundle adjustment is a method for refining the parameters of a set of cameras in order to improve their global consistency and accuracy, via minimization of reprojection error of a set of sparse 3D points given their measured projections in the image set [37]. As shown in prior work [13, 15], the RPC model is subject to a bias drift in the image domain. We observe that the principal point in the perspective camera model naturally captures any image space translation, and therefore we can bundle adjust just the principal points of all cameras, while keeping the other intrinsic and extrinsic parameters fixed. To minimize the distortion caused by projective ambiguity, and to also fix the gauge ambiguity and constrain the reconstruction to the same geographic area, we propose to add a regularization term to the sparse 3D points’ coordinates such that they do not drift too far from their original coordinates during bundle adjustment. Let be the 3D2D correspondences and first three rows of their projection matrices. Our proposed regularized bundle adjustment objective is,
(6) 
where are the point coordinates triangulated with camera parameters before bundle adjustment, is the tuned regularization weight, is the Euclidean norm, and are our optimization variables. We set in our experiments.
3.6 Skew correction
The skew parameter in an intrinsic matrix is typically assumed to be zero in computer vision approaches, and hence existing MVS pipelines often do not model it. We design a simple skewcorrection step to remove skew so that we can use standard pipelines. Our approach decomposes an intrinsic matrix as follows:
(7) 
With this decomposition, we then apply the inverse of the first matrix on the righthand side to transform the images themselves, and use the second matrix on the righthand side as the new, skewfree intrinsics. The resulting image warp resamples the image at the original resolution, and so resolution is preserved by this step. Other modifications of intrinsics, such as enforcing nonunit aspect ratio can be performed in a similar way.
4 Adapting MVS to satellite imagery
In this section, we identify and resolve key issues that prevent direct application of standard MVS pipelines tailored for groundlevel images to the satellite domain.
4.1 Depth reparametrization
MVS pipelines in computer vision often begin by estimating perview depth maps. Depth is usually defined as the component of a 3D point in the reference camera’s coordinate frame. Satellite cameras are distant (hundreds or thousands of km) from Earth, leading to a depth distribution with a large mean and comparatively small variance. This introduces numerical precision issues when computationallyintensive MVS algorithms are run on consumer GPUs with only singleprecision floating point, which supports only 7 effective decimal digits.
We tackle this challenge by reparametrizing depth using planeplusparallax [21]: we first choose a plane close to the scene and parallel to the ground plane as our reference plane. Then the distance between a 3D point and this reference plane is defined as the reparametrized depth, which is usually bounded by hundreds of meters. Concretely, this amounts to adding a fourth row to the 3 by 4 projection matrix:
(8) 
where is the conventional depth; the plane ^{3}^{3}3We denote the plane equation as for convenience. is chosen to lie below the scene, i.e., holds for all scene points; is the average conventional depth of all the sparse scene points in the SfM step. With these careful choices, the reparametrized depth is a positive variable generally bounded by hundreds of meters.^{4}^{4}4The second equality is due to the fact that has a large mean but small variance so that ; is the distance between the the point and the plane . The augmented projection matrix is finally scaled such that the largest entry is 1.
4.2 Stable homography computation
To measure photoconsistency between two views, many modern MVS systems use homographies to relate an image patch in a reference view to one in a source view. The homography matrix is computed as , where is a plane in camera one’s coordinate frame, and is camera two’s relative pose with respect to camera one. We find that this expression for computing a homography can be numerically unstable under singleprecision, due to large intermediate values. To address this issue, we derive an alternative way to compute the homography directly from the projection matrix in Eq. 8 without involving large numbers. For convenience, denote pixel coordinates as , scene coordinates as , and projection matrices (augmented with a fourth row) as . Given a plane , and the projection equation of two cameras , our goal is to derive a homograhy matrix such that .
Derivation. The plane equation implies that
(9) 
Let us write the vector as , with . Then , i.e., . Hence we have
(10) 
where is an identity matrix. Let us define . Then the homography matrix is . One can easily check that Eq. 10 involves numbers with limited magnitude, given that are prescaled. Analysis of its numerical stability can be found in the supplemental document.
4.3 Planesweep stereo
In addition to integrating the depth reparametrization and stable homography computation methods into MVS in COLMAP (which is based on PatchMatch stereo [3]), we also evaluate the use of standard planesweep stereo, with groundparallel planes, as an alternate MVS module. This usage is motivated by the observation that the frontoparallel assumption often holds very strongly for urban scenes if we sweep planes parallel to the ground. We adopt the census cost as a photoconsistency measure, and compute a cost volume. We further smooth each cost slice using a guided filter as in [19]. Planesweep and cost volume filtering are very efficient; if accuracy is not a top concern, then the arg minimum at each pixel of the filtered cost volume can serve as the height map. However, we can also compute a refined height map using Markov Random Field (MRF) optimization techniques, which are especially helpful in areas of weak texture, as we demonstrate in our experiments.
5 Experiments
In this section, we provide empirical evidence to demonstrate the validity of our proposed adapted pipeline on a benchmark dataset we created from MVS3DM dataset [4]. We first evaluate the SfM module, and then the MVS module.
5.1 Dataset and metrics
We tested our methods on three sites from the IARPA MVS3DM dataset [4]. Tab. 1 summarizes the data. The images were captured over a span of two years by the WorldView3 satellite with a resolution of 30cm per pixel in nadir views. 3D groundtruth is captured by airborne lidar.
# Views  Avg. # Pix  Area ()  

site 1  47  4.22M  0.464 
site 2  47  1.27M  0.138 
site 3  40  1.44M  0.150 
During evaluation, we follow previous practice [4] and flatten the groundtruth point cloud and the reconstructed one onto the same prespecified geographic grid to obtain two height maps. Each geographic grid cell is . The maps are then aligned and compared pixelwise to compute metric scores, namely median height error and completeness (percentage of points with error less than a threshold of 1m).
5.2 Structure from motion
We evaluate our perspective camera approximation using three metrics: accuracy of forward projection, accuracy of triangulation, and consistency between our bundleadjusted perspective cameras and S2P’s pointingerrorcorrected RPC models. For concision, the following experiments are done on site 1, but the conclusions generalize to other sites.
As mentioned above, our approximated perspective cameras are derived using the () grid points sampled in the bounding volume. The average of perview maximum forward projection error compared with the RPC model on these grid points is 0.194px. Further, we apply both the RPC model and the approximate perspective camera to points in the groundtruth point cloud. The average of perview maximum forward projection error is 0.126px, which leads to as small as a maximum 3.8cm loss of accuracy given an image resolution of 30cm per pixel.
Second, we evaluate the triangulation accuracy of our perspective cameras. We first run standard feature detection and matching across all views, then triangulate the resulting feature tracks using both the approximate perspective and RPC models. The distribution of the number of views across 34K feature tracks is shown in Fig. 3. Perspective camera triangulation is performed with COLMAP, and for RPC triangulation we implemented a solver based on [35]. We then compute the difference in point locations between the perspective and RPCtriangulated 3D points. The distribution of differences is shown in Fig. 3; the majority of the differences are below 5cm, with a few large differences caused by incorrect correspondences.
Third, we investigate the effect of bundle adjustment. The median reprojection error over 34K sparse points reduces from 1.36px to 0.864px after bundle adjustment; the distribution of reprojection errors is shown in Fig. 4. This reduction in reprojection error indicates better consistency among cameras and final reconstruction accuracy because our bundle adjustment scheme is free from projective ambiguity. We also compare the bundleadjusted perspective cameras to the RPC model whose pointing error is corrected by S2P. We take the correspondence map between two views computed by S2P, and retriangulate with our bundleadjusted perspective cameras. We then compare our triangulated point cloud to S2P’s point cloud triangulated with RPC model using the evaluation code. The median height difference is just 2.72cm. The error map is shown in Fig. 6.
The above three aspects demonstrate the high accuracy of the approximated perspective camera for the RPC model in a local area. The entire SfM procedure takes about 1.68 minutes^{5}^{5}5All runtimes are reported on a machine with 16 CPUs, 60GB memory, and 3 GPUs. for this site with 46 images, which is very efficient thanks to the fact that the SfM pipeline was initially designed to handle largescale Internet photo collections.
S2P (CP=64.2%)  COLMAP (CP= 61.2%) 
PSS+CVF (CP= 67.5%)  PSS+CVF+MRF (CP=68.3%) 
Example View  Lidar GT  COLMAP  Error  

Site 1  
Site 2  
Site 3 
5.3 Multiview stereo
We now evaluate our adapted MVS module by running: 1) COLMAP’s MVS enhanced with our proposed depth reparametrization and stable homography computation, 2) different methods including planesweep stereo compared on stereo pairs.
We report the accuracy of our enhanced COLMAP MVS run on all images in each site on sites 13 in Tab. 2. Qualitative results are shown in Fig. 8. Considering COLMAP MVS’s greater efficiency and scalability, these results are quite good, despite the fact that the accuracy is lower than the S2Pbased method [8]. We believe this gap in accuracy is mainly due to the fact that COLMAP MVS’s core stereo component is efficiencydriven. In particular, it sometimes struggles at object boundaries and in weaklytextured regions, as can be seen from the errors surrounding buildings, on the ground, and on building tops. Similar observations are also reported recently on groundlevel images [30]. However, we also note that an interesting aspect of this experiment is that COLMAP is leveraging all images at once, which is quite distinct from the practice of S2P, which only considers pairs of images at a time. We believe that these results are promising and suggest that further explorations of diverse handling of images in the satellite domain will be fruitful.
Next, we compare different algorithms on the minimal setting of stereo pairs. In particular, we compare the standard S2P to COLMAP configured to run on just two views, as well as various settings of planesweep stereo. For each site, we select 10 image pairs on which the baseline S2P performs very well; all algorithms are tested on these same image pairs, so that the comparison focuses on stereo accuracy, isolated from the influence of view selection and aggregation. The average metric performance over the selected pairs is shown in Tab. 3. The experiments show that COLMAP MVS is the least accurate while being the fastest, while our PSS+CVF+MRF^{6}^{6}6We use graph cuts in these experiments [7, 6]; the system can be further accelerated with faster MRF solvers, e.g., SGMlike solvers [18, 10]. outputs comparable results to S2P. Furthermore, we found that both COLMAP and PSS+CVF, which lack global optimization, suffer from the errors on weaklytextured areas, e.g., ground and large building tops, while MRF helps reduce the errors (see Fig. 7).
CP (%)  ME (m)  Time (mins)  

site 1  72.5  0.315  18.7 
site 2  66.8  0.450  7.37 
site 3  63.4  0.393  6.94 
CP (%)  ME (m)  Time (mins)  

Site 1  S2P  65.6  0.432  16.8 
COLMAP  58.0  0.648  0.374  
PSS+CVF  68.1  0.442  4.40  
PSS+CVF+MRF  69.1  0.395  29.7  
Site 2  S2P  62.8  0.435  2.63 
COLMAP  59.3  0.689  0.196  
PSS+CVF  63.8  0.502  1.19  
PSS+CVF+MRF  64.0  0.478  6.75  
Site 3  S2P  53.9  0.421  2.78 
COLMAP  42.1  0.949  0.208  
PSS+CVF  57.1  0.591  2.39  
PSS+CVF+MRF  58.8  0.521  14.8 
6 Conclusion
In this work, we propose a set of mechanisms to adapt the satelliteimagebased 3D reconstruction problem to one that can be solved with computer vision pipelines designed for groundlevel images. We show the effectiveness of the proposed methods by incorporating them into COLMAP, a stateoftheart SfM and MVS system. We evaluate this pipeline, as well as planesweep stereo, on a remote sensing dataset. We will publish our code to encourage future research in brigding the gap between satellite stereo and vision reconstruction.
Acknowledgements. The research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via DOI/IBC Contract Number D17PC00287. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. We thank Zhengqi Li and Charles Herrmann for helpful discussions.
References
 [1] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski. Building Rome in a day. In Proc. Int. Conf. on Computer Vision (ICCV), pages 72–79, 2009.
 [2] R. A. Beyer, O. Alexandrov, and S. McMichael. The ames stereo pipeline: Nasa’s open source software for deriving and processing terrain data. Earth and Space Science, 5(9):537–548, 2018.
 [3] M. Bleyer, C. Rhemann, and C. Rother. PatchMatch Stereo  Stereo Matching with Slanted Support Windows. In Proc. British Machine Vision Conf. (BMVC), 2011.
 [4] M. Bosch, Z. Kurtz, S. Hagstrom, and M. Brown. A multiple view stereo benchmark for satellite imagery. In Applied Imagery Pattern Recognition Workshop (AIPR), 2016 IEEE, pages 1–9, 2016.
 [5] M. Bosch, A. Leichtman, D. Chilcott, H. Goldberg, and M. Brown. Metric evaluation pipeline for 3d modeling of urban scenes. International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences, 42, 2017.
 [6] Y. Boykov and V. Kolmogorov. An experimental comparison of mincut/maxflow algorithms for energy minimization in vision. Trans. Pattern Analysis and Machine Intelligence, 26(9):1124–1137, 2004.
 [7] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. In Proc. Int. Conf. on Computer Vision (ICCV), pages 377–384, 1999.
 [8] C. De Franchis, E. MeinhardtLlopis, J. Michel, J.M. Morel, and G. Facciolo. An automatic and modular stereo pipeline for pushbroom images. In ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2014.
 [9] C. de Franchis, E. MeinhardtLlopis, J. Michel, J.M. Morel, and G. Facciolo. Automatic sensor orientation refinement of pléiades stereo images. In IEEE Int. Geoscience and Remote Sensing Symposium (IGARSS), pages 1639–1642, 2014.
 [10] G. Facciolo, C. De Franchis, and E. Meinhardt. Mgm: A significantly more global matching for stereovision. In Proc. British Machine Vision Conf. (BMVC), 2015.
 [11] G. Facciolo, C. De Franchis, and E. MeinhardtLlopis. Automatic 3d reconstruction from multidate satellite images. In Proc. Computer Vision and Pattern Recognition Workshops (CVPRW), pages 57–66, 2017.
 [12] D. A. Forsyth and J. Ponce. Computer vision: a modern approach. Computer Vision: A Modern Approach, pages 88–101, 2003.
 [13] C. S. Fraser and H. B. Hanley. Biascompensated rpcs for sensor orientation of highresolution satellite imagery. Photogrammetric Engineering and Remote Sensing, 71(8):909–915, 2005.
 [14] S. Fuhrmann, F. Langguth, and M. Goesele. MVE  A multiview reconstruction environment. In Proc. of the Eurographics Workshop on Graphics and Cultural Heritage, 2014.
 [15] J. Grodecki and G. Dial. Block adjustment of highresolution satellite images described by rational polynomials. Photogrammetric Engineering and Remote Sensing, 69(1):59–68, 2003.
 [16] R. I. Hartley and T. Saxena. The cubic rational polynomial camera model. In Image Understanding Workshop, volume 649, page 653, 1997.
 [17] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, second edition, 2004.
 [18] H. Hirschmuller. Stereo processing by semiglobal matching and mutual information. Trans. Pattern Analysis and Machine Intelligence, 30(2):328–341, 2007.
 [19] A. Hosni, C. Rhemann, M. Bleyer, C. Rother, and M. Gelautz. Fast costvolume filtering for visual correspondence and beyond. Trans. Pattern Analysis and Machine Intelligence, 35(2):504–511, 2012.
 [20] Y. Hu, V. Tao, and A. Croitoru. Understanding the rational function model: methods and applications. International Archives of Photogrammetry and Remote Sensing, 20(6), 2004.
 [21] M. Irani, P. Anandan, and D. Weinshall. From reference frames to reference planes: Multiview parallax geometry and applications. In Proc. European Conf. on Computer Vision (ECCV), 1998.
 [22] M. Jancosek and T. Pajdla. Multiview reconstruction preserving weaklysupported surfaces. In Proc. Computer Vision and Pattern Recognition (CVPR), 2011.
 [23] A. Jarvis, H. I. Reuter, A. Nelson, and E. Guevara. Holefilled SRTM for the globe version 4, 2008.
 [24] A. Knapitsch, J. Park, Q.Y. Zhou, and V. Koltun. Tanks and Temples: Benchmarking LargeScale Scene Reconstruction. In ACM Trans. Graphics (SIGGRAPH), 2017.
 [25] G. Kuschk. Large scale urban reconstruction from remote sensing imagery. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 5:W1, 2013.
 [26] P. Moulon, P. Monasse, R. Marlet, and Others. OpenMVG. An Open Multiple View Geometry library. https://github.com/openMVG/openMVG.
 [27] E. Rupnik, M. Daakir, and M. P. Deseilligny. Micmac–a free, opensource solution for photogrammetry. Open Geospatial Data, Software and Standards, 2(1):14, 2017.
 [28] J. L. Schonberger and J.M. Frahm. Structurefrommotion revisited. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, 2016.
 [29] J. L. Schönberger, E. Zheng, J.M. Frahm, and M. Pollefeys. Pixelwise view selection for unstructured multiview stereo. In Proc. European Conf. on Computer Vision (ECCV), pages 501–518, 2016.
 [30] T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger. A multiview stereo benchmark with highresolution images and multicamera videos. In Proc. Computer Vision and Pattern Recognition (CVPR), pages 3260–3269, 2017.
 [31] D. E. Shean, O. Alexandrov, Z. M. Moratto, B. E. Smith, I. R. Joughin, C. Porter, and P. Morin. An automated, opensource pipeline for mass production of digital elevation models (dems) from veryhighresolution commercial stereo satellite imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 116:101–117, 2016.
 [32] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: Exploring photo collections in 3D. In ACM Trans. Graphics (SIGGRAPH), 2006.
 [33] N. Snavely, S. M. Seitz, and R. Szeliski. Modeling the world from internet photo collections. Int. J. of Computer Vision, 80(2):189–210, 2008.
 [34] C. V. Tao and Y. Hu. A comprehensive study of the rational function model for photogrammetric processing. Photogrammetric Engineering and Remote Sensing, 67(12):1347–1358, 2001.
 [35] C. V. Tao and H. Yong. 3d reconstruction methods based on the rational function model. Photogrammetric Engineering and Remote Sensing,, 68(7):705–714, 2002.
 [36] T. Toutin. Geometric processing of remote sensing images: models, algorithms and methods. Int. J. of Remote Sensing, 25(10):1893–1924, 2004.
 [37] B. Triggs, P. McLauchlan, R. I. Hartley, and A. Fitzgibbon. Bundle adjustment – a modern synthesis. In Vision Algorithms: Theory and Practice, volume 1883 of Lecture Notes in Computer Science, pages 298–372, 2000.
 [38] C. Wu. Towards lineartime incremental structure from motion. In Int. Conf. on 3D Vision (3DV), pages 127–134, 2013.
 [39] E. Zheng, K. Wang, E. Dunn, and J.M. Frahm. Minimal solvers for 3d geometry from satellite imagery. In Proc. Int. Conf. on Computer Vision (ICCV), pages 738–746, 2015.