Primitivebased 3D Building Modeling, Sensor Simulation, and Estimation
Abstract
As we begin to consider modeling large, realistic 3D building scenes, it becomes necessary to consider a more compact representation over the polygonal mesh model. Due to the large amounts of annotated training data, which is costly to obtain, we leverage synthetic data to train our system for the satellite image domain. By utilizing the synthetic data, we formulate the building decomposition as an application of instance segmentation and primitive fitting to decompose a building into a set of primitive shapes. Experimental results on WorldView3 satellite image dataset demonstrate the effectiveness of our 3D building modeling approach.
Primitivebased 3D Building Modeling, Sensor Simulation, and Estimation
Xia Li* ^{†}^{†}thanks: *The first three authors contributed equally., YenLiang Lin*, James Miller*, Alex Cheon, Walt Dixon 
GE Global Research, Niskayuna, USA 
1 Introduction
Reconstructing realistic 3D building models from remote sensor data benefits to several tasks including physical security vulnerability assessment, mission planning, and urban visualization, etc. A primitive based representation provides several advantages over the polygonal mesh representation, such as regularization through prior knowledge, compact representation, and symbolic representation. However, building modeling and primitive fitting are still challenging tasks where some questions needed to be addressed, e.g., how many primitives are needed to represent the structure, how those primitives are arranged, and how to determine the best fitting.
An existing work [1] utilizes a random sample consensus (RANSAC) to estimate the planes for building walls. However, RANSAC involves needing to solve many constraints and can run into instability when these constraints contain some amount of noise. Convex decomposition [2] is another possible approach for shape composition. Ren et al. [2] decomposes arbitrary 2D and 3D shapes into a minimum number of nearconvex parts. However, the decomposition is not guaranteed to be formed by primitive shapes. One recent approach from [3] learns to assemble objects using volumetric primitives. The parameters of primitives (cuboids), such as the numbers, size and orientation, are estimated via a deep learning network and the obtained reconstruction allows an interpretable representation for the input object. However, it is an unsupervised approach that requires largescale training images for each category and cannot accurately fit into the input 3D data. Moreover, their method only applies on cuboid representation, limits it ability to more complex building shapes.
We propose to synthesize training data for primitivebased 3D building modeling, which does not incur costly annotations and allows deep learning models to learn the shape decomposition in a datadriven manner. In particular, we present a synthesis pipeline to generate varied building shapes and types. By utilizing the synthetic data, we formulate the building decomposition as an application of instance segmentation and primitive fitting to decompose a building into a set of sections. Each section is then classified as a certain primitive type, and model fitting is applied for adjusting pose and scale of the predicted section.
The main contributions of this work include:

We show that leveraging synthetic data is an effective approach for building decomposition and primitive fitting. In particular, we achieve promising performance on WorldView3 satellite image dataset.

We propose a synthesis pipeline that generates building shapes and types in an iterative manner, which partitions the simulation region into randomly sized nonoverlapping regions and synthesize different heights and primitive types for each region.

We formulate the problem of primitivebased 3D building modeling as an application of instance segmentation and primitive fitting to decompose a building into a set of primitive shapes.
2 Proposed method
The system pipeline of building 3D primitive models is shown in Figure 2. A baseline building elevation (digital terrain model) is estimated and the building model is referenced to this level datum. Then we formulate the decomposition problem as an application of cascading instance segmentation, which is extended to decompose a building into a set of sections. To improve the decomposition, a correction approach is used to fill gaps interior to a building between individual sections. For each section, primitive classification and fitting are applied based on multiple building height hypothesis. The best fitting model is selected and 3D model is generated.
2.1 Building Simulation
To simulate a building shape, we define a region of space and recursively randomly partition the region. In a manner similar to constructing a quadtree, we randomly sample a point within the region and divide the region into 4 rectangular regions. We iterate to partition the region into randomly sized nonoverlapping rectangles. We then randomly select a subset of the rectangles to form the building, and discard other rectangles. The building rectangles, while forming a realistic footprint for a building, typically have more primitives than necessary to represent the building. We simplify the selected collection of rectangles by merging adjacent rectangles that completely share an edge (cf. Figure 3). We assign random heights to each building section and assign roof models to each section.
2.2 Stereo Simulation
We simulate how a building will appear in a stereo reconstruction to produce a new height map (cf. Figure 4). First, we generate an ideal height map for the building. Then, we add random Gaussian noise to the simulated heights. We perturb the boundaries of the building and the boundaries between building sections by randomly dilating points along the height map to model the boundary properties observed in stereo reconstructions from tools like s2p. Finally, we smooth the noisy and perturbed height map to model the correlation typically seen in the output of satellite stereo reconstruction tools like s2p.
2.3 Building Decomposition
To generate training data for instance segmentation [4], we use the simulation method mentioned in section 2.1. We include both the idealized and noisy boundary images in the training data. The simulated buildings are randomly rotated between 0 and 45 so the method is exposed to primitives with arbitrary orientations. A total of 10,000 simulated buildings are generated. To train the network, a pretrained model from the COCO dataset is used, and all the layers in the CNN feature extraction are frozen and other layers are trained for 60 epochs. Finally, all layers are finetuned for another 60 epochs.
To decompose a building into a set of shapes, we cascade the application of the mask RCNN to partition a building into a set of parts. We select one of the instances with the largest intersection over union (IoU) compared with the original mask and then remove that instance from the data. This is a greedy approach to decomposing the building into a set of shapes. Figure 5 demonstrates the decomposition procedure, including the bounding box after the 1st iteration, the image after removing the selected instance, and the final decomposition.
2.4 Primitive Fitting
Our primitive fitting pipeline consists of two parts: primitive classification and primitive fitting, where the primitive classification estimates the roof types, and primitive fitting aligns the estimated roof primitive to the input point cloud. The primitive fitting procedure is illustrated in Figure 6.
2.4.1 Synthetic point clouds for different roof types
W use 15 primitive types as our primitive sets, which cover the most common roof types. For certain roof types, we include different directions, e.g., four directions for shed roof. Example roof primitives are shown in Figure 7. We sample a fixed number (e.g., 2048) of points for each primitive. To simulate the digital height model, we add uniform random noise on rotation angle along the zaxis (from 45 to 45 degree) and height values (+/0.1 in the range of [0, 1]). We randomly sample 500 point clouds for each primitive; total 7500 synthetic point clouds are used for training and validation.
2.4.2 Primitive classification
We utilize PointNet [5] for primitive classification. Given an input point cloud, our method first rectifies the point cloud into the canonical pose, fill the walls and bottoms and normalize the 3D point cloud to an unit cube. The normalized 3D point cloud is fed into the primitive classification model to estimate the roof type. The advantages of using the primitive classification are 1) it is more robust to the input point cloud noises and 2) it run faster as avoids fitting each primitive into the point cloud.
2.4.3 Primitive fitting
After we obtain the estimated primitive class, we apply Coherent Point Drift (CPD) [6] to align the predicted primitive into the target 3D point clouds. We assume that the transform is rigid, thus the parameter space only involves rotation, translation and scale. We compare the fitting results with the initial flat roof model, and select the primitive model with the smallest fitting errors.
2.5 Texturing
The last stage of system involves mapping the texture coordinates of the true orthographic color image to the output 3D model. The present texturing is limited to the overhead view, and simply wraps the roof texture to the building sides.
3 Experimental results
3.1 Dataset
We analyze our method on WorldView3 satellite images. The testing areas include four regions: AOI1 and AOI2 (WrightPatterson Air Force Base), AOI3 (University of California, San Diego), and AOI4 (Jacksonville, Florida), which have the extent of 0.358, 0.614, 0.962, and 1.813 square kilometers respectively. We first apply our 2Dbased instance building segmentation, 3D reconstruction algorithms to obtain the digital height model for each building region, and use it as the input to our primitivebased 3D building modeling approach. Figure 8 shows the reconstructed digital height models for four testing areas.
3.2 Evaluation Criteria
We evaluate our method by using the evaluation metrics as described in [7]. The metrics include completeness, correctness, and Jaccard index in both 2D and 3D space, which are defined as: , , and . We compute the 3D scores by the intersection between our 3D primitive reconstruction and the ground truth digital shape model. We project the 3D primitive reconstruction into 2D image space, and compute the 2D scores with the ground truth 2D building mask.
3.3 Experimental Results
Metrics  AOI1  AOI2  AOI3  AOI4 

2D completeness  93.9  91.5  90.7  96.0 
2D correctness  97.8  91.4  89.5  88.9 
2D Jaccard  92.0  84.3  82.1  85.8 
3D completeness  93.7  87.5  90.0  96.0 
3D correctness  96.2  88.5  86.7  80.4 
3D Jaccard  90.4  78.6  79.1  77.8 
U3D Baseline  AOI1  AOI2  AOI3  AOI4 

Vertex counts  1,432,900  2,458,640  3,849,248  7,253,844 
Face counts  2,863,407  4,914,139  7,694,575  15,076,271 
Our method  AOI1  AOI2  AOI3  AOI4 
Vertex counts  994  4822  25803  18736 
Face counts  1680  7748  38600  29072 
Table 1 shows the evaluation results on four testing AOIs. The team will continue working toward improving the performance of the system. Primitive reconstruction results are shown in Figure 8. By representing the polygonal mesh model as a set of primitives, our method significantly reduces the vertex and face numbers of the original dense triangulation to generate a more compact representation (cf. Table 2).
4 Discussions
Based on the fitting results in our four AOIs, we observe the following main challenges. First, the current building decomposition strategy does not consider the stacked building structures. The next step will be to incorporate stacked structures into the simulation and the decomposition strategy. Second, there exist some errors in the box decomposition, which will cause problems for primitive fitting. We seek further improvement by considering the constructive solid geometry modeling on our primitive representation.
5 Acknowledgement
The research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via DOI/IBC Contract Number D17PC00287. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.
References
 [1] Ruwen Schnabel, Roland Wahl, and Reinhard Klein, “Efficient ransac for pointcloud shape detection,” in Computer graphics forum, 2007.
 [2] Zhou Ren, Junsong Yuan, Chunyuan Li, and Wenyu Liu, “Minimum nearconvex decomposition for robust shape representation,” in ICCV, 2011.
 [3] Shubham Tulsiani, Hao Su, Leonidas J. Guibas, Alexei A. Efros, and Jitendra Malik, “Learning shape abstractions by assembling volumetric primitives,” in CVPR, 2017.
 [4] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, “Mask rcnn,” in ICCV, 2017.
 [5] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in CVPR, 2017.
 [6] Andriy Myronenko and Xubo Song, “Point set registration: Coherent point drift,” TPAMI, 2010.
 [7] D. Chilcott H. Goldberg M. Brown. M. Bosch, A. Leichtman, “Metric evaluation pipeline for 3d modeling of urban scenes,” in ISPRS Archives, 2017.