Primitive-based 3D Building Modeling, Sensor Simulation, and Estimation

Primitive-based 3D Building Modeling, Sensor Simulation, and Estimation


As we begin to consider modeling large, realistic 3D building scenes, it becomes necessary to consider a more compact representation over the polygonal mesh model. Due to the large amounts of annotated training data, which is costly to obtain, we leverage synthetic data to train our system for the satellite image domain. By utilizing the synthetic data, we formulate the building decomposition as an application of instance segmentation and primitive fitting to decompose a building into a set of primitive shapes. Experimental results on WorldView-3 satellite image dataset demonstrate the effectiveness of our 3D building modeling approach.

Primitive-based 3D Building Modeling, Sensor Simulation, and Estimation

Xia Li* thanks: *The first three authors contributed equally., Yen-Liang Lin*, James Miller*, Alex Cheon, Walt Dixon
GE Global Research, Niskayuna, USA

1 Introduction

Reconstructing realistic 3D building models from remote sensor data benefits to several tasks including physical security vulnerability assessment, mission planning, and urban visualization, etc. A primitive based representation provides several advantages over the polygonal mesh representation, such as regularization through prior knowledge, compact representation, and symbolic representation. However, building modeling and primitive fitting are still challenging tasks where some questions needed to be addressed, e.g., how many primitives are needed to represent the structure, how those primitives are arranged, and how to determine the best fitting.

Fig. 1: The goal for our system is to represent buildings and other man-made structures from the reconstructed digital height models (DHM) with a collection of geometric primitives.

An existing work [1] utilizes a random sample consensus (RANSAC) to estimate the planes for building walls. However, RANSAC involves needing to solve many constraints and can run into instability when these constraints contain some amount of noise. Convex decomposition [2] is another possible approach for shape composition. Ren et al. [2] decomposes arbitrary 2D and 3D shapes into a minimum number of near-convex parts. However, the decomposition is not guaranteed to be formed by primitive shapes. One recent approach from [3] learns to assemble objects using volumetric primitives. The parameters of primitives (cuboids), such as the numbers, size and orientation, are estimated via a deep learning network and the obtained reconstruction allows an interpretable representation for the input object. However, it is an unsupervised approach that requires large-scale training images for each category and cannot accurately fit into the input 3D data. Moreover, their method only applies on cuboid representation, limits it ability to more complex building shapes.

We propose to synthesize training data for primitive-based 3D building modeling, which does not incur costly annotations and allows deep learning models to learn the shape decomposition in a data-driven manner. In particular, we present a synthesis pipeline to generate varied building shapes and types. By utilizing the synthetic data, we formulate the building decomposition as an application of instance segmentation and primitive fitting to decompose a building into a set of sections. Each section is then classified as a certain primitive type, and model fitting is applied for adjusting pose and scale of the predicted section.

Fig. 2: System pipeline of our approach.

The main contributions of this work include:

  • We show that leveraging synthetic data is an effective approach for building decomposition and primitive fitting. In particular, we achieve promising performance on WorldView-3 satellite image dataset.

  • We propose a synthesis pipeline that generates building shapes and types in an iterative manner, which partitions the simulation region into randomly sized non-overlapping regions and synthesize different heights and primitive types for each region.

  • We formulate the problem of primitive-based 3D building modeling as an application of instance segmentation and primitive fitting to decompose a building into a set of primitive shapes.

2 Proposed method

The system pipeline of building 3D primitive models is shown in Figure  2. A baseline building elevation (digital terrain model) is estimated and the building model is referenced to this level datum. Then we formulate the decomposition problem as an application of cascading instance segmentation, which is extended to decompose a building into a set of sections. To improve the decomposition, a correction approach is used to fill gaps interior to a building between individual sections. For each section, primitive classification and fitting are applied based on multiple building height hypothesis. The best fitting model is selected and 3D model is generated.

2.1 Building Simulation

Fig. 3: Building section generation.
Fig. 4: Simulating buildings with multiple primitive types (rectangular prisms, triangular prisms, elliptical cylinders), heights and boundary distortions. Left to right: binary mask footprint, reference primitives, flat roof height map simulation, boundary distortion modeling stereo reconstructions, same with primitives overlaid.

To simulate a building shape, we define a region of space and recursively randomly partition the region. In a manner similar to constructing a quadtree, we randomly sample a point within the region and divide the region into 4 rectangular regions. We iterate to partition the region into randomly sized non-overlapping rectangles. We then randomly select a subset of the rectangles to form the building, and discard other rectangles. The building rectangles, while forming a realistic footprint for a building, typically have more primitives than necessary to represent the building. We simplify the selected collection of rectangles by merging adjacent rectangles that completely share an edge (cf. Figure 3). We assign random heights to each building section and assign roof models to each section.

2.2 Stereo Simulation

We simulate how a building will appear in a stereo reconstruction to produce a new height map (cf. Figure 4). First, we generate an ideal height map for the building. Then, we add random Gaussian noise to the simulated heights. We perturb the boundaries of the building and the boundaries between building sections by randomly dilating points along the height map to model the boundary properties observed in stereo reconstructions from tools like s2p. Finally, we smooth the noisy and perturbed height map to model the correlation typically seen in the output of satellite stereo reconstruction tools like s2p.

2.3 Building Decomposition

Fig. 5: Cascading building decomposition.

To generate training data for instance segmentation [4], we use the simulation method mentioned in section 2.1. We include both the idealized and noisy boundary images in the training data. The simulated buildings are randomly rotated between 0 and 45 so the method is exposed to primitives with arbitrary orientations. A total of 10,000 simulated buildings are generated. To train the network, a pre-trained model from the COCO dataset is used, and all the layers in the CNN feature extraction are frozen and other layers are trained for 60 epochs. Finally, all layers are fine-tuned for another 60 epochs.

To decompose a building into a set of shapes, we cascade the application of the mask R-CNN to partition a building into a set of parts. We select one of the instances with the largest intersection over union (IoU) compared with the original mask and then remove that instance from the data. This is a greedy approach to decomposing the building into a set of shapes. Figure 5 demonstrates the decomposition procedure, including the bounding box after the 1st iteration, the image after removing the selected instance, and the final decomposition.

2.4 Primitive Fitting

Fig. 6: Given the bounding box derived from the building decomposition, our primitive fitting pipeline first performs the primitive classification to estimate the roof type, and fit the selected primitive model into the input 3D point cloud.

Our primitive fitting pipeline consists of two parts: primitive classification and primitive fitting, where the primitive classification estimates the roof types, and primitive fitting aligns the estimated roof primitive to the input point cloud. The primitive fitting procedure is illustrated in Figure 6.

Fig. 7: Example roof models in our primitive set.

2.4.1 Synthetic point clouds for different roof types

W use 15 primitive types as our primitive sets, which cover the most common roof types. For certain roof types, we include different directions, e.g., four directions for shed roof. Example roof primitives are shown in Figure 7. We sample a fixed number (e.g., 2048) of points for each primitive. To simulate the digital height model, we add uniform random noise on rotation angle along the z-axis (from -45 to 45 degree) and height values (+/-0.1 in the range of [0, 1]). We randomly sample 500 point clouds for each primitive; total 7500 synthetic point clouds are used for training and validation.

2.4.2 Primitive classification

We utilize PointNet [5] for primitive classification. Given an input point cloud, our method first rectifies the point cloud into the canonical pose, fill the walls and bottoms and normalize the 3D point cloud to an unit cube. The normalized 3D point cloud is fed into the primitive classification model to estimate the roof type. The advantages of using the primitive classification are 1) it is more robust to the input point cloud noises and 2) it run faster as avoids fitting each primitive into the point cloud.

2.4.3 Primitive fitting

After we obtain the estimated primitive class, we apply Coherent Point Drift (CPD) [6] to align the predicted primitive into the target 3D point clouds. We assume that the transform is rigid, thus the parameter space only involves rotation, translation and scale. We compare the fitting results with the initial flat roof model, and select the primitive model with the smallest fitting errors.

2.5 Texturing

The last stage of system involves mapping the texture coordinates of the true orthographic color image to the output 3D model. The present texturing is limited to the overhead view, and simply wraps the roof texture to the building sides.

3 Experimental results

3.1 Dataset

We analyze our method on WorldView-3 satellite images. The testing areas include four regions: AOI-1 and AOI-2 (Wright-Patterson Air Force Base), AOI-3 (University of California, San Diego), and AOI-4 (Jacksonville, Florida), which have the extent of 0.358, 0.614, 0.962, and 1.813 square kilometers respectively. We first apply our 2D-based instance building segmentation, 3D reconstruction algorithms to obtain the digital height model for each building region, and use it as the input to our primitive-based 3D building modeling approach. Figure 8 shows the reconstructed digital height models for four testing areas.

3.2 Evaluation Criteria

We evaluate our method by using the evaluation metrics as described in [7]. The metrics include completeness, correctness, and Jaccard index in both 2D and 3D space, which are defined as: , , and . We compute the 3D scores by the intersection between our 3D primitive reconstruction and the ground truth digital shape model. We project the 3D primitive reconstruction into 2D image space, and compute the 2D scores with the ground truth 2D building mask.

Fig. 8: Our building decomposition and primitive fitting results on four testing areas. Left: input digital height model. Right: building decomposition and primitive fitting. Textures are mapped from the true orthographic color images. Zoom-in for better resolution.

3.3 Experimental Results

Metrics AOI-1 AOI-2 AOI-3 AOI-4
2D completeness 93.9 91.5 90.7 96.0
2D correctness 97.8 91.4 89.5 88.9
2D Jaccard 92.0 84.3 82.1 85.8
3D completeness 93.7 87.5 90.0 96.0
3D correctness 96.2 88.5 86.7 80.4
3D Jaccard 90.4 78.6 79.1 77.8
Table 1: Precision, recall and Jaccard index in both 2D and 3D metrics on four testing areas.
U3D Baseline AOI-1 AOI-2 AOI-3 AOI-4
Vertex counts 1,432,900 2,458,640 3,849,248 7,253,844
Face counts 2,863,407 4,914,139 7,694,575 15,076,271
Our method AOI-1 AOI-2 AOI-3 AOI-4
Vertex counts 994 4822 25803 18736
Face counts 1680 7748 38600 29072
Table 2: Vertex and triangle face count comparison.

Table 1 shows the evaluation results on four testing AOIs. The team will continue working toward improving the performance of the system. Primitive reconstruction results are shown in Figure 8. By representing the polygonal mesh model as a set of primitives, our method significantly reduces the vertex and face numbers of the original dense triangulation to generate a more compact representation (cf. Table 2).

4 Discussions

Based on the fitting results in our four AOIs, we observe the following main challenges. First, the current building decomposition strategy does not consider the stacked building structures. The next step will be to incorporate stacked structures into the simulation and the decomposition strategy. Second, there exist some errors in the box decomposition, which will cause problems for primitive fitting. We seek further improvement by considering the constructive solid geometry modeling on our primitive representation.

5 Acknowledgement

The research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via DOI/IBC Contract Number D17PC00287. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.


  • [1] Ruwen Schnabel, Roland Wahl, and Reinhard Klein, “Efficient ransac for point-cloud shape detection,” in Computer graphics forum, 2007.
  • [2] Zhou Ren, Junsong Yuan, Chunyuan Li, and Wenyu Liu, “Minimum near-convex decomposition for robust shape representation,” in ICCV, 2011.
  • [3] Shubham Tulsiani, Hao Su, Leonidas J. Guibas, Alexei A. Efros, and Jitendra Malik, “Learning shape abstractions by assembling volumetric primitives,” in CVPR, 2017.
  • [4] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, “Mask r-cnn,” in ICCV, 2017.
  • [5] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in CVPR, 2017.
  • [6] Andriy Myronenko and Xubo Song, “Point set registration: Coherent point drift,” TPAMI, 2010.
  • [7] D. Chilcott H. Goldberg M. Brown. M. Bosch, A. Leichtman, “Metric evaluation pipeline for 3d modeling of urban scenes,” in ISPRS Archives, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description