Unthule: An Incremental Graph Construction Process for
Robust Road Map Extraction from Aerial Images
The availability of highly accurate maps has become crucial due to the increasing importance of location-based mobile applications as well as autonomous vehicles. However, mapping roads is currently an expensive and human-intensive process. High-resolution aerial imagery provides a promising avenue to automatically infer a road network. Prior work uses convolutional neural networks (CNNs) to detect which pixels belong to a road (segmentation), and then uses complex post-processing heuristics to infer graph connectivity [4, 10]. We show that these segmentation methods have high error rates (poor precision) because noisy CNN outputs are difficult to correct. We propose a novel approach, Unthule, to construct highly accurate road maps from aerial images. In contrast to prior work, Unthule uses an incremental search process guided by a CNN-based decision function to derive the road network graph directly from the output of the CNN. We train the CNN to output the direction of roads traversing a supplied point in the aerial imagery, and then use this CNN to incrementally construct the graph. We compare our approach with a segmentation method on fifteen cities, and find that Unthule has a 45% lower error rate in identifying junctions across these cities.
Creating and updating road maps is a tedious, expensive, and often manual process today . Accurate and up-to-date maps are especially important given the growing popularity of location-based mobile services and the impending arrival of autonomous vehicles. Several companies are investing hundreds of millions of dollars on mapping the world, but despite this investment, error rates are not small in practice, with map providers receiving many tens of thousands of error reports per day, which are often processed manually.111See, e.g., https://www.huffingtonpost.com/2013/01/21/google-maps-editor-ground-truth-team_n_2516924.html for a day in the life of a maps editor, and https://productforums.google.com/forum/#!topic/maps/dwtCso9owlU for an example of a city (Doha, Qatar) where maps have been missing entire subdivisions for years. In fact, even obtaining “ground truth” maps in well-traveled areas may be difficult; recent work  reported that the discrepancy between OpenStreetMap and a TorontoCity dataset was 14% (the recall according to a certain metric for OSM was 0.86).
Aerial imagery provides a promising avenue to automatically infer the road network graph. In practice, however, extracting maps from aerial images is difficult even when the images have high resolution (e.g., see Figure 1 for common examples of occlusion). Prior approaches do not handle these problems well. Almost universally, they begin by segmenting the image, classifying each pixel in the input as either road or non-road [4, 10]. They then implement a complex post-processing pipeline to interpret the segmentation output and extract topological structure to construct a map. As we will demonstrate, noise frequently appears in the segmentation output, making it hard for the post-processing steps to produce an accurate result.
The fundamental problem with a segmentation-based approach is that the CNN is trained only to provide local information about the presence of roads. Key decisions on how road segments are inter-connected to each other are delegated to an (error-prone) post-processing stage that relies on heuristics instead of machine learning or principled algorithms. Rather than rely on an intermediate image representation, we seek an approach that produces the road network directly from the CNN. However, it is not obvious how to train a CNN to learn to produce a graph from aerial images.
We propose Unthule, an approach that leverages an incremental graph construction process for extracting graph structures from images to solve this problem. Our approach constructs the road network by adding individual road segments one at a time, using a novel CNN architecture to decide on the next segment to add given as input the portion of the network constructed so far. In this way, we eliminate the intermediate image representation of the road network, and thus avoid the need for extensive post-processing that limits the accuracy of existing systems.
Training the CNN decision function is challenging because the input to the CNN at each step of the search depends on the (partial) road network generated using the CNN up to that step. As a result, standard approaches that use a static set of labeled training examples are inadequate. Instead, we develop a dynamic labeling approach to produce training examples on the fly as the CNN evolves during training. This procedure resembles reinforcement learning, but we use it in an efficient supervised training procedure.
We evaluate our approach using aerial images covering 24 square km areas of 15 cities, after training the model on 25 other cities. Figure 2 shows the results pictorially for LA, NYC, and Chicago. The supplementary material submitted with this paper includes a demonstration of our method in action. Across the 15 cities, our main experimental finding is that Unthule has an average error rate of 7.5% compared to 13.7% for segmentation (45% lower), while the recall numbers for the two schemes are within 5% of each other. These numbers are for a junction metric, which quantifies how accurately the algorithm recovers the local topology around each road junction. Because false junctions have a significant adverse impact on applications like navigation, these results suggest that Unthule is an important step forward in fully automating map construction from aerial images.
2 Related Work
Classifying pixels in an aerial image as “road” or “non-road” is a well-studied problem, with solutions generally using probabilistic models. Barzobar et al. build geometric-probabilistic models of road images based on assumptions about local road-like features, such as road geometry and color intensity, and draw inferences with MAP estimation . Wegner et al. use higher-order conditional random fields (CRFs) to model the structures of the road network by first segmenting aerial images into superpixels, and then adding paths to connect these superpixels . More recently, CNNs have been applied to road segmentation [12, 5]. However, the output of road segmentation, consisting of a probability of each pixel being part of a road, cannot be directly used as a road network graph.
To extract a road network graph from the segmentation output, Cheng et al. apply binary thresholding and morphological thinning to produce single-pixel-width road centerlines. A graph can then be obtained by tracing these centerlines. Máttyus et al. propose a similar approach, but add post-processing stages to enhance the graph by reasoning about missing connections and applying heuristics . This solution yields promising results when the road segmentation has modest error. However, as we will show in Section 3.1, heuristics do not perform well when there is uncertainty in segmentation, which can arise due to occlusion, ambiguous topology, or a wide range of other reasons.
Rather than extract the road graph from the result of segmentation, some solutions directly extract a graph from images. Hinz et al. produce a road network using a complex road model that is built using detailed knowledge about roads and their context, such as nearby buildings and vehicles . Hu et al. introduce road footprints, which are detected based on shape classification of the homogeneous region around a pixel . A road tree is then grown by tracking these road footprints. Although these approaches do not use segmentation, they involve numerous heuristics and assumptions that resemble those in the post-processing pipeline of segmentation-based approaches, and thus are susceptible to similar issues.
Inferring road maps from GPS data has also been studied [3, 14, 13]. However, collecting enough GPS data that can cover the entire map in both space and time is challenging, especially when the region of the map is large and far from the city core. Meanwhile, GPS accuracy in downtown areas suffers from a noise due to urban canyons, where GPS signals reflect off of neighboring buildings and lead to location errors.
3 Automatic Map Inference
The goal of automatic map inference is to produce a road network map, i.e., an undirected graph where vertices are annotated with spatial coordinates (latitude and longitude), and edges correspond to straight-line road segments. Vertices with three or more incident edges correspond to road junctions (e.g. intersections or forks).
In Section 3.1, we detail a segmentation-based map-inference method that is representative of current state-of-the-art techniques [4, 10] to construct a road network map from aerial images. We describe problems in the maps inferred by the segmentation approach to motivate our alternative solution. Then, in Section 3.2, we introduce our novel incremental map construction method. Finally, in Section 4, we discuss the procedure used to train the CNN used in our solution.
3.1 Segmentation Approach
Typically, segmentation-based approaches have two steps. First, each pixel is labeled as either “road”, indicating that it belongs to a road, or “non-road”. Then, a post-processing step applies a set of heuristics to convert the segmentation output to a road network graph.
To understand and evaluate this approach, we prepare a training dataset consisting of input-output pairs of satellite images with corresponding segmentation labels. For ground truth, we use OpenStreetMap (OSM), an openly licensed map dataset developed through collaborative mapping . Then, to generate segmentation labels, we render lines along the edges in the OSM dataset, and apply a Gaussian blur.
We use a 13-layer CNN, where the output layer uses softmax activation and is scaled to half the size of the input on each dimension. We train the CNN using batch gradient descent with cross-entropy loss evaluated independently on each pixel. For the post-processing stage, we first threshold the segmentation output to obtain a binary image. Then, we apply morphological thinning , which produces an image where roads are represented as one-pixel-wide centerlines. We use the Douglas-Puecker method  to convert this image into a graph.
Because the CNN is trained with a loss function evaluated independently on each pixel, it will yield a noisy output in regions where it is unsure about the presence of a road. As shown in Figure 3(a) and (b), noise in the segmentation output will be reflected in the extracted graph. We perform a series of additional cleaning steps to resolve common types of noise. These steps are representative of techniques in state-of-the-art segmentation-based map-inference methods:
Prune short segments that dangle off road centerlines.
Remove small, isolated connected components.
Extend dead-end segments if there is a nearby road on the opposite side of the dead-end.
Merge junctions that are close together.
Figure 3(c) shows the graph after cleaning.
Although cleaning is sufficient to remove basic types of noise, we find that most forms of noise are too extensive to compensate for. Consider the examples in Figure 4. In the top example, blur around the intersection makes it difficult for the method to determine the topology. This noise is amplified during the graph extraction process, resulting in an unusable road network map. In the bottom example, the segmentation output drops off when shadows occlude the road, yielding disconnected road segments.
In these examples, even a human would find it difficult to accurately map the road network given the segmentation output. Because the CNN is trained only to classify individual pixels in an image as roads, it leaves us with an untenable jigsaw puzzle of deciding which pixels form the road centerlines, and where these centerlines should be connected.
These findings convinced us that we need a different approach than can produce a road network directly, without going through the noisy intermediate image representation of the road network. We propose an incremental graph construction architecture to do exactly this. By breaking down the mapping process into a series of steps that build a road network graph incrementally, we will show that we can derive a road network from the CNN, thereby eliminating the requirement of a complex post-processing pipeline and yielding more accurate maps.
3.2 Unthule: Incremental Graph Construction
In contrast to the segmentation approach, our approach consists of a search algorithm, guided by a decision function implemented via a CNN, to compute the graph incrementally. The search walks along roads starting from a single location known to be on the road network. Vertices and edges are added in the path that the search follows. The decision function is invoked at each step to determine the best action to take: either add an edge to the road network, or step back to the previous vertex in the search tree. Algorithm 1 shows the pseudocode for the search procedure.
Search algorithm. We input a region , where is the known starting location, and is a bounding box defining the area in which we want to infer the road network. The search algorithm maintains a graph and a stack of vertices that both initially contain only the single vertex . , the vertex at the top of , represents the current location of the search.
At each step, the decision function is presented with , , and an aerial image centered at ’s location. It can decide either to walk a fixed distance (we use ) forward from along a certain direction, or to stop and return to the vertex preceding in . When walking, the decision function selects the direction from a set of angles that are uniformly distributed in . Then, the search algorithm adds a vertex at the new location (i.e., away from along the selected angle), along with an edge , and is pushed onto (in effect moving the search to ).
If the decision process decides to “stop” at any step, we pop from . Stopping indicates that there are no more unexplored roads (directions) adjacent to . Note that because only new vertices are ever pushed onto , a “stop” means that the search will never visit the vertex again.
Figure 5 shows an example of how the search proceeds at an intersection. When we reach the intersection, we first follow (say) the left branch, and once we reach the end of this branch, the decision function selects the “stop” action. Then, the search returns to each vertex previously explored along the left branch. Because there are no roads coming off of the left branch, the decision function continues to select the stop action until we come back to the intersection. At the intersection, the decision function leads the search down the right branch. Once we reach the end of the right branch, the decision function repeatedly selects the stop action until we come back to and becomes empty. When is empty, the construction of the road network is complete.
Since road networks consist of cycles, it is also possible that we will turn back on an earlier explored path. The search algorithm includes a simple merging step to handle this: when processing a walk action, if is within distance of a vertex , but the shortest distance in from to is at least , then we add an edge and don’t push onto . This heuristic prevents small loops from being created, e.g. if a road forks into two at a small angle.
Lastly, we may walk out of our bounding box . To avoid this, when processing a walk action, if is not contained in , then we treat it as a stop action.
CNN decision function. A crucial component of our algorithm is the decision function, which we implement with a CNN. The input layer consists of a window centered on . This window has five channels:
The first three channels are from the portion of aerial imagery around .
The fourth channel is the graph constructed so far, . We render by drawing anti-aliased lines along the edges of that fall inside the window. Including in the input to the CNN is a noteworthy aspect of our method. First, this allows the CNN to understand which roads in the aerial imagery have been explored earlier in the search, in effect moving the problem of excluding these roads from post-processing to the CNN. Second, it provides the CNN with useful context; e.g., when encountering a portion of aerial imagery occluded by a tall building, the CNN can use the presence or absence of edges on either side of the building to help determine whether the building occludes a road.
The fifth channel is . This channel is static: all values are 0 except the pixel at .
The output layer is a sigmoid layer with neurons, . Each corresponding to an angle to walk in. We use a threshold to decide between walking and stopping. If , then walk in the angle corresponding to . Otherwise, stop.
Figure 6 summarizes the interaction between the search algorithm and the decision function, and the implementation of the decision function with a CNN.
We noted earlier that our solution does not require complex post-processing heuristics, unlike segmentation-based methods where CNN outputs are noisy. The only post-processing required in our decision function is to check a threshold on the CNN outputs and select the maximum index of the output vector. Thus, our method enables the CNN to directly produce a road network graph.
We now discuss the training procedure for the decision function.
4 Training a CNN for Incremental Graph Construction
We assume we have a ground truth map (e.g., from OpenStreetMap). Training the CNN is non-trivial: the CNN takes as input a partial graph (generated by the search algorithm) and outputs the desirability of walking at various angles, but we only have this ground truth map. How might we use to generate training examples?
4.1 Static Training Dataset
We initially attempted to generate a static set of training examples. For each training example, we sample a region and a step count , and initialize a search. We run steps of the search using an “oracle” decision function that uses to always make optimal decisions. The state of the search algorithm immediately preceding the th step is the input for the training example, while the action taken by the oracle on the th step is used to create a target output . We can then train a CNN using gradient descent by back-propagating a mean-squared loss between and .
However, we found that although the CNN can achieve high performance in terms of the loss function on the training examples, it performs poorly during inference. This is because is essentially perfect in every example that the CNN sees during training, as it is constructed by the oracle based on the ground truth map. During inference, however, the CNN may choose angles that are slightly off from the ones predicted by the oracle, resulting in small errors in . Then, because the CNN has not been trained on imperfect inputs, these small errors lead to larger prediction errors, which in turn result in even larger errors.
Figure 7 shows a typical example of this snowball effect. The CNN does not output the ideal angle at the turn; this causes it to quickly veer off the actual road because it never saw such deviations from the road during training, and hence it cannot correct course. We tried to mitigate this problem by using various methods to introduce noise on in the training examples. Although this reduces the scale of the problem, the CNN still yields low performance at inference time, because the noise that we introduce does not match the characteristics of the noise introduced inherently by the CNN during inference. Thus, we conclude a static training dataset is not suitable.
4.2 Dynamic Labels Approach
We instead generate training examples dynamically by running the search algorithm with the CNN as the decision function. As the CNN model evolves during training, we generate new training examples as well.
Given a region , training begins by initializing an instance of the search algorithm . On each training step, as during inference, we feed-forward the CNN to decide on an action based on the output layer, and update and based on that action.
In addition to deciding on the action, we also determine the action that an oracle would take, and train the CNN to learn that action. The key difference from the static dataset approach is that, here, and are updated based on the CNN output and not the oracle output; the oracle is only used to compute a label for back-propagation.
The basic strategy is similar to before. On each training step, based on , we first identify the set of angles where there are unexplored roads from . Next, we convert into a target output vector . If is empty, then . Otherwise, for each angle , we set , where is the closest walkable angle to . Lastly, we derive a loss from the mean-squared error between and , and apply back-propagation to update the CNN parameters.
A key challenge is how to decide where to start the walk in to pick the next vertex. The naive approach is to start the walk from the closest location in to . However, as the example in Figure 8 illustrates, this approach can direct the system towards the wrong road when differs from .
To solve this problem, we apply a map-matching algorithm to find a path in that is most similar to a path in ending at . To obtain the path in , we perform a random walk in starting from . We stop the random walk when we have traversed a configurable number of vertices (we use ), or when there are no vertices adjacent to the current vertex that haven’t already been traversed earlier in the walk. Then, we match this path to the path in to which it is most similar. We use a standard map-matching method based on the Viterbi algorithm . If is the endpoint of the last edge in , we start our walk in at .
Finally, we maintain a set containing edges of that have already been explored during the walk. is initially empty. On each training step, after deriving from map-matching, we add each edge in to . Then, when performing the walk in , we avoid traversing edges that are in again.
Dataset. To evaluate our approach, we assemble a large corpus of high-resolution satellite imagery and ground truth road network graphs covering the urban core of forty cities in the U.S. and U.K. For each city, our dataset covers a region of approximately 24 sq km around the city center. We obtain satellite imagery from Google at 60 cm/pixel resolution, and the road network from OSM (we exclude certain roads that appear in OSM such as pedestrian paths and parking lots). We convert the coordinate system of the road network so that the vertex spatial coordinate annotations correspond to pixels in the satellite images.
We split our dataset into a training set with 25 cities and a test set with 15 cities. To our knowledge, we conduct the first evaluation of automatic mapping approaches where systems are trained and evaluated on entirely separate cities, and not merely different regions of one city, and also the first large-scale evaluation over aerial images from several cities. As Figure 1 shows, many properties of roads vary greatly from city to city. The ability of an automatic mapping approach to perform well even on cities that are not seen during training is crucial, as the regions where automatic mapping holds the most potential are the regions where existing maps are non-existent or inaccurate.
Metrics. We use two metrics to compare the road network inferred by an automatic mapping approach against the ground truth road network from OSM.
The first metric, TOPO, evaluates a combination of topology (connections between roads) and geometry (alignment of individual roads), and has been used before in the GPS-based mapping literature . TOPO simulates a car driving a certain distance from a selected source location, and compares the destinations that can be reached in with those that can be reached in . Given a source location in , we identify the closest location in , and perform a search in each graph from those locations. During the searches, we place markers every meters; the search ends after walking meters from the source. We compute a precision and recall between the markers placed by the two searches, where two markers match if the distance between them is less than . This procedure is repeated over a large number of source location, and precision and recall are averaged over the iterations.
Figure 10 illustrates one iteration of TOPO comparing two inferred maps against OSM. We also use Figure 10 to highlight a case where TOPO gives misleading scores. We argue that in a practical mapping scenario, the first inferred map is unusable as too many edges are placed incorrectly, whereas the missing edge in the second map can be quickly corrected manually. However, TOPO penalizes the second map heavily on recall. The score for the first map is 0.82, but for the second map it is only 0.80.
Thus, we propose a new evaluation metric with two goals: (a) to give a score that is representative of the inferred map’s practical usability, and (b) to be interpretable. Our metric compares the ground truth and inferred maps junction-by-junction, where a junction is any vertex with three or more edges. We first identify pairs of corresponding junctions , where is in the ground truth map and is in the inferred map. Then, is the fraction of incident edges of that are captured around , and is the fraction of incident edges of that appear around . For each unpaired ground truth junction , , and for each unpaired inferred map junction , . Finally, if and , we report the correct junction fraction and error rate . In Figure 10, the first map has , , while the second has , .
Isolated City Regions. Several cities have road network maps that include strongly connected components that are weakly connected to the rest of the network. For example, rivers and bays often separate different parts of a city. We find that our approach may yield low recall on these cities if it is unable to follow any of the few roads that lead out of the component that contains the known initial vertex.
We solve this problem by enhancing Unthule with a two-phase inference strategy. In the first phase, we use a low stopping threshold so that the search is likely to extend across the city. However, this low threshold may produce many false positive edges. Thus, we prune the resulting graph and only retain high-confidence sequences of edges, yielding . In the second phase, we restart the search with a new, empty graph from an arbitrary edge in , with . After the search terminates, we remove edges from that are close to some edge in . If is not empty, then we add an arbitrary edge in to and restart the search from there; this is repeated until is empty.
We find that two-phase inference boosts the average across the test cities by 21%. We use this approach in the results below.
Results. We report performance in terms of and over each city in the test set in Figure 9. On the junction metric, our method correctly captures a similar count of junction as the segmentation-based approach, but has a significantly lower error rate—45% lower on average—than segmentation across the 15 cities. Unthule’s much lower error rate is key to making automatic mapping systems practical: if the error rate is too high, then it is more efficient to map the roads manually from scratch rather than first remove incorrect segments from the inferred map.
Table 1 shows the TOPO performance. The precision increases by 9.9%, corresponding to a halving in the error rate, . As with the junction metric, Unthule yields similar recall with far fewer errors. As shown in Figure 2, the lower error rate results in a noticeable improvement in the quality of the inferred map.
On the face of it, using deep learning to infer a road network graph seems straightforward: train a CNN to recognize which pixels belong to a road, produce the polylines, and then connect them. But occlusions and lighting conditions pose challenges, and such a segmentation-based approach requires complex post-processing heuristics. By contrast, our incremental graph construction method uses a CNN-guided search to directly output a graph. We showed how to construct training examples dynamically for this method, and evaluated it on 15 cities, having trained on aerial imagery from 25 entirely different cities. To our knowledge, this is the largest map-inference evaluation to date, and the first that fully separates the training and test cities. Our principal experimental finding is that Unthule has an average error rate of 7.5% compared to 13.7% for segmentation (45% lower), while the recall numbers for the two schemes are comparable. Hence, we believe that our work presents an important step forward in fully automating map construction from aerial images.
-  M. Barzohar and D. B. Cooper. Automatic finding of main roads in aerial images by using geometric-stochastic models and estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(7):707–721, 1996.
-  J. Biagioni and J. Eriksson. Inferring road maps from global positioning system traces. Transportation Research Record: Journal of the Transportation Research Board, 2291(1):61–71, 2012.
-  J. Biagioni and J. Eriksson. Map inference in the face of noise and disparity. In Proceedings of the 20th International Conference on Advances in Geographic Information Systems, pages 79–88. ACM, 2012.
-  G. Cheng, Y. Wang, S. Xu, H. Wang, S. Xiang, and C. Pan. Automatic road detection and centerline extraction via cascaded end-to-end convolutional neural network. IEEE Transactions on Geoscience and Remote Sensing, 55(6):3322–3337, 2017.
-  D. Costea and M. Leordeanu. Aerial image geolocalization from recognition and matching of roads and intersections. arXiv preprint arXiv:1605.08323, 2016.
-  D. H. Douglas and T. K. Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica: The International Journal for Geographic Information and Geovisualization, 10(2):112–122, 1973.
-  M. Haklay and P. Weber. OpenStreetMap: User-generated street maps. IEEE Pervasive Computing, 7(4):12–18, 2008.
-  S. Hinz and A. Baumgartner. Automatic extraction of urban road networks from multi-view aerial imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 58(1):83–98, 2003.
-  J. Hu, A. Razdan, J. C. Femiani, M. Cui, and P. Wonka. Road network extraction and intersection detection from aerial images by tracking road footprints. IEEE Transactions on Geoscience and Remote Sensing, 45(12):4144–4157, 2007.
-  G. Máttyus, W. Luo, and R. Urtasun. DeepRoadMapper: Extracting road topology from aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3438–3446, 2017.
-  G. Miller. The Huge, Unseen Operation Behind the Accuracy of Google Maps. https://www.wired.com/2014/12/google-maps-ground-truth/, Dec. 2014.
-  V. Mnih and G. E. Hinton. Learning to detect roads in high-resolution aerial images. In European Conference on Computer Vision, pages 210–223. Springer, 2010.
-  Z. Shan, H. Wu, W. Sun, and B. Zheng. Cobweb: a robust map update system using gps trajectories. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 927–937. ACM, 2015.
-  R. Stanojevic, S. Abbar, S. Thirumuruganathan, S. Chawla, F. Filali, and A. Aleimat. Kharita: Robust map inference using graph spanners. arXiv preprint arXiv:1702.06025, 2017.
-  A. Thiagarajan, L. Ravindranath, K. LaCurts, S. Madden, H. Balakrishnan, S. Toledo, and J. Eriksson. VTrack: accurate, energy-aware road traffic delay estimation using mobile phones. In Proceedings of the 7th ACM Conference on Embedded Networked Sensor Systems, pages 85–98. ACM, 2009.
-  J. D. Wegner, J. A. Montoya-Zegarra, and K. Schindler. Road networks as collections of minimum cost paths. ISPRS Journal of Photogrammetry and Remote Sensing, 108:128–137, 2015.
-  T. Zhang and C. Y. Suen. A fast parallel algorithm for thinning digital patterns. Communications of the ACM, 27(3):236–239, 1984.