Streetify: Using Street View Imagery And Deep Learning For Urban Streets Development
The classification of streets on road networks has been focused on the vehicular transportational features of streets such as arterials, major roads, minor roads and so forth based on their transportational use. City authorities on the other hand have been shifting to more urban inclusive planning of streets, encompassing the side use of a street combined with the transportational features of a street. In such classification schemes, streets are labeled for example as commercial throughway, residential neighborhood, park etc. This modern approach to urban planning has been adopted by major cities such as the city of San Francisco, the states of Florida and Pennsylvania among many others. Currently, the process of labeling streets according to their contexts is manual and hence is tedious and time consuming. In this paper, we propose an approach to collect and label imagery data then deploy advancements in computer vision towards modern urban planning. We collect and label street imagery then train deep convolutional neural networks (CNN) to perform the classification of street context. We show that CNN models can perform well achieving accuracies in the 81% to 87%, we then visualize samples from the embedding space of streets using the t-SNE method and apply class activation mapping methods to interpret the features in street imagery contributing to output classification from a model.
There have been several interesting application of machine learning and particularly deep learning on imagery data in the domain of urban computing and remote sensing. Researchers from the domain of remote sensing have been focusing on developing models of inferring land use and land cover from satellite imagery[albert2017using, Buslaev_2018_CVPR_Workshops, Zhou_2018_CVPR_Workshops, Hamaguchi_2018_CVPR_Workshops, Aich_2018_CVPR_Workshops, albert2017modeling]. Recently, researchers organized the Deepglobe challenge as part of the CVPR conference, the challenge focused on common challenges in remote sensing including road extraction, building detection and land cover classification [demir2018deepglobe]. Researchers produced a myriad of approaches and deep learning techniques to address the challenging tasks[demir2018deepglobe, zhou2018d, hamaguchi2018building, sun2018stacked, kuo2018deep]. Furthermore, researchers used street view imagery in congruence with satellite imagery to enhance the process of digitizing maps. Cao et al. used an integrative approach of using satellite imagery and street view to infer land use then classify buildings and points of interest (POIs)[cao2018integrating].
Using street view imagery researchers showed how it can be a predictor to several urban socio-economic measures in cities. Gebru et al. developed an object detection model to extract cars from street view images, the paper then proceed by doing an image classification of the make, model and age of cars seen in a neighborhood area[gebru2017using]. Then a lookup for the prices of the cars seen in images is performed against a database of their expected prices. The paper shows statistically significant correlations between demographics and the prices of cars residing in an area. They tested the strength of correlation of the metric of extracted car prices against several demographical features of urban areas, namely the U.S. Census and presidential voting data. Naik et al. used street view imagery to develop models that can predict the perceived safety of a street. They trained models against perceived street safety collected through surveying 7000 participant to input their perceived safety score given a street view image[naik2014streetscore]. Naik et al. extended the work and developed a computer vision model to measure changes in the physical appearances of neighborhoods from street-level imagery across time. They studied the correlation of the magnitude of change to neighborhood demographical characteristics. They identified metrics that can predict neighborhood physical change, they found that education level and population density are strong predictors in magnitude of urban physical change of an area[naik2017computer]. Kidzinski et al. used street view imagery of houses and apartment complexes to train a model against car insurance data. The paper shows predictive strength in visual features of houses towards predicting risk of car accidents. Given the address of insurance beneficiary, the model inputs street imagery of their home address and can outputs prediction of car accidents risk of an individual[kita2019google]. The visual features in street view images are very rich enabling many applications of urban computing. We explore its potentials for urban planning towards better classifying streets in a city.
The city of San Francisco developed a manual process by which street contexts are determined. In this section, we first discuss the manual process of street context classification in urban planning developed by the city of San Francisco [SF_BSP]. Then, we proceed by discussing how we sample labeled street imagery data for the purpose of training a deep convolutional neural network to perform the task of street context classification.
2.1 The manual street context classification process
The city of San Francisco developed a manual process by which street contexts are determined[SF_BSP]. The inputs of this manual process of street context classification are the shapefiles of the unlabeled streets in a city and the use of parcels on the sides of the streets (i.e. commercial, residential…etc). We first collect the streets shapefile for the city of interest, the data is usually made available by city councils. The shapefile data for the streets of Boston and San Francisco were graciously provided by the city councils (San Francisco streets were already labeled by the city council). The city also shared the manual process used by their urban planners to perform the classifications of street contexts. We then followed the manual process to label streets in Boston. The labeling scheme results in 11 classes for the city of San Francisco and 10 classes for the city of Boston. This is due to the reason that there is little to no presence of the Downtown Residential class in the downtown of Boston.
The streets are labeled based on conditions pertaining to transportational functionality and the use of land on the sides. The San Francisco classification scheme developed by the city’s urban planners is a multistage scheme for classification of streets that can be summarized in the following steps:
Determine the side use context: Streets are distinguished by their side use using parcel information including side contexts of commercial and residential.
Determine the transportation context: we then assigning labels that pertain to transportational features of the street. This includes throughway, highway and highway ramps. Streets with lower flows are labeled dowtown and neighborhood for their transportation context.
Identify special conditions: certain streets have special classifications including alleys, parks, industrial.
2.2 Sampling labeled street imagery:
Figure 1 shows the 11 classes for the city of San Francisco. Streets of the class Neighborhood Residential constitute the majority of the streets. The process of classification for the city of SF was conducted by the city council. We followed the same process to label subset of streets in Boston for this study.
Once a set of streets in a city is labeled according to the manual process of street context classification discussed in the previous subsection, we then can use the shapefiles of streets to sample labeled street view imagery. First, we randomly sample a street segment from the subset of labeled street segments without replacement. Then, we sample a random lat/lon from the sampled street segment. We then proceed to collect images that capture sides of the street as well as the road ahead. Images provided by the Google Street View API cover an angle of approximately 90 degrees. To cover the sides of the street while maintaining view on the road, we collect two Images one tilted towards the right of the direction of traffic on the street and the other tilted towards the left of the direction of traffic on the street. The pair of images that we sample are labeled with the street context label as per the shapefile.
The Google Street View images API [streetViewApi] provides street view imagery as a service. The service provides a free quota for the use of the API that we utilize for the purpose of this study, the free quota is renewed every month. The API takes the lat/lon of a location as well as an angle of view and returns images of varying sizes and maximum size of 640x640 pixels. Figure 2 shows sample points on the street road network and the corresponding views from the street view API. In orange frame are the images taken with an angle towards the left and in blue frame are images that are taken towards the right of the direction of the street. From left to right of the figure we have a Commercial Throughway, a Downtown Commercial and an Alley from Boston. Each sample image has a map below it showing the location where the lat/lon and view angle of the images.
In this section we discuss the conceptual framework of the process and the architectures of the deep Convolutional Neural Networks (CNN) in our framework.
3.1 The general framework
Street context classification incorporates the side use of streets and land use of its sides in addition to the transportational attributes of a street. Side use of streets is influenced by the cultural and socio-economic functions the street servers (which is a subjective judgment by experts). Figure 3-a illustrates the framework of street context classification of a city. The framework sampling street view imagery from labeled streets shapefile. The CNN model is trained to perform the classification. The CNN outputs a feature map in the embedding space of street contexts.
We take the view that street context classes are just a useful discretization of a more continuous spectrum of patterns in the organization of fabric of streets in an urban setting. This viewpoint is illustrated in Figure 3-b while some attributes (e.g., amount of built structures or vegetation) are directly interpretable, some others may not be. Nevertheless, these patterns influence, and are influenced by, socio-economic factors (e.g., economic activity), and dynamic human behavior (e.g. mobility, parking occupancy). We see the work on cheaply curating a large-scale street view classification dataset and comparing streets using deep representations that this paper puts forth as a necessary first step towards a granular understanding of urban settings in data-poor regions.
3.2 Convolutional Neural Network (CNN) architectures
In 2012, Krizhevsky et al. applied a CNN to the Imagenet [krizhevsky2012imagenet]. It was the first time an architecture was more successful than traditional, hand-crafted feature learning on the ImageNet. The AlexNet laid the foundations for the traditional CNN, a convolutional layer followed by an activation function followed by a max pooling operation. Much of the success of deep neural networks has been accredited to these additional layers. The intuition behind their function is that these layers progressively learn more complex features. The first layer learns edges, the second layer learns shapes, the third layer learns objects, and so on. In this paper, we explore various CNN architectures starting with the AlexNet then moving to more recent ones including ResNet and Inception [krizhevsky2012imagenet, szegedy2016rethinking, he2016deep].
4 Results and Validation
In this section, we show the accuracies of the models per city, per model architecture. We show the confusion matrices of the Inception-v3 model on the validation set of Boston and San Francisco. The data was split 80% for training/testing and 20% for validation.
Table I shows the accuracies of the different architectures trained on labeled images from the city of Boston and San Francisco. The AlexNet architecture achieves the lowest accuracy on the validation set for both cities. The Inception-v3 model achieves the highest accuracy on our validation set of both cities. For Boston, AlexNet has an accuracy of 83.16% and Inception-v3 has an accuracy of 87.79%. For San Francisco, AlexNet has an accuracy of 81.69% and Inception-v3 has accuracy of 84.17%. We notice a drop in accuracy between Boston and San Francisco for the same model architectures generally. This is attributed to the number of classes of streets where there are 11 classes in San Francisco and 10 classes in Boston, the context of Downtown Residential is absent for the city of Boston as discussed earlier.
4.2 Visualizing the embedding space of street contexts
Figure 5 visualizes the t-SNE projection of the feature vectors for each image in a sample from the training dataset. The feature vectors are the output values on the before last layer on the neural network (in our case we used the AlexNet model architecture). The feature vectors are 4096 dimensional. The t-SNE method help in visualizing the feature vector space by projecting it into a lower dimensional space while preserving neighborhood structure in the original space of the feature vectors[maaten2008visualizing]. Figure 5 show the projection of feature vectors for a sample set of street images in Boston onto two dimensions.
The t-SNE visualization shows the neighborhood structure of the highly dimensional spaces of feature vectors. The visualization illustrates the neighboring structure of the high dimensional space on the 2d in our case. The visualization shows the variations in streets in their contexts and geometrical features. Alleys present on the top right side of the plot are narrow and surrounded by red bricked building walls. Street passing through parks contexts are in the lower side of the plot with dense presence of greenery in them. Streets of type Downtown Commercial and Neighborhood Commercial share similar visual attributes to those of Alleys where they are usually surrounded by more buildings that have commercial signs, the Neighborhood Commercial sometimes has more vegetation. They reside on the top right side of the plot showing more of red-bricked buildings on the sides of the streets and little to no vegetation. Neighborhood Residential and Residential Throughway streets are more similar to Parks and have significant presence of greenery in Boston, they reside in the lower right to middle side of the plot. Streets that are highways, highway ramps and industrial contexts reside on the left side of the plot. They are usually wide and have less greenery or presence of buildings on the sides. Commercial Throughways are present closer to middle of the t-SNE visualization where they have some presence of greenery as well as businesses on the sides making them sit between Downtown-like streets streets and Park-like streets. Figure 4 shows the confusion matrices for the Inception-v3 model. The accuracy of the model typically varies by street context. In Boston, Inception-v3 has the highest accuracy for the Park context and lowest for Neighborhood Commercial context. We also notice a few cells where the model confuses streets contexts. For example, the model confuses Highways with Highway Ramps which share similar visual features. The model confuses Residential Throughways, Parks and Neighborhood Residential. They share the visual features of dense greenery on street sides. The model also confuses the Highways and Highway Ramps contexts. Generally, we notice that the cofusion patterns of the model are consistent with our t-SNE visualization where confused classes are usually in close proximity in the embedding space.
For San Francisco, the model has the highest accuracy for the Alleys and lowest accuracy for classifying the Residential Throughways. The model confuses Downtown Commercial and Downtown Residential. Similar to Boston, the model confuses Highways and Highway Ramps. The mentioned confused classes share similar visual characteristics.
The patterns of confusion between classes in the city of Boston and San Francisco are similar. This is clear in the bottom right corner for the classes of Residential Throughway, Park and Neighborhood Residential. The same holds for Highways and Highway Ramps. In addition to Neighborhood Commercial and Neighborhood Residential.
4.3 Class Activation Mapping of street contexts
To better understand the features which the model is looking for to classify streets. We further investigate the features attributing to the activations on the images using ideas of Class Activation Mapping (CAM) proposed by Zhou et al.[zhou2016learning]. The methodology constructs a heat maps indicative of the features in images that are responsible for activating the predicted class. The heat map is generated by a weighted sum of the last set of convolutional outputs. The weights are of the last layer and corresponding to the outputted class node on the network.
Figure 6 shows CAM applied to the ResNet18 model for the city of Boston. We show here a sample of the heat maps illustrating some of the features in the street view images that activated certain classification. The model captures several features in images and we discuss some of those shown in Figure 6. For the Alleys class in the figure, the heat map of the CAM highlights red bricked walls, trash bins and back-side parking spots which are features typical for Alleys in the city of Boston.
For Commercial Throughways, we see the activation map highlighting stores on the side of streets, the sidewalk and traffic lights. For Downtown Commercial, we see the activation map hot on high rise buildings typically available in downtown areas. For the Industrial streets, we see the model activated by corrugated steel walls and a cargo truck. For Neighborhood Residential, we see the model was activated by the presence of houses on the side of the street as well as greenery and cars parked on the side. For Neighborhood Commercial, we see the model activated by buildings with stores in them, we also notice the zebra line on the street under the heated activation area. For Parks, the model is activated by trees on the sides of the street. For Residential Throughways, the heated area is often wide to capture the throughway nature of the streets as well as detecting residential houses on the sides. For Highways, the heatmaps highlights the road area of the image on both lanes if exists. For Highway Ramps, the model is looking at the road ahead which has less number of lanes usually. It is also slightly activated by the presence of a nighboring highway or bridge on which it will merge. These are sample activation features among others that the ResNet18 model highlights towards activating their respective predicted classes. In the Figure 6 we show a sample of two images per class for illustration and validation.
The authors would like to thank the city councils of San Francisco for the street context labeled data. The author would also like to thank the collaboration program (CCES) between MIT and KACST for their continuous support. In addition, the author would like to thank the following individuals, Mohamed Alhajri for valuable inputs, Abdulaziz Alhassan and Abdulaziz Aldawood for interesting discussions and critique.