Sky pixel detection in outdoor imagery using an adaptive algorithm and machine learning.
Computer vision techniques allow automated detection of sky pixels in outdoor imagery. Multiple applications exist for this information across a large number of research areas. In urban climate, sky detection is an important first step in gathering information about urban morphology and sky view factors. However, capturing accurate results remains challenging and becomes even more complex using imagery captured under a variety of lighting and weather conditions.
To address this problem, we present a new sky pixel detection system demonstrated to produce accurate results using a wide range of outdoor imagery types. Images are processed using a selection of mean-shift segmentation, K-means clustering, and Sobel filters to mark sky pixels in the scene. The algorithm for a specific image is chosen by a convolutional neural network, trained with 25,000 images from the Skyfinder data set, reaching 82% accuracy with the top three classes. This selection step allows the sky marking to follow an adaptive process and to use different techniques and parameters to best suit a particular image. An evaluation of fourteen different techniques and parameter sets shows that no single technique can perform with high accuracy across varied Skyfinder and Google Street View data sets. However, by using our adaptive process, large increases in accuracy are observed. The resulting system is shown to perform better than other published techniques.
keywords:sky view factor, Google Street View, machine learning, WUDAPT, sky pixel detection, Skyfinder
Sky pixel detection in images is an ongoing computer vision challenge with a large range of applications such as autonomous vehicle or drone navigation (Shen2013), real time weather classification (Roser2008), image editing (Laffont2014; Tao2009), sky replacement (Tsai2016), and scene parsing (Tighe2010; Hoiem2005). It is also an important tool in urban climate research. The use of fisheye photography to calculate sky view factors (SVF), the fraction of sky visible to a point, at individual locations has long been used in urban climate. Numerous techniques exist to process this sort of imagery (Grimmond2001; Chapman2004; Ali-Toudert2007).
In computer vision research, sky detection techniques have largely followed two main paths, either finding the pixels associated with the sky on a pixel by pixel basis or by finding a sky-ground boundary and labelling the sky as everything above that boundary. The first approach focuses on finding individual pixels associated with the sky. Luo2002 employed a physics based approach using the changes of sky colours from zenith to horizon. Gallagher2004 generated sky pixel probability maps based on colour values and a two-dimensional polynomial for each colour channel. Zafarifar2007 added texture, gradients, and vertical position to colour values to generate their probability map, however, only blue sky is detected accurately with clouds marked with low probabilities. Schmitt2009 added in an analysis of position and shape and were reportedly able to also accurately perform sky detection under cloudy conditions.
Other approaches have focused on finding a sky/ground boundary. Straight lines were first used to define a horizon located via an energy function (Ettinger2003). Using an improved energy function optimisation and gradient information from the image, Shen2013 allowed the horizon line to follow the boundary instead of being restricted to a straight line. Additional variations allowed increasingly difficult sky regions (i.e. regions separated from the main sky region by buildings, flags, or other obstructions) to be detected (Zhijie2014; Zhijie2015). Another approach, not specifically designed for sky pixel identification, attempts to classify around a dozen classes (such as sky, buildings, trees, and cars) in outdoor imagery through semantic segmentation using trained convolutional neural networks (CNN), such as SegNet (Badrinarayanan2017), and other variations (Holder2016; Middel2019).
Comparing existing approaches is difficult, as most studies do not report accuracy metrics, and when they do, they are reported using different metrics and benchmark data sets. Luo2002 reports 90.4% correct detection of blue sky pixels and 13% misclassifications. Chapman2004 reports a RMSE of 0.06 in marking SVF in fisheye images. Schmitt2009 reports an accuracy of 0.90 (error rate 0.1) for 80% of their validation images and an accuracy of 0.95 (error rate 0.05) for 75% of the images. Liang2017, using SegNet, reports an accuracy of 0.96 for sky pixel identification. Shen2018, using an off-the-shelf version of SegNet, report an accuracy of 0.83 with lateral views. Finally, Middel2019 reports an accuracy of 0.95 with lateral views.
An effort is ongoing to provide worldwide databases of standardised urban morphology information. Now with the widespread availability of urban imagery, an opportunity exists to expand the range of research that can be conducted without the requirement of manually collecting urban morphology parameters and to accelerate the population of databases such as The World Urban Database and Portal Tool (WUDAPT) (Mills2015).
Recent studies have started to utilise automated methods to build SVF data sets from GSV imagery (Middel2018; Gong2018). The results are promising but show some accuracy problems. Urban areas with large numbers of street trees in particular are cited by Gong2018 as a key source of inaccuracies in their system. In addition, these systems are highly dependent on GSV imagery, which as of 2018 has become more restrictive to license and expensive to obtain. This necessitates the need to expand the type of imagery used, imagery that unlike GSV, might vary more in lighting, weather conditions, camera angles, and aspect ratios.
With these factors in mind, we present a sky pixel detection system that has been tested using various types of outdoor imagery collected under a wide range of lighting and weather conditions, camera angles, and aspect ratios. This system, built on artificial intelligence training, is adaptive and uses a range of algorithms and combinations of parameters to locate the sky pixels to ensure the highest accuracy for each individual class of images. We evaluate a number of existing and new techniques for sky pixel classification and demonstrate that this new adaptive system performs with greater accuracy than any of these individual techniques on their own.
Our sky pixel identification system used data from two main sources, the Skyfinder data set (Mihail2016) and GSV (GoogleMaps2017b). Three computer vision techniques and a number of parameter variations were used to process the data (see Table 1). The overall process flow is shown in Figure 1. Finally, a previously published fourth technique (i.e. Sobel/flood-fill) was used as a benchmark test for our system.
|Sobel||Implementation of Wang2015a’s Sobel operator/hybrid probability model||6|
|Mean||Algorithm developed by the authors based on mean shift segmentation||4|
|K-means||Algorithm developed by the authors based on K-means clustering and HSL colour filtering||3|
|Sobel/flood-fill||Middel2018’s Sobel operator/flood-fill algorithm used as benchmark||1|
2.1.1 Skyfinder data
This data set was built from 90,000 long-term timelapse images from 53 outdoor webcams over a variety of lighting and weather conditions. Images are of a wide range of sizes and aspect ratios, including 640489, 857665, 960600, 1,280720, and 1,280960. For each location, a binary sky mask was created for validation purposes. All of these images are available from the Skyfinder website (Mihail2015).
In this study, we selected 38,115 images from 40 locations. Night-time and images with heavy fog were removed as these are conditions unlikely to be encountered in imagery used to calculate SVF. The dataset was split into two data sets, 28,586 for neural network training and 9,529 for validation.
2.1.2 GSV data
Panoramas for 406 locations in a variety of cities (Adelaide, Brisbane, Paris, Sydney, Tokyo, Perth, and Melbourne) were retrieved using the Google Maps API. Images were retrieved as six 640640 tiles (one each for up, down, left, right, front, and back directions). The six images were stitched together into a 1,280960 cubic image using Java 8 (Oracle2018) and OpenCV (Bradski2000). Validation images were created by hand marking sky regions in each image using the GNU Image Manipulation Program (GIMP2019). Figure 2 shows an example of a GSV panorama image and the corresponding hand-marked validation image. This data set was only used as part of the validation data set.
2.2 Techniques and parameters
2.2.1 Wang2015a Sobel operator/hybrid probability model
An implementation of the sky detection algorithm presented in Wang2015a was implemented using OpenCV and Java 8. This method proceeds by calculating grey scale gradient images using x- and y-directional Sobel operators to estimate sky colour. An optimised objective function attempts to find the best sky-ground boundary in the gradient image using the covariance matrices of a first calculation of sky and ground regions. Using this best sky boundary, probability models are created from i) the centre and standard deviations of the colours, ii) the gradient values, and iii) the vertical position of each pixel (vertically higher pixels are more likely to be sky). An overall probability model, ranking each pixel’s probability (0 to 1) to be sky, is generated from these three probability models. Wang2015a reports an error average (in percent of sky) of 0.051 and standard deviation of 0.058 in their evaluation using human-labelled images.
Wang2015a did not recommend a probability threshold, so a number of thresholds were tested (0.50, 0.60, 0.70, 0.80, 0.90, and 0.95) and given the designations of Sobel_50, Sobel_60, Sobel_70, Sobel_80, Sobel_90, and Sobel_95 respectively. The algorithm was applied to each image and pixels that exceeded the chosen threshold were marked as sky pixels (using blue, RGB 0,0,255). Results from our implementation are shown in Figure 3.
2.2.2 Mean shift segmentation algorithm
Mean shift is an algorithm often used for image segmentation (Comaniciu1997; Comaniciu2002). Image segmentation involves decomposing images into homogeneous contiguous regions of pixels of similar colours or grey levels. Mean shift uses an iterative algorithm to pick search windows (spatial and range) of a certain radius in an initial location in an image, then compute a mean shift vector and translate the search window by that amount until convergence (Comaniciu1997). Segmentation results are highly dependent on input parameters for the algorithm, which include the spatial radius of the search window, colour range radius of the search window, and minimum density (the minimum number of pixels to constitute a region). The mean shift used in this project is based on a Java port by Pangburn2002 of the C++ based EDISON vision toolkit (Christoudias2002).
Four different variations of the input parameters were used, determined experimentally through a sensitivity test to work across the widest variety of images. For example, in an effort to shift the entire sky to a single colour, images with patchy multi-coloured clouds are more accurately segmented when the radius and density parameters are increased (in Figures 4 row 1 a-d). However, this can have the effect of creating false positives in other images, for example, buildings (in Figures 4 row 3 a-d, centre left background) are increasingly segmented into the sky. The technique designations and parameters are detailed in Table 2. Mean shift is applied to each image with the chosen set of parameters and pixels of the most common colour (in the top half of the segmented image) are marked as sky. Results are shown in Figure 5.
|Designation||Spatial radius||Range radius||Min. density|
2.2.3 K-means clustering and HSL color filtering
A third sky segmentation technique was designed using K-means clustering and hue, saturation, and lightness (HSL) colour filtering. K-means clustering iteratively splits an image into number of clusters, terminating when a specified criteria is met (i.e. maximum iterations and/or desired accuracy). The K-means clustering was performed using the K-means method from the OpenCV library. Three different input parameter settings were used, determined experimentally through a sensitivity test to work on a wide variety of images. The technique designations and parameters are detailed in Table 3. K-mean_12 was found to work best when the sky is mostly obscured by a building or bridge. K-mean_6 was more accurate with cloudy skies. K-mean_14 handles sky scenes broken up by tree canopies.
K-means clustering was performed on each image, splitting the image into the chosen number of clusters. Filtering cluster regions was based on HSL values. The following conditions (for , hue, , saturation, and , lightness) must be met to add a colour region to a list of possible sky regions:
Of these possible sky clusters, only clusters with a number of pixels greater than the Skyreq threshold (percent of all pixels) in the image were finally marked as sky regions. Example results are shown in Figure 6.
2.2.4 Middel2018 Sobel operator/flood-fill algorithm
For benchmark comparisons, we used an algorithm developed by Middel2018. This process is based on a Sobel filter (Sobel1968) and flood-fill algorithm (Laungrungthip2008; Middel2017). This method was designed to calculate SVF from GSV image cubes that were projected into upwards facing fisheye views. Note, this algorithm also rescales the imagery to 512512. All 38,521 images in the combined training and validation data set were processed with this algorithm (sky pixels marked with white, RGB 255,255,255), compared to validation images, and results saved for a comparison with our process flow results. These results were kept separate from the other 13 techniques and were not included in the NN training process (see Section 2.3.2). Also, in this benchmark comparison, we used this system to process a varied outdoor imagery data set, not the fisheye imagery (cropped below the horizon) this algorithm was originally designed to process. Results are shown in Figure 7.
2.3 Neural network
2.3.1 Inception V3
The Microsoft Cognitive Toolkit (CNTK) (Yu2015; Agarwal2016), with the Inception V3 network (Szegedy2015a), was used in this project to route images through our adaptive algorithm. This artificial neural network (NN) is a widely used model for image classification across a large variety of fields (Xia2017; Hassannejad2016). The model was trained with a list of images assigned to categories (in our case which technique performed most accurately for each image) and running the training process until the model reached peak accuracy (convergence) at recognising images from these classifications.
2.3.2 Neural network training
The Skyfinder data set of 38,115 images was split into two data sets of 75% training and 25% validation. All training and validation images (which consisted of images of a wide variety of sizes and aspect ratios) were rescaled to 300300. The network was calibrated using supervised learning with the generated data set to identify one of the 13 sky detection techniques variations that performed with the highest accuracy for each image. Note, none of the GSV imagery was used in the NN training process.
2.3.3 Neural network inference
Using the trained model, inferences were performed using the images from the validation data set (the 25% of images from Section 2.3.2) as well as the 406 GSV panoramas. The techniques and parameters picked by the NN as the most appropriate for that image were used to mark the sky pixels. The marked sky pixels were compared to the ground truth to assess accuracy.
Three sets of results are reported in this section. The first is a comparison of all of the techniques and parameters run individually against the Skyfinder and GSV datasets (a total of 38,521 images). The second presents the results of 9,636 validation images using our process flow with the techniques and parameters chosen by the trained NN. The third presents a comparison to two benchmark models: a) the Wang2015a Sobel operator/hybrid probability model and b) the Middel2018 Sobel operator/flood-fill algorithm.
3.1 Results from all techniques
All the technique and parameter variations were used to process the two data sets of the 38,115 Skyfinder and 406 GSV images. A summary of index of agreement (Willmott1981), R, and root-mean-square error (RMSE) statistics for evaluations against the two datasets is presented in Table 4. Plots of a number of the better performing techniques are presented in Figure 8. Note, the strong horizontal lines in these figures are due to the nature of the Skyfinder data set that contains large groups of the same scenes (with the same percentage of sky) under different lighting and weather conditions, often resulting in a wide range of calculated results.
|Adaptive NN process||0.882||0.063||0.967||0.485||0.072||0.840|
Finally, precision, recall and F1 statistics for each technique against the 9,636 validation images are shown in Table 5.
|Adaptive NN process||0.946||0.965||0.952||0.933||0.918||0.918|
3.2 Results from neural network classified techniques
Figure 9 presents a theoretical best case. If the NN was 100% accurate in picking the best technique from the thirteen possible combinations based on its training, a RMSE of 0.026 and 0.020 (and index of agreement of 0.994 and 0.988) is possible against the 38,115 Skyfinder images and 406 GSV images respectively.
Samples of the imagery used in training for selected classifications are shown in Figure 10. As can be seen in this figure, there are no strong visual themes in each of the classifications (i.e. all very cloudy, clear blue sky, or multi-coloured sky), however the NN is able to pick up on more subtle features not readily visible to the eye.
The NN was trained for 250 epochs on a Nvidia GeForce GTX 1080 GPU, requiring about 12 hours. The NN reached a peak accuracy rate of 52.6% in choosing the optimal algorithm from the 13 options. However, as some techniques did only slightly better than others, the error introduced by picking the second or third best algorithm is generally limited. Breaking down the accuracy with this in mind, the NN picked the best method 52.6% of the time, the second best 18.9%, and the third best 10.5% (for a total of the three of 82.0%)
Figure 9b shows the overall accuracy of the NN chosen pathway process flow against the 9,636 validation images, for the Skyfinder and GSV images (respectively) from the validation data set with a RMSE of 0.063 and 0.072, R of 0.882 and 0.485, and index of agreement of 0.967 and 0.840. The accuracy of the NN has impacted the overall accuracy of the system (i.e. not reaching the theoretical accuracy of 0.026 or 0.020 RMSE), but the results on out-of-sample images are still very good. In addition, precision, recall, and F1-scores for the Skyfinder and GSV validation data sets show good results. Precision is 0.946 and 0.933, recall is 0.965 and 0.918, while the F1-scores are 0.952 and 0.918 (all respectively).
3.3 Benchmark results
3.3.1 Results from the Wang2015a Sobel operator/hybrid probability model
These results have previously been presented in Table 4 and Figure 8. The best performing variation (Sobel_70) for the Skyfinder data resulted in a RMSE of 0.132, R of 0.401, and d of 0.683 while the best performing variation (Sobel_80) for the GSV data resulted in a RMSE of 0.07, R of 0.433 and d of 0.655. Precision, recall, and F1-scores shows that Sobel_70 has a lower precision than Sobel_80 with the Skyfinder validation images (at 0.869 vs. 0.920) while Sobel_70 has better recall scores than Sobel_80 (0.914 vs. 0.781). Similar patterns are seen with the GSV imagery. And overall, the F1-scores are in the range of 0.82 to 0.87 for both techniques and data sets.
3.3.2 Results from the Middel2018 Sobel operator/flood-fill algorithm evaluation
In the evaluation of the Sobel/flood-fill algorithm, results from the Skyfinder and GSV datasets and 9,636 validation images are shown in Figure 11 and Table 4. This algorithm yields a RMSE of 0.205, R of 0.150 and of 0.663 against the Skyfinder images from the validation data set and a RMSE of 0.312, R of 0.067, and of 0.304 against with the GSV images from the validation data set. The results from the evaluation of GSV imagery showed a number of images miss-marked as 100% sky, inflating the error rate for this data set. A similar problem was seen with the Skyfinder images, several of them were miss-marked as 0% sky. This is also reflected in the precision, recall, and F1-scores. Precision for the Skyfinder and GSV data sets are low (0.840 and 0.761) while the recall is much higher (at 0.900 and 0.948), resulting in low F1-scores (at 0.856 and 0.813).
4 Discussion and conclusion
In comparison to published methods, our adaptive process performs well. Our accuracy against the Skyfinder images of RMSE of 0.063 compares well to the RMSE of 0.205 for the Middel2018 Sobel/flood-fill algorithm and the best performing Wang2015a Sobel variations, Sobel_70 and Sobel_80, which achieved an RMSE of 0.132 and 0.177. Similarly, our adaptive process also performs well with the GSV images from the validation dataset with results of RMSE of 0.072 compared to 0.07 and 0.312 for Sobel_80 and Sobel/flood-fill respectively. Finally, our adaptive process performs with the best precision, recall, and F1-scores for the Skyfinder data set compared to any of the evaluated techniques.
The results from Section 3.1 show that no single technique and parameter combination performs sky pixel identification with high accuracy across the data sets used by this project. These data sets contain a wide variety of outdoor scenes with various lighting and weather conditions (as can be seen in some of the sample images in Figure 10), challenging many of the techniques.
It was expected that algorithms would perform better with GSV imagery, due to their regularity. These images were captured with the same type of equipment, using the same camera angles (horizon at 50% image height), under clear sky or partly cloudy conditions. The results show that almost all of the variations perform better with the GSV data than the Skyfinder data. Some of the variations even approach the accuracy of our system with the GSV data, for example Sobel_80.
However, the Skyfinder data set challenged all of the variations with the Mean and Sobel based methods achieving no better than 0.100 to 0.200 RMSE. However, for some individual images, even the poorest performing techniques excelled compared to all of the other techniques. Also, in some cases, some techniques perform poorly for certain images. In Figure 8d, the results for the Sobel_70 method show wide variations in between the Skyfinder and GSV datasets while the RMSE values are roughly similar. In the case of Sobel_70, sky fractions for images with low sky fractions (a small number of images in the dataset) are systematically overestimated but this has a significant impact on the GSV values. Both of these cases validate the need for an adaptive process that can respond to the specific challenges each image presents to deliver overall better results than any single algorithm, also allowing certain techniques to be not chosen in the cases that they will perform poorly.
Further, having a range of combinations of techniques was important for the overall accuracy. Experimentation was performed to reduce the number of classifications to possibly increase the accuracy of the NN (reducing the number of the required classifications choices). However, in removing some of the worst performing methods (many of the K-means variations), the overall accuracy degraded. While some of the variations had very low accuracy overall, in processing some images, they were the best choice and having those available overrode the lower accuracy in the NN picking the exact best choice.
While the theoretical accuracy from the 13 approaches was a RMSE of 0.026 and 0.020 (Skyfinder and GSV respectively), combinations of the best performing three or four classes saw reduced theoretical accuracy reduced to RMSEs of 0.039 (for Mean_7_6_100, K-mean6, and Sobel_70), 0.045 (for Mean_3_6_100, K-mean6, and Sobel_70), 0.036 (for Mean_7_6_100, Mean_7_6_100, K-mean6, and Sobel_70), or 0.095 (for all Mean combinations).
This attempt to increase the accuracy of the NN predictions highlights a limitation of this study. 53% accuracy in picking between 13 approaches shows that there is room for improvement. NNs perform best when they are trained with a large amounts of data. It this study, the training data set only included 28,000 images. With a larger training data set, resulting in a lower NN error rate, it is possible to come closer to the theoretical RMSE of 0.026 for the sky pixel identification.
One difficulty in this study was comparing different sky pixel detection schemes in an objective manner. As noted in the introduction, most studies either do not provide metrics, or provide differing types of metrics. Also, with our results evaluating the Skyfinder and GSV data sets showing some large differences between the two, a lack of standardised benchmarks makes comparisons less meaningful. We attempted to overcome these difficulties by implementing some of the other methods and including them in our evaluation against common data sets. In addition, the needs of the eventual application should be kept in mind. As precision and recall scores often varied widely for each algorithm, the impact of either higher false positives or false negatives should guide algorithm choice or at least be considered in the results. For example, an algorithm with higher false positives will overestimate sky pixels, leading to a higher SVF estimate and possibly higher maximum temperatures in urban canyon modelling.
In our linked Data in Brief article (Nice2019Data), we provide our trained NN model (and configuration files), which can be used to infer the best algorithm for any type of outdoor imagery, as well as all the training and validation imagery used in this study. This system can then be used out of the box. With our flexible framework, new algorithms and variations of existing algorithms can be added to the system to handle new imagery with greater accuracy. Using our system, it will be possible to incorporate a wider sets of imagery, and with greater accuracy, to populate databases (such as WUDAPT) of urban morphology information. This also provides a standardised data set to reproduce our results and allow benchmark comparisons with other sky pixel detection systems.
In conclusion, we present a system of sky pixel identification that shows high accuracy rates with varied and challenging outdoor imagery. This system sits between algorithms that can be quickly set up and run but are not as accurate with challenging datasets (e.g. Middel2018), and more complex systems such as Gong2018 that require a more complex, trained deep learning algorithm. Our adaptive system uses the best elements of each in pursuit of the most accurate results possible.
5 Code and availability and licensing
Code and data are available from the corresponding author on request. Data is also provided in the linked Data in Brief article (Nice2019Data) and at https://doi.org/10.5281/zenodo.2562396. Code is also available at https://bitbucket.org/politemadness/skypixeldetection (Nice2019SkyCode) and is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Generic (CC BY-NC-SA 4.0).
The support of the Commonwealth of Australia through the Cooperative Research Centre program is acknowledged. At Monash University, Kerry Nice was funded by the Cooperative Research Centre for Water Sensitive Cities, an Australian Government initiative. At the University of Melbourne, Kerry Nice was funded by the Transport, Health, and Urban Design (THUD) Hub and a Graham Treloar Fellowship for Early Career Researchers.