The UAVid Dataset for Video Semantic Segmentation
Video semantic segmentation has been one of the research focus in computer vision recently. It serves as a perception foundation for many fields such as robotics and autonomous driving. The fast development of semantic segmentation attributes enormously to the large scale datasets, especially for the deep learning related methods. Currently, there already exist several semantic segmentation datasets for complex urban scenes, such as the Cityscapes and CamVid datasets. They have been the standard datasets for comparison among semantic segmentation methods. In this paper, we introduce a new high resolution UAV video semantic segmentation dataset as complement, UAVid. Our UAV dataset consists of 30 video sequences capturing high resolution images. In total, 300 images have been densely labelled with 8 classes for urban scene understanding task. Our dataset brings out new challenges. We provide several deep learning baseline methods, among which the proposed novel Multi-Scale-Dilation net performs the best via multi-scale feature extraction. We have also explored the usability of sequence data by leveraging on CRF model in both spatial and temporal domain.
Visual scene understanding has been advancing in recent years, which serves as a perception foundation for many fields such as robotics and autonomous driving. The most effective and successful methods for scene understanding tasks adopt deep learning as their cornerstone, as it can distil high level semantic knowledge from the training data. However, the drawback is that deep learning requires tremendous number of samples for training to make it learn useful knowledge instead of noise, especially for real world applications. Semantic segmentation, as part of scene understanding, is to assign labels for each pixel in the image. To make the best of deep learning method, a large number of densely labelled images are required. At present, there are only several public semantic segmentation datasets available, which focus only on certain applications. MS COCO [MSCoco] provides semantic segmentation dataset containing common objects recognition in common scenes, and its semantic labelling task focuses on person, car, animal and different stuffs. Pascal VOC dataset [PascalVOC] also provides objects like bus, car, cow, dog for semantic segmentation task. Other semantic segmentation datasets are designed for street scene objects recognition. Their target objects include pedestrians, cars, roads, lanes, traffic lights, trees and other street scene related objects. Specially, CamVid [CamVid] provides continuously labelled driving frames, which can be used for temporal consistency evaluation. Highway Driving dataset [HighwayDriving] provides 30Hz labels that are even denser in temporal domain, and it is designed for semantic video segmentation for driving scenes. Daimler Urban Segmentation dataset [Daimler_Urban_Segmentation] is also a video dataset for street scene understanding, but its labels are sparser in temporal domain. Cityscapes dataset [Cordts2016Cityscapes] focuses more on data variation as it is much larger in the number of labelled frames, which are collected from 50 cities, making it closer to real world complexity. Each frame is much larger in size compared with CamVid. The newly published Berkeley Deep Drive dataset [deepdrive] has even more image labels with medium image size across multiple street scenes. The KITTI Vision Benchmark Suite [kitti] also provides images of medium size for the task. To help learning models to generalize well across different scenes, ADE20K dataset [ADE20k] contributes as it spans more diverse scenes, and objects from much more different categories are labelled. ADE20K dataset brings more variability and complexity for general object representations in images. For remote sensing community, aerial image dataset is provided for ISPRS 2D semantic labelling contest [ISPRSbenchmark14]. All datasets above have had great impacts on the development of current state-of-the-art semantic segmentation methods.
Dynamic scene understanding is another interesting topic. There are several video datasets for moving foreground objects segmentation, such as Video Segmentation Benchmark(VSB100) [VSB1, VSB2], Freiburg-Berkeley Motion Segmentation dataset(MoSeg) [MoSeg1, MoSeg2] and Densely Annotated VIdeo Segmentation dataset(DAVIS) [DAVIS]. In these datasets, foreground objects are labelled densely in both spatial and temporal domain. The challenge for continuous foreground segmentation is that the prediction across highly correlated frames should be consistent. Segmenting foreground objects of interest with consistency is difficult, but useful for surveillance and monitoring.
As present, most of the modern visual semantic segmentation tasks use information acquired on the ground. However, another data acquisition platform is more and more utilized, which is the unmanned aerial vehicle(UAV). Compact and light weighted UAVs are a trend for future data acquisition. The UAVs make image retrieval in large area cheaper and more convenient, which allows quick access to useful information around certain area. Distinguished from collecting images by satellites, UAVs capture images from the sky with flexible flying schedule and higher resolution, bringing the possibility to monitor and analyze landscape at specific location and time swiftly. These abilities make UAVs an effective data collection means for various applications.
The inherently fundamental applications for UAVs are surveillance [UAV_Surveillance1, UAV_Surveillance2] and monitoring [UAV_Monitoring] in the target area. They have already been used for smart farming [smartfarm], precision agriculture [Algriculture] and weed monitoring [UAV_monitor_weed]. To make the system more intelligent, it could rely on techniques like semantic segmentation and video object segmentation. In this aspect, UAV is a great platform to combine both of the two tasks. These two visual understanding tasks could also be the main foundations for higher level smart applications. As the data from UAVs has its own specialties, semantic segmentation and video object segmentation tasks using UAV data deserve more attentions. There are existing UAV datasets for detection and behaviour analysis [UAV_behavior], but to the best of our knowledge, public datasets for UAV video semantic segmentation do not exist.
In this paper, a new high resolution UAV video semantic segmentation dataset, UAVid, is brought out, which covers semantic segmentation and video object segmentation as a video semantic segmentation task. In total, 300 images from 30 video sequences are densely labelled with 8 object classes. All the labels are acquired with our in-house video labeller tool. To test the usability of our dataset, several typical deep neural networks(DNNs) designed for image semantic segmentation together with CRF based video semantic segmentation methods are evaluated as baselines. In addition, we also show that our novel multi-scale-dilation net model is useful to deal with multi-scale problems for UAV images.
Designing an UAV video dataset requires careful thought about the data acquisition strategy, UAV flying protocol and object classes selection for annotation. The whole process is designed considering the usefulness and effectiveness for UAV video semantic segmentation research.
Ii-a Data Specification
Our data acquisition and annotation methodology is designed for UAV video semantic segmentation in complex scenes, featuring on both static and moving object recognition. To capture data that contributes the most towards researches on UAV scene understanding, the following features for the dataset are taken into consideration.
High resolution. We adopt 4K resolution video recording mode with safe flying height of 30 to 50 meters. In this setting, it is visually clear enough to differentiate most of the objects, and objects that are horizontally far away could also be detected. In addition, it is even possible to detect humans that are not too far away.
Consecutive labelling. Our dataset is designed for video semantic segmentation, it is preferred to label images in sequence, where prediction stability could be evaluated. As it is too expensive to label densely in temporal space, we label 10 images with 5 seconds interval for each sequence.
Complex and dynamic scenes with diverse objects. Our dataset aims at achieving real world complexity, where there are both static and moving objects. Scenes near streets are chosen for the UAVid dataset as they are complex enough with more dynamic human activities. A variety of objects appear in the scene such as cars, pedestrians, buildings, roads, vegetation, billboards, light poles, traffic lights and so on. We fly UAVs with slant view along the streets or across different street blocks to acquire such scenes.
Data variation. In total, 30 small UAV video sequences are captured in 30 different places to bring variance to the dataset, preventing learning algorithms from overfitting. Data acquisition is done in good weather condition with sufficient illumination. We believe that data acquired in dark environment or other weather conditions like snowing or raining require special processing techniques, which are not the focus of our dataset.
Ii-B Class Definition and Statistical Analysis
To fully label all types of objects in the street scene in a 4K UAV image is very expensive. As a result, only the most common and representative types of objects are labelled for our current version dataset. In total, 8 classes are deliberately selected for video semantic segmentation, they are building, road, tree, low vegetation, static car, moving car, human and clutter. Example instances from different classes are shown in Fig. 2.
We deliberately divide the car class into moving car and static car classes. Moving car is such special class designed for moving object segmentation. Other classes can be inferred from their appearance and context, while moving car class may need additional temporal information in order to be separated properly from static car class. Achieving high accuracy for both static and moving car classes is one possible research goal for our dataset.
Number of pixels for each class is reported in Fig. 3. It clearly shows the unbalanced pixel number distribution of different classes. Most of the pixels are from classes like building, tree, clutter, road and low vegetation, and fewer pixels are from moving car and static car classes, which are both fewer than 2% of the total pixels. For human class, it is almost zero, fewer than 0.2% of the total pixels. Smaller pixel number is not necessarily resulted by fewer instances, but the size of each instance. A single building can take more than 10k pixels while a human instance in the image may only take fewer than 100 pixels. Normally, classes with too small pixel numbers are ignored in both training and evaluation for semantic segmentation task [Cordts2016Cityscapes]. But we believe humans and cars are important classes that should be kept in street scenes rather than being ignored.
Ii-C Annotation Method
We provide densely labelled fine annotations for high resolution UAV images. All the labels are acquired with our own video labeller tool. Pixel level, super-pixel level and polygon level annotation methods are provided for users. For super-pixel annotation, we adopt SLIC method [slic] to achieve super-pixel segmentation with 4 different scales, which can be useful for objects with fuzzy boundaries like trees. Polygon annotation is used for regular shape annotation like buildings, while pixel level annotation serves as a basic annotation method. Our tool also provides video play functionality around certain frames to help inspecting whether certain objects are moving or not. As there might be overlapping objects, we label the overlapping pixels to be the class that is closer to the camera.
Ii-D Dataset Splits
The whole 30 densely labelled video sequences are divided into training, validation and test splits. We do not split the data completely randomly, but in a way that makes each split to be representative enough for the variability of different scenes. All three splits should contain all classes. Our data is split at sequence level, and each sequence comes from a different scene place. Following this scheme, we get 15 training sequences(150 labelled images) and 5 validation sequences(50 labelled images) for training and validation splits respectively, whose annotations will be made publicly available. The test split consists of the left 10 sequences(100 labelled images), whose labels are withheld for benchmarking purposes. The ratios among training, validation and test splits are 3:1:2.
Iii Video Semantic Labelling
The task for UAVid dataset is to predict per-pixel semantic labelling for the UAV video sequences. The original video file for each sequence is provided together with the labelled images.
Iii-a Tasks and Metrics
The semantic labelling performance is assessed based on the standard IoU metric [PascalVOC]. The goal for this task is to achieve as high IoU score as possible. For UAVid dataset, clutter class has a relatively large pixel number ratio and consists of meaningful objects, which is taken as one class for both training and evaluation rather than being ignored.
Iii-B Networks for Baselines
To test the usability of our UAVid dataset for semantic labelling task, we have evaluated the performance of several deep learning models for single image prediction. Although static car and moving car cannot be differentiated by their appearance from only one image, it is still possible to predict based on their context. We start with 3 typical deep fully convolutional neural networks, they are FCN-8s [FCN8s], Dilation net [dilationNet] and U-Net [UNet].
FCN-8s [FCN8s] has been a good baseline candidate for semantic segmentation. It is a giant model with strong and effective feature extraction ability, but yet simple in structure. It takes a series of simple 3x3 convolutional layers to form the main parts for high level semantic information extraction. This simplicity in structure also makes FCN-8s popular and widely used for semantic segmentation.
Dilation net [dilationNet] has similar front end structure with FCN-8s, but it removes last two pooling layers in VGG16. Instead, convolutions in all following layers from conv5 block are dilated by a factor of 2 due to the ablated pooling layers. Dilation net also applies a multi-scale context aggregation module in the end, which expands the receptive field to boost the performance for prediction. The module is achieved by using a series of dilated convolutional layers, whose dilation rate gradually expands as the layer goes deeper.
U-Net [UNet] is a typical symmetric encoder-decoder network originally designed for segmentation on medical images. The encoder extracts features, which are gradually decoded through the decoder. The features from each convolutional block in the encoder are concatenated to the corresponding convolutional block in the decoder to gradually acquire features of higher and higher resolution for prediction. U-Net is also simple in structure but good at preserving object boundaries.
Iii-C Multi-Scale-Dilation Net
For a high resolution image captured by UAV in slant view, size of objects in different horizontal distances can dramatically vary. Such large scale variation in an UAV image can affect the accuracy for prediction. In a network, each output pixel in the final prediction layer has a fixed receptive field, which is formed by pixels in the original image that can affect the final prediction of that output pixel. When the objects are too small, the neural network may learn the noise from the background. When the objects are too big, the model may not acquire enough information to infer the label correctly. This is a long standing notorious problem in computer vision. To reduce such large scale variation effect, a novel multi-scale-dilation net (MS-Dilation net) is proposed in this paper.
One way to expand the receptive field of a network is to use dilated convolution. Dilated convolution can be implemented through different ways, one of which is to leverage on space to batch operation(S2B) and batch to space operation(B2S), which is provided in Tensorflow API. Space to batch operation outputs a copy of the input tensor where values from the height and width dimensions are moved to the batch dimension. Batch to space operation does the inverse. Applying a standard 2D convolution on the image after S2B is the same as a dilated convolution on the original image. A single dilated convolution can be performed as . This implementation for dilated convolution is efficient when there is a cascade of dilated convolutions, where intermediate S2B and B2S cancel out. For instance, 2 consecutive dilated convolution with the same dilation rate can be performed as .
By utilizing space to batch operation and batch to space operation, semantic segmentation can be done in different scales. In total, three streams are created for three scales as shown in Fig. 4. For each stream, a modified FCN-8s is used as the main structure, where the depth for each convolutional block is reduced due to the memory limitation. Here, filter depth is sacrificed for more scales. To reduce detail loss in feature extraction, the pooling layer in the fifth convolutional block is removed to keep a smaller receptive field. Instead, features with larger receptive field from other streams are concatenated to higher resolution features through skip connection in conv7 layers. Note that these skip connections need batch to space operation to retain spatial and batch number alignment. In this way, each stream handles feature extraction in its own scale and features from larger scales are aggregated to boost prediction for higher resolution streams.
Multiple scales may also be achieved by down sampling images directly [pyramid_images]. However, there are 3 advantages for our multi-scale processing. First, every pixel is assigned to one batch in space to batch operation and all the labelled pixels shall be used for each scale with no waste. Second, there is strict alignment between image and label pairs in each scale as there is no mixture of image pixels or mixture of label pixels. Finally, the concatenated features in the conv7 layer are also strictly aligned.
For each scale, corresponding ground truth labels can also be generated through space to batch operation in the same way as the generation for input images in different streams. With ground truth labels for each scale, deeply supervised training can be done. The losses in three scales are all cross entropy loss. The loss in stream1 is the target loss while the losses in stream2 and stream3 are auxiliary losses, which we call the deep supervision losses. The final loss to be optimized is the weighted mean of the three losses, shown in the equation below. are numbers of pixels of an image in each stream. is batch index and is pixel index. is target probability distribution of a pixel, while is the predicted probability distribution.
It is also interesting to note that every layer becomes a dilated version for stream2 and stream3, especially for pooling layer and transposed convolutional layer, which turn into dilated pooling layer and dilated transposed convolutional layer respectively. Compared to layers in stream1, layers in stream2 are dilated by rate of 2 and layers in stream3 are dilated by rate of 4. Theses 3 streams together form the MS-Dilation net.
Iii-D Fine-tune Pre-trained Networks
Due to the limited size of our UAVid dataset, training from scratch may not be enough for the networks to learn diverse features for better label prediction. Pre-training a network has been proved to be very useful for various benchmarks [pretrain_isprs, pretrain_davis, pretrain_cityscape, pretrain_ade20k], which boosts the performance by utilizing more data from other dataset. To reduce the effect of limited training samples, we also explore how much pre-training a network can boost the score for UAVid semantic labelling task. We pre-train all the networks with cityscapes dataset [Cordts2016Cityscapes], which comprises many more images for training.
Iii-E Video Semantic Segmentation
For video semantic labelling task, it is ideal to output prediction consistently for the same objects observed in multiple different images. Taking advantage of temporal information effectively is valuable for video sequence label prediction. Normally, deep neural networks trained on individual images cannot provide completely consistent predictions spanning several frames. However, different frames provide observations from different viewing positions, through which multiple clues can be collected for object prediction. To utilize temporal information in UAVid dataset, we adopt feature space optimization(FSO) [FSO] method for sequence data prediction. It smooths the final label prediction for the whole sequence by applying 3D CRF covering both spatial and temporal domain. It is the optical flows and tracks in the method that link the images in temporal domain.
Our experiments are divided into 3 parts. Firstly, we compare semantic segmentation results by training deep neural networks from scratch. These results serve as the basic baselines. Secondly, we analyse how pre-trained models can be useful for UAVid semantic labelling task, and we fine-tune deep neural networks that are pre-trained on cityscapes dataset [Cordts2016Cityscapes]. Finally, we explore the influence of spatial temporal regulation by using video sequence data for semantic video segmentation.
It should be noted that the resolution of our UAV images is quite high. The size of each image is 40962160 or 38402160, which requires too much GPU memory for intermediate feature storage in deep neural networks. As a result, we clip each UAV image into 9 evenly distributed smaller overlapped images that cover the whole image for training. Each clipped image is of size 20481024. We keep such a moderate image size in order to reduce the ratio between zero padding area and valid image area. Bigger image size also resembles larger batch size if each pixel is taken as a training sample.
|Model||Building||Tree||Clutter||Road||Low Vegetation||Static Car||Moving Car||Human||Mean IoU|
|Method||Building||Tree||Clutter||Road||Low Vegetation||Static Car||Moving Car||Human||Mean IoU|
|fine-tune w/o deep supervision||78.5||72.2||44.0||65.3||43.5||17.4||51.5||1.2||46.7|
|fine-tune w deep supervision||79.2||72.5||44.8||64.6||44.3||17.0||52.8||3.4||47.3|
|fine-tune w+w/o deep supervision||79.4||73.1||43.7||65.5||45.3||21.3||55.8||6.3||48.8|
Iv-a Train from Scratch
To have a fair comparison among different deep neural networks, we re-implement all the networks with Tensorflow [tensorflow], and all networks are trained with a Nvidia Titan X GPU. To accommodate the networks into 12G GPU memory, depth of some layers in Dilation net, U-Net and MS-Dilation net are reduced to maximally fit into the memory. The model configuration detail of different networks is shown in Fig. 5 in appendix.
The neural networks share similar hyper-parameters for training from scratch. All models are trained with Adam optimizer for 27K iterations(20 epochs). The base learning rate is set to exponentially decaying to . Weight decay for all weights in convolutional kernels is set to . Training is done with one image per batch. For data augmentation in training, we apply random left and right flip. We also apply a series of color augmentation, including random hue operation, random contrast operation, random brightness operation, random saturation operation.
Deep supervision losses are used for our MS-Dilation net. The loss weights for three streams are 1.8, 0.8 and 0.4 respectively. The loss weights for stream2 and stream3 are set smaller than stream1 as the main goal is to minimize the loss in stream1. For Dilation net, basic context aggregation module is used and initialized as it is in [dilationNet]. All networks are trained end-to-end and their mean IoU scores are reported in percentage as shown in Tab. I.
For the four networks, they are all better at discriminating building, road and tree classes, achieving IoU scores higher than 50%. The scores for car, vegetation and clutter classes are relatively lower. All four networks completely fail to discriminate human class. Normally, classes with larger pixel number have relatively higher IoU scores. However, IoU score for moving car class is much higher than static car class even though the two classes have similar pixel number. The reason may be that static cars may appear in various context like parking lot, garage, side walk or partially blocked under the trees, while moving cars are normally running in the middle of road with very clear view.
Our model achieves the best mean IoU score and the best IoU score for most of the classes among the four networks. It shows the effectiveness of multi-scale feature extraction.
Iv-B Fine-tune Pre-trained Models
For fine-tuning pre-trained networks, all the networks are pre-trained with cityscapes dataset [Cordts2016Cityscapes]. Finely annotated data from both training and validation splits are used, that is 3,450 densely labelled images in total. Hyper-parameters and data augmentation are set the same as they are in section IV-A, except that the iteration is set to 52K. Next, all the networks are fine-tuned with data from UAVid dataset. As there is still large heterogeneity between these two datasets, all layers are trained for all networks. We only initialize feature extraction parts of the networks with pre-trained models, while the prediction parts are initialized the same as training from scratch. The learning rate is set to exponentially decaying to for FCN-8s, and exponentially decaying to for other 3 networks as they are easily stuck at local minimum with initial learning rate to be during training. The rest of the hyper-parameters are set the same as training from scratch. The performance is also shown in Tab. I.
To find out whether deep supervision losses are important, we have fine-tuned MS-Dilation net with 3 different deep supervision plans. For the first plan, we fine-tune MS-Dilation net without deep supervision losses for 30 epochs by setting loss weights to 0 in stream2 and stream3. For the second plan, we fine-tune MS-Dilation net with deep supervision losses for 30 epochs. For the final plan, we fine-tune MS-Dilation net with deep supervision losses for 20 epochs and without deep supervision losses for another 10 epochs. The IoU scores for three plans are shown in Tab. II. As it is shown, the best mean IoU score is achieved by the third plan. The better result for MS-Dilation net+PRT in Tab. I is achieved by fine-tuning 20 epochs without deep supervision losses after fine-tuning 20 epochs with deep supervision losses.
Clearly, deep supervision losses are very important for MS-Dilation net. However, neither purely fine-tuning the MS-Dilation net with deep supervision losses nor without achieves the best score. It is the combination of these two fine-tuning processes that brings the best score. Deep supervision losses are important as they can guide the multi-scale feature learning process, but the network needs to be further fine-tuned without deep supervision losses to get the best multi-scale filters for prediction.
By fine-tuning the pre-trained models, the performance boost is huge for all networks across all classes except human class. The networks still struggle to differentiate human class. Nevertheless, the improvement is evident for MS-Dilation net with 8% improvement. Decoupling the filters with different scales can be very beneficial when objects appear in large scale difference.
Iv-C Video Semantic Segmentation
For video semantic segmentation, we apply methods used in feature space optimization (FSO) [FSO]. As FSO process a block of images simultaneously, 5 consecutive frames with 15 frames interval (0.5s-0.7s gap) are extracted from provided video files, which form a block spanning 2s to 3s, and the test image is located at the center in each block. The gap between consecutive frames is not set too big so as to get good flow extraction. It is better to have longer sequence to gain longer temporal regularization, but due to memory limitation, it is not possible to support more than 5 images in a 30G memory without sacrificing the image size.
FSO process in each block requires several ingredients. Contour strength for each image is calculated according to [edge]. The unary for each image is set as the softmax layer output from each fine-tuned network. Forward flows and backward flows are calculated according to [LDOF1, LDOF2]. As the computation speed for optical flow at original image scale is extremely low, the images to be processed are downsized by 8 times for both width and height, and the final flows at original scale are got through bicubic interpolation and magnification. Then, points trajectories can be calculated according to [track] with the forward and backward flows. Finally, a dense 3D CRF is applied after feature space optimization as described in [FSO].
The IoU scores for FSO results with unaries from different fine-tuned networks are reported in Tab. I. For each model, there is improvement in mean IoU score and IoU score for each individual class except for human and moving car classes. FSO favors more for class whose instance covers more image pixels, and IoU score improves less for class with smaller instance like static car and it drops for moving car and human classes. The human class IoU score for MS-Dilation net drops by a large margin, nearly 4%.
V Conclusion and Outlook
In this paper, we present a new UAVid dataset to advance the development of video semantic segmentation. It captures complex street scenes in slant view style with very high resolution videos. Classes for the video semantic labelling task have been defined and labelled. The usability of our UAVid dataset has also been proved with several deep convolutional neural networks, among which the proposed Multi-Scale-Dilation net performs the best via multi-scale feature extraction. It has also been shown that pre-training the network is beneficial for all classes in UAVid semantic labelling task. In the future, we will continually collect new UAV video data, which will be labelled densely in temporal space. We will extend labelling from current classes to more classes including window, door, balcony, etc. The benchmark together with our labelling tool will be published online.