Primary Object Segmentation in Aerial Videos via Hierarchical Temporal Slicing and Co-Segmentation

Primary Object Segmentation in Aerial Videos via Hierarchical Temporal Slicing and Co-Segmentation

Pengcheng Yuan, Jia Li, Daxin Gu, Yonghong Tian State Key Laboratory of Virtual Reality Technology and Systems, SCSE, Beihang UniversityInternational Research Institute for Multidisciplinary Science, Beihang UniversitySchool of Electronics Engineering and Computer Science, Peking UniversityCooperative Medianet Innovation Center, China

Primary object segmentation plays an important role in understanding videos generated by unmanned aerial vehicles. In this paper, we propose a large-scale dataset APD with 500 aerial videos, in which the primary objects are manually annotated on 5,014 sparsely sampled frames. To the best of our knowledge, it is the largest dataset to date for the task of primary object segmentation in aerial videos. From this dataset, we find that most aerial videos contain large-scale scenes, small sized primary objects as well as consistently varying scales and viewpoints. Inspired by that, we propose a novel hierarchical temporal slicing approach that repeatedly divides a video into two sub-videos formed by the odd and even frames, respectively. In this manner, an aerial video can be represented by a set of hierarchically organized short video clips, and the primary objects they share can be segmented by training end-to-end co-segmentation CNNs and finally refined within the neighborhood reversible flows. Experimental results show that our approach remarkably outperforms 24 state-of-the-art methods in segmenting primary objects in various types of aerial videos.

Primary object segmentation, unmanned aerial vehicles, video co-segmentation, CNNs
footnotetext: *Corresponding author: Jia Li (E-mail:

1. Introduction

Figure 1. Representative challenging scenarios in aerial videos. Frames and ground-truth (GT) masks are taken from our dataset APD that cover challenging cases of (a) large-scale scenes, (b) small-sized primary objects, (c) scale variation, (d) viewpoint variation. We also demonstrate the results of state-of-the-art image-based and video-based models for salient/primary object segmentation, including DHSNet (Liu and Han, 2016), DSS (Hou et al., 2017), FST (Papazoglou and Ferrari, 2013), NRF (Li et al., 2017) and our HTS approach.

Recently, unmanned aerial vehicles (drones) have become very popular since it provides a new way to observe and explore the world. As a result, aerial videos generated by drones have been growing explosively. For these videos, one of the key tasks is to segment the primary objects, which can be used to facilitate subsequent tasks such as event understanding (Shu et al., 2015), scene reconstruction (Mancini et al., 2013), drone navigation (Zhang et al., 2010) and visual tracking (Cavaliere et al., 2017).

Figure 2. Representative video frames from the APD and their ground-truth masks of primary objects. The 500 videos in APD are divided into five categories according to the types of primary objects they contains, including (a) APD-Human (95 videos), (b) APD-Building (121 videos), (c) APD-Vehicle (56 videos), (d) APD-Boat (180 videos) and (e) APD-Other (48 videos).

Hundreds of models have been proposed in the past decade to segment primary objects, which can be roughly divided into two categories. The first category contains image-based models that focus on detecting salient (primary) objects in images. In this category, classic models such as (Cheng et al., 2015; Zhang et al., 2015; Tu et al., 2016) focus on designing rules to pop-out salient targets and suppress distractors, while recent models (Chen et al., 2017; Hou et al., 2017; Liu and Han, 2016; Wang et al., 2015a) usually adopt the deep learning framework due to the availability of large-scale image datasets like DUT-OMRON (Yang et al., 2013), MSRA-B (Liu et al., 2011; Jiang et al., 2013; Achanta et al., 2009) and XPIE (Xia et al., 2017). The second category contains video-based models that aim to segment a sequence of primary/foreground objects that consistently pop-out in the whole video. Similar to the image-based category, classic video-based models such as (Papazoglou and Ferrari, 2013; Wang et al., 2015c; Rahtu et al., 2010; Wang et al., 2015b) also design rules to segment primary objects by jointly considering the per-frame accuracy and inter-frame consistency. Recently, with the presence of large-scale video datasets like VOS (Li et al., 2018) and DAVIS (Perazzi et al., 2016), several deep learning models have been proposed as well. For example, Li et al. (Li et al., 2017) adopted complementary CNNs to initialize per-frame predictions of primary objects, which were then refined along neighborhood reversible flows that reflected the most reliable temporal correspondences between far-away frames. Caelles et al. (Caelles et al., 2017) proposed a semi-supervised video object segmentation approach that adopted CNNs to transfer generic semantic information to the task of foreground segmentation. In (Li et al., 2018), saliency-guided stacked autoencoders were adopted to encode multiple saliency cues into spatiotemporal saliency scores for video object segmentation. In addition, some video object co-segmentation (Fu et al., 2015; Zhang et al., 2014) have been proposed as well to simultaneously segment a common category of objects from two or more videos. For example, Fu et al. (Fu et al., 2015) proposed a co-selection graph to formulate correspondences between different videos, and extended this framework to handle multiple objects using a multi-state selection graph model.

Generally speaking, most existing models from the two categories can perform impressively on generic images and videos taken on the ground. However, their capability in processing aerial videos, which often contain large-scale scenes, small sized primary objects as well as consistently varying scales and viewpoints, may be not very satisfactory (see Fig. 1 for some representative examples). The main reasons are two-folds: 1) the heuristic rules and learning frameworks may not perfectly fit the characteristics of aerial videos, and 2) there is a lack of large-scale aerial video datasets for model training and benchmarking. Toward this end, this paper proposes a large-scale dataset APD with 500 aerial videos (76,221 frames). Based on the types of primary objects, these videos can be divided into five subsets, including humans, buildings, vehicles, boats and others. From these videos, 5,014 frames are sparsely sampled, in which the primary objects are manually annotated (see Fig. 2 for representative frames and their ground-truth masks).

Based on the aerial video dataset APD, we propose a hierarchical temporal slicing approach for primary object segmentation in aerial videos. We first divide a long aerial video into two sub-videos formed by the odd and even frames, respectively. By repeatedly conducting such temporal slicing operations to the sub-videos, a long video can be represented by a set of hierarchically organized sub-videos. As a result, the object segmentation problem in a long aerial video can be resolved by co-segmenting the objects shared by much shorter sub-videos. By learning end-to-end CNNs for co-segmenting two frames, a mask can be initialized for each frame by co-segmenting frames from sub-videos that have the same parent node in the hierarchy. These masks are then refined within the neighborhood reversible flows so that the primary video objects can consistently pop-out in the video. Extensive experimental results on the proposed aerial dataset as well as a generic video dataset show that our approach is efficient and outperforms 24 state-of-the-art models, including 11 image-based non-deep models, 8 image-based deep models and 5 video-based models. The results also show that APD is a very challenging dataset for existing object segmentation models.

The contributions of this work are summarized as follows: 1) we propose a dataset for primary object segmentation in aerial videos, which, to the best of our knowledge, is currently the largest one. This dataset can be used to further investigate the problem of primary video object segmentation from a completely new perspective; 2) we propose a hierarchical temporal slicing framework that can efficiently and accurately segment primary objects in aerial videos with co-segmentation CNNs; and 3) we provide a benchmark of our approach and massive state-of-the-art models on the proposed dataset, and such benchmarking results can be re-used by subsequent works to facilitate the development of new models.

Dataset  #Video  Max Res.  #Frames  #MaxF  #MinF  #Annot  #Avg-Obj  Avg-Area (%)
SegTrack (Tsai et al., 2010)  6    244  71  21  244    
SegTrack V2 (Li et al., 2013a)  14    1,065  279  21  1,065    
FBMS (Brox and Malik, 2010)  50    13,860  800  19  720    
DAVIS (Perazzi et al., 2016)  90    3,455  104  25  3,455    
ViSal (Wang et al., 2015c)  17    963  100  30  193    
VOS (Li et al., 2018)  200    116,103  2,249  71  7,467    
APD-Human  95    14,638  271  61  966    
APD-Building  121    17,749  271  31  1,170    
APD-Vehicle  56    7,665  280  31  505    
APD-Boat  180    28,085  280  16  1,851    
APD-Other  48    8,084  284  86  522    
APD  500    76,221  284  16  5,014    
Table 1. Comparison between our dataset and another six generic video object segmentation datasets. #Annot: the number of annotated frames; #MaxF and #MinF: the max and min numbers of frames in a video; #Avg-Obj: the average number of objects per video; Avg-Area: the average area of primary objects per video.

2. APD: A Dataset for Primary Object Segmentation in Aerial Video

Towards primary object segmentation in aerial videos, we construct a large scale video dataset for model training and benchmarking, denoted as APD. The construction process and dataset statistics are described as follows.

Dataset Construction. In constructing the dataset, we first collect 2,402 long aerial videos (107 hours in total) shared on the Internet. After that, we manually divide long videos into 52,712 shots and remove shots that are unlikely to be taken by drones or contain no obvious primary objects (determined through voting by three volunteers). After that, we obtain 21,395 video clips, from which we randomly sample 500 clips for the subsequent annotation process (76,221 frames in total). According to the types of primary objects, these videos are further divided into five subsets, including humans (95 videos), buildings (121 videos), vehicles (56 videos), boats (180 videos) and others (48 videos). From these videos, we uniformly sample only one keyframe out of every 15 frames and manually annotate the 5,090 keyframes.

The annotation process is conducted by three annotators with the LabelMe toolbox (Russell et al., 2008). In the annotation process, each annotator is requested to first watch the videos to obtain an initial impression of what are the primary video objects. Based on the impression, they then annotate the primary objects in the sparsely sampled keyframes with polygons. After that. the annotation quality of each frame is then independently assessed by another two subjects. Flawed annotations are then corrected by the three annotators through majority voting, while frames with confusing annotations are discarded. Finally, we obtain 5,014 binary masks that indicate the location of primary video objects in keyframes.

Dataset Statistics. To demonstrate the major characteristics of APD, we show the detailed statistics of APD and its five subsets in Table 1. In addition, to facilitate the comparison between APD and previous datasets, we also show the information of six representative video datasets with massive ground-based videos for primary semantic/salient object segmentation, including SegTrack (Tsai et al., 2010), SegTrack V2 (Li et al., 2013a), FBMS (Brox and Malik, 2010), DAVIS (Perazzi et al., 2016), ViSal (Wang et al., 2015c) and VOS (Li et al., 2018).

Figure 3. Average annotation maps of APD and its subsets.

As shown in Table 1, the primary objects in APD are remarkably smaller than that in most previous datasets (except SegTrack, which is a small dataset with only one object annotated per frame). Such small sized objects will make the segmentation task of primary objects very difficult. In particular, in the categories of humans and vehicles, the primary objects are even smaller than the other three categories. Considering that there already exist many ground-based approaches for the detection, segmentation and recognition of humans and vehicles, the APD dataset provides an opportunity to find out a way that can transfer ground-based knowledge of humans and vehicles to aerial videos. Moreover, the number of videos in APD are larger than most previous datasets, and all these videos are from five clearly defined object categories. In this sense, it is possible to directly train video-based deep learning models on APD with less risk of over-fitting.

Beyond the quantitative statistics, we also show the average annotation maps of APD and its subsets in Fig. 3. As stated in (Li et al., 2018), an average annotation map is computed by resizing all annotated masks to the same resolution, summing them pixel by pixel, and normalizing the cumulative map to a maximum value of 1. From Fig. 3, we find that the distribution of primary objects also have a strong center-bias tendency, implying that many rules and models for generic primary/salient object segmentation can be re-used for segmenting primary objects in aerial videos (e.g., the boundary prior (Zhang et al., 2015)). Moreover, the degrees of center-bias in the five subsets differ from each other, indicating that there may exist several different ways to optimally segment primary objects in aerial videos if their semantic attributes are known or predictable.

3. A Hierarchical Temporal Slicing Approach for Primary Object Segmentation in Aerial Videos

The challenges of large-scale scenes, small-sized objects and consistently varying scales and viewpoints make the segmentation task of primary objects in aerial videos very challenging. Fortunately, we find that most primary objects last for a long period in the majority of aerial video sequences, which may be caused by the fact that aerial videos usually have less or slower camera motion. Inspired by this fact, we propose a novel approach to address the problem of primary video object segmentation in aerial videos by turning a complex task to several simple ones. The framework of our approach is shown in Fig. 4, which consists of three major stages: 1) hierarchical temporal slicing of aerial videos, 2) mask initialization via video object co-segmentation and 3) mask refinement within neighborhood reversible flows. Details of these three stages are described as follows.

3.1. Stage 1: Hierarchical Temporal Slicing of Aerial Videos

In the first stage, we divide a long aerial video into two sub-videos formed by the odd and even frames, respectively. In this manner, the content similarity between the two sub-videos can be maximally guaranteed. By repeatedly conducting such temporal slicing operations to all sub-videos, a hierarchy of short video clips can be efficiently constructed. Assuming that primary objects last for at least frames in an aerial video, we can build a tree structure with a depth of and nodes. Here we empirically set . The short video clips at each leaf node has at least one frame that contains the primary objects. As a result, primary objects in the original video can be segmented by solving a set of simpler tasks: co-segmenting the objects shared by massive much shorter video clips.

Figure 4. Framework of our approach. In this framework, videos are first hierarchically divided into much shorter video clips, which are then co-segmented to initialize per-frame masks. Primary objects are then segmented via mask refinement within neighborhood reversible flows.
Figure 5. The network architecture of CoSegNet. The symbol indicates kernel size and dilation parameter .

3.2. Stage 2: Mask Initialization via Video Object Co-segmentation

In the second stage, we aim to initialize a mask of primary objects for each video frame by co-segmenting the objects shared by the short video clips at leaf nodes. To speed up this process, the co-segmentation is conducted only between two sub-videos that have the same parent node. Let and be two short video clips, where and denote the numbers of frames in and , respectively. For these two short videos, we assume that there exists a model that can segment the objects shared by the th frame of and the th frame of :


where is a probability map for the frame that depicts the objects it shares with the frame . By co-segmenting all frame pairs between and , the mask of primary objects for a frame can be initialized as the per-pixel average of all such co-segmentation results with respect to all frames from :


For the probability map produced by (2), we can see that each frame is actually co-segmented with multiple non-adjacent frames with increasing temporal distances. The advantages of such co-segmentation between far-away frames are at least four-folds: First, far-away frames can provide more useful cues of the primary objects in the co-segmentation process than adjacent frames that are full of redundant visual stimuli. In other words, far-away frames form a global picture of what is the primary video object. Second, most co-segmentation operations can pop-out primary objects since they appear in a large portion of video frames. Thus the primary objects can be repeatedly enhanced through the additive fusion in (2). Third, the hierarchical framework ensures that each frame can be co-segmented with at least one frame with the same primary objects. Last but not least, the computational cost of co-segmenting frame pairs from two short videos is remarkably smaller than that from two long videos so that the efficiency of the proposed approach can be improved.

In practice, the model can be set to any co-segmentation algorithms. In this study, we simply train two-stream fully convolutional neural networks, denoted as CoSegNet, to justify the effectiveness of the proposed hierarchical temporal slicing and co-segmentation framework. As shown in Fig. 5, CoSegNet takes two frames as the input and two probability maps as the output. Features from the two frames are extracted with two separate branches, which are initialized with the architecture and parameters of the first several layers of VGG16 (Simonyan and Zisserman, 2014). After that, the output feature maps of these two streams are concatenated and fused into a shared trunk for extracting the common features of the two frames. Then the network splits into two separate branches that predict a probability map of shared objects for each input frame. Note that a skip connection from each input stream is also incorporated to the corresponding output branch so as to regularize the generation of each probability map by introducing frame-specific low-level features.

In training CoSegNet, we sample pairs of annotated frames from the training set of APD. For a pair of frames and with ground-truth binary masks and , we train CoSegNet by minimizing two losses and , where is the cross-entropy loss. We resize all input frames and output predictions to the resolution of . The learning rate is set to at the first 50000 iterations and in subsequent iterations. The Caffe platform (Jia et al., 2014) is adopted in training the network with a batch size of 8 frame pairs. The optimization algorithm is set to Adam, the gamma is set to 0.1 and the momentum is set to 0.9.

3.3. Stage 3: Mask Refinement within Neighborhood Reversible Flows

After co-segmenting two short videos and , each video frame obtains an initial object mask represented by a probability map. Recall that the videos and under the same parent node are generated by the odd and even frames of a longer sub-video , we assume that each frame is initialized with a probability maps .

To enhance inter-frame consistency and correct probable errors in , a key challenge is to derive reliable inter-frame correspondences. Considering that frames in the sub-video may be actually far away from each other in the original video, the pixel-based optical flow may fail to handle large pixel displacement. To address this problem, we construct neighborhood reversible flows (Li et al., 2017) based on superpixels. We first apply the SLIC algorithm (Achanta et al., 2012) to divide two frames and into and superpixels that are denoted as and , respectively. Similar to (Li et al., 2017), we compute the pair-wise distances between superpixels from and , where a superpixel is represented by its average RGB, Lab and HSV colors as well as the horizontal and vertical positions. Suppose that and reside in the nearest neighbors of each other, they are -nearest neighborhood reversible with the correspondence measured by


where is a constant empirically set to to suppress weak correspondence. Such superpixel-based inter-frame correspondence between and is denoted as the neighborhood reversible flow , in which the component at equals . Note that we further normalize so that each row sums up to 1. Based on such flows, we refine the initial mask according its correlations with other frames. To speed up the refinement, we only refer to the previous mask and subsequent mask . We first turn the pixel-based map to a vectorized superpixel-based map by averaging the scores of all pixels inside each superpixel. After that, the score in is updated as


where and are two constants to balance the influence of previous and subsequent frames. In experiments, we set . After the temporal propagation, we turn superpixel-based scores into pixel-based ones as


where is the refined probability map of that depict the presence of primary objects. is an indicator function which equals 1 if and 0 otherwise. An adaptive threshold is then used to segment the primary objects in the frame .

Models APD-Human APD-Building APD-Vehicle APD-Boat APD-Other APD
 mIoU  wFM  mIoU  wFM  mIoU  wFM  mIoU  wFM  mIoU  wFM  mIoU  wFM
[I+N]  CB (Jiang et al., 2011)  .050  .101  .162  .260  .145  .265  .111  .183  .252  .448  .126  .228
 BSCA (Qin et al., 2015)  .044  .107  .186  .277  .177  .233  .120  .186  .222  .347  .131  .219
 DSR (Li et al., 2013b)  .149  .243  .250  .334  .219  .310  .232  .318  .327  .425  .222  .329
 RBD (Zhu et al., 2014)  .162  .275  .232  .319  .257  .351  .275  .363  .373  .491  .243  .357
 SMD (Peng et al., 2017)  .244  .304  .240  .312  .306  .354  .329  .376  .416  .481  .294  .365
 GMR (Yang et al., 2013)  .138  .200  .211  .279  .250  .269  .193  .239  .315  .302  .202  .258
 GP (Jiang et al., 2015)  .033  .053  .172  .238  .124  .169  .123  .174  .177  .243  .119  .177
 MB+ (Zhang et al., 2015)  .115  .189  .219  .276  .241  .325  .247  .313  .315  .406  .220  .300
 HS (Yan et al., 2013)  .037  .091  .145  .273  .132  .210  .119  .186  .274  .398  .123  .218
 HDCT (Kim et al., 2014)  .137  .288  .241  .342  .234  .413  .249  .413  .314  .542  .221  .396
 ELE+ (Xia et al., 2017)  .268  .295  .330  .360  .425  .453  .411  .451  .521  .452  .371  .417
[I+D]  DCL (Li and Yu, 2016)  .349  .433  .341  .394  .503  .566  .511  .554  .561  .583  .444  .515
 DHSNet (Liu and Han, 2016)  .394  .472  .387  .438  .523  .601  .572  .655  .626  .701  .493  .581
 DSS (Hou et al., 2017)  .277  .409  .326  .389  .474  .564  .462  .572  .509  .668  .400  .517
 ELD (Lee et al., 2016)  .165  .247  .289  .353  .269  .326  .320  .417  .429  .521  .294  .389
 FSN (Chen et al., 2017)  .286  .324  .363  .393  .507  .554  .519  .580  .589  .664  .443  .505
 LEGS (Wang et al., 2015a)  .064  .093  .260  .331  .225  .292  .172  .228  .334  .415  .193  .261
 MCDL (Zhao et al., 2015)  .147  .049  .243  .138  .283  .147  .277  .141  .422  .155  .262  .129
 RFCN (Wang et al., 2016)  .338  .378  .360  .398  .467  .521  .521  .561  .603  .671  .451  .510
[V]  SSA (Li et al., 2018)  .284  .348  .263  .331  .366  .432  .350  .421  .489  .551  .333  .414
 NRF (Li et al., 2017)  .393  .433  .423  .449  .507  .540  .552  .598  .677  .741  .496  .551
 FST (Papazoglou and Ferrari, 2013)  .272  .308  .190  .213  .535  .596  .342  .399  .375  .440  .319  .382
 MSG (Fu et al., 2015)  .080  .088  .160  .186  .216  .232  .143  .172  .361  .410  .153  .182
 RMC (Zhang et al., 2014)  .123  .136  .202  .226  .312  .336  .208  .231  .230  .265  .205  .233
 HTS  .485  .555  .533  .594  .700  .753  .678  .744  .696  .781  .617  .695
Table 2. Performance benchmark of our approach HTS and 24 state-of-the-art models on APD and its five subsets. The best and runner-up models of each column are marked with bold and underline, respectively.

4. Experiments

In the experiments, we compare the proposed approach (denoted as HTS) with 24 state-of-the-art models on APD to demonstrate 1) the effectiveness of the proposed approach and dataset, and 2) the key challenges in aerial-based primary video object segmentation. The 24 models to be compared with can be divided into three groups:

1) The [I+N] group contains 11 image-based non-deep models, including CB (Jiang et al., 2011), BSCA (Qin et al., 2015), DSR (Li et al., 2013b), RBD (Zhu et al., 2014), SMD (Peng et al., 2017), GMR (Yang et al., 2013), GP (Jiang et al., 2015), MB+ (Zhang et al., 2015), HS (Yan et al., 2013) and HDCT (Kim et al., 2014).

2) The [I+D] group contains 8 image-based deep models, including DCL (Li and Yu, 2016), DHSNet (Liu and Han, 2016), DSS (Hou et al., 2017), ELD (Lee et al., 2016), FSN (Chen et al., 2017), LEGS(Wang et al., 2015a), MCDL (Zhao et al., 2015) and RFCN (Wang et al., 2016).

3) The [V] group contains 5 video-based models, including FST (Papazoglou and Ferrari, 2013), SSA (Li et al., 2018), NRF (Li et al., 2017), MSG (Fu et al., 2015) and RMC (Zhang et al., 2014). Note that SSA and NRF also utilize deep learning components such as stacked autoencoders and CNNs, while MSG and RMC are video co-segmentation models.

In the comparisons, we divide APD into three subsets: 50% for training, 25% for validation and 25% for testing. On the testing subset with 125 videos, we evaluate the model performance with two metrics, including the mean Interaction-over-Union (mIoU) and the weighted F-Measure (wFM). The mIoU score is computed following the way proposed in (Li et al., 2018), which first computes the IoU score at each frame and then step-wisely average them on each video and the whole dataset to balance the influence of short and long videos. Note that the thresholds for turning probability maps into binary masks are set to 20% of the maximal probability scores, as suggested in (Li et al., 2017). Similarly, the wFM score is computed with the source code provided by (Margolin et al., 2014) that assess the overall segmentation performance by jointly considering the completeness and exactness.

Figure 6. Results of our HTS approach on representative video frames from APD.

4.1. Comparison with State-of-the-art Models

The performance scores of HTS and the other 24 state-of-the-art models on APD and its five subsets (only the 125 testing videos) are shown in Table 2. In addition, we also illustrate the results of HTS on representative frames in Fig. 6.

From Table 2, we can see that HTS outperforms the other 24 models on the whole dataset and all the five subsets. We also find that APD is a very challenging dataset for most existing salient/primary object segmentation models. The image-based non-deep models perform far from satisfactory, especially on the APD-Human subset since the primary objects cover only 1.5% area of the video frames on average (see Table 1). Most image-based deep models outperform non-deep ones, indicating that the learned features are more robust than heuristic rules when the application scenarios are transferred from ground-based to aerial. In particular, among the image-based deep models, LEGS, MCDL and ELD have the worst overall performance, which may be caused by the fact that such models often rely heavily on pre-segmented superpixels. For the small-sized primary objects such as humans and vehicles, the superpixels may be inaccurate for feature extraction. Moreover, among video-based models, NRF achieves impressive performance scores that are much higher than SSA and FST. This implies that the CNNs learned on generic image datasets can be partially reused in aerial videos, while the predictions can be further refined by using the inter-frame correspondences (e.g., in NRF and HTS). Moreover, the performances of some models, such as NRF and DHSNet, have different ranks in terms of mIoU and wFM. This phenomena may imply that mIoU and wFM are two metrics that reveal the model performance from two different perspectives. Therefore, we suggest to use both metrics for model evaluation on APD.

Beyond the direct performance comparisons, we fine-tune three top-performed deep models, DSS, NRF and DHSNet, on the same APD videos used to train our HTS model and test the fine-tuned models again for a more fair comparison. As shown in Table 3, we find that HTS performs slightly better than the other three deep models fine-tuned on APD. This indicates that NRF, DSS and DHSNet, the state-of-the-art salient/primary object segmentation models, can learn some useful cues for primary object segmentation in aerial videos. However, in many aerial videos the primary objects are very small and thus not always salient in all frames. They are just consistently shared by the majority of video frames and keep on capturing human attention throughout the videos (e.g., the buildings in Fig. 6). These results imply that the task of segmenting primary objects from a video is not equivalent to the task of separately segmenting salient objects from each frames. On the contrary, CoSegNet well resolves the problem from the perspective of co-segmentation. Even when the scene contains rich content and small-sized primary objects, the co-segmentation framework can enforce CoSegNet to learn the features from the objects shared by different frames, leading to higher performance than these deep models.

Models Before Fine-tuning After Fine-tuning
mIoU wFM mIoU wFM
DSS (Hou et al., 2017) 0.400 0.517 0.575 0.688
NRF (Li et al., 2017) 0.496 0.551 0.609 0.687
DHSNet (Liu and Han, 2016) 0.493 0.581 0.587 0.685
HTS - - 0.617 0.695
Table 3. Model performance after being fine-tuned on APD.

4.2. Detailed Performance Analysis

Beyond the comparisons with state-of-the-arts, we also conduct several experiments to provide an in-depth performance analysis from the perspective of effectiveness, generalization ability, computational complexity and framework rationality.

Effectiveness test of odd-even temporal slicing. To validate the effectiveness of the odd-even temporal slicing framework, we test HTS again by hierarchically dividing the testing videos into the same number of sub-videos formed by consecutive frames other than the even and odd frames. Note that the same HTS model pre-trained on the training set of APD is used for co-segmentation. In the case, the mIoU of HTS decreases from 0.617 to 0.608, and the wFM decreases from 0.695 to 0.689, implying the odd-even slicing framework provides better frame pairs for co-segmentation.

Generalization ability test. In order to validate HTS model is generic in various scenes, we test HTS over VOS (Li et al., 2018), a dataset for primary object segmentation with mainly ground-based videos. On this dataset, we re-train and compare HTS with NRF, which is also previously fine-tuned to achieve its best performance on VOS. In the experiments, we equally divide the 200 videos of VOS into two subsets. By training HTS on one subset and test it on the other subset for two times, we obtain the segmentation results for all the 200 VOS videos. On the VOS dataset, our HTS model has mIoU of 0.717 and wFM of 0.771, which slightly outperforms NRF (mIoU=0.717 and wFM=0.765) on the ground-based videos from VOS. As a result, we can safely claim that HTS can be used as a generic framework that have the potential of segmenting primary objects in both aerial and ground-level videos.

Complexity and rationality. Another concern about the proposed approach may be the complexity and rationality of the hierarchical temporal slicing framework. Theoretically, a deeper hierarchy can speed up the co-segmentation process since CoSegNet will operate on all frame pairs formed by probable combinations of all frames from two sub-videos. However, a deeper hierarchy also leads to over-segmented videos, and a frame will be co-segmented with only far-away frames. To validate these two points, we conduct another experiment that divides the a testing video into the depth and . The performance scores and co-segmentation times can be found in Table 4. We can see that the experimental results well match the theoretical analysis: a deeper hierarchy leads to almost stable performance but remarkably lower computational complexity. Such results justify the rationality of using the hierarchical temporal slicing framework. In most experiments, we adopt a depth of 5, which corresponds to the assumption that a primary object will consistently appear for at least 32 frames (i.e., about 1s). Considering that one co-segmentation operation in CoSegNet takes only 4ms on GPU (NVIDIA GTX 1080) and refinement via neighborhood reversible flow takes 1.32s on CPU, HTS takes 1.82s per frame when processing a video with 181 frames.

5. Conclusion

In this paper, we propose a large-scale dataset for primary object segmentation in aerial videos. The dataset consists of 500 videos from five semantic categories, which is currently the largest aerial video dataset in this area. We believe this dataset will be helpful for the development of video object segmentation techniques. Based on the dataset, we propose a hierarchical temporal slicing approach for primary video segmentation in aerial videos, which repeatedly divides a video into short sub-videos that are assumed to share the same primary objects. As a result, the original segmentation task is converted to a set of co-segmentation tasks, which are then resolved by training CNNs for co-segmenting frame pairs and being refined within neighborhood reversible flows. Experimental results show that the proposed dataset is very challenging and the proposed approach outperform 24 state-of-the-art models.

In the future work, we will try to explore the difference between the visual patterns extracted from ground-based and aerial videos so as to facilitate the design of better models for primary video object segmentation. In addition, the probability of constructing CNNs that can directly co-segment two short videos will be explored as well.

Depth mIoU wFM #Co-Seg
2 0.613 0.693 785K
3 0.614 0.693 379K
4 0.616 0.694 177K
5 0.617 0.695 76K
6 0.615 0.693 25K
Table 4. Performance and complexity of HTS when the 125 testing video are temporally sliced into different depth. #Co-Seg indicates the number of co-segmentation operations.


  • (1)
  • Achanta et al. (2009) R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk. 2009. Frequency-tuned salient region detection. In CVPR.
  • Achanta et al. (2012) R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. S¨¹sstrunk. 2012. SLIC Superpixels Compared to State-of-the-Art Superpixel Methods. IEEE TPAMI (2012).
  • Brox and Malik (2010) Thomas Brox and Jitendra Malik. 2010. Object segmentation by long term analysis of point trajectories. In ECCV.
  • Caelles et al. (2017) Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. 2017. One-shot video object segmentation. In CVPR.
  • Cavaliere et al. (2017) D. Cavaliere, V. Loia, A. Saggese, S. Senatore, and M. Vento. 2017. Semantically Enhanced UAVs to Increase the Aerial Scene Understanding. IEEE TSMC:Systems (2017).
  • Chen et al. (2017) Xiaowu Chen, Anlin Zheng, Jia Li, and Feng Lu. 2017. Look, Perceive and Segment: Finding the Salient Objects in Images via Two-Stream Fixation-Semantic CNNs. In ICCV.
  • Cheng et al. (2015) M. M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S. M. Hu. 2015. Global Contrast Based Salient Region Detection. IEEE TPAMI (2015).
  • Fu et al. (2015) Huazhu Fu, Dong Xu, Bao Zhang, Stephen Lin, and Rabab Kreidieh Ward. 2015. Object-based multiple foreground video co-segmentation via multi-state selection graph. IEEE TIP (2015).
  • Hou et al. (2017) Qibin Hou, Ming-Ming Cheng, Xiao-Wei Hu, Ali Borji, Zhuowen Tu, and Philip Torr. 2017. Deeply supervised salient object detection with short connections. In CVPR.
  • Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia.
  • Jiang et al. (2011) Huaizu Jiang, Jingdong Wang, Zejian Yuan, Tie Liu, Nanning Zheng, and Shipeng Li. 2011. Automatic salient object segmentation based on context and shape prior. In BMVC.
  • Jiang et al. (2013) Huaizu Jiang, Jingdong Wang, Zejian Yuan, Yang Wu, Nanning Zheng, and Shipeng Li. 2013. Salient Object Detection: A Discriminative Regional Feature Integration Approach. In CVPR.
  • Jiang et al. (2015) Peng Jiang, Nuno Vasconcelos, and Jingliang Peng. 2015. Generic Promotion of Diffusion-Based Salient Object Detection. In ICCV.
  • Kim et al. (2014) Jiwhan Kim, Dongyoon Han, Yu-Wing Tai, and Junmo Kim. 2014. Salient region detection via high-dimensional color transform. In CVPR.
  • Lee et al. (2016) Gayoung Lee, Yu-Wing Tai, and Junmo Kim. 2016. Deep saliency with encoded low level distance map and high level features. In CVPR.
  • Li et al. (2013a) Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M Rehg. 2013a. Video segmentation by tracking many figure-ground segments. In ICCV.
  • Li and Yu (2016) Guanbin Li and Yizhou Yu. 2016. Deep contrast learning for salient object detection. In CVPR.
  • Li et al. (2018) Jia Li, Changqun Xia, and Xiaowu Chen. 2018. A Benchmark Dataset and Saliency-guided Stacked Autoencoders for Video-based Salient Object Detection. IEEE TIP (2018).
  • Li et al. (2017) Jia Li, Anlin Zheng, Xiaowu Chen, and Bin Zhou. 2017. Primary Video Object Segmentation via Complementary CNNs and Neighborhood Reversible Flow. In ICCV.
  • Li et al. (2013b) Xiaohui Li, Huchuan Lu, Lihe Zhang, Xiang Ruan, and Ming-Hsuan Yang. 2013b. Saliency detection via dense and sparse reconstruction. In ICCV.
  • Liu and Han (2016) Nian Liu and Junwei Han. 2016. Dhsnet: Deep hierarchical saliency network for salient object detection. In CVPR.
  • Liu et al. (2011) T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H. Y. Shum. 2011. Learning to Detect a Salient Object. IEEE TPAMI (2011).
  • Mancini et al. (2013) Francesco Mancini, Marco Dubbini, Mario Gattelli, Francesco Stecchi, Stefano Fabbri, and Giovanni Gabbianelli. 2013. Using Unmanned Aerial Vehicles (UAV) for High-Resolution Reconstruction of Topography: The Structure from Motion Approach on Coastal Environments. Remote Sensing (2013).
  • Margolin et al. (2014) Ran Margolin, Lihi Zelnik-Manor, and Ayellet Tal. 2014. How to evaluate foreground maps?. In CVPR.
  • Papazoglou and Ferrari (2013) Anestis Papazoglou and Vittorio Ferrari. 2013. Fast object segmentation in unconstrained video. In ICCV.
  • Peng et al. (2017) Houwen Peng, Bing Li, Haibin Ling, Weiming Hu, Weihua Xiong, and Stephen J Maybank. 2017. Salient object detection via structured matrix decomposition. IEEE TPAMI (2017).
  • Perazzi et al. (2016) Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR.
  • Qin et al. (2015) Yao Qin, Huchuan Lu, Yiqun Xu, and He Wang. 2015. Saliency detection via cellular automata. In CVPR.
  • Rahtu et al. (2010) Esa Rahtu, Juho Kannala, Mikko Salo, and Janne Heikkil. 2010. Segmenting Salient Objects from Images and Videos. In ECCV.
  • Russell et al. (2008) Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. 2008. LabelMe: A Database and Web-Based Tool for Image Annotation. IJCV (2008).
  • Shu et al. (2015) Tianmin Shu, D. Xie, B. Rothrock, S. Todorovic, and S. C. Zhu. 2015. Joint inference of groups, events and human roles in aerial videos. In CVPR.
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
  • Tsai et al. (2010) D Tsai, M Flagg, and J Rehg. 2010. Motion coherent tracking with multi-label mrf optimization, algorithms. In BMVC.
  • Tu et al. (2016) Wei-Chih Tu, Shengfeng He, Qingxiong Yang, and Shao-Yi Chien. 2016. Real-Time Salient Object Detection With a Minimum Spanning Tree. In CVPR.
  • Wang et al. (2015a) Lijun Wang, Huchuan Lu, Xiang Ruan, and Ming-Hsuan Yang. 2015a. Deep networks for saliency detection via local estimation and global search. In CVPR.
  • Wang et al. (2016) Linzhao Wang, Lijun Wang, Huchuan Lu, Pingping Zhang, and Xiang Ruan. 2016. Saliency detection with recurrent fully convolutional networks. In ECCV.
  • Wang et al. (2015b) Wenguan Wang, Jianbing Shen, and Fatih Porikli. 2015b. Saliency-Aware Geodesic Video Object Segmentation. In CVPR.
  • Wang et al. (2015c) Wenguan Wang, Jianbing Shen, and Ling Shao. 2015c. Consistent video saliency using local gradient flow optimization and global refinement. IEEE TIP (2015).
  • Xia et al. (2017) Changqun Xia, Jia Li, Xiaowu Chen, Anlin Zheng, and Yu Zhang. 2017. What Is and What Is Not a Salient Object? Learning Salient Object Detector by Ensembling Linear Exemplar Regressors. In CVPR.
  • Yan et al. (2013) Qiong Yan, Li Xu, Jianping Shi, and Jiaya Jia. 2013. Hierarchical saliency detection. In CVPR.
  • Yang et al. (2013) Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and Ming-Hsuan Yang. 2013. Saliency Detection via Graph-Based Manifold Ranking. In CVPR.
  • Zhang et al. (2014) Dong Zhang, Omar Javed, and Mubarak Shah. 2014. Video object co-segmentation by regulated maximum weight cliques. In ECCV.
  • Zhang et al. (2015) Jianming Zhang, Stan Sclaroff, Zhe Lin, Xiaohui Shen, Brian Price, and Radomir Mech. 2015. Minimum barrier salient object detection at 80 fps. In ICCV.
  • Zhang et al. (2010) J. Zhang, Y. Wu, W. Liu, and X. Chen. 2010. Novel Approach to Position and Orientation Estimation in Vision-Based UAV Navigation. IEEE Trans. Aerospace Electron. Systems (2010).
  • Zhao et al. (2015) Rui Zhao, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. 2015. Saliency detection by multi-context deep learning. In CVPR.
  • Zhu et al. (2014) Wangjiang Zhu, Shuang Liang, Yichen Wei, and Jian Sun. 2014. Saliency optimization from robust background detection. In CVPR.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description