Detecting Tiny Moving Vehicles in Satellite Videos

Detecting Tiny Moving Vehicles in Satellite Videos

Wei Ao, Yanwei Fu, Feng Xu Wei Ao, Feng Xu are with Key Lab for Information Science of Electromagnetic Waves (MoE), Fudan University, Shanghai 200433, China. Email: {wao16, fengxu} Yanwei Fu is with the School of Data Science Fudan University, Shanghai 200433, China. Email:

In recent years, the satellite videos have been captured by a moving satellite platform. In contrast to consumer, movie, and common surveillance videos, satellite video can record the snapshot of the city-scale scene. In a broad field-of-view of satellite videos, each moving target would be very tiny and usually composed of several pixels in frames. Even worse, the noise signals also existed in the video frames, since the background of the video frame has the subpixel-level and uneven moving thanks to the motion of satellites. We argue that this is a new type of computer vision task since previous technologies are unable to detect such tiny vehicles efficiently. This paper proposes a novel framework that can identify the small moving vehicles in satellite videos. In particular, we offer a novel detecting algorithm based on the local noise modeling. We differentiate the potential vehicle targets from noise patterns by an exponential probability distribution. Subsequently, a multi-morphological-cue based discrimination strategy is designed to distinguish correct vehicle targets from a few existing noises further. Another significant contribution is to introduce a series of evaluation protocols to measure the performance of tiny moving vehicle detection systematically. We annotate a satellite video manually and use it to test our algorithms under different evaluation criterion. The proposed algorithm is also compared with the state-of-the-art baselines, and demonstrates the advantages of our framework over the benchmarks.


tiny object detection, Probabilistic Noise Modeling, Evaluation, Vehicle Detection.

1 Introduction

With the recent advanced in the earth observation (EO) technology, satellite videos are captured by utilizing the optical sensors to capture consecutive images from a moving satellite platform. The satellite videos can enable many potential applications, such as city-scale traffic surveillance, 3D reconstruction of urban buildings and quake-relief efforts, etc. For instance, Figure 1 (a) shows that a frame of a satellite video of the Valencia city in Spain. As visualized in Fig. 1 (b), corresponding areal map of each video frame is about square kilometers. Satellite video can thus facilitate monitoring the dynamics scenes of the city-scale. On the other hand, to efficiently supervise the city-scale scene, one primary, and yet the critical task is to detect and track the moving vehicles captured in satellite videos. However, there is no previous technique for detecting the very tiny moving vehicles in satellite videos, due to the following challenges.

Figure 1: An example of satellite video. (a) A frame of a satellite video of Valencia, Spain. (b) Its corresponding optical map downloaded from Google Earth.

In the video, vehicles are moving and very tiny. In satellite videos, only several pixels represent each vehicle. Thus essentially, we have to detect tiny moving vehicles in satellite videos. Fig. 2 shows two enlarged areas of the panorama in Fig. 1 (a). From these enlargements, a vehicle is only composed of several pixels without any distinctive color or texture. So, the state-of-the-art detection algorithms, like deep learning, can easily overfit the training vehicle data but fail to detect/describe the patterns of these moving vehicles. We observe that the most robust features of these moving vehicles come from their motion patterns which however are very easily obscured and challenged by the background moving.

Figure 2: The enlargements of two scenes in Fig. 1 (a), where some vehicle targets are denoted with red circles.

The frames of satellite videos cover a large-scale area and provide a dynamic scenario. In terms of the distances between camera shot and observed objects, we have near-field, medium-field, far-field surveillance videos, and the extremely far-field satellite videos [1, 2]. Not only broad field-of-view does a satellite video provide but it also presents a very complex background. As shown in Fig. 4, the visual content of satellite videos may include the roads, buildings, vegetation and football field, etc. Furthermore, it also has varied traffic conditions as many as possible, e.g. straight arteries, intersections, and roundabouts, etc.

The background of satellite videos presents sub-pixel-level and uneven moving. The optical flow field [3, 4] of the above satellite video is shown in Fig. 3. It shows that the background is continuously moving, and the optical flow field is very uneven. Even worse, the relative motion of the satellite video is very complicated since intrinsically, the satellite video frames are the 2D projection of a sophisticated 3D movement of the satellite platform. On the other hand, since the satellites are very far away from the earth plane, we can only observe very slow moving among the consecutive video frames. Such slowly moving will lead to small variants of the stationary pixels. Critically, we notice that the moving of two successive satellite video frames is always sub-pixel-level. And it also challenges the techniques of frame-by-frame video stabilization and registration. Overall, one key difficulty in detecting the tiny moving vehicles is to differentiate the background motions from the moving vehicles; otherwise, the moving background would negatively affect the detection of tiny moving vehicles.

Figure 3: Optical flow field which is obtained from frame 1 and 100 of the satellite video. (a) shows the magnitude and orientation of the optical flow field through vectors, while (b) further illustrates the magnitude distribution of the optical flow filed.
Figure 4: Varied parts of a frame of the satellite video. (a) roads, (b) buildings, (c) vegetation and (d) a football field.

This paper focuses on detecting and tracking the tiny vehicle of only a few pixels in the satellite videos which is very hard to be identified and easily affected by noise. The patterns of moving vehicles may also be confused with the noise patterns which caused by the complex moving backgrounds. Such noise patterns may result in regular moving of stationary corners or edges, and thus further hinder the detection of the tiny moving vehicles.

To tackle the problems as mentioned above, we, for the first time, propose a framework in addressing the challenging tasks of detecting tiny moving vehicles in satellite videos. The whole framework is built upon a series of statistical tools. In particular, we propose a motion based detecting algorithm using a novel local modeling. We decompose each frame into two parts, i.e., an original image and an additive random 2D noise signal map. A probability distribution is used to fit the noise patterns, which facilitates us to distinguish potential vehicle targets. A local tactic is applied to address intra-variants within a frame and discern inter-variants between frames, simultaneously. Then, a region growing is designed, and a discrimination algorithm based on multiple morphological cues is proposed, which can remove other noise. The Kalman filter (KF) [5] is further used to track the vehicles. Extensive experiments are conducted on the real-world satellite video dataset to evaluate our framework and show the efficacy of the proposed models over the baselines.

The major contributions of this paper are fourfold:

  1. To the best of our knowledge, the tasks of detecting moving tiny moving objects are, for the first time, studied. To further study this problem, we contribute the satellite video dataset, which has the labeled ground-truth of tiny moving objects.

  2. We propose a motion based detecting algorithm using a novel local modeling. The noise pattern is modeled by probabilistic distributions.

  3. A region growing algorithm is further designed and a discrimination algorithm based on multiple morphological cues.

  4. We, for the first time propose a set of evaluation protocols which can systematically measure the algorithms of analyzing tiny moving vehicles.

The remainder of the paper is organized as follows. First, Sec. 2 reviews some related work. Then, Sec. 3 details the proposed algorithms, including the overall framework and two major contributions, detecting and discrimination algorithms. Subsequently, the used evaluation metrics and the proposed evaluation algorithm is presented in Sec. 4. Sec. 5 shows experiments undertaken on a satellite video. Finally, Sec. 6 concludes the paper.

2 Related Work

2.1 Earth Observation Technology and Satellite Videos

Nowadays many observation technologies have been developed and are of enormous significance, including optical satellite images, space-borne synthetic aperture radar (SAR) images and aerial videos. Such technology plays a critical role in both civil and military area, such as the city traffic system, maritime surveillance, aerial spy and battlefield monitor, etc. Both optical satellite image and space-borne SAR can observe a large area in a high resolution. However, optical satellite is very susceptible to different illumination and various weather. Although SAR has the unique capability of earth imaging in all-whether regardless of day and night [6], the SAR images are difficult to be interpreted [7, 8]. Another weakness in understanding the optical satellite image and SAR is that they cannot observe dynamics due to stationary imaging, which narrows down their applications.

The satellite videos have many advantages over the other conventional videos, such as aerial videos captured by the unmanned aerial vehicle (UAV). The aerial videos often suffer from undesirable dramatic motions of platforms and have to resort to complex stabilization preprocessing [9]. Thus to track objects, image registration has to be done at the first stage. One can then separate camera egomotion from object motion [10, 11, 12]. The aerial videos can only cover a small city scope, while our satellite videos can easily supervise a city-scale scene. Furthermore, the new legislation of the civil aviation safety had forbidden or restricted the usage of UAV in many cities.

Recently, some commercial companies are able to have the satellites. For example, Satellite Imaging Corporation (SIC) successfully launched video satellites – SkySat-1 and SkySat-2, on November 21, 2013 and July 8, 2014, respectively. Chang Guang Satellite Technology CO., LTD (CGSTL) successfully launched two video satellites on October 7, 2015. Up to now, CGSTL has 8 on-orbit video satellites which are a part of the ongoing Jilin No. 1 satellite constellation. The time resolution of EO of Jilin No. 1 satellite constellation will be shortened to half an hour when the constellation is constructed by 2020.

Comparing with the conventional EO technologies. the satellite video can cover the largest scope, and is very stable. First, we can understand and forecast the dynamics of the earth. Second, a video satellite can turn its lens towards the region of interest (ROI) all the time through flying so long as this area is within the field-if-view of the satellite, which has very good image quality. So, the satellite videos are more stable compared with aerial videos. Third, a high altitude of a video satellite results in a broader field-of-view, which even covers a city-scale area. Another important issue should be taken into account is that a video satellite is a free platform which can record anywhere in the earth without any restrictions.

Figure 5: The overall framework of tiny vehicles detection.

2.2 Moving Target Detection and Tracking

The moving target detection, can be taken as a special case of foreground segmentation. Such tasks can be solved by the Gaussian Mixture Model (GMM) [13, 14, 15] and the state-of-the-art ViBe [16, 17] algorithm. GMM is a representative of parametric models, while the ViBe tries a non-parametric method to describe the dynamic patterns of a pixel. GMM utilizes a weighted mixture of Gaussian distributions to model the pixel value varying over time. However, the dynamics of pixel values may be not subjected to Gaussian distributions. In most cases, we cannot use a definite parametric model to represent the variance of pixel values. ViBe proposed a novel idea that some pixel values in different time steps are regarded as the samples of one space, in order to represent the patterns of the pixels.

Those previous algorithms such as GMM and ViBe have several serious drawbacks if we apply them to our tasks. First, they requires heavy computational cost and resources in processing the satellite videos, since the pixel-based modeling has heavy computational loadings. Second, they are relative inefficiently in telling the differences between the moving target and the ego-motion of the satellites. To this end, we propose the noise model in isolating the moving background and detecting the potential moving targets, simultaneously. Moreover, the proposed tiny moving vehicle detecting is implemented in the spatiotemporal domain. In terms of sub-pixel-level moving and the neighborhood similarity, we model pixels of inter-frame differences spatially rather than temporally.

Figure 6: Some cases of nearest searching, wherein the black rectangles denote the tracking vehicles. (a), (c) and (e) are previous positions of the vehicles, while (b), (d) and (f) are corresponding current positions. Please note that it is hard to recognize vehicles by naked eyes in original RGB images due to the low-contrast; so, the original images are converted to colorful pseudo color images just for view.

3 Methodology of Detecting Tiny Moving Vehicles

The whole section is divided into the three subsections. Firstly, we discuss the overall framework of proposed tiny moving vehicles detection algorithms in Sec. 3.1. The proposed framework is under Kalman filter (KF) tracking framework as shown in Fig. 5. These tiny moving vehicles are detected in each local region in Sec. 3.2. We subsequently propose the discrimination algorithm removing the falsely detected components in Sec. 3.3.

3.1 Overview of the Proposed Algorithms

Key concepts. Before fully developing our framework, some key concepts are explained here. Detection or a detector is a potential-vehicle detecting procedure that embodies the proposed local tactic and noise modeling algorithm. Candidates are outputs of the detection that are composed of true vehicles and some noises. Discrimination or a discriminator is a distinguishing procedure between true vehicles and existing noises, including the proposed region growing and multi-morphological-cue based discrimination algorithms. Hypotheses are outputs of the discrimination that are composed of true vehicles and a few noises. In addition, the final outputs of the detecting and tracking framework are also defined as hypotheses. The State is a vector that includes position, velocity and acceleration of a vehicle in a time step. The Track is a sequence of states of a vehicle in temporal domain. Each track is marked with an unique ID and assigned with a KF. Association is a matching procedure that meets the minimum cost. Prediction is a current position of a track that is inferred by its KF.

The widely-used object tracking methods include Kalman filter, particle filter [5], and mean shift [18]. As shown in Fig. 5, our whole vehicle tracking pipeline is based on Kalman filter, which is one of the most classical tracking algorithms. In Fig. 5, the KF is the central module of the processing framework, which includes several interactive branches, as follows.

(1) Initialization. Initialization is to determine the initial state of a track. The hypotheses of current frame and previous frame are associated using Hungarian algorithm [19, 20]. So, we can derive their velocities and positions, and their initial accelerations are regarded as zero.

(2) Prediction. The current state of one vehicle tracked can be inferred in term of the previous observation.

(3) Hypothesis-to-Track Association. The discriminator yields hypotheses, while the tracks yield predictions. In this stage, hypotheses are matched with predictions in order to meet minimum cost. Here, the cost between a hypothesis and a prediction is defined as Euclidean distance. Hungary algorithm is employed to derive an optimal association between hypotheses and predictions. Hypothesis-to-track association yields assignments, unassigned tracks and unassigned hypotheses. Assignments are optimal matches. Unassigned tracks are those tracks that do not successfully with any hypotheses, likewise unassigned hypotheses do. Then, assignments are utilized to update the stages; unassigned tracks are further processed in the nearest searching stage; unassigned hypotheses are used to initialize new tracks.

(4) Update. The state vectors of hypotheses of the assignments are used to update the state of their corresponding KFs.

(5) Nearest Searching, correction and termination. We do not simply discard the unassigned tracks because their corresponding hypotheses may be missed by the detector or the discriminator. So, the nearest searching strategy is applied to find out whether there exists a connected region which resembles the tracking vehicle around the previous position of the vehicle. The matching is resort to structural similarity index (SSIM) [21]. If the similar region is found, the track is updated; otherwise the track is terminated. The nearest searching using SSIM is illustrated in Fig. 6. (a), (c) and (e) are from the previous frame where the vehicles are marked by black rectangles, while (b), (d), and (f) are from the current previous where the black rectangles denote the results of nearest searching. These experimental results demonstrate the efficiency of such a nearest searching strategy.

Figure 7: Flow diagram of the motion-based detection algorithm via noise modeling.

We will fully explain the above four steps of detecting the tiny moving objects in the next few sections. Once the objects are detected, the Kalman filter is adopted to fit the motion of these vehicles, which is the optimal solution in the linear and Gaussian situations. Although the motions of vehicles in real world are very complex, from an approximate viewpoint, a non-linear procedure can be decomposed into a series of linear procedures. So, KF is a simple but effective tool to measure, predict and track the motion of a moving vehicle. The evolution function of the system is defined as


where , and denote state vector, evolution matrix and procedure noise vector, respectively, and the subscript indicates the time step of a frame. Position, velocity and acceleration of a vehicle constitute , namely


Without loss of generality, we assume that vehicle targets move in a constant acceleration and straight line during each fixed interval. So, the evolution matrix can be written as


The measurement function can be written as


where , and denote measurement vector, measurement matrix and measurement noise, respectively. The definition of in this study is


Assuming that a set of measurements is obtained, i.e. , KF recursively derives posterior PDF of the state vector via Bayesian theorem, i.e.


3.2 Motion-Based Detection Using Local Noise Modeling

Inter-frame difference is a conventional but effective tool to discerns changes between two frames. In contrast tos pixel-based ViBe and GMM, it has two notable merits: high efficiency and low memory consuming. However, traditional inter-frame difference [22] is based on a predefined threshold to separate moving pixels and background. Specifically, the grey-level value difference image is converted to a binary image wherein ones denote moving pixels, while zeros denote stationary background pixels. This procedure is termed as Binarization in our paper. Essentially, binarization differentiates moving pixels from the whole inter-frame difference image. But a fixed binarization threshold cannot be adapted to large-scale intro-variant scenarios of satellite videos.

In order to address the aforementioned challenges, we propose a local tactic and a novel detecting method. It is conceptualized as motion-based detection using local noise modeling. An adaptive binarization is derived by noise modeling in each local area, dealing with the variances of local area by the neighborhood similarity. Motion-based means inter-frame difference processing to search moving pixels.

  • Local tactic. A local tactic is designed to tackle the dramatic intro-variance within a frame. Specifically, a 2D rasterizing is implemented along the vertical and horizontal directions in a frame. The original frame is converted into paved local areas. The size of a local is empirically set as square pixels. For one thing, a local area has much lower degree of heterogeneity than the whole frame. For another, it integrated local information to reduce the interference of the moving background. It greatly facilitates the following detecting.

  • Detecting method. The proposed detecting method is composed of four stages as shown in Fig. 7, (1) deriving inter-frame difference images, (2) estimating noise distribution, (3) binarization and (4) logical AND operations to finally get the detection results. Our major contribution is step (2) to estimate noise distribution which can yield a adaptive binarization threshold of each local area in step (3). Each step details as follows.

Figure 8: Amplitude histogram of noises and some fitted probabilistic distributions. Note that "Histogram" and "Difference" are abbreviated as "Hist" and "Diff", respectively.
Distance Exponential Gamma Weibull GD Frame
KL 0.0959 0.0914 0.0919 0.0544 50
KS 0.0959 0.1018 0.0891 0.0813
KL 0.0864 0.0812 0.0862 0.0579 100
KS 0.0875 0.0988 0.0896 0.0865
KL 0.0846 0.0800 0.0845 0.0531 500
KS 0.0854 0.0964 0.0859 0.0816
Table I: KL and KS Distance of Some Noise Probability Models. The smaller values, the better performance.

(1) Deriving Inter-frame difference images. Inter-frame difference is a trivial operation. Here, we provide a novel viewpoint on inter-frame difference images. By taking the frame as a 2D signal consisting of original optical signal and additive random noise, i.e.


where denotes the grey-level value of pixel in frame , since the gray-level images111The RGB frames will be converted to grey-level images. are more common in satellite videos. denotes the original amplitude of the pixel in frame , while denotes the corresponding noise signal. Accordingly, the absolute inter-frame difference of two registered frames can be regarded as a set of random noise, i.e.,


where denotes absolute inter-frame difference, denotes the k frames interval. From Eq. (9), the inter-fame difference image signal is only corresponding with noises when two frames are registered. However, there still exists some outliers. These outliers are composed of tiny moving vehicles and non-vehicle targets. Thus the next issue is how to differentiate these outliers from random noises.

(2) Estimating noise distribution. Detecting the pattern of tiny moving vehicles is a challenge, since noise patterns will blur the underlying patterns of the tiny moving vehicles. Thus in this step, the key idea is to fit the noise patterns, namely in Eq. (8), by the probabilistic distributions.

Intuitively, the value differences of the same pixel at two consecutive frames should approximate zero, while the value differences of the pixels of noise patterns, or moving vehicles should be larger than zero. Figure 8 shows the histogram of the value differences of pixels of two consecutive frames. The amplitude histogram of noises exhibits notable regulations, like smooth decaying and a heavy tail. In the pattern of noise, true noise pixels are inliers, and the other pixels are outliers. Thus the heavy tail of Fig. 8 should be corresponding to the outliers.

The probabilistic distribution is adopted to fit the histogram and derive a binary threshold given a probability. Thus several widely-used heavy-tail distributions, such as exponential distribution, Gamma distribution, Weibull distribution and generalized Gamma distribution (GD) are tested and compared in Fig. 8. Quantatively, Kullback-Leibler (KL) distance [23] and Kolmogorov-Smironv (KS) distance, (also known as Kolmogorove-Smirnov test [24]), are introduced to quantify the fitting performance of different distributions, as shown in Table I. The smaller scores of KL and KS distance indicate the better results for fitness.

The quantitative experiment further proves the above hypothesis. It also shows that the three-parameter distribution, GD with higher degree of freedom (DoF), outperforms the other distributions. Alternatively, the one-parameter distribution – exponential distribution, also fit the noise distribution very well. Nevertheless, the parameter estimation of GD is more difficult, higher computational load and more time consuming than exponential distribution. To make a balance between accuracy and computational load, exponential distribution is adapted to fit noises; and the cumulative density function (CDF) is


(3) Binarization. Once we fit the distribution of pixel value differences of consecutive frames, we can utilize an adaptive threshold of binarization to determine the outliers. In particular, we introduce a predefined probability () to derive the binarization threshold, namely [6]


where denotes the inverse function of the distribution in Eq (10). We set as here. If the pixel value difference is bigger than the binarization threshold , this pixel would be taken as an outlier. By virtue of such a binarization algorithm, we turn the original images into a binary image: inliers have zero pixel values, outliers are ones.

Figure 9: Visualization of regular noises: the stationary corners or edges. The rectangles denotes starting locations, while the filled blue points are their terminal positions.

(4) Logical AND. These outliers comprise the vehicles and other noise. In the binary difference image, a true vehicle target exhibits as two symmetrical blobs. One blob indicates the current position, and another indicates the previous or future position. We derive the intersection of two binary inter-frame difference images222Explicitly, the difference between frame and frame vs. the difference between frame and the frame , wherein is set as 10. to determine the current positions of vehicles. It is a Boolean operation that only one and one yield one, which is named as, “Logical AND”. In addition to eliminating ambiguities, logical AND also reduces the existing noises due to their random appearing.

3.3 Region Growing and Multi-Morphological-Cue Based Discrimination

Figure 10: Detecting and region growing results of vehicle targets and a noise. (a) - (d) are pseudo color images converted from original RGB images for view, where real vehicles are marked with black rectangles, and (c), (d) are falsely detected edges or corners of buildings. (e) - (h) are corresponding foregrounds generated by the detector. (i) - (l) are reconstructed geometries.

There still exist some noises in candidates. These noises include irregular noises and regular noises. Irregular noises result from dramatic illumination variants or slight deviations between frames. They may randomly appear in some consecutive frames. Generally, it is not necessary to design an algorithm of pruning the irregular noises since KF tracking can gradually eliminate this type of noise. In contrast, we term the background moving patterns as the regular noises. Particularly, these noises are caused by the slight deviation of satellite moving. Such deviation may be appeared/detected as the edges or corners of some static objects in the frames. Even worse, these detected corners or edges exhibit relative moving pattern with respect to the moving background. The regular noises have to be pruned by our algorithms. We visualize the regular noises in Fig. 9.

To this end, we propose a novel discrimination algorithm using the geometrical and neighborhood information. This discrimination algorithm includes two parts, i.e., the region growing to reconstruct candidate geometry and multi-morphological-cue based discrimination to distinguish noises from vehicles. The key idea is that a vehicle target is a singular point in 2D temporal domain. By contrast, these regular noises share similar temporal distributions as their neighborhood in frames. If the candidates can be connected with their similar neighborhood pixels, we can differentiate vehicles from regular noises in terms of shape: the vehicle targets approximate a rectangle, while regular noises can be taken as arbitrary shapes.

(1) Region growing. A region growing algorithm is proposed to connect a candidate with its similar neighborhood pixels. This procedure is namely to reconstruct the whole geometry of a candidate. From Fig. 10, the detector only yields a partial geometry of a candidate, because of the overlap of positions of a candidate in two adjacent frames. The region growing utilizes the detected partial geometry to restore the whole geometry of the candidate. Neighborhood area is defined as a pixel window in the candidate. Gaussian distribution is employed to measure the similarity between a neighbor pixel and the candidate. The CDF of Gaussian distribution is


where , and denote the related error function, the mean and the standard deviation, respectively. These parameter values of a Gaussian distribution can be estimated using the values of those pixels of the candidate. A range can be obtained given the predefined the lower bound probability and the upper bound probability , i.e.


where and represent lower and upper bound threshold, respectively. and are set as and , symmetrically. If the grey-level value of a pixel inner the searching window is in , the pixel will be re-classified as candidate pixels. Those new candidate pixels connected with the original candidate are reserved. Finally, the result of region growing is shown in Fig. 10 (i) - (l). From Fig. 10, the detector captures only partial geometries of the candidates in advance. Then, the proposed region growing algorithm reconstruct the whole geometries of the candidates. The region growing result also demonstrates the feasibility of discrimination in terms of shape.

Figure 11: Some cases of hypothesis-to-ground-truth associations. In panel (a), the black dash line, the red solid line with filled circles, and the grey filled polygon denote accurate trajectory of a vehicle target, manual annotated ground truth of the trajectory, and the area where an output belongs to the trajectory, respectively. (a) illustrates that manual annotated ground truth cannot completely fit the accurate trajectory. (b) shows a hypothesis whose trajectory fits the ground truth in panel (a). (c) shows that two hypotheses cover the same ground truth, wherein IDSW is counted. From panel (d), the hypothesis whose ID equal to 7 outperforms the hypothesis whose ID equal to 18 during first three frames. However, the former gradually loses the main pattern of the ground truth, the latter follows the ground truth more closely.

(2) Multi-Morphological-Cue Based Discrimination. After region growing, we adopt a series of morphological properties to differentiate vehicle targets and irregular noises. The employed morphological cues include area, extent, major axis length and eccentricity as follows.

Area. The number of the pixels of a candidate.

Extent. The ratio of pixels of a candidate to the area of the bounding box of the candidate.

Major Axis Length. If an ellipse has the same normalized second central moments as the connected region of a candidate, the major axis length of the ellipse is defined as the major axis length of the candidate.

Eccentricity. The eccentricity of a candidate is equal to the eccentricity of the above ellipse.

The area and the major axis length cues represent the size of a candidate, while the extent and the eccentricity cues measure the similarity between a candidate and a rectangle. The spacing in the satellite videos represents about 1 meter in real world. Thereby, these morphological cues indicate real shape of a vehicle. They constitute a robust feature of vehicles because vehicles are rigid bodies without any deformation in satellite videos.

4 Metrics and Protocols in Performance Evaluation

The widely-used evaluation metrics on object detection tasks are precision/recall curve, and average precision (AP) [25]. These metrics are widely used in traditional visual object benchmarks, i.e. PASCAL VOC [25] and MOT [26]. As a novel task of detecting tiny moving objects, these previous metrics are inefficient in evaluating the performance of our task. The key challenge again is caused by the fact that each vehicle has only several pixels on each video frame. To this end, we systematically introduce a complete set of evaluation protocol in measuring the algorithm performance on our novel detection tasks in Sec. 4.2. Our evaluation protocol is built upon the existing evaluation metrics in Sec. 4.1.

4.1 Evaluation Metrics

Generally, a single criterion cannot reckon the performance of detecting and tracking objectively and comprehensively. To the best of our knowledge, it is the first time to introduce a systematic series of evaluation metrics, including precision, recall, Jaccard similarity, etc., whose definitions detail as following.

Precision. With respect to detection performance evaluation, it is the most important to determine whether a hypothesis is a true positive (TP) that is an accurate target correctly covered by an output, or a false positive (FP) that is a non target falsely covered by an output. Those missed accurate targets are called as false negatives (FNs). The ratio of the accurate targets to the detected targets is Precision, i.e.


Recall. Recall measures the ability of a detector to capture true target, which is equal to the ratio of TP to the number of all existing true targets, namely


-score. -score is a traditional criterion of binary classification between interest targets and non targets, which is equal to the harmonic mean of Precision and Recall, i.e.


Jaccard Similarity. Jaccard similarity is a criterion of evaluating tracking performance, which integrates TP, FP and FN as follows, [27]

District Frame Rate (fps) Resolution (m) Duration (s) Height×Width (pixel) Latitude and Longitude of Frame Corner
Top Left Top Right Bottom Left Bottom Right
Valencia, Spain 20 1.0 29 3072×4096
Table II: Information of the experimental satellite video.

MOTA. Multiple object tracking accuracy (MOTA) is a tracking performance metric to quantify multiple object tracking performance. The definition of MOTA is [26, 28]


where , , and represent the number of , , and ground truth, respectively, in frame i. IDSW means identity switch of the trajectories associated to a ground truth, and please refer to [26] for more details. Obviously, MOTA score ranges from - to 1, and the bigger it is, the better the goodness of detecting and tracking is.

Figure 12: Some vehicle samples and a noise. These images are scaled for viewing. (a), (b), (c) and (d) are RGB images where red rectangles represent real vehicles, while a noise signal exists in image (d). Greyscale images (e), (f), (g) and (h) are corresponding enhanced images of (a), (b), (c) and (d) to improve contrast for viewing.

MOTP. Multiple object tracking precision (MOTP) is adopted to measure the positioning precision of the detecting algorithms, which can be written as [26, 28, 20]


where (intersection over union [25]) denote the sum of overlap ratio of hypotheses to ground truths and is the number of matches of ground truths and hypotheses in frame i. From the above definition, MOTP score ranges from 0 to 1, and the bigger that is, the more precise the derived location is.

Figure 13: Annotated areas of the satellite video.

4.2 An Evaluation Protocol

The IoU of bboxes between a hypothesis and a ground truth is adopted as the similarity between a hypothesis and a truth. Similar to aforementioned hypothesis-to-track association, the matching of multiple hypotheses and ground truths also resorts to Hungary algorithm [19] in spatiotemporal domain not only in each frame.

Some cases of associations are listed in Fig. 11 (b)-(d), and actual situation is more complicated than that. The protocol of performance evaluation of the proposed detection and tracking algorithm details as following:

  1. The IoUs of each hypothesis and each ground truth can be obtained.Then, the reciprocal of a IoU is the distance between a hypothesis and a ground truth. Note that all IoUs are added with a very small value to avoid zero denominators. The matching distance threshold is set as 50, empirically, which is equal to a very small IoU value, 0.02. It means that if a hypothesis and a ground truth overlap, the hypothesis is regarded to cover the ground truth. In contrast, the IoU ratio is set as 0.5 for the detection of general objects, such as pedestrian, aeroplane, bicycle, etc., which cannot adopt to the tiny vehicles in satellite videos. Obviously, the smaller the distance, the smaller is the cost of the association between the hypothesis and the ground truth. All distances constitute a cost matrix, given M hypotheses and N ground truths, i.e.


    where t indicates the frame number.

  2. Repeat the first step for K consecutive frames. Then, cost matrices of these frames can constritue a cost tensor, namely

  3. The optimal associations of hypotheses and ground truths can be obtained using Hungary algorithm. A time window is employed to reduce computational load and memory consumption. Explicitly, the association is implemented among ten consecutive frames rather than all frames, namely the K in Eq (22) is set as 10.

  4. Repeat the step 1-3.

  5. Finally, the metrics are calculated based on the associations.

5 Experiments and Discussion

5.1 Experimental Setups

Figure 14: One shot of ground truth annotation. (a) shows the locations of vehicles, while (b) represents their corresponding IDs.
(a) Frame 50
(b) Frame 100
(c) Frame 150
(d) Frame 200
Figure 15: Tiny vehicle detecting results of four frames. Please refer to the enlargements of scenes in Fig. 20

Dataset. The experimental satellite video is provided by CGSTL. The videos are captured on March 7, 2017. A video satellite recorded a region in Valencia, Spain. The information of the satellite video details as Table II. The study video is free provided by CGSTL for scientific research.333 Any one can purchases satellite videos from their official website The annotated dataset can be downloaded from the first authour’s website.

Competitors. We compare several other algorithms in detecting the tiny moving vehicles, including GMM and ViBe. In order to fairly compare the proposed algorithm with baseline algorithms, GMM and ViBe are also linked with the same KF tracking framework.

(a) ID = 19
(b) ID = 359
(c) ID = 92
(d) ID = 161
Figure 16: The tracks of four vehicles. The yellow lines indicate their moving tracks detected by our algorithms. The yellow rectangles mark their initial positions, while the filled red points denote their terminal locations. In contrast, the other two methods – GMM and ViBe can not be used to detecting/tracking the vehicles.
Figure 17: Detecting scores of the proposed algorithm and baselines frame by frame.
Figure 18: Foreground segmentation results of Frame 100. (a) the proposed algorithm, (b) ViBe, (c) GMM.

Ground-truth. We annotated the experimental satellite video to quantitatively evaluate the proposed algorithm. The annotation of satellite videos is an arduous work, because the vehicles are hard to be distinguished from the background in naked eyes for a lack of distinctive prominent features, which is illustrated in Fig. 12. In Fig. 12 (a), (b) and (c), there are no significant differences between real vehicle targets and the background, especially dark vehicles, like Fig. 12 (b). Fig. 12 (d) shows that a noise signal may be a stationary vehicle, and resembles the true positive, which results in ambiguities in the interpreting work. In order to tackle this problem, we inspect the previous and future frames to determine whether some connected pixels form a moving vehicle. In the other words, we go through a short-term consecutive frames to seek out a moving region. Besides the difficult to detect vehicles, we can hardly annotate all vehicles of the scope spanning about 3×4 square kilometers frame by frame. We also seek the trade-off between workload and annotation accuracy here: first, three areas with 500×500 pixels of the video are randomly selected to be annotated, as illustrated in Fig. 13; second, we manually annotate the vehicles every 10 frames, while the ground-truth of the other frames are obtained by linear interpolation. Fig. 14 shows a representative scene of the annotation. The vehicle numbers of ground truths of area 1, area 2 and area 3 are 49, 41 and 29, respectively. Particularly, we utilized the Ground Truth Labeler app in MATLAB 2018a to help annotate the satellite video.

5.2 Experimental Results and Discussion

Figure 19: Vehicle candidates yielded by our method, ViBe and GMM in frame 100. The first, second, and third column represent the results produced by our method, ViBe and GMM, respectively. The first, second and third row are corresponding to the vehicle candidates in Area 1, Area 2, Area 3, respectively.

Quantitative results. The quantitative results of comparison experiments are detailed in Table III, including separate areas and average results. Table III shows that the proposed framework obviously outperforms ViBe and GMM in any criterion. Specifically, ViBe and GMM detect around 50% of vehicles, but yield about 90% of FPs. The high false positive rate extremely degrades the performance of ViBe and GMM. That results from that ViBe and GMM totally neglects the background moving and also fails to separate moving vehicles from noises. On the other hand, they are both sensitive the varying of the pixels, so, they report a trivial Recall score: 50%. On the contrary, our method has the unique capability of perceiving pixel moving not only varying. Leveraging the very high Precision, our method reports good scores of -score, Jaccard Similarity and MOTA. These criteria are closely related to Precision. On the other hand, the Recall score of our method is relatively low, although it is about 10% higher than ViBe and GMM. Some related criteria, Jaccard Similarity and MOTA, is affected to some extent. The location precision metric, MOTP, shows that average overlap between ground truths and hypotheses is 0.52. The very small size of vehicle targets leads to difficult in precisely locating them. But we think the location precision and pixel-level deviation meet the demand of application, especially in urban traffic surveillance.

Qualitative results. We give some visualization results of detecting multiple tiny moving vehicles. Figure 15 shows four frames, frame 50, 100, 150, 200. Intuitively, many vehicles in the main arteries are detected by the proposed algorithms, and there are a few FPs because most of the annotated labels exist in the main arteries. On the other hand, the number of detected vehicles is growing gradually frame by frame, which illustrates that the tracking also facilitates the detecting.

In order to clearly observe the detecting and tracking details, we provide four enlargements as shown in Fig. 20. From the Fig. 20, we can obtain the dynamics of not only the moving of vehicles but also the detecting and tracking procedure. Once a vehicle has been repeatedly detected of two consecutive frames, a unique ID is assigned to it and simultaneously a KF is allotted to it. The position and velocity provided by the detector also are used to initialize the KF. Subsequently, according to the designed framework, the detector provides the current state frame by frame, while the KF continuously embodies the latest state of the tracking vehicle and further updates its systematic model. Therefore, the proposed processing workflow can detect the tiny moving vehicles accurately and precisely.

Fig. 16 shows four trajectories tracked by the proposed algorithms. These trajectories include linear tracks and curved tracks, which demonstrates the above theory that a series of linear procedures can approximate a non-linear procedure as accurate as possible. These four trajectories also cover different traffic scenarios, including a straight artery in Fig. 16(a), a right turn in Fig. 16(b) and two roundabouts in Fig. 16(c)-(d). It proves that the proposed algorithms can not only address simple traffic conditions but also adapt to complex traffic scenarios.

Area Method R. (%) P. (%) JS MA MP
1 Ours 64.15 81.71 0.72 0.56 0.46 0.50
ViBe 51.72 15.10 0.23 0.13 -2.45 0.39
GMM 43.82 12.29 0.19 0.11 -2.75 0.37
2 Ours 62.80 82.23 0.71 0.55 0.47 0.52
ViBe 61.70 9.14 0.16 0.09 -5.56 0.45
GMM 61.83 7.5 0.13 0.07 -7.08 0.39
3 Ours 60.42 77.26 0.68 0.51 0.41 0.56
ViBe 41.53 6.76 0.12 0.06 -5.35 0.47
GMM 46.10 6.34 0.11 0.06 -6.41 0.42
Avg. Ours 63.06 81.04 0.71 0.55 0.46 0.52
ViBe 52.86 10.74 0.18 0.10 -3.92 0.43
GMM 49.66 8.79 0.15 0.08 -4.72 0.39
Table III: Evaluation scores of the proposed algorithm and baseline algorithms. R., P., , JS, MA and MP are short for Recall, Precision, -score, Jaccard Similarity, MOTA and MOTP.

Figure 18 shows the foreground segmentation results generated by our algorithms and baseline algorithms. From Fig. 18 (a), our detector yields hypotheses composed of true vehicles and a few noise, while the noise extremely outnumber true vehicles in 18 (b) and (c). The candidate vehicles generated by the proposed algorithm and competitors area shown in Fig. 19. From Fig. 19, our detector mainly perceives the moving vehicle pixels in the roads and yields limited false positives, while ViBe and GMM aimlessly detect varying of the pixels. It illustrates that ViBe and GMM are unable to separate the motions of vehicles from the slow and slight motion of background. ViBe and GMM try to estimate the pattern of each pixel using a non-parametric model or a Gaussian distribution, respectively. So, this strategy is very sensitive to the varying of the pixel and works well in common video processing, especially a stationary camera. However, they cannot address the satellite video processing because of neglecting local or neighborhood information. Our detection algorithms focus on a local area not a single pixel, which can adapt to the moving background of satellite videos.

Frame-by-frame Quantitative results. For providing a detailed figure of detecting performance of the proposed algorithm and baselines, some scores frame by frame are presented in Fig. 17. Fig. 17 shows the most widely-used detecting metrics: F-score, Precision and Recall. It further illustrates that the proposed algorithm outperforms baselines mainly by leveraging high Precision. From Fig. 17, the F-score of ViBe and GMM rapidly decays at the beginning and then totally traps into the moving background. So, for ViBe and GMM, the background moving totally blurs the vehicle moving, leading to their failure. Likewise, this experiment demonstrates the outstanding performance of the proposed algorithm to separate the background moving and the vehicle moving.

6 Conclusion

Satellite videos have the unique capability of observing a city-scale region. This paper addresses the tiny vehicle detecting algorithm in satellite videos, and we design the practical detecting and tracking framework of tiny moving vehicles in satellite videos. It is the first time to adopt a probabilistic distribution to represent the patterns of noises in spatiotemporal domain, which facilitates us to differentiate candidates from noise. We further propose the multi-morphological-cue based discrimination algorithm to distinguish true vehicle targets from a few existing noise. Another important issue is to introduce a series of evaluation metrics and to propose a complete evaluation protocol. The proposed algorithms are tested in three manual annotated areas of a satellite video, which are also compared with baseline algorithms. These experiments demonstrate the good performance of our algorithms.

7 Acknowledgment

We are grateful to CGSTL for providing the satellite video data used in this study.

Figure 20: The enlargements of four scenes of Fig. 15.


  • [1] Y. Tian, R.S. Feris, H. Liu, A. Hampapur, and M.-T. Sun. Robust detection of abandoned and removed objects in complex surveillance videos. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 41(5):565?576, 2011.
  • [2] Chia-Chih Chen and J. K. Aggarwal. Recognizing human action from a far field of view. In Motion and Video Computing, 2009. WMVC ’09. Workshop on, 2009.
  • [3] Bruce D Lucas, Takeo Kanade, et al. An iterative image registration technique with an application to stereo vision. 1981.
  • [4] A. A. Efros, A. C. Berg, G. Mori, and J. Malik. Recognizing action at a distance. In IEEE International Conference on Computer Vision, pages 726–733, 2003.
  • [5] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filter for online nonlinear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing, 50(2):174–188, 2002.
  • [6] Wei Ao, Feng Xu, Yongchen Li, and Haipeng Wang. Detection and discrimination of ship targets in complex background from spaceborne alos-2 sar images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 11(2):536–550, 2018.
  • [7] Yu Zhou, Haipeng Wang, Feng Xu, and Ya-Qiu Jin. Polarimetric sar image classification using deep convolutional neural networks. IEEE Geoscience and Remote Sensing Letters, 13(12):1935–1939, 2016.
  • [8] Feng Xu, Ya-Qiu Jin, and Alberto Moreira. A preliminary study on sar advanced information retrieval and scene reconstruction. IEEE Geoscience and Remote Sensing Letters, 13(10):1443–1447, 2016.
  • [9] Ahlem Walha, Ali Wali, and Adel M Alimi. Video stabilization with moving object detecting and tracking for aerial video surveillance. Multimedia Tools and Applications, 74(17):6745–6767, 2015.
  • [10] Brian P Jackson and A Ardeshir Goshtasby. Registering aerial video images using the projective constraint. IEEE Transactions on image processing, 19(3):795–804, 2010.
  • [11] Heng Guo, Shuaicheng Liu, Tong He, Shuyuan Zhu, Bing Zeng, and Moncef Gabbouj. Joint video stitching and stabilization from moving cameras. IEEE Transactions on Image Processing, 25(11):5491–5503, 2016.
  • [12] Edgardo Molina and Zhigang Zhu. Persistent aerial video registration and fast multi-view mosaicing. IEEE Transactions on Image Processing, 23(5):2184–2192, 2014.
  • [13] Nan Jiang and Wenyu Liu. Data-driven spatially-adaptive metric adjustment for visual tracking. IEEE Transactions on Image Processing, 23(4):1556–1568, 2014.
  • [14] Pakorn Kaewtrakulpong and Richard Bowden. An improved adaptive background mixture model for realtime tracking with shadow detection. 2001.
  • [15] Chris Stauffer and W Eric L Grimson. Adaptive background mixture models for real-time tracking. In Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference on., volume 2, pages 246–252. IEEE, 1999.
  • [16] Olivier Barnich and Marc Van Droogenbroeck. Vibe: a powerful random technique to estimate the background in video sequences. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 945–948. IEEE, 2009.
  • [17] Olivier Barnich and Marc Van Droogenbroeck. Vibe: A universal background subtraction algorithm for video sequences. IEEE Transactions on Image processing, 20(6):1709–1724, 2011.
  • [18] K. Fukunaga and L. Hosteler. The estimation of the gradient of density function, with application in pattern recognition. IEEE Trans. Info. Theory, 1975.
  • [19] M. L. Miller, H. S. Stone, and I. J. Cox. Optimizing murty’s ranked assignment method. IEEE Transactions on Aerospace and Electronic Systems, 33(32):851–862, 1997.
  • [20] Rainer Stiefelhagen, Keni Bernardin, Rachel Bowers, John Garofolo, Djamel Mostefa, and Padmanabhan Soundararajan. The clear 2006 evaluation. In International Evaluation Workshop on Classification of Events, Activities and Relationships, pages 1–44. Springer, 2006.
  • [21] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • [22] Jiangjian Xiao, Hui Cheng, Harpreet Sawhney, and Feng Han. Vehicle detection and tracking in wide field-of-view aerial video. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 679–684. IEEE, 2010.
  • [23] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
  • [24] William H Press, Saul A Teukolsky, William T Vetterling, and Brian P Flannery. Numerical recipes in C, volume 2. Cambridge university press Cambridge, 1996.
  • [25] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
  • [26] Anton Milan, Laura Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.
  • [27] Liang Liang, Hongying Shen, Pietro De Camilli, and James S Duncan. A novel multiple hypothesis based particle tracking method for clathrin mediated endocytosis analysis using fluorescence microscopy. IEEE Transactions on Image Processing, 23(4):1844–1857, 2014.
  • [28] Rangachar Kasturi, Dmitry Goldgof, Padmanabhan Soundararajan, Vasant Manohar, John Garofolo, Rachel Bowers, Matthew Boonstra, Valentina Korzhova, and Jing Zhang. Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics, and protocol. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2):319–336, 2009.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description