Detecting Tiny Moving Vehicles in Satellite Videos
Abstract
In recent years, the satellite videos have been captured by a moving satellite platform. In contrast to consumer, movie, and common surveillance videos, satellite video can record the snapshot of the cityscale scene. In a broad fieldofview of satellite videos, each moving target would be very tiny and usually composed of several pixels in frames. Even worse, the noise signals also existed in the video frames, since the background of the video frame has the subpixellevel and uneven moving thanks to the motion of satellites. We argue that this is a new type of computer vision task since previous technologies are unable to detect such tiny vehicles efficiently. This paper proposes a novel framework that can identify the small moving vehicles in satellite videos. In particular, we offer a novel detecting algorithm based on the local noise modeling. We differentiate the potential vehicle targets from noise patterns by an exponential probability distribution. Subsequently, a multimorphologicalcue based discrimination strategy is designed to distinguish correct vehicle targets from a few existing noises further. Another significant contribution is to introduce a series of evaluation protocols to measure the performance of tiny moving vehicle detection systematically. We annotate a satellite video manually and use it to test our algorithms under different evaluation criterion. The proposed algorithm is also compared with the stateoftheart baselines, and demonstrates the advantages of our framework over the benchmarks.
tiny object detection, Probabilistic Noise Modeling, Evaluation, Vehicle Detection.
1 Introduction
With the recent advanced in the earth observation (EO) technology, satellite videos are captured by utilizing the optical sensors to capture consecutive images from a moving satellite platform. The satellite videos can enable many potential applications, such as cityscale traffic surveillance, 3D reconstruction of urban buildings and quakerelief efforts, etc. For instance, Figure 1 (a) shows that a frame of a satellite video of the Valencia city in Spain. As visualized in Fig. 1 (b), corresponding areal map of each video frame is about square kilometers. Satellite video can thus facilitate monitoring the dynamics scenes of the cityscale. On the other hand, to efficiently supervise the cityscale scene, one primary, and yet the critical task is to detect and track the moving vehicles captured in satellite videos. However, there is no previous technique for detecting the very tiny moving vehicles in satellite videos, due to the following challenges.
In the video, vehicles are moving and very tiny. In satellite videos, only several pixels represent each vehicle. Thus essentially, we have to detect tiny moving vehicles in satellite videos. Fig. 2 shows two enlarged areas of the panorama in Fig. 1 (a). From these enlargements, a vehicle is only composed of several pixels without any distinctive color or texture. So, the stateoftheart detection algorithms, like deep learning, can easily overfit the training vehicle data but fail to detect/describe the patterns of these moving vehicles. We observe that the most robust features of these moving vehicles come from their motion patterns which however are very easily obscured and challenged by the background moving.
The frames of satellite videos cover a largescale area and provide a dynamic scenario. In terms of the distances between camera shot and observed objects, we have nearfield, mediumfield, farfield surveillance videos, and the extremely farfield satellite videos [1, 2]. Not only broad fieldofview does a satellite video provide but it also presents a very complex background. As shown in Fig. 4, the visual content of satellite videos may include the roads, buildings, vegetation and football field, etc. Furthermore, it also has varied traffic conditions as many as possible, e.g. straight arteries, intersections, and roundabouts, etc.
The background of satellite videos presents subpixellevel and uneven moving. The optical flow field [3, 4] of the above satellite video is shown in Fig. 3. It shows that the background is continuously moving, and the optical flow field is very uneven. Even worse, the relative motion of the satellite video is very complicated since intrinsically, the satellite video frames are the 2D projection of a sophisticated 3D movement of the satellite platform. On the other hand, since the satellites are very far away from the earth plane, we can only observe very slow moving among the consecutive video frames. Such slowly moving will lead to small variants of the stationary pixels. Critically, we notice that the moving of two successive satellite video frames is always subpixellevel. And it also challenges the techniques of framebyframe video stabilization and registration. Overall, one key difficulty in detecting the tiny moving vehicles is to differentiate the background motions from the moving vehicles; otherwise, the moving background would negatively affect the detection of tiny moving vehicles.
This paper focuses on detecting and tracking the tiny vehicle of only a few pixels in the satellite videos which is very hard to be identified and easily affected by noise. The patterns of moving vehicles may also be confused with the noise patterns which caused by the complex moving backgrounds. Such noise patterns may result in regular moving of stationary corners or edges, and thus further hinder the detection of the tiny moving vehicles.
To tackle the problems as mentioned above, we, for the first time, propose a framework in addressing the challenging tasks of detecting tiny moving vehicles in satellite videos. The whole framework is built upon a series of statistical tools. In particular, we propose a motion based detecting algorithm using a novel local modeling. We decompose each frame into two parts, i.e., an original image and an additive random 2D noise signal map. A probability distribution is used to fit the noise patterns, which facilitates us to distinguish potential vehicle targets. A local tactic is applied to address intravariants within a frame and discern intervariants between frames, simultaneously. Then, a region growing is designed, and a discrimination algorithm based on multiple morphological cues is proposed, which can remove other noise. The Kalman filter (KF) [5] is further used to track the vehicles. Extensive experiments are conducted on the realworld satellite video dataset to evaluate our framework and show the efficacy of the proposed models over the baselines.
The major contributions of this paper are fourfold:

To the best of our knowledge, the tasks of detecting moving tiny moving objects are, for the first time, studied. To further study this problem, we contribute the satellite video dataset, which has the labeled groundtruth of tiny moving objects.

We propose a motion based detecting algorithm using a novel local modeling. The noise pattern is modeled by probabilistic distributions.

A region growing algorithm is further designed and a discrimination algorithm based on multiple morphological cues.

We, for the first time propose a set of evaluation protocols which can systematically measure the algorithms of analyzing tiny moving vehicles.
The remainder of the paper is organized as follows. First, Sec. 2 reviews some related work. Then, Sec. 3 details the proposed algorithms, including the overall framework and two major contributions, detecting and discrimination algorithms. Subsequently, the used evaluation metrics and the proposed evaluation algorithm is presented in Sec. 4. Sec. 5 shows experiments undertaken on a satellite video. Finally, Sec. 6 concludes the paper.
2 Related Work
2.1 Earth Observation Technology and Satellite Videos
Nowadays many observation technologies have been developed and are of enormous significance, including optical satellite images, spaceborne synthetic aperture radar (SAR) images and aerial videos. Such technology plays a critical role in both civil and military area, such as the city traffic system, maritime surveillance, aerial spy and battlefield monitor, etc. Both optical satellite image and spaceborne SAR can observe a large area in a high resolution. However, optical satellite is very susceptible to different illumination and various weather. Although SAR has the unique capability of earth imaging in allwhether regardless of day and night [6], the SAR images are difficult to be interpreted [7, 8]. Another weakness in understanding the optical satellite image and SAR is that they cannot observe dynamics due to stationary imaging, which narrows down their applications.
The satellite videos have many advantages over the other conventional videos, such as aerial videos captured by the unmanned aerial vehicle (UAV). The aerial videos often suffer from undesirable dramatic motions of platforms and have to resort to complex stabilization preprocessing [9]. Thus to track objects, image registration has to be done at the first stage. One can then separate camera egomotion from object motion [10, 11, 12]. The aerial videos can only cover a small city scope, while our satellite videos can easily supervise a cityscale scene. Furthermore, the new legislation of the civil aviation safety had forbidden or restricted the usage of UAV in many cities.
Recently, some commercial companies are able to have the satellites. For example, Satellite Imaging Corporation (SIC) successfully launched video satellites – SkySat1 and SkySat2, on November 21, 2013 and July 8, 2014, respectively. Chang Guang Satellite Technology CO., LTD (CGSTL) successfully launched two video satellites on October 7, 2015. Up to now, CGSTL has 8 onorbit video satellites which are a part of the ongoing Jilin No. 1 satellite constellation. The time resolution of EO of Jilin No. 1 satellite constellation will be shortened to half an hour when the constellation is constructed by 2020.
Comparing with the conventional EO technologies. the satellite video can cover the largest scope, and is very stable. First, we can understand and forecast the dynamics of the earth. Second, a video satellite can turn its lens towards the region of interest (ROI) all the time through flying so long as this area is within the fieldifview of the satellite, which has very good image quality. So, the satellite videos are more stable compared with aerial videos. Third, a high altitude of a video satellite results in a broader fieldofview, which even covers a cityscale area. Another important issue should be taken into account is that a video satellite is a free platform which can record anywhere in the earth without any restrictions.
2.2 Moving Target Detection and Tracking
The moving target detection, can be taken as a special case of foreground segmentation. Such tasks can be solved by the Gaussian Mixture Model (GMM) [13, 14, 15] and the stateoftheart ViBe [16, 17] algorithm. GMM is a representative of parametric models, while the ViBe tries a nonparametric method to describe the dynamic patterns of a pixel. GMM utilizes a weighted mixture of Gaussian distributions to model the pixel value varying over time. However, the dynamics of pixel values may be not subjected to Gaussian distributions. In most cases, we cannot use a definite parametric model to represent the variance of pixel values. ViBe proposed a novel idea that some pixel values in different time steps are regarded as the samples of one space, in order to represent the patterns of the pixels.
Those previous algorithms such as GMM and ViBe have several serious drawbacks if we apply them to our tasks. First, they requires heavy computational cost and resources in processing the satellite videos, since the pixelbased modeling has heavy computational loadings. Second, they are relative inefficiently in telling the differences between the moving target and the egomotion of the satellites. To this end, we propose the noise model in isolating the moving background and detecting the potential moving targets, simultaneously. Moreover, the proposed tiny moving vehicle detecting is implemented in the spatiotemporal domain. In terms of subpixellevel moving and the neighborhood similarity, we model pixels of interframe differences spatially rather than temporally.
3 Methodology of Detecting Tiny Moving Vehicles
The whole section is divided into the three subsections. Firstly, we discuss the overall framework of proposed tiny moving vehicles detection algorithms in Sec. 3.1. The proposed framework is under Kalman filter (KF) tracking framework as shown in Fig. 5. These tiny moving vehicles are detected in each local region in Sec. 3.2. We subsequently propose the discrimination algorithm removing the falsely detected components in Sec. 3.3.
3.1 Overview of the Proposed Algorithms
Key concepts. Before fully developing our framework, some key concepts are explained here. Detection or a detector is a potentialvehicle detecting procedure that embodies the proposed local tactic and noise modeling algorithm. Candidates are outputs of the detection that are composed of true vehicles and some noises. Discrimination or a discriminator is a distinguishing procedure between true vehicles and existing noises, including the proposed region growing and multimorphologicalcue based discrimination algorithms. Hypotheses are outputs of the discrimination that are composed of true vehicles and a few noises. In addition, the final outputs of the detecting and tracking framework are also defined as hypotheses. The State is a vector that includes position, velocity and acceleration of a vehicle in a time step. The Track is a sequence of states of a vehicle in temporal domain. Each track is marked with an unique ID and assigned with a KF. Association is a matching procedure that meets the minimum cost. Prediction is a current position of a track that is inferred by its KF.
The widelyused object tracking methods include Kalman filter, particle filter [5], and mean shift [18]. As shown in Fig. 5, our whole vehicle tracking pipeline is based on Kalman filter, which is one of the most classical tracking algorithms. In Fig. 5, the KF is the central module of the processing framework, which includes several interactive branches, as follows.
(1) Initialization. Initialization is to determine the initial state of a track. The hypotheses of current frame and previous frame are associated using Hungarian algorithm [19, 20]. So, we can derive their velocities and positions, and their initial accelerations are regarded as zero.
(2) Prediction. The current state of one vehicle tracked can be inferred in term of the previous observation.
(3) HypothesistoTrack Association. The discriminator yields hypotheses, while the tracks yield predictions. In this stage, hypotheses are matched with predictions in order to meet minimum cost. Here, the cost between a hypothesis and a prediction is defined as Euclidean distance. Hungary algorithm is employed to derive an optimal association between hypotheses and predictions. Hypothesistotrack association yields assignments, unassigned tracks and unassigned hypotheses. Assignments are optimal matches. Unassigned tracks are those tracks that do not successfully with any hypotheses, likewise unassigned hypotheses do. Then, assignments are utilized to update the stages; unassigned tracks are further processed in the nearest searching stage; unassigned hypotheses are used to initialize new tracks.
(4) Update. The state vectors of hypotheses of the assignments are used to update the state of their corresponding KFs.
(5) Nearest Searching, correction and termination. We do not simply discard the unassigned tracks because their corresponding hypotheses may be missed by the detector or the discriminator. So, the nearest searching strategy is applied to find out whether there exists a connected region which resembles the tracking vehicle around the previous position of the vehicle. The matching is resort to structural similarity index (SSIM) [21]. If the similar region is found, the track is updated; otherwise the track is terminated. The nearest searching using SSIM is illustrated in Fig. 6. (a), (c) and (e) are from the previous frame where the vehicles are marked by black rectangles, while (b), (d), and (f) are from the current previous where the black rectangles denote the results of nearest searching. These experimental results demonstrate the efficiency of such a nearest searching strategy.
We will fully explain the above four steps of detecting the tiny moving objects in the next few sections. Once the objects are detected, the Kalman filter is adopted to fit the motion of these vehicles, which is the optimal solution in the linear and Gaussian situations. Although the motions of vehicles in real world are very complex, from an approximate viewpoint, a nonlinear procedure can be decomposed into a series of linear procedures. So, KF is a simple but effective tool to measure, predict and track the motion of a moving vehicle. The evolution function of the system is defined as
(1) 
where , and denote state vector, evolution matrix and procedure noise vector, respectively, and the subscript indicates the time step of a frame. Position, velocity and acceleration of a vehicle constitute , namely
(2) 
Without loss of generality, we assume that vehicle targets move in a constant acceleration and straight line during each fixed interval. So, the evolution matrix can be written as
(3) 
The measurement function can be written as
(4) 
where , and denote measurement vector, measurement matrix and measurement noise, respectively. The definition of in this study is
(5) 
Assuming that a set of measurements is obtained, i.e. , KF recursively derives posterior PDF of the state vector via Bayesian theorem, i.e.
(6) 
3.2 MotionBased Detection Using Local Noise Modeling
Interframe difference is a conventional but effective tool to discerns changes between two frames. In contrast tos pixelbased ViBe and GMM, it has two notable merits: high efficiency and low memory consuming. However, traditional interframe difference [22] is based on a predefined threshold to separate moving pixels and background. Specifically, the greylevel value difference image is converted to a binary image wherein ones denote moving pixels, while zeros denote stationary background pixels. This procedure is termed as Binarization in our paper. Essentially, binarization differentiates moving pixels from the whole interframe difference image. But a fixed binarization threshold cannot be adapted to largescale introvariant scenarios of satellite videos.
In order to address the aforementioned challenges, we propose a local tactic and a novel detecting method. It is conceptualized as motionbased detection using local noise modeling. An adaptive binarization is derived by noise modeling in each local area, dealing with the variances of local area by the neighborhood similarity. Motionbased means interframe difference processing to search moving pixels.

Local tactic. A local tactic is designed to tackle the dramatic introvariance within a frame. Specifically, a 2D rasterizing is implemented along the vertical and horizontal directions in a frame. The original frame is converted into paved local areas. The size of a local is empirically set as square pixels. For one thing, a local area has much lower degree of heterogeneity than the whole frame. For another, it integrated local information to reduce the interference of the moving background. It greatly facilitates the following detecting.

Detecting method. The proposed detecting method is composed of four stages as shown in Fig. 7, (1) deriving interframe difference images, (2) estimating noise distribution, (3) binarization and (4) logical AND operations to finally get the detection results. Our major contribution is step (2) to estimate noise distribution which can yield a adaptive binarization threshold of each local area in step (3). Each step details as follows.
Distance  Exponential  Gamma  Weibull  GD  Frame 

KL  0.0959  0.0914  0.0919  0.0544  50 
KS  0.0959  0.1018  0.0891  0.0813  
KL  0.0864  0.0812  0.0862  0.0579  100 
KS  0.0875  0.0988  0.0896  0.0865  
KL  0.0846  0.0800  0.0845  0.0531  500 
KS  0.0854  0.0964  0.0859  0.0816 
(1) Deriving Interframe difference images. Interframe difference is a trivial operation. Here, we provide a novel viewpoint on interframe difference images. By taking the frame as a 2D signal consisting of original optical signal and additive random noise, i.e.
(7) 
where denotes the greylevel value of pixel in frame , since the graylevel images^{1}^{1}1The RGB frames will be converted to greylevel images. are more common in satellite videos. denotes the original amplitude of the pixel in frame , while denotes the corresponding noise signal. Accordingly, the absolute interframe difference of two registered frames can be regarded as a set of random noise, i.e.,
(8)  
(9) 
where denotes absolute interframe difference, denotes the k frames interval. From Eq. (9), the interfame difference image signal is only corresponding with noises when two frames are registered. However, there still exists some outliers. These outliers are composed of tiny moving vehicles and nonvehicle targets. Thus the next issue is how to differentiate these outliers from random noises.
(2) Estimating noise distribution. Detecting the pattern of tiny moving vehicles is a challenge, since noise patterns will blur the underlying patterns of the tiny moving vehicles. Thus in this step, the key idea is to fit the noise patterns, namely in Eq. (8), by the probabilistic distributions.
Intuitively, the value differences of the same pixel at two consecutive frames should approximate zero, while the value differences of the pixels of noise patterns, or moving vehicles should be larger than zero. Figure 8 shows the histogram of the value differences of pixels of two consecutive frames. The amplitude histogram of noises exhibits notable regulations, like smooth decaying and a heavy tail. In the pattern of noise, true noise pixels are inliers, and the other pixels are outliers. Thus the heavy tail of Fig. 8 should be corresponding to the outliers.
The probabilistic distribution is adopted to fit the histogram and derive a binary threshold given a probability. Thus several widelyused heavytail distributions, such as exponential distribution, Gamma distribution, Weibull distribution and generalized Gamma distribution (GD) are tested and compared in Fig. 8. Quantatively, KullbackLeibler (KL) distance [23] and KolmogorovSmironv (KS) distance, (also known as KolmogoroveSmirnov test [24]), are introduced to quantify the fitting performance of different distributions, as shown in Table I. The smaller scores of KL and KS distance indicate the better results for fitness.
The quantitative experiment further proves the above hypothesis. It also shows that the threeparameter distribution, GD with higher degree of freedom (DoF), outperforms the other distributions. Alternatively, the oneparameter distribution – exponential distribution, also fit the noise distribution very well. Nevertheless, the parameter estimation of GD is more difficult, higher computational load and more time consuming than exponential distribution. To make a balance between accuracy and computational load, exponential distribution is adapted to fit noises; and the cumulative density function (CDF) is
(10) 
(3) Binarization. Once we fit the distribution of pixel value differences of consecutive frames, we can utilize an adaptive threshold of binarization to determine the outliers. In particular, we introduce a predefined probability () to derive the binarization threshold, namely [6]
(11) 
where denotes the inverse function of the distribution in Eq (10). We set as here. If the pixel value difference is bigger than the binarization threshold , this pixel would be taken as an outlier. By virtue of such a binarization algorithm, we turn the original images into a binary image: inliers have zero pixel values, outliers are ones.
(4) Logical AND. These outliers comprise the vehicles and other noise. In the binary difference image, a true vehicle target exhibits as two symmetrical blobs. One blob indicates the current position, and another indicates the previous or future position. We derive the intersection of two binary interframe difference images^{2}^{2}2Explicitly, the difference between frame and frame vs. the difference between frame and the frame , wherein is set as 10. to determine the current positions of vehicles. It is a Boolean operation that only one and one yield one, which is named as, “Logical AND”. In addition to eliminating ambiguities, logical AND also reduces the existing noises due to their random appearing.
3.3 Region Growing and MultiMorphologicalCue Based Discrimination
There still exist some noises in candidates. These noises include irregular noises and regular noises. Irregular noises result from dramatic illumination variants or slight deviations between frames. They may randomly appear in some consecutive frames. Generally, it is not necessary to design an algorithm of pruning the irregular noises since KF tracking can gradually eliminate this type of noise. In contrast, we term the background moving patterns as the regular noises. Particularly, these noises are caused by the slight deviation of satellite moving. Such deviation may be appeared/detected as the edges or corners of some static objects in the frames. Even worse, these detected corners or edges exhibit relative moving pattern with respect to the moving background. The regular noises have to be pruned by our algorithms. We visualize the regular noises in Fig. 9.
To this end, we propose a novel discrimination algorithm using the geometrical and neighborhood information. This discrimination algorithm includes two parts, i.e., the region growing to reconstruct candidate geometry and multimorphologicalcue based discrimination to distinguish noises from vehicles. The key idea is that a vehicle target is a singular point in 2D temporal domain. By contrast, these regular noises share similar temporal distributions as their neighborhood in frames. If the candidates can be connected with their similar neighborhood pixels, we can differentiate vehicles from regular noises in terms of shape: the vehicle targets approximate a rectangle, while regular noises can be taken as arbitrary shapes.
(1) Region growing. A region growing algorithm is proposed to connect a candidate with its similar neighborhood pixels. This procedure is namely to reconstruct the whole geometry of a candidate. From Fig. 10, the detector only yields a partial geometry of a candidate, because of the overlap of positions of a candidate in two adjacent frames. The region growing utilizes the detected partial geometry to restore the whole geometry of the candidate. Neighborhood area is defined as a pixel window in the candidate. Gaussian distribution is employed to measure the similarity between a neighbor pixel and the candidate. The CDF of Gaussian distribution is
(12) 
where , and denote the related error function, the mean and the standard deviation, respectively. These parameter values of a Gaussian distribution can be estimated using the values of those pixels of the candidate. A range can be obtained given the predefined the lower bound probability and the upper bound probability , i.e.
(13) 
(14) 
where and represent lower and upper bound threshold, respectively. and are set as and , symmetrically. If the greylevel value of a pixel inner the searching window is in , the pixel will be reclassified as candidate pixels. Those new candidate pixels connected with the original candidate are reserved. Finally, the result of region growing is shown in Fig. 10 (i)  (l). From Fig. 10, the detector captures only partial geometries of the candidates in advance. Then, the proposed region growing algorithm reconstruct the whole geometries of the candidates. The region growing result also demonstrates the feasibility of discrimination in terms of shape.
(2) MultiMorphologicalCue Based Discrimination. After region growing, we adopt a series of morphological properties to differentiate vehicle targets and irregular noises. The employed morphological cues include area, extent, major axis length and eccentricity as follows.
Area. The number of the pixels of a candidate.
Extent. The ratio of pixels of a candidate to the area of the bounding box of the candidate.
Major Axis Length. If an ellipse has the same normalized second central moments as the connected region of a candidate, the major axis length of the ellipse is defined as the major axis length of the candidate.
Eccentricity. The eccentricity of a candidate is equal to the eccentricity of the above ellipse.
The area and the major axis length cues represent the size of a candidate, while the extent and the eccentricity cues measure the similarity between a candidate and a rectangle. The spacing in the satellite videos represents about 1 meter in real world. Thereby, these morphological cues indicate real shape of a vehicle. They constitute a robust feature of vehicles because vehicles are rigid bodies without any deformation in satellite videos.
4 Metrics and Protocols in Performance Evaluation
The widelyused evaluation metrics on object detection tasks are precision/recall curve, and average precision (AP) [25]. These metrics are widely used in traditional visual object benchmarks, i.e. PASCAL VOC [25] and MOT [26]. As a novel task of detecting tiny moving objects, these previous metrics are inefficient in evaluating the performance of our task. The key challenge again is caused by the fact that each vehicle has only several pixels on each video frame. To this end, we systematically introduce a complete set of evaluation protocol in measuring the algorithm performance on our novel detection tasks in Sec. 4.2. Our evaluation protocol is built upon the existing evaluation metrics in Sec. 4.1.
4.1 Evaluation Metrics
Generally, a single criterion cannot reckon the performance of detecting and tracking objectively and comprehensively. To the best of our knowledge, it is the first time to introduce a systematic series of evaluation metrics, including precision, recall, Jaccard similarity, etc., whose definitions detail as following.
Precision. With respect to detection performance evaluation, it is the most important to determine whether a hypothesis is a true positive (TP) that is an accurate target correctly covered by an output, or a false positive (FP) that is a non target falsely covered by an output. Those missed accurate targets are called as false negatives (FNs). The ratio of the accurate targets to the detected targets is Precision, i.e.
(15) 
Recall. Recall measures the ability of a detector to capture true target, which is equal to the ratio of TP to the number of all existing true targets, namely
(16) 
score. score is a traditional criterion of binary classification between interest targets and non targets, which is equal to the harmonic mean of Precision and Recall, i.e.
(17) 
Jaccard Similarity. Jaccard similarity is a criterion of evaluating tracking performance, which integrates TP, FP and FN as follows, [27]
(18) 
District  Frame Rate (fps)  Resolution (m)  Duration (s)  Height×Width (pixel)  Latitude and Longitude of Frame Corner  
Top Left  Top Right  Bottom Left  Bottom Right  
Valencia, Spain  20  1.0  29  3072×4096 
MOTA. Multiple object tracking accuracy (MOTA) is a tracking performance metric to quantify multiple object tracking performance. The definition of MOTA is [26, 28]
(19) 
where , , and represent the number of , , and ground truth, respectively, in frame i. IDSW means identity switch of the trajectories associated to a ground truth, and please refer to [26] for more details. Obviously, MOTA score ranges from  to 1, and the bigger it is, the better the goodness of detecting and tracking is.
MOTP. Multiple object tracking precision (MOTP) is adopted to measure the positioning precision of the detecting algorithms, which can be written as [26, 28, 20]
(20) 
where (intersection over union [25]) denote the sum of overlap ratio of hypotheses to ground truths and is the number of matches of ground truths and hypotheses in frame i. From the above definition, MOTP score ranges from 0 to 1, and the bigger that is, the more precise the derived location is.
4.2 An Evaluation Protocol
The IoU of bboxes between a hypothesis and a ground truth is adopted as the similarity between a hypothesis and a truth. Similar to aforementioned hypothesistotrack association, the matching of multiple hypotheses and ground truths also resorts to Hungary algorithm [19] in spatiotemporal domain not only in each frame.
Some cases of associations are listed in Fig. 11 (b)(d), and actual situation is more complicated than that. The protocol of performance evaluation of the proposed detection and tracking algorithm details as following:

The IoUs of each hypothesis and each ground truth can be obtained.Then, the reciprocal of a IoU is the distance between a hypothesis and a ground truth. Note that all IoUs are added with a very small value to avoid zero denominators. The matching distance threshold is set as 50, empirically, which is equal to a very small IoU value, 0.02. It means that if a hypothesis and a ground truth overlap, the hypothesis is regarded to cover the ground truth. In contrast, the IoU ratio is set as 0.5 for the detection of general objects, such as pedestrian, aeroplane, bicycle, etc., which cannot adopt to the tiny vehicles in satellite videos. Obviously, the smaller the distance, the smaller is the cost of the association between the hypothesis and the ground truth. All distances constitute a cost matrix, given M hypotheses and N ground truths, i.e.
(21) where t indicates the frame number.

Repeat the first step for K consecutive frames. Then, cost matrices of these frames can constritue a cost tensor, namely
(22) 
The optimal associations of hypotheses and ground truths can be obtained using Hungary algorithm. A time window is employed to reduce computational load and memory consumption. Explicitly, the association is implemented among ten consecutive frames rather than all frames, namely the K in Eq (22) is set as 10.

Repeat the step 13.

Finally, the metrics are calculated based on the associations.
5 Experiments and Discussion
5.1 Experimental Setups
Dataset. The experimental satellite video is provided by CGSTL. The videos are captured on March 7, 2017. A video satellite recorded a region in Valencia, Spain. The information of the satellite video details as Table II. The study video is free provided by CGSTL for scientific research.^{3}^{3}3 Any one can purchases satellite videos from their official websitehttp://mall.charmingglobe.com/videoIndex.html The annotated dataset can be downloaded from the first authour’s website.
Competitors. We compare several other algorithms in detecting the tiny moving vehicles, including GMM and ViBe. In order to fairly compare the proposed algorithm with baseline algorithms, GMM and ViBe are also linked with the same KF tracking framework.
Groundtruth. We annotated the experimental satellite video to quantitatively evaluate the proposed algorithm. The annotation of satellite videos is an arduous work, because the vehicles are hard to be distinguished from the background in naked eyes for a lack of distinctive prominent features, which is illustrated in Fig. 12. In Fig. 12 (a), (b) and (c), there are no significant differences between real vehicle targets and the background, especially dark vehicles, like Fig. 12 (b). Fig. 12 (d) shows that a noise signal may be a stationary vehicle, and resembles the true positive, which results in ambiguities in the interpreting work. In order to tackle this problem, we inspect the previous and future frames to determine whether some connected pixels form a moving vehicle. In the other words, we go through a shortterm consecutive frames to seek out a moving region. Besides the difficult to detect vehicles, we can hardly annotate all vehicles of the scope spanning about 3×4 square kilometers frame by frame. We also seek the tradeoff between workload and annotation accuracy here: first, three areas with 500×500 pixels of the video are randomly selected to be annotated, as illustrated in Fig. 13; second, we manually annotate the vehicles every 10 frames, while the groundtruth of the other frames are obtained by linear interpolation. Fig. 14 shows a representative scene of the annotation. The vehicle numbers of ground truths of area 1, area 2 and area 3 are 49, 41 and 29, respectively. Particularly, we utilized the Ground Truth Labeler app in MATLAB 2018a to help annotate the satellite video.
5.2 Experimental Results and Discussion
Quantitative results. The quantitative results of comparison experiments are detailed in Table III, including separate areas and average results. Table III shows that the proposed framework obviously outperforms ViBe and GMM in any criterion. Specifically, ViBe and GMM detect around 50% of vehicles, but yield about 90% of FPs. The high false positive rate extremely degrades the performance of ViBe and GMM. That results from that ViBe and GMM totally neglects the background moving and also fails to separate moving vehicles from noises. On the other hand, they are both sensitive the varying of the pixels, so, they report a trivial Recall score: 50%. On the contrary, our method has the unique capability of perceiving pixel moving not only varying. Leveraging the very high Precision, our method reports good scores of score, Jaccard Similarity and MOTA. These criteria are closely related to Precision. On the other hand, the Recall score of our method is relatively low, although it is about 10% higher than ViBe and GMM. Some related criteria, Jaccard Similarity and MOTA, is affected to some extent. The location precision metric, MOTP, shows that average overlap between ground truths and hypotheses is 0.52. The very small size of vehicle targets leads to difficult in precisely locating them. But we think the location precision and pixellevel deviation meet the demand of application, especially in urban traffic surveillance.
Qualitative results. We give some visualization results of detecting multiple tiny moving vehicles. Figure 15 shows four frames, frame 50, 100, 150, 200. Intuitively, many vehicles in the main arteries are detected by the proposed algorithms, and there are a few FPs because most of the annotated labels exist in the main arteries. On the other hand, the number of detected vehicles is growing gradually frame by frame, which illustrates that the tracking also facilitates the detecting.
In order to clearly observe the detecting and tracking details, we provide four enlargements as shown in Fig. 20. From the Fig. 20, we can obtain the dynamics of not only the moving of vehicles but also the detecting and tracking procedure. Once a vehicle has been repeatedly detected of two consecutive frames, a unique ID is assigned to it and simultaneously a KF is allotted to it. The position and velocity provided by the detector also are used to initialize the KF. Subsequently, according to the designed framework, the detector provides the current state frame by frame, while the KF continuously embodies the latest state of the tracking vehicle and further updates its systematic model. Therefore, the proposed processing workflow can detect the tiny moving vehicles accurately and precisely.
Fig. 16 shows four trajectories tracked by the proposed algorithms. These trajectories include linear tracks and curved tracks, which demonstrates the above theory that a series of linear procedures can approximate a nonlinear procedure as accurate as possible. These four trajectories also cover different traffic scenarios, including a straight artery in Fig. 16(a), a right turn in Fig. 16(b) and two roundabouts in Fig. 16(c)(d). It proves that the proposed algorithms can not only address simple traffic conditions but also adapt to complex traffic scenarios.
Area  Method  R. (%)  P. (%)  JS  MA  MP  

1  Ours  64.15  81.71  0.72  0.56  0.46  0.50 
ViBe  51.72  15.10  0.23  0.13  2.45  0.39  
GMM  43.82  12.29  0.19  0.11  2.75  0.37  
2  Ours  62.80  82.23  0.71  0.55  0.47  0.52 
ViBe  61.70  9.14  0.16  0.09  5.56  0.45  
GMM  61.83  7.5  0.13  0.07  7.08  0.39  
3  Ours  60.42  77.26  0.68  0.51  0.41  0.56 
ViBe  41.53  6.76  0.12  0.06  5.35  0.47  
GMM  46.10  6.34  0.11  0.06  6.41  0.42  
Avg.  Ours  63.06  81.04  0.71  0.55  0.46  0.52 
ViBe  52.86  10.74  0.18  0.10  3.92  0.43  
GMM  49.66  8.79  0.15  0.08  4.72  0.39 
Figure 18 shows the foreground segmentation results generated by our algorithms and baseline algorithms. From Fig. 18 (a), our detector yields hypotheses composed of true vehicles and a few noise, while the noise extremely outnumber true vehicles in 18 (b) and (c). The candidate vehicles generated by the proposed algorithm and competitors area shown in Fig. 19. From Fig. 19, our detector mainly perceives the moving vehicle pixels in the roads and yields limited false positives, while ViBe and GMM aimlessly detect varying of the pixels. It illustrates that ViBe and GMM are unable to separate the motions of vehicles from the slow and slight motion of background. ViBe and GMM try to estimate the pattern of each pixel using a nonparametric model or a Gaussian distribution, respectively. So, this strategy is very sensitive to the varying of the pixel and works well in common video processing, especially a stationary camera. However, they cannot address the satellite video processing because of neglecting local or neighborhood information. Our detection algorithms focus on a local area not a single pixel, which can adapt to the moving background of satellite videos.
Framebyframe Quantitative results. For providing a detailed figure of detecting performance of the proposed algorithm and baselines, some scores frame by frame are presented in Fig. 17. Fig. 17 shows the most widelyused detecting metrics: Fscore, Precision and Recall. It further illustrates that the proposed algorithm outperforms baselines mainly by leveraging high Precision. From Fig. 17, the Fscore of ViBe and GMM rapidly decays at the beginning and then totally traps into the moving background. So, for ViBe and GMM, the background moving totally blurs the vehicle moving, leading to their failure. Likewise, this experiment demonstrates the outstanding performance of the proposed algorithm to separate the background moving and the vehicle moving.
6 Conclusion
Satellite videos have the unique capability of observing a cityscale region. This paper addresses the tiny vehicle detecting algorithm in satellite videos, and we design the practical detecting and tracking framework of tiny moving vehicles in satellite videos. It is the first time to adopt a probabilistic distribution to represent the patterns of noises in spatiotemporal domain, which facilitates us to differentiate candidates from noise. We further propose the multimorphologicalcue based discrimination algorithm to distinguish true vehicle targets from a few existing noise. Another important issue is to introduce a series of evaluation metrics and to propose a complete evaluation protocol. The proposed algorithms are tested in three manual annotated areas of a satellite video, which are also compared with baseline algorithms. These experiments demonstrate the good performance of our algorithms.
7 Acknowledgment
We are grateful to CGSTL for providing the satellite video data used in this study.
References
 [1] Y. Tian, R.S. Feris, H. Liu, A. Hampapur, and M.T. Sun. Robust detection of abandoned and removed objects in complex surveillance videos. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 41(5):565?576, 2011.
 [2] ChiaChih Chen and J. K. Aggarwal. Recognizing human action from a far field of view. In Motion and Video Computing, 2009. WMVC ’09. Workshop on, 2009.
 [3] Bruce D Lucas, Takeo Kanade, et al. An iterative image registration technique with an application to stereo vision. 1981.
 [4] A. A. Efros, A. C. Berg, G. Mori, and J. Malik. Recognizing action at a distance. In IEEE International Conference on Computer Vision, pages 726–733, 2003.
 [5] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filter for online nonlinear/nongaussian bayesian tracking. IEEE Transactions on Signal Processing, 50(2):174–188, 2002.
 [6] Wei Ao, Feng Xu, Yongchen Li, and Haipeng Wang. Detection and discrimination of ship targets in complex background from spaceborne alos2 sar images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 11(2):536–550, 2018.
 [7] Yu Zhou, Haipeng Wang, Feng Xu, and YaQiu Jin. Polarimetric sar image classification using deep convolutional neural networks. IEEE Geoscience and Remote Sensing Letters, 13(12):1935–1939, 2016.
 [8] Feng Xu, YaQiu Jin, and Alberto Moreira. A preliminary study on sar advanced information retrieval and scene reconstruction. IEEE Geoscience and Remote Sensing Letters, 13(10):1443–1447, 2016.
 [9] Ahlem Walha, Ali Wali, and Adel M Alimi. Video stabilization with moving object detecting and tracking for aerial video surveillance. Multimedia Tools and Applications, 74(17):6745–6767, 2015.
 [10] Brian P Jackson and A Ardeshir Goshtasby. Registering aerial video images using the projective constraint. IEEE Transactions on image processing, 19(3):795–804, 2010.
 [11] Heng Guo, Shuaicheng Liu, Tong He, Shuyuan Zhu, Bing Zeng, and Moncef Gabbouj. Joint video stitching and stabilization from moving cameras. IEEE Transactions on Image Processing, 25(11):5491–5503, 2016.
 [12] Edgardo Molina and Zhigang Zhu. Persistent aerial video registration and fast multiview mosaicing. IEEE Transactions on Image Processing, 23(5):2184–2192, 2014.
 [13] Nan Jiang and Wenyu Liu. Datadriven spatiallyadaptive metric adjustment for visual tracking. IEEE Transactions on Image Processing, 23(4):1556–1568, 2014.
 [14] Pakorn Kaewtrakulpong and Richard Bowden. An improved adaptive background mixture model for realtime tracking with shadow detection. 2001.
 [15] Chris Stauffer and W Eric L Grimson. Adaptive background mixture models for realtime tracking. In Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference on., volume 2, pages 246–252. IEEE, 1999.
 [16] Olivier Barnich and Marc Van Droogenbroeck. Vibe: a powerful random technique to estimate the background in video sequences. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 945–948. IEEE, 2009.
 [17] Olivier Barnich and Marc Van Droogenbroeck. Vibe: A universal background subtraction algorithm for video sequences. IEEE Transactions on Image processing, 20(6):1709–1724, 2011.
 [18] K. Fukunaga and L. Hosteler. The estimation of the gradient of density function, with application in pattern recognition. IEEE Trans. Info. Theory, 1975.
 [19] M. L. Miller, H. S. Stone, and I. J. Cox. Optimizing murty’s ranked assignment method. IEEE Transactions on Aerospace and Electronic Systems, 33(32):851–862, 1997.
 [20] Rainer Stiefelhagen, Keni Bernardin, Rachel Bowers, John Garofolo, Djamel Mostefa, and Padmanabhan Soundararajan. The clear 2006 evaluation. In International Evaluation Workshop on Classification of Events, Activities and Relationships, pages 1–44. Springer, 2006.
 [21] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
 [22] Jiangjian Xiao, Hui Cheng, Harpreet Sawhney, and Feng Han. Vehicle detection and tracking in wide fieldofview aerial video. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 679–684. IEEE, 2010.
 [23] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
 [24] William H Press, Saul A Teukolsky, William T Vetterling, and Brian P Flannery. Numerical recipes in C, volume 2. Cambridge university press Cambridge, 1996.
 [25] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
 [26] Anton Milan, Laura LealTaixé, Ian Reid, Stefan Roth, and Konrad Schindler. Mot16: A benchmark for multiobject tracking. arXiv preprint arXiv:1603.00831, 2016.
 [27] Liang Liang, Hongying Shen, Pietro De Camilli, and James S Duncan. A novel multiple hypothesis based particle tracking method for clathrin mediated endocytosis analysis using fluorescence microscopy. IEEE Transactions on Image Processing, 23(4):1844–1857, 2014.
 [28] Rangachar Kasturi, Dmitry Goldgof, Padmanabhan Soundararajan, Vasant Manohar, John Garofolo, Rachel Bowers, Matthew Boonstra, Valentina Korzhova, and Jing Zhang. Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics, and protocol. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2):319–336, 2009.