MOT20: A benchmark for multi object tracking in crowded scenes
Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for research. The benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal to establish a standardized evaluation of multiple object tracking methods. The challenge focuses on multiple people tracking, since pedestrians are well studied in the tracking community, and precise tracking and detection has high practical relevance. Since the first release, MOT15 , MOT16 , and MOT17  have tremendously contributed to the community by introducing a clean dataset and precise framework to benchmark multi-object trackers. In this paper, we present our MOT20 benchmark, consisting of 8 new sequences depicting very crowded challenging scenes. The benchmark was presented first at the 4 BMTT MOT Challenge Workshop at the Computer Vision and Pattern Recognition Conference (CVPR) 2019, and gives to chance to evaluate state-of-the-art methods for multiple object tracking when handling extremely crowded scenarios.
Since its first release in 2014, MOTChallenge has attracted more than active users who have successfully submitted their trackers and detectors to five different challenges, spanning sequences with bounding boxes over a total length of seconds. As evaluating and comparing multi-target tracking methods is not trivial (cf. e.g. [?]), MOTChallenge provides carefully annotated datasets and clear metrics to evaluate the performance of tracking algorithms and pedestrian detectors. Parallel to the MOTChallenge all-year challenges, we organize workshop challenges on multi-object tracking for which we often introduce new data.
In this paper, we introduce the MOT20 benchmark, consisting of novel sequences out of very crowded scenes. All sequences have been carefully selected and annotated according to the evaluation protocol of previous challenges [2, 3]. This benchmark addresses the challenge of very crowded scenes in which the density can reach values of pedestrians per frame. The sequences were filmed in both indoor and outdoor locations, and include day and night time shots. Figure 1 shows the split of the sequences of the three scenes into training and testing sets. The testing data consists of sequences from known as well as from an unknown scenes in order to measure the genralization capabilities of both detectors and trackers. We make available the images for all sequences, the ground truth annotations for the training set as well as a set of public detections (obtained from a Faster R-CNN trained on the training data) for the tracking challenge.
The MOT20 challenges and all data, current ranking and submission guidelines can be found at:
2 Annotation rules
For the annotation of the dataset, we follow the protocol introduced in MOT16, ensuring that every moving person or vehicle within each sequence is annotated with a bounding box as accurately as possible. In the following, we define a clear protocol that was obeyed throughout the entire dataset to guarantee consistency.
2.1 Target class
In this benchmark, we are interested in tracking moving objects in videos. In particular, we are interested in evaluating multiple people tracking algorithms, hence, people will be the center of attention of our annotations.
We divide the pertinent classes into three categories:
(i) moving pedestrians;
(ii) people that are not in an upright position, not moving, or artificial representations of humans; and
(iii) vehicles and occluders.
In the first group, we annotate all moving pedestrians that appear in the field of view and can be determined as such by the viewer. Furthermore, if a person briefly bends over or squats, e.g., to pick something up or to talk to a child, they shall remain in the standard pedestrian class. The algorithms that submit to our benchmark are expected to track these targets.
In the second group, we include all people-like objects whose exact classification is ambiguous and can vary depending on the viewer, the application at hand, or other factors. We annotate all static people, e.g., sitting, lying down, or do stand still at the same place over the whole sequence. The idea is to use these annotations in the evaluation such that an algorithm is neither penalized nor rewarded for tracking, e.g., a sitting or not moving person.
In the third group, we annotate all moving vehicles such as cars, bicycles, motorbikes and non-motorized vehicles (e.g., strollers), as well as other potential occluders. These annotations will not play any role in the evaluation, but are provided to the users both for training purposes and for computing the level of occlusion of pedestrians. Static vehicles (parked cars, bicycles) are not annotated as long as they do not occlude any pedestrians.
The dataset for the new benchmark has been carefully selected to challenge trackers and detectors on extremely crowded scenes. In contrast to previous challenges, some of the new sequences show a pedestrian density of pedestrians per frame. In Fig. 1 and Tab. I, we show an overview of the sequences included in the benchmark.
|MOT20-03||25||1173x880||2,405 (01:36)||702||313,658||130.42||outdoor, night||new|
|MOT20-05||25||1654x1080||3,315 (02:13)||1169||646,344||194.98||outdoor, night||new|
|MOT20-04||25||1545x1080||2,080 (01:23)||669||274,084||131.77||outdoor, night||new|
|MOT20-06||25||1920x734||1,008 (00:40)||271||132,757||131.70||outdoor, day||new|
|MOT20-08||25||1920x734||806 (00:32)||191||77,484||96.13||outdoor, day||new|
|Sequence||Pedestrian||Non motorized vehicle||Static person||Occluder on the ground||crowd||Total|
3.1 MOT 20 sequences
We have compiled a total of 8 sequences, of which we use half for training and half for testing. The annotations of the testing sequences will not be released in order to avoid (over)fitting of the methods to the specific sequences. The sequences are filmed in three different scenes. Several sequences are filmed per scene and distributed in the train and test sets. One of the scenes though, is reserved for test time, in order to challenge the generalization capabilities of the methods.
The new data contains circa 3 times more bounding boxes for training and testing compared to MOT17. All sequences are filmed in high resolution from an elevated viewpoint, and the mean crowd density reaches 246 pedestrians per frame which times denser when compared to the first benchmark release. Hence, we expect the new sequences to be more challenging for the tracking community and to push the models to their limits when it comes to handling extremely crowded scenes. In Tab. I, we give an overview of the training and testing sequence characteristics for the challenge, including the number of bounding boxes annotated.
We trained a Faster R-CNN  with ResNet101  backbone on the MOT20 training sequences, obtaining the detection results presented in Table VI. This evaluation follows the standard protocol for the MOT20 challenge and only accounts for pedestrians. Static persons and other classes are not considered and filtered out from both, the detections, as well as the ground truth.
A detailed breakdown of detection bounding boxes on individual sequences is provided in Tab. III.
|Seq||nDet.||nDet./fr.||min height||max height|
For the tracking challenge, we provide these public detections as a baseline to be used for training and testing of the trackers. For the MOT20 challenge, we will only accept results on public detections. When later the benchmark will be open for continuous submissions, we will accept both public as well as private detections.
3.3 Data format
All images were converted to JPEG and named sequentially to a 6-digit file name (e.g. 000001.jpg). Detection and annotation files are simple comma-separated value (CSV) files. Each line represents one object instance and contains 9 values as shown in Tab. IV.
The first number indicates in which frame the object appears, while the second number identifies that object as belonging to a trajectory by assigning a unique ID (set to in a detection file, as no ID is assigned yet). Each object can be assigned to only one trajectory. The next four numbers indicate the position of the bounding box of the pedestrian in 2D image coordinates. The position is indicated by the top-left corner as well as width and height of the bounding box. This is followed by a single number, which in case of detections denotes their confidence score. The last two numbers for detection files are ignored (set to -1).
|1||Frame number||Indicate at which frame the object is present|
|2||Identity number||Each pedestrian trajectory is identified by a unique ID ( for detections)|
|3||Bounding box left||Coordinate of the top-left corner of the pedestrian bounding box|
|4||Bounding box top||Coordinate of the top-left corner of the pedestrian bounding box|
|5||Bounding box width||Width in pixels of the pedestrian bounding box|
|6||Bounding box height||Height in pixels of the pedestrian bounding box|
|7||Confidence score||DET: Indicates how confident the detector is that this instance is a pedestrian. GT: It acts as a flag whether the entry is to be considered (1) or ignored (0).|
|8||Class||GT: Indicates the type of object annotated|
|9||Visibility||GT: Visibility ratio, a number between 0 and 1 that says how much of that object is visible. Can be due to occlusion and due to image border cropping.|
|Person on vehicle||2|
|Non motorized vehicle||6|
|Occluder on the ground||10|
An example of such a 2D detection file is:
1, -1, 794.2, 47.5, 71.2, 174.8, 67.5, -1, -1
1, -1, 164.1, 19.6, 66.5, 163.2, 29.4, -1, -1
1, -1, 875.4, 39.9, 25.3, 145.0, 19.6, -1, -1
2, -1, 781.7, 25.1, 69.2, 170.2, 58.1, -1, -1
For the ground truth and results files, the 7 value (confidence score) acts as a flag whether the entry is to be considered. A value of 0 means that this particular instance is ignored in the evaluation, while a value of 1 is used to mark it as active. The 8 number indicates the type of object annotated, following the convention of Tab. V. The last number shows the visibility ratio of each bounding box. This can be due to occlusion by another static or moving object, or due to image border cropping.
An example of such an annotation 2D file is:
1, 1, 794.2, 47.5, 71.2, 174.8, 1, 1, 0.8
1, 2, 164.1, 19.6, 66.5, 163.2, 1, 1, 0.5
2, 4, 781.7, 25.1, 69.2, 170.2, 0, 12, 1.
In this case, there are 2 pedestrians in the first frame of the sequence, with identity tags 1, 2. In the second frame, we can see a static person (class 7), which is to be considered by the evaluation script and will neither count as a false negative, nor as a true positive, independent of whether it is correctly recovered or not. Note that all values including the bounding box are 1-based, i.e. the top left corner corresponds to .
To obtain a valid result for the entire benchmark, a separate CSV file following the format described above must be created for each sequence and called ‘‘Sequence-Name.txt’’. All files must be compressed into a single ZIP file that can then be uploaded to be evaluated.
Our framework is a platform for fair comparison of state-of-the-art tracking methods. By providing authors with standardized ground truth data, evaluation metrics and scripts, as well as a set of precomputed detections, all methods are compared under the exact same conditions, thereby isolating the performance of the tracker from everything else. In the following paragraphs, we detail the set of evaluation metrics that we provide in our benchmark.
4.1 Evaluation metrics
In the past, a large number of metrics for quantitative evaluation of multiple target tracking have been proposed [?, ?, ?, ?, ?, ?]. Choosing “the right” one is largely application dependent and the quest for a unique, general evaluation metric is still ongoing. On the one hand, it is desirable to summarize the performance into one single number to enable a direct comparison. On the other hand, one might not want to lose information about the individual errors made by the algorithms and provide several performance estimates, which precludes a clear ranking.
Following a recent trend [?, ?, ?], we employ two sets of measures that have established
themselves in the literature: The CLEAR metrics proposed by
Stiefelhagen et al. [?], and a set of track
quality measures introduced by Wu and Nevatia [?].
The evaluation scripts used in our benchmark are publicly
There are two common prerequisites for quantifying the performance of a tracker. One is to determine for each hypothesized output, whether it is a true positive (TP) that describes an actual (annotated) target, or whether the output is a false alarm (or false positive, FP). This decision is typically made by thresholding based on a defined distance (or dissimilarity) measure (see Sec. 4.1.2). A target that is missed by any hypothesis is a false negative (FN). A good result is expected to have as few FPs and FNs as possible. Next to the absolute numbers, we also show the false positive ratio measured by the number of false alarms per frame (FAF), sometimes also referred to as false positives per image (FPPI) in the object detection literature.
Obviously, it may happen that the same target is covered by multiple outputs. The second prerequisite before computing the numbers is then to establish the correspondence between all annotated and hypothesized objects under the constraint that a true object should be recovered at most once, and that one hypothesis cannot account for more than one target.
For the following, we assume that each ground truth trajectory has one unique start and one unique end point, i.e. that it is not fragmented. Note that the current evaluation procedure does not explicitly handle target re-identification. In other words, when a target leaves the field-of-view and then reappears, it is treated as an unseen target with a new ID. As proposed in [?], the optimal matching is found using Munkre’s (a.k.a. Hungarian) algorithm. However, dealing with video data, this matching is not performed independently for each frame, but rather considering a temporal correspondence. More precisely, if a ground truth object is matched to hypothesis at time and the distance (or dissimilarity) between and in frame is below , then the correspondence between and is carried over to frame even if there exists another hypothesis that is closer to the actual target. A mismatch error (or equivalently an identity switch, IDSW) is counted if a ground truth target is matched to track and the last known assignment was . Note that this definition of ID switches is more similar to [?] and stricter than the original one [?]. Also note that, while it is certainly desirable to keep the number of ID switches low, their absolute number alone is not always expressive to assess the overall performance, but should rather be considered in relation to the number of recovered targets. The intuition is that a method that finds twice as many trajectories will almost certainly produce more identity switches. For that reason, we also state the relative number of ID switches, i.e., IDSW/Recall.
These relationships are illustrated in Fig. 3. For simplicity, we plot ground truth trajectories with dashed curves, and the tracker output with solid ones, where the color represents a unique target ID. The grey areas indicate the matching threshold (see next section). Each true target that has been successfully recovered in one particular frame is represented with a filled black dot with a stroke color corresponding to its matched hypothesis. False positives and false negatives are plotted as empty circles. See figure caption for more details.
After determining true matches and establishing correspondences it is now possible to compute the metrics. We do so by concatenating all test sequences and evaluating on the entire benchmark. This is in general more meaningful instead of averaging per-sequences figures due to the large variation in the number of targets.
In the most general case, the relationship between ground truth objects and a tracker output is established using bounding boxes on the image plane. Similar to object detection [?], the intersection over union (a.k.a. the Jaccard index) is usually employed as the similarity criterion, while the threshold is set to or .
People are a common object class present in many scenes, but should we track all people in our benchmark? For example, should we track static people sitting on a bench? Or people on bicycles? How about people behind a glass? We define the target class of CVPR19 as all upright walking people that are reachable along the viewing ray without a physical obstacle, i.e. reflections, people behind a transparent wall or window are excluded. We also exclude from our target class people on bicycles or other vehicles. For all these cases where the class is very similar to our target class (see Figure 4), we adopt a similar strategy as in [?]. That is, a method is neither penalized nor rewarded for tracking or not tracking those similar classes. Since a detector is likely to fire in those cases, we do not want to penalize a tracker with a set of false positives for properly following that set of detections, i.e. of a person on a bicycle. Likewise, we do not want to penalize with false negatives a tracker that is based on motion cues and therefore does not track a sitting person.
In order to handle these special cases, we adapt the tracker-to-target assignment algorithm to perform the following steps:
At each frame, all bounding boxes of the result file are matched to the ground truth via the Hungarian algorithm.
In contrast to MOT17 we account for the very crowded scenes and exclude result boxes that overlap with one of these classes (distractor, static person, reflection, person on vehicle) and remove them from the solution in the detection challenge.
During the final evaluation, only those boxes that are annotated as pedestrians are used.
Multiple Object Tracking Accuracy
The MOTA [?] is perhaps the most widely used metric to evaluate a tracker’s performance. The main reason for this is its expressiveness as it combines three sources of errors defined above:
where is the frame index and GT is the number of ground truth objects. We report the percentage MOTA in our benchmark. Note that MOTA can also be negative in cases where the number of errors made by the tracker exceeds the number of all objects in the scene.
Even though the MOTA score gives a good indication of the overall performance, it is highly debatable whether this number alone can serve as a single performance measure.
Robustness. One incentive behind compiling this benchmark was to reduce dataset bias by keeping the data as diverse as possible. The main motivation is to challenge state-of-the-art approaches and analyze their performance in unconstrained environments and on unseen data. Our experience shows that most methods can be heavily overfitted on one particular dataset, and may not be general enough to handle an entirely different setting without a major change in parameters or even in the model.
To indicate the robustness of each tracker across all benchmark sequences, we show the standard deviation of their MOTA score.
Multiple Object Tracking Precision
The Multiple Object Tracking Precision is the average dissimilarity between all true positives and their corresponding ground truth targets. For bounding box overlap, this is computed as
where denotes the number of matches in frame and is the bounding box overlap of target with its assigned ground truth object. MOTP thereby gives the average overlap between all correctly matched hypotheses and their respective objects and ranges between and .
It is important to point out that MOTP is a measure of localization precision, not to be confused with the positive predictive value or relevance in the context of precision / recall curves used, e.g., in object detection.
In practice, it mostly quantifies the localization accuracy of the detector, and therefore, it provides little information about the actual performance of the tracker.
Track quality measures
Each ground truth trajectory can be classified as mostly tracked (MT), partially tracked (PT), and mostly lost (ML). This is done based on how much of the trajectory is recovered by the tracking algorithm. A target is mostly tracked if it is successfully tracked for at least of its life span. Note that it is irrelevant for this measure whether the ID remains the same throughout the track. If a track is only recovered for less than of its total length, it is said to be mostly lost (ML). All other tracks are partially tracked. A higher number of MT and few ML is desirable. We report MT and ML as a ratio of mostly tracked and mostly lost targets to the total number of ground truth trajectories.
In certain situations one might be interested in obtaining long, persistent tracks without gaps of untracked periods. To that end, the number of track fragmentations (FM) counts how many times a ground truth trajectory is interrupted (untracked). In other words, a fragmentation is counted each time a trajectory changes its status from tracked to untracked and tracking of that same trajectory is resumed at a later point. Similarly to the ID switch ratio (cf. Sec. 4.1.1), we also provide the relative number of fragmentations as FM / Recall.
As we have seen in this section, there are a number of reasonable performance measures to assess the quality of a tracking system, which makes it rather difficult to reduce the evaluation to one single number. To nevertheless give an intuition on how each tracker performs compared to its competitors, we compute and show the average rank for each one by ranking all trackers according to each metric and then averaging across all performance measures.
5 Conclusion and Future Work
We have presented a new challenging set of sequences within the MOTChallenge benchmark. Theses sequences contain a large number of targets to be tracked and the scenes are substantially more crowded when compared to previous MOTChallenge releases. The scenes are carefully chosen and included indoor/ outdoor and day/ night scenarios.
We believe that the MOT20 release within the already established MOTChallenge benchmark provides a fairer comparison of state-of-the-art tracking methods, and challenges researchers to develop more generic methods that perform well in unconstrained environments and on very crowded scenes.
- (2015-12) Deep Residual Learning for Image Recognition. arXiv e-prints, pp. arXiv:1512.03385. External Links: Cited by: §3.2.
- (2015) MOTChallenge 2015: Towards a benchmark for multi-target tracking. arXiv:1504.01942. Cited by: MOT20: A benchmark for multi object tracking in crowded scenes, §1.
- (2016-03) MOT16: A benchmark for multi-object tracking. arXiv:1603.00831 [cs]. Note: arXiv: 1603.00831 External Links: Cited by: MOT20: A benchmark for multi object tracking in crowded scenes, §1.
- (2015-06) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv e-prints, pp. arXiv:1506.01497. External Links: Cited by: §3.2.