An Empirical Analysis of Visual Features for Multiple Object Tracking in Urban Scenes
This paper addresses the problem of selecting appearance features for multiple object tracking (MOT) in urban scenes. Over the years, a large number of features has been used for MOT. However, it is not clear whether some of them are better than others. Commonly used features are color histograms, histograms of oriented gradients, deep features from convolutional neural networks and re-identification (ReID) features. In this study, we assess how good these features are at discriminating objects enclosed by a bounding box in urban scene tracking scenarios. Several affinity measures, namely the , and the Bhattacharyya distances, Rank-1 counts and the cosine similarity, are also assessed for their impact on the discriminative power of the features. Results on several datasets show that features from ReID networks are the best for discriminating instances from one another regardless of the quality of the detector. If a ReID model is not available, color histograms may be selected if the detector has a good recall and there are few occlusions; otherwise, deep features are more robust to detectors with lower recall. The project page is www.mehdimiah.com/visual_features.
Cities are faced with many challenges, including how to move people safely and efficiently for their daily activities. Data on the movement of all road users is therefore necessary. Such data can be collected automatically through various kinds of sensors, including video cameras with computer vision algorithms. The main task is to detect and track all roads users, which is also called multiple object tracking (MOT). This is one example, among many, of the use of MOT.
Many state-of-the-art MOT methods rely on the strategy called “tracking-by-detection” : first, they detect objects of interest, such as vehicles or pedestrians, then they link the detections between frames to create trajectories. For the second step, various features are used: appearance, spatial information and motion . Even if MOT is a well-studied problem [26, 16, 41, 19], there are still many unsolved challenges limiting the quality of the results. One of them is describing objects’ appearance. It should be possible to distinguish every tracked object from the others, while at the same time considering that the appearance of an object might change over time because of a viewpoint change and illumination variations. Therefore, selecting the most discriminative features and finding a proper way to compare them become two key elements in the process of visual appearance modeling.
Should handcrafted features be used, or should the object appearance be learned? Given the fact that in MOT, several aspects, such as appearance, spatial and motion information, are usually investigated at the same time, it is difficult to tell whether a method is better because of the chosen feature, or the data association method, or the method to predict where the object should be in the future.
Recently, Kornblith et al.  studied how models with high performance on ImageNet [7, 29] actually performed for classification on other datasets. The answer is comforting: the better the models are on ImageNet, the better they are on other datasets. However, can the same conclusion be drawn on a downstream task such as tracking? Indeed, MOT encounters atypical challenges such as the need for instance classification, deformations, illumination changes, occlusions, blur, etc. Some of these challenges are absent from the ImageNet dataset. Moreover, the classification required in tracking is more fine-grained to distinguish all object instances: models trained on ImageNet succeed when they correctly classify persons as persons whereas for the MOT task, these persons must be discriminated from one another.
In this paper, we assess the performance of popular visual features to describe objects in MOT in various urban scene scenarios. These are among the most popular scenarios in MOT, and the focus of the most popular MOT datasets. Therefore, the objects of interest to describe are mostly pedestrians and various vehicles. To avoid interference from other MOT components, we only focus on the visual appearance description and comparison for image regions enclosed by bounding boxes (BB). No spatial or motion information are used in this paper. The results suggest that re-identification (ReID) features are the best visual features for MOT tasks. When these features are not available, deep features may be used and give better performance than the color histogram when objects are further apart in time. The HOG features critically degrade when the detector provides imprecise BBs.
The main contributions of this paper are:
a comparison of visual descriptors on four MOT datasets;
a new methodology to compare features for the MOT task;
an analysis of descriptors and affinity measures performance according to the size of objects, the precision of the bounding boxes and the elapsed time between observations.
Ii Related Work
A large variety of appearance features have been used for tracking. Some of them are briefly reviewed in this section.
One of the most popular features for MOT is the color histogram. Among others, color histograms were used in the work of [28, 46, 33]. In the work of Riahi et al. , color histograms are combined with other features such as optical flow and a sparse representation. Optical flow calculates the motion vector of pixels between two frames, while a sparse representation reconstructs an image region using templates from the image regions of a model object and trivial templates, which contain only one non-zero value. If the visual appearance of an object is different from the model object, the reconstruction will require many trivial templates. In the case of the work of Zhu et al.  and Sun et al. , color histograms are combined with histograms of oriented gradients (HOG).
While the color histogram focuses on the general color appearance of an object, HOG focuses on the texture of an object (spatial arrangement of the colors). Because HOG features are calculated using gradient magnitudes as weights, they often also capture the general shape of an object. Therefore an HOG feature can be seen both as a texture and a shape descriptor. The MOT methods presented in [46, 13, 33] for example rely on HOG. In the work of Heimbach et al. , HOG is used solely in combination with a Kalman filter to predict object position.
Many works use deep features as universal descriptors for MOT [36, 30, 35, 23, 6]. The object appearance is described with features from VGG-16 in both the work of Tang et al.  and Sadeghian et al. , while the work of Wang et al.  uses a two-layer custom Convolutional Neural Network (CNN). Class labels (e.g. car, pedestrian, bike) were also used recently as a coarse description of an object appearance . As for , the authors worked with VGG-19 from which multiple outputs from different layers were extracted. Finally, recent works include ReID features [4, 38, 45, 44]. These features are computed by learning a model to predict if two detections from two points of view are instances of the same object.
Surprisingly, we could only find one work that compared features for MOT . Because it dates back to 1996, the features that were compared are the distance between the center of gravity, the size of the BBs, and correlation between object pixels.
Iii Tested visual features, affinity measures and datasets
Figure 1 gives an overview of our visual feature evaluation strategy. Given a bounding box-enclosed object extracted from a frame, it is first described with a feature descriptor. Then, we select another frame where this object is present, describe all objects in this frame with the same feature descriptor, and compute an affinity measure to select the most similar object in term of visual appearance. In the following, we describe the visual features, affinity measures and datasets selected for our evaluation. Given the available MOT datasets, we assume that objects are either pedestrians or vehicles.
Iii-a Visual features
There are many ways to obtain a description (a numerical vector) from an image of an object enclosed by a BB. We selected eight popular visual descriptors among four categories: color histograms-based, gradient-based, CNN-based and ReID-based models.
Color and grayscale histograms the RGB color histogram descriptor consists in counting the number of occurrences of each color inside a BB. We elected to use quantified histograms because they are more resilient to noise. For grayscale 1D-histogram, we used 32 bins. For color histogram, similarly, each channel of the image gives an histogram; after concatenation, we obtain a vector of size 96.
Histograms of oriented gradients (HOG) contrarily to color-based histograms, gradient-based descriptors are more robust to illumination change. HOG  is one famous example. It captures texture information in the image as well as shape information about the object. To obtain an HOG descriptor, the image is first convolved with kernels to extract vertical and horizontal gradients. From these gradients, their angles and magnitudes can be obtained. The angles are typically quantified between 0 and 180°, as it was shown experimentally that ignoring the sign of the angle gives better results. Histograms are then constructed for several overlapping cells by counting the occurrence of quantified angles weighted by their gradient magnitudes. Since in our experimental setup, the datasets contain either vehicles or pedestrians, the HOG vectors are the same size for all candidate objects in each dataset : the BB is resized to pixels for pedestrians and for vehicles, with cells of size and blocks of size . By quantifying angles into nine bins every 20°, this results in a vector of 1764 elements for vehicles and 3528 for pedestrians.
CNN-based features since their breakthrough in 2012 , CNNs are commonly used in computer vision for classification tasks. We used four different architectures which have great performance on ImageNet to extract visual features. For each network, we removed the last fully-connected layer to obtain a descriptor . We evaluated VGG-19 , ResNet-18 , DenseNet-121  and EfficientNet-B0  architectures, respectively providing a description vector of size 4096, 512, 1024 and 1280, having respectively 140, 11, 7 and 4 millions parameters and resulting ImageNet top-1 error rates of respectively 27.6%, 30.2%, 25.4% and 23.4%.
Re-identification network in the case of pedestrian tracking, we evaluated a re-identification method named OSNet-AIN  containing 3 millions parameters giving ReID features of size 512. For vehicles, we extracted ReID features from the model of  containing 24 millions parameters providing vectors of size 2048.
Iii-B Descriptor affinity measures
Each object BB in a frame is described with a numerical vector . So, given another image containing object BBs, each of them described by a vector , the aim is now to find the “most similar” vector to among hoping that the two vectors are instances of the same object.
The following five different affinity measures were used to compare the visual feature vectors.
and distances two common ways to compute affinity between vectors are the and distances. Given two vectors and , for , the distance is given by
where refers to the element of each vector and is the length of the vectors to compare. The smaller the distance is, the more similar the vectors are.
Rank-1 counts to compare feature vectors from a CNN, the Rank-1 counts was proposed in . It was shown to be efficient to compare deep features. It works by comparing a pair of vectors () to other possible pairs () such that . The underlying principle is to find the vector whose elements are closer to a query vector. It is computed as follows:
where is an indicator function that takes value 1 if the expression in the argument is true and 0 otherwise. The expression verifies whether the element of x is strictly closer to the corresponding element of y compared to all other candidate vectors z. The larger the is, the more similar the objects are.
Bhattacharyya distance this distance  measures the affinity between two distributions as follows:
The smaller the is, the more similar the objects are.
Cosine similarity for two vectors x and y, the cosine similarity is given by :
The smaller the is, the less similar the objects are.
We tested the visual features on four datasets commonly used in MOT. Two of them focus on pedestrians and two others on vehicles. Table I summarizes some of their characteristics.
Iv Experimental methodology
To evaluate the performance of 35 descriptor-affinity pairs (each pair composed of a feature and an affinity measure except pairs between a non histogram-based feature and the Bhattacharyya distance), we tried to link two bounding boxes referring to the same object throughout a video. For that, given a BB-enclosed object extracted from a frame, we described it with a feature descriptor. Then we select another frame where this object is present, described all objects in this frame with the same feature descriptor, and compute an affinity measure to select the most similar object to the one in the first frame (Figure 1). We then verify if the match is correct based on the ground truth.
Iv-a Data preparation
It should be noted that working with the true BBs is a too ideal scenario. In practice, when “tracking-by-detection” is applied, the detection algorithm may omit an object, detect non-object elements or the predicted BBs may be slightly shifted. In order to simulate the BBs returned by a detector, we introduced noise in two ways.
Noisy coordinates the first way is to add a white Gaussian noise to each coordinate independently. Given a BB where is the top-left coordinate of the BB and the bottom-right one, noisy coordinates are obtained by sampling as follows:
where and are the BB width and the height. and are calculated similarly from and . The parameter allows to modify the variance of the Gaussian. Figure 2 illustrates the effect of on BB coordinates. By introducing noise in this way, it is still possible to get access to the true identity of each object, which is not possible if we used a detector. Indeed, a detector may miss some hard to detect objects, resulting in a biased analysis.
Note that adding a Gaussian noise is not sufficient: we have to make sure that the new coordinates are valid (integers such that and , where and represent the width and the height of the frame, respectively). We chose the following parameters: 0, 0.05, 0.1 and 0.2.
Sampling step the second way to simulate noise is by skipping frames: instead of comparing two consecutive frames, we increase the sampling step. Therefore, the visual appearance of objects changes more and this simulates the case when the detector missed some objects or the object was not visible for several frames. We chose the following sampling steps: 1, 2, 4, 8, 16 and 32 frames. Note that this can result in different temporal skip in seconds depending on the video frame rate. This should simply be viewed as gradually including more and more missing detections, making the matching more difficult.
Iv-B Performance measure
For a given descriptor-affinity pair, a configuration -step and a sequence of a dataset, we evaluated the average precision for pairs (query object, set of candidate objects) by calculating the ratio of number of correct matches (when we find the same object among the set) over the total number of tested query objects. We reported the mean average precision over sequences on each dataset.
The configuration (, step = 1) refers to the case where the detector can perfectly detect all objects.
Iv-C Implementation details
For HOG, color and grayscale histograms, we used the implementations from OpenCV . Models and weights for VGG-19, ResNet-18 and DenseNet-121 come directly from Pytorch . As for EfficientNet-B0, we used the implementation provided by . We relied on pretrained models learned on ImageNet. When using CNN-based models, as recommended by Pytorch, RGB BBs are resized to and normalized. OSNet-AIN weights were pretrained by  on ImageNet and fine-tuned on Market1501  and are available in the torchreid library . Weights for vehicle ReID were trained by  on VeRi [20, 21], CompCars Surveillance , BoxCars  and unsupervisedly fine-tuned on AI City Challenge dataset .
V Results and analysis
V-a General feature performance
We summarized all results from the four datasets into four figures to rank the descriptor-affinity pairs according to 24 -step configurations ( and sampling step). For each case and for each dataset, we only reported the best five descriptor-affinity pairs, and for categories of features which were not in the top-5, the best model among them.
|Color histogram (RGB)||black|
|Grayscale histogram (GR)||gray|
|Vehicle ReID (VID)||pink|
|(L1)||\ \ \|
unsurprisingly, the four figures show that increasing the parameter and/or the sampling step decreases the matching performance of the best descriptor-affinity model.
Color and grayscale histograms
color histograms are competitive features, in particular for vehicles-centered datasets, for a low sampling step and a low , especially when combined with the Bhattacharyya distance. For these configurations, depending on the datasets, this model is almost always in the top-5. On WildTrack, due to the low framerate, it is not able to discriminate pedestrians when their BBs are separated by more than two seconds, meaning that the objects should not be occluded for too long, or that the detector should have a good recall. We explained this by their color appearance changing too much between these two frames. However, when the BBs do not enclose the object precisely (), these models ranked almost always in the top-5. This is due to the excessive loss of semantic information when BBs coordinates are imprecise: low-level characteristics such as colors are in that case relevant.
Histograms of oriented gradients
HOG is a good appearance feature descriptor when the BB coordinates correspond to the ground truth. As soon as they get imprecise, the performance of HOG decreases dramatically. Regardless of the dataset, if the BB correspond to the ground truth, the HOG descriptor is among the best models (when it is not in the top-5, the deviation in absolute value from the best model is small). But when increases, its performance falls on average to the 30 position amongst 35 candidates. In the case of very imprecise BBs, the gap between this feature and others is significant. This is due to the construction of HOG: feature vectors are computed over cells of pixels. So, a small shift in the BB makes this feature non robust.
this category of models is competitive when the sampling step is high and moderate. When is less than and the sampling step over , a CNN-based model is often in the top-5 ranking. Features computed from a VGG-19 descriptor are not competitive against the three other CNN-based models as these features never rank in the top-5. Moreover, cosine similarity is sometimes a good affinity measure but not consistently.
in the case of pedestrians tracking, OSNet-AIN is generally the best visual feature regardless of the performance of the detector. In almost all configurations, this model ranks first, with either , distances or cosine similarity. Since this model is trained to discriminate pedestrians, it is made to extract meaningful instance-specific characteristics from images. So, even if the BB of the image is corrupted, it is able to discriminate persons. As for vehicles, the model from  ranks in top-5 when is over 0.1 or when the sampling step is over 4. For BBs more similar to the ground truth, the deviation in absolute value from the best model is low. Cosine similarity is systematically the best affinity measure for this descriptor.
V-B Feature performance according to size of objects
In addition to two characteristics of the detector (its ability to predict correctly the coordinates of the BBs and to avoid missed detections), the size of objects might influence the choice of the visual feature descriptor. Smaller objects are commonly the hardest targets to track in MOT. But it is unclear how visual features are affected by the size of BBs.
Figure 7 gives the average precision with regard to the query object size, on the UAVDT dataset where there are few occlusions. The configuration -step selected correspond to the hardest one (0.2-32) where differences are more meaningful. For a fair comparison, only the distance is used.
Firstly, for any feature, the larger the query object is, the easier it is to get the correct match. RGB-histograms are among the best visual features for the smallest objects (approximately smaller than 250 pixels of area), where it is difficult to extract semantics. But for larger objects, ReID features give the best performance. Then, the tested CNN-based models, except VGG-19 which performs more poorly, yield similar results, but lower than ReID which indicates that performing well on ImageNet does not necessarily produce better features for MOT. Similar conclusions can be drawn from other datasets, with the exception of Wildtrack because of its small scale in terms of available data (cf appendix).
In this paper, we compared several feature descriptors in the context of MOT in urban scenes. Our experiments show that features perform differently given the quality of bounding boxes. ReID features, combined with cosine similarity, are one of the best descriptors for pedestrians and vehicles, regardless of the performance of the detector. If these models are not available, color histograms with the Bhattacharyya distance is competitive when the boxes are not too noisy. But, as soon as the bounding boxes get noisier, these methods are not able to compete against deep features. Moreover, the size of objects matter on the choice of visual features : in difficult cases, compared to RGB-histograms and modern deep features, ReID features particularly stand out on medium-sized objects.
We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), [CRDPJ 528786 - 18], [DG 2017-06115], and the support of Arcturus Networks.
In the following, we provide some results on other datasets related on the statements about the effect of the size of an object.
- (1943) On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of the Calcutta Mathematical Society 35, pp. 99–109. Cited by: §III-B.
- (2000) The OpenCV library. Dr. Dobb’s Journal of Software Tools. Note: tex.citeulike-article-id: 2236121 tex.posted-at: 2008-01-15 19:21:54 tex.priority: 4 Cited by: §IV-C.
- (2018) WILDTRACK: A Multi-Camera HD Dataset for Dense Unscripted Pedestrian Detection. In CVPR, Cited by: TABLE I.
- (2017) Beyond Triplet Loss: A Deep Quadruplet Network for Person Re-Identification. In CVPR, Cited by: §II.
- (2005-06) Histograms of oriented gradients for human detection. In CVPR, External Links: Cited by: §III-A.
- (2016) Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking. In ECCV, External Links: Cited by: §II.
- (2009) ImageNet: A large-scale hierarchical image database. In CVPR, External Links: Cited by: §I.
- (2014) DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. In ICML, Cited by: §III-A.
- (2018) The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking. In ECCV, Cited by: TABLE I.
- (1996-08) Comparing features for target tracking in traffic scenes. Pattern Recognition 29 (8), pp. 1285–1296 (en). External Links: Cited by: §II.
- (2016) Deep Motion Features for Visual Tracking. In International Conference on Pattern Recognition (ICPR), Cited by: §I.
- (2016) Deep Residual Learning for Image Recognition. In CVPR, Cited by: §III-A.
- (2017-10) Resolving occlusion ambiguity by combining Kalman tracking with feature tracking for image sequences. In 2017 51st Asilomar Conference on Signals, Systems, and Computers, pp. 144–147. External Links: Cited by: §II.
- (2017) Densely Connected Convolutional Networks. In CVPR, Cited by: §III-A.
- (2017) End-To-End Face Detection and Cast Grouping in Movies Using Erdos-Renyi Clustering. In ICCV, Cited by: §III-B.
- (2016-11) Tracking All Road Users at Multimodal Urban Traffic Intersections. IEEE Transactions on Intelligent Transportation Systems 17 (11), pp. 3241–3251. External Links: Cited by: §I.
- (2019) Do Better ImageNet Models Transfer Better?. In CVPR, Cited by: §I.
- (2012) ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, Cited by: §III-A.
- (2015-04) MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking. arXiv: 1504.01942. Cited by: §I, TABLE I.
- (2016) Large-scale vehicle re-identification in urban surveillance videos. In International Conference on Multimedia and Expo (ICME), Note: ISSN: 1945-788X External Links: Cited by: §IV-C.
- (2016) A Deep Learning-Based Approach to Progressive Vehicle Re-identification for Urban Surveillance. In ECCV, External Links: Cited by: §IV-C.
- (2017-05) Multiple Object Tracking: A Literature Review. arXiv:1409.7618. Cited by: §I.
- (2015) Hierarchical Convolutional Features for Visual Tracking. In ICCV, Cited by: §II.
- (2020-03) Lukemelas/EfficientNet-PyTorch. External Links: Cited by: §IV-C.
- (2018) The 2018 NVIDIA AI City Challenge. In CVPR - Workshops, Cited by: §IV-C.
- (2018) Multiple Object Tracking in Urban Traffic Scenes with a Multiclass Object Detector. In International Symposium on Visual Computing (ISVC), (en). External Links: Cited by: §I, §II.
- (2019) PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS, Cited by: §IV-C.
- (2015) Multiple object tracking based on sparse generative appearance modeling. In IEEE International Conference on Image Processing (ICIP), External Links: Cited by: §II.
- (2015-12) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (3), pp. 211–252 (en). External Links: Cited by: §I.
- (2017) Tracking the Untrackable: Learning to Track Multiple Cues With Long-Term Dependencies. In ICCV, Cited by: §II.
- (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR, Cited by: §III-A.
- (2018) BoxCars: Improving Fine-Grained Recognition of Vehicles Using 3-D Bounding Boxes in Traffic Surveillance. IEEE Transactions on Intelligent Transportation Systems 20 (1), pp. 97–108. Note: Conference Name: IEEE Transactions on Intelligent Transportation Systems External Links: Cited by: §IV-C.
- (2013-10) Multiple pedestrians tracking algorithm by incorporating histogram of oriented gradient detections. IET Image Processing 7 (7), pp. 653–659 (en). External Links: Cited by: §II, §II.
- (2019) EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In ICML, Cited by: §III-A.
- (2017) Multiple People Tracking by Lifted Multicut and Person Re-Identification. In CVPR, Cited by: §II.
- (2014) Learning deep features for multiple object tracking by using a multi-task learning strategy. In IEEE International Conference on Image Processing (ICIP), External Links: Cited by: §II.
- (2020-04) UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking. Computer Vision and Image Understanding 193, pp. 102907 (en). External Links: Cited by: TABLE I.
- (2017) Simple online and realtime tracking with a deep association metric. In IEEE International Conference on Image Processing (ICIP), External Links: Cited by: §II.
- (2018) Vehicle Re-Identification With the Space-Time Prior. In CVPR - Workshops, Cited by: §III-A, §IV-C, §V-A5.
- (2015) A Large-Scale Car Dataset for Fine-Grained Categorization and Verification. In CVPR, Cited by: §IV-C.
- (2017-05) Multiple Object Tracking with Kernelized Correlation Filters in Urban Mixed Traffic. In Conference on Computer and Robot Vision (CRV), External Links: Cited by: §I.
- (2015) Scalable Person Re-Identification: A Benchmark. In ICCV, Cited by: §IV-C.
- (2019) Torchreid: A library for deep learning person re-identification in pytorch. arXiv:1910.10093. Cited by: §IV-C.
- (2019-10) Learning Generalisable Omni-Scale Representations for Person Re-Identification. arXiv:1910.06827. Cited by: §II, §III-A, §IV-C.
- (2019) Omni-Scale Feature Learning for Person Re-Identification. In ICCV, Cited by: §II.
- (2012) A real-time and robust approach for short-term multiple objects tracking. In International Conference on Computer Science and Information Processing (CSIP), External Links: Cited by: §II, §II.