Online Multi-Object Tracking with Historical Appearance Matching and
Scene Adaptive Detection Filtering
In this paper, we propose the methods to handle temporal errors during multi-object tracking. Temporal error occurs when objects are occluded or noisy detections appear near the target. In that situation, tracking may fail and various errors like drift or ID-switching occur. It is hard to overcome temporal errors only by using motion and shape information. So, we propose the historical appearance matching method and joint-input siamese network which was trained by 2-step process. It can prevent tracking failure although objects are temporally occluded or last matching information is unreliable. We also provide useful technique to remove noisy detections effectively according to scene condition. Tracking performance, especially identity consistency, is highly improved by attaching our methods.
Current paradigm of multi-object tracking is tracking by detection approach. Most trackers assume that detections are already given and focus on labeling each detection with specific ID. This labeling process is basically done by data association. For online tracker, data association problem could be simplified to bipartite matching problem and hungarian algorithm has frequently been adopted to solve it. Before solving data association problem, cost matrix has to be defined. Each element of cost matrix is the affinity(similarity) between specific target and detection(observation). Because of the reason that data association simply finds 1-to-1 matches on cost matrix, it is important to derive accurate affinity scores for better performance.
Motion is a basic factor of affinity. Motion is the only information that we can guess in simple tracking environment(e.g. tracking dots, which are signals from specific objects like ship or airplane, on 2D field). Kalman filter  has frequently been adopted for motion modeling. It can model temporal errors by adaptively predicting and updating position according to tracking condition. But it is insufficient to track objects in more complex situation. Scenes taken directly from RGB camera contains a lot of difficulties. As described in Figure 1(temporal occlusion), objects are occluded by other objects and by obstacles which exist on the scene. To overcome this, we can exploit appearance information. There have been many works [2, 10, 3, 1, 11, 21] which tried to derive accurate appearance affinity. Several works [2, 10, 1] tried to design appearance model without using deep learning. Those trackers achieved better performance but couldn’t improve the performance significantly. Along rapid development of deep learning, several works [3, 11, 21] tried to apply deep learning to calculate appearance affinity. Most of those works used siamese network to calculate affinity score. Although, siamese network has a strong ability of discrimination, it can only see cropped patches which contain limited information. If imperfect detectors like Deformable Part-based Model [8, 17] are used to extract detections, detections itself contain inaccurate information. Those detections are ambiguous as described in Figure 1(noisy detections) and may lead to inaccurate appearance affinity.
We propose several methods to tackle those problems above(noisy detections, temporal occlusion). First, it is hard to match target to observation when recent target appearance is ambiguous. To break this, we save reliable historical appearances. From support of reliable previous appearances, we can get accurate affinity score even in ambiguous situation. And it is necessary to reduce noisy detections as much as possible for better performance. Many trackers have used constant detection threshold(e.g. 30) for all sequences. Instead of constant threshold, we propose the method to decide detection threshold according to scene condition. In our best knowledge, this is the first work that considers to filter out detections according to scene condition. In summary, our main contributions are:
|We propose historical appearance matching method to break matching ambiguity;|
|We provide detail of our network structure and 2-step learning method. 2-step learning method outperforms learning from single sequence;|
|We propose better method to decide detection confidence threshold. It decides threshold according to scene condition and performs better than constant threshold;|
In experiments section, it is proved that each of our method improves the tracking performance.
2 Proposed methods
In this section, we describe our tracking framework and proposed methods. Our framework is drawn in Figure 2. Our framework is based on simple framework of online multi-object tracking. It associates existing targets with observations first. Then, update target state with associated observation and process birth&death of target. Our main contribution is in designing appearance cue and preprocessing given detections. It would be explained in following sub-sections.
2.1 Affinity models
Our affinity model consists of three cues, appearance, shape and motion. Affinity matrix is calculated by multiplying scores from each cue :
Each of indicates appearance, shape and motion. Score from each cue is calculated as below :
Appearance affinity score is calculated by our proposed method. It will be explained in section 2.2 and 2.3. Different from other tracking method, we predicted state of each target not only for motion but also for shape. Although we modeled motion and appearance affinities robust to error, tracking may fail because of noisy detections with different size. We thought kalman filter could be applied for shape state in similar way to predict motion state . indicates predicted state of target . We calculated relative difference of height and width between target and observation. Motion affinity score is calculated by mahalanobis distance between position of predicted state and observation with predefined covariance matrix . If camera is fixed like PETS02-S2 dataset, it is better to use prediction covariance matrix which is calculated in kalman filtering process. But MOT challenge dataset contains a lot of scenes captured from moving camera. Because of linear assumption of kalman filter, fluctuating kalman prediction covariance may aggravate the performance. So, we take constant matrix which generally shows good performance in any scene.
2.2 Joint-input siamese network
There are various shape of siamese network that we can consider to use in multi-object tracking. From experiments of prior works , , joint-input siamese network outperforms other shape of siamese network. Also, it is important to set output range of siamese network between 0-1 to balance with other affinities(motion and shape). Softmax layer of joint siamese network naturally set the output range between 0-1. Our network structure is drawn in Figure 3. Different from prior works, we used batch normalization for better accuracy. It prevents overfitting and improves convergence so is useful to train network with small size of training data. Thanks to convolutional neural network which can extract reach appearance feature, ours outperforms color histogram without historical matching method(Table 2(a)). We trained our network in two steps, pretrain and transfer learning. The detail of network training is explained in Figure 6 and Section 3.1.
|conv & bn & relu||9x9x12||128x64x6||120x56x12|
|conv & bn & relu||5x5x16||60x28x12||56x24x16|
|conv & bn & relu||5x5x24||28x12x16||24x8x24|
2.3 Historical Appearance Matching
Because of temporal occlusion or inaccurate detection, tracklet information may be unreliable. As we mentioned in Section 2.1, shape and motion cues can handle temporal errors using kalman filter. But different from those cues, size of appearance feature is huge and is hard to be modeled considering temporal errors. Before explaining our method, we revisit the method of adaptive color histogram update. It is possible to update target color histogram adaptively according to current matching affinity score. It can be represented as:
means saved color histogram of target in time and is matched observation color histogram of target in time . Update ratio is controlled easily by . is large if current matching affinity score is high and vice versa. Although, even with adaptive update, color histogram is unreliable because it is sensitive to change of light, background and target pose. Joint-input siamese network produces much reliable affinity score. But features can’t be updated adaptively like color histogram because input images are concatenated and jointly inferred through network. So, we propose the historical appearance matching(HAM) as following equations:
Before calculating appearance affinity, we create shape and motion affinity matrix first using Eq. 2. and are respectively th target and th observation. is total number of target() or observation(). Then, we calculate final affinity matrix as:
Final affinity matrix is simply calculated by multiplying appearance affinity when motion and shape affinity is higher than . This is predefined association threshold. Even target and observation is associated by hungarian algorithm, those with affinity, lower than , are ignored. So, we only calculate appearance affinity score for pairs which have shape, motion affinity higher than . This can save processing time a lot. Appearance affinity is calculated as follows:
For comparison, we set two kinds of appearance matching method. One is baseline method which simply compare recently matched appearance with observation . It is simply an output of siamese network() of those two inputs. And the other one is our proposed method which is called historical appearance matching(HAM). It is robust to break confusion and ambiguity as described in Figure 4. It is calculated as:
is recent matching confidence of target. Relative weights of two terms in equation are controlled by . If recent matching is unreliable(), reliable historical appearances take bigger portion in appearance affinity and vice versa. is a length of historical appearance of target . is the relative weight of each saved appearance of target . It is defined as:
Each weight of historical appearance is calculated by dividing its matching confidence() by sum of all matching confidence. This is for making the sum of all weights to 1. indicates the affinity score() of at the time that -th historical appearance is matched with target . So, sum of is equal to 1. And naturally, this assures that is in range 0-1. As described in Figure 2, historical appearance of each target is updated when matching appearance is bigger than . We maintain the maximum length of historical appearance as 10 and oldest one to be within 30 from current frame. MOTA score of color histogram, baseline(w/o ham) and proposed are drawn in Table 1(a).
2.4 Scene Adaptive Detection Filtering
In popular public benchmarks[12, 15], detections, extracted by Deformable Part Models(DPM) [8, 17], are given as default. DPM is not a deep learning based detector and make a lot of false-positive detections. So, it is necessary to filter out noisy detections for better performance. It is common to filter out noisy detections using given detection confidences which are also produced by detector. A lot of previous works simply filter out detections which have lower confidences then predefined constant threshold . But distribution of detection confidence is variable depends on tracking environment. We show an example in Figure 5. Average detection confidence is high in PETS09-S2L1 which is taken by static camera and size of objects are constant. In contrast, average confidence is low in ETH-SUNNYDAY which is taken from dynamic camera in highly illuminated environment and size of objects are large. If detection threshold() is fixed to work well in PETS09-S2L1, a lot of true-positive detections are filtered out in ETH-SUNNYDAY dataset(see Figure 5(b)). Even it varies in same scene as time flows. So, we propose simple but robust method to adaptively decide threshold depends on scene as:
is the detection threshold at frame . Detections with confidence lower than are eliminated before tracking in frame-. First part of equation is a scene adpative threshold() which considers inter scene, intra scene difference(described in Figure 5(a)). is defined as:
Two cumulative distribution functions() of gaussian variable() are combined through . Gaussian variable() is derived from average() and standard deviation() of detection confidences. is calculated by detection confidences of recent 10 frames. Because of the reason that is usually calculated by small number of samples, , calculated by all detection confidences collected until current frame, is needed for smoothness. controls the degree of smoothness. is an important constant which decide the . We found that 0.4 generally works best(Table 2(b)). Second part of Eq. 9 is pre-defined threshold(). Because our tracker operates in fully online way, this pre-defined threshold is needed for first a few frames when number of detection samples are insufficient to calculate distribution(). Its proportion() gets smaller as frames() gets bigger. We heuristically selected as 0.95.
In this section, we explain detail of implementation and show the improvement in tracking performance by attaching our methods one-by-one. Also, we compare the performance of our tracker with other public trackers. y-axis label in Table 1 and 2 is Multi-Object Tracking Accuracy(MOTA).
3.1 Two-step siamese network training
Network may be confused if it learns directly from tracking sequences. As described in Figure 6(2DMOT2015), there exist a lot of occluded or noisy objects which are marked as ground-truth target. If network trains from those samples, it may decrease the performance. For this reason, it would be better to train the network about general concept of appearance comparison before train it examples from real tracking sequences. We separated training process into two steps, pre-train from CUHK02 dataset and transfer-learning from tracking sequences. CUHK02 dataset  was developed for person re-identification task and contains 1816 identities each of which has 4 different samples. All images are clear and not occluded. So, this is proper dataset to learn general concept of appearance comparison. First, we trained our network in learning rate for 300 epochs(training converges). In each epoch, 3000 pairs, positive:negative ratio 1:1, are trained with mini batch size 100. After pre-train, it learns from 2DMOT2015 training sequences with lower learning rate. We decrease the learning rate to and train it in same way as pre-training step.
3.2 Performance evaluation
|Rezatofighi et al.||23.8||33.8||365||869||6373||40084||5.0||58.1||32.6|
|Milan et al.||22.5||31.5||697||737||7890||39020||5.8||63.9||0.2|
|Fagot-Bouquet et al.||24.5||34.8||298||744||5864||40207||5.5||64.6||7.5|
|Manen et al.||25.7||32.7||383||600||4779||40511||4.3||57.4||5.0|
|Bewley et al.||17.0||17.3||1872||1872||9233||39933||3.9||52.4||3.7|
|Taixe et al.||29.0||34.3||639||1316||5160||37798||8.5||48.4||52.8|
|Sanchez-Matilla et al. (O)||22.3||32.8||833||1485||7924||38982||5.4||52.7||12.2|
|Fagot-Bouquet et al. (O)||15.8||27.9||514||1010||7597||43633||1.8||61||28.1|
Performance improvement: To prove the contribution of our methods in tracking performance, we provide several experimental results(Table 1, Table 2). In Table 1, we compared results by sequentially attaching our method related to appearance affinity. We tested on two kinds of training dataset(2DMOT2015, MOT16). In 2DMOT2015 dataset, we tested the validity of our historical appearance matching method. For fair experiment, we used network pre-trained from CUHK02 dataset without additional training from 2015 training-set. As you can see in Table 1(a), It is clear that historical appearance matching improve the overall tracking performance. In MOT16 training-set, we tested validity of 2-step training method. As you can see in Table 1(b), 2-step training method shows the highest MOTA score outperforming network which trained from each single dataset. In Table 2, you can find experimental results which prove necessity of scene adaptive detection filtering. We compared MOTA score of ours with scores from other filtering methods(, ).
Comparison with other trackers: We compared our methods with several trackers on 2DMOT2015 benchmark. To verify the strength of our methods in ID consistency metric(IDSW, IDF1), we carefully chose other trackers to be compared. Although all competitors use DPM based public detections, number of false positive(FP) and false negative(FN) fluctuate according to detection filtering methods. Because of the reason that IDSW is not calculated for missing targets, IDSW usually decreases as FN increases. This means that we can’t simply compare IDSW or IDF1 metric with every trackers. For fair comparison, we chose trackers which have similar number of FP and FN with ours. Overall comparison result is on Table 3. It is clear that our method shows lowest ID-switching and highest IDF1. IDF1  was proposed to compensate the limitation of ID-switching metric. High performance in both IDF1 and ID-switching proves that our tracker can manage target ID consistently. We guess that it attributes to our historical appearance matching method. And you can see the effect of batch processing and gating technique(Eq. 5). Although our tracker uses deep neural network, which usually is a time bottleneck, it shows real-time speed(20Hz) in set of complex sequences.
We proposed several methods to overcome temporal errors which occur because of occulsion and noisy detections. First, we tried to filter out noisy detections according to scene condition. And we designed joint-input siamese network for appearance matching and trained it using 2-step learning method. Finally, with historical appearance matching method, our tracker showed significantly improved performance, especially in ID consistency metrics. But there is a limitation of our work. Our network only takes cropped patches as an input and lacks contextual information. In our future work, we will try to exploit contextual information instead of directly cropping patches from image.
This work was supported by the ICT R&D program of MSIP/IITP. [2014-0-00077, Development of global multi-target tracking and event prediction techniques based on real-time large-scale video analysis]
-  N. T. L. Anh, F. M.Khan, F. Negin, and F. Bremond. Multi-object tracking using multi-channel part appearance representation. In AVSS, 2017.
-  S.-H. Bae and K.-J. Yoon. Confidence-based data association and discriminative deep appearance learning for robust online multi-object tracking. In CVPR, 2014.
-  S.-H. Bae and K.-J. Yoon. Confidence-based data association and discriminative deep appearance learning for robust online multi-object tracking. IEEE transactions on pattern analysis and machine intelligence, 40(3), 2018.
-  A. Bewley, L. Ott, F. Ramos, and B. Upcroft. ALExTRAC: Affinity Learning by Exploring Temporal Reinforcement within Association Chains. In ICRA, 2016.
-  G. Chang. Robust kalman filtering based on mahalanobis distance as outlier judging criterion. Journal of Geodesy, 88(4), 2014.
-  L. Fagot-Bouquet, R. Audigier, Y. Dhome, and F. Lerasle. Online multi-person tracking based on global sparse collaborative representations. In ICIP, 2015.
-  L. Fagot-Bouquet, R. Audigier, Y. Dhome, and F. Lerasle. Improving multi-frame data association with sparse representations for robust near-online multi-object tracking. In ECCV, 2016.
-  P. F. Felzenszwalb, R. B. Girshick, D. McAllester, , and D. Ramanan. Object detection with discriminatively trained part based models. IEEE transactions on pattern analysis and machine intelligence, 32(9), 2010.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
-  H. Izadinia, I. Saleemi, W. Li, and M. Shah. Multiple people multiple parts tracker. In ECCV, 2012.
-  L. Leal-Taixe, C. Canton-Ferrer, and K. Schindler. Learning by tracking: Siamese cnn for robust target association. In DeepVision workshop in conjunction with CVPR, 2016.
-  L. Leal-Taixe, A. Milan, I. Reid, S. Roth, and K. Schindler. Motchallenge 2015: Towards a benchmark for multi-target tracking. In arXiv:1504.01942.
-  W. Li and X. Wang. Locally aligned feature transforms across views. In CVPR, 2013.
-  S. Manen, R. Timofte, D. Dai, and L. Gool. Leveraging single for multi-target tracking using a novel trajectory overlap affinity measure. In WACV, 2016.
-  A. Milan, L. Leal-Taixe, I. Reid, S. Roth, and K. Schindler. Mot16: A benchmark for multi-object tracking. In arXiv:1603.00831.
-  A. Milan, L. Leal-TaixÃ©, K. Schindler, and I. Reid. Joint tracking and segmentation of multiple targets. In CVPR, 2015.
-  R. B. Girshick and P. F. Felzenszwalb and D. McAllester. Discriminatively trained deformable part models, release 5. http://people.cs.uchicago.edu/ rbg/latent-release5/.
-  H. Rezatofighi, A. Milan, Z. Zhang, Q. Shi, A. Dick, and I. Reid. Joint probabilistic data association revisited. In ICCV, 2015.
-  E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In BMTT workshop in conjunction with ECCV, 2016.
-  J. Sanchez-Matilla, F. Poiesi, and A. Cavallaro. Multi-target tracking with strong and weak detections. In BMTT-Workshop in conjunction with ECCV, 2016.
-  S. Tang, M. Andriluka, B. Andres, and B. Schiele. Multiple people tracking by lifted multicut and person re-identification. CVPR, pages 3701–3710, 2017.