Online Multi-Object Tracking with Historical Appearance Matching and Scene Adaptive Detection Filtering

Online Multi-Object Tracking with Historical Appearance Matching and
Scene Adaptive Detection Filtering

Young-chul Yoon  Abhijeet Boragule  Kwangjin Yoon  Moongu Jeon
Gwangju Institute of Science and Technology
{zerometal9268, abhijeet, yoon28, mgjeon}

In this paper, we propose the methods to handle temporal errors during multi-object tracking. Temporal error occurs when objects are occluded or noisy detections appear near the target. In that situation, tracking may fail and various errors like drift or ID-switching occur. It is hard to overcome temporal errors only by using motion and shape information. So, we propose the historical appearance matching method and joint-input siamese network which was trained by 2-step process. It can prevent tracking failure although objects are temporally occluded or last matching information is unreliable. We also provide useful technique to remove noisy detections effectively according to scene condition. Tracking performance, especially identity consistency, is highly improved by attaching our methods.

1 Introduction

Current paradigm of multi-object tracking is tracking by detection approach. Most trackers assume that detections are already given and focus on labeling each detection with specific ID. This labeling process is basically done by data association. For online tracker, data association problem could be simplified to bipartite matching problem and hungarian algorithm has frequently been adopted to solve it. Before solving data association problem, cost matrix has to be defined. Each element of cost matrix is the affinity(similarity) between specific target and detection(observation). Because of the reason that data association simply finds 1-to-1 matches on cost matrix, it is important to derive accurate affinity scores for better performance.

Motion is a basic factor of affinity. Motion is the only information that we can guess in simple tracking environment(e.g. tracking dots, which are signals from specific objects like ship or airplane, on 2D field). Kalman filter [5] has frequently been adopted for motion modeling. It can model temporal errors by adaptively predicting and updating position according to tracking condition. But it is insufficient to track objects in more complex situation. Scenes taken directly from RGB camera contains a lot of difficulties. As described in Figure 1(temporal occlusion), objects are occluded by other objects and by obstacles which exist on the scene. To overcome this, we can exploit appearance information. There have been many works [2, 10, 3, 1, 11, 21] which tried to derive accurate appearance affinity. Several works [2, 10, 1] tried to design appearance model without using deep learning. Those trackers achieved better performance but couldn’t improve the performance significantly. Along rapid development of deep learning, several works [3, 11, 21] tried to apply deep learning to calculate appearance affinity. Most of those works used siamese network to calculate affinity score. Although, siamese network has a strong ability of discrimination, it can only see cropped patches which contain limited information. If imperfect detectors like Deformable Part-based Model [8, 17] are used to extract detections, detections itself contain inaccurate information. Those detections are ambiguous as described in Figure 1(noisy detections) and may lead to inaccurate appearance affinity.

Figure 1: Example of temporal errors. DPM detector creates noisy detections which include several objects simultaneously or only a small part of object. Also, a lot of temporal occlusions occur because of complex scene condition
Figure 2: Our tracking framework.

We propose several methods to tackle those problems above(noisy detections, temporal occlusion). First, it is hard to match target to observation when recent target appearance is ambiguous. To break this, we save reliable historical appearances. From support of reliable previous appearances, we can get accurate affinity score even in ambiguous situation. And it is necessary to reduce noisy detections as much as possible for better performance. Many trackers have used constant detection threshold(e.g. 30) for all sequences. Instead of constant threshold, we propose the method to decide detection threshold according to scene condition. In our best knowledge, this is the first work that considers to filter out detections according to scene condition. In summary, our main contributions are:

We propose historical appearance matching method to break matching ambiguity;
We provide detail of our network structure and 2-step learning method. 2-step learning method outperforms learning from single sequence;
We propose better method to decide detection confidence threshold. It decides threshold according to scene condition and performs better than constant threshold;

In experiments section, it is proved that each of our method improves the tracking performance.

2 Proposed methods

In this section, we describe our tracking framework and proposed methods. Our framework is drawn in Figure 2. Our framework is based on simple framework of online multi-object tracking. It associates existing targets with observations first. Then, update target state with associated observation and process birth&death of target. Our main contribution is in designing appearance cue and preprocessing given detections. It would be explained in following sub-sections.

2.1 Affinity models

Our affinity model consists of three cues, appearance, shape and motion. Affinity matrix is calculated by multiplying scores from each cue :


Each of indicates appearance, shape and motion. Score from each cue is calculated as below :


Appearance affinity score is calculated by our proposed method. It will be explained in section 2.2 and 2.3. Different from other tracking method, we predicted state of each target not only for motion but also for shape. Although we modeled motion and appearance affinities robust to error, tracking may fail because of noisy detections with different size. We thought kalman filter could be applied for shape state in similar way to predict motion state . indicates predicted state of target . We calculated relative difference of height and width between target and observation. Motion affinity score is calculated by mahalanobis distance between position of predicted state and observation with predefined covariance matrix . If camera is fixed like PETS02-S2 dataset, it is better to use prediction covariance matrix which is calculated in kalman filtering process. But MOT challenge dataset contains a lot of scenes captured from moving camera. Because of linear assumption of kalman filter, fluctuating kalman prediction covariance may aggravate the performance. So, we take constant matrix which generally shows good performance in any scene.

2.2 Joint-input siamese network

There are various shape of siamese network that we can consider to use in multi-object tracking. From experiments of prior works [11], [21], joint-input siamese network outperforms other shape of siamese network. Also, it is important to set output range of siamese network between 0-1 to balance with other affinities(motion and shape). Softmax layer of joint siamese network naturally set the output range between 0-1. Our network structure is drawn in Figure 3. Different from prior works, we used batch normalization[9] for better accuracy. It prevents overfitting and improves convergence so is useful to train network with small size of training data. Thanks to convolutional neural network which can extract reach appearance feature, ours outperforms color histogram without historical matching method(Table 2(a)). We trained our network in two steps, pretrain and transfer learning. The detail of network training is explained in Figure 6 and Section 3.1.

layer filter size input output
conv & bn & relu 9x9x12 128x64x6 120x56x12
max pool 2x2 120x56x12 60x28x12
conv & bn & relu 5x5x16 60x28x12 56x24x16
max pool 2x2 56x24x16 28x12x16
conv & bn & relu 5x5x24 28x12x16 24x8x24
max pool 2x2 24x8x24 12x4x24
flatten - 12x4x24 1x1152
dense - 1x1152 1x150
dense - 1x150 1x2
softmax - 1x2 1x2
Figure 3: Our joint-input siamese network structure. bn indicates batch normalization layer. Each of two final output means probability of which two inputs are identical or different.

2.3 Historical Appearance Matching

Because of temporal occlusion or inaccurate detection, tracklet information may be unreliable. As we mentioned in Section 2.1, shape and motion cues can handle temporal errors using kalman filter. But different from those cues, size of appearance feature is huge and is hard to be modeled considering temporal errors. Before explaining our method, we revisit the method of adaptive color histogram update. It is possible to update target color histogram adaptively according to current matching affinity score. It can be represented as:


means saved color histogram of target in time and is matched observation color histogram of target in time . Update ratio is controlled easily by . is large if current matching affinity score is high and vice versa. Although, even with adaptive update, color histogram is unreliable because it is sensitive to change of light, background and target pose. Joint-input siamese network produces much reliable affinity score. But features can’t be updated adaptively like color histogram because input images are concatenated and jointly inferred through network. So, we propose the historical appearance matching(HAM) as following equations:


Before calculating appearance affinity, we create shape and motion affinity matrix first using Eq. 2. and are respectively th target and th observation. is total number of target() or observation(). Then, we calculate final affinity matrix as:

Figure 4: Example of breaking ambiguity using historical appearance matching. (Black arrow): It is hard to choose the observation correspond to target among (1) and (2) because of ambiguous recent target appearance. (Red arrow): From support of historical appearance, target can be matched to correct observation (2).

Final affinity matrix is simply calculated by multiplying appearance affinity when motion and shape affinity is higher than . This is predefined association threshold. Even target and observation is associated by hungarian algorithm, those with affinity, lower than , are ignored. So, we only calculate appearance affinity score for pairs which have shape, motion affinity higher than . This can save processing time a lot. Appearance affinity is calculated as follows:


For comparison, we set two kinds of appearance matching method. One is baseline method which simply compare recently matched appearance with observation . It is simply an output of siamese network() of those two inputs. And the other one is our proposed method which is called historical appearance matching(HAM). It is robust to break confusion and ambiguity as described in Figure 4. It is calculated as:


is recent matching confidence of target. Relative weights of two terms in equation are controlled by . If recent matching is unreliable(), reliable historical appearances take bigger portion in appearance affinity and vice versa. is a length of historical appearance of target . is the relative weight of each saved appearance of target . It is defined as:


Each weight of historical appearance is calculated by dividing its matching confidence() by sum of all matching confidence. This is for making the sum of all weights to 1. indicates the affinity score() of at the time that -th historical appearance is matched with target . So, sum of is equal to 1. And naturally, this assures that is in range 0-1. As described in Figure 2, historical appearance of each target is updated when matching appearance is bigger than . We maintain the maximum length of historical appearance as 10 and oldest one to be within 30 from current frame. MOTA score of color histogram, baseline(w/o ham) and proposed are drawn in Table 1(a).

2.4 Scene Adaptive Detection Filtering

In popular public benchmarks[12, 15], detections, extracted by Deformable Part Models(DPM) [8, 17], are given as default. DPM is not a deep learning based detector and make a lot of false-positive detections. So, it is necessary to filter out noisy detections for better performance. It is common to filter out noisy detections using given detection confidences which are also produced by detector. A lot of previous works simply filter out detections which have lower confidences then predefined constant threshold . But distribution of detection confidence is variable depends on tracking environment. We show an example in Figure 5. Average detection confidence is high in PETS09-S2L1 which is taken by static camera and size of objects are constant. In contrast, average confidence is low in ETH-SUNNYDAY which is taken from dynamic camera in highly illuminated environment and size of objects are large. If detection threshold() is fixed to work well in PETS09-S2L1, a lot of true-positive detections are filtered out in ETH-SUNNYDAY dataset(see Figure 5(b)). Even it varies in same scene as time flows. So, we propose simple but robust method to adaptively decide threshold depends on scene as:


is the detection threshold at frame . Detections with confidence lower than are eliminated before tracking in frame-. First part of equation is a scene adpative threshold() which considers inter scene, intra scene difference(described in Figure 5(a)). is defined as:


Two cumulative distribution functions() of gaussian variable() are combined through . Gaussian variable() is derived from average() and standard deviation() of detection confidences. is calculated by detection confidences of recent 10 frames. Because of the reason that is usually calculated by small number of samples, , calculated by all detection confidences collected until current frame, is needed for smoothness. controls the degree of smoothness. is an important constant which decide the . We found that 0.4 generally works best(Table 2(b)). Second part of Eq. 9 is pre-defined threshold(). Because our tracker operates in fully online way, this pre-defined threshold is needed for first a few frames when number of detection samples are insufficient to calculate distribution(). Its proportion() gets smaller as frames() gets bigger. We heuristically selected as 0.95.

\thesubsubfigure various scene condition


\thesubsubfigure average detection confidence comparison
Figure 5: (a)Upper row shows different scene condition between ETH-Sunnyday and PETS09-S2L1. Lower row shows the varying scene condition between different frames of ETH-Pedcross2. (b)Comparison of average detection confidence between two scenes(ETH-Sunnyday, PETS09-S2L1)

3 Experiments

In this section, we explain detail of implementation and show the improvement in tracking performance by attaching our methods one-by-one. Also, we compare the performance of our tracker with other public trackers. y-axis label in Table 1 and 2 is Multi-Object Tracking Accuracy(MOTA).

3.1 Two-step siamese network training

Network may be confused if it learns directly from tracking sequences. As described in Figure 6(2DMOT2015), there exist a lot of occluded or noisy objects which are marked as ground-truth target. If network trains from those samples, it may decrease the performance. For this reason, it would be better to train the network about general concept of appearance comparison before train it examples from real tracking sequences. We separated training process into two steps, pre-train from CUHK02 dataset and transfer-learning from tracking sequences. CUHK02 dataset [13] was developed for person re-identification task and contains 1816 identities each of which has 4 different samples. All images are clear and not occluded. So, this is proper dataset to learn general concept of appearance comparison. First, we trained our network in learning rate for 300 epochs(training converges). In each epoch, 3000 pairs, positive:negative ratio 1:1, are trained with mini batch size 100. After pre-train, it learns from 2DMOT2015 training sequences with lower learning rate. We decrease the learning rate to and train it in same way as pre-training step.

Figure 6: Our 2-step learning process. CUHK02 dataset is clear so is good to learn general concept of similarity. In contrast, 2DMOT2015 contains occluded, noisy samples. Its good to learn real-world tracking situation.

color hist

\thesubsubfigure 2DMOT2015 training

color hist

\thesubsubfigure MOT16 training
Table 1: MOTA improvement by applying our methods. We tested the contribution of our methods on two different training dataset. In each graph, left-most bar shows the result using color-histogram based appearance model as mentioned in Eq. 3. (a)2DMOT2015 training set. The middle bar shows that our joint-input siamese network outperforms color histogram based appearance model without HAM. Right-most bar shows that performance is improved by using historical appearance matching. (b)MOT16 training set. Second and third bars from left show the result from network trained by single dataset, CUHK or 2015train. The right-most bar verify the outperforming accuracy from 2-step training

3.2 Performance evaluation

(20, 30, 40)

\thesubsubfigure baseline methods

\thesubsubfigure comparison
Table 2: (a)MOTA scores in 2DMOT2015 trainingset from two different kinds of detection threshold. : pre-defined threshold(20, 30, 40 from left to right). : special case of Eq. 10 when . This means that it doesn’t consider intra scene difference. (b)MOTA score comparison between proposed method() and other baseline methods(, ). We chose the best score of each method in (a).
Rezatofighi et al.[18] 23.8 33.8 365 869 6373 40084 5.0 58.1 32.6
Milan et al.[16] 22.5 31.5 697 737 7890 39020 5.8 63.9 0.2
Fagot-Bouquet et al.[7] 24.5 34.8 298 744 5864 40207 5.5 64.6 7.5
Manen et al.[14] 25.7 32.7 383 600 4779 40511 4.3 57.4 5.0
Bewley et al.[4] 17.0 17.3 1872 1872 9233 39933 3.9 52.4 3.7
Taixe et al.[11] 29.0 34.3 639 1316 5160 37798 8.5 48.4 52.8
Sanchez-Matilla et al.[20] (O) 22.3 32.8 833 1485 7924 38982 5.4 52.7 12.2
Fagot-Bouquet et al.[6] (O) 15.8 27.9 514 1010 7597 43633 1.8 61 28.1
Ours (O) 24.8 36.8 297 673 6844 39071 5.7 61.4 20.0
Table 3: Comparison with other trackers which have similar number of FP and FN. Best scores in table are marked in bold. Online trackers are marked with (O).

Performance improvement: To prove the contribution of our methods in tracking performance, we provide several experimental results(Table 1, Table 2). In Table 1, we compared results by sequentially attaching our method related to appearance affinity. We tested on two kinds of training dataset(2DMOT2015, MOT16). In 2DMOT2015 dataset, we tested the validity of our historical appearance matching method. For fair experiment, we used network pre-trained from CUHK02 dataset without additional training from 2015 training-set. As you can see in Table 1(a), It is clear that historical appearance matching improve the overall tracking performance. In MOT16 training-set, we tested validity of 2-step training method. As you can see in Table 1(b), 2-step training method shows the highest MOTA score outperforming network which trained from each single dataset. In Table 2, you can find experimental results which prove necessity of scene adaptive detection filtering. We compared MOTA score of ours with scores from other filtering methods(, ).

Comparison with other trackers: We compared our methods with several trackers on 2DMOT2015 benchmark. To verify the strength of our methods in ID consistency metric(IDSW, IDF1), we carefully chose other trackers to be compared. Although all competitors use DPM based public detections, number of false positive(FP) and false negative(FN) fluctuate according to detection filtering methods. Because of the reason that IDSW is not calculated for missing targets, IDSW usually decreases as FN increases. This means that we can’t simply compare IDSW or IDF1 metric with every trackers. For fair comparison, we chose trackers which have similar number of FP and FN with ours. Overall comparison result is on Table 3. It is clear that our method shows lowest ID-switching and highest IDF1. IDF1 [19] was proposed to compensate the limitation of ID-switching metric. High performance in both IDF1 and ID-switching proves that our tracker can manage target ID consistently. We guess that it attributes to our historical appearance matching method. And you can see the effect of batch processing and gating technique(Eq. 5). Although our tracker uses deep neural network, which usually is a time bottleneck, it shows real-time speed(20Hz) in set of complex sequences.

4 Conclusion

We proposed several methods to overcome temporal errors which occur because of occulsion and noisy detections. First, we tried to filter out noisy detections according to scene condition. And we designed joint-input siamese network for appearance matching and trained it using 2-step learning method. Finally, with historical appearance matching method, our tracker showed significantly improved performance, especially in ID consistency metrics. But there is a limitation of our work. Our network only takes cropped patches as an input and lacks contextual information. In our future work, we will try to exploit contextual information instead of directly cropping patches from image.

5 Acknowledgement

This work was supported by the ICT R&D program of MSIP/IITP. [2014-0-00077, Development of global multi-target tracking and event prediction techniques based on real-time large-scale video analysis]


  • [1] N. T. L. Anh, F. M.Khan, F. Negin, and F. Bremond. Multi-object tracking using multi-channel part appearance representation. In AVSS, 2017.
  • [2] S.-H. Bae and K.-J. Yoon. Confidence-based data association and discriminative deep appearance learning for robust online multi-object tracking. In CVPR, 2014.
  • [3] S.-H. Bae and K.-J. Yoon. Confidence-based data association and discriminative deep appearance learning for robust online multi-object tracking. IEEE transactions on pattern analysis and machine intelligence, 40(3), 2018.
  • [4] A. Bewley, L. Ott, F. Ramos, and B. Upcroft. ALExTRAC: Affinity Learning by Exploring Temporal Reinforcement within Association Chains. In ICRA, 2016.
  • [5] G. Chang. Robust kalman filtering based on mahalanobis distance as outlier judging criterion. Journal of Geodesy, 88(4), 2014.
  • [6] L. Fagot-Bouquet, R. Audigier, Y. Dhome, and F. Lerasle. Online multi-person tracking based on global sparse collaborative representations. In ICIP, 2015.
  • [7] L. Fagot-Bouquet, R. Audigier, Y. Dhome, and F. Lerasle. Improving multi-frame data association with sparse representations for robust near-online multi-object tracking. In ECCV, 2016.
  • [8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, , and D. Ramanan. Object detection with discriminatively trained part based models. IEEE transactions on pattern analysis and machine intelligence, 32(9), 2010.
  • [9] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [10] H. Izadinia, I. Saleemi, W. Li, and M. Shah. Multiple people multiple parts tracker. In ECCV, 2012.
  • [11] L. Leal-Taixe, C. Canton-Ferrer, and K. Schindler. Learning by tracking: Siamese cnn for robust target association. In DeepVision workshop in conjunction with CVPR, 2016.
  • [12] L. Leal-Taixe, A. Milan, I. Reid, S. Roth, and K. Schindler. Motchallenge 2015: Towards a benchmark for multi-target tracking. In arXiv:1504.01942.
  • [13] W. Li and X. Wang. Locally aligned feature transforms across views. In CVPR, 2013.
  • [14] S. Manen, R. Timofte, D. Dai, and L. Gool. Leveraging single for multi-target tracking using a novel trajectory overlap affinity measure. In WACV, 2016.
  • [15] A. Milan, L. Leal-Taixe, I. Reid, S. Roth, and K. Schindler. Mot16: A benchmark for multi-object tracking. In arXiv:1603.00831.
  • [16] A. Milan, L. Leal-Taixé, K. Schindler, and I. Reid. Joint tracking and segmentation of multiple targets. In CVPR, 2015.
  • [17] R. B. Girshick and P. F. Felzenszwalb and D. McAllester. Discriminatively trained deformable part models, release 5. rbg/latent-release5/.
  • [18] H. Rezatofighi, A. Milan, Z. Zhang, Q. Shi, A. Dick, and I. Reid. Joint probabilistic data association revisited. In ICCV, 2015.
  • [19] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In BMTT workshop in conjunction with ECCV, 2016.
  • [20] J. Sanchez-Matilla, F. Poiesi, and A. Cavallaro. Multi-target tracking with strong and weak detections. In BMTT-Workshop in conjunction with ECCV, 2016.
  • [21] S. Tang, M. Andriluka, B. Andres, and B. Schiele. Multiple people tracking by lifted multicut and person re-identification. CVPR, pages 3701–3710, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description