FAMNet: Joint Learning of Feature, Affinity and Multidimensional Assignment for Online Multiple Object Tracking
Abstract
Data associationbased multiple object tracking (MOT) involves multiple separated modules processed or optimized differently, which results in complex method design and requires nontrivial tuning of parameters. In this paper, we present an endtoend model, named FAMNet, where Feature extraction, Affinity estimation and Multidimensional assignment are refined in a single network. All layers in FAMNet are designed differentiable thus can be optimized jointly to learn the discriminative features and higherorder affinity model for robust MOT, which is supervised by the loss directly from the assignment ground truth. We also integrate single object tracking technique and a dedicated target management scheme into the FAMNetbased tracking system to further recover false negatives and inhibit noisy target candidates generated by the external detector. The proposed method is evaluated on a diverse set of benchmarks including MOT2015, MOT2017, KITTICar and UADETRAC, and achieves promising performance on all of them in comparison with stateofthearts.
1 Introduction
Tracking multiple objects in video is critical for many applications, ranging from visionbased surveillance to autonomous driving. A current popular framework to solve multiple object tracking (MOT) uses the trackingbydetection strategy where target candidates generated from an external detector are associated and connected to form the target trajectories across frames [1, 14, 21, 34, 37, 45, 49]. At the core of trackingbydetection strategy lies the data association problem which is usually treated as three separate parts: feature extraction for candidate representation, affinity metric to evaluate the cost of each association hypothesis and association algorithm to find the optimal association. These parts involve multiple individual dataprocessing steps and are optimized differently from each other, which results in a complex method design and extensive tuning parameters to adapt different target categories and tracking scenarios.
Recently, deep neural network (DNN) has been investigated intensively to learn the association cost function in a unified architecture combining both feature extraction and affinity metric [10, 26, 42]. Through training, the task and scenario prior can be automatically adapted by the candidate representation and estimation metric without manually tuning the hyperparameters. However, the association algorithm still stands outside the network, which requires dedicated affinity samples to be manually fabricated from ground truth association for the training process. It is not guaranteed that training and inference phases share the same data distribution; consequently it may lead to the degraded generalizability of the trained model. Moreover, crowded targets, similar appearance and fast motion impose great ambiguity for the association only considering pairs of neighboring frames. Successful association requires global optimization across multiple frames, where higherorder discriminative clues such as appearance changes over time and motion context could be included. Learning the robust representation and affinity criteria without the cooperation from the association procedure in this circumstance is even more complicated.
Our objective in this paper is to formulate an endtoend model for MOT: the Feature representation, Affinity model and Multidimensional assignment (MDA) are refined in a single deep network named FAMNet, which is optimized jointly to learn the task prior. In particular, feature subnetwork is used to extract features for candidates on each frame, after which an affinity subnetwork estimates the higherorder affinity for all association hypothesis. With the affinity, the MDA subnetwork is to optimize globally and obtain the optimal assignments. By all layers in FAMNet designed differentiable, the feature and affinity subnetwork can be trained directly referring to the assignment ground truth. To realize it, we make the following novelties to the FAMNet and its based tracking system:

We design an affinity subnetwork that fuses discriminative higherorder appearance and motion information into the affinity estimation.

We propose an MDA subnetwork, in which a modified rank1 tensor approximation power iteration is designed differentiable and adapted for the deep learning architecture.

We integrate single object tracking into the data associationbased MOT. Detections and tracking predictions are merged and selected optimally through MDA to construct the target trajectories.

We employ a target management scheme where a dedicated CNN network is used to refine the target bounding box to eliminate the noised candidates generated by external detector.
To show the effectiveness of the proposed approach, it is evaluated on the popular multiple pedestrian and vehicle tracking challenge benchmarks including MOT2015, MOT2017, KITTICar and UADETRAC. Our results show promising performance in comparison with other published works.
2 Related Work
Multiple object tracking (MOT) has been an active research area for decades, and many methods have been investigated for this topic. Recently, the most popular framework for MOT is the trackingbydetection. Traditional methods primarily focus on solving the data association problem using such as Hungarian algorithm [3, 15, 19], network flow [12, 53, 55] and multiple hypotheses tracking [6, 23] on various of affinity estimation schemes. Higherorder affinity provides the global and discriminative information that is not available in pairwise association. In order to utilize it, MOT is usually treated as the MDA problem. Collins [11] proposes a block ICMlike method for the MDA to incorporate higherorder motion model. The method iteratively solves bipartite assignments alternatively while keeping other assignments fixed. In [41], MDA is formulated as the rank1 tensor approximation problem where a dedicated power iteration with unite normalization is proposed to find the optimal solution. Our work is closely related to the MDA formulation, especially [41].
Recently, deep learning is explored with increasing popularity in MOT with great success. Most recent solutions rely on it as a powerful discriminative technique [1, 27, 42, 56]. Tang et al. [43] propose to use DNN based ReID techniques for affinity estimations. They include lift edges that connect two candidates spanning multiple frames to capture the longterm affinity. In [39], recurrent neural networks (RNN) and long shortterm memory (LSTM) is adapted to model the higherorder discriminative clue. Those methods learn the networks in a separate process with the manually fabricated affinity training samples.
Some recent works have gone further to tentatively solve MOT in an entirely endtoend fashion. Ondruska and Posner [35] introduce the RNN for the task to estimate the candidate state. Although this work is demonstrated on the synthetic sensor data and no explicit data association is applied, it firstly shows the efficacy of using RNN for an endtoend solution. Milan et al. [31] propose an RNNLSTM based online framework to integrate both motion affinity estimation and bipartite association into the deep learning network. They use LSTM to solve the data association target by target at each frame where the constrains in data association are not explicitly built into the network but learned from training data. For both works, only the occupancy status of targets are considered, the informative appearance clue is not utilized. Different from their methods, we propose an MDA subnetwork which handles both the data association and the constrains, and our affinity fuses both the appearance and motion clue for better discriminability.
3 Overview
In this section, we first formulate the multiple object tracking (MOT) problem as a multidimensional assignment (MDA) form, and then provide an overview of our FAMNetbased tracking system (overview in Fig. 1).
3.1 Problem Formulation
Following the notation in [41], the input for MOT is denoted by , which contains target candidate sets from frames. For frame , is the set of candidates to be matched or associated, where represents the status of the candidate such as its center coordinate on the image frame.
With the input candidate set , MOT is to find a multidimensional association that maximizes the overall affinity subject to the association constrains. In detail, denotes affinity for one possible association, or in term of MOT, one hypothesis trajectory composed by candidates . We use to indicate whether a hypothesis trajectory is true () or not (). If we further denote tensor and , the MOT can be formulated as following MDA problem to solve based on :
(1) 
(2) 
where denotes the elementwise product, is the matrix 1norm, and stands for summation over all subscripts except for .
To solve Eq. 1, we follow the Rank1 Tensor Approximation (R1TA) framework [41]. The multidimensional assignments are first decomposed as the product of a serials of local assignments which represent the assignments between candidates in adjacent frames, i.e. and . If we further rewrite the local assignment matrix into vector form where ,^{1}^{1}1For notational convenience, the same symbol is used for elements in and with double subscripts and a single subscript respectively. optimization problem defined in Eq. 1 can be rewritten as follow:
(3) 
where is the reshaped affinity tensor from the th order tensor to a th order tensor following the rules defined in [41] and is the  tensor product, is the set of local assignment vectors which we are optimizing for.
3.2 Architecture Overview and Tracking Pipeline
For each association batch, the FAMNet based tracking system takes the image frames and corresponding detections provided by an external detector as input. Detection candidates are first used to generate the hypothesis trajectories. Image patches of target candidates together with the trajectory hypothesis are passed into FAMNet to compute the set of local assignments as shown in Fig. 1. Inside FAMNet, features of candidate patches are extracted through a feature subnetwork. The affinity subnetwork then calculates the affinity for all hypothesis trajectories on those features to form the affinity tensor as described in Sec. 4.1. With the affinity tensor, the optimal multidimension assignments are estimated by the MDA subnetwork as explained in Sec. 4.2 and 4.3.
During training, the assignment ground truth is directly compared with the network output to compute the loss. The loss signal then backpropagates throughout the network to the feature and affinity subnetworks for learning, which is illustrated as the red paths in Fig. 1 and detailed in Sec. 4.4. In tracking phase, the output assignments together with single object tracking (SOT) predictions are used to update the trajectories of tracked targets through the target management scheme as described in Sec. 4.5 and 4.6.
We design our method in the online tracking framework intended for more casual applications. Under the constant velocity assumption, three frames are the minimum temporal span to calculate the motion affinity. Therefore, in the rest of paper, with two frames overlapping between association batches is used as to balance the computation cost and the sufficient depth of association to include the higherorder discriminative clues.
4 FAMNet
4.1 Affinity SubNetwork
The affinity subnetwork takes the features of candidates and hypothesis trajectories as input, and generates the affinity tensor as output.
For each association batch, the feature subnetwork, which is a Siamesestyle network, is first used to extract spatially aligned feature for the candidates from all frames in the batch. The candidates in the middle frame of the batch are treated as anchor candidates, and the middle frame is referred as anchor frame. E.g., for , the anchor frame refers to the frame . For each anchor candidate , features are extracted respectively from the frames, denoted as . These features are all centered at the same location of on frame , as illustrated on the left of Fig. 2. This way, features in share the same coordinate origin thus can encode motion clue when concatenated along a channel. Spatial dimension of is determined by the bounding box of , while others are the multiples of in order to include enough candidates on adjacent frames into the same view. Note that, hypothesis trajectories sharing the same anchor candidate have the same set of . Therefore, features are extracted for each association batch.
Two levels of affinities are calculated for each hypothesis trajectory using the extracted feature set, as shown in Fig. 2. In detail, the affinity tensor is calculated as following:
(4) 
where is the set of valid hypothesis trajectories, is the bounding box associated with , calculates the pairwise affinity and evaluates the longterm affinity of a hypothesis trajectory, is a spatial subset of centered at with the same spatial dimension as . The actual center coordinates of in need to be converted accordingly. We use here for convenience.
For pairwise affinity, the crosscorrelation operation is used, such as
(5) 
where is the convolution operation. Due to the fact that hypothesis trajectories sharing the same anchor candidates have the same set of , we can calculate first then take the value at from the crosscorrelation result, which is referred as .
We use convolutional neural network (CNN) with spatial attention to evaluate the higherorder affinity of hypothesis trajectory. For this purpose, features and are multiplied with spatial masks generated from and as shown in Fig. 2. In particular, we create a binary mask of the same spatial size with or for each candidate. Inside each mask, the region within is set to 1 otherwise is 0. Each time, the actual position of in the mask is converted from the image frame to the coordinates centered at anchor candidates. After encoding the spatialtemporal information, features in are concatenated along channel to form the input of a CNN to estimate the longterm affinity. The final affinity for a hypothesis trajectory is the summation of two levels of affinity according to Eq. 4.
4.2 R1TA Power Iteration Layer
With the affinity tensor, we use R1TA power iteration to estimate the set of optimal assignments in Eq. 3. Solving the global optimum for MDA usually requires NPhard probing. A suboptimal approximation is usually guaranteed by a power iteration algorithm which can be expressed in the pure mathematic format.
In order to fit this process into the deep network framework, we adapt a different iteration scheme than the one in [41] where the row/column normalization is applied to after each iteration to enforce the constrain defined in Eq. 2. The tensor power iteration and row/column normalization are separated into two independent layers in our design. It avoids cumulating too deep operations in a single layer and alleviate the potential gradient vanishing. Downside of this scheme is that we could not expect the same convergence property as in [41]. However, benefited from the endtoend training, it can be compensated by the learned more discriminative feature and affinity metric.
In detail, the optimal solution to Eq. 3 subject to Eq. 2 is approximated iteratively by:
Forward pass. At the th iteration, the elements in is calculated from^{2}^{2}2The second superscript indicates the round of iteration. Moreover, derivation in this subsection is on , but is the same for other .
(6) 
where is the normalization factor. At initialization, elements in all local assignment vectors are set to 1.
Backward pass. The R1TA power iteration layer computes the loss gradient of affinity tensor , denoted by , as backward output. The input of the backward pass is the loss gradient of all local assignment vectors at the last iteration, e.g. , where is the total number of iterations performed in the forward pass. The gradient output is calculated as follow:
(7)  
where is a unit vector of the same dimension as and has elements equal to 1 only at and otherwise 0. In order to calculate Eq. 7 for all iterations, the loss gradients of assignment vectors at each iteration are also needed, which follows:
(8)  
where can be calculated as, e.g. for , .
4.3 Normalization Layer
To satisfy the constrains defined in Eq. 2 required by MDA, row/column normalization is applied to the result . The assignment vectors from the R1TA power iteration layer are reshaped back to their matrix form such that
. Then, the normalization is performed row and column alternatively through multiple iterations.
Forward pass. For each pair of iterations, we start from row normalization. In the and th iteration:
(9)  
where is a vector of with all elements being 1, here and below represents the diagonal matrix with as diagonal elements.
Backward pass. Given a starting gradient , we iteratively compute the gradients as:
4.4 Training
During the training, the total loss is measured by the binary cross entropy between all predicted assignments and assignment ground truth , which is written as
With the total loss, gradients are calculated throughout the network back to the affinity and feature subnetworks.
We use each association batch containing frames as one minibatch during training and tracking. For each batch, consecutive frames and target candidates on each frame provided by an external detector serve as the input. The candidate set is first used to generate the hypothesis trajectories . We set a bound for the hypothesis trajectory generation where two candidates from two consecutive frames can be connected only when they are spatially close to each other and have the similar bounding box size. We set adaptive thresholds for this strategy. In detail, if a candidate cannot connect with any candidate at a threshold, it researches using degraded thresholds for the possible connection. This strategy allows the connections for both fast and slow movement targets, and meanwhile rejects naive false positive connections. Hypothesis trajectories are generated greedily by iterating through all valid connections to form trajectories starting with candidates in and terminating in . Generated trajectories are sorted by their anchor candidates and together with image frames fed into our networks for training and tracking.
4.5 Tracking by Integrating Detection and SOT
In the tracking phase, predictions using SOT techniques are included to recover missing candidates from the external detector. We add a virtual candidate to each candidate set to represent missing candidates and allow it to connect with any candidate in consecutive frames as shown in Fig. 3. Both real and virtual candidates are used to generate trajectory hypothesis. When calculating affinity, we choose the location maximizing the affinity in Eq. 5 as the center of the virtual candidates for each anchor candidate such that
(10) 
Therefore, if an anchor candidate misses its detection in consecutive frame, it will connect with the virtual candidate which represents the location most similar to it in that consecutive frame, or, in terms of SOT, its tracking prediction. Each anchor candidate may have a different location predicted by SOT. We use in Eq. 10 to refer to the virtual candidate, its center coordinates may vary on different anchor candidates.
To prevent MDA from always choosing the virtual candidates since their affinity are no smaller than any real candidate, a coefficient is used to scale down their affinity. The virtual candidates are handled specially in the normalization layer: for the row (column) in representing a virtual candidate, only column (row) normalization is applied. This way, each real candidate can be assigned to only one candidate in consecutive frame including the virtual ones, while a virtual candidate can be assigned to multiple candidates. During optimization, if the affinities of real candidates are smaller than that of the virtual candidates, which represent tracking predictions, the anchor candidates will be automatically associated with the tracking predication. This way, our tracking system integrates the detection and SOT naturally.
4.6 Target Management
On receiving the assignment results, target management handles target entering, exiting and updating. In assignment results, if multiple anchor candidates choose to associate with a virtual candidate, new candidates will be added into candidate sets accordingly. For a virtual candidate not associated with any anchor candidate in assignment results, it will be dropped from the candidate set. Furthermore, if the virtual candidate associated with an anchor candidate in this batch appears as an anchor candidate in the next batch, the appearance feature of the anchor candidate is reused in the next batch in case that the missing detection is caused by occlusion. This SOT process will continue until a confident real candidate is associated.
For anchor candidates associated with real candidates, we train a CNN network to further refine their bounding boxes. During MDA, associations are made mainly based on the object center of each candidate. Most MOT tasks concern the actual bounding box enclosure of targets. When an anchor candidate is assigned to a real candidate, two bounding boxes will be associated with it: one from the real candidate itself denoted by and the other from the SOT predication of the anchor candidate . If the bounding box Intersection over Union (IoU) between and is smaller than a threshold , a CNN is used to evaluate the quality of . In detail, a CNNbased binary classifier is trained to decide whether a bounding box has IoU larger than a threshold, e.g. 0.5, with the category target. The detailed procedure of the target management is listed in Alg. 1.


Method  MOTA  MOTP  MT  ML  FP  FN  IDS  
Offline  CEM [32]  19.3  70.7  8.5%  46.5%  14180  34591  813 
R1TA [41]  24.3  68.2  5.5%  46.6%  6664  38582  1271  
SCNN [27]  29.0  71.2  8.5%  48.4%  5160  37798  639  
DAM [23]  32.4  71.8  16.0%  43.8%  9064  32060  435  
JMC [21]  35.6  71.9  23.2%  39.3%  10580  28508  457  
Online  RNN [31]  19.0  71.0  5.5%  45.6%  11578  36706  1490 
oICF [22]  27.1  70.0  6.4%  48.7%  7594  36757  454  
SCEA [51]  29.1  71.1  8.9%  47.3%  6060  36912  604  
AP [7]  38.5  71.3  8.7%  37.4%  4005  33203  586  
proposed  40.6  71.1  12.5%  34.4%  4678  31018  778  



Method  MOTA  MOTP  MT  ML  FP  FN  IDS  
Offline  IOU17 [4]  45.5  76.9  15.7%  40.5%  19993  281643  5988 
bLSTM [24]  47.5  77.5  18.2%  41.7%  25981  268042  2069  
TLMHT [40]  50.6  77.6  17.6 %  43.4%  22213  255030  1407  
jCC [43]  51.2  75.9  20.9%  37.0%  25937  247822  1802  
Online  GMPHD [25]  39.6  74.5  8.8%  43.3 %  50903  284228  5811 
DMAN [56]  48.2  75.7  19.3%  38.3 %  26218  263608  2194  
MOTDT [8]  50.9  76.6  17.5%  35.7%  24069  250768  2474  
proposed  52.0  76.5  19.1%  33.4%  14138  253616  3072  

5 Experiment
We conduct experiments on four popular MOT datasets: MOT2015 [28] and MOT2017 [30] for pedestrian tracking, KITTICar [17] and UADETRAC [47] for vehicle tracking. All datasets are provided with referred detections from real detectors.
5.1 Experiment Setting
The proposed approach is implemented in PyTorch and runs on a desktop with CPU of 6 cores@3.60GHz and a Titan X GPU. We adapt the SiamFC proposed in [39] as our feature subnetwork and use their weights as pretrained model which is trained on the ILSVRC15 dataset for object detection in video. The CNN to estimate longterm affinity is constructed with three convolutional layers to map the concatenated spatial aligned feature into affinity score. A ResNet101 with binary output is adapted for , the pretrained weights from MaskRCNN [18] on COCO dataset is used for initialization. Proposed method runs average 0.6 fps on MOT2017 dataset in tracking phase.
For each test sequence in MOT2015 and MOT2017, following their protocol, one or more similar sequences in the training set are used to train a different set of FAMNet and to best adapt the scenario prior. Sequences in the KITTI and UADETRAC dataset are all recorded in a similar setting, therefore all training sequences in their datasets are used together to train one set of networks for all test sequences. To train the FAMNet, ground truth bounding boxes are used as input target candidates. Therefore, no virtual candidate or SOT process is enabled during training. When training the , the training samples are collected from both the external detection and ground truth bounding box after random shift and scale. The IoU of bounding boxes with ground truth larger than 0.5 are selected as positive samples while smaller than 0.4 are for negative samples.
To evaluate the performance of the proposed method, the widely accepted CLEAR MOT metrics [2] are reported, which include multiple object tracking precision (MOTP) and multiple object tracking accuracy (MOTA) that combines false positives (FP), false negatives (FN) and the identity switches (IDS). Additionally, we also report the percentage of mostly tracked targets (MT), the percentage of mostly lost targets (ML).
5.2 Evaluation Results
MOT2015. MOT2015 [28] contains 11 different indoor and outdoor scenes of public places with pedestrians as the objects of interest, where camera motion, camera angle and imaging condition vary greatly. The dataset provides detections generated by the ACFbased detector [13]. The numerical results on its test set are reported in Tab. 1. Our approach achieves clearly the stateoftheart performance. In particular, our method achieves better performance in most metrics than the RNN based endtoend online methods due to our discriminative higherorder affinity and the optimization method adapted. Our method also surpasses the same R1TAbased method which is with handcrafted features and affinity metrics.
MOT2017. Similar to MOT2015, MOT2017 [30] contains seven different sequences in both training and test datasets but with higher average target density (31.8 vs 10.6 on the test set), thus is more challenging. MOT2017 also focuses on evaluating the tracker performance on different detection quality. It provides three different detection inputs from DPM [16], FasterRCNN [38] and SDP [50], ranked in ascending order by AP. We train seven different sets of networks according to different scenes, without further fitting on the different detections. The numerical results are reported in Tab. 2. The performance of our method is better than or on par with other published stateoftheart methods.
KITTICar. The KITTI dataset [17] contains 21 video sequences in the training set and 29 in the test set for multiple vehicle tracking in street view, where videos are recorded through a camera mounting in front of a moving vehicle. Referred detections from the regionlet [46] detector are used in our experiment. The numerical results on the dataset of our method along with other methods using the same detections are summarized in Tab. 3. Our method again surpasses the handcrafted featurebased R1TA method, despite the fact that it uses a much larger association batch for offline tracking. It is worth mentioning that motion affinity plays a more importance role in KITTI than in MOT2015 and MOT2017, since both targets and camera move faster and more regularly in KITTI.
UADETRAC. UADETRAC dataset [47] is another multiple vehicle tracking dataset with 60 sequences for training and 40 sequences for testing. All sequences are recorded with static camera at a liftup position near different drive ways in various of weather conditions. We use referred detection from CompACT [5] detector in our experiment. UADETRAC reports the average of each MOT metric from a serials of results using different detection confidence thresholds (from 0 to 1.0 with 0.1 step). Comparison with other methods using the same detections are reported in Tab. 4. Proposed method achieves stateoftheart performance among the published works. Our method also surpasses the IOU tracker which is an offline method and using a private detector.


Method  MOTA  MOTP  MT  ML  FP  FN  IDS  
Offline  DCOX [33]  68.1  78.9  37.5%  14.1%  2588  8063  318 
R1TA [41]  71.2  79.2  47.9 %  11.7%  1915  7579  418  
LPSSVM [44]  77.6  77.8  56.3%  8.5%  1239  6393  62  
NOMT [9]  78.1  79.5  57.2%  13.2%  1061  6421  31  
Online  RMOT [52]  65.8  75.4  40.2%  9.7 %  4148  7396  209 
mbodSSP [29]  72.7  78.8  48.8%  8.7%  1918  7360  114  
CIWT [36]  75.4  79.4  49.9%  10.3 %  954  7345  165  
proposed  77.1  78.8  51.4%  8.9%  760  6998  123  



Method  MOTA  MOTP  MT  ML  FP  FN  IDS 
CEM [32]  5.1  35.2  3.0%  35.3%  12341  260390  267 
[48]  12.4  35.7  14.8 %  19.4%  51765  173899  852 
CMOT [1]  12.6  36.1  16.1%  18.6%  57886  167111  285 
GOG [37]  14.2  37.0  13.9%  19.9%  32093  180184  3335 
IOU [4]  19.4  28.9  17.7%  18.4 %  14796  171806  2311 
proposed  19.8  36.7  17.1%  18.2%  14989  164433  617 

Private detector used.
5.3 Ablation Study



Method  MOTA  FP  FN  IDS 
No training  35.5  240  4799  202 
Training from scratch  44.1  281  4160  97 
Without  40.5  518  4227  87 
Without SOT  42.0  200  4412  99 
Finetuning (proposed)  45.2  259  4105  87 



We justify the effectiveness of different modules in proposed method through ablation study as shown in Tab. 5. We conduct the study using the sequences ETHPedcross2 and ETHSunnyday for testing and ETHBahnhof for training. All sequences are from the training set of MOT2015. We start from FAMNet with randomly initialized weights whose tracking performance is referred as “No training” in Tab. 5. Then the network is trained on sequence ETHBahnhof. “Training from scratch” stands for the results in this scheme. Training with the limited MOT sequences may lead to overfitting of the feature and affinity subnetwork. To increase the generalizability and further boost the performance, we use the weights trained on the ILSVRC15 dataset as initialization then perform finetuning on the MOT sequence, which is referred as “Finetuning” and is the scheme used in other experiments in this paper. “Without ” shows the configuration where detection score is used for bounding box quality estimation instead of the dedicated . Target management without cannot efficiently prevent FPs merging into the tracking results. “Without SOT” in Tab. 5 stands for the case that only detections from external detector are used for association, no SOT prediction is included. By contrast, in the proposed solution, though SOT predictions introduce some FPs, it recovers much more missing candidates and reduces FN greatly.
6 Conclusion
In this paper we proposed a novel deep architecture for MOT, which learns jointly, in an endtoend fashion, features and highorder affinity directly from the ground truth trajectories. During tracking, predictions from SOT and a dedicated target management are include to further boost tracking robustness. Experiments on the MOT2015, MOT2017, KITTICar and UADETRAC datasets clearly show the effectiveness of proposed approach.
References
 [1] S.H. Bae and K.J. Yoon. Confidencebased data association and discriminative deep appearance learning for robust online multiobject tracking. TPAMI, 2018.
 [2] K. Bernardin and R. Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. JIVP, 2008.
 [3] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Simple online and realtime tracking. In ICIP, 2016.
 [4] E. Bochinski, V. Eiselein, and T. Sikora. Highspeed trackingbydetection without using image information. In AVSS, 2017.
 [5] Z. Cai, M. Saberian, and N. Vasconcelos. Learning complexityaware cascades for deep pedestrian detection. In CVPR, 2015.
 [6] J. Chen, H. Sheng, Y. Zhang, and Z. Xiong. Enhancing detection model for multiple hypothesis tracking. In CVPRw, 2017.
 [7] L. Chen, H. Ai, C. Shang, Z. Zhuang, and B. Bai. Online multiobject tracking with convolutional neural networks. In ICIP, 2017.
 [8] L. Chen, H. Ai, Z. Zhuang, and C. Shang. Realtime multiple people tracking with deeply learned candidate selection and person reidentification. In ICME, 2018.
 [9] W. Choi. Nearonline multitarget tracking with aggregated local flow descriptor. In ICCV, 2015.
 [10] Q. Chu, W. Ouyang, H. Li, X. Wang, B. Liu, and N. Yu. Online multiobject tracking using cnnbased single object tracker with spatialtemporal attention mechanism. ICCV, 2017.
 [11] R. T. Collins. Multitarget data association with higherorder motion models. In CVPR, 2012.
 [12] A. Dehghan, Y. Tian, P. H. Torr, and M. Shah. Target identityaware network flow for online multiple target tracking. In CVPR, 2015.
 [13] P. Dollár, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. TPAMI, 2014.
 [14] L. FagotBouquet, R. Audigier, Y. Dhome, and F. Lerasle. Improving multiframe data association with sparse representations for robust nearonline multiobject tracking. In ECCV, 2016.
 [15] K. Fang, Y. Xiang, X. Li, and S. Savarese. Recurrent autoregressive networks for online multiobject tracking. In WACV, 2018.
 [16] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. TPAMI, 2010.
 [17] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012.
 [18] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. In ICCV, 2017.
 [19] C. Huang, B. Wu, and R. Nevatia. Robust object tracking by hierarchical association of detection responses. In ECCV, 2008.
 [20] M. Keuper, E. Levinkov, N. Bonneel, G. Lavoué, T. Brox, and B. Andres. Efficient decomposition of image and mesh graphs by lifted multicuts. In ICCV, 2015.
 [21] M. Keuper, S. Tang, B. Andres, T. Brox, and B. Schiele. Motion segmentation & multiple object tracking by correlation coclustering. TPAMI, 2018.
 [22] H. Kieritz, S. Becker, W. Hübner, and M. Arens. Online multiperson tracking using integral channel features. In AVSS, 2016.
 [23] C. Kim, F. Li, A. Ciptadi, and J. M. Rehg. Multiple hypothesis tracking revisited. In ICCV, 2015.
 [24] C. Kim, F. Li, and J. M. Rehg. Multiobject tracking with neural gating using bilinear lstm. In ECCV, 2018.
 [25] T. Kutschbach, E. Bochinski, V. Eiselein, and T. Sikora. Sequential sensor fusion combining probability hypothesis density and kernelized correlation filters for multiobject tracking in video data. In AVSS, 2017.
 [26] L. Lan, X. Wang, S. Zhang, D. Tao, W. Gao, and T. S. Huang. Interacting tracklets for multiobject tracking. TIP, 2018.
 [27] L. LealTaixé, C. CantonFerrer, and K. Schindler. Learning by tracking: Siamese cnn for robust target association. In CVPRw, 2016.
 [28] L. LealTaixé, A. Milan, I. Reid, S. Roth, and K. Schindler. MOTChallenge 2015: Towards a benchmark for multitarget tracking. arXiv:1504.01942, 2015.
 [29] P. Lenz, A. Geiger, and R. Urtasun. Followme: Efficient online mincost flow tracking with bounded memory and computation. In ICCV, 2015.
 [30] A. Milan, L. LealTaixé, I. Reid, S. Roth, and K. Schindler. MOT16: A benchmark for multiobject tracking. arXiv:1603.00831, 2016.
 [31] A. Milan, S. H. Rezatofighi, A. R. Dick, I. D. Reid, and K. Schindler. Online multitarget tracking using recurrent neural networks. In AAAI, 2017.
 [32] A. Milan, S. Roth, and K. Schindler. Continuous energy minimization for multitarget tracking. TPAMI, 2014.
 [33] A. Milan, K. Schindler, and S. Roth. Detectionand trajectorylevel exclusion in multiple object tracking. In CVPR, 2013.
 [34] A. Milan, K. Schindler, and S. Roth. Multitarget tracking by discretecontinuous energy minimization. TPAMI, 2016.
 [35] P. Ondruska and I. Posner. Deep tracking: Seeing beyond seeing using recurrent neural networks. In AAAI, 2016.
 [36] A. Osep, W. Mehner, M. Mathias, and B. Leibe. Combined imageand worldspace tracking in traffic scenes. In ICRA, 2017.
 [37] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes. Globallyoptimal greedy algorithms for tracking a variable number of objects. In CVPR, 2011.
 [38] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: towards realtime object detection with region proposal networks. In NIPS, 2015.
 [39] A. Sadeghian, A. Alahi, and S. Savarese. Tracking the untrackable: Learning to track multiple cues with longterm dependencies. arXiv:1701.01909, 2017.
 [40] H. Sheng, J. Chen, Y. Zhang, W. Ke, Z. Xiong, and J. Yu. Iterative multiple hypothesis tracking with trackletlevel association. CSVT, 2018.
 [41] X. Shi, H. Ling, Y. Pang, W. Hu, P. Chu, and J. Xing. Rank1 tensor approximation for highorder association in multitarget tracking. IJCV, 2019.
 [42] J. Son, M. Baek, M. Cho, and B. Han. Multiobject tracking with quadruplet convolutional neural networks. In CVPR, 2017.
 [43] S. Tang, M. Andriluka, B. Andres, and B. Schiele. Multiple people tracking by lifted multicut and person reidentification. In CVPR, 2017.
 [44] S. Wang and C. C. Fowlkes. Learning optimal parameters for multitarget tracking with contextual interactions. IJCV, 2017.
 [45] X. Wang, E. Türetken, F. Fleuret, and P. Fua. Tracking interacting objects using intertwined flows. TPAMI, 2016.
 [46] X. Wang, M. Yang, S. Zhu, and Y. Lin. Regionlets for generic object detection. In ICCV, 2013.
 [47] L. Wen, D. Du, Z. Cai, Z. Lei, M.C. Chang, H. Qi, J. Lim, M.H. Yang, and S. Lyu. Uadetrac: A new benchmark and protocol for multiobject detection and tracking. arXiv:1511.04136, 2015.
 [48] L. Wen, W. Li, J. Yan, Z. Lei, D. Yi, and S. Z. Li. Multiple target tracking based on undirected hierarchical relation hypergraph. In CVPR, 2014.
 [49] Y. Xiang, A. Alahi, and S. Savarese. Learning to track: Online multiobject tracking by decision making. In ICCV, 2015.
 [50] F. Yang, W. Choi, and Y. Lin. Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In CVPR, 2016.
 [51] J. H. Yoon, C.R. Lee, M.H. Yang, and K.J. Yoon. Online multiobject tracking via structural constraint event aggregation. In CVPR, 2016.
 [52] J. H. Yoon, M.H. Yang, J. Lim, and K.J. Yoon. Bayesian multiobject tracking using motion context from multiple objects. In WACV, 2015.
 [53] A. R. Zamir, A. Dehghan, and M. Shah. Gmcptracker: Global multiobject tracking using generalized minimum clique graphs. In ECCV. 2012.
 [54] A. Zanfir and C. Sminchisescu. Deep learning of graph matching. In CVPR, 2018.
 [55] L. Zhang, Y. Li, and R. Nevatia. Global data association for multiobject tracking using network flows. In CVPR, 2008.
 [56] J. Zhu, H. Yang, N. Liu, M. Kim, W. Zhang, and M.H. Yang. Online multiobject tracking with dual matching attention networks. In ECCV, 2018.