# Pedestrian Tracking by Probabilistic Data Association and Correspondence Embeddings

###### Abstract

This paper studies the interplay between kinematics (position and velocity) and appearance cues for establishing correspondences in multi-target pedestrian tracking. We investigate tracking-by-detection approaches based on a deep learning detector, joint integrated probabilistic data association (JIPDA), and appearance-based tracking of deep correspondence embeddings. We first addressed the fixed-camera setup by fine-tuning a convolutional detector for accurate pedestrian detection and combining it with kinematic-only JIPDA. The resulting submission ranked first on the 3DMOT2015 benchmark. However, in sequences with a moving camera and unknown ego-motion, we achieved the best results by replacing kinematic cues with global nearest neighbor tracking of deep correspondence embeddings. We trained the embeddings by fine-tuning features from the second block of ResNet-18 using angular loss extended by a margin term. We note that integrating deep correspondence embeddings directly in JIPDA did not bring significant improvement. It appears that geometry of deep correspondence embeddings for soft data association needs further investigation in order to obtain the best from both worlds.

## I Introduction

The application of multi-target tracking (MTT) algorithms is found today in many different areas, such as autonomous vehicles, robotics, air-traffic control, video surveillance and many other. Tracking and state estimation of multiple moving objects gives rise to many challenges compared to classical estimation, like time-varying number of targets, false alarms, occlusions, and missed detections. Additional level of complexity is brought by unknown association between measurements and targets. In visual MTT this can be alleviated by leveraging appearance of the detected targets in images. Thus a visual tracking approach usually combines the following three steps: (i) detection of objects in images, in our case pedestrians, (ii) computation of appearance based metrics for association, and (iii) a tracking algorithm consolidating the previous two steps into the MTT framework.

Pedestrian MTT methods require detections as inputs and deep convolutional models are particularly suitable for the task. Pre-training deep models on large datasets is shown to have a great regularization effect. Datasets like ImageNet [1] and COCO [2] hold great generalization potential which is available through pre-training on such large collections of annotated data. However, fine-tuning of a multi class object detector to detect pedestrians is not a straightforward task. Limitations in model vertical receptive field, noise in bounding box annotations, and annotation errors are common issues [3]. Fine-tuning on homogeneous video sequences incurs high overfitting risk, thus diversity in training data should be targeted to improve generalization. Given that, fine-tuning a multiclass object detector for the task of pedestrian detection is possible by training on a dataset like CityPersons [4]. This dataset contains diversity on multiple axes, such as person identity, clothing, pose, occlusion level etc.

Correspondence embeddings can be useful for a target association problem such as pedestrian tracking, since they are trained to measure similarity between images. The pioneer of approaches for deep metric learning used siamese networks [5], while triplet networks are considered as an improvement [6]. This is due to a slight, but impactful modification of the loss, which ensures better alignment between the similarity in the embedded space and the likeliness of correspondence. A body of work analyses and improves triplet loss functions. Convergence issues of metric learning using triplet loss are alleviated using N-pair loss, which compares a positive example to N-1 negatives [7]. Rather than focusing on pairwise distances in metric space, angular loss [8] minimizes the angle at the negative point of the triplet. This has a positive effect on quality of learning, since angles are insensitive to changes in scale. Our approach follows previous work which utilizes segmentation masks to make the appearance embedding less sensitive to occlusions and changes in the background [9].

A good overview of the current state-of-the-art of the MTT algorithms can be found in [10], where authors consider three different approaches to the MTT problem: (i) probabilistic data association, (ii) multiple hypothesis tracking, and (iii) random finite set approach. In probabilistic data association (PDA) [11, 12, 13] tracking methods, the measurement association uncertainty is untangled by soft assignment. The first such methods for single and multiple target tracking were the PDA filter [11] and the joint probabilistic data association (JPDA) filter [12], respectively. However, both approaches assume known and constant target number and need some heuristics for track initialisation and termination. The integrated PDA (IPDA) [14] and joint integrated PDA (JIPDA) [13] alleviate this issue by estimating the targets existence probability together with its states, thus providing a natural method for automatic track initialisation and termination. Unlike PDA methods, multiple hypothesis tracking (MHT) algorithms [15, 16] generate hypotheses for different associations and the decision about which of the hypotheses is correct is postponed until new data is collected. Somewhat more recent approaches are based on the random finite set (RFS) paradigm [17]. Based on the RFS theory, the closed-form first moment approximation of the RFS filter, the Gaussian mixture probability hypothesis density (GM-PHD) filter was presented in [18], and since then other novel approaches have been proposed [19, 20, 21].

Visual pedestrian tracking can be implemented by consolidating an object detector, appearance based association metric, and a suitable MTT approach. In [22] authors track multiple pedestrians using a Rao-Blackwellized particle filter with track management based on detection association likelihoods. Therein, the authors augment the state vector of tracked objects by an appearance based deep person re-identification vector [23] and compute data association probability by multiplying conditionally independent position and appearance association likelihoods. The authors report that adding appearance information reduced the number of identity switches and increased slightly the overall tracking score; however, tracking using just the appearance, without position information, showed to perform quite poorly. In [24] probabilistic models were incorporated into a track-by-detection approach using prior knowledge of a static scene, describing pedestrian state using position, height and width in world coordinates. Such approach lacks information on pedestrian appearance to correctly handle interactions between pedestrians in crowded scenes. The MOANA approach [25] uses hand-crafted features to represent appearance and resolve situations when a candidate observation is spatially close to other objects.

In this paper we present a pedestrian tracking-by-detection approach based on a deep learning detector combined with the JIPDA and an appearance-based tracker using deep correspondence embeddings. A convolutional neural network detector was pretrained on the COCO dataset for accurate pedestrian detection and serves as the input for the JIPDA based tracking algorithm where the state consists only of pedestrian kinematic cues (positions and velocities). The proposed pedestrian tracker with kinematic cues currently ranks first on the 3DMOT2015 online benchmark [26] that contains sequences with a static camera (cf. Fig. 1). In order to enable pedestrian tracking in sequences containing camera motion, under the assumption that camera ego-motion is not available, kinematic parameters in 3D need to be exchanged for appearance cues based on a deep correspondence metric in the image space. We therefore learn a correspondence embedding and leverage it for association across video frames using the global nearest neighbor approach (GNN). In the end, we compare GNN tracking of correspondence embeddings with the JIPDA tracker based on kinematic cues (position and velocity).

This paper is organized as follows. In Section II we describe models we used to detect pedestrians as well as calculate correspondence embeddings. We outline our method for probabilistic association in Section III. Results on the online MOT benchmark as well as validation experiments are presented in Section IV. Finally, we give a summary of our accomplishments and findings in Section V.

## Ii Detection and appearance representation

We detected pedestrians with the Mask R-CNN algorithm trained on a suitable blend of public datasets. We cropped and scaled the bounding boxes, applied instance segmentation masks and processed them with a separate model trained with a metric loss. This resulted in correspondence embeddings which we used as descriptors in appearance-only tracking. We describe the details in the following subsections.

### Ii-a Pedestrian detection

Mask R-CNN [27] is an extension of the Faster R-CNN [28] object detector. It consists of two stages: (i) finding regions of interest (RoIs) using region proposal network (RPN) and (ii) classification of the proposed RoIs and bounding box regression. Mask R-CNN enhances the second stage by predicting segmentation masks of RoIs provided by RPN. By utilizing the RoIAlign operation and attaining better representations through learning segmentation masks, Mask R-CNN surpasses Faster R-CNN on the task of object detection. We adapted the multi-class Mask R-CNN for pedestrian detection.

We chose the most suitable transfer-learning strategy by performing validation experiments with a Mask R-CNN detector trained on different datasets. Fine-tuning from COCO to CityPersons turned out to be the most appropriate course of action as shown in detail in Section IV. We believe this can be explained as follows. Firstly, CityPersons includes annotations with fixed aspect ratio (BB-full) which are suitable to train occlusion invariant bounding box regression. Secondly, we noticed that COCO people are much more diverse than MOT pedestrians due to numerous other contexts such as riding, driving, sitting down etc. Furthermore, CityPersons inherits ground truth pixel-level masks from the Cityscapes dataset [29]. Presence of ground truth pixel level masks is suitable for fine-tuning Mask R-CNN’s mask head.

We adapted the pre-trained multi class Mask R-CNN [27] for pedestrian detection in two steps. Firstly, we adapted the Mask R-CNN classification, bounding box regression and mask prediction heads to have two possible outputs: background and pedestrian. We sliced the weights of the last layer of the classification head in order to leave only the logits for the background and pedestrian classes which we initialized with weights of the corresponding COCO classes. Secondly, we fine-tuned the resulting model with ground truth bounding boxes and segmentation masks from CityPersons.

### Ii-B Deep correspondence embedding

We represented pedestrian appearance with a metric embedding provided by a deep correspondence model. Appearance of each pedestrian is represented by high-dimensional embeddings in metric space. Selection of the correspondence model is not straightforward. We started form ResNet-18 [30] classification architecture which consists of four residual blocks, from RB1 to RB4. Features from RB4 are suitable for discriminating between different classes. However, we found that features from earlier blocks are more beneficial for differentiating between different person identities. Therefore, we calculated embeddings from features in the last convolutional layer of RB2. Validation experiments suggest that these features contain more information about person appearance than features in any other residual block. Furthermore, the last two blocks hold around of total ResNet parameters. By getting rid of them, we decreased the susceptibility to overfitting. At the same time, it is possible to initialize the first two blocks with pre-trained weights and profit from regularization induced by ImageNet. We demonstrate the effectiveness of this approach in more detail in Section IV.

Furthermore, we investigated the possibility of using segmentation masks provided by Mask R-CNN to generate descriptors which are robust to changes in object background and occlusions. We experimented with two approaches for incorporating the segmentation mask into the correspondence embedding. The first approach applies to the input image. The second approach uses to mask the convolutional features. The two approaches are not the same since the latter approach preserves some background influence due to receptive field of the convolutional features. Despite this, applying the mask to ResNet features performed better in experiments. We conjecture that this is due to low resolution of the Mask-RCNN mask resulting in poor accuracy when upsampled to RoI resolution. Note that a segmentation mask can be interpreted as a dense probability map that the corresponding pixel is foreground. Therefore, one can suppress the background by elementwise multiplication with the segmentation mask.

As mentioned before, we adapted the ImageNet pre-trained architecture by taking only the first two residual blocks. The features of the last residual block were passed to a convolutional layer and masked using the output of Mask R-CNN’s segmentation head. Finally, the correspondence embedding was produced by global average pooling.

The model was trained using angular loss [8]. We extended the angular loss by adding the margin term. For a given reference embedding , a corresponding embedding of the same identity and a negative embedding , we calculated the angular loss (1), where is the margin hyperparameter and :

(1) |

Gradients of the angular loss push the negative example away from the center of and examples in the direction. This also minimizes and (, and are unit vectors).

### Ii-C Details of training correspondence embedding

We trained the correspondence model on MOT2016 [31]. We refrain from training on 2D MOT 2015 since it does not include precise ground truth data regarding occlusion level. During training, we removed all training samples with occlusion level greater than . We incorporated the following method for generating positive and negative samples. We generated positive examples by taking random detections less than 5 frames away from the reference example frame. We generated negative examples by taking random identity from the same sequence. We sampled random easy negatives as bounding boxes which do not intersect any ground truth bounding boxes. This made the correspondence model more robust to pedestrian detector’s false negative outputs. We chose the following sequences to serve as validation data: MOT16-02, MOT16-04, MOT16-05. The validation data was used for early stopping and tuning of hyperparameters. The output embedding vectors had 64 dimensions. We used the Adam [32] optimizer with fixed learning rate of . Weight decay was set to for all parameters and the model was trained for 10 epochs. During training and testing, we did not use whole images. Instead, we cropped the detection bounding boxes and resized them to the fixed resolution .

## Iii Joint Integrated Probabilistic Data Association

Probabilistic data association algorithms use the soft assignment method to update the states of each individual target with all available measurements. However, this results in a large number of possible joint associations events that have to be considered. This challenge can be further aggravated if the targets are not well separated and the calculation of the a posteriori association probabilities may become intractable for practical application. However, the number of events can be significantly reduced by a validation process in which the association hypotheses that are very unlikely are discarded. Furthermore, an efficient approximation of the JPDA was proposed in [33] with best joint associations; nevertheless, in this paper we consider the exact JIPDA algorithm since the targets are well separated and the clutter rate is low.

JIPDA predicts the target state individually for each track. If we construct an appropriate target motion model, then the state of each target can be propagated using the standard Kalman filter prediction step. Additionally, the target existence probability prediction is given by

(2) |

where is the target survival probability, is the set of all observations at time where is the number of observations at time . is the set of all observations up to and including time and is the hypothesis that track exists.

Let denote the innovation of the -th measurement to the track , where is the measurement and is the predicted measurement for track . Time superscripts are omitted here for clarity. The target state is then corrected by the Kalman filter update equation

(3) |

where is the weighted innovation, are posterior association probabilities, and is the Kalman gain for target . The update of the covariance matrix slightly differs from the original Kalman update step [12]

(4) | ||||

(5) |

where is the innovation covariance of the target .

The combinatorial computational complexity of the JIPDA can be alleviated by discarding the assignment hypotheses that are unlikely. Since the innovation of the measurement is a zero-mean normal distribution, the measurement validation can be achieved by selecting only those measurement that lie in the confidence ellipsoid of the target [11]. A priori likelihood function of a measurement given state of a target after validating with the gating probability is given by

(6) |

when is inside the validation gate and zero otherwise.

To calculate a posteriori association probabilities , all possible joint association events must be considered. In each event, one target can be associated with at most one detection, and each detection cannot be assigned to more than one target. Let denote the set of all joint association events. Since those events are exhaustive and mutually exclusive, the probability of the joint event is given by [13]

(7) |

where is the normalization constant, is detection probability, is clutter density, and are sets of tracks assigned with no measurements and with one measurement in and .

Let be the hypothesis that the measurement belongs to target and the hypothesis that the target was not detected. The a posteriori probabilities of individual track existence and measurement association can be obtained by [13]

(8) |

(9) |

where is the set of all events that assign measurement to track , while is the set of all events in which track was missed. Given probabilities (8) and (9), the a posteriori track existence probability is computed as

(10) |

where is the element of the validation matrix, while a posteriori association probabilities are given by

(11) |

## Iv Experimental Results

### Iv-a Validating the detection and correspondence model

For validation experiments we studied the impact of training Mask R-CNN on different combinations of training datasets and we carefully analyzed the design possibilities to find the most suitable correspondence embedding. Here, we describe several validation studies and comment on the results.

#### Fine tuning Mask R-CNN

After having little success in transfer learning from COCO to MOT in our preliminary experiments, we performed validation experiments by training just on CityPersons, just on COCO, and on both datasets, achieving average precision of 45.1, 53.3, and 57.0, respectively. Fine-tuning on CityPersons is suitable to distinguish between pedestrians and other people. Also, bounding boxes generated by Mask R-CNN trained using BB_full annotations from CityPersons are a better fit for detection of MOT pedestrians. All our detection experiments feature Mask-RCNN based on ResNet-50 FPN.

#### Using segmentation maps

The impact of using segmentation masks is shown in Table I, where IDs denote the number of identity switches, while IDs† shows evaluation on ground truth bounding boxes. There are more IDs when evaluating on ground truth because no fragmentations are present. The models were trained on the MOT2016 train dataset, while evaluation was performed using an appearance based GNN on 2DMOT2015 train. We showed that segmentation masks generated by Mask R-CNN benefit the correspondence model by alleviating impacts of background and occlusions. First, we trained a baseline correspondence model which did not use segmentation masks. Secondly, we trained two correspondence models improved by segmentation masks. The first model masks the input image. The second model masks the final feature map before the global average pooling operation. We witnessed an improvement in tracking with the latter approach.

masked tensor | IDs† | IDs | MOTA |
---|---|---|---|

– | 507 | 404 | 53.6 |

input image | 420 | 337 | 53.8 |

final conv features | 328 | 291 | 53.9 |

#### Residual blocks

Our final model uses only the first two residual blocks of an ImageNet pre-trained ResNet-18. This design choice is supported by experiments shown in Table II. In each experiment, we used one additional residual block. We trained the model on MOT2016 and evaluate tracking using a position-agnostic GNN approach. The results complement our initial hypothesis that for describing appearance, abstract features like ones in the output of a full ResNet model may not be beneficial. In Fig. 2 we can see how the appearance similarity is distributed throughout the frames by looking at the similarity score of appearance vectors of the same object separated in time. The results show that even for a separation of five time steps the similarity of most appearance vector is preserved, with clear separation from the other objects.

RB1 | RB2 | RB3 | RB4 | #params | IDs† | IDs |
---|---|---|---|---|---|---|

✓ | 161.7K | 393 | 458 | |||

✓ | ✓ | 691.3K | 328 | 291 | ||

✓ | ✓ | ✓ | 2.8M | 416 | 398 | |

✓ | ✓ | ✓ | ✓ | 11.2M | 1271 | 687 |

### Iv-B Pedestrian tracking evaluation

To track the states of individual targets we used the constant velocity motion model with the state vector of the targets given by . We set the process and measurement noise deviations to and . Target survival and detection probabilities were and . Measurement gating probability was . False alarm rate was set to , where is the surveillance area. New targets were initialized for the measurements whose a posteriori association probability of not being associated to any of the existing targets was

Initial existence probability for new targets was set to and the target was confirmed when its existence probability exceeded threshold . Since nothing could be inferred about the new target’s velocity from only one measurement, it was assumed to be zero, but the initial covariance matrix of the target was inflated so that the state of the target converges to the actual value when the new measurements arrived. Targets were terminated when their existence probability fell below the threshold . To improve tracking performance we discarded all detections with confidence score below the threshold .

Tracker | MOTA | MOTP | FP | FN | IDs |
---|---|---|---|---|---|

MCN_JIPDA | 55.9 | 64.0 | 2,910 | 4,011 | 486 |

MOANA [25] | 52.7 | 56.3 | 2,226 | 5,551 | 167 |

DBN [24] | 51.1 | 61.0 | 2,077 | 5,746 | 380 |

GPDBN[34] | 49.8 | 62.2 | 1,813 | 6,300 | 311 |

GustavHX | 42.5 | 56.2 | 2,735 | 6,623 | 302 |

The tracking results are shown in Table III, where we can see that the proposed kinematic cues based JIPDA with the Mask R-CNN detector ranked first on the 3DMOT2015 dataset that contains static camera sequences. The table shows results for the test sequence, while on the train sequences the tracker obtained MOTA 80.6 and MOTP 69.1. Our method did produce a higher number of identity switches compared to MOANA, since we did not use appearance cues and our detector has higher recall than public detections. The tracking performance could be further improved by using interacting multiple model [35] instead of a constant velocity Kalman filter and by taking unresolved measurements into account as proposed in [36].

In Table IV, which compares the kinematic cues based JIPDA with deep detections to the deep correspondence metric based GNN, we can see that both trackers show roughly the same performance for static camera sequences and tracking in the image space, while the kinematic based JIPDA is not appropriate for moving camera with unknown motion. Augmenting the state space with deep correspondence embeddings directly within a soft data association approach such as JIPDA did not result in increased tracking accuracy in our experiments. It remains an interesting venue of future work to investigate the correspondence embeddings space geometry and utilize the findings in soft data association approaches.

Cam | Sequence | JIPDA | Appearance GNN |
---|---|---|---|

Static |
ADL-Rundle-6 | 58.4 | 58.4 |

KITTI-17 | 58.3 | 56,1 | |

PETS09-S2L1 | 79.8 | 78.8 | |

TUD-Campus | 78.3 | 79.4 | |

TUD-Stadtmitte | 81.0 | 81.6 | |

Venice-2 | 46.0 | 47.1 | |

Moving |
ADL-Rundle-8 | – | 49.5 |

ETH-Bahnhof | – | 29.4 | |

ETH-Pedcross2 | – | 58.0 | |

ETH-Sunnyday | – | 62.8 | |

KITTI-13 | – | 40.8 | |

Total | – | 53.8 |

## V Conclusion

In this work we have proposed an online pedestrian tracking method based on JIPDA and deep models for pedestrian detection and correspondence embedding. We have demonstrated how a COCO pre-trained Mask R-CNN can be adapted for accurate pedestrian detection. Furthermore, we incorporated segmentation masks to improve the correspondence model embeddings. Our correspondence embedding uses masked features from the second residual block of ResNet-18 in order to focus on low-level foreground appearance and reduce the parameter count. The features are pre-trained on ImageNet and fine-tuned with the angular loss. We achieve our best results on the 3DMOT2015 benchmark by combining Mask R-CNN detection and JIPDA. Our submission achieves MOTA 55.9 and ranks #1 at the time of writing this manuscript. Suitable directions for future work include integrating correspondence embeddings within JIPDA and investigating the geometry of such soft data association.

## Acknowledgement

This work has been supported in part by the European Regional Development Fund under the project ”System for increased driving safety in public urban rail traffic (SafeTRAM)” under Grant KK.01.2.1.01.0022 and in part by the Ministry of Science and Education of the Republic of Croatia under the project Rethinking Robotics for the Robot Companion of the future (RoboCom++). The Titan X used in experiments was donated by NVIDIA Corporation.

## References

- [1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” IJCV, 2015.
- [2] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” CoRR, 2014.
- [3] S. Zhang, R. Benenson, M. Omran, J. H. Hosang, and B. Schiele, “How far are we from solving pedestrian detection?” in CVPR, 2016.
- [4] S. Zhang, R. Benenson, and B. Schiele, “Citypersons: A diverse dataset for pedestrian detection,” in CVPR, 2017.
- [5] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a siamese time delay neural network,” in NIPS, 1993.
- [6] E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in Similarity-Based Pattern Recognition - Third International Workshop, SIMBAD, 2015.
- [7] K. Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” in NIPS, 2016.
- [8] J. Wang, F. Zhou, S. Wen, X. Liu, and Y. Lin, “Deep metric learning with angular loss,” in ICCV, 2017.
- [9] C. Song, Y. Huang, W. Ouyang, and L. Wang, “Mask-guided contrastive attention model for person re-identification,” in CVPR, 2018.
- [10] B.-n. Vo, M. Mallick, Y. Bar-shalom, S. Coraluppi, R. Osborne, R. Mahler, and B.-t. Vo, “Multitarget Tracking,” in Wiley Encyclopedia of Electrical and Electronics Engineering, 2015.
- [11] Y. Bar-Shalom and E. Tse, “Tracking in a cluttered environnement with probabilistic data association,” Automatica, 1975.
- [12] T. Fortmann, Y. Bar-Shalom, and M. Scheffe, “Sonar tracking of multiple targets using joint probabilistic data association,” IEEE Journal of Oceanic Engineering, 1983.
- [13] D. Musicki and R. Evans, “Joint Integrated Probabilistic Data Association - JIPDA,” in Proceedings of the Fifth International Conference on Information Fusion (FUSION), 2002.
- [14] D. Mušicki, R. Evans, and S. Stankovic, “Integrated probabilistic data association,” Transaction on Automatic Control, 1994.
- [15] D. Reid, “An algorithm for tracking multiple targets,” IEEE Transactions on Automatic Control, 1979.
- [16] S. S. Blackman, “Multiple hypothesis tracking for multiple target tracking,” IEEE Aerospace and Electronic Systems Magazine, 2004.
- [17] I. R. Goodman, R. P. S. Mahler, and H. T. Nguyen, Mathematics of Data Fusion, Dordrecht, 1997.
- [18] B.-N. Vo and W.-K. Ma, “The Gaussian Mixture Probability Hypothesis Density Filter,” IEEE Transactions on Signal Processing, 2006.
- [19] R. P. Mahler, Statistical Multisource-Multitarget Information Fusion, 2007.
- [20] S. Reuter, B. T. Vo, B. N. Vo, and K. Dietmayer, “The labeled multi-Bernoulli filter,” IEEE Transactions on Signal Processing, 2014.
- [21] K. Krishanth, X. Chen, R. Tharmarasa, T. Kirubarajan, and M. McDonald, “The Social Force PHD Filter for Tracking Pedestrians,” IEEE Transactions on Aerospace and Electronic Systems, 2017.
- [22] B. H. Wang, Y. Wang, K. Q. Weinberger, and M. Campbell, “Deep Person Re-identification for Probabilistic Data Association in Multiple Pedestrian Tracking,” in arXiv:1810.08565, 2018.
- [23] Y. Wang, L. Wang, Y. You, X. Zou, V. Chen, S. Li, G. Huang, B. Hariharan, and K. Q. Weinberger, “Resource Aware Person Re-identification across Multiple Resolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- [24] T. Klinger, F. Rottensteiner, and C. Heipke, “Probabilistic multi-person tracking using dynamic bayes networks,” in ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2015.
- [25] Z. Tang and J. Hwang, “Moana: An online learned adaptive appearance model for robust multiple object tracking in 3d,” IEEE Access, 2019.
- [26] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, “MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking,” 2015.
- [27] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask R-CNN,” in ICCV, 2017.
- [28] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in NIPS, 2015.
- [29] M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset,” in CVPRW, 2015.
- [30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
- [31] A. Milan, L. Leal-Taixe, I. Reid, S. Roth, and K. Schindler, “MOT16: A Benchmark for Multi-Object Tracking,” 2016.
- [32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, 2014.
- [33] S. H. Rezatofighi, A. Milan, Z. Zhang, Q. Shi, A. Dick, and I. Reid, “Joint Probabilistic Data Association Revisited,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
- [34] T. Klinger, F. Rottensteiner, and C. Heipke, “Probabilistic multi-person localisation and tracking in image sequences,” ISPRS Journal of Photogrammetry and Remote Sensing, 2017.
- [35] M. de Feo, A. Graziano, R. Miglioli, and A. Farina, “IMMJPDA versus MHT and Kalman filter with NN correlation: performance comparison,” IEE Proceedings - Radar, Sonar and Navigation, vol. 144, no. 2, 1997.
- [36] D. Svensson, M. Ulmke, and L. Hammarstrand, “Multitarget sensor resolution model and joint probabilistic data association,” IEEE Transactions on Aerospace and Electronic Systems, vol. 48, no. 4, 2012.