End-to-End Recurrent Multi-Object Tracking and Trajectory Prediction with Relational Reasoning

End-to-End Recurrent Multi-Object Tracking and Trajectory Prediction with Relational Reasoning


The majority of contemporary object-tracking approaches do not model interactions between objects. This contrasts with the fact that objects’ paths are not independent: a cyclist might abruptly deviate from a previously planned trajectory in order to avoid colliding with a car. Building upon hart, a neural class-agnostic single-object tracker, we introduce a multi-object tracking method (mohart) capable of relational reasoning. Importantly, the entire system, including the understanding of interactions and relations between objects, is class-agnostic and learned simultaneously in an end-to-end fashion. We explore a number of relational reasoning architectures and show that permutation-invariant models outperform non-permutation-invariant alternatives. We also find that architectures using a single permutation invariant operation like DeepSets, despite, in theory, being universal function approximators, are nonetheless outperformed by a more complex architecture based on multi-headed attention. The latter better accounts for complex physical interactions in a challenging toy experiment. Further, we find that modelling interactions leads to consistent performance gains in tracking as well as future trajectory prediction on three real-world datasets (MOTChallenge, UA-DETRAC, and Stanford Drone dataset), particularly in the presence of ego-motion, occlusions, crowded scenes, and faulty sensor inputs.


VAEvaevariational auto-encoder \newacronymSSMssmstate-space model \newacronymSGDsgdstochastic gradient descent \newacronymELBOelboevidence lower bound \newacronymKLklKullback-Leibler \newacronymLSTMlstmlong short-term memory \newacronymCNNcnnconvolutional neural network \newacronymADAMadamadam \glsunsetADAM \newacronymRMSproprmspropRMSprop \glsunsetRMSprop \newacronymGANgangenerative adversarial network \newacronymGRUgrugated recurrent unit \newacronymMLPmlpmultilayer perceptron \newacronymMCmcMonte Carlo \newacronym[firstplural=recurrent neural networks, plural=RNNs]RNNrnnrecurrent neural network \newacronymMNISTmnistmnist \glsunsetMNIST \newacronym[firstplural=degrees of freedom, plural=DoFs]DOFdofdegree of freedom \newacronymGENESISgenesisgenerative scene inference and sampling \newacronymGMMgmmgaussian mixture model \newacronymSBPsbpstick-breaking process \newacronymHARTharthierarchical attentive recurrent tracking \newacronymMOHARTmohartmulti-object hierarchical attentive recurrent tracking \newacronymSOTsotsingle-object tracking \newacronymMOTmotmulti-object tracking \newacronymVOTvotvisual object tracking \newacronymSOTAsotastate-of-the-art \newacronymRPNrpnregion proposal network \newacronymIOUiouintersection-over-union \newacronymRCNNrcnnrcnn \glsunsetRCNN \newacronymSIAMRCNNsiamrcnnsiamrcnn \glsunsetSIAMRCNN \newacronymSIAMRPNsiamrpn++siamrpn++ \glsunsetSIAMRPN \newacronymDIMPdimpdimp \glsunsetDIMP \newacronymECOecoeco \glsunsetECO \variablesa,g,t,x,y,z


1 Introduction

It is imperative that any autonomous agent acting in the real world be capable of accounting for a variety of present objects and for interactions between these objects. This motivates the need for tracking algorithms which are class-agnostic and able to model the dynamics of multiple objects—properties not yet supported by current state-of-the-art visual object trackers. These often use detectors or region proposal networks such as Faster-\glsRCNN RCNN; FasterRCNN. Algorithms from this family can achieve high accuracy, provided sufficient labelled data to train the object detector, and given that all encountered objects can be associated with known classes, but fail when faced with objects from unseen categories.


HART Kosiorek17 is a recently-proposed, alternative method for single-object tracking (sot), which can track arbitrary objects indicated by the user. As is common in \glsVOT, \glsHART is provided with a bounding box in the first frame. In the following frames, \glsHART efficiently processes just the relevant part of an image using spatial attention; it also integrates object detection, feature extraction, and motion modelling into one end-to-end network. Contrary to most methods, which process video frames one at a time, end-to-end learning in \glsHART allows for discovering complex visual and spatio-temporal patterns in videos, which is conducive to inferring what an object is and how it moves. It is also class-agnostic as it does not rely on a pre-trained detector.

Figure 1: \GlsMOHART. A glimpse is extracted for each object using a (fully differentiable) spatial attention mechanism. These glimpses are further processed with a CNN and fed into a relational reasoning module. A recurrent module, which iterates over time steps, allows for capturing of complex motion patterns. It also outputs spatial attention parameters and a feature vector per object for the relational reasoning module. Dashed lines indicate temporal connections (from time step to ). The entire pipeline operates in parallel for the different objects, only the relational reasoning module allows for exchange of information between tracking states of each object. \glsMOHART is an extension of \acrshortHART (a single-object tracker), which features the same pipeline without the relational reasoning module.

In the original formulation, \glsHART is limited to the single-object modality—as are other existing end-to-end trackers (Kahou15; Danesh19; Gordon2018). In this work, we present \glsMOHART, a class-agnostic tracker with relational reasoning capabilities. \GlsMOHART infers the latent state of every tracked object in parallel, informing per-object states about other tracked objects using self-attention (Vaswani17; Lee2019settransformer). This helps to avoid performance loss under self-occlusions of tracked objects or strong camera motion. Moreover, since the model is trained end-to-end, it is able to learn how to manage faulty or missing sensor inputs. See Figure 1 for a high-level illustration of \glsMOHART. The relational-reasoning module receives a list of feature vectors, one per object, as input. This part of the problem is permutation invariant, as the list-order of object representations carries no meaning for the task at hand.

In order to track objects, \glsMOHART estimates their states, which can be naturally used to predict future trajectories over short temporal horizons, which is especially useful for planning in the context of autonomous agents. \glsMOHART can be trained simultaneously for object tracking and trajectory prediction, thereby increasing statistical efficiency of learning. In contrast to prior art, where these two tasks are usually addressed as separate problems with unrelated solutions, our work shows trajectory prediction and object tracking are best addressed jointly.

2 Related Work

Visual Object Tracking vot

In \glsVOT, a ground-truth bounding box is provided to the algorithm in the first frame and the tracker is evaluated on all future frames. The performance is typically measured as \glsIOU averaged across all frames in which the object is present. On many \glsVOT datasets, \glsSIAMRCNN SiamRCNN is currently \glsSOTA. In each frame, it employs a pre-trained \glsRPN providing candidate bounding boxes, which are then compared to the bounding box in the first frame. Previous \glsSOTA models include \glsSIAMRPN SiamRPNpp \glsECO ECO, and \glsDIMP DIMP. These models are highly engineered achieving excellent results: They often use \glsRPNs pre-trained on large amounts of data and fine-tune their model online on the target dataset. All of these models are tracking one object at a time (although \glsSIAMRCNN does track distractor objects) and therefore do not perform relational reasoning. They also do not have internal motion models of the objects.

End-to-End \glsVot

A newly established and much less explored stream of work approaches region proposal, feature extraction and tracking in an end-to-end fashion with gradients propagated through all parts of the model and across the time axis. This allows for complex motion models as well as efficiency (learning where to look). A key difficulty here is that extracting an image crop (according to bounding-boxes provided by a detector), is non-differentiable and results in high-variance gradient estimators. Kahou15 propose an end-to-end tracker with soft spatial-attention using a 2D grid of Gaussians instead of a hard bounding-box. \GlsHART draws inspiration from this idea, employs an additional attention mechanism, and shows promising performance on the real-world KITTI dataset Kosiorek17. \GlsHART forms the foundation of this work. It has also been extended to incorporate depth information from rgbd cameras Danesh19. Gordon2018 propose an approach in which the crop corresponds to the scaled-up previous bounding-box. This simplifies the approach but does not allow the model to learn where to look—i.e. no gradient is backpropagated through crop coordinates. To the best of our knowledge, there are no successful implementations of any such end-to-end approaches for multi-object tracking from vision beyond generative modelling works Kosiorek2018sqair; Steenkiste2018; Jiang2020scalor which work only on relatively simple datasets, and Frossard2018, which relies on costly lidar sensors. On real-world data, the only end-to-end approaches correspond to applying multiple single-object trackers in parallel—a method which does not leverage the potential of scene context or inter-object interactions.


Here, traditionally, objects are first detected in each frame independently. A tracking algorithm then links the detections from different frames to propose a coherent trajectory Zhang2008; Milan2014; bae2017confidence; keuper2018motion. Often, tracking-by-detection benchmark suites provide external detections for both training and test sets, turning it into a data association task. Recently, some approaches started reusing the same networks for detecting objects and generating re-identification features Zhou2020objectsaspoints; Zhang2020asb. \GlsMOHART currently does not use external detections (provided or from a private detector) and hence cannot be quantitatively compared to these approaches. However, incorporating external detections via the proposed attention framework is a promising future direction of research.

Pedestrian trajectory prediction

We draw inspiration from this stream of work for developing a relational reasoning module. Social-lstm social-lstm employs a \glsLSTM to predict pedestrian trajectories and uses max-pooling to model global social context. Attention mechanisms have also been employed to query the most relevant information, such as neighbouring pedestrians, in a learnable fashion su2016crowd; fernando2018soft; sadeghian2019sophie. Our work stands apart from this prior art by not relying on ground truth tracklets. It addresses the more challenging task of working directly with visual input, performing tracking, modelling interactions, and, depending on the application scenario, simultaneously predicting future motions.

3 Recurrent Multi-Object Tracking with Self-Attention

Figure 2: The relational reasoning module in \GlsMOHART based on multi-headed self-attention. Here, we show the computation of the interaction of the red object with all other objects. Object representations are computed using visual features, positional encoding and the hidden state from the recurrent module. These are linearly projected onto keys (k), queries (q), and values (v) to compute a weighted sum of interactions between objects, yielding an interaction vector . Subscripts are dropped from all variables for clarity of presentation, so is the splitting into multiple heads.

This section describes the model architecture in Figure 1. We start by describing the \glsresetHART\glsHART algorithm (Kosiorek17), and then follow with an extension of \glsHART to tracking multiple objects, where multiple instances of \glsHART communicate with each other using multi-headed attention to facilitate relational reasoning. We also explain how this method can be extended to trajectory prediction instead of just tracking.

3.1 Hierarchical Attentive Recurrent Tracking (hart)

Hart is an attention-based recurrent algorithm, which can efficiently track single objects in a video. It uses a spatial attention mechanism to extract a glimpse , which corresponds to a small crop of the image at time-step , containing the object of interest. This allows it to dispense with the processing of the whole image and can significantly decrease the amount of computation required. \GlsHART uses a \glsCNN to convert the glimpse into features , which then update the hidden state of a \glsLSTM core. The hidden state is used to estimate the current bounding-box , spatial attention parameters for the next time-step , as well as object appearance. Importantly, the recurrent core can learn to predict complicated motion conditioned on the past history of the tracked object, which leads to relatively small attention glimpses—contrary to \glsCNN-based approaches Held2016goturn; Valmadre2017corr, \glsHART does not need to analyse large regions-of-interest to search for tracked objects. In the original paper, hart processes the glimpse with an additional ventral and dorsal stream. Early experiments have shown that this does not improve performance on the MOTChallenge dataset, presumably due to the oftentimes small objects and overall small amount of training data. Further details are provided in Appendix B.

The algorithm is initialised with a bounding-box2 for the first time-step, and operates on a sequence of raw images . For time-steps , it recursively outputs bounding-box estimates for the current time-step and predicted attention parameters for the next time-step. The performance is measured as \glsIOU averaged over all time steps in which an object is present, excluding the first time step.


HART is limited to tracking one object at a time. While it can be deployed on several objects in parallel, different \glsHART instances have no means of communication. This results in performance loss, as it is more difficult to identify occlusions, ego-motion and object interactions. Below, we propose an extension of \glsHART which remedies these shortcomings.

3.2 Multi-Object Hierarchical Attentive Recurrent Tracking

Multi-object support in \glsHART requires the following modifications. Firstly, in order to handle a dynamically changing number of objects, we apply \glsHART to multiple objects in parallel, where all parameters between \glsHART instances are shared. We refer to each \glsHART instance as a tracker. Secondly, we introduce a presence variable for object . It is used to mark whether an object should interact with other objects, as well as to mask the loss function (described in Kosiorek17) for the given object when it is not present. In this setup, parallel trackers cannot exchange information and are conceptually still single-object trackers, which we use as a baseline, referred to as \glsHART (despite it being an extension of the original algorithm). Finally, to enable communication between trackers, we augment \glsHART with an additional step between feature extraction and the \glsLSTM.

For each object, a glimpse is extracted and processed by a \glsCNN (see Figure 1). Furthermore, spatial attention parameters are linearly projected on a vector of the same size and added to this representation, acting as a positional encoding. This is then concatenated with the hidden state of the recurrent module of the respective object (see Figure 2). Let denote the resulting feature vector corresponding to the m object, and let be the set of such features for all objects. Since different objects can interact with each other, it is necessary to use a method that can inform each object about the effects of their interactions with other objects. Moreover, since features extracted from different objects comprise a set, this method should be permutation-equivariant,\iethe results should not depend on the order in which object features are processed. Therefore, we use the multi-head self-attention block (sab, Lee2019settransformer), which is able to account for higher-order interactions between set elements when computing their representations. Intuitively, in our case, sab allows any of the trackers to query other trackers about attributes of their respective objects,\egdistance between objects, their direction of movement, or their relation to the camera. This is implemented as follows,


where is the output of the relational reasoning module for object . Time-step subscripts are dropped to decrease clutter. In Eq. 1, each of the extracted features is linearly projected into a triplet of key , query and value vectors. Together, they comprise and matrices with rows and columns, respectively. and are then split up into multiple heads , which allows to query different attributes by comparing and aggregating different projection of features. Multiplying in Eq. 2 allows to compare every query vector to all key vectors , where the value of the corresponding dot-products represents the degree of similarity. Similarities are then normalised via a operation and used to aggregate values . Finally, outputs of different attention heads are concatenated in Eq. 3. Sab produces output vectors, one for each input, which are then concatenated with corresponding inputs and fed into separate \glsLSTMs for further processing, as in \glsHART—see Figure 1.


MOHART is trained fully end-to-end, contrary to other approaches. It maintains a hidden state, which can contain information about the object’s motion. One benefit is that one can simply feed black frames into the model to predict future trajectories. Our experiments show that the model learns to fall back on the motion model captured by the \glsLSTM in this case.

4 Validation on Simulated Data

Figure 3: hart single object tracking applied four times in parallel and trained to predict the location of each circle three time steps into the future. Dashed lines indicate spatial attention, solid lines are predicted bounding boxes, faded circles show ground truth location at . Each circle exerts repulsive forces on each other, where the force scales with , being their distance.

We first evaluate the relational reasoning capabilities of the proposed algorithms on a toy domain—a two-dimensional square box filled with bouncing balls. We train the model to predict future object locations (in contrast to simply tracking), to see how well the models understand motion patterns and interactions between objects. For details about the experimental setup, see Appendix C.

4.1 First Experiment: Deterministic Domain

In the first experiment in the toy domain (Figure 3), four balls, which can be thought of as ‘protons in a box’, repel each other. Hart is applied four times in parallel and is trained to predict the location of each ball three time steps into the future. Different forces from different objects lead to a non-trivial force field at each time step. Accurately predicting the future location of an object using only its previous motion is therefore challenging (Figure 3 shows that each attention glimpse covers only the current object). Surprisingly, the single object tracker solves this task with an average of IoU over sequences of 15 time steps, which shows the efficacy of end-to-end tracking to capture complex motion patterns and use them to predict future locations. This, of course, could also be used to generate good-quality bounding boxes for a tracking task.

4.2 Second Experiment: Stochastic Domain

Figure 4: A scenario constructed to be impossible to solve without relational reasoning. Circles of the same colour repel each other, circles of different colour attract each other. Crucially, each circle is randomly assigned its identity in each time step. Hence, the algorithm can not infer the forces exerted on one object without knowledge of the state of the other objects in the current time step. The forces in this scenario scale with and the algorithm was trained to predict one time step into the future. hart (top) is indeed unable to predict the future location of the objects accurately. The achieved average IoU is , which is only slightly higher than predicting the objects to have the same position in the next time step as in the current one (). Using the relational reasoning module, mohart (bottom) is able to make meaningful predictions ( IoU). The numbers in the bottom row indicate the self-attention weights from the perspective of the top left tracker (yellow number box). Interestingly, the attention scores have a strong correlation with the interaction strength (which scales with distance) without receiving supervision.
Figure 5: Left: average IoU over sequence length for different implementations of relational reasoning on the toy domain shown in Figure 4 (). Right: performance depends on randomness—the frequency with which ball identities are randomly changed (sequence length 15). Higher randomness puts more pressure on relational reasoning. For , identities still have to be reassigned in some cases in order to prevent deadlocks, this leads to a performance loss for all models, which explains lower performance of self-attention for .

In the second experiment, we introduce randomness, rendering the scenario not solvable for a single object tracker as it requires knowledge about the state of the other objects and relational reasoning (see Figure 4). In each time step, we assign a colour-coded identity to the objects. Objects of the same identity repel each other, object of different identities attract each other (the objects can be thought of as electrons and protons). We compare our proposed attention-based relational reasoning module to the following baselines:


In this version, the representations of all objects are concatenated and fed into a fully connected layer followed by ELU activations. The output is then again concatenated to the unaltered feature vector of each object. This concatenated version is then fed to the recurrent module of \glsHART. This way of exchanging information allows for universal function approximation (in the limit of infinite layer sizes) but does not impose permutation invariance.


Here, the learned representations of the different objects are summed up instead of concatenated and then divided by total number of objects. This is closely related to DeepSets (Zaheer2017) and allows for universal function approximation of all permutation invariant functions (Wagstaff2019).


Similar to DeepSets, but using max-pooling as the permutation invariant operation. This way of exchanging information is used, e.g., by social-lstm who predict future pedestrian trajectories from ground truth tracklets in coordinate space.

Figure 5 (left) shows a quantitative comparison of augmenting hart with different relational reasoning modules when identities are re-assigned in every timestep (). Exchanging information between different trackers with an MLP leads to slightly worse performance than the baseline, while simple max-pooling performs significantly better (). This can be explained through the permutation invariance of the problem: latent representation of different objects have no meaningful order; therefore the output of the model should be invariant to the ordering of the objects. The MLP is in itself not permutation invariant and therefore prone to overfitting to the (meaningless) order of the objects in the training data. Max-pooling, however, is permutation invariant and can in theory, despite its simplicity, be used to approximate any permutation invariant function given a sufficiently large latent space (Wagstaff2019). Max-pooling is often used to exchange information between different tracklets,\egin the trajectory prediction domain (social-lstm; Gupta2019). However, self-attention, allowing for learned querying and encoding of information, solves the relational reasoning task much more accurately.

In Figure 5 (right), the frequency with which object identities are reassigned randomly is varied. The results show that, in a deterministic environment, tracking does not necessarily profit from relational reasoning - even in the presence of long-range interactions. The less random, the more static the force field is and a static force field can be inferred from a small number of observations (see Figure 3). This does of course not mean that all stochastic environments profit from relational reasoning. What these experiments indicate is that tracking can not be expected to profit from relational reasoning by default in any environment, but instead in environments which feature (potentially non-deterministic) dynamics and predictable interactions.

5 Relational Reasoning in Real-World Tracking

Figure 6: Camera blackout experiment on a street scene from the MOTChallenge dataset with strong ego-motion. Solid boxes are mohart predictions (for ), faded bounding boxes indicate object locations in the first frame. As the model is trained end-to-end, mohart learns to fall back onto its internal motion model if no new observations are available (black frames). As soon as new observations come in, the model ’snaps’ back onto the tracked objects.

Having established that mohart is capable of performing complex relational reasoning, we now test the algorithm on three real-world datasets and analyse the effects of relational reasoning on performance depending on dataset and task. We find consistent improvements of mohart compared to hart throughout. Relational reasoning yields particularly high gains for scenes with ego-motion, crowded scenes, and simulated faulty sensor inputs.

5.1 Experimental Details

We investigate three qualitatively different datasets: the MOTChallenge dataset (MOT16), the UA-DETRAC dataset (Wen15), and the Stanford Drone dataset (DroneDataset). To increase scene dynamics and make the tracking/prediction problems more challenging, we sub-sample some of the high framerate scenes with a stride of two, resulting in scenes with 7-15 frames per second. Training and architecture details are given in Appendices B and A. We conduct experiments in three different modes:

Tracking. The model is initialised with the ground truth bounding boxes for a set of objects in the first frame. It then consecutively sees the following frames and predicts the bounding boxes. The sequence length is 30 time steps and the performance is measured as \glsresetIOU\glsIOU averaged over the entire sequence excluding the first frame. This algorithm is either applied to the entire dataset or subsets of it to study the influence of certain properties of the data.

Camera Blackout. This simulates unsteady or faulty sensor inputs. The setup is the same as in Tracking, but sub-sequences of the input are replaced with black images. The algorithm is expected to recognise that no new information is available and that it should resort to its internal motion model.

Prediction. Testing mohart’s ability to capture motion patterns, only two frames are shown to the model followed by three black frames. IoU is measured separately for each time step.


floatrowsep=qquad, captionskip=4pt \ttabbox Entire Only No Crowded Camera Dataset Ego-Motion Ego-Motion Scenes Blackout \glsMOHART 68.5% 66.9% 64.7% 69.1% 63.6% \glsHART 66.6% 64.0% 62.9% 66.9% 60.6% 1.9% 2.9% 1.8% 2.2% 3.0% {floatrow}[2]

Table 1: Tracking performance on the MOTChallenge dataset measured in IoU.
All Crowded Scenes Camera Blackout
\glsMOHART 68.1% 69.5% 64.2%
\glsHART 68.4% 68.6% 53.8%
-0.3% 0.9% 0.4%
Table 2: UA-DETRAC Dataset
All Camera Blackout CamBlack Bikes
57.3% 53.3% 53.3%
56.1% 52.6% 50.7%
1.2% 0.7% 2.6%
Table 3: Stanford Drone Data

5.2 Results and Analysis

On the MOTChallenge dataset, hart achieves \glsIOU (see Table 3), which in itself is impressive given the small amount of training data of only 5225 training frames and no pre-training. mohart achieves (both numbers are averaged over 5 runs, independent samples -test resulted in ). The performance gain increases when only considering ego-motion data. This is readily explained: movements of objects in the image space due to ego-motion are correlated and can therefore be better understood when combining information from movements of multiple objects, i.e. performing relational reasoning. In another ablation, we filtered for only crowded scenes by requesting five objects to be present for, on average, 90% of the frames in a sub-sequence. For the MOT-Challenge dataset, this only leads to a minor increase of the performance gain of mohart indicating that the dataset exhibits a sufficient density of objects to learn interactions. The biggest benefit from relational reasoning can be observed in the camera blackout experiments (setup explained in Section 5.1). Both hart and mohart learn to rely on their internal motion models when confronted with black frames and propagate the bounding boxes according to the previous movement of the objects. It is unsurprising that this scenario profits particularly from relational reasoning. Qualitative tracking and camera blackout results are shown in Figure 6 and Appendix E.

Tracking performance on the UA-DETRAC dataset only profits from relational reasoning when filtering for crowded scenes (see Table 3). The fact that the performance of mohart is slightly worse on the vanilla dataset () can be explained with more overfitting. As there is no exchange between trackers for each object, each object constitutes an independent training sample.

The Stanford drone dataset (see Table 3) is different to the other two—it is filmed from a birds-eye view. The scenes are more crowded and each object covers a small number of pixels, rendering it a difficult problem for tracking. The dataset was designed for trajectory prediction—a setup where an algorithm is typically provided with ground-truth tracklets in coordinate space and potentially an image as context information. The task is then to extrapolate these tracklets into the future. The performance gain of mohart on the camera blackout experiments is particularly strong when only considering cyclists.

In the prediction experiments (see Appendix D), mohart consistently outperforms hart. On both datasets, the model outperforms a baseline which uses momentum to linearly extrapolate the bounding boxes from the first two frames. This shows that even from just two frames, the model learns to capture motion models which are more complex than what could be observed from just the bounding boxes (i.e. momentum), suggesting that it uses visual information (hart & mohart) as well as relational reasoning (mohart).

5.3 Visual Object Tracking Performance

We compare our method to \glsSOTA visual object trackers on the MOTChallege dataset. While the evaluation is identical for all methods (see Appendix A), the training data differs. \GlsMOHART was trained on the MOTChallenge dataset (excluding the evaluation frames), while the other methods were trained on different datasets3, with three orders of magnitude more data in total, but they were not trained on any part of the MOTChallenge. Fine-tuning the baselines on the target dataset is therefore likely to improve their performance. Table 4 shows that \glsMOHART performs worse but comparable to previous \glsSOTA models SiamRPNpp; ECO; DIMP. It is, however, significantly outperformed by the concurrently developed \glsSIAMRCNN SiamRCNN.


floatrowsep=quad, captionskip=4pt \ttabbox \glsMOHART \glsSIAMRPNSiamRPNpp \glsECO ECO \glsDIMP-50 DIMP \glsSIAMRCNN SiamRCNN iou 72.1% 75.8% 74.6% 76.2% 85.1%

Table 4: Comparison to non-end-to-end \glsSOTA algorithms for \glsVOT on the MOTChallenge dataset. Here, the original framerate was used leading to higher than performance in Table 3.

6 Conclusion

With \glsMOHART, we introduce an end-to-end multi-object tracker that is capable of capturing complex interactions and leveraging these for precise predictions as experiments both on toy and real-world data show. However, the experiments also show that the benefit of relational reasoning strongly depends on the nature of the data. In particular, the toy experiments showed that in an entirely deterministic world, relational reasoning was much less important than in a stochastic environment. Amongst the real-world dataset, the highest performance gains from relational reasoning were achieved on the MOTChallenge dataset, which features crowded scenes, ego-motion and occlusions. Compared to \glsSOTA visual object trackers, \glsMOHART achieves inferior-but-comparable performance. The relational reasoning toy experiments, the camera blackout and the prediction experiments, as well as the training on a fraction of the data, show the flexibility and the potential of our approach compared to the non-end-to-end4 \glsSOTA visual object trackers. We see two potential routes for further improving the performance of \glsMOHART in the future: (1) leveraging pre-training on other datasets—this point might seem trivial, but our initial experiments demonstrated no performance improvements—and (2) incorporating external detections, which would also allow for a fair comparison against tracking-by-detection methods.


We thank Stefan Saftescu for his contributions, particularly for integrating the Stanford Drone Dataset. We thank Jonathon Luiten for providing baselines on the MOTChallenge dataset and Adam Golinski as well as Stefan Saftescu for proof-reading. This research was funded by the EPSRC AIMS Centre for Doctoral Training at Oxford University and an EPSRC Programme Grant (EP/M019918/1). We acknowledge use of Hartree Centre resources in this work. The STFC Hartree Centre is a research collaboratory in association with IBM providing High Performance Computing platforms funded by the UK’s investment in e-Infrastructure. The Centre aims to develop and demonstrate next generation software, optimised to take advantage of the move towards exa-scale computing.


Appendix A Experimental Details

The MOTChallenge and the UA-DETRAC dataset discussed in this section are intended to be used as a benchmark suite for multi-object-tracking in a tracking-by-detection paradigm. Therefore, ground truth bounding boxes are only available for the training datasets. The user is encouraged to upload their model which performs tracking in a data association paradigm leveraging the provided bounding box proposals from an external object detector. As we are interested in a different analysis (IoU given inital bounding boxes), we divide the training data further into training and test sequences. To make up for the smaller training data, we extend the MOTChallenge 2017 dataset with three sequences from the 2015 dataset (ETH-Sunnyday, PETS09-S2L1, ETH-Bahnhof). We use the first 70% of the frames of each of the ten sequences for training and the rest for testing. Sequences with high frame rates (30Hz) are sub-sampled with a stride of two. For the UA-DETRAC dataset, we split the 60 available sequences into 44 training sequences and 16 test sequences. For the considerably larger Stanford Drone dataset we took three videos of the scene deathCircle for training and the remaining two videos from the same scene for testing. The videos of the drone dataset were also sub-sampled with a stride of two to increase scene dynamics.

For Section 5.3, all methods were evaluated on the first 30 frames of sequences ’02’, ’05’, ’09’, ’11’ of the MOTChallenge dataset using all objects visible in the first frame. We used the original framerate of the dataset. For details about architecture and training set-up of the baselines, please see SiamRCNN.

Appendix B Architecture Details

The architecture details were chosen to optimise hart performance on the MOTChallenge dataset. They deviate from the original hart implementation (Kosiorek17) as follows: A presence variable predicts whether an object is in the scene and successfully tracked. This is trained with a binary cross entropy loss. The maximum number of objects to be tracked simultaneously was set to 5 for the UA-DETRAC and MOTChallenge dataset. For the more crowded Stanford drone dataset, this number was set to 10. The feature extractor is a three layer convolutional network with a kernel size of 5, a stride of 2 in the first and last layer, 32 channels in the first two layers, 64 channels in the last layer, ELU activations, and skip connections. This converts the initial glimpse into a feature representation. This is followed by a fully connected layer with a 128 dimensional output and an elu activation. The spatial attention parameters are linearly projected onto 128 dimensions and added to this feature representation serving as a positional encoding. The LSTM has a hidden state size of 128. The self-attention unit in mohart comprises linear projects the inputs to dimensionality 128 for each keys, queries and values. For the real-world experiments, in addition to the extracted features from the glimpse, the hidden states from the previous LSTM state are also fed as an input by concatinating them with the features. In all cases, the output of the attention module is concatenated to the input features of the respective object.

As an optimizer, we used RMSProp with momentum set to and learning rate . For the MOTChallenge dataset and the UA-DETRAC dataset, the models were trained for 100,000 iterations of batch size 10 and the reported IoU is exponentially smoothed over iterations to achieve lower variance. For the Stanford Drone dataset, the batch size was increased to 32, reducing time to convergence and hence model training to 50,000 iterations.

Appendix C Bouncing Balls

The toy domain consists of a square two-dimensional box filled with bouncing balls. The balls move and can collide with each other with approximated elastic collisions (energy and momentum conservation). Additionally, balls exert either repulsive force (first experiment) or repulsive/attractive force (second experiment, colour-coded), which scales with , being the distance between centres of the balls.

Appendix D Prediction Experiments

(a) Prediction results on the MOTChallenge dataset MOT16.
(b) Prediction results on the UA-DETRAC dataset (crowded scenes only) Wen15.
Figure 7: Peeking into the future. Only the first two frames are shown to the tracking algorithm followed by three black frames. mohart learns to fall back on its internal motion model when no observation (i.e. only a black frame) is available. The reported IoU scores show the performance for the respective frames 0, 1, 2, and 3 time steps into the future.

In the results from the prediction experiments (see Figure 7) mohart consistently outperforms hart. On both datasets, the model outperforms a baseline which uses momentum to linearly extrapolate the bounding boxes from the first two frames. This shows that even from just two frames, the model learns to capture motion models which are more complex than what could be observed from just the bounding boxes (i.e. momentum), suggesting that it uses visual information (hart & mohart) as well as relational reasoning (mohart). The strong performance gain of mohart compared to hart on the UA-DETRAC dataset, despite the small differences for tracking on this dataset, can be explained as follows: this dataset features little interactions but strong correlations in motion. Hence when only having access to the first two frames, mohart profits from estimating the velocities of multiple cars simultaneously.

Appendix E Qualitative Tracking Results

Figure 8: Tracking examples of both hart and mohart. Coloured boxes are bounding boxes predicted by the model, arrows point at challenging aspects of the scenes. (A) & (C): Each person being tracked is temporarily occluded by a woman walking across the scene (blue arrows). mohart, which includes a relational reasoning module, handles this more robustly (compare red arrows).
Figure 9: Camera blackout experiment on a pedestrian street scene from the MOTChallenge dataset without ego-motion. Subsequent frames are displayed going from top left to bottom right. Shown are the inputs to the model (some of them being black frames, i.e. arrays of zeroes) and bounding boxes predicted by MOHART (coloured boxes). This scene is particularly challenging as occlusion and missing sensor input coincide (fourth row).

In Section 5, we tested mohart on three different real world data sets and in a number of different setups. Figure 8 shows qualitative results both for hart and mohart on the MOTChallenge dataset.

Furthermore, we conducted a set of camera blackout experiments to test mohart’s capability of dealing with faulty sensor inputs. While traditional pipeline methods require careful consideration of different types of corner cases to properly handle erroneous sensor inputs, mohart is able to capture these automatically, especially when confronted with similar issues in the training scenarios. To simulate this, we replace subsequences of the images with black frames. Figure 9 and Figure 6 show two such examples from the test data together with the model’s prediction. mohart learns not to update its internal model when confronted with black frames and instead uses the LSTM to propagate the bounding boxes. When proper sensor input is available again, the model uses this to make a rapid adjustment to its predicted location and ‘snap’ back onto the object. This works remarkably well in both the presence of occlusion (Figure 9) and ego-motion (Figure 6). Tables 3, 3 and 3 show that the benefit of relational reasoning is particularly high in these scenarios specifically. These experiments can also be seen as a proof of concept of mohart’s capabalities of predicting future trajectories—and how this profits from relational reasoning.


  1. This is an extended version of the paper "Permutation Invariance and Relational Reasoning in Multi-Object Tracking" presented at the Sets and Partitions Workshop at the 33rd Conference on Neural Information Processing Systems, Vancouver 2019
  2. We can use either a ground-truth bounding-box or one provided by an external detector; the only requirement is that it contains the object of interest.
  3. as specified in Tab. 10 of SiamRCNN; up to million frames compared to 8k used for \glsMOHART
  4. end-to-end meaning gradients are propagated through all parts of the model and across the time axis
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description