A Labeled Random Finite Set Online Multi-Object Tracker for Video Data

A Labeled Random Finite Set Online Multi-Object Tracker for Video Data

Du Yong Kim, Ba-Ngu Vo,  and Ba-Tuong Vo, 

This paper proposes an online multi-object tracking algorithm for image observations using a top-down Bayesian formulation that seamlessly integrates state estimation, track management, clutter rejection, occlusion and mis-detection handling into a single recursion. This is achieved by modeling the multi-object state as labeled random finite set and using the Bayes recursion to propagate the multi-object filtering density forward in time. The proposed filter updates tracks with detections but switches to image data when mis-detection occurs, thereby exploiting the efficiency of detection data and the accuracy of image data. Furthermore the labeled random finite set framework enables the incorporation of prior knowledge that mis-detections in the middle of the scene are likely to be due to occlusions. Such prior knowledge can be exploited to improve occlusion handling, especially long occlusions that can lead to premature track termination in on-line multi-object tracking. Tracking performance is compared to state-of-the-art algorithms on synthetic data and well-known benchmark video datasets.


online multi-object tracking, Track-before-detect, random finite set

I Introduction

In a multiple object setting, not only do the states of the objects vary with time, but the number of objects also changes due to objects appearing and disappearing. In this work, we consider the problem of jointly estimating the time-varying number of objects and their trajectories from a stream of noisy images. In particular, we are interested in multi-object tracking (MOT) solutions that compute estimates at a given time using only data up to that time. These so-called online solutions are better suited for time-critical applications.

A critical function of a multi-object tracker is track management, which concerns track initiation/termination and track labeling or identifying trajectories of individual objects. Track management is more challenging for online algorithms than for batch algorithms. Usually, track initiation/termination in online MOT algorithms is performed by examining consecutive detections in time [1], [2]. However, false positives generated by the background, compounded by false negatives from object occlusions and mis-detections, can result in false tracks and lost tracks, especially in online algorithms. False negatives also cause track fragmentation in batch algorithms as reported in [3], [4], [5]. With the exception of the recent network flow [6] techniques, track labels are assigned upon track initiation, and maintained over time until termination. An online multi-object Bayesian filter that provides systematic track labeling using labeled random finite set (RFS) was proposed in [7].

In most video MOT approaches, each image in the data sequence is compressed into a set of detections before a filtering operation is applied to keep track of the objects (including undetected ones) [2, 8]. Typically, in the filtering module, motion correspondence or data association is first determined followed by the application of standard filtering techniques such as Kalman or sequential Monte Carlo [8]. The main advantage of performing detection before filtering is the computational efficiency in the compression of images into relevant detections. The main disadvantage is the loss of information, in addition to mis-detection and false alarms, especially in low signal to noise ratio (SNR) applications.

Track-before-detect (TBD) is an alternative approach, which by-passes the detection module and exploits the spatio-temporal information directly from the image sequence. The TBD methodology is often required in tracking applications for low SNR image data [9], [10], [11], [12], [13], [14]. In visual tracking applications, perhaps the most well-known TBD MOT algorithm is BraMBLe [15]. Other visual MOT algorithms that can be categorized as TBD include [16], [17] which exploit color-based observation models, [18], [2], [19], which exploit multi-modality of distributions, and [20], [21] which uses multi-Bernoulli random finite set models. While the TBD approach minimizes information loss, it is computationally more expensive. A balance between tractability and fidelity is important in the design of the measurement model.

In this paper, we present an efficient online MOT algorithm for video data that exploits the advantages of both detection-based and TBD approaches to improve performance while reducing the computational cost. The proposed MOT filter updates the tracks adaptively with detections for efficiency, or with local regions on the image to minimize information loss. In addition it seamlessly integrates state estimation, track management, clutter rejection, mis-detection and occlusion handling in a single Bayesian recursion.
Specifically, using the RFS framework we propose a hybrid multi-object likelihood function that accommodates both detections and image observations, thereby generalizing the standard multi-object likelihood [22] and the separable likelihood for image [10]. Further, we establish conjugacy of the Generalized Labelled Multi-Bernoulli (GLMB) distributions with respect to the proposed likelihood function, which then yields an analytic solution to the multi-object Bayes recursion since the GLMB family is closed under the Chapman-Kolmogorov equation. The proposed MOT filter exploits the efficiency of the detection-based approach which avoids updating with the entire image, while at the same time exploiting relevant information at the image level by using only small regions of the image where mis-detected objects are expected.

The labeled RFS formulation [7] addresses state estimation, track management, clutter rejection, mis-detection and occlusion handling, in one single Bayes recursion. Generally, an online MOT algorithm would terminate a track that has not been detected over several frames. In many video MOT applications however, it is observed that away from designated exit regions such as scene edges, the longer an object is in the scene, the less likely it is to disappear. Intuitively, this observation can be used to delay the termination of tracks that have been occluded over an extended period, so as to improve occlusion handling. The use of labeled RFS in our proposed filter provides a principled and inexpensive means to exploit this observation for improved occlusion handling.

The remainder of the paper is structured as follows. The Bayesian filtering formulation of the MOT problem using labeled RFS is given in Section II, followed by details of the proposed solution in Section III. Performance evaluation of the proposed MOT filter against state-of-the-art trackers is presented in Section IV, and concluding remarks are given in Section V.

Fig. 1: 1D multi-object trajectories with labeling

Ii Bayesian Multiple Object Tracking

This section outlines the RFS framework for MOT that accommodates uncertainty in the number of objects, the states of the objects and their trajectories. The salient feature of this framework is that it admits direct parallels between traditional Bayesian filtering and MOT. The modeling of the multi-object state as an RFS in Subsection II-A enables Bayesian filtering concepts to be directly translated to the multi-object case in Subsection II-B. Subsection II-C examines the MOT problem in the presence of occlusion.

Ii-a Multi-object State

To distinguish different object trajectories in a multi-object setting, each object is assigned a unique label that consists of an ordered pair , where is the time of birth and is the index of individual objects born at the same time [7]. For example, if two objects appear in the scene at time 1, one is assigned label (1,1) while the other is assigned label (1,2), see Fig. 1. A trajectory or track is the sequence of states with the same label.
Formally, the state of an object at time is a vector , where denotes the label space for objects at time (including those born prior to ). Note that is given by , where denotes the label space for objects born at time (and is disjoint from ). Suppose that there are objects at time , with states . In the context of MOT, the collection of states, referred to as the multi-object state, is naturally represented as a finite set

where denotes the space of finite subsets of . We denote cardinality (number of elements) of by and the set of labels of , , by . Note that since the label is unique, no two objects have the same label, i.e. . Hence is called the distinct label indicator.
For the rest of the paper, we follow the convention that single-object states are represented by lower-case letters (e.g. , ), while multi-object states are represented by upper-case letters (e.g. , ), symbols for labeled states and their distributions are bold-faced to distinguish them from unlabeled ones (e.g. , , , etc.), and spaces are represented by blackboard bold (e.g. , , , , etc.). The list of variables is abbreviated as . We denote a generalization of the Kroneker delta that takes arbitrary arguments such as sets, vectors, integers etc., by

For a given set , denotes the indicator function of , and denotes the class of finite subsets of . For a finite set , the multi-object exponential notation denotes the product , with . The inner product is denoted by .

Ii-B Multi-object Bayes filter

From a Bayesian estimation viewpoint the multi-object state is naturally modeled as an RFS or a simple-finite point process [23]. While the space does not inherit the Euclidean notion of probability density, Mahler’s Finite Set Statistic (FISST) provides a suitable notion of integration/density for RFSs [22, 24]. This approach is mathematically consistent with measure theoretic integration/density but circumvents measure theoretic constructs [25].
At time , the multi-object state is observed as an image . All information on the set of object trajectories conditioned on the observation history , is captured in the multi-object posterior density

where is the initial prior, is the multi-object likelihood function at time , is the  multi-object transition density to time . The multi-object likelihood function encapsulates the underlying observation model while the multi-object transition density encapsulates the underlying models for motions, births and deaths of objects. Note that track management is incorporated into the Bayes recursion via the multi-object state with distinct labels.
MCMC approximations of the posterior density have been proposed in [26, 27] for detection measurements and image measurements respectively. Results on satellite imaging applications reported in [27] are very impressive. However, these techniques are still expensive and not suitable for on-line application.
For real-time tracking, a more tractable alternative is the multi-object filtering density, a marginal of the multi-object posterior. For notational compactness, herein we omit the dependence on data in the multi-object densities. The multi-object filtering density can be recursively propagated by the multi-object Bayes filter [23], [28] according to the following prediction and update


where the integral is a set integral defined for any function by

Bayes optimal multi-object estimators can be formulated by minimizing the Bayes risk with ordinary integrals replaced by set integrals as in [24]. One such estimator is the marginal multi-object estimator [22].
A generic particle implementation of the Bayes multi-object filter (1)-(2) was proposed in [25] and applied to labeled multi-object states in [11]. The Generalized labeled Multi-Bernoulli (GLMB) filter is an analytic solution to the Bayes multi-object filter, under the standard multi-object dynamic and observation models [7].

Ii-B1 Standard multi-object dynamic model

Given the multi-object state (at time ), each state either survives with probability and evolves to a new state (at time ) with probability density or dies with probability . The set of new objects born at time is distributed according to the labeled multi-Bernoulli (LMB)

where , is the probability that a new object with label is born, and  is the distribution of its kinematic state [7]. The multi-object state (at time ) is the superposition of surviving objects and new born objects. It is assumed that, conditional on ,  objects move, appear and die independently of each other. The expression for the multi-object transition density can be found in [7, 29]. The standard multi-object dynamic model enables the Bayes multi-object filter to address motion, births and deaths of objects.

Ii-B2 Standard multi-object observation model

In most applications a designated detection operation is applied to resulting in a set of points


Since the detection process is not perfect, false positives and false negatives are inevitable. Hence only a subset of correspond to some objects in the scene (not all objects are detected) while the remainder are false positives. The most popular detection-based observation model is described in the following.
For a given multi-object state , each is either detected with probability and generates a detection with likelihood or missed with probability . The multi-object observation is the superposition of the observations from detected objects and Poisson clutter with intensity . The ratio


can be interpreted as the detection signal to noise ratio (SNR).
Assuming that, conditional on , detections are independent of each other and clutter, the multi-object likelihood function is given by [22], [7, 29]


where: is the set of positive 1-1 maps :, i.e. maps such that no two distinct arguments are mapped to the same positive value; and


The map specifies which objects generated which detections, i.e. object generates detection , with undetected objects assigned to . The positive 1-1 property means that is 1-1 on , the set of labels that are assigned positive values, and ensures that any detection in is assigned to at most one object.
The standard multi-object observation model enables the Bayes multi-object filter to address mis-detection and false detection, but not occlusion. It assumes that each object is detected independently from each other, and that a detection cannot be assigned to more than one object. This is clearly not valid in occlusions.

Ii-C Bayes Optimal Occlusion Handing

Fig. 2: Examples of partitions for five objects

By relaxing the assumption that each object is independently detected, a multi-object observation model that explicitly addresses occlusion (as well as mis-detections and false positives) was proposed in [30]. The main difference between this so-called merged-measurement model and the standard model is the idea that each group of objects (instead of each object) in the multi-object state generates at most one detection [30]. Fig. 2 shows various partitions or groupings of a multi-object state with five objects.
A partition of a finite set is a collection of mutually exclusive subsets of , whose union is . The collection of all partitions of is denoted by . It is assumed that given a partition , each group generates at most one detection with probability , independent of other groups, and that conditional on detection generates with likelihood .
Let denote the collection of labels of the partition , i.e. (note that forms a partition of ). Let denote the class of all positive 1-1 mappings :. Then, the likelihood that a given partition of a multi-object state , generates the detection set is



with denoting the detection SNR for group . The merged-measurement likelihood function is obtained by summing the group likelihoods (7) over all partitions of [30]:

The multi-object filter (1)-(2) with merged-measurement likelihood is Bayes optimal in the sense that the filtering density contains all information on the current multi-object state in the presence of false positives, mis-detections and occlusions. Unfortunately, this filter is numerically intractable due to the sum over all partitions of the multi-object state in the merged-measurement likelihood. At present, there is no polynomial time technique for truncating sums over partitions. Moreover, given a partition, computations involving the joint detection probability , joint likelihood and associated joint densities are intractable except for scenarios with a few objects.
A GLMB approximation that reduces the number of partitions using the cluster structure of the predicted multi-object state and the sensor’s resolution capabilities was proposed in [30]. Also, computation of joint densities are approximated by products of independent densities that minimise the Kullback-Leibler divergence [12]. Case studies on MOT with bearings only measurements shows very good tracking performance. Nonetheless, at present, this filter is still computationally demanding and therefore not suitable for online MOT with image data.

Iii GLMB filter for tracking with image data

The GLMB filter (with the standard measurement likelihood) is a suitable candidate for online MOT [29, 31]. However, it is neither designed to handle occlusion nor image data. Even though occluded objects share the observations of the occluding objects, this situation is not permitted in the standard multi-object likelihood. Consequently, uncertainties in the states of occluded objects grow, while their existence probabilities quickly diminish to zero, leading to possible hi-jacking, and premature track termination in longer occlusions.
This section proposes an efficient GLMB filter that exploits information from image data and addresses false positives, mis-detections and occlusions. Subsection III-A extends the standard observation model to allow occluded objects to share observations at the image level while Subsection III-B incorporates, into the death model, domain knowledge that mis-detected tracks with long durations are unlikely to disappear. The GLMB filter for image data and an efficient implementation are then described in Subsections III-C and III-D.

Iii-a Hybrid Multi-Object Measurement Likelihood

Fig. 3: Example of shared measurement in occlusion

While the detection set is an efficient compression of the image observation , mis-detected (including occluded) objects will not be updated by the filter. On the other hand the uncompressed observation contains relevant information about all objects, but updating with is computationally expensive. Conceptually, we can have the best of both worlds by updating detected objects with the associated detections and mis-detected objects with the image observations localised to regions where these objects are expected. More importantly, this strategy exploits the fact that occluded objects share measurements with the objects occluding them as illustrated in Fig. 3.
A hybrid tractable multi-object likelihood function that accommodates both detection and image observations can be obtained as follows. For tractability, it is assumed that each object generates observation independently from each other (similar to the standard observation model).
Given an object with state the likelihood of observing the local image (some transformation of the image ) is . On the other hand, given that there are no objects, the likelihood of observing is . The ratio


can be interpreted as the image SNR (c.f. detection SNR (4)).
For a given association map in the likelihood function (5), an object with state is mis-detected if , in which case the value of is , the probability of a miss. Consequently, after the Bayes update, track has no dependence on the observation . In order for track to be updated with the local image , the value of should be scaled by the image SNR . Note that the value of should remain unchanged for . Formally, this can be achieved by defining an extension of (6) as follows


In other words, for , is equal to the image SNR (8) scaled by the mis-detection probability, otherwise it is equal to the detection SNR (4) scaled by the detection probability.
Given a state , plays the same role as , but accommodates both detection measurements and image measurements. Moreover, since each object generates observation independently from each other, the hybrid multi-object likelihood function has the same form as (5), but with replaced by , i.e.


In visual occlusions, the hybrid likelihood allows occluded objects to share the image observations of the objects that occlude them. Moreover, when integrated into the Bayes recursion (1)-(2), consideration for a track-length-dependent survival probability in combination with information from the image observation, reduces uncertainties in the states of occluded objects and maintains their existence probabilities to keep the tracks alive. Hence, hi-jacking and premature track termination in longer occlusions will be avoided.

Remark: The hybrid multi-object likelihood function (10) is a generalization of both the standard multi-object likelihood and the separable likelihood in [10]. When for each , i.e. there is no mis-detection, the hybrid likelihood (10) is the same as the standard likelihood (5). On the other hand, if for each , i.e. there is no detection, then the only non-zero term in the hybrid likelihood (10) is one with for all . In this case, the hybrid likelihood (10) reduces to the separable likelihood in [10]. For a general detection profile , the hybrid likelihood (10) reduces to the standard likelihood (5) when for each .
Note that a hybrid likelihood function can be also developed for the merged-measurement model. However, the resulting multi-object filter still suffers from the same intractability as the merged-measurement filter.

Fig. 4: Example of scene mask for the proposed probability of survival

Iii-B Death model

In most video MOT applications, if an object stays in the scene for a long time, then it is more likely to continue to do so, provided it is not close to the designated exit regions. Such prior empirical knowledge can be used to improve occlusion handling, especially long occlusions that can lead to premature track termination in on-line MOT algorithms. In general, the GLMB filter would terminate an object that has not been detected over several frames. However, if this object has been in the scene for some time and is not in the proximity of designated exit regions, then it is highly likely to be occluded and track termination should be delayed. The labeled RFS formulation enables such prior information to be incorporated into track termination in a principled manner, via the survival probability.

The labeled RFS formulation accommodates survival probabilities that depend on track lengths since a labeled state contains the time of birth in its label, and the track length is simply the difference between the current time and the time of birth. In practice, it is unlikely for an object to disappear in the middle of the visual scene (even if mis-detected or occluded) whereas it is more likely to disappear near designated exit regions due to the scene structure (e.g. the borders of the scene). Hence, we require the survival probability to be large (close to one) in the middle of the scene and small (close to zero) on the edges or designated death regions. Furthermore, since objects staying in the scene for a long time are more certain to continue existing, we require the survival probability to increase to one as its track length increases.

An explicit form of the survival probability that satisfies these requirements is given by


where is a scene mask that represents the scene structure, e.g., entrance or exit as illustrated in Fig. 4, is a control parameter of the sigmoid function. The scene mask can be learned from a set of training data or designed from the known scene structure.

Iii-C GLMB Recursion

A GLMB density can be written in the following form


where each represents a history of association maps , each is a probability density on , and each is non-negative with . The cardinality distribution of a GLMB is given by


while, the existence probability and probability density of track are respectively


Given the GLMB density (12), an intuitive multi-object estimator is the multi-Bernoulli estimator, which first determines the set of labels with existence probabilities above a prescribed threshold, and second the MAP/mean estimates from the densities , for the states of the objects. A popular estimator is a suboptimal version of the Marginal Multi-object Estimator [22], which first determines the pair with the highest weight such that coincides with the MAP cardinality estimate, and second the MAP/mean estimates from , for the states of the objects.

The GLMB family enjoys a number of nice analytical properties. The void probability functional–a necessary and sufficient statistic–of a GLMB, the Cauchy-Schwarz divergence between two GLMBs, the -distance between a GLMB and its truncation, can all be computed in closed form [29]. The GLMB is flexible enough to approximate any labeled RFS density with matching intensity function and cardinality distribution [12]. More importantly, the GLMB family is closed under the prediction equation (1) and conjugate with respect to the standard observation likelihood [7].
In the following we show that the GLMB family is conjugate with respect to the hybrid observation likelihood function. Hence, starting from an initial GLMB prior, all multi-object predicted and updated densities propagated by the Bayes recursion (1)-(2) are GLMBs. For notational compactness, we drop the subscript for the current time, the next time is indicated by the subscript ‘’.

Proposition 1.

Suppose that the multi-object prediction density to time is a GLMB of the form


where , . Then under the hybrid observation likelihood function (10), the filtering density at time is a GLMB of the form


where , and


Note that the likelihood function (10) at time can be written as

where . Using Bayes rule

In this work we adopt the joint prediction and update strategy [31] for the proposed video MOT GLMB filter. Using the same line of arguments as in [31], yields the following recursion

Proposition 2.

Given the GLMB filtering density (12) at time , the filtering density at time is:


where , , , , and


The summation in (21) can be interpreted as an enumeration of all possible combinations of births, deaths and survivals together with associations of new measurements to hypothesized tracks. Observe that (21) does indeed take on the same form as (12) when rewritten as a sum over with weights


Hence at the next iteration we only propagate forward the components with weights .
Remark: It is also possible to approximate the resulting GLMB filtering density by an LMB with matching 1st moment and cardinality distribution [32]. This so-called LMB filtering strategy reduces considerable computations since an LMB is a GLMB with one term. However, tracking performance tend to degrade, especially in scenarios with many closely space objects.

Note that for high SNR scenarios the detection probability is high, hence the recursion (21)-(27) would process detections mostly. On the other hand when the detection probability is low it would process the image mostly. In practice the SNR varies between different regions in the observation space as well as with time, the recursion (21)-(27) adaptively processes detections and image data to improve performance while reducing the computational cost.

Iii-D GLMB Filter Implementation

The number of terms in the GLMB filtering density grows super-exponentially, and it is necessary to truncate these terms without exhaustive enumeration. A two-stage implementation of the GLMB filter truncates the prediction and filtering densities using the K-shortest path and the ranked assignment algorithms, respectively [29]. In [31] the joint prediction and update was designed to improve the truncation efficiency of the two-staged implementation. Further, the GLMB truncation can be performed via Gibbs sampling with linear complexity in the number of detections (the reader is referred to [31] for derivations and analysis). Fortuitously, this implementation can be readily adapted for the video MOT GLMB filter (21)-(27).

The GLMB filtering density (12) at time is completely characterized by the parameters , , which can be enumerated as , where

Since (12) can now be rewritten as

implementing the GLMB filter amounts to propagating the component set (there is no need to store ) forward in time using (21)-(27). The procedure for computing the component set at the next time is summarized in Algorithm 1. Note that to be consistent with the indexing by instead of , we also abbreviate



Algorithm 1. Joint Prediction and Update

  • input: , , , ,

  • input: , , , , ,

  • output:


sample counts    from multinomial distribution

with parameters   trials and weights


compute  using (28)



compute   from  and  using (29)

compute   from  and  using (30)

compute   from  and  using (31)





normalize weights



Algorithm 1a. Gibbs

  • input:

  • output:









There are three main steps in one iteration of the GLMB filter. First, the Gibbs sampler (Algorithm 1a) is used to generate the auxiliary vectors , :, :, with the most significant weights (note that is an equivalent representation of the hypothesis , with components