A Labeled Random Finite Set Online Multi-Object Tracker for Video Data
This paper proposes an online multi-object tracking algorithm for image observations using a top-down Bayesian formulation that seamlessly integrates state estimation, track management, clutter rejection, occlusion and mis-detection handling into a single recursion. This is achieved by modeling the multi-object state as labeled random finite set and using the Bayes recursion to propagate the multi-object filtering density forward in time. The proposed filter updates tracks with detections but switches to image data when mis-detection occurs, thereby exploiting the efficiency of detection data and the accuracy of image data. Furthermore the labeled random finite set framework enables the incorporation of prior knowledge that mis-detections in the middle of the scene are likely to be due to occlusions. Such prior knowledge can be exploited to improve occlusion handling, especially long occlusions that can lead to premature track termination in on-line multi-object tracking. Tracking performance is compared to state-of-the-art algorithms on synthetic data and well-known benchmark video datasets.
online multi-object tracking, Track-before-detect, random finite set
In a multiple object setting, not only do the states of the objects vary with time, but the number of objects also changes due to objects appearing and disappearing. In this work, we consider the problem of jointly estimating the time-varying number of objects and their trajectories from a stream of noisy images. In particular, we are interested in multi-object tracking (MOT) solutions that compute estimates at a given time using only data up to that time. These so-called online solutions are better suited for time-critical applications.
A critical function of a multi-object tracker is track management, which
concerns track initiation/termination and track labeling or identifying
trajectories of individual objects. Track management is more challenging for
online algorithms than for batch algorithms. Usually, track
initiation/termination in online MOT algorithms is performed by examining
consecutive detections in time , . However,
false positives generated by the background, compounded by false negatives
from object occlusions and mis-detections, can result in false tracks and
lost tracks, especially in online algorithms. False negatives also cause
track fragmentation in batch algorithms as reported in , , . With the exception of the recent network flow
 techniques, track labels are assigned upon track
initiation, and maintained over time until termination. An online
multi-object Bayesian filter that provides systematic track labeling using
labeled random finite set (RFS) was proposed in .
In most video MOT approaches, each image in the data sequence is compressed into a set of detections before a filtering operation is applied to keep track of the objects (including undetected ones) [2, 8]. Typically, in the filtering module, motion correspondence or data association is first determined followed by the application of standard filtering techniques such as Kalman or sequential Monte Carlo . The main advantage of performing detection before filtering is the computational efficiency in the compression of images into relevant detections. The main disadvantage is the loss of information, in addition to mis-detection and false alarms, especially in low signal to noise ratio (SNR) applications.
Track-before-detect (TBD) is an alternative approach, which by-passes the detection module and exploits the spatio-temporal information directly from the image sequence. The TBD methodology is often required in tracking applications for low SNR image data , , , , , . In visual tracking applications, perhaps the most well-known TBD MOT algorithm is BraMBLe . Other visual MOT algorithms that can be categorized as TBD include ,  which exploit color-based observation models, , , , which exploit multi-modality of distributions, and ,  which uses multi-Bernoulli random finite set models. While the TBD approach minimizes information loss, it is computationally more expensive. A balance between tractability and fidelity is important in the design of the measurement model.
In this paper, we present an efficient online MOT algorithm for
video data that exploits the advantages of both detection-based and TBD
approaches to improve performance while reducing the computational cost. The
proposed MOT filter updates the tracks adaptively with detections for
efficiency, or with local regions on the image to minimize information loss.
In addition it seamlessly integrates state estimation, track management,
clutter rejection, mis-detection and occlusion handling in a single Bayesian
Specifically, using the RFS framework we propose a hybrid multi-object likelihood function that accommodates both detections and image observations, thereby generalizing the standard multi-object likelihood  and the separable likelihood for image . Further, we establish conjugacy of the Generalized Labelled Multi-Bernoulli (GLMB) distributions with respect to the proposed likelihood function, which then yields an analytic solution to the multi-object Bayes recursion since the GLMB family is closed under the Chapman-Kolmogorov equation. The proposed MOT filter exploits the efficiency of the detection-based approach which avoids updating with the entire image, while at the same time exploiting relevant information at the image level by using only small regions of the image where mis-detected objects are expected.
The labeled RFS formulation  addresses state estimation, track management, clutter rejection, mis-detection and occlusion handling, in one single Bayes recursion. Generally, an online MOT algorithm would terminate a track that has not been detected over several frames. In many video MOT applications however, it is observed that away from designated exit regions such as scene edges, the longer an object is in the scene, the less likely it is to disappear. Intuitively, this observation can be used to delay the termination of tracks that have been occluded over an extended period, so as to improve occlusion handling. The use of labeled RFS in our proposed filter provides a principled and inexpensive means to exploit this observation for improved occlusion handling.
The remainder of the paper is structured as follows. The Bayesian filtering formulation of the MOT problem using labeled RFS is given in Section II, followed by details of the proposed solution in Section III. Performance evaluation of the proposed MOT filter against state-of-the-art trackers is presented in Section IV, and concluding remarks are given in Section V.
Ii Bayesian Multiple Object Tracking
This section outlines the RFS framework for MOT that accommodates uncertainty in the number of objects, the states of the objects and their trajectories. The salient feature of this framework is that it admits direct parallels between traditional Bayesian filtering and MOT. The modeling of the multi-object state as an RFS in Subsection II-A enables Bayesian filtering concepts to be directly translated to the multi-object case in Subsection II-B. Subsection II-C examines the MOT problem in the presence of occlusion.
Ii-a Multi-object State
To distinguish different object trajectories in a multi-object
setting, each object is assigned a unique label that consists of
an ordered pair , where is the time of birth and is the index
of individual objects born at the same time . For example, if two
objects appear in the scene at time 1, one is assigned label (1,1) while the
other is assigned label (1,2), see Fig. 1. A trajectory or
track is the sequence of states with the same label.
Formally, the state of an object at time is a vector , where denotes the label space for objects at time (including those born prior to ). Note that is given by , where denotes the label space for objects born at time (and is disjoint from ). Suppose that there are objects at time , with states . In the context of MOT, the collection of states, referred to as the multi-object state, is naturally represented as a finite set
where denotes the space of finite
subsets of . We denote cardinality (number of
elements) of by and the set of labels of , , by . Note that since the label is unique, no two objects have the same
label, i.e. . Hence is called the distinct label indicator.
For the rest of the paper, we follow the convention that single-object states are represented by lower-case letters (e.g. , ), while multi-object states are represented by upper-case letters (e.g. , ), symbols for labeled states and their distributions are bold-faced to distinguish them from unlabeled ones (e.g. , , , etc.), and spaces are represented by blackboard bold (e.g. , , , , etc.). The list of variables is abbreviated as . We denote a generalization of the Kroneker delta that takes arbitrary arguments such as sets, vectors, integers etc., by
For a given set , denotes the indicator function of , and denotes the class of finite subsets of . For a finite set , the multi-object exponential notation denotes the product , with . The inner product is denoted by .
Ii-B Multi-object Bayes filter
From a Bayesian estimation viewpoint the multi-object state is naturally
modeled as an RFS or a simple-finite point process . While
the space does not inherit the
Euclidean notion of probability density, Mahler’s Finite Set Statistic
(FISST) provides a suitable notion of integration/density for RFSs [22, 24]. This approach is mathematically consistent
with measure theoretic integration/density but circumvents measure theoretic
At time , the multi-object state is observed as an image . All information on the set of object trajectories conditioned on the observation history , is captured in the multi-object posterior density
where is the initial prior,
is the multi-object likelihood function at time , is the multi-object transition density to
time . The multi-object likelihood function encapsulates the underlying
observation model while the multi-object transition density encapsulates the
underlying models for motions, births and deaths of objects. Note that track
management is incorporated into the Bayes recursion via the multi-object
state with distinct labels.
MCMC approximations of the posterior density have been proposed in [26, 27] for detection measurements and image measurements respectively. Results on satellite imaging applications reported in  are very impressive. However, these techniques are still expensive and not suitable for on-line application.
For real-time tracking, a more tractable alternative is the multi-object filtering density, a marginal of the multi-object posterior. For notational compactness, herein we omit the dependence on data in the multi-object densities. The multi-object filtering density can be recursively propagated by the multi-object Bayes filter ,  according to the following prediction and update
where the integral is a set integral defined for any function by
Bayes optimal multi-object estimators can be formulated by minimizing the
Bayes risk with ordinary integrals replaced by set integrals as in . One such estimator is the marginal multi-object estimator
A generic particle implementation of the Bayes multi-object filter (1)-(2) was proposed in  and applied to labeled multi-object states in . The Generalized labeled Multi-Bernoulli (GLMB) filter is an analytic solution to the Bayes multi-object filter, under the standard multi-object dynamic and observation models .
Ii-B1 Standard multi-object dynamic model
Given the multi-object state (at time ), each state either survives with probability and evolves to a new state (at time ) with probability density or dies with probability . The set of new objects born at time is distributed according to the labeled multi-Bernoulli (LMB)
where , is the probability that a new object with label is born, and is the distribution of its kinematic state . The multi-object state (at time ) is the superposition of surviving objects and new born objects. It is assumed that, conditional on , objects move, appear and die independently of each other. The expression for the multi-object transition density can be found in [7, 29]. The standard multi-object dynamic model enables the Bayes multi-object filter to address motion, births and deaths of objects.
Ii-B2 Standard multi-object observation model
In most applications a designated detection operation is applied to resulting in a set of points
Since the detection process is not perfect,
false positives and false negatives are inevitable. Hence only a subset of correspond to some objects in the scene (not all objects are
detected) while the remainder are false positives. The most popular
detection-based observation model is described in the following.
For a given multi-object state , each is either detected with probability and generates a detection with likelihood or missed with probability . The multi-object observation is the superposition of the observations from detected objects and Poisson clutter with intensity . The ratio
can be interpreted as the detection signal to noise ratio (SNR).
Assuming that, conditional on , detections are independent of each other and clutter, the multi-object likelihood function is given by , [7, 29]
where: is the set of positive 1-1 maps :, i.e. maps such that no two distinct arguments are mapped to the same positive value; and
The map specifies which objects generated which detections, i.e.
object generates detection , with
undetected objects assigned to . The positive 1-1 property means that is 1-1 on , the set of labels that are
assigned positive values, and ensures that any detection in is
assigned to at most one object.
The standard multi-object observation model enables the Bayes multi-object filter to address mis-detection and false detection, but not occlusion. It assumes that each object is detected independently from each other, and that a detection cannot be assigned to more than one object. This is clearly not valid in occlusions.
Ii-C Bayes Optimal Occlusion Handing
By relaxing the assumption that each object is independently detected, a
multi-object observation model that explicitly addresses occlusion (as well
as mis-detections and false positives) was proposed in . The
main difference between this so-called merged-measurement model and
the standard model is the idea that each group of objects (instead of each
object) in the multi-object state generates at most one detection . Fig. 2 shows various partitions or groupings
of a multi-object state with five objects.
A partition of a finite set is a collection of mutually exclusive subsets of , whose union is . The collection of all partitions of is denoted by . It is assumed that given a partition , each group generates at most one detection with probability , independent of other groups, and that conditional on detection generates with likelihood .
Let denote the collection of labels of the partition , i.e. (note that forms a partition of ). Let denote the class of all positive 1-1 mappings :. Then, the likelihood that a given partition of a multi-object state , generates the detection set is
The multi-object filter (1)-(2)
with merged-measurement likelihood is Bayes optimal in the sense that the
filtering density contains all information on the current multi-object state
in the presence of false positives, mis-detections and occlusions.
Unfortunately, this filter is numerically intractable due to the sum over
all partitions of the multi-object state in the merged-measurement
likelihood. At present, there is no polynomial time technique for truncating
sums over partitions. Moreover, given a partition, computations involving
the joint detection probability , joint likelihood and associated joint densities are intractable except
for scenarios with a few objects.
A GLMB approximation that reduces the number of partitions using the cluster structure of the predicted multi-object state and the sensor’s resolution capabilities was proposed in . Also, computation of joint densities are approximated by products of independent densities that minimise the Kullback-Leibler divergence . Case studies on MOT with bearings only measurements shows very good tracking performance. Nonetheless, at present, this filter is still computationally demanding and therefore not suitable for online MOT with image data.
Iii GLMB filter for tracking with image data
The GLMB filter (with the standard measurement likelihood) is a suitable
candidate for online MOT [29, 31]. However, it is neither designed
to handle occlusion nor image data. Even though occluded objects share the
observations of the occluding objects, this situation is not permitted in
the standard multi-object likelihood. Consequently, uncertainties in the
states of occluded objects grow, while their existence probabilities quickly
diminish to zero, leading to possible hi-jacking, and premature track
termination in longer occlusions.
This section proposes an efficient GLMB filter that exploits information from image data and addresses false positives, mis-detections and occlusions. Subsection III-A extends the standard observation model to allow occluded objects to share observations at the image level while Subsection III-B incorporates, into the death model, domain knowledge that mis-detected tracks with long durations are unlikely to disappear. The GLMB filter for image data and an efficient implementation are then described in Subsections III-C and III-D.
Iii-a Hybrid Multi-Object Measurement Likelihood
While the detection set is an efficient compression of the image
observation , mis-detected (including occluded) objects will not be
updated by the filter. On the other hand the uncompressed observation
contains relevant information about all objects, but updating with
is computationally expensive. Conceptually, we can have the best of both
worlds by updating detected objects with the associated detections and
mis-detected objects with the image observations localised to regions where
these objects are expected. More importantly, this strategy exploits the
fact that occluded objects share measurements with the objects occluding
them as illustrated in Fig. 3.
A hybrid tractable multi-object likelihood function that accommodates both detection and image observations can be obtained as follows. For tractability, it is assumed that each object generates observation independently from each other (similar to the standard observation model).
Given an object with state the likelihood of observing the local image (some transformation of the image ) is . On the other hand, given that there are no objects, the likelihood of observing is . The ratio
can be interpreted as the image SNR (c.f. detection SNR (4)).
For a given association map in the likelihood function (5), an object with state is mis-detected if , in which case the value of is , the probability of a miss. Consequently, after the Bayes update, track has no dependence on the observation . In order for track to be updated with the local image , the value of should be scaled by the image SNR . Note that the value of should remain unchanged for . Formally, this can be achieved by defining an extension of (6) as follows
In other words, for , is equal to the
image SNR (8) scaled by the mis-detection probability, otherwise
it is equal to the detection SNR (4) scaled by the detection
Given a state , plays the same role as , but accommodates both detection measurements and image measurements. Moreover, since each object generates observation independently from each other, the hybrid multi-object likelihood function has the same form as (5), but with replaced by , i.e.
In visual occlusions, the hybrid likelihood allows occluded objects to share the image observations of the objects that occlude them. Moreover, when integrated into the Bayes recursion (1)-(2), consideration for a track-length-dependent survival probability in combination with information from the image observation, reduces uncertainties in the states of occluded objects and maintains their existence probabilities to keep the tracks alive. Hence, hi-jacking and premature track termination in longer occlusions will be avoided.
Remark: The hybrid multi-object likelihood function (10) is a generalization of both the standard multi-object
likelihood and the separable likelihood in . When for each , i.e. there is
no mis-detection, the hybrid likelihood (10) is the
same as the standard likelihood (5). On the other
hand, if for each , i.e.
there is no detection, then the only non-zero term in the hybrid likelihood (10) is one with for all . In this case, the hybrid likelihood (10) reduces to the separable likelihood in .
For a general detection profile , the hybrid likelihood (10) reduces to the standard likelihood (5) when for each .
Note that a hybrid likelihood function can be also developed for the merged-measurement model. However, the resulting multi-object filter still suffers from the same intractability as the merged-measurement filter.
Iii-B Death model
In most video MOT applications, if an object stays in the scene for a long time, then it is more likely to continue to do so, provided it is not close to the designated exit regions. Such prior empirical knowledge can be used to improve occlusion handling, especially long occlusions that can lead to premature track termination in on-line MOT algorithms. In general, the GLMB filter would terminate an object that has not been detected over several frames. However, if this object has been in the scene for some time and is not in the proximity of designated exit regions, then it is highly likely to be occluded and track termination should be delayed. The labeled RFS formulation enables such prior information to be incorporated into track termination in a principled manner, via the survival probability.
The labeled RFS formulation accommodates survival probabilities that depend on track lengths since a labeled state contains the time of birth in its label, and the track length is simply the difference between the current time and the time of birth. In practice, it is unlikely for an object to disappear in the middle of the visual scene (even if mis-detected or occluded) whereas it is more likely to disappear near designated exit regions due to the scene structure (e.g. the borders of the scene). Hence, we require the survival probability to be large (close to one) in the middle of the scene and small (close to zero) on the edges or designated death regions. Furthermore, since objects staying in the scene for a long time are more certain to continue existing, we require the survival probability to increase to one as its track length increases.
An explicit form of the survival probability that satisfies these requirements is given by
where is a scene mask that represents the scene structure, e.g., entrance or exit as illustrated in Fig. 4, is a control parameter of the sigmoid function. The scene mask can be learned from a set of training data or designed from the known scene structure.
Iii-C GLMB Recursion
A GLMB density can be written in the following form
where each represents a history of association maps , each is a probability density on , and each is non-negative with . The cardinality distribution of a GLMB is given by
while, the existence probability and probability density of track are respectively
Given the GLMB density (12), an intuitive multi-object estimator is the multi-Bernoulli estimator, which first determines the set of labels with existence probabilities above a prescribed threshold, and second the MAP/mean estimates from the densities , for the states of the objects. A popular estimator is a suboptimal version of the Marginal Multi-object Estimator , which first determines the pair with the highest weight such that coincides with the MAP cardinality estimate, and second the MAP/mean estimates from , for the states of the objects.
The GLMB family enjoys a number of nice analytical properties. The void
probability functional–a necessary and sufficient statistic–of a GLMB, the
Cauchy-Schwarz divergence between two GLMBs, the -distance between a
GLMB and its truncation, can all be computed in closed form .
The GLMB is flexible enough to approximate any labeled RFS density with
matching intensity function and cardinality distribution .
More importantly, the GLMB family is closed under the prediction equation (1) and conjugate with respect to the standard
observation likelihood .
In the following we show that the GLMB family is conjugate with respect to the hybrid observation likelihood function. Hence, starting from an initial GLMB prior, all multi-object predicted and updated densities propagated by the Bayes recursion (1)-(2) are GLMBs. For notational compactness, we drop the subscript for the current time, the next time is indicated by the subscript ‘’.
Suppose that the multi-object prediction density to time is a GLMB of the form
where , . Then under the hybrid observation likelihood function (10), the filtering density at time is a GLMB of the form
where , and
Note that the likelihood function (10) at time can be written as
where . Using Bayes rule
Given the GLMB filtering density (12) at time , the filtering density at time is:
where , , , , and
The summation in (21) can be interpreted as an enumeration of all possible combinations of births, deaths and survivals together with associations of new measurements to hypothesized tracks. Observe that (21) does indeed take on the same form as (12) when rewritten as a sum over with weights
Hence at the next iteration we only propagate forward the components with weights .
Remark: It is also possible to approximate the resulting GLMB filtering density by an LMB with matching 1st moment and cardinality distribution . This so-called LMB filtering strategy reduces considerable computations since an LMB is a GLMB with one term. However, tracking performance tend to degrade, especially in scenarios with many closely space objects.
Note that for high SNR scenarios the detection probability is high, hence the recursion (21)-(27) would process detections mostly. On the other hand when the detection probability is low it would process the image mostly. In practice the SNR varies between different regions in the observation space as well as with time, the recursion (21)-(27) adaptively processes detections and image data to improve performance while reducing the computational cost.
Iii-D GLMB Filter Implementation
The number of terms in the GLMB filtering density grows super-exponentially, and it is necessary to truncate these terms without exhaustive enumeration. A two-stage implementation of the GLMB filter truncates the prediction and filtering densities using the K-shortest path and the ranked assignment algorithms, respectively . In  the joint prediction and update was designed to improve the truncation efficiency of the two-staged implementation. Further, the GLMB truncation can be performed via Gibbs sampling with linear complexity in the number of detections (the reader is referred to  for derivations and analysis). Fortuitously, this implementation can be readily adapted for the video MOT GLMB filter (21)-(27).
The GLMB filtering density (12) at time is completely characterized by the parameters , , which can be enumerated as , where
Since (12) can now be rewritten as
implementing the GLMB filter amounts to propagating the component set (there is no need to store ) forward in time using (21)-(27). The procedure for computing the component set at the next time is summarized in Algorithm 1. Note that to be consistent with the indexing by instead of , we also abbreviate
Algorithm 1. Joint Prediction and Update
input: , , , ,
input: , , , , ,
sample counts from multinomial distribution
with parameters trials and weights
compute using (28)
compute from and using (29)
compute from and using (30)
compute from and using (31)
Algorithm 1a. Gibbs
There are three main steps in one iteration of the GLMB filter. First, the Gibbs sampler (Algorithm 1a) is used to generate the auxiliary vectors , :, :, with the most significant weights (note that is an equivalent representation of the hypothesis , with components