A Labeled Random Finite Set Online MultiObject Tracker for Video Data
Abstract
This paper proposes an online multiobject tracking algorithm for image observations using a topdown Bayesian formulation that seamlessly integrates state estimation, track management, clutter rejection, occlusion and misdetection handling into a single recursion. This is achieved by modeling the multiobject state as labeled random finite set and using the Bayes recursion to propagate the multiobject filtering density forward in time. The proposed filter updates tracks with detections but switches to image data when misdetection occurs, thereby exploiting the efficiency of detection data and the accuracy of image data. Furthermore the labeled random finite set framework enables the incorporation of prior knowledge that misdetections in the middle of the scene are likely to be due to occlusions. Such prior knowledge can be exploited to improve occlusion handling, especially long occlusions that can lead to premature track termination in online multiobject tracking. Tracking performance is compared to stateoftheart algorithms on synthetic data and wellknown benchmark video datasets.
online multiobject tracking, Trackbeforedetect, random finite set
I Introduction
In a multiple object setting, not only do the states of the objects vary with time, but the number of objects also changes due to objects appearing and disappearing. In this work, we consider the problem of jointly estimating the timevarying number of objects and their trajectories from a stream of noisy images. In particular, we are interested in multiobject tracking (MOT) solutions that compute estimates at a given time using only data up to that time. These socalled online solutions are better suited for timecritical applications.
A critical function of a multiobject tracker is track management, which
concerns track initiation/termination and track labeling or identifying
trajectories of individual objects. Track management is more challenging for
online algorithms than for batch algorithms. Usually, track
initiation/termination in online MOT algorithms is performed by examining
consecutive detections in time [1], [2]. However,
false positives generated by the background, compounded by false negatives
from object occlusions and misdetections, can result in false tracks and
lost tracks, especially in online algorithms. False negatives also cause
track fragmentation in batch algorithms as reported in [3], [4], [5]. With the exception of the recent network flow
[6] techniques, track labels are assigned upon track
initiation, and maintained over time until termination. An online
multiobject Bayesian filter that provides systematic track labeling using
labeled random finite set (RFS) was proposed in [7].
In most video MOT approaches, each image in the data sequence is compressed into a set of detections before a filtering operation is applied to keep track of the objects (including undetected ones) [2, 8]. Typically, in the filtering module, motion correspondence or data association is first determined followed by the application of standard filtering techniques such as Kalman or sequential Monte Carlo [8]. The main advantage of performing detection before filtering is the computational efficiency in the compression of images into relevant detections. The main disadvantage is the loss of information, in addition to misdetection and false alarms, especially in low signal to noise ratio (SNR) applications.
Trackbeforedetect (TBD) is an alternative approach, which bypasses the detection module and exploits the spatiotemporal information directly from the image sequence. The TBD methodology is often required in tracking applications for low SNR image data [9], [10], [11], [12], [13], [14]. In visual tracking applications, perhaps the most wellknown TBD MOT algorithm is BraMBLe [15]. Other visual MOT algorithms that can be categorized as TBD include [16], [17] which exploit colorbased observation models, [18], [2], [19], which exploit multimodality of distributions, and [20], [21] which uses multiBernoulli random finite set models. While the TBD approach minimizes information loss, it is computationally more expensive. A balance between tractability and fidelity is important in the design of the measurement model.
In this paper, we present an efficient online MOT algorithm for
video data that exploits the advantages of both detectionbased and TBD
approaches to improve performance while reducing the computational cost. The
proposed MOT filter updates the tracks adaptively with detections for
efficiency, or with local regions on the image to minimize information loss.
In addition it seamlessly integrates state estimation, track management,
clutter rejection, misdetection and occlusion handling in a single Bayesian
recursion.
Specifically, using the RFS framework we propose a hybrid
multiobject likelihood function that accommodates both detections and image
observations, thereby generalizing the standard multiobject likelihood [22] and the separable likelihood for image [10].
Further, we establish conjugacy of the Generalized Labelled MultiBernoulli
(GLMB) distributions with respect to the proposed likelihood function, which
then yields an analytic solution to the multiobject Bayes recursion since
the GLMB family is closed under the ChapmanKolmogorov equation. The
proposed MOT filter exploits the efficiency of the detectionbased approach
which avoids updating with the entire image, while at the same time
exploiting relevant information at the image level by using only small
regions of the image where misdetected objects are expected.
The labeled RFS formulation [7] addresses state estimation, track management, clutter rejection, misdetection and occlusion handling, in one single Bayes recursion. Generally, an online MOT algorithm would terminate a track that has not been detected over several frames. In many video MOT applications however, it is observed that away from designated exit regions such as scene edges, the longer an object is in the scene, the less likely it is to disappear. Intuitively, this observation can be used to delay the termination of tracks that have been occluded over an extended period, so as to improve occlusion handling. The use of labeled RFS in our proposed filter provides a principled and inexpensive means to exploit this observation for improved occlusion handling.
The remainder of the paper is structured as follows. The Bayesian filtering formulation of the MOT problem using labeled RFS is given in Section II, followed by details of the proposed solution in Section III. Performance evaluation of the proposed MOT filter against stateoftheart trackers is presented in Section IV, and concluding remarks are given in Section V.
Ii Bayesian Multiple Object Tracking
This section outlines the RFS framework for MOT that accommodates uncertainty in the number of objects, the states of the objects and their trajectories. The salient feature of this framework is that it admits direct parallels between traditional Bayesian filtering and MOT. The modeling of the multiobject state as an RFS in Subsection IIA enables Bayesian filtering concepts to be directly translated to the multiobject case in Subsection IIB. Subsection IIC examines the MOT problem in the presence of occlusion.
Iia Multiobject State
To distinguish different object trajectories in a multiobject
setting, each object is assigned a unique label that consists of
an ordered pair , where is the time of birth and is the index
of individual objects born at the same time [7]. For example, if two
objects appear in the scene at time 1, one is assigned label (1,1) while the
other is assigned label (1,2), see Fig. 1. A trajectory or
track is the sequence of states with the same label.
Formally, the state of an object at time is a vector , where
denotes the label space for objects at time (including those born prior
to ). Note that is given by , where denotes the label space for objects born at
time (and is disjoint from ). Suppose that there are objects at time , with states . In the context of MOT, the collection of states, referred to as
the multiobject state, is naturally represented as a finite set
where denotes the space of finite
subsets of . We denote cardinality (number of
elements) of by and the set of labels of , , by . Note that since the label is unique, no two objects have the same
label, i.e. . Hence is called the distinct label indicator.
For the rest of the paper, we follow the convention that
singleobject states are represented by lowercase letters (e.g. , ), while multiobject states are represented by uppercase
letters (e.g. , ), symbols for labeled states and their
distributions are boldfaced to distinguish them from unlabeled ones (e.g. , , , etc.), and spaces are
represented by blackboard bold (e.g. , , , , etc.). The list of variables is
abbreviated as . We denote a generalization of the Kroneker delta
that takes arbitrary arguments such as sets, vectors, integers etc., by
For a given set , denotes the indicator function of , and denotes the class of finite subsets of . For a finite set , the multiobject exponential notation denotes the product , with . The inner product is denoted by .
IiB Multiobject Bayes filter
From a Bayesian estimation viewpoint the multiobject state is naturally
modeled as an RFS or a simplefinite point process [23]. While
the space does not inherit the
Euclidean notion of probability density, Mahler’s Finite Set Statistic
(FISST) provides a suitable notion of integration/density for RFSs [22, 24]. This approach is mathematically consistent
with measure theoretic integration/density but circumvents measure theoretic
constructs [25].
At time , the multiobject state is observed as
an image . All information on the set of object trajectories
conditioned on the observation history , is captured in the multiobject posterior density
where is the initial prior,
is the multiobject likelihood function at time , is the multiobject transition density to
time . The multiobject likelihood function encapsulates the underlying
observation model while the multiobject transition density encapsulates the
underlying models for motions, births and deaths of objects. Note that track
management is incorporated into the Bayes recursion via the multiobject
state with distinct labels.
MCMC approximations of the posterior density have been proposed in
[26, 27] for detection measurements and image measurements
respectively. Results on satellite imaging applications reported in [27] are very impressive. However, these techniques are still expensive
and not suitable for online application.
For realtime tracking, a more tractable alternative is the multiobject filtering density, a marginal of the multiobject posterior.
For notational compactness, herein we omit the dependence on data in the
multiobject densities. The multiobject filtering density can be
recursively propagated by the multiobject Bayes filter [23], [28] according to the following prediction and
update
(1)  
(2) 
where the integral is a set integral defined for any function by
Bayes optimal multiobject estimators can be formulated by minimizing the
Bayes risk with ordinary integrals replaced by set integrals as in [24]. One such estimator is the marginal multiobject estimator
[22].
A generic particle implementation of the Bayes multiobject filter (1)(2) was proposed in [25] and applied to labeled multiobject states in [11].
The Generalized labeled MultiBernoulli (GLMB) filter is an analytic
solution to the Bayes multiobject filter, under the standard multiobject
dynamic and observation models [7].
IiB1 Standard multiobject dynamic model
Given the multiobject state (at time ), each state either survives with probability and evolves to a new state (at time ) with probability density or dies with probability . The set of new objects born at time is distributed according to the labeled multiBernoulli (LMB)
where , is the probability that a new object with label is born, and is the distribution of its kinematic state [7]. The multiobject state (at time ) is the superposition of surviving objects and new born objects. It is assumed that, conditional on , objects move, appear and die independently of each other. The expression for the multiobject transition density can be found in [7, 29]. The standard multiobject dynamic model enables the Bayes multiobject filter to address motion, births and deaths of objects.
IiB2 Standard multiobject observation model
In most applications a designated detection operation is applied to resulting in a set of points
(3) 
Since the detection process is not perfect,
false positives and false negatives are inevitable. Hence only a subset of correspond to some objects in the scene (not all objects are
detected) while the remainder are false positives. The most popular
detectionbased observation model is described in the following.
For a given multiobject state , each is either detected with probability and
generates a detection with likelihood or
missed with probability . The multiobject
observation is the superposition of the observations from detected
objects and Poisson clutter with intensity . The ratio
(4) 
can be interpreted as the detection signal to noise ratio (SNR).
Assuming that, conditional on , detections are
independent of each other and clutter, the multiobject likelihood function
is given by [22], [7, 29]
(5) 
where: is the set of positive 11 maps :, i.e. maps such that no two distinct arguments are mapped to the same positive value; and
(6) 
The map specifies which objects generated which detections, i.e.
object generates detection , with
undetected objects assigned to . The positive 11 property means that is 11 on , the set of labels that are
assigned positive values, and ensures that any detection in is
assigned to at most one object.
The standard multiobject observation model enables the Bayes
multiobject filter to address misdetection and false detection, but not
occlusion. It assumes that each object is detected independently from each
other, and that a detection cannot be assigned to more than one object. This
is clearly not valid in occlusions.
IiC Bayes Optimal Occlusion Handing
By relaxing the assumption that each object is independently detected, a
multiobject observation model that explicitly addresses occlusion (as well
as misdetections and false positives) was proposed in [30]. The
main difference between this socalled mergedmeasurement model and
the standard model is the idea that each group of objects (instead of each
object) in the multiobject state generates at most one detection [30]. Fig. 2 shows various partitions or groupings
of a multiobject state with five objects.
A partition of a finite set is a collection of mutually exclusive subsets of ,
whose union is . The collection of all partitions of
is denoted by . It is assumed that given a
partition , each group generates at most one detection with probability , independent of other groups, and that conditional on detection
generates with likelihood .
Let denote the collection of
labels of the partition , i.e. (note that forms a partition of ). Let denote the class of all positive 11
mappings :. Then, the likelihood that a given partition of a multiobject state , generates the detection
set is
(7) 
where
with denoting the detection SNR for group . The mergedmeasurement likelihood function is obtained by summing the group likelihoods (7) over all partitions of [30]:
The multiobject filter (1)(2)
with mergedmeasurement likelihood is Bayes optimal in the sense that the
filtering density contains all information on the current multiobject state
in the presence of false positives, misdetections and occlusions.
Unfortunately, this filter is numerically intractable due to the sum over
all partitions of the multiobject state in the mergedmeasurement
likelihood. At present, there is no polynomial time technique for truncating
sums over partitions. Moreover, given a partition, computations involving
the joint detection probability , joint likelihood and associated joint densities are intractable except
for scenarios with a few objects.
A GLMB approximation that reduces the number of partitions using the
cluster structure of the predicted multiobject state and the sensor’s
resolution capabilities was proposed in [30]. Also, computation of
joint densities are approximated by products of independent densities that
minimise the KullbackLeibler divergence [12]. Case studies on
MOT with bearings only measurements shows very good tracking performance.
Nonetheless, at present, this filter is still computationally demanding and
therefore not suitable for online MOT with image data.
Iii GLMB filter for tracking with image data
The GLMB filter (with the standard measurement likelihood) is a suitable
candidate for online MOT [29, 31]. However, it is neither designed
to handle occlusion nor image data. Even though occluded objects share the
observations of the occluding objects, this situation is not permitted in
the standard multiobject likelihood. Consequently, uncertainties in the
states of occluded objects grow, while their existence probabilities quickly
diminish to zero, leading to possible hijacking, and premature track
termination in longer occlusions.
This section proposes an efficient GLMB filter that exploits
information from image data and addresses false positives, misdetections
and occlusions. Subsection IIIA extends the standard
observation model to allow occluded objects to share observations at the
image level while Subsection IIIB incorporates, into the death
model, domain knowledge that misdetected tracks with long durations are
unlikely to disappear. The GLMB filter for image data and an efficient
implementation are then described in Subsections IIIC and IIID.
Iiia Hybrid MultiObject Measurement Likelihood
While the detection set is an efficient compression of the image
observation , misdetected (including occluded) objects will not be
updated by the filter. On the other hand the uncompressed observation
contains relevant information about all objects, but updating with
is computationally expensive. Conceptually, we can have the best of both
worlds by updating detected objects with the associated detections and
misdetected objects with the image observations localised to regions where
these objects are expected. More importantly, this strategy exploits the
fact that occluded objects share measurements with the objects occluding
them as illustrated in Fig. 3.
A hybrid tractable multiobject likelihood function that
accommodates both detection and image observations can be obtained as
follows. For tractability, it is assumed that each object generates
observation independently from each other (similar to the standard
observation model).
Given an object with state the likelihood of observing
the local image (some transformation of the image ) is . On the other hand, given that there are no
objects, the likelihood of observing is . The ratio
(8) 
can be interpreted as the image SNR (c.f. detection SNR (4)).
For a given association map in the likelihood function (5), an object with state is
misdetected if , in which case the value of is , the
probability of a miss. Consequently, after the Bayes update, track
has no dependence on the observation . In order for track to
be updated with the local image , the value of should be scaled by the image SNR . Note that the value of should remain unchanged for . Formally, this can be achieved by defining an extension of (6) as follows
(9) 
In other words, for , is equal to the
image SNR (8) scaled by the misdetection probability, otherwise
it is equal to the detection SNR (4) scaled by the detection
probability.
Given a state , plays the same role as , but accommodates both detection measurements and image
measurements. Moreover, since each object generates observation
independently from each other, the hybrid multiobject likelihood function
has the same form as (5), but with replaced by , i.e.
(10) 
In visual occlusions, the hybrid likelihood allows occluded objects to share the image observations of the objects that occlude them. Moreover, when integrated into the Bayes recursion (1)(2), consideration for a tracklengthdependent survival probability in combination with information from the image observation, reduces uncertainties in the states of occluded objects and maintains their existence probabilities to keep the tracks alive. Hence, hijacking and premature track termination in longer occlusions will be avoided.
Remark: The hybrid multiobject likelihood function (10) is a generalization of both the standard multiobject
likelihood and the separable likelihood in [10]. When for each , i.e. there is
no misdetection, the hybrid likelihood (10) is the
same as the standard likelihood (5). On the other
hand, if for each , i.e.
there is no detection, then the only nonzero term in the hybrid likelihood (10) is one with for all . In this case, the hybrid likelihood (10) reduces to the separable likelihood in [10].
For a general detection profile , the hybrid likelihood (10) reduces to the standard likelihood (5) when for each .
Note that a hybrid likelihood function can be also developed for the
mergedmeasurement model. However, the resulting multiobject filter still
suffers from the same intractability as the mergedmeasurement filter.
IiiB Death model
In most video MOT applications, if an object stays in the scene for a long time, then it is more likely to continue to do so, provided it is not close to the designated exit regions. Such prior empirical knowledge can be used to improve occlusion handling, especially long occlusions that can lead to premature track termination in online MOT algorithms. In general, the GLMB filter would terminate an object that has not been detected over several frames. However, if this object has been in the scene for some time and is not in the proximity of designated exit regions, then it is highly likely to be occluded and track termination should be delayed. The labeled RFS formulation enables such prior information to be incorporated into track termination in a principled manner, via the survival probability.
The labeled RFS formulation accommodates survival probabilities that depend on track lengths since a labeled state contains the time of birth in its label, and the track length is simply the difference between the current time and the time of birth. In practice, it is unlikely for an object to disappear in the middle of the visual scene (even if misdetected or occluded) whereas it is more likely to disappear near designated exit regions due to the scene structure (e.g. the borders of the scene). Hence, we require the survival probability to be large (close to one) in the middle of the scene and small (close to zero) on the edges or designated death regions. Furthermore, since objects staying in the scene for a long time are more certain to continue existing, we require the survival probability to increase to one as its track length increases.
An explicit form of the survival probability that satisfies these requirements is given by
(11) 
where is a scene mask that represents the scene structure, e.g., entrance or exit as illustrated in Fig. 4, is a control parameter of the sigmoid function. The scene mask can be learned from a set of training data or designed from the known scene structure.
IiiC GLMB Recursion
A GLMB density can be written in the following form
(12) 
where each represents a history of association maps , each is a probability density on , and each is nonnegative with . The cardinality distribution of a GLMB is given by
(13) 
while, the existence probability and probability density of track are respectively
(14)  
(15) 
Given the GLMB density (12), an intuitive multiobject estimator is the multiBernoulli estimator, which first determines the set of labels with existence probabilities above a prescribed threshold, and second the MAP/mean estimates from the densities , for the states of the objects. A popular estimator is a suboptimal version of the Marginal Multiobject Estimator [22], which first determines the pair with the highest weight such that coincides with the MAP cardinality estimate, and second the MAP/mean estimates from , for the states of the objects.
The GLMB family enjoys a number of nice analytical properties. The void
probability functional–a necessary and sufficient statistic–of a GLMB, the
CauchySchwarz divergence between two GLMBs, the distance between a
GLMB and its truncation, can all be computed in closed form [29].
The GLMB is flexible enough to approximate any labeled RFS density with
matching intensity function and cardinality distribution [12].
More importantly, the GLMB family is closed under the prediction equation (1) and conjugate with respect to the standard
observation likelihood [7].
In the following we show that the GLMB family is conjugate with
respect to the hybrid observation likelihood function. Hence, starting from
an initial GLMB prior, all multiobject predicted and updated densities
propagated by the Bayes recursion (1)(2) are GLMBs. For notational compactness, we drop the
subscript for the current time, the next time is indicated by the
subscript ‘’.
Proposition 1.
Suppose that the multiobject prediction density to time is a GLMB of the form
(16) 
where , . Then under the hybrid observation likelihood function (10), the filtering density at time is a GLMB of the form
(17) 
where , and
(18)  
(19)  
(20) 
Proof.
In this work we adopt the joint prediction and update strategy [31] for the proposed video MOT GLMB filter. Using the same line of arguments as in [31], yields the following recursion
Proposition 2.
Given the GLMB filtering density (12) at time , the filtering density at time is:
(21) 
where , , , , and
(22)  
(23)  
(24)  
(25)  
(26) 
The summation in (21) can be interpreted as an enumeration of all possible combinations of births, deaths and survivals together with associations of new measurements to hypothesized tracks. Observe that (21) does indeed take on the same form as (12) when rewritten as a sum over with weights
(27) 
Hence at the next iteration we only propagate forward the components with weights .
Remark: It is also possible to approximate the resulting GLMB filtering
density by an LMB with matching 1st moment and cardinality distribution [32]. This socalled LMB filtering strategy reduces considerable
computations since an LMB is a GLMB with one term. However, tracking
performance tend to degrade, especially in scenarios with many closely space
objects.
Note that for high SNR scenarios the detection probability is high, hence the recursion (21)(27) would process detections mostly. On the other hand when the detection probability is low it would process the image mostly. In practice the SNR varies between different regions in the observation space as well as with time, the recursion (21)(27) adaptively processes detections and image data to improve performance while reducing the computational cost.
IiiD GLMB Filter Implementation
The number of terms in the GLMB filtering density grows superexponentially, and it is necessary to truncate these terms without exhaustive enumeration. A twostage implementation of the GLMB filter truncates the prediction and filtering densities using the Kshortest path and the ranked assignment algorithms, respectively [29]. In [31] the joint prediction and update was designed to improve the truncation efficiency of the twostaged implementation. Further, the GLMB truncation can be performed via Gibbs sampling with linear complexity in the number of detections (the reader is referred to [31] for derivations and analysis). Fortuitously, this implementation can be readily adapted for the video MOT GLMB filter (21)(27).
The GLMB filtering density (12) at time is completely characterized by the parameters , , which can be enumerated as , where
Since (12) can now be rewritten as
implementing the GLMB filter amounts to propagating the component set (there is no need to store ) forward in time using (21)(27). The procedure for computing the component set at the next time is summarized in Algorithm 1. Note that to be consistent with the indexing by instead of , we also abbreviate
(28) 
Algorithm 1. Joint Prediction and Update

input: , , , ,

input: , , , , ,

output:
sample counts from multinomial distribution
with parameters trials and weights
for
compute using (28)
initialize
for
compute from and using (29)
compute from and using (30)
compute from and using (31)
end
end
for
end
normalize weights
Algorithm 1a. Gibbs

input:

output:
for
for
for
end
end
end
There are three main steps in one iteration of the GLMB filter. First, the Gibbs sampler (Algorithm 1a) is used to generate the auxiliary vectors , :, :, with the most significant weights (note that is an equivalent representation of the hypothesis , with components