Part-Based Tracking by Sampling
We propose a novel part-based method for tracking an arbitrary object in challenging video sequences, focusing on robustly tracking under the effects of camera motion and object motion change. Each of a group of tracked image patches on the target is represented by pairs of RGB pixel samples and counts of how many pixels in the patch are similar to them. This empirically characterises the underlying colour distribution of the patches and allows for matching using the Bhattacharyya distance. Candidate patch locations are generated by applying non-shearing affine transformations to the patches’ previous locations, followed by local optimisation. Experiments using the VOT2016 dataset show that our tracker out-performs all other part-based trackers in terms of robustness to camera motion and object motion change.
George De Athgd259@exeter.ac.uk1
Department of Computer Science
University of Exeter
Exeter, United Kingdom Part-based Tracking by Sampling
Visual tracking of arbitrary objects is a very active research topic in computer vision, with applications across many different fields including video surveillance, activity analysis, robot vision, and human-computer interfaces. Its main goal is to determine the location of an unknown object, specified only with a bounding box in the first frame of a sequence, in all subsequent frames. Despite strong progress having been made in recent years, such as the emergence of popular benchmark datasets [Wu et al.(2013)Wu, Lim, and Yang, Kristan et al.(2016)Kristan, Leonardis, Matas, Felsberg, Pflugfelder, and Čehovin] leading to increased performance year on year, it still remains a challenge to design trackers to deal with difficult tracking scenarios. Particular challenges include camera motion, illumination change, motion change, occlusion, and object size change; test datasets now label individual frames of test sequences [Kristan et al.(2016)Kristan, Leonardis, Matas, Felsberg, Pflugfelder, and Čehovin] based on these difficult tracking attributes.
Object tracking methods can roughly be divided up into two broad areas: those based on global appearance models and part-based approaches. Global appearance-based approaches model the object with a single representation, such as Correlation Filters [Henriques et al.(2015)Henriques, Caseiro, Martins, and Batista, Bertinetto et al.(2016)Bertinetto, Valmadre, Golodetz, Miksik, and Torr, Danelljan et al.(2015)Danelljan, Häger, Khan, and Felsberg] or Convolutional Neural Networks [Danelljan et al.(2016)Danelljan, Robinson, Khan, and Felsberg, Nam and Han(2016), Wang et al.(2014)Wang, Pham, Ng, Wang, Chan, and Leman], and generally achieve good performance across test datasets. However, if large amounts of deformation and occlusion occur then global model approaches can fail to track the target robustly [Du et al.(2017)Du, Qi, Wen, Tian, Huang, and Lyu]. A promising way of countering these types of problems is part-based methods, which divide the object into smaller parts to be tracked. These create a model for constituent parts of the object using various different techniques such as Correlation Filters [Du et al.(2016)Du, Qi, Li, Wen, Huang, and Lyu, Du et al.(2017)Du, Qi, Wen, Tian, Huang, and Lyu, Lukežič et al.(2018)Lukežič, Zajc, and Kristan, Čehovin et al.(2016)Čehovin, Leonardis, and Kristan, Akin et al.(2016)Akin, Erdem, Erdem, and Mikolajczyk], structured SVMs [Battistone et al.(2017)Battistone, Petrosino, and Santopietro], displacement regressors [Wang et al.(2015)Wang, Valstar, Martinez, Khan, and Pridmore] and general histograms-related methods [Čehovin et al.(2013)Čehovin, Kristan, and Leonardis, Xiao et al.(2015)Xiao, Stolkin, and Leonardis, Maresca and Petrosino(2015)].
Part-based methods vary in how their parts are geometrically related to one another; while some enforce no geometric relationships explicitly [Xiao et al.(2015)Xiao, Stolkin, and Leonardis, Kwon and Lee(2009)], others use the more popular star-based topology [Maresca and Petrosino(2013), Kwon and Lee(2014), Cai et al.(2014)Cai, Wen, Lei, Vasconcelos, and Li, Zhu et al.(2015)Zhu, Wang, Zhao, and Lu], and some have a fully-connected sets of parts [Artner et al.(2011)Artner, Ion, and Kropatsch, Lukežič et al.(2018)Lukežič, Zajc, and Kristan]. Other trackers enforce a local geometric model, connecting patches that are close to each other, with closeness defined in terms of being in neighbouring superpixels [Du et al.(2016)Du, Qi, Li, Wen, Huang, and Lyu], being connected by edges in a Delaunay triangulated mesh [Čehovin et al.(2013)Čehovin, Kristan, and Leonardis], and higher order geometric relationships [Du et al.(2017)Du, Qi, Wen, Tian, Huang, and Lyu]. Trackers may also employ implicit geometric constraints resulting from limiting potential patch locations searched for in subsequent frames, such as only searching for rigid motion followed by localised non-rigidity. [Yi et al.(2015)Yi, Jeong, Kim, Yin, Oh, and Choi] creates multiple sets of copies of the object’s parts and moves each part in a particular set by the same random translation, sampled from a Gaussian, after which each part is separately randomly diffused before combining the likelihood of all sets of part positions with a particle filter to find the best set of part locations. Alternatively, [Čehovin et al.(2013)Čehovin, Kristan, and Leonardis] attempts to learn the optimal part positions in the current frame by first using the cross-entropy method to find the optimal affine transform of the part’s locations in the previous frame, and then applying the cross-entropy method to each local part in turn, fixing the location of all others.
As part-based trackers have been shown to track objects well in the presence of deformation and occlusion [Čehovin et al.(2016)Čehovin, Leonardis, and Kristan, Kwon and Lee(2009), Du et al.(2017)Du, Qi, Wen, Tian, Huang, and Lyu], and are typically composed of multiple smaller versions of the models used in the global appearance-based trackers, one might expect their performance to be as good or better than the global models. Their recent performance on tracking benchmarks [Kristan et al.(2016)Kristan, Leonardis, Matas, Felsberg, Pflugfelder, and Čehovin, Wu et al.(2013)Wu, Lim, and Yang], however, have shown them to be less accurate, particularly in cases of rigid motion, both in terms of camera motion and object motion change. We believe that there are two main factors affecting the performance of part-based trackers, video noise and model complexity. The size of noisy regions of a video will be proportionally larger when compared with the small models used in part-based tracking, as opposed to the larger, singular models used in global tracking, leading to a lack of robustness to noise. Model complexity is also a problem in part-based tracking, as the more parts used in a tracker the more parameters need to be learnt, leading to sacrifices having to be made in terms of the size of the region in the image in which to search for the object.
This paper presents a novel technique called Part-Based tracking by Sampling (PBTS). It is designed to tackle the issues presented above by using a novel visual model for parts that is robust to the noise introduced via motion, and also makes use of a novel scheme for object localisation. We take inspiration from the ViBe [Barnich and Droogenbroeck(2011)] background subtraction technique, which models the colour distribution of each background pixel in an image by randomly sampling from its own and neighbouring pixels across frames of video. A pixel in a future frame of video is labelled as a match to its background model if it sufficiently close (in colour space) to a sufficient proportion of samples in the model. Pixels matching their background models are labelled as background, and marked as foreground pixels otherwise. This has proved a simple, but extremely effective, approach to background subtraction and has achieved a good level of performance on benchmarks [Xu et al.(2016)Xu, Dong, Zhang, and Xu] since its inception.
In contrast to ViBe, we model the colour distributions of image patches, rather than individual pixels, by building an empirical distribution of the colours. This is achieved by randomly sampling a set of pixel values from each patch and counting the number of pixels within the patch that match the samples. A pixel matches a sample if , for some radius . Samples are selected such that they are sufficiently far away from one another to not match each other. A patch is therefore represented by spheres of radius in colour space, centred on the model’s samples, and the number of pixels in the patch that lie within each sphere. This model, a collection of samples and match counts, can be viewed as an empirical characterisation of the underlying colour distribution of the image patch.
Image patches on an object are modelled based on the given bounding box in the first frame of a sequence and matching patches are sought in subsequent frames. The object search process comprises two steps. First, non-shearing affine transformations of the patch locations in the previous frame are generated by sampling from a Gaussian distribution of the transform’s parameters. Secondly, the sets of patch locations with the highest likelihood, i.e\bmvaOneDotthose that most closely match their models, are then locally optimised. This involves selecting the patch’s new location, constrained to a small window around the patch, as the location which has be highest colour similarity to the patch’s model. The set of patches with the highest likelihood is then chosen as the object’s position in the current frame. The patches’ models are updated to take into account new information about the patch in the current frame, and a bounding box is generated to give the object’s estimated location.
While the methods of [Yi et al.(2015)Yi, Jeong, Kim, Yin, Oh, and Choi, Čehovin et al.(2013)Čehovin, Kristan, and Leonardis] share some similarities with our search scheme, we highlight some important differences. In contrast to [Yi et al.(2015)Yi, Jeong, Kim, Yin, Oh, and Choi], who randomly move each patch, we select each patch’s location, after an initial global non-shearing affine transform, to be its optimal location within a small local neighbourhood. The start of the search for the optimal affine transform in [Čehovin et al.(2013)Čehovin, Kristan, and Leonardis] is based on the previous time-step’s optimal transform and its estimated uncertainty, meaning that the search will be biased to look in the same direction the object was travelling in and introduces an implicit motion model. Our method, on the other hand, makes no assumptions as to the motion of the object, as we have found empirically that objects tend to move in an unpredictable manner and a motion model is detrimental to tracking accuracy, particularly for countering the effects of camera motion. One reason for this is that if an object is travelling across the frame of view, then the operator of the camera will typically recentre the object once it approaches the edge of the field of view. Recentering the object involves moving the camera in the direction the object is travelling, meaning that the object’s relative position in the frame moves in the opposite direction to which it is travelling.
In summary, we propose a part-based visual tracking framework based on sampling which has the following novel contributions:
A part-based visual model that empirically characterises the underlying colour distribution of the image patches they represent with samples and match counts.
An object localisation scheme that combines both a global and local search scheme to model a deforming object’s motion using global non-shearing affine transforms followed by localised patch location optimisation.
Given an object defined by a bounding box in the first frame of a sequence, we construct its model. The object is represented by a group of patches each of by pixels. Each patch is characterised by a patch model, , so that the object is modelled as . Patch models contain pairs of RGB samples and their match counts , indicating how many pixels within the patch match the corresponding sample. An RGB pixel vector matches a sample in the model if . If multiple matches occur, i.e\bmvaOneDotthe matching samples in the model lie within of each another, the pixel is assigned to the closest sample. During construction of the model, the number of samples is limited to the pairs with the largest counts, because in practice the vast majority of pixels in a patch are characterised by a few samples. Constraining the number of samples effectively reduces overfitting by excluding rare samples that do not characterise the patch.
The object is then tracked in subsequent frames. Groups of candidate patch locations are generated by sampling from a Gaussian distribution centred on the the object’s patch locations at the previous time-step. An object model is built for each group of candidate patches, and their likelihood is evaluated by comparing the candidate patch’s model to the actual patch model using the Bhattacharyya distance. The groups of patches with the highest likelihood are then locally optimised, and the group with the highest likelihood is selected as being the patch’s new locations. Lastly, the patch models are updated and a bounding box is generated indicating the tracker’s estimate of the object location, based on the patch’s current locations. Each stage of the process is discussed in more detail in the following section.
2.1 Online Tracking
Given the positions of the patches at the previous time-step (Fig. (a)a), we apply affine transformations to them in the current frame (Fig. (b)b), generating candidate sets of patch locations (Fig. (c)c). We restrict scalings to be isotropic and exclude shears from the class of transformations considered. The potential motions are limited to this class as we have observed that frame-to-frame motion is well represented in this manner and it is only when considering multiple frames that a more general class is needed. Parameters describing each transform are therefore sampled from a Gaussian with a mean centred on the current location, scale and rotation and with a diagonal covariance matrix.
A set of patch models is then created for each of the sets of locations, such that the RGB sample points in are identical to those in but the counts in reflect the number of matching pixels in the patch’s transformed location; that is, . Using the same samples points to generate counts for each patch’s candidate location allows for the direct comparison of how well each of them match the actual modelled patch.
We make use of the Bhattacharyya distance (BD) to evaluate how similar the sets of candidate patch locations’ models , are to the object’s model . The BD is based on the Bhattacharyya coefficient and commonly used to measure histogram similarity in tracking [Comaniciu et al.(2003)Comaniciu, Ramesh, and Meer]. The squared BD between a patch’s candidate model and actual model is
where is the number of pixels in the patch. Normalising pixel counts by the number of pixels in a patch reflects the fact that the model is an empirical (histogram) representation of the probability distribution of colours in its corresponding patch. Patches containing pixels that do not match any modelled RGB samples are implicitly down-weighted by this normalisation. This is a desirable property as it leads to larger distances being assigned to patches with non-matching pixels than would be given if they were normalised by the sum of their counts. The likelihood of a candidate set of patch location’s models being the correct location of the object (relative to ) is therefore defined as
Local optimisation of a patch’s location is applied to the sets of candidate patches with the highest likelihoods (Fig. (d)d). This is carried out by, for each patch in each of the models, selecting the location within an by window, centred on the patch’s centroid, that has the highest likelihood of being the patch’s true location (i.e\bmvaOneDotthat has the lowest BD). Ties in BD are broken by selecting the patch closest to its unoptimised position in order to limit patches drifting in homogeneous regions of images; ties with identical distances are broken randomly. The non-shearing affine transform search can be viewed as a form of global model, with the local optimisation allowing small independent motion of the individual parts. This closely models our observations that an object’s movement is fairly rigid between frames, with only small deviations from rigidity occurring at the local level.
Lastly, an axis-aligned bounding box is tightly fitted around the patches that generated the model with the highest likelihood , and its width and height are expanded by a proportion (Fig. (e)e). This bounding box is used as the predicted object’s location in the current frame. The expansion of the bounding box is needed as the initialisation procedure tends not to place patches at the extremities of the object and so the expansion counteracts this.
2.2 Model initialisation
The procedure for initialising the tracker, given the object’s bounding box in the first frame of a sequence (Fig. (a)a), can be split into two parts: selecting locations for the patches that are likely to be on the object itself; and then creating patch models at these locations. We first crop the image to times the width and height of an axis-aligned version of the object’s bounding box, centred on the object, thus allowing background pixels to aid in the object segmentation process. We use the alpha-matting object segmentation technique described in [De Ath and Everson(2018)] to segment the object (Fig. (b)b). It assumes that a small region of pixels in the centre of the given bounding box belong to the object, with those sufficiently outside it as being background, and learns the labels (foreground/background) for pixels in between these regions.
The cropped image is then superpixeled using SLICO [Achanta et al.(2012)Achanta, Shaji, Smith, Lucchi, Fua, and Süsstrunk], a less-parametrised version of the SLIC [Achanta et al.(2012)Achanta, Shaji, Smith, Lucchi, Fua, and Süsstrunk] superpixeling algorithm (Fig. (c)c), designed to provide regular shaped superpixels without having to specify a required compactness for the superpixels. SLIC oversegments the image into a pre-defined number of regions that are both similarly sized and roughly uniform in colour and texture. We set the number of superpixels required to lie within the masked region, supplied by the segmentation, of the cropped image as being equal to . Patch locations are then evaluated at the centroid of each superpixel (Fig. (d)d) to exploit their uniformity of colour and texture. Experimentation shows that more accurate and robust tracking is achieved by centring patches on homogeneous regions rather than on heterogeneous ones (e.g\bmvaOneDoton the boundaries of superpixels). The patch locations are evaluated in descending order of superpixel size, placing a patch at a superpixel’s centroid if it overlaps all other patches by less than a proportion of its area, thus limiting the amount of repeated information stored in the model.
Each patch’s colour model is initialised by examining all pixels in the patch in random order. If a pixel matches one of the model’s sample points then its corresponding count is incremented; otherwise the pixel is included in the model as a new sample with a count of 1. We also explored more sophisticated sampling techniques, such as using the pixel’s distance from the closest sample in the model as its likelihood of being selected to be evaluated [Arthur and Vassilvitskii(2007)]. No statistically significant performance gains were obtained over random selection for any of the more complex techniques evaluated.
2.3 Model update
Once the optimal set of patch locations has been found, is updated. The samples in each patch model are updated to be a convex combination of its current value and the average RGB value of matching pixels in the new patch location:
where the smoothing parameter controls the rate at which samples are moved towards the regions of colour space most heavily occupied by the matching pixels. Updating the samples in this way allows the patch colour model to evolve over time, and provides extra robustness against noise as the model is adapted to the currently observed image data rather than assuming that the object’s appearance is constant throughout the entire sequence.
3 Experimental results and evaluation
The tracker was evaluated on the recent VOT2016 benchmark [Kristan et al.(2016)Kristan, Leonardis, Matas, Felsberg, Pflugfelder, and Čehovin], which provides a tracking dataset with fully annotated frames, and details the results of a large number of state-of-the-art trackers. The dataset consists of 60 sequences, containing difficult tracking scenarios such as occlusion, scale variation, camera motion, object motion change, and illumination changes. The official protocol of the VOT challenge was followed, with the tracker initialised on the first frame of a sequence using the ground-truth bounding box provided, and reinitialised if the tracker drifted away from the target. Trackers were evaluated in terms of their accuracy (target localisation), robustness (failure frequency), and Expected Average Overlap (EAO). For full details see [Kristan et al.(2016)Kristan, Leonardis, Matas, Felsberg, Pflugfelder, and Čehovin]. PBTS was compared to 19 part-based trackers [Kristan et al.(2016)Kristan, Leonardis, Matas, Felsberg, Pflugfelder, and Čehovin]; for brevity we report the results of the 12 best trackers in terms of EAO, including PBTS.
3.1 Implementation details
The tracker was implemented in Python 3.6 on an Intel i5-4690 CPU with 32Gb memory, and runs at approximately 20 frames per second. Its parameters were empirically chosen and fixed for all experimentation runs.
Model initialisation: We cropped the image to be times the object’s axis-aligned bounding box. Using the notation of [De Ath and Everson(2018)], we set model initialisation parameters as , , and . We placed patches of size on objects, with a maximum patch area overlap of , and limited samples in each model to be ; performance is roughly constant for . The matching distance was set to [Barnich and Droogenbroeck(2011)].
Online tracking: Candidate transforms were sampled from a Gaussian distribution centred on the last frame’s location, scale and rotation. Standard deviations of components were: translation where is half the sum of the height and width of the axis aligned bounding box located at the previous time-step; rotation ; and scale . non-shearing affine transforms were sampled, with the best locally optimised. We found no performance gain by increasing or . During local optimisation a patch may move pixels in any direction per time-step (), allowing sets of patches limited, but sufficient, non-affine movement between frames. The predicted bounding box was the minimal axis-aligned bounding box enclosing the optimised patches scaled by .
Model update: The colour model update rate was set to , and played a strong role in the trade-off between robustness and accuracy. Increasing tended to decrease the tracking accuracy, but increase robustness.
3.2 Qualitative evaluation
Examples of the visual tracking results of PBTS and the top 3 performing trackers (SHCT [Du et al.(2016)Du, Qi, Li, Wen, Huang, and Lyu], GGTv2 [Du et al.(2017)Du, Qi, Wen, Tian, Huang, and Lyu] and DPT [Lukežič et al.(2018)Lukežič, Zajc, and Kristan]) on 3 challenging sequences can be seen in Fig. 3. Note that only PBTS has been able to successfully track the football for more than 7 frames in the top sequence, which may be due to the object’s small size as the football in the sequence is only 20 by 20 pixels. The three top trackers all use Correlation Filters, which are known to suffer performance problems when their size are sufficiently small [Lukežič et al.(2018)Lukežič, Zajc, and Kristan]. The search range of a Correlation Filter is limited by its size, as it can only look for correlations up to half the filter’s width or height away from where it is centred in the image. PBTS is able to successfully track the football through the sequence as its search mechanism allows it to locate objects further away than within the bounds of the object’s previous position.
In the gymnastics sequence all four methods successfully track the object for the majority of the sequence, but as the gymnast rotates, the difference in the methods becomes apparent (e.g\bmvaOneDotframe #141). The three Correlation Filter-based techniques fail to correctly track the rotation of the gymnast, predicting a vertical bounding box although the gymnast is approximately horizontal, leading them to update their parts models on the background and lose track of the gymnast in later frames.
In the third sequence the fish moves with both in-plane and out-of-plane rotations, with all four trackers managing to successfully track it in the earlier frames. In later frames (frame #253 onwards), the fish rotates again out-of-plane and is partially occluded; this time both GGTv2 and DPT struggle to track the fish and predict its location as being large portions of the image (frame #281), whereas PBTS gives the most meaningful bounding box prediction.
3.3 Quantitative Evaluation
Accuracy and robustness plots for the video characteristics in which we are mainly interested are shown in Fig. 4. PBTS is the most robust (lowest average rank) in frames with camera motion (Fig. (a)a) and object motion change (Fig. (b)b). It also achieved joint second place in robustness for frames with no difficult attributes (Fig. (c)c) and third place in object size change (Fig. (d)d). However, for both the two other measured attributes, illumination change and occlusion, PBTS was ranked joint 10th in robustness, meaning that overall the tracker was the third most robust of the 19 part-based trackers evaluated.
The tracker’s performance with regard to accuracy was lower, with the tracker ranking roughly 9th place for each of the five difficult attributes and 8th for frames with no difficult attributes. This was mainly due to underestimation of the object’s size, rather than patches drifting away from the object leading to overly-large bounding box predictions, as tended to be the case with the other trackers depicted in Fig. 3. An example of this can be seen in Fig. 5, where the tracked object goes into shade over several frames, causing the object to get darker, starting with the lower region and working up the object. This leads to the patches being pushed towards the region of the object most similar to its model, favouring transforms that increase the density of the patches and local optimisation that moves towards the brighter regions of the object that are more similar to the original patch colour models.
|SHCT||[Du et al.(2016)Du, Qi, Li, Wen, Huang, and Lyu]||0.2661||BST||[Battistone et al.(2017)Battistone, Petrosino, and Santopietro]||0.1997||LGT||[Čehovin et al.(2013)Čehovin, Kristan, and Leonardis]||0.1682|
|GGTv2||[Du et al.(2017)Du, Qi, Wen, Tian, Huang, and Lyu]||0.2377||TricTRACK||[Wang et al.(2015)Wang, Valstar, Martinez, Khan, and Pridmore]||0.1995||CDTT||[Xiao et al.(2015)Xiao, Stolkin, and Leonardis]||0.1644|
|DPT||[Lukežič et al.(2018)Lukežič, Zajc, and Kristan]||0.2358||PBTS||0.1793||MatFlow||[Maresca and Petrosino(2013)]||0.1545|
|ANT||[Čehovin et al.(2016)Čehovin, Leonardis, and Kristan]||0.2045||DPCF||[Akin et al.(2016)Akin, Erdem, Erdem, and Mikolajczyk]||0.1787||HT||[Godec et al.(2013)Godec, Roth, and Bischof]||0.1500|
Occlusion and illumination change both produce the pushing effect. This is a particular problem with occlusion, because if an occluding object completely cuts across the object and moves from one side to another, it can push patches off the object if they are unable to jump across the occlusion. This effect explains both the smaller overlap with the true bounding boxes and the lack of robustness with respect to occlusion and illumination changes. Expected Average Overlap is shown in Table 1, where PBTS is ranked 7th of the 19 evaluated.
We have proposed a novel part-based representation and object localisation scheme for tracking objects in challenging video sequences. The empirical sample-based representation of image patches, combined with the novel global and local search procedure leads to accurate and robust performance, particularly in sequences with object motion changes and camera motion. However, the tracker demonstrates lower performance when illumination changes occur and also when the object is occluded.
Current work addresses tracking in the face of illumination changes and occlusion. As only the features of a patch and their match counts are modelled, the RGB features can be directly replaced with those more robust to illumination variation, including texture based features (e.g\bmvaOneDotHOG and LBP) or colour features such as the Colour Names features. Occlusion handling can be introduced directly by limiting the size of scale change further between frames, as well as by only using the better matching patches to contribute towards the overall set of candidate patch locations.
- [Achanta et al.(2012)Achanta, Shaji, Smith, Lucchi, Fua, and Süsstrunk] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2274–2282, 2012.
- [Akin et al.(2016)Akin, Erdem, Erdem, and Mikolajczyk] O. Akin, E. Erdem, A. Erdem, and K. Mikolajczyk. Deformable part-based tracking by coupled global and local correlation filters. Journal of Visual Communication and Image Representation, 38:763–774, 2016.
- [Arthur and Vassilvitskii(2007)] D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. In Symposium on Discrete Algorithms, pages 1027–1035, 2007.
- [Artner et al.(2011)Artner, Ion, and Kropatsch] N. M. Artner, A. Ion, and W. G. Kropatsch. Multi-scale 2d tracking of articulated objects using hierarchical spring systems. Pattern Recognition, 44(4):800–810, 2011.
- [Barnich and Droogenbroeck(2011)] O. Barnich and M. Van Droogenbroeck. ViBe: A universal background subtraction algorithm for video sequences. IEEE Transactions on Image Processing, 20(6):1709–1724, 2011.
- [Battistone et al.(2017)Battistone, Petrosino, and Santopietro] F. Battistone, A. Petrosino, and V. Santopietro. Watch Out: Embedded video tracking with BST for unmanned aerial vehicles. Journal of Signal Processing Systems, 2017.
- [Bertinetto et al.(2016)Bertinetto, Valmadre, Golodetz, Miksik, and Torr] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. S. Torr. Staple: Complementary learners for real-time tracking. In Computer Vision and Pattern Recognition, pages 1401–1409, 2016.
- [Cai et al.(2014)Cai, Wen, Lei, Vasconcelos, and Li] Z. Cai, L. Wen, Z. Lei, N. Vasconcelos, and S. Z. Li. Robust deformable and occluded object tracking with dynamic graph. IEEE Transactions on Image Processing, 23(12):5497–5509, 2014.
- [Čehovin et al.(2013)Čehovin, Kristan, and Leonardis] L. Čehovin, M. Kristan, and A. Leonardis. Robust visual tracking using an adaptive coupled-layer visual model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(4):941–953, 2013.
- [Čehovin et al.(2016)Čehovin, Leonardis, and Kristan] L. Čehovin, A. Leonardis, and M. Kristan. Robust visual tracking using template anchors. In IEEE Winter Conference on Applications of Computer Vision, pages 1–8, 2016.
- [Comaniciu et al.(2003)Comaniciu, Ramesh, and Meer] D. Comaniciu, V. Ramesh, and P. Meer. Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5):564–577, 2003.
- [Danelljan et al.(2015)Danelljan, Häger, Khan, and Felsberg] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg. Learning spatially regularized correlation filters for visual tracking. In International Conference on Computer Vision, pages 4310–4318, 2015.
- [Danelljan et al.(2016)Danelljan, Robinson, Khan, and Felsberg] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In European Conference on Computer Vision, pages 472–488, 2016.
- [De Ath and Everson(2018)] G. De Ath and R. Everson. Visual object tracking: The initialisation problem. In Conference on Computer and Robot Vision, pages 142–149, 2018.
- [Du et al.(2016)Du, Qi, Li, Wen, Huang, and Lyu] D. Du, H. Qi, W. Li, L. Wen, Q. Huang, and S. Lyu. Online deformable object tracking based on structure-aware hyper-graph. IEEE Transactions on Image Processing, 25(8):3572–3584, 2016.
- [Du et al.(2017)Du, Qi, Wen, Tian, Huang, and Lyu] D. Du, H. Qi, L. Wen, Q. Tian, Q. Huang, and S. Lyu. Geometric hypergraph learning for visual tracking. IEEE Transactions on Cybernetics, PP(99):1–14, 2017.
- [Godec et al.(2013)Godec, Roth, and Bischof] M. Godec, P.M. Roth, and H. Bischof. Hough-based tracking of non-rigid objects. Computer Vision and Image Understanding, 117(10):1245–1256, 2013.
- [Henriques et al.(2015)Henriques, Caseiro, Martins, and Batista] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2015.
- [Kristan et al.(2016)Kristan, Leonardis, Matas, Felsberg, Pflugfelder, and Čehovin] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, and L. Čehovin. The visual object tracking VOT2016 challenge results. In European Conference on Computer Vision, pages 777–823, 2016.
- [Kwon and Lee(2009)] J. Kwon and K. M. Lee. Tracking of a non-rigid object via patch-based dynamic appearance modeling and adaptive Basin Hopping Monte Carlo sampling. In Computer Vision and Pattern Recognition, pages 1208–1215, 2009.
- [Kwon and Lee(2014)] J. Kwon and K. M. Lee. Tracking by sampling and integrating multiple trackers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1428–1441, 2014.
- [Lukežič et al.(2018)Lukežič, Zajc, and Kristan] A. Lukežič, L. Č. Zajc, and M. Kristan. Deformable parts correlation filters for robust visual tracking. IEEE Transactions on Cybernetics, pages 1–13, 2018.
- [Maresca and Petrosino(2015)] M. E. Maresca and A. Petrosino. Clustering local motion estimates for robust and efficient object tracking. In European Conference on Computer Vision, pages 244–253, 2015.
- [Maresca and Petrosino(2013)] M.E Maresca and A. Petrosino. Matrioska: A multi-level approach to fast tracking by learning. In Image Analysis and Processing – ICIAP 2013, pages 419–428. Springer Berlin Heidelberg, 2013.
- [Nam and Han(2016)] H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In Computer Vision and Pattern Recognition, pages 4293–4302, 2016.
- [Wang et al.(2014)Wang, Pham, Ng, Wang, Chan, and Leman] L. Wang, N. T. Pham, T. T. Ng, G. Wang, K. L. Chan, and K. Leman. Learning deep features for multiple object tracking by using a multi-task learning strategy. In International Conference on Image Processing, pages 838–842, 2014.
- [Wang et al.(2015)Wang, Valstar, Martinez, Khan, and Pridmore] X. Wang, M. Valstar, B. Martinez, M. H. Khan, and T. Pridmore. Tric-track: Tracking by regression with incrementally learned cascades. In International Conference on Computer Vision, pages 4337–4345, 2015.
- [Wu et al.(2013)Wu, Lim, and Yang] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Online object tracking: A benchmark. In Computer Vision and Pattern Recognition, pages 2411–2418, 2013.
- [Xiao et al.(2015)Xiao, Stolkin, and Leonardis] J. Xiao, R. Stolkin, and A. Leonardis. Single target tracking using adaptive clustered decision trees and dynamic multi-level appearance models. In Computer Vision and Pattern Recognition, pages 4978–4987, 2015.
- [Xu et al.(2016)Xu, Dong, Zhang, and Xu] Y. Xu, J. Dong, B. Zhang, and D. Xu. Background modelling methods in video analysis: A review and comparative evaluation. CAAI Transactions on Intelligence Technology, 1(1):43–60, 2016.
- [Yi et al.(2015)Yi, Jeong, Kim, Yin, Oh, and Choi] K. M. Yi, H. Jeong, S. W. Kim, S. Yin, S. Oh, and J. Y. Choi. Visual tracking of non-rigid objects with partial occlusion through elastic structure of local patches and hierarchical diffusion. Image and Vision Computing, 39:23–37, 2015.
- [Zhu et al.(2015)Zhu, Wang, Zhao, and Lu] G. Zhu, J. Wang, C. Zhao, and H. Lu. Weighted part context learning for visual tracking. IEEE Transactions on Image Processing, 24(12):5140–5151, 2015.