Tracking-by-Trackers with a Distilled and Reinforced Model
Visual object tracking was generally tackled by reasoning independently on fast processing algorithms, accurate online adaptation methods, and fusion of trackers. In this paper, we unify such goals by proposing a novel tracking methodology that takes advantage of other visual trackers, offline and online. A compact student model is trained via the marriage of knowledge distillation and reinforcement learning. The first allows to transfer and compress tracking knowledge of other trackers. The second enables the learning of evaluation measures which are then exploited online. After learning, the student can be ultimately used to build (i) a very fast single-shot tracker, (ii) a tracker with a simple and effective online adaptation mechanism, (iii) a tracker that performs fusion of other trackers. Extensive validation shows that the proposed algorithms compete with real-time state-of-the-art trackers.
Visual object tracking corresponds to the persistent recognition and localization –by means of bounding boxes– of a target object in consecutive video frames. This problem comes with several different challenges including object occlusion and fast motion, light changes, and motion blur. Additionally, real-time constraints are often posed by the many practical applications, such as video surveillance, behavior understanding, autonomous driving, and robotics.
In the past, the community has proposed solutions emphasizing different aspects of the problem. Processing speed was pursued by algorithms like correlation filters [6, 32, 16, 3, 50] or offline methods such as siamese convolutional neural networks (CNNs) [31, 26, 4, 43, 42, 84, 82]. Improved performance was attained by online target adaptation methods [57, 37, 14, 15, 5]. Higher tracking accuracy and robustness were achieved by methods built on top of other trackers [81, 79, 71, 1, 70]. All these characteristics belong to an optimal tracker but they were studied one independently from the other. The community currently lacks a general framework to tackle them jointly. In this view, a single model should be able to (i) track an object in a fast way, (ii) implement simple and effective online adaptation mechanisms, (iii) apply decision-making strategies to select tracker outputs.
It is a matter of fact that a large number of tracking algorithms has been produced so far, with different principles exploited. Preliminary solutions were based on mean shift algorithms , key-point  or part-based methods [8, 58], or SVM learning . Later, correlation filters gained popularity thanks to their fast processing times [6, 32, 16, 3, 50]. Since more recently, CNNs have been exploited to extract efficient image features. This kind of representation has been included in deep regression networks [31, 26], online tracking-by-detection methods [57, 37], solutions that treat visual tracking as a reinforcement learning (RL) problem [80, 66, 12, 63, 9, 20], CNN-based discriminative correlation filters [17, 14, 15, 5], and in siamese CNNs [4, 43, 42, 82, 73, 19]. Other methods tried to take advantage of the output produced by multiple trackers [81, 79, 71, 1, 70]. Thus, one can imagine that different trackers incorporate different knowledge, and this may constitute a valuable resource to leverage during tracking.
Lately, the knowledge distillation (KD) framework  was introduced in the deep learning panorama as paradigm for, among the many [29, 69, 45, 60], knowledge transferring between models  and model compression [10, 35, 61]. The idea boils down into considering a student model and one or more teacher models to learn from. Teachers explicit their knowledge through demonstrations on a never seen before transfer set. Through specific loss functions, the student is set to learn a task by matching the teachers’ output and the ground-truth labels. As visual tracking requires fast and accurate methods, KD can be a valuable tool to transfer the tracking ability of more accurate teacher trackers to more compact and faster student ones. However, the standard setup of KD does not provide methods to exploit teachers online, but just offline. This makes this methodology unsuitable for tracking, which has been shown to benefit from both offline and online methods [15, 5, 14, 80, 9]. In contrast to such an issue, RL techniques offer established methodologies to optimize not only policies but also policy evaluation functions [75, 39, 68, 54, 53], which are then used to extract decision strategies. Along with this, RL also gives the possibility to maximize arbitrary and non-differentiable performance measures, and thus more tracking oriented objectives can be defined.
For the aforementioned motivations, the contribution of this paper is a novel tracking methodology where a student model exploits off-the-shelf trackers offline and online (tracking-by-trackers). The student is first trained via an effective strategy that combines KD and RL. After that, the model’s compressed knowledge can be used interchangeably depending on the application’s needs. We will show how to exploit the student in three setups which result in, respectively, (i) a fast tracker (TRAS), (ii) a tracker with a simple online mechanism (TRAST), and (iii) a tracker capable of expert tracker fusion (TRASFUST). Through extensive evaluation procedures, it will be demonstrated that each of the algorithms competes with the respective state-of-the-art class of trackers while performing in real-time.
2 Related Work
Here we review the trackers most related to ours. The network architecture implemented by the proposed student model takes inspiration from GOTURN  and RE3 . These regression-based CNNs were shown to capture the target’s motion while performing very fast. However, the learning strategy employed optimizes parameters just for coordinate difference. Moreover, great amount of data is needed to make such models achieve good accuracy. In contrast, our KD-RL-based method offers parameter optimization for overlap maximization and extracts previously acquired knowledge from other trackers requiring less labeled data. Online adaptation methods like discriminative model learning [38, 57, 37] or discriminative correlation filters [14, 15, 5] have been studied extensively to improve tracking accuracy. These procedures are time-consuming and require particular assumptions and careful design. We propose a simple online update strategy where an off-the-shelf tracker is used to correct the performance of the student model. Our method does not make any assumption on such tracker, and thus it can be freely selected to adapt to application needs. Present fusion models exploit trackers in the form of discriminative trackers , CNN feature layers , correlation filters  or out-of-the-box tracking algorithms [79, 71, 1, 70]. However, such models work just online and do not take advantage of the great amount of offline knowledge that expert trackers can provide. Furthermore, they are not able to track objects without them. Our student model addresses these issues thanks to the decision making strategy learned via KD and RL.
KD and RL.
We review the learning strategies most related to ours. KD techniques have been used for transferring knowledge between teacher and student models [7, 33], where the supervised learning setting was employed more [25, 29, 69, 45] than the setup that uses RL [64, 59]. In the context of computer vision, KD was employed for action recognition [24, 74], object detection [10, 65], semantic segmentation [48, 30], person re-identification . In the visual tracking panorama, KD was explored in [72, 49] to compress, CNN representations for correlation filter trackers and siamese network architectures, respectively. However, these works offer methods involving teachers specifically designed as correlation filter and siamese trackers, and so cannot be adapted to generic-approach visual trackers as we propose in this paper. Moreover, to the best of our knowledge, no method mixing KD and RL is currently present in the computer vision literature. Our learning procedure is also related to the strategies that use deep RL to learn tracking policies [80, 66, 63, 9, 52]. Our formulation shares some characteristics with such methods in the markov decision process (MDP) definition, but our proposed learning algorithm is different as no present method leverages on teachers to learn the policy.
The key point of this paper is to learn a simple and fast student model with versatile tracking abilities. KD is used for transferring the tracking knowledge of off-the-shelf trackers to a compressed model. However, as both offline and online strategies are necessary for tracking [15, 5, 80, 9], we propose to augment the KD framework with an RL optimization objective. RL techniques deliver unified optimization strategies to directly maximize a desired performance measure (in our case the overlap between prediction and ground-truth bounding boxes) and to predict the expectation of such measure. We use the latter as base for an online evaluation and selection strategy. Put in other words, combining KD and RL lets the student model extract a tracking policy from teachers, improve it, and express its quality through an overlap-based objective.
Given a transfer set of videos , we consider the -th video as a sequence of frames , where is the space of RGB images. Let be the -th bounding box defining the coordinates of the top left corner, and the width and height of the rectangle that contains the target object. At time , given the current frame , the goal of the tracker is to predict that best fits the target in . We formally consider the student model as that is a function which outputs the relative motion between and , and the performance evaluation , when inputted with frame . Similarly, we define the set of tracking teachers as where each is a function that, given a frame image, produces a bounding box estimate for that frame.
3.2 Visual Tracking as an MDP
In our setting, is treated as an artificial agent which interacts with an MDP defined over a video . The interaction happens through a temporal sequence of states , actions and rewards . In the -th frame, the student is provided with the state and outputs the continuous action which consists in the relative motion of the target object, i.e. it indicates how its bounding box, which is known in frame , should move to enclose the target in the frame . is rewarded by the measure of its quality . We refer this interaction process as the episode , which dynamics are defined by the MDP .
Every is defined as a pair of image patches obtained by cropping and using . Specifically, , where crops the frames within the area of the bounding box that has the same center coordinates of but which width and height are scaled by . By selecting , we can control the amount of additional image context information to be provided to the student.
Actions and State Transition.
Each consists in a vector which defines the relative horizontal and vertical translations (, respectively) and width and height scale variations (, respectively) that have to be applied to to predict . The latter step is obtained through .
The reward function expresses the quality of taken at and it is used to feedback the student. Our reward definition is based on the Intersection-over-Union (IoU) metric computed between and the ground-truth bounding box, denoted as , i.e.,
At every interaction step , the reward is formally defined as
with that floors to the closest digit and shifts the input range from to .
3.3 Learning Tracking from Teachers
The student is first trained in an offline stage. Through KD, knowledge is transferred from to . By means of RL, such knowledge is improved and the ability of evaluating its quality is also acquired. All the gained knowledge will be used for online tracking. We implement as a parameterized function that given outputs at the same time the action and state-value . In RL terms, maintains representations of both the policy and the state value functions. The proposed learning framework, which is depicted in Figure 3, provides a single offline end-to-end learning stage. students are distributed as parallel and independent learning agents. Each one owns a set of learnable weights that are used to generate experience by interacting with . The obtained experience, in the form of , is used to update asynchronously a shared set of weights . After ending , each student updates its by copying the values of the currently available . The entire procedure is repeated until convergence. This learning architecture follows the recent trends in RL that make use of distributed algorithms to speed up the training phase [56, 53, 21]. We devote half of the students, which we refer to as distilling students, in acquiring knowledge from the teachers’ tracking policy. The other half, called autonomous students, learn to track by interacting with autonomously.
Each distilling student interacts with by observing states, performing actions and receiving rewards just as an autonomous student. However, to distill knowledge independently from the teachers’ inner structure, we propose the student to learn from the actions of , which are executed in parallel. In particular, is exploited every steps with the following loss function
which is the L1 loss between the actions performed by the student and the actions that the teacher would take to move the student’s bounding box into the teacher’s prediction .
as we would like to learn always from the best teacher. The absolute values are multiplied by . Each of these is computed along the interaction and determines the status in which performed worse than () or better () in terms of the rewards and . The whole Eq. (3) is similar to what proposed in  for KD from bounding-box predictions of object detectors. However, here we provide a temporal formulation of such objective and we swap the L2 loss with the L1, which was shown to work better for regression-based trackers [31, 26]. By optimizing Eq. (3), the weights are changed only if the student’s performance is lower than the performance of the teacher. In this way, we make the teacher transferring its knowledge by suggesting actions only in bad performing cases. In the others, we let the student free to follow its current tracking policy since it is superior.
The learning process performed by the autonomous students follows the standard RL method for continuous control . Each student interacts with for a maximum of steps. At each step , the students sample actions from a normal distribution , where the mean is defined as the student’s predicted action, , and the standard deviation is obtained as (which is the absolute value of the difference between the student’s action and the action that obtains, by shifting , the ground-truth bounding box ). Intuitively, shrinks when is close to the ground-truth action , reducing the chance of choosing potential wrong actions when approaching the correct one. On the other hand, when is distant from , spreads letting the student explore more. The students also predict which is the cumulative reward that the student expects to receive from to the end of the interaction. Since the proposed reward definition is a direct measure of the IoU occurring between the predicted and the ground-truth bounding boxes, gives an estimate of the total amount of IoU that expects to obtain from state on wards. Thus, this function can be exploited as a future-performance evaluator. After steps of interaction, the gradient to update the shared weights is built as
3.4 Student Architecture
The architecture used to maintain the representation of both the policy and the state value functions, which is pictured in Figure 3, is simple and presents a structure similar to the one proposed in [31, 26]. The network gets as input two image patches that pass through two ResNet-18 based  convolutional branches that share weights. The feature maps produced by the branches are first linearized, then concatenated together and finally fed to two consecutive fully connected layers with ReLU activations. After that, features are given to an LSTM  layer. Both the fully connected layers and the LSTM are composed of 512 neurons. The output of the LSTM is ultimately fed to two separate fully connected heads, one that outputs the action and the other that outputs the value of the state .
3.5 Tracking after Learning
After the learning process, the student is ready to be used for tracking. Here we describe three different ways in which can be exploited:
the student’s learned policy is used to predict bounding boxes independently from the teachers. We call this setting TRAS (TRAcking Student).
the learned policy and value function are used to, respectively, predict and evaluate and tracking behaviors, in order to correct the former’s performance. We refer to this setup as TRAST (TRAcking Student and Teacher).
the learned state-value function is used to evaluate the performance of the pool of teachers in order to choose the best and perform tracker fusion. We call this setup TRASFUST (TRAcking by Student FUSing Teachers).
In the following, we provide more details about the three settings. For a better understanding, the setups are visualized in Figure 3.
In this setting, each tracking sequence , with target object outlined by , is considered as described in section 3.2. States are extracted from frames , actions are performed by means of the student’s learned policy and are used to output the bounding boxes . This setup is fast as it requires just a forward pass through the network to obtain a bounding box prediction.
In this setup, the student makes use of the learned to predict and to evaluate its running tracking quality and the one of which is run in parallel. In particular, at each time step , and are obtained as performance evaluation for and respectively. The teacher state is obtained as . By comparing the two expected returns, TRAST decides if to output the student’s or the teacher’s bounding box. More formally, if then otherwise . This assignment has the side effect of correcting the tracking behaviour of the student as, at the successive time step, the previously known bounding box becomes the previous prediction of the teacher. Thus, the online adaption consists in a very simple procedure that evaluates ’s performance to eventually pass control to it. Notice that, at every , the execution of is independent from as the second does not need the first to finish because the evaluations are done based on the predictions given at . Hence, the executions of the two can be put in parallel, with the overall speed of TRAST resulting is the lowest between the one of and .
In this tracking setup, just the student’s learned state-value function is exploited. At each step , teachers are executed following their standard methodology. States are obtained. The performance evaluation of the teachers is obtained through the student as . The output bounding box is selected as by considering the teacher that achieves the highest expected return, i.e.
This procedure consists in fusing sequence-wise the predictions of and, similarly as for TRAST, the execution of teachers and student can be put in parallel. In such setting, the speed of TRASFUST results in the lowest between the ones of and of each .
|# traj||AO||# traj||AO||# traj||AO||# traj||AO||# traj||AO|
4 Experimental Results
4.1 Experimental Setup
The tracking teachers selected for this work are KCF , MDNet , ECO , SiamRPN , ATOM , and DiMP , due to their established popularity in the visual tracking panorama. Moreover, since they tackle visual tracking by different approaches, they can provide knowledge of various quality. In experiments, we considered exploiting single teacher or a pool of teachers. In particular, the following sets of teachers were examined .
The selected transfer set was the training set of GOT-10k dataset , due to its large scale. Just were used for offline learning, as none of these was trained on this dataset. This is an important point because unbiased examples of the trackers’ behavior should be exploited to train the student. Moreover, predictions that exhibit meaningful knowledge should be retained. Therefore, we filtered out all the videos which teacher predictions did not satisfy for all . We considered as minimum threshold for a prediction to be considered positive, and we then varied among 0.6, 0.7, 0.8, 0.9 for more precise predictions. To produce more training samples, videos, and filtered trajectories were split in five randomly indexed sequences of 32 frames and bounding boxes, similarly as done in . In Table 1 a summary of is presented. The number of positive trajectories, the average overlap (A0)  on the transfer set, and the total number of sequences are reported per teacher and per .
Benchmarks and Performance Measures.
We performed performance evaluations on the GOT-10k test set , UAV123 , LaSOT , OTB-100  and VOT2019  datasets. These offer videos of various nature and difficulty, and are all popular benchmarks in the visual tracking community. The evaluation protocol used for GOT-10k is the one-pass evaluation (OPE) , along with the metrics: AO, and success rates (SR) with overlap thresholds and . For UAV123, LaSOT, and OTB-100 the OPE method was considered with the area-under-the-curve (AUC) of the success and precision plot, referred to as success score (SS) and precision scores (PS) respectively . Evaluation on VOT2019 is performed in terms of expected average overlap (EAO), accuracy (A), and robustness (R) . Further details about the benchmarks are given in Appendix B.0.1.
The image crops of were resized to pixels and standardized by the mean and standard deviation calculated on the ImageNet dataset . The ResNet-18 weights were pre-trained for image classification on the same dataset . The image context factor was set to . The training videos were processed in chunks of 32 frames. At test time, every 32 frames, the LSTM’s hidden state is reset to the one obtained after the first student prediction (i.e. ), following .
Due to hardware constraints, a maximum of training students were distributed on 4 NVIDIA TITAN V GPUs of a machine with an Intel Xeon E5-2690 v4 @ 2.60GHz CPU and 320 GB of RAM.
The discount factor was set to 1. The length of the interaction before an update was defined in steps.
The Radam optimizer  was employed and the learning rate for both distilling and autonomous students was set to . A weight decay of was also added to as regularization term. To control the magnitude of the gradients and stabilize learning, was multiplied by .
The student was trained until the validation performance on the GOT-10k validation set stopped improving. Longest trainings took around 10 days.
The speed of the parallel setups of TRAST and TRASFUST was computed by considering the speed of the slowest tracker (student or teacher) plus an overhead.
Code was implemented in Python and is available here
In the following sections, when not specified, the three tracker setups regard the student trained using and , paired with in TRAST, and managing in TRASFUST.
In Table 2 the performance of TRAS, TRAST, TRASFUST are reported, while the performances of the teachers are presented in the first six rows of Table 5. TRAS results in a very fast method with good accuracy. Combining KD and RL results in the best performance, outperforming the baselines that use for training just the ground-truth (TRAS-GT), KD and ground-truth (TRAS-KD-GT), and just KD (TRAS-KD). We did not report the performance of trained only by RL because convergence was not attained due to the large state and action spaces.
Benefiting the teacher during tracking is an effective online procedure. Indeed, TRAST improves TRAS by 24% on average, and a qualitative example of the ability to pass control to the teacher is given in Figure 5.
The performance of TRASFUST confirms the student’s evaluation ability. This is the most accurate and robust tracker thanks to the effective fusion of the underlined trackers.
Overall, all three trackers show balanced performance across the benchmarks, thus demonstrating good generalization.
No use of the curriculum learning strategy (TRAS-no-curr, TRAST-no-curr, TRASFUST-no-curr) slightly decreases the performance of all.
In Figure 5 the performance of the trackers is reported for different classes of sequences of VOT2019, while in Figure 8 some qualitative examples are presented.
Impact Of Teachers.
In Table 3 the performance of the proposed trackers in different student-teacher setups is reported. The general trend of the three trackers reflects the increasing tracking capabilities of the teachers. Indeed, on every considered benchmark, the tracking ability of the student increases as a stronger teacher is employed. For TRAST, this is also proven by Figure 8 (a), where we show that better teachers are exploited more. Moreover, using more than one teacher during training leads to superior tracking policies and to better exploitation of them during tracking. Although in general student models cannot outperform their teachers due to their simple and compressed architecture , TRAS and TRAST show such behavior on benchmarks where teachers are weak. Using two teachers during tracking is the best TRASFUST configuration, as in this setup it outperforms the best teacher by more than 2% on all the considered benchmarks. When weaker teachers are added to the pool, the performance tends to decrease, suggesting a behavior similar to the one pointed out in . Part of the error committed by TRAST and TRASFUST on benchmarks like OTB-100 and VOT2019 is explained by Figure 8. In situations of ambiguous ground-truth, such trackers make predictions that are qualitatively better but quantitatively worse. TRAST and TRASFUST show to be unbiased to the training teachers, as their capabilities generalize also to and which are not exploited during training.
In Table 4 we present the performance of the proposed trackers while considering different quality of teacher actions. Increasing the quality, thus reducing the number of videos, results in decreasing the performance of all three trackers. The loss is not significant between and , while considering more precise actions, TRAS suffers majorly, suggesting that more data is a key factor for an autonomous tracking policy. Interestingly, TRAST and TRASFUST are able to perform tracking even if the student is trained with limited training samples. The plot (b) in Figure 8 confirms that the student relies effectively to its teacher, as the latter’s output is selected more often as loses performance.
Running the student takes just 11ms on our machine. TRAS performs at 90 FPS. The speed of TRAST and TRASFUST depends on the chosen teacher and varies between 5 and 40 FPS, as shown in Table 3. In parallel setups, TRAST and TRASFUST run in real-time if the teachers do so.
State of the Art Comparison.
In Table 5 we report the results of the proposed trackers against the state-of-the-art. In the following comparisons, we consider the results of the best configurations proposed in the above analysis.
TRAS outperforms GOTURN and RE3 which employ a similar DNN architecture but different learning strategies. On GOT-10k and LaSOT it also surpasses the recent GradNet and ROAM, and GCT on UAV123. TRAST outperforms ATOM and SiamCAR on GOT-10k, UAV123, LaSOT, while losing little performance to DiMP. The performance is better than RL-based trackers [80, 9, 63] on UAV123 and comparable on OTB-100. Finally, TRASFUST outperforms all the trackers on all the benchmarks (where the pool was used). Remarkable results are obtained on UAV123 and OTB-100, with SS of 0.679 and 0.701 and PS of 0.873 and 0.931, respectively. Large improvement is achieved over all the methodologies that include expert trackers in their methodology.
|Zhu et al. ||-||-||-||-||-||-||-||0.587||0.788||-||-||-||36|
|Li et al. ||-||-||-||-||-||-||-||0.621||0.864||-||-||-||6|
In this paper, a novel methodology for visual tracking is proposed. KD and RL are joined in a novel framework where off-the-shelf tracking algorithms are employed to compress knowledge into a CNN-based student model. After learning, the student can be exploited in three different tracking setups, TRAS, TRAST and TRASFUST, depending on application needs. An extensive validation shows that the proposed trackers TRAS and TRAST compete with the state-of-the-art, while TRASFUST outperforms recently published methods and fusion approaches. All trackers can run in real-time.
This work is supported by the ACHIEVE-ITN project.
Supplementary Material of
with a Distilled and Reinforced Model
Appendix A Methodology
a.1 MDP Auxiliary Functions
The function used to obtain the bounding box given and the previous bounding box is defined such that
. The function employed to obtain the expert action given the teacher bounding boxes is defined as
a.2 Curriculum Learning Strategy
A curriculum learning strategy  is designed to further facilitate and improve the student’s learning. After terminating each , a success counter for is increased if performs better than in that interaction, i.e. if the first cumulative reward, received up to , is greater or equal to the one obtained by the second. In formal terms, is updated if the following condition holds
The counter update is done by testing students that interact with by exploiting . The terminal video index is successively increased during the training procedure by a central process which checks if . After each update of , is reset to zero. By this setup, we ensure that, at every increase of , students face a simpler learning problem where they are likely to succeed and in a shorter time, since they have already developed a tracking policy that, up to , is at least good as the one of . We found to work well in practice.
Appendix B Experimental Setup
Benchmarks and Performance Measures.
In this subsection we offer more details about the benchmark datasets and the relative performance measures employed to validate our methodology.
GOT-10k Test Set. The GOT-10k  test set is composed of 180 videos. Target objects belong to 84 different classes, and 32 forms of object motion are present. An interesting note is that, except for the class person, there is no overlap between the object classes in the training and test splits. For the person class, there is no overlap in the type of motion. The evaluation protocol proposed by the authors is the one-pass evaluation (OPE) , while the metrics used are the average overlap (AO) and the success rates (SR) with overlap thresholds and .
Otb-100. The OTB-100  benchmark is a set of 100 challenging videos and it is widely used in the tracking literature. The standard evaluation procedure for this dataset is the OPE method while the Area Under the Curve (AUC) of the success and precision plot, referred as success score (SS) and precision scores (PS) respectively, are utilized to quantify trackers’ performance.
LaSOT. A performance evaluation was also performed on the test set of LaSOT benchmark . This dataset is composed of 280 videos, with a total of more than 650k frames and an average sequence length of 2500 frames, that is higher than the lengths of the videos contained in the aforementioned benchmarks. The same methodology and metrics used for the OTB  experiments are employed.
Vot2019. The VOT benchmarks are datasets used in the annual VOT tracking competitions. These sets change year by year, introducing increasingly challenging tracking scenarios. We evaluated our trackers on the set of the VOT2019 challenge , which provides 60 highly challenging videos. Within the framework used by the VOT committee, trackers are evaluated based on Expected Average Overlap (EAO), Accuracy (A) and Robustness (R) . Differently from the OPE, the VOT evaluation protocol presents the automatic re-initialization of the tracker when the IoU between its estimated bounding box and the ground-truth becomes zero.
Appendix C Additional Results
c.1 Impact of Transfer Set
We evaluated how performance change considering other sources of video data. By respecting the idea that unbiased demonstrations of the teachers should be employed, we used the training set of the LaSOT benchmark . This dataset is smaller than the training set of GOT-10k and contains 1120 videos with approximately 2.83M frames. After filtering the trajectories, we obtained the transfer set which specification are given in Table 6.
The results are shown in Table 7. The amount of training samples is lower than the amount obtained by filtering the GOT-10k transfer set with , and the proposed trackers present a behaviour that reflects the loss of data (as seen in Table 4). This experiment suggests that the quantity of data has more impact than the quality of data.
c.2 Success and Precision Plots on OTB-100
At this link https://youtu.be/uKtQgPk3nCU, we provide a video showing the tracking abilities of our proposed trackers. For each video, the predictions of TRAS, TRAST and TRASFUST are shown. For TRAST and TRASFUST, we report also the tracker which prediction was chosen as output proposes (with the term ”CONTROLLING” next to the tracker’s name).
- (2014) A superior tracking approach: Building a strong tracker through fusion. In European Conference on Computer Vision, Vol. 8695 LNCS, pp. 170–185. External Links: Cited by: §1, §1, §2.0.1, §4.2.2.
- (2009) Curriculum learning. In 26th International Conference on Machine Learning, ICML ’09, New York, New York, USA, pp. 1–8. External Links: Cited by: §A.2, §3.3.2.
- (2016) Staple: Complementary learners for real-time tracking. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2016-Decem, pp. 1401–1409. External Links: Cited by: §1, §1.
- (2016) Fully-convolutional siamese networks for object tracking. European Conference on Computer Vision 9914 LNCS, pp. 850–865. External Links: Cited by: §1, §1.
- (2019) Learning Discriminative Model Prediction for Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, External Links: Cited by: §1, §1, §1, §2.0.1, §3, §4.1.1, Table 5.
- (2010) Visual object tracking using adaptive correlation filters. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544–2550. Cited by: §1, §1.
- (2006) Model compression. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 2006, pp. 535–541. External Links: Cited by: §2.0.2.
- (2013) Robust visual tracking using an adaptive coupled-layer visual model. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (4), pp. 941–953. External Links: Cited by: §1.
- (2018) Real-time ’Actor-Critic’ Tracking. In European Conference on Computer Vision, pp. 318–334. External Links: Cited by: §1, §1, §2.0.2, §3, §4.2.3, Table 5.
- (2017) Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems, Vol. 2017-Decem, pp. 743–752. External Links: Cited by: §1, §2.0.2, §3.3.1.
- (2019-10) On the efficacy of knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2019-Octob, pp. 4793–4801. External Links: Cited by: §4.2.2.
- (2018) Real-time visual tracking by deep reinforced decision making. Computer Vision and Image Understanding 171, pp. 10–19. External Links: Cited by: §1.
- (2000) Real-time tracking of non-rigid objects using mean shift. IEEE Conference on Computer Vision and Pattern Recognition 2, pp. 142–149. External Links: Cited by: §1.
- (2017-11) ECO: Efficient Convolution Operators for Tracking. In IEEE Conference on Computer Vision and Pattern Recognition, External Links: Cited by: §1, §1, §1, §2.0.1, §4.1.1, Table 5.
- (2019) ATOM: Accurate Tracking by Overlap Maximization. In IEEE Conference on Computer Vision and Pattern Recognition, External Links: Cited by: §1, §1, §1, §2.0.1, §3, §4.1.1, Table 5.
- (2017) Discriminative Scale Space Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (8), pp. 1561–1575. External Links: Cited by: §1, §1.
- (2016-08) Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking. In European Conference on Computer Vision, External Links: Cited by: §1.
- (2009-06) ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. External Links: Cited by: §4.1.4.
- (2020-02) Siam-U-Net: encoder-decoder siamese network for knee cartilage tracking in ultrasound images. Medical Image Analysis, pp. . External Links: Cited by: §1.
- (2019) Visual Tracking by means of Deep Reinforcement Learning and an Expert Demonstrator. In Proceedings of The IEEE/CVF International Conference on Computer Vision Workshops, External Links: Cited by: §1.
- (2018-02) IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. In 35th International Conference on Machine Learning, ICML 2018, Vol. 4, pp. 2263–2284. External Links: Cited by: §3.3.
- (2019-09) LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking. In IEEE Conference on Computer Vision and Pattern Recognition, External Links: Cited by: §B.0.1, §C.1, §4.1.3.
- (2019) Graph Convolutional Tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4649–4659. External Links: Cited by: Table 5.
- (2018) Modality distillation with multiple stream networks for action recognition. In European Conference on Computer Vision, Vol. 11212 LNCS, pp. 106–121. External Links: Cited by: §2.0.2.
- (2015-11) Blending LSTMs into CNNs. External Links: Cited by: §1, §2.0.2.
- (2018) Re 3 : Real-time recurrent regression networks for visual tracking of generic objects. IEEE Robotics and Automation Letters 3 (2), pp. 788–795. External Links: Cited by: §1, §1, §2.0.1, §3.3.1, §3.4, §4.1.2, §4.1.4, Table 5.
- (2020-11) SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, External Links: Cited by: Table 5.
- (2016) Struck: Structured Output Tracking with Kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10), pp. 2096–2109. External Links: Cited by: §1.
- (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2016-Decem, pp. 770–778. External Links: Cited by: §1, §2.0.2, §3.4.
- (2019) Knowledge Adaptation for Efficient Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 578–587. External Links: Cited by: §2.0.2.
- (2016) Learning to Track at 100 FPS with Deep Regression Networks. In European Conference on Computer Vision, Vol. abs/1604.0. External Links: Cited by: §1, §1, §2.0.1, §3.3.1, §3.4, Table 5.
- (2015) High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3), pp. 583–596. External Links: Cited by: §1, §1, §4.1.1, Table 5.
- (2014-03) Distilling the Knowledge in a Neural Network. In Deep Learning Workshop NIPS 2014, External Links: Cited by: §1, §2.0.2.
- (1997-11) Long Short-Term Memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Cited by: §3.4.
- (2017-04) MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. External Links: Cited by: §1.
- (2019-10) GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. External Links: Cited by: §B.0.1, §4.1.2, §4.1.3.
- (2018) Real-Time MDNet. In European Conference on Computer Vision, Cited by: §1, §1, §2.0.1.
- (2012) Tracking-learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (7), pp. 1409–1422. External Links: Cited by: §2.0.1.
- (2000) Actor-Critic Algorithms. In Advances in Neural Information Processing Systems, External Links: Cited by: §1.
- (2019) The Seventh Visual Object Tracking VOT2019 Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Cited by: §B.0.1, §4.1.3.
- (2016-11) A Novel Performance Evaluation Methodology for Single-Target Trackers. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (11), pp. 2137–2155. External Links: Cited by: §B.0.1, §4.1.3.
- (2019) SIAMRPN++: Evolution of siamese visual tracking with very deep networks. IEEE Conference on Computer Vision and Pattern Recognition 2019-June, pp. 4277–4286. External Links: Cited by: §1, §1, Table 5.
- (2018-06) High Performance Visual Tracking with Siamese Region Proposal Network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971–8980. External Links: Cited by: §1, §1, §4.1.1, Table 5.
- (2019) GradNet: Gradient-Guided Network for Visual Object Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, External Links: Cited by: Table 5.
- (2017) Learning from Noisy Labels with Distillation. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2017-Octob, pp. 1928–1936. External Links: Cited by: §1, §2.0.2.
- (2019) Online Multi-Expert Learning for Visual Tracking. IEEE Transactions on Image Processing 29, pp. 934–946. External Links: Cited by: §2.0.1, Table 5.
- (2019-08) On the Variance of the Adaptive Learning Rate and Beyond. External Links: Cited by: §4.1.4.
- (2019) Structured Knowledge Distillation for Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2599–2608. External Links: Cited by: §2.0.2.
- (2019-07) Teacher-Students Knowledge Distillation for Siamese Trackers. External Links: Cited by: §2.0.2.
- (2018) Discriminative Correlation Filter Tracker with Channel and Spatial Reliability. International Journal of Computer Vision 126 (7), pp. 671–688. External Links: Cited by: §1, §1.
- (2013) MATRIOSKA: A multi-level approach to fast tracking by learning. In International Conference on Image Analysis and Processing, Vol. 8157 LNCS, pp. 419–428. External Links: Cited by: §1.
- (2019) Long and short memory balancing in visual co-tracking using q-learning. In 2019 IEEE International Conference on Image Processing (ICIP), Vol. , pp. 3970–3974. Cited by: §2.0.2.
- (2016) Asynchronous methods for deep reinforcement learning. 33rd International Conference on Machine Learning, ICML 2016 4, pp. 2850–2869. External Links: Cited by: §1, §3.3.
- (2013) Playing Atari with Deep Reinforcement Learning. CoRR abs/1312.5. External Links: Cited by: §1.
- (2016) A Benchmark and Simulator for UAV Tracking. In European Conference on Computer Vision, pp. 445–461. External Links: Cited by: §B.0.1, §4.1.3.
- (2015-07) Massively Parallel Methods for Deep Reinforcement Learning. External Links: Cited by: §3.3.
- (2016) Learning Multi-domain Convolutional Neural Networks for Visual Tracking. IEEE Conference on Computer Vision and Pattern Recognition 2016-Decem, pp. 4293–4302. External Links: Cited by: §1, §1, §2.0.1, §4.1.1, Table 5.
- (2014) Online graph-based tracking. In European Conference on Computer Vision, Vol. 8693 LNCS, pp. 112–126. External Links: Cited by: §1.
- (2016-11) Actor-mimic deep multitask and transfer reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, External Links: Cited by: §2.0.2.
- (2019) Distillation-Based Training for Multi-Exit Architectures. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §1.
- (2018-02) Model compression via distillation and quantization. In International Conference on Learning Representations, External Links: Cited by: §1.
- (2016) Hedged Deep Tracking. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2016-Decem, pp. 4303–4311. External Links: Cited by: §2.0.1, Table 5.
- (2018) Deep Reinforcement Learning with Iterative Shift for Visual Tracking. In European Conference on Computer Vision, pp. 684–700. External Links: Cited by: §1, §2.0.2, §4.2.3, Table 5.
- (2016-11) Policy distillation. In 4th International Conference on Learning Representations, ICLR 2016, External Links: Cited by: §2.0.2.
- (2017) Incremental Learning of Object Detectors without Catastrophic Forgetting. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2017-Octob, pp. 3420–3429. External Links: Cited by: §2.0.2.
- (2017) Tracking as Online Decision-Making: Learning a Policy from Streaming Videos with Reinforcement Learning. Proceedings of the IEEE International Conference on Computer Vision 2017-Octob, pp. 322–331. External Links: Cited by: §1, §2.0.2.
- (2018) Reinforcement Learning: An Introduction. 2nd edition, MIT Press, Cambridge, MA, USA. Cited by: §3.3.2.
- (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pp. 1057–1063. External Links: Cited by: §1, §3.3.2.
- (2016-05) Recurrent neural network training with dark knowledge transfer. In IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 2016-May, pp. 5900–5904. External Links: Cited by: §1, §2.0.2.
- (2016-12) Online adaptive hidden Markov model for multi-tracker fusion. Computer Vision and Image Understanding 153, pp. 109–119. External Links: Cited by: §1, §1, §2.0.1, Table 5.
- (2014) Ensemble-based tracking: Aggregating crowdsourced structured time series data. In 31st International Conference on Machine Learning, ICML 2014, Vol. 4, pp. 2807–2817. External Links: Cited by: §1, §1, §2.0.1.
- (2020-07) Real-Time Correlation Tracking Via Joint Model Compression and Transfer. IEEE Transactions on Image Processing 29, pp. 6123–6135. External Links: Cited by: §2.0.2.
- (2019) Fast Online Object Tracking and Segmentation: A Unifying Approach. In IEEE Conference on Computer Vision and Pattern Recognition, External Links: Cited by: §1.
- (2019) Progressive Teacher-student Learning for Early Action Prediction. Computer Vision and Pattern Recognition (CVPR), pp. 3556–3565. External Links: Cited by: §2.0.2.
- (1992) Q-learning. Machine Learning 8 (3), pp. 279–292. External Links: Cited by: §1.
- (2019) Distilled Person Re-identification:Towards a More Scalable System. IEEE Conference on Computer Vision and Pattern Recognition (3), pp. 1187–1196. Cited by: §2.0.2.
- (2013) Online object tracking: A benchmark. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418. External Links: Cited by: §B.0.1, §B.0.1, §B.0.1, §B.0.1, §4.1.3.
- (2020-07) ROAM: Recurrently Optimizing Tracking Model. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, External Links: Cited by: Table 5.
- (2012) Visual tracking via adaptive tracker selection with multiple features. In European Conference on Computer Vision, Vol. 7575 LNCS, pp. 28–41. External Links: Cited by: §1, §1, §2.0.1.
- (2017-07) Action-decision networks for visual tracking with deep reinforcement learning. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2017-Janua, pp. 1349–1358. External Links: Cited by: §1, §1, §2.0.2, §3, §4.2.3, Table 5.
- (2014) MEEM: Robust tracking via multiple experts using entropy minimization. In European Conference on Computer Vision, Vol. 8694 LNCS, pp. 188–203. External Links: Cited by: §1, §1, §2.0.1, Table 5.
- (2019-01) Deeper and Wider Siamese Networks for Real-Time Visual Tracking. IEEE Conference on Computer Vision and Pattern Recognition. External Links: Cited by: §1, §1.
- (2018-08) Visual Tracking with Dynamic Model Update and Results Fusion. In Proceedings - International Conference on Image Processing, pp. 2685–2689. External Links: Cited by: Table 5.
- (2018-08) Distractor-aware siamese networks for visual object tracking. In European Conference on Computer Vision, Vol. 11213 LNCS, pp. 103–119. External Links: Cited by: §1.