Tracking-by-Trackers with a Distilled and Reinforced Model

Tracking-by-Trackers with a Distilled and Reinforced Model


Visual object tracking was generally tackled by reasoning independently on fast processing algorithms, accurate online adaptation methods, and fusion of trackers. In this paper, we unify such goals by proposing a novel tracking methodology that takes advantage of other visual trackers, offline and online. A compact student model is trained via the marriage of knowledge distillation and reinforcement learning. The first allows to transfer and compress tracking knowledge of other trackers. The second enables the learning of evaluation measures which are then exploited online. After learning, the student can be ultimately used to build (i) a very fast single-shot tracker, (ii) a tracker with a simple and effective online adaptation mechanism, (iii) a tracker that performs fusion of other trackers. Extensive validation shows that the proposed algorithms compete with real-time state-of-the-art trackers.

1 Introduction

Visual object tracking corresponds to the persistent recognition and localization –by means of bounding boxes– of a target object in consecutive video frames. This problem comes with several different challenges including object occlusion and fast motion, light changes, and motion blur. Additionally, real-time constraints are often posed by the many practical applications, such as video surveillance, behavior understanding, autonomous driving, and robotics.

In the past, the community has proposed solutions emphasizing different aspects of the problem. Processing speed was pursued by algorithms like correlation filters [6, 32, 16, 3, 50] or offline methods such as siamese convolutional neural networks (CNNs) [31, 26, 4, 43, 42, 84, 82]. Improved performance was attained by online target adaptation methods [57, 37, 14, 15, 5]. Higher tracking accuracy and robustness were achieved by methods built on top of other trackers [81, 79, 71, 1, 70]. All these characteristics belong to an optimal tracker but they were studied one independently from the other. The community currently lacks a general framework to tackle them jointly. In this view, a single model should be able to (i) track an object in a fast way, (ii) implement simple and effective online adaptation mechanisms, (iii) apply decision-making strategies to select tracker outputs.

It is a matter of fact that a large number of tracking algorithms has been produced so far, with different principles exploited. Preliminary solutions were based on mean shift algorithms [13], key-point [51] or part-based methods [8, 58], or SVM learning [28]. Later, correlation filters gained popularity thanks to their fast processing times [6, 32, 16, 3, 50]. Since more recently, CNNs have been exploited to extract efficient image features. This kind of representation has been included in deep regression networks [31, 26], online tracking-by-detection methods [57, 37], solutions that treat visual tracking as a reinforcement learning (RL) problem [80, 66, 12, 63, 9, 20], CNN-based discriminative correlation filters [17, 14, 15, 5], and in siamese CNNs [4, 43, 42, 82, 73, 19]. Other methods tried to take advantage of the output produced by multiple trackers [81, 79, 71, 1, 70]. Thus, one can imagine that different trackers incorporate different knowledge, and this may constitute a valuable resource to leverage during tracking.

Lately, the knowledge distillation (KD) framework [33] was introduced in the deep learning panorama as paradigm for, among the many [29, 69, 45, 60], knowledge transferring between models [25] and model compression [10, 35, 61]. The idea boils down into considering a student model and one or more teacher models to learn from. Teachers explicit their knowledge through demonstrations on a never seen before transfer set. Through specific loss functions, the student is set to learn a task by matching the teachers’ output and the ground-truth labels. As visual tracking requires fast and accurate methods, KD can be a valuable tool to transfer the tracking ability of more accurate teacher trackers to more compact and faster student ones. However, the standard setup of KD does not provide methods to exploit teachers online, but just offline. This makes this methodology unsuitable for tracking, which has been shown to benefit from both offline and online methods [15, 5, 14, 80, 9]. In contrast to such an issue, RL techniques offer established methodologies to optimize not only policies but also policy evaluation functions [75, 39, 68, 54, 53], which are then used to extract decision strategies. Along with this, RL also gives the possibility to maximize arbitrary and non-differentiable performance measures, and thus more tracking oriented objectives can be defined.

For the aforementioned motivations, the contribution of this paper is a novel tracking methodology where a student model exploits off-the-shelf trackers offline and online (tracking-by-trackers). The student is first trained via an effective strategy that combines KD and RL. After that, the model’s compressed knowledge can be used interchangeably depending on the application’s needs. We will show how to exploit the student in three setups which result in, respectively, (i) a fast tracker (TRAS), (ii) a tracker with a simple online mechanism (TRAST), and (iii) a tracker capable of expert tracker fusion (TRASFUST). Through extensive evaluation procedures, it will be demonstrated that each of the algorithms competes with the respective state-of-the-art class of trackers while performing in real-time.

2 Related Work

Visual Tracking.

Here we review the trackers most related to ours. The network architecture implemented by the proposed student model takes inspiration from GOTURN [31] and RE3 [26]. These regression-based CNNs were shown to capture the target’s motion while performing very fast. However, the learning strategy employed optimizes parameters just for coordinate difference. Moreover, great amount of data is needed to make such models achieve good accuracy. In contrast, our KD-RL-based method offers parameter optimization for overlap maximization and extracts previously acquired knowledge from other trackers requiring less labeled data. Online adaptation methods like discriminative model learning [38, 57, 37] or discriminative correlation filters [14, 15, 5] have been studied extensively to improve tracking accuracy. These procedures are time-consuming and require particular assumptions and careful design. We propose a simple online update strategy where an off-the-shelf tracker is used to correct the performance of the student model. Our method does not make any assumption on such tracker, and thus it can be freely selected to adapt to application needs. Present fusion models exploit trackers in the form of discriminative trackers [81], CNN feature layers [62], correlation filters [46] or out-of-the-box tracking algorithms [79, 71, 1, 70]. However, such models work just online and do not take advantage of the great amount of offline knowledge that expert trackers can provide. Furthermore, they are not able to track objects without them. Our student model addresses these issues thanks to the decision making strategy learned via KD and RL.

KD and RL.

We review the learning strategies most related to ours. KD techniques have been used for transferring knowledge between teacher and student models [7, 33], where the supervised learning setting was employed more [25, 29, 69, 45] than the setup that uses RL [64, 59]. In the context of computer vision, KD was employed for action recognition [24, 74], object detection [10, 65], semantic segmentation [48, 30], person re-identification [76]. In the visual tracking panorama, KD was explored in [72, 49] to compress, CNN representations for correlation filter trackers and siamese network architectures, respectively. However, these works offer methods involving teachers specifically designed as correlation filter and siamese trackers, and so cannot be adapted to generic-approach visual trackers as we propose in this paper. Moreover, to the best of our knowledge, no method mixing KD and RL is currently present in the computer vision literature. Our learning procedure is also related to the strategies that use deep RL to learn tracking policies [80, 66, 63, 9, 52]. Our formulation shares some characteristics with such methods in the markov decision process (MDP) definition, but our proposed learning algorithm is different as no present method leverages on teachers to learn the policy.

3 Methodology

The key point of this paper is to learn a simple and fast student model with versatile tracking abilities. KD is used for transferring the tracking knowledge of off-the-shelf trackers to a compressed model. However, as both offline and online strategies are necessary for tracking [15, 5, 80, 9], we propose to augment the KD framework with an RL optimization objective. RL techniques deliver unified optimization strategies to directly maximize a desired performance measure (in our case the overlap between prediction and ground-truth bounding boxes) and to predict the expectation of such measure. We use the latter as base for an online evaluation and selection strategy. Put in other words, combining KD and RL lets the student model extract a tracking policy from teachers, improve it, and express its quality through an overlap-based objective.

3.1 Preliminaries

Given a transfer set of videos , we consider the -th video as a sequence of frames , where is the space of RGB images. Let be the -th bounding box defining the coordinates of the top left corner, and the width and height of the rectangle that contains the target object. At time , given the current frame , the goal of the tracker is to predict that best fits the target in . We formally consider the student model as that is a function which outputs the relative motion between and , and the performance evaluation , when inputted with frame . Similarly, we define the set of tracking teachers as where each is a function that, given a frame image, produces a bounding box estimate for that frame.

Figure 1: Scheme of the proposed KD-RL-based learning framework. students interact independently with . After each done, a copy of the shared weights is sent to each one. Every steps each student send the computed gradients to apply an update on . The distilling students (highlighted by the orange dashed contour) extract knowledge from teachers by optimizing . Autonomous students (circled with the blue dashed contour) learn an autonomous tracking policy by optimizing jointly and .
Figure 2: The student architecture is composed by two branches of convolutional layers (gray boxes) with shared weights followed by, two fully-connected layers (orange boxes), an LSTM layer (in green) and two parallel fully connected layers for the prediction of and respectively.
Figure 3: Visual representation showing how the student and teachers are employed in the proposed trackers at every frame . (a) represents TRAS, (b) TRAST, and (c) TRASFUST.

3.2 Visual Tracking as an MDP

In our setting, is treated as an artificial agent which interacts with an MDP defined over a video . The interaction happens through a temporal sequence of states , actions and rewards . In the -th frame, the student is provided with the state and outputs the continuous action which consists in the relative motion of the target object, i.e. it indicates how its bounding box, which is known in frame , should move to enclose the target in the frame . is rewarded by the measure of its quality . We refer this interaction process as the episode , which dynamics are defined by the MDP .


Every is defined as a pair of image patches obtained by cropping and using . Specifically, , where crops the frames within the area of the bounding box that has the same center coordinates of but which width and height are scaled by . By selecting , we can control the amount of additional image context information to be provided to the student.

Actions and State Transition.

Each consists in a vector which defines the relative horizontal and vertical translations (, respectively) and width and height scale variations (, respectively) that have to be applied to to predict . The latter step is obtained through .1 After performing , the student moves through from to which is defined as the pair of cropped images obtained from and using .


The reward function expresses the quality of taken at and it is used to feedback the student. Our reward definition is based on the Intersection-over-Union (IoU) metric computed between and the ground-truth bounding box, denoted as , i.e.,


At every interaction step , the reward is formally defined as


with that floors to the closest digit and shifts the input range from to .

3.3 Learning Tracking from Teachers

The student is first trained in an offline stage. Through KD, knowledge is transferred from to . By means of RL, such knowledge is improved and the ability of evaluating its quality is also acquired. All the gained knowledge will be used for online tracking. We implement as a parameterized function that given outputs at the same time the action and state-value . In RL terms, maintains representations of both the policy and the state value functions. The proposed learning framework, which is depicted in Figure 3, provides a single offline end-to-end learning stage. students are distributed as parallel and independent learning agents. Each one owns a set of learnable weights that are used to generate experience by interacting with . The obtained experience, in the form of , is used to update asynchronously a shared set of weights . After ending , each student updates its by copying the values of the currently available . The entire procedure is repeated until convergence. This learning architecture follows the recent trends in RL that make use of distributed algorithms to speed up the training phase [56, 53, 21]. We devote half of the students, which we refer to as distilling students, in acquiring knowledge from the teachers’ tracking policy. The other half, called autonomous students, learn to track by interacting with autonomously.

Distilling Students.

Each distilling student interacts with by observing states, performing actions and receiving rewards just as an autonomous student. However, to distill knowledge independently from the teachers’ inner structure, we propose the student to learn from the actions of , which are executed in parallel. In particular, is exploited every steps with the following loss function


which is the L1 loss between the actions performed by the student and the actions that the teacher would take to move the student’s bounding box into the teacher’s prediction .2 At every , is selected as


as we would like to learn always from the best teacher. The absolute values are multiplied by . Each of these is computed along the interaction and determines the status in which performed worse than () or better () in terms of the rewards and . The whole Eq. (3) is similar to what proposed in [10] for KD from bounding-box predictions of object detectors. However, here we provide a temporal formulation of such objective and we swap the L2 loss with the L1, which was shown to work better for regression-based trackers [31, 26]. By optimizing Eq. (3), the weights are changed only if the student’s performance is lower than the performance of the teacher. In this way, we make the teacher transferring its knowledge by suggesting actions only in bad performing cases. In the others, we let the student free to follow its current tracking policy since it is superior.

Autonomous Students.

The learning process performed by the autonomous students follows the standard RL method for continuous control [67]. Each student interacts with for a maximum of steps. At each step , the students sample actions from a normal distribution , where the mean is defined as the student’s predicted action, , and the standard deviation is obtained as (which is the absolute value of the difference between the student’s action and the action that obtains, by shifting , the ground-truth bounding box ). Intuitively, shrinks when is close to the ground-truth action , reducing the chance of choosing potential wrong actions when approaching the correct one. On the other hand, when is distant from , spreads letting the student explore more. The students also predict which is the cumulative reward that the student expects to receive from to the end of the interaction. Since the proposed reward definition is a direct measure of the IoU occurring between the predicted and the ground-truth bounding boxes, gives an estimate of the total amount of IoU that expects to obtain from state on wards. Thus, this function can be exploited as a future-performance evaluator. After steps of interaction, the gradient to update the shared weights is built as


where (6) is the policy loss and (7) is the value loss. These definitions follow the standard advantage actor-critic objective [68].

To further facilitate and improve the learning, a curriculum learning strategy [2] is built for each parallel student. During learning, the length of the interaction is increased as performs better than . Details are given in Appendix A.2.

3.4 Student Architecture

The architecture used to maintain the representation of both the policy and the state value functions, which is pictured in Figure 3, is simple and presents a structure similar to the one proposed in [31, 26]. The network gets as input two image patches that pass through two ResNet-18 based [29] convolutional branches that share weights. The feature maps produced by the branches are first linearized, then concatenated together and finally fed to two consecutive fully connected layers with ReLU activations. After that, features are given to an LSTM [34] layer. Both the fully connected layers and the LSTM are composed of 512 neurons. The output of the LSTM is ultimately fed to two separate fully connected heads, one that outputs the action and the other that outputs the value of the state .

3.5 Tracking after Learning

After the learning process, the student is ready to be used for tracking. Here we describe three different ways in which can be exploited:

  1. the student’s learned policy is used to predict bounding boxes independently from the teachers. We call this setting TRAS (TRAcking Student).

  2. the learned policy and value function are used to, respectively, predict and evaluate and tracking behaviors, in order to correct the former’s performance. We refer to this setup as TRAST (TRAcking Student and Teacher).

  3. the learned state-value function is used to evaluate the performance of the pool of teachers in order to choose the best and perform tracker fusion. We call this setup TRASFUST (TRAcking by Student FUSing Teachers).

In the following, we provide more details about the three settings. For a better understanding, the setups are visualized in Figure 3.


In this setting, each tracking sequence , with target object outlined by , is considered as described in section 3.2. States are extracted from frames , actions are performed by means of the student’s learned policy and are used to output the bounding boxes . This setup is fast as it requires just a forward pass through the network to obtain a bounding box prediction.


In this setup, the student makes use of the learned to predict and to evaluate its running tracking quality and the one of which is run in parallel. In particular, at each time step , and are obtained as performance evaluation for and respectively. The teacher state is obtained as . By comparing the two expected returns, TRAST decides if to output the student’s or the teacher’s bounding box. More formally, if then otherwise . This assignment has the side effect of correcting the tracking behaviour of the student as, at the successive time step, the previously known bounding box becomes the previous prediction of the teacher. Thus, the online adaption consists in a very simple procedure that evaluates ’s performance to eventually pass control to it. Notice that, at every , the execution of is independent from as the second does not need the first to finish because the evaluations are done based on the predictions given at . Hence, the executions of the two can be put in parallel, with the overall speed of TRAST resulting is the lowest between the one of and .


In this tracking setup, just the student’s learned state-value function is exploited. At each step , teachers are executed following their standard methodology. States are obtained. The performance evaluation of the teachers is obtained through the student as . The output bounding box is selected as by considering the teacher that achieves the highest expected return, i.e.


This procedure consists in fusing sequence-wise the predictions of and, similarly as for TRAST, the execution of teachers and student can be put in parallel. In such setting, the speed of TRASFUST results in the lowest between the ones of and of each .

# traj AO # traj AO # traj AO # traj AO # traj AO
1884 0.798 9225 1097 0.836 5349 439 0.873 2122 73 0.914 356 0 0.0 0
1600 0.767 7859 781 0.808 3831 216 0.851 1052 18 0.898 86 0 0.0 0
2754 0.808 13526 1659 0.843 8122 720 0.879 3507 160 0.915 773 1 0.954 4
3913 0.829 19259 2646 0.854 12997 1447 0.878 7080 431 0.908 2097 9 0.947 42
4519 0.840 22252 3092 0.863 15195 1698 0.887 8307 496 0.915 2414 10 0.948 46
Table 1: Teacher-based statistics of the transfer set.

4 Experimental Results

4.1 Experimental Setup


The tracking teachers selected for this work are KCF [32], MDNet [57], ECO [14], SiamRPN [43], ATOM [15], and DiMP [5], due to their established popularity in the visual tracking panorama. Moreover, since they tackle visual tracking by different approaches, they can provide knowledge of various quality. In experiments, we considered exploiting single teacher or a pool of teachers. In particular, the following sets of teachers were examined .

Transfer Set.

The selected transfer set was the training set of GOT-10k dataset [36], due to its large scale. Just were used for offline learning, as none of these was trained on this dataset. This is an important point because unbiased examples of the trackers’ behavior should be exploited to train the student. Moreover, predictions that exhibit meaningful knowledge should be retained. Therefore, we filtered out all the videos which teacher predictions did not satisfy for all . We considered as minimum threshold for a prediction to be considered positive, and we then varied among 0.6, 0.7, 0.8, 0.9 for more precise predictions. To produce more training samples, videos, and filtered trajectories were split in five randomly indexed sequences of 32 frames and bounding boxes, similarly as done in [26]. In Table 1 a summary of is presented. The number of positive trajectories, the average overlap (A0) [36] on the transfer set, and the total number of sequences are reported per teacher and per .

Benchmarks and Performance Measures.

We performed performance evaluations on the GOT-10k test set [36], UAV123 [55], LaSOT [22], OTB-100 [77] and VOT2019 [40] datasets. These offer videos of various nature and difficulty, and are all popular benchmarks in the visual tracking community. The evaluation protocol used for GOT-10k is the one-pass evaluation (OPE) [77], along with the metrics: AO, and success rates (SR) with overlap thresholds and . For UAV123, LaSOT, and OTB-100 the OPE method was considered with the area-under-the-curve (AUC) of the success and precision plot, referred to as success score (SS) and precision scores (PS) respectively [77]. Evaluation on VOT2019 is performed in terms of expected average overlap (EAO), accuracy (A), and robustness (R) [41]. Further details about the benchmarks are given in Appendix B.0.1.

Implementation Details.

The image crops of were resized to pixels and standardized by the mean and standard deviation calculated on the ImageNet dataset [18]. The ResNet-18 weights were pre-trained for image classification on the same dataset [18]. The image context factor was set to . The training videos were processed in chunks of 32 frames. At test time, every 32 frames, the LSTM’s hidden state is reset to the one obtained after the first student prediction (i.e. ), following [26]. Due to hardware constraints, a maximum of training students were distributed on 4 NVIDIA TITAN V GPUs of a machine with an Intel Xeon E5-2690 v4 @ 2.60GHz CPU and 320 GB of RAM. The discount factor was set to 1. The length of the interaction before an update was defined in steps. The Radam optimizer [47] was employed and the learning rate for both distilling and autonomous students was set to . A weight decay of was also added to as regularization term. To control the magnitude of the gradients and stabilize learning, was multiplied by . The student was trained until the validation performance on the GOT-10k validation set stopped improving. Longest trainings took around 10 days. The speed of the parallel setups of TRAST and TRASFUST was computed by considering the speed of the slowest tracker (student or teacher) plus an overhead. Code was implemented in Python and is available here3. Source code publicly available was used to implement the teacher trackers. Default configurations were respected as much as possible. For a fair comparison, we report the results of such implementations, that have slightly different performance than stated in the original papers.

GOT-10k UAV123 LaSOT OTB-100
Contribution AO SR SR SS PS SS PS SS PS
TRAS-GT 0.444 0.495 0.286 0.483 0.616 0.331 0.271 0.438 0.581
TRAS-KD-GT 0.448 0.499 0.305 0.491 0.630 0.354 0.298 0.448 0.606
TRAS-KD 0.422 0.481 0.239 0.494 0.634 0.340 0.276 0.457 0.635
TRAS-no-curr 0.474 0.547 0.307 0.501 0.644 0.385 0.323 0.447 0.600
TRAS 0.484 0.556 0.326 0.515 0.655 0.386 0.330 0.481 0.644
TRAST-no-curr 0.530 \tblbest0.630 \tblbest0.347 0.602 0.770 0.484 0.464 0.595 0.794
TRAST \tblbest0.531 0.626 0.345 0.603 0.773 0.490 0.470 0.604 0.818
TRASFUST-no-curr 0.506 0.599 0.278 0.627 0.819 0.496 0.484 \tblbest0.665 0.879
TRASFUST 0.519 0.616 0.287 \tblbest0.628 \tblbest0.823 \tblbest0.510 \tblbest0.505 0.660 \tblbest0.890
Table 2: Performance of the proposed trackers. Results of removing some components of our methodology are also reported. Best values, per contribution, are highlighted in red.
Figure 4: Visual example of how TRAST relies effectively on the teacher, passing control to and saving the simple student (TRAS) from the drift.
Figure 5: Analysis of the accuracy (A) and robustness (R) on VOT2019 over different classes of tracking sequences.

4.2 Results

In the following sections, when not specified, the three tracker setups regard the student trained using and , paired with in TRAST, and managing in TRASFUST.

General Remarks.

In Table 2 the performance of TRAS, TRAST, TRASFUST are reported, while the performances of the teachers are presented in the first six rows of Table 5. TRAS results in a very fast method with good accuracy. Combining KD and RL results in the best performance, outperforming the baselines that use for training just the ground-truth (TRAS-GT), KD and ground-truth (TRAS-KD-GT), and just KD (TRAS-KD). We did not report the performance of trained only by RL because convergence was not attained due to the large state and action spaces. Benefiting the teacher during tracking is an effective online procedure. Indeed, TRAST improves TRAS by 24% on average, and a qualitative example of the ability to pass control to the teacher is given in Figure 5. The performance of TRASFUST confirms the student’s evaluation ability. This is the most accurate and robust tracker thanks to the effective fusion of the underlined trackers. Overall, all three trackers show balanced performance across the benchmarks, thus demonstrating good generalization. No use of the curriculum learning strategy (TRAS-no-curr, TRAST-no-curr, TRASFUST-no-curr) slightly decreases the performance of all. In Figure 5 the performance of the trackers is reported for different classes of sequences of VOT2019, while in Figure 8 some qualitative examples are presented. 4 These results demonstrate the effectiveness of our methodology and that the proposed student model respects, respectively, the goals (i), (ii), (iii) introduced in Section 1.

Training Tracking GOT-10k UAV123 LaSOT OTB-100 FPS
Teachers Teachers AO SR SR SS PS SS PS SS PS


- 0.371 0.418 0.178 0.464 0.598 0.321 0.241 0.390 0.524 90
- 0.414 0.473 0.214 0.462 0.606 0.336 0.262 0.390 0.545
- 0.422 0.484 0.232 0.507 \tblsecondbest0.652 0.357 0.286 0.422 0.567
- \tblsecondbest0.441 \tblsecondbest0.499 \tblsecondbest0.290 \tblbest0.517 0.646 \tblsecondbest0.377 \tblsecondbest0.310 \tblsecondbest0.447 \tblsecondbest0.599
- \tblbest0.484 \tblbest0.556 \tblbest0.326 \tblsecondbest0.515 \tblbest0.655 \tblbest0.386 \tblbest0.330 \tblbest0.481 \tblbest0.644


0.390 0.440 0.191 0.526 0.682 0.388 0.319 0.495 0.660 90
0.452 0.521 0.223 0.572 0.776 0.433 0.386 0.569 0.793 5
0.491 0.571 0.249 0.580 0.768 0.442 0.397 0.583 0.786 15
0.532 0.632 0.354 0.605 0.779 0.485 0.457 0.601 0.806 40
0.469 0.541 0.297 0.562 0.727 0.422 0.376 0.560 0.760 90
0.494 0.573 0.302 0.604 0.798 0.466 0.431 0.596 0.815 5
0.521 0.607 0.307 0.606 0.795 0.456 0.419 0.608 0.822 15
0.531 0.626 0.345 0.603 0.773 0.490 0.470 0.604 0.818 40
\tblsecondbest0.557 \tblsecondbest0.640 \tblsecondbest0.393 \tblsecondbest0.634 \tblsecondbest0.823 \tblsecondbest0.513 \tblsecondbest0.488 \tblsecondbest0.623 \tblsecondbest0.838 20
\tblbest0.604 \tblbest0.708 \tblbest0.469 \tblbest0.647 \tblbest0.837 \tblbest0.545 \tblbest0.524 \tblbest0.643 \tblbest0.865 25


0.317 0.319 0.105 0.493 0.720 0.396 0.372 0.666 \tblsecondbest0.901 5
0.384 0.398 0.131 0.563 0.791 0.422 0.392 \tblbest0.701 \tblbest0.931 5
\tblsecondbest0.526 \tblsecondbest0.624 \tblsecondbest0.305 \tblsecondbest0.634 0.815 0.507 0.500 0.670 0.877 15
\tblbest0.617 \tblbest0.729 \tblbest0.490 \tblbest0.679 \tblbest0.873 \tblbest0.576 \tblbest0.574 \tblsecondbest0.692 0.895 20
0.517 0.615 0.294 0.633 \tblsecondbest0.823 \tblsecondbest0.513 0.504 0.682 0.897 5
0.519 0.616 0.287 0.628 \tblsecondbest0.823 0.510 \tblsecondbest0.505 0.660 0.890 5
Table 3: Performance of the proposed trackers while considering different teacher setups for training and tracking. Best results per tracker are highlighted in red, second-best in blue.

Impact Of Teachers.

In Table 3 the performance of the proposed trackers in different student-teacher setups is reported. The general trend of the three trackers reflects the increasing tracking capabilities of the teachers. Indeed, on every considered benchmark, the tracking ability of the student increases as a stronger teacher is employed. For TRAST, this is also proven by Figure 8 (a), where we show that better teachers are exploited more. Moreover, using more than one teacher during training leads to superior tracking policies and to better exploitation of them during tracking. Although in general student models cannot outperform their teachers due to their simple and compressed architecture [11], TRAS and TRAST show such behavior on benchmarks where teachers are weak. Using two teachers during tracking is the best TRASFUST configuration, as in this setup it outperforms the best teacher by more than 2% on all the considered benchmarks. When weaker teachers are added to the pool, the performance tends to decrease, suggesting a behavior similar to the one pointed out in [1]. Part of the error committed by TRAST and TRASFUST on benchmarks like OTB-100 and VOT2019 is explained by Figure 8. In situations of ambiguous ground-truth, such trackers make predictions that are qualitatively better but quantitatively worse. TRAST and TRASFUST show to be unbiased to the training teachers, as their capabilities generalize also to and which are not exploited during training.

GOT-10k UAV123 LaSOT OTB-100
TRAS \tblbest0.484 \tblbest0.556 \tblbest0.326 \tblbest0.515 \tblbest0.655 \tblbest0.386 \tblbest0.330 \tblbest0.481 \tblbest0.644
TRAST \tblbest0.532 \tblbest0.632 \tblbest0.354 \tblbest0.605 \tblbest0.779 \tblbest0.485 \tblbest0.457 \tblsecondbest0.601 \tblsecondbest0.806
TRASFUST \tblbest0.519 \tblbest0.616 0.287 0.628 0.823 0.510 \tblsecondbest0.505 0.660 0.890
TRAS \tblsecondbest0.426 \tblsecondbest0.488 \tblsecondbest0.244 \tblsecondbest0.481 \tblsecondbest0.609 \tblsecondbest0.343 \tblsecondbest0.277 \tblsecondbest0.452 \tblsecondbest0.617
TRAST \tblsecondbest0.518 \tblsecondbest0.616 \tblsecondbest0.326 \tblsecondbest0.599 \tblsecondbest0.768 0.475 0.452 \tblbest0.608 \tblbest0.809
TRASFUST \tblsecondbest0.507 \tblsecondbest0.599 \tblbest0.295 \tblbest0.639 \tblbest0.827 \tblbest0.514 \tblbest0.510 \tblbest0.683 \tblbest0.901
TRAS 0.404 0.449 0.231 0.430 0.552 0.334 0.260 0.390 0.522
TRAST 0.513 0.603 0.310 0.594 0.766 \tblsecondbest0.478 \tblsecondbest0.456 0.586 0.781
TRASFUST \tblsecondbest0.507 \tblsecondbest0.599 \tblsecondbest0.289 \tblsecondbest0.638 \tblbest0.827 \tblsecondbest0.513 \tblsecondbest0.505 \tblsecondbest0.675 \tblsecondbest0.894
TRAS 0.326 0.344 0.155 0.387 0.489 0.243 0.170 0.323 0.414
TRAST 0.505 0.598 0.297 0.592 0.764 0.457 0.426 0.589 0.774
TRASFUST 0.494 0.575 0.260 0.624 0.815 0.494 0.482 0.672 0.888
TRAS 0.140 0.070 0.014 0.064 0.045 0.086 0.019 0.132 0.104
TRAST 0.471 0.541 0.250 0.547 0.697 0.445 0.409 0.574 0.746
TRASFUST 0.403 0.425 0.169 0.534 0.743 0.401 0.374 0.626 0.836
Table 4: Results of the proposed trackers considering ’s increasingly better predictions. Best values, per tracker, are highlighted in red, second-best in blue.

In Table 4 we present the performance of the proposed trackers while considering different quality of teacher actions. Increasing the quality, thus reducing the number of videos, results in decreasing the performance of all three trackers. The loss is not significant between and , while considering more precise actions, TRAS suffers majorly, suggesting that more data is a key factor for an autonomous tracking policy. Interestingly, TRAST and TRASFUST are able to perform tracking even if the student is trained with limited training samples. The plot (b) in Figure 8 confirms that the student relies effectively to its teacher, as the latter’s output is selected more often as loses performance.

Running the student takes just 11ms on our machine. TRAS performs at 90 FPS. The speed of TRAST and TRASFUST depends on the chosen teacher and varies between 5 and 40 FPS, as shown in Table 3. In parallel setups, TRAST and TRASFUST run in real-time if the teachers do so.

Figure 6: Per benchmark fractions of predictions attributed to in the TRAST setup.
Figure 7: Qualitative examples of the proposed trackers.
Figure 8: Behaviour of TRAST and TRASFUST with ambiguous ground-truths. In the presented frames, TRAST selects the bounding box predicted by the student, while TRASFUST to one given by . Those outputs are qualitative better but have much less IoU (quantified by the colored numbers) with respect to . This impacts the overall quantitative performance.

State of the Art Comparison.

In Table 5 we report the results of the proposed trackers against the state-of-the-art. In the following comparisons, we consider the results of the best configurations proposed in the above analysis.

TRAS outperforms GOTURN and RE3 which employ a similar DNN architecture but different learning strategies. On GOT-10k and LaSOT it also surpasses the recent GradNet and ROAM, and GCT on UAV123. TRAST outperforms ATOM and SiamCAR on GOT-10k, UAV123, LaSOT, while losing little performance to DiMP. The performance is better than RL-based trackers [80, 9, 63] on UAV123 and comparable on OTB-100. Finally, TRASFUST outperforms all the trackers on all the benchmarks (where the pool was used). Remarkable results are obtained on UAV123 and OTB-100, with SS of 0.679 and 0.701 and PS of 0.873 and 0.931, respectively. Large improvement is achieved over all the methodologies that include expert trackers in their methodology.

GOT-10k UAV123 LaSOT OTB-100 VOT2019
KCF [32] 0.203 0.177 0.065 0.331 0.503 0.178 0.166 0.477 0.693 0.110 0.441 1.279 105
MDNet [57] 0.299 0.303 0.099 0.489 0.718 0.397 0.373 0.673 0.909 0.151 0.507 0.782 5
ECO [14] 0.316 0.309 0.111 0.532 0.726 0.324 0.301 0.668 0.896 0.262 0.505 0.441 15
SiamRPN [43] 0.508 0.604 0.308 0.616 0.785 0.508 0.492 0.649 0.851 0.259 0.554 0.572 43
ATOM [15] 0.556 0.634 0.402 0.643 0.832 0.516 0.506 0.660 0.867 \tblsecondbest0.292 \tblbest0.603 \tblsecondbest0.411 20
DiMP [5] \tblsecondbest0.611 \tblsecondbest0.717 \tblbest0.492 \tblsecondbest0.653 \tblsecondbest0.839 \tblsecondbest0.570 \tblsecondbest0.569 0.681 0.888 \tblbest0.379 0.594 \tblbest0.278 25
GOTURN [31] 0.347 0.375 0.124 0.389 0.548 0.214 0.175 0.395 0.534 - - - 100
RE3 [26] - - - 0.514 0.667 0.325 0.301 0.464 0.582 0.152 0.458 0.940 150
ADNet [80] - - - - - - - 0.646 0.880 - - - 3
ACT [9] - - - 0.415 0.636 - - 0.625 0.859 - - - 30
DRL-IS [63] - - - - - - - 0.671 0.909 - - - 10
SiamRPN++ [42] - - - 0.613 0.807 0.496 - \tblsecondbest0.696 \tblsecondbest0.914 0.285 \tblsecondbest0.599 0.482 35
GCT [23] - - - 0.508 0.732 - - 0.648 0.854 - - - 50
GradNet [44] - - - - - 0.365 0.351 0.639 0.861 - - - 80
SiamCAR [27] 0.569 0.670 0.415 0.614 0.760 0.507 0.510 - - - - - 52
ROAM [78] 0.436 0.466 0.164 - - 0.368 0.390 0.681 0.908 - - - 13
MEEM [81] 0.253 0.235 0.068 0.392 0.627 0.280 0.224 0.566 0.830 - - - 10
HMMTxD [70] - - - - - - - - - 0.163 0.499 1.073 -
HDT [62] - - - - - - - 0.562 0.844 - - - 10
Zhu et al. [83] - - - - - - - 0.587 0.788 - - - 36
Li et al. [46] - - - - - - - 0.621 0.864 - - - 6
TRAS 0.484 0.556 0.326 0.515 0.655 0.386 0.330 0.481 0.644 0.131 0.400 1,020 90
TRAST 0.604 0.708 0.469 0.647 0.837 0.545 0.524 0.643 0.865 0.203 0.517 0.693 25
TRASFUST \tblbest0.617 \tblbest0.729 \tblsecondbest0.490 \tblbest0.679 \tblbest0.873 \tblbest0.576 \tblbest0.574 \tblbest0.701 \tblbest0.931 0.266 0.592 0.597 20
Table 5: Performance of the proposed trackers (in the last block of rows) in comparison with the the state-of-the-art. First block of rows reports the performance of the selected teachers; second block shows generic-approach tracker performance; third presents trackers that exploit experts or perform fusion. Best results are highlighted in red, second-best in blue.

5 Conclusions

In this paper, a novel methodology for visual tracking is proposed. KD and RL are joined in a novel framework where off-the-shelf tracking algorithms are employed to compress knowledge into a CNN-based student model. After learning, the student can be exploited in three different tracking setups, TRAS, TRAST and TRASFUST, depending on application needs. An extensive validation shows that the proposed trackers TRAS and TRAST compete with the state-of-the-art, while TRASFUST outperforms recently published methods and fusion approaches. All trackers can run in real-time.


This work is supported by the ACHIEVE-ITN project.

Supplementary Material of


with a Distilled and Reinforced Model

Appendix A Methodology

a.1 MDP Auxiliary Functions

The function used to obtain the bounding box given and the previous bounding box is defined such that


. The function employed to obtain the expert action given the teacher bounding boxes is defined as



a.2 Curriculum Learning Strategy

A curriculum learning strategy [2] is designed to further facilitate and improve the student’s learning. After terminating each , a success counter for is increased if performs better than in that interaction, i.e. if the first cumulative reward, received up to , is greater or equal to the one obtained by the second. In formal terms, is updated if the following condition holds


The counter update is done by testing students that interact with by exploiting . The terminal video index is successively increased during the training procedure by a central process which checks if . After each update of , is reset to zero. By this setup, we ensure that, at every increase of , students face a simpler learning problem where they are likely to succeed and in a shorter time, since they have already developed a tracking policy that, up to , is at least good as the one of . We found to work well in practice.

Appendix B Experimental Setup

Benchmarks and Performance Measures.

In this subsection we offer more details about the benchmark datasets and the relative performance measures employed to validate our methodology.

GOT-10k Test Set. The GOT-10k [36] test set is composed of 180 videos. Target objects belong to 84 different classes, and 32 forms of object motion are present. An interesting note is that, except for the class person, there is no overlap between the object classes in the training and test splits. For the person class, there is no overlap in the type of motion. The evaluation protocol proposed by the authors is the one-pass evaluation (OPE) [77], while the metrics used are the average overlap (AO) and the success rates (SR) with overlap thresholds and .

Otb-100. The OTB-100 [77] benchmark is a set of 100 challenging videos and it is widely used in the tracking literature. The standard evaluation procedure for this dataset is the OPE method while the Area Under the Curve (AUC) of the success and precision plot, referred as success score (SS) and precision scores (PS) respectively, are utilized to quantify trackers’ performance.

Uav123. The UAV123 benchmark [55] proposes 123 videos that are inherently different from traditional visual tracking benchmarks like OTB and VOT, since it offers sequences acquired form low-altitude UAVs. To evaluate trackers, the standard OTB methodology [77] is exploited.

LaSOT. A performance evaluation was also performed on the test set of LaSOT benchmark [22]. This dataset is composed of 280 videos, with a total of more than 650k frames and an average sequence length of 2500 frames, that is higher than the lengths of the videos contained in the aforementioned benchmarks. The same methodology and metrics used for the OTB [77] experiments are employed.

Vot2019. The VOT benchmarks are datasets used in the annual VOT tracking competitions. These sets change year by year, introducing increasingly challenging tracking scenarios. We evaluated our trackers on the set of the VOT2019 challenge [40], which provides 60 highly challenging videos. Within the framework used by the VOT committee, trackers are evaluated based on Expected Average Overlap (EAO), Accuracy (A) and Robustness (R) [41]. Differently from the OPE, the VOT evaluation protocol presents the automatic re-initialization of the tracker when the IoU between its estimated bounding box and the ground-truth becomes zero.

Appendix C Additional Results

c.1 Impact of Transfer Set

We evaluated how performance change considering other sources of video data. By respecting the idea that unbiased demonstrations of the teachers should be employed, we used the training set of the LaSOT benchmark [22]. This dataset is smaller than the training set of GOT-10k and contains 1120 videos with approximately 2.83M frames. After filtering the trajectories, we obtained the transfer set which specification are given in Table 6.

# traj AO
16 0.835 80
32 0.830 160
44 0.817 220
87 0.852 435
106 0.856 530
Table 6: Teacher-based statistics of the LaSOT transfer set.

The results are shown in Table 7. The amount of training samples is lower than the amount obtained by filtering the GOT-10k transfer set with , and the proposed trackers present a behaviour that reflects the loss of data (as seen in Table 4). This experiment suggests that the quantity of data has more impact than the quality of data.

GOT-10k UAV123 LaSOT OTB-100
TRAS 0.242 0.252 0.086 0.329 0.437 0.222 0.166 0.254 0.337
TRAST 0.475 0.552 0.248 0.553 0.746 0.463 0.432 0.577 0.760
TRASFUST 0.468 0.529 0.221 0.594 0.803 0.470 0.452 0.666 0.885
Table 7: Performance of the proposed trackers considering the training set of LaSOT as transfer set.

c.2 Success and Precision Plots on OTB-100

In Figures 9 and 10 the success plots and precision plots for different sequence categories of the OTB-100 benchmark are presented.

Figure 9: Success plots on OTB-100 presenting the performance of the proposed trackers and the teachers on tracking situations with: occlusion (OCC); background clutter (BC); out of view (OV); motion blur (MB); low resolution (LR); fast motion (FM).
Figure 10: Precision plots on OTB-100 presenting the performance of the proposed trackers and the teachers on tracking situations with: occlusion (OCC); background clutter (BC); out of view (OV); motion blur (MB); low resolution (LR); fast motion (FM).

c.3 Video

At this link, we provide a video showing the tracking abilities of our proposed trackers. For each video, the predictions of TRAS, TRAST and TRASFUST are shown. For TRAST and TRASFUST, we report also the tracker which prediction was chosen as output proposes (with the term ”CONTROLLING” next to the tracker’s name).


  1. Please refer to Appendix A.1 for the definition of .
  2. Please refer to Appendix A.1 for the definition of .
  4. For more, please see


  1. C. Bailer, A. Pagani and D. Stricker (2014) A superior tracking approach: Building a strong tracker through fusion. In European Conference on Computer Vision, Vol. 8695 LNCS, pp. 170–185. External Links: Document, ISBN 9783319105833, ISSN 16113349 Cited by: §1, §1, §2.0.1, §4.2.2.
  2. Y. Bengio, J. Louradour, R. Collobert and J. Weston (2009) Curriculum learning. In 26th International Conference on Machine Learning, ICML ’09, New York, New York, USA, pp. 1–8. External Links: Document, ISBN 9781605585161, Link Cited by: §A.2, §3.3.2.
  3. L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik and P. H.S. Torr (2016) Staple: Complementary learners for real-time tracking. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2016-Decem, pp. 1401–1409. External Links: Document, 1512.01355, ISBN 9781467388504, ISSN 10636919 Cited by: §1, §1.
  4. L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi and P. H.S. Torr (2016) Fully-convolutional siamese networks for object tracking. European Conference on Computer Vision 9914 LNCS, pp. 850–865. External Links: Document, 1606.09549, ISBN 9783319488806, ISSN 16113349, Link Cited by: §1, §1.
  5. G. Bhat, M. Danelljan, L. Van Gool and R. Timofte (2019) Learning Discriminative Model Prediction for Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, External Links: 1904.07220, Link Cited by: §1, §1, §1, §2.0.1, §3, §4.1.1, Table 5.
  6. D. S. Bolme, J. R. Beveridge, B. A. Draper and Y. M. Lui (2010) Visual object tracking using adaptive correlation filters. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544–2550. Cited by: §1, §1.
  7. C. Bucilǎ, R. Caruana and A. Niculescu-Mizil (2006) Model compression. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 2006, pp. 535–541. External Links: Document, ISBN 1595933395 Cited by: §2.0.2.
  8. L. Čehovin, M. Kristan and A. Leonardis (2013) Robust visual tracking using an adaptive coupled-layer visual model. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (4), pp. 941–953. External Links: Document, ISSN 01628828, Link Cited by: §1.
  9. B. Chen, D. Wang, P. Li, S. Wang and H. Lu (2018) Real-time ’Actor-Critic’ Tracking. In European Conference on Computer Vision, pp. 318–334. External Links: Link Cited by: §1, §1, §2.0.2, §3, §4.2.3, Table 5.
  10. G. Chen, W. Choi, X. Yu, T. Han and M. Chandraker (2017) Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems, Vol. 2017-Decem, pp. 743–752. External Links: ISSN 10495258 Cited by: §1, §2.0.2, §3.3.1.
  11. J. H. Cho and B. Hariharan (2019-10) On the efficacy of knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2019-Octob, pp. 4793–4801. External Links: Document, 1910.01348, ISBN 9781728148038, ISSN 15505499, Link Cited by: §4.2.2.
  12. J. Choi, J. Kwon and K. M. Lee (2018) Real-time visual tracking by deep reinforced decision making. Computer Vision and Image Understanding 171, pp. 10–19. External Links: Document, 1702.06291, ISSN 1090235X, Link Cited by: §1.
  13. D. Comaniciu, V. Ramesh and P. Meer (2000) Real-time tracking of non-rigid objects using mean shift. IEEE Conference on Computer Vision and Pattern Recognition 2, pp. 142–149. External Links: Document, ISSN 10636919 Cited by: §1.
  14. M. Danelljan, G. Bhat, F. S. Khan and M. Felsberg (2017-11) ECO: Efficient Convolution Operators for Tracking. In IEEE Conference on Computer Vision and Pattern Recognition, External Links: 1611.09224, Link Cited by: §1, §1, §1, §2.0.1, §4.1.1, Table 5.
  15. M. Danelljan, G. Bhat, F. S. Khan and M. Felsberg (2019) ATOM: Accurate Tracking by Overlap Maximization. In IEEE Conference on Computer Vision and Pattern Recognition, External Links: 1811.07628, Link Cited by: §1, §1, §1, §2.0.1, §3, §4.1.1, Table 5.
  16. M. Danelljan, G. Hager, F. S. Khan and M. Felsberg (2017) Discriminative Scale Space Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (8), pp. 1561–1575. External Links: Document, 1609.06141, ISBN 1609.06141v1, ISSN 01628828 Cited by: §1, §1.
  17. M. Danelljan, A. Robinson, F. S. Khan and M. Felsberg (2016-08) Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking. In European Conference on Computer Vision, External Links: Document, 1608.03773, Link Cited by: §1.
  18. J. Deng, W. Dong, R. Socher, L. Li, Kai Li and Li Fei-Fei (2009-06) ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. External Links: Document, ISBN 978-1-4244-3992-8, Link Cited by: §4.1.4.
  19. M. Dunnhofer, M. Antico, F. Sasazawa, Y. Takeda, S. Camps, N. Martinel, C. Micheloni, G. Carneiro and D. Fontanarosa (2020-02) Siam-U-Net: encoder-decoder siamese network for knee cartilage tracking in ultrasound images. Medical Image Analysis, pp. . External Links: Document, ISSN 13618423 Cited by: §1.
  20. M. Dunnhofer, N. Martinel, G. L. Foresti and C. Micheloni (2019) Visual Tracking by means of Deep Reinforcement Learning and an Expert Demonstrator. In Proceedings of The IEEE/CVF International Conference on Computer Vision Workshops, External Links: 1909.08487, Link Cited by: §1.
  21. L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, B. Yotam, F. Vlad, H. Tim, I. Dunning, S. Legg and K. Kavukcuoglu (2018-02) IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. In 35th International Conference on Machine Learning, ICML 2018, Vol. 4, pp. 2263–2284. External Links: 1802.01561, ISBN 9781510867963, Link Cited by: §3.3.
  22. H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao and H. Ling (2019-09) LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking. In IEEE Conference on Computer Vision and Pattern Recognition, External Links: 1809.07845, Link Cited by: §B.0.1, §C.1, §4.1.3.
  23. J. Gao, T. Zhang and C. Xu (2019) Graph Convolutional Tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4649–4659. External Links: Link Cited by: Table 5.
  24. N. C. Garcia, P. Morerio and V. Murino (2018) Modality distillation with multiple stream networks for action recognition. In European Conference on Computer Vision, Vol. 11212 LNCS, pp. 106–121. External Links: Document, 1806.07110, ISBN 9783030012366, ISSN 16113349 Cited by: §2.0.2.
  25. K. J. Geras, A. Mohamed, R. Caruana, G. Urban, S. Wang, O. Aslan, M. Philipose, M. Richardson and C. Sutton (2015-11) Blending LSTMs into CNNs. External Links: 1511.06433, Link Cited by: §1, §2.0.2.
  26. D. Gordon, A. Farhadi and D. Fox (2018) Re 3 : Real-time recurrent regression networks for visual tracking of generic objects. IEEE Robotics and Automation Letters 3 (2), pp. 788–795. External Links: Document, 1705.06368, ISSN 23773766 Cited by: §1, §1, §2.0.1, §3.3.1, §3.4, §4.1.2, §4.1.4, Table 5.
  27. D. Guo, J. Wang, Y. Cui, Z. Wang and S. Chen (2020-11) SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, External Links: 1911.07241, Link Cited by: Table 5.
  28. S. Hare, S. Golodetz, A. Saffari, V. Vineet, M. M. Cheng, S. L. Hicks and P. H.S. Torr (2016) Struck: Structured Output Tracking with Kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10), pp. 2096–2109. External Links: Document, ISSN 01628828, Link Cited by: §1.
  29. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2016-Decem, pp. 770–778. External Links: Document, 1512.03385, ISBN 9781467388504, ISSN 10636919 Cited by: §1, §2.0.2, §3.4.
  30. T. He, C. Shen, Z. Tian, D. Gong, C. Sun and Y. Yan (2019) Knowledge Adaptation for Efficient Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 578–587. External Links: Document, 1903.04688 Cited by: §2.0.2.
  31. D. Held, S. Thrun and S. Savarese (2016) Learning to Track at 100 FPS with Deep Regression Networks. In European Conference on Computer Vision, Vol. abs/1604.0. External Links: 1604.01802, Link Cited by: §1, §1, §2.0.1, §3.3.1, §3.4, Table 5.
  32. J. F. Henriques, R. Caseiro, P. Martins and J. Batista (2015) High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3), pp. 583–596. External Links: Document, 1404.7584, ISSN 01628828 Cited by: §1, §1, §4.1.1, Table 5.
  33. G. Hinton, O. Vinyals and J. Dean (2014-03) Distilling the Knowledge in a Neural Network. In Deep Learning Workshop NIPS 2014, External Links: 1503.02531, Link Cited by: §1, §2.0.2.
  34. S. Hochreiter and J. Schmidhuber (1997-11) Long Short-Term Memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Document, ISSN 08997667, Link Cited by: §3.4.
  35. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto and H. Adam (2017-04) MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. External Links: 1704.04861, Link Cited by: §1.
  36. L. Huang, X. Zhao and K. Huang (2019-10) GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. External Links: Document, 1810.11981, ISSN 0162-8828, Link Cited by: §B.0.1, §4.1.2, §4.1.3.
  37. I. Jung, J. Son, M. Baek and B. Han (2018) Real-Time MDNet. In European Conference on Computer Vision, Cited by: §1, §1, §2.0.1.
  38. Z. Kalal, K. Mikolajczyk and J. Matas (2012) Tracking-learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (7), pp. 1409–1422. External Links: Document, ISSN 01628828 Cited by: §2.0.1.
  39. V. R. Konda and J. N. Tsitsiklis (2000) Actor-Critic Algorithms. In Advances in Neural Information Processing Systems, External Links: Link Cited by: §1.
  40. M. Kristan, J. Matas, A. Leonardis, M. Felsberg, R. Pflugfelder, J. Kämäräinen, L. Zajc, O. Drbohlav, A. Lukežič, A. Berg, A. Eldesokey, J. Käpylä, G. Fernández, A. Gonzalez-Garcia, A. Memarmoghadam, A. Lu, A. He, A. Varfolomieiev, A. Chan, A. Shekhar Tripathi, A. Smeulders, B. Suraj Pedasingu, B. Xin Chen, B. Zhang, B. Wu, B. Li, B. He, B. Yan, B. Bai, B. Li, B. Li, B. Hak Kim, C. Ma, C. Fang, C. Qian, C. Chen, C. Li, C. Zhang, C. Tsai, C. Luo, C. Micheloni, C. Zhang, D. Tao, D. Gupta, D. Song, D. Wang, E. Gavves, E. Yi, F. Shahbaz Khan, F. Zhang, F. Wang, F. Zhao, G. De Ath, G. Bhat, G. Chen, G. Wang, G. Li, H. Cevikalp, H. Du, H. Zhao, H. Saribas, H. Min Jung, H. Bai, H. Yu, H. Peng, H. Lu, H. Li, J. Li, J. Li, J. Fu, J. Chen, J. Gao, J. Zhao, J. Tang, J. Li, J. Wu, J. Liu, J. Wang, J. Qi, J. Zhang, J. K. Tsotsos, J. Hyuk Lee, J. van de Weijer, J. Kittler, J. Ha Lee, J. Zhuang, K. Zhang, K. Wang, K. Dai, L. Chen, L. Liu, L. Guo, L. Zhang, L. Wang, L. Wang, L. Zhang, L. Wang, L. Zhou, L. Zheng, L. Rout, L. Van Gool, L. Bertinetto, M. Danelljan, M. Dunnhofer, M. Ni, M. Young Kim, M. Tang, M. Yang, N. Paluru, N. Martinel, P. Xu, P. Zhang, P. Zheng, P. Zhang, P. H. Torr, Q. Zhang Qiang Wang, Q. Guo, R. Timofte, R. Krishna Gorthi, R. Everson, R. Han, R. Zhang, S. You, S. Zhao, S. Zhao, S. Li, S. Li, S. Ge, S. Bai, S. Guan, T. Xing, T. Xu, T. Yang, T. Zhang, T. Ojí, W. Feng, W. Hu, W. Wang, W. Tang, W. Zeng, W. Liu, X. Chen, X. Qiu, X. Bai, X. Wu, X. Yang, X. Chen, X. Li, X. Sun, X. Chen, X. Tian, X. Tang, X. Zhu, Y. Huang, Y. Chen, Y. Lian, Y. Gu, Y. Liu, Y. Chen, Y. Zhang, Y. Xu, Y. Wang, Y. Li, Y. Zhou, Y. Dong, Y. Xu, Y. Zhang, Y. Li, Z. Wang Zhao Luo, Z. Zhang, Z. Feng, Z. He, Z. Song, Z. Chen, Z. Zhang, Z. Wu, Z. Xiong, Z. Huang, Z. Teng and Z. Ni (2019) The Seventh Visual Object Tracking VOT2019 Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Cited by: §B.0.1, §4.1.3.
  41. M. Kristan, J. Matas, A. Leonardis, T. Vojir, R. Pflugfelder, G. Fernandez, G. Nebehay, F. Porikli and L. Čehovin (2016-11) A Novel Performance Evaluation Methodology for Single-Target Trackers. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (11), pp. 2137–2155. External Links: Document, ISSN 0162-8828 Cited by: §B.0.1, §4.1.3.
  42. B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing and J. Yan (2019) SIAMRPN++: Evolution of siamese visual tracking with very deep networks. IEEE Conference on Computer Vision and Pattern Recognition 2019-June, pp. 4277–4286. External Links: Document, 1812.11703, ISBN 9781728132938, ISSN 10636919, Link Cited by: §1, §1, Table 5.
  43. B. Li, J. Yan, W. Wu, Z. Zhu and X. Hu (2018-06) High Performance Visual Tracking with Siamese Region Proposal Network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971–8980. External Links: Document, ISBN 9781538664209, ISSN 10636919, Link Cited by: §1, §1, §4.1.1, Table 5.
  44. P. Li, B. Chen, W. Ouyang, D. Wang, X. Yang and H. Lu (2019) GradNet: Gradient-Guided Network for Visual Object Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, External Links: 1909.06800, Link Cited by: Table 5.
  45. Y. Li, J. Yang, Y. Song, L. Cao, J. Luo and L. J. Li (2017) Learning from Noisy Labels with Distillation. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2017-Octob, pp. 1928–1936. External Links: Document, 1703.02391, ISBN 9781538610329, ISSN 15505499, Link Cited by: §1, §2.0.2.
  46. Z. Li, W. Wei, T. Zhang, M. Wang, S. Hou and X. Peng (2019) Online Multi-Expert Learning for Visual Tracking. IEEE Transactions on Image Processing 29, pp. 934–946. External Links: Document, ISSN 19410042 Cited by: §2.0.1, Table 5.
  47. L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao and J. Han (2019-08) On the Variance of the Adaptive Learning Rate and Beyond. External Links: 1908.03265, Link Cited by: §4.1.4.
  48. Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo and J. Wang (2019) Structured Knowledge Distillation for Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2599–2608. External Links: Document Cited by: §2.0.2.
  49. Y. Liu, X. Dong, X. Lu, F. S. Khan, J. Shen and S. Hoi (2019-07) Teacher-Students Knowledge Distillation for Siamese Trackers. External Links: 1907.10586, Link Cited by: §2.0.2.
  50. A. Lukežič, T. Vojíř, L. Čehovin Zajc, J. Matas and M. Kristan (2018) Discriminative Correlation Filter Tracker with Channel and Spatial Reliability. International Journal of Computer Vision 126 (7), pp. 671–688. External Links: Document, 1611.08461, ISSN 15731405 Cited by: §1, §1.
  51. M. E. Maresca and A. Petrosino (2013) MATRIOSKA: A multi-level approach to fast tracking by learning. In International Conference on Image Analysis and Processing, Vol. 8157 LNCS, pp. 419–428. External Links: Document, ISBN 9783642411830, ISSN 03029743 Cited by: §1.
  52. K. Meshgi, M. S. Mirzaei and S. Oba (2019) Long and short memory balancing in visual co-tracking using q-learning. In 2019 IEEE International Conference on Image Processing (ICIP), Vol. , pp. 3970–3974. Cited by: §2.0.2.
  53. V. Mnih, A. P. Badia, L. Mirza, A. Graves, T. Harley, T. P. Lillicrap, D. Silver and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. 33rd International Conference on Machine Learning, ICML 2016 4, pp. 2850–2869. External Links: 1602.01783, ISBN 9781510829008, Link Cited by: §1, §3.3.
  54. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra and M. Riedmiller (2013) Playing Atari with Deep Reinforcement Learning. CoRR abs/1312.5. External Links: 1312.5602, Link Cited by: §1.
  55. M. Mueller, N. Smith and B. Ghanem (2016) A Benchmark and Simulator for UAV Tracking. In European Conference on Computer Vision, pp. 445–461. External Links: Document, Link Cited by: §B.0.1, §4.1.3.
  56. A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, S. Legg, V. Mnih, K. Kavukcuoglu and D. Silver (2015-07) Massively Parallel Methods for Deep Reinforcement Learning. External Links: 1507.04296, Link Cited by: §3.3.
  57. H. Nam and B. Han (2016) Learning Multi-domain Convolutional Neural Networks for Visual Tracking. IEEE Conference on Computer Vision and Pattern Recognition 2016-Decem, pp. 4293–4302. External Links: Document, 1510.07945, ISBN 9781467388504, ISSN 10636919 Cited by: §1, §1, §2.0.1, §4.1.1, Table 5.
  58. H. Nam, S. Hong and B. Han (2014) Online graph-based tracking. In European Conference on Computer Vision, Vol. 8693 LNCS, pp. 112–126. External Links: Document, ISSN 16113349 Cited by: §1.
  59. E. Parisotto, J. Ba and R. Salakhutdinov (2016-11) Actor-mimic deep multitask and transfer reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, External Links: 1511.06342, Link Cited by: §2.0.2.
  60. M. Phuong and C. H. Lampert (2019) Distillation-Based Training for Multi-Exit Architectures. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §1.
  61. A. Polino, R. Pascanu and D. Alistarh (2018-02) Model compression via distillation and quantization. In International Conference on Learning Representations, External Links: 1802.05668, Link Cited by: §1.
  62. Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim and M. H. Yang (2016) Hedged Deep Tracking. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2016-Decem, pp. 4303–4311. External Links: Document, ISBN 9781467388504, ISSN 10636919 Cited by: §2.0.1, Table 5.
  63. L. Ren, X. Yuan, J. Lu, M. Yang and J. Zhou (2018) Deep Reinforcement Learning with Iterative Shift for Visual Tracking. In European Conference on Computer Vision, pp. 684–700. External Links: Link Cited by: §1, §2.0.2, §4.2.3, Table 5.
  64. A. A. Rusu, S. G. Colmenarejo, Ç. Gülçehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu and R. Hadsell (2016-11) Policy distillation. In 4th International Conference on Learning Representations, ICLR 2016, External Links: 1511.06295, Link Cited by: §2.0.2.
  65. K. Shmelkov, C. Schmid and K. Alahari (2017) Incremental Learning of Object Detectors without Catastrophic Forgetting. In Proceedings of the IEEE International Conference on Computer Vision, Vol. 2017-Octob, pp. 3420–3429. External Links: Document, 1708.06977, ISBN 9781538610329, ISSN 15505499 Cited by: §2.0.2.
  66. J. Supancic and D. Ramanan (2017) Tracking as Online Decision-Making: Learning a Policy from Streaming Videos with Reinforcement Learning. Proceedings of the IEEE International Conference on Computer Vision 2017-Octob, pp. 322–331. External Links: Document, 1707.04991, ISBN 9781538610329, ISSN 15505499, Link Cited by: §1, §2.0.2.
  67. R. S. Sutton and A. G. Barto (2018) Reinforcement Learning: An Introduction. 2nd edition, MIT Press, Cambridge, MA, USA. Cited by: §3.3.2.
  68. R. S. Sutton, D. McAllester, S. Singh and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pp. 1057–1063. External Links: ISBN 0262194503, ISSN 10495258, Link Cited by: §1, §3.3.2.
  69. Z. Tang, D. Wang and Z. Zhang (2016-05) Recurrent neural network training with dark knowledge transfer. In IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 2016-May, pp. 5900–5904. External Links: Document, 1505.04630, ISBN 9781479999880, ISSN 15206149, Link Cited by: §1, §2.0.2.
  70. T. Vojir, J. Matas and J. Noskova (2016-12) Online adaptive hidden Markov model for multi-tracker fusion. Computer Vision and Image Understanding 153, pp. 109–119. External Links: Document, ISSN 1090235X Cited by: §1, §1, §2.0.1, Table 5.
  71. N. Wang and D. Y. Yeung (2014) Ensemble-based tracking: Aggregating crowdsourced structured time series data. In 31st International Conference on Machine Learning, ICML 2014, Vol. 4, pp. 2807–2817. External Links: ISBN 9781634393973 Cited by: §1, §1, §2.0.1.
  72. N. Wang, W. Zhou, Y. Song, C. Ma and H. Li (2020-07) Real-Time Correlation Tracking Via Joint Model Compression and Transfer. IEEE Transactions on Image Processing 29, pp. 6123–6135. External Links: Document, 1907.09831, ISSN 19410042, Link Cited by: §2.0.2.
  73. Q. Wang, L. Zhang, L. Bertinetto, W. Hu and P. H. S. Torr (2019) Fast Online Object Tracking and Segmentation: A Unifying Approach. In IEEE Conference on Computer Vision and Pattern Recognition, External Links: 1812.05050, Link Cited by: §1.
  74. X. Wang, J. Hu, J. Lai, J. Zhang and W. Zheng (2019) Progressive Teacher-student Learning for Early Action Prediction. Computer Vision and Pattern Recognition (CVPR), pp. 3556–3565. External Links: Link Cited by: §2.0.2.
  75. C. J. C. H. Watkins and P. Dayan (1992) Q-learning. Machine Learning 8 (3), pp. 279–292. External Links: Document, ISSN 1573-0565, Link Cited by: §1.
  76. A. Wu, W. Zheng, X. Guo and J. Lai (2019) Distilled Person Re-identification:Towards a More Scalable System. IEEE Conference on Computer Vision and Pattern Recognition (3), pp. 1187–1196. Cited by: §2.0.2.
  77. Y. Wu, J. Lim and M. H. Yang (2013) Online object tracking: A benchmark. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418. External Links: Document, ISBN 978-0-7695-4989-7, ISSN 10636919, Link Cited by: §B.0.1, §B.0.1, §B.0.1, §B.0.1, §4.1.3.
  78. T. Yang, P. Xu, R. Hu, H. Chai and A. B. Chan (2020-07) ROAM: Recurrently Optimizing Tracking Model. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, External Links: 1907.12006, Link Cited by: Table 5.
  79. J. H. Yoon, D. Y. Kim and K. J. Yoon (2012) Visual tracking via adaptive tracker selection with multiple features. In European Conference on Computer Vision, Vol. 7575 LNCS, pp. 28–41. External Links: Document, ISBN 9783642337642, ISSN 03029743 Cited by: §1, §1, §2.0.1.
  80. S. Yun, J. Choi, Y. Yoo, K. Yun and J. Y. Choi (2017-07) Action-decision networks for visual tracking with deep reinforcement learning. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2017-Janua, pp. 1349–1358. External Links: Document, ISBN 9781538604571, Link Cited by: §1, §1, §2.0.2, §3, §4.2.3, Table 5.
  81. J. Zhang, S. Ma and S. Sclaroff (2014) MEEM: Robust tracking via multiple experts using entropy minimization. In European Conference on Computer Vision, Vol. 8694 LNCS, pp. 188–203. External Links: Document, ISBN 9783319105987, ISSN 16113349 Cited by: §1, §1, §2.0.1, Table 5.
  82. Z. Zhang and H. Peng (2019-01) Deeper and Wider Siamese Networks for Real-Time Visual Tracking. IEEE Conference on Computer Vision and Pattern Recognition. External Links: 1901.01660, Link Cited by: §1, §1.
  83. Y. Zhu, J. Wen, L. Zhang and Y. Wang (2018-08) Visual Tracking with Dynamic Model Update and Results Fusion. In Proceedings - International Conference on Image Processing, pp. 2685–2689. External Links: Document, ISBN 9781479970612, ISSN 15224880 Cited by: Table 5.
  84. Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan and W. Hu (2018-08) Distractor-aware siamese networks for visual object tracking. In European Conference on Computer Vision, Vol. 11213 LNCS, pp. 103–119. External Links: Document, 1808.06048, ISBN 9783030012397, ISSN 16113349, Link Cited by: §1.