Describe and Attend to Track: Learning Natural Language guided Structural Representation and Visual Attention for Object Tracking

Describe and Attend to Track: Learning Natural Language guided Structural Representation and Visual Attention for Object Tracking

Xiao Wang, Chenglong Li, Rui Yang, Tianzhu Zhang, Jin Tang, Bin Luo
School of Computer Science and Technology, Anhui University, Hefei, China 230601
{wangxiaocvpr, lcl1314, yangruiahu}, {luobin, tj},

The tracking-by-detection framework requires a set of positive and negative training samples to learn robust tracking models for precise localization of target objects. However, existing tracking models mostly treat different samples independently while ignores the relationship information among them. In this paper, we propose a novel structure-aware deep neural network to overcome such limitations. In particular, we construct a graph to represent the pairwise relationships among training samples, and additionally take the natural language as the supervised information to learn both feature representations and classifiers robustly. To refine the states of the target and re-track the target when it is back to view from heavy occlusion and out of view, we elaborately design a novel subnetwork to learn the target-driven visual attentions from the guidance of both visual and natural language cues. Extensive experiments on five tracking benchmark datasets validated the effectiveness of our proposed method.

1 Introduction

Figure 1: The definition of tracking by natural language specification (left sub-figure) and loss functions we used for the optimization of our tracker (right sub-figure).

As a classical and challenge task in computer vision, visual tracking has been widely used in various applications, such as intelligent surveillance and automatic driving. Although appealing results have been achieved, visual tracking is still challenging partly due to the existences of the extreme factors including heavy occlusion, abruptly changing, large deformation and out of view.

Most successful visual trackers follow the tracking-by-detection framework, in which a set of positive and negative samples are used to train the parameters of classifiers and deep representations. However, existing tracking models mostly treat different training samples independently while ignores the relationship information among them, which is crucial to the robustness of feature representation and classifier learning. For instance, when we attempt to estimate the response score of one sample, existing approaches only consider the relationship between positive and negative samples. Other relations among different sample pairs are ignored. As a result, some hard positive or negative samples are difficult to obtain proper response score since only limited relationship information among samples is utilized for the overall estimation.

Recently, natural language is introduced in visual tracking, which achieves improved tracking performance [25]. For example, Li et al. [25] propose three models including natural language only, visual target specification based on language, and leveraging their joint capacity, to help visual trackers against model drift. However, they use the Recurrent Neural Network (RNN) model to encode the input sentences to generate a dynamic filter, in which the RNN module would increase heavy computational burden on the tracking speed. Moreover, how to use natural language to guide the learning of graph-based structural feature representations remains not studied yet.

To handle above problems, we propose a novel structure-aware deep neural network that is end-to-end trained for visual tracking. On the one hand, we utilize the graph convolutional network (GCN) to model the relations among training samples. Specifically, we first take them as graph nodes and use the standard convolutional network to extract their features. To fully utilize the spatial and temporal relations among samples (i.e., graph nodes), the deeply learned messages are then propagated among nodes via GCN to update and refine the pairwise relation feature for each node. After that, we form the final feature representation for each proposal by concatenating the enhanced and the original feature. On the other hand, we treat natural language embedding as high-level semantic information to guide the structural feature learning in the training phrase with triplet loss function, as shown in Figure 1 (b).

In the visual tracking task, the targets are easily lost when heavy occlusion occurs and they are out of view. It is difficult to re-track the targets when they are back to view as online update scheme adopted in most tracking methods will contaminate tracking models and the used local search strategy is also limited to recover the targets. Although some trackers employ the strategy of target re-detection [27], how to judge whether tracking failures occur or not is a challenging problem, and the re-detection models are too weak to recover the targets effectively when they reappear. To handle this problem, we elaborately design a novel subnetwork to learn the target-driven visual attentions from the guidance of both visual and natural language cues. Specifically, we use convolutional network to encode all the input data, i.e., the whole video frame, target object patch and natural language specification, for more efficient computation. The features are concatenated and input to an upsample module to generate the target-driven attention maps. The global proposals can be extracted from the attention regions and then input to the classifier together with local proposals. Therefore, in addition to providing complementary proposals to local ones, the global proposals could cover the targets well when they are back to view from heavy occlusion and out of view.

Generally speaking, our proposed algorithm is more intelligent by mining more structure information and exploring high-quality global proposal generation via target-driven visual attention. The contributions of this paper can be summarized as the following three aspects:

  • We propose an effective approach to handle the challenges of significant appearance changes, heavy occlusion and out of view in visual tracking. Extensive experiments on five tracking benchmarks against some recent and state-of-the-art trackers demonstrate that our proposed tracker is more robust to aforementioned challenging factors.

  • We propose a novel structure-aware deep neural network to make best use of the structures between training sample pairs and thus enhance the discriminative ability of feature representations. To make feature representations more discriminative, we introduce the natural language of target objects to assist visual feature learning via a triplet loss function.

  • We elaborately design a novel global proposal generation network to the target-driven visual attentions from the guidance of both visual and natural language cues. Benefit from the global proposals, our tracker is able to re-track the target objects that are lost caused by the challenges of heavy occlusion and out of view.

Figure 2: The pipeline of our proposed tracking algorithm.

2 Related Works

We give a brief review about tracking algorithms related to this paper as follows.

Structure based Trackers. The algorithm to track non-rigid object has attracted great attention in recent years. Regular trackers can nearly handle the extreme deformations, therefore, some researchers begin to study this task and attempt to exploit part information and achieve promising performance. Son et al. [38] utilize the online gradient boosting decision tree operation on individual patches to achieve robust visual tracking. Yeo et al. in [46] attempt to use Markov Chain on superpixel graph, however, the information propagation through a graph could be slow based on the structure. Ting and Yang et al. propose the patch-based trackers based on correlation filter and combine the patches within a particle filter framework in [24] and [28], respectively. The major issue existed in these trackers is that they are all separately learn the correlation filter for each part and record the relative positions between each part and target center. Besides, these part-based trackers divide the target object into fixed number of fragments in a violent way. Hence, their model contains few discriminative local structure information due to such rough patch dividing, and little semantic information maintained in the feature of such patches. It is also very hard to design a reasonable updating strategy for these trackers and will be rather sensitive to model drift when drastic deformation occurred. Zhang et al. [48] propose a structure constrained part-based model for visual tracking using DNNs and does not explicitly divide the target into parts. Their tracker can suppress the influence of aforementioned issues to some extent, however, they may still not be able to handle the drastic deformation or heavy occlusion well due to the lack of considering global information. Our model takes the target object and its language description as condition and estimate target-driven attention maps for global proposal generation which can handle this issue well.

Multi-domain based Trackers. The idea of use multi-domain layers for the training of CNN is first proposed by Nam et al. in [34]. They pretrain a CNN using a large set of videos with tracking groundtruths to obtain a generic target representation. Their network is composed of shared layers and multiple branches of domain-specific layers, where domains correspond to individual training sequences and each branch is responsible for binary classification to identify target in each domain. Their final tracking performance is indeed great and many trackers are developed based on this idea, such as BranchOut [16], Meta-tracker [35], Real-time MDNet [18]. Although these trackers are all attempt to improve MDNet from different views, however, none of them consider the structure information when pretrain their models. In addition, these trackers still adopt the local search strategy which may make them sensitive to challenging factors as mentioned above. Our tracker utilizes the GCN and natural language to take the structure information into consideration and also joint use the global and local proposals for classification which make the baseline tracker more robust to challenging factors.

Visual Attention based Trackers. To handle the influence of video noises and/or tracker noises in the extremely challenging conditions, there are several attempts to combine attention maps with visual tracking. Choi et al. [8] presented an attention-modulated visual tracking algorithm that decomposes an object into multiple cognitive units and trains multiple elementary trackers to modulate the distribution of attention based on various features and kernel types. Han et al. [17] proposed an online visual tracking algorithm by learning discriminative saliency map using CNN. They also directly searched the target object from attention locations. The spatial weights, which are widely used by DCF trackers to suppress the boundary effect, can also be interpreted as one type of visual attention. For example, the cosine window map [5] and the Gaussian window map [13] [12]. Recently, a number of efforts [9] [10] [45] have been made to exploit visual attention within deep models. These approaches emphasize attentive features and resort to additional attention modules to generate feature weights. However, the feature weights learned in single frames are unlikely to enable classifiers to concentrate on robust features over a long temporal span. Moreover, slight inaccuracy of feature weights will exacerbate the misclassification problem. This requires an in-depth investigation on how to best exploit the visual attention of deep classifiers so that they can attend to target objects over time. Similar views can also be found in [37]. Instead, our proposed target-driven attention network takes video frames, initial target object and natural language as inputs. The generated attention maps are video-specific and can provide high-quality global proposals for visual tracking.

Tracking by Natural Language. Integrating natural language into the computer vision community has becoming a new trend and many new tasks has been proposed, such as image caption, visual question answer, segmentation with natural language. The bridge used to connect the natural language and computer vision is the embedding technique which has achieved great progress in recent years, such as word2vector [32], GloVe [36]. Usually, they utilize the memory network (RNN, LSTM, GRU or SRU) or CNNs to further learning the feature representations based on embedded vectors. They also integrate attention mechanism into their deep models to further improve the final performance. Improving tracking performance with natural language has been studied in [25], they propose three kinds of models to fully illusturate possible combinations of visual tracking and natural language specifications. Different from their work, we embedding the natural language with CNN and use the embedding features to guide the global target-driven attention map generation. In addition, we also utilize the language embedding as high-level semantic information for shared feature learning.

3 The Proposed Method

The motivation of our method lies in two main aspects: i). How to learn a more robust deep feature representation by considering the correlations between extracted proposals? ii). How to obtain high-quality global proposals for visual tracking? In this paper, we propose an unified deep visual tracking algorithm guided by natural language specification, as shown in Figure 2. We will give a detailed introduction to our tracker in following sections, including network architecture, loss functions, online tracking procedure and implementation details.

3.1 Network Architecture.

Our tracker contains two sub-networks, i.e. structure-aware local search sub-network (SALNet) and global proposal generation sub-network (GPGNet).

3.1.1 SALNet

The SALNet is actually a binary classification based visual tracker which follows the regular tracking-by-detection framework. Following MDNet, we use a deep convolutional network architecture as shown in Figure 2. It takes a RGB image patch as input, and contain five hidden layers including three convolutional layers and two fully connected layers. The convolutional layers are identical to the corresponding parts of VGG-M network [6] 111 except that the feature map sizes are adjusted by our input size. The next two fully connected layers contain 4608 and 512 output units and are combined with ReLUs and dropouts. We adopt the feature from the second fc layer which is 512-D to denote corresponding image patch.

The major difference between our SALNet with existing binary classification based trackers is that we take the correlations between training samples into consideration. Specifically, we formulate the deep feature learning problem for visual tracking as a node-focused graph application. Given the features of extracted training samples, we can construct an undirected complete graph , where denotes the set of nodes. Each node represents a feature vector of extracted image patch. We also establish edges on the graph to represent the set of relationships between different nodes. In this graph, we connect pairs of semantically related nodes together. More specifically, we will assign the weight based on the Euclidean distance between each paired proposal. We use to denote the relation importance between node and node , which can be represented as following:


where and are the -th and -th node. is a pairwise similarity estimation function, that estimates the similarity score between and .

After the affinity matrix is computed according to Eq. 1, we perform normalization on each row of the matrix so that the sum of all the edge values connected to one proposal will be 1. Following [41], we adopt the softmax function for the normalization:


The normalized is taken as the adjacency matrix representing the similarity graph.

Different from standard convolution that operate on local region in an image, the convolutional operations on graphs is then defined by computing the response at a node based on the neighboring nodes defined by the adjacency graph. Mathematically, the convolutional operations for each layer in the network is represented as:


where is a normalized version of the binary adjacency matrix of the graph, with dimensions. is the input feature matrix from previous layer. is the weight matrix of the layer with dimension , where is the output channel number. Therefore, the input to a convolutional layer is , and the output is a matrix . The convolution operations can be stacked one after another. A non-linear operation (ReLU) can be applied after each convolutional layer. For the final convolutional layer, the number of output channels is the number of label classes () in the supervised learning. In this paper, we only want to obtain its output feature, therefore, we do not integrate this layer into our network. More detailed introductions can be found in [19] [2].

After we obtain the enhanced feature via GCN, we concatenate it with original input as the final feature representation of each proposal. Following MDNet, we also introduce the domain-specific layers to model the correlations between different video sequences in the training dataset. We prefer readers to check the MDNet to have a deeper understand of this algorithm.

In the shared feature learning phase (i.e. the SALNet), the loss function used for binary classification (i.e. BCE loss) can be formulated as:


where is the mini-batch size, the ground truth label of the -th sample, is the prediction of corresponding sample from deep neural network.

In addition to BCE, the triplet loss function is also adopted to ensure that all positive samples (positive) are closer to the high-level semantic vectors V (anchor) and all negative samples (negative) are at a distance from the anchor vector, as illustrated in Figure 1 (b). Formally, we have:


where is a margin that is enforced between positive and negative pairs, we set it as 1.0 in our implementation. is the set of all possible triplets in the training set and the mini-batch can be setted as N. Hence, the loss function for the mini-batch can be formulated as:


Therefore, the final loss function for the optimization of the SALNet can be formulated as:


where is a tradeoff parameter, we experimentally set it as 0.1 in our experiments.

3.1.2 GPGNet

Although the proposed SALNet already achieve good performance on some video sequences, however, it still cannot get rid of the issues caused by local search strategy under the tracking-by-detection framework. In this paper, we propose the global proposal generation network (GPGNet) to complement with local proposals for robust visual tracking.

Figure 3: The illustration of convolutional network for natural language embedding.

As shown in Figure 2, the inputs of this module are target object patch, video frame and natural language specification. For all the input data, we only use convolutional networks to obtain its features due to the efficiency of CNN. Specifically, for each video frame and the target object patch, we resize them into and obtain their feature map whose dimension is via VGG-Net. For the natural language, we first embedding each word into a 512-D feature vector. Different from regular operation which use RNN model to embedding the language [25] [31] [7], we adopt CNN to encode them with more efficient parallel convolution which has widely used in many tasks [15] [1]. The detailed configuration of the convolutional network can be found in Figure 3. The visual features of video frame and target object are concatenated along the channel, hence, we obtained a feature map whose dimension is . The maximum length of given sentence is 16 in this paper, therefore, we can obtain the sentence embedding . Then, we expand this embedding into and obtain a feature map after concatenate with visual features. The features we obtained in the encoding phase are concatenated together and input into the upsample network (which is a reversed VGG-Net).

Following [31], we adopt the binary cross-entropy loss for the optimization of GPGNet.

3.2 Online Tracking

When tracking the target object in a new sequence with the guide of target-driven attention maps, the shared layers in pre-trained CNN and a new binary classification layer are combined together to construct a new network. And also the GCN module is only used for shared feature learning in the training phrase to achieve more efficient tracking. Online tracking is performed by evaluating the candidate windows randomly sampled around the previous target state and proposals extracted from attention regions. In this paper, to estimate the target state in each frame, target candidates sampled around the previous target state and current attention regions are evaluated using the network. We obtain their positive scores and negative scores from the network. During the sampling phase, our target-driven attention maps could provide proposals with more accurate location and scale information, which could make the searching process more efficient and effective. Hence, the optimal target state is given by finding the example with the maximum positive score as:


We adopt the same strategy with MDNet, i.e., long-term and short-term updates, to update our model. Long-term updates are performed in regular intervals using the positive samples collected for a long period while short-term updates are conducted whenever potential tracking failures are detected (when the positive score of the estimated target is less than 0.5) using the positive samples in a short-term period.

3.3 Implementation Details

The global proposals generated from attention maps can be concluded as the following three steps: 1) Obtain attention regions and center location of each region, given the attention map; 2) Obtain BBox, which attempts to cover each attention region; 3) Employ Gaussian sampling strategy on these bounding boxes to generate proposals.

The GPGNet is used to generate the target-driven attention maps for global proposal extractation. We use binary mask for the training of this network which can be obtained directly from BBox annotations. As shown in Figure 4, we first generate a black mask which has the same resolution as the video frame, then, we white the target object regions according to annotated BBox in the training dataset. The binary mask is used as the ground truth attention maps to optimize the GPGNet. Following the regular semantic segmentation, we adopt the binary cross-entropy loss to measure the difference between the generated attention maps and the ground truth.

Figure 4: The pipeline of pre-processing training samples for GPGNet.

The training details of SALNet: initial learning rate is 0.0001; batch size is set as 8, and Adam is utilized for optimization. For the GPGNet, the initial learning rate is 5e-5, batch size is 20, Adagrad is used for the optimization. Three convolutional layers are used to encode the natural language and the maximum sentence length is 16. We train this network for 50 epochs. All the experiments are implemented based on PyTorch on a desktop computer with Ubuntu 16.04, I7-6700k, NVIDIA TITAN Xp with 12G VRAM and 32G RAM.

4 Experiments

We will first introduce the dataset and evaluation criteria used in this paper. After that, we will compare our tracking results with other state-of-the-art visual trackers on several public benchmarks. Then, we conduct ablation studies to validate the effectiveness of each component. Finally, we will discuss the difference with existing trackers.

4.1 Datasets and Evaluation Criterion

The training datasets used in this paper for SALNet and GPGNet are TLP50 [33], DTB70 [23] and LaSOT [14] 222 dataset which totally contain 120 (50 + 70) and 1400 video sequences, respectively 333The baseline method pyMDNet used in this paper is implemented based on PyTorch and pre-trained on two long-term dataset TLP [33] and DTB70 dataset [23] for all our experiments.. The LaSOT provide both the BBox and natural language annotations of target object, which is suitable for our natural language guided tracking task. Specifically, we generate binary mask for each video frame by setting the target object pixels as zero and background pixels as 255. We use those masks as goundtruth attention maps to optimize the GPGNet. It is also worthy to note that, we only select 44660 images from the LaSOT dataset (it totally contains 3.52 million frames) for the training of GPGNet to quickly validate our proposed method. The testing is conducted on five benchmark datasets, including OTB-2013 [43], OTB-100 [44], VOT-2014 [21], VOT-2016 [20] and TC128 [26] dataset. The natural language specification of OTB100 is borrowed from lang-tracker 444, and other datasets used for testing is annotated by one person to maintain its consistency.

Two widely used evaluation protocols are utilized in this paper: success rate and precision rate. These two criteria are all aiming at measure the percentage of successfully tracked frames. For the success rate, a frame is declared to be successfully tracked if the estimated bounding box and the ground truth box have an interaction-over-union overlap larger than a certain threshold. For precision rate, tracking on a frame is considered successful if the distance between the center of the predicted box and the ground truth box is under some threshold.

Figure 5: The attention maps generated by our GPGNet.
Figure 6: The difference between target object only, natural language only and joint based global attention estimation.
OTB-2013 0.839/0.624 0.880/0.638 0.908/ 0.673 0.899/ 0.672 0.786/0.583 0.892/0.670 0.870/0.653 -/- 0.860/0.607
OTB-100 0.768/0.574 0.851/0.621 0.838/0.623 0.898/ 0.671 0.778/0.581 -/0.641 0.825/0.627 0.733/0.587 0.799/0.573
Algorithm ADNet Meta-Tracker MemTrack Staple CFNet Lang-Tracker SiamFC pyMDNet Our
OTB-2013 0.903/0.659 -/- 0.849/0.642 0.793/0.600 0.785/0.589 -/0.578 0.809/0.607 0.880/0.655 0.925/ 0.676
OTB-100 0.880/ 0.646 0.856/0.637 0.820/0.626 0.784/0.581 0.777/0.586 -/- 0.771/0.582 0.866/ 0.643 0.889/ 0.646
Table 1: The tracking results on OTB-2013 and OTB-100 Benchmark (The top three results are highlighted in red, green and blue, respectively).
Figure 7: The tracking results on VOT-2014 (left two sub-figures) and VOT-2016 dataset (right two sub-figures).
Figure 8: The tracking results on TC128 dataset.

4.2 Comparison with State-of-the-art Trackers

To fully demonstrate the effectiveness of our proposed algorithm, we compare with many recent and popular trackers, including: MDNet [34], CCOT [13], ECO [11], RASNet [42], SINT [39], SINT++ [42], CSR-DCF [30], ADNet [47], Meta-Tracker [35], MemTrack [45], Staple [3], CFNet [40], Lang-Tracker [25], SiamFC [4], ReGLe [22], StructSiam [48] and AFCN [9].

As shown in Table 1, it is obvious to find that our proposed visual tracker achieve good and even better performance than some recent trackers on OTB-2013 and OTB-100 benchmark datasets. Compared with the baseline method pyMDNet, our method improves significantly on both OTB-2013 and OTB-100 dataset. Specifically, our algorithm improves the tracking accuracy (precision plot/success plot) from 0.880/0.655 to 0.925/0.676 on OTB-2013 dataset; from 0.866/0.643 to 0.889/0.646 on OTB-2015 dataset, respectively. Our method also achieves good tracking results on the public benchmarks compared with other visual trackers. Our tracker does not perform as well as the top performing tracker CCOT on OTB100 dataset. It is because: i). CCOT crops the sample in a continuous space for scale estimation, while our tracker only randomly draws a sparse set of samples. ii). CCOT also use multiple features (e.g., color names, HOG and deep features), while we only use deep features. We will consider to explore these ideas as our future works.

We also show the tracking results on VOT-2016 which are evaluated with its own default metrics, as shown in Table 2. We can find that our tracker can achieve comparable or even better performance when compared with other trackers. The illustration about the tracking results and target-driven attention maps can be found from Figure 5, 6 and 11.

Algorithm CCOT EBT Staple SRDCF HCF SiamRN DSST MDNet Ours
EAO 0.3310 0.2913 0.2952 0.2471 0.2203 0.2766 0.1814 0.2572 0.3045
FPS 82.18 2.87 14.43 503.18 328.73 7.05 13.90 2.66 2.27
Table 2: Comparison with other trackers on VOT-2016 dataset with default metrics.
Algorithm Our-SALNet Our-TO Our-NL Our-JTNL
OTB100 0.876/0.644 0.886/0.647 0.884/0.643 0.889/0.646
Table 3: Tracking performance without or with global proposals on OTB100 dataset. The tracking results on precision plot and success plot are listed as follows.

4.3 Ablation Studies

The Effect of GCN. To demonstrate the effectiveness of structured information from other proposals, we conduct experiment on this component (i.e. pyMDNet+GCN) on OTB-2013 dataset. As shown in Table 4, pyMDNet+GCN improved the tracking result from 0.880/0.655 to 0.905/0.671 on precision and success plot, respectively, compared with baseline method pyMDNet. This result fully validated the effectiveness of the structured information from other nodes, that is to say, the GCN can help to learn more discriminative deep features for visual tracking.

The Effect of Triplet Loss. To validate the effectiveness of natural language guided feature learning, we test the model which only with triplet module, i.e. pyMDNet+Language, as we can see from Table 4, the prior knowledge also improved the feature learning. This experiment fully demonstrates the effectiveness of the introduced natural language specification to guide the shared feature learning.

Algorithm pyMDNet + GCN + Language
OTB-2013 0.880/0.655 0.905/0.671 0.913/0.668
Table 4: Component Analysis on OTB-2013 dataset. The tracking results on precision plot and success plot are listed as follows.

The Effect of Global Search Strategy. As shown in Table 3, we conduct object tracking without global proposals (Our-SALNet), and also joint use local and global search strategy for robust visual tracking. Specifically, we estimate the attention maps with following three versions: target object patch based (Our-TO), natural language based (Our-NL) and joint target and language (Our-JTNL) based. It is easy to find that the utilization of global search strategy can significantly improve the tracking results compared with baseline method Our-SALNet. We also visualize some of these global attention maps in Figure 5.

The Generic of Target-driven Attention Maps. We show our target-driven attention maps can also be integrated with other trackers, such as CSRDCF [29]. We take the attention maps generated by our GPGNet as a kind of feature representation, and integrated with CSRDCF for robust visual tracking. For example, CSR-DCF uses [gray, color name and hog feature] as original features. After integrated with our attention maps, its feature tuple becomes [gray, color name, hog, attention map]. As shown in Figure 7 (a), the tracking results of CSRDCF can be improved with our attention maps on VOT dataset.

Influence of Tradeoff Parameter . As shown in Eq. 7, our loss function contains a hyperparameter which is introduced to tradeoff the classification loss and triplet loss. In this section, we conduct some experiments on this parameter (we set the equal to 0, 0.1, 0.2, 0.3, 0.5, 0.8 and 1) to show its influence on the final tracking results. The curve to shown variation of tracking performance can be found in Figure 9. We can find that our proposed deep model is not sensitive to the hyperparameter .

Figure 9: The analysis of tradeoff parameter (left sub-figure) and node numbers (right sub-figure). The red line denotes the variation of precision plots, and the blue line denotes the success plots.

Influence of Node Numbers for GCN. To test the influence of different numbers of training samples, we extract different amount of samples (i.e. 0, 20, 28, 32, 43, 50, 70 ) and conduct experiment to check the final result. As shown in Figure 9, the tracking results can be enhanced significantly when integrating GCN, since the result of other proposals are all better than zero’s (i.e. the baseline method pyMDNet). We can also find that the results are better than others when the node number belong to the range of (30, 50).

Influence on Different Layers of GCN. To check the influence of different graph convolutional layers, we conduct ablation studies on this question. We set the number of GCN layers as 2, 3, 5 and 8 layers to pretrain the model and test on the OTB2013 dataset. As shown in Table 5, the tracking results can be enhanced when increasing the number of GCN layers. However, it also increased the training time when more layers are added. Hence, we choose 3 GCN layers to achieve better tradeoff between accuracy and training time in our experiments.

Layers 2 3 5 8
SR 0.654 0.663 0.671 0.670
Table 5: Tracking results with different layers of GCN on OTB2013 dataset.

Tracking results on similar appearance videos. To validate the performance on videos with similar target objects, we also test our tracker on 46 video sequences 555The selected video list: Basketball, Bird1, Girl2, BlurCar1, BlurCar2, BlurCar4, Bolt, Bolt2, Walking, Walking2, BlurCar3, Freeman3, Car1, Car2, Car24, Car4, CarDark, Couple, Coupon, Crossing, Crowds, Deer, Football, Football1, Human3, Human4, Human5, Human6, Human7, Human8, Human9, Ironman, Jogging-1, Jogging-2, Jumping, KiteSurf, Liquor, Shaking, Singer1, Singer2, Skating1, Skating2-1, Skating2-2, Soccer, Subway, Suv. selected from OTB100 dataset. These videos contain at least one or more similar objects with target object. We want to validate the robustness of our target-driven attention maps via this experiments. As shown in Figure 10, our tracker can still achieve good performance on these challenging videos. Specifically, we can achieve 91.8/65.2 on this sub-dataset evaluated with PR and SR evaluation criterion. It is easy to find that our results are better than baseline method pyMDNet (86.5/64.2) and some other recent visual trackers, such as SINT++ [42], ReGLe [22].

Figure 10: The tracking results on 46 videos selected from OTB100 dataset (these videos all contain similar appearance objects with the target object).
Figure 11: The tracking results of our method and other trackers.

4.4 Discussion

Difference Between Regular Saliency Estimation and Our Attention Maps. Saliency maps usually focus on the target we humans attend, however, it maybe not the target we want to track in practical videos. Therefore, it can not be directly utilized in practical tracking algorithms. Meanwhile, our attention maps are generated based on initial target object and natural language specifications. It only focuses on the target object we want to track in each video, in another word, our attention maps are video-specific.

Difference with Existing Trackers. The most relevant works with ours are Lang-Tracker [25] and MDNet. For the Lang-Tracker: lang-tracker use language to detect target object in the first frame and tracking target object according to image patch and language descriptions for subsequent frames; we use the natural languge for shared feature learning and global attention estimation, which also improve the final tracking results significantly. For the MDNet: i). MDNet did not consider the structure information between training samples or language; we model this information with GCN and triplet loss function when design our network. ii). MDNet only adopt the local search strategy by following tracking-by-detection framework which make their tracker rather sensitive to challenging factors; Our tracker jointly use local and global search strategy for robust visual tracking. Extensive experiments on five tracking benchmarks validated the effectiveness of our proposed algorithm.

5 Conclusion

In this paper, we propose a novel visual tracker, named DAT, to track the target object based on the provided BBox and its natural language specification. Our tracker can be devided into two main subnetworks: SALNet and GPGNet. The SALNet is a novel structure-aware deep neural network by take both the correlations between video sequences and training samples in each video into consideration. We adopt the softmax and triplet loss functions to train this sub-network. We also propose the GPGNet, which is a novel target-driven global attention estimation network to ensure the locations we should focus on. These proposals extracted from the attention regions are feed into binary classifier together with proposals extracted from local search window. The proposal with maximum response score will be chosen as the tracking result of current frame. Extensive experiments on several public tracking benchmarks validated the effectiveness of our proposed method.


  • [1] J. Aneja, A. Deshpande, and A. G. Schwing. Convolutional image captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [2] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  • [3] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. S. Torr. Staple: Complementary learners for real-time tracking. 38(2):1401–1409, 2016.
  • [4] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. Fully-Convolutional Siamese Networks for Object Tracking. Springer International Publishing, 2016.
  • [5] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui. Visual object tracking using adaptive correlation filters. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2544–2550. IEEE, 2010.
  • [6] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531, 2014.
  • [7] D. Chen, H. Li, X. Liu, Y. Shen, J. Shao, Z. Yuan, and X. Wang. Improving deep visual representation for person re-identification by global and local image-language association. In The European Conference on Computer Vision (ECCV), September 2018.
  • [8] J. Choi, H. J. Chang, J. Jeong, Y. Demiris, and Y. C. Jin. Visual tracking using attention-modulated disintegration and integration. In Computer Vision and Pattern Recognition, pages 4321–4330, 2016.
  • [9] J. Choi, H. J. Chang, S. Yun, T. Fischer, Y. Demiris, J. Y. Choi, et al. Attentional correlation filter network for adaptive visual tracking. In CVPR, volume 2, page 7, 2017.
  • [10] Q. Chu, W. Ouyang, H. Li, X. Wang, B. Liu, and N. Yu. Online multi-object tracking using cnn-based single object tracker with spatial-temporal attention mechanism. In 2017 IEEE International Conference on Computer Vision (ICCV).(Oct 2017), pages 4846–4855, 2017.
  • [11] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. Eco: Efficient convolution operators for tracking. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 6931–6939. IEEE, 2017.
  • [12] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, pages 4310–4318, 2015.
  • [13] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In European Conference on Computer Vision, pages 472–488, 2016.
  • [14] H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling. Lasot: A high-quality benchmark for large-scale single object tracking. arXiv preprint arXiv:1809.07845, 2018.
  • [15] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional sequence to sequence learning. In International Conference on Machine Learning, pages 1243–1252, 2017.
  • [16] B. Han, J. Sim, and H. Adam. Branchout: Regularization for online ensemble tracking with convolutional neural networks. In Proceedings of IEEE International Conference on Computer Vision, pages 2217–2224, 2017.
  • [17] S. Hong, T. You, S. Kwak, and B. Han. Online tracking by learning discriminative saliency map with convolutional neural network. Computer Science, pages 597–606, 2015.
  • [18] I. Jung, J. Son, M. Baek, and B. Han. Real-time mdnet. In The European Conference on Computer Vision (ECCV), September 2018.
  • [19] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. 2016.
  • [20] M. Kristan, J. Matas, A. Leonardis, T. Vojir, R. Pflugfelder, G. Fernandez, G. Nebehay, F. Porikli, and L. Čehovin. A novel performance evaluation methodology for single-target trackers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(11):2137–2155, Nov 2016.
  • [21] M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, L. Čehovin Zajc, G. Nebehay, T. Vojir, G. Fernandez, A. Lukežič, A. Dimitriev, A. Petrosino, A. Saffari, B. Li, B. Han, C. Heng, C. Garcia, D. Pangeršič, G. Häger, F. S. Khan, F. Oven, H. Possegger, H. Bischof, H. Nam, J. Zhu, J. Li, J. Y. Choi, J.-W. Choi, J. ao F. Henriques, J. van de Weijer, J. Batista, K. Lebeda, K. Öfjäll, K. M. Yi, L. Qin, L. Wen, M. E. Maresca, M. Danelljan, M. Felsberg, M.-M. Cheng, P. Torr, Q. Huang, R. Bowden, S. Hare, S. Y. Lim, S. Hong, S. Liao, S. Hadfield, S. Z. Li, S. Duffner, S. Golodetz, T. Mauthner, V. Vineet, W. Lin, Y. Li, Y. Qi, Z. Lei, and Z. Niu. The visual object tracking vot2014 challenge results, 2014.
  • [22] C. Li, X. Wu, Z. Bao, and J. Tang. Regle: Spatially regularized graph learning for visual tracking. pages 252–260, 2017.
  • [23] S. Li and D.-Y. Yeung. Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models. In AAAI, pages 4140–4146, 2017.
  • [24] Y. Li, J. Zhu, and S. C. Hoi. Reliable patch trackers: Robust visual tracking by exploiting reliable patches. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 353–361, 2015.
  • [25] Z. Li, R. Tao, E. Gavves, C. G. M. Snoek, and A. W. M. Smeulders. Tracking by natural language specification. In Computer Vision and Pattern Recognition, pages 7350–7358, 2017.
  • [26] P. Liang, E. Blasch, and H. Ling. Encoding color information for visual tracking: Algorithms and benchmark. IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, 24(12):5630–5644, 2015.
  • [27] C. Liu, P. Liu, W. Zhao, and X. Tang. Robust tracking and redetection: Collaboratively modeling the target and its context. IEEE Transactions on Multimedia, 20(4):889–902, 2018.
  • [28] T. Liu, G. Wang, and Q. Yang. Real-time part-based visual tracking via adaptive correlation filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4902–4912, 2015.
  • [29] A. Lukežič, T. Voj’iř, L. Čehovin Zajc, J. Matas, and M. Kristan. Discriminative correlation filter tracker with channel and spatial reliability. International Journal of Computer Vision, 2018.
  • [30] A. Lukezic, T. Vojir, L. C. Zajc, J. Matas, and M. Kristan. Discriminative correlation filter with channel and spatial reliability. International Journal of Computer Vision, pages 1–18, 2016.
  • [31] E. Margffoy-Tuay, J. C. Perez, E. Botero, and P. Arbelaez. Dynamic multimodal instance segmentation guided by natural language queries. In The European Conference on Computer Vision (ECCV), September 2018.
  • [32] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. Computer Science, 2013.
  • [33] A. Moudgil and V. Gandhi. Long-term visual object tracking benchmark. arXiv preprint arXiv:1712.01358, 2017.
  • [34] H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. pages 4293–4302, 2015.
  • [35] E. Park and A. C. Berg. Meta-tracker: Fast and robust online adaptation for visual object trackers. In The European Conference on Computer Vision (ECCV), September 2018.
  • [36] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Conference on Empirical Methods in Natural Language Processing, pages 1532–1543, 2014.
  • [37] S. Pu, Y. Song, C. Ma, H. Zhang, and M.-H. Yang. Deep attentive tracking via reciprocative learning. arXiv preprint arXiv:1810.03851, 2018.
  • [38] J. Son, I. Jung, K. Park, and B. Han. Tracking-by-segmentation with online gradient boosting decision tree. In Proceedings of the IEEE International Conference on Computer Vision, pages 3056–3064, 2015.
  • [39] R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance search for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1420–1429, 2016.
  • [40] J. Valmadre, L. Bertinetto, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. End-to-end representation learning for correlation filter based tracking. 2017.
  • [41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
  • [42] X. Wang, C. Li, B. Luo, and J. Tang. Sint++: Robust visual tracking via adversarial positive instance generation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [43] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: A benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
  • [44] Y. Wu, J. Lim, and M.-H. Yang. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1834–1848, 2015.
  • [45] T. Yang and A. B. Chan. Learning dynamic memory networks for object tracking. In The European Conference on Computer Vision (ECCV), September 2018.
  • [46] D. Yeo, J. Son, B. Han, and J. H. Han. Superpixel-based tracking-by-segmentation using markov chains. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 511–520. IEEE, 2017.
  • [47] S. Yun, J. Choi, Y. Yoo, K. Yun, and Y. C. Jin. Action-decision networks for visual tracking with deep reinforcement learning. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1349–1358, 2017.
  • [48] Y. Zhang, L. Wang, J. Qi, D. Wang, M. Feng, and H. Lu. Structured siamese network for real-time visual tracking. In The European Conference on Computer Vision (ECCV), September 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description