CyLKs: Unsupervised Cycle Lucas-Kanade Network for Landmark Tracking
Across a majority of modern learning-based tracking systems, expensive annotations are needed to achieve state-of-the-art performance. In contrast, the Lucas-Kanade (LK) algorithm works well without any annotation. However, LK has a strong assumption of photometric (brightness) consistency on image intensity and is easy to drift because of large motion, occlusion, and aperture problem. To relax the assumption and alleviate the drift problem, we propose CyLKs, a data-driven way of training Lucas-Kanade in an unsupervised manner. CyLKs learns a feature transformation through CNNs, transforming the input images to a feature space which is especially favorable to LK tracking. During training, we perform differentiable Lucas-Kanade forward and backward on the convolutional feature maps, and then minimize the re-projection error. During testing, we perform the LK tracking on the learned features. We apply our model to the task of landmark tracking and perform experiments on datasets of THUMOS and 300VW.
CyLKs: Unsupervised Cycle Lucas-Kanade Network for Landmark Tracking
Xinshuo Weng Carnegie Mellon University firstname.lastname@example.org Wentao Han Carnegie Mellon University email@example.com
noticebox[b]CMU 10707 Deep Learning Course Project, (2017).\end@float
Landmark tracking, also known as key-point tracking, is one of the hot areas in computer vision for decades. The quality of the landmark tracking lays the foundation for improving the performance of many other vision tasks, such as face recognition, blendshape modeling, face animation, face reenactment and human action recognition, etc. For example, in face animation and reenactment, 2D landmarks can be used as the control points (fixed constraints) while deforming one mesh to another. In action recognition, one can define the human skeleton as a set of landmarks and classify the action based on the motion of human skeleton.
With the advent of convolutional neural networks (CNNs), there are many modern data-driven tracking systems trained in a supervised manner achieving state-of-the-art performance [11, 8, 20, 23]. The main problem with these methods is that extensive training annotations are needed, which is very expensive. By contrast, the optical flow methods like Lucas-Kanade algorithm does not need any annotation and works well in general cases. But Lucas-Kanade algorithm has the limitation on images with a large variation of illumination changes, aperture problem, occlusion, etc.
To overcome this, we propose the CyLKs, which is a trainable Lucas-Kanade network. After training on a large amount of video data, the CyLKs is expected to alleviate the problems of illumination changes, aperture problem, etc. Different from the existing work in [15, 21, 7], which also proposes to combine Lucas-Kanade algorithm with convolution neural networks. CyLKs can be trained in an unsupervised manner without using any human annotation, and thus can leverage a very large amount of unlabeled video data and have better generalization capability.
Specially, CyLKs learns a feature representation that is favorable to LK tracking. This is achieved by: i) Extract features from source and template image using a CNN; ii) Given initial point positions in the template image, tracking forward (forward pass) to estimate positions in the input image; iii) Tracks backward (backward pass) given the estimated points in forward pass to the points in the template image; iv) Calculate Euclidean loss between initial points and backward tracked points , then pass the gradients back to update CNN parameters. The overall architecture is shown in Figure 1.
Contributions. 1) CyLKs can be trained without using any human annotation, and thus can learn variations from a large number of unlabeled videos. 2) CyLKs can learn a feature transformation, which can improve the performance of Lucas-Kanade based trackers.
2 Related Work
Unsupervised Tracking. Before the explosion of deep learning techniques, most unsupervised trackers can be classified into three categories: direct methods, feature-based methods and a hybrid of direct and feature-based methods.
Direct methods, mostly based on Lucas-Kanade algorithm , operate on pixel intensity to estimate the motion between images. Such methods are computationally efficient and have been proved to achieve competitive results in tasks of SLAM [10, 2] and visual odometry . However, direct methods assume photometric consistency across frames and are thus not robust to variations of illumination changes, occlusion, and out-of-plane motion.
Instead of working on raw images, feature-based methods extract robust features and estimate motion by matching feature descriptors between images. Robust features such as SIFT  and ORB  are usually used. Without assuming photometric consistency, feature-based methods are more robust to illumination changes. However, the performance of feature-based methods heavily rely on the localization capabilities and matching the accuracy of the features.
Recently, a hybrid of direct and feature-based methods are explored in [3, 4, 6], which apply direct approaches on the robust features. Unfortunately, although these approaches are proved to have a significant improvement from the direct methods, especially with the improved quality of data (e.g., higher resolution and frame rate), they are still limited to the presence of large motion.
Supervised Tracking. With the superior representation capabilities of the CNNs, many CNN-based tracking methods outperform the unsupervised tracking methods. Held et al.  proposed GOTURN which applies a deep regression network to predict object locations based on deep features.  proposed a classification-based multi-domain tracker, which try to separate the domain-independent information from domain-specific one, to capture shared representations to some extent. C-COT  introduced the concept of multi-resolution fusion and continuous domain learning for the visual tracking system to achieve accurate sub-pixel feature point tracking. ECO  proposed a factorized convolution operator to reduce the number of parameters and an efficient model update strategy, and achieve significant improvement in both speed and robustness.  designed a two-stream CNN to handle drastic appearance change and distinguish target object from its similar distracters during tracking.  set up a CNN architecture for simultaneous detection and tracking, and introduced the correlation features to represent object co-occurrences across time to aid tracking.
However, CNN-based trackers usually work in a specific domain and have limited generalization capabilities. Also, the performance of such trackers drops significantly on the sequences which have a different appearance from the training data. To achieve good generalization, a large number of high-quality annotations are necessary, which turns out to be very expensive.
3.1 Inverse Compositional Lucas-Kanede (IC-LK) Layer
LK Algorithm. Given a template image and an input image , we first extract the image features with a CNN to obtain feature maps and . Starting from an initial point in the template image, the IC-LK algorithm tries to estimate translation parameters p by minimizing
where is a feature patch of centered at . This nonlinear objective is then optimized in an iterative manner. Starting from an initial translation p, taking the first order Taylor expansion of , the translation update term is estimated by minimizing
where is the gradient patch of centered at . The derivative of w.r.t is then given by
setting this term to zeros gives the least square solution of as
where is the Hessian matrix of at . p is then updated by .
IC-LK Algorithm. Unlike the feature map gradients which can be pre-computed, have to be computed at each iteration. To avoid such computation, the inverse-compositional method is applied to LK algorithm by transforming the objective function to
which gives the least square solution of as
where is the Hessian matrix of at . Note that now can be pre-computed and is fixed throughout the optimization process. Then p is updated by at each iteration. After convergence, the forward predicted point locations in the input image is obtained as .
3.2 Cycle IC-LK Network
Similarly, we apply the algorithm described in 3.1 for a backward pass to track back to the template image and obtain . Then the parameters of are updated by minimizing the patch loss and the cycle poss , which is illustrated in 3.3. Since and are parameterized by and , this approach is end-to-end differentiable and we can propogate the error back to update the learnable parameters. Details of the derivation are described in 3.4.
3.3 Loss Function
Cycle Loss. The cycle loss is the implicit criterion of the tracker’s performance. As previously mentioned, our approach tries to learn the trainable parameters in an unsupervised manner by minimizing the cycle loss.
Patch Loss. Similar to the objective function of IC-LK algorithm, a desirable representation of an image should approximate the photometric consistency assumption, i.e. the two feature patches centered at and should be similar. We enforce this by introducing a patch loss , where is a patch of feature map centered at .
Overall Loss. The overall loss is simply the combination of the cycle loss and the patch loss: , where is factor balancing the two loss terms.
3.4 Full Derivation
The computation graph of Cycle IC-LK Network is shown in Figure 3.1. In this section, the full derivation of Cycle IC-LK network is given, following the computation flow as shown in the computation graph.
Starting from an initial point in , suppose p is updated times to get the final point , during which a series of temporary points and corresponding updates in warp parameters are obtained, then we have , where
in which is the Hessian matrix of at and is the warp parameters after the th update. Then we further have
Similarly, for the backward pass, suppose the p is updated times to convergence and are obtained, we can derive
where . Combining the two equations, we can derive the cycle loss as
To propagate the error back to the CNN, we need to compute the gradient of the feature maps and , which can be obtained by summing up the gradients of all the feature patches that parameterize at the corresponding locations of and . This requires computing the derivative of w.r.t all feature patches, including template patches and and non-template patches and , which is a complicated computation flow. In practice, the back-propagation is done by the automatic differentiation functionality of PyTorch .
In addition, in 3.3 we introduced patch loss to enforce photometric consistency in the feature space, the derivative of which can be simply computed as
Finally, the gradients of and is obtained by summing up the gradients of all feature patches at the locations from which the patches are sampled. The feature map gradients are then back-propagated to the CNN to update the learnable parameters.
THUMOS 2015 Dataset. A large video dataset for action recognition, consisting of 13,320 trimmed videos for training, 2,104 untrimmed videos for validation and 5,613 untrimmed videos for testing. The resolution varies from to . No landmark annotation is provided.
300-VW. A video dataset for face alignment, consisting of 114 videos with each approximately one minute in length. 68 facial landmarks with semantic interpretation are annotated across all the videos.
5.1 Implementation Details
We use first four convolutional layers from VGG-16  with pre-trained weights as our feature extraction part to obtain the image features. To maintain the same resolution with the input images without losing the receptive fields, we remove the pooling layers and increase the dilation of the convolutional layers accordingly. We randomly sample 50 landmarks per pair from input images as the initial landmarks during training. We train the proposed cycle LK network for 20 epochs with a batch size of 1. The Adam optimization is used with an initial learning rate of 0.0001, betas of 0.9 and 0.999, weight decay of 0.000001. Basic data augmentation including scaling, horizontal flipping, a rotation is applied to the input images. All models are implemented in PyTorch .
5.2 Evaluation Metrics
One pass evaluation (OPE) is used to evaluate the performance of tracking. We run the tracker throughout a test sequence with initialization from the ground truth position in the first frame. We then measure the average pixel distance error in every frame as . Whenever the tracking for one landmark fails in one image, i.e., , where is a threshold, we re-initialize this landmark from the ground truth at the failed frame. The overall performance of the tracker is the mean over all frames and successful rate (i.e., how frequently the tracker does not fail).
The aforementioned error term can be defined differently according to the existence of ground truth of landmarks. While evaluating on THUMOS dataset where we do not have ground truth of landmarks, we define the re-projection error to represent the error,
Where and is the initialization and backward estimation respectively. When evaluating on dataset where we do have the annotations, we can define the forward error metrics ,
Where is the ground truth locations and is the forward locations estimated by the tracker. In addition, the success rate is defined as
Where is the total number of landmarks successfully tracked on all frames, is the total number of landmarks on all frames. In the following experiments, we use pixel as the threshold.
5.3 Qualitative Results
5.4 Quantitative Results
The quantitative results on THUMOS and 300-VW dataset are shown in Table 1 and 2. Three error metrics , , and are used for evaluation. In THUMOS, we only evaluate the error metrics of and as we do not have landmark annotations on for the dataset. We evaluate sample sequences on these datasets respectively. From the results, we show that the re-projection error (i.e., the difference between backward tracking with the initialization) is substantially lower compared to the Lucas-Kanade algorithm. This demonstrates the cycle loss can help the network learn the transformed features on which we can perform bi-directional tracking. In other words, we can obtain the same results by either tracking from to frame or from to frame. Also, we observe that the forward error of CyLKs is lower than Lucas-Kanade by a large margin in most sequences. This demonstrates that the patch loss does enforce the tracked patch in forward pass to be close to the template patch. In brief, we show that, without using any annotations from manual labeling, we can achieve a significant improvement in both forward tracking and re-projection.
6 Future Work
In this work, we propose a Cycle Lucas-Kanade (CyLKs) Network to achieve landmark tracking in an unsupervised way. Using the proposed cycle loss and patch loss, we can achieve a significant improvement on landmark tracking without using any manual labels. With the success of CyLKs, it is interesting to show how the learned model can improve existing LK-based tracking systems. In other words, we can use off-the-shelf CyLKs to extract features for LK-based tracking systems instead of using raw images. Another interesting direction is to extend the work from landmark tracking to general object tracking by replacing current 2 DoF (degree of freedom) translation parameters with 6 DoF affine transformation.
-  Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration. https://github.com/pytorch/pytorch.
-  H. Alismail, B. Browning, and S. Lucey. Photometric bundle adjustment for vision-based slam. Asian Conference on Computer Vision, 2016.
-  H. Alismail, B. Browning, and S. Lucey. Robust tracking in low light and sudden illumination changes. 3D Vision (3DV), 2016 Fourth International Conference on, 2016.
-  E. Antonakos, J. Alabort-i Medina, G. Tzimiropoulos, and S. P. Zafeiriou. Feature-based lucas–kanade and active appearance models. IEEE Transactions on Image Processing, 2015.
-  S. Baker and I. Matthews. Lucas-kanade 20 years on: A unifying framework. International journal of computer vision, 2004.
-  H. Bristow and S. Lucey. In defense of gradient-based alignment on densely sampled sparse features. Dense Image Correspondences for Computer Vision, 2016.
-  C.-N. Chang, C.-N. Chou, and E. Chang. CLKN: Cascaded Lucas-Kanade Networks for Image Alignment. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. ECO: Efficient Convolution Operators for Tracking. arXiv, 2016.
-  M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. Beyond correlation filters: Learning continuous convolution operators for visual tracking. arXiv preprint arXiv:1608.03773, 2016.
-  J. Engel, T. Schöps, and D. Cremers. Lsd-slam: Large-scale direct monocular slam. European Conference on Computer Vision, 2014.
-  H. Fan and H. Ling. Parallel Tracking and Verifying: A Framework for Real-Time and High Accuracy Visual Tracking. arXiv, 2017.
-  C. Feichtenhofer, A. Pinz, and A. Zisserman. Detect to Track and Track to Detect. Proceedings of the IEEE International Conference on Computer Vision, 2017.
-  C. Forster, M. Pizzoli, and D. Scaramuzza. Svo: Fast semi-direct monocular visual odometry. Robotics and Automation (ICRA), 2014 IEEE International Conference on, 2014.
-  D. Held, S. Thrun, and S. Savarese. Learning to track at 100 fps with deep regression networks. European Conference on Computer Vision, 2016.
-  C.-H. Lin and S. Lucey. Inverse Compositional Spatial Transformer Networks. arXiv, 2016.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 2004.
-  H. Nam and B. Han. Learning Multi-Domain Convolutional Neural Networks for Visual Tracking. arXiv preprint arXiv:1510.07945, 2015.
-  E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: An efficient alternative to sift or surf. Computer Vision (ICCV), 2011 IEEE international conference on, 2011.
-  K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ImageNet Challenge, 2014.
-  J. Valmadre, L. Bertinetto, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. End-to-end representation learning for Correlation Filter based tracking. arXiv, 2017.
-  C. Wang, H. K. Galoogahi, C.-H. Lin, and S. Lucey. Deep-LK for Efficient Adaptive Object Tracking. arXiv, 2017.
-  L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual tracking with fully convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, 2016.
-  L. Č. Zajc, A. Lukežič, A. Leonardis, and M. Kristan. Beyond standard benchmarks: Parameterizing performance evaluation in visual object tracking. arXiv, 2016.