Content-Aware Unsupervised Deep Homography Estimation

Content-Aware Unsupervised Deep Homography Estimation

Jirong Zhang1,2    Chuan Wang2    Shuaicheng Liu1,2    Lanpeng Jia2    Jue Wang2    Ji Zhou1
1University of Electronic Science and Technology of China
Megvii Technology
Joint Corresponding Author

Robust homography estimation between two images is a fundamental task which has been widely applied to various vision applications. Traditional feature based methods often detect image features and fit a homography according to matched features with RANSAC outlier removal. However, the quality of homography heavily relies on the quality of image features, which are prone to errors with respect to low light and low texture images. On the other hand, previous deep homography approaches either synthesize images for supervised learning or adopt aerial images for unsupervised learning, both ignoring the importance of depth disparities in homography estimation. Moreover, they treat the image content equally, including regions of dynamic objects and near-range foregrounds, which further decreases the quality of estimation. In this work, to overcome such problems, we propose an unsupervised deep homography method with a new architecture design. We learn a mask during the estimation to reject outlier regions. In addition, we calculate loss with respect to our learned deep features instead of directly comparing the image contents as did previously. Moreover, a comprehensive dataset is presented, covering both regular and challenging cases, such as poor textures and non-planar interferences. The effectiveness of our method is validated through comparisons with both feature-based and previous deep-based methods. Code will be soon available at Github111

1 Introduction

A homography is the fundamental image alignment model that is wildly used in computer vision [13]. It can align images taken from different perspectives, as long as the images under go a pure rotational motion or the scene is close to a planar surface. It has been wildly used in vision applications such as image mosaicing [5], monocular SLAM [27], Co-SLAM [45], image stitching [4], video stitching [12], augmented reality [33], and camera calibration [44].

Figure 1: Our deep homography estimation on challenging cases. Yellow: traditional feature-based method. Blue: ours. Red: ground-truth. (a) An easy example with rich textures and a flat scene. (b) An example with dominate moving foreground. (c) A low texture example. (d) A low light example.

Homography alignment holds a pivot position among other image registration techniques, such as content preserving warp (CPW) that adopts mesh warps to compensate non-planer motions [21] and optical flow for dynamic objects or depth discontinuities [17]. Even in such difficult cases, it is quite helpful to align images first by a homography before more advanced models. Moreover, for scenes in which objects are far from the viewers, they can be considered as planer surfaces, and thus is applicable for a single homography model.

Homography estimation in traditional approaches can be divided into two categories, one follows the Lucas-Kanade algorithm [2] and the other requires matched image feature points [24]. In the feature-based methods, a set of feature correspondences are obtained, the homography is estimated using DLT method [13] with RANSAC outlier rejection [10]. Compared with direct methods, feature-based methods have achieved better performance. However, the quality of feature-based methods highly rely on the quality of image features. Estimation could be inaccurate when the number of matched points were insufficient or when the distribution of features were poor. High quality features should evenly distributed and cover the entire image. However, this is challenging due to the exsistence of textureless regions (e.g., blue sky and white wall), repetitive patterns and illumination variations.

Figure 1 shows some examples that compare our deep homography with traditional SIFT and RANSAC method. For images with rich textures (Figure 1 (a)), our method performs equally well as feature-based method. However, for images suffering from textureless regions (Figure 1 (c)), limited number of feature points can be extracted, leading to problematic homography fitting. In contrast, our deep solution is more robust under this situation. Moreover, it is challenging to conduct correct RANSAC for scenes containing large moving foreground (Figure 1 (b)) or two dominate planes (Figure 1 (d)). The failure of RANSAC leads to the failure of homography. Our deep solution is free from such problems.

DeTone et al. proposed the first deep homography approach [7]. The method takes two images as input and produces a homography from the source image to the target image. It requires ground truth homography to supervise the training. Therefore, a random homography is applied to the source image to generate the target image, forming the training pair. However, the training data generated by homography warping cannot reflect the real depth disparities. As such, the performance of DeTone et al. on real images is unsatisfactory. To solve this problem, Nguyen et al. [28] proposed an unsupervised approach which minimized the photometric loss on real image pairs. However, there are two problems. First, the loss calculated with respect to intensity is less effective than that in the feature space. Second, image regions are considered equally, ignoring the effect of ‘RANSAC’. Some image regions, such as moving objets or non-planer objects, should be excluded during the loss calculation, without which the outlier regions would decrease the estimation accuracy. Therefore, Nguyen et al. has to work on aerial images that is far away from the camera to minimize the influence of depth variations of parallax.

In this work, we propose an unsupervised approach with a new architecture for content awareness learning. In particular, we learn a content mask to reject outlier regions to mimic the traditional RANSAC procedure. To realize this, we introduce a novel triple loss for the effective optimization. Moreover, instead of comparing intensity values directly, we calculate loss with respect to our learned deep features, which is more effective. In addition, we introduce a comprehensive homography dataset, within which the testing set contains manually labeled ground-truth point matches for the purpose of quantitative comparison. The dataset consisted of 5 categories, including regular, low-texture, low-light, small-foreground, and large-foreground of scenes. We show the advantages of our method over both traditional feature-based approaches and previous deep-based solutions. In summary, our main contributions are:

  • A new unsupervised network structure that enables content-aware robust homography estimation from two images.

  • A triple loss designed for training the network, so that a deep feature map for alignment and a mask highlighting the alignment inliers could be learned.

  • A comprehensive dataset covers various scenes for both training and testing.

2 Related Work

Traditional homography.

A homography is a matrix which compensates plane motions between two images. It consists of degree of freedom, with 2 for scale, 2 for translation, 2 for rotation and 2 for perspective [13]. To solve a homography, traditional approaches often detect and match image features, e.g, SIFT [24], SURF [3] and ORB [32]. Two sets of correspondences were estabilished between two images, following which robust estimation is adopted, such as RANSAC [24] and IRLS [16], for the outlier rejection during the model estimation.

A homogrpahy can also be solved directly without image features. The direct methods, such as seminal Lucas-Kanade algorithm [25], calculates sum of squared differences (SSD) between pixels from two images. The differences guide the shift and warp of the images, yielding homography updates. A random initialized homography is optimized in this way iteratively [2]. Moreover, the SSD can be replaced with enhanced correlation coefficient (ECC) for the robustness [9].

Deep homography.

Following the success of various deep image alignment methods, such as optical flow [39, 17], dense matching [31] and deep features [1], the deep homography was first proposed by [7] in 2016. The network takes source and target images as input and outputs displacement vectors at image corners of source image, which then yields the homography. It used ground-truth homography to supervise the training. However, the training images with GT homography is generated without depth disparity. To overcome such issue, [28] proposed an unsupervised approach that computed photomatric loss between two images and adopted Spatial Transform Network (STN) [18] for image warping.

Figure 2: The overall structure of our deep homography estimation network (a) and the triple loss we design to train the network (b). In (a), two input patches and are fed into two branches consisting of feature extractor and mask predictor respectively, generating features and masks . Then the features and masks are fed into a homography estimator to produce 8 values that compose the homography matrix . To train the network in (a), we design a triple loss, i.e. minimizing the sum of distances between and (), and maximizing the feature distance before warping.

Image stitching.

Our approach is also related to image stitching methods. These methods are traditional methods that target at registration images under large disparities [42] for the purpose of constructing panorama [4]. The stitched images were captured under dramatic viewpoint differences. Various methods have been proposed along this direction, such as Dual-Homography for the scenes contain two dominant planes [11], As-Projective-As-Possible (APAP) [41] and MeshFlow [23] for non-rigid mesh motion compensation, Direct Photometric Alignment (DPA) for low-textured images [20], and Shape-preserving-half-projective(SPHP) for the shape rigidity at non-overlapping regions [6]. In this work, we do not target on examples of image stitching. We focus on images with sufficient overlaps, moderate viewpoint differences and reasonable depth disparities, which can be aligned under the capability of a single homography.

3 Algorithm

3.1 Network Structure

Our method is built upon convolutional neural networks. It takes two grayscale images and as input, and produces a homography matrix from to as output. The entire structure could be divided into three modules: a feature extractor , a mask predictor and a homography estimator . and are fully convolutional networks which accepts input of arbitrary sizes, and the utilizes a backbone of ResNet-34 [14] and produces 8 values. Figure 2(a) illustrates the network structure.

Feature extractor.

Unlike previous DNN based methods that directly utilizes the pixel values as the feature, here our network automatically learns a feature from the input for robust feature alignment. To this end, we build a FCN that takes an input of size , and produces a feature map of size . For inputs and , the feature extractor shares weights and produces feature maps and , i.e.


Mask predictor.

In non-planar scenes, especially those including moving objects, there exists no single homography that can align the two views. In traditional algorithm, RANSAC is widely applied to find the inliers for homography estimation, so as to solve the most approximate matrix for the scene alignment. Following the similar idea, we build a sub-network to automatically learn the inliers’ positions. Specifically, a sub-network learns to produce an inlier probability map or mask, highlighting the content in the feature maps that contribute much for the homography estimation. The size of the mask is the same as the size of the feature. With the masks, we further weight the features extracted by before feeding them to the homography estimator, obtaining two weighted feature maps and as,


Homography estimator.

Given the weighted feature maps and , we concatenate them to build a feature map of size . Then it is fed to the homography estimator network and 2D offset vectors ( values) are produced. With the offset vectors, it is straight-forward to obtain the homography matrix with freedoms by solving a linear system. We use to represent the whole process, i.e.


The backbone of the homography estimator network follows a ResNet-34 structure. It contains 34 layers of strided convolutions followed by an adaptive pooling layer, which generates fixed size (8 in our case) of feature vectors regardless of the input feature dimensions.

We list the layer details of the three modules above in Table 1. Note that, we use the sigmoid at the last layer of the mask predictor to ensure the output value ranges from .

3.2 Triple Loss for Robust Homography Estimation

With the homography matrix estimated, we warp image to and then further extracts its feature map as . Intuitively, if the homography matrix is accurate enough, should be well aligned with , causing a low loss between them. Considering in real scenes, a single homography matrix cannot satisfy the transformation between the two views, we also normalize the loss by and . Here is the warped version of . So the loss between the warped and is as follows,


where and indicates a pixel location in the masks and feature maps. Here we utilize spatial transform network [18] to achieve the warping operation.

Directly minimizing Eq. 5 may easily cause trivial solutions, where the feature extractor only produces all zero maps, i.e. . In this case, the features learned indeed describe the fact that and are well aligned, but it fails to reflect the fact that the original images and are mis-aligned. To this end, we involve another loss between and , i.e.


and further maximize it when minimizing Eq. 5. This strategy avoids the trivial solutions, and enables the network to learn a discriminative feature map for image alignment.

In practise, we swap the features of and and produce another homography matrix . Following Eq. 5 we involve a loss between the warped and . We also add a constraint that enforces and to be inverse. So, the optimization procedure of the network could be written as follows,


where and are balancing hyper-parameters, and is a 3-order identity matrix. We set and in our experiments. We illustrates the loss formulations in Figure 2(b).

Figure 3: Our predicted masks for various of scenes. (a) contains complex foreground motions. (b) and (c) contains large dynamic foreground. (d) contains few textures and (e) is an night example.
(a) Feature extractor
Layer No. Type Kernel Stride Channel
1 conv 3 1 4
2 conv 3 1 8
3 conv 3 1 1
(b) Mask predictor
Layer No. Type Kernel Stride Channel
1 conv 3 1 4
2 conv 3 1 8
3 conv 3 1 16
4 conv 3 1 32
5 conv 3 1 1
(c) Homography estimator
Layer No. Type Kernel Stride Channel
1 conv 7 2 64
2 max pool 3 2 -
3 8 conv 3 1 64
9 conv 3 2 128
10 16 conv 3 1 128
17 conv 3 2 256
18 28 conv 3 1 256
29 conv 3 2 512
30 34 conv 3 1 512
35 adapt pool - 1 -
36 fc - - 8
Table 1: Network architecture of feature extractor (a), mask predictor (b) and homography estimator (c).

3.3 Unsupervised Content-Awareness Learning

As mentioned above, our network contains a sub-network to predict an inlier probability map or mask. It is such designed that our network can be of content-awareness by the two-fold effects. First, we use the masks to explicitly weight the features , so that only highlighted features could be fully fed into homography estimator . Meanwhile, they are also implicitly involved into the normalized distance between the warped feature and its original counterpart , or and , meaning only those regions that are really fit for alignment would be taken into account. For those areas containing low texture or moving foreground, because they are non-distinguishable or misleading for alignment, they are naturally removed for homography estimation during optimizing the triple loss as proposed. Such a content-awareness is achieved fully by an unsupervised learning scheme, without any ground-truth mask data as supervision.

To demonstrate the effectiveness of mask, We illustrate several examples in Figure 3. For example, in Figure 3(a c) where the scenes contain dynamic objects, our network successfully rejects moving objects, even if the movements are inapparent as the fountain in (c), or the objects occupy a large space as in (a)(b). These cases are very difficult for RANSAC to find robust inliers. In particular, the most challenging case is Figure 3(a), in which the moving foregrounds are complex, including people and the train. Our method successfully locates the useful background for the homography estimation. Figure 3(d) is a low-textured example, in which the blue sky occupies half space of the image. It is challenging for traditional methods where the sky provides no features and the sea causes matching ambiguities. Our predicted mask concentrates on the horizon but with sparse weights on sea waves. Figure 3(e) is a low light example, where only visible areas contain weights as seen. We also conduct an ablation study to reveal the influence if disabling the mask prediction. As seen in Table 2, the accuracy has a significant decrease when mask is removed.

4 Experimental Results

4.1 Dataset and Implementation Details

Figure 4: A glace of our dataset. (a) regular examples (RE). (b) examples with low textures (LT). (c) low light examples (LL). (d) examples with foregrounds of small sizes (SF). (e) examples of large foreground (LF).
Figure 5: Comparison with existing approaches. Supervised [7], Unsupervised [28], Ours and Ground-truth are shown by green, yellow, blue and red rectangles, respectively. (a) A single frame synthesized example. (b) Real consecutive frames of (a). (c) An example with a dominate plane. (d) A flash and no flash example with illumination differences of two frames caused by camera flash. (e) An example with near-range foreground at corners. (f) An example with two dominate planes plus large moving foreground. (g) An example with poor textures. (h) A low light example. Please refer to webpage for more examples, where we toggle the images with GIF animation for clearer illustration.

Previously, there is no dedicated dataset that is designed to evaluate the performance of homography fitting. The supervised method [7] synthesized homographies from a single image, so it cannot reflect disparities and occlusions. The unsupervised method [28] adopted aerial images that lacks the generalization. Therefore, we propose our own dataset for comprehensive evaluations.

Our dataset contains 5 categories, including regular (RE), low-texture (LT), low-light (LL), small-foregrounds (SF), and large-foregrounds (LF) image pairs. Each category contains around 80 image pairs, thus totally 400 image pairs in the dataset. Figure 4 shows some examples. Specifically, we collect these images from 291 video clips, each of which lasts 15 20 seconds. For each frame, we randomly sample 5 frames from its 8 consecutive later frames. With respect to the testing data, we randomly choose 100 image pairs from all categories. For each pair, we manually marked 6 8 matching points for the purpose of quantitative comparison. The marked points are equally distributed on the image.

The category partition is based on the understanding of traditional homography registration. For regular examples (Figure 4(a)), image features can be extracted easily due to rich textures and the scene is flat which is friendly for a homography. With respect to low-texture and low-light examples (Figure 4(b) and (c)), only a few number of image features could be extracted, which causes troubles for traditional homography fitting. With regards to scenes containing foreground or contain dynamic objects (Figure 4(d)), the scene is no longer a plane. In such cases, a best fitting homography would align the most dominate planar structure of the scene, with other non-planar objects be excluded. This can be achieved by RANSAC outlier rejection for traditional methods, but may cause troubles for the previous two deep methods [7, 28] which treat the image content equally. The most challenging case is the scene with large foreground (Figure 4(e)), for which even the RANSAC cannot handle it easily. We will show in the subsequent experiments that our method is robust over all categories.

Our network is trained with iterations by an Adam optimizer [19], whose parameters are set as , , , . The batch size is set to . For every iterations, we multiply the learning rate by . Each iteration costs approximate s and it takes nearly hours to complete the entire training. The detailed network configuration with respect to feature extractor, mask predictor and homography estimator are summarized in Table 1. The implementation is based on PyTorch and the network training is performed on NVIDIA RTX 2080 Ti. To augment the training data and avoid black boundaries appearing in the warped image, we randomly crop patches of size from the original image to form and .

4.2 Comparisons with Existing Methods

Qualitative comparison.

We first compare our method with the existing two deep homography ones, i.e. the supervised estimation [7] and the unsupervised estimation [28] approaches. Figure 5 shows the results of qualitative comparison.

In Figure 5(a), we synthesized an example with a single frame. Therefore, no disparities are introduced. In such case, all methods perform equally well, as indicated by the coincidence of rectangles. However, when we test consecutive frames of the same footage (Figure 5(b)), the supervised approach fails. Because the supervised approach cannot handle large disparities as well as moving objects of the scene. Note that, except the Figure 5(a), all examples come from real different frames. In Figure 5(c), the building surface is a plane, all methods work well in this case. Interestingly, this footage contains camera flashes, where the flash causes the illumination variation across different images. We choose the flash and no flash images for alignment (Figure 5(d)). All the three methods drift from ground-truth rectangle, while our method is the closest. This indicates that our method is robust to a certain amount of illumination change. Figure 5(e) contains near-range objects at corners and Figure 5(f) contains two dominate planes with moving objects at corners. Figure 5(g) is a low texture example and Figure 5(h) is a low light example. Similarly, in both scenarios, our method is the best among other candidates.

Figure 6: Comparison with existing methods (left: with previous DNN based methods, right: with feature-based methods).

Quantitative comparison.

Beyond qualitative comparisons, we also verify the effectiveness of our method by comparing with two deep methods quantitatively. The comparison is based on our dataset. In particular, the testing set for each category contains ground-truth labels. For each pair of image, we manually marked to correspondances. We use the estimated homography to transform the source points to the target points. The averaged distances are recorded as an evaluation metric like in [38, 37, 8, 22, 26]. We report the performances for each category as well as the overall averaged scores in Figure 6 left. We use a identity matrix as a special reference homography. As seen, our method outperforms the others for all categories in the comparison.

Specifically, in Figure 6 left, our method performs best on the regular (RE) category compared with other categories, as the difficulty for RE is the smallest. When compared with supervised method, the large improvements have been achieved with respect to low-texture (LT) and large-foreground (LF) categories, with and improvements, respectively, which indicates that the selection of content is crucial for these two categories. For low-texture class, only a small amount of area is informative which should be accurately identified. With regards to large-foreground class, no matter the foreground is static or dynamic, it would confuse the estimation. If the foreground is static, as it stays close to the camera, the disparity issue will be magnified. If it is dynamic, it will violate the camera motion. For both cases, the foreground must be rejected correctly to estimate accurate homography on the background. Note that, the observations above are also applied to the low-light and small foreground categories.

We further compare our method with traditional feature-based methods. The results are shown in Figure 6 right. We verified three popular features, SIFT [24], SURF [3] and ORB [32]. For regular class (RE), rich texture delivers sufficient high quality features, while both SIFT and SURF are only a little better than ours, in which SURF achieved the best performance on the RE category. For the other categories, our method significantly outperforms the others.

Note that, for LT, LL and LF categories, the traditional feature-based methods frequently fail badly. The failure is caused by various reasons, such as limited number of detected features with poor distributions, or the failure of RANSAC given raise to inlier features located both on foreground and background. Figure 7 shows some examples. This type of failure often leads to huge errors. Therefore, for examples that are totally fail, we resort back to the least scores (Figure 6 left, ), to produce reasonable values for a relatively fair comparison. Specifically, the percentage of total failure is , , and for LT, LL, and LF categories, respectively.

Figure 7: The failure examples of feature-based method. Red: ground-truth. Blue: ours. Yellow: SIFT. We also show the failure image caused by the incorrect homography of the feature-based method on the right column.

4.3 Ablation Studies

Content-aware mask.

Content-aware is the most important feature of our network. Therefore, we compare the performance in the case of with and without mask. Table 2 ‘w/o. Mask’ shows the results. The results with mask are definitely better than the results without mask. It is clear that the mask plays an important role in the deep homography, just like the importance of RANSAC to the feature-based methods. The mask not only rejects dynamic regions, but also selects reliable areas for deep homography estimation.  [40, 35, 30]

Triple loss.

We exam the effectiveness of our triple loss by removing the term of Eq. 6 from Eq. 7. Table 2 ‘w/o. Triple loss’ shows the result. It is clear that the triple loss not only avoids the problem of obtaining trivial solutions, but also facilitates a better optimization.


Feature backbone is another important aspect that should be studied. Here, we exam several popular backbones, including VGG [34], ResNet-18, ResNet-34 [14], and ShuffleNet [43]. As seen, the ResNet-18 achieves similar performance as our best result obtained by ResNet-34. The VGG backbone is slightly worse than ResNet-18 and ResNet-34. Interestingly, the light-weight backbone ShuffleNet-v2 achieves the performance on par with other large backbones, which indicates that our proposed method can be platted into portable or embedded systems, facilitating a wide application scope.

Training strategy.

We adopted a separate training strategy to train our network as in [36, 15, 29], i.e. at the very beginning, train the feature extractor only without the mask predictor involved by setting the mask to all ones. With stable features have been trained from the extractor, i.e. about iterations in our experiments, we finetuned the network with mask predictor involved, with a learning rate of . To validate the effectiveness of this training strategy, we also did an experiment by training the feature extractor and mask predictor simultaneously, both from scratch. Table 2 compares the performances of two training strategies. As reported, our separate training posses better performance.

w/o. Mask 1.16 1.85 1.45 1.38 1.25 1.42
w/o. Triple loss 1.37 2.78 2.20 1.53 1.98 1.97
VGG 1.08 1.90 1.49 1.36 1.38 1.44
ResNet-18 1.10 1.54 1.54 1.17 1.24 1.32
ShuffleNet-v2 1.18 1.55 1.40 1.29 1.26 1.34
Train from scratch 1.10 1.73 1.45 1.20 1.34 1.36
Ours(ResNet-34) 1.02 1.48 1.29 1.15 1.20 1.23
Table 2: Ablation studies on mask, triple loss, training strategy and network backbones. Data represents the distances between transformed points and marked ground-truth points.

5 Conclusions

We have presented a new architecture for deep homography estimation with content-aware capability. Traditional feature based methods heavily rely on the quality of image features which are vulnerable to low-texture and low-light scenes. Large foreground also causes troubles for RANSAC outlier removal. Previous DNN based methods pay less attention to the depth disparity issue. They treat the image content equally which can be influenced by non-planar structures or dynamic objects. Our network learn a mask during the estimation to reject outlier regions for robust homography estimation. In addition, we calculate loss with respect to our learned deep features instead of directly comparing the image intensities. Moreover, we have provided a comprehensive homography dataset. The dataset have been divided into 5 categories, regular, low-texture, low-light, small-foregrounds, and large-foregrounds, to evaluate performances under different aspects. The comparison with previous methods show the effectiveness of our method.


  • [1] H. Altwaijry, A. Veit, S. J. Belongie, and C. Tech (2016) Learning to detect and match keypoints with deep architectures.. In in Proc. BMVC, Cited by: §2.
  • [2] S. Baker and I. Matthews (2004) Lucas-kanade 20 years on: a unifying framework. International journal of computer vision 56 (3), pp. 221–255. Cited by: §1, §2.
  • [3] H. Bay, T. Tuytelaars, and L. Van Gool (2006) Surf: speeded up robust features. In in Proc. ECCV, pp. 404–417. Cited by: §2, §4.2.
  • [4] M. Brown, D. G. Lowe, et al. (2003) Recognising panoramas.. In in Proc. ICCV, Vol. 3, pp. 1218. Cited by: §1, §2.
  • [5] D. Capel (2004) Image mosaicing. In Image Mosaicing and super-resolution, pp. 47–79. Cited by: §1.
  • [6] C. Chang, Y. Sato, and Y. Chuang (2014) Shape-preserving half-projective warps for image stitching. In in Proc. CVPR, pp. 3254–3261. Cited by: §2.
  • [7] D. DeTone, T. Malisiewicz, and A. Rabinovich (2016) Deep image homography estimation. arXiv preprint arXiv:1606.03798. Cited by: §1, §2, Figure 5, §4.1, §4.1, §4.2.
  • [8] Y. Ding, C. Wang, H. Huang, J. Liu, J. Wang, and L. Wang (2019) Frame-recurrent video inpainting by robust optical flow inference. arXiv preprint arXiv:1905.02882. Cited by: §4.2.
  • [9] G. D. Evangelidis and E. Z. Psarakis (2008) Parametric image alignment using enhanced correlation coefficient maximization. IEEE Trans. on Pattern Analysis and Machine Intelligence 30 (10), pp. 1858–1865. Cited by: §2.
  • [10] M. A. Fischler and R. C. Bolles (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §1.
  • [11] J. Gao, S. J. Kim, and M. S. Brown (2011) Constructing image panoramas using dual-homography warping. In in Proc. CVPR, pp. 49–56. Cited by: §2.
  • [12] H. Guo, S. Liu, T. He, S. Zhu, B. Zeng, and M. Gabbouj (2016) Joint video stitching and stabilization from moving cameras. IEEE Trans. on Image Processing 25 (11), pp. 5491–5503. Cited by: §1.
  • [13] R. Hartley and A. Zisserman (2003) Multiple view geometry in computer vision. Cambridge university press. Cited by: §1, §1, §2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In in Proc. CVPR, pp. 770–778. Cited by: §3.1, §4.3.
  • [15] Y. He, J. Shi, C. Wang, H. Huang, J. Liu, G. Li, R. Liu, and J. Wang (2019) Semi-supervised skin detection by network with mutual guidance. arXiv preprint arXiv:1908.01977. Cited by: §4.3.
  • [16] P. W. Holland and R. E. Welsch (1977) Robust regression using iteratively reweighted least-squares. Communications in Statistics-theory and Methods 6 (9), pp. 813–827. Cited by: §2.
  • [17] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In in Proc. CVPR, pp. 2462–2470. Cited by: §1, §2.
  • [18] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §2, §3.2.
  • [19] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [20] K. Lin, N. Jiang, S. Liu, L. Cheong, M. Do, and J. Lu (2017) Direct photometric alignment by mesh deformation. In in Proc. CVPR, pp. 2405–2413. Cited by: §2.
  • [21] F. Liu, M. Gleicher, H. Jin, and A. Agarwala (2009) Content-preserving warps for 3d video stabilization. In ACM Trans. Graph. (Proc. of SIGGRAPH), Vol. 28, pp. 44. Cited by: §1.
  • [22] J. Liu, C. Wu, Y. Wang, Q. Xu, Y. Zhou, H. Huang, C. Wang, S. Cai, Y. Ding, H. Fan, et al. (2019) Learning raw image denoising with bayer pattern unification and bayer preserving augmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §4.2.
  • [23] S. Liu, P. Tan, L. Yuan, J. Sun, and B. Zeng (2016) Meshflow: minimum latency online video stabilization. In in Proc. ECCV, pp. 800–815. Cited by: §2.
  • [24] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2), pp. 91–110. Cited by: §1, §2, §4.2.
  • [25] B. D. Lucas, T. Kanade, et al. (1981) An iterative image registration technique with an application to stereo vision. Cited by: §2.
  • [26] X. Meng, X. Deng, S. Zhu, S. Liu, C. Wang, C. Chen, and B. Zeng (2018) Mganet: a robust model for quality enhancement of compressed video. arXiv preprint arXiv:1811.09150. Cited by: §4.2.
  • [27] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos (2015) ORB-slam: a versatile and accurate monocular slam system. IEEE Trans. on robotics 31 (5), pp. 1147–1163. Cited by: §1.
  • [28] T. Nguyen, S. W. Chen, S. S. Shivakumar, C. J. Taylor, and V. Kumar (2018) Unsupervised deep homography: a fast and robust homography estimation model. IEEE Robotics and Automation Letters 3 (3), pp. 2346–2353. Cited by: §1, §2, Figure 5, §4.1, §4.1, §4.2.
  • [29] H. Qiu, C. Wang, H. Zhu, X. Zhu, J. Gu, and X. Han (2019) Two-phase hair image synthesis by self-enhancing generative model. arXiv preprint arXiv:1902.11203. Cited by: §4.3.
  • [30] J. Ren, Y. Hu, Y. Tai, C. Wang, L. Xu, W. Sun, and Q. Yan (2016) Look, listen and learn a multimodal lstm for speaker identification. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §4.3.
  • [31] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid (2016) Deepmatching: hierarchical deformable dense matching. International Journal of Computer Vision 120 (3), pp. 300–323. Cited by: §2.
  • [32] E. Rublee, V. Rabaud, K. Konolige, and G. R. Bradski (2011) ORB: an efficient alternative to sift or surf.. In in Proc. ICCV, Vol. 11, pp. 2564–2571. Cited by: §2, §4.2.
  • [33] G. Simon, A. W. Fitzgibbon, and A. Zisserman (2000) Markerless tracking using planar structures in the scene. In in Proc. International Symposium on Augmented Reality, pp. 120–128. Cited by: §1.
  • [34] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.3.
  • [35] C. Wang, Y. Guo, J. Zhu, L. Wang, and W. Wang (2014) Video object co-segmentation via subspace clustering and quadratic pseudo-boolean optimization in an mrf framework. IEEE Transactions on Multimedia 16 (4), pp. 903–916. Cited by: §4.3.
  • [36] C. Wang, H. Huang, X. Han, and J. Wang (2019) Video inpainting by jointly learning temporal structure and spatial details. In Proceedings of the 33th AAAI Conference on Artificial Intelligence, Cited by: §4.3.
  • [37] C. Wang, J. Zhu, Y. Guo, and W. Wang (2017) Video vectorization via tetrahedral remeshing. IEEE Transactions on Image Processing 26 (4), pp. 1833–1844. Cited by: §4.2.
  • [38] Y. Wang, H. Huang, C. Wang, T. He, J. Wang, and M. Hoai (2019) Gif2video: color dequantization and temporal interpolation of gif images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1419–1428. Cited by: §4.2.
  • [39] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid (2013) DeepFlow: large displacement optical flow with deep matching. In in Proc. CVPR, pp. 1385–1392. Cited by: §2.
  • [40] P. Yan, G. Li, Y. Xie, Z. Li, C. Wang, T. Chen, and L. Lin (2019) Semi-supervised video salient object detection using pseudo-labels. arXiv preprint arXiv:1908.04051. Cited by: §4.3.
  • [41] J. Zaragoza, T. Chin, M. S. Brown, and D. Suter (2013) As-projective-as-possible image stitching with moving dlt. In in Proc. CVPR, pp. 2339–2346. Cited by: §2.
  • [42] F. Zhang and F. Liu (2014) Parallax-tolerant image stitching. In in Proc. CVPR, pp. 3262–3269. Cited by: §2.
  • [43] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In in Proc. CVPR, pp. 6848–6856. Cited by: §4.3.
  • [44] Z. Zhang (2000) A flexible new technique for camera calibration. IEEE Trans. on Pattern Analysis and Machine Intelligence 22. Cited by: §1.
  • [45] D. Zou and P. Tan (2012) Coslam: collaborative visual slam in dynamic environments. IEEE Trans. on Pattern Analysis and Machine Intelligence 35 (2), pp. 354–366. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description