Deep Image Homography Estimation
Abstract
We present a deep convolutional neural network for estimating the relative homography between a pair of images. Our feedforward network has 10 layers, takes two stacked grayscale images as input, and produces an 8 degree of freedom homography which can be used to map the pixels from the first image to the second. We present two convolutional neural network architectures for HomographyNet: a regression network which directly estimates the realvalued homography parameters, and a classification network which produces a distribution over quantized homographies. We use a 4point homography parameterization which maps the four corners from one image into the second image. Our networks are trained in an endtoend fashion using warped MSCOCO images. Our approach works without the need for separate local feature detection and transformation estimation stages. Our deep models are compared to a traditional homography estimator based on ORB features and we highlight the scenarios where HomographyNet outperforms the traditional technique. We also describe a variety of applications powered by deep homography estimation, thus showcasing the flexibility of a deep learning approach.
I Introduction
Sparse 2D feature points are the basis of most modern Structure from Motion and SLAM techniques [9]. These sparse 2D features are typically known as corners, and in all geometric computer vision tasks one must balance the errors in corner detection methods with geometric estimation errors. Even the simplest geometric methods, like estimating the homography between two images, rely on the errorprone cornerdetection method.
Estimating a 2D homography (or projective transformation) from a pair of images is a fundamental task in computer vision. The homography is an essential part of monocular SLAM systems in scenarios such as:

Rotation only movements

Planar scenes

Scenes in which objects are very far from the viewer
It is wellknown that the transformation relating two images undergoing a rotation about the camera center is a homography, and it is not surprising that homographies are essential for creating panoramas [3]. To deal with planar and mostlyplanar scenes, the popular SLAM algorithm ORBSLAM [14] uses a combination of homography estimation and fundamental matrix estimation. Augmented Reality applications based on planar structures and homographies have been wellstudied [16]. Camera calibration techniques using planar structures [20] also rely on homographies.
The traditional homography estimation pipeline is composed of two stages: corner estimation and robust homography estimation. Robustness is introduced into the corner detection stage by returning a large and overcomplete set of points, while robustness into the homography estimation step shows up as heavy use of RANSAC or robustification of the squared loss function. Since corners are not as reliable as manmade linear structures, the research community has put considerable effort into adding line features [18] and more complicated geometries [8] into the feature detection step. What we really want is a single robust algorithm that, given a pair of images, simply returns the homography relating the pair. Instead of manually engineering cornerish features, lineish features, etc, is it possible for the algorithm to learn its own set of primitives? We want to go even further, and add the transformation estimation step as the last part of a deep learning pipeline, thus giving us the ability to learn the entire homography estimation pipeline in an endtoend fashion.
Recent research in âdenseâ or âdirectâ featureless SLAM algorithms such as LSDSLAM [6] indicates promise in using a full image for geometric computer vision tasks. Concurrently, deep convolutional networks are setting stateoftheart benchmarks in semantic tasks such as image classification, semantic segmentation and human pose estimation. Additionally, recent works such as FlowNet [7], Deep Semantic Matching [1] and Eigen et al.’s MultiScale Deep Network [5] present promising results for dense geometric computer vision tasks like optical flow and depth estimation. Even robotic tasks like visual odometry are being tackled with convolutional neural networks [4].
In this paper, we show that the entire homography estimation problem can be solved by a deep convolutional neural network (See Figure 1). Our contributions are as follows: we present a new VGGstyle [17] network for the homography estimation task. We show how to use the 4point parameterization [2] to get a wellbehaved deep estimation problem. Because deep networks require a lot of data to be trained from scratch, we share our recipe for creating a seemingly infinite dataset of training triplets from an existing dataset of real images like the MSCOCO dataset. We present an additional formulation of the homography estimation problem as classification, which produces a distribution over homographies and can be used to determine the confidence of an estimated homography.
Ii The 4point Homography Parameterization
The simplest way to parameterize a homography is with a 3x3 matrix and a fixed scale. The homography maps , the pixels in the left image, to , the pixels in the right image, and is defined up to scale (see Equation 1).
(1) 
However, if we unroll the 8 (or 9) parameters of the homography into a single vector, we’ll quickly realize that we are mixing both rotational and translational terms. For example, the submatrix , represents the rotational terms in the homography, while the vector is the translational offset. Balancing the rotational and translational terms as part of an optimization problem is difficult.
We found that an alternate parameterization, one based on a single kind of âlocationâ variable, namely the corner location, is more suitable for our deep homography estimation task. The 4point parameterization has been used in traditional homography estimation methods [2], and we use it in our modern deep manifestation of the homography estimation problem (See Figure 2). Letting be the uoffset for the first corner, the 4point parameterization represents a homography as follows:
(2) 
Equivalently to the matrix formulation of the homography, the 4point parameterization uses eight numbers. Once the displacement of the four corners is known, one can easily convert to . This can be accomplished in a number of ways, for example one can use the normalized Direct Linear Transform (DLT) algorithm [9], or the function
etPerspectiveTransform() in OpenCV. \section{Data Generation for Homoraphy Estimation Training deep convolutional networks from scratch requires a large amount of data. To meet this requirement, we generate a nearly unlimited number of labeled training examples by applying random projective transformations to a large dataset of natural images ^{1}^{1}1In our experiments, we used cropped MSCOCO [13] images, although any largeenough dataset could be used for training. The process is illustrated in Figure 3 and described below.
To generate a single training example, we first randomly crop a square patch from the larger image at position (we avoid the borders to prevent bordering artifacts later in the data generation pipeline). This random crop is . Then, the four corners of Patch A are randomly perturbed by values within the range [, ]. The four correspondences define a homography . Then, the inverse of this homography is applied to the large image to produce image . A second patch is cropped from at position . The two grayscale patches, and are then stacked channelwise to create the 2channel image which is fed directly into our ConvNet. The 4point parameterization of is then used as the associated groundtruth training label.
Managing the training image generation pipeline gives us full control over the kinds of visual effects we want to model. For example, to make our method more robust to motion blur, we can apply such blurs to the image in our training set. If we want the method to be robust to occlusions, we can insert random âoccludingâ shapes into our training images. We experimented with inpainting random occluding rectangles into our training images, as a simple mechanism to simulate real occlusions.
Iii ConvNet Models
Our networks use 3x3 convolutional blocks with BatchNorm [10] and ReLUs, and are architecturally similar to Oxfordâs VGG Net [17] (see Figure 1). Both networks take as input a twochannel grayscale image sized 128x128x2. In other words, the two input images, which are related by a homography, are stacked channelwise and fed into the network. We use 8 convolutional layers with a max pooling layer (2x2, stride 2) after every two convolutions. The 8 convolutional layers have the following number of filters per layer: 64, 64, 64, 64, 128, 128, 128, 128. The convolutional layers are followed by two fully connected layers. The first fully connected layer has 1024 units. Dropout with a probability of 0.5 is applied after the final convolutional layer and the first fullyconnected layer. Our two networks share the same architecture up to the last layer, where the first network produces realvalued outputs and the second network produces discrete quantities (see Figure 4).
The regression network directly produces 8 realvalued numbers and uses the Euclidean (L2) loss as the final layer during training. The advantage of this formulation is the simplicity; however, without producing any kind of confidence value for the prediction, such a direct approach could be prohibitive in certain applications.
The classification network uses a quantization scheme, has a softmax at the last layer, and we use the cross entropy loss function during training. While quantization means that there is some inherent quantization error, the network is able to produce a confidence for each of the corners produced by the method. We chose to use 21 quantization bins for each of the 8 output dimensions, which results in a final layer with 168 output neurons. Figure 6 is a visualization of the corner confidences produced by our method — notice how the confidence is not equal for all corners.
Iv Experiments
We train both of our networks for about 8 hours on a single Titan X GPU, using stochastic gradient descent (SGD) with momentum of 0.9. We use a base learning rate of 0.005 and decrease the learning rate by a factor of 10 after every 30,000 iterations. The networks are trained for for 90,000 total iterations using a batch size of 64. We use Caffe [11], a popular opensource deep learning package, for all experiments.
To create the training data, we use the MSCOCO Training Set. All images are resized to 320x240 and converted to grayscale. We then generate 500,000 pairs of image patches sized 128x128 related by a homography using the method described in Section II. We choose = 32, which means that each corner of the 128x128 grayscale image can be perturbed by a maximum of one quarter of the total image edge size. We avoid larger random perturbations to avoid extreme transformations. We did not use any form of pretraining; the weights of the networks were initialized to random values and trained from scratch. We use the MSCOCO validation set to monitor overfitting, of which we found very little.
To our knowledge there are no large, publicly available homography estimation test sets, thus we evaluate our homography estimation approach on our own Warped MSCOCO 14 Test Set. To create this test set, we randomly chose 5000 images from the test set and resized each image to grayscale 640x480, and generate a pairs of image patches sized 256x256 ^{2}^{2}2We found that very few ORB features were detected when the patches were sized 128x128, while the HomographyNets had no issues working at the smaller scale. and corresponding ground truth homography, using the approach described in Figure 3 with = 64.
We compare the Classification and Regression variants of the HomographyNet with two
baselines. The first baseline is a classical ORB [15] descriptor +
RANSAC + etPerspectiveTransform() } OpenCV Homo
raphy computation.
We use the default OpenCV parameters in the traditional homography estimator. This estimates
ORB features at multiple scales and uses the top 25 scoring matches as input to the RANSAC
estimator. In scenarios where too few ORB features are computed, the ORB+RANSAC approach outputs
an identity estimate. In scenarios where the ORB+RANSAC’s estimate is too extreme, the 4point
homography estimate is clipped at [64,64]. The second baseline uses a 3x3 identity matrix for every pair of images in the test set.
Since the HomographyNets expect a fixed sized 128x128x2 input, the image pairs from the Warped MSCOCO 14 Test Set are resized from 256x256x2 to 128x128x2 before being passed through the network. The 4point parameterized homography output by the network is then multiplied by a factor of two to account for this. When evaluating the Classification HomographyNet, the corner displacement with the highest confidence is chosen.
The results are reported in Figure 5. We report the Mean Average Corner Error for each approach. To measure this metric, one first computes the L2 distance between the ground truth corner position and the estimated corner position. The error is averaged over the four corners of the image, and the mean is computed over the entire test set. While the regression network performs the best, the classification network can produce confidences and thus a meaningful way to visually debug the results. In certain applications, it may be critical to have this measure of certainty.
We visualize homography estimations in Figure 7. The blue squares in column 1 are mapped to a blue quadrilateral in column 2 by a random homography generated from the process described in Section II. The green quadrilateral is the estimated homography. The more closely the blue and green quadrilateral align, the better. The red lines show the top scoring matches of ORB features across the image patches. A similar visualization is shown in columns 3 and 4, except the Deep Homography Estimator is used.
V Applications
Our Deep Homography Estimation system enables a variety of interesting applications. Firstly, our system is fast. It runs at over 300fps with a batch size of one (i.e. realtime inference mode) on an NVIDIA Titan X GPU, which enables a host of applications that are simply not possible with a slower system. The recent emergence of specialized embedded hardware for deep networks will enable applications on many embedded systems or platforms with limited computational power which cannot afford an expensive and powerhungry desktop GPU. These embedded systems are capable of running much larger networks such as AlexNet [12] in realtime, and should have no problem running the relatively lightweight HomographyNets.
Secondly, by formulating homography estimation as a machine learning problem, one can build applicationspecific homography estimation engines. For example, a robot that navigates an indoor factory floor using planar SLAM via homography estimation could be trained solely with images captured from the robot’s image sensor of the indoor factory. While it is possible to optimize a feature detector such as ORB to work in specific environments, it is not straightforward. Environment and sensorspecific noise, motion blur, and occlusions which might restrict the ability of a homography estimation algorithm can be tackled in a similar fashion using a ConvNet. Other classical computer vision tasks such as image mosaicing (as in [19]) and markerless camera tracking systems for augmented reality (as in [16]) could also benefit from HomographyNets trained on image pair examples created from the target system’s sensors and environment.
Vi Conclusion
In this paper we asked if one of the most essential computer vision estimation tasks, namely homography estimation, could be cast as a learning problem. We presented two Convolutional Neural Network architectures that are able to perform well on this task. Our endtoend training pipeline contains two additional insights: using a 4point âcorner parameterizationâ of homographies, which makes the parameterizationâs coordinates operate on the same scale, and using a large dataset of real image to synthetically create an seemingly unlimitedsized training set for homography estimation. We hope that more geometric problems in vision will be tackled using learning paradigms.
References
 Bai et al. [2016] M. Bai, W. Luo, K. Kundu, and R. Urtasun. Deep Semantic Matching for Optical Flow. CoRR, abs/1604.01827, April 2016.
 Baker et al. [2006] Simon Baker, Ankur Datta, and Takeo Kanade. Parameterizing homographies. Technical Report CMURITR0611, Robotics Institute, Pittsburgh, PA, March 2006.
 Brown and Lowe [2007] Matthew Brown and David G Lowe. Automatic panoramic image stitching using invariant features. International journal of computer vision, 74(1):59–73, 2007.
 Costante et al. [2016] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia. Exploring representation learning with cnns for frame to frame egomotion estimation. ICRA, 2016.
 Eigen et al. [2014] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multiscale deep network. CoRR, abs/1406.2283, 2014.
 Engel et al. [2014] J. Engel, T. Schöps, and D. Cremers. LSDSLAM: Largescale direct monocular SLAM. 2014.
 Fischer et al. [2015] Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip Häusser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. ICCV, 2015.
 Gee et al. [2008] A. P. Gee, D. Chekhlov, A. Calway, and W. MayolCuevas. Discovering higher level structure in visual slam. IEEE Transactions on Robotics, 2008.
 Hartley and Zisserman [2004] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004.
 Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.
 Jia et al. [2014] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS. 2012.
 Lin et al. [2014] TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr DollÃ¡r, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
 MurArtal et al. [2015] R. MurArtal, J. M. M. Montiel, and J. D. TardÃ³s. Orbslam: A versatile and accurate monocular slam system. IEEE Transactions on Robotics, 2015.
 Rublee et al. [2011] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In ICCV, 2011.
 Simon et al. [2000] G. Simon, A. Fitzgibbon, and A. Zisserman. Markerless tracking using planar structures in the scene. In Proc. International Symposium on Augmented Reality, pages 120–128, October 2000.
 Simonyan and Zisserman [2014] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 Smith et al. [2006] Paul Smith, Ian Reid, and Andrew Davison. Realtime monocular SLAM with straight lines. In Proc. British Machine Vision Conference, 2006.
 Szeliski [1996] Richard Szeliski. Video mosaics for virtual environments. IEEE Computer Graphics and Applications, 1996.
 Zhang [2000] Zhengyou Zhang. A flexible new technique for camera calibration. PAMI, 22(11):1330–1334, 2000.