# GIFT: Learning Transformation-Invariant

Dense Visual Descriptors via Group CNNs

###### Abstract

Finding local correspondences between images with different viewpoints requires local descriptors that are robust against geometric transformations. An approach for transformation invariance is to integrate out the transformations by pooling the features extracted from transformed versions of an image. However, the feature pooling may sacrifice the distinctiveness of the resulting descriptors. In this paper, we introduce a novel visual descriptor named Group Invariant Feature Transform (GIFT), which is both discriminative and robust to geometric transformations. The key idea is that the features extracted from the transformed versions of an image can be viewed as a function defined on the group of the transformations. Instead of feature pooling, we use group convolutions to exploit underlying structures of the extracted features on the group, resulting in descriptors that are both discriminative and provably invariant to the group of transformations. Extensive experiments show that GIFT outperforms state-of-the-art methods on several benchmark datasets and practically improves the performance of relative pose estimation.

## 1 Introduction

Establishing local feature correspondences between images is a fundamental problem in many computer vision tasks such as structure from motion Hartley and Zisserman (2003), visual localization Filliat (2007), SLAM Mur-Artal et al. (2015), image stitching Brown and Lowe (2007) and image retrieval Philbin et al. (2010). Finding reliable correspondences requires image descriptors that effectively encode distinctive image patterns while being invariant to geometric and photometric image transformations caused by viewpoint and illumination changes.

To achieve the invariance to viewpoints, traditional methods Lowe (2004); Luo et al. (2018) use patch detectors Lindeberg (1998); Mikolajczyk and Schmid (2004) to extract transformation covariant local patches which are then normalized for transformation invariance. Then, invariant descriptors can be extracted on the detected local patches. However, a typical image may have very few pixels for which viewpoint covariant patches can be reliably detected Hassner et al. (2012). Also, “hand-crafted" detectors such as DoG Lowe (2004) and Affine-Harris Mikolajczyk and Schmid (2004) are sensitive to image artifacts and lighting conditions. Reliably detecting covariant regions is still an open problem Lenc and Vedaldi (2016); DeTone et al. (2018) and a performance bottleneck in the traditional pipeline of correspondence estimation.

Instead of relying on a sparse set of covariant patches, some recent works Hariharan et al. (2015); Choy et al. (2016); DeTone et al. (2018) propose to extract dense descriptors by feeding the whole image into a convolutional neural network (CNN) and constructing pixel-wise descriptors from the feature maps of the CNN. However, the CNN-based descriptors are usually sensitive to viewpoint changes as convolutions are inherently not invariant to geometric transformations. While augmenting training data with warped images improves the robustness of learned features, the invariance is not guaranteed and a larger network is typically required to fit the augmented datasets.

In order to explicitly improve invariance to geometric transformations, some works Yang et al. (2016); Wang et al. (2014); Hassner et al. (2012) resort to integrating out the transformations by pooling the features extracted from transformed versions of the original images. But the distinctiveness of extracted features may degenerate due to the pooling operation.

In this paper, we propose a novel CNN-based dense descriptor, named Group Invariant Feature Transform (GIFT), which is both discriminative and invariant to a group of transformations. The key idea is that, if an image is regarded as a function defined on the translation group, the CNN features extracted from multiple transformed images can be treated as a function defined on the transformation group. Analogous to local image patterns, such features on the group also have discriminative patterns, which are neglected by the previous methods that use pooling for invariance. We argue that exploiting underlying structures of the group features is essential for building discriminative descriptors. It can be theoretically demonstrated that transforming the input image with any element in the group results in a permutation of the group features. Such a permutation preserves local structures of the group features. Thus, we propose to use group convolutions to encode the local structures of the group features, resulting in feature representations that are not only discriminative but equivariant to the transformations in the group. Finally, the intermediate representations are bilinearly pooled to obtain provably invariant descriptors. This transformation-invariant dense descriptor simplifies correspondence estimation as detecting covariant patches can be avoided. Without needs for patch detectors, the proposed descriptor can be incorporated with any interest point detector for sparse feature matching or even a uniformly sampled grid for dense matching.

We evaluate the performance of GIFT on the HPSequence Balntas et al. (2017); Lenc and Vedaldi (2018) dataset and the SUN3D Xiao et al. (2013) dataset for correspondence estimation. The results show that GIFT outperforms both of the traditional descriptors and recent learned descriptors. We further demonstrate the robustness of GIFT to extremely large scale and orientation changes on several new datasets. The current unoptimized implementation of GIFT runs at 15 fps on a GTX 1080 Ti GPU, which is sufficiently fast for practical applications.

## 2 Related work

Existing pipelines for feature matching usually rely on a feature detector and a feature descriptor. Feature detectors Lindeberg (1998); Lowe (2004); Mikolajczyk and Schmid (2004) detect local patches which are covariant to geometric transformations brought by viewpoint changes. Then, invariant descriptors can be extracted on the normalized local patches via traditional patch descriptors Lowe (2004); Calonder et al. (2012); Rublee et al. (2011); Bay et al. (2006) or deep metric learning based patch descriptors Zagoruyko and Komodakis (2015); Han et al. (2015); Mishchuk et al. (2017); Tian et al. (2017); Luo et al. (2018); Balntas et al. (2016); G et al. (2015); He et al. (2018); Zhang et al. (2017). The robustness of detectors can be guaranteed theoretically, e.g., by the scale-space theory Lindeberg (2013). However, a typical image often have very few pixels for which viewpoint covariant patches may be reliably detected Hassner et al. (2012). The scarcity of reliably detected patches becomes a performance bottleneck in the traditional pipeline of correspondence estimation. Some recent works Ono et al. (2018); Yi et al. (2016a); Lenc and Vedaldi (2016); Mishkin et al. (2018); Zhang et al. (2017); Doiphode et al. (2018); Yi et al. (2016b) try to learn such viewpoint covariant patch detectors by CNNs. However, the definition of a canonical scale or orientation is ambiguous. Detecting a consistent scale or orientation for every pixel remains challenging.

To alleviate the dependency on detectors, A-SIFT Morel and Yu (2009) warps original image patches by affine transformations and exhaustively searches for the best match. Some other methods Yang et al. (2016); Hassner et al. (2012); Wang et al. (2014); Dong and Soatto (2015) follow similar pipelines but pool features extracted from these transformed patches to obtain invariant descriptors. GIFT also transforms images, but instead of using feature pooling, it applies group convolutions to further exploit the underlying structures of features extracted from the group of transformed images to retain distinctiveness of the resulting descriptors.

Feature map based descriptor. Descriptors can also be directly extracted from feature maps of CNNs Hariharan et al. (2015); DeTone et al. (2018); Choy et al. (2016). However, CNNs are not invariant to geometric transformations naturally. The common strategy to make CNNs invariant to geometric transformations is to augment the training data with such transformations. However, data augmentation cannot guarantee the invariance on unseen data. The Universal Correspondence Network (UCN) Choy et al. (2016) uses a convolutional spatial transformer Jaderberg et al. (2015) in the network to normalize the local patches to a canonical shape. However, learning an invariant spatial transformer is as difficult as learning a viewpoint covariant detector. Our method also uses CNNs to extract features on transformed images but applies subsequent group convolutions to construct transformation-invariant descriptors.

Equivariant or invariant CNNs. Some recent works Cohen and Welling (2016); Marcos et al. (2017); Khasanova and Frossard (2017); Cohen and Welling (2017); Worrall et al. (2017); Esteves et al. (2018b); Kaisheng et al. (2019); Cohen et al. (2018); Khasanova and Frossard (2017); Weiler et al. (2018); Marcos et al. (2017); Henriques and Vedaldi (2017); Esteves et al. (2018a, 2019); Bekkers et al. (2018) design special architectures to make CNNs equivariant to specific transformations. The most related work is the Group Equivariant CNN Cohen and Welling (2016) which uses group convolution and subgroup pooling to learn equivariant feature representations. It applies group convolutions directly on a large group which is the product of the translation group and the geometric transformation group. In contrast, GIFT uses a vanilla CNN to process images, which can be regarded as features defined on the translation group, and separate group CNNs to process the features on the geometric transformation group, which results in a more efficient model than the original Group Equivariant CNN.

## 3 Method

Preliminary. Assuming the observed 3D surfaces are smooth, the transformation between corresponding image patches under different viewpoints is approximately in the affine group. In this paper, we only consider its subgroup which consists of rotations and scaling. The key intermediate feature representation in the pipeline of GIFT is a map from the group to a feature space , which is referred to as group feature.

Overview. As illustrated in Fig. 1, the proposed method consists of two modules: group feature extraction and group feature embedding. Group feature extraction module takes an image as input, warps the image with a grid of sampled elements in , separately feeds the warped images through a vanilla CNN, and outputs a set of feature maps where each feature map corresponds to an element in . For any interest point in the image, a feature vector can be extracted from each feature map. The feature vectors corresponding to in all the feature maps form a group feature . Next, the group feature embedding module embeds the group feature of every interest point to two features and by two group CNNs, both of which have group convolution layers. Finally, and are pooled by a bilinear pooling operator Lin et al. (2015) to obtain a GIFT descriptor .

### 3.1 Group feature extraction

Given an input image and a point on the image, this module aims to extract a transformation-equivariant group feature on this point . To get the feature vector on a specific transformation , we begin with transforming the input image with . Then, we process the transformed image with a vanilla CNN . The output feature map is denoted by . Since the image is transformed, the corresponding location of on the output feature map also changes into . We use the feature vector locating at on the feature map as the value of . Because the coordinates of may not be integers, we apply a bilinear interpolation to get the feature vector on it. The whole process can be expressed by,

(1) |

The extracted group feature is equivariant to transformations in the group, as illustrated in Fig. 2.

###### Lemma 1.

The group feature of a point in an image extracted by Eq. (1) is denoted by . If the image is transformed by an element and the group feature extracted at the corresponding point in this transformed image is denoted by , then for any , , which means that the transformation of the input image results in a permutation of the group feature. ∎

Lemma 1 provides a novel and strict criterion for matching two feature points. Traditional methods usually detect a canonical scale and orientation for an interest point in each view and match points across views by descriptors extracted at the canonical scale and orientation. This can be interpreted as, if two points are matched, then there exists a and such that . However, the canonical and are ambiguous and hard to detect reliably. Lemma 1 shows that, if two points are matched, then there exists an such that for all , . In other words, the group features of two matched points are related by a permutation. This provides a strict matching criterion between two group features. Even though can hardly be determined when extracting descriptors, the permutation caused by preserves structures of group features and only changes their locations. Encoding local structures of group features allows us to construct distinctive and transformation-invariant descriptors.

### 3.2 Group convolution layer

After group feature extraction, we apply the discrete group convolution originally proposed in Cohen and Welling (2016) to encode local structures of group features, which is defined as

(2) |

where and are group features of the layer and the layer respectively, means the -th dimension of the vector, and are elements in the group, is a set of transformations around the identity transformation, are learnable parameters which are defined on , is a bias term and is a non-linear activation function. If is the 2D translation group, the group convolution in Eq. (2) becomes the conventional 2D convolution. Similar to the conventional CNNs that are able to encode local patterns of images, the group CNNs are able to encode local structures of group features. For more discussions about the relationship between the group convolution and the conventional convolution, please refer to Cohen and Welling (2016).

The group convolution actually preserves the equivariance of group features:

###### Lemma 2.

### 3.3 Group bilinear pooling

In GIFT, we actually construct two group CNNs and , both of which consist of group convolution layers, to process the input group feature . The outputs of two group CNNs are denoted by and , respectively. Finally, we obtain the GIFT descriptor by applying the bilinear pooling operator Lin et al. (2015) to and , which can be described as

(3) |

where is an element of feature vector . Based on Lemma 1 and Lemma 2, we can prove the invariance of GIFT as stated in Proposition 1. The proof is given in the supplementary material.

###### Proposition 1.

Let denote the GIFT descriptor of an interest point in an image. If the image is transformed by any transformation and the GIFT descriptor extracted at the corresponding point in the transformed image is denoted by , then .

In fact, many pooling operators such as average pooling and max pooling can achieve such invariance. We adopt bilinear pooling for two reasons. First, it collects second-order statistics of features and thus produces more informative descriptors. Second, it can be shown that the statistics used in many previous methods for invariant descriptors Hassner et al. (2012); Wang et al. (2014); Yang et al. (2016) can be written as special forms of bilinear pooling, as proved in supplementary material. So the proposed GIFT is a more generalized form compared to these methods.

### 3.4 Implementation details

Sampling from the group. Due to limited computational resources, we sample a range of elements in to compute group features. We sample evenly in the scale group and the rotation group separately. The unit transformations are defined as 1/4 downsampling and 45 degree clockwise rotation and denoted by and , respectively. Then, the sampled elements in the group form a grid . Considering computational complexity, we choose scales ranging from to and orientations ranging from to . In this case, the group feature of an interest point is a tensor where is the dimension of the feature space.

Due to the discrete sampling, Lemma 1 and Lemma 2 don’t rigorously hold near the boundary of the selected range. But empirical results show that this boundary effect will not obviously affect the final matching performance if the scale and rotation changes are in a reasonable range.

Bilinear pooling. The integral in the Eq. (3) is approximated by the summation over the sampled group elements. Suppose the output group features of two group CNNs are denoted by and , respectively, and reshaped as two matrices and , where . Then, the GIFT descriptor can be written as

(4) |

Network architecture. The vanilla CNN has four convolution layers and an average pooling layer to enlarge receptive fields. In the vanilla CNN, we use instance normalization Ulyanov et al. (2016) instead of batch normalization Ioffe and Szegedy (2015). The output feature dimension of the vanilla CNN is 32. In both group CNNs, defined in Eq. (2) is , where is the identity transformation. ReLU Nair and Hinton (2010) is used as the nonlinear activation function. The number of group convolution layers in ablation studies and in subsequent comparisons to state-of-the-art methods. The output feature dimensions and of two group CNNs are 8 and 16 respectively, which results in a 128-dimensional descriptor after bilinear pooling. The output descriptors are L2-normalized so that .

Loss function. The model is trained by minimizing a triplet loss Schroff et al. (2015) defined by

(5) |

where , and are descriptors of an anchor point in an image, its true match in the other image, and a false match selected by hard negative mining, respectively. The margin is set to 0.5 in all experiments. The hard negative mining is a modified version of that proposed in Choy et al. (2016).

## 4 Experiments

### 4.1 Datasets and Metrics

HPSequences Balntas et al. (2017); Lenc and Vedaldi (2018) is a dataset that contains 580 image pairs for evaluation which can be divided into two splits, namely Illum-HP and View-HP. Illum-HP contains only illumination changes while View-HP contains mainly viewpoint changes. The viewpoint changes in the View-HP cause homography transformations because all observed objects are planar.

SUN3D Xiao et al. (2013) is a dataset that contains 500 image pairs of indoor scenes. The observed objects are not planar so that it introduces self-occlusion and perspective distortion, which are commonly-considered challenges in correspondence estimation.

ES-* and ER-*. To fully evaluate the correspondence estimation performance under extreme scale and orientation changes, we create extreme scale (ES) and extreme rotation (ER) datasets by artificially scaling and rotating the images in HPSequences and SUN3D. For a pair of images, we manually add large orientation or scale changes to the second image. The range of rotation angle is . The range of scaling factor is . Examples are shown in Fig. 3.

MVS dataset Strecha et al. (2008) contains six image sequences of outdoor scenes. All images have accurate ground-truth camera poses which are used to evaluate the descriptors for relative pose estimation.

Training data. The proposed GIFT is trained on a synthetic dataset. We randomly sample images from MS-COCO Lin et al. (2014) and warp images with reasonable homographies defined in Superpoint DeTone et al. (2018) to construct image pairs for training. When evaluating on the task of relative pose estimation, we further finetune GIFT on the GL3D Shen et al. (2018) dataset which contains real image pairs with ground truth correspondences given by a standard Structure-from-Motion (SfM) pipeline.

Metrics. To quantify the performance of correspondence estimation, we use Percentage of Correctly Matched Keypoints (PCK) Long et al. (2014); Zhou et al. (2015), which is defined as the ratio between the number of correct matches and the total number of interest points. All matches are found by nearest-neighbor search. A matched point is declared being correct if it is within five pixels from the ground truth location. To evaluate relative pose estimation, we use the rotation error as the metric, which is defined as the angle of in the axis-angle form, where is the estimated rotation and is the ground truth rotation. All testing images are resized to 480360 in all experiments.

Superpoint DeTone et al. (2018)+GIFT | Superpoint DeTone et al. (2018) | DoG Lowe (2004)+GIFT | DoG Lowe (2004)+GeoDesc Luo et al. (2018) |
---|---|---|---|

### 4.2 Ablation study

We conduct ablation studies on HPSequence, ES-HP and ER-HP in three aspects, namely comparison to baseline models, choice of pooling operators and different numbers of group convolution layers. In all ablation studies, we use the keypoints detected by Superpoint DeTone et al. (2018) as interest points for evaluation. We denote the proposed method by GIFT- where means the number of group convolution layers. Architectures of compared models can be found in the supplementary material. All tested models are trained with the same loss function and training data.

Baseline models. We consider three baseline models which all produce 128-dimensional descriptors, namely Vanilla CNN (VCNN), Group Fully Connected network (GFC) and Group Attention Selection network (GAS). VCNN has four vanilla convolution layers with three average pooling layers and outputs a 128-channel feature map. Descriptors are directly interpolated from the output feature map. GFC and GAS have the same group feature extraction module as GIFT. GFC replaces the group CNN in GIFT-1 with a two-layer fully connected network. GAS is similar to the model proposed in Wang et al. (2017), which tries to learn attention weights by CNNs to select a scale for each keypoint. GAS first transforms input group features to 128-dimension by a convolution layer. Then, it applies a two-layer fully connected network on the input group feature to produce attention weights. Finally, GAS uses the average of 128-dimensional embedded group features weighted by the attention weights as descriptors.

VCNN | GFC | GAS | GIFT-1 | |
---|---|---|---|---|

Illum-HP | 59.15 | 60.63 | 59.2 | 59.61 |

View-HP | 61.7 | 62.5 | 62.2 | 63.71 |

ES-HP | 14.9 | 16.58 | 18.28 | 21.74 |

ER-HP | 28.86 | 26.89 | 30.72 | 39.68 |

avg | max | subspace | bilinear | |
---|---|---|---|---|

Illum-HP | 57.72 | 54.31 | 47.21 | 59.61 |

View-HP | 62.52 | 58.16 | 49.36 | 63.71 |

ES-HP | 19.08 | 19.37 | 14.85 | 21.74 |

ER-HP | 36.15 | 32.57 | 29.12 | 39.68 |

Table 2 summarizes results of the proposed method and other baseline models. The proposed method achieves the best performance on all datasets except Illum-HP. The Illum-HP dataset contains no viewpoint changes, which means that there is no permutation between the group features of two matched points. Then, the GFC model which directly compares the elements of two group features achieves a better performance. Compared to baseline models, the significant improvements of GIFT-1 on ES-HP and ER-HP demonstrate the benefit of the proposed method to deal with large scale and orientation changes.

Pooling operators. To illustrate the necessity of bilinear pooling, we test other three commonly-used pooling operators, namely average pooling, max pooling and subspace pooling Hassner et al. (2012); Wang et al. (2014); Wei et al. (2018). For all these models, we apply the same group feature extraction module as GIFT. For average pooling and max pooling, the input group feature is fed into group CNNs to produce 128-dimensional group features which are subsequently pooled with average pooling or max pooling to construct descriptors. For subspace pooling, we use a group CNN to produce a feature map with 16 channels, which results in 256-dimensional descriptors after subspace pooling. Results are listed in Table 2 which shows that the bilinear pooling outperforms all other pooling operators.

GIFT-1 | GIFT-3 | GIFT-6 | |
---|---|---|---|

Illum-HP | 59.61 | 61.33 | 62.49 |

View-HP | 63.71 | 64.91 | 67.15 |

ES-HP | 21.74 | 23.9 | 27.29 |

ER-HP | 39.68 | 43.37 | 48.93 |

Number of group convolution layers. To further demonstrate the effect of group convolution layers, we test on different numbers of group convolution layers. All models use the same vanilla CNN but different group CNNs with 1, 3 or 6 group convolution layers. The results in the Table 3 show that the performance increases with the number of group convolution layers. In subsequent experiments, we use GIFT-6 as the default model and denote it with GIFT for short.

### 4.3 Comparison with state-of-the-art methods

detector | Superpoint DeTone et al. (2018) | DoG Lowe (2004) | LF-Net Ono et al. (2018) | ||||

\diagbox[innerwidth=2cm]datasetdescriptor | GIFT | Superpoint | GIFT | SIFT | GeoDesc | GIFT | LF-Net |

DeTone et al. (2018) | Lowe (2004) | Luo et al. (2018) | Ono et al. (2018) | ||||

Illum-HP | 62.49 | 61.13 | 56.58 | 28.38 | 34.41 | 52.17 | 34.55 |

View-HP | 67.15 | 53.66 | 62.53 | 34.33 | 42.75 | 15.93 | 1.22 |

SUN3D | 27.32 | 26.4 | 19.97 | 15.2 | 14.53 | 21.73 | 12.93 |

ES-HP | 27.29 | 12.16 | 22.07 | 18.25 | 19.63 | 7.89 | 0.3 |

ER-HP | 48.93 | 24.77 | 44.44 | 29.39 | 37.36 | 12.50 | 0.05 |

ES-SUN3D | 12.37 | 5.94 | 7.40 | 4.09 | 3.42 | 7.61 | 0.55 |

ER-SUN3D | 22.29 | 14.01 | 15.77 | 15.16 | 15.39 | 15.98 | 10.59 |

We compare the proposed GIFT with three state-of-the-art methods, namely Superpoint DeTone et al. (2018), GeoDesc Luo et al. (2018) and LF-Net Ono et al. (2018). For all methods, we use their released pretrained models for comparison. Superpoint DeTone et al. (2018) localizes keypoints and interpolates descriptors of these keypoints directly on a feature map of a vanilla CNN. GeoDesc Luo et al. (2018) is a state-of-the-art patch descriptor which is usually incorporated with DoG detector for correspondence estimation. LF-Net Ono et al. (2018) provides a complete pipeline of feature detection and description. The detector network of LF-Net not only localizes keypoints but also estimates their scales and orientations. Then the local patches are fed into the descriptor network to generate descriptors. For fair comparison, we use the same keypoints as the compared method for evaluation. Results are summarized in Table 4, which shows that GIFT outperforms all other state-of-the-art methods. Qualitative results are shown in Fig. 3.

To further validate the robustness of GIFT to scaling and rotation, we add synthetic scaling and rotation to images in HPatches and report the matching performances under different scaling and rotations. The results are plotted in Fig. 5, which show that the PCK of GIFT drops slowly with the increase of scaling and rotation.

### 4.4 Performance for dense correspondence estimation

Reference | GIFT | VCNN | Daisy Philbin et al. (2010) |
---|---|---|---|

GIFT | VCNN | Daisy Philbin et al. (2010) | |
---|---|---|---|

Illum-HP | 27.82 | 26.96 | 17.08 |

View-HP | 37.92 | 32.92 | 19.6 |

ES-HP | 12.52 | 4.64 | 1.05 |

ER-HP | 26.61 | 14.02 | 5.69 |

We also evaluate GIFT for the task of dense correspondence estimation on HPSequence, ES-HP and ER-HP. The quantitative results are listed in Table 5 and qualitative results are shown in Fig. 4. The proposed GIFT outperforms the baseline Vanilla CNN and the traditional method Daisy Philbin et al. (2010), which demonstrates the ability of GIFT for dense correspondence estimation.

### 4.5 Performance for relative pose estimation.

We also evaluate GIFT for the task of relative pose estimation of image pairs on the MVS dataset Strecha et al. (2008). For a pair of images, we estimate the relative pose of cameras by matching descriptors and computing essential matrix. Since the estimated translations are up-to-scale, we only evaluate the estimated rotations using the metric of rotation error as mentioned in Section 4.1. We further finetune GIFT on the outdoor GL3D dataset Shen et al. (2018) and denote the finetuned model with GIFT-F. The results are listed in Table 6. GIFT-F outperforms all other methods on most sequences, which demonstrates the applicability of GIFT to real computer vision tasks.

Detector | DoG Lowe (2004) | Superpoint DeTone et al. (2018) | ||||
---|---|---|---|---|---|---|

\diagbox[innerwidth=2cm]SequenceDescriptor | GIFT | GIFT-F | SIFT | GIFT | GIFT-F | Superpoint |

Lowe (2004) | DeTone et al. (2018) | |||||

Herz-Jesus-P8 | 0.656 | 0.582 | 0.662 | 0.848 | 0.942 | 1.072 |

Herz-Jesus-P25 | 4.968 | 2.756 | 5.296 | 4.484 | 2.891 | 2.87 |

Fountain-P11 | 0.821 | 1.268 | 0.587 | 1.331 | 1.046 | 1.071 |

Entry-P10 | 1.368 | 1.259 | 3.844 | 1.915 | 1.059 | 1.076 |

Castle-P30 | 3.431 | 1.741 | 2.706 | 1.526 | 1.501 | 1.588 |

Castle-P19 | 1.887 | 1.991 | 3.018 | 1.739 | 1.500 | 1.814 |

Average | 2.189 | 1.600 | 2.686 | 1.974 | 1.490 | 1.583 |

### 4.6 Running time

Given a 480360 image and randomly-distributed 1024 interest points in the image, the PyTorch Paszke et al. (2017) implementation of GIFT-6 costs about 65.2 ms on a desktop with an Intel i7 3.7GHz CPU and a GTX 1080 Ti GPU. Specifically, it takes 32.5 ms for image warping, 27.5 ms for processing all warped images with the vanilla CNN and 5.2 ms for group feature embedding by the group CNNs.

## 5 Conclusion

We introduced a novel dense descriptor named GIFT with provable invariance to a certain group of transformations. We showed that the group features, which are extracted on the transformed images, contain structures which are stable under the transformations and discriminative among different interest points. We adopt group CNNs to encode such structures and applied bilinear pooling to construct transformation-invariant descriptors. We reported state-of-the-art performance on the task of correspondence estimation on the HPSequence dataset, the SUN3D dataset and several new datasets with extreme scale and orientation changes.

Acknowledgement. The authors would like to acknowledge support from NSFC (No. 61806176), Fundamental Research Funds for the Central Universities and ZJU-SenseTime Joint Lab of 3D Vision.

## References

- [1] (2017) HPatches: a benchmark and evaluation of handcrafted and learned local descriptors. In CVPR, Cited by: §1, §4.1.
- [2] (2016) Learning local feature descriptors with triplets and shallow convolutional neural networks. In BMVC, Cited by: §2.
- [3] (2006) Surf: speeded up robust features. In ECCV, Cited by: §2.
- [4] (2018) Roto-translation covariant convolutional networks for medical image analysis. In MICCAI, Cited by: §2.
- [5] (2007) Automatic panoramic image stitching using invariant features. In ICCV, Cited by: §1.
- [6] (2012) BRIEF: computing a local binary descriptor very fast. T-PAMI 34 (7), pp. 1281–1298. Cited by: §2.
- [7] (2016) Universal correspondence network. In NeurIPS, Cited by: §1, §2, §3.4.
- [8] (2018) Spherical cnns. In ICLR, Cited by: §2.
- [9] (2017) Steerable cnns. In ICLR, Cited by: §2.
- [10] (2016) Group equivariant convolutional networks. In ICML, Cited by: §2, §3.2.
- [11] (2018) Superpoint: self-supervised interest point detection and description. In CVPR Workshops, Cited by: §1, §1, §2, Figure 3, §4.1, §4.2, §4.3, Table 4, Table 6.
- [12] (2018) An improved learning framework for covariant local feature detection. CoRR abs/1811.00438. Cited by: §2.
- [13] (2015) Domain-size pooling in local descriptors: dsp-sift. In CVPR, Cited by: §2.
- [14] (2018) Learning so (3) equivariant representations with spherical cnns. In ECCV, Cited by: §2.
- [15] (2018) Polar transformer networks. In ICLR, Cited by: §2.
- [16] (2019) Equivariant multi-view networks. arXiv preprint arXiv:1904.00993. Cited by: §2.
- [17] (2007) A visual bag of words method for interactive qualitative localization and mapping. In ICRA, Cited by: §1.
- [18] (2015) Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In CVPR, Cited by: §2.
- [19] (2015) Matchnet: unifying feature and metric learning for patch-based matching. In CVPR, Cited by: §2.
- [20] (2015) Hypercolumns for object segmentation and fine-grained localization. In CVPR, Cited by: §1, §2.
- [21] (2003) Multiple view geometry in computer vision. Cambridge university press. Cited by: §1.
- [22] (2012) On sifts and their scales. In CVPR, Cited by: §1, §1, §2, §2, §3.3, §4.2.
- [23] (2018) Local descriptors optimized for average precision. In CVPR, Cited by: §2.
- [24] (2017) Warped convolutions: efficient invariance to spatial transformations. In ICML, Cited by: §2.
- [25] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §3.4.
- [26] (2015) Spatial transformer networks. In NeurIPS, Cited by: §2.
- [27] (2019) Equivariant transformer networks. In ICML, Cited by: §2.
- [28] (2017) Graph-based isometry invariant representation learning. In ICML, Cited by: §2.
- [29] (2016) Learning covariant feature detectors. In ECCV, Cited by: §1, §2.
- [30] (2018) Large scale evaluation of local image feature detectors on homography datasets. In BMVC, Cited by: §1, §4.1.
- [31] (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §4.1.
- [32] (2015) Bilinear cnn models for fine-grained visual recognition. In ICCV, Cited by: §3.3, §3.
- [33] (1998) Feature detection with automatic scale selection. IJCV 30 (2), pp. 79–116. Cited by: §1, §2.
- [34] (2013) Scale-space theory in computer vision. Vol. 256, Springer Science & Business Media. Cited by: §2.
- [35] (2014) Do convnets learn correspondence?. In NeurIPS, Cited by: §4.1.
- [36] (2004) Distinctive image features from scale-invariant keypoints. In ICCV, Cited by: §1, §2, Figure 3, Table 4, Table 6.
- [37] (2018) Geodesc: learning local descriptors by integrating geometry constraints. In ECCV, Cited by: §1, §2, Figure 3, §4.3, Table 4.
- [38] (2017) Rotation equivariant vector field networks. In ICCV, Cited by: §2.
- [39] (2004) Scale & affine invariant interest point detectors. In ICCV, Cited by: §1, §2.
- [40] (2017) Working hard to know your neighbor’s margins: local descriptor learning loss. In NeurIPS, Cited by: §2.
- [41] (2018) Repeatability is not enough: learning affine regions via discriminability. In ECCV, Cited by: §2.
- [42] (2009) ASIFT: a new framework for fully affine invariant image comparison. SIAM journal on imaging sciences 2 (2), pp. 438–469. Cited by: §2.
- [43] (2015) ORB-slam: a versatile and accurate monocular slam system. IEEE Trans. Robot. 31 (5), pp. 1147–1163. Cited by: §1.
- [44] (2010) Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: §3.4.
- [45] (2018) LF-net: learning local features from images. In NeurIPS, Cited by: §2, §4.3, Table 4.
- [46] (2017) Automatic differentiation in pytorch. In NeurIPS Workshops, Cited by: §4.6.
- [47] (2010) Descriptor learning for efficient retrieval. In ECCV, Cited by: §1, Figure 4, §4.4, Table 5.
- [48] (2011) ORB: an efficient alternative to sift or surf.. In ICCV, Cited by: §2.
- [49] (2015) Facenet: a unified embedding for face recognition and clustering. In CVPR, Cited by: §3.4.
- [50] (2018) Matchable image retrieval by learning from surface reconstruction. In ACCV, Cited by: §4.1, §4.5.
- [51] (2008) On benchmarking camera calibration and multi-view stereo for high resolution imagery. In CVPR, Cited by: §4.1, §4.5, Table 6.
- [52] (2017) L2-net: deep learning of discriminative patch descriptor in euclidean space. In CVPR, Cited by: §2.
- [53] (2016) Instance normalization: the missing ingredient for fast stylization.. CoRR abs/1607.08022. Cited by: §3.4.
- [54] (2017) AutoScaler: scale-attention networks for visual correspondence. In BMVC, Cited by: §4.2.
- [55] (2014) Affine subspace representation for feature description. In ECCV, Cited by: §1, §2, §3.3, §4.2.
- [56] (2018) Kernelized subspace pooling for deep local descriptors. In CVPR, Cited by: §4.2.
- [57] (2018) Learning steerable filters for rotation equivariant cnns. In CVPR, Cited by: §2.
- [58] (2017) Harmonic networks: deep translation and rotation equivariance. In CVPR, Cited by: §2.
- [59] (2013) Sun3d: a database of big spaces reconstructed using sfm and object labels. In ICCV, Cited by: §1, §4.1.
- [60] (2016) Accumulated stability voting: a robust descriptor from descriptors of multiple scales. In CVPR, Cited by: §1, §2, §3.3.
- [61] (2016) Lift: learned invariant feature transform. In ECCV, Cited by: §2.
- [62] (2016) Learning to assign orientations to feature points. In CVPR, Cited by: §2.
- [63] (2015) Learning to compare image patches via convolutional neural networks. In CVPR, Cited by: §2.
- [64] (2017) Learning discriminative and transformation covariant local feature detectors. In CVPR, Cited by: §2.
- [65] (2015) Flowweb: joint image set alignment by weaving consistent, pixel-wise correspondences. In CVPR, Cited by: §4.1.