GLAMpoints: Greedily Learned Accurate Match points

# GLAMpoints: Greedily Learned Accurate Match points

## Abstract

We introduce a novel CNN-based feature point detector - Greedily Learned Accurate Match Points (GLAMpoints) - learned in a semi-supervised manner. Our detector extracts repeatable, stable interest points with a dense coverage, specifically designed to maximize the correct matching in a specific domain, which is in contrast to conventional techniques that optimize indirect metrics. In this paper, we apply our method on challenging retinal slitlamp images, for which classical detectors yield unsatisfactory results due to low image quality and insufficient amount of low-level features. We show that GLAMpoints significantly outperforms classical detectors as well as state-of-the-art CNN-based methods in matching and registration quality for retinal images. Our method can also be extended to other domains, such as natural images.

\iccvfinalcopy

## 1 Introduction

Digital fundus images of the human retina are widely used to diagnose variety of eye diseases, such as Diabetic Retinopathy (DR), glaucoma, and Age-related Macular Degeneration (AMD) [41, 52]. For retinal images acquired during the same session and presenting small overlaps, image registration can be used to create mosaics depicting larger areas of the retina. Through image mosaicking, ophthalmologists can display the retina in one large picture, which is helpful during diagnosis and treatment planning. Besides, mosaicking of retinal images taken at different time points has been shown to be important for monitoring the progression or identification of eye diseases. More importantly, fundus image registration has been explored in eye laser treatment for DR. It allows real-time tracking of the vessels during surgical operations to ensure accurate application of the laser on the retina and minimal damage to the healthy tissues.

Mosaicking usually relies on extracting repeatable interest points from the images, matching the correspondences and searching for transformations relating them. As a result, the keypoint detection is the first and the most crucial stage of this pipeline, as it conditions all further steps and therefore the success of the registration.

At the same time, classical feature detectors are general-purpose and manually optimized for outdoor, in-focus, low-noise images with sharp edges and corners. They usually fail to work with medical images, which can be distorted, noisy, have no guarantee of focus and depict soft tissue with no sharp edges (see Figure 3). Traditional methods perform sub-optimally on such images, making more sophisticated optimization necessary at a later step in the registration, such as Random Sampling Consensus (RanSaC[21], bundle adjustment [43] and Simultaneous Localization and Mapping (SLAM[16] techniques. Besides, supervised learning methods for keypoint detection fail or are not applicable, due to missing ground truths for feature points.

In this paper we present a method for learning feature points in a semi-supervised manner. Learned feature detectors were shown to outperform the heuristics-based methods, but they are usually optimized for repeatability, which is a proxy for the matching quality and as a result they may underperform during the final matching. On the contrary, our keypoints - GLAMpoints - are trained for the final matching accuracy and when associated with Scale-Invariant Feature Transform (SIFT[36] descriptor they outperform state-of-the-art in matching performance and registration quality on retinal images. As shown in Figure 1, GLAMpoints produces significantly more correct matches than SIFT detector.

Registration based on feature points is inherently non-differentiable due to point matching and transformation estimations. We take inspiration from the loss formulation in Reinforcement Learning (RL) using a reward to compute the suitability of the detected keypoints based on the final registration quality. It makes it possible to use the key performance measure, \iematching power, to directly train a Convolutional Neural Network (CNN). Our contribution is therefore a formulation for keypoint detection that is directly optimized for the final matching performance in an image domain. Both training code and model weights are available at https://gitlab.com/retinai_sandro/glampoints.

The remainder of this paper is organized as follows: we introduce the current state-of-the-art feature detection methods in section 2, our training procedure and loss in section 3, followed by experimental comparison of previous methods in section 4 and conclusion in section 5.

## 2 Related Work

Existing registration algorithms can be classified as area-based and feature-based approaches. The former typically rely on a similarity metric such as cross-correlation [14], mutual information [37, 28] or phase correlation [47] to compare the intensity patterns of an image pair and estimate the transformation. However, in the case of changes in illumination or small overlapping areas, the application of area-based approaches becomes challenging or infeasible. Conversely, feature-based methods extract corresponding points on pairs of images along with a set of features and search for a transformation that minimizes the distance between the detected key points. Compared with area-based registration techniques, they are more robust to changes of intensity, scale and rotation and therefore, they are considered more appropriate for problems such as medical image registration.

Typically, feature extraction and matching of two images comprise four steps: detection of interest points, computing feature descriptor for each of them, matching of corresponding keypoints and estimation of a transformation between the images using the matches. As can be seen, the detection step influences every further step and is therefore crucial for a successful registration. It requires a high image coverage and stable key points in low contrasts images.

In the literature, local interest point detectors have been thoroughly studied. SIFT [31] is probably the most well known detector/descriptor in computer vision. It computes corners and blobs on different scales to achieve scale invariance and extracts descriptors using the local gradients. Speeded-Up Robust Features (SURF[10] is a faster alternative, using Haar filters and integral images, while KAZE [3] exploits non-linear scale space for more accurate keypoint detection.

In the field of fundus imaging, a widely used technique relies on vascular trees and branch point analysis [30, 23]. However, accurate segmentation of the vascular trees is challenging and registration often fails on images with few vessels. Alternative registration techniques are based on matching repeatable local features; Chen \etal [13] detected Harris corners [24] on low quality multi-modal retinal images and assigned them a partial intensity invariant feature (Harris-PIIFD) descriptor. They achieved good results on low quality images with an overlapping area greater than 30%, but the method is characterised by low repeatability. Wang \etal [46] used SURF features to increase the repeatability and introduced a new method for point matching to reject a large number of outliers, but the success rate drops significantly when the overlapping area diminishes below 50%. Cattin \etal [11] also demonstrated that SURF can be efficiently used to create mosaics of retina images even for cases with no discernible vascularisation. However this technique only appeared successful in the case of highly self-similar images. D-saddle detector/descriptor [38] was shown to outperform the previous methods in terms of rate of successful registration on the Fundus Image Registration (FIRE) Dataset [25], enabling the detection of interest points on low quality regions.

Recently, with the advent of deep learning, learned detectors based on CNN architectures were shown to outperform state-of-the-art computer vision detectors [20, 18, 50, 34, 8]. Learned Invariant Feature Transform (LIFT[50] uses patches to train a fully differentiable deep CNN for interest point detection, orientation estimation and descriptor computation based on supervision from classical Structure from Motion (SfM) systems. SuperPoint [18] introduced a self-supervised framework for training interest point detectors and descriptors. It rises to state-of-the-art homography estimation results on HPatches [9] when compared to SIFT, LIFT and Oriented Fast and Rotated Brief (ORB[40]. The training procedure is, however, complicated and their self-supervision implies that the network can only find corner points. Altwaijry \etal [5] proposed a two-step CNN for matching aerial image patches, which is a particularly challenging task due to ultra-wide baseline. Altwaijry \etal [6] also introduced a method to detect keypoint locations on different scales, utilizing high activations in recursive network feature maps. KCNN [19] was shown to emulate hand-crafted detectors by training small networks using keypoints detected by other methods as ground-truth. Local Feature Network (LF-NET[34] is the closest to our method: a keypoint detector and descriptor is trained end-to-end in a two branch set-up, one being differentiable and feeding on the output of the other non-differentiable branch. However, they optimized their detector for repeatability between image pairs, not taking into account the matching performance.

Truong \etal [44] presented an evaluation of SURF, KAZE, ORB, Binary Robust Invariant Scalable Keypoints (BRISK[29], Fast Retina Keypoint (FREAK[2], LIFT, SuperPoint and LF-NET both in terms of image matching and registration quality on retinal fundus images. They found that while SuperPoint outperforms all the others relative to the matching performance, LIFT demonstrates the highest results in terms of registration quality, closely followed by KAZE and SIFT. The highlighted issue was that even the best-performing detectors produce feature points which are densely positioned and as a result may be associated with a similar descriptor. This can lead to false matches and thus inaccurate or failed registrations.

Our goal is to tackle this problem by introducing a novel semi-supervised learned method for keypoint detection. Detectors are often optimized for repeatability (such as LF-NET [34]) and not for the quality of the associated matches between image pairs. Our training procedure uses a reward concept akin to RL to extract repeatable, stable interest points with a uniform coverage and it is specifically designed to maximize correct matching on a specific domain, as shown for challenging retinal slit lamp images.

## 3 Methods

Our trained network predicts the location of stable interest points, called GLAMpoints, on a full-sized gray-scale image. In this section, we explain how our training set was produced and our training procedure. As we used standard convolutional network architecture, we only briefly discuss it in the end.

### 3.1 Dataset

We trained our model on a dataset from the ophthalmic field, namely slit lamp fundus videos, used in laser treatment (examples in Figure 3). In this application, live registration is required for an accurate ablation of the retinal tissue. Our training dataset consists of 1336 images with different resolutions, ranging from   to   by   to   . These images were acquired with multiple cameras and devices to cover large variability of appearances. They come from eye examination of 10 different patients, who were healthy or with diabetic retinopathy.

From the original fundus images, image pairs are synthetically created and used for training. Let B be a particular base image from the training dataset, of size . At every step , an image pair is generated from image by applying two separate, randomly sampled homography transforms . Images and are thus related according to the homography (see supplementary material). On top of the geometric transformations, standard data augmentation methods are used: gaussian noise, changes of contrast, illumination, gamma, motion blur and the inverse of image. A subset of these appearance transformations is randomly chosen for each image of the pair.

### 3.2 Training

We define our learned function , where denotes the pixel-wise feature point probability map of size . Lacking a direct ground truth of keypoint locations, a delayed reward can be computed instead. We base this reward on the matching success, computed after registration. The training proceeds as follows:

1. Given a pair of images and related with the ground truth homography , our model provides a score map for each image, and .

2. The locations of interest points are extracted on both score maps using standard non-differentiable Non-Max-Supression (NMS), with a window size .

3. A 128 root-SIFT [7] feature descriptor is computed for each detected keypoint.

4. The keypoints from image are matched to those of image and vice versa using a brute force matcher [35]. Only the matches that are found in both directions are kept.

5. The matches are checked according to the ground truth homography . A match is defined as true positive if the corresponding keypoint in image falls into an -neighborhood of the point in after applying . This is formulated as , where we chose as .

Let denote the set of true positive key points. If a given detected feature point ends up in the set of true positive points, it gets a positive reward. All other points/pixels are given a reward of 0. Consequently, the reward matrix for a keypoint can be defined as follows:

 Rx,y={1,for(x,y)∈T0,otherwise\par} (1)

This leads to the following loss function:

 Lsimple(θ,I)=∑(fθ(I)−R)2 (2)

However, a major drawback of this formulation is the large class imbalance between positively rewarded points and null-rewarded ones, where latter prevails by far, especially in the first stages of training. Given a reward with mostly zero values, the converges to a zero output. Hard mining has been shown to boost training of descriptors [42]. Thus, negative hard mining on the false positive matches might also enhance performance in our method, but has not been investigated in this work.

Instead, to counteract the imbalance, we use sample mining: we select all true positive points and randomly sample additional from the set of false positives. We only back-propagate through the true positive feature points and mined false positive key points. If there are more true positives than false positives, gradients are backpropagated through all found matches. This mining is mathematically formulated as a binary pixel-wise mask , equal to at the locations of the true positive key points and that of the subset of mined feature points, and equal to 0 otherwise. The final loss is thus formulated as follows:

 L(θ,I)=∑(fθ(I)−R)2⋅M∑\raisebox−0.75pt$M$ (3)

where denotes the element-wise multiplication.

An overview of the training steps is given in Figure 2. Importantly, only step 1 is differentiable with respect to the loss. We learn directly on a reward which is the result of non differentiable actions, without supervision.

It should be noted that the descriptor we used is the root-SIFT version without rotation invariance. The reason is that it performs better on slitlamp images than root-SIFT detector/descriptor with rotation invariance (see supplementary material for details). The aim of this paper is to investigate the detector only and therefore we used rotation-dependent root-SIFT for consistency.

### 3.3 Network

A standard 4-level deep Unet [39] with a final sigmoid activation was used to learn . It comprises of 3x3 convolution blocks with batch normalization and Rectified Linear Unit (ReLU) activations (see Figure 2,c). Since the task of keypoint detection is similar to pixel-wise binary segmentation (class interest point or not), Unet was a promising choice due to its past successes in binary and semantic segmentation tasks.

## 4 Results

In this section, we describe the testing dataset and the evaluation protocol. We then compare state-of-the-art detectors, quantitatively and qualitatively to our proposed GLAMpoints.

### 4.1 Testing datasets

In this study we used the following test datasets:

1. The slit lamp dataset: from retinal videos of 3 patients (different from the ones used for training), a random set of 206 frame pairs was selected as testing samples, with size   to   by   to   . Examples are shown in Figure 3. The pairs were selected to have an overlap ranging from 20 to 100%. They are related by affine transformations and rotations up to 15 degrees. Using a dedicated software tool, all pairs of images were manually annotated following common procedures [12] with at least 5 corresponding points, which were then used to estimate the ground truth homographies relating the pairs. As the slit lamp images depict small area of retina, it is justified to apply the planar assumption in generating homographies [11, 22].

2. The FIRE dataset [25]: a publicly available retinal image registration dataset with ground truth annotations. It consists of 129 retinal images forming 134 image pairs. The original images of 2912x2912 pixels were-down scaled to 15% of their original size, to match the resolution of the training set. Examples of such images are shown in Figure 5.

As a pre-processing step for testing on fundus images, we isolated the green channel, applied adaptive histogram equalization and a bilateral filter to reduce noise and enhance the appearance of edges as proposed in [17]. The effect of pre-processing can be seen in Figure 1.

Even though the focus of this paper is on the retinal images, we also tested the generalization capabilities of our model by evaluating it on natural images. We used the  [33],  [53],  [45, 26] and  [51] datasets. More details are provided in the supplementary material.

### 4.2 Evaluation criteria

We evaluated the performance using the following metrics:

1. Repeatability describes the percentage of detected points in image that are within an -distance () to points in after transformation with , where and are the sets of extracted points on both images:

 ∣∣{x∈P,x′∈P′∣|∥∥HI,I′∗x−x′∥∥<ε}∣∣|P|+|P′| (4)
2. Matching performance. Matches were found using the Nearest Neighbor Distance Ratio (NNDR) strategy, as proposed in [31]: two keypoints are matched if the descriptor distance ratio between the first and the second nearest neighbor is below a certain threshold . Then, the following metrics were evaluated:

1. AUC, area under the ROC curve created by varying the value of , following  [15, 49, 48].

2. M.score, the ratio of correct matches over the total number of keypoints extracted by the detector in the shared viewpoint region [32].

3. Coverage fraction, measures the coverage of an image by correctly matched key points. A coverage mask was generated from true positive key points, each one adding a disk of fixed radius (25px) as in [4].

We computed the homography relating the reference to the transformed image by applying RanSaC algorithm to remove outliers from the detected matches.

3. Registration success rate. We furthermore evaluated the registration accuracy achieved after using key points computed by different detectors as in [13, 46]. To do so, we compared the reprojection error of six fixed points of the reference image (denoted as ) onto the other. For each image pair for which a homography was found, the quality of the registration was assessed with the median error (MEE) and the maximum error (MAE) of the distances between corresponding points after transformation.

Using these metrics, we defined different thresholds on MEE and MAE that define , and registrations. We consider registration if not enough keypoints or matches were found to compute a homography (minimum 4), if it involves a flip or if the estimated scaling component is greater than 4 or smaller than . We classified the result as when and and as otherwise. The values for the thresholds were found empirically by post-viewing the results. Using the above definitions, we calculated the success rate of each class, equal to the percentage of image pairs for which the registration falls into each category. These metrics are the most important quantitative evaluation criteria of the overall performance in a real-world setting.

### 4.3 Baselines and implementation details

To evaluate the performance of our GLAMpoints detector associated with root-SIFT descriptor, we compared its matching ability and registration quality against well known detectors and descriptors. Among them, SIFT [36], KAZE [3] and LIFT [50] were shown to perform well on fundus images by Truong \etal [44]. Moreover, we compared our method to other CNN-based detectors-descriptors: LF-NET [34] and SuperPoint [18]. We used the authors’ implementation of LIFT (pretrained on Picadilly), SuperPoint and LF-NET (pretrained on indoor data, which gives better results on fundus images than the version pretrained on outdoor data) and OpenCV implementation for SIFT and KAZE. A rotation-dependent version of root-SIFT descriptor is used due to its better performance on our test set compared to the rotation invariant version. For the remainder of the paper, SIFT descriptor refers to root-SIFT, rotation-dependent, except if otherwise stated.

Training of GLAMpoints was performed using Tensorflow [1] with mini-batch size of 5 and the Adam optimizer [27] with learning rate and = (0.9, 0.999) for 35 epochs. For each batch we randomly cropped patches of the full-resolution image to speed up the computation. GLAMpoints (NMS10) was trained and tested with a NMS window equal to 10px. It must be noted that other NMS windows can be applied, which obtain similar performance.

### 4.4 Quantitative results on the slit lamp dataset

Table 1 presents the success rate of registration evaluated on the slit lamp dataset. Without pre-processing, most detectors show lower performance compared to the pre-processed images, but GLAMpoints performs well even on raw images. While the success rate of acceptable registrations of SIFT, KAZE and SuperPoint drops by 20 to 30% between pre-processed and raw images, GLAMpoints as well as LIFT and LF-NET show only a decrease of 3 to 6%. Besides, LF-NET, LIFT and GLAMpoints detect a steady average number of keypoints (around 485 for preprocessed and 350 non-preprocessed) independently of the pre-processing, whereas the other detectors see a reduction half. In general, GLAMpoints shows the highest performance for both raw and pre-processed images in terms of registration success rate. The robust results of our method indicate that while our detector performs as well or better on good quality images compared to the heuristic-based methods, its performance does not drop on lower quality images.

While SIFT extracts a large number of keypoints (205.69 on average for unprocessed images and 431.03 for pre-processed), they appear in clusters as shown in Figure 1. As a result, even if the repeatability is relatively high, the close positioning of the interest points leads to a large number of rejected matches, as the nearest-neighbours are very close to each other. This is evidenced by the low coverage fraction, and (Figure 4). With a similar value of repeatability, our approach extracts interest points widely spread and trained for their matching ability (highest coverage fraction), resulting in more true positive matches (second highest M.score and AUC), as shown in Figure 4.

LF-NET, similar to SIFT, shows high repeatability, which can be explained by its training strategy, which preferred repeatability over accurate matching objective. However, its M.score and AUC are in the bottom part of the ranking (Figure 4). While the performance of LF-NET may increase if it was trained on fundus images, its training procedure requires images pairs with their relative pose and corresponding depth maps, which would be extremely difficult - if not impossible - to obtain for fundus images.

It is worth noting that SuperPoint scored the highest and but in this case the metrics are artificially inflated because very few keypoints are detected (35,88 and 59,21 on average for raw and pre-processed images respectively). This translates to relatively small coverage fraction and one of the lowest repeatability, leading to few possible correct matches.

As part of an ablation study, we trained GLAMpoints with different descriptors (Table 1b, top). While it performs best with the SIFT descriptor, the results show that for every considered descriptor (SIFT, ORB, BRISK), GLAMpoints improves upon the corresponding original detector.

To benchmark the detection results, we used the descriptors that were developed/trained jointly with the given detector and thus can be considered as optimal. For instance in [50], the combination of the LIFT/LIFT detector/descriptor outperformed LIFT/SIFT. For completeness, we present the registration results of baseline detectors combined with root-SIFT descriptor in Table 1b, center. As can be seen, using root-SIFT descriptor does not improve the result compared to the original descriptor.

Finally, to verify that the performance gain of GLAMpoints does not come solely from the uniform and dense spread of the detected keypoints, we computed the success rate for keypoints in a random, uniformly distributed grid (Table 1b, bottom), which underperforms in comparison. This shows that our detector predicts not only uniform but also significant points.

### 4.5 Quantitative results on Fire dataset

Table 2 shows the results for success rates of registrations on FIRE. Our method outperforms baselines both in terms of success rate and global accuracy of non-failed registrations. As all the images in FIRE dataset present good quality with highly contrasted vascularization, we did not apply pre-processing. We also did not find it necessary to use the available background masks to filter out keypoints detected outside of the retina as generally they were not matched and did not contribute to the final registration.

It is interesting to note the gap of 33.6% in the success rate of acceptable registrations between GLAMpoints and SIFT. As both use the same descriptor, this difference can be only explained by the quality of the detector. As can be seen in Figure 5, SIFT detects a restricted number of keypoints densely positioned solely on the vascular tree and in the image borders, while GLAMpoints extracts interest points over the entire retina, including challenging areas such as the fovea and avascular zones, leading to a substantial rise in the number of correct matches.

Even though GLAMpoints outperforms all other detectors, LIFT and SuperPoint also present high performance on the FIRE dataset. This dataset contains images with well-defined corners on a clearly contrasted vascular tree and LIFT extracts keypoints spread over the entire image, while SuperPoint was trained to detect corners on synthetic primitive shapes. However, as evidenced on the slit lamp dataset, the performance of SuperPoint strongly deteriorates on images with less clear features.

### 4.6 Results on natural images

To further demonstrate a possible extension of our method to other image domains, we computed its predictions on natural images. Note that we used the same GLAMpoints model trained on slit lamp images.

Globally, GLAMpoints reaches a success rate of 75.38% for acceptable registrations, against 85.13% for the best performing detector - SIFT with rotation invariance - and 83.59% for SuperPoint. In terms of , and coverage fraction it scores respectively second, second and first best. In contrast, of GLAMpoints is only second to last after SIFT, KAZE and LF-NET even though it successfully registers more images. This result shows once again that repeatability is not the most adequate metric to measure the performance of a detector. The detailed results can be found in the supplementary material.

Finally, it should be noted that the outdoor images of this dataset are significantly different from medical fundus images and contain much greater variability of structures, which indicates a promising generalization of our model to unseen image domains.

### 4.7 Qualitative results

In case of slit lamp videos, the end goal is to create retinal mosaics. Using 10 videos containing 25 to 558 images, we generated mosaics by registering consecutive frames using keypoints detected by different methods. We calculated the average number of frames before the registration failed (due to the lack of extracted keypoints or correct matches between a pair of images). Over those 10 videos, the average number of registered frames before failure is 9.98 for GLAMpoints and only 1.04 for SIFT.

Example mosaics are presented in Figure 6. For the same video, SIFT failed after 34 frames when the data was pre-processed and only after 11 frames on the original data. In contrast, GLAMpoints successfully registered 53 consecutive raw images, without visual errors. The mosaics were created with frame to frame matching with the blending method of [17] and without bundle adjustment.

### 4.8 Run time

The run time of detection is computed over 84 pairs of images with a resolution of 660px by 350px. The GLAMpoints architecture was run on a Nvidia GeForce GTX 1080 GPU while NMS and SIFT used CPU. Mean and standard deviation of run time for GLAMpoints and SIFT are presented in Table 3. GLAMpoints is on average significantly faster than SIFT. Importantly, it does not require any time-consuming pre-processing step.

## 5 Conclusion

In this paper we introduce GLAMpoints - a keypoint detector optimized for matching performance. This is in contrast to other detectors that are optimized for repeatability of keypoints, ignoring their correctness for matching. GLAMpoints detects significantly more keypoints that lead to correct matches even in low textured images, which do not present many features. As a result, no explicit pre-processing of the images is required. We train our detector on generated image pairs avoiding the need for ground truth correspondences. Our method produces state-of-the-art matching and registration results of medical fundus images and our experiments show that it can be further extended to other domains, such as natural images.

## Supplementary material

In this supplementary material, we first provide additional details on the training methodology in Section A. We then give additional qualitative and quantitative evaluation results on fundus images in Section B. Finally, in Section C, we show the generalization capabilities of GLAMpoints by presenting evaluation results on natural images. Importantly, for all results, we use the same model weights trained on fundus images. For the entire supplementary material, SIFT descriptor [31] refers to root-SIFT [7].

## Appendix A Supplementary details on the training method

### a.1 Performance comparison between SIFT descriptor with or without rotation invariance

GLAMpoints detector was trained and tested in association with SIFT descriptor rotation-dependent because SIFT descriptor without rotation invariance performs better than the rotation invariant version on fundus images. The details of the metrics evaluated on the pre-processed slitlamp dataset for both versions of SIFT descriptor are shown in Table 4.

### a.2 Method for homography generation

For training of our GLAMpoints, we rely on pairs of images synthetically created by applying randomly sampled homography transformations to a set of base slitlamp images. Let B denote a particular base image from the training dataset, of size . At every step , an image pair is generated from image by applying two separate, randomly sampled homography transforms . Each of those homographies is a composition of rotation, shearing, perspective, scaling and translation elements. The minimum and maximum values of the geometric transformation parameters are given in table 5.

## Appendix B Details of results on fundus images

Here, we provide more detailed quantitative experiments on fundus images as well as additional qualitative results.

### b.1 Details of Mee and Rmse per registration class on the retinal images dataset

Table 6 and 7 show the mean and standard deviation of the median error and the root mean squared error for respectively the FIRE dataset and the dataset. In both cases, GLAMpoints (NMS10) presents the highest registration accuracy for inaccurate registrations and globally.

### b.2 Supplementary examples of matching on the Fire dataset

We show additional examples of matches obtained by GLAMpoints, SIFT, KAZE, SuperPoint, LIFT and LF-NET on two pairs of images from the dataset in Figure 8. Again, our keypoints GLAMpoints are homogeneously spread and they lead to substantially more true-positive matches (in green in the figure) than any other method.

## Appendix C Generalization of the model on natural images

Our method GLAMpoints was designed for application on medical retinal images. However, to show its generalisation properties, we also evaluate our network on natural images. Importantly, it must be noted that here, we use the model weights trained on slitlamp images.

Our method was tested on several natural image datasets, with following specifications:

1. dataset [33]: 8 sequences with 45 pairs in total. The dataset contains various imaging changes including viewpoint, rotation, blur, illumination, scale, JPEG compression changes. We evaluated on six of these sequences, excluding the ones showing rotation (boat and bark). Indeed, we trained our model associated with SIFT descriptor without rotation invariance. To be consistent, SIFT descriptor rotation-dependent was also used for testing.

2. dataset [51]: 5 sequences with 25 pairs in total. It exhibits large viewpoint changes and in-plane rotations up to 45 degrees.

3. dataset [53]: 3 sequences with 17 pairs in total. The dataset exhibits drastic lighting changes as well as daytime changes and viewpoint changes.

4. dataset [45, 26]: 6 sequences with 124 pairs in total. It shows seasonal changes as well as day time changes of scenes taken from far away.

For all of the aforementioned datasets, the images pairs are related by homography transforms. Indeed, the scenes are either planar, purely rotative or the images are taken at sufficient distance so that the planar assumption holds. In figure 7 are represented examples of images pairs from the Oxford dataset.

The metrics computed on the aforementioned datasets are shown in Figure 9. We use the same thresholds as in the main paper to determine acceptable, inaccurate and failed registration. We used the LF-NET pretrained on outdoor data, since most images of those datasets are outdoor. It is worth mentioning the gap in performance between SIFT descriptor with or without rotation invariance on the and the datasets. Those images exhibit large rotations and therefore a rotation invariant descriptor is necessary, which is not currently the case of our detector associated with SIFT. This explains why GLAMpoints performs poorly on those datasets.

Besides, it is interesting to note that on the Viewpoints dataset, LF-NET scores extremely low in all metrics except for . Indeed, on those images, even though the extracted key-points are repeatable, most of them are useless for matching. Therefore, LF-Net finds only very few true positive matches compared to the number of detected keypoints and matches, leading to poor evaluation results. This emphasize the importance of designing a detector specifically optimized for matching purposes.

Finally, on the dataset, GLAMpoints outperforms all other detectors in terms of , coverage fraction and while scoring second in repeatability.

### References

1. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. ManÃ©, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. ViÃ©gas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu and X. Zheng (2015) TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. External Links: Link Cited by: §4.3.
2. A. Alahi, R. Ortiz and P. Vandergheynst (2012) FREAK: Fast Retina Keypoint. In cvpr, Cited by: §2.
3. P. F. Alcantarilla, A. Bartoli and A. J. Davison (2012) KAZE Features. In eccv, Cited by: §2, §4.3.
4. J. Aldana-Iuit, D. Mishkin, O. Chum and J. Matas (2016) In the Saddle: Chasing Fast and Repeatable Features. In icpr, pp. 675–680. Cited by: item 2c.
5. H. Altwaijry, E. Trulls, S. Belongie, J. Hays and P. Fua (2016) Learning to Match Aerial Images with Deep Attentive Architecture. In cvpr, Cited by: §2.
6. H. Altwaijry, A. Veit and S. Belongie (2016) Learning to Detect and Match Keypoints with Deep Architectures. In bmvc, Cited by: §2.
7. R. Arandjelovic and A. Zisserman (2012) Three things everyone should know to improve object retrieval. In cvpr, pp. 2911–2918. Cited by: Supplementary material, item 3.
8. V. Balntas, E. Johns, L. Tang and K. Mikolajczyk (2016) PN-Net: Conjoined Triple Deep Network for Learning Local Image Descriptors. CoRR abs/1601.05030. Cited by: §2.
9. V. Balntas, K. Lenc, A. Vedaldi and K. Mikolajczyk (2017) HPatches: A Benchmark and Evaluation of Handcrafted and Learned Local Descriptors. In cvpr, pp. 3852–3861. Cited by: §2.
10. H. Bay, T. Tuytelaars and L. Van Gool (2006) Surf: Speeded up robust features. In eccv, pp. 404–417. Cited by: §2.
11. P. C. Cattin, H. Bay, L. Van Gool and G. Székely (2006) Retina Mosaicing Using Local Features. In miccai, pp. 185–192. Cited by: §2, item 1.
12. J. Chen, J. Tian, N. Lee, J. Zheng, R. T. Smith and A. F. Laine (2010) A Partial Intensity Invariant Feature Descriptor for Multimodal Retinal Image Registration. tbme 57 (7), pp. 1707–1718. Cited by: item 1.
13. J. Chen, J. Tian, N. Lee, J. Zheng, T. R. Smith and A. F. Laine (2010) A Partial Intensity Invariant Feature Descriptor for Multimodal Retinal Image Registration. tbme 57 (7), pp. 1707–1718. Cited by: §2, item 3.
14. A. V. Cideciyan (1995) Registration of Ocular Fundus Images: an Algorithm Using Cross-correlation of Triple Invariant Image Descriptors. IEEE Engineering in Medicine and Biology Magazine 14 (1), pp. 52–58. Cited by: §2.
15. A. L. Dahl, H. Aanæs and K. S. Pedersen (2011) Finding the Best Feature Detector-Descriptor Combination. In International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission, pp. 318–325. Cited by: item 2a.
16. A. Davison (2003) Real-Time Simultaneous Localisation and Mapping with a Single Camera. In iccv, Cited by: §1.
17. S. De Zanet, T. Rudolph, R. Richa, C. Tappeiner and R. Sznitman (2016) Retinal Slit Lamp Video Mosaicking. International Journal of Computer Assisted Radiology and Surgery 11 (6), pp. 1035–1041. Cited by: §4.1, §4.7.
18. D. DeTone, T. Malisiewicz and A. Rabinovich (2018) SuperPoint: Self-Supervised Interest Point Detection and Description. In cvprw, pp. 224–236. Cited by: §2, §4.3.
19. P. Di Febbo, C. Dal Mutto, K. Tieu and S. Mattoccia (2018) KCNN: Extremely-Efficient Hardware Keypoint Detection With a Compact Convolutional Neural Network. In cvprw, Cited by: §2.
20. P. Fischer, A. Dosovitskiy and T. Brox (2014-05) Descriptor Matching with Convolutional Neural Networks: a Comparison to SIFT . Technical report Technical Report 1405.5769, arXiv. Cited by: §2.
21. M. A. Fischler and R. C. Bolles (1981-06) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §1.
22. L. Giancardo, F. Meriaudeau, T. Karnowski, E. Grisan, P. Favaro, A. Ruggeri and E. Chaum (2011) Textureless Macula Swelling Detection With Multiple Retinal Fundus Images. tbme 58 (3), pp. 795–799. Cited by: item 1.
23. Y. Hang, X. Zhang, Y. Shao, H. Wu and W. Sun (2017) Retinal Image Registration Based on the Feature of Bifurcation Point. In International Congress on Image and Signal Processing, BioMedical Engineering and Informatics, Cited by: §2.
24. C. Harris and M. Stephens (1988) A Combined Corner and Edge Detector. In Fourth Alvey Vision Conference, Cited by: §2.
25. C. Hernandez-Matas, X. Zabulis, A. Triantafyllou, P. Anyfanti, S. Douma and A. Argyros (2017) FIRE : Fundus Image Registration dataset. Journal for Modeling in Ophthalmology 4, pp. 16–28. Cited by: §2, item 2.
26. N. Jacobs, N. Roman and R. Pless (2007) Consistent Temporal Variations in Many Outdoor Scenes. In cvpr, Cited by: item 4, §4.1.
27. Diederik. P. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimisation. In iclr, Cited by: §4.3.
28. P. Legg, P. Rosin, D. Marshall and J. Morgan (2013) Improving Accuracy and Efficiency of Mutual Information for Multi-modal Retinal Image Registration using Adaptive Probability Density Estimation. Computerized Medical Imaging and Graphics 37 (7-8), pp. 597–606. Cited by: §2.
29. S. Leutenegger, M. Chli and R. Siegwart (2011) BRISK: Binary Robust Invariant Scalable Keypoints. In iccv, Cited by: §2.
30. P. Li, Q. Chen, W. Fan and S. Yuan (2017) Registration of OCT Fundus Images with Color Fundus Images Based on Invariant Features. In Cloud Computing and Security, pp. 471–482. Cited by: §2.
31. D. G. Lowe (2004-11) Distinctive Image Features from Scale-Invariant Keypoints. ijcv 20 (2), pp. 91–110. Cited by: Supplementary material, §2, item 2.
32. K. Mikolajczyk and C. Schmid (2005) A Performance Evaluation of Local Descriptors. pami 27 (10), pp. 1615–1630. Cited by: item 2b.
33. K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir and L. Van Gool (2005) A Comparison of Affine Region Detectors. ijcv 65 (1/2), pp. 43–72. Cited by: item 1, §4.1.
34. Y. Ono, E. Trulls, P. Fua and K. M. Yi (2018) LF-Net: Learning Local Features from Images. In nips, pp. 6237–6247. Cited by: §2, §2, §4.3.
35. OpenCV: cv::BFMatcher Class Reference. External Links: Link Cited by: item 4.
36. OpenCV: cv::xfeatures2d::SIFT Class Reference. External Links: Link Cited by: §1, §4.3.
37. J. P. W. Pluim, J. B. A. Maintz and M. A. Viergever (2003) Mutual Information Based Registration of Medical Images: A Survey. tmi 22 (8), pp. 986–1004. Cited by: §2.
38. R. Ramli, M. Y. I. Idris, K. Hasikin, N. K. A Karim, A. W. Abdul Wahab, I. Ahmedy, F. Ahmedy, N. A. Kadri and H. Arof (2017-10) Feature-Based Retinal Image Registration Using D-Saddle Feature. Journal of Healthcare Engineering 2017, pp. 1–15. Cited by: §2.
39. O. Ronneberger, P. Fischer and T. Brox (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In miccai, pp. 234–241. Cited by: §3.3.
40. E. Rublee, V. Rabaud, K. Konolige and G. Bradski (2011) ORB: An Efficient Alternative to SIFT or SURF. In iccv, Cited by: §2.
41. C. A. Sánchez-Galeana, C. Bowd, E. Z. Blumenthal, P. A. Gokhale, L. M. Zangwill and R. N. Weinreb (2001) Using Optical Imaging Summary Data to Detect Glaucoma. Opthamology, pp. 1812–1818. Cited by: §1.
42. E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua and F. Moreno-Noguer (2015) Discriminative Learning of Deep Convolutional Feature Point Descriptors. In iccv, Cited by: §3.2.
43. B. Triggs, P. F. McLauchlan, R. I. Hartley and A. W. Fitzgibbon (2000) Bundle Adjustment – A Modern Synthesis. In Vision Algorithms: Theory and Practice, pp. 298–372. Cited by: §1.
44. P. Truong, S. De Zanet and S. Apostolopoulos (2019) Comparison of Feature Detectors for Retinal Image Alignment. In ARVO, Cited by: §2, §4.3.
45. Y. Verdie, K. M. Yi, P. Fua and V. Lepetit (2015) TILDE: A Temporally Invariant Learned DEtector. cvpr, pp. 5279–5288. Cited by: item 4, §4.1.
46. G. Wang, Z. Wang, Y. Chen and W. Zhao (2015) Robust Point Matching Method for Multimodal Retinal Image Registration. Biomedical Signal Processing and Control 19, pp. 68–76. Cited by: §2, item 3.
47. Y. Wang (2005) Phase Correlation-based Iris Image Registration Model. Journal of Computer Science and Technology 20 (3), pp. 419–425. Cited by: §2.
48. S. Winder and M. Brown (2007-06) Learning Local Image Descriptors. In cvpr, Cited by: item 2a.
49. S. Winder, G. Hua and Matthew. Brown (2009) Picking the Best DAISY. In cvpr, pp. 178–185. Cited by: item 2a.
50. K. M. Yi, E. Trulls, V. Lepetit and P. Fua (2016) LIFT: Learned Invariant Feature Transform. In eccv, pp. 467–483. Cited by: §2, §4.3, §4.4.
51. K. M. Yi, Y. Verdie, P. Fua and V. Lepetit (2016) Learning to Assign Orientations to Feature Points. In cvpr, Cited by: item 2, §4.1.
52. L. Zhou, M. S. Rzeszotarski, L. J. Singerman and J. M. Chokreff (1994) The Detection and Quantification of Retinopathy Using Digital Angiograms. tmi 13 (4), pp. 619–626. Cited by: §1.
53. L. Zitnick and K. Ramnath (2011) Edge Foci Interest Points. In iccv, Cited by: item 3, §4.1.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters