Enhancing Remote Sensing Image Retrieval with Triplet Deep Metric Learning Network
With the rapid growing of remotely sensed imagery data, there is a high demand for effective and efficient image retrieval tools to manage and exploit such data. In this letter, we present a novel content-based remote sensing image retrieval method based on Triplet deep metric learning convolutional neural network (CNN). By constructing a Triplet network with metric learning objective function, we extract the representative features of the images in a semantic space in which images from the same class are close to each other while those from different classes are far apart. In such a semantic space, simple metric measures such as Euclidean distance can be used directly to compare the similarity of images and effectively retrieve images of the same class. We also investigate a supervised and an unsupervised learning methods for reducing the dimensionality of the learned semantic features. We present comprehensive experimental results on two publicly available remote sensing image retrieval datasets and show that our method significantly outperforms state-of-the-art.
With the development of remote sensing technologies, the ability of remote sensing image acquisition has been largely enhanced. The quantity and quality of remote sensing images have increased dramatically. Consequently, remote sensing image retrieval (RSIR) has received increasing interest in the remote sensing community.
Early remote sensing image retrieval systems search images by geographical location, acquisition time, or sensor type which are not directly related to the visual content of images. Owing to the development of content-based image retrieval (CBIR), which employs the features extracted directly from the visual content of images for retrieval tasks, content-based remote sensing image retrieval has thus also witnessed great advance in recent years [1, 2, 3].
One of the major issues in CBIR is to find discriminative and robust features from images. Traditional methods rely on handcrafted features, the design of which requires sufficient expert knowledge and is time-consuming. Handcrafted features are also widely exploited as remote sensing image representations in RSIR works [1, 2]. Popular handcrafted features include global features such as spectral (color), texture, and shape features, and aggregated local features such as bag of visual words (BoVW) , vector of locally aggregated descriptors (VLAD) , and Fisher vector (FV) .
The development of deep learning has advanced many areas including content-based image retrieval. Deep convolutional neural network (CNN) features are highly abstractive and contain high-level semantic information, which have been shown to have superior performances to traditional handcrafted features in remote sensing image retrieval [3, 4, 5]. Furthermore, the deep features are learned automatically from data and there is no need for human effort in designing the features, which makes deep learning technique extremely valuable in large-scale remote sensing image retrieval.
Deep metric learning (DML) is an emerging technique that combines deep learning and metric learning . It exploits the discriminative power of deep neural networks to embed the images into an embedding metric space in which simple metrics such as the Euclidean distance can be used directly to measure the semantic similarity between images. Deep metric learning is proven to be effective in fields like face recognition , person re-identification , and natural image retrieval . Although remote sensing images are very different from ordinary natural images, DML still shows promising potential for content-based remote sensing image retrieval .
In this letter, we present a novel Triplet deep neural network based metric learning method to enhance RSIR. Using deep convolutional neural networks, we embed the remote sensing images into a semantic space in which images from the same class are close to each other and those from different classes are far apart. We also investigate methods based on the use of a fully-connected layer of the CNN (supervised learning) and PCA (unsupervised learning) for reducing the dimensionality of the learned semantic features. We present comprehensive experiments on the popular UCMD  and the large PatternNet  RSIR datasets and the results demonstrate that our method significantly outperforms state-of-the-art.
In this section, we describe the details of the proposed Triplet network for RSIR, including the architecture and loss function of the network, as well as an effective and efficient method for selecting image triplets to train the network. We also present two methods, one based on a fully-connected layer of the CNN (supervised learning) and the other based on PCA (unsupervised learning), to reduce the dimension of extracted features from the proposed network for retrieval.
Ii-a Triplet Network for Image Retrieval
Ii-A1 Network Architecture
The overall architecture of our proposed Triplet network is shown in Fig. 1. The network is composed of three identical convolutional neural networks which share the same weights. In the training phase, three sample images (a triplet) are fed into each network respectively, one image is set as anchor, the other two are positive and negative samples. The positive sample is of the same class as the anchor, while the negative sample belongs to a different class. The three identical networks take input image triplets and output three corresponding feature maps. The feature maps are then vectorized by the global average pooling (GAP) operation to obtain fixed-length embedding features, which are further -normalized. Finally, the distance between the three extracted vectors will be used to compute the loss to train the network. In the testing phase, images can be fed into one of the three identical networks to extract the fixed-length feature vectors for retrieval.
It should be noted that the CNNs used are “fully convolutional” which consist of only convolutional layers, so that the networks can extract features regardless of image size. This prevents the loss of information when resizing or cropping input images. Besides, “fully convolutional” CNNs are low-parameterized and fast to run.
Ii-A2 Loss Function
The goal of the Triplet network is to learn a metric embedding function that maps the input images to a feature space so that semantically similar images in are metrically close in , where the function parameterized by represents the CNN-based feature extractor. This is achieved by designing a triplet loss.
For an arbitrary image , is the corresponding feature vector in the metric embedding space. Let the metric to measure the similarity of images in the embedding space be . In this letter, squared Euclidean distance is used as metric. Then, for a triplet of images , where , , are the anchor, positive, and negative images respectively, and () are their corresponding class labels, the corresponding embedding vectors are , and thereby the triplet loss function can be formulated as follows:
where represents , denotes the margin. and are the distances between the anchor-positive and the anchor-negative pairs respectively, measured by the metric in the embedding space .
The intuition of the triplet loss is to make the positive sample closer to the anchor while push the negative sample far away. The process is illustrated in Fig. 2.
Ii-A3 Triplet Selection for Effective Training
It is crucial to select informative triplets for Triplet network training. The naive way is to select triplets offline, by randomly picking positive samples from the same class as the anchor (except the anchor), and negative samples from any other classes. However, this method is both inefficient and ineffective. In a mini-batch with images, there would only be triplets.
In this letter, we use the batch all triplet mining technique  to select triplets. The key idea is to exhaust all the valid triplets within a mini-batch to compute the loss in the training phase. In a mini-batch , let denotes the number of all the classes, denotes the number of images per class, then there would be image triplets within the mini-batch. Let (anchor) and (positive) be the -th and -th image in class , while (negative) be the -th image in class , then the loss function within the mini-batch can be formulated as follows:
where are the corresponding embedding vectors of image triplets .
Compared with the naive offline triplet-selection strategy, the triplet mining method greatly increases the efficacy of training by exploiting all the valid triplets online within a training mini-batch. This makes full use of the hard samples of each training batch to compute loss, and thus makes the training process easier to converge.
Ii-B Feature Dimension Reduction
Deep features directly extracted from the proposed Triplet network are of high dimensions, which impacts the efficiency of similarity search in image retrieval and requires large storage. To address this issue, we use a method based on the fully-connected (FC) layer of the CNN (supervised learning), and a method based on the traditional PCA (unsupervised learning), to reduce the dimensionality of the semantic features.
The process of employing a fully-connected layer of the CNN for dimensionality reduction is illustrated in Fig. 3. In the training phase, the embedding vectors (extracted from the network in Fig. 1) are fed to a fully-connected layer of a lower dimension, which are then -normalized to compute the final loss. In the testing phase, the -normalized condensed vector can be exploited as the representative feature for retrieval.
For PCA dimension reduction, the training images are firstly fed into the trained DML network to obtain the embedding feature vectors. The covariance matrix of these vectors are computed and its eigenvectors are then used to project the feature vectors onto a lower dimensional space. The features with dimension reduced by PCA are further -normalized and then used for image retrieval.
Two publicly available RSIR datasets, UCMD and PatternNet, are used to evaluate the proposed method. The University of California, Merced (UCMD) land use dataset  is the most widely used RSIR dataset. It includes 21 classes, with 100 images per class. All the images are 256256 pixels, and the pixel resolution is about 0.3. The images are extracted from the USGS Map from various US urban areas. PatternNet  is currently the largest publicly available dataset for remote sensing image retrieval. It consists of 38 classes, with 800 images per class, and the size of each image is 256256 pixels. The spatial resolution of the images ranges from 0.062 to 4.693 meters. The images are collected from Google Earth and Google Maps imagery from US cities.
Iii-B Experiment Setup
For UCMD, we follow the data splitting that yields the best performance in , which randomly selects 50% images of each class for training and the rest 50% for performance evaluation. For PatternNet, we follow the 80%/20% training and testing data splitting strategy as per .
Three convolutional neural networks are employed as the basic networks for feature extraction, i.e. AlexNet , VGG16 , and ResNet50 . For each network, only the convolutional layers are used to extract features, and global average pooling is operated on extracted output feature maps to obtain the final fixed-length feature vectors. The number of convolutional layers of the three CNNs and the dimension of their corresponding output features are listed in Table I.
|Network||Conv. Layers||Feature Dimension|
In the experiments, the networks are implemented based on the PyTorch framework. Adam optimizer is used to train the Triplet network, with a learning rate of 0.00001 and a batch size of 30. The parameters of the networks are initialized by the corresponding network weights pretrained on ImageNet . The maximum training iteration is set to 30 epochs. The margin of the triplet loss is empirically set to 0.2 which was found to work consistently well in all experimental settings on both datasets. Random horizontal and vertical image flips are applied as data augmentation.
For comparison, the features extracted by the pretrained and the fine-tuned CNNs (with convolutional layers only) are used as baselines. For fine-tuning, fully-connected layers and partial convolutional layers (last layer for AlexNet, last two layers for VGG16, and last three layers for ResNet50) are retrained on the training sets of UCMD and PatternNet respectively.
Iii-C Performance Evaluation Metrics
To evaluate image retrieval performance, average normalized modified retrieval rank (ANMRR) , mean average precision (mAP), and precision at (, precision of the top- retrieval results) are utilized. It should be noted that the lower the value of ANMRR is, the better the retrieval performance, while it is opposite for mAP and .
Iii-D Results and Analysis
We evaluate our deep metric learning (DML)-based features by comparing results with baseline features from pretrained (PT) and fine-tuned (FT) CNNs, and results reported by previous works.
Iii-D1 Overall Results
|Fc7_W (50) ||0.0673|
|Pool5_W (50) ||0.0404|
|Gabor Texture ||0.6439||0.2773||0.6855||0.6278||0.4461||0.3552||0.0899|
The overall results on the UCMD and PatternNet datasets are shown in Table II and Table III respectively. As can be seen, in general, DML-based features achieve the best performance, fine-tuned CNN features achieve competitive results, while pretrained features show the worst results. In addition, the deeper the network is, the better the image retrieval performance. It can be seen that DML-based ResNet50 features have achieved the best results on both datasets, significantly outperformed all the other methods on all the evaluation metrics.
It is interesting to note that the best performance on PatternNet is significantly better than that on the UCMD dataset. One probable reason is that deep metric learning is data hungry and the amount of training data influences the learning of representative features. Since PatternNet is much larger than UCMD, the network for the PatternNet dataset is better trained than that for the UCMD dataset.
Iii-D2 Per-class Results
The per-class performances in terms of ANMRR of different ResNet50-based deep features on the two datasets are shown in Fig. 4 and Fig. 5 respectively. As presented in Fig. 4, in general, for almost every class, DML-based features outperform fine-tuned features, and both of them perform much better than pretrained features. Pretrained ResNet50-based features have particular difficulty in retrieving images of buildings, dense residential, and intersection classes, with an average ANMRR of 0.60, much higher than that of its counterpart, with 0.13 for the fine-tuned features, and less than 0.08 for DML-based features. It can be seen from Fig. 5 that DML-based features outperform fine-tuned features significantly, while the latter perform much better than pretrained features for all the classes. Pretrained ResNet50-based features perform poorly on classes like basketball court, ferry terminal, and nursing home, with an average ANMRR of 0.59. This value for the fine-tuned features is 0.11, while it is less than 0.006 for DML-based features, which further demonstrates the superior performance of DML-based features for content-based remote sensing image retrieval.
Iii-D3 Qualitative retrieval results
As can be seen in Fig. 4 and Fig. 5, DML-based features perform significantly better than pretrained and fine-tuned CNN features for almost all the classes on both datasets. To investigate into the details, qualitative results of several difficult query cases are presented in Fig. 6, which shows the top-5 retrieved images that are similar to the query image, using features extracted by pretrained, fine-tuned, and DML-based ResNet50 networks respectively. In Fig. 6, the left column is the results on the UCMD dataset, and query images are from the classes of buildings, dense residential, intersection, and medium residential; the query results of the PatternNet dataset are presented in the right column, with query images from the classes of basketball court, ferry terminal, nursing home, and sparse residential. DML-based features outperform pretrained and fine-tuned features on the cases noticeably. The results indicate that deep metric learning can well learn high-level semantic features to distinguish images with high within-class variance.
Iii-E Analysis of Feature Dimension Reduction
Fully-connected (FC) layers and principal component analysis (PCA) are used to reduce the dimension of extracted features. The retrieval performances of DML-based features with different dimensions (i.e. the original dimension, the reduced dimensions of 128, 64, 32, 16, and 8) using the two methods are shown in Fig. 7. As can be seen, for PCA, the reduction of feature size from the original dimension to 32 has relatively small impact on retrieval performances on both UCMD and PatternNet, the best performances are generally achieved at 32, and the performances drop significantly when the dimension is further reduced since the values of ANMRR increase noticeably. For FC-based dimension reduction, the ANMRR values generally increase with the reduction of feature dimension on the UCMD dataset, and the performance is worse than PCA at the same dimension. However, FC-based reduction performances are generally stable across different dimensions on the PatternNet dataset, and the performance is much better than PCA when feature dimension is reduced to less than 16, which implies that FC-based feature dimension reduction performs better with sufficient training data.
Content-based remote sensing image retrieval is key to effective use of the ever-growing remote sensing images. In this letter, we use deep metric learning-based Triplet network to learn deep metric embeddings from positive and negative sample images, and thereby enhance the retrieval performance of remote sensing images. We test our methods on two publicly available datasets and achieve state-of-the-art performances on both datasets. We also investigate supervised CNN fully-connected layers and unsupervised PCA methods to further reduce the dimension of extracted features. We have successfully demonstrated the effectiveness of deep metric learning method for remote sensing image retrieval.
-  Y. Yang and S. Newsam, “Geographic Image Retrieval Using Local Invariant Features,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 2, pp. 818–832, 2013.
-  S. Ãzkan, T. AteÅ, E. Tola, M. Soysal, and E. Esen, “Performance Analysis of State-of-the-Art Representation Methods for Geographical Image Retrieval and Categorization,” IEEE Geosci. Remote Sens. Lett., vol. 11, no. 11, pp. 1996–2000, 2014.
-  P. Napoletano, “Visual descriptors for content-based retrieval of remote-sensing images,” Int. J. Remote Sens., vol. 39, no. 5, pp. 1343–1376, 2018.
-  F. Ye, H. Xiao, X. Zhao, M. Dong, W. Luo, and W. Min, “Remote Sensing Image Retrieval Using Convolutional Neural Network Features and Weighted Distance,” IEEE Geosci. Remote Sens. Lett., vol. 15, no. 10, pp. 1535–1539, 2018.
-  W. Zhou, S. Newsam, C. Li, and Z. Shao, “PatternNet: A benchmark dataset for performance evaluation of remote sensing image retrieval,” ISPRS J. Photogramm. Remote Sens., vol. 145, pp. 197–209, 2018.
-  F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2015, pp. 815–823.
-  A. Hermans, L. Beyer, and B. Leibe, “In Defense of the Triplet Loss for Person Re-Identification,” arXiv:1703.07737 [cs], 2017.
-  A. Gordo, J. AlmazÃ¡n, J. Revaud, and D. Larlus, “End-to-End Learning of Deep Visual Representations for Image Retrieval,” Int. J. Comput. Vis., vol. 124, no. 2, pp. 237–254, 2017.
-  S. Roy, E. Sangineto, B. Demir, and N. Sebe, “Deep Metric and Hash-Code Learning for Content-Based Retrieval of Remote Sensing Images,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., 2018, pp. 4539–4542.
-  Y. Yang and S. D. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” in Proc. ACM Int. Conf. Adv. Geogr. Inf. Syst., Nov. 2010, pp. 270–279.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Proc. Int. Conf. Neural Inf. Process. Syst., Dec. 2012, pp. 1106–1114.
-  K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in Proc. Int. Conf. Learn. Represent., Sep. 2015, pp. 1–14.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2016, pp. 770–778.
-  J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 248–255.