# Acceleration of Large Margin Metric Learning for Nearest Neighbor Classification Using Triplet Mining and Stratified Sampling

## Abstract

Metric learning is one of the techniques in manifold learning with the goal of finding a projection subspace for increasing and decreasing the inter- and intra-class variances, respectively. Some of the metric learning methods are based on triplet learning with anchor-positive-negative triplets. Large margin metric learning for nearest neighbor classification is one of the fundamental methods to do this. Recently, Siamese networks have been introduced with the triplet loss. Many triplet mining methods have been developed for Siamese networks; however, these techniques have not been applied on the triplets of large margin metric learning for nearest neighbor classification. In this work, inspired by the mining methods for Siamese networks, we propose several triplet mining techniques for large margin metric learning. Moreover, a hierarchical approach is proposed, for acceleration and scalability of optimization, where triplets are selected by stratified sampling in hierarchical hyper-spheres. We analyze the proposed methods on three publicly available datasets, i.e., Fisher Iris, ORL faces, and MNIST datasets.

=0mu plus 1mu

## I Introduction

Distance metric learning is one of the fundamental and most competitive techniques in machine and manifold learning [20]. The goal of metric learning is to find a proper metric whose subspace discriminates the classes by increasing and decreasing the inter- and intra-class variances, respectively [14, 13] (e.g., see Fig. 1). This goal was first introduced by Fisher Discriminant Analysis (FDA) [14, 11].

Some metric learning methods make use of anchor-positive-negative triplets where the positive and negative instances are the data points having the same and different class labels with respect to an anchor instance, respectively. One of the first metric learning methods based on triplets was large margin metric learning for nearest neighbor classification [30, 32]. This method uses Semi-Definite Programming (SDP) optimization [29] as SDP has been found to be useful for metric learning [30, 32, 1, 31]. Later, the concept of a triplet cost function was proposed in the field of neural networks by introducing Siamese networks [16, 24, 18]. The triplet loss can be either in the form of Hinge loss [24] or softmax [35]. The examples of the former and latter are [30, 32, 24] and [15, 23, 34], respectively.

Solving SDP problems requires the interior point method [5], which is iterative and slow especially for big data. This can be improved and accelerated by selecting the most important data points for embedding [33]. For example, we rather care about the nearest or farthest positives and negatives than selecting all the data points. This technique is referred to as triplet mining in the literature where the positive and negative instances with respect to an anchor make a triplet [24].

After the introduction of Siamese networks in the literature, different triplet mining techniques were developed for Siamese training using triplets. However, these mining methods have not been implemented or proposed for the previously developed concept of large margin metric learning for nearest neighbor classification. In this work, inspired by the mining techniques for Siamese networks, we propose different triplet mining methods for large margin metric learning. By only considering the most valuable points of the dataset with respect to anchors, the SDP optimization speeds up while preserving an acceptable classification accuracy in large margin metric learning.

In addition to proposing triplet mining techniques for the optimization, we propose a hierarchical approach for further acceleration of metric learning. This approach includes iterative selection of data subsets by hierarchical stratified sampling [3] to train the embedding subspace. Not only does this approach accelerate the SDP optimization by reducing time complexity, but also it improves performance in some cases due to effectiveness of model averaging [7] and reduction of estimation variance by stratified sampling [12]. We also used the proposed triplet mining techniques in combination with the proposed hierarchical approach for the sake of acceleration.

The remainder of paper is as follows. In Section II, we review the foundations of large margin metric learning, triplet loss, and Siamese networks. We discuss the triplet mining methods that have already been proposed for Siamese triplet training, i.e., batch all [9], batch hard [17], batch semi-hard [24], easiest/hardest positives and easiest/hardest negatives, and negative sampling [33]. Section III proposes how to use the triplet mining techniques in SDP optimization of large margin metric learning. The hierarchical approach is proposed in Section IV. We report the experimental results in Section V. Finally, Section VI concludes the paper and provides the possible future direction.

## Ii Background

### Ii-a Large Margin Metric Learning for Nearest Neighbor Classification

-Nearest Neighbor (-NN) classification is highly impacted by the distance metric utilized for measuring the differences between data points. Euclidean distance does not weight the points and it values them equally. A general distance metric can be viewed as the Euclidean distance after projection of points onto a discriminative subspace. This projection can be viewed as a linear transformation with a projection matrix denoted by [19]. We call this general metric the Mahalanobis distance [20, 14]:

(1) | ||||

where . The matrix must be positive semi-definite, i.e. , for the metric to satisfy convexity and the triangle inequality [5].

In order to improve the -NN classification performance, we should decrease and increase the intra- and inter-class variances of data, respectively [13]. As can be seen in Fig. 1, one way to achieve this goal is to pull the data points of the same class toward one another while pushing the points of different classes away.

Let be one (zero) if the data points and are (are not) from the same class. Moreover, let be one if is amongst the -nearest neighbors of with the same class label; otherwise, it is zero. For tackling the goal of pushing together the points of a class and pulling different classes away, the following cost function can be minimized [30]:

(2) | ||||

where is the standard Hinge loss. The first term in Eq. (2) pushes the same-class points towards each other. The second term, on the other hand, is a triplet loss [24] which increases and decreases the inter- and intra-class variances, respectively.

Inspired by support vector machines, the cost function (2) can be restated using slack variables:

(3) | ||||||

subject to | ||||||

which is a SDP problem [29]. The first term in the objective functions of Eqs. (2) and (3) are equivalent because of Eq. (1). The Hinge loss in Eq. (2) can be approximated using non-negative slack variables, denoted by . The second term of objective function in Eq. (3), in addition to the first and second constraints, play the role of Hinge loss.

### Ii-B Triplet loss and Siamese Network

As explained for Eq. (2), the second term in that equation is the triplet loss which pushes the classes away and pulls the points of a class together. In Eq. (2), , , and are anchor, positive, and negative instances, respectively. The goal of triplet loss is to make anchor and positive instances closer and push the negative instances away as also seen in Fig. 1.

Recently, the triplet loss has been used for training neural networks which are called Siamese or triplet networks [24]. A Siamese network is composed of three sub-networks which share their weights. The anchor, positive, and negative instances are fed to these sub-networks and the triplet loss is used to tune their weights. Siamese networks are usually used for learning a discriminative embedding space. In this work, we propose several triplet mining methods inspired by the triplet mining techniques already existing in the literature for the Siamese nets.

## Iii Proposed Triplet Mining

The optimization problem in Eq. (3) considers all the negative instances even in large datasets. The SDP for solving Problem (3) is very time-consuming and slow [29]. Hence, Problem (3) becomes intractable for large datasets, as has been noted in [30]. This motivated us to use triplet mining on the data for further improvement upon [30, 32]. There exist several triplet mining methods which are proposed for Siamese network training. Inspired by those, we propose here the triplet mining techniques in the objective function of Eq. (3) to facilitate the optimization process. There can be different ways for triplet mining. In the following, we propose -batch all, -batch hard, -batch semi-hard, extreme distances, and negative sampling for large margin metric learning.

### Iii-a -Batch All

One of the mining methods to be considered is batch all which takes all the positives and negatives of the data batch into account for Siamese neural network [9]. The proposed method in [30, 32] is a batch-all version which takes only nearest positives and all the negatives. This makes sense because the SDP is slow and cannot handle all possible permutations of positive and negative instances. Here, we call this method -batch all (-BA) where the objective in equation:

(4) |

### Iii-B -Batch Hard

Another mining method for Siamese networks is batch hard in which the farthest positive and nearest negative with respect to the anchor are considered [17]. The farthest positive is the hardest one to be classified as a neighbor of the anchor. Likewise, the nearest negative is the hardest one to be separated from the anchor’s class. In this work, we consider positive and negative instances and we call this -batch hard (-BH) where the objective in Eq. (3) becomes:

(5) |

where is one (zero) if is (is not) amongst the -farthest neighbors of with the same class label. Similarly, is one (zero) if is (is not) amongst the -nearest neighbors of with different class label.

### Iii-C -Batch Semi-Hard

Batch semi-hard is another method, for Siamese networks, in which the hardest negatives (closest to the anchor) that are farther from the positive are taken into account [24]. In our work, we have positive instances and for each, we consider negatives. This method we call -batch semi-hard (-BSH) in which the cost in Eq. (3) can be modeled as:

(6) |

where , as defined before, is one (zero) if is (is not) amongst the -nearest neighbors of with the same class label and is one (zero) if is (is not) amongst the -nearest neighbors of , with different class label, and farther from to .

### Iii-D Extreme Distances

Considering that every instance could be chosen based on their distance to the anchor (whether they are nearest or farthest), we have four different cases [25]. Easy and hard positives correspond to the nearest and farthest positives, respectively; easy and hard negatives correspond to the farthest and nearest negatives, respectively. Easy Positive-Easy Negative (EPEN), Easy Positive-Hard Negative (EPHN), Hard Positive-Easy Negative (HPEN), and Hard Positive-Hard Negative (HPHN) are the four possible cases. HPHN is equivalent to the batch hard method explained in Section III-B. Since we are taking instances from both positive and negative sets, the cost in Eq. (3) for the other three cases are as follows:

(7) | |||

(8) | |||

(9) |

where is one (zero) if is (is not) amongst the -farthest neighbors of with different class label. The hardest cases are useful due to the concept of opposition learning [27] and the fact that more difficult separable data points are better to be emphasized. Moreover, the easiest cases are found to be effective in the literature [34].

### Iii-E Negative Sampling

In negative sampling, as another mining method proposed for Siamese networks, for every positive instance, each negative’s probability of occurrence is calculated using a stochastic probability distribution. The distribution of pairwise distances, denoted by , of two points can be estimated as [33]:

(10) |

where is the dimensionality of data and is defined by Eq. (1). For an anchor , the probability of a negative instance , with distance from can be calculated as [33]:

(11) |

where (e.g., ) is for giving all the negatives a minimum chance of selection. One can use a roulette wheel strategy for selecting negative instances using the probability in Eq. (11) [26].

In this work, we select the -nearest positives and sample negatives for every anchor-positive pair. We call this method -negative sampling (-NS) and its cost function in Eq. (3) is:

(12) |

where is one (zero) if is (is not) a sampled negative for the anchor-positive pair.

## Iv Proposed Hierarchical Large Margin Metric Learning with Stratified Sampling

The triplet mining methods, introduced in the previous section, are promising techniques for better and faster performance of large margin metric learning; however, they can be further improved as explained here. We propose a hierarchical approach for accelerating the large margin metric learning. The main idea is to consider portions of data for training for solving the optimization in order to tackle the slow pace of SDP. However, for taking into account the whole training data, portions of data should be introduced to the optimization problem hierarchically. This technique has a divide and conquer manner to accelerate the training phase [8]. It also can improve the performance of embedding model due to model averaging [7, 6] and reduction of estimation variance by stratified sampling [12].

The procedure of this hierarchical approach can be found in Algorithm LABEL:algorithm_hierarchical. As can be seen in this algorithm, this approach is iterative. In every iteration, several hyper-spheres are considered in the space of data and the triplets are sampled from inside of the hyper-spheres (see Line LABEL:alg_sampling in Algorithm LABEL:algorithm_hierarchical). We employ stratified sampling [3] where classes of data are considered to be strata. The SDP optimization, Eq. (3), is solved at every iteration using merely the sampled triplets rather than the whole data (see Line LABEL:alg_optimization in Algorithm LABEL:algorithm_hierarchical). We factorize the matrix in Eq. (1) into using eigenvalue decomposition:

(13) |

which can be done because . As Eq. (1) shows, metric learning can be viewed as Euclidean distance after projection onto a subspace spanned by the columns of , i.e., the column space of . Hence, the whole data are projected into the metric subspace trained by the sampled triplets (see Line LABEL:alg_projection in Algorithm LABEL:algorithm_hierarchical). Note that for not having data being collapsed in subspaces with low ranks, one can slightly strengthen the diagonal of which results in larger eigenvalues without effecting the projection directions [22].

At every iteration, the number of hyper-spheres, denoted by , and the radius of them, denoted by , are determined by a function decreasing and increasing with respect to the iteration index, respectively. This is because by the progress of algorithm, we want to make the hyper-spheres coarser to see more of data but at the same time, the number of them should become less not to have much overlap between the sampling areas. The size of stratified sampling in every hyper-sphere can also alter by the iteration index because in the late iterations, there is no need to consider the whole data in the hyper-sphere but a part of them. For the stratified sampling size, we sample a portion of each available class (i.e., strata) within the hyper-sphere. \@floatalgocf[!t]

We initialize the radius, number of hyper-spheres, and the portion of sampling by , (clipped to ), and . The updates of these variables are performed as , , and , where and is the average standard deviation along features.

Dataset | -BA | -BH | -BSH | -HPEN | -EPEN | -EPHN | -NS | ||
---|---|---|---|---|---|---|---|---|---|

Iris | Non-Hierarchical | Accuracy () | 72.73 | 100 | 86.36 | 95.45 | 81.82 | 95.45 | 72.73 |

Time (sec) | 832.85 | 5.51 | 6.62 | 4.77 | 5.34 | 5.11 | 5.06 | ||

Hierarchical | Accuracy () | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |

Time (sec) | 23.73 | 9.72 | 4.54 | 7.25 | 4.73 | 5.05 | 4.64 | ||

ORL Faces | Non-Hierarchical | Accuracy () | – | 85.00 | 78.75 | 72.50 | 75.00 | 85.00 | 77.50 |

Time (sec) | – | 16.13 | 18.61 | 19.59 | 19.19 | 16.31 | 19.05 | ||

Hierarchical | Accuracy | 76.25 | 76.25 | 81.25 | 78.75 | 78.75 | 81.25 | 63.75 | |

Time (sec) | 0.39 | 0.93 | 0.79 | 4.36 | 1.07 | 0.95 | 0.39 | ||

MNIST | Non-Hierarchical | Accuracy () | – | 82.00 | 79.00 | 82.00 | 78.00 | 82.00 | 78.00 |

Time (sec) | – | 122.21 | 182.13 | 152.89 | 173.18 | 135.64 | 170.33 | ||

Hierarchical | Accuracy () | 71.00 | 77.00 | 79.00 | 81.00 | 75.00 | 78.00 | 79.00 | |

Time (sec) | 27.17 | 1.56 | 1.55 | 0.49 | 1.02 | 1.70 | 1.55 |

## V Experimental Results and Analysis

### V-a Datasets and Setup

In this work, we use three publicly available datasets. The first dataset is the Fisher Iris data [10] which includes 150 data points in three classes with dimensionality of 4. The second dataset which we used was ORL faces data [2] with 40 classes each having 10 subjects. The size of facial images are pixels. The third dataset was the MNIST digits data [21] with -pixel images.

The Iris dataset was randomly split into train-validation-test sets with portions --. In the ORL dataset, the first six faces of every subject made the training data and the rest of images were split to test and validation sets. A subset of MNIST with 400-100-100 images was also taken for train-validation-test. Note that the SDP in large margin metric learning cannot handle very large datasets due to the slow pacing of optimization. The ORL dataset was further projected onto the 15 leading eigenfaces [28] as pre-processing [19]. The validation set was used for determining the optimal values of and . The MNIST data were also projected onto the principal component analysis subspace with dimensionality 30.

### V-B Comparison of Triplet Mining Methods in the Non-Hierarchical and Hierarchical Approaches

For each dataset, we returned the accuracy of the -nearest classification using the Mahalanobis distance for the different triplet mining methods. Table I represents the accuracies and run-time for Iris, ORL faces, and MNIST datasets, respectively.

In all datasets, -BH has obtained the highest accuracy in non-hierarchical approach. However, in hierarchical approaches, -BSH has obtained a top accuracy. The reason for -BH and -BSH to have acceptable performance is using the hard (near) negative instances in the training. This helps avoiding overfitting to the training data. In ORL faces data, the best accuracy is for -BH and -EPHN. This is because in both of these methods, the hardest negative instances are used for training, helping to avoid overfitting again. For the same reason, -BSH has the second best performance in this dataset. Moreover, we see that the results of -NS is acceptable in this data which is due to the effectiveness of the probability distribution used for sampling from the negative instances. This distribution was recently proposed for Siamese training [33]; however, the results show that it is also effective for triplet mining in the large margin metric learning.

In the case of Iris data, due to the small size and simplicity of dataset, the accuracies are all perfect in the hierarchical approach. In this approach, for the ORL and MNIST datasets, the highest accuracies are for -BSH which can be interpreted as explained above. As obvious in table, the hierarchical approach either outperforms the non-hierarchical approach (due to model averaging) or has comparable result with much less consumed time.

In the non-hierarchical approach, we tested the -BA merely on the Iris dataset because the two other datasets are too large for -BA as it considers all the negative instances. For the same reason, it is very time consuming; hence, the longest time belongs to -BA in Table I. For the ORL and MNIST datasets, the longest time belongs to -HPEN and -BSH, respectively, mainly due to handling the hard cases in optimization. As the table shows, the hierarchical approach is scalable and much faster because of sampling. For this reason, we could run -BA efficiently for all three datasets in this approach. Note that the characteristic of computer used for simulations was Intel Core-i7, 1.80GHz, with 16GB RAM.

### V-C Comparison of Triplet Mining Methods By Ghost Faces

As Eq. (13) shows, metric learning can be viewed as Euclidean distance after projection onto a subspace spanned by the columns of . In the eigenvalue decomposition, the eigenvectors and eigenvalues are sorted from the leading to trailing.

Inspired by eigenfaces [28] and Fisherfaces [4], for the large margin metric learning, we can visualize the eigen-subspaces (column spaces of ) for the facial dataset in order to display the ghost faces. Here, we consider the top ten columns of . The ghost faces of the ORL face dataset are depicted in Fig. 2. As seen in this figure, -NS features are more discriminative which distinguish the different classes using various extracted features including eye, eyebrow, cheeks (for eye glasses), chin, hair, and nose. In second place after -NS, the -BSH, -HPEN, and -EPEN features are diverse enough (including eye, cheek, nose, and hair) for discriminating the classes. The -BH and -EPHN have mostly concentrated on the eye and eye-brow. This makes sense because many of the subjects in the ORL face dataset wear eye-glasses.

## Vi Conclusion and Future Direction

Large margin metric learning for for nearest neighbor classification makes use of SDP optimization which is very slow and computationally expensive, because of the interior point optimization method, especially when the data scale up. In this paper, inspired by the state-of-the-art triplet mining techniques for Siamese network training, we proposed and analyzed several triplet mining methods for large margin metric learning. These triplet mining methods make the set of triplets smaller by limiting the instances to the most important ones. This speeds up the optimization and makes it more efficient. The proposed triplet mining techniques were -BA, -BH, -BSH, -HPEN, -EPEN, -EPHN, and -NS. Moreover, We suggested a new hierarchical approach which, in combination with the triplet mining methods, reduces the time of training considerably and makes the method scalable. Our experiments on three public available datasets verified the effectiveness of the proposed approaches. A possible future direction is to try the proposed hierarchical approach using stratified sampling on other subspace learning methods.

### References

- (2008) Distance metric learning vs. Fisher discriminant analysis. In Proceedings of the 23rd National Conference on Artificial Intelligence, Vol. 2, pp. 598–603. Cited by: §I.
- (2001) ORL face dataset. AT&T Laboratories Cambridge. External Links: Link Cited by: §V-A.
- (1974) Elements of sampling theory. English Universities Press, London. Cited by: §I, §IV.
- (1997) Eigenfaces vs. fisherfaces: recognition using class specific linear projection. IEEE Transactions on pattern analysis and machine intelligence 19 (7), pp. 711–720. Cited by: §V-C.
- (2004) Convex optimization. Cambridge University Press. Cited by: §I, §II-A.
- (1996) Bagging predictors. Machine Learning 24 (2), pp. 123–140. Cited by: §IV.
- (2008) Model selection and model averaging. Cambridge Books. Cited by: §I, §IV.
- (2009) Introduction to algorithms. MIT Press. Cited by: §IV.
- (2015) Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition 48 (10), pp. 2993–3003. Cited by: §I, §III-A.
- (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §V-A.
- (2001) The elements of statistical learning. Vol. 1, Springer Series in Statistics New York. Cited by: §I.
- (2019) The theory behind overfitting, cross validation, regularization, bagging, and boosting: tutorial. arXiv preprint arXiv:1905.12787. Cited by: §I, §IV.
- (2020) Fisher discriminant triplet and contrastive losses for training siamese networks. In IEEE International Joint Conference on Neural Networks (IJCNN), Cited by: §I, §II-A.
- (2006) Metric learning by collapsing classes. In Advances in Neural Information Processing Systems, pp. 451–458. Cited by: §I, §II-A.
- (2005) Neighbourhood components analysis. In Advances in Neural Information Processing Systems, pp. 513–520. Cited by: §I.
- (2006) Dimensionality reduction by learning an invariant mapping. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 1735–1742. Cited by: §I.
- (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §I, §III-B.
- (2015) Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pp. 84–92. Cited by: §I.
- (2011) Principal component analysis. Springer. Cited by: §II-A, §V-A.
- (2013) Metric learning: a survey. Foundations and Trends® in Machine Learning 5 (4), pp. 287–364. Cited by: §I, §II-A.
- (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §V-A.
- (1999) Fisher discriminant analysis with kernels. In Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop, pp. 41–48. Cited by: §IV.
- (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: §I.
- (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §I, §I, §I, §II-A, §II-B, §III-C.
- (2020) Offline versus online triplet mining based on extreme distances of histopathology patches. arXiv preprint arXiv:2007.02200. Cited by: §III-D.
- (2009) Metaheuristics: from design to implementation. Vol. 74, John Wiley & Sons. Cited by: §III-E.
- (2005) Opposition-based learning: a new scheme for machine intelligence. In International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’06), Vol. 1, pp. 695–701. Cited by: §III-D.
- (1991) Face recognition using eigenfaces. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 586–587. Cited by: §V-A, §V-C.
- (1996) Semidefinite programming. SIAM Review 38 (1), pp. 49–95. Cited by: §I, §II-A, §III.
- (2006) Distance metric learning for large margin nearest neighbor classification. In Advances in Neural Information Processing Systems, pp. 1473–1480. Cited by: §I, §II-A, §III-A, §III.
- (2006) Unsupervised learning of image manifolds by semidefinite programming. International Journal of Computer Vision 70 (1), pp. 77–90. Cited by: §I.
- (2009) Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (Feb), pp. 207–244. Cited by: §I, §III-A, §III.
- (2017) Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848. Cited by: §I, §I, §III-E, §V-B.
- (2020) Improved embeddings with easy positive triplet mining. In The IEEE Winter Conference on Applications of Computer Vision, pp. 2474–2482. Cited by: §I, §III-D.
- (2019) Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6210–6219. Cited by: §I.