Graph Metric Learning via Gershgorin Disc Alignment
Abstract
We propose a general projectionfree metric learning framework, where the minimization objective is a convex differentiable function of the metric matrix , and resides in the set of generalized graph Laplacian matrices for connected graphs with positive edge weights and node degrees. Unlike lowrank metric matrices common in the literature, includes the important positivediagonalonly matrices as a special case in the limit. The key idea for fast optimization is to rewrite the positive definite cone constraint in as signaladaptive linear constraints via Gershgorin disc alignment, so that the alternating optimization of the diagonal and offdiagonal terms in can be solved efficiently as linear programs via FrankWolfe iterations. We prove that the Gershgorin discs can be aligned perfectly using the first eigenvector of , which we update iteratively using Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) with warm start as diagonal / offdiagonal terms are optimized. Experiments show that our efficiently computed graph metric matrices outperform metrics learned using competing methods in terms of classification tasks.
Cheng Yang, Gene Cheung, Wei Hu
\addressDepartment of Electrical Engineering & Computer Science, York University, Toronto, Canada
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
\ninept
{keywords}
Metric Learning, graph signal processing
1 Introduction
Given a feature vector per sample , a metric matrix defines the feature distance—Mahalanobis distance [13] between two samples and in a feature space as , where is commonly assumed to be positive definite (PD). Metric learning—identifying the best metric minimizing a chosen objective function subject to —has been the focus of many recent machine learning research efforts [23, 20, 11, 12, 25].
One key challenge in metric learning is to satisfy the positive (semi)definite (PSD) cone constraint () when minimizing in a computationefficient manner. A standard approach is iterative gradientdescent / projection (e.g., proximal gradient (PG) [19]), where a descent step from current solution at iteration in the direction of the negative gradient is followed by a projection back to the PSD cone, i.e., . However, projection typically requires eigendecomposition of and softthresholding of its eigenvalues, which is computationexpensive.
Recent methods consider alternative search spaces of matrices such as sparse or lowrank matrices to ease optimization [20, 11, 12, 16, 26]. While efficient, the assumed restricted search spaces often degrade the quality of sought metric in defining the Mahalanobis distance. For example, lowrank methods explicitly assume reducibility of the available features to a lower dimension, and hence exclude the simple yet important weighted feature metric case where contains only positive diagonal entries [24], i.e., , . We show in our experiments that computed metrics by these methods may result in inferior performance for selected applications.
In this paper, we propose a metric learning framework that is both general and projectionfree, capable of optimizing any convex differentiable objective .
Compared to lowrank methods, our framework is more encompassing and includes positivediagonal metric matrices as a special case in the limit
Assuming , we next rewrite the PD cone constraint as signaladaptive linear constraints via Gershgorin disc alignment [2, 3]: first compute scalars ’s from previous solution that align the Gershgorin disc leftends of matrix , where , then derive scaled linear constraints using ’s to ensure PDness of the next computed metric via the Gershgorin Circle Theorem (GCT) [6]. Linear constraints mean that our proposed alternating optimization of the diagonal and offdiagonal terms in can be solved speedily as linear programs [18] via FrankWolfe iterations [9]. We prove that for any metric in , using scalars can perfectly align Gershgorin disc leftends for matrix at the smallest eigenvalue , where . We efficiently update iteratively using Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) [10] with warm start as diagonal / offdiagonal terms are optimized. Experiments show that our computed graph metrics outperform metrics learned using competing methods in a range of applications.
2 Review of Spectral Graph Theory
We consider an undirected graph composed of a node set of cardinality , an edge set connecting nodes, and a weighted adjacency matrix . Each edge has a positive weight which reflects the degree of similarity between nodes and . Specifically, it is common to compute edge weight as the exponential of the feature distance between nodes and [21]:
(1) 
Using (1) means for . We discuss feature distance in the next section.
There may be selfloops in graph , i.e., where , and the corresponding diagonal entries of are positive. The combinatorial graph Laplacian [21] is defined as , where is the degree matrix—a diagonal matrix where . A generalized graph Laplacian [4] accounts for selfloops in also and is defined as , where extracts the diagonal entries of . Alternatively we can write , where the generalized degree matrix is diagonal.
3 Graph Metric Learning
3.1 Graph Metric Matrices
We first define the search space of metric matrices for our optimization framework. We assume that associated with each sample is a length feature vector . A metric matrix defines the feature distance —the Mahalanobis distance [13]—between samples and as:
(2) 
We require to be a positive definite (PD) matrix
Definition 1.
A PD symmetric matrix is a graph metric if it is a generalized graph Laplacian matrix with positive edge weights and node degrees for an irreducible graph.
For a generalized graph Laplacian to have positive degrees, each node may have a selfloop, but its loop weight must satisfy . Irreducible graph [15] essentially means that any graph node can commute with any other node.
3.2 Problem Formulation
Denote by the set of all graph metric matrices. We pose an optimization problem for : find the optimal graph metric in —leading to intersample distances in (2)—that yields the smallest value of a convex differential objective :
(3) 
where is a chosen parameter. Constraint is needed to avoid pathological solutions with infinite feature distances, i.e., . For stability, we assume also that the objective is lowerbounded, i.e., for some constant .
Our strategy to solve (3) is to optimize ’s diagonal and offdiagonal terms alternately using FrankWolfe iterations [9], where each iteration is solved as an LP until convergence. We discuss first the initialization of , then the two optimizations in order. For notation convenience, we will write the objective simply as , with the understanding that metric affects first the feature distances , which in turn determine the objective .
3.3 Initialization of
We first initialize a valid graph metric as follows:

Initialize each diagonal term .

Initialize offdiagonal terms , , as:
(4)
where is a parameter. Initialization of the diagonal terms ensures that constraints , and are satisfied. Initialization of the offdiagonal terms ensures that is symmetric and irreducible, and constraint , , is satisfied; i.e., is a Laplacian matrix for graph with nonnegative edge weights. We can hence conclude that initial is a graph metric, i.e., .
3.4 Optimization of Diagonal Terms
When optimizing ’s diagonal terms , (3) becomes
(5)  
where . Because the diagonal terms do not affect the irreducibility of matrix , the only requirements for to be a graph metric are: i) must be PD, and ii) diagonals must be strictly positive.
Gershgorinbased Reformulation
To efficiently enforce the PD constraint , we derive sufficient (but not necessary) linear constraints using the Gershgorin Circle Theorem (GCT) [6]. By GCT, each eigenvalue of a real matrix resides in at least one Gershgorin disc , corresponding to row of , with center and radius , i.e.,
(6) 
Thus a sufficient condition to ensure is PD (smallest eigenvalue ) is to ensure that all discs’ leftends are strictly positive, i.e.,
(7) 
This translates to a linear constraint for each row :
(8) 
where is a sufficiently small parameter.
However, GCT lower bound for is often loose. When optimizing ’s diagonal terms, enforcing (8) directly means that we are searching for in a smaller space than the original space in (5), resulting in an inferior solution. As an illustration, consider the following example matrix :
(9) 
Gershgorin disc leftends for this matrix are , of which is the smallest. Thus the diagonal terms do not meet constraints (8). However, is PD, since its smallest eigenvalue is .
Gershgorin Disc Alignment
To derive more appropriate linear constraints—thus more suitable search space when solving , we examine instead the Gershgorin discs of a similartransformed matrix from , i.e.,
(10) 
where is a diagonal matrix with scalars along its diagonal, . has the same eigenvalues as , and thus the smallest Gershgorin disc leftend, , for is also a lower bound for ’s smallest eigenvalue . Our goal is then to derive tight lower bounds by adapting to good solutions to (5)—by appropriately choosing used to define in (10).
Specifically, given scalars , a disc for has center and radius . Thus to ensure is PD (and hence is PD), we can write similar linear constraints as (8):
(11) 
It turns out that given a graph metric , there exist scalars such that all disc leftends are aligned at the same value . We state this formally as a theorem.
Theorem 1.
Let be a graph metric matrix. There exist strictly positive scalars such that all Gershgorin disc leftends of are aligned exactly at the smallest eigenvalue, i.e., .
In other words, for matrix the Gershgorin lower bound is exactly , and the bound is the tightest possible. The important corollary is the following:
Corollary 1.
For any graph metric , which by definition is PD, there exist scalars where is feasible using linear constraints in (11).
Proof.
By Theorem 1, let be scalars such that all Gershgorin disc leftends of align at . Thus
(12) 
where since is PD. Hence must also satisfy (11) for all for sufficiently small . ∎
Continuing our earlier example, using , and , we see that for in (9) has all disc leftends aligned at . Hence using these scalars and constraints (11), diagonal terms now constitute a feasible solution.
To prove Theorem 1, we first establish the following lemma.
Lemma 1.
There exists a first eigenvector with strictly positive entries for a graph metric matrix .
Proof.
By definition, graph metric matrix is a generalized graph Laplacian with positive edge weights in and positive degrees in . Let be the first eigenvector of , i.e.,
where since is PD. Since the matrix on the right contains only nonnegative entries and is an irreducible matrix, is a positive eigenvector by the PerronFrobenius Theorem [7]. ∎
We now prove Theorem 1 as follows.
Proof.
Denote by a strictly positive eigenvector corresponding to graph metric matrix ’s smallest eigenvalue . Define . Then,
(13) 
where . Let . Then,
(14) 
(14) means that
Note that the offdiagonal terms , since i) is strictly positive and ii) offdiagonal terms of graph metric satisfy . Thus,
(15) 
Thus defining means has all its Gershgorin disc leftends aligned at . ∎
Thus, using a positive first eigenvector of a graph metric , one can compute corresponding scalars to align all disc leftends of at , and satisfies (11) by Corollary 1. Note that these scalars are signaladaptive, i.e., ’s depend on , which is computed from . Our strategy then is to derive scalars ’s from a good solution , optimize for a better solution using scaled Gershgorin linear constraints (11), derive new scalars again until convergence. Specifically,

Given scalars ’s, identify a good solution minimizing objective subject to (11), i.e.,
(16) s.t. 
Given , update scalars where is the first eigenvector of .

Increment and repeat until convergence.
When the scalars in (16) are updated as for iteration , we show that previous solution at iteration remains feasible at iteration :
Lemma 2.
Proof.
Using the first eigenvector of graph metric at iteration , by the proof of Theorem 1 we know that the Gershgorin disc leftends of are aligned at . Since is a feasible solution in (16), and . Thus is also a feasible solution when scalars are updated as . ∎
The remaining issue is how to best compute first eigenvector given solution repeatedly. For this task, we employ Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) [10], a stateoftheart iterative algorithm known to compute extreme eigenpairs efficiently. Further, using previously computed eigenvector as an initial guess, LOBPCG benefits from warm start when computing , reducing its complexity in subsequent iterations [10].
FrankWolfe Algorithm
To solve (16), we employ the FrankWolfe algorithm [9] that iteratively linearizes the objective using its gradient with respect to diagonal terms , computed using previous solution , i.e.,
(17) 
Given gradient , optimization (16) becomes a linear program (LP) at each iteration :
(18)  
s.t. 
where is a vector composed of diagonal terms , and are offdiagonal terms of previous solution . LP (18) can be solved efficiently using known fast algorithms such as Simplex [18] and interior point method [5]. When a new solution is obtained, gradient is updated, and LP (18) is solved again until convergence.
3.5 Optimization of Offdiagonal Entries
For offdiagonal entries of , we design a block coordinate descent algorithm, which optimizes one row / column at a time.
Block Coordinate Iteration
First, we divide into four submatrices:
(19) 
where , , and . Assuming is symmetric, . We optimize in one iteration, i.e.,
(20) 
In the next iteration, a different row / column is selected, and with appropriate row / column permutation, we still optimize the first column offdiagonal terms as in (20).
Note that the constraint in (3) can be ignored, since it does not involved optimization variable . For to remain in the set of graph metric matrices, i) must be PD, ii) must be irreducible, and iii) .
As done for the diagonal terms optimization, we replace the PD constraint with Gershgorinbased linear constraints. To ensure irreducibility (i.e., the graph remains connected), we ensure that at least one offdiagonal term (say index ) in column 1 has magnitude at least . The optimization thus becomes:
(21)  
s.t.  
Essentially any selection of in (21) can ensure is irreducible. To encourage solution convergence, we select as the index of the previously optimized with the largest magnitude.
(21) also has a convex differentiable objective with a set of linear constraints. We thus employ the FrankWolfe algorithm again to iteratively linearize the objective using gradient with respect to offdiagonal , where the solution in each iteration is solved as an LP. We omit the details for brevity.
4 Experiments
We evaluate our proposed metric learning method by classification performance. Specifically, the objective function we consider here is the graph Laplacian Regularizer (GLR) [21, 17]:
(22) 
A small GLR means that signal at connected node pairs are similar for a large edge weight , i.e. is smooth w.r.t. the variation operator . GLR has been used in the GSP literature to solve a range of inverse problems, including image denoising [17] and deblurring [1].
We evaluate our method with the following competing schemes: three metric learning methods that only learn the diagonals of , i.e., [27], [14], and [24], and two methods that learn the full matrix , i.e., [25] and [8]. We do this by performing classificaiton tasks via the following two classifiers: 1) a knearestneighbour classifier, and 2) a graphbased classifier with quadratic formulation , where in subset are the observed labels. We evaluate all classifiers on wine (3 classes, 13 features and 178 samples), iris (3 classes, 4 features and 150 samples), seeds (3 classes, 7 features and 210 samples), and pb (2 classes, 10 features and 300 samples). All experiments were performed in Matlab R2017a on an i57500, 8GB of RAM, Windows 10 PC. We perform 2fold cross validation 50 times using 50 random seeds (0 to 49) with oneagainstall classification strategy. As shown in Tables 1, our proposed metric learning method has the lowest classification error rates with a graphbased classifier.
methods  iris  wine  seeds  pb  

kNN  GB  kNN  GB  kNN  GB  kNN  GB  
[27]  4.61  4.41  3.84  4.88  7.30  7.20     
[14]  4.97  4.57  4.61  5.18  7.15  6.93  4.46  5.04 
[24]  5.45  5.49  4.35  4.96  7.78  7.40  5.33  4.51 
[25]  6.12  10.40  3.58  4.37  6.92  6.63  4.55  4.96 
[8]  4.35  4.80  4.12  4.36  7.77  7.47  4.44  4.24 
Prop.  4.35  4.12  4.27  4.19  7.10  6.61  4.8  4.23 
Footnotes
 As the interfeature correlations tend to zero, only graph selfloops expressing relative importance among the features remain, and the general graph Laplacian matrix tends to diagonal.
 By definition of a metric [22], if .
References
 (2019) Graphbased blind image deblurring from a single photograph. IEEE Transactions on Image Processing (TIP) 28 (3), pp. 1404–1418. Cited by: §4.
 (201905) Reconstructioncognizant graph sampling using gershgorin disc alignment. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK. Cited by: §1.
 (2019) Fast graph sampling set selection using gershgorin disc alignment. arXiv. Cited by: §1.
 (2005) Nodal domain theorems and bipartite subgraphs. Cited by: §1, §2.
 (2009) Convex optimization. Cambridge University Press. Cited by: §3.4.3.
 (1931) Über die abgrenzung der eigenwerte einer matrix. Proceedings of the Russian Academy of Sciences (6), pp. 749–754. Cited by: §1, §3.4.1.
 (2012) Matrix analysis. Cambridge University Press. Cited by: §3.4.2.
 (2019) Feature graph learning for 3d point cloud denoising. CoRR abs/1907.09138. External Links: Link, 1907.09138 Cited by: Table 1, §4.
 (201306) Revisiting FrankWolfe: projectionfree sparse convex optimization. In ICML, Atlanta, Georgia, USA, pp. 427–435. Cited by: §1, §3.2, §3.4.3.
 (2001) Toward the optimal preconditioned eigensolver: locally optimal block preconditioned conjugate gradient method. SIAM Journal on Scientific Computing 23 (2), pp. 517–541. Cited by: §1, §3.4.2.
 (2013) Robust structural metric learning. In International conference on machine learning, pp. 615–623. Cited by: §1, §1.
 (2015) Lowrank similarity metric learning in high dimensions. In Twentyninth AAAI conference on artificial intelligence, Cited by: §1, §1.
 (1936) On the generalized distance in statistics. Proceedings of the National Institute of Sciences of India 2 (1), pp. 49–55. Cited by: §1, §3.1.
 (2016Dec.) Joint learning of similarity graph and image classifier from partial labels. In APSIPA, Jeju, South Korea. Cited by: Table 1, §4.
 (1972) Irreducible graphs. Journal Of Combinatorial Theory (B) 12, pp. 6–31. Cited by: §3.1.
 (2016) Fixedrank supervised metric learning on riemannian manifold. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1.
 (2017) Graph laplacian regularization for image denoising: analysis in the continuous domain. IEEE Transactions on Image Processing (TIP) 26 (4), pp. 1770–1785. Cited by: §4.
 (1998) Combinatorial optimization. Dover Publications, Inc. Cited by: §1, §3.4.3.
 (2013) Proximal algorithms. Foundations and Trends in Optimization 1 (3), pp. 123–231. Cited by: §1.
 (2009) An efficient sparse metric learning in highdimensional space via l 1penalized logdeterminant regularization. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 841–848. Cited by: §1, §1.
 (201305) The emerging field of signal processing on graphs: extending highdimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine 30, pp. 83–98. Cited by: §2, §2, §4.
 (2014) Foundations of signal processing. Cambridge University Press. Cited by: footnote 2.
 (2009) Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (Feb), pp. 207–244. Cited by: §1.
 (2018) Alternating binary classifier and graph learning from partial labels. In AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1137–1140. Cited by: §1, §3.1, Table 1, §4.
 (2016) Geometric mean metric learning. In International conference on machine learning, pp. 2464–2471. Cited by: §1, Table 1, §4.
 (2017) Efficient stochastic optimization for lowrank distance metric learning. In ThirtyFirst AAAI Conference on Artificial Intelligence, Cited by: §1.
 (2003) Semisupervised learning using gaussian fields and harmonic functions. In ICML, pp. 912–919. Cited by: Table 1, §4.