Graph Metric Learning via Gershgorin Disc Alignment

# Graph Metric Learning via Gershgorin Disc Alignment

## Abstract

We propose a general projection-free metric learning framework, where the minimization objective is a convex differentiable function of the metric matrix , and resides in the set of generalized graph Laplacian matrices for connected graphs with positive edge weights and node degrees. Unlike low-rank metric matrices common in the literature, includes the important positive-diagonal-only matrices as a special case in the limit. The key idea for fast optimization is to rewrite the positive definite cone constraint in as signal-adaptive linear constraints via Gershgorin disc alignment, so that the alternating optimization of the diagonal and off-diagonal terms in can be solved efficiently as linear programs via Frank-Wolfe iterations. We prove that the Gershgorin discs can be aligned perfectly using the first eigenvector of , which we update iteratively using Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) with warm start as diagonal / off-diagonal terms are optimized. Experiments show that our efficiently computed graph metric matrices outperform metrics learned using competing methods in terms of classification tasks.

\name

Cheng Yang, Gene Cheung, Wei Hu \addressDepartment of Electrical Engineering & Computer Science, York University, Toronto, Canada
Wangxuan Institute of Computer Technology, Peking University, Beijing, China \ninept {keywords} Metric Learning, graph signal processing

## 1 Introduction

Given a feature vector per sample , a metric matrix defines the feature distanceMahalanobis distance [13] between two samples and in a feature space as , where is commonly assumed to be positive definite (PD). Metric learning—identifying the best metric minimizing a chosen objective function subject to —has been the focus of many recent machine learning research efforts [23, 20, 11, 12, 25].

One key challenge in metric learning is to satisfy the positive (semi-)definite (PSD) cone constraint () when minimizing in a computation-efficient manner. A standard approach is iterative gradient-descent / projection (e.g., proximal gradient (PG) [19]), where a descent step from current solution at iteration in the direction of the negative gradient is followed by a projection back to the PSD cone, i.e., . However, projection typically requires eigen-decomposition of and soft-thresholding of its eigenvalues, which is computation-expensive.

Recent methods consider alternative search spaces of matrices such as sparse or low-rank matrices to ease optimization [20, 11, 12, 16, 26]. While efficient, the assumed restricted search spaces often degrade the quality of sought metric in defining the Mahalanobis distance. For example, low-rank methods explicitly assume reducibility of the available features to a lower dimension, and hence exclude the simple yet important weighted feature metric case where contains only positive diagonal entries [24], i.e., , . We show in our experiments that computed metrics by these methods may result in inferior performance for selected applications.

In this paper, we propose a metric learning framework that is both general and projection-free, capable of optimizing any convex differentiable objective . Compared to low-rank methods, our framework is more encompassing and includes positive-diagonal metric matrices as a special case in the limit1. The main idea is as follows. First, we define a search space of general graph Laplacian matrices [4], each corresponding to a connected graph with positive edge weights and node degrees. The underlying graph edge weights capture pairwise correlations among the features, and the self-loops designate relative importance among the features.

Assuming , we next rewrite the PD cone constraint as signal-adaptive linear constraints via Gershgorin disc alignment [2, 3]: first compute scalars ’s from previous solution that align the Gershgorin disc left-ends of matrix , where , then derive scaled linear constraints using ’s to ensure PDness of the next computed metric via the Gershgorin Circle Theorem (GCT) [6]. Linear constraints mean that our proposed alternating optimization of the diagonal and off-diagonal terms in can be solved speedily as linear programs [18] via Frank-Wolfe iterations [9]. We prove that for any metric in , using scalars can perfectly align Gershgorin disc left-ends for matrix at the smallest eigenvalue , where . We efficiently update iteratively using Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) [10] with warm start as diagonal / off-diagonal terms are optimized. Experiments show that our computed graph metrics outperform metrics learned using competing methods in a range of applications.

## 2 Review of Spectral Graph Theory

We consider an undirected graph composed of a node set of cardinality , an edge set connecting nodes, and a weighted adjacency matrix . Each edge has a positive weight which reflects the degree of similarity between nodes and . Specifically, it is common to compute edge weight as the exponential of the feature distance between nodes and [21]:

 wi,j=exp(−δi,j) (1)

Using (1) means for . We discuss feature distance in the next section.

There may be self-loops in graph , i.e., where , and the corresponding diagonal entries of are positive. The combinatorial graph Laplacian [21] is defined as , where is the degree matrix—a diagonal matrix where . A generalized graph Laplacian [4] accounts for self-loops in also and is defined as , where extracts the diagonal entries of . Alternatively we can write , where the generalized degree matrix is diagonal.

## 3 Graph Metric Learning

### 3.1 Graph Metric Matrices

We first define the search space of metric matrices for our optimization framework. We assume that associated with each sample is a length- feature vector . A metric matrix defines the feature distance —the Mahalanobis distance [13]—between samples and as:

 δi,j(M)=(fi−fj)⊤M(fi−fj) (2)

We require to be a positive definite (PD) matrix2. The special case where is diagonal with strictly positive entries was studied in [24]. Instead, we study here a more general case: must be a graph metric matrix, which we define formally as follows.

###### Definition 1.

A PD symmetric matrix is a graph metric if it is a generalized graph Laplacian matrix with positive edge weights and node degrees for an irreducible graph.

For a generalized graph Laplacian to have positive degrees, each node may have a self-loop, but its loop weight must satisfy . Irreducible graph [15] essentially means that any graph node can commute with any other node.

### 3.2 Problem Formulation

Denote by the set of all graph metric matrices. We pose an optimization problem for : find the optimal graph metric in —leading to inter-sample distances in (2)—that yields the smallest value of a convex differential objective :

 minM∈SQ({δi,j(M)}),   s.t.  tr(M)≤C (3)

where is a chosen parameter. Constraint is needed to avoid pathological solutions with infinite feature distances, i.e., . For stability, we assume also that the objective is lower-bounded, i.e., for some constant .

Our strategy to solve (3) is to optimize ’s diagonal and off-diagonal terms alternately using Frank-Wolfe iterations [9], where each iteration is solved as an LP until convergence. We discuss first the initialization of , then the two optimizations in order. For notation convenience, we will write the objective simply as , with the understanding that metric affects first the feature distances , which in turn determine the objective .

### 3.3 Initialization of M

We first initialize a valid graph metric as follows:

1. Initialize each diagonal term .

2. Initialize off-diagonal terms , , as:

 m0i,j:={−ϵif j=i±10o.w. (4)

where is a parameter. Initialization of the diagonal terms ensures that constraints , and are satisfied. Initialization of the off-diagonal terms ensures that is symmetric and irreducible, and constraint , , is satisfied; i.e., is a Laplacian matrix for graph with non-negative edge weights. We can hence conclude that initial is a graph metric, i.e., .

### 3.4 Optimization of Diagonal Terms

When optimizing ’s diagonal terms , (3) becomes

 min{mi,i}  Q(M) (5) s.t.M≻0;∑imi,i≤C;   mi,i>0,∀i

where . Because the diagonal terms do not affect the irreducibility of matrix , the only requirements for to be a graph metric are: i) must be PD, and ii) diagonals must be strictly positive.

#### Gershgorin-based Reformulation

To efficiently enforce the PD constraint , we derive sufficient (but not necessary) linear constraints using the Gershgorin Circle Theorem (GCT) [6]. By GCT, each eigenvalue of a real matrix resides in at least one Gershgorin disc , corresponding to row of , with center and radius , i.e.,

 ∃i  s.t.  ci−ri≤λ≤ci+ri (6)

Thus a sufficient condition to ensure is PD (smallest eigenvalue ) is to ensure that all discs’ left-ends are strictly positive, i.e.,

 0

This translates to a linear constraint for each row :

 mi,i≥∑j|j≠i|mi,j|+ρ,      ∀i∈{1,…,K} (8)

where is a sufficiently small parameter.

However, GCT lower bound for is often loose. When optimizing ’s diagonal terms, enforcing (8) directly means that we are searching for in a smaller space than the original space in (5), resulting in an inferior solution. As an illustration, consider the following example matrix :

 M=⎡⎢⎣2−2−1−25−2−1−24⎤⎥⎦ (9)

Gershgorin disc left-ends for this matrix are , of which is the smallest. Thus the diagonal terms do not meet constraints (8). However, is PD, since its smallest eigenvalue is .

#### Gershgorin Disc Alignment

To derive more appropriate linear constraints—thus more suitable search space when solving , we examine instead the Gershgorin discs of a similar-transformed matrix from , i.e.,

 B=SMS−1 (10)

where is a diagonal matrix with scalars along its diagonal, . has the same eigenvalues as , and thus the smallest Gershgorin disc left-end, , for is also a lower bound for ’s smallest eigenvalue . Our goal is then to derive tight lower bounds by adapting to good solutions to (5)—by appropriately choosing used to define in (10).

Specifically, given scalars , a disc for has center and radius . Thus to ensure is PD (and hence is PD), we can write similar linear constraints as (8):

 mi,i≥si∑j|j≠i|mi,j|sj+ρ,    ∀i∈{1,…,K} (11)

It turns out that given a graph metric , there exist scalars such that all disc left-ends are aligned at the same value . We state this formally as a theorem.

###### Theorem 1.

Let be a graph metric matrix. There exist strictly positive scalars such that all Gershgorin disc left-ends of are aligned exactly at the smallest eigenvalue, i.e., .

In other words, for matrix the Gershgorin lower bound is exactly , and the bound is the tightest possible. The important corollary is the following:

###### Corollary 1.

For any graph metric , which by definition is PD, there exist scalars where is feasible using linear constraints in (11).

###### Proof.

By Theorem 1, let be scalars such that all Gershgorin disc left-ends of align at . Thus

 ∀i,  mi,i−si∑j|j≠i|mi,j|sj=λmin>0 (12)

where since is PD. Hence must also satisfy (11) for all for sufficiently small . ∎

Continuing our earlier example, using , and , we see that for in (9) has all disc left-ends aligned at . Hence using these scalars and constraints (11), diagonal terms now constitute a feasible solution.

To prove Theorem 1, we first establish the following lemma.

###### Lemma 1.

There exists a first eigenvector with strictly positive entries for a graph metric matrix .

###### Proof.

By definition, graph metric matrix is a generalized graph Laplacian with positive edge weights in and positive degrees in . Let be the first eigenvector of , i.e.,

 Mv =λminv (Dg−W)v =(λminI)v Dgv =(W+λminI)v v =D−1g(W+λminI)v

where since is PD. Since the matrix on the right contains only non-negative entries and is an irreducible matrix, is a positive eigenvector by the Perron-Frobenius Theorem [7]. ∎

We now prove Theorem 1 as follows.

###### Proof.

Denote by a strictly positive eigenvector corresponding to graph metric matrix ’s smallest eigenvalue . Define . Then,

 SMS−1Sv=λminSv (13)

where . Let . Then,

 B1=λmin1 (14)

(14) means that

 bi,i+∑j|j≠ibi,j =λmin,   ∀i

Note that the off-diagonal terms , since i) is strictly positive and ii) off-diagonal terms of graph metric satisfy . Thus,

 bi,i−∑j|j≠i|bi,j| =λmin,   ∀i (15)

Thus defining means has all its Gershgorin disc left-ends aligned at . ∎

Thus, using a positive first eigenvector of a graph metric , one can compute corresponding scalars to align all disc left-ends of at , and satisfies (11) by Corollary 1. Note that these scalars are signal-adaptive, i.e., ’s depend on , which is computed from . Our strategy then is to derive scalars ’s from a good solution , optimize for a better solution using scaled Gershgorin linear constraints (11), derive new scalars again until convergence. Specifically,

1. Given scalars ’s, identify a good solution minimizing objective subject to (11), i.e.,

 min{mi,i} Q(M) (16) s.t. mi,i≥si∑j|j≠i|mi,j|sj+ρ,∀i;   ∑imi,i≤C
2. Given , update scalars where is the first eigenvector of .

3. Increment and repeat until convergence.

When the scalars in (16) are updated as for iteration , we show that previous solution at iteration remains feasible at iteration :

###### Lemma 2.

Solution to (16) in iteration remains feasible in iteration , when scalars for the linear constraints in (16) are updated as , where is the first eigenvector of .

###### Proof.

Using the first eigenvector of graph metric at iteration , by the proof of Theorem 1 we know that the Gershgorin disc left-ends of are aligned at . Since is a feasible solution in (16), and . Thus is also a feasible solution when scalars are updated as . ∎

The remaining issue is how to best compute first eigenvector given solution repeatedly. For this task, we employ Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) [10], a state-of-the-art iterative algorithm known to compute extreme eigenpairs efficiently. Further, using previously computed eigenvector as an initial guess, LOBPCG benefits from warm start when computing , reducing its complexity in subsequent iterations [10].

#### Frank-Wolfe Algorithm

To solve (16), we employ the Frank-Wolfe algorithm [9] that iteratively linearizes the objective using its gradient with respect to diagonal terms , computed using previous solution , i.e.,

 ∇Q(Mt)=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣∂Q(M)∂m1,1⋮∂Q(M)∂mK,K⎤⎥ ⎥ ⎥ ⎥ ⎥⎦∣∣ ∣ ∣ ∣ ∣∣Mt (17)

Given gradient , optimization (16) becomes a linear program (LP) at each iteration :

 min{mi,i} vec({mi,i})⊤ ∇Q(Mt) (18) s.t. mi,i≥si∑j|j≠i|mti,j|sj+ρ,  ∀i;   ∑imi,i≤C.

where is a vector composed of diagonal terms , and are off-diagonal terms of previous solution . LP (18) can be solved efficiently using known fast algorithms such as Simplex [18] and interior point method [5]. When a new solution is obtained, gradient is updated, and LP (18) is solved again until convergence.

### 3.5 Optimization of Off-diagonal Entries

For off-diagonal entries of , we design a block coordinate descent algorithm, which optimizes one row / column at a time.

#### Block Coordinate Iteration

First, we divide into four sub-matrices:

 M=[m1,1M1,2M2,1M2,2], (19)

where , , and . Assuming is symmetric, . We optimize in one iteration, i.e.,

 minM2,1 Q(M),   s.t.  M∈S (20)

In the next iteration, a different row / column is selected, and with appropriate row / column permutation, we still optimize the first column off-diagonal terms as in (20).

Note that the constraint in (3) can be ignored, since it does not involved optimization variable . For to remain in the set of graph metric matrices, i) must be PD, ii) must be irreducible, and iii) .

As done for the diagonal terms optimization, we replace the PD constraint with Gershgorin-based linear constraints. To ensure irreducibility (i.e., the graph remains connected), we ensure that at least one off-diagonal term (say index ) in column 1 has magnitude at least . The optimization thus becomes:

 minM2,1 Q(M) (21) s.t. mi,i≥si∑j|j≠i|mi,j|sj+ρ,  ∀i ms,1≤−ϵ;   M2,1≤0

Essentially any selection of in (21) can ensure is irreducible. To encourage solution convergence, we select as the index of the previously optimized with the largest magnitude.

(21) also has a convex differentiable objective with a set of linear constraints. We thus employ the Frank-Wolfe algorithm again to iteratively linearize the objective using gradient with respect to off-diagonal , where the solution in each iteration is solved as an LP. We omit the details for brevity.

## 4 Experiments

We evaluate our proposed metric learning method by classification performance. Specifically, the objective function we consider here is the graph Laplacian Regularizer (GLR) [21, 17]:

 Q(M) =z⊤L(M)z=N∑i=1N∑j=1wi,j(zi−zj)2 =exp{−(fi−fj)⊤M(fi−fj)}(zi−zj)2 (22)

A small GLR means that signal at connected node pairs are similar for a large edge weight , i.e. is smooth w.r.t. the variation operator . GLR has been used in the GSP literature to solve a range of inverse problems, including image denoising [17] and deblurring [1].

We evaluate our method with the following competing schemes: three metric learning methods that only learn the diagonals of , i.e., [27], [14], and [24], and two methods that learn the full matrix , i.e., [25] and [8]. We do this by performing classificaiton tasks via the following two classifiers: 1) a k-nearest-neighbour classifier, and 2) a graph-based classifier with quadratic formulation , where in subset are the observed labels. We evaluate all classifiers on wine (3 classes, 13 features and 178 samples), iris (3 classes, 4 features and 150 samples), seeds (3 classes, 7 features and 210 samples), and pb (2 classes, 10 features and 300 samples). All experiments were performed in Matlab R2017a on an i5-7500, 8GB of RAM, Windows 10 PC. We perform 2-fold cross validation 50 times using 50 random seeds (0 to 49) with one-against-all classification strategy. As shown in Tables 1, our proposed metric learning method has the lowest classification error rates with a graph-based classifier.

### Footnotes

1. As the inter-feature correlations tend to zero, only graph self-loops expressing relative importance among the features remain, and the general graph Laplacian matrix tends to diagonal.
2. By definition of a metric [22], if .

### References

1. Y. Bai, G. Cheung, X. Liu and W. Gao (2019) Graph-based blind image deblurring from a single photograph. IEEE Transactions on Image Processing (TIP) 28 (3), pp. 1404–1418. Cited by: §4.
2. Y. Bai, G. Cheung, F. Wang, X. Liu and W. Gao (2019-05) Reconstruction-cognizant graph sampling using gershgorin disc alignment. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK. Cited by: §1.
3. Y. Bai, F. Wang, G. Cheung, Y. Nakatsukasa and W. Gao (2019) Fast graph sampling set selection using gershgorin disc alignment. arXiv. Cited by: §1.
4. T. Biyikoglu, J. Leydold and P. F. Stadler (2005) Nodal domain theorems and bipartite subgraphs. Cited by: §1, §2.
5. S. Boyd and L. Vandenberghe (2009) Convex optimization. Cambridge University Press. Cited by: §3.4.3.
6. S. A. Gershgorin (1931) Über die abgrenzung der eigenwerte einer matrix. Proceedings of the Russian Academy of Sciences (6), pp. 749–754. Cited by: §1, §3.4.1.
7. R. Horn and C. Johnson (2012) Matrix analysis. Cambridge University Press. Cited by: §3.4.2.
8. W. Hu, X. Gao, G. Cheung and Z. Guo (2019) Feature graph learning for 3d point cloud denoising. CoRR abs/1907.09138. External Links: Link, 1907.09138 Cited by: Table 1, §4.
9. M. Jaggi (2013-06) Revisiting Frank-Wolfe: projection-free sparse convex optimization. In ICML, Atlanta, Georgia, USA, pp. 427–435. Cited by: §1, §3.2, §3.4.3.
10. A. V. Knyazev (2001) Toward the optimal preconditioned eigensolver: locally optimal block preconditioned conjugate gradient method. SIAM Journal on Scientific Computing 23 (2), pp. 517–541. Cited by: §1, §3.4.2.
11. D. Lim, G. Lanckriet and B. McFee (2013) Robust structural metric learning. In International conference on machine learning, pp. 615–623. Cited by: §1, §1.
12. W. Liu, C. Mu, R. Ji, S. Ma, J. R. Smith and S. Chang (2015) Low-rank similarity metric learning in high dimensions. In Twenty-ninth AAAI conference on artificial intelligence, Cited by: §1, §1.
13. P. C. Mahalanobis (1936) On the generalized distance in statistics. Proceedings of the National Institute of Sciences of India 2 (1), pp. 49–55. Cited by: §1, §3.1.
14. Y. Mao, G. Cheung, C.-W. Lin and Y. Ji (2016-Dec.) Joint learning of similarity graph and image classifier from partial labels. In APSIPA, Jeju, South Korea. Cited by: Table 1, §4.
15. M. Milgram (1972) Irreducible graphs. Journal Of Combinatorial Theory (B) 12, pp. 6–31. Cited by: §3.1.
16. Y. Mu (2016) Fixed-rank supervised metric learning on riemannian manifold. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1.
17. J. Pang and G. Cheung (2017) Graph laplacian regularization for image denoising: analysis in the continuous domain. IEEE Transactions on Image Processing (TIP) 26 (4), pp. 1770–1785. Cited by: §4.
18. C. Papadimitriou and K. Steiglitz (1998) Combinatorial optimization. Dover Publications, Inc. Cited by: §1, §3.4.3.
19. N. Parikh and S. Boyd (2013) Proximal algorithms. Foundations and Trends in Optimization 1 (3), pp. 123–231. Cited by: §1.
20. G. Qi, J. Tang, Z. Zha, T. Chua and H. Zhang (2009) An efficient sparse metric learning in high-dimensional space via l 1-penalized log-determinant regularization. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 841–848. Cited by: §1, §1.
21. D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega and P. Vandergheynst (2013-05) The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine 30, pp. 83–98. Cited by: §2, §2, §4.
22. M. Vetterli, J. Kovacevic and V. Goyal (2014) Foundations of signal processing. Cambridge University Press. Cited by: footnote 2.
23. K. Q. Weinberger and L. K. Saul (2009) Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10 (Feb), pp. 207–244. Cited by: §1.
24. C. Yang, G. Cheung and V. Stankovic (2018) Alternating binary classifier and graph learning from partial labels. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1137–1140. Cited by: §1, §3.1, Table 1, §4.
25. P. Zadeh, R. Hosseini and S. Sra (2016) Geometric mean metric learning. In International conference on machine learning, pp. 2464–2471. Cited by: §1, Table 1, §4.
26. J. Zhang and L. Zhang (2017) Efficient stochastic optimization for low-rank distance metric learning. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.
27. X. Zhu, Z. Ghahramani and J. Lafferty (2003) Semi-supervised learning using gaussian fields and harmonic functions. In ICML, pp. 912–919. Cited by: Table 1, §4.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters