Positive definite matrices and the S-divergenceA small fraction of an initial version of this work was presented at the Advances in Neural Information Processing Systems (NIPS) 2012 conference—see [48].

Positive definite matrices and the S-divergencethanks: A small fraction of an initial version of this work was presented at the Advances in Neural Information Processing Systems (NIPS) 2012 conference—see [48].

Suvrit Sra Parts of this paper were written during my stay at Carnegie Mellon University; the initial version was prepared while I was at the Max Planck Institute for Intelligent Systems, Tübingen, Germany.
Abstract

Positive definite matrices abound in a dazzling variety of applications. This ubiquity can be in part attributed to their rich geometric structure: positive definite matrices form a self-dual convex cone whose strict interior is a Riemannian manifold. The manifold view is endowed with a “natural” distance function while the conic view is not. Nevertheless, drawing motivation from the conic view, we introduce the S-Divergence as a “natural” distance-like function on the open cone of positive definite matrices. We motivate the S-divergence via a sequence of results that connect it to the Riemannian distance. In particular, we show that (a) this divergence is the square of a distance; and (b) that it has several geometric properties similar to those of the Riemannian distance, though without being computationally as demanding. The S-Divergence is even more intriguing: although nonconvex, we can still compute matrix means and medians using it to global optimality. We complement our results with some numerical experiments illustrating our theorems and our optimization algorithm for computing matrix medians.

Key words. Bregman matrix divergence; Log Determinant; Stein Divergence; Jensen-Bregman divergence; matrix geometric mean; matrix median; nonpositive curvature

1 Introduction

Hermitian positive definite (HPD) matrices are a noncommutative generalization of positive reals. They abound in a multitude of applications and exhibit attractive geometric properties—e.g., they form a differentiable Riemannian (also Finslerian) manifold [10, 33] that is a well-studied example of a manifold of nonpositive curvature [17, Ch.10]. HPD matrices possess even more structure: (i) they embody a canonical higher-rank symmetric space [51]; and (ii) their closure forms a closed, self-dual convex cone.

The convex conic view enjoys great importance in convex optimization [43, 6, 44] and in nonlinear Perron-Frobenius theory [40]; symmetric spaces are important in algebra, analysis [51, 32, 39], and optimization [43, 52]; while the manifold view (Riemannian or Finslerian) plays diverse roles—see [10, Ch.6] and [46].

The manifold view is equipped with a with a “natural” distance function while the conic view is not. Nevertheless, drawing motivation from the convex conic view, we introduce the S-Divergence as a “natural” distance-like function on the open cone of positive definite matrices. Indeed, we prove a sequence of results connecting the S-Divergence to the Riemannian distance. Most importantly, we show that (a) this divergence is the square of a distance; and (b) that it has several geometric properties in common with the Riemannian distance, without being numerically as demanding. This builds an informal link between the manifold and conic views of HPD matrices.

1.1 Background and notation

We begin by fixing notation. The letter denotes some Hilbert space, usually just . The inner product between two vectors and in is ( denotes ‘conjugate transpose’). The set of Hermitian matrices is denoted as . A matrix is called positive definite if

\hb@xt@.01(1.1)

The set of all positive definite (henceforth positive) matrices is denoted by . We say is positive semidefinite if for all ; denoted . The inequality is the usual Löwner order and means . The Frobenius norm of a matrix is defined as ; and denotes the operator 2-norm. Let be an analytic function on ; for a matrix with eigendecomposition , equals with .

The set is a well-studied differentiable Riemannian manifold, with the Riemannian metric given by the differential form . This metric induces the Riemannian distance (see e.g., [10, Ch. 6]):

\hb@xt@.01(1.2)

and where denotes the matrix logarithm.

A counterpart to the distance (LABEL:eq.riem) was formally introduced in [48] under the name S-Divergence111It is a divergence because although nonnegative, definite, and symmetric, it is not a metric.; this divergence is defined as

\hb@xt@.01(1.3)

Our definition above already writes in anticipation of Theorem LABEL:thm.metric that shows to be a metric. This paper suggests S-Divergence as an alternative to (LABEL:eq.riem), and studies several of its properties that may also be of independent interest. The simplicity of (LABEL:eq.ss) is one of the key reasons for using it as an alternative to (LABEL:eq.riem): it is cheaper to compute, as is its derivative, and certain basic algorithms involving it run much faster than corresponding ones that use  [48].

This line of thought actually originates in [24, 25], where for an image search task based on “nearest neighbors,” is used to measure nearness instead of , and is shown to yield large speedups without blighting the quality of search results. Although exact details of this image search are outside the scope of this paper, let us highlight below the two speedups that were crucial to [24, 25].

The first speedup is shown in the left panel of Fig. LABEL:fig.one, which compares times taken to compute and . For computing the latter, we used the dist.m function in the Matrix Means Toolbox (MMT)222Downloaded from http://bezout.dm.unipi.it/software/mmtoolbox/. The second, more dramatic speedup in shown in the right panel which shows time taken to compute the matrix means

where are HPD matrices. For details on see Section LABEL:sec.mtxmeans; the geometric mean is also known as the “Karcher mean”, and was computed using the MMT via the rich.m script which implements a state-of-the-art method [13, 34].

Figure 1.1: Left: Time taken to compute and . For , we used MMT’s Schur factorization based implementation. The results are averaged over 10 runs to reduce variance. The plot indicates that can be up to 50 times faster than . Right: Time taken to compute and . The former was computed using the method of [13], while the latter was obtained via a fixed-point iteration. The differences are huge: can be obtained up to 1000 times faster! Even the slew of methods surveyed in [34] show similar or worse runtimes than those cited for above.

We mention here that other alternatives to are also possible, for instance the popular “log-Euclidean” distance [3], given by

\hb@xt@.01(1.4)

Notice that to compute we require two eigenvector decompositions; this makes it more expensive than which requires only 3 Cholesky factorizations. Even though the matrix mean under can be computed in closed form, its dependency on matrix logarithms and exponentials can make it slower than . However, much more importantly, for the applications in [24, 25], and other alternatives to proved to be substantially less competitive than (LABEL:eq.ss), so we limit our focus to ; for more extensive empirical comparisons with other distances, we refer the reader to [24, 25].

While our paper was under review (in 2011), we became aware of a concurrent paper of Chebbi and Moakher (CM) [21], who consider a one parameter family of divergences that generalize (LABEL:eq.ss). Our work differs from CM in the following aspects:

  • CM prove to be a distance only for commuting matrices. As per Remark LABEL:rmk.commute, the commuting case essentially reduces to the scalar case. The noncommuting case is much harder, and was conjectured to be true in [21]. We propose and solve the general noncommuting case, independent of CM.

  • We establish several theorems connecting to the Riemannian distance . These connections have not been made either by CM or elsewhere.

  • A question closely related to metricity of is whether the matrix

    is positive semidefinite for every integer and every scalar . CM considered special cases of this question. We provide a complete characterization of necessary and sufficient for the above matrix to be semidefinite.

  • CM analyze the “matrix-mean” , whose solution they obtain by solving . CM’s results essentially imply global optimality333We thank a referee for alerting us to this fact, which ultimately follows from CM’s uniqueness theorem and the observation that the cost function goes to for both and (see Sec. LABEL:sec.mtxmeans for more details).; we provide two different proofs of this fact. One of our proofs is based of establishing geodesic convexity of the S-Divergence—a result that is also interesting because in our previous attempts [48, 49] we oversaw this property. In fact, we show (Theorem LABEL:thm.jointgc) that is jointly geodesically convex.

Other contributions. The present paper substantially extends our initial work [48]; we outline below the key differences from [48].

  • Due to lack of space proofs of the lemmas supporting Theorem LABEL:thm.metric did not appear in [48]. In particular, proofs of Corollaries LABEL:corr.mink, LABEL:cor.eigs and Theorems LABEL:thm.diag, LABEL:thm.det are absent from [48] (these results are not difficult though).

  • None of our results on similarities between the Riemannian distance and are present in [48]. Although some of these results are mentioned in a summary table in [48], the full theorem statements as well as their proofs are absent. The concerned results are: Theorems LABEL:thm.gm, LABEL:thm.contract, LABEL:thm.contract3, LABEL:thm.cancel, LABEL:thm.tucontract, and LABEL:thm.contract2.

  • This paper uncovers a new result (previously unknown): is jointly geodesically convex—Prop. LABEL:prop.mixedmean and Theorem  LABEL:thm.jointgc establish this remarkable fact.

  • This paper proves several new “conic” contraction results for and . The appurtenant results are: Proposition LABEL:prop.logdet, Corollary LABEL:cor.ratio, Theorem LABEL:thm.compr, Corollary LABEL:cor.compr.tensor, Theorem LABEL:thm.geig, and Corollary LABEL:corr.riem.compress.

  • This paper proves bi-Lipschitz-like inequalities for and (Theorem LABEL:thm.bnds).

  • Finally, this paper studies the weighted matrix-medians problem:

    This problem was also partially studied by [19], who presented an iterative method that was erroneously claimed to be a fixed-point iteration in the Thompson metric. We present a counterexample to illustrate this error, and rectify it by presenting different analysis that ensures convergence (see §LABEL:sec.median).

2 The S-Divergence

We proceed now to formally introduce the S-Divergence. We follow the viewpoint of Bregman divergences. Consider, thus, a differentiable strictly convex function ; then, , with equality if and only if . The difference between the two sides of this inequality defines the Bregman Divergence444Bregman divergences over scalars and vectors have been well-studied; see e.g., [18, 4]. They are called divergences because they are not distances (though they often behave like squared distances, in a sense that can be made precise for certain choices of  [22]).

\hb@xt@.01(2.1)

The scalar divergence (LABEL:eq.breg.scalar) can be extended to Hermitian matrices.

Proposition 2.1

Let be differentiable and strictly convex on ; let be arbitrary. Then, we have the matrix Bregman Divergence:

\hb@xt@.01(2.2)

By construction is nonnegative, strictly convex in , and zero if and only if . It is typically asymmetric, and may be viewed as a measure of dissimilarity.

Example 2.2

Let . Then, for , , with which (LABEL:eq.breg) yields the squared Frobenius norm

If on , then , and (LABEL:eq.breg) yields the (unnormalized) von Neumann Divergence of quantum information theory [45]:

For on , , and we obtain the divergence

which is known as the LogDet Divergence [37], or more classically as Stein’s loss [50].

The divergence is of key importance to our paper, so we mention some additional noteworthy contexts where it occurs: (i) information theory [26], as the relative entropy between two multivariate gaussians with same mean; (ii) optimization, when deriving the famous Broyden-Fletcher-Goldfarb-Shanno (BFGS) updates [47]; (iii) matrix divergence theory [5, 28, 46]; (iv) kernel learning [37].

Despite the broad applicability of Bregman divergences, their asymmetry is sometimes undesirable. This drawback has leads us to consider symmetric divergences, among which the most popular is the “Jensen-Bregman” divergence555This symmetrization has been largely studied only for divergences over scalars or vectors.

\hb@xt@.01(2.3)

This divergence has two attractive and perhaps more useful representations:

\hb@xt@.01(2.4)
\hb@xt@.01(2.5)

Compare (LABEL:eq.breg) with (LABEL:eq.js1): both formulas define divergence as departure from linearity; the former uses derivatives, while the latter is stated using midpoint convexity. Representation (LABEL:eq.js1) has an advantage over (LABEL:eq.breg), (LABEL:eq.js), and (LABEL:eq.js2), in that it does not need to assume differentiability of .

The reader must have realized by now that the S-Divergence (LABEL:eq.ss) is nothing but the symmetrized divergence (LABEL:eq.js) generated by . Alternatively, the S-Divergence may be essentially viewed as the Jensen-Bregman divergence between two multivariate gaussians [26], or as the Bhattacharya distance between them [12].

Let us now list a few basic properties of .

Proposition 2.3

Let be the vector of eigenvalues of , and be a diagonal matrix with as its diagonal. Let . Then,

  1. ;

  2. , where , ;

  3. ;

  4. ; and

  5. .

Proof. (i) follows from the equality .
(ii) follows upon observing that

(iii) follows upon noting

(iv) follows as , and .
(v) is trivial since .  \qedsymbol The most useful corollary to Prop. LABEL:prop.basic is congruence invariance of .

Corollary 2.4

Let , and let be any invertible matrix. Then,

The next result reaffirms that is a divergence, while showing that it enjoys some limited convexity and concavity.

Proposition 2.5

Let . Then, (i) with equality if and only if ; (ii) for fixed , is convex in for , while for , it is concave.

Proof. Since is a sum of Bregman divergences, property (i) follows from definition (LABEL:eq.js). Alternatively, note that , with equality if and only if . Part (ii) follows upon analyzing the Hessian . This Hessian can be identified with the matrix

\hb@xt@.01(2.6)

where is the usual the Kronecker product. Matrix is positive definite for and negative definite for , which proves (ii).

Below we show that is richer than a divergence: its square-root is actually a distance on . This is the first main result of our paper. Previous authors [24, 21] conjectured this result but could not establish it, perhaps because both ultimately sought to map to a Hilbert space metric. This approach fails because HPD matrices do not form even a (multiplicative) semigroup, which renders the powerful theory of harmonic analysis on semigroups [7] inapplicable to . This difficulty necessitates a different path to proving metricity of , and this is the subject of the next section.

3 The metric

In this section we prove the following main theorem.

Theorem 3.1

Let be defined by (LABEL:eq.ss). Then, is a metric on .

The proof of Theorem LABEL:thm.metric depends on several results, which we first establish.666We note that a referee wondered whether the “metrization” results of [22, 23] could yield an alternative proof of Thm. LABEL:thm.metric. Unfortunately, those results rely very heavily on the commutativity and total-order available on , both of which are missing in . Indeed, for the commutative case, Lemma LABEL:lem.cpd alone suffices to prove that is a metric.

Definition 3.2 ([7, Def. 1.1])

Let be a nonempty set. A function is said to be negative definite if for all , , and the inequality

holds for all integers , and subsets , with .

Theorem 3.3 ([7, Prop. 3.2, Ch. 3])

Let be negative definite. Then, there is a Hilbert space and a mapping from such that one has the relation

\hb@xt@.01(3.1)

Moreover, negative definiteness of is necessary for such a mapping to exist.

Theorem LABEL:thm.schoen helps prove the triangle inequality for the scalar case.

Lemma 3.4

Define, the scalar version of as

Then, satisfies the triangle inequality, i.e.,

\hb@xt@.01(3.2)

Proof. We show that is negative definite. Since , Theorem LABEL:thm.schoen then immediately implies the triangle inequality (LABEL:eq.11). To prove that is negative definite, by [7, Thm. 2.2, Ch. 3] we may equivalently show that is a positive definite function for , and . To that end, it suffices to show that the matrix

is HPD for every integer , and positive numbers . Now, observe that

\hb@xt@.01(3.3)

where is the Gamma function. Thus, with , we see that equals the Gram matrix , whereby .  \qedsymbol Using Lemma LABEL:lem.cpd obtain the following “Minkowski” inequality for .

Corollary 3.5

Let ; and let . Then,

\hb@xt@.01(3.4)

Proof. Lemma LABEL:lem.cpd implies that for positive scalars , , and , we have

Exponentiate, sum, and invoke Minkowski’s inequality to conclude (LABEL:eq.36).  \qedsymbol

Theorem 3.6

Let be diagonal matrices. Then,

\hb@xt@.01(3.5)

Proof. For diagonal matrices and , it is easy to verify that Now invoke Corollary LABEL:corr.mink with .  \qedsymbol

Next, we recall an important determinantal inequality for positive matrices.

Theorem 3.7 ([9, Exercise VI.7.2])

Let . Let denote the vector of eigenvalues of sorted in decreasing order; define likewise. Then,

\hb@xt@.01(3.6)
Corollary 3.8

Let . Let denote the diagonal matrix with as its diagonal; define likewise. Then,

Proof. Scale and by 2, divide each term in (LABEL:eq.14) by , and note that is invariant to permutations of ; take logarithms to conclude.  \qedsymbol

The final result we need is well-known in linear algebra (we provide a proof).

Lemma 3.9

Let , and let be Hermitian. There is a matrix for which

\hb@xt@.01(3.7)

Proof. Let , and define . The the matrix is Hermitian; so let diagonalize it to . Set , to obtain

and by construction, it follows that .  \qedsymbol

Accoutered with the above results, we can finally prove Theorem LABEL:thm.metric.

Proof. (Theorem LABEL:thm.metric).   We need to show that is symmetric, nonnegative, definite, and that is satisfies the triangle inequality. Symmetry is obvious. Nonnegativity and definiteness were shown in Prop. LABEL:prop.scvx. The only hard part is to prove the triangle inequality, a result that has eluded previous attempts [24, 21].

Let be arbitrary. From Lemma LABEL:lem.diag2 we know that there is a matrix such that and . Since is arbitrary, and congruence preserves positive definiteness, we may write just instead of . Also, since (see Prop. LABEL:prop.basic), proving the triangle inequality reduces to showing that

\hb@xt@.01(3.8)

Consider now the diagonal matrices and . Theorem LABEL:thm.diag asserts

\hb@xt@.01(3.9)

Prop. LABEL:prop.basic(i) implies that and , while Corollary LABEL:cor.eigs shows that . Combining these inequalities, we immediately obtain (LABEL:eq.35).

We now turn our attention to a connection of importance to machine learning and approximation theory: kernel functions related to . Indeed, some of connections (e.g., Theorem LABEL:thm.wallach) have already been recently useful in computer vision [31].

3.1 Hilbert space embedding

Since is a metric, and Lemma LABEL:lem.cpd shows that for scalars, embeds isometrically into a Hilbert space, one may ask if also admits such an embedding. But as mentioned previously, it is the lack of such an embedding that necessitated a different route to metricity. Let us look more carefully at what goes wrong, and what kind of Hilbert space embeddability does admit.

Theorem LABEL:thm.schoen implies that a Hilbert space embedding exists if and only if is a negative definite kernel; equivalently, if and only if the map (cf. Lemma LABEL:lem.cpd)

is a positive definite kernel for . It suffices to check whether the matrix

\hb@xt@.01(3.10)

is positive definite for every and arbitrary HPD matrices .

Unfortunately, a quick numerical experiment reveals that can be indefinite. A counterexample is given by the following positive definite matrices (, )

\hb@xt@.01(3.11)

and by setting , for which . This counterexample destroys hopes of embedding the metric space isometrically into a Hilbert space.

Although matrix (LABEL:eq.60) is not HPD in general, we might ask: For what choices of is HPD? Theorem LABEL:thm.wallach answers this question for formed from symmetric real positive definite matrices, and characterizes the values of necessary and sufficient for to be positive definite. We note here that the case was essentially treated in [27], in the context of semigroup kernels on measures.

Theorem 3.10

Let be real symmetric matrices in . The matrix defined by (LABEL:eq.60) is positive definite, if and only if satisfies

\hb@xt@.01(3.12)

Proof. We first prove the “if” part. Recall therefore, the Gaussian integral

Define the map , where the inner-product is given by

Thus, it follows that . The Schur product theorem says that the elementwise product of two positive matrices is again positive. So, in particular is positive whenever is an integer multiple of . To extend the result to all covered by (LABEL:eq.62), we invoke another integral representation: the multivariate Gamma function, defined as [42, §2.1.2]

\hb@xt@.01(3.13)

Let , for some constant ; then, compute the inner product

which exists if . Thus, for all defined by (LABEL:eq.62).

The converse is a deeper result grounded in the theory of symmetric cones. Specifically, since is a symmetric cone, and is decreasing on this cone, an appeal to [30, Thm. VII.3.1] yields the only if part of our claim.

Remark 3.11

Let be a set of HPD matrices that commute with each other. Then, can be isometrically embedded into a Hilbert space. This claim follows because a commuting set of matrices can be simultaneously diagonalized, and for diagonal matrices, , which is a nonnegative sum of negative definite kernels and is therefore itself negative definite.

Theorem LABEL:thm.wallach shows that is not a kernel for all , while Remark LABEL:rmk.commute mentions an extreme case for which is always a positive definte kernel. This prompts us to pose the following problem.

Open problem 1. Determine necessary and sufficient conditions on a set , so that is a kernel function on for all .

4 Connections with

This section returns to our original motivation: using as an alternative to the Riemannian metric . In particular, in this section we show a sequence of results that highlight similarities between and . Table LABEL:tab.summ lists these to provide a quick summary. Thereafter, we develop the details.

Riemannian distance Ref. S-Divergence Ref.
[10, Ch.6] Prp.LABEL:prop.basic
[10, Ch.6] Prp.LABEL:prop.basic
E Prp.LABEL:prop.basic
[10, Ex.6.5.4] Th.LABEL:thm.contract
Th.LABEL:thm.tucontract Th.LABEL:thm.tucontract
g-convex in [10, 6.1.11] g-convex in Th.LABEL:thm.jointgc
Cor. LABEL:corr.riem.compress Th.LABEL:thm.compr
E Th.LABEL:thm.gm
[10, Th.6.1.6] Th.LABEL:thm.contract3
[10, Th.6.1.2] Th.LABEL:thm.cancel
[10, §6.2.8] Th.LABEL:thm.gm
[14] Th.LABEL:thm.contract2
Table 4.1: Similarities between and , at a glance. All matrices are assumed to be in , except for in Line 1, , and in Line 7, (, full colrank). The scalars satisfy , . An ‘E’ indicates an easily verifiable result.

4.1 Geometric mean

We begin by studying an object that connects and most intimately: the matrix geometric mean (GM). For HPD matrices and , the GM is denoted by , and is given by the formula

\hb@xt@.01(4.1)

The GM (LABEL:eq.47) has numerous attractive properties—see for instance [1]—among which, the following variational characterization is very important [11]:

\hb@xt@.01(4.2)

Surprisingly, the GM enjoys a similar characterization even with .

Theorem 4.1

Let . Then,

\hb@xt@.01(4.3)

Moreover, is equidistant from and , i.e., .

Proof. If , then clearly minimizes . Assume therefore, that . Ignoring the constraint for the moment, we see that any stationary point of must satisfy . This condition translates into

\hb@xt@.01(4.4)

The last equation is a Riccati equation whose unique, positive definite solution is the geometric mean (see [10, Prop 1.2.13]).

Next, we show that this stationary point is a local minimum, not a local maximum or saddle point. To that end, we show that the Hessian is positive definite at the stationary point . The Hessian of is given by

Writing , and , upon using (LABEL:eq.33) we obtain

Thus, is a strict local minimum of (LABEL:eq.11). This local minimum is actually global as has a unique positive solution and goes to at the boundary.

To prove the equidistance, recall that ; then observe that

4.2 Geodesic convexity

The above derivation concludes optimality from first principles. It was originally driven by the fact that is not geodesically convex [48]. However, we recently realized that the square is actually geodesically convex. This realization leads to more insightful proof of uniqueness and optimality of the S-mean.

In fact even more is true: Theorem LABEL:thm.jointgc shows that is not only geodesically convex, it is jointly geodesically convex. Before proving Theorem LABEL:thm.jointgc, we recall two useful results; the first immediately implies the second777It is a minor curiosity to note that [41, Thm. 2] proved a mixed-mean inequality for the matrix geometric and arithmetic means; Prop. LABEL:prop.mixedmean includes their result as a special case..

Theorem 4.2 ([36])

The GM of is given by the variational formula

Proposition 4.3 (Joint-concavity (see e.g. [36]))

Let . Then,

\hb@xt@.01(4.5)

Proof. From one application of Thm. LABEL:thm.andomax we have