Positive definite matrices and the Sdivergence^{†}^{†}thanks: A small fraction of an initial version of this work was presented at the Advances in Neural Information Processing Systems (NIPS) 2012 conference—see [48].
Abstract
Positive definite matrices abound in a dazzling variety of applications. This ubiquity can be in part attributed to their rich geometric structure: positive definite matrices form a selfdual convex cone whose strict interior is a Riemannian manifold. The manifold view is endowed with a “natural” distance function while the conic view is not. Nevertheless, drawing motivation from the conic view, we introduce the SDivergence as a “natural” distancelike function on the open cone of positive definite matrices. We motivate the Sdivergence via a sequence of results that connect it to the Riemannian distance. In particular, we show that (a) this divergence is the square of a distance; and (b) that it has several geometric properties similar to those of the Riemannian distance, though without being computationally as demanding. The SDivergence is even more intriguing: although nonconvex, we can still compute matrix means and medians using it to global optimality. We complement our results with some numerical experiments illustrating our theorems and our optimization algorithm for computing matrix medians.
Key words. Bregman matrix divergence; Log Determinant; Stein Divergence; JensenBregman divergence; matrix geometric mean; matrix median; nonpositive curvature
1 Introduction
Hermitian positive definite (HPD) matrices are a noncommutative generalization of positive reals. They abound in a multitude of applications and exhibit attractive geometric properties—e.g., they form a differentiable Riemannian (also Finslerian) manifold [10, 33] that is a wellstudied example of a manifold of nonpositive curvature [17, Ch.10]. HPD matrices possess even more structure: (i) they embody a canonical higherrank symmetric space [51]; and (ii) their closure forms a closed, selfdual convex cone.
The convex conic view enjoys great importance in convex optimization [43, 6, 44] and in nonlinear PerronFrobenius theory [40]; symmetric spaces are important in algebra, analysis [51, 32, 39], and optimization [43, 52]; while the manifold view (Riemannian or Finslerian) plays diverse roles—see [10, Ch.6] and [46].
The manifold view is equipped with a with a “natural” distance function while the conic view is not. Nevertheless, drawing motivation from the convex conic view, we introduce the SDivergence as a “natural” distancelike function on the open cone of positive definite matrices. Indeed, we prove a sequence of results connecting the SDivergence to the Riemannian distance. Most importantly, we show that (a) this divergence is the square of a distance; and (b) that it has several geometric properties in common with the Riemannian distance, without being numerically as demanding. This builds an informal link between the manifold and conic views of HPD matrices.
1.1 Background and notation
We begin by fixing notation. The letter denotes some Hilbert space, usually just . The inner product between two vectors and in is ( denotes ‘conjugate transpose’). The set of Hermitian matrices is denoted as . A matrix is called positive definite if
\hb@xt@.01(1.1) 
The set of all positive definite (henceforth positive) matrices is denoted by . We say is positive semidefinite if for all ; denoted . The inequality is the usual Löwner order and means . The Frobenius norm of a matrix is defined as ; and denotes the operator 2norm. Let be an analytic function on ; for a matrix with eigendecomposition , equals with .
The set is a wellstudied differentiable Riemannian manifold, with the Riemannian metric given by the differential form . This metric induces the Riemannian distance (see e.g., [10, Ch. 6]):
\hb@xt@.01(1.2) 
and where denotes the matrix logarithm.
A counterpart to the distance (LABEL:eq.riem) was formally introduced in [48] under the name SDivergence^{1}^{1}1It is a divergence because although nonnegative, definite, and symmetric, it is not a metric.; this divergence is defined as
\hb@xt@.01(1.3) 
Our definition above already writes in anticipation of Theorem LABEL:thm.metric that shows to be a metric. This paper suggests SDivergence as an alternative to (LABEL:eq.riem), and studies several of its properties that may also be of independent interest. The simplicity of (LABEL:eq.ss) is one of the key reasons for using it as an alternative to (LABEL:eq.riem): it is cheaper to compute, as is its derivative, and certain basic algorithms involving it run much faster than corresponding ones that use [48].
This line of thought actually originates in [24, 25], where for an image search task based on “nearest neighbors,” is used to measure nearness instead of , and is shown to yield large speedups without blighting the quality of search results. Although exact details of this image search are outside the scope of this paper, let us highlight below the two speedups that were crucial to [24, 25].
The first speedup is shown in the left panel of Fig. LABEL:fig.one, which compares times taken to compute and . For computing the latter, we used the dist.m function in the Matrix Means Toolbox (MMT)^{2}^{2}2Downloaded from http://bezout.dm.unipi.it/software/mmtoolbox/. The second, more dramatic speedup in shown in the right panel which shows time taken to compute the matrix means
where are HPD matrices. For details on see Section LABEL:sec.mtxmeans; the geometric mean is also known as the “Karcher mean”, and was computed using the MMT via the rich.m script which implements a stateoftheart method [13, 34].
We mention here that other alternatives to are also possible, for instance the popular “logEuclidean” distance [3], given by
\hb@xt@.01(1.4) 
Notice that to compute we require two eigenvector decompositions; this makes it more expensive than which requires only 3 Cholesky factorizations. Even though the matrix mean under can be computed in closed form, its dependency on matrix logarithms and exponentials can make it slower than . However, much more importantly, for the applications in [24, 25], and other alternatives to proved to be substantially less competitive than (LABEL:eq.ss), so we limit our focus to ; for more extensive empirical comparisons with other distances, we refer the reader to [24, 25].
While our paper was under review (in 2011), we became aware of a concurrent paper of Chebbi and Moakher (CM) [21], who consider a one parameter family of divergences that generalize (LABEL:eq.ss). Our work differs from CM in the following aspects:

CM prove to be a distance only for commuting matrices. As per Remark LABEL:rmk.commute, the commuting case essentially reduces to the scalar case. The noncommuting case is much harder, and was conjectured to be true in [21]. We propose and solve the general noncommuting case, independent of CM.

We establish several theorems connecting to the Riemannian distance . These connections have not been made either by CM or elsewhere.

A question closely related to metricity of is whether the matrix
is positive semidefinite for every integer and every scalar . CM considered special cases of this question. We provide a complete characterization of necessary and sufficient for the above matrix to be semidefinite.

CM analyze the “matrixmean” , whose solution they obtain by solving . CM’s results essentially imply global optimality^{3}^{3}3We thank a referee for alerting us to this fact, which ultimately follows from CM’s uniqueness theorem and the observation that the cost function goes to for both and (see Sec. LABEL:sec.mtxmeans for more details).; we provide two different proofs of this fact. One of our proofs is based of establishing geodesic convexity of the SDivergence—a result that is also interesting because in our previous attempts [48, 49] we oversaw this property. In fact, we show (Theorem LABEL:thm.jointgc) that is jointly geodesically convex.
Other contributions. The present paper substantially extends our initial work [48]; we outline below the key differences from [48].

None of our results on similarities between the Riemannian distance and are present in [48]. Although some of these results are mentioned in a summary table in [48], the full theorem statements as well as their proofs are absent. The concerned results are: Theorems LABEL:thm.gm, LABEL:thm.contract, LABEL:thm.contract3, LABEL:thm.cancel, LABEL:thm.tucontract, and LABEL:thm.contract2.

This paper uncovers a new result (previously unknown): is jointly geodesically convex—Prop. LABEL:prop.mixedmean and Theorem LABEL:thm.jointgc establish this remarkable fact.

This paper proves several new “conic” contraction results for and . The appurtenant results are: Proposition LABEL:prop.logdet, Corollary LABEL:cor.ratio, Theorem LABEL:thm.compr, Corollary LABEL:cor.compr.tensor, Theorem LABEL:thm.geig, and Corollary LABEL:corr.riem.compress.

This paper proves biLipschitzlike inequalities for and (Theorem LABEL:thm.bnds).

Finally, this paper studies the weighted matrixmedians problem:
This problem was also partially studied by [19], who presented an iterative method that was erroneously claimed to be a fixedpoint iteration in the Thompson metric. We present a counterexample to illustrate this error, and rectify it by presenting different analysis that ensures convergence (see §LABEL:sec.median).
2 The SDivergence
We proceed now to formally introduce the SDivergence. We follow the viewpoint of Bregman divergences. Consider, thus, a differentiable strictly convex function ; then, , with equality if and only if . The difference between the two sides of this inequality defines the Bregman Divergence^{4}^{4}4Bregman divergences over scalars and vectors have been wellstudied; see e.g., [18, 4]. They are called divergences because they are not distances (though they often behave like squared distances, in a sense that can be made precise for certain choices of [22]).
\hb@xt@.01(2.1) 
The scalar divergence (LABEL:eq.breg.scalar) can be extended to Hermitian matrices.
Proposition 2.1
Let be differentiable and strictly convex on ; let be arbitrary. Then, we have the matrix Bregman Divergence:
\hb@xt@.01(2.2) 
By construction is nonnegative, strictly convex in , and zero if and only if . It is typically asymmetric, and may be viewed as a measure of dissimilarity.
Example 2.2
Let . Then, for , , with which (LABEL:eq.breg) yields the squared Frobenius norm
If on , then , and (LABEL:eq.breg) yields the (unnormalized) von Neumann Divergence of quantum information theory [45]:
For on , , and we obtain the divergence
which is known as the LogDet Divergence [37], or more classically as Stein’s loss [50].
The divergence is of key importance to our paper, so we mention some additional noteworthy contexts where it occurs: (i) information theory [26], as the relative entropy between two multivariate gaussians with same mean; (ii) optimization, when deriving the famous BroydenFletcherGoldfarbShanno (BFGS) updates [47]; (iii) matrix divergence theory [5, 28, 46]; (iv) kernel learning [37].
Despite the broad applicability of Bregman divergences, their asymmetry is sometimes undesirable. This drawback has leads us to consider symmetric divergences, among which the most popular is the “JensenBregman” divergence^{5}^{5}5This symmetrization has been largely studied only for divergences over scalars or vectors.
\hb@xt@.01(2.3) 
This divergence has two attractive and perhaps more useful representations:
\hb@xt@.01(2.4)  
\hb@xt@.01(2.5) 
Compare (LABEL:eq.breg) with (LABEL:eq.js1): both formulas define divergence as departure from linearity; the former uses derivatives, while the latter is stated using midpoint convexity. Representation (LABEL:eq.js1) has an advantage over (LABEL:eq.breg), (LABEL:eq.js), and (LABEL:eq.js2), in that it does not need to assume differentiability of .
The reader must have realized by now that the SDivergence (LABEL:eq.ss) is nothing but the symmetrized divergence (LABEL:eq.js) generated by . Alternatively, the SDivergence may be essentially viewed as the JensenBregman divergence between two multivariate gaussians [26], or as the Bhattacharya distance between them [12].
Let us now list a few basic properties of .
Proposition 2.3
Let be the vector of eigenvalues of , and be a diagonal matrix with as its diagonal. Let . Then,

;

, where , ;

;

; and

.
Proof. (i) follows from the equality .
(ii) follows upon observing that
(iii) follows upon noting
(iv) follows as , and .
(v) is trivial since . \qedsymbol
The most useful corollary to Prop. LABEL:prop.basic is congruence invariance of .
Corollary 2.4
Let , and let be any invertible matrix. Then,
The next result reaffirms that is a divergence, while showing that it enjoys some limited convexity and concavity.
Proposition 2.5
Let . Then, (i) with equality if and only if ; (ii) for fixed , is convex in for , while for , it is concave.
Proof. Since is a sum of Bregman divergences, property (i) follows from definition (LABEL:eq.js). Alternatively, note that , with equality if and only if . Part (ii) follows upon analyzing the Hessian . This Hessian can be identified with the matrix
\hb@xt@.01(2.6) 
where is the usual the Kronecker product. Matrix is positive definite for and negative definite for , which proves (ii).
Below we show that is richer than a divergence: its squareroot is actually a distance on . This is the first main result of our paper. Previous authors [24, 21] conjectured this result but could not establish it, perhaps because both ultimately sought to map to a Hilbert space metric. This approach fails because HPD matrices do not form even a (multiplicative) semigroup, which renders the powerful theory of harmonic analysis on semigroups [7] inapplicable to . This difficulty necessitates a different path to proving metricity of , and this is the subject of the next section.
3 The metric
In this section we prove the following main theorem.
Theorem 3.1
Let be defined by (LABEL:eq.ss). Then, is a metric on .
The proof of Theorem LABEL:thm.metric depends on several results, which we first establish.^{6}^{6}6We note that a referee wondered whether the “metrization” results of [22, 23] could yield an alternative proof of Thm. LABEL:thm.metric. Unfortunately, those results rely very heavily on the commutativity and totalorder available on , both of which are missing in . Indeed, for the commutative case, Lemma LABEL:lem.cpd alone suffices to prove that is a metric.
Definition 3.2 ([7, Def. 1.1])
Let be a nonempty set. A function is said to be negative definite if for all , , and the inequality
holds for all integers , and subsets , with .
Theorem 3.3 ([7, Prop. 3.2, Ch. 3])
Let be negative definite. Then, there is a Hilbert space and a mapping from such that one has the relation
\hb@xt@.01(3.1) 
Moreover, negative definiteness of is necessary for such a mapping to exist.
Theorem LABEL:thm.schoen helps prove the triangle inequality for the scalar case.
Lemma 3.4
Define, the scalar version of as
Then, satisfies the triangle inequality, i.e.,
\hb@xt@.01(3.2) 
Proof. We show that is negative definite. Since , Theorem LABEL:thm.schoen then immediately implies the triangle inequality (LABEL:eq.11). To prove that is negative definite, by [7, Thm. 2.2, Ch. 3] we may equivalently show that is a positive definite function for , and . To that end, it suffices to show that the matrix
is HPD for every integer , and positive numbers . Now, observe that
\hb@xt@.01(3.3) 
where is the Gamma function. Thus, with , we see that equals the Gram matrix , whereby . \qedsymbol Using Lemma LABEL:lem.cpd obtain the following “Minkowski” inequality for .
Corollary 3.5
Let ; and let . Then,
\hb@xt@.01(3.4) 
Proof. Lemma LABEL:lem.cpd implies that for positive scalars , , and , we have
Exponentiate, sum, and invoke Minkowski’s inequality to conclude (LABEL:eq.36). \qedsymbol
Theorem 3.6
Let be diagonal matrices. Then,
\hb@xt@.01(3.5) 
Proof. For diagonal matrices and , it is easy to verify that Now invoke Corollary LABEL:corr.mink with . \qedsymbol
Next, we recall an important determinantal inequality for positive matrices.
Theorem 3.7 ([9, Exercise VI.7.2])
Let . Let denote the vector of eigenvalues of sorted in decreasing order; define likewise. Then,
\hb@xt@.01(3.6) 
Corollary 3.8
Let . Let denote the diagonal matrix with as its diagonal; define likewise. Then,
Proof. Scale and by 2, divide each term in (LABEL:eq.14) by , and note that is invariant to permutations of ; take logarithms to conclude. \qedsymbol
The final result we need is wellknown in linear algebra (we provide a proof).
Lemma 3.9
Let , and let be Hermitian. There is a matrix for which
\hb@xt@.01(3.7) 
Proof. Let , and define . The the matrix is Hermitian; so let diagonalize it to . Set , to obtain
and by construction, it follows that . \qedsymbol
Accoutered with the above results, we can finally prove Theorem LABEL:thm.metric.
Proof. (Theorem LABEL:thm.metric). We need to show that is symmetric, nonnegative, definite, and that is satisfies the triangle inequality. Symmetry is obvious. Nonnegativity and definiteness were shown in Prop. LABEL:prop.scvx. The only hard part is to prove the triangle inequality, a result that has eluded previous attempts [24, 21].
Let be arbitrary. From Lemma LABEL:lem.diag2 we know that there is a matrix such that and . Since is arbitrary, and congruence preserves positive definiteness, we may write just instead of . Also, since (see Prop. LABEL:prop.basic), proving the triangle inequality reduces to showing that
\hb@xt@.01(3.8) 
Consider now the diagonal matrices and . Theorem LABEL:thm.diag asserts
\hb@xt@.01(3.9) 
Prop. LABEL:prop.basic(i) implies that and , while Corollary LABEL:cor.eigs shows that . Combining these inequalities, we immediately obtain (LABEL:eq.35).
We now turn our attention to a connection of importance to machine learning and approximation theory: kernel functions related to . Indeed, some of connections (e.g., Theorem LABEL:thm.wallach) have already been recently useful in computer vision [31].
3.1 Hilbert space embedding
Since is a metric, and Lemma LABEL:lem.cpd shows that for scalars, embeds isometrically into a Hilbert space, one may ask if also admits such an embedding. But as mentioned previously, it is the lack of such an embedding that necessitated a different route to metricity. Let us look more carefully at what goes wrong, and what kind of Hilbert space embeddability does admit.
Theorem LABEL:thm.schoen implies that a Hilbert space embedding exists if and only if is a negative definite kernel; equivalently, if and only if the map (cf. Lemma LABEL:lem.cpd)
is a positive definite kernel for . It suffices to check whether the matrix
\hb@xt@.01(3.10) 
is positive definite for every and arbitrary HPD matrices .
Unfortunately, a quick numerical experiment reveals that can be indefinite. A counterexample is given by the following positive definite matrices (, )
\hb@xt@.01(3.11) 
and by setting , for which . This counterexample destroys hopes of embedding the metric space isometrically into a Hilbert space.
Although matrix (LABEL:eq.60) is not HPD in general, we might ask: For what choices of is HPD? Theorem LABEL:thm.wallach answers this question for formed from symmetric real positive definite matrices, and characterizes the values of necessary and sufficient for to be positive definite. We note here that the case was essentially treated in [27], in the context of semigroup kernels on measures.
Theorem 3.10
Let be real symmetric matrices in . The matrix defined by (LABEL:eq.60) is positive definite, if and only if satisfies
\hb@xt@.01(3.12) 
Proof. We first prove the “if” part. Recall therefore, the Gaussian integral
Define the map , where the innerproduct is given by
Thus, it follows that . The Schur product theorem says that the elementwise product of two positive matrices is again positive. So, in particular is positive whenever is an integer multiple of . To extend the result to all covered by (LABEL:eq.62), we invoke another integral representation: the multivariate Gamma function, defined as [42, §2.1.2]
\hb@xt@.01(3.13) 
Let , for some constant ; then, compute the inner product
which exists if . Thus, for all defined by (LABEL:eq.62).
The converse is a deeper result grounded in the theory of symmetric cones. Specifically, since is a symmetric cone, and is decreasing on this cone, an appeal to [30, Thm. VII.3.1] yields the only if part of our claim.
Remark 3.11
Let be a set of HPD matrices that commute with each other. Then, can be isometrically embedded into a Hilbert space. This claim follows because a commuting set of matrices can be simultaneously diagonalized, and for diagonal matrices, , which is a nonnegative sum of negative definite kernels and is therefore itself negative definite.
Theorem LABEL:thm.wallach shows that is not a kernel for all , while Remark LABEL:rmk.commute mentions an extreme case for which is always a positive definte kernel. This prompts us to pose the following problem.
4 Connections with
This section returns to our original motivation: using as an alternative to the Riemannian metric . In particular, in this section we show a sequence of results that highlight similarities between and . Table LABEL:tab.summ lists these to provide a quick summary. Thereafter, we develop the details.
Riemannian distance  Ref.  SDivergence  Ref. 

[10, Ch.6]  Prp.LABEL:prop.basic  
[10, Ch.6]  Prp.LABEL:prop.basic  
E  Prp.LABEL:prop.basic  
[10, Ex.6.5.4]  Th.LABEL:thm.contract  
Th.LABEL:thm.tucontract  Th.LABEL:thm.tucontract  
gconvex in  [10, 6.1.11]  gconvex in  Th.LABEL:thm.jointgc 
Cor. LABEL:corr.riem.compress  Th.LABEL:thm.compr  
E  Th.LABEL:thm.gm  
[10, Th.6.1.6]  Th.LABEL:thm.contract3  
[10, Th.6.1.2]  Th.LABEL:thm.cancel  
[10, §6.2.8]  Th.LABEL:thm.gm  
[14]  Th.LABEL:thm.contract2 
4.1 Geometric mean
We begin by studying an object that connects and most intimately: the matrix geometric mean (GM). For HPD matrices and , the GM is denoted by , and is given by the formula
\hb@xt@.01(4.1) 
The GM (LABEL:eq.47) has numerous attractive properties—see for instance [1]—among which, the following variational characterization is very important [11]:
\hb@xt@.01(4.2) 
Surprisingly, the GM enjoys a similar characterization even with .
Theorem 4.1
Let . Then,
\hb@xt@.01(4.3) 
Moreover, is equidistant from and , i.e., .
Proof. If , then clearly minimizes . Assume therefore, that . Ignoring the constraint for the moment, we see that any stationary point of must satisfy . This condition translates into
\hb@xt@.01(4.4)  
The last equation is a Riccati equation whose unique, positive definite solution is the geometric mean (see [10, Prop 1.2.13]).
Next, we show that this stationary point is a local minimum, not a local maximum or saddle point. To that end, we show that the Hessian is positive definite at the stationary point . The Hessian of is given by
Writing , and , upon using (LABEL:eq.33) we obtain
Thus, is a strict local minimum of (LABEL:eq.11). This local minimum is actually global as has a unique positive solution and goes to at the boundary.
To prove the equidistance, recall that ; then observe that
4.2 Geodesic convexity
The above derivation concludes optimality from first principles. It was originally driven by the fact that is not geodesically convex [48]. However, we recently realized that the square is actually geodesically convex. This realization leads to more insightful proof of uniqueness and optimality of the Smean.
In fact even more is true: Theorem LABEL:thm.jointgc shows that is not only geodesically convex, it is jointly geodesically convex. Before proving Theorem LABEL:thm.jointgc, we recall two useful results; the first immediately implies the second^{7}^{7}7It is a minor curiosity to note that [41, Thm. 2] proved a mixedmean inequality for the matrix geometric and arithmetic means; Prop. LABEL:prop.mixedmean includes their result as a special case..
Theorem 4.2 ([36])
The GM of is given by the variational formula
Proposition 4.3 (Jointconcavity (see e.g. [36]))
Let . Then,
\hb@xt@.01(4.5) 
Proof. From one application of Thm. LABEL:thm.andomax we have