Riemannian Newton’s method for joint diagonalization on the Stiefel manifoldwith application to ICA

# Riemannian Newton’s method for joint diagonalization on the Stiefel manifold with application to ICA

Hiroyuki Sato
Department of Management Science
Tokyo University of Science, Tokyo 162-8601, Japan
hsato@ms.kagu.tus.ac.jp
July 14, 2019
###### Abstract

Joint approximate diagonalization of non-commuting symmetric matrices is an important process in independent component analysis. It is known that this problem can be formulated as an optimization problem on the Stiefel manifold. Riemannian optimization techniques can be used to solve this optimization problem. Among the available techniques, this article provides Riemannian Newton’s method for the joint diagonalization problem, which has the quadratic convergence property. In particular, it is shown that the resultant Newton’s equation can be effectively solved by means of the Kronecker product and vec operator, which reduces the dimension of the equation. Numerical experiments are performed to show that the proposed method improves the accuracy of an approximate solution of the problem. The proposed method is applied to independent component analysis.

Keywords: Joint diagonalization; Riemannian optimization; Newton’s method; Stiefel manifold; Independent component analysis, Image Separation

## 1 Introduction

The joint diagonalization (JD) problem for real symmetric matrices is often considered on the orthogonal group . The problem is to find an orthogonal matrix that minimizes the sum of the squared off-diagonal elements, or equivalently, maximizes the sum of the squared diagonal elements of [18]. For more information regarding finding non-orthogonal matrices, see [20]. A solution to the JD problem is valuable for independent component analysis (ICA) and the blind source separation problem [2, 5, 6, 8, 17].

Several approaches have been proposed in the context of Jacobi methods [3, 5, 6] and Riemannian optimization [17, 19]. In [17], the JD problem is considered on the Stiefel manifold with . That is, the required matrix is a rectangular orthonormal matrix. The orthogonal group is a special case of the Stiefel manifold because .

Riemannian optimization refers to optimization on Riemannian manifolds. Unconstrained optimization methods in Euclidean space such as steepest descent, conjugate gradient, and Newton’s methods have been generalized to those on a Riemannian manifold [1, 9, 12, 16]. In [17], the Riemannian trust-region method is applied to the JD problem on the Stiefel manifold . With varying on with , minimizing the sum of the squared off-diagonal elements of is no longer equivalent to maximizing the sum of the squared diagonal elements. According to [17], the JD problem on the Stiefel manifold maximizes the sum of the squared diagonal elements of with .

This article deals with Newton’s method for the JD problem on the Stiefel manifold. The Hessian of the objective function is fundamental for deriving Newton’s equation, the key equation in Newton’s method. We have intensively examined the Hessian to efficiently solve Newton’s equation. In particular, we have used the Kronecker product, and vec and veck operators to reduce the dimension of Newton’s equation and to transform the equation into the form “”.

This paper is organized as follows. We introduce the JD problem on the Stiefel manifold in Section 2, following on from [17]. We also present a brief review of computing the gradient and the Hessian of the objective function. In Section 3, we consider Newton’s equation, which is directly obtained by substituting the gradient and Hessian formulas in Section 2 into . To derive the representation matrix formula of the Hessian of the objective function, we use the Kronecker product, and vec and veck operators. This results in a smaller equation that is easier to solve. It should be noted that the dimension of the resultant equation is equal to the dimension of the Stiefel manifold in question. This means that the equation can be efficiently solved. Section 4 provides information about the two types of numerical experiments we use to evaluate our method. The first experiment is an application to ICA. We demonstrate that the proposed method improves the accuracy when compared with the solution generated by a Jacobi-like method, which is an existing method for the JD problem. The other experiment uses a larger problem to show that sequences generated by the proposed Newton’s method converge quadratically. Section 5 contains our concluding remarks. In these sections, the Stiefel manifold is endowed with the induced metric from the natural inner product in ambient Euclidean space. In contrast to this, we derive another formula for the representation matrix of the Hessian in Appendix A, in which the Stiefel manifold is endowed with the canonical metric.

## 2 Joint diagonalization problem on the Stiefel manifold

### 2.1 Joint diagonalization problem

Let be real symmetric matrices. We consider the following JD problem on the Stiefel manifold [17]:

###### Problem
 minimize f(Y)=−N∑l=1∥diag(YTAlY)∥2F, (2.1) subjectto Y∈St(p,n), (2.2)

where with , denotes the Frobenius norm, and denotes the diagonal part of the matrix.

We wish to apply Newton’s method to Problem 2.1, so the Hessian of is fundamental. To derive and analyze the Hessian and other requisites, we first review the geometry of .

### 2.2 The geometry of the Stiefel manifold

In this subsection, we review the geometry of the Stiefel manifold , as discussed in [1, 9].

The tangent space of at is

 TYSt(p,n)={ξ∈Rn×p|ξTY+YTξ=0}. (2.3)

In later sections, we will make full use of the equivalent form [9]

 TYSt(p,n)={YB+Y⊥C|B∈Skew(p), C∈R(n−p)×p}, (2.4)

rather than Eq. (2.3), where is an arbitrary matrix that satisfies and , and denotes the set of all skew-symmetric matrices. We here note that

 dim(St(p,n))=p(p−1)2+p(n−p)=dim(Skew(p))+dim(R(n−p)×p), (2.5)

which is an important relation for rewriting Newton’s equation into a system of linear equations.

Since is a submanifold of the matrix Euclidean space , it can be endowed with the Riemannian metric

 ⟨ξ1,ξ2⟩Y:=tr(ξT1ξ2),ξ1,ξ2∈TYSt(p,n), (2.6)

which is induced from the natural inner product in . We view as a Riemannian submanifold of with the above metric. Under this metric, the orthogonal projection at onto is expressed as

 PY(W)=W−Ysym(YTW),Y∈St(p,n), W∈Rn×p, (2.7)

where denotes the symmetric part of the matrix.

In optimization algorithms on the Euclidean space, the line search is performed after computing the search direction. In Riemannian optimization, the concept of a straight line is replaced with a curve (not necessarily geodesic) on a general Riemannian manifold. A retraction on the manifold in question is needed to implement Riemannian optimization algorithms [1]. It defines an appropriate curve for searching a next iterate point on the manifold. We will use the QR retraction on the Stiefel manifold [1] defined as

 RY(ξ)=qf(Y+ξ),Y∈St(p,n), ξ∈TYSt(p,n), (2.8)

where denotes the Q factor of the QR decomposition of the matrix. In other words, if a full-rank matrix is uniquely decomposed into , where and is a upper triangular matrix with strictly positive diagonal entries, then .

### 2.3 The gradient and the Hessian of the objective function

We now return to Problem 2.1. We need the gradient, , and Hessian, , of the objective function, , on to describe Newton’s equation for the problem. Newton’s equation is defined at each as , where is an unknown tangent vector. Expressions for the gradient and Hessian can also be found in [17]. In this subsection, we briefly describe how to compute them.

Let be the function on defined in the same way as the right-hand side of (2.1). We note that is the restriction of to . Since we view the Stiefel manifold, , as a submanifold of the Euclidean space , the gradient at can be expressed as

where is the Euclidean gradient of on . Furthermore, the Hessian at acts on as

Therefore, we only have to compute and .

We use the relation

 ∥diag(YTAlY)∥2F=tr(diag(YTAlY)2)=tr(YTAlYdiag(YTAlY)) (2.11)

to compute the Frechét derivative with as

 D¯f(Y)[η]=−4N∑l=1ηTAlYdiag(YTAlY), (2.12)

so that

The gradient of on can be obtained by Eq. (2.9) and (2.13) as

We can also express as

where is easily obtained from Eq. (2.13) as

We here note that . It follows from Eqs. (2.10) and (2.15) that

where we have used and in the first equality.

## 3 Newton’s method for the joint diagonalization problem on the Stiefel manifold

### 3.1 The Hessian of the objective function as a linear transformation on Skew(p)×R(n−p)×p

In Newton’s method for minimizing an objective function, , on a general Riemannian manifold, , the search direction at the current point is computed as a solution to Newton’s equation, . Since we have already obtained the matrix expressions of and , Newton’s equation for Problem 2.1 at is written as

 −4N∑l=1PY(Alξdiag(YTAlY)+2AlYdiag(YTAlξ)−ξsym(YTAlYdiag(YTAlY))) =4N∑l=1(AlYdiag(YTAlY)−Ysym(YTAlYdiag(YTAlY))). (3.1)

This must be solved for , with given. Because is in , must satisfy . Equation (3.1) appears too difficult to solve. This is because Eq. (3.1) is complicated, and is an matrix with independent variables.

To overcome these difficulties, we wish to obtain the representation matrix of as a linear transformation on , for an arbitrarily fixed . To this end, we identify as and view as a ()-dimensional vector. This can be done using the form in Eq. (2.4). We arbitrarily fix a that satisfies and . Such a can be computed by applying the Gram-Schmidt orthonormalization process to linearly independent column vectors of the matrix (see Algorithm 3.1). Then, can be expressed as

 ξ=YB+Y⊥C,B∈Skew(p), C∈R(n−p)×p. (3.2)

can also be written as

 Hessf(Y)[ξ]=YBH+Y⊥CH,BH∈Skew(p), CH∈R(n−p)×p, (3.3)

and we can write out and using and .

###### Proposition

Let and satisfy . If a tangent vector is expressed as (3.2), then the Hessian of the objective function (2.1) acts on as with

 BH=−4N∑l=1skew((ZlB+Z⊥lC)diag(Zl)+2Zldiag(ZlB+Z⊥lC)−Bsym(Zldiag(Zl))), (3.4)

and

 CH=−4N∑l=1(((Z⊥l)TB+Z⊥⊥lC)diag(Zl)+2(Z⊥l)Tdiag(ZlB+Z⊥lC)−Csym(Zldiag(Zl))), (3.5)

where we have defined , , and , and where denotes the skew-symmetric part of the matrix.

###### Proof

Multiplying Eq. (3.3) by from the left yields that

 BH=YTHessf(Y)[ξ] = −4N∑l=1YTPY(Alξdiag(YTAlY)+2AlYdiag(YTAlξ)−ξsym(YTAlYdiag(YTAlY))) = −4N∑l=1skew(YT(Alξdiag(YTAlY)+2AlYdiag(YTAlξ)−ξsym(YTAlYdiag(YTAlY)))) = −4N∑l=1skew((ZlB+Z⊥lC)diag(Zl)+2Zldiag(ZlB+Z⊥lC)−Bsym(Zldiag(Zl))). (3.6)

Similarly, we multiply (3.3) by from the left to obtain

 CH=YT⊥Hessf(Y)[ξ] = −4N∑l=1YT⊥PY(Alξdiag(YTAlY)+2AlYdiag(YTAlξ)−ξsym(YTAlYdiag(YTAlY))) = −4N∑l=1YT⊥(Alξdiag(YTAlY)+2AlYdiag(YTAlξ)−ξsym(YTAlYdiag(YTAlY))) = −4N∑l=1(((Z⊥l)TB+Z⊥⊥lC)diag(Zl)+2(Z⊥l)Tdiag(ZlB+Z⊥lC)−Csym(Zldiag(Zl))). (3.7)

This completes the proof.

### 3.2 Kronecker product and the vec and veck operators

The vec operator and the Kronecker product are useful for rewriting a matrix equation. They can be used to transform the matrix into an unknown column vector [13, 15]. The vec operator acts on a matrix as

 vec(W)=(w11,…,wm1,w12,…,wm2,…,w1n,…,wmn)T. (3.8)

That is, is an -dimensional column vector obtained by stacking the columns of one underneath the other. The Kronecker product of and (denoted by ) is an matrix defined as

 U⊗V=⎛⎜ ⎜⎝u11V⋯u1nV⋮⋮um1V⋯umnV⎞⎟ ⎟⎠. (3.9)

The following useful properties of these operators are known.

• For , and ,

 vec(UVW)=(WT⊗U)vec(V). (3.10)
• There exists an permutation matrix such that

 vec(WT)=Tnvec(W),W∈Rn×n, (3.11)

where is given by

 Tn=n∑i,j=1E(n×n)ij⊗E(n×n)ji, (3.12)

and denotes the matrix that has -component equal to , and the others equal to .

Furthermore, we can easily derive the following properties.

• For ,

 vec(sym(W))=12(In2+Tn)vec(W),vec(skew(W))=12(In2−Tn)vec(W). (3.13)
• There exists an diagonal matrix such that

 vec(diag(W))=Δnvec(W),W∈Rn×n, (3.14)

where .

For in Eq. (3.2), is an appropriate vector expression of , because all the elements of are independent variables. On the other hand, for in Eq. (3.2), contains zeros stemmed from the diagonal elements of . These should be removed. In addition, contains duplicates of each independent variable, because the upper triangular part (excluding the diagonal) of is the negative of the lower triangular part. Therefore, we use the veck operator [11]. The veck operator acts on an skew-symmetric matrix as

 veck(S)=(s21,…,sn1,s32,…,sn2,…,sn,n−1)T. (3.15)

That is, is an -dimensional column vector obtained by stacking the columns of the lower triangular part of one underneath the other. There exists a matrix that only depends on (the size of ) such that

 vec(S)=Dnveck(S), (3.16)

where is given by

 Dn=∑n≥i>j≥1(E(n2×n(n−1)/2)n(j−1)+i, j(n−(j+1)/2)−n+i−E(n2×n(n−1)/2)n(i−1)+j, j(n−(j+1)/2)−n+i). (3.17)

We here note that Eq. (3.16) is valid only if is skew-symmetric. Because each column of has a and a , and because each row of has at most one nonzero element, we have . It follows that

 veck(S)=12DTnvec(S). (3.18)

### 3.3 Representation matrix of the Hessian and Newton’s equation

We have obtained all the requisites in the previous subsections. We regard the Hessian at as a linear transformation on that transforms a ()-vector, , into , where and . Our goal is to obtain the representation matrix of .

###### Proposition

Let be a linear transformation on which acts on with , as

 H(veck(B)vec(C))=(veck(BH)vec(CH)), (3.19)

where and are given as Eqs. (3.4) and (3.5). Then, the representation matrix of is given by

 HA=(H11H12H21H22), (3.20)

where

 H11=−DTp(Ip2−Tp)N∑l=1(diag(Zl)⊗Zl+2(Ip⊗Zl)Δp(Ip⊗Zl)−sym(Zldiag(Zl))⊗Ip)Dp, (3.21)
 H12=−DTp(Ip2−Tp)N∑l=1(diag(Zl)⊗Z⊥l+2(Ip⊗Zl)Δp(Ip⊗Z⊥l)), (3.22)
 H21=−4N∑l=1((diag(Zl)⊗(Z⊥l)T+2(Ip⊗(Z⊥l)T)Δp(Ip⊗Zl))Dp, (3.23)

and

 H22=−4N∑l=1(diag(Zl)⊗Z⊥⊥l+2(Ip⊗(Z⊥l)T)Δp(Ip⊗Z⊥l)−sym(Zldiag(Zl))⊗In−p). (3.24)

###### Proof

From Eqs. (3.4) and (3.5), and are calculated as follows:

 veck(BH) = 12DTpvec(−4N∑l=1skew((ZlB+Z⊥lC)diag(Zl)+2Zldiag(ZlB+Z⊥lC)−Bsym(Zldiag(Zl)))) = −2DTpN∑l=112(Ip2−Tp)vec((ZlB+Z⊥lC)diag(Zl)+2Zldiag(ZlB+Z⊥lC)−Bsym(Zldiag(Zl))) = −DTp(Ip2−Tp)N∑l=1((diag(Zl)⊗Zl)vec(B)+(diag(Zl)⊗Z⊥l)vec(C) +2(Ip⊗Zl)vec(diag(ZlB+Z⊥lC))−(sym(Zldiag(Zl))⊗Ip)vec(B)) = −DTp(Ip2−Tp)N∑l=1((diag(Zl)⊗Zl+2(Ip⊗Zl)Δp(Ip⊗Zl)−sym(Zldiag(Zl))⊗Ip)Dpveck(B) +(diag(Zl)⊗Z⊥l+2(Ip⊗Zl)Δp(Ip⊗Z⊥l))vec(C)), (3.25)

and

 vec(CH) = vec(−4N∑l=1(((Z⊥l)TB+Z⊥⊥lC)diag(Zl)+2(Z⊥l)Tdiag(ZlB+Z⊥lC)−Csym(Zldiag(Zl)))) = −4N∑l=1((diag(Zl)⊗(Z⊥l)T)vec(B)+(diag(Zl)⊗Z⊥⊥l)vec(C) +2(Ip⊗(Z⊥l)T)vec(diag(ZlB+Z⊥lC))−(sym(Zldiag(Zl))⊗In−p)vec(C)) = −4N∑l=1((diag(Zl)⊗(Z⊥l)T+2(Ip⊗(Z⊥l)T)Δp(Ip⊗Zl))Dpveck(B) +(diag(Zl)⊗Z⊥⊥l+2(Ip⊗(Z⊥l)T)Δp(Ip⊗Z⊥l)−sym(Zldiag(Zl))⊗In−p)vec(C)). (3.26)

Therefore, we obtain the linear relation

 (veck(BH)vec(CH))=HA(veck(B)vec(C)), (3.27)

where the representation matrix is given by Eqs. (3.20). This ends the proof.

Thus, Newton’s equation, , can be solved by the following method. We first note that Newton’s equation is equivalent to

Applying the veck operator to the first equation of Eq. (3.28), and the vec operator to the second, yields

 (3.29)

where with and . If is invertible, we can solve Eq. (3.29) as

 (3.30)

After we have obtained and , we can easily reshape and . Therefore, we can calculate the solution of Newton’s equation (3.1).

### 3.4 Newton’s method

When implementing Newton’s equation, we must compute the matrix in Eq. (3.20). If the block matrices of have some relationships, we may reduce the computational cost of computing .

The Hessian, , is symmetric with respect to the metric . However, the representation matrix is not symmetric. This is because, for and with and , we have

 (3.31)

so that the independent coordinates of and are counted twice. If we endowed with the canonical metric [9], the representation matrix would be symmetric (see Appendix A for more details).

Although the representation matrix with the induced metric is not symmetric, it does satisfy the following.

###### Proposition

The block matrices and defined by (3.21), (3.22), (3.23), and (3.24) satisfy

 H11=HT11,H21=2HT12,H22=HT22. (3.32)

###### Proof

For the function defined by (2.1) under the induced metric on , we have

 ⟨Hessf(Y)[ξ],η⟩Y=⟨Hessf(Y)[η],ξ⟩Y,ξ,η∈TYSt(p,n), (3.33)

because is symmetric with respect to the induced metric. Let and . It follows from Eqs. (3.27), (3.31), and (3.33) that

 (HA(veck(B1)vec(C1)))T(2Ip(p−1)/200Ip(n−p))(veck(B2)vec(C2)) = (HA(veck(B2)vec(C2)))T(2Ip(p−1)/200Ip(n−p))(veck(B1)vec(C1)). (3.34)

Now

 (3.35)

We have

 (2Ip(p−1)/200Ip(n−p