Approximations of Schatten Norms via Taylor Expansions

# Approximations of Schatten Norms via Taylor Expansions

## Abstract

In this paper we consider symmetric, positive semidefinite (SPSD) matrix and present two algorithms for computing the -Schatten norm . The first algorithm works for any SPSD matrix . The second algorithm works for non-singular SPSD matrices and runs in time that depends on , where is the -th eigenvalue of . Our methods are simple and easy to implement and can be extended to general matrices. Our algorithms improve, for a range of parameters, recent results of Musco, Netrapalli, Sidford, Ubaru and Woodruff (ITCS 2018.) and match the running time of the methods by Han, Malioutov, Avron, and Shin (SISC 2017) while avoiding computations of coefficients of Chebyshev polynomials.

## 1 Introduction

In many applications of data science and machine learning data is represented by large matrices. Fast and accurate analysis of such matrices is a challenging task that is of paramount importance for the aforementioned applications. Randomized numerical algebra (RNA) is an popular area of research that often provides such fast and accurate algorithmic methods for massive matrix computations. Many critical problems in RNA boil down to approximating spectral functions and one of the most fundamental examples of such spectral functions is Schatten norm. The -th Schatten norm for matrix is defined as the norm of a vector comprised of singular values of matrix , i.e.,

 ∥A∥p=⎛⎝min(n1,n2)∑i=1|σi(A)|p⎞⎠1/p,

where is the -th singular value of .

### 1.1 Our Results and Related Work

In this paper we consider symmetric, positive semidefinite (SPSD) matrix and present two algorithms for computing . The first algorithm in Section 3 works for any SPSD matrix . The second algorithm in Section 4 works for non-singular SPSD matrices and runs in time that depends on , where is the -th eigenvalue of . The table below summarizes our results. Our methods are simple and easy to implement and can be extended to general matrices. Indeed, to compute for matrix , one can apply our methods to SPSD matrix and note that . It is well known that for SPSD matrix , and thus, our algorithms provide multiplicative approximations for .

Musco, Netrapalli, Sidford, Ubaru and Woodruff [5] (see also the full version in [4]) provided a general approach for approximating spectral functions that works for general matrices. Table summarizes some of their results for Schatten norms for general matrices. Our result in Theorem 3.1 improves the bounds in [5, 4] for a range of parameters, for example, when and or when and .

In [2], Han, Malioutov, Avron, and Shin proposed to use general Chebyshev polynomials to approximate a wide class of functions and Schatten norms in particular. The methods in [2] work for invertible matrices and the running time depends on . Table summarizes their results for Schatten norms. Under the assumptions from [2] the running time in Remark 4.2 matches the running time of the algorithm in [2] for constant . In addition, our methods do not need to compute or store the coefficients of Chebyshev polynomials. A straightforward computation of Chebyshev coefficients for Schatten norm [2] requires time where1 . Thus, for a range of parameters, e.g. when , a straightforward computation of Chebyshev coefficients may be more expensive than computing the approximation of the Schatten norm in our paper.

Our work has been inspired by the approach of Boutsidis, Drineas, Kambadur, Kontopoulou, and Zouzias [1] that uses Taylor expansion for to approximate the log determinant. The coefficients of this expansion have the same sign and thus the Hutchinson estimator [3] can be applied to each partial sum of this expansion. This is not the case for the Taylor expansion of . The key idea of our approach is to find such a constant minus the partial sum of the first terms is positive. Thus, we can apply the Hutchinson estimator to the corresponding matrix polynomial.

In Section 1.3 we introduce necessary notations. Section 2 provides a Hutchinson estimator for a special matrix polynomial that will be used in our algorithms. Section 3 describes an algorithm for approximating that does not require knowledge of . Section 4 describes an algorithm for approximating that depends on . Section 5 contains necessary technical claims.

### 1.3 Notations

We use the following symbols: be the integer part of ,

 (pk)=p(p−1)…(p−k+1)k!,k≥2,   (p1)=p. (1)

We use to denote the natural logarithm and dedicate lower case Latin letters for constants and real variables and upper case Latin letters for matrices. Consider the Taylor expansion

 (1−x)p=1+∞∑k=1(pk)(−1)kxk,   −1

Denote

 h(x)=∞∑k=1(−1)k−1(pk)xk. (3)

It follows from that for :

 1−(1−x)p=h(x). (4)

Denote

 hm(x)=∞∑k=m+1(−1)k−1(pk)xk, (5)

and

 ~hm(x)=m∑k=1(−1)k−1(pk)xk. (6)

According to ,

 1−(1−x)p=h(x)=~hm(x)+hm(x). (7)

Denote by be the Hutchinson estimator [3] for the trace:

 Ht(X)=1tt∑i=1gTiXgi, (8)

where are i.i.d. vectors whose entries are independent Rademacher variables. Note that . We will be using the following well-known results.

###### Theorem 1.1.

(Roosta-Khorasani and Ascher [6], Theorem ) Let be an SPSD matrix. Let . Then, with probability at least :

 ∣∣Ht(Y)−tr Y∣∣≤ϵ tr Y. (9)
###### Lemma 1.2.

(Boutsidis, Drineas, Kambadur, Kontopoulou, and Zouzias [1], Lemma ) Let be a symmetric positive semidefinite matrix. There exists an algorithm that runs in time and outputs such that, with probability at least :

 16λ1(A)≤α≤λ1(A).

## 2 Hutchinson Estimator for ~hm(X)

The following simple algorithm computes .

###### Theorem 2.1.

Let be a matrix and let be an integer. Then the output of Algorithm 1 is and the running time of the algorithm is

###### Proof.

For denote and note that and , by . Fix and note that after the -th iteration we have

 uk=gTiXkgi,   Ski=gTi(k∑j=1(−1)j−1ajXj)gi.

Thus, Denote ; thus the algorithm outputs the estimator and the theorem follows. The bound on the running time follows from direct computations. ∎

## 3 Algorithm for estimates without κ

###### Theorem 3.1.

Let be an SPSD matrix and let be the output of Algorithm 2 on input . Then, with probability at least , we have:

 ∣∣∥A∥pp−y∣∣≤ϵ∥A∥pp.

If we use Algorithm 1 to compute and Lemma 1.2 to compute then the running time of Algorithm 2 is

 O([γ(p)ϵ2+1/pn1/pnnz(A)+nlogn]log1δ),

where

 γ(p)=c(p)p1/p(1+6p)2. (10)
###### Proof.

We have for , Let be a SPSD matrix with . Denote . Since we have that is also a SPSD matrix and

 Bp=In−h(C). (11)

According to Claim 5.4 the matrix is SPSD. So, we may apply estimator to this matrix. We also have

 Bp=In−h(C)=In−~hm(C)−hm(C)=[(βm+1)In−~hm(C)]−[hm(C)+βmIn].

Because ,

 ∣∣tr Bp−[(1+βm)n−Ht(~hm(C))]∣∣=∣∣tr Bp−Ht[(1+βm)In−~hm(C)]∣∣
 ≤∣∣tr[(βm+1)In−~hm(C)]−Ht[(βm+1)In−~hm(C)]∣∣+|tr(βmIn)|+|tr(hm(C))|=Δ1+Δ2+Δ3.

Below we will bound each separately.

### Bounding Δ1 and Δ2

According to Theorem 1.1 for we have with probability at least :

 Δ1≤ϵtr[(βm+1)In−~hm(C)]=ϵnβm+ϵtr[In−~hm(C)]
 =ϵnβm+ϵtr[In−h(C)]+ϵtrhm(C)=ϵnβm+ϵtr Bp+ϵtr hm(C).

Note that .

### Bounding Δ3

Denoting we have, using Claim 5.3:

 |tr hm(C)|=|n∑i=1hm(1−λi)|≤c(p)pn∑i=1(1−λi)m+1(m+1)−p≤c(p)pn(m+1)−p=nβm.

### The final bound for B

So,

 Δ1+Δ2+Δ3≤(2ϵ+1)nβm+ϵtr Bp≤3nβm+ϵtr Bp.

Thus, if , i.e., then

 ∣∣tr Bp−Ht[(1+βm)In−~hm(C)]∣∣≤ϵ+ϵtr Bp,

with probability at least

### Back to matrix A

Let be the constant from Lemma 1.2 and . Then and . Recall that and . So, from the above:

 ∣∣∥A∥pp−y∣∣=∣∣tr Ap−αp[(1+βm)n−Ht(~hm(In−α−1A))]∣∣≤αpϵtr Bp+αpϵ
 ≤ϵtr Ap+ϵ(6λ1(A))p≤ϵ(1+6p)tr Ap=ϵ(1+6p)∥A∥pp,

with probability at least We obtain the result by substituting with , with , and noting that . The bound on the running time follows from Theorem 2.1 and Lemma 1.2. ∎

## 4 Algorithm for estimates with κ

###### Theorem 4.1.

Let be an SPSD matrix such that and let be the output of Algorithm 3 on input . Then, with probability at least , we have:

 ∣∣∥A∥pp−y∣∣≤ϵ∥A∥pp.

If we use Algorithm 1 to compute and Lemma 1.2 to compute then the running time of Algorithm 3 is

 O((1ϵ2nnz(A)κ[plog(6κ)+log(1ϵ)+log(c(p)p)]+(n+nnz(A))logn)log1δ)
###### Remark 4.2.

In [2] the authors assume that they are given and . This assumption is stronger than our assumption that we only know . Under the stronger assumption of [2] we do not need to estimate using Lemma 1.2. Thus we don’t need to perform Step of Algorithm 3. Hence the running time becomes

 O(1ϵ2nnz(A)κ[plog(6κ)+log(1ϵ)+log(c(p)p)]log1δ),

which, for constant , is the same as in [2].

###### Proof.

(of Theorem 4.1) As in Theorem 3.1, is the constant from Lemma 1.2 and . Denote . We have and

 λi≥λn(A)α=λn(A)λ1(A)λ1(A)α>16κ,

so for . From here and Claim 5.5, with ,

 1−~hm(1−λi)>0,1≤i≤n,

where

 m>6κ(log(c(p)/p)+plog(6κ)). (12)

From here we conclude that for such the matrix is SPSD. In addition,

 Bp=[In−(In−B)]p=In−h(C)=[In−~hm(C)]−hm(C).

Since ,

 ∣∣tr Bp−n+Ht(~hm(C))∣∣=∣∣tr Bp−Ht(In−~hm(C))∣∣≤|tr[In−~hm(C)]−Ht(In−~hm(C))|+|tr hm(C)|=Δ1+Δ2. (13)

To conclude, we will provide estimates for and .

### Estimate for Δ1

According to Theorem 1.1, with probability at least , But and so (14)

### Estimate for Δ2

We have . Applying Claim 5.6 with we get:

 |hm(1−λi)|<ϵλpi,

if satisfies . For such :

 Δ2=|tr hm(C)|≤n∑i=1|hm(1−λi)|≤ϵn∑i=1λpi=ϵtr Bp.

### The final estimate for matrix B

If

 m≥6κ[plog(6κ)+log(1ϵ)+log(c(p)p)]. (15)

then satisfies , so for such from and :

 ∣∣tr Bp−[n−Ht(~hm(C))]∣∣≤Δ1+Δ2≤ϵtr Bp+(1+ϵ)Δ2≤ϵ(2+ϵ)tr Bp≤3ϵtr Bp. (16)

### Back to matrix A

We have and . Recall that and . So we get from , w.p. at least :

 ∣∣∥A∥pp−y∣∣=∣∣tr Ap−αp[n−Ht(~hm(C))]∣∣≤αp(3ϵ)tr Bp≤3ϵ tr Ap=3ϵ∥A∥pp,

for satisfying . Since and by substituting with and with the theorem follows. ∎

## 5 Technical Lemmas

###### Claim 5.1.

There exists a constant , (defined in ), that depends only on , such that

 ∣∣∣(pk)∣∣∣≤c(p)(k+1)−(p+1),

for any .

###### Proof.

Denote

 c1(p)={∏[p]i=1∣∣p−i+1i∣∣,   p>1,1,   0≤p≤1.

If then . Thus, for :

 ∣∣∣(pk)∣∣∣=c1(p)p−[p][p]+1k∏i=[p]+2(1−p+1i). (17)

Since for and , we have:

 ∣∣∣(pk)∣∣∣≤c1(p)exp⎛⎝−(1+p)k∑i=[p]+21i⎞⎠. (18)

Further, it is known from Langrange formula that and thus

 k∑i=[p]+21i≥k∑i=[p]+2(ln(i+1)−lni)=ln(k+1)−ln([p]+2). (19)

From here and we obtain

 ∣∣∣(pk)∣∣∣≤c1(p)exp(−(1+p)(ln(k+1)−ln([p]+2)))=c1(p)(2+[p])p+1(k+1)−(p+1). (20)

Denote

 c(p)=c1(p)(2+[p])p+1. (21)

The claim follows. ∎

###### Claim 5.2.

For all ,

 c(p)>p.
###### Proof.

Assume first that Then and so . Let now . Then and . Assume now . Then and

 c(p)=p[p]∏i=2∣∣∣p−i+1i∣∣∣(2+[p])p+1. (22)

For we have

 p−i+1i=p+1i−1≥p+1[p]−1≥p+1p−1=1p.

So,

 [p]∏i=2∣∣∣p−i+1i∣∣∣(2+[p])p+1≥(1p)p−1(2+[p])p+1=(2+[p]p)p−1(2+[p])2>1,

where the last inequality holds because . From here and the claim follows. ∎

###### Claim 5.3.

For and :

 |hm(x)|≤c(p)p|x|m+1(m+1)−p.
###### Proof.

We have:

 |hm(x)|≤∞∑k=m+1∣∣∣(pk)∣∣∣|x|k≤|x|m+1∞∑k=m+1∣∣∣(pk)∣∣∣≤|x|m+1c(p)∞∑k=m+1(k+1)−(p+1),

where the first inequality follows from the definition of in , the second inequality follows since and the third inequality follows from Claim 5.1.Let us estimate the last sum2. For any we have . Integrating over we obtain

 (k+1)−(p+1)≤∫k+1kx−(p+1)dx.

Thus

 ∞∑k=m+1(k+1)−(p+1)≤∞∑k=m+1∫k+1kx−(p+1)dx=∫∞m+1x−(p+1)dx=1p(m+1)−p.

The claim follows from here and the above bound. ∎

###### Claim 5.4.

For any and :

 1+c(p)p(m+1)−p−~hm(x)≥(1−x)p≥0.
###### Proof.

We have , and, by Claim 5.3, . So . The claim follows. ∎

###### Claim 5.5.

Let . If

 m>log(c(p)/p)−plog(1−a)1−a−1

then for .

###### Proof.

Claim 5.3 yields

 1−~hm(x)=(1−x)p+hm(x)≥(1−a)p−|hm(x)|≥(1−a)p−c(p)pxm+1(m+1)−p≥(1−a)p−c(p)pam+1.

Solving the inequality with respect to we get

 m>log(c(p)/p)−plog(1−a)−loga−1. (23)

Now we use the inequality

 −1logx≤11−x,    0

To check it, note that this inequality is equivalent to

 g(x)\vcentcolon=1−x+logx<0,