greedy approaches to symmetric orthogonal tensor decomposition

# greedy approaches to symmetric orthogonal tensor decomposition

Cun Mu111Department of Industrial Engineering and Operations Research, Columbia University (cm3052@columbia.edu, goldfarb@columbia.edu). DG was partially supported by NSF Grant CCF-1527809.    Daniel Hsu222Department of Computer Science, Columbia University (djhsu@cs.columbia.edu). DH was partially supported by NSF IIS-1563785, Bloomberg Data Science Research Grant and Sloan Research Fellowship.    Donald Goldfarb111Department of Industrial Engineering and Operations Research, Columbia University (cm3052@columbia.edu, goldfarb@columbia.edu). DG was partially supported by NSF Grant CCF-1527809.
###### Abstract

Finding the symmetric and orthogonal decomposition (SOD) of a tensor is a recurring problem in signal processing, machine learning and statistics. In this paper, we review, establish and compare the perturbation bounds for two natural types of incremental rank-one approximation approaches. Numerical experiments and open questions are also presented and discussed.

Key words. tensor decomposition, rank-1 tensor approximation, orthogonally decomposable tensor, perturbation analysis

AMS subject classifications. 15A18, 15A69, 49M27, 62H25

## 1 Introduction

A -way -dimensional tensor , namely , is called symmetrically orthogonally decomposable (SOD) [1, 2] (a.k.a. odeco in [3]) if it can be expressed as a linear combination over the real field of symmetric -th powers of vectors that generate an orthonormal basis of . Mathematically, is SOD if there exist and an orthogonal matrix such that

 T=λ1v⊗p1+λ2v⊗p2+⋯+λnv⊗pn, \hb@xt@.01(1.1)

where , the symmetric -th power of the vector , denotes a -way -dimensional tensor with . The decomposition is called the symmetric orthogonal decomposition (also abbreviated as SOD) of with individual and , respectively, called an eigenvalue and eigenvector of .111For a more detailed discussion on eigenvalues and eigenvectors of SOD tensors, please see [3]. The gist of our paper is to find the SOD of (potentially with perturbations), a recurring problem arising in different contexts including higher-order statistical estimation [4], independent component analysis [5, 6], and parameter estimation for latent variable models [7], just to name a few.

From the expression (LABEL:eqn:sodeco), it is quite tempting to find one by one in a greedy manner using proper deflation procedures. Specifically, one first approximates by the best rank-one tensor,

 (λ⋆,v⋆)∈argminλ∈R,∥v∥=1∥∥T−λ⋅v⊗p∥∥F. \hb@xt@.01(1.2)

After that, to find the next pair, one modifies the optimization problem (LABEL:eqn:TBRO) to exclude the found eigenpair . We next review two natural deflation procedures—residual deflation [8] and constrained deflation [9]—which incorporate the information of into an optimization framework by altering, respectively, the objective and the feasible set of problem (LABEL:eqn:TBRO).

#### Residual deflation

In residual deflation, the rank-one approximation is subtracted from the original tensor, i.e., , and then finds the best rank-one approximation to the deflated tensor by solving (LABEL:eqn:TBRO) again. The complete scheme, referred to as Successive Rank-One Approximation with Residual Deflation (SROAwRD), is described in Algorithm LABEL:alg:SROAwRD.

#### Constrained deflation

In constrained deflation, one restricts to be nearly orthogonal to by solving problem (LABEL:eqn:TBRO) with the additional linear constraints , where is a prescribed parameter. The complete scheme, referred to as Successive Rank-One Approximation with Constrained Deflation (SROAwCD), is described in Algorithm LABEL:alg:SROAwCD. At the -th iteration, rather than deflating the original tensor by subtracting from it the sum of the rank-one tensors , , , as the SROAwRD method does, the SROAwCD method imposes the near-orthogonality constraints for .

It is not hard to prove that given the SOD tensor as the input, both SROAwRD and SROAwCD methods are capable of finding the eigenpairs exactly. In this paper, we focus on the more challenging case of tensors that are only close to being SOD. Specifically:

###### Problem 1

Suppose the SOD tensor , and that the perturbed SOD tensor is provided as input to the SROAwRD and SROAwCD methods. Characterize the discrepancy between and the components found by these methods.

In this paper, we provide positive answers to Problem LABEL:question:main. The characterization for SROAwRD was done in our previous paper [1]; we review the results in Section LABEL:sec:SROAwRD. The charaterization for SROAwCD is the main contribution of the present paper. These results can be regarded as higher order generalizations of the Davis-Kahan perturbation result [10] for matrix eigen-eigenvalue decomposition, and is not only of mathematical interest but also crucial to applications in signal processing, machine learning and statistics [4, 5, 6, 7], where the common interest is to find the underlying eigenpairs but the tensor collected is subject to inevitable perturbations arising from sampling errors, noisy measurements, model specification, numerical errors and so on.

#### Organization

The rest of the paper is organized as follows. In Section 2, we introduce notation relevant to this paper. In Section 3, we review theoretically what is known about the SROAwRD method. In Section 4, we provide a perturbation analysis for the SROAwCD method.

## 2 Notation

In this section, we introduce some tensor notation needed in our paper, largely borrowed from [11].

#### Symmetric tensor

A real -way -dimensional tensor ,

 A=(Ai1,i2,…,ip),Ai1,i2,…,ip∈R,1≤i1,i2,…,ip≤n,

is called symmetric if its entries are invariant under any permutation of their indices, i.e. for any ,

 Ai1,i2,…,ip=Aiπ(1),iπ(2),…,iπ(p)

for every permutation mapping of .

#### Multilinear map

In addition to being considered as a multi-way array, a tensor can also be interpreted as a multilinear map in the following sense: for any matrices for , we define as a tensor in whose -th entry is

 (A(V1,V2,…,Vp))i1,i2,…,ip:=∑j1,j2,…,jp∈[n]Aj1,j2,…,jp(V1)j1i1(V2)j2i2⋯(Vp)jpip.

The following two special cases are quite frequently used in the paper:

for all : which defines a homogeneous polynomial of degree .

for all , and :

 Ax⊗p−1 :=A(x,…,x,I)∈Rn.

For a symmetric tensor , the differentiation result can be established.

#### Inner product

For any tensors , , the inner product between them is naturally defined as

 ⟨A,B⟩:=∑i1,i2,…,ip∈[n]Ai1,i2,…,ipBi1,i2,…,ip.

#### Tensor norms

Two tensor norms will be used in the paper. For a tensor , its Frobenius norm is , and its operator norm , is defined as . It is also well-known that for symmetric tensors , can be equivalently defined as (see, e.g., [12, 13]).

## 3 Review on SROAwRD

Algorithm LABEL:alg:SROAwRD is intensively studied in the tensor community, though most papers [14, 8, 15, 16, 17, 18, 12, 13, 19, 20, 21, 22, 7, 23, 24] focus on the numerical aspects of how to solve the best tensor rank-one approximation (LABEL:eqn:TBRO). Regarding theoretical guarantees for the symmetric and orthogonal decomposition, Zhang and Golub [8] first prove that SROAwRD outputs the exact symmetric and orthogonal decomposition if the input tensor is symmetric and orthogonally decomposable:

###### Proposition 3.1

[8, Theorem 3.2] Let be a symmetric tensor with orthogonal decomposition , where and forms an orthonormal basis of . Let be the output of Algorithm LABEL:alg:SROAwRD with input . Then , and moreover there exists a permutation of such that for each ,

 min{|λπ(j)−^λj|,|λπ(j)+^λj|}=0, min{∥∥vπ(j)−^vj∥∥,∥∥vπ(j)+^vj∥∥}=0.

The perturbation analysis is recently addressed in [1]:

###### Theorem 3.2

[1, Theorem 3.1] There exists a positive constant such that the following holds. Let , where the ground truth tensor is symmetric with orthogonal decomposition , forms an orthonormal basis of , and the perturbation tensor is symmetric with operator norm . Assume , where . Let be the output of Algorithm LABEL:alg:SROAwRD with input . Then there exists a permutation over such that for each :

 min{|λπ(j)−^λj|,|λπ(j)+^λj|}≤2ε, min{∥∥vπ(j)−^vj∥∥,∥∥vπ(j)+^vj∥∥}≤20ε/∣∣λπ(j)∣∣.

Theorem LABEL:thm:main-SROA-E generalizes Proposition LABEL:thm:main-SROA, and provides perturbation bounds for the SROAwRD method. Specifically, when the operator norm of the perturbation tensor vanishes, i.e. , Theorem LABEL:thm:main-SROA-E is reduced to Proposition LABEL:thm:main-SROA; when is small enough (i.e. ), the SROAwRD method is able to robustly recover the eigenpairs of the underlying symmetric and orthogonal decomposable tensor .

In Theorem LABEL:thm:main-SROA-E, is required to be at most on the order of , which decreases with increasing tensor size. It is interesting to explore whether or not this dimensional dependency is essential:

###### Open Question 1

Can we provide a better analysis for the SROAwRD method to remove the dimensional dependance on the noise level? Or can we design a concrete example to corroborate the necessity of this dimensional dependency?

The existence of the dimensional dependency, at least for the current proof in Mu et. al. [1], can be briefly explained as follows. At the end of the -th iteration, we subtract the rank-one tensor from . Since only approximates the underlying truth, this deflation procedure introduces additional errors into . Although [1] has made substantial efforts to reduce the accumulative effect from sequential deflation steps, the perturbation error still needs to depend on the iteration number in order to control the perturbation bounds of the eigenvalue and eigenvector, and we tend to believe that the dimensional dependency in Theorem LABEL:thm:main-SROA-E is necessary.

In contrast, the SROAwCD method, instead of changing the objective, imposes additional constraints, which force the next eigenvector to be nearly orthogonal to . As the SROAwCD method alters the search space rather than the objective in the optimization, there is hope that the requirement on the noise level might be dimension-free. In the next section, we will confirm this intuition.

## 4 SROAwCD

In this section, we establish the first perturbation bounds that have been given for the SROAwCD method for tensor SOD. The main result can be stated as follows:

###### Theorem 4.1

Let , where is a symmetric tensor with orthogonal decomposition , is an orthonormal basis of , for all , and is a symmetric tensor with operator norm . Assume and , where , and . Let be the output of Algorithm LABEL:alg:SROAwCD for input . Then there exists a permutation of such that for all ,

 min{|λπ(j)−^λj|,|λπ(j)+^λj|}≤ε, \hb@xt@.01(4.1) min{∥∥vπ(j)−^vj∥∥,∥∥vπ(j)+^vj∥∥}≤(6.2+4κ)ε/|λπ(j)|. \hb@xt@.01(4.2)

Theorem LABEL:thm:C-SROA guarantees that for an appropriately chosen , the SROAwCD method can approximately recover whenever the perturbation error is small. A few remarks immediately come to find. First, Theorem LABEL:thm:C-SROA specifies the choice of the parameter , which depends on the ratio of the largest to smallest eigenvalues of in absolute value. In subsection 4.2, we will see this dependency is necessary through numerical studies. Second, in contrast to the SROAwRD method, Theorem LABEL:thm:C-SROA does not require the noise level to be dependent on the tensor size. This could be a potential advantage for the SROAwCD method.

The rest of this section is organized as follows. In subsection LABEL:subsec:proof, we provide the proof for Theorem LABEL:thm:C-SROA. In subsection LABEL:subsec:numeric, we present numerical experiments to corroborate Theorem LABEL:thm:C-SROA. In subsection LABEL:subsec:disc, we discuss issues related to determining the maximum spectral ratio defined in Theorem LABEL:thm:C-SROA.

### 4.1 Proof of Theorem LABEL:thm:C-SROA

We will prove Theorem LABEL:thm:C-SROA by induction. For the base case, we need the perturbation result regarding the best rank-one tensor approximation, which is proven in [1] and can be regarded as a generalization of its matrix counterpart [25, 10]. In the following, we restate this result [1, Theorem 2.2] with a minor variation:

###### Lemma 4.2

Let , where is a symmetric tensor with orthogonal decomposition , is an orthonormal basis of , for all , and is a symmetric tensor with operator norm . Let . Then there exist such that

 min{|λj−^λ|,|λj+^λ|}≤ε,and min{∥∥vj−^v∥∥,∥∥vj+^v∥∥}≤10(ε|λj|+(ελj)2).

Now we are ready to prove our main Theorem LABEL:thm:C-SROA.

Proof. Without loss of generality, we assume is odd, and for all (as we can always flip the signs of the to ensure this). Then problem (LABEL:eqn:SROAwCD) can be equivalently written as

 \hb@xt@.01(4.3)

and .

To prove the theorem, it suffices to prove that the following property holds for each : there is a permutation of such that for every ,

 |λπ(j)−^λj|≤εand% ∥∥vπ(j)−^vj∥∥≤(6.2+4κ)ελπ(j). \hb@xt@.01(∗)

We will prove (LABEL:eqn:indhyp) by induction.

For the base case , Lemma LABEL:lem:for_base implies that there exists a satisfying

 |^λ1−λj|≤ε,and∥∥^v1−vj∥∥≤10ελj(1+ελj)≤10.2ελj≤(6.2+4κ)ελj,

where we have used the fact that

Next we assume the induction hypothesis (LABEL:eqn:indhyp) is true for , and prove that there exists an that satisfies

 |^λk+1−λl|≤ε,and% ∥^vk+1−vl∥≤(6.2+4κ)ελπ(l). \hb@xt@.01(4.4)

Denote and . Then based on (LABEL:eqn:equiv_odd), one has

 \hb@xt@.01(4.5)

and . Since forms an orthonormal basis, we may write . Without loss of generality, we renumber to and renumber to , respectively, to satisfy

 λ1|x1|p−2≥λ2|x2|p−2≥…≥λk|xk|p−2,and \hb@xt@.01(4.6) λk+1|xk+1|p−2≥λk+2|xk+2|p−2≥…≥λn|xn|p−2.

In the following, we will show that is indeed the index satisfying (LABEL:eqn:indhyp_l). The idea of the rest of the proof is as follows. We first provide lower and upper bounds for , based on which, we are able to show that and . However, Theorem LABEL:thm:C-SROA requires . To close this gap, we characterize the optimality condition of (LABEL:eqn:subproblem), use of which enables us to sharpen the upper bound of .

We first consider the lower bound for by finding a that is feasible for (LABEL:eqn:subproblem). For each , one has

 |⟨vi,^vj⟩| \hb@xt@.01(4.7) ≤(6.2+4κ)ελπ(j)≤(6.2+4κ)θ2λmin12.5λmin=(6.2+4κ)θ12.5⋅θ=6.2+4κ25κθ<θ,

where we have used the Cauchy-Schwarz inequality, and the facts , , and . Hence, are all feasible to problem (LABEL:eqn:subproblem) and then we can naturally achieve a lower bound for , as

 ^λ=ˆT^x⊗p≥maxi∈[n]∖[k]ˆTv⊗pi≥maxi∈[n]∖[k]λi−ε≥λk+1−ε. \hb@xt@.01(4.8)

Regarding the upper bound for , one has

 ^λ=ˆT^x⊗p=(T+E)^x⊗p =(n∑i=1λiv⊗pi+E)^x⊗p =k∑i=1λixpi+n∑i=k+1λixpi+E^x⊗p ≤k∑i=1λi|xi|p−2x2i+n∑i=k+1λi|xi|p−2x2i+E^x⊗p ≤max{λ1|x1|p−2,λk+1|xk+1|p−2}+ε, \hb@xt@.01(4.9)

where the last line is due to the assumptions made in (LABEL:eqn:ordering_ass), and .

Combining (LABEL:eqn:lower_bd) and (LABEL:eqn:upper_bd), we have

 λk+1−ε≤maxi∈[n]∖[k]λi−ε≤^λ≤max{λ1|x1|p−2,λk+1|xk+1|p−2}+ε. \hb@xt@.01(4.10)

Also note that

 λ1|x1|p−2+ε \hb@xt@.01(4.11) ≤λ1|x1|+ε=λ1|⟨^x,v1⟩|+ε=λ1|⟨^x,^vπ−1(1)⟩+⟨^x,v1−^vπ−1(1)⟩|+ε ≤λ12κ+6.2ε+4κε+ε≤λmin2+λmin12.5+7.2ε<λmin−ε ≤λk+1−ε,

where we have used the facts that , , and

 4κε≤4⋅λmaxλmin⋅θ2λmin12.5≤4⋅λmax12.5⋅4κ2≤λmin12.5.

Therefore, in order to satisfy (LABEL:eqn:comb_upp_low), we must have

 max{λ1|x1|p−2,λk+1|xk+1|p−2}=λk+1|xk+1|p−2, \hb@xt@.01(4.12)

which simplifies (LABEL:eqn:comb_upp_low) to

 λk+1−ε≤maxi∈[n]∖[k]λi−ε≤^λ≤λk+1|xk+1|p−2+ε. \hb@xt@.01(4.13)

Based on (LABEL:eqn:combine_upp_low_2), we have that

 λk+1≥maxi∈[n]∖[k]λi−2ε,|λk+1−^λ|≤ε,and \hb@xt@.01(4.14) |xk+1|≥|xk+1|p−2≥λk+1−2ελk+1=1−2ελk+1.

Thus, we have achieved the eigenvalue perturbation bound (LABEL:eqn:lam_bd) promised in the theorem. Next, we will sharpen the eigenvector perturbation bound by exploiting the optimality conditions for problem (LABEL:eqn:subproblem).

The key observation is that, at the point , the constraint is not active. To see this, for any ,

 |⟨^vi,^x⟩| \hb@xt@.01(4.15) ≤√1−x2k+1+(6.2+4κ)θ2/12.5≤√4ε/λmin+(6.2+4κ)θ/12.5⋅θ ≤√4θ212.5+(3.112.5+212.5)⋅θ<θ,

where the last line is due to (LABEL:eqn:intermediate_bd) and the fact that and Therefore, only the equality constraint is active and will be involved in the optimality conditions at the point . Consider the Lagrangian function at the point ,

 L(^x,λ)=ˆT^x⊗p−pλ2(∥^x∥2−1),

where corresponds to the (scaled) Lagrange multiplier for the equality constraint on the norm of , which we have squared. Since the linear independent constraint qualification [26, Section 12.3] can be easily verified, by the first-order optimality conditions (a.k.a. KKT condition), there exists a such that

 1p(∇L(^x,¯λ))=ˆT^x⊗p−1−¯λ^x=0.

Moreover, as , . Thus, we have

 ^λ^x=ˆT^x⊗p−1=λk+1xp−1k+1vk+1+∑i≠k+1λixp−1ivi+E^x⊗p−1.

Consider the quantity

 ∥λk+1(^x−vk+1)∥ \hb@xt@.01(4.16) = ∥∥(λk+1−^λ)^x+(^λ^x−λk+1vk+1)∥∥ = ∥∥ ∥∥(λk+1−^λ)^x+λk+1(xp−1k+1−1)vk+1+∑i≠k+1λixp−1ivi+E^x⊗p−1∥∥ ∥∥ ≤ |λk+1−^λ|+λk+1|xp−1k+1−1|+∥∥ ∥∥∑i≠k+1λixp−1ivi∥∥ ∥∥+∥∥E^x⊗p−1∥∥.

Thanks to the intermediate result (LABEL:eqn:intermediate_bd), we have

 |λk+1−^λ|≤ε,∥∥E^x⊗p−1∥∥≤ε,and \hb@xt@.01(4.17) λk+1|xp−1k+1−1|=λk+1(1−|xk+1|⋅|xk+1|p−2)≤λk+1(1−(1−2ελk+1)2)≤4ε.

Moreover, for the term , we can derive that

 ∥∥ ∥∥∑i≠k+1λixp−1ivi∥∥ ∥∥=⎛⎝∑i≠k+1λ2ix2p−2i⎞⎠1/2 ≤max{λ1|x1|p−2,λk+2|xk+2|p−2}√∑i≠k+1x2i≤4κε。 \hb@xt@.01(4.18)

The last line holds due to (LABEL:eqn:ordering_ass) and for ,

 λj∣∣xj∣∣p−2√∑i≠k+1x2i ≤λj√1−x2k+1⋅√1−x2k+1=λj(1−x2k+1)≤4λjελk+1≤4κε, \hb@xt@.01(4.19)

where we have used and (LABEL:eqn:intermediate_bd).

Therefore, by substituting (LABEL:eqn:big_term_1) and (LABEL:eqn:big_term_2) into (LABEL:eqn:total), one has

 ∥λk+1(^x−vk+1)∥≤(6+4κ)ε,

which leads to the desired bound

By mathematical induction, we complete the proof.

### 4.2 Numerical Experiments

In this subsection, we present three sets of numerical experiments to corroborate our theoretical findings in Theorem LABEL:sec:SROAwCD regarding the SROAwCD method. We solve the main subproblem (LABEL:eqn:SROAwCD) via the general polynomial solver GloptiPoly 3 [27], which is a global solver based on the Sum-of-Squares (SOS) framework [28, 29, 30, 31, 32].

#### Experiment 1

In this experiment, we will synthetically verify the perturbation bounds stated in Theorem LABEL:sec:SROAwCD. We generate nearly symmetric orthogonally decomposable tensor in the following manner. The underlying symmetric orthogonally decomposable tensor is set as the diagonal tensor with all diagonal entries equal to 300, i.e., , and the perturbation tensors are produced by symmetrizing a randomly generated tensor whose entries follow standard normal distribution independently. We set to be (as suggested in Theorem LABEL:thm:C-SROA). 1000 random instances are tested. Figure LABEL:fig:1st_exp plots the histogram of perturbations in both eigenvalue and eigenvector. As depicted in Figure LABEL:fig:1st_exp, both types of perturbations are well controlled by the bounds provided in Theorem LABEL:sec:SROAwCD.

#### Experiment 2

In Theorem LABEL:sec:SROAwCD, the parameter is suggested to be set to . In this experiment, we compare the performance of SROAwCD with and based on the criterion

 ∥∥ ∥∥T−5∑i=1^λi^v⊗3i∥∥ ∥∥F. \hb@xt@.01(4.20)

The tensors are generated in the same way as in the first experiment. Among all the 1000 random cases, the SROAwCD method with consistently outperforms the one with . This makes intuitive sense. As only approximate the underlying truth, setting , which forces strict orthogonality, tends to introduce additional errors into the problem.

#### Experiment 3

In Theorem LABEL:sec:SROAwCD, the parameter is suggested to be set to , which depends on . In this experiment, we will demonstrate the necessity of this dependency. We consider the SOD tensor with . We first apply the SROAwCD method with . The output is as follows:

 ^λ=(1000.00,189.95,189.95,189.95,189.95)⊤,and ^v1=(1.00,0.00,0.00,0.00,0.00)⊤ ^v2=(0.50,0.00,0.87,0.00,0.00)⊤ ^v3=(0.50,0.00,0.00,0.87,0.00)⊤ ^v4=(0.50,0.87,0.00,