Orthogonalized ALS: A Theoretically Principled Tensor Decomposition Algorithm for Practical Use

# Orthogonalized ALS: A Theoretically Principled Tensor Decomposition Algorithm for Practical Use

Vatsal Sharan
Stanford University
vsharan@stanford.edu &Gregory Valiant
Stanford University
gvaliant@stanford.edu
Vatsal Sharan
Stanford University
vsharan@stanford.edu
Gregory Valiant
Stanford University
valiant@stanford.edu

# Guaranteed Tensor Decomposition via Orthogonalized Alternating Least Squares

Vatsal Sharan
Stanford University
vsharan@stanford.edu &Gregory Valiant
Stanford University
gvaliant@stanford.edu
Vatsal Sharan
Stanford University
vsharan@stanford.edu
Gregory Valiant
Stanford University
valiant@stanford.edu
###### Abstract

The popular Alternating Least Squares (ALS) algorithm for tensor decomposition is efficient and easy to implement, but often converges to poor local optima—particularly when the weights of the factors are non-uniform. We propose a modification of the ALS approach that is as efficient as standard ALS, but provably recovers the true factors with random initialization under standard incoherence assumptions on the factors of the tensor. We demonstrate the significant practical superiority of our approach over traditional ALS for a variety of tasks on synthetic data—including tensor factorization on exact, noisy and over-complete tensors, as well as tensor completion—and for computing word embeddings from a third-order word tri-occurrence tensor.

\mdfdefinestyle

MyFramelinecolor=black, outerlinewidth=1pt, roundcorner=20pt, innertopmargin=nerbottommargin=nerrightmargin=20pt, innerleftmargin=20pt, backgroundcolor=white!50!white

## 1 Introduction

From a theoretical perspective, tensor methods have become an incredibly useful and versatile tool for learning a wide array of popular models, including topic modeling (Anandkumar et al., 2012), mixtures of Gaussians (Ge et al., 2015), community detection (Anandkumar et al., 2014a), learning graphical models with guarantees via the method of moments (Anandkumar et al., 2014b; Chaganty and Liang, 2014) and reinforcement learning (Azizzadenesheli et al., 2016). The key property of tensors that enables these applications is that tensors have a unique decomposition (decomposition here refers to the most commonly used CANDECOMP/PARAFAC or CP decomposition), under mild conditions on the factor matrices (Kruskal, 1977); for example, tensors have a unique decomposition whenever the factor matrices are full rank. As tensor methods naturally model three-way (or higher-order) relationships, it is not too optimistic to hope that their practical utility will only increase, with the rise of multi-modal measurements (e.g. measurements taken by “Internet of Things” devices) and the numerous practical applications involving high order dependencies, such as those encountered in natural language processing or genomic settings. In fact, we are already seeing exciting applications of tensor methods for analysis of high-order spatiotemporal data (Yu and Liu, 2016), health data analysis (Wang et al., 2015a) and bioinformatics (Colombo and Vlassis, 2015). Nevertheless, to truly realize the practical impact that the current theory of tensor methods portends, we require better algorithms for computing decompositions—practically efficient algorithms that are both capable of scaling to large (and possibly sparse) tensors, and are robust to noise and deviations from the idealized “low-rank” assumptions.

As tensor decomposition is NP-Hard in the worst-case (Shitov, 2016; Hillar and Lim, 2013; Håstad, 1990), one cannot hope for algorithms which always produce the correct factorization. Despite this worst-case impossibility, accurate decompositions can be efficiently computed in many practical settings. Early work from the 1970’s (Leurgans et al., 1993; Harshman, 1970) established a simple algorithm for computing the tensor decomposition (in the noiseless setting) provided that the factor matrices are full rank. This approach, based on an eigendecomposition, is very sensitive to noise in the tensor (as we also show in our experiments), and does not scale well for large, sparse tensors.

Since this early work, much progress has been made. Nevertheless, many of the tensor decomposition algorithms hitherto proposed and employed have strong provable success guarantees but are computationally expensive (though still polynomial time)—either requiring an expensive initialization phase, being unable to leverage the sparsity of the input tensor, or not being efficiently parallelizable. On the other hand, there are also approaches which are efficient to implement, but which fail to compute an accurate decomposition in many natural settings. The Alternating Least Squares (ALS) algorithm (either with random initialization or more complicated initializations) falls in this latter category and is, by far, the most widely employed decomposition algorithm despite its often poor performance and propensity for getting stuck in local optima (which we demonstrate on both synthetic data and real NLP data).

In this paper we propose an alternative decomposition algorithm, “Orthogonalized Alternating Least Squares” (Orth-ALS) which has strong theoretical guarantees, and seems to significantly outperform the most commonly used existing approaches in practice on both real and synthetic data, for a number of tasks related to tensor decomposition. This algorithm is a simple modification of the ALS algorithm to periodically “orthogonalize” the estimates of the factors. Intuitively, this periodic orthogonalization prevents multiple recovered factors from “chasing after” the same true factors, allowing for the avoidance of local optima and more rapid convergence to the true factors.

From the practical side, our algorithm enjoys all the benefits of standard ALS, namely simplicity and computational efficiency/scalability, particularly for very large yet sparse tensors, and noise robustness. Additionally, the speed of convergence and quality of the recovered factors are substantially better than standard ALS, even when ALS is initialized using the more expensive SVD initialization. As we show, on synthetic low-rank tensors, our algorithm consistently recovers the true factors, while standard ALS often falters in local optima and fails both in recovering the true factors and in recovering an accurate low-rank approximation to the original tensor. We also applied Orth-ALS to a large 3-tensor of word co-occurrences to compute “word embeddings”.111Word embeddings are vector representations of words, which can then be used as features for higher-level machine learning. Word embeddings have rapidly become the backbone of many downstream natural language processing tasks (see e.g. (Mikolov et al., 2013b)). The embedding produced by our Orth-ALS algorithm is significantly better than that produced by standard ALS, as we quantify via a near 30% better performance of the resulting word embeddings across standard NLP datasets that test the ability of the embeddings to answer basic analogy tasks (i.e. “puppy is to dog as kitten is to  ?”) and semantic word-similarity tasks. Together, these results support our optimism that with better decomposition algorithms, tensor methods will become an indispensable, widely-used data analysis tool in the near future.

Beyond the practical benefits of Orth-ALS, we also consider its theoretical properties. We show that Orth-ALS provably recovers all factors under random initialization for worst-case tensors as long as the tensor satisfies an incoherence property (which translates to the factors of the tensors having small correlation with each other), which is satisfied by random tensors with rank where is the dimension of the tensor. This requirement that is significantly worse than the best known provable recovery guarantees for polynomial-time algorithms on random tensors—the recent work Ma et al. (2016) succeeds even in the over-complete setting with . Nevertheless, our experiments support our belief that this shortcoming is more a property of our analysis than the algorithm itself. Additionally, for many practical settings, particularly natural language tasks, the rank of the recovered tensor is typically significantly sublinear in the dimensionality of the space, and the benefits of an extremely efficient and simple algorithm might outweigh limitations on the required rank for provable recovery.

Finally, as a consequence of our analysis technique for proving convergence of Orth-ALS, we also improve the known guarantees for another popular tensor decomposition algorithm—the tensor power method. We show that the tensor power method with random initialization converges to one of the factors with small residual error for rank , where is the dimension. We also show that the convergence rate is quadratic in the dimension. Anandkumar et al. (2014c) had previously shown local convergence of the tensor power method with a linear convergence rate (and also showed global convergence via a SVD-based initialization scheme, obtaining the first guarantees for the tensor power method in non-orthogonal settings). Our new results, particularly global convergence from random initialization, provide some deeper insights into the behavior of this popular algorithm.

The rest of the paper is organized as follows– in Section 2 we discuss related work, describe the ALS algorithm and tensor power method, and discuss the shortcomings of both algorithms, particularly for tensors with non-uniform factor weights. Section 3 states the notation. Section 4 introduces and motivates Orth-ALS, and states the convergence guarantees. We state our convergence results for the tensor power method in Section 4.2. The experimental results, on both synthetic data and the NLP tasks are discussed in Section 5. In Section 6 we illustrate our proof techniques for the special case of orthogonal tensors. Proof details have been deferred to the Appendix. Our code is available at http://web.stanford.edu/~vsharan/orth-als.html.

## 2 Background and Related Work

We begin the section with a brief discussion of related work on tensor decomposition. We then review the ALS algorithm and the tensor power method and discuss their basic properties. Our proposed tensor decomposition algorithm, Orth-ALS, builds on these algorithms.

### 2.1 Related Work on Tensor Decomposition

Though it is not possible for us to do justice to the substantial body of work on tensor decomposition, we will review three families of algorithms which are distinct from alternating minimization approaches such as ALS and the tensor power method. Many algorithms have been proposed for guaranteed decomposition of orthogonal tensors, we refer the reader to Anandkumar et al. (2014b); Kolda and Mayo (2011); Comon et al. (2009); Zhang and Golub (2001). However, obtaining guaranteed recovery of non-orthogonal tensors using algorithms for orthogonal tensors requires converting the tensor into an orthogonal form (known as whitening) which is ill conditioned in high dimensions (Le et al., 2011; Souloumiac, 2009), and is computationally the most expensive step (Huang et al., 2013). Another very interesting line of work on tensor decompositions is to use simultaneous diagonalization and higher order SVD (Colombo and Vlassis, 2016; Kuleshov et al., 2015; De Lathauwer, 2006) but these are not as computationally efficient as alternating minimization222De Lathauwer (2006) prove unique recovery under very general conditions, but their algorithm is quite complex and requires solving a linear system of size , which is prohibitive for large tensors. We ran the simultaneous diagonalization algorithm of Kuleshov et al. (2015) on a dimension 100, rank 30 tensor; and the algorithm needed around 30 minutes to run, whereas Orth-ALS converges in less than 5 seconds.. Recently, there has been intriguing work on provably decomposing random tensors using the sum-of-squares approach (Ma et al., 2016; Hopkins et al., 2016; Tang and Shah, 2015; Ge and Ma, 2015). Ma et al. (2016) show that a sum-of-squares based relaxation can decompose highly overcomplete random tensors of rank up to . Though these results establish the polynomial learnability of the problem, they are unfortunately not practical.

Very recently, there has been exciting work on scalable tensor decomposition algorithms using ideas such as sketching (Song et al., 2016; Wang et al., 2015b) and contraction of tensor problems to matrix problems (Shah et al., 2015). Also worth noting are recent approaches to speedup ALS via sampling and randomized least squares (Battaglino et al., 2017; Cheng et al., 2016; Papalexakis et al., 2012).

### 2.2 Alternating Least Squares (ALS)

ALS is the most widely used algorithm for tensor decomposition and has been described as the “workhorse” for tensor decomposition (Kolda and Bader, 2009). The algorithm is conceptually very simple: if the goal is to recover a rank- tensor, ALS maintains a rank- decomposition specified by three sets of dimensional matrices corresponding to the three modes of the tensor. ALS will iteratively fix two of the three modes, say and , and then update by solving a least-squared regression problem to find the best approximation to the underlying tensor having factors and in the first two modes, namely ALS will then continue to iteratively fix two of the three modes, and update the other mode via solving the associated least-squares regression problem. These updates continue until some stopping condition is satisfied—typically when the squared error of the approximation is no longer decreasing, or when a fixed number of iterations have elapsed. The factors used in ALS are either chosen uniformly at random, or via a more expensive initialization scheme such as SVD based initialization (Anandkumar et al., 2014c). In the SVD based scheme, the factors are initialized to be the singular vectors of a random projection of the tensor onto a matrix.

The main advantages of the ALS approach, which have led to its widespread use in practice are its conceptual simplicity, noise robustness and computational efficiency given its graceful handling of sparse tensors and ease of parallelization. There are several publicly available optimized packages implementing ALS, such as Kossaifi et al. (2016); Vervliet et al. (); Bader et al. (2012); Bader and Kolda (2007); Smith and Karypis (); Huang et al. (2014); Kang et al. (2012).

Despite the advantages, ALS does not have any global convergence guarantees and can get stuck in local optima (Comon et al., 2009; Kolda and Bader, 2009), even under very realistic settings. For example, consider a setting where the weights for the factors decay according to a power-law, hence the first few factors have much larger weight than the others. As we show in the experiments (see Fig. 2), ALS fails to recover the low-weight factors. Intuitively, this is because multiple recovered factors will be chasing after the same high weight factor, leading to a bad local optima.

### 2.3 Tensor Power Method

The tensor power method is a special case of ALS that only computes a rank-1 approximation. The procedure is then repeated multiple times to recover different factors. The factors recovered in different iterations of the algorithm are then clustered to determine the set of unique factors. Different initialization strategies have been proposed for the tensor power method. Anandkumar et al. (2014c) showed that the tensor power method converges locally (i.e. for a suitably chosen initialization) for random tensors with rank . They also showed that a SVD based initialization strategy gives good starting points and used this to prove global convergence for random tensors with rank . However, the SVD based initialization strategy can be computationally expensive, and our experiments suggest that even SVD initialization fails in the setting where the weights decay according to a power-law (see Fig. 2).

In this work, we prove global convergence guarantees with random initializations for the tensor power method for random and worst-case incoherent tensors. Our results also demonstrate how, with random initialization, the tensor power method converges to the factor having the largest product of weight times the correlation of the factor with the random initialization vector. This explains the difficulty of using random initialization to recover factors with small weight. For example, if one factor has weight less than a fraction of the weight of, say, the heaviest factors, then with high probability we would require at least random initializations to recover this factor. This is because the correlation between random vectors in high dimensions is approximately distributed as a Normal random variable and if samples are drawn from the standard Normal distribution, the probability that one particular sample is at least a factor of larger than the other other samples scales as roughly .

## 3 Notation

We state our algorithm and results for 3rd order tensors, and believe the algorithm and analysis techniques should extend easily to higher dimensions. Given a 3rd order tensor our task is to decompose the tensor into its factor matrices and : where denotes the th column of a matrix . Here and denotes the tensor product: if then and . We will refer to as the weight of the factor . This is also known as CP decomposition. We refer to the dimension of the tensor by and denote its rank by . We refer to different dimensions of a tensor as the modes of the tensor.

We denote as the mode matricization of the tensor, which is the flattening of the tensor along the th direction obtained by stacking all the matrix slices together. For example denotes flattening of a tensor to a matrix. We denote the Khatri-Rao product of two matrices and as , where denotes the flattening of the matrix into a row vector. For any tensor and vectors , we also define . Throughout, we say if up to poly-logarithmic factors.

Though all algorithms in the paper extend to asymmetric tensors, we prove convergence results under the symmetric setting where . Similar to other works (Tang and Shah, 2015; Anandkumar et al., 2014c; Ma et al., 2016), our guarantees depend on the incoherence of the factor matrices (), defined to be the maximum correlation in absolute value between any two factors, i.e. . This serves as a natural assumption to simplify the problem as it is NP-Hard in the worst case. Also, tensors with randomly drawn factors satisfy , and our results hold for such tensors.

## 4 The Algorithm: Orthogonalized Alternating Least Squares (Orth-ALS)

In this section we introduce Orth-ALS, which combines the computational benefits of standard ALS and the provable recovery of the tensor power method, while avoiding the difficulties faced by both when factors have different weights. Orth-ALS is a simple modification of standard ALS that adds an orthogonalization step before each set of ALS steps. We describe the algorithm in Algorithm 1. Note that steps 4-6 are just the solution to the least squares problem expressed in compact tensor notation, for instance step 4 can be equivalently stated as . Similarly, step 9 is the least squares estimate of the weight of each rank-1 component .

To get some intuition for why the orthogonalization makes sense, let us consider the more intuitive matrix factorization problem, where the goal is to compute the eigenvectors of a matrix. Subspace iteration is a straightforward extension of the matrix power method to recover all eigenvectors at once. In subspace iteration, the matrix of eigenvector estimates is orthogonalized before each power method step (by projecting the second eigenvector estimate orthogonal to the first one and so on), because otherwise all the vectors would converge to the dominant eigenvector. For the case of tensors, the vectors would not all necessarily converge to the dominant factor if the initialization is good, but with high probability a random initialization would drive many factors towards the larger weight factors. The orthogonalization step is a natural modification which forces the estimates to converge to different factors, even if some factors are much larger than the others. It is worth stressing that the orthogonalization step does not force the final recovered factors to be orthogonal (because the ALS step follows the orthogonalization step) and in general the factors output will not be orthogonal (which is essential for accurately recovering the factors).

From a computational perspective, adding the orthogonalization step does not add to the computational cost as the least squares updates in step 4-6 of Algorithm 1 involve an extra pseudoinverse term for standard ALS, which evaluates to identity for Orth-ALS and does not have to be computed. The cost of orthogonalization is , while the cost of computing the pseudoinverse is also . We also observe significant speedups in terms of the number of iterations required for convergence for Orth-ALS as compared to standard ALS in our simulations on random tensors (see the experiments in Section 5).

#### Variants of Orthogonalized ALS.

Several other modifications to the simple orthogonalization step also seem natural. Particularly for low-dimensional settings, in practice we found that it is useful to carry out orthogonalization for a few steps and then continue with standard ALS updates until convergence (we call this variant Hybrid-ALS). Hybrid-ALS also gracefully reverts to standard ALS in settings where the factors are highly correlated and orthogonalization is not helpful. Our advice to practitioners would be try Hybrid-ALS first before the fully orthogonalized Orth-ALS, and then tune the number of steps for which orthogonalization takes place to get the best results.

### 4.1 Performance Guarantees

We now state the formal guarantees on the performance of Orthogonalized ALS. The specific variant of Orthogonalized ALS that our theorems apply to is a slight modification of Algorithm 1, and differs in that there is a periodic (every steps) re-randomization of the factors for which our analysis has not yet guaranteed convergence. In our practical implementations, we observe that all factors seem to converge within this first steps, and hence the subsequent re-randomization is unnecessary.

{restatable}

theoremorthalsconvergence Consider a -dimensional rank tensor . Let be the incoherence between the true factors and be the ratio of the largest and smallest weight. Assume , and the estimates of the factors are initialized randomly from the unit sphere. Provided that, at the th step of the algorithm the estimates for all but the first factors are re-randomized, then with high probability the orthogonalized ALS updates converge to the true factors in steps, and the error at convergence satisfies (up to relabelling) and for all .

Theorem 4.1 immediately gives convergence guarantees for random low rank tensors. For random dimensional tensors, ; therefore Orth-ALS converges globally with random initialization whenever . If the tensor has rank much smaller than the dimension, then our analysis can tolerate significantly higher correlation between the factors. In the Appendix, we also prove Theorem 4.1 for the special and easy case of orthogonal tensors, which nevertheless highlights the key proof ideas.

### 4.2 New Guarantees for the Tensor Power Method

As a consequence of our analysis of the orthogonalized ALS algorithm, we also prove new guarantees on the tensor power method. As these may be of independent interest because of the wide use of the tensor power method, we summarize them in this section. We show a quadratic rate of convergence (in steps) with random initialization for random tensors having rank . This contrasts with the analysis of Anandkumar et al. (2014c) who showed a linear rate of convergence ( steps) for random tensors, provided an SVD based initialization is employed.

{restatable}

theoremrandomtensor Consider a -dimensional rank tensor with the factors sampled uniformly from the -dimensional sphere. Define to be the ratio of the largest and smallest weight. Assume and . If the initialization is chosen uniformly from the unit sphere, then with high probability the tensor power method updates converge to one of the true factors (say ) in steps, and the error at convergence satisfies . Also, the estimate of the weight satisfies .

Theorem 4.2 provides guarantees for random tensors, but it is natural to ask if there are deterministic conditions on the tensors which guarantee global convergence of the tensor power method. Our analysis also allows us to obtain a clean characterization for global convergence of the tensor power method updates for worst-case tensors in terms of the incoherence of the factor matrix—

{restatable}

theoremglobalconvergence Consider a -dimensional rank tensor . Let and be the ratio of the largest and smallest weight, and assume . If the initialization is chosen uniformly from the unit sphere, then with high probability the tensor power method updates converge to one of the true factors (say ) in steps, and the error at convergence satisfies and .

## 5 Experiments

We compare the performance of Orth-ALS, standard ALS (with random and SVD initialization), the tensor power method, and the classical eigendecomposition approach, through experiments on low rank tensor recovery in a few different parameter regimes, on a overcomplete tensor decomposition task and a tensor completion task. We also compare the factorization of Orth-ALS and standard ALS on a large real-world tensor of word tri-occurrence based on the 1.5 billion word English Wikipedia corpus.333MATLAB, Python and C code for Orth-ALS and Hybrid-ALS is available at http://web.stanford.edu/~vsharan/orth-als.html

### 5.1 Experiments on Random Tensors

Recovering low rank tensors: We explore the abilities of Orth-ALS, standard ALS, and the tensor power method (TPM), to recover a low rank (rank ) tensor that has been constructed by independently drawing each of the factors independently and uniformly at random from the dimensional unit spherical shell. We consider several different combinations of the dimension, , and rank, . We also consider both the setting where all of the factors are equally weighted, as well as the practically relevant setting where the factor weights decay geometrically, and consider the setting where independent Gaussian noise has been added to the low-rank tensor.

In addition to random initialization for standard ALS and the TPM, we also explore SVD based initialization (Anandkumar et al., 2014c) where the factors are initialized via SVD of a projection of the tensor onto a matrix. We also test the classical technique for tensor decomposition via simultaneous diagonalization (Leurgans et al., 1993; Harshman, 1970) (also known as Jennrich’s algorithm, we refer to it as Sim-Diag), which first performs two random projections of the tensor, and then recovers the factors by an eigenvalue decomposition of the projected matrices. This gives guaranteed recovery when the tensors are noiseless and factors are linearly independent, but is extremely unstable to perturbations.

We evaluate the performance in two respects: 1) the ability of the algorithms to recover a low-rank tensor that is close to the input tensor, and 2) the ability of the algorithms to recover accurate approximations of many of the true factors. Fig. 1 depicts the performance via the first metric. We evaluate the performance in terms of the discrepancy between the input low-rank tensor, and the low-rank tensor recovered by the algorithms, quantified via the ratio of the Frobenius norm of the residual, to the Frobenius norm of the actual tensor: , where is the recovered tensor. Since the true tensor has rank , the inability of an algorithm to drive this error to zero indicates the presence of local optima. Fig. 1 depicts the performance of Orth-ALS, standard ALS with random initialization and the hybrid algorithm that performs Orth-ALS for the first five iterations before reverting to standard ALS (Hybrid-ALS). Tests are conducted in both the setting where factor weights are uniform, as well as a geometric spacing, where the ratio of the largest factor weight to the smallest is 100. Fig. 1 shows that Hybrid ALS and Orth-ALS have much faster convergence and find a significantly better fit than standard ALS.

Fig. 2 quantifies the performance of the algorithms in terms of the number of the original factors that the algorithms accurately recover. We use standard ALS, Orth-ALS (Algorithm 1), Hybrid-ALS, TPM with random initialization (TPM), ALS with SVD initialization (ALS-SVD), TPM with SVD initialization (TPM-SVD) and the simultaneous diagonalization approach (Sim-Diag). We run TPM and SVD-TPM with 100 different initializations and find a rank decomposition for ALS, ALS-SVD, Orth-ALS, Hybrid-ALS and Sim-Diag. We repeat the experiment (by sampling a new tensor) 10 times. We perform this evaluation in both the setting where we receive an actual low-rank tensor as input, as well as the setting where each entry of the low-rank tensor has been perturbed by independent Gaussian noise of standard deviation equal to We can see that Orth-ALS and Hybrid-ALS perform significantly better than the other algorithms and are able to recover all factors in the noiseless case even when the weights are highly skewed. Note that the reason the Hybrid-ALS and Orth-ALS fail to recover all factors in the noisy case when the weights are highly skewed is that the magnitude of the noise essentially swamps the contribution from the smallest weight factors.

Recovering over-complete tensors: Overcomplete tensors are tensors with rank higher than the dimension, and have found numerous theoretical applications in learning latent variable models (Anandkumar et al., 2015). Even though orthogonalization cannot be directly applied to the setting where the rank is more than the dimension (as the factors can no longer be orthogonalized), we explore a deflation based approach in this setting. Given a tensor with dimension and rank , we find a rank decomposition of , subtract from , and then compute a rank decomposition of to recover the next set of factors. We repeat this process to recover subsequent factors. After every set of factors has been estimated, we also refine the factor estimates of all factors estimated so far by running an additional ALS step using the current estimates of the extracted factors as the initialization. Fig. 3 plots the number of factors recovered when this deflation based approach is applied to a dimension tensor with a mild power low distribution on weights. We can see that Hybrid-ALS is successful at recovering tensors even in the overcomplete setup, and gives an improvement over ALS.

Tensor completion: We also test the utility of orthogonalization on a tensor completion task, where the goal is to recover a large missing fraction of the entries. Fig. 3 suggests Hybrid-ALS gives considerable improvements over standard ALS. Further examining the utility of orthogonalization in this important setting, in theory and practice, would be an interesting direction.

### 5.2 Learning Word Embeddings via Tensor Factorization

A word embedding is a vector representation of words which preserves some of the syntactic and semantic relationships in the language. Current methods for learning word embeddings implicitly (Mikolov et al., 2013b; Levy and Goldberg, 2014) or explicitly (Pennington et al., 2014) factorize some matrix derived from the matrix of word co-occurrences , where denotes how often word appears with word . We explore tensor methods for learning word embeddings, and contrast the performance of standard ALS and Orthogonalized ALS on standard tasks.

#### Methodology.

We used the English Wikipedia as our corpus, with 1.5 billion words. We constructed a word co-occurrence tensor of the 10,000 most frequent words, where the entry denotes the number of times the words , and appear in a sliding window of length across the corpus. We consider two different window lengths, and . Before factoring the tensor, we apply the non-linear element-wise scaling to the tensor. This scaling is known to perform well in practice for co-occurrence matrices (Pennington et al., 2014), and makes some intuitive sense in light of the Zipfian distribution of word frequencies. Following the application of this element-wise nonlinearity, we recover a rank 100 approximation of the tensor using Orth-ALS or ALS.

We concatenate the (three) recovered factor matrices into one matrix and normalize the rows. The th row of this matrix is then the embedding for the th word. We test the quality of these embeddings on two tasks aimed at measuring the syntactic and semantic structure captured by these word embeddings.

We also evaluated the performance of matrix SVD based methods on the task. For this, we built the co-occurrence matrix with a sliding window of length over the corpus. We applied the same non-linear element-wise scaling and performed a rank 100 SVD, and set the word embeddings to be the singular vectors after row normalization.

It is worth highlighting some implementation details for our experiments, as they indicate the practical efficiency and scalability inherited by Orth-ALS from standard ALS. Our experiments were run on a cluster with 8 cores and 48 GB of RAM memory per core. Most of the runtime was spent in reading the tensor, the runtime for Orth-ALS was around 80 minutes, with 60 minutes spent in reading the tensor (the runtime for standard ALS was around 100 minutes because it took longer to converge). Since storing a dense representation of the 10,00010,00010,000 tensor is too expensive, we use an optimized ALS solver for sparse tensors (Smith and Karypis, ; Smith and Karypis, 2015) which also has an efficient parallel implementation.

#### Evaluation: Similarity and Analogy Tasks.

We evaluated the quality of the recovered word embeddings produced by the various methods via their performance on two different NLP tasks for which standard, human-labeled data exists: estimating the similarity between a pair of words, and completing word analogies.

The word similarity tasks (Bruni et al., 2012; Finkelstein et al., 2001) contain word pairs along with human assigned similarity scores, and the objective is to maximize the correlation between the similarity in the embeddings of the two words (according to a similarity metric such as the dot product) and human judged similarity.

The word analogy tasks (Mikolov et al., 2013a; c) present questions of the form “ is to as is to  ?” (e.g. “Paris is to France as Rome is to  ?”). We find the answer to “ is to as is to ” by finding the word whose embedding is the closest to in cosine similarity, where denotes the embedding of the word .

#### Results.

The performances are summarized in the Table 1. The use of Orth-ALS rather than standard ALS leads to significant improvement in the quality of the embeddings as judged by the similarity and analogy tasks. However, the matrix SVD method still outperforms the tensor based methods. We believe that it is possible that better tensor based approaches (e.g. using better renormalization, additional data, or some other tensor rather than the symmetric tri-occurrence tensor) or a combination of tensor and matrix based methods can actually improve the quality of word embeddings, and is an interesting research direction. Alternatively, it is possible that natural language does not contain sufficiently rich higher-order dependencies among words that appear close together, beyond the co-occurrence structure, to truly leverage the power of tensor methods. Or, perhaps, the two tasks we evaluated on—similarity and analogy tasks—do not require this higher order. In any case, investigating these possibilities seems worthwhile.

## 6 Proof Overview: the Orthogonal Tensor Case

In this section, we will consider Orthogonalized ALS for the special case when the factors matrix of the tensor is an orthogonal matrix. Although this special case is easy and numerous algorithms provably work in this setting, it will serve to highlight the high level analysis approach that we apply to the more general settings.

The analysis of Orth-ALS hinges on an analysis of the tensor power method. For completeness we describe the tensor power method in Algorithm 2. We will go through some preliminaries for our analysis of the tensor power method. Let the iterate of the tensor power method at time be . The tensor power method update equations can be written as (refer to (Anandkumar et al., 2014c))

 Zt=∑ki=1wi⟨Zt−1,Ai⟩2Ai∥∑ki=1wi⟨Zt−1,Ai⟩2Ai∥2 (6.1)

Eq. 6.1 is just the tensor analog of the matrix power method updates. For tensors, the updates are quadratic in the previous inner products, in contrast to matrices where the updates are linear in the inner products in the previous step.

Observe from Algorithm 1 (Orth-ALS) that the ALS steps in step 4-6 have the same form as tensor power method updates, but on the orthogonalized factors. This is the key idea we use in our analysis of Orth-ALS. Note that the first factor estimate is never affected by the orthogonalization, hence the updates for the first estimated factor have exactly the same form as the tensor power method updates, as this factor is unaffected by orthogonalization. The subsequent factors have an orthogonalization step before every tensor power method step. This ensures that they never have high correlation with the factors which have already been recovered, as they are projected orthogonal to the recovered factors before each ALS step. We then use the incoherence of the factors to argue that orthogonalization does not significantly affect the updates of the factors which have not been recovered so far, while ensuring that the factors which have already been recovered always have a small correlation.

Note that Eq. 6.1 is invariant with respect to multiplying the weights of all the factors by some constant. Hence for ease of exposition, we assume that all the weights lie in the interval , where . We also define .

{restatable}

propositionorthogonaltensor Consider a -dimensional rank tensor where the factor matrix is orthogonal. Define to be the ratio of the largest and smallest weight. If the initial estimates for all the factors are initialized randomly from the unit sphere and the factors are re-randomized after steps where is an integer, then with high probability the orthogonalized ALS updates converge to the true factors in steps, and the error at convergence satisfies and for all .

###### Proof.

Without loss of generality, we assume that the th recovered factor converges to the th true factor. As mentioned earlier, the iterations for the first factor are the usual tensor power method updates and are unaffected by the remaining factors. Therefore to show that orthogonalized ALS recovers the first factor, we only need to analyze the tensor method updates. We show that the tensor power method with random initialization converges in steps with probability at least , for any . Hence this implies that Orth-ALS correctly recovers the first factor in steps with probability at least , for any .

The main idea of our proof of convergence of the tensor power method is the following – with decent probability, there is some separation between the correlations of the factors with the random initialization. By the tensor power method updates (Eq. 6.1), this gap is amplified at every stage. We analyze the updates for all the factors together by a simple recursion. We then show that this recursion converges in in steps.

Let be the iterate of the tensor power method updates at time . Without loss of generality, we will be proving convergence to the first factor . Let be the correlation of the th factor with , i.e. (note that this should technically be called the inner product, but we refer to it as the correlation). We will refer to as the weighted correlation of the th factor.

The first step of the proof is that with decent probability, there is some separation between the weighted correlation of the factors with the initial random estimate. This is Lemma 6. {restatable}lemmagoodstartwhp If for some , then with probability at least , . The proof of Lemma 6 is a bit technical, but relies on basic concentration inequalities for Gaussians. Then using Eq. 6.1 the correlation at the end of the th time step is given by

 ai,t+1 =wia2i,t/κt

where is the normalizing factor at the th time step.

Because the estimate is normalized at the end of the updates, we only care about the ratio of the correlations of the factors with the estimate rather than the magnitude of the correlations themselves. Hence, it is convenient to normalize all the correlations by the correlation of the largest factor and normalize all the weights by the weight of the largest factor. Therefore, let and . The new update equation for the ratio of correlations is-

 ^ai,t+1 =^wi^a2i,t (6.2)

Our goal is to show that becomes small for all in steps. Instead of separately analyzing the different for different factors , we upper bound for all via a simple recursion. Consider the recursion,

 β0 =maxi≠1∣∣^wi^ai,0∣∣ (6.3) βt+1 =β2t (6.4)

We claim that for all and . By Eq. 6.3, this is true for by definition. We prove our claim via induction. Assume that for . Note that by Eq. 6.2, . Therefore for all . Hence for all and . Note that as the weights lie in the interval , .

To show convergence, we will now analyze the recursion in Eq. 6.3. We will show that becomes sufficiently small in steps. Note that and . Therefore for . In another steps, . Hence in steps. As is an upper bound for the correlation of all but the first factor, hence for all in steps.

To finish the proof of convergence for the tensor power method, we need to show that the estimate is close to if it has small correlation with every factor other than . Lemma 6 shows that if the ratio of the correlation of every other factor with is small, then the residual error in estimating is also small.

{restatable}

lemmaerrorconverge Let . Without loss of generality assume convergence to the first factor . Define - the ratio of the correlation of the th and 1st factor with the iterate at time . If , then in the subsequent iteration. Also, if the relative error in the estimation of the weight is at most .

Using Lemma 6, it follows that the estimate and for the factor satisfies and . Hence we have shown that Orth-ALS correctly recovers the first factor.

We now prove that Orth-ALS also recovers the remaining factors. The proof proceeds by induction. We have already shown that the base case is correct and the algorithm recovers the first factor. We next show that if the first factors have converged, then the th factor converges in steps with failure probability at most . The main idea is that as the factors have small correlation with each other, hence orthogonalization does not affect the factors which have not been recovered but ensures that the th estimate never has high correlation with the factors which have already been recovered. Recall that we assume without loss of generality that the th recovered factor converges to the th true factor, hence for , where . This is our induction hypothesis, which is true for the base case as we just showed that the tensor power method updates converge with residual error at most .

Let denote the th factor estimate at time and let denote it’s value at convergence. We will first calculate the effect of the orthogonalization step on the correlation between the factors and the estimate . Let denote an orthogonal basis for . The basis is calculated via QR decomposition, and can be recursively written down as follows,

 ¯Xi=Xi−∑j

Note that the estimate is projected orthogonal to this basis. Define as this orthogonal projection, which can be written down as follows –

 ¯Xm,t=¯Xm,t−∑j

In the QR decomposition algorithm is also normalized to have unit norm but we will ignore the normalization of in our analysis because as before we only consider ratios of correlations of the true factors with , which is unaffected by normalization.

We will now analyze the orthogonal basis . The key idea is that the orthogonal basis is close to the original factors as the factors are incoherent. Lemma 6 proves this claim.

{restatable}

lemmaorthbasis Consider a stage of the Orthogonalized ALS iterations when the first factors have converged. Without loss of generality let , where. Let denote an orthogonal basis for calculated using Eq. 6.5. Then,

1. and .

2. .

3. .

Using Lemma 6, we will find the effect of orthogonalization on the correlations of the factors with the iterate . At a high level, we need to show that the iterations for the factors are not much affected by the orthogonalization, while the correlations of the factors with the estimate are ensured to be small. Lemma 6 is the key tool to prove this, as it shows that the orthogonalized basis is close to the true factors.

We will now analyze the inner product between and factor . This is given by-

 ATi¯Xm,t=ATiXm,t−∑j

As earlier, we normalize all the correlations by the correlation of the largest factor, let be the ratio of the correlations of and with the orthogonalized estimate at time . We can write as-

 ¯ai,t=ATiXm,t−∑j

We can multiply both sides by and substitute from Lemma 6 and then rewrite as follows-

 ^wi¯ai,t=^wiATiXm,t−∑j

We divide the numerator and denominator by to derive an expression in terms of the ratios of correlations. Let .

 ^wi¯ai,t=^wi^ai,t−∑j

We now need to show is small for all and is close to , the weighted correlation before orthogonalization, for all . Lemma 6 proves this, and shows that the weighted correlation of factors which have not yet been recovered, , is not much affected by orthogonalization, but the factors which have already been recovered. , are ensured to be small after the orthogonalization step.

{restatable}

lemmaortheffect Let at the end of the th iteration. Let be the ratio of the correlation of the th and the th factor with , the iterate at time after the orthogonalization step. Then,

1. .

2. .

We are now ready to analyze the Orth-ALS updates for the th factor. First, we argue about the initialization step. Lemma 6 shows that an orthogonalization step performed after a random initialization ensures that the factors which have already been recovered have small correlation with the orthogonalized initialization. This is where we need a periodic re-randomization of the factors which have not converged so far.

{restatable}

lemmaorthini Let be initialized randomly and the result be projected orthogonal to the previously estimated factors, let these be without loss of generality. Then with high probability. Also, with failure probability at most , after the orthogonalization step.

Lemma 6 shows that with high probability, the initialization for the th recovered factor has the largest weighted correlation with a factor which has not been recovered so far after the orthogonalization step. It also shows that the separation condition in Lemma 6 is satisfied for all remaining factors with probability .

Now, we combine the effects of the tensor power method step and the orthogonalization step for subsequent iterations to show that that converges to . Consider a tensor power method step followed by an orthogonalization step. By our previous argument about the convergence of the tensor power method, if at some time , then for after a tensor power method step. Lemma 6 shows that the correlation of all factors other than the th factor is still small after the orthogonalization step, if it was small before. Combining the effect of the orthogonalization step via Lemma 6, if for some time , then for after both the tensor power method and the orthogonalization steps. By also using Lemma 6 for the initialization, can now write the updated combined recursion analogous to Eq. 6.3 and Eq. 6.4, but which combines the effect of the tensor power method step and the orthogonalization step.

 β0 =maxi≠1∣∣^wi^ai,0∣∣ (6.6) βt+1 =β2t(1+1/k1+ϵ) (6.7)

By the previous argument, . Note that by Lemma 6. By expanding the recursion 6.7, . Hence in steps as was the case for the analysis for the tensor power method. This shows that the correlation of the estimate with all factors other than becomes small in . We now again use Lemma 6 to argue that this implies that the recovery error is small, i.e. and .

Hence we have shown that if the first factors have converged to where then the th factor converges to where in steps with probability at least . This proves the induction hypothesis.

We can now do a union bound to argue that each factor converges with error at most in with overall failure probability at most . This finishes the proof of convergence of Orth-ALS for the special case of orthogonal tensors.

## 7 Conclusion

Our results suggest the theoretical and practical benefits of Orthogonalized ALS, versus standard ALS. An interesting direction for future work would be to more thoroughly examine the practical and theoretical utility of orthogonalization for other tensor-related tasks, such as tensor completion. Additionally, its seems worthwhile to investigate Orthogonalized ALS or Hybrid ALS in more application-specific domains, such as natural language processing.

## References

• Anandkumar et al. (2012) Animashree Anandkumar, Yi-kai Liu, Daniel J Hsu, Dean P Foster, and Sham M Kakade. A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems, pages 917–925, 2012.
• Anandkumar et al. (2014a) Animashree Anandkumar, Rong Ge, Daniel Hsu, and Sham M Kakade. A tensor approach to learning mixed membership community models. The Journal of Machine Learning Research, 15(1):2239–2312, 2014a.
• Anandkumar et al. (2014b) Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773–2832, 2014b.
• Anandkumar et al. (2014c) Animashree Anandkumar, Rong Ge, and Majid Janzamin. Guaranteed non-orthogonal tensor decomposition via alternating rank- updates. arXiv preprint arXiv:1402.5180, 2014c.
• Anandkumar et al. (2015) Animashree Anandkumar, Rong Ge, and Majid Janzamin. Learning overcomplete latent variable models through tensor methods. In Proceedings of The 28th Conference on Learning Theory, pages 36–112, 2015.
• Azizzadenesheli et al. (2016) Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. Reinforcement learning of POMDPs using spectral methods. In 29th Annual Conference on Learning Theory, pages 193–256, 2016.
• Bader and Kolda (2007) Brett W. Bader and Tamara G. Kolda. Efficient MATLAB computations with sparse and factored tensors. SIAM Journal on Scientific Computing, 30(1):205–231, December 2007. doi: 10.1137/060676489.
• Bader et al. (2012) Brett W. Bader, Tamara G. Kolda, et al. Matlab tensor toolbox version 2.5. Available online, January 2012.
• Battaglino et al. (2017) Casey Battaglino, Grey Ballard, and Tamara G Kolda. A practical randomized CP tensor decomposition. arXiv preprint arXiv:1701.06600, 2017.
• Bruni et al. (2012) Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 136–145. Association for Computational Linguistics, 2012.
• Chaganty and Liang (2014) Arun Tejasvi Chaganty and Percy Liang. Estimating latent-variable graphical models using moments and likelihoods. In ICML, pages 1872–1880, 2014.
• Cheng et al. (2016) Dehua Cheng, Richard Peng, Yan Liu, and Ioakeim Perros. SPALS: Fast alternating least squares via implicit leverage scores sampling. In Advances In Neural Information Processing Systems, pages 721–729, 2016.
• Colombo and Vlassis (2015) Nicolo Colombo and Nikos Vlassis. FastMotif: spectral sequence motif discovery. Bioinformatics, 31(16):2623, 2015.
• Colombo and Vlassis (2016) Nicolo Colombo and Nikos Vlassis. Tensor decomposition via joint matrix schur decomposition. In Proceedings of The 33rd International Conference on Machine Learning, pages 2820–2828, 2016.
• Comon et al. (2009) Pierre Comon, Xavier Luciani, and André LF De Almeida. Tensor decompositions, alternating least squares and other tales. Journal of chemometrics, 23(7-8):393–405, 2009.
• De Lathauwer (2006) Lieven De Lathauwer. A link between the canonical decomposition in multilinear algebra and simultaneous matrix diagonalization. SIAM journal on Matrix Analysis and Applications, 28(3):642–666, 2006.
• Duembgen (2010) Lutz Duembgen. Bounding standard gaussian tail probabilities. arXiv preprint arXiv:1012.2063, 2010.
• Finkelstein et al. (2001) Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406–414. ACM, 2001.
• Ge and Ma (2015) Rong Ge and Tengyu Ma. Decomposing overcomplete 3rd order tensors using sum-of-squares algorithms. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, page 829, 2015.
• Ge et al. (2015) Rong Ge, Qingqing Huang, and Sham M Kakade. Learning mixtures of gaussians in high dimensions. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pages 761–770. ACM, 2015.
• Harshman (1970) Richard A Harshman. Foundations of the parafac procedure: Models and conditions for an” explanatory” multi-modal factor analysis. 1970.
• Håstad (1990) Johan Håstad. Tensor rank is NP-Complete. Journal of Algorithms, 11(4):644–654, 1990.
• Hillar and Lim (2013) Christopher J Hillar and Lek-Heng Lim. Most tensor problems are NP-Hard. Journal of the ACM (JACM), 60(6):45, 2013.
• Hopkins et al. (2016) Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, pages 178–191. ACM, 2016.
• Huang et al. (2013) Furong Huang, UN Niranjan, Mohammad Umar Hakeem, and Animashree Anandkumar. Fast detection of overlapping communities via online tensor methods. arXiv preprint arXiv:1309.0787, 2013.
• Huang et al. (2014) Furong Huang, Sergiy Matusevych, Anima Anandkumar, Nikos Karampatziakis, and Paul Mineiro. Distributed latent dirichlet allocation via tensor factorization. In NIPS Optimization Workshop, 2014.
• Kang et al. (2012) U Kang, Evangelos Papalexakis, Abhay Harpale, and Christos Faloutsos. Gigatensor: scaling tensor analysis up by 100 times-algorithms and discoveries. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 316–324. ACM, 2012.
• Kolda and Bader (2009) Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
• Kolda and Mayo (2011) Tamara G Kolda and Jackson R Mayo. Shifted power method for computing tensor eigenpairs. SIAM Journal on Matrix Analysis and Applications, 32(4):1095–1124, 2011.
• Kossaifi et al. (2016) Jean Kossaifi, Yannis Panagakis, and Maja Pantic. Tensorly: Tensor learning in python. arXiv preprint arXiv:1610.09555, 2016.
• Kruskal (1977) Joseph B Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear algebra and its applications, 18(2):95–138, 1977.
• Kuleshov et al. (2015) Volodymyr Kuleshov, Arun Tejasvi Chaganty, and Percy Liang. Tensor factorization via matrix factorization. In AISTATS, 2015.
• Le et al. (2011) Quoc V Le, Alexandre Karpenko, Jiquan Ngiam, and Andrew Y Ng. ICA with reconstruction cost for efficient overcomplete feature learning. In Advances in Neural Information Processing Systems, pages 1017–1025, 2011.
• Leurgans et al. (1993) SE Leurgans, RT Ross, and RB Abel. A decomposition for three-way arrays. SIAM Journal on Matrix Analysis and Applications, 14(4):1064–1083, 1993.
• Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, pages 2177–2185, 2014.
• Ma et al. (2016) Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with sum-of-squares. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 438–446. IEEE, 2016.
• Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a.
• Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013b.
• Mikolov et al. (2013c) Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746–751, 2013c.
• Papalexakis et al. (2012) Evangelos E Papalexakis, Christos Faloutsos, and Nicholas D Sidiropoulos. Parcube: Sparse parallelizable tensor decompositions. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 521–536. Springer, 2012.
• Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
• Shah et al. (2015) Parikshit Shah, Nikhil Rao, and Gongguo Tang. Sparse and low-rank tensor decomposition. In Advances in Neural Information Processing Systems, pages 2548–2556, 2015.
• Shitov (2016) Yaroslav Shitov. How hard is the tensor rank? arXiv preprint arXiv:1611.01559, 2016.
• (44) Shaden Smith and George Karypis. SPLATT: The Surprisingly ParalleL spArse Tensor Toolkit.
• Smith and Karypis (2015) Shaden Smith and George Karypis. DMS: Distributed sparse tensor factorization with alternating least squares. Technical report, 2015.
• Song et al. (2016) Zhao Song, David Woodruff, and Huan Zhang. Sublinear time orthogonal tensor decomposition. In Advances in Neural Information Processing Systems, pages 793–801, 2016.
• Souloumiac (2009) Antoine Souloumiac. Joint diagonalization: Is non-orthogonal always preferable to orthogonal? In 2009 3rd IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 2009.
• Tang and Shah (2015) Gongguo Tang and Parikshit Shah. Guaranteed tensor decomposition: A moment approach. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1491–1500, 2015.
• (49) N. Vervliet, O. Debals, L. Sorber, M. Van Barel, and L. De Lathauwer. Tensorlab 3.0, Mar. . Available online.
• Wang et al. (2015a) Yichen Wang, Robert Chen, Joydeep Ghosh, Joshua C Denny, Abel Kho, You Chen, Bradley A Malin, and Jimeng Sun. Rubik: Knowledge guided tensor factorization and completion for health data analytics. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1265–1274. ACM, 2015a.
• Wang et al. (2015b) Yining Wang, Hsiao-Yu Tung, Alexander J Smola, and Anima Anandkumar. Fast and guaranteed tensor decomposition via sketching. In Advances in Neural Information Processing Systems, pages 991–999, 2015b.
• Yu and Liu (2016) Rose Yu and Yan Liu. Learning from multiway data: Simple and efficient tensor regression. In Proceedings of the 33nd International Conference on Machine Learning (ICML-16), pages 238–247, 2016.
• Zhang and Golub (2001) Tong Zhang and Gene H Golub. Rank-one approximation to high order tensors. SIAM Journal on Matrix Analysis and Applications, 23(2):534–550, 2001.

## Appendix A Global convergence of the tensor power method for incoherent tensors

In this section, we will analyze the tensor power method updates for worst-case incoherent tensors. This is a necessary step before analyzing Orth-ALS, because as was pointed out in the proof of convergence of Orth-ALS in the orthogonal tensor case, analyzing Orth-ALS updates reduces to analyzing a perturbed version of the tensor power method updates. Our convergence results for the tensor power method are interesting independent of Orth-ALS though, as they prove global convergence under random initialization. The proof idea is similar to the proof of convergence of the tensor power method in the orthogonal case, but we now need to analyze the cross-terms which come in because the factors are no longer orthogonal.

\globalconvergence

*

###### Proof.

Without loss of generality, we will prove convergence to the first factor . The proof is similar in spirit to the proof of convergence of the tensor power method in the orthogonal case in Section 6.

As in the orthogonal case, Lemma 6 states that with high probability there is some separation between the weighted correlation of the largest and second largest factors. \goodstartwhp*

We normalize all the correlations by the correlation of the largest factor, let and normalize all the weights by the weight of the largest factor, . The new update equations in terms of the ratio of correlations become-

 ^ai,t+1 =^wi^a2i,t+ci,1+∑j:j≠{i,1}ci,j^wj^a2j,t1+∑j:j≠1c1,j^wj^a2j,t (A.1)

Notice that we have cross terms in Eq. A.1 as compared to Eq. 6.2 in the orthogonal case, due to the correlation between the factors being non-zero. The goal of the analysis for the non-orthogonal case is to bound these cross-terms using the incoherence between the factors.

As in the orthogonal case, we will analyze all the correlations via a single recursion. We define in the non-orthogonal case keeping in mind the cross-terms because of the correlations between the factors being non-zero.

 β0 =maxi≠1∣∣wi^ai,0∣∣ (A.2) βt+1 =γcmax+β2t+3γkcmaxβ2t (A.3)

We now show that and all .

If