Homotopy Analysis for Tensor PCA

Homotopy Analysis for Tensor PCA

Abstract

Developing efficient and guaranteed nonconvex algorithms has been an important challenge in modern machine learning. Algorithms with good empirical performance such as stochastic gradient descent often lack theoretical guarantees. In this paper, we analyze the class of homotopy or continuation methods for global optimization of nonconvex functions. These methods start from an objective function that is efficient to optimize (e.g. convex), and progressively modify it to obtain the required objective, and the solutions are passed along the homotopy path. For the challenging problem of tensor PCA, we prove global convergence of the homotopy method in the “high noise” regime. The signal-to-noise requirement for our algorithm is tight in the sense that it matches the recovery guarantee for the best degree- sum-of-squares algorithm. In addition, we prove a phase transition along the homotopy path for tensor PCA. This allows us to simplify the homotopy method to a local search algorithm, viz., tensor power iterations, with a specific initialization and a noise injection procedure, while retaining the theoretical guarantees.

1Introduction

Non-convex optimization is a critical component in modern machine learning. Unfortunately, theoretical guarantees for nonconvex optimization have been mostly negative, and the problems are computationally hard in the worst case. Nevertheless, simple local-search algorithms such as stochastic gradient descent have enjoyed great empirical success in areas such as deep learning. As such, recent research efforts have attempted to bridge this gap between theory and practice.

For example, one property that can guarantee the success of local search methods over nonconvex functions is when all local minima are also the global minima. Interestingly, it has been recently proven that many well known nonconvex problems do have this property, under mild conditions. Consequently, local-search methods, which are designed to find a local optimum, automatically achieve global optimality. Examples of such problems include matrix completion [1], orthogonal tensor decomposition [2], phase retrieval [4], complete dictionary learning [5], and so on. However, such a class of nonconvex problems is limited, and there are many practical problems with poor local optima, where local search methods can fail.

The above property, while very helpful, imposes a strong assumption on the nonconvex problem. A less restrictive requirement for the success of local search methods is the ability to initialize local search in the basin of attraction of the global optimum using another polynomial-time algorithm. This approach does not require all the local optima to be of good quality, and thus can cover a broader set of problems. Efficient initialization strategies have recently been developed for many nonconvex problems such as overcomplete dictionary learning [6], tensor decomposition [8], robust PCA [9], mixed linear regression [10] and so on.

Although the list of such tractable nonconvex problems is growing, currently, the initialization algorithms are problem-specific and as such, cannot be directly extended to new problems. An interesting question is whether there exist common principles that can be used in designing efficient initialization schemes for local search methods. In this paper, we demonstrate how a class of homotopy continuation methods may provide such a framework for efficient initialization of local search schemes.

1.1Homotopy Method

The homotopy method is a general and a problem independent technique for tackling nonconvex problems. It starts from an objective function that is efficient to optimize (e.g. convex function), and progressively transforms it to the required objective [11]. Throughout this progression, the solution of each intermediate objective is used to initialize a local search on the next one. A particular approach for constructing this progression is to smooth the objective function. Precisely, the objective function is convolved with the Gaussian kernel and the amount of smoothing is varied to obtain the set of transformations. Intuitively, smoothing “erases wiggles” on the objective surface (which can lead to poor local optima), thereby resulting in a function that is easier to optimize. Global optimality guarantees for the homotopy method have been recently established [12]. However, the assumptions in these results are either too restrictive [12] or extremely difficult to check [13]. In addition, homotopy algorithms are generally slow since local search is repeated within each instantiation of the smoothed objective.

In this paper, we address all the above issues for the nonconvex tensor PCA problem. We analyze the homotopy method and guarantee convergence to global optimum under a set of transparent conditions. Additionally, we demonstrate how the homotopy method can be drastically simplified without sacrificing the theoretical guarantees. Specifically, by taking advantage of the phase transitions in the homotopy path, we can avoid the intermediate transformations of the objective function. In fact, we can start from the extreme case of “easy” (convex) function of the homotopy, and use its solution to initialize local search on the original objective. Thus, we show that the homotopy method can serve as a problem independent principle for obtaining a smart initialization which is then employed in local search methods. Although we limit ourselves to the problem of tensor PCA in this paper, we expect the developed techniques to be applicable for a broader set of nonconvex problems.

1.2Tensor PCA

Tensor PCA problem is an extension of the matrix PCA. The statistical model for tensor PCA was first introduced by [14]. This is a single spike model where the input tensor is a combination of an unknown rank- tensor and a Gaussian noise tensor with for

where is the signal that we would like to recover.

Tensor PCA belongs to the class of “needle in a haystack” or high dimensional denoising problems, where the goal is to separate the unknown signal from a large amount of random noise. Recovery in the high noise regime has intimate connections to computational hardness, and has been extensively studied in a number of settings. For instance, in the spiked random matrix model, the input is an additive combination of an unknown rank- matrix and a random noise matrix. The requirement on the signal-to-noise ratio for simple algorithms, such as principal component analysis (PCA), to recover the unknown signal has been studied under various noise models [15] and sparsity assumptions on the signal vector [17].

Previous algorithms for tensor PCA belong to two classes: local search methods such as tensor power iterations [14], and global methods such as sum of squares [18]. Currently, the best signal-to-noise guarantee is achieved by the sum-of-squares algorithm and the flattening algorithm, which are more expensive compared to power iterations (see Table 1). In this paper, we analyze the Gaussian homotopy method for tensor PCA, and prove that it matches the best known signal-to-noise performance. [18] also showed a lowerbound that no degree- (or lower) sum-of-squares algorithm can achieve better signal-to-noise ratio, implying that our analysis is likely to be tight.

Table 1: Table of comparison of various methods for tensor PCA. Here space does not include the tensor itself. The power method with random initialization was analyzed in . sum-of-squares, Recover and Certify, and flattened tensor were analyzed in .
Method Bound on Time Space
Power method + initialization + noise injection (ours)
Power method, random initialization
Sum-of-Squares
Recover and Certify
Eigendecomposition of flattened matrix
Information-theoretic Exp

1.3Contributions

We analyze a simple variant of the popular tensor power method, which is a local search method for finding the best rank- approximation of the input tensor. We modify it by introducing a specific initialization and injecting appropriate random noise in each iteration. This runs almost in linear time; see Table 1 for more details.

Our algorithm achieves the best possible trade-offs among all known algorithms (see Table 1).

Our algorithm is inspired by the homotopy framework. In particular, we establish a phase transition along the homotopy path.

The above result allows us to skip the intermediate steps in the homotopy path. We only need two end points of the homotopy path: the original objective function with no smoothing and with an infinite amount of smoothing. The optimal solution for the latter can be obtained through any local search method; in fact, in our case, it has a closed form. This serves as initialization for the original objective function. In the proof we also design a new noise injection procedure that breaks the dependency between the steps. This allows for simpler analysis and our algorithm does not rely on the independence conjecture. We discuss this in more detail in Section 3.1.

The comparison of all the current algorithms for tensor PCA is given in Table 1. Note that the space in the table does not include the space for storing the tensor, this is because the more practical algorithms only access the tensor for a very small number of passes, which allows the algorithms to be implemented online and do not need to keep the whole tensor in the memory. We see that our algorithm has the best performance across all the measures. In our synthetic experiments (see Section 5, we find that our method significantly outperforms the other methods: it converges to a better solution faster and with a lower variance.

2Preliminaries

In this section, we formally define the tensor PCA problem and its associated objective function. Then we show how to compute the smoothed versions of these objective functions.

2.1Tensors and Polynomials

Tensors are higher dimensional generalization of matrices. In this paper we focus on 3rd order tensors, which correspond to a 3 dimensional arrays. Given a vector , similar to rank one matrices , we consider rank 1 tensors to be a array whose -th entry is equal to .

For a matrix , we often consider the quadratic form it defines: . Similarly, for a tensor , we define a degree 3 polynomial . This polynomial is just a special trilinear form defined by the tensor. Given three vectors , the trilinear form . Using this trilinear form, we can also consider the tensor as an operator that maps vectors to matrices, or two vectors into a single vector. In particular, is a matrix whose -th entry is equal to where is the -th basis vector. Similarly, is a vector whose -th coordinate is equal to .

Since the tensor we consider is not symmetric ( is not necessarily equal to or other permutations), we also define the symmetric operator

2.2Objective Functions for Tensor PCA

We first define the tensor PCA problem formally.

Similar to the Matrix PCA where we maximize the quadratic form, for tensor PCA we can focus on optimizing the degree 3 polynomial over the unit sphere.

The optimal value of this program is known as the spectral norm of the tensor. It is often solved in practice by tensor power method. [14] noticed that:

Unfortunately, solving this optimization problem is NP-hard in the worst-case [19]. Currently, the best known algorithm uses sum-of-squares hierarchy and works when . There is a huge gap between what’s achievable information theoretically () and what can be achieved algorithmically ().

2.3Gaussian Smoothing for the Objective Function

Guaranteed homotopy methods rely on smoothing the objective function by the Gaussian kernel [11]. More precisely, smoothing the objective (Equation 1) requires convolving it with the Gaussian kernel. Let be a mapping such that

Here, is the Gaussian density function for , satisfying

It is known that convolution of polynomials with the Gaussian kernel has a closed form expression [20]. In particular, the objective function of the Tensor PCA has the following smoothed form.

The proof of this Lemma is based on interpreting the convolution as an expectation . We defer the detailed calculation to Appendix A.1

3Tensor PCA by Homotopy Initialization

In this section we give a simple smart initialization algorithm for tensor PCA. Our algorithm only uses two points in homotopy path – the infinite smoothing and the no smoothing . This is inspired by our full analysis of the homotopy path (see Section ?), where we show there is a phase transition in the homotopy path. When the smoothing parameter is larger than a threshold, the function behaves like the infinite smoothing case; when the smoothing parameter is smaller than the threshold, the function behaves like the no smoothing case.

Recall that the smoothed function is:

with as a vector such that . When , the solution of the smoothed problem has the special form . That is because the term dominates and thus its maximizer under yields .

Note that by Lemma ?, we can compute vector , and we know . Therefore we know can be computed just from the tensor. We use this point as an initialization, and then run power method on the original function. The resulting algorithm is described in Algorithm ?.

In order to analyze the algorithm, we use the following independence condition, which states that the “random”-looking vectors and indeed have some properties satisfied by random vectors:

Note that if in every step of the algorithm, the noise tensor is resampled to be a fresh random tensor, independent of the previous step , then is just a random Gaussian vector. In this case the condition is trivially satisfied. Of course, in reality ’s are dependent on . However, we are able to modify the algorithm by a noise injection procedure, that adds more noise to the tensor , and make the noise tensor “look” as if they were independent. The extra dependency on in Condition ? comes from noise injection procedure and will be more clear in Section 3.1. We will first show the correctness of the algorithm assuming independence condition here, and in Section 3.1 we discuss the noise injection procedure and prove the independence condition.

The main idea is to show the correlation of and increases in every step. In order to do this, first notice that the initial point itself is equal to a normalization of , where the norm of and its correlation with are all bounded by the Independence Condition. It is easy to check that , which is already non-trivial because a random vector would only have correlation around . For the later iterations, let be the vector before normalization and we have . Notice that the first term is in the direction , and the Independence Condition bounds the norm and correlation with for the second term. We can show that the correlation with increases in every iteration, because the initial point already has a large inner product with . The detailed proof is deferred to Appendix A.2.

3.1Noise Injection Procedure

In order to prove the Independence Condition, we slightly modify the algorithm (see Algorithm ?). In particular, we add more noise in every step as follows

  • Get the input tensor ;

  • Draw a sequence of such that ;

  • Let with , run Algorithm ? by using in the -th iteration;

Intuitively, by adding more noise the new noise will overwhelm the original noise , and every time it looks like a fresh random noise. We prove this formally by the following lemma:

This Lemma states that after our noise injection procedure, the tensors look exactly the same as tensors where the noise is sampled independently. The basic idea for this lemma is that for two multivariate Gaussians to have the same distribution, we only need to show that they have the same first and second moments. We defer the details to Appendix A.2.

Using Lemma ? we can create a sequence of such that its noise tensor is redrawn independently and each element is according to . Now, because each behave as if it is drawn independently, we can prove the Independence Condition:

This Lemma is now true because by Lemma ?, we know the noise tensors is independent of . As a result is independent of ! This lemma then follows immediately from standard concentration inequalities. We defer the full proof to Appendix A.2.

The noise injection technique is mostly a technicality that we need in order to make sure different steps are independent. This is standard in analyzing nonconvex optimization algorithms. As an example, previous works on alternating minimization for matrix completion [21] relied on the availability of different subsamples in different iterations to obtain the theoretical guarantees. Our noise injection procedure is very similar, however this is the first application of this idea for the case of Gaussian noise. The main usage of the noise injection is to get rid of the dependence of the noise matrix between different iterations. Moreover, this technique is designed to simplify the proof and rarely used in the real applications. In practice, an algorithm without noise injection, like Algorithm ?, usually performs well enough.

Combining Lemma ? and Theorem ?, we know Algorithm ? solves the tensor PCA problem when .

4Characterizing the Homotopy Path

Figure 1: Phase Transition for a 1-d function
Figure 1: Phase Transition for a 1-d function
Figure 2: Phase Transition for a 1-d function
Figure 2: Phase Transition for a 1-d function
Figure 3: Phase Transition for a 1-d function
Figure 3: Phase Transition for a 1-d function

This section analyzes the behavior of the smoothed objective function as varies. Under a plausible conjecture, we prove that a phase transition occurs: when is large behaves very similarly to and when is small behaves very similarly to . This motivates the algorithms in the previous section, as the phase transition suggests the most important regimes are very large and .

In this section we first describe how the homotopy method works in more details. Then we present an alternative objective function of Tensor PCA and derive its smoothed version. Finally, we prove that when , the smoothed function retains its maximizer around . However, when , the configuration of critical points change, with only one of the critical points being close to the solution . Importantly, we can find our way from the vicinity of toward this critical point via the dominant curvature direction of the function.

4.1Homotopy

In the homotopy method, we start from the maximizer of the function with large amount of smoothing . We earlier denoted this maximizer as . Then we continuously decrease the amount of smoothing , while following the maximizer throughout this process, until reaching . We call the path taken by the maximizer the homotopy path. It is formally defined as follows.

In practice, to search a homotopy path, one computes the initial point by analytical derivation or numerical approximation as and then successively minimizes the smoothed functions over a finite sequence of decreasing numbers to , where is sufficiently large, and . The resulted procedure is listed in Algorithm ?.

4.2Alternative Objective Function and Its Smoothing

Turning a constrained problem into an unconstrained problem can facilitate the computation of the effective gradient and Hessian of . In this section, we consider the alternative objective function: we modify by adding the penalty term :

Thus we consider the following unconstrained optimization problem,

If we fix the magnitude , the function is . The optimizer of this is an increasing function of . Therefore the maximizer of (Equation 2) is exactly in the same direction as the constrained problem (Equation 1). The factor here is just to make sure the optimal solution has roughly unit norm; in practice we can choose any coefficient in front of and the solution will only differ by scaling.

Moreover, note that if in the absence of noise tensor , then

To get the stationary point, we have

Therefore, the new function is defined on and the maximizer of is close to . We also compute the smoothed version of this problem:

The proof of this Lemma is very similar to Lemma ? and is deferred to Appendix A.3.

4.3Phase Transition on the Homotopy Path

Notice that when , the dominating terms in are terms (the only term is a constant). Therefore, forms a quadratic function, so it has a unique global maximizer equal to , denoted as . Notice that this vector has different norm compared to the in previous section.

Before we state the Theorem, we need a counterpart of the Independence Condition. We call this the Strong Independence Conjecture:

Intuitively, this assumes that the noise is not adversarially correlated with the signal on the entire homotopy path. The main difference between the strong independence conjecture and the weak independence conjecture is that they apply to different algorithms with different number of iterations. The strong independence conjecture applies to the general Homotopy method, which may have a large number of iterations, and thus a conjecture that depends on the number of iterations does not provide us any useful properties. We use the strong independence conjecture to analyze the general Homotopy method to gain intuitions in order to design our algorithm. The weak conjecture is for our Algorithm ?, which only has rounds, and can be satisfied using the noise injection technique. Although we cannot use noise injection to prove the strong independence conjecture, similar conjectures are often used to get intuitions about optimization problems [22].

Intuitively, this theorem shows that in the process of homotopy method, if we consider a continuous path in the sense that is close to for all , then (1) at the beginning, is close to ; (2) at some point , is a saddle point in the function and from the saddle point we are very likely to follow the Hessian direction to actually converge to the good local maximizer near the signal. This phenomenon is illustrated in Figure 3:

Figure 3(a) has large smoothing parameter, and the function has a unique local/global maximizer. Figure 3(b) has medium smoothing parameter, the original global maximizer now behaves like a local minimizer in one dimension, but it in general could be a saddle point in high dimensions. The Hessian at this point leads the direction of the homotopy path. In Figure 3(c) the smoothing is small and the algorithm should go to a different maximizer.

Next, we are going to show that, when , starting from , with high probability, a gradient descent algorithm will converge to the local maximizers with and .

In the Algorithm ?, the first step is to compute such that is maximized. That is, the gradient at point should be perpendicular to , i.e. . Henceforth, we have

Since is normalized, we have

Recall that with norm and correlation . For the very first iteration with , we have

By scale analysis, we can get .

Partition the gradient into three parts: , and . Notice that, dominates , . Therefore, in , is the dominant term. Moreover, we have

Therefore, after the first step, the algorithm makes improvement, compared to the initial correlation .

In order to show this for the future steps, we claim that for each step , . If the claim is true, according to Lemma ?, the process can only converge to a good maximizer.

The claim is clearly true for . By induction, assume this is true for , in the next iteration, we partition the gradient into three parts, , and , as before.

For , if dominates , then . Otherwise, due to induction hypothesis, and we have

Therefore, the only part that could decrease the correlation is . However, notice that and . It suffices to show that dominates . To figure out , recall (Equation 3)

where . Recall that , . Henceforth, we have

Therefore, (Equation 4) becomes

If , then the right hand side of (Equation 3) is dominated by . But, notice that if dominates the left hand side of (Equation 3), [Yuan: proof breaks down here, steep descent does not work? but can we find a way to guarantee the length of current solution does not decrease?]

5Experiments

Figure 4: Success probabilities for the algorithms. y axis is n and x axis is \tau. Black means fail.
Figure 4: Success probabilities for the algorithms. axis is and axis is . Black means fail.
Figure 5: Success probabilities for the algorithms. y axis is n and x axis is \tau. Black means fail.
Figure 5: Success probabilities for the algorithms. axis is and axis is . Black means fail.
Figure 6: Success probabilities for the algorithms. y axis is n and x axis is \tau. Black means fail.
Figure 6: Success probabilities for the algorithms. axis is and axis is . Black means fail.
Figure 7: Rate of Convergence. \tau = \alpha n^{\frac 3 4}, x axis is the number of iterations, y axis is the expected correlation with signal \bm v (with variance represented as error bars)
Figure 7: Rate of Convergence. , axis is the number of iterations, axis is the expected correlation with signal (with variance represented as error bars)
Figure 8: Rate of Convergence. \tau = \alpha n^{\frac 3 4}, x axis is the number of iterations, y axis is the expected correlation with signal \bm v (with variance represented as error bars)
Figure 8: Rate of Convergence. , axis is the number of iterations, axis is the expected correlation with signal (with variance represented as error bars)
Figure 9: Rate of Convergence. \tau = \alpha n^{\frac 3 4}, x axis is the number of iterations, y axis is the expected correlation with signal \bm v (with variance represented as error bars)
Figure 9: Rate of Convergence. , axis is the number of iterations, axis is the expected correlation with signal (with variance represented as error bars)

For brevity we refer to our Tensor PCA with homotopy initialization method (Algorithm ?) as HomotopyPCA. We compare that with two other algorithms: the Flatten algorithm and the Power method. The Flatten algorithm was originally proposed by [14], where they show it works when . [18] accelerated the Flatten algorithm to near-linear time, and improved the analysis to show it works when . The Power method is similar to our algorithm, except it does not use intuitions from homotopy, and initialize at a random vector. Note that there are other algorithms proposed in [18], however they are based on the Sum-of-Squares SDP hierarchy, and even the fastest version runs in time (much worse than the algorithms compared here).

We first compare how often these algorithms successfully find the signal vector , given different values of and . The results are in Figure 6, in which -axis represents and -axis represents . We run 50 experiments for each values of , and the grayness in each grid shows how frequent each algorithm succeeds: black stands for “always fail” and white stands “always succeed”. For every algorithm, we say it fails if (1) when it converges, i.e., the result at two consecutive iterations are very close, the correlation with the signal is less than ; (2) the number of iterations exceeds 100. In the experiments for Power Method, we observe there are many cases where situation (1) is true, although our new algorithms can always find the correct solution. In these cases the function indeed have a local maximizer. From Figure 6, our algorithm outperforms both Power Method and the Flatten algorithm in practice. This suggests the constant hiding in our algorithm is possibly smaller.

Next we compare the number of iterations to converge with and , where varies in . In Figure 9, the x-axis is the number of iterations, and the axis is the correlation with the signal (error bars shows the distribution from 50 independent runs). For all , Homotopy PCA performs well — converges in less than iterations and finds the signal . The Power Method converges to a result with good correlations with the signal , but has large variance because it sometimes gets trapped in local optima. As for the Flatten algorithm, the algorithm always converges. However, it takes more iterations compared to our algorithm. Also when is small, the converged result has bad correlation with .


AOmitted Proofs

a.1Omitted Proof in Section

We can write as an expectation

Since is just a degree 3 polynomial, we can expand it and use the lower moments of Gaussian distributions:

Therefore the first part of the lemma holds if we define to be the vector . In order to compute the vector , notice that the term is the linear term on , and it is equal to

This means

a.2Omitted Proof in Section

We first show the initial maximizer already has a nontrivial correlation with . Recall . Note that if is very large such that , then we already have . Later we will show that whenever all later iterations have the same property.

Therefore, we are left with the case when (this implies ). In this case, by Condition ? we know and , therefore

Therefore, . Assume for large enough (where we will later show suffices)

Now let us consider the first step of power method. Let be the vector before normalization. Observe that . By Condition ? we have bounds on and , therefore we have

Note that when and , the first term is much larger than . Hence for the first iteration, we have .

Similar as before, when , we already have . On the other hand, if , in this case, by Condition ? we know . We again have . Therefore, . Combining the bounds for the norm of and its correlation with ,

Therefore, when , the correlation between and is larger than the correlation between and . This shows the first step makes an improvement.

In order to show this for the future steps, we do induction over . The induction hypothesis is for every , either or

Initially, for , we have already proved the induction hypothesis.

Now assume the induction hypothesis is true for . In the next iteration, let be the vector before normalization. Similar as before we have .

When , by Condition ? we know the norm of is much smaller than . Therefore we still have .

In the other case, we follow the same strategy as the first step. By Condition ? we can compute the correlation between and :

For the norm of , notice that the first term has norm , and the second term has norm . Note that these two terms are almost orthogonal by Independence Condition, therefore

If , then , where is a constant that is smaller than when is large enough. Therefore in this case . Thus we successfully recover in the next step.

Otherwise, we know . Then,

If we select , after rounds, we have , therefore we must always be in the first case. As a result .

Note that both distributions are multivariate Gaussians. Therefore we only need to show that they have the same first and second moments.

For the first moment, this is easy, we have and for all .

For the second moment (covariance), we consider the covariance between and . Note that for the distribution , as long as the 4 tuple the correlation is 0. We first show when we have

Hence for these variables the two distributions have the same covariance.

Next we consider the case ,

The covariance for these entries also match.

Finally we need to consider the variance for each entry of and . To do that we compute the Variance of

This is also the same as the variance of . Therefore the two multivariate Gaussians have the same mean and covariance, and must be the same distribution.

Since by Lemma ?, we know the noise tensors used in -th step behave exactly the same as independent Gaussian tensors. The vectors and are therefore spherical Gaussian random variables conditioned on any value of . Therefore we can prove this lemma by standard Gaussian concentration results.

For terms like and , we know the norm of a Gaussian random variable obeys the distribution and is highly concentrated to its expectation. For terms like and , we know they are just Gaussian distributions and is always bounded by with high probability. Therefore we only need to compute the expected norms of these vectors.

Therefore by Claim ? we have with high probability.