Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method

# Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method

## Abstract

We give a new approach to the dictionary learning (also known as “sparse coding”) problem of recovering an unknown matrix (for ) from examples of the form

 y=Ax+e,

where is a random vector in with at most nonzero coordinates, and is a random noise vector in with bounded magnitude. For the case , our algorithm recovers every column of within arbitrarily good constant accuracy in time , in particular achieving polynomial time if for any , and time if is (a sufficiently small) constant. Prior algorithms with comparable assumptions on the distribution required the vector to be much sparser—at most nonzero coordinates—and there were intrinsic barriers preventing these algorithms from applying for denser .

We achieve this by designing an algorithm for noisy tensor decomposition that can recover, under quite general conditions, an approximate rank-one decomposition of a tensor , given access to a tensor that is -close to in the spectral norm (when considered as a matrix). To our knowledge, this is the first algorithm for tensor decomposition that works in the constant spectral-norm noise regime, where there is no guarantee that the local optima of and have similar structures.

Our algorithm is based on a novel approach to using and analyzing the Sum of Squares semidefinite programming hierarchy (Parrilo 2000, Lasserre 2001), and it can be viewed as an indication of the utility of this very general and powerful tool for unsupervised learning problems.

\newrefformat

eq(LABEL:*#1) \newrefformatlemLemma LABEL:*#1 \newrefformatdefDefinition LABEL:*#1 \newrefformatthmTheorem LABEL:*#1 \newrefformatcorCorollary LABEL:*#1 \newrefformatchaChapter LABEL:*#1 \newrefformatsecSection LABEL:*#1 \newrefformatappAppendix LABEL:*#1 \newrefformattabTable LABEL:*#1 \newrefformatfigFigure LABEL:*#1 \newrefformathypHypothesis LABEL:*#1 \newrefformatalgAlgorithm LABEL:*#1 \newrefformatremRemark LABEL:*#1 \newrefformatitemItem LABEL:*#1 \newrefformatstepstep LABEL:*#1 \newrefformatconjConjecture LABEL:*#1 \newrefformatfactFact LABEL:*#1 \newrefformatpropProposition LABEL:*#1 \newrefformatprobProblem LABEL:*#1 \newrefformatclaimClaim LABEL:*#1 \newrefformatrelaxRelaxation LABEL:*#1 \newrefformatredReduction LABEL:*#1 \newrefformatpartPart LABEL:*#1

Keywords: sparse coding, dictionary learning, sum-of-squares method, semidefinite programming, machine learning, unsupervised learning, statistical recovery, approximation algorithms, tensor optimization, polynomial optimization.

## 1 Introduction

The dictionary learning (also known as “sparse coding”) problem is to recover an unknown matrix (known as a “dictionary”) from examples of the form

 y=Ax+e, (1.1)

where is sampled from a distribution over sparse vectors in (i.e., with much fewer than nonzero coordinates), and is sampled from a distribution over noise vectors in of some bounded magnitude.

This problem has found applications in multiple areas, including computational neuroscience [OF97, OF96a, OF96b], machine learning [EP07, MRBL07], and computer vision and image processing [EA06, MLB08, YWHM08]. The appeal of this problem is that, intuitively, data should be sparse in the “right” representation (where every coordinate corresponds to a meaningful feature), and finding this representation can be a useful first step for further processing, just as representing sound or image data in the Fourier or wavelet bases is often a very useful preprocessing step in signal or image processing. See [SWW12, AAJ13, AGM13, ABGM14] and the references therein for further discussion of the history and motivation of this problem.

This is a nonlinear problem, as both and are unknown, and dictionary learning is a computationally challenging task even in the noiseless case. When is known, recovering from constitutes the sparse recovery / compressed sensing problem, which has efficient algorithms [Don06, CRT06]. Hence, a common heuristic for dictionary learning is to use alternating minimization, using sparse recovery to obtain a guess for based on a guess of , and vice versa.

Recently there have been several works giving dictionary learning algorithms with rigorous guarantees on their performance [SWW12, AAJ13, AAN13, AGM13, ABGM14]. These works differ in various aspects, but they all share a common feature: they give no guarantee of recovery unless the distribution is over extremely sparse vectors, namely having less than (as opposed to merely ) nonzero coordinates. (There have been other works dealing with the less sparse case, but only at the expense of making strong assumptions on and/or ; see Section 1.3 for more discussion of related works.)

In this work we give a different algorithm that can be proven to approximately recover the matrix even when is much denser (up to coordinates for some small constant in some settings). The algorithm works (in the sense of approximate recovery) even with noise, in the so-called overcomplete case (where ), and without making any incoherence assumptions on the dictionary.

Our algorithm is based on the Sum of Squares (SOS) semidefinite programming hierarchy [Sho87, Nes00, Par00, Las01]. The SOS algorithm is a very natural method for solving non-convex optimization problems that has found applications in a variety of scientific fields, including control theory [HG05], quantum information theory [DPS02], game theory [Par06], formal verification [Har07], and more. Nevertheless, to our knowledge this work provides the first rigorous bounds on the SOS algorithm’s running time for a natural unsupervised learning problem.

### 1.1 Problem definition and conditions on coefficient distribution

In this section we formally define the dictionary learning problem and state our result. We define a -dictionary to be an matrix such that for all , and (where is the identity matrix). The parameter is an analytical proxy for the overcompleteness of the dictionary . In particular, if the columns of are in isotropic position (i.e., is proportional to the identity), then the top eigenvalue of is its trace divided by , which equals because all of the ’s have unit norm.1 In this work we are mostly interested in the case , which corresponds to .

#### Nice distributions

Let be some distribution over the coefficients in (1.1). We will posit some conditions on low-order moments of to allow recovery. Let be some even constant that we will use as a parameter (think of ). Consider a vector with nonzero coordinates. Then and . In other words, if we select three “typical” coordinates , then

 xd/2ixd/2j⩽τxd/2k. (1.2)

Equation (1.2) will motivate us in defining an analytical proxy for the condition that the distribution over coefficients is -sparse.2

Specifically, in the dictionary learning case, since we are interested in learning all column vectors, we want every coordinate to be typical (for example, if the coefficient is always or always , we will not be able to learn the corresponding column vector). Moreover, a necessary condition for recovery is that every pair of coordinates is somewhat typical in the sense that the events that and are nonzero are not perfectly correlated. Indeed, suppose for simplicity that when is nonzero, it is distributed like an independent standard Gaussian. Then if those two events were perfectly correlated, recovery would be impossible since the distribution over examples would be identical if we replaced with any pair of vectors where is a rotation in the plane spanned by .

Given these considerations, if we normalize our distribution so that for all , then it makes sense to assume:3

 Exd/2ixd/2j⩽τ, (1.3)

for all and for some .

We can assume without loss of generality that the marginal distribution is symmetric around zero (namely for all ), since given two samples and we can treat them as a single sample , and the distribution , which is only slightly less sparse (and slightly more noisy), has this property. In particular this means we can assume for every integer and . We will strengthen this condition to assume that

 Exα=0 (1.4)

for every non-square monomial of degree at most . (Here, is a multiindex and denotes the monomial . The degree of is ; we say that is non-square if is not the square of another monomial, i.e.,, if has an odd coordinate.)

We say that a distribution is -nice if it satisfies (1.3) and (1.4).4 One example for a -nice distribution is the Bernoulli-Gaussian distribution, where with the ’s being independent random variables satisfying and the ’s being independent normally distributed random variables (normalized to satisfy ). Indeed, in this case, since (using Cauchy-Schwarz) ,

 Exd/2ixd/2j=(Eyiyj)(Ezd/2izd/2j)⩽τ2(1/τ)=τ.

In fact, we can replace here the normal distribution with any distribution satisfying , and also allow some dependence between the variables (in particular encapsulating the models considered by [AGM13]). As our discussion above and this example demonstrates, the parameter serves as a proxy to the sparsity of , where a -nice distribution roughly corresponds to a distribution having at most coordinates with significant mass. (For technical reasons, our formal definition of nice distributions, Definition 1, is somewhat different but is qualitatively equivalent to the above, see Remark 4.)

###### Remark 1.

Another way to justify this notion of nice distributions is that, as our analysis shows, it is a natural way to ensure that if is a column of the dictionary then the random variable for a random sample from (1.1) will be “spiky” in the sense that it will have a large -norm compared to its -norm. Thus it is a fairly clean way to enable recovery, especially in the setting (such as ours) where we don’t assume orthogonality or even incoherence between the dictionary vectors.

#### Modeling noise

Given a noisy dictionary learning example of the form , one can also view it (assuming we are in the non-degenerate case of having full rank) as for some (whose magnitude is controlled by the norm of and the condition number of ). If has sufficiently small magnitude, and is composed of i.i.d random variables (and even under more general conditions), the distribution will be nice as well. Therefore, we will not explicitly model the noise in the following, but rather treat it as part of the distribution which our definition allows to be only “approximately sparse”.

### 1.2 Our results

Given samples of the form for a -nice , with a sufficiently large constant (corresponding to having nonzero entries), we can approximately recover the dictionary in polynomial time as long as for some , and in quasipolynomial time as long as is a sufficiently small constant. Prior polynomial-time algorithms required the distribution to range over vectors with less than nonzero entries (and it was not known how to improve upon this even using quasipolynomial time).

We define the correlation of a pair of vectors , , to be . We say that two sets of vectors are -close if for every there is such that , and for every there is such that .5

###### Theorem 2 (Dictionary learning).

For every there exists and a polynomial-time algorithm such that for every -dictionary and -nice , given samples from from , outputs with probability at least a set that is -close to .

The hidden constants in the notation may depend on . The algorithm can recover the dictionary vectors even in the relatively dense case when is (a sufficiently small) constant, at the expense of a quasipolynomial (i.e., ) running time. See Theorems 2 and 6 for a precise statement of the dependencies between the constants.

###### Remark 3.

Our algorithm aims to recover the vectors up to -accuracy, with a running time as in a PTAS that depends (polynomially) on in the exponent. Prior algorithms achieving exact recovery needed to assume much stronger conditions, such as incoherence of dictionary columns. Because we have not made incoherence assumptions and have only assumed the signals obey an analytic notion of sparsity, exact recovery is not possible, and there are limitations on how precisely one can recover the dictionary vectors (even information theoretically).

We believe that it is important to understand the extent to which dictionary recovery can be performed with only weak assumptions on the model, particularly given that real-world signals are often only approximately sparse and have somewhat complicated distributions of errors. When stronger conditions are present that make better error guarantees possible, our algorithm can provide an initial solution for local search methods (or other recovery algorithms) to boost the approximate solution to a more precise one. We believe that understanding the precise tradeoffs between the model assumptions, achievable precision, and running time is an interesting question for further research.

We also note that approximate recovery is directly useful in some applications (e.g., for learning applications one might only need to know if a feature, which is modeled by the magnitude of for a dictionary column , is “on” or “off” for a particular sample . By an averaging argument, for a typical sample and feature , the events that us large and is large would have about correlation, where is the approximation we produce for .

Our main tool is a new algorithm for the noisy tensor decomposition problem, which is of interest in its own right. This is the problem of recovering the set of vectors given access to a noisy version of the polynomial in , where is an matrix.6 We give an algorithm that is worse than prior works in the sense that it requires a higher value of , but can handle a much larger level of noise than these previous algorithms. The latter property turns out to be crucial for the dictionary learning application. Our result for noisy tensor decomposition is captured by the following theorem:

###### Theorem 4 (Noisy tensor decomposition).

For every , there exists and a probabilistic -time algorithm such that for every -dictionary , given a polynomial such that

 ∥A⊤u∥dd−τ∥u∥d2⪯P⪯∥A⊤u∥dd+τ∥u∥d2, (1.5)

outputs with probability at least a set that is -close to .

(We denote if is a sum of squares of polynomials. Also, as in Theorem 2, there are certain conditions under which runs in polynomial time; see Section 7.)

The condition (1.5) implies that the input to is -close to the tensor , in the sense that for every unit vector . This allows for very significant noise, since for a typical vector , we expect to be have magnitude roughly which would be much smaller than for every constant . Thus, on most of its inputs, can behave radically differently than , and in particular have many local minima that do not correspond to local minima of the latter polynomial. For this reason, it seems unlikely that one can establish a result such as Theorem 4 using a local search algorithm.7

We give an overview of our algorithm and its analysis in Section 2. Sections 4, 6 and 5 contain the complete formal proofs. In its current form, our algorithm is efficient only in the theoretical/asymptotic sense, but it is very simple to describe (modulo its calls to the SOS solver), see Figure 1. We believe that the Sum of Squares algorithm can be a very useful tool for attacking machine learning problems, yielding a first solution to the problem that can later be tailored and optimized.

### 1.3 Related work

Starting with the work of Olshausen and Field [OF96a, OF96b, OF97], there is a vast body of literature using various heuristics (most commonly alternating minimization) to learn dictionaries for sparse coding, and applying this tool to many applications. Here we focus on papers that gave algorithms with proven performance.

Independent Component Analysis (ICA) [Com94] is one method that can be used for the dictionary learning in the case the random variables are statistically independent. For the case of this was shown in [Com94, FJK96, NR09], while the works [LCC07, GVX14] extend it for the overcomplete (i.e. ) case.

Another recent line of works analyzed different algorithms, which in some cases are more efficient or handle more general distributions than ICA. Spielman, Wang and Wright [SWW12] give an algorithm to exactly recover the dictionary in the case. Agarwal, Anandkumar, Jain, Netrapalli, and Tandon [AAJ13] and Arora, Ge and Moitra [AGM13] obtain approximate recovery in the overcomplete (i.e. ) case, which can be boosted to exact recovery under some additional conditions on the sparsity and dictionary [AAN13, AGM13]. However, all these works require the distribution to be over very sparse vectors, specifically having less than nonzero entries. As discussed in [SWW12, AGM13], sparsity seemed like a natural barrier for this problem, and in fact, Spielman et al [SWW12] proved that every algorithm of similar nature to theirs will fail to recover the dictionary when when the coefficient vector can have coordinates. The only work we know of that can handle vectors of support larger than is the recent paper [ABGM14], but it achieves this at the expense of making fairly strong assumptions on the structure of the dictionary, in particular assuming some sparsity conditions on itself. In addition to the sparsity restriction, all these works had additional conditions on the distribution that are incomparable or stronger than ours, and the works [AAJ13, AGM13, AAN13, ABGM14] make additional assumptions on the dictionary (namely incoherence) as well.

The tensor decomposition problem is also very widely studied with a long history (see e.g., [Tuc66, Har70, Kru77]). Some recent works providing algorithms and analysis include [AFH12, AGM12, BCMV14, BCV14]. However, these works are in a rather different parameter regime than ours— assuming the tensor is given with very little noise (inverse polynomial in the spectral norm), but on the other hand requiring very low order moments (typically three or four, as opposed to the large constant or even logarithmic number we use).

As described in Sections 2 and 2.1 below, the main tool we use is the Sum of Squares (SOS) semidefinite programming hierarchy [Sho87, Nes00, Par00, Las01]. We invoke the SOS algorithm using the techniques and framework introduced by Barak, Kelner and Steurer [BKS14]. In addition to introducing this framework, [BKS14] showed how a somewhat similar technical barrier can be bypassed in a setting related to dictionary learning— the task of recovering a sparse vector that is planted in a random subspace of given a basis for that subspace. Assuming the subspace has dimension at most , [BKS14] showed that the vector can be recovered as long as it has less than nonzero coordinates for some constant , thus improving (for ) on the prior work [DH13] that required the vector to be sparse.

## Organization of this paper

In Section 2 we give a high level overview of our ideas. Sections 46 contain the full proof for solving the dictionary learning and tensor decomposition problems in quasipolynomial time, where the sparsity parameter is a small constant. In Section 7 we show how this can be improved to polynomial time when for some constant .

## 2 Overview of algorithm and its analysis

The dictionary learning problem can be easily reduced to the noisy tensor decomposition problem. Indeed, it is not too hard to show that for an appropriately chosen parameter , given a sufficiently large number of examples from the distribution , the polynomial

 P=1NN∑i=1⟨yi,u⟩2d (2.1)

will be roughly close (in the spectral norm) to the polynomial , where is the “niceness”/“sparsity” parameter of the distribution . Therefore, if we give as input to the tensor decomposition algorithm of Theorem 4, we will obtain a set that is close to the columns of .8

The challenge is that because is a positive constant, no matter how many samples we take, the polynomial will always be bounded away from the tensor . Hence we must use a tensor decomposition algorithm that can handle a very significant amount of noise. This is where the Sum-of-Squares algorithm comes in. This is a general tool for solving systems of polynomial equations [Sho87, Nes00, Par00, Las01]. Given the SOS algorithm, the description of our tensor decomposition algorithm is extremely simple (see Figure 1 below). We now describe the basic facts we use about the SOS algorithm, and sketch the analysis of our noisy tensor decomposition algorithm. See the survey [BS14] and the references therein for more detail on the SOS algorithm, and Sections 4, 5 and 6 for the full description of our algorithm and its analysis (including its variants that take polynomial time at the expense of requiring dictionary learning examples with sparser coefficients).

### 2.1 The SOS algorithm

The SOS algorithm is a method, based on semidefinite programming, for solving a system of polynomial equations. Alas, since this is a non-convex and NP-hard problem, the algorithm doesn’t always succeed in producing a solution. However, it always returns some object, which in some sense can be interpreted as a “distribution” over solutions of the system of equations. It is not an actual distribution, and in particular we cannot sample from and get an individual solution, but we can compute low order moments of . Specifically, we make the following definition:

###### Definition 1 (Pseudo-expectations).

Let denote the ring of polynomials with real coefficients in variables . Let denote the set of polynomials in of degree at most . A degree- pseudoexpectation operator for is a linear operator that maps polynomials in into and satisfies that and for every polynomial of degree at most .

For every distribution over and , the operator defined as for all is degree pseudo-expectation operator. We will use notation that naturally extends the notation for actual expectations. We denote pseudoexpectation operators as , where acts as index to distinguish different operators. If is a degree- pseudoexpectation operator for , we say that is a degree- pseudodistribution for the indeterminates . In order to emphasize or change indeterminates, we use the notation . In case we have only one pseudodistribution for indeterminates , we denote it by . In that case, we also often drop the subscript for the pseudoexpectation and write or for .

We say that a degree- pseudodistribution satisfies the constraint if for all of degree at most . Note that this is a stronger condition than simply requiring . We say that satisfies if it satisfies the constraint where is a sum-of-squares polynomial . It is not hard to see that if was an actual distribution, then these definitions imply that all points in the support of the distribution satisfy the constraints. We write to denote that is a sum of squares of polynomials, and similarly we write to denote .

A degree pseudo-distribution can be represented by the list of values of the expectations of all monomials of degree up to . It can also be written as an matrix whose rows and columns correspond to monomials of degree up to ; the condition that translates to the condition that this matrix is positive semidefinite. The latter observation can be used to prove the fundamental fact about pseudo-distributions, namely that we can efficiently optimize over them. This is captured in the following theorem:

###### Theorem 2 (The SOS Algorithm [Sho87, Nes00, Par00, Las01]).

For every , and -variate polynomials in , whose coefficients are in , if there exists a degree pseudo-distribbution satisfying the constraint for every , then we can find in time a pseudo-distribution satisfying and for every .

Numerical accuracy will never play an important role in our results, and so we can just assume that we can always find in time a degree- pseudo-distribution satisfying given polynomial constraints, if such a pseudo-distribution exists.

### 2.2 Noisy tensor decomposition

Our basic noisy tensor decomposition algorithm is described in Figure 1. This algorithm finds (a vector close to) a column of with inverse polynomial probability. Using similar ideas, one can extend it to an algorithm that outputs all vectors with high probability; we provide the details in Section 6. Following the approach of [BKS14], our analysis of this algorithm proceeds in two phases:

(i)

We show that if the pseudo-distribution obtained in Step 1 is an actual distribution, then the vector output in Step 3 is close to one of the columns of .

(ii)

We then show that the arguments used in establishing (i) generalize to the case of pseudo-distributions as well.

#### Part (i)

The first part is actually not so surprising. For starters, every unit vector that maximizes must be highly correlated with some column of . Indeed, for every column of , and hence the maximum of over a unit is at least . But if for every column then must be much smaller than . Indeed, in this case

 ∥A⊤u∥dd=∑i⟨ai,u⟩d⩽maxi⟨ai,u⟩d−2∑⟨ai,u⟩2. (2.2)

Since , this implies that, as long as , (and thus also ) is much smaller than .

Therefore, if obtained in Step 1 is an actual distribution, then it would be essentially supported on the set of the columns of and their negations. Let us suppose that is simply the uniform distribution over . (It can be shown that this essentially is the hardest case to tackle.) In this case the matrix considered in Step 3 can be written as

 M=1mm∑i=1W(ai)2(ai)(ai)⊤,

where is the polynomial selected in Step 2. (This uses the fact that this polynomial is a product of linear functions and hence satisfies for all .) If satisfies

 |W(a1)|≫√m|W(ai)| (2.3)

for all then is very close to (a constant times) the matrix , and hence its top eigenvector is close to and we would be done. We want to show that the event (2.3) happens with probability at least inverse polynomial in . Recall that is a product of random linear functions for some constant (e.g., will do). That is, , where are standard random Gaussian vectors. Since and these choices are independent, for all . However, with probability it will hold that for all . In this case , while we can show that even conditioned on this event, with high probability we will have for all , in which case (2.3) holds.9

Part (ii). The above argument establishes (i), but this is all based on a rather bold piece of wishful thinking— that the object we obtained in Step 1 of the algorithm was actually a genuine distribution over unit vectors maximizing . In actuality, we can only obtain the much weaker guarantee that is a degree pseudo-distribution for some . (An actual distribution corresponds to a degree- pseudo-distribution.) The technical novelty of our work lies in establishing (ii). The key observation is that in all our arguments above, we never used any higher moments of , and that all the inequalities we showed boil down to the simple fact that a square of a polynomial is never negative. (Such proofs are known as Sum of Squares (SOS) proofs.)

We will not give the full analysis here, but merely show a representative example of how one “lifts” arguments into the SOS setting. In (2.2) above we used the simple inequality that for every vector

 ∥v∥dd⩽∥v∥d−2∞∥v∥22, (2.4)

applying it to the vector (where we denote ). The first (and most major) obstacle in giving a low degree “Sum of Squares” proof for (2.4) is that this is not a polynomial inequality. To turn it into one, we replace the norm with the norm for some large ( will do). If we replace with in (2.4), and raise it to the -th power then we obtain the inequality

 (∥v∥dd)k/(d−2)⩽∥v∥kk(∥v∥22)k/(d−2), (2.5)

which is a valid inequality between polynomials in whenever is an integer multiple of (which we can ensure).

We now need to find a sum-of-squares proof for this inequality, namely that the right-hand side of (2.5) is equal to the left-hand side plus a sum of squares, that is, we are to show that for ,

 (∑ivdi)s⪯(∑iv(d−2)si)(∑iv2i)s.

By expanding the -th powers in this expression, we rewrite this polynomial inequality as

 ∑|α|=s(sα)vdα⪯(∑iv(d−2)si)∑|α|=s(sα)v2α=∑|α|=s(sα)v2α∑iv(d−2)si, (2.6)

where the summations involving are over degree- multiindices , and denotes the multinomial coefficient . We will prove \prefeq:norm-inequality-expanded term by term, i.e., we will show that for every multiindex . Since , it is enough to show that . This is implied by the following general inequality, which we prove in Appendix A:

###### Lemma 3.

Let be polynomials. Suppose . Then, for every multiindex ,

We note that is even, so is a square, as required by the lemma.

For the case that is a power of , the inequality in the lemma follows by repeatedly applying the inequality , which in turn holds because the difference between the two sides equals . As a concrete example, we can derive in this way,

 w31w2=w21⋅w1w2⪯12w41+12w21⋅w22⪯12w41+12(12w41+12w42)⪯w41+w42.

(The first two steps use the inequality . The last step uses that both and are sum of squares.)

Once we have an SOS proof for (2.5) we can conclude that it holds for pseudo-distributions as well, and in particular that for every pseudo-distribution of degree at least satisfying ,

 ~E(∥A⊤u∥dd)k/(d−2)⩽~E∥A⊤u∥kkσk/(d−2).

We use similar ideas to port the rest of the proof to the SOS setting, concluding that whenever is a pseudo-distribution that satisfies and , then with inverse polynomial probability it will hold that

 ~EW(u)2⟨u,a⟩2⩾(1−ε)~EW2 (2.7)

for some column of and that can be made arbitrarily close to . Once we have (2.7), it is not hard to show that the matrix obtained in Step 3 of our algorithm is close to . Hence, we can recover a vector close to by computing the top eigenvector10 of the matrix .

## 3 Preliminaries

We recall some of the notation mentioned above. We use to denote that is a sum of square polynomials. For a vector and , we denote and . For any , a -dictionary is an matrix such that for all and the spectral norm of is at most or ,equivalently, . Two sets are -close in symmetrized Hausdorff distance if for all , , where ; we often drop the qualifier “symmetrized Hausdorff distance” as we will not use another notion of distance between sets of vectors in this paper.

We use the notation of pseudo-expectations and pseudo-distributions from \prefsec:SOS. We now state some basic useful facts about pseudo-distributions, see [BS14, BKS14, BBH12] for a more comprehensive treatment.

One useful property of pseudo-distributions is that we can find actual distribution that match their first two moments.

###### Lemma 1 (Matching first two moments).

Let be a pseudo-distribution over of degree at least . Then we can efficiently sample from a Gaussian distribution11 over such that for every polynomial of degree at most ,

 EQ(ξ)=~EQ(u).
###### Proof.

By shifting, it suffices to restrict attention to the case where for all . Consider the matrix such that . The positivity condition implies that is a positive semidefinite matrix. Therefore, admits a Cholesky factorization . Let be the standard Gaussian distribution on (mean and variance in each coordinate) and consider the Gaussian distribution . We are to show that has the same degree- moments as the pseudo-distribution . Indeed,

 Eξξ⊤=EVζζ⊤V⊤=VV⊤=M=~Euu⊤.

Here, we use that is the identity because is a standard Gaussian vector. ∎

Another property we will use is that we can reweigh a pseudo-distribution by a positive polynomial to obtain a new pseudo-distribution that corresponds to the operation on actual distributions of reweighing the probability of an element proportional to .

###### Lemma 2 (Reweighing).

Let be a degree- pseudo-distribution. Then for every SOS polynomial of degree with , there exists a degree- pseudo-distribution such that for every polynomial of degree at most

 ~E{u′}P(u′)=1~E{u}W(u)~E{u}W(u)P(u).
###### Proof.

The functional is linear and satisfies , and so we just need to verify the positivity property. For every polynomial of degree at most ,

 ~E{u′}P(u′)2=(~E{u}W(u)P(u)2)/(~E{u}W(u))

but since is a sum of squares, is also a sum of squares and hence the denominator of the left-hand side is non-negative, while the numerator is by assumption positive. ∎

## 4 Dictionary Learning

We now state our formal theorem for dictionary learning. The following definition of nice distributions captures formally the conditions needed for recovery. (It is equivalent up to constants to the definition of \prefsec:intro-nice, see \prefrem:niceness-definition below.)

###### Definition 1 (Nice distribution).

Let and with even. A distribution over is -nice if it satisfies the following properties:

1. for all ,

2. for all degree- monomials , and

3. for all non-square degree- monomials .

Here, denotes the monomial . Furthermore, we require that to have polynomial variance so that . To avoid some technical issues, -nice distributions are also assumed to be -nice after rescaling for all even . Concretely, when we say that is a -nice distribution, we also imply that for every positive even , there exists a rescaling factor such that the distribution satisfies the three properties above (plus polynomial variance bound).

Let us briefly discuss the meaning of these conditions. The condition is a weak non-degeneracy condition, ruling out distributions where the main contribution to some low order moments comes from events that happen with super-polynomially small probability. Condition 1 stipulates that we are in the symmetric case, where all coefficients have more or less the same magnitude. (We can remove symmetry by either dropping this condition or allowing the dictionary vectors to have different norms; see \prefrem:different-norms.) Condition 2 captures to a certain extent both the sparsity conditions and that that the random variables and for are not too correlated. Condition 3 stipulates that there is significant “cancellations” between the negative and positive coefficients. While it is satisfied by many natural distributions, it would be good to either show that it can be dropped, or that it is inherently necessary. The requirement of having expectation zero—perfect cancellation—can be somewhat relaxed to having a sufficiently small bound (inverse polynomial in ) on the magnitude of the non-square moments.

We can now state our result for dictionary learning in quasipolynomial time. The result for polynomial time is stated in Section 7.

###### Theorem 2 (Dictionary learning, quasipolynomial time).

There exists an algorithm that for every desired accuracy and overcompleteness solves the following problem for every -nice distribution with and in time : Given samples from a distribution for a -overcomplete dictionary and -nice distribution , output a set of vectors that is -close to the set of columns of (in symmetrized Hausdorff distance).

In the tensor decomposition problem, we are given a polynomial of the form (or equivalently a tensor of the form ) and our goal is to recover the vectors (up to signs). It turns out the heart of the dictionary learning problem is solving a variant of the tensor decomposition problem, where we are not given the polynomial but a polynomial close to it in spectral norm. (The magnitude of this error is related to the niceness of the distribution, which means that we cannot assume it to be arbitrarily small.)

###### Theorem 3 (Noisy tensor decomposition).

There exists an algorithm that for every desired accuracy and overcompleteness solves the following problem for every degree and noise parameter in time : Given a degree- polynomial that is -close to in spectral norm for a -overcomplete dictionary , i.e.,

 ∥A⊤u∥dd+τ∥u∥d2⪰P(u)⪰∥A⊤u∥dd−τ∥u∥d2,

output a set of vectors that is -close to the set of columns of (in symmetrized Hausdorff distance).

###### Remark 4 (Different notions of niceness).

In \prefsec:intro-nice we defined -niceness in a different way. Instead of requiring for every monomial , we only required this condition for some of these monomials, namely monomials of the form . It turns out that these two definitions are equivalent up to a factor in the exponent of . (This loss of a factor of in the exponent is OK, since in our applications will anyway be exponentially small in .) To see the equivalence of the definitions, note that every degree- square monomial involves at least two distinct variables, say and , and therefore where is a monomial of degree (so that ). By Hölder’s Inequality, we can bound its expectation

 Ex2ix2jxα′⩽(Exd/2ixd/2j)4/d(Exβ)(d−4)/d,

for . Since , the Arithmetic-Mean Geometric-Mean Inequality together with our normalization implies

 Exβ⩽∑kβkd⋅Exdk=1,

thus proving that for every degree- square monomial .

### 4.1 Dictionary learning via noisy tensor decompostion

We will prove \prefthm:tensor-decomp (noisy tensor decomposition) in \prefsec:sample and \prefsec:tensor-decomp-proof. At this point, let us see how it yields \prefthm:dict-learn (dictionary learning, quasipolynomial time). The following lemma gives the connection between tensor decomposition and dictionary learning.

###### Lemma 5.

Let be a -nice distribution over and a -overcomplete dictionary. Then,12

 ∥A⊤u∥dd+τσddd∥u∥d2⪰Ex⟨Ax,u⟩d⪰∥A⊤u∥dd.
###### Proof.

Consider the polynomial in the monomial basis for the variables . All coefficients corresponding to non-squared monomials are zero (by the third property of nice istibutions). All other coefficients are nonnegative (by the first and second property of nice distributions). We conclude that is a sum of squares. The relation follows by substituting and using the relation .

For the lower bound, we see that the polynomial is a nonnegative combination of square monomials. Thus, and the desired bound follows by substituting . ∎

###### Proof of \prefthm:dict-learn.

If we take a sufficiently large number of samples from the distribution (e.g., will do), then with high probability every coefficient of the polynomial would be -close to the corresponding coefficient of . Therefore, . Together with \preflem:reduce-tensor it follows that

 ∥A⊤u∥dd+2τσddd∥u∥d2⪰P⪰∥A⊤u∥dd−2τσddd∥u∥d2.

Therefore, we can apply the algorithm in \prefthm:tensor-decomp (noisy tensor decomposition) for noise parameter to obtain a set of unit vectors that is -close to the set of columns of (in symmetrized Hausdorff distance). ∎

## 5 Sampling pseudo-distributions

In this section we will develop an efficient algorithm that behaves in certain ways like a hypothetical sampling procedure for low-degree pseudo-distributions. (Sampling procedures, even inefficient or approximate ones, cannot exist in general for low-degree pseudo-distributions [Gri01, Sch08].) This algorithm will be a key ingredient of our algorithm for \prefthm:tensor-decomp (noisy tensor decomposition, quasipolynomial time).

Here is the property of a sampling procedure that our algorithm mimics: Suppose we have a probability distribution over unit vectors in that satisfies for some unit vector , small , and (so that is very small). This condition implies that if we sample a vector  from the distribution then with probability at least the vector satisfies