Better Agnostic Clustering via Relaxed Tensor Norms\DeclareCaptionType
We develop a new family of convex relaxations for -means clustering based on sum-of-squares norms, a relaxation of the injective tensor norm that is efficiently computable using the Sum-of-Squares algorithm. We give an algorithm based on this relaxation that recovers a faithful approximation to the true means in the given data whenever the low-degree moments of the points in each cluster have bounded sum-of-squares norms.
We then prove a sharp upper bound on the sum-of-squares norms for moment tensors of any distribution that satisfies the Poincaré inequality. The Poincaré inequality is a central inequality in probability theory, and a large class of distributions satisfy it including Gaussians, product distributions, strongly log-concave distributions, and any sum or uniformly continuous transformation of such distributions.
As an immediate corollary, for any , we obtain an efficient algorithm for learning the means of a mixture of arbitrary Poincaré distributions in in time so long as the means have separation . This in particular yields an algorithm for learning Gaussian mixtures with separation , thus partially resolving an open problem of Regev and Vijayaraghavan [RV17].
Our algorithm works even in the robust setting where an fraction of arbitrary outliers are added to the data, as long as the fraction of outliers is smaller than the smallest cluster. We therefore obtain results in the strong agnostic setting where, in addition to not knowing the distribution family, the data itself may be arbitrarily corrupted.
Progress on many fundamental unsupervised learning tasks has required circumventing a plethora of intractability results by coming up with natural restrictions on input instances that preserve some essential character of the problem. For example, while -means clustering is NP-hard in the worst-case [MNV09], there is an influential line of work providing spectral algorithms for clustering mixture models satisfying appropriate assumptions [AM05, KK10, AS12]. On the flip side, we run the risk of developing algorithmic strategies that exploit strong assumptions in a way that makes them brittle. We are thus forced to walk the tight rope of avoiding computational intractability without “overfiting” our algorithmic strategies to idealized assumptions on input data.
Consider, for example, the problem of clustering data into groups. On the one hand, a line of work leading to [AS12] shows that a variant of spectral clustering can recover the underlying clustering so long as each cluster has bounded covariance around its center and the cluster centers are separated by at least . Known results can improve on this bound to require a separation of if the cluster distributions are assumed to be isotropic and log-concave [VW02]. If the cluster means are in general position, other lines of work yields results for Gaussians [KMV10, MV10, BS10, HK13, BCV14, GVX14, BCMV14, ABG14, GHK15] or for distributions satisfying independence assumptions [HKZ09, AGH13]. However, the assumptions often play a crucial role in the algorithm. For example, the famous method of moments that yields a result for learning mixtures of Gaussians in general position uses the specific algebraic structure of the moment tensor of Gaussian distributions. Such techniques are unlikely to work for more general classes of distributions.
As another example, consider the robust mean estimation problem which has been actively investigated recently. Lai et. al. [LRV16] and later improvements [DKK17, SCV18] show how to estimate the mean of an unknown distribution (with bounded second moments) where an fraction of points are adversarially corrupted, obtaining additive error . On the other hand, Diakonikolas et. al. [DKK16] showed how to estimate the mean of a Gaussian or product distribution with nearly optimal additive error . However, their algorithm again makes strong use of the known algebraic structure of the moments of these distributions.
Further scrutiny reveals that the two examples of clustering and robust mean estimation suffer from a “second-moment” barrier. For both problems, the most general results algorithmically exploit only some boundedness condition on the second moments of the data, while the strongest results use exact information about higher moments (e.g. by assuming Gaussianity) and are thus brittle. This leads to the key conceptual driving force of the present work:
Can we algorithmically exploit boundedness information about a limited number of low-degree moments?
As the above examples illustrate, this is a natural way to formulate the “in-between” case between the two well-explored extremes. From an algorithmic perspective, this question forces us to develop techniques that can utilize information about higher moments of data for problems such as clustering and mean estimation. For these problems, we can more concretely ask:
Can we beat the second-moment barrier in the agnostic setting for clustering and robust mean estimation?
The term agnostic here refers to the fact that we want our algorithm to work for as wide a class of distributions as possible, and in particular to avoid making parametric assumptions (such as Gaussianity) about the underlying distribution.
The main goal of this work is to present a principled way to utilize higher moment information in input data and break the second moment barrier for both clustering and robust mean estimation. A key primitive in our approach is algorithmic certificates upper bounding the injective norms of moment tensors of data.
Given input points, consider the injective tensor norm of their moments that generalizes the spectral norm of a matrix:
For , bounds on the injective norm of the moment tensor present a natural way to utilize higher moment information in the given data, which suggests an avenue for algorithm design. Indeed, one of our contributions (Theorem 2) is a generalization of spectral norm clustering that uses estimates of injective norms of moment tensors to go beyond the second moment barrier.
Unfortunately for us, estimating injective norms (unlike the spectral norm) is intractable. While it is likely easier than computing injective norms for arbitrary tensors, it turns out that approximately computing injective norms for moment tensors is equivalent to the well-studied problem of approximating the norm which is known to be small-set-expansion hard [BBH12b]. The best known algorithms for approximating norm achieve a multiplicative approximation ratio of in dimensions, and while known hardness results [BBH12b] only rule out some fixed constant factor algorithms for this problem, it seems likely that there is no polynomial time algorithm for norm that achieves any dimension-independent approximation ratio.
An average-case variant of approximating injective norms of moment tensors has been studied to some extent due to its relationship to the small-set-expansion problem. The sum-of-squares hierarchy of semi-definite programming relaxations turns out to be a natural candidate algorithm in this setting and is known to exactly compute the injective norm in specialized settings such as that of the Gaussian distribution. On the other hand, the most general such results [BBH12b, BKS15] imply useful bounds only for settings similar to product distributions .
One of the key technical contributions of this work is to go beyond product distributions for estimating injective norms. Specifically, we show (Theorem 1) that Sum-of-Squares gives a polynomial time procedure to show a dimension-free upper bound on the injective norms of (large enough i.i.d. samples from) arbitrary distributions that satisfy a Poincaré inequality. This is a much more satisfying state of affairs as it immediately captures all strongly log-concave distributions, including correlated Gaussians. Further, the Poincaré inequality is robust—i.e., it continues to hold under uniformly continuous transformations of the underlying space, as well as bounded re-weightings of the probability density.
Without further ado, we define Poincaré distributions: A distribution on is said to be -Poincaré if for all differentiable functions we have
This is a type of isoperimetric inequality on the distribution and implies concentration of measure. In Section 3 we discuss in more detail various examples of distributions that satisfy (1.2), as well as properties of such distributions. Poincaré inequalities and distributions are intensely studied in probability theory; indeed, we rely on one such powerful result of Adamczak and Wolff [AW15] for establishing a sharp bound on the sum-of-squares algorithm’s estimate of the injective norm of an i.i.d. sample from a Poincaré distribution.
We then confirm the intuitive claim that understanding injective norms of moment tensors can give us an algorithmic tool to beat the second moment barrier, by combining our result on certification of Poincaré distributions with our algorithm for clustering under such certificates. Specifically, we show that for any , given a balanced mixture of Poincaré distributions with means separated by , we can successfully cluster samples from this mixture in time (by using levels of the sum-of-squares hierarchy). Similarly, given samples from a Poincaré distribution with an fraction of adversarial corruptions, we can estimate its mean up to an error of in time. In fact, we will see below that we get both at once: a robust clustering algorithm that can learn well-separated mixtures even in the presence of arbitrary outliers.
To our knowledge such a result was not previously known even in the second-moment case ([CSV17] and [SCV18] study this setting but only obtain results in the list-decodable learning model). Our result only relies on the SOS-certifiability of the moment tensor, and holds for any deterministic point set for which such a sum-of-squares certificate exists.
Despite their generality, our results are strong enough to yield new bounds even in very specific settings such as learning balanced mixtures of spherical Gaussians with separation . Our algorithm allows recovering the true means in time and partially resolves an open problem posed in the recent work of [RV17].
Certifying injective norms of moment tensors appears to be a useful primitive and could help enable further applications of the sum of squares method in machine learning. Indeed, [KS17] studies the problem of robust estimation of higher moments of distributions that satisfy a bounded-moment condition closely related to approximating injective norms. Their relaxation and the analysis are significantly different from the present work; nevertheless, our result for Poincaré distributions immediately implies that the robust moment estimation algorithm of [KS17] succeeds for a large class of Poincaré distributions.
1.1 Main Results and Applications
Our first main result regards efficient upper bounds on the injective norm of the moment tensor of any Poincaré distribution. Let be i.i.d. samples from a Poincaré distribution with mean , and let be the empirical estimate of the th moment tensor. We are interested in upper-bounding the injective norm (1.1), which can be equivalently expressed in terms of the moment tensor as
Standard results (see Fact 5) yield dimension-free upper bounds on (1.3) for all Poincaré distributions. Our first result is a “sum-of-squares proof” of this fact giving an efficient method to certify dimension-free upper bounds on (1.3) for samples from any Poincaré distribution.
Specifically, let the sum of squares norm of , denoted by , be the degree- sum-of-squares relaxation of (1.3) (we discuss such norms and the sum-of-squares method in more detail in Section 2; for now the important fact is that can be computed in time ). We show that for a large enough sample from a distribution that satisfies the Poincaré inequality, the sum-of-squares norm of the moment tensor is upper bounded by a dimension-free constant.
Let be a -Poincaré distribution over with mean . Let with . Then, for some constant (depending only on ) with probability at least we have , where .
As noted above, previous sum-of-squares bounds worked for specialized cases such as product distributions. Theorem 1 is key to our applications that crucially rely on 1) going beyond product distributions and 2) using norms as a proxy for injective norms for higher moment tensors.
Outlier-Robust Agnostic Clustering
Our second main result is an efficient algorithm for outlier-robust agnostic clustering whenever the “ground-truth” clusters have moment tensors with bounded sum-of-squares norms.
Concretely, the input is data points of points in , a fraction of which admit a (unknown) partition into sets each having bounded sum-of-squares norm around their corresponding means . The remaining fraction can be arbitrary outliers. Observe that in this setting, we do not make any explicit distributional assumptions.
We will be able to obtain strong estimation guarantees in this setting so long as the clusters are well-separated and the fraction of outliers is not more than , where is the fraction of points in the smallest cluster. We define the separation as . A lower bound on is information theoretically necessary even in the special case of learning mixtures of identity-covariance gaussians without any outliers (see [RV17]).
Suppose points can be partitioned into sets and , where the are the clusters and is a set of outliers of size . Suppose has size and mean , and that its th moment satisfies . Also suppose that for .
Finally, suppose the separation , with (for a universal constant ). Then there is an algorithm running in time and outputting means such that for all .
The parameter specifies a bound on the variation in each cluster. The separation condition says that the distance between cluster means must be slightly larger (by a factor) than this variation. The error in recovering the cluster means depends on two terms—the fraction of outliers , and the separation .
To understand the guarantees of the theorem, let’s start with the case where (no outliers) and (all clusters have the same size). In this case, the separation requirement between the clusters is where is the bound on the moment tensor of order . The theorem guarantees a recovery of the means up to an error in Euclidean norm of . By taking larger (and spending the correspondingly larger running time), our clustering algorithm works with separation for any constant . This is the first result that goes beyond the separation requirement of in the agnostic clustering setting—i.e., without making distributional assumptions on the clusters.
It is important to note that even in 1 dimension, it is information theoretically impossible to recover cluster means to an error when relying only on th moment bounds. A simple example to illustrate this is obtained by taking a mixture of two distributions on the real line with bounded th moments but small overlap in the tails. In this case, it is impossible to correctly classify the points that come from the the overlapping part. Thus, a fraction of points in the tail always end up misclassified, shifting the true means. The recovery error of our algorithm does indeed drop as the separation (controlled by ) between the true means increases (making the overlapping parts of the tail smaller). We note that for the specific case of spherical gaussians, we can exploit their parametric structure to get arbitrarily accurate estimates even for fixed separation; see Corollary 4.
Next, let’s consider In this case, if , we recover the means up to an error of again (for ). It is intuitive that the recovery error for the means should grow with the number of outliers, and the condition is necessary, as if then the outliers could form an entirely new cluster making recovery of the means information-theoretically impossible.
We also note that in the degenerate case where (a single cluster), Theorem 2 yields results for robust mean estimation of a set of points corrupted by an fraction of outliers. In this case we are able to estimate the mean to error ; when this is , which matches the error obtained by methods based on second moments [LRV16, DKK17, SCV18]. For we get error , for we get error , and so on, approaching an error of as . In particular, this pleasingly approaches the rate obtained by much more bespoke methods that rely strongly on specific distributional assumptions [LRV16, DKK16].
Note that we could not hope to do better than , as that is the information-theoretically optimal error for distributions with bounded th moments (even in one dimension), and degree- SOS only “knows about” moments up to .
Finally, we can obtain results even for clusters that are not well-separated, and for fractions of outliers that could exceed . In this case we no longer output exactly means, and must instead consider the list-decodable model [BBV08, CSV17], where we output a list of means of which the true means are a sublist. We defer the statement of this result to Theorem 5 in Section 5.
Corollary 3 (Disentangling Mixtures of Arbitrary Poincaré Distributions).
Suppose that we are given a dataset of points , such that at least points are drawn from a mixture of distributions, where is -Poincaré with mean (the remaining points may be arbitrary). Let . Also suppose that the separation is at least , for some constant depending only on and some .
Then, assuming that , for some , there is an algorithm running in time which with probability outputs candidate means such that for all (where is a different universal constant).
The factor in the sample complexity is so that we have enough samples from every single cluster for Theorem 1 to hold. The extra term of in the sample complexity is so that the empirical means of each cluster concentrate to the true means.
Corollary 3 is one of the strongest results on learning mixtures that one could hope for. If the mixture weights are all at least , then Corollary 3 implies that we can cluster the points as long as the separation for any . Even for spherical Gaussians the best previously known algorithms required separation . On the other hand, Corollary 3 applies to a large family of distributions including arbitrary strongly log-concave distributions. Moreover, while the Poincaré inequality does not directly hold for discrete distributions, Fact 3 in Section 3 implies that a large class of discrete distributions, including product distributions over bounded domains, will satisfy the Poincaré inequality after adding zero-mean Gaussian noise. Corollary 3 then yields a clustering algorithm for these distributions, as well.
For mixtures of Gaussians in particular, we can do better, and in fact achieve vanishing error independent of the separation:
Corollary 4 (Learning Mixtures of Gaussians).
Suppose that are drawn from a mixture of Gaussians: , where for all . Then for any , there is a separation such that given samples from , if the separation , then with probability we obtain estimates with for all .
This partially resolves an open question of [RV17], who ask whether it is possible to efficiently learn mixtures of Gaussians with separation .
The error now goes to as , which is not true in the more general Corollary 3. This requires invoking Theorem IV.1 of [RV17], which, given a sufficiently good initial estimate of the means of a mixture of Gaussians, shows how to get an arbitrarily accurate estimate. As discussed before, such a result is specific to Gaussians and in particular is information-theoretically impossible for mixtures of general Poincaré distributions.
1.2 Proof Sketch and Technical Contributions
1.2.1 Sketch of Theorem 1
For simplicity, we will only focus on SOS-certifiability in the infinite-data limit, i.e. on showing that SOS can certify an upper bound . (In Section 4.2 we will show that finite-sample concentration follows due to the matrix Rosenthal inequality [MJC14].)
We make extensive use of a result of [AW15]; it is a very general result on bounding non-Lipschitz functions of Poincaré distributions, but in our context the important consequence is the following:
If is a degree- polynomial such that for , then for a constant , assuming is -Poincaré. (Note that is a constant since is degree-.)
Here denotes the Frobenius norm of the tensor , i.e. the -norm of if it were flattened into a -element vector.
We can already see why this sort of bound might be useful for . Then if we let , we have and hence . This exactly says that has bounded covariance.
More interesting is the case . Here we will let , where is the mean and is the covariance of . It is easy to see that both and . Therefore, we have .
Why is this bound useful? It says that if we unroll to a -dimensional vector, then this vector has bounded covariance (since if we project along any direction with , the variance is at most ). This is useful because it turns out sum-of-squares “knows about” such covariance bounds; indeed, this type of covariance bound is exactly the property used in [BBH12a] to certify th moment tensors over the hypercube. In our case it yields a sum-of-squares proof that , which can then be used to bound the th moment .
Motivated by this, it is natural to try the same idea of “subtracting off the mean and squaring” with . Perhaps we could define ?
Alas, this does not work—while there is a suitable polynomial for that yields sum-of-squares bounds, it is somewhat more subtle. For simplicity we will write the polynomial for . It is the following: , where is the third-moment tensor of . By checking that , we obtain that the tensor , when unrolled to a -dimensional vector, has bounded covariance, which means that sum-of-squares knows that is bounded for all .
However, this is not quite what we want—we wanted to show that is bounded. Fortunately, the leading term of is indeed , and all the remaining terms are lower-order. So, we can subtract off and recursively bound all of the lower-order terms to get a sum-of-squares bound on . The case of general follows similarly, by carefully constructing a tensor whose first derivatives are all zero in expectation.
There are a couple contributions here beyond what was known before. The first is identifying appropriate tensors whose covariances are actually bounded so that sum-of-squares can make use of them. For (the cases that had previously been studied) the appropriate tensor is in some sense the “obvious” one , but even for we end up with the fairly non-obvious tensor . (For it is .) While these tensors may seem mysterious a priori, they are actually the unique tensor polynomials with leading term such that all derivatives of order have mean zero. Even beyond Poincaré distributions, these seem like useful building blocks for sum-of-squares proofs.
The second contribution is making the connection between Poincaré distributions and the above polynomial inequalities. The well known work of Latała [Lat06] establishes non-trivial estimates of upper bounds on the moments of polynomials of Gaussians, of which the inequalities used here are a special case. [AW15] show that these inequalities also hold for Poincaré distributions. However, it is not a priori obvious that these inequalities should lead to sum-of-squares proofs, and it requires a careful invocation of the general inequalities to get the desired results in the present setting.
1.2.2 Sketch of Theorem 2
We next establish our result on robust clustering. In fact we will establish a robust mean estimation result which will lead to the clustering result—specifically, we will show that if a set of points contains a subset of size that is SOS-certifiable, then the mean (of the points in ) can be estimated regardless of the remaining points. There are two parts: if we want to show error going to as , while if we want to show error that does not grow too fast as . In the latter case we will output candidates for the mean and show that at least one of them is close to the true mean (think of these candidates as accounting for possible clusters in the data). We will later prune down to exactly means for well-separated clusters.
For (which corresponds to bounded covariance), the case is studied in [CSV17]. A careful analysis of the proof there reveals that all of the relevant inequalities are sum-of-squares inequalities, so there is a sum-of-squares generalization of the algorithm in [CSV17] that should give bounds for SOS-certifiable distributions. While this would likely lead to some robust clutering result, we note the bounds we achieve here are stronger than those in [CSV17], as [CSV17] do not achieve tight results when the clusters are well-separated. Moreover, the proof in [CSV17] is complex and would be somewhat tedious to extend in full to the sum-of-squares setting.
We combine and simplify ideas from both [CSV17] and [SCV18] to obtain a relatively clean algorithm. In fact, we will see that a certain mysterious constraint appearing in [CSV17] is actually the natural constraint from a sum-of-squares perspective.
Our algorithm is based on the following optimization. Given points , we will try to find points such that is small for all pseudodistributions over the sphere. This is natural because we know that for the good points and the true mean , is small (by the SOS-certifiability assumption). However, without further constraints this is not a very good idea because the trivial optimum is to set . We would somehow like to ensure that the cannot overfit too much to the ; it turns out that the natural way to measure this degree of overfitting is via the quantity .
Of course, this quantity is not known because we do not know . But we do know that is small for all pseudodistributions (because the corresponding quantity is small for and , and hence also for by Minkowski’s inequality). Therefore, we impose the following constraint: whenever are such that for all , it is also the case that is small. This constraint is not efficiently imposable, but it does have a simple sum-of-squares relaxation. Namely, we require that is small whenever are pseudomoment tensors satisfying .
Together, this leads to seeking such that
If we succeed in this, we can show that we end up with a good estimate of the mean (more specifically, the are clustered into a small number of clusters, such that one of them is centered near ). The above is a convex program, and thus, if this is impossible, by duality there must exist specific and such that the above quantities cannot be small for any . But for fixed and , the different are independent of each other, and in particular it should be possible to make both sums small at least for the terms coming from the good set . This gives us a way of performing outlier removal: look for terms where or is large, and remove those from the set of points. We can show that after a finite number of iterations this will have successfully removed many outliers and few good points, so that eventually we must succeed in making both sums small and thus get a successful clustering.
Up to this point the proof structure is similar to [SCV18]; the main innovation is the constraint involving the , which bounds the degree of overfitting. In fact, when this constraint is the dual form of one appearing in [CSV17], which asks that for all , for some matrix of small trace. In [CSV17], the matrix couples all of the variables, which complicates the analysis. In the form given here, we avoid the coupling and also see why the constraint is the natural one for controlling overfitting.
To finish the proof, it is also necessary to iteratively re-cluster the and re-run the algorithm on each cluster. This is due to issues where we might have, say, clusters, where the first two are relatively close together but very far from the third one. In this case our algorithm would resolve the third cluster from the first two, but needs to be run a second time to then resolve the first two clusters from each other.
[CSV17] also use this re-clustering idea, but their re-clustering algorithm makes use of a sophisticated metric embedding technique and is relatively complex. Here we avoid this complexity by making use of resilient sets, an idea introduced in [SCV18]. A resilient set is a set such that all large subsets have mean close to the mean of the original set; it can be shown that any set with bounded moment tensor is resilient, and by finding such resilient sets we can robustly cluster in a much more direct manner than before. In particular, in the well-separated case we show that after enough rounds of re-clustering, every resilient set has almost all of its points coming from a single cluster, leading to substantially improved error bounds in that case.
1.3 Open Problems
In this work, we showed that sum-of-squares can certify moment tensors for distributions satisfying the Poincaré inequality. While this class of distributions is fairly broad, one could hope to establish sum-of-squares bounds for even broader families. Indeed, one canonical family is the class of sub-Gaussian distributions. Is it the case that sum-of-squares certifies moment tensors for all sub-Gaussian distributions? Conversely, are there sub-Gaussian distributions that sum-of-squares cannot certify? Even for th moments, this is unknown:
Open Question 6.
Let be a -sub-Gaussian distribution and let denote its fourth moment tensor. Is it always the case that for some constant ?
In another direction, the only property we required from Poincaré distributions is Adamczak and Wolff’s result [AW15] bounding the variance of polynomials whose derivatives all have mean . Adamczak and Wolff show that this property also holds for other distributions, such as sub-Gaussian product distributions. One might expect additional distributions to satisfy these inequalities as well, in which case our present results would apply unchanged.
Open Question 7.
Say that a distribution satisfies the -moment property if, whenever is a degree- polynomial with , we have . Which distributions satisfy the -moment property?
Finally, the present results all regard certifying moment tensors in the -norm, i.e., on upper bounding for all . However, [SCV18] show that in some cases–such as discrete distribution learning–the -norm is more natural. To this end, define to be the maximum of over all pseudodistributions on the hypercube.
Open Question 8.
For what distributions is small? Additionally, do bounds on lead to better robust estimation and clustering in the -norm?
In this section we set up notation and introduce a number of preliminaries regarding sum-of-squares algorithms.
We will use to denote dimension, and the number of samples in a dataset . For clustering problems will denote the number of clusters. will denote, depending on circumstance, either the desired estimation error or the fraction of adversarial corruptions for a robust estimation problem. will denote the probability of failure of an algorithm. will denote an exponent which we think of as going to zero, as in phrases like “ for any ”. For tensors, will denote their order (or if we want to emphasize the order is even). We let denote a universal constant and a universal constant depending on (these constants may change in each place they are used).
Below we use Theorem (and Proposition, Lemma, etc.) for results that we prove in this paper, and Fact for results proved in other papers.
Tensors, Polynomials and Norms
A th order tensor on is a -dimensional array of real numbers indexed by -tuples on is naturally associated with a homogenous degree polynomial . The injective norm of a tensor is defined as
Given a distribution on , the th moment tensor of is defined by , where is the mean of . Observe that each entry of is the expectation of some monomial of degree with respect to For a finite set of points , we let denote expectation with respect to its empirical distribution. The moment tensor of a set of points is the moment tensor of its empirical distribution.
Given a matrix , we let denote its operator norm (maximum singular value) and denotes its Frobenius norm (-norm of its entries when flattened to a -dimensional vector). More generally, for a tensor , we let denote the Frobenius norm (which is again the -norm when the entries are flattened to a -dimensional vector).
2.1 Sum-of-Squares Programs and Pseudodistributions
In this paper we are interested in approximating injective norms of moment tensors, i.e. of upper-bounding programs of the form
This problem is hard to solve exactly, so we will instead consider the following sum-of-squares relaxation of (2.1):
Formally, , which we refer to as a pseudo-expectation, is simply a linear functional on the space of polynomials in of degree at most . The last two constraints say that specifies a pseudo-distribution, meaning that it respects all properties of a regular probability distribution that can be specified with low-degree polynomials. Meanwhile, the first constraint is a relaxation of the constraint that . If satisfies (2.4-2.6), we say that is a pseudo-distribution on the sphere.
The relaxation (2.3) is a special case of the well-studied family of sum-of-squares relaxations. It is well-known that these programs can be solved efficiently, i.e. in time , due to the ability to represent them as semidefinite programs [Sho87, Par00, Nes00, Las01].
The key strategy for bounding the value of (2.3) is sum-of-squares proofs. We say that a polynomial inequality has a sum-of-squares proof if the polynomial can be written as a sum of squares of polynomials. We write this as , or if we want to emphasize that the proof only involves polynomials of degree at most .
For pseudo-distributions on the sphere, we will extend this notation and say that if is a sum of squares for some polynomial of degree at most .
Now, let . Imagine that there is some sequence of sum-of-squares proofs . Then we know, by the constraints on , that . Therefore, such a proof immediately implies that (2.3) has value at most .
Note that since the relation is transitive, this is equivalent to the condition . In this case, we call the set of points SOS-certifiable. More generally, for a distribution we have the following definition:
For a distribution , we say that is -SOS-certifiable if . We will alternately denote this by . For a set of points , we say it is -SOS-certifiable if its empirical distribution is SOS-certifiable.
Note that so that this definition coincides with certifying (2.3).
2.2 Basic Sum-of-Squares Facts
We capture a few basic facts about sum-of-squares proofs that we will use later. First, sum-of-squares can certify all spectral norm bounds:
Fact 2 (Sum-of-Squares Proofs from Spectral Norm Bounds).
For any symmetric matrix ,
As a corollary (by applying the spectral norm bound to higher-order tensors), we also have the following:
Fact 3 (Spectral Norm Bounds for Tensors).
For a degree- tensor function , suppose that for all symmetric tensors . Then, .
Finally, the following basic inequality holds:
If and , then .
This is useful because it allows us to multiply sum-of-squares inequalities together.
3 Poincaré Distributions
In this section, we note some important properties of the class of all Poincaré distributions.
Definition 1 (Poincaré Distributions).
A distribution over is said to be -Poincaré if it satisfies the following Poincaré inequality with parameter : For all differentiable functions ,
Note that no discrete distribution can satisfy (3.1). To see why, consider for instance the uniform distribution over . This cannot satisfy (3.1) for any , because for the function we have but . More generally, (3.1) implies that there are no low-probability “valleys” separating two high-probability regions.
Next we give some examples of distributions satisfying (3.1). First, if is a normal distribution with mean and variance , then satisfies (3.1) with parameter . More generally, any strongly log-concave distribution satisfies the Poincaré inequality:
Fact 2 ([Bé85]).
Suppose that , where is a function with strictly positive curvature: for all . Then satisfies the Poincaré inequality with parameter .
Another important class is the family of distributions with bounded support. It cannot be the case that an arbitrary bounded distribution satisfies (3.1), because we have already seen that no discrete distribution can satisfy (3.1). However, it is always possible to add noise to the distribution that smooths out the support and allows (3.1) to hold. Specifically:
Fact 3 ([Bgmz18]).
Suppose that is a distribution on whose support has radius at most in -norm. Let denote the result of adding Gaussian noise with variance to . Then, if , satisfies the Poincaré inequality with parameter .
So, we can always cause a bounded random variable to satisfy (3.1) by adding sufficiently large Gaussian noise. We remark that while this is very useful in low dimensions, in high dimensions the radius typically grows as , in which case Fact 3 does not give very good bounds.
The Poincaré inequality is also preserved under products, sums, and uniformly continuous transformations. Specifically, we have the following:
The following composition rules hold:
If independent random variables and are - and -Poincaré, respectively, then the product distribution is -Poincaré.
If independent random variables and are - and -Poincaré, respectively, then is -Poincaré.
If is -Poincaré and satisfies for all , then is -Poincaré.
The above properties are all straightforward, but for completeness we prove them in Appendix B.
Implications of the Poincaré inequality
First, any distribution satisfying the Poincaré inequality has exponentially decaying tails. Specifically, the following is a well known fact:
Fact 5 (Tail Bound for Poincare Distributions).
For any unit vector , and a -Poincaré random variable with mean , we have
More generally, any distribution satisfying the Poincaré inequality also satisfies a Lipschitz concentration property:
Fact 6 ([Bl97]).
If satisfies the Poincaré inequality with parameter and satisfies for all , then . In particular, .
Exponential concentration of Lipschitz functions (with weaker bounds than above) was first observed in [GM83]. Fact 6 generalizes the previous point on exponential concentration of linear functions, as can be seen by taking .
Finally, and most important to our subsequent analysis, the gradient bound (3.1) implies an analogous bound for higher-order derivatives as well. Specifically, (3.1) can be re-written as saying that whenever . More generally, we have:
Fact 7 ([Aw15]).
Suppose that satisfies the Poincaré inequality with parameter . Then, if is a function satisfying , we have
where is a constant depending only on .
Despite its simplicity, Fact 7 is actually a highly non-trivial consequence of the Poincaré inequality. It is a special case of results due to Adamczak and Wolff [AW15] (see Theorem 3.3 and the ensuing discussion therein); those results in turn build on work of Latała [Lat06].
In the next section, we will use Fact 7 to obtain a low-degree sum-of-squares proof of a sharp upper bound on the injective norms of moment tensors of arbitrary Poincaré distributions.
4 Certifying Injective Norms for Poincare Distributions
Fact 5 shows an upper bound on for -Poincaré distributions, and hence in particular on the th moments for any . The goal of this section is to show a low-degree sum-of-squares proof of this fact.
Specifically, the main goal is to show the following:
Let be a zero-mean, -Poincaré distribution with -th moment tensor , where is the mean of . Then, for all , is -SOS-certifiable:
Moreover, given samples from , with probability the moment tensor of the empirical distribution will also satisfy (4.1) (with a different constant ).
Recall that means that there is a degree- sum-of-squares proof that (as a polynomial in ).
In the rest of this section, we will first warm up by proving Theorem 1 for ; the proof in these cases is standard for sum-of-squares experts, but will help to illustrate a few important ideas used in the sequel. Next, in Section 4.1 we will prove Theorem 1 for , which is no longer standard and contains most of the ideas in the general case. Finally, we will prove the general case in Section 4.2. We leave the issue of finite-sample concentration to the very end.
For , a natural idea is to flatten (which is a tensor) into a matrix , and obtain upper bounds in terms of the spectral norm of this flattened matrix.
Unfortunately, even for a Gaussian distribution this estimate can be off by a factor as large as the dimension . Specifically, if is a Gaussian with variance , then the flattening is equal to , where flattens the -dimensional matrix to a -dimensional vector, and . (We skip the standard argument based on Isserlis’ theorem.) This is problematic because it means that . If e.g. is the identity matrix, then , while we would hope to certify an upper bound of .
The key idea that allows us to get a (much) improved bound here is to observe that, as polynomials in , . That is, the degree- polynomials defined by and are equal—this allows us to “change the representation” for the same polynomial to go to a “representation” where the associated matrix has a smaller spectral norm. This fact has a simple sum-of-squares proof (it is sometimes referred to by saying that pseudo-distributions respect PPT symmetries) and allows us to now upper bound
The above argument shows how one can exploit the symmetry properties of pseudo-distributions in order to certify strikingly better upper bounds on the maximum of the degree- polynomials associated with the moment tensors. This suggests a natural strategy for going beyond : write the moment tensor as a sum of (constantly many) terms and show that each term, as a polynomial, is equivalent to one where the canonical flattening as a matrix has a small spectral norm. This argument can (with much tedium) be made to work for small s but can get unwieldy for large s. However, the issue with the above argument is that it uses that the structure of the moment tensor was known to us. In our argument for arbitrary Poincaré distributions, we cannot rely on knowing the structure of the moment tensors and so will need a different proof technique.
Degree- proof for Poincaré distributions
To establish sum-of-squares bounds for general Poincaré distributions, we make use of Fact 7 from Section 3. Recall that Fact 7 states (for ) that if satisfies , then . We will define the polynomial , where is the mean and is the covariance of .
Note that , while . Therefore, we have for all matrices . This implies that for the tensor , we have (by Fact 3).
Next note that . Therefore, we have the sum-of-squares proof