Smoothed Analysis in Unsupervised Learning via Decoupling
Smoothed analysis is a powerful paradigm in overcoming worst-case intractability in unsupervised learning and high-dimensional data analysis. While polynomial time smoothed analysis guarantees have been obtained for worst-case intractable problems like tensor decompositions and learning mixtures of Gaussians, such guarantees have been hard to obtain for several other important problems in unsupervised learning. A core technical challenge is obtaining lower bounds on the least singular value for random matrix ensembles with dependent entries, that are given by low-degree polynomials of a few base underlying random variables.
In this work, we address this challenge by obtaining high-confidence lower bounds on the least singular value of new classes of structured random matrix ensembles of the above kind. We then use these bounds to obtain polynomial time smoothed analysis guarantees for the following three important problems in unsupervised learning:
Robust subspace recovery, when the fraction of inliers in the -dimensional subspace is at least for any constant . This contrasts with the known worst-case intractability when , and the previous smoothed analysis result which needed (Hardt and Moitra, 2013).
Higher order tensor decompositions, where we generalize the so-called FOOBI algorithm of Cardoso to find order- rank-one tensors in a subspace. This allows us to obtain polynomially robust decomposition algorithms for ’th order tensors with rank .
Learning overcomplete hidden markov models, where the size of the state space is any polynomial in the dimension of the observations. This gives the first polynomial time guarantees for learning overcomplete HMMs in the smoothed analysis model.
Several basic computational problems in unsupervised learning like learning probabilistic models, clustering and representation learning are intractable in the worst-case. Yet practitioners have had remarkable success in designing heuristics that work well on real-world instances. Bridging this disconnect between theory and practice is a major challenge for many problems in unsupervised learning and high-dimensional data analysis.
The paradigm of Smoothed Analysis [ST04] has proven to be a promising avenue when the algorithm has only a few isolated bad instances. Given any instance from the whole problem space (potentially the worst input), smoothed analysis gives good guarantees for most instances in a small neighborhood around it; this is formalized by small random perturbations of worst-case inputs. This powerful notion was developed to explain the practical efficiency of the simplex algorithm in solving linear programs [ST04]. It has also been used for many other problems including linear binary optimization problems like knapsack and bin packing [BV06], multi-objective optimization [MO11], local max-cut [ER17, ABPW17], and supervised learning [KST09]. Smoothed analysis gives an elegant way of interpolating between traditional average-case analysis and worst-case analysis by varying the size of the random perturbations.
In recent years, smoothed analysis has been particularly useful in unsupervised learning and high-dimensional data analysis, where the hard instances often correspond to adversarial degenerate configurations. For instance, consider the problem of finding a low-rank decomposition of an order- tensor which can be expressed as . It is NP-hard to find a rank- decomposition in the worst-case when the rank [Hås90] (this setting where the rank is called the overcomplete setting). On the other hand, when the factors of the tensor are perturbed with some small amount of random Gaussian noise, there exist polynomial time algorithms can successfully find a rank- decomposition with high probability even when the rank is [BCMV14, ABG14]. Similarly, parameter estimation for basic latent variable models like mixtures of spherical Gaussians has exponential sample complexity in the worst case [MV10]; yet, polynomial time guarantees can be obtained using smoothed analysis, where the parameters (e.g., means for Gaussians) are randomly perturbed in high dimensions [HK12, BCMV14, ABG14, GHK15].111As in many other unsupervised learning problems, the random perturbation to the parameters can not be simulated by perturbations to the input (i.e., points sampled from the mixture). Hence unlike binary linear optimization [BV06], such smoothed analysis settings in learning are not limited by known NP-hardness and hardness of approximation results. Subsequently, smoothed analysis results have also been obtained for other problems like learning mixtures of general Gaussians [GHK15], overcomplete ICA [GVX14], and fourth-order tensor decompositions [MSS16].
The technical core of many of the above smoothed analysis results involves analyzing the minimum singular value of certain carefully constructed random matrices with dependent entries. Let be random (Gaussian) perturbations of the points (think of the average length of the small random perturbation to be ). Typically, these correspond to the unknown parameters of the probabilistic model that we are trying to learn. Proving polynomial smoothed complexity bounds often boils down to proving an inverse polynomial lower bound on the minimum singular value of certain random matrices (that depend on the algorithm), where every entry in the matrix is a multivariate polynomial involving a few of the perturbed vectors . These bounds need to hold with a sufficiently small failure probability over the randomness in the perturbations.
Let us now consider some examples to give a flavor of the statements that arise in applications.
In learning mixtures of spherical Gaussians via tensor decomposition, the key matrix that arises is the “product of means” matrix, in which the number of columns is , the number of components in the mixture, and the th column is the flattened tensor , where is the mean of the th component.
In the so-called FOOBI algorithm for tensor decomposition (proposed by [Car91], which we will study later), the complexity as well as correctness of the algorithm depend on a special matrix being well conditioned. has the following form: each column corresponds to a pair of indices , and the th column is .
In learning hidden Markov models (HMMs), the matrix of interest is one in which each column is a sum of appropriate monomials of the form , where correspond to length- paths in the graph being learned.
It turns out that for many of the recent algorithms based on spectral and tensor decomposition methods (e.g., ones in [AMR09, AGH12]), one can write down matrices whose condition numbers determine the complexity of the corresponding algorithms.
The high level question we study in this paper is the following: can we obtain a general characterization of when such matrices have a polynomial condition number with high probability? For instance, in the first example, we may expect that as long as , the matrix has an inverse polynomial condition number (note that this is due to the symmetries).
We mention two general approaches to the question above. The first is a characterization that follows from results in algebraic geometry (see [AMR09, Str83]). These results state that the matrix of polynomials either has a sub-matrix whose determinant is the identically zero polynomial, or that the matrix is generically full rank. This means that the set of that result in the matrix having has measure zero. However, note that this characterization is far from being quantitative. For polynomial time algorithms, we typically need with high probability (this is because polynomial sample complexity often requires these algorithms to be robust to inverse polynomial error). A second approach is via known anti-concentration inequalities for polynomials. In certain settings, these can be used to prove that each column must have at least a small non-zero component orthogonal to the span of the other columns (which would imply a lower bound on ). However, unless done carefully as we will see, this approach does not lead to strong enough probability guarantees for the condition number.
Our main contribution is to prove lower bounds on the least singular value for some broad classes of random matrix ensembles where the entries are low-degree multivariate polynomials of the entries of a given set of randomly perturbed vectors. The technical difficulty arises due to the correlations in the perturbations (as different matrix entries could be polynomials of the same “base” variables). We remark that the case when each entry of the matrix is perturbed independently corresponds to lower bounding the minimum singular value of random rectangular matrices which is already non-trivial, and has been studied extensively in random matrix theory (see [Tao11, RV08]).
Our results lead to new smoothed analysis guarantees for learning overcomplete hidden markov models, improved bounds for overcomplete tensor decompositions and for robust subspace recovery.
1.1 Our Results
We give lower bounds for the least singular value of some general classes of random matrix ensembles, where the entries of the matrix are given by low-degree multivariate polynomials of perturbed random vectors. We also instantiate these results to derive new smoothed analysis guarantees in unsupervised learning.
Our first result applies to a random ensemble where there are perturbed vectors and each column of the matrix is a fixed polynomial function of .
Let be a constant and let be any arbitrary collection of vectors, let be a collection of arbitrary homogeneous polynomials of degree given by
and let be the matrix obtained as . Denote by with , with row representing co-efficients of . We have that with probability at least that
where represents a random perturbation of with independent Gaussian noise .
In fact the proof of this theorem essentially gives a vector valued generalization of the Carbery-Wright anticoncentration inequality [CW01], which may be of independent interest (see Section 1.2 and Theorem 3.2 for details). We note that the singular value condition on in Theorem 1.1 is qualitatively necessary – firstly, being non-negligible is a necessary condition (irrespective of ). Secondly, by choosing to be a projector onto a product space of dimension it is not hard to see that non-trivial singular values (for some ) are necessary for the required concentration bounds (see Proposition 3.11 for a more detailed explanation). Hence, Theorem 1.1 gives an almost tight condition (up to the exact polynomial of in the failure probability exponent) for the least singular value of the above random matrix ensemble to be non-negligible. We will use this theorem to derive improved smoothed polynomial time guarantees for robust subspace recovery (Theorem 5.1).
There are several other applications in unsupervised learning where the th column of does not depend solely on , but on a small subset of the columns in in a structured form. The second class of random matrix ensembles that we consider corresponds to a setting where each column of the matrix depends on a few of the vectors in as a “monomial” in terms of tensor products i.e., each column is of the form where . Before we proceed, we need some notation to describe the result. For two monomials and , we say that they disagree in positions if for exactly different . For a fixed column , let represent the number of other columns whose monomial disagrees with that of column in exactly positions, and let . (Note that and by default).
Let be a set of -perturbed vectors, let be a constant, and let be a matrix whose columns are tensor monomials in . Let be as above for . If
for some , then with probability at least .
The above statement will be crucial in obtaining smoothed polynomial time guarantees for learning overcomplete hidden markov models (Theorem 1.6), and for higher order generalizations of the FOOBI algorithm of [Car91] that gives improved tensor decomposition algorithms up to rank for order tensors (Theorem 1.4). In both these applications, the matrix of interest can be expressed as a linear combinations of the monomial matrix i.e., , and has full column rank (in a robust sense). Finally, the ideas here can also be used to derive lower bounds on least singular values for other general random matrix ensembles as below, where each column is obtained by applying an arbitrary degree- homogenous multivariate polynomial which can expressed in terms of tensor products, to every combination of vectors out of . See Theorem 4.4 and Section B for a formal statement and proof.
We now describe how the above results give new smoothed analysis results for three different problems in unsupervised learning.
Robust Subspace Recovery
Robust subspace recovery is a basic problem in unsupervised learning where we are given points , an fraction of which lie in a -dimensional subspace . When can we find the subspace , and hence the “inliers”, that belong to this subspace? This problem is closely related to designing a robust estimator for subspace recovery: a -robust estimator for subspace recovery approximately recovers the subspace even when a fraction of the points are corrupted arbitrarily (think of ). The largest value of that an estimator (algorithm estimates the subspace ) tolerates is called the breakdown point of the estimator. This problem has attracted significant attention in the robust statistics community [Rou84, RL05, DH83], yet many of these estimators are not computationally efficient in high dimensions. On the other hand, the singular value decomposition is not robust to outliers. Hardt and Moitra [HM13] gave the first algorithm for this problem that is both computationally efficient and robust. Their algorithm successfully estimates the subspace when , assuming a certain non-degeneracy condition about both the inliers and outliers.222This general position condition holds in a smoothed analysis setting This algorithm is also robust to some small amount of noise in each point i.e., the inliers need not lie exactly on the subspace . They complemented their result with a computational hardness in the worst-case (based on the Small Set Expansion hypothesis) for finding the subspace when .
We give a simple algorithm that for any constants runs in time and in a smoothed analysis setting, provably recovers the subspace with high probability, when . Note that this is significantly smaller than the bound of from [HM13] when . For instance in the setting when for some constant (say ), our algorithms recovers the subspace when the fraction of inliers is any constant by choosing , while the previous result requires that at least of the points are inliers. In our smoothed analysis setting, each point is given a small random perturbation – each outlier is perturbed with a -variate Gaussian (think of ), and each inlier is perturbed with a projection of a -variate Gaussian onto the subspace .
Informal Theorem 1.3.
For any and . Suppose there are points which are randomly -perturbed according to the smoothed analysis model described above, with an fraction of the points being close to a -dimensional subspace . Then there is an efficient algorithm return a subspace with with probability at least .
See Section 5 for a formal statement, algorithm and proof. The proof of the above result crucially relies on Theorem 1.1 (and Theorem 1.7) about least singular value bounds in the smoothed analysis setting.
While the above result gives smoothed analysis guarantees when is at least , the hardness result of [HM13] shows that finding a -dimensional subspace that contains an fraction of the points is computationally hard assuming the Small Set Expansion conjecture. Hence our result presents a striking contrast between the intractability result in the worst-case and a computationally efficient algorithm in a smoothed analysis setting when for some constant .
Overcomplete Tensor Decompositions
Tensor decomposition has been a crucial tool in many of the recent developments in showing learning guarantees for unsupervised learning problems. The problem here is the following. Suppose are vectors in . Consider the ’th order moment tensor
The question is if the decomposition can be recovered given access only to the tensor . This is impossible in general. For instance, with , the can only be recovered up to a rotation. The remarkable result of Kruskal [Kru77] shows that for , the decomposition in “typically” unique, as long as is a small enough. Several works [Har70, Car91, AGH12, MSS16] have designed efficient recovery algorithms in different regimes of , and assumptions on . The other important question is if the can be recovered assuming that we only have access to , for some noise tensor .
Works based on the sum-of-squares hierarchy achieve the best dependence on (i.e., handle the largest values of ), and also have the best noise tolerance, but require strong incoherence (or even Gaussian) assumptions on the . Meanwhile, spectral algorithms (such as [GVX14, BCMV14]) achieve a weaker dependence on and can tolerate a significantly smaller amount of noise, but they allow recoverability for smoothed vectors , which is considerably more general than recoverability for random vectors. The recent work of [MSS16] bridges the two approaches.
Our result here is a decomposition algorithm for ’th order tensors that achieves efficient recovery guarantees in the smoothed analysis model, as long as for a constant . Our result is based on a generalization of the algorithm of Cardoso [Car91, DLCC07], who consider the case . Our contribution here is also a robustness analysis for this algorithm: we show that the algorithm can recover the decomposition to an arbitrary precision (up to a permutation), as long as , where is the perturbation parameter in the smoothed analysis model.
Informal Theorem 1.4.
Let be an integer. Suppose we are given a ’th order tensor , where are -perturbations of vectors with polynomially bounded length. Then with high probability, we can find the up to any desired accuracy (up to a permutation), assuming that for a constant , and is a sufficiently small polynomial in .
See Theorem 7.1 and Section 7 for a formal statement and details. We remark that there exists different generalizations of the FOOBI algorithm of Cardoso to higher [AFCC04]. However, to the best of our knowledge, there is no analysis known for these algorithms that is robust to inverse polynomial error for . Further our algorithm is a very simple generalization of Cardoso’s algorithm to higher .
This yields an improvement in the best-known dependence on the rank in such a smoothed analysis setting — from (which follows from [BCMV14]) to . Previously such results were only known for in [MSS16]. Apart from this quantitative improvement, our result also has a more qualitative contribution: it yields an algorithm for the problem of finding symmetric rank-1 tensors in a linear subspace.
Informal Theorem 1.5.
Suppose we are given a basis for an dimensional subspace of that is equal to the span of the flattenings of , where the are unknown -perturbed vectors. Then the can be recovered in time with high probability. Further, this is also true if the original basis for is known up to an inverse-polynomial perturbation.
Learning Overcomplete Hidden Markov Models
Hidden Markov Models (HMMs) are latent variable models that are extensively used for data with a sequential structure, like reinforcement learning, speech recognition, image classification, bioinformatics etc [Edd96, GY08]. In an HMM, there is a hidden state sequence taking values in , that forms a stationary Markov chain with transition matrix and initial distribution (assumed to be the stationary distribution). The observation is represented by a vector in . Given the state at time , (and hence ) is conditionally independent of all other observations and states. The matrix (of size ) represents the probability distribution for the observations: the th column represents the expectation of conditioned on the state i.e.
In an HMM with continuous observations, the distribution of the observation conditioned on state being can be a Gaussian and th column of would correspond to its mean. In the discrete setting, each column of can correspond to the parameters of a discrete distribution over an alphabet of size .
An important regime for HMMs in the context of many settings in image classification and speech is the overcomplete setting where the dimension of the observations is much smaller than state space . Many existing algorithms for HMMs are based on tensor decompositions, and work in the regime when [AHK12, AGH12]. In the overcomplete regime, there have been several works [AMR09, BCV14, HGKD15] that establish identifiability (and identifiability with polynomial samples) under some non-degeneracy assumptions, but obtaining polynomial time algorithms has been particularly challenging in the overcomplete regime. Very recently Sharan et al. [SKLV17] gave a polynomial time algorithm for learning the parameters of an overcomplete discrete HMMs when the observation matrix is random (and sparse), and the transition matrix is well-conditioned, under some additional sparsity assumptions on both the transition matrix and observation matrix (e.g., the degree of each node in the transition matrix is at most for some large enough constant ). Using Theorem 1.2, we give a polynomial time algorithm in the more challenging smoothed analysis setting where entries of are randomly perturbed with small random Gaussian perturbations 333While small Gaussian perturbations makes most sense in a continuous observation setting, we believe that these ideas should also imply similar results in the discrete setting for an appropriate smoothed analysis model..
Informal Theorem 1.6.
Let be constants. Suppose we are given a Hidden Markov Model with states and with dimensional observations with hidden parameters . Suppose the transition matrix is sparse (both row and column) and , and the each entry of the observation matrix is -randomly perturbed (in a smoothed analysis sense), and the stationary distribution has , then there is a polynomial time algorithm that uses samples of time window and recovers the parameters up to accuracy (in Frobenius norm) in time , with probability at least .
For comparison, the result of Sharan et al. [SKLV17] applies to discrete HMMs, and gives an algorithm that uses time windows of size in time (there is no extra explicit lower bound on ). But it assumes that the observation matrix is fully random, and has other assumptions about sparsity about both and , and about non-existence of short cycles. On the other hand, we can handle the more general smoothed analysis setting for the observation matrix for (for any constant ), and assume no additional conditions about non-existence of short cycles. To the best of our knowledge, this gives the first polynomial time guarantees in the smoothed analysis for learning overcomplete HMMs.
Our results complement the surprising sample complexity lower bound in Sharan et al. [SKLV17] who showed that it is statistically impossible to recover the parameters with polynomial samples when , even when the observation matrix is random. The above theorem (Theorem 1.6) follows by a simple application of Theorem 1.2 in conjunction with existing algorithms using tensor decompositions [AMR09, AGH12, BCMV14, SKLV17].
We first sketch some of the technical ideas involved in the proofs of Theorem 1.1 and Theorem 1.2, before describing how they can be used to prove smoothed analysis bounds for robust subspace recovery, higher order tensor decompositions, and learning HMMs.
Theorem 1.1 and Theorem 1.2 give lower bounds on the minimum singular values of random matrix ensembles with entries being a polynomial of a few base perturbed random vectors. These bounds need to hold with a sufficiently small failure probability, say or even (sub-)exponentially small of the form , over the randomness in the perturbations. This is desirable since in smoothed analysis applications, these random perturbations correspond to randomness in the generation of the given input – so this perturbation is “one shot”, and the success probability can not be amplified with repetition. Further, the running time and sample complexity in many of these applications has an inverse polynomial dependence on the minimum singular value. Hence for instance, having the guarantee that the minimum singular value is with probability may not suffice to show smoothed polynomial complexity.
Theorem 1.1 crucially relies on the following theorem, which is also a statement of independent interest.
Informal Theorem 1.7.
Let , and let be space of all symmetric order tensors in , and let be an arbitrary subspace of dimension . If represents the projection matrix onto the subspace of orthogonal to we have that with probability at least .
Note that the above statement immediately implies a lower bound on the least singular value of where by using the leave-one-out distance characterization of the least singular value (see Lemma 2.2). If we have columns the least singular value of is inverse polynomial with exponentially small failure probability.
The proofs of both Theorem 1.1 and Theorem 1.2 use the smoothed analysis result of Bhaskara, Charikar, Moitra and Vijayaraghavan [BCMV14] which shows minimum singular value bounds (with sub-exponentially small failure probability) for tensor products of vectors that have been independently perturbed. Given randomly perturbed vectors , Bhaskara et al. [BCMV14] analyze the minimum singular value of a matrix where the th column () is given by . However this setting does not suffice for proving Theorem 1.1, Theorem 1.2, or the different applications presented here for the following two reasons:
Each column of depends on a disjoint set of vectors ; and any vector is involved in only one column.
Our main tool for proving Theorem 1.1, and Theorem 1.2 are various decoupling techniques to overcome the dependencies that exists in the randomness for different terms. Decoupling inequalities [dlPMS95] are often used to prove concentration bounds (bounds on the upper tail) for polynomials of random variables. However, in our case they will be used to establish lower bounds on the minimum singular values. This has an anti-concentration flavor, since we are giving an upper bound on the “small ball probability” i.e., the probability that the minimum singular value is close to a small ball around . For Theorem 1.1 (and Theorem 1.7) which handles symmetric tensor products, we use a combination of asymmetric decoupling along with a positive correlation inequality for polynomials that is inspired by [Lov10].
We remark that one approach towards proving lower bounds on the least singular value for the random matrix ensembles that we are interested in, is through anti-concentration inequalities for low-degree polynomials like the Carbery-Wright inequality. In certain settings, a direct application of Carbery-Wright inequality can be used to prove that each column must have at least a small non-zero component orthogonal to the span of the other columns – this would imply an lower bound on with failure probability (inverse polynomial), which does not suffice for our purposes (see [ABG14] for smoothed analysis bounds using this approach).
In fact the ideas developed here can be used to prove a vector valued generalization of the Carbery-Wright anticoncentration inequality [CW01]. In what follows, we will represent a degree multivariate polynomial using the symmetric tensor such that .
Informal Theorem 1.8.
Let , , and let be a vector-valued degree homogenous polynomial of variables given by such that the matrix , with the th row being formed by the co-efficients of the polynomial , has . Then for any fixed , and we have
where and are constants that depend on .
See Theorem 3.2 for a more formal statement. The main feature of the above result is that while we lose in the “small ball” probability with the degree , we gain an factor in the exponent on account of having a vector valued function. The interesting setting of parameters is when and . We remark that the requirement of non-trivial singular values is qualitatively necessary, as described below Theorem 3.2.
The second issue mentioned earlier about [BCMV14] is that in many applications each column depends on many of the same underlying few “base” vectors. Theorem 1.2 identifies a simple condition in terms of the amount of overlap between different columns that allows us to prove robust linear independence for very different settings like learning overcomplete HMMs and higher order versions of the FOOBI algorithm. Here the decoupling is achieved by building on the ideas in [MSS16], by carefully defining appropriate subspaces where we can apply the result of [BCMV14].
Robust Subspace Recovery
The algorithm for robust subspace recovery at a high level follows the same approach as Hardt and Moitra [HM13]. Their main insight was that if we sample a set of size slightly less than from the input, and if the fraction of inliers is , then there is a good probability of obtaining inliers, and thus there exist points that are in the linear span of the others. Further, since we sampled fewer than points and the outliers are also in general position, one can conclude that the only points that are in the linear span of the other points are the inliers!
In our algorithm, the key idea is to use the exact same algorithm, but with tensored vectors. Let us illustrate in the case . Suppose that the fraction of inliers is . Suppose we take a sample of size slightly less than points from the input, and consider the flattened vectors of these points. As long as we have more than inliers, we expect to find linear dependencies among the tensored inlier vectors. Further, using Theorem 1.7, we can show that such dependencies cannot involve the outliers. This allows us to find a large number of the inliers using approximate linear dependencies (or small leave-one-out distance). This in turn allows us to recover the subspace even when for any constant in a smoothed analysis sense.
We remark that the earlier result of [BCMV14] can be used to show a weaker guarantee about robust linear independence of the matrix formed by columns with a factor loss in the number of columns (for a constant ). This translates to an improvement over [HM13] only in the regime when ). Our tight characterization in Theorem 1.7 is crucial for our algorithm to beat the threshold of [HM13] for any dimension .
Higher order tensor decompositions and FOOBI.
At a technical level, the algorithm of [Car91, DLCC07] for decomposing fourth-order tensors rests on a rank-1 detecting “device” that evaluates to zero if the inputs are a symmetric product vector, and is non-zero otherwise. We construct such a device for general , and further analyze the condition number of an appropriate matrix that results.
We also give a robustness analysis of the algorithm of [Car91] and its extension to higher . While such analyses typically rely on proving that every estimated quantity is similar in the noise-free and noisy setting, this turns out to be impossible in the present algorithm. Roughly speaking, this is because the algorithm involves certain non-linear operations on eigenvectors of the flattened tensor. These can be very sensitive to small perturbations, unless we have gaps in the eigenvalues. While smoothed analysis results give us ways of bounding the condition number, they are not strong enough to provide gaps between the eigenvalues. We thus need a way to argue that even if certain intermediate quantities are rather different due to perturbation, the final estimates still match up. This brings in added complexity to our argument.
In this section, we introduce notation and preliminary results that will be used throughout the rest of the paper.
Given a vector and a (typically a small inverse polynomial in ), a -perturbation of is obtained by adding independent Gaussian random variables to each coordinate of . The result of this perturbation is denoted by .
We will denote the singular values of a matrix by , in decreasing order. We will usually use or to represent the number of columns of the matrix. The maximum and minimum (nonzero) singular values are also sometimes written and .
While estimating the minimum singular value of a matrix can be difficult to do directly, it is closely related to the leave-one-out distance of a matrix, which is often much easier to calculate.
Given a matrix with columns , the leave-one-out distance of is
The leave-one-out distance is closely related to the minimum singular value, up to a factor polynomial in the number of columns of [RV08].
For any matrix , we have
Tensors and multivariate polynomials.
An order- tensor has modes each of dimension . Given vectors we will denote by the outer product between the vectors , and by the outer product of with itself times i.e., .
We will often identify with an th order tensor (with dimension in each mode), the vector obtained by flattening the tensor into a vector. For sake of convenience, we will sometimes abuse notation (when the context is clear) and use to represent both the tensor and flattened vector interchangeably. Given two th order tensors the inner product denotes the inner product of the corresponding flattened vectors in .
A symmetric tensor of order satisfies for any and any permutation of the elements in . It is easy to see that the set of symmetric tensors is a linear subspace of , and has a dimension equal to . Given any -variate degree homogenous polynomial , we can associate with the unique symmetric tensor of order such that .
3 Decoupling and Symmetric Tensor Products
In this section we prove Theorem 1.1 and related theorems about the least singular value of random matrices in which each column is a function of a single random vector. The proof of Theorem 1.1 relies on the following theorem which forms the main technical theorem of this section.
Theorem 3.1 (Same as Theorem 1.7).
Let , and let be space of all symmetric order tensors in (dimension is ), and let be an arbitrary subspace of dimension . Then we have for any and where
where are constants that depend only on .
Theorem 1.1 follows by combining the above theorem with an additional lemma (see Section 3.4). Our main tool will be the idea of decoupling, along with a result of Bhaskara, Charikar, Moitra and Vijayaraghavan [BCMV14]. While decoupling inequalities [dlPMS95] are often used to prove concentration bounds for polynomials of random variables, here this will be used to establish lower bounds on projections and minimum singular values, which have more of an anti-concentration flavor.
In fact we can use the same ideas to prove the following anti-concentration statement that can be seen as a variant of the well-known inequality of Carbery and Wright [CW01]. In what follows, we will represent a degree multivariate polynomial using the symmetric tensor of order such that .
Let , and let be an integer. Let be a vector-valued degree homogenous polynomial of variables given by where for each , for some symmetric order tensor . Suppose the matrix formed with the as rows has , then for any fixed , and we have
where and are constants that depend on .
Comparing with the Carbery-Wright inequality.
Anti-concentration inequalities for polynomials are often stated for a single polynomial. They take the following form: if is a degree- polynomial and for some distribution , then the probability that is . Our statement above applies to vector valued polynomials . Here, if the are “different enough”, one can hope that the dependence above becomes , where is the number of polynomials. Our statement may be viewed as showing a bound that is qualitatively of this kind (albeit with a much weaker dependence on ), when . We capture the notion of being different using the condition on the singular value of the matrix . Also, while seems like a strong requirement, Proposition 3.11 shows that a fairly large i.e., is necessary. Getting a tight dependence in the exponent in terms of is an interesting open question. We also note that the paper of Carbery and Wright [CW01] does indeed consider vector-valued polynomials, but their focus is only on obtaining type bounds with a better constant for . To the best of our knowledge, none of the known results try to get an advantage due to having multiple .
The main ingredient in the proof of the above theorems is the following decoupling inequality.
[Anticoncentration through Decoupling] Let and let be an integer, and let represent any norm over . Let be given by where for each , is a multivariate homogeneous polynomial of degree (say given by for some symmetric tensor ). For any fixed , and we have
Note that in the above proposition, the polynomials correspond to decoupled multilinear polynomials of degree . Unlike standard decoupling statements, here the different components are not identically distributed. We also note that the proposition itself is inspired by a similar lemma in the work of Lovett [Lov10] on an alternate proof of the Carbery-Wright inequality. Indeed the basic inductive structure of our argument is similar (going via Lemma 3.6 below), but the details of the argument turn out to be quite different. Also, the proposition above deals with vector-valued polyomials , as opposed to real valued polynomials in [Lov10].
Theorem 3.1 follows by combining Proposition 3.3 and a theorem of [BCMV14]. This will be described in Section 3.2. Later in Section A, we also give an alternate simple proof of Theorem 3.1 for that is more combinatorial. First we introduce the slightly more general setting for decoupling that also captures the required smoothed analysis statement.
3.1 Proof of Proposition 3.3
We will start with a simple fact involving signed combinations.
Let be real numbers, and let be independent Rademacher random variables. Then
For a subset , let . Then it is easy to check that if , and if . Applying this along with the multinomial expansion for gives the lemma. ∎
Consider any symmetric order tensor , a fixed vector , and let be independent random Gaussians. Then we have
Note that the right side corresponds to the evaluation of the at a random perturbation of .
First, we observe that since is symmetric, it follows that for any permutation on . Let , and let be independent Rademacher random variables. For any symmetric decomposition into rank-one tensors (note that such a decomposition always exists for a symmetric tensor; see [CGLM08] for example), we have for every , . Applying Lemma 3.4 (with ) to each term separately and combining them, we get
Let be an n-variate Gaussian random variable, and let and be a collection of independent n-variate Gaussian random variables. Then for any measurable set we have
This inequality and its proof are inspired by the work of Lovett [Lov10] mentioned earlier. The main advantage in our inequality is that the right side here involves the particular signed combinations of the function values at points from independent copies that directly yields the asymmetric decoupled product (using Lemma 3.5).
Let , and for each , let . Clearly . Let represent the indicator function of . For , let
We will prove that for each , . Using Cauchy-Schwartz inequality, we have
Now if are i.i.d variables distributed as , then are identically distributed. More crucially, and are independent! Hence