Tight Bounds for the Subspace Sketch Problem with Applications††footnotetext: Yi Li was supported in part by a Singapore Ministry of Education (AcRF) Tier 2 grant MOE2018-T2-1-013. Ruosong Wang and David P. Woodruff were supported in part by an Office of Naval Research (ONR) grant N00014-18-1-2562, as well as the Simons Institute for the Theory of Computing where part of this work was done.
In the subspace sketch problem one is given an matrix with bit entries, and would like to compress it in an arbitrary way to build a small space data structure , so that for any given , with probability at least , one has , where , and where the randomness is over the construction of . The central question is:
How many bits are necessary to store ?
This problem has applications to the communication of approximating the number of non-zeros in a matrix product, the size of coresets in projective clustering, the memory of streaming algorithms for regression in the row-update model, and embedding subspaces of in functional analysis. A major open question is the dependence on the approximation factor .
We show if is not a positive even integer and , then bits are necessary. On the other hand, if is a positive even integer, then there is an upper bound of bits independent of . Our results are optimal up to logarithmic factors, and show in particular that one cannot compress to “directions” , such that for any , can be well-approximated from . Our lower bound rules out arbitrary functions of these inner products (and in fact arbitrary data structures built from ), and thus rules out the possibility of a singular value decomposition for in a very strong sense. Indeed, as , for the space complexity becomes arbitrarily large, while for it is at most . As corollaries of our main lower bound, we obtain new lower bounds for a wide range of applications, including the above, which in many cases are optimal.
- 1 Introduction
- 2 Preliminaries
- 3 An Lower Bound
- 4 Lower Bounds of Dependence on
- 5 Linear Embeddings
- 6 Sampling-based Embeddings
- 7 Oblivious Sketches
- 8 Lower Bounds for -estimators
- 9 Lower Bounds for Coresets of Projective Clustering
- 10 Upper Bounds for the Tukey Loss -Norm
- 11 An Upper Bound for the Subspace Sketch in Two Dimensions
The explosive growth of available data has necessitated new models for processing such data. A particularly powerful tool for analyzing such data is sketching, which has found applications to communication complexity, data stream algorithms, functional analysis, machine learning, numerical linear algebra, sparse recovery, and many other areas. Here one is given a large object, such as a graph, a matrix, or a vector, and one seeks to compress it while still preserving useful information about the object. One of the main goals of a sketch is to use as little memory as possible in order to compute functions of interest. Typically, to obtain non-trivial space bounds, such sketches need to be both randomized and approximate. By now there are nearly-optimal bounds on the memory required of sketching many fundamental problems, such as graph sparsification, norms of vectors, and problems in linear algebra such as low-rank approximation and regression. We refer the reader to the surveys [28, 37], as well as the compilation of lecture notes here .
In this paper we consider the subspace sketch problem.
Given an matrix with entries specified by bits, an accuracy parameter , and a function , design a data structure so that for any , with probability at least , .
The subspace sketch problem captures many important problems as special cases. We will show how to use this problem to bound the communication of approximating statistics of a matrix product, the size of coresets in projective clustering, the memory of streaming algorithms for regression in the row-update model, and the embedding dimension in functional analysis. We will describe these applications in more detail below.
The goal in this work is to determine the memory, i.e., the size of , required for solving the subspace sketch problem for different functions . We first consider the classical -norms , in which case the problem is referred to as the subspace sketch problem111Note we are technically considering the -th power of the -norms, but for the purposes of -approximation, they are the same for constant . Also, when , is not a norm, though it is still a well-defined quantity. Finally, denotes the number of non-zero entries of .. We next consider their robust counterparts , where if , otherwise . Here is a so-called -estimator and known as the Tukey loss -norm. It is less sensitive to “outliers” since it truncates large coordinate valus at . We let denote when , and use when is the Tukey loss -norm.
It is known that for and , if one chooses a matrix of i.i.d. -stable random variables, then for any fixed , from the sketch one can output a number for which with probability at least . We say is a -approximation of . For , the output is just , where denotes the median of the absolute values of the coordinates in a vector. A sketch with rows is also known for . For , there is a distribution on with for which one can output a -approximation of given with probability at least . By appropriately discretizing the entries, one can solve the subspace sketch problem by storing for an appropriate sketching matrix , and estimating using . In this way, one obtains a sketch of size 222Throughout we use and to hide factors that are polynomial in . We note that our lower bounds are actually independent of . bits for , and a sketch of size bits for . Note, however, that this was only one particular approach, based on choosing a random matrix , and better approaches may be possible. Indeed, note that for , one can simply store and output . This is exact (i.e., holds for ) and only uses bits of space, which is significantly smaller than for small enough . We note that the term may be extremely prohibitive in applications, e.g., if one wants high accuracy such as , the is a severe drawback of existing algorithms.
A natural question is what makes it possible for to obtain bits of space, and whether it is also possible to achieve space for . One thing that makes this possible for is the singular value decomposition (SVD), namely, that for matrices and with orthonormal columns, and a non-negative diagonal matrix. Then since has orthonormal columns. Consequently, it suffices to maintain the inner products , where the ’s are the rows of . Thus one can “compress” to “directions” . A natural question is whether for it is also possible find directions , such that for any , can be well-approximated from some function of . Indeed, this would be an analogous SVD for , for which little is known.
The central question of our work is:
How much memory is needed to solve the subspace sketch problem as a function of ?
1.1 Our Contributions
Up to polylogarithmic factors, we resolve the above question for the -norms and Tukey loss -norms for any . For we also obtain a surprising separation for even integers from other values of .
Our main theorem is the following. We denote by the set of positive integers.
Theorem 1.1 (Informal).
Let be a constant. For any , we have that bits are necessary to solve the subspace sketch problem.
When , there is an upper bound of bits, independent of (see Remark 3.15). This gives a surprising separation between positive even integers and other values of ; in particular for positive even integers it is possible to obtain with at most bits of space, whereas for other values of the space becomes arbitrarily large as . This also shows it is not possible, for for example, to find representative directions for analogous to the SVD for . Note that the lower bound in Theorem 1.1 is much stronger than this, showing that there is no data structure whatsoever which uses fewer than bits, and so as gets smaller, the space complexity becomes arbitrarily large.
In addition to the -norm, in the subspace sketch problem we also consider a more general entry-decomposable , that is, for and some . We show the same lower bounds for a number of -estimators .
The subspace sketch problem requires bits when for the following functions :
(- estimator) ;
(Huber estimator) ;
(Fair estimator) ;
(Cauchy estimator) ;
(Tukey loss -norm) .
We remark that the lower bound for the Tukey loss -norm function, when , is tight up to logarithmic factors, since we design a new algorithm which approximates using bits (see Section 10), which implies an upper bound of for the subspace sketch problem.
While Theorem 1.1 gives a tight lower bound for , matching the simple sketching upper bound described earlier, and giving a separation from the bit bound for even integers , one may ask what exactly the space required is for even integers and arbitrarily small . For , the upper bound is tight up to logarithmic factors since previous work [4, Theorem 2.2] implies an lower bound once . We do not fully resolve the complexity for , though we show the following: for a constant , there is a better dependence on of bits (see Remark 4.4), which is nearly tight in light of the following lower bound, which holds already for constant .
Theorem 1.3 (Informal).
Let and be constants, we have that bits are necessary to solve the subspace sketch problem.
Note that Theorem 1.3 holds even if is not an even integer, and shows that a lower bound of holds for every .
Statistics of a Matrix Product.
In , an algorithm was given for estimating for integer matrices and with bit integer entries (see Algorithm 1 in  for the general algorithm). When , this estimates the number of non-zero entries of , which may be useful since there are faster algorithms for matrix product when the output is sparse, see  and the references therein. More generally, norms of the product can be used to determine how correlated the rows of are with the columns of . The bit complexity of this problem was studied in [35, 38]. In  a lower bound of bits was shown for estimating for matrices up to a factor, assuming (this lower bound holds already for binary matrices and ). This lower bound implies an -subspace sketch lower bound of assuming that . Our lower bound in Theorem 1.1 considerably strengthens this result by showing the same lower bound (up to factors) for a much smaller value of . For any , there is a matching upper bound up to polylogarithmic factors (such an upper bound is implicit in the description of Algorithm 1 of , where the there is instantiated with , and also follows from the random sketching matrices discussed above).
In the task of projective clustering, we are given a set of points, a positive integer , and a non-negative integer . A center is a -tuple , where each is a -dimensional affine subspace in . Given a function , the objective is to find a center that minimizes the projective cost, defined to be
where , the Euclidean distance from a point to its nearest subspace in . The coreset problem for projective clustering asks to design a data structure such that for any center , with probability at least , . Note that in this and other computational geometry problems, the dimension may be small (e.g., )), though one may want a high accuracy solution. Our lower bound below is the first non-trivial lower bound on the size of a coreset for projective clustering.
Theorem 1.4 (Informal).
Suppose that for or is one of the functions in Theorem 1.2. For and , any coreset for projective clustering requires bits.
In the linear regression problem, there is an data matrix and a vector . The goal is to find a vector so as to minimize , where for and some . Here we consider streaming coresets for linear regression in the row-update model. In the row-update model, the streaming coreset is updated online during one pass over the rows of , and outputs a -approximation to the optimal value at the end. By a simple reduction, our lower bound for the subspace sketch problem implies lower bounds on the size of streaming coresets for linear regression in the row-update model. To see this, we note that by taking sufficiently large ,
Thus, a streaming coreset for linear regression can solve the subspace sketch problem, which we formalize in the following corollary.
Suppose that for or is one of the functions in Theorem 1.2. Any streaming coreset for linear regression in the row-update model requires bits when .
Let . Given , the subspace embedding problem asks to find a linear map such that for all ,
The smallest which admits a for every is denoted by , which is of great interest in functional analysis. When is allowed to be random, we require (1) to hold with probability at least . This problem can be seen as a special case of the “for-all” version of the subspace sketch problem in Definition 1.1. In the for-all version of the subspace sketch problem, the data structure is required to, with probability at least , satisfy simultaneously for all . In this case, the same lower bound of bits holds for .
Since the data structure can store if it exists, we can turn our bit lower bound into a dimension lower bound on . Doing so will incur a loss of an factor (Theorem 5.1). We give an lower bound, which is the first such lower bound giving a dependence on for general .
Suppose that and . It holds that .
This bound is tight, up to factors, on the -dependence for all values of . When , no bound with a dependence on exists, since a -dimensional subspace of always embeds into isometrically with . See more discussion below in Section 1.2 on functional analysis. We also prove a bit lower bound for the aforementioned for-all version of the subspace sketch problem. We refer the readers to Section 4.2 for details.
Let be a constant. Suppose that is a constant. The for-all version of the subspace sketch problem requires bits.
This lower bound immediately implies a dimension lower bound of for the subspace embedding problem for constant , recovering the existing lower bounds (up to logarithmic factors), which are known to be tight.
Sampling by Lewis Weights.
While it is immediate that , our lower bound above thus far has not precluded the possibility that . However, the next corollary, which lower bounds the target dimension for sampling-based embeddings, indicates this is impossible to achieve using a prevailing existing technique.
Let and . Suppose that solves the subspace sketch problem for some for which each row of contains exactly one non-zero element. Then , provided that .
The same lower bound holds for the for-all version of the subspace sketch problem. As a consequence, since the upper bounds of in (2) for are based on subsampling with the “change of density” technique (also known as sampling by Lewis weights ), they are, within the framework of this classical technique, best possible up to polylog factors.
For the for-all version of the subspace sketch problem, we note that there exist general sketches such as the Cauchy sketch  which are beyond the reach of the corollary above. Note that the Cauchy sketch is an oblivious sketch, which means the distribution is independent of . We also prove a dimension lower bound of on the target dimension for oblivious sketches (see Section 7), which is tight up to logarithmic factors since the Cauchy sketch has a target dimension of .
Theorem 1.9 (Informal).
Let be a constant. Any oblivious sketch that solves the for-all version of the subspace sketch problem has a target dimension of .
1.2 Connection with Banach Space Theory
In the language of functional analysis, the subspace embedding problem is a classical problem in the theory of spaces with a rich history. For two Banach spaces and , we say -embeds into , if there exists an injective homomorphism satisfying for all . Such a is called an isomorphic embedding. A classical problem in the theory of Banach spaces is to consider the isomorphic embedding of finite-dimensional subspaces of into , where is a constant. Specifically, the problem asks what is the minimum value of , denoted by , for which all -dimensional subspaces of -embed into . A comprehensive survey of this problem can be found in .
The case of is immediate, in which case one can take and , obtaining an isometric embedding, and thus we assume . We remark that, when is an even integer, it is also possible to attain an isometric embedding into with . In general, the best known upper bounds on are as follows.
where is an absolute constant and is a constant depending only on . The cases of and are due to Talagrand [33, 34], the case of non-even integers is due to Bourgain et al.  and Schechtman , and the case of even integers is due to Schechtman .
The upper bounds in (2) are established by subsampling with a technique called the “change of density” . First observe that it suffices to consider embeddings from to since any -dimensional subspace of -embeds into for some large . Now suppose that is a -dimensional subspace of . One can show that randomly subsampling coordinates induces a low-distortion isomorphism between and restricted onto the sampled coordinates, provided that each element of is “spread out” among the coordinates, which is achieved by first applying the technique of change of density to .
Regarding lower bounds, a quick lower bound follows from the tightness of Dvoretzky’s Theorem for spaces (see, e.g. [26, p21]), which states that if -embeds into , then for and for , where is an absolute constant. Since embeds into isometrically for all [19, p16], identical lower bounds for follow. Hence the upper bounds in (2) are, in terms of , tight for , and near-tight (up to logarithmic factors) for other values of . However, the right dependence on is a long-standing open problem and little is known [20, 32]. It is known that , whose proof critically relies upon the fact that the unit ball of a finite-dimensional space of is the polar of a zonotope (a linear image of cube ) and the -norm for vectors in the subspace thus admits a nice representation , but a lower bound for general was unknown. Our Corollary 1.6 shows that for all and , which is the first lower bound on the dependence of for general , and is optimal up to logarithmic factors. We would like to stress that except for the very special case of , no lower bound on the dependence on whatsoever was known for . We consider this to be significant evidence of the generality and novelty of our techniques. Moreover, even our lower bound for is considerably wider in scope, as discussed more below.
1.3 Comparison with Prior Work
1.3.1 Comparison with Previous Results in Functional Analysis
As discussed, the mentioned lower bounds on come from the tightness of Dvoretzky’s Theorem, which shows the impossibility of embedding into a Banach space with low distortion. Here the hardness comes from the geometry of the target space. In contrast, we emphasize that the hardness in our subspace sketch problem comes from the source space, since the target space is unconstrained and the output function does not necessarily correspond to an embedding. The lower bound via tightness of Dvoretzky’s Theorem cannot show that does not -embed into for and , where .
When the target space is not , lower bounds via functional analysis are more difficult to obtain since they require understanding the geometry of the dual space. Since our communication problem has no constraints on , the target space does not even need to be normed. In theoretical computer science and machine learning applications, the usual “sketch and solve” paradigm typically just requires the target space to admit an efficient algorithm for the optimization problem at hand333For example, consider the space endowed with a premetric , where (), which is not even symmetric when . See  for an embedding into this space. . Our lower bounds are thus much wider in scope than those in geometric functional analysis.
1.3.2 Comparison with Previous Results for Graph Sparsifiers
where denotes the capacity of the cut between and . The main result of these works is that any -cut sketch requires bits to store. Note that a cut sketch can be constructed using a for-all version of the subspace sketch for any , by just taking the matrix to be the edge-vertex matrix of the graph and querying all vectors . Thus, one may naturally ask if the lower bounds in [4, 11] imply any lower bounds for the subspace sketch problem.
We note that both [4, 11] have explicit constraints on the value of . In , in order to prove the lower bound, it is required that . In  their lower bound of requires . Thus, the strongest lower bound that can be proved using such an approach is . This is natural, since one can always store the entire adjacency matrix of the graph in bits. Our lower bound, in contrast, becomes arbitrarily large as .
1.4 Our Techniques
We use the case of to illustrate our ideas behind the lower bound for the subspace sketch problem, when . We then extend this to an lower bound for general via a simple padding argument. We first show how to prove a weaker lower bound for the for-all version of the problem, and then show how to strengthen the argument to obtain both a stronger lower bound and in the weaker original version of the problem (the “for-each” model, where we only need to be correct on a fixed query with constant probability).
Note that the condition that is crucial for our proof. As shown in Section 11, when , there is actually an upper bound, and thus our lower bound does not hold universally for all values of . It is thus crucial that we look at a larger value of , and we show that suffices.
To prove our bit lower bounds for the subspace sketch problem, we use randomized one-way communication complexity, which is a standard framework for setting up lower bounds for data structures. In our communication problem, Alice receives a matrix and then sends a message to Bob. In the for-each version of the problem, Bob also receives a vector , and Bob should correctly report a -approximation to with constant probability at the end of the protocol. In the for-all version of the problem, upon receiving the message from Alice, Bob should output a function at the end of the protocol, such that with constant probability, simultaneously for all .
Warmup: An Lower Bound for the For-All Version.
In our hard instance, we let be such that . Suppose that Alice designs her matrix by including all vectors and scaling the -th vector by a nonnegative scalar . We can think of as a vector in with . Now, Bob queries for all vectors . For an appropriate choice of , for all , we have
Since is a -approximation to , and is always an integer, Bob can recover the exact value of using , for all .
Now we define a matrix , where , where are interpreted as vectors in . A simple yet crucial observation is that, is exactly the -th coordinate of . Notice that this critically relies on the assumption that has nonnegative coordinates. Thus, the communication game can be equivalently viewed in the following way: Alice first designs a vector with and Bob receives the exact vector . At this point, a natural idea is to show that the matrix has a sufficiently large rank, say, , and carefully design to show an lower bound.
Fourier analysis on the hypercube shows that the eigenvectors of are the rows of the normalized Hadamard matrix, while the eigenvalues of are the Fourier coefficients associated with the function , where is the Hamming weight of a vector . Considering all vectors of Hamming weight in and their associated Fourier coefficients, we arrive at the conclusion that there are at least eigenvalues of with absolute value
which can be shown to be at least . The formal argument is given in Section 3.1. Hence . Without loss of generality we assume the upper-left block of is non-singular.
Now an lower bound follows readily. Alice can set to be
where is a set of i.i.d. Bernoulli random variables. Since Bob knows the exact value of , and the upper-left block of is non-singular, Bob can recover the values of by solving a linear system, which implies an lower bound.
Before proceeding, let us first review why our argument fails for . For the -norm, the Fourier coefficients associated with the vectors of Hamming weight on the Boolean cube are
Therefore this sum vanishes if and only if is an even integer, in which case will no longer be and the lower bound argument will fail.
An Lower Bound for the For-Each Version.
To strengthen this to an lower bound, it is tempting to increase so that . In this case, however, Bob can no longer recover the exact value of , since each entry of now has magnitude and the function only gives a -approximation. Bob still receives the vector , but with a additive error on each entry. One peculiarity of the model here is that if some entries of are negative, then (cf. (3)), where denotes the vector formed by taking the absolute value of each coordinate of , i.e., depends only on the absolute values of entries of , which suggests that the constraint that each entry of has magnitude with an additive error of is somehow intrinsic.
To illustrate our idea for overcoming the issue of large additive error, for the time being let us forget the actual form of previously defined in the argument for our lower bound and consider instead a general with orthogonal rows, each row having norm . For now we also allow Alice to use with negative entries and such that , and pretend that Bob receives with an additive error on each entry. Now, Alice sets to be
where is a set of i.i.d. Rademacher random variables. By a standard concentration inequality, holds with high probability (recall that ). Now consider the vector . Due to the orthogonality of the rows of , the -th coordinate of will be
Provided that is larger than the additive error , Bob can still recover by just looking at the sign of . Thus, for an appropriate choice of such that , we can obtain an lower bound.
Now we return to the original with , whose rows are not necessarily orthogonal. The previous argument still goes through so long as we can identify a subset of size such that the rows are nearly orthogonal, meaning that the norm of the orthogonal projection of onto the subspace spanned by other rows is much smaller than .
To achieve this goal, we study the spectrum of . The Fourier argument mentioned above implies that at least eigenvalues of have the same absolute value . If all other eigenvalues of are zeros, then we can identify a set of nearly orthogonal rows using rows of and each with norm , using a procedure similar to the standard Gram-Schmidt process. The full details can be found in Section 3.2. Although other eigenvalues of are not all zeros, we can simply ignore the associated eigenvectors since they are orthogonal to the set of nearly orthogonal rows we obtained above.
Lastly, recall that what Bob receives is instead of , unless . To fix this, note that with high probability, and so Alice can just shift each entry of by a fixed amount of to ensure that all entries of are positive. Bob can still obtain with an additive error , since the amount of the shift is fixed and bounded by .
Notice that the above argument in fact holds even for the for-each version of the subspace sketch problem. By querying the -th vector on the Boolean cube for some , Bob is able to recover the sign of with constant probability. Given this, we can now use standard arguments in one-way communication complexity to show that our lower bound holds for the for-each version of the problem.
The formal analysis given in Section 3.3 is a careful combination of all the ideas mentioned above.
Applications: -estimators and Projective Clustering Coresets.
Our general strategy for proving lower bounds for -estimators is to relate one -estimator, for which we want to prove a lower bound, to another -estimator for which a lower bound is easy to derive. For the - estimator, the Huber estimator and the Fair estimator, when is sufficiently large, (up to rescaling of and the function value), and thus the lower bounds follow from those for the subspace sketch problem.
For the Cauchy estimator, we relate it to another estimator . In Section 8, we show that our Fourier analytic arguments also works for . Since for sufficiently large , the Cauchy estimator satisfies (up to rescaling of and the function value), a lower bound for the Cauchy estimator follows.
To prove lower bounds for projective clustering coresets, the main observation is that when and , by choosing the query subspace to be the orthogonal complement of a vector , the projection cost is just , and thus we can invoke our lower bounds for the subspace sketch problem. We use a coding argument to handle general . In Lemma 9.1, we show there exists a set , where , and is arbitrarily large when . Now for copies of the hard instance of the subspace sketch problem, we add as a prefix to all data points in the -th hard instance, and set the query subspace to be the orthogonal complement of a vector , to which we add as a prefix. Now, the data points in the -th hard instance will always choose the -th center in the optimal solution, since otherwise an arbitrarily large cost will incur. Thus, we can solve independent copies of the subspace sketch problem, and the desired lower bound follows.
In the rest of the section, we shall illustrate our techniques for proving lower bounds that depend on for the subspace sketch problem. These lower bounds hold even when is a constant.
An Lower Bound for the For-Each Version.
Our approach for proving the lower bound is based on the following crucial observation: consider a uniformly random matrix and a uniformly random vector of i.i.d. Rademacher coordinates. Then the expectation , whereas for each row of , . Intuitively, the lower bound comes from the fact that Bob can recover the whole matrix by querying all Boolean vectors using the function , since if is a row of , then would be slightly larger than its typical value, by adjusting constants.
To implement this idea, one can generate a set of almost orthogonal vectors , and require that all rows of come from . A simple probabilistic argument shows that one can construct a set of vectors such that for any distinct , 444The factor can be removed using more sophisticated constructions based on coding theory (see Lemma 4.1).. If Alice generates her matrix using vectors from as the rows, then for any vector that is not a row of ,
for some appropriate choice of . Thus, by querying for all vectors , Bob can recover the whole matrix , even when is a constant. By a standard information-theoretic argument, this leads to a lower bound of . Furthermore, Bob only needs to query vectors, which means the lower bound in fact holds for the for-each version of the subspace sketch problem, by a standard repetition argument and losing a factor in the lower bound.
An Lower Bound for the For-All Version.
In order to obtain the nearly optimal lower bound for the for-all version, we must abandon the constraint that all rows of the matrix come from a set of vectors. Our plan is still to construct a large set of matrices , and show that for any distinct matrices , Bob can distinguish them using the function , thus proving an lower bound. The new observation is that, to distinguish two matrices , it suffices to have a single row of , say , such that . Again using the probabilistic method, we show the existence of such a set with size , which implies an lower bound.
Our main technical tool is Talagrand’s concentration inequality, which shows that for any and vector , for a matrix with i.i.d. Rademacher entries, with probability . This implies that for two random matrices with i.i.d. Rademacher entries, the probability that there exists some row of such that is at least , since the rows of are independent. By a probabilistic argument, the existence of the set