Variable Selection is Hard

Variable Selection is Hard

Abstract

Variable selection for sparse linear regression is the problem of finding, given an matrix and a target vector , a sparse vector such that approximately equals . Assuming a standard complexity hypothesis, we show that no polynomial-time algorithm can find a -sparse with , where and , where are arbitrary. This is true even under the promise that there is an unknown -sparse vector satisfying . We prove a similar result for a statistical version of the problem in which the data are corrupted by noise.

To the authors’ knowledge, these are the first hardness results for sparse regression that apply when the algorithm simultaneously has and .

1Introduction

Consider a linear regression problem in which one is given an matrix and a target vector . The goal is to approximately represent as a linear combination of as few columns of as possible. If a polynomial-time algorithm is presented with and , and it is known that is an exact linear combination of some columns of , and is allowed to choose more than columns of , how many columns must choose in order to generate a linear combination which is close to ?

Note that we have allowed to “cheat” both on the number of columns and on the accuracy of the resulting linear combination. In this paper, we show the problem is intractable despite the allowed cheating.

Formally, we define -Sparse Regression as follows. Let denote the -dimensional vector of 1’s, and for any vector , and let denote the number of nonzeros in . Let and denote arbitrary functions. An algorithm for -Sparse Regression satisfies the following.

  • Given: An Boolean matrix and a positive integer such that there is a real -dimensional vector , , such that . (Call such an input valid.)

  • Goal: Output a (possibly random) -dimensional vector with such that with high probability over the algorithm’s internal randomness.

Since we are focused on hardness, restricting to have entries in and the target vector to be only makes the result stronger.

There is, of course, an obvious exponential-time deterministic algorithm for -Sparse Regression, for for all , and : enumerate all subsets of of size . For each , use Gaussian elimination to determine if there is an with for all such that , and if there is one, to find it. Since and , for at least one the algorithm must return an with .

Before getting too technical, we warm up with a simple hardness proof for -Sparse Regression. The short proof can be understood on its own. This theorem is just a warmup; later we will prove a much stronger result (relying, unfortunately, on a stronger complexity assumption). To the authors’ knowledge, this simple proof was not known, though similar arguments had been used previously to show weaker hardness results for related problems (cf. [2] and [6]).)

Feige [9] gives a reduction from SAT, running in deterministic time on SAT instances of size , to Set Cover, in which the resulting, say, , incidence matrix (whose rows are elements and columns are sets) has the following properties. There is a (known) such that (1) if a formula , then there is a collection of disjoint sets which covers the universe (that is, there is a collection of columns of whose sum is ), and (2) if , then no collection of at most sets covers the universe (in other words, no set of at most columns of has a sum which is coordinate-wise at least ). This is already enough to establish that any polynomial-time algorithm for -Sparse Regression implies an -time algorithm for SAT. The remainder of the proof is devoted to establishing the analogous statement for -Sparse Regression .

Build a matrix by stacking copies of atop one another, to be determined later. Let . If , then there is a collection of columns summing to . This is a linear combination of sparsity at most . If , then for any linear combination of at most column vectors, in each of the copies of , there is squared error at least 1 (since the best one could hope for, in the absence of a set cover, is 1’s and one ). This means that the squared error overall is at least . We want , i.e., , and hence we define .

Construct an algorithm for SAT as follows. Run on instance of -Sparse Regression for time steps, where is the time would need on an matrix if were a valid input for -Sparse Regression (which it may not be); since in the valid case, runs in time polynomial in the size of the input matrix (which is ), is also . If outputs a vector such that and within this time bound, then outputs “Satisfiable.” Otherwise ( doesn’t terminate in the allotted time or it terminates and outputs an inappropriate vector), outputs “Unsatisfiable.”

Clearly runs in time on inputs of size . It remains to show that is a correct algorithm for SAT.

To this end, suppose that . In this case, there is a solution of sparsity at most with , and since is a correct algorithm for -Sparse Regression, will find such a solution, causing to output “Satisfiable” when run on . On the other hand, if , then there is no vector with such that . Hence, must output “Unsatisfiable” when run on . We conclude that is a correct algorithm for SAT running in time on instances of size .

One can combine Dinur and Steurer’s PCP [8] with Feige’s construction, or the earlier construction of Lund and Yannakakis [12], to strengthen the conclusion of Theorem ? to P.

Throughout, will denote the set of all languages decidable by randomized algorithms, with two-sided error, running in expected time . Our main result is that unless , then even if grows at a “nearly polynomial” rate, and for any positive constants , there is no quasipolynomial-time (randomized) algorithm for -Sparse Regression:

Note that the “” cannot be removed from the condition in Theorem ?: the algorithm that always outputs the all-zeros vector solves -Sparse Regression for and .

We also show, assuming a slight strengthening of a standard conjecture—known as the projection games conjecture (PGC) [14]—about the existence of probabilistically checkable proofs with small soundness error, that -Sparse Regression is hard even if grows as a constant power. We refer to our slight strengthening of the PGC as the “Biregular PGC,” and state it formally in Section 2.3.

We might consider -Sparse Regression to be a “computer science” version of Sparse Regression, in the sense that the data are deterministic and fully specified. In an alternative, “statistics” version of Sparse Regression, the data are corrupted by random noise unknown to the algorithm. Specifically we consider the following problem, which we call -Noisy Sparse Regression.

  • There are a positive integer and an Boolean matrix , such that there exists an unknown -dimensional vector with such that . An -dimensional vector of i.i.d. “noise” components is generated and is set to . , , and (but not or ) are revealed to the algorithm.

  • Goal: Output a (possibly random) such that and . Here, the expectation is taken over both the internal randomness of the algorithm and of the ’s.

We give a simple reduction from -Sparse Regression to -Noisy Sparse Regression that proves the following theorems.

Importance and Prior Work. Variable selection is a crucial part of model design in statistics. People want a model with a small number of variables partially for simplicity and partially because models with fewer variables tend to have smaller generalization error, that is, they give better predictions on test data (data not used in generating the model). Standard greedy statistical algorithms to choose features for linear regression include forward stepwise selection (also known as stepwise regression or orthogonal least squares), backward elimination, and least angle regression. Standard non-greedy feature selection algorithms include LASSO and ridge regression.

There are algorithmic results for sparse linear regression that guarantee good performance under certain conditions on the matrix . For example, equation (6) of [18] states that a “restricted eigenvalue condition” implies that the LASSO algorithm will give good performance for the statistical version of the problem. A paper by Natarajan [16] presents an algorithm, also known as forward stepwise selection or orthogonal least squares (see also [4]), which achieves good performance provided that the -norm of the pseudoinverse of the matrix obtained from by normalizing each column is small [16]. It appears that no upper bound for stepwise regression under reasonable assumptions on the matrix was known prior to the appearance of Natarajan’s algorithm in 1995 (and the equivalence of Natarajan’s algorithm with forward stepwise selection appears not to have been noticed until recently). In the appendix, we include an example proving that Natarajan’s algorithm can perform badly when the -norm of that pseudoinverse is large, or in other words, that Natarajan’s analysis of his algorithm is close to tight. Another example proving necessity of the factor involving the pseudoinverse appeared in [5].

There have been several prior works establishing hardness results for variants of the sparse regression problem. Natarajan [16] used a reduction from Exact Cover By 3-Sets to prove that, given an matrix , a vector , and , it is NP-hard to compute a vector satisfying if such an exists, such that has the fewest nonzero entries over all such vectors. Davis et al. [7] proved a similar NP-hardness result. The hardness results of [16] only establish hardness if the algorithm is not allowed to “cheat” simultaneously on both the sparsity and accuracy of the resulting linear combination.

Arora et al. [2] showed that, for any , a problem called Min-Unsatisfy does not have any polynomial-time algorithm achieving an approximation factor of , assuming . In this problem, the algorithm is given a system of linear equations over the rationals, and the cost of a solution is the number of equalities that are violated by . Amaldi and Kann [1] built directly on the result of Arora et al. [2] to show, in our terminology, that -Sparse-Regression also has no polynomial-time algorithm under the same assumption. Also see a result by S. Muthukrishnan [15].

Finally, Zhang et al. [18] showed a hardness result for -Noisy Sparse Regression. We defer a discussion of the result of [18] to Section 3.1. For now, we just note that their hardness only applies to algorithms which cannot “cheat” on the sparsity, that is, to algorithms that must generate a solution with at most nonzeros.

In summary, to the best of our knowledge, our work is the first to establish that sparse linear regression is hard to approximate, even when the algorithm is allowed to “cheat” on both the sparsity of the solution output and on the accuracy of the resulting linear combination.

2Proofs of Theorems and

2.1Notation and Proof Outline

Throughout, we will use lower-case boldface letters to denote vectors. For any vector , will denote the Euclidean norm of , while will denote the sparsity (i.e., the number of nonzeros) of . Let denote the all-ones vector, and for any vector , let denote the ball of radius around in the square of the Euclidean norm.

If represents a vector as a linear combination of vectors , we say that participates in the linear combination if .

We will use the symbol to denote the input size of instances used in our reductions.

The first step in our proof of Theorem ? is to establish Proposition ? below.

(For the purpose of proving Theorem ?, having in Proposition ? would actually suffice.)

The second step of the proof describes a simple transformation of any -Sparse Regression algorithm for a “fast-growing” function into a -Sparse Regression algorithm for . The proof appears in Section 2.4.

[Proof of Theorem ? assuming Propositions ? and ?] Suppose by way of contradiction that there are positive constants such that there is a quasipolynomial-time randomized algorithm for -Sparse Regression  where and . By Proposition ?, can be transformed into a randomized quasipolynomial-time algorithm for -Sparse Regression.

Clearly is capable of distinguishing the following two cases, for any :

  1. There is an such that and .

  2. For all such that , .

In particular, the above holds for in , which contradicts Proposition ?.

Proof Outline for Proposition ?. Lund and Yannakakis showed, assuming cannot be solved by algorithms running in time , that cannot be approximated within a factor of for any constant [12]. Here, an instance of consists of a set of size and a family of subsets of , and the goal is to find a minimal collection of the ’s whose union equals .

Lund and Yannakakis’ transformation from an instance of of to an instance of has a (known) remarkable property: if is satisfiable, then the generated instance of does not just have a small cover of the base set , it has a small partition of . That is, if denotes the indicator vector of set , then . This is a stronger property than , which is the condition required to show hardness of . This observation naturally allows us to define a corresponding instance of the Linear Regression problem with a sparse solution; the columns of the matrix in the regression problem are simply the indicator vectors .

A central ingredient in Lund and Yannakakis’ transformation is a certain kind of set system over a base set . Their set system naturally requires that any union of fewer than ’s or ’s cannot cover , unless there is some such that both and participate in the union. As a result, they needed to be polynomial in and exponential in . Since we are studying Sparse Regression rather than , we can impose a weaker condition on our set systems. Specifically, we need that:

any linear combination of indicator vectors of the sets or their complements is “far” from , unless the linear combination includes both the indicator vector of a set and the indicator vector of its complement.

(See Definition ? below.) As a result, we are able to take to be much smaller relative to and than can Lund and Yannakakis. Specifically, we can take to be polynomial in and logarithmic in . This is the key to establishing hardness for super-logarithmic approximation ratios, and to obtaining hardness results even when we only require an approximate solution to the system of linear equations.

2.2Proof of Proposition

Preliminaries

A basic concept in our proofs is that of -useful set systems, defined below.

(Note that there is a 2-sparse linear combination involving and that exactly equals , namely, .)

(It seems likely that a deterministic construction that appears in [2] could be modified to generate a -useful set system.)

Throughout the proof, we set

To avoid notational clutter, we denote simply as throughout the proof of Lemma ?. The core of our argument is the following technical lemma bounding the probability that is “close” to the span of a “small” number of randomly chosen vectors from .

Rather than reasoning directly about the Boolean vectors in the statement of the sublemma, it will be convenient to reason about vectors in . Accordingly, for any vector , define . Notice that if is a uniform random vector in , then is a uniform random vector in . The primary reason we choose to work with vectors over rather than is that vectors in always have squared Euclidean norm exactly equal to , in contrast to Boolean vectors, which can have squared Euclidean norm as low as 0 and as large as . Throughout, given a set of vectors in , and another vector in , we let denote the projection of onto the linear span of the vectors in .

Note that for any Boolean vectors , . Hence, event from the statement of Sublemma ? occurs if and only if there exist coefficients such that . We bound the probability of this occurrence as follows.

Let . Note that

Combined with the previous paragraph, we see that event occurs if and only if

By equation , so it suffices to bound from above the probability that

Bounding the probability that Our analysis consists of two steps. In step 1, we bound the probability that , where is a vector chosen uniformly at random from , independently of . In step 2, we argue that, even though is a fixed vector (not chosen uniformly at random from ), it still holds that with exactly the same probability (where the probability is now only over the random choice of vectors ).

Step 1. Let denote the dimension of the linear subspace . Let denote an arbitrary orthonormal basis for . We have

where denotes the th entry of , and denotes the th entry of . For any , let denote the value of the sum . Since is a vector in an orthonormal basis, we know that .

Meanwhile, each is a Rademacher random variable. Because we can view and hence as fixed while is random, and because , standard concentration results for Rademacher sequences [13] imply that for and for all , , for all fixed , and hence even if is random. In particular, . A union bound over all implies that . In this event, i.e., for all , we can bound the right-hand side of equation (Equation 2) by

Step 2. For vectors , let equal 1 if , and 0 otherwise, where . We claim that where the first expectation is over random choice of , and the second expectation is only over the random choice of .

For two vectors , let denote the component-wise product of and , i.e., . Then we can write:

Here, the first equality is the definition of expectation; the second follows by rearranging the sum. The third equality holds because multiplying each of and component-wise by is the same as premultiplying each vector by the unitary matrix , and premultiplying by a unitary matrix just rotates the space, which doesn’t change lengths of projections. The fourth equality holds because, for any fixed , if the vectors are uniformly distributed on , then so are the vectors . The final equality is the definition of expectation.

Let be random subsets of , and be the indicator vector of . Consider any subset of of the vectors in which there is no such that both and are in . (There are exactly such subsets .) Then the vectors in are all uniformly random vectors in that are chosen independently of each other. Sublemma ? implies that the probability that there exist coefficients such that is at most .

We can take a union bound over all possible choices of to conclude that for all possible choices of the subset with probability at least . In this event, there is no -sparse linear combination of the and vectors in for , unless and both participate in the linear combination for some . That is, is a -useful set system, as desired.

This completes the proof of Lemma ?.

Full Proof of Proposition

Suppose that is a algorithm that distinguishes the two cases appearing in the statement of Proposition ?. We show how to use to design an efficient randomized algorithm for .

MIP Notation. Our construction of assumes the existence of a perfectly complete multiprover interactive proof (MIP) with two provers and soundness error . Let denote the possible values of the MIP verifier’s private randomness, denote the set of possible queries that may be posed to the th prover, and denote the set of possible responses from the th prover. Denote by the verifier’s output given internal randomness and answers from the two provers, with an output of 1 being interpreted as the verifier’s accepting the provers’ answers as valid proofs that the input is satisfiable.

As in prior work extending Lund and Yannakakis’ methodology [3], we require the MIP to satisfy four additional properties, and we call an MIP canonical if all four additional properties hold. Specifically:

We will show how to turn any canonical MIP for into a randomized algorithm that accomplishes the following. Fix any function . Given any instance of size , outputs an Boolean matrix , where and are polynomial in the parameters . Moreover, satisfies the following two properties with high probability over the internal randomness of .

  • If , then there is a vector such that and .

  • If , then any such that satisfies , for some .

Before describing , we complete the proof of Proposition ?, assuming the existence of .

It is well-known that the standard PCP theorem can be combined with Raz’s parallel repetition theorem [17] to yield a canonical two-prover MIP for a instance of size with the following two properties, for any constant :

  • the soundness error is , and

  • are all of size .

See, e.g., [9]. Throughout the remainder of the proof, we choose to satisfy.

Hence, setting and , properties 1 and 2 above imply the following. Given an instance of of size , outputs an Boolean matrix , where and are , and satisfies the following properties.

  • If , then there is a vector such that and .

  • If , then any such that satisfies , where and and .

Suppose that there is a randomized algorithm that runs in and distinguishes the above two cases. By running on the matrix output by , we determine with high probability whether is satisfiable. Thus, if , no such algorithm can exist.

Proposition ? follows by recalling that , where is chosen such that ; thus .

We now turn to the description of the algorithm that generates the matrix . Our description borrows many ideas from Lund and Yannakakis [12]; accordingly, our presentation closely follows that of [12].

Construction of . Let and let be arbitrary. Let be a -useful set system over a set , where , . Lemma ? guarantees can be output by an efficient randomized algorithm with high probability.

The matrix will have rows, and (which is quasipolynomial in ) columns. We associate each of the rows of with a point , where is a random seed to the verifier and is a point in the set . Likewise, we associate each of the columns of with a pair , for .

For and , let be the query that the verifier asks the th prover when using the random seed . Likewise, for each and each , let denote the unique answer such that , if such an answer exists, and be undefined otherwise. By functionality of the MIP, there exists at most one such answer.

Now we define the following sets, following [12]. For each pair , define:

Similarly, for each pair , define

Let denote the indicator vector of . We define to be the matrix whose columns are the vectors over , .

Here is an equivalent way to construct . Start with the zero matrix. For each , , let the block of be the submatrix corresponding to and . Similarly, for each , , let the block of be the submatrix corresponding to and .

Then, for each , do the following.

  1. For each , replace the 0-column corresponding to in the block by .

  2. For each such that is defined, replace the 0-column corresponding to in the block by ,

Note that for each , there are only one value of , namely, , and only one value of , namely, , such that the block of can be nonzero. Note also that a prover strategy corresponds to choosing, for each and each , exactly one column from the block of columns corresponding to .

Analysis of .

Assume and consider a strategy of the provers that makes the verifier accept with probability 1. Such a strategy must exist, because the MIP has perfect completeness. For this strategy, for , let be the answer gives when given query .

Consider any . Because on query , the responses from provers and induce acceptance, we have , and therefore

Furthermore, the only nonzero blocks corresponding to are those for and . It suffices to show that

since any other columns appearing in the sum, in the rows corresponding to , are all zero. (Here, for an -vector , denotes the -vector corresponding to block .) By the definition of , . In addition, . But by , we know that , and therefore .

Thus, equation represents as a linear combination (with coefficients in ) of exactly columns of , proving property 1.

Let be any vector satisfying . We claim that if is sufficiently sparse, then we can transform into a prover strategy causing the verifier to accept with probability larger than , a contradiction.

We say that involves if . We will write to mean the same thing.

Given , and recalling that , we define a cost function via for and . Similarly, we define the cost of as . We say that is good if . Let denote the fraction of values that are good. We claim that we can transform into a prover strategy causing the verifier to accept with probability at least . To establish this, we require several lemmas.

For , let denote the matrix obtained by restricting to rows corresponding to pairs of the form for some . Similarly, let denote the subvector of whose th entry is , for each . Note that since , it also holds that

Since is good, the vector is a linear combination of at most columns of . To see this, observe that the definition of implies that the only non-zero columns of correspond to vectors of the form or for some or . Since is good,