Unbiased estimates for linear regressionvia volume sampling

Unbiased estimates for linear regression via volume sampling

Michał Dereziński
Department of Computer Science
University of California Santa Cruz
mderezin@ucsc.edu
&
Manfred K. Warmuth
Department of Computer Science
University of California Santa Cruz
manfred@ucsc.edu
Abstract

Given a full rank matrix with more columns than rows, consider the task of estimating the pseudo inverse based on the pseudo inverse of a sampled subset of columns (of size at least the number of rows). We show that this is possible if the subset of columns is chosen proportional to the squared volume spanned by the rows of the chosen submatrix (ie, volume sampling). The resulting estimator is unbiased and surprisingly the covariance of the estimator also has a closed form: It equals a specific factor times .

Pseudo inverse plays an important part in solving the linear least squares problem, where we try to predict a label for each column of . We assume labels are expensive and we are only given the labels for the small subset of columns we sample from . Using our methods we show that the weight vector of the solution for the sub problem is an unbiased estimator of the optimal solution for the whole problem based on all column labels.

We believe that these new formulas establish a fundamental connection between linear least squares and volume sampling. We use our methods to obtain an algorithm for volume sampling that is faster than state-of-the-art and for obtaining bounds for the total loss of the estimated least-squares solution on all labeled columns.

Unbiased estimates for linear regression
via volume sampling

Michał Dereziński Department of Computer Science University of California Santa Cruz mderezin@ucsc.edu Manfred K. Warmuth Department of Computer Science University of California Santa Cruz manfred@ucsc.edu

1 Introduction

Let be a wide full rank matrix with rows and columns where . Our goal is to estimate the pseudo inverse of based on the pseudo inverse of a subset of columns. More precisely, we sample a subset of column indices (where ). We let be the sub-matrix of the columns indexed by (See Figure 1). Consider a version of in which all but the columns of are zero. This matrix equals where is an -dimensional diagonal matrix with if and 0 otherwise.

We assume that the set of column indices of is selected proportional to the squared volume spanned by the rows of submatrix , i.e. proportional to and prove a number of new surprising expectation formulas for this type of volume sampling, such as

 E[(XIS)+]=X+andE[(XSX⊤S)−1(XIS)+⊤(XIS)+]=n−d+1s−d+1X+⊤X+.

Note that has the shape of where the rows indexed by contain and the remaining rows are zero. The expectation of this matrix is even though is clearly not a sub-matrix of . In addition to the expectation formulas, our new techniques lead to an efficient volume sampling procedure which beats the state-of-the-art by a factor of in time complexity.

Volume sampling is useful in numerous applications, from clustering to matrix approximation, but we focus on the task of solving linear least squares problems: For an dimensional label vector , let . Assume the entire design matrix is known to the learner but labels are expensive and you want to observe as few of them as possible. Let be the solution to the sub-problem based on labels . What is the smallest number of labels necessary, for which there is a sampling procedure on sets of size st the expected loss of is at most a constant factor larger than the loss of that uses all labels (where the constant is independent of )? More precisely, using the short hand for the loss on all labels, what is the smallest size such that . This question is a version of the “minimal coresets” open problem posed in [3].

The size has to be at least and one can show that randomization is necessary in that any deterministic algorithm for choosing a set of columns can suffer loss larger by a factor of . Also any iid sampling of (such as the commonly used leverage scores [8]) requires at least examples to achieve a finite factor. In this paper however we show that with a size volume sample, if is in general position. Note again that we have equality and not just an upper bound. Also we can show that the multiplicative factor is optimal. We further improve this factor to via repeated volume sampling. Moreover, our expectation formulas imply that when is size volume sampled, then is an unbiased estimator for , ie .

2 Related Work

Volume sampling is an extension of a determinantal point process [15], which has been given a lot of attention in the literature with many applications to machine learning, including recommendation systems [10] and clustering [13]. Many exact and approximate methods for efficiently generating samples from this distribution have been proposed [6, 14], making it a useful tool in the design of randomized algorithms. Most of those methods focus on sampling elements. In this paper, we study volume sampling sets of size , which has been proposed in [1] and motivated with applications in graph theory, linear regression, matrix approximation and more. The only known polynomial time algorithm for size volume sampling was recently proposed in [16] with time complexity . We offer a new algorithm with runtime , which is faster by a factor of at least .

The problem of selecting a subset of input vectors for solving a linear regression task has been extensively studied in statistics literature under the terms optimal design [9] and pool-based active learning [18]. Various criteria for subset selection have been proposed, like A-optimality and D-optimality. For example, A-optimality seeks to minimize , which is combinatorially hard to optimize exactly. We show that for size volume sampling (for ), which provides an approximate randomized solution for this task.

A related task has been explored in the field of computational geometry, where efficient algorithms are sought for approximately solving linear regression and matrix approximation [17, 5, 3]. Here, subsampling appears as one of the key techniques for obtaining multiplicative bounds on the loss of the approximate solution. In this line of work, volume sampling size has been used by [7, 11] for matrix approximation. Another common sampling technique is based on statistical leverage scores [8], which have been effectively used for the task of linear regression. However, this approach is based on iid sampling, and requires sampling at least elements to achieve multiplicative loss bounds. On the other hand, the input vectors obtained from volume sampling are selected jointly, which makes the chosen subset more informative, and we show that just volume sampled elements are sufficient to achieve a multiplicative bound.

3 Unbiased estimators

Let be an integer dimension. For each subset of size we are given a matrix formula . Our goal is to sample set of size using some sampling process and then develop concise expressions for . Examples of formula classes will be given below.

We represent the sampling by a directed acyclic graph (dag), with a single root node corresponding to the full set , Starting from the root, we proceed along the edges of the graph, iteratively removing elements from the set . Concretely, consider a dag with levels . Level contains nodes for sets of size . Every node at level has directed edges to the nodes at the next lower level. These edges are labeled with a conditional probability vector . The probability of a (directed) path is the product of the probabilities along its edges. The outflow of probability from each node on all but the bottom level is 1. We let the probability of node be the probability of all paths from the top node to and set the probability of the top node to 1. We associate a formula with each set node in the dag. The following key equality lets us compute expectations.

Lemma 1

If for all of size greater than we have

 \definecolorpgfstrokecolorrgb0,0,1\pgfsys@color@rgb@stroke001\pgfsys@color@rgb@fill001F(S)=∑i∈SP(S−i|S)F(S−i)\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@cmyk@stroke000.01.0\pgfsys@color@cmyk@fill000.01.0,

then for any :

Proof Suffices to show that expectations at successive layers are equal:

3.1 Volume sampling

Given a wide full-rank matrix and a sample size , volume sampling chooses subset of size with probability proportional to volume spanned by the rows of submatrix , ie proportional to . The following corollary uses the above dag setup to compute the normalization constant for this distribution. When , the corollary provides a novel minimalist proof for the Cauchy-Binet formula: .

Corollary 2

Let and of size st . Then for any , define

 P(S−i|S):=det(XS−iX⊤S−i)(s−d)det(XSX⊤S)=1−x⊤i(XSX⊤S)−1xis−d, (reverse iterative volume sampling)

where is the th column of and is the sub matrix of columns indexed by . Then is a proper probability distribution and thus for all . Furthermore

 P(S)=det(XSX⊤S)(n−ds−d)det(XX⊤). (volume sampling)

Proof For any , st , it is easy to see that forms a probability vector:

 ∑i∈SP(S−i|S)=∑i∈S1−tr((XSX⊤S)−1xix⊤i)s−d=s−tr((XSX⊤S)−1XSX⊤S)s−d=s−ds−d=1.

It remains to show the formula for the probability of all paths ending at . We first consider the top node, ie . In this case both the path definition and the formula are 1. For all but the top node, the probability equals the total inflow of probability into that node from the previous level, ie

 P(S)=∑i∉SP(S|S+i)P(S+i) =∑i∉Sdet(XSX⊤S)(s+1−d)det(XS+iX⊤S+i)det(XS+iX⊤S+i)(n−ds+1−d)det(XX⊤) =(n−s)det(XSX⊤S)(s+1−d)(n−ds+1−d)det(XX⊤)=det(XSX⊤S)(n−ds−d)det(XX⊤).

Note that all paths from to a subset (of size ) have the same probability because the ratios of determinants cancel along paths. It is easy to verify that this probability is Note that ratios are avoided because paths with such ratios always lead to sets of probability 0.

3.2 Expectation formulas for volume sampling

All expectations in the remainder of the paper are wrt volume sampling. We use the short hand for expectation with volume sampling where the size of the sampled set is fixed to . The expectation formulas for two choices of are proven in the next two theorems. By Lemma 1 it suffices to show for volume sampling.

We introduce a bit more notation first. Recall that is the sub matrix of columns indexed by (See Figure 1). Consider a version of in which all but the columns of are zero. This matrix equals where is an -dimensional diagonal matrix with if and 0 otherwise.

Theorem 3

Let be a wide full rank matrix (ie ). For , let be a size volume sampled set over . Then

 E[(XIS)+]=X+.

We believe that this fundamental formula lies at the core of why volume sampling is important in many applications. In this work, we focus on its application to linear regression. However, [1] discuss many problems where controlling the pseudo-inverse of a submatrix is essential. For those applications, it is important to establish variance bounds for the estimator offered by Theorem 3. In this case, volume sampling once again offers very concrete guarantees. We obtain them by showing the following formula, which can be viewed as a second moment for this estimator.

Theorem 4

Let be a full-rank matrix and . If size volume sampling over has full support, then

 E[(XSX⊤S)−1(XIS)+⊤(XIS)+]=n−d+1s−d+1(XX⊤)−1X+⊤X+.

If volume sampling does not have full support then the matrix equality “” is replaced by the positive-definite inequality “”.

The condition that size volume sampling over has full support is equivalent to for all of size . Note that if size volume sampling has full support, then size also has full support. So full support for the smallest size (often phrased as being in general position) implies that volume sampling wrt any size has full support.

Surprisingly by combining theorems 3 and 4, we can obtain a “covariance type formula” for the pseudo-inverse matrix estimator:

 E[((XIS)+−E[(XIS)+])⊤((XIS)+−E[(XIS)+])] =E[(XIS)+⊤(XIS)+]−E[(XIS)+]⊤E[(XIS)+] =n−d+1s−d+1X+⊤X+−X+⊤X+=n−ss−d+1X+⊤X+. (1)

Theorem 4 can also be used to obtain an expectation formula for the Frobenius norm of the estimator:

 E∥(XIS)+∥2F =E[tr((XIS)+⊤(XIS)+)]=n−d+1s−d+1∥X+∥2F. (2)

This norm formula has been shown in [1], with numerous applications. Theorem 4 can be viewed as a much stronger pre trace version of the norm formula. Also our proof techniques are quite different and much simpler. Note that if size volume sampling for does not have full support then (1) becomes a semi-definite inequality between matrices and (2) an inequality between numbers.

Proof of Theorem 3 We apply Lemma 1 with . It suffices to show for , ie:

 (XIS)+=∑i∈S1−x⊤i(XSX⊤S)−1xis−d(XIS−i)+(XIS−i)⊤(XS−iX⊤S−i)−1.

Proven by applying Sherman Morrison to on the rhs:

 ∑i1−x⊤i(XSX⊤S)−1xin−d((XIS)⊤−eix⊤i)((XSX⊤S)−1+(XSX⊤S)−1xix⊤i(XSX⊤S)−11−x⊤i(XSX⊤S)−1xi).

We now expand the last two factors into 4 terms. The expectation of the first is (which is the lhs) and the expectations of the remaining three terms times sum to 0:

 −∑i∈S(1−x⊤i(XSX⊤S)−1xi)eix⊤i(XSX⊤S)−1+(XIS)⊤(XSX⊤S)−1∑i∈Sxix⊤i(XSX⊤S)−1

Proof of Theorem 4 Choose . By Lemma 1 it suffices to show for volume sampling:

 s−d+1n−d+1(XSX⊤S)−1=∑i∈S1−x⊤i(XSX⊤S)−1xis−ds−dn−d+1(XS−iX⊤S−i)−1

To show this we apply Sherman Morrison to on the rhs:

 ∑i∈S(1−x⊤i(XSX⊤S)−1xi)((XSX⊤S)−1+(XSX⊤S)−1xix⊤i(XSX⊤S)−11−x⊤i(XSX⊤S)−1xi)

If some denominators are zero, then only sum over for which the denominators are positive. In this case the above matrix equality becomes a positive-definite inequality .

4 Linear regression with few labels

Our main motivation for studying volume sampling came from asking the following simple question. Suppose we want to solve a -dimensional linear regression problem with a matrix of input column vectors and a label vector , ie find that minimizes the least squares loss :

 w∗\lx@stackrel{def}=argminw∈RdL(w)=X+⊤y,

but the access to label vector is restricted. We are allowed to pick a subset for which the labels (where ) are revealed to us, and then solve the subproblem , obtaining . What is the smallest number of labels such that for any , we can find for which is only a multiplicative factor away from (independent of the number of input vectors )? This question was posed as an open problem by [3]. It is easy to show that we need at least labels (when is full-rank), so as to guarantee the uniqueness of solution . We use volume sampling to show that labels are in fact sufficient (proof in Section 4.1).

Theorem 5

If the input matrix is in general position, then for any label vector , the expected square loss (on all labeled vectors) of the optimal solution for the subproblem , with the -element set obtained from volume sampling, is given by

 E[L(w∗\raisebox{0.0pt}{(S)})]=(d+1)L(w∗).

If is not in general position, then the expected loss is upper-bounded by .

The factor cannot be improved when selecting only labels (we omit the proof):

Proposition 6

For any , there exists a least squares problem with vectors in such that for every -element index set , we have

 L(w∗\raisebox{0.0pt}{(S)})=(d+1)L(w∗).

Note that the multiplicative factor in Theorem 5 does not depend on . It is easy to see that this cannot be achieved by any deterministic algorithm (without the access to labels). Namely, suppose that and is a vector of all ones, whereas the label vector is a vector of all ones except for a single zero. No matter which column index we choose deterministically, if that index corresponds to the label , the solution to the subproblem will incur loss . The fact that volume sampling is a joint distribution also plays an essential role in proving Theorem 5. Consider a matrix with exactly unique linearly independent columns (and an arbitrary number of duplicates). Any iid column sampling distribution (like for example leverage score sampling) will require samples to retrieve all unique columns (ie coupon collector problem), which is necessary to get any multiplicative loss bound.

The exact expectation formula for the least squares loss under volume sampling suggests a deep connection between linear regression and this distribution. We can use Theorem 3 to further strengthen that connection. Note, that the least squares estimator obtained through volume sampling can be written as . Applying formula for the expectation of pseudo-inverse, we conclude that is an unbiased estimator of .

Proposition 7

Let be a full-rank matrix and . Let be a size volume sampled set over . Then, for arbitrary label vector , we have

 E[w∗\raisebox{0.0pt}{(S)}]=E[(XIS)+⊤y]=X+⊤y=w∗.

For size volume sampling, the fact that equals can be found in an old paper [2]. They give a direct proof based on Cramer’s rule. For us the above proposition is a direct consequence of the matrix expectation formula given in Theorem 3 that holds for volume sampling of any size . In contrast, the loss expectation formula of Theorem 5 is limited to sampling of size . Bounding the loss expectation for remains an open problem. However, we consider a different strategy for extending volume sampling in linear regression. Combining Proposition 7 with Theorem 5 we can compute the variance of predictions generated by volume sampling, and obtain tighter multiplicative loss bounds by sampling multiple -element subsets independently.

Theorem 8

Let be as in Theorem 5. For independent size volume samples ,

 E[L(1kk∑j=1w∗% \raisebox{0.0pt}{(Sj)})]=(1+dk)L(w∗).

Proof Denote and as the predictions generated by and respectively. We perform bias-variance decomposition of the loss of (for size volume sampling):

 E[L(w∗\raisebox{0.0pt}{(S)})] =E[∥ˆy\raisebox{0.0pt}{(S)}% −y∥2]=E[∥ˆy\raisebox{0.0pt% }{(S)}−ˆy+ˆy−y∥2] =E[∥ˆy\raisebox{0.0pt}{(S)}% −ˆy∥2]+E[2(ˆy% \raisebox{0.0pt}{(S)}−ˆy)⊤(ˆy−y)]+∥ˆy−y∥2 (∗)=n∑i=1E[(ˆy% \raisebox{0.0pt}{(S)}i−E[ˆy\raisebox{0.0pt}{(S)}i])2]+L(w∗)=n∑i=1Var[ˆy\raisebox{0.0pt}{(S)}i]+L(w∗),

where follows from Theorem 3. Now, we use Theorem 5 to obtain the total variance of predictions:

 n∑i=1Var[ˆy\raisebox{0.0pt}{(S)}i]=E[L(w∗\raisebox{0.0pt}{(S)})]−L(w∗)=dL(w∗).

Now the expected loss of the average weight vector wrt sampling independent sets is:

 E[L(1kk∑j=1w∗\raisebox{0.0pt}{(Sj)})] =n∑i=1Var[1kk∑j=1ˆy\raisebox{0.0pt}{(Sj)}i]+L(w∗)

It is worth noting that the average weight vector used in Theorem 8 is not expected to perform better than taking the solution to the joint subproblem, , where . However, theoretical guarantees for that case are not yet available.

4.1 Proof of Theorem 5

We use the following lemma regarding the leave-one-out loss for linear regression [4]:

Lemma 9

Let denote the least squares solution for problem . Then, we have

 L(w∗)=L(w∗\raisebox{0.0pt}{(−i)})−x⊤i(XX⊤)−1xiℓi(w∗\raisebox{0.0pt}{(−i)}), % where ℓi(w)\lx@stackrel{def}=(x⊤iw−yi)2.

When has columns and is a full-rank matrix, then and Lemma 9 leads to the following:

 det(˜X˜X⊤) (1)=det(XX⊤)L(w∗)∥ˆy−y∥2 where ˜X=(Xy⊤) (2)=det(XX⊤)(1−x⊤i(XX⊤)−1xi)ℓi(w∗\raisebox{0.0pt}{(−i)}) (3)=det(X−iX⊤−i)ℓi(w∗\raisebox{0.0pt}{(−i)}), (3)

where (1) is the “base height” formula for volume, (2) follows from Lemma 9 and (3) follows from a standard determinant formula. Returning to the proof, our goal is to find the expected loss , where is a size volume sampled set. First, we rewrite the expectation as follows:

 E[L(w∗\raisebox{0.0pt}{(S)})] =∑S,|S|=dP(S)L(w∗\raisebox{0.0pt}{(S)})=∑S,|S|=dP(S)n∑j=1ℓj(w∗% \raisebox{0.0pt}{(S)}) =∑S,|S|=d∑j∉SP(S)ℓj(w∗\raisebox{0.0pt}{(S)})=∑T,|T|=d+1∑j∈TP(T−j)ℓj(w∗\raisebox{0.0pt}{(T−j)}). (4)

We now use (3) on the matrix and test instance (assuming ):

 P(T−j)ℓj(w∗\raisebox{0.0pt}{(T−j)})=det(XT−jX⊤T−j)det(XX⊤)ℓj(w∗\raisebox{0.0% pt}{(T−j)})=det(˜XT˜X⊤T)det(XX⊤). (5)

Since the summand does not depend on the index , the inner summation in (4) becomes a multiplication by . This lets us write the expected loss as:

 E[L(w∗\raisebox{0.0pt}{(S)})]=d+1det(XX⊤)∑T,|T|=d+1det(˜XT˜X⊤T)(1)=(d+1)det(˜X˜X⊤)det(XX⊤)(2)=(d+1)L(w∗), (6)

where (1) follows from the Cauchy-Binet formula and (2) is an application of the “base height” formula. If is not in general position, then for some summands in (5), and . Thus the left-hand side of (5) is , while the right-hand side is non-negative, so (6) becomes an inequality, completing the proof of Theorem 5.

5 Efficient algorithm for volume sampling

In this section we propose an algorithm for efficiently performing exact volume sampling for any . This addresses the question posed by [1], asking for a polynomial-time algorithm for the case when . [6, 11] gave an algorithm for the case when , which runs in time . Recently, [16] offered an algorithm for arbitrary , which has complexity . We propose a new method, which uses our techniques to achieve the time complexity , a direct improvement over [16] by a factor of at least . Our algorithm also offers an improvement for in certain regimes. Namely, when , then our algorithm runs in time , faster than the method proposed by [6].

Our algorithm implements reverse iterative sampling from Corollary 2. After removing columns, we are left with an index set of size that is distributed according to volume sampling for column set size .

Theorem 10

The sampling algorithm runs in time