Reverse iterative volume sampling for linear regression1footnote 11footnote 1This paper is an expanded version of two conference papers (Dereziński and Warmuth, 2017, 2018).

# Reverse iterative volume sampling for linear regression111This paper is an expanded version of two conference papers (Dereziński and Warmuth, 2017, 2018).

\nameMichał Dereziński \emailmderezin@ucsc.edu
\nameManfred K. Warmuth \emailmanfred@ucsc.edu
University of California Santa Cruz
###### Abstract

We study the following basic machine learning task: Given a fixed set of input points in for a linear regression problem, we wish to predict a hidden response value for each of the points. We can only afford to attain the responses for a small subset of the points that are then used to construct linear predictions for all points in the dataset. The performance of the predictions is evaluated by the total square loss on all responses (the attained as well as the hidden ones). We show that a good approximate solution to this least squares problem can be obtained from just dimension many responses by using a joint sampling technique called volume sampling. Moreover, the least squares solution obtained for the volume sampled subproblem is an unbiased estimator of optimal solution based on all responses. This unbiasedness is a desirable property that is not shared by other common subset selection techniques.

Motivated by these basic properties, we develop a theoretical framework for studying volume sampling, resulting in a number of new matrix expectation equalities and statistical guarantees which are of importance not only to least squares regression but also to numerical linear algebra in general. Our methods also lead to a regularized variant of volume sampling, and we propose the first efficient algorithms for volume sampling which make this technique a practical tool in the machine learning toolbox. Finally, we provide experimental evidence which confirms our theoretical findings.

Reverse iterative volume sampling for linear regressionfootnotemark: Michał Dereziński mderezin@ucsc.edu
Manfred K. Warmuth manfred@ucsc.edu
Department of Computer Science
University of California Santa Cruz

Keywords: Volume sampling, linear regression, row sampling, active learning, optimal design.

## 1 Introduction

As an introductory case, consider linear regression in one dimension. We are given points . Each point has a hidden real response (or target value) . Assume that obtaining the responses is expensive and the learner can afford to request the responses for only a small number of indices . After receiving the requested responses, the learner determines an approximate linear least squares solution. In the one dimensional case, this is just a single weight. How many response values does the learner need to request so that the total square loss of its approximate solution on all points is “close” to the total loss of the optimal linear least squares solution found with the knowledge of all responses? We will show here that just one response suffices if the index is chosen proportional to . When the learner uses the approximate solution , then its expected loss equals 2 times the loss of the optimum that is computed based on all responses (See Figure 1.1). Moreover, the approximate solution is an unbiased estimator for the optimum :

We will extend these formulas to higher dimensions and to sampling more responses by making use of a joint sampling distribution called volume sampling. We summarize our contributions in the next four subsections.

#### Least squares with dimension many responses

Consider the case when the points lie in . Let denote the matrix that has the transposed points as rows, and let be the vector of responses. Now the goal is to minimize the (total) square loss

 L(w)=∑ni=1(x⊤iw−yi)2=∥Xw−y∥2,

over all linear weight vectors . Let denote the optimal such weight vector. We want to minimize the square loss based on a small number of responses we attained for a subset of rows. Again, the learner is initially given the fixed set of rows (i.e. fixed design), but none of the responses. It is then allowed to choose a random subset of indices, , and obtains the responses for the corresponding rows. The learner proceeds to find the optimal linear least squares solution for the subproblem . where is the subset of rows of indexed by and the corresponding responses from the response vector . As a generalization of the one-dimensional distribution that chooses an index based on the squared length, set of size is chosen proportional to the squared volume of the parallelepiped spanned by the rows of . This squared volume equals . Using elementary linear algebra, we will show that volume sampling the set assures that is a good approximation to in the following sense: In expectation, the square loss (on all row response pairs) of is equal times the square loss of :

 E[L(w∗\small\raisebox{0.0pt}{(S)% })]=\definecolorpgfstrokecolorrgb1,0,0\pgfsys@color@rgb@stroke100\pgfsys@color@rgb@fill100(d+1)L(w∗),when P(S)∼det(X⊤SXS).

Furthermore, for any sampling procedure that attains less than responses, the ratio between the expected loss and the loss of the optimum cannot be bounded by a constant.

#### Unbiased pseudoinverse estimator

There is a direct connection between solving linear least squares problems and the pseudoinverse of matrix : For an dimensional response vector , the optimal solution is . Similarly is the solution for the subproblem . We propose a new implementation of volume sampling called reverse iterative sampling which enables a novel proof technique for obtaining elementary expectation formulas for pseudoinverses based on volume sampling.

Suppose that our goal is to estimate the pseudoinverse based on the pseudoinverse of a subset of rows. Recall that for a subset of row indices (where the size is fixed and ), we let be the submatrix of the rows indexed by (see Figure 1.2). Consider a version of in which all but the rows of are zero. This matrix equals , where the selection matrix is an -dimensional diagonal matrix with if and 0 otherwise.

For the set of fixed size row indices chosen proportional to , we can prove the following expectation formulas:

 E[(ISX)+]=X+andE[(X⊤SXS)−1(ISX)+(ISX)+⊤]=n−d+1s−d+1(X⊤X)−1X+X+⊤.

Note that has the shape of where the columns indexed by contain and the remaining columns are zero. The expectation of this matrix is even though is clearly not a submatrix of . This expectation formula now implies that for any size , if of size is drawn by volume sampling, then is an unbiased estimator111For size volume sampling, the fact that can be found in an early paper (Ben-Tal and Teboulle, 1990). They give a direct proof based on Cramer’s rule. for , i.e.

 E[w∗\small\raisebox{0.0pt}{(S)}]=E[(XS)+yS]=E[(ISX)+y]=E[(ISX)+]y=X+y=w∗.

The second expectation formula can be viewed as a second moment of the pseudoinverse estimator , and it can be used to compute a useful notion of matrix variance with applications in random matrix theory:

 E[(ISX)+(ISX)+⊤]−E[(ISX)+]E[(ISX)+]⊤=n−ss−d+1X+X+⊤.

#### Regularized volume sampling

We also develop a new regularized variant of volume sampling, which extends reverse iterative sampling to selecting subsets of size smaller than , and leads to a useful extension of the above matrix variance formula. Namely, for any , our -regularized procedure for sampling subsets of size satisfies

 E[(X⊤SXS+λI)−1]⪯n−dλ+1s−dλ+1(X⊤X+λI)−1,

where is a standard notion of statistical dimension. Crucially, the above bound holds for subset sizes , which can be much smaller than the dimension .

Under the additional assumption that response vector is generated by a linear transformation distorted with bounded white noise, the expected bound on leads to strong variance bounds for ridge regression estimators. Specifically, we prove that when , with having mean zero and bounded variance , then if is sampled according to -regularized volume sampling with , we can obtain the following mean squared prediction error (MSPE) bound:

 ESEξ[1n∥X(w∗λ\small\raisebox{0.0pt}{(S)}−˜w)∥2] ≤σ2dλs−dλ+1,

where is the ridge regression estimator for the subproblem . Our new lower bounds show that the above upper bound for regularized volume sampling is essentially optimal with respect to the choice of a subsampling procedure.

#### Algorithms and experiments

The only known polynomial time algorithm for size volume sampling was recently proposed by Li et al. (2017) with time complexity . In this paper we give two new algorithms using our general framework of reverse iterative sampling: one with deterministic runtime of , and a second one that with high probability finishes in time . Thus both algorithms improve on the state-of-the-art by a factor of at least and make volume sampling nearly as efficient as the comparable i.i.d. sampling technique called leverage score sampling. Our experiments on real datasets confirm the efficiency of our algorithms, and show that for small sample sizes volume sampling is more effective than leverage score sampling for the task of subset selection for linear regression.

#### Related work

Volume sampling is a type of determinantal point process (DPP) (Kulesza and Taskar, 2012). DPP’s have been given a lot of attention in the literature with many applications to machine learning, including recommendation systems (Gartrell et al., 2016) and clustering (Kang, 2013). Many exact and approximate methods for efficiently generating samples from this distribution have been proposed (Deshpande and Rademacher, 2010; Kulesza and Taskar, 2011), making it a useful tool in the design of randomized algorithms. Most of those methods focus on sampling elements. In this paper, we study volume sampling sets of size , which was proposed by Avron and Boutsidis (2013) and motivated with applications in graph theory, linear regression, matrix approximation and more.

The problem of selecting a subset of the rows of the input matrix for solving a linear regression task has been extensively studied in statistics literature under the terms optimal design (Fedorov, 1972) and pool-based active learning (Sugiyama and Nakajima, 2009). Various criteria for subset selection have been proposed, like A-optimality and D-optimality. For example, A-optimality seeks to minimize , which is combinatorially hard to optimize exactly. We show that for size volume sampling, , which provides an approximate randomized solution of the sampled inverse covariance matrix rather than just its trace.

In the field of computational geometry a variant of volume sampling was used to obtain optimal bounds for low-rank matrix approximation. In this task, the goal is to select a small subset of rows of a matrix (much fewer than the rank of , which is bounded by ), so that a good low-rank approximation of can be constructed from those rows. Deshpande et al. (2006) showed that volume sampling of size index sets obtains optimal multiplicative bounds for this task and polynomial time algorithms for size volume sampling were given in Deshpande and Rademacher (2010) and Guruswami and Sinop (2012). We show in this paper that for linear regression, fewer than rank many rows do not suffice to obtain multiplicative bounds. This is why we focus on volume sampling sets of size (recall that, for simplicity, we assume that is full rank).

Computing approximate solutions to linear regression has been explored in the domain of numerical linear algebra (see Mahoney (2011) for an overview). Here, multiplicative bounds on the loss of the approximate solution can be achieved via two approaches. The first approach relies on sketching the input matrix and the response vector by multiplying both by the same suitably chosen random matrix. Algorithms which use sketching to generate a smaller input matrix for a given linear regression problem are computationally efficient (Sarlos, 2006; Clarkson and Woodruff, 2013), but they require all of the responses from the original problem to generate the sketch and are thus not suitable for the goal of using as few response values as possible. The second approach is based on subsampling the rows of the input matrix and only asking for the responses of the sampled rows. The learner optimally solves the sampled subproblem222Note that those methods typically require additional rescaling of the subproblem, whereas the techniques proposed in this paper do not require any rescaling. and then uses the obtained weight vector for its prediction on all rows. The selected subproblem is known under the term “-agnostic minimal coreset” in (Boutsidis et al., 2013; Drineas et al., 2008) since it is selected without knowing the response vector (denoted as the vector ). The second approach coincides with the goals of this paper but the focus here is different in a number of ways. First, we focus on the smallest sample size for which a multiplicative loss bound is possible: Just volume sampled rows are sufficient to achieve a multiplicative bound with a fixed factor, while are not sufficient. A second focus here is the efficiency and the combinatorics of volume sampling. The previous work is mostly based on i.i.d. sampling using the statistical leverage scores (Drineas et al., 2012). As we show in this paper, leverage scores are the marginals of volume sampling and any i.i.d. sampling method requires sample size to achieve multiplicative loss bounds for linear regression. On the other hand, the rows obtained from volume sampling are selected jointly and this makes the chosen subset more informative and brings the required sample size down to . Third, we focus on the fact that the estimators produced from volume sampling are unbiased and therefore can be averaged to get more accurate estimators. Using our methods, averaging immediately leads to an unbiased estimator with expected loss times the optimum based on sampling responses in total. We leave it as an open problem to construct a factor unbiased estimator from sampling only responses. If unbiasedness is not a concern, then such an estimator has recently been found (Chen and Price, 2017).

#### Outline of the paper

In the next section, we define volume sampling as an instance of a more general procedure we call reverse iterative sampling, and we use this methodology to prove closed form matrix expressions for the expectation of the pseudoinverse estimator and its square , when is sampled by volume sampling. Central to volume sampling is the Cauchy-Binet formula for determinants. As a side, we produce a number of short self-contained proofs for this formula and show that leverage scores are the marginals of volume sampling. Then in Section 3 we formulate the problem of solving linear regression from a small number of responses, and state the upper bound for the expected square loss of the volume sampled least squares estimator (Theorem 3.1), followed by a discussion and related lower-bounds. In Section 3.2, we prove Theorem 3.1 and an additional related matrix expectation formula. We next discuss in Section 3.3 how unbiased estimators can easily be averaged for improving the expected loss and discuss open problems for constructing unbiased estimators. A new regularized variant of volume sampling is proposed in Section 4, along with the statistical guarantees it offers for computing subsampled ridge regression estimators. Next, we present efficient volume sampling algorithms in Section 5, based on the reverse iterative sampling paradigm, which are then experimentally evaluated in Section 6. Finally, Section 7 concludes the paper by suggesting a future research direction.

## 2 Reverse iterative sampling

Let be an integer dimension. For each subset of size we are given a matrix formula . Our goal is to sample set of size using some sampling process and then develop concise expressions for . Examples of formula classes will be given below.

We represent the sampling by a directed acyclic graph (DAG), with a single root node corresponding to the full set . Starting from the root, we proceed along the edges of the graph, iteratively removing elements from the set (see Figure 2.1). Concretely, consider a DAG with levels . Level contains nodes for sets of size . Every node at level has directed edges to the nodes (also denoted ) at the next lower level. These edges are labeled with a conditional probability vector . The probability of a (directed) path is the product of the probabilities along its edges. The outflow of probability from each node on all but the lowest level is 1. We let the probability of node be the probability of all paths from the top node to and set the probability of the top node to 1. We associate a formula with each set node in the DAG. The following key equality lets us compute expectations.

###### Lemma 2.1

If for all of size greater than we have

 \definecolorpgfstrokecolorrgb0,0,1\pgfsys@color@rgb@stroke001\pgfsys@color@rgb@fill001F(S)=∑i∈SP(S−i|S)F(S−i)\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@cmyk@stroke000.01.0\pgfsys@color@cmyk@fill000.01.0,

then for any :

Proof  Suffices to show that expectations at successive layers and are equal for :

 ∑S:|S|=sP(S)\definecolorpgfstrokecolorrgb0,0,1\pgfsys@color@rgb@stroke001\pgfsys@color@rgb@fill001F(S)\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@cmyk@stroke000.01.0\pgfsys@color@cmyk@fill000.01.0=∑S:|S|=sP(S)\definecolorpgfstrokecolorrgb0,0,1\pgfsys@color@rgb@stroke001\pgfsys@color@rgb@fill001∑i∈SP(S−i|S)F(S−i)\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@cmyk@stroke000.01.0\pgfsys@color@cmyk@fill000.01.0 =∑S:|S|=s∑i∈SP(S)P(S−i|S)F(S−i) =∑T:|T|=s−1∑j∉TP(T+j)P(T|T+j)P(T)F(T).

Note that the r.h.s. of the first line has one summand per edge leaving level , and the r.h.s. of the second line has one summand per edge arriving at level . Now the last equality holds because the edges leaving level are exactly those arriving at level , and the summand for each edge in both expressions is equivalent.

### 2.1 Volume sampling

Given a tall full rank matrix and a sample size , volume sampling chooses subset of size with probability proportional to squared volume spanned by the columns of submatrix333For sample size , the rows and columns of have the same length and is also the squared volume spanned by the rows . and this squared volume equals . The following theorem uses the above DAG setup to compute the normalization constant for this distribution. Note that all subsets of volume 0 will be ignored, since they are unreachable in the proposed sampling procedure.

###### Theorem 2.2

Let , where and . For any set of size for which , define the probability of the edge from to for as:

 (reverse iterative volume sampling)

where is the th row of . In this case is a proper probability distribution. If , then simply set to . With these definitions, for all and the probability of all paths from the root to any subset of size at least is

 P(S)=det(X⊤SXS)(n−ds−d)det(X⊤X). (volume sampling)

The rewrite of the ratio as is Sylvester’s Theorem for determinants. Incidentally, this is the only property of determinants used in this section.

The theorem also implies a generalization of the Cauchy-Binet formula to size sets:

 ∑S:|S|=sdet(X⊤SXS)=(n−ds−d)det(X⊤X). (2.1)

When , then the binomial coefficient is 1 and the above becomes the vanilla Cauchy-Binet formula. The below proof of the theorem thus results in a minimalist proof of this classical formula as well. The proof uses the reverse iterative sampling (Figure 2.1) and the fact that all paths from the root to node have the same probability. For the sake of completeness we also give a more direct inductive proof of the above generalized Cauchy-Binet formula in Appendix A.

Proof  First, for any node s.t. and , the probabilities out of sum to 1:

 ∑i∈SP(S−i|S)=∑i∈S1−tr((X⊤SXS)−1xix⊤i)s−d=s−tr((X⊤SXS)−1X⊤SXS)s−d=s−ds−d=1.

It remains to show the formula for the probability of all paths ending at node . If , then one edge on any path from the root to has probability 0. This edge goes from a superset of with positive volume to a superset of that has volume 0. Since all paths have probability 0, in this case.

Now assume and consider any path from the root to . There are such paths all going through sets with positive volume. The fractions of determinants in the probabilities along each path telescope and the additional factors accumulate to the same product. So the probability of all paths from the root to is the same and the total probability into is

 (n−s)!\definecolorpgfstrokecolorrgb1,0,0\pgfsys@color@rgb@stroke100\pgfsys@color@rgb@fill100(n−d)…(s−d+1)det(X⊤SXS)det(X⊤X)= 1\definecolorpgfstrokecolorrgb1,0,0\pgfsys@color@rgb@stroke100\pgfsys@color@rgb@fill100(n−ds−d)det(X⊤SXS)det(X⊤X).

An immediate consequence of the above sampling procedure is the following composition property of volume sampling, which states that this distribution is closed under subsampling. We also give a direct proof to highlight the combinatorics of volume sampling.

###### Corollary 2.3

For any and , the following hierarchical sampling procedure:

 T t∼X  (size t volume % sampling from X), S s∼XT(size s volume % sampling from XT)

returns a set which is distributed according to size volume sampling from .

Proof  We start with the Law of Total Probability and then use the probability formula for volume sampling from the above theorem. Here means the probability of all paths going through node at level and ending up at the final node at level . If , then .

 P(S) =∑T:S⊆TP(T∩S)P(S|T)P(T) =\definecolorpgfstrokecolorrgb0,0,1\pgfsys@color@rgb@stroke001\pgfsys@color@rgb@fill001(n−st−s)\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@cmyk@stroke000.01.0\pgfsys@color@cmyk@fill000.01.0det(X⊤SXS)\definecolorpgfstrokecolorrgb1,0,0\pgfsys@color@rgb@stroke100\pgfsys@color@rgb@fill100(t−ds−d)\definecolorpgfstrokecolorrgb1,0,0\pgfsys@color@rgb@stroke100\pgfsys@color@rgb@fill100(n−dt−d)det(X⊤X) = det(X⊤SXS)\definecolorpgfstrokecolorrgb1,0,0\pgfsys@color@rgb@stroke100\pgfsys@color@rgb@fill100(n−ds−d)det(X⊤X).

Note that for all sets containing , the probability is the same, and there are such sets.

The main competitor of volume sampling is i.i.d. sampling of the rows of w.r.t. the statistical leverage scores. For an input matrix , the leverage score of the -th row of is defined as

Recall that this quantity appeared in the definition of conditional probability in Theorem 2.2, where the leverage score was computed w.r.t. the submatrix . In fact, there is a more basic relationship between leverage scores and volume sampling: If set is sampled according to size volume sampling, then the leverage score of row is the marginal probability of selecting -th row into . A general formula for the marginals of size volume sampling is given in the following proposition:

###### Proposition 2.4

Let be a full rank matrix and . If is sampled according to size volume sampling, then for any ,

 P(i∈S)=s−dn−d+n−sn−dlix⊤i(X⊤X)−1xi.

Proof  Instead of we will first compute :

 P(i∉S) =∑S:|S|=s,i∉Sdet(X⊤SXS)(n−ds−d)det(X⊤X) =∑S:|S|=s,i∉S∑T⊆S:|T|=ddet(X⊤TXT)(n−ds−d)det(X⊤X) =n−sn−d(1−x⊤i(X⊤X)−1xi),

where we used Cauchy-Binet twice and the fact that every set appears in sets . Now, the marginal probability follows from the fact that .

### 2.2 Expectation formulas for volume sampling

All expectations in the remainder of the paper are w.r.t. volume sampling. We use the short-hand for expectation with volume sampling where the size of the sampled set is fixed to . The expectation formulas for two choices of are proven in Theorems 2.5 and 2.6. By Lemma 2.1 it suffices to show for volume sampling. We also present a related expectation formula (Theorem 2.7), which is proven later using different techniques.

Recall that is the submatrix of rows indexed by . We also use a version of in which all but the rows of are zeroed out. This matrix equals where is an -dimensional diagonal matrix with if and 0 otherwise (see Figure 1.2).

###### Theorem 2.5

Let be a tall full rank matrix (i.e. ). For , let be a size volume sampled set over . Then

 E[(ISX)+]=X+.

For the special case of , this fact was known in the linear algebra literature (Ben-Tal and Teboulle, 1990; Ben-Israel, 1992). It was shown there using elementary properties of the determinant such as Cramer’s rule.444Using the composition property of volume sampling (Corollary 2.3), the case of the theorem can be reduced to the case. However, we give a different self-contained proof. The proof methodology developed here based on reverse iterative volume sampling is very different. We believe that this fundamental formula lies at the core of why volume sampling is important in many applications. In this work, we focus on its application to linear regression. However, Avron and Boutsidis (2013) discuss many problems where controlling the pseudoinverse of a submatrix is essential. For those applications, it is important to establish variance bounds for the above expectation and volume sampling once again offers very concrete guarantees. We obtain them by showing the following formula, which can be viewed as a second moment for this estimator.

###### Theorem 2.6

Let be a full rank matrix and . If size volume sampling over has full support, then

 E[(X⊤SXS)−1(ISX)+(ISX)+⊤]=n−d+1s−d+1(X⊤X)−1X+X+⊤.

In the case when volume sampling does not have full support, then the matrix equality “” above is replaced by the positive-definite inequality “”.

The condition that size volume sampling over has full support is equivalent to for all of size . Note that if size volume sampling has full support, then size also has full support. So full support for the smallest size (often phrased as being in general position) implies that volume sampling w.r.t. any size has full support.

The above theorem immediately gives an expectation formula for the Frobenius norm of the estimator:

 E[∥(ISX)+∥2F] =E[tr((ISX)+(ISX)+⊤)]=n−d+1s−d+1∥X+∥2F. (2.2)

This norm formula was shown by Avron and Boutsidis (2013), with numerous applications. Theorem 2.6 can be viewed as a much stronger pre-trace version of the known norm formula. Also our proof techniques are quite different and much simpler. Note that if size volume sampling for does not have full support, then (2.2) becomes an inequality.

We now mention a second application of the above theorem in the context of linear regression for the case when the response vector is modeled as a noisy linear transformation (i.e., for some and a random noise vector (detailed discussion in Section 4). In this case the matrix can be interpreted as the covariance matrix of least-squares estimator (for a fixed set ) and Theorem 2.6 gives an exact formula for the covariance matrix of under volume sampling. In Section 4, we give an extended version of this result which provides even stronger guarantees for regularized least-squares estimators under this model (Theorem 4.1).

Note that except for the above application, all results in this paper hold for arbitrary response vectors . By combining Theorems 2.5 and 2.6, we can also obtain a covariance-type formula555This notion of “covariance” is used in random matrix theory, i.e. for a random matrix we analyze . See for example Tropp (2012). for the pseudoinverse matrix estimator:

 E[((ISX)+−E[(ISX)+])((