# Reverse iterative volume sampling for linear regression^{1}^{1}1This paper is an expanded version of two conference papers
(Dereziński and Warmuth, 2017, 2018).

###### Abstract

We study the following basic machine learning task: Given a fixed set of input points in for a linear regression problem, we wish to predict a hidden response value for each of the points. We can only afford to attain the responses for a small subset of the points that are then used to construct linear predictions for all points in the dataset. The performance of the predictions is evaluated by the total square loss on all responses (the attained as well as the hidden ones). We show that a good approximate solution to this least squares problem can be obtained from just dimension many responses by using a joint sampling technique called volume sampling. Moreover, the least squares solution obtained for the volume sampled subproblem is an unbiased estimator of optimal solution based on all responses. This unbiasedness is a desirable property that is not shared by other common subset selection techniques.

Motivated by these basic properties, we develop a theoretical framework for studying volume sampling, resulting in a number of new matrix expectation equalities and statistical guarantees which are of importance not only to least squares regression but also to numerical linear algebra in general. Our methods also lead to a regularized variant of volume sampling, and we propose the first efficient algorithms for volume sampling which make this technique a practical tool in the machine learning toolbox. Finally, we provide experimental evidence which confirms our theoretical findings.

^{†}

^{†}footnotemark: † Michał Dereziński mderezin@ucsc.edu

Manfred K. Warmuth manfred@ucsc.edu

Department of Computer Science

University of California Santa Cruz

Keywords: Volume sampling, linear regression, row sampling, active learning, optimal design.

## 1 Introduction

As an introductory case, consider linear regression in one dimension. We are given points . Each point has a hidden real response (or target value) . Assume that obtaining the responses is expensive and the learner can afford to request the responses for only a small number of indices . After receiving the requested responses, the learner determines an approximate linear least squares solution. In the one dimensional case, this is just a single weight. How many response values does the learner need to request so that the total square loss of its approximate solution on all points is “close” to the total loss of the optimal linear least squares solution found with the knowledge of all responses? We will show here that just one response suffices if the index is chosen proportional to . When the learner uses the approximate solution , then its expected loss equals 2 times the loss of the optimum that is computed based on all responses (See Figure 1.1). Moreover, the approximate solution is an unbiased estimator for the optimum :

We will extend these formulas to higher dimensions and to sampling more responses by making use of a joint sampling distribution called volume sampling. We summarize our contributions in the next four subsections.

#### Least squares with dimension many responses

Consider the case when the points lie in . Let denote the matrix that has the transposed points as rows, and let be the vector of responses. Now the goal is to minimize the (total) square loss

over all linear weight vectors . Let denote the optimal such weight vector. We want to minimize the square loss based on a small number of responses we attained for a subset of rows. Again, the learner is initially given the fixed set of rows (i.e. fixed design), but none of the responses. It is then allowed to choose a random subset of indices, , and obtains the responses for the corresponding rows. The learner proceeds to find the optimal linear least squares solution for the subproblem . where is the subset of rows of indexed by and the corresponding responses from the response vector . As a generalization of the one-dimensional distribution that chooses an index based on the squared length, set of size is chosen proportional to the squared volume of the parallelepiped spanned by the rows of . This squared volume equals . Using elementary linear algebra, we will show that volume sampling the set assures that is a good approximation to in the following sense: In expectation, the square loss (on all row response pairs) of is equal times the square loss of :

Furthermore, for any sampling procedure that attains less than responses, the ratio between the expected loss and the loss of the optimum cannot be bounded by a constant.

#### Unbiased pseudoinverse estimator

There is a direct connection between solving linear least squares problems and the pseudoinverse of matrix : For an dimensional response vector , the optimal solution is . Similarly is the solution for the subproblem . We propose a new implementation of volume sampling called reverse iterative sampling which enables a novel proof technique for obtaining elementary expectation formulas for pseudoinverses based on volume sampling.

Suppose that our goal is to estimate the pseudoinverse based on the pseudoinverse of a subset of rows. Recall that for a subset of row indices (where the size is fixed and ), we let be the submatrix of the rows indexed by (see Figure 1.2). Consider a version of in which all but the rows of are zero. This matrix equals , where the selection matrix is an -dimensional diagonal matrix with if and 0 otherwise.

For the set of fixed size row indices chosen proportional to , we can prove the following expectation formulas:

Note that has the shape of
where the columns indexed by contain and the
remaining columns are zero.
The expectation of this matrix is even though
is clearly not a submatrix of .
This expectation formula now implies that for any size , if
of size is drawn by volume sampling, then is an
unbiased estimator^{1}^{1}1For size volume sampling,
the fact that
can be found in an early paper (Ben-Tal and Teboulle, 1990).
They give a direct proof based on Cramer’s rule.
for , i.e.

The second expectation formula can be viewed as a second moment of the pseudoinverse estimator , and it can be used to compute a useful notion of matrix variance with applications in random matrix theory:

#### Regularized volume sampling

We also develop a new regularized variant of volume sampling, which extends reverse iterative sampling to selecting subsets of size smaller than , and leads to a useful extension of the above matrix variance formula. Namely, for any , our -regularized procedure for sampling subsets of size satisfies

where is a standard notion of statistical dimension. Crucially, the above bound holds for subset sizes , which can be much smaller than the dimension .

Under the additional assumption that response vector is generated by a linear transformation distorted with bounded white noise, the expected bound on leads to strong variance bounds for ridge regression estimators. Specifically, we prove that when , with having mean zero and bounded variance , then if is sampled according to -regularized volume sampling with , we can obtain the following mean squared prediction error (MSPE) bound:

where is the ridge regression estimator for the subproblem . Our new lower bounds show that the above upper bound for regularized volume sampling is essentially optimal with respect to the choice of a subsampling procedure.

#### Algorithms and experiments

The only known polynomial time algorithm for size volume sampling was recently proposed by Li et al. (2017) with time complexity . In this paper we give two new algorithms using our general framework of reverse iterative sampling: one with deterministic runtime of , and a second one that with high probability finishes in time . Thus both algorithms improve on the state-of-the-art by a factor of at least and make volume sampling nearly as efficient as the comparable i.i.d. sampling technique called leverage score sampling. Our experiments on real datasets confirm the efficiency of our algorithms, and show that for small sample sizes volume sampling is more effective than leverage score sampling for the task of subset selection for linear regression.

#### Related work

Volume sampling is a type of determinantal point process (DPP) (Kulesza and Taskar, 2012). DPP’s have been given a lot of attention in the literature with many applications to machine learning, including recommendation systems (Gartrell et al., 2016) and clustering (Kang, 2013). Many exact and approximate methods for efficiently generating samples from this distribution have been proposed (Deshpande and Rademacher, 2010; Kulesza and Taskar, 2011), making it a useful tool in the design of randomized algorithms. Most of those methods focus on sampling elements. In this paper, we study volume sampling sets of size , which was proposed by Avron and Boutsidis (2013) and motivated with applications in graph theory, linear regression, matrix approximation and more.

The problem of selecting a subset of the rows of the input matrix for solving a linear regression task has been extensively studied in statistics literature under the terms optimal design (Fedorov, 1972) and pool-based active learning (Sugiyama and Nakajima, 2009). Various criteria for subset selection have been proposed, like A-optimality and D-optimality. For example, A-optimality seeks to minimize , which is combinatorially hard to optimize exactly. We show that for size volume sampling, , which provides an approximate randomized solution of the sampled inverse covariance matrix rather than just its trace.

In the field of computational geometry a variant of volume sampling was used to obtain optimal bounds for low-rank matrix approximation. In this task, the goal is to select a small subset of rows of a matrix (much fewer than the rank of , which is bounded by ), so that a good low-rank approximation of can be constructed from those rows. Deshpande et al. (2006) showed that volume sampling of size index sets obtains optimal multiplicative bounds for this task and polynomial time algorithms for size volume sampling were given in Deshpande and Rademacher (2010) and Guruswami and Sinop (2012). We show in this paper that for linear regression, fewer than rank many rows do not suffice to obtain multiplicative bounds. This is why we focus on volume sampling sets of size (recall that, for simplicity, we assume that is full rank).

Computing approximate solutions to linear regression has
been explored in the domain of numerical linear algebra (see
Mahoney (2011) for an overview).
Here, multiplicative bounds on the loss of the approximate solution can be
achieved via two approaches. The first approach relies on sketching
the input matrix and the response vector by
multiplying both by the same suitably chosen random matrix.
Algorithms which use sketching to generate a smaller
input matrix for a given linear regression problem are
computationally efficient
(Sarlos, 2006; Clarkson and Woodruff, 2013), but they
require all of the responses from the original problem
to generate the sketch and are thus not suitable for the goal
of using as few response values as possible.
The second approach is based on subsampling the rows of
the input matrix and only asking for the responses of the sampled rows.
The learner optimally solves the sampled subproblem^{2}^{2}2Note
that those methods typically require
additional rescaling of the subproblem, whereas the techniques
proposed in this paper do not require any rescaling.
and then uses the obtained weight vector for
its prediction on all rows. The selected subproblem is
known under the term “-agnostic minimal coreset”
in (Boutsidis et al., 2013; Drineas et al., 2008)
since it is selected without knowing the response vector (denoted as
the vector ).
The second approach coincides with the goals of this paper
but the focus here is different in a number of ways.
First, we focus on the smallest sample size for which a
multiplicative loss bound is possible:
Just volume sampled rows are
sufficient to achieve a multiplicative bound with a fixed
factor, while are not sufficient.
A second focus here is the efficiency and the combinatorics of volume
sampling. The previous work is mostly based on
i.i.d. sampling using the statistical leverage scores (Drineas et al., 2012).
As we show in this paper, leverage scores are the
marginals of volume sampling and
any i.i.d. sampling method requires
sample size to achieve multiplicative loss bounds
for linear regression. On the other hand, the rows obtained
from volume sampling are selected jointly
and this makes the chosen subset more informative and
brings the required sample size down to .
Third, we focus on the fact that the estimators produced
from volume sampling are unbiased and therefore can be
averaged to get more accurate estimators.
Using our methods, averaging immediately leads to an unbiased estimator
with expected loss times the optimum based on
sampling responses in total. We leave it as
an open problem to construct a factor
unbiased estimator from sampling only responses.
If unbiasedness is not a concern, then such an estimator has
recently been found (Chen and Price, 2017).

#### Outline of the paper

In the next section, we define volume sampling as an instance of a more general procedure we call reverse iterative sampling, and we use this methodology to prove closed form matrix expressions for the expectation of the pseudoinverse estimator and its square , when is sampled by volume sampling. Central to volume sampling is the Cauchy-Binet formula for determinants. As a side, we produce a number of short self-contained proofs for this formula and show that leverage scores are the marginals of volume sampling. Then in Section 3 we formulate the problem of solving linear regression from a small number of responses, and state the upper bound for the expected square loss of the volume sampled least squares estimator (Theorem 3.1), followed by a discussion and related lower-bounds. In Section 3.2, we prove Theorem 3.1 and an additional related matrix expectation formula. We next discuss in Section 3.3 how unbiased estimators can easily be averaged for improving the expected loss and discuss open problems for constructing unbiased estimators. A new regularized variant of volume sampling is proposed in Section 4, along with the statistical guarantees it offers for computing subsampled ridge regression estimators. Next, we present efficient volume sampling algorithms in Section 5, based on the reverse iterative sampling paradigm, which are then experimentally evaluated in Section 6. Finally, Section 7 concludes the paper by suggesting a future research direction.

## 2 Reverse iterative sampling

Let be an integer dimension. For each subset of size we are given a matrix formula . Our goal is to sample set of size using some sampling process and then develop concise expressions for . Examples of formula classes will be given below.

We represent the sampling by a directed acyclic graph (DAG), with a single root node corresponding to the full set . Starting from the root, we proceed along the edges of the graph, iteratively removing elements from the set (see Figure 2.1). Concretely, consider a DAG with levels . Level contains nodes for sets of size . Every node at level has directed edges to the nodes (also denoted ) at the next lower level. These edges are labeled with a conditional probability vector . The probability of a (directed) path is the product of the probabilities along its edges. The outflow of probability from each node on all but the lowest level is 1. We let the probability of node be the probability of all paths from the top node to and set the probability of the top node to 1. We associate a formula with each set node in the DAG. The following key equality lets us compute expectations.

###### Lemma 2.1

If for all of size greater than we have

then for any :

Proof Suffices to show that expectations at successive layers and are equal for :

Note that the r.h.s. of the first line has one summand per
edge leaving level , and the r.h.s. of the second line has one
summand per edge arriving at level .
Now the last equality holds because
the edges leaving level are exactly those
arriving at level , and the summand for each edge in
both expressions is equivalent.

### 2.1 Volume sampling

Given a tall full rank matrix
and a sample size ,
volume sampling chooses subset
of size with probability proportional to squared volume spanned by
the columns of submatrix^{3}^{3}3For sample size ,
the rows and columns of have the same length and
is also the squared
volume spanned by the rows .
and this squared volume equals .
The following theorem uses
the above DAG setup to compute the normalization constant for this
distribution.
Note that all subsets of volume 0 will be ignored,
since they are unreachable in the proposed sampling procedure.

###### Theorem 2.2

Let , where and . For any set of size for which , define the probability of the edge from to for as:

(reverse iterative volume sampling) |

where is the th row of . In this case is a proper probability distribution. If , then simply set to . With these definitions, for all and the probability of all paths from the root to any subset of size at least is

(volume sampling) |

The rewrite of the ratio as is Sylvester’s Theorem for determinants. Incidentally, this is the only property of determinants used in this section.

The theorem also implies a generalization of the Cauchy-Binet formula to size sets:

(2.1) |

When , then the binomial coefficient is 1 and the above becomes the vanilla Cauchy-Binet formula. The below proof of the theorem thus results in a minimalist proof of this classical formula as well. The proof uses the reverse iterative sampling (Figure 2.1) and the fact that all paths from the root to node have the same probability. For the sake of completeness we also give a more direct inductive proof of the above generalized Cauchy-Binet formula in Appendix A.

Proof First, for any node s.t. and , the probabilities out of sum to 1:

It remains to show the formula for the probability of all paths ending at node . If , then one edge on any path from the root to has probability 0. This edge goes from a superset of with positive volume to a superset of that has volume 0. Since all paths have probability 0, in this case.

Now assume and consider any path from the root to . There are such paths all going through sets with positive volume. The fractions of determinants in the probabilities along each path telescope and the additional factors accumulate to the same product. So the probability of all paths from the root to is the same and the total probability into is

An immediate consequence of the above sampling procedure is the following composition property of volume sampling, which states that this distribution is closed under subsampling. We also give a direct proof to highlight the combinatorics of volume sampling.

###### Corollary 2.3

For any and , the following hierarchical sampling procedure:

returns a set which is distributed according to size volume sampling from .

Proof We start with the Law of Total Probability and then use the probability formula for volume sampling from the above theorem. Here means the probability of all paths going through node at level and ending up at the final node at level . If , then .

Note that for all sets containing , the probability is the same,
and there are such sets.

The main competitor of volume sampling is i.i.d. sampling of the rows of w.r.t. the statistical leverage scores. For an input matrix , the leverage score of the -th row of is defined as

Recall that this quantity appeared in the definition of conditional probability in Theorem 2.2, where the leverage score was computed w.r.t. the submatrix . In fact, there is a more basic relationship between leverage scores and volume sampling: If set is sampled according to size volume sampling, then the leverage score of row is the marginal probability of selecting -th row into . A general formula for the marginals of size volume sampling is given in the following proposition:

###### Proposition 2.4

Let be a full rank matrix and . If is sampled according to size volume sampling, then for any ,

Proof Instead of we will first compute :

where we used Cauchy-Binet twice and the fact
that every set
appears in
sets .
Now, the marginal probability follows from the fact that .

### 2.2 Expectation formulas for volume sampling

All expectations in the remainder of the paper are w.r.t. volume sampling. We use the short-hand for expectation with volume sampling where the size of the sampled set is fixed to . The expectation formulas for two choices of are proven in Theorems 2.5 and 2.6. By Lemma 2.1 it suffices to show for volume sampling. We also present a related expectation formula (Theorem 2.7), which is proven later using different techniques.

Recall that is the submatrix of rows indexed by . We also use a version of in which all but the rows of are zeroed out. This matrix equals where is an -dimensional diagonal matrix with if and 0 otherwise (see Figure 1.2).

###### Theorem 2.5

Let be a tall full rank matrix (i.e. ). For , let be a size volume sampled set over . Then

For the special case of , this fact was known in the linear
algebra literature (Ben-Tal and Teboulle, 1990; Ben-Israel, 1992).
It was shown there using elementary properties of the determinant
such as Cramer’s rule.^{4}^{4}4Using the composition property of volume sampling (Corollary 2.3),
the case of the theorem can be reduced to the case.
However, we give a different self-contained proof.
The proof methodology developed here based on reverse
iterative volume sampling is very different.
We believe that this fundamental formula lies at the
core of why volume sampling is important in many applications. In this
work, we focus on its application to linear regression. However,
Avron and Boutsidis (2013) discuss many problems where controlling the
pseudoinverse of a submatrix is essential. For those
applications, it is important to establish variance bounds for the
above expectation and volume
sampling once again offers very concrete guarantees. We obtain them by
showing the following formula, which can be viewed as a second moment
for this estimator.

###### Theorem 2.6

Let be a full rank matrix and . If size volume sampling over has full support, then

In the case when volume sampling does not have full support, then the matrix equality “” above is replaced by the positive-definite inequality “”.

The condition that size volume sampling over has full support is equivalent to for all of size . Note that if size volume sampling has full support, then size also has full support. So full support for the smallest size (often phrased as being in general position) implies that volume sampling w.r.t. any size has full support.

The above theorem immediately gives an expectation formula for the Frobenius norm of the estimator:

(2.2) |

This norm formula was shown by Avron and Boutsidis (2013), with numerous applications. Theorem 2.6 can be viewed as a much stronger pre-trace version of the known norm formula. Also our proof techniques are quite different and much simpler. Note that if size volume sampling for does not have full support, then (2.2) becomes an inequality.

We now mention a second application of the above theorem in the context of linear regression for the case when the response vector is modeled as a noisy linear transformation (i.e., for some and a random noise vector (detailed discussion in Section 4). In this case the matrix can be interpreted as the covariance matrix of least-squares estimator (for a fixed set ) and Theorem 2.6 gives an exact formula for the covariance matrix of under volume sampling. In Section 4, we give an extended version of this result which provides even stronger guarantees for regularized least-squares estimators under this model (Theorem 4.1).

Note that except for the above application, all results in
this paper hold for arbitrary response vectors .
By combining Theorems 2.5 and 2.6, we
can also obtain a covariance-type
formula^{5}^{5}5This notion of “covariance” is
used in random matrix theory, i.e. for a random matrix
we analyze . See for example
Tropp (2012). for the pseudoinverse matrix estimator: