1 Introduction

## Abstract

Given vectors , we want to fit a linear regression model for noisy labels . The ridge estimator is a classical solution to this problem. However, when labels are expensive, we are forced to select only a small subset of vectors for which we obtain the labels . We propose a new procedure for selecting the subset of vectors, such that the ridge estimator obtained from that subset offers strong statistical guarantees in terms of the mean squared prediction error over the entire dataset of labeled vectors. The number of labels needed is proportional to the statistical dimension of the problem which is often much smaller than . Our method is an extension of a joint subsampling procedure called volume sampling. A second major contribution is that we speed up volume sampling so that it is essentially as efficient as leverage scores, which is the main i.i.d. subsampling procedure for this task. Finally, we show theoretically and experimentally that volume sampling has a clear advantage over any i.i.d. sampling in the sparse label case.

\runningtitle

Subsampling for Ridge Regression via Regularized Volume Sampling

## 1 Introduction

Given a matrix , we consider the task of fitting a linear model1 to a vector of labels , where and the noise is a mean zero random vector with covariance matrix for some . A classical solution to this task is the ridge estimator:

 ˆw∗λ =argminw∈Rd∥X⊤w−y∥2+λ∥w∥2 =(XX⊤+λI)−1Xy.

In many settings, obtaining labels is expensive and we are forced to select a subset of label indices. Let be the sub vector of labels indexed by and be the columns of indexed by . We will show that if is sampled with a new variant of volume sampling [3] on the columns of , then the ridge estimator for the subproblem

 ˆw∗λ\small\raisebox{0.0pt}{(S)}=(XSX⊤S+λI)−1XSyS

has strong generalization properties with respect to the full problem .

Volume sampling is a sampling technique which has received a lot of attention recently [3, 5, 6, 7, 16]. For a fixed size , the original variant samples of size proportional to the squared volume of the parallelepiped spanned by the rows of [3]:

 P(S)∝det(XSX⊤S). (1)

A simple approach for implementing volume sampling (just introduced in [5]) is to start with the full set of column indices and then (in reverse order) select an index in each iteration to be eliminated from set with probability proportional to the change in matrix volume caused by removing the th column:

 Sample i∼P(i|S)=det(XS−iX⊤S−i)(|S|−d)det(XSX⊤S), (2) Update S←S−{i}.

Note that when , then all matrices are singular, and so the distribution becomes undefined. Motivated by this limitation, we propose a regularized variant, called -regularized volume sampling:

 Sample i∼P(i|S)∝det(XS−iX⊤S−i+λI)det(XSX⊤S+λI), (3) Update S←S−{i}.

Note that in the special case of no regularization (i.e. ), then (3) sums to (see equality (2)). However when , then the sum of (3) depends on all columns of and not just the size of . This makes regularized volume sampling more complicated and certain equalities proven in [5] for become inequalities.

Nevertheless, we were able to show that the proposed -regularized distribution exhibits a fundamental connection to ridge regression, and introduce an efficient algorithm to sample from it in time . In particular, we prove that when is sampled according to -regularized volume sampling with , then the mean squared prediction error (MSPE) of estimator over the entire dataset is bounded:

 (ˆw∗λ\small\raisebox{0.0% pt}{(S)}−w∗)∥2=O(σ2dλs), wheredλ =tr(X⊤(XX⊤+λI)−1X) (4)

is the statistical dimension. If are the eigenvalues of , then . Note that is decreasing with and . If the spectrum of the matrix decreases quickly then does so as well with increasing . When is properly tuned then is the effective degrees of freedom of . Our new lower bounds will show that the above upper bound for regularized volume sampling is essentially optimal with respect to the choice of a subsampling procedure.

Volume sampling can be viewed as a non-i.i.d. extension of leverage score sampling [8], a widely used method where columns are sampled independently according to their leverage scores. Volume sampling has been shown to return better column subsets than its i.i.d. counterpart in many applications like experimental design, linear regression and graph theory [3, 5]. In this paper we additionally show that any i.i.d. subsampling with respect to any fixed distribution such as leverage score sampling can require labels to achieve good generalization for ridge regression, compared to for regularized volume sampling. We reinforce this claim experimentally in Section 4.

The main obstacle against using volume sampling in practice has been high computational cost. Only recently, the first polynomial time algorithms have been proposed for exact [5] and approximate [16] volume sampling (see Table 1 for comparison).

In particular, the fastest algorithm for exact volume sampling2 is whereas exact leverage score sampling3 is (in both cases, the dependence on sample size is not asymptotically significant). In typical settings for experimental design [9] and active learning [19], quality of the sample is more important than the runtime. However for many modern datasets, the number of examples is much larger than , which makes existing algorithms for volume sampling infeasible. In this paper, we give an easy-to-implement volume sampling algorithm that runs in time . Thus we give the first volume sampling procedure which is essentially linear in and matches the time complexity of exact leverage score sampling. For example, dataset MSD from the UCI data repository [17] has examples with dimension . Our algorithm performed volume sampling on this dataset in 39 seconds, whereas the previously best algorithm [5] did not finish within hours. Sampling with leverage scores took 12 seconds on this data set. Finally our procedure also achieves regularized volume sampling for any with the running time of .

### 1.1 Related work

Many variants of probability distributions based on the matrix determinant have been studied in the literature, including Determinantal Point Processes (DPP) [15], k-DPP’s [14] and volume sampling [3, 5, 6, 16], with applications to matrix approximation [7], clustering [13], recommender systems [10], etc. More recently, further theoretical results suggesting applications of volume sampling in linear regression were given by [5], where an expected loss bound for the unregularized least squares estimator was given under volume sampling of size . Moreover, Reverse Iterative Volume Sampling – a technique enhanced in this paper with a regularization – was first proposed in [5].

Subset selection techniques for regression have long been studied in the field of experimental design [9]. More recently, computationally tractable techniques have been explored [2]. Statistical guarantees under i.i.d. subsampling in kernel ridge regression have been analyzed for uniform sampling [4] and leverage score sampling [1]. In this paper, we propose the first tractable non-i.i.d. subsampling procedure with strong statistical guarantees for the ridge estimator and show its benefits over using i.i.d. sampling approaches.

For the special case of volume sampling size , a polynomial time algorithm was developed by [6], and slightly improved by [11], with runtime . An exact sampling algorithm for arbitrary was given by [5], with time complexity which is when . Also, [16] proposed a Markov-chain procedure which generates -approximate volume samples in time . The algorithm proposed in this paper, running in time , enjoys a direct asymptotic speed-up over all of the above methods. Moreover, the procedure suffers only a small constant factor overhead over computing exact leverage scores of matrix .

### 1.2 Main results

The main contributions of this paper are two-fold:

1. Statistical: We define a regularized variant of volume sampling and show that it offers strong generalization guarantees for ridge regression in terms of mean squared error (MSE) and mean squared prediction error (MSPE).

2. Algorithmic: We propose a simple implementation of volume sampling, which not only extends the procedure to its regularized variant, but also offers a significant runtime improvement over the existing methods when .

The key technical result shown in this paper, needed to obtain statistical guarantees for ridge regression, is the following property of regularized volume sampling (where is defined as in (4)):

###### Theorem 1

For any , and , let be sampled according to -regularized size volume sampling from . Then,

 ES(XSX⊤S+λI)−1⪯n−dλ+1s−dλ+1(XX⊤+λI)−1,

where denotes a positive semi-definite inequality between matrices.

As a consequence of Theorem 1, we show that ridge estimators computed from volume sampled subproblems offer statistical guarantees with respect to the full regression problem , despite observing only a small portion of the labels.

###### Theorem 2

Let and , and suppose that , where is a mean zero vector with . Let be sampled according to -regularized size volume sampling from and be the -ridge estimator of computed from subproblem . Then, if , we have

 \bf(MSPE)ESEξ1n∥X⊤(ˆw∗λ% \small\raisebox{0.0pt}{(S)}−w∗)∥2≤σ2dλs−dλ+1, \bf(MSE) ESEξ∥ˆw∗λ\small\raisebox{0.0pt}{(S)}−w∗∥2≤σ2ntr((XX⊤+λI)−1)s−dλ+1.

Next, we present two lower-bounds for MSPE of a subsampled ridge estimator which show that the statistical guaranties achieved by regularized volume sampling are nearly optimal for and better than standard approaches for . In particular, we show that non-i.i.d. nature of volume sampling is essential if we want to achieve good generalization when the number of labels is close to . Namely, for certain data matrices, any subsampling procedure which selects examples in an i.i.d. fashion (e.g., leverage score sampling), requires more than labels to achieve MSPE below , whereas volume sampling obtains that bound for any matrix with labels.

###### Theorem 3

For any and , there is such that for any sufficiently large divisible by there exists a matrix such that

 dλ(X)≥p for any 0≤λ≤σ2,

and for each of the following two statements there is a vector for which the corresponding regression problem with satisfies that statement:

1. For any subset of size ,

 Eξ1n∥X⊤(ˆw∗λ\small\raisebox{0.0pt}{(S)}−w∗)∥2 ≥σ2dλs+dλ;
2. For multiset of size , sampled i.i.d. from any distribution over ,

Finally, we propose an algorithm for regularized volume sampling which runs in time . For the previously studied case of , this algorithm offers a significant asymptotic speed-up over existing volume sampling algorithms (both exact and approximate).

###### Theorem 4

For any , there is an algorithm sampling according to -regularized size volume sampling, that with probability at least runs in time4

 O((n+d+log(nd)log(1δ))d2).

When the time complexity of our proposed algoirthm is not deterministic, but its dependence on the failure probability is very small – even for the time complexity is still .

The remainder of this paper is arranged as follows: in Section 2 we present statistical analysis of regularized volume sampling in the context of ridge regression, in particular proving Theorems 1, 2 and 3; in Section 3 we present two algorithms for regularized volume sampling and use them to prove Theorem 4; in Section 4 we evaluate the runtime of our algorithms on several standard linear regression datasets, and compare the prediction performance of the subsampled ridge estimator under volume sampling versus leverage score sampling; in Section 5 we summarize the results and suggest future research directions.

## 2 Statistical guarantees

In this section, we show upper and lower bounds for the generalization performance of subsampled ridge estimators, starting with an important property of regularized volume sampling which connects it with ridge regression. We will use as a shorthand in the proofs.

### 2.1 Proof of Theorem 1

To obtain this result, we will show how the expectation of matrix changes when iteratively removing a column in -Regularized Volume Sampling (see (3)):

###### Lemma 5

Let and .
If we sample w.p. , then

 Ei(XS−iX⊤S−i+λI)−1⪯s−dλ+1s−dλ(XSX⊤S+λI)−1.

Proof We write the unnormalized probability of as:

 hi(S) =det(Zλ\small\raisebox{0.0pt}{% (S−i)})det(Zλ\small\raisebox{0.0pt}{(S)})(∗)=1−x⊤iZλ%\raisebox0.0pt$(S)$−1xi,

where follows from Sylvester’s theorem. Next, letting , we compute unnormalized expectation by applying the Sherman-Morrison formula to :

 M Ei(XS−iX⊤S−i+λI)−1=∑i∈Shi(S)Zλ% \small\raisebox{0.0pt}{(S−i)}−1 =∑i∈Shi(S)⎛⎝Zλ\small% \raisebox{0.0pt}{(S)}−1+Zλ\small% \raisebox{0.0pt}{(S)}−1xix⊤iZλ\small\raisebox{0.0pt}{(S)}−11−x⊤iZλ\small\raisebox{0.0pt}{(S)}−1xi⎞⎠ =MZλ\small\raisebox{0.0pt}{(S)}−1+Zλ\small\raisebox{0.0pt}{(S)}−1(∑i∈Sxix⊤i)Zλ\small\raisebox{0.0pt}{(S)}−1 =MZλ\small\raisebox{0.0pt}{(S)}−1+Zλ\small\raisebox{0.0pt}{(S)}−1(Zλ\small\raisebox{0.0pt}{(S)}−λI)Zλ\small\raisebox{0.0pt}{(S)}−1 =MZλ\small\raisebox{0.0pt}{(S)}−1+Zλ\small\raisebox{0.0pt}{(S)}−1−λZλ\small\raisebox{0.0pt}{(S)}−2 ⪯(M+1)Zλ\small\raisebox{0.0pt% }{(S)}−1.

Finally, we compute the normalization factor:

 M =∑i∈S(1−x⊤iZλ\small\raisebox{0.0pt}{(S)}−1xi) =s−tr(X⊤SZλ%\raisebox0.0pt$(S)$−1XS)=s−tr(XSX⊤SZλ\small\raisebox{0.0pt}{% (S)}−1) =s−tr((Zλ\small\raisebox{0.0% pt}{(S)}−λI)Zλ\small\raisebox{% 0.0pt}{(S)}−1) =s−tr(I)+λtr(Zλ\small\raisebox{0.0pt}{(S)}−1) ≥s−d+λtr(Zλ\small% \raisebox{0.0pt}{({1..n})}−1)=s−dλ,

where we used the fact that can we rewritten as Putting the bounds together, we obtain the result.

To prove Theorem 1 it remains to chain the conditional expectations along the sequence of subsets obtained by -Regularized Volume Sampling:

 ESZλ\small\raisebox{0.0% pt}{(S)}−1 ⪯(n∏t=s+1t−dλ+1t−dλ)Zλ\small\raisebox{0.0pt}{({1..n})}−1 =n−dλ+1s−dλ+1(XX⊤+λI)−1.

### 2.2 Proof of Theorem 2

Standard analysis for the ridge regression estimator follows by performing bias-variance decomposition of the error, and then selecting so that bias can be appropriately bounded. We will recall this calculation for a fixed subproblem . First, we compute the bias of the ridge estimator for a fixed set :

 Biasξ[ˆw∗λ\small\raisebox{0.0pt}{(S)}] =E[ˆw∗λ\small% \raisebox{0.0pt}{(S)}]−w∗ =Eξ[Zλ% \small\raisebox{0.0pt}{(S)}−1XSyS]−w∗ =Zλ\small\raisebox{0.0pt}{(S)}−1XS(X⊤Sw∗+Eξ[ξS])−w∗ =(Zλ\small\raisebox{0.0pt}{(S)}−1XSX⊤S−I)w∗ =−λZλ\small\raisebox{0.0pt}{(S)}−1w∗.

Similarly, the covariance matrix of is given by:

 Varξ[ˆw∗λ\small\raisebox{0.0pt}{(S)}] =Zλ\small\raisebox{0.0pt}{(S)}−1XSVarξ[ξS]X⊤SZλ\small\raisebox{0.0pt}{(S)}−1 ⪯σ2Zλ\small\raisebox{0.% 0pt}{(S)}−1XSX⊤SZλ\small\raisebox{0.0pt}{(S)}−1 =σ2(Zλ\small\raisebox{0.0pt}{% (S)}−1−λZλ\small\raisebox{0.0pt}{(S)}−2).

Mean squared error of the ridge estimator for a fixed subset can now be bounded by:

 Eξ∥ ˆw∗λ\small\raisebox{0.0pt% }{(S)}−w∗∥2=tr(Varξ[ˆw∗λ\small\raisebox{0.0pt}{(S)}])+∥Biasξ[ˆw∗λ% \small\raisebox{0.0pt}{(S)}]∥2 ≤ σ2tr(Zλ\small% \raisebox{0.0pt}{(S)}−1−λZλ% \small\raisebox{0.0pt}{(S)}−2)+λ2tr(Zλ\small\raisebox{0.0pt}{(S)}−2w∗w∗⊤) ≤ σ2tr(Zλ\small% \raisebox{0.0pt}{(S)}−1)+λtr(Zλ\small\raisebox{0.0pt}{(S)}−2)(λ∥w∗∥2−σ2) (5) ≤ σ2tr(Zλ\small% \raisebox{0.0pt}{(S)}−1), (6)

where in (5) we applied Cauchy-Schwartz inequality for matrix trace, and in (6) we used the assumption that . Thus, taking expectation over the sampling of set , we get

 ESEξ ∥ˆw∗λ\small\raisebox{0.0% pt}{(S)}−w∗∥2≤σ2EStr(Zλ\small\raisebox{0.0pt}{(S)}−1) (Thm. 1) ≤σ2n−dλ+1s−dλ+1tr(Zλ\small\raisebox{0.0pt}{({1..n})}−1) (7) ≤σ2ntr((XX⊤+λI)−1)s−dλ+1.

Next, we bound the mean squared prediction error. As before, we start with the standard bias-variance decomposition for fixed set :

 Eξ ∥X⊤(ˆw∗λ% \small\raisebox{0.0pt}{(S)}−w∗)∥2 =tr(Varξ[X⊤ˆw∗λ\small\raisebox{0.0pt}{(S)}])+∥X⊤(Eξ[ˆw∗λ\small\raisebox{0.0pt}{(S)}]−w∗)∥2 ≤σ2tr(X⊤(Zλ\small\raisebox{0.0pt}{(S)}−1−λZλ\small\raisebox{0.0pt}{(S)}−2)X) +λ2tr(Zλ\small% \raisebox{0.0pt}{(S)}−1XX⊤Zλ\small\raisebox{0.0pt}{(S)}−1w∗w∗⊤) ≤σ2tr(X⊤Zλ\small\raisebox{0.0pt}{(S)}−1X) +λtr(X⊤Zλ\small\raisebox{0.0pt}{(S)}−2X)(λ∥w∗∥2−σ2) ≤σ2tr(X⊤Zλ\small\raisebox{0.0pt}{(S)}−1X).

Once again, taking expectation over subset , we have

 ES Eξ1n∥X⊤(ˆw∗λ\small\raisebox{0.0pt}{(S)}−w∗)∥2 ≤σ2nEStr(X⊤Zλ\small\raisebox{0.0pt}{(S)}−1X) =σ2ntr(X⊤ES[Zλ\small\raisebox{0.0pt}{(S)}−1]X) (Thm. 1) ≤σ2nn−dλ+1s−dλ+1tr(X⊤Zλ\small\raisebox{0.0% pt}{({1..n})}−1X) (8) ≤σ2dλs−dλ+1.

The key part of proving both bounds is the application of Theorem 1. For MSE, we only used the trace version of the inequality (see (7)), however to obtain the bound on MSPE we used the more general positive semi-definite inequality in (8).

### 2.3 Proof of Theorem 3

Let and be divisible by . We define

 X \lx@stackrel\tiny{def}=[I,...,I]∈Rd×n w∗⊤ \lx@stackrel\tiny{def}=[aσ,...,aσ]∈Rd

for some . For any , the -statistical dimension of is

 dλ =tr(X⊤Zλ% \small\raisebox{0.0pt}{({1..n})}−1X) ≥⌈σ2⌉d(d−1)⌈σ2⌉(d−1)+λ≥d(d−1)d−1+1≥p.

Let be any set of size , and for let

 si\lx@stackrel\tiny{def}=|{i∈S:xi=ei}|.

The prediction variance of estimator is equal to

 tr(Varξ[X⊤ˆw∗λ\small\raisebox{0.0pt}{(S)}]) =σ2tr(X⊤(Zλ\small\raisebox{0.0pt}{(S)}−1−λZλ\small\raisebox{0.0pt}{(S)}−2)X) =σ2ndd∑i=1(1si+λ−λ(si+λ)2) =σ2ndd∑i=1si(si+λ)2.

The prediction bias of estimator is equal to

 ∥X⊤ (Eξ[ˆw∗λ\small\raisebox{0.0pt}{(S)}]−w∗)∥2 =λ2w∗⊤Zλ% \small\raisebox{0.0pt}{(S)}−1XX⊤Zλ\small\raisebox{0.0pt}{(S)}−1w∗⊤ =λ2a2σ2ndtr(Zλ\small\raisebox{0.0pt}{(S)}−2) =λ2a2σ2ndd∑i=11(si+λ)2.

Thus, MSPE of estimator is equal to:

 Eξ 1n∥X⊤(ˆw∗λ\small\raisebox{0.0pt}{(S)}−w∗)∥2 =1ntr(Varξ[X⊤ˆw∗λ\small\raisebox{0.0pt}{(S)}])+1n∥X⊤(Eξ[ˆw∗λ\small\raisebox{0.0pt}{(S)}]−w∗)∥2 =σ2dd∑i=1(si(si+λ)2+a2λ2(si+λ)2) =σ2dd∑i=1si+a2λ2(si+λ)2.

Next, we find the that minimizes this expression. Taking the derivative with respect to we get:

 ∂∂λ(σ2dd∑i=1si+a2λ2(si+λ)2)=σ2dd∑i=12si(λ−a−2)(si+λ)3.

Thus, since at least one has to be greater than , for any set the derivative is negative for and positive for , and the unique minimum of MSPE is achieved at , regardless of which subset is chosen. So, as we are seeking a lower bound, we can focus on the case of .

#### Proof of Part 1

Let . As shown above, we can assume that . In this case the formula simplifies to:

 Eξ 1n∥X⊤(ˆw∗λ\small\raisebox{0.0pt}{(S)}−w∗)∥2=σ2dd∑i=1si+1(si+1)2 =σ2dd∑i=11si+1(∗)≥σ2sd+1=σ2ds+d≥σ2dλs+dλ,

where follows by applying Jensen’s inequality to convex function .

#### Proof of Part 2

Let . As shown above, we can assume that . Suppose that set is sampled i.i.d. from some distribution over set . Using standard analysis for the Coupon Collector’s problem, it can be shown that if , then with probability at least there is such that (ie, one of the unit vectors was never selected). Thus, MSPE can be lower-bounded as follows:

 ESEξ 1n∥X⊤(ˆw∗λ\small\raisebox{0.0pt}{(S)}−w∗)∥2 ≥12σ2dsi+a2λ2(si+λ)2=σ22d2dλ2λ2=σ2.

## 3 Algorithms

In this section, we present two algorithms that implement -regularized volume sampling (3). The first one, RegVol (Algorithm 1), simply adds regularization to the algorithm proposed in [5] for Reverse Iterative Volume Sampling (2). Note that in this algorithm the conditional distribution is updated at every iteration, which leads to time complexity. The second algorithm, FastRegVol (Algorithm 2), is our new algorithm which avoids recomputing the conditional distribution at every step, making it essentially as fast as exact leverage score sampling.

The correctness of RegVol follows from Lemma 6, which is a straight-forward application of the Sherman-Morrison formula (see [5] for more details).

###### Lemma 6

For any matrix , set and two distinct indices , we have

 1−x⊤jZλ\small% \raisebox{0.0pt}{(S−i)}−1xj=1−x⊤jZλ\small\raisebox{0.0pt}{(S)}−1xj−a2,

where

 a=x⊤jZλ\small\raisebox{0.0% pt}{(S)}−1xi√1−x⊤iZλ\small\raisebox{0.0pt}{(S)}−1xi.

We propose a new volume sampling algorithm, which runs in time , significantly faster than RegVol when . Our key observation is that updating the full conditional distribution is wasteful, since the distribution changes very slowly throughout the procedure. Moreover, the unnormalized weights , which are computed in the process are all bounded by 1. Thus, to sample from the correct distribution at any given iteration, we can employ rejection sampling as follows:

1. Sample uniformly from set ,

2. Compute ,

3. Accept with probability ,

4. Otherwise, draw another sample.

Note that this rejection sampling can be employed locally, within each iteration of the algorithm. Thus, one rejection does not revert us back to the beginning of the algorithm. Moreover, if the probability of acceptance is high, then this strategy requires computing only a small number of weights per iteration of the algorithm, as opposed to updating all of them. This turns out to be the case, as shown below. The full pseudo-code of the sampling procedure is given in Algorithm 2, called FastRegVol.

### 3.1 Proof of Theorem 4

To simplify the analysis, we combine RegVol and FastRegVol, by running FastRegVol while subset has size at least