A statistical perspective of sampling scores for linear regression
Abstract
In this paper, we consider a statistical problem of learning a linear model from noisy samples. Existing work has focused on approximating the least squares solution by using leveragebased scores as an importance sampling distribution. However, no finite sample statistical guarantees and no computationally efficient optimal sampling strategies have been proposed. To evaluate the statistical properties of different sampling strategies, we propose a simple yet effective estimator, which is easy for theoretical analysis and is useful in multitask linear regression. We derive the exact mean square error of the proposed estimator for any given sampling scores. Based on minimizing the mean square error, we propose the optimal sampling scores for both estimator and predictor, and show that they are influenced by the noisetosignal ratio. Numerical simulations match the theoretical analysis well.
I Introduction
In many engineering problems, it is expensive to measure and store a large set of samples. Motivated by this, there has been a great deal of work on developing an effective sampling strategy for a variety of matrixbased problems, including compressed sensing [2, 3], adaptive sampling for signal recovery and detection [4, 5, 6], adaptive sampling for matrix approximation and completion [7, 8, 9] and many others. At the same time, motivated by computational efficiency, statistically effective sampling strategies are developed for leastsquares approximation [10], least absolute deviations regression [11] and lowrank matrix approximation [12].
One way to develop a sampling strategy is to design a datadependent importance sampling distribution from which to sample the data. A widely used sampling distribution to select rows and columns of the input matrix are the leverage scores, defined formally in Section 2. Typically, the leverage scores are computed approximately [13, 11], or otherwise, a random projection is used to precondition by approximately uniformizing them [14, 15, 16, 11, 17]. A detailed discussion of this approach can be found in the recent review on randomized algorithms for matrices and matrixbased data problems [18]. Even though leveragebased sampling distributions are widely used, for many problems, it is unclear that why they might work well from a statistical perspective.
For the problem of lowrank matrix approximation, [19] provided an extensive empirical evaluation of several sampling distributions and showed that iterative norm sampling outperforms leverage score based sampling empirically. [20] showed that for lowrank matrix approximation, the square root of the leverage score based sampling statistically outperforms uniform sampling all the time and outperforms leverage score based sampling when the leverage scores are nonuniform. Further, the authors proposed a constrained optimization problem resulting in optimal sampling probabilities, which are between leverage scores and their square roots.
For the problem of linear regression, the optimal sampling strategies for estimation and prediction are known to be obtained via and optimality criteria, respectively [21], but the resulting sampling strategies are combinatorial. The computationally efficient optimal sampling strategies are still unclear. [22] showed that based on the sampled least squares, neither leverage score based sampling nor uniform sampling outperforms the other. The leveraging based estimator could suffer from a large variance when the leverage scores are nonuniform while uniform sampling is less vulnerable to small leverage scores. [23] only analyzed the asymptotic behavior of the sampled least squares and the corresponding optimal sampling is still unclear.
In this paper, we propose an estimator with computationally efficient sampling strategy that also comes with closed form and finite sample guarantees on performance. We derive the exact mean square error (MSE) of this estimator and show the optimal sampling distribution for both estimator and predictor. The results of the statistical analysis can be summarized as follows: (1) the proposed estimator is unbiased for any sampling distribution; (2) the optimal sampling distribution is influenced by the noisetosignal ratio; (3) the optimal sampling distribution involves a tradeoff between the leverage scores and the square root of the leverage scores. We further provide an empirical evaluation of the proposed algorithm on synthetic data under various settings. This empirical evaluation clearly shows the efficacy of the derived optimal sampling scores for sampling largescale data in order to learn a linear model from the noisy samples.
Except for the simple statistical analysis, the proposed sampled projection estimator is useful for the multitask linear regression. This is because the computational cost is less when we estimate multiple coefficient vectors based on the same data matrix. This is crucial for many applications on sensor networks and electrical systems [24]. For example, traffic speeds are timeevolving data supported on an urban traffic network. Based on the same urban traffic network, it is potentially efficient to use the proposed estimator to estimate the traffic speeds at all the intersections from a small number of samples every period of time. The proposed estimator also can be applied to estimate corrupted measurements in sensor networks [25], uncertain voltage states in power grids [26] and many others.
The main contributions are (1) we derive the exact MSE of a natural modification of the least squares estimator; and (2) we derive the optimal sampling distributions for the estimator and the predictor, and show that they are influenced by the noisetosignal ratio and they are related to leverage scores.
Ii Background
We consider a simple linear model
where is a response vector, is a data matrix (), is a coefficient vector, and is a zero mean noise vector with independent and identically distributed entries, and covariance matrix . The task is, given , , we aim to estimate . In this case, the unknown coefficient vector is usually estimated via the least squares as
(1) 
where . The predicted response vector is then , where is the hat matrix. The diagonal elements of are the leverage scores, that is, is the leverage of the th sample. The statistical leverage scores are widely used for detecting outliers and influential data [27, 28, 13].
In some applications, it is expensive to sample the entire response vector. Some previous works consider sampling algorithms computing the approximate solutions to the overconstrained least squares problem [10, 18, 29, 22]. These algorithms choose a small number of rows of and the corresponding elements of , and use least squares on the samples to estimate . Formally, let be the sequence of sampled indices, called sampling set, with and . A sampling matrix is a linear mapping from to that describes the process of sampling with replacement,
When the th element of is chosen in the th random trial, then and . The goal is given , we design the sampling operator to obtain samples , and then, estimate . Here we focus on experimental design of sampling operator, that is, the sampling strategy needs to be designed in advance of collecting any .
The choice of samples is an important degree of freedom when studying the corresponding quality of approximation. In this paper, we employ random sampling with an underlying probability distribution to choose samples. Let be a probability distribution, where denotes the probability to choose the th sample in each random trial. We consider two choices of the probability distribution: uniform sampling means that the sample indices are chosen from independently and randomly; and sampling scorebased sampling which means that the sample indices are chosen from an importance sampling distribution that is proportional to a sampling score that is computed from the data matrix^{1}^{1}1the terms of sampling distribution and sampling scores mean the same thing in this paper. A widelyused sampling score is the leverage scores of the data matrix. Given the samples, one way to estimate is by solving a weighted least square problem:
(2)  
where is a diagonal rescaling matrix with when . This is called sampled least squares (SampleLS) [22]. The computational cost of taking pseudoinverse of is . The inverse term is involved with the sampling operator, but the computation is much cheaper than taking pseudoinverse of , which takes (). However, it is hard to show the exact MSEs and the optimal sampling scores of SampleLS. [22] shows that, based on SampleLS, neither leverage score based sampling nor uniform sampling dominates the other from from a statistical perspective.
Iii Proposed Method
To have a deeper statistical understanding of this task, we propose a similar, but simpler algorithm to deal with the same task, but it is easy to show its corresponding exact MSEs and the corresponding optimal sampling scores.
Iiia Problem Formulation
Similarly to (2), we estimate by solving the following problem:
(3)  
where is the same diagonal rescaling matrix in (2). We call (3) sampled projection (SampleProj). This is akin to zerofilling the unobserved entries of and rescaling the nonzero entries to make it unbiased. Comparing to the ordinary least squares, SampleProj does not benefit the computational efficiency because it still needs to taking pseudoinverse of and the computation involves the factor ; however, it is still useful when taking samples is expense. Comparing to SampleLS, this algorithm is more appealing when we aim to estimate multiple s based on the same data matrix because the inverse term is not involved with the sampling operator.
IiiB Statistical Analysis
We obtain the exact mean squared error and an optimal sampling distribution for SampleProj. Our main contributions based on SampleProj can be summarized as follows:
Due to the lack of space, the full proofs of results appear in [1]
Lemma 1.
The estimator in SampleProj is an unbiased estimator of , that is,
Lemma 2.
The covariance of the estimator in SampleProj has the following property:
where is the variance of the Gaussian random noise.
Combining the results of Lemmas 1 and 2, we obtain the exact MSE of both the estimator and the predictor.
Theorem 1.
Let be the solution of SampleProj with sampling distribution of . The mean squared error of the estimator is
where . The mean squared error of the predictor is
We next optimize over the mean squared errors and obtain the optimal sampling scores.
Theorem 2.
The optimal sampling score to minimize the mean squared error of the estimator is
The optimal sampling scores to minimize the MSE of the predictor is
Theorem 2 shows that the optimal sampling scores depend on the signal strength and the noise. In practice, we cannot access to and we need to approximate the ratio between each and . For active sampling, we can collect the feedback to approximate each signal coefficient ; for experimentally designed sampling, we approximate beforehand. One way is to use the CauchySchwarz inequality to bound ,
An upper bound of the MSE of the estimator is then
where . The corresponding sampling scores are
(4)  
where .
When the column vectors of are orthonormal, is the identity matrix, the sampling scores (4) are between the leverage scores and the square root of the leverage scores.
Corollary 1.
An upper bound of the MSE of the predictor is then
where . The corresponding optimal sampling scores are
(5)  
Iv Empirical Evaluation
In this section, we validate the proposed algorithm and statistical analysis on synthetic data.
Iva Synthetic Dataset
We generate a data matrix from multivariate tdistribution with 1 degree of freedom and covariance matrix , whose elements are . The leverage scores of this matrix are nonuniform. This is the same as in [22]. We then generate a coefficient vector , whose elements are drawn from uniform distribution . We test both noiseless and noisy cases. In the noiseless case, a response vector is ; and in the noisy case, a response vector is , where . We vary as . The corresponding ratios between and are around and the corresponding NSR, , are around .
IvB Experimental Setup
We consider two key questions as follows:

do different sampling scores make a difference?

does noise make a difference?
To answer these questions, we test the proposed SampleProj with different sampling scores for both noiseless and noisy cases. For each test, we compare sampling scores, including uniform sampling as Uniform (blue), leverage scores as Lev (red), square root of the leverage scores as sqlLev (orange), sampling scores for the estimator (4) as optEst (purple), and sampling scores for the predictor (5) as optPred (green). For optEst and optPred, we use the true noisetosignal ratios. All the results are averaged over 500 independent runs.
We measure the quality of estimation as
where is the ground truth and is an estimator, and we measure the quality of prediction as
IvC Results
Figure 1 compares the estimation error in both the noiseless and noisy scenarios of SampleProj as a function of sample size. We see that, as expected, the sampling scores for the estimator (4) outperform all other scores in both noiseless and noisy cases. Also, leverage scores perform well in the noiseless case, and square root of the leverage scores perform well in the noisy case, which matches the result in Corollary 1.
(a) Noiseless.  (b) noisy . 
Figure 2 compares the prediction error for noiseless and noisy cases of SampleProj as a function of sample size. We see that, the sampling scores for the predictor (5) result in the best performance in both the noiseless and noisy cases; leverage scores perform better than square root of the leverage scores in the noiseless cases. In Corollary 1, we see that square root of the leverage scores should perform better than leverage scores in a high noise case, thus, we suspect that the noise level is not high enough to see the trend.
(a) Noiseless.  (b) Noisy . 
To study how noise influences the results, given 200 samples, we show the estimation errors and the prediction errors of SampleProj as a function of noise level. Figure 3 (a) shows that sampling scores for the estimator (4) consistently outperform the other scores. Leverage scores are better than square root of the leverage scores when the noise level is small, and square root of the leverage scores catch up with leverage scores when the noise level increases, which matches Corollary 1. Figure 3 (b) shows that sampling scores for the predictor (5) consistently outperform the other scores in terms of prediction error; leverage scores are better than square root of the leverage scores when the noise level is small, and square root of the leverage scores catch up with leverage scores when the noise level increases, which matches Corollary 2.
(a) Estimation error .  (b) Prediction error. 
Now we can answer the two questions proposed in Section IVB. Based on SampleProj, the proposed sampling scores for the estimator consistently outperforms the other sampling scores in terms of the estimation error and the proposed sampling scores for the predictor consistently outperforms the other sampling scores in terms of the prediction error. Leverage score based sampling is better in the noiseless case, but the square roots of the leverage score based sampling is better in the noisy case. Note that many work in computer science theory typically does not address noise and mostly focuses on least squares [10, 30, 14], hence, from the algorithmic perspective, the leverage score sampling was considered optimal.
In the experiments, we use the true NSRs for optimal sampling scores for the estimator and the predictor, which is impractical. In practice, for low noise cases, we set the NSR to zero and sample proportional to for the estimator and for the predictor; for high noise cases, we should sample proportional to for the estimator and for the predictor. In general, when is nonorthonormal, these sampling scores are different than the leverage scores and the square root of the leverage scores.
V Conclusions
We consider the problem of learning a linear model from noisy samples from a statistical perspective. Existing work on sampling for leastsquares approximation has focused on using leveragebased scores as an importance sampling distribution. However, it is hard to obtain the precise MSE of the sampled least squares estimator. To understand the importance sampling distributions, we propose a simple yet effective estimator, called SampleProj, to evaluate the statistical properties of sampling scores. The proposed SampleProj is appealing for theoretical analysis and multitask linear regression. The main contributions are (1) we derive the exact MSE of SampleProj with a given sampling distribution; and (3) we derive the optimal sampling scores for the estimator and the predictor, and show that they are influenced by the noisetosignal ratio. The numerical simulations show that empirical performance is consistent with the proposed theory. We have derived the optimal sampling strategy for a specific estimator, but identifying the optimal estimator with a computationally efficient sampling strategy remains an open direction.
References
 [1] S. Chen, R. Varma, A. Singh, and J. Kovačević, “A statistical perspective of sampling scores for linear regression,” in arXiv:1507.05870, 2015.
 [2] E. J. Candès, “Compressive sampling,” in Int. Congr. Mathematicians, Madrid, Spain, 2006.
 [3] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, no. 4, pp. 1289–1306, Apr. 2006.
 [4] J. Tropp, J. N. Laska, M. F. Duarte, J. K Romberg, and R. G. Baraniuk, “Beyond nyquist: Efficient sampling of sparse bandlimited signals,” IEEE Trans. Inf. Theory, vol. 56, pp. 520 – 544, June 2010.
 [5] J. Haupt, R. Castro, and R. Nowak, “Distilled sensing: Adaptive sampling for sparse detection and estimation,” IEEE Trans. Inf. Theory, vol. 57, pp. 6222–6235, Sept. 2011.
 [6] M. A. Davenport, A. K. Massimino, D. Needell, and T. Woolf, “Constrained adaptive sensing,” arXiv:1506.05889, 2015.
 [7] A. Krishnamurthy and A. Singh, “Lowrank matrix and tensor completion via adaptive sampling,” in Proc. Neural Information Process. Syst., Harrahs and Harveys, Lake Tahoe, Dec. 2013.
 [8] R. Patel, T. A. Goldstein, E. L. Dyer, A. Mirhoseini, and R. G. Baraniuk, “oasis: Adaptive column sampling for kernel matrix approximation,” arXiv:1505.05208 [stat.ML], May 2015.
 [9] Y. Wang and A. Singh, “Column subset selection with missing data via active sampling,” in AISTATS, 2015, pp. 1033–1041.
 [10] P. Drineas, M. W. Mahoney, and S. Muthukrishnan, “Sampling algortihms for regression and applications,” In Proceedings of the 17th Annual ACMSIAM Symposium on Discrete Algorithms, pp. 1127–1136, 2006.
 [11] K. L. Clarkson, P. Drineas, M. MagdonIsmail, M. W. Mahoney, X. Meng, and D. P. Woodruff, “The fast cauchy transform and faster robust linear regression,” in Proceedings of the TwentyFourth Annual ACMSIAM Symposium on Discrete Algorithms. SIAM, 2013, pp. 466–477.
 [12] M. W. Mahoney and P. Drineas, “Cur matrix decompositions for improved data analysis,” Proceedings of the National Academy of Sciences, vol. 106, no. 3, pp. 697–702, 2009.
 [13] P. Drineas, M. MagdonIsmail, M. W. Mahoney, and D. Woodruff, “Fast approximation of matrix coherence and statistical leverage,” Journal of Machine Learning Research, vol. 13, no. 1, pp. 3475–3506, 2012.
 [14] H. Avron, P. Maymounkov, and S. Toledo, “Blendenpik: Supercharging lapack’s leastsquares solver,” SIAM Journal on Scientific Computing, vol. 32, no. 3, pp. 1217–1236, 2010.
 [15] N. Ailon and B. Chazelle, “Faster dimension reduction,” Communications of the ACM, vol. 53, no. 2, pp. 97–104, 2010.
 [16] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlós, “Faster least squares approximation,” Numerische Mathematik, vol. 117, no. 2, pp. 219–249, 2011.
 [17] X. Meng, M. A. Saunders, and M. W. Mahoney, “Lsrn: A parallel iterative solver for strongly overor underdetermined systems,” SIAM Journal on Scientific Computing, vol. 36, no. 2, pp. C95–C118, 2014.
 [18] M. W. Mahoney, Randomized Algorithms for Matrices and Data, Number 2. Now Publishers Inc., Boston, 2011.
 [19] Y. Wang and A. Singh, “An empirical comparison of sampling techniques for matrix column subset selection,” in Allerton, 2015.
 [20] T. Yang, L. Zhang, R. Jin, and S. Zhu, “An explicit sampling dependent spectral error bound for column subset selection,” arXiv preprint arXiv:1505.00526, 2015.
 [21] F. Pukelsheim, Optimal Design of Experiments, Classics in Applied Mathematics. SIAM, New York, NY, 2006.
 [22] P. Ma, M. W. Mahoney, and B. Yu, “A statistical perspective on algorithmic leveraging,” Journal of Machine Learning Research, vol. 16, pp. 861–911, 2015.
 [23] R. Zhu, P. Ma, M. W. Mahoney, and B. Yu, “Optimal subsampling approaches for large sample linear regression,” arXiv:1509.05111, 2015.
 [24] S. Chen, R. Varma, A. Singh, and J. Kovačević, “Signal recovery on graphs: Random versus experimentally designed sampling,” in SampTA, Washington, D.C., 2015.
 [25] S. Chen, A. Sandryhaila, J. M. F. Moura, and J. Kovačević, “Signal recovery on graphs: Variation minimization,” IEEE Trans. Signal Process., vol. 63, pp. 4609–4624, June 2015.
 [26] Y. Weng, R. Negi, and M. D. Ilić, “Graphical model for state estimation in electric power systems,” in IEEE SmartGridComm Symposium, Vancouver, Oct. 2013.
 [27] D. C. Hoaglin and R. E. Welsch, “The hat matrix in regression and anova,” American Statistician, , no. 1, pp. 17–22, 1978.
 [28] P. F. Velleman and R. E. Welsch, “Efficient computing of regression diagnotics,” American Statistician, , no. 4, pp. 234–242, 1981.
 [29] P. Dhillon, Y. Lu, D. P. Foster, and L. Ungar, “New subsampling algorithms for fast least square regression,” in Advances in Neural Information Processing Systems, 2013, pp. 360–368.
 [30] C. Boutsidis and P. Drineas, “Random projections for the nonnegative leastsquares problem,” Linear Algebra Appl., vol. 431, pp. 760â771, 2009.
Appendix A Appendices
Aa Proof of Theorem 1
We first show the unbias of the estimator in Lemma 1. For each element in the bias term, we have
where follows from that each sample is identically and independently draw from .
We then show the covariance of the estimator. We first split into two components, and then bound both one by one.
The first term captures variability due to sampling, while the second term captures variability introduced by noise. Since the noise is independent from the sampling set , we can bound and separately. For , we have
For the noise term , we have
We then combine both and , and obtain the variance term.
Finally, we put the bias term and the variance term together to obtain the exact MSE of the estimator.
AB Proof of Theorem 1
To obtain the optimal sampling scores for the estimator, we solve the following optimization problem.
The objective function is the upper bound on the MSE of the estimator derived in Theorem 1 and the constraints require to be a valid probability distribution. The Lagrangian function is then
We set the derivative of the Lagrangian function to zero,
and then, we obtain the optimal sampling distribution to be
Similarly, we minimize to obtain the optimal sampling distribution for the predictor to be