Gradient-based Sampling: An Adaptive Importance Sampling for Least-squares
Abstract
In modern data analysis, random sampling is an efficient and widely-used strategy to overcome the computational difficulties brought by large sample size. In previous studies, researchers conducted random sampling which is according to the input data but independent on the response variable, however the response variable may also be informative for sampling. In this paper we propose an adaptive sampling called the gradient-based sampling which is dependent on both the input data and the output for fast solving of least-square (LS) problems. We draw the data points by random sampling from the full data according to their gradient values. This sampling is computationally saving, since the running time of computing the sampling probabilities is reduced to where is the full sample size and is the dimension of the input. Theoretically, we establish an error bound analysis of the general importance sampling with respect to LS solution from full data. The result establishes an improved performance of the use of our gradient-based sampling. Synthetic and real data sets are used to empirically argue that the gradient-based sampling has an obvious advantage over existing sampling methods from two aspects of statistical efficiency and computational saving.
1 Introduction
Modern data analysis always addresses enormous data sets in recent years. Facing the increasing large sample data, computational savings play a major role in the data analysis. One simple way to reduce the computational cost is to perform random sampling, that is, one uses a small proportion of the data as a surrogate of the full sample for model fitting and statistical inference. Among random sampling strategies, uniform sampling is simple but trivial way since it fails to exploit the unequal importance of the data points. As an alternative, leverage-based sampling is to perform random sampling with respect to nonuniform sampling probabilities that depend on the empirical statistical leverage scores of the input matrix . It has been intensively studied in the machine learning community and has been proved to achieve much better results for worst-case input than uniform sampling [1, 2, 3, 4]. However it is known that leverage-based sampling replies on input data but is independent on the output variable, so does not make use of the information of the output. Another shortcoming is that it needs to cost much time to get the leverage scores, although approximating leverage scores has been proposed to further reduce the computational cost [5, 6, 7].
In this paper, we proposed an adaptive importance sampling, the gradient-based sampling, for solving least-square (LS) problem. This sampling attempts to sufficiently make use of the data information including the input data and the output variable. This adaptive process can be summarized as follows: given a pilot estimate (good “guess”) for the LS solution, determine the importance of each data point by calculating the gradient value, then sample from the full data by importance sampling according to the gradient value. One key contribution of this sampling is to save more computational time than leverage-based sampling, and the running time of getting the probabilities is reduced to where is the sample size and is the input dimension. It is worthy noting that, although we apply gradient-based sampling into the LS problem, we believe that it may be extended to fast solve other large-scale optimization problems as long as the gradient of optimization function is obtained. However this is out of the scope so we do not extend it in this paper.
Theoretically, we give the risk analysis, error bound of the LS solution from random sampling. [8] and [9] gave the risk analysis of approximating LS by Hadamard-based projection and covariance-thresholded regression, respectively. However, no such analysis is studied for importance sampling. The error bound analysis is a general result on any importance sampling as long as the conditions hold. By this result, we establishes an improved performance guarantee on the use of our gradient-based sampling. It is improved in the sense that our gradient-based sampling can make the bound approximately attain its minimum, while previous sampling methods can not get this aim. Additionally, the non-asymptotic result also provides a way of balancing the tradeoff between the subsample size and the statistical accuracy.
Empirically, we conduct detailed experiments on datasets generated from the mixture Gaussian and real datasets. We argue by these empirical studies that the gradient-based sampling is not only more statistically efficient than leverage-based sampling but also much computationally cheaper from the computational viewpoint. Another important aim of detailed experiments on synthetic datasets is to guide the use of the sampling in different situations that users may encounter in practice.
The remainder of the paper is organized as follows: In Section 2, we formally describe random sampling algorithm to solve LS, then establish the gradient-based sampling in Section 3. The non-asymptotic analysis is provided in Section 4. We study the empirical performance on synthetic and real world datasets in Section 5.
Notation: For a symmetric matrix , we define and as its the largest and smallest eigenvalues. For a vector , we define as its L norm.
2 Problem Set-up
For LS problem, suppose that there are an matrix and an response vector . We focus on the setting . The LS problem is to minimize the sample risk function of parameters as follows:
(1) |
The solution of equation (1) takes the form of
(2) |
where and . However, the challenge of large sample size also exists in this simple problem, i.e., the sample size is so large that the computational cost for calculating LS solution (2) is very expensive or even not affordable.
We perform the random sampling algorithm as follows:
(a) Assign sampling probabilities for all data points such that ;
(b) Get a subsample by random sampling according to the probabilities;
(c) Maximize a weighted loss function to get an estimate
(3) |
where , , and , and are the partitions of , and with the subsample size , respectively, corresponding the subsample . Note that the last equality in (3) holds under the assumption that is invertible. Throughout this paper, we assume that is invertible for the convenience since in our setting and it can be replaced with its regularized version if it is not invertible.
How to construct is a key component in random sampling algorithm. One simple method is the uniform sampling, i.e.,, and another method is leverage-based sampling, i.e., . In the next section, we introduce a new efficient method: gradient-based sampling, which draws data points according to the gradient value of each data point.
Related Work. [10, 11, 4] developed leverage-based sampling in matrix decomposition. [10, 12] applied the sampling method to approximate the LS solution. [13] derived the bias and variance formulas for the leverage-based sampling algorithm in linear regression using the Taylor series expansion. [14] further provided upper bounds for the mean-squared error and the worst-case error of randomized sketching for the LS problem. [15] proposed a sampling-dependent error bound then implied a better sampling distribution by this bound. Fast algorithms for approximating leverage scores were proposed to further reduce the computational cost [5, 6, 7].
3 Gradient-based Sampling Algorithm
The gradient-based sampling uses a pilot solution of the LS problem to compute the gradient of the objective function, and then sampling a subsample data set according to the calculated gradient values. It differs from leverage-based sampling in that the sampling probability is allowed to depend on input data as well as . Given a pilot estimate (good guess) for parameters , we calculate the gradient for the th data point
(4) |
Gradient represents the slope of the tangent of the loss function, so logically if gradient of data points are large in some sense, these data points are important to find the optima. Our sampling strategy makes use of the gradient upon observing given , and specifically,
(5) |
Equations (4) and (5) mean that, includes two parts of information: one is which is the information provided by the input data and the other is which is considered to provide a justification from the pilot estimate to a better estimate. Figure 1 illustrates the efficiency benefit of the gradient-based sampling by constructing the following simple example. The figure shows that the data points with larger are probably considered to be more important in approximating the solution. On the other side, given , we hope to choose the data points with larger values, since larger values probably cause the approximate solution be more efficient. From the computation view, calculating costs , so the gradient-based sampling is much saving computational cost.
Choosing the pilot estimate . In many applications, there may be a natural choice of pilot estimate , for instance, the fit from last time is a natural choice for this time. Another simple way is to use a pilot estimate from an initial subsample of size obtained by uniform sampling. The extra computational cost is , which is assumed to be small since a choice will be good enough. We empirically show the effect of small () on the performance of the gradient-based sampling by simulations, and argue that one does not need to be careful when choosing to get a pilot estimate. (see Supplementary Material, Section S1)
Poisson sampling v.s. sampling with replacement. In this study, we do not choose sampling with replacement as did in previous studies, but apply Poisson sampling into this algorithm. Poisson sampling is executed in the following way: proceed down the list of elements and carry out one randomized experiment for each element, which results either in the election or in the nonselection of the element [16]. Thus, Poisson sampling can improve the efficiency in some context compared to sampling with replacement since it can avoid repeatedly drawing the same data points, especially when the sampling ratio increases, We empirically illustrates this advantage of Poisson sampling compared to sampling with replacement. (see Supplementary Material, Section S2)
Independence on model assumption. LS solution is well known to be statistically efficient under the linear regression model with homogeneous errors, but model misspecification is ubiquitous in real applications. On the other hand, LS solution is also an optimization problem without any linear model assumption from the algorithmic view. To numerically show the independence of the gradient-based sampling on model assumption, we do simulation studies and find that it is an efficient sampling method from the algorithmic perspective. (see Supplementary Material, Section S3)
Now as a summary we present the gradient-based sampling in Algorithm 1.
Remarks on Algorithm 1. (a) The subsample size from Poisson sampling is random in Algorithm 1. Since is multinomial distributed with expectation and variance , the range of probable values of can be assessed by an interval. In practice we just need to set the expected subsample size . (b) If ’s are so large that for some data points, we should take , i.e., for them.
4 Error Bound Analysis of Sampling Algorithms
Our main theoretical result establishes the excess risk, i.e., an upper error bound of the subsample estimator to approximate for an random sampling method. Given sampling probabilities , the excess risk of the subsample estimator with respect to is given in Theorem 1. (see Section S4 in Supplementary Material for the proof). By this general result, we provide an explanation why the gradient-based sampling algorithm is statistically efficient.
Theorem 1
Define , where , and , if
holds, the excess risk of for approximating is bounded in probability for as
(6) |
where .
Theorem 1 indicates that, can be bounded by . From (6), the choice of sampling method has no effect on the decreasing rate of the bound, , but influences the constant . Thus, a theoretical measure of efficiency for some sampling method is whether it can make the constant attain its minimum. In Corollary 1 (see Section S5 in Supplementary Material for the proof), we show that Algorithm 1 can approximately get this aim.
Remarks on Theorem 1. (a) Theorem 1 can be used to guide the choice of in practice so as to guarantee the desired accuracy of the solution with high probability. (b) The constants , and can be estimated based on the subsample. (c) The risk of to predict follows from equation (6) and get that . (d) Although Theorem 1 is established under Poisson sampling, we can easily extend the error bound to sampling with replacement by following the technical proofs in Supplementary Material, since each drawing in sampling with replacement is considered to be independent.
Corollary 1
If , then is approximately mimimized by Algorithm 1, that is,
(7) |
where denotes the value corresponding to our gradient-based sampling.
The significance of Corollary 1 is to give an explanation why the gradient-based sampling is statistically efficient. The corollary establishes an improved performance guarantee on the use of the gradient-based sampling. It is improved in the sense that our gradient-based sampling can make the bound approximately attain its minimum as long as the condition is satisfied, while neither uniform sampling nor leverage-based sampling can get this aim. The condition that provides a benchmark whether the pilot estimate is a good guess of . Note the condition is satisfied by the initial estimate from an initial subsample of size by uniform sampling since .
5 Numerical Experiments
Detailed numerical experiments are conducted to compare the excess risk of based on loss against the expected subsample size for different synthetic datasets and real data examples. In this section, we report several representative studies.
5.1 Performance of gradient-based sampling
The design matrix is generated with elements drawn independently from the mixture Gaussian distributions below: (1) and , i.e., Gaussian distribution (referred as to GA data); (2) and , i.e.,the mixture between small and relatively large variances (referred as to MG1 data); (3) and , i.e., the mixture between small and highly large variances (referred as to MG2 data); (4) and , i.e., the mixture between two symmetric peaks (referred as to MG3 data). We also do simulations on generated from multivariate mixture Gaussian distributions with AR(1) covariance matrix, but obtain the similar performance to the setting above, so we do not report them here. Given , we generate from the model where each element of is drawn from normal distribution and then fixed, and , where . Note that we also consider the heteroscedasticity setting that is from a mixture Gaussian, and get the similar results to the homoscedasticity setting. So we do not report them here. We set as 100, and as among 20K, 50K, 100K, 200K, 500K.
We calculate the full sample LS solution for each dataset, and repeatedly apply various sampling methods for times to get subsample estimates for . We calculate the empirical risk based on loss (MSE) as follows:
Two sampling ratio values are considered: 0.01 and 0.05. We compare uniform sampling (UNIF), the leverage-based sampling (LEV) and the gradient-based sampling (GRAD) to these data sets. For GRAD, we set the to getting the pilot estimate .
Figure 2 gives boxplots of the logarithm of sampling probabilities of LEV and GRAD, where taking the logarithm is to clearly show their distributions. We have some observations from the figure. (1) For all four datasets, GRAD has heavier tails than LEV, that is, GRAD lets sampling probabilities more disperse than LEV. (2) MG2 tends to have the most heterogeneous sampling probabilities, MG1 has less heterogeneous than MG2, whereas MG3 and GA have the most homogeneous sampling probabilities. This indicates that the mixture of large and small variances has effect on the distributions of sampling probabilities while the mixture of different peak locations has no effect.
We plot the logarithm of MSE values for GA, MG1, and MG2 in Figure 3, where taking the logarithm is to clearly show the relative values. We do not report the results for MG3, as there is little difference between MG3 and GA. There are several interesting results shown in Figure 3. (1) GRAD has better performance than others, and the advantage of GRAD becomes obvious as increases. (2) For GA, LEV is shown to have similar performance to UNIF, however GRAD has obviously better performance than UNIF. (3) When increases, the smaller is needed to make sure that GRAD outperforms others.
From the computation view, we compare the computational cost for UNIF, approximate LEV (ALEV) [5, 6] and GRAD in Table 1, since ALEV is shown to be computationally efficient to approximate LEV. From the table, UNIF is the most saving, and the time cost of GRAD is much less than that of ALEV. It indicates that GRAD is also an efficient method from the computational view, since its running time is . Additionally, Table 2 summaries the computational complexity of several sampling methods for fast solving LS problems.
System Time () | User Time () | |||||
200 | 500 | 2000 | 200 | 500 | 2000 | |
UNIF | 0.000 | 0.002 | 0.003 | 0.010 | 0.018 | 0.050 |
ALEV | 0.494 | 0.642 | 0.797 | 2.213 | 2.592 | 4.353 |
GRAD | 0.099 | 0.105 | 0.114 | 0.338 | 0.390 | 0.412 |
System Time () | User Time () | |||||
500 | 2000 | 10000 | 500 | 2000 | 10000 | |
UNIF | 0.057 | 0.115 | 0.159 | 2.81 | 5.94 | 14.28 |
ALEV | 50.86 | 53.64 | 81.85 | 86.12 | 88.36 | 120.15 |
GRAD | 5.836 | 6.107 | 6.479 | 28.85 | 30.06 | 37.51 |
Stage | D1 | D2 | overall |
---|---|---|---|
Full | - | ||
UNIF | - | ||
LEV | |||
ALEV | |||
GRAD |
5.2 Real Data Examples
In this section, we compare the performance of various sampling algorithms on two UCI datasets: CASP () and OnlineNewsPopularity (NEWS) (). At first, we plot boxplots of the logarithm of sampling probabilities of LEV and GRAD in Figure 4. From it, similar to synthetic datasets, we know that the sampling probabilities of GRAD looks more dispersed compared to those of LEV.
The MSE values are reported in Table 3. From it, we have two observations below. First, GRAD has smaller MSE values than others when is large. Second, as increases, the outperformance of Poisson sampling than sampling with replacement gets obvious for various methods. Similar observation is gotten in simulations (see Supplementary Material, Section S2).
CASP | |||||
45 | 180 | 450 | 1800 | 4500 | |
UNIF-SR | 2.998e-05 | 9.285e-06 | 4.411e-06 | 1.330e-06 | 4.574e-07 |
UNIF-PS | 2.702e-05 | 9.669e-06 | 4.243e-06 | 1.369e-06 | 4.824e-07 |
LEV-SR | 1.962e-05 | 4.379e-06 | 1.950e-06 | 4.594e-07 | 2.050e-07 |
LEV-PS | 2.118e-05 | 5.240e-06 | 1.689e-06 | 4.685e-07 | 1.694e-07 |
GRAD-SR | 2.069e-05 | 5.711e-06 | 1.861e-06 | 4.322e-07 | 1.567e-07 |
GRAD-PS | 2.411e-05 | 5.138e-06 | 1.678e-06 | 3.687e-07 | 1.179e-07 |
NEWS | |||||
300 | 600 | 1200 | 2400 | 4800 | |
UNIF-SR | 22.050 | 14.832 | 10.790 | 7.110 | 4.722 |
UNIF-PS | 27.215 | 19.607 | 15.258 | 9.504 | 4.378 |
LEV-SR | 22.487 | 11.047 | 5.519 | 2.641 | 1.392 |
LEV-PS | 21.971 | 9.419 | 4.072 | 2.101 | 0.882 |
GRAD-SR | 10.997 | 5.508 | 3.074 | 1.505 | 0.752 |
GRAD-PS | 9.729 | 5.252 | 2.403 | 1.029 | 0.399 |
6 Conclusion
In this paper we have proposed gradient-based sampling algorithm for approximating LS solution. This algorithm is not only statistically efficient but also computationally saving. Theoretically, we provide the error bound analysis, which supplies a justification for the algorithm and give a tradeoff between the subsample size and approximation efficiency. We also argue from empirical studies that: (1) since the gradient-based sampling algorithm is justified without linear model assumption, it works better than the leverage-based sampling under different model specifications; (2) Poisson sampling is much better than sampling with replacement when sampling ratio increases.
There is an interesting problem to address in the further study. Although the gradient-based sampling is proposed to approximate LS solution in this paper, we believe that this sampling method can apply into other optimization problems for large-scale data analysis, since gradient is considered to be the steepest way to attain the (local) optima. Thus, applying this idea to other optimization problems is an interesting study.
Acknowledgments
This research was supported by National Natural Science Foundation of China grants 11301514 and 71532013. We thank Xiuyuan Cheng for comments in a preliminary version.
s.7 Supplementary
s.7.1 Additional Empirical Studies
The influence of the pilot estimate
In gradient-based sampling, we need to get the pilot estimate by uniformly sampling a subsample of size . Now we investigate the effect of by plotting the relative MSE for with respect to that for on GA, MG1 and MG2 datasets in Figure 5. We observe that when is larger than , MSE values of go flat. Thus, we argue that choosing the initial size of to get a pilot estimate may not be careful.
The advantage of poisson sampling
Now we empirically compare poisson sampling (PS) with sampling with replacement (SR). We compare risk performance between them for different values: 0.01 and 0.05. We report the results in Table 4, where we do not report the performance of UNIF and LEV due to the similarity shared with GRAD. From Table 4, there is little difference between PS and SR for , however PS becomes better than SR for . This observation indicates that PS outperforms SR when the sampling ratio increases.
n | 20K | 50K | 100K | 200K | 500K |
---|---|---|---|---|---|
GA | 1.061 | 0.945 | 0.986 | 0.946 | 1.019 |
MG1 | 0.968 | 0.945 | 0.927 | 0.973 | 0.997 |
MG2 | 1.020 | 0.980 | 0.969 | 0.989 | 1.004 |
GA | 0.920 | 0.946 | 0.937 | 0.975 | 0.970 |
MG1 | 0.860 | 0.921 | 0.909 | 0.853 | 0.925 |
MG2 | 0.885 | 0.843 | 0.895 | 0.824 | 0.835 |
The Robustness to Model Specification
The gradient-based sampling algorithm does not reply on the model assumption. We empirically investigate the effect of the model specification on various sampling methods.
Three kinds of model specification are considered here, i.e., models generating data are as follows:
(I) heteroscedasticity,
where is ignored in LS computation, and denotes the seriousness degree of “model wrong” and is set as among ;
(II) model error dependence,
where , and denotes the dependence degree among model errors and is set as among ;
(III) correlation between error and predictor,
where denotes the correlation between model error and the predictor and is set as among .
We report the results on MG1 dataset for and in Table 5 but do not report the results on other data sets because of the similarity. From Table 5, Firstly, most importantly, GRAD still works better than UNIF and LEV. Secondly, Types I and III can bring serious effect, especially Type III causes the most serious effect, while Type II seems have little effect on efficiency of sampling methods. Thus, these observations command that GRAD is a nice choice from the model robustness viewpoint.
Type I: heteroscedasticity | |||||
---|---|---|---|---|---|
0 | 0.2 | 0.5 | 1 | 2 | |
UNIF | 0.027 | 0.031 | 0.054 | 0.131 | 0.466 |
LEV | 0.026 | 0.029 | 0.038 | 0.078 | 0.227 |
GRAD | 0.013 | 0.016 | 0.026 | 0.060 | 0.199 |
Type II: model error dependence | |||||
0 | 0.2 | 0.5 | 0.7 | 0.9 | |
UNIF | 0.027 | 0.027 | 0.028 | 0.028 | 0.027 |
LEV | 0.026 | 0.025 | 0.027 | 0.025 | 0.026 |
GRAD | 0.013 | 0.0143 | 0.013 | 0.014 | 0.013 |
Type III: correlation between error and predictor | |||||
0 | 0.2 | 0.5 | 1 | 2 | |
UNIF | 0.027 | 0.545 | 3.181 | 12.74 | 51.07 |
LEV | 0.026 | 0.235 | 1.344 | 5.271 | 20.80 |
GRAD | 0.013 | 0.157 | 0.908 | 3.724 | 14.67 |
s.7.2 Technical Results
Lemma for proving Theorem 1
To analyze the risk, our key point is to apply Matrix Bernstein expectation bound (Theorem 6.1 in [17]) into matrix Bernoulli series. The lemma below present the expectation bound for matrix Bernoulli series.
Lemma 1
Consider a finite sequence of Hermitian matrices, where is vector. Let , with mean respectively, be a finite sequence of independent Bernoulli variables. Let and
Define matrix Bernoulli series We have,
Since the sequence is independent random Hermitian matrices with , , and
applying the matrix Berstein inequality of Theorem 6.1 in ([17]) to obtain that
Proof of Theorem 1
We have that
(A.1) |
Note that . For convenience, we assume without loss of generality. If the event
(A.2) |
holds, then we have that
and combining (B.2),
(A.3) |
where the 2nd inequality is from the fact that for any and the condition that the event holds. For any , define
Since
by Markov’s inequality we have that,
(A.4) |
Lemma 1 shows that
(A.5) |
For (A.2), we have that if
(A.6) | |||
(A.7) |
holds. Thus, combing (A.3), (A.4), (B.2), (A.6) and (A.7), we get
(A.8) |
where and . From (A.6), . Thus Theorem 1 is proved.
Proof of Corollary 1
References
- P. Drineas, R. Kannan, and M.W. Mahoney. Fast monte carlo algorithms for matrices i: approximating matrix multiplication. SIAM Journal on Scientific Computing, 36:132–157, 2006.
- P. Drineas, R. Kannan, and M.W. Mahoney. Fast monte carlo algorithms for matrices ii: computing a low-rank approximation to a matrix. SIAM Journal on Scientific Computing, 36:158–183, 2006.
- P. Drineas, R. Kannan, and M.W. Mahoney. Fast monte carlo algorithms for matrices iii: computing a compressed approximate matrix decomposition. SIAM Journal on Scientific Computing, 36:184–206, 2006.
- M.W. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106:697–702, 2009.
- P. Drineas, M. Magdon-Ismail, M.W. Mahoney, and D.P. Woodruff. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13:3475–3506, 2012.
- D.P. Clarkson, K.L.and Woodruff. Low rank approximation and regression in input sparsity time. STOC, 2013.
- M.B. Cohen, Y.T. Lee, C. Musco, C. Musco, R. Peng, and A. Sidford. Uniform sampling for matrix approximation. arXiv:1408.5099, 2014.
- P. Dhillon, Y. Lu, D.P. Foster, and L. Ungar. New subsampling algorithns for fast least squares regression. In Advances in Neural Information Processing Systems, volume 26, pages 360–368, 2013.
- D. Shender and J. Lafferty. Computation-risk tradeoffs for covariance-thresholded regression. In Proceedings of the 30th International Conference on Machine Learning, 2013.
- P. Drineas, M.W. Mahoney, and S. Muthukrishnan. Sampling algorithms for regression and applications. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1127–1136, 2006.
- P. Drineas, M.W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decomposition. SIAM Journal on Matrix Analysis and Applications, 30:844–881, 2008.
- P. Drineas, M.W. Mahoney, S. Muthukrishnan, and T. Sarlos. Faster least squares approximation. Numerische Mathematik, 117:219–249, 2011.
- P. Ma, M.W. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. In Proceedings of the 31th International Conference on Machine Learning, 2014.
- G. Raskutti and M.W. Mahoney. A statistical perspective on randomized sketching for ordinary least-squares. In Proc. of the 32nd ICML Conference, 2015.
- T. Yang, L. Zhang, R. Jin, and S. Zhu. An explicit sampling dependent spectral error bound for column subset selection. In Proc. of the 32nd ICML Conference, 2015.
- C.E. Särndal, B. Swensson, and J.H. Wretman. Model Assisted Survey Sampling. Springer, New York, 2003.
- J.A. Tropp. User-friendly tools for random matrices: An introduction. In Advances in Neural Information Processing Systems, 2012.