Sparse Approximate Cross-Validation for High-Dimensional GLMs

# Sparse Approximate Cross-Validation for High-Dimensional GLMs

William T. Stephenson
MIT CSAIL
wtstephe@mit.edu
&Tamara Broderick
MIT CSAIL
tbroderick@mit.edu
###### Abstract

Leave-one-out cross validation (LOOCV) can be particularly accurate among CV variants for estimating out-of-sample error. Unfortunately, LOOCV requires re-fitting a model times for a dataset of size . To avoid this prohibitive computational expense, a number of authors have proposed approximations to LOOCV. These approximations work well when the unknown parameter is of small, fixed dimension but suffer in high dimensions; they incur a running time roughly cubic in the dimension, and, in fact, we show their accuracy significantly deteriorates in high dimensions. We demonstrate that these difficulties can be surmounted in -regularized generalized linear models when we assume that the unknown parameter, while high dimensional, has a small support. In particular, we show that, under interpretable conditions, the support of the recovered parameter does not change as each datapoint is left out. This result implies that the previously proposed heuristic of only approximating CV along the support of the recovered parameter has running time and error that scale with the (small) support size even when the full dimension is large. Experiments on synthetic and real data support the accuracy of our approximations.

Sparse Approximate Cross-Validation for High-Dimensional GLMs

Preprint
William T. Stephenson MIT CSAIL wtstephe@mit.edu Tamara Broderick MIT CSAIL tbroderick@mit.edu

## 1 Introduction

Modern complex data analyses, on large data sets and in high dimensions, pose computational challenges not just for machine learning algorithms but also for evaluation methods. For instance, practitioners widely use cross validation (CV) to assess out-of-sample error and variability in machine learning methods. But CV requires substantial computation, namely re-running a machine learning algorithm many times, especially in the case of leave-one-out CV (LOOCV). This expense has led to recent proposals to approximate LOOCV (Obuchi and Kabashima, 2016, 2018; Beirami et al., 2017; Rad and Maleki, 2019; Wang et al., 2018; Giordano et al., 2019). Theory and empirics demonstrate that these approximations are fast and accurate – as long as the dimension of the unknown parameter in a problem is low. Unfortunately a number of issues arise in high dimensions. These LOOCV approximations use local quadratic fits in the parameter space, which incur an time cost for inverting a matrix. In fact, these matrices may not even be invertible when is large relative to . Moreover, existing error bounds for LOOCV approximations either assume a fixed or suffer from poor error scaling when grows with . One might wonder whether the theory could be improved, but our own experiments (see, e.g., Fig. 1) confirm that LOOCV approximations can suffer considerable error degradation in high dimensions in practice. We emphasize, though, that this high-dimensional case is exactly the case where we need LOOCV the most, rather than the computationally cheaper -fold CV; the fact that -fold CV trains on a smaller portion of the dataset than LOOCV makes it a more biased estimate of the out-of-sample error (Arlot and Celisse, 2010, Sec. 5.1.1), and in many cases this effect becomes more extreme when is large relative to (e.g., [Rad and Maleki, 2019, Fig. 1, Wang et al., 2018, Fig. 1]).

In this work, we demonstrate that the difficulties of approximate LOOCV in high dimensions can be surmounted by carefully exploiting sparsity. We focus on generalized linear models (GLMs) as they have a clear mechanism for engendering sparsity (namely, setting many parameter components to zero). And we focus on regularization as an effective and popular means to recover a sparse parameter support. Since -regularized penalties are not twice differentiable, they cannot be used directly with existing CV approximations. And we show empirically that smooth approximations to the penalty yield slow and inaccurate CV approximations. Instead, we validate a simple and easy-to-use proposal: to first run the full machine learning algorithm once and then use approximate LOOCV only on the recovered (low-dimensional) support. This proposal will work if exact CV rounds recover a shared support. Our major theoretical contribution is to show that, under mild and interpretable conditions, the recovered support is in fact shared across rounds of LOOCV with very high probability. Obuchi and Kabashima (2016, 2018) have considered a similar setup and shown that the effect of the change in support is asymptotically negligible for -regularized linear regression; however, they do not show the support is actually shared. We show that the support is shared with high probability in the practical finite-data setting – even for the very high-dimensional case – and also extend to logistic regression. Our support stability result may be of independent interest and allows us to show that, with high probability under finite data, the error and time cost of our LOOCV approximation will depend on the support size – typically much smaller than the full dimension – rather than . Our experiments confirm these findings empirically on simulated and real data.

## 2 Overview of Approximations

Let be an unknown parameter of interest. Consider a dataset of size , where indexes the data point. Then a number of problems – such as maximum likelihood, general M-estimation, and regularized loss minimization – can be expressed as solving

 ^θ:=argminθ∈Θ1NN∑n=1fn(θ)+λR(θ), (1)

where is a constant, and and are functions. For instance, might be the loss associated with the th data point, the regularizer, and the amount of regularization. Consider a data set where the th data point has covariates and response . In what follows, we will be interested in taking advantage of sparsity. With this in mind, we focus on generalized linear models (GLMs), where , as they offer a natural framework where sparsity can be expressed by choosing many parameter dimensions to be zero.

In LOOCV, we are interested in the solutions of the same problem with the th data point removed. To that end, define . Note our choice of scaling here – instead of . While we believe this choice is not of particular importance in the case of LOOCV, we observe this scaling does not seem to be a settled issue in the literature; see Appendix A for a brief discussion. In any case, computing exactly across usually requires runs of an optimization procedure – a prohibitive cost. Various approximations, which we detail next, address the poor scaling by solving Eq. 1 only once.

Two approximations. Assume that and are twice differentiable functions of . Let be the unregularized objective, and let be the Hessian matrix of the full objective. For the moment, we assume appropriate terms in each approximation below are invertible. Beirami et al. (2017); Rad and Maleki (2019); Wang et al. (2018) approximate by taking a Newton step (“NS”) on the objective starting from ; see Section B.4 for details. We thus call this approximation for regularizer :

 ˜NS∖n(R):=^θ+(1/N)(H(^θ)−(1/N)∇2θfn(^θ))−1∇θfn(^θ). (2)

In the case of GLMs, Theorem 8 of Rad and Maleki (2019) gives conditions on and that imply, for fixed , the error of averaged over is as .

The second approximation we consider is due to Giordano et al. (2019). While their work applies much more generally than LOOCV, its application to LOOCV was originally formulated only for . We show in Appendix B that their approach can be adapted to the more general case. As their approximation is inspired by the infinitesimal jackknife (“IJ”) (Jaeckel, 1972; Efron, 1982), we denote it by ; see Section B.1 for details.

 ˜IJ∖n(R):=^θ+(1/N)H(^θ)−1∇θfn(^θ). (3)

When , Corollary 1 of Giordano et al. (2019) shows that the accuracy of Eq. 3 is bounded by in general or, in the case of bounded gradients , by . The constants may depend on but not . Our Proposition 2 in Section B.3 extends this result to the case of . Still, we are left with the fact that and depend on in an unknown way.

In what follows, we consider both and , as the two approximations have complimentary strengths. In terms of empirics, we find that performs better in our LOOCV GLM experiments here. But is computationally efficient for a wider range of losses and CV variants. E.g., one can derive an NS-style approximation for general leave--out CV for GLMs, but every round will require inversion of a matrix that differs from by a new rank- matrix. Meanwhile, the IJ approximation still requires only a single inversion, namely of . In terms of theory, has a generally tighter error bound of , but given a good bound on the gradients, the theory behind may provide a tighter rate.

Problems in high dimensions. The first challenge for both approximations given high dimension is computational. Since every variant of CV or approximate CV requires running the machine learning algorithm of interest at least once, we will focus on the cost of the approximations after this single run. Given , both approximations require the inversion of a matrix. Calculation of across requires a single matrix inversion and matrix multiplications for a runtime in . In general, calculating has runtime of due to needing an inversion for each . In the case of GLMs, though, is a rank-one matrix, so standard rank-one updates give a runtime of as well.

The second challenge for both approximations is the invertibility of and that was assumed in defining and . We note that, if is only positive semidefinite, then invertibility of both matrices may be impossible when , presenting a major problem in high dimensions. We discuss this point further in Section B.2.

The third and final challenge for both approximations is accuracy in high dimensions. Not only do existing error bounds behave poorly (or not exist) in high dimensions, but empirical performance degrades as well. To create Fig. 1, we generated datasets from a sparse logistic regression model with ranging from 500 to 5,000. For the blue lines, we set , and for the red lines we set . In both cases, we see that error is much lower when is small and fixed.

We recall that for large and small , training error often provides a fine estimate of the out-of-sample error (e.g., see (Vapnik, 1992)). That is, CV is needed precisely in the high-dimensional regime, and this case is exactly where current approximations struggle both computationally and statistically. Thus, we wish to understand whether there are high- cases where approximate CV is useful. In what follows, we answer this question in the affirmative.

Approximate CV with Regularization. For statistical or interpretability reasons, we often wish to recover a parameter that has some low “effective dimension” with . One way to achieve low is sparsity: i.e., we have , where collects the indices of the non-zero entries of . A way to achieve sparsity is choosing . We show that proper use of regularization allows approximate CV to be far more accurate and fast. Note that and cannot be applied directly in this case as is not twice-differentiable. One proposal put forward by Rad and Maleki (2019); Wang et al. (2018) is to use a smoothed approximation to ; however, as we show in Section 4, this approach is both inaccurate and slow. We instead take the intuitive approach of approximating CV only on the dimensions in .

For notation, let be the covariate matrix, with rows . For , let be the submatrix of with column indices in ; define and similarly. Let , and define the restricted Hessian evaluated at : . Further define the LOO restricted Hessian, . Finally, without loss of generality, assume . We now define versions of and restricted to the entries in :

 NS∖n:=⎛⎝^θ^S+(H∖n^S^S)−1[∇θf(xTn^θ,yn)]^S0⎞⎠,IJ∖n:=⎛⎝^θ^S+H−1^S^S[∇θf(xTn^θ,yn)]^S0⎞⎠. (4)

Other authors have previously considered . Rad and Maleki (2019); Wang et al. (2018) derive by considering a smooth approximation to and then taking the limit of as the amount of smoothness goes to zero. In Appendix C, we show a similar argument can yield . Also, Obuchi and Kabashima (2016, 2018) directly propose the special case of for linear and logistic regression without using as a starting point. We now show how and avoid many of the high-dimensional challenges with and we discussed above.

The first challenge was that compute time for and scaled poorly with . That and do not share this issue is immediate from their definitions.

###### Proposition 1.

For general , the time to compute or scales with , rather than . In particular, computing across all takes time, and computing across all takes time. Furthermore, when takes the form of a GLM, computing across all takes time.

Thus, when , and will be much faster to compute than and . The second challenge was that and may not be invertible when . Notice the relevant matrices in Eq. 4 are of dimension . So, to calculate and , we need only make the much less restrictive assumption that , rather than . We address the third and final challenge of accuracy in the next section.

## 3 Approximation quality in high dimensions

Recall that the accuracy of and in general has a poor dependence on dimension . We now show that the accuracy of and depends on (the hopefully small) rather than . We start by assuming a “true” population parameter111This assumption may not be necessary to prove the dependence of and on , but it allows us to invoke existing support results in our proofs. that minimizes the population-level loss, , where the expectation is over from some population distribution. Assume is sparse with and . Our parameter estimate would be faster and more accurate if an oracle told us in advance and we worked just over :

 (5)

We define as the leave-one-out variant of (as is to ). Let and be the result of applying the approximation in or to the restricted problem in Eq. 5; note that and have accuracy that scales with the (small) dimension .

Our analysis of the accuracy of and will depend on the idea that if, for all , , , and run over the same -dimensional subspace, then the accuracy of and must be identical to that of and . In the case of regularization, this idea specializes to the following condition, under which our main result in Theorem 1 will be immediate.

###### Condition 1.

For all , we have .

###### Theorem 1.

Assume Condition 1 holds. Then for all , and are (1) zero outside the dimensions and (2) equal to their restricted counterparts from Eq. 5:

 ^θ∖n=(^θ∖n,S0)=(^ϕ∖n0),IJ∖n=(IJ∖n,S0)=(RIJ∖n0). (6)

It follows that the error is the same in the full problem as in the low-dimensional restricted problem: . The same results hold for and replaced by and .

Again, Theorem 1 is immediate if one is willing to assume Condition 1, but when does Condition 1 hold? There exist general conditions in the literature under which (Lee et al., 2014; Li et al., 2015). If one assumed these conditions held for all , then Condition 1 would directly follow. However, it is not immediate that any models of interest meet these conditions. In what follows, we give a few interpretable assumptions under which Condition 1 holds.

In fact, we show that we need just four principal assumptions in the case of linear and logistic regression; we conjecture that similar results hold for other GLMs. The first assumption arises from the intuition that, if individual data points are very extreme, the support will certainly change for some . To avoid these extremes with high probability, we assume that the covariates follow a sub-Gaussian distribution:

###### Definition 1.

[Vershynin (2018)] For , a random variable is -sub-Gaussian if

###### Assumption 1.

Each has zero-mean i.i.d. -sub-Gaussian entries with .

We conjecture that the unit-variance part of the assumption is unnecessary. Conditions on the distributions of the responses will be specific to linear and logistic regression and will be given in Assumptions 6 and 5, respectively. Our results below will hold with high probability under these distributions. Note there are reasons to expect we cannot do better than high-probability results. In particular, Xu et al. (2012) suggest that there exist worst-case training datasets for which sparsity-inducing methods like regularization are not stable as each datapoint is left out.

Our second principal assumption is a full-data high-probability incoherence condition.

###### Assumption 2.

The incoherence condition holds with high probability over the data:

 Pr[∥∥∇F(θ∗)Sc,S(∇2F(θ∗)SS)−1∥∥∞<1−α]≤e−25,

Authors in the literature often assume that incoherence holds deterministically for a given design matrix – starting from the introduction of incoherence in Zhao and Yu (2006) and continuing in more recent work (Lee et al., 2014; Li et al., 2015). Similarly, we will take our high probability version in Assumption 2 as given. But we note that Assumption 2 is at least known to hold for the case of linear regression with an i.i.d. Gaussian design matrix (e.g., see Exercise 11.5 of Hastie et al. (2015)). We next place some restrictions on how quickly and grow as functions of .

###### Assumption 3.

As functions of , and satisfy: (1) , (2) , and (3) .

The constraints on here are particularly loose. While those on are tighter, we still allow polynomial growth of in for some lower powers of . Our final assumption is on the smallest entry of . Such conditions – typically called beta-min conditions – are frequently used in the literature to ensure (Wainwright, 2009; Lee et al., 2014; Li et al., 2015).

###### Assumption 4.

satisfies where is some constant relating to the objective function ; see Assumption 15 in Section G.1 for an exact description.

### 3.1 Linear regression

We now give the distributional assumption on the responses in the case of linear regression and then show that Condition 1 holds.

###### Assumption 5.

, where the are i.i.d. -sub-Gaussian random variables.

###### Theorem 2 (Linear Regression).

Take Assumptions 3, 5, 4, 2 and 1. Suppose the regularization parameter satisfies

 λ≥Cα−Mlin⎛⎝√c2xc2εlogDN+25c2xc2εN+4cxcε(log(ND)+26)N⎞⎠, (7)

where is a constant in , and , and is a scalar given by Eq. 35 in Appendix G that satisfies, as , . Then for sufficiently large, Condition 1 holds with probability at least .

A full statement and proof of Theorem 2, including the exact value of , appears in Appendix G. A corollary of Theorem 1 and Theorem 2 together is that, under Assumptions 5, 3, 2, 1 and 4, the LOOCV approximations and have accuracy that depends on (the ideally small) rather than (the potentially large) .

It is worth considering how the allowed values of in Eq. 7 compare to previous results in the literature for the support recovery of . We will talk about a sequence of choices of scaling with denoted by . Theorem 11.3 of Hastie et al. (2015) shows that (for some constant in and ) is sufficient for ensuring that with high probability in the case of linear regression. Thus, we ought to set to ensure support recovery of . Compare this to the bound implied by Eq. 7. We have that as , so that, for large , the bound in Eq. 7 becomes for some constant . We immediately see that the sequence of satisfying Eq. 7 scales at exactly the same rate as those that ensure . This is important, as the error in , , is typically proportional to . The fact that we have not increased the asymptotic scaling of therefore means that we can enjoy the same decay of while ensuring for all .

### 3.2 Logistic regression

We now give the distributional assumption on the responses in the case of logistic regression.

###### Assumption 6.

we have with .

We will also need a condition on the minimum eigenvalue of the Hessian.

###### Assumption 7.

Assume for some scalar that may depend on and , we have

 Pr[λmin(∇2θF(θ∗)SS)≤Lmin]≤e−25.

Furthermore, assume the scaling of in and is such that, under Assumption 3 and for sufficiently large , for some constant that may depend on .

In the case of linear regression, we did not need an analogue of Assumption 7, as standard matrix concentration results tell us that its Hessian satisfies Assumption 7 with (see Lemma 2 in Appendix G). The Hessian for logistic regression is significantly more complicated, and it is typical in the literature to make some kind of assumption about its eigenvalues (Bach, 2010; Li et al., 2015). Empirically, Assumption 7 is satisfied when Assumptions 6 and 1 hold; however we are unaware of any results in the literature showing this is the case.

###### Theorem 3 (Logistic Regression).

Take Assumptions 7, 3, 6, 4, 2 and 1. Suppose the regularization parameter satisfies:

 λ≥Cα−Mlogr⎛⎜ ⎜⎝√c2x25+logDN+√2c2xlog(ND)+√50c2xN⎞⎟ ⎟⎠, (8)

where are constants in and , and is a scalar given by Eq. 66, that, as , satisfies . Then for sufficiently large, Condition 1 is satisfied with probability at least .

A restatement and proof of Theorem 3 are given as Theorem 5 in Appendix G. Similar to the remarks after Theorem 2, Theorem 3 implies that when applied to logistic regression, the LOOCV approximations and have accuracy that depends on (the ideally small) rather than (the potentially large) , even when .

Theorem 3 has implications for the work of Obuchi and Kabashima (2018). The authors of that paper conjecture that, as , the change in support of regularized logistic regression becomes negligible as each datapoint is left out and use this assumption to derive a version of customized to logistic regression. Our Theorem 3 confirms this conjecture by proving the stronger fact that the support is actually unchanged with high probability for finite data.

## 4 Experiments

Two alternatives to the limiting approximation. Our theoretical results imply that and have high accuracy and low computational cost. We begin our empirical investigation of this claim by first comparing to two more straightforward alternatives: (A) just use Eq. 3 with some smoothed version of , or (B) approximate exact CV by exactly computing for just a few random values of . To illustrate (A), we consider the smooth approximation given by Rad and Maleki (2019): While , we found that this approximation became numerically unstable for the purposes of optimization when was much larger than 100, so we set in our experiments. To test (A) and (B), we trained logistic regression models on twenty-five random datasets in which with and . The true was supported on its first five entries. We computed our approximation to CV as: and we computed the exact CV estimate, denoted by , similarly. We computed the accuracy of our approximation as a percent error:

 |ALOO−LOO|/LOO. (9)

Fig. 2 compares the accuracy and run times of (A) and (B) versus . We chose the number of subsamples so that subsampling CV would have about the same runtime as computing for all .222Specifically, we computed 41 different for each trial in order to roughly match the time cost of computing for all datapoints. In this case, we see that its accuracy is much worse than for nearly every trial. Using Eq. 3 with as a regularizer is far worse: while subsampling exact CV is fast and unbiased (although high variance), the smoothed approximation has to form an approximation over all dimensions; the resulting approximation is orders of magnitude less accurate and slower.

##### The importance of correct support recovery.

Our theoretical results show that each having correct support (i.e., ) is a sufficient condition for obtaining the fixed-dimensional error scaling shown in blue in Fig. 1. One might wonder if such a condition is necessary for approximate CV to be accurate. We offer evidence in Appendix D that this is the case. In particular, we empirically show that when grows with , the error in also grows with .

Real Data Experiments. We next aim to study how dependent our results are on the particular random design we chose in Theorems 3 and 2. We explore this question with a number of publicly available datasets (bcTCGA, 2018; Lewis et al., 2004; Guyon et al., 2004) described in Appendix E. We chose the particular datasets shown here for having a high enough dimension to observe the effect of our results, yet not so high in dimension nor number of datapoints that running exact CV for comparison was prohibitively expensive. For a description of each dataset as well as our exact experimental setup, see Appendix E. For each dataset, we approximate CV for the regularized model using and . For comparison, we report the accuracy of the and with . Our results, reported in Fig. 3 show that approximate CV for the regularized problem is significantly faster and more accurate.

To briefly demonstrate the scalability of our approximations, we re-ran our RCV1 experiment on a larger version of the dataset with and . Based on the time to compute exact LOOCV for twenty datapoints, we estimate exact LOOCV would have taken over two weeks to complete, whereas computing both and for all took three minutes.

## 5 Selection of λ and future work

We have shown that in both theory and practice, we can expect and to harness the low effective dimension present in regularized GLMs to give significant improvements in both computation time and approximation accuracy. However, these results, as well as those in all previous works on approximate CV, only apply when assessing the generalization error for a single fixed model (i.e., for a fixed ). These analyses do not address one of the more common uses of CV, which is to train with many different values of and select the one with the lowest CV error. In Appendix F, we give some empirical evidence that approximate CV methods are sometimes – but not always – suitable for this purpose. Specifically, we show that the infinitesimal jackknife approximation sometimes selects . Fortunately, this failure mode is easy to detect, suggesting the following workflow: run approximate CV to select , and in the event that is selected, run exact CV instead. Still, this suggestion is purely based on empirics. We believe that understanding the uses and limitations of approximate CV for selecting is one of the most important directions for future work in this area.

#### Acknowledgements

This research is supported in part by DARPA, the CSAIL-MSR Trustworthy AI Initiative, an NSF CAREER Award, an ARO YIP Award, and ONR.

## References

• Arlot and Celisse (2010) S. Arlot and A. Celisse. A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 2010.
• Bach (2010) F. Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4, 2010.
• bcTCGA (2018) bcTCGA. Breast cancer gene expression data, Nov 2018.
• Beirami et al. (2017) A. Beirami, M. Razaviyayn, S. Shahrampour, and V. Tarokh. On optimal generalizability in parametric learning. In Advances in Neural Information Processing Systems (NIPS), pages 3458–3468, 2017.
• Chetverikov et al. (2019) D. Chetverikov, Z. Liao, and V. Chernozhukov. On cross-validated Lasso. arXiv Preprint, jan 2019.
• Efron (1982) B. Efron. The Jackknife, the Bootstrap, and Other Resampling Plans, volume 38. Society for Industrial and Applied Mathematics, 1982.
• Giordano et al. (2019) R. Giordano, W. T. Stephenson, R. Liu, M. I. Jordan, and T. Broderick. A Swiss army infinitesimal jackknife. In International Conference on Artificial Intelligence and Statistics (AISTATS), April 2019.
• Guyon et al. (2004) I. Guyon, S. R. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the NIPS 2003 feature selection challenge. In Advances in Neural Information Processing Systems (NIPS), 2004.
• Hastie et al. (2015) T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity: the Lasso and generalizations. Chapman and Hall / CRC, 2015.
• Homrighausen and McDonald (2013) D. Homrighausen and D. J. McDonald. The lasso, persistence, and cross-validation. In International Conference in Machine Learning (ICML), 2013.
• Homrighausen and McDonald (2014) D. Homrighausen and D. J. McDonald. Leave-one-out cross-validation is risk consistent for Lasso. Machine Learning, 97(1-2):65–78, October 2014.
• Jaeckel (1972) L. Jaeckel. The infinitesimal jackknife, memorandum. Technical report, MM 72-1215-11, Bell Lab. Murray Hill, NJ, 1972.
• Lee et al. (2014) J. D. Lee, Y. Sun, and J. E. Taylor. On model selection consistency of regularized m-estimators. arXiv Preprint, Oct 2014.
• Lewis et al. (2004) D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 2004.
• Li et al. (2015) Y. Li, J. Scarlett, P. Ravikumar, and V. Cevher. Sparsistency of l1-regularized m-estimators. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.
• Miolane and Montanari (2018) L. Miolane and A. Montanari. The distribution of the Lasso: Uniform control over sparse balls and adaptive parameter tuning. arXiv Preprint, November 2018.
• Obuchi and Kabashima (2016) T. Obuchi and Y. Kabashima. Cross validation in lasso and its acceleration. Journal of Statistical Mechanics, May 2016.
• Obuchi and Kabashima (2018) T. Obuchi and Y. Kabashima. Accelerating cross-validation in multinomial logistic regression with l1-regularization. Journal of Machine Learning Research, September 2018.
• Rad and Maleki (2019) K. R. Rad and A. Maleki. A scalable estimate of the extra-sample prediction error via approximate leave-one-out. arXiv Preprint, February 2019.
• van Handel (2016) R. van Handel. Probability in High Dimensions. Lecture Notes, December 2016.
• Vapnik (1992) V. Vapnik. Principles of risk minimization for learning theory. In J. E. Moody, S. J. Hanson, and R. P. Lippmann, editors, Advances in Neural Information Processing Systems 4, pages 831–838. 1992.
• Vershynin (2018) R. Vershynin. High-dimensional probability: an introduction with applications in data science. Cambridge University Press, August 2018.
• Wainwright (2009) M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5), 05 2009.
• Wang et al. (2018) S. Wang, W. Zhou, H. Lu, A. Maleki, and V. Mirrokni. Approximate leave-one-out for fast parameter tuning in high dimensions. In International Conference in Machine Learning (ICML), 2018.
• Xu et al. (2012) H. Xu, C. Caramanis, and S. Mannor. Sparse algorithms are not stable: a no-free-lunch theorem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(1), 2012.
• Xu et al. (2019) J. Xu, A. Maleki, K. R. Rad, and D. Hsu. Consistent risk estimation in high-dimensional linear regression. arXiv Preprint, February 2019.
• Zhao and Yu (2006) P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning Research, 7:2541–2563, 2006.

## Appendix A Scaling of the leave-one-out objective

We defined as the solution to the following optimization problem:

 ^θ∖n:=argminθ∈Θ1N∑m≠nfm(θ)+λR(θ).

An alternative would be to use the objective in order to keep the scaling between the regularizer and the objective the same as in the full-data problem. Indeed, all existing theory that we are aware of for CV applied to regularized problems seems to follow the scaling [Homrighausen and McDonald, 2014, 2013, Miolane and Montanari, 2018, Chetverikov et al., 2019]. On the other hand, all existing approaches to approximate LOOCV for regularized problems have used the scaling that we have given [Beirami et al., 2017, Rad and Maleki, 2019, Wang et al., 2018, Xu et al., 2019, Obuchi and Kabashima, 2016, 2018]. Note that the scaling is not relevant in Giordano et al. [2019], as they do not consider the regularized case. As our work is aimed at identifying when existing approximations work well in high dimensions, we have followed the choice from the literature on approximate LOOCV. The different results from using the two scalings may be insignificant when leaving only one datapoint out. But one might expect the difference to be substantial for, e.g., -fold CV. We leave an understanding of what the effect of this scaling is (if any) to future work.

## Appendix B Further details of Eq. 2 and Eq. 3

In Section 2, we briefly outlined the approximations and to ; we give more details about these approximations and their derivations here. Recall that we defined . We first restate the “infinitesimal jackknife” approximation from the main text, which was derived by the same approach taken by Giordano et al. [2019]:

 ^θ∖n≈˜IJ∖n(R):=^θ+1NH(^θ)−1∇θf(xTn^θ,yn). (10)

The “Newton step” approximation, similar to the approach in Beirami et al. [2017] and identical to the approximation in Rad and Maleki [2019], Wang et al. [2018], is:

 ^θ∖n≈˜NS∖n(R):=^θ+1N(H(^θ)−1N∇2θf(xTn^θ,yn))−1∇θf(xTn^θ,yn). (11)

### b.1 Derivation of ˜IJ∖n(R)

We will see in Section B.3 that, after some creative algebra, is an instance of from Definition 2 of Giordano et al. [2019]. However, this somewhat obscures the motivation for considering Eq. 10. As an alternative to jamming our problem setup into that considered by Giordano et al. [2019], we can more directly obtain the approximation in Eq. 10 by a derivation only slightly different from that in Giordano et al. [2019]. We begin by defining as the solution to a weighted optimization problem with weights :

 ^θw:=argminθ∈ΘG(w,θ):=argminθ∈Θ1NN∑n=1wnf(xTnθ,yn)+λR(θ), (12)

where we assume to be twice continuously differentiable with an invertible Hessian at (where is the solution in Eq. 12 with all ). For example, we have that if is the -dimensional vector of all ones but with a zero in the th coordinate. We will form a linear approximation to as a function of . To do so, we will need to compute the derivatives for each . To compute these derivatives, we begin with the first order optimality condition of Eq. 12 and take a total derivative with respect to :

 ∂G∂θ∣∣w=1,θ=^θ1=0 ⟹ddwn∂G∂θ∣∣w=1,θ=^θ1=∂2G∂θ∂wn∣∣w=1,θ=^θ1dwndwn+∂2G∂θ2∣∣w=1,θ=^θ1d^θwdwn∣∣w=1=0.

Re-arranging, defining , and using the assumed invertibility of gives:

 d^θdwn∣∣w=1=−(∂2G∂θ2∣∣w=1,θ=^θ1)−1∂2G∂wn∂θ∣∣w=1,θ=^θ1=−1NH(^θ)−1∇θf(xTn^θ,yn). (13)

In the final equality, we used the fact that . Now, by a first order Taylor expansion around , we can write:

 ^θw ≈^θ+N∑n=1d^θdwn∣∣w=1(wn−1) (14) =^θ−1NN∑n=1H(^θ)−1∇θf(xTn^θ,yn)(wn−1). (15)

For the special case of being the vector of all ones with a zero in the th coordinate (i.e., the weighting for LOOCV), we recover Eq. 10.

### b.2 Invertibility in the definition of ˜NS∖n(R) and ˜IJ∖n(R)

In writing Eqs. 3 and 2 we have assumed the invertibility of and . We here note a number of common cases where this invertibility holds. First, if is positive definite for all (as in the case of ), then these matrices are always invertible. If is merely convex, then is always invertible if . If is convex, is invertible if . This condition on the span holds almost surely if the are sampled from a continuous distribution and .

### b.3 Accuracy of ˜IJ∖n(R) for regularized problems

As noted in the main text, Giordano et al. [2019] show that the error of is bounded by for some that is constant in . However, their results apply only to the unregularized case (i.e., ). We show here that their results can be extended to the case of with mild additional assumptions; the proof of Proposition 2 appears below.

###### Proposition 2.

Assume that the conditions for Corollary 1 of Giordano et al. [2019] are satisfied by . Furthermore, assume that we are restricted to in some compact subset of , , is twice continuously differentiable for all , and that is positive definite for all . Then can be seen as an application of the approximation in Definition 2 of Giordano et al. [2019]. Furthermore, the assumptions of their Corollary 1 are met, which implies:

 (16)

where and are problem-specific constants independent of that may depend on .

Proposition 2 provides two bounds on the error : either times the maximum of the gradient or just . One bound or the other may be easier to use, depending on the specific problem. It is worth discussing the conditions of Proposition 2 before going into its proof. The first major assumption is that is restricted to some compact set . Although this assumption may not be satisfied by problems of interest, one may be willing to assume that lives in some bounded set in practice. In any case, such an assumption seems necessary to apply the results of Giordano et al. [2019] to most unregularized problems, as they, for example, require to be bounded. We will require the compactness of to show that is bounded.

The second major assumption of Proposition 2 is that . We need this assumption to ensure that the term is sufficiently well behaved. In practice this assumption may be somewhat limiting; however, we note that for fixed D, such a scaling is usually assumed – and in some situations is necessary – to obtain standard theoretical results for regularization (e.g., Wainwright [2009] gives the standard scaling for linear regression, ). Our Theorems 3 and 2 also satisfy such a scaling when D is fixed. In any case, we stress that this assumption – as well as the assumption on compactness – are needed only to prove Proposition 2, and not any of our other results. We prove Proposition 2 to demonstrate the baseline results that exist in the literature so that we can then show how our results build on these baselines.

###### Proof.

We proceed by showing that the regularized optimization problem in our Eq. 1 can be written in the framework of Eq. (1) of Giordano et al. [2019] and then showing that the re-written problem satisfies the assumptions of their Corollary 1. First, the framework of Giordano et al. [2019] applies to weighted optimization problems of the form:

 ^θw:=θ∈Θs.t.1NN∑n=1wngn(θ)=0. (17)

In order to match this form, we will rewrite the gradient of the objective in Eq. 1 as a weighted sum with terms, where the first term, with weight , will correspond to :

 1N+1w0(N+1)λ∇R(θ)+1N+1N∑n=1wnN+1N∇f(xTnθ,yn). (18)

We will also need a set of weight vectors for which we are interested in evaluating . We choose this set as follows. In the set, we include each weight vector that is equal to one everywhere except for exactly one of . Thus, for each , there is a such that . Finally, then, we can apply Definition 2 of Giordano et al. [2019] to find the approximation for the that corresponds to leaving out . We see that in this case is exactly equal to in our notation here.

Now that we know our approximation is actually an instance of , we need to check that Eq. 18 meets the assumptions of Corollary 1 of Giordano et al. [2019] to apply their theoretical analysis to our problem. We check these below, first stating the assumption from Giordano et al. [2019] and then covering why it holds for our problem.

1. (Assumption 1): for all , each is continuously differentiable in .
For our problem, by assumption, and are twice continuously differentiable functions of , so this assumption holds.

2. (Assumption 2): for all , the Hessian matrix, is invertible and satisfies for some constant , where denotes the operator norm on matrices with respect to the norm (i.e., the maximum eigenvalue of the matrix).
For our problem, by assumption, the inverse matrix exists and has bounded maximum eigenvalue for all . Also by assumption, has a positive semidefinite Hessian for all , which implies:

 supθ∈Θ∥∥H−1(θ,1)∥∥op=supθ∈Θ∥∥(∇2F(θ)+λ∇2R(θ))−1∥∥op≤supθ∈Θ∥∥(∇2F(θ))−1∥∥op.

To see that the inequality holds, first note that for a positive semi-definite (PSD) matrix , . The inequality would then follow if