Approximate Newton-based statistical inference using only stochastic gradients

# Approximate Newton-based statistical inference using only stochastic gradients

Tianyang Li1
lty@cs.utexas.edu
Anastasios Kyrillidis2
anastasios@rice.edu
Liu Liu1
liuliu@utexas.edu
Constantine Caramanis1
constantine@utexas.edu

1 The University of Texas at Austin
2 Rice University
###### Abstract

We present a novel inference framework for convex empirical risk minimization, using approximate stochastic Newton steps. The proposed algorithm is based on the notion of finite differences and allows the approximation of a Hessian-vector product from first-order information. In theory, our method efficiently computes the statistical error covariance in -estimation, both for unregularized convex learning problems and high-dimensional LASSO regression, without using exact second order information, or resampling the entire data set. In practice, we demonstrate the effectiveness of our framework on large-scale machine learning problems, that go even beyond convexity: as a highlight, our work can be used to detect certain adversarial attacks on neural networks.

## 1 Introduction

Statistical inference is an important tool for assessing uncertainties, both for estimation and prediction purposes [25, 21]. E.g., in unregularized linear regression and high-dimensional LASSO settings [53, 32, 49], we are interested in computing coordinate-wise confidence intervals and p-values of a -dimensional variable, in order to infer which coordinates are active or not [58]. Traditionally, the inverse Fisher information matrix [20] contains the answer to such inference questions; however it requires storing and computing a matrix structure, often prohibitive for large-scale applications [52]. Alternatively, the Bootstrap method is a popular statistical inference algorithm, where we solve an optimization problem per dataset replicate, but can be expensive for large data sets [35].

While optimization is mostly used for point estimates, recently it is also used as a means for statistical inference in large scale machine learning [37, 14, 48, 24]. This manuscript follows this path: we propose an inference framework that uses stochastic gradients to approximate second-order, Newton steps. This is enabled by the fact that we only need to compute Hessian-vector products; in math, this can be approximated using , where is the objective function, and , denote the gradient and Hessian of . Our method can be interpreted as a generalization of the SVRG approach in optimization [34] (Appendix D); further, it is related to other stochastic Newton methods (e.g. [3]) when . We defer the reader to Section 5 for more details. In this work, we apply our algorithm to unregularized -estimation, and we use a similar approach, with proximal approximate Newton steps, in high-dimensional linear regression.

Our contributions can be summarized as follows; a more detailed discussion is deferred to Section 5:

• For the case of unregularized -estimation, our method efficiently computes the statistical error covariance, useful for confidence intervals and p-values. Compared to state of the art, our scheme guarantees consistency of computing the statistical error covariance, exploits better the available information (without wasting computational resources to compute quantities that are thereafter discarded), and converges to the optimum (without swaying around it).

• For high-dimensional linear regression, we propose a different estimator (see (13)) than the current literature. It is the result of a different optimization problem that is strongly convex with high probability. This permits the use of linearly convergent proximal algorithms [61, 36] towards the optimum; in contrast, state of the art only guarantees convergence to a neighborhood of the LASSO solution within statistical error. Our model also does not assume that absolute values of the true parameter’s non-zero entries are lower bounded.

• The effectiveness of our framework goes even beyond convexity. As a highlight, we show that our work can be used to detect certain adversarial attacks on neural networks.

## 2 Unregularized M-estimation

In unregularized, low-dimensional -estimation problems, we estimate a parameter of interest:

 θ⋆=argminθ∈RpEX∼P[ℓ(X;θ)],where P(X) is % the data distribution,

using empirical risk minimization (ERM) on i.i.d. data points :

 ˆθ=argminθ∈Rp1nn∑i=1ℓ(Xi;θ).

Statistical inference, such as computing one-dimensional confidence intervals, gives us information beyond the point estimate , when has an asymptotic limit distribution [58]. E.g., under regularity conditions, the -estimator satisfies asymptotic normality [54, Theorem 5.21]. I.e., weakly converges to a normal distribution:

 √n(ˆθ−θ⋆)→N(0,H⋆−1G⋆H⋆−1),

where and . We can perform statistical inference when we have a good estimate of . In this work, we use the plug-in covariance estimator for , where:

 ˆH=1nn∑i=1∇2θℓ(Xi;ˆθ), and ˆG=1nn∑i=1∇θℓ(Xi;ˆθ)∇θℓ(Xi;ˆθ)⊤.

Observe that, in the naive case of directly computing and , we require both high computational- and space-complexity. Here, instead, we utilize approximate stochastic Newton motions from first order information to compute the quantity .

### 2.1 Statistical inference with approximate Newton steps using only stochastic gradients

Based on the above, we are interested in solving the following -dimensional optimization problem:

 ˆθ=argminθ∈Rpf(θ):=1nn∑i=1fi(θ),where fi(θ)=ℓ(Xi;θ).

Notice that can be written as , which can be interpreted as the covariance of stochastic –inverse-Hessian conditioned– gradients at . Thus, the covariance of stochastic Newton steps can be used for statistical inference.

Algorithm 1 approximates each stochastic Newton step using only first order information. We start from which is sufficiently close to , which can be effectively achieved using SVRG [34]; a description of the SVRG algorithm can be found in Appendix D. Lines 4, 5 compute a stochastic gradient whose covariance is used as part of statistical inference. Lines 6 to 12 use SGD to solve the Newton step,

 ming∈Rp⟨1So∑i∈Io∇fi(θt),g⟩+12ρt⟨g,∇2f(θt)g⟩, (1)

which can be seen as a generalization of SVRG; this relationship is described in more detail in Appendix D. In particular, these lines correspond to solving (1) using SGD by uniformly sampling a random , and approximating:

 ∇2f(θ)g≈∇f(θ+δjtg)−∇f(θ)δjt=E[∇fi(θ+δjtg)−∇fi(θ)δjt∣θ]. (2)

Finally, the outer loop (lines 2 to 13) can be viewed as solving inverse Hessian conditioned stochastic gradient descent, similar to stochastic natural gradient descent [4].

In terms of parameters, similar to [43, 46], we use a decaying step size in Line 8 to control the error of approximating . We set to control the error of approximating Hessian vector product using a finite difference of gradients, so that it is smaller than the error of approximating using stochastic approximation. For similar reasons, we use a decaying step size in the outer loop to control the optimization error.

The following theorem characterizes the behavior of Algorithm 1.

###### Theorem 1.

For a twice continuously differentiable and convex function where each is also convex and twice continuously differentiable, assume satisfies

• strong convexity: , ;

• , each , which implies that has Lipschitz gradient: , ;

• each is Lipschitz continuous: , .

In Algorithm 1, we assume that batch sizes —in the outer loop—and —in the inner loops—are . The outer loop step size is

 ρt=ρ0⋅(t+1)−do,where do∈(12,1) is the decaying rate. (3)

In each outer loop, the inner loop step size is

 τj=τ0⋅(j+1)−di,where di∈(12,1) is the decaying rate. (4)

The scaling constant for Hessian vector product approximation is

 δjt=δ0⋅ρ4t⋅τ4j=o(1(t+1)2(j+1)2). (5)

Then, for the outer iterate we have
(6) and (7)

In each outer loop, after steps of the inner loop, we have:

 E[∥∥¯gtρt−[∇2f(θt)]−1g0t∥∥22∣θt]≲1L∥∥g0t∥∥22, (8)

and at each step of the inner loop, we have:

 E[∥∥gj+1t−[∇2f(θt)]−1g0t∥∥42∣θt]≲(j+1)−2di∥∥g0t∥∥42. (9)

After steps of the outer loop, we have a non-asymptotic bound on the “covariance”:

 E[∥∥ ∥∥H−1GH−1−SoTT∑t=1¯gt¯g⊤tρ2t∥∥ ∥∥2]≲T−do2+L−12, (10)

where and .

Some comments on the results in Theorem 1. The main outcome is that (10) provides a non-asymptotic bound and consistency guarantee for computing the estimator covariance using Algorithm 1. This is based on the bound for approximating the inverse-Hessian conditioned stochastic gradient in (8), and the optimization bound in (6). As a side note, the rates in Theorem 1 are very similar to classic results in stochastic approximation [43, 46]; however the nested structure of outer and inner loops is different from standard stochastic approximation algorithms. Heuristically, calibration methods for parameter tuning in subsampling methods ([42], Ch. 9) can be used for hyper-parameter tuning in our algorithm.

In Algorithm 1, does not have asymptotic normality. I.e., does not weakly converge to ; we give an example using mean estimation in Section C.1. For a similar algorithm based on SVRG (Algorithm 5 in Appendix C), we show that we have asymptotic normality and improved bounds for the “covariance”; however, this requires a full gradient evaluation in each outer loop. In Appendix B, we present corollaries for the case where the iterations in the inner loop increase, as the counter in the outer loop increases (i.e., is an increasing series). This guarantees consistency (convergence of the covariance estimate to ), although it is less efficient than using a constant number of inner loop iterations. Our procedure also serves as a general and flexible framework for using different stochastic gradient optimization algorithms [50, 28, 38, 16] in the inner and outer loop parts.

Finally, we present the following corollary that states that the average of consecutive iterates, in the outer loop, has asymptotic normality, similar to [43, 46].

###### Corollary 1.

In Algorithm 1’s outer loop, the average of consecutive iterates satisfies
(11) and (12)
where weakly converges to , and when and .

Corollary 1 uses 2nd , 4th moment bounds on individual iterates (eqs. (6), (7) in the above theorem), and the approximation of inverse Hessian conditioned stochastic gradient in (9).

## 3 High dimensional LASSO linear regression

In this section, we focus on the case of high-dimensional linear regression. Statistical inference in such settings, where , is arguably a more difficult task: the bias introduced by the regularizer is of the same order with the estimator’s variance. Recent works [63, 53, 32] propose statistical inference via de-biased LASSO estimators. Here, we present a new -norm regularized objective and propose an approximate stochastic proximal Newton algorithm, using only first order information.

We consider the linear model , for some sparse . For each sample, is i.i.d. noise. And each data point .

• Assumptions on : is -sparse; , which implies that .

• Assumptions on : is sparse, where each column (and row) has at most non-zero entries;111This is satisfied when is block diagonal or banded. Covariance estimation under this sparsity assumption has been extensively studied [7, 8, 13], and soft thresholding is an effective yet simple estimation method [45]. is well conditioned: all of ’s eigenvalues are ; is diagonally dominant ( for all ), and this will be used to bound the norm of [55]. A commonly used design covariance that satisfies all of our assumptions is .

We estimate using:

 ˆθ=argminθ∈Rp12⟨θ,(ˆS−1nn∑i=1xix⊤i)θ⟩+1nn∑i=112(x⊤iθ−yi)2+λ∥θ∥1, (13)

where is an estimate of by soft-thresholding each element of with [45]. Under our assumptions, is positive definite with high probability when (Lemma 4), and this guarantees that the optimization problem (13) is well defined. I.e., we replace the degenerate Hessian in regular LASSO regression with an estimate, which is positive definite with high probability under our assumptions.

We set the regularization parameter

 λ=Θ((σ+∥θ⋆∥1)√logpn),

which is similar to LASSO regression [12, 41] and related estimators using thresholded covariance [62, 33].

##### Point estimate.

Theorem 2 provides guarantees for our proposed point estimate (13).

###### Theorem 2.

When , the solution in (13) satisfies

 ∥∥ˆθ−θ⋆∥∥1 ≲s(σ+∥θ⋆∥1)√logpn≲s(σ+√s)√logpn, (14) ∥∥ˆθ−θ⋆∥∥2 ≲√s(σ+∥θ⋆∥1)√logpn≲√s(σ+√s)√logpn, (15)

with probability at least .

##### Confidence intervals.

We next present a de-biased estimator (16), based on our proposed estimator. can be used to compute confidence intervals and p-values for each coordinate of , which can be used for false discovery rate control [30]. The estimator satisfies:

 ˆθd=ˆθ+ˆS−1[1nn∑i=1yixi−(1nn∑i=1xix⊤i)ˆθ]. (16)

Our de-biased estimator is similar to [63, 53, 31, 32]. however, we have different terms, since we need to de-bias covariance estimation. Our estimator assumes , since then is positive definite with high probability (Lemma 4). The assumption that is diagonally dominant guarantees that the norm is bounded by with high probability when .

Theorem 3 shows that we can compute valid confidence intervals for each coordinate when This is satisfied when . And the covariance is similar to the sandwich estimator [29, 59].

###### Theorem 3.

Under our assumptions, when , we have:

 √n(ˆθd−θ⋆)=Z+R, (17)

where the conditional distribution satisfies , and with probability at least .

Our estimate in (13) has similar error rates to the estimator in [62]; however, no confidence interval guarantees are provided, and the estimator is based on inverting a large covariance matrix. Further, although it does not match minimax rates achieved by regular LASSO regression [44], and the sample complexity in Theorem 3 is slightly higher than other methods [53, 31, 32], our criterion is strongly convex with high probability: this allows us to use linearly convergent proximal algorithms [61, 36], whereas provable linearly convergent optimization bounds for LASSO only guarantees convergence to a neighborhood of the LASSO solution within statistical error [1]. This is crucial for computing the de-biased estimator, as we need the optimization error to be much less than the statistical error.

In Appendix A, we present our algorithm for statistical inference in high dimensional linear regression using stochastic gradients. It estimates the statistical error covariance using the plug-in estimator:

 ˆS−1(1nn∑i=1(x⊤iˆθ−yi)2xix⊤i)ˆS−1,

which is related to the empirical sandwich estimator [29, 59]. Algorithm 2 computes the statistical error covariance. Similar to Algorithm 1, Algorithm 2 has an outer loop part and an inner loop part, where the outer loops correspond to approximate proximal Newton steps, and the inner loops solve each proximal Newton step using proximal SVRG [61]. To control the variance, we use SVRG and proximal SVRG to solve the Newton steps. This is because in the high dimensional setting, the variance is too large when we use SGD [40] and proximal SGD [5] for solving Newton steps. However, since we have , instead of sampling by sample, we sample by feature. When we set , we can estimate the statistical error covariance with element-wise error less than with high probability, using numerical operations. And Algorithm 3 calculates the de-biased estimator (16) via SVRG. For more details, we defer the reader to the appendix.

## 4 Experiments

### 4.1 Synthetic data

The coverage probability is defined as where is the estimated confidence interval for the coordinate. The average confidence interval length is defined as where is the estimated confidence interval for the coordinate. In our experiments, coverage probability and average confidence interval length are estimated through simulation. Result given as a pair indicates (coverage probability, confidence interval length).

##### Low dimensional problems.

Table 1 shows coverage and confidence interval length of 200 simulations for linear and logistic regression. The exact configurations for linear/logistic regression examples are provided in Section G.1.1. Compared with Bootstrap and Jackknife [22], Algorithm 1 uses less numerical operations, while achieving similar results. Compared with the averaged SGD method [37, 14], our algorithm performs much better, while using the same amount of computation, and is much less sensitive to the choice hyper-parameters.

##### High dimensional linear regression.

We use 600 i.i.d. samples from a model with , , which is 8-sparse. Figure 1 shows 95% confidence intervals for the first 20 coordinates. The average confidence interval length is 0.14 and average coverage is 0.83. Additional experimental results, including p-value distribution, are presented in Section G.1.2.

### 4.2 Real data

##### Neural network adversarial attack detection.

Here we use ideas from statistical inference to detect certain adversarial attacks on neural networks. A key observation is that neural networks are effective at representing low dimensional manifolds such as natural images [6, 15], and this causes the risk function’s Hessian to be degenerate [47]. From a statistical inference perspective, we interpret this as meaning that the confidence intervals in the null space of is infinity, where is the pseudo-inverse of the Hessian (see Section 2). When we make a prediction using a fixed data point as input (i.e., conditioned on ), using the delta method [54], the confidence interval of the prediction can be derived from the asymptotic normality of

 √n(Ψ(x;ˆθ)−Ψ(x;θ⋆))→N(0,∇θΨ(x;ˆθ)⊤[ˆH−1ˆGˆH−1]∇θΨ(x;ˆθ)).

To detect adversarial attacks, we use the score

 ∥∥(I−PH+GH+)∇θΨ(x;ˆθ)∥∥2∥∥∇θΨ(x;ˆθ)∥∥2,

to measure how much lies in null space of , where is the projection matrix onto the range of . Conceptually, for the same image, the randomly perturbed image’s score should be larger than the original image’s score, and the adversarial image’s score should be larger than the randomly perturbed image’s score.

We train a binary classification neural network with 1 hidden layer and softplus activation function, to distinguish between “Shirt” and “T-shirt/top” in the Fashion MNIST data set [60]. Figure 2 shows distributions of scores of original images, adversarial images generated using the fast gradient sign method [27], and randomly perturbed images. Adversarial and random perturbations have the same norm. The adversarial perturbations and example images are shown in Section G.2.1. Although the scores’ values are small, they are still significantly larger than 64-bit floating point precision (). We observe that scores of randomly perturbed images is an order of magnitude larger than scores of original images, and scores of adversarial images is an order of magnitude larger than scores of randomly perturbed images.

##### High dimensional linear regression.

We apply our high dimensional linear regression statistical inference procedure to a high-throughput genomic data set concerning riboflavin (vitamin B2) production rate [11], which contains samples of genes. We set and . In Section G.2.2, we show that our point estimate is similar to the vanilla LASSO estimate, and compare our statistical inference results with those of [31, 11, 10, 39].

## 5 Related work

##### Unregularized M-estimation.

This work provides a general, flexible framework for simultaneous point estimation and statistical inference, and improves upon previous methods, based on averaged stochastic gradient descent [37, 14].

Compared to [14] (and its extensions [48, 24]), our method does not need to increase the lengths of “segments” (inner loops) to reduce correlations between different “replicates”. Even in that case, if we use replicates and increasing “segment” length (number of inner loops is ) with a total of stochastic gradient steps, [14] guarantees , whereas our method guarantees . Further, [14] is inconsistent, whereas our scheme guarantees consistency of computing the statistical error covariance.

[37] uses fixed step size SGD for statistical inference, and discards iterates between different “segments” to reduce correlation, whereas we do not discard any iterates in our computations. Although [37] states empirically constant step SGD performs well in statistical inference, it has been empirically shown [17] that averaging consecutive iterates in constant step SGD does not guarantee convergence to the optimal – the average will be “wobbling” around the optimal, whereas decreasing step size stochastic approximation methods ([43, 46] and our work) will converge to the optimal, and averaging consecutive iterates guarantees “fast” rates.

Finally, from an optimization perspective, our method is similar to stochastic Newton methods (e.g. [3]); however, our method only uses first-order information to approximate a Hessian vector product (). Algorithm 1’s outer loops are similar to stochastic natural gradient descent [4]. Also, we demonstrate an intuitive view of SVRG [34] as a special case of approximate stochastic Newton steps using first order information (Appendix D).

##### High dimensional linear regression.

[14]’s high dimensional inference algorithm is based on [2], and only guarantees that optimization error is at the same scale as the statistical error. However, proper de-biasing of the LASSO estimator requires the optimization error to be much less than the statistical error, otherwise the optimization error introduces additional bias that de-biasing cannot handle. Our optimization objective is strongly convex with high probability: this permits the use of linearly convergent proximal algorithms [61, 36] towards the optimum, which guarantees the optimization error to be much smaller than the statistical error.

Our method of de-biasing the LASSO in Section 3 is similar to [63, 53, 31, 32]. Our method uses a new regularized objective (13) for high dimensional linear regression, and we have different de-biasing terms, because we also need to de-bias the covariance estimation.

In Algorithm 2, our covariance estimate is similar to the classic sandwich estimator [29, 59]. Previous methods require space which unsuitable for large scale problems, whereas our method only requires space.

Similar to our -norm regularized objective, [62, 33] shows similar point estimate statistical guarantees for related estimators; however there are no confidence interval results. Further, although [62] is an elementary estimator in closed form, it still requires computing the inverse of the thresholded covariance, which is challenging in high dimensions, and may not computationally outperform optimization approaches.

Finally, for feature selection, we do not assume that absolute values of the true parameter’s non-zero entries are lower bounded. [23, 56, 12].

## Acknowledgments

We thank Xi Chen and Philipp Krähenbühl for insightful discussions.

## References

• [1] Alekh Agarwal, Sahand Negahban, and Martin Wainwright. Fast global convergence rates of gradient methods for high-dimensional statistical recovery. In Advances in Neural Information Processing Systems, pages 37–45, 2010.
• [2] Alekh Agarwal, Sahand Negahban, and Martin J. Wainwright. Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions. In Advances in Neural Information Processing Systems, pages 1538–1546, 2012.
• [3] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148–4187, 2017.
• [4] Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
• [5] Yves F. Atchadé, Gersende Fort, and Eric Moulines. On perturbed proximal gradient algorithms. J. Mach. Learn. Res, 18(1):310–342, 2017.
• [6] Ronen Basri and David Jacobs. Efficient representation of low-dimensional manifolds using deep networks. arXiv preprint arXiv:1602.04723, 2016.
• [7] Peter Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics, pages 2577–2604, 2008.
• [8] Peter Bickel, Ya’acov Ritov, and Alexandre Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, pages 1705–1732, 2009.
• [9] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–357, 2015.
• [10] Peter Bühlmann. Statistical significance in high-dimensional linear models. Bernoulli, 19(4):1212–1242, 2013.
• [11] Peter Bühlmann, Markus Kalisch, and Lukas Meier. High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1:255–278, 2014.
• [12] Peter Bühlmann and Sara van de Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
• [13] Tony Cai and Harrison Zhou. Optimal rates of convergence for sparse covariance matrix estimation. The Annals of Statistics, pages 2389–2420, 2012.
• [14] Xi Chen, Jason Lee, Xin Tong, and Yichen Zhang. Statistical inference for model parameters in stochastic gradient descent. arXiv preprint arXiv:1610.08637, 2016.
• [15] Charles K Chui and Hrushikesh Narhar Mhaskar. Deep nets for local manifold learning. arXiv preprint arXiv:1607.07110, 2016.
• [16] Hadi Daneshmand, Aurelien Lucchi, and Thomas Hofmann. Starting small-learning with adaptive sample sizes. In International conference on machine learning, pages 1463–1471, 2016.
• [17] Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the Gap between Constant Step Size Stochastic Gradient Descent and Markov Chains. arXiv preprint arXiv:1707.06386, 2017.
• [18] Jürgen Dippon. Asymptotic expansions of the Robbins-Monro process. Mathematical Methods of Statistics, 17(2):138–145, 2008.
• [19] Jürgen Dippon. Edgeworth expansions for stochastic approximation theory. Mathematical Methods of Statistics, 17(1):44–65, 2008.
• [20] Francis Ysidro Edgeworth. On the probable errors of frequency-constants. Journal of the Royal Statistical Society, 71(2):381–397, 1908.
• [21] B. Efron and T. Hastie. Computer age statistical inference. Cambridge University Press, 2016.
• [22] Bradley Efron and Robert J. Tibshirani. An introduction to the bootstrap. CRC press, 1994.
• [23] Jianqing Fan, Wenyan Gong, Chris Junchi Li, and Qiang Sun. Statistical sparse online regression: A diffusion approximation perspective. In International Conference on Artificial Intelligence and Statistics, pages 1017–1026, 2018.
• [24] Yixin Fang, Jinfeng Xu, and Lei Yang. On Scalable Inference with Stochastic Gradient Descent. arXiv preprint arXiv:1707.00192, 2017.
• [25] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001.
• [26] Sébastien Gadat and Fabien Panloup. Optimal non-asymptotic bound of the Ruppert-Polyak averaging without strong convexity. arXiv preprint arXiv:1709.03342, 2017.
• [27] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
• [28] Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Konecny, and Scott Sallinen. StopWasting My Gradients: Practical SVRG. In Advances in Neural Information Processing Systems 28, pages 2251–2259, 2015.
• [29] Peter Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the fifth Berkeley symposium on Mathematical Statistics and Probability, pages 221–233, Berkeley, CA, 1967. University of California Press.
• [30] Adel Javanmard and Hamid Javadi. False Discovery Rate Control via Debiased Lasso. arXiv preprint arXiv:1803.04464, 2018.
• [31] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15(1):2869–2909, 2014.
• [32] Adel Javanmard and Andrea Montanari. De-biasing the Lasso: Optimal sample size for Gaussian designs. arXiv preprint arXiv:1508.02757, 2015.
• [33] Jessie Jeng and John Daye. Sparse covariance thresholding for high-dimensional variable selection. Statistica Sinica, pages 625–657, 2011.
• [34] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
• [35] Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael Jordan. A scalable bootstrap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):795–816, 2014.
• [36] Jason Lee, Yuekai Sun, and Michael Saunders. Proximal Newton-type methods for minimizing composite functions. SIAM Journal on Optimization, 24(3):1420–1443, 2014.
• [37] Tianyang Li, Liu Liu, Anastasios Kyrillidis, and Constantine Caramanis. Statistical inference using SGD. In The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), 2018.
• [38] Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343, 2015.
• [39] Nicolai Meinshausen, Lukas Meier, and Peter Bühlmann. -values for high-dimensional regression. Journal of the American Statistical Association, 104(488):1671–1681, 2009.
• [40] Eric Moulines and Francis R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pages 451–459, 2011.
• [41] Sahand Negahban, Pradeep Ravikumar, Martin Wainwright, and Bin Yu. A Unified Framework for High-Dimensional Analysis of M-Estimators with Decomposable Regularizers. Statistical Science, pages 538–557, 2012.
• [42] D. N. Politis, J. P. Romano, and M. Wolf. Subsampling. Springer Series in Statistics. Springer New York, 2012.
• [43] Boris Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
• [44] Garvesh Raskutti, Martin Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional linear regression over -balls. IEEE transactions on information theory, 57(10):6976–6994, 2011.
• [45] Adam Rothman, Elizaveta Levina, and Ji Zhu. Generalized thresholding of large covariance matrices. Journal of the American Statistical Association, 104(485):177–186, 2009.
• [46] David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988.
• [47] Levent Sagun, Utku Evci, V. Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks. arXiv preprint arXiv:1706.04454, 2017.
• [48] Weijie Su and Yuancheng Zhu. Statistical Inference for Online Learning and Stochastic Approximation via Hierarchical Incremental Gradient Descent. arXiv preprint arXiv:1802.04876, 2018.
• [49] R. Tibshirani, M. Wainwright, and T. Hastie. Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2015.
• [50] Panos Toulis and Edoardo M. Airoldi. Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics, 45(4):1694–1727, 2017.
• [51] Joel Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1–230, 2015.
• [52] F. Tuerlinckx, F. Rijmen, G. Verbeke, and P. Boeck. Statistical inference in generalized linear mixed models: A review. British Journal of Mathematical and Statistical Psychology, 59(2):225–255, 2006.
• [53] Sara van de Geer, Peter Bühlmann, Yaâacov Ritov, and Ruben Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202, 2014.
• [54] Aad W. van der Vaart. Asymptotic statistics. Cambridge University Press, 1998.
• [55] James Varah. A lower bound for the smallest singular value of a matrix. Linear Algebra and its Applications, 11(1):3–5, 1975.
• [56] Martin Wainwright. Sharp thresholds for High-Dimensional and noisy sparsity recovery using -Constrained Quadratic Programming (Lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
• [57] Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. To appear, 2017.
• [58] Larry Wasserman. All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2013.
• [59] Halbert White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica: Journal of the Econometric Society, pages 817–838, 1980.
• [60] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv preprint arXiv:1708.07747, 2017.
• [61] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
• [62] Eunho Yang, Aurelie Lozano, and Pradeep Ravikumar. Elementary estimators for high-dimensional linear regression. In International Conference on Machine Learning, pages 388–396, 2014.
• [63] Cun-Hui Zhang and Stephanie Zhang. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242, 2014.

## Appendix A High dimensional linear regression statistical inference using stochastic gradients (Section 3)

### a.1 Statistical inference using approximate proximal Newton steps with stochastic gradients

Here, we present a statistical inference procedure for high dimensional linear regression via approximate proximal Newton steps using stochastic gradients. It uses the plug-in estimator:

 ˆS−1(1nn∑i=1(x⊤iˆθ−yi)2xix⊤i)ˆS−1,

which is related to the empirical sandwich estimator [29, 59]. Lemma 1 shows this is a good estimate of the covariance when .

Algorithm 2 performs statistical inference in high dimensional linear regression (13), by computing the statistical error covariance in Theorem 3, based on the plug-in estimate in Lemma 1. We denote the soft thresholding of by as an element-wise procedure . For a vector , we write ’s th coordinate as . The optimization objective (13) is denoted as:

 12θ⊤(ˆS−1n∑ni=1xix⊤i)θ+1n∑ni=1fi,

where Further,

where is the basis vector where the th coordinate is and others are , and is computed in a column-wise manner.

For point estimate optimization, the proximal Newton step [36] at solves the optimization problem

 minΔ12ρΔ⊤ˆSΔ+⟨(ˆS−1n∑ni=1xix⊤i)θ+1nn∑i=1∇fi(θ),Δ⟩+λ∥θ+Δ∥1,

to determine a descent direction. For statistical inference, we solve a Newton step:

 minΔ12ρΔ⊤ˆSΔ+⟨1So∑k∈Io∇fk(θt),Δ⟩

to compute , whose covariance is the statistical error covariance.

To control variance, we solve Newton steps using SVRG and proximal SVRG [61], because in the high dimensional setting, the variance using SGD [40] and proximal SGD [5] for solving Newton steps is too large. However because , instead of sampling by sample, we sample by feature. We start from sufficiently close to (see Theorem 4 for details), which can be effectively achieved using proximal SVRG (Section A.3). Line 7 corresponds to SVRG’s outer loop part that computes the full gradient, and line 12 corresponds to SVRG’s inner loop update. Line 8 corresponds to proximal SVRG’s outer loop part that computes the full gradient, and line 13 corresponds to proximal SVRG’s inner loop update.

The covariance estimate bound, asymptotic normality result, and choice of hyper-parameters are described in Section A.4. When , we can estimate the covariance with element-wise error less than with high probability, using numerical operations. Calculation of the de-biased estimator (16) via SVRG is described in Section A.2.

### a.2 Computing the de-biased estimator (16) via SVRG

To control variance, we solve each proximal Newton step using SVRG, in stead of SGD as in Algorithm 1. Because However because the number of features is much larger than the number of samples, instead of sampling by sample, we sample by feature.

The de-biased estimator is

 ˆθd= ˆθ+ˆS−1[1nn∑i=1yixi−(1nn∑i=1xix⊤i)ˆθ] = ˆθ+ˆS−1(−1nn∑i=1∇fi(ˆθ)).

And we compute using SVRG [34] by solving the following optimization problem using SVRG and sampling by feature

 minu12u⊤ˆSu+⟨1nn∑i=1∇fi(ˆθ),u⟩.

Similar to Algorithm 2, we choose and .

### a.3 Solving the high dimensional linear regression optimization objective (13) using proximal SVRG

We solve our high dimensional linear regression optimization problem using proximal SVRG [61]

 ˆθ=argminθ12θ⊤(ˆS−1nn∑i=1xix⊤i)θ+1nn∑i=112(x⊤iθ−yi)2+λ