# Convergence diagnostics for stochastic gradient descent with constant step size

## Abstract

Iterative procedures in stochastic optimization are typically comprised of a transient phase and a stationary phase. During the transient phase the procedure converges towards a region of interest, and during the stationary phase the procedure oscillates in a convergence region, commonly around a single point. In this paper, we develop a statistical diagnostic test to detect such phase transition in the context of stochastic gradient descent with constant step size. We present theoretical and experimental results suggesting that the diagnostic behaves as intended, and the region where the diagnostic is activated coincides with the convergence region. For a class of loss functions, we derive a closed-form solution describing such region, and support this theoretical result with simulated experiments. Finally, we suggest an application to speed up convergence of stochastic gradient descent by halving the learning rate every time convergence is detected. This leads to remarkable speed gains that are empirically comparable to state-of-art procedures.

## 1Introduction

We consider the classical problem in stochastic optimization stated as

where is the loss function, denotes the response, are the features, and are parameters in . For example, the quadratic loss function is defined as . In estimation problems we typically have a finite data set , for , from which we wish to estimate by solving the empirical version of Eq. :

When data size, , and parameter size, , are large classical methods for computing fail. Stochastic gradient descent (SGD) is a powerful alternative [29] because it solves the problem in an iterative fashion through the procedure:

Here, is the estimate of prior to the th iteration, is a random sample from the data, and is the gradient of the loss with respect to . Classical stochastic approximation theory [20] suggests that SGD converges to a value such that , which under typical regularity conditions is equal to when is infinite (streaming setting), or is equal to when is finite. Going forward we assume the streaming setting for simplicity, but our results hold for finite- settings as well.

Typically, stochastic iterative procedures start from some starting point and then move through a transient phase and towards a stationary phase [14]. In stochastic gradient descent this behavior is largely governed by parameter , which is known as the learning rate, and can be either decreasing over (e.g., ), or constant. In the decreasing rate case, the transient phase is usually long, and can be impractically so if the rate is slightly misspecified [16], whereas the stationary phase involves SGD converging in quadratic mean to . In the constant rate case the transient phase is much shorter and less sensitive to the learning rate, whereas the stationary phase involves SGD oscillating within a region that contains . As such, in this paper we focus on constant rate SGD because it is in this context where the diagnostic can be utilized to identify when there is no benefit in running the procedure longer. Furthermore, we can utilize the speed of constant rate SGD in combination with the diagnostic to implement SGD procedures that converge in linear time, as in Section 4.2.

More generally, understanding when a stochastic procedure, such as stochastic gradient descent, reaches the stationary phase is crucial for implementation and for understanding its empirical performance. The main approach to this problem in current practice is through heuristics. In this paper, we develop a principled statistical diagnostic to detect stationarity in SGD with constant learning rate. We illustrate the use of this diagnostic through extensive theory and simulated experiments in the quadratic loss model and extensions. As an additional illustration of the diagnostic’s use, we suggest a new variant of SGD, where we halve the learning rate each time stationarity is detected. This variant combines effectively the speed of constant rate SGD with the convergence property of decreasing rate SGD and exhibits state-of-art performance.

### 1.1Related work and contributions

The idea that SGD methods are roughly composed of a transient phase and a stationary phase (also known as search phase and convergence phase, respectively), has been expressed before [14]. However, no principled statistical methods have been developed to address stationarity issues, and thereby guide empirical practice of SGD. Currently, guidance is based on heuristics originating from optimization theory that aim to evaluate the magnitude of SGD updates. For example, a popular method is to stop when is small according to some threshold, or when updates of the loss function have reached machine precision [8]. These methods, however, do not take into account the sampling variation in SGD estimates, which implies that the aforementioned thresholds depend on the learning rate and the underlying data variation, and are thus suited for deterministic procedures but not stochastic ones.

A more statistically motivated approach is to monitor the test error of SGD iterates on a hold-out validation test, concurrently with the main SGD iteration [3]. One idea here is to stop the procedure when the validation error starts increasing. An important problem with this approach is that the validation error is also a stochastic process, and estimating when it actually starts increasing presents similar, if not greater, challenges to the original problem of detecting convergence to stationary phase. Furthermore, cross validation can be computationally costly in large data sets.

Interestingly, statistically principled methods to detect stationarity can be traced back to classical theory of stopping times in stochastic approximation [28], of which SGD is a special case. One important method, which also forms the basis of our own work, is Pflug’s procedure [18] that keeps a running average of the inner product of successive gradients , where . The underlying idea is that in the transient phase SGD moves towards and on average the stochastic gradients point to the same direction, and thus their inner product is positive. In the stationary phase, however, SGD with constant rate moves more haphazardly around , and so the gradients point to different directions making their inner product negative. Despite the solid mathematical underpinnings of this idea, so far it has not been analyzed or implemented in settings of SGD-based stochastic optimization.

#### Overview of results and contributions

Our contributions in this paper can be summarized as follows. In Section 2, we develop the convergence diagnostic test for SGD, which combines Pflug’s stopping time procedure [18] with SGD in Eq. to detect when SGD exits the transient phase and enters to the stationary phase. We note that by convergence of SGD with constant rate we do not mean convergence to a single point but convergence to the stationarity region. In Theorem ? we prove under relatively mild assumptions that the diagnostic is activated almost surely. We obtain this result by examining the expected value of change (increase or decrease) in the diagnostic as a function of . We then illustrate the use of the diagnostic through a simulated example where conditional on the diagnostic being activated, the distance to true parameter value, , is uncorrelated with the starting distance , which implies that the diagnostic captures the transition from transient to stationary phase well. In Section 3 we develop theory for quadratic loss, and derive a closed-form solution describing the region where the diagnostic is activated, which we support with visual empirical evidence. In Section 4.1 we present extensions beyond the quadratic loss. In Section 4.2 we suggest an application of the diagnostic in speeding up SGD by halving the learning rate each time convergence is detected. This leads to a new SGD procedure, named `SGD`

, with impressive speed gains that we compare against state-of-art procedures, such as variance-reduced SGD [9] and averaged SGD [5], in Sections Section 4.3 and Section 4.4.

## 2Convergence diagnostic

In this section, we focus on the SGD procedure in Eq. and develop the diagnostic for detecting when has escaped the transient phase and is in the stationary phase. Before we develop the diagnostic we note that the existence of the two phases is suggested from existing analyses of the SGD procedure in Eq. . Such analyses operate under different assumptions, but invariably suggest that the mean squared error of SGD has a bias term from distance to the starting point, and a variance term from noise in stochastic gradients that becomes dominant for large . The following meta-theorem summarizes this finding.

*Remarks.* The constants differ depending on the analysis. For example, Bach and Moulines [13] use , where is the strong convexity constant of expected loss , and is the Lipschitz constant of ; and , where is an upper bound for the variance of . Needell and Srebro [15], under different assumptions, use and .

Despite differences, all analyses suggest that the SGD procedure with constant rate goes through a transient phase exponentially fast during which it forgets the initial conditions , and then enters a stationary phase during which it oscillates around , roughly at a region of radius . We note that a trade-off exists here: reducing will make the oscillation radius, , smaller but escaping the transient phase becomes much slower—in the extreme case where , for instance, the procedure will never exit the transient phase.

Despite the theoretical insights it offers, Theorem ? has limited practical utility for estimating the phase transition in SGD. For example, consider one approach where we find the value of for which , that is, the initial conditions have been discounted to 1% of the oscillation (squared) radius of the stationarity region. That, however, requires estimating quantities such as , , and , which is challenging. In the following section, we develop a concrete statistical diagnostic to estimate the phase transition and detect convergence of SGD in a much simpler way.

### 2.1`Pflug`

diagnostic

In this section, we develop a convergence diagnostic for SGD procedures that relies on Pflug’s procedure [?] in stochastic approximation. The diagnostic is presented as Algorithm ? and concrete instances under quadratic loss along with theoretical analysis are presented in Section 3, with extensions in Section 4.

The diagnostic is defined by a random variable, , that keeps the running sum of the inner product of successive stochastic gradients, as shown in Line 5. The idea is that in the transient phase SGD moves towards by discarding initial conditions, and so the stochastic gradients point to the same direction, on average. This implies that the inner product of successive stochastic gradients is likely positive in the transient phase. In the stationary phase, however, SGD is oscillating around at a distance bounded by Theorem ?, and so the gradients point to different directions. This implies a negative inner product on average during the stationary phase. When the statistic changes sign from positive to negative, this is probably a good signal that the procedure has exited the transient phase.

Since our testing diagnostic is iterative we need to show that it eventually terminates with an answer. Our first theoretical result is Theorem ?, where we prove that as , and so the algorithm indeed terminates almost surely. For brevity, we state the theorem without technical details. The full assumptions and proof are in the supplementary material.

*Remarks.* Theorem ? shows that the inner product of successive gradients is negative in expectation as the iteration number increases. Roughly, when is very close to the dominant force is the variance in the stochastic gradient pulling away the next iterates; when is far from the dominant force is the bias in the stochastic gradient pulling in the next iterates towards . This implies that the running average of successive gradients will eventually become negative at a finite iteration, and so by the law of large numbers the diagnostic returns a value almost surely.

## 3Quadratic loss model

In this section, we focus on the quadratic loss function, where and . In this context we will obtain theoretical and empirical results that will illustrate the performance of the `Pflug`

diagnostic in Algorithm ?. In Section 4 we discuss extensions to other models.

For a first intuition on how the test works consider the case where , i.e., the procedure starts in the stationary region. Let , where are zero-mean random variables, . Then,

from which it follows that

Thus, when the procedure starts at the truth, , the diagnostic is decreased in expectation, and eventually, by the law of large numbers, at some iteration the statistic becomes negative and the diagnostic is activated returning value . We generalize this result in the following theorem.

*Remarks.* Theorem ? shows that the boundary surface that separates the two regions where the test statistic increases or decreases in expectation looks like an ellipse, for large enough . Regardless of the choice of , when is close enough to , the diagnostic is guaranteed to decrease in expectation since the only remaining term is .

The result also shows the various competing forces between bias and variance in the stochastic gradients as they relate to how the diagnostic behaves. For instance, when is very close , larger (noise in stochastic gradient) contributes faster decrease of the diagnostic in expectation, but at the cost of higher variance. The contribution of the other term, , is less clear. For instance, is large when there is strong collinearity in features , which may contribute to decreasing . But strong collinearity also implies that is almost positive definite which contributes positive values to , thus counteracting the contribution of . Note that is a positive definite matrix but may not be. This implies that careful selection of may be necessary for the diagnostic to work well. For example, when is very small and is positive definite, then will converge to a negative number slowly. One way to alleviate this sensitivity to the learning rate is through implicit updates [25], which we explore in Section 3.3.

In the following sections we illustrate the main result of Theorem ?, and also the performance of the `Pflug`

diagnostic in detecting convergence of SGD.

### 3.1Illustration

Here, we illustrate the main results of Theorem ? through Figure 1, which can be described as follows. The dark shaded area in the figure corresponds to the region where the `Pflug`

diagnostic is decreased in expectation if the SGD iterate falls in the region. We call this the `Pflug`

region. Outside of the `Pflug`

region the diagnostic is increased in expectation, and the farther we move away from the center of that region, which roughly coincides with , the larger the expected increase of the diagnostic becomes. The polygon shaded with diagonal lines corresponds to empirical estimations of the convergence region of SGD, where SGD iterates have oscillated around over 1000 simulations. In addition to the result in Theorem ? about the shape of the region this shows that the `Pflug`

region approximates very well the convergence region of SGD. This is remarkable because the `Pflug`

region is implementable from data, whereas the SGD convergence region is not since by Theorem ? it depends on unknown quantities.

### 3.2Simulated example

Next, we apply the `Pflug`

convergence diagnostic on a simulated example to investigate empirically whether it estimates the phase transition well. The experimental setup is as follows. We set as the parameter dimension, and as the data set size. We fix with —this ensures some variation and sparsity in the parameter values. We sample features as , where denotes a -variate normal distribution, and is the identity matrix, and . We sample outcomes as , .

For a given we run Algorithm ? with `burnin`

= , for various values of the starting point sampled as , where . Let , then for each run we store the tuple

where is the output of Algorithm ?, i.e., the iteration at which the `Pflug`

diagnostic detected convergence. The idea in this experimental evaluation is that if the convergence diagnostic detects convergence accurately, the iterates will depend on the initial conditions more than the iterates . Thus, for given and , we should expect a much higher correlation between and than between and . To test this hypothesis, for a given value of we draw 100 independent samples of . With these samples we regress on and on in two normal linear regression models. Table 1 summarizes the experimental results. In the second and third column of Table 1 we report the regression coefficients of in the two model fits, respectively, and also report statistical significance.

** | . | |

*** | ||

** | ||

** | ||

*** | ||

* | * | |

** | ||

** | ||

From the table, we see that the regression coefficient corresponding to is always positive and statistically significant, whereas the coefficient is mostly not significant for . This suggests that depends on initial conditions , and thus stationarity has not yet been reached at iteration . In contrast, does not depend on initial conditions , and thus stationarity has likely occurred after iteration . This is evidence indicating that the `Pflug`

diagnostic performs reasonably well in estimating the switch of SGD from its transient phase to its stationary phase. We note that in the regression evaluation we had to control for (by using it as a regressor) because the iteration number is correlated with mean-squared error (larger values for are correlated with smaller error).

### 3.3Implicit update

As mentioned in the remarks of Theorem ? the `Pflug`

diagnostic is sensitive to the choice of learning rate . For example, when is small and is positive definite, will be mostly increasing during the transient phase, which makes convergence slower. But choosing a large learning rate can easily lead to numerical instability. One way to alleviate such sensitivity to the learning rate is to use the SGD procedure with an implicit update:

Note that appears on both sides of the equation. In the quadratic loss model we can solve exactly the implicit equation as follows:

Implementing the procedure in Eq. is fast since it is equivalent to . More generally, the implicit update in Equation 11 can be computed efficiently through a one-dimensional root-finding procedure in many settings [25]. Previous work on implicit SGD procedures (ISGD) has shown that implicit procedures have similar asymptotic properties with standard SGD procedures with numerical stability as an added benefit. Since most related work on ISGD methods is with respect to decreasing learning rate procedures [11], we provide an analysis for constant rate ISGD as in Eq. (4) in the supplementary material. We note that ISGD procedures are related to proximal updates in stochastic optimization [17], but these methods differ from ISGD methods in that they employ a combination of classical SGD with deterministic proximal operators, whereas ISGD’s proximal operator is purely stochastic.

The following theorem shows that the implicit update in the linear model mitigates the sensitivity of the `Pflug`

diagnostic to the choice of the learning rate.

*Remarks.* Theorem ? shows that the diagnostic is more stable with the ISGD procedure rather than the classical SGD procedure. Take, for example, the case where so that . In the classical SGD setting, Theorem ? suggests that the diagnostic increases without bound and convergence fails. In the ISGD setting, however, it holds that , and convergence may happen. In contrast, when then and so , which implies that the diagnostic behaves in the implicit procedure as it behaves in the explicit one.

## 4Extensions and applications

In this section we consider extensions of the `Pflug`

diagnostic to a more broad family of loss functions inspired by generalized linear models (GLMs). We also consider an application of the diagnostic to speed up convergence of SGD with constant rate.

### 4.1Generalized linear loss

Here, we consider the loss based on the GLM formulation [12] where . For example, the quadratic loss is equivalent to . The logistic loss is when is binary and . In general, cannot be chosen arbitrarily—one standard choice is to define such that is a proper density, i.e., it integrates to one. The following theorem generalizes the results in Section 3 on the quadratic loss.

*Remarks.* The result in Theorem ? has the same structure as in Theorem ? so a direct analogy can be helpful. The terms in the two theorems are identical, if we consider that for the quadratic loss it holds that . The term in Theorem ? corresponds to the term in Theorem ?, and corresponds to , respectively. The terms are equal when we set , in which case . Thus, the diagnostic with the more general GLM loss has familiar properties. For example, when , i.e., when SGD is near the truth, and , in which case the negative constant term dominates and the test statistic decreases in expectation. One difference with the quadratic loss, however, is that as we move farther from the statistic may change in a nonlinear way. Therefore the boundary separating the positive and negative regions of the diagnostic will generally not have the nice, smooth elliptical shape as in the quadratic loss (see Figure 1). This may lead to more complex behavior for the diagnostic, which we aim to analyze fully in future work.

We note that the constraint on is not particularly strict because in the GLM formulation is guaranteed to be positive. The assumption is made to simplify the analysis but can be improved by analyzing the quantity through existing analyses of .

### 4.2`Sgd`

for fast convergence

We now switch gears from analyzing the behavior of the `Pflug`

diagnostic to using it in a practical application. Our suggested application is to use the diagnostic within a SGD loop where the learning rate is halved and the procedure restarted each time convergence is detected. We note that our goal here is to illustrate the utility of our convergence diagnostic, and a full analysis of the proposed method will be a subject of future work.

More specifically, note that by Theorem ? procedure SGD with constant rate has linear convergence to a stationary distance from of . It would therefore be beneficial to reduce the learning rate when we know that the SGD iterates are oscillating around in a ball of radius , so that the procedure moves to a ball with a smaller radius. To implement such a procedure, however, would require knowing , and also knowing all parameters required to calculate . Our solution employs the `Pflug`

convergence diagnostic to detect stationarity. Algorithm ? describes such a procedure, named `SGD`

, where the learning rate is halved upon detection of convergence (Line 10).

Note that implicit updates can be used in this algorithm as well; we call this modified algorithm `ISGD`

. In our experiments in the following section, we employ `ISGD`

because of the benefits in numerical stability from using implicit updates.

### 4.3Simulated data experiments

To evaluate the effectiveness of `ISGD`

, we compare to other classical and state-of-the-art SGD methods. We first experiment on simulated data to better understand the performance of `ISGD`

and its competition under various parameter settings. In particular, we compare the performance of our procedure `ISGD`

in Algorithm ? against SVRG and classical ISGD on simulated data. The classical ISGD uses an optimized learning rate of . The basic experimental setup is as follows.

We consider settings of high and low signal to noise ration (SNR), and high and low dimension and test under the four combinations of these settings. For the high SNR case, we set , where , and for the low SNR case we set . For the high dimension case we set as the parameter dimension, and for the low dimension case we set . Given the dimension , we fix such that . We set as the size of the data set. We sample features as , where . We sample outcomes as for the normal model, and for the logistic model, where denotes the binomial random variable with mean . The learning parameters for each SGD method were tuned to provide best performance through pre-processing.

From simulations with the normal model in the left half of Figure 2 we see that `ISGD`

attains a comparable performance to SVRG. In general, SVRG attains an overall better performance for these experiments, which we believe is related to our convergence diagnostic being aggressive in a couple of cases.

From simulations with the logistic model in the right half of Figure 2 we see that `ISGD`

attains an even better performance than before as there are fewer cases where the diagnostic is aggressive. With high SNR and low dimension parameter settings, `ISGD`

achieves consistently better performance than SVRG. We note that such comparisons do not take into account the sensitivity of SVRG to misspecifications of the learning rate (large enough learning rates can easily make the procedure diverge); or that SVRG requires periodic calculations over the entire data set, which here is easy because we are using only 5,000 data points, but may be a problem in more realistic settings. We also note that there are several improvements available for `ISGD`

by allowing a larger `burnin`

period or by discounting the learning rate less aggressively, but we plan on addressing such tuning issues in future work, both theoretically and empirically.

### 4.4Benchmark data sets

In addition to simulated experiments we conduct experiments on benchmark data sets MNIST (binary) and COVERTYPE (binary) to evaluate real world performance.^{1}`ISGD`

, SVRG, classical ISGD, and averaged ISGD, where averaging iterates theoretically leads to optimal asymptotic errors. We plot the prediction error on a held out test set in Figure 3 relative to the number of passes over the data.

Overall, we see that `ISGD`

convergences very quickly, after going over less than a quarter of the data, and achieves best performance in the COVERTYPE data set. We currently do not have a theoretical justification for this, but we have verified that it is consistently observed across multiple experiments. `ISGD`

was also very stable to specifications of the learning rate parameter, as expected from the analysis of Theorem ?. In contrast, even though SVRG performed comparably to `ISGD`

, its performance was unstable, especially in the COVERTYPE data set, and required careful fine tuning of the learning rate through trial and error. Averaged SGD performed well on the MNIST data set, but flattened out very fast in the COVERTYPE data, possibly due to misspecification.

## 5Concluding remarks

In this paper we focused on the problem of convergence of SGD to the stationary phase. This is an important task because statistical properties of iterative stochastic procedures are better understood when they have reached stationarity. We borrowed from the theory of stopping times in stochastic approximation to develop a diagnostic that reliably detects the phase transition, as shown theoretically and empirically. Future work can focus on analysis of conditional on the diagnostic being activated. This could show theoretically that the error is uncorrelated with the initial starting point conditional on the test being activated, and so provide a theoretical analysis of the results in Table 1.

In separate work we plan to focus more on the procedure `ISGD`

that worked very well in experiments, and to analyze theoretically its behavior. One idea is to run parallel `ISGD`

chains and aggregate the iterates. At stationarity we expect the iterates from different chains to be uncorrelated with each other; averaging can help, in a similar way that iterate averaging helps in stochastic approximations [22].

## AProofs of theorems

For brevity let be the stochastic gradient at iteration .

*Remarks.* Assumption (b) is a form of strong convexity. For example, suppose that , then and . In this case is the Fisher information matrix and Assumption (b) holds for . When is small enough and a Taylor approximation of is possible, the above result still holds for when the Fisher information exists. Assumption (c) shows that there is a threshold value for below which the diagnostic cannot terminate. For example, suppose that error noise is small so that and , as argued before. Then, , that is, the learning rate has to exceed the reciprocal of the minimum eigenvalue of the Fisher information matrix.

For notational brevity we make the following definitions:

where is the current iterate, and and are the next two using iid data and . For a fixed we understand the Pflug diagnostic through the function

We use Eq. to derive an expression for :

Let ; we know that . Now, we analyze each term individually:

The variables are conditionally independent of and so using the law of iterated expectations these terms vanish.

Using a similar reasoning, for the second term we have:

In expectation of Eq. ,

By combining all results we finally get:

We derive similar theoretical results for under the linear normal model for implicit updates. We have the implicit updates

Also note the collinearity

where and . We derive an expression for , with implicit updates:

where is the function from the explicit update in Eq. . The formula for follows by applying expectation and the reasoning in Eq. . Note that since and are independent and have marginally identical distributions.

The updates for the GLM loss are as follows:

Note that . We can now follow the exact same reasoning as in Theorem ? and that almost surely.

## BMean squared error bound for constant rate ISGD

We have the implicit update of SGD (ISGD):

We will operate under the following assumptions:

To prove Theorem ?, our result for the upper bound on the MSE for constant learning rate ISGD, we first prove the following results:

From the chain rule , and similarly . Thus the two gradients are collinear. Therefore there exists a scalar such that

We also have,

Substituting the expression for in Eq. into Eq. we obtain the desired result of the theorem. From Eq. we get the equality

and substituting we get our desired result

From Lemma ? we have

where the derivative of the loss function is with respect to the natural parameter . Using the definition of the implicit update Eq. ,

We substitute this definition of into Eq. and perform a Taylor approximation on . Recall Taylor approximation for a function , where lies in the closed interval between and . From Eq. we let and . Also, by the Chain rule . Thus we obtain,

where and .

By combining Eq. with Eq. and cancelling out the first derivative term we get

Starting from the implicit update , we have

To bound the last term,

Taking expectation of both sides of Eq.,

We can bound the expectation of the second term as

Taking expectations in and substituting inequalities and into , and again taking expectation, yields the recursion,

Let . We can now derive the bound of the theorem as follows:

The discount factor is bounded below by because . We will show that this term is bounded below by 0.

A quick manipulation of the algebra gives us

Both the upper bound and stationary bound are satisfied by Assumption 1 (e). Further manipulating the lower bound, from Eq.,

Solving the equality of Eq. (with the quadratic equation) gives us

Recall that for a second-degree polynomial of the form , the convexity is determined by . Because (a standard assumption), the discriminant and thus there are no real roots. Looking at the convexity,

The strict inequality is because of the following. For all observed Fisher information matrices, (with the dimesnion)

Thus for all the lower bound represented by Eq. is satisfied. We have zero real roots and a convex function.

## CComputation of ISGD

Here, we show that the ISGD procedure can be computed efficiently. Most of the ideas here have been presented in earlier work [10], but we present them here as well for clarity and completeness.

Consider the ISGD procedure:

Note that , where is scalar and is defined as . Therefore we can write, for some ,

and

Substituting Eq. into Eq. we get

Equating Eq. and Eq. we conclude that

which is a one-dimensional fixed-point equation for . We can also see that if , or if . Thus, search bounds are available for and solving the fixed-point equation is very fast. Given , the implicit update can be readily computed through Eq..

### Footnotes

- Data sets can be found at https://archive.ics.uci.edu/ml/databases/mnist/ and https://archive.ics.uci.edu/ml/datasets/covertype, respectively.

### References

*Adaptive algorithms and stochastic approximations*.

Benveniste, A., P. Priouret, and M. Métivier (1990). Springer-Verlag New York, Inc.**Incremental proximal methods for large scale convex optimization.**

Bertsekas, D. P. (2011).*Mathematical programming**129*(2), 163–195.**Beating the hold-out: Bounds for k-fold and progressive cross-validation.**

Blum, A., A. Kalai, and J. Langford (1999). In*Proceedings of the twelfth annual conference on Computational learning theory*, pp. 203–208. ACM.**Stochastic approximation.**

Borkar, V. S. (2008).*Cambridge Books*.**Large-scale machine learning with stochastic gradient descent.**

Bottou, L. (2010). In*Proceedings of COMPSTAT’2010*, pp. 177–186. Springer.**Stochastic Gradient Descent Tricks.**

Bottou, L. (2012). In*Neural Networks: Tricks of the Trade*, Volume 1, pp. 421–436.**Optimization methods for large-scale machine learning.**

Bottou, L., F. E. Curtis, and J. Nocedal (2016).*arXiv preprint arXiv:1606.04838*.*Numerical techniques for stochastic optimization*.

Ermoliev, Y. M. and R.-B. Wets (1988). Springer-Verlag.**Accelerating stochastic gradient descent using predictive variance reduction.**

Johnson, R. and T. Zhang (2013). In*Advances in Neural Information Processing Systems*, pp. 315–323.**The p-norm generalization of the lms algorithm for adaptive filtering.**

Kivinen, J., M. K. Warmuth, and B. Hassibi (2006).*Signal Processing, IEEE Transactions on**54*(5), 1782–1793.**Implicit online learning.**

Kulis, B. and P. L. Bartlett (2010). In*Proceedings of the 27th International Conference on Machine Learning (ICML-10)*, pp. 575–582.**Generalized linear models.**

McCullagh, P. (1984).*European Journal of Operational Research**16*(3), 285–292.**Non-asymptotic analysis of stochastic approximation algorithms for machine learning.**

Moulines, E. and F. R. Bach (2011). In*Advances in Neural Information Processing Systems*, pp. 451–459.**A statistical study of on-line learning.**

Murata, N. (1998).*Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK*, 63–92.**Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm.**

Needell, D., R. Ward, and N. Srebro (2014). In*Advances in Neural Information Processing Systems*, pp. 1017–1025.**Robust stochastic approximation approach to stochastic programming.**

Nemirovski, A., A. Juditsky, G. Lan, and A. Shapiro (2009).*SIAM Journal on Optimization**19*(4), 1574–1609.**Proximal algorithms.**

Parikh, N. and S. Boyd (2013).*Foundations and Trends in optimization**1*(3), 123–231.**Non-asymptotic confidence bounds for stochastic approximation algorithms with constant step size.**

Pflug, G. C. (1990).*Monatshefte für Mathematik**110*(3), 297–314.**Gradient estimates for the performance of markov chains and discrete event processes.**

Pflug, G. C. (1992).*Annals of Operations Research**39*(1), 173–194.**A stochastic approximation method.**

Robbins, H. and S. Monro (1951).*The annals of mathematical statistics*, 400–407.**Convergence of stochastic proximal gradient algorithm.**

Rosasco, L., S. Villa, and B. C. Vũ (2014).*arXiv preprint arXiv:1403.5074*.**Efficient estimators from a slowly convergent robbins-monro process.**

Ruppert, D. (1988). Technical report, School of Operations Research and Industrial Engineering, Cornell University.**Scalable estimation strategies based on stochastic approximations: classical results and new insights.**

Toulis, P. and E. M. Airoldi (2015).*Statistics and computing**25*(4), 781–795.**Asymptotic and finite-sample properties of estimators based on stochastic gradients.**

Toulis, P., E. M. Airoldi, et al. (2017).*The Annals of Statistics**45*(4), 1694–1727.**Statistical analysis of stochastic gradient methods for generalized linear models.**

Toulis, P., J. Rennie, and E. Airoldi (2014). In*31st International Conference on Machine Learning*.**A proximal stochastic gradient method with progressive variance reduction.**

Xiao, L. and T. Zhang (2014).*SIAM Journal on Optimization**24*, 2057–2075.**Towards optimal one pass large scale learning with averaged stochastic gradient descent.**

Xu, W. (2011).*arXiv preprint arXiv:1107.2490*.**Stopping times for stochastic approximation.**

Yin, G. (1989). In*Modern Optimal Control: A Conference in Honor of Solomon Lefschetz and Joseph P. LaSalle*, pp. 409–420.**Solving large scale linear prediction problems using stochastic gradient descent algorithms.**

Zhang, T. (2004). In*Proceedings of the twenty-first international conference on Machine learning*, pp. 116. ACM.