Convergence diagnostics for stochastic gradient descent with constant step size

# Convergence diagnostics for stochastic gradient descent with constant step size

Jerry Chee University of Chicago Panos Toulis University of Chicago, Booth School of Business
June 28, 2019
###### Abstract

Iterative procedures in stochastic optimization are typically comprised of a transient phase and a stationary phase. During the transient phase the procedure converges towards a region of interest, and during the stationary phase the procedure oscillates in a convergence region, commonly around a single point. In this paper, we develop a statistical diagnostic test to detect such phase transition in the context of stochastic gradient descent with constant step size. We present theoretical and experimental results suggesting that the diagnostic behaves as intended, and the region where the diagnostic is activated coincides with the convergence region. For a class of loss functions, we derive a closed-form solution describing such region, and support this theoretical result with simulated experiments. Finally, we suggest an application to speed up convergence of stochastic gradient descent by halving the learning rate every time convergence is detected. This leads to remarkable speed gains that are empirically comparable to state-of-art procedures.

## 1 Introduction

We consider the classical problem in stochastic optimization stated as

 θ⋆=argminθ∈ΘE(ℓ(y,x⊤θ)), (1)

where is the loss function, denotes the response, are the features, and are parameters in . For example, the quadratic loss function is defined as . In estimation problems we typically have a finite data set , for , from which we wish to estimate by solving the empirical version of Eq. (1):

 ^θ=argminθ∈Θ1NN∑i=1ℓ(yi,x⊤iθ).

When data size, , and parameter size, , are large classical methods for computing fail. Stochastic gradient descent (SGD) is a powerful alternative (Zhang, 2004; Bottou, 2010, 2012; Toulis and Airoldi, 2015) because it solves the problem in an iterative fashion through the procedure:

 θn=θn−1−γ∇ℓ(yn,x⊤nθn−1). (2)

Here, is the estimate of prior to the th iteration, is a random sample from the data, and is the gradient of the loss with respect to . Classical stochastic approximation theory (Robbins and Monro, 1951; Benveniste et al., 1990; Borkar, 2008) suggests that SGD converges to a value such that , which under typical regularity conditions is equal to when is infinite (streaming setting), or is equal to when is finite. Going forward we assume the streaming setting for simplicity, but our results hold for finite- settings as well.

Typically, stochastic iterative procedures start from some starting point and then move through a transient phase and towards a stationary phase (Murata, 1998). In stochastic gradient descent this behavior is largely governed by parameter , which is known as the learning rate, and can be either decreasing over (e.g., ), or constant. In the decreasing rate case, the transient phase is usually long, and can be impractically so if the rate is slightly misspecified (Nemirovski et al., 2009; Toulis et al., 2017), whereas the stationary phase involves SGD converging in quadratic mean to . In the constant rate case the transient phase is much shorter and less sensitive to the learning rate, whereas the stationary phase involves SGD oscillating within a region that contains . As such, in this paper we focus on constant rate SGD because it is in this context where the diagnostic can be utilized to identify when there is no benefit in running the procedure longer. Furthermore, we can utilize the speed of constant rate SGD in combination with the diagnostic to implement SGD procedures that converge in linear time, as in Section 4.2.

More generally, understanding when a stochastic procedure, such as stochastic gradient descent, reaches the stationary phase is crucial for implementation and for understanding its empirical performance. The main approach to this problem in current practice is through heuristics. In this paper, we develop a principled statistical diagnostic to detect stationarity in SGD with constant learning rate. We illustrate the use of this diagnostic through extensive theory and simulated experiments in the quadratic loss model and extensions. As an additional illustration of the diagnostic’s use, we suggest a new variant of SGD, where we halve the learning rate each time stationarity is detected. This variant combines effectively the speed of constant rate SGD with the convergence property of decreasing rate SGD and exhibits state-of-art performance.

### 1.1 Related work and contributions

The idea that SGD methods are roughly composed of a transient phase and a stationary phase (also known as search phase and convergence phase, respectively), has been expressed before (Murata, 1998). However, no principled statistical methods have been developed to address stationarity issues, and thereby guide empirical practice of SGD. Currently, guidance is based on heuristics originating from optimization theory that aim to evaluate the magnitude of SGD updates. For example, a popular method is to stop when is small according to some threshold, or when updates of the loss function have reached machine precision (Ermoliev and Wets, 1988; Bottou et al., 2016). These methods, however, do not take into account the sampling variation in SGD estimates, which implies that the aforementioned thresholds depend on the learning rate and the underlying data variation, and are thus suited for deterministic procedures but not stochastic ones.

A more statistically motivated approach is to monitor the test error of SGD iterates on a hold-out validation test, concurrently with the main SGD iteration (Blum et al., 1999; Bottou, 2012). One idea here is to stop the procedure when the validation error starts increasing. An important problem with this approach is that the validation error is also a stochastic process, and estimating when it actually starts increasing presents similar, if not greater, challenges to the original problem of detecting convergence to stationary phase. Furthermore, cross validation can be computationally costly in large data sets.

Interestingly, statistically principled methods to detect stationarity can be traced back to classical theory of stopping times in stochastic approximation (Yin, 1989; Pflug, 1990), of which SGD is a special case. One important method, which also forms the basis of our own work, is Pflug’s procedure (Pflug, 1990) that keeps a running average of the inner product of successive gradients , where . The underlying idea is that in the transient phase SGD moves towards and on average the stochastic gradients point to the same direction, and thus their inner product is positive. In the stationary phase, however, SGD with constant rate moves more haphazardly around , and so the gradients point to different directions making their inner product negative. Despite the solid mathematical underpinnings of this idea, so far it has not been analyzed or implemented in settings of SGD-based stochastic optimization.

#### 1.1.1 Overview of results and contributions

Our contributions in this paper can be summarized as follows. In Section 2, we develop the convergence diagnostic test for SGD, which combines Pflug’s stopping time procedure (Pflug, 1990) with SGD in Eq. (2) to detect when SGD exits the transient phase and enters to the stationary phase. We note that by convergence of SGD with constant rate we do not mean convergence to a single point but convergence to the stationarity region. In Theorem 2 we prove under relatively mild assumptions that the diagnostic is activated almost surely. We obtain this result by examining the expected value of change (increase or decrease) in the diagnostic as a function of . We then illustrate the use of the diagnostic through a simulated example where conditional on the diagnostic being activated, the distance to true parameter value, , is uncorrelated with the starting distance , which implies that the diagnostic captures the transition from transient to stationary phase well. In Section 3 we develop theory for quadratic loss, and derive a closed-form solution describing the region where the diagnostic is activated, which we support with visual empirical evidence. In Section 4.1 we present extensions beyond the quadratic loss. In Section 4.2 we suggest an application of the diagnostic in speeding up SGD by halving the learning rate each time convergence is detected. This leads to a new SGD procedure, named SGD, with impressive speed gains that we compare against state-of-art procedures, such as variance-reduced SGD (Johnson and Zhang, 2013, SVRG) and averaged SGD (Bottou, 2010; Xu, 2011), in Sections 4.3 and 4.4.

## 2 Convergence diagnostic

In this section, we focus on the SGD procedure in Eq. (2) and develop the diagnostic for detecting when has escaped the transient phase and is in the stationary phase. Before we develop the diagnostic we note that the existence of the two phases is suggested from existing analyses of the SGD procedure in Eq. (2). Such analyses operate under different assumptions, but invariably suggest that the mean squared error of SGD has a bias term from distance to the starting point, and a variance term from noise in stochastic gradients that becomes dominant for large . The following meta-theorem summarizes this finding.

###### Theorem 1 ((Moulines and Bach, 2011; Needell et al., 2014))

Under certain assumptions on the loss function, there are positive constants such that, for every , it holds that

 E(||θn−θ⋆||2)≤E(||θ0−θ⋆||2)e−Aγn+Bγ.

Remarks. The constants differ depending on the analysis. For example, Bach and Moulines (Moulines and Bach, 2011) use , where is the strong convexity constant of expected loss , and is the Lipschitz constant of ; and , where is an upper bound for the variance of . Needell and Srebro (Needell et al., 2014), under different assumptions, use and .

Despite differences, all analyses suggest that the SGD procedure with constant rate goes through a transient phase exponentially fast during which it forgets the initial conditions , and then enters a stationary phase during which it oscillates around , roughly at a region of radius . We note that a trade-off exists here: reducing will make the oscillation radius, , smaller but escaping the transient phase becomes much slower—in the extreme case where , for instance, the procedure will never exit the transient phase.

Despite the theoretical insights it offers, Theorem 1 has limited practical utility for estimating the phase transition in SGD. For example, consider one approach where we find the value of for which , that is, the initial conditions have been discounted to 1% of the oscillation (squared) radius of the stationarity region. That, however, requires estimating quantities such as , , and , which is challenging. In the following section, we develop a concrete statistical diagnostic to estimate the phase transition and detect convergence of SGD in a much simpler way.

### 2.1 Pflug diagnostic

In this section, we develop a convergence diagnostic for SGD procedures that relies on Pflug’s procedure (Pflug, 1992) in stochastic approximation. The diagnostic is presented as Algorithm 1 and concrete instances under quadratic loss along with theoretical analysis are presented in Section 3, with extensions in Section 4.

The diagnostic is defined by a random variable, , that keeps the running sum of the inner product of successive stochastic gradients, as shown in Line 5. The idea is that in the transient phase SGD moves towards by discarding initial conditions, and so the stochastic gradients point to the same direction, on average. This implies that the inner product of successive stochastic gradients is likely positive in the transient phase. In the stationary phase, however, SGD is oscillating around at a distance bounded by Theorem 1, and so the gradients point to different directions. This implies a negative inner product on average during the stationary phase. When the statistic changes sign from positive to negative, this is probably a good signal that the procedure has exited the transient phase.

Since our testing diagnostic is iterative we need to show that it eventually terminates with an answer. Our first theoretical result is Theorem 2, where we prove that as , and so the algorithm indeed terminates almost surely. For brevity, we state the theorem without technical details. The full assumptions and proof are in the supplementary material.

###### Theorem 2

Under certain assumptions, the convergence diagnostic in Algorithm 1 for constant rate SGD procedure in Eq. (2) satisfies as , and so the algorithm terminates almost surely.

Remarks. Theorem 2 shows that the inner product of successive gradients is negative in expectation as the iteration number increases. Roughly, when is very close to the dominant force is the variance in the stochastic gradient pulling away the next iterates; when is far from the dominant force is the bias in the stochastic gradient pulling in the next iterates towards . This implies that the running average of successive gradients will eventually become negative at a finite iteration, and so by the law of large numbers the diagnostic returns a value almost surely.

In this section, we focus on the quadratic loss function, where and . In this context we will obtain theoretical and empirical results that will illustrate the performance of the Pflug diagnostic in Algorithm 1. In Section 4 we discuss extensions to other models.

For a first intuition on how the test works consider the case where , i.e., the procedure starts in the stationary region. Let , where are zero-mean random variables, . Then,

 θ1 =θ⋆+γ(y1−x⊤1θ⋆)x1=θ⋆+γε1x1,

from which it follows that

 S2−S1 =(y2−x⊤2θ1)(y1−x⊤1θ0)x⊤2x1 =(ε2−γε1x⊤2x1)ε1x⊤2x1. E(S2−S1) =−γE(ε21)E((x⊤2x1)2)<0. (3)

Thus, when the procedure starts at the truth, , the diagnostic is decreased in expectation, and eventually, by the law of large numbers, at some iteration the statistic becomes negative and the diagnostic is activated returning value . We generalize this result in the following theorem.

###### Theorem 3

Suppose that the loss is quadratic, . Let and be two iid vectors from the distribution of , and define: ; ; ; , and suppose that all such constants are finite. Then, for ,

 Δn(θ) =E(Sn+2−Sn+1|θn=θ) =(θ−θ⋆)⊤(C−γD)(θ−θ⋆)−γc2σ2.

Remarks. Theorem 3 shows that the boundary surface that separates the two regions where the test statistic increases or decreases in expectation looks like an ellipse, for large enough . Regardless of the choice of , when is close enough to , the diagnostic is guaranteed to decrease in expectation since the only remaining term is .

The result also shows the various competing forces between bias and variance in the stochastic gradients as they relate to how the diagnostic behaves. For instance, when is very close , larger (noise in stochastic gradient) contributes faster decrease of the diagnostic in expectation, but at the cost of higher variance. The contribution of the other term, , is less clear. For instance, is large when there is strong collinearity in features , which may contribute to decreasing . But strong collinearity also implies that is almost positive definite which contributes positive values to , thus counteracting the contribution of . Note that is a positive definite matrix but may not be. This implies that careful selection of may be necessary for the diagnostic to work well. For example, when is very small and is positive definite, then will converge to a negative number slowly. One way to alleviate this sensitivity to the learning rate is through implicit updates (Toulis et al., 2014), which we explore in Section 3.3.

In the following sections we illustrate the main result of Theorem 3, and also the performance of the Pflug diagnostic in detecting convergence of SGD.

### 3.1 Illustration

Here, we illustrate the main results of Theorem 3 through Figure 1, which can be described as follows. The dark shaded area in the figure corresponds to the region where the Pflug diagnostic is decreased in expectation if the SGD iterate falls in the region. We call this the Pflug region. Outside of the Pflug region the diagnostic is increased in expectation, and the farther we move away from the center of that region, which roughly coincides with , the larger the expected increase of the diagnostic becomes. The polygon shaded with diagonal lines corresponds to empirical estimations of the convergence region of SGD, where SGD iterates have oscillated around over 1000 simulations. In addition to the result in Theorem 3 about the shape of the region this shows that the Pflug region approximates very well the convergence region of SGD. This is remarkable because the Pflug region is implementable from data, whereas the SGD convergence region is not since by Theorem 1 it depends on unknown quantities.

### 3.2 Simulated example

Next, we apply the Pflug convergence diagnostic on a simulated example to investigate empirically whether it estimates the phase transition well. The experimental setup is as follows. We set as the parameter dimension, and as the data set size. We fix with —this ensures some variation and sparsity in the parameter values. We sample features as , where denotes a -variate normal distribution, and is the identity matrix, and . We sample outcomes as , .

For a given we run Algorithm 1 with burnin = , for various values of the starting point sampled as , where . Let , then for each run we store the tuple

 (γ,τ,E0,Eτ/2,E2τ),

where is the output of Algorithm 1, i.e., the iteration at which the Pflug diagnostic detected convergence. The idea in this experimental evaluation is that if the convergence diagnostic detects convergence accurately, the iterates will depend on the initial conditions more than the iterates . Thus, for given and , we should expect a much higher correlation between and than between and . To test this hypothesis, for a given value of we draw 100 independent samples of . With these samples we regress on and on in two normal linear regression models. Table 1 summarizes the experimental results. In the second and third column of Table 1 we report the regression coefficients of in the two model fits, respectively, and also report statistical significance.

From the table, we see that the regression coefficient corresponding to is always positive and statistically significant, whereas the coefficient is mostly not significant for . This suggests that depends on initial conditions , and thus stationarity has not yet been reached at iteration . In contrast, does not depend on initial conditions , and thus stationarity has likely occurred after iteration . This is evidence indicating that the Pflug diagnostic performs reasonably well in estimating the switch of SGD from its transient phase to its stationary phase. We note that in the regression evaluation we had to control for (by using it as a regressor) because the iteration number is correlated with mean-squared error (larger values for are correlated with smaller error).

### 3.3 Implicit update

As mentioned in the remarks of Theorem 3 the Pflug diagnostic is sensitive to the choice of learning rate . For example, when is small and is positive definite, will be mostly increasing during the transient phase, which makes convergence slower. But choosing a large learning rate can easily lead to numerical instability. One way to alleviate such sensitivity to the learning rate is to use the SGD procedure with an implicit update:

 θn=θn−1+γ∇ℓ(yn,x⊤nθn). (4)

Note that appears on both sides of the equation. In the quadratic loss model we can solve exactly the implicit equation as follows:

 θn=(I+γxnx⊤n)−1(θn−1+γynxn) (5)

Implementing the procedure in Eq. (5) is fast since it is equivalent to . More generally, the implicit update in Eq. 14 can be computed efficiently through a one-dimensional root-finding procedure in many settings (Toulis et al., 2014). Previous work on implicit SGD procedures (ISGD) has shown that implicit procedures have similar asymptotic properties with standard SGD procedures with numerical stability as an added benefit. Since most related work on ISGD methods is with respect to decreasing learning rate procedures (Kulis and Bartlett, 2010; Bertsekas, 2011; Toulis et al., 2014, 2017), we provide an analysis for constant rate ISGD as in Eq. (4) in the supplementary material. We note that ISGD procedures are related to proximal updates in stochastic optimization (Parikh and Boyd, 2013; Rosasco et al., 2014; Xiao and Zhang, 2014), but these methods differ from ISGD methods in that they employ a combination of classical SGD with deterministic proximal operators, whereas ISGD’s proximal operator is purely stochastic.

The following theorem shows that the implicit update in the linear model mitigates the sensitivity of the Pflug diagnostic to the choice of the learning rate.

###### Theorem 4

Let . Under the assumptions of Theorem 3 applied on the implicit procedure in Eq. (14), it holds that

 Δimn(θ) =E(Sn+2−Sn+1|θn=θ) =aγΔn(θ)+bγ[(θ−θ⋆)⊤D(θ−θ⋆)+σ2c2],

where , .

Remarks. Theorem 4 shows that the diagnostic is more stable with the ISGD procedure rather than the classical SGD procedure. Take, for example, the case where so that . In the classical SGD setting, Theorem 3 suggests that the diagnostic increases without bound and convergence fails. In the ISGD setting, however, it holds that , and convergence may happen. In contrast, when then and so , which implies that the diagnostic behaves in the implicit procedure as it behaves in the explicit one.

## 4 Extensions and applications

In this section we consider extensions of the Pflug diagnostic to a more broad family of loss functions inspired by generalized linear models (GLMs). We also consider an application of the diagnostic to speed up convergence of SGD with constant rate.

### 4.1 Generalized linear loss

Here, we consider the loss based on the GLM formulation (McCullagh, 1984; Toulis et al., 2014) where . For example, the quadratic loss is equivalent to . The logistic loss is when is binary and . In general, cannot be chosen arbitrarily—one standard choice is to define such that is a proper density, i.e., it integrates to one. The following theorem generalizes the results in Section 3 on the quadratic loss.

###### Theorem 5

Consider the GLM loss defined as . Let and suppose that , almost surely for all . Let be two iid vectors from the distribution of . Define ; ; ; . For small enough ,

 Δglmn(θ) =E(Sn+2−Sn+1|θn=θ) ≤||C(θ,θ⋆)||2−γk[σ2c2+D2(θ,θ⋆)].

Remarks. The result in Theorem 5 has the same structure as in Theorem 3 so a direct analogy can be helpful. The terms in the two theorems are identical, if we consider that for the quadratic loss it holds that . The term in Theorem 5 corresponds to the term in Theorem 3, and corresponds to , respectively. The terms are equal when we set , in which case . Thus, the diagnostic with the more general GLM loss has familiar properties. For example, when , i.e., when SGD is near the truth, and , in which case the negative constant term dominates and the test statistic decreases in expectation. One difference with the quadratic loss, however, is that as we move farther from the statistic may change in a nonlinear way. Therefore the boundary separating the positive and negative regions of the diagnostic will generally not have the nice, smooth elliptical shape as in the quadratic loss (see Figure 1). This may lead to more complex behavior for the diagnostic, which we aim to analyze fully in future work.

We note that the constraint on is not particularly strict because in the GLM formulation is guaranteed to be positive. The assumption is made to simplify the analysis but can be improved by analyzing the quantity through existing analyses of .

### 4.2 Sgd1/2 for fast convergence

We now switch gears from analyzing the behavior of the Pflug diagnostic to using it in a practical application. Our suggested application is to use the diagnostic within a SGD loop where the learning rate is halved and the procedure restarted each time convergence is detected. We note that our goal here is to illustrate the utility of our convergence diagnostic, and a full analysis of the proposed method will be a subject of future work.

More specifically, note that by Theorem 1 procedure SGD with constant rate has linear convergence to a stationary distance from of . It would therefore be beneficial to reduce the learning rate when we know that the SGD iterates are oscillating around in a ball of radius , so that the procedure moves to a ball with a smaller radius. To implement such a procedure, however, would require knowing , and also knowing all parameters required to calculate . Our solution employs the Pflug convergence diagnostic to detect stationarity. Algorithm 2 describes such a procedure, named SGD, where the learning rate is halved upon detection of convergence (Line 10).

Note that implicit updates can be used in this algorithm as well; we call this modified algorithm ISGD. In our experiments in the following section, we employ ISGD because of the benefits in numerical stability from using implicit updates.

### 4.3 Simulated data experiments

To evaluate the effectiveness of ISGD, we compare to other classical and state-of-the-art SGD methods. We first experiment on simulated data to better understand the performance of ISGD and its competition under various parameter settings. In particular, we compare the performance of our procedure ISGD in Algorithm 2 against SVRG and classical ISGD on simulated data. The classical ISGD uses an optimized learning rate of . The basic experimental setup is as follows.

We consider settings of high and low signal to noise ration (SNR), and high and low dimension and test under the four combinations of these settings. For the high SNR case, we set , where , and for the low SNR case we set . For the high dimension case we set as the parameter dimension, and for the low dimension case we set . Given the dimension , we fix such that . We set as the size of the data set. We sample features as , where . We sample outcomes as for the normal model, and for the logistic model, where denotes the binomial random variable with mean . The learning parameters for each SGD method were tuned to provide best performance through pre-processing.

From simulations with the normal model in the left half of Figure 2 we see that ISGD attains a comparable performance to SVRG. In general, SVRG attains an overall better performance for these experiments, which we believe is related to our convergence diagnostic being aggressive in a couple of cases.

From simulations with the logistic model in the right half of Figure 2 we see that ISGD attains an even better performance than before as there are fewer cases where the diagnostic is aggressive. With high SNR and low dimension parameter settings, ISGD achieves consistently better performance than SVRG. We note that such comparisons do not take into account the sensitivity of SVRG to misspecifications of the learning rate (large enough learning rates can easily make the procedure diverge); or that SVRG requires periodic calculations over the entire data set, which here is easy because we are using only 5,000 data points, but may be a problem in more realistic settings. We also note that there are several improvements available for ISGD by allowing a larger burnin period or by discounting the learning rate less aggressively, but we plan on addressing such tuning issues in future work, both theoretically and empirically.

### 4.4 Benchmark data sets

In addition to simulated experiments we conduct experiments on benchmark data sets MNIST (binary) and COVERTYPE (binary) to evaluate real world performance.111Data sets can be found at https://archive.ics.uci.edu/ml/databases/mnist/ and https://archive.ics.uci.edu/ml/datasets/covertype, respectively. In particular, we perform binary logistic regression using ISGD, SVRG, classical ISGD, and averaged ISGD, where averaging iterates theoretically leads to optimal asymptotic errors. We plot the prediction error on a held out test set in Figure 3 relative to the number of passes over the data.

Overall, we see that ISGD convergences very quickly, after going over less than a quarter of the data, and achieves best performance in the COVERTYPE data set. We currently do not have a theoretical justification for this, but we have verified that it is consistently observed across multiple experiments. ISGD was also very stable to specifications of the learning rate parameter, as expected from the analysis of Theorem 4. In contrast, even though SVRG performed comparably to ISGD, its performance was unstable, especially in the COVERTYPE data set, and required careful fine tuning of the learning rate through trial and error. Averaged SGD performed well on the MNIST data set, but flattened out very fast in the COVERTYPE data, possibly due to misspecification.

## 5 Concluding remarks

In this paper we focused on the problem of convergence of SGD to the stationary phase. This is an important task because statistical properties of iterative stochastic procedures are better understood when they have reached stationarity. We borrowed from the theory of stopping times in stochastic approximation to develop a diagnostic that reliably detects the phase transition, as shown theoretically and empirically. Future work can focus on analysis of conditional on the diagnostic being activated. This could show theoretically that the error is uncorrelated with the initial starting point conditional on the test being activated, and so provide a theoretical analysis of the results in Table 1.

In separate work we plan to focus more on the procedure ISGD that worked very well in experiments, and to analyze theoretically its behavior. One idea is to run parallel ISGD chains and aggregate the iterates. At stationarity we expect the iterates from different chains to be uncorrelated with each other; averaging can help, in a similar way that iterate averaging helps in stochastic approximations (Ruppert, 1988).

## References

• Benveniste et al. (1990) Benveniste, A., P. Priouret, and M. Métivier (1990). Adaptive algorithms and stochastic approximations. Springer-Verlag New York, Inc.
• Bertsekas (2011) Bertsekas, D. P. (2011). Incremental proximal methods for large scale convex optimization. Mathematical programming 129(2), 163–195.
• Blum et al. (1999) Blum, A., A. Kalai, and J. Langford (1999). Beating the hold-out: Bounds for k-fold and progressive cross-validation. In Proceedings of the twelfth annual conference on Computational learning theory, pp. 203–208. ACM.
• Borkar (2008) Borkar, V. S. (2008). Stochastic approximation. Cambridge Books.
• Bottou (2010) Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186. Springer.
• Bottou (2012) Bottou, L. (2012). Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade, Volume 1, pp. 421–436.
• Bottou et al. (2016) Bottou, L., F. E. Curtis, and J. Nocedal (2016). Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838.
• Ermoliev and Wets (1988) Ermoliev, Y. M. and R.-B. Wets (1988). Numerical techniques for stochastic optimization. Springer-Verlag.
• Johnson and Zhang (2013) Johnson, R. and T. Zhang (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315–323.
• Kivinen et al. (2006) Kivinen, J., M. K. Warmuth, and B. Hassibi (2006). The p-norm generalization of the lms algorithm for adaptive filtering. Signal Processing, IEEE Transactions on 54(5), 1782–1793.
• Kulis and Bartlett (2010) Kulis, B. and P. L. Bartlett (2010). Implicit online learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 575–582.
• McCullagh (1984) McCullagh, P. (1984). Generalized linear models. European Journal of Operational Research 16(3), 285–292.
• Moulines and Bach (2011) Moulines, E. and F. R. Bach (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp. 451–459.
• Murata (1998) Murata, N. (1998). A statistical study of on-line learning. Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 63–92.
• Needell et al. (2014) Needell, D., R. Ward, and N. Srebro (2014). Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neural Information Processing Systems, pp. 1017–1025.
• Nemirovski et al. (2009) Nemirovski, A., A. Juditsky, G. Lan, and A. Shapiro (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization 19(4), 1574–1609.
• Parikh and Boyd (2013) Parikh, N. and S. Boyd (2013). Proximal algorithms. Foundations and Trends in optimization 1(3), 123–231.
• Pflug (1990) Pflug, G. C. (1990). Non-asymptotic confidence bounds for stochastic approximation algorithms with constant step size. Monatshefte für Mathematik 110(3), 297–314.
• Pflug (1992) Pflug, G. C. (1992). Gradient estimates for the performance of markov chains and discrete event processes. Annals of Operations Research 39(1), 173–194.
• Robbins and Monro (1951) Robbins, H. and S. Monro (1951). A stochastic approximation method. The annals of mathematical statistics, 400–407.
• Rosasco et al. (2014) Rosasco, L., S. Villa, and B. C. Vũ (2014). Convergence of stochastic proximal gradient algorithm. arXiv preprint arXiv:1403.5074.
• Ruppert (1988) Ruppert, D. (1988). Efficient estimators from a slowly convergent robbins-monro process. Technical report, School of Operations Research and Industrial Engineering, Cornell University.
• Toulis and Airoldi (2015) Toulis, P. and E. M. Airoldi (2015). Scalable estimation strategies based on stochastic approximations: classical results and new insights. Statistics and computing 25(4), 781–795.
• Toulis et al. (2017) Toulis, P., E. M. Airoldi, et al. (2017). Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics 45(4), 1694–1727.
• Toulis et al. (2014) Toulis, P., J. Rennie, and E. Airoldi (2014). Statistical analysis of stochastic gradient methods for generalized linear models. In 31st International Conference on Machine Learning.
• Xiao and Zhang (2014) Xiao, L. and T. Zhang (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization 24, 2057–2075.
• Xu (2011) Xu, W. (2011). Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv preprint arXiv:1107.2490.
• Yin (1989) Yin, G. (1989). Stopping times for stochastic approximation. In Modern Optimal Control: A Conference in Honor of Solomon Lefschetz and Joseph P. LaSalle, pp. 409–420.
• Zhang (2004) Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, pp. 116. ACM.

## Appendix A Proofs of theorems

###### Theorem 1 ((Moulines and Bach, 2011; Needell et al., 2014))

Under certain assumptions on the loss function, there are positive constants such that, for every , it holds that

 E(||θn−θ⋆||2)≤E(||θ0−θ⋆||2)e−Aγn+Bγ.
###### Theorem 2

Consider SGD with constant rate,

 θn=θn−1−γ∇ℓ(yn,x⊤nθn−1).

Suppose that Theorem 1 holds, so that that , for some positive and large enough . We make the following additional assumptions:

1. , where is -Lipschitz, and .

2. It holds , for any , for some positive constant , and some positive definite matrix with minimum eigenvalue .

3. It holds that .

Then,

 E(∇ℓ(yn,x⊤nθn−1)⊤∇ℓ(yn+1,x⊤n+1θn))<0.
###### Proof 2

For brevity let be the stochastic gradient at iteration .

 E(~ℓ⊤i−1~ℓi) =E[(fi−1+ei−1)⊤(fi+ei)]=E[(fi−1+ei−1)⊤fi] %[because$ei$arezero−mean] =E[(fi−1+ei−1)⊤f(θi−1−γfi−1−γei−1)] [\small{% by SGD step for θi }] ≤E(||fi−1||2)−γK⋅E[(fi−1+ei−1)⊤C(fi−1+ei−1)] [% \small{ by Assumption~{}(b) }] ≤(1−γμK)E(||fi−1)||2)−γK⋅E(||ei−1||2C)+O(γ2) ≤(1−γμK)L2E(||θi−1−θ⋆||2)−γμKτ2+O(γ2) [\small{ % by Lipschitz Assumption~{}(a) }] ≤γ[(1−γμK)L2M−μKτ2]+O(γ2) <0. [\small{ by Assumption~{}(c) and % small enough γ }] (6)

Remarks. Assumption (b) is a form of strong convexity. For example, suppose that , then and . In this case is the Fisher information matrix and Assumption (b) holds for . When is small enough and a Taylor approximation of is possible, the above result still holds for when the Fisher information exists. Assumption (c) shows that there is a threshold value for below which the diagnostic cannot terminate. For example, suppose that error noise is small so that and , as argued before. Then, , that is, the learning rate has to exceed the reciprocal of the minimum eigenvalue of the Fisher information matrix.

###### Theorem 3

Suppose that the loss is quadratic, . Let and be two iid vectors from the distribution of , and define: ; ; ; , and suppose that all such constants are finite. Then, for ,

 Δn(θ) =E(Sn+2−Sn+1|θn=θ) =(θ−θ⋆)⊤(C−γD)(θ−θ⋆)−γc2σ2.
###### Proof 3

For notational brevity we make the following definitions:

 θ+ =θ+γ(y1−x⊤1θ)x1 θ++ =θ++γ(y2−x⊤2θ+)x2, (7)

where is the current iterate, and and are the next two using iid data and . For a fixed we understand the Pflug diagnostic through the function

 H(θ) =S++−S+|θ=∇++ℓ⊤∇+ℓ=(θ+−θ)⊤(θ++−θ+)/γ2 (8) and Δn(θ) =E(H(θ))=E((θ+−θ)⊤(θ++−θ+)/γ2). (9)

We use Eq. (3) to derive an expression for :

 H(θ) =(y1−x⊤1θ)(y2−x⊤2θ+)x⊤1x2 =(y1−x⊤1θ)[y2−x⊤2θ−γ(y1−x⊤1θ)x⊤1x2]x⊤1x2 =(y1−x⊤1θ)(y2−x⊤2θ)x⊤1x2−γ(y1−x⊤1θ)2(x⊤1x2)2. (10)

Let ; we know that . Now, we analyze each term individually:

 (y1−x⊤1θ)(y2−x⊤2θ)x⊤1x2 =[x⊤1(θ⋆−θ)+ε1][x⊤2(θ⋆−θ)+ε2]x⊤1x2 =(θ−θ⋆)⊺x1x⊤2(x⊤1x2)(θ−θ⋆)+ε1W(1)+ε2W(2)+ε1ε2W(3).

The variables are conditionally independent of and so using the law of iterated expectations these terms vanish.

 E((y1−x⊤1θ)(y2−x⊤2θ)x⊤1x2)=(θ−θ⋆)⊺E(x1x⊤2(x⊤1x2))(θ−θ⋆)=(θ−θ⋆)⊺C(θ−θ⋆).

Using a similar reasoning, for the second term we have:

 (y1−x⊤1θ)2(x⊤1x2)2= [(x⊤1(θ⋆−θ)+ε1]2(x⊤1x2)2 = (θ−θ⋆)⊺x1x⊤1(x⊤1x2)2(θ−θ⋆)+ε1W(4)+ε21(x⊤1x2)2. (11)

In expectation of Eq. (3),

 E((y1−x⊤1θ)2(x⊤1x2)2) =(θ−θ⋆)⊺E(x1x⊤1(x⊤1x2)2)(θ−θ⋆)+ε21(x⊤1x2)2 =(θ−θ⋆)⊺D(θ−θ⋆)+σ2c2. (12)

By combining all results we finally get:

 Δn(θ)=(θ−θ⋆)⊺(C−γD)(θ−θ⋆)−γσ2c2.
###### Theorem 4

Let . Under the assumptions of Theorem 3 applied on the implicit procedure in Eq. (14), it holds that

 Δimn(θ) =E(Sn+2−Sn+1|θn=θ) =aγΔn(θ)+bγ[(θ−θ⋆)⊤D(θ−θ⋆)+σ2c2],

where , .

###### Proof 4

We derive similar theoretical results for under the linear normal model for implicit updates. We have the implicit updates

 θ+ =θ+γ(y1−x⊤1θ+)x1 θ++ =θ++γ(y2−x⊤2θ++)x2

Also note the collinearity

 (y1−x⊤1θ+) =λ1(y1−x⊤1θ) (y2−x⊤2θ++) =λ2(y2−x⊤2θ+), =λ2[y2−x⊤2θ−γλ1(y1−x⊤1θ)x⊤1x2],

where and . We derive an expression for , with implicit updates:

 Him(θ) =(θ+−θ)⊤(θ++−θ+)/