Asymptotic and finite-sample properties of estimators based on stochastic gradientsPanagiotis (Panos) Toulis is an Assistant Professor of Econometrics and Statistics at University of Chicago, Booth School of Business (panos.toulis@chicagobooth.edu). Edoardo M. Airoldi is an Associate Professor of Statistics at Harvard University (airoldi@fas.harvard.edu). The authors wish to thank Joe Blitzstein, Leon Bottou, Bob Carpenter, David Dunson, Andrew Gelman, Brian Kulis, Xiao-Li Meng, Natesh Pillai and Neil Shephard for useful comments and discussion. We are grateful to four reviewers, the Associate Editor, and the Editor for offering suggestions that improved the paper. This research was sponsored, in part, by NSF CAREER award IIS-1149662, ARO MURI award W911NF-11-1-0036, and ONR YIP award N00014-14-1-0485.

Asymptotic and finite-sample properties of estimators based on stochastic gradientsPanagiotis (Panos) Toulis is an Assistant Professor of Econometrics and Statistics at University of Chicago, Booth School of Business (panos.toulis@chicagobooth.edu). Edoardo M. Airoldi is an Associate Professor of Statistics at Harvard University (airoldi@fas.harvard.edu). The authors wish to thank Joe Blitzstein, Leon Bottou, Bob Carpenter, David Dunson, Andrew Gelman, Brian Kulis, Xiao-Li Meng, Natesh Pillai and Neil Shephard for useful comments and discussion. We are grateful to four reviewers, the Associate Editor, and the Editor for offering suggestions that improved the paper. This research was sponsored, in part, by NSF CAREER award IIS-1149662, ARO MURI award W911NF-11-1-0036, and ONR YIP award N00014-14-1-0485.

Abstract

Stochastic gradient descent procedures have gained popularity for parameter estimation from large data sets. However, their statistical properties are not well understood, in theory. And in practice, avoiding numerical instability requires careful tuning of key parameters. Here, we introduce implicit stochastic gradient descent procedures, which involve parameter updates that are implicitly defined. Intuitively, implicit updates shrink standard stochastic gradient descent updates. The amount of shrinkage depends on the observed Fisher information matrix, which does not need to be explicitly computed; thus, implicit procedures increase stability without increasing the computational burden. Our theoretical analysis provides the first full characterization of the asymptotic behavior of both standard and implicit stochastic gradient descent-based estimators, including finite-sample error bounds. Importantly, analytical expressions for the variances of these stochastic gradient-based estimators reveal their exact loss of efficiency. We also develop new algorithms to compute implicit stochastic gradient descent-based estimators for generalized linear models, Cox proportional hazards, M-estimators, in practice, and perform extensive experiments. Our results suggest that implicit stochastic gradient descent procedures are poised to become a workhorse for approximate inference from large data sets.

: Stochastic Approximation; Implicit Updates; Asymptotic Variance; Generalized Linear Models; Cox Proportional Hazards; M-estimation; Maximum Likelihood; Statistical Efficiency; Numerical Stability

1Introduction

Parameter estimation by optimization of an objective function is a fundamental idea in statistics and machine learning [28]. However, classical procedures, such as Fisher scoring, the EM algorithm or iteratively reweighted least squares [27], do not scale to modern data sets with millions of data points and hundreds or thousands of parameters [53].

In particular, suppose we want to estimate the true parameter of a distribution from i.i.d. data points , such that conditional on covariate outcome is distributed according to . Such estimation problems often reduce to optimization problems. For instance, the maximum likelihood estimator (MLE) is obtained by solving . Classical optimization procedures, such as Newton-Raphson or Fisher scoring, have a runtime complexity that ranges between and , in the best case and worst case respectively [40]. Quasi-Newton (QN) procedures are the only viable alternative in practice because they have complexity per iteration, or in certain favorable cases [33]. However, estimation from large data sets requires an even better runtime complexity that is roughly , i.e., linear in data size but sublinear in parameter dimension . The first requirement on is generally unavoidable because all data points carry information from the i.i.d. assumption. Sublinearity in is therefore critical.

Such requirements have recently generated interest in stochastic optimization procedures, especially those only relying on first-order information, i.e., gradients. Perhaps the most widely popular procedure in this family is stochastic gradient descent (SGD), defined for , as

where is the learning rate sequence, typically defined as , is the learning rate parameter, , and are positive-definite matrices, also known as condition matrices.

Stochastic optimization procedures of this kind are special cases of stochastic approximation [62], where the estimation problem is not formulated as an optimization problem but more generally as a characteristic equation. Early research considered a streaming data setting—akin to a superpopulation setting—where the characteristic equation is

with the expectation being over the true conditional distribution of outcome given covariate . More recent research, largely in computer science and optimization, considers a finite N setting with characteristic equation

where the expectation is over the empirical distribution of in the finite data set. In both settings, SGD of Eq. is well-defined: in the finite population setting of Eq. the data point is a random sample with replacement from the finite data set; in the infinite population setting of Eq. the data point is simply the th data point in the stream.

From a computational perspective, SGD in Eq. is appealing because it avoids expensive matrix inversions, as in Newton-Raphson, and the log-likelihood is evaluated at a single data point and not on the entire data set. From a theoretical perspective, SGD in Eq. converges, under suitable conditions, to where [6]. This condition can satisfy both Eq. and Eq. , implying that SGD can be used on both finite and infinite population settings. For the rest of this paper we assume an infinite population setting, as it is the most natural setting for stochastic approximations. The main difference between the streaming data setting studied in the computer science and optimization literature and the infinite population setting we consider here is that we do not condition on the observed ordering of data points, but we condition on a random ordering instead. Moreover, most of the theoretical results presented in this paper for the infinite population case can be applied to the finite population case, where instead of estimating we estimate the MLE, or the maximum a posteriori (MAP) estimate if there is regularization.

In this paper, we introduce implicit stochastic gradient descent procedures—implicit SGD for short—defined as

where are defined as in standard SGD in Eq. . Furthermore, we provide a theoretical analysis of estimators based on stochastic gradients, for both implicit and standard procedures. To distinguish the two procedures, we will refer to standard SGD in Eq. as SGD with explicit updates, or explicit SGD for short, because the next iterate can be immediately computed given and the data point . In contrast, the update in Eq. is implicit because the next iterate appears on both sides of the equation, where the iterate was typed in boldface to emphasize the fact.

1.1Illustrative example

Here, we motivate the main results of this paper on the comparison between implicit and explicit SGD. Let be the true parameter of a normal model with i.i.d. observations , where the variance is assumed known for simplicity. The log-likelihood is , and the score function (i.e., gradient of log-likelihood) is given by . Let be distributed according to some unknown distribution with bounded second moment. Assume , for some as the learning rate. Then, the explicit SGD procedure in Eq. is

Procedure is the least mean squares filter (LMS) in signal processing, also known as the Widrow-Hoff algorithm [82]. The implicit SGD procedure can be derived in closed form in this problem using update in Eq. as

The procedure defined by Eq. is known as the normalized least mean squares filter (NLMS) in signal processing [52].

From Eq. we see that it is crucial for explicit SGD to have a well-specified learning rate parameter . For instance, if then will diverge to a value at the order of , before converging to the true value (see Section 2.5, Lemma ?). In contrast, implicit SGD is more stable to misspecification of the learning rate parameter . For example, a very large will not cause divergence as in explicit SGD, but it will simply put more weight on the th observation than the previous iterate . Assuming for simplicity , it also holds , showing that implicit SGD iterates are shrinked versions of explicit ones (see also Section 5).

Let , then by Theorem ? the asymptotic variance of (and of ) satisfies if . Since , it is best to set . In this case . Implicit SGD can thus be optimal by setting in which case is exactly the OLS estimator, and is an approximate but more stable version of the OLS estimator. Thus, the implicit SGD estimator in Eq. inherits the efficiency properties of , with the added benefit of being stable over a wide range of learning rates. Overall, implicit SGD is a superior form of SGD.

1.2Related work

Historically, the duo of explicit-implicit updates originates from the numerical methods introduced by Euler (ca. 1770) for solving ordinary differential equations [34]. The explicit SGD procedure was first proposed by Sakrison ([67]) as a recursive statistical estimation method and it is theoretically based on the stochastic approximation method of [62]. Statistical estimation with explicit SGD is a straightforward generalization of Sakrison’s method and has recently attracted attention in the machine learning community as a fast learning method for large-scale problems [86]. Applications of explicit SGD procedures in massive data problems can be found in many diverse areas such as large-scale machine learning [86], online EM algorithm [14], image analysis and deep learning [19] and MCMC sampling [81].

The implicit SGD procedure is less known and not well-understood. In optimization, implicit methods have recently attracted attention under the guise of proximal methods, such as mirror-descent [56]. In fact, the implicit SGD update in Eq. can be expressed as a proximal update:

From a Bayesian perspective, is the posterior mode of a model with the standard multivariate normal as the prior, and as the log-likelihood of for observation . Arguably, the normalized least mean squares (NLMS) filter [52], introduced in Eq. , was the first statistical model that used an implicit update as in Eq. , and was shown to be consistent and robust under excessive input noise [73]. From an optimization perspective, the update in Eq. corresponds to a stochastic version of the proximal point algorithm by [63], which has been generalized through the idea of splitting algorithms [44]; see also the comprehensive review of proximal methods in optimization by [58]. Additional intuition of implicit methods has been provided by [39] and [55], who have argued that proximal methods can fit better in the geometry of the parameter space. [7] derived the convergence rate of an implicit procedure similar to Eq. on a fixed data set, and compared the rates between procedures that randomly sampled data or simply cycled through them. [78] derived the asymptotic variance of as estimator of in the family of generalized linear models, and provided an algorithm to efficiently compute the implicit update of Eq. in such models and in the simplified setting where .

1.3Contributions

Prior work on procedures similar to implicit SGD has considered mostly an optimization setting, in which the focus is on speed of convergence [7]. Instead, we focus on statistical efficiency, that is, the sampling variability of the estimator implied by implicit and explicit SGD procedures—the relevant analysis and the results of Theorem ? and Theorem ? are novel. Furthermore, our procedure, which we generalized in [75], is different than typical stochastic proximal gradient procedures [23]. In such procedures the parameter updates are obtained by combining a stochastic explicit update and a deterministic implicit update. In implicit SGD there is a single stochastic implicit update, which prevents numerical instability.

With regard to theoretical contributions, the asymptotic statistical efficiency of SGD procedures (both explicit and implicit) derived in Theorem ? is a key contribution of our work. Our analysis is in fact general enough that allowed us to derive the asymptotic efficiency of other popular stochastic optimization procedures, notably of AdaGrad [24] in Eq. of our paper. The asymptotic normality of implicit SGD in Theorem ? is new and enables a novel comparison of explicit SGD and implicit SGD in terms of the normality of their iterates, which is also a clear point of departure from the typical optimization literature. The results in Section 2.5 are also new, and formalize the advantages of implicit SGD over explicit SGD in terms of numerical stability.

With regard to practical contributions, Algorithm ? and its variants presented in the paper are a significant extension of our earlier work beyond first-order GLMs [78]. The key contribution here is that these new algorithms make implicit SGD as simple to implement as standard explicit SGD, whenever the fixed-point computation of the implicit update is feasible. We provide extensive applications in Section 3 and experiments in Section 4 of implicit SGD compared to explicit SGD. Importantly, we implemented implicit SGD through the R package sgd [80] (available at https://cran.r-project.org/web/packages/sgd/index.html) to compare implicit SGD with state-of-art procedures, including R’s glm() function, biglm package, the elastic net [29], AdaGrad [24], Prox-SVRG [84], and Prox-SAG [69].

2Theory

The norm denotes the norm. If a positive scalar sequence is nonincreasing and , we write . For two positive scalar sequences , equation denotes that is bounded above by , i.e., there exists a fixed such that , for all . Furthermore, denotes that . Similarly, for a sequence of vectors (or matrices) , we write to denote , and write to denote . For two positive definite matrices we write to express that is positive definite. The set of eigenvalues of a matrix is denoted by ; thus, if and only if for every .

Remarks

. Assumption is typical in stochastic approximation as it implies that and , as posited by [62]. Assumption narrows our focus to models for which the likelihood depends on parameters through the linear combination . This family of models is large and includes generalized linear models, Cox proportional hazards models, and M-estimation. Furthermore, in Section 5 we discuss a significant relaxation of Assumption . Assumption puts a Lipschitz condition on the log-likelihood but it is used only for deriving finite-sample error bounds in Theorem ?—it is possible that this condition can be relaxed. Assumption is equivalent to assuming strong convexity for the negative log-likelihood, which is typical for proving convergence in probability. The assumption on the observed Fisher information is less standard and, intuitively, it posits that a minimum of statistical information is received from any data point, at least for certain model parameters. Making this assumption allows us to forgo boundedness assumptions on the errors of stochastic gradients that were originally used by [62], and have since been standard. Finally, Assumption posits the typical Lindeberg conditions that are necessary to invoke the central limit theorem and prove asymptotic normality; this assumption follows the conditions defined by [25] for the normality of explicit SGD procedures.

2.1Finite-sample error bounds

Here, we derive bounds for the errors on a finite sample of fixed size .

Not surprisingly, implicit SGD in Eq. matches the asymptotic rate of explicit SGD in Eq. . In particular, the iterates have squared error with rate , as seen in Theorem ?, which is identical to the rate of error for the explicit iterates [6]. One way to explain intuitively this similarity in convergence rates is to assume that both explicit and implicit SGD are at the same estimate . Then, using definitions in Eq. and in Eq. , a Taylor approximation of the gradient yields

where and . Therefore, as , we have , and the two procedures coincide.

Despite the similarity in convergence rates, the critical advantage of implicit SGD—more generally of implicit procedures—is their robustness to initial conditions and excess noise. This can be seen in Theorem ? where the implicit procedure discounts the initial conditions at an exponential rate through the term . Importantly, the discounting of initial conditions happens regardless of the specification of the learning rate. In fact, large values of can lead to faster discounting, and thus possibly to faster convergence, however at the expense of increased variance as implied by Theorem ?, which is presented in the following section. The implicit iterates are therefore unconditionally stable, i.e., virtually any specification of the learning rate will lead to a stable discounting of the initial conditions.

In contrast, explicit SGD is known to be very sensitive to the learning rate, and can numerically diverge if the rate is misspecified. For example, [51] showed that there exists a term , where is a Lipschitz constant for the gradient of the log-likelihood, amplifying the initial conditions of explicit SGD, which can be catastrophic if the learning rate parameter is misspecified.1 Thus, although implicit and explicit SGD have identical asymptotic performance, they are crucially different in their stability properties. This is investigated further in Section 2.5 and in the experiments of Section 4.

2.2Asymptotic variance and optimal learning rates

In the previous section we showed that in quadratic mean, i.e., the implicit SGD iterates converge to the true model parameters , similar to classical results for the explicit SGD iterates . Thus, and are consistent estimators of . In the following theorem we show that both SGD estimators have the same asymptotic variance.

Remarks

. Although the implicit SGD estimator is significantly more stable than the explicit estimator (Theorem ?), both estimators have the same asymptotic efficiency in the limit according to Theorem ?. This implies that implicit SGD is a superior form of SGD, and should be preferred when the calculation of implicit updates in Eq. is computationally feasible. In Section 3 we show that this is possible in a large family of statistical models, and illustrate with several numerical experiments in Section 4.1.

Asymptotic variance results in stochastic approximation similar to Theorem ? were first obtained by [16], [66], and followed by [25], [60], and several other authors [45]. We contribute to this literature in two important ways. First, our asymptotic variance result includes implicit SGD, which is a stochastic approximation procedure with implicitly defined updates, whereas other works consider only explicit updates. Second, in our setting we estimate recursively the true parameters of a statistical model, and thus we can exploit the typical regularity conditions of Assumption to derive the asymptotic variance of (and ) in a simplified closed-form. We illustrate the asymptotic variance results of Theorem ? in Section ?.

Optimal learning rates

Crucially, the asymptotic variance formula of Theorem ? depends on the limit of the sequence used in the SGD procedures of Eq. and Eq. . We distinguish two classes of procedures, one where , known as first-order procedures, and a second class where is not trivial, known as second-order procedures.

In first-order procedures only gradients are used in the SGD procedures. Inevitably, no matter how we set the learning rate parameter , first-order SGD procedures will lose statistical efficiency. We can immediately verify this by comparing the asymptotic variance in Theorem ? with the asymptotic variance of the maximum likelihood estimator (MLE), denoted by , on a data set with data points , . Under the regularity conditions of Assumption , the MLE is the asymptotically optimal unbiased estimator and . By Theorem ? and convergence of implicit SGD, it holds , which also holds for . For any we have as an identity that

The proof is rather quick if we consider and note that is the corresponding eigenvalue of the left-hand matrix in Ineq. and is the eigenvalue of , and that implies that

for every . Therefore, both SGD estimators lose information and this loss can be quantified exactly by Ineq. . This inequality can also be leveraged to find the optimal choice for given an appropriate objective. As demonstrated in the experiments in Section 4, this often suffices to achieve estimates that are comparable with MLE in statistical efficiency but with substantial computational gains. One reasonable objective is to minimize the trace of the asymptotic variance matrix, i.e., to set equal to

Eq. is defined under the constraint because Theorem ? requires to be positive definite.

Of course, the eigenvalues are unknown in practice and need to be estimated from the data. This problem has received significant attention recently and several methods exist [37]. We will use Eq. extensively in our experiments (Section 4) in order to tune the SGD procedures. However, we note that in first-order SGD procedures, knowing the eigenvalues of does not necessarily achieve statistical efficiency because of the spectral gap of , i.e., the ratio between its maximum eigenvalue and minimum eigenvalue ; for instance, if , then the choice of learning rate parameter according to Eq. leads to statistically efficient first-order SGD procedures. However, this case is not typical in practice, especially in many dimensions.

In second-order procedures, we assume non-trivial condition matrices . Such procedures are called second-order because they usually leverage curvature information from the Fisher information matrix (or the Hessian of the log-likelihood). They are also known as adaptive procedures because they adapt their hyperparameters, i.e., learning rates or condition matrices , according to observed data. For instance, let and . Plugging in in Theorem ?, the normalized asymptotic variance of the SGD estimators is

which is the theoretically optimal asymptotic variance of the MLE, i.e., the Cramér-Rao lower bound.

Therefore, to achieve asymptotic efficiency, second-order procedures need to estimate the Fisher information matrix at . Because is unknown one can simply use (or ) as an iterative estimate of , and the same optimality result holds. This approach in second-order explicit SGD was first studied by [67], and later by [57]. It was later extended by [26] and several other authors. Notably, [1] refers to the direction as the “natural gradient” and uses information geometry arguments to prove statistical optimality.

An alternative way to implement second-order procedures is to use stochastic approximation to estimate , in addition to the approximation procedure estimating . For example, [2] proposed the following second-order procedure,

where is a learning rate sequence, separate from . By standard stochastic approximation, converges to , and thus the procedure in Eq. is asymptotically optimal. However, there are two important problems with this procedure. First, it is computationally costly because of matrix inversions. A faster way is to apply quasi-Newton ideas. SGD-QN developed by [9] is such a procedure where the first expensive matrix computations are substituted by the secant condition. Second, the stochastic approximation of is usually very noisy in high-dimensional problems and this affects the main approximation for . Recently, more robust variants of SGD-QN have been proposed [13].

Another notable adaptive procedure is AdaGrad [24], which is defined as

where takes the diagonal matrix of its matrix argument, and the learning rate is set constant to . AdaGrad can be considered a second-order procedure because it tries to approximate the Fisher information matrix, however it only uses gradient information so technically it is first-order. Under appropriate conditions, and a simple modification in the proof of Theorem ? can show that the asymptotic variance of the AdaGrad estimate is given by

This result reveals an interesting trade-off achieved by AdaGrad and a subtle contrast to first-order SGD procedures. The asymptotic variance of AdaGrad is , which indicates significant loss of information. However, this rate is attained regardless of the specification of the learning rate parameter .2 In contrast, as shown in Theorem ?, first-order SGD procedures require in order to achieve the rate, and the rate is significantly worse if this condition is not met. For instance, [55] give an example of misspecification of where the rate of first-order explicit SGD is , and can be arbitrarily small. The variance result in Eq. is illustrated in the numerical experiments of Section ?.

2.3Optimality with averaging

As shown in Section ?, Theorem ? implies that first-order SGD procedures can be statistically inefficient, especially in many dimensions. One surprisingly simple idea to achieve statistical efficiency is to combine larger learning rates with averaging of the iterates. In particular, we consider the procedure

where are the typical implicit SGD iterates in Eq. , and , . Under suitable conditions, the iterates are asymptotically efficient. This is formalized in the following theorem.

Remarks. In the context of explicit stochastic approximations, averaging was first proposed and analyzed by [65] and [4]. [65] argued that larger learning rates in stochastic approximation uncorrelates the iterates allowing averaging to improve efficiency. [59] expanded the scope of averaging by proving asymptotic optimality in more general explicit stochastic approximations that operate under suitable conditions similar to Theorem ?. [59] thus proved that slowly-converging stochastic approximations can be improved by using larger learning rates and averaging of the iterates. Recent work has analyzed explicit updates with averaging [86], and has shown their superiority in numerous learning tasks. More recently, [79] derived the finite-sample error bounds of the averaged implicit SGD estimator.

2.4Asymptotic normality

Asymptotic distributions, or more generally invariance principles, are well-studied in classical stochastic approximation [45]. In this section we leverage Fabian’s theorem [25] to show that iterates from implicit SGD are asymptotically normal.

Remarks. The combined results of Theorems ?, ?, and ? indicate that implicit SGD is numerically stable and has known asymptotic variance and distribution. Therefore, contrary to explicit SGD that has severe stability issues, implicit SGD emerges as a stable estimation procedure with known standard errors, which enables typical statistical tasks, such as confidence intervals, hypothesis testing, and model checking. We show empirical evidence supporting this claim in Section ?.

2.5Stability

To illustrate the stability, or lack thereof, of both SGD estimators in small-to-moderate samples, we simplify the SGD procedures and inspect the size of the biases and . In particular, based on Theorem ?, we simply assume the Taylor expansion ; to simplify further we ignore the remainder term .

Under this simplification, the SGD procedures in Eq. and in Eq. can be written as follows:

where , , and denotes the initial bias of the two procedures from a common starting point . Thus, the matrices and describe how fast the initial bias decays for the explicit and implicit SGD respectively. In the limit, and [77], and thus both methods are asymptotically stable.

However, the explicit procedure has significant stability issues in small-to-moderate samples. By inspection of Eq. , the magnitude of is dominated by , the maximum eigenvalue of . Furthermore, the rate of convergence is dominated by , the minimum eigenvalue of .3 For stability, it is desirable , for all eigenvalues . This implies the requirement for stability. Furthermore, Theorem ? implies the requirement for fast convergence. This is problematic in high-dimensional settings because is typically orders of magnitude larger than . Thus, the requirements for stability and speed of convergence are in conflict in explicit procedures: to ensure stability we need a small learning rate parameter , thus paying a high price in convergence which will be at the order of , and vice versa.

In contrast, the implicit procedure is unconditionally stable. The eigenvalues of are . Critically, it is no longer required to have a small for stability because the eigenvalues of are always less than one. We summarize these findings in the following lemma.

Remarks

. Lemma ? shows that in the explicit SGD procedure the effect from the initial bias can be amplified in an arbitrarily large way before fading out, if the learning rate is misspecified (i.e., if ). This sensitivity of explicit SGD is well-known and requires problem-specific considerations to be avoided in practice, e.g., preprocessing, small-sample tests, projections, truncation [15]. In fact, there exists voluminous work, which is still ongoing, in designing learning rates to stabilize explicit SGD; see, for example, a review by [30]. Implicit procedures render such ad-hoc designs obsolete because they remain stable regardless of learning rate design, and still maintain the asymptotic convergence and efficiency properties of explicit SGD.

3Applications

Here, we show how to apply implicit SGD in Eq. for estimation in generalized linear models, Cox proportional hazards, and more general M-estimation problems. We start by developing an algorithm that efficiently computes the implicit update in Eq. , and is applicable to all aforementioned models.

3.1Efficient computation of implicit updates

The main difficulty in applying implicit SGD is the solution of the multidimensional fixed-point equation . In a large family of models where the likelihood depends on the parameter only through the natural parameter , the solution of the fixed-point equation is feasible and computationally efficient. We prove the general result in Theorem ?.

For the rest of this section we will treat as a function of the natural parameter for a fixed outcome . Thus, will refer to the first derivative of with respect to with fixed .

Remarks. Theorem ? implies that computing the implicit update in Eq. reduces to numerically solving the one-dimensional fixed-point equation for —this idea is implemented in Algorithm ?. As shown in the proof of Theorem ?, this implementation is fast because lies on an interval of size .

We also note that Theorem ? can be readily extended to cases with linearly separable regularizers, for instance, regularizers using the norm . In such cases, there are additional fixed-point equations as in Step 9 of Algorithm ? that involve the components of the regularizer. More generally, for families of models that do not satisfy Assumption there are methods to approximately perform the implicit update—we discuss one such method in Section 3.3.

3.2Generalized linear models

In this section, we apply implicit SGD to estimate generalized linear models (GLMs). In such models, follows an exponential distribution conditional on , and , where is the transfer function of the GLM model [54]. Furthermore, the gradient of the GLM log-likelihood for parameter value at data point is given by

The conditional variance of is , and thus the Fisher information matrix is . Thus, the SGD procedures in Eq. and in Eq. can be written as

Implementation of explicit SGD is straightforward. Implicit SGD can be implemented through Algorithm ?. In particular, with . In typical GLMs is twice-differentiable and also because it is proportional to the conditional variance of given , thus fulfilling Assumption . In the simplified case where , the identity matrix, for all , Algorithm ? simplifies to Algorithm ?, which was first derived by [78].

We make extensive experiments using Algorithm ? in Section 4.2.

3.3Cox proportional hazards model

Here, we apply SGD to estimate a Cox proportional hazards model, which is a popular model in survival analysis [17]. Multiple variations of the model exist but for simplicity we will analyze one simple variation that is popular in practice [18]. Consider individuals, indexed by , with observed survival times , failure indicators , and covariates . The survival times can be assumed ordered, , whereas denotes failure (e.g., death) and indicates censoring (e.g., patient dropped out of study). Given a failure for unit () at time , the risk set is defined as the set of individuals that could possibly fail at , i.e., all individuals except those who failed or were censored before . In our simplified model, . Define , then the log-likelihood for is given by [18]

where . In an online setting, where is infinite and data points are observed one at a time, future observations affect the likelihood of previous ones, as can be seen by inspection of Eq. . Therefore, we apply SGD assuming fixed to estimate the MLE . As mentioned in Section 1, our theory in Section 2 can be applied unchanged if we only substitute , the true parameter, with the MLE .

A straightforward implementation of explicit SGD in Eq. for the Cox model is shown in Algorithm ?. For implicit SGD in Eq. we have the update

which is similar to the implicit procedure for GLMs in Eq. . However, the log-likelihood term does not satisfy the conditions of Assumption because may be increasing or decreasing since it depends on terms . Thus, Theorem ? cannot be applied.

One way to circumvent this problem is to simply compute on the previous update instead of the current . Then, update becomes

which now satisfies Assumption since is constant with respect to . In fact, this idea can be used to apply implicit SGD more generally beyond models that satisfy Assumption ; see Section 5 for a discussion.

3.4M-Estimation

Given observed data points and a convex function , the M-estimator is defined as

where it is assumed , and are i.i.d. zero mean-valued noise. M-estimators are especially useful in robust statistics [35] because appropriate choice of can reduce the influence of outliers in data. Typically, is twice-differentiable around zero. In this case,

where the expectation is over the empirical data distribution. Thus, according to Section 1, SGD procedures can be applied to approximate the M-estimator . There has been increased interest in the literature for fast approximation of M-estimators due to their robustness [22]. The implicit SGD procedure for approximating M-estimators is defined in Algorithm ?, and is a simple adaptation of Algorithm ?.

Importantly, the conditions of Assumption are met because is convex and thus . Thus, Step 4 of Algorithm ? is a straightforward application of Algorithm ? by simply setting . The asymptotic variance of is also easy to derive. If , such that and commute, , and , Theorem ? can be leveraged to show that

Historically, one of the first applications of explicit stochastic approximation procedures in robust estimation was due to [48]. The asymptotic variance was first derived, only for the explicit SGD case, by [61] using stochastic approximation theory from [57].

4Simulation and data analysis

In this section, we demonstrate the computational and statistical advantages of SGD estimation procedures in Eq. and in Eq. . For our experiments we developed a new R package, namely sgd, which has been published on CRAN. All experiments were conducted on a single laptop running Linux Ubuntu 13.x with 8 cores@2.4GHz, 16Gb of RAM memory and 256Gb of physical storage with SSD technology. A separate set of experiments, which is presented in the supplemental article [77], focuses on comparisons of implicit SGD with popular machine learning methods on typical estimation tasks.

4.1Numerical results

In this section we aim to illustrate the theoretical results of Section 2, namely the result on asymptotic variance (Theorem ?) and asymptotic normality (Theorem ?) of SGD procedures.

Asymptotic variance

In this experiment we use a normal linear model following [85]. The procedures we test are explicit SGD in Eq. , implicit SGD in Eq. , and AdaGrad in Eq. . For simplicity we use first-order SGD, i.e., . In the experiment we calculate the empirical variance of said procedures for 25 values of their common learning rate parameter in the interval . For every value of we calculate the empirical variances through the following process, repeated for 150 times. First, we set as the true parameter value. For iterations , we sample covariates as , where is diagonal with elements uniformly on . The outcome is then sampled as . In every repetition we store the iterate for every tested procedure and then calculate the empirical variance of stored iterates over all 150 repetitions.

For any fixed learning rate parameter we set for implicit SGD and for AdaGrad. For explicit SGD we set in order to stabilize its updates. This trick is necessary by the analysis of Section 2.5. In particular, the Fisher information matrix here is , and thus the minimum eigenvalue is and the maximum is . Therefore, for stability we require and for fast convergence we require . The two requirements are incompatible, which indicates that explicit SGD can have serious stability issues.

For given , the asymptotic variance of SGD procedures after iterations is , by Theorem ?. The asymptotic variance of AdaGrad after iterations is equal to by Eq. . The log traces of the empirical variance of the SGD procedures and AdaGrad in this experiment are shown in Figure 4.

Simulation with normal model. The x-axis corresponds to learning rate parameter \gamma_1; the y-axis curves corresponds to log trace of the empirical variance of tested procedures (explicit/implicit SGD, AdaGrad). Theoretical asymptotic variances of SGD and AdaGrad are plotted as well. Implicit SGD is stable and its empirical variance is very close to its asymptotic value. Explicit SGD becomes unstable at large \gamma_1. AdaGrad is statistically inefficient but remains stable to large learning rates.
Simulation with normal model. The x-axis corresponds to learning rate parameter ; the y-axis curves corresponds to log trace of the empirical variance of tested procedures (explicit/implicit SGD, AdaGrad). Theoretical asymptotic variances of SGD and AdaGrad are plotted as well. Implicit SGD is stable and its empirical variance is very close to its asymptotic value. Explicit SGD becomes unstable at large . AdaGrad is statistically inefficient but remains stable to large learning rates.

The x-axis corresponds to different values of the learning rate parameter , and the y-axis corresponds to the log trace of the empirical variance of the iterates for all three different procedures. We also include curves for the theoretical values of the empirical variances.

We see that our theory predicts well the empirical variances of all methods. Explicit SGD performs on par with implicit SGD for moderate values of , however, it required a modification in its learning rate to make it work. Furthermore, explicit SGD quickly becomes unstable at larger values of (see, for example, its empirical variance for ), and in several instances, not considered in Figure 4, it numerically diverged. On the other hand, AdaGrad is stable to the specification of and tracks its theoretical variance well. However, it gives inefficient estimators because their variance has order . Implicit SGD effectively combines stability and good statistical efficiency. First, it remains very stable to the entire range of the learning rate parameter . Second, its empirical variance is and is tracks closely the theoretical value predicted by Theorem ? for all .

Asymptotic normality

In this experiment we use the normal linear model in the setup of Section ? to check the asymptotic normality result of Theorem ?. For simplicity, we only test first-order implicit SGD in Eq. and first-order explicit SGD.

In the experiment we define a set of learning rates . For every learning rate we take 400 samples of , where and denotes either or . The matrix is the asymptotic variance matrix in Theorem ?, and , is the true parameter value. We use the ground-truth values both for and , as we are only interested to test normality of the iterates in this experiment. We also tried as the parameter dimension. Because the explicit SGD procedure was very unstable across experiments we only report results for . Results on the implicit procedure for larger are given in the supplemental article [77], where we also include results for a logistic regression model.

By Theorem ? for implicit SGD, and by classical normality results for explicit SGD [25], the quadratic form is a chi-squared random variable with degrees of freedom. Thus, for every procedure we plot this quantity against independent samples from a distribution and visually check for deviations. As before, we tried to stabilize explicit SGD as much as possible by setting . This worked in many iterations, but not for all. Iterations for which explicit SGD diverged were not considered. For implicit SGD we simply set without additional tuning.

The results of this experiment are shown in Figure 1.

Figure 1: Simulation with normal model. The x-axis corresponds to the SGD procedure (explicit or implicit) for various values of the learning rate parameter, \gamma_1 \in \{0.5, 1, 3, 5, 7\}. The histograms (x-axis) for the SGD procedures are 500 replications of SGD where at each replication we only store the quantity N (\theta_{N}-\theta_\star)^\intercal \Sigma^{-1} (\theta_{N}-\theta_\star), for every method (N=1200); the theoretical covariance matrix \Sigma is different for every learning rate and is given in Theorem . The data generative model is the same as in Section . We observe that implicit SGD is stable and follows the nominal chi-squared distribution. Explicit SGD becomes unstable at larger \gamma_1 and its distribution does not follow the nominal one well. In particular, the distribution of N (\theta_{N}^{\mathrm{sgd}}-\theta_\star)^\intercal \Sigma^{-1} (\theta_{N}^{\mathrm{sgd}}-\theta_\star) becomes increasingly heavy-tailed as the learning rate parameter gets larger, and eventually diverges for \gamma_1\ge 7.
Figure 1: Simulation with normal model. The x-axis corresponds to the SGD procedure (explicit or implicit) for various values of the learning rate parameter, . The histograms (x-axis) for the SGD procedures are 500 replications of SGD where at each replication we only store the quantity , for every method (); the theoretical covariance matrix is different for every learning rate and is given in Theorem . The data generative model is the same as in Section . We observe that implicit SGD is stable and follows the nominal chi-squared distribution. Explicit SGD becomes unstable at larger and its distribution does not follow the nominal one well. In particular, the distribution of becomes increasingly heavy-tailed as the learning rate parameter gets larger, and eventually diverges for .

The vertical axis on the grid corresponds to different values of the learning rate parameter , and the horizontal axis has histograms of , and also includes samples from a distribution for visual comparison.

We see that the distribution of the implicit iterates follows the nominal chi-squared distribution. This also seems to be unaffected by the learning rate parameter. However, the distribution of does not follow a chi-squared distribution, except for small learning rate parameter values. For example, as the learning rate parameter increases, the distribution becomes more heavy-tailed (e.g., for ), indicating that explicit SGD becomes unstable. Particularly for explicit SGD diverged in almost all replications, and thus a histogram could not be constructed.

4.2Comparative performance results

In this section we aim to illustrate the performance of implicit SGD estimation against deterministic estimation procedures that are optimal. The goal is to investigate the extent to which implicit SGD can be as fast as deterministic methods, and to quantify how much statistical efficiency needs be sacrificed to accomplish that.

Experiments with glm() function

The built-in function glm() in R performs deterministic maximum-likelihood estimation through iterative reweighted least squares. In this experiment, we wish to compare computing time and MSE between first-order implicit SGD and glm(). Our simulated data set is a simple normal linear model constructed as follows. First, we sample a binary design matrix such that (intercept) and i.i.d, where determines the sparsity of . We set indicating that roughly 8% of the matrix will be nonzero. We generate by sampling elements from with replacement. The outcomes are , where i.i.d., and is the vector of ’s covariates. By GLM properties,

Slightly tedious algebra can show that the eigenvalues of are with multiplicity and the two solutions of , where and . It is thus possible to use the analysis of Section 2.2 and Eq. to derive a theoretically optimal learning rate. We sample 200 pairs for the problem size, uniformly in the ranges and , and obtain running times and MSE of the estimates from implicit SGD and glm(). Finally, we then run a regression of computing time and MSE against the problem size .

The results are shown in Table ?. We observe that implicit SGD scales better in both sample size , and especially in the model size . We also observe that this significant computational gain does not come with much efficiency loss. In fact, averaged over all samples, the MSE of the implicit SGD is 10% higher than the MSE of glm(), with a standard error of . Furthermore, the memory requirements (not reported in Table ?) are roughly for glm() and only for implicit SGD.

Experiments with biglm

The package biglm is a popular choice for fitting GLMs with data sets where is large but is small.4 It works in an iterative way by splitting the data set in many parts, and by updating the model parameters using incremental QR decomposition [50], which results in only