Asymptotic Analysis via Stochastic Differential Equations of Gradient Descent Algorithms in Statistical and Computational Paradigms

Asymptotic Analysis via Stochastic Differential Equations of Gradient Descent Algorithms in Statistical and Computational Paradigms

Yazhen Wang
Department of Statistics, University of Wisconsin-Madison
Madison, WI 53706, USA. Email: yzwang@stat.wisc.edu
Abstract

This paper investigates asymptotic behaviors of gradient descent algorithms (particularly accelerated gradient descent and stochastic gradient descent) in the context of stochastic optimization arising in statistics and machine learning where objective functions are estimated from available data. We show that these algorithms can be computationally modeled by continuous-time ordinary or stochastic differential equations. We establish gradient flow central limit theorems to describe the limiting dynamic behaviors of these computational algorithms and the large-sample performances of the related statistical procedures, as the number of algorithm iterations and data size both go to infinity, where the gradient flow central limit theorems are governed by some linear ordinary or stochastic differential equations like time-dependent Ornstein-Uhlenbeck processes. We illustrate that our study can provide a novel unified framework for a joint computational and statistical asymptotic analysis, where the computational asymptotic analysis studies dynamic behaviors of these algorithms with the time (or the number of iterations in the algorithms), the statistical asymptotic analysis investigates large sample behaviors of the statistical procedures (like estimators and classifiers) that the algorithms are applied to compute, and in fact the statistical procedures are equal to the limits of the random sequences generated from these iterative algorithms as the number of iterations goes to infinity. The joint analysis results based on the obtained gradient flow central limit theorems can identify four factors – learning rate, batch size, gradient covariance, and Hessian – to derive new theory regarding the local minima found by stochastic gradient descent for solving non-convex optimization problems.

Key words: Gradient flow central limit theorem; joint computational and statistical asymptotic analysis, weak convergence limit, mini-batch, optimization, ordinary or stochastic differential equation,

Running title: Computational & Statistical Analysis of Gradient Descent

1 Introduction

1.1 Background and Motivation

Optimization plays an important role in scientific fields ranging from machine learning to physical sciences and statistics to engineering. It lies at the core of data science by providing a mathematical language for handling both computational algorithms and statistical inferences in data analysis. Numerous algorithms and methods have been proposed to solve optimization problems. Examples include Newton’s method, gradient and subgradient descent, conjugate gradient methods, trust region methods, and interior point methods (see Polyak, 1987; Boyd and Vandenberghe, 2004; Nocedal and Wright, 2006; Ruszczynski, 2006; Boyd et al., 2011; Shor, 2012; Goodfellow et al. (2016) for expositions). Practical problems arising in fields like statistics and machine learning usually involve optimization settings where the objective functions are empirically estimated from available data with the form of a sum of differentiable functions. We refer to such optimization problems with random objective functions as stochastic optimization. As data sets in practical problems grow rapidly in scale and complexity, methods such as stochastic gradient descent can scale to the enormous size of big data and have been very popular. There has been recent surging interest in and great research work on the theory and practice of gradient descent and its extensions and variants. For example, a number of recent papers were devoted to investigate stochastic gradient descent and its variants for solving complex optimization problems (Ali et al. (2019), Chen et al. (2016), Ge et al. (2015), Jin et al. (2017), Kawaguchi (2016), Keskar et al. (2017), Lee et al. (2016), Li et al. (2017b), Mandt et al. (2016), and Shallue et al. (2019)). In particular Su et al. (2016) derived the continuous-time limit of Nesterov’s accelerated gradient descent as a second-order ordinary differential equation for studying the acceleration phenomenon and generalizing Nesterov’s scheme. Wibisono et al. (2016) further developed a systematic approach based on continuous-time variations to understand the acceleration phenomenon and produce acceleration algorithms from continuous-time differential equations. In spite of compelling theoretical and numerical evidence on the value of the stochastic approximation idea and acceleration phenomenon, yet there remains some conceptual and theoretical mystery in the acceleration and stochastic approximation schemes.

1.2 Contributions

This paper establishes asymptotic theory for gradient descent, stochastic gradient descent, and accelerated gradient descent in the stochastic optimization setup. We derive continuous-time ordinary or stochastic differential equations to model the dynamic behaviors of these gradient descent algorithms and investigate their limiting algorithmic dynamics and large sample performances as the number of algorithm iterations and data size both go to infinity. Specifically for an optimization problem whose objective function is convex and deterministic, we consider a matched stochastic optimization problem whose random objective function is an empirical estimator of the deterministic objective function based on available data. The solution of the stochastic optimization specifies a decision rule like an estimator or a classifier based on the sampled data in statistics and machine learning, while its corresponding deterministic optimization problem characterizes through its solution the true value of the parameter in the population model. In other words, the two connected optimization problems associate with the data sample and its corresponding population model where the data are sampled from, and the stochastic optimization is considered to be a sample version of the deterministic optimization corresponding to the population. These two types of optimization problems refer to the deterministic population and stochastic sample optimization problems. Consider random sequences that are generated from the gradient descent algorithms and their corresponding continuous-time ordinary or stochastic differential equations for the stochastic sample optimization setting. We show that the random sequences converge to the ordinary differential equations for the corresponding deterministic population optimization setup, and we derive their asymptotic distributions by some linear ordinary or stochastic differential equations such as time-dependent Ornstein-Uhlenbeck processes. The asymptotic distributions are used to understand and quantify the limiting discrepancy between the random sequences generated from each algorithm for solving the corresponding sample and population optimization problems. In particular since the obtained asymptotic distributions characterize the limiting behavior of the normalized difference between the sample and population gradient (or Lagrangian) flows, the limiting distributions may be viewed as central limit theorems (CLT) for gradient (or Lagrangian) flows, and are then called the gradient (or Lagrangian) flow central limit theorems (GF-CLT or LF-CLT). Moreover, our analysis may offer a novel unified framework to carry out a joint asymptotic analysis for computational algorithms and statistical decision rules that the algorithms are applied to compute. As iterated computational methods, these gradient descent algorithms generate sequences that converge to the exact decision rule or the true parameter value for the corresponding optimization problems, when the number of the iterations goes to infinity. Thus, as time (corresponding to the number of iterations) goes to infinity, the continuous-time differential equations may have distributional limits corresponding to the large-sample distributions of statistical decision rules as the sample size goes to infinity. In other words, the asymptotic analysis can be done with both time and data size, where the time direction corresponds to the computational asymptotics on dynamic behaviors of the algorithms, and the data size direction associates with the statistical large-sample asymptotics on the statistical behaviors of decision rules such as estimators and classifiers. The continuous-time modeling and the GF-CLT based joint asymptotic analysis may reveal new facts and shed some light on the phenomenon that stochastic gradient descent algorithms can escape from saddle points and converge to good local minimizers for solving non-convex optimization problems in deep learning. To the best of our knowledge, this is the first paper to establish the GF-CLT and LF-CLT, offer the unified framework for the joint computational and statistical asymptotic analysis, and establish a novel theory to identify four factors for influencing the local minima found by stochastic gradient descent in non-convex optimization.

There is a large literature on stochastic approximation and recursive algorithms in particular stochastic gradient descent in deep learning (see Chen et al. (2016), Dalalyan (2017), Fan et al. (2018), Ge et al. (2015), Jastrzȩbski et al. (2018), Jin et al. (2017), Kawaguchi (2016), Keskar et al. (2017), Kushner and Yin (2003), Lee et al. (2016), Li et al. (2016), Li et al. (2017a), Li et al. (2017b), Ma et al. (2019), Mandt et al. (2016), Shallue et al. (2019), Sirignano and Spiliopoulos (2017), Su et al. (2016), Wibisono et al. (2016)). Both continuous-time and discrete-time means are adopted by computational and statistical (as well as machine learning) communities. The work on the computational side focuses more on the dynamics and convergence of learning algorithms, while the statistics research emphasizes more on statistical inferences of learning rules. Our study combines both computational and statistical approaches to carry out a joint analysis on the learning algorithms and the learning rules, where the algorithms are used to compute the rules. In a nutshell, we analyze the learning algorithms in terms of both computational dynamic convergence behavior and statistical large sample performance. The developed GF-CLT based theory and the created joint analysis provide a description of dynamic convergence behaviors of the computational algorithms as well as the statistical large sample performances of the learning rules computed by the algorithms. The statistical behaviors of the associated learning rules can be derived from the GF-CLT at an infinite time horizon. On the computational side, it has implications for optimization phenomena observed in the discrete-time case as well as in practice. For example, the obtained GF-CLT uncovers that the gradient flow for the stochastic sample optimization can be decomposed into the gradient flow for the corresponding deterministic population optimization plus a random fluctuation term, where the random fluctuation depends on the learning rate and batch size only through their ratio and is governed by a time-dependent Ornstein-Uhlenbeck process. Using the joint analysis along with the algebraic Ricatti equation for characterizing the stationary covariance of the Ornstein-Uhlenbeck process, we discover a novel theory about how the minima found by stochastic gradient descent are influenced by four factors: learning rate, batch size, gradient covariance, and Hessian. As a case in point, our general results cover the study under a special circumstance in Jastrzȩbski et al. (2018) that can identify only three of the four factors to influence the local minima found by stochastic gradient descent. Foster et al. (2019) showed that the complexity of a stochastic optimization can be decomposed into the complexity of its corresponding deterministic population optimization and the sample complexity, where the optimization complexity represents the minimal amount of effort required to find near-stationary points, and the sample complexity of an algorithm refers to the number of training-samples needed to learn a target function sufficient well. Our results indicate that finding near-stationary points for a stochastic sample optimization can be converted into finding near-stationary points for the corresponding deterministic population optimization plus some control on the random fluctuation term. As the random fluctuation has zero mean, the control can be achieved through bounding the variance of the time-dependent Ornstein-Uhlenbeck process along with selecting a sufficiently small ratio of learning rate to batch size, which is often used to describe the sample complexity of the associated statistical learning problem for the time-dependent Ornstein-Uhlenbeck process. This shows that our results are in agreement with Foster et al. (2019), and may point to some potential intrinsic connection between our approach and that in Foster et al. (2019). Furthermore, the continuous-time approach serves as a handy means as well as a beautiful framework to formulate stochastic dynamics and statistical procedures and derive their limiting behaviors and large sample performances. These are eventually used to establish limiting behaviors of their discrete counterparts.

In summary, we highlight our main contributions as follows:

  • We establish a new asymptotic theory for the discrepancy between the sample and population gradient (or Lagrangian) flows. In particular the new limiting distributions for the normalized discrepancy are called the gradient (or Lagrangian) flow central limit theorems (GF-CLT or LF-CLT). See Sections 3.3 and 4.1-4.2.

  • The obtained asymptotic theory provides a novel unified framework for a joint computational and statistical asymptotic analysis. Statistically the joint analysis can facilitate inferential analysis of a learning rule computed by gradient descent algorithms. Computationally the joint analysis enables us to understand and quantify a random fluctuation in and the related impact on the dynamic and convergence behavior of a gradient descent algorithm when being applied to solve a stochastic optimization problem. See Sections 3.4 and 4.3.

  • Computationally we discover a novel theory that four factors – learning rate, batch size, gradient covariance, and Hessian – along with the associated identities are shown to influence the local minima found by stochastic gradient descent for solving a non-convex optimization problem. It may also shed light on some intrinsic relationship among stochastic optimization, deterministic optimization, and statistical learning. See Section 4.4.

  • Statistically we illustrate implications of our results for statistical analysis of stochastic gradient descent and inference of outputs from stochastic gradient descent. See Section 4.5.

  • The continuous-time approach is employed to demonstrate that it can provide a handy means for deriving beautiful and deep results for stochastic dynamics of learning algorithms and statistical inference of learning rules.

1.3 Organization

The rest of the paper proceeds as follows. Section 2 introduces gradient descent, accelerated gradient descent, and their corresponding ordinary differential equations. Section 3 presents stochastic optimization and investigates asymptotic behaviors of the plain and accelerated gradient descent algorithms and their associated ordinary differential equations (with random coefficients) when the sample size goes to infinity. We illustrate the unified framework to carry out a joint analysis on computational and statistical asymptotics, where computational asymptotics deals with dynamic behaviors of the gradient descent algorithms with time (or iteration), and statistical asymptotics studies large sample behaviors of statistical decision rules that the algorithms are applied to compute. Section 4 considers stochastic gradient descent algorithms for large scale data and derives stochastic differential equations to model these algorithms. We establish asymptotic theory for these algorithms and their associated stochastic differential equations, and describe a joint analysis on computational and statistical asymptotics. Section 5 features an example. All technical proofs are relegated in the appendix section.

We adopt the following notations and conventions. For the stochastic sample optimization problem considered in Sections 3 and 4, we add a superscript to notations for the associated processes and sequences in Section 3 and indices and/or to notations for the corresponding processes and sequences affiliated with mini-batches in Section 4, while notations without such subscripts or superscripts are for sequences and functions corresponding to the deterministic population optimization problem given in Section 2. Our basic proof ideas are as follows. Each algorithm generates a sequence for computing a learning rule, a step-wise empirical process is formed by the generated sequence, and a continuous process is obtained from the corresponding continuous-time differential equation. We derive asymptotic distributions by analyzing the differential equations, and we bound the differences between the empirical processes and their corresponding continuous processes by studying the optimization problems and utilizing the empirical process theory along with the related differential equations.

2 Ordinary differential equations for gradient descent algorithms

Consider the following minimization problem

(2.1)

where the objective function is defined on a parameter space and assumed to have L-Lipshitz continuous gradients. Iterative algorithms such as gradient descent methods are often employed to numerically compute the solution of the minimization problem. Starting with some initial values , the plain gradient descent algorithm is iteratively defined by

(2.2)

where denotes gradient operator, and is a positive constant which is often called a step size or learning rate.

It is easy to model , by a smooth curve with the Ansatz as follows. Define a step function for , and as , approaches satisfying

(2.3)

where denotes the derivative of , and initial value . In fact, is a gradient flow associated with the objective function in the optimization problem (2.1).

Nesterov’s accelerated gradient descent scheme is a well-known algorithm that is much faster than the plain gradient descent algorithm. Starting with initial values and , Nesterov’s accelerated gradient descent algorithm is iteratively defined by

(2.4)

where is a positive constant. Using (2.4) we derive a recursive relationship between consecutive increments

(2.5)

We model , by a smooth curve in a sense that are its samples at discrete points, that is, we define a step function for , and introduce the Ansatz for some smooth function defined for . Let be the step size. For , as , we have , , and

Applying the Taylor expansion and using L-Lipshitz continuous gradients we obtain

where denotes the second derivative of . Substituting above results into the equation (2.5) and letting we obtain

(2.6)

with the initial conditions and . As the coefficient in the ordinary differential equation (2.6) is singular at , classical ordinary differential equation theory is not applicable to establish the existence or uniqueness of the solution to the equation (2.6). The heuristic derivation of (2.6) is from Su et al. (2016) who has established that the equation (2.6) has a unique solution satisfying the initial conditions, and converges to uniformly on for any fixed . Note the step size difference between the plain and accelerated cases, where the step size is for Nesterov’s accelerated gradient descent algorithm and for the plain gradient descent algorithm. Su et al. (2016) has shown that, because of the difference, the accelerated gradient descent algorithm moves much faster than the plain gradient descent algorithm along the curve . Wibisono et al. (2016) provided more elaborate explanation on the acceleration phenomenon and developed a systematic continuous-time variational scheme to generate a large class of continuous-time differential equations and produce a family of accelerated gradient algorithms. The variational scheme utilizes a first-order rescaled gradient flow and a second-order Lagrangian flow, which are generalizations of gradient flow. As we refer the solution of the differential equation (2.3) to the gradient flow for the gradient descent algorithm (2.2), the solution to the differential equation (2.6) is called the Lagrangian flow for the accelerated gradient descent algorithm (2.4).

3 Gradient descent for stochastic optimization

Let be the parameter that we are interested in, and be a relevant random element on a probability space with a given distribution . Consider an objective function and its corresponding expectation . For example, in a statistical decision problem, we may take to be a decision rule, a loss function, and its corresponding risk; in M-estimation, we may treat as a sample observation and a -function; in nonparametric function estimation and machine learning, we may choose an observation and equal to a loss function plus some penalty. For these problems we need to consider the corresponding population minimization problem (2.1) for characterizing the true parameter value or its function as an estimand, but practically, because is usually unavailable, we have to employ its empirical version and consider a stochastic optimization problem, described as follows:

(3.7)

where , is a sample, and we assume that are i.i.d. and follow the distribution .

The minimization problem (2.1) characterizes the true value of the target estimand such as an estimation parameter in a statistical model and a classification parameter in a machine learning task. As the true objective function is usually unknown in practice, we often solve the stochastic minimization problem (3.7) with observed data to obtain practically useful decision rules such as an M-estimator, a smoothing function estimator, and a machine learning classifier. The approach to obtaining practical procedures is based on the heuristic reasoning that as , the law of large number implies that eventually converges to in probability, and thus the solution of (3.7) approaches that of the minimization problem (2.1).

3.1 Plain gradient descent algorithm

Applying the plain gradient descent scheme to the minimization problem (3.7) with initial value , we obtain the following iterative algorithm to compute the solution of (3.7),

(3.8)

where is a step size or learning rate, and is the objective function in the minimization problem (3.7).

Following the continuous curve approximation described in Section 2 we define a step function for , and for each , as , approaches a smooth curve , , given by

(3.9)

where , gradient operator here is applied to and with respect to , and initial value . is a gradient flow associated with in the optimization problem (3.7).

As and are random, and our main interest is to study the distributional behaviors of the solution and algorithm, we may define a solution of the equation (3.9) in a weak sense that there exist a process and a random vector defined on some probability space such that is identically distributed as , satisfies the equation (3.9), and is called a (weak) solution of the equation (3.9). Note that is not required to be defined on a fixed probability space with given random variables, instead we define on some probability space with some associated random variables whose distributions are given by . The weak solution definition, which shares the same spirit as that for stochastic differential equations (see Ikeda and Watanabe (1981) and more in Section 4), will be very handy in facilitating our asymptotic analysis in this paper. For simplicity we drop index and ‘weak’ when there is no confusion.

3.2 Accelerated gradient descent algorithm

Nesterov’s accelerated gradient descent scheme can be used to solve the minimization problem (3.7). Starting with initial values and , we obtain the following iterative algorithm to compute the solution of the stochastic minimization problem (3.7),

(3.10)

Using the continuous curve approach described in Section 2 we can define a step function for , and for every , as , we approximate by a smooth curve , , governed by

(3.11)

where initial values and , , and gradient operator here is applied to and with respect to . is a Lagrangian flow associated with in the optimization problem (3.7).

Again we define a solution of the equation (3.11) in the weak sense, i.e., that there exist a process and a random vector on some probability space so that the distribution of is specified by , and is a solution of the equation (3.11).

3.3 Asymptotic theory via ordinary differential equations

To make the equations (3.9) and (3.11) and their solutions to be well defined and study their asymptotics we need to impose the following conditions.

  1. Assume initial values satisfy .

  2. is continuously twice differentiable in ; , , such that , , where and for some fixed have finite fourth moments.

  3. , , , on the parameter space , is continuously twice differentiable and strongly convex, and is -Lipschitz for some , where is the gradient operator (the first order partial derivatives), and is the Hessian operator (the second order partial derivatives).

  4. Define cross auto-covariance , , where
    Cov are assumed to be continuously differentiable, and -Lipschitz. Let Cov, and Var be positive definite.

  5. weakly converges to uniformly over , where is a Gaussian process with mean zero and auto-covariance defined in A3, is a bounded subset of , and the interior of contains the solutions of the ordinary differential equations (2.3) and (2.6) connecting the initial value and the minimizer of .

Conditions A1-A2 are often used to make optimization problems and differential equations to be well defined, and match the stochastic sample optimization problem (3.7) to the deterministic population optimization problem (2.1). Conditions A3-A4 guarantee that the solution of (3.7) and its associated differential equations provide large sample approximations of those for (2.1). Condition A4 can be easily justified by empirical processes with common assumptions such as that , , form a Donsker class (van der Vaart and Wellner (2000)), since the solution curves of the ordinary differential equations (2.3) and (2.6) are deterministic and bounded, and it is easy to select .

For a given , denote by the space of all continuous functions on with the uniform metric between functions and . For the solutions and of the ordinary differential equations (2.3) and (3.9) [or (2.6) and (3.11)], respectively, we define . Then , and live on . Treating them as random elements in , in the following theorem we establish a weak convergence limit of .

Theorem 3.1.

Under conditions A0-A4, as , weakly converges to a Gaussian process , where is the unique solution of the following linear differential equations

(3.12)

for the plain gradient descent case, and

(3.13)

for the accelerated gradient descent case, where the deterministic functions in (3.12) and (3.13) are the solutions of the ordinary differential equations (2.3) and (2.6), respectively, is the Hessian operator, random coefficient is the Gaussian process given by Condition A4.

In particular if Gaussian process , where random variable , and is defined in Condition A3, then on , and the deterministic matrix is the unique solution of the following linear differential equations

(3.14)

for the plain gradient descent case, and

(3.15)

for the accelerated gradient descent case, where in (3.14) and (3.15) are the solutions of the ordinary differential equations (2.3) and (2.6), respectively, is the Hessian operator, and is defined in Condition A3.

Remark 3.1.

As discussed in Sections 2 and 3.1, for the gradient descent case and are gradient flows associated with the population optimization (2.1) and the sample optimization (3.7), respectively, and thus refer to the corresponding population and sample gradient flows. As a consequence, the Gaussian limiting distribution of describes the asymptotic distribution of the difference between the sample and population gradient flows, with a normalization factor . Hence, it is natural to view the Gaussian limiting distribution as the central limit theorem for the gradient flows, and we call it the gradient flow central limit theorem (GF-CLT). Similarly, for the accelerated case and are Lagrangian flows associated with the population optimization (2.1) and the sample optimization (3.7), respectively, and thus refer to the corresponding population and sample Lagrangian flows. The Gaussian limiting distribution for the normalized discrepancy between the sample and population Lagrangian flows can be naturally viewed as the central limit theorem for the Lagrangian flows, and we call it the Lagrangian flow central limit theorem (LF-CLT).

Remark 3.2.

As we discussed early in Section 3, as , converges to in probability, and the solutions of the minimization problems (2.1) and (3.7) should be very close to each other. We may heuristically illustrate the derivation of Theorem 3.1 as follows. Central limit theorem may lead us to see that as , is asymptotically distributed as . Then asymptotically the differential equations (3.9) and (3.11) are, respectively, equivalent to

(3.16)
(3.17)

Applying the perturbation method for solving ordinary differential equations, we write approximation solutions of the equations (3.16) and (3.17) as and substitute it into (3.16) and (3.17). With satisfying the ordinary differential equations (2.3) or (2.6), using the Taylor expansion and ignoring higher order terms, we can easily obtain the equations (3.12) and (3.13) for the limit of in the two cases, respectively.

The step function is used to model generated from the gradient descent algorithms (3.8) and (3.10). To study their weak convergence, we need to introduce the Skorokhod space, denoted by , of all cádlág functions on , equipped with the Skorokhod metric (Billingsely (1999)). Then lives on , and treating it as a random element in , we derive its weak convergence limit in the following theorem.

Theorem 3.2.

Under assumption A0-A4, as and , we have

where are the continuous-time step processes for discrete generated from the algorithms (3.8) and (3.10), with continuous curves defined by the ordinary differential equations (3.9) and (3.11), for the cases of plain and accelerated gradient descent algorithms, respectively. In particular, we may choose such that as and , and then for the chosen , weakly converges to on , where is the solution of the ordinary differential equations (2.3) or (2.6), and is given by Theorem 3.1. That is, and share the same weak convergence limit.

Remark 3.3.

There are two types of asymptotic analyses in the set up. One type is to employ continuous differential equations to model discrete sequences generated from the gradient descent algorithms, which is associated with treated as the step size between consecutive sequence points. Another type involves the use of random objective functions in stochastic optimization, which are estimated from sample data of size . We refer the first and second types as computational and statistical asymptotics, respectively. The computational asymptotic analysis is that for each , the ordinary differential equations (3.9) and (3.11)[or (3.16) and (3.17)] provide continuous solutions as the limits of discrete sequences generated from the algorithms (3.8) and (3.10), respectively, when is allowed to go to zero. Theorem 3.1 provides the statistical asymptotic analysis to describe the behavior difference between the sample gradient flow and the population gradient flow , as the sample size goes to infinity. Theorem 3.2 involves both types of asymptotics and shows that as and , is of order . It is easy to choose so that is of order smaller than . Then has the same asymptotic distribution as .

3.4 A framework to unify computational and statistical asymptotic analysis

The two types of asymptotics associated with and seem to be quite different, with one for computational algorithms and one for statistical procedures. This section will elaborate further about these analyses and provide a framework to unify both viewpoints. Denote the solutions of the optimization problems (2.1) and (3.7) by and , respectively. In the statistical set-up, and represent the true estimand and its associated estimator, respectively. Using the definitions of and and the Taylor expansion, we have ,

the law of large number implies that converges in probability to as , and Condition A4 indicates that

where stands for a standard normal random vector. Thus, is asymptotically distributed as . On the other hand, the gradient descent algorithms generate sequences corresponding to and , which are expected to approach the solutions of the two optimization problems (2.1) and (3.7), respectively. Hence and must move towards and , respectively, and and are reaching their corresponding targets and . Below we will provide a framework to connect with and with .

Since the time interval considered so far is for any arbitrary , we may extend the time interval to , and consider , the space of all continuous functions on , equipped with a metric for the topology of uniform convergence on compacta:

The solutions , , and of the ordinary differential equations (2.3), (2.6), (3.9), (3.11)-(3.17) all live on , and we can study their weak convergence on . Similarly we may adopt the Skorokhod space equipped with the Skorokhod metric for the weak convergence study of (see Billingsely (1999)). The following theorem establishes the weak convergence of these processes on and studies their asymptotic behaviors as .

Theorem 3.3.

Suppose that the assumption A0-A4 are met, is positive definite, all eigenvalues of diverge as , and commute for any , and as and . Then on , as and , and weakly converge to , .

Furthermore, for the plain gradient descent case we have as and ,

  • , and converge to , where , and are defined in Section 2 (see the ordinary differential equations (2.2)-(2.4) and (2.6)).

  • , and converge to in probability, and thus converges to in probability, where , and are defined in the algorithms and equations (3.8)-(3.11).

  • The limiting distributions of as and as are identical and given by a normal distribution with mean zero and variance , where , defined in the ordinary differential equations (3.12) and (3.13), is the weak convergence limit of as .

Remark 3.4.

Denote the limits of the processes in Theorem 3.3 as by the corresponding processes with and replacing by . Then Theorem 3.3 shows that for the plain gradient descent case, , , , , weakly converges to as , and weakly converges to as . In particular, as the process is indexed by and , its limits are the same regardless the order of and . Also as is the minimizer of the convex function , the positive definite assumption is very reasonable; since the limit of as has all positive eigenvalues, it is natural to expect that has diverging eigenvalues. We conjecture that for the accelerated gradient descent case, similar asymptotic results might hold as .

With the augmentation of , we extend further to , consider , , , , , and on and derive the limits of and on by Theorem 3.3. As and , the limiting distributions of and are for , where describe the dynamic evolution of the gradient descent algorithms for and the statistical distribution of for .

The joint asymptotic analysis provides a unified framework to describe distribution limits of and from both computation and statistical viewpoints as follows. For , and give the limiting behaviors of and corresponding to the computational algorithms, and and illustrate their limiting behaviors of the corresponding statistical decision rule (or the exact solutions of the corresponding optimization problems (2.1) and (3.7) that the algorithms are designed to compute). We use the following simple example to explicitly illustrate the joint asymptotic analysis.

Example 1. Suppose that , , are iid random vectors, where and are independent, and follow a normal distribution with mean and variance and an exponential distribution with mean , respectively, and . Define , and denote by the true value of the parameter in the model. Then , , , , , and , where is the sample mean. It is easy to see that the corresponding minimization problems (2.1) and (3.7) have explicit solutions: has the minimizer , and has the minimizer . For this example, the algorithms (2.2), (3.8), (2.4) and (3.10) yield recursive formulas , and for the plain gradient descent case; and , , , for the accelerated gradient descent case. While it may not be so obvious to explicitly describe the dynamic behaviors of these algorithms for the accelerated case, below we will clearly illustrate the behaviors of their corresponding ordinary differential equations through closed form expressions. First we consider the plain gradient descent case where closed form expressions are very simple. The ordinary differential equations (2.3) and (3.9) admit simple solutions

Note that , converges in distribution to a standard normal random variable , and and are independent. As in Theorem 3.1, let , , where is the matrix solution of the linear differential equation (3.14) in this case. Then for ,

which confirms that converges to , as shown in Theorem 3.1. Furthermore, as , , , and