On the Regularizing Property of Stochastic Gradient Descent

On the Regularizing Property of Stochastic Gradient Descent

Bangti Jin Department of Computer Science, University College London, Gower Street, London WC1E 2BT, UK (b.jin@ucl.ac.uk,bangti.jin@gmail.com)    Xiliang Lu School of Mathematics and Statistics and Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan 430072, People’s Republic of China (xllv.math@whu.edu.cn)
Abstract

Stochastic gradient descent is one of the most successful approaches for solving large-scale problems, especially in machine learning and statistics. At each iteration, it employs an unbiased estimator of the full gradient computed from one single randomly selected data point. Hence, it scales well with problem size and is very attractive for truly massive dataset, and holds significant potentials for solving large-scale inverse problems. In the recent literature of machine learning, it was empirically observed that when equipped with early stopping, it has regularizing property. In this work, we rigorously establish its regularizing property (under a priori early stopping rule), and also prove convergence rates under the canonical sourcewise condition, for minimizing the quadratic functional for linear inverse problems. This is achieved by combining tools from classical regularization theory and stochastic analysis. Further, we analyze the preasymptotic weak and strong convergence behavior of the algorithm. The theoretical findings shed insights into the performance of the algorithm, and are complemented with illustrative numerical experiments.
Keywords: stochastic gradient descent; regularizing property; error estimates; preasymptotic convergence; strong convergence; weak convergence.

1 Introduction

In this paper, we consider the following finite-dimensional linear inverse problem:

Ax=y^{\delta}, (1.1)

where A\in\mathbb{R}^{n\times m} is the system matrix representing the data formation mechanism, and x\in\mathbb{R}^{m} is the unknown signal of interest. The noisy data y^{\delta}\in\mathbb{R}^{n} is formed by

y^{\delta}=y^{\dagger}+\xi,

where the vector \xi\in\mathbb{R}^{n} is the noise in the data, with a noise level \delta=\|\xi\| (and \bar{\delta}=n^{-\frac{1}{2}}\delta), and y^{\dagger}\in\mathbb{R}^{n} is the exact data formed by y^{\dagger}=Ax^{\dagger}, with x^{\dagger} being the true solution. The noise \xi is assumed to be a realization of independent identically distributed (i.i.d.) mean zero Gaussian random vector. Throughout, we denote the ith row of the matrix A by a column vector a_{i}\in\mathbb{R}^{m}, and the ith entry of the vector y^{\delta} by y_{i}^{\delta}. The model (1.1) is representative for many discrete linear inverse problems. Hence, the stable and efficient numerical solution of the model (1.1) has been the topic of many research works, and plays an important role in developing practical inversion techniques (see, e.g., the monographs [6, 9, 22]).

Stochastic gradient descent (SGD), dating at least back to Robbins and Monro [21], represents an extremely popular solver for large-scale least square type problems and statistical inference, and its accelerated variants represent state-of-the-art solvers for training (deep) neural networks [4, 13]. Such methods hold great potentials for solving large-scale inverse problems. For example, the randomized Kaczmarz method, which have long been very popular in computed tomography [16], can be viewed as SGD with proper weighted sampling (see, e.g., [17] and [11, Prop. 4.1]). In this work, we consider the following version of SGD: given an initial guess x_{1}\in\mathbb{R}^{m}, we update the iterate x_{k+1}^{\delta} by

x_{k+1}^{\delta}=x_{k}^{\delta}-\eta_{k}((a_{i_{k}},x_{k}^{\delta})-y_{i_{k}}^% {\delta})a_{i_{k}},\quad k=1,\ldots (1.2)

where the index i_{k} is drawn i.i.d. uniformly from the set \{1,\ldots,n\}, \eta_{k}>0 is the step size at the kth iteration, and (\cdot,\cdot) denotes the Euclidean inner product on \mathbb{R}^{m}. The update (1.2) can be derived by computing an unbiased gradient estimate ((a_{i_{k}},x)-y_{i_{k}}^{\delta})a_{i_{k}} from the functional \frac{1}{2}(y_{i_{k}}^{\delta}-(a_{i_{k}},x))^{2}, corresponding to uniformly randomly sampling one single datum, instead of the full gradient n^{-1}A^{t}(Ax-y^{\delta}) of the functional \frac{1}{2n}\sum_{i=1}^{n}(y_{i}^{\delta}-(a_{i},x))^{2} for the full data.

The SGD iteration (1.2) is a randomized version of the classical Landweber iteration:

x_{k+1}^{\delta}=x_{k}^{\delta}-\eta_{k}n^{-1}A^{t}(Ax_{k}^{\delta}-y^{\delta}). (1.3)

In comparison with the Landewber iteration (1.3), SGD (1.2) requires only evaluating one data point per iteration, and thus the per-iteration cost is drastically reduced by a factor of n, which is especially beneficial for large-scale problems. In theory, the Landweber method (1.3) is known to be regularizing [6, Chapter 6]. However, the regularizing property of SGD remains to be established, even though it is widely conjectured and empirically examined (see, e.g., [8, 10]). Numerically, one observes a semiconvergence phenomenon for SGD: the iterate x_{k}^{\delta} first converges to the true solution x^{\dagger}, and then diverges as the iteration further proceeds. Semiconvergence is characteristic of deterministic iterative regularization techniques, and early stopping is often used to stabilize the iteration [6, 14]. This naturally motivates its use in obtaining a regularizing SGD.

The goal of this work is to analyze SGD (1.2) with a polynomially decaying sequence of step sizes of the form \eta_{j}=c_{0}j^{-\alpha}, \alpha\in(0,1), in the lens of regularization theory, i.e., the regularizing property and convergence rate under canonical source condition. Our analysis exploits the specific structure of the least-squares quadratic functional and relies on decomposing the error into three components: approximation error due to early stopping, propagation error due to the presence of data noise, and stochastic error due to the random index i_{k}. The first two parts are deterministic in nature and can be analyzed in a manner similar to the Landweber method (see, e.g., [6, Chapter 6]), and the last part constitutes the main technical challenge in the analysis. The randomness of iterates introduces additional error and complicates the analysis of the algorithm. It is overcome by a careful analysis of the iterate variance and the evolution of the (expected) residual. The analysis indicates that the stochasticity affects the convergence rate via the computational variance: Indeed, there is a limit beyond which the smoothing property of the solution does not further improve the convergence rates as in the deterministic case.

Consistency is concerned with the asymptotics as the noise level \delta tends to zero, and it does not tell the full story of an algorithm. Hence, we also study the preasymptotic convergence in both weak and strong norms, extending the recent work on the randomized Kaczmarz method [11]. This is achieved by dividing the error into low- and high-frequency components, and studying their dynamics separately. The analysis indicates that the low-frequency error can decay much faster than the high-frequency one. Under the canonical source type condition, the low-frequency error is dominating, and the analysis explains the fast initial convergence observed for SGD, thereby shedding insights into its practical performance. These theoretical findings are complemented with extensive numerical experiments.

Now we situate this work in the existing literature in two areas: inverse problems with random noise, and machine learning. Inverse problems with random noise have attracted some attention over the last decade. In a series of interesting works, Hohage and his collaborators [1, 2, 3] studied various regularization methods, e.g., Tikhonov regularization, iterative regularization and nonsmooth penalty, for solving linear and nonlinear inverse problems with random noise. For example, Bissantz et al [2] analyzed Tikhonov regularization for nonlinear inverse problems, and analyzed consistency and convergence rate; In these works, randomness enters into the problem formulation via the data y^{\delta} directly as a Hilbert space valued process, which is fixed (though random) when applying regularization techniques. Thus, it differs from SGD, for which randomness arises due to the random row indices i_{k} and changes at each iteration. Handling the iteration noise requires different techniques than that in these works.

There are also relevant works in machine learning [24, 23, 15, 5]. Ying and Pontil [24] studied an online least-squares gradient descent algorithm in a reproducing kernel Hilbert space (RKHS), and presented a novel capacity independent approach to derive bounds on the generalization error. Tarres and Yao [23] analyzed an online learning algorithm closely related to SGD, and analyzed its convergence. Lin and Rosasco [15] analyzed the convergence of the value of the objective function for SGD (1.2), and discussed the influence of batch size. See also the recent work [5] for SGD with averaging for nonparametric regression in RKHS. All these works analyze the methods in the framework of statistical learning, where the noise arises due to finite sampling from the underlying data distribution, whereas for inverse problems, the noise arises from the imperfect data acquisition process and enters into the data y^{\delta} directly. Further, the main focus of these works is to bound the generalization error, instead of error estimates for the regularized solution. Nonetheless, our proof strategy follows closely the existing works by means of decomposing the error into three different components.

The rest of the paper is organized as follows. In Section 2, we derive basic estimates for SGD iterates. Then in Section 3, we analyze the regularizing property of SGD with an a priori stopping rule, and prove convergence rates under classical source condition. In Section 4, we discuss the preasymptotic convergence of SGD. Some numerical results are given in Section 5. In an appendix, we collect some elementary inequalities which will be used extensively in the analysis. We conclude this section with some notation. Throughout, we use the superscript \delta in x_{k}^{\delta} to indicate the iterates for noisy data y^{\delta}, and as usual, denote by x_{k} the SGD iterates for the exact data y^{\dagger}. The notation \|\cdot\| denotes the Euclidean norm for vectors, and also the spectral norm for matrices. The notation c, with or without subscript, denotes a generic constant that is independent of the iteration index k and the noise level \delta.

2 Basic estimates

In this part, we provide basic estimates for the SGD iteration (1.2). These estimates are crucial for analyzing the SGD in Section 3.

2.1 Basic estimates on SGD iteration (1.2)

Now we derive useful estimates for analyzing SGD (1.2). Due to the stochasticity of the row index i_{k}, the iterate x_{k}^{\delta} is random. We shall measure its approximation error relative to the true solution x^{\dagger} by the mean squared error \mathbb{E}[\|x_{k}^{\delta}-x^{\dagger}\|^{2}], where the expectation \mathbb{E}[\cdot] is with respect to the random index i_{k}. The overall strategy is to decompose the error into three components: approximation error due to early stopping, propagation error due to noise and stochastic error due to the random choice of the index i_{k}. Indeed, by bias-variance decomposition and triangle inequality, we have

\displaystyle\mathbb{E}[\|x_{k}^{\delta}-x^{\dagger}\|^{2}] \displaystyle=\|\mathbb{E}[x_{k}^{\delta}]-x^{\dagger}\|^{2}+\mathbb{E}[\|% \mathbb{E}[x_{k}^{\delta}]-x_{k}^{\delta}\|^{2}]
\displaystyle\leq 2\|\mathbb{E}[x_{k}]-x^{\dagger}\|^{2}+2\|\mathbb{E}[x_{k}-x% _{k}^{\delta}]\|^{2}+\mathbb{E}[\|\mathbb{E}[x_{k}^{\delta}]-x_{k}^{\delta}\|^% {2}], (2.1)

where x_{k} is the random iterate for exact data y^{\dagger}. The objective of this part is to derive various bounds on the terms in (2.1).

For the analysis, we first introduce auxiliary iterations. Let e_{k}^{\delta}=x_{k}^{\delta}-x^{\dagger} and e_{k}=x_{k}-x^{\dagger} be the errors for SGD iterates x_{k}^{\delta} and x_{k}, for y^{\delta} and y^{\dagger}, respectively. They satisfy the following recursion:

\displaystyle e_{k+1} \displaystyle=e_{k}-\eta_{k}((a_{i_{k}},x_{k})-y_{i_{k}}^{\dagger})a_{i_{k}}=e% _{k}-\eta_{k}(a_{i_{k}},e_{k})a_{i_{k}}, (2.2)
\displaystyle e_{k+1}^{\delta} \displaystyle=e_{k}^{\delta}-\eta_{k}((a_{i_{k}},x_{k}^{\delta})-y_{i_{k}}^{% \delta})a_{i_{k}}=e_{k}^{\delta}-\eta_{k}(a_{i_{k}},e_{k}^{\delta})a_{i_{k}}+% \eta_{k}\xi_{i_{k}}a_{i_{k}}. (2.3)

We denote by \{\mathcal{F}_{k}\}_{k\geq 1} a sequence of increasing \sigma-fields generated by the random index i_{k} up to the kth iteration. Then we introduce two auxiliary matrices: for any vector b\in\mathbb{R}^{n},

\displaystyle B:=\mathbb{E}[a_{i}a_{i}^{t}]\quad\mbox{and}\quad\bar{A}^{t}b:=% \mathbb{E}[a_{i}b_{i}].

Clearly, under uniform sampling for the index i_{k}, B=n^{-1}A^{t}A and \bar{A}^{t}=n^{-1}A^{t}.

Throughout, we consider the following choice of the step size \eta_{j}, which is commonly employed for SGD. Clearly the condition c_{0}\max_{i}\|a_{i}\|^{2}\leq 1 implies c_{0}\|B\|\leq 1.

Assumption 2.1.

The step size \eta_{j}=c_{0}j^{-\alpha}, j=1,2,\ldots, \alpha\in(0,1), with c_{0}\max_{i}\|a_{i}\|^{2}\leq 1.

Next we recall the concept of source condition:

x^{\dagger}-x_{1}=B^{p}w,\quad p\geq 0. (2.4)

This condition is often employed in classical regularization theory [6, 9] for deterministic inverse problems. It represents smoothness on the initial error x^{\dagger}-x_{1}, and is necessary for deriving convergence rates.

Now we bound the weighted norm \|B^{s}\mathbb{E}[e_{k}]\| of the approximation error \mathbb{E}[e_{k}]. The cases s=0 and s=1/2 will be used for bounding the approximation error and the residual, respectively.

Theorem 2.1.

Let Assumption 2.1 be fulfilled. Under the source condition (2.4) and for any s\geq 0, there holds (with c_{p,s}=(\frac{(p+s)(1-\alpha)}{c_{0}e(2^{1-\alpha}-1)})^{p+s}\|w\|)

\|B^{s}\mathbb{E}[e_{k+1}]\|\leq c_{p,s}k^{-(p+s)(1-\alpha)}.
Proof.

It follows from the relation (2.2) and the identity y_{i}^{\dagger}=(a_{i},x^{\dagger}) that the error e_{k} satisfies

\displaystyle\mathbb{E}[e_{k+1}|\mathcal{F}_{k-1}] \displaystyle=(I-\eta_{k}\mathbb{E}[a_{i}a_{i}^{t}])e_{k}=(I-\eta_{k}B)e_{k}.

Taking the full expectation yields a Landweber type iteration:

\mathbb{E}[e_{k+1}]=(I-\eta_{k}B)\mathbb{E}[e_{k}]. (2.5)

Repeatedly applying the recursion (2.5) gives

\mathbb{E}[e_{k+1}]=\prod_{i=1}^{k}(I-\eta_{i}B)\mathbb{E}[e_{1}]=\prod_{i=1}^% {k}(I-\eta_{i}B)e_{1},

where the last step follows since e_{1} is deterministic. Now, with the shorthand notation

\Pi_{j}^{k}(B)=\prod_{i=j}^{k}(I-\eta_{i}B),\quad j\leq k, (2.6)

(and the convention \Pi_{k+1}^{k}(B)=I), and noting the source condition (2.4), we deduce

\|\mathbb{E}[B^{s}e_{k+1}]\|\leq\|\Pi_{1}^{k}(B)B^{p+s}\|\|w\|.

By Lemmas A.1 and A.2, we arrive at

\displaystyle\|\mathbb{E}[e_{k+1}]\| \displaystyle\leq\frac{(p+s)^{p+s}}{e^{p+s}(\sum_{i=1}^{k}\eta_{i})^{p+s}}\|w% \|\leq c_{p,s}k^{-(p+s)(1-\alpha)},

with a constant c_{p,s}=(\frac{(p+s)(1-\alpha)}{c_{0}e(2^{1-\alpha}-1)})^{p+s}\|w\|. This completes the proof of the theorem. ∎

Remark 2.1.

The constant c_{p,s} is uniformly bounded in \alpha\in[0,1]: \lim_{\alpha\to 1^{-}}\frac{1-\alpha}{2^{1-\alpha}-1}=(\ln 2)^{-1}.

Next we bound the weighted norm of the propagation error \mathbb{E}[x_{k}^{\delta}-x_{k}] due to data noise \xi.

Theorem 2.2.

Let Assumption 2.1 be fulfilled, s\in[0,\frac{1}{2}], and r=\frac{1}{2}+s. Then there holds

\|B^{s}\mathbb{E}[x_{k+1}-x_{k+1}^{\delta}]\|\leq c_{r,\alpha}\bar{\delta}% \left\{\begin{array}[]{ll}k^{(1-r)(1-\alpha)},&0\leq r<1,\\ \max(1,\ln k),&r=1,\end{array}\right.

with \bar{\delta}=n^{-\frac{1}{2}}\delta, and the constant c_{r,\alpha} given by

c_{r,\alpha}=c_{0}^{1-r}\left\{\begin{array}[]{ll}\frac{r^{r}}{e^{r}}B(1-% \alpha,1-r)+1,&r<1,\\ \frac{r^{r}}{e^{r}}2^{\alpha}(1-\alpha)^{-1}+1,&r=1.\end{array}\right.
Proof.

It follows from the recursive relations (2.2) and (2.3) that the (expected) propagation error \nu_{k}=\mathbb{E}[x_{k}^{\delta}-x_{k}] satisfies (with \xi=y^{\delta}-y^{\dagger} being the data noise)

\displaystyle\nu_{k+1}=(I-\eta_{k}B)\nu_{k}+\eta_{k}\bar{A}^{t}\xi,

with the initial condition \nu_{1}=0. With the notation \Pi_{i}^{k}(B) from (2.6) (and the convention \Pi_{k+1}^{k}(B)=I), applying the recursion repeatedly yields

\nu_{k+1}=\sum_{j=1}^{k}\eta_{j}\Pi_{j+1}^{k}(B)\bar{A}^{t}\xi.

Thus, by the triangle inequality, we have

\displaystyle\|B^{s}\nu_{k+1}\|\leq\sum_{j=1}^{k}\eta_{j}\|B^{s}\Pi_{j+1}^{k}(% B)\bar{A}^{t}\|\|\xi\|.

Since \|B^{s}\Pi_{j+1}^{k}(B)\bar{A}^{t}\|=n^{-\frac{1}{2}}\|\Pi_{j+1}^{k}(B)B^{s+% \frac{1}{2}}\|, by Lemma A.1 and the noise level n^{-\frac{1}{2}}\|\xi\|=\bar{\delta}

\displaystyle\|\nu_{k+1}\| \displaystyle\leq\frac{r^{r}}{e^{r}}\sum_{j=1}^{k-1}\frac{\eta_{j}}{(\sum_{i=j% +1}^{k}\eta_{i})^{r}}\bar{\delta}+\eta_{k}\|B^{s}\bar{A}^{t}\|\|\xi\|
\displaystyle=\Big{(}\frac{r^{r}}{e^{r}}\sum_{j=1}^{k-1}\frac{\eta_{j}}{(\sum_% {i=j+1}^{k}\eta_{i})^{r}}+k^{-\alpha}c_{0}\|B\|^{r}\Big{)}\bar{\delta}.

Under Assumption 2.1, we have c_{0}\|B\|^{r}\leq c_{0}^{1-r}. This together with Lemma A.2 completes the proof. ∎

The last two results together yield an error bound on the mean \mathbb{E}[x_{k}^{\delta}].

Corollary 2.1.

Let Assumption 2.1 be fulfilled. Under the source condition (2.4), there holds

\|\mathbb{E}[x_{k+1}^{\delta}]-x^{\dagger}\|\leq c_{p}k^{-p(1-\alpha)}+c_{% \alpha}k^{\frac{1-\alpha}{2}}\bar{\delta}.
Remark 2.2.

\mathbb{E}[x_{k}] and \mathbb{E}[x_{k}^{\delta}] satisfy the recursive relation for the Landweber method. Hence, these estimates are deterministic, and the proof follows closely that for Landweber method and the error bounds resemble that for Landweber method [6, Chapter 6]. Such estimate is known as weak error in the literature of stochastic differential equations. Theorems 2.1 and 2.2 with s=0 give bounds on the approximation and propagation errors. In particular, when \alpha=0 (constant step size), we have

\displaystyle\|\mathbb{E}[x_{k+1}]-x^{\dagger}\| \displaystyle\leq c_{p}k^{-p}\quad\mbox{and}\quad\|\mathbb{E}[x_{k+1}^{\delta}% -x_{k+1}]\|\leq c_{\alpha}k^{\frac{1}{2}}\bar{\delta}.

The next result gives a recurrence relation on the variance \mathbb{E}[\|B^{s}(x_{k}^{\delta}-\mathbb{E}[x_{k}^{\delta}])\|^{2}] of the approximation x_{k}^{\delta}. It arises from the random index i_{k} in SGD (1.2). Theorem 2.3 indicates that the variance can be bounded by the expected residuals \{\mathbb{E}[\|Ax_{j}^{\delta}-y^{\delta}\|^{2}]\}_{j=1}^{k}, and also step sizes \{\eta_{j}\}_{j=1}^{k}. The extra exponent \frac{1}{2} relies crucially on the quadratic structure of the least-squares functional.

Theorem 2.3.

For the SGD iteration (1.2), there holds

\mathbb{E}[\|B^{s}(x_{k+1}^{\delta}-\mathbb{E}[x_{k+1}^{\delta}])\|^{2}]\leq% \sum_{j=1}^{k}\eta_{j}^{2}\|B^{s+\frac{1}{2}}\Pi_{j+1}^{k}(B)\|^{2}\mathbb{E}[% \|Ax_{j}^{\delta}-y^{\delta}\|^{2}].
Proof.

Let z_{k}=x_{k}^{\delta}-\mathbb{E}[x_{k}^{\delta}]. By the definition of the iterate x_{k}^{\delta} in (2.3), we directly deduce

\mathbb{E}[x_{k+1}^{\delta}]=\mathbb{E}[x_{k}^{\delta}]-\eta_{k}(B\mathbb{E}[x% _{k}^{\delta}]-\bar{A}^{t}y^{\delta}),

and thus z_{k} satisfies the iteration

z_{k+1}=z_{k}-\eta_{k}[((a_{i_{k}},x_{k}^{\delta})-y_{i_{k}}^{\delta})a_{i_{k}% }-(B\mathbb{E}[x_{k}^{\delta}]-\bar{A}^{t}y^{\delta})],

with the initial condition z_{1}=0. Upon rewriting, z_{k} satisfies

z_{k+1}=(I-\eta_{k}B)z_{k}+\eta_{k}M_{k}, (2.7)

where the iteration noise M_{k} is defined by

M_{k}=(Bx_{k}^{\delta}-\bar{A}^{t}y^{\delta})-((a_{i_{k}},x_{k}^{\delta})-y_{i% _{k}}^{\delta})a_{i_{k}}.

Since x_{j}^{\delta} is measurable with respect to \mathcal{F}_{j-1}, \mathbb{E}[M_{j}|\mathcal{F}_{j-1}]=0, and thus \mathbb{E}[M_{j}]=0. Further, for j\neq\ell, M_{j} and M_{\ell} are independent:

\mathbb{E}[(M_{j},M_{\ell})]=0,\quad\forall j\neq\ell. (2.8)

Indeed, for j<\ell, we have \mathbb{E}[(M_{j},M_{\ell})|\mathcal{F}_{\ell-1}]=(M_{j},\mathbb{E}[M_{\ell}|% \mathcal{F}_{\ell-1}])=0, since M_{j} is measurable with respect to \mathcal{F}_{\ell-1}. Then taking full expectation yields (2.8). Applying (2.7) repeatedly gives

z_{k+1}=\sum_{j=1}^{k}\eta_{j}\Pi_{j+1}^{k}(B)M_{j}.

Then it follows from the independence (2.8) that

\displaystyle\mathbb{E}[\|B^{s}z_{k+1}\|^{2}] \displaystyle=\sum_{j=1}^{k}\sum_{\ell=1}^{k}\eta_{j}\eta_{\ell}\mathbb{E}[(B^% {s}\Pi_{j+1}^{k}(B)M_{j},B^{s}\Pi_{\ell+1}^{k}(B)M_{\ell})]=\sum_{j=1}^{k}\eta% _{j}^{2}\mathbb{E}[\|B^{s}\Pi_{j+1}^{k}(B)M_{j}\|^{2}].

Since a_{i}=A^{t}e_{i} (with e_{i} being the ith Cartesian vector), we have (with the notation \bar{y}^{\delta}=n^{-1}y^{\delta})

\displaystyle M_{j} \displaystyle=A^{t}(\bar{A}x_{j}^{\delta}-\bar{y}^{\delta})-((a_{i_{j}},x_{j}^% {\delta})-y_{i_{j}}^{\delta})A^{t}e_{i_{j}}
\displaystyle=A^{t}[(\bar{A}x_{j}^{\delta}-\bar{y}^{\delta})-((a_{i_{j}},x_{j}% ^{\delta})-y_{i_{j}}^{\delta})e_{i_{j}}]:=A^{t}N_{j}.

This and the identity \|B^{s}\Pi_{j+1}^{k}(B)A^{t}\|^{2}=n\|B^{s}\Pi_{j+1}^{k}(B)B^{\frac{1}{2}}\|^{2} yield

\displaystyle\mathbb{E}[\|B^{s}\Pi_{j+1}^{k}(B)M_{j}\|^{2}] \displaystyle\leq\|B^{s}\Pi_{j+1}^{k}(B)A^{t}\|^{2}\mathbb{E}[\|N_{j}\|^{2}]=% \|B^{s+\frac{1}{2}}\Pi_{j+1}^{k}(B)\|^{2}\mathbb{E}[n\|N_{j}\|^{2}].

By the measurability of x_{j}^{\delta} with respect to \mathcal{F}_{j-1}, we can bound the term \mathbb{E}[\|N_{j}\|^{2}]:

\displaystyle\mathbb{E}[\|N_{j}\|^{2}|\mathcal{F}_{j-1}] \displaystyle=\mathbb{E}[\|(\bar{A}x_{j}^{\delta}-\bar{y}^{\delta})-((a_{i_{j}% },x_{j}^{\delta})-y_{i_{j}}^{\delta})e_{i_{j}}\|^{2}|\mathcal{F}_{j-1}]
\displaystyle\leq\sum_{i=1}^{n}n^{-1}\|((a_{i},x_{j}^{\delta})-y_{i}^{\delta})% e_{i}\|^{2}=n^{-1}\|Ax_{j}^{\delta}-y^{\delta}\|^{2},

where the inequality follows from the identity \mathbb{E}[((a_{i_{j}},x_{j}^{\delta})-y_{i_{j}}^{\delta})e_{i_{j}}|\mathcal{F% }_{j-1}]=\bar{A}x_{j}-\bar{y}^{\delta} and bias-variance decomposition. Thus, by taking full expectation, we obtain

\displaystyle\mathbb{E}[\|N_{j}\|^{2}]\leq n^{-1}\mathbb{E}[\|Ax_{j}^{\delta}-% y^{\delta}\|^{2}].

Combining the preceding bounds yields the desired assertion. ∎

2.2 Bounding the expected residuals

Now we derive bounds on the expected residual \mathbb{E}[\|Ax_{k}^{\delta}-y^{\delta}\|^{2}], which, together with Theorem 2.3, will be used to control the variance of x_{k}^{\delta}. We begin with a technical lemma.

Lemma 2.1.

Given \{b_{j}\}_{j=1}^{\infty}\subset\mathbb{R}_{+}, a_{1}\geq 0 and c_{i}>0, let \{a_{j}\}_{j=2}^{\infty}\subset\mathbb{R}_{+} be defined by

a_{k+1}=c_{1}\sum_{j=1}^{k-1}\frac{\eta_{j}^{2}}{\sum_{i=j+1}^{k}\eta_{i}}a_{j% }+c_{2}k^{-2\alpha}a_{k}+b_{k}.

Then under Assumption 2.1, then the following statements hold.

  • If \sup_{j}b_{j}<\infty, then \{a_{j}\}_{j=1}^{\infty} is bounded by a constant dependent of \alpha, \sup_{j}b_{j} and c_{i}s. Further, if b_{j}\leq c_{3}j^{-\gamma}, j\in\mathbb{N}, then for some c(\alpha,\gamma,c_{i},\ell) dependent of \alpha, \gamma, \ell and c_{i}s, there holds

    a_{k+1}\leq c(\alpha,\gamma,c_{i},\ell)k^{-\min(\ell\alpha,1-\alpha,\gamma)}% \ln^{\ell}k.
  • If b_{j} is nondecreasing, then for some c(\alpha,c_{i}) dependent of \alpha and c_{i}, there holds

    a_{k+1}\leq c(\alpha,c_{i})k^{-\min(\alpha,1-\alpha)}\ln k+2b_{k}.
Proof.

The proof proceeds by mathematical induction. Let c_{\alpha}=c(\alpha,0,1)+c^{\prime}(\alpha,0,1) from Lemma A.3. Let k_{*}\in\mathbb{N} be such that c_{1}c_{\alpha}k^{-\min(1-\alpha,\alpha)}\ln k+c_{2}k^{-2\alpha}<1/2 for any k\geq k_{*}. The existence of a finite k_{*} follows by the monotonicity of f(t)=t^{-\min(1-\alpha,\alpha)}\ln t for large t>0 and \lim_{t\to\infty}f(t)=0.

(a) part (i). Let b=\sup_{j}b_{j}. Then a_{*}=\max_{1\leq i\leq k_{*}}a_{i}<\infty. Let a^{*}=2b+a_{*}. We claim that a_{k}\leq a^{*} for any k\in\mathbb{N}. By construction, this holds trivially for j=1,\ldots,k_{*}. Suppose now the claim holds for some k\geq k_{*}. Then by Lemma A.3, for k+1, there holds

\displaystyle a_{k+1} \displaystyle\leq\max_{1\leq i\leq k}a_{i}\Big{(}c_{1}\sum_{j=1}^{k-1}\frac{% \eta_{j}^{2}}{\sum_{i=j+1}^{k}\eta_{i}}+c_{2}k^{-2\alpha}\Big{)}+b
\displaystyle\leq a^{*}(c_{1}c_{\alpha}k^{-\min(\alpha,1-\alpha)}\ln k+c_{2}k^% {-2\alpha})+b\leq\tfrac{1}{2}(a_{*}+2b)+b\leq a_{*}+2b,

showing the desired claim. Next we show the decay estimate. By Lemma A.3, for any k\in\mathbb{N}, we have

\displaystyle a_{k+1} \displaystyle\leq\max_{1\leq i\leq k}a_{i}\Big{(}c_{1}\sum_{j=1}^{k-1}\frac{% \eta_{j}^{2}}{\sum_{i=j+1}^{k}\eta_{i}}+c_{2}k^{-2\alpha}\Big{)}+c_{3}k^{-\gamma}
\displaystyle\leq a^{*}(c_{1}c_{\alpha}k^{-\min(\alpha,1-\alpha)}\ln k+c_{2}k^% {-2\alpha})+c_{3}k^{-\gamma}.

Setting c(\alpha,\gamma,c_{i})=a^{*}(c_{1}c_{\alpha}+c_{2})+c_{3} shows the case \ell=1, and the rest follows similarly.

(b) part (ii). The proof is analogous to part (i). First, the argument in part (i) shows that there exists a_{**} such that a_{k}\leq a_{*}+2b_{k} for any k\in\mathbb{N}. Clearly, the estimate in part (ii) for k\leq k_{*} is trivial. Next, by applying Lemma A.3, for any k>k_{*}, we have

\displaystyle a_{k+1} \displaystyle\leq(a_{*}+2b_{k})(c_{1}c_{\alpha}k^{-\min(\alpha,1-\alpha)}\ln k% +c_{2}k^{-2\alpha})+b_{k}
\displaystyle\leq c(\alpha,c_{i})k^{-\min(\alpha,1-\alpha)}\ln k+2b_{k}.

This completes the proof of the lemma. ∎

Now we can state a decay estimate of the expected residual \mathbb{E}[\|Ax_{k}-y^{\dagger}\|^{2}] (for exact data y^{\dagger}).

Theorem 2.4.

Let Assumption 2.1 be fulfilled. Then, under the source condition (2.4), there holds

\mathbb{E}[\|Ax_{k+1}-y^{\dagger}\|^{2}]\leq c(p,\alpha,\ell)k^{-\min(\ell% \alpha,2(1-\alpha),(2p+1)(1-\alpha))}\ln^{\ell}k,\quad\forall\ell\geq 2,

where the constant c(p,\alpha,\ell) depends on p, \alpha, \|Ax_{1}-y^{\dagger}\|, \|A\| and \ell.

Proof.

Since y^{\dagger}=Ax^{\dagger}, by the bias-variance decomposition, for any k\geq 1, there holds

\displaystyle\mathbb{E}[\|A(x_{k+1}-x^{\dagger})\|^{2}]=\mathbb{E}[\|A(x_{k+1}% -\mathbb{E}[x_{k+1}])\|^{2}]+\|A(\mathbb{E}[x_{k+1}]-x^{\dagger})\|^{2}.

For the bias \|A(\mathbb{E}[x_{k+1}]-x^{\dagger})\|^{2}, Theorem 2.1 gives (with c_{p}=n(\frac{(\frac{1}{2}+p)(1-\alpha)}{c_{0}e(2^{1-\alpha}-1)})^{1+2p}\|w\|^% {2})

\displaystyle\|A(\mathbb{E}[x_{k+1}]-x^{\dagger})\|^{2} \displaystyle=n\|B^{\frac{1}{2}}(\mathbb{E}[x_{k+1}]-x_{k+1})\|^{2}\leq c_{p}k% ^{-(2p+1)(1-\alpha)}. (2.9)

Next, Theorem 2.3 (with s=\frac{1}{2}) and Lemmas A.1 and A.3 (with r=1) imply

\displaystyle\mathbb{E}[\|A(x_{k+1}-\mathbb{E}[x_{k+1}])\|^{2}]\leq n\sum_{j=1% }^{k}\eta_{j}^{2}\|\Pi_{j+1}^{k}(B)B\|^{2}\mathbb{E}[\|Ax_{j}-y^{\dagger}\|^{2}]
\displaystyle\leq \displaystyle c_{1}\sum_{j=1}^{k-1}\frac{\eta_{j}^{2}}{\sum_{i=j+1}^{k}\eta_{i% }}\mathbb{E}[\|Ax_{j}-y^{\dagger}\|^{2}]+c_{2}k^{-2\alpha}\mathbb{E}[\|Ax_{k}-% y^{\dagger}\|^{2}], (2.10)

with c_{1}=(2e)^{-1}\|A\|^{2} and c_{2}=c_{0}\|A\|^{2}, where the last step is due to Assumption 2.1, i.e., c_{0}\|B\|\leq 1, and the identity n\|B\|=\|A\|^{2}. Thus,

\mathbb{E}[\|Ax_{k+1}-y^{\dagger}\|^{2}]\leq c_{1}\sum_{j=1}^{k-1}\frac{\eta_{% j}^{2}}{\sum_{i=j+1}^{k}\eta_{i}}\mathbb{E}[\|Ax_{j}-y^{\dagger}\|^{2}]+c_{2}k% ^{-2\alpha}\mathbb{E}[\|Ax_{k}-y^{\dagger}\|^{2}]+c_{p}k^{-(2p+1)(1-\alpha)}.

Then Lemma 2.1(i) (with \ell=1) yields

\mathbb{E}[\|Ax_{k+1}-y^{\dagger}\|^{2}]\leq c(p,\alpha)k^{-\min(\alpha,1-% \alpha)}\ln k. (2.11)

Next, we bootstrap the estimate by applying Lemma A.1 to (2.2) (by letting p=1 for j=1,\ldots,[\frac{k}{2}] and p=\frac{1}{2} for j=[\frac{k}{2}]+1,\ldots,k-1, respectively, and c_{3}=4ne^{-2}):

\displaystyle\mathbb{E}[\|A(x_{k+1}-\mathbb{E}[x_{k+1}])\|^{2}]\leq c_{3}\sum_% {j=1}^{[\frac{k}{2}]}\frac{\eta_{j}^{2}}{(\sum_{i=j+1}^{k}\eta_{i})^{2}}% \mathbb{E}[\|Ax_{j}-y^{\dagger}\|^{2}]
\displaystyle\quad+c_{1}\sum_{j=[\frac{k}{2}]+1}^{k-1}\frac{\eta_{j}^{2}}{\sum% _{i=j+1}^{k}\eta_{i}}\mathbb{E}[\|Ax_{j}-y^{\dagger}\|^{2}]+c_{2}k^{-2\alpha}% \mathbb{E}[\|Ax_{k}-y^{\dagger}\|^{2}]:={\rm I}_{1}+{\rm I}_{2}+{\rm I}_{3}.

Then we bound the three terms {\rm I}_{i} by Lemma A.3 with \beta=\min(\alpha,1-\alpha) and (2.11):

\displaystyle{\rm I}_{1} \displaystyle\leq c_{3}c(p,\alpha)c_{\alpha,\beta,2}k^{-2(1-\alpha)+\max(0,1-2% \alpha-\beta)}\ln^{2}k,
\displaystyle{\rm I}_{2} \displaystyle\leq c_{1}c(p,\alpha)c^{\prime}_{\alpha,\beta,1}k^{-(\alpha+\beta% )}\ln^{2}k,\quad\mbox{and}\quad{\rm I_{3}}\leq c_{2}c(p,\alpha)k^{-2\alpha-% \beta}\ln k,

where the constants c_{\alpha,\beta,2} and c_{\alpha,\beta,1}^{\prime} are defined in Lemma A.3. These three estimates and (2.9) imply

\mathbb{E}[\|Ax_{k}-y^{\dagger}\|^{2}]\leq ck^{-\gamma}\ln^{2}k,

with c=c(p,\alpha)(c_{3}c_{\alpha,\beta,2}+c_{1}c^{\prime}_{\alpha,\beta,1}+c_{2})+% c_{p}. The exponent \gamma is computed from the identities \max(0,1-2\alpha-\beta)=\max(0,1-3\alpha) and \alpha+\beta=\min(2\alpha,1) and thus

\displaystyle\gamma \displaystyle=\min(2\alpha,2(1-\alpha)-\max(0,1-3\alpha),(2p+1)(1-\alpha))
\displaystyle=\min(2\alpha,2(1-\alpha),(2p+1)(1-\alpha)).

Now repeating the bootstrapping argument yields the desired estimate and completes the proof. ∎

Remark 2.3.

For any \alpha>0, the decay is essentially O(k^{-\min(2-2\alpha,(2p+1)(1-\alpha))}), modulo the \log factor, which shows the impact of condition (2.4). The constant c(p,\alpha,\ell) may be pessimistic for large \ell.

Last, we state a bound on the expected residual \mathbb{E}[\|Ax_{k}^{\delta}-y^{\delta}\|^{2}] for noisy data y^{\delta}.

Theorem 2.5.

Let Assumption 2.1 be fulfilled. Then under the source condition (2.4), there holds

\mathbb{E}[\|Ax_{k+1}^{\delta}-y^{\delta}\|^{2}]\leq c(\alpha,p)k^{-\min(\ell% \alpha,2-2\alpha,(2p+1)(1-\alpha))}\ln^{\ell}k+c^{\prime}(\alpha,p)\delta^{2}% \ln^{2}k,

for any \ell\geq 2, where the constants c(\alpha,p) and c^{\prime}(\alpha,p) depend on \alpha, p, \|A\|, y^{\delta} and \ell.

Proof.

By bias-variance decomposition and the triangle inequality, we have

\displaystyle\mathbb{E}[\|Ax_{k+1}^{\delta}-y^{\delta}\|^{2}]=\|A\mathbb{E}[x_% {k+1}^{\delta}]-y^{\delta}\|^{2}+\mathbb{E}[\|A(x_{k+1}^{\delta}-E[x_{k+1}^{% \delta}])\|^{2}]
\displaystyle\leq \displaystyle 4\|A(\mathbb{E}[x_{k+1}]-x^{\dagger})\|^{2}+4\|A\mathbb{E}[x_{k+% 1}^{\delta}-x_{k+1}]\|^{2}+\mathbb{E}[\|A(x_{k+1}^{\delta}-\mathbb{E}[x_{k+1}^% {\delta}])\|^{2}]+2\delta^{2}.

Theorems 2.1 and 2.2 imply (with c_{p}=n(\frac{(\frac{1}{2}+p)(1-\alpha)}{c_{0}e(2^{1-\alpha}-1)})^{1+2p}\|w\|^% {2} and c_{\alpha}=(\frac{2^{\alpha}}{e(1-\alpha)}+1)^{2})

\displaystyle\|A(\mathbb{E}[x_{k+1}]-x^{\dagger})\|^{2} \displaystyle=\|(A^{t}A)^{\frac{1}{2}}(\mathbb{E}[x_{k+1}]-x^{\dagger})\|^{2}% \leq c_{p}k^{(1+2p)(\alpha-1)},
\displaystyle\|A\mathbb{E}[x_{k+1}^{\delta}-x_{k+1}]\|^{2} \displaystyle=\|(A^{t}A)^{\frac{1}{2}}(\mathbb{E}[x_{k+1}^{\delta}-x_{k+1}])\|% ^{2}\leq c_{\alpha}\delta^{2}\max(1,\ln k)^{2}.

Next, we bound the variance \mathbb{E}[\|A(x_{k+1}^{\delta}-\mathbb{E}[x_{k+1}^{\delta}])\|^{2}] by Theorem 2.3 with s=\frac{1}{2} and Lemma A.1:

\displaystyle\mathbb{E}[\|A(x_{k+1}^{\delta}-\mathbb{E}[x_{k+1}^{\delta}])\|^{% 2}] \displaystyle\leq n\sum_{j=1}^{k}\eta_{j}^{2}\|\Pi_{j+1}^{k}(B)B\|^{2}\mathbb{% E}[\|Ax_{j}^{\delta}-y^{\delta}\|^{2}] (2.12)
\displaystyle\leq c_{1}\sum_{j=1}^{k-1}\frac{\eta_{j}^{2}}{\sum_{i=j+1}^{k}% \eta_{i}}\mathbb{E}[\|Ax_{j}^{\delta}-y^{\delta}\|^{2}]+c_{2}k^{-2\alpha}% \mathbb{E}[\|Ax_{k}^{\delta}-y^{\delta}\|^{2}],

with c_{1}=e^{-1}\|A\|^{2} and c_{2}=c_{0}\|A\|^{2}. Combining these estimates yields (with c_{3}=4c_{p} and c_{4}=4c_{\alpha}+2)

\displaystyle\mathbb{E}[\|Ax_{k+1}^{\delta}-y^{\delta}\|^{2}] \displaystyle\leq c_{1}\sum_{j=1}^{k-1}\frac{\eta_{j}^{2}}{\sum_{i=j+1}^{k}% \eta_{i}}\mathbb{E}[\|Ax_{j}-y^{\delta}\|^{2}]
\displaystyle\quad+c_{2}k^{-2\alpha}\mathbb{E}[\|Ax_{k}^{\delta}-y^{\delta}\|^% {2}]+c_{3}k^{-(2p+1)(1-\alpha)}+c_{4}\delta^{2}\max(1,\ln k)^{2}.

This and Lemma 2.1 directly imply

\mathbb{E}[\|Ax_{k+1}^{\delta}-y^{\delta}\|^{2}]\leq c(p,\alpha)k^{-\min(% \alpha,1-\alpha)}\ln k+2c_{4}\delta^{2}\max(1,\ln k)^{2}. (2.13)

Next, we refine the estimate by bootstrapping. Lemma A.1 and (2.12) yield (with c_{5}=ne^{-2})

\displaystyle\mathbb{E}[\|A(x_{k+1}^{\delta}-\mathbb{E}[x_{k+1}^{\delta}])\|^{% 2}]\leq c_{5}\sum_{j=1}^{[\frac{k}{2}]}\frac{\eta_{j}^{2}}{(\sum_{i=j+1}^{k}% \eta_{i})^{2}}\mathbb{E}[\|Ax_{j}^{\delta}-y^{\delta}\|^{2}]
\displaystyle\quad+c_{1}\sum_{j=[\frac{k}{2}]+1}^{k-1}\frac{\eta_{j}^{2}}{\sum% _{i=j+1}^{k}\eta_{i}}\mathbb{E}[\|Ax_{j}^{\delta}-y^{\delta}\|^{2}]+c_{2}k^{-2% \alpha}\mathbb{E}[\|Ax_{k}^{\delta}-y^{\delta}\|^{2}]:={\rm I}_{1}+{\rm I}_{2}% +{\rm I}_{3}.

Last, we bound the terms {\rm I}_{i} by (2.13) and Lemma A.3 with \beta=\min(\alpha,1-\alpha):

\displaystyle{\rm I}_{1} \displaystyle\leq c_{5}c(p,\alpha)c_{\alpha,\beta,2}k^{-2(1-\alpha)+\max(0,1-2% \alpha-\beta)}\ln^{2}k+2c_{5}c_{4}c_{\alpha,0,2}\delta^{2}k^{-(1-\alpha)}\ln^{% 2}k,
\displaystyle{\rm I}_{2} \displaystyle\leq c_{1}c(p,\alpha)c^{\prime}_{\alpha,\beta,1}k^{-(\alpha+\beta% )}\ln^{2}k+2c_{1}c_{4}c^{\prime}_{\alpha,0,1}\delta^{2}k^{-\alpha}\ln^{3}k,
\displaystyle{\rm I_{3}} \displaystyle\leq c_{2}c(p,\alpha)k^{-2\alpha-\beta}\ln k+2c_{2}c_{4}\delta^{2% }k^{-2\alpha}\ln^{2}k.

Combining these estimates together yields the assertion for \ell=2, where the constants in front of \delta^{2} can be made independent of k, since for any \gamma>0, k^{-\gamma}\ln k is uniformly bounded for k\geq 1. Then repeating the argument in Theorem 2.4 shows the desired assertion. ∎

Remark 2.4.

The bound in Theorem 2.5 depends on regularity index p and noise level \delta. Due to the presence of the factor \ln k, the bound is not uniform in k, but the growth is very mild.

3 Regularizing property

Now we analyze the regularizing property of SGD with early stopping, and establish convergence rates under a priori stopping rule. First, we analyze the convergence for exact data. The reference solution x^{\dagger} is taken to be the unique minimum norm solution:

x^{\dagger}=\arg\min_{x}\{\|x-x_{1}\|\quad\mbox{s.t.}\quad Ax^{\dagger}=y^{% \dagger}\}.

It is well known that the minimum norm solution is characterized by x^{\dagger}-x_{1}\in\mathrm{range}(A^{t}) [7]. We only consider a priori stopping index, and begin with exact data y^{\dagger}. The next result shows the convergence of the SGD iterate x_{k} to x^{\dagger}, for any \alpha\in(0,1).

Theorem 3.1.

Let Assumption 2.1 be fulfilled. Then the SGD iterate x_{k} converges to the minimum norm solution x^{\dagger} as k\to\infty, i.e.,

\lim_{k\to\infty}\mathbb{E}[\|x_{k}-x^{\dagger}\|^{2}]=0.
Proof.

The proof employs the decomposition (2.1), and bounds separately the mean and variance. It follows from (2.5) that the mean \mathbb{E}[e_{k}] satisfies

\mathbb{E}[e_{k+1}]=\Pi_{1}^{k}(B)e_{1}.

The term \|\Pi_{1}^{k}(B)e_{1}\| converges to zero as k\to\infty. Specifically, we define a function r_{k}(\lambda):(0,\|B\|]\to[0,1) by r_{k}(\lambda)=\prod_{j=1}^{k}(1-\eta_{k}\lambda). By Assumption 2.1, c_{0}\max_{i}\|a_{i}\|^{2}\leq 1, r_{k}(\lambda) is uniformly bounded. By the inequality 1-x\leq e^{-x} for x\geq 0, r_{k}(\lambda)\leq e^{-\lambda\sum_{j=1}^{k}\eta_{j}}. This and the inequality \lim_{k\to\infty}\sum_{j=1}^{k}\eta_{j}=\infty imply that for any \lambda>0, \lim_{k\to\infty}r_{k}(\lambda)=0. Hence, r_{k}(\lambda) converges to zero pointwise, and the argument for Theorem 4.1 of [6] yields \lim_{k\to\infty}\|\mathbb{E}[e_{k}]\|=0. Next, we bound the variance \mathbb{E}[\|x_{k+1}-\mathbb{E}[x_{k+1}]\|^{2}]. By Theorem 2.3 (with s=0) and Lemma A.1 (with p=\frac{1}{2}),

\displaystyle\mathbb{E}[\|x_{k+1}-\mathbb{E}[x_{k+1}]\|^{2}] \displaystyle\leq\sum_{j=1}^{k}\eta_{j}^{2}\|\Pi_{j+1}^{k}(B)B^{\frac{1}{2}}\|% ^{2}\mathbb{E}[\|A(x_{j}-x^{\dagger})\|^{2}]
\displaystyle\leq\sup_{j}\mathbb{E}[\|A(x_{j}-x^{\dagger})\|^{2}]\Big{(}(2e)^{% -1}\sum_{j=1}^{k-1}\frac{\eta_{j}^{2}}{\sum_{i=j+1}^{k}\eta_{i}}+c_{0}k^{-2% \alpha}\Big{)}.

By Theorem 2.4, the sequence \{\mathbb{E}[\|A(x_{j}-x^{\dagger})\|^{2}]\}_{j=1}^{\infty} is uniformly bounded. Then Lemma A.2 implies

\lim_{k\to\infty}\mathbb{E}[\|x_{k}-\mathbb{E}[x_{k}]\|^{2}]=0.

The desired assertion follows from bias variance decomposition:

\lim_{k\to\infty}\mathbb{E}[\|x_{k}-x^{\dagger}\|^{2}]\leq\lim_{k\to\infty}\|% \mathbb{E}[x_{k}]-x^{\dagger}\|^{2}+\lim_{k\to\infty}\mathbb{E}[\|x_{k}-% \mathbb{E}[x_{k}]\|^{2}]=0.

Last, by the construction of the SGD iterate x_{k}, x_{k}-x_{1} always belongs to {\rm range}(A^{t}), and thus the limit is the unique minimum-norm solution x^{\dagger}. ∎

Next we analyze the convergence of the SGD iterate x_{k}^{\delta} for noisy data y^{\delta} as \delta\to 0. To this end, we need a bound on the variance \mathbb{E}[\|x_{k}^{\delta}-\mathbb{E}[x_{k}^{\delta}]\|^{2}] of the iterate x_{k}.

Lemma 3.1.

Let Assumption 2.1 be fulfilled. Under the source condition (2.4), there holds

\displaystyle\mathbb{E}[\|x_{k+1}^{\delta}-\mathbb{E}[x_{k+1}^{\delta}]\|^{2}]% \leq c(\alpha)k^{-\min(1-\alpha,\ell\alpha)}\ln^{\ell}k+c^{\prime}(\alpha)% \delta^{2}\quad\forall\ell\geq 2,

where the constants c(\alpha) and c^{\prime}(\alpha) depend on \alpha and \ell.

Proof.

Theorem 2.3 with s=0 and Lemma A.1 with p=\frac{1}{2} imply (with c_{1}=(2e)^{-1})

\displaystyle\mathbb{E}[\|x_{k+1}^{\delta}-\mathbb{E}[x_{k+1}^{\delta}]\|^{2}] \displaystyle\leq\sum_{j=1}^{k-1}\eta_{j}^{2}\|\Pi_{j+1}^{k}(B)B^{\frac{1}{2}}% \|^{2}\mathbb{E}[\|Ax_{j}^{\delta}-y^{\delta}\|^{2}]+\eta_{k}^{2}\|B^{\frac{1}% {2}}\|^{2}\mathbb{E}[\|Ax_{k}^{\delta}-y^{\delta}\|^{2}]
\displaystyle\leq c_{1}\sum_{j=1}^{[\frac{k}{2}]}\frac{\eta_{j}^{2}}{\sum_{i=j% +1}^{k}\eta_{i}}\mathbb{E}[\|Ax_{j}^{\delta}-y^{\delta}\|^{2}]+c_{1}\sum_{j=[% \frac{k}{2}]+1}^{k-1}\frac{\eta_{j}^{2}}{\sum_{i=j+1}^{k}\eta_{i}}\mathbb{E}[% \|Ax_{j}^{\delta}-y^{\delta}\|^{2}]
\displaystyle+c_{0}k^{-2\alpha}\mathbb{E}[\|Ax_{k}^{\delta}-y^{\delta}\|^{2}]:% ={\rm I}_{1}+{\rm I}_{2}+{\rm I}_{3},

where the last step is due to c_{0}\|B\|\leq 1 from Assumption 2.1. Now the estimate (2.13) gives

\mathbb{E}[\|Ax_{k+1}^{\delta}-y^{\delta}\|^{2}]\leq c(\alpha)k^{-\beta}\ln k+% 2c_{4}\delta^{2}\max(\ln k,1)^{2},

where the exponent \beta=\min(\alpha,1-\alpha). This and Lemma A.3 imply

\displaystyle{\rm I}_{1} \displaystyle\leq c_{1}c(\alpha)c_{\alpha,\beta,1}k^{-(1-\alpha)+\max(0,1-2% \alpha-\beta)}\ln^{2}k+c_{1}c_{4}c_{\alpha,0,1}\delta^{2}k^{-\alpha}\ln^{2}k,
\displaystyle{\rm I}_{2} \displaystyle\leq c_{1}c(\alpha)c_{\alpha,\beta,1}^{\prime}k^{-(\alpha+\beta)}% \ln^{2}k+c_{1}c_{4}c_{\alpha,0,1}^{\prime}\delta^{2}k^{-\alpha}\ln^{3}k,\quad{% \rm I}_{3}\leq c_{0}c(\alpha)k^{-2\alpha-\beta}\ln k+c_{0}c_{4}\delta^{2}k^{-2% \alpha}\ln^{3}k.

By combining these estimates and collecting the terms, we obtain the assertion for \ell=2. The general case \ell\geq 3 can be proved analogously by applying Theorem 2.5. ∎

Now we can state the regularizing property of SGD (1.2) under a priori stopping rule. The condition (3.1) is analogous to that for classical regularization methods.

Theorem 3.2.

Let Assumption 2.1 be fulfilled. If the stopping index k(\delta) satisfies

\lim_{\delta\to 0^{+}}k(\delta)=\infty\quad\mbox{and}\quad\lim_{\delta\to 0^{+% }}k(\delta)^{(\alpha-1)/2+\epsilon}\delta=0, (3.1)

for some small \epsilon>0, then the iterate x_{k(\delta)}^{\delta} satisfies

\lim_{\delta\to 0^{+}}\mathbb{E}[\|x_{k(\delta)}^{\delta}-x^{\dagger}\|^{2}]=0.
Proof.

We appeal to the bias-variance decomposition (2.1):

\displaystyle\mathbb{E}[\|x_{k(\delta)}^{\delta}-x^{\dagger}\|^{2}]\leq 2\|% \mathbb{E}[x_{k(\delta)}^{\delta}-x_{k(\delta)}]\|^{2}+2\|\mathbb{E}[x_{k(% \delta)}]-x^{\dagger}\|^{2}+\mathbb{E}[\|x_{k(\delta)}^{\delta}-\mathbb{E}[x_{% k(\delta)}^{\delta}]\|^{2}].

By the proof of Theorem 3.1 and condition (3.1), we have

\lim_{\delta\to 0^{+}}\|\mathbb{E}[x_{k(\delta)}]-x^{\dagger}\|=\lim_{k\to% \infty}\|\mathbb{E}[x_{k}]-x^{\dagger}\|=0.

Thus, it suffices to analyze \|\mathbb{E}[x_{k(\delta)}^{\delta}-x_{k(\delta)}]\|^{2} and \mathbb{E}[\|x_{k(\delta)}^{\delta}-\mathbb{E}[x_{k(\delta)}^{\delta}]\|^{2}]. By Theorem 2.2 and the choice of k(\delta) in condition (3.1), yield

\lim_{\delta\to 0^{+}}\|\mathbb{E}[x_{k(\delta)}-x_{k(\delta)}^{\delta}]\|=0.

Last, by Lemma 3.1 and condition (3.1), by choosing a large \ell, we can bound \mathbb{E}[\|x_{k(\delta)}^{\delta}-\mathbb{E}[x_{k(\delta)}^{\delta}]\|^{2}] by

\lim_{\delta\to 0^{+}}\mathbb{E}[\|x_{k(\delta)}^{\delta}-\mathbb{E}[x_{k(% \delta)}^{\delta}]\|^{2}]=0.

This completes the proof of the theorem. ∎

Remark 3.1.

The consistency condition (3.1) in Theorem 3.2 requires \alpha\in(0,1). The constant step size, i.e., \alpha=0, is not covered by the theory, for which the bootstrapping argument does not work.

Last, we prove convergence rates under a priori stopping index.

Theorem 3.3.

Let Assumption 2.1 be fulfilled, and the source condition (2.4) holds. Then there holds

\mathbb{E}[\|x_{k+1}^{\delta}-x^{\dagger}\|^{2}]\leq c(\alpha,p)k^{-\min(\ell% \alpha,1-\alpha,2p(1-\alpha))}\ln^{\ell}k+c_{\alpha}k^{1-\alpha}\delta^{2},% \quad\forall\ell\geq 3.
Proof.

By the bias-variance decomposition, we have

\displaystyle\mathbb{E}[\|x_{k+1}^{\delta}-x^{\dagger}\|^{2}]=\mathbb{E}[\|x_{% k+1}^{\delta}-\mathbb{E}[x_{k+1}^{\delta}]\|^{2}]+\|\mathbb{E}[x_{k+1}^{\delta% }]-x^{\dagger}\|^{2}.

It follows from Lemma 3.1 that

\displaystyle\mathbb{E}[\|x_{k+1}^{\delta}-\mathbb{E}[x_{k+1}^{\delta}]\|^{2}]% \leq c(\alpha,p)(k^{-\min(1-\alpha,\ell\alpha)}\ln^{\ell}k+\delta^{2}).

By the triangle inequality and Theorems 2.1 and 2.2,

\displaystyle\|\mathbb{E}[x_{k+1}^{\delta}]-x^{\dagger}\|^{2} \displaystyle\leq 2c_{p}^{2}k^{-2p(1-\alpha)}+2c_{\alpha}^{2}k^{1-\alpha}\bar{% \delta}^{2}.

These two estimates together give the desired rate. ∎

Remark 3.2.

Theorem 3.3 indicates semiconvergence for the iterate x_{k}^{\delta}: one decreasing term dependent of regularity index p, and one increasing term dependent of the data noise. Due to the variance, the best possible convergence rate is restricted, and it is suboptimal for any p>\frac{1}{2} when compared with that for the Landweber method. For sufficient large \ell, we obtain a rate O(\delta^{\min(1,\frac{4p}{2p+1})}) after O(\delta^{-\frac{2}{1-\alpha}\frac{1}{\min(2,2p+1)}}) iterations, and the required number of iterations increases as p decreases to 0.

Remark 3.3.

The a priori choice in Theorem 3.3 requires a knowledge of the regularity index p, and thus is infeasible in practice. The discrepancy principle also does not work directly due to expensive residual evaluation, and it induces complex dependence between the iterates, which requires different techniques for analysis. Thus, it is of much interest to develop purely data-driven rules avoiding residual evaluation while adapting to solution regularity, e.g., quasi-optimality criterion and balancing principle [12, 19].

4 Preasymptotic convergence

The regularizing property discussed in Section 3 is concerned with the asymptotic as \delta\to 0. This does not fully describe the practical performance of SGD at a fixed noise level \delta. In order to shed further insights, we analyze the preasymptotic convergence of SGD in order to explain the fast initial convergence, which was observed experimentally. The analysis extends that for the randomized Kaczmarz method [11]. It relies on singular value decomposition (SVD). That is, n^{-\frac{1}{2}}A=U\Sigma V^{t}, where U\in\mathbb{R}^{n\times n},V=[v_{1}\ v_{2}\ \cdots\ v_{m}]\in\mathbb{R}^{m% \times m} are column orthonormal matrices, and \Sigma=\mathrm{diag}(\sigma_{1},\ldots,\sigma_{r},0,\ldots,0)\in\mathbb{R}^{n% \times m} is diagonal with the diagonals ordered nonincreasingly, i.e., \sigma_{1}\geq\sigma_{2}\geq\ldots\geq\sigma_{r}>\sigma_{r+1}=\ldots=\sigma_{% \min(n,m)}=0, where r is the rank of the matrix A. Hence, we have B=V\Sigma^{2}V^{t}. The solution space is spanned by the singular vectors \{v_{i}\}_{i=1}^{m}. For a fixed truncation level L\leq r, we define the low- and high-frequency solution spaces \mathcal{L} and \mathcal{H} respectively by

\mathcal{L}=\mathrm{span}(\{v_{i}\}_{i=1}^{L})\quad\mbox{and}\quad\mathcal{H}=% \mathrm{span}(\{v\}_{i=L+1}^{\min(n,m)}).

Let P_{\mathcal{L}} and P_{\mathcal{H}} be the orthogonal projection onto \mathcal{L} and \mathcal{H}, respectively. The analysis decomposes the error e_{k}=x_{k}-x^{\dagger} into P_{\mathcal{L}}e_{k} and P_{\mathcal{H}}e_{k}, the low- and high-frequency components, respectively. This is motivated by the following fact: In view of the source condition (2.4) and the decay of the singular values \sigma_{i}, e_{1} is dominated by P_{\mathcal{L}}e_{1}, i.e., \|P_{\mathcal{L}}e_{1}\|\gg\|P_{\mathcal{H}}e_{1}\|. To see this, we consider a simplistic probabilistic model: the sourcewise representer w follows the standard Gaussian distribution \mathcal{N}(0,I_{m}).

Proposition 4.1.

In Condition (2.4), if w\sim\mathcal{N}(0,I_{m}), then there hold

\mathbb{E}[\|P_{\mathcal{L}}e_{1}\|^{2}]=\sum_{i=1}^{L}\sigma_{i}^{4p}\quad% \mbox{and}\quad\mathbb{E}[\|P_{\mathcal{H}}e_{1}\|^{2}]=\sum_{i=L+1}^{r}\sigma% _{i}^{4p}.
Proof.

Under Condition (2.4), we have e_{1}=B^{p}w=V\Sigma^{2p}V^{t}w. Thus, we have

\displaystyle\|P_{\mathcal{L}}e_{1}\|^{2} \displaystyle=\|\sum_{i=1}^{L}V_{i}\sigma_{i}^{2p}(V^{t}w)_{i}\|^{2}=\sum_{i=1% }^{L}\sigma_{i}^{4p}(V^{t}w)_{i}^{2}.

Since w\sim\mathcal{N}(0,I_{m}) and the matrix V is orthonormal, (V^{t}w)_{i}\sim\mathcal{N}(0,1), and \mathbb{E}[(V^{t}w)_{i}^{2}]=1, from which the assertion on \mathbb{E}[\|P_{\mathcal{L}}e_{1}\|^{2}] follows, and the other estimate follows similarly. ∎

Remark 4.1.

For polynomially decaying singular values \sigma_{i}, i.e., \sigma_{i}=ci^{-\beta}, \beta>0, and if 4p\beta>1, simple computation shows that \mathbb{E}[\|P_{\mathcal{L}}e_{1}\|^{2}]\geq c^{4}(4p\beta-1)^{-1}(1-(L+1)^{1-% 4p\beta}) and \mathbb{E}[\|P_{\mathcal{H}}e_{1}\|^{2}]\leq c^{4}(4p\beta-1)^{-1}(L^{1-4p% \beta}-m^{1-4p\beta}), and thus

\frac{\mathbb{E}[\|P_{\mathcal{L}}e_{1}\|^{2}]}{\mathbb{E}[\|P_{\mathcal{H}}e_% {1}\|^{2}]}\geq\frac{1-(L+1)^{1-4p\beta}}{L^{1-4p\beta}-m^{1-4p\beta}}.

Hence, for a truncation level L\ll m and 4p\beta\gg 1, \mathbb{E}[\|P_{\mathcal{L}}e_{1}\|^{2}] is dominating. The condition 4p\beta\gg 1 holds for either severely ill-posed problems (large \beta) or highly regular solution (large p).

The preasymptotic analysis accounts for this important fact. Our analysis below covers both weak and strong errors, and further it does not impose any condition on the step sizes.

First we give the preasymptotic weak convergence for noisy data y^{\delta}.

Theorem 4.1.

Let Assumption 2.1 be fulfilled. Then with \bar{\delta}^{2}=n^{-1}\delta^{2}, there hold

\displaystyle\|\mathbb{E}[P_{\mathcal{L}}e_{k+1}^{\delta}]\| \displaystyle\leq(1-\eta_{k}\sigma_{L}^{2})\|\mathbb{E}[P_{\mathcal{L}}e_{k}^{% \delta}]\|+c_{0}^{-\frac{1}{2}}\eta_{k}\bar{\delta},
\displaystyle\|\mathbb{E}[P_{\mathcal{H}}e_{k+1}^{\delta}]\| \displaystyle\leq\|\mathbb{E}[P_{\mathcal{H}}e_{k}^{\delta}]\|+\eta_{k}\sigma_% {L+1}\bar{\delta}.
Proof.

By the definition of the SGD iteration in (2.3), we have

P_{\mathcal{L}}e_{k+1}^{\delta}=P_{\mathcal{L}}e_{k}^{\delta}-\eta_{k}(a_{i_{k% }},e_{k}^{\delta})P_{\mathcal{L}}a_{i_{k}}+\eta_{k}\xi_{i_{k}}P_{\mathcal{L}}a% _{i_{k}}.

By taking conditional expectation with respect to \mathcal{F}_{k-1} and the identity e_{k}^{\delta}=P_{\mathcal{L}}e_{k}^{\delta}+P_{\mathcal{H}}e_{k}^{\delta}, we obtain

\displaystyle\mathbb{E}[P_{\mathcal{L}}e_{k+1}^{\delta}|\mathcal{F}_{k-1}] \displaystyle=P_{\mathcal{L}}e_{k}^{\delta}-\eta_{k}n^{-1}\sum_{i=1}^{n}(a_{i}% ,e_{k}^{\delta})P_{\mathcal{L}}a_{i}+\eta_{k}n^{-1}\sum_{i=1}^{n}\xi_{i}P_{% \mathcal{L}}a_{i}
\displaystyle=P_{\mathcal{L}}e_{k}^{\delta}-\eta_{k}P_{\mathcal{L}}Be_{k}^{% \delta}+\eta_{k}P_{\mathcal{L}}\bar{A}^{t}\xi
\displaystyle=(I-\eta_{k}P_{\mathcal{L}}BP_{\mathcal{L}})P_{\mathcal{L}}e_{k}^% {\delta}-\eta_{k}P_{\mathcal{L}}BP_{H}e_{k}^{\delta}+\eta_{k}P_{\mathcal{L}}% \bar{A}^{t}\xi.

By the construction of P_{\mathcal{L}} and P_{\mathcal{H}}, P_{\mathcal{L}}BP_{H}e_{k}^{\delta}=0, and then taking full expectation yields

\displaystyle\mathbb{E}[P_{\mathcal{L}}e_{k+1}^{\delta}] \displaystyle=(I-\eta_{k}P_{\mathcal{L}}BP_{\mathcal{L}})\mathbb{E}[P_{% \mathcal{L}}e_{k}^{\delta}]+\eta_{k}P_{\mathcal{L}}\bar{A}^{t}\xi.

Then the first assertion follows from Assumption 2.1 as \|\bar{A}^{t}\|=n^{-\frac{1}{2}}\|B\|^{\frac{1}{2}}\leq n^{-\frac{1}{2}}c_{0}^% {-\frac{1}{2}}, and the estimates

\displaystyle\eta_{k}\|P_{\mathcal{L}}\bar{A}^{t}\xi\| \displaystyle\leq c_{0}k^{-\alpha}\|\bar{A}^{t}\|\|\xi\|=c_{0}^{\frac{1}{2}}k^% {-\alpha}\bar{\delta},
\displaystyle\|(I-\eta_{k}P_{\mathcal{L}}BP_{\mathcal{L}})P_{\mathcal{L}}e_{k}\| \displaystyle\geq(1-\eta_{k}\sigma_{L}^{2})\|P_{\mathcal{L}}e_{k}\|.

Next, appealing again to the SGD iteration (2.3) gives

P_{\mathcal{H}}e_{k+1}^{\delta}=P_{\mathcal{H}}e_{k}^{\delta}-\eta_{k}(a_{i_{k% }},e_{k}^{\delta})P_{\mathcal{H}}a_{i_{k}}+\eta_{k}\xi_{i_{k}}P_{\mathcal{H}}a% _{i_{k}}.

Thus the conditional expectation \mathbb{E}[P_{\mathcal{H}}e_{k+1}|\mathcal{F}_{k-1}] is given by

\displaystyle\mathbb{E}[P_{\mathcal{H}}e_{k+1}^{\delta}|\mathcal{F}_{k-1}] \displaystyle=P_{\mathcal{H}}e_{k}^{\delta}-\eta_{k}n^{-1}\sum_{i=1}^{n}(a_{i}% ,e_{k}^{\delta})P_{\mathcal{H}}a_{i}+\eta_{k}n^{-1}\sum_{i=1}^{n}\xi_{i}P_{% \mathcal{H}}a_{i}
\displaystyle=(I-\eta_{k}P_{\mathcal{H}}BP_{\mathcal{H}})P_{\mathcal{H}}e_{k}^% {\delta}+\eta_{k}P_{\mathcal{H}}\bar{A}^{t}\xi.

Then, taking full expectation and appealing to the triangle inequality yield the second estimate. ∎

Remark 4.2.

For exact data y^{\dagger}, we obtain the following simplified expressions:

\displaystyle\|\mathbb{E}[P_{\mathcal{L}}e_{k+1}]\|\leq(1-c_{1}\eta_{k})\|% \mathbb{E}[P_{\mathcal{L}}e_{k}]\|\quad\mbox{and}\quad\|\mathbb{E}[P_{\mathcal% {H}}e_{k+1}]\|\leq\|\mathbb{E}[P_{\mathcal{H}}e_{k}]\|.

Thus the low-frequency error decreases faster than the high-frequency one. Further, there is no interaction between the low- and high-frequency errors in the weak error.

Now we turn to preasymptotic strong convergence, for exact data y^{\dagger}.

Theorem 4.2.

Let Assumption 2.1 be fulfilled. Then with c_{1}=\sigma_{L}^{2} and c_{2}=\sum_{i=L+1}^{r}\sigma_{i}^{2}, there hold

\displaystyle\mathbb{E}[\|P_{\mathcal{L}}e_{k+1}\|^{2}|\mathcal{F}_{k-1}] \displaystyle\leq(1-\eta_{k}c_{1})\|P_{\mathcal{L}}e_{k}\|^{2}+c_{2}c_{0}^{-1}% \eta_{k}^{2}\|P_{\mathcal{H}}e_{k}\|^{2},
\displaystyle\mathbb{E}[\|P_{\mathcal{H}}e_{k+1}\|^{2}|\mathcal{F}_{k-1}] \displaystyle\leq c_{2}c_{0}^{-1}\eta_{k}^{2}\|P_{\mathcal{L}}e_{k}\|^{2}+(1+c% _{2}c_{0}^{-1}\eta_{k}^{2})\|P_{\mathcal{H}}e_{k}\|^{2}.
Proof.

It follows from the SGD iteration (2.2) that P_{\mathcal{L}}e_{k+1}=P_{\mathcal{L}}e_{k}-\eta_{k}(a_{i_{k}},e_{k})P_{% \mathcal{L}}a_{i_{k}}. This and the condition on \eta_{k} from Assumption 2.1, i.e., c_{0}\max_{i}\|a_{i}\|^{2}\leq 1, imply

\displaystyle\|P_{\mathcal{L}}e_{k+1}\|^{2} \displaystyle=\|P_{\mathcal{L}}e_{k}\|^{2}-2\eta_{k}(a_{i_{k}},e_{k})(P_{% \mathcal{L}}e_{k},P_{\mathcal{L}}a_{i_{k}})+\eta_{k}^{2}(e_{k},a_{i_{k}})^{2}% \|P_{\mathcal{L}}a_{i_{k}}\|^{2}
\displaystyle\leq\|P_{\mathcal{L}}e_{k}\|^{2}-2\eta_{k}(a_{i_{k}},e_{k})(P_{% \mathcal{L}}e_{k},P_{\mathcal{L}}a_{i_{k}})+c_{0}^{-1}\eta_{k}^{2}(e_{k},a_{i_% {k}})^{2}.

The conditional expectation with respect to \mathcal{F}_{k-1} is given by

\displaystyle\mathbb{E}[\|P_{\mathcal{L}}e_{k+1}\|^{2}|\mathcal{F}_{k-1}] \displaystyle\leq\|P_{\mathcal{L}}e_{k}\|^{2}-2\eta_{k}n^{-1}\sum_{i=1}^{n}(a_% {i},e_{k})(P_{\mathcal{L}}e_{k},P_{\mathcal{L}}a_{i})+c_{0}^{-1}\eta_{k}^{2}n^% {-1}\sum_{i=1}^{n}(e_{k},a_{i})^{2}
\displaystyle=\|P_{\mathcal{L}}e_{k}\|^{2}-2\eta_{k}(P_{\mathcal{L}}e_{k},P_{% \mathcal{L}}Be_{k})+c_{0}^{-1}\eta_{k}^{2}(e_{k},Be_{k}).

With the decomposition e_{k}=P_{\mathcal{L}}e_{k}+P_{\mathcal{H}}e_{k} and the construction of P_{\mathcal{L}} and P_{\mathcal{H}}, we obtain

\displaystyle(P_{\mathcal{L}}e_{k},P_{\mathcal{L}}Be_{k}) \displaystyle=(P_{\mathcal{L}}e_{k},P_{\mathcal{L}}BP_{\mathcal{L}}e_{k}),
\displaystyle(e_{k},Be_{k}) \displaystyle=(P_{\mathcal{L}}e_{k},P_{\mathcal{L}}BP_{\mathcal{L}}e_{k})+(P_{% \mathcal{H}}e_{k},P_{\mathcal{H}}BP_{\mathcal{H}}e_{k}).

Together with the last two inequalities, this leads to

\displaystyle\mathbb{E}[\|P_{\mathcal{L}}e_{k+1}\|^{2}|\mathcal{F}_{k-1}] \displaystyle\leq\|P_{\mathcal{L}}e_{k}\|^{2}-\eta_{k}(P_{\mathcal{L}}e_{k},P_% {\mathcal{L}}BP_{\mathcal{L}}e_{k})+c_{0}^{-1}\eta_{k}^{2}(P_{\mathcal{H}}e_{k% },P_{\mathcal{H}}BP_{\mathcal{H}}e_{k})
\displaystyle\leq(1-\eta_{k}\sigma_{L}^{2})\|P_{\mathcal{L}}e_{k}\|^{2}+c_{0}^% {-1}\eta_{k}^{2}\sigma_{L+1}^{2}\|P_{\mathcal{H}}e_{k}\|^{2}
\displaystyle\leq(1-c_{1}\eta_{k})\|P_{\mathcal{L}}e_{k}\|^{2}+c_{2}c_{0}^{-1}% \eta_{k}^{2}\|P_{\mathcal{H}}e_{k}\|^{2}.

This shows the first estimate. Next, appealing again to the SGD iteration (2.2), we obtain

P_{\mathcal{H}}e_{k+1}=P_{\mathcal{H}}e_{k}-\eta_{k}(a_{i_{k}},e_{k})P_{% \mathcal{H}}a_{i_{k}},

which together with the condition on \eta_{k} from Assumption 2.1, i.e., c_{0}\max_{i}\|a_{i}\|^{2}\leq 1, and the Cauchy-Schwarz inequality, implies

\displaystyle\|P_{\mathcal{H}}e_{k+1}\|^{2} \displaystyle=\|P_{\mathcal{H}}e_{k}\|^{2}-2\eta_{k}(a_{i_{k}},e_{k})(P_{% \mathcal{H}}e_{k},P_{\mathcal{H}}a_{i_{k}})+\eta_{k}^{2}(e_{k},a_{i_{k}})^{2}% \|P_{\mathcal{H}}a_{i_{k}}\|^{2}
\displaystyle\leq\|P_{\mathcal{H}}e_{k}\|^{2}-2\eta_{k}(a_{i_{k}},e_{k})(P_{% \mathcal{H}}e_{k},P_{\mathcal{H}}a_{i_{k}})+c_{0}^{-1}\eta_{k}^{2}\|e_{k}\|^{2% }\|P_{\mathcal{H}}a_{i_{k}}\|^{2}.

Thus the conditional expectation \mathbb{E}[\|P_{\mathcal{H}}e_{k+1}\|^{2}|\mathcal{F}_{k-1}] is given by

\displaystyle\mathbb{E}[\|P_{\mathcal{H}}e_{k+1}\|^{2}|\mathcal{F}_{k-1}] \displaystyle\leq\|P_{\mathcal{H}}e_{k}\|^{2}-2\eta_{k}n^{-1}\sum_{i=1}^{n}(a_% {i},e_{k})(P_{\mathcal{H}}a_{i},P_{\mathcal{H}}e_{k})+c_{0}^{-1}\eta_{k}^{2}n^% {-1}\|e_{k}\|^{2}\sum_{i=1}^{n}\|P_{\mathcal{H}}a_{i}\|_{F}^{2}
\displaystyle=\|P_{\mathcal{H}}e_{k}\|^{2}-2\eta_{k}(P_{\mathcal{H}}e_{k},P_{% \mathcal{H}}Be_{k})+c_{0}^{-1}\eta_{k}^{2}\|e_{k}\|^{2}\|P_{\mathcal{H}}B^{% \frac{1}{2}}\|^{2}_{F}.

Upon observing \|P_{\mathcal{H}}B^{\frac{1}{2}}\|_{F}^{2}=\sum_{i=L+1}^{r}\sigma_{i}^{2}% \equiv c_{2} [11, Lemma 3.2], we deduce

\displaystyle\mathbb{E}[\|P_{\mathcal{H}}e_{k+1}\|^{2}|\mathcal{F}_{k-1}] \displaystyle\leq\|P_{\mathcal{H}}e_{k}\|^{2}-2\eta_{k}\|B^{\frac{1}{2}}P_{% \mathcal{H}}e_{k}\|^{2}+c_{2}c_{0}^{-1}\eta_{k}^{2}\|e_{k}\|^{2}
\displaystyle\leq\|P_{\mathcal{H}}e_{k}\|^{2}+c_{0}^{-1}\eta_{k}^{2}(\|P_{% \mathcal{L}}e_{k}\|^{2}+\|P_{\mathcal{H}}e_{k}\|^{2}).

This proves the second estimate and completes the proof of the theorem. ∎

Last, we turn to the preasymptotic strong convergence for noisy data y^{\delta}.

Theorem 4.3.

Under Assumption 2.1, with c_{1}=\sigma_{L}^{2}, c_{2}=\sum_{i=L+1}^{r}\sigma_{i}^{2} and \bar{\delta}^{2}=n^{-1}\delta^{2}, there hold

\displaystyle\mathbb{E}[\|P_{\mathcal{L}}e_{k+1}^{\delta}\|^{2}|\mathcal{F}_{k% -1}] \displaystyle\leq(1-c_{1}\eta_{k})\|P_{\mathcal{L}}e_{k}^{\delta}\|^{2}+c_{2}c% _{0}^{-1}\eta_{k}^{2}\|P_{\mathcal{H}}e_{k}^{\delta}\|^{2}+c_{0}^{-1}\eta_{k}% \bar{\delta}(\eta_{k}\bar{\delta}+2\sigma_{1}\|e_{k}^{\delta}\|),
\displaystyle\mathbb{E}[\|P_{\mathcal{H}}e_{k+1}^{\delta}\|^{2}|\mathcal{F}_{k% -1}] \displaystyle\leq c_{2}c_{0}^{-1}\eta_{k}^{2}\|P_{\mathcal{L}}e_{k}^{\delta}\|% ^{2}+(1+c_{2}c_{0}^{-1}\eta_{k}^{2})\|P_{\mathcal{H}}e_{k}^{\delta}\|^{2}+c_{0% }^{-1}\eta_{k}^{2}\bar{\delta}^{2}
\displaystyle\quad+2\sqrt{2}c_{2}^{\frac{1}{2}}\eta_{k}\bar{\delta}\Big{(}\|P_% {\mathcal{H}}e_{k}^{\delta}\|^{2}+c_{0}^{-2}\eta_{k}^{2}\|e_{k}^{\delta}\|^{2}% \Big{)}^{\frac{1}{2}}.
Proof.

It follows from the SGD iteration (2.3) that

P_{\mathcal{L}}e_{k+1}^{\delta}=P_{\mathcal{L}}e_{k}^{\delta}-\eta_{k}(a_{i_{k% }},e_{k}^{\delta})P_{\mathcal{L}}a_{i_{k}}+\eta_{k}\xi_{i_{k}}P_{\mathcal{L}}a% _{i_{k}},

and upon expansion, we obtain

\displaystyle\mathbb{E}[\|P_{\mathcal{L}}e_{k+1}^{\delta}\|^{2}|\mathcal{F}_{k% -1}]=\mathbb{E}[\|P_{\mathcal{L}}e_{k}^{\delta} \displaystyle-\eta_{k}(a_{i_{k}},e_{k}^{\delta})P_{\mathcal{L}}a_{i_{k}}\|^{2}% |\mathcal{F}_{k-1}]+\eta_{k}^{2}\mathbb{E}[\xi_{i_{k}}^{2}\|P_{\mathcal{L}}a_{% i_{k}}\|^{2}|\mathcal{F}_{k-1}]
\displaystyle+2\mathbb{E}[(P_{\mathcal{L}}e_{k}^{\delta}-\eta_{k}(a_{i_{k}},e_% {k}^{\delta})P_{\mathcal{L}}a_{i_{k}},\eta_{k}\xi_{i_{k}}P_{\mathcal{L}}a_{i_{% k}})|\mathcal{F}_{k-1}]:={\rm I}_{1}+{\rm I}_{2}+{\rm I}_{3}.

It suffices to bound the three terms {\rm I}_{i}. The first term {\rm I}_{1} can be bounded by Theorem 4.2. For the term {\rm I}_{2}, by Assumption 2.1, there holds

{\rm I}_{2}\leq\eta_{k}^{2}n^{-1}\max_{i}\|P_{\mathcal{L}}a_{i}\|^{2}\sum_{i=1% }^{n}\xi_{i}^{2}\leq c_{0}^{-1}\eta_{k}^{2}\bar{\delta}^{2}.

For the third term {\rm I}_{3}, by the identity (a_{i},e_{k}^{\delta})=(P_{\mathcal{L}}a_{i},P_{\mathcal{L}}e_{k}^{\delta})+(P% _{\mathcal{H}}a_{i},P_{\mathcal{H}}e_{k}^{\delta}), we have the splitting

\displaystyle{\rm I}_{3}=2n^{-1}\eta_{k}\sum_{i=1}^{n}\xi_{i}[(P_{\mathcal{L}}% a_{i},P_{\mathcal{L}}e_{k}^{\delta})-\eta_{k}(a_{i},e_{k}^{\delta})\|P_{% \mathcal{L}}a_{i}\|^{2}]=2n^{-1}\eta_{k}\sum_{i=1}^{n}\xi_{i}{\rm I}_{3,i},

with {\rm I}_{3,i}=(1-\eta_{k}\|P_{\mathcal{L}}a_{i}\|^{2})(P_{\mathcal{L}}a_{i},P_% {\mathcal{L}}e_{k}^{\delta})-\eta_{k}(P_{\mathcal{H}}a_{i},P_{\mathcal{H}}e_{k% }^{\delta})\|P_{\mathcal{L}}a_{i}\|^{2}. It suffices to bound {\rm I}_{3,i}. By the choice of \eta_{k}, cf. Assumption 2.1, we deduce

\displaystyle{\rm I}_{3,i}^{2} \displaystyle=(1-\eta_{k}\|P_{\mathcal{L}}a_{i}\|^{2})^{2}(P_{\mathcal{L}}a_{i% },P_{\mathcal{L}}e_{k}^{\delta})^{2}+\eta_{k}^{2}\|P_{\mathcal{L}}a_{i}\|^{4}(% P_{\mathcal{H}}a_{i},P_{\mathcal{H}}e_{k}^{\delta})^{2}
\displaystyle\leq(P_{\mathcal{L}}a_{i},P_{\mathcal{L}}e_{k}^{\delta})^{2}+(P_{% \mathcal{H}}a_{i},P_{\mathcal{H}}e_{k}^{\delta})^{2},

and consequently,

\displaystyle\sum_{i=1}^{n}{\rm I}_{3,i}^{2} \displaystyle\leq\sum_{i=1}^{n}(P_{\mathcal{L}}a_{i},P_{\mathcal{L}}e_{k}^{% \delta})^{2}+(P_{\mathcal{H}}a_{i},P_{\mathcal{H}}e_{k}^{\delta})^{2}=\|A^{t}e% _{k}^{\delta}\|^{2}=n\|B\|\|e_{k}^{\delta}\|^{2}.

Combining these two estimates with the Cauchy-Schwarz inequality leads to |{\rm I}_{3}|\leq 2\bar{\delta}\eta_{k}\sigma_{1}\|e_{k}^{\delta}\|. These estimates together show the first assertion. For the high-frequency component P_{H}e_{k}^{\delta}, we have

P_{\mathcal{H}}e_{k+1}^{\delta}=P_{\mathcal{H}}e_{k}^{\delta}-\eta_{k}(a_{i_{k% }},e_{k}^{\delta})P_{\mathcal{H}}a_{i_{k}}+\eta_{k}\xi_{i_{k}}P_{\mathcal{H}}a% _{i_{k}},

and upon expansion, we obtain

\displaystyle\mathbb{E}[\|P_{\mathcal{H}}e_{k+1}^{\delta}\|^{2}|\mathcal{F}_{k% -1}] \displaystyle=\mathbb{E}[\|P_{\mathcal{H}}e_{k}^{\delta}-\eta_{k}(a_{i_{k}},e_% {k}^{\delta})P_{\mathcal{H}}a_{i_{k}}\|^{2}|\mathcal{F}_{k-1}]+\eta_{k}^{2}% \mathbb{E}[\xi_{i_{k}}^{2}\|P_{\mathcal{H}}a_{i_{k}}\|^{2}|\mathcal{F}_{k-1}]
\displaystyle\quad+2\mathbb{E}[(P_{\mathcal{H}}e_{k}^{\delta}-\eta_{k}(a_{i_{k% }},e_{k}^{\delta})P_{\mathcal{H}}a_{i_{k}},\eta_{k}\xi_{i_{k}}P_{\mathcal{H}}a% _{i_{k}})|\mathcal{F}_{k-1}]:={\rm I}_{4}+{\rm I}_{5}+{\rm I}_{6}.

The term {\rm I}_{4} can be bounded by Theorem 4.2. Clearly, {\rm I}_{5}\leq c_{0}^{-1}\eta_{k}^{2}\bar{\delta}^{2}. For {\rm I}_{6}, simple computation yields

{\rm I}_{6}=2n^{-1}\eta_{k}\sum_{i=1}^{n}\xi_{i}[(P_{\mathcal{H}}a_{i},P_{% \mathcal{H}}e_{k}^{\delta})-\eta_{k}(a_{i},e_{k})\|P_{\mathcal{H}}a_{i}\|^{2}]% :=2n^{-1}\eta_{k}\sum_{i=1}^{n}\xi_{i}{\rm I}_{6,i},

with {\rm I}_{6,i} given by {\rm I}_{6,i}=(P_{\mathcal{H}}a_{i},P_{\mathcal{H}}e_{k}^{\delta})-\eta_{k}(a_% {i},e_{k})\|P_{\mathcal{H}}a_{i}\|^{2}. Now we bound the term involving {\rm I}_{6,i} by

\displaystyle\sum_{i=1}^{n}{\rm I}_{6,i}^{2} \displaystyle\leq 2\sum_{i=1}^{n}\Big{(}(P_{\mathcal{H}}a_{i},P_{\mathcal{H}}e% _{k}^{\delta})^{2}+\eta_{k}^{2}(a_{i},e_{k})^{2}\|P_{\mathcal{H}}a_{i}\|^{4}% \Big{)}
\displaystyle\leq\big{(}2\|P_{\mathcal{H}}e_{k}^{\delta}\|^{2}+2\eta_{k}^{2}% \max_{i}\|a_{i}\|^{4}\|e_{k}^{\delta}\|^{2}\big{)}\sum_{i=1}^{n}\|P_{\mathcal{% H}}a_{i}\|^{2}
\displaystyle\leq 2c_{2}n(\|P_{\mathcal{H}}e_{k}^{\delta}\|^{2}+c_{0}^{-2}\eta% _{k}^{2}\|e_{k}^{\delta}\|^{2}).

This estimate together with the Cauchy-Schwarz inequality gives

|{\rm I}_{6}|\leq 2\sqrt{2}c_{2}^{\frac{1}{2}}\eta_{k}\bar{\delta}\Big{(}\|P_{% \mathcal{H}}e_{k}^{\delta}\|^{2}+c_{0}^{-2}\eta_{k}^{2}\|e_{k}^{\delta}\|^{2}% \Big{)}^{\frac{1}{2}}.

These estimates together show the second assertion, and complete the proof. ∎

Remark 4.3.

Theorems 4.2 and 4.3 indicate that the low-frequency error can decrease much faster than the high-frequency error (up to a term dependent of the noise level, in the presence of data noise). Thus, if the initial error consists mostly of low-frequency modes, SGD can decrease it rapidly, thereby explaining the fast initial convergence.

5 Numerical experiments

Now we present numerical experiments to complement the theoretical study. All the numerical examples, i.e., phillips, gravity and shaw, are taken from the public domain MATLAB package Regutools111Available from http://www.imm.dtu.dk/~pcha/Regutools/, last accessed on January 8, 2018. They are Fredholm integral equations of the first kind, with the first example being mildly ill-posed, and the other two severely ill-posed, respectively. Unless otherwise stated, the examples are discretized with a dimension n=m=1000. The noisy data y^{\delta} is generated from the exact data y^{\dagger} as

y^{\delta}_{i}=y_{i}^{\dagger}+\delta\max_{j}(|y_{j}^{\dagger}|)\xi_{i},\quad i% =1,\ldots,n,

where \delta is the relative noise level, and the random variables \xi_{i}s follow the standard Gaussian distribution. The initial guess x_{1} is fixed at x_{1}=0. We present the squared error e_{k} and/or squared residual r_{k}, i.e.,

e_{k}=\mathbb{E}[\|x^{\dagger}-x_{k}\|^{2}]\quad\mbox{and}\quad r_{k}=\mathbb{% E}[\|Ax_{k}-y^{\delta}\|^{2}]. (5.1)

The expectation \mathbb{E}[\cdot] with respect to the random index i_{k} is approximated by the average of 100 independent runs. The constant c_{0} in the step size schedule is always taken to be c_{0}=1/\max_{i}\|a_{i}\|^{2}, and the exponent \alpha is taken to be \alpha=0.1, unless otherwise stated. All the computations were carried out on a personal laptop with 2.50 GHz CPU and 8.00G RAM by MATLAB 2015b.

5.1 The role of the exponent \alpha

The convergence of SGD depends essentially on the parameter \alpha. To examine its role, we present in Figs. 1, 2 and 3 the numerical results for the examples with different noise levels, computed using different \alpha values. The smaller is the \alpha value, the quicker does the algorithm reach the convergence and so does the divergence for noisy data. This agrees with the analysis in Section 3. Hence, a smaller \alpha value is desirable for convergence. However, in the presence of large noise, a too small \alpha value may sacrifice the attainable accuracy; see Figs. 1(c) and 2(c) for illustrations; and also the oscillation magnitudes of the iterates and the residual tend to be larger. This is possibly due to the intrinsic variance for large step sizes, and it would be interesting to precisely characterize the dynamics, e.g., with stochastic differential equations. In practice, the variations may cause problems with a proper stopping criterion (especially with one single trajectory).

(a) \delta=\text{1e-3} (b) \delta=\text{1e-2} (c) \delta=\text{5e-2}
Figure 1: Numerical results for phillips with different noise levels by SGD (with various \alpha).
(a) \delta=\text{1e-3} (b) \delta=\text{1e-2} (c) \delta=\text{5e-2}
Figure 2: Numerical results for gravity with different noise levels by SGD (with various \alpha).
(a) \delta=\text{1e-3} (b) \delta=\text{1e-2} (c) \delta=\text{5e-2}
Figure 3: Numerical results for shaw with different noise levels by SGD (with various \alpha).

5.2 Comparison with Landweber method

Since SGD is a randomized version of the classical Landweber method, in Fig. 4, we compare their performance. To compare the iteration complexity only, we count one Landweber iteration as n SGD iteration, and the full gradient evaluation is indicated by flat segments in the plots. For all examples, the error e_{k} and residual r_{k} first experience fast reduction, and then the error starts to increase, which is especially pronounced at \delta=5\times 10^{-2}, exhibiting the typical semiconvergence behavior. During the initial stage, SGD is much more effective than SGD: indeed one single full loop over the data can already significantly reduce the error e_{k} and produce an acceptable approximation. The precise mechanism for this interesting observation will be further examined below. However,, the nonvanishing variance of the stochastic gradient slows down the asymptotic convergence of SGD, and the error e_{k} and the residual r_{k} eventually tend to oscillate for noisy data, before finally diverge.

(a) phillips, \delta=10^{-2} (b) phillips, \delta=5\times 10^{-2}
(c) gravity, \delta=10^{-2} (d) gravity, \delta=5\times 10^{-2}
(e) shaw, \delta=10^{-2} (f) shaw, \delta=5\times 10^{-2}
Figure 4: Numerical results for the examples by SGD (with \alpha=0.1) and LM.

5.3 Preasymptotic convergence

Now we examine the preasymptotic strong convergence of SGD (since the weak convergence satisfies a Landweber type iteration). Theorems 4.2 and 4.3 predict that during first iterations, the low-frequency error e_{L}=\mathbb{E}[\|P_{\mathcal{L}}e_{k}\|^{2}] decreases rapidly, but the high-frequency error e_{H}=\mathbb{E}[\|P_{\mathcal{H}}e_{k}\|^{2}] can at best decay mildly. For all examples, the first five singular vectors can well capture the total energy of the initial error e_{1}=x^{*}-x_{1}, which gives a truncation level L=5 for illustration. We plot the low- and high-frequency errors e_{L} and e_{H} and the total error e=\mathbb{E}[\|e_{k}\|^{2}] in Fig. 5. The low-frequency error e_{L} decays much more rapidly during the initial iterations, and since under the source condition (2.4), e_{L} is dominant, the total error e also enjoys a fast initial decay. Intuitively, this behavior may be explained as follows. The rows of the matrix {A} mainly contain low-frequency modes, and thus each SGD iteration tends to mostly remove the low-frequency component e_{L} of the initial error x^{*}-x_{1}. The high-frequency component e_{H} experiences a similar but much slower decay. Eventually, both components level off and oscillate, due to the deleterious effect of noise. These observations confirm the preasymptotic analysis in Section 4. For noisy data, the error e_{k} can be highly oscillating, so is the residual r_{k}. The larger is the noise level \delta, the larger is the oscillation magnitude.

(a) phillips (b) gravity (c) shaw
Figure 5: The error decay for the examples with two noise levels: \delta=10^{-2} (top) and \delta=5\times 10^{-2} (bottom), with a truncation level L=5.

5.4 Asymptotic convergence

To examine the asymptotic convergence (with respect to the noise level \delta), in Table 1, we present the smallest error e along the trajectory and the number of iterations to reach the error e for several different \delta values. It is observed that for all three examples, the minimal error e increases steadily with the noise level \delta, whereas also the required number of iterations decreases dramatically, which qualitatively agrees well with Remark 3.2. Thus, the SGD is especially efficient in the regime of high noise level, for which one or two epochs can already give very good approximations, due to the fast preasymptotic convergence. This agrees with the common belief that SGD is most effective for finding an approximate solution that is not highly accurate. At low noise levels, shaw takes far more iterations to reach the smallest error. This might be attributed to the fact that the exponent p in the source condition (2.4) for shaw is much smaller than that for phillips or gravity, since the low-frequency modes are less dominating, as roughly indicated by the red curves in Fig. 5. Interestingly, for all examples, the error e undergoes sudden change when the noise level \delta increases from 1e-2 to 3e-2. This might be related to the exponent \alpha in the schedule of the step size, which probably should be adapted to \delta in order to achieve optimal balance between the computational efficiency and statistical errors. However, the precise mechanism is to be ascertained.

Table 1: The (minimal) expected error e for the examples.
\delta phillips gravity shaw
1e-3 (1.09e-3,7.92e4) (3.22e-1,4.55e5) (2.92e0,3.55e6)
5e-3 (3.23e-3,1.83e4) (5.65e-1,6.19e4) (3.21e0,1.95e6)
1e-2 (6.85e-3,3.09e3) (6.21e-1,4.60e4) (6.75e0,1.15e6)
3e-2 (4.74e-2,4.20e2) (2.60e0, 6.50e3) (3.50e1,7.80e3)
5e-2 (6.71e-2,1.09e3) (6.32e0, 2.55e3) (3.70e1,1.28e3)

6 Concluding remarks

In this work, we have analyzed the regularizing property of stochastic gradient descent for solving linear inverse problems, by extending properly deterministic inversion techniques. The study indicates that with proper early stopping and suitable step size schedule, it is regularizing in the sense that iterates converge to the exact solution in the mean squared norm as the noise level tends to zero. Further, under the canonical source condition, we prove error estimates, which depend on the noise level and the schedule of step sizes. The findings are complemented by extensive numerical experiments.

There are many questions related to stochastic algorithms that deserve further research. One outstanding issue is stopping criterion, and rigorous stopping criteria have to be developed. In practice, the performance of SGD can be sensitive to the exponent \alpha [18]. Promising strategies for overcoming the drawback include averaging [20] and variance reduction [13]. It is of much interest to analyze such schemes in the context of inverse problems.

References

  • [1] F. Bauer, T. Hohage, and A. Munk. Iteratively regularized Gauss-Newton method for nonlinear inverse problems with random noise. SIAM J. Numer. Anal., 47(3):1827–1846, 2009.
  • [2] N. Bissantz, T. Hohage, and A. Munk. Consistency and rates of convergence of nonlinear Tikhonov regularization with random noise. Inverse Problems, 20(6):1773–1789, 2004.
  • [3] N. Bissantz, T. Hohage, A. Munk, and F. Ruymgaart. Convergence rates of general regularization methods for statistical inverse problems and applications. SIAM J. Numer. Anal., 45(6):2610–2636, 2007.
  • [4] L. Bottou. Large-scale machine learning with stochastic gradient descent. In Y. Lechevallier and G. Saporta, editors, Proc. CompStat’2010, pages 177–186. Physica-Verlag HD, 2010.
  • [5] A. Dieuleveut and F. Bach. Nonparametric stochastic approximation with large step-sizes. Ann. Statist., 44(4):1363–1399, 2016.
  • [6] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Kluwer, Dordrecht, 1996.
  • [7] P. C. Hansen. Rank-Deficient and Discrete Ill-posed Problems. SIAM, Philadelphia, PA, 1998.
  • [8] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: stability of stochastic gradient descent. In Proc. 33rd Int. Conf. Machine Learning (ICML), volume 48, pages 1225–1234, 2016.
  • [9] K. Ito and B. Jin. Inverse Problems: Tikhonov Theory and Algorithms. World Scientific, Hackensack, NJ, 2015.
  • [10] S. Jastrzȩbski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors influencing minima in SGD. Preprint, arXiv:1711.04623, 2017.
  • [11] Y. Jiao, B. Jin, and X. Lu. Preasymptotic convergence of randomized Kaczmarz method. Inverse Problems, 33(12):125012, 21 pp., 2017.
  • [12] B. Jin and D. A. Lorenz. Heuristic parameter-choice rules for convex variational regularization based on error estimates. SIAM J. Numer. Anal., 48(3):1208–1229, 2010.
  • [13] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. pages 315–323, Lake Tahoe, Nevada, 2013.
  • [14] B. Kaltenbacher, A. Neubauer, and O. Scherzer. Iterative Regularization Methods for Nonlinear Ill-posed Problems. Walter de Gruyter, Berlin, 2008.
  • [15] J. Lin and L. Rosasco. Optimal rates for multi-pass stochastic gradient methods. J. Mach. Learn. Res., 18:1–47, 2017.
  • [16] F. Natterer. The Mathematics of Computerized Tomography. SIAM, Philadelphia, PA, 2001.
  • [17] D. Needell, N. Srebro, and R. Ward. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Math. Program., 155(1-2, Ser. A):549–573, 2016.
  • [18] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19(4):1574–1609, 2008.
  • [19] S. Pereverzev and E. Schock. On the adaptive selection of the parameter in regularization of ill-posed problems. SIAM J. Numer. Anal., 43(5):2060–2076, 2005.
  • [20] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992.
  • [21] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Stat., 22:400–407, 1951.
  • [22] T. Schuster, B. Kaltenbacher, B. Hofmann, and K. S. Kazimierski. Regularization Methods in Banach Spaces. Walter de Gruyter, Berlin, 2012.
  • [23] P. Tarrès and Y. Yao. Online learning as stochastic approximation of regularization paths: optimality and almost-sure convergence. IEEE Trans. Inform. Theory, 60(9):5716–5735, 2014.
  • [24] Y. Ying and M. Pontil. Online gradient descent learning algorithms. Found. Comput. Math., 8(5):561–596, 2008.

Appendix A Elementary inequalities

In this appendix, we collect some useful inequalities for sequences involving polynomially decaying step sizes. We begin with a basic estimate on the operator norm. This estimate is well known (see, e.g.,[15]).

Lemma A.1.

For j<k, and any symmetric and positive semidefinite matrix S and step sizes \eta_{j}\in(0,\|S\|] and p\geq 0, there holds

\|\prod_{i=j}^{k}(I-\eta_{i}S)S^{p}\|\leq\frac{p^{p}}{e^{p}(\sum_{i=j}^{k}\eta% _{i})^{p}}.

Next we derive basic estimates on various series involving \eta_{j}=c_{0}j^{-\alpha}, with c_{0}>0 and \alpha\in[0,1).

Lemma A.2.

For the choice \eta_{j}=c_{0}j^{-\alpha}, \alpha\in[0,1) and r\in[0,1], for any 1\leq j<k, there holds

\displaystyle\sum_{i=1}^{k}\eta_{i} \displaystyle\geq(2^{1-\alpha}-1)(1-\alpha)^{-1}c_{0}k^{1-\alpha}, (A.1)
\displaystyle\sum_{j=1}^{k-1}\frac{\eta_{j}}{(\sum_{i=j+1}^{k}\eta_{i})^{r}} \displaystyle\leq\left\{\begin{array}[]{ll}c_{0}^{1-r}B(1-\alpha,1-r)k^{(1-r)(% 1-\alpha)},&r\in[0,1),\\ 2^{\alpha}((1-\alpha)^{-1}+\ln k)&r=1,\end{array}\right. (A.2)

where B(\cdot,\cdot) is the Beta function defined by B(a,b)=\int_{0}^{1}s^{a-1}(1-s)^{b-1}\mathrm{d}s for any a,b>0.

Proof.

Since \alpha\in[0,1), we have

\displaystyle c_{0}^{-1}\sum_{i=1}^{k}\eta_{i} \displaystyle\geq\int_{1}^{k+1}s^{-\alpha}\mathrm{d}s=(1-\alpha)^{-1}((k+1)^{1% -\alpha}-1)\geq c_{\alpha}k^{1-\alpha},

with the constant c_{\alpha}=(1-\alpha)^{-1}(2^{1-\alpha}-1). This shows the first estimate (A.1). Next, since \eta_{i}\geq c_{0}k^{-\alpha}, for any i=j+1,\ldots,k, we have

\displaystyle c_{0}^{-1}\sum_{i=j+1}^{k}\eta_{i}\geq k^{-\alpha}(k-j). (A.3)

If r\in[0,1), by changing variables, we have

\displaystyle c_{0}^{r-1}\sum_{j=1}^{k-1}\frac{\eta_{j}}{(\sum_{i=j+1}^{k}\eta% _{i})^{r}}\leq k^{r\alpha}\sum_{j=1}^{k-1}j^{-\alpha}(k-j)^{-r}
\displaystyle\leq \displaystyle k^{r\alpha}\int_{0}^{k}s^{-\alpha}(k-s)^{-r}\mathrm{d}s=B(1-% \alpha,1-r)k^{(1-r)(1-\alpha)}.

For r=1, it can be derived directly (the notation [\cdot] denotes the integral part of a real number)

\displaystyle\sum_{j=1}^{k-1}\frac{\eta_{j}}{\sum_{i=j+1}^{k}\eta_{i}}\leq k^{% \alpha}\sum_{j=1}^{k-1}j^{-\alpha}(k-j)^{-1}
\displaystyle= \displaystyle k^{\alpha}\sum_{j=1}^{[\frac{k}{2}]}j^{-\alpha}(k-j)^{-1}+k^{% \alpha}\sum_{j=[\frac{k}{2}]+1}^{k-1}j^{-\alpha}(k-j)^{-1}
\displaystyle\leq \displaystyle 2k^{\alpha-1}\sum_{j=1}^{[\frac{k}{2}]}j^{-\alpha}+2^{\alpha}% \sum_{j=[\frac{k}{2}]+1}^{k-1}(k-j)^{-1}.

Simple computation gives \sum_{j=[\frac{k}{2}]+1}^{k-1}(k-j)^{-1}\leq\ln k and \sum_{j=1}^{[\frac{k}{2}]}j^{-\alpha}\leq(1-\alpha)^{-1}(\frac{k}{2})^{1-% \alpha}. Combining the last two estimates yields the estimate (A.2), and completes the proof. ∎

The next result gives some further estimates.

Lemma A.3.

For \eta_{j}=c_{0}j^{-\alpha}, with \alpha\in(0,1), \beta\in[0,1], and r\geq 0, there hold

\displaystyle\sum_{j=1}^{[\frac{k}{2}]}\eta_{j}^{2}\frac{1}{(\sum_{i=j+1}^{k}% \eta_{i})^{r}}j^{-\beta} \displaystyle\leq c_{\alpha,\beta,r}k^{-r(1-\alpha)+\max(0,1-2\alpha-\beta)},
\displaystyle\sum_{j=[\frac{k}{2}]+1}^{k-1}\frac{\eta_{j}^{2}}{(\sum_{i=j+1}^{% k}\eta_{i})^{r}}j^{-\beta} \displaystyle\leq c_{\alpha,\beta,r}^{\prime}k^{-((2-r)\alpha+\beta)+\max(0,1-% r)},

where we slightly abuse k^{-\max(0,0)} for \ln k, and the constants c_{\alpha,\beta,r} and c^{\prime}_{\alpha,\beta,r} are given by

\displaystyle c_{\alpha,\beta,r} \displaystyle=c_{0}^{2-r}\left\{\begin{array}[]{ll}2^{r}(2\alpha+\beta-1)^{-1}% ,&2\alpha+\beta>1,\\ 2,&2\alpha+\beta=1,\\ 2^{r-1+2\alpha+\beta}(1-2\alpha-\beta)^{-1},&2\alpha+\beta<1,\end{array}\right.
\displaystyle c^{\prime}_{\alpha,\beta,r} \displaystyle=2^{2\alpha+\beta}c_{0}^{2-r}\left\{\begin{array}[]{ll}(r-1)^{-1}% ,&r>1,\\ 1,&r=1,\\ 2^{r-1}(1-r)^{-1},&r<1.\end{array}\right.
Proof.

It follows from the inequality (A.3) that

\displaystyle\quad c_{0}^{r-2}\sum_{j=1}^{[\frac{k}{2}]}\eta_{j}^{2}\frac{1}{(% \sum_{i=j+1}^{k}\eta_{i})^{r}}j^{-\beta}=\sum_{j=1}^{[\frac{k}{2}]}\frac{j^{-(% 2\alpha+\beta)}}{(\sum_{i=j+1}^{k}i^{-\alpha})^{r}}
\displaystyle\leq k^{r\alpha}\sum_{j=1}^{[\frac{k}{2}]}j^{-(2\alpha+\beta)}(k-% j)^{-r}\leq 2^{r}k^{-r+r\alpha}\sum_{j=1}^{[\frac{k}{2}]}j^{-(2\alpha+\beta)}
\displaystyle\leq 2^{r}k^{r\alpha-r}\left\{\begin{array}[]{ll}(2\alpha+\beta-1% )^{-1},&2\alpha+\beta>1,\\ \ln k,&2\alpha+\beta=1,\\ (1-2\alpha-\beta)^{-1}(\frac{k}{2})^{1-2\alpha-\beta},&2\alpha+\beta<1.\end{% array}\right.

Collecting terms shows the first estimate. Similarly, by the inequality (A.3), we have

\displaystyle\quad c_{0}^{r-2}\sum_{j=[\frac{k}{2}]+1}^{k-1}\eta_{j}^{2}\frac{% 1}{(\sum_{i=j+1}^{k}\eta_{i})^{r}}j^{-\beta}=\sum_{j=[\frac{k}{2}]+1}^{k-1}% \frac{j^{-(2\alpha+\beta)}}{(\sum_{i=j+1}^{k}i^{-\alpha})^{r}}
\displaystyle\leq k^{r\alpha}\sum_{j=[\frac{k}{2}]+1}^{k-1}j^{-(2\alpha+\beta)% }(k-j)^{-r}\leq 2^{2\alpha+\beta}k^{r\alpha-(2\alpha+\beta)}\sum_{j=[\frac{k}{% 2}]+1}^{k-1}(k-j)^{-r}
\displaystyle\leq 2^{2\alpha+\beta}k^{r\alpha-(2\alpha+\beta)}\left\{\begin{% array}[]{ll}(r-1)^{-1},&r>1,\\ \ln k,&r=1,\\ (1-r)^{-1}(\frac{k}{2})^{1-r},&r<1.\end{array}\right.

This completes the proof of the lemma. ∎

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
214307
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description