Rate of Convergence of Truncated Stochastic Approximation Procedures with Moving Bounds

# Rate of Convergence of Truncated Stochastic Approximation Procedures with Moving Bounds

## Abstract

The paper is concerned with stochastic approximation procedures having three main characteristics: truncations with random moving bounds, a matrix valued random step-size sequence, and a dynamically changing random regression function. We study convergence and rate of convergence. Main results are supplemented with corollaries to establish various sets of sufficient conditions, with the main emphases on the parametric statistical estimation. The theory is illustrated by examples and special cases.

Department of Mathematics, Royal Holloway, University of London

Egham, Surrey TW20 0EX

e-mail: t.sharia@rhul.ac.uk

Keywords: Stochastic approximation, Recursive estimation, Parameter estimation

## 1 Introduction

This paper is a continuation of Sharia (2014) where a large class of truncated Stochastic approximation (SA) procedures with moving random bounds was proposed. Although the proposed class of procedures can be applied to a wider range of problems, our main motivation comes from applications to parametric statistical estimation theory. To make this paper self contained, we introduce the main ideas below (a full list of references as well as some comparisons can be found in Sharia (2014)).

The main idea can be easily explained in the case of the classical problem of finding a unique zero, say , of a real valued function when only noisy measurements of are available. To estimate , consider a sequence defined recursively as

 Zt=Zt−1+γt[R(Zt−1)+εt],t=1,2,… (1.1)

where is a sequence of zero-mean random variables and is a deterministic sequence of positive numbers. This is the classical Robbins-Monro SA procedure (see Robbins and Monro (1951)), which under certain conditions converges to the root of the equation . (Comprehensive surveys of the SA technique can be found in Benveniste et al. (1990), Borkar (2008), Kushner and Yin (2003), Lai (2003), and Kushner (2010).)

Statistical parameter estimation is one of the most important applications of the above procedure. Indeed, suppose that are i.i.d. random variables and is the common probability density function (w.r.t. some -finite measure), where is an unknown parameter. Consider a recursive estimation procedure for defined by

 ^θt=^θt−1+1ti(^θt−1)−1\leavevmode\nobreak f′T(Xt,^θt−1)f(Xt,^θt−1),\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak t≥1, (1.2)

where is some starting value and is the one-step Fisher information matrix ( is the row-vector of partial derivatives of w.r.t. the components of ). This estimator was introduced in Sakrison (1965) and studied by a number of authors (see e.g, Polyak and Tsypkin (1980), Campbell (1982), Ljung and Soderstrom (1987), Lazrieve and Toronjadze (1987), Englund et al (1989), Lazrieve et al (1997, 2008), Sharia (1997–2010)). In particular, it has been shown that under certain conditions, the recursive estimator is asymptotically equivalent to the maximum likelihood estimator, i.e., it is consistent and asymptotically efficient. One can analyse (1.2) by rewriting it in the form of stochastic approximation with ,

 R(z)=i(z)−1Eθ{f′T(Xt,z)f(Xt,z)}\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak % and\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak εt=i(^θt−1)−1(f′T(Xt,^θt−1)f(Xt,^θt−1)−R(^θt−1)),

where is an arbitrary but fixed value of the unknown parameter. Indeed, under certain standard assumptions, and is a martingale difference w.r.t. the filtration generated by . So, (1.2) is a standard SA of type (1.1).

Suppose now that we have a stochastic process and let be the conditional probability density function of the observation given , where is an unknown parameter. Then one can define a recursive estimator of by

 ^θt=^θt−1+γt(^θt−1)ψt(^θt−1),\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak t≥1, (1.3)

where are suitably chosen functions which may, in general, depend on the vector of all past and present observations , and have the property that the process is - martingale difference, i.e., for each . For example, a choice

 ψt(θ)=lt(θ)≡[f′t(Xt,θ)]Tft(Xt,θ)

yields a likelihood type estimation procedure. In general, to obtain an estimator with asymptotically optimal properties, a state-dependent matrix-valued random step-size sequences are needed (see Sharia (2010)). For the above procedure, a step-size sequence with the property

 γ−1t(θ)−γ−1t−1(θ)=Eθ{ψt(θ)lTt(θ)∣Ft−1}

is an optimal choice. For example, to derive a recursive procedure which is asymptotically equivalent to the maximum likelihood estimator, we need to take

 ψt(θ)=lt(θ) and γt(θ)=I−1t(θ),

where

 It(θ)=t∑s=1E{ls(θ)lTs(θ)|Fs−1} (1.4)

is the conditional Fisher information matrix. To rewrite (1.3) in the SA form, let us assume that is an arbitrary but fixed value of the parameter and define

 Rt(z)=Eθ{ψt(Xt,z)∣Ft−1}\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak and\leavevmode\nobreak \leavevmode\nobreak \leavevmode\nobreak εt(z)=(ψt(Xt,z)−Rt(z)).

Then, since is -martingale difference, it follows that for each . So, the objective now is to find a common root of a dynamically changing sequence of functions .

Before introducing the general SA process, let us consider one simple modification of the classical SA procedure. Suppose that we have additional information about the root of the equation . Let us, e.g., assume that at each step , where and are random variables such that . Then one can consider a procedure, which at each step produces points from the interval . For example, a truncated classical SA procedure in this case can be derived using the following recursion

 Zt=Φ[αt,βt](\leavevmode\nobreak Zt−1+γt[R(Zt−1)+εt]),t=1,2,…

where is the truncation operator, that is, for any

 Φ[a,b](z)=⎧⎨⎩aifzb.

Truncated procedures may be useful in a number of circumstances. For example, if the functions in the recursive equation are defined only for certain values of the parameter, then the procedure should produce points only from this set. Truncations may also be useful when certain standard assumptions, e.g., conditions on the growth rate of the relevant functions are not satisfied. Truncations may also help to make an efficient use of auxiliary information concerning the value of the unknown parameter. For example, we might have auxiliary information about the parameters, e.g. a set, possibly time dependent, that contains the value of the unknown parameter. Also, sometimes a consistent but not necessarily efficient auxiliary estimator is available having a rate . Then to obtain asymptotically efficient estimator, one can construct a procedure with shrinking bounds by truncating the recursive procedure in a neighbourhood of with where .

Note that the idea of truncations is not new and goes back to KhasÊ¹minskii and Nevelson (1972) and Fabian (1978) (see also Chen and Zhu (1986), Chen et al.(1987), Andradóttir (1995), Sharia (1997), Tadic (1997,1998), Lelong (2008). A comprehensive bibliography and some comparisons can be found in Sharia (2014)).

In order to study these procedures in an unified manner, Sharia (2014) introduced a SA of the following form

 Zt=ΦUt(\leavevmode\nobreak Zt−1+γt(Zt−1)[Rt(Zt−1)+εt(Zt−1)]),t=1,2,…

where is some starting value, is a predictable process with the property that for all ’s, is a matrix-valued predictable step-size sequence, and is a random sequence of truncation sets (see Section 2 for details). These SA procedures have the following main characteristics: (1) inhomogeneous random functions ; (2) state dependent matrix valued random step-sizes; (3) truncations with random and moving (shrinking or expanding) bounds. The main motivation for these comes from parametric statistical applications: (1) is needed for recursive parameter estimation procedures for non i.i.d. models; (2) is required to guarantee asymptotic optimality and efficiency of statistical estimation; (3) is needed for various different adaptive truncations, in particular, for the ones arising by auxiliary estimators.

Convergence of the above class of procedures is studied in Sharia (2014). In this paper we present new results on rate of convergence. Furthermore, we present a convergence result which generalises the corresponding result in Sharia (2014) by considering time dependent random Lyapunov type functions (see Lemma 3.1). This generalisation turns out to be quite useful as it can be used to derive convergence results of the recursive parameter estimators in time series models. Some of the conditions in the main statements are difficult to interpret. Therefore, we discuss these conditions in explanatory remarks and corollaries. The corollaries are presented in such a way that each subsequent statement imposes conditions that are more restrictive than the previous one. We discuss the case of the classical SA and demonstrate that conditions introduced in this paper are minimal in the sense that they do not impose any additional restrictions when applied to the classical case. We also compare our set of conditions to that of Kushner-Clark’s setting (see Remark 4.4). Furthermore, the paper contains new results even for the classical SA. In particular, truncations with moving bounds give a possibility to use SA in the cases when the standard conditions on the function do not hold. Also, an interesting link between the rate of the step-size sequence and the rate of convergence of the SA process is given in the classical case (see corollary 4.7 and Remark 4.8). This observation might not surprise experts working in this field, but we failed to find it in a written form in the existing literature.

## 2 Main objects and notation

Let be a stochastic basis satisfying the usual conditions. Suppose that for each , we have -measurable functions

 Rt(z)=Rt(z,ω):Rm×Ω→Rmεt(z)=εt(z,ω):Rm×Ω→Rmγt(z)=γt(z,ω):Rm×Ω→Rm×m

such that for each , the processes and are predictable, i.e., and are measurable for each . Suppose also that for each , the process is a martingale difference, i.e., is measurable and . We also assume that

 Rt(z0)=0

for each , where is a non-random vector.

Suppose that is a real valued function of . Denote by the row-vector of partial derivatives of with respect to the components of , that is, Also, we denote by the matrix of second partial derivatives. The identity matrix is denoted by . Denote by and the positive and negative parts of , i.e. and .

Let is a closed convex set and define a truncation operator as a function , such that

 ΦU(z)={zifz∈Uz∗ifz∉U,

where is a point in , that minimizes the distance to .

Suppose that . We say that a random sequence of sets () from is admissible for if

for each and is a closed convex subset of ;
for each and , the truncation is measurable;
eventually, i.e., for almost all there exist such that whenever .

Assume that is some starting value and consider the procedure

 Zt=ΦUt(Zt−1+γt(Zt−1)Ψt(Zt−1)),t=1,2,… (2.1)

 Ψt(z)=Rt(z)+εt(z),

and , , are random fields defined above. Everywhere in this work, we assume that

 E{Ψt(Zt−1)∣Ft−1}=Rt(Zt−1) (2.2)

and

 E{εTt(Zt−1)εt(Zt−1)∣Ft−1}=[E{εTt(z)εt(z)∣Ft−1}]z=Zt−1, (2.3)

and the conditional expectations (2.2) and (2.3) are assumed to be finite.

###### Remark 2.1

Condition (2.2) ensures that is a martingale difference. Conditions (2.2) and (2.3) obviously hold if, e.g., the measurement errors are independent random variables, or if they are state independent. In general, since we assume that all conditional expectations are calculated as integrals w.r.t. corresponding regular conditional probability measures (see the convention below), these conditions can be checked using disintegration formula (see, e.g., Theorem 5.4 in Kallenberg (2002)).

We say that a random field

 Vt(z)=Vt(z,ω):Rm×Ω⟶R(t=1,2,...)

is a Lyapunov random field if

is a predictable process for each ;

for each and almost all , is a non-negative function with continuous and bounded partial second derivatives.

Convention.

Everywhere in the present work convergence and all relations between random variables are meant with probability one w.r.t. the measure unless specified otherwise.
A sequence of random variables has a property eventually if for every in a set of probability 1, the realisation has this property for all greater than some .
All conditional expectations are calculated as integrals w.r.t. corresponding regular conditional probability measures.
The of a real valued function is whenever .

## 3 Convergence and rate of convergence

We start this section with a convergence lemma, which uses a concept of a Lyapunov random field (see Section 2). The proof of this lemma is very similar to that of presented in Sharia (2014). However, the dynamically changing Lyapunov functions make it possible to apply this result to derive the rate of convergence of the SA procedures. Also, this result turns out to be very useful to derive convergence of the recursive parameter estimations in time series models.

###### Lemma 3.1

Suppose that is a process defined by (2.1). Let be a Lyapunov random field. Denote , , and assume that

(V1)
 Vt(Δt)≤Vt(Δt−1+γt(Zt−1)[Rt(Zt−1)+εt(Zt−1)])

eventually;

(V2)
 ∞∑t=1[1+Vt−1(Δt−1)]−1[Kt(Δt−1)]+<∞,P-a.s.,

where

 Kt(u)=ΔVt(u)+V′t(u)γt(z0+u)Rt(z0+u)+ηt(z0+u)

and

 ηt(v)=12supzE{[Rt(v)+εt(v)]TγTt(v)V′′t(z)γt(v)[Rt(v)+εt(v)]∣∣Ft−1}.

Then converges (-a.s.) to a finite limit for any initial value .

Furthermore, if there exists a set A with such that for each

(V3)
 ∞∑t=1inf\lx@stackrelϵ≤Vt(u)≤1/ϵz0+u∈Ut−1[Kt(u)]−=∞on A, (3.1)

then (-a.s.) for any initial value .

Proof. The proof is similar to that of Theorem 2.2 and 2.4 in Sharia (2014). Rewrite (2.1) in the form

 Δt=Δt−1+γt(Zt−1)[Rt(Zt−1)+εt(Zt−1)].

By (V1), using the Taylor expansion, we have

 Vt(Δt) ≤ Vt(Δt−1)+V′t(Δt−1)γt(Zt−1)[Rt(Zt−1)+εt(Zt−1)] +12[Rt(Zt−1)+εt(Zt−1)]TγTt(Zt−1)V′′t(~Δt−1)γt(Zt−1)[Rt(Zt−1)+εt(Zt−1)],

where is -measurable Since

 Vt(Δt−1)=Vt−1(Δt−1)+ΔVt(Δt−1),

using (2.2) and (2.3), we obtain

 E{Vt(Δt)|Ft−1}≤Vt−1(Δt−1)+Kt(Δt−1).

Then, using the decomposition , the above can be rewritten as

 E{Vt(Δt)|Ft−1}≤Vt−1(Δt−1)(1+Bt)+Bt−[Kt(Δt−1)]−,

where .

By , we have that . Now we can use Lemma 6.1 in Appendix (with and ) to deduce that the processes and

 Yt=t∑s=1[Ks(Δs−1)]−

converge to some finite limits. Therefore, it follows that .

To prove the second assertion, suppose that . Then there exist such that eventually. By (3.1), this would imply that for some ,

 ∞∑s=t0[Ks(Δs−1)]−≥∞∑s=t0inf\lx@stackrelϵ≤Vs(u)≤1/ϵz0+u∈Us−1[Ks(u)]−=∞

on the set A, which contradicts the existence of a finite limit of . Hence, and .

###### Remark 3.2

The conditions of the above Lemma are difficult to interpret. Therefore, the rest of the section is devoted to formulate lemmas and corollaries (Lemmas 3.5 and 3.9, Corollaries 3.7, 3.12 and 3.13) containing sufficient conditions for the convergence and the rate of convergence, and remarks (Remarks 3.3, 3.4, 3.8, 3.10, 3.11 and 3.14) explaining some of the assumptions. These results are presented in such a way, that each subsequent statement imposes conditions that are more restrictive than the previous one. For example, Corollary 3.13 and Remark 3.14 contain conditions which are most restrictive than all the previous ones, but are written in the simplest possible terms.

###### Remark 3.3

A typical choice of is , where is a predictable positive semi-definite matrix process. If goes to a finite matrix with , then subject to the conditions of Lemma 3.1, will tend to a finite limit implying that . This approach is adopted in Example 5.3 to derive convergence of the on-line Least Square estimator.

###### Remark 3.4

Consider truncation sets , where denotes a closed sphere in with the center at and the radius . Let and suppose that . Let where is a positive definite matrix and denote by and the largest and smallest eigenvalues of respectively. Then (i.e., (V1) holds with ), if , where . (See Proposition 6.2 in Appendix for details.) In particular, if is a scalar matrix, condition (V1) automatically holds.

###### Lemma 3.5

Suppose that all the conditions of Lemma 3.1 hold and

(L)

for any , there exist some such that

 inf∥u∥≥MVt(u)>δ eventually.

Then -a.s.) for any initial value .

Proof. From Lemma 3.1, we have (a.s.). Now, follows from (L) by contradiction. Indeed, suppose that on a set, say of positive probability. Then, for any fixed from this set, there would exist a sequence such that for some and ((L)) would imply that for large -s, which contradicts the -a.s. convergence .

###### Remark 3.6

The following corollary contains simple sufficient conditions for convergence. The poof of this corollary does not require dynamically changing Lyapunov functions and can be obtained from a less general version of Lemma 3.1 presented in Sharia (2014). We decided to present this corollary for the sake of completeness, noting that the proof, as well as a number of different sets of sufficient conditions, can be found in Sharia (2014).

###### Corollary 3.7

Suppose that is a process defined by (2.1), are admissible truncations for and

(D1)

for large ’s

 (z−z0)TRt(z)≤0ifz∈Ut−1;
(D2)

there exists a predictable process such that

 supz∈Ut−1E{∥Rt(z)+εt(z)∥2∣Ft−1}1+∥z−zo∥2≤rt

eventually, and

 ∞∑t=1rta−2t<∞,P-a.s.

Then converges (-a.s.) to a finite limit.

Furthermore, if

(D3)

for each there exists a predictable process such that

 inf\lx@stackrelϵ≤∥z−zo∥≤1/ϵz∈Ut−1−(z−z0)TRt(z)>νt

eventually, where

 ∞∑t=1νta−1t=∞,P-a.s.

Then converges (-a.s.) to .

Proof. See Remark 3.6 above.

###### Remark 3.8

The rest of this section is concerned with the derivation of sufficient conditions to establish rate of convergence. In most applications, checking conditions of Lemma 3.9 and Corollary 3.12 below is difficult without establishing the convergence of first. Therefore, although formally not required, we can assume that convergence has already been established (using the lemmas and corollaries above or otherwise). Under this assumption, conditions for the rate of convergence below can be regarded as local in , that is, they can be derived using certain continuity and differentiability assumptions of the corresponding functions at point (see examples in Section 5).

###### Lemma 3.9

Suppose that is a process defined by (2.1). Let be a predictable positive definite matrix process, and and be the largest and the smallest eigenvalues of respectively. Denote . Suppose also that (V1) of Lemma 3.1 holds and

(R1)

there exists a predictable non-negative scalar process such that

 2ΔTt−1Ctγt(z0+Δt−1)Rt(z0+Δt−1)λmaxt+Pt≤−ρt∥Δt−1∥2,

eventually, where is a predictable non-negative scalar process satisfying

 ∞∑t=1[λmaxt−λmint−1λmint−1−λmaxtλmint−1ρt]+<∞;
(R2)
 ∞∑t=1λmaxt[E{∥∥γt(z0+Δt−1)[Rt(z0+Δt−1)+εt(z0+Δt−1)]∥∥2∣Ft−1}−Pt]+1+λmint−1∥Δt−1∥2<∞.

Then converges to a finite limit (P-a.s.).

Proof. Let us check the conditions of Lemma 3.1 with . Condition (V1) is satisfied automatically.

Denote , and . Since and , we have

 Kt(Δt−1)=ΔVt(Δt−1)+2ΔTt−1CtγtRt+E{[γt(Rt+εt)]TCtγt(Rt+εt)∣Ft−1}

Since is positive definite, for any . Therefore

 ΔVt(Δt−1)≤(λmaxt−λmint−1)∥Δt−1∥2.

Denote

 ~Pt=λmaxt(Dt−Pt)

where

 Dt=E{∥γt(Rt+εt)∥2∣Ft−1}.

Then

 Kt(Δt−1) ≤ (λmaxt−λmint−1)∥Δt−1∥2+2ΔTt−1CtγtRt+λmaxtDt = (λmaxt−λmint−1)∥Δt−1∥2+2ΔTt−1CtγtRt+λmaxtPt+~Pt.

By (R1), we have

 2ΔTt−1CtγtRt≤−λmaxt(ρt∥Δt−1∥2+Pt).

Therefore,

 Kt(Δt−1) ≤ (λmaxt−λmint−1)∥Δt−1∥2−λmaxt(ρt∥Δt−1∥2+Pt)+λmaxtPt+~Pt ≤ (λmaxt−λmint−1−λmaxtρt)∥Δt−1∥2+~Pt=rtλmint−1∥Δt−1∥2+~Pt,

where

 rt=(λmaxt−λmint−1−λmaxtρt)/λmint−1.

Since , using the inequality , we have

 [Kt(Δt−1)]+≤λmint−1∥Δt−1∥2[rt]++[~Pt]+.

Also, since ,

 [Kt(Δt−1)]+1+Vt−1(Δt−1) ≤ [Kt(Δt−1)]+1+λmint−1∥Δt−1∥2≤λmint−1∥Δt−1∥2[rt]+1+λmint−1∥Δt−1∥2+[~Pt]+1+λmint−1∥Δt−1∥2 ≤ [rt]++[~Pt]+1+λmint−1∥Δt−1∥2.

By (R2), and according to (R1)

 ∞∑t=1[rt]+=∞∑t=1[λmaxt−λmint−1λmint−1−λmaxtλmint−1ρt]+<∞.

Thus,

 ∞∑t−1[Kt(Δt−1)]+1+Vt−1(Δt−1)<∞,

implying that Condition (V2) of Lemma 3.1 holds. Thus, converges to a finite limit almost surely.

###### Remark 3.10

The choice means that (R2) becomes more restrictive imposing stronger probabilistic restrictions on the model. Now, if is eventually negative with a large absolute value, then it is possible to introduce a non-zero without strengthening condition (R1). One possibility might be