Stability of Stochastic Approximations with ‘Controlled Markov’ Noise and Temporal Difference Learning

# Stability of Stochastic Approximations with ‘Controlled Markov’ Noise and Temporal Difference Learning

Arunselvan Ramaswamy arunselvan@csa.iisc.ernet.in Shalabh Bhatnagar shalabh@csa.iisc.ernet.in
###### Abstract

In this paper we present a ‘stability theorem’ for stochastic approximation (SA) algorithms with ‘controlled Markov’ noise. Such algorithms were first studied by Borkar in 2006. Specifically, sufficient conditions are presented which guarantee the stability of the iterates. Further, under these conditions the iterates are shown to track a solution to the differential inclusion defined in terms of the ergodic occupation measures associated with the ‘controlled Markov’ process. As an application to our main result we present an improvement to a general form of temporal difference learning algorithms. Specifically, we present sufficient conditions for their stability and convergence using our framework. This paper builds on the works of Borkar and Benveniste, Metivier and Priouret.

## 1 Introduction

Let us begin by considering the general form of stochastic approximation algorithms:

 xn+1=xn+a(n)(h(xn)+Mn+1), where (1)

is a Lipschitz continuous function;
is the given step-size sequence such that and ;
is the sequence of square integrable martingale difference terms.

In 1996, Benaïm [3] showed that the asymptotic behavior of recursion (1) can be determined by studying the asymptotic behavior of the associated o.d.e.

 ˙x(t)=h(x(t)).

This technique is popularly known as the ODE method and was originally developed by Ljung in 1977 [9]. In [3] it is assumed that , in other words the iterates are assumed to be stable. In many cases the stability assumption becomes a bottleneck in using the ODE method. This bottleneck was overcome by Borkar and Meyn in 1999 [8]. Specifically, they developed sufficient conditions that guarantee the ‘stability and convergence’ of recursion (1).

In many applications, the noise-process is Markovian in nature. Stochastic approximation algorithms with ‘Markov Noise’ have been extensively studied in Benveniste et. al. [5]. These results have been extended to the case when the noise is ‘controlled Markov’ by Borkar [6]. Specifically, the asymptotics of the iterates are described via a limiting differential inclusion () that is defined in terms of the ergodic occupation measures of the Markov process. As explained in [6], the motivation for such a study stems from the fact that in many cases the noise-process is not Markov, but its lack of Markov property comes through its dependence on a time-varying ‘control’ process. In particular this is the case with many reinforcement learning algorithms. In [6], the iterates are assumed to be stable, which as explained earlier poses a bottleneck, especially in analyzing algorithms from reinforcement learning. The aim of this paper is to overcome this bottleneck. In other words, we present sufficient conditions for the ‘stability and convergence’ of stochastic approximation algorithms with ‘controlled Markov’ noise. Finally, as an application setting, we consider a general form of the temporal difference learning algorithms in reinforcement learning and present weaker sufficient conditions (than those in literature) that guarantee their stability and convergence using our framework.

The organization of this paper is as follows:
In Section 2.1 we present the definitions and notations involved in this paper. In Section 2.2 we discuss the assumptions involved in proving the stability of the iterates given by (3).
In Section 3 we show the stability of the iterates under the assumptions outlined in Section 2.2 (Theorem 1).
In Section 4 we present additional assumptions which coupled with assumptions from Section 2.2 are used to prove the ‘stability and convergence’ of recursion (3) (Theorem 2). Specifically, Theorem 2 states that under the aforementioned sets of assumptions the iterates are stable and converge to an internally chain transitive invariant set associated with For the definition of the reader is referred to Section 4.
In Section 5 we discuss an application of Theorem 2. We present sufficient conditions for the ‘stability and convergence’ of a general form of temporal difference learning algorithms, in reinforcement learning.

## 2 Preliminaries and Assumptions

### 2.1 Notations & Definitions

In this section we present the definitions and notations used in this paper for the purpose of easy reference. Note that they can be found in Benaïm et. al. [4], Aubin et. al. [1], [2] and Borkar [7].

Marchaud Map: A set-valued map } is called a Marchaud map if it satisfies the following properties:

• For each , is convex and compact.

• (point-wise boundedness) For each , for some .

• is an upper-semicontinuous map.
We say that is upper-semicontinuous, if given sequences (in ) and (in ) with , and , , . In other words, the graph of , , is closed in .

If the set-valued map is Marchaud, then the differential inclusion (DI) given by

 ˙x(t) ∈ H(x(t)) (2)

is guaranteed to have at least one solution that is absolutely continuous. The reader is referred to Aubin & Cellina[1] for more details.
If x is an absolutely continuous map satisfying (2) then we say that .
A set-valued semiflow associated with (2) is defined on as follows:
. Let , define

 ΦB(M):=⋃{t∈B, x∈M}Φt(x).

Let , the set be defined by Similarly the limit set of a solution x is given by .
Invariant Set: is invariant if for every there exists a trajectory, x, entirely in with . In other words, with , for all .
Internally Chain Transitive Set: is said to be internally chain transitive if is compact and for every , and we have the following: There exist that are solutions to the differential inclusion , a sequence and real numbers greater than such that: and for . The sequence is called an chain in from to .
Given and , define the distance between and by . We define the -open neighborhood of by . The -closed neighborhood of is defined by .
Attracting Set: is an attracting set if it is compact and there exists a neighborhood such that for any there exists such that . Then is called the fundamental neighborhood of . In addition to being compact if the attracting set is also invariant then it is called an attractor. The basin of attraction of is given by . The set is Lyapunov stable if for all , such that . We use and interchangeably to denote the dependence of on .
The open ball of radius around is represented by , while the closed ball is represented by .
Upper limit of sequences of sets: Let be a sequence of sets in . The upper-limit of is given by .

### 2.2 Assumptions

Let us consider a stochastic approximation algorithm with ‘controlled Markov’ noise in .

 xn+1=xn+a(n)[h(xn,yn)+Mn+1], where (3)
• is a jointly continuous map with a compact metric space. The map is Lipschitz continuous in the first component, further its constant does not change with the second component. Let the Lipschitz constant be . This is assumption in Section 2 of Borkar [6]. Here we call it (A1).

• The step-size sequence is such that for all , and . Without loss of generality let . This is assumption in Section 2 of Borkar [6]. Here we call it (A3).

• is a sequence of square integrable martingale difference terms, that also contribute to the noise. They are related to by

 E[∥Mn+1∥2 | Fn]≤K(1+∥xn∥2), where n≥0.

This is assumption in Section 2 of Borkar [6]. Here we call it (A2).

• is the -valued ‘Controlled Markov’ process.

Note that is assumed to be polish in [6]. As stated in , in this paper we let be a compact metric space, hence polish. Among the assumptions made in [6], are relevant to prove the stability of the iterates. The remaining assumptions are listed in Section 4 where we present the result on the ‘stability and convergence’ of the iterates given by (3). See Borkar [6] for more details.

• For each , we define functions by .

• We define the limiting map by , where is the upper-limit of a sequence of sets (see Section 2.1).

• For each define .

We replace the stability assumption in [6] with the following two assumptions.

• If , and = for some , then .

• There exists an attracting set, , associated with such that . Further, is a subset of some fundamental neighborhood of .

Assumption , discussed in Section 5, is a sufficient condition for to be satisfied. One could say that constitutes the ‘Lyapunov function’ condition for . We shall show that is a Marchaud map in Lemma 2. As explained in [1], it follows that the DI, , has at least one solution that is absolutely continuous. Hence assumption is meaningful.

We begin by showing that satisfies for all . Fix , and , we have

 ∥hc(x1,y)−hc(x2,y)∥=∥h(cx1,y)/c−h(cx2,y)/c∥,
 ∥h(cx1,y)/c−h(cx2,y)/c∥≤L∥cx1−cx2∥/c, hence
 ∥hc(x1,y)−hc(x2,y)∥≤L∥x1−x2∥.

We thus have that is Lipschitz continuous in the first component with Lipschitz constant . Further, for a fixed this constant does not change with . Since was arbitrarily chosen it follows that is the Lipschitz constant associated with every . It is trivially true that is a jointly continuous map.

Fix , and , then

 ∥hc(x,y)−hc(0,y)∥≤L∥x−0∥, hence
 ∥hc(x,y)∥≤∥hc(0,y)∥+L∥x∥.

Since is a continuous function on (a compact set) and we have for some . Thus

 ∥hc(x,y)∥≤K(1+∥x∥), where K=L∨M.

We may assume without loss of generality that is such that also holds for all (assumption ). Again does not change with .

Fix and . As explained in the previous paragraph we have,

 supc≥1 ∥hc(x,y)∥≤K(1+∥x∥).

The upper-limit of , , is clearly non-empty. Recall that and . Hence,

 supu∈h∞(x,y) ∥u∥≤K(1+∥x∥) and supu∈H(x) ∥u∥≤K(1+∥x∥). (4)

We need to show that is a Marchaud map. Before we do that, let us prove an auxiliary result.

###### Lemma 1.

Suppose in , in , and . Then .

###### Proof.

Consider the following inequality:

 ∥hcn(x,yn)−u∥≤∥hcn(xn,yn)−u∥+∥hcn(x,yn)−hcn(xn,yn)∥.

Since and , we get

 limcn→∞hcn(x,yn)=u.

It follows from that . ∎

The following is a direct consequence of Lemma 1: If in , and then . If this is not so, then without loss of generality we have that for some . Since is compact, such that and for some and some . We have , , and . It follows from Lemma 1 that . This is a contradiction.

###### Lemma 2.

is a Marchaud map.

###### Proof.

Recall that . As explained earlier (cf. (2.2)),

 supu∈H(x)∥u∥≤K(1+∥x∥).

Hence is point-wise bounded. From the definition of it follows that is convex and compact for each .

It is left to show that is upper semi-continuous. Let , and , . We need to show that . If this is not true, then there exists a linear functional on , say , such that and , for some and . Since , there exists such that for each , i.e., , here is used to denote the set . For the sake of notational convenience let us denote by for all . We claim that for all . We shall prove this claim later, for now we assume that the claim is true and proceed.

Pick for each . Let for some . Since is norm bounded it contains a convergent subsequence, say . Let . Since , such that . The sequence is chosen such that for each . Since is from a compact set, there exists a convergent subsequence. For the sake of notational convenience (without loss of generality) we assume that the sequence itself has a limit, i.e., for some . We have the following: , , , and for . It follows from Lemma 1 that . Since and for each , we have that . This contradicts .

It remains to prove that for all . If this were not true, then such that for all . It follows that for each . Since , such that for all , . This leads to a contradiction. ∎

## 3 Stability Theorem

Let us construct the linear interpolated trajectory for from the sequence . Define and , . Let and for let

 ¯¯¯x(t) := (t(n+1)−tt(n+1)−t(n)) ¯¯¯x(t(n)) + (t−t(n)t(n+1)−t(n)) ¯¯¯x(t(n+1)).

Define and for . Observe that there exists a subsequence, , of such that for all .

We use to construct the rescaled trajectory, , for . Let for some and define , where . Also, let , . The rescaled martingale difference terms are given by , .

We define a piece-wise constant trajectory, , using the rescaled trajectory as follows: Let and . Define . Let us define another piece-wise constant trajectory using as follows: Let for all .

Recall that is an attracting set associated with (see assumption in section 2.2). Let , then . Choose such that . Fix , where is defined in section 2.1. Let be a solution to such that , then for all .

Consider the following recursion:

 ¯¯¯x(t(k+1)) = ¯¯¯x(t(k)) + a(k)(h(¯¯¯x(t(k)),yk) + Mk+1),

such that . Multiplying both sides by , we get the following rescaled recursion:

 ^x(t(k+1)) = ^x(t(k)) + a(k)(hr(n)(^x(t(k)),yk) + ^Mk+1). (5)

Note that .

The following two lemmas can be found in Borkar & Meyn [8] (that however does not consider ‘controlled Markov’ noise). It is shown there that the ‘martingale noise’ sequence converges almost surely. We present the results below using our setting.

.

###### Proof.

Recall that and . It is enough to show that

 supm(n)

for some that is independent of . Let us fix and such that and . Consider the following rescaled recursion:

 ^x(t(k)) = ^x(t(k−1)) + a(k−1)(^z(t(k−1)) + ^Mk).

Unfolding the above we get,

 ^x(t(k)) = ^x(t(m(n))) + k−1∑l=m(n)a(l)(^z(t(l)) + ^Ml+1).

Taking expectation of the square of the norms on both sides we get,

 E∥^x(t(k))∥2 = E∥∥ ∥∥^x(t(m(n))) + k−1∑l=m(n)a(l)(^z(t(l)) + ^Ml+1)∥∥ ∥∥2.

It follows from the Minkowski inequality that,

 E1/2∥^x(t(k))∥2≤E1/2∥^x(Tn)∥2 + k−1∑l=m(n)a(l)(E1/2∥^z(t(l))∥2 + E1/2∥^Ml+1∥2).

For each such that , . Further, . Observe that (since ). Using these observations we get the following:

 E1/2∥^x(t(k))∥2≤ 1 + k−1∑l=m(n)a(l)(KE1/2(1+∥^x(t(l))∥)2 + √KE1/2(1+∥^x(t(l))∥2)), (6)
 E1/2∥^x(t(k))∥2≤ 1 + k−1∑l=m(n)a(l)(K(1+E1/2∥^x(t(l))∥2) + √K(1+E1/2∥^x(t(l))∥2)), (7)
 E1/2∥^x(t(k))∥2 ≤ [1+(K+√K)(T+1)]+(K+√K)k−1∑l=m(n)a(l)E1/2∥^x(t(l))∥2. (8)

Applying the discrete version of Gronwall inequality we now get,

 E1/2∥^x(t(k))∥2 ≤ [1+(K+√K)(T+1)]e(K+√K)(T+1).

Let us define . Clearly is independent of and the claim follows. ∎

###### Lemma 4.

The sequence , , converges almost surely, where for all .

###### Proof.

It is enough to prove that

 ∞∑k=0E[∥a(k)^Mk+1∥2 | Fk] < ∞ a.s.

 E[∞∑k=0a(k)2E[∥^Mk+1∥2 | Fk]] < ∞.

From assumption we get

 E[∞∑k=0a(k)2E[∥^Mk+1∥2 | Fk]]≤∞∑k=0a(k)2K(1+E∥^x(t(k))∥2).

The claim now follows from Lemma 3 and . ∎

Let , , be the solution (up to time ) to with initial condition . Clearly,

 xn(t) = ^x(Tn)+∫t0^z(Tn+s)ds. (9)

###### Proof.

Let such that , where . First we prove the lemma when . Consider the following:

 ^x(t)=(t(m(n)+k+1)−ta(m(n)+k))^x(t(m(n)+k))+(t−t(m(n)+k)a(m(n)+k))^x(t(m(n)+k+1)). (10)

Substituting for in the above equation we get:

 ^x(t)=(t(m(n)+k+1)−ta(m(n)+k))^x(t(m(n)+k))+(t−t(m(n)+k)a(m(n)+k))(^x(t(m(n)+k))+a(m(n)+k)(hr(n)(^x(t(m(n)+k)),ym(n)+k)+^Mm(n)+k+1)), (11)

hence,

 ^x(t)=^x(t(m(n)+k))+(t−t(m(n)+k))(hr(n)(^x(t(m(n)+k)),ym(n)+k)+^Mm(n)+k+1). (12)

Unfolding , we get (see (5)),

 ^x(t)=^x(Tn)+k−1∑l=0a(m(n)+l)(hr(n)(^x(t(m(n)+l)),ym(n)+l)+