Supplementary material for Uncorrected least-squares temporal difference with lambda-return

# Supplementary material for Uncorrected least-squares temporal difference with lambda-return

Takayuki Osogami
IBM Research - Tokyo
osogami@jp.ibm.com
###### Abstract

Here, we provide a supplementary material for Takayuki Osogami, “Uncorrected least-squares temporal difference with lambda-return,” which appears in Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI-20) (Osogami, 2019).

## Appendix A Proofs

In this section, we prove Theorem 1, Lemma 1, Theorem 2, Lemma 2, and Proposition 1. Note that equations (1)-(19) refers to those in Osogami (2019).

### a.1 Proof of Theorem 1

From (7)-(8), we have the following equality:

 AUncT+1 =T∑t=0ϕt(ϕt−(1−λ)γT−t∑m=1(λγ)m−1ϕt+m)⊤ (20) =T−1∑t=0ϕt(ϕt−(1−λ)γT−t∑m=1(λγ)m−1ϕt+m)⊤+ϕTϕ⊤T (21) =T−1∑t=0ϕt(ϕt−(1−λ)γT−t−1∑m=1(λγ)m−1ϕt+m−(1−λ)γ(λγ)T−t−1ϕT)⊤+ϕTϕ⊤T (22) =AUncT−T−1∑t=0ϕt(1−λ)γ(λγ)T−t−1ϕ⊤T+ϕTϕ⊤T (23) (24) =AUncT+(zT−γzT−1)ϕ⊤T. (25) bT+1 =T∑t=0ϕtT−t∑m=0(λγ)mrt+1+m (26) =T−1∑t=0ϕt(T−t−1∑m=0(λγ)mrt+1+m+(λγ)T−trT+1)+ϕTrT+1 (27) =bT+zTrT+1. (28)

Here, the equality from (24) to (25) and the equality from (27) to (28) follow from the definition of the eligibility trace in the theorem. The recursive computation of the eligibility trace can be verified in a straightforward manner. This completes the proof of Theorem 1.

### a.2 Proof of Lemma 1

Observe that there exists such that is invertible for any , because we assume that converges to an invertible matrix as , and invertible matrices form an open set. Then for each , we have

 1TAUncT(θT−θ⋆T) =1T(bT−b⋆T) (29) (θT−θ⋆T) =(1TAUncT)−11T(bT−b⋆T). (30)

By the continuity of matrix inverse, we then have

 limT→∞(θT−θ⋆T) =(limT→∞1TAUncT)−1limT→∞1T(bT−b⋆T). (31)

It thus suffices to show as .

Because the state space is finite, the magnitude of the immediate reward and the feature vector is uniformly bounded. Namely, there exists such that and elementwise for any . Thus, we have the following elementwise inequality:

 1T|b⋆T−bT| =1T∣∣ ∣∣T−1∑t=0ϕtλT−t−1∞∑m=T−tγm∑s′∈S(Pm)s,s′R(s′)∣∣ ∣∣ (32) ≤c21TT−1∑t=0λT−t−1∞∑m=T−tγm∑s′∈S(Pm)s,s′ (33) =c21TT−1∑t=0λT−t−1∞∑m=T−tγm (34) =c21TT−1∑t=0λT−t−1γT−t1−γ (35) =c2γ1−γ1TT−1∑t=0(λγ)T−t−1 (36) =c2γ1−γ1T1−(λγ)T1−λγ, (37)

which tends to 0 as . This completes the proof of Lemma 1.

### a.3 Proof of Theorem 2

At each step , Uncorrected LSTD() gives the weights , which is the solution of . Therefore, it suffices to show and as .

Due to the ergodicity of the Markov chain, as , each state is visited infinitely often, and the time each state is occupied is proportional to the steady state probability almost surely. Then, by the pointwise ergodic theorem, we have the following almost sure convergence:

 limT→∞1TAUncT =limT→∞1TT−1∑t=0ϕt(ϕt−(1−λ)γT−t−1∑m=1(λγ)m−1ϕt+m)⊤ (38) =∑s∈Sπ(s)ϕ(s)(∞∑m=0(λγ)m∑s′∈S(Pm)s,s′ϕ(s′)−γ∞∑m=1(λγ)m−1∑s′∈S(Pm)s,s′ϕ(s′))⊤ (39) =Φ⊤Diag(π)(I−λγP)−1Φ−γΦ⊤Diag(π)P(I−λγP)−1Φ (40) =Φ⊤Diag(π)(I−γP)(I−λγP)−1Φ (41)

and

 limT→∞1TbT =limT→∞1TT−1∑t=0ϕtT−t−1∑m=0(λγ)mrt+1+m (42) =∑s∈Sπ(s)ϕ(s)∞∑m=0(λγ)m∑s′∈S(Pm)s,s′R(s′) (43) =∞∑m=0(λγ)mΦ⊤Diag(π)Pmr (44) =Φ⊤Diag(π)(I−λγP)−1r, (45)

which establishes the theorem. Here, the equality from (38) to (39) and the equality from (42) to (43) relate the time average to the ensemble average (almost surely) via the pointwise ergodic theorem.

### a.4 Proof of Lemma 2

From (17), we have

 ABoyT =T−1∑t=0ϕt(ϕt−(1−λ)γT−t−1∑m=1(λγ)m−1ϕt+m−γ(λγ)T−t−1ϕT)⊤ (46) =T−1∑t=0ϕt(T−t−1∑m=0(λγ)mϕt+m−γT−t∑m=1(λγ)m−1ϕt+m)⊤ (47) =T−1∑t=0ϕt(T−1∑k=t(λγ)k−tϕk−γT∑k=t+1(λγ)k−t−1ϕk)⊤ (48) =T−1∑k=0k∑t=0(λγ)k−tϕtϕ⊤k−γT∑k=1k−1∑t=0(λγ)k−t−1ϕtϕ⊤k (49) =T−1∑k=0zkϕ⊤k−γT∑k=1zk−1ϕ⊤k (50) =T−1∑k=0zk(ϕk−γϕk+1)⊤, (51)

where the equality from (49) to (50) follows from the definition of the eligibility trace in Theorem 1. This completes the proof of the lemma.

### a.5 Proof of Proposition 1

Because there is no transition of states, we can let . Then the coefficient matrix (17)for Boyan’s LSTD() is reduced to the following one dimensional constant for given , , and :

 ABoyT =T−1∑t=0(1−(1−λ)γT−t−1∑m=1(λγ)m−1−γ(λγ)T−t−1) (52) =(1−γ)T−1∑t=0T−t−1∑m=0(λγ)m (53) =(1−γ)T∑n=11−(λγ)n1−λγ. (54) =(1−γ)T1−λγ+o(T). (55)

From (7)and (17), we have

 AUncT =T−1∑t=0(1−(1−λ)γT−t−1∑m=1(λγ)m−1) (56) =ABoyT+γT−1∑t=0(λγ)T−t−1 (57) =ABoyT+γ1−(λγ)T1−λγ. (58)

Let

 ΔT ≡AUncT−ABoyT (59) =γ1−(λγ)T1−λγ (60) =γ1−λγ+o(1). (61)

The estimator of the discounted cumulative reward is given by or , where is reduced to the following random variable (here, denotes the reward obtained at step ):

 bT =T−1∑t=0T−t−1∑m=0(λγ)mRt+1+m (62) =T∑n=1n−1∑t=0(λγ)n−t−1Rn (63) =T∑n=11−(λγ)n1−λγRn. (64)

where the second equality follows by changing variables and exchanging the order of summations. Because the reward is i.i.d., the expectation and variance of is given as follows:

 E[bT] =μT∑n=11−(λγ)n1−λγ (65) Var[bT] =σ2T∑n=1(1−(λγ)n1−λγ)2. (66)

By (54) and (65), it is straightforward to verify that is unbiased. Indeed, the expected value is

 E[θBoyT] =E[bT]ABoyT=μ1−γ, (67)

which coincides with the true expected discount cumulative reward.

Now, by (58) and (67), the bias of is given by

 E[θUncT]−μ1−γ =E[bT]AUncT−E[bT]ABoyT (68) =ABoyT−AUncTAUncTE[bT]ABoyT (69) =−ΔABoyT+Δμ1−γ (70) =−γμ(1−γ)2T+o(1T), (71)

which establishes (18).

Finally, the variance of the estimator is given by and . Hence, we have

 Var[θBoyT]Var[θUncT] =(ABoyT+ΔTABoyT)2 (72) =1+2γ(1−γ)T+o(1T). (73)

This completes the proof of the proposition.

## Appendix B Details of experiments

In this section, we provide the details of the experiments in Section 4.

### b.1 Computational environment

To generate the random MRPs and to run the experiments, we use the library published by van Seijen et al. (2016). We run our experiments on a Ubuntu 18.04 workstation having eight Intel Core i7-6700K CPUs running at 4.00 GHz and 64 GB random access memory.

### b.2 Detailed results of experiments

Figure 2 shows the results corresponding to Figure 4 of van Seijen et al. (2016). A difference is that, for the three LSTD()s, we show the best MSE with the optimal value of regularization coefficient, , among for each data point. In Figure 4 of van Seijen et al. (2016), the best MSE with the optimal step size is shown for each TD(). As a reference we include the results with true online TD() in Figure 2. Figure 2: Mean squared error (MSE) of Uncorrected, Mixed, and Boyan’s LSTD(λ) as well as true online TD(λ). The MSE is evaluated on three random Markov reward processes (MRPs), each with three representations (features) of states. For Uncorrected, Mixed, and Boyan’s LSTD(λ), the best MSE with the optimal value of regularization coefficient (among {2i∣i=−8,−7,…,7,8}) is shown for each λ. For true online TD(λ), the best MSE with the optimal step size (see van Seijen et al. (2016) for details) is shown for each λ. For each data point, MSE is computed on the basis of 50 runs, and the error bar shows its standard deviation. For clarity, we show error bars only at λ=i/10 for i=0,…,9,10.

Figure 2 shows the best achievable MSE with the optimal choice of hyperparameters for each LSTD() and for true online TD(), but this best achievable MSE cannot be achieved in practice, because one cannot optimally tune the hyperparameters.

Figure 2 thus needs to be understood with the sensitivity of the performance to the particular values of hyperparameters, which are shown in Figures 3-5. These figures may be compared against Figures 10-12 of van Seijen et al. (2016). Note, however, that the horizontal axis is regularization coefficient in Figures 3-5 but step size in Figures 10-12 of van Seijen et al. (2016).

Table 1 shows the computational time (in seconds) of each LSTD() in each run of the experiment with the small MRP shown with Figure 3. Likewise, Tables 2-3 show the computational time with the large and deterministic MRPs shown with Figures 4-5. Notice that each run consists of combinations of the values of and (recall that we vary in for each of in ). As a reference, we also include the running time of true online TD() on our environment. Because true online TD() considers combinations of the values of hyperparameters as in van Seijen et al. (2016), the running time is normalized by after running all of the combinations for fair comparison.

In our implementation, Uncorrected LSTD() requires slightly more computational time than Boyan’s, because at each step Uncorrected LSTD() copies and stores the eligibility trace to be used in the next step. As is expected, Mixed LSTD() requires (20 % to 65 %) more computational time than the other LSTD()s, because Mixed LSTD() applies the rank-one update twice. Figure 3: Mean squared error (MSE) of Uncorrected, Mixed, and Boyan’s LSTD(λ) on the small MRP (10,3,0.1) as a function of the value of regularization coefficient. Each curve shows the MSE (over 50 runs) of a particular value of λ for 0≤λ≤1. The legend only shows the color with λ∈{0,1}, but the intermediate values of λ follow the color map of rainbow. This figure is analogous to Figure 10 from van Seijen et al. (2016), but here we vary the value of regularization coefficient, while the step size is varied in van Seijen et al. (2016). Figure 4: Mean squared error (MSE) of Uncorrected, Mixed, and Boyan’s LSTD(λ) on the large MRP (100,10,0.1) as a function of the value of regularization coefficient. See the caption of 3 for the settings of experiments. Figure 5: Mean squared error (MSE) of Uncorrected, Mixed, and Boyan’s LSTD(λ) on the deterministic MRP (100,3,0) as a function of the value of regularization coefficient. See the caption of 3 for the settings of experiments.

## References

• Osogami (2019) T. Osogami. Uncorrected least-squares temporal difference with lambda-return. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, February 2019.
• van Seijen et al. (2016) H. van Seijen, A. R. Mahmood, P. M. Pilarski, M. C. Machado, and R. S. Sutton. True online temporal-difference learning. Journal of Machine Learning Research, 17(145):1–40, 2016.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters   