Supplementary material for Uncorrected least-squares temporal difference with lambda-return

Supplementary material for Uncorrected least-squares temporal difference with lambda-return

Takayuki Osogami
IBM Research - Tokyo
osogami@jp.ibm.com
Abstract

Here, we provide a supplementary material for Takayuki Osogami, “Uncorrected least-squares temporal difference with lambda-return,” which appears in Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI-20) (Osogami, 2019).

Appendix A Proofs

In this section, we prove Theorem 1, Lemma 1, Theorem 2, Lemma 2, and Proposition 1. Note that equations (1)-(19) refers to those in Osogami (2019).

a.1 Proof of Theorem 1

From (7)-(8), we have the following equality:

(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)

Here, the equality from (24) to (25) and the equality from (27) to (28) follow from the definition of the eligibility trace in the theorem. The recursive computation of the eligibility trace can be verified in a straightforward manner. This completes the proof of Theorem 1.

a.2 Proof of Lemma 1

Observe that there exists such that is invertible for any , because we assume that converges to an invertible matrix as , and invertible matrices form an open set. Then for each , we have

(29)
(30)

By the continuity of matrix inverse, we then have

(31)

It thus suffices to show as .

Because the state space is finite, the magnitude of the immediate reward and the feature vector is uniformly bounded. Namely, there exists such that and elementwise for any . Thus, we have the following elementwise inequality:

(32)
(33)
(34)
(35)
(36)
(37)

which tends to 0 as . This completes the proof of Lemma 1.

a.3 Proof of Theorem 2

At each step , Uncorrected LSTD() gives the weights , which is the solution of . Therefore, it suffices to show and as .

Due to the ergodicity of the Markov chain, as , each state is visited infinitely often, and the time each state is occupied is proportional to the steady state probability almost surely. Then, by the pointwise ergodic theorem, we have the following almost sure convergence:

(38)
(39)
(40)
(41)

and

(42)
(43)
(44)
(45)

which establishes the theorem. Here, the equality from (38) to (39) and the equality from (42) to (43) relate the time average to the ensemble average (almost surely) via the pointwise ergodic theorem.

a.4 Proof of Lemma 2

From (17), we have

(46)
(47)
(48)
(49)
(50)
(51)

where the equality from (49) to (50) follows from the definition of the eligibility trace in Theorem 1. This completes the proof of the lemma.

a.5 Proof of Proposition 1

Because there is no transition of states, we can let . Then the coefficient matrix (17)for Boyan’s LSTD() is reduced to the following one dimensional constant for given , , and :

(52)
(53)
(54)
(55)

From (7)and (17), we have

(56)
(57)
(58)

Let

(59)
(60)
(61)

The estimator of the discounted cumulative reward is given by or , where is reduced to the following random variable (here, denotes the reward obtained at step ):

(62)
(63)
(64)

where the second equality follows by changing variables and exchanging the order of summations. Because the reward is i.i.d., the expectation and variance of is given as follows:

(65)
(66)

By (54) and (65), it is straightforward to verify that is unbiased. Indeed, the expected value is

(67)

which coincides with the true expected discount cumulative reward.

Now, by (58) and (67), the bias of is given by

(68)
(69)
(70)
(71)

which establishes (18).

Finally, the variance of the estimator is given by and . Hence, we have

(72)
(73)

This completes the proof of the proposition.

Appendix B Details of experiments

In this section, we provide the details of the experiments in Section 4.

b.1 Computational environment

To generate the random MRPs and to run the experiments, we use the library111https://github.com/armahmood/totd-rndmdp-experiments published by van Seijen et al. (2016). We run our experiments on a Ubuntu 18.04 workstation having eight Intel Core i7-6700K CPUs running at 4.00 GHz and 64 GB random access memory.

b.2 Detailed results of experiments

Figure 2 shows the results corresponding to Figure 4 of van Seijen et al. (2016). A difference is that, for the three LSTD()s, we show the best MSE with the optimal value of regularization coefficient, , among for each data point. In Figure 4 of van Seijen et al. (2016), the best MSE with the optimal step size is shown for each TD(). As a reference we include the results with true online TD() in Figure 2.

small MRP
 
 
large MRP
 
 
deterministic MRP

(a) tabular features

(b) binary features

(c) non-binary features

Figure 2: Mean squared error (MSE) of Uncorrected, Mixed, and Boyan’s LSTD() as well as true online TD(). The MSE is evaluated on three random Markov reward processes (MRPs), each with three representations (features) of states. For Uncorrected, Mixed, and Boyan’s LSTD(), the best MSE with the optimal value of regularization coefficient (among ) is shown for each . For true online TD(), the best MSE with the optimal step size (see van Seijen et al. (2016) for details) is shown for each . For each data point, MSE is computed on the basis of 50 runs, and the error bar shows its standard deviation. For clarity, we show error bars only at for .

Figure 2 shows the best achievable MSE with the optimal choice of hyperparameters for each LSTD() and for true online TD(), but this best achievable MSE cannot be achieved in practice, because one cannot optimally tune the hyperparameters.

Figure 2 thus needs to be understood with the sensitivity of the performance to the particular values of hyperparameters, which are shown in Figures 3-5. These figures may be compared against Figures 10-12 of van Seijen et al. (2016). Note, however, that the horizontal axis is regularization coefficient in Figures 3-5 but step size in Figures 10-12 of van Seijen et al. (2016).

Table 1 shows the computational time (in seconds) of each LSTD() in each run of the experiment with the small MRP shown with Figure 3. Likewise, Tables 2-3 show the computational time with the large and deterministic MRPs shown with Figures 4-5. Notice that each run consists of combinations of the values of and (recall that we vary in for each of in ). As a reference, we also include the running time of true online TD() on our environment. Because true online TD() considers combinations of the values of hyperparameters as in van Seijen et al. (2016), the running time is normalized by after running all of the combinations for fair comparison.

In our implementation, Uncorrected LSTD() requires slightly more computational time than Boyan’s, because at each step Uncorrected LSTD() copies and stores the eligibility trace to be used in the next step. As is expected, Mixed LSTD() requires (20 % to 65 %) more computational time than the other LSTD()s, because Mixed LSTD() applies the rank-one update twice.

Detailed results with the small Mrp



Mixed LSTD() Uncorrected LSTD() Boyan’s LSTD() (a) tabular features (b) binary features (c) non-binary features
Figure 3: Mean squared error (MSE) of Uncorrected, Mixed, and Boyan’s LSTD() on the small MRP as a function of the value of regularization coefficient. Each curve shows the MSE (over 50 runs) of a particular value of for . The legend only shows the color with , but the intermediate values of follow the color map of rainbow. This figure is analogous to Figure 10 from van Seijen et al. (2016), but here we vary the value of regularization coefficient, while the step size is varied in van Seijen et al. (2016).
Boyan’s Uncorrected Mixed true online TD()
tabular
binary
non-binary
Table 1: The average computational time (seconds) for each run, consisting of 340 combinations of the values of hyperparameters, in the experiments with the small MRP. Here, the computational time of true online TD() is normalized by after running all of the 600 combinations of the values of hyperparameters.

Detailed results with the large Mrp



Mixed LSTD() Uncorrected LSTD() Boyan’s LSTD() (a) tabular features (b) binary features (c) non-binary features
Figure 4: Mean squared error (MSE) of Uncorrected, Mixed, and Boyan’s LSTD() on the large MRP as a function of the value of regularization coefficient. See the caption of 3 for the settings of experiments.
Boyan’s Uncorrected Mixed true online TD()
tabular
binary
non-binary
Table 2: The average computational time (seconds) for each run in the experiments with the large MRP. See the caption for Table 1 for details.

Detailed results with the deterministic Mrp



Mixed LSTD() Uncorrected LSTD() Boyan’s LSTD() (a) tabular features (b) binary features (c) non-binary features
Figure 5: Mean squared error (MSE) of Uncorrected, Mixed, and Boyan’s LSTD() on the deterministic MRP as a function of the value of regularization coefficient. See the caption of 3 for the settings of experiments.
Boyan’s Uncorrected Mixed true online TD()
tabular
binary
non-binary
Table 3: The average computational time (seconds) for each run in the experiments with the deterministic MRP. See the caption for Table 1 for details.

References

  • Osogami (2019) T. Osogami. Uncorrected least-squares temporal difference with lambda-return. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, February 2019.
  • van Seijen et al. (2016) H. van Seijen, A. R. Mahmood, P. M. Pilarski, M. C. Machado, and R. S. Sutton. True online temporal-difference learning. Journal of Machine Learning Research, 17(145):1–40, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398338
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description