Supplementary material for Uncorrected least-squares temporal difference with lambda-return
Abstract
Here, we provide a supplementary material for Takayuki Osogami, “Uncorrected least-squares temporal difference with lambda-return,” which appears in Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI-20) (Osogami, 2019).
Appendix A Proofs
In this section, we prove Theorem 1, Lemma 1, Theorem 2, Lemma 2, and Proposition 1. Note that equations (1)-(19) refers to those in Osogami (2019).
a.1 Proof of Theorem 1
From (7)-(8), we have the following equality:
(20) | ||||
(21) | ||||
(22) | ||||
(23) | ||||
(24) | ||||
(25) | ||||
(26) | ||||
(27) | ||||
(28) |
Here, the equality from (24) to (25) and the equality from (27) to (28) follow from the definition of the eligibility trace in the theorem. The recursive computation of the eligibility trace can be verified in a straightforward manner. This completes the proof of Theorem 1.
a.2 Proof of Lemma 1
Observe that there exists such that is invertible for any , because we assume that converges to an invertible matrix as , and invertible matrices form an open set. Then for each , we have
(29) | ||||
(30) |
By the continuity of matrix inverse, we then have
(31) |
It thus suffices to show as .
Because the state space is finite, the magnitude of the immediate reward and the feature vector is uniformly bounded. Namely, there exists such that and elementwise for any . Thus, we have the following elementwise inequality:
(32) | ||||
(33) | ||||
(34) | ||||
(35) | ||||
(36) | ||||
(37) |
which tends to 0 as . This completes the proof of Lemma 1.
a.3 Proof of Theorem 2
At each step , Uncorrected LSTD() gives the weights , which is the solution of . Therefore, it suffices to show and as .
Due to the ergodicity of the Markov chain, as , each state is visited infinitely often, and the time each state is occupied is proportional to the steady state probability almost surely. Then, by the pointwise ergodic theorem, we have the following almost sure convergence:
(38) | ||||
(39) | ||||
(40) | ||||
(41) |
and
(42) | ||||
(43) | ||||
(44) | ||||
(45) |
which establishes the theorem. Here, the equality from (38) to (39) and the equality from (42) to (43) relate the time average to the ensemble average (almost surely) via the pointwise ergodic theorem.
a.4 Proof of Lemma 2
a.5 Proof of Proposition 1
Because there is no transition of states, we can let . Then the coefficient matrix (17)for Boyan’s LSTD() is reduced to the following one dimensional constant for given , , and :
(52) | ||||
(53) | ||||
(54) | ||||
(55) |
From (7)and (17), we have
(56) | ||||
(57) | ||||
(58) |
Let
(59) | ||||
(60) | ||||
(61) |
The estimator of the discounted cumulative reward is given by or , where is reduced to the following random variable (here, denotes the reward obtained at step ):
(62) | ||||
(63) | ||||
(64) |
where the second equality follows by changing variables and exchanging the order of summations. Because the reward is i.i.d., the expectation and variance of is given as follows:
(65) | ||||
(66) |
By (54) and (65), it is straightforward to verify that is unbiased. Indeed, the expected value is
(67) |
which coincides with the true expected discount cumulative reward.
Finally, the variance of the estimator is given by and . Hence, we have
(72) | ||||
(73) |
This completes the proof of the proposition.
Appendix B Details of experiments
In this section, we provide the details of the experiments in Section 4.
b.1 Computational environment
To generate the random MRPs and to run the experiments, we use the library^{1}^{1}1https://github.com/armahmood/totd-rndmdp-experiments published by van Seijen et al. (2016). We run our experiments on a Ubuntu 18.04 workstation having eight Intel Core i7-6700K CPUs running at 4.00 GHz and 64 GB random access memory.
b.2 Detailed results of experiments
Figure 2 shows the results corresponding to Figure 4 of van Seijen et al. (2016). A difference is that, for the three LSTD()s, we show the best MSE with the optimal value of regularization coefficient, , among for each data point. In Figure 4 of van Seijen et al. (2016), the best MSE with the optimal step size is shown for each TD(). As a reference we include the results with true online TD() in Figure 2.
Figure 2 shows the best achievable MSE with the optimal choice of hyperparameters for each LSTD() and for true online TD(), but this best achievable MSE cannot be achieved in practice, because one cannot optimally tune the hyperparameters.
Figure 2 thus needs to be understood with the sensitivity of the performance to the particular values of hyperparameters, which are shown in Figures 3-5. These figures may be compared against Figures 10-12 of van Seijen et al. (2016). Note, however, that the horizontal axis is regularization coefficient in Figures 3-5 but step size in Figures 10-12 of van Seijen et al. (2016).
Table 1 shows the computational time (in seconds) of each LSTD() in each run of the experiment with the small MRP shown with Figure 3. Likewise, Tables 2-3 show the computational time with the large and deterministic MRPs shown with Figures 4-5. Notice that each run consists of combinations of the values of and (recall that we vary in for each of in ). As a reference, we also include the running time of true online TD() on our environment. Because true online TD() considers combinations of the values of hyperparameters as in van Seijen et al. (2016), the running time is normalized by after running all of the combinations for fair comparison.
In our implementation, Uncorrected LSTD() requires slightly more computational time than Boyan’s, because at each step Uncorrected LSTD() copies and stores the eligibility trace to be used in the next step. As is expected, Mixed LSTD() requires (20 % to 65 %) more computational time than the other LSTD()s, because Mixed LSTD() applies the rank-one update twice.
Boyan’s | Uncorrected | Mixed | true online TD() | |
---|---|---|---|---|
tabular | ||||
binary | ||||
non-binary |
Boyan’s | Uncorrected | Mixed | true online TD() | |
---|---|---|---|---|
tabular | ||||
binary | ||||
non-binary |
Boyan’s | Uncorrected | Mixed | true online TD() | |
---|---|---|---|---|
tabular | ||||
binary | ||||
non-binary |
References
- Osogami (2019) T. Osogami. Uncorrected least-squares temporal difference with lambda-return. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, February 2019.
- van Seijen et al. (2016) H. van Seijen, A. R. Mahmood, P. M. Pilarski, M. C. Machado, and R. S. Sutton. True online temporal-difference learning. Journal of Machine Learning Research, 17(145):1–40, 2016.