Model-Based Policy Gradients
with Parameter-Based Exploration
by Least-Squares Conditional Density Estimation
The goal of reinforcement learning (RL) is to let an agent learn an optimal control policy in an unknown environment so that future expected rewards are maximized. The model-free RL approach directly learns the policy based on data samples. Although using many samples tends to improve the accuracy of policy learning, collecting a large number of samples is often expensive in practice. On the other hand, the model-based RL approach first estimates the transition model of the environment and then learns the policy based on the estimated transition model. Thus, if the transition model is accurately learned from a small amount of data, the model-based approach can perform better than the model-free approach. In this paper, we propose a novel model-based RL method by combining a recently proposed model-free policy search method called policy gradients with parameter-based exploration and the state-of-the-art transition model estimator called least-squares conditional density estimation. Through experiments, we demonstrate the practical usefulness of the proposed method.
Reinforcement learning (RL) is a framework to let an agent learn an optimal control policy in an unknown environment so that expected future rewards are maximized . The RL methods developed so far can be categorized into two types: Policy iteration where policies are learned based on value function approximation [21, 12] and policy search where policies are learned directly to maximize expected future rewards [24, 4, 22, 8, 15, 26].
1.1 Policy Iteration Vs. Policy Search
A value function represents expected future rewards as a function of a state or a state and an action. In the policy iteration framework, approximation of the value function for the current policy and improvement of the policy based on the learned value function are iteratively performed until an optimal policy is found. Thus, accurately approximating the value function is a challenge in the value function based approach. So far, various machine learning techniques have been employed for better value function approximation, such as least-squares approximation , manifold learning , efficient sample reuse , active learning , and robust learning .
However, because policy functions are learned indirectly via value functions in policy iteration, improving the quality of value function approximation does not necessarily yield a better policy function. Furthermore, because a small change in value functions can cause a big change in policy functions, it is not safe to use the value function based approach for controlling expensive dynamic systems such as a humanoid robot. Another weakness of the value function approach is that it is difficult to handle continuous actions because a maximizer of the value function with respect to an action needs to be found for policy improvement.
On the other hand, in the policy search approach, policy functions are determined so that expected future rewards are directly maximized. A popular policy search method is to update policy functions via gradient ascent. However, a classic policy gradient method called REINFORCE  tends to produce gradient estimates with large variance, which results in unreliable policy improvement . More theoretically, it was shown that the variance of policy gradients can be proportional to the length of an agent’s trajectory, due to the stochasticity of policies . This can be a critical limitation in RL problems with long trajectories.
To cope with this problem, a novel policy gradient method called policy gradients with parameter-based exploration (PGPE) was proposed . In PGPE, deterministic policies are used to suppress irrelevant randomness and useful stochasticity is introduced by drawing policy parameters from a prior distribution. Then, instead of policy parameters, hyper-parameters included in the prior distribution are learned from data. Thanks to this prior-based formulation, the variance of gradient estimates in PGPE is independent of the length of an agent’s trajectory . However, PGPE still suffers from an instability problem in small sample cases. To further improve the practical performance of PGPE, an efficient sample reuse method called importance-weighted PGPE (IW-PGPE) was proposed recently and demonstrated to achieve the state-of-the-art performance .
1.2 Model-Based Vs. Model-Free
The RL methods reviewed above are categorized into the model-free approach, where policies are learned without explicitly modeling the unknown environment (i.e., the transition probability of the agent in the environment). On the other hand, an alternative approach called the model-based approach explicitly models the environment in advance and uses the learned environment model for policy learning [23, 5]. In the model-based approach, no additional sampling cost is necessary to generate artificial samples from the learned environment model.
The model-based approach is particularly advantageous in the policy search scenario. For example, given a fixed budget for data collection, IW-PGPE requires us to determine the sampling schedule in advance. More specifically, we need to decide, e.g., whether we gather many samples in the beginning or only a small batch of samples are collected for a longer period. However, optimizing the sampling schedule in advance is not possible without strong prior knowledge. Thus, we need to just blindly design the sampling schedule in practice, which can cause significant performance degradation. On the other hand, the model-based approach does not suffer from this problem because we can draw as many trajectory samples as we want from the learned transition model without additional sampling costs.
Another advantage of the model-based approach lies in baseline subtraction. In the gradient-based policy search methods such as REINFORCE and PGPE, subtraction of a baseline from a gradient estimate is a vital technique to reduce the estimation variance of policy gradients [13, 26]. If the baseline is estimated from samples that are statistically independent of samples used for the estimation of policy gradients, variance reduction can be carried out without increasing the estimation bias. However, such independent samples are not available in practice (if available, they should be used for policy gradient estimation), and thus variance reduction by baseline subtraction is practically performed at the expense of bias increase. On the other hand, in the model-based scenario, we can draw as many trajectory samples as we want from the learned transition model without additional sampling costs. Therefore, two statistically independent sets of samples can be generated and they can be separately used for policy gradient estimation and baseline estimation.
1.3 Transition Model Learning by Least-Squares Conditional Density Estimation
If the unknown environment is accurately approximated, the model-based approach can fully enjoy all the above advantages. However, accurately estimating the transition model from a limited amount of trajectory data in multi-dimensional continuous state and action spaces is highly challenging. Although the model-based method that does not require an accurate transition model was developed , it is only applicable to deterministic environments, which significantly limits its range of applications in practice. On the other hand, a recently proposed model-based policy search method called PILCO  learns a probabilistic transition model by the Gaussian process (GP) , and explicitly incorporates long-term model uncertainty. However, PILCO requires states and actions to follow Gaussian distributions and the reward function to be a particular exponential form to ensure that the policy evaluation is performed in a closed form and policy gradients are computed analytically for policy improvement. These strong requirements make PILCO practically restrictive.
To overcome such limitations of existing approaches, we propose a highly practical policy-search algorithm by extending the model-free PGPE method to the model-based scenario. In the proposed model-based PGPE (M-PGPE) method, the transition model is learned by the state-of-the-art non-parametric conditional density estimator called least-squares conditional density estimation (LSCDE) , which has various superior properties: It can directly handle multi-dimensional inputs and outputs, it was proved to achieve the optimal convergence rate , it has high numerical stability , it is robust against outliers , its solution can be analytically and efficiently computed just by solving a system of linear equations , and generating samples from the learned conditional density is straightforward. Through experiments, we demonstrate that the proposed M-PGPE method is a promising approach.
2 Problem Formulation and Model-Free Policy Search
In this section, we first formulate our RL problem and review existing model-free policy search methods.
Let us consider a Markov decision problem consisting of the following elements:
: A set of continuous states.
: A set of continuous actions.
: The (unknown) probability density of initial states.
: The (unknown) conditional probability density of visiting state from state by action .
: The immediate reward function for the transition from to by .
Let be a policy of an agent parameterized by , which is the conditional probability density of taking action at state . Let
be a history, which is a sequence of states and actions with finite length generated as follows: First, the initial state is determined following the initial-state probability density . Then action is chosen following policy , and next state is determined following the transition probability density . This process is repeated times.
Let be the return for history , which is the discounted sum of future rewards the agent can obtain:
where is a discount factor. The expected return is given by
where is the probability density of observing history :
The goal of RL is to find optimal policy parameter that maximizes the expected return :
REINFORCE  is a classic method for learning the policy parameter via gradient ascent:
where denotes the learning rate and denotes the gradient of with respect to .
The gradient can be expressed as
where we used
In the above expression, the probability density of histories, , is unknown. Suppose that we are given roll-out samples for the current policy, where
Then the expectation over can be approximated by the empirical average over the samples , i.e., an empirical approximation of the gradient is given by
It is known  that the variance of the above gradient estimator can be reduced by subtracting the baseline :
Let us consider the following Gaussian policy model with policy parameter :
where denotes the transpose, is the Gaussian mean, is the Gaussian standard deviation, and is the basis function vector. Then the policy gradients are explicitly expressed as
REINFORCE is a simple policy-search algorithm that directly updates policies to increase the expected return. However, gradient estimates tend to have large variance even if it is combined with variance reduction by baseline subtraction. For this reason, policy update by REINFORCE tends to be unreliable . In particular, the variance of gradient estimates in REINFORCE can be proportional to the length of the history, , due to the stochasticity of policies . This can be a critical limitation when the history is long.
2.3 Policy Gradients with Parameter-Based Exploration (PGPE)
To overcome the above limitation of REINFORCE, a novel policy-search method called policy gradients with parameter-based exploration (PGPE) was proposed recently . In PGPE, a deterministic policy (such as the linear policy) is adopted, and the stochasticity for exploration is introduced by drawing the policy parameter from a prior distribution with hyper-parameter . Thanks to this per-trajectory formulation, the variance of gradient estimates can be drastically reduced.
In the PGPE formulation, the expected return is represented as a function of the hyper-parameter :
Differentiating this with respect to , we have
Because of the per-trajectory formulation, roll-out samples in the PGPE framework are accompanied with policy parameters, i.e., . Based on these paired samples, an empirical estimator of the above gradient (with baseline subtraction) is given as follows :
Let us employ the linear deterministic policy, i.e., action is chosen as for some basis function . The parameter vector is drawn from the Gaussian prior distribution with hyper-parameter . Here denotes the Gaussian mean vector and denotes the vector consisting of the Gaussian standard deviation in each element:
where , , , and are the -th elements of , , , and , respectively. Then the derivatives of with respect to and are given as follows:
2.4 Importance-Weighted PGPE (IW-PGPE)
A popular idea to further improve the performance of RL methods is to reuse previously collected samples [21, 6]. Such a sample-reuse strategy is particularly useful when data sampling costs is high (e.g., robot control).
Importance-weighted PGPE (IW-PGPE)  combines the sample-reuse idea with PGPE. Technically, IW-PGPE can be regarded as an off-policy extension of PGPE, where data collecting policies are different from the current policy. In the PGPE formulation, such a off-policy scenario can be regarded as the situation where data collecting policies and the current policy are drawn from different prior distributions (more specifically, different hyper-parameters). Let be the hyper-parameter for the current policy and be the hyper-parameter for a data collecting policy. Let us denote data samples collected with hyper-parameter as .
When the data collecting policy is different from the current policy, importance sampling is a useful technique to correct the estimation bias caused by differing distributions . More specifically, the gradient is estimated as
where is the importance weight defined as
and is the baseline given by
Through experiments, the IW-PGPE method was demonstrated to be the best performing algorithm in model-free RL approaches .
The purpose of this paper is to develop a model-based counterpart of PGPE.
3 Model-Based Policy Search
Model-based RL first estimates the transition model and then learns a policy based on the estimated transition model. Because one can draw as many trajectory samples as one wants from the learned transition model without additional sampling costs, the model-based approach can work well if the transition model is accurately estimated [23, 5]. In this section, we extend PGPE to a model-based scenario. We first review an existing model estimation method based on the Gaussian process (GP)  and point out its limitations. Then we propose to use the state-of-the-art conditional density estimator called least-squares conditional density estimation (LSCDE)  in the model-based PGPE method.
3.1 Model-Based PGPE (M-PGPE)
PGPE can be extended to a model-based scenario as follows.
Collect transition samples .
Obtain transition model by a model estimation method from .
Initialize hyper-parameter .
Draw policy parameter from prior distribution .
Generate many samples from and current policy .
Estimate baseline and gradient from disjoint subsets of .
Update hyper-parameter as , where denotes the learning rate.
Repeat Steps 4–7 until converges.
Below, we consider the problem of approximating the transition probability from samples , and review transition model estimation methods.
3.2 Gaussian Process (GP)
Here we review a transition model estimation method based on GP.
In the GP framework, the problem of transition probability estimation is formulated as the regression problem of predicting output given input and under Gaussian noise:
where is an unknown regression function and is independent Gaussian noise. Then, the GP estimate of the transition probability density for an arbitrary test input and is given by the Gaussian distribution with mean and variance given by
respectively. Here, denotes the -dimensional identity matrix. is the -dimensional vector and is the Gram matrix defined by
denotes the covariance function, which is, e.g., defined by
Here, and are hyperparameters, and together with the noise variance , the hyperparameters are determined by evidence maximization .
As shown above, the GP-based model estimation method requires the strong assumption that the transition probability density is Gaussian. That is, GP is non-parametric as a regression method of estimating the conditional mean, it is parametric (Gaussian) as a conditional density estimator. Such a conditional Gaussian assumption is highly restrictive in RL problems.
3.3 Least-Squares Conditional Density Estimation (LSCDE)
To overcome the restriction of the GP-based model estimation method, we propose to use LSCDE.
Let us model the transition probability by the following linear-in-parameter model:
where is the -dimensional basis function vector and is the -dimensional parameter vector. If is too large, we may reduce the number of basis functions by only using a subset of samples as Gaussian centers. We may use different Gaussian widths for and if necessary.
The parameter in the model (1) is learned so that the following squared error is minimized:
This can be expressed as
where we used in the second term and
Because is constant, we only consider the first two terms from here on:
Note that, for the Gaussian model (1), the -th element of can be computed analytically as
Because and included in contain the expectations over unknown densities and , they are approximated by sample averages. Then we have
By adding an -regularizer to to avoid overfitting, the LSCDE optimization criterion is given as
where is the regularization parameter. Taking the derivative of the above objective function and equating it to zero, we can see that the solution can be obtained just by solving the following system of linear equations:
where denotes the -dimensional identity matrix. Thus, the solution is given analytically as
Because conditional probability densities are non-negative by definition, we modify the solution as
where denotes the -dimensional zero vector and ‘’ for vectors are applied in the element-wise manner.
Finally, we renormalize the solution in the test phase. More specifically, given a test input point , the final LSCDE solution is given as
LSCDE was proved to achieve the optimal non-parametric convergence rate to the true conditional density in the mini-max sense , meaning that no method can outperform this simple LSCDE method asymptotically.
Model selection of the Gaussian width and the regularization parameter is possible by cross-validation. A MATLAB implementation of LSCDE is available from
In this section, we demonstrate the usefulness of the proposed method through experiments.
4.1 Continuous Chain Walk
For illustration purposes, let us first consider a simple continuous chain-walk task (Figure 1).
That is, the agent receives positive reward at the center of the state space. We set the episode length at , the discount factor at , and the learning rate at . We use the following linear-in-parameter policy model:
As transition dynamics, we consider two setups:
The true transition dynamics is given by
where is the Gaussian noise with mean and standard deviation .
The true transition dynamics is given by
where is the Gaussian noise with mean and standard deviation , and the sign of is randomly chosen with probability .
We compare the following three policy search methods:
The model-based PGPE method with transition model estimated by LSCDE.
The model-based PGPE method with transition model estimated by GP.
The model-free PGPE method with sample reuse by importance weighting111 We have also tested the plain PGPE method without importance weighting. However, this did not perform well in our preliminary experiments, and thus we decided to omit the results. .
Below, we consider the situation where the budget for data collection is limited to episodic samples.
4.1.2 LSCDE Vs. GP
When the transition model is learned in the M-PGPE methods, all samples are gathered randomly in the beginning at once. More specifically, the initial state and the action are chosen from the uniform distributions over and , respectively. Then the next state and the immediate reward are obtained. Then the action is chosen from the uniform distribution over , and the next state and the immediate reward are obtained. This process is repeated until we obtain . This gives a trajectory sample, and we repeat this data generation process times to obtain trajectory samples.
Figure 4 and Figure 7 illustrate the true transition dynamics and its estimates obtained by LSCDE and GP in the Gaussian and bimodal cases. Figure 4 shows that both LSCDE and GP can learn the entire profile of the true transition dynamics well in the Gaussian case. On the other hand, Figure 7 shows that LSCDE can still successfully capture the entire profile of the true transition model well even in the bimodal case, but GP fails to capture the bimodal structure.
Based on the estimated transition models, we learn policies by the M-PGPE method. We generate artificial samples for policy gradient estimation and another artificial samples for baseline estimation from the learned transition model. Then policy is updated based on these artificial samples. We repeat this policy update step times. For evaluating the return of a learned policy, we use additional test episodic samples which are not used for policy learning. Figure 4 and Figure 7 depict the average performance of learned policies over runs. As expected, the GP-based method performs very well in the Gaussian case, but LSCDE still exhibits reasonably good performance. In the bimodal case, GP performs poorly and LSCDE gives much better policies than GP. This illustrates the high flexibility of LSCDE.
4.1.3 Model-Based Vs. Model-Free
Next, we compare the performance of M-PGPE with the model-free IW-PGPE method.
For the IW-PGPE method, we need to determine the schedule of collecting samples under the fixed budget scenario. First, we illustrate how the choice of sampling schedules affects the performance of IW-PGPE. Figure 4 and Figure 7 show expected returns averaged over runs under the sampling schedule that a batch of samples are gathered times for different values. In our implementation of IW-PGPE, policy update is performed times after observing each batch of samples, because we empirically observed that this performs better than performing policy update only once. Figure 4 shows that the performance of IW-PGPE depends heavily on the sampling schedule, and gathering samples at once is shown to be the best choice in the Gaussian case. Figure 7 shows that gathering samples at once is also the best choice in the bimodal case.
Although the best sampling schedule is not accessible in practice, we use this optimal sampling schedule for IW-PGPE. Figure 4 and Figure 7 also include returns of IW-PGPE averaged over runs as functions of the sampling steps. These graphs show that IW-PGPE can improve the policies only in the beginning, because all samples are gathered at once in the beginning. The performance of IW-PGPE may be further improved if it is possible to gather more samples, but this is prohibited under the fixed budget scenario. On the other hand, return values of M-PGPE constantly increase throughout iterations, because artificial samples can be kept generated without additional sampling costs. This illustrates a potential advantage of model-based RL methods.
4.2 Humanoid Robot Control
Finally, we evaluate the performance of M-PGPE on a practical control problem of a simulated upper-body model of the humanoid robot CB-i  (see Figure 8(a)). We use its simulator for experiments (see Figure 8(b)). The goal of the control problem is to lead the end-effector of the right arm (right hand) to the target object.
The simulator is based on the upper-body of the CB-i humanoid robot, which has 9 joints for shoulder pitch, shoulder roll, elbow pitch of the right arm, shoulder pitch, shoulder roll, elbow pitch of the left arm, waist yaw, torso roll, and torso pitch.
At each time step, the controller receives a state vector from the system and sends out an action vector. The state vector is 18-dimensional and real-valued, which corresponds to the current angle in degree and the current angular velocity for each joint. The action vector is 9-dimensional and real-valued, which corresponds to the target angle of each joint in degree.
We simulate a noisy control system by perturbing action vectors with independent bimodal Gaussian noise. More specifically, for each action element, we add Gaussian noise with mean and standard deviation with probability , and Gaussian noise with mean and standard deviation with probability .
The initial posture of the robot is fixed to standing up straight with arms down. The the target object is located in front-above of the right hand which is reachable by using the controllable joints. The reward function at each time step is defined as
where is the distance between the right hand and target object at time step , and is the sum of control costs for each joint. The coefficient is multiplied to keep the values of the two terms in the same order of magnitude. The deterministic policy model used in PGPE is defined as with the basis function . We set the episode length at , the discount factor at , and the learning rate at .
4.2.2 Experiment with 2 Joints
First, we only use 2 joints among the 9 joints, i.e., we allow only the right shoulder pitch and right elbow pitch to be controlled, while the other joints remain still at each time step (no control signal is sent to these joints). Therefore, the dimensionality of state vector and action vector is and , respectively. Under this simplified setup, we compare the performance of M-PGPE(LSCDE), M-PGPE(GP), and IW-PGPE.
We suppose that the budget for data collection is limited to episodic samples. For the M-PGPE methods, all samples are collected at first using the uniformly random initial states and policy. More specifically, the initial state is chosen from the uniform distributions over . At each time step, the -th element of action vector is chosen from the uniform distribution on . In total, we have 5000 transition samples for model estimation. Then, we generate 1000 artificial samples for policy gradient estimation and another 1000 artificial samples for baseline estimation from the learned transition model, and update the control policy based on these artificial samples. For the IW-PGPE method, we performed preliminary experiments to determine the optimal sampling schedule (Figure 10), showing that collecting samples times yields the highest average return. We use this sampling schedule for performance comparison with the M-PGPE methods.
Returns obtained by each method averaged over 10 runs are plotted in Figure 10, showing that M-PGPE(LSCDE) tends to outperform both M-PGPE(GP) and IW-PGPE. Figure 11 illustrates an example of the reaching motion with 2-joints obtained by M-PGPE(LSCDE) at the th iteration policy. This shows that the learned policy successfully leads the right hand to the target object within only steps in this noisy control system.
4.2.3 Experiment with 9 Joints
Finally, we evaluate the performance of the proposed method on the reaching task with 9 joints, i.e., all joints are allowed to move. In this experiment, we compare learning performance between M-PGPE(LSCDE) and IW-PGPE. We do not include M-PGPE(GP) since it is outperformed by M-PGPE(LSCDE) on the previous 2-joints experiments, and furthermore the GP-based method requires an enormous amount of computation time.
The experimental setup is essentially the same as the 2-joints experiments, but we have a budget for gathering samples for this complex and high-dimensional task. The position of the target object is moved to far left, which is not reachable by using just 2-joints. Thus, the robot is required to move other joints to reach the object with right hand. We randomly choose 5000 samples for Gaussian centers for M-PGPE(LSCDE). The sampling schedule for IW-PGPE was set to 1000 samples at once, which is the best sampling schedule according to Figure 12. The returns obtained by M-PGPE(LSCDE) and IW-PGPE averaged over 30 runs are plotted in Figure 13, showing that M-PGPE(LSCDE) tends to outperform the state-of-the-art IW-PGPE method in this challenging robot control task.
Figure 14 shows a typical example of the reaching motion with 9 joints obtained by M-PGPE(LSCDE) at the 1000th iteration. The images show that the policy learned by M-PGPE(LSCDE) leads the right hand to the distant object successfully within 14 steps.
Overall, the proposed M-PGPE(LSCDE) method is shown to be promising in the noisy and high-dimensional humanoid robot arm reaching task.
We extended the model-free PGPE method to a model-based scenario, and proposed to combine it with a model estimator called LSCDE. Under the fixed sampling budget, appropriately designing a sampling schedule is critical for the model-free IW-PGPE method, while this is not a problem for the proposed model-based PGPE method. Through experiments, we confirmed that GP-based model estimation is not as flexible as the LSCDE-based method when the transition model is not Gaussian, and the proposed model-based PGPE based on LSCDE was overall demonstrated to be promising.
VT was supported by the JASSO scholarship, TZ was supported by the MEXT scholarship, JM was supported by MEXT KAKENHI 23120004, and MS was supported by the FIRST project.
-  P. Abbeel, M. Quigley, and A. Y. Ng. Using inaccurate models in reinforcement learning. Proceedings of the 23rd International Conference on Machine Learning, pages 1–8, 2006.
-  T. Akiyama, H. Hachiya, and M. Sugiyama. Efficient exploration through active learning for value function approximation in reinforcement learning. Neural Networks, 23(5):639–648, 2010.
-  G. Cheng, S. Hyon, J. Morimoto, A. Ude, G.H. Joshua, Glenn Colvin, Wayco Scroggin, and C.J. Stephen. Cb: A humanoid research platform for exploring neuroscience. Advanced Robotics, 21(10):1097–1114, 2007.
-  P. Dayan and G. E. Hinton. Using expectation-maximization for reinforcement learning. Neural Computation, 9(2):271–278, 1997.
-  M. P. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-efficient approach to policy search. Proceedings of the 28th International Conference on Machine Learning, pages 465–473, 2011.
-  H. Hachiya, T. Akiyama, M. Sugiyama, and J. Peters. Adaptive importance sampling for value function approximation in off-policy reinforcement learning. Neural Networks, 22(10):1399–1410, 2009.
-  L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
-  S. Kakade. A natural policy gradient. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 1531–1538, Cambridge, MA, 2002. MIT Press.
-  T. Kanamori, S. Hido, and M. Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10:1391–1445, Jul. 2009.
-  T. Kanamori, T. Suzuki, and M. Sugiyama. Computational complexity of kernel-based density-ratio estimation: A condition number analysis. Machine Learning, 2012. to appear.
-  T. Kanamori, T. Suzuki, and M. Sugiyama. Statistical analysis of kernel-based least-squares density-ratio estimation. Machine Learning, 86(3):335–367, 2012.
-  M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003.
-  J. Peters and S. Schaal. Policy gradient methods for robotics. In Processing of the IEEE/RSJ Internatinal Conference on Inatelligent Robots and Systems (IROS), pages 2219–2225, 2006.
-  C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, USA, 2006.
-  F. Sehnke, C. Osendorfer, T. Rückstiess, A. Graves, J. Peters, and J. Schmidhuber. Parameter-exploring policy gradients. Neural Networks, 23(4):551–559, 2010.
-  M. Sugiyama, H. Hachiya, H. Kashima, and T. Morimura. Least absolute policy iteration—A robust approach to value function approximation. IEICE Transactions on Information and Systems, E93-D(9):2555–2565, 2010.
-  M. Sugiyama, H. Hachiya, C. Towell, and S. Vijayakumar. Geodesic Gaussian kernels for value function approximation. Autonomous Robots, 25(3):287–304, 2008.
-  M. Sugiyama and M. Kawanabe. Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation. MIT Press, Cambridge, Massachusetts, USA, 2012.
-  M. Sugiyama, T. Suzuki, and T. Kanamori. Density ratio matching under the Bregman divergence: A unified framework of density ratio estimation. Annals of the Institute of Statistical Mathematics, 64(5):1009–1044, 2012.
-  M. Sugiyama, I. Takeuchi, T. Suzuki, T. Kanamori, H. Hachiya, and D. Okanohara. Least-squares conditional density estimation. IEICE Transactions on Information and Systems, E93-D(3):583–594, 2010.
-  R. S. Sutton and G. A. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA, 1998.
-  R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In S.A. Solla, T.K. Leen, and K.-R. Müller, editors, Advances in Neural Information Processing Systems 12, pages 1057–1063. MIT Press, 2000.
-  X. Wang and T. G. Dietterich. Model-based policy gradient reinforcement learning. Proceedings of the 20th International Conference on Machine Learning, pages 776–783, 2003.
-  Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229, 1992.
-  T. Zhao, H. Hachiya, G. Niu, and M. Sugiyama. Analysis and improvement of policy gradient estimation. Neural Networks, 26:118–129, 2012.
-  T. Zhao, H. Hachiya, V. Tangkaratt, J. Morimoto, and M. Sugiyama. Efficient sample reuse in policy gradients with parameter-based exploration. Neural Computation, 25:1512–1547, 2013.