ModelBased Policy Gradients
with ParameterBased Exploration
by LeastSquares Conditional Density Estimation
Abstract
The goal of reinforcement learning (RL) is to let an agent learn an optimal control policy in an unknown environment so that future expected rewards are maximized. The modelfree RL approach directly learns the policy based on data samples. Although using many samples tends to improve the accuracy of policy learning, collecting a large number of samples is often expensive in practice. On the other hand, the modelbased RL approach first estimates the transition model of the environment and then learns the policy based on the estimated transition model. Thus, if the transition model is accurately learned from a small amount of data, the modelbased approach can perform better than the modelfree approach. In this paper, we propose a novel modelbased RL method by combining a recently proposed modelfree policy search method called policy gradients with parameterbased exploration and the stateoftheart transition model estimator called leastsquares conditional density estimation. Through experiments, we demonstrate the practical usefulness of the proposed method.
1 Introduction
Reinforcement learning (RL) is a framework to let an agent learn an optimal control policy in an unknown environment so that expected future rewards are maximized [7]. The RL methods developed so far can be categorized into two types: Policy iteration where policies are learned based on value function approximation [21, 12] and policy search where policies are learned directly to maximize expected future rewards [24, 4, 22, 8, 15, 26].
1.1 Policy Iteration Vs. Policy Search
A value function represents expected future rewards as a function of a state or a state and an action. In the policy iteration framework, approximation of the value function for the current policy and improvement of the policy based on the learned value function are iteratively performed until an optimal policy is found. Thus, accurately approximating the value function is a challenge in the value function based approach. So far, various machine learning techniques have been employed for better value function approximation, such as leastsquares approximation [12], manifold learning [17], efficient sample reuse [6], active learning [2], and robust learning [16].
However, because policy functions are learned indirectly via value functions in policy iteration, improving the quality of value function approximation does not necessarily yield a better policy function. Furthermore, because a small change in value functions can cause a big change in policy functions, it is not safe to use the value function based approach for controlling expensive dynamic systems such as a humanoid robot. Another weakness of the value function approach is that it is difficult to handle continuous actions because a maximizer of the value function with respect to an action needs to be found for policy improvement.
On the other hand, in the policy search approach, policy functions are determined so that expected future rewards are directly maximized. A popular policy search method is to update policy functions via gradient ascent. However, a classic policy gradient method called REINFORCE [24] tends to produce gradient estimates with large variance, which results in unreliable policy improvement [13]. More theoretically, it was shown that the variance of policy gradients can be proportional to the length of an agent’s trajectory, due to the stochasticity of policies [25]. This can be a critical limitation in RL problems with long trajectories.
To cope with this problem, a novel policy gradient method called policy gradients with parameterbased exploration (PGPE) was proposed [15]. In PGPE, deterministic policies are used to suppress irrelevant randomness and useful stochasticity is introduced by drawing policy parameters from a prior distribution. Then, instead of policy parameters, hyperparameters included in the prior distribution are learned from data. Thanks to this priorbased formulation, the variance of gradient estimates in PGPE is independent of the length of an agent’s trajectory [25]. However, PGPE still suffers from an instability problem in small sample cases. To further improve the practical performance of PGPE, an efficient sample reuse method called importanceweighted PGPE (IWPGPE) was proposed recently and demonstrated to achieve the stateoftheart performance [26].
1.2 ModelBased Vs. ModelFree
The RL methods reviewed above are categorized into the modelfree approach, where policies are learned without explicitly modeling the unknown environment (i.e., the transition probability of the agent in the environment). On the other hand, an alternative approach called the modelbased approach explicitly models the environment in advance and uses the learned environment model for policy learning [23, 5]. In the modelbased approach, no additional sampling cost is necessary to generate artificial samples from the learned environment model.
The modelbased approach is particularly advantageous in the policy search scenario. For example, given a fixed budget for data collection, IWPGPE requires us to determine the sampling schedule in advance. More specifically, we need to decide, e.g., whether we gather many samples in the beginning or only a small batch of samples are collected for a longer period. However, optimizing the sampling schedule in advance is not possible without strong prior knowledge. Thus, we need to just blindly design the sampling schedule in practice, which can cause significant performance degradation. On the other hand, the modelbased approach does not suffer from this problem because we can draw as many trajectory samples as we want from the learned transition model without additional sampling costs.
Another advantage of the modelbased approach lies in baseline subtraction. In the gradientbased policy search methods such as REINFORCE and PGPE, subtraction of a baseline from a gradient estimate is a vital technique to reduce the estimation variance of policy gradients [13, 26]. If the baseline is estimated from samples that are statistically independent of samples used for the estimation of policy gradients, variance reduction can be carried out without increasing the estimation bias. However, such independent samples are not available in practice (if available, they should be used for policy gradient estimation), and thus variance reduction by baseline subtraction is practically performed at the expense of bias increase. On the other hand, in the modelbased scenario, we can draw as many trajectory samples as we want from the learned transition model without additional sampling costs. Therefore, two statistically independent sets of samples can be generated and they can be separately used for policy gradient estimation and baseline estimation.
1.3 Transition Model Learning by LeastSquares Conditional Density Estimation
If the unknown environment is accurately approximated, the modelbased approach can fully enjoy all the above advantages. However, accurately estimating the transition model from a limited amount of trajectory data in multidimensional continuous state and action spaces is highly challenging. Although the modelbased method that does not require an accurate transition model was developed [1], it is only applicable to deterministic environments, which significantly limits its range of applications in practice. On the other hand, a recently proposed modelbased policy search method called PILCO [5] learns a probabilistic transition model by the Gaussian process (GP) [14], and explicitly incorporates longterm model uncertainty. However, PILCO requires states and actions to follow Gaussian distributions and the reward function to be a particular exponential form to ensure that the policy evaluation is performed in a closed form and policy gradients are computed analytically for policy improvement. These strong requirements make PILCO practically restrictive.
To overcome such limitations of existing approaches, we propose a highly practical policysearch algorithm by extending the modelfree PGPE method to the modelbased scenario. In the proposed modelbased PGPE (MPGPE) method, the transition model is learned by the stateoftheart nonparametric conditional density estimator called leastsquares conditional density estimation (LSCDE) [20], which has various superior properties: It can directly handle multidimensional inputs and outputs, it was proved to achieve the optimal convergence rate [11], it has high numerical stability [10], it is robust against outliers [19], its solution can be analytically and efficiently computed just by solving a system of linear equations [9], and generating samples from the learned conditional density is straightforward. Through experiments, we demonstrate that the proposed MPGPE method is a promising approach.
2 Problem Formulation and ModelFree Policy Search
In this section, we first formulate our RL problem and review existing modelfree policy search methods.
2.1 Formulation
Let us consider a Markov decision problem consisting of the following elements:

: A set of continuous states.

: A set of continuous actions.

: The (unknown) probability density of initial states.

: The (unknown) conditional probability density of visiting state from state by action .

: The immediate reward function for the transition from to by .
Let be a policy of an agent parameterized by , which is the conditional probability density of taking action at state . Let
be a history, which is a sequence of states and actions with finite length generated as follows: First, the initial state is determined following the initialstate probability density . Then action is chosen following policy , and next state is determined following the transition probability density . This process is repeated times.
Let be the return for history , which is the discounted sum of future rewards the agent can obtain:
where is a discount factor. The expected return is given by
where is the probability density of observing history :
The goal of RL is to find optimal policy parameter that maximizes the expected return :
2.2 Reinforce
REINFORCE [24] is a classic method for learning the policy parameter via gradient ascent:
where denotes the learning rate and denotes the gradient of with respect to .
The gradient can be expressed as
where we used
In the above expression, the probability density of histories, , is unknown. Suppose that we are given rollout samples for the current policy, where
Then the expectation over can be approximated by the empirical average over the samples , i.e., an empirical approximation of the gradient is given by
It is known [13] that the variance of the above gradient estimator can be reduced by subtracting the baseline :
where
Let us consider the following Gaussian policy model with policy parameter :
where denotes the transpose, is the Gaussian mean, is the Gaussian standard deviation, and is the basis function vector. Then the policy gradients are explicitly expressed as
REINFORCE is a simple policysearch algorithm that directly updates policies to increase the expected return. However, gradient estimates tend to have large variance even if it is combined with variance reduction by baseline subtraction. For this reason, policy update by REINFORCE tends to be unreliable [13]. In particular, the variance of gradient estimates in REINFORCE can be proportional to the length of the history, , due to the stochasticity of policies [25]. This can be a critical limitation when the history is long.
2.3 Policy Gradients with ParameterBased Exploration (PGPE)
To overcome the above limitation of REINFORCE, a novel policysearch method called policy gradients with parameterbased exploration (PGPE) was proposed recently [15]. In PGPE, a deterministic policy (such as the linear policy) is adopted, and the stochasticity for exploration is introduced by drawing the policy parameter from a prior distribution with hyperparameter . Thanks to this pertrajectory formulation, the variance of gradient estimates can be drastically reduced.
In the PGPE formulation, the expected return is represented as a function of the hyperparameter :
Differentiating this with respect to , we have
Because of the pertrajectory formulation, rollout samples in the PGPE framework are accompanied with policy parameters, i.e., . Based on these paired samples, an empirical estimator of the above gradient (with baseline subtraction) is given as follows [25]:
where
Let us employ the linear deterministic policy, i.e., action is chosen as for some basis function . The parameter vector is drawn from the Gaussian prior distribution with hyperparameter . Here denotes the Gaussian mean vector and denotes the vector consisting of the Gaussian standard deviation in each element:
where , , , and are the th elements of , , , and , respectively. Then the derivatives of with respect to and are given as follows:
2.4 ImportanceWeighted PGPE (IWPGPE)
A popular idea to further improve the performance of RL methods is to reuse previously collected samples [21, 6]. Such a samplereuse strategy is particularly useful when data sampling costs is high (e.g., robot control).
Importanceweighted PGPE (IWPGPE) [26] combines the samplereuse idea with PGPE. Technically, IWPGPE can be regarded as an offpolicy extension of PGPE, where data collecting policies are different from the current policy. In the PGPE formulation, such a offpolicy scenario can be regarded as the situation where data collecting policies and the current policy are drawn from different prior distributions (more specifically, different hyperparameters). Let be the hyperparameter for the current policy and be the hyperparameter for a data collecting policy. Let us denote data samples collected with hyperparameter as .
When the data collecting policy is different from the current policy, importance sampling is a useful technique to correct the estimation bias caused by differing distributions [18]. More specifically, the gradient is estimated as
where is the importance weight defined as
and is the baseline given by
Through experiments, the IWPGPE method was demonstrated to be the best performing algorithm in modelfree RL approaches [26].
The purpose of this paper is to develop a modelbased counterpart of PGPE.
3 ModelBased Policy Search
Modelbased RL first estimates the transition model and then learns a policy based on the estimated transition model. Because one can draw as many trajectory samples as one wants from the learned transition model without additional sampling costs, the modelbased approach can work well if the transition model is accurately estimated [23, 5]. In this section, we extend PGPE to a modelbased scenario. We first review an existing model estimation method based on the Gaussian process (GP) [14] and point out its limitations. Then we propose to use the stateoftheart conditional density estimator called leastsquares conditional density estimation (LSCDE) [20] in the modelbased PGPE method.
3.1 ModelBased PGPE (MPGPE)
PGPE can be extended to a modelbased scenario as follows.

Collect transition samples .

Obtain transition model by a model estimation method from .

Initialize hyperparameter .

Draw policy parameter from prior distribution .

Generate many samples from and current policy .

Estimate baseline and gradient from disjoint subsets of .

Update hyperparameter as , where denotes the learning rate.

Repeat Steps 4–7 until converges.
Below, we consider the problem of approximating the transition probability from samples , and review transition model estimation methods.
3.2 Gaussian Process (GP)
Here we review a transition model estimation method based on GP.
In the GP framework, the problem of transition probability estimation is formulated as the regression problem of predicting output given input and under Gaussian noise:
where is an unknown regression function and is independent Gaussian noise. Then, the GP estimate of the transition probability density for an arbitrary test input and is given by the Gaussian distribution with mean and variance given by
respectively. Here, denotes the dimensional identity matrix. is the dimensional vector and is the Gram matrix defined by
denotes the covariance function, which is, e.g., defined by
Here, and are hyperparameters, and together with the noise variance , the hyperparameters are determined by evidence maximization [14].
As shown above, the GPbased model estimation method requires the strong assumption that the transition probability density is Gaussian. That is, GP is nonparametric as a regression method of estimating the conditional mean, it is parametric (Gaussian) as a conditional density estimator. Such a conditional Gaussian assumption is highly restrictive in RL problems.
3.3 LeastSquares Conditional Density Estimation (LSCDE)
To overcome the restriction of the GPbased model estimation method, we propose to use LSCDE.
Let us model the transition probability by the following linearinparameter model:
(1) 
where is the dimensional basis function vector and is the dimensional parameter vector. If is too large, we may reduce the number of basis functions by only using a subset of samples as Gaussian centers. We may use different Gaussian widths for and if necessary.
The parameter in the model (1) is learned so that the following squared error is minimized:
This can be expressed as
where we used in the second term and
Because is constant, we only consider the first two terms from here on:
where
Note that, for the Gaussian model (1), the th element of can be computed analytically as
Because and included in contain the expectations over unknown densities and , they are approximated by sample averages. Then we have
where
By adding an regularizer to to avoid overfitting, the LSCDE optimization criterion is given as
where is the regularization parameter. Taking the derivative of the above objective function and equating it to zero, we can see that the solution can be obtained just by solving the following system of linear equations:
where denotes the dimensional identity matrix. Thus, the solution is given analytically as
Because conditional probability densities are nonnegative by definition, we modify the solution as
where denotes the dimensional zero vector and ‘’ for vectors are applied in the elementwise manner.
Finally, we renormalize the solution in the test phase. More specifically, given a test input point , the final LSCDE solution is given as
(2) 
where, for the Gaussian model (1), the denominator in Eq.(2) can be analytically computed as
LSCDE was proved to achieve the optimal nonparametric convergence rate to the true conditional density in the minimax sense [20], meaning that no method can outperform this simple LSCDE method asymptotically.
Model selection of the Gaussian width and the regularization parameter is possible by crossvalidation. A MATLAB implementation of LSCDE is available from
‘http://sugiyamawww.cs.titech.ac.jp/~sugi/software/LSCDE/’.
4 Experiments
In this section, we demonstrate the usefulness of the proposed method through experiments.
4.1 Continuous Chain Walk
For illustration purposes, let us first consider a simple continuous chainwalk task (Figure 1).
4.1.1 Setup
Let
That is, the agent receives positive reward at the center of the state space. We set the episode length at , the discount factor at , and the learning rate at . We use the following linearinparameter policy model:
where .
As transition dynamics, we consider two setups:
 Gaussian:

The true transition dynamics is given by
where is the Gaussian noise with mean and standard deviation .
 Bimodal:

The true transition dynamics is given by
where is the Gaussian noise with mean and standard deviation , and the sign of is randomly chosen with probability .
We compare the following three policy search methods:
 MPGPE(LSCDE):

The modelbased PGPE method with transition model estimated by LSCDE.
 MPGPE(GP):

The modelbased PGPE method with transition model estimated by GP.
 IWPGPE:

The modelfree PGPE method with sample reuse by importance weighting^{1}^{1}1 We have also tested the plain PGPE method without importance weighting. However, this did not perform well in our preliminary experiments, and thus we decided to omit the results. [26].
Below, we consider the situation where the budget for data collection is limited to episodic samples.
4.1.2 LSCDE Vs. GP
When the transition model is learned in the MPGPE methods, all samples are gathered randomly in the beginning at once. More specifically, the initial state and the action are chosen from the uniform distributions over and , respectively. Then the next state and the immediate reward are obtained. Then the action is chosen from the uniform distribution over , and the next state and the immediate reward are obtained. This process is repeated until we obtain . This gives a trajectory sample, and we repeat this data generation process times to obtain trajectory samples.
Figure 4 and Figure 7 illustrate the true transition dynamics and its estimates obtained by LSCDE and GP in the Gaussian and bimodal cases. Figure 4 shows that both LSCDE and GP can learn the entire profile of the true transition dynamics well in the Gaussian case. On the other hand, Figure 7 shows that LSCDE can still successfully capture the entire profile of the true transition model well even in the bimodal case, but GP fails to capture the bimodal structure.
Based on the estimated transition models, we learn policies by the MPGPE method. We generate artificial samples for policy gradient estimation and another artificial samples for baseline estimation from the learned transition model. Then policy is updated based on these artificial samples. We repeat this policy update step times. For evaluating the return of a learned policy, we use additional test episodic samples which are not used for policy learning. Figure 4 and Figure 7 depict the average performance of learned policies over runs. As expected, the GPbased method performs very well in the Gaussian case, but LSCDE still exhibits reasonably good performance. In the bimodal case, GP performs poorly and LSCDE gives much better policies than GP. This illustrates the high flexibility of LSCDE.
4.1.3 ModelBased Vs. ModelFree
Next, we compare the performance of MPGPE with the modelfree IWPGPE method.
For the IWPGPE method, we need to determine the schedule of collecting samples under the fixed budget scenario. First, we illustrate how the choice of sampling schedules affects the performance of IWPGPE. Figure 4 and Figure 7 show expected returns averaged over runs under the sampling schedule that a batch of samples are gathered times for different values. In our implementation of IWPGPE, policy update is performed times after observing each batch of samples, because we empirically observed that this performs better than performing policy update only once. Figure 4 shows that the performance of IWPGPE depends heavily on the sampling schedule, and gathering samples at once is shown to be the best choice in the Gaussian case. Figure 7 shows that gathering samples at once is also the best choice in the bimodal case.
Although the best sampling schedule is not accessible in practice, we use this optimal sampling schedule for IWPGPE. Figure 4 and Figure 7 also include returns of IWPGPE averaged over runs as functions of the sampling steps. These graphs show that IWPGPE can improve the policies only in the beginning, because all samples are gathered at once in the beginning. The performance of IWPGPE may be further improved if it is possible to gather more samples, but this is prohibited under the fixed budget scenario. On the other hand, return values of MPGPE constantly increase throughout iterations, because artificial samples can be kept generated without additional sampling costs. This illustrates a potential advantage of modelbased RL methods.
4.2 Humanoid Robot Control
Finally, we evaluate the performance of MPGPE on a practical control problem of a simulated upperbody model of the humanoid robot CBi [3] (see Figure 8(a)). We use its simulator for experiments (see Figure 8(b)). The goal of the control problem is to lead the endeffector of the right arm (right hand) to the target object.
4.2.1 Setup
The simulator is based on the upperbody of the CBi humanoid robot, which has 9 joints for shoulder pitch, shoulder roll, elbow pitch of the right arm, shoulder pitch, shoulder roll, elbow pitch of the left arm, waist yaw, torso roll, and torso pitch.
At each time step, the controller receives a state vector from the system and sends out an action vector. The state vector is 18dimensional and realvalued, which corresponds to the current angle in degree and the current angular velocity for each joint. The action vector is 9dimensional and realvalued, which corresponds to the target angle of each joint in degree.
We simulate a noisy control system by perturbing action vectors with independent bimodal Gaussian noise. More specifically, for each action element, we add Gaussian noise with mean and standard deviation with probability , and Gaussian noise with mean and standard deviation with probability .
The initial posture of the robot is fixed to standing up straight with arms down. The the target object is located in frontabove of the right hand which is reachable by using the controllable joints. The reward function at each time step is defined as
where is the distance between the right hand and target object at time step , and is the sum of control costs for each joint. The coefficient is multiplied to keep the values of the two terms in the same order of magnitude. The deterministic policy model used in PGPE is defined as with the basis function . We set the episode length at , the discount factor at , and the learning rate at .
4.2.2 Experiment with 2 Joints
First, we only use 2 joints among the 9 joints, i.e., we allow only the right shoulder pitch and right elbow pitch to be controlled, while the other joints remain still at each time step (no control signal is sent to these joints). Therefore, the dimensionality of state vector and action vector is and , respectively. Under this simplified setup, we compare the performance of MPGPE(LSCDE), MPGPE(GP), and IWPGPE.
We suppose that the budget for data collection is limited to episodic samples. For the MPGPE methods, all samples are collected at first using the uniformly random initial states and policy. More specifically, the initial state is chosen from the uniform distributions over . At each time step, the th element of action vector is chosen from the uniform distribution on . In total, we have 5000 transition samples for model estimation. Then, we generate 1000 artificial samples for policy gradient estimation and another 1000 artificial samples for baseline estimation from the learned transition model, and update the control policy based on these artificial samples. For the IWPGPE method, we performed preliminary experiments to determine the optimal sampling schedule (Figure 10), showing that collecting samples times yields the highest average return. We use this sampling schedule for performance comparison with the MPGPE methods.
Returns obtained by each method averaged over 10 runs are plotted in Figure 10, showing that MPGPE(LSCDE) tends to outperform both MPGPE(GP) and IWPGPE. Figure 11 illustrates an example of the reaching motion with 2joints obtained by MPGPE(LSCDE) at the th iteration policy. This shows that the learned policy successfully leads the right hand to the target object within only steps in this noisy control system.
4.2.3 Experiment with 9 Joints
Finally, we evaluate the performance of the proposed method on the reaching task with 9 joints, i.e., all joints are allowed to move. In this experiment, we compare learning performance between MPGPE(LSCDE) and IWPGPE. We do not include MPGPE(GP) since it is outperformed by MPGPE(LSCDE) on the previous 2joints experiments, and furthermore the GPbased method requires an enormous amount of computation time.
The experimental setup is essentially the same as the 2joints experiments, but we have a budget for gathering samples for this complex and highdimensional task. The position of the target object is moved to far left, which is not reachable by using just 2joints. Thus, the robot is required to move other joints to reach the object with right hand. We randomly choose 5000 samples for Gaussian centers for MPGPE(LSCDE). The sampling schedule for IWPGPE was set to 1000 samples at once, which is the best sampling schedule according to Figure 12. The returns obtained by MPGPE(LSCDE) and IWPGPE averaged over 30 runs are plotted in Figure 13, showing that MPGPE(LSCDE) tends to outperform the stateoftheart IWPGPE method in this challenging robot control task.
Figure 14 shows a typical example of the reaching motion with 9 joints obtained by MPGPE(LSCDE) at the 1000th iteration. The images show that the policy learned by MPGPE(LSCDE) leads the right hand to the distant object successfully within 14 steps.
Overall, the proposed MPGPE(LSCDE) method is shown to be promising in the noisy and highdimensional humanoid robot arm reaching task.
5 Conclusion
We extended the modelfree PGPE method to a modelbased scenario, and proposed to combine it with a model estimator called LSCDE. Under the fixed sampling budget, appropriately designing a sampling schedule is critical for the modelfree IWPGPE method, while this is not a problem for the proposed modelbased PGPE method. Through experiments, we confirmed that GPbased model estimation is not as flexible as the LSCDEbased method when the transition model is not Gaussian, and the proposed modelbased PGPE based on LSCDE was overall demonstrated to be promising.
Acknowledgments
VT was supported by the JASSO scholarship, TZ was supported by the MEXT scholarship, JM was supported by MEXT KAKENHI 23120004, and MS was supported by the FIRST project.
References
 [1] P. Abbeel, M. Quigley, and A. Y. Ng. Using inaccurate models in reinforcement learning. Proceedings of the 23rd International Conference on Machine Learning, pages 1–8, 2006.
 [2] T. Akiyama, H. Hachiya, and M. Sugiyama. Efficient exploration through active learning for value function approximation in reinforcement learning. Neural Networks, 23(5):639–648, 2010.
 [3] G. Cheng, S. Hyon, J. Morimoto, A. Ude, G.H. Joshua, Glenn Colvin, Wayco Scroggin, and C.J. Stephen. Cb: A humanoid research platform for exploring neuroscience. Advanced Robotics, 21(10):1097–1114, 2007.
 [4] P. Dayan and G. E. Hinton. Using expectationmaximization for reinforcement learning. Neural Computation, 9(2):271–278, 1997.
 [5] M. P. Deisenroth and C. E. Rasmussen. Pilco: A modelbased and dataefficient approach to policy search. Proceedings of the 28th International Conference on Machine Learning, pages 465–473, 2011.
 [6] H. Hachiya, T. Akiyama, M. Sugiyama, and J. Peters. Adaptive importance sampling for value function approximation in offpolicy reinforcement learning. Neural Networks, 22(10):1399–1410, 2009.
 [7] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
 [8] S. Kakade. A natural policy gradient. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 1531–1538, Cambridge, MA, 2002. MIT Press.
 [9] T. Kanamori, S. Hido, and M. Sugiyama. A leastsquares approach to direct importance estimation. Journal of Machine Learning Research, 10:1391–1445, Jul. 2009.
 [10] T. Kanamori, T. Suzuki, and M. Sugiyama. Computational complexity of kernelbased densityratio estimation: A condition number analysis. Machine Learning, 2012. to appear.
 [11] T. Kanamori, T. Suzuki, and M. Sugiyama. Statistical analysis of kernelbased leastsquares densityratio estimation. Machine Learning, 86(3):335–367, 2012.
 [12] M. G. Lagoudakis and R. Parr. Leastsquares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003.
 [13] J. Peters and S. Schaal. Policy gradient methods for robotics. In Processing of the IEEE/RSJ Internatinal Conference on Inatelligent Robots and Systems (IROS), pages 2219–2225, 2006.
 [14] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, USA, 2006.
 [15] F. Sehnke, C. Osendorfer, T. Rückstiess, A. Graves, J. Peters, and J. Schmidhuber. Parameterexploring policy gradients. Neural Networks, 23(4):551–559, 2010.
 [16] M. Sugiyama, H. Hachiya, H. Kashima, and T. Morimura. Least absolute policy iteration—A robust approach to value function approximation. IEICE Transactions on Information and Systems, E93D(9):2555–2565, 2010.
 [17] M. Sugiyama, H. Hachiya, C. Towell, and S. Vijayakumar. Geodesic Gaussian kernels for value function approximation. Autonomous Robots, 25(3):287–304, 2008.
 [18] M. Sugiyama and M. Kawanabe. Machine Learning in NonStationary Environments: Introduction to Covariate Shift Adaptation. MIT Press, Cambridge, Massachusetts, USA, 2012.
 [19] M. Sugiyama, T. Suzuki, and T. Kanamori. Density ratio matching under the Bregman divergence: A unified framework of density ratio estimation. Annals of the Institute of Statistical Mathematics, 64(5):1009–1044, 2012.
 [20] M. Sugiyama, I. Takeuchi, T. Suzuki, T. Kanamori, H. Hachiya, and D. Okanohara. Leastsquares conditional density estimation. IEICE Transactions on Information and Systems, E93D(3):583–594, 2010.
 [21] R. S. Sutton and G. A. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA, 1998.
 [22] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In S.A. Solla, T.K. Leen, and K.R. Müller, editors, Advances in Neural Information Processing Systems 12, pages 1057–1063. MIT Press, 2000.
 [23] X. Wang and T. G. Dietterich. Modelbased policy gradient reinforcement learning. Proceedings of the 20th International Conference on Machine Learning, pages 776–783, 2003.
 [24] Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8:229, 1992.
 [25] T. Zhao, H. Hachiya, G. Niu, and M. Sugiyama. Analysis and improvement of policy gradient estimation. Neural Networks, 26:118–129, 2012.
 [26] T. Zhao, H. Hachiya, V. Tangkaratt, J. Morimoto, and M. Sugiyama. Efficient sample reuse in policy gradients with parameterbased exploration. Neural Computation, 25:1512–1547, 2013.