# Neural Policy Gradient Methods:

Global Optimality and Rates of Convergence

###### Abstract

Policy gradient methods with actor-critic schemes demonstrate tremendous empirical successes, especially when the actors and critics are parameterized by neural networks. However, it remains less clear whether such “neural” policy gradient methods converge to globally optimal policies and whether they even converge at all. We answer both the questions affirmatively in the overparameterized regime. In detail, we prove that neural natural policy gradient converges to a globally optimal policy at a sublinear rate. Also, we show that neural vanilla policy gradient converges sublinearly to a stationary point. Meanwhile, by relating the suboptimality of the stationary points to the representation power of neural actor and critic classes, we prove the global optimality of all stationary points under mild regularity conditions. Particularly, we show that a key to the global optimality and convergence is the “compatibility” between the actor and critic, which is ensured by sharing neural architectures and random initializations across the actor and critic. To the best of our knowledge, our analysis establishes the first global optimality and convergence guarantees for neural policy gradient methods.

## 1 Introduction

In reinforcement learning (Sutton and Barto, 2018), an agent aims to maximize its expected total reward by taking a sequence of actions according to a policy in a stochastic environment, which is modeled as a Markov decision process (MDP) (Puterman, 2014). To obtain the optimal policy, policy gradient methods (Williams, 1992; Baxter and Bartlett, 2000; Sutton et al., 2000) directly maximize the expected total reward via gradient-based optimization. As policy gradient methods are easily implementable and readily integrable with advanced optimization techniques such as variance reduction (Johnson and Zhang, 2013; Papini et al., 2018) and distributed optimization (Mnih et al., 2016; Espeholt et al., 2018), they enjoy wide popularity among practitioners. In particular, when the policy (actor) and action-value function (critic) are parameterized by neural networks, policy gradient methods achieve significant empirical successes in challenging applications, such as playing Go (Silver et al., 2016, 2017), real-time strategy gaming (Vinyals et al., 2019), robot manipulation (Peters and Schaal, 2006; Duan et al., 2016), and natural language processing (Wang et al., 2018). See Li (2017) for a detailed survey.

In stark contrast to the tremendous empirical successes, policy gradient methods remain much less well understood in terms of theory, especially when they involve neural networks. More specifically, most existing work analyzes the REINFORCE algorithm (Williams, 1992; Sutton et al., 2000), which estimates the policy gradient via Monte Carlo sampling. Based on the recent progress in nonconvex optimization, Papini et al. (2018); Shen et al. (2019); Xu et al. (2019a); Karimi et al. (2019); Zhang et al. (2019) establish the rate of convergence of REINFORCE to a first- or second-order stationary point. However, the global optimality of the attained stationary point remains unclear. A more commonly used class of policy gradient methods is equipped with the actor-critic scheme (Konda and Tsitsiklis, 2000), which alternatingly estimates the action-value function in the policy gradient via a policy evaluation step (critic update), and performs a policy improvement step using the estimated policy gradient (actor update). The global optimality and rate of convergence of such a class are even more challenging to analyze than that of REINFORCE. In particular, the policy evaluation step itself may converge to an undesirable stationary point or even diverge (Tsitsiklis and Van Roy, 1997), especially when it involves both nonlinear action-value function approximator, such as neural network, and temporal-difference update (Sutton, 1988). As a result, the estimated policy gradient may be biased, which possibly leads to divergence. Even if the algorithm converges to a stationary point, due to the nonconvexity of the expected total reward with respect to the policy as well as its parameter, the global optimality of such a stationary point remains unclear. The only exception is the linear-quadratic regulator (LQR) setting (Fazel et al., 2018; Malik et al., 2018; Tu and Recht, 2018; Yang et al., 2019b; Bu et al., 2019), which is, however, more restrictive than the general MDP setting that possibly involves neural networks.

To bridge the gap between practice and theory, we analyze neural policy gradient methods equipped with actor-critic schemes, where the actors and critics are represented by overparameterized two-layer neural networks. In detail, we study two settings, where the policy improvement steps are based on vanilla policy gradient and natural policy gradient, respectively. In both settings, the policy evaluation steps are based on the TD(0) algorithm (Sutton, 1988). In the first setting, we prove that neural vanilla policy gradient converges to a stationary point of the expected total reward at a -rate in the expected squared norm of the policy gradient, where is the number of policy improvement steps. Meanwhile, through a geometric characterization that relates the suboptimality of the stationary points to the representation power of the neural networks parameterizing the actor and critic, we establish the global optimality of all stationary points under mild regularity conditions. In the second setting, through the lens of Kullback-Leibler (KL) divergence regularization, we prove that neural natural policy gradient converges to a globally optimal policy at a -rate in the expected total reward. In particular, a key to such global optimality and convergence guarantees is a notion of compatibility between the actor and critic, which connects the accuracy of policy evaluation steps with the efficacy of policy improvement steps. We show that such a notion of compatibility is ensured by using shared neural architectures and random initializations for both the actor and critic, which is often used as a practical heuristic (Mnih et al., 2016). To our best knowledge, our analysis gives the first global optimality and convergence guarantees for neural policy gradient methods, which corroborate their significant empirical successes.

Related Work. In contrast to the huge body of empirical literature on policy gradient methods, theoretical results on their convergence remain relatively scarce. In particular, Sutton et al. (2000) and Kakade (2002) analyze vanilla policy gradient (REINFORCE) and natural policy gradient with compatible action-value function approximators, respectively, which are further extended by Konda and Tsitsiklis (2000); Peters and Schaal (2008); Castro and Meir (2010) to incorporate actor-critic schemes. Most of this line of work only establishes the asymptotic convergence based on stochastic approximation techniques (Kushner and Yin, 2003; Borkar, 2009) and requires the actor and critic to be parameterized by linear functions. Another line of work (Papini et al., 2018; Xu et al., 2019a, b; Shen et al., 2019; Karimi et al., 2019; Zhang et al., 2019) builds on the recent progress in nonconvex optimization to establish the nonasymptotic rates of convergence of REINFORCE (Williams, 1992; Baxter and Bartlett, 2000; Sutton et al., 2000) and its variants, but only to first- or second-order stationary points, which, however, lacks global optimality guarantees. Moreover, when actor-critic schemes are involved, due to the error of policy evaluation steps and its impact on policy improvement steps, the nonasymptotic rates of convergence of policy gradient methods, even to first- or second-order stationary points, remain rather open.

Compared with the convergence of policy gradient methods, their global optimality is even less explored in terms of theory. Fazel et al. (2018); Malik et al. (2018); Tu and Recht (2018); Yang et al. (2019b); Bu et al. (2019) prove that policy gradient methods converge to globally optimal policies in the LQR setting, which is more restrictive. In very recent work, Bhandari and Russo (2019) establish the global optimality of vanilla policy gradient (REINFORCE) in the general MDP setting. However, they require the policy class to be convex, which restricts its applicability to the tabular and LQR settings. In independent work, Agarwal et al. (2019) prove that vanilla policy gradient and natural policy gradient converge to globally optimal policies at -rates in the tabular and linear settings. In the tabular setting, their rate of convergence of vanilla policy gradient depends on the size of the state space. In contrast, we focus on the nonlinear setting with the actor-critic scheme, where the actor and critic are parameterized by neural networks. It is worth mentioning that when such neural networks have linear activation functions, our analysis also covers the linear setting, which is, however, not our focus. In addition, Liu et al. (2019) analyze the proximal policy optimization (PPO) and trust region policy optimization (TRPO) algorithms (Schulman et al., 2015, 2017), where the actors and critics are parameterized by neural networks, and establish their -rates of convergence to globally optimal policies. However, they require solving a subproblem of policy improvement in the functional space using multiple stochastic gradient steps in the parameter space, whereas vanilla policy gradient and natural policy gradient only require a single stochastic (natural) gradient step in the parameter space, which makes the analysis even more challenging.

There is also an emerging body of literature that analyzes the training and generalization error of deep supervised learning with overparameterized neural networks (Daniely, 2017; Jacot et al., 2018; Wu et al., 2018; Allen-Zhu et al., 2018a, b; Du et al., 2018a, b; Zou et al., 2018; Chizat and Bach, 2018; Jacot et al., 2018; Li and Liang, 2018; Cao and Gu, 2019b, a; Arora et al., 2019; Lee et al., 2019), especially when they are trained using stochastic gradient. See Fan et al. (2019) for a detailed survey. In comparison, our focus is on deep reinforcement learning with policy gradient methods. In particular, the policy evaluation steps are based on the TD(0) algorithm, which uses stochastic semigradient (Sutton, 1988) rather than stochastic gradient. Moreover, the interplay between the actor and critic makes our analysis even more challenging than that of deep supervised learning.

Notation. For distribution on and , we define as the norm of . We define as the -norm of . We write for notational simplicity when the variable of is clear from the context. We further denote by the -norm for notational simplicity. For a vector and , we denote by the -norm of . We denote by a vector in , where is the -th block of for .

## 2 Background

In this section, we introduce the background of reinforcement learning and policy gradient methods.

Reinforcement Learning. A discounted Markov decision process (MDP) is defined by tuple . Here and are the state and action spaces, respectively. Meanwhile, is the Markov transition kernel and is the reward function, which is possibly stochastic. Specifically, when taking action at state , the agent receives reward and the environment transits into a new state according to transition probability . Meanwhile, is the distribution of initial state and is the discount factor. In addition, policy gives the probability of taking action at state . We denote the state- and action-value functions associated with by and , which are defined respectively as

(1) | ||||

(2) |

where , and for all . Also, we define the advantage function of policy as the difference between and , i.e., for all . By the definitions in (1) and (2), and are related via

where is the inner product in . Here we write as for notational simplicity. Note that policy together with the transition kernel induces a Markov chain over state space . We denote by the stationary state distribution of the Markov chain induced by . We further define as the stationary state-action distribution for all . Meanwhile, policy induces a state visitation measure over and a state-action visitation measure over , which are denoted by and , respectively. Specifically, for all , we define

(3) |

where , , and for all . By definition, we have . We define the expected total reward function by

(4) |

where we write for notational simplicity. The goal of reinforcement learning is to find the optimal policy that maximizes , which is denoted by . When the state space is large, a popular approach is to find the maximizer of over a class of parameterized policies , where is the parameter and is the parameter space. In this case, we obtain the optimization problem .

Policy Gradient Methods. Policy gradient methods maximize using . These methods are based on the policy gradient theorem (Sutton and Barto, 2018), which states that

(5) |

where is the state-action visitation measure defined in (3). Based on (5), (vanilla) policy gradient maximizes the expected total reward via gradient ascent. Specifically, we generate a sequence of policy parameters via

(6) |

where is the learning rate. Meanwhile, natural policy gradient (Kakade, 2002) utilizes natural gradient ascent (Amari, 1998), which is invariant to the parameterization of policies. Specifically, let be the Fisher information matrix corresponding to policy , which is given by

(7) |

At each iteration, natural policy gradient performs

(8) |

where is the inverse of and is the learning rate. In practice, both in (5) and in (7) remain to be estimated, which yields approximations of the policy improvement steps in (6) and (8).

## 3 Neural Policy Gradient Methods

In this section, we represent by a two-layer neural network and study neural policy gradient methods, which estimate the policy gradient and natural policy gradient using the actor-critic scheme (Konda and Tsitsiklis, 2000).

### 3.1 Overparameterized Neural Policy

We now introduce the parameterization of policies. For notational simplicity, we assume that with . Without loss of generality, we further assume that for all . A two-layer neural network with input and width takes the form of

(9) |

Here is the rectified linear unit (ReLU) activation function, which is defined as . Also, and in (9) are the parameters. When training the two-layer neural network, we initialize the parameters via and for all . Note that the ReLU activation function satisfies for all and . Hence, without loss of generality, we keep fixed at the initial parameter throughout training and only update in the sequel. For notational simplicity, we write as hereafter.

Using the two-layer neural network in (9), we define

(10) |

where is defined in (9) with playing the role of . Note that defined in (10) takes the form of an energy-based policy (Haarnoja et al., 2017), where is the temperature parameter and is the energy function.

In the sequel, we investigate policy gradient methods for the class of neural policies defined in (10). We define the feature mapping of a two-layer neural network as

(11) |

By (9), it holds that . Meanwhile, is almost everywhere differentiable with respect to , and it holds that . In the following proposition, we calculate the closed forms of the policy gradient and the Fisher information matrix for defined in (10). {proposition}[Policy Gradient and Fisher Information Matrix] For defined in (10), we have

(12) | |||

(13) |

where is the feature mapping defined in (11), is the temperature parameter, and is the state-action visitation measure defined in (3). Here we write for notational simplicity.

###### Proof.

See §D.1 for a detailed proof. ∎

Since the action-value function in (12) is unknown, to obtain the policy gradient, we use another two-layer neural network to track the action-value function of policy . Specifically, we use a two-layer neural network defined in (9) to represent the action-value function , where plays the same role as in (9). Such an approach is known as the actor-critic scheme (Konda and Tsitsiklis, 2000). We call and the actor and critic, respectively.

Shared Initialization and Compatible Function Approximation. Sutton et al. (2000) introduce the notion of compatible function approximations. Specifically, the action-value function is compatible with if we have for all , where is the advantage function corresponding to . Compatible function approximations enable us to construct unbiased estimators of the policy gradient, which are essential for the optimality and convergence of policy gradient methods (Konda and Tsitsiklis, 2000; Sutton et al., 2000; Kakade, 2002; Peters and Schaal, 2008; Wagner, 2011, 2013).

To approximately obtain compatible function approximations when both the actor and critic are represented by neural networks, we use a shared architecture between the action-value function and the energy function of , and initialize and with the same parameter , where for all . We show that in the overparameterized regime where is large, the shared architecture and random initialization ensure to be approximately compatible with in the following sense. We define as the centered feature mapping corresponding to the initialization, which takes the form of

(14) | ||||

where is the initialization shared by both the actor and critic, and we omit the dependency on for notational simplicity. Similarly, we define for all the following centered feature mappings,

(15) |

Here and are the feature mappings defined in (11), which correspond to and , respectively. By (9), we have

(16) |

which holds almost everywhere for . As shown in Corollary A in §A, when the width is sufficiently large, in policy gradient methods, both and are well approximated by defined in (14). Therefore, by (16), we conclude that in the overparameterized regime with shared architecture and random initialization, is approximately compatible with .

### 3.2 Neural Policy Gradient Methods

Now we present neural policy gradient and neural natural policy gradient. Following the actor-critic scheme, they generate a sequence of policies and action-value functions .

#### 3.2.1 Actor Update

As introduced in §2, we aim to solve the optimization problem iteratively via gradient-based methods, where is the parameter space. We set , where and is the initial parameter defined in §3.1. For all , let be the policy parameter at the -th iteration. For notational simplicity, in the sequel, we denote by and the state-action visitation measure and the stationary state-action distribution , respectively, which are defined in §2. Similarly, we write and . To update , we set

(17) |

where we define as the projection operator onto the parameter space . Here is a matrix specific to each algorithm. Specifically, we have for policy gradient and for natural policy gradient, where is the Fisher information matrix in (13). Meanwhile, is the learning rate and is an estimator of , which takes the form of

(18) |

Here is the temperature parameter of , is sampled from the state-action visitation measure corresponding to the current policy , and is the batch size. Also, is the critic obtained by Algorithm 2.

Sampling From Visitation Measure. Recall that the policy gradient in (12) involves an expectation taken over the state-action visitation measure . Thus, to obtain an unbiased estimator of the policy gradient, we need to sample from the visitation measure . To achieve such a goal, we introduce an artificial MDP . Such an MDP only differs from the original MDP in the Markov transition kernel , which is defined as

Here is the Markov transition kernel of the original MDP. That is, at each state transition of the artificial MDP, the next state is sampled from the initial state distribution with probability . In other words, at each state transition, we restart the original MDP with probability . As shown in Konda (2002), the stationary state distribution of the induced Markov chain is exactly the state visitation measure . Therefore, when we sample a trajectory , where , , and for all , the marginal distribution of converges to the state-action visitation measure .

Inverting Fisher Information Matrix. Recall that is the inverse of the Fisher information matrix used in natural policy gradient. In the overparameterized regime, inverting an estimator of can be infeasible as is a high-dimensional matrix, which is possibly not invertible. To resolve this issue, we estimate the natural policy gradient by solving

(19) |

where is defined in (18), is the temperature parameter in , and is the parameter space. Meanwhile, is an unbiased estimator of based on sampled from , which is defined as

(20) |

where is defined in (11) with . The actor update of neural natural policy gradient takes the form of

(21) |

where we use an arbitrary minimizer of (19) if it is not unique. Note that we also update the temperature parameter by , which ensures . It is worth mentioning that up to minor modifications, our analysis allows for approximately solving (19), which is the common practice of approximate second-order optimization (Martens and Grosse, 2015; Wu et al., 2017).

#### 3.2.2 Critic Update

To obtain , it remains to obtain the critic in (18). For any policy , the action-value function is the unique solution to the Bellman equation (Sutton and Barto, 2018). Here is the Bellman operator that takes the form of

where and . Correspondingly, we aim to solve the following optimization problem

(22) |

where and are the stationary state-action distribution and the Bellman operator associated with , respectively, and is the parameter space. We adopt neural temporal-difference learning (TD) studied in Cai et al. (2019), which solves the optimization problem in (22) via stochastic semigradient descent (Sutton, 1988). Specifically, an iteration of neural TD takes the form of

(23) | |||

(24) |

where , , , and is the learning rate of neural TD. Here (23) is the stochastic semigradient step, and (24) projects the parameter obtained by (23) back to the parameter space . Meanwhile, the state-action pairs in (23) are sampled from the stationary state-action distribution , which is achieved by sampling from the Markov chain induced by until it mixes. See Algorithm 2 in §B for details. Finally, combining the actor updates and the critic update described in (17), (21), and (22), respectively, we obtain neural policy gradient and natural policy gradient, which are described in Algorithm 1.

## 4 Main Results

In this section, we establish the global optimality and convergence for neural policy gradient methods. Hereafter, we assume that the absolute value of the reward function is upper bounded by an absolute constant . As a result, we obtain from (1) and (2) that , , and for all and . In §4.1, we show that neural policy gradient converges to a stationary point of with respect to at a sublinear rate. We further characterize the geometry of and establish the global optimality of the obtained stationary point. Meanwhile, in §4.2, we prove that neural natural policy gradient converges to the global optimum of at a sublinear rate.

### 4.1 Neural Policy Gradient

In the sequel, we study the convergence of neural policy gradient, i.e., Algorithm 1 with (17) as the actor update, where . In what follows, we lay out a regularity condition on the action-value function . {assumption}[Action-Value Function Class] We define

where is the density function of the Gaussian distribution and is the two-layer neural network corresponding to the initial parameter . We assume that for all . Assumption 4.1 is a mild regularity condition on , as captures a sufficiently general family of functions, which constitute a subset of the reproducing kernel Hilbert space (RKHS) induced by the random feature with (Rahimi and Recht, 2008, 2009) up to the shift of . Similar assumptions are imposed in the analysis of batch reinforcement learning in RKHS (Farahmand et al., 2016).

In what follows, we lay out a regularity condition on the state visitation measure and the stationary state distribution .

[Regularity Condition on and ] Let and be two arbitrary policies. We assume that there exists an absolute constant such that

Here the expectations are taken over the joint distributions and over , respectively.

Assumption 4.1 essentially imposes a regularity condition on the Markov transition kernel of the MDP as determines and for all . Such a regularity condition holds if both and have upper-bounded density functions for all .

After introducing these regularity conditions, we present the following proposition adapted from Cai et al. (2019), which characterizes the convergence of neural TD for the critic update.

[Convergence of Critic Update] We set in Algorithm 1. Let be the output of the -th critic update in Line 3 of Algorithm 1, which is an estimator of obtained by Algorithm 2 with iterations. Under Assumptions 4.1 and 4.1, it holds for that

(25) |

where is the stationary state-action distribution corresponding to . Here the expectation is taken over the random initialization.

###### Proof.

See §B.1 for a detailed proof. ∎

Cai et al. (2019) show that the error of the critic update consists of two parts, namely the approximation error of two-layer neural networks and the algorithmic error of neural TD. The former decays as the width grows, while the latter decays as the number of neural TD iterations in Algorithm 2 grows. By setting , the algorithmic error in (25) of Proposition 4.1 is dominated by the approximation error. In contrast with Cai et al. (2019), we obtain a more refined convergence characterization under the more restrictive assumption that . Specifically, such a restriction allows us to obtain the upper bound of the mean squared error in (25) of Proposition 4.1.

It now remains to establish the convergence of the actor update, which involves the estimator of the policy gradient based on . We introduce the following regularity condition on the variance of .

[Variance Upper Bound] Recall that is the state-action visitation measure corresponding to for all . Let , where is defined in (18). We assume that there exists an absolute constant such that for all . Here the expectations are taken over given and .

Assumption 4.1 is a mild regularity condition. Such a regularity condition holds if the Markov chain that generates mixes sufficiently fast and with have upper bounded second moments for all . Zhang et al. (2019) verify that under certain regularity conditions, similar unbiased policy gradient estimators have almost surely upper bounded norms, which implies Assumption 4.1. Similar regularity conditions are also imposed in the analysis of policy gradient methods by Xu et al. (2019a, b).

In what follows, we impose a regularity condition on the discrepancy between the state-action visitation measure and the stationary state-action distribution corresponding to the same policy. {assumption}[Regularity Condition on and ] We assume that there exists an absolute constant such that

(26) |

Here is the Radon-Nikodym derivative of with respect to .

We highlight that if the MDP is initialized at the stationary distribution , the state-action visitation measure is the same as . Meanwhile, if the induced Markov state-action chain mixes sufficiently fast, such an assumption also holds. A similar regularity condition is imposed by Scherrer (2013), which assumes that the -norm of is upper bounded, whereas we only assume that its -norm is upper bounded.

Meanwhile, we impose the following regularity condition on the smoothness of the expected total reward with respect to . {assumption}[Lipschitz Continuous Policy Gradient] We assume that is -Lipschitz continuous with respect to , where is an absolute constant.

Such an assumption holds when the transition probability and the reward function are both Lipschitz continuous with respect to their inputs (Pirotta et al., 2015). Also, Karimi et al. (2019); Zhang et al. (2019); Xu et al. (2019b) verify the Lipschitz continuity of the policy gradient under certain regularity conditions.

Note that we restrict to the parameter space . Here we call a stationary point of if it holds for all that . We now show that the sequence generated by neural policy gradient converges to a stationary point at a sublinear rate. {theorem}[Convergence to Stationary Point] We set , , , , and by Algorithm 1, where the actor update is given in (17) with . For all , we define

(27) |

where is the projection operator onto . Under the assumptions of Proposition 4.1 and Assumptions 4.1-4.1, for we have