Actor-Expert: A Framework for using Action-Value Methods in Continuous Action Spaces

# Actor-Expert: A Framework for using Action-Value Methods in Continuous Action Spaces

Sungsu Lim, Ajin Joseph, Lei Le\ddagger, Yangchen Pan, and Martha White
University of Alberta
\ddagger Indiana University Bloomington
{sungsu, ajoseph, pan6, whitem}@ualberta.ca
leile@iu.edu
###### Abstract

Value-based approaches can be difficult to use in continuous action spaces, because an optimization has to be solved to find the greedy action for the action-values. A common strategy has been to restrict the functional form of the action-values to be convex or quadratic in the actions, to simplify this optimization. Such restrictions, however, can prevent learning accurate action-values. In this work, we propose the Actor-Expert framework for value-based methods, that decouples action-selection (Actor) from the action-value representation (Expert). The Expert uses Q-learning to update the action-values towards the optimal action-values, whereas the Actor (learns to) output the greedy action for the current action-values. We develop a Conditional Cross Entropy Method for the Actor, to learn the greedy action for a generically parameterized Expert, and provide a two-timescale analysis to validate asymptotic behavior. We demonstrate in a toy domain with bimodal action-values that previous restrictive action-value methods fail whereas the decoupled Actor-Expert with a more general action-value parameterization succeeds. Finally, we demonstrate that Actor-Expert performs as well as or better than these other methods on several benchmark continuous-action domains.

## Introduction

Model-free control methods are currently divided into two main branches: value-based methods and policy gradient methods. Value-based methods, such as Q-learning, have been quite successful in discrete-action domains [\citeauthoryearMnih et al.2013, \citeauthoryearvan Hasselt, Guez, and Silver2015], whereas policy gradient methods have been more commonly used in continuous action spaces. One of the reasons for this choice is because finding the optimal action for Q-learning can be difficult in continuous-action spaces, necessitating an optimization problem to be solved.

A common strategy when using action value methods in continuous actions has been to restrict the form of action values, to make optimization over actions easy to solve. Wire-fitting [\citeauthoryearBaird and Klopf1993, \citeauthoryeardel R Millán, Posenato, and Dedieu2002] interpolates between a set of action points, adjusting those points over time to force one interpolation action point to become the maximizing action. Normalized Advantage Functions (NAF) [\citeauthoryearGu et al.2016b] learn an advantage function [\citeauthoryearBaird1993, \citeauthoryearHarmon and Baird1996a, \citeauthoryearHarmon and Baird1996b] by constraining the advantage function to be quadratic in terms of the actions, keeping track of the vertex of the parabola. Partial Input Convex Neural Networks (PICNN) are learned such that action-values are guaranteed to be convex in terms of action [\citeauthoryearAmos, Xu, and Kolter2016]. To enable convex functions to be learned, however, PICNNs are restricted to non-negative weights and ReLU activations, and the maximizing action is found with an approximate gradient descent from random action points.

Another direction has been to parameterize the policy using the action-values, and use instead a soft Q-learning update [\citeauthoryearHaarnoja et al.2017]. For action selection, the policy is parameterized as an energy-based model using the action-values. This approach avoids the difficult optimization over actions, but unfortunately instead it can be expensive to sample an action from the policy. The action-values can be an arbitrary (energy) function, and sampling from the corresponding energy-based model requires an approximate sampling routine, like MCMC. Moreover, it optimizes over the entropy-regularized objective, which differs from the traditional objective in most other action-values learning algorithms, like Q-learning.

Policy gradient methods, on the other hand, learn a simple parametric distribution or a deterministic function over actions that can be easily used in continuous action spaces. In recent years, policy gradient methods have been particularly successful in continuous action benchmark domains [\citeauthoryearDuan et al.2016], facilitated by the Actor-Critic framework. Actor-Critic methods, first introduced in [\citeauthoryearSutton1984], use a Critic (value function) that evaluates the current policy, to help compute the gradient for the Actor (policy). This separation into Actor and Critic enabled the two components to be optimized in a variety of ways, facilitating algorithm development. The Actor can incorporate different update mechanisms to achieve better sample efficiency [\citeauthoryearMnih et al.2016, \citeauthoryearKakade2001, \citeauthoryearPeters and Schaal2008, \citeauthoryearWu et al.2017] or stable learning [\citeauthoryearSchulman et al.2015, \citeauthoryearSchulman et al.2017]. The Critic can be used as a baseline or control variate to reduce variance [\citeauthoryearGreensmith, Bartlett, and Baxter2004, \citeauthoryearGu et al.2016a, \citeauthoryearSchulman et al.2016], and improve sample efficiency by incorporating off-policy samples [\citeauthoryearDegris, White, and Sutton2012, \citeauthoryearSilver et al.2014, \citeauthoryearLillicrap et al.2015, \citeauthoryearWang et al.2016].

In this work, we propose a framework called Actor-Expert, that parallels Actor-Critic, but for value-based methods, facilitating use of Q-learning for continuous action spaces. Actor-Expert decouples optimal action selection (Actor) from action-value representation (Expert), enabling a variety of optimization methods to be used for the Actor. The Expert learns the action-values using Q-learning. The Actor learns the greedy action by iteratively updating towards an estimate of the maximum action for the action-values given by the Expert. This decoupling also enables any Actor to be used, including any exploration mechanism, without interfering with the Expert’s goal to learn the optimal action-values. Actor-Expert is different from Actor-Critic because the Expert uses Q-learning—the Bellman optimality operator—whereas the Critic performs policy evaluation to get values of the current (sub-optimal) policy. In Actor-Expert, the Actor tracks the Expert, to track the greedy action, whereas in Actor-Critic, the Critic tracks the Actor, to track the policy values.

Taking advantage of this formalism, we introduce a Conditional Cross Entropy Method for the Actor, that puts minimal restrictions on the form of the action-values. The basic idea is to iteratively increase the likelihood of near-maximal actions for the expert over time, extending the global optimization algorithm, the Cross Entropy Method [\citeauthoryearRubinstein1999], to be conditioned on state. We show in a toy domain with bimodal action-values—which are not quadratic nor convex—that previous action-value methods with restrictive action-values (NAF and PICNN) perform poorly, whereas Actor-Expert learns the optimal policy well. We then show results on several continuous-action benchmark domains that our algorithm outperforms previous value-based methods and an instance of an Actor-Critic method, Deep Deterministic Policy Gradient (DDPG).

## Background and Problem Formulation

The interaction between the agent and environment is formalized as a Markov decision process (\mathcal{S},\mathcal{A},P,R,\gamma), where \mathcal{S} is the state space, \mathcal{A} is the action space, P:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1] is the one-step state transition dynamics, R:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R} is the reward function and \gamma\in[0,1) is the discount rate. At each discrete time step t=1,2,3,..., the agent selects an action A_{t}\sim\pi(\cdot|S_{t}) according to policy \pi:\mathcal{S}\times\mathcal{A}\rightarrow[0,\infty), the agent transitions to state S_{t+1} according to P, and observes a scalar reward R_{t+1}\doteq R(S_{t},A_{t},S_{t+1}).

For valued-based methods, the objective is to find the fixed-point for the Bellman optimality operator:

 \!\!Q^{*}(s,a)\!\doteq\!\int_{\mathcal{S}}\!P(s,a,s^{\prime})\!\left[R(s,a,s^{% \prime})\!+\!\gamma\max_{a^{\prime}\in\mathcal{A}}Q^{*}(s^{\prime},a^{\prime})% \right]\!ds.\!\! (1)

The corresponding optimal policy selects a greedy action from the set \operatorname*{arg\,max}_{a\in\mathcal{A}}Q^{*}(s,a). These optimal Q-values are typically learned using Q-learning [\citeauthoryearWatkins and Dayan1992]: for action-values Q_{\theta} parameterized by \theta\in\mathbb{R}^{n}, the iterative updates are \theta_{t+1}=\theta_{t}+\alpha_{t}\delta_{t}\nabla_{\theta}Q_{\theta}(S_{t},A_% {t}) for

 \delta_{t}=R_{t+1}+\gamma\max_{a^{\prime}\in\mathcal{A}}Q_{\theta}(S_{t+1},a^{% \prime})-Q_{\theta}(S_{t},A_{t}).

Q-learning is an off-policy algorithm, that can learn the action-values for the optimal policy while following a different (exploratory) behaviour policy.

Policy gradient methods directly optimize a parameterized policy \pi_{\mathbf{w}}, with parameters \mathbf{w}\in\mathbb{R}^{m}. The objective is typically an average reward objective,

 \max_{\pi}\int_{\mathcal{S}}d_{\pi}(s)\int_{\mathcal{A}}\pi(s,a)\int_{\mathcal% {S}}P(s,a,s^{\prime})R(s,a,s^{\prime})ds^{\prime}\ da\ ds (2)

where d_{\pi}:\mathcal{S}\rightarrow[0,\infty) is the stationary distribution over states, representing state visitation. Policy gradient methods estimate gradients of this objective [\citeauthoryearSutton et al.2000]

 \displaystyle\int_{\mathcal{S}} \displaystyle d_{\pi_{\mathbf{w}}}(s)\int_{\mathcal{A}}\nabla_{\mathbf{w}}\pi_% {\mathbf{w}}(s,a)Q^{\pi_{\mathbf{w}}}(s,a)da\ ds \displaystyle\text{for }\ Q^{\pi_{\mathbf{w}}}(s,a) \displaystyle\doteq\int_{\mathcal{S}}P(s,a,s^{\prime})\left[R(s,a,s^{\prime})+% \gamma\int_{\mathcal{A}}Q^{\pi_{\mathbf{w}}}(s^{\prime},a^{\prime})\right].

For example, in the policy-gradient approach called Actor-Critic [\citeauthoryearSutton1984], the Critic estimates Q^{\pi_{\mathbf{w}}} and the Actor uses the Critic to obtain an estimate of the above gradient to adjust the policy parameters \mathbf{w}.

Action-value methods for continuous actions can be difficult to use, due to the fact that an optimization over actions needs to be solved, both for decision-making and for the Q-learning update. For a reasonably small number of discrete actions, \max_{a\in\mathcal{A}}Q_{\theta}(s,a) is straightforward to solve, by iterating across all actions. For continuous actions, Q_{\theta}(s,\cdot) cannot be queried for all actions, and the optimization can be difficult to solve, such as if Q_{\theta}(s,\cdot) is non-convex in a.

## Actor-Expert Formalism

We propose a new framework for value-based methods, with an explicit Actor. The goal is to provide a similar framework to Actor-Critic—which has been so successful for algorithm development of policy gradient methods—to simplify algorithm development for value-based methods. The Expert learns Q using Q-learning, but with an explicit actor that provides the greedy actions. The Actor has two roles: to select which action to take (behavior policy) and to provide the greedy action for the Expert’s Q-learning target. In this section, we develop a Conditional Cross Entropy Method for the Actor, to estimate the greedy action, and provide theoretical guarantees that the approach tracks a changing Expert.

### Conditional Cross Entropy Method for the Actor

The primary role of the Actor is to identify—or learn— \operatorname*{arg\,max}_{a^{\prime}\in\mathcal{A}}Q_{\mathbf{\theta}}(S_{t+1}% ,a^{\prime}) for the Expert. Different strategies can be used to obtain this greedy action on each step. The simplest strategy is to solve this optimization with gradient ascent, to convergence, on every time step. This is problematic for two reasons: it is expensive and is likely to get stuck in suboptimal stationary points.

Consider now a slightly more effective strategy, that learns an Actor that can provide an approximate greedy action that can serve as a good initial point for gradient ascent. Such a strategy reduces the number of gradient ascent steps required, and so makes it more feasible to solve the gradient ascent problem on each step. After obtaining a^{\prime} at the end of the gradient ascent iterations, the Actor can be trained towards a^{\prime}, using a supervised learning update on \pi_{\mathbf{w}}(\cdot|S_{t+1}). The Actor will slowly learn to select better initial actions, conditioned on state, that are near stationary points for Q(s,a)—which hopefully correspond to high-value actions. This Actor learns to maximize Q, reducing computational complexity, but still suffers from reaching suboptimal stationary points.

To overcome this issue, we propose an approach inspired by the Cross Entropy Method from global optimization. Global optimization strategies are designed to find the global optimum of a function f(\theta) for some parameters \theta. For example, for parameters \theta of a neural network, f may be the loss function on a sample of data. The advantage of these methods is that they do not rely on gradient-based strategies, which are prone to getting stuck in saddlepoints and local optima. Instead, they use randomized search strategies, that have been shown to be effective in practice [\citeauthoryearSalimans et al.2017, \citeauthoryearPeters and Schaal2007, \citeauthoryearSzita and LÃ¶rincz2006, \citeauthoryearHansen, Müller, and Koumoutsakos2003].

One such algorithm is the Cross Entropy Method (CEM) [\citeauthoryearRubinstein1999]. This method maintains a distribution p(\theta) over parameters \theta, starting with a wide distribution, such as a Gaussian distribution with mean zero \mu_{0}=\mathbf{0} and a diagonal covariance \boldsymbol{\Sigma}_{0} of large magnitude. The high-level idea is elegantly simple. On each iteration t, the goal is to minimize the KL-divergence to the uniform distribution over parameters where the objective function is greater than some threshold: I(f(\theta)\geq\text{threshold}). This distribution can be approximated with an empirical distribution, such as by sampling several parameter vectors \theta_{1},\ldots,\theta_{N} and keeping those with f(\theta_{i})\geq\text{threshold} and discarding the rest. Each minimization of the KL-divergence to this empirical distribution \hat{I}=\{\theta_{1}^{*},\ldots,\theta_{h}^{*}\}, for h<N, corresponds to maximizing the likelihood of the parameters in the set \hat{I} under the distribution p_{t}. Iteratively, the distribution over parameters p_{t} narrows around higher valued \theta. Sampling the \theta from p_{t} narrows the search over \theta and makes it more likely for them to produce a useful approximation to I(f(\theta)\geq\text{threshold}).

CEM, however, finds the single-best set of optimal parameters for a single optimization problem. Most of the work using CEM in reinforcement learning aim to learn a single-best set of parameters that optimize towards higher roll-out returns [\citeauthoryearSzita and LÃ¶rincz2006, \citeauthoryearMannor, Rubinstein, and Gat2003]. However, our goal is not to do a single global optimization over returns, but rather a repeated optimization to select maximal actions, conditioned on each state. The global optimization strategy could be run on each step to find the exact best action for each current state, but this is expensive and throws away prior information about the function surface when previous optimization was executed.

We extend the Cross Entropy Method to be (a) conditioned on state and (b) learned iteratively over time. CEM is well-suited to extend to a conditional approach, for use in the Actor, because it provides a stochastic Actor that can explore naturally and is effective for smooth, non-convex functions [\citeauthoryearKroese, Porotsky, and Rubinstein2006]. The idea is to iteratively update \pi(\cdot|S_{t}), where previous updates conditioned on state S_{t} generalize to similar states. The Actor learns a stochastic policy that slowly narrows around maximal actions, conditioned on states, as the agent does CEM updates iteratively for the functions Q(S_{1},\cdot),Q(S_{2},\cdot),\ldots.

The Conditional CEM (CCEM) algorithm replaces the learned p(\cdot) with \pi(\cdot|S_{t}), where \pi(\cdot|S_{t}) can be any parametrized, multi-modal distribution. For a mixture model, for example, the parameters are conditional means \mu_{i}(S_{t}), conditional diagonal covariances \Sigma_{i}(S_{t}) and coefficients c_{i}(S_{t}), for the ith component of the mixture. On each step, the conditional mixture model, \pi(\cdot|S_{t}), is sampled to provide a set of actions a_{1},\ldots,a_{N} from which we construct the empirical distribution \hat{I}(S_{t})=\{a^{*}_{1},\ldots,a_{h}^{*}\} where h<N for state S_{t} with current values Q(S_{t},\cdot). The parameters \mathbf{w} are updated using a gradient ascent step on the log-likelihood of the actions \hat{I}(S_{t}) under \pi.

The high-level framework is given in Algorithm 1. The Expert is updated towards learning the optimal Q-values, with (a variant of) Q-learning. The Actor provides exploration and, over time, learns how to find the maximal action for the Expert in the given state, using the described Conditional CEM algorithm. The strategy for the empirical distribution is assumed to be given. We discuss two strategies we explore in the experiments, in the next subsection.

We depict an Actor-Expert architecture where the Actor uses a mixture model in Figure 1. In our implementation, we use mixture density networks [\citeauthoryearBishop1994] to learn a Gaussian mixture distribution. As in Figure 1, the Actor and Expert share the same neural network to obtain the representation for the state, and learn separate functions conditioned on that state. To obtain the maximal action under mixture models with a small number of components, we simply used the mean with the highest coefficient c_{i}(S_{t}). To prevent the diagonal covariance \boldsymbol{\Sigma} from exploding or vanishing, we bound it between [e^{-2},e^{2}] using a tanh layer. We also follow standard practice of using experience replay and target networks to stabilize learning in neural networks. A more detailed algorithm for Actor-Expert with neural networks is described in Supplement 2.1.

### Selecting the empirical distribution

A standard strategy for selecting the empirical distributions in CEM is to use the top quantile of sampled variables—actions in this case (Algorithm 2). For a_{1},\ldots,a_{N} sampled from \pi_{\mathbf{w}_{t}}(\cdot|S_{t}), we select a_{i}^{*}\subset\{a_{1},\ldots,a_{N}\} where Q(S_{t},a_{i}^{*}) are all with the top (1-\rho) quantile values. The resulting empirical distribution is \hat{I}(S_{t})=\{a^{*}_{1},\ldots,a_{h}^{*}\}, for h=\lceil\rho N\rceil. This strategy is generic, and as we find empirically, effective.

For particular regularities in the action-values, however, we may be able to further improve this empirical distribution. For action-values differentiable in the action, we can perform a small number of gradient ascent steps from a_{i} to reach actions a_{i}^{*} with slightly higher action-values (Algorithm 3). The empirical distribution, then, should contain a larger number of useful actions—those with higher action-values—on which to perform maximum likelihood, potentially also requiring less samples. In our experiments we perform 10 gradient ascent steps.

### Theoretical guarantees for the Actor

In this section, we derive guarantees that the Conditional CEM Actor tracks a CEM update, for an evolving Expert. We follow a two-timescale stochastic approximation approach, where the action-values (Expert) change more slowly than the policy (Actor), allowing the Actor to track the maximal actions.111This is actually opposite to Actor-Critic, for which the Actor changes slowly, and the value estimates are on the faster timescale. The Actor itself has two timescales, to account for its own parameters changing at different timescales. Actions for the maximum likelihood step are selected according to older—slower—-parameters, so that it is as if the primary—faster—parameters are updated using samples from a fixed distribution.

We provide an informal theorem statement here, with a proof-sketch. We include the full theorem statement, with assumptions and proof, in Supplement 1.

###### Theorem 1 (Informal Convergence Result).

Let \theta_{t} be the action-value parameters with stepsize \alpha_{q,t}, and \mathbf{w}_{t} be the policy parameters with stepsize \alpha_{a,t}, with \mathbf{w}^{\prime}_{t} a more slowly changing set of policy parameters set to \mathbf{w}^{\prime}_{t}=(1-\alpha^{\prime}_{a,t})\mathbf{w}^{\prime}_{t}+% \alpha^{\prime}_{a,t}\mathbf{w}_{t} for stepsize \alpha^{\prime}_{a,t}\in(0,1]. Assume

1. 1.

States S_{t} are sampled from a fixed marginal distribution.

2. 2.

\nabla_{\mathbf{w}}\ln{\pi_{\mathbf{w}}(\cdot|s)} is locally Lipschitz w.r.t. \mathbf{w}, \forall s\in\mathcal{S}.

3. 3.

Parameters \mathbf{w}_{t} and \theta_{t} remain bounded almost surely.

4. 4.

Stepsizes are chosen for three different timescales to make \mathbf{w}_{t} evolves faster than \mathbf{w}^{\prime}_{t} and \mathbf{w}^{\prime}_{t} evolves faster than \theta_{t},

 \lim_{t\rightarrow\infty}\frac{\alpha^{\prime}_{a,t}}{\alpha_{a,t}}=0,\hskip 1% 1.381102pt\text{and}\hskip 11.381102pt\lim_{t\rightarrow\infty}\frac{\alpha_{q% ,t}}{\alpha_{a,t}}=0
5. 5.

All the three stepsizes decays to 0, while the sample length N_{t} strictly increases to infinity.

6. 6.

Both L_{2} norm and the centered second moment of \nabla_{\mathbf{w}}\ln{\pi_{\mathbf{w}}(\cdot|s)} w.r.t. \pi_{\mathbf{w}^{\prime}} are bounded uniformly.

Then the Conditional CEM Actor tracks the CEM Optimizer for actions, conditioned on state: the stochastic recursion for the Actor asymptotically behaves like an expected CEM Optimizer, with expectation taken across states.

Proof Sketch:  The proof follows a multi-timescale stochastic approximation analysis. The primary concern is that the stochastic update to the Actor is not a direct gradient-descent update. Rather, each update to the Actor is a CEM update, which requires a different analysis to ensure that the stochastic noise remains bounded and is asymptotically negligible. Further, the classical results of the CEM also do not immediately apply, because such updates assume distribution parameters can be directly computed. Here, distribution parameters are conditioned on state, as outputs from a parametrized function. We identify conditions on the parametrized policy to ensure well-behaved CEM updates.

The multi-timescale analysis allows us to focus on the updates of the Actor \mathbf{w}_{t}, assuming the action-value parameter \theta and action-sampling parameter \mathbf{w}^{\prime} are quasi-static. These parameters are allowed to change with time—as they will in practice—but are moving at a sufficiently slower timescale relative to \mathbf{w}_{t} and hence the analysis can be undertaken as if they are static. These updates need to produce \theta that keep the action-values bounded for each state and action, but we do not specify the exact algorithm for the action-values. We assume that the action-value algorithm is given, and focus the analysis on the novel component: the Conditional CEM updates for the Actor.

The first step in the proof is to formulate the update to the weights as a projected stochastic recursion—simply meaning a stochastic update where after each update the weights are projected to a compact, convex set to keep them bounded. The stochastic recursion is reformulated into a summation involving the mean vector field g^{\theta}(\mathbf{w}_{t}) (which depends on the action-value parameters \theta), martingale noise and a loss term \ell^{\theta}_{t} that is due to having approximate quantiles. The key steps are then to show almost surely that the mean vector field g^{\theta} is locally Lipschitz, the martingale noise is quadratically bounded and that the loss term \ell^{\theta}_{t} decays to zero asymptotically. For the first and second, we identify conditions on the policy parameterization that guarantee these. For the final case, we adapt the proof for sampled quantiles approaching true quantiles for CEM, with modifications to account for expectations over the conditioning variable, the state. \hbox{}\hfill\blacksquare

## Experiments

In this section, we investigate the utility of AE, particularly highlighting the utility of generalizing the functional form for the action-values and demonstrating performance across several benchmark domains. We first design a domain where the true action-values are neither quadratic nor concave, to investigate the utility of generalizing the functional form for the action-values. Then, we test AE and several other algorithms listed below in more complex continuous-action domains from OpenAI Gym [\citeauthoryearBrockman et al.2016] and MuJoCo [\citeauthoryearTodorov, Erez, and Tassa2012].

### Algorithms

We use two versions of Actor-Expert: AE which uses the Quantile Empirical Distribution (Alg. 2) and AE+ which uses the Optimized Quantile Empirical Distribution (Alg. 3). We use a bimodal Gaussian mixture for both Actors, with N=30 and \rho=0.2 for AE and N=10 and \rho=0.4 for AE+. The second choice for AE+ reflects that a smaller number of samples is needed for the optimized set of actions. For benchmark environments, it was even effective—and more efficient—for AE+ by sampling only 1 action (N=1), with \rho=1.0. For NAF, PICNN, Wire-fitting, and DDPG, we attempt to match the settings used in their works.

Normalized Advantage Function (NAF) [\citeauthoryearGu et al.2016b] uses Q(s,a)=V(s)+A(s,a), restricting the advantage function to the form A(s,a)=-\frac{1}{2}(a-\mu(s))^{T}\Sigma^{-1}(s)(a-\mu(s)). V(s) correspond to the state value for the maximum action \mu(s), and A(s,a) only decreases this value for a\neq\mu(s). NAF takes actions by sampling from a Gaussian with learned mean \mu(s) and learned covariance \Sigma(s), with initial exploration scale swept in {0.1, 0.3, 1.0}.

Partially Input Convex Neural Networks (PICNN) [\citeauthoryearAmos, Xu, and Kolter2016] is a neural network that is convex with respect to a part of its input—the action in this case. PICNN learns -Q(s,a) so that it is convex with respect to a, by restricting the weights of intermediate layers to be non-negative, and activation function to be convex and non-decreasing (e.g. ReLU). For exploration in PICNN, we use OU noise—temporally correlated stochastic noise generated by an Ornstein-Uhlenbeck process [\citeauthoryearUhlenbeck and Ornstein1930]—with \mu=0.0,\theta=0.15,\sigma=0.2, where the noise is added to the greedy action. To obtain the greedy action, as suggested in their paper, we used 5 iterations of the bundle entropy method from a randomly initialized action.

Wire-fitting [\citeauthoryearBaird and Klopf1993] outputs a set of action control points and corresponding action values \mathcal{C}=\{(a_{i},q_{i}):i=1,...,m\} for a state. By construction, the optimal action is one of the action control points with the highest action value. Like PICNN, we use OU exploration. This method uses interpolation between the action control points to find the action values, and thus its performance is largely dependent on the number of control points. We used 100 action control points for the Bimodal Domain. For the benchmark problems, we found that Wire-fitting did not scale well, and so was omitted.

Deep Deterministic Policy Gradient (DDPG) [\citeauthoryearLillicrap et al.2015] learns a deterministic policy, parameterized as a neural network, using the deterministic policy gradient theorem [\citeauthoryearSilver et al.2014]. We include it as a policy gradient baseline, as it is a competitive Actor-Critic method using off-policy policy gradient. Like PICNN and Wire-fitting, DDPG uses OU noise for exploration.

### Experimental Settings

Agent performance is evaluated every n steps of training, by executing the current policy without exploration for 10 episodes. The performance was averaged over 20 runs of different random seeds for the Bimodal Domain, and 10 runs for the benchmark domains. For all agents we use a neural network of 2-layers with 200 hidden units each, with ReLU activations between each layer and tanh activation for action outputs. For AE and AE+, the Actor and Expert share the first layer, and branch out into two separate layers which all have 200 hidden units. We keep a running average and standard deviation to normalize unbounded state inputs. We use an experience replay buffer and target networks, as is common with neural networks. We use a batch size of 32, with buffer size = 10^{6}, target networks(\tau=0.01), and discount factor \gamma=0.99 for all agents. We sweep over learning rates – policy: {1e-3, 1e-4, 1e-5}, action-values: {1e-2, 1e-3, 1e-4}, and then use of layer normalization between network layers. For PICNN however, layer normalization could not be used in order to preserve convexity. Best hyperparameter settings found for all agents are reported in Supplement 2.4.

### Experiments in a Bimodal Toy Domain

To illustrate the limitation that could be posed by restricting the functional form of Q(s,a), we design a toy domain with a single state S_{0} and a\subset[-2,2], where the true Q^{*}(s,a)—shown in in Figure 3—is a function of two radial basis functions centered at a=-1.0 and a=1.0 respectively, with unequal values of 1.0 and 1.5 respectively. We assume a deterministic setting, and so the rewards R(s,a)=Q^{*}(s,a).

We plot the average performance of the best setting for each agent over 20 runs, in Figure 2. We also monitored the training process, logging action-value function, exploratory action, and greedy action at each time step. We include videos in the Supplement222https://sites.google.com/ualberta.ca/actorexpert/, and descriptions can be found in Supplement 2.2.

All the methods that restrict the functional form for actions failed in many runs. PICNN and NAF start to increase value for one action center, and by necessity of convexity, must overly decrease the values around the other action center. Consequently, when they randomly explore and observe the higher reward for that action than they predict, a large update skews the action-value estimates. DDPG would similarly suffer, because the Actor only learns to output one action. Even though its action-value function is not restrictive, DDPG may periodically see high value for the other action center, and so its choice of greedy action can be pulled back-and-forth between these high-valued actions. AE methods, on the other hand, almost always found the optimal action. Wire-fitting performed better than DDPG, PICNN and NAF, as it should be capable of correctly modeling action-values, but still converged to the suboptimal policy quite often.

The exploration mechanisms also played an important role. For certain exploration settings, the agents restricting the functional form on the action-values or policy can learn to settle on one action, rather than oscillating. For NAF with small exploration scale and and DDPG using OU noise, such oscillations were not observed in the above figure, because the agent only explores locally around one action, avoiding oscillation but also often converging to the suboptimal action. This was still a better choice for overall performance as oscillation produces lower accumulated reward. AE and AE+, on the other hand, explore by sampling from their learned multi-modal Gaussian mixture distribution, with no external exploration parameter to tune.

### Experiments in Benchmark Domains

We evaluated the algorithms on a set of benchmark continuous action tasks, with results shown in Figure 4. As mentioned above, we do not include Wire-fitting, as it scaled poorly on these domains. Detailed description of the benchmark environments and their dimensions is included in Supplement 2.3, with state dimensions ranging from 3 to 17 and action dimensions ranging from 1 to 6.

In all benchmark environments AE and AE+ perform as well or better than other methods. In particular, they seem to learn more quickly. We hypothesize that this is because AE better estimates greedy actions and explores around actions with high action-values more effectively.

NAF and PICNN seemed to have less stable behavior, potentially due to their restrictive action-value function. PICNN likely suffers less, because its functional form is more general, but its greedy action selection mechanism is not as robust and some instability was observed in Lunar Lander. Such instability is not observed in Pendulum or HalfCheetah possibly because the action-value surface is simple in Pendulum and for locomotion environment like HalfCheetah, precision is not necessary; approximately good actions may still enable the agent to move and achieve reasonable performance.

Though the goal here is to evaluate the utility of AE compared to value-based methods, we do include one policy gradient method as a baseline and a preliminary result into value-based versus policy gradient approaches for continuous control. It is interesting to see that AE methods often perform better than their Actor-Critic (policy gradient) counterpart, DDPG. In particular, AE seems to learn much more quickly, which is a hypothesized benefit of value-based methods. Policy gradient methods, on the other hand, typically have to use out-of-date value estimates to update the policy, which could slow learning.

## Discussion and Future Work

In our work, we introduced a new framework called Actor-Expert, that decouples action-selection from action-value representation by introducing an Actor that learns to identify maximal actions for the Expert action-values. Previous value-based approaches for continuous control have typically limited the action-value functional form to easily optimize over actions. We have shown that this can be problematic in domains with true action-values that do not follow this parameterization. We proposed an instance of Actor-Expert, by developing a Conditional Cross Entropy Method to iteratively find greedy actions conditioned on states. We use a multi-timescale analysis to prove that this Actor tracks the Cross Entropy updates which seek the optimal actions across states, as the Expert evolves gradually. This proof differs from other multi-timescale proofs in reinforcement learning, as we analyze a stochastic recursion that is based on the Cross Entropy Method, rather than a more typical stochastic (semi-)gradient descent update. We conclude by showing that AE methods are able to find the optimal policy even when the true action-value function is bimodal, and performs as well as or better than previous methods in more complex domains. Like the Actor-Critic framework, we hope for the Actor-Expert framework to facilitate further development and use of value-based methods for continuous action problems.

One such direction is to more extensively compare value-based methods and policy gradient methods for continuous control. In this work, we investigated how to use value-based methods under continuous actions, but did not state that value-based methods were preferable over policy gradient methods. However, there are several potential benefits of value-based methods that merit further exploration. One advantage is that Q-learning easily incorporates off-policy samples, potentially improving sample complexity, whereas with policy gradient methods, it comes at the cost of introducing bias. Although some off-policy policy gradient methods like DDPG have achieved high performance in benchmark domains, they are also known to suffer from brittleness and hyperparameter sensitivity [\citeauthoryearDuan et al.2016, \citeauthoryearHenderson et al.2017]. Another more speculative advantage is in terms of the optimization surface. The Q-learning update converges to optimal values in tabular settings and linear function approximation [\citeauthoryearMelo and Ribeiro2007]. Policy gradient methods, on the other hand, can have local minima, even in the tabular setting. One goal with Actor-Expert is to improve value-based methods for continuous actions, and so facilitate investigation into these hypotheses, without being limited by difficulties in action selection.

## References

• [\citeauthoryearAmos, Xu, and Kolter2016] Amos, B.; Xu, L.; and Kolter, J. Z. 2016. Input convex neural networks. CoRR abs/1609.07152.
• [\citeauthoryearBaird and Klopf1993] Baird, L. C., and Klopf, A. H. 1993. Reinforcement learning with high-dimensional, continuous actions. Wright Laboratory.
• [\citeauthoryearBaird1993] Baird, L. C. 1993. Advantage updating. Technical report, Technical report, DTIC Document.
• [\citeauthoryearBishop1994] Bishop, C. M. 1994. Mixture density networks. Technical report.
• [\citeauthoryearBorkar1997] Borkar, V. S. 1997. Stochastic approximation with two time scales. Systems & Control Letters 29(5):291–294.
• [\citeauthoryearBorkar2008] Borkar, V. S. 2008. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press.
• [\citeauthoryearBrockman et al.2016] Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym.
• [\citeauthoryearDegris, White, and Sutton2012] Degris, T.; White, M.; and Sutton, R. S. 2012. Off-policy actor-critic. CoRR abs/1205.4839.
• [\citeauthoryeardel R Millán, Posenato, and Dedieu2002] del R Millán, J.; Posenato, D.; and Dedieu, E. 2002. Continuous-Action Q-Learning. Machine Learning.
• [\citeauthoryearDuan et al.2016] Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; and Abbeel, P. 2016. Benchmarking deep reinforcement learning for continuous control. CoRR abs/1604.06778.
• [\citeauthoryearDurrett1991] Durrett, R. 1991. Probability. theory and examples. the wadsworth & brooks/cole statistics/probability series. Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove, CA.
• [\citeauthoryearGreensmith, Bartlett, and Baxter2004] Greensmith, E.; Bartlett, P. L.; and Baxter, J. 2004. Variance reduction techniques for gradient estimates in reinforcement learning. J. Mach. Learn. Res. 5:1471–1530.
• [\citeauthoryearGu et al.2016a] Gu, S.; Lillicrap, T. P.; Ghahramani, Z.; Turner, R. E.; and Levine, S. 2016a. Q-prop: Sample-efficient policy gradient with an off-policy critic. CoRR abs/1611.02247.
• [\citeauthoryearGu et al.2016b] Gu, S.; Lillicrap, T. P.; Sutskever, I.; and Levine, S. 2016b. Continuous deep q-learning with model-based acceleration. CoRR abs/1603.00748.
• [\citeauthoryearHaarnoja et al.2017] Haarnoja, T.; Tang, H.; Abbeel, P.; and Levine, S. 2017. Reinforcement learning with deep energy-based policies. CoRR abs/1702.08165.
• [\citeauthoryearHansen, Müller, and Koumoutsakos2003] Hansen, N.; Müller, S. D.; and Koumoutsakos, P. 2003. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es). Evol. Comput. 11(1):1–18.
• [\citeauthoryearHarmon and Baird1996a] Harmon, M. E., and Baird, L. C. 1996a. Advantage learning applied to a game with nonlinear dynamics and a nonlinear function approximator. Proceedings of the International Conference on Neural Networks (ICNN).
• [\citeauthoryearHarmon and Baird1996b] Harmon, M. E., and Baird, L. C. 1996b. Multi-player residual advantage learning with general function approximation. Technical report, Wright Laboratory.
• [\citeauthoryearHenderson et al.2017] Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; and Meger, D. 2017. Deep reinforcement learning that matters. CoRR abs/1709.06560.
• [\citeauthoryearHoeffding1963] Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. Journal of the American statistical association 58(301):13–30.
• [\citeauthoryearHomem-de Mello2007] Homem-de Mello, T. 2007. A study on the cross-entropy method for rare-event probability estimation. INFORMS Journal on Computing 19(3):381–394.
• [\citeauthoryearHu, Fu, and Marcus2007] Hu, J.; Fu, M. C.; and Marcus, S. I. 2007. A model reference adaptive search method for global optimization. Operations Research 55(3):549–568.
• [\citeauthoryearKroese, Porotsky, and Rubinstein2006] Kroese, D. P.; Porotsky, S.; and Rubinstein, R. Y. 2006. The cross-entropy method for continuous multi-extremal optimization. Methodology and Computing in Applied Probability 8(3):383–407.
• [\citeauthoryearKushner and Clark2012] Kushner, H. J., and Clark, D. S. 2012. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer Science & Business Media.
• [\citeauthoryearLillicrap et al.2015] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. CoRR abs/1509.02971.
• [\citeauthoryearMannor, Rubinstein, and Gat2003] Mannor, S.; Rubinstein, R.; and Gat, Y. 2003. The cross entropy method for fast policy search. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, 512–519. AAAI Press.
• [\citeauthoryearMelo and Ribeiro2007] Melo, F. S., and Ribeiro, M. I. 2007. Convergence of q-learning with linear function approximation. In European Control Conference.
• [\citeauthoryearMnih et al.2013] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. A. 2013. Playing atari with deep reinforcement learning. CoRR abs/1312.5602.
• [\citeauthoryearMnih et al.2016] Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T. P.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. CoRR abs/1602.01783.
• [\citeauthoryearMorris1982] Morris, C. N. 1982. Natural exponential families with quadratic variance functions. The Annals of Statistics 65–80.
• [\citeauthoryearPeters and Schaal2007] Peters, J., and Schaal, S. 2007. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th International Conference on Machine Learning, 745–750. ACM.
• [\citeauthoryearPeters and Schaal2008] Peters, J., and Schaal, S. 2008. Natural actor-critic. Neurocomputing 71(7):1180 – 1190.
• [\citeauthoryearRobbins and Monro1985] Robbins, H., and Monro, S. 1985. A stochastic approximation method. In Herbert Robbins Selected Papers. Springer. 102–109.
• [\citeauthoryearRubinstein and Shapiro1993] Rubinstein, R. Y., and Shapiro, A. 1993. Discrete event systems: Sensitivity analysis and stochastic optimization by the score function method, volume 1. Wiley New York.
• [\citeauthoryearRubinstein1999] Rubinstein, R. 1999. The cross-entropy method for combinatorial and continuous optimization. Methodology And Computing In Applied Probability 1(2):127–190.
• [\citeauthoryearSalimans et al.2017] Salimans, T.; Ho, J.; Chen, X.; and Sutskever, I. 2017. Evolution strategies as a scalable alternative to reinforcement learning. CoRR abs/1703.03864.
• [\citeauthoryearSchulman et al.2015] Schulman, J.; Levine, S.; Moritz, P.; Jordan, M. I.; and Abbeel, P. 2015. Trust region policy optimization. CoRR abs/1502.05477.
• [\citeauthoryearSchulman et al.2016] Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; and Abbeel, P. 2016. High-dimensional continuous control using generalized advantage estimation. International Conference on Learning Representations.
• [\citeauthoryearSchulman et al.2017] Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. CoRR abs/1707.06347.
• [\citeauthoryearSen and Singer2017] Sen, P. K., and Singer, J. M. 2017. Large Sample Methods in Statistics (1994): An Introduction with Applications. CRC Press.
• [\citeauthoryearSilver et al.2014] Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, I–387–I–395. JMLR.org.
• [\citeauthoryearSutton et al.2000] Sutton, R. S.; McAllester, D.; Singh, S.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems.
• [\citeauthoryearSutton1984] Sutton, R. S. 1984. Temporal Credit Assignment in Reinforcement Learning. Ph.D. Dissertation.
• [\citeauthoryearSzita and LÃ¶rincz2006] Szita, I., and LÃ¶rincz, A. 2006. Learning tetris using the noisy cross-entropy method. Neural Computation 18(12):2936–2941.
• [\citeauthoryearTodorov, Erez, and Tassa2012] Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In IROS, 5026–5033. IEEE.
• [\citeauthoryearUhlenbeck and Ornstein1930] Uhlenbeck, G. E., and Ornstein, L. S. 1930. On the theory of the brownian motion. Physical review, 36(5):823.
• [\citeauthoryearvan Hasselt, Guez, and Silver2015] van Hasselt, H.; Guez, A.; and Silver, D. 2015. Deep reinforcement learning with double q-learning. CoRR abs/1509.06461.
• [\citeauthoryearWang et al.2016] Wang, Z.; Bapst, V.; Heess, N.; Mnih, V.; Munos, R.; Kavukcuoglu, K.; and de Freitas, N. 2016. Sample efficient actor-critic with experience replay. CoRR abs/1611.01224.
• [\citeauthoryearWatkins and Dayan1992] Watkins, C. J. C. H., and Dayan, P. 1992. Q-learning. In Machine Learning, 279–292.
• [\citeauthoryearWu et al.2017] Wu, Y.; Mansimov, E.; Liao, S.; Grosse, R. B.; and Ba, J. 2017. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. CoRR abs/1708.05144.

Supplementary Material

## Appendix A 1 Convergence Analysis

In this section, we prove that the stochastic Conditional Cross-Entropy Method update for the Actor tracks an underlying deterministic ODE for the expected Cross-Entropy update over states. We being by providing some definitions, particularly for the quantile function which is central to the analysis. We then lay out the assumptions, and discuss some policy parameterizations to satisfy those assumptions. We finally state the theorem, with proof, and provide one lemma needed to prove the theorem in the final subsection.

### 1.1 Notation and Definitions

Notation: For a set A, let \mathring{A} represent the interior of A, while \partial A is the boundary of A. The abbreviation a.s. stands for almost surely and i.o. stands for infinitely often. Let \mathbb{N} represent the set \{0,1,2,\dots\}. For a set A, we let I_{A} to be the indicator function/characteristic function of A and is defined as I_{A}(x)=1 if x\in A and 0 otherwise. Let \mathbb{E}_{g}[\cdot], \mathbb{V}_{g}[\cdot] and \mathbb{P}_{g}(\cdot) denote the expectation, variance and probability measure w.r.t. g. For a \sigma-field \mathcal{F}, let \mathbb{E}\left[\cdot|\mathcal{F}\right] represent the conditional expectation w.r.t. \mathcal{F}. A function f:X\rightarrow Y is called Lipschitz continuous if \exists L\in(0,\infty) s.t. \|f(\mathbf{x}_{1})-f(\mathbf{x}_{2})\|\leq L\|\mathbf{x}_{1}-\mathbf{x}_{2}\|, \forall\mathbf{x}_{1},\mathbf{x}_{2}\in X. A function f is called locally Lipschitz continuous if for every \mathbf{x}\in X, there exists a neighbourhood U of X such that f_{|U} is Lipschitz continuous. Let C(X,Y) represent the space of continuous functions from X to Y. Also, let B_{r}(\mathbf{x}) represent an open ball of radius r with centered at \mathbf{x}. For a positive integer M, let [M]:=\{1,2\dots M\}.

###### Definition 1.

A function \Gamma:U\subseteq{\rm I\!R}^{d_{1}}\rightarrow V\subseteq{\rm I\!R}^{d_{2}} is Frechet differentiable at \mathbf{x}\in U if there exists a bounded linear operator \widehat{\Gamma}_{\mathbf{x}}:{\rm I\!R}^{d_{1}}\rightarrow{\rm I\!R}^{d_{2}} such that the limit

 \displaystyle\lim_{\epsilon\downarrow 0}\frac{\Gamma(\mathbf{x}+\epsilon% \mathbf{y})-\mathbf{x}}{\epsilon} (3)

exists and is equal to \widehat{\Gamma}_{\mathbf{x}}(\mathbf{y}). We say \Gamma is Frechet differentiable if Frechet derivative of \Gamma exists at every point in its domain.

###### Definition 2.

Given a bounded real-valued continuous function H:{\rm I\!R}^{d}\rightarrow{\rm I\!R} with H(a)\in[H_{l},H_{u}] and a scalar \rho\in[0,1], we define the (1-\rho)-quantile of H(A) w.r.t. the PDF g (denoted as f^{\rho}(H,g)) as follows:

 \displaystyle f^{\rho}(H,g):=\sup_{\ell\in[H_{l},H_{u}]}\{\mathbb{P}_{g}\big{(% }H(A)\geq\ell\big{)}\geq\rho\}, (4)

where \mathbb{P}_{g} is the probability measure induced by the PDF g, i.e., for a Borel set \mathcal{A}, \mathbb{P}_{g}(\mathcal{A}):=\int_{\mathcal{A}}g(a)da.

This quantile operator will be used to succinctly write the quantile for Q_{\theta}(S,\cdot), with actions selected according to \pi_{\mathbf{w}}, i.e.,

 f_{\theta}^{\rho}(\mathbf{w};s):=f^{\rho}(Q_{\theta}(s,\cdot),\pi_{\mathbf{w}}% (\cdot|s))=\sup_{\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]}\{\mathbb{P}_{\pi_{% \mathbf{w}}(\cdot|s)}\big{(}Q_{\theta}(s,A)\geq\ell\big{)}\geq\rho\}. (5)

### 1.2 Assumptions

###### Assumption 1.

Given a realization of the transition dynamics of the MDP in the form of a sequence of transition tuples \mathcal{O}:=\{(S_{t},A_{t},R_{t},S^{\prime}_{t})\}_{t\in\mathbb{N}}, where the state S_{t}\in\mathcal{S} is drawn using a latent sampling distribution \nu, while A_{t}\in\mathcal{A} is the action chosen at state S_{t}, the transitioned state \mathcal{S}\ni S^{\prime}_{t}\sim P(S_{t},A_{t},\cdot) and the reward {\rm I\!R}\ni R_{t}:=R(S_{t},A_{t},S^{\prime}_{t}). We further assume that the reward is uniformly bounded, i.e., |R(\cdot,\cdot,\cdot)|<R_{max}<\infty.

Here, we analyze the long run behaviour of the conditional cross-entropy recursion (actor) which is defined as follows:

 \displaystyle\mathbf{w}_{t+1}:=\Gamma^{W}\left\{\mathbf{w}_{t}+\alpha_{a,t}% \frac{1}{N_{t}}\sum_{A\in\Xi_{t}}{I_{\{Q_{\theta_{t}}(S_{t},A)\geq\widehat{f}^% {\rho}_{t+1}\}}}\nabla_{\mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(A|S_{t})\right\}, (6) \displaystyle                          \text{ where }\Xi_{t}:=\{A_{t,1},A_{t,2% },\dots,A_{t,N_{t}}\}\lx@stackrel{{\scriptstyle\mathclap{\tiny\mbox{iid}}}}{{% \sim}}\pi_{\mathbf{w}^{\prime}_{t}}(\cdot|S_{t}). \displaystyle\mathbf{w}^{\prime}_{t+1}:=\mathbf{w}^{\prime}_{t}+\alpha^{\prime% }_{a,t}\left(\mathbf{w}_{t+1}-\mathbf{w}^{\prime}_{t}\right). (7)

Here, \Gamma^{W}\{\cdot\} is the projection operator onto the compact (closed and bounded) and convex set W\subset{\rm I\!R}^{m} with a smooth boundary \partial W. Therefore, \Gamma^{W} maps vectors in {\rm I\!R}^{m} to the nearest vectors in W w.r.t. the Euclidean distance (or equivalent metric). Convexity and compactness ensure that the projection is unique and belongs to W.

###### Assumption 2.

The pre-determined, deterministic, step-size sequences \{\alpha_{a,t}\}_{t\in\mathbb{N}}, \{\alpha^{\prime}_{a,t}\}_{t\in\mathbb{N}} and \{\alpha_{q,t}\}_{t\in\mathbb{N}} are positive scalars which satisfy the following:

 \displaystyle\sum_{t\in\mathbb{N}}\alpha_{a,t}=\sum_{t\in\mathbb{N}}\alpha^{% \prime}_{a,t}=\sum_{t\in\mathbb{N}}\alpha_{q,t}=\infty,\hskip 11.381102pt\sum_% {t\in\mathbb{N}}\left(\alpha^{2}_{a,t}+{\alpha^{\prime}}^{2}_{a,t}+\alpha^{2}_% {q,t}\right)<\infty, \displaystyle\lim_{t\rightarrow\infty}\frac{\alpha^{\prime}_{a,t}}{\alpha_{a,t% }}=0,\hskip 11.381102pt\lim_{t\rightarrow\infty}\frac{\alpha_{q,t}}{\alpha_{a,% t}}=0.

The first conditions in Assumption 2 are the classical Robbins-Monro conditions [\citeauthoryearRobbins and Monro1985] required for stochastic approximation algorithms. The last two conditions enable the different stochastic recursions to have separate timescales. Indeed, it ensures that the \mathbf{w}_{t} recursion is relatively faster compared to the recursions of \theta_{t} and \mathbf{w}^{\prime}_{t}. This timescale divide is needed to obtain the pursued coherent asymptotic behaviour, as we describe in the next section.

###### Assumption 3.

The pre-determined, deterministic, sample length schedule \{N_{t}\in\mathbb{N}\}_{t\in\mathbb{N}} is positive and strictly monotonically increases to \infty and \inf_{t\in\mathbb{N}}\frac{N_{t+1}}{N_{t}}>1.

Assumption 3 states that the number of samples increases to infinity and is primarily required to ensure that the estimation error arising due to the estimation of sample quantiles eventually decays to 0. Practically, one can indeed consider a fixed, finite, positive integer for N_{t} which is large enough to accommodate the acceptable error.

###### Assumption 4.

The sequence \{\theta_{t}\}_{t\in\mathbb{N}} satisfies \theta_{t}\in\Theta, where \Theta \subset{\rm I\!R}^{n} is a convex, compact set. Also, for \theta\in\Theta, let Q_{\theta}(s,a)\in[Q^{\theta}_{l},Q^{\theta}_{u}], \forall s\in\mathcal{S},a\in\mathcal{A}.

Assumption 4 assumes stability of the Expert, and minimally only requires that the values remain in a bounded range. We make no additional assumptions on the convergence properties of the Expert, as we simply need stability to prove that the Actor tracts the desired update.

###### Assumption 5.

For \theta\in\Theta and s\in\mathcal{S}, let \mathbb{P}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|s)}\left(Q_{\theta}(s,A)\geq% \ell\right)>0, \forall\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}] and \forall\mathbf{w}^{\prime}\in W.

Assumption 5 implies that there always exists a strictly positive probability mass beyond every threshold \ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]. This assumption is easily satisfied when Q_{\theta}(s,a) is continuous in a and \pi_{\mathbf{w}}(\cdot|s) is a continuous probability density function.

###### Assumption 6.
 \displaystyle\sup_{\begin{subarray}{c}\mathbf{w},\mathbf{w}^{\prime}\in W,\\ \theta\in\Theta,\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]\end{subarray}}\mathbb{E% }_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|S)}\Big{[}\Big{\|}I_{\{Q_{\theta}(S,A)% \geq\ell\}}\nabla_{\mathbf{w}}\ln{\pi_{\mathbf{w}}(A|S)}- \displaystyle                                   \mathbb{E}_{A\sim\pi_{\mathbf{% w}^{\prime}}(\cdot|S)}\left[I_{\{Q_{\theta}(S,A)\geq\ell\}}\nabla_{\mathbf{w}}% \ln{\pi_{\mathbf{w}}(A|S)}\big{|}S\right]\Big{\|}_{2}^{2}\Big{|}S\Big{]}<% \infty\hskip 5.690551pta.s., \displaystyle\sup_{\begin{subarray}{c}\mathbf{w},\mathbf{w}^{\prime}\in W,\\ \theta\in\Theta,\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]\end{subarray}}\mathbb{E% }_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|S)}\left[\Big{\|}I_{\{Q_{\theta}(S,A)% \geq\ell\}}\nabla_{\mathbf{w}}\ln\pi_{\mathbf{w}}(A|S)\Big{\|}_{2}^{2}\Big{|}S% \right]<\infty\hskip 5.690551pta.s.
###### Assumption 7.

For s\in\mathcal{S}, \nabla_{\mathbf{w}}\ln{\pi_{\mathbf{w}}(\cdot|s)} is locally Lipschitz continuous w.r.t. \mathbf{w}.

Assumptions 6 and 7 are technical requirements and can be justified and more appropriately characterized when we consider \pi_{\mathbf{w}} to belong to the most popular natural exponential family (NEF) of distributions.

###### Definition 3.

Natural exponential family of distributions (NEF)[\citeauthoryearMorris1982]: These probability distributions over {\rm I\!R}^{m} are represented by

 \{\pi_{\eta}(\mathbf{x}):=h(\mathbf{x})e^{\eta^{\top}T(\mathbf{x})-K(\eta)}% \mid\eta\in\Lambda\subset{\rm I\!R}^{d}\}, (8)

where \eta is the natural parameter, h:{\rm I\!R}^{m}\longrightarrow{\rm I\!R}, while T:{\rm I\!R}^{m}\longrightarrow{\rm I\!R}^{d} (called the sufficient statistic) and K(\eta):=\ln{\int{h(\mathbf{x})e^{\eta^{\top}T(\mathbf{x})}d\mathbf{x}}} (called the cumulant function of the family). The space \Lambda is defined as \Lambda:=\{\eta\in{\rm I\!R}^{d}|\hskip 8.535827pt|K(\eta)|<\infty\}. Also, the above representation is assumed minimal.333For a distribution in NEF, there may exist multiple representations of the form (8). However, for the distribution, there definitely exists a representation where the components of the sufficient statistic are linearly independent and such a representation is referred to as minimal. A few popular distributions which belong to the NEF family include Binomial, Poisson, Bernoulli, Gaussian, Geometric and Exponential distributions.

We parametrize the policy \pi_{\mathbf{w}}(\cdot|S) using a neural network, which implies that when we consider NEF for the stochastic policy, the natural parameter \eta of the NEF is being parametrized by \mathbf{w}. To be more specific, we have \{\psi_{\mathbf{w}}:\mathcal{S}\rightarrow\Lambda|\mathbf{w}\in{\rm I\!R}^{m}\} to be the function space induced by the neural network of the actor, i.e., for a given state s\in\mathcal{S}, \psi_{\mathbf{w}}(s) represents the natural parameter of the NEF policy \pi_{\mathbf{w}}(\cdot|s). Further,

 \displaystyle\nabla_{\mathbf{w}}\ln\pi_{\mathbf{w}}(A|S) \displaystyle=\ln{(h(A))}+\psi_{\mathbf{w}}(S_{t})^{\top}T(A)-K(\psi_{\mathbf{% w}}(S)) \displaystyle=\nabla_{\mathbf{w}}\psi_{\mathbf{w}}(S)\left(T(A)-\nabla_{\eta}K% (\psi_{\mathbf{w}}(S))\right). \displaystyle=\nabla_{\mathbf{w}}\psi_{\mathbf{w}}(S)\left(T(A)-\mathbb{E}_{A% \sim\pi_{\mathbf{w}}(\cdot|S)}\left[T(A)\right]\right). (9)

Therefore Assumption 7 can be directly satisfied by assuming that \psi_{w} is twice continuously differentiable w.r.t. \mathbf{w}.

The next assumption is a standard assumption that sample average converges with an exponential rate in the number of samples. The assumption reflects that this should be true for arbitrary \mathbf{w}\in W.

###### Assumption 8.

For \epsilon>0 and N\in\mathbb{N}, we have

 \displaystyle\mathbb{P}_{\Xi\lx@stackrel{{\scriptstyle\mathclap{\tiny\mbox{iid% }}}}{{\sim}}\pi_{\mathbf{w}^{\prime}}(\cdot|s)}\Big{(}\Big{\|}\frac{1}{N}\sum_% {A\in\Xi}{I_{\{Q_{\theta}(s,A)\geq f^{\rho}(Q_{\theta}(s,\cdot),\pi_{\mathbf{w% }^{\prime}}(\cdot|s)\}}\nabla_{\mathbf{w}}\ln\pi_{\mathbf{w}}(A|s)}- \displaystyle     \mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|s)}\left[I_% {\{Q_{\theta}(s,A)\geq\widehat{f}^{\rho}(Q_{\theta}(s,\cdot),\pi_{\mathbf{w}^{% \prime}}(\cdot|s)\}}\nabla_{\mathbf{w}}\ln\pi_{\mathbf{w}}(A|s)\right]\Big{\|}% \geq\epsilon\Big{)}\leq C_{1}\exp{\left(-c_{2}N^{c_{3}}\epsilon^{c_{4}}\right)}, \displaystyle                                                            % \forall\theta\in\Theta,\mathbf{w},\mathbf{w}^{\prime}\in W,s\in\mathcal{S},

where C_{1},c_{2},c_{3},c_{4}>0.

###### Assumption 9.

For every \theta\in\Theta, s\in\mathcal{S} and \mathbf{w}\in W, f_{\theta}^{\rho}(\mathbf{w};s) (from Eq. (5)) exists and is unique.

The above assumption ensures that the true (1-\rho)-quantile is unique and the assumption is usually satisfied for most distributions and a well-behaved Q_{\theta}.

### 1.3 Main Theorem

To analyze the algorithm, we employ here the ODE-based analysis as proposed in [\citeauthoryearBorkar2008, \citeauthoryearKushner and Clark2012]. The actor recursions (Eqs. (6-7)) represent a classical two timescale stochastic approximation recursion, where there exists a bilateral coupling between the individual stochastic recursions (6) and (7). Since the step-size schedules \{\alpha_{a,t}\}_{t\in\mathbb{N}} and \{\alpha^{\prime}_{a,t}\}_{t\in\mathbb{N}} satisfy \frac{\alpha^{\prime}_{a,t}}{\alpha_{a,t}}\rightarrow 0, we have \alpha^{\prime}_{a,t}\rightarrow 0 relatively faster than \alpha_{a,t}\rightarrow 0. This disparity induces a pseudo-heterogeneous rate of convergence (or timescales) between the individual stochastic recursions which further amounts to the asymptotic emergence of a stable coherent behaviour which is quasi-asynchronous. This pseudo-behaviour can be interpreted using multiple viewpoints, i.e., when viewed from the faster timescale recursion (recursion controlled by \alpha_{a,t}), the slower timescale recursion (recursion controlled by \alpha^{\prime}_{a,t}) appears quasi-static (‘almost a constant’); likewise, when observed from the slower timescale, the faster timescale recursion seems equilibrated. The existence of this stable long run behaviour under certain standard assumptions of stochastic approximation algorithms is rigorously established in [\citeauthoryearBorkar1997] and also in Chapter 6 of [\citeauthoryearBorkar2008]. For our stochastic approximation setting (Eqs. (6-7)), we can directly apply this appealing characterization of the long run behaviour of the two timescale stochastic approximation algorithms—after ensuring the compliance of our setting to the pre-requisites demanded by the characterization—by considering the slow timescale stochastic recursion (7) to be quasi-stationary (i.e., {\mathbf{w}^{\prime}}_{t}\equiv\mathbf{w}^{\prime}, a.s., \forall t\in\mathbb{N}), while analyzing the limiting behaviour of the faster timescale recursion (6). Similarly, we let \theta_{t} to be quasi-stationary too (i.e., \theta_{t}\equiv\theta, a.s., \forall t\in\mathbb{N}). The asymptotic behaviour of the slower timescale recursion is further analyzed by considering the faster timescale temporal variable \mathbf{w}_{t} with the limit point so obtained during quasi-stationary analysis.

Define the filtration \{\mathcal{F}_{t}\}_{t\in\mathbb{N}}, a family of increasing natural \sigma-fields, where \mathcal{F}_{t}:= \sigma\left(\{{\mathbf{w}}_{i},{\mathbf{w}}^{\prime}_{i},(S_{i},A_{i},R_{i},S^% {\prime}_{i}),\Xi_{i};0\leq i\leq t\}\right).

###### Theorem 2.

Let \mathbf{w}^{\prime}_{t}\equiv\mathbf{w}^{\prime},\theta_{t}\equiv\theta,% \forall t\in\mathbb{N} a.s. Let Assumptions 1-9 hold. Then the stochastic sequence \{\mathbf{w}_{t}\}_{t\in\mathbb{N}} generated by the stochastic recursion (6) asymptotically tracks the following ODE:

 \displaystyle\frac{d}{dt}{\mathbf{w}}(t)=\widehat{\Gamma}^{W}_{{\mathbf{w}}(t)% }\left(\nabla_{\mathbf{w}(t)}\mathbb{E}_{\begin{subarray}{c}S\sim\nu,A\sim\pi_% {\mathbf{w}^{\prime}}(\cdot|S)\end{subarray}}\Big{[}I_{\{Q_{\theta}(S,A)\geq f% _{\theta}^{\rho}(\mathbf{w}^{\prime};S)\}}\ln\pi_{\mathbf{w}(t)}(A|S)\Big{]}% \right),\hskip 11.381102ptt\geq 0. (10)

In other words, \lim_{t\rightarrow\infty}\mathbf{w}_{t}\in\mathcal{K} a.s., where \mathcal{K} is set of stable equilibria of the ODE (10) contained inside W.

###### Proof.

Firstly, we rewrite the stochastic recursion (6) under the hypothesis that \theta_{t} and \mathbf{w}^{\prime}_{t} are quasi-stationary, i.e., \theta_{t}\underset{a.s.}{\equiv}\theta and \mathbf{w}^{\prime}_{t}\underset{a.s.}{\equiv}\mathbf{w}^{\prime} as follows:

 \displaystyle\mathbf{w}_{t+1} \displaystyle:=\Gamma^{W}\left\{\mathbf{w}_{t}+\alpha_{a,t}\frac{1}{N_{t}}\sum% _{A\in\Xi_{t}}{I_{\{Q_{\theta}(S_{t},A)\geq\widehat{f}^{\rho}_{t+1}\}}}\nabla_% {\mathbf{w}}\ln\pi_{\mathbf{w}}(A|S_{t})\right\} \displaystyle=\Gamma^{W}\Bigg{\{}\mathbf{w}_{t}+\alpha_{a,t}\Bigg{(}\mathbb{E}% _{S_{t}\sim\nu,A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|S_{t})}\left[I_{\{Q_{% \theta}(S_{t},A)\geq f_{\theta}^{\rho}(\mathbf{w}^{\prime};S_{t})\}}\nabla_{% \mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(A|S_{t})\right]- \displaystyle             \mathbb{E}_{S_{t}\sim\nu,A\sim\pi_{\mathbf{w}^{% \prime}}(\cdot|S_{t})}\Big{[}I_{\{Q_{\theta}(S_{t},A)\geq f_{\theta}^{\rho}(% \mathbf{w}^{\prime};S_{t})\}}\nabla_{\mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(A|S_{t% })\Big{]}+ \displaystyle                 \mathbb{E}\bigg{[}\frac{1}{N_{t}}\sum_{A\in\Xi_{% t}}{I_{\{Q_{\theta}(S_{t},A)\geq\widehat{f}^{\rho}_{t+1}\}}\nabla_{\mathbf{w}_% {t}}\ln\pi_{\mathbf{w}}(A|S_{t})}\bigg{|}\mathcal{F}_{t}\bigg{]}- \displaystyle                      \mathbb{E}\bigg{[}\frac{1}{N_{t}}\sum_{A\in% \Xi_{t}}{I_{\{Q_{\theta}(S_{t},A)\geq\widehat{f}^{\rho}_{t+1}\}}\nabla_{% \mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(A|S_{t})}\bigg{|}\mathcal{F}_{t}\bigg{]}+ \displaystyle                          \frac{1}{N_{t}}\sum_{A\in\Xi_{t}}{I_{\{% Q_{\theta}(S_{t},A)\geq\widehat{f}^{\rho}_{t+1}\}}}\nabla_{\mathbf{w}_{t}}\ln% \pi_{\mathbf{w}}(A|S_{t})\Bigg{)}\Bigg{\}}. \displaystyle=\Gamma^{W}\Big{\{}g^{\theta}(\mathbf{w}_{t})+\mathbb{M}_{t+1}+% \ell^{\theta}_{t}\Big{\}}, (11)

where f_{\theta}^{\rho}(\mathbf{w}^{\prime};S):=f^{\rho}(Q_{\theta}(S,\cdot),\pi_{% \mathbf{w}^{\prime}}(\cdot|S)) and \nabla_{\mathbf{w}_{t}}:=\nabla_{\mathbf{w}=\mathbf{w}_{t}}, i.e., the gradient w.r.t. \mathbf{w} at \mathbf{w}_{t}. Also,

 \displaystyle g^{\theta}(\mathbf{w}) \displaystyle:=\mathbb{E}_{S_{t}\sim\nu,A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|S% _{t})}\Big{[}I_{\{Q_{\theta}(S_{t},A)\geq f_{\theta}^{\rho}(\mathbf{w}^{\prime% };S_{t})\}}\nabla_{\mathbf{w}}\ln\pi_{\mathbf{w}}(A|S_{t})\Big{]}. (12)
 \displaystyle\mathbb{M}_{t+1} \displaystyle:=\frac{1}{N_{t}}\sum_{A\in\Xi_{t}}{I_{\{Q_{\theta}(S_{t},A)\geq% \widehat{f}^{\rho}_{t+1}\}}\nabla_{\mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(A|S_{t})}- \displaystyle         \mathbb{E}\Bigg{[}\frac{1}{N_{t}}\sum_{A\in\Xi_{t}}{I_{% \{Q_{\theta}(S_{t},A)\geq\widehat{f}^{\rho}_{t+1}\}}\nabla_{\mathbf{w}_{t}}\ln% \pi_{\mathbf{w}}(A|S_{t})}\Big{|}\mathcal{F}_{t}\Bigg{]}. (13)
 \displaystyle\ell^{\theta}_{t} \displaystyle:=\mathbb{E}\bigg{[}\frac{1}{N_{t}}\sum_{A\in\Xi_{t}}{I_{\{Q_{% \theta}(S_{t},A)\geq\widehat{f}^{\rho}_{t+1}\}}\nabla_{\mathbf{w}_{t}}\ln\pi_{% \mathbf{w}}(A|S_{t})}\bigg{|}\mathcal{F}_{t}\bigg{]}- \displaystyle         \mathbb{E}_{S_{t}\sim\nu,A\sim\pi_{\mathbf{w}^{\prime}}(% \cdot|S_{t})}\Big{[}I_{\{Q_{\theta}(S_{t},A)\geq f_{\theta}^{\rho}(\mathbf{w}^% {\prime};S_{t})\}}\nabla_{\mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(A|S_{t})\Big{]} (14)

A few observations are in order:

1. B1.

\{\mathbb{M}_{t+1}\}_{t\in\mathbb{N}} is a martingale difference noise sequence w.r.t. the filtration \{\mathcal{F}_{t}\}_{t\in\mathbb{N}}, i.e., \mathbb{M}_{t+1} is \mathcal{F}_{t+1}-measurable and integrable, \forall t\in\mathbb{N} and \mathbb{E}\left[\mathbb{M}_{t+1}|\mathcal{F}_{t}\right]=0 a.s., \forall t\in\mathbb{N}.

2. B2.

g^{\theta} is locally Lipschitz continuous. This follows from Assumption 7.

3. B3.

\ell^{\theta}_{t}\rightarrow 0 a.s. as t\rightarrow\infty. (By Lemma 2 below).

4. B4.

The iterates \{\mathbf{w}_{t}\}_{t\in\mathbb{N}} is bounded almost surely, i.e.,

 \displaystyle\sup_{t\in\mathbb{N}}\|\mathbf{w}_{t}\|<\infty\hskip 11.381102pta% .s.

This is ensured by the explicit application of the projection operator \Gamma^{W}\{\cdot\} over the iterates \{\mathbf{w}_{t}\}_{t\in\mathbb{N}} at every iteration onto the bounded set W.

5. B5.

\exists L\in(0,\infty)\hskip 5.690551pts.t.\hskip 5.690551pt\mathbb{E}\left[\|% \mathbb{M}_{t+1}\|^{2}|\mathcal{F}_{t}\right]\leq L\left(1+\|\mathbf{w}_{t}\|^% {2}\right)\hskip 5.690551pta.s.
This follows from Assumption 6 (ii).

Now, we rewrite the stochastic recursion (A) as follows:

 \displaystyle\mathbf{w}_{t+1} \displaystyle:=\mathbf{w}_{t}+\alpha_{a,t}\frac{\Gamma^{W}\left\{\mathbf{w}_{t% }+\xi_{t}\left(g^{\theta}(\mathbf{w}_{t})+\mathbb{M}_{t+1}+\ell^{\theta}_{t}% \right)\right\}-\mathbf{w}_{t}}{\alpha_{a,t}} \displaystyle={\mathbf{w}}_{t}+\alpha_{a,t}\left(\widehat{\Gamma}^{W}_{{% \mathbf{w}}_{t}}(g^{\theta}({\mathbf{w}}_{t}))+\widehat{\Gamma}^{W}_{{\mathbf{% w}}_{t}}\left(\mathbb{M}_{t+1}\right)+\widehat{\Gamma}^{W}_{{\mathbf{w}}_{t}}% \left({\ell}^{\theta}_{t}\right)+o(\alpha_{a,t})\right), (15)

where \widehat{\Gamma}^{W} is the Frechet derivative (Definition 3).

The above stochastic recursion is also a stochastic approximation recursion with the vector field \widehat{\Gamma}^{W}_{{\mathbf{w}}_{t}}(g^{\theta}({\mathbf{w}}_{t})), the noise term \widehat{\Gamma}^{W}_{{\mathbf{w}}_{t}}\left(\mathbb{M}_{t+1}\right), the bias term \widehat{\Gamma}^{W}_{{\mathbf{w}}_{t}}\left({\ell}^{\theta}_{t}\right) with an additional error term o(\alpha_{a,t}) which is asymptotically inconsequential.

Also, note that \Gamma^{W} is single-valued map since the set W is assumed convex and also the limit exists since the boundary \partial W is considered smooth. Further, for \mathbf{w}\in\mathring{W}, we have

 \displaystyle\widehat{\Gamma}^{W}_{\mathbf{w}}(\mathbf{u}):=\lim_{\epsilon% \rightarrow 0}\frac{\Gamma^{W}\left\{\mathbf{w}+\epsilon\mathbf{u}\right\}-% \mathbf{w}}{\epsilon}=\lim_{\epsilon\rightarrow 0}\frac{\mathbf{w}+\epsilon% \mathbf{u}-\mathbf{w}}{\epsilon}=\mathbf{u}\text{ (for sufficiently small }% \epsilon), (16)

i.e., \widehat{\Gamma}^{W}_{\mathbf{w}}(\cdot) is an identity map for \mathbf{w}\in\mathring{W}.

Now by appealing to Theorem 2, Chapter 2 of [\citeauthoryearBorkar2008] along with the observations B1-B5, we conclude that the stochastic recursion (6) asymptotically tracks the following ODE almost surely:

 \displaystyle\frac{d}{dt}{\mathbf{w}}(t) \displaystyle=\widehat{\Gamma}^{W}_{{\mathbf{w}}(t)}(g^{\theta}({\mathbf{w}}(t% ))),\hskip 11.381102ptt\geq 0 \displaystyle=\widehat{\Gamma}^{W}_{{\mathbf{w}}(t)}\left(\mathbb{E}_{% \begin{subarray}{c}S\sim\nu,A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|S)% \end{subarray}}\Big{[}I_{\{Q_{\theta}(S,A)\geq f_{\theta}^{\rho}(\mathbf{w}^{% \prime};S)\}}\nabla_{\mathbf{w}(t)}\ln\pi_{\mathbf{w}}(A|S)\Big{]}\right) \displaystyle=\widehat{\Gamma}^{W}_{{\mathbf{w}}(t)}\left(\nabla_{\mathbf{w}(t% )}\mathbb{E}_{\begin{subarray}{c}S\sim\nu,A\sim\pi_{\mathbf{w}^{\prime}}(\cdot% |S)\end{subarray}}\Big{[}I_{\{Q_{\theta}(S,A)\geq f_{\theta}^{\rho}(\mathbf{w}% ^{\prime};S)\}}\ln\pi_{\mathbf{w}}(A|S)\Big{]}\right). (17)

The interchange of expectation and the gradient in the last equality follows from dominated convergence theorem and Assumption 7 [\citeauthoryearRubinstein and Shapiro1993]. The above ODE is a gradient flow with dynamics restricted inside W. This further implies that the stochastic recursion (6) converges to a (possibly sample path dependent) asymptotically stable equilibrium point of the above ODE inside W. ∎

### 1.4 Proof of Lemma 2 to satisfy Condition 3

In this section, we show that \ell^{\theta}_{t}\rightarrow 0 a.s. as t\rightarrow\infty, in Lemma 2. To do so, we first need to prove several supporting lemmas. Lemma 1 shows that, for a given Actor and Expert, the sample quantile converges to the true quantile. Using this lemma, we can then prove Lemma 2. In the following subsection, we provide three supporting lemmas about convexity and Lipschitz properties of the sample quantiles, required for the proof Lemma 1.

For this section, we require the following characterization of f^{\rho}(Q_{\theta}(s,\cdot),\mathbf{w}^{\prime}). Please refer Lemma 1 of [\citeauthoryearHomem-de Mello2007] for more details.

 \displaystyle f^{\rho}(Q_{\theta}(s,\cdot),\mathbf{w}^{\prime})=\operatorname*% {arg\,min}_{\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]}{\mathbb{E}_{A\sim\pi_{% \mathbf{w}^{\prime}}(\cdot|s)}\left[\Psi(Q_{\theta}(s,A),\ell)\right]}, (18)

where \Psi(y,\ell):=(y-\ell)(1-\rho)I_{\{y\geq\ell\}}+(\ell-y)\rho I_{\{\ell\geq y\}}.

Similarly, the sample estimate of the true (1-\rho)-quantile, i.e., \widehat{f}^{\rho}:=Q^{(\lceil(1-\rho)N\rceil)}_{\theta,s}, (where Q_{\theta,s}^{(i)} is the i-th order statistic of the random sample \{Q_{\theta}(s,A)\}_{A\in\Xi} with \Xi:=\{A_{i}\}_{i=1}^{N}\lx@stackrel{{\scriptstyle\mathclap{\tiny\mbox{iid}}}}% {{\sim}}\pi_{\mathbf{w}^{\prime}}(\cdot|s)) can be characterized as the unique solution of the stochastic counterpart of the above optimization problem, i.e.,

 \displaystyle\widehat{f}^{\rho}=\operatorname*{arg\,min}_{\ell\in[Q^{\theta}_{% l},Q^{\theta}_{u}]}{\frac{1}{N}\sum_{\begin{subarray}{c}A\in\Xi\\ |\Xi|=N\end{subarray}}\Psi(Q_{\theta}(s,A),\ell)}. (19)
###### Lemma 1.

Assume \theta_{t}\equiv\theta, \mathbf{w}^{\prime}_{t}\equiv\mathbf{w}^{\prime}, \forall t\in\mathbb{N}. Also, let Assumptions 3-5 hold. Then, for a given state s\in\mathcal{S},

 \displaystyle\lim_{t\rightarrow\infty}\widehat{f}^{\rho}_{t}=f^{\rho}(Q_{% \theta}(s,\cdot),\mathbf{w}^{\prime})\hskip 5.690551pta.s.,

where \widehat{f}^{\rho}_{t}:=Q^{(\lceil(1-\rho)N_{t}\rceil)}_{\theta,s}, (where Q_{\theta,s}^{(i)} is the i-th order statistic of the random sample \{Q_{\theta}(s,A)\}_{A\in\Xi_{t}} with \Xi_{t}:=\{A_{i}\}_{i=1}^{N_{t}}\lx@stackrel{{\scriptstyle\mathclap{\tiny\mbox% {iid}}}}{{\sim}}\pi_{\mathbf{w}^{\prime}}(\cdot|s)).

###### Proof.

The proof is similar to arguments in Lemma 7 of [\citeauthoryearHu, Fu, and Marcus2007]. Since state s and expert parameter \theta are considered fixed, we assume the following notation in the proof. Let

 \displaystyle\widehat{f}^{\rho}_{t|s,\theta}:=\widehat{f}^{\rho}_{t}\text{ and% }f^{\rho}_{|s,\theta}:=f^{\rho}(Q_{\theta}(s,\cdot),\mathbf{w}^{\prime}), (20)

where \widehat{f}^{\rho}_{t} and f^{\rho}(Q_{\theta}(s,\cdot),\mathbf{w}^{\prime}) are defined in Equations (18) and (19).

Consider the open cover \{B_{r}(\ell),\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]\} of [Q^{\theta}_{l},Q^{\theta}_{u}]. Since [Q^{\theta}_{l},Q^{\theta}_{u}] is compact, there exists a finite sub-cover, i.e., \exists\{\ell_{1},\ell_{2},\dots,\ell_{M}\} s.t. \cup_{i=1}^{M}B_{r}(\ell_{i})=[Q^{\theta}_{l},Q^{\theta}_{u}]. Let \vartheta(\ell):=\mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|S)}\left[% \Psi(Q_{\theta}(s,A),\ell)\right] and \widehat{\vartheta}_{t}(\ell):=\frac{1}{N_{t}}\sum\limits_{\begin{subarray}{c}% A\in\Xi_{t},|\Xi_{t}|=N_{t},\\ \Xi_{t}\lx@stackrel{{\scriptstyle\mathclap{\tiny\mbox{iid}}}}{{\sim}}\pi_{% \mathbf{w}^{\prime}}(\cdot|s)\end{subarray}}\Psi(Q_{\theta}(s,A),\ell).

Now, by triangle inequality, we have for \ell\in[Q^{\theta}_{l},Q^{\theta}_{u}],

 \displaystyle|\vartheta(\ell)-\widehat{\vartheta}_{t}(\ell)| \displaystyle\leq|\vartheta(\ell)-\vartheta(\ell_{j})|+|\vartheta(\ell_{j})-% \widehat{\vartheta}_{t}(\ell_{j})|+|\widehat{\vartheta}_{t}(\ell_{j})-\widehat% {\vartheta}_{t}(\ell)| \displaystyle\leq L_{\rho}|\ell-\ell_{j}|+|\vartheta(\ell_{j})-\widehat{% \vartheta}_{t}(\ell_{j})|+\widehat{L}_{\rho}|\ell_{j}-\ell| \displaystyle\leq\left(L_{\rho}+\widehat{L}_{\rho}\right)r+|\vartheta(\ell_{j}% )-\widehat{\vartheta}_{t}(\ell_{j})|, (21)

where L_{\rho} and \widehat{L}_{\rho} are the Lipschitz constants of \vartheta(\cdot) and \widehat{\vartheta}_{t}(\cdot) respectively.

For \delta>0, take r=\delta(L_{\rho}+\widehat{L}_{\rho})/2. Also, by Kolmogorov’s strong law of large numbers (Theorem 2.3.10 of [\citeauthoryearSen and Singer2017]), we have \widehat{\vartheta}_{t}(\ell)\rightarrow\vartheta(\ell) a.s. This implies that there exists T\in\mathbb{N} s.t. |\vartheta(\ell_{j})-\widehat{\vartheta}_{t}(\ell_{j})|<\delta/2, \forall t\geq T, \forall j\in[M]. Then from Eq. (A), we have

 \displaystyle|\vartheta(\ell)-\widehat{\vartheta}_{t}(\ell)|\leq\delta/2+% \delta/2=\delta,\hskip 11.381102pt\forall\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}].

This implies \widehat{\vartheta}_{t} converges uniformly to \vartheta. By Lemmas 4 and 5, \widehat{\vartheta}_{t} and \vartheta are strictly convex and Lipschitz continuous, and so because \widehat{\vartheta}_{t} converges uniformly to \vartheta, by Lemma 3, this means that the sequence of minimizers of \widehat{\vartheta}_{t} converge to the minimizer of \vartheta. These minimizers correspond to \widehat{f}^{\rho}_{t} and f^{\rho}(Q_{\theta}(s,\cdot),\mathbf{w}^{\prime}) respectively, and so \lim_{N_{t}\rightarrow\infty}\widehat{f}^{\rho}_{t}=f^{\rho}(Q_{\theta}(s,% \cdot),\mathbf{w}^{\prime}) a.s.

Now, for \delta>0 and r:=\delta(L_{\rho}+\widehat{L}_{\rho})/2, we obtain the following from Eq. (A):

 \displaystyle| \displaystyle\vartheta(\ell)-\widehat{\vartheta}_{t}(\ell)|\leq\delta/2+|% \vartheta(\ell_{j})-\widehat{\vartheta}_{t}(\ell_{j})| \displaystyle\Leftrightarrow\{|\vartheta(\ell_{j})-\widehat{\vartheta}_{t}(% \ell_{j})|\leq\delta/2,\forall j\in[M]\}\Rightarrow\{|\vartheta(\ell)-\widehat% {\vartheta}_{t}(\ell)|\leq\delta,\forall\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]\}
 \displaystyle\Rightarrow\mathbb{P}_{\pi_{\mathbf{w}^{\prime}}}\left(|\vartheta% (\ell)-\widehat{\vartheta}_{t}(\ell)|\leq\delta,\forall\ell\in[Q^{\theta}_{l},% Q^{\theta}_{u}]\right) \displaystyle\geq\mathbb{P}_{\pi_{\mathbf{w}^{\prime}}}\left(|\vartheta(\ell_{% j})-\widehat{\vartheta}_{t}(\ell_{j})|\leq\delta/2,\forall j\in[M]\right) \displaystyle=1-\mathbb{P}_{\pi_{\mathbf{w}^{\prime}}}\left(|\vartheta(\ell_{j% })-\widehat{\vartheta}_{t}(\ell_{j})|>\delta/2,\exists j\in[M]\right) \displaystyle\geq 1-\sum_{j=1}^{M}\mathbb{P}_{\pi_{\mathbf{w}^{\prime}}}\left(% |\vartheta(\ell_{j})-\widehat{\vartheta}_{t}(\ell_{j})|>\delta/2\right) \displaystyle\geq 1-M\max_{j\in[M]}\mathbb{P}_{\pi_{\mathbf{w}^{\prime}}}\left% (|\vartheta(\ell_{j})-\widehat{\vartheta}_{t}(\ell_{j})|>\delta/2\right) \displaystyle\geq 1-2M\exp{\left(\frac{-2N_{t}\delta^{2}}{4(Q^{\theta}_{u}-Q^{% \theta}_{l})^{2}}\right)}, (22)

where \mathbb{P}_{\pi_{\mathbf{w}^{\prime}}}:=\mathbb{P}_{A\sim\pi_{\mathbf{w}^{% \prime}}}(\cdot|s). And the last inequality follows from Hoeffding’s inequality [\citeauthoryearHoeffding1963] along with the fact that \mathbb{E}_{\pi_{\mathbf{w}^{\prime}}}\left[\widehat{\vartheta}_{t}(\ell_{j})% \right]=\vartheta(\ell_{j}) and \sup\limits_{\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]}|\vartheta(\ell)|\leq Q^{% \theta}_{u}-Q^{\theta}_{l}.

Now, the sub-differential of \vartheta(\ell) is given by

 \displaystyle\partial_{\ell}\vartheta(\ell)=\left[\rho-\mathbb{P}_{A\sim\pi_{% \mathbf{w}^{\prime}}(\cdot|s)}\left(Q_{\theta}(s,A)\geq\ell\right),\rho-1+% \mathbb{P}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|s)}\left(Q_{\theta}(s,A)\leq% \ell\right)\right]. (23)

By the definition of sub-gradient we obtain

 \displaystyle c|\widehat{f}^{\rho}_{t|s,\theta}-f^{\rho}_{|s,\theta}|\leq|% \vartheta(\widehat{f}^{\rho}_{t|s,\theta})-\vartheta(f^{\rho}_{|s,\theta})|,% \hskip 5.690551ptc\in\partial_{\ell}\vartheta(\ell) \displaystyle\Rightarrow C|\widehat{f}^{\rho}_{t|s,\theta}-f^{\rho}_{|s,\theta% }|\leq|\vartheta(\widehat{f}^{\rho}_{t|s,\theta})-\vartheta(f^{\rho}_{|s,% \theta})|, (24)

where C:=\max{\left\{\rho-\mathbb{P}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|s)}\left(% Q_{\theta}(s,A)\geq f^{\rho}_{|s,\theta}\right),\rho-1+\mathbb{P}_{A\sim\pi_{% \mathbf{w}^{\prime}}(\cdot|s)}\left(Q_{\theta}(s,A)\leq f^{\rho}_{|s,\theta}% \right)\right\}}. Further,

 \displaystyle C|\widehat{f}^{\rho}_{t|s,\theta}-f^{\rho}_{|s,\theta}| \displaystyle\leq|\vartheta(\widehat{f}^{\rho}_{t|s,\theta})-\vartheta(f^{\rho% }_{|s,\theta})| \displaystyle\leq|\vartheta(\widehat{f}^{\rho}_{t|s,\theta})-\widehat{% \vartheta}_{t}(\widehat{f}^{\rho}_{t|s,\theta})|+|\widehat{\vartheta}_{t}(% \widehat{f}^{\rho}_{t|s,\theta})-\vartheta(f^{\rho}_{|s,\theta})| \displaystyle\leq|\vartheta(\widehat{f}^{\rho}_{t|s,\theta})-\widehat{% \vartheta}_{t}(\widehat{f}^{\rho}_{t|s,\theta})|+\sup_{\ell\in[Q^{\theta}_{l},% Q^{\theta}_{u}]}|\widehat{\vartheta}_{t}(\ell)-\vartheta(\ell)| \displaystyle\leq 2\sup_{\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]}|\widehat{% \vartheta}_{t}(\ell)-\vartheta(\ell)|. (25)

From Eqs. (A) and (A), we obtain for \epsilon>0

 \displaystyle\mathbb{P}_{\mathbf{w}^{\prime}}\left(N_{t}^{\alpha}|\widehat{f}^% {\rho}_{t|s,\theta}-f^{\rho}_{|s,\theta}|\geq\epsilon\right) \displaystyle\leq\mathbb{P}_{\mathbf{w}^{\prime}}\left(N_{t}^{\alpha}\sup_{% \ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]}|\widehat{\vartheta}_{t}(\ell)-% \vartheta(\ell)|\geq\frac{\epsilon}{2}\right) \displaystyle\leq 2M\exp{\left(\frac{-2N_{t}\epsilon^{2}}{16N^{2\alpha}_{t}(Q^% {\theta}_{u}-Q^{\theta}_{l})^{2}}\right)}=2M\exp{\left(\frac{-2N_{t}^{1-2% \alpha}\epsilon^{2}}{16(Q^{\theta}_{u}-Q^{\theta}_{l})^{2}}\right)}.

For \alpha\in(0,1/2) and \inf_{t\in\mathbb{N}}\frac{N_{t+1}}{N_{t}}\geq\tau>1 (by Assumption 3), then

 \displaystyle\sum_{t=1}^{\infty}2M\exp{\left(\frac{-2N_{t}^{1-2\alpha}\epsilon% ^{2}}{16(Q^{\theta}_{u}-Q^{\theta}_{l})^{2}}\right)}\leq\sum_{t=1}^{\infty}2M% \exp{\left(\frac{-2\tau^{(1-2\alpha)t}N_{0}^{1-2\alpha}\epsilon^{2}}{16(Q^{% \theta}_{u}-Q^{\theta}_{l})^{2}}\right)}<\infty.

Therefore, by Borel-Cantelli’s Lemma [\citeauthoryearDurrett1991], we have

 \displaystyle\mathbb{P}_{\mathbf{w}^{\prime}}\left(N_{t}^{\alpha}\big{|}% \widehat{f}^{\rho}_{t|s,\theta}-f^{\rho}_{|s,\theta}\big{|}\geq\epsilon\hskip 5% .690551pti.o\right)=0.

Thus we have N_{t}^{\alpha}\left(\widehat{f}^{\rho}_{t|s,\theta}-f^{\rho}_{|s,\theta}\right) \rightarrow 0 a.s. as N_{t}\rightarrow\infty. ∎

###### Lemma 2.

Almost surely,

 \displaystyle\ell^{\theta}_{t}\rightarrow 0\hskip 5.690551pt\text{ as }N_{t}% \rightarrow\infty.

Proof of Lemma 2 Consider

 \displaystyle\mathbb{E}\bigg{[}\frac{1}{N_{t}}\sum_{A\in\Xi_{t}}{I_{\{Q_{% \theta}(S_{t},A)\geq\widehat{f}^{\rho}_{t+1}\}}\nabla_{\mathbf{w}_{t}}\ln\pi_{% \mathbf{w}}(A|S_{t})}\bigg{|}\mathcal{F}_{t}\bigg{]}= \displaystyle     \mathbb{E}\Bigg{[}\mathbb{E}_{\Xi_{t}}\bigg{[}\frac{1}{N_{t}% }\sum_{A\in\Xi_{t}}{I_{\{Q_{\theta}(S_{t},A)\geq\widehat{f}^{\rho}_{t+1}\}}% \nabla_{\mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(A|S_{t})}\bigg{]}\bigg{|}S_{t}=s,% \mathbf{w}^{\prime}_{t}\Bigg{]}

For \alpha^{\prime}>0, from Assumption 8, we have

 \displaystyle\mathbb{P}\Big{(}N_{t}^{\alpha^{\prime}}\Big{\|}\frac{1}{N_{t}}% \sum_{A\in\Xi_{t}}{I_{\{Q_{\theta}(s,A)\geq\widehat{f}^{\rho}_{\theta,s}\}}% \nabla_{\mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(A|s)}-\mathbb{E}\left[I_{\{Q_{% \theta}(s,A)\geq\widehat{f}^{\rho}_{\theta,s}\}}\nabla_{\mathbf{w}_{t}}\ln\pi_% {\mathbf{w}}(A|s)\right]\Big{\|}\geq\epsilon\Big{)} \displaystyle             \leq C_{1}\exp{\left(-\frac{c_{2}N^{c_{3}}_{t}% \epsilon^{c_{4}}}{N_{t}^{c_{4}\alpha^{\prime}}}\right)}=C_{1}\exp{\left(-c_{2}% N^{c_{3}-c_{4}\alpha^{\prime}}_{t}\epsilon^{c_{4}}\right)} \displaystyle             \leq C_{1}\exp{\left(-c_{2}\tau^{(c_{3}-c_{4}\alpha^% {\prime})t}N^{c_{3}-c_{4}\alpha^{\prime}}_{0}\epsilon^{c_{4}}\right)},

where f^{\rho}_{\theta,s}:=f^{\rho}(Q_{\theta}(s,\cdot),\pi_{\mathbf{w}^{\prime}}(% \cdot|s)) and \inf_{t\in\mathbb{N}}\frac{N_{t+1}}{N_{t}}\geq\tau>1 (by Assumption 3).

For c_{3}-c_{4}\alpha^{\prime}>0 \Rightarrow \alpha^{\prime}<c_{3}/c_{4}, we have

 \displaystyle\sum_{t=1}^{\infty}{C_{1}\exp{\left(-c_{2}\tau^{(c_{3}-c_{4}% \alpha^{\prime})t}N^{c_{3}-c_{4}\alpha^{\prime}}_{0}\epsilon^{c_{4}}\right)}}<\infty.

Therefore, by Borel-Cantelli’s Lemma [\citeauthoryearDurrett1991], we have

 \displaystyle\mathbb{P}\Big{(}N_{t}^{\alpha^{\prime}}\Big{\|}\frac{1}{N_{t}}% \sum_{A\in\Xi_{t}}{I_{\{Q_{\theta}(s,A)\geq\widehat{f}^{\rho}_{\theta,s}\}}% \nabla_{\mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(A|s)}-\mathbb{E}\left[I_{\{Q_{% \theta}(s,A)\geq\widehat{f}^{\rho}_{\theta,s}\}}\nabla_{\mathbf{w}_{t}}\ln\pi_% {\mathbf{w}}(A|s)\right]\Big{\|}\geq\epsilon\hskip 5.690551pti.o.\Big{)} \displaystyle                 =0.

This implies that

 \displaystyle N_{t}^{\alpha^{\prime}}\Big{\|}\frac{1}{N_{t}}\sum_{A\in\Xi_{t}}% {I_{\{Q_{\theta}(s,A)\geq\widehat{f}^{\rho}_{\theta,s}\}}\nabla_{\mathbf{w}_{t% }}\ln\pi_{\mathbf{w}}(A|s)}-\mathbb{E}\left[I_{\{Q_{\theta}(s,A)\geq\widehat{f% }^{\rho}_{\theta,s}\}}\nabla_{\mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(A|s)\right]% \Big{\|}\rightarrow 0\hskip 11.381102pta.s. (26)

The above result implies that the sample average converges at a rate O(N_{t}^{\alpha^{\prime}}), where 0<\alpha^{\prime}<c_{3}/c_{4} independent of \mathbf{w},\mathbf{w}^{\prime}\in W. By Lemma 1, we have the sample quantiles \widehat{f}^{\rho}_{t} also converging to the true quantile at a rate O(N_{t}^{\alpha}) independent of \mathbf{w},\mathbf{w}^{\prime}\in W. Now the claim follows directly from Assumption 6 (ii) and bounded convergence theorem.

\hbox{}\hfill\blacksquare

### 1.5 Supporting Lemmas for Lemma 1

###### Lemma 3.

Let \{f_{n}\in C({\rm I\!R},{\rm I\!R})\}_{n\in\mathbb{N}} be a sequence of strictly convex, continuous functions converging uniformly to a strict convex function f. Let x_{n}^{*}=\operatorname*{arg\,min}_{x}{f_{n}(x)} and x^{*}=\operatorname*{arg\,min}_{x\in{\rm I\!R}}{f(x)}. Then \lim\limits_{n\rightarrow\infty}x^{*}_{n}=x^{*}.

###### Proof.

Let c=\liminf_{n}x^{*}_{n}. We employ proof by contradiction here. For that, we assume x^{*}>c. Now, note that f(x^{*})<f(c) and f(x^{*})<f(\left(x^{*}+c\right)/2) (by the definition of x^{*}). Also, by the strict convexity of f, we have f((x^{*}+c)/2)<\left(f(x^{*})+f(c)\right)/2 <f(c). Therefore, we have

 \displaystyle f(c)>f((x^{*}+c)/2)>f(x^{*}). (27)

Let r_{1}\in{\rm I\!R} be such that f(c)>r_{1}>f((x^{*}+c)/2). Now, since \|f_{n}-f^{*}\|_{\infty} \rightarrow 0 as n\rightarrow\infty, there exists an positive integer N s.t. |f_{n}(c)-f(c)|<f(c)-r_{1}, \forall n\geq N and \epsilon>0. Therefore, f_{n}(c)-f(c)>r_{1}-f(c) \Rightarrow f_{n}(c)>r_{1}. Similarily, we can show that f_{n}((x^{*}+c)/2)>r_{1}. Therefore, we have f_{n}(c)>f_{n}((x^{*}+c)/2). Similarily, we can show that f_{n}((x^{*}+c)/2)>f_{n}(x^{*}). Finally, we obtain

 \displaystyle f_{n}(c)>f_{n}((x^{*}+c)/2)>f_{n}(x^{*}),\hskip 11.381102pt% \forall n\geq N. (28)

Now, by the extreme value theorem of the continuous functions, we obtain that for n\geq N, f_{n} achieves minimum (say at x_{p} in the closed interval [c,(x^{*}+c)/2]. Note that f_{n}(x_{p})\nless f_{n}((x^{*}+c)/2) (if so then f_{n}(x_{p}) will be a local minimum of f_{n} since f_{n}(x^{*})<f_{n}((x^{*}+c)/2)). Also, f_{n}(x_{p})\neq f_{n}((x^{*}+c)/2). Therefore, f_{n} achieves it minimum in the closed interval [c,(x^{*}+c)/2] at the point (x^{*}+c)/2. This further implies that x^{*}_{n}>(x^{*}+c)/2. Therefore, \liminf_{n}x_{n}^{*}\geq(x^{*}+c)/2 \Rightarrow c\geq(x^{*}+c)/2 \Rightarrow c\geq x^{*}. This is a contradiction. This implies that

 \displaystyle\liminf_{n}x^{*}_{n}\geq x^{*}. (29)

Now consider g_{n}(x)=f_{n}(-x). Note that g_{n} is also continuous and strictly convex function. Indeed, for \lambda\in[0,1], we have g_{n}(\lambda x_{1}+(1-\lambda)x_{2})=f_{n}(-\lambda x_{1}-(1-\lambda)x_{2})<% \lambda f(-x_{1})+(1-\lambda)f(-x_{2})=\lambda g(x_{1})+(1-\lambda)g(x_{2}). Applying the result from Eq. (29) to the sequence \{g_{n}\}_{n\in\mathbb{N}}, we obtain that \liminf_{n}(-x_{n}^{*})\geq-x^{*}. This further implies that \limsup_{n}x_{n}^{*}\leq x^{*}. Therefore,

 \displaystyle\liminf_{n}x^{*}_{n}\geq x^{*}\geq\limsup_{n}x_{n}^{*}\geq\limsup% _{n}x_{n}^{*}.

Hence, \liminf_{n}x^{*}_{n}=\limsup_{n}x_{n}^{*}=x^{*}

###### Lemma 4.

Let Assumption 5 hold. For \theta\in\Theta, \mathbf{w}^{\prime}\in W, s\in\mathcal{S} and \ell\in[Q^{\theta}_{l},Q^{\theta}_{u}], we have \mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|s)}\left[\Psi(Q_{\theta}(s,A)% ,\ell)\right] is Lipschitz continuous. Also, \frac{1}{N}\sum_{\begin{subarray}{c}A\in\Xi\\ |\Xi|=N\end{subarray}}\Psi(Q_{\theta}(s,A),\ell) (with \Xi\lx@stackrel{{\scriptstyle\mathclap{\tiny\mbox{iid}}}}{{\sim}}\pi_{\mathbf{% w}^{\prime}}(\cdot|s)) is Lipschitz continuous with Lipschitz constant independent of the sample length N.

###### Proof.

Let \ell_{1},\ell_{2}\in[Q^{\theta}_{l},Q^{\theta}_{u}], \ell_{2}\geq\ell_{1}. By Assumption 5 we have \mathbb{P}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|s)}(Q_{\theta}(s,A)\geq\ell_{% 1})>0 and \mathbb{P}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|s)}(Q_{\theta}(s,A)\geq\ell_{% 2})>0. Now,

 \displaystyle\Big{|} \displaystyle\mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|s)}\left[\Psi(Q_% {\theta}(s,A),\ell_{1})\right]-\mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(% \cdot|s)}\left[\Psi(Q_{\theta}(s,A),\ell_{2})\right]\Big{|} \displaystyle=\Big{|}\mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|s)}\left% [(Q_{\theta}(s,A)-\ell_{1})(1-\rho)I_{\{Q_{\theta}(s,A)\geq\ell_{1}\}}+(\ell_{% 1}-Q_{\theta}(s,A))\rho I_{\{\ell_{1}\geq Q_{\theta}(s,A)\}}\right] \displaystyle    -\mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|s)}\left[(Q% _{\theta}(s,A)-\ell_{2})(1-\rho)I_{\{Q_{\theta}(s,A)\geq\ell_{2}\}}+(\ell_{2}-% Q_{\theta}(s,A))\rho I_{\{\ell_{2}\geq Q_{\theta}(s,A)\}}\right]\Big{|} \displaystyle=\Big{|}\mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|s)}\Big{% [}(Q_{\theta}(s,A)-\ell_{1})(1-\rho)I_{\{Q_{\theta}(s,A)\geq\ell_{1}\}}+(\ell_% {1}-Q_{\theta}(s,A))\rho I_{\{\ell_{1}\geq Q_{\theta}(s,A)\}} \displaystyle    -(Q_{\theta}(s,A)-\ell_{2})(1-\rho)I_{\{Q_{\theta}(s,A)\geq% \ell_{2}\}}+(\ell_{2}-Q_{\theta}(s,A))\rho I_{\{\ell_{2}\geq Q_{\theta}(s,A)\}% }\Big{]}\Big{|} \displaystyle=\Big{|}\mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|s)}\Big{% [}(1-\rho)(\ell_{2}-\ell_{1})I_{\{Q_{\theta}(s,A)\geq\ell_{2}\}}+\rho(\ell_{1}% -\ell_{2})I_{\{Q_{\theta}(s,A)\leq\ell_{1}\}}+ \displaystyle    +\left(-(1-\rho)\ell_{1}-\rho\ell_{2}+\rho Q_{\theta}(s,A)+(1% -\rho)Q_{\theta}(s,A)\right)I_{\{\ell_{1}\leq Q_{\theta}(s,A)\leq\ell_{2}\}}% \Big{]}\Big{|} \displaystyle\leq(1-\rho)|\ell_{2}-\ell_{1}|+\left(2\rho+1\right)|\ell_{2}-% \ell_{1}| \displaystyle=(\rho+2)|\ell_{2}-\ell_{1}|.

Similarly, we can prove the later claim also. This completes the proof of Lemma 4. ∎

###### Lemma 5.

Let Assumption 5 hold. Then, for \theta\in\Theta, \mathbf{w}^{\prime}\in W, s\in\mathcal{S} and \ell\in[Q^{\theta}_{l},Q^{\theta}_{u}], we have \mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdot|s)}\left[\Psi(Q_{\theta}(s,A)% ,\ell)\right] and \frac{1}{N}\sum_{\begin{subarray}{c}A\in\Xi\\ |\Xi|=N\end{subarray}}\Psi(Q_{\theta}(s,A),\ell) (with \Xi\lx@stackrel{{\scriptstyle\mathclap{\tiny\mbox{iid}}}}{{\sim}}\pi_{\mathbf{% w}^{\prime}}(\cdot|s)) are strictly convex.

###### Proof.

For \lambda\in[0,1] and \ell_{1},\ell_{2}\in[Q_{l},Q_{u}] with \ell_{1}\leq\ell_{2}, we have

 \displaystyle\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdot|S)}\big{[}\Psi(Q_% {\theta}(S,A),\lambda\ell_{1}+(1-\lambda)\ell_{2})\big{]} (30) \displaystyle=\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdot|S)}\big{[}(1-% \rho)\big{(}Q_{\theta}(S,A)-\lambda\ell_{1}-(1-\lambda)\ell_{2}\big{)}I_{\{Q_{% \theta}(S,A)\geq\lambda\ell_{1}+(1-\lambda)\ell_{2}\}} \displaystyle                 +\rho\big{(}\lambda\ell_{1}+(1-\lambda)\ell_{2}-% Q_{\theta}(S,A)\big{)}I_{\{Q_{\theta}(S,A)\leq\lambda\ell_{1}+(1-\lambda)\ell_% {2}\}}\big{]}.

Notice that

 \displaystyle\big{(}Q_{\theta}(S,A)-\lambda\ell_{1}-(1-\lambda)\ell_{2}\big{)}% I_{\{Q_{\theta}(S,A)\geq\lambda\ell_{1}+(1-\lambda)\ell_{2}\}} \displaystyle=\big{(}\lambda Q_{\theta}(S,A)-\lambda\ell_{1}+(1-\lambda)Q_{% \theta}(S,A)-(1-\lambda)\ell_{2}\big{)}I_{\{Q_{\theta}(S,A)\geq\lambda\ell_{1}% +(1-\lambda)\ell_{2}\}}

We consider how one of these components simplifies.

 \displaystyle\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdot|S)}\big{[}\big{(}% \lambda Q_{\theta}(S,A)-\lambda\ell_{1}\big{)}I_{\{Q_{\theta}(S,A)\geq\lambda% \ell_{1}+(1-\lambda)\ell_{2}\}}\big{]} \displaystyle=\lambda\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdot|S)}\big{[% }\big{(}Q_{\theta}(S,A)-\ell_{1}\big{)}I_{\{Q_{\theta}(S,A)\geq\lambda\ell_{1}% \}}-\big{(}Q_{\theta}(S,A)-\ell_{1}\big{)}I_{\lambda\ell_{1}\leq\{Q_{\theta}(S% ,A)\leq\lambda\ell_{1}+(1-\lambda)\ell_{2}\}}\big{]} \displaystyle\leq\lambda\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdot|S)}% \big{[}\big{(}Q_{\theta}(S,A)-\ell_{1}\big{)}I_{\{Q_{\theta}(S,A)\geq\lambda% \ell_{1}\}}\big{]}\ \ \ \ \triangleright\ -\big{(}Q_{\theta}(S,A)-\ell_{1}\big% {)}\leq 0 \displaystyle%                                                                      \text{ % for }\lambda\ell_{1}\leq\{Q_{\theta}(S,A)\leq\lambda\ell_{1}+(1-\lambda)\ell_{% 2}\} \displaystyle\leq\lambda\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdot|S)}% \big{[}\big{(}Q_{\theta}(S,A)-\ell_{1}\big{)}I_{\{Q_{\theta}(S,A)\geq\ell_{1}% \}}\big{]}\ \ \ \ \triangleright\ \big{(}Q_{\theta}(S,A)-\ell_{1}\big{)}\leq 0% \text{ for }I_{\lambda\ell_{1}\leq\{Q_{\theta}(S,A)\leq\ell_{1}\}}

Similarly, we get

 \displaystyle\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdot|S)} \displaystyle\big{[}\big{(}Q_{\theta}(S,A)-\ell_{2}\big{)}I_{\{Q_{\theta}(S,A)% \geq\lambda\ell_{1}+(1-\lambda)\ell_{2}\}}\big{]}\leq\mathbb{E}_{A\in\pi_{% \mathbf{w}^{\prime}}(\cdot|S)}\big{[}\big{(}Q_{\theta}(S,A)-\ell_{2}\big{)}I_{% \{Q_{\theta}(S,A)\geq\ell_{2}\}}\big{]} \displaystyle\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdot|S)} \displaystyle\big{[}\big{(}\ell_{1}-Q_{\theta}(S,A)\big{)}I_{\{Q_{\theta}(S,A)% \leq\lambda\ell_{1}+(1-\lambda)\ell_{2}\}}\big{]}\leq\mathbb{E}_{A\in\pi_{% \mathbf{w}^{\prime}}(\cdot|S)}\big{[}\big{(}\ell_{1}-Q_{\theta}(S,A)\big{)}I_{% \{Q_{\theta}(S,A)\leq\ell_{1}\}}\big{]} \displaystyle\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdot|S)} \displaystyle\big{[}\big{(}\ell_{2}-Q_{\theta}(S,A)\big{)}I_{\{Q_{\theta}(S,A)% \leq\lambda\ell_{1}+(1-\lambda)\ell_{2}\}}\big{]}\leq\mathbb{E}_{A\in\pi_{% \mathbf{w}^{\prime}}(\cdot|S)}\big{[}\big{(}\ell_{2}-Q_{\theta}(S,A)\big{)}I_{% \{Q_{\theta}(S,A)\leq\ell_{2}\}}\big{]}

Therefore, for Equation (30), we get

 \displaystyle\eqref{eq_quant_main} \displaystyle\leq\lambda(1-\rho)\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(% \cdot|S)}\big{[}\big{(}Q_{\theta}(S,A)-\ell_{1}\big{)}I_{\{Q_{\theta}(S,A)\geq% \ell_{1}\}}\big{]} \displaystyle\ \ \ +(1-\lambda)(1-\rho)\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime% }}(\cdot|S)}\big{[}\big{(}Q_{\theta}(S,A)-\ell_{2}\big{)}I_{\{Q_{\theta}(S,A)% \geq\ell_{2}\}}\big{]} \displaystyle\ \ \ +\lambda\rho\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdot% |S)}\big{[}\big{(}\ell_{1}-Q_{\theta}(S,A)\big{)}I_{\{Q_{\theta}(S,A)\leq\ell_% {1}\}}\big{]} \displaystyle\ \ \ +(1-\lambda)\rho\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(% \cdot|S)}\big{[}\big{(}\ell_{2}-Q_{\theta}(S,A)\big{)}I_{\{Q_{\theta}(S,A)\leq% \ell_{2}\}}\big{]} \displaystyle=\lambda\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdot|S)}\left[% \Psi(Q_{\theta}(S,A),\ell_{1})\right]+(1-\lambda)\mathbb{E}_{A\in\pi_{\mathbf{% w}^{\prime}}(\cdot|S)}\left[\Psi(Q_{\theta}(S,A),\ell_{2})\right].

We can prove the second claim similarly. This completes the proof of Lemma 5. ∎

## Appendix B 2 Experiment Details

### 2.2 Bimodal Toy Domain Videos

We monitored the training process of all agents, its action-value function and its policy at every step of training on the Bimodal Toy Domain. Sample video of each agent is included in https://sites.google.com/ualberta.ca/actorexpert/. The upper graph logs the action-value function, while the lower graph logs the policy of the agent. Red vertical line indicates the greedy action and the Blue vertical line indicates the actual exploratory action taken. For NAF, AE, and AE+, we also plot the policy function on the same graph.

### 2.3 Benchmark Environment Description

We use benchmark environments from OpenAI Gym [\citeauthoryearBrockman et al.2016] and Mujoco [\citeauthoryearTodorov, Erez, and Tassa2012] to evaluate our agents. In this section we give a brief description and dimensionality of each environments.

### 2.4 Best Hyperparameters

In this section we report the best hyperparameter settings for the evaluated environments. We swept over the use of layer normalization, actor learning rate, expert/critic learning rate, and other algorithm specific parameters.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters