ActorExpert: A Framework for using ActionValue Methods in Continuous Action Spaces
Abstract
Valuebased approaches can be difficult to use in continuous action spaces, because an optimization has to be solved to find the greedy action for the actionvalues. A common strategy has been to restrict the functional form of the actionvalues to be convex or quadratic in the actions, to simplify this optimization. Such restrictions, however, can prevent learning accurate actionvalues. In this work, we propose the ActorExpert framework for valuebased methods, that decouples actionselection (Actor) from the actionvalue representation (Expert). The Expert uses Qlearning to update the actionvalues towards the optimal actionvalues, whereas the Actor (learns to) output the greedy action for the current actionvalues. We develop a Conditional Cross Entropy Method for the Actor, to learn the greedy action for a generically parameterized Expert, and provide a twotimescale analysis to validate asymptotic behavior. We demonstrate in a toy domain with bimodal actionvalues that previous restrictive actionvalue methods fail whereas the decoupled ActorExpert with a more general actionvalue parameterization succeeds. Finally, we demonstrate that ActorExpert performs as well as or better than these other methods on several benchmark continuousaction domains.
Introduction
Modelfree control methods are currently divided into two main branches: valuebased methods and policy gradient methods. Valuebased methods, such as Qlearning, have been quite successful in discreteaction domains [\citeauthoryearMnih et al.2013, \citeauthoryearvan Hasselt, Guez, and Silver2015], whereas policy gradient methods have been more commonly used in continuous action spaces. One of the reasons for this choice is because finding the optimal action for Qlearning can be difficult in continuousaction spaces, necessitating an optimization problem to be solved.
A common strategy when using action value methods in continuous actions has been to restrict the form of action values, to make optimization over actions easy to solve. Wirefitting [\citeauthoryearBaird and Klopf1993, \citeauthoryeardel R Millán, Posenato, and Dedieu2002] interpolates between a set of action points, adjusting those points over time to force one interpolation action point to become the maximizing action. Normalized Advantage Functions (NAF) [\citeauthoryearGu et al.2016b] learn an advantage function [\citeauthoryearBaird1993, \citeauthoryearHarmon and Baird1996a, \citeauthoryearHarmon and Baird1996b] by constraining the advantage function to be quadratic in terms of the actions, keeping track of the vertex of the parabola. Partial Input Convex Neural Networks (PICNN) are learned such that actionvalues are guaranteed to be convex in terms of action [\citeauthoryearAmos, Xu, and Kolter2016]. To enable convex functions to be learned, however, PICNNs are restricted to nonnegative weights and ReLU activations, and the maximizing action is found with an approximate gradient descent from random action points.
Another direction has been to parameterize the policy using the actionvalues, and use instead a soft Qlearning update [\citeauthoryearHaarnoja et al.2017]. For action selection, the policy is parameterized as an energybased model using the actionvalues. This approach avoids the difficult optimization over actions, but unfortunately instead it can be expensive to sample an action from the policy. The actionvalues can be an arbitrary (energy) function, and sampling from the corresponding energybased model requires an approximate sampling routine, like MCMC. Moreover, it optimizes over the entropyregularized objective, which differs from the traditional objective in most other actionvalues learning algorithms, like Qlearning.
Policy gradient methods, on the other hand, learn a simple parametric distribution or a deterministic function over actions that can be easily used in continuous action spaces. In recent years, policy gradient methods have been particularly successful in continuous action benchmark domains [\citeauthoryearDuan et al.2016], facilitated by the ActorCritic framework. ActorCritic methods, first introduced in [\citeauthoryearSutton1984], use a Critic (value function) that evaluates the current policy, to help compute the gradient for the Actor (policy). This separation into Actor and Critic enabled the two components to be optimized in a variety of ways, facilitating algorithm development. The Actor can incorporate different update mechanisms to achieve better sample efficiency [\citeauthoryearMnih et al.2016, \citeauthoryearKakade2001, \citeauthoryearPeters and Schaal2008, \citeauthoryearWu et al.2017] or stable learning [\citeauthoryearSchulman et al.2015, \citeauthoryearSchulman et al.2017]. The Critic can be used as a baseline or control variate to reduce variance [\citeauthoryearGreensmith, Bartlett, and Baxter2004, \citeauthoryearGu et al.2016a, \citeauthoryearSchulman et al.2016], and improve sample efficiency by incorporating offpolicy samples [\citeauthoryearDegris, White, and Sutton2012, \citeauthoryearSilver et al.2014, \citeauthoryearLillicrap et al.2015, \citeauthoryearWang et al.2016].
In this work, we propose a framework called ActorExpert, that parallels ActorCritic, but for valuebased methods, facilitating use of Qlearning for continuous action spaces. ActorExpert decouples optimal action selection (Actor) from actionvalue representation (Expert), enabling a variety of optimization methods to be used for the Actor. The Expert learns the actionvalues using Qlearning. The Actor learns the greedy action by iteratively updating towards an estimate of the maximum action for the actionvalues given by the Expert. This decoupling also enables any Actor to be used, including any exploration mechanism, without interfering with the Expert’s goal to learn the optimal actionvalues. ActorExpert is different from ActorCritic because the Expert uses Qlearning—the Bellman optimality operator—whereas the Critic performs policy evaluation to get values of the current (suboptimal) policy. In ActorExpert, the Actor tracks the Expert, to track the greedy action, whereas in ActorCritic, the Critic tracks the Actor, to track the policy values.
Taking advantage of this formalism, we introduce a Conditional Cross Entropy Method for the Actor, that puts minimal restrictions on the form of the actionvalues. The basic idea is to iteratively increase the likelihood of nearmaximal actions for the expert over time, extending the global optimization algorithm, the Cross Entropy Method [\citeauthoryearRubinstein1999], to be conditioned on state. We show in a toy domain with bimodal actionvalues—which are not quadratic nor convex—that previous actionvalue methods with restrictive actionvalues (NAF and PICNN) perform poorly, whereas ActorExpert learns the optimal policy well. We then show results on several continuousaction benchmark domains that our algorithm outperforms previous valuebased methods and an instance of an ActorCritic method, Deep Deterministic Policy Gradient (DDPG).
Background and Problem Formulation
The interaction between the agent and environment is formalized as a Markov decision process (\mathcal{S},\mathcal{A},P,R,\gamma), where \mathcal{S} is the state space, \mathcal{A} is the action space, P:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1] is the onestep state transition dynamics, R:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R} is the reward function and \gamma\in[0,1) is the discount rate. At each discrete time step t=1,2,3,..., the agent selects an action A_{t}\sim\pi(\cdotS_{t}) according to policy \pi:\mathcal{S}\times\mathcal{A}\rightarrow[0,\infty), the agent transitions to state S_{t+1} according to P, and observes a scalar reward R_{t+1}\doteq R(S_{t},A_{t},S_{t+1}).
For valuedbased methods, the objective is to find the fixedpoint for the Bellman optimality operator:
\!\!Q^{*}(s,a)\!\doteq\!\int_{\mathcal{S}}\!P(s,a,s^{\prime})\!\left[R(s,a,s^{% \prime})\!+\!\gamma\max_{a^{\prime}\in\mathcal{A}}Q^{*}(s^{\prime},a^{\prime})% \right]\!ds.\!\!  (1) 
The corresponding optimal policy selects a greedy action from the set \operatorname*{arg\,max}_{a\in\mathcal{A}}Q^{*}(s,a). These optimal Qvalues are typically learned using Qlearning [\citeauthoryearWatkins and Dayan1992]: for actionvalues Q_{\theta} parameterized by \theta\in\mathbb{R}^{n}, the iterative updates are \theta_{t+1}=\theta_{t}+\alpha_{t}\delta_{t}\nabla_{\theta}Q_{\theta}(S_{t},A_% {t}) for
\delta_{t}=R_{t+1}+\gamma\max_{a^{\prime}\in\mathcal{A}}Q_{\theta}(S_{t+1},a^{% \prime})Q_{\theta}(S_{t},A_{t}). 
Qlearning is an offpolicy algorithm, that can learn the actionvalues for the optimal policy while following a different (exploratory) behaviour policy.
Policy gradient methods directly optimize a parameterized policy \pi_{\mathbf{w}}, with parameters \mathbf{w}\in\mathbb{R}^{m}. The objective is typically an average reward objective,
\max_{\pi}\int_{\mathcal{S}}d_{\pi}(s)\int_{\mathcal{A}}\pi(s,a)\int_{\mathcal% {S}}P(s,a,s^{\prime})R(s,a,s^{\prime})ds^{\prime}\ da\ ds  (2) 
where d_{\pi}:\mathcal{S}\rightarrow[0,\infty) is the stationary distribution over states, representing state visitation. Policy gradient methods estimate gradients of this objective [\citeauthoryearSutton et al.2000]
\displaystyle\int_{\mathcal{S}}  \displaystyle d_{\pi_{\mathbf{w}}}(s)\int_{\mathcal{A}}\nabla_{\mathbf{w}}\pi_% {\mathbf{w}}(s,a)Q^{\pi_{\mathbf{w}}}(s,a)da\ ds  
\displaystyle\text{for }\ Q^{\pi_{\mathbf{w}}}(s,a)  \displaystyle\doteq\int_{\mathcal{S}}P(s,a,s^{\prime})\left[R(s,a,s^{\prime})+% \gamma\int_{\mathcal{A}}Q^{\pi_{\mathbf{w}}}(s^{\prime},a^{\prime})\right]. 
For example, in the policygradient approach called ActorCritic [\citeauthoryearSutton1984], the Critic estimates Q^{\pi_{\mathbf{w}}} and the Actor uses the Critic to obtain an estimate of the above gradient to adjust the policy parameters \mathbf{w}.
Actionvalue methods for continuous actions can be difficult to use, due to the fact that an optimization over actions needs to be solved, both for decisionmaking and for the Qlearning update. For a reasonably small number of discrete actions, \max_{a\in\mathcal{A}}Q_{\theta}(s,a) is straightforward to solve, by iterating across all actions. For continuous actions, Q_{\theta}(s,\cdot) cannot be queried for all actions, and the optimization can be difficult to solve, such as if Q_{\theta}(s,\cdot) is nonconvex in a.
ActorExpert Formalism
We propose a new framework for valuebased methods, with an explicit Actor. The goal is to provide a similar framework to ActorCritic—which has been so successful for algorithm development of policy gradient methods—to simplify algorithm development for valuebased methods. The Expert learns Q using Qlearning, but with an explicit actor that provides the greedy actions. The Actor has two roles: to select which action to take (behavior policy) and to provide the greedy action for the Expert’s Qlearning target. In this section, we develop a Conditional Cross Entropy Method for the Actor, to estimate the greedy action, and provide theoretical guarantees that the approach tracks a changing Expert.
Conditional Cross Entropy Method for the Actor
The primary role of the Actor is to identify—or learn— \operatorname*{arg\,max}_{a^{\prime}\in\mathcal{A}}Q_{\mathbf{\theta}}(S_{t+1}% ,a^{\prime}) for the Expert. Different strategies can be used to obtain this greedy action on each step. The simplest strategy is to solve this optimization with gradient ascent, to convergence, on every time step. This is problematic for two reasons: it is expensive and is likely to get stuck in suboptimal stationary points.
Consider now a slightly more effective strategy, that learns an Actor that can provide an approximate greedy action that can serve as a good initial point for gradient ascent. Such a strategy reduces the number of gradient ascent steps required, and so makes it more feasible to solve the gradient ascent problem on each step. After obtaining a^{\prime} at the end of the gradient ascent iterations, the Actor can be trained towards a^{\prime}, using a supervised learning update on \pi_{\mathbf{w}}(\cdotS_{t+1}). The Actor will slowly learn to select better initial actions, conditioned on state, that are near stationary points for Q(s,a)—which hopefully correspond to highvalue actions. This Actor learns to maximize Q, reducing computational complexity, but still suffers from reaching suboptimal stationary points.
To overcome this issue, we propose an approach inspired by the Cross Entropy Method from global optimization. Global optimization strategies are designed to find the global optimum of a function f(\theta) for some parameters \theta. For example, for parameters \theta of a neural network, f may be the loss function on a sample of data. The advantage of these methods is that they do not rely on gradientbased strategies, which are prone to getting stuck in saddlepoints and local optima. Instead, they use randomized search strategies, that have been shown to be effective in practice [\citeauthoryearSalimans et al.2017, \citeauthoryearPeters and Schaal2007, \citeauthoryearSzita and LÃ¶rincz2006, \citeauthoryearHansen, Müller, and Koumoutsakos2003].
One such algorithm is the Cross Entropy Method (CEM) [\citeauthoryearRubinstein1999]. This method maintains a distribution p(\theta) over parameters \theta, starting with a wide distribution, such as a Gaussian distribution with mean zero \mu_{0}=\mathbf{0} and a diagonal covariance \boldsymbol{\Sigma}_{0} of large magnitude. The highlevel idea is elegantly simple. On each iteration t, the goal is to minimize the KLdivergence to the uniform distribution over parameters where the objective function is greater than some threshold: I(f(\theta)\geq\text{threshold}). This distribution can be approximated with an empirical distribution, such as by sampling several parameter vectors \theta_{1},\ldots,\theta_{N} and keeping those with f(\theta_{i})\geq\text{threshold} and discarding the rest. Each minimization of the KLdivergence to this empirical distribution \hat{I}=\{\theta_{1}^{*},\ldots,\theta_{h}^{*}\}, for h<N, corresponds to maximizing the likelihood of the parameters in the set \hat{I} under the distribution p_{t}. Iteratively, the distribution over parameters p_{t} narrows around higher valued \theta. Sampling the \theta from p_{t} narrows the search over \theta and makes it more likely for them to produce a useful approximation to I(f(\theta)\geq\text{threshold}).
CEM, however, finds the singlebest set of optimal parameters for a single optimization problem. Most of the work using CEM in reinforcement learning aim to learn a singlebest set of parameters that optimize towards higher rollout returns [\citeauthoryearSzita and LÃ¶rincz2006, \citeauthoryearMannor, Rubinstein, and Gat2003]. However, our goal is not to do a single global optimization over returns, but rather a repeated optimization to select maximal actions, conditioned on each state. The global optimization strategy could be run on each step to find the exact best action for each current state, but this is expensive and throws away prior information about the function surface when previous optimization was executed.
We extend the Cross Entropy Method to be (a) conditioned on state and (b) learned iteratively over time. CEM is wellsuited to extend to a conditional approach, for use in the Actor, because it provides a stochastic Actor that can explore naturally and is effective for smooth, nonconvex functions [\citeauthoryearKroese, Porotsky, and Rubinstein2006]. The idea is to iteratively update \pi(\cdotS_{t}), where previous updates conditioned on state S_{t} generalize to similar states. The Actor learns a stochastic policy that slowly narrows around maximal actions, conditioned on states, as the agent does CEM updates iteratively for the functions Q(S_{1},\cdot),Q(S_{2},\cdot),\ldots.
The Conditional CEM (CCEM) algorithm replaces the learned p(\cdot) with \pi(\cdotS_{t}), where \pi(\cdotS_{t}) can be any parametrized, multimodal distribution. For a mixture model, for example, the parameters are conditional means \mu_{i}(S_{t}), conditional diagonal covariances \Sigma_{i}(S_{t}) and coefficients c_{i}(S_{t}), for the ith component of the mixture. On each step, the conditional mixture model, \pi(\cdotS_{t}), is sampled to provide a set of actions a_{1},\ldots,a_{N} from which we construct the empirical distribution \hat{I}(S_{t})=\{a^{*}_{1},\ldots,a_{h}^{*}\} where h<N for state S_{t} with current values Q(S_{t},\cdot). The parameters \mathbf{w} are updated using a gradient ascent step on the loglikelihood of the actions \hat{I}(S_{t}) under \pi.
The highlevel framework is given in Algorithm 1. The Expert is updated towards learning the optimal Qvalues, with (a variant of) Qlearning. The Actor provides exploration and, over time, learns how to find the maximal action for the Expert in the given state, using the described Conditional CEM algorithm. The strategy for the empirical distribution is assumed to be given. We discuss two strategies we explore in the experiments, in the next subsection.
We depict an ActorExpert architecture where the Actor uses a mixture model in Figure 1. In our implementation, we use mixture density networks [\citeauthoryearBishop1994] to learn a Gaussian mixture distribution. As in Figure 1, the Actor and Expert share the same neural network to obtain the representation for the state, and learn separate functions conditioned on that state. To obtain the maximal action under mixture models with a small number of components, we simply used the mean with the highest coefficient c_{i}(S_{t}). To prevent the diagonal covariance \boldsymbol{\Sigma} from exploding or vanishing, we bound it between [e^{2},e^{2}] using a tanh layer. We also follow standard practice of using experience replay and target networks to stabilize learning in neural networks. A more detailed algorithm for ActorExpert with neural networks is described in Supplement 2.1.
Selecting the empirical distribution
A standard strategy for selecting the empirical distributions in CEM is to use the top quantile of sampled variables—actions in this case (Algorithm 2). For a_{1},\ldots,a_{N} sampled from \pi_{\mathbf{w}_{t}}(\cdotS_{t}), we select a_{i}^{*}\subset\{a_{1},\ldots,a_{N}\} where Q(S_{t},a_{i}^{*}) are all with the top (1\rho) quantile values. The resulting empirical distribution is \hat{I}(S_{t})=\{a^{*}_{1},\ldots,a_{h}^{*}\}, for h=\lceil\rho N\rceil. This strategy is generic, and as we find empirically, effective.
For particular regularities in the actionvalues, however, we may be able to further improve this empirical distribution. For actionvalues differentiable in the action, we can perform a small number of gradient ascent steps from a_{i} to reach actions a_{i}^{*} with slightly higher actionvalues (Algorithm 3). The empirical distribution, then, should contain a larger number of useful actions—those with higher actionvalues—on which to perform maximum likelihood, potentially also requiring less samples. In our experiments we perform 10 gradient ascent steps.
Theoretical guarantees for the Actor
In this section, we derive guarantees that the Conditional CEM Actor tracks a CEM update, for an evolving Expert. We follow a twotimescale stochastic approximation approach, where the actionvalues (Expert) change more slowly than the policy (Actor), allowing the Actor to track the maximal actions.^{1}^{1}1This is actually opposite to ActorCritic, for which the Actor changes slowly, and the value estimates are on the faster timescale. The Actor itself has two timescales, to account for its own parameters changing at different timescales. Actions for the maximum likelihood step are selected according to older—slower—parameters, so that it is as if the primary—faster—parameters are updated using samples from a fixed distribution.
We provide an informal theorem statement here, with a proofsketch. We include the full theorem statement, with assumptions and proof, in Supplement 1.
Theorem 1 (Informal Convergence Result).
Let \theta_{t} be the actionvalue parameters with stepsize \alpha_{q,t}, and \mathbf{w}_{t} be the policy parameters with stepsize \alpha_{a,t}, with \mathbf{w}^{\prime}_{t} a more slowly changing set of policy parameters set to \mathbf{w}^{\prime}_{t}=(1\alpha^{\prime}_{a,t})\mathbf{w}^{\prime}_{t}+% \alpha^{\prime}_{a,t}\mathbf{w}_{t} for stepsize \alpha^{\prime}_{a,t}\in(0,1]. Assume

1.
States S_{t} are sampled from a fixed marginal distribution.

2.
\nabla_{\mathbf{w}}\ln{\pi_{\mathbf{w}}(\cdots)} is locally Lipschitz w.r.t. \mathbf{w}, \forall s\in\mathcal{S}.

3.
Parameters \mathbf{w}_{t} and \theta_{t} remain bounded almost surely.

4.
Stepsizes are chosen for three different timescales to make \mathbf{w}_{t} evolves faster than \mathbf{w}^{\prime}_{t} and \mathbf{w}^{\prime}_{t} evolves faster than \theta_{t},
\lim_{t\rightarrow\infty}\frac{\alpha^{\prime}_{a,t}}{\alpha_{a,t}}=0,\hskip 1% 1.381102pt\text{and}\hskip 11.381102pt\lim_{t\rightarrow\infty}\frac{\alpha_{q% ,t}}{\alpha_{a,t}}=0 
5.
All the three stepsizes decays to 0, while the sample length N_{t} strictly increases to infinity.

6.
Both L_{2} norm and the centered second moment of \nabla_{\mathbf{w}}\ln{\pi_{\mathbf{w}}(\cdots)} w.r.t. \pi_{\mathbf{w}^{\prime}} are bounded uniformly.
Then the Conditional CEM Actor tracks the CEM Optimizer for actions, conditioned on state: the stochastic recursion for the Actor asymptotically behaves like an expected CEM Optimizer, with expectation taken across states.
Proof Sketch: The proof follows a multitimescale stochastic approximation analysis. The primary concern is that the stochastic update to the Actor is not a direct gradientdescent update. Rather, each update to the Actor is a CEM update, which requires a different analysis to ensure that the stochastic noise remains bounded and is asymptotically negligible. Further, the classical results of the CEM also do not immediately apply, because such updates assume distribution parameters can be directly computed. Here, distribution parameters are conditioned on state, as outputs from a parametrized function. We identify conditions on the parametrized policy to ensure wellbehaved CEM updates.
The multitimescale analysis allows us to focus on the updates of the Actor \mathbf{w}_{t}, assuming the actionvalue parameter \theta and actionsampling parameter \mathbf{w}^{\prime} are quasistatic. These parameters are allowed to change with time—as they will in practice—but are moving at a sufficiently slower timescale relative to \mathbf{w}_{t} and hence the analysis can be undertaken as if they are static. These updates need to produce \theta that keep the actionvalues bounded for each state and action, but we do not specify the exact algorithm for the actionvalues. We assume that the actionvalue algorithm is given, and focus the analysis on the novel component: the Conditional CEM updates for the Actor.
The first step in the proof is to formulate the update to the weights as a projected stochastic recursion—simply meaning a stochastic update where after each update the weights are projected to a compact, convex set to keep them bounded. The stochastic recursion is reformulated into a summation involving the mean vector field g^{\theta}(\mathbf{w}_{t}) (which depends on the actionvalue parameters \theta), martingale noise and a loss term \ell^{\theta}_{t} that is due to having approximate quantiles. The key steps are then to show almost surely that the mean vector field g^{\theta} is locally Lipschitz, the martingale noise is quadratically bounded and that the loss term \ell^{\theta}_{t} decays to zero asymptotically. For the first and second, we identify conditions on the policy parameterization that guarantee these. For the final case, we adapt the proof for sampled quantiles approaching true quantiles for CEM, with modifications to account for expectations over the conditioning variable, the state. \hbox{}\hfill\blacksquare
Experiments
In this section, we investigate the utility of AE, particularly highlighting the utility of generalizing the functional form for the actionvalues and demonstrating performance across several benchmark domains. We first design a domain where the true actionvalues are neither quadratic nor concave, to investigate the utility of generalizing the functional form for the actionvalues. Then, we test AE and several other algorithms listed below in more complex continuousaction domains from OpenAI Gym [\citeauthoryearBrockman et al.2016] and MuJoCo [\citeauthoryearTodorov, Erez, and Tassa2012].
Algorithms
We use two versions of ActorExpert: AE which uses the Quantile Empirical Distribution (Alg. 2) and AE+ which uses the Optimized Quantile Empirical Distribution (Alg. 3). We use a bimodal Gaussian mixture for both Actors, with N=30 and \rho=0.2 for AE and N=10 and \rho=0.4 for AE+. The second choice for AE+ reflects that a smaller number of samples is needed for the optimized set of actions. For benchmark environments, it was even effective—and more efficient—for AE+ by sampling only 1 action (N=1), with \rho=1.0. For NAF, PICNN, Wirefitting, and DDPG, we attempt to match the settings used in their works.
Normalized Advantage Function (NAF) [\citeauthoryearGu et al.2016b] uses Q(s,a)=V(s)+A(s,a), restricting the advantage function to the form A(s,a)=\frac{1}{2}(a\mu(s))^{T}\Sigma^{1}(s)(a\mu(s)). V(s) correspond to the state value for the maximum action \mu(s), and A(s,a) only decreases this value for a\neq\mu(s). NAF takes actions by sampling from a Gaussian with learned mean \mu(s) and learned covariance \Sigma(s), with initial exploration scale swept in {0.1, 0.3, 1.0}.
Partially Input Convex Neural Networks (PICNN) [\citeauthoryearAmos, Xu, and Kolter2016] is a neural network that is convex with respect to a part of its input—the action in this case. PICNN learns Q(s,a) so that it is convex with respect to a, by restricting the weights of intermediate layers to be nonnegative, and activation function to be convex and nondecreasing (e.g. ReLU). For exploration in PICNN, we use OU noise—temporally correlated stochastic noise generated by an OrnsteinUhlenbeck process [\citeauthoryearUhlenbeck and Ornstein1930]—with \mu=0.0,\theta=0.15,\sigma=0.2, where the noise is added to the greedy action. To obtain the greedy action, as suggested in their paper, we used 5 iterations of the bundle entropy method from a randomly initialized action.
Wirefitting [\citeauthoryearBaird and Klopf1993] outputs a set of action control points and corresponding action values \mathcal{C}=\{(a_{i},q_{i}):i=1,...,m\} for a state. By construction, the optimal action is one of the action control points with the highest action value. Like PICNN, we use OU exploration. This method uses interpolation between the action control points to find the action values, and thus its performance is largely dependent on the number of control points. We used 100 action control points for the Bimodal Domain. For the benchmark problems, we found that Wirefitting did not scale well, and so was omitted.
Deep Deterministic Policy Gradient (DDPG) [\citeauthoryearLillicrap et al.2015] learns a deterministic policy, parameterized as a neural network, using the deterministic policy gradient theorem [\citeauthoryearSilver et al.2014]. We include it as a policy gradient baseline, as it is a competitive ActorCritic method using offpolicy policy gradient. Like PICNN and Wirefitting, DDPG uses OU noise for exploration.
Experimental Settings
Agent performance is evaluated every n steps of training, by executing the current policy without exploration for 10 episodes. The performance was averaged over 20 runs of different random seeds for the Bimodal Domain, and 10 runs for the benchmark domains. For all agents we use a neural network of 2layers with 200 hidden units each, with ReLU activations between each layer and tanh activation for action outputs. For AE and AE+, the Actor and Expert share the first layer, and branch out into two separate layers which all have 200 hidden units. We keep a running average and standard deviation to normalize unbounded state inputs. We use an experience replay buffer and target networks, as is common with neural networks. We use a batch size of 32, with buffer size = 10^{6}, target networks(\tau=0.01), and discount factor \gamma=0.99 for all agents. We sweep over learning rates – policy: {1e3, 1e4, 1e5}, actionvalues: {1e2, 1e3, 1e4}, and then use of layer normalization between network layers. For PICNN however, layer normalization could not be used in order to preserve convexity. Best hyperparameter settings found for all agents are reported in Supplement 2.4.
Experiments in a Bimodal Toy Domain
To illustrate the limitation that could be posed by restricting the functional form of Q(s,a), we design a toy domain with a single state S_{0} and a\subset[2,2], where the true Q^{*}(s,a)—shown in in Figure 3—is a function of two radial basis functions centered at a=1.0 and a=1.0 respectively, with unequal values of 1.0 and 1.5 respectively. We assume a deterministic setting, and so the rewards R(s,a)=Q^{*}(s,a).
We plot the average performance of the best setting for each agent over 20 runs, in Figure 2. We also monitored the training process, logging actionvalue function, exploratory action, and greedy action at each time step. We include videos in the Supplement^{2}^{2}2https://sites.google.com/ualberta.ca/actorexpert/, and descriptions can be found in Supplement 2.2.
All the methods that restrict the functional form for actions failed in many runs. PICNN and NAF start to increase value for one action center, and by necessity of convexity, must overly decrease the values around the other action center. Consequently, when they randomly explore and observe the higher reward for that action than they predict, a large update skews the actionvalue estimates. DDPG would similarly suffer, because the Actor only learns to output one action. Even though its actionvalue function is not restrictive, DDPG may periodically see high value for the other action center, and so its choice of greedy action can be pulled backandforth between these highvalued actions. AE methods, on the other hand, almost always found the optimal action. Wirefitting performed better than DDPG, PICNN and NAF, as it should be capable of correctly modeling actionvalues, but still converged to the suboptimal policy quite often.
The exploration mechanisms also played an important role. For certain exploration settings, the agents restricting the functional form on the actionvalues or policy can learn to settle on one action, rather than oscillating. For NAF with small exploration scale and and DDPG using OU noise, such oscillations were not observed in the above figure, because the agent only explores locally around one action, avoiding oscillation but also often converging to the suboptimal action. This was still a better choice for overall performance as oscillation produces lower accumulated reward. AE and AE+, on the other hand, explore by sampling from their learned multimodal Gaussian mixture distribution, with no external exploration parameter to tune.
Experiments in Benchmark Domains
We evaluated the algorithms on a set of benchmark continuous action tasks, with results shown in Figure 4. As mentioned above, we do not include Wirefitting, as it scaled poorly on these domains. Detailed description of the benchmark environments and their dimensions is included in Supplement 2.3, with state dimensions ranging from 3 to 17 and action dimensions ranging from 1 to 6.
In all benchmark environments AE and AE+ perform as well or better than other methods. In particular, they seem to learn more quickly. We hypothesize that this is because AE better estimates greedy actions and explores around actions with high actionvalues more effectively.
NAF and PICNN seemed to have less stable behavior, potentially due to their restrictive actionvalue function. PICNN likely suffers less, because its functional form is more general, but its greedy action selection mechanism is not as robust and some instability was observed in Lunar Lander. Such instability is not observed in Pendulum or HalfCheetah possibly because the actionvalue surface is simple in Pendulum and for locomotion environment like HalfCheetah, precision is not necessary; approximately good actions may still enable the agent to move and achieve reasonable performance.
Though the goal here is to evaluate the utility of AE compared to valuebased methods, we do include one policy gradient method as a baseline and a preliminary result into valuebased versus policy gradient approaches for continuous control. It is interesting to see that AE methods often perform better than their ActorCritic (policy gradient) counterpart, DDPG. In particular, AE seems to learn much more quickly, which is a hypothesized benefit of valuebased methods. Policy gradient methods, on the other hand, typically have to use outofdate value estimates to update the policy, which could slow learning.
Discussion and Future Work
In our work, we introduced a new framework called ActorExpert, that decouples actionselection from actionvalue representation by introducing an Actor that learns to identify maximal actions for the Expert actionvalues. Previous valuebased approaches for continuous control have typically limited the actionvalue functional form to easily optimize over actions. We have shown that this can be problematic in domains with true actionvalues that do not follow this parameterization. We proposed an instance of ActorExpert, by developing a Conditional Cross Entropy Method to iteratively find greedy actions conditioned on states. We use a multitimescale analysis to prove that this Actor tracks the Cross Entropy updates which seek the optimal actions across states, as the Expert evolves gradually. This proof differs from other multitimescale proofs in reinforcement learning, as we analyze a stochastic recursion that is based on the Cross Entropy Method, rather than a more typical stochastic (semi)gradient descent update. We conclude by showing that AE methods are able to find the optimal policy even when the true actionvalue function is bimodal, and performs as well as or better than previous methods in more complex domains. Like the ActorCritic framework, we hope for the ActorExpert framework to facilitate further development and use of valuebased methods for continuous action problems.
One such direction is to more extensively compare valuebased methods and policy gradient methods for continuous control. In this work, we investigated how to use valuebased methods under continuous actions, but did not state that valuebased methods were preferable over policy gradient methods. However, there are several potential benefits of valuebased methods that merit further exploration. One advantage is that Qlearning easily incorporates offpolicy samples, potentially improving sample complexity, whereas with policy gradient methods, it comes at the cost of introducing bias. Although some offpolicy policy gradient methods like DDPG have achieved high performance in benchmark domains, they are also known to suffer from brittleness and hyperparameter sensitivity [\citeauthoryearDuan et al.2016, \citeauthoryearHenderson et al.2017]. Another more speculative advantage is in terms of the optimization surface. The Qlearning update converges to optimal values in tabular settings and linear function approximation [\citeauthoryearMelo and Ribeiro2007]. Policy gradient methods, on the other hand, can have local minima, even in the tabular setting. One goal with ActorExpert is to improve valuebased methods for continuous actions, and so facilitate investigation into these hypotheses, without being limited by difficulties in action selection.
References
 [\citeauthoryearAmos, Xu, and Kolter2016] Amos, B.; Xu, L.; and Kolter, J. Z. 2016. Input convex neural networks. CoRR abs/1609.07152.
 [\citeauthoryearBaird and Klopf1993] Baird, L. C., and Klopf, A. H. 1993. Reinforcement learning with highdimensional, continuous actions. Wright Laboratory.
 [\citeauthoryearBaird1993] Baird, L. C. 1993. Advantage updating. Technical report, Technical report, DTIC Document.
 [\citeauthoryearBishop1994] Bishop, C. M. 1994. Mixture density networks. Technical report.
 [\citeauthoryearBorkar1997] Borkar, V. S. 1997. Stochastic approximation with two time scales. Systems & Control Letters 29(5):291–294.
 [\citeauthoryearBorkar2008] Borkar, V. S. 2008. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press.
 [\citeauthoryearBrockman et al.2016] Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym.
 [\citeauthoryearDegris, White, and Sutton2012] Degris, T.; White, M.; and Sutton, R. S. 2012. Offpolicy actorcritic. CoRR abs/1205.4839.
 [\citeauthoryeardel R Millán, Posenato, and Dedieu2002] del R Millán, J.; Posenato, D.; and Dedieu, E. 2002. ContinuousAction QLearning. Machine Learning.
 [\citeauthoryearDuan et al.2016] Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; and Abbeel, P. 2016. Benchmarking deep reinforcement learning for continuous control. CoRR abs/1604.06778.
 [\citeauthoryearDurrett1991] Durrett, R. 1991. Probability. theory and examples. the wadsworth & brooks/cole statistics/probability series. Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove, CA.
 [\citeauthoryearGreensmith, Bartlett, and Baxter2004] Greensmith, E.; Bartlett, P. L.; and Baxter, J. 2004. Variance reduction techniques for gradient estimates in reinforcement learning. J. Mach. Learn. Res. 5:1471–1530.
 [\citeauthoryearGu et al.2016a] Gu, S.; Lillicrap, T. P.; Ghahramani, Z.; Turner, R. E.; and Levine, S. 2016a. Qprop: Sampleefficient policy gradient with an offpolicy critic. CoRR abs/1611.02247.
 [\citeauthoryearGu et al.2016b] Gu, S.; Lillicrap, T. P.; Sutskever, I.; and Levine, S. 2016b. Continuous deep qlearning with modelbased acceleration. CoRR abs/1603.00748.
 [\citeauthoryearHaarnoja et al.2017] Haarnoja, T.; Tang, H.; Abbeel, P.; and Levine, S. 2017. Reinforcement learning with deep energybased policies. CoRR abs/1702.08165.
 [\citeauthoryearHansen, Müller, and Koumoutsakos2003] Hansen, N.; Müller, S. D.; and Koumoutsakos, P. 2003. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cmaes). Evol. Comput. 11(1):1–18.
 [\citeauthoryearHarmon and Baird1996a] Harmon, M. E., and Baird, L. C. 1996a. Advantage learning applied to a game with nonlinear dynamics and a nonlinear function approximator. Proceedings of the International Conference on Neural Networks (ICNN).
 [\citeauthoryearHarmon and Baird1996b] Harmon, M. E., and Baird, L. C. 1996b. Multiplayer residual advantage learning with general function approximation. Technical report, Wright Laboratory.
 [\citeauthoryearHenderson et al.2017] Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; and Meger, D. 2017. Deep reinforcement learning that matters. CoRR abs/1709.06560.
 [\citeauthoryearHoeffding1963] Hoeffding, W. 1963. Probability inequalities for sums of bounded random variables. Journal of the American statistical association 58(301):13–30.
 [\citeauthoryearHomemde Mello2007] Homemde Mello, T. 2007. A study on the crossentropy method for rareevent probability estimation. INFORMS Journal on Computing 19(3):381–394.
 [\citeauthoryearHu, Fu, and Marcus2007] Hu, J.; Fu, M. C.; and Marcus, S. I. 2007. A model reference adaptive search method for global optimization. Operations Research 55(3):549–568.
 [\citeauthoryearKakade2001] Kakade, S. 2001. A natural policy gradient. In Advances in Neural Information Processing Systems 14 (NIPS 2001). MIT Press.
 [\citeauthoryearKroese, Porotsky, and Rubinstein2006] Kroese, D. P.; Porotsky, S.; and Rubinstein, R. Y. 2006. The crossentropy method for continuous multiextremal optimization. Methodology and Computing in Applied Probability 8(3):383–407.
 [\citeauthoryearKushner and Clark2012] Kushner, H. J., and Clark, D. S. 2012. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer Science & Business Media.
 [\citeauthoryearLillicrap et al.2015] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. CoRR abs/1509.02971.
 [\citeauthoryearMannor, Rubinstein, and Gat2003] Mannor, S.; Rubinstein, R.; and Gat, Y. 2003. The cross entropy method for fast policy search. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, 512–519. AAAI Press.
 [\citeauthoryearMelo and Ribeiro2007] Melo, F. S., and Ribeiro, M. I. 2007. Convergence of qlearning with linear function approximation. In European Control Conference.
 [\citeauthoryearMnih et al.2013] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. A. 2013. Playing atari with deep reinforcement learning. CoRR abs/1312.5602.
 [\citeauthoryearMnih et al.2016] Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T. P.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. CoRR abs/1602.01783.
 [\citeauthoryearMorris1982] Morris, C. N. 1982. Natural exponential families with quadratic variance functions. The Annals of Statistics 65–80.
 [\citeauthoryearPeters and Schaal2007] Peters, J., and Schaal, S. 2007. Reinforcement learning by rewardweighted regression for operational space control. In Proceedings of the 24th International Conference on Machine Learning, 745–750. ACM.
 [\citeauthoryearPeters and Schaal2008] Peters, J., and Schaal, S. 2008. Natural actorcritic. Neurocomputing 71(7):1180 – 1190.
 [\citeauthoryearRobbins and Monro1985] Robbins, H., and Monro, S. 1985. A stochastic approximation method. In Herbert Robbins Selected Papers. Springer. 102–109.
 [\citeauthoryearRubinstein and Shapiro1993] Rubinstein, R. Y., and Shapiro, A. 1993. Discrete event systems: Sensitivity analysis and stochastic optimization by the score function method, volume 1. Wiley New York.
 [\citeauthoryearRubinstein1999] Rubinstein, R. 1999. The crossentropy method for combinatorial and continuous optimization. Methodology And Computing In Applied Probability 1(2):127–190.
 [\citeauthoryearSalimans et al.2017] Salimans, T.; Ho, J.; Chen, X.; and Sutskever, I. 2017. Evolution strategies as a scalable alternative to reinforcement learning. CoRR abs/1703.03864.
 [\citeauthoryearSchulman et al.2015] Schulman, J.; Levine, S.; Moritz, P.; Jordan, M. I.; and Abbeel, P. 2015. Trust region policy optimization. CoRR abs/1502.05477.
 [\citeauthoryearSchulman et al.2016] Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; and Abbeel, P. 2016. Highdimensional continuous control using generalized advantage estimation. International Conference on Learning Representations.
 [\citeauthoryearSchulman et al.2017] Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. CoRR abs/1707.06347.
 [\citeauthoryearSen and Singer2017] Sen, P. K., and Singer, J. M. 2017. Large Sample Methods in Statistics (1994): An Introduction with Applications. CRC Press.
 [\citeauthoryearSilver et al.2014] Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning  Volume 32, I–387–I–395. JMLR.org.
 [\citeauthoryearSutton et al.2000] Sutton, R. S.; McAllester, D.; Singh, S.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems.
 [\citeauthoryearSutton1984] Sutton, R. S. 1984. Temporal Credit Assignment in Reinforcement Learning. Ph.D. Dissertation.
 [\citeauthoryearSzita and LÃ¶rincz2006] Szita, I., and LÃ¶rincz, A. 2006. Learning tetris using the noisy crossentropy method. Neural Computation 18(12):2936–2941.
 [\citeauthoryearTodorov, Erez, and Tassa2012] Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for modelbased control. In IROS, 5026–5033. IEEE.
 [\citeauthoryearUhlenbeck and Ornstein1930] Uhlenbeck, G. E., and Ornstein, L. S. 1930. On the theory of the brownian motion. Physical review, 36(5):823.
 [\citeauthoryearvan Hasselt, Guez, and Silver2015] van Hasselt, H.; Guez, A.; and Silver, D. 2015. Deep reinforcement learning with double qlearning. CoRR abs/1509.06461.
 [\citeauthoryearWang et al.2016] Wang, Z.; Bapst, V.; Heess, N.; Mnih, V.; Munos, R.; Kavukcuoglu, K.; and de Freitas, N. 2016. Sample efficient actorcritic with experience replay. CoRR abs/1611.01224.
 [\citeauthoryearWatkins and Dayan1992] Watkins, C. J. C. H., and Dayan, P. 1992. Qlearning. In Machine Learning, 279–292.
 [\citeauthoryearWu et al.2017] Wu, Y.; Mansimov, E.; Liao, S.; Grosse, R. B.; and Ba, J. 2017. Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. CoRR abs/1708.05144.
Supplementary Material
Appendix A 1 Convergence Analysis
In this section, we prove that the stochastic Conditional CrossEntropy Method update for the Actor tracks an underlying deterministic ODE for the expected CrossEntropy update over states. We being by providing some definitions, particularly for the quantile function which is central to the analysis. We then lay out the assumptions, and discuss some policy parameterizations to satisfy those assumptions. We finally state the theorem, with proof, and provide one lemma needed to prove the theorem in the final subsection.
1.1 Notation and Definitions
Notation:
For a set A, let \mathring{A} represent the interior of A, while \partial A is the boundary of A. The abbreviation a.s. stands for almost surely and i.o. stands for infinitely often. Let \mathbb{N} represent the set \{0,1,2,\dots\}. For a set A, we let I_{A} to be the indicator function/characteristic function of A and is defined as I_{A}(x)=1 if x\in A and 0 otherwise. Let \mathbb{E}_{g}[\cdot], \mathbb{V}_{g}[\cdot] and \mathbb{P}_{g}(\cdot) denote the expectation, variance and probability measure w.r.t. g. For a \sigmafield \mathcal{F}, let \mathbb{E}\left[\cdot\mathcal{F}\right] represent the conditional expectation w.r.t. \mathcal{F}.
A function f:X\rightarrow Y is called Lipschitz continuous if \exists L\in(0,\infty) s.t. \f(\mathbf{x}_{1})f(\mathbf{x}_{2})\\leq L\\mathbf{x}_{1}\mathbf{x}_{2}\, \forall\mathbf{x}_{1},\mathbf{x}_{2}\in X. A function f is called locally Lipschitz continuous if for every \mathbf{x}\in X, there exists a neighbourhood U of X such that f_{U} is Lipschitz continuous. Let C(X,Y) represent the space of continuous functions from X to Y. Also, let B_{r}(\mathbf{x}) represent an open ball of radius r with centered at \mathbf{x}. For a positive integer M, let [M]:=\{1,2\dots M\}.
Definition 1.
A function \Gamma:U\subseteq{\rm I\!R}^{d_{1}}\rightarrow V\subseteq{\rm I\!R}^{d_{2}} is Frechet differentiable at \mathbf{x}\in U if there exists a bounded linear operator \widehat{\Gamma}_{\mathbf{x}}:{\rm I\!R}^{d_{1}}\rightarrow{\rm I\!R}^{d_{2}} such that the limit
\displaystyle\lim_{\epsilon\downarrow 0}\frac{\Gamma(\mathbf{x}+\epsilon% \mathbf{y})\mathbf{x}}{\epsilon}  (3) 
exists and is equal to \widehat{\Gamma}_{\mathbf{x}}(\mathbf{y}). We say \Gamma is Frechet differentiable if Frechet derivative of \Gamma exists at every point in its domain.
Definition 2.
Given a bounded realvalued continuous function H:{\rm I\!R}^{d}\rightarrow{\rm I\!R} with H(a)\in[H_{l},H_{u}] and a scalar \rho\in[0,1], we define the (1\rho)quantile of H(A) w.r.t. the PDF g (denoted as f^{\rho}(H,g)) as follows:
\displaystyle f^{\rho}(H,g):=\sup_{\ell\in[H_{l},H_{u}]}\{\mathbb{P}_{g}\big{(% }H(A)\geq\ell\big{)}\geq\rho\},  (4) 
where \mathbb{P}_{g} is the probability measure induced by the PDF g, i.e., for a Borel set \mathcal{A}, \mathbb{P}_{g}(\mathcal{A}):=\int_{\mathcal{A}}g(a)da.
This quantile operator will be used to succinctly write the quantile for Q_{\theta}(S,\cdot), with actions selected according to \pi_{\mathbf{w}}, i.e.,
f_{\theta}^{\rho}(\mathbf{w};s):=f^{\rho}(Q_{\theta}(s,\cdot),\pi_{\mathbf{w}}% (\cdots))=\sup_{\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]}\{\mathbb{P}_{\pi_{% \mathbf{w}}(\cdots)}\big{(}Q_{\theta}(s,A)\geq\ell\big{)}\geq\rho\}.  (5) 
1.2 Assumptions
Assumption 1.
Given a realization of the transition dynamics of the MDP in the form of a sequence of transition tuples \mathcal{O}:=\{(S_{t},A_{t},R_{t},S^{\prime}_{t})\}_{t\in\mathbb{N}}, where the state S_{t}\in\mathcal{S} is drawn using a latent sampling distribution \nu, while A_{t}\in\mathcal{A} is the action chosen at state S_{t}, the transitioned state \mathcal{S}\ni S^{\prime}_{t}\sim P(S_{t},A_{t},\cdot) and the reward {\rm I\!R}\ni R_{t}:=R(S_{t},A_{t},S^{\prime}_{t}). We further assume that the reward is uniformly bounded, i.e., R(\cdot,\cdot,\cdot)<R_{max}<\infty.
Here, we analyze the long run behaviour of the conditional crossentropy recursion (actor) which is defined as follows:
\displaystyle\mathbf{w}_{t+1}:=\Gamma^{W}\left\{\mathbf{w}_{t}+\alpha_{a,t}% \frac{1}{N_{t}}\sum_{A\in\Xi_{t}}{I_{\{Q_{\theta_{t}}(S_{t},A)\geq\widehat{f}^% {\rho}_{t+1}\}}}\nabla_{\mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(AS_{t})\right\},  (6)  
\displaystyle \text{ where }\Xi_{t}:=\{A_{t,1},A_{t,2% },\dots,A_{t,N_{t}}\}\lx@stackrel{{\scriptstyle\mathclap{\tiny\mbox{iid}}}}{{% \sim}}\pi_{\mathbf{w}^{\prime}_{t}}(\cdotS_{t}).  
\displaystyle\mathbf{w}^{\prime}_{t+1}:=\mathbf{w}^{\prime}_{t}+\alpha^{\prime% }_{a,t}\left(\mathbf{w}_{t+1}\mathbf{w}^{\prime}_{t}\right).  (7) 
Here, \Gamma^{W}\{\cdot\} is the projection operator onto the compact (closed and bounded) and convex set W\subset{\rm I\!R}^{m} with a smooth boundary \partial W. Therefore, \Gamma^{W} maps vectors in {\rm I\!R}^{m} to the nearest vectors in W w.r.t. the Euclidean distance (or equivalent metric). Convexity and compactness ensure that the projection is unique and belongs to W.
Assumption 2.
The predetermined, deterministic, stepsize sequences \{\alpha_{a,t}\}_{t\in\mathbb{N}}, \{\alpha^{\prime}_{a,t}\}_{t\in\mathbb{N}} and \{\alpha_{q,t}\}_{t\in\mathbb{N}} are positive scalars which satisfy the following:
\displaystyle\sum_{t\in\mathbb{N}}\alpha_{a,t}=\sum_{t\in\mathbb{N}}\alpha^{% \prime}_{a,t}=\sum_{t\in\mathbb{N}}\alpha_{q,t}=\infty,\hskip 11.381102pt\sum_% {t\in\mathbb{N}}\left(\alpha^{2}_{a,t}+{\alpha^{\prime}}^{2}_{a,t}+\alpha^{2}_% {q,t}\right)<\infty,  
\displaystyle\lim_{t\rightarrow\infty}\frac{\alpha^{\prime}_{a,t}}{\alpha_{a,t% }}=0,\hskip 11.381102pt\lim_{t\rightarrow\infty}\frac{\alpha_{q,t}}{\alpha_{a,% t}}=0. 
The first conditions in Assumption 2 are the classical RobbinsMonro conditions [\citeauthoryearRobbins and Monro1985] required for stochastic approximation algorithms. The last two conditions enable the different stochastic recursions to have separate timescales. Indeed, it ensures that the \mathbf{w}_{t} recursion is relatively faster compared to the recursions of \theta_{t} and \mathbf{w}^{\prime}_{t}. This timescale divide is needed to obtain the pursued coherent asymptotic behaviour, as we describe in the next section.
Assumption 3.
The predetermined, deterministic, sample length schedule \{N_{t}\in\mathbb{N}\}_{t\in\mathbb{N}} is positive and strictly monotonically increases to \infty and \inf_{t\in\mathbb{N}}\frac{N_{t+1}}{N_{t}}>1.
Assumption 3 states that the number of samples increases to infinity and is primarily required to ensure that the estimation error arising due to the estimation of sample quantiles eventually decays to 0. Practically, one can indeed consider a fixed, finite, positive integer for N_{t} which is large enough to accommodate the acceptable error.
Assumption 4.
The sequence \{\theta_{t}\}_{t\in\mathbb{N}} satisfies \theta_{t}\in\Theta, where \Theta \subset{\rm I\!R}^{n} is a convex, compact set. Also, for \theta\in\Theta, let Q_{\theta}(s,a)\in[Q^{\theta}_{l},Q^{\theta}_{u}], \forall s\in\mathcal{S},a\in\mathcal{A}.
Assumption 4 assumes stability of the Expert, and minimally only requires that the values remain in a bounded range. We make no additional assumptions on the convergence properties of the Expert, as we simply need stability to prove that the Actor tracts the desired update.
Assumption 5.
For \theta\in\Theta and s\in\mathcal{S}, let \mathbb{P}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdots)}\left(Q_{\theta}(s,A)\geq% \ell\right)>0, \forall\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}] and \forall\mathbf{w}^{\prime}\in W.
Assumption 5 implies that there always exists a strictly positive probability mass beyond every threshold \ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]. This assumption is easily satisfied when Q_{\theta}(s,a) is continuous in a and \pi_{\mathbf{w}}(\cdots) is a continuous probability density function.
Assumption 6.
\displaystyle\sup_{\begin{subarray}{c}\mathbf{w},\mathbf{w}^{\prime}\in W,\\ \theta\in\Theta,\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]\end{subarray}}\mathbb{E% }_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdotS)}\Big{[}\Big{\}I_{\{Q_{\theta}(S,A)% \geq\ell\}}\nabla_{\mathbf{w}}\ln{\pi_{\mathbf{w}}(AS)}  
\displaystyle \mathbb{E}_{A\sim\pi_{\mathbf{% w}^{\prime}}(\cdotS)}\left[I_{\{Q_{\theta}(S,A)\geq\ell\}}\nabla_{\mathbf{w}}% \ln{\pi_{\mathbf{w}}(AS)}\big{}S\right]\Big{\}_{2}^{2}\Big{}S\Big{]}<% \infty\hskip 5.690551pta.s.,  
\displaystyle\sup_{\begin{subarray}{c}\mathbf{w},\mathbf{w}^{\prime}\in W,\\ \theta\in\Theta,\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]\end{subarray}}\mathbb{E% }_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdotS)}\left[\Big{\}I_{\{Q_{\theta}(S,A)% \geq\ell\}}\nabla_{\mathbf{w}}\ln\pi_{\mathbf{w}}(AS)\Big{\}_{2}^{2}\Big{}S% \right]<\infty\hskip 5.690551pta.s. 
Assumption 7.
For s\in\mathcal{S}, \nabla_{\mathbf{w}}\ln{\pi_{\mathbf{w}}(\cdots)} is locally Lipschitz continuous w.r.t. \mathbf{w}.
Assumptions 6 and 7 are technical requirements and can be justified and more appropriately characterized when we consider \pi_{\mathbf{w}} to belong to the most popular natural exponential family (NEF) of distributions.
Definition 3.
Natural exponential family of distributions (NEF)[\citeauthoryearMorris1982]: These probability distributions over {\rm I\!R}^{m} are represented by
\{\pi_{\eta}(\mathbf{x}):=h(\mathbf{x})e^{\eta^{\top}T(\mathbf{x})K(\eta)}% \mid\eta\in\Lambda\subset{\rm I\!R}^{d}\},  (8) 
where \eta is the natural parameter, h:{\rm I\!R}^{m}\longrightarrow{\rm I\!R}, while T:{\rm I\!R}^{m}\longrightarrow{\rm I\!R}^{d} (called the sufficient statistic) and K(\eta):=\ln{\int{h(\mathbf{x})e^{\eta^{\top}T(\mathbf{x})}d\mathbf{x}}} (called the cumulant function of the family). The space \Lambda is defined as \Lambda:=\{\eta\in{\rm I\!R}^{d}\hskip 8.535827ptK(\eta)<\infty\}. Also, the above representation is assumed minimal.^{3}^{3}3For a distribution in NEF, there may exist multiple representations of the form (8). However, for the distribution, there definitely exists a representation where the components of the sufficient statistic are linearly independent and such a representation is referred to as minimal. A few popular distributions which belong to the NEF family include Binomial, Poisson, Bernoulli, Gaussian, Geometric and Exponential distributions.
We parametrize the policy \pi_{\mathbf{w}}(\cdotS) using a neural network, which implies that when we consider NEF for the stochastic policy, the natural parameter \eta of the NEF is being parametrized by \mathbf{w}. To be more specific, we have \{\psi_{\mathbf{w}}:\mathcal{S}\rightarrow\Lambda\mathbf{w}\in{\rm I\!R}^{m}\} to be the function space induced by the neural network of the actor, i.e., for a given state s\in\mathcal{S}, \psi_{\mathbf{w}}(s) represents the natural parameter of the NEF policy \pi_{\mathbf{w}}(\cdots). Further,
\displaystyle\nabla_{\mathbf{w}}\ln\pi_{\mathbf{w}}(AS)  \displaystyle=\ln{(h(A))}+\psi_{\mathbf{w}}(S_{t})^{\top}T(A)K(\psi_{\mathbf{% w}}(S))  
\displaystyle=\nabla_{\mathbf{w}}\psi_{\mathbf{w}}(S)\left(T(A)\nabla_{\eta}K% (\psi_{\mathbf{w}}(S))\right).  
\displaystyle=\nabla_{\mathbf{w}}\psi_{\mathbf{w}}(S)\left(T(A)\mathbb{E}_{A% \sim\pi_{\mathbf{w}}(\cdotS)}\left[T(A)\right]\right).  (9) 
Therefore Assumption 7 can be directly satisfied by assuming that \psi_{w} is twice continuously differentiable w.r.t. \mathbf{w}.
The next assumption is a standard assumption that sample average converges with an exponential rate in the number of samples. The assumption reflects that this should be true for arbitrary \mathbf{w}\in W.
Assumption 8.
For \epsilon>0 and N\in\mathbb{N}, we have
\displaystyle\mathbb{P}_{\Xi\lx@stackrel{{\scriptstyle\mathclap{\tiny\mbox{iid% }}}}{{\sim}}\pi_{\mathbf{w}^{\prime}}(\cdots)}\Big{(}\Big{\}\frac{1}{N}\sum_% {A\in\Xi}{I_{\{Q_{\theta}(s,A)\geq f^{\rho}(Q_{\theta}(s,\cdot),\pi_{\mathbf{w% }^{\prime}}(\cdots)\}}\nabla_{\mathbf{w}}\ln\pi_{\mathbf{w}}(As)}  
\displaystyle \mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdots)}\left[I_% {\{Q_{\theta}(s,A)\geq\widehat{f}^{\rho}(Q_{\theta}(s,\cdot),\pi_{\mathbf{w}^{% \prime}}(\cdots)\}}\nabla_{\mathbf{w}}\ln\pi_{\mathbf{w}}(As)\right]\Big{\}% \geq\epsilon\Big{)}\leq C_{1}\exp{\left(c_{2}N^{c_{3}}\epsilon^{c_{4}}\right)},  
\displaystyle % \forall\theta\in\Theta,\mathbf{w},\mathbf{w}^{\prime}\in W,s\in\mathcal{S}, 
where C_{1},c_{2},c_{3},c_{4}>0.
Assumption 9.
For every \theta\in\Theta, s\in\mathcal{S} and \mathbf{w}\in W, f_{\theta}^{\rho}(\mathbf{w};s) (from Eq. (5)) exists and is unique.
The above assumption ensures that the true (1\rho)quantile is unique and the assumption is usually satisfied for most distributions and a wellbehaved Q_{\theta}.
1.3 Main Theorem
To analyze the algorithm, we employ here the ODEbased analysis as proposed in [\citeauthoryearBorkar2008, \citeauthoryearKushner and Clark2012]. The actor recursions (Eqs. (67)) represent a classical two timescale stochastic approximation recursion, where there exists a bilateral coupling between the individual stochastic recursions (6) and (7). Since the stepsize schedules \{\alpha_{a,t}\}_{t\in\mathbb{N}} and \{\alpha^{\prime}_{a,t}\}_{t\in\mathbb{N}} satisfy \frac{\alpha^{\prime}_{a,t}}{\alpha_{a,t}}\rightarrow 0, we have \alpha^{\prime}_{a,t}\rightarrow 0 relatively faster than \alpha_{a,t}\rightarrow 0. This disparity induces a pseudoheterogeneous rate of convergence (or timescales) between the individual stochastic recursions which further amounts to the asymptotic emergence of a stable coherent behaviour which is quasiasynchronous. This pseudobehaviour can be interpreted using multiple viewpoints, i.e., when viewed from the faster timescale recursion (recursion controlled by \alpha_{a,t}), the slower timescale recursion (recursion controlled by \alpha^{\prime}_{a,t}) appears quasistatic (‘almost a constant’); likewise, when observed from the slower timescale, the faster timescale recursion seems equilibrated. The existence of this stable long run behaviour under certain standard assumptions of stochastic approximation algorithms is rigorously established in [\citeauthoryearBorkar1997] and also in Chapter 6 of [\citeauthoryearBorkar2008]. For our stochastic approximation setting (Eqs. (67)), we can directly apply this appealing characterization of the long run behaviour of the two timescale stochastic approximation algorithms—after ensuring the compliance of our setting to the prerequisites demanded by the characterization—by considering the slow timescale stochastic recursion (7) to be quasistationary (i.e., {\mathbf{w}^{\prime}}_{t}\equiv\mathbf{w}^{\prime}, a.s., \forall t\in\mathbb{N}), while analyzing the limiting behaviour of the faster timescale recursion (6). Similarly, we let \theta_{t} to be quasistationary too (i.e., \theta_{t}\equiv\theta, a.s., \forall t\in\mathbb{N}). The asymptotic behaviour of the slower timescale recursion is further analyzed by considering the faster timescale temporal variable \mathbf{w}_{t} with the limit point so obtained during quasistationary analysis.
Define the filtration \{\mathcal{F}_{t}\}_{t\in\mathbb{N}}, a family of increasing natural \sigmafields, where \mathcal{F}_{t}:= \sigma\left(\{{\mathbf{w}}_{i},{\mathbf{w}}^{\prime}_{i},(S_{i},A_{i},R_{i},S^%
{\prime}_{i}),\Xi_{i};0\leq i\leq t\}\right).
Theorem 2.
Let \mathbf{w}^{\prime}_{t}\equiv\mathbf{w}^{\prime},\theta_{t}\equiv\theta,% \forall t\in\mathbb{N} a.s. Let Assumptions 19 hold. Then the stochastic sequence \{\mathbf{w}_{t}\}_{t\in\mathbb{N}} generated by the stochastic recursion (6) asymptotically tracks the following ODE:
\displaystyle\frac{d}{dt}{\mathbf{w}}(t)=\widehat{\Gamma}^{W}_{{\mathbf{w}}(t)% }\left(\nabla_{\mathbf{w}(t)}\mathbb{E}_{\begin{subarray}{c}S\sim\nu,A\sim\pi_% {\mathbf{w}^{\prime}}(\cdotS)\end{subarray}}\Big{[}I_{\{Q_{\theta}(S,A)\geq f% _{\theta}^{\rho}(\mathbf{w}^{\prime};S)\}}\ln\pi_{\mathbf{w}(t)}(AS)\Big{]}% \right),\hskip 11.381102ptt\geq 0.  (10) 
In other words, \lim_{t\rightarrow\infty}\mathbf{w}_{t}\in\mathcal{K} a.s., where \mathcal{K} is set of stable equilibria of the ODE (10) contained inside W.
Proof.
Firstly, we rewrite the stochastic recursion (6) under the hypothesis that \theta_{t} and \mathbf{w}^{\prime}_{t} are quasistationary, i.e., \theta_{t}\underset{a.s.}{\equiv}\theta and \mathbf{w}^{\prime}_{t}\underset{a.s.}{\equiv}\mathbf{w}^{\prime} as follows:
\displaystyle\mathbf{w}_{t+1}  \displaystyle:=\Gamma^{W}\left\{\mathbf{w}_{t}+\alpha_{a,t}\frac{1}{N_{t}}\sum% _{A\in\Xi_{t}}{I_{\{Q_{\theta}(S_{t},A)\geq\widehat{f}^{\rho}_{t+1}\}}}\nabla_% {\mathbf{w}}\ln\pi_{\mathbf{w}}(AS_{t})\right\}  
\displaystyle=\Gamma^{W}\Bigg{\{}\mathbf{w}_{t}+\alpha_{a,t}\Bigg{(}\mathbb{E}% _{S_{t}\sim\nu,A\sim\pi_{\mathbf{w}^{\prime}}(\cdotS_{t})}\left[I_{\{Q_{% \theta}(S_{t},A)\geq f_{\theta}^{\rho}(\mathbf{w}^{\prime};S_{t})\}}\nabla_{% \mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(AS_{t})\right]  
\displaystyle \mathbb{E}_{S_{t}\sim\nu,A\sim\pi_{\mathbf{w}^{% \prime}}(\cdotS_{t})}\Big{[}I_{\{Q_{\theta}(S_{t},A)\geq f_{\theta}^{\rho}(% \mathbf{w}^{\prime};S_{t})\}}\nabla_{\mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(AS_{t% })\Big{]}+  
\displaystyle \mathbb{E}\bigg{[}\frac{1}{N_{t}}\sum_{A\in\Xi_{% t}}{I_{\{Q_{\theta}(S_{t},A)\geq\widehat{f}^{\rho}_{t+1}\}}\nabla_{\mathbf{w}_% {t}}\ln\pi_{\mathbf{w}}(AS_{t})}\bigg{}\mathcal{F}_{t}\bigg{]}  
\displaystyle \mathbb{E}\bigg{[}\frac{1}{N_{t}}\sum_{A\in% \Xi_{t}}{I_{\{Q_{\theta}(S_{t},A)\geq\widehat{f}^{\rho}_{t+1}\}}\nabla_{% \mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(AS_{t})}\bigg{}\mathcal{F}_{t}\bigg{]}+  
\displaystyle \frac{1}{N_{t}}\sum_{A\in\Xi_{t}}{I_{\{% Q_{\theta}(S_{t},A)\geq\widehat{f}^{\rho}_{t+1}\}}}\nabla_{\mathbf{w}_{t}}\ln% \pi_{\mathbf{w}}(AS_{t})\Bigg{)}\Bigg{\}}.  
\displaystyle=\Gamma^{W}\Big{\{}g^{\theta}(\mathbf{w}_{t})+\mathbb{M}_{t+1}+% \ell^{\theta}_{t}\Big{\}},  (11) 
where f_{\theta}^{\rho}(\mathbf{w}^{\prime};S):=f^{\rho}(Q_{\theta}(S,\cdot),\pi_{% \mathbf{w}^{\prime}}(\cdotS)) and \nabla_{\mathbf{w}_{t}}:=\nabla_{\mathbf{w}=\mathbf{w}_{t}}, i.e., the gradient w.r.t. \mathbf{w} at \mathbf{w}_{t}. Also,
\displaystyle g^{\theta}(\mathbf{w})  \displaystyle:=\mathbb{E}_{S_{t}\sim\nu,A\sim\pi_{\mathbf{w}^{\prime}}(\cdotS% _{t})}\Big{[}I_{\{Q_{\theta}(S_{t},A)\geq f_{\theta}^{\rho}(\mathbf{w}^{\prime% };S_{t})\}}\nabla_{\mathbf{w}}\ln\pi_{\mathbf{w}}(AS_{t})\Big{]}.  (12) 
\displaystyle\mathbb{M}_{t+1}  \displaystyle:=\frac{1}{N_{t}}\sum_{A\in\Xi_{t}}{I_{\{Q_{\theta}(S_{t},A)\geq% \widehat{f}^{\rho}_{t+1}\}}\nabla_{\mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(AS_{t})}  
\displaystyle \mathbb{E}\Bigg{[}\frac{1}{N_{t}}\sum_{A\in\Xi_{t}}{I_{% \{Q_{\theta}(S_{t},A)\geq\widehat{f}^{\rho}_{t+1}\}}\nabla_{\mathbf{w}_{t}}\ln% \pi_{\mathbf{w}}(AS_{t})}\Big{}\mathcal{F}_{t}\Bigg{]}.  (13) 
\displaystyle\ell^{\theta}_{t}  \displaystyle:=\mathbb{E}\bigg{[}\frac{1}{N_{t}}\sum_{A\in\Xi_{t}}{I_{\{Q_{% \theta}(S_{t},A)\geq\widehat{f}^{\rho}_{t+1}\}}\nabla_{\mathbf{w}_{t}}\ln\pi_{% \mathbf{w}}(AS_{t})}\bigg{}\mathcal{F}_{t}\bigg{]}  
\displaystyle \mathbb{E}_{S_{t}\sim\nu,A\sim\pi_{\mathbf{w}^{\prime}}(% \cdotS_{t})}\Big{[}I_{\{Q_{\theta}(S_{t},A)\geq f_{\theta}^{\rho}(\mathbf{w}^% {\prime};S_{t})\}}\nabla_{\mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(AS_{t})\Big{]}  (14) 
A few observations are in order:

B1.
\{\mathbb{M}_{t+1}\}_{t\in\mathbb{N}} is a martingale difference noise sequence w.r.t. the filtration \{\mathcal{F}_{t}\}_{t\in\mathbb{N}}, i.e., \mathbb{M}_{t+1} is \mathcal{F}_{t+1}measurable and integrable, \forall t\in\mathbb{N} and \mathbb{E}\left[\mathbb{M}_{t+1}\mathcal{F}_{t}\right]=0 a.s., \forall t\in\mathbb{N}.

B2.
g^{\theta} is locally Lipschitz continuous. This follows from Assumption 7.

B3.
\ell^{\theta}_{t}\rightarrow 0 a.s. as t\rightarrow\infty. (By Lemma 2 below).

B4.
The iterates \{\mathbf{w}_{t}\}_{t\in\mathbb{N}} is bounded almost surely, i.e.,
\displaystyle\sup_{t\in\mathbb{N}}\\mathbf{w}_{t}\<\infty\hskip 11.381102pta% .s. This is ensured by the explicit application of the projection operator \Gamma^{W}\{\cdot\} over the iterates \{\mathbf{w}_{t}\}_{t\in\mathbb{N}} at every iteration onto the bounded set W.

B5.
\exists L\in(0,\infty)\hskip 5.690551pts.t.\hskip 5.690551pt\mathbb{E}\left[\% \mathbb{M}_{t+1}\^{2}\mathcal{F}_{t}\right]\leq L\left(1+\\mathbf{w}_{t}\^% {2}\right)\hskip 5.690551pta.s.
This follows from Assumption 6 (ii).
Now, we rewrite the stochastic recursion (A) as follows:
\displaystyle\mathbf{w}_{t+1}  \displaystyle:=\mathbf{w}_{t}+\alpha_{a,t}\frac{\Gamma^{W}\left\{\mathbf{w}_{t% }+\xi_{t}\left(g^{\theta}(\mathbf{w}_{t})+\mathbb{M}_{t+1}+\ell^{\theta}_{t}% \right)\right\}\mathbf{w}_{t}}{\alpha_{a,t}}  
\displaystyle={\mathbf{w}}_{t}+\alpha_{a,t}\left(\widehat{\Gamma}^{W}_{{% \mathbf{w}}_{t}}(g^{\theta}({\mathbf{w}}_{t}))+\widehat{\Gamma}^{W}_{{\mathbf{% w}}_{t}}\left(\mathbb{M}_{t+1}\right)+\widehat{\Gamma}^{W}_{{\mathbf{w}}_{t}}% \left({\ell}^{\theta}_{t}\right)+o(\alpha_{a,t})\right),  (15) 
where \widehat{\Gamma}^{W} is the Frechet derivative (Definition 3).
The above stochastic recursion is also a stochastic approximation recursion with the vector field \widehat{\Gamma}^{W}_{{\mathbf{w}}_{t}}(g^{\theta}({\mathbf{w}}_{t})), the noise term \widehat{\Gamma}^{W}_{{\mathbf{w}}_{t}}\left(\mathbb{M}_{t+1}\right), the bias term \widehat{\Gamma}^{W}_{{\mathbf{w}}_{t}}\left({\ell}^{\theta}_{t}\right) with an additional error term o(\alpha_{a,t}) which is asymptotically inconsequential.
Also, note that \Gamma^{W} is singlevalued map since the set W is assumed convex and also the limit exists since the boundary \partial W is considered smooth. Further, for \mathbf{w}\in\mathring{W}, we have
\displaystyle\widehat{\Gamma}^{W}_{\mathbf{w}}(\mathbf{u}):=\lim_{\epsilon% \rightarrow 0}\frac{\Gamma^{W}\left\{\mathbf{w}+\epsilon\mathbf{u}\right\}% \mathbf{w}}{\epsilon}=\lim_{\epsilon\rightarrow 0}\frac{\mathbf{w}+\epsilon% \mathbf{u}\mathbf{w}}{\epsilon}=\mathbf{u}\text{ (for sufficiently small }% \epsilon),  (16) 
i.e., \widehat{\Gamma}^{W}_{\mathbf{w}}(\cdot) is an identity map for \mathbf{w}\in\mathring{W}.
Now by appealing to Theorem 2, Chapter 2 of [\citeauthoryearBorkar2008] along with the observations B1B5, we conclude that the stochastic recursion (6) asymptotically tracks the following ODE almost surely:
\displaystyle\frac{d}{dt}{\mathbf{w}}(t)  \displaystyle=\widehat{\Gamma}^{W}_{{\mathbf{w}}(t)}(g^{\theta}({\mathbf{w}}(t% ))),\hskip 11.381102ptt\geq 0  
\displaystyle=\widehat{\Gamma}^{W}_{{\mathbf{w}}(t)}\left(\mathbb{E}_{% \begin{subarray}{c}S\sim\nu,A\sim\pi_{\mathbf{w}^{\prime}}(\cdotS)% \end{subarray}}\Big{[}I_{\{Q_{\theta}(S,A)\geq f_{\theta}^{\rho}(\mathbf{w}^{% \prime};S)\}}\nabla_{\mathbf{w}(t)}\ln\pi_{\mathbf{w}}(AS)\Big{]}\right)  
\displaystyle=\widehat{\Gamma}^{W}_{{\mathbf{w}}(t)}\left(\nabla_{\mathbf{w}(t% )}\mathbb{E}_{\begin{subarray}{c}S\sim\nu,A\sim\pi_{\mathbf{w}^{\prime}}(\cdot% S)\end{subarray}}\Big{[}I_{\{Q_{\theta}(S,A)\geq f_{\theta}^{\rho}(\mathbf{w}% ^{\prime};S)\}}\ln\pi_{\mathbf{w}}(AS)\Big{]}\right).  (17) 
The interchange of expectation and the gradient in the last equality follows from dominated convergence theorem and Assumption 7 [\citeauthoryearRubinstein and Shapiro1993]. The above ODE is a gradient flow with dynamics restricted inside W. This further implies that the stochastic recursion (6) converges to a (possibly sample path dependent) asymptotically stable equilibrium point of the above ODE inside W. ∎
1.4 Proof of Lemma 2 to satisfy Condition 3
In this section, we show that \ell^{\theta}_{t}\rightarrow 0 a.s. as t\rightarrow\infty, in Lemma 2. To do so, we first need to prove several supporting lemmas. Lemma 1 shows that, for a given Actor and Expert, the sample quantile converges to the true quantile. Using this lemma, we can then prove Lemma 2. In the following subsection, we provide three supporting lemmas about convexity and Lipschitz properties of the sample quantiles, required for the proof Lemma 1.
For this section, we require the following characterization of f^{\rho}(Q_{\theta}(s,\cdot),\mathbf{w}^{\prime}). Please refer Lemma 1 of [\citeauthoryearHomemde Mello2007] for more details.
\displaystyle f^{\rho}(Q_{\theta}(s,\cdot),\mathbf{w}^{\prime})=\operatorname*% {arg\,min}_{\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]}{\mathbb{E}_{A\sim\pi_{% \mathbf{w}^{\prime}}(\cdots)}\left[\Psi(Q_{\theta}(s,A),\ell)\right]},  (18) 
where \Psi(y,\ell):=(y\ell)(1\rho)I_{\{y\geq\ell\}}+(\elly)\rho I_{\{\ell\geq y\}}.
Similarly, the sample estimate of the true (1\rho)quantile, i.e., \widehat{f}^{\rho}:=Q^{(\lceil(1\rho)N\rceil)}_{\theta,s}, (where Q_{\theta,s}^{(i)} is the ith order statistic of the random sample \{Q_{\theta}(s,A)\}_{A\in\Xi} with \Xi:=\{A_{i}\}_{i=1}^{N}\lx@stackrel{{\scriptstyle\mathclap{\tiny\mbox{iid}}}}% {{\sim}}\pi_{\mathbf{w}^{\prime}}(\cdots)) can be characterized as the unique solution of the stochastic counterpart of the above optimization problem, i.e.,
\displaystyle\widehat{f}^{\rho}=\operatorname*{arg\,min}_{\ell\in[Q^{\theta}_{% l},Q^{\theta}_{u}]}{\frac{1}{N}\sum_{\begin{subarray}{c}A\in\Xi\\ \Xi=N\end{subarray}}\Psi(Q_{\theta}(s,A),\ell)}.  (19) 
Lemma 1.
Assume \theta_{t}\equiv\theta, \mathbf{w}^{\prime}_{t}\equiv\mathbf{w}^{\prime}, \forall t\in\mathbb{N}. Also, let Assumptions 35 hold. Then, for a given state s\in\mathcal{S},
\displaystyle\lim_{t\rightarrow\infty}\widehat{f}^{\rho}_{t}=f^{\rho}(Q_{% \theta}(s,\cdot),\mathbf{w}^{\prime})\hskip 5.690551pta.s., 
where \widehat{f}^{\rho}_{t}:=Q^{(\lceil(1\rho)N_{t}\rceil)}_{\theta,s}, (where Q_{\theta,s}^{(i)} is the ith order statistic of the random sample \{Q_{\theta}(s,A)\}_{A\in\Xi_{t}} with \Xi_{t}:=\{A_{i}\}_{i=1}^{N_{t}}\lx@stackrel{{\scriptstyle\mathclap{\tiny\mbox% {iid}}}}{{\sim}}\pi_{\mathbf{w}^{\prime}}(\cdots)).
Proof.
The proof is similar to arguments in Lemma 7 of [\citeauthoryearHu, Fu, and Marcus2007]. Since state s and expert parameter \theta are considered fixed, we assume the following notation in the proof. Let
\displaystyle\widehat{f}^{\rho}_{ts,\theta}:=\widehat{f}^{\rho}_{t}\text{ and% }f^{\rho}_{s,\theta}:=f^{\rho}(Q_{\theta}(s,\cdot),\mathbf{w}^{\prime}),  (20) 
where \widehat{f}^{\rho}_{t} and f^{\rho}(Q_{\theta}(s,\cdot),\mathbf{w}^{\prime}) are defined in Equations (18) and (19).
Consider the open cover \{B_{r}(\ell),\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]\} of [Q^{\theta}_{l},Q^{\theta}_{u}].
Since [Q^{\theta}_{l},Q^{\theta}_{u}] is compact, there exists a finite subcover, i.e., \exists\{\ell_{1},\ell_{2},\dots,\ell_{M}\} s.t. \cup_{i=1}^{M}B_{r}(\ell_{i})=[Q^{\theta}_{l},Q^{\theta}_{u}].
Let \vartheta(\ell):=\mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdotS)}\left[%
\Psi(Q_{\theta}(s,A),\ell)\right] and
\widehat{\vartheta}_{t}(\ell):=\frac{1}{N_{t}}\sum\limits_{\begin{subarray}{c}%
A\in\Xi_{t},\Xi_{t}=N_{t},\\
\Xi_{t}\lx@stackrel{{\scriptstyle\mathclap{\tiny\mbox{iid}}}}{{\sim}}\pi_{%
\mathbf{w}^{\prime}}(\cdots)\end{subarray}}\Psi(Q_{\theta}(s,A),\ell).
Now, by triangle inequality, we have for \ell\in[Q^{\theta}_{l},Q^{\theta}_{u}],
\displaystyle\vartheta(\ell)\widehat{\vartheta}_{t}(\ell)  \displaystyle\leq\vartheta(\ell)\vartheta(\ell_{j})+\vartheta(\ell_{j})% \widehat{\vartheta}_{t}(\ell_{j})+\widehat{\vartheta}_{t}(\ell_{j})\widehat% {\vartheta}_{t}(\ell)  
\displaystyle\leq L_{\rho}\ell\ell_{j}+\vartheta(\ell_{j})\widehat{% \vartheta}_{t}(\ell_{j})+\widehat{L}_{\rho}\ell_{j}\ell  
\displaystyle\leq\left(L_{\rho}+\widehat{L}_{\rho}\right)r+\vartheta(\ell_{j}% )\widehat{\vartheta}_{t}(\ell_{j}),  (21) 
where L_{\rho} and \widehat{L}_{\rho} are the Lipschitz constants of \vartheta(\cdot) and \widehat{\vartheta}_{t}(\cdot) respectively.
For \delta>0, take r=\delta(L_{\rho}+\widehat{L}_{\rho})/2. Also, by Kolmogorov’s strong law of large numbers (Theorem 2.3.10 of [\citeauthoryearSen and Singer2017]), we have \widehat{\vartheta}_{t}(\ell)\rightarrow\vartheta(\ell) a.s. This implies that there exists T\in\mathbb{N} s.t. \vartheta(\ell_{j})\widehat{\vartheta}_{t}(\ell_{j})<\delta/2, \forall t\geq T, \forall j\in[M]. Then from Eq. (A), we have
\displaystyle\vartheta(\ell)\widehat{\vartheta}_{t}(\ell)\leq\delta/2+% \delta/2=\delta,\hskip 11.381102pt\forall\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]. 
This implies \widehat{\vartheta}_{t} converges uniformly to \vartheta. By Lemmas 4 and 5, \widehat{\vartheta}_{t} and \vartheta are strictly convex and Lipschitz continuous, and so because \widehat{\vartheta}_{t} converges uniformly to \vartheta, by Lemma 3, this means that the sequence of minimizers of \widehat{\vartheta}_{t} converge to the minimizer of \vartheta. These minimizers correspond to \widehat{f}^{\rho}_{t} and f^{\rho}(Q_{\theta}(s,\cdot),\mathbf{w}^{\prime}) respectively, and so \lim_{N_{t}\rightarrow\infty}\widehat{f}^{\rho}_{t}=f^{\rho}(Q_{\theta}(s,%
\cdot),\mathbf{w}^{\prime}) a.s.
Now, for \delta>0 and r:=\delta(L_{\rho}+\widehat{L}_{\rho})/2, we obtain the following from Eq. (A):
\displaystyle  \displaystyle\vartheta(\ell)\widehat{\vartheta}_{t}(\ell)\leq\delta/2+% \vartheta(\ell_{j})\widehat{\vartheta}_{t}(\ell_{j})  
\displaystyle\Leftrightarrow\{\vartheta(\ell_{j})\widehat{\vartheta}_{t}(% \ell_{j})\leq\delta/2,\forall j\in[M]\}\Rightarrow\{\vartheta(\ell)\widehat% {\vartheta}_{t}(\ell)\leq\delta,\forall\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]\} 
\displaystyle\Rightarrow\mathbb{P}_{\pi_{\mathbf{w}^{\prime}}}\left(\vartheta% (\ell)\widehat{\vartheta}_{t}(\ell)\leq\delta,\forall\ell\in[Q^{\theta}_{l},% Q^{\theta}_{u}]\right)  \displaystyle\geq\mathbb{P}_{\pi_{\mathbf{w}^{\prime}}}\left(\vartheta(\ell_{% j})\widehat{\vartheta}_{t}(\ell_{j})\leq\delta/2,\forall j\in[M]\right)  
\displaystyle=1\mathbb{P}_{\pi_{\mathbf{w}^{\prime}}}\left(\vartheta(\ell_{j% })\widehat{\vartheta}_{t}(\ell_{j})>\delta/2,\exists j\in[M]\right)  
\displaystyle\geq 1\sum_{j=1}^{M}\mathbb{P}_{\pi_{\mathbf{w}^{\prime}}}\left(% \vartheta(\ell_{j})\widehat{\vartheta}_{t}(\ell_{j})>\delta/2\right)  
\displaystyle\geq 1M\max_{j\in[M]}\mathbb{P}_{\pi_{\mathbf{w}^{\prime}}}\left% (\vartheta(\ell_{j})\widehat{\vartheta}_{t}(\ell_{j})>\delta/2\right)  
\displaystyle\geq 12M\exp{\left(\frac{2N_{t}\delta^{2}}{4(Q^{\theta}_{u}Q^{% \theta}_{l})^{2}}\right)},  (22) 
where \mathbb{P}_{\pi_{\mathbf{w}^{\prime}}}:=\mathbb{P}_{A\sim\pi_{\mathbf{w}^{%
\prime}}}(\cdots). And
the last inequality follows from Hoeffding’s inequality [\citeauthoryearHoeffding1963] along with the fact that \mathbb{E}_{\pi_{\mathbf{w}^{\prime}}}\left[\widehat{\vartheta}_{t}(\ell_{j})%
\right]=\vartheta(\ell_{j}) and \sup\limits_{\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]}\vartheta(\ell)\leq Q^{%
\theta}_{u}Q^{\theta}_{l}.
Now, the subdifferential of \vartheta(\ell) is given by
\displaystyle\partial_{\ell}\vartheta(\ell)=\left[\rho\mathbb{P}_{A\sim\pi_{% \mathbf{w}^{\prime}}(\cdots)}\left(Q_{\theta}(s,A)\geq\ell\right),\rho1+% \mathbb{P}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdots)}\left(Q_{\theta}(s,A)\leq% \ell\right)\right].  (23) 
By the definition of subgradient we obtain
\displaystyle c\widehat{f}^{\rho}_{ts,\theta}f^{\rho}_{s,\theta}\leq% \vartheta(\widehat{f}^{\rho}_{ts,\theta})\vartheta(f^{\rho}_{s,\theta}),% \hskip 5.690551ptc\in\partial_{\ell}\vartheta(\ell)  
\displaystyle\Rightarrow C\widehat{f}^{\rho}_{ts,\theta}f^{\rho}_{s,\theta% }\leq\vartheta(\widehat{f}^{\rho}_{ts,\theta})\vartheta(f^{\rho}_{s,% \theta}),  (24) 
where C:=\max{\left\{\rho\mathbb{P}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdots)}\left(% Q_{\theta}(s,A)\geq f^{\rho}_{s,\theta}\right),\rho1+\mathbb{P}_{A\sim\pi_{% \mathbf{w}^{\prime}}(\cdots)}\left(Q_{\theta}(s,A)\leq f^{\rho}_{s,\theta}% \right)\right\}}. Further,
\displaystyle C\widehat{f}^{\rho}_{ts,\theta}f^{\rho}_{s,\theta}  \displaystyle\leq\vartheta(\widehat{f}^{\rho}_{ts,\theta})\vartheta(f^{\rho% }_{s,\theta})  
\displaystyle\leq\vartheta(\widehat{f}^{\rho}_{ts,\theta})\widehat{% \vartheta}_{t}(\widehat{f}^{\rho}_{ts,\theta})+\widehat{\vartheta}_{t}(% \widehat{f}^{\rho}_{ts,\theta})\vartheta(f^{\rho}_{s,\theta})  
\displaystyle\leq\vartheta(\widehat{f}^{\rho}_{ts,\theta})\widehat{% \vartheta}_{t}(\widehat{f}^{\rho}_{ts,\theta})+\sup_{\ell\in[Q^{\theta}_{l},% Q^{\theta}_{u}]}\widehat{\vartheta}_{t}(\ell)\vartheta(\ell)  
\displaystyle\leq 2\sup_{\ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]}\widehat{% \vartheta}_{t}(\ell)\vartheta(\ell).  (25) 
From Eqs. (A) and (A), we obtain for \epsilon>0
\displaystyle\mathbb{P}_{\mathbf{w}^{\prime}}\left(N_{t}^{\alpha}\widehat{f}^% {\rho}_{ts,\theta}f^{\rho}_{s,\theta}\geq\epsilon\right)  \displaystyle\leq\mathbb{P}_{\mathbf{w}^{\prime}}\left(N_{t}^{\alpha}\sup_{% \ell\in[Q^{\theta}_{l},Q^{\theta}_{u}]}\widehat{\vartheta}_{t}(\ell)% \vartheta(\ell)\geq\frac{\epsilon}{2}\right)  
\displaystyle\leq 2M\exp{\left(\frac{2N_{t}\epsilon^{2}}{16N^{2\alpha}_{t}(Q^% {\theta}_{u}Q^{\theta}_{l})^{2}}\right)}=2M\exp{\left(\frac{2N_{t}^{12% \alpha}\epsilon^{2}}{16(Q^{\theta}_{u}Q^{\theta}_{l})^{2}}\right)}. 
For \alpha\in(0,1/2) and \inf_{t\in\mathbb{N}}\frac{N_{t+1}}{N_{t}}\geq\tau>1 (by Assumption 3), then
\displaystyle\sum_{t=1}^{\infty}2M\exp{\left(\frac{2N_{t}^{12\alpha}\epsilon% ^{2}}{16(Q^{\theta}_{u}Q^{\theta}_{l})^{2}}\right)}\leq\sum_{t=1}^{\infty}2M% \exp{\left(\frac{2\tau^{(12\alpha)t}N_{0}^{12\alpha}\epsilon^{2}}{16(Q^{% \theta}_{u}Q^{\theta}_{l})^{2}}\right)}<\infty. 
Therefore, by BorelCantelli’s Lemma [\citeauthoryearDurrett1991], we have
\displaystyle\mathbb{P}_{\mathbf{w}^{\prime}}\left(N_{t}^{\alpha}\big{}% \widehat{f}^{\rho}_{ts,\theta}f^{\rho}_{s,\theta}\big{}\geq\epsilon\hskip 5% .690551pti.o\right)=0. 
Thus we have N_{t}^{\alpha}\left(\widehat{f}^{\rho}_{ts,\theta}f^{\rho}_{s,\theta}\right) \rightarrow 0 a.s. as N_{t}\rightarrow\infty. ∎
Lemma 2.
Almost surely,
\displaystyle\ell^{\theta}_{t}\rightarrow 0\hskip 5.690551pt\text{ as }N_{t}% \rightarrow\infty. 
Proof of Lemma 2: Consider
\displaystyle\mathbb{E}\bigg{[}\frac{1}{N_{t}}\sum_{A\in\Xi_{t}}{I_{\{Q_{% \theta}(S_{t},A)\geq\widehat{f}^{\rho}_{t+1}\}}\nabla_{\mathbf{w}_{t}}\ln\pi_{% \mathbf{w}}(AS_{t})}\bigg{}\mathcal{F}_{t}\bigg{]}=  
\displaystyle \mathbb{E}\Bigg{[}\mathbb{E}_{\Xi_{t}}\bigg{[}\frac{1}{N_{t}% }\sum_{A\in\Xi_{t}}{I_{\{Q_{\theta}(S_{t},A)\geq\widehat{f}^{\rho}_{t+1}\}}% \nabla_{\mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(AS_{t})}\bigg{]}\bigg{}S_{t}=s,% \mathbf{w}^{\prime}_{t}\Bigg{]} 
For \alpha^{\prime}>0, from Assumption 8, we have
\displaystyle\mathbb{P}\Big{(}N_{t}^{\alpha^{\prime}}\Big{\}\frac{1}{N_{t}}% \sum_{A\in\Xi_{t}}{I_{\{Q_{\theta}(s,A)\geq\widehat{f}^{\rho}_{\theta,s}\}}% \nabla_{\mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(As)}\mathbb{E}\left[I_{\{Q_{% \theta}(s,A)\geq\widehat{f}^{\rho}_{\theta,s}\}}\nabla_{\mathbf{w}_{t}}\ln\pi_% {\mathbf{w}}(As)\right]\Big{\}\geq\epsilon\Big{)}  
\displaystyle \leq C_{1}\exp{\left(\frac{c_{2}N^{c_{3}}_{t}% \epsilon^{c_{4}}}{N_{t}^{c_{4}\alpha^{\prime}}}\right)}=C_{1}\exp{\left(c_{2}% N^{c_{3}c_{4}\alpha^{\prime}}_{t}\epsilon^{c_{4}}\right)}  
\displaystyle \leq C_{1}\exp{\left(c_{2}\tau^{(c_{3}c_{4}\alpha^% {\prime})t}N^{c_{3}c_{4}\alpha^{\prime}}_{0}\epsilon^{c_{4}}\right)}, 
where f^{\rho}_{\theta,s}:=f^{\rho}(Q_{\theta}(s,\cdot),\pi_{\mathbf{w}^{\prime}}(%
\cdots)) and \inf_{t\in\mathbb{N}}\frac{N_{t+1}}{N_{t}}\geq\tau>1 (by Assumption 3).
For c_{3}c_{4}\alpha^{\prime}>0 \Rightarrow \alpha^{\prime}<c_{3}/c_{4}, we have
\displaystyle\sum_{t=1}^{\infty}{C_{1}\exp{\left(c_{2}\tau^{(c_{3}c_{4}% \alpha^{\prime})t}N^{c_{3}c_{4}\alpha^{\prime}}_{0}\epsilon^{c_{4}}\right)}}<\infty. 
Therefore, by BorelCantelli’s Lemma [\citeauthoryearDurrett1991], we have
\displaystyle\mathbb{P}\Big{(}N_{t}^{\alpha^{\prime}}\Big{\}\frac{1}{N_{t}}% \sum_{A\in\Xi_{t}}{I_{\{Q_{\theta}(s,A)\geq\widehat{f}^{\rho}_{\theta,s}\}}% \nabla_{\mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(As)}\mathbb{E}\left[I_{\{Q_{% \theta}(s,A)\geq\widehat{f}^{\rho}_{\theta,s}\}}\nabla_{\mathbf{w}_{t}}\ln\pi_% {\mathbf{w}}(As)\right]\Big{\}\geq\epsilon\hskip 5.690551pti.o.\Big{)}  
\displaystyle =0. 
This implies that
\displaystyle N_{t}^{\alpha^{\prime}}\Big{\}\frac{1}{N_{t}}\sum_{A\in\Xi_{t}}% {I_{\{Q_{\theta}(s,A)\geq\widehat{f}^{\rho}_{\theta,s}\}}\nabla_{\mathbf{w}_{t% }}\ln\pi_{\mathbf{w}}(As)}\mathbb{E}\left[I_{\{Q_{\theta}(s,A)\geq\widehat{f% }^{\rho}_{\theta,s}\}}\nabla_{\mathbf{w}_{t}}\ln\pi_{\mathbf{w}}(As)\right]% \Big{\}\rightarrow 0\hskip 11.381102pta.s.  (26) 
The above result implies that the sample average converges at a rate O(N_{t}^{\alpha^{\prime}}), where 0<\alpha^{\prime}<c_{3}/c_{4} independent of \mathbf{w},\mathbf{w}^{\prime}\in W. By Lemma 1, we have the sample quantiles \widehat{f}^{\rho}_{t} also converging to the true quantile at a rate O(N_{t}^{\alpha}) independent of \mathbf{w},\mathbf{w}^{\prime}\in W. Now the claim follows directly from Assumption 6 (ii) and bounded convergence theorem.
\hbox{}\hfill\blacksquare
1.5 Supporting Lemmas for Lemma 1
Lemma 3.
Let \{f_{n}\in C({\rm I\!R},{\rm I\!R})\}_{n\in\mathbb{N}} be a sequence of strictly convex, continuous functions converging uniformly to a strict convex function f. Let x_{n}^{*}=\operatorname*{arg\,min}_{x}{f_{n}(x)} and x^{*}=\operatorname*{arg\,min}_{x\in{\rm I\!R}}{f(x)}. Then \lim\limits_{n\rightarrow\infty}x^{*}_{n}=x^{*}.
Proof.
Let c=\liminf_{n}x^{*}_{n}. We employ proof by contradiction here. For that, we assume x^{*}>c. Now, note that f(x^{*})<f(c) and f(x^{*})<f(\left(x^{*}+c\right)/2) (by the definition of x^{*}). Also, by the strict convexity of f, we have f((x^{*}+c)/2)<\left(f(x^{*})+f(c)\right)/2 <f(c). Therefore, we have
\displaystyle f(c)>f((x^{*}+c)/2)>f(x^{*}).  (27) 
Let r_{1}\in{\rm I\!R} be such that f(c)>r_{1}>f((x^{*}+c)/2). Now, since \f_{n}f^{*}\_{\infty} \rightarrow 0 as n\rightarrow\infty, there exists an positive integer N s.t. f_{n}(c)f(c)<f(c)r_{1}, \forall n\geq N and \epsilon>0. Therefore, f_{n}(c)f(c)>r_{1}f(c) \Rightarrow f_{n}(c)>r_{1}. Similarily, we can show that f_{n}((x^{*}+c)/2)>r_{1}. Therefore, we have f_{n}(c)>f_{n}((x^{*}+c)/2). Similarily, we can show that f_{n}((x^{*}+c)/2)>f_{n}(x^{*}). Finally, we obtain
\displaystyle f_{n}(c)>f_{n}((x^{*}+c)/2)>f_{n}(x^{*}),\hskip 11.381102pt% \forall n\geq N.  (28) 
Now, by the extreme value theorem of the continuous functions, we obtain that for n\geq N, f_{n} achieves minimum (say at x_{p} in the closed interval [c,(x^{*}+c)/2]. Note that f_{n}(x_{p})\nless f_{n}((x^{*}+c)/2) (if so then f_{n}(x_{p}) will be a local minimum of f_{n} since f_{n}(x^{*})<f_{n}((x^{*}+c)/2)). Also, f_{n}(x_{p})\neq f_{n}((x^{*}+c)/2). Therefore, f_{n} achieves it minimum in the closed interval [c,(x^{*}+c)/2] at the point (x^{*}+c)/2. This further implies that x^{*}_{n}>(x^{*}+c)/2. Therefore, \liminf_{n}x_{n}^{*}\geq(x^{*}+c)/2 \Rightarrow c\geq(x^{*}+c)/2 \Rightarrow c\geq x^{*}. This is a contradiction. This implies that
\displaystyle\liminf_{n}x^{*}_{n}\geq x^{*}.  (29) 
Now consider g_{n}(x)=f_{n}(x). Note that g_{n} is also continuous and strictly convex function. Indeed, for \lambda\in[0,1], we have g_{n}(\lambda x_{1}+(1\lambda)x_{2})=f_{n}(\lambda x_{1}(1\lambda)x_{2})<% \lambda f(x_{1})+(1\lambda)f(x_{2})=\lambda g(x_{1})+(1\lambda)g(x_{2}). Applying the result from Eq. (29) to the sequence \{g_{n}\}_{n\in\mathbb{N}}, we obtain that \liminf_{n}(x_{n}^{*})\geqx^{*}. This further implies that \limsup_{n}x_{n}^{*}\leq x^{*}. Therefore,
\displaystyle\liminf_{n}x^{*}_{n}\geq x^{*}\geq\limsup_{n}x_{n}^{*}\geq\limsup% _{n}x_{n}^{*}. 
Hence, \liminf_{n}x^{*}_{n}=\limsup_{n}x_{n}^{*}=x^{*} ∎
Lemma 4.
Let Assumption 5 hold. For \theta\in\Theta, \mathbf{w}^{\prime}\in W, s\in\mathcal{S} and \ell\in[Q^{\theta}_{l},Q^{\theta}_{u}], we have \mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdots)}\left[\Psi(Q_{\theta}(s,A)% ,\ell)\right] is Lipschitz continuous. Also, \frac{1}{N}\sum_{\begin{subarray}{c}A\in\Xi\\ \Xi=N\end{subarray}}\Psi(Q_{\theta}(s,A),\ell) (with \Xi\lx@stackrel{{\scriptstyle\mathclap{\tiny\mbox{iid}}}}{{\sim}}\pi_{\mathbf{% w}^{\prime}}(\cdots)) is Lipschitz continuous with Lipschitz constant independent of the sample length N.
Proof.
Let \ell_{1},\ell_{2}\in[Q^{\theta}_{l},Q^{\theta}_{u}], \ell_{2}\geq\ell_{1}. By Assumption 5 we have \mathbb{P}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdots)}(Q_{\theta}(s,A)\geq\ell_{% 1})>0 and \mathbb{P}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdots)}(Q_{\theta}(s,A)\geq\ell_{% 2})>0. Now,
\displaystyle\Big{}  \displaystyle\mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdots)}\left[\Psi(Q_% {\theta}(s,A),\ell_{1})\right]\mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(% \cdots)}\left[\Psi(Q_{\theta}(s,A),\ell_{2})\right]\Big{}  
\displaystyle=\Big{}\mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdots)}\left% [(Q_{\theta}(s,A)\ell_{1})(1\rho)I_{\{Q_{\theta}(s,A)\geq\ell_{1}\}}+(\ell_{% 1}Q_{\theta}(s,A))\rho I_{\{\ell_{1}\geq Q_{\theta}(s,A)\}}\right]  
\displaystyle \mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdots)}\left[(Q% _{\theta}(s,A)\ell_{2})(1\rho)I_{\{Q_{\theta}(s,A)\geq\ell_{2}\}}+(\ell_{2}% Q_{\theta}(s,A))\rho I_{\{\ell_{2}\geq Q_{\theta}(s,A)\}}\right]\Big{}  
\displaystyle=\Big{}\mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdots)}\Big{% [}(Q_{\theta}(s,A)\ell_{1})(1\rho)I_{\{Q_{\theta}(s,A)\geq\ell_{1}\}}+(\ell_% {1}Q_{\theta}(s,A))\rho I_{\{\ell_{1}\geq Q_{\theta}(s,A)\}}  
\displaystyle (Q_{\theta}(s,A)\ell_{2})(1\rho)I_{\{Q_{\theta}(s,A)\geq% \ell_{2}\}}+(\ell_{2}Q_{\theta}(s,A))\rho I_{\{\ell_{2}\geq Q_{\theta}(s,A)\}% }\Big{]}\Big{}  
\displaystyle=\Big{}\mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdots)}\Big{% [}(1\rho)(\ell_{2}\ell_{1})I_{\{Q_{\theta}(s,A)\geq\ell_{2}\}}+\rho(\ell_{1}% \ell_{2})I_{\{Q_{\theta}(s,A)\leq\ell_{1}\}}+  
\displaystyle +\left((1\rho)\ell_{1}\rho\ell_{2}+\rho Q_{\theta}(s,A)+(1% \rho)Q_{\theta}(s,A)\right)I_{\{\ell_{1}\leq Q_{\theta}(s,A)\leq\ell_{2}\}}% \Big{]}\Big{}  
\displaystyle\leq(1\rho)\ell_{2}\ell_{1}+\left(2\rho+1\right)\ell_{2}% \ell_{1}  
\displaystyle=(\rho+2)\ell_{2}\ell_{1}. 
Similarly, we can prove the later claim also. This completes the proof of Lemma 4. ∎
Lemma 5.
Let Assumption 5 hold. Then, for \theta\in\Theta, \mathbf{w}^{\prime}\in W, s\in\mathcal{S} and \ell\in[Q^{\theta}_{l},Q^{\theta}_{u}], we have \mathbb{E}_{A\sim\pi_{\mathbf{w}^{\prime}}(\cdots)}\left[\Psi(Q_{\theta}(s,A)% ,\ell)\right] and \frac{1}{N}\sum_{\begin{subarray}{c}A\in\Xi\\ \Xi=N\end{subarray}}\Psi(Q_{\theta}(s,A),\ell) (with \Xi\lx@stackrel{{\scriptstyle\mathclap{\tiny\mbox{iid}}}}{{\sim}}\pi_{\mathbf{% w}^{\prime}}(\cdots)) are strictly convex.
Proof.
For \lambda\in[0,1] and \ell_{1},\ell_{2}\in[Q_{l},Q_{u}] with \ell_{1}\leq\ell_{2}, we have
\displaystyle\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdotS)}\big{[}\Psi(Q_% {\theta}(S,A),\lambda\ell_{1}+(1\lambda)\ell_{2})\big{]}  (30)  
\displaystyle=\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdotS)}\big{[}(1% \rho)\big{(}Q_{\theta}(S,A)\lambda\ell_{1}(1\lambda)\ell_{2}\big{)}I_{\{Q_{% \theta}(S,A)\geq\lambda\ell_{1}+(1\lambda)\ell_{2}\}}  
\displaystyle +\rho\big{(}\lambda\ell_{1}+(1\lambda)\ell_{2}% Q_{\theta}(S,A)\big{)}I_{\{Q_{\theta}(S,A)\leq\lambda\ell_{1}+(1\lambda)\ell_% {2}\}}\big{]}. 
Notice that
\displaystyle\big{(}Q_{\theta}(S,A)\lambda\ell_{1}(1\lambda)\ell_{2}\big{)}% I_{\{Q_{\theta}(S,A)\geq\lambda\ell_{1}+(1\lambda)\ell_{2}\}}  
\displaystyle=\big{(}\lambda Q_{\theta}(S,A)\lambda\ell_{1}+(1\lambda)Q_{% \theta}(S,A)(1\lambda)\ell_{2}\big{)}I_{\{Q_{\theta}(S,A)\geq\lambda\ell_{1}% +(1\lambda)\ell_{2}\}} 
We consider how one of these components simplifies.
\displaystyle\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdotS)}\big{[}\big{(}% \lambda Q_{\theta}(S,A)\lambda\ell_{1}\big{)}I_{\{Q_{\theta}(S,A)\geq\lambda% \ell_{1}+(1\lambda)\ell_{2}\}}\big{]}  
\displaystyle=\lambda\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdotS)}\big{[% }\big{(}Q_{\theta}(S,A)\ell_{1}\big{)}I_{\{Q_{\theta}(S,A)\geq\lambda\ell_{1}% \}}\big{(}Q_{\theta}(S,A)\ell_{1}\big{)}I_{\lambda\ell_{1}\leq\{Q_{\theta}(S% ,A)\leq\lambda\ell_{1}+(1\lambda)\ell_{2}\}}\big{]}  
\displaystyle\leq\lambda\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdotS)}% \big{[}\big{(}Q_{\theta}(S,A)\ell_{1}\big{)}I_{\{Q_{\theta}(S,A)\geq\lambda% \ell_{1}\}}\big{]}\ \ \ \ \triangleright\ \big{(}Q_{\theta}(S,A)\ell_{1}\big% {)}\leq 0  
\displaystyle% \text{ % for }\lambda\ell_{1}\leq\{Q_{\theta}(S,A)\leq\lambda\ell_{1}+(1\lambda)\ell_{% 2}\}  
\displaystyle\leq\lambda\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdotS)}% \big{[}\big{(}Q_{\theta}(S,A)\ell_{1}\big{)}I_{\{Q_{\theta}(S,A)\geq\ell_{1}% \}}\big{]}\ \ \ \ \triangleright\ \big{(}Q_{\theta}(S,A)\ell_{1}\big{)}\leq 0% \text{ for }I_{\lambda\ell_{1}\leq\{Q_{\theta}(S,A)\leq\ell_{1}\}} 
Similarly, we get
\displaystyle\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdotS)}  \displaystyle\big{[}\big{(}Q_{\theta}(S,A)\ell_{2}\big{)}I_{\{Q_{\theta}(S,A)% \geq\lambda\ell_{1}+(1\lambda)\ell_{2}\}}\big{]}\leq\mathbb{E}_{A\in\pi_{% \mathbf{w}^{\prime}}(\cdotS)}\big{[}\big{(}Q_{\theta}(S,A)\ell_{2}\big{)}I_{% \{Q_{\theta}(S,A)\geq\ell_{2}\}}\big{]}  
\displaystyle\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdotS)}  \displaystyle\big{[}\big{(}\ell_{1}Q_{\theta}(S,A)\big{)}I_{\{Q_{\theta}(S,A)% \leq\lambda\ell_{1}+(1\lambda)\ell_{2}\}}\big{]}\leq\mathbb{E}_{A\in\pi_{% \mathbf{w}^{\prime}}(\cdotS)}\big{[}\big{(}\ell_{1}Q_{\theta}(S,A)\big{)}I_{% \{Q_{\theta}(S,A)\leq\ell_{1}\}}\big{]}  
\displaystyle\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdotS)}  \displaystyle\big{[}\big{(}\ell_{2}Q_{\theta}(S,A)\big{)}I_{\{Q_{\theta}(S,A)% \leq\lambda\ell_{1}+(1\lambda)\ell_{2}\}}\big{]}\leq\mathbb{E}_{A\in\pi_{% \mathbf{w}^{\prime}}(\cdotS)}\big{[}\big{(}\ell_{2}Q_{\theta}(S,A)\big{)}I_{% \{Q_{\theta}(S,A)\leq\ell_{2}\}}\big{]} 
Therefore, for Equation (30), we get
\displaystyle\eqref{eq_quant_main}  \displaystyle\leq\lambda(1\rho)\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(% \cdotS)}\big{[}\big{(}Q_{\theta}(S,A)\ell_{1}\big{)}I_{\{Q_{\theta}(S,A)\geq% \ell_{1}\}}\big{]}  
\displaystyle\ \ \ +(1\lambda)(1\rho)\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime% }}(\cdotS)}\big{[}\big{(}Q_{\theta}(S,A)\ell_{2}\big{)}I_{\{Q_{\theta}(S,A)% \geq\ell_{2}\}}\big{]}  
\displaystyle\ \ \ +\lambda\rho\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdot% S)}\big{[}\big{(}\ell_{1}Q_{\theta}(S,A)\big{)}I_{\{Q_{\theta}(S,A)\leq\ell_% {1}\}}\big{]}  
\displaystyle\ \ \ +(1\lambda)\rho\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(% \cdotS)}\big{[}\big{(}\ell_{2}Q_{\theta}(S,A)\big{)}I_{\{Q_{\theta}(S,A)\leq% \ell_{2}\}}\big{]}  
\displaystyle=\lambda\mathbb{E}_{A\in\pi_{\mathbf{w}^{\prime}}(\cdotS)}\left[% \Psi(Q_{\theta}(S,A),\ell_{1})\right]+(1\lambda)\mathbb{E}_{A\in\pi_{\mathbf{% w}^{\prime}}(\cdotS)}\left[\Psi(Q_{\theta}(S,A),\ell_{2})\right]. 
We can prove the second claim similarly. This completes the proof of Lemma 5. ∎
Appendix B 2 Experiment Details
2.1 ActorExpert Algorithm with Neural Networks
2.2 Bimodal Toy Domain Videos
We monitored the training process of all agents, its actionvalue function and its policy at every step of training on the Bimodal Toy Domain. Sample video of each agent is included in https://sites.google.com/ualberta.ca/actorexpert/. The upper graph logs the actionvalue function, while the lower graph logs the policy of the agent. Red vertical line indicates the greedy action and the Blue vertical line indicates the actual exploratory action taken. For NAF, AE, and AE+, we also plot the policy function on the same graph.
2.3 Benchmark Environment Description
We use benchmark environments from OpenAI Gym [\citeauthoryearBrockman et al.2016] and Mujoco [\citeauthoryearTodorov, Erez, and Tassa2012] to evaluate our agents. In this section we give a brief description and dimensionality of each environments.
Environment  Description 

Pendulumv0  The goal of the agent is to swing the pendulum up to the top and stay upright, with spending least amount of energy. 
LunarLanderContinuousv2  The goal of the agent is to control the spaceship and land safely on a specified region. 
HalfCheetahv2  The goal of the agent is to move forward as much as possible with a cheetahlike figure. 
Hopperv2  The goal of the agent is to move forward as much as possible with a monoped figure. 
Environment  State dim.  Action dim.  Action range 

Pendulumv0  3  1  [2,2] 
LunarLanderContinuousv2  8  2  [1,1] 
HalfCheetahv2  17  6  [1,1] 
Hopperv2  11  3  [1,1] 
2.4 Best Hyperparameters
In this section we report the best hyperparameter settings for the evaluated environments. We swept over the use of layer normalization, actor learning rate, expert/critic learning rate, and other algorithm specific parameters.
Algorithm  Normalization  Actor LR  Expert/Critic LR  Others 

AE    1e3  1e2   
AE+    1e3  1e2   
Wirefitting      1e2  control points (100) 
NAF      1e2  exploration scale (0.1) 
PICNN      1e2   
DDPG    1e3  1e2   
Algorithm  Normalization  Actor LR  Expert/Critic LR  Others 

AE  layernorm  1e4  1e2   
AE+  layernorm  1e4  1e2   
NAF      1e3  exploration scale (1.0) 
PICNN      1e3   
DDPG    1e3  1e3   
Algorithm  Normalization  Actor LR  Expert/Critic LR  Others 

AE  layernorm  1e5  1e3   
AE+  layernorm  1e5  1e3   
NAF      1e4  exploration scale (1.0) 
PICNN      1e3   
DDPG  layernorm  1e3  1e3   
Algorithm  Normalization  Actor LR  Expert/Critic LR  Others 

AE  layernorm  1e5  1e3   
AE+    1e4  1e3   
NAF      1e3  exploration scale (0.1) 
PICNN      1e3   
DDPG  layernorm  1e3  1e3   
Algorithm  Normalization  Actor LR  Expert/Critic LR  Others 

AE  layernorm  1e5  1e3   
AE+    1e5  1e3   
NAF      1e3  exploration scale (1.0) 
PICNN      1e3   
DDPG    1e5  1e3   