Smoothed Dual Embedding Control

Smoothed Dual Embedding Control

Bo Dai Albert Shaw Lihong Li Lin Xiao Niao He Jianshu Chen Le Song
Georgia Insititute of Technology
Microsoft Research Redmond
Google AI
University of Illinois at Urbana Champaign

We revisit the Bellman optimality equation with Nesterov’s smoothing technique and provide a unique saddle-point optimization perspective of the policy optimization problem in reinforcement learning based on Fenchel duality. A new reinforcement learning algorithm, called Smoothed Dual Embedding Control or SDEC, is derived to solve the saddle-point reformulation with arbitrary learnable function approximator. The algorithm bypasses the policy evaluation step in the policy optimization from a principled scheme and is extensible to integrate with multi-step bootstrapping and eligibility traces. We provide a PAC-learning bound on the number of samples needed from one single off-policy sample path, and also characterize the convergence of the algorithm. Finally, we show the algorithm compares favorably to the state-of-the-art baselines on several benchmark control problems.

1 Introduction

Reinforcement learning (RL) algorithms aim to learn a policy that maximizes the long-term return by sequentially interacting with an unknown environment (Sutton and Barto, 1998). The dominating framework to model such an interaction is Markov decision processes, or MDPs. A fundamental result for MDP is that the Bellman operator is a contraction in the value-function space, and thus, the optimal value function is a unique fixed point of the operator. Furthermore, starting from any initial value function, iterative applications of the Bellman operator will converge to the fixed point. Interested readers are referred to the textbook of Puterman (2014) for details.

Many of the most effective RL algorithms have their root in such a fixed-point view. The most prominent family of algorithms is perhaps the temporal-difference algorithms, including TD (Sutton, 1988), Q-learning (Watkins, 1989), SARSA (Rummery and Niranjan, 1994; Sutton, 1996), and numerous variants. Compared to direct policy search or policy gradient algorithms like REINFORCE (Williams, 1992), these fixed-point methods use bootstrapping to make learning more efficient by reducing variance. When the Bellman operator can be computed exactly (even on average), such as when the MDP has finite state/actions, convergence is guaranteed and the proof typically relies on the contraction property (Bertsekas and Tsitsiklis, 1996). Unfortunately, when function approximatiors are used, such fixed-point methods can easily become unstable/divergent (Boyan and Moore, 1995; Baird, 1995; Tsitsiklis and Van Roy, 1997), except in rather limited cases. For example,

  • for some rather restrictive function classes that have a non-expansion property, such as kernel averaging, most of the finite-state MDP theory continues to apply (Gordon, 1995);

  • when linear function classes are used to approximate the value function of a fixed policy from on-policy samples (Tsitsiklis and Van Roy, 1997), convergence is guaranteed.

In recent years, a few authors have made important progress toward finding scalable, convergent TD algorithms, by designing proper objective functions and using stochastic gradient descent (SGD) to optimize them (Sutton et al., 2009; Maei, 2011). Later on, it was realized that several of these gradient-based algorithms can be interpreted as solving a primal-dual problem (Mahadevan et al., 2014; Liu et al., 2015; Macua et al., 2015; Dai et al., 2016). This insight has led to novel, faster, and more robust algorithms by adopting sophisticated optimization techniques (Du et al., 2017). Unfortunately, to the best of our knowledge, all existing works either assume linear function approximation or are designed for policy evaluation. It remains an open question how to find the optimal policy reliably with nonlinear function approximators such as neural networks, even in the presence of off-policy data.

In this work, we take a substantial step towards solving this decades-long open problem, leveraging a unique saddle point optimization perspective to derive a new algorithm call smoothed dual embedding control (SDEC). Our development hinges upon a special look at the Bellman optimality equation and the temporal relationship between optimal value function and optimal policy revealed from a smoothed Bellman optimality equation. We exploit such a relation and introduce a distinct saddle point optimization that simultaneously learns both optimal value function and policy in the primal form and allows to escape from the instability of -operator and “double sampling” issues faced by existing algorithms. As a result, the SDEC algorithm enjoys many desired properties, in particular:

  • It is stable for a broad class of nonlinear function approximators including neural networks, and provably converges to a solution with vanishing gradient. This is the case even in the more challenging off-policy scenario.

  • It uses bootstrapping to yield high sample efficiency, as in TD-style methods, and can be generalized to cases of multi-step bootstrapping and eligibility traces.

  • It directly optimizes the squared Bellman residual based on a sample trajectory, while avoiding the infamous double-sample issue.

  • It uses stochastic gradient descent to optimize the objective, thus very efficient and scalable.

Furthermore, the algorithm handles both the optimal value function estimation and policy optimization in a unified way, and readily applies to both continuous and discrete action spaces. We test the algorithm on several continuous control benchmarks. Preliminary results show that the proposed algorithm achieves the state-of-the-art performances.

2 Preliminaries

In this section, we introduce the preliminary background about Markov decision processes.

Markov Decision Processes (MDPs).

We denote the MDPs as , where is the state space (possible infinite), is the set of actions, is the transition probability kernel defining the distribution upon taking action on state , gives the corresponding stochastic immediate rewards with expectation , and is the discount factor.

Given the initial state distribution , the goal of reinforcement learning problems is to find a policy , where denotes all the probability measure over , that maximizes the total expected discounted reward, i.e., , where , .

Bellman Optimality Equation.

Denote It is known that the optimal state value function satisfies the Bellman optimality equation (Puterman, 2014)


The optimal policy can be obtained from via


Therefore, solving the reinforcement learning problem is equivalent to finding the solution to the Bellman optimality equation (1). However, one should note that even if a solution to the Bellman optimality equation (1) is obtained, finding the optimal policy via (2) remains a challenging task in reinforcement learning, since one needs to solve yet another optimization problem.

3 A New Optimization Perspective of Reinforcement Learning

In this section, we introduce a unique optimization perspective of the reinforcement learning problem that paves the way to designing efficient and provable algorithms with the desired properties. The development hinges upon a special look at the Bellman optimality equation by explicitly educing the role of policy. By leveraging a smoothed Bellman optimality equation with entropy regularization, we discover a direct relationship between the the optimal policy and corresponding value function and come up with a new optimization formulation of the reinforcement learning problem.

3.1 Revisiting the Bellman Optimality Equation

Recall that most value-function-based algorithms find the optimal policy only after (approximately) solving the Bellman optimality equation for the optimal value function. We start by revisiting the Bellaman optimality equation and rewriting it in a Fenchel-type representation:


where denotes the family of valid distributions. This reformulation is based on the simple fact that for any , . Observe that the role of policy is now explicitly revealed in the Bellman optimality equation. Despite its simplicity, this observation is an important step that allows us to develop the new algorithm in the following. A straightforward idea is to consider a joint optimization over and by fitting them to the Bellman equation and minimizing the expected residuals; that is,


Unlike most existing approaches, this optimization formulation brings the search procedures for optimal state value function and optimal policy in a unified framework. However, there are several major difficulties when directly solving this optimization problem,

  • The operator over distribution space will cause numerical instability, especially in environments where a slight change in may cause large differences in the RHS of Eq. (3).

  • The conditional expectation, , composed with the square loss, requires double samples (Baird, 1995) to compute an unbiased stochastic gradient, which is often impractical.

3.2 Smoothed Bellman Optimality Equation

To avoid the instability and discontinuity caused by operator, we propose to smooth the policy update by utilizing the smoothing technique of Nesterov (2005). Since the policy is defined on the distribution space, we introduce an entropic regularization to the Fenchel-type representation (3):


where is the entropy function and controls the level of smoothing. We first show that the entropy-regularization will indeed introduce smoothness and continuity yet preserves the existence and uniqueness of optimal solution. Observe that the RHS of Eq. (5) is exactly the Fenchel-dual representation of -sum- function (Boyd and Vandenberghe, 2004). Hence, we have the smoothed Bellman optimality equation as


where the -sum- is a smooth approximation of the -operator.

Next we show that that satisfies the smoothed optimality equation is still unique due to the contraction of the operator .

Proposition 1 (Uniqueness)

is a contraction operator. Therefore, the smoothed Bellman optimality equation (5) has a unique solution.

A similar result is also presented in Fox et al. (2015); Asadi and Littman (2016). For completeness, we list the proof in Appendix A.

Note that although using the entropy-regularization introduces smoothness to the policy and avoids numerical instability cause by -operator, it also introduces bias of the optimal value function:

Proposition 2 (Smoothing bias)

Let and be fixed points of (3) and (5), respectively. It holds that


As , converges to pointwisely.

The proof can be found in Appendix A.

By further simplifying the smoothed Bellman optimality equation, we are able to recover a direct relationship between the optimal value function and optimal policy.

Theorem 3 (Temporal consistency)

Let be the fixed point of (5) and be the corresponding policy that attains the maximum in the RHS of (5). Then is the unique solution that satisfies


We point out that a similar condition has also been realized in Rawlik et al. (2012); Neu et al. (2017); Nachum et al. (2017a) but from a completely different view point and our proof is slightly different; see Appendix A. In Nachum et al. (2017a), the entropy regularization term is adopted to encourage exploration and prevent early convergence, while we start from the smoothing technique (Nesterov, 2005).

Note that the simplified equation Eq. (8) provides a both sufficient and necessary condition for characterizing the optimal value function and optimal policy. As we will see, this characterization indeed yields new opportunities for learning optimal value functions, especially in the off-poliy and multi-step/eligibility-traces cases.

3.3 Saddle Point Reformulation via Dual Embeddings

With this new characterization of the smoothed Bellman optimality equation, a straightforward idea is to solve the equation (8) by minimizing the mean square consistency Bellman error, namely,


Due to the inner conditional expectation, directly applying stochastic gradient descent algorithm requires two independent samples in each updates, referred as “double sampling” issue; e.g.Baird (1995); Dai et al. (2016). Directly optimizing (9) remains challenging since in practice one can hardly access to two independent samples from .

Inspired by Dai et al. (2016),we propose to reformulate the objective into an equivalent saddle-point problem in order to bypass the double sampling issue. Specifically, by exploiting the Fenchel dual of the square function, i.e., and further applying the interchangeability principle (Dai et al., 2016, Lemma 1), we can show that (9) is equivalent to the saddle point problem


where stands for the function space defined on . Note that this is not a standard convex-concave saddle point problem: the objective is convex in for any fixed and concave in for any fixed , but not necessarily always convex in for any fixed .

In contrast to our saddle point optimization approach (10), Nachum et al. (2017a) considered a different way to handle the double sampling issue by solving instead an upper bound of (9), namely, . This surrogate function is obtained by brute-forcedly extracting the inner expectation outside, thus admitting unbiased stochastic gradient estimates with one sample . However, the surrogate function introduces extra variance term into the original objective; in fact, If the variance is large, minimizing the surrogate function could lead to highly inaccurate solution, while such an issue does not exist in our saddle point optimization because of the exact equivalence to (9).

In fact, by substituting the dual function , the objective in the saddle point problem becomes


where . Note that the first term is the same as , and the second term will cancel the extra variance term as we proved in Theorem 8 in Appendix B. Indeed, this was also observed in Antos et al. (2008). Such an understanding of the saddle point objective as a decomposition of mean and variance is indeed very useful to exploit a better bias-variance tradeoff. Specifically, when function approximators are used for the dual variables, extra bias will be induced instead of the variance term. To balance the induced bias caused by function approximator and the variance, one can impose a weight on the second term. This turns out to be very important especially in the multi-step setting where the dual variables need complicated parametrization and has also been observed in our experiments.

Remark (Comparison to existing optimization perspectives).

Several recent works (Chen and Wang, 2016; Wang, 2017) have also considered saddle point formulations of Bellman equations, but these formulations are fundamentally different from ours, even from its origin. These saddle point problems are derived from the Lagrangian dual of the linear programming formulation of Bellman optimality equations and only applicable to MDPs with finite state and action spaces. In contrast, our saddle point optimization originates from the Fenchel dual of the mean squared error of a smoothed Bellman optimality equation. Moreover, our framework can be applied to both finite-state and continuous MDPs, and naturally adapted to the multi-step and eligibility-trace extensions.

4 Smoothed Dual Embedding Control Algorithm

In this section, we develop an efficient reinforcement learning algorithm from the saddle point perspective. As we discussed, the optimization (11) provides a convenient mechanism to achieve better bias-variance tradeoff by reweighting the two terms, i.e.,


where is a given positive scaler used for balancing the variance and potential bias. When , this reduces to the original saddle point formulation (10). When , this reduces to the surrogate objective considered in Nachum et al. (2017a).

From the new optimization view of reinforcement learning, we derive the smoothed dual embedding control algorithm based on stochastic mirror descent in Nemirovski et al. (2009). For simplicity, we mainly discuss the one-step optimization (12), the algorithm can be easily generalized to the multi-step and eligibility-trace settings. Please check the details in Appendix C.2 and C.3, respectively.

We first derive the unbiased gradient estimator of the objective in (12) w.r.t. and :

Theorem 4 (Unbiased gradient estimator)


We have the unbiased gradient estimator as


Denote the parameters the primal and dual variables and as and , respectively, then and can be obtained by chain rule from (13) and (14). We will apply the stochastic mirror descent to update and , i.e., solving the prox-mapping in each iteration,

where and denote the Bregman divergences. We can use Euclidean metric for both and , or exploit -divergence for . Following these steps, we arrive at the smoothed dual embedding control algorithm.

For practical purpose, we incorporate the experience replay into the algorithm. We illustrate the algorithm in Algorithm 1. As we can see, rather than the prefixed samples, we have the procedure to collect samples by executing the behavior policy, corresponding to line 3-5, and the behavior policy will be updated in line 12. Line 6-11 corresponds to the updates for stochastic gradient descent.

1:  Initialize , , , and randomly, set .
2:  for episode  do
3:     for size  do
4:        Collect transition into by executing behavior policy .
5:     end for
6:     for iteration  do
7:        Update .
8:        Decay the stepsize in rate .
9:        Compute the stochastic gradients w.r.t. and as and .
10:        Update the parameters of primal function by solving the prox-mappings, i.e.,
11:     end for
12:     Update behavior policy .
13:  end for
Algorithm 1 Smoothed Dual Embedding Control with Experience Replay
Remark (Role of dual variables):

The dual solution is updated through solving the subproblem


which can be processed by stochastic gradient descent or other optimization algorithms. Obviously, the solution of the optimization is

Therefore, the dual variables can be essentially viewed as -function in entropy-regularized MDP. Therefore, the algorithm could be understood as first fitting a parametrized -function by dual variables via mean square loss, and then, applying the stochastic mirror descent w.r.t. and with gradient estimator (13) and (14) where .

Remark (Connection to TRPO and natural policy gradient):

The update of is highly related to trust region policy optimization (TRPO) of Schulman et al. (2015) and natural policy gradient (NPG) (Kakade, 2002; Rajeswaran et al., 2017) when we set to -divergence. Specifically, in Kakade (2002) and Rajeswaran et al. (2017), is update by , which is similar to with the difference in replacing the with our gradient, while in Schulman et al. (2015), a related optimization with hard constraints is used for update policy, i.e., . Although these operations are similar to , we emphasize that the estimation of advantage, denoted as , and the update of policy are separated in NPG and TRPO. Arbitrary policy evaluation algorithm can be adopted for estimating the value function for current policy. While in our algorithm, is different from the vanilla advantage function, which is designed appropriate for off-policy particularly, and the estimation of and is also integrated as the whole part.

5 Theoretical Analysis

In this section, we provide our main results of the theoretical behavior of the proposed algorithm under the setting in Antos et al. (2008) where samples are prefixed and from one single off-policy sample path. We consider the case where with the equivalent optimization (10) for simplicity. For general , we can achieve a similar result to Theorem 6 by replacing with a combination of and . We omit here due to the space limitation.

Based on the construction of the algorithm, the convergence analysis essentially boils down to several parts:

  • the bias from smoothing Bellman optimality equation;

  • the statistical error induced when learning with finite samples from one single sample path;

  • the approximation error introduced by function parametrization (both for primal and dual variables) in (10);

  • the optimization error when solving the finite-sample version of the saddle point problem (10) within a fixed number of iterations.

Notations. The parametrized function class of value function , policy , and dual variable are denoted as , respectively. Denote as the parametrized objective of and as the corresponding optimal solution. Denote as the finite sample approximation of using samples and as the corresponding optimal solution. The function approximation error between two function classes and is defined as and for and as and . The norm of any function is defined as . We also introduce a scaled norm for value function : this is indeed a well-defined norm since and is injective.

We make the following standard assumptions about the MDPs:

Assumption 1 (MDP regularity)

We assume , and there exists an optimal policy, , such that .

Assumption 2 (Sample path property (Antos et al., 2008))

Denote as the stationary distribution of behavior policy over the MDP. We assume , , and the corresponding Markov process is ergodic. We further assume that is strictly stationary and exponentially -mixing with a rate defined by the parameters .

Assumption 1 ensures the solvability of the MDP and boundedness of the optimal value functions, and . Assumption 2 ensures -mixing property of the samples (see e.g., Proposition 4 in Carrasco and Chen (2002)) and is often necessary to prove large deviation bounds.

The error introduced by smoothing has been characterized in Section 3.2. The approximation error is tied to the flexibility of the parametrized function classes of , and can has been widely studied in approximation theory. Here we mainly focus on investigating the statistical error and optimization error. For sake of simplicity, here we only brief the main results and ignore the constant factors whenever possible. Detailed theorems and proofs can be found in Appendix D.


Denote and . The statistical error is defined as . Invoking a generalized version of Pollard’s tail inequality to -mixing sequences and prior results in Antos et al. (2008) and Haussler (1995), we show that

Theorem 5 (Statistical error)

Under Assumption 2, it holds with at least probability , where are some constants.

Combining the error caused by smoothing and function approximation, we show that the difference between and under the norm is given by

Theorem 6 (Total error)

Let be a candidate solution output from the proposed algorithm based on off-policy samples, with high probability, we have where corresponds to the approximation error, corresponds to the bias induced by smoothing, and corresponds to the statistical error, and is the optimization error of solving within a fixed budget.

There exists a delicate trade-off between the smoothing bias and approximation error. Using large increases the smoothing bias but decreases the approximation error since the solution function space is better behaved. The concrete correspondence between and depends on the specific form of the function approximators, which is beyond the scope of this paper. When and the approximation is good enough, the solution will converge to the optimal value function .

Convergence Analysis.

It is well-known that for convex-concave saddle point problems, applying stochastic mirror descent ensures global convergence in a sublinear rate; see e.g., (Nemirovski et al., 2009). However, this no longer holds for problems without convex-concavity. On the other hand, since our algorithm solves exactly the dual maximization problem at each iteration (which is convex), it can be essentially regarded as a special case of the stochastic mirror descent algorithm applied to solve the non-convex minimization problem . The latter was proven to converge sublinearly to the stationary point when stepsize is diminishing and Euclidean distance is used for the prox-mapping (Ghadimi and Lan, 2013). For completeness, we list the result below.

Theorem 7 ((Ghadimi and Lan, 2013), resp. Corollary 2.2)

Consider the case when Euclidean distance is used in the algorithm. Assume that the parametrized objective is -Lipschitz and variance of stochastic gradient is bounded by . Let the algorithm run iterations with stepsize for some and output . Setting the candidate solution to be with randomly chosen from such that , then it holds that where represents the distance of the initial solution to the optimal solution.

The above result implies that the algorithm converges sublinearly to a stationary point. Note that the Lipschitz constant is inherently dependent on the smoothing parameter : the Lipschitz constant gets worse when increases.

6 Related Work

The algorithm is related to the reinforcement learning with entropy-regularized MDP model. Different from the motivation in our method where the entropy regularization is introduced in dual form for smoothing (Nesterov, 2005), the entropy-regularized MDP has been proposed for balancing exploration and exploitation (Haarnoja et al., 2017), taming the noises in observations (Rubin et al., 2012; Fox et al., 2015), and tractability (Todorov, 2007).

Specifically, Fox et al. (2015) proposed soft Q-learning which extended the Q-learning with tabular form for the new Bellman optimality equation corresponding to the finite state finite action entropy-regularized MDP. The algorithm does not accomodate for function approximator due to the intractability of the -sum- operation in the soft Q-learning update. To avoid such difficulty, Haarnoja et al. (2017) reformulates the update as an optimization which is approximated by samples from stein variational gradient descent (SVGD) sampler. Another related algorithm is proposed in Asadi and Littman (2016) the intractability issue of the -sum- operator, named as ‘mellowmax’, is avoided by optimizing for a maximum entropy policy in each update. The resulting algorithm resembles to SARSA with particular policy. Liu et al. (2017) focuses on the soft Bellman optimality equation with the ‘mellowmax’ operator following a similar way with Asadi and Littman (2016). The only difference is that a Bayesian policy parametrization is used in Liu et al. (2017) which is updated by SVGD. By noticing the duality between soft Q-learning and the maximum entropy policy, Neu et al. (2017); Schulman et al. (2017a) investigate the equivalence between these two types of algorithms.

Besides the difficulty to generalize these algorithms to multi-step trajectories in off-policy setting, the major drawback of these algorithms is the lack of theoretical guarantees when accompanying with function approximators. It is not clear whether the algorithms converge or not, do not even mention the quality of the stationary points.

On the other hand, Nachum et al. (2017a, b) also exploit the consistency condition in Theorem 3 and propose the PCL algorithm which optimizes the upper bound of the mean square consistency Bellman error (9). The same consistency condition is also discovered in Rawlik et al. (2012), and the proposed -learning algorithm can be viewed as fix-point iteration version of the the unified PCL with tabular -function. However, as we discussed in Section 3, the PCL algorithms becomes biased in stochastic environment, which may lead to inferior solutions.

7 Experiments

(a) InvertedDoublePendulum (b) Swimmer
(c) Hopper (d) HalfCheetah
Figure 1: The results of SDEC against TRPO and PPO. Each plot shows average reward during training across random runs, with confidence interval. The x-axis is the number of training iterations. SDEC achieves better or comparable performance than TRPO and PPO on all tasks.

We test the proposed smoothed dual embedding algorithm (SDEC), on several continuous control tasks from the OpenAI Gym benchmark (Brockman et al., 2016) using the MuJoCo simulator (Todorov et al., 2012), comparing with trust region policy optimization (TRPO) (Schulman et al., 2015) and the proximal policy optimization (PPO) (Schulman et al., 2017b). Since the TRPO and PPO are only applicable for on-policy setting, for fairness, we also restrict the SDEC to on-policy setting. However, as we show in the paper, the SDEC is able to exploit the off-policy samples efficiently. We use the Euclidean distance for and the -divergence for in the experiments. We emphasize that other Bregman divergences are also applicable. Following the comprehensive comparison in Henderson et al. (2017), the implementation of the TRPO and PPO affects the performance of algorithms. For a fair comparison, we use the codes from reported to have achieved the best scores in Henderson et al. (2017).

We ran the algorithm with random seeds and reported the average rewards with 50% confidence intervals. The empirical comparison results are illustrated in Figure 1. We can see that in all these tasks, the proposed SDEC achieves significantly better performance than the other algorithms. The experiment setting is reported below.

Policy and value function parametrization.

For fairness, we use the same parametrization of policy and value functions across all algorithms. The choices of parametrization are largely based on the recent paper by Rajeswaran et al. (2017), which shows the natural policy gradient with RBF neural network achieves the state-of-the-art performances of TRPO on MuJoCo. For the policy distribution, we parametrize it as , where is a two-layer neural nets with the random features of RBF kernel as the hidden layer and the is a diagonal matrix. The RBF kernel bandwidth is chosen via median trick (Dai et al., 2014; Rajeswaran et al., 2017). Same as Rajeswaran et al. (2017), we use hidden nodes in InvertedDoublePendulum, Swimmer, Hopper, and use hidden nodes in HalfCheetah. Since the TRPO and PPO uses linear control variable as , we also adapt the parametrization for in our algorithm. However, SDEC can adopt arbitrary function approximators.

Training hyperparameters.

For all algorithms, we set and stepsize = . A batch size of trajectories was used in each iteration. For TRPO, the CG damping parameter is set to be . For SDEC, was set to and from a grid search in .

8 Conclusion

We provide a new optimization perspective of the Bellman optimality equation, based on which we develop the smoothed dual embedding control for the policy optimization problem in reinforcement learning. The algorithm is provably convergent with nonlinear function approximators using off-policy samples by solving the Bellman optimality equations. We also provide PAC-learning bound to characterize the sample complexity based on one single off-policy sample path. Preliminary empirical study shows the proposed algorithm achieves comparable or even better than the state-of-the-art performances on MuJoCo tasks.


Part of this work was done during BD’s internship at Microsoft Research, Redmond. Part of the work was done when LL was with Microsoft Research, Redmond. LS is supported in part by NSF IIS-1218749, NIH BIGDATA 1R01GM108341, NSF CAREER IIS-1350983, NSF IIS-1639792 EAGER, NSF CNS-1704701, ONR N00014-15-1-2340, Intel ISTC, NVIDIA and Amazon AWS.


  • Antos et al. (2008) Antos, A., Szepesvári, C., and Munos, R. (2008). Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129.
  • Asadi and Littman (2016) Asadi, K. and Littman, M. L. (2016). A new softmax operator for reinforcement learning. CoRR, abs/1612.05628.
  • Baird (1995) Baird, L. (1995). Residual algorithms: reinforcement learning with function approximation. In Proc. Intl. Conf. Machine Learning, pages 30–37. Morgan Kaufmann.
  • Bertsekas and Tsitsiklis (1996) Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific.
  • Boyan and Moore (1995) Boyan, J. A. and Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems 7 (NIPS-94), pages 369–376.
  • Boyd and Vandenberghe (2004) Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press, Cambridge, England.
  • Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
  • Carrasco and Chen (2002) Carrasco, M. and Chen, X. (2002). Mixing and moment properties of various garch and stochastic volatility models. Econometric Theory, 18(1):17–39.
  • Chen and Wang (2016) Chen, Y. and Wang, M. (2016). Stochastic primal-dual methods and sample complexity of reinforcement learning. arXiv preprint arXiv:1612.02516.
  • Dai et al. (2016) Dai, B., He, N., Pan, Y., Boots, B., and Song, L. (2016). Learning from conditional distributions via dual kernel embeddings. CoRR, abs/1607.04579.
  • Dai et al. (2014) Dai, B., Xie, B., He, N., Liang, Y., Raj, A., Balcan, M.-F. F., and Song, L. (2014). Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, pages 3041–3049.
  • Du et al. (2017) Du, S. S., Chen, J., Li, L., Xiao, L., and Zhou, D. (2017). Stochastic variance reduction methods for policy evaluation. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1049–1058.
  • Fox et al. (2015) Fox, R., Pakman, A., and Tishby, N. (2015). Taming the noise in reinforcement learning via soft updates. arXiv preprint arXiv:1512.08562.
  • Ghadimi and Lan (2013) Ghadimi, S. and Lan, G. (2013). Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368.
  • Gordon (1995) Gordon, G. J. (1995). Stable function approximation in dynamic programming. In Proceedings of the Twelfth International Conference on Machine Learning (ICML-95), pages 261–268.
  • Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165.
  • Haussler (1995) Haussler, D. (1995). Sphere packing numbers for subsets of the boolean n-cube with bounded vapnik-chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2):217–232.
  • Henderson et al. (2017) Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. (2017). Deep reinforcement learning that matters. arXiv preprint arXiv:1709.06560.
  • Kakade (2002) Kakade, S. (2002). A natural policy gradient. pages 1531–1538. MIT Press.
  • Liu et al. (2015) Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., and Petrik, M. (2015). Finite-sample analysis of proximal gradient td algorithms. In Uncertainty in Artificial Intelligence (UAI). AUAI Press.
  • Liu et al. (2017) Liu, Y., Ramachandran, P., Liu, Q., and Peng, J. (2017). Stein variational policy gradient. arXiv preprint arXiv:1704.02399.
  • Macua et al. (2015) Macua, S. V., Chen, J., Zazo, S., and Sayed, A. H. (2015). Distributed policy evaluation under multiple behavior strategies. IEEE Transactions on Automatic Control, 60(5):1260–1274.
  • Maei (2011) Maei, H. R. (2011). Gradient Temporal-Difference Learning Algorithms. PhD thesis, University of Alberta, Edmonton, Alberta, Canada.
  • Mahadevan et al. (2014) Mahadevan, S., Liu, B., Thomas, P. S., Dabney, W., Giguere, S., Jacek, N., Gemp, I., and Liu, J. (2014). Proximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. CoRR abs/1405.6757.
  • Nachum et al. (2017a) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017a). Bridging the gap between value and policy based reinforcement learning. arXiv preprint arXiv:1702.08892.
  • Nachum et al. (2017b) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017b). Trust-pcl: An off-policy trust region method for continuous control. CoRR, abs/1707.01891.
  • Nemirovski et al. (2009) Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609.
  • Nesterov (2005) Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127–152.
  • Neu et al. (2017) Neu, G., Jonsson, A., and Gómez, V. (2017). A unified view of entropy-regularized markov decision processes. CoRR, abs/1705.07798.
  • Puterman (2014) Puterman, M. L. (2014). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
  • Rajeswaran et al. (2017) Rajeswaran, A., Lowrey, K., Todorov, E., and Kakade, S. (2017). Towards generalization and simplicity in continuous control. arXiv preprint arXiv:1703.02660.
  • Rawlik et al. (2012) Rawlik, K., Toussaint, M., and Vijayakumar, S. (2012). On stochastic optimal control and reinforcement learning by approximate inference. In Robotics: science and systems.
  • Rubin et al. (2012) Rubin, J., Shamir, O., and Tishby, N. (2012). Trading value and information in mdps. Decision Making with Imperfect Decision Makers, pages 57–74.
  • Rummery and Niranjan (1994) Rummery, G. A. and Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department.
  • Schulman et al. (2017a) Schulman, J., Abbeel, P., and Chen, X. (2017a). Equivalence between policy gradients and soft q-learning. CoRR, abs/1704.06440.
  • Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. (2015). Trust region policy optimization. In ICML, pages 1889–1897.
  • Schulman et al. (2017b) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017b). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • Sutton (1988) Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44.
  • Sutton (1996) Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8 (NIPS-95), pages 1038–1044.
  • Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
  • Sutton et al. (2009) Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning (ICML), pages 993–1000.
  • Todorov (2007) Todorov, E. (2007). Linearly-solvable markov decision problems. In Advances in neural information processing systems, pages 1369–1376.
  • Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE.
  • Tsitsiklis and Van Roy (1997) Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674–690.
  • Wang (2017) Wang, M. (2017). Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear Running Time. ArXiv e-prints.
  • Watkins (1989) Watkins, C. J. (1989). Learning from Delayed Rewards. PhD thesis, King’s College, University of Cambridge, UK.
  • Williams (1992) Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256.


Appendix A Properties of Smoothed Bellman Optimality Equation

In this section, we provide the details of the proofs for the properties of the smoothed Bellman optimality equation.

After applying the smoothing technique (Nesterov, 2005), we obtain a new Bellman operator, , which is contractive. By such property, we can guarantee the uniqueness of the solution. Specifically,

Proposition 1 (Uniqueness)

The is a contraction operator, therefore the smoothed Bellman optimality equation (5) has unique solution.

Proof  Consider ,

By the Banach fixed point theorem, the existence and uniqueness is guaranteed for the smoothed Bellman optimality equation.  
A similar result is also presented in Fox et al. (2015); Asadi and Littman (2016). We characterize the bias introduced by the smoothing technique, i.e.,

Proposition 2 (Smoothing bias)

Let and be fixed points of (3) and (5), respectively. It holds


As , converges to pointwisely.

Proof  We denote .

which implies the conclusion.  

The smoothed Bellman optimality equation involves a -sum- operator to approximate the -operator, which increases the nonlinearity of the equation. We further characterize the solution of the smoothed Bellman optimality equation, by the temporal consistency conditions.

Theorem 3 (Temporal consistency)

Let be the fixed point of (5) and be the corresponding policy that attains the maximum in the RHS of (5). Then is the unique solution that satisfies


Proof  Necessity. Given the definition of , denote

by the convexity of , we have a unique as

which implies

Then, we can rewrite the smoothed Bellman optimality equation Eq. (5) as


Obviously, this equation is a necessary condition, i.e., the optimal and satisfy such equation. In fact, we can show the sufficiency of such equation,

Sufficiency. Assume and satisfies (8),

Recall , we have

The same conditions have been re-discovered several times, e.g.(Rawlik et al., 2012; Nachum et al., 2017a), from a completely different point of views.

Appendix B Variance Cancellation via the Saddle Point Formulation

The second term in the saddle point formulation (11) will cancel the variance . Specifically,

Theorem 8

Given and , we have