Smoothed Dual Embedding Control
Abstract
We revisit the Bellman optimality equation with Nesterov’s smoothing technique and provide a unique saddlepoint optimization perspective of the policy optimization problem in reinforcement learning based on Fenchel duality. A new reinforcement learning algorithm, called Smoothed Dual Embedding Control or SDEC, is derived to solve the saddlepoint reformulation with arbitrary learnable function approximator. The algorithm bypasses the policy evaluation step in the policy optimization from a principled scheme and is extensible to integrate with multistep bootstrapping and eligibility traces. We provide a PAClearning bound on the number of samples needed from one single offpolicy sample path, and also characterize the convergence of the algorithm. Finally, we show the algorithm compares favorably to the stateoftheart baselines on several benchmark control problems.
1 Introduction
Reinforcement learning (RL) algorithms aim to learn a policy that maximizes the longterm return by sequentially interacting with an unknown environment (Sutton and Barto, 1998). The dominating framework to model such an interaction is Markov decision processes, or MDPs. A fundamental result for MDP is that the Bellman operator is a contraction in the valuefunction space, and thus, the optimal value function is a unique fixed point of the operator. Furthermore, starting from any initial value function, iterative applications of the Bellman operator will converge to the fixed point. Interested readers are referred to the textbook of Puterman (2014) for details.
Many of the most effective RL algorithms have their root in such a fixedpoint view. The most prominent family of algorithms is perhaps the temporaldifference algorithms, including TD (Sutton, 1988), Qlearning (Watkins, 1989), SARSA (Rummery and Niranjan, 1994; Sutton, 1996), and numerous variants. Compared to direct policy search or policy gradient algorithms like REINFORCE (Williams, 1992), these fixedpoint methods use bootstrapping to make learning more efficient by reducing variance. When the Bellman operator can be computed exactly (even on average), such as when the MDP has finite state/actions, convergence is guaranteed and the proof typically relies on the contraction property (Bertsekas and Tsitsiklis, 1996). Unfortunately, when function approximatiors are used, such fixedpoint methods can easily become unstable/divergent (Boyan and Moore, 1995; Baird, 1995; Tsitsiklis and Van Roy, 1997), except in rather limited cases. For example,

for some rather restrictive function classes that have a nonexpansion property, such as kernel averaging, most of the finitestate MDP theory continues to apply (Gordon, 1995);

when linear function classes are used to approximate the value function of a fixed policy from onpolicy samples (Tsitsiklis and Van Roy, 1997), convergence is guaranteed.
In recent years, a few authors have made important progress toward finding scalable, convergent TD algorithms, by designing proper objective functions and using stochastic gradient descent (SGD) to optimize them (Sutton et al., 2009; Maei, 2011). Later on, it was realized that several of these gradientbased algorithms can be interpreted as solving a primaldual problem (Mahadevan et al., 2014; Liu et al., 2015; Macua et al., 2015; Dai et al., 2016). This insight has led to novel, faster, and more robust algorithms by adopting sophisticated optimization techniques (Du et al., 2017). Unfortunately, to the best of our knowledge, all existing works either assume linear function approximation or are designed for policy evaluation. It remains an open question how to find the optimal policy reliably with nonlinear function approximators such as neural networks, even in the presence of offpolicy data.
In this work, we take a substantial step towards solving this decadeslong open problem, leveraging a unique saddle point optimization perspective to derive a new algorithm call smoothed dual embedding control (SDEC). Our development hinges upon a special look at the Bellman optimality equation and the temporal relationship between optimal value function and optimal policy revealed from a smoothed Bellman optimality equation. We exploit such a relation and introduce a distinct saddle point optimization that simultaneously learns both optimal value function and policy in the primal form and allows to escape from the instability of operator and “double sampling” issues faced by existing algorithms. As a result, the SDEC algorithm enjoys many desired properties, in particular:

It is stable for a broad class of nonlinear function approximators including neural networks, and provably converges to a solution with vanishing gradient. This is the case even in the more challenging offpolicy scenario.

It uses bootstrapping to yield high sample efficiency, as in TDstyle methods, and can be generalized to cases of multistep bootstrapping and eligibility traces.

It directly optimizes the squared Bellman residual based on a sample trajectory, while avoiding the infamous doublesample issue.

It uses stochastic gradient descent to optimize the objective, thus very efficient and scalable.
Furthermore, the algorithm handles both the optimal value function estimation and policy optimization in a unified way, and readily applies to both continuous and discrete action spaces. We test the algorithm on several continuous control benchmarks. Preliminary results show that the proposed algorithm achieves the stateoftheart performances.
2 Preliminaries
In this section, we introduce the preliminary background about Markov decision processes.
Markov Decision Processes (MDPs).
We denote the MDPs as , where is the state space (possible infinite), is the set of actions, is the transition probability kernel defining the distribution upon taking action on state , gives the corresponding stochastic immediate rewards with expectation , and is the discount factor.
Given the initial state distribution , the goal of reinforcement learning problems is to find a policy , where denotes all the probability measure over , that maximizes the total expected discounted reward, i.e., , where , .
Bellman Optimality Equation.
Denote It is known that the optimal state value function satisfies the Bellman optimality equation (Puterman, 2014)
(1) 
The optimal policy can be obtained from via
(2) 
Therefore, solving the reinforcement learning problem is equivalent to finding the solution to the Bellman optimality equation (1). However, one should note that even if a solution to the Bellman optimality equation (1) is obtained, finding the optimal policy via (2) remains a challenging task in reinforcement learning, since one needs to solve yet another optimization problem.
3 A New Optimization Perspective of Reinforcement Learning
In this section, we introduce a unique optimization perspective of the reinforcement learning problem that paves the way to designing efficient and provable algorithms with the desired properties. The development hinges upon a special look at the Bellman optimality equation by explicitly educing the role of policy. By leveraging a smoothed Bellman optimality equation with entropy regularization, we discover a direct relationship between the the optimal policy and corresponding value function and come up with a new optimization formulation of the reinforcement learning problem.
3.1 Revisiting the Bellman Optimality Equation
Recall that most valuefunctionbased algorithms find the optimal policy only after (approximately) solving the Bellman optimality equation for the optimal value function. We start by revisiting the Bellaman optimality equation and rewriting it in a Fencheltype representation:
(3) 
where denotes the family of valid distributions. This reformulation is based on the simple fact that for any , . Observe that the role of policy is now explicitly revealed in the Bellman optimality equation. Despite its simplicity, this observation is an important step that allows us to develop the new algorithm in the following. A straightforward idea is to consider a joint optimization over and by fitting them to the Bellman equation and minimizing the expected residuals; that is,
(4) 
Unlike most existing approaches, this optimization formulation brings the search procedures for optimal state value function and optimal policy in a unified framework. However, there are several major difficulties when directly solving this optimization problem,

The operator over distribution space will cause numerical instability, especially in environments where a slight change in may cause large differences in the RHS of Eq. (3).

The conditional expectation, , composed with the square loss, requires double samples (Baird, 1995) to compute an unbiased stochastic gradient, which is often impractical.
3.2 Smoothed Bellman Optimality Equation
To avoid the instability and discontinuity caused by operator, we propose to smooth the policy update by utilizing the smoothing technique of Nesterov (2005). Since the policy is defined on the distribution space, we introduce an entropic regularization to the Fencheltype representation (3):
(5) 
where is the entropy function and controls the level of smoothing. We first show that the entropyregularization will indeed introduce smoothness and continuity yet preserves the existence and uniqueness of optimal solution. Observe that the RHS of Eq. (5) is exactly the Fencheldual representation of sum function (Boyd and Vandenberghe, 2004). Hence, we have the smoothed Bellman optimality equation as
(6) 
where the sum is a smooth approximation of the operator.
Next we show that that satisfies the smoothed optimality equation is still unique due to the contraction of the operator .
Proposition 1 (Uniqueness)
is a contraction operator. Therefore, the smoothed Bellman optimality equation (5) has a unique solution.
A similar result is also presented in Fox et al. (2015); Asadi and Littman (2016). For completeness, we list the proof in Appendix A.
Note that although using the entropyregularization introduces smoothness to the policy and avoids numerical instability cause by operator, it also introduces bias of the optimal value function:
Proposition 2 (Smoothing bias)
The proof can be found in Appendix A.
By further simplifying the smoothed Bellman optimality equation, we are able to recover a direct relationship between the optimal value function and optimal policy.
Theorem 3 (Temporal consistency)
We point out that a similar condition has also been realized in Rawlik et al. (2012); Neu et al. (2017); Nachum et al. (2017a) but from a completely different view point and our proof is slightly different; see Appendix A. In Nachum et al. (2017a), the entropy regularization term is adopted to encourage exploration and prevent early convergence, while we start from the smoothing technique (Nesterov, 2005).
Note that the simplified equation Eq. (8) provides a both sufficient and necessary condition for characterizing the optimal value function and optimal policy. As we will see, this characterization indeed yields new opportunities for learning optimal value functions, especially in the offpoliy and multistep/eligibilitytraces cases.
3.3 Saddle Point Reformulation via Dual Embeddings
With this new characterization of the smoothed Bellman optimality equation, a straightforward idea is to solve the equation (8) by minimizing the mean square consistency Bellman error, namely,
(9) 
Due to the inner conditional expectation, directly applying stochastic gradient descent algorithm requires two independent samples in each updates, referred as “double sampling” issue; e.g., Baird (1995); Dai et al. (2016). Directly optimizing (9) remains challenging since in practice one can hardly access to two independent samples from .
Inspired by Dai et al. (2016),we propose to reformulate the objective into an equivalent saddlepoint problem in order to bypass the double sampling issue. Specifically, by exploiting the Fenchel dual of the square function, i.e., and further applying the interchangeability principle (Dai et al., 2016, Lemma 1), we can show that (9) is equivalent to the saddle point problem
(10) 
where stands for the function space defined on . Note that this is not a standard convexconcave saddle point problem: the objective is convex in for any fixed and concave in for any fixed , but not necessarily always convex in for any fixed .
In contrast to our saddle point optimization approach (10), Nachum et al. (2017a) considered a different way to handle the double sampling issue by solving instead an upper bound of (9), namely, . This surrogate function is obtained by bruteforcedly extracting the inner expectation outside, thus admitting unbiased stochastic gradient estimates with one sample . However, the surrogate function introduces extra variance term into the original objective; in fact, If the variance is large, minimizing the surrogate function could lead to highly inaccurate solution, while such an issue does not exist in our saddle point optimization because of the exact equivalence to (9).
In fact, by substituting the dual function , the objective in the saddle point problem becomes
(11) 
where . Note that the first term is the same as , and the second term will cancel the extra variance term as we proved in Theorem 8 in Appendix B. Indeed, this was also observed in Antos et al. (2008). Such an understanding of the saddle point objective as a decomposition of mean and variance is indeed very useful to exploit a better biasvariance tradeoff. Specifically, when function approximators are used for the dual variables, extra bias will be induced instead of the variance term. To balance the induced bias caused by function approximator and the variance, one can impose a weight on the second term. This turns out to be very important especially in the multistep setting where the dual variables need complicated parametrization and has also been observed in our experiments.
Remark (Comparison to existing optimization perspectives).
Several recent works (Chen and Wang, 2016; Wang, 2017) have also considered saddle point formulations of Bellman equations, but these formulations are fundamentally different from ours, even from its origin. These saddle point problems are derived from the Lagrangian dual of the linear programming formulation of Bellman optimality equations and only applicable to MDPs with finite state and action spaces. In contrast, our saddle point optimization originates from the Fenchel dual of the mean squared error of a smoothed Bellman optimality equation. Moreover, our framework can be applied to both finitestate and continuous MDPs, and naturally adapted to the multistep and eligibilitytrace extensions.
4 Smoothed Dual Embedding Control Algorithm
In this section, we develop an efficient reinforcement learning algorithm from the saddle point perspective. As we discussed, the optimization (11) provides a convenient mechanism to achieve better biasvariance tradeoff by reweighting the two terms, i.e.,
(12) 
where is a given positive scaler used for balancing the variance and potential bias. When , this reduces to the original saddle point formulation (10). When , this reduces to the surrogate objective considered in Nachum et al. (2017a).
From the new optimization view of reinforcement learning, we derive the smoothed dual embedding control algorithm based on stochastic mirror descent in Nemirovski et al. (2009). For simplicity, we mainly discuss the onestep optimization (12), the algorithm can be easily generalized to the multistep and eligibilitytrace settings. Please check the details in Appendix C.2 and C.3, respectively.
We first derive the unbiased gradient estimator of the objective in (12) w.r.t. and :
Theorem 4 (Unbiased gradient estimator)
Denote
We have the unbiased gradient estimator as
(13) 
(14) 
Denote the parameters the primal and dual variables and as and , respectively, then and can be obtained by chain rule from (13) and (14). We will apply the stochastic mirror descent to update and , i.e., solving the proxmapping in each iteration,
where and denote the Bregman divergences. We can use Euclidean metric for both and , or exploit divergence for . Following these steps, we arrive at the smoothed dual embedding control algorithm.
For practical purpose, we incorporate the experience replay into the algorithm. We illustrate the algorithm in Algorithm 1. As we can see, rather than the prefixed samples, we have the procedure to collect samples by executing the behavior policy, corresponding to line 35, and the behavior policy will be updated in line 12. Line 611 corresponds to the updates for stochastic gradient descent.
Remark (Role of dual variables):
The dual solution is updated through solving the subproblem
(15) 
which can be processed by stochastic gradient descent or other optimization algorithms. Obviously, the solution of the optimization is
Therefore, the dual variables can be essentially viewed as function in entropyregularized MDP. Therefore, the algorithm could be understood as first fitting a parametrized function by dual variables via mean square loss, and then, applying the stochastic mirror descent w.r.t. and with gradient estimator (13) and (14) where .
Remark (Connection to TRPO and natural policy gradient):
The update of is highly related to trust region policy optimization (TRPO) of Schulman et al. (2015) and natural policy gradient (NPG) (Kakade, 2002; Rajeswaran et al., 2017) when we set to divergence. Specifically, in Kakade (2002) and Rajeswaran et al. (2017), is update by , which is similar to with the difference in replacing the with our gradient, while in Schulman et al. (2015), a related optimization with hard constraints is used for update policy, i.e., . Although these operations are similar to , we emphasize that the estimation of advantage, denoted as , and the update of policy are separated in NPG and TRPO. Arbitrary policy evaluation algorithm can be adopted for estimating the value function for current policy. While in our algorithm, is different from the vanilla advantage function, which is designed appropriate for offpolicy particularly, and the estimation of and is also integrated as the whole part.
5 Theoretical Analysis
In this section, we provide our main results of the theoretical behavior of the proposed algorithm under the setting in Antos et al. (2008) where samples are prefixed and from one single offpolicy sample path. We consider the case where with the equivalent optimization (10) for simplicity. For general , we can achieve a similar result to Theorem 6 by replacing with a combination of and . We omit here due to the space limitation.
Based on the construction of the algorithm, the convergence analysis essentially boils down to several parts:

the bias from smoothing Bellman optimality equation;

the statistical error induced when learning with finite samples from one single sample path;

the approximation error introduced by function parametrization (both for primal and dual variables) in (10);

the optimization error when solving the finitesample version of the saddle point problem (10) within a fixed number of iterations.
Notations. The parametrized function class of value function , policy , and dual variable are denoted as , respectively. Denote as the parametrized objective of and as the corresponding optimal solution. Denote as the finite sample approximation of using samples and as the corresponding optimal solution. The function approximation error between two function classes and is defined as and for and as and . The norm of any function is defined as . We also introduce a scaled norm for value function : this is indeed a welldefined norm since and is injective.
We make the following standard assumptions about the MDPs:
Assumption 1 (MDP regularity)
We assume , and there exists an optimal policy, , such that .
Assumption 2 (Sample path property (Antos et al., 2008))
Denote as the stationary distribution of behavior policy over the MDP. We assume , , and the corresponding Markov process is ergodic. We further assume that is strictly stationary and exponentially mixing with a rate defined by the parameters .
Assumption 1 ensures the solvability of the MDP and boundedness of the optimal value functions, and . Assumption 2 ensures mixing property of the samples (see e.g., Proposition 4 in Carrasco and Chen (2002)) and is often necessary to prove large deviation bounds.
The error introduced by smoothing has been characterized in Section 3.2. The approximation error is tied to the flexibility of the parametrized function classes of , and can has been widely studied in approximation theory. Here we mainly focus on investigating the statistical error and optimization error. For sake of simplicity, here we only brief the main results and ignore the constant factors whenever possible. Detailed theorems and proofs can be found in Appendix D.
Suboptimality.
Denote and . The statistical error is defined as . Invoking a generalized version of Pollard’s tail inequality to mixing sequences and prior results in Antos et al. (2008) and Haussler (1995), we show that
Theorem 5 (Statistical error)
Under Assumption 2, it holds with at least probability , where are some constants.
Combining the error caused by smoothing and function approximation, we show that the difference between and under the norm is given by
Theorem 6 (Total error)
Let be a candidate solution output from the proposed algorithm based on offpolicy samples, with high probability, we have where corresponds to the approximation error, corresponds to the bias induced by smoothing, and corresponds to the statistical error, and is the optimization error of solving within a fixed budget.
There exists a delicate tradeoff between the smoothing bias and approximation error. Using large increases the smoothing bias but decreases the approximation error since the solution function space is better behaved. The concrete correspondence between and depends on the specific form of the function approximators, which is beyond the scope of this paper. When and the approximation is good enough, the solution will converge to the optimal value function .
Convergence Analysis.
It is wellknown that for convexconcave saddle point problems, applying stochastic mirror descent ensures global convergence in a sublinear rate; see e.g., (Nemirovski et al., 2009). However, this no longer holds for problems without convexconcavity. On the other hand, since our algorithm solves exactly the dual maximization problem at each iteration (which is convex), it can be essentially regarded as a special case of the stochastic mirror descent algorithm applied to solve the nonconvex minimization problem . The latter was proven to converge sublinearly to the stationary point when stepsize is diminishing and Euclidean distance is used for the proxmapping (Ghadimi and Lan, 2013). For completeness, we list the result below.
Theorem 7 ((Ghadimi and Lan, 2013), resp. Corollary 2.2)
Consider the case when Euclidean distance is used in the algorithm. Assume that the parametrized objective is Lipschitz and variance of stochastic gradient is bounded by . Let the algorithm run iterations with stepsize for some and output . Setting the candidate solution to be with randomly chosen from such that , then it holds that where represents the distance of the initial solution to the optimal solution.
The above result implies that the algorithm converges sublinearly to a stationary point. Note that the Lipschitz constant is inherently dependent on the smoothing parameter : the Lipschitz constant gets worse when increases.
6 Related Work
The algorithm is related to the reinforcement learning with entropyregularized MDP model. Different from the motivation in our method where the entropy regularization is introduced in dual form for smoothing (Nesterov, 2005), the entropyregularized MDP has been proposed for balancing exploration and exploitation (Haarnoja et al., 2017), taming the noises in observations (Rubin et al., 2012; Fox et al., 2015), and tractability (Todorov, 2007).
Specifically, Fox et al. (2015) proposed soft Qlearning which extended the Qlearning with tabular form for the new Bellman optimality equation corresponding to the finite state finite action entropyregularized MDP. The algorithm does not accomodate for function approximator due to the intractability of the sum operation in the soft Qlearning update. To avoid such difficulty, Haarnoja et al. (2017) reformulates the update as an optimization which is approximated by samples from stein variational gradient descent (SVGD) sampler. Another related algorithm is proposed in Asadi and Littman (2016) the intractability issue of the sum operator, named as ‘mellowmax’, is avoided by optimizing for a maximum entropy policy in each update. The resulting algorithm resembles to SARSA with particular policy. Liu et al. (2017) focuses on the soft Bellman optimality equation with the ‘mellowmax’ operator following a similar way with Asadi and Littman (2016). The only difference is that a Bayesian policy parametrization is used in Liu et al. (2017) which is updated by SVGD. By noticing the duality between soft Qlearning and the maximum entropy policy, Neu et al. (2017); Schulman et al. (2017a) investigate the equivalence between these two types of algorithms.
Besides the difficulty to generalize these algorithms to multistep trajectories in offpolicy setting, the major drawback of these algorithms is the lack of theoretical guarantees when accompanying with function approximators. It is not clear whether the algorithms converge or not, do not even mention the quality of the stationary points.
On the other hand, Nachum et al. (2017a, b) also exploit the consistency condition in Theorem 3 and propose the PCL algorithm which optimizes the upper bound of the mean square consistency Bellman error (9). The same consistency condition is also discovered in Rawlik et al. (2012), and the proposed learning algorithm can be viewed as fixpoint iteration version of the the unified PCL with tabular function. However, as we discussed in Section 3, the PCL algorithms becomes biased in stochastic environment, which may lead to inferior solutions.
7 Experiments
(a) InvertedDoublePendulum  (b) Swimmer 
(c) Hopper  (d) HalfCheetah 
We test the proposed smoothed dual embedding algorithm (SDEC), on several continuous control tasks from the OpenAI Gym benchmark (Brockman et al., 2016) using the MuJoCo simulator (Todorov et al., 2012), comparing with trust region policy optimization (TRPO) (Schulman et al., 2015) and the proximal policy optimization (PPO) (Schulman et al., 2017b). Since the TRPO and PPO are only applicable for onpolicy setting, for fairness, we also restrict the SDEC to onpolicy setting. However, as we show in the paper, the SDEC is able to exploit the offpolicy samples efficiently. We use the Euclidean distance for and the divergence for in the experiments. We emphasize that other Bregman divergences are also applicable. Following the comprehensive comparison in Henderson et al. (2017), the implementation of the TRPO and PPO affects the performance of algorithms. For a fair comparison, we use the codes from https://github.com/joschu/modular_rl reported to have achieved the best scores in Henderson et al. (2017).
We ran the algorithm with random seeds and reported the average rewards with 50% confidence intervals. The empirical comparison results are illustrated in Figure 1. We can see that in all these tasks, the proposed SDEC achieves significantly better performance than the other algorithms. The experiment setting is reported below.
Policy and value function parametrization.
For fairness, we use the same parametrization of policy and value functions across all algorithms. The choices of parametrization are largely based on the recent paper by Rajeswaran et al. (2017), which shows the natural policy gradient with RBF neural network achieves the stateoftheart performances of TRPO on MuJoCo. For the policy distribution, we parametrize it as , where is a twolayer neural nets with the random features of RBF kernel as the hidden layer and the is a diagonal matrix. The RBF kernel bandwidth is chosen via median trick (Dai et al., 2014; Rajeswaran et al., 2017). Same as Rajeswaran et al. (2017), we use hidden nodes in InvertedDoublePendulum, Swimmer, Hopper, and use hidden nodes in HalfCheetah. Since the TRPO and PPO uses linear control variable as , we also adapt the parametrization for in our algorithm. However, SDEC can adopt arbitrary function approximators.
Training hyperparameters.
For all algorithms, we set and stepsize = . A batch size of trajectories was used in each iteration. For TRPO, the CG damping parameter is set to be . For SDEC, was set to and from a grid search in .
8 Conclusion
We provide a new optimization perspective of the Bellman optimality equation, based on which we develop the smoothed dual embedding control for the policy optimization problem in reinforcement learning. The algorithm is provably convergent with nonlinear function approximators using offpolicy samples by solving the Bellman optimality equations. We also provide PAClearning bound to characterize the sample complexity based on one single offpolicy sample path. Preliminary empirical study shows the proposed algorithm achieves comparable or even better than the stateoftheart performances on MuJoCo tasks.
Acknowledgments
Part of this work was done during BD’s internship at Microsoft Research, Redmond. Part of the work was done when LL was with Microsoft Research, Redmond. LS is supported in part by NSF IIS1218749, NIH BIGDATA 1R01GM108341, NSF CAREER IIS1350983, NSF IIS1639792 EAGER, NSF CNS1704701, ONR N000141512340, Intel ISTC, NVIDIA and Amazon AWS.
References
 Antos et al. (2008) Antos, A., Szepesvári, C., and Munos, R. (2008). Learning nearoptimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129.
 Asadi and Littman (2016) Asadi, K. and Littman, M. L. (2016). A new softmax operator for reinforcement learning. CoRR, abs/1612.05628.
 Baird (1995) Baird, L. (1995). Residual algorithms: reinforcement learning with function approximation. In Proc. Intl. Conf. Machine Learning, pages 30–37. Morgan Kaufmann.
 Bertsekas and Tsitsiklis (1996) Bertsekas, D. P. and Tsitsiklis, J. N. (1996). NeuroDynamic Programming. Athena Scientific.
 Boyan and Moore (1995) Boyan, J. A. and Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems 7 (NIPS94), pages 369–376.
 Boyd and Vandenberghe (2004) Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press, Cambridge, England.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
 Carrasco and Chen (2002) Carrasco, M. and Chen, X. (2002). Mixing and moment properties of various garch and stochastic volatility models. Econometric Theory, 18(1):17–39.
 Chen and Wang (2016) Chen, Y. and Wang, M. (2016). Stochastic primaldual methods and sample complexity of reinforcement learning. arXiv preprint arXiv:1612.02516.
 Dai et al. (2016) Dai, B., He, N., Pan, Y., Boots, B., and Song, L. (2016). Learning from conditional distributions via dual kernel embeddings. CoRR, abs/1607.04579.
 Dai et al. (2014) Dai, B., Xie, B., He, N., Liang, Y., Raj, A., Balcan, M.F. F., and Song, L. (2014). Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, pages 3041–3049.
 Du et al. (2017) Du, S. S., Chen, J., Li, L., Xiao, L., and Zhou, D. (2017). Stochastic variance reduction methods for policy evaluation. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1049–1058.
 Fox et al. (2015) Fox, R., Pakman, A., and Tishby, N. (2015). Taming the noise in reinforcement learning via soft updates. arXiv preprint arXiv:1512.08562.
 Ghadimi and Lan (2013) Ghadimi, S. and Lan, G. (2013). Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368.
 Gordon (1995) Gordon, G. J. (1995). Stable function approximation in dynamic programming. In Proceedings of the Twelfth International Conference on Machine Learning (ICML95), pages 261–268.
 Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energybased policies. arXiv preprint arXiv:1702.08165.
 Haussler (1995) Haussler, D. (1995). Sphere packing numbers for subsets of the boolean ncube with bounded vapnikchervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2):217–232.
 Henderson et al. (2017) Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. (2017). Deep reinforcement learning that matters. arXiv preprint arXiv:1709.06560.
 Kakade (2002) Kakade, S. (2002). A natural policy gradient. pages 1531–1538. MIT Press.
 Liu et al. (2015) Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., and Petrik, M. (2015). Finitesample analysis of proximal gradient td algorithms. In Uncertainty in Artificial Intelligence (UAI). AUAI Press.
 Liu et al. (2017) Liu, Y., Ramachandran, P., Liu, Q., and Peng, J. (2017). Stein variational policy gradient. arXiv preprint arXiv:1704.02399.
 Macua et al. (2015) Macua, S. V., Chen, J., Zazo, S., and Sayed, A. H. (2015). Distributed policy evaluation under multiple behavior strategies. IEEE Transactions on Automatic Control, 60(5):1260–1274.
 Maei (2011) Maei, H. R. (2011). Gradient TemporalDifference Learning Algorithms. PhD thesis, University of Alberta, Edmonton, Alberta, Canada.
 Mahadevan et al. (2014) Mahadevan, S., Liu, B., Thomas, P. S., Dabney, W., Giguere, S., Jacek, N., Gemp, I., and Liu, J. (2014). Proximal reinforcement learning: A new theory of sequential decision making in primaldual spaces. CoRR abs/1405.6757.
 Nachum et al. (2017a) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017a). Bridging the gap between value and policy based reinforcement learning. arXiv preprint arXiv:1702.08892.
 Nachum et al. (2017b) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017b). Trustpcl: An offpolicy trust region method for continuous control. CoRR, abs/1707.01891.
 Nemirovski et al. (2009) Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609.
 Nesterov (2005) Nesterov, Y. (2005). Smooth minimization of nonsmooth functions. Mathematical programming, 103(1):127–152.
 Neu et al. (2017) Neu, G., Jonsson, A., and Gómez, V. (2017). A unified view of entropyregularized markov decision processes. CoRR, abs/1705.07798.
 Puterman (2014) Puterman, M. L. (2014). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
 Rajeswaran et al. (2017) Rajeswaran, A., Lowrey, K., Todorov, E., and Kakade, S. (2017). Towards generalization and simplicity in continuous control. arXiv preprint arXiv:1703.02660.
 Rawlik et al. (2012) Rawlik, K., Toussaint, M., and Vijayakumar, S. (2012). On stochastic optimal control and reinforcement learning by approximate inference. In Robotics: science and systems.
 Rubin et al. (2012) Rubin, J., Shamir, O., and Tishby, N. (2012). Trading value and information in mdps. Decision Making with Imperfect Decision Makers, pages 57–74.
 Rummery and Niranjan (1994) Rummery, G. A. and Niranjan, M. (1994). Online Qlearning using connectionist systems. Technical Report CUED/FINFENG/TR 166, Cambridge University Engineering Department.
 Schulman et al. (2017a) Schulman, J., Abbeel, P., and Chen, X. (2017a). Equivalence between policy gradients and soft qlearning. CoRR, abs/1704.06440.
 Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. (2015). Trust region policy optimization. In ICML, pages 1889–1897.
 Schulman et al. (2017b) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017b). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
 Sutton (1988) Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44.
 Sutton (1996) Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8 (NIPS95), pages 1038–1044.
 Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
 Sutton et al. (2009) Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., SzepesvÃ¡ri, C., and Wiewiora, E. (2009). Fast gradientdescent methods for temporaldifference learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning (ICML), pages 993–1000.
 Todorov (2007) Todorov, E. (2007). Linearlysolvable markov decision problems. In Advances in neural information processing systems, pages 1369–1376.
 Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. (2012). Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE.
 Tsitsiklis and Van Roy (1997) Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporaldifference learning with function approximation. IEEE Transactions on Automatic Control, 42:674–690.
 Wang (2017) Wang, M. (2017). Randomized Linear Programming Solves the Discounted Markov Decision Problem In NearlyLinear Running Time. ArXiv eprints.
 Watkins (1989) Watkins, C. J. (1989). Learning from Delayed Rewards. PhD thesis, King’s College, University of Cambridge, UK.
 Williams (1992) Williams, R. J. (1992). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256.
Appendix
Appendix A Properties of Smoothed Bellman Optimality Equation
In this section, we provide the details of the proofs for the properties of the smoothed Bellman optimality equation.
After applying the smoothing technique (Nesterov, 2005), we obtain a new Bellman operator, , which is contractive. By such property, we can guarantee the uniqueness of the solution. Specifically,
Proposition 1 (Uniqueness)
The is a contraction operator, therefore the smoothed Bellman optimality equation (5) has unique solution.
Proposition 2 (Smoothing bias)
Proof We denote .
which implies the conclusion.
The smoothed Bellman optimality equation involves a sum operator to approximate the operator, which increases the nonlinearity of the equation. We further characterize the solution of the smoothed Bellman optimality equation, by the temporal consistency conditions.
Theorem 3 (Temporal consistency)
Let be the fixed point of (5) and be the corresponding policy that attains the maximum in the RHS of (5). Then is the unique solution that satisfies
(17) 
Proof Necessity. Given the definition of , denote
by the convexity of , we have a unique as
which implies
Then, we can rewrite the smoothed Bellman optimality equation Eq. (5) as
(18)  
Obviously, this equation is a necessary condition, i.e., the optimal and satisfy such equation. In fact, we can show the sufficiency of such equation,
Appendix B Variance Cancellation via the Saddle Point Formulation
The second term in the saddle point formulation (11) will cancel the variance . Specifically,
Theorem 8
Given and , we have
(19) 
Proof