Wasserstein Reinforcement Learning
Abstract
We propose behaviordriven optimization via Wasserstein distances (WDs) to improve several classes of stateoftheart reinforcement learning (RL) algorithms. We show that WD regularizers acting on appropriate policy embeddings efficiently incorporate behavioral characteristics into policy optimization. We demonstrate that they improve Evolution Strategy methods by encouraging more efficient exploration, can be applied in imitation learning and to speed up training of Trust Region Policy Optimization methods. Since the exact computation of WDs is expensive, we develop approximate algorithms based on the combination of different methods: dual formulation of the optimal transport problem, alternating optimization and random feature maps, to effectively replace exact WD computations in the RL tasks considered. We provide theoretical analysis of our algorithms and exhaustive empirical evaluation in a variety of RL settings.
Wasserstein Reinforcement Learning
Aldo Pacchiano^{†}^{†}thanks: Equal contribution. UC Berkeley pacchiano@berkeley.edu Jack ParkerHolder^{1}^{1}footnotemark: 1 Columbia University jh3764@columbia.edu Yunhao Tang^{1}^{1}footnotemark: 1 Columbia University yt2541@columbia.edu Anna Choromanska New York University ac5455@nyu.edu Krzysztof Choromanski Google Brain Robotics kchoro@google.com Michael Jordan UC Berkeley jordan@cs.berkeley.edu
noticebox[b]Preprint. Under review.\end@float
1 Introduction
One of the key challenges in reinforcement learning (RL) is to efficiently incorporate the behavioral characteristics of learned policies into optimization algorithms [15, 17, 7]. Its importance comes from the fact that the natural policy optimization landscape is highly nonEuclidean  two policies with similar vectorized representations may lead to drastically different trajectories and rewards. It is thus reasonable to seek more expressive policy encodings to represent this information. A natural approach is to identify policies with probabilistic distributions over spaces of the trajectories they generate, and ultimately obtain the corresponding pushforward distributions by applying certain embeddings to these trajectories focusing on different behavioral aspects (state visit frequencies, distributions over rewards, etc.). Dissimilarity between different policies can be interpreted in this framework as the distance in certain metric space defined on the manifold of probabilistic measures.
With this in mind, we propose to use Wasserstein distances (WDs, [34]) defined on sets of probabilistic distributions as regularization terms in a wide range of RL algorithms to efficiently incorporate behavioral characteristics of policies into the learning algorithms. As we demonstrate, this information can lead to dramatic improvements in a variety of settings. In particular, we show that WD regularizers acting on appropriate policy embeddings improve standard Evolution Strategy (ES, [26]) techniques by encouraging more efficient exploration, can be applied in imitation learning and to speed up training of Trust Region Policy Optimization (TRPO, [27]) methods. Since the exact computation of WDs is expensive, we develop approximate algorithms based on the combination of different methods: dual formulation of the optimal transport problem, alternating optimization techniques and random feature maps [23], to effectively replace exact WD computation in the RL tasks considered. We provide theoretical analysis of our algorithms and an exhaustive empirical evaluation across a wide range of RL tasks.
WDs are increasingly used in a plethora of machine learning applications such as: generative adversarial networks (GANs) [1], retrieving images in computer vision [3], document classification [14], graph/network classification [21], autoencoders [32], time series and dynamical systems analysis and more. As opposed to asymmetric KullbackLeibler (KL) divergences, they define valid metrics. It was showed in [1], that in contrast to other metrics (such as those based on JensenShannon (JS) divergencies) they lead to improved GAN training methods, where the gradient does not vanish when the generator reaches a region that the discriminator has high confidence over. Therefore, they alleviate the common problem in GAN training  the need for careful balancing between training of the discriminator and the generator. Finally, they are much more flexible than other constructions. For example, WDs are well defined in the important hybrid continuousdiscrete distribution setting, whereas many of its counterparts requiring finite conditions (such as KL or JS divergence) are not.
Related Work:
Here we summarize some prior work on applying the WD metric in RL. Wasserstein Gradient Flows (WGFs) were recently introduced [35] for finding efficient RL policies. This approach casts policy optimization as gradient descent flow on the manifold of corresponding probability measures, where geodesic lengths are given as secondorder WDs. We notice that computing WGFs is a nontrivial task. In [35] particle approximation methods are used, and we show in Section 5 that RL algorithms using these techniques are substantially slower than our methods. The WD was also proposed to replace KL terms [24] in standard Trust Region Policy Optimization. This is a very special case of our more generic framework (see: Section 3.2). In [24] it is suggested to solve the corresponding RL problems via FokkerPlanck equations and diffusion processes, yet no empirical evidence is provided. We propose general practical algorithms and present exhaustive empirical evaluation.
2 Preliminaries: Optimal Transport Problem and RL
Let and be distributions over domains and , respectively, and let be a cost function. For , define:
(1) 
where is a joint distribution over that has marginals and , and is the entropy of . We call the smoothed Wasserstein distance.
If we omit the entropy expression, the first term in Eq. 1—the Wasserstein distance—can be interpreted as a minimal cost of transporting mass of some material from a specific set of locations (corresponding to the atoms of a probability distribution in the discrete case) to another set of locations, where the cost of transporting a unit mass from to is given by . For this reason the Wasserstein metric is sometimes called the Earth mover’s distance (EMD) and the corresponding optimization problem is referred to as the optimal transport problem (OTP).
2.1 Optimal Transport Problem: Dual Formulation
We wish to use smoothed Wasserstein distance to derive efficient regularizers for RL algorithms. To arrive at this goal we first need to consider the dual form of Eq. 1.
We begin with the discrete case, where we identify and with points in simplices and respectively. We let and denote the corresponding vectors of probabilities. Let , and define the entropy as . We denote as the cost matrix defined as: . The optimization problem in Eq. 1 reduces to:
The Lagrangian for this optimization problem is defined as follows:
where stand for the vectors of dual variables. Since the objective in Eq. 1 is convex (even strongly convex when ), strong duality holds:
and therefore: where if and if . (The symbol denotes an indicator function that takes the value if the condition is satisfied and otherwise).
We now turn to the continuous case. Consider metric spaces and . Let denote the space of continuous functions on and let denote the space of continuous functions over . Let and be (Radon) probability measures over and , respectively. Let be the space of couplings (joint distributions) over having marginal distributions and respectively. Finally, define the KullbackLeibler (KL) divergence between probability distributions and having support as: Let be a cost function, interpreted as the “ground cost” to move a unit of mass from to . The Kantorovich formululation [13] of Optimal Transport, augmented with entropic regularization [8], can be written as:
(2) 
When , the entropic regularizer is omitted (), and the cost function for some distance function over , the optimization problem in Eq. 2 defines the Wasserstein distance. If , the problem is strongly convex, so that the optimal coupling is unique and the problem can be solved in principle using the Sinkhorn algorithm [28].
Using Fenchel duality, and a similar decomposition as in the discrete case, we can obtain the following dual formulation of the problem in Eq. 2:
(3) 
where the continuous analogue of the expression is defined as:
(4) 
2.2 Reinforcement Learning Policy Optimization
We turn to the problem of RL policy optimization under Wasserstein regularization. We set the stage by providing a brief overview of relevant RL concepts in this section.
A Markov Decision Process () is a tuple . Here and stand for the sets of states and actions respectively, such that for and : is the probability that the system/agent transitions from to given action and is a reward obtained by an agent transitioning from to via . A policy is a (possibly randomized) mapping (parameterized by ) from to . The goal is to optimize parameters of such that an agent applying it in the environment given by a fixed maximizes total (expected/discounted) reward over given horizon . In this paper we consider MDPs with finite horizons. In most practical applications the is not known to the learner and in many applications it is accessed through a simulator. The policy is often parameterized by a neural network. Several algorithmic frameworks have been proposed to construct efficient policies. In this paper we consider the following general frameworks.
Evolution strategy (ES) methods:
The RL problem can be cast as blackbox (agnostic) optimization, where a function to be optimized takes as input the parameters of and outputs total discounted/expected reward obtained by an agent applying this policy in a given environment. There is a large literature on applying ES methods in challenging RL tasks [26, 11, 16, 6, 25, 5]. ES algorithms approximate the gradient of the Gaussian smoothing of defined as: (where stands for the smoothing parameter) using Monte Carlo (MC) methods since the gradient is given as an expectation:
(5) 
Policy gradient methods:
These techniques [30] rely on the formula for gradients of total expected rewards for trajectories obtained by agents applying policies , with the use of the underlying structure of the problem (even though assumptions can in fact be relaxed). In practice, if expectation is taken with respect to perturbations of a base policy, these methods lead to ES. Another important case is when policies are randomized (e.g. Gaussian policies), when the expectation is taken with respect to random perturbations affecting final actions proposed by policies. In the latter setting, for denoting the horizon, the gradient is given by:
(6) 
where is the rewardtogo. As before, various MC methods are used to approximate the above gradients. An important subclass of these methods constitute actorcritic (AC) algorithms, where is approximated by a neural network (critic) and the other neural network (actor) is encoding a policy that is getting critic feedback to improve itself via gradient signal.
Trust Region Policy Optimization (TRPO) methods:
At every iteration of the algorithm the goal is to maximize the difference between the reward obtained by the current new policy and current policy . This is measured by the socalled advantage function [27] that is estimated using importance sampling. The corresponding loss term is defined as , where stands for the discount factor. The final form of the optimization problem is (for some fixed ):
(7) 
or: such that for some . The latter condition defines a trust region, where the parameter updates are sensibly constrained (see also Subsection 5.2 in Section 5).
Imitation learning:
In this setting an agent has access to expert demonstrations, usually given by the expert’s policy trajectories (and potentially an additional reinforcement signal) to learn efficient policies. Some approaches here include behavioral cloning [33], casting the problem as supervised learning over stateaction pairs from expert’s trajectories and inverse reinforcement learning which aims to learn the reward function [20] (see also Subsection 5.2 in Section 5).
3 Wasserstein Reinforcement Learning
We are ready to explain how the WD metric can be applied to improve RL policy optimization.
Let be the set of possible trajectories enriched by sequences of partial rewards under some policy . The undiscounted reward function satisfies , where . Denote by an embedding map of trajectories into a space equipped with a metric (or cost function) . Given a policy , we denote as the distribution it induces over the spaces of trajectories and by the corresponding pushforward distribution on induced by .
We measure the distance between two trajectories quantitatively by a cost function for their corresponding embeddings. For a given cost function , embedding function and parameter , we propose to measure the distance between two policies and by the smoothed Wasserstein distance between corresponding pushforward distributions and , and parameterized by . We propose to use the following embeddings:

Statebased: the final state , the visiting frequency of a fixed state , the frequency vector of visited states (where is the onehot vector corresponding to state ); see also Section 5.1.

Actionbased: the concatenation of actions ; see also Section 5.2.

Rewardbased: the total reward , rewardtogo vector (where is a onehot vector corresponding to and with dimensions indexed from to ); see also Section 5.2.
For instance, captures the frequency with which different states are visited under policy . Note that some of the above embeddings are valid only in the tabular case () while others are universal. Next we provide conditions for under which the WD () between and implies that policies and are equal.
Lemma 3.1.
Let and be finite sets, the be episodic (i.e. of finite horizon ), and with the indicator vector for the state action pair . Let for . If and then .
It is not our goal to provide optimal embeddings and in fact, as we see in Section 5, the choice of the right embedding depends on the particular RL setup. In our experiments, we use embeddings from all three classes defined above. We argue that the Wasserstein metric provides a natural framework where those embeddings can be used to differentiate between qualitatively different policies.
3.1 MaxMax RL Policy Optimization with Wasserstein regularizers
Here we propose to use the WD on the embeddings of the policies to improve exploration profiles of existing RL algorithms. Consider the following problem , where:
(8) 
where , , is a policy parameterized by , is a policy sampled from a buffer of policies seen so far in the optimization and is some fixed embedding. The term enforces newly constructed policies to be behaviorally different from the previous ones (improving exploration) while the term drives the optimization to achieve its main objective, i.e. maximize obtained reward. This application of our Wasserstein regularizers falls into the category of the socalled novelty search techniques, see for instance [7, 22], where several existing approaches such as [7] can be seen as special instantiations of our method.
where , stands for the trajectory sampled from policy and is the function defined in Equation 4. The optimization problem becomes: , where:
(10) 
To make the above problem tractable, we combine different techniques. The first step is to replace continuous functions over with the functions from Reproducing Kernel Hilbert Spaces (RKHSs) corresponding to universal kernels [18]. We are motivated by the fact that those classes of functions are dense in . We choose the Gaussian kernel and approximate it using random Fourier feature maps [23]. Thus we consider functions of the following form: , where is a random feature map with standing for the number of random features and . For the Gaussian kernel, is defined as follows: for , where is Gaussian with iid entries taken from and where with iid s such that . By optimizing over , from now on we mean optimizing over corresponding dual vectors associated with . Thus we replace with the corresponding function .
We propose to solve such MaxMax problem using Alternating Optimization (AO) techniques, alternating between optimizing over and . In each turn of the AO we compute the gradient of the objective with respect to the corresponding arguments and conduct a fixed number of steps of the gradient ascent algorithm. Computing the gradient with respect to can be done using standard RL formulae like these from Equations: 5 and 6, by using various MC methods [6, 25]. They can be also used to approximate gradients with respect to since (proof in the Appendix):
Lemma 3.2.
The gradient with respect to parameters satisfies for :
(11) 
and an anologous results holds for .
3.2 MinMax RL Policy Optimization with Wasserstein regularizers
If in the optimization problem considered in Section 3.1 we replace Wasserstein regularization term with , then we obtain an optimization problem of the form:
(12) 
which we propose to solve using techniques completely analogous to those from Section 3.1. Interestingly, optimization problem from Equation 12 appears in a couple of important RL setups:
Imitation learning:
Assume that an agent has access to the trajectories from an expert’s policy even though the expert’s policy is not known. We propose an imitation learning setup for training, where the goal is to train a policy maximizing , but the objective is regularized by the term to enforce trained policies to produce trajectories resembling the expert. We show empirically that WD regularizers applied in this context improve training.
WassersteinTRPO:
As in the trust region formulation of policy optimization from Eq. 7, we introduce WassersteinTRPO by replacing KLdivergence constraints by a WD constraint. This leads to the MinMax optimization problem if we use our dual formulation. Importantly, WassersteinTRPO can leverage important trajectory information when forming the trust region, while retaining the monotonic improvement guarantees as in the original KLdivergence formulation [27]. We show empirically that this leads to faster optimization and provide theoretical guarantees.
4 Theoretical results
Here we provide theoretical guarantees for the presented AO algorithm and methods of Wasserstein regularizers in TRPO. Additional theoretical results, e.g. those for the MinMax setting, all technical definitions not explicitly given here and proofs are given in the Appendix.
We will analyze our AO algorithm for the MaxMax optimization problem. We show that obtained solutions converge to the local maxima of the objective function. Consider function from Lemma 3.2. We denote by some of its local maxima. Define , i.e. is as a function of for locally optimal values of and .
We will assume that is locally strongly concave and smooth for some fixed in the neighborhood of its optimal value . We will also assume that gradient of is Lipschitz with Lipschitz coefficient in that neighborhood. The following convergence theorem holds:
Theorem 4.1.
For the entropy coefficient , denote: , and . Denote and . Let be the solution from iteration and the local maximum considered above. Assume that optimization starts in . If , then the error at iteration of the presented AO algorithm for the MaxMax problem from Section 3.1 with decaying gradient step size satisfies for :
(13) 
Local concavity is necessary to obtain strict theoretical guarantees for the optimization of blackbox functions and smoothness assumptions are motivated by the fact that most RL algorithms deal with smoothings of the original blackbox functions (see our discussion on ES methods).
4.1 Wasserstein Trust Region
For a given policy , we denote as: , and the: value function, function and advantage function (see: Appendix: Section A.5). Furthermore, let be the expected reward of policy and be the visitation measure.
Two distinct policies and can be related via the equation (see: [29]) and the linear approximations to around via: (see: [12]). Let be a finite set. Consider the following embedding defined by and related cost function defined as: . The Wasserstein distance is related to visitation frequencies since (see the Appendix for the proof). These observations enable us to prove an analogue of Theorem 1 from [27], namely:
Theorem 4.2.
If and , then .
5 Experiments
We compare our algorithm with stateoftheart in several RL policy optimization settings and include wall clock time experiments demonstrating that our alternating optimization approach is significantly faster than particle approximation techniques used to compute Wasserstein flows. We present results for all domains considered in the paper, in particular: to improve exploration strategies, policy gradient methods (TRPO) and for imitation learning. Additional details are in the Appendix.
5.1 MaxMax Setting
A common challenge in modelfree RL is deceptive rewards. These arise since agents can only learn from data gathered via exploration in the environment. As such, our primary interest in the MaxMax setting (see: Section 3.1) is to assess whether we can efficiently explore to avoid deceptive rewards and escape from local optima. We investigate this through the use of intentionally deceptive environments and locomotion problems where agents may learn suboptimal gaits.
Efficient Exploration:
We consider two types of agents (point and quadruped) that aim to avoid the deceptive barrier. We compare with stateoftheart methods for efficient exploration:  from [7] and  [10]. Results are presented on Fig. 1. Policies avoiding the wall correspond to rewards: and for quadruped and point agent respectively.
In the prior case an agent needs to first learn how to walk and the presence of the wall is enough to prohibit Vanilla ES from even learning basic forward locomotion. Our WD method is the only one that drives the agent to the goal in both settings. For the quadruped agent we used the rewardtogo embedding while for the point agent we applied the final state (see: Section 3). We are solely interested in investigating the impact of our method, and thus optimize a single agent.
Escaping Local Maxima.
We compared our methods with the ones applying regularizers using other distances/divergencies defined on probabilistic distributions (namely: Hellinger, JensenShannon (JS), KL and Total Variation (TV) distances), as well as vanilla ES (i.e. with no distance regularizer), see: Fig. 2. Experiments were performed on a environment from [4], where the number of MC samples of the ES optimizer (see: Section 2.2) was drastically reduced. Our method is the only one that manages to obtain good policies which also proves that the benefits come here not just from introducing the regularizer, but from its particular form.
5.2 MinMax Setting
Trust Region Policy Optimization:
As discussed in Section 3.2, we propose to use Wasserstein distances to improve standard TRPO algorithms. We compare our method with the baseline using KL divergence on several RL tasks, obtaining consistent substantial gains. We use the concatenationofactions embedding (see: Section 3). Results are presented on Fig. 3. The benchmark tasks are from or the DeepMind Control Suite [31]. See the Appendix for more results.
Imitation Learning
As discussed in Section 3.2, we can also utilize the MinMax framework for Imitation Learning. Here we have access to an expert’s trajectories and translate them using the rewardtogo embedding (see: Section 3). In Fig 4(a), we show the policy trained with this WDregularized objective significantly outperforms vanilla ES on the task.
Wall Clock Time:
To illustrate computational benefits of alternating optimization (AO) of Wasserstein distance, we compare it to the particle approximation (PA) method introduced in [35] in Fig. 4(b)(c)(d). In practice, the Wasserstein distance across different state samples can be optimized in a batched manner using AO (see Appendix for details). We see that AO is substantially faster than PA.
6 Conclusion
We showed in this paper that stateoftheart RL algorithms can be improved by incorporating behavioral characteristics of the trained policies via Wassersteinbased regularizers acting on certain policy embeddings. We also proposed efficient algorithms to solve these enriched problems and provided theoretical guarantees.
References
 [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, 2017.
 [2] S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the em algorithm: From population to samplebased analysis. Ann. Statist., 45(1):77–120, 02 2017.
 [3] N. Bonneel. Optimal Transport for Computer Graphics and Temporal Coherence of Image Processing Algorithms. (Transport Optimal pour l’Informatique Graphique et Cohérence Temporelle des Algorithmes de Traitement d’Images). 2018.
 [4] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 [5] K. Choromanski, A. Pacchiano, J. ParkerHolder, J. Hsu, A. Iscen, D. Jain, and V. Sindhwani. When random search is not enough: Sampleefficient and noiserobust blackbox optimization of RL policies. CoRR, abs/1903.02993, 2019.
 [6] K. Choromanski, M. Rowland, V. Sindhwani, R. E. Turner, and A. Weller. Structured evolution with compact architectures for scalable policy optimization. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pages 969–977, 2018.
 [7] E. Conti, V. Madhavan, F. P. Such, J. Lehman, K. O. Stanley, and J. Clune. Improving exploration in evolution strategies for deep reinforcement learning via a population of noveltyseeking agents. In Advances in Neural Information Processing Systems 31, NeurIPS, pages 5032–5043, 2018.
 [8] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, NeurIPS, pages 2292–2300, 2013.
 [9] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov. Openai baselines. https://github.com/openai/baselines, 2017.
 [10] M. Fortunato, M. G. Azar, B. Piot, J. Menick, M. Hessel, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg. Noisy networks for exploration. In International Conference on Learning Representations, ICLR, 2018.
 [11] D. Ha and J. Schmidhuber. Recurrent world models facilitate policy evolution. Advances in Neural Information Processing Systems 31, NeurIPS, 2018.
 [12] S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. 2:267–274, 2002.
 [13] L. Kantorovich. On the transfer of masses: Doklady akademii nauk ussr. 1942.
 [14] M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger. From word embeddings to document distances. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pages 957–966, 2015.
 [15] S. J. Lee and Z. Popovic. Learning behavior styles with inverse reinforcement learning. ACM Trans. Graph., 29(4):122:1–122:7, 2010.
 [16] H. Mania, A. Guy, and B. Recht. Simple random search of static linear policies is competitive for reinforcement learning. Advances in Neural Information Processing Systems 31, NeurIPS, 2018.
 [17] E. Meyerson, J. Lehman, and R. Miikkulainen. Learning behavior characterizations for novelty search. In Proceedings of the 2016 on Genetic and Evolutionary Computation Conference, Denver, CO, USA, July 20  24, 2016, pages 149–156, 2016.
 [18] C. A. Micchelli, Y. Xu, and H. Zhang. Universal kernels. Journal of Machine Learning Research, 7:2651–2667, 2006.
 [19] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition, 2014.
 [20] A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29  July 2, 2000, pages 663–670, 2000.
 [21] G. Nikolentzos, P. Meladianos, and M. Vazirgiannis. Matching node embeddings for graph similarity. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, February 49, 2017, San Francisco, California, USA., pages 2429–2435, 2017.
 [22] J. K. Pugh, L. B. Soros, and K. O. Stanley. Quality diversity: A new frontier for evolutionary computation. Frontiers in Robotics and AI, 3:40, 2016.
 [23] A. Rahimi and B. Recht. Random features for largescale kernel machines. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1177–1184. Curran Associates, Inc., 2008.
 [24] P. H. Richemond and B. Maginnis. On wasserstein reinforcement learning and the fokkerplanck equation. CoRR, abs/1712.07185, 2017.
 [25] M. Rowland, K. Choromanski, F. Chalus, A. Pacchiano, T. Sarlos, R. E. Turner, and A. Weller. Geometrically coupled monte carlo sampling. Advances in Neural Information Processing Systems 31, NeurIPS, 2018.
 [26] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. CoRR, abs/1703.03864, 2017.
 [27] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pages 1889–1897, 2015.
 [28] R. Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Statist., 35(2):876–879, 06 1964.
 [29] R. S. Sutton, A. G. Barto, et al. Introduction to Reinforcement Learning, volume 135. MIT press Cambridge, 1998.
 [30] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29  December 4, 1999], pages 1057–1063, 1999.
 [31] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
 [32] I. O. Tolstikhin, O. Bousquet, S. Gelly, and B. Schölkopf. Wasserstein autoencoders. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, 2018.
 [33] F. Torabi, G. Warnell, and P. Stone. Behavioral cloning from observation. In Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 1319, 2018, Stockholm, Sweden., pages 4950–4957, 2018.
 [34] C. Villani. Optimal transport: Old and new. Springer, 2008.
 [35] R. Zhang, C. Chen, C. Li, and L. Carin. Policy optimization as wasserstein gradient flows. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pages 5741–5750, 2018.
Appendix A Appendix: Wasserstein Reinforcement Learning
a.1 Proof of Lemma 3.1
We start by proving Lemma 3.1, which we restate here for Reader’s convenience:
Lemma A.1.
Let and be finite sets, the be episodic (i.e. of finite horizon ), and with the indicator vector for the state action pair . Let for . If and then .
Proof.
If , there exists a coupling between and such that:
Consequently:
Therefore for all :
Where and denote the entries of and respectively. Notice that for all :
(14) 
Since for all and :
Therefore for all :
Consequently for all . By Bayes rule, this plus equation 14 yields:
And therefore: . ∎
a.2 Proof of Lemma 3.2
We now prove Lemma 3.2, that as before, we restate here for Reader’s convenience:
Lemma A.2.
The gradient of the objective function from Equation 10 with respect to parameters satisfies for :
(15) 
and an anologous results holds for .
Proof.
Note that the objective function from Equation 10 is of the form:
And therefore:
where . Rewriting this expression in terms of expectations gives:
(16) 
and that completes the proof.
∎
a.3 MaxMax Problem: theoretical analysis
Our goal in this section is to prove Theorem 4.1 which we restate below for Reader’s convenience:
Theorem A.3.
For the entropy coefficient , denote: , and . Denote and . Let be the solution from iteration and the local maximum considered above. Assume that optimization starts in . If , then the error at iteration of the presented AO algorithm for the MaxMax problem from Section 3.1 with decaying gradient step size satisfies:
(17) 
where .
We will need several auxiliary technical results. We will use the following notation: , where is the objective function from the main body of the paper parameterized by entropy coefficient , and . We will apply this notation also in the next section regarding MinMax Problem. For completeness, the definitions of strong concavity, smoothness and Lipschits condition from Theorem A.3 (and Theorem 4.1 from the main body of the paper) are given in Section A.3.1.
We consider the dual optimization problem:
where is a parameter,