Off-Policy Deep Reinforcement Learning with Analogous Disentangled Exploration

Off-Policy Deep Reinforcement Learning with Analogous Disentangled Exploration


Off-policy reinforcement learning (RL) is concerned with learning a rewarding policy by executing another policy that gathers samples of experience. While the former policy (i.e. target policy) is rewarding but in-expressive (in most cases, deterministic), doing well in the latter task, in contrast, requires an expressive policy (i.e. behavior policy) that offers guided and effective exploration. Contrary to most methods that make a trade-off between optimality and expressiveness, disentangled frameworks explicitly decouple the two objectives, which each is dealt with by a distinct separate policy. Although being able to freely design and optimize the two policies with respect to their own objectives, naively disentangling them can lead to inefficient learning or stability issues. To mitigate this problem, our proposed method Analogous Disentangled Actor-Critic (ADAC) designs analogous pairs of actors and critics. Specifically, ADAC leverages a key property about Stein variational gradient descent (SVGD) to constraint the expressive energy-based behavior policy with respect to the target one for effective exploration. Additionally, an analogous critic pair is introduced to incorporate intrinsic rewards in a principled manner, with theoretical guarantees on the overall learning stability and effectiveness. We empirically evaluate environment-reward-only ADAC on 14 continuous-control tasks and report the state-of-the-art on 10 of them. We further demonstrate ADAC, when paired with intrinsic rewards, outperform alternatives in exploration-challenging tasks.

Key Words.:
Reinforcement Leaning; Deep Reinforcement Learning; Exploration

ifaamas \copyrightyear2020 \acmYear2020 \acmDOI \acmPrice \acmISBN \acmConference[AAMAS’20]Proc. of the 19th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2020)May 9–13, 2020Auckland, New ZealandB. An, N. Yorke-Smith, A. El Fallah Seghrouchni, G. Sukthankar (eds.)


University of California, Los Angeles \streetaddress404 Westwood Plaza


University of California, Los Angeles \streetaddress404 Westwood Plaza


University of California, Los Angeles \streetaddress404 Westwood Plaza

1 Introduction

Reinforcement learning (RL) studies the control problem where an agent tries to navigate through an unknown environment Sutton and Barto (2018). The agent attempts to maximize its cumulative rewards through an iterative trial-and-error learning process Arulkumaran et al. (2017). Recently, we have seen many successes of applying RL to challenging simulation Mnih et al. (2015); Liang et al. (2016) and real-world Silver et al. (2017); Leibo et al. (2017); Wang and Zhang (2017) problems. Inherently, RL consists of two distinct but closely related objectives: learn the best possible policy from the gathered samples (i.e. exploitation) and collect new samples effectively (i.e. exploration). While the exploitation step shares certain similarities with tasks such as supervised learning, exploration is unique, essential, and is often viewed as the backbone of many successful RL algorithms Mnih et al. (2013); Haarnoja et al. (2018a).

In order to explore novel states that are potentially rewarding, it is crucial to incorporate randomness when interacting with the environment. Thanks to its simplicity, injecting noise into the action Lillicrap et al. (2015); Fujimoto et al. (2018a) or parameter space Fortunato et al. (2017); Plappert et al. (2018) is widely used to implicitly construct behavior policies from target policies. In most prior work, the injected noise has a mean of zero, such that the updates to the target policy have no bias Fujimoto et al. (2018b); Gu et al. (2016). The stability of noise-based exploration, which is obtained from its non-biased nature, makes it a safe exploration strategy. However, noise-based approaches are generally less effective since they are neither aware of potentially rewarding actions nor guided by the exploration-oriented targets.

To tackle the above problem, two orthogonal lines of approaches have been proposed. One of them considers extracting more information from the current knowledge (i.e. gathered samples). For example, energy-based RL algorithms learn to capture potentially rewarding actions through its energy objective Haarnoja et al. (2018a); Sutton and Barto (2018). A second line of work considers leveraging external guidance to aid exploration. In a nutshell, they formulate some intuitive tendencies in exploration as an additional reward function called intrinsic reward Bellemare et al. (2016); Houthooft et al. (2016). Guided by these auxiliary tasks, RL algorithms tend to act curiously, substantially improving exploration of the state space.

Despite their promising exploration efficiency, both lines of work fail to fully exploit the collected samples and turn them into the highest performing policy, as their learned policy often executes sub-optimal actions. To avoid this undesirable exploration-exploitation trade-off, several attempts have been made to separately design two policies (i.e. disentangle them), of which one aims to gather the most informative examples (and hence is commonly referred as the behavior policy) while the other attempts to best utilize the current knowledge from the gathered samples (and hence is usually referred as the target policy) Colas et al. (2018); Beyer et al. (2019). To help fulfill their respective goals, disentangled objective functions and learning paradigms are further designed and separately applied to the two policies.

However, naively disentangling the behavior from the target policy would render their update process unstable. For example, when disentangled naively, the two policies tend to differ substantially due to their contrasting objectives, which is known to potentially result in catastrophic learning failure Nachum et al. (2018). To mitigate this problem, we propose Analogous Disentangled Actor-Critic (ADAC), where being analogous is reflected by the constraints imposed on the disentangled actor-critic Mnih et al. (2016) pairs. ADAC consists of two main algorithmic contributions. First, policy co-training guides the behavior policy’s update by the target policy, making the gathered samples more helpful for the target policy’s learning process while keeping the expressiveness of the behavior policy for extensive ex-
ploration (Section 4.2). Second, critic bounding allows an additional explorative critic to be trained with the aid of intrinsic rewards (Section 4.3). Under certain constraints from the target policy, the resultant critic maintains the curiosity incentivized by intrinsic rewards while guarantees training stability of the target policy.

Besides Section 4’s elaboration of our method, the rest of the paper is organized as follows. Section 2 reviews and summarizes the related work. Key background concepts and notations are introduced in Section 3. Experiment details of ADAC are explained in Section 5. Finally, conclusions are presented in Section 6.

2 Related Work

Learning to be aware of potentially rewarding actions is a promising strategy to conduct exploration, as it automatically prunes less rewarding actions and concentrates exploration efforts on those with high potential. To capture these actions, expressive learning models/objectives are widely used. Most noticeable recent work on this direction, such as Soft Actor-Critic Haarnoja et al. (2018a), EntRL Schulman et al. (2017a), and Soft Q Learning Haarnoja et al. (2017), learns an expressive energy-based target policy according to the maximum entropy RL objective Ziebart (2010). However, the expressiveness of their policies in turn becomes a burden for their optimality, and in practice, trade-offs such as temperature controlling Haarnoja et al. (2018b) and reward scaling Haarnoja et al. (2017) have to be made for better overall performance. As we shall show later, ADAC makes use of a similar but extended energy-based target, and alleviates the compromise on optimality using the analogous disentangled framework.

Ad-hoc exploration-oriented learning targets that are designed to better explore the state space are also promising. Some recent research efforts on this line include count-based exploration Xu et al. (2017); Bellemare et al. (2016) and intrinsic motivation Houthooft et al. (2016); Fu et al. (2017); Kulkarni et al. (2016) approaches. The outcome of these methods is usually an auxiliary reward termed the intrinsic reward, which is extremely useful when the environment-defined reward is sparsely available. However, as we shall illustrate in Section 5.3, intrinsic reward potentially biases the task-defined learning objective, leading to catastrophic failure in some tasks. Again, with the disentangled nature of ADAC, we give a principled solution to solve this problem with theoretical guarantees (Section 4.3).

Explicitly disentangling exploration from exploitation has been used to solve a common problem in the above approaches, which is sacrificing the target policy’s optimality for better exploration. By separately designing exploration and exploitation components, both objectives can be better pursued simultaneously. Specifically, GEP-PG Colas et al. (2018) uses a Goal Exploration Process (GEP) Forestier et al. (2017) to generate samples and feed them to the replay buffer of DDPG Lillicrap et al. (2015) or its variants. Multiple losses for exploration (MULEX) Beyer et al. (2019) proposes to use a series of intrinsic rewards to optimize different policies in parallel, which in turn generates abundant samples to train the target policy. Despite having intriguing conceptual ideas, they overlook the training stability issue caused by the mismatch in the distribution of collected samples (using the behavior policy) and the distribution induced by the target policy, which is formalized as extrapolation error in Fujimoto et al. (2018b). ADAC aims to mitigate the training stability issue caused by the extrapolation error while maintaining effective exploration exploitation trade-off promised by expressive behavior policies (Section 4.2) as well as intrinsic rewards (Section 4.3) using its analogous disentangled actor-critic pairs.

3 Preliminaries

In this section, we introduce the reinforcement learning (RL) setting we address in this paper, as well as some background concepts that we utilize to build our method.

3.1 RL with Continuous Control

In a standard reinforcement learning (RL) setup, an agent interacts with an unknown environment at discrete time steps and aims to maximize the reward signal Sutton and Barto (2018). The environment is often formalized as a Markov Decision Process (MDP), which can be succinctly defined as a 5-tuple . At time step , the agent in state takes action according to policy , a conditional distribution of given , leading to the next state according to the transition probability . Meanwhile, the agent observes reward emitted from the environment.1

The agent strives to learn the optimal policy that maximizes the expected return , where is the initial state distribution and is the discount factor balancing the priority of short and long-term rewards. For continuous control, the policy (also known as the actor in the actor-critic framework) parameterized by can be updated by taking the gradient . According to the deterministic policy gradient theorem Silver et al. (2014), , where denotes the state-action marginals of the trajectory distribution induced by , and denotes the state-action value function (also know as the critic in the actor-critic framework), which represents the expected return under the reward function specified by when performing action at state and following policy afterwards. Intuitively, it measures how preferable executing action is at state with respect to the policy and reward function . Following Bellman (1966), we additionally introduce the Bellman operator, which is commonly used to update the -function. The Bellman operator uses and to update an arbitrary value function , which is not necessarily defined with respect to the same or . For example, the outcome of is defined as . By slightly abusing notations, we further define the outcome of as . Some also call the Bellman optimality operator.

3.2 Off-policy Learning and Behavior Policy

To aid exploration, it is a common practice to construct/store more than one policy for the agent (either implicitly or explicitly). Off-policy actor-critic methods Watkins and Dayan (1992) allow us to make a clear separation between the target policy, which refers to the best policy currently learned by the agent, and the behavior policy, which the agent follows to interact with the environment. Note that the discussion in Section 3.1 is largely around the target policy. Thus, starting from this point, to avoid confusion, is reserved to only denote the target policy and notation is introduced to denote the behavior policy. Due to the policy separation, the target policy is instead resorting to the estimates calculated with regards to samples collected by the behavior policy , that is, the deterministic policy gradient mentioned above is approximated as


where is the environment-defined reward. One of the most notable off-policy learning algorithms that capitalize on this idea is deep deterministic policy gradient (DDPG) Lillicrap et al. (2015). To mitigate function approximation errors in DDPG, Fujimoto et al. proposes TD3 Fujimoto et al. (2018a). Given that DDPG and TD3 have demonstrated themselves to be competitive in many continuous control benchmarks, we choose to implement our Analogous Disentangled Actor Critic (ADAC) on top of their target policies. Yet, it is worth reiterating that ADAC is compatible with any existing off-policy learning algorithms. We defer a more detailed discussion of ADAC’s compatibility until we start formally introducing our method in Section 4.1.

3.3 Expressive Behavior Policies through Energy-Based Representation

Figure 1: Evaluation of the amortized SVGD learning algorithm Feng et al. (2017) (Eq (3)) with different under two target distributions.

One promising way to design an exploration-oriented behavior policy without external guidance, which is usually in the form of intrinsic reward, is by increasing the expressiveness of to capture information about potentially rewarding actions. Energy-based representations have recently been increasingly chosen as the target form to construct an expressive behavior policy. Since its first introduction by Ziebart (2010) to achieve maximum-entropy reinforcement learning, several additional prior work keeps improving upon this idea. Among them, the most notable ones include Soft Q-Learning (SQL) Haarnoja et al. (2017), EntRL Schulman et al. (2017a), and Soft Actor-Critic (SAC) Haarnoja et al. (2018b). Collectively, they have achieved competitive results on many benchmark tasks. Formally, the energy-based behavior policy is defined as


where is commonly selected to be the target critic in prior work Haarnoja et al. (2018b); Haarnoja et al. (2018a). Various efficient samplers have been proposed to approximate the distribution specified in Eq (2). Among them, Haarnoja et al. (2017)’s Stein variational gradient descent (SVGD) Liu and Wang (2016); Wang et al. (2018) based sampler is especially worth noting as it has the potential to approximate complex and multi-model behavior policies. Given this, we also choose it to sample the behavior policy in our proposed ADAC.

Additionally, we want to highlight an intriguing property of SVGD that is critical for understanding why we can perform analogous disentangled exploration effectively. Intuitively, SVGD transforms a set of particles to match a target distribution. In the context of RL, following Amortized SVGD Feng et al. (2017), we use a neural network sampler () to approximate Eq (2), which is done by minimizing the KL divergence between two distributions. According to Feng et al. (2017), is updated according to the following gradient:


where is a positive definite kernel2, and is an additional hyper-parameter proposed to make optimality-expressiveness trade-off. The intrinsic connection between Eq (3) and the deterministic policy gradient (i.e. Eq (1)) is introduced in Haarnoja et al. (2017) and Feng et al. (2017): the first term of the gradient represents a combination of deterministic policy gradients weighted by the kernel , while the second term of the gradient represents an entropy maximization objective.

To aid a better understanding of this relation, we illustrate the distribution approximated by SVGD using different in a toy example as shown in Figure 1. The dashed line is the approximation target. When is small, the entropy of the learned distribution is restricted and the overall policy leans towards the highest-probability region. On the other hand, larger leads to more expressive approximation.

4 Analogous Disentangled Actor Critic

This section introduces our proposed method Analogous Disentangled Actor-Critic (ADAC). We start by providing an overview of it (Section 4.1), which is followed by elaborating the specific choices we make to design our actors and critics (Sections 4.2 and 4.3).

Figure 2: Block diagram of ADAC, which consists of the sample collection phase (green box with dotted line) and the model update phase (gray box with dashed line). Model (i.e. actor and critic networks) updates are performed sequentially from 1⃝ to 4⃝. Each update step’s corresponding line in Algorithm 1 is shown in brackets.
1:  Input: A minibatch of samples , actor model (represents the target policy as well as the behavior policy , critic models and .
2:   the deterministic policy gradient of with respect to (Eq (1)). // target policy update
3:   gradient of with respect to the behavior policy (Eq (3), Section 3.3) // behavior policy learning
4:  Update with and // policy co-training
5:  Update and to minimize the mean squared error on with respect to the target and , respectively. // value update with critic bounding
Algorithm 1 The model update phase of ADAC. Correspondence with Figure 2 is given after “//”.

4.1 Algorithm Overview

Figure 2 provides a diagram overview of ADAC, which consists of two pairs of actor-critic and (see the blue and pink box) to achieve disentanglement. Same with prior off-policy algorithms (e.g., DDPG), during training ADAC alternates between the two main procedures, namely sample collection (dotted green box), where we use to interact with the environment to collect training samples, and model update (dashed gray box), which consists of two phases: (i) batches of the collected samples are used to update both critics (the pink box); (ii) and (the blue box) are updated according to their respective critic using different objectives. During evaluation, is used to interact with the environment.

Both steps in the model update phase manifest the analogous property of our method. First, although optimized with respect to different objectives, both policies ( and ) are represented by the same neural network , where and .3 That is, is a deterministic policy since the input to is fixed, while can be regarded as an action sampler that uses the randomly sampled to generate actions. As we shall demonstrate in Section 4.2, this specific setup effectively restricts the deviation between the two policies ( and ) (i.e. update bias), which stabilizes the training process and maintains sufficient expressiveness in the behavior policy (also see Section 5.1 for an intuitive illustration).

The second exhibit of our method’s analogous nature lies on our designed critics and , which are based on the environment-defined reward and the augmented reward ( is the intrinsic reward) respectively yet are both computed with regard to the target policy . As a standard approach, approximates the task-defined objective that the algorithm aims to maximize. On the other hand, is a behavior critic that can be shown to be both explorative and stable theoretically (Section 4.3) and empirically (Section 5.3). Note that when not using intrinsic reward, the two critics are degraded to be identical to one another (i.e. ) and in practice when that happens we only store one of them.

To better appreciate our method, it is not enough to only gain an overview about our actors and critics in isolation. Given this, we then formalize the connections between the actors and the critics as well as the objectives that are optimized during the model update phase (Figure 2). As defined above, is the exploitation policy that aims to maintain optimality throughout the learning process, which is best optimized using the deterministic policy gradient (Eq (1)), where is used as the referred critic (1⃝ in Figure 2). On the other hand, for the sake of expressiveness, the energy-based objective (Eq (2)) is a good fit for . To further encourage exploration, we use the behavior critic in the objective, which gives (2⃝ in Figure 2). Since both policies share the same network , the actor optimization process (3⃝ in Figure 2) is done by maximizing


where the gradients of both terms are defined by Eqs (1) and (3), respectively. In particular, we set in Eq (1) and in Eq (3). As illustrated in Algorithm 1 (line 5), we update and with the target and on the collected samples using the mean squared error loss, respectively.

In the sample collection phase, interacts with the environment and the gathered samples are stored in a replay buffer Mnih et al. (2013) for later use in the model update phase. Given state , actions are sampled from with a three-step procedure: (i) sample , (ii) plug the sampled in to get its output , and (iii) regard as the center of kernel 4 and sample an action from it.

On the implementation side, ADAC is compatible with any existing off-policy actor-critic model for continuous control: it directly builds upon them by inheriting their actor (which is also their target policy) and critic . To be more specific, ADAC merely adds a new actor to interact with the environment and a new critic that guides ’s updates on top of the base model, along with the constraints/connections enforced between the inherented and the new actor and between the inherent and the new critic (i.e. policy co-training and critic bounding). In other words, modifications made by ADAC would not conflict with the originally proposed improvements on the base model. In our experiments, two base models (i.e. DDPG Lillicrap et al. (2015) and TD3 Fujimoto et al. (2018a)) are adopted.5

4.2 Stabilizing Policy Updates by Policy Co-training

Although the energy-based behavior policy defined by Eq (2) is sufficiently expressive to capture potentially rewarding actions, it may still not be helpful for learning a better : being expressive also means that is often significantly different from , leading to collect samples that can substantially bias ’s updates (recall the discussion about Equation 1), and in turn rendering the learning process of unstable and vulnerable to catastrophic failure Sutton et al. (2008); Schlegel et al. (2019); Xie et al. (2019); Fujimoto et al. (2018b). To be more specific, since the difference between and an expressive is more than some zero-mean random noise, the state marginal distribution defined with respect to can potentially diverge greatly from the state marginal distribution defined with respect to . Since is not directly accessible, as shown in Eq (1), the gradients of are approximated using samples from . When the approximated gradients constantly deviate significantly from the true values (i.e. the approximated gradients are biased), the updates to essentially become inaccurate and hence ineffective. This suggests that a brutal act of disentangling the behavior policy from the target policy alone is not a guarantee of improved training efficiency or final performance.

Therefore, to mitigate the aforementioned problem, we would like to reduce the distance between and , which naturally reduces the KL-divergence between distribution and . One straightforward approach to reduce the distance between the two policies is to restrict the randomness of , for example by lowering the entropy of the behavior policy through a smaller (Eq (3)). However, this inevitably sacrifices ’s expressiveness, which in turn would also harm ADAC’s competitiveness. Alternatively, we propose policy co-training to best maintain the expressiveness of while also stabilizing it by restricting it with regards to , which is motivated by the intrinsic connection between Eqs (1) and (3) (see the paragraph of Section 3.3). As described in Section 4.1, we reiterate here that in a nutshell, both policies are modeled by the same network and are distinguished only by their different inputs to . During training, is updated to maximize Eq (4). The method to sample actions from is described in the paragraph of Section 4.1.

We further justify the above choice by demonstrating that the imposed restrictions on and only has minor influence on ’s optimality and ’s expressiveness. To argue for this point, we need to revisit Eq (3) for one more time: can be viewed as being updated with , whereas is updated with . Intuitively, this makes policy optimal since its action is not affected by the entropy maximization term (i.e. the second term). is still expressive since only when the input random variable is close to the zero vector, it will be significantly restricted by . In Section 5.1, we will empirically demonstrate policy co-training indeed reduces the distance between and during training, fulfilling its mission.

Additionally, policy co-training enforces the underlying relations between and . Specifically, policy co-training forces to be contained in since is the highest-density point of , and sampling from is likely to generate actions close to that from . This matches the intuition that and should share similarities: actions proposed by is rewarding (with respect to ) and thus should be frequently executed by .

4.3 Incorporating Intrinsic Reward in Behavior Critic via Critic Bounding

With the help of disentanglement as well as policy co-training, which makes and analogous, we manage to design an expressive behavior policy that not only explores effectively but also helps stabilize ’s learning process. In this subsection, we aim to achieve the same goal – stability and expressiveness – on a different subject, the behavior critic .

As introduced in Section 4.1, is the environment-defined reward function, while consists of an additional exploration-oriented intrinsic reward . As hinted by the notations, ADAC’s target critic and behavior critic are defined with regard to the same policy but updated differently according to the following


where updates are performed through minibatches in practice. Note that when no intrinsic reward is used, Eq (5) becomes trivial and the two critics ( and ) are identical. Therefore, we only consider the case where intrinsic reward exists in the following discussion.

While it is natural that the target critic is updated using the target policy, it may seem counterintuitive that the behavior critic is also updated using the target policy. Given that is updated following the guidance (i.e. through the energy-based objective) of , we do so to prevent from diverging disastrously from . {theorem} Let be a greedy policy w.r.t. and be a greedy policy w.r.t. . Assume is optimal w.r.t. and . We  have  the  following  results.

First, , a proxy of training stability, is lower bounded by


Second, , a proxy of training effectiveness, is lower bounded by


The full proof is deferred to Appendix A. Here, we only focus on the insights conveyed by Theorem 4.3. Intuitively, the first result (i.e. (6)) guarantees training stability by providing a lower bound on our ultimate learning goal – the expected improvement of w.r.t. the task-defined reward ; the second result (i.e. (7)) provides a lower bound on the expected improvement of , which measures the effectiveness of ADAC in the sense that the better ’s performance, the high the quality of collected samples (since is used to interact with the environment).

Before formalizing the above intuition, we first examine the assumptions made by the theorem. While other assumptions are generally satisfiable and are commonly made in the RL literature Munos (2007), the assumption on the rewards () seems restrictive. However, since most intrinsic rewards are strictly greater than zero (e.g., Houthooft et al. (2016); Fu et al. (2017)), it can be easily satisfied in practice.

To better understand the theorem, we first provide interpretations of the key components. According to the definition of the Bellman optimality operator (Section 3.1), quantifies the improvement on after performing one value iteration Bellman (1966) step (w.r.t. , where all states receive a hard update), which is a proxy of the policy improvement in the near future. Therefore, is the expected policy improvement under state-action distribution in the near future.

We formalize training stability as the lower bound of the expected policy improvement under (i.e. ). Its lower bound (Eq (6)) consists of two parts. The second part, , is greater than zero since is optimized to maximize the cumulative reward of while is not. On the other hand, the first term can be viewed as the improvement of during training since is the training sample distribution. Therefore, the improvement of during training lower bounds the expected policy improvement under , which represents stability.

Figure 3: Learning curves of ADAC (with base model DDPG) and baselines on the modified CartPole environment. In addition, ADAC’s target (red dots) and behavior policies (solid blue curves) at different timesteps are plotted below the learning curves.

Conversely, the lower bound on reflects the effectiveness of the training procedure. Note that most intrinsic rewards are designed to be small in states that are frequently visited. Therefore, when the state-action pairs would visit states that are frequently visited by , which is promised using the policy co-training approach (Section 4.2), will be small. Therefore, even if and are not identical, as long as allows substantial visitations of high probability states in to make sufficiently small, improvement when trained on the samples would be almost as large as the training improvement on the target distribution , which indicates effectiveness.

Action () Reward ()
Table 1: Specifications of our action and reward designs for the modified CartPole task. The original task consists of two discrete actions left and right, each pushing the cart towards its corresponding direction. We converted them into a single-dimension continuous action.

5 Experiments

In this section, we take gradual steps to analyze and illustrate our proposed method ADAC. Specifically, We first investigate the behavior of our analogous disentangled behavior policy (Section 5.1). Next, we perform an empirical evaluation of ADAC without intrinsic rewards on 14 standard continuous-control benchmarks (Section 5.2). Finally, encouraged by its promising performance and to further justify the critic bounding method, we examine ADAC with intrinsic rewards in 4 sparse-reward and hence exploration-heavy environments (Section 5.3). Throughout this paper, we highlight two benefits from the analogous disentangled nature of ADAC: (i) avoiding unnecessary trade-offs between current optimality and exploration (i.e. a more expressive and effective behavior policy); (ii) natural compatibility with intrinsic rewards without altering environment-defined optimality. In this context, the first two subsections are devoted to demonstrating the first benefit and the last subsection is dedicated for the second.

RoboschoolAnt 2219373 838.1*97.1 2903666 450.027.9 2726652 128071
RoboschoolHopper 2299333 766.5*10 2302537 543.8307 2089657 1229345
RoboschoolHalfCheetah 1578166 1711*95 607.2246.2 441.6120.4 807.0252.6 1225184.2
RoboschoolAtlasForwardWalk 234.655.7 186.7*37.9 190.650.1 52.6326.2 126.047.1 107.629.4
RoboschoolWalker2d 1769452 1564*651 995.1146.3 208.7137.1 1021263 578.9231.3
Ant 3353847 1226*18 4034517 370.5223 42911498 1401168
Hopper 3598 374 374.5*36.5 2845609 38.930.88 3307825 1555458
HalfCheetah 9392199 2238*40 105262367 100949 115412989 881.710.1
Walker2d 51221314 1291*42 4630778 186.233.3 40671211 1146368
InvertedPendulum 10000 1000*0 10000 1000*0 10000 98.902.08
InvertedDoublePendulum 93590.17 9334*1.39 7665566 27.202.61 93532896 98.905.88
BipedalWalker 309.815.6 -52.77*1.94 288.451.25 -123.9011.17 307.257.92 266.928.52
BipedalWalkerHardcore -10.7627.70 -98.523.21 -57.9721.08 -50.05*10.27 -127.445.2 -105.322.2
LunarLanderContinuous 290.050.9 85.67*23.42 289.754.1 -65.8996.48 283.369.29 59.3268.44
Table 2: Continuous-control performance in 14 benchmark environments. Average episode return ( standard deviation) over 20 trials are reported. Bold indicates the best average episode return. indicates the better performance between ADAC(TD3) and its base model TD3. Similarly, indicates the better performance between ADAC (DDPG) and its base model DDPG. In all three cases, values that are statistically insignificantly different (>0.05 in t-test) from the respective should-be indicated ones are denoted as well.

5.1 Analysis of Analogous Disentangled Behavior Policy

Since we are largely motivated by the potential luxury of designing an expressive exploration strategy offered by the disentangled nature of our framework, it is natural we are first interested in investigating how well our behavior policy lives up to our expectation. Yet as discussed in Section 4.2, in order to aid stable policy updates, we specifically put some restrains on our behavior policy, deliberately making it analogous of the target policy, which means our behavior policy may not be as expressive as otherwise. Given this, we start this set of empirical experiments with investigating whether our behavior policy is still expressive enough, which is measure by its coverage (i.e. does it explore a wide enough action/policy space outside the current target policy). To further examine the influence of our added restrains, we examine the policy network’s stability (i.e. does the policy co-training lowers the bias between two policies and stabilize the ’s learning process). Finally, we focus on the effectiveness of our behavior policy by measuring the overall performance of ADAC (i.e. does ADAC’s exploration strategy efficiently lead the target policy to iteratively converge to a more desirable local optimum).

Setup For the sake of ease of illustration, we choose a straightforward environment, namely CartPole Brockman et al. (2016), as our demonstration bed. The goal in this environment is to balance the pole that is attached to a cart by applying left/right force. For the compatibility with continuous control and a better modeling of real-world implications, we modified CartPole’s original discrete action space and added effort penalty to the rewards as specified in Table 1. To demonstrate the advantages of our behavior policy, we choose DDPG with two commonly-used existing exploration strategies as the main baselines, i.e., Gaussian noise () and Ornstein-Uhlenbeck process noise () Uhlenbeck and Ornstein (1930), both with two variance levels and . For the virtue of fair comparison, we only present DDPG-based ADAC here (or simply ADAC later in this subsection). To further demonstrate the benefits from disentanglement, we choose SAC as another baseline. As discussed earlier in related works, SAC similarly also utilizes energy-based policies, yet opposite to our approach its exploration is embedded into its target policy.

Empirical Insights To minimize distraction, our discussion starts with closely examining ADAC’s behavior and target policy alone. First, see the cells at the bottom of Figure 3, which are snapshots of the behavior and target policy at different training stages. As suggested by the wide bell shape of the solid blue curves () at the first cell, our behavior policy acts curiously when ignorant about the environment, extensively exploring all possible actions including those that are far away from the target policy (represented by the red dots). Yet having such a broad coverage alone is still not sufficient to overcome the beginning trap of getting stuck in the deceiving local-optimum of constantly exerting . As suggested by the bimodal shape of the solid blue curve () in the second cell, after acquiring a preliminary understanding of the environment the agent starts to form preference for some actions when exploring. Almost at the same time, the target policy no longer stays close to 0.0 (represented by the intersection of the two axes), suggesting that the behavior policy is effective in leading the target policy towards a more desirable place. This can be further corroborated by what is suggested from the third and fourth cell. In the late stage, besides being able to balance the pole, our agent even manages to learn exerting actions with small absolute value from time to time to avoid the effort penalty.

Other than its expressiveness, stability critically influences ADAC’s overall performance, which is by design controlled by the proposed policy co-training approach. To examine its effect, we perform an ablation study about it. To be more specific, we compare ADAC with against ADAC without policy co-training.6 The effect of critic bounding is measured by the bias between and , which is shown in the middle of Figure 3. We can see that ADAC has much lower bias than its variant without policy co-training. Additionally, policy co-training does not affect the expressiveness of , which is suggested by the behavior policies rendered below in Figure 3.

Finally we move our attention to the learning curves in Figure 3: ADAC exceeds baselines in both learning efficiency (i.e. being the first to consistently accumulate positive rewards) and final performance. Unlike our behavior policy, exploration through random noise is unguided, resulting in either wasted exploration on unpromising regions or insufficient exploration on rewarding areas. This largely explains the noticeable performance gap between DDPG with random noise and ADAC. On the other side, SAC bears an expressive policy similar to our behavior policy. However, suffering from no separate behavior policy, to aid exploration, SAC has to consistently take sub-optimal actions into account, adversely affecting its policy improvement process. In other words, different from ADAC, SAC cannot fully exploits its learned knowledge of the environment (i.e. its value functions) to construct its target policy, leading to a performance inferior to ADAC’s.

5.2 Comparison with the State of the Art

Though well-suited for illustration, CartPole alone is not challenging and generalized enough to fully manifest ADAC’s competitiveness. In this subsection, we present that ADAC can achieve state-of-the-art performance in standard benchmarks.

Setup To demonstrate the generality of our method, we construct a 14-task testbed suite composed of qualitatively diverse continuous-control environments from the OpenAI Gym toolkit Brockman et al. (2016). On top of the two baselines adopted earlier (i.e. DDPG and SAC), we further include TD3 Fujimoto et al. (2018a), which improves upon DDPG by addressing some of its function approximation errors, PPO Schulman et al. (2017b), which is regarded as one of the most stable and efficient on-policy policy gradient algorithm, and GEP-PG Colas et al. (2018), which combines Goal Exploration Process Péré et al. (2018) with policy gradient to perform curious exploration as well as stable learning. Though not exhaustive, this baseline suite still embodies many of the latest advancements and can be indeed deemed as the existing state-of-the-art. However, we compare with GEP-PG only in tasks adopted in their original experiments. Specifically, since the GEP part of the algorithm needs hand-crafted exploration goals, we are not able to run their model on new experiments since it is nontrivial to generalize their experiments in other tasks. To best reproduce the rest’s performance, we use their original open-source implementations if released; otherwise, we build our own versions after the most-starred third-party implementations in GitHub. Furthermore, we fine-tune their hyper-parameters around the values reported in the respective literature and only coarsely tune the hyper-parameters introduced by ADAC. All experiments are run for 1 million time-steps, or until reaching performance convergence, whichever happens earlier.7

Empirical Insights Table 2 corroborates ADAC’s competitiveness stemmed from its disentangled nature over existing methods. More importantly, these results reveal two desirable properties of ADAC’s full compatibility with existing off-policy methods. First, ADAC consistently outperforms the method it is based on. As indicated by the symbols, compared to its base model, DDPG-based ADAC achieves statistically better or comparable performance on more than of the benchmarks and obtains identical performance on one of the remaining two. Though not as remarkable as DDPG-based ADAC, TD-based ADAC also manages to achieve statically better or comparable performance over its base model on more than of the tasks; see the symbols. Second, ADAC retains the benefits of improvements developed by the base model themselves. This is best illustrated by TD3-based ADAC’s performance superiority over DDPG-based ADAC.

We would like to specially call readers’ attention on our comparison of ADAC against SAC since they both use energy-based policy. This comparison also reveals the benefit brought by the disentangled structure and the analogous actors and critics. ADAC (TD3) achieves better average performance over SAC on 71%(10/14) of the benchmarks, indicating the effectiveness of our proposed analogous disentangled structure.

Despite also using the disentanglement idea, we do not compare with GEP-PG Colas et al. (2018) across 14 benchmarks and hence GEP-PG is not included in Table 2 because the Goal Exploration Process (GEP) in GEP-PG requires manually defining a goal space to explore, which is task dependent and critically influences the algorithm performance. Therefore, we only compare our results on the two experiments that they have run, of which only one overlaps with our task suit, namely HalfCheetah. In HalfCheetah, GEP-PG achieves 6118 cumulative reward, while ADAC (TD3) achieves 9392, showing superiority over GEP-PG. Furthermore as also acknowledged in its paper, GEP-PG lags behind SAC in performance, which suggests that simply disentangling behavior policy from target policy in a brutal way does not guarantee competitive performance. Rather, to design effective disentangled actor-critic, we should also pay attention to how to best restrict some components.

When considering all reported methods together, TD3-based ADAC obtains the most number of state-of-the-art results; as indicated in bold, it is the best performer (or statistically comparable with the best) on more than of the benchmarks.

5.3 Evaluation in Sparse-Reward Environments

Figure 4: Illustration of how intrinsic reward contaminates the environment-defined optimality in PendulumSparse. Fooled into collecting more intrinsic rewards rather than environment rewards (see the learning curves on the left), the agent constantly alternates between spinning the pendulum and barely moving it (see the snapshots of the target policy on the right), making no real progress.

Encouraged by the promising results observed on the benchmarks, in this subsection we evaluate ADAC under more challenging environments, in which rewards are barely provided. This set of experiments aim to test ADAC’s exploration capacity under extreme settings. Furthermore, we also see them fit as demonstration beds to present ADAC’s natural compatibility with intrinsic methods. In this regard, we are particularly interested in investigating whether the disentangled nature of ADAC helps mitigate intrinsic rewards’ undesirable bias effect on the environment-defined optimality.

Setup To the surprise of many, sparse-reward environments turn out to be relatively unpopular in commonly-used RL toolkits. Besides including the classic MountainCarContinuous and Acrobot (after converting its action space to be countinuous), to construct a decently sized testing suite, we further hand-craft new tasks, namely PendulumSparse and CartPoleSwingUpSparse by sparsifying the rewards in the existing environments. Sparsifying is achieved mainly through suppressing the original rewards until reaching some predefined threshold.8 Due to their dependency on environment-provided rewards as feedback signals, most model-free RL algorithms suffer significant performance degradation in these sparse-reward tasks. In this situation, resorting to intrinsic methods (IM) for additional signals has been widely considered as the go-to solution. Among a wide variety of IM methods, we adopt Variational Information Maximization Exploration (VIME) Houthooft et al. (2016) as our internal reward generator for its consistent good performance on a wide variety of exploration-challenging tasks. Considering TD3-based ADAC’s superiority over DDPG-based ADAC, we only combine VIME into TD3 and TD3-based ADAC. Note when paired with ADAC, intrinsic rewards are only visible to the behavior policy.

Figure 5: Learning curves for four sparse-reward tasks. Lines denote the average over 20 trials and the shaded areas represent the range of one standard deviation.

Empirical Insights Among the four environments, PendulumSparse has the most vulnerable environment-defined optimality. The goal here is to swing the inverted pendulum up so it stays upright. As suggested by Figure 4, not knowing how to distinguish between intrinsic and environment rewards, VIME-augmented TD3 is completely fooled into chasing after the intrinsic rewards. In other words, the VIME-augmented TD3’s understanding of what is optimal is completely off from the true environment-defined optimality. Note as demonstrated in left-bottom part of Figure 5, VIME-augmented TD3’s performance even trails behind TD3’s, which is an indisputable evidence that the bias introduced by IM can be detrimental and should be addressed whenever possible. In contrast, thanks to its disentangled nature, VIME-augmented ADAC only perceives intrinsic rewards in its behavior policy, which means its target policy always remains optimal with regards to our current knowledge about environment rewards. Because of this, VIME-augmented manages to consistently solve this exploration-challenging task. ADAC’s natural compatibility with VIME is further corroborated by the results in the remaining 3 tasks. As suggested by the the complete Figure 5, VIME-augmented ADAC consistently surpasses all reported alternatives by a large margin in terms of both convergence speed and final performance.

6 Conclusion

We present Analogous Disentangled Actor-Critic (ADAC), an off-policy reinforcement learning framework explicit disentangles the behavior and target policy. Compared to prior work, to stabilize model updates, we restrain our behavior policy and its corresponding critic to be analogous of their target counterparts. Thanks to its disentangled and analogous nature, ADAC achieves the state-of-the-art results in 10 out of 14 continuous control benchmarks. Moreover, ADAC is naturally compatible with intrinsic rewards, outperforming alternatives in exploration-challenging tasks.

Acknowledgements This work is partially supported by NSF grants #IIS-1943641, #IIS-1633857, #CCF-1837129, DARPA XAI grant #N66001-17-2-4032, UCLA Samueli Fellowship, and gifts from Intel and Facebook Research.

Supplementary Material

Appendix A Theoretical Results

This section provides the full proof of Theorem 4.3 that is the guarantees of the training stability as well as the training effectiveness of the critic bounding approach (Section 4.3).

Proof of Theorem 4.3

We define as the optimal value function with respect to policy and reward , i.e., . We further define as the optimal value function with respect to and (i.e., ). Our proof is built upon the foundation result stated in the following lemma. For the sake of a smoother presentation, we defer its proof after we finish proving the theorem.

Lemma \thetheorem

Under the definitions and assumptions made in Theorem 4.3 and the above paragraph, we have the following result


Recall that and are the optimal value function with respect to and , respectively. By definition, we have and (since ).

Result on training effectiveness (i.e. Eq (7)) We are now ready to prove the second result stated in the theorem. Since , we have . Plug in Eq (8) and use the equality

we have

which is equivalent to the second result stated in the theorem (Eq (7)).

Result on training stability (i.e. Eq (6)) To prove the first result stated in the theorem, we start from rearranging Eq (8):


where uses the inequality , and follows from . Rewriting Eq (9) gives us the first result stated in the theorem (Eq (6)):

Proof of Lemma A.

Before delving into the detailed derivation, we make the following clarifications. First, although is a greedy policy w.r.t. , is not the optimal value function w.r.t. and . In other words, is guaranteed to hold yet we might have . Second, in both the theorem and the proof, we omit the state-action notation (e.g., ) for the sake of simplicity.

We begin from the difference between the respective optimal value function with regard to and :


where is the state probability transition operator with respect to the environment dynamics and policy ; uses the equality ; adopts the fact that . Combining the terms and gives us


where is the identity operator, i.e. . We define . By definition, given the initial state-action distribution , is the state-action marginal distribution with respect to and policy . We can easily verify that and since by definition, .

Next, we derive the connection between and , which is closely related to the result given by Munos et al. Munos (2007):

where the result and are used. Plug in Eq (11), we have

Combining the above equation with Eq (10), we get

Appendix B Algorithmic Details of ADAC

1:  input: environment , batch size , maximum episode length , dimension of the action space , mini-batch size , , and .
2:  initialize: networks , , and ; target networks , , and ; replay buffer . and correspond to and in the main text, respectively. ; ; .
3:  Define the deterministic target policy and the stochastic behavior policy with :
4:  where and are input to the neural network (see Figure 6 for its structure). For the target policy, is fixed as a -dimensional zero vector. To sample an action from the behavior policy, we first sample from , feed it into together with , and add Gaussian noise .
5:  repeat
6:     Reset the environment and receive the initial state .
7:     for  do
8:        Sample from according to Eq (13).
9:        Execute in and observe environment reward , intrinsic reward (from any intrinsic motivation approach, or simply set to zero), and the next state .
10:        Store tuple in replay buffer .
11:        Call training procedure
12:        Break if the current episode terminates on .
13:     end for
14:  until  converge or reaching the pre-defined number of training steps
15:  return
17:  training procedure
18:      Sample a minibatch of samples from .
19:      Update and by minimizing the losses
20:      where and .   //Update the critic
21:      Update following the gradient