Towards Effective Context for Meta-Reinforcement Learning: an Approach based on Contrastive Learning

Towards Effective Context for Meta-Reinforcement Learning: an Approach based on Contrastive Learning


Context, the embedding of previous collected trajectories, is a powerful construct for Meta-Reinforcement Learning (Meta-RL) algorithms. By conditioning on an effective context, Meta-RL policies can easily generalize to new tasks within a few adaptation steps. We argue that improving the quality of context involves answering two questions: 1. How to train a compact and sufficient encoder that can embed the task-specific information contained in prior trajectories? 2. How to collect informative trajectories of which the corresponding context reflects the specification of tasks? To this end, we propose a novel Meta-RL framework called CCM (Contrastive learning augmented Context-based Meta-RL). We first focus on the contrastive nature behind different tasks and leverage it to train a compact and sufficient context encoder. Further, we train a separate exploration policy and theoretically derive a new information-gain-based objective which aims to collect informative trajectories in a few steps. Empirically, we evaluate our approaches on common benchmarks as well as several complex sparse-reward environments. The experimental results show that CCM outperforms state-of-the-art algorithms by addressing previously mentioned problems respectively.


1Tianjin University, {haotianfu, bluecontra},
2Noah’s Ark Lab, Huawei, {haojianye, chenchen9, lidong106, liuwulong}
3Department of Automation, Tsinghua University,


Reinforcement Learning (RL) combined with deep neural networks has achieved impressive results on various complex tasks DBLP:journals/nature/MnihKSRVBGRFOPB15; DBLP:journals/corr/LillicrapHPHETS15; DBLP:conf/icml/SchulmanLAJM15. Conventional RL agents need large amount of environmental interactions to train a single policy for one task. However, in real-world problems many tasks share similar internal structures and we expect agents to adapt to such tasks quickly based on prior experiences. Meta-Reinforcement Learning (Meta-RL) proposes to address such problems by learning how to learn DBLP:journals/corr/WangKTSLMBKB16. Given a number of tasks with similar structures, Meta-RL methods aim to capture such common knowledge from previous experience on training tasks and adapt to a new task with only a small amount of interactions.

Based on this idea, many Meta-RL methods try to learn a general model initialization and update the parameters during adaptation DBLP:conf/icml/FinnAL17; DBLP:conf/iclr/RothfussLCAA19. Such methods require on-policy meta-training and are empirically proved to be sample inefficient. To alleviate this problem, a number of methods DBLP:conf/icml/RakellyZFLQ19; DBLP:conf/iclr/FakoorCSS20 are proposed to meta-learn a policy that is able to adapt with off-policy data by leveraging context information. Typically, an agent adapts to a new environment by inferring latent context from a small number of interactions with the environment. The latent context is expected to be able to capture the distribution of tasks and efficiently infer new tasks. Context-based Meta-RL methods then train a policy conditioned on the latent context to improve generalization.

As the key component of context-based Meta-RL, the quality of latent context can affect algorithms’ performance significantly. However, current algorithms are sub-optimal in two aspects. Firstly, the training strategy for context encoder is flawed. A desirable context is expected to only extract task-specific information from trajectories and throw away other information. However, the latent context learned by existing methods (i.e. recovering value function DBLP:conf/icml/RakellyZFLQ19 or dynamics prediction DBLP:journals/corr/abs-2005-06800; DBLP:conf/iclr/ZhouPG19) are quite noisy as it may model irrelevant dependencies and ignore some task-specific information. Instead, we propose to directly analyze and discriminate the underlying structure behind different tasks’ trajectories by leveraging contrastive learning. Secondly, prior context-based Meta-RL methods ignore the importance of collecting informative trajectories for generating distinctive context. If the exploration process does not collect transitions that are able to reflect the task’s individual property and distinguish it from dissimilar tasks, the latent context would be ineffective. For instance, in many cases, tasks in one distribution only vary in the final goals, which means the transition dynamics remains the same in most places of the state space. Without a good exploration policy it is hard to obtain information that is able to distinguish tasks from each other, which leads to a bad context.

In this paper, we propose a novel off-policy Meta-RL algorithm CCM (Contrastive learning augmented Context-based Meta-RL), aiming to improve the quality of context by tackling the two aforementioned problems. Our first contribution is an unsupervised training framework for context encoder by leveraging contrastive learning. The main insight is that by setting transitions from the same task as positive samples and the ones from different tasks as negative samples, contrastive learning is able to directly distinguish context in the original latent space without modeling irrelevant dependencies. The second contribution is an information-gain-based exploration strategy. With the purpose of collecting trajectories as informative as possible, we theoretically obtain a lower bound estimation of the exploration objective in contrastive learning framework. Then it is employed as an intrinsic reward, based on which a separate exploration agent is trained. The effectiveness of CCM is validated on a variety of continuous control tasks. The experimental results show that CCM outperforms state-of-the-art Meta-RL methods through generating high-quality latent context.


Meta-Reinforcement Learning

In meta-reinforcement learning (Meta-RL) scenario, we assume a distribution of tasks . Each task shares similar structures and corresponds to a different Markov Decision Process (MDP), , with state space , action space , transition distribution , and reward function . We assume one or both of transition dynamics and reward function vary across tasks. Following prior problem settings in DBLP:journals/corr/DuanSCBSA16; DBLP:conf/iclr/FakoorCSS20; DBLP:conf/icml/RakellyZFLQ19, we define a meta-test trial as episodes in the same MDP, with an initial exploration of episodes, followed by execution episodes leveraging the data collected in exploration phase. We further define a transition batch sampled from task as . A trajectory consists of transitions is a special case of transition batch when all transitions are consecutive. For context-based Meta-RL, the agent’s policy typically depends on all prior transitions collected by exploration policy . The agent firstly consumes the collected trajectories and outputs a latent context through context encoder , then executes policy conditioned on the current state and latent context . The goal of the agent is therefore to maximize the expected return, .

Contrastive Representation Learning

The key component of representation learning is the way to efficiently learn rich representations of given input data. Contrastive learning has recently been widely used to achieve such purpose. The core idea is to learn representation function that maps semantically similar data closer in the embedding space. Given a query and keys , the goal of contrastive representation learning is to ensure matches with positive key more than any other keys in the data set. Empirically, positive keys and query are often obtained by taking two augmented versions of the same image, and negative keys are obtained from other images.

Previous work proposes InfoNCE lossDBLP:journals/corr/abs-1807-03748 to score positive keys higher compared with a set of distractors :


where function calculates the similarity score between query and key data, and is usually modeled as bilinear products, i.e.  DBLP:journals/corr/abs-1905-09272. As proposed and proved in DBLP:journals/corr/abs-1807-03748, minimizing the InfoNCE loss is equivalent to maximizing a lower bound of the mutual information between and :


The lower bound becomes tighter as increases.


In this section, we describe our algorithm CCM by first introducing a novel context encoder training strategy, then we propose an information-gain-based exploration strategy to collect informative trajectories for effective task inference.

Contrastive Context Encoder

As the key component in context-based Meta-RL framework, how to train a powerful context encoder is non-trivial. One straightforward method is to train the encoder in an end-to-end fashion from RL loss (i.e. recover the state-action value function DBLP:conf/icml/RakellyZFLQ19). However, the update signals from value function is stochastic and weak that may not capture the similarity relations among tasks. Moreover, recovering value function is only able to implicitly capture high-level task-specific features and may ignore low-level detailed transition difference that contains relevant information as well. Another kind of methods resorts to dynamics prediction DBLP:journals/corr/abs-2005-06800; DBLP:conf/iclr/ZhouPG19. The main insight is to capture the task-specific features via distinguishing varying dynamics among tasks. However, entirely depending on low-level reconstructing states or actions is prone to over-fit on the commonly-shared dynamics transitions and model irrelevant dependencies which may hinder the learning process.

Figure 1: Contrastive Context Encoder

These two existing methods train the context encoder by extracting high-level or low-level task-specific features but either not sufficient because of ignoring useful information or not compact because of modeling irrelevant dependencies. To this end, here we aim to train a compact and sufficient encoder through extracting mid-level task-specific features. We propose to directly extract the underlying task structure behind trajectories by performing contrastive learning on the nature distinctions of trajectories. Through explicitly comparing different tasks’ trajectories instead of each transition’s dynamics, the encoder is prevented from modeling commonly-shared information while still be able to capture all the relevant task-specific structures by leveraging the contrastive nature behind different tasks.

Contrastive learning methods learn representations by pulling together semantically similar data points (positive data pairs) while pushing apart dissimilar ones (negative data pairs). We treat the trajectories sampled from same tasks as the positive data pairs and trajectories from different tasks as negative data pairs. Then the contrastive loss is minimized to gather the latent context of trajectories from same tasks closer in embedding space while pushing apart the context from other tasks.

Concretely, assuming a training task set containing different tasks from task distribution . We first generate trajectories with current policy for each task and store them in replay buffer. At each training step, we first sample a task from the training task set, and then randomly sample two batches of transitions , from task independently. serves as a query in contrastive learning framework while is the corresponding positive key. We also randomly sample batches of transitions from the other tasks as negative keys. Note that previous work (e.g. CURL DBLP:journals/corr/abs-2004-04136) relies on data augmentation on images like random crop to generate positive observations. In contrast, independent transitions sampled from the same task replay buffer naturally construct the positive samples.

As shown in Figure 1, after obtaining query and key data, we map them into latent context and respectively, on which we calculate similarity score and contrastive loss to train the encoder:


Following the settings in DBLP:conf/cvpr/He0WXG20; DBLP:journals/corr/abs-2004-04136, we use momentum averaged version of the query encoder for encoding the keys. The Reinforcement Learning part takes in the latent context as an additional observation, then executes policy and updates separately. Note that the Contrastive Context Encoder is a generic framework and can be integrated with any context-based Meta-RL algorithm.

Information-gain-based Exploration

Even with a compact and sufficient encoder, the context is still ineffective if the exploration process does not collect enough transitions that is able to reflect the new task’s specific property and distinguish it from dissimilar tasks. Some previous approaches DBLP:conf/icml/RakellyZFLQ19; DBLP:conf/iclr/RothfussLCAA19 utilize Thompson-sampling Schmidhuber1987EvolutionaryPI to explore, in which an agent needs to explore in the initial few episodes and execute optimally in the subsequent episode. However, these two processes is actually conducted by one single policy and is trained in an end-to-end fashion by maximizing expected return. This means the exploration policy is limited to the learned execution policy. When adapting to a new task, the agent tends to only explore experiences which are useful for solving previously trained tasks, making the adaptation process less effective.

Instead, we decouple the exploration and execution policy and define an exploration agent aiming to collect trajectories as informative as possible. We achieve this goal by encouraging the exploration policy to maximize the information gain from collecting new transition at time step of task :


where denotes the collected transitions before time step . The above equation can also be interpreted as how much task belief has changed given the newly collected transition. We further transform the equation as follows:


Equation (5) implies that the information gain from collecting new transition can be written as the temporal difference of the mutual information between task belief and collected trajectories. This indicate that we expect the exploration agent to collect informative experience that form a solid hypothesis for task .

Although theoretically sound, the mutual information of latent context and collected trajectories is hard to directly estimate. We expect a tractable form of Equation (5) without losing information. To this end, we first theoretically define the sufficient encoder for context in Meta-RL framework. Given two batches of transitions and from one task, the encoders and extract representations and , respectively.

Definition 4.1

(Context-Sufficient Encoder) The context encoder of is sufficient if and only if .

Intuitively, the mutual information remains the same as the encoder does not change the amount of information contained in , which means the encoder is context-sufficient.

Assuming the context encoder trained in previous section is context-sufficient, we can utilize this property to further transform the information gain objective in Equation (5) in a view at the latent context level. Given batches of transitions from the same task and a context-sufficient encoder , the latent context , . As for the latent context , we can approximate it as the embedding of positive transition batch as well: . Then (5) becomes:


Given this form of equation, we can further derive a tractable lower bound of . The calculation process for the lower bound of Equation (6) can be decomposed into computing the lower bound of and the upper bound of .

As mentioned in Preliminaries, a commonly used lower bound of can be written as:




denotes the number of tasks. , where contains latent context from the same tasks while contains latent context from different ones. As stated in previous section, we optimize the former term in (7) to train the context encoder and similarity function .

We also need to find a tractable form for the upper bound of . Leveraging current contrastive learning framework, we make the following proposition:

Proposition 4.2




As defined in DBLP:journals/corr/abs-1807-03748, the optimal value for is given by , inserting this back into Equation (10) results in:


Compared with InfoNCE loss, the only difference between these two bounds is whether the similarity relations between query and positive key is included in denominator. Then we can further derive the lower bound of Equation (6) as:

Proposition 4.3


This can be derived by simply calculating the difference between Equation (7) and (9). ∎

Thus, we obtain an estimate of the lower bound of mutual information and we further use it as an intrinsic reward for the independent exploration agent:


The intrinsic reward can be interpreted as the difference between two contrastive loss. It measures how much the inference for current task has been improved after collecting new transition . To make this objective more clear and comprehensible, we can transform the lower bound (12) as:

Figure 2: CCM training procedure. Dashed lines denote backward gradients.
Figure 3: Comparison for different context encoder training strategies. Our methods CCM + DP and CCM + RV achieve consistently better performance compared to existing methods. The error bar shows 1 standard deviation.

Intuitively, the first term is an estimation of how much the similarity score for positive pairs (the correct task inference) has improved after collecting new transition. Maximizing this term means that we want the exploration policy to collect transitions which make it easy to distinguish between tasks and make correct and solid task hypothesis. The second term can be interpreted as a regularization term. Maximizing this term means that while the agent tries to visit places which may result in enhancement of positive score, we want to limit the policy not visiting places where the negative score (the similarity with other tasks in embedding space) enhances as well.

We summarize our training procedure in Figure 2. The exploration agent and execution agent interact with environment separately to get their own replay buffer. The execution replay buffer also contains transitions from exploration agent that do not add on the intrinsic reward term. At the beginning of each meta-training episode, the two agents sample transition data and to train their policy separately. We independently sample from exploration buffer to obtain transition batches for calculating latent context. Note that the reward terms in do not add on the intrinsic reward and only use the original environmental reward for computing latent context. In practice, we utilize Soft Actor-Critic DBLP:conf/icml/HaarnojaZAL18 for both exploration and execution agent. We first pretrain the context encoder with contrastive loss for a few episodes to avoid the intrinsic reward being too noisy at the beginning of training. We also maintain an encoder replay buffer that only contain recently collected data for the same reason. During meta-testing, we first let the exploration agent collect trajectories for a few episodes then compute the latent context based on these experience. The execution agent then acts conditioned on the latent context. The pseudo-code for CCM during training and adaptation can be found in our appendix.


In this section, we evaluate the performance of our proposed CCM to answer the following questions:

  • Does CCM’s contrastive context encoder improve the performance of state-of-the-art context-based Meta-RL methods?

  • After combining with information-gain-based exploration policy, does CCM improve the overall adaptation performance compared with prior Meta-RL algorithms?

  • How does regularization term (second term in (14)) influence the exploration policy?

  • Is CCM able to extract effective and reasonable context information?

Figure 4: CCM’s overall performance compared with state-of-the-art Meta-RL methods on complex sparse-reward environments. From left to right: walker-sparse,cheetah-sparse, hard-point-robot.The error bar shows 1 standard deviation.

Comparison of Context Encoder Training Strategy

We first evaluate the performance of context-based Meta-RL methods after combining with contrastive context encoder on several continuous control tasks simulated via MuJoCo physics simulator DBLP:conf/iros/TodorovET12, which are standard Meta-RL benchmarks used in prior work DBLP:conf/iclr/FakoorCSS20; DBLP:conf/icml/RakellyZFLQ19. We also evaluate its performance on out-of-distribution (OOD) tasks, which are more challenging cases for testing encoder’s capacity of extracting useful information. We compare to the existing two kinds of training strategy for context encoder: 1) RV (Recovering Value-function) DBLP:conf/icml/RakellyZFLQ19; DBLP:conf/iclr/FakoorCSS20, in which the encoder is trained with gradients from recovering the state-action value function. 2) DP (Dynamics Prediction) DBLP:journals/corr/abs-2005-06800; DBLP:conf/iclr/ZhouPG19, in which the encoder is trained by performing forward or backward prediction. Here, we follow the settings in CaDM DBLP:journals/corr/abs-2005-06800 which uses the combined prediction loss of forward prediction and backward prediction to update the context encoder1. Our method CCM+RV denotes the encoder receives gradients from both the contrastive loss and value function, while CCM+DP denotes the encoder receives gradients from both the contrastive loss and dynamics prediction loss. To maintain consistency, we does not consider extra exploration policy in this part of evaluation and all three methods use the same network structure (i.e. actor-critic, context encoder) and evaluation settings as state-of-the-art context-based Meta-RL method PEARL.

The results are shown in Figure 3. Both CCM+DP and CCM+RV achieve comparably or better performance than existing algorithms, implying that the proposed contrastive context encoder can help to extract contextual information more effectively. For in-distribution tasks that vary in dynamics (i.e. ant-mass and cheetah-mass), CCM+DP obtains better returns and converges faster than CCM+RV. This is consistent to what is empirically found in CaDM DBLP:journals/corr/abs-2005-06800: prediction models is more effective when the transition function changes across tasks. Moreover, recall that contrastive context encoder focus on the difference between tasks at trajectory level. Such property may compensate for DP’s drawbacks which can embed commonly-shared dynamics knowledge that hinder the meta-learning process. We assume that this property leads to the better performance of CCM+DP on in-distribution tasks.

For out-of-distribution tasks, CCM+RV outperforms other methods including CCM+DP, implying that contrastive loss combined with loss from recovering value function obtains better generalization for different tasks. Recovering value function focuses on extracting high-level task-specific features. This property avoids the overfitting problem of dynamics prediction (limited to common dynamics in training tasks), of which is amplified in this case. However, solely using RV to train the encoder can perform poorly due to its noisy updating signal as well as the ignorance of detailed transition difference that contain task-specific information. Combining with CCM addresses such problems through the usage of low-variance InfoNCE loss as well as forcing the encoder explicitly find the trajectory-level information that varies tasks from each other.

Comparison of Overall Adaptation Performance on Complex Environments

We then consider both components of CCM algorithm and evaluate on several complex sparse-reward environments to compare the performance of CCM with prior Meta-RL algorithms and show whether the information-gain-based exploration strategy improve its adaptation performance. For cheetah-sparse and walker-sparse environments, agent receives a reward when it is close to the target velocity. In hard-point-robot, agent needs to reach two randomly selected goals one after another, and receives a reward when it’s close enough to the goals. We compare CCM with state-of-the-art Meta-RL approach PEARL, and we further modify it with a contrastive context encoder for a fair comparison (PEARL-CL) to measure the influence of CCM’s exploration strategy. We further consider MAML DBLP:conf/icml/FinnAL17 and ProMP DBLP:conf/iclr/RothfussLCAA19, which adapt with on-policy methods. We also compare to varibad DBLP:conf/iclr/ZintgrafSISGHW20, which is a stable version of RL2 DBLP:journals/corr/DuanSCBSA16; DBLP:journals/corr/WangKTSLMBKB16.

As shown in Figure 4, CCM achieves better performance than prior Meta-RL algorithms in both final returns and learning speed in such complex sparse-reward environments. Within relatively small number of environment interactions, on policy methods MAML, varibad and ProMP struggle to learn effective policies on these complex environment while off-policy context-based methods CCM and PEARL-CL generally achieves higher performance.

As the only difference between CCM and PEARL-CL here is whether to use the proposed extra exploration policy, we can conclude that CCM’s information-gain-based exploration strategy enables fast adaptation by collecting more informative experience and further improving the quality of context when carrying out execution policies. Note that during training phases, trajectories collected by the exploration agent are used for updating the context encoder and the execution policy as well. As shown in experimental results, this may lead to faster training speed of context encoder and execution policy because of the more informative replay buffer.

Figure 5: Influence of the regularization term

Influence of the Regularization Term

We perform ablation experiments in order to investigate the influence of the regularization term in the objective function which reflects the change in similarity score with all the tasks in embedding space (the second term in (14)). We compare against CCM using intrinsic reward without the regularization term on in-distribution version of hard-point-robot. The results are shown in Figure 5. The regularization term improves CCM’s performance and we assume that without this term the exploration agent may try to collect transitions that the similarity score with all the tasks in embedding space increases as well, which will result in a noisy updating signal. Due to space limit, we put other ablation experiments (i.e. intrinsic reward scale, context updating frequency) in the appendix.

Visualization for Context

Finally, we visualize CCM’s learned context during adaptation phases via t-SNE Maaten2008VisualizingDU and compare with PEARL. We run the learned policies on ten randomly sampled test tasks multiple times to collect trajectories. Further, we encode the collected trajectories into latent context in embedding space with the learned context encoder and visualize via t-SNE in Figure 6. We find that the latent context generated by CCM from the same tasks is close together in embedding space while maintains clear boundaries between different tasks. In contrast, the latent context generated by PEARL shows globally messy clustering results with only a few reasonable patterns in local region. This indicates that CCM extracts high-quality (i.e. compact and sufficient) task-specific information from the environment compared with PEARL. As a result, the policy conditioned on the high-quality latent context is more likely to get a higher return on those meta-testing tasks, which is consistent to our prior empirical results.

Figure 6: Visualization of context in embedding space. Different color represents context from different tasks.

Related Work

Contrastive Learning

Contrastive learning has recently achieved great success in learning representations for image or sequential data DBLP:journals/corr/abs-1807-03748; DBLP:journals/corr/abs-1905-09272; DBLP:conf/cvpr/He0WXG20; DBLP:journals/corr/abs-2002-05709. In RL, it has been used to extract reward signals in latent space DBLP:conf/icra/SermanetLCHJSLB18; DBLP:conf/iros/DwibediTLS18, or used as an auxiliary task to study representations for high-dimensional data DBLP:journals/corr/abs-2004-04136; DBLP:conf/nips/AnandROBCH19. Contrastive learning helps learn representations that obey similarity constraints by dividing the data set into similar (positive) and dissimilar (negative) pairs and minimizes contrastive loss. Prior work DBLP:journals/corr/abs-2004-04136; DBLP:journals/corr/abs-1905-09272 has shown various methods of generating positive and negative pairs for image-based input data. The standard approach is to create multiple views of each datapoint like random crops and data augmentations Wu2018UnsupervisedFL; DBLP:journals/corr/abs-2002-05709; DBLP:conf/cvpr/He0WXG20. However, in this work we focus on low dimensional input data and leverage natural discrimination inside the trajectories of different tasks to generate positive and negative data. The selection of contrastive loss function is also various and the most competitive one is InfoNCE DBLP:journals/corr/abs-1807-03748. The motivation behind contrastive loss is the InfoMax principle DBLP:journals/computer/Linsker88, which can be interpreted as maximizing the mutual information between two views of data. The relationships between InfoNCE loss and mutual information is conprehensively explained in DBLP:conf/icml/PooleOOAT19.

Meta-Reinforcement Learning

Meta-RL extends the framework of meta-learning Schmidhuber1987EvolutionaryPI; Thrun1998LearningTL; Naik1992MetaneuralNT to the reinforcement learning setting. Recurrent or recursive Meta-RL methods DBLP:journals/corr/DuanSCBSA16; DBLP:journals/corr/WangKTSLMBKB16 directly learn a function that takes in past experiences and outputs a policy for the agent. Gradient-based Meta-RL DBLP:conf/icml/FinnAL17; DBLP:conf/iclr/RothfussLCAA19; Liu2019TamingME; Gupta2018MetaReinforcementLO methods meta learn a model initialization and adapt the parameters with policy gradient methods. These two kinds of methods focus on on-policy meta-training and are struggling to achieve good performance on complicated environments.

We here consider context-based Meta-RL DBLP:conf/icml/RakellyZFLQ19; Fu2019EfficientMR; DBLP:conf/iclr/FakoorCSS20, which is one kind of Meta-RL algorithms that is able to meta-learn from off-policy data by leveraging context. Rakelly et al. (DBLP:conf/icml/RakellyZFLQ19) propose PEARL that adapts to a new environment by inferring latent context variables from a small number of trajectories. Fakoor et al. (DBLP:conf/iclr/FakoorCSS20) propose MQL showing that context combined with other vanilla RL algorithms performs comparably to PEARL. Lee et al. (DBLP:journals/corr/abs-2005-06800) learn a global model that generalizes across tasks by training a latent context to capture the local dynamics. In contrast, our approach directly focuses on the context itself, which motivates an algorithm to improve the quality of latent context.

Some prior research has proposed to explore to obtain the most informative trajectories in Meta-RL DBLP:conf/icml/RakellyZFLQ19; DBLP:conf/iclr/RothfussLCAA19; Gupta2018MetaReinforcementLO. The common idea behind these approaches is Thompson-sampling Schmidhuber1987EvolutionaryPI. However, this kind of posterior sampling is conditioned on learned execution policy which may lead to the exploration behavior only limited to the learned tasks. In contrast, we theoretically obtain a lower bound estimation of the exploration objective, based on which a separate exploration agent is trained.


In this paper, we propose that constructing a powerful context for Meta-RL involves two problems: 1) How to collect informative trajectories of which the corresponding context reflects the specification of tasks? 2) How to train a compact and sufficient encoder that can embed the task-specific information contained in prior trajectories? We then propose our method CCM which tackles the above two problems respectively. Firstly, CCM focuses on the underlying structure behind different tasks’ transitions and trains the encoder by leveraging contrastive learning. CCM further learns a separate exploration agent with an information-theoretical objective that aims to maximize the improvement of inference after collecting new transitions. The empirical results on several complex simulated control tasks show that CCM outperforms state-of-the-art Meta-RL methods by addressing the aforementioned problems.


We thank Chenjia Bai, Qianli Shen, and Weixun Wang for useful discussions and suggestions.


Appendix A Environment Details

We run all experiments with OpenAI gym DBLP:journals/corr/BrockmanCPSSTZ16, with the mujoco simulator DBLP:conf/iros/TodorovET12. The benchmarks used in our experiments are visualized in Figure 7. We further modify the original tasks to be Meta-RL tasks similar to DBLP:conf/icml/RakellyZFLQ19; DBLP:journals/corr/abs-2005-06800; DBLP:conf/iclr/FakoorCSS20:

  • humanoid-dir: The target direction of running changes across tasks. The horizon is 200.

  • cheetah-mass: The mass of cheetah changes across tasks to change transition dynamics. The horizon is 200.

  • cheetah-mass-OOD: The mass of cheetah for a training task is sampled uniformly from while that for test task is sampled uniformly randomly from . The horizon is 200.

  • ant-mass: The mass of ant changes across tasks to change transition dynamics. The horizon is 200.

  • cheetah-vel-OOD: Target velocity for a training task is sampled uniformly from while that for test task is sampled uniformly randomly from . The horizon is 200.

  • cheetah-sparse: Both target velocity and mass of cheetah change across tasks. The agent receives a dense reward only when it is close enough to the target velocity. The horizon is 64.

  • walker-sparse: The target velocity changes across tasks. The agent receives a dense reward only when it is close enough to the target velocity. The horizon is 64.

  • hard-point-robot: The agent needs to reach two randomly selected goals one after another with reward given when inside the goal radius. The horizon is 40.

Figure 7: Meta-RL tasks: left-to-right: humanoid, ant, half cheetah, and walker.
0:  Batch of training tasks from ,
1:  Initialize execution replay buffer , exploration replay buffer for each training task
2:  Initialize parameters and for the off-policy method employed by exploration and execution respectively, e.g., SAC
3:  Initialize parameters for context encoder network
4:  while not done do
5:     for each task  do
6:        Roll-out exploration policy , producing transitions , where
7:        Add tuples to exploration replay buffer and execution replay buffer
8:        Roll-out execution policy , producing transitions
9:        Add tuples to execution replay buffer
10:     end for
11:     for each training step do
12:        for each task  do
13:           Sample transitions for encoder and RL batch ,
14:           Update and with RL loss and contrastive loss using
15:           Update with RL loss using
16:        end for
17:     end for
18:  end while
Algorithm 1 CCM Meta-training
0:  Test task ,
1:  Initialize
2:  for each episode do
3:     for  do
4:        Roll-out exploration policy , producing transition
5:        Accumulate transitions
6:     end for
7:  end for
8:  for  do
9:     Roll-out execution policy conditioned on , producing transition
10:     Accumulate transitions
11:  end for
Algorithm 2 CCM Meta-testing

Appendix B Implementation Details

For each environment, we meta-train 3 models, and meta-test each of them. The evaluation results reflect the average return in the last episode over a test rollout. We show the pseudo-code for CCM during meta-training and meta-testing in Algorithm 1 and 2. Note that here we calculate the exploration intrinsic reward during interaction with environment instead of training to reduce the computational complexity. To avoid the intrinsic reward being to noisy, we pretrain the context encoder for several episodes and set the exploration replay buffer to only contain recently collected data.

Figure 8: CCM adaptation procedure.

We model the context encoder as the product of independent Gaussian factors, in which the mean and variance are parameterized by MLPs with hidden layers of units that produce a 7-dimensional vector. When comparing context encoder training strategy, we use deterministic version of the context encoder network. In other cases, we select in KL divergence term from . For contrastive context encoder, the scale of RV or DP loss is while the scale for contrastive loss is chosen between . For DP, the penalty parameter DBLP:journals/corr/abs-2005-06800 is set to be . We use SAC for both exploration and execution agents and set learning rate as . Other hyperparameters are detailed on Table 1 and 2.

Environment Meta-train tasks Meta-test tasks Number of exploration episodes Embedding batch size
cheetah-sparse 60 10 2 1 128
walker-sparse 60 10 2 2 128
hard-point-robot 40 10 4 2 256
Table 1: CCM’s hyperparameters for sparse-reward environments
Environment Meta-train tasks Meta-test tasks Meta batch size
humanoid-dir 100 30 20
cheetah-mass 30 5 16
cheetah-mass-OOD 30 5 16
cheetah-vel-OOD 50 5 24
ant-mass 50 5 24
Table 2: hyperparameters for continuous control benchmarks

Appendix C Additional Ablation Study

Further evaluation of contrastive context encoder

In this section, we show the performance of our proposed contrastive context encoder without combining with other training strategy (i.e. RV or DP). CCM only denotes that the context encoder is trained only with gradients from contrastive loss. As shown in Figure 9, context encoder trained only with contrastive loss is still able to reach a comparable or better performance than existing methods.

Figure 9: Contrastive context encoder without combining other training strategies

Intrinsic reward scale

We investigate the influence of extra intrinsic reward for CCM’s exploration scale by changing the intrinsic reward scale . We experiment with and show the results on cheetah-sparse. As shown in Figure 10, both too large and too small value of intrinsic reward scale will decrease the adaptation performance. Completely depending on intrinsic reward or environment reward will hinder the exploration process.

Context updating frequency

We also conduct experiments to show the influence of different context updating frequency during adaptation. We compare the adaptation performance between updating context every episode and updating context every step. As shown in Figure 10, updating context every step obtains relatively better returns which is consistent to our original objective in Equation (4).

Figure 10: (a) Context updating frequency comparison; (b) Intrinsic reward scale comparison

Appendix D Additional Details

Derivation of Equation (14)


Connection to recent work on Meta-RL exploration

Concurrent with this work, Liu et al. (DBLP:journals/corr/abs-2008-02790) and Zhang et al. (DBLP:journals/corr/abs-2006-08170) also focus on the exploration problem in Meta-RL and propose to learn execution and exploration with decoupled objectives. Our paper first points out that obtaining a high-quality latent context not only involves exploration to gain informative trajectories but also involves how to train the context encoder to extract the task-specific information. In contrast, those two concurrent work only focuses on the exploration problem. Here we address the latter problem by proposing contrastive context encoder. As for exploration, we theoretically derive a low-variance variational lower bound for the information gain objective based on contrastive loss, without introducing other settings like prediction model. Secondly, the intrinsic reward used by the two concurrent work can be interpreted as the Meta-RL version of ”curiosity” DBLP:conf/icml/PathakAED17, which encourages the agent to visit places that is prone to generate high prediction error. However, high prediction error does not always represent informative transition. This might be caused by random noise which is task-irrelevant. In contrast, our objective based on the temporal difference of contrastive loss not only encourages temporal uncertainty of task belief but also limit the agent to explore transitions which is easy to make correct task hypothesis and filter out task-irrelevant information.


  1. For environments that have fixed dynamics, we further modify the model to predict the reward as well.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description