Towards Effective Context for MetaReinforcement Learning: an Approach based on Contrastive Learning
Abstract
Context, the embedding of previous collected trajectories, is a powerful construct for MetaReinforcement Learning (MetaRL) algorithms. By conditioning on an effective context, MetaRL policies can easily generalize to new tasks within a few adaptation steps. We argue that improving the quality of context involves answering two questions: 1. How to train a compact and sufficient encoder that can embed the taskspecific information contained in prior trajectories? 2. How to collect informative trajectories of which the corresponding context reflects the specification of tasks? To this end, we propose a novel MetaRL framework called CCM (Contrastive learning augmented Contextbased MetaRL). We first focus on the contrastive nature behind different tasks and leverage it to train a compact and sufficient context encoder. Further, we train a separate exploration policy and theoretically derive a new informationgainbased objective which aims to collect informative trajectories in a few steps. Empirically, we evaluate our approaches on common benchmarks as well as several complex sparsereward environments. The experimental results show that CCM outperforms stateoftheart algorithms by addressing previously mentioned problems respectively.
^{1}Tianjin University,
{haotianfu, bluecontra}@tju.edu.cn,
^{2}Noah’s Ark Lab, Huawei, {haojianye, chenchen9, lidong106, liuwulong}@huawei.com
^{3}Department of Automation, Tsinghua University, fengxidongwh@gmail.com
Introduction
Reinforcement Learning (RL) combined with deep neural networks has achieved impressive results on various complex tasks DBLP:journals/nature/MnihKSRVBGRFOPB15; DBLP:journals/corr/LillicrapHPHETS15; DBLP:conf/icml/SchulmanLAJM15. Conventional RL agents need large amount of environmental interactions to train a single policy for one task. However, in realworld problems many tasks share similar internal structures and we expect agents to adapt to such tasks quickly based on prior experiences. MetaReinforcement Learning (MetaRL) proposes to address such problems by learning how to learn DBLP:journals/corr/WangKTSLMBKB16. Given a number of tasks with similar structures, MetaRL methods aim to capture such common knowledge from previous experience on training tasks and adapt to a new task with only a small amount of interactions.
Based on this idea, many MetaRL methods try to learn a general model initialization and update the parameters during adaptation DBLP:conf/icml/FinnAL17; DBLP:conf/iclr/RothfussLCAA19. Such methods require onpolicy metatraining and are empirically proved to be sample inefficient. To alleviate this problem, a number of methods DBLP:conf/icml/RakellyZFLQ19; DBLP:conf/iclr/FakoorCSS20 are proposed to metalearn a policy that is able to adapt with offpolicy data by leveraging context information. Typically, an agent adapts to a new environment by inferring latent context from a small number of interactions with the environment. The latent context is expected to be able to capture the distribution of tasks and efficiently infer new tasks. Contextbased MetaRL methods then train a policy conditioned on the latent context to improve generalization.
As the key component of contextbased MetaRL, the quality of latent context can affect algorithms’ performance significantly. However, current algorithms are suboptimal in two aspects. Firstly, the training strategy for context encoder is flawed. A desirable context is expected to only extract taskspecific information from trajectories and throw away other information. However, the latent context learned by existing methods (i.e. recovering value function DBLP:conf/icml/RakellyZFLQ19 or dynamics prediction DBLP:journals/corr/abs200506800; DBLP:conf/iclr/ZhouPG19) are quite noisy as it may model irrelevant dependencies and ignore some taskspecific information. Instead, we propose to directly analyze and discriminate the underlying structure behind different tasks’ trajectories by leveraging contrastive learning. Secondly, prior contextbased MetaRL methods ignore the importance of collecting informative trajectories for generating distinctive context. If the exploration process does not collect transitions that are able to reflect the task’s individual property and distinguish it from dissimilar tasks, the latent context would be ineffective. For instance, in many cases, tasks in one distribution only vary in the final goals, which means the transition dynamics remains the same in most places of the state space. Without a good exploration policy it is hard to obtain information that is able to distinguish tasks from each other, which leads to a bad context.
In this paper, we propose a novel offpolicy MetaRL algorithm CCM (Contrastive learning augmented Contextbased MetaRL), aiming to improve the quality of context by tackling the two aforementioned problems. Our first contribution is an unsupervised training framework for context encoder by leveraging contrastive learning. The main insight is that by setting transitions from the same task as positive samples and the ones from different tasks as negative samples, contrastive learning is able to directly distinguish context in the original latent space without modeling irrelevant dependencies. The second contribution is an informationgainbased exploration strategy. With the purpose of collecting trajectories as informative as possible, we theoretically obtain a lower bound estimation of the exploration objective in contrastive learning framework. Then it is employed as an intrinsic reward, based on which a separate exploration agent is trained. The effectiveness of CCM is validated on a variety of continuous control tasks. The experimental results show that CCM outperforms stateoftheart MetaRL methods through generating highquality latent context.
Preliminaries
MetaReinforcement Learning
In metareinforcement learning (MetaRL) scenario, we assume a distribution of tasks . Each task shares similar structures and corresponds to a different Markov Decision Process (MDP), , with state space , action space , transition distribution , and reward function . We assume one or both of transition dynamics and reward function vary across tasks. Following prior problem settings in DBLP:journals/corr/DuanSCBSA16; DBLP:conf/iclr/FakoorCSS20; DBLP:conf/icml/RakellyZFLQ19, we define a metatest trial as episodes in the same MDP, with an initial exploration of episodes, followed by execution episodes leveraging the data collected in exploration phase. We further define a transition batch sampled from task as . A trajectory consists of transitions is a special case of transition batch when all transitions are consecutive. For contextbased MetaRL, the agent’s policy typically depends on all prior transitions collected by exploration policy . The agent firstly consumes the collected trajectories and outputs a latent context through context encoder , then executes policy conditioned on the current state and latent context . The goal of the agent is therefore to maximize the expected return, .
Contrastive Representation Learning
The key component of representation learning is the way to efficiently learn rich representations of given input data. Contrastive learning has recently been widely used to achieve such purpose. The core idea is to learn representation function that maps semantically similar data closer in the embedding space. Given a query and keys , the goal of contrastive representation learning is to ensure matches with positive key more than any other keys in the data set. Empirically, positive keys and query are often obtained by taking two augmented versions of the same image, and negative keys are obtained from other images.
Previous work proposes InfoNCE lossDBLP:journals/corr/abs180703748 to score positive keys higher compared with a set of distractors :
(1) 
where function calculates the similarity score between query and key data, and is usually modeled as bilinear products, i.e. DBLP:journals/corr/abs190509272. As proposed and proved in DBLP:journals/corr/abs180703748, minimizing the InfoNCE loss is equivalent to maximizing a lower bound of the mutual information between and :
(2) 
The lower bound becomes tighter as increases.
Algorithms
In this section, we describe our algorithm CCM by first introducing a novel context encoder training strategy, then we propose an informationgainbased exploration strategy to collect informative trajectories for effective task inference.
Contrastive Context Encoder
As the key component in contextbased MetaRL framework, how to train a powerful context encoder is nontrivial. One straightforward method is to train the encoder in an endtoend fashion from RL loss (i.e. recover the stateaction value function DBLP:conf/icml/RakellyZFLQ19). However, the update signals from value function is stochastic and weak that may not capture the similarity relations among tasks. Moreover, recovering value function is only able to implicitly capture highlevel taskspecific features and may ignore lowlevel detailed transition difference that contains relevant information as well. Another kind of methods resorts to dynamics prediction DBLP:journals/corr/abs200506800; DBLP:conf/iclr/ZhouPG19. The main insight is to capture the taskspecific features via distinguishing varying dynamics among tasks. However, entirely depending on lowlevel reconstructing states or actions is prone to overfit on the commonlyshared dynamics transitions and model irrelevant dependencies which may hinder the learning process.
These two existing methods train the context encoder by extracting highlevel or lowlevel taskspecific features but either not sufficient because of ignoring useful information or not compact because of modeling irrelevant dependencies. To this end, here we aim to train a compact and sufficient encoder through extracting midlevel taskspecific features. We propose to directly extract the underlying task structure behind trajectories by performing contrastive learning on the nature distinctions of trajectories. Through explicitly comparing different tasks’ trajectories instead of each transition’s dynamics, the encoder is prevented from modeling commonlyshared information while still be able to capture all the relevant taskspecific structures by leveraging the contrastive nature behind different tasks.
Contrastive learning methods learn representations by pulling together semantically similar data points (positive data pairs) while pushing apart dissimilar ones (negative data pairs). We treat the trajectories sampled from same tasks as the positive data pairs and trajectories from different tasks as negative data pairs. Then the contrastive loss is minimized to gather the latent context of trajectories from same tasks closer in embedding space while pushing apart the context from other tasks.
Concretely, assuming a training task set containing different tasks from task distribution . We first generate trajectories with current policy for each task and store them in replay buffer. At each training step, we first sample a task from the training task set, and then randomly sample two batches of transitions , from task independently. serves as a query in contrastive learning framework while is the corresponding positive key. We also randomly sample batches of transitions from the other tasks as negative keys. Note that previous work (e.g. CURL DBLP:journals/corr/abs200404136) relies on data augmentation on images like random crop to generate positive observations. In contrast, independent transitions sampled from the same task replay buffer naturally construct the positive samples.
As shown in Figure 1, after obtaining query and key data, we map them into latent context and respectively, on which we calculate similarity score and contrastive loss to train the encoder:
(3) 
Following the settings in DBLP:conf/cvpr/He0WXG20; DBLP:journals/corr/abs200404136, we use momentum averaged version of the query encoder for encoding the keys. The Reinforcement Learning part takes in the latent context as an additional observation, then executes policy and updates separately. Note that the Contrastive Context Encoder is a generic framework and can be integrated with any contextbased MetaRL algorithm.
Informationgainbased Exploration
Even with a compact and sufficient encoder, the context is still ineffective if the exploration process does not collect enough transitions that is able to reflect the new task’s specific property and distinguish it from dissimilar tasks. Some previous approaches DBLP:conf/icml/RakellyZFLQ19; DBLP:conf/iclr/RothfussLCAA19 utilize Thompsonsampling Schmidhuber1987EvolutionaryPI to explore, in which an agent needs to explore in the initial few episodes and execute optimally in the subsequent episode. However, these two processes is actually conducted by one single policy and is trained in an endtoend fashion by maximizing expected return. This means the exploration policy is limited to the learned execution policy. When adapting to a new task, the agent tends to only explore experiences which are useful for solving previously trained tasks, making the adaptation process less effective.
Instead, we decouple the exploration and execution policy and define an exploration agent aiming to collect trajectories as informative as possible. We achieve this goal by encouraging the exploration policy to maximize the information gain from collecting new transition at time step of task :
(4) 
where denotes the collected transitions before time step . The above equation can also be interpreted as how much task belief has changed given the newly collected transition. We further transform the equation as follows:
(5)  
Equation (5) implies that the information gain from collecting new transition can be written as the temporal difference of the mutual information between task belief and collected trajectories. This indicate that we expect the exploration agent to collect informative experience that form a solid hypothesis for task .
Although theoretically sound, the mutual information of latent context and collected trajectories is hard to directly estimate. We expect a tractable form of Equation (5) without losing information. To this end, we first theoretically define the sufficient encoder for context in MetaRL framework. Given two batches of transitions and from one task, the encoders and extract representations and , respectively.
Definition 4.1
(ContextSufficient Encoder) The context encoder of is sufficient if and only if .
Intuitively, the mutual information remains the same as the encoder does not change the amount of information contained in , which means the encoder is contextsufficient.
Assuming the context encoder trained in previous section is contextsufficient, we can utilize this property to further transform the information gain objective in Equation (5) in a view at the latent context level. Given batches of transitions from the same task and a contextsufficient encoder , the latent context , . As for the latent context , we can approximate it as the embedding of positive transition batch as well: . Then (5) becomes:
(6)  
Given this form of equation, we can further derive a tractable lower bound of . The calculation process for the lower bound of Equation (6) can be decomposed into computing the lower bound of and the upper bound of .
As mentioned in Preliminaries, a commonly used lower bound of can be written as:
(7) 
where,
(8) 
denotes the number of tasks. , where contains latent context from the same tasks while contains latent context from different ones. As stated in previous section, we optimize the former term in (7) to train the context encoder and similarity function .
We also need to find a tractable form for the upper bound of . Leveraging current contrastive learning framework, we make the following proposition:
Proposition 4.2
(9) 
where,
(10) 
Proof.
As defined in DBLP:journals/corr/abs180703748, the optimal value for is given by , inserting this back into Equation (10) results in:
(11)  
∎
Compared with InfoNCE loss, the only difference between these two bounds is whether the similarity relations between query and positive key is included in denominator. Then we can further derive the lower bound of Equation (6) as:
Proposition 4.3
(12)  
Thus, we obtain an estimate of the lower bound of mutual information and we further use it as an intrinsic reward for the independent exploration agent:
(13) 
The intrinsic reward can be interpreted as the difference between two contrastive loss. It measures how much the inference for current task has been improved after collecting new transition . To make this objective more clear and comprehensible, we can transform the lower bound (12) as:
(14)  
Intuitively, the first term is an estimation of how much the similarity score for positive pairs (the correct task inference) has improved after collecting new transition. Maximizing this term means that we want the exploration policy to collect transitions which make it easy to distinguish between tasks and make correct and solid task hypothesis. The second term can be interpreted as a regularization term. Maximizing this term means that while the agent tries to visit places which may result in enhancement of positive score, we want to limit the policy not visiting places where the negative score (the similarity with other tasks in embedding space) enhances as well.
We summarize our training procedure in Figure 2. The exploration agent and execution agent interact with environment separately to get their own replay buffer. The execution replay buffer also contains transitions from exploration agent that do not add on the intrinsic reward term. At the beginning of each metatraining episode, the two agents sample transition data and to train their policy separately. We independently sample from exploration buffer to obtain transition batches for calculating latent context. Note that the reward terms in do not add on the intrinsic reward and only use the original environmental reward for computing latent context. In practice, we utilize Soft ActorCritic DBLP:conf/icml/HaarnojaZAL18 for both exploration and execution agent. We first pretrain the context encoder with contrastive loss for a few episodes to avoid the intrinsic reward being too noisy at the beginning of training. We also maintain an encoder replay buffer that only contain recently collected data for the same reason. During metatesting, we first let the exploration agent collect trajectories for a few episodes then compute the latent context based on these experience. The execution agent then acts conditioned on the latent context. The pseudocode for CCM during training and adaptation can be found in our appendix.
Experiments
In this section, we evaluate the performance of our proposed CCM to answer the following questions:

Does CCM’s contrastive context encoder improve the performance of stateoftheart contextbased MetaRL methods?

After combining with informationgainbased exploration policy, does CCM improve the overall adaptation performance compared with prior MetaRL algorithms?

How does regularization term (second term in (14)) influence the exploration policy?

Is CCM able to extract effective and reasonable context information?
Comparison of Context Encoder Training Strategy
We first evaluate the performance of contextbased MetaRL methods after combining with contrastive context encoder on several continuous control tasks simulated via MuJoCo physics simulator DBLP:conf/iros/TodorovET12, which are standard MetaRL benchmarks used in prior work DBLP:conf/iclr/FakoorCSS20; DBLP:conf/icml/RakellyZFLQ19. We also evaluate its performance on outofdistribution (OOD) tasks, which are more challenging cases for testing encoder’s capacity of extracting useful information. We compare to the existing two kinds of training strategy for context encoder: 1) RV (Recovering Valuefunction) DBLP:conf/icml/RakellyZFLQ19; DBLP:conf/iclr/FakoorCSS20, in which the encoder is trained with gradients from recovering the stateaction value function. 2) DP (Dynamics Prediction) DBLP:journals/corr/abs200506800; DBLP:conf/iclr/ZhouPG19, in which the encoder is trained by performing forward or backward prediction. Here, we follow the settings in CaDM DBLP:journals/corr/abs200506800 which uses the combined prediction loss of forward prediction and backward prediction to update the context encoder
The results are shown in Figure 3. Both CCM+DP and CCM+RV achieve comparably or better performance than existing algorithms, implying that the proposed contrastive context encoder can help to extract contextual information more effectively. For indistribution tasks that vary in dynamics (i.e. antmass and cheetahmass), CCM+DP obtains better returns and converges faster than CCM+RV. This is consistent to what is empirically found in CaDM DBLP:journals/corr/abs200506800: prediction models is more effective when the transition function changes across tasks. Moreover, recall that contrastive context encoder focus on the difference between tasks at trajectory level. Such property may compensate for DP’s drawbacks which can embed commonlyshared dynamics knowledge that hinder the metalearning process. We assume that this property leads to the better performance of CCM+DP on indistribution tasks.
For outofdistribution tasks, CCM+RV outperforms other methods including CCM+DP, implying that contrastive loss combined with loss from recovering value function obtains better generalization for different tasks. Recovering value function focuses on extracting highlevel taskspecific features. This property avoids the overfitting problem of dynamics prediction (limited to common dynamics in training tasks), of which is amplified in this case. However, solely using RV to train the encoder can perform poorly due to its noisy updating signal as well as the ignorance of detailed transition difference that contain taskspecific information. Combining with CCM addresses such problems through the usage of lowvariance InfoNCE loss as well as forcing the encoder explicitly find the trajectorylevel information that varies tasks from each other.
Comparison of Overall Adaptation Performance on Complex Environments
We then consider both components of CCM algorithm and evaluate on several complex sparsereward environments to compare the performance of CCM with prior MetaRL algorithms and show whether the informationgainbased exploration strategy improve its adaptation performance. For cheetahsparse and walkersparse environments, agent receives a reward when it is close to the target velocity. In hardpointrobot, agent needs to reach two randomly selected goals one after another, and receives a reward when it’s close enough to the goals. We compare CCM with stateoftheart MetaRL approach PEARL, and we further modify it with a contrastive context encoder for a fair comparison (PEARLCL) to measure the influence of CCM’s exploration strategy. We further consider MAML DBLP:conf/icml/FinnAL17 and ProMP DBLP:conf/iclr/RothfussLCAA19, which adapt with onpolicy methods. We also compare to varibad DBLP:conf/iclr/ZintgrafSISGHW20, which is a stable version of RL2 DBLP:journals/corr/DuanSCBSA16; DBLP:journals/corr/WangKTSLMBKB16.
As shown in Figure 4, CCM achieves better performance than prior MetaRL algorithms in both final returns and learning speed in such complex sparsereward environments. Within relatively small number of environment interactions, on policy methods MAML, varibad and ProMP struggle to learn effective policies on these complex environment while offpolicy contextbased methods CCM and PEARLCL generally achieves higher performance.
As the only difference between CCM and PEARLCL here is whether to use the proposed extra exploration policy, we can conclude that CCM’s informationgainbased exploration strategy enables fast adaptation by collecting more informative experience and further improving the quality of context when carrying out execution policies. Note that during training phases, trajectories collected by the exploration agent are used for updating the context encoder and the execution policy as well. As shown in experimental results, this may lead to faster training speed of context encoder and execution policy because of the more informative replay buffer.
Influence of the Regularization Term
We perform ablation experiments in order to investigate the influence of the regularization term in the objective function which reflects the change in similarity score with all the tasks in embedding space (the second term in (14)). We compare against CCM using intrinsic reward without the regularization term on indistribution version of hardpointrobot. The results are shown in Figure 5. The regularization term improves CCM’s performance and we assume that without this term the exploration agent may try to collect transitions that the similarity score with all the tasks in embedding space increases as well, which will result in a noisy updating signal. Due to space limit, we put other ablation experiments (i.e. intrinsic reward scale, context updating frequency) in the appendix.
Visualization for Context
Finally, we visualize CCM’s learned context during adaptation phases via tSNE Maaten2008VisualizingDU and compare with PEARL. We run the learned policies on ten randomly sampled test tasks multiple times to collect trajectories. Further, we encode the collected trajectories into latent context in embedding space with the learned context encoder and visualize via tSNE in Figure 6. We find that the latent context generated by CCM from the same tasks is close together in embedding space while maintains clear boundaries between different tasks. In contrast, the latent context generated by PEARL shows globally messy clustering results with only a few reasonable patterns in local region. This indicates that CCM extracts highquality (i.e. compact and sufficient) taskspecific information from the environment compared with PEARL. As a result, the policy conditioned on the highquality latent context is more likely to get a higher return on those metatesting tasks, which is consistent to our prior empirical results.
Related Work
Contrastive Learning
Contrastive learning has recently achieved great success in learning representations for image or sequential data DBLP:journals/corr/abs180703748; DBLP:journals/corr/abs190509272; DBLP:conf/cvpr/He0WXG20; DBLP:journals/corr/abs200205709. In RL, it has been used to extract reward signals in latent space DBLP:conf/icra/SermanetLCHJSLB18; DBLP:conf/iros/DwibediTLS18, or used as an auxiliary task to study representations for highdimensional data DBLP:journals/corr/abs200404136; DBLP:conf/nips/AnandROBCH19. Contrastive learning helps learn representations that obey similarity constraints by dividing the data set into similar (positive) and dissimilar (negative) pairs and minimizes contrastive loss. Prior work DBLP:journals/corr/abs200404136; DBLP:journals/corr/abs190509272 has shown various methods of generating positive and negative pairs for imagebased input data. The standard approach is to create multiple views of each datapoint like random crops and data augmentations Wu2018UnsupervisedFL; DBLP:journals/corr/abs200205709; DBLP:conf/cvpr/He0WXG20. However, in this work we focus on low dimensional input data and leverage natural discrimination inside the trajectories of different tasks to generate positive and negative data. The selection of contrastive loss function is also various and the most competitive one is InfoNCE DBLP:journals/corr/abs180703748. The motivation behind contrastive loss is the InfoMax principle DBLP:journals/computer/Linsker88, which can be interpreted as maximizing the mutual information between two views of data. The relationships between InfoNCE loss and mutual information is conprehensively explained in DBLP:conf/icml/PooleOOAT19.
MetaReinforcement Learning
MetaRL extends the framework of metalearning Schmidhuber1987EvolutionaryPI; Thrun1998LearningTL; Naik1992MetaneuralNT to the reinforcement learning setting. Recurrent or recursive MetaRL methods DBLP:journals/corr/DuanSCBSA16; DBLP:journals/corr/WangKTSLMBKB16 directly learn a function that takes in past experiences and outputs a policy for the agent. Gradientbased MetaRL DBLP:conf/icml/FinnAL17; DBLP:conf/iclr/RothfussLCAA19; Liu2019TamingME; Gupta2018MetaReinforcementLO methods meta learn a model initialization and adapt the parameters with policy gradient methods. These two kinds of methods focus on onpolicy metatraining and are struggling to achieve good performance on complicated environments.
We here consider contextbased MetaRL DBLP:conf/icml/RakellyZFLQ19; Fu2019EfficientMR; DBLP:conf/iclr/FakoorCSS20, which is one kind of MetaRL algorithms that is able to metalearn from offpolicy data by leveraging context. Rakelly et al. (DBLP:conf/icml/RakellyZFLQ19) propose PEARL that adapts to a new environment by inferring latent context variables from a small number of trajectories. Fakoor et al. (DBLP:conf/iclr/FakoorCSS20) propose MQL showing that context combined with other vanilla RL algorithms performs comparably to PEARL. Lee et al. (DBLP:journals/corr/abs200506800) learn a global model that generalizes across tasks by training a latent context to capture the local dynamics. In contrast, our approach directly focuses on the context itself, which motivates an algorithm to improve the quality of latent context.
Some prior research has proposed to explore to obtain the most informative trajectories in MetaRL DBLP:conf/icml/RakellyZFLQ19; DBLP:conf/iclr/RothfussLCAA19; Gupta2018MetaReinforcementLO. The common idea behind these approaches is Thompsonsampling Schmidhuber1987EvolutionaryPI. However, this kind of posterior sampling is conditioned on learned execution policy which may lead to the exploration behavior only limited to the learned tasks. In contrast, we theoretically obtain a lower bound estimation of the exploration objective, based on which a separate exploration agent is trained.
Conclusion
In this paper, we propose that constructing a powerful context for MetaRL involves two problems: 1) How to collect informative trajectories of which the corresponding context reflects the specification of tasks? 2) How to train a compact and sufficient encoder that can embed the taskspecific information contained in prior trajectories? We then propose our method CCM which tackles the above two problems respectively. Firstly, CCM focuses on the underlying structure behind different tasks’ transitions and trains the encoder by leveraging contrastive learning. CCM further learns a separate exploration agent with an informationtheoretical objective that aims to maximize the improvement of inference after collecting new transitions. The empirical results on several complex simulated control tasks show that CCM outperforms stateoftheart MetaRL methods by addressing the aforementioned problems.
Acknowledgements
We thank Chenjia Bai, Qianli Shen, and Weixun Wang for useful discussions and suggestions.
References
Appendix A Environment Details
We run all experiments with OpenAI gym DBLP:journals/corr/BrockmanCPSSTZ16, with the mujoco simulator DBLP:conf/iros/TodorovET12. The benchmarks used in our experiments are visualized in Figure 7. We further modify the original tasks to be MetaRL tasks similar to DBLP:conf/icml/RakellyZFLQ19; DBLP:journals/corr/abs200506800; DBLP:conf/iclr/FakoorCSS20:

humanoiddir: The target direction of running changes across tasks. The horizon is 200.

cheetahmass: The mass of cheetah changes across tasks to change transition dynamics. The horizon is 200.

cheetahmassOOD: The mass of cheetah for a training task is sampled uniformly from while that for test task is sampled uniformly randomly from . The horizon is 200.

antmass: The mass of ant changes across tasks to change transition dynamics. The horizon is 200.

cheetahvelOOD: Target velocity for a training task is sampled uniformly from while that for test task is sampled uniformly randomly from . The horizon is 200.

cheetahsparse: Both target velocity and mass of cheetah change across tasks. The agent receives a dense reward only when it is close enough to the target velocity. The horizon is 64.

walkersparse: The target velocity changes across tasks. The agent receives a dense reward only when it is close enough to the target velocity. The horizon is 64.

hardpointrobot: The agent needs to reach two randomly selected goals one after another with reward given when inside the goal radius. The horizon is 40.
Appendix B Implementation Details
For each environment, we metatrain 3 models, and metatest each of them. The evaluation results reflect the average return in the last episode over a test rollout. We show the pseudocode for CCM during metatraining and metatesting in Algorithm 1 and 2. Note that here we calculate the exploration intrinsic reward during interaction with environment instead of training to reduce the computational complexity. To avoid the intrinsic reward being to noisy, we pretrain the context encoder for several episodes and set the exploration replay buffer to only contain recently collected data.
We model the context encoder as the product of independent Gaussian factors, in which the mean and variance are parameterized by MLPs with hidden layers of units that produce a 7dimensional vector. When comparing context encoder training strategy, we use deterministic version of the context encoder network. In other cases, we select in KL divergence term from . For contrastive context encoder, the scale of RV or DP loss is while the scale for contrastive loss is chosen between . For DP, the penalty parameter DBLP:journals/corr/abs200506800 is set to be . We use SAC for both exploration and execution agents and set learning rate as . Other hyperparameters are detailed on Table 1 and 2.
Environment  Metatrain tasks  Metatest tasks  Number of exploration episodes  Embedding batch size  

cheetahsparse  60  10  2  1  128 
walkersparse  60  10  2  2  128 
hardpointrobot  40  10  4  2  256 
Environment  Metatrain tasks  Metatest tasks  Meta batch size 

humanoiddir  100  30  20 
cheetahmass  30  5  16 
cheetahmassOOD  30  5  16 
cheetahvelOOD  50  5  24 
antmass  50  5  24 
Appendix C Additional Ablation Study
Further evaluation of contrastive context encoder
In this section, we show the performance of our proposed contrastive context encoder without combining with other training strategy (i.e. RV or DP). CCM only denotes that the context encoder is trained only with gradients from contrastive loss. As shown in Figure 9, context encoder trained only with contrastive loss is still able to reach a comparable or better performance than existing methods.
Intrinsic reward scale
We investigate the influence of extra intrinsic reward for CCM’s exploration scale by changing the intrinsic reward scale . We experiment with and show the results on cheetahsparse. As shown in Figure 10, both too large and too small value of intrinsic reward scale will decrease the adaptation performance. Completely depending on intrinsic reward or environment reward will hinder the exploration process.
Context updating frequency
We also conduct experiments to show the influence of different context updating frequency during adaptation. We compare the adaptation performance between updating context every episode and updating context every step. As shown in Figure 10, updating context every step obtains relatively better returns which is consistent to our original objective in Equation (4).
Appendix D Additional Details
Derivation of Equation (14)
(15)  
Connection to recent work on MetaRL exploration
Concurrent with this work, Liu et al. (DBLP:journals/corr/abs200802790) and Zhang et al. (DBLP:journals/corr/abs200608170) also focus on the exploration problem in MetaRL and propose to learn execution and exploration with decoupled objectives. Our paper first points out that obtaining a highquality latent context not only involves exploration to gain informative trajectories but also involves how to train the context encoder to extract the taskspecific information. In contrast, those two concurrent work only focuses on the exploration problem. Here we address the latter problem by proposing contrastive context encoder. As for exploration, we theoretically derive a lowvariance variational lower bound for the information gain objective based on contrastive loss, without introducing other settings like prediction model. Secondly, the intrinsic reward used by the two concurrent work can be interpreted as the MetaRL version of ”curiosity” DBLP:conf/icml/PathakAED17, which encourages the agent to visit places that is prone to generate high prediction error. However, high prediction error does not always represent informative transition. This might be caused by random noise which is taskirrelevant. In contrast, our objective based on the temporal difference of contrastive loss not only encourages temporal uncertainty of task belief but also limit the agent to explore transitions which is easy to make correct task hypothesis and filter out taskirrelevant information.
Footnotes
 For environments that have fixed dynamics, we further modify the model to predict the reward as well.