# Revisiting Reconstruction Loss in

Deep Reinforcement Learning

# Towards Sample-efficient Deep

Reinforcement Learning from Images

# Revisiting Reconstruction Loss in Off-Policy Deep Reinforcement Learning from Images

# Improving Sample Efficiency in Model-Free Reinforcement Learning from Images

###### Abstract

Training an agent to solve control tasks directly from high-dimensional images with model-free reinforcement learning (RL) has proven difficult. The agent needs to learn a latent representation together with a control policy to perform the task. Fitting a high-capacity encoder using a scarce reward signal is not only sample inefficient, but also prone to suboptimal convergence.
Two ways to improve sample efficiency are to extract relevant features for the task and use off-policy algorithms.
We dissect various approaches of learning good latent features, and conclude that the image reconstruction loss is the essential ingredient that enables efficient and stable representation learning in image-based RL.
Following these findings, we devise an off-policy actor-critic algorithm with an auxiliary decoder that trains end-to-end and matches state-of-the-art performance across both model-free and model-based algorithms on many challenging control tasks. We release our code to encourage future research on image-based RL^{1}^{1}1Code, results, and videos are available at https://sites.google.com/view/sac-ae/home.

mycolorRGB0,128,255

## 1 Introduction

Cameras are a convenient and inexpensive way to acquire state information, especially in complex, unstructured environments, where effective control requires access to the proprioceptive state of the underlying dynamics. Thus, having effective RL approaches that can utilize pixels as input would potentially enable solutions for a wide range of real world problems.

The challenge is to efficiently learn a mapping from pixels to an appropriate representation for control using only a sparse reward signal. Although deep convolutional encoders can learn good representations (upon which a policy can be trained), they require large amounts of training data. As existing reinforcement learning approaches already have poor sample complexity, this makes direct use of pixel-based inputs prohibitively slow. For example, model-free methods on Atari (bellemare13arcade) and DeepMind Control (DMC) (tassa2018dmcontrol) take tens of millions of steps (mnih2013dqn; barth-maron2018d4pg), which is impractical in many applications, especially robotics.

A natural solution is to add an auxiliary task with an unsupervised objective to improve sample efficiency. The simplest option is an autoencoder with a pixel reconstruction objective. Prior work has attempted to learn state representations from pixels with autoencoders, utilizing a two-step training procedure, where the representation is first trained via the autoencoder, and then either with a policy learned on top of the fixed representation (lange10deepaeinrl; munk2016mlddpg; higgins2017darla; zhang2018ddr; nair2018imaginedgoal), or with planning (mattner2012ae; finn2015deepspatialae). This allows for additional stability in optimization by circumventing dueling training objectives but leads to suboptimal policies. Other work utilizes end-to-end model-free learning with an auxiliary reconstruction signal in an on-policy manner (jaderberg2016unreal).

We revisit the concept of adding an autoencoder to model-free RL approaches, but with a focus on off-policy algorithms. We perform a sequence of careful experiments to understand why previous approaches did not work well. We found that a pixel reconstruction loss is vital for learning a good representation, specifically when trained end-to-end. Based on these findings, we propose a simple autoencoder-based off-policy method that can be trained end-to-end. Our method is the first model-free off-policy algorithm to successfully train simultaneously both the latent state representation and policy in a stable and sample-efficient manner.

Of course, some recent state-of-the-art model-based RL methods (hafner2018planet; lee2019slac) have demonstrated superior sample efficiency to leading model-free approaches on pixel tasks from (tassa2018dmcontrol). But we find that our model-free, off-policy, autoencoder-based approach is able to match their performance, closing the gap between model-based and model-free approaches in image-based RL, despite being a far simpler method that does not require a world model.

This paper makes three main contributions: (i) a demonstration that adding a simple auxiliary reconstruction loss to a model-free off-policy RL algorithm achieves comparable results to state-of-the-art model-based methods on the suite of continuous control tasks from tassa2018dmcontrol; (ii) an understanding of the issues involved with combining autoencoders with model-free RL in the off-policy setting that guides our algorithm; and (iii) an open-source PyTorch implementation of our simple method for researchers and practitioners to use as a strong baseline that may easily be built upon.

## 2 Related work

Efficient learning from high-dimensional pixel observations has been a problem of paramount importance for model-free RL. While some impressive progress has been made applying model-free RL to domains with simple dynamics and discrete action spaces (mnih2013dqn), attempts to scale these approaches to complex continuous control environments have largely been unsuccessful, both in simulation and the real world. A glaring issue is that the RL signal is much sparser than in supervised learning, which leads to sample inefficiency, and higher dimensional observation spaces such as pixels worsens this problem.

One approach to alleviate this problem is by training with auxiliary losses. Early work (lange10deepaeinrl) explores using deep autoencoders to learn feature spaces in visual reinforcement learning, crucially lange10deepaeinrl propose to recompute features for all collected experiences after each update of the autoencoder, rendering this approach impractical to scale to more complicated domains. Moreover, this method has been only demonstrated on toy problems. Alternatively, finn2015deepspatialae apply deep autoencoder pretraining to real world robots that does not require iterative re-training, improving upon computational complexity of earlier methods. However, in this work the linear policy is trained separately from the autoencoder, which we find to not perform as well as end-to-end methods.

shelhamer16selfsuperrl use auxiliary losses in Atari that incorporate forward and inverse dynamics with A3C, an on-policy algorithm. They recommend a multi-task setting and learning dynamics and reward to find a good representation, which relies on the assumption that the dynamics in the task are easy to learn and useful for learning a good policy. jaderberg2016unreal propose to use unsupervised auxiliary tasks, both observation-based and reward-based based off of real world inductive priors, and show improvements in Atari, again in the on-policy regime, which is much more stable for learning. Unfortunately, this work also relies on inductive biases by designing internal rewards to learn a good representation which is hard to scale to the real world problems. higgins2017darla; nair2018imaginedgoal use a beta variational autoencoder (-VAE) (kingma2013auto; higgins2017betavae) and attempt to extend unsupervised representation pretraining to the off-policy setting, but find it hard to perform end-to-end training, thus receding to the iterative re-training procedure (lange10deepaeinrl; finn2015deepspatialae).

There has been more success in using model-based methods on images, such as hafner2018planet; lee2019slac. These methods use a world model (ha2018worldmodels) approach, learning a representation space using a latent dynamics loss and pixel decoder loss to ground on the original observation space. These model-based reinforcement learning methods often show improved sample efficiency, but with the additional complexity of balancing various auxiliary losses, such as a dynamics loss, reward loss, and decoder loss in addition to the original policy and value optimizations. These proposed methods are correspondingly brittle to hyperparameter settings, and difficult to reproduce, as they balance multiple training objectives.

To close the gap between model-based and model-free image-based RL in terms of sample efficiency and sidestep the issues of model learning, our goal is to train a model-free off-policy algorithm with auxiliary reconstruction loss in a stable manner.

## 3 Background

A fully observable Markov decision process (MDP) is described by tuple , where is the state space, is the action space, is the probability distribution over transitions, is the reward function, and is the discount factor (bellman1957mdp). An agent starts in a initial state sampled from a fixed distribution , then at each timestep it takes an action from a state and moves to a next state . After each action the agent receives a reward . We consider episodic environments with the length fixed to . The goal of standard RL is to learn a policy that can maximize the agent’s expected cumulative reward , where is a state-action marginal distribution induced by the policy and transition distribution . An important modification (ziebart2008maxent) auguments this objective with an entropy term to encourage exploration and robustness to noise. The resulting maximum entropy objective is then defined as:

where is a temperature parameter that balances between optimizing for the reward and for the stochasticity of the policy.

We build on Soft Actor-Critic (SAC) (haarnoja2018sac), an *off-policy* actor-critic method that uses the maximum entropy framework to derive soft policy iteration. At each iteration SAC performs a soft policy evaluation step and a soft policy improvement step. The soft policy evaluation step fits a parametric soft Q-function (critic) by minimizing the soft Bellman residual:

(1) |

where is the replay buffer, and is the target soft Q-function parametrized by a weight vector obtained using the exponentially moving average of the soft Q-function weights to stabilize training. The soft policy improvement step then attempts to learn a parametric policy (actor) by directly minimizing the KL divergence between the policy and a Boltzmann distribution induced by the current soft Q-function, producing the following objective:

(2) |

The policy is parametrized as a diagonal Gaussian to handle continuous action spaces.

When learning from raw images, we deal with the problem of partial observability, which is formalized by a partially observable MDP (POMDP). In this setting, instead of getting a low-dimensional state at time , the agent receives a high-dimensional observation , which is a rendering of potentially incomplete view of the corresponding state of the environment (kaelbling1998planning). This complicates applying RL as the agent now needs to also learn a compact latent representation to infer the state. Fitting a high-capacity encoder using only a scarce reward signal is sample inefficient and prone to suboptimal convergence. Following prior work (lange10deepaeinrl; finn2015deepspatialae) we explore unsupervised pretraining via an image-based autoencoder. In practice, the autoencoder is represented as a convolutional encoder that maps an image observation to a low-dimensional latent vector , and a deconvolutional decoder that reconstructs back to the original image . The optimization is done by minimizing the standard reconstruction objective:

(3) |

Or in the case of -VAE (kingma2013auto; higgins2017betavae), where the variational distribution is parametrized as diagonal Gaussian, the objective is defined as:

(4) |

where and . The latent vector is then used by an RL algorithm, such as SAC, instead of the unavailable true state . To infer temporal statistics, such as velocity and acceleration, it is common practice to stack three consecutive frames to form a single observation (mnih2013dqn). We emphasize that in contrast to model-based methods (ha2018worldmodels; hafner2018planet), we do not predict future states and solely focus on learning representations from the current observation to stay model-free.

## 4 A dissection of learning state representations with -Vae

In this section we explore in a systematic fashion how model-free *off-policy* RL can be made to train directly from pixel observations. We start by noting a dramatic performance drop when SAC is trained on pixels instead of proprioceptive state (Section 4.2) in the off-policy regime. This result motivates us to explore different ways of employing auxiliary supervision to speed up representation learning. While a wide range of auxiliary objectives
could be added to aid effective representation learning, for simplicity we focus our attention on
autoencoders.
We follow lange10deepaeinrl; finn2015deepspatialae and in Section 4.3 try an iterative unsupervised pretraining of an autoencoder that reconstructs pixels and is parameterized by -VAE as per nair2018imaginedgoal; higgins2017betavae. Exploring the training procedure used in previous work shows it to be sub-optimal and points towards the need for end-to-end training of the -VAE with the policy network. Our investigation in Section 4.4 renders this approach useless due to severe instability in training, especially with larger values. We resolve this by using deterministic forms of the variational autoencoder (ghosh2019rae) and a careful learning procedure. This leads to our algorithm, which is described and evaluated in Section 5.

### 4.1 Experimental setup

We briefly state our setup here, for more details refer to Appendix B. Throughout the paper we evaluate on image-based challenging continuous control tasks from tassa2018dmcontrol depicted in Figure 1. For a concise presentation, in some places of the main paper we choose to plot results for reacher_easy, ball_in_cup_catch, and walker_walk only, while full results are available in the Appendix. An episode for each task results in maximum total reward of and lasts for exactly steps. Image observations are represented as RGB renderings, where each pixel is scaled down to range. To infer velocity and acceleration we stack consecutive frames following standard practice from mnih2013dqn. We keep the hyper parameters fixed across all tasks, except for action repeat, which we set only when learning from pixels according to hafner2018planet for a fair comparison to the baselines. If action repeat is used, the number of training observations is only a fraction of the environment steps (e.g. a steps episode at action repeat will only result in training observations). The exact action repeat settings can be found in Section B.3. We evaluate an agent after every training observation, by computing an average total reward across evaluation episodes. For reliable comparison we run random seeds for each configuration and compute mean and standard deviation of the evaluation reward.

### 4.2 Model-free off-policy RL with no auxiliary tasks

We start with an experiment comparing a model-free and off-policy algorithm SAC (haarnoja2018sac) on pixels, with two state-of-the-art model-based algorithms, PlaNet (hafner2018planet) and SLAC (lee2019slac), and an upper bound of SAC on proprioceptive state (Table 1). We see a large gap between the capability of SAC on pixels (SAC:pixel), versus PlaNet and SLAC, which make use of many auxiliary tasks to learn a better representation, and can achieve performance close to the upper bound of SAC on proprioceptive state (SAC:state). From now, SAC:pixel will be our lower bound on performance as we gradually introduce different auxiliary reconstruction losses in order to close the performance gap.

Task name | Number of | SAC:pixel | PlaNet | SLAC | SAC:state |
---|---|---|---|---|---|

Episodes | |||||

finger_spin | 1000 | ||||

walker_walk | 1000 | ||||

ball_in_cup_catch | 2000 | ||||

cartpole_swingup | 2000 | - | |||

reacher_easy | 2500 | - | |||

cheetah_run | 3000 |

### 4.3 Iterative representation learning with -Vae

Following lange10deepaeinrl; finn2015deepspatialae, we experiment with unsupervised representation pretraining using a pixel autoencoder, which speeds up representation learning in image-based RL. Taking into account successful results from nair2018imaginedgoal; higgins2017darla of using a -VAE (kingma2013auto; higgins2017betavae) in the iterative re-training setup, we choose to employ a -VAE likewise. We then proceed to first learn a representation space by pretraining the , , and networks of the -VAE according to the loss Equation 4 on data collected from a random policy. We then learn a control policy on top of the frozen latent representations . We tune for best performance, and find large to be worse, and that very small performed best. In Figure 2 we vary the frequency at which the representation space is updated, from , where the representation is never updated after an initial pretraining period with randomly collected data, to where the representation is updated after every policy update. There is a positive correlation between this frequency and the final policy performance. We emphasize that the gradients are never shared between the -VAE for learning the representation space, and the actor-critic learning the policy. These results suggest that if we can combine the representation pretraining via a -VAE together with the policy learning in a stable end-to-end procedure, we would expect better performance. However, we note that prior work (nair2018imaginedgoal; higgins2017betavae) has been unable to successfully demonstrate this. Regardless, we next perform such experiment to gain better understanding on what goes wrong.

### 4.4 An attempt for end-to-end representation learning with -Vae

Our findings and the results from jaderberg2016unreal motivate us to allow gradient propagation to the encoder of the -VAE from the actor-critic, which in our case is SAC. We enable end-to-end learning by allowing the encoder to not only update with gradients from the loss (Equation 4, as done in Section 4.3, but also with gradients coming from the and (Equations 2 and 1) losses specified in Section 3. Results in Figure 3 show that the end-to-end policy learning together with the -VAE in unstable in the off-policy setting and prone to divergent behaviours that hurt performance. Our conclusion supports the findings from nair2018imaginedgoal; higgins2017betavae, which alleviate the problem by receding to the iterative re-training procedure. We next attempt stabilizing end-to-end training and introduce our method.

## 5 Our method: SAC+AE with end-to-end off-policy training

We now seek to design a stable training procedure that can update the pixel autoencoder simultaneously with policy learning. We build on top of SAC (haarnoja2018sac), a model-free and off-policy actor-critic algorithm. Based on our findings from Section 4, we propose a new, simple algorithm, SAC+AE, that enables end-to-end training. We notice that electing to learn deterministic latent representations, rather than stochastic as in the -VAE case, has a stabilizing effect on the end-to-end learning in the off-policy regime. We thus use a deterministic autoencoder in a form of the regularized autoencoder (RAE) (ghosh2019rae), that has many structural similarities with -VAE. We also found it is important to update the convolutional weights in the target critic network faster, than the rest of the parameters. This allows faster learning while preserving the stability of the off-policy actor-critic. Finally, we share the encoder’s convolutional weights between the actor and critic networks, but prevent the actor from updating them. Our algorithm is presented in Figure 4 for visual guidance.

### 5.1 Performance on pixels

We now show that our simple method, SAC+AE, achieves stable end-to-end training of an off-policy algorithm from images with an auxiliary reconstruction loss. We test our method on 6 challenging image-based continuous control tasks (see Figure 1) from DMC (tassa2018dmcontrol). The RAE consists of a convolutional and deconvolutional trunk of layers of filters each, with kernel size. The actor and critic networks are layer MLPs with ReLU activations and hidden size of . We update the RAE and actor-critic network at each environment step with a batch of experience sampled from a replay buffer. A comprehensive overview of other hyper paremeters is Appendix B.

We perform comparisons against several state-of-the-art model-free and model-based RL algorithms for learning from pixels. In particular: D4PG (barth-maron2018d4pg), an off-policy actor-critic algorithm, PlaNet (hafner2018planet), a model-based method that learns a dynamics model with deterministic and stochastic latent variables and employs cross-entropy planning for control, and SLAC (lee2019slac), which combines a purely stochastic latent model together with an model-free soft actor-critic. In addition, we compare against SAC that learns from low-dimensional proprioceptive state, as an upper bound on performance. In Figure 5 we show that SAC+AE:pixel is able to match the state-of-the-art model-based methods such as PlaNet and SLAC, and significantly improve performance over the baseline SAC:pixel. Note that we use 10 random seeds, as recommended in henderson2017rlmatters whereas the PlaNet and SLAC numbers shown are only over 4 and 2 seeds, respectively, as per the original publications.

## 6 Ablations

To shed more light on some properties of the latent representation space learned by our algorithm we conduct several ablation studies. In particular, we want to answer the following questions: (i) is our method able to extract a sufficient amount of information from raw images to recover corresponding proprioceptive states readily? (ii) can our learned latent representation generalize to unseen tasks with similar image observations, but different reward objective, without reconstruction signal? Below, we answer these questions.

### 6.1 Representation power of the encoder

Given how significantly our method outperforms a variant that does not have access to the image reconstruction signal, we hypothesize that the learned representation space encodes a sufficient amount of information about the internal state of the environment from raw images. Moreover, this information can be easily extracted from the latent state. To test this conjecture, we train SAC+AE:pixel and SAC:pixel until convergence on cheetah_run, then fix their encoders. We then train two identical linear projections to map the encoders’ latent embedding of image observations into the corresponding proprioceptive states. Finally, we compare ground truth proprioceptive states against their reconstructions on a sample episode. Results in Figure 6 confirm our hypothesis that the encoder grounded on pixel observations is powerful enough to almost perfectly restore the internals of the task, whereas SAC without the reconstruction loss cannot. Full results in Appendix F.

### 6.2 Generalization to unseen tasks

To verify whether the latent representation space learned by our method is able to generalize to different tasks without additional fine-tuning with the reconstruction signal, we take three tasks walker_stand, walker_walk, and walker_run from DMC, which share similar observational appearance, but have different reward structure. We train an agent using our method (SAC+AE:pixel) on walker_walk task until convergence and extract its encoder. Consequently, we train two SAC agents without reconstruction loss on walker_stand and walker_run tasks from pixels. The encoder of the first agent is initialized with weights from the pretrained walker_walk encoder, while the encoder of the second agent is not. Neither of the agents use the reconstruction signal, and only backpropogate gradients from the critic to the encoder (see Figure 4). Results in Figure 7 suggest that our method learns latent representations that can readily generalize to unseen tasks and help a SAC agent achieve strong performance and solve the tasks.

## 7 Discussion

We have presented the first end-to-end, off-policy, model-free RL algorithm for pixel observations with only reconstruction loss as an auxiliary task. It is competitive with state-of-the-art model-based methods, but much simpler, robust, and without requiring learning a dynamics model. We show through ablations the superiority of end-to-end learning over previous methods that use a two-step training procedure with separated gradients, the necessity of a pixel reconstruction loss over reconstruction to lower-dimensional “correct” representations, and demonstrations of the representation power and generalization ability of our learned representation.

We find that deterministic models outperform -VAEs (higgins2017betavae), likely due to the other introduced instabilities, such as bootstrapping, off-policy data, and end-to-end training with auxiliary losses. We hypothesize that deterministic models that perform better even in stochastic environments should be chosen over stochastic ones with the potential to learn probability distributions, and argue that determinism has the benefit of added interpretability, through handling of simpler distributions.

In the Appendix we provide results across all experiments on the full suite of 6 tasks chosen from DMC (Appendix A), and the full set of hyperparameters used in Appendix B. There are also additional experiments autoencoder capacity (Appendix E), a look at optimality of the learned latent representation (Appendix H), importance of action repeat (Appendix I), and a set of benchmarks on learning from proprioceptive observation (Appendix J). Finally, we opensource our codebase for the community to spur future research in image-based RL.

## References

## Appendix

## Appendix A The DeepMind control suite

We evaluate the algorithms in the paper on the DeepMind control suite (DMC) (tassa2018dmcontrol) – a collection of continuous control tasks that offers an excellent testbed for reinforcement learning agents. The software emphasizes the importance of having a standardised set of benchmarks with a unified reward structure in order to measure made progress reliably.

Specifically, we consider six domains (see Figure 14) that result in twelve different control tasks. Each task (Table 2) poses a particular set of challenges to a learning algorithm. The ball_in_cup_catch task only provides the agent with a sparse reward when the ball is caught; the cheetah_run task offers high dimensional internal state and action spaces; the reacher_hard task requires the agent to explore the environment. We refer the reader to the original paper to find more information about the benchmarks.

Task name | Reward type | |||
---|---|---|---|---|

Proprioceptive | Image-based | |||

ball_in_cup_catch | sparse | |||

cartpole_{balance,swingup} | dense | |||

cheetah_run | dense | |||

finger_{spin,turn_easy,turn_hard} | dense/sparse | |||

reacher_{easy,hard} | sparse | |||

walker_{stand,walk,run} | dense |

## Appendix B Hyper parameters and setup

### b.1 Actor and Critic networks

We employ double Q-learning (hasselt2015doubledqn) for the critic, where each Q-function is parametrized as a 3-layer MLP with ReLU activations after each layer except of the last. The actor is also a 3-layer MLP with ReLUs that outputs mean and covariance for the diagonal Gaussian that represents the policy. The hidden dimension is set to for both the critic and actor.

### b.2 Encoder and Decoder networks

We employ an almost identical encoder architecture as in tassa2018dmcontrol, with two minor differences. Firstly, we add two more convolutional layers to the convnet trunk. Secondly, we use ReLU activations after each conv layer, instead of ELU. We employ kernels of size with channels for all the conv layers and set stride to everywhere, except of the first conv layer, which has stride . We then take the output of the convnet and feed it into a single fully-connected layer normalized by LayerNorm (ba2016layernorm). Finally, we add tanh nonlinearity to the dimensional output of the fully-connected layer.

The actor and critic networks both have separate encoders, although we share the weights of the conv layers between them. Furthermore, only the critic optimizer is allowed to update these weights (e.g. we truncate the gradients from the actor before they propagate to the shared conv layers).

The decoder consists of one fully-connected layer that is then followed by four deconv layers. We use ReLU activations after each layer, except the final deconv layer that produces pixels representation. Each deconv layer has kernels of size with channels and stride , except of the last layer, where stride is .

We then combine the critic’s encoder together with the decoder specified above into an autoencoder. Note, because we share conv weights between the critic’s and actor’s encoders, the conv layers of the actor’s encoder will be also affected by reconstruction signal from the autoencoder.

### b.3 Training and evaluation setup

We first collect seed observations using a random policy. We then collect training observations by sampling actions from the current policy. We perform one training update every time we receive a new observation. In cases where we use action repeat, the number of training observations is only a fraction of the environment steps (e.g. a steps episode at action repeat will only results into training observations). The action repeat used for each environment is specified in Table 3, following those used by PlaNet and SLAC.

We evaluate our agent after every environment steps by computing an average episode return over evaluation episodes. Instead of sampling from the Gaussian policy we take its mean during evaluation.

We preserve this setup throughout all the experiments in the paper.

Task name | Action repeat |
---|---|

cartpole_swingup | 8 |

reacher_easy | 4 |

cheetah_run | 4 |

finger_spin | 2 |

ball_in_cup_catch | 4 |

walker_walk | 2 |

### b.4 Weights initialization

We initialize the weight matrix of fully-connected layers with the orthogonal initialization (saxe2013ortho) and set the bias to be zero. For convolutional and deconvolutional layers we use delta-orthogonal initialization (xiao2018deltainit).

### b.5 Regularization

We regularize the autoencoder network using the scheme proposed in ghosh2019rae. In particular, we extend the standard reconstruction loss for a deterministic autoencoder with a penalty on the learned representation and add weight decay on the decoder parameters :

We set and .

### b.6 Pixels preprocessing

We construct an observational input as an -stack of consecutive frames (mnih2013dqn), where each frame is a RGB rendering of size from the th camera. We then divide each pixel by to scale it down to range. For reconstruction targets we instead preprocess images by reducing bit depth to 5 bits as in kingma2018glow.

### b.7 Other hyper parameters

We also provide a comprehensive overview of all the remaining hyper parameters in Table 4.

Parameter name | Value |

Replay buffer capacity | |

Batch size | |

Discount | |

Optimizer | Adam |

Critic learning rate | |

Critic target update frequency | |

Critic Q-function soft-update rate | 0.01 |

Critic encoder soft-update rate | 0.05 |

Actor learning rate | |

Actor update frequency | |

Actor log stddev bounds | |

Autoencoder learning rate | |

Temperature learning rate | |

Temperature Adam’s | |

Init temperature |

## Appendix C Iterative representation learning with -Vae

Iterative pretraining suggested in lange10deepaeinrl; finn2015deepspatialae allows for faster representation learning, which consequently boosts the final performance, yet it is not sufficient enough to fully close the gap and additional modifications, such as end-to-end training, are needed. Figure 15 provides additional results for the experiment described in Section 4.3.

## Appendix D An attempt for end-to-end representation learning with -Vae

Additional results to the experiments from Section 4.4 are in Figure 16.

## Appendix E Capacity of the Autoencoder

We also investigate various autoencoder capacities for the different tasks. Specifically, we measure the impact of changing the capacity of the convolutional trunk of the encoder and corresponding deconvolutional trunk of the decoder. Here, we maintain the shared weights across convolutional layers between the actor and critic, but modify the number of convolutional layers and number of filters per layer in Figure 17 across several environments. We find that SAC+AE is robust to various autoencoder capacities, and all architectures tried were capable of extracting the relevant features from pixel space necessary to learn a good policy. We use the same training and evaluation setup as detailed in Section B.3.

## Appendix F Representation power of the Encoder

Addition results to the experiment in Section 6.1 that demonstrates encoder’s power to reconstruct proprioceptive state from image-observations are shown in Figure 18.

## Appendix G Decoding to proprioceptive state

Learning from low-dimensional proprioceptive observations achieves better final performance with greater sample efficiency (see Figure 5 for comparison to pixels and Appendix J for proprioceptive baselines), therefore our intuition is to directly use these compact observations as the reconstruction targets to generate an auxiliary signal. Although, this is an unrealistic setup, given that we do not have access to proprioceptive states in practice, we use it as a tool to understand if such supervision is beneficial for representation learning and therefore can achieve good performance. We augment the observational encoder , that maps an image into a latent vector , with a state decoder , that restores the corresponding state from the latent vector . This leads to an auxililary objective , where . We parametrize the state decoder as a -layer MLP with hidden size and ReLU activations, and train it end-to-end with the actor-critic network. Such auxiliary supervision helps less than expected, and surprisingly hurts performance in ball_in_cup_catch, as seen in Figure 19. Our intuition is that such low-dimensional supervision is not able to provide the rich reconstruction error needed to fit the high-capacity convolutional encoder . We thus seek for a denser auxiliary signal and try learning latent representation spaces with pixel reconstructions.

## Appendix H Optimality of learned latent representation

We define the optimality of the learned latent representation as the ability of our model to extract and preserve all relevant information from the pixel observations sufficient to learn a good policy. For example, the proprioceptive state representation is clearly better than the pixel representation because we can learn a better policy. However, the differences in performance of SAC:state and SAC+AE:pixel can be attributed not only to the different observation spaces, but also the difference in data collected in the replay buffer. To decouple these attributes and determine how much information loss there is in moving from proprioceptive state to pixel images, we measure final task reward of policies learned from the same fixed replay buffer, where one is trained on proprioceptive states and the other trained on pixel observations.

We first train a SAC+AE policy until convergence and save the replay buffer that we collected during training. Importantly, in the replay buffer we store both the pixel observations and the corresponding proprioceptive states. Note that for two policies trained on the fixed replay buffer, we are operating in an off-policy regime, and thus it is possible we won’t be able to train a policy that performs as well.

In Figure 20 we find, surprisingly, that our learned latent representation outperforms proprioceptive state on a fixed buffer. This could be because the data collected in the buffer is by a policy also learned from pixel observations, and is different enough from the policy that would be learned from proprioceptive states that SAC:state underperforms in this setting.

## Appendix I Importance of action repeat

We found that repeating nominal actions several times has a significant effect on learning dynamics and final reward. Prior works (hafner2018planet; lee2019slac) treat action repeat as a hyper parameter to the learning algorithm, rather than a property of the target environment. Effectively, action repeat decreases the control horizon of the task and makes the control dynamics more stable. Yet, action repeat can also introduce a harmful bias, that prevents the agent from learning an optimal policy due to the injected lag. This tasks a practitioner with a problem of finding an optimal value for the action repeat hyper parameter that stabilizes training without limiting control elasticity too much.

To get more insights, we perform an ablation study, where we sweep over several choices for action repeat on multiple control tasks and compare acquired results against PlaNet (hafner2018planet) with the original action repeat setting, which was also tuned per environment. We use the same setup as detailed in Section B.3. Specifically, we average performance over random seeds, and reduce the number of training observations inverse proportionally to the action repeat value. The results are shown in Figure 21. We observe that PlaNet’s choice of action repeat is not always optimal for our algorithm. For example, we can significantly improve performance of our agent on the ball_in_cup_catch task if instead of taking the same nominal action four times, as PlaNet suggests, we take it once or twice. The same is true on a few other environments.

## Appendix J Learning from proprioceptive observations

In addition to the results when an agent learns from pixels, we also provide a comprehensive comparison of several state-of-the-art continuous control algorithms that directly learn from proprioceptive states. Specifically, we consider four agents that implement SAC (haarnoja2018sac), TD3 (fujimoto2018td3), DDPG (lillicrap2015ddpg), and D4PG (barth-maron2018d4pg). We leverage open-source implementations of TD3 and DDPG from https://github.com/sfujim/TD3, and use the reported set of optimal hyper parameters, except of the batch size, which we increase to , as we find it improves performance of both the algorithms. Due to lack of a publicly accessible implementation of D4PG, we take the final performance results after environments steps as reported in tassa2018dmcontrol. We use our own implementation of SAC together with the hyper parameters listed in Appendix B, again we increase the batch size to . Importantly, we keep the same set of hyper parameters across all tasks to avoid overfitting individual tasks.

For this evaluation we do not repeat actions and perform one training update per every environment step. We evaluate a policy every steps (or every episodes as one episode consists of steps) by running evaluation episodes and averaging corresponding returns. To assess the stability properties of each algorithm and produce reliable baselines we compute mean and std of evaluation performance over random seeds. We test on twelve challenging continuous control tasks from DMC (tassa2018dmcontrol), as described in Appendix A. The results are shown in Figure 22.