[
Abstract
Deep reinforcement learning has the potential to train robots to perform complex tasks in the real world without requiring accurate models of the robot or its environment. A practical approach is to train agents in simulation, and then transfer them to the real world. One of the most popular methods for achieving this is to use domain randomisation, which involves randomly perturbing various aspects of a simulated environment in order to make trained agents robust to the reality gap between the simulator and the real world. However, less work has gone into understanding such agents—which are deployed in the real world—beyond task performance. In this work we examine such agents, through qualitative and quantitative comparisons between agents trained with and without visual domain randomisation, in order to provide a better understanding of how they function. In this work, we train agents for Fetch and Jaco robots on a visuomotor control task, and evaluate how well they generalise using different unit tests. We tie this with interpretability techniques, providing both quantitative and qualitative data. Finally, we investigate the internals of the trained agents by examining their weights and activations. Our results show that the primary outcome of domain randomisation is more redundant, entangled representations, accompanied with significant statistical/structural changes in the weights; moreover, the types of changes are heavily influenced by the task setup and presence of additional proprioceptive inputs. Furthermore, even with an improved saliency method introduced in this work, we show that qualitative studies may not always correspond with quantitative measures, necessitating the use of a wide suite of inspection tools in order to provide sufficient insights into the behaviour of trained agents.
mode = title]Analysing Deep Reinforcement Learning Agents Trained with Domain Randomisation
1]Tianhong Dai \cormark[1] 1]Kai Arulkumaran \cormark[1]
1]Samyakh Tukra 1]Feryal Behbahani 1]Anil Anthony Bharath
[cor1]Equal contributions. Correspondence to: BICILab, Department of Bioengineering, Imperial College London, Exhibition Road, London SW7 2AZ, United Kingdom.
eep reinforcement learning, Generalisation, Interpretability, Saliency
Deep reinforcement learning (DRL) is currently one of the most prominent subfields in AI, with applications to many domains (Arulkumaran et al., 2017; FrançoisLavet et al., 2018). One of the most enticing possibilities that DRL affords is the ability to train robots to perform complex tasks in the real world, all from raw sensory inputs. For instance, while robotics has traditionally relied on handcrafted pipelines, each performing welldefined estimation tasks – such as groundplane estimation, object detection, segmentation and classification, (Kragic and Vincze, 2009; MartinezGomez et al., 2014) – it is now possible to learn visual perception and control in an “endtoend” fashion (Levine et al., 2016; Gu et al., 2017; Zhu et al., 2017; Levine et al., 2018), without explicit specification and training of networks for specific subtasks.
A major advantage of using reinforcement learning (RL) versus the more traditional approach to robotic system design based on optimal control is that the latter requires a transition model for the task in order to solve for the optimal sequence of actions. While optimal control, when applicable, is more efficient, modelling certain classes of objects (e.g., deformable objects) can require expensive simulation steps, and often physical parameters (e.g., frictional coefficients) of real objects that are not known in detail. Instead, approaches that use RL can learn a direct mapping from observations to the optimal sequence of actions, purely through interacting with the environment. Through the powerful function approximation capabilities of neural networks (NNs), deep learning (DL) has allowed RL algorithms to scale to domains with significantly more complex input and action spaces than previously considered tractable.
The downside is that while DRL algorithms can learn complex control policies from raw sensory data, they typically have poor sample complexity. In practice, this means training DRL algorithms in simulators before deploying them on real robots, which then introduces a reality gap (Jakobi et al., 1995) between the simulated and real worlds—including not just differences in the physics, but also visual appearance. There are several solutions to this problem, including finetuning a DRL agent on the real world (Rusu et al., 2017), performing system identification to reduce the domain gap (Chebotar et al., 2018), and explicitly performing domain adaptation (Tzeng et al., 2015; Bousmalis et al., 2018).
One solution to increase the robustness of agents to potential differences between simulators and the real world is to use domain randomisation (DR; pictured in Figure 1), in which various properties of the simulation are varied, altering anything from the positions or dynamical properties of objects to their visual appearance. This extension of data augmentation to RL environments has been used to successfully train agents for a range of different robots, including robotic arms (Tobin et al., 2017; James et al., 2017), quadcopters (Sadeghi and Levine, 2017), and even humanoid robotic hands (Andrychowicz et al., 2018). While early uses of DR (Tobin et al., 2017; James et al., 2017) did not include transition dynamics as a random property, we note that “dynamics randomisation” (Peng et al., 2018) can now also be considered part of the standard DR pipeline.
When the primary aim of this line of research is to enable the training of agents that perform well in the real world, there is an obvious need to characterise how these agents behave before they can be deployed “in the wild”. In particular, one can study how well these agents generalise—a criterion that has received considerable interest in the DRL community recently (Zhang et al., 2018a, b; Justesen et al., 2018; Witty et al., 2018; Packer et al., 2018; Cobbe et al., 2018; Zhao et al., 2019). To do so, we can construct unit tests that not only reflect the conditions under which the agent has been trained, but also extrapolate beyond; for instance, James et al. (2017) studied the testtime performance of agents trained with DR in the presence of distractors or changed illumination. While adding robustness to these extrapolation tests can be done by simply training under the new conditions, we are interested in developing general procedures that would still be useful when this option is not available. As we show later (in Subsection [), depending on the training conditions, we can even observe a failure of agents trained with DR to generalise to the much simpler default visuals of the simulator.
While unit tests provide a quantitative measure by which we can probe the performance of trained agents under various conditions, by themselves they treat the trained agents as black boxes. On the contrary, with full access to the internals of the trained models and even control over the training process, we can dive even further into the models. Using common interpretability tools such as saliency maps (Morch et al., 1995; Simonyan et al., 2013; Zeiler and Fergus, 2014; Selvaraju et al., 2017; Sundararajan et al., 2017) and dimensionality reduction methods (Pearson, 1901; Maaten and Hinton, 2008; McInnes et al., 2018) for visualising NN activations (Rauber et al., 2017), we can obtain information on why agents act the way they do. The results of these methods work in tandem with unit tests, as matching performance to the qualitative results allows us to have greater confidence in interpreting the latter; in fact, this process allowed us to debug and improve upon an existing saliency map method, as detailed in Subsection [. Through a combination of existing and novel methods, we present here a more extensive look into DRL agents that have been trained to perform control tasks using both visual and proprioceptive inputs. In particular, under our set of experimental conditions, we show that our agents trained with visual DR:

•
require more representational learning capacity (Subsection [),

•
are more robust to visual changes in the scene, exhibiting generalisation to unseen local/global perturbations (Subsection [),

•
use a smaller set of more reliable visual cues when not provided proprioceptive inputs (Subsection [),

•
have filters that have higher norms or greater spatial structure (Subsection [), which respond to more complex spatial patterns (Subsection [),

•
learn more redundant (Subsection [) and entangled (Frosst et al., 2019) representations (Subsection [),

•
and can “overfit” to DR visuals (Subsection [).
In RL, the aim is to learn optimal behaviour in sequential decision problems (Sutton and Barto, 2018), such as finding the best trajectory for a manipulation task. It can formally be described by a Markov decision process (MDP), whereby at every timestep t the agent receives the state of the environment \mathbf{s}_{t}, performs an action \mathbf{a}_{t} sampled from its policy \pi(\mathbf{a}_{t}\mathbf{s}_{t}) (potentially parameterised by weights \theta), and then receives the next state \mathbf{s}_{t+1} along with a scalar reward r_{t+1}. The goal of RL is to find the optimal policy, \pi^{*}, which maximises the expected return:
\displaystyle\mathbb{E}[R_{t=0}]=\mathbb{E}\left[\sum_{t=0}^{T1}\gamma^{t}r_{% t+1}\right], 
where in practice a discount value \gamma\in[0,1) is used to weight earlier rewards more heavily and reduce the variance of the return over an episode of interaction with the environment, ending at timestep T.
Policy search methods, which are prevalent in robotics (Deisenroth et al., 2013), are one way of finding the optimal policy. In particular, policy gradient methods that are commonly used with NNs perform gradient ascent on \mathbb{E}_{\pi}[R] to optimise a parameterised policy \pi(\cdot;\theta) (Williams and Peng, 1991). Other RL methods rely on value functions, which represent the future expected return from following a policy from a given state: V_{\pi}(\mathbf{s}_{t})=\mathbb{E}_{\pi}[R_{t}]. The combination of learned policy and value functions are known as actorcritic methods, and utilise the critic (value function) in order to reduce the variance of the training signal to the actor (policy) (Barto et al., 1983). Instead of directly maximising the return R_{t}, the policy can then be trained to maximise the advantage A_{t}=R_{t}V_{t} (the difference between the empirical and predicted return).
We note that in practice many problems are better described as partiallyobserved MDPs, where the observation received by the agent does not contain full information about the state of the environment. In visuomotor object manipulation this can occur as the end effector blocks the line of sight between the camera and the object, causing selfocclusion. A common solution to this is to utilise recurrent connections within the NN, allowing information about observations to propagate from the beginning of the episode to the current timestep (Wierstra et al., 2007).
For our experiments we train our agents using proximal policy optimisation (PPO) (Schulman et al., 2017), a widely used and performant RL algorithm.^{1}^{1}1In particular, PPO has been used with DR to train a policy that was applied to a Shadow Dexterous Hand in the real world (Andrychowicz et al., 2018). Rather than training the policy to maximise the advantage directly, PPO instead maximises the surrogate objective:
\displaystyle\mathcal{L}_{clip}=\mathbb{E}_{t}\left[\min(\rho_{t}(\theta)A_{t}% ,\text{clip}(\rho_{t}(\theta),1\epsilon,1+\epsilon)A_{t})\right], 
with
\displaystyle\rho_{t}(\theta)=\frac{\pi(\mathbf{a}_{t}\mathbf{s}_{t};\theta)}% {\pi_{old}(\mathbf{a}_{t}\mathbf{s}_{t};\theta_{old})}, 
where \rho_{t}(\theta) is the ratio between the current policy and the old policy, \epsilon is the clip ratio which restricts the change in the policy distribution, and A_{t} is the advantage, which we choose to be the Generalised Advantage Estimate (GAE):
\displaystyle A_{t}=\delta_{t}+(\gamma\lambda)\delta_{t+1}+\ldots+(\gamma% \lambda)^{Tt+1}\delta_{T1}, 
that mixes Monte Carlo returns R_{t} and temporal difference errors \delta_{t}=r_{t}+\gamma V_{\pi}(\mathbf{s}_{t+1})V_{\pi}(\mathbf{s}_{t}) with hyperparameter \lambda (Schulman et al., 2015).
In practice, both the actor and the critic can be combined into a single NN with two output heads, parameterised by \theta (Mnih et al., 2016). The full PPO objective involves maximising \mathcal{L}_{clip}, minimising the squared error between the learned value function and the empirical return:
\displaystyle\mathcal{L}_{value}=\mathbb{E}_{t}\left[(V_{\pi}(\mathbf{s}_{t};% \theta)R_{t})^{2}\right], 
and maximising the (Shannon) entropy of the policy, which for discrete action sets of size \mathcal{A}, is defined as:
\displaystyle\mathcal{L}_{entropy}=\mathbb{E}_{t}\left[\sum_{n=1}^{\mathcal{% A}}\pi(a_{n}\mathbf{s}_{t};\theta)\log\left(\pi(a_{n}\mathbf{s}_{t};\theta)% \right)\right]. 
Entropy regularisation prevents the policy from prematurely collapsing to a deterministic solution and aids exploration (Williams and Peng, 1991).
Using a parallelised implementation of PPO, we are able to train our agents to strong performance on all training setups within a reasonable amount of time. Training details are described in Subsection [.
The recent success of machine learning (ML) methods has led to a renewed interest in trying to interpret trained models, whereby an explanation of a model’s “reasoning” may be used as a way to understand other properties, such as safety, fairness, or reliability, or simply to provide an explanation of the model’s behaviour (DoshiVelez and Kim, 2017). In this work, we are primarily concerned with scientific understanding, but our considerations are grounded in other properties necessary for eventual realworld deployment, such as robustness.
The challenge that we face is that, unlike other ML algorithms that are considered interpretable by design (such as decision trees or nearest neighbours (Freitas, 2014)), standard NNs are generally considered black boxes. However, given decades of research into methods for interpreting NNs (Morch et al., 1995; Craven and Shavlik, 1996), we now have a range of techniques at our disposal (Guidotti et al., 2018). Beyond simply looking at test performance (a measure of interpretability in its own right (DoshiVelez and Kim, 2017)), we will focus on a variety of techniques that will let us examine trained NNs both in the context of, and independently of, task performance. In particular, we discuss saliency maps (Subsection [), activation maximisation (Subsection [), weight visualisations (Subsection [), statistical and structural weight characterisations (Subsection [), unit ablations (Subsection [), layer ablations (Subsection [) and activation analysis (Subsection [). By utilising a range of techniques we hope to cover various points along the tradeoff between fidelity and interpretability (Ribeiro et al., 2016).
Saliency maps are one of the most common techniques used for understanding the decisions made by NNs, and in particular, convolutional NNs (CNNs). The most common methods are gradientbased, and utilise the derivative of the network output with respect to the inputs, indicating, for images, how changing the pixel intensities at each location will affect the output (Simonyan et al., 2013). We investigated the use of two popular, more advanced variants of this technique—gradientweighted class activation mapping (GradCAM) (Selvaraju et al., 2017) and integrated gradients (IG) (Sundararajan et al., 2017)—as well as an occlusionbased method, which masks parts of the image and performs a sensitivity analysis with respect to the change in the network’s outputs (Zeiler and Fergus, 2014). As shown in Figure 2, the latter technique gave the most “interpretable” saliency maps across all trained agents, so we utilise it alone when analysing our trained agents in latter sections.^{2}^{2}2This observation is also in line with prior work on interpreting DRL agents (Greydanus et al., 2018). In light of the unreliability of saliency methods (Kindermans et al., 2017), we include a discussion and comparison of these methods to illuminate the importance of checking the outputs of qualitative methods. As a final remark we note that clustering methods have been used to automatically find groups of strategies via collections of saliency maps (Lapuschkin et al., 2019), but, given the relative visual simplicity of our tasks, highlighting individual examples is sufficiently informative.
The class average map (CAM) (Zhou et al., 2016) was developed as a saliency method for CNNs with global average pooling (Lin et al., 2013) trained for the purpose of object recognition. The value of the saliency map S^{c}_{m,n} for class c at spatial location m,n is calculated by summing over the activations \mathbf{A}^{k} of the final convolutional layer (with k channels) and the corresponding class weights w_{k}^{c}:
\displaystyle S^{c}_{m,n}=\sum_{k}w_{k}^{c}A^{k}_{m,n} 
Given a network F and input \mathbf{x}, GradCAM extends CAM from fullyconvolutional NNs to generic CNNs by instead constructing class weights \omega_{k}^{c} by using the partial derivative for the output of a class c, \partial F(\mathbf{x})^{c}, with respect to the k feature maps \mathbf{A}^{k} of any convolutional layer. The GradCAM saliency map for a class, \mathbf{S}^{c}, is the positive component of the linear combination of class weights \omega_{k}^{c} and feature maps \mathbf{A}^{k}:
\displaystyle\mathbf{S}^{c}=\max\left(\sum_{k}\omega_{k}^{c}\mathbf{A}^{k},0% \right),  
\displaystyle\text{with }\omega_{k}^{c}=\frac{1}{mn}\sum_{m}\sum_{n}\frac{% \partial F(\mathbf{x})^{c}}{\partial A_{m,n}^{k}}, 
where \omega_{k}^{c} is formed by averaging over spatial locations m,n.
In place of a given class c, we use GradCAM to create a saliency map per (output) action (Figure 3).
Sundarajan et al. (2017) proposed that attribution methods (saliency maps in our case) should be:
 Sensitive

If an input and the baseline differ in one feature and have different outputs, the differing feature should have a nonzero attribution
 Invariant to implementation

Attributions should be identical for two functionally equivalent models
Prior gradientbased methods break the first property. Their method, IG, achieves both by constructing the saliency value S_{n} for each input dimension n from the path integral of the gradients along the linear interpolation between input \mathbf{x} and a baseline input \mathbf{x}^{base}:
\displaystyle S_{m}=\left(x_{n}x_{n}^{base}\right)\int_{\alpha=0}^{1}\frac{% \partial F\left(\mathbf{x}^{base}+\alpha\left(\mathbf{x}\mathbf{x}^{base}% \right)\right)}{\partial x_{n}}d\alpha. 
Although Sundarajan et al. (2017) suggested that a black image can be used as the baseline, we found that using the (dataset) average input, provided superior results.
As an alternative to gradientbased methods, Zeiler et al. (2014) proposed running a (grey, square) mask over the input and tracking how the network’s outputs change in response. Greydanus et al. (2018) applied this method to understanding actorcriticbased DRL agents, using the resulting saliency maps to examine strong and overfitting policies; they however noted that a grey square may be perceived as part of a grey object, and instead used a localised Gaussian blur to add “spatial uncertainty”. The saliency value for each input location is the Euclidean distance between the original output^{3}^{3}3In practice taken to be the logits for a categorical policy. We considered that the KullbackLeibler divergence between policy distributions might be more meaningful, but found that it produces qualitatively similar saliency maps. and the output given the input \mathbf{x}_{m,n}^{occ} which has been occluded at location (m,n):
\displaystyle S_{m,n}=\lVert F(\mathbf{x})F(\mathbf{x}_{m,n}^{occ})\rVert_{2}, 
where \lVert\cdot\rVert_{p} denotes the \ell_{p}norm.
However, we found that certain trained agents sometimes confused the blurred location with the target location—a failing of the attribution method against noise/distractors (Kindermans et al., 2016), and not necessarily the model itself. Motivated by the methods that compute interpretations against reference inputs (Bach et al., 2015; Ribeiro et al., 2016; Shrikumar et al., 2017; Sundararajan et al., 2017; Lundberg and Lee, 2017), we replaced the Gaussian blur with a mask^{4}^{4}4Replacing a circular region of 5px radius around the (m,n) location. derived from a baseline input, which roughly represents what the model would expect to see on average. Intuitively, this acts as a counterfactual, revealing what would happen if the specific part of the input was not there. For this we averaged over frames collected from our standard evaluation protocol (see Subsection [ for details), creating an average input to be used as an improved baseline for IG, as well as as the source of the mask for the occlusionbased method (Figure 4). Unless specified otherwise, we use our average input baseline for all IG and occlusionbased saliency maps.
Gradients can also be used to try and visualise what maximises the activation of a given neuron/channel. This can be formulated as an optimisation problem, using projected^{5}^{5}5After every gradient step the input is clamped back to within [0,1]. gradient ascent in the input space (Erhan et al., 2009). Although this would ideally show what a neuron/channel is selective for, unconstrained optimisation may end up in solutions far from the training manifold (Mahendran and Vedaldi, 2015), and so a variety of regularisation techniques have been suggested for making qualitatively better visualisations. We experimented with some of the “weak regularisers” (Olah et al., 2017), and found that a combination of frequency penalisation (Gaussian blur) (Nguyen et al., 2015) and transformation robustness (random scaling and translation/jitter) (Mordvintsev et al., 2015) worked best, although they were not sufficient to completely rid the resulting visualisations of the high frequency patterns caused by strided convolutions (Odena et al., 2016). We performed the optimisation procedure for activation maximisation for 20 iterations, applying the regularisation transformations and taking gradient steps in the \ell_{2}norm (Madry et al., 2018) with a step size of 0.1. Pseudocode for our method, applied to a trained network f, is detailed in Algorithm 1.
It is possible to visualise both convolutional filters and fullyconnected weight matrices as images. Part of the initial excitement around DL was the observation that CNNs trained on object recognition would learn frequency, orientation and colourselective filters (Krizhevsky et al., 2012), and more broadly might reflect the hierarchical feature extraction within the visual cortex (Yamins and DiCarlo, 2016). However, as demonstrated by Such et al. (2018), DRL agents can perform well with unstructured filters, although they did find a positive correlation between structure and performance for RL agents trained with gradients^{6}^{6}6Intriguingly, agents trained using evolutionary algorithms did not develop structured filters, even when achieving competitive performance.. We also found this to be the case, and hence developed quantitative measures to compare filters, which we discuss below. Similarly, even more sophisticated visualisations of weight matrices for fullyconnected layers (Hinton and Shallice, 1991) are difficult to reason about, and so we turned to statistical measures for these as well.
A traditional measure for the “importance” of individual neurons in a weight matrix is their magnitude, as exemplified by utilising weight decay as a regulariser (Hanson and Pratt, 1989). Similarly, convolutional filters, considered as one unit, can be characterised by their \ell_{1}norms. Given that NN weights are typically randomly initialised with small but nonzero values (LeCun et al., 1998; Glorot and Bengio, 2010; He et al., 2015), the presence of many zeros or large values indicate significant changes during training. We can compare these both across trained agents, and across the training process (although change in magnitude may not correspond with a change in task performance (Zhang et al., 2019)).
The set of weights in a layer can be considered as a distribution of values, and analysed as such. Early connectionist work studied the distributions of weights of trained networks, finding generally nonnormal distributions using goodnessoffit tests and higher order moments (skew and kurtosis) (Hanson and Burr, 1990; Bellido and Fiesler, 1993).
Convolutional filters are typically initialised pseudorandomly, so that there exists little or no spatial correlation within a single unit. We hence propose using the 2D discrete power spectral density (PSD) as a way of assessing the spatial organisation of convolutional filters, and the power spectral entropy (PSE) as a measure of their complexity. Given the 2D spatialdomain filter, \mathbf{W}_{m,n}, its corresponding spectral representation, \mathbf{\hat{W}}_{u,v}, can be calculated via the 2D discrete Fourier transform of the original filter pattern (j=\sqrt{1}):
\displaystyle\mathbf{\hat{W}}_{u,v}=\sum_{m=0}^{M1}\sum_{n=0}^{N1}W_{m,n}% \exp\left[\frac{j2\pi}{MN}(um+vn)\right], 
and its PSD, \mathbf{S}_{u,v}, from the normalised squared amplitude of the spectrum:
\displaystyle\mathbf{S}_{u,v}=\frac{1}{UV}\left\mathbf{\hat{W}}_{u,v}\right^% {2}, 
where (m,n) are spatial indices, (u,v) are frequency indices, (M,N) is the spatial extent of the filter, and (U,V) is the frequency extent of the filter.
When renormalised such that the sum of the PSD is 1, the PSD may be thought of as a probability mass function over a dictionary of components from a spatial Fourier transform. We can treat each location (u,v) in Fourier space as a symbol, and its corresponding value at \mathbf{S}_{u,v} as the probability of that symbol appearing. The PSE, H_{S}, is then simply the Shannon entropy of this distribution.
The initial weights for units are typically drawn independently from a normal or uniform distribution. In either case, this leads to a flat PSD with PSE close to \log(MN). Therefore, we can compare the PSE of trained filters to this baseline to have an estimate of their spatial organisation relative to noise, which we denote as the relative entropy, H_{R}.
One weakness of spectral analysis is that these measures will fail to pick up strongly localised spatial features, and such filters would also result in a roughly uniform PSD. In practice, global structure is still useful to quantify, and matches well with human intuition (Figure 5).
Entropy as an informationtheoretic measure has been used in DL in many functions, from predicting neural network ensemble performance (Hansen and Salamon, 1990) to usage as a regulariser (Khabou et al., 1999) or pruning criteria (Luo and Wu, 2017) when applied to activations. Spectral entropy has been used an input feature for NNs (Zheng et al., 1996; Krkic et al., 1996; Misra et al., 2004; Srinivasan et al., 2005), but, to the best of our knowledge, not for quantifying aspects of the network itself.
Another way to characterise the importance of a single neuron/convolutional filter is to remove it and observe how this affects the performance of the NN: a large drop indicates that a particular unit is by itself very important to the task at hand. More generally, one might simply look for a large change in the output. It is also possible to extend this to pairs or higherorder groups of neurons, checking for redundancy among units (Sietsma and Dow, 1988), but this process can then become combinatorially expensive.
This process is highly related to that of pruning—a methodology for model compression. Pruning involves removing connections or even entire units while minimising performance loss (Sietsma and Dow, 1988; Reed, 1993). Some statistical and structural weight characterisations used for pruning include the \ell_{1}norm (for individual neurons (Han et al., 2015) and for convolutional filters (Li et al., 2017)) and discrete cosine transform coefficients (for convolutional filters (Liu et al., 2018)). More broadly, one might consider redundancy in activation space (Sietsma and Dow, 1988, 1991), or (indirectly) change in task performance, using criteria such as the (second) derivative of the objective function with respect to the parameters (LeCun et al., 1990; Hassibi and Stork, 1993). As such, we combine unit ablation studies—which give empirical results—with these quantitative metrics.
One can extend the concept of ablations to entire layers, and use this to study the reinitialisation robustness of trained networks (Zhang et al., 2019). Typical neural network architectures, as used in our work, are compositions of multiple parameterised layers, with parameters \{\theta_{1},\theta_{2},\ldots,\theta_{L}\}, where L is the depth of the network. Using \theta_{l}^{t} to denote the set of parameters of layer l\in[1,L] at training epoch t\in[1,T] over a maximum of T epochs, we can study the evolution of each layer’s parameters over time—for example through the change in the \ell_{\infty} or \ell_{2}norm of the set of parameters.
Zhang et al. (2019) proposed reinitialisation robustness as a measure of how important a layer’s parameters are with respect to task performance over the span of the optimisation procedure. After training, for a given layer l, reinitialisation robustness is measured by replacing the parameters \theta_{l}^{T} with parameters checkpointed from a previous timepoint t, that is, setting \theta_{l}^{T}\leftarrow\theta_{l}^{t}, and then remeasuring task performance. They observed that for common CNN architectures trained for object classification, while the parameters of the latter layers of the networks tended to change a lot by the \ell_{\infty} and \ell_{2}norms, the same layers were robust to reinitialisation at checkpoints early during the optimisation procedure, and even to the initialisation at t=0. In the latter case, the parameters are independent of the training data, which means that the effective number of parameters is lower than the total number of parameters, and hence the model is simpler. In line with Zhang et al. (2019), we use reinitialisation robustness to study the effect of task complexity (training with and without DR, and with and without proprioceptive inputs), but with networks of similar capacity.
Finally, we consider analysing the internal activations of trained networks. One of the primary methods for examining activations is to take the highdimensional vectors and project them to a lowerdimensional space (commonly \mathbb{R}^{2} for visualisation purposes) using dimensionality reduction methods that try and preserve the structure of the original data (Rauber et al., 2017). Common choices for visualising activations include both principal components analysis (PCA; a linear projection) (Pearson, 1901; Elman, 1989; Aubry and Russell, 2015) and tdistributed stochastic neighbor embedding (tSNE; a nonlinear projection) (Maaten and Hinton, 2008; Hamel and Eck, 2010; Mohamed et al., 2012; Donahue et al., 2014; Mnih et al., 2015).
While it is possible to qualitatively examine the projections of the activations for a single network, or compare them across trained networks, one can also use the projections quantitatively, by for instance looking at class overlap in the projected space (Rauber et al., 2017). In our RL setting there is no native concept of a “class”, but we can instead use activations taken under different generalisation test scenarios (Subsection [) to see (beyond the generalisation performance) how the internal representations of the trained networks vary under the different scenarios. Specifically, we measure entanglement (“how close pairs of representations from the same class are, relative to pairs of representations from different classes” (Frosst et al., 2019)) using the soft nearest neighbour loss, \mathcal{L}_{SNN}, (Salakhutdinov and Hinton, 2007), defined over a batch of size B with samples \mathbf{x} and classes y (where in our case \mathbf{x} is a projected activation and y is a test scenario) with temperature T (and using \delta_{i,j} as the Kroneckerdelta):
\displaystyle\mathcal{L}_{SNN}  \displaystyle=  \displaystyle\frac{1}{B}\sum_{n=1}^{B}\left(\log\left[\sum_{b=1}^{B}(1\delta_% {b,n})\cdot e^{\frac{\lVert\mathbf{x}_{n}\mathbf{x}_{b}\rVert_{2}^{2}}{T}}% \right]\right.  
\displaystyle\left.\log\left[\sum_{a=1}^{B}(1\delta_{a,n})\cdot\delta_{y_{a}% ,y_{n}}\cdot e^{\frac{\lVert\mathbf{x}_{n}\mathbf{x}_{a}\rVert_{2}^{2}}{T}}% \right]\right) 
In particular, if representations between different test scenarios are highly entangled, this indicates that the network is largely invariant to the factors of variation between between the different scenarios. Considering DR as a form of data augmentation, this is what we might expect of networks trained with DR.
In order to test the effects of DR, we base our experiments on reaching tasks with visuomotor control. The tasks involve moving the end effector of a robot arm to reach a randomly positioned target during each episode, with visual (one RGB camera view) and sometimes proprioceptive (joint positions, angles and velocities) input provided to the agent. Unlike many DRL experiments where the position of the joints and the target are explicitly provided, in our setup the agent must infer the position of the target, and sometimes itself, purely through vision. Importantly, we use two robotic arms—the Fetch Mobile Manipulator and the KINOVA JACO Assistive robotic arm (pictured in Figure 6; henceforth referred to as Fetch and Jaco, respectively)—which have different control schemes and different visual appearances. This leads to changes in the relative importance of the visual and proprioceptive inputs, which we explore in several of our experiments.
The Fetch has a 7 degreesoffreedom (DoF) arm, not including the twofinger gripper. The original model and reaching task setup were modified from the FetchReach task in OpenAI Gym (Brockman et al., 2016; Plappert et al., 2018) in order to provide an additional camera feed for the agent (while also removing the coordinates of the target from the input). The target can appear anywhere on the 2D table surface. The agent has 3 sets of actions, corresponding to position control of the end effector ([5, 5] cm in the x, y and z directions; gripper control is disabled).
The Jaco has 9 DoF, including 1 DoF for each of the 3 fingers. The target can appear anywhere within a 3D area to one side of the robot’s base. The agent has 6 sets of actions, corresponding to velocity control of the arm joints ([0.6, +0.6] rad/s; finger control is disabled). Due to the difference in control schemes, 2D versus 3D target locations, and homogeneous appearance of the Jaco, reaching tasks with the Jaco are more challenging—and particularly so when proprioceptive input is not provided to the agent. A summary of the different settings for the Fetch and Jaco environments is provided in Table 1.
Setting  Fetch  Jaco 
Active (Total) DoF  7  6 (9) 
Target Range  21\times 31\text{cm}^{2}  40\times 40\times 40\text{cm}^{3} 
Num. Test Targets  80  250 
Vision Input  3\times 64\times 64  3\times 64\times 64 
Proprioceptive Inputs  30  18 
Control Type  Position  Velocity 
Num. Actions  3  6 
Action Discretisation  5  5 
Control Frequency  6.67Hz  6.67Hz 
During training, target positions are sampled uniformly from within the set range, with episodes terminating once the target is reached (within 10 cm of the target centre), or otherwise timing out in 100 timesteps. The reward is sparse, with the only nonzero reward being +1 when the target is reached. During testing, a fixed set of target positions, covering a uniform grid over all possible target positions, are used; 80 positions in a 2D grid are used for Fetch, and 250 positions in a 3D grid are used for Jaco. By using a deterministic policy and averaging performance over the entire set of test target positions, we obtain an empirical estimate of the probability of task success. Test episodes are set to time out within 20 timesteps in order to minimise false positives from the policy accidentally reaching the target.
We only randomise initial positions (for all agents) and visuals (for some agents), but not dynamics, as this is still a sufficiently rich task setup to explore. Henceforth we refer to agents trained with visual randomisations as being under the DR condition, whereas agents trained without are the standard (baseline) condition. Apart from the target, we randomise the visuals of all other objects in the environment: the robots, the table, the floor and the skybox. At the start of every episode and at each timestep, we randomly alter the RGB colours, textures and colour gradients of all surfaces (Figure 1 for example visual observations). One of the tests that we apply to probe generalisation is to change a previously static property—surface reflectivity, which is completely disabled during training—and see how this affects the trained agents. All environments were constructed in MuJoCo (Todorov et al., 2012), a fast and accurate physics simulator that is commonly used for DRL experiments.
We utilise the same basic actorcritic network architecture for each experiment, based on the recurrent architecture used by Rusu et al. (2017) for their Jaco experiments. The architecture has 2 convolutional layers, a fullyconnected layer, a long shortterm memory (LSTM) layer (Hochreiter and Schmidhuber, 1997; Gers et al., 2000), and a final fullyconnected layer for the policy and value outputs; rectified linear units (Nair and Hinton, 2010) were used at the output of the convolutional layers and first fullyconnected layer. Proprioceptive inputs, when provided, were concatenated with the outputs of the convolutional layers before being input into the first fullyconnectedlayer. The policy, \pi(\cdot;\theta), is a product of independent categorical distributions, with one distribution per action. Weights were initialised using orthogonal weight initialisation (Saxe et al., 2014; Ilyas et al., 2018) and biases were set to zero. The specifics of the architecture are detailed in Figure 7.
During training, a stochastic policy \mathbf{a}\sim\pi(\mathbf{a}\mathbf{s};\theta) is used and trained with PPO with clip ratio \epsilon=0.1, GAE trace decay \lambda=0.95 and discount \gamma=0.99. Each epoch of training consists of 32 worker processes collecting 128 timesteps worth of data each, then 4 PPO updates with a minibatch size of 1024. We train for up to 5\times 10^{3} epochs, using the Adam optimiser (Kingma and Ba, 2014) with learning rate =2.5\times 10^{4}, \betas =\{0.9,0.999\}, and \epsilon=1\times 10^{5}. \mathcal{L}_{value} is weighted by 0.5 and \mathcal{L}_{entropy} is weighted by 0.01. If the max \ell_{2}norm of the gradients exceeds 0.5 they are rescaled to have a max \ell_{2}norm of 0.5 (Pascanu et al., 2013). During testing, the deterministic policy \mathbf{a}=\operatorname*{\arg\!\max}_{\mathbf{a}}\pi(\mathbf{a}\mathbf{s};\theta) is used. Our training was implemented using PyTorch (Paszke et al., 2017). Training each model (each seed) for the full number of timesteps takes 1 day on a GTX 1080Ti.
Once agents are successfully trained on each of the different conditions (Fetch/Jaco, DR/no DR, proprioceptive/no proprioceptive inputs), we can perform further tests to see how they generalise. However, while the agents achieve practically perfect test performance on the conditions that they were trained under, the agents trained with DR but without proprioceptive inputs fare worse when tested under the simulator’s standard visuals—demonstrating a drop in performance due to domain shift without even testing on real world visuals. The effect is especially pronounced with the Jaco agents (Figure 8). Because of this, it is not completely straightforward to compare performance between different agents, but the change in performance of a single agent over differing test conditions is still highly meaningful.
We also trained agents with visual DR where the visuals were only randomised at the beginning of each episode, and kept fixed during. These agents exhibited the same gap in performance between the standard and randomised visuals, indicating that this is not an issue of temporal consistency.
In order to test how the agents generalise to different heldout conditions, we constructed a suite of standard tests for the trained agents (Figure 9 for observations for Fetch under the different conditions^{7}^{7}7Simulation environment parameters of the Mujoco can be referenced from http://www.mujoco.org/book/XMLreference.html., and Table 2 for the results):
 Standard

This is the standard evaluation procedure with the default simulator visuals, where the deterministic policy is applied to all test target positions and the performance is averaged (1.0 means that all targets were reached within 20 timesteps).
 Colour

This introduces a yellow sphere distractor object that is the same size and shape as the target. To deal with clutter, one would train with distractors; this specifically tests the sensitivity of the policy to localising the target given another object of a different colour in the scene, given our training regime (no distractors).
 Shape

This introduces a red cube distractor object that is the same size and colour as the target, but a different shape.
 Illumination

This changes the diffuse colour of the main light from 1.2 to 0.1 for Jaco, pr from 0.8 to 0.0 for Fetch.
 Noise

This adds Gaussian noise \sim N(0,0.25) to the visual observations.
 Reflection

This sets the table (for Fetch) or ground (for Jaco) to be reflective. This introduces reflections of the robot (and the target for Jaco) in the input.
 Translation

This offsets the RGB camera by 20cm in the x direction for Jaco or 20cm in the y direction for Fetch.
 Invisibility

This makes the robot transparent; this is not a realistic alteration, but is instead used to test the importance of the visual inputs for selflocalisation.
Robot  DR  Prop.  Standard  Colour  Shape  Illumination  Noise  Reflection  Translation  Invisibility 
Fetch  ✗  ✗  1.000\pm 0.000  0.993\pm 0.007  0.775\pm 0.085  0.467\pm 0.067  0.980\pm 0.006  0.447\pm 0.039  0.008\pm 0.004  0.000\pm 0.000 
Fetch  ✗  ✓  1.000\pm 0.000  0.875\pm 0.088  0.243\pm 0.064  0.325\pm 0.115  0.988\pm 0.004  0.570\pm 0.078  0.000\pm 0.000  0.000\pm 0.000 
Fetch  ✓  ✗  0.983\pm 0.004  0.970\pm 0.011  0.913\pm 0.042  0.893\pm 0.013  0.985\pm 0.007  0.972\pm 0.011  0.093\pm 0.040  0.000\pm 0.000 
Fetch  ✓  ✓  0.997\pm 0.002  0.995\pm 0.003  0.963\pm 0.020  0.983\pm 0.006  0.970\pm 0.008  0.985\pm 0.005  0.153\pm 0.055  0.023\pm 0.015 
Jaco  ✗  ✗  0.995\pm 0.003  0.281\pm 0.067  0.274\pm 0.077  0.874\pm 0.034  0.635\pm 0.028  0.734\pm 0.032  0.394\pm 0.055  0.000\pm 0.000 
Jaco  ✗  ✓  0.995\pm 0.001  0.451\pm 0.040  0.258\pm 0.044  0.587\pm 0.043  0.478\pm 0.059  0.618\pm 0.061  0.399\pm 0.040  0.001\pm 0.001 
Jaco  ✓  ✗  0.650\pm 0.056  0.640\pm 0.046  0.636\pm 0.040  0.473\pm 0.049  0.575\pm 0.040  0.429\pm 0.060  0.141\pm 0.034  0.007\pm 0.002 
Jaco  ✓  ✓  0.991\pm 0.004  0.987\pm 0.005  0.970\pm 0.017  0.442\pm 0.018  0.896\pm 0.007  0.946\pm 0.006  0.356\pm 0.029  0.916\pm 0.022 
For Fetch, the distractor with the same shape and size but different colour has a small effect on the performance of the agents. The distractor with the same colour but different shape has a small effect on agents without proprioceptive inputs, but a large effect on agents with proprioceptive inputs; in the latter case the agent with DR is a bit more robust though. From this we can infer that these agents have learned something slightly more sophisticated than a colour detector, except when proprioceptive inputs are provided, in which case this simple strategy suffices during training.
For Jaco, it appears that both giving proprioceptive inputs and applying DR make the agent more robust to either type of distractor, with the agent having neither being greatly affected, while the agent with both is almost unaffected. As with Fetch, the distractor with the same colour reduces performance more than the distractor with the same shape.
As the location of the distractor may affect the model’s response, we also test this. As shown in Table 3, this has a relatively small impact on agent performance.
Robot  DR  Prop.  Colour  Shape 
Fetch  ✗  ✗  0.79\pm 0.06  0.44\pm 0.08 
Fetch  ✗  ✓  0.75\pm 0.08  0.35\pm 0.06 
Fetch  ✓  ✗  0.84\pm 0.06  0.50\pm 0.11 
Fetch  ✓  ✓  0.86\pm 0.05  0.49\pm 0.10 
Jaco  ✗  ✗  0.31\pm 0.05  0.18\pm 0.04 
Jaco  ✗  ✓  0.45\pm 0.07  0.20\pm 0.04 
Jaco  ✓  ✗  0.69\pm 0.02  0.41\pm 0.05 
Jaco  ✓  ✓  0.91\pm 0.02  0.46\pm 0.09 
For Fetch, reducing the illumination drops performance across all agents, and somewhat more for agents without proprioceptive inputs, indicating that providing this input helps mitigate some of the effect of changing the lighting. When reflections are introduced, the DR agents are relatively robust, particularly in comparison to the agents without DR. When the camera is moved, performance drops signficantly for all agents, but again DR provides some protection against this.
For Jaco, in contrast with performance with distractors, the agent trained without proprioceptive inputs and without DR is most robust to changes in illumination, whereas the agent trained with both is the most affected. There therefore seems to be a tradeoff between how sensitive the agent is to local versus global changes in the scene, with respect to illumination. On the other hand, when adding reflections or translating the camera, the presence of proprioceptive inputs seems to be the most beneficial with respect to robustness; the agent trained with both proprioceptive inputs and DR seems to be unnafected by the presence of reflections.
For nearly all agents, rendering the robot invisible drops performance to 0 (the Fetch agent without proprioceptive inputs and without DR reaches 1/80 targets by chance), indicating that perhaps either directly or indirectly the position of the robot is inferred visually. Given the relative size of the Fetch robot in the visual inputs, we cannot rule out that the drop in performance is due to the domain shift that results from rendering the arm invisible. The Jaco agent without proprioceptive inputs but with DR training reached 15/250 targets, but this is well within the possibilities of chance, as the arm reaches out over space. The standout is the Jaco agent with proprioceptive inputs and DR training, which only incurs a small drop in performance—this agent uses almost exclusively its proprioceptive input to selflocalise, and its visual inputs to locate only the target.
There is no single clear result from our evaluaton of different setups with different types of tests, beyond the general importance of sensor fusion and DR to regularise agents. For instance, when the global illumination is decreased, in the case of the Fetch agents, having access to proprioceptive inputs helps, but DR does not; on the other hand, the Jaco agent which uses proprioceptive inputs and had DR training performs significantly worse. When provided with proprioceptive inputs (without noise), all of the Fetch agents seem to rely heavily on vision for selflocalisation, so it is not necessarily the case that agents will even utilise the inputs that we may expect. The takeaway is that “generalisation” is more nuanced, and performing systematic tests can help probe what strategies networks might be using to operate. Finding failure cases for “weaker” agents can still be a useful exercise for evaluating more robust agents, as it enables adversarial evaluation for finding rare failure cases (Uesato et al., 2018).
The unit tests that we constructed can be used to evaluate the performance of an arbitrary black box policy under differing conditions, but we also have the ability to inspect the internals of our trained agents. Although we cannot obtain a complete explanation for the learned policies, we can still glean further information from both the learned parameters and the sets of activations in the networks.
One of the first tests usually conducted is to examine saliency maps to infer which aspects of the input influence the output of the agent. We use the occlusionbased technique with average baseline, and focus on distractors: We show saliency maps for both the standard test setup, and with either the different colour or different shape distractors.
Between Fetch agents trained either with or without DR (Figure 11 and Figure 10, respectively), or Jaco agents trained either with or without DR (Figure 13 and Figure 12, respectively), most saliency maps are very similar. The most noticeable difference is for agents trained without DR and without proprioception (Figure 10ac and Figure 12ac), as these show a large amount of attention on the robots for selflocalisation, but also on the distractors; however, empirically, this Jaco agent is badly affected, while the Fetch agent with these settings is not (Table 2). Therefore it is not always the case that saliency maps reflect performance.
That said, there are several aspects of the learned policy that are reflected in the saliency maps. For instance, the Fetch agents are much more robust to the distractor with a different colour than the distractor with a different shape, and this is reflected in the corresponding saliency maps (Figure 10b,e vs. Figure 10c,f, and Figure 11e vs. Figure 11f). The Fetch agents, which learn policies that are more driven by visuals, show attention just above the gripper, even when proprioception is provided (with DR off). The saliency maps for the Jaco agents show a high degree of activation around the target position, and as a result it is difficult to see saliency over the robot for visual selflocalisation, even when DR is on and proprioception is off (Figure 13ac). It is possible to find given that we know that we should expect this, but is a potential failure mode of saliency maps. This problem persists across the other test settings for Jaco agents—although the saliency maps tend to be localised around the target and the arm if proprioception is not provided, test performance is still poor.
In line with Such et al. (2018), activation maximisation applied to the first convolutional layer results in edge detectors, with largerscale spatial structure in the latter layers (Figure 14 and Figure 15). In the first convolutional layer, Jaco agents have several “dead” filters, whereas the Fetch agents use most of their visual inputs; the notable exception is the Jaco agent with DR but without proprioceptive inputs—this agent must rely purely on visual inputs to solve the challenging visuomotor control task. In the second convolutional layer of the Fetch agents, the texture detectors take on different colours without DR, but more homogeneous with DR. For the Jaco agents, the second layer features produce sparser patterns, with what appears to be red target detectors when DR is active. Finally, there is more global, but largely uninterpretable structure when maximising the value function or policy outputs (choosing the unit that corresponds to the largest positive movement per action output). For Fetch agents without DR, the Fetch visualisations are dominated by red (the target colour), but with DR there is a wider spectrum of colours. This trend is the same for the Jaco agents, although without DR and without proprioceptive inputs the colours that maximise the value output are purple and green (a constant hue shift on the usual red and blue). Most noticeably with the Jaco agents trained with DR, only the first and third actuators are activated by strong visual inputs (given zeroes as the proprioceptive inputs and hidden state), which correspond to the most important joints for accomplishing this reaching task (the rotating base and the elbow). As a reminder we note that activation maximisation may not (and is practically unlikely to) converge to images within the the training data manifold (Mahendran and Vedaldi, 2015)—a disadvantage that can be addressed by the complementary technique of finding image patches within the training data that maximally activate individual neurons (Girshick et al., 2014).
We calculated statistical and structural weight characteristics over all trained models (Fetch and Jaco, with/without proprioception, with/without DR, 5 seeds), which gives us many factors to average over when trying to examine the effects of DR. We analysed the norms (Subsection [) and moments (Subsection [) of all of the weights of the trained agents, but in general did not come across any particular trends. The most meaningful characterisations were the \ell_{1}norm and our relative entropy measure, H_{R}, (Subsection [), applied to the convolutional filters. Our analysis is based on treating these measures as probability distributions over the observed values via the use of kernel density estimation. Firstly, we see that with DR the distribution of \ell_{1}norms is skewed towards higher values for layer 2 filters (Figure 16b), but there is no noticeable difference for layer 1 filters (Figure 16a). Conversely, with DR, H_{R} is greater for layer 1 filters (Figure 16c), but not layer 2 filters (Figure 16d). As one might expect, DR induces stronger spatial structure in the convolutional filters, though in our setups this seems to be localised to only the first convolutional layer. We further evaluate these trends by calculating the KullbackLeibler (KL) divergence between marginal DR and no DR distributions, and find that there are indeed significant differences between these conditions (Figure 17).
Given access to the trained models, unit ablations allow us to perform a quantitative, white box analysis. Our method is to manually zero the activations of one of the output channels in either the first or second convolutional layers, iterating the process over every channel. We then reevaluate the modified agents for each of the 8 training settings, using the agent with the best performance over all 5 seeds for each one. These agents are tested on a single xy plane of the fixed test targets—the full 80 for Fetch, and 125 for Jaco—and both the standard visual and additive Gaussian noise test scenarios (see Subsection [).
We can make several observations from the plots in Figure 18. First, unit ablations have little effect on the Fetch agents, but have varying effects on the Jaco agents, possibly reflecting the complexity of the sensing and control problem. The one noticeable exception for the Fetch agents is a single unit in layer 1 for the agent trained with proprioception and without DR (Figure 18a): The ablation of the single unit increases the failure rate to 40%. This unit is likely to be strongly (but not exclusively) involved in detecting the location of the target. All of the Fetch agents are robust to noise in the face of unit ablations (Figure 18b,d), which again suggests that the problem is not particularly challenging.
The effects of unit ablations are much more pronounced for the Jaco robots. Under standard visuals, several layer 1 units affect performance (Figure 18a), and also several layer 2 units (Figure 18c), but with a lesser effect. In these conditions, DR does not seem to convey any particular advantage (Figure 18a,c), but with added noise both the distribution and worstcase performance of DR agents under ablation is clearly better than that of the nonDR agents (Figure 18b,d), suggesting that DR does result in more redundant representations.
Moving from unit ablations to layer ablations, we now show the reinitialisation robustness, as well as the change in \ell_{\infty} and \ell_{2}norms of the parameters of our trained Fetch and Jaco agents in Figures 19 and 20, respectively. Our results are mostly in line with Zhang et al. (2019)—the latter layers of the network are robust to reinitialisation after a few epochs of training, and in the case of the Fetch agents, the policy layer is robust to reinitialisation to the original set of weights. For nearly all agents, the recurrent layer is quite robust to reinitialisation to the original set of weights (despite noticeable changes in the weights as measured by both the \ell_{\infty} \ell2norms)—while this does not necessarily indicate that the agents do not utilise information over time, it does imply that training the recurrent connections is largely unnecessary for these tasks. The Jaco agents trained with DR and with proprioceptive inputs displays a noticeable difference against the other agents (Figure 20j): we observe that the fullyconnected layer (which receives the proprioceptive inputs) is not robust to reinitialisation, indicating its importance for solving the reaching task.
Firstly, we consider the quantitative analysis of activations from different trained agents under the different training conditions. Table 4 contains the entanglement scores (Frosst et al., 2019) of the different trained agents, calculated across the first 4 layers (not including the policy/value outputs); as with the original work, we use a 2D tSNE (Maaten and Hinton, 2008) embedding for the activations. There are two noticeable trends. Firstly, the entanglement scores increase deeper into the network; this supports the notion that the different testing conditions can result in very different visual observations, but the difference between them diminishes as they are further processed by the networks. Secondly, the agents trained with DR have noticeably higher entanglement scores for each layer as compared to their equivalents trained without DR. This quantitatively supports the idea that DR makes agents largely invariant to nuisance visual factors (as opposed to the agents finding different strategies to cope with different visual conditions).
Robot  DR  Prop.  1^{st} Conv.  2^{nd} Conv.  FC  LSTM 
Fetch  ✗  ✗  0.11  0.30  0.56  0.68 
Fetch  ✗  ✓  0.12  0.30  0.45  0.45 
Fetch  ✓  ✗  0.23  0.38  0.62  0.92 
Fetch  ✓  ✓  0.24  0.41  0.58  1.15 
Jaco  ✗  ✗  0.14  0.29  0.52  0.68 
Jaco  ✗  ✓  0.11  0.08  0.43  0.66 
Jaco  ✓  ✗  0.41  0.37  0.55  0.73 
Jaco  ✓  ✓  0.65  0.56  1.21  1.37 
We can also qualitatively support these findings by visualising the same activations in 2D (Figure 21). We use three common embedding techniques in order to show different aspects of the data. Firstly, we use PCA (Pearson, 1901), which linearly embeds the data into dimensions which explain the most variance in the original data; as a result, linearly separable clusters have very different global characteristics. Secondly, we use tSNE (Maaten and Hinton, 2008), which attempts to retain local structure in the data by calculating pairwise similarities between datapoints and creating a constrained graph layout in which distances in the original highdimensional and the lowdimensional projection are preserved as much as possible. Thirdly, we use uniform manifold approximation and projection (UMAP) (McInnes et al., 2018), which operates similarly to tSNE at a high level, but better preserves global structure. Although it is possible to tune tSNE (Wattenberg et al., 2016), by default, UMAP better shows relevant global structure.
The outcome of our series of experiments, with associated qualitative and quantitative tests, has allowed us to empirically assess the effects of DR on trained DRL agents. Our results support prior work and intuitions about the use of DR, and are in line with research showing that both more traditional NN regularisers, such as weight decay, as well as data augmentation, improve generalisation in DRL (Cobbe et al., 2018). Adding DR to a task makes it more challenging to solve, not just in terms of sample complexity, but also in terms of the effective number of parameters utilised (Subsection [). This results in various changes in the weights according to different quantitative metrics (Subsection [). We find evidence that this leads to a more distributed coding, as there is greater redundancy in the neurons (Subsection [). We applied the idea of entanglement (Frosst et al., 2019), but with respect to visual perturbations, and found that the representations that are learned appear to be more entangled, or invariant, to these changes in the visuals (Subsection [). Put together, these factors provide possible explanations for why the resulting policies generalise to some unseen perturbation types (Subsection [).
While we observe these general trends, it is notable that some of the results are not a priori as obvious. For example, even when provided with proprioceptive inputs, the Fetch agent trained without DR still uses its visual inputs for selflocalisation (Subsection [), although this does not seem to be the case when it is trained with DR. We believe that the relative simplicity of the Fetch reaching task—including both sensing and actuation—means that the effects of DR are less pronounced (Subsection [). However, the most unexpected finding was that the performance of the Jaco agent trained with DR and without proprioception dropped when shifting from DR visuals to the standard simulator visuals (Subsection [). With proprioception the gap disappears, which supports the idea that the form of input can have a significant effect on generalisation in agents (Hill et al., 2019), meriting further investigation.
Our investigation has focused on the effects of DR, but it also has a dual purpose, which is to inform research in an opposite sense: in situations where DR is expensive or even infeasible, what approaches can we take to improve generalisation in simulationtoreal transfer? Standard regularisation techniques work to a limited extent (Cobbe et al., 2018), and these could perhaps be augmented with our spatial structure metric—the frequencydomain entropy of convolutional filters (Subsection [)—as a regularisation objective. We suggest that a fruitful avenue for future research is to consider techniques for adversarial robustness (Qin et al., 2019), as these have strong regularisation effects that aim to promote robustness to a range of potentially unknown input perturbations.
References
 Andrychowicz et al. (2018) Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., et al., 2018. Learning dexterous inhand manipulation. arXiv preprint arXiv:1808.00177 .
 Arulkumaran et al. (2017) Arulkumaran, K., Deisenroth, M.P., Brundage, M., Bharath, A.A., 2017. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine 34, 26–38.
 Aubry and Russell (2015) Aubry, M., Russell, B.C., 2015. Understanding deep features with computergenerated imagery, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 2875–2883.
 Bach et al. (2015) Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W., 2015. On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation. PloS one 10, e0130140.
 Barto et al. (1983) Barto, A.G., Sutton, R.S., Anderson, C.W., 1983. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics , 834–846.
 Bellido and Fiesler (1993) Bellido, I., Fiesler, E., 1993. Do backpropagation trained neural networks have normal weight distributions?, in: International Conference on Artificial Neural Networks, Springer. pp. 772–775.
 Bousmalis et al. (2018) Bousmalis, K., Irpan, A., Wohlhart, P., Bai, Y., Kelcey, M., Kalakrishnan, M., Downs, L., Ibarz, J., Pastor, P., Konolige, K., et al., 2018. Using simulation and domain adaptation to improve efficiency of deep robotic grasping, in: 2018 IEEE International Conference on Robotics and Automation (ICRA), IEEE. pp. 4243–4250.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W., 2016. OpenAI Gym. arXiv preprint arXiv:1606.01540 .
 Chebotar et al. (2018) Chebotar, Y., Handa, A., Makoviychuk, V., Macklin, M., Issac, J., Ratliff, N., Fox, D., 2018. Closing the simtoreal loop: Adapting simulation randomization with real world experience. arXiv preprint arXiv:1810.05687 .
 Cobbe et al. (2018) Cobbe, K., Klimov, O., Hesse, C., Kim, T., Schulman, J., 2018. Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341 .
 Craven and Shavlik (1996) Craven, M., Shavlik, J.W., 1996. Extracting treestructured representations of trained networks, in: Advances in neural information processing systems, pp. 24–30.
 Deisenroth et al. (2013) Deisenroth, M.P., Neumann, G., Peters, J., et al., 2013. A survey on policy search for robotics. Foundations and Trends® in Robotics 2, 1–142.
 Donahue et al. (2014) Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T., 2014. Decaf: A deep convolutional activation feature for generic visual recognition, in: International conference on machine learning, pp. 647–655.
 DoshiVelez and Kim (2017) DoshiVelez, F., Kim, B., 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 .
 Elman (1989) Elman, J.L., 1989. Representation and structure in connectionist models. Technical Report. Univ. of California at San Diego, La Jolla Center For Research In Language.
 Erhan et al. (2009) Erhan, D., Bengio, Y., Courville, A., Vincent, P., 2009. Visualizing higherlayer features of a deep network. Technical Report 1341. University of Montreal.
 FrançoisLavet et al. (2018) FrançoisLavet, V., Henderson, P., Islam, R., Bellemare, M.G., Pineau, J., et al., 2018. An introduction to deep reinforcement learning. Foundations and Trends® in Machine Learning 11, 219–354.
 Freitas (2014) Freitas, A.A., 2014. Comprehensible classification models: a position paper. ACM SIGKDD explorations newsletter 15, 1–10.
 Frosst et al. (2019) Frosst, N., Papernot, N., Hinton, G., 2019. Analyzing and improving representations with the soft nearest neighbor loss, in: International Conference on Machine Learning, pp. 2012–2020.
 Gers et al. (2000) Gers, F.A., Schmidhuber, J., Cummins, F., 2000. Learning to forget: Continual prediction with lstm. Neural Computation 12, 2451–2471.
 Girshick et al. (2014) Girshick, R., Donahue, J., Darrell, T., Malik, J., 2014. Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587.
 Glorot and Bengio (2010) Glorot, X., Bengio, Y., 2010. Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256.
 Greydanus et al. (2018) Greydanus, S., Koul, A., Dodge, J., Fern, A., 2018. Visualizing and understanding atari agents, in: International Conference on Machine Learning, pp. 1787–1796.
 Gu et al. (2017) Gu, S., Holly, E., Lillicrap, T., Levine, S., 2017. Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates, in: 2017 IEEE International Conference on Robotics and Automation (ICRA), IEEE. pp. 3389–3396.
 Guidotti et al. (2018) Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D., 2018. A survey of methods for explaining black box models. ACM computing surveys (CSUR) 51, 93.
 Hamel and Eck (2010) Hamel, P., Eck, D., 2010. Learning features from music audio with deep belief networks., in: ISMIR, Utrecht, The Netherlands. pp. 339–344.
 Han et al. (2015) Han, S., Pool, J., Tran, J., Dally, W., 2015. Learning both weights and connections for efficient neural network, in: Advances in neural information processing systems, pp. 1135–1143.
 Hansen and Salamon (1990) Hansen, L.K., Salamon, P., 1990. Neural network ensembles. IEEE Transactions on Pattern Analysis & Machine Intelligence , 993–1001.
 Hanson and Burr (1990) Hanson, S.J., Burr, D.J., 1990. What connectionist models learn: Learning and representation in connectionist networks. Behavioral and Brain Sciences 13, 471–489.
 Hanson and Pratt (1989) Hanson, S.J., Pratt, L.Y., 1989. Comparing biases for minimal network construction with backpropagation, in: Advances in neural information processing systems, pp. 177–185.
 Hassibi and Stork (1993) Hassibi, B., Stork, D.G., 1993. Second order derivatives for network pruning: Optimal brain surgeon, in: Advances in neural information processing systems, pp. 164–171.
 He et al. (2015) He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification, in: Proceedings of the IEEE international conference on computer vision, pp. 1026–1034.
 Hill et al. (2019) Hill, F., Lampinen, A., Schneider, R., Clark, S., Botvinick, M., McClelland, J.L., Santoro, A., 2019. Emergent systematic generalization in a situated agent. arXiv preprint arXiv:1910.00571 .
 Hinton and Shallice (1991) Hinton, G.E., Shallice, T., 1991. Lesioning an attractor network: Investigations of acquired dyslexia. Psychological review 98, 74.
 Hochreiter and Schmidhuber (1997) Hochreiter, S., Schmidhuber, J., 1997. Long shortterm memory. Neural computation 9, 1735–1780.
 Ilyas et al. (2018) Ilyas, A., Engstrom, L., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., Madry, A., 2018. Are deep policy gradient algorithms truly policy gradient algorithms? arXiv preprint arXiv:1811.02553 .
 Jakobi et al. (1995) Jakobi, N., Husbands, P., Harvey, I., 1995. Noise and the reality gap: The use of simulation in evolutionary robotics, in: European Conference on Artificial Life, Springer. pp. 704–720.
 James et al. (2017) James, S., Davison, A.J., Johns, E., 2017. Transferring endtoend visuomotor control from simulation to real world for a multistage task, in: Conference on Robot Learning, pp. 334–343.
 Justesen et al. (2018) Justesen, N., Torrado, R.R., Bontrager, P., Khalifa, A., Togelius, J., Risi, S., 2018. Procedural level generation improves generality of deep reinforcement learning. arXiv preprint arXiv:1806.10729 .
 Khabou et al. (1999) Khabou, M.A., Gader, P.D., Shi, H., 1999. Entropy optimized morphological sharedweight neural networks. Optical Engineering 38, 263–274.
 Kindermans et al. (2017) Kindermans, P.J., Hooker, S., Adebayo, J., Alber, M., Schütt, K.T., Dähne, S., Erhan, D., Kim, B., 2017. The (un) reliability of saliency methods, in: Proceedings of the NeurIPS Interpreting, Explaining and Visualizing Deep Learning Workshop.
 Kindermans et al. (2016) Kindermans, P.J., Schütt, K., Müller, K.R., Dähne, S., 2016. Investigating the influence of noise and distractors on the interpretation of neural networks, in: Proceedings of the NIPS Interpretable Machine Learning in Complex Systems Workshop.
 Kingma and Ba (2014) Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
 Kragic and Vincze (2009) Kragic, D., Vincze, M., 2009. Vision for robotics. Foundations and Trends in Robotics 1, 1–78.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, pp. 1097–1105.
 Krkic et al. (1996) Krkic, M., Roberts, S.J., Rezek, I., Jordan, C., 1996. Eegbased assessment of anaesthetic depth using neural networks .
 Lapuschkin et al. (2019) Lapuschkin, S., Wäldchen, S., Binder, A., Montavon, G., Samek, W., Müller, K.R., 2019. Unmasking clever hans predictors and assessing what machines really learn. Nature communications 10, 1096.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Orr, G.B., Müller, K.R., 1998. Efficient backprop, in: Neural Networks: Tricks of the Trade. Springer, pp. 9–50.
 LeCun et al. (1990) LeCun, Y., Denker, J.S., Solla, S.A., 1990. Optimal brain damage, in: Advances in neural information processing systems, pp. 598–605.
 Levine et al. (2016) Levine, S., Finn, C., Darrell, T., Abbeel, P., 2016. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research 17, 1334–1373.
 Levine et al. (2018) Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J., Quillen, D., 2018. Learning handeye coordination for robotic grasping with deep learning and largescale data collection. The International Journal of Robotics Research 37, 421–436.
 Li et al. (2017) Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P., 2017. Pruning filters for efficient convnets, in: ICLR.
 Lin et al. (2013) Lin, M., Chen, Q., Yan, S., 2013. Network in network. International Conference on Learning Representations (ICLR) .
 Liu et al. (2018) Liu, Z., Xu, J., Peng, X., Xiong, R., 2018. Frequencydomain dynamic pruning for convolutional neural networks, in: Advances in Neural Information Processing Systems, pp. 1043–1053.
 Lundberg and Lee (2017) Lundberg, S.M., Lee, S.I., 2017. A unified approach to interpreting model predictions, in: Advances in Neural Information Processing Systems, pp. 4765–4774.
 Luo and Wu (2017) Luo, J.H., Wu, J., 2017. An entropybased pruning method for cnn compression. arXiv preprint arXiv:1706.05791 .
 Maaten and Hinton (2008) Maaten, L.v.d., Hinton, G., 2008. Visualizing data using tsne. Journal of machine learning research 9, 2579–2605.
 Madry et al. (2018) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A., 2018. Towards deep learning models resistant to adversarial attacks, in: ICLR.
 Mahendran and Vedaldi (2015) Mahendran, A., Vedaldi, A., 2015. Understanding deep image representations by inverting them, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5188–5196.
 MartinezGomez et al. (2014) MartinezGomez, J., FernandezCaballero, A., GarciaVarea, I., Rodriguez, L., RomeroGonzalez, C., 2014. A taxonomy of vision systems for ground mobile robots. International Journal of Advanced Robotic Systems 11, 111.
 McInnes et al. (2018) McInnes, L., Healy, J., Melville, J., 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 .
 Misra et al. (2004) Misra, H., Ikbal, S., Bourlard, H., Hermansky, H., 2004. Spectral entropy based feature for robust asr, in: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE. pp. I–193.
 Mnih et al. (2016) Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K., 2016. Asynchronous methods for deep reinforcement learning, in: International conference on machine learning, pp. 1928–1937.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al., 2015. Humanlevel control through deep reinforcement learning. Nature 518, 529.
 Mohamed et al. (2012) Mohamed, A.r., Hinton, G., Penn, G., 2012. Understanding how deep belief networks perform acoustic modelling. neural networks , 6–9.
 Morch et al. (1995) Morch, N., Kjems, U., Hansen, L.K., Svarer, C., Law, I., Lautrup, B., Strother, S., Rehm, K., 1995. Visualization of neural networks using saliency maps, in: Proceedings of ICNN’95International Conference on Neural Networks, IEEE. pp. 2085–2090.
 Mordvintsev et al. (2015) Mordvintsev, A., Olah, C., Tyka, M., 2015. Inceptionism: Going deeper into neural networks .
 Nair and Hinton (2010) Nair, V., Hinton, G.E., 2010. Rectified linear units improve restricted boltzmann machines, in: Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814.
 Nguyen et al. (2015) Nguyen, A., Yosinski, J., Clune, J., 2015. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436.
 Odena et al. (2016) Odena, A., Dumoulin, V., Olah, C., 2016. Deconvolution and checkerboard artifacts. Distill URL: http://distill.pub/2016/deconvcheckerboard, doi:10.23915/distill.00003.
 Olah et al. (2017) Olah, C., Mordvintsev, A., Schubert, L., 2017. Feature visualization. Distill doi:10.23915/distill.00007. https://distill.pub/2017/featurevisualization.
 Packer et al. (2018) Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V., Song, D., 2018. Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282 .
 Pascanu et al. (2013) Pascanu, R., Mikolov, T., Bengio, Y., 2013. On the difficulty of training recurrent neural networks, in: International conference on machine learning, pp. 1310–1318.
 Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A., 2017. Automatic differentiation in pytorch .
 Pearson (1901) Pearson, K., 1901. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 559–572.
 Peng et al. (2018) Peng, X.B., Andrychowicz, M., Zaremba, W., Abbeel, P., 2018. Simtoreal transfer of robotic control with dynamics randomization, in: 2018 IEEE International Conference on Robotics and Automation (ICRA), IEEE. pp. 1–8.
 Plappert et al. (2018) Plappert, M., Andrychowicz, M., Ray, A., McGrew, B., Baker, B., Powell, G., Schneider, J., Tobin, J., Chociej, M., Welinder, P., et al., 2018. Multigoal reinforcement learning: Challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464 .
 Qin et al. (2019) Qin, C., Martens, J., Gowal, S., Krishnan, D., Fawzi, A., De, S., Stanforth, R., Kohli, P., et al., 2019. Adversarial robustness through local linearization. arXiv preprint arXiv:1907.02610 .
 Rauber et al. (2017) Rauber, P.E., Fadel, S.G., Falcao, A.X., Telea, A.C., 2017. Visualizing the hidden activity of artificial neural networks. IEEE transactions on visualization and computer graphics 23, 101–110.
 Reed (1993) Reed, R., 1993. Pruning algorithmsa survey. IEEE transactions on Neural Networks 4, 740–747.
 Ribeiro et al. (2016) Ribeiro, M.T., Singh, S., Guestrin, C., 2016. Why should i trust you?: Explaining the predictions of any classifier, in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM. pp. 1135–1144.
 Rusu et al. (2017) Rusu, A.A., Večerík, M., Rothörl, T., Heess, N., Pascanu, R., Hadsell, R., 2017. Simtoreal robot learning from pixels with progressive nets, in: Conference on Robot Learning, pp. 262–270.
 Sadeghi and Levine (2017) Sadeghi, F., Levine, S., 2017. Cad2rl: Real singleimage flight without a single real image, in: Robotics: Science and Systems.
 Salakhutdinov and Hinton (2007) Salakhutdinov, R., Hinton, G., 2007. Learning a nonlinear embedding by preserving class neighbourhood structure, in: Artificial Intelligence and Statistics, pp. 412–419.
 Saxe et al. (2014) Saxe, A.M., McClelland, J.L., Ganguli, S., 2014. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, in: ICLR.
 Schulman et al. (2015) Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P., 2015. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 .
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .
 Selvaraju et al. (2017) Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D., 2017. Gradcam: Visual explanations from deep networks via gradientbased localization, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626.
 Shrikumar et al. (2017) Shrikumar, A., Greenside, P., Kundaje, A., 2017. Learning important features through propagating activation differences, in: Proceedings of the 34th International Conference on Machine LearningVolume 70, JMLR. org. pp. 3145–3153.
 Sietsma and Dow (1988) Sietsma, J., Dow, R.J., 1988. Neural net pruning  why and how, in: Neural Networks, 1988., IEEE International Conference on, pp. 325–333.
 Sietsma and Dow (1991) Sietsma, J., Dow, R.J., 1991. Creating artificial neural networks that generalize. Neural networks 4, 67–79.
 Simonyan et al. (2013) Simonyan, K., Vedaldi, A., Zisserman, A., 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 .
 Springenberg et al. (2015) Springenberg, J., Dosovitskiy, A., Brox, T., Riedmiller, M., 2015. Striving for simplicity: The all convolutional net, in: ICLR (Workshop Track).
 Srinivasan et al. (2005) Srinivasan, V., Eswaran, C., Sriraam, N, 2005. Artificial neural network based epileptic detection using timedomain and frequencydomain features. Journal of Medical Systems 29, 647–660.
 Such et al. (2018) Such, F., Madhavan, V., Liu, R., Wang, R., Castro, P., Li, Y., Schubert, L., Bellemare, M.G., Clune, J., Lehman, J., 2018. An atari model zoo for analyzing, visualizing, and comparing deep reinforcement learning agents, in: Proceedings of the NeurIPS Deep RL Workshop.
 Sundararajan et al. (2017) Sundararajan, M., Taly, A., Yan, Q., 2017. Axiomatic attribution for deep networks, in: Proceedings of the 34th International Conference on Machine LearningVolume 70, JMLR. org. pp. 3319–3328.
 Sutton and Barto (2018) Sutton, R.S., Barto, A.G., 2018. Reinforcement learning: An introduction. MIT press.
 Tobin et al. (2017) Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P., 2017. Domain randomization for transferring deep neural networks from simulation to the real world, in: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE. pp. 23–30.
 Todorov et al. (2012) Todorov, E., Erez, T., Tassa, Y., 2012. Mujoco: A physics engine for modelbased control, in: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE. pp. 5026–5033.
 Tzeng et al. (2015) Tzeng, E., Devin, C., Hoffman, J., Finn, C., Peng, X., Levine, S., Saenko, K., Darrell, T., 2015. Towards adapting deep visuomotor representations from simulated to real environments. arXiv preprint arXiv:1511.07111 2.
 Uesato et al. (2018) Uesato, J., Kumar, A., Szepesvari, C., Erez, T., Ruderman, A., Anderson, K., Heess, N., Kohli, P., et al., 2018. Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures. arXiv preprint arXiv:1812.01647 .
 Wattenberg et al. (2016) Wattenberg, M., Viégas, F., Johnson, I., 2016. How to use tsne effectively. Distill 1, e2.
 Wierstra et al. (2007) Wierstra, D., Foerster, A., Peters, J., Schmidhuber, J., 2007. Solving deep memory pomdps with recurrent policy gradients, in: International Conference on Artificial Neural Networks, Springer. pp. 697–706.
 Williams and Peng (1991) Williams, R.J., Peng, J., 1991. Function optimization using connectionist reinforcement learning algorithms. Connection Science 3, 241–268.
 Witty et al. (2018) Witty, S., Lee, J.K., Tosch, E., Atrey, A., Littman, M., Jensen, D., 2018. Measuring and characterizing generalization in deep reinforcement learning. arXiv preprint arXiv:1812.02868 .
 Yamins and DiCarlo (2016) Yamins, D.L., DiCarlo, J.J., 2016. Using goaldriven deep learning models to understand sensory cortex. Nature neuroscience 19, 356.
 Zeiler and Fergus (2014) Zeiler, M.D., Fergus, R., 2014. Visualizing and understanding convolutional networks, in: European conference on computer vision, Springer. pp. 818–833.
 Zhang et al. (2018a) Zhang, A., Ballas, N., Pineau, J., 2018a. A dissection of overfitting and generalization in continuous reinforcement learning. arXiv preprint arXiv:1806.07937 .
 Zhang et al. (2019) Zhang, C., Bengio, S., Singer, Y., 2019. Are all layers created equal? arXiv preprint arXiv:1902.01996 .
 Zhang et al. (2018b) Zhang, C., Vinyals, O., Munos, R., Bengio, S., 2018b. A study on overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893 .
 Zhao et al. (2019) Zhao, C., Siguad, O., Stulp, F., Hospedales, T.M., 2019. Investigating generalisation in continuous deep reinforcement learning. arXiv preprint arXiv:1902.07015 .
 Zheng et al. (1996) Zheng, B., Qian, W., Clarke, L.P., 1996. Digital mammography: mixed feature neural network with spectral entropy decision for detection of microcalcifications. IEEE Transactions on Medical Imaging 15, 589–597.
 Zhou et al. (2016) Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A., 2016. Learning deep features for discriminative localization, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929.
 Zhu et al. (2017) Zhu, Y., Mottaghi, R., Kolve, E., Lim, J.J., Gupta, A., FeiFei, L., Farhadi, A., 2017. Targetdriven visual navigation in indoor scenes using deep reinforcement learning, in: 2017 IEEE international conference on robotics and automation (ICRA), IEEE. pp. 3357–3364.