DataEfficient Learning of Feedback Policies from Image Pixels using Deep Dynamical Models
Abstract
Dataefficient reinforcement learning (RL) in continuous stateaction spaces using very highdimensional observations remains a key challenge in developing fully autonomous systems. We consider a particularly important instance of this challenge, the pixelstotorques problem, where an RL agent learns a closedloop control policy (“torques”) from pixel information only. We introduce a dataefficient, modelbased reinforcement learning algorithm that learns such a closedloop policy directly from pixel information. The key ingredient is a deep dynamical model for learning a lowdimensional feature embedding of images jointly with a predictive model in this lowdimensional feature space. Joint learning is crucial for longterm predictions, which lie at the core of the adaptive nonlinear model predictive control strategy that we use for closedloop control. Compared to stateoftheart RL methods for continuous states and actions, our approach learns quickly, scales to highdimensional state spaces, is lightweight and an important step toward fully autonomous endtoend learning from pixels to torques.
DataEfficient Learning of Feedback Policies from Image Pixels using Deep Dynamical Models
JohnAlexander M. Assael^{†}^{†}thanks: These authors contributed equally to this work. Department of Computing Imperial College London, UK i.assael@imperial.ac.uk Niklas Wahlström^{1}^{1}footnotemark: 1 Division of Automatic Control Linköping University, Sweden nikwa@isy.liu.se Thomas B. Schön Department of Information Technology Uppsala University, Sweden thomas.schon@it.uu.se Marc Peter Deisenroth Department of Computing Imperial College London, UK m.deisenroth@imperial.ac.uk
1 Introduction
The vision of fully autonomous and intelligent systems that learn by themselves has inspired artificial intelligence (AI) and robotics research for many decades. The pixels to torques problem identifies key aspects of such an autonomous system: autonomous thinking and decision making using (generalpurpose) sensor measurements only, intelligent exploration and learning from mistakes. We consider the problem of efficiently learning closedloop policies (“torques”) from pixel information endtoend. Although, this problem falls into the general class of reinforcement learning (RL) [1], it is challenging for the following reasons: (1) The statespace (here defined by pixel values) is enormous (e.g., for a image, we are looking at continuousvalued dimensions); (2) In many practical applications, we need to find solutions data efficiently: When working with real systems, e.g., robots, we cannot perform millions of experiments because of time and hardware constraints.
One way of using data efficiently, and, therefore, reducing the number of experiments, is to learn predictive forward models of the underlying dynamical system, which are then used for internal simulations and policy learning. These ideas have been successfully applied to RL, control and robotics [2, 3, 4, 5, 6, 7], for instance. However, they often rely on heuristics, demonstrations or engineered lowdimensional features, and do not easily scale to dataefficient RL using pixel information only.
A common way of dealing with highdimensional data is to learn lowdimensional feature representations. Deep learning architectures, such as deep neural networks [8], stacked autoencoders [9, 10], or convolutional neural networks [11], are the current stateoftheart in learning parsimonious representations of highdimensional data. Since 2006, deep learning has produced outstanding empirical results in image, text and audio tasks [12].
Related Work
Over the last months, there has been significant progress in the context of the pixelstotorques problem. A first working solution was presented in 2015 [13], where an RL agent automatically learned to play Atari games purely based on pixel information. The key idea was to embed the highdimensional pixel space into a lowerdimensional space using deep neural networks and apply Qlearning in this compact feature space. A potential issue with this approach is that it is not a dataefficient way of learning policies (weeks of training data are required), i.e., it will be impractical to apply it to a robotic scenario. This data inefficiency is not specific to Qlearning, but a general problem of modelfree RL methods [3, 14].
To increase data efficiency, modelbased RL methods aim to learn a model of the transition dynamics of the system/robot and subsequently use this model as a surrogate simulator. Recently, the idea of learning predictive models from raw images where only pixel information is available was exploited [15, 16]. The approach taken here follows the idea of Deep Dynamical Models (DDMs) [17]: Instead of learning predictive models for images directly, a detour via a lowdimensional feature space is taken by embedding images into a lowerdimensional feature space, e.g., with a deep autoencoder. This detour is promising since direct mappings between highdimensional spaces require large data sets. Whereas Wahlström et al. [15] consider deterministic systems and nonlinear model predictive control (NMPC) techniques for online control, Watter et al. [16] use variational autoencoders [18], local linearization, and locally linear control methods (iLQR [19] and AICO [20]).
To model the dynamical behavior of the system, the pixels of both the current and previous frame are used. Watter et al. [16] concatenate the input pixels to discover such features, whereas Wahlström et al. [15] concatenate the processed lowdimensional embeddings of the two states. The latter approach requires at least fewer parameters, which makes it a promising candidate for more dataefficient learning. Nevertheless, properties such as local linearization [16] can be attractive. However, the complex architecture proposed by Watter et al. [16] is based on very large neural network models with million parameters for learning to swing up a single pendulum. A vast number of training parameters results in higher model complexity, and, thus, decreased statistical efficiency. Hence, an excessive number of training samples, which might not be available, is required to learn the underlying system, taking several days to be trained. These properties make dataefficient learning complicated. Therefore, we propose a relatively lightweight architecture to address the pixelstotorques problem in a dataefficient manner.
Contribution
We propose a dataefficient modelbased RL algorithm that addresses the pixelstotorques problem. (1) We devise a dataefficient policy learning framework based on the DDM approach for learning predictive models for images. We use stateoftheart optimization techniques for training the DDM. (2) Our model profits from a concatenation of lowdimensional features (instead of highdimensional images) to model dynamical behavior, yielding – times fewer model parameters and faster training time than the complex E2C architecture [16]. In practice, our model requires only a few hours of training, while E2C [16] requires days. (3) We introduce a novel training objective that encourages consistency in the latent space paving the way towards more accurate longterm predictions. Overall, we use an efficient model architecture, which can learn tasks of complex nonlinear dynamics.
2 Problem Setup and Objective
We consider an step finitehorizon RL setting in which an agent attempts to solve a particular task by trial and error. In particular, our objective is to find a closedloop policy , that minimizes the longterm cost , where denotes an immediate cost, is the continuousvalued system state, and are continuous control signals.
The learning agent faces the following additional two challenges: (a) The agent has no access to the true state , but perceives the environment only through highdimensional pixel information (images); (b) A good control policy is required in only a few trials. This setting is practically relevant, e.g., when the agent is a robot that is monitored by a video camera based on which the robot has to learn to solve tasks fully autonomously. Therefore, this setting is an instance of the pixelstotorques problem.
We solve this problem in three key steps, which will are detailed in the following sections: (a) Using a deep autoencoder architecture we map the highdimensional pixel information to a lowdimensional embedding/feature . (b) We combine the , and features with the control signal to learn a predictive model of the system dynamics for predicting future features . (a) and (b) form a Deep Dynamical Model (DDM) [17]. (c) We apply an adaptive nonlinear model predictive control strategy for optimal closedloop control and endtoend learning from pixels to torques.
3 Learning a Deep Dynamical Model (DDM)
Our approach to solving the pixelstotorques problem is based on a deep dynamical model (DDM), see Figure 1, which jointly (a) embeds highdimensional images in a lowdimensional feature space via deep autoencoders, and (b) learns a predictive forward model in this feature space, based on the work by Wahlström et al. [17]. In particular, we consider a DDM with control signals and highdimensional observations at timestep . We assume that the relevant properties of can be compactly represented by a feature variable . Furthermore, is the reconstructed highdimensional measurement. The two components of the DDM, i.e., the lowdimensional feature and the predictive model, which predicts future features and observations based on past observations and control signals, are detailed in the following sections.
3.1 Predictive Forward Model
Inspired by the concept of static autoencoders [21, 22], we turn them into a dynamical model that can predict future features and images . Our DDM consists of the following elements:

An encoder mapping highdimensional observations onto lowdimensional features ,

A decoder mapping lowdimensional features back to highdimensional observations , and

The predictive model , which takes as input and predicts the next latent feature .
The and functions of our DDM are neural network models performing the following transformations:
(1a)  
(1b)  
(1c)  
(1d) 
We now put these elements together to construct the DDM. The DDM architecture takes the raw images and as input and maps them to their lowdimensional features and respectively, using in (1a). These latent features are then concatenated and, together with the control signal , used to predict with in (1c). Finally, the predicted feature is passed through the decoder network , to compute the predicted image . The overall architecture is depicted in Figure 2.
The neural networks and that compose the DDM, are parameterized by and respectively. These parameters consist of the weights that perform linear transformations of the input data in each neuron.
3.2 Training
For training the DDM in (1), we introduce a novel training objective, that encourages consistency in the latent space, paving the way toward accurate longterm predictions. More specifically, for our training objective we define the following cost functions
(2a)  
(2b)  
(2c) 
where is the squared deep autoencoder reconstruction error and is the squared prediction error, both operating in image space. Note that depends on the parameters of the decoder, the predictive model and the encoder. Additionally, we introduce that enforces consistency between the latent spaces of the encoder and the prediction model . In the bigdata regime, this additional penalty in latent space is not necessary, but if not much data is available, this additional term increases the data efficiency as the prediction model is forced to make predictions close to the next embedded feature . The overall training objective of the current dataset is given by
(3)  
where is a parameter that controls the influence of . Finally, we train the DDM parameters by jointly minimizing the overall cost
(4) 
Training jointly leads to good predictions as it facilitates the extraction and separation of the features describing the underlying dynamical system, and not only features for creating good reconstructions [17].
3.3 Network Architecture
The neural networks and are composed by linear layers, where each of the first are followed by Rectified Linear Unit (ReLU) activation functions [23]. As it has been demonstrated [24], ReLU nonlinearities allow the network to train faster than the conventional units, as evaluated on the CIFAR10 [25] dataset. Furthermore, similar to Watter et al. [16], we use Adam [26] to train the DDM, which is considered the stateoftheart among the latest methods for stochastic gradient optimization. Finally, after evaluating different weight optimization methods, such as uniform and random Gaussian [27, 24], the weights of the DDM were initialized using orthogonal weight initialization [28], which demonstrated the most efficient training performance, leading to decoupled weights that evolve independently of each other.
4 Policy Learning
Our objective is to control the system to a state where a certain target frame without any prior knowledge of the system at hand. To accomplish this we use the DDM for learning a closedloop policy by means of nonlinear model predictive control (NMPC).
4.1 NMPC using the DDM
NMPC finds an optimal sequence of control signals that minimizes a step loss function, where is typically smaller than the full horizon. We choose to do the control in the lowdimensional embedded space to reduce the complexity of the control problem.
Our NMPC formulation relies on (a) a target feature and (b) the DDM that allows us to predict future features. The target feature is computed by encoding the target frame provided by the model. Further, with the DDM, future features can be predicted based on a sequence of future (and yet unknown) controls and two initial encoded features assuming that the current feature is denoted by .
Using the dynamical model, NMPC determines an optimal (openloop) control sequence , such that the predicted features gets as close to the target feature as possible, which results in the objective
(5) 
where is a cost associated with the deviation of the predicted features from the reference feature , and penalizes the amplitude of the control signals. Here, is a tuning parameter adjusting the importance of these two objectives. When the control sequence is determined, the first control is applied to the system. After observing the next feature, NMPC repeats the entire optimization and turns the overall policy into a closedloop (feedback) control strategy.
Overall, we now have an online NMPC algorithm that, given a trained DDM, works indirectly on images by exploiting their feature representation.
4.2 Adaptive NMPC for Learning from Scratch
We will now turn over to describe how adaptive NMPC can be used together with our DDM to address the pixelstotorques problem and to learn from scratch. At the core of our NMPC formulation lies the DDM, which is used to predict future features (and images) from a sequence of control signals. The quality of the NMPC controller is inherently bound to the prediction quality of the dynamical model, which is typical in modelbased RL [14, 29, 5].
To learn models and controllers from scratch, we apply a control scheme that allows us to update the DDM as new data arrives. In particular, we use the NMPC controller in an adaptive fashion to gradually improve the model by collected data in the feedback loop without any specific prior knowledge of the system at hand. Data collection is performed in closedloop (online NMPC), and it is divided into multiple sequential trials. After each trial, we add the data of the most recent trajectory to the data set, and the model is retrained using all data that has been collected so far.
Simply applying the NMPC controller based on a randomly initialized model would make the closedloop system very likely to converge to a point, which is far away from the desired reference value, due to the poor model that cannot extrapolate well to unseen states. This would in turn imply that no data is collected in unexplored regions, including the region that we are interested in. There are two solutions to this problem: either we use a probabilistic dynamical model [14, 5] to explicitly account for model uncertainty and the implied natural exploration, or we follow an explicit exploration strategy to ensure proper excitation of the system. In this paper, we follow the latter approach. In particular, we choose an greedy exploration strategy where the optimal feedback at each time step is selected with a probability , and a random action is selected with probability .
Algorithm 1 summarizes our adaptive online NMPC scheme. We initialize the DDM with a random trial. We use the learned DDM to find an greedy policy using predicted features within NMPC. This happens online while the collected data is added to the data set, and the DDM is updated after each trial.
5 Experimental Evaluation
In this section, we empirically assess the components of our proposed methodology for autonomous learning from highdimensional synthetic image data, on learning the underlying dynamics of a single and a planar double pendulum. The main lines of the evaluation are: (a) the quality of the learned DDM and (b) the overall learning framework.
In both experiments, we consider the following setting: We take screenshots of a simulated pendulum system at a sampling frequency of . Each pixel is a component of the measurement and takes a continuous grayvalue in the interval . The control signals are the torques applied to the system. No access to the underlying dynamics nor the state (angles and angular velocities) was available, i.e., we are dealing with a highdimensional continuous time series. The challenge was to dataefficiently learn (a) a good dynamical model and (b) a good controller from pixel information only.
To speed up the training process, we applied PCA prior to model learning as a preprocessing step to reduce the dimensionality of the original problem. With these inputs, a layer autoencoder was employed, such that the dimensionality of the features is optimal to model the periodic angle of the pendulums. The features and were later passed to the layer predictive feedforward neural network generating . Furthermore, during training, the parameter for encouraging consistent latent space predictions was set to for both experiments. While, in the adaptive NMPC, the tuning parameter that penalizes the amplitude of the control signals, was set to .
5.1 Planar Pendulum


The first experiment evaluates the performance of the DDM on a planar pendulum, assembled by 1link robot arm with length , weight and friction coefficient .The screenshots consist of pixels, and the input dimension has been reduced to using PCA. These inputs are processed by an encoder with architecture: – ReLU – – ReLU – .
The lowdimensional features are of in order to model the periodic angle of the pendulum. To capture the dynamic properties, such as angular velocity, we concatenate two consecutive features with the control signal and pass them through the predictive model , with architecture: – ReLU – – ReLU – . Note that the dimensionality of the first layer is given by . Finally, the predicted feature , can be mapped back to using our decoder, with architecture: – ReLU – – ReLU – .
The performance of the DDM is illustrated in Figure 3 on a test data set. The top row shows the true images and the bottom row shows the DDM’s longterm predictions. The model yields a good predictive performance for both onestep ahead prediction and multiplestep ahead prediction, a consequence of (a) jointly learning predictor and autoencoder, (b) concatenating features instead of images to model the dynamic behavior.
In Figure 4, we show the learned feature space for different pendulum angles between and . The DDM has learned to generate features that represent the angle of the pendulum, as they are mapped to a circlelike shape accounting for the wraparound property of an angle.
Finally, in Figure 5, we report results on learning a policy that moves the pendulum from a start position to an upright target position . The reference signal was the screenshot of the pendulum in the target position. For the NMPC controller, we used a planning horizon of steps and a control penalty . For the greedy exploration strategy we used .
Figure 5 shows the learning stages of the system, i.e., the different trials of the NMPC controller. Starting with a randomly initialized model, images were appended to the dataset in each trial. As it can be seen, starting already from the first controlled trial, the system managed to control the pendulum successfully and bring it to a position less than from the target position. This means, the solution is found very data efficiently, especially when we consider that the problem is learned from pixels information without access to the “true” state.
5.2 Planar Double Pendulum


In this experiment, a planar double pendulum is considered assembled by 2link robot arm with length m and m respectively, weight kg and kg and friction coefficients . Torques can be applied at both joints. The screenshots consist of pixels, and the input dimension has been reduced to prior to model learning using PCA to speed up the training process. The encoder architecture is: – ReLU – – ReLU – , and the decoder vice versa. The lowdimensional embeddings and the architecture of the predictive model was: – ReLU – – ReLU – .
The predictive performance of the DDM is shown in Figure 3 on a test data set. The performance of the controller is depicted in Figure 5. We used trials with the downward initial position and upward target for the angle of both inner and outer pendulums. The figure shows the error after each trial (1000 frames) and clearly indicates that after three controlled trials a good solution is found, which brings both pendulums within a range to the target angles.
Despite the high complexity of the dynamical system, our learning framework manages to successfully control both pendulums after the third trial in nearly all cases.
5.3 Comparison with StateoftheArt
The same experiments were executed employing PILCO [30], a state of the art policy search method, under the following settings: (a) PILCO has access to the true state, i.e., the angle and angular velocity ; (b) A deep autoencoder is used to learn twodimensional features from images, which are used by PILCO for policy learning. In the first setting (a) PILCO managed to successfully reach the target after the second and the third trial in the two experiments, respectively. However, in setting (b), PILCO did not manage to learn anything meaningful at all. The reason why PILCO could not learn on autoencoder features is that these features were only trained to minimize the reconstruction error. However, the autoencoder did not attempt to map similar images to similar features, which led to zigzagging around in feature space (instead of following a smooth manifold as in Figure 4), making the model learning part in feature space incredibly hard [17].
We modeled and controlled equally complex models with E2C [16], but at the same time our DDM requires fewer neural network parameters if we use the same PCA preprocessing step within E2C. The reason lies in our efficient processing of the dynamics of the model in the feature space instead of the image space. This number increases up to fewer parameters than E2C without the PCA preprocessing step.
The number of parameters can be directly translated to reduced training time, and increased data efficiency. Employing the adaptive model predictive control, our proposed DDM model requires significantly less data samples, as it efficiently focuses on learning the latent space towards the reference target state. Furthermore, the control performance of our model is gradually improved in respect to the number of trials. As proved by our experimental evaluation we can successfully control a complex dynamical system, such as the planar double pendulum, with less than samples. This adaptive learning approach can be essential in problems with time and hardware constraints.
6 Conclusion
We proposed a dataefficient modelbased RL algorithm that learns closedloop policies in continuous state and action spaces directly from pixel information. The key components of our solution are (a) a deep dynamical model (DDM) that is used for longterm predictions via a compact feature space, (b) a novel training objective that encourages consistency in the latent space, paving the way toward more accurate longterm predictions, and (c) an NMPC controller that uses the predictions of the DDM to determine optimal actions on the fly without the need for value function estimation. For the success of this RL algorithm it is crucial that the DDM learns the feature mapping and the predictive model in feature space jointly to capture dynamical behavior for highquality longterm predictions. Compared to stateoftheart RL our algorithm learns fairly quickly, scales to highdimensional state spaces and facilitates learning from pixels to torques.
Acknoledgements
We thank Roberto Calandra for valuable discussions in the early stages of the project. The Tesla K40 used for this research was donated by the NVIDIA Corporation.
References
References
 Sutton and Barto [1998] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT Press, 1998.
 Schmidhuber [1990] J. Schmidhuber. An online algorithm for dynamic reinforcement learning and planning in reactive environments. In International Joint Conference on Neural Networks (IJCNN), pages 253–258. IEEE, 1990.
 Atkeson and Schaal [1997] C. G. Atkeson and S. Schaal. Learning tasks from a single demonstration. In Proceedings of the International Conference on Robotics and Automation (ICRA). IEEE, 1997.
 Bagnell and Schneider [2001] J. A. Bagnell and J. G. Schneider. Autonomous helicopter control using reinforcement learning policy search methods. In Proceedings of the IEEE International Conference on Robotics and Automation, volume 2, pages 1615–1620, 2001.
 Deisenroth et al. [2015] M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian processes for dataefficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(2):408–423, 2015.
 Pan and Theodorou [2014] Y. Pan and E. Theodorou. Probabilistic differential dynamic programming. In Advances in Neural Information Processing Systems, pages 1907–1915, 2014.
 Levine et al. [2015] S. Levine, C. Finn, T. Darrell, and P. Abbeel. Endtoend training of deep visuomotor policies. arXiv preprint arXiv:1504.00702, 2015.
 Hinton and Salakhutdinov [2006] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 Bengio et al. [2007] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layerwise training of deep networks. In Advances in Neural Information Processing Systems (NIPS), pages 153–160. MIT Press, 2007.
 Vincent et al. [2008] P. Vincent, L. Hugo, Y. Bengio, and P.A. Manzagol. Extracting and composing robust features with denoising autoencoders. In International Conference on Machine Learning (ICML), pages 1096–1103. ACM, 2008. ISBN 9781605582054.
 LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Schmidhuber [2014] J. Schmidhuber. Deep learning in neural networks: An overview. Technical Report IDSIA0314 / arXiv:1404.7828v1 [cs.NE], The Swiss AI Lab IDSIA, 2014.
 Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Schneider [1997] J. G. Schneider. Exploiting model uncertainty estimates for safe dynamic control learning. In Advances in Neural Information Processing Systems (NIPS). 1997.
 Wahlström et al. [2015a] N. Wahlström, T. B. Schön, and M. P. Deisenroth. From pixels to torques: Policy learning with deep dynamical models. arXiv preprint arXiv:1502.02251, 2015a.
 Watter et al. [2015] M. Watter, J. T. Springenberg, J. Boedecker, and M. A. Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. arXiv preprint arXiv:1506.07365, 2015.
 Wahlström et al. [2015b] N. Wahlström, T. B. Schön, and M. P. Deisenroth. Learning deep dynamical models from image pixels. In IFAC Symposium on System Identification (SYSID), 2015b.
 Jimenez Rezende et al. [2014] D. Jimenez Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and variational inference in deep latent Gaussian models. In International Conference on Machine Learning (ICML), June 2014.
 Todorov and Li [2005] E. Todorov and W. Li. A generalized iterative LQG method for locallyoptimal feedback control of constrained nonlinear stochastic systems. In American Control Conference, pages 300–306. IEEE, 2005.
 Toussaint [2009] M. Toussaint. Robot trajectory optimization using approximate inference. In International Conference on Machine Learning (ICML), Montreal, QC, Canada, June 2009.
 Rumelhart et al. [1986] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing, volume 1, pages 318–362. MIT Press, 1986.
 Bengio [2009] Y. Bengio. Learning deep architectures for AI. Foundations and trends in Machine Learning, 2(1):1–127, 2009.
 Nair and Hinton [2010] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In International Conference on Machine Learning (ICML), pages 807–814, 2010.
 Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1097–1105, 2012.
 Krizhevsky [2009] A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, Computer Science Department, University of Toronto, 2009.
 Kingma and Ba [2014] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Glorot and Bengio [2010] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Artificial Intelligence and Statistics (AISTATS), pages 249–256, 2010.
 Saxe et al. [2013] A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
 Schaal [1997] S. Schaal. Learning from demonstration. In Advances in Neural Information Processing Systems (NIPS), pages 1040–1046. 1997.
 Deisenroth and Rasmussen [2011] M. P. Deisenroth and C. E. Rasmussen. PILCO: A modelbased and dataefficient approach to policy search. In International Conference on Machine Learning (ICML), 2011.