ModelBased Reinforcement Learning for Physical Systems Without Velocity and Acceleration Measurements
Abstract
In this paper, we propose a derivativefree model learning framework for Reinforcement Learning (RL) algorithms based on Gaussian Process Regression (GPR). In many mechanical systems, only positions can be measured by the sensing instruments. Then, instead of representing the system state as suggested by the physics with a collection of positions, velocities, and accelerations, we define the state as the set of past position measurements. However, the equation of motions derived by physical first principles cannot be directly applied in this framework, being functions of velocities and accelerations. For this reason, we introduce a novel derivativefree physicallyinspired kernel, which can be easily combined with nonparametric derivativefree Gaussian Process models. Tests performed on two real platforms show that the considered state definition combined with the proposed model improves estimation performance and dataefficiency w.r.t. traditional models based on GPR. Finally, we validate the proposed framework by solving two RL control problems for two real robotic systems.
I Introduction
Reinforcement Learning (RL) has seen explosive growth in recent years. RL algorithms have been able to reach and exceed humanlevel performance in several benchmark problems, such as playing the games of chess, go and shogi [25]. Despite these remarkable results, the application of RL to real physical systems (e.g., robotic systems) is still a challenge, because of the large amount of experience required and the safety risks associated with random exploration.
To overcome these limitations, ModelBased RL (MBRL) techniques have been developed [2, 27, 11]. Providing an explicit or learned model of the physical system allows drastic decreases in the experience time required to converge to good solutions, while also reducing the risk of damage to the hardware during exploration and policy improvement.
Describing the evolution of physical systems is generally very challenging, and still an active area of research. Deriving models from first principles of physics might be very difficult, and could also introduce biases due to parameter uncertainties and unmodelled nonlinear effects. On the other hand, learning a model solely from data could be expensive, and generally suffers from insufficient generalization. Models based on Gaussian Process Regression (GPR) [19] have received considerable attention for model learning tasks in MBRL [2]. GPR allows to merge prior physical information with datadriven knowledge, i.e., information inferred from analyzing the similarity between data, leading to socalled semiparametric models [22, 21, 15].
Physical laws suggest that the state of a mechanical system can be described by positions, velocities, and accelerations of its generalized coordinates. However, velocity and acceleration sensors are often not available, in particular when considering lowcost experimental setups. In such cases, velocities and accelerations are usually estimated by means of causal numerical differentiation of positions, introducing a difference between the real and estimated signals. These signal distortions can be seen as an additional unknown input noise, which might compromise significantly the prediction accuracy of the learning algorithm. Indeed, standard GPR models do not consider noisy inputs. Several Heteroscedastic GPR models have been proposed in the literature, see for example [28, 6, 13]. However, the solutions proposed might not be suitable for realtime application, and most of the time they are more useful for improving the estimation of uncertainty, than for improving the accuracy of prediction.
In this work, we propose a learning framework for modelbased RL algorithms that does not need measurements of velocities and accelerations. Instead of representing the system state as a collection of positions, velocities, and accelerations, we propose to define the state as a finite past history of the position measurements. We call this representation derivativefree, to express the idea that the derivatives of position are not included in it.
The use of the past history of the state has been considered in the GPNARX literature [13, 12, 3], as well as in Eigensystem realization algorithm (ERA) and Dynamic Mode Decomposition (DMD) [10, 23]. However, these techniques do not use a derivativefree approach when dealing with physical systems, e.g., they consider the history of position and velocity having double state dimension w.r.t. our approach (which might be a problem for MBRL) and do not incorporate prior physical model to design the covariance function. Derivativefree GPR models have also already been introduced in [20], where the authors proposed derivativefree nonparametric kernels.
The proposed approach has some connections with discrete dynamics models, see for instance [17, 14]. In these works, the authors derived a discretetime model of the dynamics of a manipulator discretizing the Lagrangian equations. However, different from our approach, these techniques assume a complete knowledge of the dynamics parameters, typically identified in continuous time. Finally, such models might not be sufficiently flexible to capture unmodeled behaviors like delays, backlash, and elasticity.
Contribution. The main contribution of the present work is the formulation of derivativefree GPR models capable of encoding physical prior knowledge of mechanical systems that naturally depend on velocity and acceleration. We propose physically inspired derivativefree (PIDF) kernels, which provide better generalization properties than the nonparametric deriviativefree kernel, and enable the design of semiparametric derivativefree (SPDF) models.
The commonly used derivative and acceleration signals approximated through numerical differentiation represent statistics of the past raw positional data that cannot be exact, in general. The proposed framework does not make these computational assumptions, thus preserving richer information content in the inputs that are fed into the model. Moreover, providing to the GPR model a sufficient reach past history we can capture eventual higher orders unmodeled behaviors, like delays, backlash, and elasticity.
The proposed learning framework is tested on two real systems, a ballandbeam platform and a Furuta pendulum. The experiments show that the proposed derivativefree learning framework improves significantly the estimation performance obtained by standard derivativebased models. The SPDF models are used to solve RLbased trajectory optimization tasks. In both systems, we applied the control trajectory obtained by an iLQG [26] algorithm in an openloop fashion. The obtained performance shows that the proposed framework learns accurately the dynamics of the two systems, and it is suitable for RL applications.
The paper is organized as follows. In Section II, we briefly introduce the standard learning framework adopted in modelbased RL using GPR. Then, in Section III, we propose our derivativefree learning framework composed of the definition of the state and a novel derivativefree prior for GPR, based on the physical equations of motion. Finally, in the last two sections, we report the performed experiments.
Ii Model Based Reinforcement Learning Using Gaussian Process Regression
In this section, we describe the standard model learning framework adopted in MBRL using GPR, and the trajectory optimization algorithm applied. An environment for RL is formally defined by a Markov Decision Process (MDP). Consider a discretetime system subject to the Markov property, where and are the state vector and the input vector at the time instant .
When considering a mechanical system with generalized coordinates , the dynamics equations obtained through Rigid Body Dynamics, see [24], suggest that, in order to satisfy the Markov property, the state vector should consist of positions, velocities, and accelerations of the generalized coordinates, i.e., , or possibly of a subset of these variables, depending on the task.
Modelbased RL algorithms derive the policy starting from , an estimate of the system evolution.
Iia Gaussian Process Regression
In some studies, GPR [19] has been used to learn , see for instance [2]. Typically, the variables composing are assumed to be conditionally independent given and , and each state dimension is modeled by a separate GPR. The components of , denoted by , with , are inferred and updated based on , a data set of inputoutput noisy observations. Let be the number of samples available, and define the set of GPR inputs as where with . As regards the outputs , two definitions have been proposed in the literature. In particular, can be defined as , the ith component of the state at the next time instant, or as , leading to . In both cases, GPR models the observations as
(1) 
where is Gaussian i.i.d. noise with zero mean and covariance , and . The matrix is called the kernel matrix, and is defined through the kernel function , i.e., the entry in position , is equal to . In GPR, the crucial aspect is the selection of the prior functions for , defined by , usually considered , and . Then, see [19], the maximum a posteriori estimator is:
(2) 
In the following, we will refer to and as one of the components and the relative kernel function.
Physically inspired kernels. When the physical model of the system is available, the model information might be used to identify a feature space over which the evolution of the system is linear. More precisely, assume that the model can be written in the form , where is a known nonlinear function that maps the GPR inputs vector onto the physically inspired features space, and is the vector of unknown parameters, modeled as a zero mean Gaussian random variable, i.e., , with . The expression of the physically inspired kernel (PI) is
(3) 
namely, a linear kernel in the features . For later convenience, we define also the homogeneous polynomial kernel in , which is a more general case of (3):
Nonparametric kernel. When a physical model is not available, the kernel has to be chosen by the user according to their understanding of the process to be modeled [19]. A common option is the Radial Basis Function kernel (RBF):
(4) 
where is a positive constant called the scaling factor, and is a positive definite matrix that defines the norm over which the distance between and is computed, i.e., . Several options to parameterize have been proposed, e.g., a diagonal matrix or a full matrix defined by the Cholesky decomposition, namely, , see [19, Chp.5],[5, Sec. 4.1].
Semiparametric kernel. This approach combines the physically inspired and the nonparametric kernels. Here we define the kernel function as the sum of the covariances:
(5) 
where can be, for example, the RBF kernel (4).
IiB Trajectory Optimization using iLQG
The iLQG algorithm is a popular technique for trajectory optimization [26]. Given discrete time dynamics such as (1) and a cost function, the algorithm computes locally linear models and quadratic cost functions for the system along a trajectory. These linear models are then used to compute optimal control inputs and local gain matrices by iteratively solving the associated LQG problem, see [26].
Iii DerivativeFree Framework for Reinforcement Learning Algorithms
A novel learning framework to model the evolution of a physical system is proposed, which addresses several limitations of the standard modelling approach described in Sec. II.
Numerical differentiation. The Rigid Body Dynamics of any physical system are functions of joint positions, velocities, and accelerations. However, a common issue is that often joint velocities and accelerations cannot be measured.
Computing them by means of causal numerical differentiation starting from the (possibly noisy) measurements of the joint positions might introduce considerable delays and distortions of the estimated signals. This fact could severely hamper the final solution. This is a very well known and often discussed problem, see, e.g., [24, 8, 16].
Conditional Independence. The assumption of conditional independence among the with given in (1) might be a very imprecise approximation of the real system’s behavior, in particular when the outputs considered are position, velocity, or acceleration of the same variable, which are correlated by nature. This fact has been shown to be an issue in estimation performance in [21], where the authors proposed to learn the acceleration function and integrate it forward in time in order to estimate position and velocity.
Moreover, under this assumption, a separate GP for each output needs to be estimated for modeling variables that are intrinsically correlated, leading to redundant modeling design and testing work, and a waste of computational resources and time. This last aspect might be particularly relevant when considering systems with a considerable number of DoF.
Delays and nonlinearities. Finally, physical systems are often affected by intrinsic delays and nonlinear effects that have an impact on the system over several time instants, contradicting the firstorder Markov assumption; an instance of such behavior is reported in section VB.
Iiia DerivativeFree State definition
To overcome the aforementioned limitations, we define the system state
(6) 
The simple yet exact idea behind this definition is that when velocities and accelerations measures are not available, if is chosen sufficiently large, then the history of the positions contains all the system information available at time , leaving to the modellearning algorithm the possibility of estimating the state transition function. Indeed, velocities and accelerations computed through causal numerical differentiation are the outputs of digital filters with finite impulse response (or with finite past instants knowledge for nonlinear filters), which represent a statistic of the past raw position data. Notice that these statistics cannot be exact in general, leading to a loss of information that instead is kept in the proposed derivativefree framework.
The state transition function becomes deterministic and known (i.e., the identity function) for all the components of the state. Consequently, the problem of learning the evolution of the system is restricted to learning only the functions , reducing the number of models to learn and avoiding erroneous conditional independence assumptions. Finally, the MDP has a state information rich enough to be robust to intrinsic delays and to obey the firstorder Markov property.
IiiB State Transition Learning with PIDF Kernel
Derivativefree GPRs have already been introduced in [20], where the authors derived a datadriven derivativefree GPR. As pointed out in the introduction, the generalization performance of datadriven models might not be sufficient to guarantee robust learning performance, and exploiting eventual prior information coming from the physical model is crucial. To address this problem, we propose a novel Physically Inspired DerivativeFree (PIDF) kernel.
The PIDF exploits the property that the product and sum of kernels is still a kernel, see [19]. Define and assume that a physical model of the type , is known. Then, we propose a set of guidelines to derive a PIDF kernel starting from :
PIDF Kernel Guidelines

Each and every position, velocity, or acceleration term in is replaced by a distinct polynomial kernel of degree , where is the degree of the original term; e.g., .

The input of each of the kernels in 1) is a function of , the history of the position corresponding to the independent variable of the substituted term;
e.g., . 
If a state variable appears into transformed by a function , the input to becomes the input defined at point 2) transformed by the same function , e.g., .
Applying these guidelines will generate a kernel function , which incorporates the information given by the physics, without knowing the velocity and acceleration.
The extension to semiparametric derivativefree (SPDF) kernels becomes trivial. Combining, as described in Section IIA, the proposed with a derivativefree NP kernel, (or as proposed in [20]), we obtain:
(7) 
These guidelines formalize the solution to the nontrivial issue of modeling real systems using physical models without measuring velocity and acceleration. Although the guidelines might not be the only possible solution, they represent an algorithm with no ambiguity or arbiter choice to be made by the user to convert RBD into derivative free models.
In the next sections, we apply the proposed learning framework to the benchmark systems BallandBeam (BB) and Furuta Pendulum (FP), describing in detail the kernel derivations. While for both setups we will show the task of controlling the system, highlighting the advantages of the proposed derivativefree framework, due to space limitations, we decided to present different properties of the proposed method in each of them. In the BB case, we will highlight the estimation performance of over computing with several filters and the difficulty of choosing the most suitable velocity. In the more complex FP system, we analyze robustness to delays, performance at stepahead prediction, and make extensive comparisons among physically inspired, nonparametric, semiparametric derivativefree, and standard GPR.
Iv BallandBeam Platform
Fig. 1 shows our experimental setup for the BB system [9]. An aluminum bar is attached to a tiptilt table (platform) constrained to have 1 degree of freedom (DoF). The platform is actuated by an inexpensive, commercial offtheshelf HiTec type HS805BB RC model servo motor that provides openloop positioning; the platform angle is measured by an accurate absolute encoder. There is no tachometer attached to the axis, so angular velocity is not directly measurable. A ball is rolling freely in the groove. We use an RGB camera which is attached to a fixed frame to measure the ball’s position. The ball is tracked in realtime using a simple, yet fast, blob tracking algorithm. All the communication with the camera and servo motors driving the system is done using ROS [18].
Let and be the beam angle and the ball position, respectively, considered in a reference frame with origin at the beam center and oriented s.t. the beam end is positive. The forward dynamics of the ball are expressed by the following equation (see [7] for the details)
(8)  
where , , and are the ball mass, inertia, radius, and friction coefficient, respectively. Starting from eq. (8), the forward function for is derived by integrating twice forward in time, and assuming a constant between two time instants:
(9) 
where is the sampling time. In order to describe the BB system in the framework proposed in Section III, we define the derivativefree state as , with
Applying the guidelines defined in section IIIB to Eq. (9), the PIDF kernel obtained is
(10) 
Iva Prediction performance
The purpose of this section is to compare the prediction performance of the GP models (2), using as prior the PIDF kernel (10), , and using the standard PI kernel applying (8) to Eq. (3), . The question that the standard approach imposes is how to compute the velocities from the measurements in order to estimate , and there is not a unique answer to this question. We experimented with some common filters using different gains in order to find good velocity approximations:

Standard numerical differentiation followed by a low pass filter to reject noise, which uses the position history . We considered 3 different cutoff frequencies , , Hz with correspondent estimators denominated as , , , respectively;

Kalman filter, with different process covariances and equals to , , with correspondent estimators , , ;

The acausal SavitzkyGolay filter with window length .
Acausal filters have been introduced just to provide an upper bound on prediction performance; otherwise, they can not be applied in realtime applications. As regards the number of past time instants considered in , we set . Both the training and test datasets consists in the collection of 3 minutes of operation on the BB system, with control actions applied at 30Hz, while measurements from the encoder and camera were recorded. Both the datasets account for samples. The control actions were generated as a sum of 10 sine waves with randomly sampled frequency between Hz, shift phases in , and amplitude ranging .
RMSE 

In Fig. 2, we visualize the distribution of the estimation errors module in the test set through boxplots, as well as reporting the numerical values of the . Acausal filtering guarantees the best performance, whereas, among the estimators with causal inputs, the proposed approach performs best. Indeed, the obtained with the derivativefree estimator is approximately smaller than the best obtained with the other causal estimators, i.e., and . As visible from the boxplots, the proposed solution exhibits a smaller variability. Results obtained with numerical differentiation and Kalman filtering show that the technique used to compute velocities can affect prediction performance significantly. In Fig. 2, we present also a detailed plot of the evolution obtained with different differentiation techniques. As expected, there is a tradeoff between noise rejection and delay introduced that must be considered. For instance, increasing the cutoff frequency decreases the delay, but at the same time impairs the rejection of noise. An inspection of the , and prediction errors shows that too high or too low cutoff frequencies lead to the worst prediction performance. With our proposed approach, tuning is not required, since the filtering coefficients are learned automatically during the GPR training.
IvB Ballandbeam control
The control task is the stabilization of the ball with zero velocity in a target position along the beam. The control trajectory is computed using the iLQG algorithm introduced in Section IIB. In order to model also the behaviors not captured by the physical equations of motion, we train a GP, called , with semiparametric kernel as in Eq.(7):
(11) 
where the NP kernel is with the matrix parameterized through Cholesky decomposition. The training data are the same described in Section IVA. The control trajectory obtained by iLQG using model is applied to the physical system, and performance is shown in Fig. 3.
In the top plot, we can observe how the optimized trajectory for the model remains close to the ball trajectory of the real system for all the 100 steps (3.3[s]), which is the chosen length for the iLQG trajectory. This result illustrates the high accuracy of the model in estimating the future evolution of the real system. Note that the control trajectory is implemented in open loop, to highlight the model precision obtaining an average deviation between the target and the final ball position of 9[mm] and standard deviation of 5[mm] in 10 runs. By adding a small proportional feedback control, the error becomes almost null. In the bottom plot, the control trajectory obtained by iLQG using either or is shown. Two major observations can be made: the trajectory obtained with approximates a bangbang trajectory that in a linear system would be the optimal trajectory, and the trajectory obtained with is similar, but since the equation of motions cannot describe all the nonlinear effects present in a real system, the control action has a final bias that makes the ball drift away from the target position.
V Furuta Pendulum: Derivative Free Modeling and Control
The second physical system considered is the Furuta pendulum [4], a popular benchmark system in control theory. A schematic of the FP with its parameters and variables is shown in Fig. 4. We refer to “Arm1” and “Arm2” in Fig. 4 as the base arm and the pendulum arm, respectively, and we denote and the angles of the base arm and the pendulum.
In [1], the authors have presented a model of the FP. Based on that model, we obtained the expression of as a linear function w.r.t a vector of parameters ,
(12) 
where and .
The FP considered in this work has several characteristics that are different from those typically studied in the research literature. Indeed, in our FP (see Fig. 5), the base arm is held by a gripper which is rotated by the wrist joint of a robotic arm (a MELFA RV4FL). For this reason, the rotation applied to the wrist joint is denoted by , and it is different from the actual base arm angle (see Figure 4). The control cycle of the robot is fixed at ms, and communication to the robot and the pendulum encoder is handled by ROS.
These circumstances have several consequences. First, the robot can only be controlled in a positioncontrol mode, and we need to design a trajectory of set points considering that the manufacturer limits the maximum angle displacement of any robot’s joint in a control period. This constraint, together with the high performance of the robot controller, results in a quasideterministic evolution of , that we identified to be . Therefore, the forward dynamics learning problem is restricted to model the pendulum arm dynamics. Additionally, the Dprinted gripper causes a significant interplay with the FP base link, due to the presence of elasticity and backlash. These facts lead to vibrations of the base arm along with the rotational motion, and a significant delay in actuation of the pendulum arm, which results in .
Va Delay and nonlinear effects
In order to demonstrate the presence of delays in the system dynamics, we report a simple experiment in which a triangular wave in excites the system. The results are shown in Fig. 6 (for lack of space, the term depending on is not reported, as the effects of viscous friction are not significant). The evolution of is characterized by a main lowfrequency component with two evident peaks in the beginning of the trajectory, and a higherfrequency dynamical component which corrupts more the main component as the time passes by. Several insights can be obtained from these results. First, the peaks of the lowfrequency component can be caused only by the contribution, given that the and contributions do not exhibit these behaviours so prominently. Second, the difference between the peaks in the contribution and (highlighted in the figure by the vertical dashed lines) represent the delay from the control signal and the effect on the pendulum arm. Third, the highfrequency component in might represent the noise generated by the vibration of the gripper, the elasticity of the base arm, and all the nonlinear effects given by the flexibility of the gripper.
VB FP derivative free GPR models
We used the derivativefree framework to learn a model for the evolution of the pendulum arm. The FP state vector is defined as , with
From (12), following the same procedure applied in the BB application to derive Eq. (9), we obtain . Applying the guidelines in Section IIIB we obtain the corresponding PIDF kernel
(13) 
In order to also model the complex behavior showed in Section VA, we define a semiparametric kernel for the FP as:
(14) 
where the NP kernel is defined as the product of two RBFs with their matrices independently parameterized through Cholesky decomposition . Adopting a full covariance matrix, the RBF can learn convenient transformations of the inputs, increasing the generalization ability of the predictor.
As experimentally verified in Section VA, the evolution of is characterized by a significant delay w.r.t. the dynamics of . As a consequence, positions, velocities, and accelerations at time instant are not sufficient to describe the FP dynamics. However, defining the state as the collection of past measured positions, and setting properly , the GPR has a sufficiently informative input vector, and can select inputs at the proper time instants, thus inferring the system delay from data. Note that when considering also velocities and accelerations, a similar approach would require a state of dimension , instead of .
VC Prediction performance
In this section, we test the accuracy of different predictors:

: NP estimator defined in (2) with a RBF kernel with diagonal covariance and input given by and its derivatives, i.e., all the positions velocities and accelerations from time to , ;

: NPDF estimator defined in (2) with a RBF kernel, ;
The model is considered to provide the performance of a standard NP estimator based on and derivatives.
The estimators have been trained by minimizing the negative marginal loglikelihood (NMLL) over a training data set composed of samples, corresponding approximately to seconds of experience. The input signal is a sum of sinusoids with random angular velocity ranging between . To deal with the consistent number of samples available, we rely on stochastic gradient descent to optimize the NMLL. Performance is measured on a test data set composed of samples, obtained with an input signal of the same type as the one considered in , but a different distribution of the sinusoids with frequency ranging between , to show generalization ability. Estimators are compared both in terms of accuracy and data efficiency, and results are in Figure 7. In the bottom graph, we report the Root Mean Squared Error () in , and all the estimators considered are able to predict the evolution of the pendulum arm with an error smaller than one degree. However, derivativefree approaches outperform the nonderivativefree estimator. Note that achieves the best performance, and outperforms , despite that both models are based on an RBF kernel. The latter fact confirms that numerical computation of the derivatives might reduce estimation accuracy. In the top graph, we report the evolution of the RMSE as a function of the seconds of the training samples available. The derivativefree approaches are more accurate and dataefficient than . Notice that is more accurate only for the short period of the first seconds, and its RMSE decreases more slowly. The use of the PI kernel is particularly helpful as regards data efficiency, since after seconds of data is more accurate than and , and the ’s decreases faster than the one of .
VD Rollout performance
In this section, we characterize the rollout accuracy of the derived models, namely the estimation performance at stepahead predictions.
For each model, we performed rollouts. During the th rollout, we randomly pick an initial instant , then the input location in is selected as initial input, and a prediction is performed for a window of steps. For each simulation, is computed by subtracting the predicted trajectory from the one realized in . To characterize how uncertainty evolves over time, we define the error statistic , that is the of the prediction at the th step ahead. The confidence intervals are computed assuming i.i.d. and normally distributed errors. Under this assumptions, each has a distribution. The performance in terms of the of , and is reported in Fig. 8. In the initial phase, is lower than , whereas for becomes greater than . This suggests that the NP model behaves well for short interval prediction, whereas the PI model is more suitable for longterm predictions. The SP approach combines the advantages of these two models. The evolution of confirms this, showing that outperforms and .
VE Furuta Pendulum control
The model is used to design a controller to swingup the FP using the iLQG algorithm described in Section IIB. The model is accurate to the point that the trajectories obtained by the iLQG algorithm were implemented in an openloop fashion on the real system, and the results are shown in Fig. 9. The FP swings up with nearzero velocity at the goal position; however, as expected, an openloop control sequence cannot stabilize it. Fig. 9 reports the agreement between the trajectories obtained under the iLQG control sequence, using both the and the real system. The comparison shows the longhorizon predictive accuracy of the learned model. Note that the models lose accuracy around the unstable equilibrium point, because of insufficient data, which are harder to collect in this area during training.
Vi Conclusions
In this paper, we presented a derivativefree learning framework for model based RL, and we defined a novel physicallyinspired derivativefree kernel. Experiments with two real robotic systems show that the proposed learning framework outperforms in prediction accuracy its corresponding derivativebased GPR model, and that semiparametric derivativefree methods are accurate enough to solve modelbased RL control problems in realworld applications. The proposed framework exhibits robustness to delays and a capacity to deal with partially observable systems that can be further investigated.
Footnotes
 The exact state of a physical system is usually unknown, but in general accepted to be given by position, velocity and acceleration accordingly to the physics first principles. With a slight abuse of notation, we refer to our representation of the state in a derivativefree fashion as the state variable.
References
 (2011) On the dynamics of the furuta pendulum. Journal of Control Science and Engineering. Cited by: §V.
 PILCO: a modelbased and dataefficient approach to policy search. In ICML 2011, Cited by: §I, §I, §IIA.
 (2017) Optimizing longterm predictions for modelbased policy search. In Conference on Robot Learning, pp. 227–238. Cited by: §I.
 (1992) A new inverted pendulum apparatus for education. In Advances in Control Education 1991, pp. 133–138. Cited by: §V.
 (1995) Regularization theory and neural networks architectures. Neural Computation. Cited by: §IIA.
 (1998) Regression with inputdependent noise: a gaussian process treatment. In NIPS 10, pp. 493–499. Cited by: §I.
 (199203) Nonlinear control via approximate inputoutput linearization: the ball and beam example. IEEE Transactions on Automatic Control 37, pp. 392–398. External Links: Document, ISSN 00189286 Cited by: §IV.
 (2008) Model identification. In Springer Handbook of Robotics, pp. 321–344. Cited by: §III.
 (201711) Learning to regulate rolling ball motion. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Vol. , pp. 1–6. Cited by: §IV.
 (198511) An eigensystem realization algorithm for modal parameter identification and model reduction. Journal of Guidance Control and Dynamics 8, pp. . External Links: Document Cited by: §I.
 (2014) Learning neural network policies with guided policy search under unknown dynamics. In NIPS 27, Cited by: §I.
 (2016) Latent autoregressive gaussian processes models for robust system identification. IFACPapersOnLine. Cited by: §I.
 (2011) Gaussian process training with input noise. In Advances in Neural Information Processing Systems, Cited by: §I, §I.
 (1985) Discrete dynamic robot models. IEEE Transactions on Systems, Man, and Cybernetics. Cited by: §I.
 (2010) Using model knowledge for learning inverse dynamics. In ICRA, Cited by: §I.
 (2011) Model learning for robot control: a survey. Cognitive Processing 12 (4), pp. 319–340. Cited by: §III.
 (19891201) Discretetime modeling and control of robotic manipulators. Journal of Intelligent and Robotic Systems 2 (4), pp. 411–423. Cited by: §I.
 (2009) ROS: an opensource robot operating system. In ICRA workshop on open source software, Cited by: §IV.
 (2006) Gaussian processes for machine learning. The MIT Press. Cited by: §I, §IIA, §IIA, §IIIB.
 (2019) Derivativefree online learning of inverse dynamics models. IEEE Transactions on Control Systems Technology. Cited by: §I, §IIIB, §IIIB.
 (2019) Semiparametrical gaussian processes learning of forward dynamical models for navigating in a circular maze. In International Conference on Robotics and Automation (ICRA), Cited by: §I, §III.
 (2016) Online semiparametric learning for inverse dynamics modeling. In IEEE CDC, Cited by: §I.
 (2008) Dynamic mode decomposition of numerical and experimental data. Journal of Fluid Mechanics. Cited by: §I.
 (2010) Robotics: modeling, planning and control. Springer Science&Business Media. Cited by: §II, §III.
 (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay. Science. External Links: Document, ISSN 00368075 Cited by: §I.
 (2012) Synthesis and stabilization of complex behaviors through online trajectory optimization. In IROS,, pp. 4906–4913. Cited by: §I, §IIB.
 (2005) A generalized iterative lqg method for locallyoptimal feedback control of constrained nonlinear stochastic systems. In ACC., pp. 300–306. Cited by: §I.
 (2012) Gaussian process regression with heteroscedastic or nongaussian residuals. CoRR abs/1212. Cited by: §I.