Cascaded Gaussian Processes for
Dataefficient Robot Dynamics Learning
Abstract
Motivated by the recursive NewtonEuler formulation, we propose a novel cascaded Gaussian process learning framework for the inverse dynamics of robot manipulators. This approach leads to a significant dimensionality reduction which in turn results in better learning and data efficiency. We explore two formulations for the cascading: the inward and outward, both along the manipulator chain topology. The learned modeling is tested in conjunction with the classical inverse dynamics model (semiparametric) and on its own (nonparametric) in the context of feedforward control of the arm. Experimental results are obtained with Jaco 2 sixDOF and SARCOS sevenDOF manipulators for randomly defined sinusoidal motions of the joints in order to evaluate the performance of cascading against the standard GP learning. In addition, experiments are conducted using Jaco 2 on a task emulating a pouring maneuver. Results indicate a consistent improvement in learning speed with the inward cascaded GP model and an overall improvement in data efficiency and generalization.
I Introduction
Accurate inverse dynamics enables accurate control and thus several of robotics researchers have considered this problem. When the robot’s rigidbody parameters are known precisely, such as through System Identification [wu2010overview], the manipulator dynamics equations can be formed, allowing us to express the inverse dynamics as a linear algebraic problem [featherstone2016dynamics]. Recursive NewtonEuler (NE) algorithms have been shown to accurately and efficiently solve the inverse dynamics problem in this case. However, robots today are only imperfectly calibrated, and tend to change dynamics frequently as they interact with the world (e.g., carrying loads, using new end effectors, interacting with humans).
Learning approaches, in principle, allow the quick adaptation of inverse dynamics to observed conditions, but present many obstacles such as the need to capture sufficient data and computational expense. Gaussian Processes (GP) have frequently been used in this context, as their nonparametric nature provides strong performance across a variety of robots with little human knowledge required [nguyen2009model, nguyen2011model, vinogradska2017stability]. It is feasible to learn during live operation of robots and adapt to changing situations [deisenroth2015gaussian]. The primary downside of GPs for use in robotics is their computational scaling as a function of training data size. Although many methods exist to reduce this complexity, such as sparse pseudoinputs [snelson2006sparse], sparse spectrum GPs [quia2010sparse] and Bayesian Neural Networks [galMCDropout], each of these approaches still scales with the underlying task complexity. A more intrinsically efficient representation of the inverse dynamics learning problem allows for a simpler learning and requires less data.
This paper adapts GPs to learn the inverse dynamics of a robotic arm from data in a fashion that respects our knowledge of the underlying topology by cascading the learning along the kinematic chain with the same factorization used in a NE solver, as illustrated in Figure 1. Specifically, the torque output by the GP at each joint is used as an input for the subsequent joint, capturing the dynamic coupling from all previous joints. This cascaded approach utilizes many fewer input dimensions, reducing computation, but more importantly, it reuses the results and requires learning simpler functions at each joint. Following the forward and backwards recursion of NE, we attempt both inwards and outwards cascaded versions of our GP learner.
Rather than discarding the classical inverse dynamics solution, we follow many previous approaches and combine parametric models with nonparametric models, resulting in a semiparametric modeling framework [nguyen2010using, um2014independent, romeres2016online, reinhart2017hybrid, dallasemi, wu2012semi]. Specifically, we utilize the classical inverse dynamics solution as our learner’s mean function. This approach tends to achieve better generalization, dataefficiency, and faster learning due to its utilization of prior knowledge. Finally, we employ the recentlyproposed derivativefree features [romeres2019derivative] as alternative to the standard derivativebased variables (velocity and acceleration) as inputs to the inverse dynamics model. We evaluate our proposed approach with extensive experiments carried out on a Kinova Jaco 2 sixDOF arm and using the public SARCOS dataset [vijayakumar2002statistical].
We continue with Section II, which provides a background on modelbased torque controllers and inverse dynamics learning. Section III describes the cascaded Gaussian process formulation while Section IV evaluates this method by presenting the details behind the experiments and the obtained results. Finally, section V summarizes our work.
Ii Background
Iia Modelbased Torque Control
Modelfree control schemes (e.g. independentjoint control and PID control) are only capable of controlling a robotic arm through setpoint regulation. When used for trajectory tracking applications, these methods require intermediate set points, resulting in considerable delays and jerky motions. On the other hand, modelbased controllers can make use of the inverse dynamics model to cancel out nonlinearities and to achieve zeroerror tracking performance [chung2016motion].
The dynamics model of a robotic manipulator with DOF is described by Lagrange’s equation of motion [featherstone2016dynamics]:
(1) 
where is the column vector of joint positions, is the inertia matrix, represents Coriolis and centrifugal forces, denotes gravity forces, is the column vector of joint control inputs to be computed, and is the additional torque components due to friction and disturbances.
Feedforward torque control aims to cancel the nonlinear terms and decouple the dynamics of each link based on the theory of feedback linearization. The control signal, Equation (2), consists of a feedforward term computed using the inverse dynamics model and a feedback term typically on joint positions and velocities [khosla1988experimental]:
(2) 
where and are the desired joint positions and velocities.
A major drawback of this parametric algorithm is the considerable degradation of controller performance in the case of an imprecise inverse dynamics model. The limitations of parametric approaches to capture friction, nonlinearities, input disturbances and robot interactions strongly motivate the need for a datadriven learning scheme.
IiB Gaussian Processes for Model Learning
Following the notation used in [rasmussen2004gaussian], a nonparametric model represented by Gaussian process regression is written as , where , and are respectively the input, mean and covariance.
The kernel is responsible for the general shape of the sampled functions from the GP distribution and it has a large impact on the performance of the model learning. The Matern kernel is often recommended in robotics applications for two reasons. First, unlike the Squared Exponential kernel, it does not suffer from concentration of measure for high dimensional inputs and second, it is not infinitely differentiable [rasmussen2004gaussian] which makes it a more realistic approach in robotics applications. The hyperparameters of the GP, lengthscale (), variance () and the noise variance are obtained by maximizing the log marginal likelihood using gradient based optimization algorithms.
While a nonparametric inverse dynamics model is a direct map from joint states () to joint torques, a semiparametric model can be thought of as a combination of a rigid body dynamics (RBD) and an error term:
(3) 
Here, the parametric component, left side of Equation (1), provides a baseline for the predicted torque while the nonparametric term is responsible for unmodeled dynamics of the system and the interactions of the robot.
Assuming the robot has degrees of freedom and predicts the torque for joint , the standard nonparametric model developed by previous authors uses as the input and as the output. On the other hand, the nonparametric part of a semiparametric model keeps the same input but employs as the output.
Proposed in [nguyen2010using], a straightforward method to integrate the RBD function into the GP framework is by setting it as the mean function, thus making the GP biased towards the prior knowledge available from the RBD model. It is often recommended to use a zero mean function for the GP, since addition of a nonzero constant to the GP posterior would negatively affect the refined uncertainty predictions [rasmussen2004gaussian]. Nevertheless, we believe that the semiparametric modeling framework has other advantages in the context of robot learning that justifies the use of nonzero mean functions. Using this approach, the torque of joint for query point can be predicted by:
(4)  
Equation (4) indicates the fact that the GP is mainly responsible for modeling the error between the RBD model and actual torque measurements, because if this error is close to zero, the second term would become negligible and the predicted torque is based on the RBD function. Furthermore, according to Equation (4), if the query point is far from the training data , the resulting covariance matrix will become almost zero and the torque prediction will be a function of the RBD model only. This is in fact the reason for better generalization of semiparametric models compared to nonparametric ones [nguyen2010using].
Iii Cascaded Gaussian Processes for Inverse Dynamics Learning
Iiia Cascaded Formulation
We propose cascaded Gaussian processes as a novel dataefficient approach for inverse dynamics learning. The basic idea is to reduce the dimensions of the regression problem recursively by using the output of a GP trained on one joint as a substitute for all previous joint states, thus reducing the dimensions of the feature space. The formulation of our cascaded GP is inspired by the recursive NE algorithm for computing the inverse dynamics of robotic manipulators. The recursive NE method is comprised of two main steps; first, an outward recursion from the base of the robot to compute link velocities and accelerations based on robot kinematics and joint states. Total forces on each link are also calculated in this step using the NE equations of motion. Second, during an inward recursion, values for joint torques are computed by balancing the forces acting on each link [featherstone2016dynamics].
Figure 2 shows the free body diagram of link ; is the angular velocity of link , is the axis of joint , and and are respectively the force and moment applied from link to link . As is calculated during the outward recursion and and are available from the inward recursion step, one can solve for and using the Euler equations of motion:
(5) 
Standard GP models (Figure 3c), discussed in Section IIB, are trained with the states of all joints, thus resulting in a high dimensional feature space. In contrast, the proposed cascaded approach uses joint torques as a proxy for joint states. The cascading GP formulation can be implemented in either nonparametric or semiparametric modeling paradigm.
We investigate two embodiments of the cascaded GP formulation, referred to as inward and outward cascaded GP. Our inward cascaded GP (Figure 3a) is based on Equation (5) in which during an inward recursion, the torque of joint predicted by its corresponding GP can be utilized instead of the states of the outward joints to , allowing for a reduced feature space .
The outward cascaded GP (Figure 3b), on the other hand, starts from the base of the robot and reduces the dimensions of the feature space during an outward recursion procedure. Thus, the reduced feature space for joint becomes .
The cascaded GP formulation has two main advantages over the standard inverse dynamics learning schemes. First, our method is able to generalize more effectively. According to Equation (5), the force and torque can summarize the impact of all the outward joints; however, measuring these values requires installing force/torque sensors on each joint of a robot. Therefore, we hypothesize that a single joint torque value, which is readily available on most of the robots, can summarize the impact of all the outward joints to some extent; giving our model access to the estimate of this torque avoids the need to rediscover this pattern from training data. Compared to the regression in joint states used by standard methods, which requires observing a complete set of states covering most of the workspace, the cascaded approach better matches the topology of the arm; hence, the learning is more dataefficient and the learned model can generalize better.
Second, as a result of the reduced dimensionality of the GP inputs in the cascaded approach, several computational tasks are more efficient. During prediction, the computation of the GP scales with the number of state dimensions so our cascaded approach requires less computation. Further, optimization of the hyperparamaters becomes a significantly simpler task in lower dimensions and when the input states are more uniformly correlated to the outputs. Specifically, computing the GP’s marginal likelihood and its derivative scales as , with the number of training points, so the data efficiency gained by our method translates into faster learning.
IiiB DerivativeFree Features
Utilizing derivativefree features in the context of inverse dynamics has been proposed in [romeres2019derivative] as a means to address the noisy nature of numerical differentiation, an inevitable step to obtain joint velocities and accelerations. Assuming we have access to the previous joint positions, there are numerous ways to incorporate the state history as a feature. In a study reported in [romeres2019derivative], the authors concluded that derivativefree features with reduced rank provide wellrounded performance. In this method, a smaller number of features is chosen to compress the information within the position history. Physics suggests that 3 elements (position, velocity and acceleration) suffice to define a state; in addition, according to the empirical results in [romeres2019derivative], is the optimal choice for inverse dynamics learning. Therefore, we have set in this paper for the implementation of derivativefree features:
(6) 
where is the reduced rank derivativefree feature of joint , is the history of joint positions, and is a fully parameterised matrix. Guidelines for designing can be found in [romeres2019derivative], but in the present implementation, we set and .
\clineB2203  Model  Joint 1  Joint 2  Joint 3  Joint 4  Joint 5  Joint 6  
2s  5s  10s  2s  5s  10s  2s  5s  10s  2s  5s  10s  2s  5s  10s  2s  5s  10s  
\hlineB3 Derivativebased  NPInwardCascaded  .219  .182  .151  .297  .178  .087  .562  .360  .161  .280  .247  .245  .288  .253  .204  .594  .470  .333 
NPOutwardCascaded  .229  .206  .172  .421  .293  .125  .565  .367  .213  .271  .210  .173  .330  .310  .303  .331  .286  .255  
NP  .229  .207  .169  .488  .351  .174  .769  .553  .212  .298  .236  .201  .294  .239  .212  .639  .401  .329  
\clineB222  SPInwardCascaded  .198  .171  .147  .085  .068  .058  .129  .143  .123  .258  .210  .172  .191  .191  .176  .590  .449  .335 
SPOutwardCascaded  .200  .192  .168  .088  .076  .061  .130  .146  .125  .268  .192  .158  .182  .184  .182  .370  .268  .258  
SP  .197  .186  .162  .092  .081  .067  .130  .136  .120  .296  .217  .200  .188  .194  .188  .638  .394  .331  
\hlineB3 Derivativefree  NPInwardCascadedDF  .198  .189  .148  .257  .120  .097  .605  .283  .153  .281  .237  .226  .302  .238  .178  .483  .313  .287 
NPOutwardCascadedDF  .211  .200  .172  .351  .145  .126  .615  .282  .196  .245  .200  .150  .320  .287  .259  .309  .225  .257  
NPDF  .214  .204  .175  .435  .144  .141  .705  .276  .189  .276  .215  .153  .304  .228  .189  .429  .304  .294  
\clineB222  SPInwardCascadedDF  .186  .182  .147  .076  .064  .067  .125  .124  .122  .259  .177  .134  .194  .196  .177  .471  .314  .294 
SPOutwardCascadedDF  .195  .182  .161  .076  .068  .065  .122  .129  .126  .261  .186  .121  .186  .177  .167  .329  .232  .258  
SPDF  .196  .186  .165  .079  .072  .068  .121  .124  .125  .285  .216  .133  .192  .192  .171  .431  .300  .292  
\hlineB3 

The values report an average of error over a 1second interval (e.g. 12 second, 45 second, 910 second) of training data. The yellow highlighted regions present the lowest error among the models of the same category (nonparametric and semiparametric), and the light yellow highlighted regions indicate an almost similar performance (up to 5 percent difference) among the models.
Iv Evaluation
In this section, we evaluate the performance of the proposed cascaded Gaussian process approach in terms of dataefficiency, generalization and learning efficiency. Two evaluation scenarios are considered here: first, a quantitative evaluation with a standard trainingtesting procedure on subsets of a dataset. Second, a qualitative evaluation using taskbased training episodes in which the robot attempts to learn the dynamics of various tasks while the data is transferred from one to another.
Iva Setup
The experiments are carried out on Kinova Jaco 2 robot, a lightweight sixDOF robotic manipulator with a maximum payload of 1.6 kg and a reach of 90 cm. The robot is equipped with joint encoders and joint torque sensors, and it can be monitored and commanded through a ROS (Robot Operating System) [quigley2009ros] package. The proposed algorithms as well as the feedforward torque controller have been implemented in Python as a ROS package on top of Kinova API. Due to the connectivity limitations of the robot, the torque controller runs at 40 Hz. Prior to applying the controller on the actual robot, it has been thoroughly tested in Gazebo, an opensource simulation environment [koenig2004design].
For implementation of our cascaded Gaussian process framework, we used GPy^{1}^{1}1https://github.com/SheffieldML/GPy, an opensource Gaussian process library in Python. In addition, PyBullet^{2}^{2}2https://github.com/bulletphysics/bullet3, a Python implementation of Bullet Physics Engine, has been utilized for parametric inverse dynamics modeling of Kinova Jaco 2 arm. Our model learning framework learns offline but predicts the torques online.
IvB General Evaluation
The aim of this set of experiments is to provide a quantitative evaluation of dataefficiency and accuracy of our proposed cascaded method. Two datasets have been used:

Our own dataset collected on Kinova Jaco 2 arm and referred to as the Jaco dataset.

The publicly available SARCOS dataset ^{3}^{3}3SARCOS dataset is available at http://www.gaussianprocess.org/gpml/data/ [vijayakumar2002statistical].
Every model variant has been evaluated on the Jaco dataset. However, due to the discrepancies found in some of the results, SARCOS dataset was also used to evaluate the nonparametric models.
IvB1 Training
In the Jaco dataset, the training set consists of 600 seconds of velocity controlled robot motion subsampled at 10 Hz and divided into 10 subsets. The motion for each subset has been generated by commanding each joint rate as a summation of multiple sinusoidal functions with random amplitudes, frequencies and phase angles. Therefore, the training set ensures a wide variety of joint states and robot configurations. Moreover, the test dataset consists of three oneminute motions, distinct from the training set, each collected with randomly sampled amplitudes, frequencies and phase angles.
The SARCOS dataset consists of 50,000 data points collected at 50 Hz from various movement patterns on the sevenDOF anthropomorphic SARCOS robot. The test set includes 10 percent of the data. For consistency with the Jaco dataset, we have subsampled the SARCOS dataset at 10 Hz and divided the training set into 9 subsets.
IvB2 Results
Since our main goal is to improve the dataefficiency of dynamics learning, we train our model on a series of everincreasing datasets, evaluating each resulting learned model on the test set. This allows evaluation of how much data each method requires before achieving low error.
Table I shows the evaluated normalized rootmeansquare error (nRMSE = RMSE/range) of torque predictions on the Jaco dataset of all models for three time stamps (averaged on ranges 12s, 45s, and 910s) of training time, whereas Figures 4 and 5 present the nRMSE of torque prediction as a function of data acquisition duration for the first 50 seconds of the data acquisition for nonparametric and semiparametric models, evaluated on the Jaco dataset. Finally, Figure 6 reports the results obtained from nonparametric models on the SARCOS dataset.
The results show that in the case of nonparametric models (Figures 4 and 6), cascaded models in general outperform standard models in terms of data efficiency, even though these models have access to a far smaller set of robot states. Overall, the inward cascaded model, which is consistent with the direction of the recursive NE method solution for the inverse dynamics torques, provides more accurate predictions compared to the outward cascaded.
Nonetheless, outward cascaded model outperforms its inward counterpart on joint 4 in the Jaco dataset and the last joint in both of the datasets. The discrepancy of joint 4 can be explained by the intrinsic properties of the Jaco 2 arm. As Figure 7 suggests, due to the strong coupling between joints 4 and 5, the torque of each joint is more correlated to the other rather that its own states; therefore, not having access to states of joint 5, the inward cascaded model yields less accurate predictions for joint 4. The fact that this discrepancy is not present in the results obtained from the SARCOS dataset (Figure 6), confirms the data correlation analysis.
The discrepancy of the last joint can be explained by the fact that the torque on the wrist of the robot is only a function of its own states, as it is the last joint of the openchain robot arm topology. Hence, the outward cascaded model, which has access only to the states of this joint and torque of the previous one, learns the torque in a much smaller feature space, resulting in more accurate and dataefficient predictions.
The same general trends can be observed in the case of semiparametric models (Figure 5). However, the improvement provided by cascaded modeling is less significant as the rigid body dynamics model used as the mean function already reduces learning complexity to some extent. The comments made earlier regarding joints 4 and 6 on the Jaco dataset and joint 7 on the SARCOS dataset apply to semiparametric models as well.
Finally, as already addressed in [romeres2019derivative], derivativefree features (Table I) help with the transient response of the GP models (i.e., the prediction accuracy of the model within the first seconds of training data), as they do not rely on the noisy and delayed nature of numerical differentiation. However, this improvement is limited to the first few seconds of data acquisition and as the number of data points increases, derivativefree features lack the generalization of derivativebased features.
IvC Learning Efficiency
The hyperparameters of a Gaussian process are most often obtained by maximizing the log marginal likelihood. Gradient based methods are usually used to solve this nonconvex optimization problem. In highdimensional systems the solution may converge to a local optimum that is far from the ideal solution. In order to prevent this, multiple restarts of the optimization procedure are required. Furthermore, since computing both the log marginal likelihood and its gradient has the computational complexity of , solving for the optimal hyperparameters requires a significant amount of time for large datasets.
First, the dataefficiency of the cascaded approach helps to decrease the size of the dataset required to learn a reliable model and thus decreases the training time significantly. Second, as a result of the reduced dimensionality of the cascaded GP, the total number of hyperparameters is decreased, leading to a simpler optimization problem. Consequently, log marginal likelihood is maximized with significantly fewer iterations. Figure 8 compares the average number of iterations to train cascaded and standard GP models with 500 data points. As expected, cascaded models converge quicker on the joints with smaller feature state: joints closer to the base for the inward cascaded and joints closer to the endeffector for the outward cascaded model.
IvD Task Based Evaluation
As a qualitative evaluation of our proposed cascaded modeling framework, we have applied it to a series of “pouring tasks” in which the torque controlled robot is required to move a bottle and tip it to allow marbles to roll out of the bottle into a jar. Each task attempts to explore a different part of the workspace while the data and the learned model is transferred between the tasks. As the parametric model is not aware of the unknown load carried by the robot, the baseline performance is limited. However, cascaded GP modeling proposed here is capable of learning the dynamics of each task within the first few attempts and it can generalize well to the tasks it has already observed. The results can be seen in the supplementary video ^{4}^{4}4https://youtu.be/ilwWuyRZIY.
V Conclusion
In this paper, we introduced the cascaded Gaussian process as a novel approach for inverse dynamics learning. The proposed method actively reduces the number of dimensions of the feature space by substituting joint states with joint torques. We explored two formulations of the cascaded Gaussian process, namely inward and outward models. In general, the inward models outperform the outward models as suggested by their consistency with the recursive NewtonEuler method. Overall, cascaded models provide a more dataefficient and faster training compared to the standard models; this improvement is more evident in the case of nonparametric models. Moreover, experimental results for a pouring task confirm the dataefficiency of the cascaded modeling.
As a final wrapup of the study on the variations of inverse dynamic model learning schemes, we would suggest semiparametric cascaded GPs with derivativebased features. The semiparametric variant and cascaded approach both help with the generalization and dataefficiency of the model. Although derivativefree features improve the consistency of the results by omitting the inherent noise present in the differentiation procedure, this improvement is limited to the first few seconds of data acquisition and comes with the cost of reducing the generalization of the learned model.
Acknowledgments
This work was supported by the National Sciences and Engineering Research Council (NSERC) Canadian Robotics Network (NCRN).