Derivative-free online learning of inverse dynamics models

# Derivative-free online learning of inverse dynamics models

## Abstract

This paper discusses online algorithms for inverse dynamics modelling in robotics. Several model classes including rigid body dynamics (RBD) models, data-driven models and semiparametric models (which are a combination of the previous two classes) are placed in a common framework. While model classes used in the literature typically exploit joint velocities and accelerations, which need to be approximated resorting to numerical differentiation schemes, in this paper a new “derivative-free” framework is proposed that does not require this preprocessing step. An extensive experimental study with real data from the right arm of the iCub robot is presented, comparing different model classes and estimation procedures, showing that the proposed “derivative-free” methods outperform existing methodologies.

{keywords}

Inverse dynamics, Robotics, Rigid Body Dynamics, Learning, Online methods, Semiparametric models, Derivative-free methods, Gaussian processes, Marginal Likelihood optimisation

## I Introduction

Robotic platforms, such as industrial and service robots, are becoming more and more popular and are prospected to pervade our everyday life in the near future. A key requirement for such systems is that they should be able to safely interact with the environment, possibly including humans, while performing robustly and efficiently certain assigned tasks; in order to do so, they have to adapt their behavior to variable conditions.

Among other technological challenges, this requires new mathematical tools for data-driven modeling while accounting for inherent uncertainty and changing conditions (learning). While accomplishing user assigned tasks, the robotic system acquires new data which needs to be processed online to adapt these models.

In this paper we shall be concerned with online learning of so called inverse dynamics models: these models have joint trajectories as inputs and joint torques as outputs. They are widely used for model-based control in robotic applications to improve the tracking performances leading to high accuracy and low control gains [craig2005introduction, nguyen2009model], [KARIMI, fliess2009model]; see also the survey [Nguyen-Tuong2011] for a comprehensive overview. Inverse dynamics models are also very important for detection and estimation of contact forces, see for instance [IvaldiICRA2015].

Typically, inverse dynamics models are obtained from first principles, using the physics of rigid body dynamics (RBD), [siciliano2010robotics]. This results in a parametric model that depends on some physical parameters (masses, lengths, etc.). Unfortunately, accurate knowledge of these physical parameters may often not be available; for this reason an RBD model should be estimated from data. The main advantage of the parametric approach is that in principle it provides a global relationship between inputs and outputs. However, the parametric model strongly relies on several assumptions and may be rather inaccurate when these assumptions are not perfectly satisfied (frictions, nonlinearities, non perfectly rigid links, etc.), [hollerbach2008model, siciliano2010robotics].

Alternatively, a nonparametric black-box model can be estimated from experimental data using machine learning techniques such as Gaussian Process regression [Rasmussen]. The nonparametric framework has the advantage of not requiring unrealistic assumptions, but it comes at the price of being local in nature: the model can only be expected to reflect the systems dynamics in a “neighbourhood” of the trajectories already seen during the learning phase. In the context of system identification for linear dynamical systems a comparison between the capability of parametric and nonparametric approaches has been presented in [prando2015classical].

To exploit the advantages of both estimation techniques, semiparametric models have been recently introduced as a combination of RBD and nonparametric models, as for instance in [ICRA2010NguyenTuong_62320, wu2012semi].

In order to endow robotic systems with the ability to adapt to changing conditions, algorithms should be able to process data online, while taking advantage of knowledge already acquired, in the spirit of so-called transfer learning [pan2010survey, bocsi2013alignment]. Feasibility of online learning strongly hinges on the possibility to keep the computational complexity and memory storage requirements bounded as the number of data grows with time. To this propose several approaches have been proposed in the literature [Lawrence:2002, NIPS2000_1880], [Snelson06sparsegaussian], [Tresp:2000, Williams01usingthe, AE-MY-JH:11, Csato:2002] including algorithms explicitly developed to process data online, such as [nguyen2011incremental], which selects a sparse subset of training data points, and the local Gaussian process regression approach proposed in [nguyen2009model].

In [gijsberts2011incremental] the complexity is kept constant approximating the kernel function using the so-called “random features”, [rahimi2007random], an approximation which will be exploited in this paper following also [nguyen2011incremental, SEMIPARAMTERIC_2016, romeres2016onlineIcub]. It is worth noting that other approximation techniques are available in the literature, for instance: subset of data approximation, subset of regressors approximation, conditional approximations, Nystrom approximation and relevance vector machine approach, see [quinonero2005unifying]. Approximation of the kernel is equivalent to choosing a finite (and fixed) basis expansion for the unknown model; this expansion can be exploited to tackle estimation using Kalman filtering techniques [JH-SS:10], [NicFer:1998:IFA_1779] and [Huber201485], [JP-MS:14]. Online nonparametric learning using Gaussian processes calls also for online tuning of the kernel function, see for instance [romeres2016online, prando2016online] where marginal likelihood optimisation is used in the context of online system identification.

The main contributions of this paper are as follows:

• Various models (parametric, nonparametric and semiparametric) proposed in the literature [ICRA2010NguyenTuong_62320, wu2012semi, SEMIPARAMTERIC_2016] are placed in a common framework through the so-called semiparametric models. This common framework is exploited to compare theoretically the semiparametric model wtih RBD mean (SP) and the semiparametric with RBD kernel (SPK) and to show (see Section III for details) that the latter is to be preferred, complementing the findings from the experimental evaluation. Online learning is performed, following [SEMIPARAMTERIC_2016], exploiting the random features approach.

• A new “derivative-free” modelling framework, which avoids the use of numerical derivatives, is proposed. In robot modelling, this framework contrasts the errors introduced into the system by computing numerically the velocities and accelerations starting from the measured (noisy) positions.

• A thorough experimental study, based on real data from the iCub robot, is undertaken; classical methods as well as the newly proposed derivative-free methods are compared and analysed, both in terms of adaptation capabilities (how fast the algorithms can learn the dynamical models after a change in the experiental conditions) as well as in terms of steady state error. The experimental results show that the derivative-free methods proposed in this paper outperform classical schemes based on numerical derivatives. In doing so Cross Validation and Marginal Likelihood optimisation methods are compared for estimating the kernel hyperparametrs, suggesting that the latter should be preferred to the former.

The paper is organized as follows. In Section II the problem of inverse dynamics modeling is formalized. In Section III parametric, nonparametric and semiparametric models are introduced, while Section IV deals with kernel approximation which allows for the use of online algorithms. Section V introduces learning methods to avoid the use of numerical derivatives. In Section VI the different online algorithms are tested in the inverse dynamics estimation of the robotic platform iCub. Finally, in Section VII conclusions and future works are drawn.

## Ii Problem Statement

Assume we are given a robotic system with degrees of freedom (DoF), and denote with , , the free coordinates describing the robot configuration. Starting from the laws of physics, which account for gravity, apparent forces, frictions and so on, it would in principle be possible to write a (direct) dynamical model which, having as inputs the torques acting on the robot joints, outputs the trajectory of the free coordinates (joint positions) . This is the so called “direct dynamics”.

However, for the purpose of control design, it is of interest to know which torques should be applied in order to obtain a certain trajectory . This can be achieved for example using inverse dynamics models, which can be exploited, e.g., to determine the feedforward joint torques which should be applied to follow a desired trajectory ; see Figure 1.

Clearly, closing a high performance loop quests for an accurate inverse dynamics model . While control is an important application of inverse dynamics models, it is certainly not their unique use. For instance, inverse dynamics models find important applications in modelling, detecting and estimating contact forces, see e.g. [IvaldiICRA2015].

In practice an inverse dynamics model should be learned from a set of measured data , with ; this should be possibly performed online, updating the model as new data become available. An overview of model structures used for inverse dynamics modeling will be provided in Section III and the approximations involved in online learning discussed in Section IV.

###### Remark II.1 ( Causality of Inverse Dynamics Models)

Strictly speaking, such models are not proper: the present torques depend on the joint positions, velocities and accelerations which corresponds to having knowledge about future temporal instants. In this paper we shall either assume velocities and accelerations are given, or causal approximations will be considered, where the present torques only depend on present and past joint trajectories. If such models have to be used to build feedforward control laws, it is well possible that the future (desired) trajectories are known and thus one can build, starting from data, non-causal models. On the other hand, when online identification is to be performed (which means that the underlying system is time-varying) there is the delay in updating the time-varying model. However, time-variability should be slow and, as such, it should not be a significant drawback.

###### Remark II.2 ( Stability of Inverse Dynamics Models)

A model of a mechanical system as a map from joint torques (the inputs) to joint angular velocities (the outputs), is passive. In the linear case this implies that the transfer from input to outputs is positive real and thus it is minimum phase. Nonlinear extensions of this concept are indeed possible, i.e. a passive nonlinear system has stable zero dynamics, see e.g. [ByrnesIW1991]. Therefore, when considering inverse dynamics for (passive) mechanical systems, stability is not an issue. Stability of the inverse model may become critical when non minimum-phase systems are considered (e.g. non-collocated systems); it would thus be possible to consider acausal and stable functions of the reference trajectory to build a (stable) inverse dynamics model. This would be compatible with online implementation provided the reference trajectory is known in a finite and sufficiently long future horizon. We shall not consider the latter case in this paper.

The typical assumption that joint positions , joint velocities , and joint accelerations are measured considerably simplifies the identification procedure; the latter are stacked in the so-called “input location” vector

 x(t):=[q⊤(t)˙q⊤(t)¨q⊤(t)]⊤∈Rm (1)

with . Under this assumption, the inverse dynamics of a rigid body can be written in a linear form once a suitable overparametrization is introduced, see Section III-A. Unfortunately, measuring velocities and accelerations is often unrealistic and one has to resort to numerical differentiation schemes, which may be prone to large errors in the presence of measurement noise. In Section V, we shall analyze an alternative model which only exploits the past history of the joint positions. In this way, is a model with the past history of the joint positions as input and with the applied torques as output. As we shall see this choice compares favourably with standard approaches in the literature.

## Iii Model Classes

In this Section we briefly review the typical model classes used to learn the inverse dynamics of a robot, that is the linear parametric model, whose structure is given by the physics, the nonparametric model, whose structure is learnt from the data, and the semiparametric models which are a combination of the first two models. Throughout this Section, Gaussian processes are indexed in , take values in and are zero mean. The symbol denotes a zero mean white Gaussian noise with covariance matrix . Most of the models introduced in this Section are equipped with a so-called hyperparameters vector. We address the problem of estimating this vector in Section IV.

### Iii-a Linear parametric model

The rigid body dynamics (RBD) of a robot is described by the equation

 y(t)=M(q(t))¨q(t)+C(q(t),˙q(t))˙q(t)+G(q(t)) (2)

where is the inertia matrix of the robot, the Coriolis and centripetal forces and the gravity forces, [siciliano2010robotics, taylor2005classical]. The terms on the right hand side of (2) can be rewritten as which is linear in the robot (base) inertial parameters vector . The map is the known RBD regressor which is a nonlinear function of the input locations vector described by . Therefore, the RBD model is equivalent to

 y(t)=ψ⊤(x(t))π+e(t) (3)

where includes the nonlinearities of the robot that are not modeled by the rigid body dynamics (e.g. actuator dynamics, frictions, etc.).

Since the RBD model is physics-based, it should, in principle, describe the robot dynamics for all the desired trajectories, leading to good generalization and global approximation properties. A known issue of this model (see e.g., [hollerbach2008model]) is that the problem of determining from measured data is usually ill posed and the matrix is rank deficient. Possible solutions in system identification are either the design of efficient experiments to collect data sufficiently rich to excite the highest number of modes of the system or dedicated experiments which are good to estimate parameters separately.

Another drawback of such model class is that it is prone to undermodeling (as it may not capture non-linear friction effects, effects of non-rigidities etc.) that may ultimately result in severely biased estimated models.

This model will be denoted as “P” (Parametric).

### Iii-B Nonparametric Model

Following the “Gaussian Process” framework [Rasmussen], the robot inverse dynamics can be modeled postulating that:

 y(t)=f(x(t))+e(t) (4)

where is a Gaussian process with covariance function (i.e. kernel function)1 :

 K(x(t),x(s))=ρ2KG(x(t),x(s))In. (5)

Several choices are possible for the kernel (see e.g. [Rasmussen, GP-MHQ-AC:11]), which can be exploited to encode specific model structures. However, in this paper we shall only consider to be the Gaussian kernel2

 KG(x(t),x(s))=e−∥x(t)−x(s)∥22τ (6)

where the hyperparameter is the kernel width. The latter is a typical choice to correlate the input locations for learning the inverse dynamics, [gijsberts2011incremental, ICRA2010NguyenTuong_62320, wu2012semi]. plays the role of scaling factor. The vector is referred to as hyperparameters vector. This model will be denoted as “NP” (NonParametric).

This class of models is known to have high flexibility and prediction performance (see e.g. [Rasmussen]) since the dynamics are extrapolated directly from the experimental data, without making any unrealistic approximation on the physical system (e.g. assuming linear frictions models, ignoring the dynamics of the hydraulic actuators, etc.). Nevertheless, nonparametric models deteriorate their performance when predicting unseen data which are far ( in the Euclidean metric) from those visited in the training dataset.

### Iii-C Semiparametric model with RBD mean

The semiparametric model is an attempt to take advantage of the global property of the parametric model as well as of the flexibility of the nonparametric model; the first possibility is to embed the former as a mean term into the latter. Thus, the inverse dynamics is modeled as

 y(t)=ψ⊤(x(t))π+f(x(t))+e(t) (7)

where is the vector of inertial parameters, is the RBD regressor and a Gaussian process with kernel function as in (5).

At this point two alternatives are possible. The first and most principled one is to treat as an unknown hyperparameter. Model (7) with this hypothesis will be denoted as “SP” (SemiParametric). In this case .

A suboptimal alternative but often applied in the literature is to assume that is known, here denoted by . This could be possibly estimated using some preliminary experiment as in [ICRA2010NguyenTuong_62320] or estimated via Least Squares as in [SEMIPARAMTERIC_2016] or it can be given from some expert knowledge. Model (7) with this hypothesis will be denoted as “SP2”. In this case .

### Iii-D Semiparametric model with RBD kernel

An alternative possibility for combining the parametric and nonparametric models is to incorporate the RBD structure in the kernel, [ICRA2010NguyenTuong_62320]. The debate as to whether one should account for prior knowledge as a mean term or as a structured kernel is often encountered in Gaussian Process regression. We shall come back to the relation between these two alternatives in Proposition III.1 and the discussion which follows that proposition.

Therefore, the inverse dynamics is modeled as

 y(t)=f(x(t))+e(t) (8)

where is a Gaussian process with kernel function

 K(x(t),x(s)) =γ2ψ(x(t))⊤ψ(x(s))+ρ2KG(x(t),x(s))In. (9)

This model will be denoted as “SPK” (SemiParametric Kernel). The hyperparameters vector is .

It is worth noting that in the case we obtain the parametric model (3) where the inertia parameters vector is now modeled as a Gaussian random vector with zero mean and covariance matrix .

###### Remark III.1

Note that equation (9) is derived assuming is a zero mean Gaussian vector with variance . Of course if prior knowledge was available on the expected scale of the components of , this could be included in the prior variance introducing a diagonal scaling matrix , so that .

We have also tested an alternative version in which the diagonal matrix has been estimated via Marginal Likelihood together with all other hyperparameters. However the overall performance is worse than when assuming and estimating only , which is probably due to the fact that this extra freedom results in a less favourable bias-variance tradeoff.

The semiparametric model with RBD kernel (SPK) is connected with the RBD mean (SP) model introduced in Section III-C. To make this connection sharp we shall refer to both the minimum variance estimators (see Section IV) obtained with these models, as well as to the log-likelihood functions which can be used to estimate the hyperparameters (see Section IV-C). In particular, let us stack the available data , in the vector and stack correspondingly the regressors and in the matrix and vector , respectively. Moreover, we define , and as the negative marginal log-likelihoods of data as a function of hyperparameters for models SPK and SP, respectively, and

 ^LSP(y):=minπLSP(y)=LSP(y)|π=^πWLS (10)

as the profile marginal log-likelihood, where denotes a suitable weighted least squares estimate of (see Appendix VIII for the precise definition of ). The following proposition establishes the link between the semiparametric with RBD mean (SP) and semiparametric with RBD kernel (SPK) estimators as the prior on becomes uninformative (i.e. as ), showing that also in this asymptotic regime the two models induce a different marginal likelihood (see eq. (11)). The take-home message of this proposition is that the SPK model better handles uncertainties and should thus be preferred; a more in depth discussion will be provided in Section III-E. The reader is also referred to [RUSSU2012189] where a similar discussion has been provided in a completely different context.

###### Proposition III.1

Consider the semiparametric models SP in (7) and SPK in (8), their respective negative marginal log-likelihoods and and the profile marginal likelihood defined in (10). Assume that are fixed. Then, the following two statements hold:

1. the minimum variance estimator of model SP and model SPK coincide when ;

2. in the limiting case of the two log-likelihoods and differ in a nontrivial term which leads to different location in their minima. That is:

 limγ2→∞LSPK(y)−^LSP(y)−plogγ2=log(det(Ψ⊤R−1Ψ)) (11)

where .

{proof}

See Appendix -A.

### Iii-E Discussion of Proposition iii.1

Proposition III.1 shows that, when estimating hyperparameters using the marginal likelihoods (as discussed in IV-C), the two models (SP and SPK) are not equivalent even under the assumption that a non informative prior on is used, i.e. . In fact, the two marginal likelihood functions differ by a nontrivial term which may change the location of the minima. In particular the latter term accounts for the uncertainty in estimating the term .

When the SP model is used, the hyperparameters are estimated minimising

 ^LSP(y)=log(det(2πR))+(y−Ψ^πWLS)⊤R−1(y−Ψ^πWLS).

Doing so, are chosen so as to fit with the “sample” covariance . The drawback of this choice is that the component of the variance along may be underestimated (i.e. too small along ). On the contrary, when optimising the term

 log(det(Ψ⊤R−1Ψ))=−log(det(Var{^πWLS})

favour values of which make large or, equivalently, small (i.e. large) along the direction of .

Clearly this goes in the opposite direction and avoids the risk alluded at above (i.e. too small along ). Thus, to summarise, the term serves as a correction which accounts for the uncertainty of . Thus it is fair to say that model SPK deals more “cautiously” with uncertainty than model SP. This is reflected in the simulation results (see e.g. Figure 4) where SPK slightly outperforms SP (both endowed with ML hyperparameters estimation, i.e. SP-ML and SPK-ML).

## Iv Kernel Approximation and Online Learning

In this Section we review the online learning problem using the model classes of Section III. Since it is well known how to perform online learning using the parametric model (3), we will focus on the other models. On the other hand, model (3) can be understood as a special case, for instance, of model (7). Online learning using nonparametric and semiparametric models can be performed by applying the Gaussian regression framework, [Rasmussen]. In order to do so, two issues have to be faced which will be addressed in Section IV-A and Section IV-C, respectively. First, following [SEMIPARAMTERIC_2016, romeres2016onlineIcub], an approximation of the kernel is introduced to allow online learning with constant complexity. Second, two commonly used approaches are very briefly reviewed to estimate the hyperparametes vector , namely Cross Validation and Marginal Likelihood optimisation.

### Iv-a Kernel approximation

A typical route followed in machine learning is to approximate the kernel with a low rank factorisation of the form

 KG(x(t),x(s))≈ϕ⊤(x(t))ϕ(x(s)) (12)

where is a suitable vector of “basis functions” that have to be properly defined. Ideally, should be built from the eigenfunctions of the kernel matrix; however this optimal approximation is computationally unfeasible in a recursive algorithm3. Therefore, we rely on the random features approximation proposed in [rahimi2007random] and exploited for online inverse dynamics modelling in [gijsberts2011incremental, SEMIPARAMTERIC_2016, romeres2016onlineIcub].

According to the random feature approximation, the basis functions vector for the Gaussian kernel is defined as4

 ϕ(x)=1√d [cos(ω⊤1xτ)…cos(ω⊤dxτ) sin(ω⊤1xτ)…sin(ω⊤dxτ)]⊤ (13)

with , . Notice that if then the approximation is almost surely exact. As a consequence, the parameter has to be chosen to trade-off accuracy of the approximation and computational complexity. Finally, it is worth noting that depends on the width of the Gaussian kernel .

Using the approximation in (12) for model (8) is equivalent to consider the approximation

 f(x(t))≈ψ⊤(x(t))π+[ϕ⊤(x(t))⊗In]πNP (14)

where and are independent Gaussian random vectors with zero mean and covariance matrices and , respectively. Finally, by defining

 φ⊤(x(t)) =[ψ⊤(x(t))ϕ⊤(x(t))⊗In] θ =[π⊤π⊤NP]⊤

the approximation of model (8) is

 y(t)=φ⊤(x(t))θ+e(t). (15)

It is possible to show that all the models in Section III can be approximated with a model in the form (15). A derivation of all these approximations can be found in [romeres2016onlineIcub].

### Iv-B Online learning

Next, we address the problem of online learning. For simplicity of exposition, we consider the NP model in (4). It is well known that, given data , , the minimum variance estimator of can also be expressed as the solution of the Tikhonov regularization problem, [Rasmussen],

 ^f=argminf∈H1σ2N∑t=1∥y(t)−f(x(t))∥2+1ρ2∥f∥2H (16)

where belongs to the reproducing kernel Hilbert space , [wahba1990spline], with reproducing kernel function and norm .

It is not difficult to see that the Tikhonov regularization problem (16) takes a similar form for all the models of Section III. The solution of (16) takes the general form

 ^f(x)=N∑t=1αtK(x,x(t)) (17)

where ’s are uniquely determined from . Accordingly, the solution of (16) is unique regardless of the excitation properties of . Moreover, it is possible to prove that ’s take finite value, so that maps bounded trajectories into bounded torques/forces. It is worth noting that the number of coefficients coincides with the number of the data points/samples, i.e. , making the estimator intractable for an online (recursive) solution. On the other hand, using the approximated model (15), the minimum variance estimator of is the solution to the following Regularized Least Squares problem

 ^θ=argminθ1σ2N∑t=1∥y(t)−φ⊤(x(t))θ∥2+θ⊤Wθ (18)

where is the inverse kernel matrix induced by the approximation. Its optimal solution can be computed recursively, whenever new data becomes available, through the well known Recursive Least Squares algorithm, see e.g [Ljung:99, Chapter 11] and [RPEM]. In practice, the implementation of this algorithm uses Cholesky-based updates [bjorck96], which have robust numerical properties. The computational complexity of each update is where is the dimension of vector ; the cost of evaluating the model output is for each output and thus if wrenches (torques/forces) are to be computed. This computational complexity is compatible with online implementation for state-of-the-art computation facilities.

### Iv-C Hyperparameter vector estimation

Nonparametric and semiparametric models are characterized by the hyperparameters vector , which is not known and needs to be estimated from the data. Typically, two approaches are considered to address this problem.

A first possible method is called Cross Validation (CV). The goal is to obtain an estimate of the prediction capability of the model on future data for different choices of the hyperparameters vector . Hold-out cross validation approach is a possible version of CV, see e.g. [james2013introduction, Chapter 6]. The dataset is split in two, the training set and the validation set. The optimal is given by optimizing an estimate of the mean squared error in the validation set

 ^ηCV=argminη∈ΩˆMSE(η). (19)

In practice this approach is limited to the estimation of a small number of hyperparameters since the minimization in (19) is typically performed by gridding the search space .

A second method is offered directly by the Gaussian regression framework. The marginal likelihood (ML), denoted by , expresses the likelihood of the hyperparameters vector on the data , once the parameter has been integrated out. Under model (15), the latter can be computed in closed form, as discussed in [Rasmussen]. To conclude, the optimal is then given by solving

 ^ηML=argmaxη∈Ωpη(y). (20)

In Section VI we shall thoroughly compare on real data these two approaches for hyperparameter estimation. The experimental results, in this case, show that Marginal Likelihood optimisation outperforms Cross Validation.

## V Derivative-free Learning

In Section III-A we have seen that the rigid body dynamics suggests that the inverse dynamics is described by a map from the input locations vector composed by joint positions-velocities-accelerations to the joint torques applied to the joints. This hypothesis has been exploited in all the model classes in Section III. However, it is often the case that joint velocities and accelerations cannot be measured from the robot; rather they are estimated using numerical differentiation schemes from the (noisy) measurements of the joint positions. As a consequence, failure to properly handle noise in the measurement may severely hamper the final solution. This is a very well known and highly discussed problem, see e.g., [siciliano2010robotics, hollerbach2008model, kozlowski2012modelling, craig2005introduction, Nguyen-Tuong2011] and it is usually partially addressed by ad-hoc filter design. However, this requires users’ knowledge and experience in tuning the filters’ parameters.

In this Section we propose a new methodology, which avoids the use of numerical pre-differentiation issues, by learning the inverse dynamics without using the estimated velocities and accelerations. We shall thus assume that only joint torques and joint positions , , are measured. Let

 q(t−):=[q⊤1(t−)q⊤2(t−)…q⊤n(t−)]⊤∈R(M+1)n, (21)

and

 qi(t−):=[q⊤i(t)q⊤i(t−1)…q⊤i(t−M)]⊤∈RM+1, (22)

be the vector of the past joint positions and the vector of the past of the -th joint position, respectively, in the time window where , sufficiently large, has been fixed. Our aim is to model the inverse dynamics as a map from the past joint positions to the joint torques. In particular, we shall postulate that output can be written as a non-linear function of a “features vector” , defined as a linear function of past measurements

 ξi(t)=Rqi(t−) (23)

where . Using the features (23) can be seen as a reduced rank regression problem. We shall discuss later on the choice of the number (rows of ), i.e. the number of features which are used by model (24) below5. The choice of could also be performed resorting to Bayesian arguments, as for instance done in [BSL_JOURNAL] in a similar low-rank regression problem in the context of sparse-low rank dynamic network models.

In particular, we shall model the inverse dynamics with the NP model

 y(t)=f(ξ(t))+e(t) (24)

where is a Gaussian process with kernel function

 K(ξ(s),ξ(t))=ρ2KG(ξ(s),ξ(t))In. (25)

Then, the online learning algorithm is similar to the one sketched in Section IV. The only difference is that the “standard” input locations vector in (1) is replaced with in (23). Special cases which one may consider are so that in (23) coincides with the past measurements (i.e. no dimensionally reduction is performed); as an alternative (we shall come back to this later on) the rows of the matrix may compute, using causal filtering operations, an approximation of the derivatives of (i.e. and possibly higher order derivatives). We may thus say that the approach considered in this section generalises what seen in the previous ones.

In the following, the matrix in (23) will be described by a set of hyperparameters that have to be estimated from the data; for instance, for the NP model we shall have , where is the vector containing the hyperparameters of . Models P, SP, SP2 and SPK can be derived in a similar way.

We shall now consider three possible choices (but of course many others would be possible) for the structure of the matrix .

### V-a Derivative-free features

The simplest choice is to take that is the features vector coincides with the past measured joint positions. Thus . These features will be denoted as “Derivative-Free” (DF). As an alternative it is possible to choose

 R=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣r10…00⋱⋮⋮⋱00…0rM+1⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦ (26)

with . That is, the features vector is a weighted version of the past measured joint positions. These features will be denoted as “Derivative-Free Weighted” (DFW).

### V-B Derivative-free features with reduced rank

An alternative choice is to take the number of features smaller than . For instance, the physics suggests that the right number of features should be equal to 3 (position, velocity, acceleration). Therefore, the role of the features is to compress the useful information available in so as to render the learning procedure more robust. This means that

 R=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣r⊤1r⊤2⋮r⊤k⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,∥ri∥=1 (27)

where , and the vector of hyperparameters is , i.e. is fully parameterised. In practice it makes also sense to impose some constraints in the ’s. In particular so that the first feature is the measured position. It would also be reasonable to enforce other constraints, for instance imposing that is an orthogonal projection. Notwithstanding its importance, we shall not delve further into this issue.

The number of features is chosen by the user and it will be subject of empirical analysis in Section VI. These features will be denoted as “Derivative-Free Reduced rank” (DFR).

Notice, that the features DFR include all the possible linear and causal numerical differentiation and filtering operations. The price of this generality is that a large number of hyperparameters has to be estimated, i.e. . As it is well known in optimization, this might lead to local minima problems. In order to overcome this issue one might resort to regularization techniques on the hyperparameters or to set appropriate initial conditions.

### V-C Structured derivative-free input locations with reduced rank

The last idea is to consider only features: the first will be the position while the other two will attempt to estimate explicitly velocities and accelerations. We know that the joint velocities and accelerations can be computed by a first order backward difference and by a second order backward difference, respectively, both filtered by a first order low pass filter, that is

 ˙qi(t) ≈1−z−1Ts11−β1z−1qi(t) ¨qi(t) ≈1−2z−1+z−2T2s11−β2z−1qi(t)

where is the backward shift operator, is the sampling time and represent the poles of the filters.

We resort to a partial fraction decomposition to rewrite the above expressions as a function of , that is:

 ˙qi(t) ≈α1qi(t)+M∑t=1α1βt−11(β1−1)qi(s−t) (28) ¨qi(t) ≈α2qi(t)+α2(β2−2)qi(s−1) +M∑t=2α2βt−22(β22−2β2+1)qi(s−t) (29)

where and . Here, we exploited the fact that a (stable) low-pass filter can be approximated by finite impulse response (FIR) filter with length , where the latter is chosen sufficiently large. Accordingly, we have

 R=⎡⎢ ⎢⎣10……0α1α1(β1−1)…α1βt−11(β1−1)…α2α2(β2−2)…α2βt−22(β22−2β2+1)…⎤⎥ ⎥⎦.

These features are denoted as “Derivative-Free Structured Reduced rank” (DFSR) since the features have a structure to resemble the joint positions, velocities and accelerations.

A nice property of this characterization is that the number of hyperparameters in is small, in this case , independently of the length of the past temporal lags , which can be arbitrarily chosen.

## Vi Inverse Dynamics Learning on iCub

In this section, we shall provide a thorough experimental evaluation of all the models discussed in Section III, also endowed with derivative-free input locations as discussed in Section V, for online learning of the inverse dynamics; all the algorithms have been tested using real data from the iCub robot [metta2010icub], which is a full-body humanoid robot with 53 degrees of freedom.

In this work we consider inverse dynamics modelling of the right arm. Therefore the free coordinates are the angular positions of the 3 shoulder joints and of the elbow joint, for a total of 4 degrees of freedom. When needed, joint positions have been numerically differentiated to obtain joint velocities and accelerations by the Authors of [SEMIPARAMTERIC_2016] using the standard adaptWinPolyEstimator6 module of the open source iCub project. The differentiation procedure consists in applying a time-varying linear filter based on the work [janabi2000discrete], that is actually implemented at a higher rate before downsampling the signals.

The outputs , are the 3 forces and 3 torques components measured by the six-axes force/torque (F/T) sensors embedded in the shoulder of the iCub’s arm, see Figure 2.

Notice that the measured forces/torques are not the applied joint forces and torques and, as such, the model we learn is not, strictly speaking, the inverse dynamics model. Yet, as explained in [ivaldi2011computing], the feedforward joint torques can be determined from components (forces and torques) of . Indeed, such model has been used in the literature as a benchmark for the inverse dynamics learning, [gijsberts2011incremental], [SEMIPARAMTERIC_2016].

We consider the two datasets used in [SEMIPARAMTERIC_2016], corresponding to different trajectories of the end-effector. In the first one (XY-dataset) the end-effector tracks circles in the XY plane of radius at an approximate speed of ; in the second one (XZ-dataset), the end-effector tracks similar circles but in the XZ plane (the Z axis corresponds to the vertical direction, parallel to the gravity force). The two circles are tracked using the Cartesian controller proposed in [pattacini2010experimental]. Each dataset contains approximately 8 minutes of data collected at a sampling rate of , for a total of 10000 points per dataset. One single circle is completed by the robot in about seconds, which corresponds to 25 points.

We shall consider the models described in Section III endowed with standard input locations or derivative-free features and hyperparameters estimated via either the marginal likelihood approach or Cross Validation7, the latter has been discussed in [SEMIPARAMTERIC_2016]. For ease of exposition the models will be denoted with a combination of the following shorthands:

• P, NP, SP, SP2, SPK: to indicate the model class, as discussed in Section III, approximated according to Section IV-A.

• ML, CV: to indicate the method used to estimate the hyperparameters, according to Section IV-C.

• DF, DFW, DFR, DFSR: to indicate the different derivative-free features, according to Section V. If nothing is indicated, the standard input locations are considered.

For example, we shall denote “NP-ML”, the nonparametric model (4) with hyperparameters estimated through maximization of the marginal likelihood; instead “NP-ML-DFR” shall denote the nonparametric model (24) with hyperparameters estimated through maximization of the marginal likelihood and derivative-free features defined in (27).

The estimation routine has been implemented using Matlab. The RBD regressor for the iCub’s right arm has been computed using the library iDynTree, [nori2015icub]. The Marginal Likelihood has been optimized using the Matlab fmincon.m function. The recursive least squares algorithms have been implemented using the GURLS library, [tacchetti2013gurls]. The results of all CV methods are obtained using code which has been kindly provided by the authors of [SEMIPARAMTERIC_2016].

For each model as above, the following online learning scenario is considered (with reference to the general model structure (15)):

• Initialization: The first 1000 points in XY-dataset are used to estimate the hyperparameters with one of the two techniques considered, say and , as well as to compute an initial estimate of the parameter , say .

• Online Estimation - Stage 1: The remaining 9000 points of the XY-dataset are used to update online the parameter using the recursive least squares algorithm, thus obtaining , . Let be the final value obtained by this procedure on the XY-dataset.

• Online Estimation - Stage 2: The XZ-dataset is split in 5 sequential subsets (XZ-dataset-i, ) of 2000 points each (approximately seconds). These subsets are used to evaluate the performance of the online estimators. For each subset, the estimator of is always initialized with and updated recursively on the data of each dataset XZ-dataset-i, .

Note that the estimators in Stage 2 are initialised using the final estimate from Stage 1, which corresponds to a different motion (XY-dataset). Therefore, the evaluation of the performance in Stage 2 allows us to verify how well the estimators generalise on new unseen data (initial part of the new dataset) as well as how well they are able to learn adapting to a new experimental condition (transient and steady state).

In order to measure the quality of the estimated models, we evaluate the online prediction capability of the estimated models using the following index:

 ε(k)t =∑Ts=1(y(k)(t+s)−^y(k)(t+s|t))2∑Ts=1(y(k)(t+s))2 (31)

where is the estimate of the -th output at time using the model estimated with data up to time . Note that the test data , , have not been used to estimate the model , Thus, is an average relative error on a future horizon of length for the output component . In addition, and will be the average values of for the 3 forces () and the 3 torques (), respectively. In all our simulations, will be equal to (which roughly corresponds to one revolution of the end effector in the plane).

The index (31) can be motivated, for instance, by the possible use of the model in the framework of model predictive control [maciejowski2002predictive], where the final performance of the controller hinges on the predictive capabilities of the model on a given future horizon (say ).

The experimental results will be first presented for models that use numerical derivatives and then for the derivative-free models; a comparison will be eventually provided between the two. For each experiment, we show and averaged over the subsets XZ-Dataset-i, .

### Vi-a Experimental results using numerical derivatives

In Figure 3

the behavior of and is presented. The parametric model, P, exhibits a poor performance because it describes only crude idealizations of the actual dynamics. The algorithms based on Cross Validation (CV) perform significantly worse in the first seconds than those based on Marginal Likelihood (ML) optimisation; this is not unexpected as discussed in [Pillonetto2015106].

As expected, the nonparametric model, NP-ML, has worse generalization performance (the error is larger in the first few steps) but better adaptation capabilities with respect to model P. The models with the best performance are SP-ML and SPK-ML because they combine the benefit of the parametric approach, i.e. generalization capabilities (good estimation performance at the beginning of the new dataset) and of the nonparametric approach, i.e. learning capabilities (good transient and steady state performance). The estimator SP2-ML should partially inherit these benefits from the SP structure, yet it shows an overall slightly worse performance. This is due to the fact that the first (least squares) step, i.e. the estimation of the linear model, is subject to a strong bias deriving from the unmodeled dynamics. Instead, a sound approach is followed by SP-ML and SPK-ML in which the estimation of the hyperparameters is performed jointly, avoiding such bias. In the steady state these semiparametric models outperform the others; yet the semiparametric SPK-ML performs best both in terms of average as well as distribution, as shown in Figure 4

which reports the boxplots of and in “steady state”, i.e. after the first seconds which is considered to be transient, see Figure 3. The reader is also referred to the discussion in Appendix III-E for a theoretical justification of this latter fact.

### Vi-B Experimental results with derivative-free features

The focus of this section is the analysis of the derivative-free models, including the comparison with the models that use numerical derivatives presented in Section VI-A.

The parameters (number of past temporal lags used to form the features vector , see equations (26), (27) and (LABEL:eq:input_locations_features_vel_acc), and the number of features, , in the reduced derivative-free models are set as follows:

• is fixed equal to ; larger values have been tested with no significant differences in performance.

• is fixed to . A discussion on this choice as well as some results with different choices can be found later in this Section, see e.g. Figure 7.

The hyperparameters are estimated using marginal likelihood (ML) maximization, since performing Cross Validation via gridding in high dimension is computationally unfeasible.

In the first comparison, the nonparametric methods (NP-ML) is compared to its derivative-free versions NP-ML-DF, NP-ML-DFW, NP-ML-DFR and NP-ML-DFSR.

In Figure 5 the averaged (over realisations) time evolutions of and are illustrated. All the nonparametric derivative-free models perform comparably and outperform NP-ML, both in transient (more significant) as well as in steady state. The distribution of the steady state errors is shown using boxplots in Figure 6. It is clear that all the derivative-free methods outperform NP-ML, but also that NP-ML-DFR (which uses a reduced rank derivative-free feature) is the best performing method. This confirms that the dimensionality reduction in equation (23) captures the relevant information and allows to reduce the variance of the estimators.

As anticipated, the results in Figure 5 and Figure 6 are obtained setting in NP-ML-DSR. The reasons for this choice is that, when using numerical differentiation, the input location vector contains exactly 3 components (position, velocity and acceleration) for each DoF. It is natural to ask what happens as changes. In Figure 7 the steady state behaviour8 of and of is analysed as a function of .

It is apparent from Figure 7 that and are not sufficient and lead to a considerable performance degradation; in addition, no improvement is obtained increasing beyond (compare and in Figure 7). Larger values (e.g. ) have also been tested leading to similar results and have been therefore omitted. The results in Figure 6 show that the performance of NP-ML-DF (which uses ) is worse than NP-ML-DFR with . Overall this suggests that, indeed, is the optimal choice for this specific application. NP-ML-DFR with performs better than NP-ML-DFSR. Considering the bias-variance trade-off dilemma, the latter method introduces more bias and less variance than the former, suggesting that, in these experimental results, the bias introduced by NP-ML-DFSR is preponderant with respect to the reduction of the variance.

Finally, a comparison between the estimators using numerical derivatives and the derivative-free methods (and in particular DFR) is provided for the semiparametric models.

In Figure 8 and Figure 9 the behaviour of and of for the models SPK-ML-DFR and SPK-ML is illustrated.

The two semiparametric methods have similar initial performance, but SPK-ML has a a better learning rate in the transient. This behaviour might be attributed to the parametric component of the model since in SPK-ML-DFR the physical meaning of the features is lost. However, in steady state the semiparametric derivative-free model outperforms the standard model with an improvement of about in terms of relative error, see Figure 9.

The results obtained in this section give an empirical evidence that learning the features directly from the past history of the measured joint trajectories is a rather promising direction.

### Vi-C Comparison among DFR-like models

Finally a comparison among the models with DFR features, namely, P-ML-DFR, NP-ML-DFR, SP2-ML-DFR and SPK-ML-DFR, is presented in Figures 10 and 11.

The transient performance can be analyzed by observing the first 30 seconds in Figure 10. Similarly to P-ML, also the parametric model P-ML-DFR has very poor performance (note that the relative error is always larger than 1); the poor performance of the parametric models seems to negatively affect also the transient behaviour of the semiparametric models. As a matter of fact, the transient performances of SP2-ML-DFR and SPK-ML-DFR are very poor. The nonparametric model, NP-ML-DFR, definitely outperforms all the other models in terms of transient, suggesting that the parametric models are probably inadequate to capture the dynamic behaviour.

The last 60 seconds of the experiment reported in Figure 10 and the boxplots in Figure 11 (only for NP-ML-DFR and SPK-ML-DFR) illustrate the behaviour at steady state. The suboptimal model SP2-ML-DFR is largely unsatisfactory, which can be probably attributed to the bias error introduced by the parametric component estimated by least squares.

### Vi-D Experimental results discussion

We can summarise as follows the findings of our extensive experimental study:

• The derivative-free methods outperform the schemes based on numerical differentiation; this is even more remarkable if one recalls that the numerical derivatives have been computed starting from data at a higher sampling rate, i.e. they have practically used a richer dataset (which ideally should have lead to better noise-rejection properties). The main reason is probably to be attributed to the fact that the derivatives on which the “classical” models rely are computed using numerical differentiation schemes from the measured positions, which are subject to noise. However, the models (4), (7) and (8) do not account for this noise and operate as if the measurements were correct. Instead, the derivative free method performs a reduction, which can be assimilated to computing derivatives and accelerations, but this reduction is computed as part of the modelling exercise, and as such its effect on wrench prediction is directly accounted for.

• The transient performances of semiparametric models strongly depend on the availability of an accurate (physical) model. Indeed, when using derivative-free methods wherein the extracted features are not guaranteed to approximate velocities and accelerations, the parametric model loses physical meaning. This is confirmed by the poor transient performance of P-ML-DFR, SP2-ML-DFR and SPK-ML-DFR (see Figure 10).

• Depending on whether transient or steady state performance is to be favoured, NP-ML-DFR or SPK-ML-DFR should be preferred respectively (see Figures 10, 11).

## Vii Conclusions

In this paper, several models and algorithms for online inverse dynamics learning have been discussed and compared in a common framework. Parametric and nonparametric methods have been considered, as well as semiparametric models obtained by combining the two. Different strategies, Cross Validation (CV) and Marginal Likelihood optimisation (ML), for estimating the hyperparameters have been compared and also new model structures which do not rely on pre-computed numerical derivatives have been introduced.

All these algorithms have been thoroughly tested on real data from the right arm of the iCub robot; the comparison has been performed both in terms of transient behaviour (how fast algorithms can adapt to a new experimental setup) as well as steady state behaviour.

Overall our experimental validation suggests that:

• The non-parametric model (NP) and the Semiparametric model with RBD mean (SP2) perform better when trained using Marginal Likelihood optimisation rather than Cross Validation algorithms available in [SEMIPARAMTERIC_2016];

• Semiparametric methods which exploit physical insight, when using numerical derivatives, outperform purely nonparametric structures;

• Derivative-free methods are definitely advantageous w.r.t. ad-hoc methods which rely on numerical differentiation, when applied to purely non-parametric model structures (see e.g. NP-ML-DFR);

• Endowing semiparametric methods with derivative-free schemes is not entirely trivial; these new features do not yield improvements in the parametric model as much as they do in the nonparametric one. As a result the transient behaviour of SPK-ML-DFR is significantly worse than its nonparametric counterpart; instead, SPK-ML-DFR outperforms its nonparametric counterpart in steady state.

This last item calls for further research efforts in the future, which would hopefully allow to fully exploit the benefits of derivative-free methods coupled with semiparametric model structures. Our future agenda includes further comparison with parametric methods which account for physical consistency of the parameters (see for instance [TraversaroBEN16]) as well as the implementation of control strategies which exploit the estimated models.

## Viii Acknowledgments

The authors gratefully acknowledge the other teams involved in the FIRB project “Learning meets time” (RBFR12M3AC), in particular iCub Facility group (Francesco Nori, Giorgio Metta) and Lorenzo Rosasco (LCSL-IIT@MIT) for making their data and code available to us.

### -a Proof of Proposition iii.1

First, note that both the SP model and the SPK model can be rewritten as

 y(t)=ψ⊤(x(t))π+f(x(t))+e(t) (32)

where :

1. is an unknown but fixed quantity for the SP model;

2. is a zero mean Gaussian random vector with covariance matrix for the SPK model.

In both models, is a Gaussian process with kernel function

 K(x(t),x(s))=ρ2KG(x(t),x(s)).

A well known connection between Bayes and Fisher (i.e. assuming the parameter is an unknown but fixed quantity) estimators, is that the latter can be obtained as a limiting case of the former when:

• the parameter is modeled as a zero mean Gaussian vector with covariance matrix

• the variance of the prior distribution of is let go to infinity by letting .

This proves the first part of the Proposition. We proceed to prove the second part. Let us stack the available data , in the vector and stack correspondingly the regressors and in the matrix and vector , respectively, so that models (32) can be written as

 y=Ψπ+f+e (33)

where is defined with the same rule as . Moreover, we define and . The minimum variance estimators of and under 2) and given , are:

 ^π =cov[π,y]Var−1[y]y =γ2Ψ⊤(γ2ΨΨ⊤+K(x,x)+σ2I)−1y (34) ^f(⋅) =cov[f(⋅),y]Var−1[y]y =K(⋅,x)(γ2ΨΨ⊤+K(x,x)+σ2I)−1y.

Defining and using the matrix inversion lemma we have:

 (γ2ΨΨ⊤+K(x,x)+σ2I)−1 =(γ2ΨΨ⊤+R)−1 =R−1−R−1Ψ(Ψ⊤R−1Ψ+γ−2I)−1Ψ⊤R−1

so that, from (34)

 ^π =γ2(I−Ψ⊤R−1Ψ(Ψ⊤R−1Ψ+γ−2I)−1)Ψ⊤R−1y =(Ψ⊤R−1Ψ+γ−2I)−1Ψ⊤R−1y,

and similarly

 ^f(⋅) =K(⋅,x)R−1(I−Ψ(Ψ⊤R−1Ψ+γ−2I)−1Ψ⊤R−1)y =K(⋅,x)R−1[y−Ψ^π].

Clearly, as , we have that converges to the weighted least squares estimate

 ^πWLS=(Ψ⊤R−1Ψ)−1Ψ⊤R−1y (35)

and converges to

 ^f(⋅)=K(⋅,x)R−1[y−Ψ^πWLS].

On the other hand the marginal likelihood function for model (33), under 1), i.e. when is considered as an unknown parameter, has the form:

 LSP(y)=−2log(pη(y))=log(det(2πR))+(y−Ψπ)⊤R−1(y−Ψπ).

When are kept fixed, the minimization with respect to can be performed in closed form, and yields exactly the weighted least squares solution (35). However, even for , the marginal likelihoods of given the hyperparameters under 1) and 2) are different. In fact, if 1) is postulated, and is solved as above, one obtains the profile marginal log-likelihood

 ^LSP(y)=log(det(2πR))+(y−Ψ^πWLS)⊤R−1(y−Ψ^πWLS)

where the hyperparameters are hidden in the definition of .

Instead, if 2) is postulated, the marginal log-likelihood takes the form

 LSPK(y)=log(det(2π(γ2ΨΨ⊤+R)))+y⊤(γ2ΨΨ⊤+R)−1y.

Using, as above, the matrix inversion Lemma on , the Sylvester determinant identity and defining , we obtain

 LSPK(y)=log (det(2πR))+log(det(Ip+γ2Ψ⊤R−1Ψ)) +(y−Ψ^π)⊤R−1(y−Ψ^π) +^π⊤Ψ⊤R−1(y−Ψ^π).

As we have that and so that9

 LSPK(y)≈log (det(2πR))+log(det(Ip+γ2Ψ⊤R−1Ψ)) +(y−Ψ^πWLS)⊤R−1(y−Ψ^πWLS).

The second term can be manipulated as follows:

 log (det(Ip+γ2Ψ⊤R−1Ψ)) =log(det(γ2Ip))+log(det(γ−2Ip+Ψ⊤R−1Ψ)) ≈plogγ2+log(det(Ψ⊤R−1Ψ))

where the last approximation clearly holds when . Inserting the last expression in we obtain that

 LSPK(y)≈log(det(2πR))+plogγ2+log(det(Ψ⊤R−1Ψ))+(y−Ψ^πWLS)⊤R−1(y−Ψ^πWLS)≈^LSP(y)+log(det(Ψ⊤R−1Ψ))+plogγ2

which shows that the the two log-likelihoods differ, up to the constant