An Incremental Self-Organizing Architecture for Sensorimotor Learning and Prediction

An Incremental Self-Organizing Architecture for Sensorimotor Learning and Prediction

Luiza Mici, German I. Parisi, and Stefan Wermter Authors are with the Department of Informatics, Knowledge Technology, University of Hamburg, Vogt-Koelln-Strasse 30, 22527 Hamburg, Germany e-mail: {mici, parisi, wermter} @informatik.uni-hamburg.deManuscript submitted to IEEE Transactions on Cognitive And Developmental Systems on 06-07-2017.

During visuomotor tasks, robots have to compensate for the temporal delays inherent in their sensorimotor processing systems. This capability becomes crucial in a dynamic environment where the visual input is constantly changing, e.g. when interacting with humans. For this purpose, the robot should be equipped with a prediction mechanism able to use the acquired perceptual experience in order to estimate possible future motor commands. In this paper, we present a novel neural network architecture that learns prototypical visuomotor representations and provides reliable predictions to compensate for the delayed robot behavior in an online manner. We investigate the performance of our method in the context of a synchronization task, where a humanoid robot has to generate visually perceived arm motion trajectories in synchrony with a human demonstrator. We evaluate the prediction accuracy in terms of mean prediction error and analyze the response of the network to novel movement demonstrations. Additionally, we provide experiments with the system receiving incomplete data sequences, showing the robustness of the proposed architecture in the case of a noisy and faulty visual sensor.

Self-organized networks, hierarchical learning, motion prediction

I Introduction

Real-time interaction with the environment requires robots to adapt their motor behavior according to perceived events. However, each sensorimotor cycle of the robot is affected by an inherent latency introduced by the processing time of sensors, transmission time of signals, and mechanical constraints [mainprice2012sharing][Zhong2012][saegusa2007sensory]. Due to this latency, robots exhibit a discontinuous motor behavior which may compromise the accuracy and execution time of the assigned task. For social robots, delayed motor behavior makes human-robot interaction asynchronous and less natural. Synchronization of movements during human-robot interaction may increase rapport and may endow humanoid robots with the ability to be joint partners in humans’ daily tasks [lorenz2011synchronization]. A possible solution to this issue is the application of predictive mechanisms which accumulate information from robot’s perceptual and motor experience and learn an internal model which estimates possible future motor states. The learning of these models in an unsupervised manner and their adaptation throughout acquisition of new sensorimotor information remains a challenging task.

There are naturally occurring latencies between perception and possible motor reaction in human beings [Nijhawan1063]. Such discrepancies are caused by neural transmission delays and are constantly compensated by predictive mechanisms in our sensorimotor system that account for both motor prediction and anticipation of the target movement. Miall et al. [miall1993cerebellum] have proposed that the human cerebellum is capable of estimating the effects of a motor command through an internal action simulation and uses a forward model similar to a Smith predictor [bahill1983simple]. Furthermore, there are additional mechanisms for visual motion extrapolation which account for the anticipation of the future position and movement of the target [kerzel2003neuronal]. Not only do we predict sensorimotor events in our everyday tasks, but we also constantly adjust our delay compensation mechanisms to the sensory feedback [rohde2014predictability] and to the specific task [deLaMalla].

Recently, there has been a considerable growth of learning-based prediction techniques, which mainly operate in a “learn then predict” approach, i.e. typical motion patterns are extracted and learned from training data sequences and then learned motion patterns are used for prediction [mainprice][ito2004line][levine2016learning][Zhong2012]. The main issue with this approach is that the adaptation of the learned models is interrupted by the prediction stage. However, it is desirable for a robot operating in natural environments to be able to learn incrementally, i.e. over a lifetime of observations, and to refine the accumulated knowledge over time. Therefore, the development of learning-based predictive methods accounting for both incremental learning and predictive behavior still remains an open challenge.

In this work, we propose a novel architecture that learns sensorimotor patterns and predicts the future motor states from the delayed sensory information. We investigate the capabilities of the architecture in learning new sensorimotor representations and in predicting them during an imitation task in a human-robot interaction scenario. In this scenario, body motion patterns demonstrated by a human teacher are mapped to trajectories of robot joint angles and then learned through an unsupervised incremental neural network architecture. The learned trajectories are then immediately imitated by the robot. The architecture is able to predict future motor behavior in order to compensate the delay during generation of robot movements, thereby leading to human-robot synchronization. We approach the demonstration of the movements through motion capture with a depth sensor, with body motion patterns represented by a three-dimensional skeleton model. The learning module is based on a hierarchy of Growing When Required (GWR) [marsland2002self] networks, which has been successfully applied for the classification of human activities [Parisi2016][Mici2016]. The learning algorithm processes incoming motion sequences and progressively builds a dictionary of motion segments in an unsupervised way. The cost of learning during each iteration is linear with respect to the number of neurons, thereby enabling a real-time application of the system.

We evaluated our system on a dataset of 10 arm movement patterns demonstrated by three different users. Experimental results show that the proposed architecture is able to continuously compensate the robot’s sensorimotor delay. Moreover, we show that the proposed system is able to learn and reliably predict motion in the face of missing sensory information.

Ii Related Work

Ii-a Motion prediction

Motion analysis and prediction are an integral part of robotic platforms that counterbalance the imminent sensorimotor latency. Well-known methods for tracking and prediction are the Kalman Filter models, as well as their extended versions which assume non-linearity of the system, and the Hidden Markov Models (HMM). Kalman filter-based prediction techniques require a precise kinematic or dynamic model that describes how the state of an object evolves while being subject to a set of given control commands. HMMs describe the temporal evolution of a process through a finite set of states and transition probabilities. Predictive approaches based on dynamic properties of the objects are not able to provide correct long-term predictions of human motion [vasquez2008intentional] due to the fact that human motion also depends on other higher-level factors than kinematic constraints, such as plans or intentions.

There are some alternatives to approaches based on probabilistic frameworks in the literature and neural networks are probably the most popular ones. Neural networks are known to be able to learn universal function approximations and thereby predict non-linear data even though dynamic properties of a system or state transition probabilities are not known [schaefer2008learning][saegusa2007sensory]. For instance, Multilayer Perceptrons (MLP) and Radial Basis Function (RBF) networks as well as Recurrent Neural Networks have found successful applications as predictive approaches [mainprice][barreto2007time][ito2004line][Zhong2012]. A subclass of neural network models, namely the Self-Organizing Map (SOM) [kohonen1993self], is able to perform local function approximation by partitioning the input space and learning the dynamics of the underlying process in a localized region. The advantage of the SOM-based methods is their ability to achieve long-term predictions at much less expensive computational time [simon2007forecasting].

Johnson and Hogg [johnson1996learning] first proposed the use of multilayer self-organizing networks for the motion prediction of a tracked object. Their model consisted of a bottom SOM layer learning to represent the object states and the higher SOM layer learning motion trajectories through the leaky integration of neuron activations over time. Similar approaches were proposed later by Sumpter and Bulpitt [sumpter2000learning] and Hue et al. [hu2004learning], who modeled time explicitly by adding lateral connections between neurons in the state layer, obtaining performances comparable to that of the probabilistic models. Several other approaches exist whereby the SOM has been extended with temporal associative memory techniques [barreto2007time], associating to each neuron a linear Autoregressive (AR) model [walter1990nonlinear][vesanto1997using]. A drawback which is common to these approaches is their assumption of knowing a priori the number of movement patterns to be learned. This issue can be mitigated by adopting growing extensions of the SOM such as the GWR algorithm [marsland2002self]. The GWR algorithm has the advantage of a nonfixed, but varying topology that requires no specification of the number of neurons in advance. Moreover, the prediction capability of the self-organizing approaches in the case of multidimensional data sequences has not been thoroughly analyzed in the literature. In the current work, we present experimental results in the context of a challenging robotic task, whereby real-world sensorimotor sequences have to be learned and predicted.

Fig. 1: Overview of the proposed system for the sensorimotor delay compensation during an imitative learning scenario. The vision module acquires motion from a depth sensor and estimates the three-dimensional position of joints. Shoulder and elbow angle values are extracted and fed to the visuomotor learning algorithm. The robot then receives predicted motor commands processed by the delay compensation module.

Ii-B Incremental learning of motion patterns

In the context of learning motion sequences, an architecture capable of incremental learning should identify unknown patterns and adapt its internal structure in consequence. This topic has been the focus of a number of studies on programming by demonstration (PbD) [Billard2016]. Kulić et al. [kulic2008incremental] used HMMs for segmenting and representing motion patterns together with a clustering algorithm that learns in an incremental fashion based on intra-model distances. In a more recent approach, the authors organized motion patterns as leaves of a directed graph where edges represented temporal transitions [kulic2012incremental]. However, the approach was built upon automatic segmentation which required observing the complete demonstrated task, thereby becoming task-dependent. A number of other works have also adapted HMMs to the problem of incremental learning of human motion [takano2006humanoid][billard2006discriminative][ekvall2006online][dixon2004predictive]. The main drawback of these methods is their requirement for knowing a priori the number of motions to be learned or the number of Markov models comprising the learning architecture.

Ogata et al. [ogata2004open] proposed a model that considers the case of long-term incremental learning. In their work, a recurrent neural network was used to learn a navigation task in cooperation with a human partner. The authors introduced a new training method for the recursive neural network in order to avoid the problem of memory corruption during new training data acquisition. Calinon et al. [calinon2007incremental] showed that the Gaussian Mixture Regression (GMR) technique can be successfully applied for encoding demonstrated motion patterns incrementally through a Gaussian Mixture Model (GMM) tuned with an expectation-maximization (EM) algorithm. The main limitation of this method is the need to specify in advance the number and complexity of tasks in order to find an optimal number of Gaussian components. Therefore, Cederborg et al. [cederborg2010incremental] suggested to perform a local partitioning of the input space through kd-trees and training several local GMR models. However, for high-dimensional data, partitioning of input space in a real-time system requires additional computational time. Regarding this issue, it is convenient to adopt self-organized network-based methods that perform in parallel partitioning of the input space through the creation of prototypical representations as well as the fitting of necessary local models.

Iii Methodology

Iii-a Overview

The proposed learning architecture consists of a hierarchy of GWR networks [marsland2002self] which process input data sequences and learn inherent spatiotemporal dependencies (Fig. 1). The first layer of the hierarchy learns a set of spatial prototype vectors which will then represent incoming data samples. Temporal context is later encoded as a concatenation of consecutively matched prototypical samples and becomes more complex and of higher dimensionality when moving towards the last layer. This means that time dependence of the incoming data samples is captured by the order of the concatenated elements. When body motion sequences are provided as input, the response of the neurons in the architecture resembles the neural selectivity towards temporally ordered body pose snapshots in the human brain [giese2003neural]. This simple, yet effective data sequence representation is also convenient in a prediction application due to implicitly mapping past values to the future ones. Indeed, if we consider the concatenation vector as being composed of two parts, the first part carries information about the input data at previous time steps, while the second part concerns the desired output of this mapping.

The evaluation of the predictive capabilities of the proposed architecture for compensating robot sensorimotor delay will be conducted in the context of a human-robot interaction scenario. In this scenario, a simulated Nao robot imitates a human demonstrator while compensating for the sensorimotor delay in an on-line manner.

Iii-B Sequence representations with hierarchical GWR

The building block of our architecture is the GWR network [marsland2002self], which belongs to the unsupervised competitive learning class of artificial neural networks. A widely known algorithm of this class is the SOM [kohonen1993self]. The main component of these algorithms are the neurons, which, in the case of SOMs, are distributed in a fixed 2D o 3D lattice, whereas in a GWR network they have a varying topology. The main training steps in these networks are the competition between the neurons based on a similarity measure, usually the Euclidean distance, and their adaptation to the incoming input data samples in an unsupervised fashion. When convergence has been reached, the neurons have learned to represent the input space while preserving topology, i.e. similar inputs are mapped to neurons that are near to each other. In this regard, the SOM’s predefined structure and size have shown to be disadvantageous compared to growing extensions of self-organizing networks such as the GWR network. Unlike SOMs, the topological structure in the GWR algorithm is not fixed but grows to adapt to the topological properties of the input space. Thus, new knowledge can be added to the network in the form of new prototype vectors while new data become available.

Traditionally, GWR networks do not encode temporal relationships between inputs. This limitation has been addressed by different extensions, such as hierarchies of GWRs augmented with a window in time memory or recurrent connections [parisiFrontiers][Parisi2016][Mici2016]. Since our intention is to both encode data sequences and to generate them, we adopt the first approach, in which the relevant information regarding data samples in a window of time is always explicitly available. Moreover, the use of a hierarchy instead of a shallow architecture allows for the encoding of multiple time-varying sequences through neurons representing sub-patterns observed at different points in time. Thus, the reuse of the neurons is possible within the same data sequence containing repeated sub-patterns as well as across different sequences. This data compression mechanism is quite effective especially when neurons are reused in higher levels, but how often they are reused depends on the data domain, specifically on how much overlapping the sequences to be learned are. The idea of decomposition of sequences into reusable sub-patterns, when considering body motion, relates to the motor schemata theory [arbib]. This theory considers a set of so-called action primitives which are learned and then combined for generating the desired behavior. This compositionality property leads to a further possible application of our architecture for a Programming by Demonstration task (this application is left for future studies).

The same data processing mechanism applies to all layers of the proposed architecture. Each GWR neuron holds a weight vector , with dimensionality equal to the input size. Given one data sample , the winner is calculated, namely the best matching unit (BMU) as the neuron most similar, i.e. less distant, to the input. Therefore, the index of the best matching unit is defined as:


where W is the set of all weights of one GWR. The weight vector associated to the computed BMU, , represents the current data sample. The output of each GWR layer will be the concatenation of the consecutive winning neurons within a pre-defined temporal window , as depicted in Fig. 2. The function for the recursive computation of the output is defined as:


where represents the concatenation symbol, denotes the dimension of a vector and we use the square parentheses in expressions like to denote the vector constructed by the components through of vector a. Moving up in the hierarchy, the output computed as in Eq. 2 will represent the input for the next layer. This means that the first level GWR is used as a quantization algorithm for the spatial domain, while the second and the third level GWRs encode information accumulated over a short and a relatively longer time period respectively.

Fig. 2: Schematic description of one layer of the proposed GWR-based learning architecture (not all neurons and connections are shown). At each time step , the input data sample is represented by the weight of the winner neuron which is then concatenated with the previous winner neuron weights (depicted in fading yellow) in order to compute the output . The length of the concatenation vector is a pre-defined constant  ( in this example). The blocks containing the symbol denote time delay for obtaining a vector of consecutively activated neurons.

Iii-C Predictive GWR algorithm

As discussed in Section II-A, the problem of one-step-ahead prediction can be formalized as a function approximation problem. Given a multi-dimensional time series denoted by {y(t)}, the function approximation is of the form:


where the input of the function, or regressor, has an order of regression , with denoting the vector of adjustable parameters of the model and is the predicted value. In other words, the prediction function maps or associates the past input values to the observed value directly following them. This type of input-output mapping approximations can be implemented during each training step of the last layer GWR of our architecture, which receives concatenations of the temporally ordered winning neurons from the preceding layer computed as in Eq. 2. For this purpose, we can define the right hand side of the concatenation operator in Eq. 2 as the regressor , the left hand side of the operator as the value to predict and extend the GWR algorithm in order to perform quantization of the regressors as well as learning of the mapping between the regressors and the predicted values. The same learning scheme has been successfully applied to a SOM network [barreto2007time], called Vector-Quantized Temporal Associative Memory (VQTAM) model, and it has been shown to perform well on tasks such as time series prediction and predictive control.

  1. Create two random neurons with weights {} and {}

  2. At each iteration , generate an input sample

  3. Select the best and second-best matching neuron considering only the regressor:

  4. Create a connection if it does not exist and set its age to 0.

  5. If (exp() ) and () then:

    • Add a new neuron r () with

    • Update edges: and

  6. If no new neuron is added:

    • Update best-matching neuron and its neighbors :

      with the learning rates .

    • Increment the age of all edges connected to by 1.

  7. Reduce the firing counters of the best-matching neuron and its neighbors :

    with constant and controlling the curve behavior.

  8. Remove all edges with ages larger than and remove neurons without edges.

  9. If the stop criterion is not met, repeat from step 2.

Algorithm 1 Predictive GWR - In layers 1 and 2 of our architecture, we use GWR learning, while in layer 3 we use Predictive GWR

We assign to each unit of the Predictive GWR two weight vectors: and . BMUs will be calculated following Eq. 1 and only the regressor part will be considered. Therefore, the predicted value is defined as:


where is the winner neuron, thus the element responsible for the mapping of the input regressor and the output predicted value. In this way, we are able to define also an absolute value for the prediction error:


The learning procedure for the Predictive GWR is illustrated in Algorithm 1. During training, the winner neuron at time step is determined based on as in Eq. 1, but both weights of the winner neuron and its topological neighbors will be updated according to the rules in Step 6a. This learning step guaranties that the topology preserving vector quantization happens in both the input and the output space. It also allows for minimization of the prediction error during each iteration.

The dynamics of the network are decided by the match between regressor vectors with the prototype neurons. New neurons will be added whenever the activity of the network with respect to the input is smaller than a given threshold for a well-trained best-matching neuron (firing counter smaller than a threshold ). Additionally, the GWR algorithm is equipped with a connection-aging mechanism that leads to neurons being removed when being unconnected, thus rarely used. In this way, representations of data samples that have been seen in the far past are eliminated leading to an efficient use of available resources from the lifelong learning perspective.

The Predictive GWR algorithm operates differently from supervised prediction approaches in general. In the latter, the prediction error signal is the factor that guides the learning, whereas in the Predictive GWR the prediction error is implicitly calculated and minimized but does not affect the learning flow. Moreover, unlike the SOM-based VQTAM model, the number of input-output mapping neurons, or so-called local models is not pre-defined or fixed but instead adapts to the data given as input, thereby allowing for an optimal number of such models. It should be noted that overfitting does not occur with the growth of the network due to the fact that neural growth decreases the quantization error which is proportional to the prediction error.

Iii-D Sequence prediction

During the prediction phase, each sequence is presented to the trained architecture, and the temporal information is encoded as described in Section III-B. At each time step , the one-step-ahead estimate is given by following Eq. 4. In the case that the desired prediction horizon is greater than , the multi-step-ahead prediction can be obtained in two ways: (1) recursive prediction, in which predicted values are fed back into the regressor and Eq. 4 is applied recursively until the whole desired prediction vector is obtained, (2) vector prediction, by augmenting the predicted value with as many time steps as the desired prediction horizon. The sliding window approach that has been adopted for the encoding of the sequences allows for a direct application of the recursive prediction route. The vector prediction, on the other hand, can be applied by adapting the regressor and the predicted value as follows:


where are the past values and is the desired prediction horizon. The same dimensionality and order should be defined for the weight vectors and of the Predictive GWR neurons as well. In this way, the output of the prediction function in Eq. 4, will contain all future values within the desired prediction horizon.

Since computing the winner neuron during learning and predicting the next elements of the sequences given the current input rely on the same mechanism, we can keep track of the prediction error for each learning iteration. Following the weights update reported in Algorithm 1, Step 6a, the prediction error decreases during learning of a motion sequence. On the other hand, the error is expected to increase in case of novel input sequences that do not follow any of the prototypical movement patterns previously learned. Therefore, a threshold can be defined in order to set the biggest tolerated prediction error. The extension of our architecture with a strategy for choosing predictions based on estimated accuracy complements our architecture with an action selection capability, which has to be investigated further in our future studies.

Iv The synchronization task

For humans, the synchronization of behavior is a fundamental principle for motor coordination and is known to increase rapport in daily social interaction [lorenz2011synchronization]. Psychological studies have shown that during conversation humans tend to coordinate body posture and gaze direction [shockley2009conversation]. This phenomenon is believed to be connected to the mirror neuron system [tessitore2010motor] which suggests a common neural mechanism for both motor control and action understanding. Considering that interpersonal coordination is an integral part of human interaction, we assume that applied to human-robot interaction it may promote the social acceptance of robots.

The robot task consists of learning a set of visually demonstrated movement patterns and imitating them in synchrony with the teacher signal. In this task, the users’ body pose during motion demonstration represents the sensory input and the synchronous imitation of the teacher represents the desired robot behavior. In this case, the prediction of future visuomotor states is necessary to compensate for the sensory delay introduced by the vision sensor camera, the signal transmission delay as well as the robot’s motor latency during motion generation. A schematic description of the system components for the imitation scenario is depicted in Fig. 1. This type of scenario serves the purpose of showcasing the sensorimotor delay compensation capabilities of the proposed architecture and implies a potential human-robot interaction application.

For simplicity, we consider only the arm movements while leaving the rest of the body fixed, thereby focusing on the generation of movements rather that the stability of the robot. The target motor commands for the robot are obtained by mapping the users’ arm skeletal configuration to the robot’s arm joint angles. This direct mapping allows for a simple, yet compact representation of the visuomotor states, so that predictions of visual input flow include that of future motor commands to give to the robot. Motion trajectories are then learned incrementally by training our hierarchical GWR-based learning algorithm. The training of the employed neural network model is conducted by using a set of training samples corresponding to multiple user movement patterns. This allows for extracting prototypical motion patterns which can be used for the generation of robot movements as well as prediction of future target trajectories in parallel. The simulated Nao robot was used as the robotic platform for the experimental evaluation.

Fig. 3: Examples of arm movement patterns. The visual input data, represented as three-dimensional skeleton sequences, are mapped to the robots’ joint angles.

Iv-a System description

A general overview of the proposed architecture is depicted in Fig. 1. The system consists of three main modules:

  1. The vision module including the depth sensor and the tracking of the 3D skeleton through OpenNI/NITE framework; 111OpenNI/NITE:

  2. The visuomotor learning module which computes angles from the user’s skeletal representation and feeds in angle values to the proposed learning architecture. With each learning step, the architecture improves the predictive capabilites while providing future motor commands;

  3. The robot control module. Nao’s motors are angle-controlled utilizing the proprietary NaoQi framework. 222NaoQi Framework: The latter processes the motor commands and relays them to the microcontrollers of the robot, which in our case is a local simulated Nao.

Our main contribution is the visuomotor learning module which performs incremental adaptation and early prediction of human motion patterns. Although the current setup uses a simulated environment, we will consider a further extension of the experiments towards the real robot. Therefore, we simulate the same amount of motor response latency as it has been quantified in the real Nao robot, being between to ms [Zhong2012]. This latency could be even higher due to reduced motor performance, friction or weary hardware. Visual sensor latency on the other hand, for an RGB and depth resolution of 640x480, together with the computation time required from the skeleton estimation middleware can peak up to  ms [livingston2012performance]. Taking into consideration also possible transmission delays due to connectivity issues, we assume a maximum of  ms of overall sensorimotor latency in order to carry out experiments described in Section V.

Fig. 4: Behavior of the network during three learning iterations on an unseen sequence of robot joint angles. From top to bottom illustrated are: the skeleton model of the visual sequence, the ground truth data of robot joint angles, the values predicted from the network, and the absolute value of the prediction error on the trajectory over time (red dashed line indicating the statistical trend). As it can be seen by the decreasing trend, the network adapts to the input after a few time steps.

Iv-B Data acquisition and representation

The motion sequences were collected with an Asus Xtion Pro camera operating at 30 frames per second. This type of sensor is capable of providing synchronized color information and depth maps at a reduced power consumption and weight, making it a more suitable choice than a Kinect for being placed on our small humanoid robot. Moreover, it offers a reliable and markerless body tracking method [HanMicrosoft] which makes the interface less invasive. The distance of each participant from the visual sensor was not fixed, yet maintained within the maximum range for the proper functioning of the depth sensor. To attenuate noise, we computed the median value for each body joint every 3 frames resulting in 10 joint position vectors per second.

We chose joint angles to represent the demonstrator’s postures, as our aim is to both predict and reproduce motion patterns with the robot. Joint angles allow a straightforward reconstruction of the regressed motion without applying inverse kinematics, which may be difficult due to redundancy and leads to less natural movements. Nao’s arm kinematic configuration differs from the human arm in terms of degrees of freedom (DoF). For instance, the shoulder and the elbow joints have only two DoFs while human arms have three. For this reason, we compute only shoulder pitch and yaw and elbow yaw and roll from the skeletal representation by applying trigonometric functions and map them to the Nao’s joints by appropriate rotation of the coordinate frames. Wrist orientations are not considered since they are not provided by the OpenNI/NITE framework. Considering both arms, a total of 8 angle values represent one frame of body motion given as input to the visuomotor learning module.

V Experimental Results

We conducted experiments with a set of movement patterns that were demonstrated either with one or with both arms simultaneously: raise arm(s) laterally, raise arm(s) in front, wave arm(s), rotate arms in front of the body both clockwise and counter-clockwise. Some of the movement patterns are illustrated in Fig. 3. In total, 10 different motion sequences were obtained, each repeated 10 times by three participants leading to 30 demonstrations for each of the sequences. We first describe the incremental training procedure, then we assess and analyze in details the prediction accuracy of the proposed learning method. We concentrate on the learning capabilities of the method while simulating a possible recurring malfunctioning of the visual system leading to loss of entire data chunks. We conclude with a mathematical model for choosing the optimal predicted value for a system with a variable delay.

V-a Hierarchical training

The training of our architecture is carried out layer-wise in an on-line manner. The initialization phase sees all networks composed of two neurons with random weight vectors, i.e. carrying no relevant information about the input data. With the presentation of a sequence, the first GWR layer is trained in order to perform spatial vector quantization. Then the sequence is encoded as a trajectory of activated neurons as described in Eq. 2 and given as input to the GWR of the second layer. The same procedure is then repeated for the second layer until the training of the full architecture is performed. This procedure differs from batch learning methods due to the fact that a full pre-training of each layer is not necessary and the whole dataset should not be available from the beginning.

The learning parameters used throughout our experiments with the 10 arm movement sequences are listed in Table I. The selection of the range of parameters was made empirically while also considering the GWR algorithm learning factors. The activation threshold parameter which modulates the number of neurons was kept relatively high for all GWR networks, . This was necessary since we want crisp clusters of the spatiotemporal patterns. A generous number of neurons of the Predictive GWR allows for a better input-output approximation, thus a better data reconstruction during the prediction phase. The maximum edge age parameter, which modulates the removal of rarely used neurons, was set increasingly higher with each layer. The neurons activated less frequently in the lower layer may be representing noisy input data samples, whereas in higher layers the neurons capture spatiotemporal dependencies which may vary significantly from sequence to sequence.

Parameter Value
Activation Threshold
Firing Threshold
Learning rates
Firing counter behavior
Maximum edge age {100, 200, 300}
Training epochs 50
TABLE I: Training parameters for each GWR network in our architecture for the incremental learning of sensorimotor patterns.

V-B Predictive behavior

Fig. 5: (a) Average growth of the three GWR networks in terms of number of neurons and standard deviation over learning epochs, (b) Overall prediction MSE (C.P.E) averaged over all learned sequences up to each learning epoch (in blue) and the prediction error (P.E.) computed between the predicted sequence and the sequence represented by the architecture (in red), (c) Prediction MSE computed between the predicted values and the ground truth input data.

We now assess the predictive capabilities of the proposed method while training occurs continuously in an incremental fashion. Considering that the data sample rate is 10 fps (see Section IV-B) we set a prediction horizon of 6 frames in order to compensate for the estimated delay of 600 ms. In addition to the calculation of the absolute value of the prediction error presented in Eq. 5 we analyze the overall prediction accuracy of the model by computing the mean squared error (MSE):


where is the sensory input and is the predicted value for time step and is the length of the sequence in terms of video time frames. Since the errors are squared, a relatively high weight is given to large errors. High magnitudes of error are particularly undesirable in our case since they would lead to incorrect joint angles commands given to the robot for execution, thereby causing abrupt movements.

During training, we analyzed the response of the Predictive GWR network when a new joint angles sequence was provided as input. An example is shown in Fig. 4. We observed that, except cases of highly noisy trajectories, the network adapted to the input after a few time steps, e.g. after three learning iterations, as can be deducted by the descending line of the absolute value of the prediction error illustrated in Fig. 4. Additionally, we conducted a quantitative study of the behavior of our architecture when trained on the dataset of 10 arm movements in an incremental fashion. For this, we present each data sequence one at a time and let the network perform 50 learning iterations (epochs). If we consider the whole set of arm movements, the learning took in total 500 epochs. Then we varied the order of the sequences and ran the same experiment while averaging the final results. In this way, the behavior analysis does not depend on the order of the data given during training.

Results of the networks’ growth in terms of the number of neurons, the prediction MSE on each training sequence, and the overall MSE over epochs are reported in Fig. 5. As expected, the prediction error increases as soon as a new sequence is introduced. This explains the high peaks depicted in Fig. 5.c., which are immediately followed by a fast decrease of the error. The overall prediction error, i.e. the error averaged over all learned sequences, tends towards a more constant flow. This means that the new knowledge added to the networks does not significantly affect the overall performance, which is a desirable feature for an incremental learning approach. Furthermore, we observe a growth of the three GWR networks, which is an understandable consequence of the fact that the movement patterns are very different from each other. In fact, the first GWR performing quantization of the spatial domain converges to a much lower number of neurons, whereas the higher layers have to capture a high variance of spatiotemporal patterns. However, the computational complexity of a prediction step is , where is the number of neurons. Thus the growth of the network does not introduce significant computational cost. The rapid decrease of the prediction error shows that the system learns newly introduced patterns quickly, making the method suitable for an adaptive prediction of motion trajectories in our synchronization task.

Fig. 6: Prediction mean squared error (MSE) versus the number of neurons in the Predictive GWR.

In the so-far described experiments, we set a relatively high activation threshold parameter , which led to a continuous growth of the GWR networks. Thus, we further investigated how a decreased number of neurons in the Predictive GWR would affect the overall prediction error. For this purpose, we fixed the weight vectors of the first two layers after having been trained on the entire dataset, and ran multiple times the incremental learning procedure on the Predictive GWR, each time with a different activation threshold parameter . We observed that a lower number of neurons, obtained through lower threshold values, led to quite high values of the mean squared error (Fig. 6). However, due to the hierarchical structure of our architecture, the quantization error can be propagated from layer to layer. It is expected that similar performances can be reproduced with a lower number of neurons in the Predictive GWR network when a lower quantization error is obtained in the preceding layers.

V-C Learning with missing sensory data

In addition to the previous experiments, we also analyze how the predictive performance of the network changes when trained on input data produced by a faulty visual sensor. We simulate an occurring loss of entire input data chunks in the following way: during the presentation of a motion pattern we randomly chose time steps where a whole second of data samples (i.e. 10 frames) is eliminated. The network is trained for 50 epochs on a motion pattern, each time with a different missing portion of information. We repeat the experiment thereby increasing the occurrence of this event in order to compromise up to 30% of the data and see how much the overall prediction error increases. Results are averaged over epochs and are represented in Fig. 7. As it can be seen, there is a slight increase of the prediction MSE for higher values of data loss. This means that the network can still learn and predict motion sequences even under such circumstances. On the other hand, the standard deviation shows that higher magnitudes of error can be obtained for bigger data loss. However, this issue can be mitigated by the prediction error threshold mechanism explained in Section III-D

Fig. 7: Prediction MSE averaged over 50 epochs of training on each motion pattern. In spite of increasing percentage of data loss, the MSE does not grow linearly but rather stays almost constant.

V-D Compensating a variable delay

Experimental results reported so far have accounted for compensating a fixed time delay which has been measured empirically by generating motor behavior with the real robot. However, the proposed architecture can also be used when the delay varies due to changes in the status of the hardware. In this case, given the configuration of the robot at time step in terms of joint angle values , where is the time delay estimation, the optimal predicted angle values to execute in the next step can be chosen in the following way:

where are the predictions computed up to a maximum of the prediction horizon.

The application of this prediction step requires a method for the estimation of the time-delay , which is out of the scope of this work. Current time-delay estimation techniques mainly cover constant time delays, random delay with a specific noise characteristic, or restricted dynamic time delay [sargolzaei2016sensorimotor], which nonetheless do not address uncertainty affecting real-world robot applications. Computational models inspired by biology have also been proposed for the time-delay estimation [sargolzaei2016sensorimotor]. However, these models assume knowledge of the sensorimotor dynamics.

Vi Discussion

Vi-a Summary

In this paper, we presented a self-organized hierarchical neural architecture for sensorimotor delay compensation in robots. In particular, we evaluated the proposed architecture in an imitative learning scenario, in which a simulated Nao robot had to learn visually demonstrated arm movements and reproduce them in synchrony with the teacher signal. Visuomotor sequences were extracted in the form of joint angles, which can be computed from a body skeletal representation in a straightforward way. Sequences generated by multiple users were learned using hierarchically-arranged GWR networks equipped with an increasingly large temporal window.

The prediction of the visuomotor sequences was obtained by extending the GWR learning algorithm with a mapping mechanism of input and output vectors lying in the temporal domain. We conducted experiments with a dataset of 10 arm movement sequences showing that our system achieves low prediction error values on the training data and can adapt to unseen sequences in an online manner. Experiments also showed that a possible system malfunction causing loss of data samples has a relatively low impact on the overall performance of the system.

Vi-B Growing self-organization and hierarchical learning

The hierarchical arrangement of Growing When Required (GWR) networks equipped with a window in time memory is appealing due to the fact that it can dynamically change the topological structure in an unsupervised manner and learn increasingly more complex spatiotemporal dependencies of the data in the input space. This allows for a reuse of the neurons during sequence encoding, having learned prototypical spatiotemporal patterns out of the training sequences. Although this approach seems to be quite resource-efficient for the task of learning visuomotor sequences, the extent to which neurons are reused is tightly coupled with the input domain. In fact, in our experiments with input data samples represented as multi-dimensional vectors of both arms’ shoulder and elbow angles, there was little to no overlap between training sequences. This led to an inevitable growth of the network with each sequence presentation.

The parameters modulating the growth rate of each GWR network are the activation threshold and the firing counter threshold. The activation threshold establishes the maximum discrepancy between the input and the prototype neurons in the network. The larger we set the value of this parameter, the smaller is the discrepancy, i.e. the quantization error of the network. The firing counter threshold is used to ensure the training of recently added neurons before creating new ones. Thus, smaller thresholds lead to more training of existing neurons and the slower creation of new ones, favoring again better network representations of the input. Intuitively, the less discrepancy between the input and the network representations, the smaller the inputs reconstruction error during prediction phase. However, less discrepancy means also more neurons. This proved to be not the main issue in our experiments since redundant neurons did not affect significantly the computation of the predicted values.

The use of joint angles as visuomotor representations may seem to be a limitation of the proposed architecture due to the fact that it requires sensory input and robot actions to share the same representational space. For instance, in an object manipulation task, this requirement is not satisfied, since the visual feedback would be the position information given by the object tracking algorithm. This issue can be addressed by including both the position information and the corresponding robot joint angles as input to our architecture. Due to the associative nature of self-organizing networks and their capability to function properly when receiving an incomplete input pattern, only the prediction of the object movement patterns would trigger the generation of corresponding patterns of the robot behavior.

Vi-C Future work

An interesting direction for future work is the extension of the current implementation towards the autonomous generation of robot movements that account for both synchronization with the visual feedback as well as a given action goal. For this purpose, the implementation of bidirectional Hebbian connections would have to be investigated in order to connect the last layer of the proposed architecture with a symbolic layer containing action labels [Parisi2016][schrodt2016just]. It would be interesting to investigate how such symbolic layer can modulate generated movement patterns when diverging from the final goal.

The human-robot interaction experiments reported in this paper were carried out offline, in the sense that the synchronization was evaluated on an acquired data set of motion patterns. Future experiments will consider a human-robot interaction user study in which participants will be able to teach the motion patterns directly to the robot. In the context of robots learning from demonstration [Billard2016], immediate motion imitation is at the basis of learning more complex behavior patterns which require the acquisition, fine-tuning, and combination of motor primitives [ito2004line][arie2012imitating]. Therefore, future work directions should consider an extension towards storing and recalling of complex activities such as goal-oriented actions.


The authors gratefully acknowledge partial support by the EU- and City of Hamburg-funded program Pro-Exzellenzia 4.0, the German Research Foundation DFG under project CML (TRR 169), and the Hamburg Landesforschungsförderungsprojekt.


Luiza Mici received her Bachelor’s and Master’s degree in Computer Engineering from the University of Siena, Italy. Since 2015, she is a research associate and PhD candidate in the Knowledge Technology Group at the University of Hamburg, Germany, where she was part of the research project CML (Crossmodal Learning). Her main research interests include perception and learning, neural network self-organization, and bio-inspired action recognition.

German I. Parisi received his Bachelor’s and Master’s degree in Computer Science from the University of Milano-Bicocca, Italy. In 2017 he received his PhD in Computer Science from the University of Hamburg, Germany, where he was part of the research project CASY (Cognitive Assistive Systems) and the international PhD research training group CINACS (Cross-Modal Interaction in Natural and Artificial Cognitive Systems). In 2015 he was a visiting researcher at the Cognitive Neuro-Robotics Lab at the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea. Since 2016 he is a research associate of international project Transregio TRR 169 on Crossmodal Learning in the Knowledge Technology Institute at the University of Hamburg, Germany. His main research interests include neurocognitive systems for human-robot assistance, computational models for multimodal integration, neural network self-organization, and deep learning.

Stefan Wermter is Full Professor at the University of Hamburg and Director of the Knowledge Technology institute. He holds an MSc from the University of Massachusetts in Computer Science, and a PhD and Habilitation in Computer Science from the University of Hamburg. He has been a research scientist at the International Computer Science Institute in Berkeley, US before leading the Chair in Intelligent Systems at the University of Sunderland, UK. His main research interests are in the fields of neural networks, hybrid systems, neuroscience-inspired computing, cognitive robotics and natural communication. He has been general chair for the International Conference on Artificial Neural Networks 2014. He is an associate editor of the journals “Transactions of Neural Networks and Learning Systems”, “Connection Science”, “International Journal for Hybrid Intelligent Systems” and “Knowledge and Information Systems” and he is on the editorial board of the journals “Cognitive Systems Research”, “Cognitive Computation” and “Journal of Computational Intelligence”. Currently he serves as Co-coordinator of the DFG-funded SFB/Transregio International Collaborative Research Centre on “Crossmodal Learning” and is coordinator of the European Training Network SECURE on Safe Robots.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description