Global Navigation Using Predictable and Slow Feature Analysis in Multiroom Environments, Path Planning and Other Control Tasks

# Global Navigation Using Predictable and Slow Feature Analysis in Multiroom Environments, Path Planning and Other Control Tasks

Stefan Richthofer111Electronic address: stefan.richthofer@ini.rub.de; Corresponding author Institut für Neuroinformatik,
Ruhr-Universität Bochum, Germany
Laurenz Wiskott222Electronic address: laurenz.wiskott@ini.rub.de Institut für Neuroinformatik,
Ruhr-Universität Bochum, Germany
###### Abstract

Extended Predictable Feature Analysis (PFAx) [Richthofer and Wiskott, 2017] is an extension of PFA [Richthofer and Wiskott, 2015] that allows generating a goal-directed control signal of an agent whose dynamics has previously been learned during a training phase in an unsupervised manner. PFAx hardly requires assumptions or prior knowledge of the agent’s sensor or control mechanics, or of the environment. It selects features from a high-dimensional input by intrinsic predictability and organizes them into a reasonably low-dimensional model.

While PFA obtains a well predictable model, PFAx yields a model ideally suited for manipulations with predictable outcome. This allows for goal-directed manipulation of an agent and thus for local navigation, i.e. for reaching states where intermediate actions can be chosen by a permanent descent of distance to the goal. The approach is limited when it comes to global navigation, e.g. involving obstacles or multiple rooms.

In this article, we extend theoretical results from [Sprekeler and Wiskott, 2008], enabling PFAx to perform stable global navigation. So far, the most widely exploited characteristic of Slow Feature Analysis (SFA) was that slowness yields invariances. We focus on another fundamental characteristics of slow signals: They tend to yield monotonicity and one significant property of monotonicity is that local optimization is sufficient to find a global optimum.

We present an SFA-based algorithm that structures an environment such that navigation tasks hierarchically decompose into subgoals. Each of these can be efficiently achieved by PFAx, yielding an overall global solution of the task. The algorithm needs to explore and process an environment only once and can then perform all sorts of navigation tasks efficiently. We support this algorithm by mathematical theory and apply it to different problems.

## 1 Introduction

The original motivation of this work is based on the idea to apply the unsupervised learning algorithm Slow Feature Analysis (SFA) [Wiskott and Sejnowski, 2002] to interactive scenarios. The motivation for this idea is based on the experience that SFA was successfully used in various (passive) analysis tasks that closely relate to such scenarios, e.g. learning place cells [Franzius et al., 2007, Schönfeld and Wiskott, 2015], identifying objects invariant under spacial transformations [Franzius et al., 2008, Berkes and Wiskott, 2002, Franzius et al., 2011], blind source separation [Sprekeler et al., 2014], visual tasks like face recognition and age estimation [Escalante-B. and Wiskott, 2013]. In previous work, we recognized predictability as a crucial feature for tackling interactive scenarios, as these require estimation of consequences of possible actions. This lead to the invention of Predictable Feature Analysis (PFA) [Richthofer and Wiskott, 2015, Richthofer and Wiskott, 2017], an algorithm strongly inspired by SFA – while SFA selects features by slowness, PFA selects them by predictability. Before we get into more detail of these algorithms, we briefly collect possible application fields.

Path planning of mobile robots is an application area that closely fits our implicit prototype assumptions. We imagine a robot in an environment that perceives sensory input of some kind and emits an action signal that controls its motors (c.f. Figure 1). A naturally arising task is to control the robot such that it reaches a desired state in the environment.

SFA has been frequently applied to model a rather similar scenario concerning a rat instead of a robot. Of course, it was not attempted to control the rat, but to obtain biologically plausible phenomena like place cells. A more general interactive problem setting is reinforcement learning (RL), where also an agent is acting in an environment, aiming for a maximal accumulated reward over time. In this fashion, that setting extends our notion of a sensor signal and a control signal by a reward signal. Also control theory of dynamical systems, involving tasks like pendulum swing up, pole and cart balancing, fits into the notion of an action/perception loop illustrated in figure 1. The phase space of the system can be seen as environment in this case. A rich repertoire of work exists that links these fields in various ways. Traditionally, RL algorithms are applied to path planning, or dynamical systems. We list a selection of such articles throughout this section.

As a unifying notion of the named areas’ essentials we stick to the idea of controlling an agent in an environment, aiming for a specific goal state. Environment, agent and control are represented as abstract, continuous sensor and control signals. Based on this, we perceive the navigation into a goal state as an optimization problem. This matches the setting we tackled with PFAx in [Richthofer and Wiskott, 2017] and we continue with a comprehension of that approach.

### 1.1 Predictable Feature Analysis (Extended)

Predictable Feature Analysis (PFA) [Richthofer and Wiskott, 2015] is an unsupervised learning algorithm that was developed to efficiently turn high-dimensional input data into a low-dimensional model consisting of well predictable features.

In [Richthofer and Wiskott, 2017] we have shown that by using an extension to PFA – namely PFAx – it is possible to learn well controllable features that are sufficient to solve local navigation tasks. By taking supplementary information into account for prediction, PFAx can find features that are ideally predictable based on themselves and under the assumption that a supplementary signal can be used as a helper for prediction. Such a supplementary signal does – however – not participate in feature extraction. Providing the control signal from the RL setting (specific action chosen at each time-step) as supplementary information, we can obtain features that strongly depend on the supplementary information in terms of predictability. By inverting that relation we can compute the control signal that would most likely yield a specific desired outcome, given the agent’s current state. In this sense the obtained features are well controllable.

Solving a complex navigation task usually cannot be achieved within a single time step, so we transformed the agent’s state as far as possible towards the goal state in each time step (greedy optimization), using least squares distance in feature space as a cost function. The resulting approximate gradient descent easily gets stuck in a local optimum, thus PFAx is only suitable to perform local navigation.

### 1.2 Approach in this work

A key observation is that Slow Feature Analysis (SFA) [Wiskott and Sejnowski, 2002] can be used to decompose a given environment into features that are represented as monotonic signals across the environment (see [Sprekeler and Wiskott, 2008]). We refer to these monotonic features as sources. Computing them from the usual slow features found by SFA requires additional processing. This is provided by the xSFA algorithm [Sprekeler et al., 2014]. The theoretical analysis of xSFA only scopes the case that statistically independent sources exist. We extend this analysis in section 3.2 and establish a geometrical characterization of the solutions in terms of potential, monotonicity, geodesics and representation of the data manifold. A major contribution of this work is to clarify what (x)SFA-induced monotonicity means in higher dimensions. These results motivate the navigation algorithm proposed in section 3.3.

Obtaining the sources involves an extensive unsupervised exploration phase in which PFAx can learn the effects of the control signal and (x)SFA can learn a model of the environment. Based on the learned representation, any navigation problem can be solved by local descent on the monotonic feature representation, using the learned effects of the control signal. In figure 2 we illustrate this with the interval serving as a 1D environment. As a sensor representation we model five differently parametrized grid cells using overlapping Gaussians. The component (sf1) obtained by xSFA yields a monotonic representation of the environment. Concerning the goal distance measures on the right, imagine we wanted to move an agent from e.g. position to the goal at . Based on local techniques (imagine a limited perception range, e.g. one or two units) it is impossible to efficiently find the goal regarding sensor space. Measuring distance by sf1, the same task – actually any navigation task – is well feasible.

A special case of the approach in this work was studied in [Metka et al., 2017]. Leveraging the monotonicity of the slowest SFA components, a robot is navigated in an approximately open environment around an obstacle. This asserts the feasibility of the method in principle. A major difference to the approach presented here is that the control of the robot is not learned, but assumed to be known. Also the estimation of the gradient is done in a different manner. The navigation is based on a fixed selection of SFA components, which limits it to environments yielding spatial dimensions of roughly the same size, e.g. with quadratic or circular boundary. In that sense, our paper presents a generalization of that approach, vastly relaxing the geometrical requirements on the environment. However, we still require some non-geometric limitations on the environment:

• The environment is fully observable, i.e. each position in the environment yields a unique representation in sensor space

• The environment is stationary, i.e. constant over time, contains no blinking lights, no flickering colors or moving objects

The named limitations are not inherent and in section 5.2 we suggest extensions of the algorithm to overcome each of them. They are just simplifying assumptions to focus on the core method in this work.

### 1.3 Connection to optimization

Using a distance measure (e.g. least squares distance in sensor or feature space) of the agent’s current state to a goal state as a cost function, a navigation task can be seen as an optimization problem. Complex tasks usually cannot be achieved within a single time step. This can be resolved in several ways, e.g.:

• An optimization problem could be defined on the space of possible paths rather than on a space of possible actions.

• Greedy optimization can be used, i.e. transforming the agent’s state as far as possible towards the goal state in each time step (c.f. gradient descent).

• Optimization can scope on finding a good policy.

Finding a policy is the typical approach in RL, which is discussed in the next section. Greedy optimization easily gets stuck in a local optimum and finding the globally optimal path in an arbitrary environment is a general black-box global optimization problem. Without further assumptions on the environment, such a problem would in general not be convex, quadratic or anything significantly useful. Historically, a wide range of techniques has been developed to deal with such black-box problems, e.g. evolutionary algorithms, swarm algorithms, convex relaxations, cutting plane methods, branch and bound methods, stochastic methods and many more. These techniques have in common that they are magnitudes slower than methods for efficiently optimizable problem domains like convex, quadratic or linear problems. Apart from that, they usually cannot guarantee convergence or actual optimality of the obtained result.

The method in this work can be seen as a reverse approach applicable to our specific setting. Instead of attempting to solve the resulting difficult optimization problem, we create the optimization problem such that it is efficiently solvable. Given that we already use a model to represent the agent’s state, we have a certain degree of freedom in forming this model. The presented method can intentionally form the model such that all possible goal states in the environment are efficiently achievable from any initial state.

In this context, we exploit SFA-induced monotonicity as a link between local and global optimization: Exactly on monotonic functions, the distance measure to a goal value has only a single optimum. Thus, exactly on these functions, local techniques like PFAx are sufficient for global optimization of goal-distance. However, for multidimensional environments there does not exist a single strictly monotonic component that can cover all possible goal states. Instead we will find a hierarchical decomposition into components to cover the whole environment, defining a precise sequence of efficiently and globally solvable optimization tasks to reach any desired goal state that exists in the environment, provided that it has been sufficiently explored.

### 1.4 Connection to Reinforcement Learning

Navigating an agent in an environment has significant links to RL. However, RL usually considers an arbitrary reward signal, typically in discrete state and action spaces, while approaches for continuous RL exist. Navigation tasks and path planning can be encoded in the RL setting by measuring the distance of the agent’s current state from the goal state by a certain metric (e.g. euclidean distance in sensor or feature space). Using an inversion of this measure (e.g.  or ) as reward function would yield maximum reward at the goal. The main formal difference to the optimization view is that the objective function, i.e. the accumulated future reward, in RL terms the value function is unknown and is usually learned in popular methods such as Q-Learning. Apart from that, having a reward function that is non-zero across wider ranges of the environment is untypical for RL. Usually reward is only given right at the goal and the value function is learned across multiple sessions.

In terms of RL, the approach in this work would mean that during an initial exploration phase, reward would be completely ignored and instead the topology of the environment and the dynamics of the agent would be exhaustively learned. Based on this, any reward signal that measures distance to some goal position can be maximized efficiently. Goal and start position can be arbitrary and any number of such tasks with varying start and goal can be performed efficiently based on a single initial exploration. This is possible, because the model is aligned to the environment and not the specific reward function. Thus, our approach is especially valuable if the goal is not known during exploration and if many tasks with different goals need to be performed in the same environment. Note that in terms of RL, our approach does not account for the exploration-vs-exploitation issue. It rather performs exhaustive exploitation after exhaustive exploration.

Proto value functions (PVFs) [Mahadevan and Maggioni, 2007] are an RL concept that shares some characteristics with the presented approach. They are frequently used to discover bottleneck states and options in RL settings. SFA has been used earlier to approximate PVFs, e.g. in [Luciw and Schmidhuber, 2012, Böhmer et al., 2013]. The extracted slow features can then be used as basis functions for linear models that solve the RL problem, such as LSTD [Lagoudakis et al., 2002].

Originally, the central building block of PVFs are Laplacian eigenmaps (LEMs) rather than slow features. LEMs are traditionally more affected by the curse of dimensionality in RL as the dimensionality of the graph Laplacian depends on the number of data points. [Sprekeler, 2011] provides the missing link between SFA and LEMs. Specifically, that work identifies conditions under which the problem settings of SFA and LEMs become equivalent and describes how LEMs can be approximated by slow features. The application of SFA as replacement for LEMs proposed in [Luciw and Schmidhuber, 2012, Böhmer et al., 2013] was originally enabled by the named article. For our purpose, slow features yield another crucial advantage over LEMs: SFA’s central optimization problem can be analyzed by Sturm-Liouville theory, which allows to formally prove monotonicity results for slow features [Sprekeler and Wiskott, 2008].

In this work we primarily take the optimization perspective onto the navigation setting. Our approach is rather driven by xSFA and its theoretical implications and we propose that based on results of a sufficiently converged xSFA processing, it is already feasible to fully solve our setting by local and efficient optimization techniques.

### 1.5 Connection to path planning

Path planning of mobile robots closely fits our assumed setting as it deals with finding a safe path to navigate a robot through a complex environment. Work in this area usually assumes a specific goal location. A key difference is that the robot’s dynamics are usually known and focus is fully on planning the path. Our approach on the other hand does not incorporate a safety criterion, but this could be modeled as an additional part of the sensor.

The optimization issues we stated in section 1.3 are widely recognized in this field, especially avoidance of local optima is a central concern [Warren, 1989]. We pointed out the nature of a black box optimization problem and indeed various typical black box optimization approaches have been applied to path planning: Evolutionary methods [Vadakkepat et al., 2000], neural networks [Engedy and Horvath, 2009], particle swarm optimization [Kun Su and Hu, 2015], ant colony optimization [TAN et al., 2007, Xu et al., 2017] and others [Garrido et al., 2006]. Path planning has also been frequently approached using RL [Romero-Martí et al., 2016, Zuo et al., 2014, Singh et al., 1994, Igarashi, 2002, Kollar and Roy, 2008].

### 1.6 Other related work

We give an overview of various other more or less closely related approaches. Some papers are listed because they apply a hierarchical decomposition of some sort to RL scenarios, others are listed because they deal with slowness or predictability.

[McGovern and Barto, 2001] and [Stolle and Precup, 2002] use diverse density to discover bottleneck states as useful subgoals for RL tasks. This has some parallels to the algorithm in this paper in the sense that bottlenecks occur as special states. In A.2 we explicitly study the behavior of SFA around a bottleneck and suggest in a side note how slow features can be used as a bottleneck detector.

[Stachenfeld et al., 2014] suggests how a hierarchical decomposition of a problem space can be achieved using the successor representation and its eigenvalue decomposition. That work draws a number of links to studies of animal behavior, observations in the hippocampus and to cognitive maps. It contains many interesting notes regarding biological plausibility of decomposition approaches.

[Botvinick et al., 2009] addresses the scaling problem/curse of dimensionality in RL. They develop a hierarchical notion of RL (HRL) in a model-free actor-critic approach. The work features an extensive discussion of implications for neuroscience and psychology.

[Böhmer et al., 2015] gives an overview of methods for autonomous RL directly based on sensor-observations. The mainly discussed algorithms are deep auto encoders and SFA. They mention slow distractors (e.g. the position of a slowly moving sun in an outdoor scenario) as a typical issue of SFA. This supports our idea of a combination of SFA with predictability in form of PFAx, which should be able to discard slow distractors. However, PFAx can be affected by predictable distractors, but these can be identified by the coefficients of the matrix incorporating the control signal. Finally they explicitly point out the idea of combining notions of slowness with predictability, referencing the follwoing work:

[Jonschkowski and Brock, 2013] combines notions of slowness and predictability to learn state representations for RL using a neural network. To combine these notions they propose a hybrid cost function consisting of arbitrarily weighted terms for slowness, predictability and non-constantness.

There are a number of approaches related to SFA, PFA or PFAx we discussed in a little more detail in [Richthofer and Wiskott, 2017]: Contingent Feature Analysis (CFA) [Sprague, 2014], Forecastable Component Analysis (ForeCA) [Goerg, 2013], Graph-based Predictable Feature Analysis (GPFA) [Weghenkel et al., 2017], Predictive Projections [Sprague, 2009], Neighborhood Components Analysis (NCA) [Goldberger et al., 2004], A Canonical Analysis of Multiple Time Series [Box and Tiao, 1977].

## 2 Local navigation using predictable features with supplementary information (PFAx)

We start with a comprehension of the PFAx algorithm [Richthofer and Wiskott, 2017] which extends the PFA algorithm [Richthofer and Wiskott, 2015] to incorporate supplementary information. Later we extend the method to enable global navigation. Given an -dimensional input-signal , PFA’s objective is to find most predictable output components, referred to as “predictable features”. PFAx additionally considers a signal and extracts components from such that they are most predictable if can be used as an additional helper for the prediction.

Like SFA, PFAx performs a linear extraction, but can incorporate a non-linear expansion as a preprocessing step. For this we usually use monomials up to a certain degree, which essentially yields a polynomial extraction overall111For higher degree, Legendre or Bernstein polynomials should be favored over monomials because of better numerical stability.. Note that by the Stone-Weierstrass theorem this technique can approximate any continuous function and moreover also regulated functions (piecewise continuous). However, high degree expansion can require significant cost in terms of training data and computation. Applying PFA hierarchically like is done with SFA in [Franzius et al., 2007, Schönfeld and Wiskott, 2015] can help to keep these costs tractable.

### 2.1 Recall PFAx

In the PFAx setting, predictability is measured by linear, auto-regressive processes which are widely used to model time-related problems. That means, each value of an extracted signal should be as predictable as possible by a linear combination of recent values.

This yields the problem of finding vectors and such that

 aTz(t)!≈ (1)

where denotes the lag operator (also backshift operator), i.e. , and is the expanded representation of our input signal , sphered over a finite training phase with average notation :

 ~z(t) \coloneqh(x(t))−⟨h(x(t))⟩ (make mean-free) (2) z(t) \coloneqS~z(t)withS\coloneq⟨~z~zT⟩−12,z(t)∈Rn (normalize covariance) (3)

(1) can be formulated as a least squares optimization problem over the training phase . We extend the problem to multiple output components and to avoid trivial or repeated solutions we add constraints that require them to have unit variance and to be pairwise uncorrelated:

 Fori∈{1,…,r} (4)

Because of the sphering and , the constraints simplify to

 aTiaj=δij (5)

With , constraint (5) is automatically fulfilled if we choose

 Ar=AIrwithA∈O(n) (6)

denotes the space of orthogonal transformations, i.e.  and denotes the reduced identity matrix consisting of the first Euclidean unit vectors as columns.

Problem (2.1) is not readily solvable. As a prerequisite for a solvable relaxation we define and extend the prediction model to matrix notation (7). To keep things compact we directly switch to the PFAx notion by incorporating supplementary information in (8).

 m(t)!≈ B1Uum(t−1)+…+BpUqum(t−p)withBiU∈Rr×r (7) + U1Bmu(t−1)+…+UqBpmu(t−q)withUiB∈Rr×nu (8) = Bvec((L1z,…,Lpz)(t))\eqcolonζ(t)+Uvec((L1u,…,Lqu)(t))\eqcolonμ(t) (9)

(9) uses block matrix notation and .

In (19) we will state the PFAx optimization problem in terms of (9), but that formulation requires a formula to express the prediction matrices and in terms of the extraction matrix . Matrix calculus allows us to compute the ideal prediction matrices for a given extraction :

 B(Ar)\coloneq (ATr⟨zζT⟩−U(Ar)⟨μζT⟩)Ar–––(ATr––––⟨ζζT⟩Ar–––)−1 (10) U(Ar)\coloneq (ATr⟨zμT⟩−B(Ar)ATr––––⟨ζμT⟩)⟨μμT⟩−1 (11)

This uses the shortcut notation defined for any matrix :

 M–––\coloneqIp,p⊗M=(M0⋱0M)p times M∈Rnp×mp (12)

Equations (10) and (11) are derived as follows. For a given the optimal , must solve

 (13)

Writing we can expand to

 f(B,U)=Tr( − BATr––––⟨ζzT⟩Ar+ BATr––––⟨ζζT⟩Ar–––BT+ BATr––––⟨ζμT⟩UT (14) − U⟨μzT⟩Ar+ U⟨μζT⟩Ar–––BT+ U⟨μμT⟩UT)

and set its matrix derivatives to zero:

 ∂∂Bf(B,U) =−2ATr⟨zζT⟩Ar–––+2U⟨μζT⟩Ar–––+2BATr––––⟨ζζT⟩Ar––– !=0 (15) ∂∂Uf(B,U) =−2ATr⟨zμT⟩Ar–––+2BATr––––⟨ζμT⟩+2U⟨μμT⟩ !=0 (16)

Solving (15) for yields (10) and solving (16) for yields (11). In (10) and (11), and are defined implicitly. By inserting (11) into (10) and solving for we get the explicit formula

 (17)

If is not (cleanly) invertible due to very small or zero-valued eigenvalues, we recommend to project away the eigenspaces corresponding to eigenvalues below a critical threshold. These indicate redundancies in the signal and can therefore be dropped: In an eigenvalue decomposition of replace eigenvalues below the threshold by and invert the others. Use the resulting matrix as a proxy for . Proceed equivalently with other matrices where arising inversions are not computable due to near-zero-eigenvalues.

We define the ideal linear predictor for the original signal without extraction, i.e. :

 ^z(0)(t)\coloneqB(I)ζ(t)+U(I)μ(t) (18)

and can now refine problem (2.1) to

 minimizeA∈O(n)⟨∥ATr(z−^z(0))∥2⟩=Tr(ATr⟨(z−^z(0))(z−^z(0))T⟩Ar) (19)

which can be solved by choosing such that it diagonalizes and sorts the smallest eigenvalues to the upper left. From this we obtain the prediction model by calculating and .

In [Richthofer and Wiskott, 2015], we proposed an iterated prediction as a heuristic method to better avoid overfitting. In [Richthofer and Wiskott, 2017] we extended this method to comply with supplementary information as follows. We define a matrix which implements the autoregressive model and predicts from :

 V(t)\coloneq (⟨ζ(t+1)ζT⟩−Inp,nU(I)⟨μζT⟩)⟨ζζT⟩−1 (20) ^z(i)(t)\coloneq B(I)Viζ(t−i)+ITnp,ni∑j=0VjInp,nU(I)μ(t−j) (21)

Note that is consistent with in (18) for . Based on we proposed the optimization problem

 (22)

It can be solved by the same procedure as (19): Choose such that it diagonalizes and sort the lowest eigenvalues to the upper left. Apply and to get the prediction matrices for the obtained extraction matrix.

### 2.2 Generating a control signal for local navigation

Considering an agent exploring an environment, we present the agent’s perception as main input to PFAx and provide the preceding control command as supplementary information . This way the extracted predictable features will be a compact representation of perception aspects that are influenced by the control commands. We assume that the control signal is somehow generated during training phase, e.g. randomly within some constraints. After training phase, PFAx provides , and and we want to reach a goal position in feature space, assuming that feature space is sufficiently representative to let us actually reach the associated goal in our environment. As explained earlier, we will apply SFA on top of PFAx, so in contrast to the original PFAx setting, we must consider an additional extraction matrix . This is a true generalization as yields the original setting. Note that SFA also incorporates a mean shift, which is omitted here for simplicity and considering that PFAx output should be already mean free. Further more this would only result in a shift component for the cost function and is thus irrelevant for optimization. To calculate the ideal control command we minimize the least square distance between predicted features and goal features w.r.t. a proceeding linear SFA step:

 minimizeu(t)∈Rnu ∥m∗−ATSFA(B(Ar)ATrζ(t+1)−U(Ar)μ(t+1))∥2 (23) = (24) = ∥u∗−ATSFAU1\eqcolon~U1u(t)∥2 (25)

This problem is readily solved by choosing (or , if is not square or not invertible). Note that this would also minimize . However, to incorporate constraints on , the squared distance is much friendlier for optimization.

Later we will model an agent moving with constant speed, which involves a normalized-length-constraint:

 minimizeu(t)∈Rnu∥u(t)∥=c∥u∗−~U1u(t)∥2 (26)

This is equivalent to the inhomogeneous eigenvalue problem

 ~UT1~U1u(t)= λu(t)+~UT1u∗ (27) ∥u(t)∥= c (28)

In [Mattheij and Söderlind, 1987] such problems are approached. One method from there can also be found in the appendix of [Richthofer and Wiskott, 2017]. In that work we provide some experiments indicating that this method is suitable for local navigation, but cannot readily navigate its way globally, e.g. around obstacles or through doors connecting multiple rooms. The following section extends this method such that it is capable of solving these kind of global navigation tasks.

## 3 From local to global navigation

To achieve global navigation, the Slow Feature Analysis algorithm (SFA) [Wiskott and Sejnowski, 2002] and its extension xSFA [Sprekeler et al., 2014] for blind source separation play an important role. Especially the mathematical foundation of xSFA, which is grounded on the mathematical analysis of SFA in [Sprekeler and Wiskott, 2008] forms a key component for the navigation approach presented here. So we first comprehend the original SFA algorithm and then sketch its mathematical foundation, also stating key results of the theory that xSFA is based on. Finally we apply these results to our navigation setting, yielding an efficient algorithm for global navigation.

### 3.1 Recall SFA

Like PFA selects components by predictability, SFA selects them by slowness. As it was a central inspiration for PFA, SFA has some more similarities to it: The extraction is also optimized over a training phase and to avoid trivial/constant or repeated solutions, the output signals must have unit variance, zero mean and must be pairwise uncorrelated. We refer to the transformation as , i.e. the th extracted signal is given as . Note that these depend instantaneously on the input signal , so SFA cannot just fulfill its goal by forming a lowpass filter. Operating on a general function space that fulfills the necessary mathematical requirements of integrability and differentiability, the SFA optimization problem can be formulated as follows:

 Fori∈{1,…,r} minimizegi∈F⟨˙y2i⟩subjectto⟨yi⟩y2iyj=0∀j

Restricting to be finite dimensional, e.g. to the space of polynomials up to a certain degree, transforms (3.1) into an efficiently solvable eigenvalue problem. With the notation familiar from the PFA description in section 2.1, let denote a basis of . Then using as a nonlinear expansion on the input signal , extraction can be performed by linear transformation and projection. With an initial sphering, i.e. (2) and (3) from section 2.1 we can set for extraction vectors . SFA then becomes the following linearized version of (3.1):

 Fori∈{1,…,r} (30)

Like in section 2.1, the sphering yields and , transforming the constraints to (5) and its associated matrix notation (6). Choosing as eigenvectors of , corresponding to the eigenvalues in ascending order, yields solving (3.1) globally. [Wiskott and Sejnowski, 2002] describes this procedure in detail.

(3.1) is an important approximation (3.1) for practical solvability. To get an idea to what solutions (3.1) would converge if we increase the dimension of , we focus again on the SFA version concerning an unrestricted function space and the ideal solutions one would expect there. More specifically, we focus on the scenario where is composed of statistically independent sources . [Sprekeler and Wiskott, 2008] and [Sprekeler et al., 2014] analyze this case, proposing xSFA as an extension to SFA that can identify such sources. We comprehend some theory and results:

Assuming that is an ergodic process, SFA can be formulated in terms of the ensemble (i.e. the set of possible values of and ) using the probability density . The corresponding marginal and conditional densities are defined as and . Further assuming that the ensemble averages , and all exist and using the chain rule, the SFA optimization problem can be stated in terms of the ensemble as well:

 Fori∈{1,…,r} (31)

A key result from [Sprekeler and Wiskott, 2008] is that the ideal solutions for SFA on an unrestricted function space can be found by solving the following eigenvalue equation given the partial differential operator :

 Dgi(x)=λigi(x) (32)

under the von Neumann boundary conditions

 ∑γ,νnγ(x)px(x)⟨˙xγ˙xν⟩˙x|x(x)∂νgi(x) (33)

where is the th component of the normal vector at the boundary point . Assuming the input signal is composed of statistically independent sources for , this result can be formulated in terms of the sources. Because of statistical independence we have , and . can be decomposed as

 D(s)=∑αDα(sα) (34)

Regarding this decomposition, (32) and (33) can be reformulated such that, with an additional normalization constraint, the following equations formulate SFA in terms of the sources:

 Dαgαi =λαigαi (35) pα⟨˙s2α⟩˙sα|sα∂αgαi =0on the boundary (36) ⟨g2αi⟩sα =1 (37)

Theorem 2 in [Sprekeler and Wiskott, 2008] / Theorem 1 in [Sprekeler et al., 2014] states that the solutions of (32) are composed from solutions of (35):

 gi(s) =∏αgαiα(sα) (38) λi =∑αλαiα (39)

with denoting a multi index to select the right combination of sources. Choosing the smallest eigenvalues yields the slowest output signals.

Another crucial result from [Sprekeler and Wiskott, 2008] and [Sprekeler et al., 2014] states monotonicity of each first harmonic w.r.t. . For later reference we denote this result as Lemma 1 and comprehend the proof. We extend the lemma by remarking that it does not require to be a probability distribution. It rather works for any strictly positive weighting function. We will make use of this fact later on.

###### Lemma 1

If consists of statistically independent components like introduced above, then for each source the first harmonic is a monotonic signal of the source . This also holds if the distribution of is not a probability distribution, but any strictly positive weighting function.

###### Proof.

In standard form of a Sturm-Liouville problem and assuming that maps to the interval , (35)/(36) are stated as

 ∂αpα⟨˙s2α⟩˙sα|sα∂αgαi+λαipαgαi =0 (40) pα⟨˙s2α⟩˙sα|sα∂αgαi =0∀sα∈{aα,bα} (41)

With Sturm-Liouville theory stating that has only one zero we assume that without loss of generality for and for .

 ∂αpα⟨˙s2α⟩˙sα|sα∂αgα1=−λα1pα>0gα1<0∀sα>ξ (42) ⇒ (43) pα⟨˙s2α⟩˙sα|sα>0∂αgα1>0on (ξ,bα) (44) ⇒ ∂αgα1>0on (ξ,bα) (45) ⇔ gα1monotonically increasing on % (ξ,bα] (46)

Equivalently it holds that is monotonically increasing on , implying that is monotonically increasing on the whole interval .

The calculation above does not require to be a probability distribution, but only to be a strictly positive weighting function. ∎

We list some additional important results from [Sprekeler and Wiskott, 2008] and [Sprekeler et al., 2014]:

• If the sources are normally distributed, i.e. , then is constant and the Hermite polynomials yield the solutions with .

• If the sources are uniformly distributed, then is constant and the solutions are given by Sturm-Liouville theory as harmonic oscillations with , assuming that takes values in the interval . Therefore one refers to as the th harmonic of the source . Note that in this case, all higher harmonics can be calculated from the first harmonic using the Chebyshev polynomials :

• The slowest signal found by SFA is plainly the first harmonic of the slowest source. This result is a corner stone of xSFA as it allows to clean subsequent signals from the first source. Iterating this procedure finally yields all sources.

• (38) implies that each output component of SFA is a product of harmonics of earlier obtained sources.

#### 3.1.1 SFA harmonics illustrated on a 1D random walk

To illustrate the role of harmonics for one specific source, we demonstrate the theory on a simple 1D random walk on the interval . An agent starts at position and each step is chosen by a uniform distribution over the interval . Steps exceeding the left or right boundary are simply cut off.

We provide the plain position as input to SFA, using monomials up to the sixth degree as expansion. We extract the first four harmonics and compare them to those predicted by the theory. Although a 1D random walk usually yields a normal distribution around the starting point, for long training phases and due to the boundaries, the distribution actually approaches uniformity with some bias near the boundary. Figure 6 illustrates this effect, comparing a shorter walk consisting of steps with a longer walk consisting of steps. Another notable effect is that lower harmonics are usually extracted cleaner than higher harmonics, which is due to the limited monomial expansion. Further note that the monotonicity of the first harmonic is mostly preserved even if the harmonic itself was not cleanly extracted. The algorithm presented in this paper benefits from this effect as it mainly exploits the first harmonic of each source and especially its monotonicity.

### 3.2 xSFA on manifolds

For the blind source separation setting in [Sprekeler et al., 2014] it is assumed that the input is composed from statistically independent sources. This assumption is not necessarily appropriate for the setting studied in this paper. A closer fit can be found in [Franzius et al., 2007] where the agent’s state space, denoted configuration space , is considered a manifold embedded in the sensor space that yields the data subject of study. In this section we extend xSFA theory from [Sprekeler et al., 2014] to such a manifold setting and establish a geometrical characterization of the solutions in terms of potential, monotonicity, geodesics and manifold representation. These results motivate the navigation algorithm proposed in section 3.3.

#### 3.2.1 Slowest features are monotonic flows on the state space

Lemma 1 shows monotonicity with respect to the slowest source, but does not characterize the source itself in context of the state space . We can show that under certain assumptions, the slowest features actually correspond to monotonic flows across . In the notation from [Franzius et al., 2007] the sources in terms of xSFA are the agent’s possible configurations . For a fully observable environment and a sufficiently rich sensor, each sensor input value can be identified with a value such that is an bijective map. Each output component of xSFA is a scalar field on , mapping to a real number. We can assume is bounded so it actually maps to an interval , i.e. . Let denote the fiber of . These are also known as level sets or equipotential sets. With a plain we denote the unit interval . Mapping from to , usually performs a dimensionality reduction, unless . It proves advantageous to study the layout of this reduction, i.e. of the fibers of separately from its value. We achieve this separation by splitting

 gi(s)\eqcolon~gi(ti(s))with~gi:VI⟶Ii,ti:IV⟶I (47)

is some scalar field on realizing the level sets of , while is a real-valued function realizing the value of on top of . We refer to as the coordinate function of , because it defines a one-dimensional coordinate for on . If is injective, must have the same level sets as . Otherwise assigns distinct values to separate connectivity components of level sets of whenever such components are induced by being non-injective. Also note that can differ from in velocity and that only underlies Von Neumann boundary conditions where its fibers hit the boundary orthogonally. Note that the choice of is not unique. E.g. every composition of with a bijective function yields another valid . This gives us the freedom to assume additional properties on , most notably uniform velocity of its integral curves.

###### Theorem 1

Let be the solution components of xSFA. For every , like defined in (47) with the following holds. Let be an integral curve of from to or contrary. Then it runs through strictly monotonically w.r.t. , i.e.  is a strictly monotonic function.

Intuitively this means that has no local extrema or bumps spatially “between” its minimal and maximal level sets. It does not rule out local extrema completely but they must be somewhat isolated from the main flow, e.g. in another branch of . Theorem 1 is the first step of characterizing to consist of monotonic flows that bridge the potential spanned by and in the slowest possible fashion, or in – terms of potential theory – with minimal energy. Note that due to super position principle, can consist of multiple overlapping flows of this kind. Then it can happen that or is not connected.

###### Proof.

Let be the set of integral curves of . We assume that provides sufficient structure to define integration over , e.g.  could be a Riemannian manifold. With we denote integration by volume over and with we denote integration by volume over in the sense that is a hyper surface in . With denoting the Jacobi matrix, we can write the SFA optimization criterion as follows (c.f. optimization problem 2 in [Franzius et al., 2007]):

 ⟨(∇gi(s)˙s)2⟩s,˙s =∫Vps(s)(Dgi)(s)⟨˙s˙sT⟩˙s|s\eqcolonK(s)(Dgi)T(s)dV (48) (49) =∫Ripφ(φ)∫Ipθ|φ(θ|φ)ρ(θ)\eqcolon~p(θ)(D~gi)2(Dti)(φ(θ))K(s)(Dti)T(φ(θ))\eqcolonKi(θ)dθdS (50)

Here, denotes the volume element regarding . Since can be interpreted as the empirically measured inverse metric tensor of , we have with .

We transform the unit variance constraint in a similar way:

 ⟨g2i(s)⟩s =∫Vps(s)g2i(s)dV=∫Vps(s)~g2i(ti(s))dV (51) =∫Ripφ(φ)∫Ipθ|φ(θ|φ)ρ(θ)~g2i(ti(φ(θ)))dθdS (52) =∫Ripφ(φ)∫I~p(θ)~g2i(ti(φ(θ)))dθ\eqcolonvφ>0dS (53)

A valid solution must yield

 ∫Ripφ(φ)vφdS=1 (54)

Since , every contributes a positive quantity to the overall unit variance. Let be the family of quantities that yield the slowest signal . The distribution of variance across is the only tread-off between the integral curves forming , so we can conclude that for each , must be the solution of an optimization problem of the following form:

 minimize~g1∈F∫I~p(θ)(D~g1)2K1(θ)dθsubjectto∫I~p(θ)~g21(t1(φ(θ)))dθ=v∗φ (55)

The crucial advantage of having split off is that now is scalar-valued. So (55) is an ordinary SFA optimization problem defined on the interval . It is a bit special, because the variance is not normalized to but to and is not a probability distribution but a general strictly positive weight function. However, these are just scaling issues and the mathematical theory of SFA solutions is still applicable. The underlying space is one-dimensional, so it can only involve a single source, which must have coordinate character on , i.e. be bijective and continuous, thus monotonic. Since a single source is always statistically independent, we can apply Lemma 1 and find that for every the slowest solution must be a strictly monotonic function on the interior of , denoted . Therefore we have for :

 D(~g1∘t1∘φ)=D~g1Dt1Dφ≠0 (56)

corresponds to or . For it follows that are each non-zero and . This readily proves theorem 1 for .

To extend the proof to we need to recall how xSFA operates. After is extracted, in an idealized xSFA the data is projected onto a space orthogonal to the space of continuous functions of . We can think of an infinite sequence of monomials of as a basis of this space. Essentially the projection implies

 ⟨gk1g2⟩s=0∀k∈N (57)

and consequently that

 ⟨~gk1(t1)~g2(t2)⟩s=0∀k∈N (58)

Since is monotonic we can build a Taylor expansion of the identity function from , yielding that already the coordinate is orthogonal to :

 ⟨t1~g2(t2)⟩s=0∀k∈N (59)

Therefore we can express this constraint on coordinate level and restrict to fulfill

 ⟨t1t2⟩s=0∀k∈N (60)

which imposes no further constraint on . That means, the decorrelation constraint for in xSFA sense only affects , rather than , and is encoded in . So we can apply (55) with and equation (56), follows. ∎

Our next theorem characterizes the ideal spatial location of and