Universal discrete-time reservoir computers with stochastic inputs and linear readouts using non-homogeneous state-affine systems
A new class of non-homogeneous state-affine systems is introduced for use in reservoir computing. Sufficient conditions are identified that guarantee first, that the associated reservoir computers with linear readouts are causal, time-invariant, and satisfy the fading memory property and second, that a subset of this class is universal in the category of fading memory filters with stochastic almost surely uniformly bounded inputs. This means that any discrete-time filter that satisfies the fading memory property with random inputs of that type can be uniformly approximated by elements in the non-homogeneous state-affine family.
Key Words: reservoir computing, state-affine systems, SAS, echo state networks, ESN, echo state affine systems, ESAS, machine learning, universality, fading memory property, linear training, stochastic signal treatment.
A reservoir computer (RC) [Jaeg 10, Jaeg 04, Maas 02, Maas 11, Croo 07, Vers 07, Luko 09] or a RC system is a specific type of recurrent neural network determined by two maps, namely a reservoir , , and a readout map that under certain hypotheses transform (or filter) an infinite discrete-time input into an output signal of the same type using the state-space transformation given by:
where and the dimension of the state vectors will be referred to as the number of virtual neurons of the system. The expressions (1.1)-(1.2) determine a nonlinear state-space system and many of its dynamical properties (stability, controlability) have been studied for decades in the literature from that point of view.
In supervised machine learning applications, the reservoir map is very often randomly generated (see, for instance, the echo state networks in [Jaeg 10, Jaeg 04]) and the memoryless readout is trained so that the output matches a given teaching signal that we will denote by . Two important advantages of this approach lay on the fact that they reduce the training of a dynamic task to a static problem and, moreover, if the reservoir map is rich enough, good performances can be attained with readouts that have a relatively simple functional form. Indeed, in many occasions it suffices to use an affine map that is trained via a (eventually regularized) linear regression that minimizes the Euclidean distance between the output and the teaching signal . These features circumvent well-known difficulties in the training of generic recurrent neural networks having to do with bifurcation phenomena that render classical gradient descent methods non-convergent [Doya 92].
There are two central questions that need to be addressed when designing a machine learning paradigm, namely, the capacity and the universality problems. The capacity problem concerns the estimation of the error that is going to be committed in the execution of a specific task. This estimation can have the form of generic bounds that incorporate various architecture parameters of the system like in [Pisi 81, Jone 92, Barr 93, Kurk 05]. In the specific context of reservoir computing, it has been the subject of much research [Jaeg 02, Whit 04, Gang 08, Herm 10, Damb 12, Grig 15, Coui 16, Grig 16a].
The universality problem consists in showing that the set of input/output functionals that can be generated with a specific architecture is dense in a sufficiently rich class, like the one containing, for example, all continuous or all measurable functionals. For classical machine learning paradigms like neural networks, this question has given rise to well-known results [Kolm 56, Arno 57, Spre 65, Spre 96, Spre 97, Cybe 89, Horn 89, Rusc 98] that show that they can be considered as universal approximators in a static and deterministic setup.
There is no general recipe that allows one to conclude the universality of a given supervised machine learning paradigm. The proof strategy depends much on the specific paradigm itself and, more importantly, on the nature of the inputs and the outputs. In the context of reservoir computing there are several situations for which universality has been established when the inputs/outputs are deterministic, that is, when dealing with real-valued curves or time series. There are two features that influence significantly the level of mathematical sophistication that is needed to conclude universality: first, the compactness of the time domain under consideration and second, if one works in continuous or discrete time. In the following paragraphs we briefly review the results that have already been obtained and, in passing, we present and put in context the contributions contained in this paper.
The compactness of the time domain is crucial because, as we will see later on, universality is obtained as a consequence of various versions of the Stone-Weierstrass, that is invariably formulated for functions defined on a compact metric space. When the time domain is compact, this property is naturally inherited by the spaces relevant in the proofs. However, when it is not, it is obtained by restricting the study to functionals that satisfy a condition introduced in [Boyd 85] and known as the fading memory property. The distinction between continuous and discrete time inputs is justified by the availability in the continuous setup of different tools coming from functional analysis that do not exist for discrete time.
Reservoir computing universality for compact time domains is obtained as a corollary of classical results in systems theory. Indeed, in the continuous time setup, it can be established [Flie 76, Suss 76] for linear systems using polynomial readouts or with bilinear systems using linear readouts. In the discrete-time setup, the situation is more convoluted when the readout is linear and required the introduction in [Flie 80] of the so-called (homogeneous) state-affine systems (SAS) (see also [Sont 79a, Sont 79b]). The extension of these results to continuous non-compact time intervals was carried out in [Boyd 85] for fading memory filters using exponentially stable linear RCs with polynomial readouts and their bilinear counterparts with linear readouts (see also [Maas 00, Maas 02, Maas 04, Maas 07]). An extension to the non-compact discrete-time setup based on the Stone-Weierstrass theorem is, to our knowledge, not available in the literature. This problem has only been tackled from an internal approximation point of view, which consists in uniformly approximating the reservoir and readout maps in (1.1)-(1.2) in order to obtain an approximation of the resulting filter; this strategy has been introduced in [Matt 92, Matt 93] for absolutely summable systems. The proofs in those works were unfortunately based on an invalid compactness assumption. Even though corrections were proposed in [Perr 96, Stub 97a], this approach yields, in the best of cases, universality only within the reservoir filter category, while we aim at proving that statement in the much larger category of fading memory filters.
Our paper contains the following four main contributions:
A non-homogeneous variant of the state-affine systems in [Flie 80] is introduced and we identify sufficient conditions that guarantee that the associated reservoir computers with linear readouts have the echo state property, are causal, time-invariant, and satisfy the fading memory property.
A subset of this class is characterized that is universal in the category of fading memory filters with uniformly bounded inputs.
These results are extended to the stochastic setup by formulating a version of this universality result that is valid for almost surely uniformly bounded inputs. This result shows that any discrete-time filter that has the fading memory property with almost surely uniformly bounded stochastic inputs can be uniformly approximated by elements in the non-homogeneous state-affine family.
The universal non-homogeneous state-affine family suggests a generalization of the echo state networks introduced in [Jaeg 04] that have been very successful in many information processing tasks. We call this generalization echo state affine systems (ESAS) and we empirically show that they outperform echo state networks in a standard benchmark forecasting task having to do with the chaotic time series generated by the Mackey-Glass system [Mack 77].
Despite some preexisting work on the uniform approximation in probability of stochastic systems with finite memory [Perr 96, Perr 97, Stub 97b], the universality result in the stochastic setup is, to our knowledge, the first of its type in the literature and opens the door to new developments in the learning of stochastic processes and their obvious applications to forecasting [Galt 14]. Indeed, in the deterministic setup, RC has been very successful (see for instance [Jaeg 04]) at the time of learning the attractors of various dynamical systems which, in passing, is used for forecasting by continuing the paths of the system in question using its synthetically learnt proxy. This approach led to several orders of magnitude accuracy improvements with respect to most standard dynamical systems forecasting tecniques based on Takens’ Theorem [Take 81] and we expect that should have analogous beneficial effects in the density forecasting of stochastic processes that satisfy the hypotheses of the results that are formulated later on in the paper.
2 Notation, definitions, and preliminary discussions
Vector and matrix notations. Polynomials.
A column vector is denoted by a bold lower case symbol like and indicates its transpose. Given a vector , we denote its entries by , with ; we also write . We denote by the space of real matrices with . When , we use the symbol to refer to the space of square matrices of order . Given a matrix , we denote its components by and we write , with , . Given a vector , the symbol stands for its Euclidean norm. For any , denotes its matrix norm induced by the Euclidean norms in and , and satisfies [Horn 13, Example 5.6.6] that , with the largest singular value of . is sometimes referred to as the spectral norm of [Horn 13].
Given an element , we denote by the real-valued multivariate polynomials on with real coefficients. Analogously, will denote the set of real-valued polynomials on . When is a scalar and , we define the set of -valued polynomials on with coefficients in as
The symbol denotes the set of infinite real sequences of the form , ; and are the subspaces consisting of, respectively, left and right infinite sequences: , . In most cases we shall use in these infinite product spaces either the product topology (see [Munk 14, Chapter 2]) or the topology induced by the supremum norm . The symbols and will be used to denote the Banach spaces formed by the elements in those infinite product spaces that have a finite supremum norm .
We will refer to the maps of the type as filters or operators and to those like (or ) as functionals. A filter is called causal when for any two elements that satisfy that for any , for a given , we have that . The filter is called time-invariant when, for any , it commutes with the associated time delay operator defined by , that is, (in this expression, the two time delay operators have to be understood as defined in the appropriate sequence spaces). We recall (see for instance [Boyd 85]) that there is a bijection between causal time-invariant filters and functionals on . Indeed, given a time-invariant filter , we can associate to it a functional via the assignment , where is an arbitrary extension of to . Conversely, for any functional , we can define a time-invariant causal filter by , where is the -time delay operator and is the natural projection. It is easy to verify that:
Additionally, let and , then , for any .
Consider now the RC system determined by (1.1)–(1.2) and assume, additionally, that the following existence and uniqueness property holds: for each there exists a unique such that for each , the relation (1.1) holds. This condition is known in the literature as the echo state property [Jaeg 10, Yild 12] and has deserved much attention in the context of echo state networks [Jaeg 04, Bueh 06, Bai 12, Wain 16, Manj 13]. We emphasize that the echo state property is a genuine condition that is not automatically satisfied by all RC systems.
RC systems that satisfy the echo state property have a naturally associated filter. We will denote by the filter determined by the reservoir map via (1.1), that is, and by the one determined by the entire reservoir system, that is, . will be called the reservoir filter associated to the RC system (1.1)–(1.2). The filters and are causal by construction and it can also be shown that they are necessarily time-invariant [Grig 18]. We can hence associate to a reservoir functional determined by .
Weighted norms and the fading memory property (FMP).
Let be a decreasing sequence with zero limit and . We define the associated weighted norm on associated to the weighting sequence as the map:
where denotes the Euclidean norm in . It is worth noting that the space
endowed with weighted norm forms a Banach space [Grig 18].
All along the paper, we will work with uniformly bounded families of sequences, both in the deterministic and the stochastic setups. The two main properties of these subspaces in relation with the weighted norms are spelled out in the following two lemmas.
Let and let be the set of uniformly bounded elements in by , that is,
Then, for any weighting sequence and , we have that .
Additionally, let and let be the weighting sequences given by , , , . Then, the following series are convergent and satisfy the inequalities:
The following result is a discrete-time version of Lemma 1 in [Boyd 85] that is easily obtained by noticing that in the discrete-time setup all functions are trivially continuous if we consider the discrete topology for their domains and, moreover, all families of functions are equicontinuous. A proof is given in the appendices for the sake of completeness.
Let and let be as in (2.3). Let be an arbitrary weighting sequence. Then is a compact topological space when endowed with the relative topology inherited from .
Let and let be the functional associated to the causal and time-invariant filter . We say that has the fading memory property (FMP) on whenever there exists a weighting sequence such that the map is continuous. This means that for any and any there exists a such that for any that satisfies that
If the weighting sequence is such that , for some and all , then is said to have the -exponential fading memory property.
The fading memory property is in some occasions also related to the Lyapunov stability of the autonomous system associated to the reservoir map. This connection has been made explicit, for example, for discrete-time nonlinear state-space models that are affine in their inputs, and have direct feed-through term in the output [Zang 04] or for time-delay reservoirs [Grig 16b].
Time-invariant fading memory filters always have the bounded input, bounded output (BIBO) property. Indeed, if for simplicity we consider functionals that map the zero input to zero, that is , and we want bounded outputs such that , for a given constant , by Definition 2.3 it suffices to consider inputs such that . Indeed, if has the FMP with respect to a weighting sequence , then and hence , as required.
The following lemma, that will be used later on in the paper, spells out how the FMP depends on the weighting sequence used to define it.
Let and let be the functional associated to the causal and time-invariant filter . If has the FMP with respect to a given weighting sequence , then it also has it with respect to any other weighting sequence that satisfies that
In particular, the thesis of the lemma holds when dominates , that is when .
3 Universality results in the deterministic setup
In this section we consider deterministic filters, in the sense that their inputs belong to a subset of formed by uniformly bounded elements, as in the definition in (2.3).
We will formulate two different universality results. In the first one, we show that polynomial algebras of filters generated by reservoir computers with the fading memory property that separate points are able to approximate any fading memory filter. Such families of reservoir computers are said to be universal. Two important consequences of this result is that the entire family of fading memory RCs itself is universal, as well as the one containing all the linear reservoirs with polynomial readouts, when certain spectral restrictions are imposed on the reservoir matrices. In the second result, we restrict ourselves to reservoir computers with linear readouts and introduce the non-homogeneous state-affine family in order to be able to obtain a similar universality statement.
The first result can be seen as a discrete-time translation of the one formulated in [Boyd 85] for continuous-time filters while the second one is an extension to infinite time intervals of the main approximation result in [Flie 80] formulated for compact time intervals using homogeneous state-affine systems.
3.1 Universality for fading memory RCs with non-linear readouts
The following statement is a direct consequence of the compactness result in Lemma 2.3 and the Stone-Weierstrass theorem for polynomial subalgebras of real-valued functions defined on compact metric spaces, as formulated in Theorem 7.3.1 in [Dieu 69]. See Appendix 6.4 for a detailed proof.
Let be a subset of the type defined in (2.3) and let
be a set of reservoir filters defined on that have the FMP with respect to a given weighted norm . Let be the polynomial algebra generated by , that is, the set formed by finite products and linear combinations of elements in . If the algebra contains the constant functionals and separates the points in , then any causal, time-invariant fading memory filter can be uniformly approximated by elements in , that is, is dense in the set of real-valued continuous functions on . More explicitly, this implies that for any fading memory filter and any , there exist indices and a polynomial such that
An important fact is that the polynomial algebra generated by a set formed by fading memory reservoir filters is made of fading memory reservoir filters. Indeed, let , , , and . Then, the product and the linear combination filters are such that
We emphasize that the functionals and in (3.2) and (3.3) are well defined because if the reservoir maps and satisfy the echo state property then so does . Indeed, if and are the solutions of the reservoir equation (1.1) for and associated to the input , then so is for in (3.4).
This observation has as a consequence that the set of all the RC systems that have the echo state property and the FMP with respect to a given weighted norm form a polynomial algebra that contains the constant functions (they can be obtained by using as readouts constant elements in ) and separate points (take the trivial reservoir map and use the separation property of ). This remark, put together with Theorem 3.1 yields the following corollary.
Let be a subset of the type defined in (2.3) and let
be the set of all reservoir filters defined on that have the FMP with respect to a given weighted norm . Then is universal, that is, it is dense in the set of real-valued continuous functions on .
According to the previous corollary, reservoir filters that have the FMP are able to approximate any time-invariant fading memory filter. We now show that actually a much smaller family of reservoirs suffices to do that, namely, certain classes of linear reservoirs with polynomial readouts. Consider the RC system determined by the expressions
If this system has a reservoir filter associated (the next result provides a sufficient condition for this to happen) we will denote by the associated functional and we will refer to it as the linear reservoir filter determined by , and the polynomial . These filters exhibit the following universality property that is proved in Appendix 6.5.
Let be a subset of the type defined in (2.3) and let be such that . Consider the set formed by all the linear reservoir systems as in (3.6)-(3.7) determined by polynomial readouts and by matrices such that . Then, the elements in generate -exponential fading memory reservoir filters, with , for any , that can be explicitly written down using the expression:
This family is dense in with , for any .
3.2 State-affine systems and universality for fading memory RCs with linear readouts
As it was explained in the introduction, it is is particularly convenient to work with RCs that have a linear readout since in that case the training reduces to the solution of a regression problem. That is in most cases feasible when there is need, as it happens in many applications, for a high number of neurons. This point makes relevant the identification of families of reservoirs that are universal when the readout is restricted to be linear, which is the subject of this subsection. In order to simplify the presentation, we restrict ourselves in this case to one-dimensional input signals, that is, all along this section we set . The extension to the general case is straightforward even though more convoluted to write down (see Remark 3.12).
Let , , and let and be two polynomials on the variable with matrix coefficients, as they were introduced in (2.1). The non-homogeneous state-affine system (SAS) associated to and is the reservoir system determined by the state-space transformation:
Our next result spells out a sufficient condition that guarantees that the SAS reservoir system (3.9)-(3.10) has the echo state property. Moreover, it provides an explicit expression for the unique causal and time-invariant solution associated to a given input. Recall that for any , denotes its matrix norm induced by the Euclidean norms in and and that .
In that situation, we will denote by and the corresponding SAS reservoir filter and SAS functional, respectively.
The next result presents two alternative conditions that imply the hypothesis in the previous proposition and that are easier to verify in practice.
Let be the polynomial given by
Consider the following three conditions:
There exists a constant , such that , for any , and that at the same time satisfies that .
Then, condition (i) implies (ii) and condition (ii) implies (iii).
We emphasize that since Proposition 3.5 was proved using condition (iii) in the previous lemma then, any of the three conditions in that statement imply the echo state property for (3.12)-(3.13) and the time-invariance of the corresponding solutions. The next result shows that the same situation holds in relation with the fading memory property.
Consider a non-homogeneous state-affine system as in (3.9)-(3.10) determined by polynomials , and a vector , with inputs defined on , . If the polynomial satisfies any of the three conditions in Lemma 3.6 then the corresponding reservoir filter has the fading memory property. More specifically, if satisfies condition (i) in Lemma 3.6, then is continuous with and arbitrary. The same conclusion holds for conditions (ii) and (iii) with and , respectively.
The importance of SAS in relation to the universality problem has to do with the fact that they form a polynomial algebra which allows us, under certain conditions, to use the Stone-Weierstrass theorem to prove a density statement. Before we show that, we observe that for any two polynomials and given by
with , their direct sum and their Kronecker product is written as
where we assumed that . Analogously,
Let be an open set and let be two SAS reservoir functionals associated to two corresponding time-invariant SAS reservoir systems that have and as state spaces, respectively. Assume that the two polynomials with matrix coefficients and satisfy that and for all and a given . Then, with the notation introduced in the expressions (3.16) and (3.17), we have that:
For any , the linear combination is a SAS reservoir functional associated to a SAS that has as state space and:
The product is a SAS reservoir functional associated to a SAS that has as state space and:
where , , is the polynomial with matrix coefficients whose block-matrix expression for the three summands in is:
The expression (respectively, ) denotes the linear map from (respectively, ) to that associates to any the element (respectively, ).
Theorem 3.9 (Universality of SAS reservoir computers)
Let be the subset of real uniformly bounded sequences in as in (2.3), that is,
and let be the family of functionals induced by the state-affine systems defined in (3.9)-(3.10) that satisfy that and . The family forms a polynomial subalgebra of (as defined in (3.5)) with and arbitrary, made of fading memory reservoir filters that contains the constant functions and separate points. The subfamily is hence dense in the set of real-valued continuous functions on which implies that any causal, time-invariant fading memory filter can be uniformly approximated by elements in . More specifically, for any fading memory filter and any , there exist a natural number , polynomials with , and a vector such that
As it is stated in Theorem 3.9, it is the condition (iii) in Lemma 3.6 that yields a universal family of SAS fading memory reservoirs. As it can deduced from its proof (available in the Appendix 6.10), the families determined by conditions (i) or (ii) in that lemma contain SAS fading memory reservoirs but they do not form a polynomial algebra. In such cases, and according to Theorem 3.1, it is the algebras generated by them and not the families themselves that are universal.
A continuous-time analog of the universality result that we just proved can be obtained using the bilinear systems considered in Section 5.3 of [Boyd 85]. In discrete time, but only when the number of time steps is finite, this universal approximation property is exhibited [Flie 80] by homogeneous state-affine systems, that is, by setting in (3.9)-(3.10).
Generalization to multidimensional signals. When the input signal is defined in , with , a SAS family with the same universality properties can be defined by replacing the polynomials and in Definition 3.4, by polynomials of the form:
SAS with trigonometric polynomials. An analogous construction can be carried out using trigonometric polynomials of the type:
In this case, it is easy to establish that the resulting SAS family forms a polynomial algebra by invoking Proposition 3.8 and by reformulating the expressions (3.16) and (3.17) using the trigonometric identity
4 Reservoir universality results in the stochastic setup
This section extends the previously stated deterministic universality results to a setup in which the reservoir inputs and outputs are stochastic, that is, the reservoir is not driven anymore by infinite sequences but by discrete-time stochastic processes. We emphasize that we restrict our discussion to reservoirs that are deterministic and hence the only source of randomness in the systems considered is the stochastic nature of the input.
The stochastic setup.
All along this section we work on a probability space . If a condition defined on this probability space holds everywhere except for a set with zero measure, we will see that the relation is true almost surely and we will abbreviate it a.s. We will denote by the set of -valued random variables whose Euclidean norms have a finite essential supremum or that, equivalently, have almost surely bounded Euclidean norms. More specifically, let be a random variable and let
It can be shown that is a Banach space (see [Ledo 91, pages 42 and 46], [Lord 14, page 149]. Given an element , we denote by its expectation. The following lemma collects some elementary results that will be needed later on and shows, in particular, that the expectation as well as that of all the powers of its norm are finite for all the elements .
Let and let be a real number. Then: