# Memory and Information Processing in Recurrent Neural Networks

###### Abstract

Recurrent neural networks (RNN) are simple dynamical systems whose computational power has been attributed to their short-term memory. Short-term memory of RNNs has been previously studied analytically only for the case of orthogonal networks, and only under annealed approximation, and uncorrelated input. Here for the first time, we present an exact solution to the memory capacity and the task-solving performance as a function of the structure of a given network instance, enabling direct determination of the function–structure relation in RNNs. We calculate the memory capacity for arbitrary networks with exponentially correlated input and further related it to the performance of the system on signal processing tasks in a supervised learning setup. We compute the expected error and the worst-case error bound as a function of the spectra of the network and the correlation structure of its inputs and outputs. Our results give an explanation for learning and generalization of task solving using short-term memory, which is crucial for building alternative computer architectures using physical phenomena based on the short-term memory principle.

Excitable dynamical systems, or reservoirs, store a short-term memory of a driving input signal in their instantaneous state (ganguli2008, ). This memory can produce a desired output in a linear readout layer, which can be trained efficiently using ordinary linear regression or gradient descent. This paradigm, called reservoir computing (RC), was originally proposed as a simplified model of information processing in the prefrontal cortex (dominey1995, ). It was later generalized to explain computation in cortical microcircuits (Maass et al., 2002) and to facilitate training in recurrent neural networks (jaeger2004, ). A central feature of RC is the lack of fine tuning of the underlying dynamical system: any random structure that guarantees a stable dynamics gives rise to short-term memory (Maass et al., 2002; jaeger2004, ). Analogous behavior has also been observed in selective response in random neural populations (Hansel2012, ). Furthermore, fixed underlying structure in RC makes it suitable for implementing computation using spatially distributed physical phenomena(goudarzi2012, ; goudarzi2013, ; Sillin, 2013; burger2015, ; Nakajima, 2014; Vandoorne, 2014; Haynes, 2015; Katayama, 2015). Such approaches can give us a way to store and process information more efficiently than with von Neumann architecture (crutchfield2010, ).

Short-term memory in neural networks has been studied for uncorrelated input under annealed approximation, i.e., connectivity is resampled independently at each time step (white2004, ). That study considered only linear orthogonal networks, where the columns of the connectivity matrix are pairwise orthogonal and the node transfer functions are linear. A memory function was defined to measure the ability of the system to reconstruct input from time steps ago, i.e., , from the present system state . It was shown that the total memory capacity cannot exceed the degrees of freedom in the system. For networks with saturating nonlinearity, the memory scales with (ganguli2008, ); however, by fine-tuning the nonlinearity one can achieve near-linear scaling of memory capacity (Toyoizumi, 2002). In nonlinear networks, it is very difficult to analyze the complete memory function and even harder to relate it to the performance on computational tasks, as is evident from many works in this area with hard-to-reconcile conclusions (see Ref. (lukosevicius2012, )).

Model— Consider a discrete-time network of nodes. The network weight matrix is with spectral radius . A time-dependent scalar input signal is fed to the network using the input weight vector . The evolution of the network state and the output is governed by

(1) | ||||

(2) |

where is an -dimensional column vector calculated for a desired output .
Here, each column of is the state of the network at time and each column of is the corresponding desired output at each time step. In practice it is sometimes necessary to use Tikhonov regularization to calculate the readout weights, i.e., , where is a regularization factor that needs to be adjusted depending on , and lukosevicius2012 ().

Calculating for a given problem requires the following input-output-dependent evaluations (Appendix A):

(3) | ||||

(4) |

where and are the autocorrelation of the input and the cross-correlation of the input and target output. This may also be expressed more generally in terms of the power spectrum of the input and the target:

(5) | ||||

(6) |

where and , is the power spectral density of the input, and is the cross-spectral density of the input and the target output.

The performance can be evaluated by the mean-squared-error (MSE) as follows:

(7) | ||||

(8) |

The MSE gives us a distribution-independent upper bound on the instantaneous-squared-error through the application of Markov inequality:

(9) |

The existence of a worst-case bound is important for engineering applications of RC.

Memory and task solving— The memory function of the system is defined as (white2004, )

(10) |

where is the input with lag , .

The exact evaluation of this function has previously proved elusive for arbitrary and . Here we provide a solution using eigendecomposition of and the power spectral density of the input signal. The solution may also be described directly in terms of the autocorrelation of the input, as in Appendix B.

Let and so that

(11) |

The memory function is reduced to

(12) |

where the matrix is given by

(13) |

and the matrix is given by

(14) |

The total memory is then given by

(15) |

where

(16) |

We validate our formula by comparing analytical and empirical evaluation of the memory curve. The input is assumed to be a sequence of length with autocorrelation function (Appendix C). FIG 1(a) shows the result of the single-instance calculation of the analytical and the empirical memory curves for different (Appendix C). As expected, the analytical and empirical results agree.

We also study the memory capacity for different levels of structure in the input signal. Here, we use simple ring topologies with and vary the decay exponent , FIG 1(b). For a fixed system size, decreasing exponentially increases the correlation in the input, which increases the memory capacity.

Next, we use our method to calculate the optimal output layer, the expected average error, and bounds on worst-case error for a common NARMA10 benchmark task:

(17) |

where , . The input is drawn from a uniform distribution over the interval . Here the evaluation of follows the same calculation as for the memory capacity for the uniform distribution. For we must estimate the cross-correlation of and and substitute it into Equation 4. FIG 2(a) shows the output of a network of nodes and with analytically calculated optimal output layer. The output agrees with the correct output of the NARMA10 system. The cross-correlation of the system used for the calculation is shown in the inset. FIG 2(b) shows the worst-case error bound for this system and the empirical errors generated from the system output, showing that the bound we derived is tight.

Finally, we show how the framework can be used in a prediction scenario, namely the prediction of the Mackey-Glass system. The Mackey-Glass system (Mackey, 1977) was first proposed as a model for feedback systems that may show different dynamical regimes. The system is a one-dimensional delayed feedback differential equation and manifests a wide range of dynamics, from fixed points to strange attractors with varying divergence rates (Lyapunov exponent). This system has been used as a benchmark task for chaotic signal prediction and generation (jaeger2004, ). It is defined as:

(18) |

where ensure the chaoticity of the dynamics (jaeger2004, ).

FIG 2(c) shows the prediction result for 10 time steps ahead and the inset shows the autocorrelation at different lags. The autocorrelation is characterized by a long correlation length evident from non-zero correlation values for large . This long memory is a hallmark of chaotic systems. We use this information to evaluate Equation 3 and Equation 4, where for predicting steps ahead we have

The effect of network structure— The effect of randomness and sparsity of reservoir connectivity has been a subject of debate (lukosevicius2012, ). To study the effect of network structure on memory and performance, we systematically explore the range between sparse deterministic uniform networks and random graphs. We start from a simple ring topology with identical weights and induce noise to random links by sampling the normal distribution . We then re-evaluate the memory and task solving performance keeping the weight matrix fixed. We evaluate system performance on a memory task with exponentially correlated inputs, the nonlinear autoregressive NARMA10 task, and the Mackey-Glass chaotic time series prediction. We systematically explore the effects of , , and on the performance of the system.

FIG 4(a) shows the resulting total memory capacity normalized by as a function of increasing randomness for different spectral radii . The expected theoretical total memory capacity for an uncorrelated signal is . Here the system exploits the structure of the input signal to store longer input sequences, i.e., . This effect has been studied previously under annealed approximation and in a compressive sensing setup (NIPS2010_3980, ). However, here we see that even without the sparse input assumption and optimization in the output (a computationally expensive optimization used in compressive sensing) the network can achieve capacity greater than its degrees of freedom . FIG 4(b) and (c) show the error in the NARMA10 and the Mackey-Glass prediction tasks. Here, best performance is achieved for a regular architecture. A slight randomness significantly increases error at first, but additional irregularity will decrease it. This can be observed for the NARMA10 task at and for the Mackey-Glass prediction task at .

Discussion— Although memory capacity of RNNs has been studied before, it is learning and generalization ability in a task solving setup is not discussed. Our derivation allows us to relate the memory capacity to task solving performance for arbitrary RNNs and reason about their generalization. In empirical experiments with systems presented here, the training and testing are done with finite input sequences that are sampled independently for each experiment, so the statistics of the training and testing inputs vary according to a Gaussian distribution around their true values and one expects these estimates to approach their true values with increasing sample size. Hence, the mean-squared-error , which is linear in the input and output statistics, is also distributed as a Gaussian for repeated experiments. By the law of large numbers, the difference between testing and training mean-squared-error tends to zero in the limit. This explains the ability of the system to generalize its computation from training to test samples.

Conclusion— The computational power of reservoir computing networks has been attributed to their memory capacity. While their memory properties have been studied under annealed approximation, thus far no direct mathematical connection to their signal processing performance had been made. We developed a mathematical framework to exactly calculate the memory capacity of RC systems and extended the framework to study their expected and worst-case errors on a given task in a supervised learning setup. Our result confirms previous studies that the upper bound for memory capacity for uncorrelated inputs is . We further show that the memory capacity monotonically increases with correlation in the input. Intuitively, the output exploits the redundant structure of the inputs to retrieve longer sequences. Moreover, we generalize our derivation to task solving performance. Our derivation help us reason about the memory and performance of arbitrary systems directly in terms of their structure. We showed that networks with regular structure have a higher memory capacity but are very sensitive to slight changes in the structure, while irregular networks are robust to variation in their structure.

Acknowledgments. This work was partly supported by NSF grants #1028238 and #1028120, #1518861, and #1525553.

## References

- (1) S. Ganguli, D. Huh, H. Sompolinsky, Proc. Natl. Acad. Sci. USA 105, 18970–18975 (2008)
- (2) P. Dominey, M. Arbib, and J. P. Joseph, J. Cogn. Neurosci. 7, 311–336 (1995)
- Maass et al. (2002) W. Maass, T. Natschläger, and H. Markram, Neural Comput. 14, 2531–60, (2002)
- (4) H. Jaeger and H. Haas, Science 304, 148102 (2004)
- (5) Lukoševičius, H. Jaeger, and B. Schrauwen, Künstliche Intelligenz 26, 365–371 (2012)
- (6) D. Hansel and C. van Vreeswijk, J. Neurosci. 32, 4049-4064 (2012)
- (7) J. P. Crutchfield, W. L. Ditto, and S. Sinha, Chaos 20, 037101 (2010)
- (8) O. L. White, D. D. Lee, and H. Sompolinsky, Phys. Rev. Lett. 92, 148102 (2004)
- Toyoizumi (2002) T. Toyoizumi, Neural Comput. 24, 2678–99, (2012)
- (10) L. Büsing, B. Schrauwen, and R. Legenstein, Neural Comput. 22, 1272–1311 (2010)
- (11) A. Rodan and P. Tiňo, Neural Networks, IEEE Transactions on 22, 131-144 (2011)
- (12) S. Ganguli and H. Sompolinsky, Advances in Neural Information Processing Systems, 23, 667–675, (2010)
- (13) A. Goudarzi, C. Teuscher, N. Gulbahce, and T. Rohlf, Phys. Rev. Lett. 108, 128702 (2012)
- (14) D. Snyder, A. Goudarzi, and C. Teuscher, Phys. Rev. E 87, 042808 (2013)
- Sillin (2013) H. O. Sillin, R. Aguilera, H. Shieh, A. V. Avizienis, M Aono, A. Z. Stieg, and J. K. Gimzewski, Nanotechnology, 24, 384004, (2013)
- Goudarzi (2014a) A. Goudarzi and D. Stefanovic, Procedia Computer Science 41, 176–181, (2014)
- Haynes (2015) N. D. Haynes, M. C. Soriano, D. P. Rosin, I. Fischer, and D. J. Gauthier, Phys. Rev. E 91, 020801, (2015)
- Nakajima (2014) K. Nakajima, T. Li, H. Hauser, and R. Pfiefer, J. R. Soc. Interface 11, 20140437, (2014)
- (19) J. Bürger, A. Goudarzi, D. Stefanovic, and C. Teuscher, AIMS Materials Science 2, 530–545, (2015)
- Katayama (2015) Y. Katayama, T. Yamane, D. Nakano, R. Nakane, and G. Tanaka, Proceedings of the 2015 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH ’15), IEEE, p. 23-24, (2015)
- Vandoorne (2014) K. Vandoorne, P. Mechet, T. Van Vaerenbergh, M. Fiers, G. Morthier, D. Verstraeten, B. Schrauwen, J. Dambre, and P. Bienstman, Nat. Commun. 5, 3541, (2014)
- Mackey (1977) M. C. Mackey and L. Glass, Science 197, 287–289, (1977)

## Appendix A Computing the optimal readout weights using autocorrelation

The recurrent neural network (RNN) model that we study in this paper is an echo state network (ESN) with linear activation function. This system consist of an input driven recurrent network of size , and a linear readout layer trained to calculate a desired function of the input. Let and indicate a one-dimensional input at time and an input weight vector respectively. Let be a recurrent weight matrix, be an -dimensional network state at time , and be the readout weight vector. The dynamics of the network and output is described by:

(19) | ||||

(20) |

where the readout weights are given by white2004 ():

(21) |

The value of the optimal readout weights depend on the covariance and cross-covariance components and . Here we show that these can be computed exactly for any arbitrary system given by and and autocorrelation of the input and cross-correlation of input and output .

We begin by noting that the explicit expression for the system state is given by:

(22) |

Calculating for a given problem requires the following input-output-dependent evaluations:

(23) | ||||

(24) |

## Appendix B Computing the total memory using autocorrelation

Here we compute the memory function and the total memory of the recurrent neural network described in Appendix A for exponentially correlated input where . The total memory of the system is given by the following summation over the memory function white2004 ():

(25) |

where is the input with lag , .

Computing requires the evaluation of:

(26) |

This assumes an even correlation function, i.e., . For numerical computation it is more convenient to perform the calculation as follows:

(27) |

where is a partial sum of satisfying , is a partial sum of satisfying , and is a partial sum of satisfying , which is double counted and must be subtracted. We can substitute and evaluate and as follows:

(28) | ||||

(29) |

(30) | ||||

(31) | ||||

(32) | ||||

(33) | ||||

(34) | ||||

(35) | ||||

(36) |

Here is the identity of the Hadamard product denoted by , and is a matrix inverse with respect to the Hadamard product. Here the trick is that takes the input to the basis of the connection matrix allowing the dynamics to be described by the powers of the eigenvalues of , i.e., . Since is symmetric we can use the matrix identity , where is the main diagonal of . Summing over the powers of gives us .

The covariance of the network states and the expected output is given by:

(37) |

For , the signal becomes i.i.d. and the calculations simplify as follows (Goudarzi, 2014a):

(38) | ||||

(39) |

The total memory capacity can be calculated by summing over :

(40) |

## Appendix C Experimental Setup for Memory Task

For our experiment with memory capacity of network under exponentially correlated input we used the following setup. We generated long sample inputs with autocorrelation function . To generate exponentially correlated input we draw samples from a uniform distribution over the interval . The samples are passed through a low-pass filter with a smoothing factor . We normalize and center so that and . The resulting normalized samples have exponential autocorrelation with decay exponent , i.e., . To validate our calculations, we use a network of nodes in a ring topology and identical weights. The spectral radius . The input weights are created by sampling the binomial distribution and multiplying with . The scale of the input weights does not affect the memory and the performance in linear systems and therefore we adopt this convention for generating throughout the paper. We also assumed , the number of samples , washout period of steps, and regularization factor .

## Appendix D Experimental Setup for Topological Study

A long standing question in recurrent neural network is how its structure effect its memory and task solving performance. Our derivation lets us compute optimal readout layer for arbitrary network. Here we describe the calculations we performed to examine the effect of structure of the network on its memory and task solving performance. To this end, we use networks of size , , and and we systematically study the randomness and spectral radius. We start from a uniform weight ring topology and incrementally add randomness from to . The results for each value of and are averaged over instances. This averaging is necessary even for because the input weights are randomly generated and although their scaling does not affect the result their exact values do (ganguli2008, ).

## Appendix E Computing the optimal readout weights using power spectrum

The calculations in Appendix A for optimal layer of a recurrent network may be described in a more generally in terms of power spectrum of the input signal. Here we assume the setup in Appendix A and derive an expressions for optimal readout layer using its the power spectrum of the input and output.

We start by the standard calculation of and :

(41) | ||||

(42) |

We replace

(43) |

and

(44) |

which gives

(45) | ||||

(46) | ||||

(47) |

and

(48) | ||||

(49) | ||||

(50) |

## Appendix F Memory capacity expressed in terms of power spectrum

Here we use the derivation in Appendix E and compute the memory function and the total memory of the system. Let and so that

(51) |

We find that

(52) |

The matrix is given by

(53) |

and the matrix is given by:

(54) |

The total memory is then given by:

(55) |

where

(56) |