Integer Echo State Networks: Hyperdimensional Reservoir Computing
We propose an approximation of Echo State Networks (ESN) that can be efficiently implemented on digital hardware based on the mathematics of hyperdimensional computing. The reservoir of the proposed Integer Echo State Network (intESN) is a vector containing only n-bits integers (where n<8 is normally sufficient for a satisfactory performance). The recurrent matrix multiplication is replaced with an efficient cyclic shift operation. The intESN architecture is verified with typical tasks in reservoir computing: memorizing of a sequence of inputs; classifying time-series; learning dynamic processes. Such an architecture results in dramatic improvements in memory footprint and computational efficiency, with minimal performance loss.
Recent work in Reservoir Computing [1, 2] (RC) illustrates how a Recurrent Neural Network with fixed connectivity can memorize and generate complex spatio-temporal sequences. RC has been shown to be a powerful tool for modeling and predicting dynamic systems, both living  and technical [4, 5].
Recent work on feed-forward networks shows that the binarization of filters in Convolutional Neural Networks can lead to enormous gains in memory and computational efficiency . Reducing the memory allocated to each neuron or synapse from a 32-bit float to a few bits or binary saves computation with minimal loss in performance (see, e.g., [7, 8]). The increase in efficiency broadens the range of applications for these networks.
This article addresses two important research directions in RC: training reservoir networks and implementing networks efficiently. We discovered several direct functional similarities between the operations in RC and those of hyperdimensional computing (HDC) . HDC  or more generally Vector Symbolic Architectures  are frameworks for neural symbolic representation, computation, and analogical reasoning. The distinction from the traditional computing is that all entities (objects, phonemes, symbols) are represented by random vectors of very high dimensionality – several thousand dimensions. Complex data structures and analogical reasoning are implemented by simple arithmetical operations (binding, addition/bundling, and permutation) and a well-defined similarity metric . Specifically, RC and HDC are connected by the following core principles:
Random projections of input values onto a reservoir (which in essence is a high-dimensional vector) matches random HDC representations stored in a superposition;
The update of the reservoir by a random recurrent connection matrix is similar to HDC binding/permutation operation;
The nonlinearity of the reservoir can be approximated with the thresholded addition of integers in HDC.
We exploited these findings to design Integer Echo State Networks (intESN), which perform like Echo State Networks (ESN) but with smaller memory footprint and computational cost.
In the proposed architecture, the reservoir of the network contains only constrained integers for each neuron, reducing the memory of each neuron from a 32-bit float to only a few bits. The recurrent matrix multiply update is replaced by a permutation (or even a cyclic shift), which results in the dramatic boosting of the computational efficiency. We validate the architecture on several tasks common in the RC literature. All examples demonstrate satisfactory approximation of performance of the conventional ESN.
Ii Background and related work
There are many practical tasks that require the history of inputs to be solved. In the area of artificial neural networks (ANN), such tasks require working memory. This could be implemented by recurrent connections between neurons of an RNN. Training RNNs is much harder than that of feed-forward ANNs (FFNNs) due to vanishing gradient problem .
The challenge of training RNNs was addressed from two approaches. One approach eliminates the vanishing gradient problem through neurons with special memory gates . Another approach is to reformulate the training process by learning only connections to the last readout layer while keeping the other connections fixed. This approach originally appeared in two similar architectures: Liquid State Machines  and ESNs , now referred to as reservoir computing.
It is interesting to note, that similar ideas were conceived in the area of FFNNs, which can be seen as an RNN without memory, and are known under the name of Extreme Learning Machines (ELMs) . ELMs are used to solve various machine learning problems including classification, clustering, and regression .
Important applications of RC are the modeling and predicting of complex dynamic systems. For example, it was shown that ESNs can be used for forecasting of Electroencephalography signals and for solving classification problems in the context of Brain-Computer Interfaces . There are different classification strategies and readout methods when performing classification of time-series with ESN. In  three classification strategies and three readout methods were explored under the conditions that testing data is purposefully polluted with noise. Interestingly, different readout methods are preferable in different noise conditions. Recent work in  also studied classification of multivariate time-series with ESN using standard benchmarking datasets. The work covered several advanced approaches, which extend the conventional ESN architecture, for generating a representation of a time-series in a reservoir.
Another recent research area is binary RC with cellular automata (CARC) which started as a interdisciplinary research within three areas: cellular automata, RC, and HDC. CARC was initially explored in  for projecting binarized features into high-dimensional space. Further in , it was applied for modality classification of medical images. The usage of CARC for symbolic reasoning is explored in . The memory characteristics of a reservoir formed by CARC are presented in . Work  proposed the usage of coupled cellular automata in CARC. Examples of recent RC developments also include advanced architectures such as Laplacian ESN , learning of reservoir’s size and topology , new tools for investigating reservoir dynamics  and determining its edge of criticality .
The approach which is ideologically closest to our intESN, was presented in . The authors argued that a simple cycle reservoir can be used to achieve a performance similar to the conventional ESN. While this work explored reservoir update solutions which are similar to one of our optimizations, the technical side is very different from our approach as intESN strives at using only integers as neurons activation values.
Ii-a Echo State Networks
This subsection summarizes the functionality of the conventional ESN, it follows the description in  for a special case of leaky integration when 111For the detailed tutorial on ESNs diligent readers are referred to .. Fig. 1 depicts the architectural design of the conventional ESN, which includes three layers of neurons. The input layer with neurons represents the current value of input signal denoted as . The output layer ( neurons) produces the output of the network (denoted as ) during the operating phase. The reservoir is the hidden layer of the network with neurons, with the state of the reservoir at time denoted as .
In general, the connectivity of the ESN is described by four matrices. describes connections between the input layer neurons and the reservoir, and does the same for the output layer. Both matrices project the current input and output to the reservoir. The memory in the ESN is due to the recurrent connections between neurons in the reservoir, which are described in the reservoir matrix W. Finally, the matrix of readout connections transforms the current activity levels in the input layer and reservoir ( and , respectively) into the network’s output .
Note that three matrices (, , and W) are randomly generated at the network initialization and stay fixed during the network’s lifetime. Thus, the training process is focused on learning the readout matrix . There are no strict restrictions for the generation of projection matrices and . They are usually randomly drawn from either normal or uniform distributions and scaled as shown below. The reservoir connection matrix, however, is restricted to posses the echo state property. This property is achieved when the spectral radius of the matrix W is less or equal than one. For example, W can be generated from a normal distribution and then normalized by its maximal eigenvalue. Unless otherwise stated, in this article an orthogonal matrix was used as the reservoir connection matrix; such a matrix was formed by applying QR decomposition to a random matrix generated from the standard normal distribution. Also, W can be scaled by a feedback strength parameter, see (1).
The update of the network’s reservoir at time is described by the following equation:
where and denote projection gain and the feedback strength, respectively. Note that it is assumed that the spectral radius of the reservoir connection matrix W is one. Note also that at each time step neurons in the reservoir apply as the activation function. The nonlinearity prevents the network from exploding by restricting the range of possible values from -1 to 1. The activity in the output layer is calculated as:
where the semicolon denotes concatenation of two vectors and the activation function of the output neurons, for example, linear or Winner-take-all.
Ii-A1 Training process
This article only considers training with supervised-learning when the network is provided with the ground truth desired output at each update step. The reservoir states are collected together with the ground truth for each training step. The weights of the output layer connections are acquired by solving the regression problem which minimizes the mean square error between predictions (2) and the ground truth. While this article does not focus on the readout training task, it should be noted that there are many alternatives reported in the literature including the usage of regression with regularization, online update rules, etc. .
Ii-B Fundamentals of hyperdimensional computing
In a localist representation, which is used in all modern digital computers, a group of bits is needed in its entirety to interpret a representation. In HDC, all entities (objects, phonemes, symbols, items) are represented by vectors of very high dimensionality – thousands of bits. The information is spread out in a distributed representation, which contrary to the localist representations, any subset of the bits can be interpreted. Computing with distributed representations utilizes statistical properties of vector spaces with very high dimensionality, which allow for approximate, noise-tolerant, highly parallel computations. Item memory (also referred to as clean-up memory) is needed to recover composite representations assigned to complex concepts. There are several flavors of HDC with distributed representations, differentiated by the random distribution of vector elements, which can be real numbers [32, 33, 34, 35], complex numbers , binary numbers [10, 37], or bipolar [33, 38].
We rely on the mathematics of HDC with bipolar distributed representations to develop the intESN. Kanerva  proposed the use of distributed representations comprising binary elements (referred to as HD vectors). The values of each element of an HD vector are independent equally probable, hence they are also called dense distributed representations. Similarity between two binary HD vectors is characterized by Hamming distance, which (for two vectors) measures the number of elements in which they differ. In very high dimensions Hamming distances (normalized by the dimensionality ) between any arbitrary chosen HD vector and all other vectors in the HD space are concentrated around 0.5. Interested readers are referred to  and  for comprehensive analysis of probabilistic properties of the high-dimensional representational space.
The binary HD vectors can be equivalently mapped to the case of bipolar representations, i.e., where each vector’s element is encoded as “-1” or “+1”. This definition is sometimes more convenient for purely computational reasons. The distance metric for the bipolar case is a dot product:
Basic symbols in HDC are referred to as atomic HD vectors. They are generated randomly and independently, and due to high dimensionality will be nearly orthogonal with very high probability, i.e., similarity (dot product) between such HD vectors is approximately 0. An ordered sequence of symbols can be encoded into a composite HD vector using the atomic HD vectors, the permutation (e.g., cyclic shift as a special case of permutation) and bundling operations. This vector encodes the entire sequence history in the composite HD vector and resembles a neural reservoir.
Normally in HDC, the recovery of component atomic HD vectors from a composite HD vector is performed by finding the most similar vectors stored in the item memory222It is not common to do such decoding in RC. Normally, in the scope of RC a readout matrix is learned. In this article, we follow this standard RC approach to extracting information back from a reservoir.. However, as more vectors are bundled together there is more interference noise and the likelihood of recovering the correct atomic HD vector declines.
Our recent work  reveals the impact of interference noise, and shows that different flavors of HDC have universal memory capacity. Thus, the different flavors of HDC can be interchanged without affecting performance. From these insights, we are able to design much more efficient networks for reservoir computing for digital hardware.
Iii Integer Echo State Networks
This section presents the main contribution of the article – an architecture for Integer Echo State Network. The architecture is illustrated in Fig. 2. The intESN is structurally identical to the the conventional ESN (see Fig. 1) with three layers of neurons: input (, neurons), output (, neurons), and reservoir (, neurons). It is important to note from the beginning that training the readout matrix for intESN is the same as for the conventional ESN (Section II-A1).
However, other components of the intESN differs from the conventional ESN. First, activations of input and output layers are projected into the reservoir in the form of bipolar HD vectors  of size (denoted as and ). For problems where input and output data are described by finite alphabets and each symbol can be treated independently, the mapping to -dimensional space is achieved by simply assigning a random bipolar HD vector to each symbol in the alphabet and storing them in the item memory . In the case with continuous data (e.g., real numbers), we quantized the continuous values into a finite alphabet. The quantization scheme (denoted as ) and the granularity of the quantization are problem dependent. Additionally, when there is a need to preserve similarity between quantization levels, distance preserving mapping schemes are applied (see, e.g., [40, 41]), which can preserve, for example, linear or nonlinear similarity between levels. An example of a discretization and quantization of a continuous signal as well as its HD vectors in the item memory is illustrated in Fig. 2. Continuous values can be also represented in HD vectors by varying their density. For a recent overview of several mapping approaches readers are referred to . Also, an example of applying such mapping is presented in Section IV-A2. Another feature of the intESN is the way the recurrence in the reservoir is implemented. Rather than a matrix multiply, recurrence is implemented via the permutation of the reservoir vector. Note that permutation of a vector can be described in matrix form, which can play the role of W in the intESN. Note that the spectral radius of this matrix equals one. However, an efficient implementation of permutation can be achieved for a special case – cyclic shift (denoted as ). Fig. 2 shows the recurrent connections of neurons in a reservoir with recurrence by cyclic shift of one position. In this case, vector-matrix multiplication is equivalent to .
Finally, to keep the integer values of neurons, the intESN uses different nonlinear activation function for the reservoir – clipping (4). Note that the simplest bundling operation is an elementwise addition. However, when using the elementwise addition, the activity of a reservoir (i.e., a composite HD vector) is no longer bipolar. From the implementation point of view, it is practical to keep the values of the elements of the HD vector in the limited range using a threshold value (denoted as ).
The clipping threshold is regulating nonlinear behavior of the reservoir and limiting the range of activation values. Note that in the intESN the reservoir is updated only with integer bipolar vectors, and after clipping the values of neurons are still integers in the range between and . Thus, each neuron can be represented using only bits of memory. For example, when , there are fifteen unique values of a neuron, which can be stored with just four bits.
Summarizing the aforementioned differences, the update of intESN is described as:
Iv Performance evaluation
In this section, the intESN architecture is verified and compared to the conventional ESN on a set of tasks typical for ESN. In particular, three aspects are evaluated: short-term memory, classification of time-series, and modeling of dynamic processes. Short-term memories are compared using the trajectory association task , introduced in the area of holographic reduced representations . Additionally, an approach for storing and decoding analog values using intESN is demonstrated on image patches. Classification of time-series is studied using the standard datasets from UCI and UCR. Modeling of dynamic processes is tested on two typical cases. First, the task of learning a simple sinusoidal function is considered. Next, networks are trained to reproduce a complex dynamical system produced by a Mackey-Glass series. Unless otherwise stated, ridge regression (the regularization coefficient is denoted as ) with the Moore-Penrose pseudo-inverse was used to learn the readout matrix . Values of input neurons were not used for training the readout in any of the experiments below.
Iv-a Short-term memory
Iv-A1 Sequence recall task
The sequence recall task includes two stages: memorization and recall. At the memorization stage, a network continuously stores a sequence of tokens (e.g., letters, phonemes, etc). The number of unique tokens is denoted as ( in the experiments), and one token is presented as input each timestep. At the recall stage, the network uses the content of its reservoir to retrieve the token stored steps ago, where denotes delay. In the experiments, the range of delay varied between 0 and 15.
For ESN, the dictionary of tokens was represented by a one-hot encoding, i.e. the number of input layer neurons was set to the size of the dictionary . The same encoding scheme was adopted for the output layer, . The input vector was projected to the reservoir by the projection matrix where each entry was independently generated from the uniform distribution in the range , the projection gain was set to . The reservoir connection matrix W was first generated from the standard normal distribution and then orthogonalized. The feedback strength of the reservoir connection matrix was to .
For intESN, the item memory was populated with random high-dimensional bipolar vectors. The threshold for the clipping function was set to . The output layer was the same as in ESN with and one-hot encoding of tokens.
For each value of of the delay a readout matrix was trained, producing 16 matrices in total. The training sequence presented 2000 random tokens to the network, and only the last 1500 steps were used to compute the readout matrices. The regularization parameter for ridge regression was set to . The training sequence of tokens delayed by the particular was used as the ground truth for the the activations of the output layer. During the operating phase, both the inclusion of a new token into the reservoir and the recall of the delayed token from the reservoir were simultaneous. Experiments were performed for three different sizes of the reservoir: , , and .
The memory capacity of the network is characterized by the accuracy of the correct decoding of tokens for different values of the delay. Fig. 3 depicts the accuracy for both networks ESN (solid lines) and intESN (dashed lines). The capacities of both networks grow with the increased number of neurons in the reservoir. The capacities of ESN and intESN are comparable for small , i.e., for the most recent tokens. For the increased delays the curves featured slightly different behaviors. With increase of the value of the performance of intESN started to decline faster compared to the ESN. Eventually, all curves converge to the value of the random guess which equals . Moreover, the information capacity of a network is characterized by the amount of the decoded from the reservoir information. This amount is determined using the amount of information per token (), the probability of correctly decoding a token at each delay value, and the concept of mutual information. We calculated the amount of information for all networks in Fig. 3 in the considered delay range. For 100 neurons intESN preserved 19.3% less information, for 200 and 300 neurons 21.7% less.
These results highlight a very important trade-off: the performance versus a complexity of implementation. While the performance of the intESN is somewhat poorer in this task, one has to bare in mind its memory footprint. With the clipping threshold only 3-bit are needed to represent the state of a neuron compared to 32-bit per neuron (the size of type float) in ESN. We conjecture that some reduction in the performance for 10 folds memory footprint reduction is an acceptable price in applications on resource constrained computing devices. On the other hand, we can check the performance of the two networks with equal memory footprints. For this we increased the number of neurons in intESN so that the total memory consumed by the reservoir with the same clipping threshold would match that of ESN. This network is denoted as “intESN-large”. The results for this case are presented in Fig. 4 (the training sequence was prolonged to 9000 random tokens). With such settings intESN-large has clearly higher information capacity. In particular, for ESN memory footprint with 100 neurons the decoded amount of information has increased 2.2 times while for 200 and 300 neurons it increased 1.6 and 1.3 times respectively.
Iv-A2 Storage of analog values in intESN
This subsection presents the feasibility of storing analog values in intESN using image patches as a showcase. With this showcase we are aiming at demonstrating the feasibility of using integer approximation of neuron states in intESN to work with analog representations. A value of a pixel (in an RGB channel) can be treated as an analog value in the range between 0 and 1. For each pixel it is possible to generate a unique bipolar HD vector. The typical approach to encode an analog value is to multiply all elements of the HD vector by that value. The whole sequence is then represented using the bundling operation on all scaled HD vectors. The result of bundling can be used as an input to a reservoir. However, the resultant composite HD vector will not be in the integer range anymore. We address this problem by using sparsity. Instead of scaling elements of an HD vector, we propose to randomly set the fraction of elements of the HD vector to zeros, i.e., the HD vector will become ternary. The proportion of zero elements is determined by the pixel’s analog value. Pixels with values close to zero will have very sparse HD vectors while pixels with values close to one will have dense HD vectors, but all entries will always be [-1, 0, or +1]. The result of bundling of such HD vectors (i.e., HD vector for an image) will still have integer values. Such representational scheme allows keeping integer values in the reservoir but it still can effectively store analog values.
The examples of results are presented in Fig. 5. Top row depicts original images stored in the reservoir. The other rows depict images reconstructed from the reservoir. The following parameters of intESN were used (top to bottom): , ; , ; , ; , . The values of were optimized for a particular . Columns correspond to the delay values (i.e., how many steps ago an image was stored in the reservoir) as in the previous experiment. As one would anticipate, the quality of the reconstructed images is improving for larger reservoir sizes. At the same time, the quality of the reconstructed images is deteriorating for larger delay values, i.e., the worst quality of the reconstructed image could be observed in the bottom right corner while the best reconstruction is located in the top left corner. Nevertheless, the main observation for this experiment is that it is possible to project analog values into the reservoir with integer values using the mapping via varying sparsity and then retrieve the values from the reservoir.
Iv-B Classification of time-series
In this section, ESN and intESN networks are compared in terms of classification accuracy obtained on standard time-series datasets. Following  we used several (four) univariate datasets from UCR333http://www.cs.ucr.edu/~eamonn/time_series_data/. and several (three) multivariate datasets from UCI444http://archive.ics.uci.edu/ml/datasets.html.. Details of datasets are presented in Table I. For each dataset, the table includes the name, number of variables (#V), number of classes (#C), and the number of examples in training and testing datasets.
Configurations of both networks were kept fixed for all datasets. The output layers of networks were representing one-hot encodings of classes in a dataset, i.e., for the particular dataset #C of that dataset. For both networks, the regularization parameter for ridge regression was set to . The configuration of ESN was set similar to : reservoir size was set to , projection gain was set to , the feedback strength was set to .
For intESN the clipping threshold was set to . Also for intESN the values of time-series were mapped to bipolar vectors using scatter codes [42, 44]. Two sizes of reservoir were used. The first size corresponded to the size of ESN, i.e., . The second size (“intESN-large”) corresponded to the same memory footprint555Except for the Japanese Vowels dataset where such reservoir size seemed to significantly overfit the training data. In that case, the number of neurons was increased twice. required for ESN reservoir assuming that one ESN neuron requires 32-bit while one intESN neuron requires 4-bit (when ).
The readout layers of both networks were trained using time-series from a training dataset in the so-called endpoints mode  when only final temporal reservoir states for each time-series are used for training a single readout matrix.
The experimental accuracies obtained from the networks for the considered datasets are presented in Fig. 6 and 7. Fig. 6 presents the results for univariate datasets while Fig. 7 presents the results for multivariate datasets. The figures depict mean and standard deviation values across ten independent random initializations of the networks.
The obtained results strongly depend on the characteristics of the data. However, it was generally observed that intESN with the memory footprint equivalent to ESN demonstrated higher classification accuracy. On the other hand, the classification accuracy of intESN with the same number of neurons as in ESN was similar to the ESN’s performance for all considered datasets but two (“Swedish Leaf” and “Character Trajectories”) for which the accuracy degradation was sensible. We, therefore, conjecture that in a general case, one cannot guarantee the same classification accuracy as for ESN. The empirical evidence, however, shows that it is not infeasible. Since placing the reported results into the general context of time-series classifications is outside the scope of this article we do not further elaborate on fine-tuning of hyperparameters of intESN for the best classification performance.
|Univariate datasets from UCR|
|Multivariate datasets from UCI|
|Spoken Arabic Digit||13||6600||2200||10|
Iv-C Modeling of dynamic processes
Iv-C1 Learning Sinusoidal Function
The task of learning a sinusoidal function  is an example of a learning simple dynamic system with the constant cyclic behavior. The ground truth signal was generated as follows:
In this task, the input layer was not used, i.e. but the network projected the activations of the output layer back to the reservoir using . The output layer had only one neuron (). The reservoir size was fixed to neurons. The length of the training sequence was 3000 (first 1000 steps were discarded from the calculation). For ESN, the feedback strength for the reservoir connection matrix was set to , for both networks was set to 0. A continuous value of the ground truth signal was fed-in to the ESN during the training.
For intESN, in order to map the input signal to a bipolar vector the quantization was used. The signal was quantized as:
The item memory for the projection of the output layer was populated with bipolar vectors preserving linear (in terms of dot product) similarity between quantization levels . The threshold for the clipping function was set to .
In the operating phase, the network acted as the generator of the signal feeding its previous prediction (at time ) back to the reservoir. Fig. 8 demonstrates the behavior of intESN during the first 100 prediction steps. The ground truth is depicted by dashed line while the prediction of intESN is illustrated by the shaded area between 10% and 90% percentiles (100 simulations were performed). The figure does not show the performance of the conventional ESN as it just followed the ground truth without visible deviations. The intESN clearly follows the values of the ground truth but the deviation from the ground truth is increasing with the number of prediction steps. It is unavoidable for the increasing prediction horizon but, in this scenario, it is additionally accelerated due to the presence of the quantization error at each prediction step. The next subsection will clearly demonstrate effects caused by this process. The error is accumulated because every time when feeding the prediction back to the reservoir of intESN it should be quantized in order to fetch a vector from the item memory.
Iv-C2 Mackey-Glass series prediction
A Mackey-Glass series is generated by the nonlinear time delay differential equation. It is commonly used to assess the predictive power of an RC approach. In this scenario, we followed the preprocessing of data and the parameters of ESN described in . The parameters of intESN (including quantization scheme) were the same as in the subsection above. The length of the training sequence was 3000 (first 1000 steps were discarded from the calculation). Fig. 9 depicts results for the first 300 prediction steps. The results were calculated from 100 simulation runs. The figure includes four panels. Each panel depicts the ground truth, the mean value of predictions as well as areas marking percentiles between 10% and 90%. The lower right corner corresponds to intESN while three other panels show performance of the ESN in three different cases related to the quantization of the data.
In these scenarios the ESN was trained to learn the model from the quantized data in order to see to which extent it affects the network. The upper left corner corresponds to the ESN without data quantization. In this case, the predictions precisely follow the ground truth. The upper right corner corresponds to the ESN trained on the quantized data but with no quantization during the operational phase. In such settings, the network closely follows the ground truth for the first 150 steps but then it often explodes. The lower left corner corresponds to the ESN where the data was quantized during both training and prediction. In this scenario, the network was able produce to produce good prediction just for the first few dozens of steps and then entered the chaotic mode where even the mean value does not reflect the ground truth. These cases demonstrate how the quantization error could affect the predictions especially when it is added at each prediction step. Note that the intESN operated in the same mode as the third ESN. Despite this fact, its performance rather resembles that of the second ESN where the speed deviation of the ground truth is faster. At the same time, the deviation of intESN grows smoothly without a sudden explosion in contrast to the ESN.
V-1 Efficiency of intESN
The proper evaluation of the computational efficiency of the proposed intESN approach requires reference implementation design and detailed benchmarking tests. Here we support our efficiency claims by simplified execution time measurements performed in Matlab. We used our Matlab implementations of both networks for the trajectory association task (Section IV-A1) with neurons to compare the times of projecting data and executing a reservoir. ESN was implemented using 32-bit float type (type single in Matlab) while intESN was implemented using 8-bits integer type (type int8 in Matlab). On average, the time required by intESN was 3.9 times less than that of ESN.
In comparison to ESN, intESN has fewer hyperparameters. The common hyperparameter is the reservoir size . Two specific intESN hyperparameters are clipping threshold and the mapping to the reservoir. When is fixed then in intESN has an effect on the network’s memory similar to and in ESN. However, the difference is that takes only positive integer values. This has its pros and cons. It could be much easier to optimize a single hyperparameter in the integer range. On the other hand, having real-valued hyperparameters one can get a configuration providing much finer tuning of network’s memory for a given task.
With respect to the mapping (projection) to the reservoir, it is probably the most non-trivial part on intESN. Especially, when data to be projected are real numbers. In this article, we mention three different strategy for mapping real numbers to bipolar or ternary vectors: puncturing of non-zero elements (Section IV-A2), mapping preserving linear similarity between vectors , and nonlinear approximate mapping using scatter codes [42, 44]. We suggest that it is a useful heuristic to try each of the approaches and choose the one performing best.
V-3 On equivalence of ESN and intESN in terms of forgetting time constants
Section IV-A1 presented the experimental comparison of storage capabilities of intESN and ESN. An analytical approach to the treatment of memory capacity of reservoir was presented in  (Section 2.4.4). The work introduced an analytical tool called the forgetting time constant (denoted as ). The forgetting time constant is a scalar characterizing the memory decay in a network. In the case of intESN, it can be calculated analytically using the clipping threshold . For ESN, currently only special cases can be analyzed. For example, when reservoir neurons are linear the feedback strength will determine . The other example is when reservoir neurons use , the feedback strength is fixed to one, and the reservoir update rule is modified to , where denotes a gain parameter. This parameter affects , which could be calculated numerically. Fig. 10 presents the implicit comparison of this special case of ESN and intESN via their forgetting time constants which are determined by and respectively. Both parameters similarly (not far from linear in logarithmic coordinates) affect . It allows arguing that the networks are close to being functionally identical in terms of storage capabilities. The development of an approach for estimating the forgetting time constant for ESN in a general case is a part of our agenda for the future work.
V-4 Training the readout in a generator mode
In this article we proposed an architecture for integer approximation of the reservoir computing, which is based on the mathematics of hyperdimensional computing. The neurons in the reservoir are described by integers in the limited range and the update operations include only addition, permutation (cyclic shift), and clipping. Therefore, the integer Echo State Network has substantially smaller memory footprint and higher computational efficiency compared to the conventional Echo State Network with the same number of neurons in the reservoir. The actual number of bits for representing a neuron depends on the clipping threshold , but can be significantly lower than 32-bit floats in Echo State Network. For example, in our experiments the results were obtained with and , which effectively makes it sufficient to represent a neuron with only three or four bits respectively. We demonstrated that the performance of the integer Echo State Network is comparable to the conventional Echo State Network in terms of memory capacity, potential capabilities for classification of time-series and modeling dynamic systems. The better performance was observed when the memory footprint of reservoir of the integer Echo State Network was set to that of the conventional Echo State Network. Further improvements can be made by optimization of the parameters and better quantization schemes for handling continuous values. Naturally, due to the peculiarity of input data projection into the integer Echo State Network, the performance of the network in tasks for modeling dynamic systems is to a certain degree lower than that of the conventional Echo State Network. This, however, does not undermine the importance of integer Echo State Networks, which are extremely attractive for memory and power savings, and in the general area of approximate computing, where errors and approximations are becoming acceptable as long as the outcomes have a well-defined statistical behavior.
-  M. Lukosevicius and H. Jaeger, “Reservoir computing approaches to recurrent neural network training,” Computer Science Review, vol. 3, no. 3, pp. 127–149, 2009.
-  D. Sussillo and L. Abbott, “Generating coherent patterns of activity from chaotic neural networks,” Neuron, vol. 63, no. 4, pp. 544–557, 2009.
-  F. Triefenbach, A. Jalalvand, B. Schrauwen, and J.-P. Martens, “Phoneme recognition with large hierarchical reservoirs,” in Advances in Neural Information Processing Systems 23, 2010, pp. 2307–2315.
-  H. Jaeger and H. Haas, “Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication,” Science, vol. 304, no. 5667, pp. 78–80, 2004.
-  L. Appeltant and M.C. Soriano and G. Van der Sande and J. Danckaert and S. Massar and J. Dambre and B. Schrauwen and C.R. Mirasso and I. Fischer, “Information processing using a single dynamical node as complex system,” Nature Communications, vol. 2, no. 1, pp. 1–6, 2011.
-  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks,” in European Conference on Computer Vision, ser. Lecture Notes in Computer Science, vol. 9908, 2016, pp. 525–542.
-  I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” in Advances in Neural Information Processing Systems 29 (NIPS 2016), 2016, pp. 1–9.
-  ——, “Quantized neural networks: Training neural networks with low precision weights and activations,” arXiv:1609.07061, pp. 1–29, 2016.
-  E. P. Frady, D. Kleyko, and F. T. Sommer, “A theory of sequence indexing and working memory in recurrent neural networks,” Neural Computation, vol. 0, no. 0, pp. 1–65, 2018.
-  P. Kanerva, “Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors,” Cognitive Computation, vol. 1, no. 2, pp. 139–159, 2009.
-  R. Gayler, “Vector Symbolic Architectures Answer Jackendoff’s Challenges for Cognitive Neuroscience,” in Proceedings of the Joint International Conference on Cognitive Science. ICCS/ASCS, , Ed. , 2003, pp. 133–138.
-  Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157–166, 1994.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  W. Maass and T. Natschlager and H. Markram, “Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations,” Neural Computation, vol. 14, no. 11, pp. 2531–2560, 2002.
-  H. Jaeger, “Adaptive nonlinear system identification with echo state networks,” in Advances in Neural Information Processing Systems 15, 2003, pp. 593–600.
-  G. Huang and Q. Zhu and C. Siew, “Extreme learning machine: Theory and applications,” Neurocomputing, vol. 70, no. 1-3, pp. 489–501, 2006.
-  G. Huang, “What are Extreme Learning Machines? Filling the Gap Between Frank Rosenblatt’s Dream and John von Neumann’s Puzzle,” Cognitive Computation, vol. 7, pp. 263–278, 2015.
-  E. M. Forney, C. W. Anderson, W. J. Gavin, P. L. Davies, M. C. Roll, and B. K. Taylor, “Echo State Networks for Modeling and Classification of EEG Signals in Mental-Task Brain-Computer Interfaces,” Colorado State University, CS-15-102, Tech. Rep., November 2015.
-  A. Prater, “Comparison of Echo State Network output layer classification methods on noisy data,” in 2017 International Joint Conference on Neural Networks (IJCNN), May 2017, pp. 2644–2651.
-  F. Bianchi, S. Scardapane, S. Lokse, and R. Jenssen, “Reservoir computing approaches for representation and classication of multivariate time series,” arXiv:1803.07870, pp. 1–19, 2018.
-  O. Yilmaz, “Machine Learning Using Cellular Automata Based Feature Expansion and Reservoir Computing,” Journal of Cellular Automata, vol. 10, no. 5-6, pp. 435–472, 2015.
-  D. Kleyko, S. Khan, E. Osipov, and S. P. Yong, “Modality Classification of Medical Images with Distributed Representations based on Cellular Automata Reservoir Computing,” in IEEE International Symposium on Biomedical Imaging, 2017, pp. 1–4.
-  O. Yilmaz, “Symbolic Computation Using Cellular Automata-Based Hyperdimensional Computing,” Neural Computation, vol. 27, no. 12, pp. 2661–2692, 2015.
-  S. Nichele and A. Molund, “Deep Reservoir Computing Using Cellular Automata,” arXiv:1703.02806, pp. 1–9, 2017.
-  N. McDonald, “Reservoir computing and extreme learning machines using pairs of cellular automata rules,” arXiv:1703.05807, pp. 1–8, 2017.
-  M. Han and M. Xu, “Laplacian Echo State Network for Multivariate Time Series Prediction,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 1, pp. 238–244, 2018.
-  J. Qiao, F. Li, H. Han, and W. Li, “Growing Echo-State Network With Multiple Subreservoirs,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 2, pp. 391–404, 2017.
-  F. Bianchi, L. Livi, and C. Alippi, “Investigating Echo-State Networks Dynamics by Means of Recurrence Analysis,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 2, pp. 427–439, 2018.
-  L. Livi, F. Bianchi, and C. Alippi, “Determination of the Edge of Criticality in Echo State Networks Through Fisher Information Maximization,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 3, pp. 706–717, 2018.
-  A. Rodan and P. Tino, “Minimum Complexity Echo State Network,” IEEE Transactions on Neural Networks, vol. 22, no. 1, pp. 131–144, 2011.
-  M. Lukosevicius, “A Practical Guide to Applying Echo State Networks,” in Neural Networks: Tricks of the Trade, ser. Lecture Notes in Computer Science, vol. 7700, 2012, pp. 659–686.
-  T. A. Plate, “Holographic reduced representations,” IEEE Transactions on Neural Networks, vol. 6, no. 3, pp. 623–641, 1995.
-  S. I. Gallant and T. W. Okaywe, “Representing objects, relations, and sequences,” Neural Computation, vol. 25, no. 8, pp. 2038–2078, 2013.
-  R. W. Gayler, “Multiplicative binding, representation operators & analogy,” in Gentner, D., Holyoak, K. J., Kokinov, B. N. (Eds.), Advances in analogy research: Integration of theory and data from the cognitive, computational, and neural sciences, New Bulgarian University, Sofia, Bulgaria, 1998, pp. 1–4.
-  S. I. Gallant and P. Culliton, “Positional binding with distributed representations,” in International Conference on Image, Vision and Computing (ICIVC), 2016, pp. 108–113.
-  T. A. Plate, Holographic Reduced Representations: Distributed Representation for Cognitive Structures. Stanford: Center for the Study of Language and Information (CSLI), 2003.
-  D. A. Rachkovskij, “Representation and Processing of Structures with Binary Sparse Distributed Codes,” IEEE Transactions on Knowledge and Data Engineering, vol. 3, no. 2, pp. 261–276, 2001.
-  A. Rahimi, S. Benatti, P. Kanerva, L. Benini, and J. M. Rabaey, “Hyperdimensional biosignal processing: A case study for emg-based hand gesture recognition,” in 2016 IEEE International Conference on Rebooting Computing (ICRC), Oct 2016, pp. 1–8.
-  P. Kanerva, Sparse Distributed Memory. The MIT Press, 1988.
-  D. A. Rachkovskij, S. V. Slipchenko, E. M. Kussul, and T. N. Baidyk, “Sparse Binary Distributed Encoding of Scalars,” Journal of Automation and Information Sciences, vol. 37, no. 6, pp. 12–23, 2005.
-  D. Widdows and T. Cohen, “Reasoning with vectors: A continuous model for fast robust inference,” Logic Journal of the IGPL, vol. 23, no. 2, pp. 141–173, 2015.
-  D. Kleyko, A. Rahimi, D. Rachkovskij, E. Osipov, and J. Rabaey, “Classification and Recall with Binary Hyperdimensional Computing: Trade-offs in Choice of Density and Mapping Characteristic,” IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–19, 2018.
-  T. A. Plate, “Holographic reduced representations,” IEEE Transactions on Neural Networks, vol. 6, no. 3, pp. 623–641, 1995.
-  D. Smith and P. Stanford, “A random walk in hamming space,” in 1990 International Joint Conference on Neural Networks (IJCNN), June 1990, pp. 465–470 vol.2.
-  H. Jaeger, “Tutorial on training recurrent neural networks, covering BPTT, RTRL, EKF and the Echo State Network approach,” Technical Report GMD Report 159, German National Research Center for Information Technology, Tech. Rep., 2002.