# Online Training of LSTM Networks in Distributed Systems for Variable Length Data Sequences

## Abstract

In this brief paper, we investigate online training of Long Short Term Memory (LSTM) architectures in a distributed network of nodes, where each node employs an LSTM based structure for online regression. In particular, each node sequentially receives a variable length data sequence with its label and can only exchange information with its neighbors to train the LSTM architecture. We first provide a generic LSTM based regression structure for each node. In order to train this structure, we put the LSTM equations in a nonlinear state space form for each node and then introduce a highly effective and efficient Distributed Particle Filtering (DPF) based training algorithm. We also introduce a Distributed Extended Kalman Filtering (DEKF) based training algorithm for comparison. Here, our DPF based training algorithm guarantees convergence to the performance of the optimal LSTM coefficients in the mean square error (MSE) sense under certain conditions. We achieve this performance with communication and computational complexity in the order of the first order gradient based methods. Through both simulated and real life examples, we illustrate significant performance improvements with respect to the state of the art methods.

## 1Introduction

Neural networks provide enhanced performance for a wide range of engineering applications, e.g., prediction [?] and human behavior modeling [?], thanks to their highly strong nonlinear modeling capabilities. Among neural networks, especially recurrent neural networks (RNNs) are used to model time series and temporal data due to their inherent memory storing the past information [?]. However, since simple RNNs lack control structures, the norm of gradient may grow or decay in a fast manner during training, i.e., the exploding and vanishing gradient issues [?]. Due to these problems, simple RNNs are insufficient to capture long and short term dependencies [?]. To circumvent this issue, a novel RNN architecture with control structures, i.e., the Long Short Term Memory (LSTM) network, is introduced [?]. However, since LSTM networks have additional nonlinear control structures with several parameters, they may also suffer from training issues [?].

To this end, in this brief paper, we consider online training of the parameters of an LSTM structure in a distributed network of nodes. Here, we have a network of nodes, where each node has a set of neighboring nodes and can only exchange information with these neighbors. In particular, each node sequentially receives a variable length data sequence with its label and trains the parameters of the LSTM network. Each node can also communicate with its neighbors to share information in order to enhance the training performance since the goal is to train one set of LSTM coefficients using all the available data. As an example application, suppose that we have a database of labelled tweets and our aim is to train an emotion recognition engine based on an LSTM structure, where the training is performed in an online and distributed manner using several processing units. Words in each tweet are represented by word2vec vectors [?] and tweets are distributed to several processing units in an online manner.

The LSTM architectures are usually trained in a batch setting in the literature, where all data instances are present and processed together [?]. However, for applications involving big data, storage issues may arise due to keeping all the data in one place [?]. Additionally, in certain frameworks, all data instances are not available beforehand since instances are received in a sequential manner, which precludes batch training [?]. Hence, we consider online training, where we sequentially receive the data to train the LSTM architecture without storing the previous data instances. Note that even though we work in an online setting, we may still suffer from computational power and storage issues due to large amount of data [?]. As an example, in tweet emotion recognition applications, the systems are usually trained using an enormous amount of data to achieve sufficient performance, especially for agglutinative languages [?]. For such tasks distributed architectures are used. In this basic distributed architectures, commonly named as centralized approach [?], the whole data is distributed to different nodes and trained parameters are merged later at a central node [?]. However, this centralized approach requires high storage capacity and computational power at the central node [?]. Additionally, centralized strategies have a potential risk of failure at the central node. To circumvent these issues, we distribute both the processing as well as the data to all the nodes and allow communication only between neighboring nodes, hence, we remove the need for a central node. In particular, each node sequentially receives a variable length data sequence with its label and exchanges information only with its neighboring nodes to train the common LSTM parameters.

For online training of the LSTM architecture in a distributed manner, one can employ one of the first order gradient based algorithms at each node due to their efficiency [?] and exchange estimates among neighboring nodes as in [?]. However, since these training methods only exploit the first order gradient information, they suffer from poor performance and convergence issues. As an example, the Stochastic Gradient Descent (SGD) based algorithms usually have slower convergence compared to the second order methods [?]. On the other hand, the second order gradient based methods require much higher computational complexity and communication load while providing superior performance compared to the first order methods [?]. Following the distributed implementation of the first order methods, one can implement the second order training methods in a distributed manner, where we share not only the estimates but also the Jacobian matrix, e.g., the Distributed Extended Kalman Filtering (DEKF) algorithm [?]. However, as in the first order case, these sharing and combining the information at each node is adhoc, which does not provide the optimal training performance [?]. In this brief paper, to provide improved performance with respect to the second order methods while preserving both communication and computational complexity similar to the first order methods, we introduce a highly effective distributed online training method based on the particle filtering algorithm [?]. We first propose an LSTM based model for variable length data regression. We then put this model in a nonlinear state space form to train the model in an online and optimal manner.

Our main contributions include: 1) We introduce distributed LSTM training methods in an online setting for variable length data sequences. Our Distributed Particle Filtering (DPF) based training algorithm guarantees convergence to the optimal centralized training performance in the mean square error (MSE) sense; 2) We achieve this performance with a computational complexity and a communication load in the order of the first order gradient based methods; 3) Through simulations involving real life and financial data, we illustrate significant performance improvements with respect to the state of the art methods [?].

The organization of this brief paper is as follows. In Section 2, we first describe the variable length data regression problem in a network of nodes and then introduce an LSTM based structure. Then, in Section 3, we first put this structure in a nonlinear state space form and then introduce our training algorithms. In Section 4, we illustrate the merits of our algorithms through simulations. We then finalize the brief paper with concluding remarks in Section 5.

## 2Model and Problem Description

Here^{1}^{th} tweet at the node (processing unit) . For the ^{th} tweet at the node , one can construct X by finding word2vec representation of each word, i.e., x for the ^{th} word. After receiving , i.e., the desired emotion label for the ^{th} tweet at the node , each node first updates its belief about the relation between the tweet and its emotion label, and then exchanges information, e.g., the trained system parameters, with its neighboring units to estimate the next label.

In this brief paper, each node generates an estimate using the LSTM architecture. Although there exist different variants of LSTM, we use the most widely used variant [?], i.e., the LSTM architecture without peephole connections. The input X is first fed to the LSTM architecture as illustrated in Figure 1, where the internal equations are given as [?]:

where x is the input vector, y is the output vector and c is the state vector for the ^{th} LSTM unit. Moreover, o, f and i represent the output, forget and input gates, respectively. and are set to the hyperbolic tangent function and apply vectors pointwise. Likewise, is the pointwise sigmoid function. The operation represents the elementwise multiplication of two vectors of the same size. As the coefficient matrices and the weight vectors of the LSTM architecture, we have W, R and b, where the sizes are chosen according to the input and output vectors. Given the outputs of LSTM for each column of X as seen in Figure 1, we generate the estimate for each node as follows

where w is a vector of the regression coefficients and |y is a vector obtained by taking average of the LSTM outputs for each column of X, i.e., known as the mean pooling method, as described in Figure 1.

In , we use the mean pooling method to generate |y. One can also use the other pooling methods by changing the calculation of |y and then generate the estimate as in . As an example, for the max and last pooling methods, we use |yy and |yy, respectively. All our derivations hold for these pooling methods and the other LSTM architectures. We provide the required updates for different LSTM architectures in the next section.

## 3Online Distributed Training Algorithms

In this section, we first give the LSTM equations for each node in a nonlinear state space form. Based on this form, we then introduce our distributed algorithms to train the LSTM parameters in an online manner.

Considering our model in Figure 1 and the LSTM equations in , , , and , we have the following nonlinear state space form for each node

where and represent the nonlinear mappings performed by the consecutive LSTM units and the mean pooling operation as illustrated in Figure 1, and is a parameter vector consisting of ‘ =13 , wWRbWRbWRbWRb, where . Since the LSTM parameters are the states of the network to be estimated, we also include the static equation as our state. Furthermore, represents the error in observations and it is a zero mean Gaussian random variable with variance .

We can also apply the introduced algorithms to different implementations of the LSTM architecture [?]. For this purpose, we modify the function and in and according to the chosen LSTM architecture. We also alter in by adding or removing certain parameters according to the chosen LSTM architecture.

### 3.1Online Training Using the DEKF Algorithm

In this subsection, we first derive our training method based on the EKF algorithm, where each node trains its LSTM parameters without any communication with its neighbors. We then introduce our training method based on the DEKF algorithm in order to train the LSTM architecture when we allow communication between the neighbors.

The EKF algorithm is based on the assumption that the state distribution given the observations is Gaussian [?]. To meet this assumption, we introduce Gaussian noise to , and . By this, we have the following model for each node

where e is zero mean Gaussian process with covariance Q. Here, each node is able to observe only to estimate |c, |y and . Hence, we group |c, |y and together into a vector as the hidden states to be estimated.

#### Online Training with the EKF Algorithm:

In this subsection, we derive the online training method based on the EKF algorithm when we do not allow communication between the neighbors. Since the system in and is already in a nonlinear state space form, we can directly apply the EKF algorithm [?] as follows

where is the error covariance matrix, Q is the state noise covariance and is the measurement noise variance. Additionally, we assume that and Q are known terms. We compute H and F as follows

and

where F and H.

#### Online Training with the DEKF Algorithm:

In this subsection, we introduce our online training method based on the DEKF algorithm for the network described by and . In our network of nodes, we denote the number of neighbors for the node as , i.e., also called as the degree of the node [?]. With this structure, the time update equations in , , and still hold for each node . However, since we have information exchange between the neighbors, the measurement update equations of each node adopt the iterative scheme [?] as the following.

Now, we update the state and covariance matrix estimate as

where is the weight between the node and and we compute these weights using the Metropolis rule as follows

With these steps, we can update all the nodes in our network as illustrated in Algorithm ?.

According to the procedure in Algorithm ?, the computational complexity of our training method results in computations at each node due to matrix and vector multiplications on lines ? and ? as shown in Table ?.

### 3.2Online Training Using the DPF Algorithm

In this subsection, we first derive our training method based on the PF algorithm when we do not allow communication between the nodes. We then introduce our online training method based on the DPF algorithm when the nodes share information with their neighbors.

The PF algorithm only requires the independence of the noise samples in and . Thus, we modify our system in and for the node as follows

where and are independent state and measurement noise samples, respectively, is the nonlinear mapping in and a|c|y.

#### Online Training with the PF Algorithm:

For the system in and , our aim is to obtain a, i.e., the optimal estimate for the hidden state in the MSE sense. To achieve this, we first obtain posterior distribution of the states, i.e., a. Based on the posterior density function, we then calculate the conditional mean estimate. In order to obtain the posterior distribution, we apply the PF algorithm [?].

In this algorithm, we have the samples and the corresponding weights of a, i.e., denoted as a. Based on the samples, we obtain the posterior distribution as follows

Sampling from the desired distribution a is intractable in general so that we obtain the samples from a, which is called as importance function [?]. To calculate the weights in , we use the following formula

We can factorize such that we obtain the following recursive formula [?]

In , we choose the importance function so that the variance of the weights is minimized. By this, we obtain particles that have nonnegligible weights and significantly contribute to [?]. In this sense, since aa provides a small variance for the weights [?], we choose it as our importance function. With this choice, we alter as follows

By and , we obtain the state estimate as follows

Although we choose the importance function to reduce the variance of the weights, the variance inevitably increases over time [?]. Hence, we apply the resampling algorithm introduced in [?] such that we eliminate the particles with small weights and prevent the variance from increasing.

#### Online Training with the DPF Algorithm:

In this subsection, we introduce our online training method based on the DPF algorithm when the nodes share information with their neighbors. We employ the Markov Chain Distributed Particle Filter (MCDPF) algorithm [?] to train our distributed system. In the MCDPF algorithm, particles move around the network according to the network topology. In every step, each particle can randomly move to another node in the neighborhood of its current node. While randomly moving, the weight of each particle is updated using a at the node , hence, particles use the observations at different nodes.

Suppose we consider our network as a graph , where the vertices represent the nodes in our network and the edges represent the connections between the nodes. In addition to this, we denote the number of visits to each node in steps by each particle as . Here, each particle moves to one of its neighboring nodes with a certain probability, where the movement probabilities of each node to the other nodes are represented by the adjacency matrix, i.e., denoted as . In this framework, at each visit to each node , each particle multiplies its weight with a in a run of steps [?], where is the number of edges in and is the degree of the node . From , we have the following update for each particle at the node after steps

We then calculate the posterior distribution at the node as

where represents the observations seen by the particles at the node until and is obtained from . After we obtain , we calculate our estimate for a as follows

We can obtain the estimate for each node using the same procedure as illustrated in Algorithm ?. In Algorithm ?, represents the number of particles at the node and represents the indices of the particles that move from the node to the node . Thus, we obtain a distributed training algorithm that guarantees convergence to the optimal centralized parameter estimation as illustrated in Theorem 1.

Using , from [?], we obtain

where is a bounded function, is the second largest eigenvalue modulus of , and are time dependent constants and is a function of as described in [?] such that goes to zero as goes to infinity. Since the state vector a is bounded, we can choose aa. With this choice, evaluating as and go to infinity yields the results. This concludes our proof.

According to the update procedure illustrated in Algorithm ?, the computational complexity of our training method results in computations at each node due to matrix vector multiplications in and as shown in Table ?.

## 4Simulations

We evaluate the performance of the introduced algorithms on different benchmark real datasets. We first consider the prediction performance on Hong Kong exchange rate dataset Figure 2. We then evaluate the regression performance on emotion labelled sentence dataset [?]. For these experiments, to observe the effects of communication among nodes, we also consider the EKF and PF based algorithms without communication over a network of multiple nodes, where each node trains LSTM based on only its observations. Throughout this section, we denote the EKF and PF based algorithms without communication over a network of multiple nodes as “EKF” and “PF”, respectively. We also consider the SGD based algorithm without communication over a network of multiple nodes as a benchmark algorithm and denote it by “SGD”.

We first consider the Hong Kong exchange rate dataset Figure 2. For this dataset, we have the amount of Hong Kong dollars that can buy one United States dollar on certain days. Our aim is to estimate future exchange rate by using the values in the previous two days. In online applications, one can demand a small steady state error or fast convergence rate based on the requirements of application [?]. In this experiment, we evaluate the convergence rates of the algorithms. For this purpose, we select the parameters such that the algorithms converge to the same steady state error level. In this setup, we choose the parameters for each node as follows. Since X is our input, we set the output dimension as . In addition to this, we consider a network of four nodes. For the PF based algorithms, we choose as the number of particles. Additionally, we select and as zero mean Gaussian random variables with I and , respectively. For the DPF based algorithm, we choose and . For the EKF based algorithms, we select I, QI and . Moreover, according to , the weights between nodes are calculated as . For the SGD based algorithm, we set the learning rate as . In Figure 2, we illustrate the prediction performance of the algorithms. Due to the highly nonlinear structure of our model, the EKF and DEKF based algorithms have slower convergence compared to the other algorithms. Moreover, due to only exploiting the first order gradient information, the SGD based algorithm has also slower convergence compared to the PF based algorithms. Unlike the SGD and EKF based methods, the PF based algorithms perform parameter estimation through a high performance gradient free density estimation technique, hence, they converge much faster to the final MSE level. Among the PF based methods, due to its distributed structure the DPF based algorithm has the fastest convergence rate.

In order to demonstrate the effects of the number of particles and the number of Markov steps , we perform another experiment using the Hong Kong exchange rate dataset. In this experiment, we use the same setting with the previous case except I, I and QI. In Figure 3, we observe that as and increase, the DPF based algorithm obtains a faster convergence rate and a lower final MSE value. However, as and increase, the marginal performance improvement becomes smaller with respect to the previous and values. Furthermore, the computation time of the algorithm increases with increasing and values. Thus, after a certain selection, a further increase does not worth the additional computational load. Therefore, we use and in our previous simulation.

Other than the Hong Kong exchange rate dataset, we consider the emotion labelled sentence dataset [?]. In this dataset, we have the vector representation of each word in an emotion labelled sentence. In this experiment, we evaluate the steady state error performance of the algorithms. Thus, we choose the parameters such that the convergence rate of the algorithms are similar. To provide this setup, we select the parameters for each node as follows. Since the number of words varies from sentence to sentence in this case, we have a variable length input regressor, i.e., defined as X, where represents the number of words in a sentence. For the other parameters, we use the same setting with the Hong Kong exchange rate dataset except , I, I, QI and . In Figure 4, we illustrate the label prediction performance of the algorithms. Again due to the highly nonlinear structure of our model, the EKF based algorithm has the highest steady state error value. Additionally, the SGD based algorithm also has a high final MSE value compared to the other algorithms. Furthermore, the DEKF based algorithm achieves a lower final MSE value than the PF based method thanks to its distributed structure. However, since the DPF based method utilizes a powerful gradient free density estimation method while effectively sharing information between the neighboring nodes, it achieves a much smaller steady state error value.

## 5Concluding Remarks

We studied online training of the LSTM architecture in a distributed network of nodes for regression and introduced online distributed training algorithms for variable length data sequences. We first proposed a generic LSTM based model for variable length data inputs. In order to train this model, we put the model equations in a nonlinear state space form. Based on this form, we introduced distributed extended Kalman and particle filtering based online training algorithms. In this way, we obtain effective training algorithms for our LSTM based model. Here, our distributed particle filtering algorithm guarantees convergence to the optimal centralized parameter estimation in the MSE sense under certain conditions. We achieve this performance with communication and computational complexity in the order of the first order methods [?]. Through simulations involving real life and financial data, we illustrate significant performance improvements with respect to the state of the art methods [?].

### Footnotes

- All column vectors (or matrices) are denoted by boldface lower (or uppercase) case letters. For a matrix A (or a vector a), A (a) is its ordinary transpose. The time index is given as subscript, e.g., u is the vector at time . Here, 1 (or 0) is a vector of all ones (or zeros) and I is the identity matrix, where the sizes of these notations are understood from the context.