Online Training of LSTM Networks in Distributed Systems for Variable Length Data Sequences

# Online Training of LSTM Networks in Distributed Systems for Variable Length Data Sequences

Tolga Ergen and Suleyman S. Kozat Senior Member, IEEE This work is supported in part by TUBITAK Contract No 115E917.The authors are with the Department of Electrical and Electronics Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey, Tel: +90 (312) 290-2336, Fax: +90 (312) 290-1223, (contact e-mail: {ergen, kozat}@ee.bilkent.edu.tr).
###### Abstract

In this brief paper, we investigate online training of Long Short Term Memory (LSTM) architectures in a distributed network of nodes, where each node employs an LSTM based structure for online regression. In particular, each node sequentially receives a variable length data sequence with its label and can only exchange information with its neighbors to train the LSTM architecture. We first provide a generic LSTM based regression structure for each node. In order to train this structure, we put the LSTM equations in a nonlinear state space form for each node and then introduce a highly effective and efficient Distributed Particle Filtering (DPF) based training algorithm. We also introduce a Distributed Extended Kalman Filtering (DEKF) based training algorithm for comparison. Here, our DPF based training algorithm guarantees convergence to the performance of the optimal LSTM coefficients in the mean square error (MSE) sense under certain conditions. We achieve this performance with communication and computational complexity in the order of the first order gradient based methods. Through both simulated and real life examples, we illustrate significant performance improvements with respect to the state of the art methods.

{keywords}

Distributed learning, online learning, particle filtering, extended Kalman filtering, LSTM networks.

## I Introduction

Neural networks provide enhanced performance for a wide range of engineering applications, e.g., prediction [1] and human behavior modeling [2], thanks to their highly strong nonlinear modeling capabilities. Among neural networks, especially recurrent neural networks (RNNs) are used to model time series and temporal data due to their inherent memory storing the past information [3]. However, since simple RNNs lack control structures, the norm of gradient may grow or decay in a fast manner during training, i.e., the exploding and vanishing gradient issues [4]. Due to these problems, simple RNNs are insufficient to capture long and short term dependencies [4]. To circumvent this issue, a novel RNN architecture with control structures, i.e., the Long Short Term Memory (LSTM) network, is introduced [5]. However, since LSTM networks have additional nonlinear control structures with several parameters, they may also suffer from training issues [5].

To this end, in this brief paper, we consider online training of the parameters of an LSTM structure in a distributed network of nodes. Here, we have a network of nodes, where each node has a set of neighboring nodes and can only exchange information with these neighbors. In particular, each node sequentially receives a variable length data sequence with its label and trains the parameters of the LSTM network. Each node can also communicate with its neighbors to share information in order to enhance the training performance since the goal is to train one set of LSTM coefficients using all the available data. As an example application, suppose that we have a database of labelled tweets and our aim is to train an emotion recognition engine based on an LSTM structure, where the training is performed in an online and distributed manner using several processing units. Words in each tweet are represented by word2vec vectors [6] and tweets are distributed to several processing units in an online manner.

The LSTM architectures are usually trained in a batch setting in the literature, where all data instances are present and processed together [3]. However, for applications involving big data, storage issues may arise due to keeping all the data in one place [7]. Additionally, in certain frameworks, all data instances are not available beforehand since instances are received in a sequential manner, which precludes batch training [7]. Hence, we consider online training, where we sequentially receive the data to train the LSTM architecture without storing the previous data instances. Note that even though we work in an online setting, we may still suffer from computational power and storage issues due to large amount of data [8]. As an example, in tweet emotion recognition applications, the systems are usually trained using an enormous amount of data to achieve sufficient performance, especially for agglutinative languages [6]. For such tasks distributed architectures are used. In this basic distributed architectures, commonly named as centralized approach [8], the whole data is distributed to different nodes and trained parameters are merged later at a central node [3]. However, this centralized approach requires high storage capacity and computational power at the central node [8]. Additionally, centralized strategies have a potential risk of failure at the central node. To circumvent these issues, we distribute both the processing as well as the data to all the nodes and allow communication only between neighboring nodes, hence, we remove the need for a central node. In particular, each node sequentially receives a variable length data sequence with its label and exchanges information only with its neighboring nodes to train the common LSTM parameters.

For online training of the LSTM architecture in a distributed manner, one can employ one of the first order gradient based algorithms at each node due to their efficiency [3] and exchange estimates among neighboring nodes as in [9]. However, since these training methods only exploit the first order gradient information, they suffer from poor performance and convergence issues. As an example, the Stochastic Gradient Descent (SGD) based algorithms usually have slower convergence compared to the second order methods [9, 3]. On the other hand, the second order gradient based methods require much higher computational complexity and communication load while providing superior performance compared to the first order methods [3]. Following the distributed implementation of the first order methods, one can implement the second order training methods in a distributed manner, where we share not only the estimates but also the Jacobian matrix, e.g., the Distributed Extended Kalman Filtering (DEKF) algorithm [10, 11]. However, as in the first order case, these sharing and combining the information at each node is adhoc, which does not provide the optimal training performance [10]. In this brief paper, to provide improved performance with respect to the second order methods while preserving both communication and computational complexity similar to the first order methods, we introduce a highly effective distributed online training method based on the particle filtering algorithm [12]. We first propose an LSTM based model for variable length data regression. We then put this model in a nonlinear state space form to train the model in an online and optimal manner.

Our main contributions include: 1) We introduce distributed LSTM training methods in an online setting for variable length data sequences. Our Distributed Particle Filtering (DPF) based training algorithm guarantees convergence to the optimal centralized training performance in the mean square error (MSE) sense; 2) We achieve this performance with a computational complexity and a communication load in the order of the first order gradient based methods; 3) Through simulations involving real life and financial data, we illustrate significant performance improvements with respect to the state of the art methods [13, 14].

The organization of this brief paper is as follows. In Section II, we first describe the variable length data regression problem in a network of nodes and then introduce an LSTM based structure. Then, in Section III, we first put this structure in a nonlinear state space form and then introduce our training algorithms. In Section IV, we illustrate the merits of our algorithms through simulations. We then finalize the brief paper with concluding remarks in Section V.

## Ii Model and Problem Description

Here111All column vectors (or matrices) are denoted by boldface lower (or uppercase) case letters. For a matrix (or a vector ), () is its ordinary transpose. The time index is given as subscript, e.g., is the vector at time . Here, (or ) is a vector of all ones (or zeros) and is the identity matrix, where the sizes of these notations are understood from the context., we consider a network of nodes. In this network, we declare two nodes that can exchange information as neighbors and denote the neighborhood of each node as that also includes the node , i.e., . At each node , we sequentially receive , and matrices, , defined as , where , and is the number of columns in , which can change with respect to . In our network, each node aims to learn a certain relation between the desired value and matrix . After observing and , each node first updates its belief about the relation and then exchanges an updated information with its neighbors. After receiving , each node estimates the next signal as . Based on , each node suffers the loss at time instance . This framework models a wide range of applications in the machine learning and signal processing literatures, e.g., sentiment analysis [6]. As an example, in tweet emotion recognition application [6], each corresponds to a tweet, i.e., the th tweet at the node (processing unit) . For the th tweet at the node , one can construct by finding word2vec representation of each word, i.e., for the th word. After receiving , i.e., the desired emotion label for the th tweet at the node , each node first updates its belief about the relation between the tweet and its emotion label, and then exchanges information, e.g., the trained system parameters, with its neighboring units to estimate the next label.

In this brief paper, each node generates an estimate using the LSTM architecture. Although there exist different variants of LSTM, we use the most widely used variant [5], i.e., the LSTM architecture without peephole connections. The input is first fed to the LSTM architecture as illustrated in Fig. 1, where the internal equations are given as [5]:

 \boldmathi(l)k,t=σ(\boldmathW(i)k\boldmathx(l)k,t+\boldmathR(i)k% \boldmathy(l−1)k,t+\boldmathb(i)k) (1) \boldmathf(l)k,t=σ(\boldmathW(f)k\boldmathx(l)k,t+\boldmathR(f)k% \boldmathy(l−1)k,t+\boldmathb(f)k) (2) \boldmathc(l)k,t=\boldmathi(l)k,t⊙g(\boldmathW(z)k\boldmathx(l)k,t+\boldmathR(z)k\boldmathy(l−1)k,t+% \boldmathb(z)k)+\boldmathf(l)k,t⊙% \boldmathc(l−1)k,t (3) \boldmatho(l)k,t=σ(\boldmathW(o)k\boldmathx(l)k,t+\boldmathR(o)k% \boldmathy(l−1)k,t+\boldmathb(o)k) (4) \boldmathy(l)k,t=\boldmatho(l)k,t⊙h(\boldmathc(l)k,t), (5)

where is the input vector, is the output vector and is the state vector for the th LSTM unit. Moreover, , and represent the output, forget and input gates, respectively. and are set to the hyperbolic tangent function and apply vectors pointwise. Likewise, is the pointwise sigmoid function. The operation represents the elementwise multiplication of two vectors of the same size. As the coefficient matrices and the weight vectors of the LSTM architecture, we have , and , where the sizes are chosen according to the input and output vectors. Given the outputs of LSTM for each column of as seen in Fig. 1, we generate the estimate for each node as follows

 ^dk,t=\boldmathwTk,t\boldmath¯yk,t, (6)

where is a vector of the regression coefficients and is a vector obtained by taking average of the LSTM outputs for each column of , i.e., known as the mean pooling method, as described in Fig. 1.

Remark 1: In (6), we use the mean pooling method to generate . One can also use the other pooling methods by changing the calculation of and then generate the estimate as in (6). As an example, for the max and last pooling methods, we use and , respectively. All our derivations hold for these pooling methods and the other LSTM architectures. We provide the required updates for different LSTM architectures in the next section.

## Iii Online Distributed Training Algorithms

In this section, we first give the LSTM equations for each node in a nonlinear state space form. Based on this form, we then introduce our distributed algorithms to train the LSTM parameters in an online manner.

Considering our model in Fig. 1 and the LSTM equations in (1), (2), (3), (4) and (5), we have the following nonlinear state space form for each node

 \boldmath¯ck,t=Ω(\boldmath¯ck,t−1,\boldmathXk,t,\boldmath¯yk,t−1) (7) \boldmath¯yk,t=Θ(\boldmath¯ck,t,\boldmathXk,t,\boldmath¯yk,t−1) (8) \boldmathθk,t=\boldmathθk,t−1 (9) dk,t=\boldmathwTk,t\boldmath¯yk,t+εk,t, (10)

where and represent the nonlinear mappings performed by the consecutive LSTM units and the mean pooling operation as illustrated in Fig. 1, and is a parameter vector consisting of , where . Since the LSTM parameters are the states of the network to be estimated, we also include the static equation (9) as our state. Furthermore, represents the error in observations and it is a zero mean Gaussian random variable with variance .

Remark 2: We can also apply the introduced algorithms to different implementations of the LSTM architecture [5]. For this purpose, we modify the function and in (7) and (8) according to the chosen LSTM architecture. We also alter in (9) by adding or removing certain parameters according to the chosen LSTM architecture.

### Iii-a Online Training Using the DEKF Algorithm

In this subsection, we first derive our training method based on the EKF algorithm, where each node trains its LSTM parameters without any communication with its neighbors. We then introduce our training method based on the DEKF algorithm in order to train the LSTM architecture when we allow communication between the neighbors.

The EKF algorithm is based on the assumption that the state distribution given the observations is Gaussian [11]. To meet this assumption, we introduce Gaussian noise to (7), (8) and (9). By this, we have the following model for each node

 ⎡⎢ ⎢⎣\boldmath¯ck,t\boldmath¯yk,t\boldmathθk,t⎤⎥ ⎥⎦=⎡⎢ ⎢⎣Ω(% \boldmath¯ck,t−1,\boldmathXk,t,\boldmath¯yk,t−1)Θ(\boldmath¯ck,t,\boldmathXk,t,% \boldmath¯yk,t−1)\boldmathθk,t−1⎤⎥ ⎥⎦+⎡⎢ ⎢⎣\boldmath% ek,t\boldmathϵk,t\boldmathυk,t⎤⎥ ⎥⎦ (11) dk,t=\boldmathwTk,t\boldmath¯yk,t+εk,t, (12)

where is zero mean Gaussian process with covariance . Here, each node is able to observe only to estimate , and . Hence, we group , and together into a vector as the hidden states to be estimated.

#### Iii-A1 Online Training with the EKF Algorithm:

In this subsection, we derive the online training method based on the EKF algorithm when we do not allow communication between the neighbors. Since the system in (11) and (12) is already in a nonlinear state space form, we can directly apply the EKF algorithm [11] as follows

 Time Update: \boldmath¯ck,t|t−1=Ω(\boldmath¯ck,t−1|t−1,\boldmathXk,t,\boldmath¯yk,t−1|t−1) (13) \boldmath¯yk,t|t−1=Θ(\boldmath¯ct|t−1,\boldmathXk,t,\boldmath¯yk,t−1|t−1) (14) \boldmathθk,t|t−1=\boldmathθk,t−1|t−1 (15)
 \boldmathΣk,t|t−1=\boldmathFk,t−1\boldmathΣk,t−1|t−1\boldmathFTk,t−1+%\boldmath$Q$k,t−1 (16) Measurement Update: R=\boldmathHTk,t\boldmathΣk,t|t−1\boldmathHk,t+Rk,t ⎡⎢ ⎢⎣\boldmath¯ck,t|t\boldmath¯yk,t|t\boldmathθk,t|t⎤⎥ ⎥⎦=⎡⎢ ⎢⎣\boldmath% ¯ck,t|t−1\boldmath¯yk,t|t−1\boldmathθk,t|t−1⎤⎥ ⎥⎦+\boldmathΣk,t|t−1\boldmathHk,tR−1(dk,t−^dk,t) \boldmathΣk,t|t=\boldmathΣk,t|t−1−\boldmathΣk,t|t−1\boldmathHk,tR−1\boldmathHTk,t\boldmathΣk,t|t−1,

where is the error covariance matrix, is the state noise covariance and is the measurement noise variance. Additionally, we assume that and are known terms. We compute and as follows

 \boldmathHTk,t=[∂^dk,t∂\boldmath¯c∂^dk,t∂\boldmath¯y∂^dk,t∂\boldmathθ]∣∣% \boldmath¯c=\boldmath¯ck,t|t−1\boldmath¯y=\boldmath¯yk,t|t−1\boldmathθ=\boldmathθk,t|t−1 (17)

and

 \boldmathFk,t=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣∂Ω(\boldmath¯c,\boldmathXk,t,\boldmath¯y)∂\boldmath¯c∂Ω(% \boldmath¯c,\boldmathXk,t,\boldmath¯y)∂\boldmath¯y∂Ω(\boldmath¯c,\boldmathXk,t,\boldmath¯y)∂\boldmathθ∂Θ(\boldmath¯c,\boldmathXk,t,\boldmath¯y)∂\boldmath¯c∂Θ(\boldmath¯c,\boldmathXk,t,% \boldmath¯y)∂\boldmath¯y∂Θ(\boldmath¯c,\boldmathXk,t,\boldmath% ¯y)∂\boldmathθ\boldmath0\boldmath0\boldmathI⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦∣∣∣\boldmath¯c=\boldmath¯ck,t|t\boldmath¯y=\boldmath¯yk,t|t\boldmathθ=\boldmathθk,t|t, (18)

where and .

#### Iii-A2 Online Training with the DEKF Algorithm:

In this subsection, we introduce our online training method based on the DEKF algorithm for the network described by (11) and (12). In our network of nodes, we denote the number of neighbors for the node as , i.e., also called as the degree of the node [10]. With this structure, the time update equations in (13), (14), (15) and (16) still hold for each node . However, since we have information exchange between the neighbors, the measurement update equations of each node adopt the iterative scheme [10] as the following.

 For the node k at time t: \boldmathϕk,t⟵[\boldmath¯cTk,t|t−1 \boldmath¯yTk,t|t−1%\boldmathθTk,t|t−1]T \boldmathΦk,t⟵\boldmathΣk,t|t−1 For each l∈Nk repeat: R⟵\boldmathHTl,t% \boldmathΦk,t\boldmathHl,t+Rl,t \boldmathϕk,t⟵\boldmath% ϕk,t+\boldmathΦk,t\boldmathHl,tR−1(dl,t−\boldmathwTk,t|t−1\boldmath¯yk,t|t−1) \boldmathΦk,t⟵\boldmath% Φk,t−\boldmathΦk,t\boldmathHl,tR−1\boldmathHTl,t\boldmathΦk,t.

Now, we update the state and covariance matrix estimate as

 \boldmathΣk,t|t=\boldmathΦk,t [\boldmath¯cTk,t|t \boldmath¯yTk,t|t \boldmathθTk,t|t]T=∑l∈Nkc(k,l)\boldmathϕl,t,

where is the weight between the node and and we compute these weights using the Metropolis rule as follows

 c(k,l)=⎧⎪⎨⎪⎩1/max(ηk,ηl) if l∈Nk/k1−∑l∈Nk/kc(k,l) if k=l0 if l∉Nk. (19)

With these steps, we can update all the nodes in our network as illustrated in Algorithm 1.

According to the procedure in Algorithm 1, the computational complexity of our training method results in computations at each node due to matrix and vector multiplications on lines 8 and 19 as shown in Table I.

### Iii-B Online Training Using the DPF Algorithm

In this subsection, we first derive our training method based on the PF algorithm when we do not allow communication between the nodes. We then introduce our online training method based on the DPF algorithm when the nodes share information with their neighbors.

The PF algorithm only requires the independence of the noise samples in (11) and (12). Thus, we modify our system in (11) and (12) for the node as follows

 \boldmathak,t=φ(\boldmathak,t−1,\boldmathXk,t)+\boldmathγk,t (20) dk,t=\boldmathwTk,t\boldmath¯yk,t+εk,t, (21)

where and are independent state and measurement noise samples, respectively, is the nonlinear mapping in (11) and .

#### Iii-B1 Online Training with the PF Algorithm:

For the system in (20) and (21), our aim is to obtain , i.e., the optimal estimate for the hidden state in the MSE sense. To achieve this, we first obtain posterior distribution of the states, i.e., . Based on the posterior density function, we then calculate the conditional mean estimate. In order to obtain the posterior distribution, we apply the PF algorithm [15].

In this algorithm, we have the samples and the corresponding weights of , i.e., denoted as . Based on the samples, we obtain the posterior distribution as follows

 p(\boldmathak,t|dk,1:t)≈N∑i=1ωik,tδ(\boldmathak,t−\boldmathaik,t). (22)

Sampling from the desired distribution is intractable in general so that we obtain the samples from , which is called as importance function [15]. To calculate the weights in (22), we use the following formula

 wik,t∝p(\boldmathaik,t|dk,1:t)q(\boldmathaik,t|dk,1:t), where N∑i=1ωik,t=1. (23)

We can factorize (23) such that we obtain the following recursive formula [15]

 ωik,t∝p(dk,t|\boldmathaik,t)p(\boldmathaik,t|\boldmathaik,t−1)q(\boldmathaik,t|\boldmathaik,t−1,dk,t)ωik,t−1. (24)

In (24), we choose the importance function so that the variance of the weights is minimized. By this, we obtain particles that have nonnegligible weights and significantly contribute to (22) [15]. In this sense, since provides a small variance for the weights [15], we choose it as our importance function. With this choice, we alter (24) as follows

 ωik,t∝p(dk,t|\boldmathaik,t)ωik,t−1. (25)

By (22) and (25), we obtain the state estimate as follows

 E[\boldmathak,t|dk,1:t ]=∫\boldmathak,tp(\boldmathak,t|dk,1:t)d\boldmathak,t ≈∫\boldmathak,tN∑i=1ωik,tδ(\boldmathak,t−\boldmathaik,t)d% \boldmathak,t=N∑i=1ωik,t\boldmathaik,t.

Although we choose the importance function to reduce the variance of the weights, the variance inevitably increases over time [15]. Hence, we apply the resampling algorithm introduced in [15] such that we eliminate the particles with small weights and prevent the variance from increasing.

#### Iii-B2 Online Training with the DPF Algorithm:

In this subsection, we introduce our online training method based on the DPF algorithm when the nodes share information with their neighbors. We employ the Markov Chain Distributed Particle Filter (MCDPF) algorithm [12] to train our distributed system. In the MCDPF algorithm, particles move around the network according to the network topology. In every step, each particle can randomly move to another node in the neighborhood of its current node. While randomly moving, the weight of each particle is updated using at the node , hence, particles use the observations at different nodes.

Suppose we consider our network as a graph , where the vertices represent the nodes in our network and the edges represent the connections between the nodes. In addition to this, we denote the number of visits to each node in steps by each particle as . Here, each particle moves to one of its neighboring nodes with a certain probability, where the movement probabilities of each node to the other nodes are represented by the adjacency matrix, i.e., denoted as . In this framework, at each visit to each node , each particle multiplies its weight with in a run of steps [12], where is the number of edges in and is the degree of the node . From (25), we have the following update for each particle at the node after steps

 wik,t=wik,t−1K∏j=1p(dj,t|% \boldmathaik,t)2|E(G)|sηjMi(j,s). (26)

We then calculate the posterior distribution at the node as

 p(\boldmathak,t|Ok,t)≈N∑i=1wik,tδ(\boldmathak,t−\boldmathaik,t), (27)

where represents the observations seen by the particles at the node until and is obtained from (26). After we obtain (27), we calculate our estimate for as follows

 E[\boldmathak,t|Ok,t] =∫\boldmathak,tp(\boldmathak,t|Ok,t)d\boldmathak,t ≈∫\boldmathak,tN∑i=1ωik,tδ(\boldmathak,t−\boldmathaik,t)d% \boldmathak,t =N∑i=1ωik,t\boldmathaik,t. (28)

We can obtain the estimate for each node using the same procedure as illustrated in Algorithm 2. In Algorithm 2, represents the number of particles at the node and represents the indices of the particles that move from the node to the node . Thus, we obtain a distributed training algorithm that guarantees convergence to the optimal centralized parameter estimation as illustrated in Theorem 1.

Theorem 1: For each node , let be the bounded state vector with a measurement density function that satisfies the following inequality

 0

where is a constant and

 ||p||∞=supdk,tp(dk,t|\boldmathak,t).

Then, we have the following convergence results in the MSE sense

 N∑i=1ωik,t\boldmathaik,t→E[\boldmathak,t|{dj,1:t}Kj=1] as N→∞ and k→∞.

Proof of Theorem 1. Using (29), from [12], we obtain

 E[(E[π(\boldmathat)|{dj,1:t }Kj=1]−N∑i=1ωik,tπ(\boldmathaik,t))2] ≤||π||2∞(Ct√U(s,υ)+√ςtN)2, (30)

where is a bounded function, is the second largest eigenvalue modulus of , and are time dependent constants and is a function of as described in [12] such that goes to zero as goes to infinity. Since the state vector is bounded, we can choose . With this choice, evaluating (30) as and go to infinity yields the results. This concludes our proof.

According to the update procedure illustrated in Algorithm 2, the computational complexity of our training method results in computations at each node due to matrix vector multiplications in (20) and (21) as shown in Table I.

## Iv Simulations

We evaluate the performance of the introduced algorithms on different benchmark real datasets. We first consider the prediction performance on Hong Kong exchange rate dataset [16]. We then evaluate the regression performance on emotion labelled sentence dataset [17]. For these experiments, to observe the effects of communication among nodes, we also consider the EKF and PF based algorithms without communication over a network of multiple nodes, where each node trains LSTM based on only its observations. Throughout this section, we denote the EKF and PF based algorithms without communication over a network of multiple nodes as “EKF” and “PF”, respectively. We also consider the SGD based algorithm without communication over a network of multiple nodes as a benchmark algorithm and denote it by “SGD”.

We first consider the Hong Kong exchange rate dataset [16]. For this dataset, we have the amount of Hong Kong dollars that can buy one United States dollar on certain days. Our aim is to estimate future exchange rate by using the values in the previous two days. In online applications, one can demand a small steady state error or fast convergence rate based on the requirements of application [18]. In this experiment, we evaluate the convergence rates of the algorithms. For this purpose, we select the parameters such that the algorithms converge to the same steady state error level. In this setup, we choose the parameters for each node as follows. Since is our input, we set the output dimension as . In addition to this, we consider a network of four nodes. For the PF based algorithms, we choose as the number of particles. Additionally, we select and as zero mean Gaussian random variables with and , respectively. For the DPF based algorithm, we choose