Multivariate LSTM-FCNs for Time Series Classification
Over the past decade, multivariate time series classification has been receiving a lot of attention. We propose augmenting the existing univariate time series classification models, LSTM-FCN and ALSTM-FCN with a squeeze and excitation block to further improve performance. Our proposed models outperform most of the state of the art models while requiring minimum preprocessing. The proposed models work efficiently on various complex multivariate time series classification tasks such as activity recognition or action recognition. Furthermore, the proposed models are highly efficient at test time and small enough to deploy on memory constrained systems.
Time series data is used in various fields of studies, ranging from weather readings to psychological signals [kadous2002temporal, sharabiani2017efficient]. A time series is a sequence of data points in a time domain, typically in a uniform interval [wang2016effective]. There is a significant increase of time series data being collected by sensors [spiegel2011pattern]. A time series dataset can be univariate, where a sequence of measurements from the same variable are collected, or multivariate, where sequence of measurements from multiple variables are collected [prieto2015stacking]. Over the past decade, multivariate time series classification has received significant interest. Some of the applications where multivariate time series classification is used are in healthcare, activity recognition, object recognition, and action recognition [fu2015human, geurts2001pattern, pavlovic1999time]. In this paper, we propose two deep learning models that outperform existing state of the art algorithms.
Several time series classification algorithms have been developed over the years. Distance based methods along with k-nearest neighbors have proven to be successful in classifying multivariate time series [orsenigo2010combining]. Plenty of research indicates Dynamic Time Warping (DTW) as the best distance based measure to use along k-NN [seto2015multivariate].
In addition to distance based metrics, other traditional feature based algorithms are used. Typically, feature based classification algorithms rely heavily on the features being extracted from the time series data [xing2010brief]. However, feature extraction is very difficult because intrinsic features of time series data are challenging to capture. For this reason, distance based approaches are more successful in classifying multivariate time series data [zheng2014time]. Hidden State Conditional Random Field (HCRF) and Hidden Unit Logistic Model (HULM) are two successful feature based algorithms that have led to state of the art results on various benchmark datasets, ranging from online character recognition to activity recognition [pei2017multivariate]. HCRF is a computationally expensive algorithm that detects latent structures of the input time series data using a chain of k-nomial latent variables. The number of parameters in the model increases linearly with the total number of latent states required [quattoni2007hidden]. Further, datasets that require large number of latent states tend to overfit the data. To overcome this, HULM proposes using H binary stochastic hidden units to model 2 latent structures of the data with only O(H) parameters. Results indicate HULM outperforming HCRF on most datasets [pei2017multivariate].
Traditional models, such as the naive logistic model (NL) and Fisher kernel learning (FKL) [jaakkola2000discriminative], show strong performance on a wide variety of time series classification problems. The NL logistic model is a linear logistic model that makes a prediction by summing the inner products between the model weights and feature vectors over time, which is followed by a softmax function [pei2017multivariate]. The FKL model is effective on time series classification problems when based on Hidden Markov Models (HMM). Subsequently, the features or representation from the FKL model is used to train a linear SVM to make a final prediction. [jaakkola2000discriminative, maaten2011learning]
Another common approach for multivariate time series classification is by applying dimensional reduction techniques or by concatenating all dimensions of a multivariate time series into a univariate time series. Symbolic Representation for Multivariate Time Series (SMTS) [SMTS] applies a random forest on the multivariate time series to partition it into leaf nodes, which are each represented by a word to form a codebook. These words are used with another random forest to classify the multivariate time series. Learned Pattern Similarity (LPS) [LPS] is a similar model that extracts segments from the multivariate time series. These segments are used to train regression trees to find dependencies between them. Each node is represented by a word. Finally, these words are used with a similarity measure to classify the unknown multivariate time series. Ultra Fast Shapelets (UFS) [UFS] obtains random shapelets from the multivariate time series and applies a linear SVM or a Random Forest classifier. Subsequently, UFS was enhanced by additionally computing derivatives as features (dUFS) [UFS]. The Auto-Regressive (AR) kernel [ARkernel] applies an AR kernel-based distance measure to classify the multivariate time series. Auto-Regressive forests for multivariate time series modelling (mv-ARF) [mvARF] uses a tree ensemble. Each tree is trained with a different time lags. Most recently, WEASEL+MUSE [schafer2017multivariate] builds a multivariate feature vector using a classical bag of patterns approach on each variable with various sliding window sizes to capture discrete features, words and pairs of words. Subsequently, feature selection is used to remove non-discriminative features using a Chi-squared test. The final classification is obtained using a logistic classifier on the final feature vector.
Deep learning has also yielded promising results for multivariate time series classification. In 2014, Yi et al. propose using Multi-Channel Deep Convolutional Neural Network (MC-DCNN) for multivariate time series classification. MC-DCNN takes input from each variable to detect latent features. The latent features from each channel are fed into a MLP to perform classification [zheng2014time].
This paper proposes two deep learning models for multivariate time series classification. The proposed models require minimal preprocessing and are tested on 35 datasets, obtaining strong performance in most of them. The rest of the paper is ordered as follows. The background works are discussed in Section II. We present the architecture of the two proposed models in Section III. In Section IV, we discuss the dataset, test the models on, present our results and analyze our findings. In Section V we draw our conclusion.
Ii Background Works
Ii-a Temporal Convolutions
In the proposed models, Temporal Convolutional Networks are used as a feature extraction module of the Fully Convolutional Network (FCN) branch. Typically, a basic convolution block contains a convolution layer, which is accompanied by a batch normalization [ioffe2015batch]. The batch normalization is followed by an activation function of either a Rectified Linear Unit or a Parametric Rectified Linear Unit [Trottier2016].
Generally, Temporal Convolutional Networks have an input of a time series signal. Lea et al.[Lea_2016] defines to be the input feature vector of length for time step . is and , where is the number of time steps of a sequence. Each frame has an action label, , where . is the number of classes.
Each of the convolutional layers has a 1D filter applied to it, such that the evolution of the input signals over time is captured. Lea et al. [Lea_2016] uses a tensor and biases to parameterize the 1D filter. The layer index is defined as and the filter duration is defined by . The -th component of the unnormalized activation for the -th layer is a function of the incoming normalized activation matrix from the previous layer
for each time where is a Rectified Linear Unit.
Ii-B Recurrent Neural Networks
Recurrent Neural Networks (RNN) are a form of neural networks that display temporal behavior through the direct connections between individual layers. Pascanu et al. [pascanu2013construct] state RNN to maintain a hidden vector that is update at time step ,
where the hyperbolic tangent function is represented by , the recurrent weight matrix is denoted by and the projection matrix is signified by . A prediction, can be made using a hidden state, , and a weight matrix, ,
The softmax creates a normalized probability distribution over all possible classes using a logistic sigmoid function, . RNNs can be stacked to create deeper networks by using the hidden state, as an input to another RNN,
Ii-C Long Short-Term Memory RNNs
A major issue with RNNs is they contain a vanishing gradient problem. Long short-term memory (LSTM) RNNs address this problem by integrating gating functions into their state dynamics [hochreiter1997long]. An LSTM maintains a hidden vector, , and a memory vector, , which control state updates and outputs at each time step, respectively. The computation at each time step is depicted by Graves et al. [graves2012supervised] as the following:
where the logistic sigmoid function is defined by , the elementwise multiplication is represented by . The recurrent weight matrices are depicted using the notation and the projection matrices are portrayed by .
LSTMs can learn temporal dependencies. However, long term dependencies of long sequence are challenging to learn using LSTMs. Bahdanau et al. [bahdanau2014neural] propose using an attention mechanism to learn these long term dependencies.
Ii-D Attention Mechanism
An attention mechanism conditions a context vector on the target sequence . This method is commonly used in neural translation of texts. Bahdanau et al.[bahdanau2014neural] argues the context vector depends on a sequence of annotations , where an encoder maps the input sequence. Each annotation, , comprises of information on the whole input sequence, while focusing on regions surrounding the -th word of the input sequence. The weighted sum of each annotation, , is used to compute the context vector as follows:
The weight, , of each annotation is calculated through :
where the alignment model, , is . The alignment model measures how well the input position, , and the output at position, , match using the RNN hidden state, , and the -th annotation, , of the input sentence. Bahdanau et al.[bahdanau2014neural] uses a feedforward neural network to parameterize the alignment model, . The feedforward neural network is trained with all other components of the model. In addition, the alignment model calculates a soft alignment that can backpropagate the gradient of the cost function. The gradient of the cost function trains the alignment model and the whole translation model simultaneously [bahdanau2014neural].
Ii-E Squeeze and Excite Block
Hu et al.[hu2017squeeze] propose a Squeeze-and-Excitation block that acts as a computational unit for any transformation . The outputs of are represented as U = where
The convolution operation is depicted by *, and the 2D spatial kernel is depicted by v. The single channel of v acts on the corresponding channel of X. Hu et al.[hu2017squeeze] models the channel interdependencies to adjust the filter responses in two steps, and .
The operation exploits the contextual information outside the local receptive field by using a global average pool to generate channel-wise statistics. The transformation output, U, is shrunk through spatial dimensions to compute the channel-wise statistics, z . The c-th element of z is calculated by:
For temporal sequence data, the transformation output, U, is shrunk through the temporal dimension to compute the channel-wise statistics, z . The c-th element of z is then calculated by:
The aggregated information from the operation is followed by an operation, whose objective is to capture the channel-wise dependencies. To achieve this, a simple gating mechanism is applied with a sigmoid activation, as follows:
where is a ReLU activation function, and . and is used to limit model complexity and aid with generalization. are the parameters of a dimensionality-reduction layer and are the parameters of a dimensionality-increasing layer.
Finally, the output of the block is rescaled as follows:
where and refers to channel-wise multiplication between the feature map and the scale .
Iii Multivariate LSTM Fully Convolutional Network
Iii-a Network Architecture
LSTM-FCN and ALSTM-FCN have been successful in classifying univariate time series [karim2017lstm]. The models we propose, Multivariate LSTM-FCN (MLSTM-FCN) and Multivariate ALSTM-FCN (MALSTM-FCN) augment LSTM-FCN and ALSTM-FCN respectively.
Similar to LSTM-FCN and ALSTM-FCN, the proposed models comprise of a fully convolutional block and a LSTM block, as depicted in Fig. 1. The fully convolutional block contains three temporal convolutional blocks, used as a feature extractor. Each convolutional block contains a convolutional layer, with filter size of 128 or 256, and is succeeded by a batch normalization, with a momentum of 0.99 and epsilon of 0.001. The batch normalization layer is succeeded by the ReLU activation. In addition, the first two convolutional blocks conclude with a squeeze and excite block, which sets the proposed model apart from LSTM-FCN and ALSTM-FCN. Fig. 2 summarizes the process of how the squeeze and excite block is computed in our architecture. The added squeeze and excite block enhances the performance of the LSTM-FCN and ALSTM-FCN models. The final temporal convolutional block is followed by a global average pooling layer.
On the other hand, the multivariate time series input is passed through a dimension shuffle layer, explained more in section III-B, followed by the LSTM block. The LSTM block is identical to the block from the LSTM-FCN or ALSTM-FCN models [karim2017lstm], comprising of either a LSTM layer or an Attention LSTM layer, which is followed by a dropout. Since the datasets are padded by zeros at the end to make their size consistent, we use a mask prior to the LSTM or Attention LSTM layer to skip time steps for which we have no information.
Iii-B Network Input
Depending on the dataset, the input to the fully convolutional block and LSTM block vary. The input to the fully convolutional block is a multivariate variate time series with N timesteps having M distinct variables per timestep. If there is a time series with M variables and N time steps, the fully convolutional block will receive the data as such.
On the other hand, the input to the LSTM can vary depending on the application of dimension shuffle. The dimension shuffle transposes the temporal dimension of the input data. If the input to a LSTM does not go through the dimension shuffle, the LSTM will require N time steps to process M variables at each timestep. However, if the dimension shuffle is applied, the LSTM will require M time steps to process N variables. In other words, the dimension shuffle improves the efficiency of the model when the number of variables M is less than the number of time steps N. In the proposed model, the dimension shuffle operation is only applied when the number of time steps, N, is greater than the number of variables, M.
The proposed models take a total of 13 hours to process the MLSTM-FCN and a total of 18 hours to process the MALSTM-FCN on a single GTX 1080 Ti GPU. While the time required to train these models is significant, it is to be noted that their inference time is comparable with other standard models.
MLSTM-FCN and MALSTM-FCN have been tested on 35 datasets, further explained in section IV-B1. The optimal number of LSTM cells for each dataset was found via grid search ranging from 8 cells to 128 cells. In most experiments, we use an initial batch size of 128. The FCN block is comprised of 3 blocks of 128-256-128 filters, so as to be comparable with the original models. During the training phase, we set the total number of training epochs to 250 unless explicitly stated and dropout rate set to 80% to mitigate overfitting. The convolution kernels are initialized with the initialization proposed by He et al.[he2015delving]. Each of the proposed models is trained using a batch size of 128. For datasets with class imbalance, a class weighing schemed inspired by King et al. is utilized [king2001logistic].
We use the Adam optimizer [kingma2014adam], with an initial learning rate set to 1e-3 and the final learning rate set to 1e-4 to train all models. The datasets were normalized and preprocessed to have zero mean and unit variance. We then append variable length time series with zeros so as to obtain a time series dataset with a constant length N, where N is the maximum length of the time series. In addition, after every 100 epochs, the learning rate is reduced by a factor of . We use the Keras [chollet2015keras] library with the Tensorflow backend [tensorflow2015-whitepaper] to train the proposed models.
Iv-a Evaluation Metrics
In this paper, various models, including the proposed models, are evaluated using accuracy, arithmetic rank, geometric rank, Wilcoxon signed rank test, and mean per class error. The arithmetic and geometric rank are the arithmetic and geometric mean of the ranks. The Wilcoxon signed rank test is a non-parametric statistical test that hypothesizes that the median of the rank between the compared models are the same. The alternative hypothesis of the Wilcoxon signed rank test is that the median of the rank between the compared models are not the same. Finally, the mean per class error is the mean of the per class error (PCE) of all the datasets,
A total of 35 datasets are used to test the proposed models. Five of the 35 datasets are benchmark datasets used by Pei et al.[pei2017multivariate], where the training and testing sets are provided online. In addition, we test the proposed models on 20 benchmark datasets, most recently by utilized Schäfer and Leser [schafer2017multivariate] The remaining datasets are from the UCI repository [Lichman:2013]. The dataset is summarized in Table I.
|Dataset||Num. of Classes||Num. of Variables||Max Training Length||Task||Train-Test Split||Source|
|Arabic Voice||88||39||91||Speaker Recognition||75-25 split||[hammami2010improved]|
|Cohn-Kanade AU-coded Expression (CK+)||7||136||71||Facial Expression Classification||10-fold||[van2012action]|
|MSR Action||20||570||100||Action Recognition||5 ppl in train; rest in test||[li2010action]|
|MSR Activity||16||570||337||Activity Recognition||5 ppl in train; rest in test||[wang2012mining]|
|ArabicDigits||10||13||93||Digit Recognition||75-25 split||[Lichman:2013]|
|AUSLAN||95||22||96||Sign Language Recognition||44-56 split||[Lichman:2013]|
|CharacterTrajectories||20||3||205||Handwriting Classification||10-90 split||[Lichman:2013]|
|CMU_MOCAP_S16||2||62||534||Action Recognition||50-50 split||[cmu]|
|DigitShape||4||2||97||Action Recognition||60-40 split||[subakan2014probabilistic]|
|ECG||2||2||147||ECG Classification||50-50 split||[bobski_world]|
|JapaneseVowels||9||12||26||Speech Recognition||42-58 split||[Lichman:2013]|
|KickvsPunch||2||62||761||Action Recognition||62-38 split||[cmu]|
|LIBRAS||15||2||45||Sign Language Recognition||38-62 split||[Lichman:2013]|
|LP1||4||6||15||Robot Failure Recogntion||43-57 split||[Lichman:2013]|
|LP2||5||6||15||Robot Failure Recogntion||36-64 split||[Lichman:2013]|
|LP3||4||6||15||Robot Failure Recogntion||36-64 split||[Lichman:2013]|
|LP4||3||6||15||Robot Failure Recogntion||36-64 split||[Lichman:2013]|
|LP5||5||6||15||Robot Failure Recogntion||39-61 split||[Lichman:2013]|
|NetFlow||2||4||994||Action Recognition||60-40 split||[subakan2014probabilistic]|
|PenDigits||10||2||8||Digit Recognition||2-98 split||[Lichman:2013]|
|Shapes||3||2||97||Action Recognition||60-40 split||[subakan2014probabilistic]|
|Uwave||8||3||315||Gesture Recognition||20-80 split||[Lichman:2013]|
|Wafer||2||6||198||Manufacturing Classification||25-75 split||[bobski_world]|
|WalkVsRun||2||62||1918||Action Recognition||64-36 split||[cmu]|
|AREM||7||7||480||Activity Recognition||50-50 split||[Lichman:2013]|
|HAR||6||9||128||Activity Recognition||71-29 split||[Lichman:2013]|
|Daily Sport||19||45||125||Activity Recognition||50-50 split||[Lichman:2013]|
|Gesture Phase||5||18||214||Gesture Recognition||50-50 split||[Lichman:2013]|
|EEG||2||13||117||EEG Classification||50-50 split||[Lichman:2013]|
|EEG2||2||64||256||EEG Classification||20-80 split||[Lichman:2013]|
|HT Sensor||3||11||5396||Food Classification||50-50 split||[Lichman:2013]|
|Movement AAL||2||4||119||Movement Classification||50-50 split||[Lichman:2013]|
|Occupancy||2||5||3758||Occupancy Classification||35-65 split||[Lichman:2013]|
|Ozone||2||72||291||Weather Classification||50-50 split||[Lichman:2013]|
The proposed models were tested on twenty-five multivariate datasets. Each of the datasets that were trained were on the same training and testing datasets most resently mentioned by Pei et al. [pei2017multivariate] and Schäfer and Leser [schafer2017multivariate]. These benchmark datasets are from varying fields. Some of the fields the datasets encompass are in the medical, speech recognition and motion recognition fields. Further details of each dataset are depicted in Table I.
Multivariate Datasets from UCI
The remaining 10 datasets of various classification tasks were obtained from the UCI repository. “HAR”, “EEG2”, and the “Occupancy” datasets have training and testing sets provided. All the remaining datasets are partitioned into training and testing sets with a split ratio of 50-50. Each of the datasets is normalized with a mean of 0 and a standard deviation of 1, and padded with zeros as required.
|Action 3d||71.72||75.42*||72.73||74.74||70.71 [DTW]|
|Daily Sport||99.65||99.65||99.63||99.72*||98.42 [DTW]|
|Gesture Phase||50.51||53.53*||52.53||53.05||40.91 [DTW]|
|HT Sensor||68.00||78.00||72.00||80.00*||72.00 [DTW]|
|Movement AAL||73.25||79.63*||70.06||78.34||65.61 [SVM-Poly]|
MLSTM-FCN and MALSTM-FCN is applied on three sets of experiments, benchmark datasets used by Pei et al., benchmark datasets used by Schäfer and Leser and datasets found in a variety of repositories. We compare our performance with HULM [pei2017multivariate], HCRF [quattoni2007hidden], NL and FKL [jaakkola2000discriminative] only on the datasets used by Pei et al. The proposed models are also compared to the results of ARKernel [ARkernel], LPS [LPS], mv-ARF [mvARF], SMTS [SMTS], WEASEL+MUSE [schafer2017multivariate], and dUFS [UFS] on the benchmark datasets used by Schäfer and Leser. Additionally, we compare our performance with LSTM-FCN [karim2017lstm], ALSTM-FCN [karim2017lstm]. Alongside these models, we also obtain baselines for these dataset by testing them on DTW, Random Forest, SVM with a linear kernel, SVM with a 3rd degree polynomial kernel and choose the highest score as the baseline.
IV-C1a Multivariate Datasets Used By Pei et al. Table III compares the performance of various models with MLSTM-FCN and MALSTM-FCN. Two datasets, “Activity” and “Action 3d”, required a strided temporal convolution (stride 3 and stride 2 respectively) prior to the LSTM branch to reduce the amount of memory consumed when using the MALSTM-FCN model, because the models were too large to fit on a single GTX 1080 Ti processor otherwise. Both the proposed models outperform the state of the art models (SOTA) on five out of the six datasets of this experiment. “Activity” is the only dataset where the proposed models did not outperform the SOTA model. We postulate that the low performance is due to the large stride of the convolution prior to the LSTM branch, which led to a loss of valuable information.
MLSTM-FCN and MALSTM-FCN have an average arithmetic rank of 2 and 1.4 respectively, and a geometric rank of 2.05 and 1.23 respectively. Fig. 3 depicts the superiority of the proposed models over existing models through a critical difference diagram of the average arithmetic ranks. The MPCE of both the proposed models are below 0.71 percent. In comparison, HULM, the prior SOTA on these datasets, has a MPCE of 1.03 percent.
IV-C1b Multivariate Datasets Used By Schäfer and Leser MALSTM-FCN and MLSTM-FCN are compared to seven state of the art models, using results reported by their respective authors in their publications. The results are summarized in Table III. Over the 20 datasets, MLSTM-FCN and MALSTM-FCN far outperforms all of the existing state of the art models. The average arithmetic rank of MLSTM-FCN is 2.25 and its average geometric rank is 1.79. The average arithmetic rank and average geometric rank of MALSTM-FCN is 3.15 and 2.47, respectively. The current existing state of the art model, WEASEL+MUSE, has an average arithmetic rank of 4.15 and an average geometric rank of 2.98. Fig. 4, a critical difference diagram of the average arithmetic ranks, indicates the performance dominance of MLSTM-FCN and MALSTM-FCN over all other state of the art models. Furthermore, the MPCE of MLSTM-FCN and MALSTM-FCN are 0.016 and 0.018. WEASEL+MUSE has a MPCE of 0.016.
Multivariate Datasets from UCI
The performance of MLSTM-FCN and MALSTM-FCN on various multivariate datasets is summarized in Table IV. The colored cells represent the highest performing model on a particular dataset. Both the proposed models outperform the baseline models on all the datasets in this experiment. The arithmetic rank of MLSTM-FCN is 1.33 and the arithmetic rank of MALSTM-FCN is 1.42. In addition, the geometric rank of MLSTM-FCN and MALSM-FCN is 1.26 and 1.33 respectively. MLSTM-FCN and MALSTM-FCN have a MPCE of 6.49 and 6.81, respectively. In juxtaposition, DTW, a successful algorithm for multivariate time series classification, has an arithmetic rank, geometric rank, and a MPCE of 5.5, 5.34, and 9.68 respectively. A visual representation comparing the arithmetic ranks of each model is depicted in Fig. 5. Fig. 5 depicts both the proposed models outperforming the remaining models by a large margin.
Further, we perform a Wilcoxon signed rank test to compare all models that were tested on all 35 datasets. We statistically conclude that the proposed models have a performance score higher than the remaining model as the p-values are below 5 percent. However, the Wilcoxon signed rank test also demonstrates the performance of MLSTM-FCN and MALSTM-FCN to be the same. Both MLSTM-FCN and MALSTM-FCN perform significantly better than LSTM-FCN and ALSTM-FCN. This indicates the squeeze and excite layer enhances performance significantly on multivariate time series classification through modelling the inter-dependencies between the channels.
V Conclusion & Future Work
The two proposed models attain state of the art results in most of the datasets tested, 28 out of 35 datasets. Each of the proposed models require minimal preprocessing and feature extraction. Furthermore, the addition of the squeeze and excite block improves the performance of LSTM-FCN and ALSTM-FCN significantly. We provide a comparison of our proposed models to other existing state of the art algorithms.
The proposed models will be beneficial in various multivariate time series classification tasks, such as activity recognition, or action recognition. The proposed models can easily be deployed on real time systems and embedded systems because the proposed models are small and efficient. Further research is being done to better understand why the squeeze and excite block does not match the performance of the general LSTM-FCN or ALSTM-FCN models on the dataset “Arabic Voice”.
- The codes and weights of each models are available at https://github.com/houshd/MLSTM-FCN