Modeling Combinatorial Evolution in Time Series Prediction

Modeling Combinatorial Evolution in Time Series Prediction

Wenjie Hu Zhejiang University Yang Yang Zhejiang University Zilong You State Grid Lishui Power Supply Co. Ltd. Zongtao Liu Zhejiang University  and  Xiang Ren University of Southern California

Time series modeling aims to capture the intrinsic factors underpinning observed data and its evolution. However, most existing studies ignore the evolutionary relations among these factors, which are what cause the combinatorial evolution of a given time series. For example, personal interests are intrinsic factors hidden behind users’ observable online shopping behaviors; consequently, a precise item recommendation depends not only on discovering the item-interest relationship, but also on an understanding of how user interests shift over time. In this paper, we propose to represent complex and dynamic relations among intrinsic factors of time series data by means of an evolutionary state graph structure. Accordingly, we propose the Evolutionary Graph Recurrent Networks (EGRN) to learn representations of these factors, along with the given time series, using a graph neural network framework. The learned representations can then be applied to time series classification tasks. From our experimental results, based on six real-world datasets, it can be seen that our approach clearly outperforms ten state-of-the-art baseline methods (e.g. +5% in terms of accuracy, and +15% in terms of F1 on average). In addition, we demonstrate that due to the graph structure’s improved interpretability, our method is also able to explain the logical causes of the predicted events. Code is available at

Time series model, evolutionary graph, graph neural networks
copyright: acmlicensedjournalyear: 2019doi: 10.1145/1122445.1122456isbn: 978-1-4503-9999-9/18/06

1. Introduction

Discovering and understanding the intrinsic factors that cause data to evolve over time has been found to be critical for modeling time series data (e.g., stock prices, earthquake wave, automobile sensor data, etc.). For instance, earthquake wave is the observation of crustal movements, while different actions like running and walking will cause differences in observations of a fitness-tracking device. Moreover, in practice, we often observe the combinatorial evolution of data; that is, the observed time series being covered by the influence of multiple factors, and especially the relations among these factors. For example, an earthquake is the result of quick transitions from smooth movements in the Earth’s crust to intense ones, which cause a sudden release of energy in the Earth’s crust. Meanwhile, observing one’s online shopping logs, precise item recommendations rely on tracing and understanding the shift of that user’s personal interests.

One common method of modeling time series data is to use the latent states to represent the intrinsic factors behind each segment, where each state has, in most cases, an independent probability distribution over the possible observations. As for the relations between these latent states, traditional statistical methods (e.g., HMM (Yang and Jiang, 2014)) either consider only the Markov dependence of one latent state on the previous state, or ignore the relations entirely. Meanwhile, deep-neural-network based methods, especially RNNs (e.g., LSTM, GRU, and variants) (Hochreiter and Schmidhuber, 1997; Chung et al., 2015; Du et al., 2016), model the sequential combination of states by defining recurrent layers over time. However, in these cases, the structure is generalized indiscriminately thus can not model complex relations.

Recently, various graph (neural) networks (Battaglia et al., 2018) has been developed to support relational reasoning over graph-structured representations. Existing graph networks require an explicit graph as input. However, it is difficult to directly observe either the latent states or their relations in practice. Moreover, these methods can only take static graphs as input, whereas the relations among states might change over time. To the best of our knowledge, no existing studies have successfully captured the combinatorial evolution of time series with complex and dynamic relations among latent states.

Figure 1. An example of evolutionary state graph constructed by our method based on the Earthquakes dataset. (a) is an observed earthquake waves; (b) and (c) are the evolutionary state graphs on the positive sample (earthquakes occur) and negative samples respectively; (d) presents the learned distribution to generated observations corresponding to several typical latent states. The transitions from state 1 to state 3 and state 10 to state 46 indicate a sign of earthquake, which shall be incorporated into the prediction task effectively.

In this paper, we present a novel framework for modeling the combinatorial evolution of time series data, which we call Evolutionary Graph Recurrent Networks (EGRN). We are given a set of time series, each of which is composed of several segments. In our method, we define the states to indicate different distributions of segments, so that segments can be broken down into a combination of these states. For a generic segment of a time series, we build a directed graph to represent the transitions from states of current segment to those of the next segment. For example, the graph may show whether states indicating stable crustal movement still remain stable, or whether they have transmitted to other states indicating intense crustal movement within the specified time period. In particular, each vertex in the graph indicates a state, and each edge, along with its weight represents the transition probability between states. The graph evolves among the segments over time; therefore we refer to it as an Evolutionary State Graph. See the formal definition provided in Section 2.

After constructing the Evolutionary State Graph according to the observed time series, the question of how to quantitatively incorporate the graph’s structural information (dynamic relations between states over different time) into the time series classification becomes a challenging one. Inspired by studies of graph networks and graph embedding, at a generic time, we define a representation vector for the whole graph and that for each vertex. We then employ a graph propagation algorithm to learn and update the representation vectors. Eventually, the representation vector of the whole graph can be fed to the appropriate machine learning models for applications such as time series classification. See details in Section 3.

Figure 1 presents an application of our method to the prediction of previously recorded earthquakes according to the Richter scale111The original magnitude scale for measuring the strength ("size") of earthquake wave, developed by Charles F. Richter. Figure 1 (a) is one of the observed time series, from which our method generates and recognizes states to each segment. Figure 1 (b) and (c) show the evolutionary state graph built by our method that represents the state transitions in all positive samples (i.e. an earthquake occurs) and negative samples respectively. From the figure, we can see that the transition from state 15 to state 16 and from state 1 to state 25 in the negative samples are more obvious than that in positive samples. From Figure 1 (d), we can observe that all these states tend to generate lower and more stable Richter-scale values, indicating smooth crustal movement. Otherwise, these transitions hardly appear in the positive samples. In addition, state 14, state 13 and state 43 present more intensive crustal movement, while they are more likely to be reached the earthquake samples. Eventually, as a clear transition from “smooth states” (state 16, state 1) in the segment within time to “intensive states” (state 3, state 31, and state 43) in the segment within time is incorporated into the representation vector of the evolutionary state graph, our model successfully predicts a coming earthquake.

To further validate the effectiveness of the proposed Evolutionary Graph Recurrent Networks method, we construct experiments based on four public datasets and two real-world datasets from different domains. Experimental results demonstrate the superiority of our method over 10 state-of-the-art baseline methods on several classification tasks. Moreover, we visualize the evolutionary state graph extracted from the data by means of our method, which provides us with additional interpretability and insights in the classification scenario. See more experimental results in Section 4.

Accordingly, our contributions are as follows:

  • We propose the evolutionary state graph to represent the combinatorial evolution of time series.

  • We design and implement a novel EGRN method to quantitatively incorporate the dynamic structural information of the evolutionary state graph into the time series classification task.

  • We construct sufficient experiments to demonstrate advantages of our approach over 10 state-of-the-art baseline methods.

2. Evolutionary State Graph  for Time series

2.1. Overview

Most works of time series modeling only consider simple relations among latent states (e.g. Markov dependency), rather than modeling its combinatorial evolution, which is covered by multiple intrinsic factors, and especially the relations among these factors.

Problem definition. The task considered in this paper is to discover and model the combinatorial evolution behind the time series , and thus to reason the event of the time series classification. Formally, let be an observation sequence with segments in a time series data. Each is a segment which the length is and has physical meaning; for example, one day has 24 hours or an hour has 60 minutes. Each is a single- or multi-variate observation with variables, denoted as . is the event occurring under the observation sequence , and represents the event label where is the set of labels and is the specific label.

Figure 2. Overview of our proposed model. Segments are recognized into four states with different probabilities. The evolution between the two segments is represented as an evolutionary state graph, where each node denotes a state and edges are the product of anterior-posterior probability. The solid lines represent the passing messages and the dashed lines represent the feedback messages. Each node has a bar with different gray scale, which denotes the current hidden vectors of each node and influences each other through the edges. Their information is propagated via an LSTM-like structure.

Evolutionary State Graph. In this paper, we leverage graph structure, a nature way of modeling relations, to represent the combinatorial evolution. We define the state v to indicate a kind of factor influencing segments, such as running or walking (as discussed above) influence different segments of a fitness-tracking time series, and each segment can be generated by the combination of states with the probability . We define a time-sensitive graphical structure, evolutionary state graph, which is a sequence of weighted-directed graphs and evolves among the segments over time. In particular, each graph is formulated as to represent the transitions from states of segments to those of . Each vertex in the graph indicates a state , and each edge along with its weight represents the transition probability from to . Each vertex has a node vector (or node representation or node embedding) to encode the information of the state, which is denoted by .

Based on above definitions, we propose a novel model, Evolutionary Graph Recurrent Networks (EGRN), a GNN-based advanced deep framework that can capture information propagations along the edges, to model the combinatorial evolution of time series. As Figure 2 shows, given a time series, EGRN first recognizes the states of each segment, then constructs the evolutionary state graph, and finally, propagates the information and outputs the prediction.

2.2. State Discovery

For the state recognition of time series, there are many existing works (Lin et al., 2012; Rakthanmanon and Keogh, 2013; Senin and Malinchik, 2013; Bagnall et al., 2017) in past few years, which focus on discretizing the successive time series and form the independent segments into several representative patterns. In our work, we define state to uniformly represent these patterns, each of which indicates a factor influencing segments, such as running or walking (as discussed above) influence different segments of a fitness-tracking time series. Each segment can be generated by the combination of states with the probability .

Formally, we let denote the pattern of the state , and function denote the generation of segments with the state . Then, each segment can be considered as a sample generated from a linear combination of all states, which is formulated as:


where is the regularized weight of the segment ; is a hyper parameter to indicate the number of states. We can implement it by many existing clustering methods, such as K-means (Kanungo et al., 2002) and GMM (Bouttefroy et al., 2010). They estimate via Expectation-Maximization algorithm (Do and Batzoglou, 2008).

2.3. Evolutionary State Graph Construction

After discovering the states behind the time series, we then aim to represent transitions of these states among adjacent segments. To do this, we propose to construct the evolutionary state graph for modeling these relations. Each vertex denotes a particular state, with a node vector to encode the information of . For example, in our work, initially, and will be updated along with the graph evolves. We will introduce the details in Section 3. Each edge denotes the transition from the sender state to the receiver state . In addition, we define the weight of each edge as the production of anterior-posterior probability, which is formulated as:


For example, as Figure 2 (a) and (b) shows, the -th segment is recognized to the state 2 with the probability of 0.8, while the -th segment is recognized to state 3 with the probability of 0.7. Thereby the edge weight between node 2 and node 3 is their production (i.e., 0.56) from segment to .

Next, we model the combinatorial evolution of time series by a graph neural network on the constructed evolutionary state graph.

3. Time Series Classification with Evolutionary State Graph

In this section, we provide specific descriptions our proposed method, Evolutionary Graph Recurrent Networks (EGRN), which can model the combinatorial evolution of time series on the constructed evolutionary state graph and is applied to reasoning the event of time series classification. We first present the procedure of message passing on the evolutionary state graph, and then present its propagation. Finally, we present EGRN’s output and learning. We will introduce each component detailedly in the following chapters.

3.1. Message Passing

Recently, the message-passing neural network (MPNN)  (Gilmer et al., 2017) has unified various graph convolutional network and graph neural network approaches by analogy to message-passing in graphical models, which provides new ideas for representing the propagation of dynamic graphs. Inspired by that, we formulate the message passing of evolutionary state graph at the -th segment as


where is the node vector of state after message passing, which combines the messages from all in-degree nodes and out-degree nodes . Specifically, we assume that passes a message to via edge , and will pass a feedback message to . Just as shown in Figure 2 (b), for the node 2 from segment to , it receives messages from node 1 and itself and sends message to node 3, so that it will receive the feedback messages from itself and node 3. The node vector will be updated after receiving messages. In particular, we implement by a neural network that controls the updating weight of node vectors as follows:


where is the passing messages. and are parameters that can be estimated according to the downstream applications like time series classification.

From the perspective of all the nodes, we can reformulate the message passing as the recurrence of evolutionary state graph among the adjacent segments, which can be formulated as


The adjacency matrix represents the current in and out edges propagated at the -th segment, which contains in-degree matrix and out-degree matrix as Eq 3. is the weight vector of all states for segment . is the transpose operation and is matrix multiplication. denotes all the edge weights from segment to . Similarly, the feedback edge weights are denoted as . We observe that is actually the transposition of . Then, Eq 4 rewritten to Eq 5b denotes the node vector of all states after message passing.

3.2. Evolutionary State Graph Propagation

We next introduce the propagation of EGRN. A potential solution is to implement the gated graph sequence neural networks (GGSNN) (Li et al., 2016), which is taken as a baseline in our experiments (see details in Section 4). To quantify node vectors, which encode state’s information and can be propagated to all reachable nodes, GGSNN only takes node index into consideration and ignore other information. Here, we propose to use the distribution patterns represented by states as the node vectors, which encodes richer information and obtains better performance in our experiments. Formally, for each state obtained by GMM, we take the distribution patterns represented by state to initialize node vectors:


We can propagate and update node vectors by wrapping the evolutionary-state-graph operation in Eq 3 into a recurrent block that can be incorporated into many existing architectures. To do this, we define an EGRN block as


where is given in Eq 3 and “” denotes a gated connection (e.g., LSTM, GRU) (Hochreiter and Schmidhuber, 1997; Chung et al., 2015). The gated connection allows us to incorporate information from other nodes and from the previous timestamp to update vectors of each node. When there are few messages from other nodes , will be influenced more by previous . Otherwise, the messages from other nodes will influence more. By this way, each state’s information is propagated over time.

As shown in Figure 2(c), unlike the GRU-like structure taken in the GGSNN, we take a LSTM-like structure (Hochreiter and Schmidhuber, 1997) to model the propagation. The difference is that LSTM has a global representation to memorize all information of evolutionary state graph, which can capture more patterns and better control the propagation of states’ information. Formally, we have


where , and are forget gate, input gate and output gate respectively. is sigmoid activation function and is element-wise multiplication. The current node vectors are updated by receiving their own previous memory and the messages from other nodes under the global memory’s influence. Meanwhile, global memory is updated under the node vectors over time.

3.3. End-to-End Model Learning

To encode the information of the entire time series, we define the graph-level representation vector, or graph vector, as


where acts as an “end-to-end” model, which is relevant to the downstream application task. can be neural networks or other machine learning algorithms such as XGBoost (Chen and Guestrin, 2016). It takes the concatenation of and as input and outputs the graph vector for downstream application task. In the learning process, we use Adam optimization algorithm (Kingma and Ba, 2015) to minimize the cross-entropy loss to update the network parameters:


where is the training labels of the downstream application. EGRN  computes the gradients based upon the converged solution and runs the propagation to convergence. The procedure of states’ capturing and propagation is carried out step by step, that we first pre-train the GMM to find the states and construct evolutionary state graph, and then train evolutionary state graph propagation to model the combinatorial evolution of time series, the procedure of which is as shown in Algorithm 1.

1:time series data , real labels , state recognition
2:predicted label
3:while the parameters of EGRN  have not converged do
4:     sample a batch from , and
5:     for each segment  do
6:          construct the evolutionary state graph as Eq 5a
7:          message passing as Eq 5b
8:          propagate and update as Eq 8
9:     end for
10:      compute the graph-level vectors as Eq 9
12:end while
Algorithm 1 Time series classification on EGRN

Complete procedure of training EGRN is presented in Appendix.

4. Experiment

Earthquakes WormTwoClass DJIA30 WebTraffic INS MCE
Accuracy Precision Recall Precision Recall
NN-ED (Bagnall et al., 2017) 68.22 62.41 80.51 73.40 28.51 19.33 23.01 59.90 34.82 44.01
NN-DTW (Berndt and Clifford, 1994) 70.31 66.06 82.09 74.03 27.14 21.73 24.13 60.17 41.41 49.04
NN-CID (Batista et al., 2011) 69.41 69.56 82.97 74.26 52.65 10.25 17.05 57.12 40.86 47.55
FS (Rakthanmanon and Keogh, 2013) 74.66 70.58 72.84 73.89 31.66 16.73 21.84 54.34 43.54 48.34
TSF (Deng et al., 2013) 74.67 68.51 81.94 75.38 48.11 21.04 29.13 76.80 52.61 62.50
SAXVSM (Senin and Malinchik, 2013) 73.76 72.10 83.51 74.91 62.71 28.41 40.11 65.12 59.96 62.44
MC-DCNN (Zheng et al., 2014) 70.29 59.85 75.34 75.29 53.77 5.79 10.38 78.94 49.27 60.70
RNN (Hochreiter and Schmidhuber, 1997) 74.82(10) 63.46(50) 79.08(20) 73.41(20) 71.52 33.10 45.31(15) 64.37 53.55 58.46(10)
GGSNN (Li et al., 2016) 74.82 (20) 65.14 (5) 81.91 (5) 73.51 (10) 71.42 48.47 57.74 (10) 71.11 59.1 64.52 (5)
NLNN (Wang et al., 2018a) 75.54 71.40 79.79 73.69 75.32 48.48 58.95 72.35 48.32 58.03
EGRN * 76.26 (50) 71.64 (20) 83.51 (20) 74.22 (20) 78.72 48.21 59.80 (30) 71.35 65.30 68.19 (15)
EGRN 77.70 (50) 72.72 (20) 83.76 (20) 73.93 (20) 78.74 48.87 60.25 (30) 71.21 65.36 68.14 (15)
RNN (Hochreiter and Schmidhuber, 1997) 76.25(10) 53.25(5) 81.56(20) 73.74(20) 73.45 48.13 58.14(15) 80.33 58.10 67.42(10)
GGSNN (Li et al., 2016) 77.70 (20) 59.35 (5) 83.51 (50) 73.73 (10) 84.53 48.86 61.94 (10) 80.27 72.14 75.95 (10)
NLNN (Wang et al., 2018a) 76.26 62.33 83.87 75.17 83.73 50.00 62.61 81.49 72.77 76.88
EGRN * 79.14 (50) 64.90 (20) 83.62 (20) 75.79 (50) 86.33 50.40 63.64 (30) 80.91 73.72 77.15 (15)
EGRN 80.58 (50) 64.94 (20) 84.22 (20) 76.11 (50) 87.15 49.67 63.26 (30) 81.61 73.77 77.45 (15)
Table 1. Comparison of classification performance on public and real-world datasets (%). The bold indicates the best performance of all the methods, the underline indicates the second-best performance, and the italics indicate the local best performance of some (graph) neural networks’ methods. Parentheses indicate the number of states which have the best performance in Section 4.2.

4.1. Datasets

We employ six datasets to construct our experiments, including four public datasets and two real-world datasets. Two of the public datasets come from the UCR Suite222 while the remaining two come from Kaggle333 One of the real-world datasets is provided by China Telecom, the major mobile service provider, while the other is provided by the State Grid, the major electric power company.

Earthquakes. This dataset, which comes from UCR, spans from Dec 1st 1967 to 2003, and each data point is an average of an hourly reading on the Richter scale. The task is to predict whether a major event is about to occur based on the most recent readings. A major event is defined as any reading of over 5 on the Richter scale. In total, 368 negative and 93 positive cases are extracted from 86k hourly readings. We set 24 hours as a window and split the sequence with length of 512 into 21 segments.

WormTwoClass. This dataset, which comes from UCR, is used for time series classification tasks and was employed in (Bagnall et al., 2015). The task is to classify individual worms of either wild-type (109 cases) or mutant-type (149 cases). We set 60 observations as a window and split the sequence with length of 900 to 15 segments.

DJIA 30 Stock Time Series (DJIA30). This dataset, which comes from Kaggle, contains historical stock prices for 29 of 30 DJIA companies and spans from Jan. 1st 2006 to Jan. 1st 2018. We set a classification task of predicting whether there will be drastic mutation (variance greater than 1) in the next week (five trading days) based on the most recent readings in the past year (50 weeks). In total, we extract 10k negative cases and 3k positive cases from 310k daily readings.

Web Traffic Time Series Forecasting (WebTraffic). This dataset, which comes from Kaggle, is taken from Jul 1st 2015 up until Dec 31st 2016. Each data point is a number of daily views of a specific Wikipedia article. We set a classification task of predicting whether there will be rapid growth (curve slope greater than 1) in the next months (30 days) based on the most recent readings in the past year (12 months). In total, we extract 105k negative cases and 38k positive cases from 145k daily readings.

Information Networks Supervision (INS). This dataset is provided by China Telecom. It consists of around 242K network flow series, each of which describes the hourly in- and out-flow of different servers, spanning from Apr 1st 2017 to May 10th 2017. When an abnormal flow goes through the server ports, the alarm states will be recorded. Our goal is to use the daily network flow data from the previous 15 days to predict whether there will be an abnormal flow in the next day. In total, we identify 2K abnormal flow series and 240K normal ones.

Watt-hour Meter Clock Error (MCE). This dataset is provided by the State Grid in China. It consists of around 4 million clock error series, each of which describes the deviation time, compared with the standard time, and the communication delay of different watt-hour meters per week, The duration is from Feb. 2016 to Feb. 2018. When the deviation time exceeds 120s, the meter will be marked as abnormal. Our goal is to predict the potential abnormal watt-hour meters in the next month by utilizing clock data from the past 12 months. In total, we identify 0.5 million abnormal clock error sequences and 3.5 million normal ones.

We present the statistics of these datasets and implement details in the appendix.

4.2. Experiment on Time Series Classification

We then evaluate our proposed model in terms of its accuracy of predicting the correct labels. We compare our proposed model against the following ten baseline methods, which have proven to be competitive across a wide variety of classification tasks:

  • NN-ED, NN-DTW and NN-CID: Given a sample, these methods calculate its nearest neighbor in the training data and use the nearest neighbor’s label to classify the given sample. To quantify the distance between samples, these methods consider different metrics, which are, respectively, Euclidean Distance, Dynamic Time Warping (Berndt and Clifford, 1994) and Complexity Invariant Distance (Batista et al., 2011).

  • Fast Shapelets (FS): This is a fast shapelets algorithm that uses shapelets as features for classification (Rakthanmanon and Keogh, 2013).

  • Time Series Forest (TSF): This is a tree-ensemble method that derives features from the intervals of each series (Deng et al., 2013).

  • SAX-VSM: This is a dictionary method that derives features from the intervals of each series (Senin and Malinchik, 2013).

  • MC-DCNN This is a multi-channel deep convolutional neural network for time series classification proposed in (Zheng et al., 2014).

In addition to the above methods, we further consider the following (graph) neural network as baselines. is a model in which we use node vectors and raw segment vector as input in Eq 9, and only uses node vectors as input.

  • RNN: This is a common neural networks (Hochreiter and Schmidhuber, 1997), that uses the sequential assignment (and ) as input.

  • GGSNN: This is a gated graph neural network (Li et al., 2016) in which each is a one-hot vector of corresponding node. The GRU structure is used in propagation.

  • NLNN: This is a non-local neural networks (Wang et al., 2018a). Each node is a segment and connects to each other. The of each node is the segment vector . We use NLNN blocks to replace the RNN blocks and keep the same in output model.

  • EGRN: This is the proposed method that captures different states and uses their distribution information as node vector . denotes that the GRU structure is used in propagation.

Comparison results. Table 1 compares the classification results. For the public datasets, we use accuracy as a metric due to their relatively balanced positive/negative ratio; this is also used in (Bagnall et al., 2017). For the real-world datasets, we use precision, recall and F-measures () as metrics. We observe that all quantifying-distance methods based on nearest neighbors perform similarly, because they only capture the sequence structure and can perform poorly on some complex scenarios. The neural network approach (MC-DCNN) performs poorly on small-scale data, as it might be more suitable for processing large-scale data due to its model complexity. Fast shapelets (FS) are unstable, and performs poorly on the WormTwoClass and DJI30 datasets. Moreover, feature-extracted methods have relatively better performance on all datasets, such as TSF and the dictionary-method SAX-VSM. In particular, SAX-VSM gets the second best accuracy on WormTwoClass datasets.

The performance can be improved by modeling the combinatorial evolution of time series. The graph neural network can adopt a non-local approach, which outperforms the recurrent neural networks (RNN). We also note that the combination of the can perform better in most cases. GGSNN, which uses the one-hot annotations of the corresponding node as the node vector, outperforms many non-graph baselines. However, it is not as good as the latter two methods; this result may be due to the monotonous information expression. NLNN, which directly uses the original segment as a node vector and constructs a fully connected graph, has achieved a good performance on most of the datasets, achieving the second-best performance on the INS and MCE datasets. However, the graph structure of NLNN becomes very complex with the growth of segments, which increases the computational cost of the calculations. As expected, our method obtains the best performance compared to other baselines because the distribution patterns of states can provide more useful information relating to time series. Meanwhile, the LSTM-like structure is slightly better than GRU-like structure, due to the global memory’s influence. They both can be applied into the evolutionary state graph propagation. Due to the pre-selected number , computational complexity does not increase as the time windows increase. We can observe that the number of nodes is not a fixed parameter and is sensitive to specific methods and datasets. To show how the hyper-parameter influences our model, more analyses are conducted in the following part .

Figure 3. The impact of the different state numbers . (a)-(f) present comparison on the public and real-world datasets. Red lines represent the classification performance (accuracy, ) over the different , and gray lines present the silhouette scores of corresponding state assignments.

How does the state number influence performance? We conduct some experiments on these six real-world datasets, to validate the impact of state number . We test with the value from 5 to 100 with interval 5. We use accuracy or F1-score as metrics to compare this parameter across the datasets, and also use the silhouette score to evaluate the current quality of state assignment from the GMM.

As shown in Figure 3, the curves of classification performance and silhouette score are relatively consistent, illustrating that the state number is sensitive to the segment’s own patterns. The performance varies on the different datasets and is not bound to improve as the state number increases. The peaks of the state number are different; moreover, the performance will be worsen when the state number is too large for all datasets, which may be due to the over-sized feature space or the insufficient data volume. We conclude that is an empirically determined parameter that can be set by evaluating the quality of state assignment.

5. Case Studies

Figure 4. Another example of the evolutionary state graph trained for anomaly detection in China Telecoms’ dataset. The top-25% weighted edges are visualized. The hourly in- and out-flow are recorded by the port monitor. The red line indicates an anomaly in this day. We focus on the edges (25-12, 8-29, 22-13) and nodes (16, 18, 23). In this example computed by our model, the transitions (25-12, 8-29) are more likely to indicate anomaly.

In this section, we visualize two examples of the evolutionary state graph constructed by our approach. As shown in Figure 1 and Figure 4, we use 50 states in the Earthquakes dataset and 30 states in the INS dataset, which exhibit the best performance in Section 4.2. We visualize the evolutionary state graphs built on positive (earthquake or abnormal flow) and negative (no earthquake or normal flow) samples separately. We note that our model can learn to find some meaningful relational clues no matter where they appears.

What kind of situations may earthquakes occur in? It is well-known that crustal movements before earthquakes should be more active than usual, and that Richter scale readings should be higher and more unstable. Our findings from the evolutionary state graph confirm this common-sense understanding. As shown in Figure 1, the graphs built by EGRN differ depending on whether they utilize positive or negative samples. The transitions of state 15 to state 16 and state 1 to state 25 in the negative samples are more obvious than in the positive samples, meanwhile, we can observe that their Richter scale reading are all lower and stable. These findings reveal that crustal movement transfers among the relatively quiet states, under which earthquakes are less likely to occur. Otherwise, these safety transitions appear only rarely in the positive samples. In addition, state 14, state 13 and state 43 present more active crustal movement, and we can observe that their weights connected with other states in positive samples are larger than that in negative samples. Specially, the state 46 and state 3, whose Richter scale readings are considerably enhanced compared to other states, only exhibit transitions with other states in the positive samples. There is no state in the negative samples associated with them.

As shown in Figure 1(a), we can observe a clear transition from “smooth states” (state 16, 1) in the segment within time to “intensive states” (state 3, 31, and 43) in the segment within time , which illustrates that the crustal activity is becoming more and more active. Its evolutionary state graph will be more like positive one, and EGRN  will predict that an earthquake is about to occur.

What will cause abnormal flow? Many phenomena can indicate anomalies in flow data of network devices, such as sudden drop of flow data, low indication and unbalanced flow of the import and export, etc. As shown in Figure 4, the evolutionary state graphs built by our methods differ between the positive and negative samples. The transitions of state 25 to state 12 and state 8 to state 29 in the positive samples are more obvious than the case in the negative samples, which reveals that the low indication and sudden drop will cause an exception. On the contrary, the smooth transition of state 13 to state 22 appears only in the negative samples. The difference between state 23 and state 18 is the unbalanced flow of the import and export, although all of them are in a high indication. State 23 is more likely to appear in the negative samples, but it is not true for state 18, which illustrates that the high but inconsistent in-flow and out-flow is also a cause of anomalies. Similarly, anomalies can also occur when the flows are always low, such as state 16. As shown in the upper case of the raw time series, the transition between state 25 and state 12 computed by our methods appears in the days , and occurs again in the future (the days ). This abnormal transition can be memorized by EGRN  and predicted when it happens again in the future.

Through analysis of the evolutionary state graph, we find several logical causes to explain events in the time series classification. As a result, we are aware what kind of states or transitions need to be earned, early, and why these warnings are necessary.

6. Related Work

Time series modeling. The modeling of Time series have broad applications in different domains, such as biology applications (e.g., the hormonal cycles (Chiazze et al., 1968)); human behavior recognition (e.g., circadian rhythms and cyclic variation (Pierson et al., 2018)); and anomaly detection (e.g., abnormal mutation (Chapfuwa et al., 2018). The different distance measurements have been mainly concentrated to model time series data, such as dynamic time warping (Lines and Bagnall, 2015), complexity-invariant distance (Batista et al., 2014) move–split–merge (Stefan et al., 2013), and elastic ensemble (Lines and Bagnall, 2015). Some methods pay attention to sequence-clustering by graph (Hallac et al., 2017), which aims to apply graph structure to represent different segments, rather than the transition between segments. It is different from our task.

Model-based methods fit a generative model to each sequence, then measure the similarity between the sequences via model’s parameters. The parametric approaches used include hidden Markov models (Yang and Jiang, 2014) and fitting auto-regressive models (Shokoohi-Yekta et al., 2015), which rely on the artificial knowledges. Recently, a lot of models based on neural networks have been proposed (Wang et al., 2018c; Binkowski et al., 2018), which have mostly been studied in high-level patterns representation. The main idea behind these methods is to model the fusion of multiple factors like time or space, etc. . For forming the frequency count of repeated patterns, some dictionary-based approaches have also been explored (Lin et al., 2012; Senin and Malinchik, 2013), which form frequency counts of the recurring patterns, then build classifiers based on the resulting histograms (Baydogan and Runger, 2016; Xu et al., 2018). Many of these works have formulated the problem as a discrete-time sequence prediction task and used Markov models. However, Markov models assume unit time steps and are further unable to capture long-range dependencies since the overall state-space will grow exponentially in the number of time steps considered (Yang and Jiang, 2014). Other works have used LSTM models (Hochreiter and Schmidhuber, 1997), which also assume discrete time steps and are limited in their interpretability.

Dynamic graph and graph neural networks. Models in the graph neural network family (Scarselli et al., 2009; Li et al., 2016; Raposo et al., 2017; Van Steenkiste et al., 2018) have been explored in a various range of domains, across supervised, unsupervised, semi-supervised, and reinforcement learning settings. Most works focus on the dynamic graph structure, which has rich relational representation and can be applied to many real scenarios. They have been applied to learning the dynamics of physical systems (Watters et al., 2017; Zhou et al., 2018), to predicting the chemical properties of molecules (Gilmer et al., 2017), to predicting traffic on roads(Cui et al., 2018), to reasoning about knowledge graphs (Hamaguchi et al., 2017). They have been used within both model-based (Sanchez et al., 2018) and model-free(Wang et al., 2018b) continuous control, for more classical approaches to planning(Toyer et al., 2018) and for model-free reinforcement learning (Hamrick et al., 2018). These studies provide a representative cross-section of the breadth of domains for which graph neural networks have proven useful. Recently, the message-passing neural network (MPNN) unified various graph convolutional network and graph neural network approaches by analogy to message-passing in graphical models (Gilmer et al., 2017). The non-local neural network (NLNN) has a similar vein, which unified various “self-attention”-style approaches by analogy to methods from graphical models and computer vision for capturing long range dependencies in signals (Wang et al., 2018a). Recently, several researchers from DeepMind, Google Brain and MIT summed up previous works in this domain, focusing on these two representative works (MPNN and NLNN), and systematically proposed the concept of graph network (Battaglia et al., 2018).

7. Conclusions

In this paper, we study the problem of combinatorial evolution of time series data. We propose a novel graph neural networks-based model that can capture the complex and dynamic relations among intrinsic factors of time series and learn effective representations for classification tasks. To validate the effectiveness of our proposed model, we conduct sufficient experiments on both the public and the real-world datasets. Experimental results show that our model clearly outperforms eleven state-of-the-art baseline methods. Meanwhile, based on two case studies, we find some meaningful relations among the states, which can reveal the logical causes of the predicted events.


  • (1)
  • Bagnall et al. (2017) Anthony J Bagnall, Jason Lines, Aaron Bostrom, James Large, and Eamonn J Keogh. 2017. The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. SIGKDD 31, 3 (2017), 606–660.
  • Bagnall et al. (2015) Anthony J Bagnall, Jason Lines, Jon Hills, and Aaron Bostrom. 2015. Time-Series Classification with COTE: The Collective of Transformation-Based Ensembles. TKDE 27, 9 (2015), 2522–2535.
  • Batista et al. (2014) Gustavo E. Batista, Eamonn J. Keogh, Oben Moses Tataw, and Vinícius M. Souza. 2014. CID: An efficient complexity-invariant distance for time series. SIGKDD (2014), 634–669.
  • Batista et al. (2011) Gustavo EAPA Batista, Xiaoyue Wang, and Eamonn J Keogh. 2011. A complexity-invariant distance measure for time series. ICDM (2011), 699–710.
  • Battaglia et al. (2018) Peter Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchezgonzalez, Vinicius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. 2018. Relational inductive biases, deep learning, and graph networks. arXiv: Learning (2018).
  • Baydogan and Runger (2016) Mustafa Gokce Baydogan and George Runger. 2016. Time series representation and similarity based on local autopatterns. DMKD (2016), 476–509.
  • Berndt and Clifford (1994) Donald J Berndt and James Clifford. 1994. Using dynamic time warping to find patterns in time series. SIGKDD (1994), 359–370.
  • Binkowski et al. (2018) Mikolaj Binkowski, Gautier Marti, and Philippe Donnat. 2018. Autoregressive Convolutional Neural Networks for Asynchronous Time Series. ICML (2018), 579–588.
  • Bouttefroy et al. (2010) Philippe Loic Marie Bouttefroy, Abdesselam Bouzerdoum, Son Lam Phung, and Azeddine Beghdadi. 2010. On the analysis of background subtraction techniques using Gaussian Mixture Models. ICASSP (2010), 4042–4045.
  • Chapfuwa et al. (2018) Paidamoyo Chapfuwa, Chenyang Tao, Courtney Page, Benjamin Goldstein, Chunyuan Li, Lawrence Carin, and Ricardo Henao. 2018. Adversarial Time-to-Event Modeling. ICML (2018), 734–743.
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. SIGKDD (2016), 785–794.
  • Chiazze et al. (1968) Leonard Chiazze, Franklin T Brayer, John J Macisco, Margaret P Parker, and Benedict J Duffy. 1968. The Length and Variability of the Human Menstrual Cycle. JAMA (1968), 377–380.
  • Chung et al. (2015) Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2015. Gated Feedback Recurrent Neural Networks. ICML (2015), 2067–2075.
  • Cui et al. (2018) Zhiyong Cui, Kristian Henrickson, Ruimin Ke, and Yinhai Wang. 2018. High-Order Graph Convolutional Recurrent Neural Network: A Deep Learning Framework for Network-Scale Traffic Learning and Forecasting. arXiv: Learning (2018).
  • Deng et al. (2013) Houtao Deng, George Runger, Eugene Tuv, and Martyanov Vladimir. 2013. A time series forest for classification and feature extraction. Information Sciences (2013), 142–153.
  • Do and Batzoglou (2008) Chuong B Do and Serafim Batzoglou. 2008. What is the expectation maximization algorithm. Nature Biotechnology 26, 8 (2008), 897–899.
  • Du et al. (2016) Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. 2016. Recurrent Marked Temporal Point Processes:Embedding Event History to Vector. SIGKDD (2016), 1555–1564.
  • Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural Message Passing for Quantum Chemistry. ICML (2017), 1263–1272.
  • Hallac et al. (2017) David Hallac, Sagar Vare, Stephen P Boyd, and Jure Leskovec. 2017. Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data. SIGKDD (2017), 215–223.
  • Hamaguchi et al. (2017) Takuo Hamaguchi, Hidekazu Oiwa, Masashi Shimbo, and Yuji Matsumoto. 2017. Knowledge Transfer for Out-of-Knowledge-Base Entities : A Graph Neural Network Approach. IJCAI (2017), 1802–1808.
  • Hamrick et al. (2018) Jessica B Hamrick, Kelsey Allen, Victor Bapst, Tina Zhu, Kevin R Mckee, Joshua B Tenenbaum, and Peter Battaglia. 2018. Relational inductive bias for physical construction in humans and machines. arXiv: Learning (2018).
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation (1997), 1735–1780.
  • Kanungo et al. (2002) Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. 2002. An efficient k-means clustering algorithm: analysis and implementation. TPAMI 24, 7 (2002), 881–892.
  • Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. ICLR (2015).
  • Li et al. (2016) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S Zemel. 2016. Gated Graph Sequence Neural Networks. ICLR (2016).
  • Lin et al. (2012) Jessica Lin, Rohan Khade, and Yuan Li. 2012. Rotation-invariant similarity in time series using bag-of-patterns representation. IJIIS (2012), 287–315.
  • Lines and Bagnall (2015) Jason Lines and Anthony Bagnall. 2015. Time series classification with ensembles of elastic distance measures. SIGKDD (2015), 565–592.
  • Pierson et al. (2018) Emma Pierson, Tim Althoff, and Jure Leskovec. 2018. Modeling Individual Cyclic Variation in Human Behavior. WWW (2018), 107–116.
  • Rakthanmanon and Keogh (2013) Thanawin Rakthanmanon and Eamonn Keogh. 2013. Fast shapelets: A scalable algorithm for discovering time series shapelets. ICDM (2013), 668–676.
  • Raposo et al. (2017) David Raposo, Adam Santoro, David Barrett, Razvan Pascanu, Timothy Lillicrap, and Peter Battaglia. 2017. Discovering objects and their relations from entangled scene representations. ICLR (2017).
  • Sanchez et al. (2018) Alvaro Sanchez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Raia Hadsell, Martin A Riedmiller, and Peter Battaglia. 2018. Graph Networks as Learnable Physics Engines for Inference and Control. ICML (2018), 4467–4476.
  • Scarselli et al. (2009) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2009. Computational Capabilities of Graph Neural Networks. TNNLS 20, 1 (2009), 81–102.
  • Senin and Malinchik (2013) Pavel Senin and Sergey Malinchik. 2013. SAX-VSM: Interpretable Time Series Classification Using SAX and Vector Space Model. ICDM (2013), 1175–1180.
  • Shokoohi-Yekta et al. (2015) Mohammad Shokoohi-Yekta, Yanping Chen, Bilson Campana, Bing Hu, Jesin Zakaria, and Eamonn Keogh. 2015. Discovery of Meaningful Rules in Time Series. SIGKDD (2015), 1085–1094.
  • Stefan et al. (2013) Alexandra Stefan, Vassilis Athitsos, and Gautam Das. 2013. The Move-Split-Merge Metric for Time Series. TKDE (2013), 1425–1438.
  • Toyer et al. (2018) Sam Toyer, Felipe W Trevizan, Sylvie Thiebaux, and Lexing Xie. 2018. Action Schema Networks: Generalised Policies with Deep Learning. AAAI (2018), 6294–6301.
  • Van Steenkiste et al. (2018) Sjoerd Van Steenkiste, Michael Chang, Klaus Greff, and Jurgen Schmidhuber. 2018. Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions. ICLR (2018).
  • Wang et al. (2018c) Jingyuan Wang, Ze Wang, Jianfeng Li, and Junjie Wu. 2018c. Multilevel Wavelet Decomposition Network for Interpretable Time Series Analysis. SIGKDD (2018), 2437–2446.
  • Wang et al. (2018b) Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. 2018b. NerveNet: Learning Structured Policy with Graph Neural Networks. ICLR (2018).
  • Wang et al. (2018a) Xiaolong Wang, Ross B Girshick, Abhinav Gupta, and Kaiming He. 2018a. Non-Local Neural Networks. CVPR (2018).
  • Watters et al. (2017) Nicholas Watters, Daniel Zoran, Theophane Weber, Peter Battaglia, Razvan Pascanu, and Andrea Tacchetti. 2017. Visual Interaction Networks: Learning a Physics Simulator from Video. NIPS (2017), 4539–4547.
  • Xu et al. (2018) Haowen Xu, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Li, Ying Liu, Youjian Zhao, Dan Pei, and Yang Feng. 2018. Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications. WWW (2018), 187–196.
  • Yang and Jiang (2014) Yun Yang and Jianmin Jiang. 2014. HMM-based hybrid meta-clustering ensemble for temporal data. KBS (2014), 299–310.
  • Zheng et al. (2014) Yi Zheng, Qi Liu, Enhong Chen, Yong Ge, and J Leon Zhao. 2014. Time series classification using multi-channels deep convolutional neural networks. WAIM (2014), 298–310.
  • Zhou et al. (2018) Lekui Zhou, Yang Yang, Xiang Ren, Fei Wu, and Yueting Zhuang. 2018. Dynamic Network Embedding by Modeling Triadic Closure Process. AAAI (2018), 571–578.

Appendix A Appendix

a.1. Pseudo Code

In order to illustrate the structure and procedure of EGRN, we present the complete pseudo code of Evolutionary Graph Recurrent Networks  on the classification of time series. Given the time series , labels and parameter , EGRN  first use GMM to capture different states. Then, we construct the evolutionary state graph and propagate the information of states. At last, the graph-level output is fed into an output model and we use back-propagation learning algorithm with the cross-entropy loss to train the whole networks. Algorithm 2 presents more details.

1:time series , real labels , state number
2:predicted label
3:procedure OuterAssignment() Capture the states and transitions
4:      compute the weight of states via GMM Cluster number is
5:      get the distribution patterns of state Get initial node vectors
6:     EgrnPredict()
7:end procedure
9:procedure EgrnPredict()
10:     while the parameters of EGRN  have not converged do
11:         sample a batch from , and
12:         for each segment  do Recurrence in time
13:               construct the evolutionary state graph as Eq 5a
14:               message passing as Eq 5b
15:               propagate and update as Eq 8
16:         end for
17:          compute graph-level vectors as Eq 9 Graph-level classification
18:          Back propagate the loss and train the whole EGRN
19:     end while
20:end procedure
Algorithm 2  The procedure of time series classification on Evolutionary Graph Recurrent Networks

a.2. Implementation Details

Classification. Time series from different sources have different formats, which the detailed statistics are as following:

Dataset #samples #time windows #time points #variables
Earthquakes 461 21 24 1
WormTwoClass 258 15 60 1
DJIA 30 Stock Time Series (DJIA30) 14,040 50 5 4
Web Traffic Time Series Forecasting (WebTraffic) 142,753 12 30 1
Information Networks Supervision (INS) 241,045 15 24 2
Watt-hour Meter Clock Error (MCE) 3,833,213 12 4 2
Table 2. Dataset statistics

For the different datasets, if there are clear train/test split, such as UCR datasets, we use them to make experiment. Otherwise, we split the train/test set by 0.8 at the time line, such that preceding windows’ series are used for training and the following ones are used for testing. We split 10% samples from train set as validation, which controls the procedure of training and avoids the overfitting.

Time series data divided by the fixed time window can be applied into graph structure, where the number of state is hyper parameters. We train our model on an 1-GPU machine and set the batch size as 5000. Specially, for the small-size datasets from UCR, we set 50 for a batch. We train our models for 100 iterations in total, starting with a learning rate of 0.01 and reducing it by a factor of 10 at every 20 iterations. In our experiment, we also note that the larger the volume of the data, the more the number of batches, and the fewer training epoch required for convergence. For example, MCE dataset is only trained for 30 epochs and can achieve convergence, which we train 100 epochs on Earthquakes dataset.

In addition to presenting the parameter settings of our model, we also present the parameter settings of other baseline methods. The details are shown in the following table:

Parameter Setting
NN-ED the number of neighbor (5)
NN-DTW the number of neighbor (5), wrap size (10)
NN-CID the number of neighbor (5)
FS the minimum length of shaplets (5), the depth of tree (2)
TSF the number of trees (500)
MC-DCNN the number of filter (8), filter size (5), pooling size (2)
LSTM hidden dimension (64), time major (True)
GGSNN the number of state (*), the dimension of global vector (64)
NLNN input frames (the number of time windows), 3D-conv (False)
Table 3. Parameter Setting on different methods

where the codes of baseline method in the first section are from, and others can be found in Github.

Reasoning. We visualize the evolutionary state graph constructed by our model to present the intrinsic relations of states behind the time series, some transitions of which will be highlighted for a certain event of classification. These transitions between states can tell us the logical cause behind the events. We extract each output of edges in propagation and aggregate their weights among all the segments to draw the graph structure, so that we can find these meaningful relations. The edge weight is the sum of all transitions and we show the top-25% edges. For the Figure 1 and Figure 4, we visualize the graphs of different labels arrived at the last time, which present several meaningful relations for classification.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description