Modeling Combinatorial Evolution in Time Series Prediction
Abstract.
Time series modeling aims to capture the intrinsic factors underpinning observed data and its evolution. However, most existing studies ignore the evolutionary relations among these factors, which are what cause the combinatorial evolution of a given time series. For example, personal interests are intrinsic factors hidden behind users’ observable online shopping behaviors; consequently, a precise item recommendation depends not only on discovering the iteminterest relationship, but also on an understanding of how user interests shift over time. In this paper, we propose to represent complex and dynamic relations among intrinsic factors of time series data by means of an evolutionary state graph structure. Accordingly, we propose the Evolutionary Graph Recurrent Networks (EGRN) to learn representations of these factors, along with the given time series, using a graph neural network framework. The learned representations can then be applied to time series classification tasks. From our experimental results, based on six realworld datasets, it can be seen that our approach clearly outperforms ten stateoftheart baseline methods (e.g. +5% in terms of accuracy, and +15% in terms of F1 on average). In addition, we demonstrate that due to the graph structure’s improved interpretability, our method is also able to explain the logical causes of the predicted events. Code is available at https://github.com/VachelHU/ESGRN.
1. Introduction
Discovering and understanding the intrinsic factors that cause data to evolve over time has been found to be critical for modeling time series data (e.g., stock prices, earthquake wave, automobile sensor data, etc.). For instance, earthquake wave is the observation of crustal movements, while different actions like running and walking will cause differences in observations of a fitnesstracking device. Moreover, in practice, we often observe the combinatorial evolution of data; that is, the observed time series being covered by the influence of multiple factors, and especially the relations among these factors. For example, an earthquake is the result of quick transitions from smooth movements in the Earth’s crust to intense ones, which cause a sudden release of energy in the Earth’s crust. Meanwhile, observing one’s online shopping logs, precise item recommendations rely on tracing and understanding the shift of that user’s personal interests.
One common method of modeling time series data is to use the latent states to represent the intrinsic factors behind each segment, where each state has, in most cases, an independent probability distribution over the possible observations. As for the relations between these latent states, traditional statistical methods (e.g., HMM (Yang and Jiang, 2014)) either consider only the Markov dependence of one latent state on the previous state, or ignore the relations entirely. Meanwhile, deepneuralnetwork based methods, especially RNNs (e.g., LSTM, GRU, and variants) (Hochreiter and Schmidhuber, 1997; Chung et al., 2015; Du et al., 2016), model the sequential combination of states by defining recurrent layers over time. However, in these cases, the structure is generalized indiscriminately thus can not model complex relations.
Recently, various graph (neural) networks (Battaglia et al., 2018) has been developed to support relational reasoning over graphstructured representations. Existing graph networks require an explicit graph as input. However, it is difficult to directly observe either the latent states or their relations in practice. Moreover, these methods can only take static graphs as input, whereas the relations among states might change over time. To the best of our knowledge, no existing studies have successfully captured the combinatorial evolution of time series with complex and dynamic relations among latent states.
In this paper, we present a novel framework for modeling the combinatorial evolution of time series data, which we call Evolutionary Graph Recurrent Networks (EGRN). We are given a set of time series, each of which is composed of several segments. In our method, we define the states to indicate different distributions of segments, so that segments can be broken down into a combination of these states. For a generic segment of a time series, we build a directed graph to represent the transitions from states of current segment to those of the next segment. For example, the graph may show whether states indicating stable crustal movement still remain stable, or whether they have transmitted to other states indicating intense crustal movement within the specified time period. In particular, each vertex in the graph indicates a state, and each edge, along with its weight represents the transition probability between states. The graph evolves among the segments over time; therefore we refer to it as an Evolutionary State Graph. See the formal definition provided in Section 2.
After constructing the Evolutionary State Graph according to the observed time series, the question of how to quantitatively incorporate the graph’s structural information (dynamic relations between states over different time) into the time series classification becomes a challenging one. Inspired by studies of graph networks and graph embedding, at a generic time, we define a representation vector for the whole graph and that for each vertex. We then employ a graph propagation algorithm to learn and update the representation vectors. Eventually, the representation vector of the whole graph can be fed to the appropriate machine learning models for applications such as time series classification. See details in Section 3.
Figure 1 presents an application of our method to the prediction of previously recorded earthquakes according to the Richter scale^{1}^{1}1The original magnitude scale for measuring the strength ("size") of earthquake wave, developed by Charles F. Richter. Figure 1 (a) is one of the observed time series, from which our method generates and recognizes states to each segment. Figure 1 (b) and (c) show the evolutionary state graph built by our method that represents the state transitions in all positive samples (i.e. an earthquake occurs) and negative samples respectively. From the figure, we can see that the transition from state 15 to state 16 and from state 1 to state 25 in the negative samples are more obvious than that in positive samples. From Figure 1 (d), we can observe that all these states tend to generate lower and more stable Richterscale values, indicating smooth crustal movement. Otherwise, these transitions hardly appear in the positive samples. In addition, state 14, state 13 and state 43 present more intensive crustal movement, while they are more likely to be reached the earthquake samples. Eventually, as a clear transition from “smooth states” (state 16, state 1) in the segment within time to “intensive states” (state 3, state 31, and state 43) in the segment within time is incorporated into the representation vector of the evolutionary state graph, our model successfully predicts a coming earthquake.
To further validate the effectiveness of the proposed Evolutionary Graph Recurrent Networks method, we construct experiments based on four public datasets and two realworld datasets from different domains. Experimental results demonstrate the superiority of our method over 10 stateoftheart baseline methods on several classification tasks. Moreover, we visualize the evolutionary state graph extracted from the data by means of our method, which provides us with additional interpretability and insights in the classification scenario. See more experimental results in Section 4.
Accordingly, our contributions are as follows:

We propose the evolutionary state graph to represent the combinatorial evolution of time series.

We design and implement a novel EGRN method to quantitatively incorporate the dynamic structural information of the evolutionary state graph into the time series classification task.

We construct sufficient experiments to demonstrate advantages of our approach over 10 stateoftheart baseline methods.
2. Evolutionary State Graph for Time series
2.1. Overview
Most works of time series modeling only consider simple relations among latent states (e.g. Markov dependency), rather than modeling its combinatorial evolution, which is covered by multiple intrinsic factors, and especially the relations among these factors.
Problem definition. The task considered in this paper is to discover and model the combinatorial evolution behind the time series , and thus to reason the event of the time series classification. Formally, let be an observation sequence with segments in a time series data. Each is a segment which the length is and has physical meaning; for example, one day has 24 hours or an hour has 60 minutes. Each is a single or multivariate observation with variables, denoted as . is the event occurring under the observation sequence , and represents the event label where is the set of labels and is the specific label.
Evolutionary State Graph. In this paper, we leverage graph structure, a nature way of modeling relations, to represent the combinatorial evolution. We define the state v to indicate a kind of factor influencing segments, such as running or walking (as discussed above) influence different segments of a fitnesstracking time series, and each segment can be generated by the combination of states with the probability . We define a timesensitive graphical structure, evolutionary state graph, which is a sequence of weighteddirected graphs and evolves among the segments over time. In particular, each graph is formulated as to represent the transitions from states of segments to those of . Each vertex in the graph indicates a state , and each edge along with its weight represents the transition probability from to . Each vertex has a node vector (or node representation or node embedding) to encode the information of the state, which is denoted by .
Based on above definitions, we propose a novel model, Evolutionary Graph Recurrent Networks (EGRN), a GNNbased advanced deep framework that can capture information propagations along the edges, to model the combinatorial evolution of time series. As Figure 2 shows, given a time series, EGRN first recognizes the states of each segment, then constructs the evolutionary state graph, and finally, propagates the information and outputs the prediction.
2.2. State Discovery
For the state recognition of time series, there are many existing works (Lin et al., 2012; Rakthanmanon and Keogh, 2013; Senin and Malinchik, 2013; Bagnall et al., 2017) in past few years, which focus on discretizing the successive time series and form the independent segments into several representative patterns. In our work, we define state to uniformly represent these patterns, each of which indicates a factor influencing segments, such as running or walking (as discussed above) influence different segments of a fitnesstracking time series. Each segment can be generated by the combination of states with the probability .
Formally, we let denote the pattern of the state , and function denote the generation of segments with the state . Then, each segment can be considered as a sample generated from a linear combination of all states, which is formulated as:
(1) 
where is the regularized weight of the segment ; is a hyper parameter to indicate the number of states. We can implement it by many existing clustering methods, such as Kmeans (Kanungo et al., 2002) and GMM (Bouttefroy et al., 2010). They estimate via ExpectationMaximization algorithm (Do and Batzoglou, 2008).
2.3. Evolutionary State Graph Construction
After discovering the states behind the time series, we then aim to represent transitions of these states among adjacent segments. To do this, we propose to construct the evolutionary state graph for modeling these relations. Each vertex denotes a particular state, with a node vector to encode the information of . For example, in our work, initially, and will be updated along with the graph evolves. We will introduce the details in Section 3. Each edge denotes the transition from the sender state to the receiver state . In addition, we define the weight of each edge as the production of anteriorposterior probability, which is formulated as:
(2) 
For example, as Figure 2 (a) and (b) shows, the th segment is recognized to the state 2 with the probability of 0.8, while the th segment is recognized to state 3 with the probability of 0.7. Thereby the edge weight between node 2 and node 3 is their production (i.e., 0.56) from segment to .
Next, we model the combinatorial evolution of time series by a graph neural network on the constructed evolutionary state graph.
3. Time Series Classification with Evolutionary State Graph
In this section, we provide specific descriptions our proposed method, Evolutionary Graph Recurrent Networks (EGRN), which can model the combinatorial evolution of time series on the constructed evolutionary state graph and is applied to reasoning the event of time series classification. We first present the procedure of message passing on the evolutionary state graph, and then present its propagation. Finally, we present EGRN’s output and learning. We will introduce each component detailedly in the following chapters.
3.1. Message Passing
Recently, the messagepassing neural network (MPNN) (Gilmer et al., 2017) has unified various graph convolutional network and graph neural network approaches by analogy to messagepassing in graphical models, which provides new ideas for representing the propagation of dynamic graphs. Inspired by that, we formulate the message passing of evolutionary state graph at the th segment as
(3) 
where is the node vector of state after message passing, which combines the messages from all indegree nodes and outdegree nodes . Specifically, we assume that passes a message to via edge , and will pass a feedback message to . Just as shown in Figure 2 (b), for the node 2 from segment to , it receives messages from node 1 and itself and sends message to node 3, so that it will receive the feedback messages from itself and node 3. The node vector will be updated after receiving messages. In particular, we implement by a neural network that controls the updating weight of node vectors as follows:
(4) 
where is the passing messages. and are parameters that can be estimated according to the downstream applications like time series classification.
From the perspective of all the nodes, we can reformulate the message passing as the recurrence of evolutionary state graph among the adjacent segments, which can be formulated as
(5a)  
(5b) 
The adjacency matrix represents the current in and out edges propagated at the th segment, which contains indegree matrix and outdegree matrix as Eq 3. is the weight vector of all states for segment . is the transpose operation and is matrix multiplication. denotes all the edge weights from segment to . Similarly, the feedback edge weights are denoted as . We observe that is actually the transposition of . Then, Eq 4 rewritten to Eq 5b denotes the node vector of all states after message passing.
3.2. Evolutionary State Graph Propagation
We next introduce the propagation of EGRN. A potential solution is to implement the gated graph sequence neural networks (GGSNN) (Li et al., 2016), which is taken as a baseline in our experiments (see details in Section 4). To quantify node vectors, which encode state’s information and can be propagated to all reachable nodes, GGSNN only takes node index into consideration and ignore other information. Here, we propose to use the distribution patterns represented by states as the node vectors, which encodes richer information and obtains better performance in our experiments. Formally, for each state obtained by GMM, we take the distribution patterns represented by state to initialize node vectors:
(6) 
We can propagate and update node vectors by wrapping the evolutionarystategraph operation in Eq 3 into a recurrent block that can be incorporated into many existing architectures. To do this, we define an EGRN block as
(7) 
where is given in Eq 3 and “” denotes a gated connection (e.g., LSTM, GRU) (Hochreiter and Schmidhuber, 1997; Chung et al., 2015). The gated connection allows us to incorporate information from other nodes and from the previous timestamp to update vectors of each node. When there are few messages from other nodes , will be influenced more by previous . Otherwise, the messages from other nodes will influence more. By this way, each state’s information is propagated over time.
As shown in Figure 2(c), unlike the GRUlike structure taken in the GGSNN, we take a LSTMlike structure (Hochreiter and Schmidhuber, 1997) to model the propagation. The difference is that LSTM has a global representation to memorize all information of evolutionary state graph, which can capture more patterns and better control the propagation of states’ information. Formally, we have
(8a)  
(8b)  
(8c)  
(8d)  
(8e)  
(8f) 
where , and are forget gate, input gate and output gate respectively. is sigmoid activation function and is elementwise multiplication. The current node vectors are updated by receiving their own previous memory and the messages from other nodes under the global memory’s influence. Meanwhile, global memory is updated under the node vectors over time.
3.3. EndtoEnd Model Learning
To encode the information of the entire time series, we define the graphlevel representation vector, or graph vector, as
(9) 
where acts as an “endtoend” model, which is relevant to the downstream application task. can be neural networks or other machine learning algorithms such as XGBoost (Chen and Guestrin, 2016). It takes the concatenation of and as input and outputs the graph vector for downstream application task. In the learning process, we use Adam optimization algorithm (Kingma and Ba, 2015) to minimize the crossentropy loss to update the network parameters:
(10) 
where is the training labels of the downstream application. EGRN computes the gradients based upon the converged solution and runs the propagation to convergence. The procedure of states’ capturing and propagation is carried out step by step, that we first pretrain the GMM to find the states and construct evolutionary state graph, and then train evolutionary state graph propagation to model the combinatorial evolution of time series, the procedure of which is as shown in Algorithm 1.
Complete procedure of training EGRN is presented in Appendix.
4. Experiment
Earthquakes  WormTwoClass  DJIA30  WebTraffic  INS  MCE  
Accuracy  Precision  Recall  Precision  Recall  
NNED (Bagnall et al., 2017)  68.22  62.41  80.51  73.40  28.51  19.33  23.01  59.90  34.82  44.01  
NNDTW (Berndt and Clifford, 1994)  70.31  66.06  82.09  74.03  27.14  21.73  24.13  60.17  41.41  49.04  
NNCID (Batista et al., 2011)  69.41  69.56  82.97  74.26  52.65  10.25  17.05  57.12  40.86  47.55  
FS (Rakthanmanon and Keogh, 2013)  74.66  70.58  72.84  73.89  31.66  16.73  21.84  54.34  43.54  48.34  
TSF (Deng et al., 2013)  74.67  68.51  81.94  75.38  48.11  21.04  29.13  76.80  52.61  62.50  
SAXVSM (Senin and Malinchik, 2013)  73.76  72.10  83.51  74.91  62.71  28.41  40.11  65.12  59.96  62.44  
MCDCNN (Zheng et al., 2014)  70.29  59.85  75.34  75.29  53.77  5.79  10.38  78.94  49.27  60.70  
RNN (Hochreiter and Schmidhuber, 1997)  74.82(10)  63.46(50)  79.08(20)  73.41(20)  71.52  33.10  45.31(15)  64.37  53.55  58.46(10)  
GGSNN (Li et al., 2016)  74.82 (20)  65.14 (5)  81.91 (5)  73.51 (10)  71.42  48.47  57.74 (10)  71.11  59.1  64.52 (5)  
NLNN (Wang et al., 2018a)  75.54  71.40  79.79  73.69  75.32  48.48  58.95  72.35  48.32  58.03  
EGRN *  76.26 (50)  71.64 (20)  83.51 (20)  74.22 (20)  78.72  48.21  59.80 (30)  71.35  65.30  68.19 (15)  
EGRN  77.70 (50)  72.72 (20)  83.76 (20)  73.93 (20)  78.74  48.87  60.25 (30)  71.21  65.36  68.14 (15)  
RNN (Hochreiter and Schmidhuber, 1997)  76.25(10)  53.25(5)  81.56(20)  73.74(20)  73.45  48.13  58.14(15)  80.33  58.10  67.42(10)  
GGSNN (Li et al., 2016)  77.70 (20)  59.35 (5)  83.51 (50)  73.73 (10)  84.53  48.86  61.94 (10)  80.27  72.14  75.95 (10)  
NLNN (Wang et al., 2018a)  76.26  62.33  83.87  75.17  83.73  50.00  62.61  81.49  72.77  76.88  
EGRN *  79.14 (50)  64.90 (20)  83.62 (20)  75.79 (50)  86.33  50.40  63.64 (30)  80.91  73.72  77.15 (15)  
EGRN  80.58 (50)  64.94 (20)  84.22 (20)  76.11 (50)  87.15  49.67  63.26 (30)  81.61  73.77  77.45 (15) 
4.1. Datasets
We employ six datasets to construct our experiments, including four public datasets and two realworld datasets. Two of the public datasets come from the UCR Suite^{2}^{2}2http://www.timeseriesclassification.com while the remaining two come from Kaggle^{3}^{3}3https://www.kaggle.com. One of the realworld datasets is provided by China Telecom, the major mobile service provider, while the other is provided by the State Grid, the major electric power company.
Earthquakes. This dataset, which comes from UCR, spans from Dec 1st 1967 to 2003, and each data point is an average of an hourly reading on the Richter scale. The task is to predict whether a major event is about to occur based on the most recent readings. A major event is defined as any reading of over 5 on the Richter scale. In total, 368 negative and 93 positive cases are extracted from 86k hourly readings. We set 24 hours as a window and split the sequence with length of 512 into 21 segments.
WormTwoClass. This dataset, which comes from UCR, is used for time series classification tasks and was employed in (Bagnall et al., 2015). The task is to classify individual worms of either wildtype (109 cases) or mutanttype (149 cases). We set 60 observations as a window and split the sequence with length of 900 to 15 segments.
DJIA 30 Stock Time Series (DJIA30). This dataset, which comes from Kaggle, contains historical stock prices for 29 of 30 DJIA companies and spans from Jan. 1st 2006 to Jan. 1st 2018. We set a classification task of predicting whether there will be drastic mutation (variance greater than 1) in the next week (five trading days) based on the most recent readings in the past year (50 weeks). In total, we extract 10k negative cases and 3k positive cases from 310k daily readings.
Web Traffic Time Series Forecasting (WebTraffic). This dataset, which comes from Kaggle, is taken from Jul 1st 2015 up until Dec 31st 2016. Each data point is a number of daily views of a specific Wikipedia article. We set a classification task of predicting whether there will be rapid growth (curve slope greater than 1) in the next months (30 days) based on the most recent readings in the past year (12 months). In total, we extract 105k negative cases and 38k positive cases from 145k daily readings.
Information Networks Supervision (INS). This dataset is provided by China Telecom. It consists of around 242K network flow series, each of which describes the hourly in and outflow of different servers, spanning from Apr 1st 2017 to May 10th 2017. When an abnormal flow goes through the server ports, the alarm states will be recorded. Our goal is to use the daily network flow data from the previous 15 days to predict whether there will be an abnormal flow in the next day. In total, we identify 2K abnormal flow series and 240K normal ones.
Watthour Meter Clock Error (MCE). This dataset is provided by the State Grid in China. It consists of around 4 million clock error series, each of which describes the deviation time, compared with the standard time, and the communication delay of different watthour meters per week, The duration is from Feb. 2016 to Feb. 2018. When the deviation time exceeds 120s, the meter will be marked as abnormal. Our goal is to predict the potential abnormal watthour meters in the next month by utilizing clock data from the past 12 months. In total, we identify 0.5 million abnormal clock error sequences and 3.5 million normal ones.
We present the statistics of these datasets and implement details in the appendix.
4.2. Experiment on Time Series Classification
We then evaluate our proposed model in terms of its accuracy of predicting the correct labels. We compare our proposed model against the following ten baseline methods, which have proven to be competitive across a wide variety of classification tasks:

NNED, NNDTW and NNCID: Given a sample, these methods calculate its nearest neighbor in the training data and use the nearest neighbor’s label to classify the given sample. To quantify the distance between samples, these methods consider different metrics, which are, respectively, Euclidean Distance, Dynamic Time Warping (Berndt and Clifford, 1994) and Complexity Invariant Distance (Batista et al., 2011).

Fast Shapelets (FS): This is a fast shapelets algorithm that uses shapelets as features for classification (Rakthanmanon and Keogh, 2013).

Time Series Forest (TSF): This is a treeensemble method that derives features from the intervals of each series (Deng et al., 2013).

SAXVSM: This is a dictionary method that derives features from the intervals of each series (Senin and Malinchik, 2013).

MCDCNN This is a multichannel deep convolutional neural network for time series classification proposed in (Zheng et al., 2014).
In addition to the above methods, we further consider the following (graph) neural network as baselines. is a model in which we use node vectors and raw segment vector as input in Eq 9, and only uses node vectors as input.

RNN: This is a common neural networks (Hochreiter and Schmidhuber, 1997), that uses the sequential assignment (and ) as input.

GGSNN: This is a gated graph neural network (Li et al., 2016) in which each is a onehot vector of corresponding node. The GRU structure is used in propagation.

NLNN: This is a nonlocal neural networks (Wang et al., 2018a). Each node is a segment and connects to each other. The of each node is the segment vector . We use NLNN blocks to replace the RNN blocks and keep the same in output model.

EGRN: This is the proposed method that captures different states and uses their distribution information as node vector . denotes that the GRU structure is used in propagation.
Comparison results. Table 1 compares the classification results. For the public datasets, we use accuracy as a metric due to their relatively balanced positive/negative ratio; this is also used in (Bagnall et al., 2017). For the realworld datasets, we use precision, recall and Fmeasures () as metrics. We observe that all quantifyingdistance methods based on nearest neighbors perform similarly, because they only capture the sequence structure and can perform poorly on some complex scenarios. The neural network approach (MCDCNN) performs poorly on smallscale data, as it might be more suitable for processing largescale data due to its model complexity. Fast shapelets (FS) are unstable, and performs poorly on the WormTwoClass and DJI30 datasets. Moreover, featureextracted methods have relatively better performance on all datasets, such as TSF and the dictionarymethod SAXVSM. In particular, SAXVSM gets the second best accuracy on WormTwoClass datasets.
The performance can be improved by modeling the combinatorial evolution of time series. The graph neural network can adopt a nonlocal approach, which outperforms the recurrent neural networks (RNN). We also note that the combination of the can perform better in most cases. GGSNN, which uses the onehot annotations of the corresponding node as the node vector, outperforms many nongraph baselines. However, it is not as good as the latter two methods; this result may be due to the monotonous information expression. NLNN, which directly uses the original segment as a node vector and constructs a fully connected graph, has achieved a good performance on most of the datasets, achieving the secondbest performance on the INS and MCE datasets. However, the graph structure of NLNN becomes very complex with the growth of segments, which increases the computational cost of the calculations. As expected, our method obtains the best performance compared to other baselines because the distribution patterns of states can provide more useful information relating to time series. Meanwhile, the LSTMlike structure is slightly better than GRUlike structure, due to the global memory’s influence. They both can be applied into the evolutionary state graph propagation. Due to the preselected number , computational complexity does not increase as the time windows increase. We can observe that the number of nodes is not a fixed parameter and is sensitive to specific methods and datasets. To show how the hyperparameter influences our model, more analyses are conducted in the following part .
How does the state number influence performance? We conduct some experiments on these six realworld datasets, to validate the impact of state number . We test with the value from 5 to 100 with interval 5. We use accuracy or F1score as metrics to compare this parameter across the datasets, and also use the silhouette score to evaluate the current quality of state assignment from the GMM.
As shown in Figure 3, the curves of classification performance and silhouette score are relatively consistent, illustrating that the state number is sensitive to the segment’s own patterns. The performance varies on the different datasets and is not bound to improve as the state number increases. The peaks of the state number are different; moreover, the performance will be worsen when the state number is too large for all datasets, which may be due to the oversized feature space or the insufficient data volume. We conclude that is an empirically determined parameter that can be set by evaluating the quality of state assignment.
5. Case Studies
In this section, we visualize two examples of the evolutionary state graph constructed by our approach. As shown in Figure 1 and Figure 4, we use 50 states in the Earthquakes dataset and 30 states in the INS dataset, which exhibit the best performance in Section 4.2. We visualize the evolutionary state graphs built on positive (earthquake or abnormal flow) and negative (no earthquake or normal flow) samples separately. We note that our model can learn to find some meaningful relational clues no matter where they appears.
What kind of situations may earthquakes occur in? It is wellknown that crustal movements before earthquakes should be more active than usual, and that Richter scale readings should be higher and more unstable. Our findings from the evolutionary state graph confirm this commonsense understanding. As shown in Figure 1, the graphs built by EGRN differ depending on whether they utilize positive or negative samples. The transitions of state 15 to state 16 and state 1 to state 25 in the negative samples are more obvious than in the positive samples, meanwhile, we can observe that their Richter scale reading are all lower and stable. These findings reveal that crustal movement transfers among the relatively quiet states, under which earthquakes are less likely to occur. Otherwise, these safety transitions appear only rarely in the positive samples. In addition, state 14, state 13 and state 43 present more active crustal movement, and we can observe that their weights connected with other states in positive samples are larger than that in negative samples. Specially, the state 46 and state 3, whose Richter scale readings are considerably enhanced compared to other states, only exhibit transitions with other states in the positive samples. There is no state in the negative samples associated with them.
As shown in Figure 1(a), we can observe a clear transition from “smooth states” (state 16, 1) in the segment within time to “intensive states” (state 3, 31, and 43) in the segment within time , which illustrates that the crustal activity is becoming more and more active. Its evolutionary state graph will be more like positive one, and EGRN will predict that an earthquake is about to occur.
What will cause abnormal flow? Many phenomena can indicate anomalies in flow data of network devices, such as sudden drop of flow data, low indication and unbalanced flow of the import and export, etc. As shown in Figure 4, the evolutionary state graphs built by our methods differ between the positive and negative samples. The transitions of state 25 to state 12 and state 8 to state 29 in the positive samples are more obvious than the case in the negative samples, which reveals that the low indication and sudden drop will cause an exception. On the contrary, the smooth transition of state 13 to state 22 appears only in the negative samples. The difference between state 23 and state 18 is the unbalanced flow of the import and export, although all of them are in a high indication. State 23 is more likely to appear in the negative samples, but it is not true for state 18, which illustrates that the high but inconsistent inflow and outflow is also a cause of anomalies. Similarly, anomalies can also occur when the flows are always low, such as state 16. As shown in the upper case of the raw time series, the transition between state 25 and state 12 computed by our methods appears in the days , and occurs again in the future (the days ). This abnormal transition can be memorized by EGRN and predicted when it happens again in the future.
Through analysis of the evolutionary state graph, we find several logical causes to explain events in the time series classification. As a result, we are aware what kind of states or transitions need to be earned, early, and why these warnings are necessary.
6. Related Work
Time series modeling. The modeling of Time series have broad applications in different domains, such as biology applications (e.g., the hormonal cycles (Chiazze et al., 1968)); human behavior recognition (e.g., circadian rhythms and cyclic variation (Pierson et al., 2018)); and anomaly detection (e.g., abnormal mutation (Chapfuwa et al., 2018). The different distance measurements have been mainly concentrated to model time series data, such as dynamic time warping (Lines and Bagnall, 2015), complexityinvariant distance (Batista et al., 2014) move–split–merge (Stefan et al., 2013), and elastic ensemble (Lines and Bagnall, 2015). Some methods pay attention to sequenceclustering by graph (Hallac et al., 2017), which aims to apply graph structure to represent different segments, rather than the transition between segments. It is different from our task.
Modelbased methods fit a generative model to each sequence, then measure the similarity between the sequences via model’s parameters. The parametric approaches used include hidden Markov models (Yang and Jiang, 2014) and fitting autoregressive models (ShokoohiYekta et al., 2015), which rely on the artificial knowledges. Recently, a lot of models based on neural networks have been proposed (Wang et al., 2018c; Binkowski et al., 2018), which have mostly been studied in highlevel patterns representation. The main idea behind these methods is to model the fusion of multiple factors like time or space, etc. . For forming the frequency count of repeated patterns, some dictionarybased approaches have also been explored (Lin et al., 2012; Senin and Malinchik, 2013), which form frequency counts of the recurring patterns, then build classifiers based on the resulting histograms (Baydogan and Runger, 2016; Xu et al., 2018). Many of these works have formulated the problem as a discretetime sequence prediction task and used Markov models. However, Markov models assume unit time steps and are further unable to capture longrange dependencies since the overall statespace will grow exponentially in the number of time steps considered (Yang and Jiang, 2014). Other works have used LSTM models (Hochreiter and Schmidhuber, 1997), which also assume discrete time steps and are limited in their interpretability.
Dynamic graph and graph neural networks. Models in the graph neural network family (Scarselli et al., 2009; Li et al., 2016; Raposo et al., 2017; Van Steenkiste et al., 2018) have been explored in a various range of domains, across supervised, unsupervised, semisupervised, and reinforcement learning settings. Most works focus on the dynamic graph structure, which has rich relational representation and can be applied to many real scenarios. They have been applied to learning the dynamics of physical systems (Watters et al., 2017; Zhou et al., 2018), to predicting the chemical properties of molecules (Gilmer et al., 2017), to predicting traffic on roads(Cui et al., 2018), to reasoning about knowledge graphs (Hamaguchi et al., 2017). They have been used within both modelbased (Sanchez et al., 2018) and modelfree(Wang et al., 2018b) continuous control, for more classical approaches to planning(Toyer et al., 2018) and for modelfree reinforcement learning (Hamrick et al., 2018). These studies provide a representative crosssection of the breadth of domains for which graph neural networks have proven useful. Recently, the messagepassing neural network (MPNN) unified various graph convolutional network and graph neural network approaches by analogy to messagepassing in graphical models (Gilmer et al., 2017). The nonlocal neural network (NLNN) has a similar vein, which unified various “selfattention”style approaches by analogy to methods from graphical models and computer vision for capturing long range dependencies in signals (Wang et al., 2018a). Recently, several researchers from DeepMind, Google Brain and MIT summed up previous works in this domain, focusing on these two representative works (MPNN and NLNN), and systematically proposed the concept of graph network (Battaglia et al., 2018).
7. Conclusions
In this paper, we study the problem of combinatorial evolution of time series data. We propose a novel graph neural networksbased model that can capture the complex and dynamic relations among intrinsic factors of time series and learn effective representations for classification tasks. To validate the effectiveness of our proposed model, we conduct sufficient experiments on both the public and the realworld datasets. Experimental results show that our model clearly outperforms eleven stateoftheart baseline methods. Meanwhile, based on two case studies, we find some meaningful relations among the states, which can reveal the logical causes of the predicted events.
References
 (1)
 Bagnall et al. (2017) Anthony J Bagnall, Jason Lines, Aaron Bostrom, James Large, and Eamonn J Keogh. 2017. The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. SIGKDD 31, 3 (2017), 606–660.
 Bagnall et al. (2015) Anthony J Bagnall, Jason Lines, Jon Hills, and Aaron Bostrom. 2015. TimeSeries Classification with COTE: The Collective of TransformationBased Ensembles. TKDE 27, 9 (2015), 2522–2535.
 Batista et al. (2014) Gustavo E. Batista, Eamonn J. Keogh, Oben Moses Tataw, and Vinícius M. Souza. 2014. CID: An efficient complexityinvariant distance for time series. SIGKDD (2014), 634–669.
 Batista et al. (2011) Gustavo EAPA Batista, Xiaoyue Wang, and Eamonn J Keogh. 2011. A complexityinvariant distance measure for time series. ICDM (2011), 699–710.
 Battaglia et al. (2018) Peter Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchezgonzalez, Vinicius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. 2018. Relational inductive biases, deep learning, and graph networks. arXiv: Learning (2018).
 Baydogan and Runger (2016) Mustafa Gokce Baydogan and George Runger. 2016. Time series representation and similarity based on local autopatterns. DMKD (2016), 476–509.
 Berndt and Clifford (1994) Donald J Berndt and James Clifford. 1994. Using dynamic time warping to find patterns in time series. SIGKDD (1994), 359–370.
 Binkowski et al. (2018) Mikolaj Binkowski, Gautier Marti, and Philippe Donnat. 2018. Autoregressive Convolutional Neural Networks for Asynchronous Time Series. ICML (2018), 579–588.
 Bouttefroy et al. (2010) Philippe Loic Marie Bouttefroy, Abdesselam Bouzerdoum, Son Lam Phung, and Azeddine Beghdadi. 2010. On the analysis of background subtraction techniques using Gaussian Mixture Models. ICASSP (2010), 4042–4045.
 Chapfuwa et al. (2018) Paidamoyo Chapfuwa, Chenyang Tao, Courtney Page, Benjamin Goldstein, Chunyuan Li, Lawrence Carin, and Ricardo Henao. 2018. Adversarial TimetoEvent Modeling. ICML (2018), 734–743.
 Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. SIGKDD (2016), 785–794.
 Chiazze et al. (1968) Leonard Chiazze, Franklin T Brayer, John J Macisco, Margaret P Parker, and Benedict J Duffy. 1968. The Length and Variability of the Human Menstrual Cycle. JAMA (1968), 377–380.
 Chung et al. (2015) Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2015. Gated Feedback Recurrent Neural Networks. ICML (2015), 2067–2075.
 Cui et al. (2018) Zhiyong Cui, Kristian Henrickson, Ruimin Ke, and Yinhai Wang. 2018. HighOrder Graph Convolutional Recurrent Neural Network: A Deep Learning Framework for NetworkScale Traffic Learning and Forecasting. arXiv: Learning (2018).
 Deng et al. (2013) Houtao Deng, George Runger, Eugene Tuv, and Martyanov Vladimir. 2013. A time series forest for classification and feature extraction. Information Sciences (2013), 142–153.
 Do and Batzoglou (2008) Chuong B Do and Serafim Batzoglou. 2008. What is the expectation maximization algorithm. Nature Biotechnology 26, 8 (2008), 897–899.
 Du et al. (2016) Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel GomezRodriguez, and Le Song. 2016. Recurrent Marked Temporal Point Processes:Embedding Event History to Vector. SIGKDD (2016), 1555–1564.
 Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural Message Passing for Quantum Chemistry. ICML (2017), 1263–1272.
 Hallac et al. (2017) David Hallac, Sagar Vare, Stephen P Boyd, and Jure Leskovec. 2017. Toeplitz Inverse CovarianceBased Clustering of Multivariate Time Series Data. SIGKDD (2017), 215–223.
 Hamaguchi et al. (2017) Takuo Hamaguchi, Hidekazu Oiwa, Masashi Shimbo, and Yuji Matsumoto. 2017. Knowledge Transfer for OutofKnowledgeBase Entities : A Graph Neural Network Approach. IJCAI (2017), 1802–1808.
 Hamrick et al. (2018) Jessica B Hamrick, Kelsey Allen, Victor Bapst, Tina Zhu, Kevin R Mckee, Joshua B Tenenbaum, and Peter Battaglia. 2018. Relational inductive bias for physical construction in humans and machines. arXiv: Learning (2018).
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long ShortTerm Memory. Neural Computation (1997), 1735–1780.
 Kanungo et al. (2002) Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. 2002. An efficient kmeans clustering algorithm: analysis and implementation. TPAMI 24, 7 (2002), 881–892.
 Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. ICLR (2015).
 Li et al. (2016) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S Zemel. 2016. Gated Graph Sequence Neural Networks. ICLR (2016).
 Lin et al. (2012) Jessica Lin, Rohan Khade, and Yuan Li. 2012. Rotationinvariant similarity in time series using bagofpatterns representation. IJIIS (2012), 287–315.
 Lines and Bagnall (2015) Jason Lines and Anthony Bagnall. 2015. Time series classification with ensembles of elastic distance measures. SIGKDD (2015), 565–592.
 Pierson et al. (2018) Emma Pierson, Tim Althoff, and Jure Leskovec. 2018. Modeling Individual Cyclic Variation in Human Behavior. WWW (2018), 107–116.
 Rakthanmanon and Keogh (2013) Thanawin Rakthanmanon and Eamonn Keogh. 2013. Fast shapelets: A scalable algorithm for discovering time series shapelets. ICDM (2013), 668–676.
 Raposo et al. (2017) David Raposo, Adam Santoro, David Barrett, Razvan Pascanu, Timothy Lillicrap, and Peter Battaglia. 2017. Discovering objects and their relations from entangled scene representations. ICLR (2017).
 Sanchez et al. (2018) Alvaro Sanchez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Raia Hadsell, Martin A Riedmiller, and Peter Battaglia. 2018. Graph Networks as Learnable Physics Engines for Inference and Control. ICML (2018), 4467–4476.
 Scarselli et al. (2009) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2009. Computational Capabilities of Graph Neural Networks. TNNLS 20, 1 (2009), 81–102.
 Senin and Malinchik (2013) Pavel Senin and Sergey Malinchik. 2013. SAXVSM: Interpretable Time Series Classification Using SAX and Vector Space Model. ICDM (2013), 1175–1180.
 ShokoohiYekta et al. (2015) Mohammad ShokoohiYekta, Yanping Chen, Bilson Campana, Bing Hu, Jesin Zakaria, and Eamonn Keogh. 2015. Discovery of Meaningful Rules in Time Series. SIGKDD (2015), 1085–1094.
 Stefan et al. (2013) Alexandra Stefan, Vassilis Athitsos, and Gautam Das. 2013. The MoveSplitMerge Metric for Time Series. TKDE (2013), 1425–1438.
 Toyer et al. (2018) Sam Toyer, Felipe W Trevizan, Sylvie Thiebaux, and Lexing Xie. 2018. Action Schema Networks: Generalised Policies with Deep Learning. AAAI (2018), 6294–6301.
 Van Steenkiste et al. (2018) Sjoerd Van Steenkiste, Michael Chang, Klaus Greff, and Jurgen Schmidhuber. 2018. Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions. ICLR (2018).
 Wang et al. (2018c) Jingyuan Wang, Ze Wang, Jianfeng Li, and Junjie Wu. 2018c. Multilevel Wavelet Decomposition Network for Interpretable Time Series Analysis. SIGKDD (2018), 2437–2446.
 Wang et al. (2018b) Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. 2018b. NerveNet: Learning Structured Policy with Graph Neural Networks. ICLR (2018).
 Wang et al. (2018a) Xiaolong Wang, Ross B Girshick, Abhinav Gupta, and Kaiming He. 2018a. NonLocal Neural Networks. CVPR (2018).
 Watters et al. (2017) Nicholas Watters, Daniel Zoran, Theophane Weber, Peter Battaglia, Razvan Pascanu, and Andrea Tacchetti. 2017. Visual Interaction Networks: Learning a Physics Simulator from Video. NIPS (2017), 4539–4547.
 Xu et al. (2018) Haowen Xu, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Li, Ying Liu, Youjian Zhao, Dan Pei, and Yang Feng. 2018. Unsupervised Anomaly Detection via Variational AutoEncoder for Seasonal KPIs in Web Applications. WWW (2018), 187–196.
 Yang and Jiang (2014) Yun Yang and Jianmin Jiang. 2014. HMMbased hybrid metaclustering ensemble for temporal data. KBS (2014), 299–310.
 Zheng et al. (2014) Yi Zheng, Qi Liu, Enhong Chen, Yong Ge, and J Leon Zhao. 2014. Time series classification using multichannels deep convolutional neural networks. WAIM (2014), 298–310.
 Zhou et al. (2018) Lekui Zhou, Yang Yang, Xiang Ren, Fei Wu, and Yueting Zhuang. 2018. Dynamic Network Embedding by Modeling Triadic Closure Process. AAAI (2018), 571–578.
Appendix A Appendix
a.1. Pseudo Code
In order to illustrate the structure and procedure of EGRN, we present the complete pseudo code of Evolutionary Graph Recurrent Networks on the classification of time series. Given the time series , labels and parameter , EGRN first use GMM to capture different states. Then, we construct the evolutionary state graph and propagate the information of states. At last, the graphlevel output is fed into an output model and we use backpropagation learning algorithm with the crossentropy loss to train the whole networks. Algorithm 2 presents more details.
a.2. Implementation Details
Classification. Time series from different sources have different formats, which the detailed statistics are as following:
Dataset  #samples  #time windows  #time points  #variables 

Earthquakes  461  21  24  1 
WormTwoClass  258  15  60  1 
DJIA 30 Stock Time Series (DJIA30)  14,040  50  5  4 
Web Traffic Time Series Forecasting (WebTraffic)  142,753  12  30  1 
Information Networks Supervision (INS)  241,045  15  24  2 
Watthour Meter Clock Error (MCE)  3,833,213  12  4  2 
For the different datasets, if there are clear train/test split, such as UCR datasets, we use them to make experiment. Otherwise, we split the train/test set by 0.8 at the time line, such that preceding windows’ series are used for training and the following ones are used for testing. We split 10% samples from train set as validation, which controls the procedure of training and avoids the overfitting.
Time series data divided by the fixed time window can be applied into graph structure, where the number of state is hyper parameters. We train our model on an 1GPU machine and set the batch size as 5000. Specially, for the smallsize datasets from UCR, we set 50 for a batch. We train our models for 100 iterations in total, starting with a learning rate of 0.01 and reducing it by a factor of 10 at every 20 iterations. In our experiment, we also note that the larger the volume of the data, the more the number of batches, and the fewer training epoch required for convergence. For example, MCE dataset is only trained for 30 epochs and can achieve convergence, which we train 100 epochs on Earthquakes dataset.
In addition to presenting the parameter settings of our model, we also present the parameter settings of other baseline methods. The details are shown in the following table:
Parameter Setting  

NNED  the number of neighbor (5) 
NNDTW  the number of neighbor (5), wrap size (10) 
NNCID  the number of neighbor (5) 
FS  the minimum length of shaplets (5), the depth of tree (2) 
TSF  the number of trees (500) 
SAXVSM   
MCDCNN  the number of filter (8), filter size (5), pooling size (2) 
LSTM  hidden dimension (64), time major (True) 
GGSNN  the number of state (*), the dimension of global vector (64) 
NLNN  input frames (the number of time windows), 3Dconv (False) 
where the codes of baseline method in the first section are from http://www.timeseriesclassification.com, and others can be found in Github.
Reasoning. We visualize the evolutionary state graph constructed by our model to present the intrinsic relations of states behind the time series, some transitions of which will be highlighted for a certain event of classification. These transitions between states can tell us the logical cause behind the events. We extract each output of edges in propagation and aggregate their weights among all the segments to draw the graph structure, so that we can find these meaningful relations. The edge weight is the sum of all transitions and we show the top25% edges. For the Figure 1 and Figure 4, we visualize the graphs of different labels arrived at the last time, which present several meaningful relations for classification.