Predicting Path Failure In TimeEvolving Graphs
Abstract.
In this paper we use a timeevolving graph which consists of a sequence of graph snapshots over time to model many realworld networks. We study the path classification problem in a timeevolving graph, which has many applications in realworld scenarios, for example, predicting path failure in a telecommunication network and predicting path congestion in a traffic network in the near future. In order to capture the temporal dependency and graph structure dynamics, we design a novel deep neural network named Long ShortTerm Memory RGCN (LRGCN). LRGCN considers temporal dependency between timeadjacent graph snapshots as a special relation with memory, and uses relational GCN to jointly process both intratime and intertime relations. We also propose a new path representation method named selfattentive path embedding (SAPE), to embed paths of arbitrary length into fixedlength vectors. Through experiments on a realworld telecommunication network and a traffic network in California, we demonstrate the superiority of LRGCN to other competing methods in path failure prediction, and prove the effectiveness of SAPE on path representation.
1. Introduction
Graph has been widely used to model realworld entities and the relationship among them. For example, a telecommunication network can be modeled as a graph where a node corresponds to a switch and an edge represents an optical fiber link; a traffic network can be modeled as a graph where a node corresponds to a sensor station and an edge represents a road segment. In many real scenarios, the graph topological structure may evolve over time, e.g., link failures due to hardware outages; road closures due to accidents or natural disasters. This leads to a timeevolving graph which consists of a sequence of graph snapshots over time. In the literature some studies on timeevolving graphs focus on the node classification task, e.g., (Aggarwal and Li, 2011) uses a random walk approach to combine structure and content for node classification, and (Güneş et al., 2014) improves the performance of node classification in timeevolving graphs by exploiting genetic algorithms. In this work, we focus on a more challenging but practically useful task: path classification in a timeevolving graph, which predicts the status of a path in the near future. A good solution to this problem can benefit many realworld applications, e.g., predicting path failure (or path congestion) in a telecommunication (or traffic) network so that preventive measures can be implemented promptly.
In our problem setting, besides the topological structure, we also consider signals collected on the graph nodes, e.g., traffic density and traveling speed recorded at each sensor station in a traffic network. The observed signals on one node over time form a time series. We incorporate both the time series observations and evolving topological structure into our model for path classification. The complex temporal dependency and structure dynamics pose a huge challenge. For one thing, observations on nodes exhibit highly nonstationary properties such as seasonality or daily periodicity, e.g., morning and evening rush hours in a traffic network. For another, graph structure evolution can result in sudden and dramatic changes of observations on nodes, e.g., road closure due to accidents redirects traffic to alternative routes, causing increased traffic flow on those routes. To model the temporal dependency and structure dynamics, we design a new timeevolving neural network named Long ShortTerm Memory RGCN (LRGCN). LRGCN considers node correlation within a graph snapshot as intratime relation, and views temporal dependency between adjacent graph snapshots as intertime relation, then utilizes Relational GCN (RGCN) (Schlichtkrull et al., 2018) to capture both temporal dependency and structure dynamics.
Another challenge we face is that paths are of arbitrary length. It is nontrivial to develop a uniform path representation that provides both good data interpretability and classification performance. Existing solutions such as (Li et al., 2017) rely on Recurrent Neural Networks (RNN) to derive fixedsize representation, which, however, fails to provide meaningful interpretation of the learned path representation. In this work, we design a new path representation method named selfattentive path embedding (SAPE), which takes advantage of the selfattentive mechanism to explicitly highlight the important nodes on a path, thus provides good interpretability and benefits downstream tasks such as path failure diagnosis.
Our contributions are summarized as follows.

We study path classification in a timeevolving graph, which, to the best of our knowledge, has not been studied before. Our proposed solution LRGCN achieves superior classification performance to the stateoftheart deep learning methods.

We design a novel selfattentive path embedding method called SAPE to embed paths of arbitrary length into fixedlength vectors, which are then used as a standard input format for classification. The embedding approach not only improves the classification performance, but also provides meaningful interpretation of the underlying data in two forms: (1) embedding vectors of paths, and (2) node importance in a path learned through a selfattentive mechanism that differentiates their contribution in classifying a path.

We evaluate LRGCNSAPE on two realworld data sets. In a telecommunication network of a real service session, we use LRGCNSAPE to predict path failures and achieve a MacroF1 score of 61.89%, outperforming competing methods by at least 5%. In a traffic network in California, we utilize LRGCNSAPE to predict path congestions and achieve a MacroF1 score of 86.74%, outperforming competing methods by at least 4%.
2. Problem Definition
We denote a set of nodes as which represent realworld entities, e.g., switches in a telecommunication network, sensor stations in a traffic network. At time , we use an adjacency matrix to describe the connections between nodes in . represents whether there is a directed edge from node to node or not, e.g., a link that bridges two switches, a road that connects two sensor stations. In this study, we focus on a directed graph, as many realworld networks, e.g., telecommunication networks, traffic networks, are directed; yet our methodology is also applicable to undirected graphs. We use to denote the observations at each node at time , where is a dimensional vector describing the values of different signals recorded at node at time , e.g., temperature, power and other signals of a switch.
We define the adjacency matrix and the observed signals on nodes in as a graph snapshot at time . A sequence of graph snapshots with and the corresponding observations over time steps is defined as a timeevolving graph. Note that the graph structure can evolve over time as some edges may become unavailable, e.g., link failure, road congestion/closure, and some new edges may become available over time. For one node , the sequence of observations over time is a multivariate time series.
We denote a path as a sequence of length in the timeevolving graph, where each node . For the same path, we use to represent the observations of the path nodes at time . In this paper we aim to predict if a given path is available or not in the future, e.g., a path failure in a telecommunication network, or a path congestion in a traffic network. Note the availability of a path is service dependent, e.g., a path is defined as available in a telecommunication network if the transmission latency for a data packet to travel through the path is less than a predefined threshold. Thus the path availability cannot be simply regarded as the physical connectivity of the path, but is related to the “quality of service” of the path. To be more specific, for a given path at time , we utilize the past time steps to predict the availability of this path in the next time steps. We formulate this prediction task as a classification problem and our goal is to learn a function that can minimize the crossentropy loss over the training set :
(1) 
where is a training instance, is the training label representing the availability of this path in the next time steps, is the predicted probability of class , and is the number of classes. In our problem, we have , i.e., path availability and path failure.
Figure 1 depicts a timeevolving graph in the context of a telecommunication network. denote four switches. In the past time steps, although the graph structure has changed, e.g., becomes unavailable due to overload, path is still available. From this timeevolving graph, we want to predict the availability of path in the next time steps.
3. Methodology
3.1. Framework
In the context of timeevolving graph, we observe three important properties as follows.
Property 1. Node correlation. Observations on nodes are correlated. For example, if a sensor station in a traffic network detects low traffic density at time , we can infer that nearby stations on the same path also record low traffic at the same time with high probability.
Property 2. Influence of graph structure dynamics. Observations on nodes are influenced by the changes on the graph structure. For example, if a road segment becomes unavailable at time (e.g., road closure), traffic shall be redirected to alternative routes. As a result, nodes on the affected path may record a sudden drop of traffic density, while nodes on alternative routes may record an increase of traffic flow at the subsequent time steps which may increase the probability of path congestion.
Property 3. Temporal dependency. The time series recorded on each node demonstrates strong temporal dependency, e.g., high traffic density and low traveling speed recorded at morning and evening rush hours. This makes the time series nonstationary.
These three properties make our problem very complicated. A desired model for path failure prediction should have builtin mechanisms to address these challenges. First, it should model the node correlation and the influence of graph structure dynamics for an accurate prediction. Second, it should capture the temporal dependency especially the long term trends from the time series. Moreover, the temporal dependency and graph structure dynamics should be modeled jointly. Third, the model should be able to represent paths of arbitrary length and generate a fixedlength representation by considering all path nodes as a whole.
In this section, we present a novel endtoend neural network framework to address the above three requirements. The framework (shown in Figure 2) takes as input the timeevolving graph, and outputs the representation and failure probabilities for all the training paths. To be more specific, our model uses a twolayer Long ShortTerm Memory RGCN (LRGCN), a newly proposed timeevolving graph neural network in this work, to obtain the hidden representation of each node by capturing both graph structure dynamics and temporal dependency. Then it utilizes a selfattentive mechanism to learn the node importance and encode it into a unified path representation. Finally, it cascades the path representation with a fully connected layer and calculates the loss defined in Eq. 1. In the following, we describe the components in our model in details.
3.2. TimeEvolving Graph Modeling
We propose a new timeevolving neural network to capture the graph structure dynamics and temporal dependency jointly. Our design is mainly motivated by the recent success of graph convolutional networks (GCN) (Kipf and Welling, 2017) in graphbased learning tasks. As GCN cannot take both time series X and evolving graph structures A as input, our focus is how to generalize GCN to process time series and evolving graph structures simultaneously. In the following we first describe how we model the node correlation within a graph snapshot. Then we detail temporal dependency modeling between two adjacent graph snapshots. Finally, we generalize our model to the timeevolving graph.
3.2.1. Static graph modeling
Within one graph snapshot, the graph structure does not change, thus it can be regarded as a static graph. The original GCN (Kipf and Welling, 2017) was designed to handle undirected static graphs. Later, Relational GCN (RGCN) (Schlichtkrull et al., 2018) was developed to deal with multirelational graphs. A directed graph can be regarded as a multirelational graph with incoming and outgoing relations. In this vein, we use RGCN to model the node correlation in a static directed graph. Formally, RGCN takes as input the adjacency matrix and time series , and transforms the nodes’ features over the graph structure via onehop normalization:
(2) 
where , represents the incoming relation, represents the outgoing relation and . is the activation function such as . Eq. 2 can be considered as an accumulation of multirelational normalization where is a weight matrix for incoming relation, for outgoing relation and for selfconnection.
To further generalize RGCN and prevent overfitting, we can view the effect of selfconnection normalization as a linear combination of incoming and outgoing normalization. This provides us with the following simplified expression:
(3) 
where , , , and is the identity matrix. Note we can impose multihop normalization by stacking multiple layers of RGCN. In our design, we use a twolayer RGCN:
(4) 
where represents the parameter set used in the static graph modeling, is an inputtohidden weight matrix for a hidden layer with feature maps. is a hiddentooutput weight matrix, stands for this twohop graph convolution operation and shall be used thereafter.
Relation with the original GCN. The original GCN (Kipf and Welling, 2017) was defined on undirected graphs and can be regarded as a special case of this revised RGCN. One difference is that in undirected graphs incoming and outgoing relations are identical, which makes in RGCN for undirected graphs. Another difference lies in the normalization trick. The purpose of this trick is to normalize features of each node according to its onehop neighborhood. In undirected graphs the relation is symmetric, thus the symmetric normalization is applied as , where ; while in directed graphs the relation is asymmetric, hence the asymmetric normalization is used.
The discussion above focuses on a graph snapshot, which is static. Next, we extend RGCN to take as inputs two adjacent graph snapshots.
3.2.2. Adjacent graph snapshots modeling
Before diving into a sequence of graph snapshots, we first focus on two adjacent time steps and as shown in Figure 3. A node at time is not only correlated with other nodes at the same time (which is referred to as intratime relation), but also depends on nodes at the previous time step (which is referred to as intertime relation), and this dependency is directed and asymmetric. For example, if a sensor station detects high traffic density at time , then nearby sensor stations may also record high traffic density at the same time due to the spatial proximity. Moreover, if a sensor station detects a sudden increase of traffic density at time , downstream stations on the same path will record the corresponding increase at subsequent time steps, as it takes time for traffic flows to reach downstream stations. In our model, we use the Markov property to model the intertime dependency. In total, there are four types of relations to model in RGCN, i.e., intraincoming, intraoutgoing, interincoming and interoutgoing relations. For nodes at time , the multirelational normalization expression is as follows:
(5) 
where stands for the parameter set used in intertime modeling, and it does not change over time. For , is used to represent the graph structure. This operation is named timeevolving graph G_unit, which has a similar role of unit in Recurrent Neural Networks (RNN). Note that here the normalization still includes intertime selfconnection, as has selfloops.
Intuitively, Eq. 5 computes the new feature of a node by accumulating transformed features via a normalized sum of itself and neighbors from both the current and previous graph snapshots. As nodes which are densely connected by intertime and intratime relations tend to be proximal, this computation makes their representation similar, thus simplifies the downstream tasks.
Relation with RNN unit. RNN unit was proposed to transform an input by considering not only the present input but also the input preceding it in a sequence. It can be regarded as a special case of our timeevolving graph unit where at each time step a set of input elements are not structured and if we consider just onehop smoothing.
3.2.3. The proposed LRGCN model
Based on the timeevolving graph unit proposed above, we are ready to design a neural network working on a timeevolving graph. We use a hidden state to memorize the transformed features in the previous snapshots, and feed the hidden state and current input into the unit to derive a new hidden state:
(6) 
where includes and . When applied sequentially, as the transformed features in can contain information from an earlier arbitrarily long window, it can be utilized to process a sequence of graph snapshots, i.e., a timeevolving graph.
Unfortunately, despite the usefulness of this RNNstyle evolving graph neural network, it still suffers from the famous curse of gradient exploding or vanishing. In this context, past studies (e.g., (Sutskever et al., 2014)) utilized Long ShortTerm Memory (LSTM) (Hochreiter and Schmidhuber, 1997) to model the longterm dependency in sequence learning. Inspired by this, we propose a Long ShortTerm Memory RGCN, called LRGCN, which can take as input a longterm timeevolving graph and capture the structure dynamics and temporal dependency jointly. Formally, LRGCN utilizes three gates to achieve the longterm memory or accumulation:
(7) 
(8) 
(9) 
(10) 
(11) 
where stands for elementwise multiplication, , , are input gate, forget gate and output gate at time respectively. , , , are the parameter sets for the corresponding gates and cell. is the hidden state or output at time , as used in Eq. 6. denotes the twohop graph convolution operation defined in Eq. 4. Intuitively, LRGCN achieves longterm timeevolving graph memory by carefully selecting the input to alter the state of the memory cell and to remember or forget its previous state, according to the tasks at hand.
To summarize, we use a twolayer LRGCN to generate hidden representation of each node, in which the firstlayer LRGCN serves as the encoder of the whole timeevolving graph and is used to initialize the secondlayer LRGCN. Then we take the outputs of the last time step in the secondlayer LRGCN to derive the final representation . As we work on path classification, the next task is how to obtain the path representation based on .
3.3. SelfAttentive Path Embedding
In this subsection, we describe our method which produces a fixedlength path representation given the output of the previous subsection. For a path instance P, we can retrieve its representation directly from . For the final classification task, however, we still identify two challenges:

Size invariance: How to produce a fixedlength vector representation for any path of arbitrary length?

Node importance: How to encode the importance of different nodes into a unified path representation?
For node importance, it means different nodes in a path have different degrees of importance. For example, along a path, a sensor station at the intersection of main streets should be more important than one at a less busy street in contributing to the derived embedding vector. We need to design a mechanism to learn the node importance, and then encode it in the embedding vector properly.
To this end, we propose a selfattentive path embedding method, called SAPE, to address the challenges listed above. In SAPE, we first utilize LSTM to sequentially take in node representation of a path, and output representation in each step by balancing upstreaming node representation and current input node representation, as proposed in (Li et al., 2017; Liu et al., 2017). Then we use the selfattentive mechanism to learn the node importance and transform a path of variable length into a fixedlength embedding vector. Figure 4 depicts the overall framework of SAPE.
Formally, for a path , we first apply LSTM to capture node dependency along the path sequence:
(12) 
where . With LSTM, we have transformed the node representation from a dimensional space to a dimensional space by capturing node dependency along the path sequence.
Note that this intermediate path representation does not provide node importance, and it is size variant, i.e., its size is still determined by the number of nodes in the path. So next we utilize the selfattentive mechanism to learn node importance and encode it into a unified path representation, which is size invariant:
(13) 
where and are two weight matrices. The function of is to transform the node representation from a dimensional space to a dimensional space. is used as views of inferring the importance of each node. Then softmax is applied to derive a standardized importance of each node, which means in each view the summation of all node importance is 1.
Based on all the above, we compute the final path representation by multiplying with :
(14) 
is size invariant since it does not depend on the number of nodes any more. It also unifies the node importance into the final representation, in which the node importance is only determined by the downstream tasks.
In all, our framework first uses a twolayer LRGCN to obtain hidden representation of each node by capturing graph structure dynamics and temporal dependency. Then it uses SAPE to derive the path representation that takes node importance into consideration. The output of SAPE is cascaded with a fully connected layer to compute the final loss.
4. Experiments
We validate the effectiveness of our model on two realworld data sets: (1) predicting path failure in a telecommunication network, and (2) predicting path congestion in a traffic network.
4.1. Data
4.1.1. Telecommunication network (Telecom)
This data set targets a metropolitan LTE transmission network serving 10 million subscribers. We select 626 switches and collect 3 months of data from Feb 1, 2018 to Apr 30, 2018. For each switch, we collect two values every 15 minutes: sending power and receiving power. From the network, we construct an adjacency matrix by denoting if there is a directed optical link from to , and otherwise. The graph structure changes when links fail or recover. The observations on switches and the graph structure over time form a timeevolving graph, where a time step corresponds to 15 minutes and the total number of time steps is 8449 over 3 months.
Telecom  Traffic  

No. of failure/congestion  385,896  85,083 
No. of availability  6,821,101  346,917 
Average length of paths  7.05±4.39  32.56±12.48 
There are 853 paths serving transmission of various services. Using a sliding window over time, we create path instances, of which instances are used for training, for validation, and for testing. We label a path instance at time as failure if alarms are triggered on the path by the alarm system within time steps . We use 24 hours’ history data to predict if a path will fail in the next 24 hours, i.e., and .
4.1.2. Traffic network (Traffic)
This data set targets District 7 of California collected from Caltrans Performance Measurement System (PeMS). We select 4438 sensor stations and collect 3 months of data from Jun 1, 2018 to Aug 30, 2018. For each station, we collect two measures: average speed and average occupancy at the hourly granularity by aggregation. From the traffic network, we construct an adjacency matrix by denoting if and are adjacent stations on a freeway along the same direction. The graph structure changes according to the node status (e.g., congestion or closure). A timeevolving graph is constructed from observations recorded in stations and the dynamic graph structure, where a time step is an hour and the total number of time steps is 2160 over 3 months.
We sample 200 paths by randomly choosing two stations as the source and target, then use Dijkstra’s algorithm to generate the shortest path. Using a sliding window over time, we create path instances, of which instances are used for training, for validation, and for testing. We label a path instance at time as congestion if two consecutive stations are labeled as congestion within time steps . We use 24 hours’ history data to predict if a path will congest in the next one hour, i.e., and .
4.2. Baselines and Metrics

DTW (Berndt and Clifford, 1994), which first measures node similarity by the Dynamic Time Warping distance of time series observed on each node, then models the path as a bag of nodes, and calculates the similarity between two paths by their maximum node similarity.

FCLSTM, which uses twolayer LSTM to capture the temporal dependency and uses another LSTM layer to derive the path representation. It only considers the time series sequences, but does not model node correlation or graph structure.

DCRNN (Li et al., 2018), which uses twolayer DCRNN to capture both temporal dependency and node correlation and uses LSTM to get path representation from the last hidden state of the second DCRNN layer. It works on a static graph.

STGCN (Yu et al., 2018), which is similar to DCRNN except that we replace DCRNN with STGCN.

LRGCN, which is similar to DCRNN except that we replace DCRNN with LRGCN.

LRGCNSAPE (static), which is similar to LRGCN except that we replace the path representation method LSTM with SAPE.

LRGCNSAPE (evolving), which is similar to LRGCNSAPE (static) except that the underlying graph structure evolves over time.
All neural network approaches are implemented using Tensorflow, and trained using minibatch based Adam optimizer with exponential decay. The best hyperparameters are chosen using early stopping with an epoch window size of 3 on the validation set. All trainable variables are initialized by Henormal (He et al., 2015). For fair comparison, DCRNN, STGCN and LRGCN use the same static graph structure and LSTM path representation methods. Detailed parameter settings for all methods are available in Appendix B.
We run each method three times and report the average Precision, Recall and MacroF1. Precision and Recall are computed with respect to the positive class, i.e., path failure/congestion, and defined as follows.
(15) 
(16) 
F1 score of the positive class is defined as follows:
(17) 
MacroF1 score is the average of F1 scores of the positive and negative classes, i.e., and :
(18) 
4.3. Results
4.3.1. Classification performance
Tables 2 and 3 list the experimental results on Telecom and Traffic data sets respectively. Among all approaches, LRGCNSAPE (evolving) achieves the best performance. In the following, we analyze the performance of all methods categorized into 4 groups.
Group 1: DTW performs worse than all neural network based methods. One possible explanation is that DTW is an unsupervised method, which fails to generate discriminative features for classification. Another possibility is that DTW measures the similarity of two time series by their pairwise distance and does not capture temporal dependency like its competitors.
Group 2: FCLSTM performs worse than the three neural network methods in Group 3, in both MacroF1 and Precision, which proves the effectiveness of node correlation modeling in Group 3.
Group 3: All the three neural networks in this group model both node correlation and temporal dependency, but the underlying graph structure is static and does not change. LRGCN outperforms both DCRNN and STGCN by at least 1% in MacroF1 on both data sets, indicating LRGCN is more effective in node correlation and temporal dependency modeling. For STGCN and DCRNN, DCRNN performs slightly better (0.19% in MacroF1) on Traffic data and STGCN performs better (1.87% in MacroF1) on Telecom data.
Group 4: LRGCNSAPE (static) works on a static graph and LRGCNSAPE (evolving) works on a timeevolving graph. LRGCNSAPE (static) outperforms Group 3 methods by at least 1% in MacroF1 on both data sets, which means that SAPE is superior to pure LSTM in path representation. LRGCNSAPE (evolving) further achieves substantial improvement based on the timeevolving graph, i.e., it improves MacroF1 by 1.34% on Telecom and 1.90% on Traffic.
4.3.2. Training efficiency
To compare the training efficiency of different methods, we plot the learning curve of different methods in Figure 6. We find that our proposed LRGCNbased methods including LRGCN, LRGCNSAPE (static) and LRGCNSAPE (evolving) converge more quickly than other methods. Another finding is that after three epochs, LRGCNSAPE (evolving) outperforms other methods by achieving the lowest validation loss, which indicates a better training efficiency of our proposed method on timeevolving graphs.
4.3.3. Benefits of graph evolution modeling
To further investigate how LRGCN performs on timeevolving graphs, we target a path which is heavily influenced by a closed sensor station and visualize the node attention weight learned by LRGCNSAPE (evolving) and LRGCNSAPE (static) respectively in Figure 7. In the visualization, green color represents light traffic recorded by sensor stations, red color represents heavy traffic, and a bigger radius of a node denotes a larger attention weight, as the average across views inferred by SAPE. We find that LRGCNSAPE (evolving) can capture the dynamics of graph structure caused by the closure of the station, in the sense that the nearby station is affected subsequently, thus receives more attention. In contrast, LRGCNSAPE (static) is unaware of the graph structure change and assigns a large attention weight to a node that is far away from the closed station.
Algorithm  Precision  Recall  MacroF1  
1  DTW  15.47%  9.63%  53.23% 
2  FCLSTM  13.29 %  52.27 %  53.78 % 
3  DCRNN  13.97 %  57.81 %  54.42 % 
STGCN  16.35 %  52.53 %  56.29 %  
LRGCN  17.38 %  61.34 %  57.70 %  
4  LRGCNSAPE (static)  17.67 %  65.28 %  60.55 % 
LRGCNSAPE (evolving)  19.23 %  65.07 %  61.89 % 
4.3.4. Effect of the number of views
We evaluate how the number of views () affects the model performance. Taking LRGCNSAPE (static) on Traffic as an example, we vary from 1 to 32 and plot the corresponding validation loss with respect to the number of epochs in Figure 8. As we increase , the performance improves (as the loss drops) and achieves the best when (the green line). We also observe that the performance difference for different is quite small, which demonstrates that our model performs very stably with respect to the setting of .
Algorithm  Precision  Recall  MacroF1  
1  DTW  12.05%  39.12%  51.62% 
2  FCLSTM  54.44 %  87.97 %  76.55 % 
3  DCRNN  63.05 %  88.55 %  82.60 % 
STGCN  64.52 %  86.15 %  82.41 %  
LRGCN  65.15 %  87.65 %  83.74 %  
4  LRGCNSAPE (static)  67.74 %  88.44%  84.84 % 
LRGCNSAPE (evolving)  71.04 %  88.50 %  86.74 % 
4.3.5. Path embedding visualization
To have a better understanding of the derived path embeddings, we select 1000 path instances from the test set of Traffic data. We apply LRGCNSAPE (evolving) and derive the embeddings of these 1000 testing instances. We then project the learned embeddings into a twodimensional space by tSNE (v. d. Maaten and Hinton, 2008), as depicted in Figure 9. Green color represents path availability and red color represents path congestion. As we can see from this twodimensional space, paths of the same label have a similar representation, as reflected by the geometric distance between them.
4.3.6. Effect of the normalization methods
We compare the performance of asymmetric normalization and symmetric normalization on both Telecom and Traffic data sets in the LRGCNSAPE (evolving) method. Other experimental settings remain the same with the experiments presented above. Results are listed in Table 4. For both data sets, asymmetric normalization outperforms symmetric normalization in both Precision and MacroF1. The advantage of asymmetric normalization is more significant on Telecom than on Traffic. The reason is that most of the derived sensor stations in the traffic network are connected bidirectionally, while the switches in the telecommunication network are connected unidirectionally. This demonstrates that asymmetric normalization is more effective than symmetric normalization on directed graphs.
Normalization  Telecom  Traffic  

  Precision  Recall  MacroF1  Precision  Recall  MacroF1 
15.39%  59.07%  56.40%  67.60%  89.14%  85.19%  
19.23%  65.07%  61.89%  71.04%  88.50%  86.74% 
5. Related Work
Many realworld problems can be formulated as prediction tasks in timeevolving graphs. We survey on two tasks: failure prediction in telecommunication and traffic forecasting in transportation. For failure prediction, the pioneer work (Klemettinen et al., 1999) formulates this task as a sequential pattern matching problem and the network topological structure is not fully exploited. Later (Fronza et al., 2013) formulates it as a classification problem and uses SVM (Hearst et al., 1998) to distinguish failures from normal behaviors. (Pitakrat et al., 2018) makes use of Bayesian network to model the spatial dependency and uses AutoRegressive Integrated Moving Average model (ARIMA) to predict node failures. For traffic forecasting, existing approaches can be generally categorized into two groups. The first group uses statistical models (Vlahogianni et al., 2014), where they either impose a stationary hypothesis on the time series or just incorporate cyclical patterns like seasonality. Another group, on the other hand, takes advantage of deep learning neural networks to tackle nonlinear spatial and temporal dependency. (Laptev et al., 2017) uses RNN to capture dynamic temporal dependency. (Zhang et al., 2017) applies CNN to model the spatial dependency between nodes. All the above studies treat the underlying graph as a static graph, while our problem setting and solution target timeevolving graphs. In addition, we study path failure prediction instead of node failure prediction.
Although there are many studies on node representation learning (Perozzi et al., 2014), path representation has been studied less. Deepcas (Li et al., 2017) leverages bidirectional GRU to sequentially take in node representation of a path forwardly and backwardly, and represents the path by concatenating the forward and backward hidden vectors. ProxEmbed (Liu et al., 2017) uses LSTM to read node representation and applies maxpooling operation on all the time step outputs to generate the final path representation. However when it comes to a long path, RNN may suffer from gradient exploding or vanishing, which prohibits the derived representation from reserving longterm dependency. In this sense, our proposed path representation method SAPE utilizes the selfattentive mechanism, previously proven to be successful in (Lin et al., 2017; Li et al., 2019), to explicitly encode the node importance into a unified path representation.
There are many studies about neural network on static graphs (Kipf and Welling, 2017; Schlichtkrull et al., 2018). However, research that generalizes neural network to timeevolving graphs is still lacking. The closest ones are neural networks on spatiotemporal data where graph structure does not change. DCRNN (Li et al., 2018) models the static structure dependency as a diffusion process and replaces the matrix multiplications in GRU with the diffusion convolution to jointly handle temporal dynamics and spatial dependency. STGCN (Yu et al., 2018) models spatial and temporal dependency with threelayer convolutional structure, i.e., two gated sequential convolution layers and a graph convolution layer in between. Our solution LRGCN is novel as it extends neural networks to handle timeevolving graphs where graph structure changes over time.
6. Conclusion
In this paper, we study path classification in timeevolving graphs. To capture temporal dependency and graph structure dynamics, we design a new neural network LRGCN, which views node correlation within a graph snapshot as intratime relations, and views temporal dependency between adjacent graph snapshots as intertime relations, and then jointly models these two relations. To provide interpretation as well as enhance performance, we propose a new path representation method named SAPE. Experimental results on a realworld telecommunication network and a traffic network in California show that LRGCNSAPE outperforms other competitors by a significant margin in path failure prediction. It also generates meaningful interpretations of the learned representation of paths.
Acknowledgements.
The work described in this paper was supported by grants from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No.: CUHK 14205618], and Huawei Technologies Research and Development Fund.References
 (1)
 Aggarwal and Li (2011) Charu C Aggarwal and Nan Li. 2011. On node classification in dynamic contentbased networks. In SDM. 355–366.
 Berndt and Clifford (1994) Donald J. Berndt and James Clifford. 1994. Using dynamic time warping to find patterns in time series. In KDD Workshop. 359–370.
 Fronza et al. (2013) Ilenia Fronza, Alberto Sillitti, Giancarlo Succi, Mikko Terho, and Jelena Vlasenko. 2013. Failure prediction based on log files using random indexing and support vector machines. Journal of Systems and Software 86, 1 (2013), 2–11.
 Güneş et al. (2014) İsmail Güneş, Zehra Çataltepe, and Şule G. Öğüdücü. 2014. GATVRCHet: genetic algorithm enhanced time varying relational classifier for evolving heterogeneous networks. Data mining and knowledge discovery 28, 3 (2014), 670–701.
 He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In ICCV. 1026–1034.
 Hearst et al. (1998) Marti A. Hearst, Susan T. Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and their Applications 13, 4 (1998), 18–28.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural computation 9, 8 (1997), 1735–1780.
 Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. SemiSupervised Classification with Graph Convolutional Networks. In ICLR.
 Klemettinen et al. (1999) Mika Klemettinen, Heikki Mannila, and Hannu Toivonen. 1999. Rule discovery in telecommunication alarm data. Journal of Network and Systems Management 7, 4 (1999), 395–423.
 Laptev et al. (2017) Nikolay Laptev, Jason Yosinski, Li Erran Li, and Slawek Smyl. 2017. Timeseries Extreme Event Forecasting with Neural Networks at Uber. In ICML Workshop.
 Li et al. (2017) Cheng Li, Jiaqi Ma, Xiaoxiao Guo, and Qiaozhu Mei. 2017. DeepCas: An endtoend predictor of information cascades. In WWW. 577–586.
 Li et al. (2019) Jia Li, Yu Rong, Hong Cheng, Helen Meng, Wenbing Huang, and Junzhou Huang. 2019. SemiSupervised Graph Classification: A Hierarchical Graph Perspective. In WWW.
 Li et al. (2018) Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2018. Diffusion convolutional recurrent neural network: Datadriven traffic forecasting. In ICLR.
 Lin et al. (2017) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A Structured Selfattentive Sentence Embedding. In ICLR.
 Liu et al. (2017) Zemin Liu, Vincent W. Zheng, Zhou Zhao, Fanwei Zhu, Kevin ChenChuan Chang, Minghui Wu, and Jing Ying. 2017. Semantic Proximity Search on Heterogeneous Graph by Proximity Embedding. In AAAI. 154–160.
 Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In KDD. 701–710.
 Pitakrat et al. (2018) Teerat Pitakrat, Dušan Okanović, André van Hoorn, and Lars Grunske. 2018. Hora: Architectureaware online failure prediction. Journal of Systems and Software 137 (2018), 669–685.
 Schlichtkrull et al. (2018) Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne v. Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In European Semantic Web Conference. Springer, 593–607.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NIPS. 3104–3112.
 v. d. Maaten and Hinton (2008) Laurens v. d. Maaten and Geoffrey Hinton. 2008. Visualizing data using tSNE. Journal of machine learning research 9, Nov (2008), 2579–2605.
 Vlahogianni et al. (2014) Eleni I Vlahogianni, Matthew G Karlaftis, and John C Golias. 2014. Shortterm traffic forecasting: Where we are and where weâre going. Transportation Research Part C: Emerging Technologies 43 (2014), 3–19.
 Yu et al. (2018) Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2018. Spatiotemporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In IJCAI. 3634–3640.
 Zhang et al. (2017) Junbo Zhang, Yu Zheng, and Dekang Qi. 2017. Deep SpatioTemporal Residual Networks for Citywide Crowd Flows Prediction. In AAAI. 1655–1661.
Appendix A Data sets
a.1. Telecommunication network
The telecommunication data set is provided by a large telecommunication company and records sensor data from a real service session.

Data preparation We choose a metropolitan LTE transmission network as our target which contains 626 switches and records sensor data from Feb 1, 2018 to Apr 30, 2018.

Static graph Switches are linked by optical fiber in a directed way. We treat a switch as a node in the graph and add a directed edge with weight between two nodes if there is an optical fiber linking one switch to another. The statistics of the constructed static graph are listed in Table 5.
Data set Nodes Edges Density Telecom 626 2464 0.63% Traffic 4438 8996 0.05% Table 5. Statistics of constructed static graphs SwitchID Time Sending power Receiving power H0001 20180201 00:00 0.7 dB 20.7 dB H0001 20180201 00:15 8.0 dB 18.1 dB Table 6. A multivariate time series example recorded by switches 
Feature matrix Each switch records several observations every 15 minutes. Among these observations, we use the average sending power and average receiving power as features. For each switch, the sequence of features over time is a multivariate time series. We exemplify a time series fragment in a 30minute window in Table 6. We normalize each feature to the range of [0, 1]. Finally we get a feature matrix (8449 corresponds to the number of time steps).

Path labeling There are 853 paths serving various services in this metropolitan transmission network. An alarm system serves as an anomaly detector. Once a path fault (e.g., network hardware outages, high transmission latency, or signal interference, etc.) is detected, it issues a warning message. If the number of warning messages within an hour exceeds a threshold, we label it as “path failure”. We use 24 hours’ history data to predict if a path will fail in the next 24 hours. Finally we get a label matrix.

Timeevolving graph For an optical link, it can be categorized into two status: failure and availability, based on a key performance index called bit error rate. We construct the timeevolving graph as follows: at time step , we set the adjacency matrix element if edge is labeled failure, and set otherwise.
a.2. Traffic network
The traffic data set is collected by California Transportation Agencies (CalTrans) Performance Measurement System (PeMS). The details can be found on the website of CalTrans^{1}^{1}1http://pems.dot.ca.gov/.

Data preparation We use the traffic data in District 7 from Jun 1, 2018 to Aug 30, 2018. There are two kinds of information: realtime traffic and meta data. The former records the traffic information at sensor stations such as the average speed, average occupancy, etc. The latter records the general information of the stations such as longitude, latitude, delay, closure, etc. The number of stations used in this study is 4438.

Static graph In the meta data, the order of the stations along the freeway is indicated by the “Absolute Postmile” field. We treat a station as a node in the graph and connect the adjacent stations on a freeway along the same direction one by one (in “Absolute Postmile” order). The weight of every edge is set to 1.

Feature matrix Each station records several kinds of traffic information hourly. Among them, we use the average speed and average occupancy as features. We replace missing values with 0, and normalize the features to the range of [0, 1]. Finally we get a feature matrix (2160 corresponds to the number of time steps).

Path labeling We randomly choose two nodes on the static graph as the source and target, then use Dijkstra’s algorithm to find the shortest path from the source to target. We sample 200 paths and restrict the path length to the range of [2, 50]. For a node, its congestion information is indicated by the “delay” field. For a path, if two consecutive nodes have congestion at the same time, we label it as “path congestion”. We use 24 hours’ history data to predict if a path will congest in the next one hour. Finally we get a label matrix.

Timeevolving graph We construct the timeevolving graph according to the following rules.

At time step , if is labeled closure, we delete all its incoming/outgoing edges in the graph snapshot at .

At time step , if is labeled congestion, we shrink all its incoming/outgoing edge weights by a factor of 0.5.

Appendix B Detailed Experiment Settings
This part details the implementation of each method and their hyperparameter setting if any.
DTW: Dynamic Time Warping is an algorithm for measuring similarity between two time series sequences. As the time series on each node is multivariate, we calculate the sum of squared DTW similarities for all variables. We consider a path as a bag of nodes, and calculate the similarity between paths by their maximum node similarity. In the DTW method, we do not model node correlations, which means graph structure is not taken into consideration.
FCLSTM uses twolayer LSTM neural networks for modeling temporal dependency, another LSTM layer for path representation and a fully connected layer. In the twolayer LSTM, the first layer is initialized with zero, and its last hidden state is used to initialize the second LSTM layer. The output dimension of these two LSTM is 8. The LSTM path representation layer is used to derive a fixedlength path representation. It works as follows:

Indexing node representations of a path from the last hidden state of the previous LSTM.

Feeding this hidden representation sequence to a LSTM layer.

The last hidden state of this LSTM is the final path representation.
The output dimension of this LSTM is also 8. FCLSTM does not model node correlations, and it can be regarded as LRGCN with .
DCRNN uses twolayer DCRNN for static graph modeling, another LSTM layer for path representation and a fully connected layer. The difference between DCRNN and FCLSTM is that the former models the node correlation as a diffusion process while the latter does not consider node correlation. For parameters, its maximum diffusion step is 3 and the output dimensions of both DCRNN and LSTM are 8.
STGCN uses twolayer STGCN for static graph modeling, another LSTM layer for path representation and a fully connected layer. STGCN models node correlation and temporal dependency with threelayer convolutional structure, i.e., two convolution layers and one GCN layer in between. For parameters, the graph convolution kernel size is set to 1 and the temporal convolution kernel size is set to 3. The output dimensions of both STGCN and LSTM are 8.
LRGCN is the same as DCRNN except that we replace the first twolayer DCRNN with LRGCN. The hidden dimension is 96.
LRGCNSAPE (static): The difference between this method and LRGCN is that the path representation is derived by SAPE instead of LSTM. For parameters of SAPE, we set , and .
LRGCNSAPE (evolving): The main advantage of LRGCN is that it can model timeevolving graphs. In this method, graph structure dynamics are modeled. The parameters of SAPE are set the same as the above method.