Attention Models with Random Features for Multi-layered Graph Embeddings

Attention Models with Random Features for Multi-layered Graph Embeddings


Modern data analysis pipelines are becoming increasingly complex due to the presence of multi-view information sources. While graphs are effective in modeling complex relationships, in many scenarios a single graph is rarely sufficient to succinctly represent all interactions, and hence multi-layered graphs have become popular. Though this leads to richer representations, extending solutions from the single-graph case is not straightforward. Consequently, there is a strong need for novel solutions to solve classical problems, such as node classification, in the multi-layered case. In this paper, we consider the problem of semi-supervised learning with multi-layered graphs. Though deep network embeddings, e.g. DeepWalk, are widely adopted for community discovery, we argue that feature learning with random node attributes, using graph neural networks, can be more effective. To this end, we propose to use attention models for effective feature learning, and develop two novel architectures, GrAMME-SG and GrAMME-Fusion, that exploit the inter-layer dependencies for building multi-layered graph embeddings. Using empirical studies on several benchmark datasets, we evaluate the proposed approaches and demonstrate significant performance improvements in comparison to state-of-the-art network embedding strategies. The results also show that using simple random features is an effective choice, even in cases where explicit node attributes are not available.

Semi-supervised learning, multi-layered graphs, attention, deep learning, network embeddings

I Introduction

I-a Multi-layered Graph Embeddings

The prevalence of relational data in several real-world applications, e.g. social network analysis [1], recommendation systems [2] and neurological modeling [3], has led to crucial advances in machine learning techniques for graph-structured data. This encompasses a wide-range of formulations to mine and gather insights from complex network datasets – node classification [4], link prediction [5], community detection [6], influential node selection [7] and many others. Despite the variabilities in these formulations, a recurring idea that appears in almost all of these approaches is to obtain embeddings for nodes in a graph, prior to carrying out the downstream learning task. In the simplest form, the adjacency matrix indicating the connectivities can be treated as naïve embeddings for the nodes. However, it is well known that such cursed, high-dimensional representations can be ineffective for the subsequent learning. Hence, there has been a long-standing interest in constructing low-dimensional embeddings that can best represent the network topology.

Until recently, the majority of existing work has focused on analysis and inferencing from a single network. However, with the emergence of multi-view datasets in real-world scenarios, commonly represented as multi-layered graphs, conventional inferencing tasks have become more challenging. Our definition of multi-layered graphs assumes complementary views of connectivity patterns for the same set of nodes, thus requiring the need to model complex dependency structure across the views. The heterogeneity in the relationships, while providing richer information, makes statistical inferencing challenging. Note that, alternative definitions for multi-view networks exist in the literature [8], wherein the node sets can be different across layers (e.g. interdependent networks). Prior work on multi-layered graphs focuses extensively on unsupervised community detection, and they can be broadly classified into methods that obtain a consensus community structure for producing node embeddings [9, 10, 11, 12], and methods that infer a separate embedding for a node in every layer, while exploiting the inter-layer dependencies, and produce multiple potential community associations for each node [13, 14]. In contrast to existing approaches, the goal of this work is to build multi-layered graph embeddings that can lead to improved node label prediction in a semi-supervised setting.

I-B Deep Learning on Graphs

Node embeddings can be inferred by optimizing with a variety of measures that describe the graph structure – examples include decomposition of the graph Laplacian [15], stochastic factorization of the adjacency matrix [16], decomposition of the modularity matrix [6, 17] etc. The unprecedented success of deep learning with data defined on regular domains, e.g. images and speech, has motivated its extension to arbitrarily structured graphs. For example, Yang et al. [18] and Thiagarajan et al. [19] have proposed stacked auto-encoder style solutions, that directly transform the objective measure into an undercomplete representation. An alternate class of approaches utilize the distributional hypothesis, popularly adopted in language modeling [20], where co-occurrence of two nodes in short random walks implies a strong notion of semantic similarity to construct embeddings – examples include DeepWalk [21] and Node2Vec [22].

While the aforementioned approaches are effective in preserving network structure, semi-supervised learning with graph-structured data requires feature learning from node attributes in order to effectively propagate labels to unlabeled nodes. Since convolutional neural networks (CNNs) have been the mainstay for feature learning with data defined on regular-grids, a natural idea is to generalize convolutions to graphs. Existing work on this generalization can be categorized into spectral approaches [23, 24], which operate on an explicit spectral representation of the graphs, and non-spectral approaches that define convolutions directly on the graphs using spatial neighborhoods [25, 26]. More recently, attention models [27] have been introduced as an effective alternative for graph data modeling. An attention model parameterizes the local dependencies to determine the most relevant parts of the graph to focus on, while computing the features for a node. Unlike spectral approaches, attention models do not require an explicit definition of the Laplacian operator, and can support variable sized neighborhoods. However, a key assumption with these feature learning methods is that we have access to node attributes, in addition to the network structure, which is not the case in several applications.

I-C Proposed Work

In this paper, we present a novel approach, GrAMME (Graph Attention Models for Multi-layered Embeddings), for constructing multi-layered graph embeddings using attention models. In contrast to the existing literature on community detection, we propose to perform feature learning in an end-to-end fashion with the node classification objective, and show that it is superior to employing separate stages of network embedding (e.g. DeepWalk) and classifier design. First, we argue that even in datasets that do not have explicit node attributes, using random features is a highly effective choice. Second, we show that attention models provide a powerful framework for modeling inter-layer dependencies, and can easily scale to a large number of layers. To this end, we develop two architectures, GrAMME-SG and GrAMME-Fusion, that employ deep attention models for semi-supervised learning. While the former approach introduces virtual edges between the layers and constructs a Supra Graph to parameterize dependencies, the latter approach builds layer-specific attention models and subsequently obtains consensus representations through fusion for label prediction. Using several benchmark multi-layered graph datasets, we demonstrate the effectiveness of random features and show that the proposed approaches significantly outperform state-of-the-art network embedding strategies such as DeepWalk. The main contributions of this work can be summarized as follows:

  • For the first time, we develop attention model architectures for multi-layered graphs in semi-supervised learning problems;

  • We propose the use of random attributes at nodes of a multi-layered graph for deep feature learning;

  • We introduce a weighting mechanism in graph attention to better utilize complementary information from multiple attention heads;

  • The GrAMME-SG architecture that uses attention models to parameterize virtual edges in a Supra Graph;

  • The GrAMME-Fusion architecture that performs layer-wise attention modeling and effectively fuses information from different layers;

  • We evaluate the proposed approaches on several benchmark datasets and show that they outperform existing network embedding strategies.

Ii Preliminaries

Definitions: A single-layered undirected, unweighted graph is represented by , where denotes the set of nodes with cardinality , and denotes the set of edges. A multi-layered graph is represented using a set of inter-dependent graphs , where there exists a node mapping between every pair of layers to indicate which vertices in one graph correspond to vertices in the other. In our setup, we assume from all layers contain the same set of nodes, while the edge sets (each of cardinality ) are assumed to be different. In addition to the network structure, each node is endowed with a set of attributes, , , which can be used to construct latent representations, , where is the desired number of latent dimensions. Finally, each node is associated with a label , which contains one of the pre-defined categories.

Ii-a Deep Network Embeddings

The scalability challenge of factorization techniques has motivated the use of deep learning methods to obtain node embeddings. The earliest work to report results on this direction was the DeepWalk algorithm by Perozzi et al. [21]. Interestingly, it draws analogy between node sequences generated by short random walks on graphs and sentences in a document corpus. Given this formulation, the authors utilize popular language modeling tools to obtain latent representations for the nodes [28]. Let us consider a simple metric walk in step , which is rooted at the vertex . The transition probability between the nodes and can be expressed as


where indicates the similarity metric between the two vertices in the latent space to be recovered and is a linking function that connects the vertex similarity to the actual co-occurrence probability. With appropriate choice of the walk length, the true metric can be recovered accurately from the co-occurrence statistics inferred using random walks. Furthermore, it is important to note that the frequency in which vertices appear in the short random walks follows a power-law distribution, similar to words in natural language. Given a length- sequence of words, , where denotes a word in the vocabulary, neural word embeddings attempt to obtain vector spaces that can recover the likelihood of observing a word given its context, i.e., over all sequences. Extending this idea to the case of graphs, a random walk on the nodes, starting from node , produces the sequence analogous to sentences in language data.

Ii-B Graph Attention Models

In this section, we discuss the recently proposed graph attention model [27], a variant of which is utilized in this paper to construct multi-layered graph embeddings. Attention mechanism is a widely-adopted strategy in sequence-to-sequence modeling tasks, wherein a parameterized function is used to determine relevant parts of the input to focus on, in order to make decisions. A recent popular implementation of the attention mechanism in sequence models is the Transformer architecture by Vaswani et al. [29], which employs scalar dot-product attention to identify dependencies. Furthermore, this architecture uses a self-attention mechanism to capture dependencies within the same input and employs multiple attention heads to enhance the modeling power. These important components have been subsequently utilized in a variety of NLP tasks [30, 31] and clinical modeling [32].

One useful interpretation of self-attention is that it implicitly induces a graph structure for a given sequence, where the nodes are time-steps and the edges indicate temporal dependencies. Instead of a single attention graph, we can actually consider multiple graphs corresponding to the different attention heads, each of which can be interpreted to encode different types of edges and hence can provide complementary information about different types of dependencies. This naturally motivates the use of attention mechanism in modeling graph-structured data. Recently, Petra et al. generalized the idea in [27] to create a graph attention layer, that can be stacked to build effective deep architectures for semi-supervised learning tasks. In addition to supporting variabilities in neighborhood sizes and improving the model capacity, graph attention models are computationally more efficient than other graph convolutional networks. In this paper, we propose to utilize attention mechanisms to model multi-layered graphs.

Formulation: A head in the graph attention layer learns a latent representation for each node by aggregating the features from its neighbors. More specifically, the feature at a node is computed as the weighted combination of features from its neighbors, where the weights are obtained using the attention function. Following our notations, each node is endowed with a dimensional attribute vector , and hence the input to graph attention layer is denoted by the set of attributes . The attention layer subsequently produces dimensional latent representations .

An attention head is constructed as follows: First, a linear transformation is applied to the features at each node, using a shared and trainable weight matrix , thus producing intermediate representations,


Subsequently, a scalar dot-product attention function is utilized to determine attention weights for every edge in the graph, based on features from the incident neighbors. Formally, the attention weight for the edge connecting the nodes and is computed as


where denotes the parameters of the attention function, and represents concatenation of features from nodes and respectively. The attention weights are computed with respect to every node in the neighborhood of , i.e., for , where represents the neighborhood of . Note that, we include the self-edge for every node while implementing the attention function. The weights are then normalized across all neighboring nodes using a softmax function, thus producing the normalized attention coefficients.


Finally, the normalized attention coefficients are used to compute the latent representation at each node, through a weighted combination of the node features. Note that, a non-linearity function is also utilized at the end to improve the approximation.


An important observation is that the attention weights are not required to be symmetric. For example, if a node has a strong influence on node , it does not imply that node also has a strong influence on and hence . The operations from equations (2) to (5) constitute a single head. While this simple parameterization enables effective modeling of relationships in a graph while learning latent features, the modeling capacity can be significantly improved by considering multiple attention heads. Following the Transformer architecture [29], the output latent representations from the different heads can be aggregated using either concatenation or averaging operations.

Iii Proposed Approaches

Fig. 1: D Visualization of the embeddings for the single-layer Cora dataset obtained using the proposed weighted attention mechanism.
Fig. 2: GrAMME-SG Architecture: Proposed approach for obtaining multi-layered graph embeddings with attention models applied to the Supra Graph, constructed by introducing virtual edges between layers.

In this section, we discuss the two proposed approaches for constructing multi-layered graph embeddings in semi-supervised learning problems. Before presenting the algorithmic details, we describe the attention mechanism used in our approach, which utilizes a weighting function to deal with multiple attention heads. Next, we motivate the use of randomized node attributes for effective feature learning. As described in Section I, in multi-layered graphs, the relationships between nodes are encoded using multiple edge sets. Consequently, while applying attention models for multi-layered graphs, a node in layer needs to update its hidden state using not only knowledge from its neighborhood in that layer, but also the shared information from other layers. Note, we assume no prior knowledge on the dependency structure, and solely rely on attention mechanisms to uncover the structure.

Iii-a Weighted Attention Mechanism

From the discussion in Section II-B, it is clear that latent representations from the multiple attention heads can provide complementary information about the node relationships. Hence, it is crucial to utilize that information to produce reliable embeddings for label propagation. When simple concatenation is used, as done in [27], an attention layer results in features of dimension , where is the number of attention heads. While this has been effective, one can gain improvements by performing a weighted combination of the attention heads, such that different heads can be assigned varying levels of importance. This is conceptually similar to the Weighted Transformer architecture proposed by Ahmed et al. [33]. For a node , denoting the representations from the different heads as , the proposed weighted attention combines these representations as follows:


where denotes the scaling factor for head and are trainable during the optimization. Note that, the scaling factors are shared across all nodes and they are constrained to be non-negative. Optionally, one can introduce the constraint into the formulation. However, we observed that its inclusion did not result in significant performance improvements in our experiments. Given a set of attention heads for a single graph layer, we refer to this weighting mechanism as a fusion head.

Interestingly, we find that this modified attention mechanism produces robust embeddings, when compared to the graph attention layer proposed in [27], even with lesser number of attention heads. For example, let us consider Cora, a single-layered graph dataset, containing nodes (publications) belonging to one of classes. With the regular graph attention model, comprised of two attention layers with heads each, we obtained a test accuracy of ( training nodes). In contrast, our weighted attention, even with just heads, produces state-of-the-art accuracy of . Naturally, this leads to significant reduction in the computational complexity of our architecture, which is more beneficial when dealing with multi-layered graphs. Figure 1 illustrates a D visualization (obtained using t-SNE) of the embeddings from our graph attention model.

Iii-B Using Randomized Node Attributes

With graph attention models, it is required to have access to node attributes (or features), which are then used to obtain the latent representations. However, in practice, multi-layered graph datasets are often comprised of only the edge sets, without any additional information. Consequently, in existing graph inferencing approaches (e.g. community detection), it is typical to adopt an unsupervised network embedding strategy, where the objective is to ensure that the learned representations preserve the network topology (i.e. neighborhoods). However, such an approach is not optimal for semi-supervised learning tasks, since the model parameters can be more effectively tuned using the task-specific objective, in an end-to-end fashion. In order to address this challenge, we propose to employ a randomized initialization strategy for creating node attributes. Interestingly, random initialization has been highly successful in creating word representations for NLP tasks, and in many scenarios its performance matches or even surpasses pre-trained word embeddings. With this initialization, the graph attention model can be used to obtain latent representations that maximally support label propagation in the input graph. Unlike fully supervised learning approaches, the embeddings for nodes that belong to the same class can still be vastly different, since the attention model fine-tunes the initial embeddings using only the locally connected neighbors. As we will show in our experiments, this simple initialization is effective, and our end-to-end training approach produces superior performance.

Iii-C Approach Description: GrAMME-SG

In this approach, we begin with the initial assumption that information is shared between all layers in a multi-layered graph, and use attention models to infer the actual dependencies, with the objective of improving label propagation performance. More specifically, we introduce virtual edges (also referred as pillar edges [34]) between every node in a layer and its counterparts in other layers, resulting in a supra graph, . The block diagonals of the adjacency matrix for contains the individual layers, while the off-diagonal entries indicate the inter-layer connectivities. As illustrated in Figure 2, the virtual edges are introduced between nodes with the same ID across layers. This is a popularly adopted strategy in the recent community detection approaches [35], however, with a difference that the nodes across layers are connected only when they share similar neighborhoods. In contrast, we consider all possible connections for information flow, and rely on the attention model to guide the learning process. Note that, it is possible that some of the layers can only contain a subset of the nodes. Given a multi-layered graph with layers, the resulting supra graph is comprised of (at most) nodes. Furthermore, the number of edges in the supra graph is upper bounded by , assuming that there are edges between every pair of nodes in every layer, as opposed to in the original multi-layered graph. The flexibility gained in modeling dependencies comes at the price of increased computational complexity, since we need to deal with a much larger graph.

Fig. 3: GrAMME-fusion Architecture: Proposed approach for obtaining multi-layered graph embeddings through fusion of representations from layer-wise attention models.

Following this, we generate random features of dimension at each of the nodes in and build a stacked attention model for feature learning and label prediction. Our architecture is comprised of graph attention layers, which in turn contains attention heads and a fusion head to combine the complementary representations. As discussed earlier, an attention head first performs a linear transformation on the input features, and parameterizes the neighborhood dependencies to learn locally consistent features. The neighborhood size for each node can be different, and we also include self edges while computing the attention weights. Since we are using the supra graph in this case, the attention model also considers nodes from the other layers. This exploits the inter-layer dependencies and produces latent representations that can be influenced by neighbors in the other layers. Following the expression in Equation (5), the latent feature at a node in layer can be obtained using an attention head as follows:


where denotes the linear-transformed feature vector for a node. This is repeated with attention heads with different parameters, and subsequently a fusion head is used to combine those representations. Note that, a fusion head is defined using scaling factors, denoting the importance for each of the heads. This operation can be formally stated as follows:


Consequently, we obtain latent features of dimension for each node in , which are then sequentially processed using additional graph attention layers. Since the overall goal is to obtain a single label prediction for each node, there is a need to aggregate features for a node from different layers. For this purpose, we perform an across-layer average pooling and employ a feed-forward layer with softmax activation for the final prediction.

(a) Train Nodes
(b) Train Nodes
(c) Train Nodes
Fig. 4: Convergence characteristics of the proposed GrAMME-Fusion architecture with the parameters , and respectively.

Iii-D Approach Description: GrAMME-Fusion

While the GrAMME-SG approach provides complete flexibility in dealing with dependencies, the complexity of handling large supra graphs is an inherent challenge. Hence, we introduce another architecture, GrAMME-Fusion, which builds only layer-wise attention models, and introduces a supra fusion layer that exploits inter-layer dependencies using only fusion heads. As described in Section III-A, a fusion head computes simple weighted combination and hence is computationally cheap. For simplicity, we assume that the same attention model architecture is used for every layer, although that is not required. This approach is motivated from the observation that attention heads in our feature learning architecture, and the different layers in a multi-layered graph both provide complementary views of the same data, and hence they can be handled similarly using fusion heads. In contrast, GrAMME-SG considers each node in every layer as a separate entity. Figure 3 illustrates the GrAMME-Fusion architecture.

Dataset Type # Nodes # Layers # Total edges # Classes
Vickers-Chan Classroom social structure 29 3 740 2
Congress Votes Bill voting structure among senators 435 4 358,338 2
Leskovec-Ng Academic collaboration 191 4 1,836 2
Reinnovation Global innovation index similarities 145 12 18,648 3
Mammography Mammographic Masses 961 5 1,979,115 2
Balance Scale Psychological assessment 625 4 312,500 3
TABLE I: Summary of the datasets used in our empirical studies.

Initially, each graph layer is processed using an attention model comprised of stacked graph attention layers, each of which implements attention heads and a fusion head, to construct layer-specific latent representations. Though the processing of the layers can be parallelized, the computational complexity is dominated by the number of heads in each model. Next, we construct a supra fusion layer which is designed extensively using fusion heads in order to parameterize the dependencies between layers. In other words, we create fusion heads with scaling factors , in order to combine the representations from the layer-specific attention models. Note that, we use multiple fusion heads to allow different parameterizations for assigning importance to each of the layers. This is conceptually similar to using multiple attention heads. Finally, we use an overall fusion head, with scaling factors , to obtain a consensus representation from the multiple fusion heads. One can optionally introduce an additional feed-forward layer prior to employing the overall fusion to improve the model capacity. The output from the supra fusion layer is used to make the prediction through a fully-connected layer with softmax activation. The interplay between the parameters (layer-wise attention heads) and (fusion heads in the supra fusion layer) controls the effectiveness and complexity of this approach.

Iv Empirical Studies

In this section, we evaluate the proposed approaches by performing semi-supervised learning with benchmark multi-layered graph datasets. Our experiments study the behavior of our approaches, with varying amounts of labeled nodes, and cross-validated with different train-test splits. Though the proposed approaches can be utilized for inductive learning, we restrict our experiments to transductive tasks. For each dataset and experiment, we select labeled nodes uniformly at random, while fixing the amount of labeled nodes. We begin by describing the datasets considered for our study, and then briefly discuss the baseline techniques based on deep network embeddings.

% Nodes
Baselines GrAMME-SG GrAMME-Fusion
DeepWalk DeepWalk-SG
Vickers Dataset
10% 94.60 95.55 98.94 99.21 99.21
20% 95.26 95.83 98.94 99.21 99.21
30% 96.10 96.19 98.94 99.21 99.21
Congress Votes Dataset
10% 98.82 98.00 96.02 100 100
20% 99.90 99.10 96.87 100 100
30% 99.91 99.63 97.33 100 100
Leskovec-Ng Dataset
10% 92.89 94.52 91.56 92.95 93.32
20% 96.96 97.82 96.25 96.84 97.62
30% 98.09 98.11 98.30 98.72 98.73
Reinnovation Dataset
10% 69.26 67.23 76.42 74.41 75.28
20% 72.12 70.61 80.72 79.61 79.00
30% 73.46 70.55 83.16 81.97 80.95
Mammography Dataset
10% 73.30 71.65 82.27 82.57 82.63
20% 69.86 70.68 83.01 83.20 83.28
30% 77.21 77.04 83.06 83.74 83.75
Balance Scale Dataset
10% 81.80 81.39 77.67 80.13 80.15
20% 86.48 85.69 78.67 86.5 86.58
30% 89.19 86.41 79.10 87.84 88.72
TABLE II: Semi-Supervised learning performance of the proposed multi-layered attention architectures on the benchmark datasets. The results reported were obtained by averaging independent realizations.

Iv-a Datasets

We describe in detail the multi-layered graph datasets used for evaluation. A summary of the datasets can be found in Table I.

(i) Vickers-Chan: The Vickers-Chan [36] dataset represents the social structure of students from a school in Victoria, Australia. Each node represents a student studying in 7th grade, and the three graph layers are constructed based on student responses for the following three criteria: (i) who did they get along with in the class?, (ii) who are their best friends in the class?, and (iii) who do they prefer to work with?. The dataset is comprised of nodes and their gender value is used as the label in our learning formulation.

(ii) Congress Votes: The Congress votes [37] dataset is obtained from the 1984 United States Congressional Voting Records Database. This includes votes from every congressman from the U.S House of representatives for different bills, which results in a -layered graph. The dataset is comprised of nodes and they are labeled as either democrats or republicans. For every layer, we establish an edge between two nodes in the corresponding layer, if those two congressmen voted similarly (“yes” or “no”).

(ii) Leskovec-Ng: This dataset [38] is a temporal collaboration network of professors Jure Leskovec and Andrew Ng. The year co-authorship information is partitioned into -year intervals, in order to construct a -layered graph. In any layer, two researchers are connected by an edge if they co-authored at least one paper in the considered -year interval. Each researcher is labeled as affiliated to either Leskovec’s or Ng’s group.

(iv) Reinnovation: This dataset describes the Global Innovation Index for countries, which form the nodes of the graph. For each node, the label represents the development level of that corresponding country. There are levels of development, thus representing the classes. Each layer in a graph is constructed based on similarities between countries in different sectors. The sectors include infrastructure, institutions, labor market, financial market etc. This graphs contains -layers in total.

(v) Mammography: This dataset contains information about mammographic mass lesions from subjects. We consider different attributes, namely the BI-RADS assessment, subject age, shape, margin, and density of the lesion, in order to construct the different layers of the graph. This data is quite challenging due to the presence of million edges. Conventional network embedding techniques that rely on sparsity of the graphs can be particularly ineffective in these scenarios. Finally, the lesions are either marked as benign or malignant, to define the labels.

(vi) Balance Scale The final dataset that we consider is the UCI Balance scale dataset, which summarizes the results from a psychological experiment. Using different attributes characterizing the subject, namely left weight, the left distance, the right weight, and the right distance, we constructed a layered graph. Each subject (or node) is classified as having the balance scale tip to the right, tip to the left, or be balanced.

Iv-B Baselines

We use the two following baselines in order to compare the performance of the proposed approaches. Given that the datasets considered do not contain specific node attributes to perform feature learning, the natural approach is to obtain embeddings for each node in every layer, using deep network embedding techniques, and to subsequently build a classifier model using the resulting features. Following recent approaches such as [8], we choose DeepWalk, which is a state-of-the-art embedding technique, for obtaining deep embeddings. In particular, we consider two different variants: (i) DeepWalk: We treat each layer in the multi-layered graph as independent, and obtain embeddings from the layers separately. Finally, we concatenate the embeddings for each node from the different layers and build a multi-layer perceptron to perform the classification; (ii) DeepWalk-SG: We construct a supra graph by introducing virtual edges between nodes across layers (as described in Section III-C) and perform DeepWalk on the supra graph. Finally, the embeddings are concatenated as in the previous case and the classifier is designed. Though the former approach does not exploit the inter-layer information, in cases where there is significant variability in neighborhood structure across layers, it can still be effective by treating the layers independently.

(a) Congress Votes – Initial
(b) Congress Votes – Final
(c) Mammography – Initial
(d) Mammography – Final
Fig. 5: visualization of the embeddings, for two different datasets, obtained using the GrAMME-Fusion architecture with parameters , and respectively. We also show the initial randomized features for reference.

Iv-C Experiment Setup

In this section, we describe the experiment setup in detail. For both of the proposed approaches, we considered architectures with graph (weighted) attention layers, and fixed the input feature dimension . The number of hidden dimensions for both the attention layers were fixed at . We run our experiments in a transductive learning setting. As described earlier, we begin by first creating random node attributes of dimension in every layer. For the GrAMME-SG architecture, we used attention heads and a single fusion head. On the other hand, in the GrAMME-Fusion approach, we experimented with (no fusion head) and (one fusion head) for each of the layers in the graph. Furthermore, in the supra fusion layer, we used fusion heads. All networks were trained with the Adam optimizer, with the learning rate fixed at . In order to study the sensitivity of the proposed approaches over varying levels of training data availability, we varied the percentage of train nodes from to . We repeated the experiments over independent realizations of train-test split and we report the average performance in all cases. The performance of the algorithms were measured using the overall accuracy score. The DeepWalk algorithm was run with the number of walks fixed at , and the embedding sizes were fixed at and respectively for the two baselines.

Iv-D Results

Table II summarizes the performance of our approaches on the multi-layered graph datasets, along with the baseline results. As described in III-C, increasing the number of attention heads in GrAMME-SG increases the computational complexity. However, GrAMME-Fusion is computationally efficient, since it employs multiple fusion heads (supra fusion layer), while simplifying the layer-wise attention models. In fact, we report results obtained by using a single attention head in each layer. Figure 4 illustrates the convergence characteristics of the proposed GrAMME-Fusion architecture under different training settings for the Mammography dataset. As it can be observed, even with the complex graph structure (around million edges), the proposed solutions demonstrate good convergence characteristics.

From the reported results, we make the following observations: In most datasets, the proposed attention-based approaches significantly outperform the baseline techniques, providing highly robust models even when the training size was fixed at . In particular, with challenging datasets such as Reinnovation and Mammography datasets, the proposed approaches achieve improvements of over network embedding techniques. This clearly demonstrates the effectiveness of both the use of attention-based feature learning, and random features in multi-layer graph analysis. The Balance Scale dataset is a representative example for scenarios where the neighborhood structure varies significantly across layers, and consequently the baseline DeepWalk approach perform competitively with the GrAMME approaches that take inter-layer dependencies into account. When comparing the two proposed approaches, though GrAMME-SG provides improved flexibility by allowing information flow between all layers, GrAMME-Fusion consistently outperforms the former approach, while also being significantly cheaper. Interestingly, with GrAMME-Fusion, increasing the number of attention heads does not lead to significant performance improvements, demonstrating the effectiveness of the supra fusion layer.

Finally, we visualize the multi-layered graph embeddings to qualitatively understand the behavior of the feature learning approaches. More specifically, we show the D t-SNE visualizations of the hidden representations for Congress Votes and Mammography datasets, obtained using GrAMME-Fusion. Figure 5 shows that initial random features and the learned representations, wherein the effectiveness of the attention mechanism in revealing the class structure is clearly evident.

V Related Work

In this section, we briefly review prior work on deep feature learning for graph datasets, and multi-layered graph analysis. Note that, the proposed techniques are built on the graph attention networks recently proposed by [27].

V-a Feature Learning with Graph-Structured Data

Performing data-driven feature learning with graph-structured data has gained lot of interest, thanks to the recent advances in generalizing deep learning to non-Euclidean domains. The earliest application of neural networks to graph data can be seen in [39, 40], wherein recursive models were utilized to model dependencies. More formal generalizations of recurrent neural networks to graph analysis were later proposed in [41, 42]. Given the success of convolutional neural networks in feature learning from data defined on regular grids (e.g. images), the next generation of graph neural networks focused on performing graph convolutions efficiently. This implied that the feature learning was carried out to transform signals defined at nodes into meaningful latent representations, akin to filtering of signals [43]. Since the spatial convolution operation cannot be directly defined on arbitrary graphs, a variety of spectral domain and neighborhood based techniques have been developed.

Spectral approaches, as the name suggests operate using the spectral representation of graph signals, defined using the eigenvectors of the graph Laplacian. For example, in [23], convolutions are realized as multiplications in the graph Fourier domain, However, since the filters cannot be spatially localized on arbitrary graphs, this relies on explicit computation of the spectrum based on matrix inversion. Consequently, special families of spatially localized filters have been considered. Examples include the localization technique in [44], and Chebyshev polynomial expansion based localization in [24]. Building upon this idea, Kipf and Welling [4] introduced graph convolutional neural networks (GCN) using localized first-order approximation of spectral graph convolutions, wherein the filters operate within an one-step neighborhood, thus making it scalable to even large networks. On the other hand, with non-spectral approaches, convolutions are defined directly on graphs and they are capable of working with different sized neighborhoods. For example, localized spatial filters with different weight matrices for varying node degrees are learned in [25]. Whereas, in approaches such as [26] neighborhood for each node is normalized to achieve a fixed size neighborhood. More recently, attention models, which are commonly used to model temporal dependencies in sequence modeling tasks, were generalized to model neighborhood structure in graphs. More specifically, graph attention networks [27] employ dot product based self attention mechanisms to perform feature learning in semi-supervised learning problems. While these methods have produced state-of-the-art results in the case of single-layer graphs, to the best of our knowledge, no generalization exists for multi-layered graphs, which is the focus of this paper. In particular, we build solutions for scenarios where no explicit node attributes are available.

V-B Multi-layered graph analysis

Analysis and inferencing with multi-layered graphs is a challenging, yet crucial problem in data mining. With each layer characterizing a specific kind of relationships, the multi-layered graph is a comprehensive representation of relationships between nodes, which can be utilized to gain insights about complex datasets. Although the multi-layered representation is more comprehensive, a question that naturally arises is how to effectively fuse the information. Most existing work in the literature focuses on community detection, and an important class of approaches tackle this problem through joint factorization of the multiple graph adjacency matrices to infer embeddings  [45, 9]. In [46], the symmetric non-negative matrix tri-factorization algorithm is utilized in order to factorize the adjacencies into non-negative matrices including a shared cluster indicator matrix. Other alternative approaches include subgraph pattern mining [47, 48] and information-theoretic optimization based on Minimum Description Length [49]. A comprehensive survey studying the algorithms and datasets on this topic can be found in [34]. A unified optimization framework is developed in [8] to model within-layer connections and cross-layer connections simultaneously, to generate node embeddings for interdependent networks. Recently, Song and Thiagarajan [35] proposed to generalize the DeepWalk algorithm to the case of multi-layered graphs, through optimization with proxy clustering costs, and showed the resulting embeddings produce state-of-the-art results. In contrast to these approaches, we consider the problem of semi-supervised learning and develop novel feature learning techniques for the multi-layered case.

Vi Conclusions

In this paper, we introduced two novel architectures, GrAMME-SG and GrAMME-Fusion, for semi-supervised node classification with multi-layered graph data. Our architectures utilize randomized node attributes, and effectively fuse information from both within-layer and across-layer connectivities, through the use of a weighted attention mechanism. While GrAMME-SG provides complete flexibility by allowing virtual edges between all layers, GrAMME-Fusion exploits inter-layer dependencies using fusion heads, operating on layer-wise hidden representations. Experimental results show that our models consistently outperform existing node embedding techniques. As part of the future work, the proposed solution can be naturally extended to the cases of multi-modal networks and interdependent networks. Furthermore, studying the effectiveness of simple and scalable attention models in other challenging graph inferencing tasks such as multi-layered link prediction and influential node selection remains an important open problem.


This work was supported in part by the NSF I/UCRC Arizona State University Site, award 1540040 and the SenSIP center. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. We also thank Prasanna Sattigeri for the useful discussions, and sharing data.


  1. Nathan Eagle and Alex Sandy Pentland, “Reality mining: sensing complex social systems,” Personal and ubiquitous computing, vol. 10, no. 4, pp. 255–268, 2006.
  2. Nikhil Rao, Hsiang-Fu Yu, Pradeep K Ravikumar, and Inderjit S Dhillon, “Collaborative filtering with graph information: Consistency and scalable methods,” in Advances in neural information processing systems, 2015, pp. 2107–2115.
  3. Alex Fornito, Andrew Zalesky, and Michael Breakspear, “Graph analysis of the human connectome: promise, progress, and pitfalls,” Neuroimage, vol. 80, pp. 426–444, 2013.
  4. Thomas N Kipf and Max Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  5. M. Zhang and Y. Chen, “Link Prediction Based on Graph Neural Networks,” 2018.
  6. Mark EJ Newman, “Finding community structure in networks using the eigenvectors of matrices,” Physical review E, vol. 74, no. 3, pp. 036104, 2006.
  7. Qunwei Li, Bhavya Kailkhura, Jayaraman Thiagarajan, Zhenliang Zhang, and Pramod Varshney, “Influential node detection in implicit social networks using multi-task gaussian copula models,” in NIPS 2016 Time Series Workshop, 2017, pp. 27–37.
  8. Jundong Li, Chen Chen, Hanghang Tong, and Huan Liu, “Multi-layered network embedding,” in Proceedings of the 2018 SIAM International Conference on Data Mining. SIAM, 2018, pp. 684–692.
  9. Xiaowen Dong, Pascal Frossard, Pierre Vandergheynst, and Nikolai Nefedov, “Clustering with multi-layer graphs: A spectral perspective,” IEEE Transactions on Signal Processing, vol. 60, no. 11, pp. 5820–5831, 2012.
  10. Xiaowen Dong, Pascal Frossard, Pierre Vandergheynst, and Nikolai Nefedov, “Clustering on multi-layer graphs via subspace analysis on grassmann manifolds,” IEEE Transactions on signal processing, vol. 62, no. 4, pp. 905–918, 2014.
  11. Jungeun Kim, Jae-Gil Lee, and Sungsu Lim, “Differential flattening: A novel framework for community detection in multi-layer graphs,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 8, no. 2, pp. 27, 2017.
  12. Andrea Tagarelli, Alessia Amelio, and Francesco Gullo, “Ensemble-based community detection in multilayer networks,” Data Mining and Knowledge Discovery, vol. 31, no. 5, pp. 1506–1543, 2017.
  13. Peter J Mucha, Thomas Richardson, Kevin Macon, Mason A Porter, and Jukka-Pekka Onnela, “Community structure in time-dependent, multiscale, and multiplex networks,” science, vol. 328, no. 5980, pp. 876–878, 2010.
  14. Marya Bazzi, Mason A Porter, Stacy Williams, Mark McDonald, Daniel J Fenn, and Sam D Howison, “Community detection in temporal multilayer networks, with an application to correlation networks,” Multiscale Modeling & Simulation, vol. 14, no. 1, pp. 1–41, 2016.
  15. Andrew Y Ng, Michael I Jordan, and Yair Weiss, “On spectral clustering: Analysis and an algorithm,” in Advances in neural information processing systems, 2002, pp. 849–856.
  16. Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and Alexander J Smola, “Distributed large-scale natural graph factorization,” in Proceedings of the 22nd international conference on World Wide Web. ACM, 2013, pp. 37–48.
  17. Mingming Chen, Konstantin Kuzmin, and Boleslaw K Szymanski, “Community detection via maximization of modularity and its variants,” IEEE Transactions on Computational Social Systems, vol. 1, no. 1, pp. 46–65, 2014.
  18. Liang Yang, Xiaochun Cao, Dongxiao He, Chuan Wang, Xiao Wang, and Weixiong Zhang, “Modularity based community detection with deep learning.,” in IJCAI, 2016, pp. 2252–2258.
  19. Jayaraman J Thiagarajan, Prasanna Sattigeri, Karthikeyan Natesan Ramamurthy, and Bhavya Kailkhura, “Robust local scaling using conditional quantiles of graph similarities,” in Data Mining Workshops (ICDMW), 2016 IEEE 16th International Conference on. IEEE, 2016, pp. 762–769.
  20. Z.S. Harris, Distributional Structure, 1954.
  21. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena, “Deepwalk: Online learning of social representations,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014, pp. 701–710.
  22. Aditya Grover and Jure Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016, pp. 855–864.
  23. Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
  24. Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in Neural Information Processing Systems, 2016, pp. 3844–3852.
  25. David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in Advances in neural information processing systems, 2015, pp. 2224–2232.
  26. Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov, “Learning convolutional neural networks for graphs,” in International conference on machine learning, 2016, pp. 2014–2023.
  27. Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
  28. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.
  29. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
  30. Zhen Yang, Wei Chen, Feng Wang, and Bo Xu, “Improving neural machine translation with conditional sequence generative adversarial nets,” arXiv preprint arXiv:1703.04887, 2017.
  31. Antonio Valerio Miceli Barone, Jindřich Helcl, Rico Sennrich, Barry Haddow, and Alexandra Birch, “Deep architectures for neural machine translation,” arXiv preprint arXiv:1707.07631, 2017.
  32. Huan Song, Deepta Rajan, Jayaraman J Thiagarajan, and Andreas Spanias, “Attend and diagnose: Clinical time series analysis using attention models,” arXiv preprint arXiv:1711.03905, 2017.
  33. Karim Ahmed andhuan Nitish Shirish Keskar and Richard Socher, “Weighted transformer network for machine translation,” CoRR, vol. abs/1711.02132, 2017.
  34. Jungeun Kim and Jae-Gil Lee, “Community detection in multi-layer graphs: A survey,” ACM SIGMOD Record, vol. 44, no. 3, pp. 37–48, 2015.
  35. Huan Song and Jayaraman J Thiagarajan, “Improved community detection using deep embeddings from multi-layer graphs,” arXiv preprint, 2018.
  36. M Vickers and S Chan, “Representing classroom social structure,” Victoria Institute of Secondary Education, Melbourne, 1981.
  37. Jeffrey Curtis Schlimmer, “Concept acquisition through representational adjustment,” 1987.
  38. Pin-Yu Chen and Alfred O Hero, “Multilayer spectral graph clustering via convex layer aggregation: Theory and algorithms,” IEEE Transactions on Signal and Information Processing over Networks, vol. 3, no. 3, pp. 553–567, 2017.
  39. Paolo Frasconi, Marco Gori, and Alessandro Sperduti, “A general framework for adaptive processing of data structures,” IEEE transactions on Neural Networks, vol. 9, no. 5, pp. 768–786, 1998.
  40. Alessandro Sperduti and Antonina Starita, “Supervised neural networks for the classification of structures,” IEEE Transactions on Neural Networks, vol. 8, no. 3, pp. 714–735, 1997.
  41. Marco Gori, Gabriele Monfardini, and Franco Scarselli, “A new model for learning in graph domains,” in Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on. IEEE, 2005, vol. 2, pp. 729–734.
  42. Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2009.
  43. David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83–98, 2013.
  44. Mikael Henaff, Joan Bruna, and Yann LeCun, “Deep convolutional networks on graph-structured data,” arXiv preprint arXiv:1506.05163, 2015.
  45. Wei Tang, Zhengdong Lu, and Inderjit S Dhillon, “Clustering with multiple graphs,” in Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on. IEEE, 2009, pp. 1016–1021.
  46. Vladimir Gligorijević, Yannis Panagakis, and Stefanos Zafeiriou, “Fusion and community detection in multi-layer graphs,” in Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016, pp. 1327–1332.
  47. Zhiping Zeng, Jianyong Wang, Lizhu Zhou, and George Karypis, “Coherent closed quasi-clique discovery from large dense graph databases,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp. 797–802.
  48. Brigitte Boden, Stephan Günnemann, Holger Hoffmann, and Thomas Seidl, “Mining coherent subgraphs in multi-layer graphs with edge labels,” in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012, pp. 1258–1266.
  49. Evangelos E Papalexakis, Leman Akoglu, and Dino Ience, “Do more views of a graph help? community detection and clustering in multi-graphs,” in Information fusion (FUSION), 2013 16th international conference on. IEEE, 2013, pp. 899–905.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description