Attention Models with Random Features for Multilayered Graph Embeddings
Abstract
Modern data analysis pipelines are becoming increasingly complex due to the presence of multiview information sources. While graphs are effective in modeling complex relationships, in many scenarios a single graph is rarely sufficient to succinctly represent all interactions, and hence multilayered graphs have become popular. Though this leads to richer representations, extending solutions from the singlegraph case is not straightforward. Consequently, there is a strong need for novel solutions to solve classical problems, such as node classification, in the multilayered case. In this paper, we consider the problem of semisupervised learning with multilayered graphs. Though deep network embeddings, e.g. DeepWalk, are widely adopted for community discovery, we argue that feature learning with random node attributes, using graph neural networks, can be more effective. To this end, we propose to use attention models for effective feature learning, and develop two novel architectures, GrAMMESG and GrAMMEFusion, that exploit the interlayer dependencies for building multilayered graph embeddings. Using empirical studies on several benchmark datasets, we evaluate the proposed approaches and demonstrate significant performance improvements in comparison to stateoftheart network embedding strategies. The results also show that using simple random features is an effective choice, even in cases where explicit node attributes are not available.
I Introduction
Ia Multilayered Graph Embeddings
The prevalence of relational data in several realworld applications, e.g. social network analysis [1], recommendation systems [2] and neurological modeling [3], has led to crucial advances in machine learning techniques for graphstructured data. This encompasses a widerange of formulations to mine and gather insights from complex network datasets – node classification [4], link prediction [5], community detection [6], influential node selection [7] and many others. Despite the variabilities in these formulations, a recurring idea that appears in almost all of these approaches is to obtain embeddings for nodes in a graph, prior to carrying out the downstream learning task. In the simplest form, the adjacency matrix indicating the connectivities can be treated as naïve embeddings for the nodes. However, it is well known that such cursed, highdimensional representations can be ineffective for the subsequent learning. Hence, there has been a longstanding interest in constructing lowdimensional embeddings that can best represent the network topology.
Until recently, the majority of existing work has focused on analysis and inferencing from a single network. However, with the emergence of multiview datasets in realworld scenarios, commonly represented as multilayered graphs, conventional inferencing tasks have become more challenging. Our definition of multilayered graphs assumes complementary views of connectivity patterns for the same set of nodes, thus requiring the need to model complex dependency structure across the views. The heterogeneity in the relationships, while providing richer information, makes statistical inferencing challenging. Note that, alternative definitions for multiview networks exist in the literature [8], wherein the node sets can be different across layers (e.g. interdependent networks). Prior work on multilayered graphs focuses extensively on unsupervised community detection, and they can be broadly classified into methods that obtain a consensus community structure for producing node embeddings [9, 10, 11, 12], and methods that infer a separate embedding for a node in every layer, while exploiting the interlayer dependencies, and produce multiple potential community associations for each node [13, 14]. In contrast to existing approaches, the goal of this work is to build multilayered graph embeddings that can lead to improved node label prediction in a semisupervised setting.
IB Deep Learning on Graphs
Node embeddings can be inferred by optimizing with a variety of measures that describe the graph structure – examples include decomposition of the graph Laplacian [15], stochastic factorization of the adjacency matrix [16], decomposition of the modularity matrix [6, 17] etc. The unprecedented success of deep learning with data defined on regular domains, e.g. images and speech, has motivated its extension to arbitrarily structured graphs. For example, Yang et al. [18] and Thiagarajan et al. [19] have proposed stacked autoencoder style solutions, that directly transform the objective measure into an undercomplete representation. An alternate class of approaches utilize the distributional hypothesis, popularly adopted in language modeling [20], where cooccurrence of two nodes in short random walks implies a strong notion of semantic similarity to construct embeddings – examples include DeepWalk [21] and Node2Vec [22].
While the aforementioned approaches are effective in preserving network structure, semisupervised learning with graphstructured data requires feature learning from node attributes in order to effectively propagate labels to unlabeled nodes. Since convolutional neural networks (CNNs) have been the mainstay for feature learning with data defined on regulargrids, a natural idea is to generalize convolutions to graphs. Existing work on this generalization can be categorized into spectral approaches [23, 24], which operate on an explicit spectral representation of the graphs, and nonspectral approaches that define convolutions directly on the graphs using spatial neighborhoods [25, 26]. More recently, attention models [27] have been introduced as an effective alternative for graph data modeling. An attention model parameterizes the local dependencies to determine the most relevant parts of the graph to focus on, while computing the features for a node. Unlike spectral approaches, attention models do not require an explicit definition of the Laplacian operator, and can support variable sized neighborhoods. However, a key assumption with these feature learning methods is that we have access to node attributes, in addition to the network structure, which is not the case in several applications.
IC Proposed Work
In this paper, we present a novel approach, GrAMME (Graph Attention Models for Multilayered Embeddings), for constructing multilayered graph embeddings using attention models. In contrast to the existing literature on community detection, we propose to perform feature learning in an endtoend fashion with the node classification objective, and show that it is superior to employing separate stages of network embedding (e.g. DeepWalk) and classifier design. First, we argue that even in datasets that do not have explicit node attributes, using random features is a highly effective choice. Second, we show that attention models provide a powerful framework for modeling interlayer dependencies, and can easily scale to a large number of layers. To this end, we develop two architectures, GrAMMESG and GrAMMEFusion, that employ deep attention models for semisupervised learning. While the former approach introduces virtual edges between the layers and constructs a Supra Graph to parameterize dependencies, the latter approach builds layerspecific attention models and subsequently obtains consensus representations through fusion for label prediction. Using several benchmark multilayered graph datasets, we demonstrate the effectiveness of random features and show that the proposed approaches significantly outperform stateoftheart network embedding strategies such as DeepWalk. The main contributions of this work can be summarized as follows:

For the first time, we develop attention model architectures for multilayered graphs in semisupervised learning problems;

We propose the use of random attributes at nodes of a multilayered graph for deep feature learning;

We introduce a weighting mechanism in graph attention to better utilize complementary information from multiple attention heads;

The GrAMMESG architecture that uses attention models to parameterize virtual edges in a Supra Graph;

The GrAMMEFusion architecture that performs layerwise attention modeling and effectively fuses information from different layers;

We evaluate the proposed approaches on several benchmark datasets and show that they outperform existing network embedding strategies.
Ii Preliminaries
Definitions: A singlelayered undirected, unweighted graph is represented by , where denotes the set of nodes with cardinality , and denotes the set of edges. A multilayered graph is represented using a set of interdependent graphs , where there exists a node mapping between every pair of layers to indicate which vertices in one graph correspond to vertices in the other. In our setup, we assume from all layers contain the same set of nodes, while the edge sets (each of cardinality ) are assumed to be different. In addition to the network structure, each node is endowed with a set of attributes, , , which can be used to construct latent representations, , where is the desired number of latent dimensions. Finally, each node is associated with a label , which contains one of the predefined categories.
Iia Deep Network Embeddings
The scalability challenge of factorization techniques has motivated the use of deep learning methods to obtain node embeddings. The earliest work to report results on this direction was the DeepWalk algorithm by Perozzi et al. [21]. Interestingly, it draws analogy between node sequences generated by short random walks on graphs and sentences in a document corpus. Given this formulation, the authors utilize popular language modeling tools to obtain latent representations for the nodes [28]. Let us consider a simple metric walk in step , which is rooted at the vertex . The transition probability between the nodes and can be expressed as
(1) 
where indicates the similarity metric between the two vertices in the latent space to be recovered and is a linking function that connects the vertex similarity to the actual cooccurrence probability. With appropriate choice of the walk length, the true metric can be recovered accurately from the cooccurrence statistics inferred using random walks. Furthermore, it is important to note that the frequency in which vertices appear in the short random walks follows a powerlaw distribution, similar to words in natural language. Given a length sequence of words, , where denotes a word in the vocabulary, neural word embeddings attempt to obtain vector spaces that can recover the likelihood of observing a word given its context, i.e., over all sequences. Extending this idea to the case of graphs, a random walk on the nodes, starting from node , produces the sequence analogous to sentences in language data.
IiB Graph Attention Models
In this section, we discuss the recently proposed graph attention model [27], a variant of which is utilized in this paper to construct multilayered graph embeddings. Attention mechanism is a widelyadopted strategy in sequencetosequence modeling tasks, wherein a parameterized function is used to determine relevant parts of the input to focus on, in order to make decisions. A recent popular implementation of the attention mechanism in sequence models is the Transformer architecture by Vaswani et al. [29], which employs scalar dotproduct attention to identify dependencies. Furthermore, this architecture uses a selfattention mechanism to capture dependencies within the same input and employs multiple attention heads to enhance the modeling power. These important components have been subsequently utilized in a variety of NLP tasks [30, 31] and clinical modeling [32].
One useful interpretation of selfattention is that it implicitly induces a graph structure for a given sequence, where the nodes are timesteps and the edges indicate temporal dependencies. Instead of a single attention graph, we can actually consider multiple graphs corresponding to the different attention heads, each of which can be interpreted to encode different types of edges and hence can provide complementary information about different types of dependencies. This naturally motivates the use of attention mechanism in modeling graphstructured data. Recently, Petra et al. generalized the idea in [27] to create a graph attention layer, that can be stacked to build effective deep architectures for semisupervised learning tasks. In addition to supporting variabilities in neighborhood sizes and improving the model capacity, graph attention models are computationally more efficient than other graph convolutional networks. In this paper, we propose to utilize attention mechanisms to model multilayered graphs.
Formulation: A head in the graph attention layer learns a latent representation for each node by aggregating the features from its neighbors. More specifically, the feature at a node is computed as the weighted combination of features from its neighbors, where the weights are obtained using the attention function. Following our notations, each node is endowed with a dimensional attribute vector , and hence the input to graph attention layer is denoted by the set of attributes . The attention layer subsequently produces dimensional latent representations .
An attention head is constructed as follows: First, a linear transformation is applied to the features at each node, using a shared and trainable weight matrix , thus producing intermediate representations,
(2) 
Subsequently, a scalar dotproduct attention function is utilized to determine attention weights for every edge in the graph, based on features from the incident neighbors. Formally, the attention weight for the edge connecting the nodes and is computed as
(3) 
where denotes the parameters of the attention function, and represents concatenation of features from nodes and respectively. The attention weights are computed with respect to every node in the neighborhood of , i.e., for , where represents the neighborhood of . Note that, we include the selfedge for every node while implementing the attention function. The weights are then normalized across all neighboring nodes using a softmax function, thus producing the normalized attention coefficients.
(4) 
Finally, the normalized attention coefficients are used to compute the latent representation at each node, through a weighted combination of the node features. Note that, a nonlinearity function is also utilized at the end to improve the approximation.
(5) 
An important observation is that the attention weights are not required to be symmetric. For example, if a node has a strong influence on node , it does not imply that node also has a strong influence on and hence . The operations from equations (2) to (5) constitute a single head. While this simple parameterization enables effective modeling of relationships in a graph while learning latent features, the modeling capacity can be significantly improved by considering multiple attention heads. Following the Transformer architecture [29], the output latent representations from the different heads can be aggregated using either concatenation or averaging operations.
Iii Proposed Approaches
In this section, we discuss the two proposed approaches for constructing multilayered graph embeddings in semisupervised learning problems. Before presenting the algorithmic details, we describe the attention mechanism used in our approach, which utilizes a weighting function to deal with multiple attention heads. Next, we motivate the use of randomized node attributes for effective feature learning. As described in Section I, in multilayered graphs, the relationships between nodes are encoded using multiple edge sets. Consequently, while applying attention models for multilayered graphs, a node in layer needs to update its hidden state using not only knowledge from its neighborhood in that layer, but also the shared information from other layers. Note, we assume no prior knowledge on the dependency structure, and solely rely on attention mechanisms to uncover the structure.
Iiia Weighted Attention Mechanism
From the discussion in Section IIB, it is clear that latent representations from the multiple attention heads can provide complementary information about the node relationships. Hence, it is crucial to utilize that information to produce reliable embeddings for label propagation. When simple concatenation is used, as done in [27], an attention layer results in features of dimension , where is the number of attention heads. While this has been effective, one can gain improvements by performing a weighted combination of the attention heads, such that different heads can be assigned varying levels of importance. This is conceptually similar to the Weighted Transformer architecture proposed by Ahmed et al. [33]. For a node , denoting the representations from the different heads as , the proposed weighted attention combines these representations as follows:
(6) 
where denotes the scaling factor for head and are trainable during the optimization. Note that, the scaling factors are shared across all nodes and they are constrained to be nonnegative. Optionally, one can introduce the constraint into the formulation. However, we observed that its inclusion did not result in significant performance improvements in our experiments. Given a set of attention heads for a single graph layer, we refer to this weighting mechanism as a fusion head.
Interestingly, we find that this modified attention mechanism produces robust embeddings, when compared to the graph attention layer proposed in [27], even with lesser number of attention heads. For example, let us consider Cora, a singlelayered graph dataset, containing nodes (publications) belonging to one of classes. With the regular graph attention model, comprised of two attention layers with heads each, we obtained a test accuracy of ( training nodes). In contrast, our weighted attention, even with just heads, produces stateoftheart accuracy of . Naturally, this leads to significant reduction in the computational complexity of our architecture, which is more beneficial when dealing with multilayered graphs. Figure 1 illustrates a D visualization (obtained using tSNE) of the embeddings from our graph attention model.
IiiB Using Randomized Node Attributes
With graph attention models, it is required to have access to node attributes (or features), which are then used to obtain the latent representations. However, in practice, multilayered graph datasets are often comprised of only the edge sets, without any additional information. Consequently, in existing graph inferencing approaches (e.g. community detection), it is typical to adopt an unsupervised network embedding strategy, where the objective is to ensure that the learned representations preserve the network topology (i.e. neighborhoods). However, such an approach is not optimal for semisupervised learning tasks, since the model parameters can be more effectively tuned using the taskspecific objective, in an endtoend fashion. In order to address this challenge, we propose to employ a randomized initialization strategy for creating node attributes. Interestingly, random initialization has been highly successful in creating word representations for NLP tasks, and in many scenarios its performance matches or even surpasses pretrained word embeddings. With this initialization, the graph attention model can be used to obtain latent representations that maximally support label propagation in the input graph. Unlike fully supervised learning approaches, the embeddings for nodes that belong to the same class can still be vastly different, since the attention model finetunes the initial embeddings using only the locally connected neighbors. As we will show in our experiments, this simple initialization is effective, and our endtoend training approach produces superior performance.
IiiC Approach Description: GrAMMESG
In this approach, we begin with the initial assumption that information is shared between all layers in a multilayered graph, and use attention models to infer the actual dependencies, with the objective of improving label propagation performance. More specifically, we introduce virtual edges (also referred as pillar edges [34]) between every node in a layer and its counterparts in other layers, resulting in a supra graph, . The block diagonals of the adjacency matrix for contains the individual layers, while the offdiagonal entries indicate the interlayer connectivities. As illustrated in Figure 2, the virtual edges are introduced between nodes with the same ID across layers. This is a popularly adopted strategy in the recent community detection approaches [35], however, with a difference that the nodes across layers are connected only when they share similar neighborhoods. In contrast, we consider all possible connections for information flow, and rely on the attention model to guide the learning process. Note that, it is possible that some of the layers can only contain a subset of the nodes. Given a multilayered graph with layers, the resulting supra graph is comprised of (at most) nodes. Furthermore, the number of edges in the supra graph is upper bounded by , assuming that there are edges between every pair of nodes in every layer, as opposed to in the original multilayered graph. The flexibility gained in modeling dependencies comes at the price of increased computational complexity, since we need to deal with a much larger graph.
Following this, we generate random features of dimension at each of the nodes in and build a stacked attention model for feature learning and label prediction. Our architecture is comprised of graph attention layers, which in turn contains attention heads and a fusion head to combine the complementary representations. As discussed earlier, an attention head first performs a linear transformation on the input features, and parameterizes the neighborhood dependencies to learn locally consistent features. The neighborhood size for each node can be different, and we also include self edges while computing the attention weights. Since we are using the supra graph in this case, the attention model also considers nodes from the other layers. This exploits the interlayer dependencies and produces latent representations that can be influenced by neighbors in the other layers. Following the expression in Equation (5), the latent feature at a node in layer can be obtained using an attention head as follows:
(7) 
where denotes the lineartransformed feature vector for a node. This is repeated with attention heads with different parameters, and subsequently a fusion head is used to combine those representations. Note that, a fusion head is defined using scaling factors, denoting the importance for each of the heads. This operation can be formally stated as follows:
(8) 
Consequently, we obtain latent features of dimension for each node in , which are then sequentially processed using additional graph attention layers. Since the overall goal is to obtain a single label prediction for each node, there is a need to aggregate features for a node from different layers. For this purpose, we perform an acrosslayer average pooling and employ a feedforward layer with softmax activation for the final prediction.
IiiD Approach Description: GrAMMEFusion
While the GrAMMESG approach provides complete flexibility in dealing with dependencies, the complexity of handling large supra graphs is an inherent challenge. Hence, we introduce another architecture, GrAMMEFusion, which builds only layerwise attention models, and introduces a supra fusion layer that exploits interlayer dependencies using only fusion heads. As described in Section IIIA, a fusion head computes simple weighted combination and hence is computationally cheap. For simplicity, we assume that the same attention model architecture is used for every layer, although that is not required. This approach is motivated from the observation that attention heads in our feature learning architecture, and the different layers in a multilayered graph both provide complementary views of the same data, and hence they can be handled similarly using fusion heads. In contrast, GrAMMESG considers each node in every layer as a separate entity. Figure 3 illustrates the GrAMMEFusion architecture.
Dataset  Type  # Nodes  # Layers  # Total edges  # Classes 

VickersChan  Classroom social structure  29  3  740  2 
Congress Votes  Bill voting structure among senators  435  4  358,338  2 
LeskovecNg  Academic collaboration  191  4  1,836  2 
Reinnovation  Global innovation index similarities  145  12  18,648  3 
Mammography  Mammographic Masses  961  5  1,979,115  2 
Balance Scale  Psychological assessment  625  4  312,500  3 
Initially, each graph layer is processed using an attention model comprised of stacked graph attention layers, each of which implements attention heads and a fusion head, to construct layerspecific latent representations. Though the processing of the layers can be parallelized, the computational complexity is dominated by the number of heads in each model. Next, we construct a supra fusion layer which is designed extensively using fusion heads in order to parameterize the dependencies between layers. In other words, we create fusion heads with scaling factors , in order to combine the representations from the layerspecific attention models. Note that, we use multiple fusion heads to allow different parameterizations for assigning importance to each of the layers. This is conceptually similar to using multiple attention heads. Finally, we use an overall fusion head, with scaling factors , to obtain a consensus representation from the multiple fusion heads. One can optionally introduce an additional feedforward layer prior to employing the overall fusion to improve the model capacity. The output from the supra fusion layer is used to make the prediction through a fullyconnected layer with softmax activation. The interplay between the parameters (layerwise attention heads) and (fusion heads in the supra fusion layer) controls the effectiveness and complexity of this approach.
Iv Empirical Studies
In this section, we evaluate the proposed approaches by performing semisupervised learning with benchmark multilayered graph datasets. Our experiments study the behavior of our approaches, with varying amounts of labeled nodes, and crossvalidated with different traintest splits. Though the proposed approaches can be utilized for inductive learning, we restrict our experiments to transductive tasks. For each dataset and experiment, we select labeled nodes uniformly at random, while fixing the amount of labeled nodes. We begin by describing the datasets considered for our study, and then briefly discuss the baseline techniques based on deep network embeddings.

Baselines  GrAMMESG  GrAMMEFusion  
DeepWalk  DeepWalkSG  
Vickers Dataset  
10%  94.60  95.55  98.94  99.21  99.21  
20%  95.26  95.83  98.94  99.21  99.21  
30%  96.10  96.19  98.94  99.21  99.21  
Congress Votes Dataset  
10%  98.82  98.00  96.02  100  100  
20%  99.90  99.10  96.87  100  100  
30%  99.91  99.63  97.33  100  100  
LeskovecNg Dataset  
10%  92.89  94.52  91.56  92.95  93.32  
20%  96.96  97.82  96.25  96.84  97.62  
30%  98.09  98.11  98.30  98.72  98.73  
Reinnovation Dataset  
10%  69.26  67.23  76.42  74.41  75.28  
20%  72.12  70.61  80.72  79.61  79.00  
30%  73.46  70.55  83.16  81.97  80.95  
Mammography Dataset  
10%  73.30  71.65  82.27  82.57  82.63  
20%  69.86  70.68  83.01  83.20  83.28  
30%  77.21  77.04  83.06  83.74  83.75  
Balance Scale Dataset  
10%  81.80  81.39  77.67  80.13  80.15  
20%  86.48  85.69  78.67  86.5  86.58  
30%  89.19  86.41  79.10  87.84  88.72 
Iva Datasets
We describe in detail the multilayered graph datasets used for evaluation. A summary of the datasets can be found in Table I.
(i) VickersChan: The VickersChan [36] dataset represents the social structure of students from a school in Victoria, Australia. Each node represents a student studying in 7th grade, and the three graph layers are constructed based on student responses for the following three criteria: (i) who did they get along with in the class?, (ii) who are their best friends in the class?, and (iii) who do they prefer to work with?. The dataset is comprised of nodes and their gender value is used as the label in our learning formulation.
(ii) Congress Votes: The Congress votes [37] dataset is obtained from the 1984 United States Congressional Voting Records Database. This includes votes from every congressman from the U.S House of representatives for different bills, which results in a layered graph. The dataset is comprised of nodes and they are labeled as either democrats or republicans. For every layer, we establish an edge between two nodes in the corresponding layer, if those two congressmen voted similarly (“yes” or “no”).
(ii) LeskovecNg: This dataset [38] is a temporal collaboration network of professors Jure Leskovec and Andrew Ng. The year coauthorship information is partitioned into year intervals, in order to construct a layered graph. In any layer, two researchers are connected by an edge if they coauthored at least one paper in the considered year interval. Each researcher is labeled as affiliated to either Leskovec’s or Ng’s group.
(iv) Reinnovation: This dataset describes the Global Innovation Index for countries, which form the nodes of the graph. For each node, the label represents the development level of that corresponding country. There are levels of development, thus representing the classes. Each layer in a graph is constructed based on similarities between countries in different sectors. The sectors include infrastructure, institutions, labor market, financial market etc. This graphs contains layers in total.
(v) Mammography: This dataset contains information about mammographic mass lesions from subjects. We consider different attributes, namely the BIRADS assessment, subject age, shape, margin, and density of the lesion, in order to construct the different layers of the graph. This data is quite challenging due to the presence of million edges. Conventional network embedding techniques that rely on sparsity of the graphs can be particularly ineffective in these scenarios. Finally, the lesions are either marked as benign or malignant, to define the labels.
(vi) Balance Scale The final dataset that we consider is the UCI Balance scale dataset, which summarizes the results from a psychological experiment. Using different attributes characterizing the subject, namely left weight, the left distance, the right weight, and the right distance, we constructed a layered graph. Each subject (or node) is classified as having the balance scale tip to the right, tip to the left, or be balanced.
IvB Baselines
We use the two following baselines in order to compare the performance of the proposed approaches. Given that the datasets considered do not contain specific node attributes to perform feature learning, the natural approach is to obtain embeddings for each node in every layer, using deep network embedding techniques, and to subsequently build a classifier model using the resulting features. Following recent approaches such as [8], we choose DeepWalk, which is a stateoftheart embedding technique, for obtaining deep embeddings. In particular, we consider two different variants: (i) DeepWalk: We treat each layer in the multilayered graph as independent, and obtain embeddings from the layers separately. Finally, we concatenate the embeddings for each node from the different layers and build a multilayer perceptron to perform the classification; (ii) DeepWalkSG: We construct a supra graph by introducing virtual edges between nodes across layers (as described in Section IIIC) and perform DeepWalk on the supra graph. Finally, the embeddings are concatenated as in the previous case and the classifier is designed. Though the former approach does not exploit the interlayer information, in cases where there is significant variability in neighborhood structure across layers, it can still be effective by treating the layers independently.
IvC Experiment Setup
In this section, we describe the experiment setup in detail. For both of the proposed approaches, we considered architectures with graph (weighted) attention layers, and fixed the input feature dimension . The number of hidden dimensions for both the attention layers were fixed at . We run our experiments in a transductive learning setting. As described earlier, we begin by first creating random node attributes of dimension in every layer. For the GrAMMESG architecture, we used attention heads and a single fusion head. On the other hand, in the GrAMMEFusion approach, we experimented with (no fusion head) and (one fusion head) for each of the layers in the graph. Furthermore, in the supra fusion layer, we used fusion heads. All networks were trained with the Adam optimizer, with the learning rate fixed at . In order to study the sensitivity of the proposed approaches over varying levels of training data availability, we varied the percentage of train nodes from to . We repeated the experiments over independent realizations of traintest split and we report the average performance in all cases. The performance of the algorithms were measured using the overall accuracy score. The DeepWalk algorithm was run with the number of walks fixed at , and the embedding sizes were fixed at and respectively for the two baselines.
IvD Results
Table II summarizes the performance of our approaches on the multilayered graph datasets, along with the baseline results. As described in IIIC, increasing the number of attention heads in GrAMMESG increases the computational complexity. However, GrAMMEFusion is computationally efficient, since it employs multiple fusion heads (supra fusion layer), while simplifying the layerwise attention models. In fact, we report results obtained by using a single attention head in each layer. Figure 4 illustrates the convergence characteristics of the proposed GrAMMEFusion architecture under different training settings for the Mammography dataset. As it can be observed, even with the complex graph structure (around million edges), the proposed solutions demonstrate good convergence characteristics.
From the reported results, we make the following observations: In most datasets, the proposed attentionbased approaches significantly outperform the baseline techniques, providing highly robust models even when the training size was fixed at . In particular, with challenging datasets such as Reinnovation and Mammography datasets, the proposed approaches achieve improvements of over network embedding techniques. This clearly demonstrates the effectiveness of both the use of attentionbased feature learning, and random features in multilayer graph analysis. The Balance Scale dataset is a representative example for scenarios where the neighborhood structure varies significantly across layers, and consequently the baseline DeepWalk approach perform competitively with the GrAMME approaches that take interlayer dependencies into account. When comparing the two proposed approaches, though GrAMMESG provides improved flexibility by allowing information flow between all layers, GrAMMEFusion consistently outperforms the former approach, while also being significantly cheaper. Interestingly, with GrAMMEFusion, increasing the number of attention heads does not lead to significant performance improvements, demonstrating the effectiveness of the supra fusion layer.
Finally, we visualize the multilayered graph embeddings to qualitatively understand the behavior of the feature learning approaches. More specifically, we show the D tSNE visualizations of the hidden representations for Congress Votes and Mammography datasets, obtained using GrAMMEFusion. Figure 5 shows that initial random features and the learned representations, wherein the effectiveness of the attention mechanism in revealing the class structure is clearly evident.
V Related Work
In this section, we briefly review prior work on deep feature learning for graph datasets, and multilayered graph analysis. Note that, the proposed techniques are built on the graph attention networks recently proposed by [27].
Va Feature Learning with GraphStructured Data
Performing datadriven feature learning with graphstructured data has gained lot of interest, thanks to the recent advances in generalizing deep learning to nonEuclidean domains. The earliest application of neural networks to graph data can be seen in [39, 40], wherein recursive models were utilized to model dependencies. More formal generalizations of recurrent neural networks to graph analysis were later proposed in [41, 42]. Given the success of convolutional neural networks in feature learning from data defined on regular grids (e.g. images), the next generation of graph neural networks focused on performing graph convolutions efficiently. This implied that the feature learning was carried out to transform signals defined at nodes into meaningful latent representations, akin to filtering of signals [43]. Since the spatial convolution operation cannot be directly defined on arbitrary graphs, a variety of spectral domain and neighborhood based techniques have been developed.
Spectral approaches, as the name suggests operate using the spectral representation of graph signals, defined using the eigenvectors of the graph Laplacian. For example, in [23], convolutions are realized as multiplications in the graph Fourier domain, However, since the filters cannot be spatially localized on arbitrary graphs, this relies on explicit computation of the spectrum based on matrix inversion. Consequently, special families of spatially localized filters have been considered. Examples include the localization technique in [44], and Chebyshev polynomial expansion based localization in [24]. Building upon this idea, Kipf and Welling [4] introduced graph convolutional neural networks (GCN) using localized firstorder approximation of spectral graph convolutions, wherein the filters operate within an onestep neighborhood, thus making it scalable to even large networks. On the other hand, with nonspectral approaches, convolutions are defined directly on graphs and they are capable of working with different sized neighborhoods. For example, localized spatial filters with different weight matrices for varying node degrees are learned in [25]. Whereas, in approaches such as [26] neighborhood for each node is normalized to achieve a fixed size neighborhood. More recently, attention models, which are commonly used to model temporal dependencies in sequence modeling tasks, were generalized to model neighborhood structure in graphs. More specifically, graph attention networks [27] employ dot product based self attention mechanisms to perform feature learning in semisupervised learning problems. While these methods have produced stateoftheart results in the case of singlelayer graphs, to the best of our knowledge, no generalization exists for multilayered graphs, which is the focus of this paper. In particular, we build solutions for scenarios where no explicit node attributes are available.
VB Multilayered graph analysis
Analysis and inferencing with multilayered graphs is a challenging, yet crucial problem in data mining. With each layer characterizing a specific kind of relationships, the multilayered graph is a comprehensive representation of relationships between nodes, which can be utilized to gain insights about complex datasets. Although the multilayered representation is more comprehensive, a question that naturally arises is how to effectively fuse the information. Most existing work in the literature focuses on community detection, and an important class of approaches tackle this problem through joint factorization of the multiple graph adjacency matrices to infer embeddings [45, 9]. In [46], the symmetric nonnegative matrix trifactorization algorithm is utilized in order to factorize the adjacencies into nonnegative matrices including a shared cluster indicator matrix. Other alternative approaches include subgraph pattern mining [47, 48] and informationtheoretic optimization based on Minimum Description Length [49]. A comprehensive survey studying the algorithms and datasets on this topic can be found in [34]. A unified optimization framework is developed in [8] to model withinlayer connections and crosslayer connections simultaneously, to generate node embeddings for interdependent networks. Recently, Song and Thiagarajan [35] proposed to generalize the DeepWalk algorithm to the case of multilayered graphs, through optimization with proxy clustering costs, and showed the resulting embeddings produce stateoftheart results. In contrast to these approaches, we consider the problem of semisupervised learning and develop novel feature learning techniques for the multilayered case.
Vi Conclusions
In this paper, we introduced two novel architectures, GrAMMESG and GrAMMEFusion, for semisupervised node classification with multilayered graph data. Our architectures utilize randomized node attributes, and effectively fuse information from both withinlayer and acrosslayer connectivities, through the use of a weighted attention mechanism. While GrAMMESG provides complete flexibility by allowing virtual edges between all layers, GrAMMEFusion exploits interlayer dependencies using fusion heads, operating on layerwise hidden representations. Experimental results show that our models consistently outperform existing node embedding techniques. As part of the future work, the proposed solution can be naturally extended to the cases of multimodal networks and interdependent networks. Furthermore, studying the effectiveness of simple and scalable attention models in other challenging graph inferencing tasks such as multilayered link prediction and influential node selection remains an important open problem.
Acknowledgments
This work was supported in part by the NSF I/UCRC Arizona State University Site, award 1540040 and the SenSIP center. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DEAC5207NA27344. We also thank Prasanna Sattigeri for the useful discussions, and sharing data.
References
 Nathan Eagle and Alex Sandy Pentland, “Reality mining: sensing complex social systems,” Personal and ubiquitous computing, vol. 10, no. 4, pp. 255–268, 2006.
 Nikhil Rao, HsiangFu Yu, Pradeep K Ravikumar, and Inderjit S Dhillon, “Collaborative filtering with graph information: Consistency and scalable methods,” in Advances in neural information processing systems, 2015, pp. 2107–2115.
 Alex Fornito, Andrew Zalesky, and Michael Breakspear, “Graph analysis of the human connectome: promise, progress, and pitfalls,” Neuroimage, vol. 80, pp. 426–444, 2013.
 Thomas N Kipf and Max Welling, “Semisupervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
 M. Zhang and Y. Chen, “Link Prediction Based on Graph Neural Networks,” 2018.
 Mark EJ Newman, “Finding community structure in networks using the eigenvectors of matrices,” Physical review E, vol. 74, no. 3, pp. 036104, 2006.
 Qunwei Li, Bhavya Kailkhura, Jayaraman Thiagarajan, Zhenliang Zhang, and Pramod Varshney, “Influential node detection in implicit social networks using multitask gaussian copula models,” in NIPS 2016 Time Series Workshop, 2017, pp. 27–37.
 Jundong Li, Chen Chen, Hanghang Tong, and Huan Liu, “Multilayered network embedding,” in Proceedings of the 2018 SIAM International Conference on Data Mining. SIAM, 2018, pp. 684–692.
 Xiaowen Dong, Pascal Frossard, Pierre Vandergheynst, and Nikolai Nefedov, “Clustering with multilayer graphs: A spectral perspective,” IEEE Transactions on Signal Processing, vol. 60, no. 11, pp. 5820–5831, 2012.
 Xiaowen Dong, Pascal Frossard, Pierre Vandergheynst, and Nikolai Nefedov, “Clustering on multilayer graphs via subspace analysis on grassmann manifolds,” IEEE Transactions on signal processing, vol. 62, no. 4, pp. 905–918, 2014.
 Jungeun Kim, JaeGil Lee, and Sungsu Lim, “Differential flattening: A novel framework for community detection in multilayer graphs,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 8, no. 2, pp. 27, 2017.
 Andrea Tagarelli, Alessia Amelio, and Francesco Gullo, “Ensemblebased community detection in multilayer networks,” Data Mining and Knowledge Discovery, vol. 31, no. 5, pp. 1506–1543, 2017.
 Peter J Mucha, Thomas Richardson, Kevin Macon, Mason A Porter, and JukkaPekka Onnela, “Community structure in timedependent, multiscale, and multiplex networks,” science, vol. 328, no. 5980, pp. 876–878, 2010.
 Marya Bazzi, Mason A Porter, Stacy Williams, Mark McDonald, Daniel J Fenn, and Sam D Howison, “Community detection in temporal multilayer networks, with an application to correlation networks,” Multiscale Modeling & Simulation, vol. 14, no. 1, pp. 1–41, 2016.
 Andrew Y Ng, Michael I Jordan, and Yair Weiss, “On spectral clustering: Analysis and an algorithm,” in Advances in neural information processing systems, 2002, pp. 849–856.
 Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and Alexander J Smola, “Distributed largescale natural graph factorization,” in Proceedings of the 22nd international conference on World Wide Web. ACM, 2013, pp. 37–48.
 Mingming Chen, Konstantin Kuzmin, and Boleslaw K Szymanski, “Community detection via maximization of modularity and its variants,” IEEE Transactions on Computational Social Systems, vol. 1, no. 1, pp. 46–65, 2014.
 Liang Yang, Xiaochun Cao, Dongxiao He, Chuan Wang, Xiao Wang, and Weixiong Zhang, “Modularity based community detection with deep learning.,” in IJCAI, 2016, pp. 2252–2258.
 Jayaraman J Thiagarajan, Prasanna Sattigeri, Karthikeyan Natesan Ramamurthy, and Bhavya Kailkhura, “Robust local scaling using conditional quantiles of graph similarities,” in Data Mining Workshops (ICDMW), 2016 IEEE 16th International Conference on. IEEE, 2016, pp. 762–769.
 Z.S. Harris, Distributional Structure, 1954.
 Bryan Perozzi, Rami AlRfou, and Steven Skiena, “Deepwalk: Online learning of social representations,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014, pp. 701–710.
 Aditya Grover and Jure Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016, pp. 855–864.
 Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint arXiv:1312.6203, 2013.
 Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in Neural Information Processing Systems, 2016, pp. 3844–3852.
 David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in Advances in neural information processing systems, 2015, pp. 2224–2232.
 Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov, “Learning convolutional neural networks for graphs,” in International conference on machine learning, 2016, pp. 2014–2023.
 Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
 Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.
 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
 Zhen Yang, Wei Chen, Feng Wang, and Bo Xu, “Improving neural machine translation with conditional sequence generative adversarial nets,” arXiv preprint arXiv:1703.04887, 2017.
 Antonio Valerio Miceli Barone, Jindřich Helcl, Rico Sennrich, Barry Haddow, and Alexandra Birch, “Deep architectures for neural machine translation,” arXiv preprint arXiv:1707.07631, 2017.
 Huan Song, Deepta Rajan, Jayaraman J Thiagarajan, and Andreas Spanias, “Attend and diagnose: Clinical time series analysis using attention models,” arXiv preprint arXiv:1711.03905, 2017.
 Karim Ahmed andhuan Nitish Shirish Keskar and Richard Socher, “Weighted transformer network for machine translation,” CoRR, vol. abs/1711.02132, 2017.
 Jungeun Kim and JaeGil Lee, “Community detection in multilayer graphs: A survey,” ACM SIGMOD Record, vol. 44, no. 3, pp. 37–48, 2015.
 Huan Song and Jayaraman J Thiagarajan, “Improved community detection using deep embeddings from multilayer graphs,” arXiv preprint, 2018.
 M Vickers and S Chan, “Representing classroom social structure,” Victoria Institute of Secondary Education, Melbourne, 1981.
 Jeffrey Curtis Schlimmer, “Concept acquisition through representational adjustment,” 1987.
 PinYu Chen and Alfred O Hero, “Multilayer spectral graph clustering via convex layer aggregation: Theory and algorithms,” IEEE Transactions on Signal and Information Processing over Networks, vol. 3, no. 3, pp. 553–567, 2017.
 Paolo Frasconi, Marco Gori, and Alessandro Sperduti, “A general framework for adaptive processing of data structures,” IEEE transactions on Neural Networks, vol. 9, no. 5, pp. 768–786, 1998.
 Alessandro Sperduti and Antonina Starita, “Supervised neural networks for the classification of structures,” IEEE Transactions on Neural Networks, vol. 8, no. 3, pp. 714–735, 1997.
 Marco Gori, Gabriele Monfardini, and Franco Scarselli, “A new model for learning in graph domains,” in Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on. IEEE, 2005, vol. 2, pp. 729–734.
 Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2009.
 David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst, “The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains,” IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83–98, 2013.
 Mikael Henaff, Joan Bruna, and Yann LeCun, “Deep convolutional networks on graphstructured data,” arXiv preprint arXiv:1506.05163, 2015.
 Wei Tang, Zhengdong Lu, and Inderjit S Dhillon, “Clustering with multiple graphs,” in Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on. IEEE, 2009, pp. 1016–1021.
 Vladimir Gligorijević, Yannis Panagakis, and Stefanos Zafeiriou, “Fusion and community detection in multilayer graphs,” in Pattern Recognition (ICPR), 2016 23rd International Conference on. IEEE, 2016, pp. 1327–1332.
 Zhiping Zeng, Jianyong Wang, Lizhu Zhou, and George Karypis, “Coherent closed quasiclique discovery from large dense graph databases,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp. 797–802.
 Brigitte Boden, Stephan Günnemann, Holger Hoffmann, and Thomas Seidl, “Mining coherent subgraphs in multilayer graphs with edge labels,” in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012, pp. 1258–1266.
 Evangelos E Papalexakis, Leman Akoglu, and Dino Ience, “Do more views of a graph help? community detection and clustering in multigraphs,” in Information fusion (FUSION), 2013 16th international conference on. IEEE, 2013, pp. 899–905.