Anonymous Walk Embeddings
Abstract
The task of representing entire graphs has seen a surge of prominent results, mainly due to learning convolutional neural networks (CNNs) on graphstructured data. While CNNs demonstrate stateoftheart performance in graph classification task, such methods are supervised and therefore steer away from the original problem of network representation in taskagnostic manner. Here, we coherently propose an approach for embedding entire graphs and show that our feature representations with SVM classifier increase classification accuracy of CNN algorithms and traditional graph kernels. For this we describe a recently discovered graph object, anonymous walk, on which we design taskindependent algorithms for learning graph representations in explicit and distributed way. Overall, our work represents a new scalable unsupervised learning of stateoftheart representations of entire graphs.
1 Introduction
A wide range of real world applications deal with network analysis and classification tasks. An ease of representing data with graphs makes them very valuable asset in any data mining toolbox; however, the complexity of working with graphs led researchers to seek for new ways of representing and analyzing graphs, of which network embeddings have become broadly popular due to their success in several machine learning areas such as graph classification (Cai et al., 2017), visualization (Cao et al., 2016), and pattern recognition (Monti et al., 2017).
Essentially, network embeddings are vector representations of graphs that capture local and global traits and, as a consequence, are more suitable for standard machine learning techniques such as SVM that works on numerical vectors rather than graph structures. Ideally, a practitioner would like to have a polynomialtime algorithm that can convert different graphs into different feature vectors. However, such algorithm would be capable of deciding whether two graphs are isomorphic (Gärtner et al., 2003), for which currently only quasipolynomialtime algorithm exists (Babai, 2016). Hence, there are fundamental challenges in the design of polynomialtime algorithm for networktovector conversion. Instead, a lot of research was devoted to the question of designing network embedding models that are computationally efficient and preserve similarity between graphs.
Broadly speaking, network embeddings come from one of the two buckets, either based on engineered graph features or driven by training on graph data. Featurebased methods traditionally appeared in graph kernel setting (Vishwanathan et al., 2010), where each graph is decomposed into discrete components, distribution of which is used as a vector representation of a graph (Haussler, 1999). Importantly, general concept of featurebased methods implies adhoc knowledge about the data at hand. For example, Random Walk kernel (Vishwanathan et al., 2010) assumes that graph realization originates from the types of random walks a graph has, whereas for WeisfeilerLehman (WL) kernel (Shervashidze et al., 2011) the insight is in subtree patterns of a graph. For highdimensional graph embeddings featurebased methods produce sparse solution as only few substructures are common across graphs. This is known as diagonal dominance (Yanardag & Vishwanathan, 2015), a situation when a graph representation is only similar to itself, but not to any other graph.
On the other hand, datadriven approach learns network embeddings by optimizing some form of objective function defined on graph data. Deep Graph Kernels (DGK) (Yanardag & Vishwanathan, 2015), for example, learns a positive semidefinite matrix that weights the relationship between graph substructures, while PatchySan (PSCN) (Niepert et al., 2016) constructs locally connected neighborhoods for training a convolutional neural network on. Datadriven approach implies learning distributed graph representations that have demonstrated promising classification results (Niepert et al., 2016; Tixier et al., 2017).
Our approach. We propose to use a natural graph object named anonymous walk as a base for learning featurebased and datadriven network embeddings. Recent discovery (Micali & Allen Zhu, 2016) has shown that anonymous walks provide characteristic graph traits and are capable to reconstruct network proximity of a node exactly. In particular, distribution of anonymous walks starting at node is sufficient for reconstruction of a subgraph induced by all vertices within a fixed distance from ; and such distribution uniquely determines underlying Markov processes from , i.e. no two different subgraphs exist having the same distribution of anonymous walks. This implies that two graphs with similar distributions of anonymous walks should be topologically similar. We therefore define featurebased network embeddings on distribution of anonymous walks and show an efficient sampling approach that approximates distributions for large networks.
To overcome sparsity of featurebased methods, we design a datadriven approach that learns distributed representations on the generated corpus of anonymous walks via backpropagation, in the same vein as neural models in NLP (Le & Mikolov, 2014; Bengio et al., 2003). Considering anonymous walks for the same source node as cooccurring words in the sentence and graph as a collection of such sentences, the hope is that by predicting a target word in a given context of words and a document, the proposed algorithm learns semantic meaning of words and a document.
To the best of our knowledge, we are the first to introduce anonymous walks in the context of learning network representations and we highlight the following contributions:

Based on the notion of anonymous walk, we propose featurebased network embeddings, for which we describe an efficient sampling procedure to alleviate time complexity of exact computation.

By maximizing the likelihood of preserving network proximity of anonymous walks, we propose a scalable algorithm to learn datadriven network embeddings.

On widelyused real datasets, we demonstrate that our network embeddings achieve superior performance than stateoftheart graph kernels and neural networks in graph classification task.
2 Anonymous Walks
Random walks are the sequences of nodes, where each new node is selected independently from the set of neighbors of the last node in the sequence. Normally states in a random walk correspond to a label or a global name of a node; however, for reasons described below such states could be unavailable. Yet, recently it has been shown that anonymized version of a random walk can provide a flexible way to reconstruct a network even when global names are absent (Micali & Allen Zhu, 2016). We next define a notion of anonymous walk.
Definition 1.
Let be an ordered list of elements . We define the positional function such that for any ordered list and an element it returns a list of all positions of occurrences in a list .
For example, if , then as element appears only on the first position and .
Definition 2 (Anonymous Walk).
If is a random walk, then its corresponding anonymous walk is the sequence of integers , where integer .
We denote mapping of a random walk to anonymous walk by .
For instance, in the graph of Fig. 1 a random walk matches anonymous walk . Likewise, another random walk also corresponds to anonymous walk . Conversely, another random walk corresponds to a different anonymous walk .
Intuitively, states in anonymous walk correspond to the first position of the node in a random walk and their total number equals to the number of distinct nodes in a random walk. Particular name of the state does not matter (so, for example, anonymous walk would be the same as anonymous walk ); however, by agreement, anonymous walks start from and continue to name new states by incrementing the current âmaximumâ state in an anonymous walk.
Rationale. From the perspective of a single node, in the position of an observer, global topology of the network may be hidden deliberately (e.g. social networks often restrict outsiders to examine your friendships) or otherwise (e.g. newly created links in the world wide web may be yet unknown to the search engine). Nevertheless, an observer can, on his own, experiment with the network by starting a random walk from itself, passing the process to its neighbors and recording the observed states in a random walk. As global names of the nodes are not available to an observer, one way to record the states anonymously is by describing them by the first occurrence of a node in a random walk. Not only are such records succinct, but it is common to have privacy constraints (Abraham, 2012) that would not allow to record a full description of nodes.
Somewhat remarkably, (Micali & Allen Zhu, 2016) show that for a single node in a graph , a known distribution over anonymous walks of length is sufficient to reconstruct topology of the ball with the center at and radius , i.e. the subgraph of graph induced by all vertices distanced at most hops from . For the task of learning embeddings, the topology of network is available and thus distribution of anonymous walks can be computed precisely. As no two different subgraphs can have the same distribution , it is useful to generalize distribution of anonymous walks from a single node to the whole network and use it as a feature representation of a graph. This idea paves the way to our featurebased network embeddings.
3 Algorithms
We start from discussion of leveraging anonymous walks for learning network embeddings in a featurebased manner. Inspired by empirical results we train an objective function on local neighborhoods of anonymous walks, which further improves results of classification.
3.1 AWE: FeatureBased model
By definition, a weighted directed graph is a tuple , where is a set of vertices, is a set of edges, and is a set of edge weights. Given graph we construct a random walk graph such that every edge has a weight equals to , where is the set of outneighbors of and . A random walk with length on graph is a sequence of nodes , where , such that a pair () is selected with a probability in a random walk graph . A probability of having a random walk is the total probability of choosing the edges in a random walk, i.e. .
According to the Definition 1, anonymous walk is a random walk, where each state is recorded by its first occurrence index in the random walk. The number of all possible anonymous walks of length in an arbitrary graph grows exponentially with (Figure 2). Consider an initial node and a set of all different random walks that start from and have length . These random walks correspond to a set of different anonymous walks . A probability of seeing anonymous walk of length for a node is . Aggregating probabilities across all vertices in a graph and normalizing them by the total number of nodes , we get the probability of choosing anonymous walk in graph :
We are now ready to define network embeddings that we name featurebased anonymous walk embeddings (AWE).
Definition 3 (featurebased AWE).
Let be the set of all possible anonymous walks of length . Anonymous walk embedding of a graph is the vector of size , whose th component corresponds to a probability , of having anonymous walk in a graph :
(1) 
Direct computation of AWE relies on the enumeration of all different random walks in graph , which is shown below to grow exponentially with the number of steps .
Theorem 1.
The running time of Anonymous Walk Embeddings (eq. 1) is , where is the maximum in/out degree in graph with vertices.
Proof.
Let be the number of random walks of length in a directed graph. According to (TÃ¤ubig, 2012) can be bounded by the powers of in and outdegrees of nodes in :
Hence, the number of random walks in a graph is at most , where is the maximum in/out degree. As it requires operations to map one random walk of length to anonymous walk, the theorem follows. ∎
Sampling. As complete counting of all anonymous walks in a large graph may be infeasible, we describe a sampling approach to approximate the true distribution. In this fashion, we draw independently a set of random walks and calculate its corresponding empirical distribution of anonymous walks. To guarantee that empirical and actual distributions are close with a given confidence, we set the number of random walks sufficiently large.
More formally, let be the set of all possible anonymous walks of length . For two discrete probability distributions and on set , define distance as:
For a graph let be the actual distribution of anonymous walks of length and let ) be i.i.d. random variables drawn from . The empirical distribution of the original distribution is defined as:
where if is true and 0 otherwise.
Then, for all and the number of samples to satisfy equals to (from (Shervashidze et al., 2009)):
(2) 
For example, there are possible anonymous walks with length (Figure 2). If we set and , then . If we decrease and , then the number of samples will increase to 122500.
As transition probabilities for random walks can be preprocessed, sampling of a node in a random walk of length can be done in via alias method. Hence, the overall running time of sampling approach to compute featurebased anonymous walk embeddings is .
Our experimental study shows stateoftheart classification accuracy of featurebased AWE on real datasets. We continue to design datadriven approach that significantly increases performance of our embeddings.
3.2 AWE: datadriven model
Our approach for learning network embeddings is analogous to methods for learning paragraph vectors in a text corpus (Le & Mikolov, 2014). In our case, an anonymous walk is a word, a randomly sampled set of anonymous walks starting from the same node is a set of cooccurring words, and a graph is a document.
Neighborhoods of anonymous walks. To leverage the analogy from NLP, we first need to generate a corpus of cooccurring anonymous walks in a graph . We define a neighborhood between two anonymous walks of length if they share the same source node. This is similar to other methods such as shortestpaths cooccurrence in DGK (Yanardag & Vishwanathan, 2015) and rooted subgraphs neighborhood in graph2vec (Narayanan et al., 2017), which proved to be successful in empirical studies. Therefore, we iterate over each vertex in a graph , sampling random walks that start at node and map to a sequence of cooccurred anonymous walks , i.e. . A collection of all for all vertices is a corpus of cooccurred anonymous walks in a graph and is analogous to a collection of sentences in a document.
Training. In this framework, we learn representation vector of a graph and anonymous walks matrix (see Figure 3). Vector has size, where is embedding size of a graph. Matrix has size, where is the number of all possible anonymous walks of length and is embedding size of anonymous walk. For convenience, we call as a document vector and as a word matrix. Each graph corresponds to its vector and an anonymous walk corresponds to a row in a matrix . The model tries to predict a target anonymous walk given cooccurring context anonymous walks and a graph.
Formally, a sequence of cooccurred anonymous walks corresponds to vectors of matrix , and a graph corresponds to vector . We aim to maximize the average log probability:
(3) 
where is a window size, i.e. number of context words for each target word. Probability in objective (3) is defined via softmax function:
(4) 
Each is unnormalized log probability for output word :
where and are softmax parameters. Vector is constructed by first averaging walk vectors and then concatenating with a graph vector . The reason is that since anonymous walks are randomly sampled, we average vectors to compensate for the lack of knowledge on the order of walks; and at the same time, the graph vector is shared among multiple (context, target) pairs.
To avoid computation of the sum in softmax equation (4), which becomes impractical for large sets of anonymous walks, one can use Hierarchical softmax (Mikolov et al., 2013b) or NCE loss functions (Gutmann & Hyvärinen, 2010) to speed up training. In our work, we use sampled softmax (Jean et al., 2015) that for each training example picks only a fraction of vocabulary according to a chosen sampling function. One can measure distribution of anonymous walks in a graph via means of definition 1 and decide on a corresponding sampling function.
At every step of the model, we sample context and target anonymous walks from a graph and compute the gradient error from prediction of target walk and update vectors of context walks and a graph via gradient backpropagation. When given several networks to embed, one can reuse word matrix across graphs, thereby sharing previously learned embeddings of walks.
Summarizing, after initialization of matrix for all anonymous walks of length and a graph vector , the model repeats the following two steps for all nodes in a graph: 1) for sampled cooccurred anonymous walks the model calculates a loss (Eq. 3) of predicting a target walk (one of the sampled anonymous walks) by considering all context walks and a graph; 2) the model updates the vectors of context walks in matrix and graph vector via gradient backpropagation. One step of the model is depicted in Figure 3. After using up all sampled corpus, a learned graph vector is called anonymous walk embedding.
Definition 4 (datadriven AWE).
Anonymous walk embedding of a graph is a vector representation learned on a corpus of sampled anonymous walks from a graph .
So despite the fact that graph and walk vectors are initialized randomly, as an indirect result of predicting a walk in the context of other walks and a graph the model also learns feature representations of networks. Intuitively, a graph vector can be thought as a word with a special meaning: it serves as an overall summary for all anonymous walks in the graph.
In our experiments, we show how anonymous walk network embeddings can be used in graph classification problem, demonstrating stateoftheart performance in classification accuracy.
4 Graph Classification
Graph classification is a task to predict a class label of a whole graph and it has found applications in bioinformatics (Nikolentzos et al., 2017) and malware detection (Narayanan et al., 2017). In this task, given a series of graphs and their corresponding labels , we are asked to train a model m: that would efficiently classify new graphs. Two typical approaches to graph classification problem are (1) supervised learning classification algorithms such as PSCN algorithm (Niepert et al., 2016) and (2) graph kernel methods such as WL kernel (Shervashidze et al., 2011). As we are interested in designing taskagnostic network embeddings that do not require labeled data during training, we show how to use anonymous walk embeddings in conjunction with kernel methods to perform classification of new graphs. For this we define a kernel function on two graphs.
Definition 5 (Kernel function).
Kernel function is a symmetric, positive semidefinite function : defined for a nonempty set .
When , several popular choices of kernel exist (Schölkopf & Smola, 2002):

Inner product , ,

Polynomial , ,

RBF , .
With network embeddings, it is then easy to define a kernel function on two graphs:
(5) 
where is an embedding of a graph and : is a kernel function.
To train a graph classifier one can then construct a square kernel matrix for training data and feed this matrix to a kernelized algorithm such as SVM. Every element of kernel matrix equals to: . For classifying new test instance , one would first compute graph kernels with training instances and provide it to a trained classifier m.
In our experiments, we use anonymous walk embeddings to compute kernel matrices and show that kernelized SVM classifier achieves significant boost in performance comparing to more complex stateoftheart models.
5 Experiments
We evaluate our embeddings on the task of graph classification for variety of widelyused datasets.
Datasets. We evaluate performance on two sets of graphs. One set contains unlabeled graph data and is related to social networks (Yanardag & Vishwanathan, 2015). Another set contains graphs with labels on node and/or edges and originates from bioinformatics (Shervashidze et al., 2011). Statistics of these ten graph datasets presented in Table 1.
Evaluation. We train a multiclass SVM classifier with onevsone scheme. We perform a 10fold crossvalidation and for each fold we estimate SVM parameter from the range [0.001, 0.01, 0.1, 1, 10] using validation set. This process is repeated 10 times and an average accuracy is reported, i.e. the average number of correctly classified test graphs.
Dataset  Source  Graphs  Classes  

(Max)  Nodes  
Avg.  Edges  
Avg.  
COLLAB  Social  5000  3 (2600)  
IMDBB  Social  1000  2 (500)  
IMDBM  Social  1500  3 (500)  
REB  Social  2000  2 (1000)  
REM5K  Social  4999  5 (1000)  
REM12K  Social  12000  11 (2592)  
NCI1  Bio  4110  2 (2057)  
NCI109  Bio  4127  2 (2079)  
DD  Bio  1178  2 (691)  
Mutag  Bio  188  2 (125) 
Competitors. 2D CNN is the most recent supervised network representation algorithm (Tixier et al., 2017). PSCN is a convolutional neural network algorithm (Niepert et al., 2016) with size of receptive field equals to 10. 2D CNN and PSCN are the stateoftheart instances of neural network algorithms, which have achieved strong classification accuracy in many datasets, and we use the best reported accuracy for these algorithms. GK is a graphlet kernel (Shervashidze et al., 2009) and DGK is a deep graphlet kernel (Yanardag & Vishwanathan, 2015) with graphlet size equals to 7. WL is WeisfeilerLehman graph kernel algorithm (Shervashidze et al., 2011) with height of subtree pattern equals to 7. WL proved consistenly strong results comparing to other graph kernels and supervised algorithms. ER is exponential random walk kernel (Gärtner et al., 2003) with exponent equals to and kR is step random walk kernel with (Sugiyama & Borgwardt, 2015). FGSD is a graph spectral algorithm with harmonic distance (Verma & Zhang, 2017). graph2vec is an unsupervised algorithm that provides distributed network embeddings (Narayanan et al., 2017).
Setup. For featurebased anonymous walk embeddings (Def. 1), we set length of walks and approximate actual distribution of anonymous walks using sampling equation (2 with and .
For datadriven anonymous walk embeddings (Def. 4), we set length of walks to generate a corpus of cooccurred anonymous walks. Gradient descent has 100 iterations with batch size that we vary from the range [100, 500, 1000, 5000, 10000]. Context walks are drawn from a window, which size varies in the range [2, 4, 8, 16]. The embedding size of walks and graphs and equals to 128. Finally, candidate sampling function for softmax equation (4) chooses uniform or loguniform distribution of sampled classes.
To perform classification, we compute a kernel matrix, where Inner product, Polynomial, and RBF kernels are tested. For RBF kernel function we choose parameter ; for Polynomial function we set and . We run the experiments on a machine with Intel(R) Xeon(R) CPU E52680 v4 @ 2.40GHz and 32GB RAM. We refer to our algorithms as AWE (DD) and AWE (FB) for datadriven and featurebased approaches correspondingly.
Classification results. Table 2 presents results on classification accuracy for Social unlabeled datasets. For IMDBM AWE (DD) approach almost doubles stateoftheart classification accuracy of neural network algorithm PSCN and WL graph kernel. For IMDBB, REM5K, REM12K, COLLAB datasets AWE approach pushes classification accuracy by additional 1020%. At the same time, Table 4 shows accuracy results for labeled bio datasets. Note that AWE are learned using only topology of the network and not node/edge labels. Even in this setting, our embedding achieves stateoftheart performance for NCI1, NCI109, and DD datasets.
Algorithm  IMDBM  IMDBB  COLLAB  REB  REM5K  REM12K 

AWE (DD)  90.65 2.45  90.62 2.61  80.03 1.75  77.17 3.04  73.28 1.69  63.53 1.44 
AWE (FB)  51.58 4.66  73.13 3.28  70.99 1.49  82.97 2.86  54.74 2.93  41.51 1.98 


2D CNN  70.47 3.24  70.37 1.60  89.45 1.64  52.11 2.24  48.13 1.47  
PSCN  45.23 2.84  71.00 2.29  72.60 2.15  86.30 1.58  49.10 0.70  41.32 0.32 
DGK  44.55 0.52  66.96 0.56  73.09 0.25  78.04 0.39  41.27 0.18  32.22 0.10 
WL  49.33 4.75  73.4 4.63  79.02 1.77  81.1 1.9  49.44 2.36  38.18 1.3 
GK  43.89 0.38  65.87 0.98  72.84 0.28  65.87 0.98  41.01 0.17  31.82 0.08 
ER  OOM  64.00 4.93  OOM  OOM  OOM  OOM 
kR  34.47 2.42  45.8 3.45  OOM  OOM  OOM  OOM 
Overall observations.

Tables 2 and 4 demonstrate that AWE significantly outperforms stateoftheart solutions in graph classification task. Importantly, even with simple classifiers such as SVM, AWE elevates classification accuracy comparing to other more complex neural network models. Likewise, just comparing graph kernels, we can see that anonymous walks work consistently better than other graph objects such as graphlets (GK kernel) or subtree patterns (WL kernel).

A datadriven approach AWE (DD) achieves better performance than a featurebased AWE (FB) for all datasets except REB. This indicates that sparse handcrafted network features may be hard to generalize across graph data.

Polynomial and RBF kernel functions bring nonlinearity to the classification algorithm and are able to learn more complex classification boundaries. Table 3 shows that RBF and Polynomial kernels are well suited for featurebased and datadriven models respectively.
Algorithm  IMDBM  COLLAB  REB 
AWE (DD) RBF  90.32  80.03  75.15 
AWE (DD) Inner  81.86  74.13  62.09 
AWE (DD) Poly  90.65  76.42  77.17 


AWE (FB) RBF 
51.58  70.99  82.97 
AWE (FB) Inner  46.45  69.60  76.83 
AWE (FB) Poly  46.57  64.3  67.22 
Algorithm  NCI1  NCI109  DD  Mutag 

AWE  83.54  82.76  97.27  84.02 


WL  84.55  84.49  79.78  83.78 
PSCN  78.59  77.12  92.63  
FGSD  79.80  78.84  77.10  92.12 
DGK  80.31  80.32  72.75  87.44 
GK  62.07  62.04  75.05  84.04 
graph2vec  73.22  74.26  83.15  
kR  58.66  58.36  66.64  79.19 
Scalability. To test for scalability, we learn network representations using AWE (DD) algorithm for ErdosRenyi graphs with increasing sizes from [, , , , , ]. For each size we construct 10 ErdosRenyi graphs with , where is the number of nodes and is the probability of having an edge between two arbitrary nodes. In that case, a graph has edges. We average time to train AWE (DD) embeddings across 10 graphs for every and . Our setup: size of embeddings equals to 128, batch size equals to 100, window size equals to 100. We run AWE (DD) model for 100 iterations. In Figure 4, we empirically observe that the model to learn AWE (DD) network representations scales to networks with tens of thousands of nodes and edges and requires no more than a few seconds to map a graph to a vector.
Intuition behind performance. There is a couple of factors that leads anonymous walk embeddings to stateoftheart performance in graph classification task. First, the use of anonymous walks is backed up by a recent discovery that, under certain condition, distribution of anonymous walks of a single node is sufficient to reconstruct a topology of the ball around a node. Hence, at least on a level of a single node, distribution of anonymous walk serves as a unique representation of subgraphs in a network. Second, datadriven approach reuses hitherto learned embeddings matrix in previous iterations for learning embeddings of new graph instances. Therefore one can think of anonymous walks as words that have semantic meaning unified across all graphs. While learning graph embeddings, we simultaneously learn the meaning of different anonymous walks, which provides extra information for our model.
6 Related Work
Network representations were first studied in the context of graph kernels (Gärtner et al., 2003) and then have become a separate topic that found numerous applications beyond graph classification (Cai et al., 2017). Our featurebased embeddings originate from learning distribution on anonymous walks in a graph and is alike to the approach of graph kernels. Embeddings based on graph kernels include Random Walk (Gärtner et al., 2003), Graphlet (Shervashidze et al., 2009), WeisfeilerLehman (Shervashidze et al., 2011), ShortestPath (Borgwardt & Kriegel, 2005) decompositions and all can be summarized as an instance of Rconvolution framework (Haussler, 1999).
Distributed representations have become trendy after significant achievements in NLP applications (Mikolov et al., 2013a, b). Our datadriven network embeddings stem from paragraphvector distributedmemory model (Le & Mikolov, 2014) that has become successful in learning document representations. Other related approaches include Deep Graph Kernel (Yanardag & Vishwanathan, 2015) that learns a matrix for graph kernel that encodes relationship between substructures; PSCN (Niepert et al., 2016) and 2D CNN (Tixier et al., 2017) algorithms that learn convolutional neural networks on graphs; graph2vec (Narayanan et al., 2017) learns network embeddings by extracting rooted subgraphs and training on skipgram negative sampling model (Mikolov et al., 2013b); FGSD (Verma & Zhang, 2017) that constructs feature vector from the histogram of the multiset of node pairwise distances. (Cai et al., 2017) provides a more comprehensive list of graph embeddings. Besides this, there is a list of aggregation techniques of node embeddings for the purpose of graph classification (Hamilton et al., 2017).
7 Conclusion
We described two unsupervised algorithms to compute network vector representations using anonymous walks. In the first approach, we use distribution of anonymous walks as a network embedding. As the exact calculation of network embeddings can be expensive we demonstrate how one can sample walks in a graph to approximate actual distribution with a given confidence. Next, we show how one can learn distributed graph representations in a datadriven manner, similar to learning paragraph vectors in NLP.
In our experiments, we show that our network embeddings even with simple SVM classifier achieve significant increase in classification accuracy comparing to stateoftheart supervised neural network methods and graph kernels. This highlights an illustrative example when representation of your data can be more promising subject to study than the type and architecture of your predictive model.
Although the focus of this work was in representation of networks, AWE (DD) algorithm can be used to learn node, edge, or any subgraph representations by replacing graph vector with a corresponding subgraph vector. In all graph and subgraph representations, we expect datadriven approach to be a strong alternative to featurebased methods.
8 Acknowledgement
This work was supported by the Ministry of Education and Science of the Russian Federation (Grant no. 14.756.31.0001).
References
 Abraham (2012) Abraham, Ajith. Computational Social Networks: Security and Privacy. Springer Publishing Company, Incorporated, 2012.
 Babai (2016) Babai, László. Graph isomorphism in quasipolynomial time [extended abstract]. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 1821, 2016, pp. 684–697, 2016.
 Bengio et al. (2003) Bengio, Yoshua, Ducharme, Réjean, Vincent, Pascal, and Janvin, Christian. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155, 2003.
 Borgwardt & Kriegel (2005) Borgwardt, Karsten M. and Kriegel, HansPeter. Shortestpath kernels on graphs. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), 2730 November 2005, Houston, Texas, USA, pp. 74–81, 2005.
 Cai et al. (2017) Cai, HongYun, Zheng, Vincent W., and Chang, Kevin ChenChuan. A comprehensive survey of graph embedding: Problems, techniques and applications. CoRR, abs/1709.07604, 2017. URL http://arxiv.org/abs/1709.07604.
 Cao et al. (2016) Cao, Shaosheng, Lu, Wei, and Xu, Qiongkai. Deep neural networks for learning graph representations. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 1217, 2016, Phoenix, Arizona, USA., pp. 1145–1152, 2016.
 Gärtner et al. (2003) Gärtner, Thomas, Flach, Peter A., and Wrobel, Stefan. On graph kernels: Hardness results and efficient alternatives. In Computational Learning Theory and Kernel Machines, 16th Annual Conference on Computational Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 2427, 2003, Proceedings, pp. 129–143, 2003.
 Gutmann & Hyvärinen (2010) Gutmann, Michael and Hyvärinen, Aapo. Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 1315, 2010, pp. 297–304, 2010.
 Hamilton et al. (2017) Hamilton, William L., Ying, Rex, and Leskovec, Jure. Representation learning on graphs: Methods and applications. IEEE Data Eng. Bull., 40(3):52–74, 2017.
 Haussler (1999) Haussler, David. Convolution kernels on discrete structures. Technical report, 1999.
 Jean et al. (2015) Jean, Sébastien, Cho, KyungHyun, Memisevic, Roland, and Bengio, Yoshua. On using very large target vocabulary for neural machine translation. In ACL 2015, pp. 1–10, 2015.
 Le & Mikolov (2014) Le, Quoc V. and Mikolov, Tomas. Distributed representations of sentences and documents. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014, pp. 1188–1196, 2014.
 Micali & Allen Zhu (2016) Micali, Silvio and Allen Zhu, Zeyuan. Reconstructing markov processes from independent and anonymous experiments. Discrete Applied Mathematics, 200:108–122, 2016.
 Mikolov et al. (2013a) Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient estimation of word representations in vector space. CoRR, 2013a. URL http://arxiv.org/abs/1301.3781.
 Mikolov et al. (2013b) Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Gregory S., and Dean, Jeffrey. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems NIPS, pp. 3111–3119, 2013b.
 Monti et al. (2017) Monti, Federico, Boscaini, Davide, Masci, Jonathan, Rodolà, Emanuele, Svoboda, Jan, and Bronstein, Michael M. Geometric deep learning on graphs and manifolds using mixture model cnns. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pp. 5425–5434, 2017.
 Narayanan et al. (2017) Narayanan, Annamalai, Chandramohan, Mahinthan, Venkatesan, Rajasekar, Chen, Lihui, Liu, Yang, and Jaiswal, Shantanu. graph2vec: Learning distributed representations of graphs. In Proceedings of the 13th International Workshop on Mining and Learning with Graphs (MLG), 2017.
 Niepert et al. (2016) Niepert, Mathias, Ahmed, Mohamed, and Kutzkov, Konstantin. Learning convolutional neural networks for graphs. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, pp. 2014–2023, 2016.
 Nikolentzos et al. (2017) Nikolentzos, Giannis, Meladianos, Polykarpos, and Vazirgiannis, Michalis. Matching node embeddings for graph similarity. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, February 49, 2017, San Francisco, California, USA., pp. 2429–2435, 2017.
 Schölkopf & Smola (2002) Schölkopf, Bernhard and Smola, Alexander Johannes. Learning with Kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning series. MIT Press, 2002.
 Shervashidze et al. (2009) Shervashidze, Nino, Vishwanathan, S. V. N., Petri, Tobias, Mehlhorn, Kurt, and Borgwardt, Karsten M. Efficient graphlet kernels for large graph comparison. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 1618, 2009, pp. 488–495, 2009.
 Shervashidze et al. (2011) Shervashidze, Nino, Schweitzer, Pascal, van Leeuwen, Erik Jan, Mehlhorn, Kurt, and Borgwardt, Karsten M. Weisfeilerlehman graph kernels. Journal of Machine Learning Research, 12:2539–2561, 2011.
 Sugiyama & Borgwardt (2015) Sugiyama, Mahito and Borgwardt, Karsten M. Halting in random walk kernels. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 712, 2015, Montreal, Quebec, Canada, pp. 1639–1647, 2015.
 Tixier et al. (2017) Tixier, Antoine JeanPierre, Nikolentzos, Giannis, Meladianos, Polykarpos, and Vazirgiannis, Michalis. Classifying graphs as images with convolutional neural networks. CoRR, abs/1708.02218, 2017. URL http://arxiv.org/abs/1708.02218.
 TÃ¤ubig (2012) TÃ¤ubig, H. The number of walks and degree powers in directed graphs. Technical report, Computer Science Department, Rutgers University, TUMI123, TU Munich, 2012.
 Verma & Zhang (2017) Verma, Saurabh and Zhang, ZhiLi. Hunt for the unique, stable, sparse and fast feature learning on graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pp. 87–97, 2017.
 Vishwanathan et al. (2010) Vishwanathan, S. V. N., Schraudolph, Nicol N., Kondor, Risi, and Borgwardt, Karsten M. Graph kernels. J. Mach. Learn. Res., 11:1201–1242, August 2010. ISSN 15324435.
 Yanardag & Vishwanathan (2015) Yanardag, Pinar and Vishwanathan, S. V. N. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 1013, 2015, pp. 1365–1374, 2015.