# Hebbian Graph Embeddings

## Abstract

Representation learning has recently been successfully used to create vector representations of entities in language learning, recommender systems and in similarity learning. Graph embeddings exploit the locality structure of a graph and generate embeddings for nodes which could be words in a language, products of a retail website; and the nodes are connected based on a context window. In this paper, we consider graph embeddings with an error-free associative learning update rule, which models the embedding vector of node as a non-convex Gaussian mixture of the embeddings of the nodes in its immediate vicinity with some constant variance that is reduced as iterations progress. It is very easy to parallelize our algorithm without any form of shared memory, which makes it possible to use it on very large graphs with a much higher dimensionality of the embeddings. We study the efficacy of proposed method on several benchmark data sets and favourably compare with state of the art methods. Further, proposed method is applied to generate relevant recommendations for a large retailer.

## 1 Introduction

Graph embeddings learn vector representations of nodes in a graph. [3] and [6] give a comprehensive survey of graph embedding methods like node2vec [7] and also deep convolutional embeddings. The advantage of learning low dimensional embeddings is that they induce an order on the nodes of a graph which could be authors in a citation network, products in a recommender system, or words in an text corpus. The order could be established using an inner product or using another machine learning algorithm like a neural network or a random forest.

Our method uses error-free associative learning to learn the embeddings on graphs. The algorithm is quite simple, but very effective. We apply the learnt embeddings to the task of recommending items to users and to the task of link prediction and reconstruction.

Label propagation and message passing have been applied to many tasks like feature propagation [8], interest propagation, propagation of information in a population [19] and other network models of behavior like PageRank [18] and models of text like TextRank [14]. Instead of propagating a single unit of information, we propagate entire embeddings across the network. By propagating information on a graph iteratively, long distance similarities can also be learnt.

For link prediction and reconstruction, our results are directly comparable to [6]. We compare our results from the state of the art results in [6] in tables 1,5,6,7 and find that our results are better (except SEAL [24] which outperforms our method on 3 out of 4 data sets) as compared to VGAE [11], node2vec [7], GF (graph factorization) [1], SDNE [23], HOPE [17], and LE [2].

Our method is similar to LLE [20] (we compare with algorithms that were developed after LLE) and our algorithm takes inspiration from PageRank [18] and annealing [12] which is theoretically sound and it iteratively reduces the global variance. Our method is an instance of errorless learning which, as the results show, is effective and embarrassingly parallelizable.

Annealing is a process that takes steel in a furnace from a very high temperature to gradually cooling to lower temperatures. This creates a self-organizing process that improves the ductility properties of steel. A high temperature implies higher variance. We take inspiration from this process in this paper in which initially, the variance is very high and is gradually reduced with the goal that the network gains more structure and finds a stable state after the iterations complete [12].

## 2 Hebbian Graph Embeddings

Hebbian learning is the simplest form of learning invented by Donald Hebb in 1949 in his book âThe organization of behaviorâ [9]. It is inspired by dynamics of biological systems. A synapse between two neurons is strengthened when the neurons on either side of the synapse (input and output) have highly correlated outputs. In essence, when an input neuron fires, if it frequently leads to the firing of the output neuron, the synapse is strengthened. In simple terms: ”neurons that fire together wire together” [9]. Recently, there’s renewed interest in Hebbian learning. [10] postulates that Hebbian learning predicts mirror-like neurons for sensations and emotions. [21] applies Hebbian learning in modelling of temporal-causal network.

Hebbian learning consists of a parameter update rule which is based on the strength of connection between two nodes, as applied to neural networks (based on firing tendencies of neurons on the opposite ends of a synapse). We extend the idea to graphs. Based on a pre-computed transition probability between two nodes, we update the parameters (the embeddings of a node) iteratively based on an error-free associative learning rule (nodes that are contextually connected should have similar embeddings, like word2vec for words [15]). For a discussion on errorless learning, please see [13].

We first initialize all embeddings to a multivariate normal distribution with mean 0 and variance .

(1) |

We model the embedding at a node as a non-convex Gaussian mixture of the embeddings of the connected nodes. If there is an edge from node i to node j, the embedding of node j is modeled as follows:

(2) |

The variance starts off at a value of 10 and is divided by 1.1 every iteration in the spirit of simulated annealing [12]. The embedding of node j is updated as follows:

(3) |

(4) |

(5) |

The are then simply added to the embedding at node j (where there is an edge from node i to node j). is the transition probability and is the learning rate. The graph is weighted, asymmetric and undirected. Also, a random negative edge is selected at each node and the negative of the embeddings is propagated to both selected nodes with a fixed transition probability (we use 0.5). This iterative procedure learns the embeddings of all nodes in the graph and is able to generate very effective embeddings, as the next section shows. As shown in figure 1, the embeddings get propagated across the Gaussian graph iteratively.

(6) |

(7) |

(8) |

## 3 Experiments and Results

We run our algorithm on three of the data sets used in [6] namely AstroPh, BlogCatalog and HepTh for both link prediction and reconstruction. Our algorithm outperforms several other algorithms that are implemented in [6] and [5]. We also compare our algorithm for link prediction using average precision and run-time with SEAL [24] and VGAE [11] in table 6 and table 7.

We start with an initial variance of 10 and use the variance reduction factor of 1.1. We run the algorithm for 10 iterations. The algorithm is shown in Algorithm 1.

Link Prediction is the task of trying to predict a link between two nodes that were not part of the training data. Reconstruction tries to reconstruct the entire graph which is used entirely in the training set (i.e. there is no train/test separation).

We also run our algorithm on our recommender system and find that it is able to achieve a very high hit rate. Future work will focus more on the recommender system.

### 3.1 Results on Reconstruction

We ran our algorithm for reconstruction on publicly available data sets. Reconstruction tries to reconstruct the entire original graph (without splitting into train/test). As in [6], we sample 1024 nodes for calculation of the MAP. We run the algorithm for 10 iterations with a learning rate 1.0. The results in table 1, table 2 and figure 2 show that our algorithm is able to achieve good results on reconstruction when the dimensionality is large. As benchmarks, we use three data sets that [6] uses for reconstruction. Our results are favorably comparable on those three data sets. The other data sets are not used by [6] but the supporting code base as in [5] can be used to compare.

Algorithm | Dimensionality | AstroPh | BlogCatalog | HepTh |
---|---|---|---|---|

Hebbian Graph Embeddings | 200 | 0.573 | 0.499 | 0.619 |

node2vec | 256 | 0.56 | 0.24 | 0.42 |

GF | 256 | 0.29 | 0.09 | 0.39 |

SDNE | 256 | 0.46 | 0.33 | 0.5 |

HOPE | 256 | 0.33 | 0.45 | 0.32 |

LE | 256 | 0.26 | 0.09 | 0.4 |

DataSet | Nodes | Edges | Reconstruction Results of Varying Dimensionality | |||||||

10 | 20 | 50 | 100 | 200 | 300 | 400 | 500 | |||

CondMat | 23,133 | 93,497 | 0.192 | 0.304 | 0.495 | 0.649 | 0.778 | 0.838 | 0.873 | 0.895 |

GrQc | 5,242 | 14,496 | 0.245 | 0.407 | 0.625 | 0.763 | 0.860 | 0.894 | 0.910 | 0.918 |

HepPh | 12,008 | 118,521 | 0.196 | 0.293 | 0.455 | 0.586 | 0.698 | 0.755 | 0.789 | 0.814 |

AstroPh | 18,772 | 198,110 | 0.181 | 0.245 | 0.362 | 0.461 | 0.573 | 0.635 | 0.675 | 0.707 |

HepTh | 27,770 | 352,807 | 0.188 | 0.261 | 0.402 | 0.509 | 0.619 | 0.679 | 0.709 | 0.732 |

BlogCatalog | 10,312 | 333,983 | 0.432 | 0.432 | 0.458 | 0.491 | 0.499 | 0.507 | 0.508 | 0.496 |

DataSet | Nodes | Edges | Random (no training) |
---|---|---|---|

500 (Dimension) | |||

CondMat | 23,133 | 93,497 | 0.0139 |

GrQc | 5,242 | 14,496 | 0.0126 |

HepPh | 12,008 | 118,521 | 0.0233 |

AstroPh | 18,772 | 198,110 | 0.0255 |

HepTh | 27,770 | 352,807 | 0.0292 |

BlogCatalog | 10,312 | 333,983 | 0.0364 |

### 3.2 Results on the Recommender System of a large retailer

Also, in the recommender system at a large retailer, we used a sample of 200 thousand items as our population for training and measurement. 10% of the users are held out as the test set. The number of nodes in the graph is 200,000 and the number of edges is about 13.1 billion (note that the weight of an incoming edge might be different from an outgoing edge between any two nodes).

We measure the performance of our algorithm on the hit rate. Top 10 recommendations are generated per item based on the nearest neighbors of the generated embeddings based on an inner product (using all 200,000 items). Then, one random item from the userâs entire interaction history is chosen. Recommendations for this random item are computed. If any of the top 10 recommended items (other than the seed item) also occurs in the userâs interaction history, it is considered a hit. Otherwise a fail. The average hit rate is then the number of successes divided by the number of users in the test set. Results are shown in table 4. We use 10 iterations and a learning rate of 1.0.

The edges are determined using the induced graph from the consumer-product bipartite graph based on the co-viewing of the products. So, if two products were viewed by the same consumer, then we create an edge between them based on the same weight between them (as described in section 2).

Dimensionality | HitRate@10 |
---|---|

100 | 24.2% |

200 | 30.1% |

250 | 31.1% |

### 3.3 Results on Link Prediction

For link prediction, we use some of the data sets used in [16] and [6]. As in [6], we sample 1024 nodes for calculation of the MAP. We keep 10% of the edges as a held out test set. We run the algorithm for 10 iterations with a learning rate 1.0. The results in table 8, table 9 and figure 3 show that our algorithm is able to achieve good results on link prediction when the dimensionality is large. As benchmarks, we use three data sets that [6] uses for link prediction. Our results are favorably comparable on those three data sets for link prediction. [6] also has a supporting code base [5] which can be used to compare on other data sets.

We also compare our algorithm with SEAL and VGAE and find that our algorithm outperforms VGAE on all four data-sets and outperforms SEAL on one of the four data-sets. Note that since our algorithm is an Apache Spark application, there is some initial time spent on initialization and allocation of resources. The larger the graph, the more noticeable is the difference in the run-time. For instance, it might be infeasible to run SEAL or VGAE on our recommender system data-set with 200,000 nodes and 13.1 billion edges.

Algorithm | Dimensionality | AstroPh | BlogCatalog | HepTh |
---|---|---|---|---|

Hebbian Graph Embeddings | 200 | 0.317 | 0.202 | 0.339 |

node2vec | 256 | 0.025 | 0.17 | 0.04 |

GF | 256 | 0.15 | 0.02 | 0.17 |

SDNE | 256 | 0.24 | 0.19 | 0.16 |

HOPE | 256 | 0.25 | 0.07 | 0.17 |

LE | 256 | 0.21 | 0.04 | 0.23 |

Algorithm | Power | PB | USAir | C.Ele |
---|---|---|---|---|

Hebbian Graph Embeddings | 93.11 | 93 | 95.92 | 84.99 |

SEAL | 86.69 | 94.55 | 97.13 | 88.81 |

VGAE | 75.91 | 90.38 | 89.27 | 78.32 |

Algorithm | Power | PB | USAir | C.Ele |
---|---|---|---|---|

Hebbian Graph Embeddings | 287 | 233 | 237 | 172 |

SEAL | 1640 | 146 | 31 | 16 |

DataSet | Nodes | Edges | Link Prediction Results for Varying Dimensionality | |||||||

10 | 20 | 50 | 100 | 200 | 300 | 400 | 500 | |||

CondMat | 23,133 | 93,497 | 0.070 | 0.130 | 0.251 | 0.350 | 0.450 | 0.507 | 0.531 | 0.544 |

GrQc | 5,242 | 14,496 | 0.064 | 0.129 | 0.233 | 0.292 | 0.332 | 0.348 | 0.363 | 0.383 |

HepPh | 12,008 | 118,521 | 0.065 | 0.121 | 0.213 | 0.289 | 0.346 | 0.384 | 0.401 | 0.424 |

AstroPh | 18,772 | 198,110 | 0.060 | 0.092 | 0.179 | 0.235 | 0.317 | 0.357 | 0.388 | 0.409 |

HepTh | 27,770 | 352,807 | 0.070 | 0.120 | 0.203 | 0.259 | 0.339 | 0.370 | 0.383 | 0.407 |

BlogCatalog | 10,312 | 333,983 | 0.183 | 0.182 | 0.198 | 0.198 | 0.202 | 0.217 | 0.210 | 0.212 |

DataSet | Nodes | Edges | Random (no training) |
---|---|---|---|

500 (Dimension) | |||

CondMat | 23,133 | 93,497 | 0.007 |

GrQc | 5,242 | 14,496 | 0.007 |

HepPh | 12,008 | 118,521 | 0.010 |

AstroPh | 18,772 | 198,110 | 0.009 |

HepTh | 27,770 | 352,807 | 0.009 |

BlogCatalog | 10,312 | 333,983 | 0.014 |

It is quite easy to parallelize the algorithm, and we implement it on Apache Spark. We run the algorithm for 10 iterations (which takes about 3 hours on the parallel implementation on recommender system data and from 5 minutes to 2 hours (depending on the dimensionality) on the publicly available data). We found that the learning rate does not affect the results in any significant way (we use 1.0).

## 4 Conclusion

In this paper, we described a simple, but very effective algorithm to learn the embeddings on a graph. The results show that the algorithm, as applied to the tasks of link prediction and reconstruction, is able to perform well when the dimensionality of the embeddings is large. This shows the effectiveness of learning on graphs using iterative methods. Itâs a useful experiment of error-free (errorless) learning on graphs. Our method can learn long distance similarities because of the iterative nature of the algorithm which percolates the embeddings on the weighted graph.

A distinctive advantage of our approach is that it is very easy to parallelize the algorithm without any need for shared memory. It is quite easy to implement the algorithm on platforms like Apache Spark, which makes the algorithm amenable to very large graphs which cannot be processed on one machine.

Our recommender system work was tested live and it did very well. But because our item graph has a very large number of nodes and edges, we omit the implementation of [16] and [6] for our recommender system.

Other algorithms like in [22] and [4] could be compared with our work. There is still an opportunity to improve the algorithm through hyperparameter tuning. It might be interesting to measure the algorithm with a much higher dimensionality of the embeddings.

#### Acknowledgments

We thank Ramasubbu Venkatesh, Nicholas Eggert, Sayon Majumdar and Jinghe Zhang for their valuable suggestions and comments.

### References

- (2013) Distributed large-scale natural graph factorization. In Proceedings of the 22nd international conference on World Wide Web, pp. 37–48. Cited by: §1.
- (2002) Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems, pp. 585–591. Cited by: §1.
- (2018) A comprehensive survey of graph embedding: problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering 30 (9), pp. 1616–1637. Cited by: §1.
- (2019) Scalable hyperbolic recommender systems. arXiv preprint arXiv:1902.08648. Cited by: §4.
- (2018) GEM: a python package for graph embedding methods.. J. Open Source Software 3 (29), pp. 876. Cited by: §3.1, §3.3, §3.
- (2018) Graph embedding techniques, applications, and performance: a survey. Knowledge-Based Systems 151, pp. 78–94. Cited by: §1, §1, §3.1, §3.3, Table 1, Table 5, §3, §4.
- (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §1, §1.
- (2013) Dynamic probabilistic models for latent feature propagation in social networks. In International Conference on Machine Learning, pp. 275–283. Cited by: §1.
- (1949) The organization of behavior. Wiley & Sons. Cited by: §2.
- (2014) Hebbian learning and predictive mirror neurons for actions, sensations and emotions. Philosophical Transactions of the Royal Society B: Biological Sciences 369 (1644), pp. 20130175. Cited by: §2.
- (2016) Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Cited by: §1, §3.
- (1983) Optimization by simulated annealing. science 220 (4598), pp. 671–680. Cited by: §1, §2.
- (2006) How far can you go with hebbian learning, and when does it lead you astray?. Processes of change in brain and cognitive development: Attention and performance xxi 21, pp. 33–69. Cited by: §2.
- (2004) Textrank: bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 404–411. Cited by: §1.
- (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §2.
- (2017) Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems, pp. 6338–6347. Cited by: §3.3, §4.
- (2016) Asymmetric transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1105–1114. Cited by: §1.
- (1999) The pagerank citation ranking: bringing order to the web.. Technical report Stanford InfoLab. Cited by: §1.
- (1953) Spread of information through a population with socio-structural bias: i. assumption of transitivity. The bulletin of mathematical biophysics 15 (4), pp. 523–533. Cited by: §1.
- (2000) Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: §1.
- (2016) Network-oriented modeling. Springer. Cited by: §2.
- (2018) Hyperbolic recommender systems. arXiv preprint arXiv:1809.01703. Cited by: §4.
- (2016) Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1225–1234. Cited by: §1.
- (2018) Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pp. 5165–5175. Cited by: §1, Table 6, Table 7, §3.