RankMerging: A supervised learning-to-rank framework to predict links in large social networks
Uncovering unknown or missing links in social networks is a difficult task because of their sparsity and because links may represent different types of relationships, characterized by different structural patterns. In this paper, we define a simple yet efficient supervised learning-to-rank framework, called RankMerging, which aims at combining information provided by various unsupervised rankings. We illustrate our method on three different kinds of social networks and show that it substantially improves the performances of unsupervised metrics of ranking. We also compare it to other combination strategies based on standard methods. Finally, we explore various aspects of RankMerging, such as feature selection and parameter estimation and discuss its area of relevance: the prediction of an adjustable number of links on large networks.
Link prediction is a key field of research for the mining and analysis of large-scale social networks because of its many practical applications: going from recommendation strategies for commercial websites [huang2005link] to recovering missing links in incomplete data [zhou2009predicting]. Link prediction also has significant implications from a fundamental point of view, as it allows for the identification of the elementary mechanisms behind the creation and decay of links in time-evolving networks [leskovec2008microscopic]. For example, triadic closure, at the core of standard methods of link prediction is considered as one of the driving forces for the creation of links in social networks [kossinets2006empirical].
In their seminal formulation, Liben-Nowell and Kleinberg [liben2007link] present link prediction as follows: considering a snapshot of a network at time , the problem is to predict which links will be present at a future time . It is therefore a binary classification issue, where features used are mostly based on the structural properties of the network or nodes and links attributes. Our problem is similar, except for the fact that there is no timescale involved, we aim at recovering missing links in large social networks. In this context, the classification issue has specific characteristics: for a typical nodes network, there are around candidate pairs of nodes that can be connected, most of them being completely irrelevant. The problem is unmanageable without restraining ourselves to subsets of pairs, but even with this restriction we have to handle typically to items, as we shall see it implies solving challenges in terms of computational efficiency. Features are indeed known to be domain-specific: as links play different roles in social networks, they are expected to be surrounded by different types of environments and thus to be best identified by different features. For these reasons, schemes based on a single metric are prone to misclassification. In the context of classification, machine learning methods have been widely used to combine the available information. In recent works, classification trees, support vector machines, matrix factorization or neural networks are implemented to predict links in biological networks or scientific collaboration networks [pavlov2007finding, kashima2009link, benchettara2010supervised, lichtenwalter2010new, menon2011link, davis2013supervised]. However, these classification methods are not designed to easily set the number of predictions, while this property is desirable in the context of link prediction in social networks.
Another way to address the issue consists in establishing a ranking of likely links according to a scalar metric, correlated with the existence of interactions between nodes. The user sets the number of links predicted by selecting the T top-ranked pairs. Ranking metrics are often based on the structural properties of the network of known interactions, either at a local scale – e.g. the number of common neighbors – or at a global scale – e.g. random walk or hitting time. See for example [lu2011link] for a survey. Other sources of information are available to predict links, in particular node attributes such as age, gender or other profile information [backstrom2011supervised, bliss2013evolutionary], geographic location [scellato2011exploiting], as well as interaction attributes: frequencies [tylenda2009towards] or the time elapsed since the last interaction [raeder2011predictors].
In this context, learning-to-rank frameworks can be used to combine different metrics for link prediction. Unsupervised solutions are available, such as Borda’s method or Markov chain ordering [dwork2001rank, sculley2007rank]. But formulating the link prediction problem in a supervised way should perform better, so we explore supervised methods available for learning-to-rank purpose. Most supervised learning-to-rank techniques have been designed in the context of information retrieval tasks, such as document filtering, spam webpage detection, recommendation or text summarization, e.g. [freund2003efficient, burges2011learning, comar2011linkboost]. In [liu2009learning], the author distinguishes between three kinds of approaches in this field: pointwise approaches are the most straightforward, using the score or rank associated to a feature and using it to fit, for example, a regression model. One undesirable effect is that low-ranked items tend to have an over-important role in the learning process, which is particularly critical in the case of link prediction as rankings are very large. Pairwise approaches [herbrich1999large] consist in transforming the ranking problem into a classification one, by considering couple of items and learning which one should be ranked above. This transform allows the use of many supervised classification methods, but unfortunately even the cheapest implementations of this approach [chapelle2010efficient] cannot be used to predict links on large networks, as the number of items to rank here is larger than . Listwise approaches [cao2007learning] use a ranking of items as ground truth, which is not relevant to our case: when predicting T links, the quality of the prediction provided by two rankings is strictly equivalent if they provide the same amount of true prediction in their top-T items. More generally, information retrieval techniques primarily aim at high precision on the top-ranked items, and stress the relative ranking of two items. As stated in [chapelle2011future], most of the research on the topic has therefore focused on improving the prediction accuracy rather than making the algorithms scalable, which is crucial in the case of link prediction. Learning-to-rank in the context of link prediction in large graphs calls for specific methods, suited for large rankings. Closer to our work, in [pujari2012supervised] links are predicted in scientific collaboration networks using supervised approaches based on improvements of unsupervised methods, it allowed efficient predictions on to items rankings.
Here, we propose a simple yet efficient learning-to-rank supervised framework specifically designed to uncover links in large and sparse networks, such as social networks. We improve the prediction by combining rankings obtained from different sources of information. The article is organized as follows. In Section 2, we describe the datasets under study in this article, which are three large social networks (featuring more than nodes) representing different kinds of social interactions: phone calls, coauthorship, and friendship on an online social network. Section 3 is dedicated to the description of the metrics that we use to evaluate the performances of the link prediction. We then present in Section 4 how classic unsupervised learning methods can be applied to the problem under consideration. In Section 5, we develop a supervised machine learning framework, called RankMerging, which improves the quality of predictions by aggregating the information from the unsupervised metrics. Finally, we implement this method on the three datasets during two series of experiments in Section 6, comparing our results to those of other methods. We also explore aspects such as the feature selection problem, the impact of parameters values etc, and show that RankMerging is suited to social networks where information is partial and noisy, and the number of links to predict is large.
2.1 PSP phonecall network
We investigate a call detail record of approximately phonecalls of anonymized subscribers of a European phone service provider (PSP) during a one month period. Users are described as nodes and an interaction between two users is a directed link. We focus on the social groups underlying the phone call network, so in order to filter out calls which are not indicative of a lasting social relationship, we only consider calls on bidirectional links, i.e. links that have been activated in both directions. After this filtering has been applied, interactions between users are considered as undirected. The total number of phone calls between nodes and is the weight of this link. The resulting network is composed of 1,131,049 nodes, 795,865 links and 10,934,277 calls.
2.2 DBLP coauthorship network
The second dataset under study is the collection of computer science publications DBLP111Available on http://konect.uni-koblenz.de/networks/dblp_coauthor., which contains roughly papers. We study the collaboration network, where nodes of the network are authors and links connect authors who are coauthors, moreover links are weighted by the number of collaborations. The network features 10,724,828 links and 1,314,050 nodes.
2.3 Pokec social network
Pokec is a popular online social network in Slovakia222Available on http://snap.stanford.edu/data/soc-pokec.html.. We consider the network of friendship links between accounts. In its original version there are around friendship links, but friendship is directed in this dataset, so we only kept reciprocated friendships in the spirit of the preprocessing of the PSP dataset. The final network contains 8,320,600 links and 1,632,804 nodes.
3 Performance evaluation
Link prediction is here formulated as a learning-to-rank problem. The definition of a good prediction and thus of an adequate quality estimator depends on the purpose of the ranking problem. For example, the goal of an efficient search engine is to provide highly relevant information on a small amount of items. More generally, in the field of information retrieval, the performance of a ranking is often measured using metrics that emphasize the relative ranking between two items, such as discounted cumulative gain. However, considering link prediction for a fixed number of predictions , the link is predicted or not depending on whether its rank falls above or below . The quality of a prediction is therefore assessed in this work by measuring the numbers of true and false positive (resp. and ), true and false negative ( and ) predictions in the top pairs, and usual related quantities: precision , recall and F-score .
Previous works have emphasized the dramatic effect of class imbalance (or skewness) on link prediction problems in social networks, especially in mobile phone networks [lichtenwalter2010new, comar2011linkboost]. The fact that the network is sparse and that there are many more pairs of nodes than links makes the prediction and its evaluation tricky: the typical order of magnitude of the classes ratio for a social network made of nodes is . It means that the number of predicted links is much lower than the number of candidate pairs, consequently the fall-out is always very small, making the ROC curve an inappropriate way of visualizing the performances in our case. For this reason and because we aim at improving both precision and recall over a large range, we visualize the performances in the precision-recall space.
4 Unsupervised rankings
4.1 Ranking metrics
In this work, we focus on structural features that assign to each pair of nodes a score based on topological information, then pairs are ranked according to this score. A large number of metrics have been used in the past, see for example [zhou2009predicting, lu2011link]. The goal of this paper is not to propose elaborate classifiers, but to present a method that takes advantage of how complementary they are. We have therefore chosen classic metrics and generalized them to the case of weighted networks – other generalizations exist in the literature, e.g. [murata2007link].
4.1.1 Local features
In the following, denotes the set of neighbors of node , its degree is , is the weight of a link and is the activity of a node , that is the sum of the weights of its adjacent links. Some metrics are local (also called neighborhood rankers) as they only rank links among nodes which are at most at distance 2.
Common Neighbors index (CN), based on the number of common neighbors shared by nodes and , the corresponding unweigthed and weighted scores are:
Resource Allocation index (RA):
Adamic-Adar index (AA):
Sørensen index (SR):
4.1.2 Global features
Another class of features is global, since they are calculated using the whole structure of the network, and allow for the ranking of distant pairs of nodes:
Katz index (Katz), computed from the number of paths from node to node of length , i.e. , according to the following expression ( is a parameter lower than ):
Note that in the weighted case, the number of paths is computed as if links were multilinks.
Random Walk with Restart index (RWR), derived from the PageRank algorithm, is defined as the probability that a random walker starting on node , going from a node to a node with probability and returning to with probability , is on in the steady state of the process.
Preferential Attachment index (PA), based on the observation that active nodes tend to connect preferentially in social networks.
4.1.3 Intermediary features
In practice, the exact computation of global metrics is expensive on large networks, that is why approximations are often favoured to compute these scores. Both Katz and RWR are computed using infinite sums, in the following, it will be approximated by keeping only the four first dominating terms to reduce the computational cost. This approximation means that we can only predict links between pairs of nodes at a maximum distance of 4. Notice that it is a way to reduce the class-imbalance problem: as distant pairs are less likely to be connected, we dismiss them in order to increase the (true positive / candidate pairs) ratio. As the class imbalance problem is known to hinder dramatically the performance of PA, we have restricted the ranking in this case to pairs of nodes at a maximum distance of 3. Notice that with larger maximum distances, we can increase the maximum recall that can be reached, but at the cost of a drop of precision.
When even distance 4 approximation is too expensive, we use the Local Path index, especially designed to capture the structure at an intermediary scale:
With the same notations as Katz index and is a parameter.
4.2 Borda’s method
The main purpose of this work is to develop a framework to exploit a set of rankings for link prediction. Here, we present an unsupervised way of merging rankings stemming from social choice theory: Borda’s method is a rank-then-combine method originally proposed to obtain a consensus from a voting system [deborda1781memoire]. Each pair is given a score corresponding to the sum of the number of pairs ranked below, that is to say:
where denotes the number of elements ranked in . This scoring system may be biased toward global predictors by the fact that local rankings feature less elements. To alleviate this problem, pairs not ranked in ranking but ranked in will be considered as ranked in on an equal footing as any other unranked pair and below all ranked pairs. Borda’s method is computationally cheap, which is a highly desirable property in the case of large networks. A more comprehensive discussion on this method can be found in [dwork2001rank].
5 RankMerging framework
The ranking methods presented in the previous section use structural information in complementary ways. In systems such as social networks, where communication patterns of different groups, e.g. family, friends, coworkers etc, are different, one expects that a link detected as likely by using a specific ranking method may not be discovered using another one. In this section, we describe the supervised machine learning framework that we created to aggregate information from various ranking techniques for link prediction in social networks. In a nutshell, it does not demand for a pair to be highly ranked according to all criteria (as in a consensus rule), but at least one. The whole procedure is referred to as RankMerging333An implementation and user guide are available on http://lioneltabourier.fr/program.html..
5.1 The method
5.1.1 Learning phase.
We first consider the training set to learn the parameters on the learning graph . Suppose that we have different rankings from the unsupervised methods described previously. During the learning phase, ranked pairs of nodes are labelled, depending on the fact that the link does exist or not. We create during the learning phase an output ranking which contains pairs selected from various rankings .
Let denote the index used to go through ranking , we name it sliding index of in the following. Predicted links on rank from position 1 to of each ranking . We now consider simultaneously the rankings, with the goal to compute the optimal values of for a fixed number of predictions so that the total number of true positive predictions is maximum. These values are the essential outputs of the training phase, as they indicate the contribution of each ranking to the merged ranking. It is important to note that a link can only be predicted once: if a pair has been added to using ranking , it cannot be added to anymore. It implies that we have to take into account the links already added to in the next steps, which makes this problem hard to solve exactly. For this reason, it is solved heuristically, as described below.
The central idea is to find at each step the ranking with the highest number of true predictions in the next coming steps. For that purpose, we define the window as the set of links predicted according to ranking in the next steps. Note that links already in are not considered for . The number of tp in is the quality of the window, denoted . The ranking corresponding to the highest quality value is selected444In the case of a tie, we choose randomly., and we add its next top-ranked pair to the output ranking of the learning phase . Throughout the process, the sliding indices are registered. Then, the windows are updated so that each one contains exactly pairs, and the process is iterated until contains the chosen number of links to predict . To summarize, we are looking for a maximum number of true predictions by local search. The algorithm corresponding to this process is described in Algo. 1.
(1,2) (5,18) x (1,4) x (1,2) (5,6) x (8,9) (6,12) x (5,6) x (5,18) x (7,11) x (3,4) (6,9) x (4,9) x (1,14) (7,11) x (2,9) (2,9) (3,7)
(1,2) (5,18) x (1,4) x (1,2) (5,6) x (8,9) (6,12) x (5,6) x (5,18) x (7,11) x (3,4) (6,9) x (4,9) x (1,14) (7,11) x (2,9) (2,9) (3,7)
Table 1 gives an example of the first first steps of the merging process between two rankings and with . Pairs in the windows are represented with a gray background. Initially, there are 4 in and 3 in , consequently , so the first link selected is the top-ranked pair available in : . This pair is therefore excluded from the ranking . At step 2, we have , the ranking with highest quality is then selected randomly, we suppose here that has been selected so that the next link added to is . At step 3, and then the pair selected to join is from . At step 4, and and the pair selected is from . At this point, according to the previously defined notations, and .
5.1.2 Test phase.
The test phase of the procedure consists in combining rankings on the test network according to the learnt on the training network . The practical implementation is simple: at each step, we look up for the ranking chosen according to the learning process and select the corresponding ranking on the test set. Its highest ranked pair is then added to the merged ranking of the test phase , except if it is already in , in which case we go to the next highest ranked pair of until we have found one which is not already in . The corresponding process is described in Algo. 2. Note that it is possible that the rankings considered during the learning phase have different length from the ones during the test phase. There are ways to circumvent this problem, namely to use a scaling factor , that is the ratio of sizes between these rankings, so that considering step of the merging algorithm, we use the ranking predicted on step of the learning algorithm.
(2,8) (1,8) (1,8) (9,11) (5,11) (4,5) (3,6) (5,11) (2,8) (1,8) (1,8) (9,11) (5,11) (4,5) (3,6) (5,11) (2,8) (1,8) (1,8) (9,11) (5,11) (4,5) (3,6) (5,11)
An important benefit of our learning algorithm is that we need to go through each learning ranking only once. Moreover, by using appropriate data structures to store the windows – e.g., associative arrays – it is easy to manage each update in . So if we have rankings and predictions, it implies a temporal complexity. Similarly, the test phase demands to go through the test rankings once, yielding a complexity. Besides, most of the memory used is due to the storage of the rankings, which is also . These time and space consumptions are in general insignificant w.r.t. the complexities of the preliminary unsupervised classification methods.
5.2 Optimality justification
In this part, we demonstrate that the aggregation method that we used in RankMerging provides the maximum value for a function , close to the function defined in Section 5.1.1, and we discuss numerically how close and can be.
5.2.1 Sufficient condition to maximize
First, we consider a slightly different theoretical problem: suppose that we have functions with values from to . We aim at finding that maximize , with fixed .
If , is a decreasing function, then by adding at each step, we obtain the maximum value of for a fixed .
By recursion. For , the lemma is trivial. We then use a reductio ad absurdum argument: suppose is the maximum value for , and let be the index corresponding to the highest at this step. Now, suppose such that
with and .
By assumption, , it implies that . But as has been selected before , so we have with , as is a decreasing function, it comes that , in contradiction with the previous statement. ∎
5.2.2 Numerical measures
Our practical problem can be related to this lemma considering that . But it is important to notice that may not be decreasing functions during a real-life aggregation experiment. We therefore examine how close we are to the hypotheses of the lemma by examining if , are decreasing functions. Note that having decreasing functions means that our classifying features are chosen so that highest ranked pairs have an increased probability of being true positive predictions – which is expected from a good classifier.
As shown on the example of Figure 1, the decreasing condition is nearly fulfilled during the process if is large enough555This example has been extracted from the experiments on the PSP dataset (see later).. However, it does not mean that the best solution for the merging problem is to take as large as possible, as the larger gets, the lesser is an accurate estimation of the probability that the next pair selected is a true positive prediction. We therefore have to manage a trade-off by tuning parameter in order to obtain the best possible performance of RankMerging.
6.1 Benchmarks for comparison
In order to assess the efficiency of RankMerging, we compare its performances to existing techniques. First, we compare to the various unsupervised metrics described above. We also compare to classic supervised techniques for classification tasks. We restricted ourselves to several computationally efficient methods, namely nearest neighbors (NN), classification trees (CT) and AdaBoost (AB). We have used implementations from Python scikit learn toolkit666http://scikit-learn.org/ , for more details, see [scikit-learn].. These techniques are not specifically designed for ranking tasks, so we obtain different points in the precision-recall space varying the algorithms parameters, respectively the number of neighbors (NN), the minimum size of a leaf (CT), and the number of trees (AB). We recall that usual supervised learning-to-rank techniques are not available in our case.
6.2 Series 1
In the first series of experiments, we focus on the PSP dataset and explore the impact of the parameter value and the question of the feature selection in details. As PSPs only have access to records involving their own clients, they have an incomplete view of the social network as a whole. In several practical applications, however, this missing information is crucial. An important example is churn prediction, that is the detection of clients at risk of leaving to another PSP. This risk is known to depend on the local structure of the social network [dasgupta2008social, ngonmang2012churn]. Therefore, the knowledge of connections between subscribers of other providers is important for the design of efficient customer retention campaigns. We design this series of experiments for this application, that is to say the prediction of links among the users of a PSP competitor.
A PSP is usually confronted to the following situation: it has full access to the details of phone calls between its subscribers, as well as between one of its subscriber and a subscriber of another PSP. However, connections between subscribers of other PSPs are hidden. In order to simulate this situation from our dataset, we divide the set of nodes of our data into three sets: , and , and a partition of the set of links will ensue. During the learning phase, links – and – are known, defining the set of links of the graph , and we calibrate the model by guessing links –, defining the set . During the test phase, all the links are known except for links –, that we guess to evaluate the performances, we denote this set of links . Users have been assigned to , and according to the proportions 50, 25, 25%. With these notations:
, it contains 848,911 nodes and 597,538 links and we aim at predicting the 49,731 links in .
, it contains 1,131,049 nodes and 746,202 links and we aim at predicting the 49,663 links in .
Notice that according to our simulation, the learning set and the test set are derived from the same distributions, while it could not be the case in situations involving real world PSPs. However, this assumption is fair to compare the performances of our method to other prediction techniques.
6.2.2 Unsupervised learning process
We plot the results obtained on to predict links for the above classifiers. For the sake of readability, we only represent a selection of them on Figure 2. The evolution of the F-score significantly varies from one classifier to another. For example, it increases sharply for CN, and then more slowly until reaching its maximum, while RWR rises smoothly before slowly declining. Borda’s aggregation improves the performance of the classification, especially considering the precision on the top-ranked pairs. RankMerging method aims at exploiting the differences between these profiles. Given the difficulty of the task, precision is low on average: for instance, when recall is greater than , precision is lower than for all rankers. We used here structural features, making it impossible to predict links between nodes which are too distant from each other. We are therefore limited to small recall values, as increasing the number of predictions lowers dramatically the precision.
6.2.3 Supervised merging process
According to the description in 5.1, are computed on to discover links, and then used to merge rankings on to discover links, applying the scaling factor to adapt the learnt to the test rankings Cross-validation is made through a simple hold-out strategy. We will argue in the following that the user may aggregate as many rankings as possible: the information provided by different rankings may be redundant, but the merging process is such that the addition of a supplementary ranking is computationally cheap, and if a ranking does not bring additional information, it should simply be ignored during the learning process. As of the value of , our numerical experiments show that the performance of the algorithm is robust over a sufficiently large range of values (see Table 3), and we extrapolate the best value on for the aggregation on .
We plot on Figure 3 the evolution of the F-score and the precision-recall curve obtained with RankMerging, for , aggregating the rankings of the following classifiers: AA, CN, , SR, Katz (), PA, RWR () and Borda’s method applied to the seven former ones. We observe that RankMerging performs better than Borda, and consequently better than all the unsupervised methods explored, especially for intermediary recall values. It was expected, as RankMerging incorporates the information of Borda’s aggregation here. We measure the area under the Precision-Recall curves to quantify the performances with a scalar quantity. RankMerging increases the area by 8.3% compared to Borda. Concerning the supervised benchmarks, we observe that they perform well, but only for a low number of predictions (comparable to Borda for approximately 1000 to 2000 predictions). Unsurprisingly, AdaBoost is an ensemble method and outperforms Decision Trees and Nearest Neighbors for an optimal parameter choice, but the performances are of the same order of magnitude, in line with the observations in [al2006link]. As formerly stated, these methods are not designed to cover a wide range of the precision-recall space, and therefore perform very poorly out of their optimal region of use.
On the minus side, RankMerging has been designed for classification problems with large number of predictions. The window size implies an averaging effect which causes the method to lack efficiency on the top-ranked items, as can be seen on Fig. 3. As a consequence, it is not suited to problems with low number of predictions, as it is often the case for information retrieval tasks for example.
We evaluate the influence of the structural metrics in Table 3. A comprehensive study of the matter would be long and repetitive considering all possible combinations of rankings, and we restrict ourselves to a few examples. An important result to notice here is that the addition of a ranking does not decrease the quality of the merging process – except for small fluctuations. So ideally, a user may add any ranking to his own implementation of the method, whatever the source of information is, it should improve – or at least not worsen – the prediction. Note that the design of the method is supposed to favour this behaviour: if a ranking brings too little information to the aggregation process, then it should be ignored during the learning process. The problem of feature selection is critical for the link prediction problem, for instance consensus methods performances drop dramatically with a poor feature choice, so these experimental observations are encouraging for future uses of RankMerging.
The dependency on value is shown on Table 4. Results indicate that the performances are close to the maximum within the interval on both the learning and test sets. This observation suggests the possibility of tuning in the testing phase from the values of during the learning process. It is interesting to note that the performance is maximum for these intermediate values, around . This is the right balance between small which fluctuate too much to provide a good local optimum, and large , which average the quality of rankings on a too large window.
6.3 Series 2
In the second series of experiments, we focus on DBLP and Pokec datasets. Here, we investigate in details the impact of the sizes of the learning and test sets.
A few points differ from the protocol of first series. Here, all the nodes belong to both and , the partition is made on the set of links :
are the links of ,
are the links to guess during the learning phase, to calibrate our model,
are the links to guess during the testing phase, to evaluate the performance of the method.
During the learning phase, links are used to guess links. During the test phase, the links of , that is to say are used to guess . We generate several samples such that , but with various values for , the missing information increases as this set grows larger. The samples are defined by the missing links ratio .
Another noteworthy difference is that these networks have a much higher average degree than the PSP network, making the computation of the global rankers expensive in both memory and time. We therefore limited ourselves to the less costly local and intermediary metrics, more precisely: AA, CN, SR, RA, LP () and Borda’s aggregation.
For the same reasons, the number of pairs is very large when only considering nodes at distance 2 (typically larger than ). Since it would not make much sense to predict so many links in a social network, we choose to reduce the rankings length by focusing on the intersection of rankings: we kept only the top pairs in the intersection of the different rankings. Borda’s method is then applied on the intersected rankings. In this setting of experiments, the scaling factor during the whole series.
We plot on Figure 4 the results obtained on the test sets of both DBLP and Pokec datasets, for samples where the missing links ratio . In both cases, RankMerging outperforms the unsupervised methods, but the improvement is much more visible for DBLP than for Pokec.
In the case of DBLP, a closer examination shows that at the beginning of the process, we closely follow Sørensen index curve, which is the most efficient unsupervised ranking for low number of predictions. Then its performance drops, and other rankings are chosen during the aggregation process (mostly Local Path ranking). In the case of Pokec, pairs aggregated initially mainly come from Adamic-Adar ranking, then this index soon gets less efficient than Resource Allocation index, which takes over until the end of the merging process. Pokec gives a good indication of what happens when the rankings are not complementary enough, and that one of them is more efficient than the others: the aggregation nearly always choose the pairs from this ranking and do not improve significantly the performance. Notice that in both cases, Borda’s method is not the most efficient unsupervised method: as some rankings perform very poorly, they dramatically hinder the performance of Borda’s method.
The comparison to supervised classification methods is interesting here777Notice that AdaBoost does not provide exploitable results on both networks as it predicts very few links.: Classification Trees as well as Nearest Neighbors perform poorly on Pokec, however the results on DBLP show that NN and above all CT perform better than RankMerging on a limited range. This observation highlights a limit of our learning-to-rank aggregation method for the purpose of link prediction: the prediction performance cannot be largely better than the performances of the rankings that have been used as features for the learning, while classification methods can, the compensation being the severe constraints on the number of predictions.
Learning set size.
We now explore the impact of the learning set size on the performance of the method. We generated five different samples with missing links ratios: 0.05, 0.10, 0.15, 0.20 and 0.25. On Figure 5, we plot the area under the Precision-Recall curves for the samples generated, and compare them to the most efficient unsupervised methods tested. We observe that RankMerging outperforms the unsupervised methods in nearly all cases, but the differences vanish when the size of the learning set decrease, that is to say when the missing information grows. It seems that this observation stems from the fact that a ranker dominate the others when information is missing, so that the merging method tend to stick to the most performing metric.
Experimental running times.
To give the reader a better grasp of the practical running times, we indicate briefly the order of magnitudes of the different experiments. Of course, these running times depend on the implementations chosen. We used standard ones on the same machine throughout the experiments (a workstation with 16 3 GHz CPU). Unsupervised rankings production step costs from 10 to 1000 minutes (depending on the dataset and the scoring index), while the whole merging process is shorter than a minute for all the datasets.
In this work, we have presented RankMerging, a supervised machine learning framework which combines rankings from unsupervised classifiers to improve the performance of link prediction. This learning-to-rank method is straightforward and computationally cheap – its complexity is , where is the number of rankings aggregated and the number of predictions. It is suited to prediction in social networks, as can be tuned according to the application’s needs. As we have discussed the values of are generally large, that is our method is designed to yield a large number of predictions. Indeed, the precision on top-ranked items is not significantly improved, making the method ineffective for most information retrieval purposes.
So far, we have exclusively focused on structural information in order to predict unknown links. However, the framework is general and any feature providing a ranking for likely pairs of nodes can be incorporated. Additional structural classifiers are an option, but other types of attributes can also be considered, such as the profile of the users (age, hometown, etc), or timings of their interactions. In the latter case, for instance, if and are both interacting with within a short span of time, it is probably an indication of a connection between and . From a theoretical perspective, RankMerging provides a way to uncover the mechanisms of link creation, by identifying which sources of information play a dominant role in the quality of a prediction. The method could be applied to other types of networks, especially when links are difficult to detect. Applications include network security – for example by detecting the existence of connections between machines of a botnet – and biomedical engineering – for screening combinations of active compounds and experimental environments in the purpose of medicine discovery.
The authors would like to thank Emmanuel Viennet and Maximilien Danisch for useful bibliographic indications. This paper presents research results of the Belgian Network DYSCO (Dynamical Systems, Control, and Optimization), funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State, Science Policy Office. The scientific responsibility rests with its authors. We also acknowledge support from FNRS and the European Commission Project Optimizr.