Social Ranking Techniques for the Web
The proliferation of social media has the potential for changing the structure and organization of the web. In the past, scientists have looked at the web as a large connected component to understand how the topology of hyperlinks correlates with the quality of information contained in the page and they proposed techniques to rank information contained in web pages. We argue that information from web pages and network data on social relationships can be combined to create a personalized and socially connected web. In this paper, we look at the web as a composition of two networks, one consisting of information in web pages and the other of personal data shared on social media web sites. Together, they allow us to analyze how social media tunnels the flow of information from person to person and how to use the structure of the social network to rank, deliver, and organize information specifically for each individual user. We validate our social ranking concepts through a ranking experiment conducted on web pages that users shared on Google Buzz and Twitter.
According to , a conceptualization of the web is revealed by looking at patterns in the topology of hyperlinks containing web pages to separate prominent websites that serve as authorities for trusted information from malicious pages created by spammers. This conceptualization of the web eliminates the complexity of textual analysis and creates a pot-pourri of information that gets incorporated into search engines for the purpose of finding information on computing devices.
Advances in social networks have provided a new dimension to studying problems in information retrieval from a network point of view. Incorporating the social network structure into algorithms used for ranking, organizing, and delivering information in information retrieval systems such as search engines have promising improvements and new practical applications. For example, “movies that my friends like” has been introduced by Facebook as graph search.
The advances of the web have created some applications where humans can identify and label relationships for the purpose of interacting with information. Beside the typical information that users share in online social networks such as photos, messages, geographic locations, etc., URLs that users share with their friends and followers are used in this paper to infer how humans would rank the importance of the content embedded on the page because URLs shared by users focus on selected topics that they want their followers to know. Therefore, publicly shared messages embedded with URLs provide us a clue into how a user would rank the importance of a page, which defines the ranking of the page by the view of the user, and allow us to re-rank, re-organize, and re-deliver query results based on who is connected to whom.
We propose techniques for answering the following questions. First, how can we incorporate social relevance into the process of ranking pages while preserve authoritative sources determined by algorithms based on indegree analysis such as PageRank and HITS? Second, how can we rank pages based on URLs that users share in online social media such as Google Buzz and Twitter by incorporating the social network structure of those users to personalize the ranking of pages tailored to each individual user?
The rest of the paper is organized as follows. In section II, we provide techniques for ranking pages by applying PageRank, HITS, and maximum flow to social ties and URL-embedded messages shared on social media. In section III, we overview the procedure for collecting data on two social media (Google Buzz and Twitter) for the validation of our proposed framework by ranking URLs shared in them. In section IV, we analyze the social relevance of URLs and conduct a ranking experiment to observe the ranking positions of URLs computed by PageRank, HITS, and maximum flow. After presenting the literature review of ranking and other related work in section V, we conclude in section VI by summarizing the results.
Ii Social Ranking Techniques
Let be a directed multi-labeled graph where is the set of nodes, is the set of edges where represents a directed edge from node to node , and is the set of URLs with subsets of which nodes in are labeled. For URL , let denotes the set of all spreaders of the URL ; in other words all nodes in who has posted .
Ii-a PageRank on Social Network (PRSN)
We extend the PageRank algorithm to rank URLs on a social network (PRSN) as follows. Given a multi-labeled graph , let be a weighted adjacency matrix where is the number of nodes (i.e, ), if there is no directed edge from to , and otherwise. Let be a vector consisting of elements where the element of denoted as corresponds to the PageRank score of the node. Let be the maximum number of iterations that the PageRank algorithm runs. At the first iteration, every node sends its score divided by the number of links pointing from this node to other nodes through each outgoing link. Then each node updates its score to the sum of scores that it has received; that is,
If there is an edge from node to node , then and node will send fraction of its score to node . Equation 1 can be compactly written as where is the transpose of the matrix , the superscript denotes the scores of all nodes after the first iteration, and is the initial vector. Let be the scores of nodes at the or last iteration defined as:
If there are sinks in the graph , that is nodes without outgoing edges, then for large enough ’s they will absorb all scores since the scores can enter but cannot leave the sinks. One way to fix this problem is to scale the strength of links by a constant factor of and to compensate this scaling by adding an artificial flow between any two nodes with the weight . This solution is known as the scaled version of PageRank. The score of the node is then denoted as and is defined as:
Given a subset of URLs , the PageRank score of a URL on a social network (PRSN) is defined as:
Ii-B HITS on Social Network (HSN)
The HITS algorithm used to rank URLs on a social network (HSN) is defined as follows  . Given , let be a adjacency matrix where is the number of nodes, if there is a directed edge from node to node , and otherwise. Let be the maximum number of iterations. Given a set of URLs , let and be vectors of scores for hubs and authorities, respectively. Authorities are the URLs (i.e., ) and hubs are nodes that share these URLs. The element of the vector represents the score of the hub, and the element of the vector represents the score of the authority. At the first iteration, the score of a hub gets set to the number of authorities to which it points, and the score of an authority gets set to the scores of hubs pointing to it. More formally, and are defined as:
Let and be the scores of hubs and authorities at the iteration , the HITS algorithm  can be written as:
Finally, the score of a URL in the authorities is the value normalized by the sum of scores in the vector .
Ii-C Social Ranking with Maximum Flow
We defined the following maximum flow algorithm to rank URLs on a social network. Given a graph and a subset of URLs , let represent a node. We want to rank the URLs in with respect to and by constructing a directed flow graph denoted as .
The first part of the construction requires copying the social structure of to . For every node that follows, we add to and the edge into . At the subsequent iteration, we repeat the same process for every node that has been added into from the previous iteration; that is, if was added into and there is an edge , then we add to if has not been added before. The edge will still be added into if has been added before. This process of constructing the graph continues until all possible nodes from that are reachable from have been added into . For practical reasons, it is wise to stop when the diameter of is small; e.g., three to reflect the influence of nodes that are within network proximity. At the end of the process, an edge originating from node gets the weight equal to the inverse of the node degree in .
The second part of constructing introduces some additional nodes and edges. For every URL , we add into . For every spreader of the URL , we add an edge with a weight of 1 into if . We add a super sink denoted into and add an edge with an edge weight of for every URL in .
The maximum flow of the graph from source to super sink is a function that assigns a non-negative value to each edge so that it maximizes the total flow coming from the source to the super sink satisfying two conditions: first, it does not exceed the weight of an edge; i.e, and second, it obeys the conservation of flow law except for the source and the super stink ; i.e,
where is the assigned flow for the edge between two nodes, and is the assigned flow for the edge for the node and the URL . The construction of the graph is illustrated in Fig. 2. Polynomial running time algorithms such as the Edmonds-Karp algorithm for finding the maximum flow can be found in  and .
Iii Data Collection
We collected data from two networks on the web. The first one is the Google Buzz, a platform that combines social relationships and mini-blogging for information dissemination. The second network is Twitter where users choose to follow sources of information. These two networks have messages containing URLs that provide us clues into how users would rank the quality of the information coming from the URLs by using the three techniques we described in Section II.
We collected the Google Buzz data from early September of 2011 to the middle of October of the same year. There were around 2.5M users who shared approximately 100M messages of which about 30M messages had URLs embedded in them. We collected the Twitter data from early September of 2011 to the late December of that year. There were around 1M users who shared approximately 300M messages of which 50M messages had URLs embedded in them. Additional details of the datasets for Google Buzz and Twitter are provided in the Tables II and II. Please note that all URLs refer to all representations of URLs embedded into messages and two different representations could be the same URL when they are masked by redirect services. *URLs refer to the final destination of URLs that have been shared by at least two users within the network.
Iii-a Data Limitations
First, using Google Buzz and Twitter limits users’ demographics which probably is not a representative sample of the entire population as mentioned by authors in . Second, parsing URLs from messages is prone to errors where humans have multiple ways of writing supposedly the same link. Examples are URLs containing typos and spelling mistakes, masked by redirect services, and so on. Third, researchers in  have argued that BFS sampling of a network by starting at a seed generates a large connected component but causes skewness in degree centralities and higher degree averages than in the entire network.
With limits on hardware resources, bandwidth sharing and data access, we attempted to collect as much as we could for the purpose of ranking URLs on social media. We were able to collect the entire connected component with BFS sampling for Google Buzz, which resulted in the sum of indegree being equal to the sum of outdegree. Twitter is a much larger network that consists of hundreds of millions of accounts. When calculating the data summary of Twitter, we look at users who have been processed in terms of collecting their information and not users who are waiting to be processed, which resulted in the sum of indegree not being equal to the sum of outdegree.
Iii-B Data Analysis
Two sets of URLs are considered for the purpose of our data analysis. From both Google Buzz and Twitter datasets, we have randomly chosen 2,000 URLs with equal probability denoted as the random set of URLs. We also have chosen the top 2,000 shared URLs denoted as the popular set of URLs. There are two sets of URLs in each network giving us four sets of URLs in total. For each URL, we calculated the size of the affected set consists of nodes that received the URL from the spreaders but chose not to spread it further.
We also computed the average length of all shortest paths from 10 randomly chosen users to members of a random subset of spreaders. The results are shown in Fig. 3(a) for Google Buzz and Fig. 3(b) for Twitter. We substitute the entire spreader set with a randomly selected subset simply as a matter of efficiency because shortest-path computations are expensive in large networks as mentioned by authors in .
In Fig. 3, we noticed that as the size of the affected set increases, the average distance from randomly selected users to the information on the web page decreases for random and popular sets of URLs in Google Buzz. This is because very large affected sets increase the likelihood that a randomly chosen user has a path through an affected user reaching a spreader. This agrees with our intuition that information collectively shared by users with high outdegrees has a greater coverage of dissemination. However, this correlation is weaker in Twitter due to the celebrity effect of some users having millions of followers and creating large affected sets. For instance, a URL that was only shared in the network by a celebrity. More importantly, affected sets influence our social ranking techniques where the structure of the network instead of the web topology is used to rank pages or URLs. For example, the PageRank on a social network (PRSN) would rank URLs that were shared by high outdegree spreaders higher because they absorb most of the scores distributed to them. Our maximum flow approach to personalize social ranking would be affected at the first level if a user directly follows a high outdegree spreader. Because of the celebrity effect in Twitter, this rank increase will also carry over the subsequent levels because the scores could be circulated to the rest of the network by the intricate social relationships. Interestingly, HITS is not affected by the network structure since the algorithm does not consider social relationships but only takes into account which person shares what URL.
Iv Social Ranking Experiments
For each network, we selected 30 URLs from the popular and random URLs sets. For each selected URL, we calculated its score by using PageRank and HITS, and ranked the URLs (i.e, 1st, 2nd, 3rd, etc.) with respect to the set. We also ranked the selected URLs tailored to four randomly chosen users using maximum flow. Results are shown in Table III for popular URLs in Google-Buzz where we enumerated the 30 selected URLs in the first column, ranking positions using PageRank in the second column, HITS in the third column, and maximum flow in the fourth column. In the fourth column, the first element corresponds to the first person, second element corresponds to the second person, and so on. We did the same for the random set of URLs in Google Buzz shown in Table IV. The ranking results of Twitter are not shown as a full table, and full representations of the URLs listed in these tables have been shorten to save space.
We compared the ranking results of PageRank and HITS shown in Fig. 4 for Twitter. Ranking Results of Google Buzz are listed in Table III and IV. The ranking of popular URLs using PageRank and HITS are more consistent than the random URLs. We measured the ranking consistency as the average difference of two ranking algorithms on a set of URLs (i.e., ) and the sum of differences (i.e., ) where is the position of the URL determined by the algorithm and is the number of URLs.
For the popular URLs in Google Buzz, the average difference was 2.9 meaning that on average HITS and PageRank were off by 3 positions and the sum of differences between them was 86. For the random URLs in Google Buzz, the average difference was 9.6 and the sum of differences between them was 288. For the popular URLs in Twitter, the average difference was 5.9 and the sum of differences between them was 178. For random URLs in Twitter, the average difference was 7.2 and the sum of differences between them was 216. In both networks, popular URLs are ranked more consistently than random URLs which makes the HITS algorithm more suitable than PageRank when ranking viral information because it is computationally more efficient.
We noticed that the ranking results determined by each individual user using maximum flow are less correlated with themselves than the results computed by PageRank and HITS. First, we compared the ranking results of maximum flow with PageRank and HITS using popular and random URLs for Google Buzz shown in Fig. 5. The first and second plots on the left are ranking results of popular URLs and the third and fourth plots on the right are ranking results of random URLs labelled by their sub-captions. A point on the graph is a URL where the x-axis is the ranking position of the URL determined by maximum flow and the y-axis is the ranking position determined by either PageRank or HITS labelled on the y-axis. The identical layout for Twitter is shown in Fig. 6.
For personalized ranking, we measured the ranking consistency as the average difference of a pair of users with respect to a URL set. For instance, in the Table V, the left column and the top row are the four selected users where the element corresponds to the average difference of users and . Please note the upper triangle or elements above the diagonal refer to the random URLs and the lower triangle or elements below the diagonal refer to the popular URLs. The right column refers to the outdegree of users in the random URLs, and the last row refers to the outdegree of users in the popular URLs. For Twitter, the ranking results in the same format are given in Table VI.
For random URLs in Google Buzz, we noticed that persons and have an average difference of 1.7 where and have an average difference of 6.7. For popular URLs, the variability is smaller where and have an average difference of 2.0 and and have an average difference of 3.2. Outdegree measures the number of people a user follows since the ranking results are based on them. And finally, ties are expected when using maximum flow since the number of URLs shared among friends is minuscule compare to the number of pages in the deep Web. Therefore, we simply use PageRank or HITS to break ties among pages when necessary.
V Related Work
Our work lies at the intersection of the study of social network analysis and the ranking techniques in information retrieval. The closest to our work are references    in which the authors studied the problem of social searching while we studied the problem of social ranking. In , authors proposed an approximation to an algorithm called Partitioned Multi-Indexing to rank queries on the content generated in social networks by using a distributed hash table and schemas for updating the content continuously generated by the users. One similarity is that both theirs approach and ours consider information shared by social ties to be an important element in searching and ranking. Still, their work approximates network distances between users while our work uses the maximum flow of a constructed network. Another difference is that we do not focus on answering queries with social ties but on designing ranking techniques of URLs which could be used to answer friendship-related queries. In , authors proposed simple techniques to re-rank search results based on Similarity and Familiarity networks using their enterprise social network.
While social searches have been introduced in multiple settings from the Social Query Model (SQM)  to the implementation of social searching applications for mobile devices , a good amount of work has focused on finding the right answer to a search query by routing the search query to the right person in a social network graph . We studied the structure of the network to socially and automatically rank URLs without users intervention. In the Social Query Model , routing paths of search queries are studied in decentralized systems where indeterministic behavior of each agent willing to provide a correct answer with some level of accuracy and expertise is taken into consideration when forming an optimal routing policy. In Aardvark , the focus was to route a query from the searcher to a designated user in a social network that was assumed to be able to provide an answer. We took the approach of using network flow where the goal is to automatically rank a set of pages through the eyes of the searcher’s social ties.
Indegree-based algorithms such as PageRank , SALSA , and HITS  are used for ranking pages on a web graph where an edge between two pages represents an endorsement of one page by another page. The intuition behind network flow is that it automatically incorporates indegree analysis where a node that does not share a web page will distribute its flow to the sources that it follows, and sources of high indegree will eventually get the largest share of flow if the information is not found locally. In , authors looked at direct annotations from users in Delicious to enhance searches while we look at shared messages embedded with URLs to rank pages. To the best of our knowledge, we are the first to propose using maximum flow to personalize the ranking of pages based on the messages containing URLs that users share in online social networks.
Information shared between users in online social networks such as URLs provides a unique perspective of the ranking of pages. In our approach, humans instead of pages are the ones who rank the URLs by sharing them, and the social network of the users instead the web graph topology is used to propagate the ranking.
First, we collected two large-scale information networks of online users to study how users in these networks share URLs which impacts the distance between a person and a URL. For instance, researchers in  estimated the number of hops between any two pages to be on average 19; while Milgram estimated that the number of hops between any two people is no more than 6 . Since information propagates differently in social networks, the social structure bounds how far a person is away from a shared URL.
Second, we reinterpreted the ranking techniques of PageRank and HITS and proposed to use maximum network flow to personalized the ranking of pages tailored to each individual user. Maximum flow detects the popularity of a shared URL among friends but popularity does not necessary reflect endorsement. We expected that each unique individual would rank the URLs differently, since no two people on a social network are the same. Interestingly, the ranking results of popular URLs using PageRank and HITS are more correlated than random URLs suggesting that the overall view of users on ubiquitous information is more consistent, but everyone has their own opinion in the end. Instead of attempting to socially rank the entire web, we re-ranked a selected set of URLs to make it scalable and efficiently executable for search engines. If the size of the web doubles in the next few years, it would not affect our approach since only a subset of URLs that users shared are actually re-ranked.
More importantly, we believe that personalizing the ranking is useful for social searching because it provides a mechanism for the interaction between the searcher and the sharer where the searcher can discuss with the sharer about the item relating to a query on a search engine. For instance, a new product that the sharer posted on appleinsider.com or a piece of political news on nytimes.com. This potential interaction between the searcher and the sharer is valuable because the influence of the sharer on the searcher is stronger than the influence coming from the authorities detected by HITS and PageRank in many non-technical and social situations but not for all. This feature could be implemented in search engines where pages returned to a given query are re-ranked via social networks if there are pages shared among friends or other associates of the searcher that are related to the query.
-  J. Kleinberg and S. Lawrence, “The structure of the web,” Science, vol. 294, no. 5548, pp. 1849–1850, 2001.
-  D. Easley and J. Kleinberg, Networks, Crowds, Markets. Cambridge University Press, 2010.
-  J. Kleinberg, “Authoritative sources in a hyperlinked environment,” Journal ACM, vol. 46, pp. 604–632, 1999.
-  A. V. Goldberg, E. Tardos, and R. E. Tarjan, “Network flow algorithms,” in Paths, Flows, and VLSI-Design, pp. 101–164, 1990.
-  J. Kleinberg and E. Tardos, Algorithm Design. Pearson, 2006.
-  A. Mislove, S. Lehmann, Y. Ahn, J. Onnela, and J. Rosenquist, “Understanding the demographics of twitter users,” in Proceedings of the 5th Int. AAAI Conf. on Weblogs and Social Media, 2011.
-  M. Kurant, A. Markopoulou, and P. Thiran, “Towards unbiased bfs sampling,” Selected Areas in Communications, vol. 29, pp. 1799–1809, 2011.
-  A. Das Sarma, S. Gollapudi, M. Najork, and R. Panigrahy, “A sketch-based distance oracle for web-scale graphs,” in Proceedings of the 3rd ACM Int. Conf. on Web Search and Data Mining, pp. 401–410, 2010.
-  B. Bahmani and A. Goel, “Partitioned multi-indexing: bringing order to social search,” in Proceedings of the 21st Int. Conf. on World Wide Web, pp. 399–408, 2012.
-  S. Bao, G. Xue, X. Wu, Y. Yu, B. Fei, and Z. Su, “Optimizing web search using social annotations,” in Proceedings of the 16th Int. Conf. on World Wide Web, pp. 501–510, 2007.
-  D. Carmel, N. Zwerdling, I. Guy, S. Koifman, N. Har’el, I. Ronen, E. Uziel, S. Yogev, and S. Chernov, “Personalized social search based on the user’s social network,” in Proceedings of the 18th ACM Conf. on Information and Knowledge Management, pp. 1227–1236, 2009.
-  A. Banerjee and S. Basu, “A social query model for decentralized search,” in Proceedings in the 13th Int. Conf. on Knowledge Discovery and Data Mining, 2008.
-  D. Horowitz and S. D. Kamvar, “The anatomy of a large-scale social search engine,” in Proceedings of the 19th Int. Conf. on World Wide Web, pp. 431–440, 2010.
-  J. Davitz, J. Yu, S. Basu, D. Gutelius, and A. Harris, “ilink: Search and routing in social networks,” in Proceedings of the 13th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 931–940, 2007.
-  S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” in Proceedings of the 7th Int. Conf. on World Wide Web, pp. 107–117, 1998.
-  R. Lempel and S. Moran, “The stochastic approach for link-structure analysis (salsa) and the tkc effect,” in Proceedings of the 9th Int. Conf. on World Wide Web, pp. 387–401, 2000.
-  R. Albert, H. Jeong, and A. L. Barabasi, “The diameter of the world wide web,” Nature, vol. 401, pp. pp. 130–131, 1999.
-  S. Milgram, “The small world problem,” Psychology Today, vol. 2, pp. 60–67, 1967.
Appendix A Acknowledgement
Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.