RealTime Community Detection in Large Social Networks on a Laptop
Abstract
For a broad range of research, governmental and commercial applications it is important to understand the allegiances, communities and structure of key players in society. One promising direction towards extracting this information is to exploit the rich relational data in digital social networks (the social graph). As social media data sets are very large, most approaches make use of distributed computing systems for this purpose. Distributing graph processing requires solving many difficult engineering problems, which has lead some researchers to look at singlemachine solutions that are faster and easier to maintain. In this article, we present a singlemachine realtime system for largescale graph processing that allows analysts to interactively explore graph structures. The key idea is that the aggregate actions of large numbers of users can be compressed into a data structure that encapsulates user similarities while being robust to noise and queryable in realtime. We achieve singlemachine realtime performance by compressing the neighbourhood of each vertex using minhash signatures and facilitate rapid queries through Locality Sensitive Hashing. These techniques reduce query times from hours using industrial desktop machines operating on the full graph to milliseconds on standard laptops. Our method allows exploration of strongly associated regions (i.e. communities) of large graphs in realtime on a laptop. It has been deployed in software that is actively used by social network analysts and offers another channel for media owners to monetise their data, helping them to continue to provide free services that are valued by billions of people globally.
Department of Computing
Imperial College London
London SW7 2AZ, UK Josh LevyKramer josh@starcount.com
Starcount Insights
2 Riding House Street
London W1W 7FA Clive Humby clive@starcount.com
Starcount Insights
2 Riding House Street
London W1W 7FA Marc Peter Deisenroth m.deisenroth@imperial.ac.uk
Department of Computing
Imperial College London
London SW7 2AZ, UK
1 Introduction
Algorithms to discover groups of associated entities from relational (networked) data sets are often called community detection methods. They come in two forms: global methods, which partition the entire graph and local methods that look for vertices that are related to an input vertex and only work on a small part of the graph. We are concerned with community detection on large graphs that runs on a single commodity computer. To achieve this we combine the two approaches, using local community detection to identify an interesting region of a graph and then applying global community detection to help understand that region.
Our focus is on community detection using social media data. Social media data provides a record of global human interactions at a scale that is hitherto unprecedented. Discovering communities in the social graph has a large number of governmental and industrial applications, which include: security, where analysts explore a network looking for groups of potential adversaries; social sciences, where queries can establish the important relationships between individuals of interest; ecommerce, where queries reveal related products or users; marketing, where companies seek to optimise advertising channels or celebrity endorsement portfolios. These applications do not disrupt user experience in the way that sponsored links or feed advertising do offering an alternative means for social media providers to continue to offer free services.
As an illustration of a commercial application of community detection using Twitter data, take a company that wants to trade in a new geographic region. To do this they need to understand the region’s competitors, customers and marketing channels. Using our system they input the Twitter handles for their existing products, key people, brands and endorsers, and in realtime receive the accounts closely related to their company in that market. The output is automatically structured into groups (communities) such as media titles, sports people and other related companies. Analysts examine the results and explore different regions by changing the input accounts. We show a high level illustration for the drinks brand Diageo in Figure 1.
Throughout this paper we refer to graphs. In this context a graph is a collection of vertices and edges connecting them. A graph is usually written and a graph with weighted edges as . A network is a richer structure than a graph, comprising a graph and a collection of metadata describing the vertices and/or edges of the graph. A community is a collection of vertices that share many more edges than would be expected from a random subset of vertices. In the context of the Twitter graph, a vertex is a Twitter account and an (undirected) edge between exists if Follows or Follows . A community might be the set of Twitter accounts belonging to machine learning researchers. In addition to the Twitter graph, the Twitter network also includes metadata associated with the accounts (e.g., name, description) and edges (eg. creation time, direction).
Our algorithm focusses exclusively on the properties of the graph. We are particularly interested in the neighbourhood graph. The neighbourhood graph of a vertex consists of the set of all vertices that are directly connected to it, irrespective of the edge direction. DSN neighbourhood graphs can be very large; In Twitter, the largest have almost 100 million members (as of June 2016). We propose that robust associations between social network accounts can be reached by considering the similarity of their neighbourhood graphs. This proposition relies on the existence of homophily in social networks. The homophily principle states that people with similar attributes are more likely to form relationships (McPherson et al., 2001). Accordingly, social media accounts with similar neighbourhood graphs are likely to have similar attributes.
We seek to build a system that: (1) produces high quality communities from very noisy data. (2) Is robust to failure and does not require engineering support. (3) is parsimonious with the time of its users. The first constraint leads us to use the neighbourhood graph as the unit of comparison between vertices. The neighbourhood graph is generated by the actions of large numbers of independent users in contrast to features like text content or group memberships, which are usually controlled by a single user. The second requirements leads us to search for a singlemachine solution and the third prescribes a realtime system (or as close as possible). High performance is vital as analysts wish to interact with the data, combining the results of previous experiments to inform new ones. The difference between realtime and ‘quite quick’ is important. Realtime response is primary amongst the reasons that interactive program languages like Python and R have replaced compiled languages like C++ as the tools of choice for data analysts. We aim to offer similar improvements in usability.
Currently, no tool exists that provides realtime analysis of large graphs on a single commodity machine. Existing methods to analyse local community structure in large graphs either rely on distributed computing facilities or incur excessive runtimes making them impractical for exploratory and interactive work (Clauset, 2005; Bahmani et al., 2011). In this article, we describe our realtime analysis tool for detecting communities in large graphs using only a laptop. We focus on a 700 million user Twitter network. However, our work is more generally applicable as it does not rely upon Twitterspecific data, only the graph structure, and we provide some results from Facebook to demonstrate this.
There are two core problems to solve: (1) The graph must be fit into the memory of a single (commodity) machine. (2) Many neighbourhood graphs containing up to 100 million vertices must be compared in milliseconds. The first step to solving these problems is to compress the neighbourhood graphs into fixedlength minhash signatures. Minhash signatures vastly reduce the size of the graph while at the same time encoding an efficient estimation of the Jaccard similarity between any two neighbourhood graphs.^{1}^{1}1The Jaccard similarity is a widely used symmetric measure of the likeness of two sets. Choosing appropriate length minhash signatures squeezes the graph into memory and addresses problem (1). To solve problem (2) and achieve realtime querying we use the elements of the minhash signatures as the basis to build a Locality Sensitive Hashing (LSH) data structure. LSH facilitates querying of similar accounts in constant time. This combination of minhashing and LSH allows analysts to enter an account or a set of accounts and in milliseconds receive the set of most related accounts. From this set we use the minhash signatures to rapidly construct a weighted graph and apply the WALKTRAP community detection algorithm before visualising the results (Pons and Latapy, 2005).
Our system applies wellstudied techniques in an innovative way: (1) To the best of our knowledge, minhashing has not been applied to the neighbourhood graph before; Minhashing is normally only used for very similar sets. (2) We show that minhashing is effective for community detection when applied to a broad range of neighbourhood graph similarities. (3) We develop an agglomerative clustering algorithm and prove an original update procedure for minhash signatures in this setting. The novel combination of these techniques allows our system to perform realtime community detection on graphs that exceed 100 million vertices.
The contributions of this article are:

We establish that robust associations between social media users can be determined by means of the Jaccard similarity of their neighbourhood graphs.

We show that the approximations implicit in minhashing and LSH minimally degrade performance and allow querying of very large graphs in real time.

System design and evaluation: We have designed and evaluated an endtoend Python system for extracting data from social media providers, compressing the data into a form where it can be efficiently queried in real time.

We demonstrate how queries can be applied to a range of problems in graph analysis, e.g., understanding the structure of industries, allegiances within political parties and the public image of a brand.
There are seven sections in this paper. Section 2 describes how to mine the Twitter graph and can be omitted by readers uninterested in replicating our work. Section 3 describes the related work, which is necessarily broad as our system brings together community detection, graph processing and data structures. Section 4 contains our detailed methodology with the exception of how we prepare and analyse the ground truth data, which is left until Section 5. In Section 6 we describe the results of three experiments, which validate our methodology and conclusions and future work follow in Section 7.
2 Data and Preliminaries
In this article, we focus on Twitter data because Twitter is the most widely used Digital Social Network (DSN) for academic research. The Twitter Follower graph consists of roughly one billion vertices (Twitter accounts) and 30 billion edges (Follows).
To show that our method generalises to other social networks, we also present some results using a Facebook Pages engagement graph containing 450 million vertices (FB accounts) and 700 million edges (Page likes / comments) (see Section 6).
Most DSNs have public Application Programming Interfaces (APIs) so that thirdparty developers can build applications using their data. Delivering data at massive scale incurs significant cost and to manage these, DSNs limit the rate that data can be downloaded. Rate limiting varies between networks. Usually, when a DSN account holder logs into a third party application using their social login, they grant the application owner one access token. Each access token allows the application owner to download a fixed amount of data in a given time window. This procedure gives more popular apps access to more data. Our work makes use of access tokens generated by several client facing apps^{2}^{2}2Starcount Playlist, Starcount Vibe and Chatsnacks.
To collect Twitter data we use the REST API to crawl the network identifying every account with more than 10,000 Followers^{3}^{3}3The number of Followers is contained in the Twitter account metadata, i.e., it is available without collecting and counting all edges. and gather their complete Follower lists. Our data set contains 675,000 such accounts with a total of Followers, of which were unique. We use accounts with greater than 10,000 Followers (though 700 million Twitter accounts are used to build the signatures) because accounts of this size tend to have public profiles (Wikipedia pages or Google hits) making the results interpretable. To generate data from Facebook we matched the Twitter accounts with greater than 10,000 Followers to Facebook Page accounts ^{4}^{4}4Facebook pages are the public equivalent of the private profiles. Many influential users have a Facebook Page. using a combination of automatic account name matching and manual verification. Facebook Page likes are not available retrospectively, but can be collected through a realtime stream. We collected the stream over a period of two years, starting in late 2013. Downloading large quantities of social media data is an involved subject and we include details of how we did this in Appedix A.1 for reproducibility.
3 Related Work
Existing approaches to large scale, efficient, community detection have three flavours: More efficient community detection algorithms, innovative ways to perform processing on large graphs and data structures for graph compression and search. Table 1 shows related approaches to this problem and which constraints they satisfy.
Method  Realtime  Large graphs  SCM 

Modularity optimisation (Newman, 2004a)  ✗  ✗  ✓ 
WALKTRAP (Pons and Latapy, 2005)  ✗  ✗  ✓ 
INFOMAP (Rosvall and Bergstrom, 2008)  ✗  ✗  ✓ 
Louvain method (Blondel et al., 2008)  ✗  ✓  ✓ 
BigClam (Yang and Leskovec, 2013)  ✗  ✓  ✓ 
Graphlab (Low et al., 2014)  ✗  ✓  ✗ 
Pregel (Malewicz et al., 2010)  ✗  ✓  ✗ 
Surfer (Chen et al., 2010)  ✗  ✓  ✗ 
Graphci (Kyrola et al., 2012)  ✗  ✓  ✓ 
Twitter WTF (Gupta et al., 2013)  ✓  ✓  ✗ 
LEMON (Li et al., 2015)  ✗  ✓  ✓ 
Our Method  ✓  ✓  ✓ 
3.1 Community Detection Algorithms
Community detection methods have been developed in areas as diverse as neuronal firing (Bullmore and Sporns, 2009), electron spin alignment (Reichardt and Bornholdt, 2006) and social models (Yang and Leskovec, 2013). Fortunato (2010) and Newman (2003) both provide excellent and detailed overviews of the vast community detection literature. Approaches can be broadly categorised into local and global methods.
Global methods assign every vertex to a community, usually by partitioning the vertices. Many highly innovative schemes have been developed to do this. Modularity optimisation (Newman, 2004a) is one of the best known. Modularity is a metric used to evaluate the quality of a graph partition. Communities are determined by selecting the partition that maximises the modularity. An alternative to modularity was developed by Pons and Latapy (2005) who innovatively applied random walks on the graph to define communities as regions in which walkers become trapped (WALKTRAP). Rosvall and Bergstrom (2008) combined random walks with efficient coding theory to produce INFOMAP, a technique that provides a new perspective on community detection: Communities are defined as the structural subunits that facilitate the most efficient encoding of information flows through a network. All three methods are well optimised for their motivating networks, but were not created with graphs at the scale of modern Digital Social Networks (DSNs) and can not easily scale to very large data sets.
The availability of data from the Web, DSNs and services like Wikipedia has focussed research attention on methods that scale. An early success was the Louvain method that allowed modularity optimisation to be applied to large graphs (they report 100 million vertices and 1 billion edges). However, the method was not intended to be realtime and the 152 minute runtime is too slow to achieve realtime performance, even allowing for 8 years of hardware advances Blondel et al. (2008). Another noteworthy technique applied to very large graphs is the Bigclam method, which in addition to operating at scale, is able to detect overlapping communities (Yang and Leskovec, 2013). However, in common with the Louvain method, Bigclam is not a realtime algorithm that could facilitate interactive exploration of social networks.
In contrast to global community detection methods, local methods do not assign every vertex to a community. Instead they find vertices that are in the same community as a set of input vertices (seeds). For this reason they are normally faster than global methods. Local community detection methods were originally developed as crawling strategies to cope with the rapidly expanding webgraph (Flake et al., 2000). Following the huge impact of the PageRank algorithm (Page et al., 1998), many local random walk algorithms have been developed. Kloumann and Kleinberg (2014) conducted a comprehensive assessment of local community detection algorithms on large graphs. In their study Personal PageRank (PPR) (Haveliwala, 2002) was the clear winner. PPR is able to measure the similarity to a set of vertices instead of the global importance/influence of each vertex by applying a slight modification to PageRank. PageRank can be regarded as a sequence of two step processes that are iterated until convergence: A random walk on the graph followed by (with small probability) a random teleport to any vertex. PPR modifies PageRank in two ways: Only a small number of steps are run (often 4), and any random walker selected to teleport must return to one of the seed vertices. Recent extensions have shown that seeding PPR with the neighbourhood graph can improve performance Gleich and Seshadhri (2012) and that PPR can be used to initiate local spectral methods with good results Li et al. (2015).
Random walk methods are usually evaluated by power iteration; a series of matrix multiplications requiring the full adjacency matrix to be read into memory. The adjacency matrix of large graphs will not fit in memory and so distributed computing resources are used (e.g., Hadoop). While distributed systems are continually improving, they are not always available to analysts, require skilled operators and typically have an overhead of several minutes per query.
A major challenge when applying both local and global community detection algorithms to real world networks is performance verification. Testing algorithms on a held out labelled test set is complicated by the lack of any agreed definition of a community. Much early work makes use of small handlabelled communities and treats the original researchers’ decisions as gold standards (Sampson, 1969; Zachary, 1977; Lusseau, 2003). Irrespective of the validity of this process, a single (or small number) of manual labellers can not produce groundtruth for large DSNs. Yang and Leskovec (2012) proposed a solution to the verification problem in community detection. They observe that in practice, community detection algorithms detect communities based on the structure of interconnections, but results are verified by discovering common attributes or functions of vertices within a community. Yang and Leskovec (2012) identified 230 realworld networks in which they define groundtruth communities based on vertex attributes. The specific attributes that they use are varied and some examples include publication venues for academic coauthorship networks, chat group membership within social networks and product categories in copurchasing networks.
3.2 Graph Processing Systems
A complimentary approach to efficient community detection on large graphs is to develop more efficient and robust systems. This is an area of active research within the systems community. Generalpurpose tools for distributed computation on large scale graphs include Graphlab, Pregel and Surfer (Chen et al., 2010; Malewicz et al., 2010; Low et al., 2014). Purposebuilt distributed graph processing systems offer major advances over the widely used MapReduce framework (Pace, 2012). This is particularly true for iterative computations, which are common in graph processing and include random walk algorithms. However, distributed graph processing still presents major design, usability and latency challenges. Typically the run times of algorithms are dominated by communication between machines over the network. Much of the complexity comes from partitioning the graph to minimise network traffic. The general solution to the graph partitioning problem is NPhard and remains unsolved. These concerns have lead us and other researchers to buck the overarching trend for increased parallelisation on ever larger computing clusters and search for singlemachine graph processing solutions. One such solution is Graphci, a singlemachine system that offers a powerful and efficient alternative to processing on large graphs Kyrola et al. (2012). The key idea is to store the graph on disk and optimise I/O routines for graph analysis operations. Graphci achieves dramatic speedups compared to conventional systems, but the repeated disk I/O makes realtime operation impossible. Twitter also use a singlemachine recommendation system that serves “Who To Follow (WTF)” recommendations across their entire user base (Gupta et al., 2013). WTF provides realtime recommendations using random walk methods similar to PPR. They achieve this by loading the entire Twitter graph into memory. Following their design specification of 5 bytes per edge GB of RAM would be required to load the current graph, which is an order of magnitude more than available on our target platforms.
3.3 Graph Compression and Data Structures
The alternative to using large servers, clusters or disk storage for processing large graphs is to compress the whole graph to fit into the memory of a single machine. Graph compression techniques were originally motivated by the desire for single machine processing on the Web Graph. Approaches focus on ways to store the differences between graph structures instead of the raw graph. Adler and Mitzenmacher (2001) searched for web pages with similar neighbourhood graphs and encoded only the differences between edge lists. The seminal work by Boldi and Vigna (2004) ordered Web pages lexicographically endowing them with a measure of locality. Similar compression techniques were adapted to social networks by Chierichetti et al. (2009). They replaced the lexical ordering with an ordering based on a single minhash value of the outedges, but found social networks to be less compressible than the Web (14 versus 3 bits per edge). While the aforementioned techniques achieve remarkable compression levels, the cost is slower access to the data (Gupta et al., 2013).
Minhashing is a technique for representing large sets with fixed length signatures that encode an estimate of the similarity between the original sets. When the sets are subgraphs minhashing can be used for lossy graph compression. The pioneering work on minhashing was by Broder (1997) whose implementation dealt with binary vectors. This was extended to counts (integer vectors) by Charikar (2002) and later to continuous variables (Philbin, 2008). Efficient algorithms for generating the hashes are discussed by Manasse and Mcsherry (2008). Minhashing has been applied to clustering the Web by Haveliwala et al. (2000), who considered each web page to be a bag of words and built hashes from the count vectors.
Two important innovations that improve upon minhashing are bBit minhashing (Li and König, 2009) and Odd Sketches (Mitzenmacher et al., 2014). When designing a minhashing scheme there is a trade off between the size of the signatures and the variance of the similarity estimator. Li and König (2009) show that it is possible to improve on the sizevariance trade off by using longer signatures, but only keeping the lowest bbits of each element (instead of all 32 or 64). Their work delivers large improvements for very similar sets (more than half of the total elements are shared) and for sets that are large relative to the number of elements in the sample space. Mitzenmacher et al. (2014) improved upon bbit minhashing by showing that for approximately identical sets (Jaccard similarities ) there was a more optimal estimation scheme.
Locality Sensitive Hashing (LSH) is a technique introduced by Indyk and Motwani (1998) for rapidly finding approximate near neighbours in high dimensional space. In the original paper they define a parameter that governs the quality of LSH algorithms. A lower value of leads to a better algorithm. There has been a great deal of work studying the limits on . Of particular interest, Motwani et al. (2005) used a Fourier analytic argument to provide a tighter lower bound on , which was later bettered by O’Donnell et al. (2009) who exploited properties of the noise stability of boolean functions. The latest LSH research uses the structure of the data, through data dependent hash functions Andoni et al. (2014) to get even tighter bounds. As the hash functions are data dependent, unlike earlier work, only static data structures can be addressed.
4 RealTime Community Detection
System  Typical runtime (s)  Space requirement (GB) 

Naive edge list  8,000  240 
Minhash signatures  1  4 
LSH with minhash  0.25  5 
In this section, we detail our approach to realtime community detection in large social networks. Our method consists of two main stages: In stage one, we take a set of seed accounts and expand this set to a larger group containing the most related accounts to the seeds. This stage is depicted by the box labelled ”Find similar accounts” in Figure 1. Stage one uses a very fast nearest neighbour search. In stage two, we embed the results of stage one into a weighted graph where each edge is weighted by the Jaccard similarity of the two accounts it connects. We apply a global community detection algorithm to the weighted graph and visualise the results. Stage two is depicted by the box labelled ”Structure and visualise” in Figure 1.
In the remainder of the paper we use the following notation: The user account (or interchangeably, vertex of the network) is denoted by and gives the set of all accounts directly connected to (the neighbours of ). The set of accounts that are input by a user into the system are called seeds and denoted by while (community) is used for the set of accounts that are returned by stage one of the process.
4.1 Stage 1: Seed Expansion
The first stage of the process takes a set of seed accounts as input, orders all other accounts by similarity to the seeds and returns an expanded set of accounts similar to the seed account(s). For this purpose, we require three ingredients:

A similarity metric between accounts

An efficient system for finding similar accounts

A stopping criterion to determine the number of accounts to return
In the following, we detail these three ingredients of our system, which will allow for realtime community detection in large social networks on a standard laptop.
4.1.1 Similarity Metric
The property of each account that we choose to compare is the neighbourhood graph. The neighbourhood graph is an attractive feature as it is not controlled by an individual, but by the (approximately) independent actions of large numbers of individuals. The edge generation process in Digital Social Networks (DSNs) is very noisy producing graphs with many extraneous and missing edges. As an illustrative example, the pop stars Eminem and Rihanna have collaborated on four records and a stadium tour.^{5}^{5}5“Love the Way You Lie” (2010), “The Monster” (2013), “Numb” (2012), and “Love the Way You Lie (Part II)” (2010), the Monster Tour (2014) Despite this clear association, Eminem is not one of Rihanna’s 40 million Twitter followers. However, Rihanna and Eminem have a Jaccard similarity of 18%, making Rihanna Eminem’s 6 strongest connection. Using the neighbourhood graph as the unit of comparison between accounts mitigates against noise associated with the unpredictable actions of individuals. The metric that we use to compare two neighbourhood graphs is the Jaccard similarity. The Jaccard similarity has two attractive properties for this task. Firstly it is a normalised measure providing comparable results for sets that differ in size by orders of magnitude. Secondly minhashing can be used to provide an unbiased estimator of the Jaccard similarity that is both time and space efficient. The Jaccard similarity is given by
(1) 
where is the set of neighbours of account.
4.1.2 Efficient Account Search
To efficiently search for accounts that are similar to a set of seeds we represent every account as a minhash signature and use a Locality Sensitive Hashing (LSH) data structure based on the minhash signatures for approximate nearest neighbour search.
Rapid Jaccard Estimation via Minhash Signatures
Computing the Jaccard similarities in (1) is very expensive as each set can have up to members and calculating intersections is superlinear in the total number of members of the two sets being intersected. Multiple large intersection calculations can not be processed in realtime. There are two alternatives: either the Jaccard similarities can be precomputed for all possible pairs of vertices, or they can be estimated. Using precomputed values for would require caching floating point values, which is approximately 1TB and so not possible using commodity hardware. Therefore an estimation procedure is required.
The minhashing compression technique of Broder et al. (2000) generates unbiased estimates of the Jaccard similarity in , where is the number of hash functions in the signature. Each hash function approximates a two step process: An independent permutation of the indices associated with each member of a set followed by taking the minimum value of the permuted indices. Broder et al. (2000) showed that the unbiased estimate of the Jaccard similarity is attained by exploiting that
where are hash functions . This means the probability that any minhash function is equal for both sets is given by the Jaccard coefficient. We create a signature vector , which is made of independent hashes and calculate the MonteCarlo Jaccard estimate as
(2) 
where we define
(3)  
(4) 
As each is independent, . The estimator is fully efficient, i.e., the variance is given by the CramérRao lower bound
(5) 
where we have dropped the Jaccard arguments for brevity. Equation 5 shows that Jaccard coefficients can be approximated to arbitrary precision using minhash signatures with an estimation error that scales as .
The memory requirement of minhash signatures is integers, and so can be configured to fit into memory and for and is only . In comparison to calculating Jaccard similarities of the largest 675,000 Twitter accounts with neighbours minhashing reduces expected processing times by a factor of and storage space by a factor of .^{6}^{6}6Our method allows to add new accounts quickly by simply calculating one additional minhash signature without needing to add the pairwise similarity to all other accounts.
Efficient Generation of Minhash Signatures
Minhash signatures allow for rapid estimation of the Jaccard similarities. However, care must be taken when implementing minhash generation. Calculation of the signatures is expensive: Algorithm 1 requires computations, where is the number of neighbours, is the average outdegree of each neighbour and is the length of the signature. For our Twitter data these values are , , . A naive implementation can run for several days. We have an efficient implementation that takes one hour allowing signatures to be regenerated overnight without affecting operational use (See Appendix A).
Locality Sensitive Hashing (LSH)
Calculating Jaccard similarities based on minhash signatures instead of full adjacency lists provides tremendous benefits in both space and time complexity. However, finding near neighbours of the input seeds is an onerous task. For a set of 100 seeds and our Twitter data set, nearly 70 million minhash signature comparisons would need to be performed, which dominates the run time. Locality Sensitive Hashing (LSH) is an efficient system for finding approximate near neighbours Indyk and Motwani (1998).
LSH works by partitioning the data space. Any two points that fall inside the same partition are regarded as similar. Multiple independent partitions are considered, which are invoked by a set of hash functions. LSH has an elegant formulation when combined with minhash signatures for near neighbour queries in Jaccard space. The minhash signatures are divided into bands containing fixed numbers of hash values and LSH exploits that similar minhash signatures are likely to have identical bands. An LSH table can then be constructed that points from each account to all accounts that have at least one identical minhash band. We apply LSH to every input seed independently to find all candidates that are ‘near’ to at least one seed. In our implementation, we use 500 bands, each containing two hashes. As most accounts share no neighbours, the LSH step dramatically reduces the number of candidate accounts and the algorithm runtime by a factor of roughly 100. LSH is essential for the realtime capability of our system.
Sorting Similarities
LSH produces a set of candidate accounts that are related to at least one of the input seeds. In general, we do not want every candidate returned by LSH. Therefore, we select the subset of candidates that are most associated with the whole seed set. We experimented with two sequential ranking schemes: Minhash Similarity (MS) and Agglomerative Clustering (AC). The rankings can best be understood through the Jaccard distance , which is used to define the centre of any set of vertices. At each step AC and MS augment the results set with the closest account to . However, MS uses a constant value of based on the input seeds while AC updates after each step. Formally, the centre of the input vertices used for MS is defined by
(6) 
At each iteration of Algorithm 2 and are updated by first setting and then adding the closest account given by
(7) 
leading to
The new centre is most efficiently calculated using the recursive online update equation
(8) 
where is the size of .
4.1.3 Stopping Criterion
Both AC and MS are sequential processes and will return every candidate account unless a stopping criteria is applied. Many stopping criteria have been used to terminate seed expansion processes. The simplest method is to terminate after a fixed number of inclusions. Alternative methods use local maxima in modularity (Lancichinetti et al., 2009) and conductance (Leskovec et al., 2010).
An application of our work is to help define an optimal set of celebrities to endorse a brand. In this context we want to answer questions like: “What is the smallest set of athletes that have influence on over half of the users of Twitter?”. We refer to the number of unique neighbours of a set of accounts as the coverage of that set. An exact solution to this problem is combinatorial and requires calculating large numbers of unions over very large sets. However it can be efficiently approximated using minhash signatures. We exploit two properties of minhash signatures to do this: The unbiased Jaccard estimate through Equation 2 and the minhash signature of the union of two sets is the elementwise minimum of their respective minhash signatures. Minhash signatures allow coverage to be used as a stopping criteria to rank LSH candidates without losing realtime performance.
Efficient Coverage Computation
The coverage is given by
(9) 
the number of unique neighbours of the output vertices. Every time a new account is added we need to calculate to update the coverage. This is a large union operation and expensive to perform on each addition. Lemma LABEL:lemma allows us to rephrase this expensive computation equivalently by using the Jaccard coefficient (available cheaply via the minhash signatures), which we subsequently use for a realtime iterative algorithm.
Lemma 1
For a community and a new account , the number of Neighbours of the union is given as
(10) 
Proof Following (1), the Jaccard coefficient of a new Account and the community is
(11) 
By considering the Venn diagram and utilising the inclusionexclusion principle, we obtain
(12) 
Substituting this expression in the denominator of the Jaccard coefficient in (11) yields
which proves (10) and the Lemma.
Lemma 2
A community can be represented by a minhash signature where
(13) 
Proof A minhash signature is composed of independent minhash functions. Each of which is a compound function made up of a general mapping and a minimum operation.
where and so
(14)  
(15) 
which proves (13) and the Lemma.
We use Lemma 1 to update the unique neighbour count. Once the next account to add to the community is determined according to (7)
(16) 
The right hand side of (16) contains three terms: is what we started with, is the neighbour count of , which is easily obtained from Twitter or Facebook metadata and is a Jaccard calculation between a community and an account. The minhash signature of a community is obtained via (13) and so we are able to calculate the coverage with negligible additional computational overhead.
4.2 Stage 2: Community Detection and Visualisation
Stage one expanded the seed accounts to find the related region. This was done by first finding a large group of candidates using LSH that were related to any one of the seeds and then filtering down to the accounts most associated to the whole seed set.
In Stage two, the vertices returned by Stage one are used to construct a weighted Jaccard similarity graph. Figure 2 depicts the process of transforming from the original unweighted graph to the weighted graph. The red vertices are those returned by stage one. Edge weights are calculated for all pairwise associations from the minhash signatures through Equation 2. This process effectively embeds the original graph in a metric Jaccard space (Broder, 1997). Community detection is run on the weighted graph.
The final element of the process is to visualise the community structure and association strengths in the region of the input seeds. We experimented with several global community detection algorithms. These included INFOMAP, Label Propagation, various spectral methods and Modularity Maximisation (Rosvall and Bergstrom, 2008; Raghavan et al., 2007; Newman, 2006, 2004b). The Jaccard similarity graph is weighted and almost fully connected and most community detection algorithms are designed for binary sparse graphs. As a result, all methods with the exception of label propagation and WALKTRAP were too slow for our use case. Label Propagation had a tendency to select a single giant cluster, thus adding no useful information. Therefore, we chose WALKTRAP for community visualisation.
5 GroundTruth Communities
To provide a quantitative assessment of our method we require groundtruth labelled communities. No groundtruth exists for the data sets of interest and so in this section we provide a methodology for generating groundtruth. This methodology itself must be verified and we provide an extensive evaluation of the quality of the derived groundtruth based on the axiomatic definitions described in Yang and Leskovec (2012). Most community detection algorithms (including ours) are based on the structure of the graph (Fortunato and Barthelemy, 2007). Axiomatically, good community structures are:

Compact

Densely interconnected

Well separated from the rest of the network

Internally homogeneous
However, while communities are detected using these properties, verification typically requires associating each vertex with some functional attributes, e.g., fans of Arsenal football club or Python programmers and showing that the discovered communities group attributes together (Yang and Leskovec, 2012). The practice of relating community membership with personal attributes is justified by the homophily principle of social networks (McPherson et al., 2001), which states that people with similar attributes are more likely to be connected. We reverse the process of verification by generating groundtruth from personal attributes. To generate attributes we match Twitter accounts with Wikipedia pages and associate Wikipedia tags with each Twitter account. Wikipedia tags give hierarchical functions like ‘football:sportsperson:sport’ and ‘pop:musician:music’. It is not possible to match every Twitter account and our matching process discovered 127 tags that occur more than 100 times in the data. Of these, many were clearly too vague to be useful such as ‘news:media’ or ‘Product Brand:Brands’. We selected 16 tags that had relatively high frequencies in the data set and evaluated 7 metrics for each that are related to the four axioms. These result are shown in Table 3. Seperability and conductance measure how well separated a community is from the rest of the graph. Density and size measure the compactness and density. Cohesiveness, clustering and conductance ratio measure how internally homogeneous a community is. The mathematical formulation of these metrics and details of how they were calculated is provided in Appendix B. Table 3 is sorted by density and the bold rows are visualised in Figures 4,5,6 and 7. The density is the most important factor distinguishing good from bad communities, varying by two orders of magnitude across the data. This is followed by how well separated (separability) the community is from the rest of the network, which is inversely correlated with conductance by design (See Equations 21 and 25). High clustering is also a useful indicator of community goodness for the best communities, but is less useful for separating communities that are made up of many subunits like team sports from very bad communities like Food and Drink. Cohesiveness is generally not useful as most communities contain at least one well separated subunit.
Community  Size  Clustering  Cohesiveness  Conductance  CR  Density  Separability 

Mixed Martial Arts  751  6.49E02  4.29E01  5.10E01  1.19  3.06E02  4.80E01 
Adult Actors  352  7.20E02  1.29E01  7.70E01  5.98  2.94E02  1.50E01 
Cycling  371  6.43E02  4.51E01  7.04E01  1.56  2.50E02  2.11E01 
Baseball  616  3.64E02  1.49E01  7.87E01  5.29  1.63E02  1.35E01 
Basketball  786  3.84E02  3.30E01  7.71E01  2.34  1.60E02  1.48E01 
American Football  1295  2.24E02  3.82E01  7.40E01  1.94  9.33E03  1.75E01 
Athletics  530  3.48E02  4.13E01  8.47E01  2.05  8.21E03  9.01E02 
Hotel Brand  836  2.20E02  4.53E01  8.37E01  1.85  6.16E03  9.71E02 
Airline  363  2.30E02  4.41E01  9.46E01  2.15  4.35E03  2.84E02 
Cosmetics  332  3.34E02  4.87E01  9.56E01  1.96  3.55E03  2.32E02 
Football  4111  3.69E02  3.95E01  7.07E01  1.79  2.93E03  2.07E01 
Alcohol  388  1.72E02  2.34E01  9.52E01  4.06  2.66E03  2.53E02 
Travel  2038  1.27E02  4.25E01  8.29E01  1.95  2.50E03  1.03E01 
Model  2096  2.62E02  4.04E01  9.01E01  2.23  1.90E03  5.50E02 
Electronics  689  1.40E02  4.38E01  9.75E01  2.23  8.78E04  1.30E03 
Food and Drink  2974  1.76E02  4.57E01  9.06E01  1.98  7.69E04  5.18E02 
Industrial groups. Small highly connected groups due to subbrands 
Industrial groups. Limited interaction 
Strongly connected communities. Subcommunities mostly due to nationality 
Team sports. Many highly connected subgroups 
To establish a clearer view of the density and homogeneity of the groundtruth we visualise the communities using network diagrams and dendrograms. Network diagrams are generated in Gephi (Bastian et al., 2009). The layout uses the Force Atlas 2 algorithm. Colours indicate clusters generated using Gephi’s modularity optimisation routine. The node (and label) sizes indicate the weighted degree of each node and are scaled to be between 5 and 20 pixels. The network diagrams reveal any substructure present within the groundtruth. They contain too much information to easily see the individual accounts and so we magnify small subregions and display Twitter profile images for accounts within them. A weakness of the network diagrams is that different edge weights are hard to perceive. To provide a visual representation of the general strength of interaction we generated dendrograms (Figure 3) for each groundtruth community. Dendrograms are agglomerative: All accounts with a Jaccard distance less than the value are fused together into a supernode. Any subgroups containing more than 10 nodes with no two nodes separated by a Jaccard distance greater than 0.85 have been coloured to indicate subcommunities.
Figure 4 shows the Mixed Martial Arts (MMA) community. From Table 3 we see that this community is densely connected, strongly clustered and very well separated from the rest of the network. The black region in Dendrogram 2(i) is a massive cluster where the distance between any two nodes is less than 0.8. It depicts MMA fighters, mostly fighting in the Ultimate Fighting Championship (UFC). There is a single well separated subcommunity, which is magnified in Figure 4 showing Olympic judo fighters. MMA is the best community in our study. The Cycling, Adult Actor and Athletics communities are similar in structure to MMA (See Table 3).
Figure 7 shows that the basketball community (largely NBA players) exhibits two large communities (the two NBA conferences). The individual team structure within the divisions is apparent from the fine banding in Figure 2(o) where many wellconnected subclusters, each with a distance of less than 0.85 between all pairs of nodes, are visible. We have magnified a small disconnected region of Figure 7, which shows players of the Women’s National Basketball Association (WNBA). Baseball, football and american football exhibit similar structural properties.
Figure 5 shows that the industry is split into four major groups representing the different classes of alcoholic drink (wine, beer, cider and spirits). We have magnified a region of the network that contains mostly English craft ciders. Dendrogram 2(g) shows that the alcohol network is mostly poorly connected with only two coloured regions indicating well connected subcommunities. From Table 3 it can be seen that the alcohol network exhibits a low link density and separability, indicating that the community lacks distinction from the rest of the network. This is a consistent pattern for communities drawn from industrial segmentations.
Figure 6 shows an example of the final group of ground truth communities: industrial groups with prominent subcommunities. In this case the major subcommunities are the Four Seasons Hotel group and hotels located in Las Vegas (magnified). Dendrogram 2(c) shows that while the hotel network is generally poorly connected, there are sizeable highly interconnected subcommunities. From Table 3 shows that the hotel community has low clustering as most accounts are disconnected, high cohesiveness as there are well connected subgroups and a low Conductance Ratio. The travel, airlines and cosmetics communities all share these traits.
In summary we identify four groups of groundtruth communities and evaluate their quality based on the four axioms. We find that the groups differ greatly in quality. The group containing mixed martial arts, cycling, athletics and adult actors satisfies the four axioms and form a good set of groundtruth for algorithm evaluation. The group comprising team sports (american football, baseball, basketball and football) satisfy three of the four axioms (they are not homogenous). The remaining communities only contain subgroups that satisfy any of the axioms.
6 Experimental Evaluation
Our approach to realtime community detection relies on two approximations: minhashing for rapid Jaccard estimation and locality sensitive hashing to provide a fast query mechanism on top of minhashing. We assess the effect of these approximations, and demonstrate the quality of our results in three experiments: (1) We measure the sensitivity of the Jaccard similarity estimates with respect to the number of hash functions used to generate the signatures. This will justify the use of the minhash approximation for computing approximate Jaccard similarities. (2) We compare the run time and recall of our process on groundtruth communities against the Personal Page Rank (PPR) algorithm (state of the art) on a single laptop. (3) We visualise detected communities and demonstrate that association maps for social networks using minhashing and LSH produce intuitively interpretable maps of the Twitter and Facebook graphs in realtime on a single machine.
6.1 Experiment 1: Assessing the Quality of Jaccard Estimates
Twitter handle  

adidas  0.261  1  0.22  2  0.265  1 
nikestore  0.246  2  0.25  1  0.255  2 
adidasoriginals  0.200  3  0.18  3  0.222  3 
Jumpman23  0.172  4  0.13  7  0.166  4 
nikesportswear  0.147  5  0.18  4  0.137  5 
nikebasketball  0.144  6  0.16  5  0.127  7 
PUMA  0.132  7  0.13  6  0.132  6 
nikefootball  0.127  8  0.08  17  0.110  9 
adidasfootball  0.112  9  0.09  16  0.113  8 
footlocker  0.096  10  0.08  17  0.096  11 
We empirically evaluate the minhash estimation error using a sample of 400,000 similarities taken from the 250 billion pairwise relationships between the Twitter accounts in our study. We compare estimates using Equation 2 to exact Jaccards obtained by exhaustive calculations on the full sets using Equation 1. Figure 8 shows the estimation error (L1 norm) as a function of the number of hashes comprising the minhash signature. Standard error bars are just visible up until 400 hashes. The graph shows an expected error in the Jaccard of just 0.001 at 1,000 hashes. The high degree of accuracy and diminishing improvements at this point led us to select a signature length of . This value provides an appropriate balance between accuracy and performance (both runtime and memory scale linearly with ).
A topten list of Jaccard similarities is given in Table 4 for the Nike Twitter account (based on the true Jaccard). Possible matches include sports people, musicians, actors, politicians, educational institutions, media platforms and businesses from all sectors of the economy. Of these, our approach identified four of Nike’s biggest competitors, five Nike subbrands and a major retailer of Nike products as the most associated accounts. This is consistent with our assertion that the Jaccard similarity of neighbourhood sets provides a robust similarity measure between accounts. We found similar trends throughout the data and this is consistent with the experience of analysts at Starcount, a London based social media analytics company, who are using the tool. Table 4 also shows how the size of the minhash signature affects the Jaccard estimate and the corresponding rank of similar accounts. Local community detection algorithms add accounts in similarity order. Therefore, approximating the true ordering is an important property. We measure the Spearman rank correlation between the true Jaccard similarities (column ) and those calculated from signatures of length 100 (column ) and 1000 (column ) to be 0.89 and 0.97 respectively. The close correspondence of the rank vector using signatures of length 1,000 and the true rank supports our decision to use signatures of containing 1,000 hashes.
6.2 Experiment 2: Comparison of Community Detection with PPR
In experiment 2 we move from assessing a single component (minhashing) to a systemwide evaluation: We evaluate the ability of our algorithm to detect related entities by measuring its performance as a local community detection algorithm seeded with members of the groundtruth communities listed in Table 3. As a baseline for comparison we use the PPR algorithm, which is considered to be the state of the art for this problem (Kloumann and Kleinberg, 2014). It is impossible to provide a fully likeforlike comparison with PPR: Running PPR on the full graph (700 million vertices and 20 billion edges) that we extract features from requires cluster computing and could return results outside of the accounts we considered. The alternative is to restrict PPR to run on the directly observed network of the 675,000 largest Twitter accounts, which could then be run on a single machine. We adopt this latter approach as it is the only option that meets our requirements (single machine and realtime).
In our experimentation, we randomly sampled 30 seeds from each groundtruth community. To produce MS and AC results we followed the process depicted in Figure 9: The seeds are input to an LSH query, which produces a list of candidate nearneighbours. For each candidate the Jaccard similarity is estimated using minhash signatures and sorted by either the MS or AC procedures.
We compare MS and AC to PPR operating on the directly observed network of the 675,000 largest accounts. Our PPR implementation uses the 30 seeds as the teleport set and runs for three iterations returning a ranked list of similar Twitter accounts.
In all cases, we sequentially select accounts in similarity order and measure the recall after each selection. The recall is given by
(17) 
with as the initial seed set, as the ground truth community and as the set of accounts added to the output. For a community of size we do this for the most similar accounts so that a perfect system could achieve a recall of one.
Tags  PPR  MS  AC 

travel  0.186  0.240  0.230 
airline  0.040  0.151  0.180 
hotel brand  0.160  0.294  0.285 
cosmetics  0.055  0.086  0.143 
food and drink  0.072  0.099  0.082 
electronics  0.035  0.069  0.059 
alcohol  0.069  0.199  0.229 
model  0.078  0.110  0.109 
mixed martial arts  0.317  0.363  0.386 
cycling  0.278  0.330  0.445 
athletics  0.219  0.285  0.365 
adult actor  0.269  0.347  0.397 
american football  0.240  0.371  0.240 
baseball  0.203  0.379  0.378 
basketball  0.252  0.380  0.353 
football  0.202  0.233  0.212 
The results of this experiment are shown in Figure 10 with the Area Under the Curves (AUC) given in Table 5. Bold entries in Table 5 indicate the best performing method. In all cases MS and AC give superior results to PPR.
Figure 10 shows standard errors over five randomly chosen input sets of 30 accounts from . The confidence bounds are tight indicating that the methods are robust to the choice of input seeds. Figure 10 is grouped to correspond to the dendrograms in Figure 3. Performance of all methods is considerable affected by the quality of the communities. Communities with good values of the metrics given in Table 3 in general have superior recall across all methods. The third row of Figure 10 contains the best communities as measured by the metrics in Table 3. For this group recalls are as high as 80% (Cycling, AC). The worst group of communities are the transnational industrial communities in the second row. The lowest recall in row three (Athletics PPR) is still higher than the highest recall in the second row of results (Alcohol, AC). The best performing method for every community in row three of the results is AC. This is because AC is an adaptive method that can incorporate information from early results. The downside of an adaptive method is that pollution from false positives can rapidly degrade performance. This can be seen in the step decrease in gradient of the AC curves for basketball, baseball and adult actors. The fourth row of the table contains team sports. Team sports also have good metrics in Table 3, but differ markedly in structure from the communities in row three. The communities in row four have well defined multimodal substructures generated by the different teams. Both AC and MS are unimodal procedures that store the centre of a set of data points. For a multimodal distribution the mean may not be particularly close to the distribution and so false positives will occur. As AC incorporates false positives into the estimation procedure for all future results MS outperforms AC for all team sport communities. Of the communities in the first and second rows of Figure 10 AC is best performing in four and MS is best performing in four. These communities are all diffuse, but some have a single densely connected region that can be found well by AC.
Much of the difference in performance of these methods derives from their respective ability to explore the graph: PPR is really a global algorithm that has been modified to find local relationships. After three iterations PPR uses both first, second and thirdorder connections. Firstorder connection methods just use edges that directly connect to the seed nodes (neighbours). Secondorder methods also give weight to the connections of the firstorder nodes (neighbours of neighbours) and so on for thirdorder connections. The ability to explore higherorder connections is the principal reason identified by Kloumann and Kleinberg (2014) for the stateoftheart performance of PPR. They also note that after two iterations most of the benefit is realised and that after three iterations there is no more improvement. Our implementations of MS and AC are effectively secondorder methods as they operate on a derived graph where the edge weight between two vertices is calculated from the overlap of the respective neighbourhoods. MS and AC outperform PPR because they are based on many more second order connections as they run on a compressed version of the full graph instead of a subgraph. PPR is expected to perform better given more computational resources, but the additional complexity, run time, latency or financial cost required for any scaled up/out solution would violate our system constraints.
Table 6 gives the mean and standard deviation of the run times averaged over the 16 communities.
Method  Mean(s)  Std.Dev. 

PPR  12.58  8.83 
MS  0.23  0.08 
AC  18.6  22.0 
MS is the fastest method by two orders of magnitude. Average human reaction times are approximately a quarter of a second and so MS delivers a realtime user experience (Hewett et al., 1992). As MS is the only method capable of operating in the realtime domain and this is a system requirement, we choose the MS procedure for experiment 3 and in our operational prototype.
6.3 Experiment 3: RealTime Graph Analysis and Visualisation
In the following, we provide example applications of our system to graph analysis. Users need only input a set of seeds, wait a quarter of a second and the system discovers the structure of the graph in the region of the seeds. Users can then iterate the input seeds based on what previous outputs reveal about the graph structure. Figure 12 shows results on the Facebook Page Engagements network while Figures 13 and 14 use the Twitter Followers graph. Each diagram is generated by the procedure shown in Figure 11: Seeds are passed to the MS process, which returns the 100 most related entities. All pairwise Jaccard estimates are then calculated using the minhash signatures and the resulting weighted adjacency matrix is passed to the WALKTRAP global community detection algorithm. The result is a weighted graph with community affiliations for each vertex. In our visualisations we use the Force Atlas 2 algorithm to lay out the vertices. The thickness of the edges between vertices represents the pairwise Jaccard similarity, which has been thresholded for image clarity. The vertex size represents the weighted degree of the vertex, but is logarithmically scaled to be between 1 and 50 pixels. The vertex colours depict the different communities found by the WALKTRAP community detection algorithm.
We show some results using the Facebook Pages engagement graph to demonstrate that our work is broadly applicable across digital social networks. However there are some key differences between the Facebook Pages engagement graph and the Twitter Followers graph. As Following is the method used to subscribe to a Twitter feed, Follows tend to represent genuine interest. In contrast Facebook engagement is often used to grant approval or because a user desires an association. In addition, the Twitter graph corresponds to actions occurring as far back as 2006 (relatively few edges are ever deleted), while the Facebook graph corresponds only to events since 2014, when we began collecting data. As a result the Twitter data set contains significantly more data, but with less relevance to current events.
Our work uses the vast scale and richness of social media data to provide insights into a broad range of questions. Here are some illustrative examples:

How would you describe the factions and relationships within the US Republican party? This is a question with a major temporal component, and so we use the Facebook Pages graph. We feed “Donald Trump”, “Marco Rubio”, “Ted Cruz”, “Ben Carson” and “Jeb Bush” as seeds into the system and wait for for Figure 11(a), which shows a densely connected core group of active politicians with Donald Trump at the periphery surrounded by a largely disconnected set of rightwing interest bodies.

Which factions exist in global pop music? We feed the seeds “Justin Bieber”, “Lady Gaga” and “Katy Perry” into the system loaded with the Facebook Pages engagement graph and wait for for Figure 11(b), which shows that the industry forms communities that group genders and races.

How are the major social networks used? We feed the seeds “Twitter”, “Facebook”, “YouTube” and “Instagram” into the system loaded with the Twitter Followers graph and wait for for Figure 12(a), which shows that Google is highly associated with other technology brands while Instagram is closely related to celebrity and YouTube and Facebook are linked to sports and politics.

How is the brand RedBull perceived by Twitter users? We feed the single seed “RedBull” into the Twitter Followers graph and wait for for Figure 12(b), which shows that RedBull has strong associations with motor racing, sports drinks, extreme sports, gaming and football.

How does sports brand marketing differ between the USA and Europe? We use the Twitter Followers graph. “Adidas” and “Puma” are the seeds for the European brands while “Nike”, “Reebok”, “UnderArmour” and “Dicks” are used to represent the US sports brands. Figures 13(a) and 13(b) show the enormous importance of football (soccer) to European sports brands, whereas US sports brands are associated with a broad range of sports including hunting, NFL, basketball, baseball and mixed martial arts (MMA).
In all cases, the user selects a group of seeds (or a single seed) and runs the system, which returns a Figure and a table of community memberships in . Analysts can then use the results to supplement the seed list with new entities or use the table of community members from a single WALKTRAP subcommunity to explore higher resolution.
Similar tasks are traditionally conducted with expensive and difficult to scale techniques, such as telephone polling and focus groups, which often take months to return results. In contrast, we are able to produce an automatic analysis in a fraction of a second and at minimal cost, which allows for interactive community detection in large social networks.
7 Conclusion and Future Work
We have presented a realtime system to automatically detect communities in large social networks. The system is computationally and memory efficient that it runs on a standard laptop. This work represents a technical advance leading to performance gains that are useful in practice and contains a rigorous evaluation on large social media data sets. The key contributions of this article are to demonstrate that (1) using the Jaccard similarity of neighbourhood graphs provides a robust association metric between vertices of noisy social networks; (2) Working with minhash signatures of the neighbourhood graph dramatically reduces the space and time requirements of the system with acceptable approximation error; (3) Applying Locality Sensistive Hashing allows for approximate local community detection on very large graphs in real time with acceptable approximation error. For interactive and realtime community detection, we have demonstrated that our system finds higher quality communities in less time than the stateoftheart algorithm operating under the constraints of a single machine. Our work has clear applications for knowledge discovery processes that currently rely upon slow and expensive manual procedures, such as focus groups and telephone polling. In general, our system offers the potential for organisations to rapidly acquire knowledge of new territories and supplies an alternative monetisation scheme for data owners.
In this article, we focussed on digital social networks, but our method is applicable to all large networks including bipartite networks. The useritem bipartite networks that are studied in the field of recommender systems would be particularly amenable to this treatment, where items could be compactly modelled as minhash signatures of the users who have purchased them.
We leave two extensions for future work. Firstly we treat the input social network as binary. In many settings, information is available to weight the edges. This might include message counts, the time since a connection was made or the type of connection. Efficient methods already exist for working with minhashes of weighted sets Manasse and Mcsherry (2008). Therefore, an interesting progression of this work is to incorporate data with edges that can contain counts, weights and categorisations. The second extension incorporates some of the latest developments in the theory of minhashing. bbit minhashing and Odd Sketches provide two promising approaches to extend our system to even larger graphs Li and König (2009); Mitzenmacher et al. (2014). Both offer the best cost/benefit tradeoff when sets are very similar (Jaccard similarity ) or when sets contain most of the elements in the sample space. DSN data typically contains sets that are very small relative to the sample space^{7}^{7}7Our Twitter data has a sample space containing elements with a typical set containing elements. and with Jaccard similarities . The strong theoretical bounds of these algorithms do not hold in these DSNtypical settings. Therefore, a cost/benefit analysis similar to Section 6.1 would be required before implementing either in an extension.
Acknowledgements
This work was partly funded by a Royal Commission for the Exhibition of 1851 Industrial Fellowship. The authors would like to thank Donal Simmie for his work on optimising the minhash generation procedure.
A Efficient Minhash Generation
A naive Python implementation for generating minhash signatures requires six days to run on a desktop computer with 6 physical (12 logical) Intel i7 5930k @3.5GHz cores and 64GB of RAM. This is prohibitive for nightly updates and so we highly optimised this part of the code base. The code was ported to the PythontoC bridge project, Cython, which allowed us to add type information and compiler directives to turn off array bounds checking (there is a large amount of array dereferencing). We stored the input matrices in contiguous memory, removing any superfluous code (logging, most inline error checking). The loops were then rewritten to be vectorised by the SIMD processor. The fully optimised Cython version of the minhash implementation runs (in parallel on 6 cores) in approximately one hour.
a.1 Crawling Social Networks
To optimize data throughput while remaining within the DSN rate limits we developed an asynchronous distributed network crawler using Python’s Twisted library (Wysocki and Zabierowski, 2011). The crawler consists of a server responsible for token and work management and a distributed set of clients making http requests to DSNs (see Figure 15).
The server contains a credential manager that holds access tokens in a queue and monitors the number of calls to each API. Once a token has been exhausted it is put to the back of the queue and locked until its refresh time. The server communicates over TCP with the clients responding to requests for work and access tokens with account ids and fresh access tokens/pause responses respectively:
The clients make asynchronous requests to the DSNs, handling response codes, parsing and storing data. A conventional program will block while waiting for an http response. When the principal function of a program is to download data, blocking time amounts to the vast majority of the run time. One solution is to run the program using multiple threads. However, for this application threads carry an unnecessary overhead and induce inefficiencies as data is naively moved between caches by the operating system. The asynchronous programming paradigm offers a superior alternative to explicit multithreading. Asynchronous programming makes use of an event loop that constantly listens for new jobs and does not block while waiting for http responses.
We originally implemented the system using an 80 MB shared fibre optic connection, but our downloads caused network blackouts. Therefore, we designed a distributed system that could be partially deployed in the cloud. The final system is depicted in Figure 15. The access tokens and account IDs to query (work) live on a server on our local network. Clients are deployed to Amazon’s elastic cloud from where all interactions with DSN servers occurs. We configured the clients to establish persistent connections to the API endpoints. Every time a connection is opened, a handshake must occur. For secure systems (communicating over https), the handshake is particularly onerous, requiring the exchange of security certificates.
B Community Axioms
Homophily only applies to attributes that ease information flow between individuals. Some attributes have no effect or are divisive (for instance righthanded people feel no sense of kinship) and so should not be associated with communities. Additionally attributes may be at the wrong scale to describe structural subunits (sports person rather than footballer). A community evaluation based on groundtruth that were not communities would have no value We apply community goodness functions to each prospective ‘tag community’ to identify to what extent these functional traits generate structurally observable communities.
For each functional group we generate the fully connected weighted graph by calculating all pairwise Jaccard similarities and evaluate the six metrics in Table 3. They are adapted from Yang and Leskovec (2013) to apply to weighted graphs. As we work with a derived graph where each edge weight is the Jaccard similarity of neighbourhoods, the metrics have slightly different interpretations. Two entities in the derived graph are strongly connected if they have very similar neighbourhoods. Since for the large entities the neighbourhood normally has at least an order of magnitude more incoming than outgoing edges, entities are closely related if they have a similar fan/follower base. We define to be the set of vertices comprising a community and a weighted graph where is a weight matrix. The internal edge weight of is
(18) 
and the weight of edges that cross the boundary of is
(19) 
The community goodness metrics are then given by:

Clustering exploits the idea that people in communities are likely to introduce their friends to each other. It measures how cliquey a community is. In our paradigm clustering is high if followers of a community recommend things for other followers of the community to like or follow. If a vertex has neighbours then possible connections can exist between the neighbours. The clustering of a node gives the fraction of its neighbours’ possible connections that exist. The clustering of a community is the average clustering of each vertex. Clustering is sometimes referred to as the proportion of triadic closures in the network. The weighted clustering of the vertex is given by
(20) where is a matrix where each entry is the maximum weight found within (Holme et al., 2007).

Conductance is an electrical analogy for how easily information entering the community can leave it. In our context, it is defined as
(21) i.e., it is the ratio of the community’s external to total edge weight. A low value means that the the community is well separated from the rest of the network. In our paradigm, conductance is low if the followers of the community are not interested in other communities.

Cohesiveness measures how easily the community can be split into disconnected components. A good community is not easily broken up. The cohesiveness is given by the minimum conductance of any subcommunity. A low value indicates a bad community as there is at least one wellseparated subcommunity. In our paradigm, low cohesiveness corresponds to members of the community having distinct, nonoverlapping follower groups.
(22) Iterating through all subsets of is impractical. Thus, we sample by randomly selecting 10 subsets of starting vertices, running PPR community detection for each and taking a sweep through the PageRank vector to find the minimum conductance cut.

Conductance Ratio (CR) is the ratio of conductance to cohesiveness and defined as
(23) A large number indicates that the community could be broken up into structural subunits.

Density is given by the ratio of the community’s total internal edge weight to the maximum possible if every edge was present with weight one:
(24) A high number indicates a highly interconnected community. In our paradigm, this corresponds to a community with a welldefined follower base that is interested in most community members.

Separability measures how well the community is separated from the rest of the network. It is the ratio of internal to external edges and so is closely related to conductance:
(25) In our paradigm, a high value indicates that followers of the community are not interested in much else.
References
 Adler and Mitzenmacher (2001) M Adler and M Mitzenmacher. Towards compressing web graphs. Proceedings of the Data Compression Conference, pages 203–212, 2001.
 Andoni et al. (2014) A Andoni, P Indyk, HL. Nguyen and I Razenshteyn. Beyond localitysensitive hashing. Proceedings of the 25th Annual ACMSIAM Symposium on Discrete Algorithms, pages 1018–1028, 2014. ISSN 9781611973389. doi: 10.1137/1.9781611973402.76. URL http://arxiv.org/abs/1306.1547.
 Bahmani et al. (2011) B Bahmani, K Chakrabarti, and D Xin. Fast personalized pagerank on mapreduce. Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 973984, 2011. URL http://dl.acm.org/citation.cfm?id=1989425.
 Bastian et al. (2009) M Bastian, S Heymann, and M Jacomy. Gephi: an open source software for exploring and manipulating networks. Proceedings of the 3rd International AAAI Conference on Weblogs and Social Media, pages 361–362, 2009. ISSN 14753898. doi: 10.1136/qshc.2004.010033.
 Blondel et al. (2008) VD Blondel, JL Guillaume, R Lambiotte, and E Lefebvre. Fast unfolding of community hierarchies in large networks. Journal of Statistical Mechanics: Theory and Experiment, page 10008, 2008. doi: 10.1088/17425468/2008/10/P10008.
 Boldi and Vigna (2004) P Boldi and S Vigna. The webgraph framework I: compression techniques. Proceedings of the 13th International Conference on the World Wide Web, ACM, 2004. URL http://dl.acm.org/citation.cfm?id=988752.
 Broder et al. (2000) AZ Broder, M Charikar, AM Frieze, and M Mitzenmacher. Minwise independent permutations. Journal of Computer and System Sciences, 60(3):630–659, 2000.
 Broder (1997) AZ Broder. On the resemblance and containment of documents. IEEE Proceedings of Compression and Complexity of Sequences, pages 21–29, 1997. ISSN 0818681322. doi: 10.1109/SEQUEN.1997.666900.
 Bullmore and Sporns (2009) E Bullmore and O Sporns. Complex brain networks: graph theoretical analysis of structural and functional systems. Nature Reviews Neuroscience, 10.3, pages 186–198, 2009. URL http://www.nature.com/nrn/journal/v10/n3/abs/nrn2575.html.
 Charikar (2002) MoS Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th Annual ACM Symposium on the Theory of Computing (STOC), pages 380–388, 2002. ISBN 1581134959. doi: 10.1145/509907.509965. URL http://portal.acm.org/citation.cfm?doid=509907.509965.
 Chen et al. (2010) R Chen, X Weng, B He, and M Yang. Large graph processing in the cloud. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010. URL http://dl.acm.org/citation.cfm?id=1807297.
 Chierichetti et al. (2009) F Chierichetti, R Kumar, and S Lattanzi. On compressing social networks. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009. URL http://www.se.cuhk.edu.hk/~hcheng/seg5010/slides/compress_kdd2009.pdf.
 Clauset (2005) A Clauset. Finding local community structure in networks. Physical Review E, 72.2, pages 026132 2005. URL http://journals.aps.org/pre/abstract/10.1103/PhysRevE.72.026132.
 Flake et al. (2000) GW Flake, S Lawrence, and CL Giles. Efficient identification of Web communities. Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 150–160, 2000. doi: 10.1145/347090.347121. URL http://portal.acm.org/citation.cfm?doid=347090.347121.
 Fortunato and Barthelemy (2007) S Fortunato and M Barthelemy. Resolution limit in community detection. Proceedings of the National Academy of Sciences, 104.1, pages 3641 2007. URL http://www.pnas.org/content/104/1/36.short.
 Fortunato (2010) S Fortunato. Community detection in graphs. Physics Reports, 486(3):75–174, 2010.
 Gleich and Seshadhri (2012) DF Gleich and C Seshadhri. Vertex neighborhoods, low conductance cuts, and good seeds for local community methods. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 597605, 2012.
 Gupta et al. (2013) P Gupta, A Goel, J Lin, A Sharma, D Wang, and R Zadeh. WTF: The who to follow service at Twitter. Proceedings of the 22nd International Conference on the World Wide Web, pages 505–514, 2013. URL http://dl.acm.org/citation.cfm?id=2488388.2488433.
 Haveliwala et al. (2000) T Haveliwala, A Gionis, and P Indyk. Scalable techniques for clustering the Web. The 3rd International Workshop on the Web and Databases, 2000. URL http://ilpubs.stanford.edu:8090/445/.
 Haveliwala (2002) T Haveliwala. Topicsensitive pagerank. Proceedings of the 11th International Conference on the World Wide Web, pages 517–526, 2002. ISSN 08963207. doi: 10.1145/511446.511513. URL http://doi.acm.org/10.1145/511446.511513.
 Hewett et al. (1992) TT Hewett, R Baecker, S Card, and T Carey. ACM SIGCHI curricula for humancomputer interaction. ACM, 1992. URL http://dl.acm.org/citation.cfm?id=2594128.
 Holme et al. (2007) P Holme, SM Park, BJ Kim, and CR Edling. Korean university life in a network perspective: Dynamics of a large affiliation network. Physica A: Statistical Mechanics and its Applications, 373:821–830, 2007. ISSN 03784371. doi: 10.1016/j.physa.2006.04.066.
 Indyk and Motwani (1998) P Indyk and R Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. Proceedings of the 30th Annual ACM Symposium on the Theory of omputing, 126:604–613, 1998. ISSN 00123692. doi: 10.4086/toc.2012.v008a014. URL http://dl.acm.org/citation.cfm?id=276876.
 Kloumann and Kleinberg (2014) I Kloumann and J Kleinberg. Community membership identification from small seed sets. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 13661375, ACM, 2014. URL http://dl.acm.org/citation.cfm?id=2623621.
 Kyrola et al. (2012) A Kyrola, G Blelloch, and C Guestrin. Graphchi: Largescale graph computation on just a PC. Symposium on Operating Systems Design and Implementation (OSDI), 2012. URL https://www.usenix.org/conference/osdi12/technicalsessions/presentation/kyrola.
 Lancichinetti et al. (2009) A Lancichinetti, S Fortunato, and J Kertész. Detecting the overlapping and hierarchical community structure in complex networks. New Journal of Physics, 11.3(033015), 2009. URL http://iopscience.iop.org/13672630/11/3/033015.
 Leskovec et al. (2010) J Leskovec, KJ Lang, and MW Mahoney. Empirical comparison of algorithms for network community detection. Proceedings of the 19th International Conference on the World Wide Web, pages 631–640, 4 2010. URL http://arxiv.org/abs/1004.3539.
 Li et al. (2015) Y Li, K He, D Bindel and JE Hopcroft. Uncovering the small community structure in large networks: A local spectral approach. In Proceedings of the 24th International Conference on the World Wide Web, pages 658668, 2015.
 Li and König (2009) P Li and C König. bBit minwise hashing. Proceedings of the 19th International Conference on the World Wide Web, ACM, 2009. URL http://dl.acm.org/citation.cfm?id=1772759.
 Low et al. (2014) Y Low, JE Gonzalez, and A Kyrola. Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv, 2014. URL http://arxiv.org/abs/1408.2041.
 Lusseau (2003) D Lusseau. The emergent properties of a dolphin social network. Proceedings of the Royal Society of London B: Biological Sciences, 270(Suppl2), pages S186S188, 2003.
 Malewicz et al. (2010) G Malewicz, MH Austern, AJC Bik, JC Dehnert, I Horn, N Leiser, and G Czajkowski. Pregel: a system for largescale graph processing. In Proceedings of the International Conference on Management of data (SIGMOD), pages 135–146, 2010. ISBN 9781450300322. doi: 10.1145/1807167.1807184. URL http://dl.acm.org/citation.cfm?id=1807167.1807184.
 Manasse and Mcsherry (2008) M Manasse and F Mcsherry. Consistent weighted sampling. Technical report, 2010. URL http://research.microsoft.com/pubs/132309/ConsistentWeightedSampling2.pdf.
 McPherson et al. (2001) M McPherson, L SmithLovin, and JM Cook. Birds of a feather: homophily in social networks. Annual Review of Sociology, 27(1):415–444, 8 2001. ISSN 03600572. doi: 10.1146/annurev.soc.27.1.415. URL http://www.annualreviews.org/doi/abs/10.1146/annurev.soc.27.1.415?journalCode=soc.
 Mitzenmacher et al. (2014) M Mitzenmacher, R Pagh, and N Pham. Efficient estimation for high similarities using odd sketches. Proceedings of the 23rd International Conference on the World Wide Web, pages 109–118, 2014. URL http://dl.acm.org/citation.cfm?id=2568017.
 Motwani et al. (2005) R Motwani, A Naor, and R Panigrahy. Lower bounds on locality sensitive hashing. SIAM Journal on Discrete Mathematics, pages 930–935, 2005. ISSN 08954801. doi: 10.1137/050646858. URL http://arxiv.org/abs/cs/0510088.
 Newman (2006) MEJ Newman. Finding community structure in networks using the eigenvectors of matrices. Physical Review E  Statistical, Nonlinear, and Soft Matter Physics, 74(3):1–19, 2006. ISSN 15393755. doi: 10.1103/PhysRevE.74.036104.
 Newman (2003) MEJ Newman. The structure and function of complex networks. SIAM review, 45.2, pages 167256, 2003. URL http://epubs.siam.org/doi/abs/10.1137/S003614450342480.
 Newman (2004a) MEJ Newman. Fast algorithm for detecting community structure in networks. Physical review E, 69(6)(066133), 2004. URL http://journals.aps.org/pre/abstract/10.1103/PhysRevE.69.066133.
 Newman (2004b) MEJ Newman. Detecting community structure in networks. The European Physical Journal BCondensed Matter and Complex Systems, 38.2, pages 321330, 2004.
 O’Donnell et al. (2009) R O’Donnell, Y Wu, and Y Zhou. Optimal lower bounds for locality sensitive hashing (except when q is tiny). ACM Transactions on Computation Theory (TOCT), 6(1):9, 2009. ISSN 19423462. doi: 10.1145/2578221. URL http://arxiv.org/abs/0912.0250.
 Pace (2012) MF Pace. BSP Vs MapReduce. Procedia Computer Science 9, pages 246255, 2012. doi: 10.1016/j.procs.2012.04.026. URL http://arxiv.org/abs/1203.2081.
 Page et al. (1998) L Page, S Brin, R Motwani, and T Winograd. The citation ranking: bringing order to the Web, 1998. URL http://ilpubs.stanford.edu:8090/422/1/199966.pdf.
 Philbin (2008) J Philbin. Near duplicate image detection : minHash and tfidf weighting. Proceedings of the British Machine Vision Conference, 3:4, 2008. ISSN 10959203. doi: 10.5244/C.22.50.
 Pons and Latapy (2005) P Pons and M Latapy. Computing communities in large networks using random walks. Computer and Information SciencesISCIS, pages 284–293, 2005. URL http://arxiv.org/abs/physics/0512106.
 Raghavan et al. (2007) UN Raghavan, R Albert, and S Kumara. Near linear time algorithm to detect community structures in largescale networks. Physical Review E, 76(3):036106, 9 2007. ISSN 15393755. doi: 10.1103/PhysRevE.76.036106. URL http://arxiv.org/abs/0709.2938.
 Reichardt and Bornholdt (2006) J Reichardt and S Bornholdt. Statistical mechanics of community detection. Physical Review E  Statistical, Nonlinear, and Soft Matter Physics, 74, 2006. ISSN 15393755. doi: 10.1103/PhysRevE.74.016110.
 Rosvall and Bergstrom (2008) M Rosvall and CT Bergstrom. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences, 105.4:1118–1123, 2008. URL http://www.pnas.org/content/105/4/1118.short.
 Sampson (1969) SF Sampson. Crisis in a cloister. PhD thesis, Cornell University, Ithaca, 1969.
 Schaeffer (2007) SE Schaeffer. Graph clustering. In Computer Science Review, 1(1):27â64, 2007.
 Wysocki and Zabierowski (2011) R Wysocki and W Zabierowski. Twisted framework on game server example. Proceedings of the 11th International Conference on the Experience of Designing and Application of CAD Systems in Microelectronics (CADSM), pages 361–363, 2011.
 Yang and Leskovec (2012) J Yang and J Leskovec. Defining and evaluating network communities based on groundtruth. Knowledge and Information Systems, pages 181–213, 2015. URL http://dl.acm.org/citation.cfm?id=2350193.
 Yang and Leskovec (2013) J Yang and J Leskovec. Overlapping community detection at scale: a nonnegative matrix factorization approach. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining, pages 587–596. ACM, 2013. ISBN 9781450318693. doi: 10.1145/2433396.2433471. URL http://dl.acm.org/citation.cfm?id=2433396.2433471.
 Zachary (1977) W Zachary. An information flow model for conflict and fission in small groups. Journal of Anthropological Research, pages 452–473, 1977. URL http://www.jstor.org/stable/3629752.