# Comparing Community Structure to Characteristics

in Online Collegiate Social Networks

## Abstract

We study the structure of social networks of students by examining the graphs of Facebook “friendships” at five American universities at a single point in time. We investigate each single-institution network’s community structure and employ graphical and quantitative tools, including standardized pair-counting methods, to measure the correlations between the network communities and a set of self-identified user characteristics (residence, class year, major, and high school). We review the basic properties and statistics of the pair-counting indices employed and recall, in simplified notation, a useful analytical formula for the -score of the Rand coefficient. Our study illustrates how to examine different instances of social networks constructed in similar environments, emphasizes the array of social forces that combine to form “communities,” and leads to comparative observations about online social lives that can be used to infer comparisons about offline social structures. In our illustration of this methodology, we calculate the relative contributions of different characteristics to the community structure of individual universities and subsequently compare these relative contributions at different universities, measuring for example the importance of common high school affiliation to large state universities and the varying degrees of influence common major can have on the social structure at different universities. The heterogeneity of communities that we observe indicates that these networks typically have multiple organizing factors rather than a single dominant one.

## 1 Introduction

Social networks are a ubiquitous part of everyday life. Although they have long been studied by social scientists [36], the mainstream awareness of their ubiquity has arisen only recently, in part because of the rise of social networking sites (SNSs) on the World Wide Web. Since their introduction, SNSs such as Friendster, MySpace, Facebook, Orkut, LinkedIn, and hundreds of others have attracted hundreds of millions of users, many of whom have integrated SNSs into their daily lives to communicate with friends, send e-mails, solicit opinions or votes, organize events, spread ideas, find jobs, and more [2]. Facebook, an SNS launched in February 2004, now overwhelms numerous aspects of everyday life, having become an especially popular obsession among college and high school students (and, increasingly, among other members of society) [1, 2, 23, 25]. Facebook members can create self-descriptive profiles that include links to the profiles of their “friends,” who may or may not be offline friends. Facebook requires that anybody who one wants to add as a friend confirm the relationship, so Facebook friendships define a network (graph) of reciprocated ties (undirected edges) that connect individual users.

The global organization of real-world networks typically includes coexisting modular (horizontal) and hierarchical (vertical) organizational structures [33, 8, 28, 30, 5]. Myriad papers have attempted to interpret such organization through the computation of structural modules or communities [33, 8], which are defined in terms of mesoscopic groups of nodes with more internal connections (between nodes in the group) than external connections (between nodes in the group and nodes in other groups). Such communities, which are not typically identified in advance, are often considered to not be merely structural modules but are also expected to have functional importance because of the large number of common ties among nodes in a community. Additionally, prior empirical studies have observed some correspondence between communities and “ground truth” groups in social and biological networks [33]. For example, communities in social networks might correspond to circles of friends or business associates, communities in the World Wide Web might encompass pages on closely-related topics, communities in metabolic networks have been used to find functional modules [15], and communities have been used to identify and measure political polarization in legislative processes in the U.S. Congress [37, 38].

As discussed at length in two recent review articles [33, 8] and references therein, the classes of techniques available to detect communities are both numerous and diverse; they include hierarchical clustering methods such as single linkage clustering, centrality-based methods, local methods, optimization of quality functions such as modularity and similar quantities, spectral partitioning, likelihood-based methods, and more. In addition to remarkable successes on benchmark examples, investigations of community structure have led to success stories in diverse application areas—including the reconstruction of college football conferences [11] and the investigation of such structures in algorithmic rankings [6]; the analysis of committee assignments [32], legislation cosponsorship [38], and voting blocs [37] in the U.S. Congress; the examination of functional groups in metabolic networks [15]; the study of ethnic preferences in school friendship networks [13]; and the study of social structures in mobile-phone conversation networks [31].

In this paper, we investigate the community structures of complete Facebook networks whose links represent reciprocated “friendships” between user pages (nodes) within each of five American universities during a single-time snapshot in September 2005. Our primary aim in this paper is to use an unsupervised algorithm to compute the community structure—consisting of clusters of nodes—of these universities and to determine how well the demographic labels included in the data correspond to algorithmically computed clusters. We consider only ties between students at the same institution, yielding five separate realizations of university social networks and allowing us to compare the structures at different institutions.

The rest of this paper is organized as follows. In Section LABEL:sec:methods, we describe our principal methods: the employed community-detection method, visual exploration of identified communities, and standardized pair-counting methods for quantitative comparison of communities with demographic data. We present more details about the data in Section LABEL:sec:data. We then describe and discuss the results that we obtained for the five institutions in Section LABEL:sec:examples before concluding in Section LABEL:sec:discussion.

## 2 Comparing Communities

A social network with a single type of connection between nodes can be represented as an adjacency matrix whose elements give the weight of the tie between nodes and . The Facebook networks we study are unweighted, so , where the value is if a tie exists and if it does not. The resulting tangle of nodes and links, which we show for the California Institute of Technology (Caltech) Facebook network in Fig. LABEL:kkexample, can obfuscate any organizational structure that might be present.

One approach to analyzing such data is to employ exponential random graph models (see, e.g., [35]), statistically fitting an underlying model for the presence of links. While such models (which can incorporate local network features) are potentially valuable for understanding the microscopic processes that underly the links between individual nodes, we take a different approach, focusing on groups of friends that form structural “communities”—groups of nodes that contain more internal connections (links between nodes in the group) than external connections (between nodes of the group and nodes in other groups) [33, 8]. Our approach is motivated in part by the features of the Caltech data (discussed in detail in Sections LABEL:sec:data and LABEL:sec:examples). Although precise results obviously vary from one model specification to another, performing a logistic regression on the dyads (pairs of nodes) yields comparable coefficients for link presence between users from the same House as from the same high school. However, there are significantly more users sharing the former than the latter at Caltech. While common high school is unsurprisingly important at the dyadic level (in the rare cases that it happens), common House affiliation is apparently much more important for understanding structures that consist of larger groups of individuals. Accordingly, our goal in this section is to discuss how to compare the composition of algorithmically-determined communities to groups defined based on common user characteristics.

We identify communities using spectral optimization [29] (followed by supplementary Kernighan-Lin node-swapping steps [21]) of the “modularity” quality function , where denotes the fraction of ends of edges in group for which the other end of the edge lies in group and is the fraction of all ends of edges that lie in group . High values of modularity correspond to community assignments with greater numbers of intra-community links than expected at random (with respect to a particular null model [29, 33, 8]). Numerous other community detection methods are also available. However, our focus in the present paper is on studying communities after they are obtained, and our methods can be applied to the output of any community-detection algorithm in which each node is assigned to precisely one community. Such an assignment of nodes to communities constitutes a partition of the original graph. We seek a means to compare an algorithmically-obtained partition to partitions based on information that we have about Facebook user characteristics—class year, dormitory (House), high school, and major—as a means of exploring the roles of such characteristics in the social structures of each institution. An online social network is an imperfect proxy for an offline network, but our comparisons are nevertheless expected to yield interesting insights about the social life at the universities we study.

### 2.1 Visual Comparisons

The demographic composition of communities is sometimes clear from visual inspection. This is the case with the community structure of the Caltech network, which agrees closely with its undergraduate “House” system. In Fig. LABEL:fig:Caltech, we show a force-directed layout of Caltech’s 12 communities (yielding a modularity of ), which we show as pies with area proportional to the number of constituent nodes. Purple slices signify individuals who did not identify a House affiliation.

Unlike other universities (see Section LABEL:sec:examples), we find that House affiliation is the primary organizing principle of the communities in the Caltech network, which is what we expected because Caltech’s House structure is so dominant socially. Indeed, each pie in Fig. LABEL:fig:Caltech is dominated by members of one House. Moreover, many pies include a significant number of people who identify “Avery House” as their affiliation (dark blue), which is expected because of its different residency rules (members of all Houses could live in Avery at the time of this data). Given the promotion of Avery House to official House status after our data snapshot, it is natural to wonder if community detection on current data would find a community dominated by Avery. Investigating the formation of such a community using longitudinal data would be even more interesting, but is beyond the scope of our data. In principle, one can also make limited predictions based on the compositions of the communities about users who did not volunteer their House affiliation.

Despite this demonstration of the utility of visualizing communities, it is typically necessary to perform quantitative analyses after detecting communities, as Caltech is unusual among universities in having a single characteristic that aligns so closely with its communities. For other institutions, we observe more heterogeneous communities, and it is typically difficult to visually assess which characteristics best correlate with the communities or even whether there is any strong correlation at all. To investigate the social organization of communities at such universities, it is thus essential to quantitatively compare the detected communities with the available demographic groups. Such considerations apply broadly to community detection in most networks [33].

### 2.2 Pair Counting

As discussed in Refs. [26, 20], methods to compare graph partitions can be classified roughly into three groups: (1) pair counting, (2) cluster matching, and (3) information-theoretic techniques. Cluster matching might be particularly problematic in the present context, as the numbers and sizes of groups vary significantly across the comparisons, which makes the essential identifications across partitions rather difficult. We focus on a collection of pair-counting methods, in part because of their convenient algebraic description, as one just needs to count the ways that pairs of nodes are grouped across two partitions. That same simplicity can also be a weakness, as it can present a serious interpretation difficulty because of the unclear range of “good” scores. However, as we will show in Section LABEL:sec:zscores, standardization of pair-counting scores provides a unified interpretation of a number of seemingly disparate pair-counting measures and is particularly useful for the present setting. We also compare these results with those obtained using variation of information (VI) [26].

A pair-counting method defines a similarity score by counting each pair of nodes drawn from the nodes of a network according to whether the pair falls in the same or in different groups in each partition. Pair-counting methods comprise a subset of a more general class of association measures that can be used for studying unordered (i.e., categorical) contingency tables [18, 22, 26]. We denote the counts of node pairs in each classification as (pairs classified together in both partitions), (same in the first but different in the second), (different in the first but same in the second), and (different in both). The sum of these quantities is, by definition, equal to the total number of node pairs: . Given two partitions of a network, one can obtain many different pair-counting similarity coefficients using different algebraic combinations of the counts.

We first consider the Rand similarity coefficient [34], which counts the fraction of node pairs identified the same way by both partitions (either together in both or separate in both). Bounded between (no similar pair placements) and (identical partitions), the Rand coefficient is extremely intuitive and can be used fruitfully in many settings. However, it has an important deficiency: The Rand coefficient for two network partitions that each contain large numbers of categories is skewed towards the value because of the large fraction of node pairs that are placed in different groups even when comparing two partitions with little in common.

If one wishes to exclude from having an explicit role, one can use the Jaccard index or the Fowlkes-Mallows similarity coefficient . Both and clearly avoid the problematic effects of large , but their ignorance of node pairs classified similarly into different communities yields overly high values when comparing network partitions with very few categories (or when one partition consists of a single group). Another index is the Minkowski coefficient , which is asymmetric in its consideration of the two partitions. The first serves as a distinguished reference, measuring the number of mismatches relative to the number of similarly-grouped pairs in that reference. Hence, values closer to are considered better. The similarity coefficient, defined as

has the most complicated algebraic form of the similarity coefficients that we employ. Additional measures and discussions are available in Refs. [7, 19, 26]. Notably, each measure suffers from the difficulty of it being unclear what constitute “good” values, as they all depend intimately on the numbers and sizes of the groups in the partition. (We illustrate this in Section LABEL:sec:examples with computations for the Caltech network and discuss further properties of the similarity indices in Subsection LABEL:sec:zscores.)

One can also try to alleviate the problem of identifying good similarity values by introducing various “adjusted” indices that report comparisons as a similarity relative to that which might be obtained at random. For instance, one can construct adjusted indices by subtracting the expected value (under some null model, typically conditional on maintaining the numbers and sizes of groups in the two partitions) and then rescaling the result by the difference between the maximum allowed value and the mean value [18]. One such index, using a bound on the maximum allowed value, is the Adjusted Rand coefficient [18]

As described in Ref. [26], adjusted indices can be problematic because the focus on the maximum possible values does not guarantee accurate comparisons between similarity coefficients across different settings. In particular, this implies that one cannot necessarily use similarity scores to make direct comparisons between communities and House with those between communities and high school (which is something that we specifically aim to do). That is, even if such comparisons yield Adjusted Rand values of and , it is not at all clear that the second situation should be construed to yield a closer pair of partitions than the first. Consequently, the general problem of knowing what similarity-score values indicate a good correlation remains.

### 2.3 Standardized Pair Counting

Numerous studies have attempted to assess the utility of similarity measures. However, because partitioning according to demographic traits yields a graph partitioning that typically differs significantly from that obtained using algorithmic community detection, we use a classical statistical approach, advocated in [3, 9], wherein similarity measures are used in the context of testing significance levels of the obtained values versus those expected at random. We recommend using a proper metric (i.e., a quantity that is a metric in the mathematical sense rather than only in an informal sense) such as variation of information [26] for comparing partitions that are close to one another. However, in the Facebook networks, the mutual information of a pair of partitions is small compared to the total information in each. In such cases, two partitions can be relatively far from each other according to a distance measure but might nevertheless be very far in the tail of the distribution of what can be expected at random. It is consequently more appropriate to identify the pair-counting strength relative to that obtained at random, standardized by the width of the distribution via -scores , which indicate the number of standard deviations that the -value is more correlated than the mean (, noting the need to multiply by for ).

One can obtain -scores non-parametrically using permutation tests [14], though we will identify analytical formulas for and show that the Fowlkes-Mallows, , Rand, and Adjusted Rand -scores are identical. The elements of the contingency table indicate the number of nodes that are classified into the th group of the first partition and th group of the second partition. As long as partitions are constrained to have the same numbers and sizes of groups as the original partitions—i.e., as long as the row and column sums, and , remain constant—then the total number of pairs , the number of pairs classified the same way in the first partition, and the analogous quantity for the second partition likewise remain constant. This implies that any pair-counting index specified by counts can be equivalently specified in terms of only because , , and . It follows immediately that , , , are each linear functions of and hence linear functions of each other [19]. Any similarity index that is a linear function of must be statistically equivalent to in any null model (given constant , , and ), with the -score and -value equal to that associated with the specified . Meanwhile, as we demonstrate in Section LABEL:sec:examples, the values can have different orderings in different comparisons because of their dependence on , , and .

It is also instructive to note the relationships between the linear-in- similarity coefficients and the Jaccard and Minkowski indices: and . The asymmetry in the Minkowski index is clearly limited, as switching which partition is the reference changes the coefficient by a multiplicative factor. Because the square root and multiplicative inverse are both monotonic operations in the domains of these indices ( and ), it follows that the -values of the cumulative distributions of each are identical to the -value of itself even though the corresponding -scores can be different.

In deference to the seminal presentation of the Rand index [34], we refer to the -score of the linear-in- scores as -Rand: , where and are, respectively, the mean and standard deviation of (noting its equivalence by linearity to the -score advocated explicitly by Brennan and Light [3]). In the absence of external information that indicates a need to impose specific correlations, we adopt the standard and analytically tractable assumption of a random hypergeometric distribution of equally likely assignments subject to fixed row and column sums. The expected value then becomes , as for the adjusted Rand index [18]. The calculation of higher-order moments is more involved [3, 4, 17, 24]. In order to make as simple as possible to calculate, we rewrite the formulas of [17] as follows:

\hb@xt@.01(2.1) |

\hb@xt@.01(2.2) | ||||

\hb@xt@.01(2.3) |

\hb@xt@.01(2.4) | ||||

\hb@xt@.01(2.5) |

While we advocate the use of , their associated significance levels (equivalently, the -values of the cumulative distribution) are not equal to those for a Gaussian distribution. The distribution for large samples is asymptotically Gaussian [22], but the distribution associated with comparing a particular pair of partitions need not be. Indeed, the tails of the distribution can be quite heavy [4], so the probability of obtaining extreme -scores can be orders-of-magnitude higher than in the normal distribution. Nevertheless, the Gaussian approximation is frequently sufficient to gauge statistical significance (past the 95% confidence interval). Given the straightforward calculation of (LABEL:eq:zrand)–(LABEL:eq:C12), we prefer to use directly, with the caveat that the Rand indices do not translate directly to -values.

Where simple formulas for the necessary moments do not appear to be available (i.e., for the Jaccard and Minkowski indices), we resort to the computationally straightforward (albeit intensive if one desires high accuracy) method of examining distributions obtained using permutation tests [14], again under the null model of equally-likely node assignments conditional on the constancy of the numbers and sizes of groups. Specifically, starting from two network partitions whose correlation we want to measure, we calculate the similarity values and obtain a context for these values by repeatedly computing under random permutation of the node assignments in one of the partitions. (Subsequent permutation of assignments in the second partition is redundant.) We thereby aim to compare the similarity coefficients between the two partitions to the distributions of such coefficients from the appropriate ensemble of partition pairs. Numerical estimation of -values far in the tail of the distribution (where many of our points of interest lie) necessarily requires sampling a correspondingly large number of elements. In contrast, calculating -scores only requires sampling the first two moments of the distribution. We typically use permutations (even for the larger networks, where the number of nodes is actually larger than the number of permutations considered), confirming that the obtained -scores have converged to roughly two significant figures by comparing them with those obtained using half of the permutations and also comparing estimates with the analytical values obtained from (LABEL:eq:zrand)–(LABEL:eq:C12).

Of course, calculating -scores of the pair-counting indices is not a panacea, particularly when comparing networks of different sizes. Nevertheless, we find them to be exceptionally useful for examining the correlations between communities and partitions by the available demographics in our Facebook data. Before we concentrate on using these -scores to measure correlations, we compare test results (similar to those discussed in Section LABEL:sec:examples) against other methods, including variation of information [26] and the (non-standardized) Adjusted Rand index [18] using a scatter plot versus in Fig. LABEL:fig:ZvZR. While trends positively with (recall that ), there are clearly situations with very small that have much larger values than should be expected at random. We additionally observe that and each appear to be closely approximated by at the scale of Fig. LABEL:fig:ZvZR, though closer inspection reveals relative differences occasionally as large as 10%.

We admit that we are questionably guilty of one of the major sins of statistical analysis, in that -scores are typically a proxy for the likelihood with which one can reject an independent null hypothesis. It is thus reasonable to question their effectiveness for the quite different task of measuring a correlation. We stress, however, that the underlying statistic that we have standardized is a pair counting of the similarities between partitions rather than a deviation from independence. (We note that reduces to a linear function of in the special case of uniform constant marginals [4].) Therefore, in the absence of enforcing a particular model for the form of the correlation between partitions, we believe this standardization of similarity scores is a reasonable way to proceed (if done so with caution).

## 3 Data

Our data, which was sent directly to us in anonymized form by Adam D’Angelo of Facebook, consists of the complete set of users (nodes) from the Facebook networks for each of five American universities and all of the links between those users’ pages for a single-time snapshot from September 2005.^{1}^{1}1We have posted the data at http://people.maths.ox.ac.uk/~porterm/data/facebook5.zip. Similar snapshots of Facebook data from 10 Texas universities were analyzed recently in Ref. [25], and a snapshot from “a diverse private college in the Northeast U.S.” was studied in Ref. [23]. Other studies of Facebook have typically obtained data either through surveys [2] or through various forms of automated sampling [12], thereby containing missing nodes and links that can strongly impact the resulting graph structures and analyses.

We consider only ties between people at the same institution, which yields five separate realizations of university social networks and allows us to compare the structures at different institutions. Our study includes a small technical institute (California Institute of Technology [Caltech]), a pair of private universities (Georgetown University and Princeton University), and a pair of large state universities (University of Oklahoma and University of North Carolina at Chapel Hill [UNC]).

We summarize basic properties of the university networks in Fig. LABEL:sumsum and Table LABEL:table:size. See [28, 30] and references therein for discussions of the measures that we use in this section. Although our focus in this paper is community structure, we remark that even these simple network characteristics can yield insights about Facebook networks. The mean degrees tend to increase with network size, potentially indicating that broader institutional use begets greater personal use (though this trend is clearly strongly influenced by the Caltech data). The degree distributions of these institutions (plotted in the top panels of Fig. LABEL:sumsum) have heavy tails compared to random graphs. In particular, the degree distributions appear to be approximately exponential. Although the mechanisms driving such distributions are impossible to ascertain without longitudinal data, the roughly exponential form of the degree distribution both above and below the mean degree potentially indicates a wide range in the willingness to participate (i.e., to add online friends) among Facebook users.

The bottom panels of Fig. LABEL:sumsum compare node degree versus clustering coefficient,

We note that even heavy users have much larger local clustering than that expected at random (e.g., when compared with the total graph densities). In Table LABEL:table:size, we provide the mean clustering coefficient and the transitivity for each network, given by the fraction of connected triples in the network that are fully connected triangles. Both measures of local clustering are much larger at Caltech than they are at the other institutions. It is of course not surprising that we observe large transitivities in social networks such as the Facebook networks. Nevertheless, as we have shown recently in Ref. [27], tree-based theories of various dynamical processes appear to be valid for Facebook networks (despite their high clustering, implying that they are most definitely not locally tree-like) because they are “sufficiently small” worlds, in that the mean distance between nodes is close to the expected value obtained in random networks with the same joint degree-degree distributions.

The data also includes limited demographic information provided by users on their individual pages: gender, class year, and data fields that represent (using anonymous numerical identifiers) high school, major, and dormitory residence (or “House” at Caltech). In situations in which individuals elected not to volunteer a demographic characteristic, we use an additional “Missing” label. These characteristics allow us to make comparisons between different universities, under the assumption (per the discussion in Ref. [2]) that the communities and other elements of structural organization in Facebook networks reflect (even if imperfectly) the social communities and organization of the offline networks on which they’re based.

For instance, at the level of individual ties, the tendency for users to be friends with other users who have similar characteristics can be quantified by the assortativity of the links relative to that characteristic. Degree assortativity (or degree correlation) can be calculated as the Pearson correlation coefficient of the degrees at either ends of the edges. Although many social networks tend to be positively assortative with respect to degree, we find that the degree assortativity is negative for Caltech and is very small for UNC. A general measure of scalar assortativity relative to a categorical variable is given by

\hb@xt@.01(3.1) |

where is the normalized mixing matrix, the elements give the number of edges in the network that connect a node of type (e.g., a person with a given major) to a node of type , and the entry-wise matrix -norm is equal to the sum of all entries of . Comparing assortativities for various categories shows, for example, that assortativity by dormitory and class year (treated as a categorical variable) are high for all five institutions; assortativities by major are low for all five institutions; and assortativities by high school and gender are less consistent across institutions. The relative sizes of the different assortativities also vary across institutions, which is similar to what we will see below with communities. Going beyond this measure of local assortativity by characteristics, our major focus for this article is on the organization of the communities of these five Facebook networks based on these various categories. We discuss this in detail in Section LABEL:sec:examples.

## 4 Facebook Communities

We algorithmically identify a set of communities in the largest connected component of each institution’s network using a modified version of Newman’s leading-eigenvector method [29] in conjunction with subsequent Kernighan-Lin node-swapping steps [21]. We compare the communities to partitions obtained by grouping users according to each of the self-identified characteristics: major, class year, high school, and dormitory/House.

We first revisit Caltech’s community structure, which we previously examined visually in Fig. LABEL:fig:Caltech. The partition of the largest connected component into 12 communities (which has modularity ) exhibits a strong correlation with House affiliation. To investigate this quantitatively, we calculate the similarity coefficients of this partition versus each partition constructed using one of the four available user characteristics (see Table LABEL:table:caltech_S). The raw values appear to be insufficient to the task of comparing these communities. Specifically, the ordering of the correlation strengths with the different demographics is not consistent across pair-counting indices, even among those we know are linear transformations of one another. Additionally, although there is agreement that the correlation with House is strongest, the values differ wildly in how much they set apart the House correlation, with and seemingly indicating that the correlation with House is only marginally stronger than that with high school even though Caltech contains very few students at one time that come from the same high school.

These apparent disagreements in interpretation across values occur even though we know that their corresponding -values in the (unobtained) random distributions are identical. While we cannot directly calculate those -values, the -scores for each (see Section LABEL:subsec:zscores) in Table LABEL:table:caltech_S indicate that the correlation with high school is the only one of the four demographic characteristics that is not statistically significant. We note that the ordering of the VI scores in Table LABEL:table:caltech_S is consistent with that of the -scores but recall that such agreement of ordering is not consistently observed in Fig. LABEL:fig:ZvZR. The -scores provide a consistent interpretation of the roles of the four characteristics in this Caltech data: House is most important, followed distantly by year and major (in descending order), with no significant correlation with high school. Because of the close agreement between the , , and scores in Fig. LABEL:fig:ZvZR and Table LABEL:table:caltech_S, we henceforth restrict attention to the analytically-obtained values.

Before concluding our discussion of Caltech, we acknowledge the potentially important effects of missing demographic data, as a significant number of users did not volunteer an affiliation (as indicated in Table LABEL:table:sizemissing and by the purple wedges of Fig. LABEL:fig:Caltech). One can approach the issue of missing data using sophisticated tools such as multiple imputation, likelihood, or weighting methods [16]. A simpler approach is to investigate the effects on the measured correlations by various restrictions of the data. We consider three such protocols: inclusion, pairwise removal, and listwise removal. Inclusion, which we use in Table LABEL:table:caltech_S, treats the missing labels like any other category, erroneously grouping all such users together in the demographic partition. We apply pairwise removal separately for each demographic comparison with the community structure. In terms of a contingency table of demographic rows and community columns, this amounts to a deletion of the row corresponding to “Missing.” Listwise removal restricts the comparisons to the subset of users who volunteered all four of the studied demographic characteristics. We stress that these protocols do not affect the community assignments, which we obtained using the complete network data. Other restrictions or combinations of this data (such as single-gender restrictions) can also be fruitfully explored, but such investigations are beyond the scope of the present article.

In Table LABEL:table:zscores, we present the -scores for all four community-demographic comparisons using each of the three missing data protocols at the five universities we study. We caution that because of network-size effects (reflecting the different numbers of nodes in different examples), -score values cannot typically be directly compared across institutions. Accordingly, our primary conclusions are about the statistical significances and rank orderings of the demographic correlations separately in each university. Our previous conclusions about the Caltech community structure remain largely consistent across all three missing data protocols: House is most strongly correlated with the communities, followed distantly by year and major (in descending order), with no statistically significant correlation with high school. While House remains strongly correlated with communities in all three protocols, the correlation with year and major appears to be only marginally statistically significant in the analysis with listwise removal.

In contrast with Caltech, the communities at each of the other four institutions that we study correlate primarily with class year (see Table LABEL:table:zscores). Moreover, these correlations are not as dominant as House is at Caltech, as each of the four characteristics possess statistically significant correlations with the community structures at the other four institutions (except high school in listwise removal at Georgetown). We show the 12 communities identified at Princeton colored both by class year and by major in Fig. LABEL:fig:PrincetonbyYearMajor. Compared with the strong correlation between communities and House affiliation at Caltech, these visual depictions of the Princeton communities do not seem to indicate as strong a correlation with year despite the very large corresponding (which again cautions against direct comparison of values in networks of different sizes). We remark that the size of the Princeton data set, with over 8500 nodes (6575 in the largest connected component) is disproportionately large relative to the institution’s size; this is presumably a result of the relatively early Facebook adoption there.

The -scores in Table LABEL:table:zscores reveal that Princeton students break up into communities primarily according to class year (among the four demographic categories available to us), and dormitory gives the second highest correlation. While major is also significant, the correlation with high school appears to be only marginally significant in protocols that remove missing data. One can draw similar conclusions about Georgetown from Table LABEL:table:zscores; the only qualitative difference is the possible lack of significance of high school at Georgetown (as compared to the marginal significance at Princeton) that is suggested by the more stringent missing-data protocols.

Similarly, the -scores calculated for the UNC network partitioned into 5 communities suggest that class year is the primary organizing characteristic and that dormitory residence is also prominent. High school and major have smaller but significant positive correlations with the community structure. The other large state university that we consider is the University of Oklahoma, which is also partitioned into 5 communities. Like UNC, the dominant correlation of the Oklahoma communities is with year, the secondary correlation is with dormitory, and both high school and major have statistically significant correlations. Unlike UNC, however, the disparity between the correlations with year and with dormitory do not appear to be as wide at Oklahoma. In contrast to Princeton and Georgetown, communities at both UNC and Oklahoma maintain unquestionably significant correlations with high school in both missing-data protocols.

We close this section by cautioning about interpretations of conclusions drawn from the numbers in Table LABEL:table:zscores, even though they indicate some interesting differences among the institutions that we studied. In particular, one should of course be careful about how such numbers might be influenced by our methodologies. Although we have provided three different protocols for handling missing data, other effects might be similarly worthy of study. For instance, one should be wary of the possible influence of the selected definition of “community” and the method of its detection. There are numerous definitions and methods available (again see Refs. [33, 8]), and a more definitive analysis of the connections between communities and characteristics in such networks should more fully explore multiple notions of community, possibly hierarchical structures, and communities at different resolutions.

As a simple example of comparing results from different community-detection methods, we compare the 12-community Caltech partition with that obtained for a 7-community partition (with ), which we obtained using a simpler spectral modularity-optimization implementation. Despite the necessarily different details of these two community structures, the qualitative conclusions from the two partitions are the same: House provides the dominant correlation, followed distantly by year and major, and there is again no significant correlation high school. Applying this same “weaker” (in the sense of consistently resulting in partitions of lower modularity) community-detection implementation to the other four institutions also typically agrees with the results that we report above: Year has the strongest correlation with communities and is followed by dormitory. The role of high school appears to be more pronounced in these lower-modularity partitions, as one obtains statistically significant correlations with the communities at Georgetown and Princeton and even stronger correlations with the communities at UNC and Oklahoma.

We also stress the difference between causation and correlation. In this paper, we have examined correlations. As discussed in the sociological literature on SNSs (see [2] and references therein), it is obviously very interesting and important to attempt to discern which common characteristics have resulted from friendships and which ones might perhaps influence the formation of friendships. In terms of the individual characteristics discussed above, high school and class year are known prior to the formation of these Facebook links, so one would expect those particular correlations to also indicate how some friendships might have formed. Common residences and majors, on the other hand, can both encourage new friendships and arise because of them. We note, finally, that SNS friendships provide only a surrogate for offline ones, so that one can also expect to find some differences between the community structures of Facebook networks and the real-life networks that they imperfectly represent [2].

## 5 Conclusions

We have demonstrated that analysis of community structure is useful for studying the online social networks of universities and inferring interesting insights about the prominent driving forces of community development in their corresponding offline social networks. We investigated various measures for comparing algorithmically-identified communities in Facebook networks with those obtained by grouping individuals according to self-identified characteristics. We found that -scores of pair-counting indices provide an immediate (though not quantitatively perfect) interpretation about the likelihood that such values might arise at random, indicating significant correlations between the algorithmically-identified communities and multiple self-identified characteristics. Such calculations indicate that the organizational structure at Caltech, which depends very strongly on House affiliation, is starkly different from those of the other universities that we studied. The observed heterogeneity in the communities, even at an institution like Caltech whose social structure seems to be mostly dominated by a single feature (House affiliation), underscores the important point that networks typically have multiple organizational forces [33]. We hope that our work leads to a wider comparative study that might increase understanding about the different factors that drive the social organization of universities. The present paper attempts to provide foundational steps for such comparative investigations by conveying a meaningful methodology.

## Acknowledgements

We thank Adam D’Angelo and Facebook for providing the data used in this study. We also acknowledge Skye Bender-de Moll, Danah Boyd, Barry Cipra, Barbara Entwisle, Katie Faust, Avi Feller, Dan Fenn, James Gleeson, Sandra González-Bailón, Justin Howell, Nick Jones, Franziska Klingner, Marco van der Leij, Tom Maccarone, Jim Moody, Mark Newman, Andy Shaindlin, and Ashton Verdery for useful discussions. We are especially indebted to Aaron Clauset and James Fowler for thorough readings of a draft of this manuscript and to Christina Frost for developing some of the graph visualizations we used.^{2}^{2}2The code is available at http://netwiki.amath.unc.edu/VisComms. ALT was funded by the NSF through the UNC AGEP (NSF HRD-0450099) and by the UNC ECHO program. EDK’s primary contributions to this project were funded by Caltech’s Summer Undergraduate Research Fellowship (SURF) program. PJM was funded by the NSF (DMS-0645369) and by start-up funds provided by the Institute for Advanced Materials, Nanoscience & Technology and the Department of Mathematics at the University of North Carolina at Chapel Hill. MAP did some work on this project while a member of the Center for the Physics of Information at Caltech and also acknowledges a research award (#220020177) from the James S. McDonnell Foundation.

## References

- [1] D. Boyd, Why youth (heart) social network sites: The role of networked publics in teenage social life, in MacArthur Foundation Series on Digital Learning - Youth, Identity, and Digital Media Volume, D. Buckingham, ed., MIT Press, Cambridge, MA, 2007, pp. 119–142.
- [2] D. M. Boyd and N. B. Ellison, Social network sites: Definition, history, and scholarship, Journal of Computer-Mediated Communication, 13 (2007), p. 11.
- [3] R. L. Brennan and R. J. Light, Measuring agreement when two observers classify people into categories not defined in advance, British Journal of Mathematical and Statistical Psychology, 27 (1974), pp. 154–163.
- [4] R. J. Brook and W. D. Stirling, Agreement between observers when the categories are not specified in advance, British Journal of Mathematical and Statistical Psychology, 37 (1984), pp. 271–282.
- [5] G. Caldarelli, Scale-Free Networks: Complex Webs in Nature and Technology, Oxford University Press, Oxford, United Kingdom, 2007.
- [6] T. Callaghan, P. J. Mucha, and M. A. Porter, Random walker ranking for NCAA division I-A football, American Mathematical Monthly, 114 (2007), pp. 761–777.
- [7] R. J. G. B. Campello, A fuzzy extension of the Rand index and other related indexes for clustering and classification assessment, Pattern Recognition Letters, 28 (2007), pp. 833–841.
- [8] S. Fortunato, Community detection in graphs, Physics Reports, 486 (2010), pp. 75 – 174.
- [9] E. B. Fowlkes and C. L. Mallows, A method for comparing two hierarchical clusterings, Journal of the American Statistical Association, 78 (1983), pp. 553–569.
- [10] T. M. J. Fruchterman and E. M. Reingold, Graph drawing by force-directed placement, Software—Practice and Experience, 21 (1991), pp. 1129–1164.
- [11] M. Girvan and M. E. J. Newman, Community structure in social and biological networks, Proceedings of the National Academy of Sciences, 99 (2002), pp. 7821–7826.
- [12] M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou, A walk in Facebook: Uniform sampling of users in online social networks. arXiv:0906.0060, 2009.
- [13] M. C. González, H. J. Herrmann, J. Kertész, and T. Vicsek, Community structure and ethnic preferences in school friendship networks, Physica A, 379 (2007), pp. 307–316.
- [14] P. Good, Permutation, Parametric, and Bootstrap Tests of Hypotheses, Springer-Verlag, New York, NY, 2005.
- [15] R. Guimerà and L. A. N. Amaral, Functional cartography of complex metabolic networks, Nature, 433 (2005), pp. 895–900.
- [16] N. J. Horton and K. P. Kleinman, Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models, The American Statistician, 61 (2007), pp. 79–90.
- [17] L. Hubert, Nominal scale response agreement as a generalized correlation, British Journal of Mathematical and Statistical Psychology, 30 (1977), pp. 98–103.
- [18] L. Hubert and P. Arabie, Comparing partitions, Journal of Classification, 2 (1985), pp. 193–218.
- [19] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, NJ, 1988.
- [20] B. Karrer, E. Levina, and M. E. J. Newman, Robustness of community structure in networks, Physical Review E, 77 (2008), p. 046119.
- [21] B. W. Kernighan and S. Lin, An efficient heuristic procedure for partitioning graphs, The Bell System Technical Journal, 49 (1970), pp. 291–307.
- [22] E. Kulisnkaya, Large sample results for permutation tests of association, Communications in Statistics – Theory and Methods, 23 (1994), pp. 2939–2963.
- [23] K. Lewis, J. Kaufman, M. Gonzalez, M. Wimmer, and N. A. Christakis, Tastes, ties, and time: A new (cultural, multiplex, and longitudinal) social network dataset using Facebook.com, Social Networks, 30 (2008), pp. 330–342.
- [24] N. Mantel, The detection of disease clustering and a generalized regression approach, Cancer Research, 27 (1967), pp. 209–220.
- [25] A. Mayer and S. L. Puller, The old boy (and girl) network: Social network formation on university campuses, Journal of Public Economics, 92 (2008), pp. 328–347.
- [26] M Meilǎ, Comparing clusterings — an information based distance, J. Multivariate Analysis, 98 (2007), pp. 873–895.
- [27] S. Melnik, A. Hackett, M. A. Porter, P. J. Mucha, and J. P. Gleeson, The unreasonable effectiveness of tree-based theory for networks with clustering. arXiv:1001.1439, 2010.
- [28] M. E. J. Newman, The structure and function of complex networks, SIAM Review, 45 (2003), pp. 167–256.
- [29] M. E. J. Newman, Finding community structure in networks using the eigenvectors of matrices, Physical Review E, 74 (2006), p. 036104.
- [30] M. E. J. Newman, Networks: An Introduction, Oxford University Press, Oxford, U.K., 2010.
- [31] J.-P. Onnela, J. Saramäki, J. Hyvönen, G. Szabó, D. Lazer, K. Kaski, J. Kertész, and A.-L. Barabási, Structure and tie strengths in mobile communication networks, Proceedings of the National Academy of Sciences, 104 (2007), pp. 7332–7336.
- [32] M. A. Porter, P. J. Mucha, M. E. J. Newman, and C. M. Warmbrand, A network analysis of committees in the United States House of Representatives, Proceedings of the National Academy of Sciences, 102 (2005), pp. 7057–7062.
- [33] M. A. Porter, J.-P. Onnela, and P. J. Mucha, Communities in networks, Notices of the American Mathematical Society, 56 (2009), pp. 1082–1097, 1164–1166.
- [34] W. M. Rand, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association, 66 (1971), pp. 846–850.
- [35] G. Robins, P. Pattison, Y. Kalish, and D. Lusher, An introduction to exponential random graph (p*) models for social networks, Social Networks, 29 (2007), pp. 173–191.
- [36] S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications, Structural Analysis in the Social Sciences, Cambridge University Press, Cambridge, UK, 1994.
- [37] A. S. Waugh, L. Pei, J. H. Fowler, P. J. Mucha, and M. A. Porter, Party polarization in congress: A network science approach. arXiv:0907.3509, 2010.
- [38] Y. Zhang, A. J. Friend, L. Traud, A., M. A. Porter, J. H. Fowler, and P. J. Mucha, Community structure in Congressional cosponsorship networks, Physica A, 387 (2008), pp. 1705–1712.