A Tables

Social Structure of Facebook Networks

1\biboptions

comma,square, sort

Abstract

We study the social structure of Facebook “friendship” networks at one hundred American colleges and universities at a single point in time, and we examine the roles of user attributes—gender, class year, major, high school, and residence—at these institutions. We investigate the influence of common attributes at the dyad level in terms of assortativity coefficients and regression models. We then examine larger-scale groupings by detecting communities algorithmically and comparing them to network partitions based on the user characteristics. We thereby compare the relative importances of different characteristics at different institutions, finding for example that common high school is more important to the social organization of large institutions and that the importance of common major varies significantly between institutions. Our calculations illustrate how microscopic and macroscopic perspectives give complementary insights on the social organization at universities and suggest future studies to investigate such phenomena further.

1 Introduction

Since their introduction, social networking sites (SNSs) such as Friendster, MySpace, Facebook, Orkut, LinkedIn, and myriad others have attracted hundreds of millions of users, many of whom have integrated SNSs into their daily lives to communicate with friends, send e-mails, solicit opinions or votes, organize events, spread ideas, find jobs, and more (Boyd and Ellison, 2007). Facebook, an SNS launched in February 2004, now overwhelms numerous aspects of everyday life, and it has become an immensely popular societal obsession (Boyd, 2007b; Boyd and Ellison, 2007; Lewis et al., 2008b; Mayer and Puller, 2008). Facebook members can create self-descriptive profiles that include links to the profiles of their “friends,” who may or may not be offline friends. Facebook requires that anybody who one wants to add as a friend confirm the relationship, so Facebook friendships define a network (graph) of reciprocated ties (undirected edges) that connect individual users.

The emergence of SNSs such as Facebook and MySpace has revolutionized the availability of social and demographic data, which has in turn had a significant impact on the study of social networks (Boyd and Ellison, 2007; Krebs, 2008; Lievrouw and Livingstone, 2005). It is possible to acquire very large data sets from SNSs, though of course the population online and actively using SNSs is a biased sample of the broader population. Services like Facebook also contain large quantities of demographic data, as many users now voluntarily reveal voluminous amounts of detailed personal information. An especially exciting aspect of studying SNSs is that they provide an opportunity to examine social organization at unprecedented levels of size and detail, and they also provide new venues to test sampling effects (Kurant et al., 2011). One can investigate the structure of an SNS like Facebook to examine it as a network in its own right, and ideally one can also try to take one step further and infer interesting insights regarding the offline social networks that an SNS imperfectly parallels. Most people tend to draw their Facebook friends from their real-life social networks (Boyd and Ellison, 2007), so it is not entirely unreasonable to use Facebook networks as a proxy for an offline social network. (Of course, as noted by Hogan (2009), one does need to be aware of significant limitations when taking such a leap of faith.)

Social scientists, information scientists, and physical scientists have all jumped on the SNS data bandwagon (Rosenbloom, 2007). It would be impossible to exhaustively cite all of the research in this area, so we only highlight a few results; additional references can be found in the review by Boyd and Ellison (2007). Boyd (2007a) also wrote a popular essay about her empirical study of Facebook and MySpace, concluding that Facebook tends to appeal to a more elite and educated cross section than MySpace. The company RapLeaf (Sodera, 2008) has compiled global demographics on the age and gender usage of numerous SNSs. Other recent studies have investigated the manifestation on SNSs of race and ethnicity (Gajjala, 2007), religion (Nyland and Near, 2007), gender (Geidner et al., 2007; Hjorth and Kim, 2005), and national identity (Fragoso, 2006). Preliminary research has also suggested that online friendship networks can be exploited to improve shopper recommendation systems on websites such as Amazon (Zheng et al., 2007).

Several papers have attempted to increase understanding of how SNS friendships form. For example, Kumar et al. (2006) examined preferential attachment models of SNS growth, concluding that it is important to consider different classes of users. Lampe et al. (2007) explored the relationship between profile elements and number of Facebook friends, and other scholars have examined the importance of geography (Liben-Nowell et al., 2005) and online message activity (Golder et al., 2007) to online friendship formation. Other papers have established strong correlations between network participation and website activity, including the motivation of people to join particular groups (Backstrom et al., 2006), the recommendations of online groups (Spertus et al., 2005), online messages and friendship formation (Golder et al., 2007), interaction activity versus sense of belonging (Chin and Chignell, 2007), and the role of explicit ideological relationship designations in affecting voting behavior (Brzozowski et al., 2008; Hogg et al., 2008). Lewis et al. (2008b) used Facebook data for an entire class of freshmen at an unnamed, private American university to conduct a quantitative study of social networks and cultural preferences. The same data set was also used to examine user privacy settings on Facebook (Lewis et al., 2008a).

In the present paper, we study the complete Facebook networks of 100 American college and universities from a single-day snapshot in September 2005. This paper is a sequel to our previous research on 5 of these institutions (Traud et al., 2010), in which we developed some of the methodology that we employ here. In September 2005, one needed a .edu e-mail address to become a member of Facebook, and the majority of “friendship” ties were within the same institution. We thus ignore links between nodes at different institutions and study the Facebook networks of the 100 institutions as 100 separate networks. For each network, we have categorical data encompassing the gender, major, class year, high school, and residence (e.g., dormitory, House, fraternity, etc.) of the users. We examine homophily and community structure (network partitions that are obtained algorithmically) for each of the networks and compare the community structure to partitions based on the given categorical data. We thereby compare and contrast the organizations of the 100 different Facebook networks, which arguably allows us to compare and contrast the organizations of the underlying university social networks that they imperfectly represent. In addition to the inherent interest of these Facebook networks, our investigation is important for subsequent use of these networks—which were formed via ostensibly the same generative mechanism online—as benchmark examples for numerous types of computations, such as new community detection methods.

The remainder of this paper is organized as follows. We first discuss the Facebook data and present the methods that we used for testing homophily at the dyad level and demographic prevalences at the community level. We then present and discuss results on the largest connected components of the networks, student-only subnetworks, and single-gender subnetworks. Finally, we summarize and discuss our findings.

2 Data

The data that we use was sent directly to us in anonymized form by Adam D’Angelo of Facebook. It consists of the complete set of users (nodes) from the Facebook networks at each of 100 American institutions (which we enumerate in Table 1) and all of the “friendship” links between those users’ pages as they existed in September 2005. The data clearly identifies most institutions, although there are a small number of disambiguation problems. For instance, 4 different “UC” institutions plus “Cal” are in the data, and there are 2 “Texas” listings. Each institution in the data includes a number appearing as part of its name that appears to correspond to the order in which each institution “joined” Facebook. The data can be downloaded at http://people.maths.ox.ac.uk/porterm/data/facebook100.zip.

Similar snapshots of Facebook data from 10 Texas institutions were analyzed recently by Mayer and Puller (2008), and a snapshot from “a diverse private college in the Northeast U.S.” was studied by Lewis et al. (2008b). Other studies of Facebook have typically obtained data either through surveys (Boyd and Ellison, 2007) or through various forms of automated sampling (Gjoka et al., 2010), thereby missing nodes and links that can impact the resulting graph structures and analyses. We consider only ties between people at the same institution, yielding 100 separate realizations of university social networks and allowing us to compare the structures at different institutions.

We consider four networks for each of the 100 Facebook data sets: the largest connected component of the full network (which we hereafter identify as “Full”), the largest connected component of the student-only network (“Student”), the largest connected component of the female-only network (“Female”), and the largest connected component of the male-only network (“Male”). The Male and Female networks are each subsets of the Full network rather than the Student network. Each network has a single type of unweighted, undirected connection between nodes and can thus be represented as an adjacency matrix with elements indicating the presence () or absence () of a tie between nodes and . The resulting tangle of nodes and links, which we illustrate for the Reed College student Facebook network in Figure 1, can obfuscate any organizational structure that might be present.

The data also includes limited demographic (categorical) information that is volunteered by users on their individual pages: gender, class year, and (using anonymous numerical identifiers) high school, major, and residence. We use a “Missing” label for situations in which individuals did not volunteer a particular characteristic. The different characteristics allow us to make comparisons between institutions, under the assumption (see the discussion by Boyd and Ellison (2007)) that the communities and other elements of structural organization in Facebook networks reflect (even if imperfectly) the social communities and organization of the offline networks on which they’re based. It is an important research issue to determine just how imperfect this might be (Hogan, 2009), but this is far beyond the scope of the present paper (though we hope that others will take on this particular challenge). The conclusions that we draw in this paper apply directly to the university Facebook networks from September 2005, and we expect that they can provide insight about the real-world social networks at the institutions as well.

3 Methods

We study each network at both the dyad level and the community level. We first consider homophily (Wasserman and Faust, 1994; McPherson et al., 2001; Newman, 2010)) quantified by assortativity coefficients using the available categorical data. For some of the smaller networks, we additionally perform independent logistic regression on node pairs to obtain the log odds contributions to edge presence between two nodes that have the same categorical-data value. We similarly fit exponential random graph models (ERGMs) (Handcock et al., 2008; Robins et al., 2007; Frank and Strauss, 1986; Wasserman and Pattison, 1996; Lubbers and Snijders, 2007) with triangle terms to these smaller networks. Finally, we partition the networks by algorithmically detecting communities (Porter et al., 2009; Fortunato, 2010), which we compare to the given categorical data using the technique in this paper’s prequel (Traud et al., 2010). Calculating assortativity values and log odds contributions allows us to examine “microscopic” features of the networks, while comparing algorithmic partitions of the networks to the categorical data allows us to examine their “macroscopic” features. As we illustrate below, both perspectives are important because they provide complementary insights.

3.1 Assortativity

A general measure of scalar assortativity relative to a categorical variable is given by Newman (2003, 2010):

(1)

where is the normalized mixing matrix, the elements indicate the number of edges in the network that connect a node of type (e.g., a person with a given major) to a node of type , and the entry-wise matrix -norm is equal to the sum of all entries of . By construction, this formula yields when the amount of assortative mixing is the same as that expected independently at random (i.e., is simply the product of the fraction of nodes of type and the fraction of nodes of type ), and it yields when the mixing is perfectly assortative.

3.2 Logistic Regression and Exponential Random Graphs

We further measure the influence of the available user characteristics on the likelihood of a “friendship” tie via a fit by logistic regression (under an assumption of independent dyads) and by an ERGM specification that includes triangle terms. Our focus is on trying to calculate the propensity for two nodes with the same categorical value to form a tie. We consider each of the four categorical variables (major, residence, year, and high school) and use the ERGM package in R (Handcock et al., 2008) for both models (treating each network as undirected). We used R 2.11.1 and the statnet package version 2.1-1, and we note that different versions of R and statnet caused different degrees of convergence with the structural elements in the model. We obtained results for the 16 smallest institutions. (We did these calculations on a 32-bit operating system, which restricts the network sizes that can be processed.) Both models that we consider are based on a standard ERGM parametrization describing the distribution of graphs with model coefficients corresponding to statistics calculated from the adjacency matrix (with a normalizing factor to ensure that the formula yields a probability distribution) (Handcock et al., 2008; Robins et al., 2007; Frank and Strauss, 1986; Wasserman and Pattison, 1996; Lubbers and Snijders, 2007).

In the first model (logistic regression), we include five statistics (with five corresponding coefficients): the total density of ties (edges) and the common classifications (nodematch) from each of four node/user characteristics: residence, class year, major, and high school. For example, the contribution describes the additional log-odds predisposition for a “friendship” tie when two users are from the same high school. In all cases, we ignore possible contributions from missing characteristic data: two nodes with the same missing data field are not treated as having the same value for the characteristic. Rather than include gender explicitly in the model, we instead additionally fit the model to the single-gender subnetworks in order to be consistent with the treatment of gender in the community-level comparisons below. In the second model (an ERGM), we add a triangle statistic to account for the observed amount of transitivity in the network data. This gives a total of six coefficients: edges, common residence, common class year, common major, common high school, and the triangle coefficient.

3.3 Community Detection

The global organization of social networks often includes coexisting modular (horizontal) and hierarchical (vertical) organizational structures, and myriad papers have attempted to interpret such organization through the computational identification of “community structure.” Communities are defined in terms of cohesive groups of nodes with more internal connections (between nodes in the same group) than external connections (between nodes in the group and nodes in other groups). As discussed at length in two recent review articles (Porter et al., 2009; Fortunato, 2010) and in references therein, the ensemble of techniques available to detect communities is both numerous and diverse. Existing techniques include hierarchical clustering methods such as single linkage clustering, centrality-based methods, local methods, optimization of quality functions such as modularity and similar quantities, spectral partitioning, likelihood-based methods, and more. Communities are considered to not be merely structural modules but are also expected to have functional importance because of the large number of common ties among nodes in a community. For example, communities in social networks might correspond to circles of friends or business associates and communities in the World Wide Web might encompass pages on closely-related topics. In addition to remarkable successes on benchmark problems, investigations of community structure have observed correspondence between communities and “ground truth” groups in diverse application areas—including the reconstruction of college football conferences (Girvan and Newman, 2002) and the investigation of such structures in algorithmic rankings (Callaghan et al., 2007); the investigation of committee assignments (Porter et al., 2005), legislation cosponsorship (Zhang et al., 2008), and voting blocs (Waugh et al., 2009; Mucha et al., 2010) in the United States Congress; the examination of functional groups in metabolic networks (Guimerà and Amaral, 2005); the study of ethnic preferences in school friendship networks (González et al., 2007); and the study of social structures in mobile-phone conversation networks (Onnela et al., 2007)

In the present paper, we investigate the community structures of the Facebook networks from each of the 100 colleges and universities. (See the visualization of the community structure for Reed College in Figure 2.) For each institution, we consider the Full, Student, Female, and Male networks. We seek to determine how well the demographic labels included in the data correspond to algorithmically computed communities. Assortativity provides a local measure of homophily, but that does not provide sufficient information to draw conclusions about the global organization of the Facebook networks. For example, two students who attended the same high school are typically more likely to be friends with each other than are two students who attended different high schools, but this will not necessarily have a meaningful community-level effect unless enough of the students went to common high schools. As we we will see below, high school tends to be a much more dominant organizing characteristic of the social structure at the large institutions than at small institutions, presumably because of a significant frequency of common high school pairs at the large institutions.

We identify communities by optimizing the “modularity” quality function , where denotes the fraction of ends of edges in group  for which the other end of the edge lies in group  and is the fraction of all ends of edges that lie in group . High values of modularity correspond to community assignments with greater numbers of intra-community links than expected at random (with respect to a particular null model (Newman, 2006a; Porter et al., 2009; Fortunato, 2010)). Although numerous other community detection methods are also available, modularity optimization is perhaps the most popular way to detect communities and it has been successfully applied to many applications (Porter et al., 2009; Fortunato, 2010). One might also consider using a method that includes a resolution parameter (Reichardt and Bornholdt, 2006) to avoid issues with resolution limits (Fortunato and Barthelemy, 2007). However, our primary focus is on global organization of the networks, so we limit our attention to the default resolution of modularity. This focus arguably biases our study of communities to the largest structures, such as those influenced by common class year, making the observed correlations with other demographic characteristics even more striking.

To try to ensure that the communities we detect are properties of the data rather than of the algorithms that we used, we optimize modularity (with default resolution) using 6 different combinations of spectral optimization, greedy optimization, and Kernighan and Lin (1970) (KL) node-swapping steps (in the manner discussed by Newman (2006b)). Specifically, we use (1) recursive partitioning by the leading eigenvector of a modularity matrix (Newman, 2006a), (2) recursive partitioning by the leading pair of eigenvectors (including the Richardson et al. (2009) extension of the method in Newman (2006a)), (3) the Louvain greedy method (Blondel et al., 2008), and each of these three supplemented with small increases in the quality that can be obtained using KL node swaps. Each of these 6 methods yields a community partition, and we obtain our comparisons (described in Section 3.4) by considering each of these 6 partitions.

Modularity optimization is NP-hard (Brandes et al., 2008), so one must be cautious about the large number of degenerate partitions in the modularity landscape (Good et al., 2010). However, by detecting coarse observables—in particular, the global organization of a Facebook network based on the given categorical data—and considering results that are averaged over multiple optimization methods, one can obtain interesting insights. The specific “best” partition will vary from one method to another, but some of the predicted coarse organizational structure of the networks (see below) is robust to the choice of community detection algorithm.

3.4 Comparing Communities to Node Data

Once we have detected communities for each institution, we will compare the algorithmically-obtained community structure to the available categorical data for the nodes. We recently developed a methodology to accomplish this goal in Traud et al. (2010) (where we considered only 5 institutions among the 100 in order to illustrate the techniques). This method of comparison can be applied to the output of any “hard partitioning” algorithm in which each node is assigned to precisely one community (cf. “soft partitioning” methods, in which communities can overlap). We briefly review that methodology here.

To compare a network partition to the categorical demographic data, we standardize (using a -score) the Rand coefficient of the communities in that partition compared to partitioning based purely on each of the four categorical variables (one at a time). For each comparisons, we calculate the Rand -score in terms of the total number of pairs of nodes in the network , the number of pairs that are in the same community , the number of pairs that have the same categorical value , and the number of pairs of nodes that are both in the same community and have the same categorical value (Traud et al., 2010). The Rand coefficient is given in term of these quantities by (Rand, 1971). We then calculate the -score for the Rand coefficient as (Hubert, 1977; Traud et al., 2010)

(2)

where

(3)
(4)

is the number of nodes in the network, the coefficients and are given by

(5)
(6)

denotes an element of a contingency table and indicates the number of nodes that are classified into the th group of the first partition and the th group of the second partition, is a row sum, and is a column sum. Each -score indicates the deviation from randomness in comparing the community structure with the partitioning based purely on that single demographic characteristic. One needs to be cautious when interpreting such deviations from randomness as a strength of correlation. In particular, given the dependence on system size inherent in this measure, one should not overinterpret the relative values of -scores from different institutions. Nevertheless, the -scores provide a reasonable proxy quantity both for the statistical significance of correlation and for the relative strength of correlation in a specified network.

4 Results

We now use the methods outlined in the previous section to study the Facebook networks. We first follow the order of presentation above and then make some observations in combinations. Complete results are available in the tables in the appendix.

4.1 Assortativity

We tabulate the assortativities based on gender, major, residence, class year, and high school for all networks (and subsets thereof) in Table 2.

For almost all of the institutions and each of the 4 network subsets, the class year attribute produces higher assortativity values than the other available demographic characteristics. However, Rice University (31), California Institute of Technology (36), University of Georgia (50), University of Michigan (67), Auburn University (71), and University of Oklahoma (97) are each examples in which residence provides the highest assortativity values (again, for each of the 4 network subsets). We discussed Caltech as a focal example in Traud et al. (2010), in which we introduced the community comparison methods that we employ below.

Other institutions have varying orderings of class year and residence assortativity among the 4 network subsets. At MIT (8), USF (51), Notre Dame (57), University of Maine (59), UC (61), UC (64), and MU (78), residence gives the highest assortativity in the Male networks. The UCF (52) Female network has its highest assortativity with residence. Both the Full network and the Male network for University of California at Santa Cruz (68) have their highest assortativity values with residence. Both the Male and Female networks at University of Illinois at Urbana-Champaign (20), Tulane (29), UC (33), Florida State University (53), Cal (65), University of Mississippi (66), University of Indiana (69), Texas (80), Texas (84), University of Wisconsin (87), Baylor (93), University of Pennsylvania (94), and University of Tennessee (95) have their highest assortativity values with residence; all other networks from these institutions have their highest assortativity with class year.

Some outlying observations can be tied directly to small samples. For example, Simmons (81) is a female-only college. It has only four males in the Full network; none of the males had any connections with another male, so the gender assortativity values for both the Full and Student components are very close to . Similar gender numbers are also present in the data from Wellesley (22) and Smith (60).

4.2 Dyad-Level Regression and Exponential Random Graphs

We use the two statistical models described in Section 3.2 to study the 16 smallest institutions. The (dyad-independent) logistic regression model includes contributions from edges (network density) and matched user (node) characteristics for each of four demographic variables. We present the results for this model in Table 3. The second model that we consider is an ERGM, which supplements the first model with a structural triangle contribution. We present the results for this model in Table 4. These calculations give views of the networks at the microscropic (dyad-level) scale that supplement the results that we obtained using the assortativity statistics.

We consider the results from the 16 smallest institutions by fitting the models to each of their Full, Student, Female, and Male networks. Because all of the resulting model coefficients appear to be statistically significant at a -value of less than , we interpret the importance of node matching on the different demographic characteristics directly from the magnitude of the corresponding model coefficients. We summarize the results for these 16 institutions using the box plots in Figures 3 and 4. The box plots identify the outliers by institution number: Caltech (36), Oberlin (44), Smith (60), Simmons (81), Vassar (85), and Reed (98). (As we have only performed this regression analysis for the 16 smallest institutions in the data, one should not jump to conclusions from this list of outliers.) For all institutions and all four types of networks for each institution, the highest coefficient in the employed ERGM model (with triangle terms) is given for matching the High School category, and the value of this coefficient is significantly higher than those for the other node-matching coefficients. Only the Caltech (36) Female network has ERGM coefficients for Year, Residence, and High School that are very close to each other.

4.3 Comparison of Communities

We now discuss community-level results for each network using -scores of the Rand coefficient to compare partitions obtained via algorithmic community detection to partitions based on each characteristic. That is, each community-detection result identifies a group assignment for each node, thereby producing a partition (called a “hard” partition) in which each node is assigned to exactly one community. One can also obtain a hard partition for each network by selecting a single characteristic and grouping nodes according to that characteristic. Every network that we study (including the subnetworks) has at least one -score in the set with a value greater than . Although the distribution of Rand coefficients is decidedly not Gaussian, particularly in the tails of the distributions (Traud et al., 2010; Brook and Stirling, 1984; Kulisnkaya, 1994), this threshold indicates that at least one characteristic in each network exhibits strong statistical significance. Moreover, we will see that the vast majority of our comparisons below exceed the threshold. (That is, they essentially lie outside 95% confidence intervals.)

To visualize and compare the varied strengths of organization according to the different demographic characteristics, we represent the four -scores obtained for each network (Full, Student, Female, and Male) of an institution using 3-dimensional barycentric (tetrahedral) coordinates (Weisstein, 2011; Franklin, 2002). We start by setting all negative -scores to , as all observed negative -score values are small enough to be statistically insignificant. We then normalize by the sum of the -scores to obtain

(7)
(8)
(9)
(10)

From these 4 -score values, we calculate coordinates located inside a tetrahedron. For example, one can obtain a tetrahedron whose vertices are , , , and ) with the transformation

(11)

The information from is implicitly included in (11) because of the normalization. Each of the 4 vertices of the tetrahedron corresponds to a limit in which the corresponding -score completely dominates the other three -scores. That is, at a vertex, the entire -score sum arises from the corresponding component.

Because of the strong role of class year, we visualize the tetrahedra from a perspective located above the vertex corresponding to class year and project the result into the opposing face of the tetrahedron. We calculate the point for each of the 6 algorithmic partitions of each network (i.e., using the aforementioned 6 different community detection methods). For each institution, we plot a disk whose center lies at the midpoint of these 6 coordinates. The width of each disk is proportional to the maximum observed difference between these 6 sets of coordinates (with these distances separated into bins of width , as indicated in the legends of Figures 58). For example, in Figure 5, the Pepperdine (86) results have a maximum distance of between partitions, so Pepperdie (86) is represented by one of the smallest disks. Harvard (1) has a maximum distance of .1581 between partitions; this lies in , so Harvard (1) is represented by one of the disks of second smallest size. We emphasize that the computed differences are much larger than the span of the depicted disks, whose sizes allow one to discern the results from different institutions.

In Figures 58, we show each of the 100 institutions, identified by number (see Table 1), using a disk that we have color-coded according to the Cartesian distance of its center from the Year vertex. Class year is the predominant organizing category among the ones present in the data, so most of the institutions are located very close to the Year vertex. We zoom in on the Year vertex for each figure in order to better discern the relative importance of class year at the institutions. Importantly, the social organization of a few institutions differs considerably from that of the majority. Each of these institutions lies close to the Residence vertex, so their community structures are organized predominantly according to dormitory residence. Foremost among these institutions are Rice (31) and California Institute of Technology (36). As we discussed in Traud et al. (2010), California Institute of Technology (Caltech) is well-known to be organized almost exclusively according to its undergraduate “House” system (Looijen and Porter, 2007).

In repeatedly observing a strong correlation of class year with community structure, it is relevant to recall that the community detection method that we have employed optimizes modularity at the default resolution. Because of the resolution limit of modularity (Fortunato and Barthelemy, 2007), it might be interesting to explore individual networks at different scales using resolution parameters (Reichardt and Bornholdt, 2006; Fortunato, 2010; Porter et al., 2009). We reiterate, however, that our focus in the present paper is on large-scale features rather than precise node membership of network partitions.

In Figure 5, we show the social organization tetrahedron for the Full networks (i.e, for the the largest connected components of the complete networks) for each institution. Although the community structure of nearly all of the Full networks are organized overwhelmingly by class year, a few of them are also heavily influenced by dormitory residence. (We already mentioned above that Rice (31) and Caltech (36) are organized predominantly by Residence.) For example, dormitory residence also dominates the community structure at UC Santa Cruz [UCSC] (68), though to a lesser extent than at Rice and Caltech. We also observe relatively high Residence -scores at Smith (60), Auburn (71), and University of Oklahoma (97). Major seems to be most important relative to the other available characteristics at Oberlin (44) and Maine (59), though in both cases its relative correlation pales in comparison to that of class year. High School seems to be most important at USF (51) and Tennessee (95), though class year is again more important. Most of the institutions are clustered tightly near the Year vertex, but Residence can often be rather important (and sometimes even the most important category, as we have seen in three cases).

In Figure 6, we show the social organization tetrahedron for the Student networks (i.e., for the largest connected component of the student-only subnetworks) for each institution. As we saw with the Full networks, most of the institutions have community structures that are organized overwhelming according to class year. Rice, Caltech, Smith, UCSC, Auburn, and Oklahoma are again exceptions, as dormitory residence also exerts considerable (or even primary) influence at these institutions. Additionally, considering the Student network reduces the relative dominance of the Year vertex, although it clearly still dominates the social organization. This feature is illustrated by institutions such as UC (64), UF (21), and Rutgers (89).

In Figure 7, we show the social organization tetrahedron for the Female networks (i.e., for the largest connected component of the female-only subnetworks) for each institution. Class year is once again the overwhelmingly dominant organizing characteristic, and dormitory residence is again important at institutions such as Rice, Caltech, Smith, UCSC, Auburn, and Oklahoma. However, we now observe an increased importance of the High School vertex. USF (51), Tennessee (95), UF (21), FSU (53), and GWU (54) all lie closer to the High School vertex than was the case in the Full and Student networks.

In Figure 8, we show the social organization tetrahedron for the Male networks (i.e., for the largest connected component of the male-only subnetworks) for each institution. Class year is once again the overwhelmingly dominant organizing characteristic, and dormitory residence is again the most important category at institutions such as Rice, Caltech, and UCSC. Interestingly, considering the Male network suggests that Residence is the most important factor for the social organization for the males at Notre Dame (57). Residence also exerts an important influence on the males at Michigan (67). This is starkly different from what we observed for these institutions in the Full, Student, and Female networks (and would seem to be something interesting to investigate more thoroughly in the future using other data and methods). The Male UCF (52), MSU (24), USF (51), Auburn (71), and Maine (59) networks are strongly influenced by High School. The Male networks at Texas (80), Rutgers (89), and University of Illinois at Urbana-Champaign (20) stand out from other universities because of their proximity to the Major vertex.

4.4 Discussion

As described above, we see using the -scores of the Rand coefficients for demographic characteristics versus algorithmic community assignments that Year is the strongest organizing factor at most institutions but that Residence is much more important for the community organization at some institutions than at others. The correlation with Residence is especially prominent at Rice (31) and Caltech (36). We also observe that the Male networks tend to be more scattered around Year, as some institutions exhibit a stronger correlation with Major, whereas others have a stronger correlation with high school. This suggests that there are potential differences in the gender patterns of friendships, which would be interesting to investigate in future studies with new data. We do not explore this general issue further and instead attempt to identify interesting comparisons with the results that we obtained above. Although it is of course impossible to be exhaustive in our observations, we present all of our assortativity values, regression model coefficients, and community-comparing -scores in the tables in Appendix A. We also highlight some interesting facets of our results.

Of particular interest is the comparison of results from the dyad-level regression models to those from community-level correlations. We note in particular that the logistic regression and exponential random graph models that we employed for the smallest 16 institutions specify that almost all institutions and all of their subnetworks give the highest model coefficient contribution towards a link between nodes from a common High School. However, as we have seen—and which is particularly evident using the visualizations with tetrahedra—at the community level, most institutions are organized by class year and have a relatively small correlation with high school.

Even in the rare cases in which the rank ordering of the four correlations (with Year, Residence, Major, and High School) at the community level matches that obtained via dyad-level model coefficients, such as with the logistic regression model for the Full and Female networks from Caltech (36), the relative sizes of the contributions at the dyad level are completely different from those observed at the community level. Caltech supplies an illustrative example of the different insights obtained from community-detection versus logistic regression and exponential random graph models both because of its small size and because of its outlying correlation with dormitory residence at the community level. A simple interpretation of the apparent dichotomy between the dyad-level model coefficients and the correlations at the community scale is that the presence of two students from the same high school at a small institution like Caltech yields a significant increase in the likelihood of a tie between those students. Even though the corresponding model coefficient is smaller than in any of the other of the 16 smallest institutions, it is comparable to that for common residence (called “Houses” at Caltech). Nevertheless, the very small number of node pairs at Caltech that have the same high school relative to the total number of node pairs has a very small effect at the community level, as the algorithmically obtained communities are correlated overwhelmingly with House affiliation. The ERGM result with triangle contributions makes this distinction even more striking, as the common high school coefficient is actually larger than the coefficient from common House.

We also observe other features that might be worthy of future investigation using other data sets and methodologies. We report the results of our calculations in depth in Tables 15. Here we highlight only a few potentially interesting examples in which different methods or different subnetworks yield apparently different qualitative conclusions. For example, we found that Major is the second most important factor for the organization of the communities in all of the Oberlin (44) networks, but only for the Full and Male networks does the logistic regression give the second highest coefficient for Major. We also observed that the relative ordering of Major at the same institution is sometimes gender-dependent. For example, Major gives the second largest -score in the Female and Male networks of Stanford (3), but it gives the fourth largest -score in Stanford’s Full network. Even more interesting, Major gives the second largest -score for the Female network at UVA (16), the third largest -score for UVA’s Male network, and the fourth largest -score for its Full network. The communities in the Auburn (71) Female network are dominated by Residence, but those in the other Auburn networks are not. Similarly, the communities in the MIT (8) Male network are dominated by Residence, but those in the other MIT (8) networks are not. Another interesting disparity based on gender occurs in the communities in the Tennessee (95) Full and Student networks, which have their second largest contributions from High School, whereas those in the other two Tennessee networks have their second largest contributions from Residence.

5 Conclusions

We have studied the social structure of Facebook “friendship” networks at one hundred American institutions at a single point in time (using data from September 2005). To compare the organizations of the 100 institutions using categorical data, we considered both microscopic and macroscopic perspectives. In particular, calculating assortativity coefficients and regression model coefficients based on observed ties allows one to examine homophily at the local level, and algorithmic community detection allows a complementary macroscopic picture. These approaches complement each other, providing different perspectives on investigations of these Facebook networks. Such complementary calculations are particularly valuable when the microscopic and macroscopic perspectives identify different dominant contributions. For example, in the Caltech networks, the assumed ground truth of the importance of the House system is captured better by computing community structure.

This “real-world ensemble” of 100 networks formed by ostensibly similar mechanisms has the potential to provide a testing ground for various models of network formation. Because of the useful comparisons such an ensemble of data can facilitate, this data will similarly be useful for studies of dynamic processes on networks, algorithmic community detection, and so on. Because of the different rates of initial Facebook adoption at different institutions, the single point in time represented by the data might usefully describe different stages in the formation of an online social network. In order to pursue such ideas further, one needs to start by studying the networks for their own sake and comparing their structures. This was the goal of the present paper. In particular, we have identified some of the key differences across these 100 realizations of online social networks.

Some of our observations confirm conventional wisdom or are intuitively clear, providing soft verification of our investigation via expected results. For example, we found that class year is often important, Houses are important at Caltech, and high school plays a greater role in the social organization of large universities than it does at smaller institutions (where there are typically fewer pairs of people from the same high school). Other results are quite fascinating and merit further investigation. In particular, the differences in the community structures of the female-only and male-only networks would be interesting to investigate in both offline and online settings. The Facebook data suggests that women are typically more likely to have friends within their common residence (among the demographic data to which we have access) but that the characteristics in the communities in the male-only networks exhibit a wider variation. Investigating this thoroughly would require different data sets and methodologies, especially if one wishes to discern the causes of such friendships from observed correlations.

The Facebook networks that we study offer imperfect representations of corresponding real-life social networks, which have different properties from online social networks. It is thus crucial that our results are complemented by studies of the corresponding real networks in order to quantify the extent of such differences.

Acknowledgements

We thank Adam D’Angelo and Facebook for providing the data used in this study. We also acknowledge Sandra González-Bailón and Erik Kelsic for useful discussions. We thank Christina Frost for developing some of the graph visualization code that we used (available at http://netwiki.amath.unc.edu/VisComms). ALT was funded by the NSF through the UNC AGEP (NSF HRD-0450099) and by the UNC ECHO program. PJM was funded by the NSF (DMS-0645369) and the UNC ECHO program. MAP acknowledges a research award (#220020177) from the James S. McDonnell Foundation.

Figure 1: Largest connected component of the student-only subset of the Reed College Facebook network. (We used a Fruchterman and Reingold (1991) visualization.) Different node shapes and gray scale indicate different class years (gray circles denote users who did not identify an affiliation), and the edges are randomly shaded for easy viewing. Clusters of nodes with the same grayscale/shape suggest that common class year has an important effect on the aggregate Facebook structure.

Figure 2: [Color] (Left) Vizualization of community structure of the Reed College Student Facebook network shown in Figure 1. Node shapes and colors indicate class year (gray dots denote users who did not identify an affiliation), and the edges are randomly shaded for easy viewing. We place the communities using a Fruchterman and Reingold (1991) layout and use a Kamada and Kawai (1989) layout to position the nodes within communities (Traud et al., 2009). (Right) The same network layout but with each community depicted as a pie. Larger pies represent communities with larger numbers of nodes. Darker edges indicate the presence of more connections between the corresponding communities.

Figure 3: Box plots (indicating median, quartiles, extent, and outliers of the distribution) of the logistic regression nodematch coefficients for the 16 smallest institutions in the data for the model described in the main text. We plot the values to present results with greater resolution. We separately present our results for the Full, Student, Female, and Male networks.

Figure 4: Box plots (indicating median, quartiles, extent, and outliers of the distribution) of the exponential random graph model coefficients described in the main text for the 16 smallest institutions in the data. We plot the values to present results with greater resolution. We separately present our results for the Full, Student, Female, and Male networks.
Figure 5: [Color online] (Upper Left) Social organization tetrahedron for the community structures of the Full component (largest connected component) of the networks for each of the 100 institutions. Lighter disks indicate an organization that is based more predominantly on class year. See the main text for a description of this figure. (Lower Right) Magnification near the Year vertex. The legend illustrates the disk size as a function of the maximum distance between the 6 different partitions of the network. Most cases (88 out of 100 institutions) have .
Figure 6: [Color online] (Upper Left) Social organization tetrahedron for the community structures of the Student component of the networks for each of the 100 institutions. Lighter disks indicate an organization that is based more predominantly on class year. See the main text for a description of this figure. (Lower Right) Magnification near the Year vertex. As in Figure 5, the disk sizes correspond to the maximum distances between partitions.
Figure 7: [Color online] (Upper Left) Social organization tetrahedron for the community structures of the Female component of the networks for each of the 100 institutions. Lighter disks indicate an organization that is based more predominantly on class year. See the main text for a description of this figure. (Lower Right) Magnification near the Year vertex. As in the two previous figures, the disk sizes indicate the maximum distances between partitions.
Figure 8: [Color online] (Upper Left) Social organization tetrahedron for the community structures of the Male component of the networks for each of the 100 institutions. Lighter disks indicate an organization that is based more predominantly on class year. See the main text for a description of this figure. (Lower Right) Magnification near the Year vertex. As in the three previous figures, disk size indicates the maximum distance between partitions. We note that there are more cases here than in the previous figures, which illustrates the greater variability in the relative positions of the -scores in the different Male networks than was the case for the Full, Student, and Female networks.

Appendix A Tables

In Table 1, we give for each of the 100 institutions the numbers of nodes and edges for each of the Facebook networks (and subsets thereof) that we have investigated. In Table 2, we give the assortativity values for each of the networks. For each institution, we calculate assortativity values for Gender only for the Full and Student network subsets. We calculate Major, Residence, Year, and High School assortativity values for each of the four network subsets (Full, Student, Female, and Male).

Recall that we studied regression models for the 16 institutions with the smallest Facebook networks. In Table 4, we report the results of a logistic regression model with edge and nodematch terms. (All coefficients differ from zero with -values less than .) In Table 5, we similarly report the results of an ERGM that supplements the logistic regression model with triangle terms. (Again, all resulting model coefficients differ from zero with a -value of less than .)

In Table 5, we report the maximum -score for each demographic category that we obtained from the 6 different community detection partitions (described in the text) of each Facebook network (and their subsets) compared to categorical partitions based on each of Major, Residence, Year, and High School. We divide the networks in this table into five sections: (1) networks for which the High School category gives the highest -score; (2) networks for which the Residence category gives the highest -score; (3) networks for which Year gives the highest -score and High School gives the second highest; (4) networks for which Year gives the highest -score and Major gives the second highest; and (5) networks for which Year gives the highest -score and Residence gives the second highest.

Institution Number Nodes (Full, Student, Female, Male) Edges (Full, Student, Female, Male)
Harvard
Columbia
Stanford
Yale
Cornell
Dartmouth
UPenn
MIT
NYU
BU
Brown
Princeton
Berkeley
Duke
Georgetown
UVA
BC
Tufts
Northeastern
U Illinios
UF
Wellesley
Michigan
MSU
Northwestern
UCLA
Emory
UNC
Tulane
UChicago
Rice
WashU
UC
UCSD
USC
Caltech
UCSB
Rochester
Bucknell
Williams
Amherst
Swarthmore
Wesleyan
Oberlin
Middlebury
Hamilton
Bowdoin
Vanderbilt
Carnegie
UGA
USF
UCF
FSU
GWU
Johns Hopkins
Syracuse
Notre Dame
Maryland
Maine
Smith
UC
Villanova
Virginia
UC
Cal
Mississippi
Michigan
UCSC
Indiana
Vermont
Auburn
USFCA
Wake
Santa
American
Haverford
Williams
MU
JMU
Texas
Simmons
Bingham
Temple
Texas
Vassar
Pepperdine
Wisconsin
Colgate
Rutgers
Howard
UConn
UMass
Baylor
Penn
Tennessee
Lehigh
Oklahoma
Reed
Brandeis
Trinity
Table 1: Characteristics for each of the networks and subnetworks: institution name, the identifying number given by Facebook, number of nodes in each network and subnetwork, and the number of edges in each network and subnetwork.
Institution No. Full Student Female Male

Harvard
Gender
Major
Residence
Year
High School
Columbia
Gender
Major
Residence
Year
High School
Stanford
Gender
Major
Residence
Year
High School
Yale
Gender
Major
Residence
Year
High School
Cornell
Gender
Major
Residence
Year
High School
Dartmouth
Gender
Major
Residence
Year
High School
UPenn
Gender
Major
Residence
Year
High School
MIT
Gender
Major
Residence
Year
High School
NYU
Gender
Major
Residence
Year
High School
BU
Gender
Major
Residence
Year
High School
Brown
Gender
Major
Residence
Year
High School
Princeton
Gender
Major
Residence
Year
High School
Berkeley
Gender
Major
Residence
Year
High School
Duke
Gender
Major
Residence
Year
High School
Georgetown
Gender
Major
Residence
Year
High School
UVA
Gender
Major
Residence
Year
High School
BC
Gender
Major
Residence
Year
High School
Tufts
Gender
Major
Residence
Year
High School
Northeastern
Gender
Major
Residence
Year
High School
UIllinios
Gender
Major
Residence
Year
High School
UF
Gender
Major
Residence
Year
High School
Wellesley
Gender
Major
Residence
Year
High School
Michigan
Gender
Major
Residence
Year
High School
MSU
Gender
Major
Residence
Year
High School
Northwestern
Gender
Major
Residence
Year
High School
UCLA
Gender
Major
Residence
Year
High School
Emory
Gender
Major
Residence
Year
High School
UNC
Gender
Major
Residence
Year
High School
Tulane
Gender
Major
Residence
Year
High School
UChicago
Gender
Major
Residence
Year
High School
Rice
Gender
Major
Residence
Year
High School
WashU
Gender
Major
Residence
Year
High School
UC
Gender
Major
Residence
Year
High School
UCSD
Gender
Major
Residence
Year
High School
USC
Gender
Major
Residence
Year
High School
Caltech
Gender
Major
Residence
Year
High School
UCSB
Gender
Major
Residence
Year
High School
Rochester
Gender
Major
Residence
Year
High School
Bucknell
Gender
Major
Residence
Year
High School
Williams
Gender
Major
Residence
Year
High School
Amherst
Gender
Major
Residence
Year
High School
Swarthmore
Gender
Major
Residence
Year
High School