Bias and variance in the social structure of gender

Bias and variance in the social structure of gender

Kristen M. Altenburger Department of Management Science & Engineering, Stanford University. Email: kaltenb@stanford.edu.    Johan Ugander Department of Management Science & Engineering, Stanford University. Email: jugander@stanford.edu.
Abstract

The observation that individuals tend to be friends with people who are similar to themselves, commonly known as homophily, is a prominent and well-studied feature of social networks. Many machine learning methods exploit homophily to predict attributes of individuals based on the attributes of their friends. Meanwhile, recent work has shown that gender homophily can be weak or nonexistent in practice, making gender prediction particularly challenging. In this work, we identify another useful structural feature for predicting gender, an overdispersion of gender preferences introduced by individuals who have extreme preferences for a particular gender, regardless of their own gender. We call this property monophily for “love of one,” and jointly characterize the statistical structure of homophily and monophily in social networks in terms of preference bias and preference variance. For prediction, we find that this pattern of extreme gender preferences introduces friend-of-friend correlations, where individuals are similar to their friends-of-friends without necessarily being similar to their friends. We analyze a population of online friendship networks in U.S. colleges and offline friendship networks in U.S. high schools and observe a fundamental difference between the success of prediction methods based on friends, “the company you keep,” compared to methods based on friends-of-friends, “the company you’re kept in.” These findings offer an alternative perspective on attribute prediction in general and gender in particular, complicating the already difficult task of protecting attribute privacy.


Homophily is the observed phenomenon in social networks whereby friendships form frequently among similar individuals [29, 34]. Homophily can originate from an individual’s personal preference to become friends with similar others (choice homophily), structural opportunities to interact with similar others (induced homophily), or a combination of both [26]. An important consequence of homophily is that even if an individual does not disclose attribute information about themselves (such as their gender, age, or race), methods for relational learning [37, 22, 31, 45, 3, 48] can often leverage attributes disclosed by that individual’s friends to predict their private attributes. Gender prediction, however, is a difficult relational learning problem, as gender homophily can be weak or non-existent in both online and offline settings [53, 49, 46, 36, 27]. Weak gender homophily motivates us to examine alternative network structures useful for attribute prediction [13].

In this work, we focus on gender prediction and document the presence of individuals in social networks with extreme gender preferences for a particular gender, regardless of their own gender. We call this overdispersion of preferences “monophily” to indicate it as distinct from the preference bias introduced by homophily, and observe that monophily is nearly ubiquitous across the population of online and offline friendship networks that we study. The presence of these individuals with extreme preferences introduces similarity among friends-of-friends or along 2-hop relations. For the practical problem of attribute prediction, being friends with an individual with extreme gender preferences is a strong signal of one’s own gender and is therefore useful for gender prediction.

In order to model these empirical observations, as part of this work we also introduce an overdispersed stochastic block model that enables us to separately simulate homophily and monophily in social networks. We show how the 2-hop structural relationship induced by overdispersion (monophily) can exist in the complete absence of any 1-hop bias (homophily), and find that overdispersed friendship preferences can drive successful classification algorithms in settings with weak or even no homophily. Therefore, in networks with weak homophily but strong monophily, your friends-of-friends (“the company you’re kept in”) can then be responsible for disclosing private attribute information, as opposed to your friends (“the company you keep”). These findings extend the importance of privacy policies that protect relational data, while also proposing an intuitive structural property of social networks of independent interest.

In the spirit of a solution-oriented science [55], our analysis addresses the practical problem of inferring gender on social networks by revisiting the social theory of homophily and introducing alternative considerations for heterogeneity in friendship preferences. In addition to improving prediction, we also present monophily as an independent structure of interest when studying “gender as a social structure” [42] by explicitly quantifying the variability in gender preferences beyond the bias captured by homophily. Only recently has the role of variability in general and overdispersion in particular been studied on social networks where classic perspectives have prioritized analyzing aggregate patterns of interaction [40]. This work follows other advances in incorporating variance and overdispersion in social data analysis include understanding the consequences of overdispersion when estimating the size of sub-populations [60], documenting variations in the homophily of political ideology [4], assessing gender variation in linguistic patterns [2], and inferring social structure based on indirectly observed data [32].

The paper proceeds by first establishing how we measure the bias (homophily) and excess variance (monophily) of gender preferences. We then examine how relational inference methods for node classification relate to the presence of homophily and/or monophily. While previous models of homophily have shown its statistical significance in network data [58, 47], we highlight that the statistical significance of homophily does not necessarily imply predictive power when the task is to infer private attributes. Following the empirical analysis, we introduce a network model of overdispersed preferences that generalizes the well-studied stochastic block model [21]. Throughout this work we view gender as a binary attribute and aim to measure homophily in a manner that encompasses all sources of preference due to both choice and induced homophily. While we focus on gender, the methods developed in this work contribute a broad statistical toolkit for the general study of variability in social group interactions across a wide range of attributes or traits.

We begin by showing how the conventional homophily index can be interpreted as the maximum likelihood estimate of a parameter within a simple generalized linear model. We then extend this model to capture overdispersed preferences using a quasi-likelihood approach, introducing an overdispersed model with additional parameters that concisely measure the overdispersion of gender preferences among females () and males (), respectively. We propose estimates of these parameters as our measures of monophily among females and males in network data.

The homophily index of a graph [7, 10] characterizes the aggregate pattern of individuals’ biases or preferences in forming friendships with people of their own attribute class relative to people from other classes. For a generic attribute class and assuming there are classes, the homophily index with respect to class is defined as

(1)

where denotes node ’s observed -class degree with similar others, denotes its observed -class degree with different others, denotes its observed total degree, and will represent the total number of nodes with attribute such that . For notational simplicity, we use to refer to the set of all nodes with attribute value .

In measuring binary gender homophily (i.e.  or ), we first illustrate how to measure homophily among females. We assume that each individual in a network forms -class connections with the other individuals at a rate and -class ties with the other individuals at a rate (and similarly for each individual that a connection with males form at a rate and with females form at a rate ). We therefore expect for each individual that their class-specific degrees obey the following distributions (permitting self-loops):

(2)
(3)
(4)

where is a random variable describing the -class degree, describes the -class degree, and describes the total degree of node in class . We explicitly condition these random variables on the parameters and to make clear that these parameters are, for now, fixed and constant.

The nodes have the same binomial degree distribution specified by - class degrees formed among the nodes at a rate and - class degrees formed among the nodes at a rate . With only classes, for simplicity we use the notation in place of e.g. , highlighting that the rates could depend on the specific - and -classes and in the most general directed multi-class case. Note that the random variables in equations (2)–(4) are approximately independent, but not completely: constraints on the joint distribution of the degrees corresponding to the constraints of the Erdős-Gallai theorem (since the degrees must correspond to a graph) create a dependence, but this dependence is small for graphs of modest size or larger [54] and we safely ignore it here.

To show how the homophily index can be estimated using a generalized linear model (GLM) [33] of - versus -class degrees, let the observed degree data be , where the set-up is analogous for . Among the individuals, their -class degree distribution conditional on their total observed degree is approximately distributed as

(5)

in the case of two attribute classes (Supplementary Note 1). By applying a logistic-binomial model [16, 1], an adaptation of the logistic regression model for count data, the logistic link function of the binomial logistic regression model is then specified as assuming there are no additional covariates (which could otherwise be incorporated). For this model we can then derive the maximum likelihood estimate of as:

(6)

or equivalently (Supplementary Note 2). Here is exactly the homophily index specified in equation (1) above, and hence the homophily index can be interpreted as the intercept term estimated from a GLM applied to the observed degree data.

Given this interpretation of the homophily index within a GLM framework, it is useful to refer to the quantity as the “homophily parameter” for each class , letting the “homophily index” for each class embody the corresponding maximum likelihood estimate, . The homophily index is focused on assessing whether is different from the -class’ relative proportion in the population, . Meanwhile, this model gives a poor assessment of the variance of the data due to the constrained relationship between mean and variance [16]. More specifically, in this model of -class degrees for class , the variance of the -class degrees is constrained to be (Supplementary Note 3).

We observe that across the full population of 97 co-educational college online social networks from the Facebook100 dataset (FB100), the distribution of gender preferences are overdispersed, with a variance larger than the above model predicts (for details on the FB100 dataset, see Methods). As seen in Figure 1 for the Amherst College network, the empirical distributions of the gender preferences are more dispersed (less concentrated) than the homophily-only null distributions (for details of null model sampling, see Methods). Across the females and males at Amherst College, there is clear evidence that the variance of the distribution of -class preferences is greater than what would be expected given the homophily-only null model.

Figure 1: Evidence of overdispersion in gender preferences. On the Amherst College network we compute the empirical distribution (filled bars) of -class preferences for females (Left) and males (Right). We compare these distributions to a null distribution (solid lines) based on preferences with binomial variation (for details of null model sampling, see Methods). We observe overdispersion of -class gender bias in friendship formation for females and males as the observed empirical variance is greater than under the null.

We formally test the statistical significance of overdispersion of -degrees relative to -degrees among nodes with attribute class value given the fitted GLM with and the nominal variance of individual ’s -class degree count under this model. The standard test for overdispersion compares the sum of squared standardized residuals to , where there are degrees of freedom since the model features only a single intercept parameter [57, 16] (Supplementary Note 4). We consistently observe the variance of -class degrees among the females and males are significantly greater than what can be explained by the homophily-only GLM across the 97 college networks, with for all networks. The friendship networks in the Add Health dataset show equivalent evidence of overdispersion in a directed setting (for details on the Add Health dataset, see Methods).

A variety of modeling methods have been proposed to measure and model extra variation in count data [56, 57, 33, 35]. We employ a quasi-likelihood approach [57], the least presumptive approach compared to alternative methods, in order to adapt the GLM to accommodate this overdispersion. The quasi-likelihood set-up allows each node in class to have an individual latent preference for -class friendships, , such that and for some . The parameter is introduced to incorporate the extra variation, and the variance is parameterized as such for notational convenience (Supplementary Note 3). This set-up does not specify a distribution on but instead uses to quantify how much nodes in class vary in allocating their -class versus -class friendships.

The case when corresponds to the typical homophily-only model (Williams’ Model I), which restricts to be constant across all nodes in the class. Letting (Williams’ Model II) captures variation beyond the conventional model (Supplementary Note 3). Through an iterative procedure due to Williams that maximizes a quasi-likelihood function (Supplementary Note 4), we jointly estimate among female nodes and among male nodes, allowing us to use and as measures of preference overdispersion in the data. Note that the homophily measures estimated under Williams’ Model II, and , are slightly different than the traditional homophily indices, and , but the estimates and are highly correlated (Supplementary Note 5), and we focus our characterization of homophily on and given the direct connection to the homophily index.

In Figure 2, we evaluate both bias (homophily) and overdispersion (monophily) in gender preferences, using the conventional homophily index to measure bias and the estimates to measure overdispersion across the populations of college networks in the FB100 dataset. We see that across these networks the homophily measures closely follow the class proportion , whereas the monophily measures depart significantly from zero and show no sign of varying with class proportion. We next show how overdispersed preferences help explain the “predictability” of gender in relational trait inference in settings with weak or nonexistent gender homophily.

Figure 2: Homophily and monophily across the population of friendship networks. Measuring homophily and monophily in social networks. (Left) The homophily index and (Right) monophily index for bias and overdispersion, respectively, in friendship formation among male and female students at each of 97 online college social networks. The homophily indices are concentrated around relative class proportions (dashed line), while the monophily indices all show overdispersed preferences independent of the relative class proportions. Dashed lines indicate the lines of no homophily and no monophily, respectively.

Having established and as our measures of overdispersion, we now illustrate the key role overdispersion can play in the success of some but not all methods for relational inference. Our specific focus is to understand how the efficacy of different relational inference methods varies in the presence or absence of homophily and/or monophily, building on the challenge of predicting gender on large-scale social networks with minimal gender homophily. We explore a typical setting where individuals reveal information completely at random [18, 31, 44, 19] (i.e. uniformly), meaning that the likelihood to be labeled or to provide public information does not depend on other attributes. The prediction task is then to infer private gender attributes using public gender attributes and the social network relationships. We address this prediction problem through the lens of homophily and monophily. While historically the social sciences have placed a strong emphasis on explanation at the expense of prediction [20], this work reverses this traditional focus by showing how statistically significant homophily does not necessarily imply high predictability of attributes. Instead, we highlight the role of variation in relational inference methods, especially in applications when the bias introduced by homophily is weak or nonexistent.

Relational inference methods can be categorized based on the neighborhood relationships they exploit for classification, either learning from 1-hop (immediate friends) or 2-hop (friend-of-friend) relations. This distinction in relational learning is not often considered, but we note that it is a direct analog of a common distinction between the PageRank [38] and Hubs and Authorities [24] algorithms in graph ranking. PageRank is based on the principle that “a node is important if it is linked to by other important nodes,” while Hubs and Authorities is based on the principle that “a node is important if it is linked to by nodes that link to important nodes.” These differing principles can extract very different notions of importance in graph ranking; the latter is motivated by web ranking problems where, e.g., car companies don’t link to other car companies but should still appear high in search results for “cars.” Analogously, we observe that 2-hop and 1-hop methods are differently well-suited for different node classification problems. We compare these classification methods relative to a baseline model that assigns scores based on the relative class proportions observed in the training sample.

Classification methods based on a node’s 1-hop (immediate) neighbors include:

  • The 1-hop Majority Vote (1-hop MV) classifier, also called the weighted-vote relational neighbor (wvRN) classifier [31], builds directly on similarities between connected nodes where unlabeled nodes are scored based on the proportion of labels among their neighbors. When a node does not have any labeled neighbors, the relative class proportions in the training data are used (Supplementary Note 6).

  • The ZGL method [61] scores unlabeled nodes by computing the relative probabilities of reaching each node in a graph under a random walk originating at the labeled node sets. The ZGL method can be characterized as an iterated/semi-supervised adaptation of 1-hop MV [3].

Methods that exploit 2-hop (neighbor-of-neighbor) relations include:

  • The 2-hop Majority Vote (2-hop MV) classifier uses the relationship between a node and its 2-hop neighbors weighted by the number of length-2 paths. Unlabeled nodes are scored based on the weighted proportion of labels among their 2-hop neighbors.

  • LINK-Logistic Regression [59] uses labeled nodes to fit a regularized logistic regression model (Supplementary Note 7) that interprets rows of the adjacency matrix as sparse binary feature vectors, striving to predict labels from these features. The trained model is then applied to the feature vectors (adjacency matrix rows) of unlabeled nodes, which are scored based on the probability estimates from the model. Small variations that use the same feature set but employ e.g. SVMs or Random Forests instead of Logistic Regression give qualitatively similar performance. Employing the LINK feature set as part of a Naive Bayes classifier gives a clear view of LINK as a family of 2-hop methods (Supplementary Note 8).

Figure 3: Comparison of 1-hop versus 2-hop classifiers and the relationship between classification performance and homophily versus monophily. (Top) A comparison of the performance of classification methods for gender inference on the Amherst College network with =1015 and =1017, measured by AUC, varying the percentage of nodes that are given as labeled (for details on the cross-validation, see Methods). Homophily and monophily measured for the Amherst College give , and , . We observe strong classification performance from the LINK method, which we attribute to the overdispersed gender preferences. (Bottom) Across FB100 networks we compare the correlation between 1-hop and 2-hop Majority Vote (with 50% initially labeled nodes) versus gender homophily and gender monophily. We observe that homophily has high explanatory power for the 1-hop Majority Vote AUC across schools while monophily has very little. Meanwhile, homophily has weak explanatory power of the 2-hop Majority Vote AUC across schools while monophily has strong explanatory power for that method.

We observe only slight gender homophily across the population of college networks in the FB100 dataset, and accordingly in Figure 3A we observe limited performance using 1-hop methods (1-hop MV and ZGL) to predict gender in a single representative network. Meanwhile, we see that 2-hop methods (2-hop MV and LINK) have higher performance, corroborating our intuition for 2-hop methods being able to surface structural signals for classification in the presence of overdispersed preferences. As illustrated in Figure 3B, classification for 2-hop Majority Vote considerably outperforms classification based on 1-hop Majority Vote across the population of FB100 schools, and we attribute this performance difference to the monophily in the network. In addition to the undirected FB100 networks, we also examined node classification on the directed Add Health school networks (Supplementary Note 9), where we observe similar results.

In order to generalize these empirical observations on the impact of homophily versus monophily on 1-hop and 2-hop inference methods, we generate synthetic graphs with extra-binomial variation by introducing a variant on the stochastic block model (SBM) [21], also known as the planted partition model [8], a well-studied statistical distribution over graphs with desired block structure commonly employed to study network association patterns. An SBM models association preferences among node classes by specifying a set of block sizes and a preference matrix where denotes the independent probability of an edge between nodes and in attribute classes and . For modeling associations between two genders using SBMs, the matrix P is simply a matrix denoting the edge probabilities within and between the two genders. Assortative block structure is present when -class probabilities are greater than -class probabilities.

We propose an overdispersed extension of the stochastic block model to additionally capture monophily (extra-binomial heterogeneity in preferences) by relaxing this restriction of fixed class probabilities among all nodes in a given class and assuming a latent distribution on gender preferences [57]. We specifically employ a latent Beta distributions on preferences [9] applied to graphs, though other latent distributions or other means of incorporating overdispersion [12, 17] could be just as reasonable; note that the measure of monophily developed earlier in this work (that uses a quasi-likelihood approach) is agnostic to the choice of latent distribution.

The proposed overdispersed stochastic block model (oSBM) is defined by the block sizes , preference matrix , and additional overdispersion parameters and . Here and are concrete parameters of a generative model, while we will continue to use to describe generic overdispersion in preferences (when ). Networks are generated from the model via a multi-level approach, where first each node’s - and -class degrees are created by sampling class preference parameters ( and ) from an appropriate latent Beta distribution with specified means and for - and - class probabilities respectively. We assume the same mean across all attribute classes , so we denote this mean by instead of for a given class . Given the resulting individual preferences, a graph is generated analogously to how the degree-corrected SBM [23] attains prescribed degrees using a Chung-Lu construction [6], with expected -degrees and expected -degrees (Supplementary Note 10). We note that this overdispersed stochastic block model complements related work on overdispersion in social network surveys [60] where an individual’s degree to a class is taken to be distributed Gamma-Poisson. Under an oSBM, the number of individuals from a specific class that a given node is connected to will approximately follow a Beta-Binomial distribution, a close relative of the Gamma-Poisson distribution [5].

Figure 4: Four different overdispersed stochastic block models and the associated performance of 1-hop and 2-hop classifiers. (Top) Trait preference distributions for four instances of oSBMs (filled bars) varying , , and parameters: no homophily and no monophily (), monophily but no homophily (), homophily but no monophily (), and both homophily and monophily (). We then compute a null distribution (solid lines) based on affinities with binomial variation (for details of null model sampling, see Methods). (Bottom) Across the same corresponding oSBM settings, we compare the relative classification performance for different inference methods and observe a clear bifurcation of performance in the case of monophily but no homophily.

The oSBM allows us to validate and explore the relative performance of node inference methods on graphs with and without homophily and/or monophily. Figure 4A illustrates the distribution in gender preferences from four settings of the oSBM that vary the homophily and monophily parameters. In Figure 4B, we then compare the relative performance of 1-hop Majority Vote, ZGL, 2-hop Majority Vote, and LINK when attempting node classification on graphs from each of the four settings. We observe in the homophily-only setting (, ) that all inference methods perform well, while in the monophily-only setting (, ), 1-hop MV and ZGL have no predictive power while LINK-Logistic Regression and 2-hop MV show impressive performance despite the complete lack of homophily. We conclude that the presence of monophily can be sufficient, even in the complete absence of homophily, for accurate trait inference in networks.

The overarching bias-variance framework we develop for group preferences is highly interpretable, broadly enriching the tools available for studying prediction and explanation in social systems [20] and helps support the continued growth of studying variation in homophily. By adapting a quasi-likelihood approach, we can simultaneously estimate both bias and overdispersion in group preferences, where the traditional homophily index and our monophily index can be interpreted as parameters within a single extra-binomial generalized linear model. This model also offers straightforward techniques for testing the statistical significance of homophily and monophily in social networks.

The networks we study largely exhibit minimal gender homophily, and we attribute the success in gender prediction of the previously introduced LINK algorithm [59] to the presence of strongly overdispersed gender preferences in these networks. We verify and generalize these empirical observations by introducing overdispersion into a stochastic block model via a multi-level approach. We use this model to demonstrate how homophily is a sufficient but not necessary condition for gender inference, and that overdispersion provides an alternative sufficient condition. This model should be of independent interest to researchers looking to create realistic models of social data that can replicate the overdispersed preferences we observe.

These findings provide a new perspective on social network trait classification in general and gender in particular, as well as further complicating the already difficult task of preserving privacy in social networks. The overdispersion of preferences documented in this work motivates a re-examination of 2-hop network structure in network analysis very broadly, e.g. developing label-dependent inference methods [14] or community detection methods [11] that engage with relations among friends-of-friends, rather than only friends. Methods for studying privacy in bipartite affiliation networks [25] should also be revisited. We ultimately believe that the overdispersion of preferences deserves study as a social structure in its own right, and encourage investigations into social correlates of preference overdispersion. While preference biases have long been the predominant focus of group structure in social networks, this work highlights the need to simultaneously give serious parallel consideration to variability.

Methods

Description of Data

We analyze populations of networks from two sources, the Facebook100 (FB100) network dataset [52] (Supplementary Note 5) and the Add Health in-school friendship nomination dataset [41] (Supplementary Note 9). For all networks in both datasets, we restrict the analysis to only nodes that disclose their gender, completely removing those with missing gender labels. We also restrict to nodes in the largest (weakly) connected component in order to benchmark against classification methods [61] that assume a connected graph. The Facebook100 dataset (FB100), analyzed in the main paper, consists of online friendship networks from Facebook that was collected in September 2005 from 100 U.S. colleges, primarily consisting of college-aged individuals [51]. We exclude Wellesley College, Smith College, and Simmons College from our analysis, which all have female nodes in the original network dataset.

Null distribution of gender preferences

In order to assess whether gender preferences are overdispersed in empirical networks, we compare the variance of the empirical distribution of across all nodes in the same class to the variance of a Binomial null distribution without overdispersion. Since the basic model assumes that , we simulate draws from this distribution by repeatedly sampling from for each node to produce a distribution of samples under the null.

Description of cross-validation

We vary the percentage of initially labeled nodes by selecting a labeled sample uniformly at random [31]. We train our models on the labeled individuals (training dataset), and measure classification performance on the remaining unlabeled nodes (testing dataset), using the same train/test splits across the different inference methods. We evaluate performance for 10 different random samples of initially labeled nodes, reporting the mean weighted Area Under the Curve (AUC) for each of initially labeled nodes where the weights are based on the relative number of true class training labels. The vertical error bars denote the standard deviation in AUC scores across the 10 samples.

Data availability

The Facebook100 (FB100) dataset is publicly available from the Internet Archive at
https://archive.org/details/oxford-2005-facebook-matrix and other public repositories. The Add Health dataset can be obtained from the Carolina Population Center at the University of North Carolina by contacting addhealth_contractsunc.edu.

Code availability

IPython notebooks are available at https://github.com/kaltenburger/gender_graph_code, documenting all results and figures.

Acknowledgements

We thank Bailey Fosdick, Jon Kleinberg, Isabel Kloumann, Daniel Larremore, Joel Nishimura, Mason Porter, Matthew Salganik, Sam Way, and attendees of the 2016 International Conference on Computational Social Science and the 2016 SIAM Workshop on Network Science for comments. Supported in part by an National Defense Science and Engineering Graduate (NDSEG) Fellowship, the Akiko Yamazaki and Jerry Yang Engineering Fellowship, and a David Morgenthaler II Faculty Fellowship.

References

  • [1] Alan Agresti and Maria Kateri. Categorical Data Analysis. Springer, 2011.
  • [2] David Bamman, Jacob Eisenstein, and Tyler Schnoebelen. Gender identity and lexical variation in social media. Journal of Sociolinguistics, 18(2):135–160, 2014.
  • [3] Smriti Bhagat, Graham Cormode, and S Muthukrishnan. Node classification in social networks. In Social Network Data Analytics, pages 115–148. Springer, 2011.
  • [4] Andrei Boutyline and Robb Willer. The social structure of political echo chambers: Variation in ideological homophily in online networks. Political Psychology, 2016.
  • [5] Christopher Chatfield and Gerald J Goodhardt. The beta-binomial model for consumer purchasing behaviour. In Mathematical Models in Marketing, pages 53–57. Springer, 1976.
  • [6] Fan Chung and Linyuan Lu. Connected components in random graphs with given expected degree sequences. Annals of Combinatorics, 6(2):125–145, 2002.
  • [7] James Coleman. Relational analysis: the study of social organizations with survey methods. Human Organization, 17(4):28–36, 1958.
  • [8] Anne Condon and Richard M Karp. Algorithms for graph partitioning on the planted partition model. Random Structures and Algorithms, 18(2):116–140, 2001.
  • [9] Martin J Crowder. Beta-binomial anova for proportions. Applied Statistics, pages 34–37, 1978.
  • [10] Sergio Currarini, Matthew O Jackson, and Paolo Pin. An economic model of friendship: Homophily, minorities, and segregation. Econometrica, 77(4):1003–1045, 2009.
  • [11] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborová. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physical Review E, 84(6):066106, 2011.
  • [12] Thomas A DiPrete and Jerry D Forristal. Multilevel models: methods and substance. Annual Review of Sociology, pages 331–357, 1994.
  • [13] George T Duncan and Diane Lambert. Disclosure-limited data dissemination. Journal of the American Statistical Association, 81(393):10–18, 1986.
  • [14] Brian Gallagher and Tina Eliassi-Rad. Leveraging label-independent features for classification in sparsely labeled networks: An empirical study. In Advances in Social Network Mining and Analysis, pages 1–19. Springer, 2010.
  • [15] Paul H Garthwaite, Ian T Jolliffe, and Byron Jones. Statistical Inference. Oxford University Press on Demand, 2002.
  • [16] Andrew Gelman and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2006.
  • [17] Guang Guo and Hongxin Zhao. Multilevel modeling for binary data. Annual Review of Sociology, pages 441–462, 2000.
  • [18] Jianming He, Wesley W Chu, and Zhenyu Victor Liu. Inferring privacy information from social networks. In International Conference on Intelligence and Security Informatics, pages 154–165. Springer, 2006.
  • [19] Daniel F Heitjan and Srabashi Basu. Distinguishing “missing at random” and “missing completely at random”. The American Statistician, 50(3):207–213, 1996.
  • [20] Jake M Hofman, Amit Sharma, and Duncan J Watts. Prediction and explanation in social systems. Science, 355(6324):486–488, 2017.
  • [21] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5(2):109–137, 1983.
  • [22] David Jensen, Jennifer Neville, and Brian Gallagher. Why collective inference improves relational classification. In Proceedings of the Tenth ACM SIGKDD International Cconference on Knowledge Discovery and Data Mining, pages 593–598. ACM, 2004.
  • [23] Brian Karrer and Mark EJ Newman. Stochastic blockmodels and community structure in networks. Physical Review E, 83(1):016107, 2011.
  • [24] Jon M Kleinberg. Authoritative sources in a hyperlinked environment. In SODA, pages 668–677, 1998.
  • [25] Michal Kosinski, David Stillwell, and Thore Graepel. Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 110(15):5802–5805, 2013.
  • [26] Gueorgi Kossinets and Duncan J Watts. Origins of homophily in an evolving social network. American Journal of Sociology, 115(2):405–450, 2009.
  • [27] David Laniado, Yana Volkovich, Karolin Kappler, and Andreas Kaltenbrunner. Gender homophily in online dyadic and triadic relationships. EPJ Data Science, 5(1):19, 2016.
  • [28] Daniel B Larremore, Aaron Clauset, and Abigail Z Jacobs. Efficiently inferring community structure in bipartite networks. Physical Review E, 90(1):012805, 2014.
  • [29] Paul F Lazarsfeld and Robert K Merton. Friendship as a social process: A substantive and methodological analysis. Freedom and Control in Modern Society, 18(1):18–66, 1954.
  • [30] Saskia Le Cessie and Johannes C Van Houwelingen. Ridge estimators in logistic regression. Applied Statistics, pages 191–201, 1992.
  • [31] Sofus A Macskassy and Foster Provost. Classification in networked data: A toolkit and a univariate case study. Journal of Machine Learning Research, 8:935–983, 2007.
  • [32] Tyler H McCormick, Amal Moussa, Johannes Ruf, Thomas A DiPrete, Andrew Gelman, Julien Teitler, and Tian Zheng. A practical guide to measuring social structure using indirectly observed network data. Journal of Statistical Theory and Practice, 7(1):120–132, 2013.
  • [33] Peter McCullagh and John A Nelder. Generalized Linear Models, volume 37. CRC press, 1989.
  • [34] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology, pages 415–444, 2001.
  • [35] Jorge G Morel and Neerchal K Nagaraj. A finite mixture distribution for modelling multinomial extra variation. Biometrika, 80(2):363–371, 1993.
  • [36] Jennifer Watling Neal. Hanging out: Features of urban children’s peer social networks. Journal of Social and Personal Relationships, 2010.
  • [37] Jennifer Neville and David Jensen. Supporting relational knowledge discovery: Lessons in architecture and algorithm design. In Proceedings of the Data Mining Lessons Learned Workshop, 19th International Conference on Machine Learning, 2002.
  • [38] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
  • [39] Ross L Prentice. Binary regression using an extended beta-binomial distribution, with discussion of correlation induced by covariate measurement errors. Journal of the American Statistical Association, 81(394):321–327, 1986.
  • [40] Adrian E Raftery. Statistics in sociology, 1950–2000: A selective review. Sociological Methodology, 31(1):1–45, 2001.
  • [41] Michael D Resnick, Peter S Bearman, Robert Wm Blum, Karl E Bauman, Kathleen M Harris, Jo Jones, Joyce Tabor, Trish Beuhring, Renee E Sieving, Marcia Shew, et al. Protecting adolescents from harm: findings from the national longitudinal study on adolescent health. JAMA, 278(10):823–832, 1997.
  • [42] Barbara J Risman. Gender as a social structure theory wrestling with activism. Gender & Society, 18(4):429–450, 2004.
  • [43] Saharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research, 5(Aug):941–973, 2004.
  • [44] Donald B Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.
  • [45] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI Magazine, 29(3):93, 2008.
  • [46] Wesley Shrum, Neil H Cheek, and Saundra Hunter. Friendship in school: Gender and racial homophily. Sociology of Education, pages 227–239, 1988.
  • [47] Jeffrey A Smith, Miller McPherson, and Lynn Smith-Lovin. Social distance in the united states: Sex, race, religion, age, and education homophily among confidants, 1985 to 2004. American Sociological Review, 79(3):432–456, 2014.
  • [48] Ben Taskar, Pieter Abbeel, and Daphne Koller. Discriminative probabilistic models for relational data. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, pages 485–492. Morgan Kaufmann Publishers Inc., 2002.
  • [49] Mike Thelwall. Homophily in myspace. Journal of the American Society for Information Science and Technology, 60(2):219–231, 2009.
  • [50] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
  • [51] Amanda L Traud, Eric D Kelsic, Peter J Mucha, and Mason A Porter. Comparing community structure to characteristics in online collegiate social networks. SIAM Review, 53(3):526–543, 2011.
  • [52] Amanda L Traud, Peter J Mucha, and Mason A Porter. Social structure of facebook networks. Physica A: Statistical Mechanics and its Applications, 391(16):4165–4180, 2012.
  • [53] Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron Marlow. The anatomy of the facebook social graph. arXiv preprint arXiv:1111.4503, 2011.
  • [54] Remco Van Der Hofstad. Random Graphs and Complex Networks, volume 1. Cambridge University Press, 2016.
  • [55] Duncan J Watts. Should social science be more solution-oriented? Nature Human Behaviour, 1:0015, 2017.
  • [56] Robert WM Wedderburn. Quasi-likelihood functions, generalized linear models, and the gauss—newton method. Biometrika, 61(3):439–447, 1974.
  • [57] David A Williams. Extra-binomial variation in logistic linear models. Applied Statistics, pages 144–148, 1982.
  • [58] Andreas Wimmer and Kevin Lewis. Beyond and below racial homophily: Erg models of a friendship network documented on facebook. American Journal of Sociology, 116(2):583–642, 2010.
  • [59] Elena Zheleva and Lise Getoor. To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles. In Proceedings of the 18th International Conference on World Wide Web, pages 531–540, 2009.
  • [60] Tian Zheng, Matthew J Salganik, and Andrew Gelman. How many people do you know in prison? using overdispersion in count data to estimate social structure in networks. Journal of the American Statistical Association, 101(474):409–423, 2006.
  • [61] Xiaojin Zhu, Zoubin Ghahramani, John Lafferty, et al. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, volume 3, pages 912–919, 2003.

Supplementary Information

The notation is explained in the main paper, and we repeat it here for clarity. Note that we use the terminology “nodes” and “individuals” interchangeably. For notational simplicity, we will use to mean the set of all nodes with attribute value , to be the number of nodes with attribute value , and to be the number of nodes with attribute value (where we focus primarily on a class set-up). The -class degree denotes the observed number of friendships node has with individuals that also have the same attribute value , and the -class degree denotes the observed number of friendships node has with those that do not have attribute value . We use capital letters () when treating the -/-class degrees as random variables. Finally, we represent the probability of an -class link forming as for nodes in class and represent the probability of -class forming as , where we are assuming classes in which case is necessarily equivalent for both classes.

Appendix A Distribution of -class Degrees

We analyze a 2-class set-up divided into attribute classes and , where we give derivations for all nodes and the set-up is similar for . For all nodes , node ’s total observed degree is partitioned between -class degrees and -class degrees . We observe first that the conditional random variable is approximately binomially distributed, for all in a particular class, according to the following argument: for large populations (where and are large with and constant), then and , which are binomial distributed, can be view as approximately Poisson distributed. Under this Poisson approximation, the conditional distribution is distributed . In full formality:

(7)
(8)
(9)
(10)

where captures an error term that is asymptotically small when and are both large. These steps allow us to identify the conditional distribution as approximately . When , this distribution reduces to simply , and when , this distribution reduces simply to .

Appendix B Homophily Index as Intercept Term

Here we show that the maximum likelihood estimate of the intercept term in the logistic regression model applied to the - and - degree counts among nodes in a particular class can be interpreted as the conventional homophily index . This result is derived specifically for a two-class setting.

Consider for nodes , as derived above with and explicitly shown as fixed for clarity and where we define the homophily parameter . Then since the binomial distribution is a member of the exponential dispersion family and can therefore be modeled using a generalized linear model (GLM) with a logit link function, we have that
logitloglog or equivalently .

Given the observed degree counts for nodes with attribute value represented as , which are approximately independent (but weakly dependent due to combinatorial constraints on the joint distribution of degrees), we derive the maximum likelihood estimate and show its connection with the homophily index . First consider the likelihood function:

(11)
(12)
(13)
(14)

We transform this likelihood function to a log-likelihood function:

(15)
(16)
(17)

and from here we set and solve for :

(18)
(19)
(20)

Here is the maximum likelihood estimator, and we use the superscript “MLE” to make this clear. Thus when using binomial logistic regression applied to the -degrees , we obtain that , the conventional homophily index.

Appendix C Properties of Binomial Degree Data

For a realized expected degree sequence among nodes in class , the conditional distribution of -class degrees is (asymptotically, per Section 2): Binom as previously derived. In this section, we assess the unconditional expectation and variance of the -class degree sequence in settings where is assumed to be constant for all nodes (Model I below) and when is assumed to be random (Model II below). The derivations of Model I and Model II follow those presented in Chapter 10 of [15] and are adapted to this context in terms of - and - class degrees.

c.1 Without overdispersion (Model I)

The expectation of when there is no overdispersion (when is constant for all nodes) is:

(21)
(22)
(23)
(24)
(25)

The variance (again with known) is:

(27)

Considering each of these two terms, we have:

(28)
(29)
(30)
(31)
(32)

and

(33)

As a result, we obtain that .

If the expectation and variance are rewritten in terms of , then they are: and , respectively.

c.2 With overdispersion (Model II)

Following previous notational set-ups, we introduce overdispersion by allowing to vary across nodes such that and (for notational convenience as will be clearer later) that =. Note that the only assumption we’re making is that is constant across nodes in a given class but we are not making any distributional assumptions on .

Then, the unconditional expectation of (unconditional on ) when there is overdispersion (when is random across all nodes) is:

(34)
(35)
(36)
(37)
(38)

And the unconditional variance (unconditional on ) is:

(40)

Considering each part, we have: