Analysis of Partially Observed Networks via Exponential-family Random Network Models
Exponential-family random network (ERN) models specify a joint representation of both the dyads of a network and nodal characteristics. This class of models allow the nodal characteristics to be modelled as stochastic processes, expanding the range and realism of exponential-family approaches to network modelling. In this paper we develop a theory of inference for ERN models when only part of the network is observed, as well as specific methodology for missing data, including non-ignorable mechanisms for network-based sampling designs and for latent class models. In particular, we consider data collected via contact tracing, of considerable importance to infectious disease epidemiology and public health.
Ian E. Fellows]Ian E. Fellows Ian E. Fellows and Mark S. Handcock]Mark S. Handcock
It is not uncommon for researchers to collect data on a subset of a single network rather than observing the full network. This partially observed case has been studied within the framework of exponential-family random graph models (ERGM) by Handcock and Gile (2010), however their formulation suffers from the limitation that any nodal attributes included in the model must be fully observed, and only dyads may be missing. This assumption is not met in most sampling designs, where only some of the nodes are surveyed by the researcher, and reduces the practical usage of ERGMs in the missing data setting.
By including nodal attributes as variates rather than fixed quantities, exponential-family random network models (ERNM, Fellows and Handcock, 2012) can provide a convenient basis for inference in cases where the data is partially unobserved, either due to design, or out-of-design (e.g., non-response) mechanisms. While our framework is applicable to all partial observation mechanisms we consider three common mechanisms for partial observations in more detail, specifically:
- Missing Data:
If the population is comprised of a large number of units, or the number of edges is large, it is relatively common to find that the resources to observe a full network are not available. Often units or dyads are unavailable for sampling or do not provide complete responses to a survey instrument. In this case, only some of the dyads and nodal characteristics are collected. We treat missing data as a form of sampling in which the sampling mechanism is unknown and outside the control of the researcher, or an out-of-design missing data mechanism. A good example of this is the National Longitudinal Study of Adolescent Health (Add Health), a school-based, longitudinal study of the health-related behaviours of adolescents and their outcomes in young adulthood. The study design sampled 80 high schools and 52 middle schools from the U.S., representative with respect to region of country, urbanicity, school size, school type, and ethnicity (Harris et al., 2003). In 1994-95 an in-school questionnaire was administered to a nationally representative sample of students in grades 7 through 12. In addition to demographic and contextual information, each respondent was asked to nominate up to five boys and five girls within the school whom they regarded as their best friends. Thus each student could nominate up to ten students within the school (Udry, 2003). The nominations and contextual information were not available for some of the adolescents, either due to absence from school while the survey was being conducted, or refusal to participate. Thus, both the graph and nodal variates contained missing values.
- Network sampling designs:
Many studies in hard to reach populations use study designs that trace the linkages of an underlying social network. In these designs, the network is partially observed, however it is not of primary interest to the researcher. Such sampling designs have been exploited to estimate population disease rates (Gile and Handcock, 2010; Gile, 2011; Gile and Handcock, 2011).
- Latent variables:
Some quantities of the network may be in principle unobservable. The probability model for a network may posit the existence of unknown variables which do not correspond to any observable quantity. For example, stochastic block models (Nowicki and Snijders, 2001) posit the existence of classes of nodes, conditional upon which the dyads are independent. These classes are unobserveable nodal characteristics and must be inferred from the relational data. Similarly, latent position cluster models (Handcock et al., 2006) posit the existence of unobservable continuous nodal quantities that provide a spatial geometry for the network structure.
In this paper we develop approaches for each of these scenarios in the context of ERNMs. Sections 2 through 4 introduce ERNM and extend the theory to incorporate partially observed populations. Section 5 develops methodology for each of the scenarios. Sub-section 5.1 looks at the effect of random non-response, and sub-section 5.2 applies a latent class model to extract unknown clusters from a real data-set. Sub-section 5.3 develops estimates based on contact tracing designs, which is of vital importance to the public health community. To our knowledge, the methods outlined in this paper represent the first statistically justifiable approach to inference from contract tracing data.
2 Exponential-family random network models
Exponential-family random network models (Fellows and Handcock, 2012) are a generalisation of the exponential-family random graph model (Frank and Strauss, 1986; Hunter and Handcock, 2006), where both dyads and nodal characteristics are treated as random variates. Formally, in a population of units, let indicate that unit has a tie to unit . Let be an matrix and be a an matrix of unit covariates. We define a network as the union of the nodal covariates and the graph structure (i.e. ). An exponential family model of is expressed as
where is a vector of parameters, is a vector valued function defining a set of sufficient statistics, is the sample space of networks and is the normalising constant. This model is developed in Fellows and Handcock (2012).
2.1 The Simple Homophily Model
Though any set of network statistics can be represented by in equation (2), the examples in this paper will focus on a particularly parsimonious, but powerful, network model. Suppose that is a univariate categorical variable with levels, labelled . If we say that unit is in group . A joint model for and is
The first term of this model is the number of edges, and controls the density of the graph. The last term represents the number of nodes in each category of ,except for the last level, which is dropped to maintain identifiability of the model. The second term is the regularised sample homophily of , as introduced by Fellows and Handcock (2012), and is defined as
where is the number of edges between node and nodes in group , and is the expectation of the statistic , conditional upon and the category counts (that is, the number of nodes in each category of , ), assuming that and are independent. Thus, each term in the sum is the square root of the number of neighbours of a node which share the same category, minus what would be expected by chance. Using this form of homophily avoids the degeneracy problems found in other formulations. For a more thorough justification, see Fellows and Handcock (2012).
While the examples in this paper focus on applications of the simple homophily model, the framework presented here applies to any arbitrary set of network statistics . For example, in many applications the nodal attributes are multivariate, and their relationships are of interest to the researcher. Fellows and Handcock (2012) developed a network statistic that can be interpreted as a conditional logistic regression term which, if included, can model the relationship of several categorical variates.
3 Likelihood-based Inference from Partially Observed Networks
In this section we develop likelihood-based inference for network models based on partial observation of the networks. The approach allows non-ignorable sampling mechanisms for the networks, including some common network-based sampling designs.
Handcock and Gile (2010) developed a theory of missing data for ERG models, and the specification for ERN models proceeds similarly, though our formulation supports a more general class of missingness processes known as missing not at random (MNAR; see Rubin, 1976). Let and represent, respectively, the observed and unobserved part of the complete network . We write , with realisations . Let be a random variable representing the sampling process with realisation . The probabilistic distribution of is the sampling mechanism, and must fully specify the sample selection process, including the partition of into and . Typically, will consist of an by matrix indicating whether the dyad was sampled, and an by matrix indicating which nodal attributes are missing; however, may contain additional information about the sampling, such as the order of sampling.
We write the full data likelihood as
and we wish to draw inferences about from the observed data likelihood, defined as
This probability model jointly represents the distribution of the network , and the sampling process . The functional form of is dependent on the form of missingness, and will differ depending on how was obtained. Section 5.3 illustrates a design of particular interest known as biased seed link tracing. When the sampling probabilities only depend on the observed data, then the sampling design is amenable to the model (Handcock and Gile, 2010), and is ignorable in the sense of Rubin (1976). In this case, the likelihood simplifies to
Thus, when the sampling process is ignorable, inferences on are not affected by , and so knowledge of the sampling process is not essential for the process of inference.
Having defined the full and observed likelihood, it is also useful to define the missing data likelihood:
The (observed data) likelihood can then be rewritten as the ratio of two normalising constants
and using this, we may write the observed data log likelihood ratio of versus as
4 Calculating the MLE with MCMC
For most models, equation (4) is not analytically solvable. However we may approximate it by Markov Chain Monte Carlo (MCMC). Let and where be samples from the full likelihood and missing data likelihood respectively with parameters . Then equation (4) may be approximated by
As move away from the quality of this approximation degrades. Because we will be optimising equation (4), it is useful to have both the first and second derivatives of the log likelihood, which are
The expectations and covariances in these derivatives can be approximated using the conditional and unconditional MCMC samples and thus we can then use the following algorithm to approximate the MLE.
Let and choose initial parameter values , .
Use MCMC to generate k samples, from .
Use MCMC to generate m samples from .
Using the samples from step 2 and 3 in equation ( ‣ 4), find maximising the likelihood ratio, subject to and .
If the likelihood has not converged, set and go to step 2.
Let the MLE estimate be and
Asymptotic standard errors for may be obtained using an MCMC approximation to the Fisher information (i.e. the second derivative of the log likelihood). While asymptotics of the Fisher information are not assured with respect to ERNM (or ERGM) models, Fellows and Handcock (2012) show strong empirical agreement between the Fisher information standard errors and parametric bootstrap simulations. Standard errors for the mean value parameters can be approximated by MCMC sampling.
5 Specific forms of partial observation
In this section we consider the three common forms of partial observation considered in the introduction, each corresponding to a different mechanism of partial observation or conceptualisation of that mechanism.
5.1 Missing Data: Unobserved Relational Information
It is common when surveying networked populations that there are insufficient resources to conduct a census of the population and their relations. For efficiency reasons, a sampling based survey is undertaken, or the full network is partially observed due to non-response. In this sub-section, we give an illustration of the effect of non-response where the dyad information is missing completely at random. We consider the relations of “liking” among 18 monks in a monastery (Sampson, 1969). The network analysed has a directed edge between two monks if the sender monk ranked the receiver monk in the top three monks for positive affection in any of the three interviews given over a twelve month period (Hoff et al., 2002). The sociogram of this data-set is shown in Figure 1. One nodal attribute of interest is an indicator of attendance at the minor “Cloisterville” seminary before coming to the monastery.
|1||Ramauld (L)||10||Gregory (T)|
|2||Bonaventure (L)||11||Hugh (T)|
|3||Ambrose (L)||12||Boniface (T)|
|4||Berthold (L)||13||Mark (T)|
|5||Peter (L)||14||Albert (T)|
|6||Louis (L)||15||Amand (O)|
|7||Victor (L)||16||Basil (O)|
|8||Winfred (T)||17||Elias (O)|
|9||John (T)||18||Simplicius (O)|
We fit a simple homophily model on Cloisterville status using the full data. We then ran simulations on the effect of missingness by selecting dyads, and Cloisterville status variates, completely at random and setting them to missing. Figure 2 shows one simulated missingness pattern with 15% missing. We ran 100 simulations at each missingness percentage. Means and standard deviations of the ERNM models fit to these simulated missingness patterns are displayed in Figure 3.
We see that the standard deviations of the estimates increase as the amount of missingness increases. At the higher missingness levels some bias is apparent relative to the full data MLE, but not more than one standard deviation. One possible explanation for this bias is that there were only six monks who attended Cloisterville, and so at 50% missingness, a significant number of samples will include no (or perhaps a single) Cloisterville monks.
5.2 Latent Variables: Stochastic Block Models
In this sub-section we consider the situation where some characteristics of the network are posited but unobserved. Specifically, we consider the case where each node of the network belongs to a latent class, and the structure of the network depends on that latent class. The traditional approach to this has been stochastic block models Nowicki and Snijders (2001), and here we show how these models fall naturally out of our general formulation.
It is apparent from Figure 1 that the pattern of “liking” between the monks may exhibit clustering. Through close sociological study, Sampson (1969) identified three clusters which he dubbed the Turks, Loyal Opposition and the Outcasts (see: Figure 1). Here we will attempt to identify clusters by inferring class membership from the graph. We fit the simple homophily model of Section 2.1 to this data, assuming a class covariate, , with three levels, and that all of the monks are “missing” their class covariate. The simple homophily model treated this way represents a novel latent block model in the spirit of Nowicki and Snijders (2001). Note that the missingness process here is ignorable because it does not depend on unobserved quantities as all of the values are missing regardless of the values. We fit the model using the algorithm in Section 4. Table 1 shows the maximum likelihood parameter estimates, along with standard errors of the estimators based on the Fisher information.
|# of edges||-0.58||88.23||0.14||7.48|
|# in group 0||-2.50||3.95||1.44||1.08|
|# in group 1||-0.02||6.95||1.31||0.99|
The natural parameter estimates indicate significant homophily in tie formation based on the class. It also indicates that the number of monks in the third class is significantly more than those of the other two classes, which are not statistically significantly different in size. The mean value parameters indicate that the expected number of ties is about 88, and the expected numbers in the three groups are 4, 7 and 7.
An advantage of this approach is that we can investigate the probability of class membership, which is well defined through our framework as . To compute we simulated a large number of samples from using MCMC to show the probability of the monks being in the classes displayed in Figure 1 to be above 0.9999. These clusters were also identical to those chosen by Sampson (1969) and verified by later research Breiger et al. (1975); Handcock et al. (2006).
In addition to assuming a set number of latent classes for the model, we can also use the MLE procedure to select an appropriate number of clusters for the data. We fit the simple homophily model with a latent variable able to take a potentially large number of values (e.g., the number of monks). In this case places zero mass for all but three of the groups. This is evidence that the three groups we have identified are a good classification for these data. More sophisticated model selection approaches for choosing the number of clusters are possible (Handcock et al., 2006), and are left for future work.
Our form of the stochastic block model is conceptually very clean with the ability to naturally incorporate additional covariates, multiple membership variables, and extensions to an unbounded numbers of classes. Inference is straightforward, and quantities such as the probability of class membership are well defined and interpretable. We leave a full exploration of these for latter work.
5.3 Network Sampling: Biased Seed Link-Tracing
Handcock and Gile (2010) explored the idea of sampling networks by tracing the edges. As a general concept, link tracing involves selecting one or more seed nodes, and then observing the edges connected to those seeds. One or more of these edges are then followed to the neighbouring node, whose ties are observed, and the process is continued. Each iteration of this process is known as a wave.
Provided that the seed nodes are chosen at random, and the method by which edges are chosen to be followed depends only on the observed data, this missingness process is ignorable. To be explicit, consider a link tracing process with waves. Let be the ordered set of nodes and edges sampled in the th wave in the order in which they were sampled, , and . If the seeds are chosen at random, and the edges followed by the sampling process are also chosen at random, then , implying that the missingness is ignorable.
In many cases, however, the seeds are not chosen at random from the population, but are some form of convenience sample. For example, in a population where some people have an infection and others do not, we may start with a sample of seeds picked at random from among the infected individuals, and seeds picked from the non-infected individuals. These seeds are then used as a starting point for standard link tracing. We may then write the sampling probability as
where and are the number of infected and non-infected in the population, respectively. Note that does not depend on and may be factored out of the likelihood in equation ( ‣ 3). Thus there is no need to calculate explicitly, as it makes no impact on the likelihood. Hence, in this case, we can compute the likelihood without knowing the specific mechanism of seed selection.
5.4 Network Sampling: Positive Contact Tracing
As emerging epidemics develop, control measures (e.g., treatment, isolation and culling) focus on those members of the population that are known to have the infection. Because there are often many infected people who are unobserved, control can be ineffective (e.g., HIV (Potterat et al., 1989). The alternative of applying control measures to the entire population can be economically infeasible or ineffective (e.g., some instances of safe sex education) (Potterat et al., 1989; Klinkenberg et al., 2006). Contact tracing is the hybrid approach of treating both the known infected individuals and those who may have been infected by them (Potterat et al., 1989; Klinkenberg et al., 2006). In U.S. public health, health clinics are required by state law to notify those at risk from infection due to their sexual relations with individuals tested, and found to be infected, by the clinic. The process of locating, notifying and then testing partners that may have been exposed to an infectious agent allows additional information about the partners to be collected. While the primary purpose of contact tracing is disease control via partner notification and partner services, it is also a form of data collection that is rarely utilised. Such approaches are used most commonly for syphilis and HIV/AIDS, but also for other STIs such as gonorrhea and chlamydia (Golden et al., 2004), as well as routinely for tuberculosis and infectious disease outbreaks. Contact tracing has also been applied in many recent epidemics (Fenner et al., 1988; Ferguson et al., 2001; Donnelly et al., 2003). In positive contact tracing, we follow all edges from infected nodes, but edges from uninfected nodes are not followed.
While the process varies from state to state and also by disease, we consider the following biased seed link tracing process:
Select seed subjects at random from among the non-infected population, observe them.
Select seeds subjects at random from among the infected population, observe them.
Choose the next infected seed at random.
Observe all edges from the selected subject, and the infection status of these subjects.
For all infected neighbours of the selected subject, go to step 4.
If all the seeds have not been chain sampled, go to step 3
We simulated a networked population of people from the simple homophily model of Section 2.1 with natural parameters of . The number of infected nodes was fixed at 150. The generated network had a mean degree of 3.1, and its degree distribution is displayed in Figure 4. There were 296 infected to non-infected ties, with the mixing distribution displayed in Figure 5 indicating moderate homophily.
Starting with infected seeds, we simulated 100 positive link tracing samples for each of . Figure 6 displays a histogram of the sizes of the samples when there are no non-infected seeds (i.e., ).
To provide a comparison for our method we considered two estimators that could be utilised. Neither of them uses a model for the networked population but is motivated by approximations to the sampling design. The first treats the sample as a simple random sample
where and are the number of infected and uninfected in the sample respectively. The second adjusts for the sampling of the seeds
Our approach is to fit an ERNM to the contact tracing data. In this situation the contact tracing sampling design is clearly informative. For comparison, we compute two estimates of the model. The first takes into account the informativeness of the contact tracing design (MNAR) and the other assume it is ignorable (MAR). These are based on the likelihoods ‣ 3 and 3, respectively, and the algorithm in Section 4.
Figure 7 shows the results for each of the estimators over the samples. The median of the MNAR estimator is centred around the true value of 150 in all sampling scenarios, while the MAR estimator performs poorly with all infected seeds () and increasingly well as the number of non-infected seeds increases to . This is somewhat expected as the proportion of infected in the seeds approximately matches that of the population when . The two naive estimators are significantly biased across all samples. This is especially true for the sample mean which is biased both by the seed selection and by the link-tracing design. The adjusted sample mean corrects somewhat for the seed bias but does not represent the link-tracing.
This application illustrates the advantage of the model-based approach over the ad hoc estimators. By representing the structure of the networked population, the model-based approach can leverage the information in the data more efficiently.
In this paper we have given a concise and systematic statistical framework for dealing with partially observed network data when some knowledge is available on the sampling design. The framework includes, but is not restricted to, ignorable sampling designs. We have also shown that likelihood-based inference is practical under partial observation for ERN models, and that the likelihood framework naturally accommodates standard sampling designs.
We developed and implemented algorithms to compute Monte Carlo approximations to the likelihood, and showed how these can be used in practice. Three important special cases of these designs were demonstrated in Section 5. In Sub-section 5.1 we consider a missingness process which randomly selected dyads and nodal attributes to be missing. Sub-section 5.1 considers the case where all nodal attributes are missing, thus introducing a novel form of the latent cluster model.
In Sub-section 5.3 we consider non-ignorable sampling in the context of contact tracing data, a case of vital importance to public health. At present, this is the first statistically defensible approach to inference in this form of data. The example presented here shows that the MLE estimation task is robust, in that it can be applied successfully to moderately large networks (1000 nodes), with significant missingness (¿70% of nodes unobserved), but is limited by the fact that inference was performed on a simulated network. Whether the model presented here would provide a good fit for real public health data remains an important research question that we hope to address in the future.
Appendix: Algorithmic and Computational Details
A.1: Alternate MLE Formulation
While the algorithm outlined in Section 4 works well, there are some situations where an alternate formulation using equation ( ‣ 3) may be useful. First let us consider the case where , then the likelihood is
The first expectation, and the expectation in the denominator of the third term, can be calculated using an MCMC sample from . The second can be approximated with an MCMC sample from . The numerator of the third term can be approximated by importance sampling.
If the sampling process is ignorable, then the third term drops out of the likelihood ratio. The first and second derivatives of the likelihood are useful in the maximisation process. For notational convenience let .
And if the missingness process is ignorable, these equations simplify to
If we fix , then the observed likelihood of
can be maximised to find the MLE of .
This motivates the following algorithm for maximising the observed data likelihood.
Let and choose initial parameter values , .
Use MCMC to generate k samples, from .
Use MCMC to generate m samples from .
Set , with samples from step 2 used to approximate the expectation.
Using the samples from steps 2 and 3 to approximate the relevant expectations, find maximising equation ( ‣ A.1: Alternate MLE Formulation) subject to .
Set , and go to step 2.
The disadvantage of this method is that if the networks generated by the MNAR process are very different from those generated assuming MAR, the estimates of the last expectation in equation ( ‣ A.1: Alternate MLE Formulation) can become unstable. The benefit of using this method is that the sampling probability () only needs to be calculated for networks included in the sample, and not at every MCMC step as is required by the algorithm in Section 4, so if the sampling probability is computationally expensive to calculate, this method can be significantly faster than the one outlined in Section 4
A.2: Estimating Network Statistics
We can use MCMC samples from to estimate the network statistics of the sampled network. Suppose that we have used MCMC to draw samples from the distribution , and . Then we can estimate the expectation of a set of network statistics as
However, this equation ignores the possible bias introduced by our sampling process . The distribution that we should be sampling from is the full conditional distribution of ,
We then use importance sampling to estimate the relevant quantity
- Breiger et al. (1975) Breiger, R. L., Boorman, S. A., and Arabie, P. (1975). An algorithm for clustering relational data, with applications to social network analysis and comparison with multidimensional scaling. Journal of Mathematical Psychology, 12, 328–383.
- Donnelly et al. (2003) Donnelly, C. A., Ghani, A. C., Leung, G. M., et al, and Anderson, R. M. (2003). Epidemiological determinants of spread of causal agent of severe acute respiratory syndrome in Hong Kong. Lancet, 361(9371), 1761–1766.
- Fellows and Handcock (2012) Fellows, I., and Handcock, M. S. (2012). Exponential-family Random Network Models. ArXiv e-prints.
- Fenner et al. (1988) Fenner, F., Henderson, D. A., Arita, I., Jezek, Z., and Ladnyi, I. (1988). Smallpox and its eradication. Tech. rep., Geneva: World Health Organization.
- Ferguson et al. (2001) Ferguson, N. M., Donnelly, C. A., and Anderson, R. M. (2001). Transmission intensity and impact of control policies on the foot and mouth epidemic in Great Britain. Nature, 413(6855), 542–548.
- Frank and Strauss (1986) Frank, O., and Strauss, D. (1986). Markov graphs. Journal of the American Statistical Association, 81(395), 832–842.
- Gile (2011) Gile, K. J. (2011). Improved inference for respondent-driven sampling data with application to HIV prevalence estimation. Journal of the American Statistical Association, 106(493), 135–146.
Gile and Handcock (2010)
Gile, K. J., and Handcock, M. S. (2010).
Respondent-driven sampling: An assessment of current methodology.
Sociological Methodology, 40, 285–327.
Gile and Handcock (2011)
Gile, K. J., and Handcock, M. S. (2011).
Network model-assisted inference from respondent-driven sampling
- Golden et al. (2004) Golden, M. R., Hogben, M., Potterat, J. J., and Handsfield, H. H. (2004). HIV partner notification in the United States: a national survey of program coverage and outcomes. Sex Transm Dis, 31(12), 709–712.
- Handcock and Gile (2010) Handcock, M. S., and Gile, K. J. (2010). Modeling networks from sampled data. Annals of Applied Statistics, 272(2), 383–426.
- Handcock et al. (2006) Handcock, M. S., Raftery, A. E., and Tantrum, J. M. (2006). Model-based clustering for social networks. Journal of the Royal Statistical Society Series A, 170, 1–22.
- Harris et al. (2003) Harris, K. M., Florey, F., Tabor, J., Bearman, P. S., Jones, J., and Udry, J. R. (2003). The national longitudinal of adolescent health: Research design [WWW document]. Tech. rep., Carolina Population Center, University of North Carolina at Chapel Hill, Available at: http://www.cpc.unc.edu/projects/addhealth/design.
- Hoff et al. (2002) Hoff, P. D., Raftery, A. E., and Handcock, M. S. (2002). Latent space approaches to social network analysis. Journal of the American Statistical Association, 97(460), 1090–1098.
- Hunter and Handcock (2006) Hunter, D. R., and Handcock, M. S. (2006). Inference in curved exponential family models for networks. Journal of Computational and Graphical Statistics.
- Klinkenberg et al. (2006) Klinkenberg, D., Fraser, C., and Heesterbeek, H. (2006). The effectiveness of contact tracing in emerging epidemics. PLoS ONE, 1(1), e12.
- Nowicki and Snijders (2001) Nowicki, K., and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455), 1077–1087.
- Potterat et al. (1989) Potterat, J. J., Spencer, N. E., Woodhouse, D. E., and Muth, J. B. (1989). Partner notification in the control of human immunodeficiency virus infection. American Journal of Public Health, 79(7), 874–876.
- Rubin (1976) Rubin, D. (1976). Inference and missing data. Biometrika, 63, 581–592.
- Sampson (1969) Sampson, S. F. (1969). Crisis in a Cloister. PhD in Sociology, Cornell University.
- Udry (2003) Udry, J. R. (2003). The national longitudinal of adolescent health: (add health), waves I and II, 1994-1996; wave III, 2001-2002 [machine-readable data file and documentation]. Tech. rep., Carolina Population Center, University of North Carolina at Chapel Hill.