# Recent Advances on Estimating Population Size with Link-Tracing Sampling

###### Abstract

A new approach to estimate population size based on a stratified link-tracing sampling design is presented. The method extends on the Frank and Snijders (1994) approach by allowing for heterogeneity in the initial sample selection procedure. Rao-Blackwell estimators and corresponding resampling approximations similar to that detailed in Vincent and Thompson (2017) are explored. An empirical application is provided for a hard-to-reach networked population. The results demonstrate that the approach has much potential for application to such populations. Supplementary materials for this article are available online.

Keywords: Adaptive sampling; Hard-to-reach population; Markov chain Monte Carlo; Network sampling; Snowball sampling; Stratified sampling;

## 1 Introduction

There is a growing demand for practical methods to estimate population size and other quantities of hidden networked populations. A new and flexible approach that extends on previous work is presented.

The approach is applied to an empirical population at risk for HIV/AIDS. The design commences with the selection of an initial sample where selection probabilities depend on strata memberships. Links are traced from sampled members with probabilities also dependent on strata memberships. Consistent estimation for population quantities is made with a design-based approach to inference. Preliminary estimators are based on information in the initial sample. Improved estimators are obtained via the Rao-Blackwell theorem, which incorporates information from units added to the sample through link-tracing.

Advantages of using the novel approach over existing methods are: 1) it allows/accounts for heterogeneity in initial sample selection probabilities; 2) it has the ability to harness nominations from conspicuous/certainty individuals, or even those external to the target population, in the inference procedure to substantially improve the precision of estimators; for example, such nominations may come from those individuals conceivably sampled with probability one (typically, the social stars of the population), a pilot study, or from researchers familiar with the target population via a prior study; 3) it allows for heterogeneity in how nominations are defined between strata; for example, links originating from a stratum of males and which are directed towards females may be based on sexual contact, and within a stratum of males may be based on sharing drugs; and 4) it bases approximations for the computationally intensive improved (Rao-Blackwellized) estimators on an updated and efficient Markov chain Monte Carlo resampling procedure that depends on a suitable convergence diagnostic test.

## 2 Sampling Design

Let be the set of units/members of the population. Suppose there are strata the population is partitioned into, possibly based on demographic configurations. In keeping with the notation of Frank and Snijders (1994) define to be those individuals in stratum for . Define to be the size of stratum . Define if unit nominates unit and 0 otherwise, where nominations are based on predetermined relationships that may be functions of strata. Define for all . Define to be the number of nominations from unit to stratum . Define to be the number of links in the graph, , and to be the number of links from to , . Define to be the response(s) of interest attached to unit . For example, this may be an indicator of drug-use status or the individual/node-degree.

The initial sample, , is selected via a Bernoulli sampling design within each stratum; let be the probability a unit in stratum is selected for the initial sample, and . Define if unit is selected for and 0 otherwise. Selection for members of the first wave is carried out as follows. For any unit and where , is defined to be the probability the link is traced so that unit is added to the sample for the first wave. Define to be those units selected for the first wave of the sample, and . For each individual define if the unit is selected for the initial sample and if the unit is selected for the first wave. The data observed upon selecting the sample is .

## 3 Estimation

### 3.1 Population Size Estimation

Define , to be the number of non-self-nominated (non-loop) links within , to be the number of links from to for , to be the number of links from to , and to be the number of links from to for . The expectations of these statistics are

(1) | ||||

(2) | ||||

(3) | ||||

(4) | ||||

(5) |

The aforementioned equations lead to the following method-of-moments estimator for ,

(6) |

To show the consistency of this estimator, assume that for all , and in such a way that . Define to be the units in stratum nominated by unit , , and to be the units in stratum which nominate unit , . Assume that for all nominations from within are bounded so that , and for all nominations from to are bounded so that . Hence, and . As shown in the supplementary materials, these assumptions imply that

(7) | ||||

(8) | ||||

(9) | ||||

(10) |

Together, these equations imply that , and all converge in probability to 1 since their expectations are 1 and their variances tend to 0. Hence, is a consistent estimator for and is a consistent estimator for . In the simulation studies each of the stratum size estimators are stabilized in a manner that mimics the bias-adjusted Lincoln-Petersen estimator (Chapman, 1951); a value of one is added to each of and the corresponding sum is subtracted from the estimator.

Of considerable note is that utilizing the statistics based on nominations originating from all strata results in a strata size estimator with a faster rate of consistency than that based on nominations originating solely from within the stratum. Hence, one can expect less-bias with the estimator based on the stratified setup.

One argument for why the estimator in Expression 6 works as a stratum size estimator is given as follows. In a two-sample mark-recapture study the Lincoln-Petersen estimator is the typical choice for an estimator of the population size. To work as a population size estimator only one sample need be selected completely at random while the other can correspond with a “fixed-list”. In the setup presented in this paper the Bernoulli initial sample corresponds with the sample selected completely at random and nominated individuals correspond with the “fixed-list”.

Frank and Snijders (1994) developed the following jackknife procedure to obtain variance estimates of population size estimates for the homogeneous selection setup. The procedure is outlined as follows. For each define to be the estimate of the population size when unit is removed from . Define where . The variance estimate is

(11) |

In the heterogeneous selection setup, implications result from removing a unit from the initial sample on its contribution to estimation of the size of strata the unit is external to. Hence, the following estimator is proposed. When unit is removed from define to be the estimate of the size of strata and . The variance estimator is

(12) |

Although sampling is carried out independently between strata, there may be a positive covariance of the strata size estimates. It is therefore suggested to use an approach that results in conservative confidence intervals, such as that outlined in Chao (1987), to facilitate in meeting nominal levels of coverage.

### 3.2 Population Mean Estimation

In the one-stratum case, an unbiased estimator for the population mean is the initial sample mean,

(13) |

An estimate for the variance of is obtained by substituting the estimate of into the standard formula to give , where is the sample variance of the responses from .

In the multi-strata setup, a consistent estimator for the population mean is

(14) |

where is the mean of the responses of units selected from stratum for the initial sample. An estimate for the variance of is obtained by substituting the estimates of and into the standard formula to give , where is the sample variance of the responses from .

In some cases an estimate of the proportion of individuals in a stratum can be useful. Define to be the population quantity to be estimated. Then is a consistent estimator for this quantity. To obtain a variance estimate for this estimate the delete-one jackknife procedure is used as follows. Define to be the estimate when unit is removed from the initial sample, and . The standard formula with the estimate of substituted into the expression is

(15) |

## 4 Sufficiency Result

Recall that . Define the reduced data to be

, and , .

Theorem: is sufficient for .

Proof:

(16) |

The statistics and in Expression 16 are functions of the sampled members’ time of observation, , number of nominations to each strata, , and nominations within the sample, , for all , which correspond with . Therefore, by the Neyman-Factorization Theorem is sufficient for .

Rao-Blackwellized estimates are based on evaluating selection probabilities and estimates that correspond with sample reorderings that give rise to the same reduced data. For example, under a homogenous selection setup suppose a sample is selected as presented in the left of Figure 1; initial sample is selected with probability , first wave is selected conditional on with probability , and the corresponding estimate is , say. The reordering, labeled , presented in the right of Figure 1 is consistent with the reduced data, and is selected with probability , is selected conditional on with probability , and the corresponding estimate is , say.

There are possible sample reorderings, where and . Index these as . The Rao-Blackwell expression is

(17) |

Note that in Expression 17, is constant over all and cancels from the expression, an implication of being sufficient for and .

Data reduction comes from mapping the set of consistent sample reorderings to the sufficient statistic, . Hence, can be viewed as the set of all consistent sample reorderings and their corresponding observations. Preliminary estimation is based on the estimator corresponding with original sample ordering , whereas Rao-Blackwellized estimates are based on a weighted average of estimates corresponding with all reorderings in . Hence, improvement in estimation comes through utilizing more information than that provided solely with the original ordering of the sample.

## 5 Markov Chain Monte Carlo

Due to the potentially large number of reorderings, a Markov chain Monte Carlo (MCMC) procedure is used to approximate the Rao-Blackwellized estimators and their variance estimators. The procedure is outlined as follows.

Choose to be a sufficiently large number. For suppose at step of the Markov chain the most recently accepted reordering is for some . Define to be the probability of selecting reordering in the full graph setting and to be the probability of selecting reordering under the following proposal distribution. Define to be the MCMC parameters where if then , and if then for such that . Sample a value with probability equal to . Suppose the sampled value is . Select units from wave 1 of reordering completely at random. Interchange each of the units with one unit that nominates them from the initial sample, selected completely at random. Suppose this results in roerdering . If the reordering is consistent with the reduced data then with probability accept the proposal reordering.

The MCMC procedure starts in its stationary distribution with the original order the sample is selected in. With the aid of the parameters the chain has the potential to fully explore the distribution since pairs of units from up to different strata, where links may cross between strata, can be interchanged at any step. For example, consider the sample presented in Figure 1. Suppose units A and D belong to one stratum, and B and C to another. The reordering with units C and D comprising the initial sample is consistent with the reduced data, and can only be reached if units A and B are interchanged with units C and D, respectively and simultaneously. Hence, the procedure results in a Markov chain with the desired stationary distribution . Approximations to the Rao-Blackwellized version of a preliminary estimator and it’s corresponding variance estimator based on MCMC procedures are detailed in Vincent and Thompson (2017).

A test for convergence is based on the Gelman-Rubin statistic (Gelman and Rubin, 1992). Search algorithms for two “over-dispersed” reorderings are based on the proposal distribution, as follows. Choose to be of sufficient length for each search, and start with the original sample in the order it was selected. Suppose at some intermediate step the most recently accepted reordering is . Draw a sample reordering, , according to the proposal distribution. For the first over-dispersed reordering, if the reordering is consistent with the reduced data and the probability of selecting it in the full graph setting is less than that for , i.e. , then accept . Similarly, for the second over-dispersed reordering, a consistent reordering is accepted if the probability of selecting it in the full graph setting is greater than that for , i.e. . The algorithms each conclude with their last accepted reordering, and these are used as seeds in the MCMC chains for which the convergence test is based on.

The search for the first over-dispersed reordering is likely to result in one with a corresponding smaller estimate for the population size relative to the original ordering’s. The reason is that the algorithm will result in a reordering that has a smaller probability of being observed; under the sampling design this reordering will likely have more links emanating from the initial sample to individuals outside the initial sample, relative to the original ordering, because every link has a probability appended to it of being traced. Since link-tracing typically results in the selection of individuals with high-degree the search will likely result in a reordering whose initial sample is comprised of more well-connected individuals, resulting in many more nominations observed within the initial sample relative to the original ordering’s. Similarly, the search algorithm for the second over-dispersed reordering is likely to result in one with a corresponding larger estimate for the population size relative to the original ordering’s.

## 6 Empirical Study

The empirical study is based on the P90 Colorado Springs study of 595 drug-users (Darrow et al., 1999; Klovdahl et al., 1994; Rothenberg et al., 1995). Figure 2 gives a visual illustration of the population. The light-coloured nodes represent the stratum of non-injection drug users and dark-coloured nodes represent the stratum of injection drug-users. Links between individuals represent drug-sharing relationships. All links are reciprocated.

The one-stratum (homogeneous) and two-strata (heterogeneous) selection setup is considered for inference, as well as the use of a third/certainty strata; the ten individuals with largest node-degree are selected for each sample with probability one. Coverage rates and average lengths of confidence intervals corresponding with estimates for the population size are based on the log-transformation approach outlined in Chao (1987), and for the population proportion and average node-degree are based on the central limit theorem (CLT). In some cases a negative estimate is evaluated for the estimate of the variance of an improved estimate for a population quantity; there are several occurrences corresponding with an estimate for the population size with the one and two-strata setup in the second simulation study, and for the average node-degree with the three-strata setup in both simulation studies. The conservative approach presented in Vincent and Thompson (2017) is utilized, thereby inflating the average length of the confidence intervals.

### 6.1 Simulation Study 1

A simulation study is based on setting and . To determine a sufficient length of MCMC chain for approximating the Rao-Blackwellized population size estimators, the two-strata setup is considered and a search length of 10,000 is used to find over-dispersed reorderings. Figure 3 depicts a typical sample selected under the design with these sampling parameters, and presents output from search algorithms and MCMC chains starting with over-dispersed sample reorderings of the original sample.

Based on selecting 100 samples with chains set to length 2000 and , the mean and median of the Gelman-Rubin statistics are 1.08 and 1.04, respectively.

The simulation study is based on selecting 2000 samples. The average initial sample size is 89.7 and final sample size is 122.9. Table 1 presents the expectation and variance scores of the preliminary and Rao-Blackwell estimators for the population size, proportion of injection drug-users, and average node-degree. Acceptance rates for the MCMC procedure are found to be 49.6%, 31.9%, and 38.8% respectively for the one-, two-, and three-strata setup. In each case a significant improvement is found with the Rao-Blackwell estimators. The use of a multi-strata setup benefits population size estimation as bias is reduced (an implication of the rate of consistency of estimators of strata size based on external nominations) and precision is increased. The estimator of the proportion of drug-users sees a decrease in precision with the two-strata setup relative to the one-stratum setup, primarily due to estimating the relative sizes of the two strata. Note that with the one-stratum setup the estimator presented in Expression 13 is used where indicator values are the responses. With the three-strata setup some bias is introduced since convergence rates are unequal between strata size estimators. Similarly, the estimator of the average node-degree has some bias with the two- and three-strata setup due to weighting mean responses by estimated strata sizes.

Pop. Quantity | Estimator | Expectation | Var. (P) | Var. (RB) |

Size | One-stratum | 653 | 41,796 | 31,369 |

Two-strata | 636 | 31,947 | 20,230 | |

Three-strata | 617 | 20,212 | 17,102 | |

Proportion | One-stratum | 0.575 | 0.00242 | 0.00212 |

Two-strata | 0.575 | 0.01130 | 0.00733 | |

Three-strata | 0.584 | 0.00920 | 0.00755 | |

Avg. node-degree | One-stratum | 2.446 | 0.15683 | 0.12055 |

Two-strata | 2.385 | 0.16599 | 0.13042 | |

Three-strata | 2.415 | 0.14349 | 0.12701 |

Table 2 provides coverage rates and average lengths of confidence intervals corresponding with the estimates. Coverage rates of the population size are close to 95%. Hence, the pairing of the proposed variance estimator presented in Expression 12 and Chao (1987) approach to obtaining confidence intervals are ideal for the heterogeneous setup. The average length of the confidence interval corresponding with the estimate of the population proportion under the two-strata setup is profoundly wide due to the jackknife procedure over-approximating the standard error of the estimator. Coverage rates of average node degree are less than the desired level due to substituting the estimate of strata sizes into the variance expression and skewness of the degree distribution. Further work is needed to address these issues.

Pop. Quantity | Estimator | CR (P) | Length (P) | CR (RB) | Length (RB) |

Size | One-stratum | 0.970 | 848 | 0.982 | 792 |

Two-strata | 0.963 | 766 | 0.970 | 667 | |

Three-strata | 0.967 | 632 | 0.970 | 600 | |

Proportion | One-stratum | 0.942 | 0.18971 | 0.945 | 0.17850 |

Two-strata | 0.959 | 0.52033 | 0.987 | 0.47533 | |

Three-strata | 0.959 | 0.44793 | 0.964 | 0.42148 | |

Avg. node-degree | One-stratum | 0.922 | 1.51427 | 0.915 | 1.35786 |

Two-strata | 0.891 | 1.47232 | 0.908 | 1.33157 | |

Three-strata | 0.839 | 1.07306 | 0.816 | 0.95973 |

### 6.2 Simulation Study 2

A simulation study is based on setting , and . To determine a sufficient length of MCMC chain for approximating the Rao-Blackwellized population size estimators, the two-strata setup is considered and a search length of 10,000 is used to find over-dispersed reorderings. Figure 4 depicts a typical sample selected under the design with these sampling parameters, where stratum one refers to the light-coloured nodes and two to the dark-coloured nodes, and presents output from search algorithms and MCMC chains starting with the original and over-dispersed sample reorderings of the original sample.

Based on selecting 100 samples with chains set to length 2000 and , the mean and median of the Gelman-Rubin statistics are 1.11 and 1.01, respectively.

The simulation study is based on selecting 2000 samples. The average initial sample size is 47.1 and final sample size is 67.5. Table 3 presents the expectation and variance scores of the preliminary and Rao-Blackwell estimators for the population size, proportion of injection drug-users, and average node-degree. Acceptance rates for the MCMC procedure are found to be 42.7%, 28.9%, and 37.0% respectively for the one-, two-, and three-strata setup. As the one stratum setup assumes homogeneity in the selection probabilities for the initial sample, the preliminary estimators and those based on the Rao-Blackwell scheme will not necessarily coincide; the expectation of the latter is given in parentheses. In this case, there seems to be close agreement between the estimators for each of the population quantities. A significant reduction in variance is seen for all Rao-Blackwell estimators. Accounting for heterogeneity in the initial sample selection procedure results in a reduction in bias for all estimators. Adding the certainty stratum helps to further reduce the bias and variance of the population size estimator.

Pop. Quantity | Estimator | Expectation | Var. (P) | Var. (RB) |

Size | One-stratum | 739 (729) | 409,642 | 188,443 |

Two-strata | 530 | 45,961 | 29,224 | |

Three-strata | 582 | 28,541 | 24,602 | |

Proportion | One-stratum | 0.732 (0.718) | 0.00406 | 0.00385 |

Two-strata | 0.672 | 0.01526 | 0.01140 | |

Three-strata | 0.628 | 0.01474 | 0.01240 | |

Avg. node-degree | One-stratum | 2.528 (2.530) | 0.29964 | 0.26111 |

Two-strata | 2.437 | 0.29991 | 0.25962 | |

Three-strata | 2.436 | 0.28360 | 0.25536 |

Table 4 provides coverage rates and average lengths of confidence intervals corresponding with the estimates. Due to small sample sizes and resulting bias of estimators corresponding with the one and two-sample setup the population size coverage rates are less than the desired level of 95%. However, as seen with the three-strata setup, with enough information the variance estimator presented in Expression 12 and Chao (1987) approach to obtaining confidence intervals are ideal. Bias in the population proportion estimates lead to coverage rates less than 95%. Coverage rates of the average node degree are also less than 95%, primarily due to the skewness of the distribution.

Pop. Quantity | Estimator | CR (P) | Length (P) | CR (RB) | Length (RB) |

Size | One-stratum | 0.933 | 2737 | 0.944 | 2863 |

Two-strata | 0.882 | 891 | 0.906 | 799 | |

Three-strata | 0.954 | 847 | 0.962 | 828 | |

Proportion | One-stratum | 0.305 | 0.24291 | 0.351 | 0.23399 |

Two-strata | 0.837 | 0.57700 | 0.861 | 0.54061 | |

Three-strata | 0.908 | 0.56555 | 0.923 | 0.53933 | |

Avg. node-degree | One-stratum | 0.927 | 2.07014 | 0.912 | 1.91516 |

Two-strata | 0.905 | 2.06840 | 0.907 | 1.94320 | |

Three-strata | 0.797 | 1.49016 | 0.796 | 1.36777 |

## 7 Discussion

The new strategy is able to incorporate heterogeneity into the initial sample selection procedure, as well as to allow for individuals selected with probability one, to make significant contributions to inference. The gains in efficiency from these features are substantial. Further gains are made via Rao-Blackwellization, which directly utilizes observations of individuals sampled for the first wave.

The reduced data is sufficient for the strata sizes and initial sample selection probabilities . To implement Rao-Blackwellization, the first wave selection probabilities must be known. In the empirical setting these are likely to be unknown and may have to be approximated with sample data. Alternatively, one could explore a strategy that considers the set of nominations traced from a selected individual as a random sample of fixed size, similar to the approach used in a respondent driven sampling design. Future work on these topics would be invaluable for implementing this approach in practice.

Rao-Blackwellization requires observation of the presence/absence of links between all pairs of individuals in the final sample. This may be difficult to achieve in practice. An approach that appends probabilities of links between pairs of individuals for which these are unknown, possibly based on demographic information, would make for interesting future work.

As shown in the first empirical study, it may be advantageous to base strata assignments on more than (just) initial sample selection probabilities. Further investigation into how patterns of links within and between partitions can be exploited would be worthwhile.

In some cases a complete one-wave snowball sampling design, where all links are traced, is desired and/or feasible. When this is the case it is likely that few reorderings consistent with the sufficient statistic will exist. The reason is that, as required by design, a reordering consistent with the sufficient statistic must have all units selected for the corresponding initial sample to have their links traced. Allowing sampling to continue past the first wave may permit for greater improvements in the Rao-Blackwellized estimators. For example, with a complete snowball sampling design, where link-tracing continues until there are no links out of the sample, all reorderings that retain isolated members for the initial sample will be consistent with the sufficient statistic. Furthermore, consistent reorderings under such a design will have equal probabilities of being selected in the full graph setting.

Population size estimates based directly on waves succeeding the initial sample should be explored. One approach is to base strata assignments on the distribution of links from the prior wave. For example, individuals not recently sampled and not linked to the prior wave comprise one stratum, individuals linked to one individual in the prior wave comprise another stratum, and so forth. Furthermore, if a subset of links are conceivably traced with probability one then the selected individuals comprise a certainty stratum.

## 8 Supplementary Materials

The supplementary materials provide proofs required for deriving the population/strata size estimators.

## References

- Chao (1987) Chao, A. (1987). Estimating the population size for capture-recapture data with unequal catchability. Biometrics 43, 783–791.
- Chapman (1951) Chapman, D. (1951). Some properties of the hypergeometric distribution with applications to zoological sample census. University of California Publications in Statistics 1, 131–160.
- Darrow et al. (1999) Darrow, W. W., Potterat, J. J., Rothenberg, R. B., Woodhouse, D. E., Muth, S. Q., and Klovdahl, A. S. (1999). Using knowledge of social networks to prevent human immunodeficiency virus infections: The Colorado Springs Study. Sociological Focus 32, 143–158.
- Frank and Snijders (1994) Frank, O. and Snijders, T. (1994). Estimating the size of hidden populations using snowball sampling. Journal of Official Statistics 10, 53–67.
- Gelman and Rubin (1992) Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science 7, 457–472.
- Klovdahl et al. (1994) Klovdahl, A., Potterat, J., Woodhouse, D., Muth, J., Muth, S., and Darrow, W. (1994). Social networks and infectious disease: The Colorado Springs Study. Social Science & Medicine 38, 79–88.
- Rothenberg et al. (1995) Rothenberg, R. B., Potterat, J. J., Woodhouse, D. E., Darrow, W. W., Muth, S. Q., and Klovdahl, A. S. (1995). Choosing a centrality measure: Epidemiologic correlates in the Colorado Springs study of social networks. Social Networks 17, 273–297. Social networks and infectious disease: HIV/AIDS.
- Vincent and Thompson (2017) Vincent, K. and Thompson, S. (2017). Estimating population size with link-tracing sampling. Journal of the American Statistical Association .

## Appendix A Supplementary Materials

Claim: For any , .

Proof: Take any where . When squaring this entry corresponds with

(18) |

Now, the expectation of is

(19) |

Summing the above term over all where gives

(20) |

Therefore,

(21) |

Claim: For any with , .

Proof: Take any entry and . When squaring this entry corresponds with

(22) |

Now, the expectation of is

(23) |

Summing the above term over all and gives

(24) |

Therefore,

(25) |

Claim: For any , .

Proof: Take any where . When squaring this entry corresponds with

(26) |

Now, the expectation of is

(27) |

Summing the above term over all where gives