1 Introduction

Bayesian Marked Point Process Modeling for Generating Fully Synthetic Public Use Data with Point-Referenced Geography

Harrison Quick111(to whom correspondence should be addressed) Department of Statistics, University of Missouri-Columbia, 146 Middlebush Hall, Columbia, MO 65211, quickh@missouri.edu, Scott H. Holan222Department of Statistics, University of Missouri-Columbia, 146 Middlebush Hall, Columbia, MO 65211-6100, Christopher K. Wikle, Jerome P. Reiter333Department of Statistical Science, Box 90251, Duke University, Durham, North Carolina 27708-0251


Abstract

Many data stewards collect confidential data that include fine geography. When sharing these data with others, data stewards strive to disseminate data that are informative for a wide range of spatial and non-spatial analyses while simultaneously protecting the confidentiality of data subjects’ identities and attributes. Typically, data stewards meet this challenge by coarsening the resolution of the released geography and, as needed, perturbing the confidential attributes. When done with high intensity, these redaction strategies can result in released data with poor analytic quality. We propose an alternative dissemination approach based on fully synthetic data. We generate data using marked point process models that can maintain both the statistical properties and the spatial dependence structure of the confidential data. We illustrate the approach using data consisting of mortality records from Durham, North Carolina.

Keywords: Dimension reduction; Disclosure; Marked point processes; Multiple imputation; Privacy; Predictive process.

1 Introduction

Many statistical agencies, research centers, and individual researchers—henceforth all called agencies—collect confidential data that they intend to share with others as public use files. Many agencies are also obligated ethically, and often legally, to protect the confidentiality of data subjects’ identities and sensitive attributes. This can be particularly challenging for agencies seeking to include fine levels of geography, e.g., street addresses or tax parcel identifiers, in the public use files. While detailed spatial information offers enormous benefits for analysis, it also can enable ill-intentioned users—henceforth called intruders—to easily identify individuals in the file.

Because of these confidentiality risks, agencies typically alter geographies and sensitive attributes before disseminating public use files. Perhaps the most common redaction method is to aggregate geographies to high levels like states (or not to release geography at all). Unfortunately, aggregation sacrifices analyses that require finer geographic detail and potentially creates ecological fallacies (Wang and Reiter, 2012). Furthermore, when the file includes other variables known to intruders like demographic information, aggregation alone may not suffice to protect confidentiality. Another strategy is to move each record’s observed location to another randomly drawn location, e.g., within some circle of radius centered at the original location. When large movements are needed to protect confidentiality—as can be the case when released data include demographic and other variables possibly known by intruders—inferences involving spatial relationships can be seriously degraded (Armstrong et al., 1999; VanWey et al., 2005). Suppression and aggregation also are commonly used to redact non-geographic attributes (Willenborg and de Waal, 2001; Hundepool et al., 2012), as are perturbative methods like data swapping (Dalenius and Reiss, 1982) and adding noise to values (Fuller, 1993). When applied with high intensity, these methods can result in files having poor analytic quality without adequate confidentiality protection (Winkler, 2007; Holan et al., 2010; Drechsler and Reiter, 2010).

An alternative to aggregation and perturbation is to release multiply-imputed synthetic data, in which confidential values are replaced with draws from statistical models designed to capture important distributional features in the collected data. Synthetic data come in two flavors. Partially synthetic data comprise the original units surveyed with some collected values replaced with multiple imputations (Little, 1993; Kennickell, 1997; Abowd and Woodcock, 2001, 2004; Reiter, 2003, 2004; An and Little, 2007), and fully synthetic data comprise entirely simulated records (Rubin, 1993; Reiter, 2002, 2005a; Raghunathan et al., 2003). In this article, we focus on fully synthetic data; see Reiter and Raghunathan (2007) for a review of the differences in the two flavors. Fully synthetic data can offer low disclosure risks as the released data cannot be meaningfully matched to external databases, while allowing secondary analysts to make valid inferences for wide classes of estimands via standard likelihood-based methods (Raghunathan et al., 2003; Reiter, 2005b).

Thus far, synthetic data approaches have been used primarily for data with no or highly aggregated geography (e.g., Abowd et al., 2006; Hawala, 2008; Kinney et al., 2011) or with moderately aggregated geography like block groups or areal regions (e.g., Machanavajjhala et al., 2008; Burgette and Reiter, 2013; Paiva et al., 2014). For the latter, the basic idea is to aggregate from the point-level to discrete areal units, estimate a model that predicts areal units from individuals’ attributes, and draw new areal units from the model as individuals’ synthetic locations. These areal modeling approaches do not apply when the goal is to release point-referenced geography, although Paiva et al. (2014) make an ad hoc suggestion to randomly assign each individual to a point within its synthetic areal region. One exception is the work of Wang and Reiter (2012), who proposed that agencies treat latitude and longitude just like other continuous variables, approximating their conditional distributions given non-synthesized variables and releasing simulated locations by sampling from the models. They use regression trees (Reiter, 2005c) to approximate the conditional distributions. They also use trees to synthesize attributes conditional on latitude and longitude, treating the geographies as predictors in the tree models.

While the approach of Wang and Reiter (2012) is computationally efficient, it may not preserve local spatial dependence in the confidential data. To do so more effectively, we propose to take advantage of models developed specifically for point patterns, in particular models for marked point processes, to generate fully synthetic data with locations and attributes. In marked point process modeling, there are two general approaches to modeling locations and attributes simultaneously. Attributes (marks) can be modeled as conditional on the locations, where the locations are generated using a point process. Typically, this approach combines standard geostatistical methods with point process theory to create a model where both the marks and the points are considered random (Diggle, 2013). For example, Rathbun and Cressie (1994) create a space-time survival point process model for forest data where the birth of new trees is modeled using a point process, each tree’s growth is modeled using a geostatistical model, and then the lifespan of each tree is modeled using a geostatistical survival model. Alternatively, locations can be modeled as conditional on the marks, which is most sensible when the marks are categorical. In essence, this results in a collection of point processes—each with its own intensity surface—rather than a single point process. Liang et al. (2009) used this approach to model rates of colon and rectal cancer in the greater Twin Cities (Minneapolis-St. Paul, MN) area. A similar approach has been used by Guhanyiogi et al. (2011) in the context of specifying knots in the modified predictive process of Banerjee et al. (2010).

In this article, we present an approach to generating fully synthetic, point-referenced data based on marked point process modeling. We use a fully Bayesian hierarchical model that directly models data with exact geographic locations, categorical marks (e.g., race, gender, education level, cause of death), and non-categorical marks (subject’s age). We generate synthetic data using a three-step process, namely (i) generate synthetic versions of the categorical attributes without considering locations, (ii) generate synthetic locations conditional on categorical attributes, and (iii) generate synthetic versions of continuous attributes given the locations and categorical attributes. We illustrate the approach on mortality records from Durham, North Carolina (NC). We note that our approach differs from the that of Zhou et al. (2010), who use spatial smoothing to mask non-geographic attributes left at their original locations.

The remainder of the article is organized as follows. In Section 2, we describe the three-step method for modeling marked point processes. In Section 3, we describe how the parameter estimates from the marked point process model can be used to generate fully synthetic datasets in a computationally efficient manner. In Section 4, we illustrate the approach on publicly available mortality data from Durham, North Carolina. In particular, we demonstrate that the approach can preserve spatial structure in the original data and offer high quality estimates of nonspatial regression coefficients. Finally, in Section 5, we provide a concluding discussion.

2 Methodology for Modeling Marked Point Processes

The methodology we propose for modeling marked point processes is comprised of three main components: the categorical mark model, the point processes, and the non-categorical mark model. These components build upon each other sequentially utilizing a hierarchical (conditional) framework.

2.1 Model for categorical marks

Suppose the data comprise individuals with combinations of categorical marks. Let denote the number of individuals belonging to the -th combination, where . A natural choice for modeling the vector is the multinomial distribution,

(1)

where denotes the total sample size and denotes the vector of probabilities. It may be useful to put log-linear constraints on or to use mixtures of multinomial distributions (Dunson and Xing, 2009; Si and Reiter, 2013; Manrique-Vallier and Reiter, 2014), particularly when is large.

2.2 Point process model

To model locations, we take an approach similar to that of Liang et al. (2009) and use a log-Gaussian Cox process for each categorical mark combination. That is, for a set of locations corresponding to the -th combination, a spatial domain , and an intensity surface , we have

(2)

where . Here, denotes a vector of spatial predictors (e.g., elevation, proximity to a body of water, etc.) with the corresponding vector of regression coefficients , denotes a random effect that induces correlation in the intensity surface, and for and . In order to allow for correlation between intensity surfaces, one option is to specify such that , where accounts for the covariance between surfaces and controls spatial association (say, using a Matérn correlation structure).

Due to the random structure of , the integral in (2) is intractable and thus (2) cannot be computed in closed form. Instead, we approximate the integral in (2) via numerical integration, leading to

where denotes the set of points aligned using a grid for the numerical integration approximation. The choice of depends primarily on the spatial domain being analyzed, though larger values of will result in a better approximation of the integral in (2), provided covers evenly.

This increases the dimension of to . As may be large, and as we may require to be large in order to approximate the integral in (2) accurately, this can lead to a computational bottleneck. As such, we use the predictive process of Banerjee et al. (2008) to reduce the dimension of to a more manageable level. In this framework, we define a set of knots, , and such that . This results in

(3)

where with , where denotes the vector of spatial correlations between a location and the knot locations, . Details regarding the number and the placement of knots are provided in Section 2.5.

2.3 Model for non-categorical marks

The model for non-categorical marks proceeds using standard approaches from the geostastical literature; for further discussion see Cressie and Wikle (2011) and the references therein. The exact model for the non-categorical marks is problem specific and agencies should use appropriate models for the non-categorical marks found in the data. Here, we describe a model for a continuously varying mark based on a normal regression. In Section 4, where we model age as a non-categorical mark, we describe and use a truncated Poisson distribution.

Let be the value of the non-categorical mark for individual with categorical mark combination . For a continuously varying , a typical model is the regression

(4)

where is a vector of spatially varying covariates with a corresponding vector of regression coefficients, , and is a random effect that induces correlation between the responses. As before, we construct with , where and are defined similar to and , respectively. Here again, there may be computational issues pertaining to the dimension of ; thus, we consider the modified predictive process of Banerjee et al. (2010). This method is similar to the predictive process used for , but here we have

(5)

The extra variability added from (5) is necessary to correct for bias in the estimation of (see Finley et al. (2009) and Banerjee et al. (2010) for details). Replacing for in (4) yields

2.4 Linking the components

We assume conditional independence between the processes and parameters of the geographies and the marks given the data, allowing us to link the sub-models in Sections 2.12.3 sequentially. The resulting posterior distribution can be written as

where, for two random variables and , denotes the conditional distribution of given and denotes the marginal distribution of , denotes the matrix of covariate information, and , , and are vectors analogous to .

Using the likelihood in (1), for we have

where is a -vector of probabilities such that . In practice, we set for all for a noninformative prior specification, though this choice can be adapted depending on the particular application.

Using the model in (2), for we have

Here, we assign vague prior distributions for and , and we let the hyperparameters for depend on the spatial range of the data. That said, has been shown to be difficult to estimate in these types of models. For instance, Liang et al. (2009) opt to fix at a sensible value based on the maximum distance between knot points in the predictive process. An alternative would be to compare the fits for a number of values of and choose the value of which provides the best fit based on some model selection criteria, turning an estimation problem into one of model selection.

Using the likelihood in (4), for we have

Here, for and , is a mean vector with elements , and is a block diagonal matrix where the -th block has along the diagonal. We assign vague prior distributions for and , and let the hyperparameters for depend on the spatial range of the data. Finally, denotes the multivariate normal distribution from the modified predictive process defined in (5), and denotes an improper uniform prior for on the interval , following the recommendation of Gelman (2006).

2.5 Computational details

A benefit of the decomposable nature of this model is that the model can be estimated in parallel. That is, the MCMC algorithm can be run for each sub-model independently of the others, which saves computing time and resources. Further parallelization and hence computational savings can be realized by estimating the models independently for each categorical mark combination, though this would require either fixing the range parameter(s), and/or , at sensible values or allowing each combination to have its own range parameter, e.g., . These actions can limit borrowing of strength across combinations; thus, we recommend this strategy only when all are sufficiently large. In the NC mortality data synthesis, we fix both and to facilitate faster computation. We recommend assessing sensitivity by refitting the model with different choices of these parameters.

As in any knot-based method, the number and placement of knots are important decisions. Regarding the placement of knots, it is quite common to simply place knots by overlaying a grid of the spatial region (e.g. Liang et al., 2009). Others suggest the use of more sophisticated methods, such as space filling designs (Nychka and Saltzman, 1998). In the case of the modified predictive process, Guhanyiogi et al. (2011) used a point process model to adaptively select knots throughout the course of their MCMC algorithm. The authors found that an adaptive knot selection algorithm with only 25 knots performed as well as a non-adaptive algorithm with 81 knots aligned on a grid with respect to model fit and was on par with 225 knots with respect to prediction, despite requiring only 1/3 of the computing time.

Here, we offer a compromise by splitting the knots into two subsets. First, we place knots uniformly using a grid over , ensuring that we learn about the entire intensity surface. For the remaining knots, we first fit a general (e.g., unmarked) point process for the data to find the overall intensity surface; this can be done using the approach described in Section 2.2 using . After determining where the data are more heavily concentrated, we draw knot locations using this overall intensity surface. Unlike Guhanyiogi et al. (2011), however, we do not resample knot locations at each iteration of the MCMC, thereby avoiding additional computational burden. With this approach, our goal is to achieve the benefits of dense knot placement (increased predictive performance) and learn about the entire surface by filling the space, while preserving the computational benefits of having fixed knot locations. In general, we recommend choosing such that for all . Furthermore, based on the existing predictive process literature, we choose values of near 100 — aiming to strike a balance between predictive performance and computational burden — though in practice this will depend on the particular dataset.

3 Generating Fully Synthetic Data

Suppose we seek to generate fully synthetic datasets of size . We assign categorical marks to each of the samples by drawing from their posterior predictive distribution. Under the multinomial-Dirichlet model, we can create synthetic mark combinations for all records in one step. Let be the number of cases at each categorical mark combination in the -th synthetic dataset. We sample using

where is an approximately independent draw from the posterior distribution of .

For each record, we next draw a synthetic location from the intensity surface corresponding to its synthetic categorical mark combination. To do so conveniently, we first choose a large number of candidate locations; these serve as the sample space of the synthetic locations. For each of our candidate locations, , we compute from the -th sample from the posterior predictive distribution. Given our estimated intensity surface, , we can sample from

by letting

When available, the candidate pool can include all actual locations (e.g., residential addresses) in . Otherwise, the pool can be drawn uniformly across or placed using a fine grid. In the latter case, when the size of the candidate pool is not substantially larger than , or when certain areas are particularly densely populated, we recommend sampling the locations from the pool with replacement. While sampling with replacement can generate multiple observations at the same location, this is preferred to drawing unlikely locations (with respect to and the estimated intensity surface) solely due to an inadequate number of candidate locations. As such, we recommend sampling with replacement using at least 2,500 candidate locations (e.g., a grid). Of course, the size of the pool should depend on and the spatial resolution desired.

Once the synthetic individuals have been assigned categorical marks and locations, we next generate their non-categorical marks from their posterior predictive distributions. For example, when is assumed to follow (2.3), we sample each synthetic from

where is an approximately independent draw from the posterior distribution of , for . When is assumed to follow other models, we sample from the corresponding posterior predictive distribution, and similarly for the case of multiple non-categorical marks.

3.1 Evaluating data utility

To evaluate the utility of the fully synthetic data, we recommend comparing the distribution of synthetic and original geographies as well as the results of representative statistical analyses on the synthetic and original data. For geographies, we compare the function of the synthetic geographies to that of the confidential data. The function (Bartlett, 1964; Ripley, 1976) is a measure of spatial dependence such that, for a given radius, , is the expected number of events within distance of an arbitrary event. An estimate of can be obtained by computing

(6)

where is the area of spatial domain. As discussed in Cressie and Wikle (2011, pp. 210–212), this estimator is biased in the presence of edge effects; a discussion of edge-corrected estimators can be found in Cressie (1993, pp. 615–618). Using (6), we can then compute an estimate of the function,

(7)

where positive values of indicate spatial clustering. For our purposes, we compute (7) for a range of values of in each of the synthetic datasets and obtain the average over , denoted , and then compare the resulting curve (and its pointwise empirical 95% CI) to that obtained using the original data.

With respect to analyses of the marks, we use the methods in Reiter (2003) to determine point and interval estimates from the synthetic data. These inferential methods—developed for partially synthetic data—are also appropriate for fully synthetic data when (i) the original data can be analyzed as a simple random sample, and (ii) the number of synthetic observations equals the number of original observations. When this is not the case, we recommend using the methods of Si and Reiter (2011) or Raghunathan et al. (2003).

3.2 Evaluating disclosure risks

Intruders cannot meaningfully match fully synthetic records to external files; however, this does not mean that the synthetic data are risk free. In particular, the synthetic data can be subject to inferential disclosure risks (Hundepool et al., 2012), i.e., intruders can use the synthetic data to estimate confidential data values with high accuracy. In this section, we describe two inferential disclosure risk measures, one for estimating spatial locations given marks and one for estimating attributes given locations.

Before presenting the measures, we define two terms: spatially close and similar attributes. Two locations are said to be spatially close if the distance between them is less than , where is defined by the agency. Two individuals are said to have similar attributes if they belong to the same mark categories and their non-categorical marks are within of each other, where again is defined by the agency. The definition for similar attributes assumes a single non-categorical mark, but it could be extended to multiple non-categorical marks. For an individual at location with categorical mark combination and non-categorical mark , we denote the properties of being spatially close and having similar attributes as and , respectively.

In the first scenario, we assume that an intruder knows someone is at a particular location and seeks to learn that individual’s attributes. In this setting, we consider the inferential disclosure risk to be high if a large percentage of synthetic individuals who are spatially close to have similar attributes to an individual in the confidential dataset at location . To assess this, we estimate by computing

(8)

for . Uncertainty estimates for this type of risk — henceforth referred to as “Type S Risk” — can be found using the quantiles of .

In the second scenario, we assume than an intruder knows an individual’s attributes and wishes to learn the location. Here, the inferential disclosure risk for an individual at location is high if a large percentage of synthetic individuals with similar attributes are spatially close to . To assess this, we estimate by computing

(9)

for . As before, uncertainty estimates for this type of risk — henceforth “Type A Risk” — can be found using the quantiles of .

As written, both (8) and (9) assume that each synthetic dataset contains at least one observation that is spatially close and at least one observation that has similar attributes. This may not always be the case for any given or , particularly for outlying observations (either spatially or with respect to the attributes). When this occurs, one option is to choose sufficiently large values for these thresholds to ensure that neither (8) and (9) has a zero in the denominator. An alternative is to use only the synthetic datasets with non-zero denominators; for instance, if only 40 synthetic datasets have spatially close individuals for a given , base the estimate for only on these 40 datasets.

4 North Carolina Mortality Data Example

We now apply the methods presented in Section 2 to create fully synthetic datasets for a database on causes of mortality in Durham, NC in the year 2002. The data comprise records that include the precise latitude and longitude of each individual’s residence at time of death (these have been scaled to the unit square). Each record also includes the individual’s age (between 16 and 98), race (black or white), sex, level of education, and cause of death. We categorize the level of education into less than high school, high school, and some college. We dichotomize cause of death into an indicator for causes due to cancer or a failure of the immune system versus all other causes. Similar data were used by Wang and Reiter (2012) and Paiva et al. (2014).

The cross-tabulation of race, sex, education, and the cause of death indicator results in combinations, which we treat as the categorical marks. The values of range from 79 to 525, so that we have sufficient sample size to estimate the intensity surface for each combination separately. We treat age as a non-categorical mark. Since age is integer-valued and has restricted support, we model it using the truncated Poisson distribution,

where is from a predictive process akin to that used in (3). We use an intercept-only model as we do not have spatial covariates. We use a flat prior for each and conventional prior specifications for other parameters, so that the sub-model for age is

Here again, we opt to fix so that the effective range is half the maximum distance between any two points in our original dataset. To estimate the intensity and age surfaces, we use knots, allowing while also ensuring for all .

We generate synthetic datasets, each comprised of individuals. To do so, we follow the approach described in Section 3, replacing the normal model with the truncated Poisson model. For the pool of candidate synthetic locations, we sample locations from a grid, as we do not have a list of all possible locations in Durham. We generate synthetic ages from the appropriate posterior predictive distribution, which given parameter draws takes the form

To evaluate the utility of the synthetic data, we first compare the synthetic and original locations with the function. Figure 1 displays the comparison for white males with less than a high school education whose cause of death was not due to cancer or some failure of the immune system; similar findings hold for the remaining 23 groups. For , from the synthetic data is similar to from the confidential data, indicating that the synthetic data accurately reflect the degree of clustering in the real data for these values of . When , slightly underestimates from the confidential data. This bias may be due to the discontinuity in resulting from generating data on a grid.

(a) True Locations
(b) Synthetic Locations
(c) Function
Figure 1: Selected figures for white males with less than a high school education whose cause of death was not due to cancer or some failure of the immune system from the Durham, NC mortality data. Panel (a) displays locations of the observations from the true dataset while panel (b) displays the geographies from one of our synthetic datasets, both overlayed on our estimated intensity surface where off-white denotes low intensity and red denotes high. Panel (c) displays over a range of values of for our true (dashed red line) and synthetic data (black line), along with the pointwise empirical 95% CI from the synthetic data (gray band).

We next evaluate two analyses involving the marks. First, we fit a Poisson regression of age on the set of indicator variables for each categorical mark combination, i.e.,

where is an integer between 1 and 24 denoting which combination of race, gender, education level, and cause of death the individual belongs to. Figure 2(a) displays the estimated coefficients from this regression for both the synthetic and confidential data. Here, we use the methods from Reiter (2003) to form inferences from the synthetic data. The inferences from the synthetic data are quite similar to those based on the confidential data.

Second, we fit a logistic regression of cause of death on race, sex, and education level and their interactions; i.e., we let be an integer between 1 and 12 denoting which combination of race, gender, and education level each individual belongs to and assume

Figure 2(b) displays the estimated coefficients from the synthetic and confidential data. Once again, the synthetic data inferences are similar to the confidential data inferences. Of note, the cause of death in females without a high school education — in both white and black populations (groups 7 and 10 in Figure 2(b), respectively) — is significantly less likely to be cancer-related than it is for the rest of the population.

(a) Poisson Regression
(b) Logistic Regression
(c) Risk
Figure 2: Assessments of data utility and risk. Panels (a) and (b) compare parameter estimates from a Poisson regression on age and a logistic regression on the cause of death. Circles denote mean estimates while bars denote the 95% CI. Estimates from the real data are displayed in black (the leftmost interval in each pair), while estimates obtained from the synthetic data are displayed in red (the rightmost interval in each pair). Panel (c) plots the estimated Type S Risk versus the estimate Type A Risk. Estimates shown here are the medians of the values of (8) and (9) over our 100 synthetic datasets.

These evaluations of data utility suggest that the synthesizer can generate data with useful analytic properties. Of course, it is prudent for agencies to evaluate the quality of synthetic data on a much wider class of representative analyses; see Reiter and Drechsler (2010) for further discussion of this issue. We also require the synthesizer to generate synthetic data with reasonably low inferential disclosure risks. To evaluate this, we use the risk measures in (8) and (9). As shown in Figure 2(c), all 6294 individuals in the confidential data have estimated levels of Type S and Type A risk less than 0.20 when using (roughly of a mile) and years. Furthermore, many of the individuals who we identify as being at the highest risk are simply those from the most frequently observed sub-populations in the data. Given the nature of these data, it is quite possible that a number of these individuals resided in assisted living facilities in which a number of elderly persons with similar racial and socioeconomic backgrounds live.

5 Discussion

In addition to providing fully synthetic data with potentially high utility, there are several other benefits to the hierarchical marked point process approach. This approach allows one to model the various intensity surfaces and the response surface(s) in parallel (whether or not one fixes the spatial range parameter, ), yielding substantial benefits both computationally and in terms of model simplicity. In contrast, modeling a single intensity surface and jointly modeling a large number of responses would result in a substantially higher computational burden. In fact, even devising a joint model for all of the marks would pose a challenge (e.g., modeling gender as a function of location, age, race, education, and cause of death).

The framework is general and can be adapted to a variety of settings. Here, we presented two types of univariate responses—a normal distribution and a truncated Poisson distribution—but any geostatistical model can be used, including models for multivariate outcomes. For instance, we presented a normal model with constant variance over space, but one can instead use models with spatially-varying variances. One such example would be the modeling of individuals’ income, where individuals in rural areas may have less variable incomes than their urban counterparts. While we used (modified) predictive processes to achieve dimension reduction in the modeling of the intensity and response surfaces, one may use other dimension reduction approaches such as those described in Wikle (2010) if they are more appropriate for a given setting.

Although the methodology is intended to generate fully synthetic datasets, agencies can easily adapt the procedures to generate partially synthetic data. A natural option would be to reorganize the modeling strategy such that sensitive attributes are modeled conditionally on the non-sensitive data. For instance, it may be the case that only the individuals’ locations are deemed sensitive; in this case, we could define the intensity surfaces in Section 2 as functions of the non-categorical marks as well as the categorical marks. While intuitive structures for modeling these intensity surfaces can be difficult to define, the end result would be similar to the CART-based approach used by Wang and Reiter (2012).

These methods could naturally be extended to the spatio-temporal setting. In its simplest form, one could treat time as discrete (e.g., weekly incidents of a communicable disease) and consider each time-point as a separate categorical mark. The case of continuous time is more challenging, both in the modeling stage and in the generation of synthetic data. While one could treat space-time as a 3-dimensional object, generating synthetic space-time coordinates could easily require evaluating the 3-dimensional “intensity object” at over 100,000 different space-time proposal coordinates. For our example, we used a grid of spatial locations; expanding this to include 50 time-points suddenly requires 125,000 unique space-time coordinates. This is a subject of future research.

Another area for future work is to provide a comprehensive investigation of what to do when disclosure risks are deemed unacceptably high. For example, inferential disclosure risks may be unacceptably high when the models overfit the data or when the confidential data consist of relatively homogeneous clusters of individuals. In these cases, we speculate that agencies may need to incorporate differential amounts of spatial smoothing for locations and cases at high risk. Obviously, the challenge here will be to further reduce disclosure risks without sacrificing data utility.

Acknowledgements

This research was partially supported by the U.S. National Science Foundation (NSF) and the U.S. Census Bureau under NSF grants SES-1132031 and SES-1131897, funded through the NSF-Census Research Network (NCRN) program.

References

  • Abowd et al. (2006) Abowd, J., Stinson, M., and Benedetto, G. (2006). “Final report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project.” Tech. rep., U.S. Census Bureau Longitudinal Employer-Household Dynamics Program. Available at http://www.census.gov/sipp/synth_data.html.
  • Abowd and Woodcock (2001) Abowd, J. M. and Woodcock, S. D. (2001). “Disclosure limitation in longitudinal linked data.” In Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, eds. P. Doyle, J. Lane, L. Zayatz, and J. Theeuwes, 215–277. Amsterdam: North-Holland.
  • Abowd and Woodcock (2004) — (2004). “Multiply-imputing confidential characteristics and file links in longitudinal linked data.” In Privacy in Statistical Databases, eds. J. Domingo-Ferrer and V. Torra, 290–297. New York: Springer-Verlag.
  • An and Little (2007) An, D. and Little, R. (2007). “Multiple imputation: an alternative to top coding for statistical disclosure control.” Journal of the Royal Statistical Society, Series A, 170, 923–940.
  • Armstrong et al. (1999) Armstrong, M. P., Rushton, G., and Zimmerman, D. L. (1999). “Geographically masking health data to preserve confidentiality.” Statistics in Medicine, 18, 495–525.
  • Banerjee et al. (2010) Banerjee, S., Finley, A. O., Waldmann, P., and Ericsson, T. (2010). “Hierarchical spatial process models for multiple traits in large genetic trials.” Journal of the American Statistical Association, 105, 506–521.
  • Banerjee et al. (2008) Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). ‘‘Gaussian predictive process models for large spatial data sets.” J. R. Statist. Soc. B, 70, 825–848.
  • Bartlett (1964) Bartlett, M. S. (1964). “The spectral analysis of two-dimensional point process.” Biometrika, 51, 299–311.
  • Burgette and Reiter (2013) Burgette, L. F. and Reiter, J. P. (2013). “Multiple-shrinkage multinomial probit models with applications to simulating geographies in public use data.” Bayesian Analysis, 8, 1–26.
  • Cressie and Wikle (2011) Cressie, N. and Wikle, C. K. (2011). Statistics for Spatio-Temporal Data. Hoboken, NJ: Wiley.
  • Cressie (1993) Cressie, N. A. C. (1993). Statistics for Spatial Data. New York, NY: Wiley.
  • Dalenius and Reiss (1982) Dalenius, T. and Reiss, S. P. (1982). “Data-swapping: A technique for disclosure control.” Journal of Statistical Planning and Inference, 6, 73–85.
  • Diggle (2013) Diggle, P. J. (2013). Statistical Analysis of Spatial and Spatio-Temporal Point Patterns. Boca Raton: Chapmand & Hall/CRC Press.
  • Drechsler and Reiter (2010) Drechsler, J. and Reiter, J. P. (2010). “Sampling with synthesis: A new approach for releasing public use census microdata.” Journal of the American Statistical Association, 105, 1347–1357.
  • Dunson and Xing (2009) Dunson, D. B. and Xing, C. (2009). “Nonparametric Bayes modeling of multivariate categorical data.” Journal of the American Statistical Association, 104, 1042–1051.
  • Finley et al. (2009) Finley, A. O., Sang, H., Banerjee, S., and Gelfand, A. E. (2009). “Improving the performance of predictive process modeling for large datasets.” Comput. Stat. Data Anal., 8, 2873–2884.
  • Fuller (1993) Fuller, W. A. (1993). “Masking procedures for microdata disclosure limitation.” Journal of Official Statistics, 9, 383–406.
  • Gelman (2006) Gelman, A. (2006). “Prior distributions for variance parameters in hierarchical models.” Bayesian Analysis, 1, 515–533.
  • Guhanyiogi et al. (2011) Guhanyiogi, R., Finley, A. O., Banerjee, S., and Gelfand, A. E. (2011). “Adaptive Gaussian predictive process models for large spatial datasets.” Environmetrics, 22, 997–1007.
  • Hawala (2008) Hawala, S. (2008). “Producing partially synthetic data to avoid disclosure.” In Proceedings of the Joint Statistical Meetings. Alexandria, VA: American Statistical Association.
  • Holan et al. (2010) Holan, S. H., Toth, D., Ferreira, M. A. R., and Karr, A. (2010). “Bayesian multiscale multiple imputation with implications for data confidentiality.” Journal of the American Statistical Association, 105, 564–577.
  • Hundepool et al. (2012) Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholdt, E. S., Spicer, K., and de Wol, P. P. (2012). Statistical Disclosure Control. Chichester: Wiley.
  • Kennickell (1997) Kennickell, A. B. (1997). “Multiple imputation and disclosure protection: The case of the 1995 Survey of Consumer Finances.” In Record Linkage Techniques, 1997, eds. W. Alvey and B. Jamerson, 248–267. Washington, D.C.: National Academy Press.
  • Kinney et al. (2011) Kinney, S. K., Reiter, J. P., Reznek, A. P., Miranda, J., Jarmin, R. S., and Abowd, J. M. (2011). “Towards unrestricted public use business microdata: The synthetic Longitudinal Business Database.” International Statistical Review, 79, 363–384.
  • Liang et al. (2009) Liang, S., Carlin, B. P., and Gelfand, A. E. (2009). “Analysis of Minnesota colon and rectum cancer point patterns with spatial and nonspatial covariate information.” Annals of Applied Statistics, 3, 943–962.
  • Little (1993) Little, R. J. A. (1993). “Statistical analysis of masked data.” Journal of Official Statistics, 9, 407–426.
  • Machanavajjhala et al. (2008) Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., and Vilhuber, L. (2008). “Privacy: Theory meets practice on the map.” In IEEE 24th International Conference on Data Engineering, 277–286.
  • Manrique-Vallier and Reiter (2014) Manrique-Vallier, D. and Reiter, J. P. (2014). “Bayesian estimation of discrete multivariate latent structure models with structural zeros.” Journal of Computational and Graphical Statistics, to appear.
  • Nychka and Saltzman (1998) Nychka, D. and Saltzman, N. (1998). “Design of air-quality monitoring networks.” Lect. Notes Statist., 132, 51–76.
  • Paiva et al. (2014) Paiva, T., Chakroborty, A., Reiter, J., and Gelfand, A. (2014). “Imputation of confidential data sets with spatial locations using disease mapping models.” Statistics in Medicine, 33, 1928–1945.
  • Raghunathan et al. (2003) Raghunathan, T. E., Reiter, J. P., and Rubin, D. B. (2003). “Multiple imputation for statistical disclosure limitation.” Journal of Official Statistics, 19, 1–16.
  • Rathbun and Cressie (1994) Rathbun, S. L. and Cressie, N. (1994). “A space-time survival point process for a longleaf pine forest in southern Georgia.” Journal of the American Statistical Association, 89, 1164–1174.
  • Reiter (2002) Reiter, J. P. (2002). “Satisfying disclosure restrictions with synthetic data sets.” Journal of Official Statistics, 18, 531–544.
  • Reiter (2003) — (2003). “Inference for partially synthetic, public use microdata sets.” Survey Methodology, 29, 181–188.
  • Reiter (2004) — (2004). “Simultaneous use of multiple imputation for missing data and disclosure limitation.” Survey Methodology, 30, 235–242.
  • Reiter (2005a) — (2005a). “Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study.” Journal of the Royal Statistical Society, Series A, 168, 185–205.
  • Reiter (2005b) — (2005b). “Significance tests for multi-component estimands from multiply-imputed, synthetic microdata.” Journal of Statistical Planning and Inference, 131, 365–377.
  • Reiter (2005c) — (2005c). “Using CART to generate partially synthetic, public use microdata.” Journal of Official Statistics, 21, 441–462.
  • Reiter and Drechsler (2010) Reiter, J. P. and Drechsler, J. (2010). “Releasing multiply-imputed, synthetic data generated in two stages to protect confidentiality.” Statistica Sinica, 20, 405–422.
  • Reiter and Raghunathan (2007) Reiter, J. P. and Raghunathan, T. E. (2007). “The multiple adaptations of multiple imputation.” Journal of the American Statistical Association, 102, 1462–1471.
  • Ripley (1976) Ripley, B. D. (1976). “The second-order analysis of stationary point processes.” Journal of Applied Probability, 13, 255–266.
  • Rubin (1993) Rubin, D. B. (1993). “Statistical disclosure limitation.” Journal of Official Statistics, 9, 461–468.
  • Si and Reiter (2011) Si, Y. and Reiter, J. P. (2011). “A comparison of posterior simulation and inference by combining rules for multiple imputation.” Journal of Statistical Theory and Practice, 5, 335–347.
  • Si and Reiter (2013) — (2013). “Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys.” Journal of Educational and Behavioral Statistics, 38, 499 – 521.
  • VanWey et al. (2005) VanWey, L. K., Rindfuss, R. R., Guttman, M. P., Entwisle, B., and Balk, D. L. (2005). “Confidentiality and spatially explicit data: concerns and challenges.” Proceedings of the National Academy of Sciences, 102, 15337–15342.
  • Wang and Reiter (2012) Wang, H. and Reiter, J. P. (2012). “Multiple imputation for sharing precise geographies in public use data.” Annals of Applied Statistics, 6, 229–252.
  • Wikle (2010) Wikle, C. (2010). “Low rank representations as models for spatial processes.” In Handbook of Spatial Statistics, eds. A. E. Gelfand, P. J. Diggle, M. Fuentes, and P. Guttorp, 107–118. Chapman and Hall/CRC.
  • Willenborg and de Waal (2001) Willenborg, L. and de Waal, T. (2001). Elements of Statistical Disclosure Control. New York: Springer-Verlag.
  • Winkler (2007) Winkler, W. E. (2007). “Examples of easy-to-implement, widely used methods of masking for which analytic properties are not justified.” Tech. rep., U.S. Census Bureau Research Report Series, No. 2007-21.
  • Zhou et al. (2010) Zhou, Y., Dominici, F., and Louis, T. A. (2010). “A smoothing approach for masking spatial data.” Annals of Applied Statistics, 4, 1451 – 1475.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
101966
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description