# Multiple imputation for sharing precise geographies in public use data

## Abstract

When releasing data to the public, data stewards are ethically and often legally obligated to protect the confidentiality of data subjects’ identities and sensitive attributes. They also strive to release data that are informative for a wide range of secondary analyses. Achieving both objectives is particularly challenging when data stewards seek to release highly resolved geographical information. We present an approach for protecting the confidentiality of data with geographic identifiers based on multiple imputation. The basic idea is to convert geography to latitude and longitude, estimate a bivariate response model conditional on attributes, and simulate new latitude and longitude values from these models. We illustrate the proposed methods using data describing causes of death in Durham, North Carolina. In the context of the application, we present a straightforward tool for generating simulated geographies and attributes based on regression trees, and we present methods for assessing disclosure risks with such simulated data.

10.1214/11-AOAS506 \volume6 \issue1 2012 \firstpage229 \lastpage252

Sharing precise geographies in public use data

T1Supported by NIH Grant R21 AG032458-02. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

A]\fnmsHao \snmWang\correflabel=e1]haowang@sc.edu and B]\fnmsJerome P. \snmReiterlabel=e2]jerry@stat.duke.edu

Confidentiality \kwddisclosure \kwddissemination \kwdspatial \kwdsynthetic \kwdtree.

## 1 Introduction

Statistical agencies, research centers and individual researchers frequently collect geographic data as an integral part of their studies. Geographic data can be highly beneficial for analyses. In studies of aging, for example, they can reveal areas where elderly people live in high densities, which is useful for policy and planning; they can illuminate how environmental factors impact the health and quality of life of elderly people; and, through contextual data, they can yield insights into the social and economic conditions and lifestyle choices of the elderly. Analysts who do not account for spatial dependencies may miss important geographic trends and differences, potentially resulting in invalid inferences.

Geographic variables also are among the most challenging data to share when making a primary data source available to other researchers and the broader public. Very fine geography, while facilitating detailed spatial analyses, enables ill-intentioned users to infer the identities of individuals in the shared file. Even modestly coarse geography can be risky in the presence of demographic or other readily available attributes, which when combined may identify individuals in the shared file. Such identifications are problematic for data collectors, who are ethically and often legally obligated to protect data subjects’ confidentiality. To reduce the risks of disclosures, data collectors typically delete or aggregate geographies to high levels before sharing data. Unfortunately, deletion and aggregation sacrifice the quality of analyses that utilize finer geographic detail.

We propose to protect the confidentiality of data with fine geographic identifiers by simulating values of geographies and other identifying attributes from statistical models that capture the spatial dependencies among the variables in the collected data. These simulated values replace the collected ones when sharing data. To enable estimation of variances, the data steward generates several versions of the data sets for dissemination, resulting in multiply-imputed, partially synthetic data sets [Little (1993), Reiter (2003)]. Such data sets can protect confidentiality, since identification of units and their sensitive data can be difficult when the geographies and other quasi-identifiers in the released data are not actual, collected values. And, when the simulation models faithfully reflect the relationships in the collected data, the shared data can preserve spatial associations, avoid ecological inference problems, and facilitate small area estimation.

The remainder of the article is as follows. In Section 2 we describe some of the shortcomings of current approaches to protecting data with geographies, and we motivate the use of multiple imputation for releasing public use data with highly resolved geographies. In Section 3 we generate multiply-imputed, partially synthetic versions of a spatially-referenced data set describing causes of death in Durham, North Carolina. As part of the application, we present an easy-to-implement data simulator based on sequential regression trees for synthesizing highly-resolved geographies or attributes. We also describe methods for assessing disclosure risks for data with synthetic geographies. These include (i) a new measure for quantifying the risks that the original geographies could be recovered from the simulated data, and (ii) a measure for assessing risks of re-identifications based on the approach of Reiter and Mitra (2009). In Section 4 we conclude with issues for implementation of the approach.

## 2 Motivation for using simulated geographies

At first glance, releasing or sharing safe data seems a straightforward task: simply strip unique identifiers like names and tax identification numbers before releasing data. However, these actions alone may not suffice when other readily available variables, such as geographic or demographic data, remain on the file. These quasi-identifiers can be used to match units in the released data to other databases. When the quasi-identifiers include geographic variables, the risks of identification disclosures can be extremely high. For example, Sweeney [(2001), pages 51 and 52] showed that 97% of the records in a publicly available voter registration list for Cambridge, MA, could be identified using only birth date and 9-digit zip code. Because of the disclosive nature of geography, the U.S. Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule requires that, when sharing certain health data, the released geographic units comprise at least 20,000 people [Federal Register (2000), page 82543].

Data stewards can protect confidentiality by restricting public access to the data. For example, analysts can use the data only in secure data enclaves, such as the Research Data Centers operated by the U.S. Census Bureau. Or, analysts can submit queries to remote access systems that provide statistical output without revealing the data that generated the output. While useful, restricted access strategies are only a partial solution. Analysts who do not live near a secure data enclave, or do not have the resources to relocate temporarily to be near one, are shut out from this form of access. Gaining restricted access can require months of proposal preparation and background checks; analysts cannot simply walk in to any secure data enclave and immediately start working with the data. Remote access servers limit the scope of analyses and details of output, since clever queries can reveal individual data values [Gomatam et al. (2005)]. Performing exploratory data analysis and checking model fit are difficult without access to record-level data. Hence, as recommended by two recent National Research Council panels on data confidentiality, to maintain the benefits of wide dissemination, it is necessary to supplement restricted access strategies with readily available, record-level data [National Research Council (2005, 2007)].

### 2.1 Common approaches to protecting geography

Data stewards commonly employ several strategies for protecting confidentiality when sharing data with geographic identifiers. However, these methods can have serious impacts on the quality of the released data, as we now describe.

#### Data suppression

Data stewards can suppress geography or attributes from data releases. The intensity of suppression can range from not releasing entire variables, for example, stripping the file of all geographic identifiers, to not releasing small subsets of values, for example, blanking out sensitive attribute values. An example of the former is the Health and Retirement Study: the public use data do not contain any geographic information on relocations [Health and Retirement Study (2007), page 14]. Increasing the intensity of suppression generally increases data protection and decreases data quality. While intense suppression can reduce risks, it has repercussions for inferences. Wholesale deletion of geographic identifiers disables any spatial analysis. When relationships depend on the omitted geography, analysts’ inferences are biased. Selective suppression of geography or attributes creates data that are missing not at random, which complicates analyses for users. When there are many records at risk, as is likely the case when the data have fine geographic identifiers, data stewards may need to suppress so many values to achieve satisfactory protection that the released data have very limited quality for spatial analysis.

#### Data aggregation

Data stewards can coarsen geography or other variables, for example, releasing addresses at the block or county rather than parcel level, or releasing ages in five year intervals. Aggregation reduces disclosure risks by turning unique records—which generally are most at risk—into nonunique records. For example, there may be only one person with a particular combination of demographic characteristics in a street block, but many people with those characteristics in a state. Releasing data for this person with geography at the street level might have a high disclosure risk, whereas releasing the data at the state level might not. The amount of aggregation needed to protect confidentiality depends on the nature of the data. When other identifying attributes are present, such as demographic characteristics, high-level aggregation of the geographic identifiers may be needed to achieve adequate protection. For example, there may be only one person of a certain age, sex, race and marital status—which may be available to ill-intentioned users at low cost—in a particular county, so that coarsening geographies to the county level provides no greater protection for that person than does releasing the exact address.

Aggregation preserves analyses at the level of aggregation. However, it can create ecological inference fallacies [Robinson (1950), Freedman (2004)] at lower levels of aggregation. Additionally, when geography is highly aggregated, analysts may be unable to detect important local spatial dependencies. Despite these limitations, aggregation is the most widely used solution to protect data with geographic identifiers and is routinely implemented by government agencies and other data collectors. The U.S. Census Bureau, for example, does not release geographic identifiers below aggregates of at least 100,000 people in public use files of census data. The public use files for the Health and Retirement Study aggregate geography to “a level no higher than U.S. Census Region and Division” [Health and Retirement Study (2007), page 14].

Aggregation also is frequently used to disguise values in the tails of nongeographic quasi-identifiers, especially age. The HIPAA requires that all ages above 89 be aggregated into and shared as a single category, “90 or older.”

#### Random noise addition

Data stewards can disguise geographic and other attribute values by adding some randomly selected amount to each confidential observed value. For geographic attributes, this involves moving an observed location to another randomly drawn location, usually within a circle of some radius centered at the original location. The quality of inferences and the amount of protection depend crucially on . When a large is needed to protect confidentiality—as is likely the case when data contain readily available quasi-identifiers—inferences involving spatial relationships can be seriously degraded [Armstrong, Rushton and Zimmerman (1999), VanWey et al. (2005)]. Adding random noise to attribute values introduces measurement error, which inflates variances and attenuates regression coefficients [Fuller (1993)].

#### Random data swapping

Data stewards can swap data values for selected records, for example, switch values of age, race and sex for at-risk records with those for other records, to discourage users from matching, since matches may be based on incorrect data [Dalenius and Reiss (1982), Fienberg and McIntyre (2004)]. Swapping is used extensively by government agencies. It is generally presumed that swapping fractions are low—agencies do not reveal the rates to the public—because swapping at high levels destroys relationships involving the swapped and unswapped variables. Because data stewards might have to swap all geographic identifiers to ensure released records do not have their actual geographies, swapping is not effective for highly resolved geographic identifiers.

### 2.2 Proposed approach: Simulate geographic identifiers

The main limitation of the approaches in Section 2.1 is that they perturb the geography or other quasi-identifiers with minimal or no consideration of the relationships among the variables. Our proposed approach explicitly aims to preserve relationships among the geographic and other attributes through statistical modeling. At the same time, replacing geographic and other quasi-identifiers with imputations makes it difficult for ill-intentioned users to know the original values of those variables, which reduces the chance of disclosures.

Our approach differs from the recent proposal of Zhou, Dominici and Louis (2010), who use spatial smoothing to mask nongeographic attributes at the original locations. Releasing the original locations can result in high risks of identification disclosures when the data include fine geography. Zhou, Dominici and Louis (2010) do not intend to deal with these risks, whereas we explicitly seek to do so. We note that spatial smoothing could be used to mask attribute values after synthesis of locations.

To illustrate how our approach might work in practice, we modify the setting described by Reiter (2004a). Suppose that a statistical agency has collected data on a random sample of 10,000 heads of households in a state. The data comprise each person’s street block, age, sex, income and an indicator of disease status. Suppose that combining street block, age and sex uniquely determines a large percentage of records in the sample and the population. Therefore, the agency wants to replace street block, age and sex for all people in the sample—or possibly only a fraction of the three variables, for example, only street block for some records and only age and sex for others—to disguise their identities. The agency generates values of street block, age and sex for these people by randomly simulating values from their joint distribution (see Section 2.3), conditional on their disease status and income values. This distribution is estimated with the collected data. The result is one partially synthetic data set. The agency repeats this process, say, ten times, and these ten data sets are released to the public.

To illustrate how a secondary data analyst might utilize these shared data sets, suppose that the analyst seeks to fit a logistic regression of disease status on income, age, sex and indicator variables for the person’s county (obtained by aggregating the released, simulated street blocks). The analyst first estimates the regression coefficients and their variances separately in each simulated data set using standard likelihood-based estimates and standard software. Then, the analyst averages the estimated coefficients and variances across the simulated data sets. These averages are used to form 95% confidence intervals based on the simple formulas developed by Reiter (2003), described below.

The agency creates partially synthetic data sets, , that it shares with the public. Let be the secondary analyst’s estimand of interest, such as a regression coefficient or population average. For , let and be respectively the estimate of and the estimate of the variance of in synthetic data set . Secondary analysts use to estimate and to estimate , where and . For large samples, inferences for are obtained from the -distribution, , where the degrees of freedom is . Details of the derivations of these methods are in Reiter (2003). Tests of significance for multicomponent null hypotheses are derived by Reiter (2005c).

Partially synthetic data sets can have positive data utility features. When data are simulated from distributions that reflect the distributions of the collected data, Reiter (2003, 2004b, 2005c) shows that analysts can obtain valid inferences (e.g., 95% confidence intervals contain the true values 95% of the time) for wide classes of estimands. These inferences are determined by combining standard likelihood-based or survey-weighted estimates; the analyst need not learn new statistical methods or software to adjust for the effects of the disclosure limitation. The released data can include simulated values in the tails of distributions, for example, there is no top-coding of ages or incomes [however, it is challenging to develop synthesis models that simultaneously protect confidentiality and preserve inferences when data are very sparse in tails; see Reiter (2005b)]. Because many quasi-identifiers including geography can be simulated, finer details of geography can be released, facilitating estimation for small areas and spatial analyses.

There is a cost to these benefits: the validity of inferences depends on the validity of the models used to generate the simulated data. The extent of this dependence is driven by the nature of the synthesis. For example, when all of age and sex are synthesized, analyses involving those variables reflect only the relationships included in the data generation models. When the models fail to reflect certain relationships accurately, analysts’ inferences also will not reflect those relationships. Similarly, incorrect distributional assumptions built into the models will be passed on to the users’ analyses. On the other hand, when replacing only a select fraction of age and sex and leaving many original values on the file, inferences are less sensitive to the assumptions of the simulated data models. In practice, this dependence means that data stewards should release information that helps analysts decide whether or not the simulated data are reliable for their analyses. For example, data stewards might include the data generation models (without parameter estimates) as attachments to public releases of data. Or, they might include generic statements that describe the imputation models, such as “Main effects and interactions for age, sex, income and disease status are included in the imputation models for street blocks.” Analysts who desire finer detail than afforded by the imputations may have to apply for restricted access to the collected data.

When generating partially synthetic data, the data steward must choose which values to synthesize and must specify models to simulate replacements of those values. In most existing partially synthetic data sets, stewards replace all values of variables that they deem to be either (i) readily available to ill-intentioned users seeking to identify released records, or (ii) too sensitive to risk releasing exactly. However, it may be sufficient from a confidentiality perspective to replace only portions of some variables; see Little, Liu and Raghunathan (2004). The process of specifying synthesis models is typically iterative: the data steward creates synthetic data using a posited model, checks the quality of a large number of representative analyses with the synthetic data, and adjusts the models as necessary to improve quality while maintaining confidentiality protection. For examples of this process, see Drechsler and Reiter (2010) and Kinney et al. (2011).

The data steward also must determine , that is, how many synthetic data sets to release. Generally, increasing results in decreased standard errors in secondary analyses. However, increasing results in greater data storage costs and possibly increased disclosure risks [Reiter and Mitra (2009)]. When small fractions of values are synthesized (e.g., around 10%), the efficiency gains from increasing are typically modest, so that data stewards can make modest, for example, , to keep risks and storage costs comparatively low. When large fractions of values are replaced, efficiency gains from increasing can be substantial [Drechsler and Reiter (2010)]. In such cases, we recommend that data stewards select the largest that still offers acceptable risks and storage costs.

### 2.3 Synthesis models for sharing precise geographies

Our strategy for simulating geographies involves four general steps. First, the data steward converts the geographic variables on the file to latitudes and longitudes (possibly, using UTM projection to Eastings and Northings). When the collected geographies are aggregated rather than precise locations, the data steward uses a typical value for the location of all records in that area; for example, use the latitude and longitude of the centroid of the street block. Second, the data steward estimates a model for latitudes and longitudes conditional on other variables in the data set. Third, using this model, the data steward simulates new latitudes and longitudes for every record in the file. Fourth and finally, the data steward releases multiple draws of the simulated latitudes and longitudes along with the other attributes—which also might be altered to protect confidentiality, for example, Zhou, Dominici and Louis (2010)—in the original file.

We expect that, in general, some attributes in the data will exhibit spatial dependence. When considering location as the response variable, this implies a joint distribution for latitude and longitude that depends on the attributes and is possibly multi-modal. For example, people of similar age, socio-economic status and other demographic characteristics tend to cluster in neighborhoods, and certain demographic characteristics may be highly prevalent in multiple locations but absent in others. If we ignore these features when simulating geographies—or alter geography with approaches that do not explicitly account for these associations—the spatial relationships in the data will be altered or destroyed.

To illustrate some possible response models for locations, let and denote the latitude and longitude, respectively, for data subject . Let denote the nongeographical attributes for data subject . One family of convenient response models is , where each is a vector of unknown means, is a function of the covariates, and each is an unknown covariance matrix. A simple implementation is a bivariate regression model with , where each is a spline for variable and for all . An alternative is a mixture model with , where and come from mixture components.

In specifying a response model for locations, the data steward should include components of that vary with spatial locations. The data steward also should seek a flexible model that can adapt to a potentially complex response distribution. In the application, we describe a semi-automated approach for approximating the response distribution that can be easily implemented by data stewards. We emphasize, however, that the idea of treating latitude and longitude as a response is general, and that data stewards can improve the quality of the released data by tailoring the response model to their particular problem.

To our knowledge, treating geography as a continuous response and releasing simulated draws from its distribution has not been previously implemented. However, partially synthetic data are used to protect locations in the Census Bureau’s OnTheMap project [Machanavajjhala et al. (2008)]. In that project, Machanavajjhala et al. (2008) synthesize the street blocks where people live conditional on the street blocks where they work and other block-level attributes. They use multinomial regressions to simulate home-block values, constraining the possible outcome space for each individual based on where they work. Our approach differs from the OnTheMap modeling in that (i) we model more precise geography, that is, continuous versions of latitudes and longitudes, than discrete street blocks, and (ii) we do not rely on a fixed set of geographic locations, that is, where people work, to anchor the synthesis models. Furthermore, for settings with high-dimensional and no obvious way to set constraints on the outcome space, multinomial regressions can be computationally demanding if even estimable, whereas continuous response models are readily estimated.

=260pt

Variable | Range |
---|---|

Longitude | Recoded to go from 1–100 |

Latitude | Recoded to go from 1–100 |

Sex | Male, female |

Race | White, black |

Age (years) | 16–99 |

Autopsy performed | Yes, no, missing |

Autopsy findings | Yes, no missing |

Marital status | 5 categories |

Attendant | Physician, medical examiner, coroner |

Hispanic | 7 categories |

Education (years) | 0–17 years |

Hospital type | 8 categories |

## 3 Application: Protecting a cause of death file

We now apply the multiple imputation approach to create disclosure-protected data on a subset of North Carolina (NC) mortality records in 2002. The data include precise longitudes and latitudes of deceased individuals’ residences, as well as a variety of variables related to manner of death; we consider the subset of variables in Table 1. These mortality data are in fact publicly available and so do not require disclosure protection. Nonetheless, they are ideal test data for methods that protect confidentiality of geographies since, unlike many data sets on human individuals, actual locations are available and can be revealed for comparisons. Access to the data is managed by the Children’s Environmental Health Initiative at Duke University per agreement with the state of NC.

We use individuals whose place of residence was one of seven contiguous postal zones in Durham, NC. These areas are heterogeneous in terms of population density and characteristics. For simplicity, we include only individuals with race of black and white—which comprised of all records in these postal zones—resulting in observed cases. We also collapse the cause of death variable into two levels: death from diseases of the circulatory and respiratory system, and death from all other causes. We consider this binary variable, which we label , as the outcome for regression models.

In these data, does not exhibit strong residual spatial dependence after accounting for other variables. Therefore, for a more thorough test of the analytical validity of the synthetic data sets, we also generate a surrogate cause of death variable, , that exhibits spatial clustering and is dependent on several nongeographic variables. To do so, we generate outcomes as follows:

(1) |

where , ,and is a mean zero Gaussian process with exponential covariance function . We set the parameters of the exponential covariance function to and , so that the effective range (i.e., the distance at which the spatial correlation drops to 0.05) is about , which equals half of the overall range of the latitudes and longitudes in Table 1. The coefficients of the covariates are specified so that the covariates have strong effects. All results that follow use in place of the actual cause of death ; see the online supplement [Wang and Reiter (2011)] for selected results based on . For both and , the results are qualitatively similar, in that the actual spatial relationships (or lack thereof) in the original data are approximately preserved in the synthetic data sets.

### 3.1 Generation of synthetic data

We examined several methods for simulating latitude and longitude, including mixtures of bivariate regressions, bivariate partition models [De’ath (2002)] using the “mvpart” function in R, Bayesian additive regression trees [Chipman, George and McCulloch (2010)], and classification and regression trees (CART) [Breiman et al. (1984)]. Among these, the CART synthesizer resulted in data sets with a desirable profile in terms of low disclosure risks and high data usefulness. Furthermore, the CART synthesizer is fastest computationally and easy to implement, as it requires minimal tuning. It scales to large data sets with many predictors and many observations. In comparison to the CART synthesizer, the Bayesian trees and mixture model synthesizers were far more computationally demanding, and the bivariate partition model synthesizer resulted in unacceptably high disclosure risks. We therefore present results only for the CART synthesizer, which we now summarize; see Reiter (2005d) for further information on CART synthesizers.

Let include all nongeographic attributes in Table 1 and . First, we fit a regression tree of longitude on . Label the tree as , where stands for longitude. Let be the th leaf in , and let be the values of in leaf . In each , we draw values from using the Bayesian bootstrap [Rubin (1981)]. We then smooth the density of the bootstrapped values using a Gaussian kernel density estimator with bandwidth and support over the smallest to the largest value of . To get a synthetic longitude for the th unit, we trace down based on the unit’s values of , and we sample randomly from the estimated mixture density in that unit’s leaf. The result is a set of synthetic longitudes, .

Next, we fit the regression tree of latitude on and the true ; label the tree as , where stands for latitude. To locate the th person’s leaf in , we use in place of . For units with combinations of that do not belong to one of the leaves of , we search up the tree until we find a node that contains the combination, and treat that node as if it were the unit’s leaf. Once each unit’s leaf is located, values of are generated using the Bayesian bootstrap and kernel density procedure with bandwidth . The result is a set of synthetic latitudes, , and, therefore, synthetic locations .

We repeat the process of generating independently times, resulting in the collection of partially synthetic data sets, where . With no further synthesis of , these data sets would be released to the public.

We also performed the synthesis by generating latitude first and longitude second. As reported in the online supplement [Wang and Reiter (2011)], this ordering results in slightly decreased disclosure risks and slightly worse data utility. We recommend that data stewards try both orderings and choose the one that results in the more desirable risk-utility profile. For general discussions on the order of synthesis, see Reiter (2005d) and Caiola and Reiter (2010).

We also investigate simulating both geography and nongeographic identifiers to further improve confidentiality protection. Specifically, we simulate values of race () and age () in addition to . We choose these two variables because (i) in many applications, age and race might be considered available to ill-intentioned users and hence prominent candidates for disclosure protection, (ii) their distributions clearly depend on location in the NC mortality data, and (iii) they encompass the generic modeling challenges of a continuous and a categorical variable.

The process proceeds as follows. Simulate using the CART synthesizers as before, but excluding and from . We simulate new values of using a CART synthesizer fit with . Each is simulated based on its . We simulate new values of using a CART synthesizer fit with . Each is simulated based on its .

For all trees, we require the smallest node size to be at least five, and we cease splitting a leaf when the deviance of values in the leaf is less than 0.0001; see Section 4 for discussion of selecting these tuning parameters. All CART models are fit in R using the “tree” function. The bandwidth sizes are directly related to the analytical utility and disclosure risks of the synthetic data sets. Here, we investigate the risk-utility trade-offs for three bandwidths: . We set the bandwidth for generating equal to . We generate synthetic data sets.

### 3.2 Evaluation of confidentiality protection

For an initial evaluation of the protection engendered by simulation, we plot against for one simulated data set when only geography is imputed with ; see Figure 1. Clearly, can vary greatly from . However, Figure 1 is a crude evaluation, as intruders can utilize information from the multiple synthetic data sets and possibly other information to attempt disclosures.

We now outline frameworks for evaluating disclosure risks. We begin with an approach for quantifying how much intruders can learn about actual geographies from the synthetic data.

#### Risk of geography disclosure

In this section we assume that geography is the only synthesized variable, although the general ideas and approach apply to other attributes and with additional synthetic data.

Let ; let be the collection of for all persons in the sample; and, let include all original values of . Let be all the original geography except for that of the th person. Let represent any meta-data released by the data steward about the synthesis models, for example, the code for the computer program that generated that synthetic data (without the original data or parameter estimates). Let represent the intruder’s prior information on persons’ geography in the sample, for example, might include . Either or could be empty.

We posit an intruder whose goal is to estimate for one or more target records in the database. Specifically, for any record , the intruder seeks the posterior distribution of given . With this posterior distribution, the intruder could identify high density regions for the unknown , which, if precise enough, could be used to pinpoint the true location of the target individual. Using Bayes’ rule, we have

(2) |

where represents the intruder’s prior beliefs about .

The information in and play central roles in the likelihood function . For example, suppose that contains the code of the computer program used to generate the synthetic data (without original data or parameter estimates). If includes , the intruder could take guesses at according to his or her prior distribution and, with the resulting guess of , determine the likelihood of . If instead contains only a portion of the geographies or is empty, as are likely to be the cases in practice, the computation of the likelihood becomes much more complex and uncertain, since the intruder needs to guess at multiple unknown geographies. In such cases, one simple approximation of the distribution for is the convex hull of the set . Given the variation in Figure 1, these regions in the mortality data could be quite large.

The intruder’s prior distribution is also a key determinant of the posterior distribution of . An intruder may know the locations of all individuals in the population with certain characteristics contained in , and the prior distribution could be uniform over those locations. An intruder who knows and could estimate a model from these data to predict , and use that as a prior distribution. An intruder with no external information might use a uniform distribution on the map of possible locations. Unfortunately, it is nearly impossible for the data steward to know the information possessed by the intruder. Hence, it is prudent for the data steward to consider disclosure risks under a variety of assumptions about the intruders’ knowledge—including very extensive prior knowledge, which represents possible worst case scenarios—as we now demonstrate.

Using the CART synthesizer, we consider two scenarios for the NC mortality data: a high-risk scenario in which the intruders know everything except for one target’s , that is, and , and a low-risk scenario in which the intruder does not know any records’ geographies. We assume that includes everything about the trees except the individual geographies in the nodes, that is, the data steward releases the splitting rules for each tree and the kernel bandwidths. For the risky scenario, we assume the intruder’s prior distribution is uniform on a grid over a small area containing the target’s true latitude and longitude, and estimate equation (2) using importance sampling; see the online supplement [Wang and Reiter (2011)] for details. Because the small area contains the true value, this prior distribution represents strong intruder prior knowledge. We note that other specifications for the prior distribution could change the value of the risk measure.

Scenario | Risk | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

Low | ||||||||||

High | ||||||||||

To summarize how much the CART synthesis protects geographies, we create two risk metrics. Let be a draw from . The metrics are

Here, measures the average Euclidean distance between the intruder’s guess of geography and the actual geography. Larger values of (up to a max of ) indicate larger uncertainty in predicting , so that intruders’ predictions are more likely to be further away from the true geography; thus, larger values of indicate smaller disclosure risks. Larger values of indicate that many actual locations (up to a max of ) are reasonable guesses at , thus smaller disclosure risks.

Table 2 displays summary statistics for and for all records in the database. For the low-risk scenario, the medians of for all three bandwidth values are around 21 distance units, and the medians of are around 670 units, indicating that most are estimated with sizable uncertainty. In this scenario, each person’s -radius circle contains at least 27 other cases. Interestingly, for this scenario, increasing the bandwidth does not substantially increase the uncertainty in . For the high-risk scenario, the intruder can estimate with better accuracy than in the low-risk scenario. Here, both and decrease with . In fact, when , there are individuals in the data who are alone in their -radius circles. The boxplot of Figure 2 in the online supplement [Wang and Reiter (2011)] provides additional information about the distributions of and , including those under different scenarios when generating latitude first and longitude second.

#### Risk of identification

The approach in Section 3.2.1 can be used to estimate posterior distributions of any attribute, of which location is one example. Often, however, data stewards want to assess the risks that individuals in the released data can be re-identified. To quantify these risks, we now compute probabilities of identification [Duncan and Lambert (1989), Fienberg, Makov and Sanil (1997), Reiter (2005a)] by adapting the approach of Drechsler and Reiter (2008) and Reiter and Mitra (2009) for synthetic geographies.

In this approach, the data steward mimics the behavior of an intruder who possesses the true values of the quasi-identifiers, including geographies, for selected target records (or even the entire database). To illustrate, suppose the intruder has a vector of information, , on a particular target unit in the population which may or may not correspond to a unit in the released synthetic data sets, . Let be the unique identifier (e.g., the individual’s name) of the target, and let be the (not released) unique identifier for record in , where . The intruder’s goal is to match unit in to the target when , and not to match when for any .

Let be a random variable that equals when for and equals when for some . The intruder thus seeks to calculate the for . He or she then would decide whether or not any of the identification probabilities for are large enough to declare an identification. Let be all original values of the variables that were synthesized. Because the intruder does not know the actual values in , he or she should integrate over its possible values when computing the match probabilities. Hence, for each record in we compute

This integral can be approximated using Monte Carlo approaches; details are in the online supplement. Once again, the data steward must make assumptions about , the information the intruder knows about the targets.

Data stewards can summarize the risks for the entire data set using functions of these match probabilities [Reiter (2005a)]. Let be the number of records in the data set with the highest match probability for the target . Let if the true match is among the units, and otherwise. The expected match risk equals . The true match risk equals , where when , and otherwise. The false match risk equals , where when and otherwise. Effective disclosure limitation techniques have low expected and true match risks, and high false match risks.

Using the mortality data, we consider three scenarios with different information in . In the first contains everything, that is, details of the CART models, the splitting rules and the real data values in each leaf and internal node. Essentially, is a data simulator that enables analysts to generate new synthetic data sets using the same process as the data steward. In the second contains descriptions of the CART models, but not the specific splitting rules nor the real data values in each leaf and internal node. Essentially, this is akin to releasing the code used to simulate data without providing any parameter values for it. In the third is empty, that is, the data steward says nothing about how the data were collected.

For all scenarios, we suppose that intruders have a file containing the true values of sex, race, marital status, age and geography for all units in the data set, and that they seek to match records in to this file. We also suppose that the intruder knows which records were in the sample, so that . We compute each target’s probability independently of other targets’ probabilities and match with replacement.

Information in | |||||||||

Synthesizing geography only | |||||||||

Empty | 0.76 | 0.78 | 0.80 | ||||||

Code, no parameters | 0.78 | 0.80 | 0.82 | ||||||

Everything | 0.48 | – | – | – | – | – | – | ||

Synthesizing geography, age and race | |||||||||

Empty | 0.98 | 0.99 | 0.99 | ||||||

Code, no parameters | 0.99 | 0.99 | 0.99 | ||||||

Everything | 0.94 | – | – | – | – | – | – |

Table 3 summarizes risk measures in one set of synthetic data sets for each bandwidth and scenario. Three general trends are evident; these persist in two additional runs of the simulation as well. First, the synthesis of age and race dramatically decreases disclosure risks. Indeed, we suspect that many data stewards would consider the numbers of true matches unacceptably high for synthesizing geography only and perhaps acceptable for synthesizing geography, age and race. Second, releasing additional information in increases the disclosure risks. This trend is particularly pronounced when synthesizing only geography, and less so when synthesizing geography, age and race. For the latter synthesis strategy, the incremental risk of releasing the synthesis code without parameters over releasing nothing is modest, suggesting that it is worth releasing to improve analysts’ understanding of the disclosure limitation applied to the data. Third, the risks tend to increase as the bandwidth for geography synthesis decreases. This is because larger implies larger variances in the synthetic locations.

### 3.3 Evaluation of analytical validity

As with disclosure risks, the extent to which synthetic data sets can support analytically valid inferences depends on the properties of the synthesizer. In this section we examine the quality of synthetic data inferences for several estimands in the NC mortality data set. Based on the huge reductions in disclosure risks, we only consider scenarios with synthesized. The online supplement [Wang and Reiter (2011)] provides corresponding results with only synthesized.

Estimand | ZIP | ME | MSE | ME | MSE | ME | MSE | |
---|---|---|---|---|---|---|---|---|

black | Z1 | 2.4 | 4.9 | 5.6 | ||||

Z2 | 1.6 | 1.7 | 1.0 | |||||

Z3 | 1.7 | 2.0 | 3.0 | |||||

Z4 | 0.9 | 0.9 | 0.9 | |||||

Z5 | 1.9 | 2.7 | 3.6 | |||||

Z6 | 2.5 | 3.7 | 3.5 | |||||

Z7 | 1.9 | 2.9 | 4.8 | |||||

with | Z1 | 1.8 | 1.3 | 2.2 | ||||

Z2 | 2.3 | 2.3 | 1.6 | |||||

Z3 | 1.4 | 0.9 | 2.2 | |||||

Z4 | 1.6 | 1.1 | 0.8 | |||||

Z5 | 1.6 | 2.8 | 1.9 | |||||

Z6 | 1.6 | 1.8 | 2.0 | |||||

Z7 | 2.6 | 1.9 | 1.7 | |||||

Avg. age | Z1 | 1.9 | 2.9 | 3.4 | ||||

Z2 | 2.3 | 2.5 | 2.4 | |||||

Z3 | 0.6 | 1.0 | 1.3 | |||||

Z4 | 0.9 | 1.2 | 1.3 | |||||

Z5 | 1.2 | 1.3 | 1.3 | |||||

Z6 | 1.0 | 0.8 | 0.7 | |||||

Z7 | 0.5 | 0.5 | 0.6 |

Table 4 summarizes a repeated sampling experiment involving descriptive estimands at the zip code level. For each of 100 simulation runs, we create synthetic data sets using the observed mortality data (with ) and the CART synthesizers with . For the percentage-related estimands, the mean square error (MSE) is typically less than , and for age-related estimands, the MSE is typically less than 2.5 years. The MSEs for age-related estimands are generally smaller than the other MSEs because age does not vary spatially as much as the other variables do; hence, the synthesis process for age is comparatively robust to imperfect modeling of the relationship between geographies and the attributes. The MSEs tend to increase as increases, although the changes for the most part are only 3% or smaller. Overall, the results suggest that the synthetic data do a reasonable job of preserving the aggregated spatial relationships in the data for these variables.

We next evaluate inferences from two regression models. The first is a standard logistic regression of on main effects for sex, age and race. The second is a Bayesian spatial logistic regression of on main effects for sex, age and race that uses an exponential covariance function for spatial random effects, as in (1). To aid in the evaluation of the synthetic data sets, we randomly choose 2,470 people as a training set to fit the models and the remaining 200 people as a testing set to evaluate the predictive performance. Because the sample size of this training set is large for fitting hierarchical spatial random-effects models, we use Gaussian predictive process models [Banerjee et al. (2008)] to reduce computational burden. To do so, we select 100 knots by randomly choosing a subset of the locations in the training set. We assign flat prior distributions on regression coefficients , an inverse Gamma prior for and a uniform prior on for . The same training sample, testing sample and knots are used for all analyses, that is, we do not perform a repeated sampling experiment because of computational burden of estimating the spatial regression model. All models are estimated using the “spGLM” function in R.

Real data | ||||||||

Nonspatial GLM | ||||||||

Intercept | 0.18 | 0.21 | 0.20 | 0.19 | ||||

Sex | 0.08 | 0.09 | 0.09 | 0.08 | ||||

Race | 0.09 | 0.18 | 0.13 | 0.11 | ||||

0.24 | 0.27 | 0.28 | 0.26 | |||||

MR | MR | MR | MR | |||||

In-sample | ||||||||

Out-of-sample | ||||||||

Spatial GLM | ||||||||

Intercept | 0.43 | 0.37 | 0.44 | 0.33 | ||||

Sex | 0.10 | 0.09 | 0.09 | 0.10 | ||||

Race | 0.12 | 0.23 | 0.15 | 0.13 | ||||

0.25 | 0.31 | 0.28 | 0.28 | |||||

0.96 | 1.20 | 0.73 | 0.53 | |||||

0.02 | 0.02 | 0.01 | 0.01 | |||||

MR | MR | MR | MR | |||||

In-sample | ||||||||

Out-of-sample |

Table 5 summarizes the original and synthetic data inferences and predictions. For standard logistic regression, we estimate the coefficients using the methods of Reiter (2003). Misclassification rates are based on predicting when and predicting otherwise, where is the vector of synthetic point estimates for the coefficients. For the Bayesian spatial logistic regression, we mix the posterior samples of the coefficients from each of the five synthetic data sets, and report the posterior mean and variance of the mixed samples. Misclassification rates are based on predicting when the posterior mean of across the five synthetic data sets exceeds and predicting otherwise. For both models, we compute the in-sample misclassification rates as the proportions of misclassified cases conditioned on the training set, and the out-of-sample misclassification rates as the proportions of misclassified cases conditioned on the test set. All out-of-sample predictions for the Bayesian spatial logistic regression are carried out using the “spPredict” function in R.

For the logistic regression, Table 5 indicates that synthetic point estimates are generally close to those for the observed data, although there is attenuation in the coefficients for the synthesized variables. This attenuation increases with . Both in-sample and out-of-sample misclassification rates for the synthetic data are similar to those for the observed data.

For the spatial regression, Table 5 indicates that the synthetic point estimates are generally close to the observed data estimates, again with increasing attenuation as gets large. The spatial random effects parameters and in the synthetic data are similar to those from the observed data when , but declines toward zero as gets large. This indicates that large values of can weaken the spatial associations in the synthetic data.

It is also informative to compare the misclassification rates for the spatial logistic regression in the synthetic data with the rates for the nonspatial logistic regression in the observed data. In particular, both in-sample and out-of-sample misclassification rates are significantly lower in spatial logistic regression for the synthetic data than those in nonspatial logistic regression for the observed data. This suggests that, when spatial dependencies are strong, releasing simulated geographies enables better predictions than suppressing geography, even when race and sex are also simulated.

The online supplement [Wang and Reiter (2011)] reports the results of the descriptive analyses and the spatial regressions based on synthetic data sets generated from the actual cause of death , which does not exhibit strong spatial dependence. The results for the descriptive estimands are similar to, and even slightly better than, those from Table 4. For the spatial regressions, the synthetic data sets appropriately reflect the lack of spatial dependence in . As a final illustration of the usefulness of the synthetic data sets, Figure 2 displays maps of location by race for the actual data and for three synthetic data sets () based on a CART

synthesizer with . Across all values of , the synthetic data sets preserve the spatial distribution of race reasonably well.

### 3.4 Comparison against random noise addition

When considering the merits of synthetic data approaches, another relevant comparison is against other disclosure limitation procedures rather than against the original data, which cannot be made publicly available. We now compare the synthetic data sets with only geography simulated against adding random noise to geography, that is, moving an observed location to another randomly drawn location. To make results comparable, we perturb each by drawing a random value from a bivariate normal distribution with a mean equal to and a diagonal covariance matrix with standard deviations equally set to be the corresponding . Here, is computed assuming that, in the high-risk scenario, only geography is synthesized and that . In this way, the synthetic and noise-infused data sets have roughly the same risks, because , and, hence, . For comparisons, we repeat the analyses from Tables 4 and 5.

For the repeated sampling experiment, we add random noise to each location independently 100 times, thus creating 100 noise-infused data sets. For the noise infusion, four of the fourteen percentage-related estimands in Table 4 have . In contrast, when synthesizing geography only with , none of the percentage-related estimands have ; these results are reported in the online supplement [Wang and Reiter (2011)]. For the age-related estimands, the MSEs are similar for synthetic and noise-infused data sets. Thus, for comparable levels of disclosure risks, adding random noise reduces the quality of inferences for the descriptive estimands relative to synthetic data.

For the regression analyses, we estimate the Bayesian spatial regression with one data set generated by adding random noise to geography only. The in-sample and out-of-sample misclassification rates are 0.32 and 0.38, respectively, for this noise-infused data set. We observed similar misclassification rates when repeating this analysis three more times. These misclassification rates are substantially larger than those for the corresponding synthetic data sets reported in the supplement (as well as those in Table 5), again suggesting that, for comparable risk levels, random noise does not preserve spatial relationships as well as synthetic data.

## 4 Concluding remarks

Although synthesizing geographies via modeling, such as the CART approach here, can preserve some spatial analyses, it does not preserve all of them. For example, two records close in space in the original data will not necessarily be close in space in the synthetic data, because their locations are independently generated from the response distribution. Additionally, simulated geographies may not preserve analyses when used to link the synthetic data with other data containing geography, since the simulated locations are conditionally independent of the variables in the linked data set that are not included in the synthesis model. Evaluating the impacts of synthetic geographies on linked analysis is a future extension of this research.

When synthesizing the nongeographic quasi-identifiers, we controlled for location as predictors in the model. An alternative approach is to simulate from hierarchical spatial models for point-referenced data, or perhaps from area-level models by aggregating locations [Banerjee, Gelfand and Carlin (2004)]. With large data sets, fitting spatial random effects models can be computationally challenging, although this can be overcome using approximations from the spatial statistics literature. Another strategy is to mask attribute data using spatial smoothing techniques [Zhou, Dominici and Louis (2010)]. We note that applying either of these approaches alone, that is, without simulating geography, leaves the original fine geography on the file, which may be too high of a disclosure risk. Evaluating the potential gains in disclosure risk and data usefulness of such strategies over the simple CART synthesizer for attributes utilized here is an area open for further theoretical and empirical investigation.

To implement the CART synthesizer, data stewards need to select the tuning parameters of the trees, that is, the minimum number of observations per leaf and the splitting criteria. These parameters control the size of the tree: increasing them results in smaller trees, and decreasing them results in larger trees. Based on our experience, we recommend that data stewards begin by setting the minimum deviance in the splitting criteria to a small number, like 0.0001 or even smaller, and requiring at least five records per leaf. These are typical default values for many applications and software routines for regression trees. The data steward then evaluates the disclosure risk and data utility associated with the synthetic data sets. If the risks are too high, the data steward can re-tune the parameters for the variables that are not sufficiently altered by the synthesis to grow smaller trees for those variables [Reiter (2005d)]. We did not prune the leaves further, as experiments with further pruning worsened the quality of the synthetic data sets without substantially improving the confidentiality protection. Growing larger trees can increase the quality of the synthetic data sets. However, it increases the time to run the synthesizer. Further, it can increase disclosure risks, for example, using trees with one observation per leaf reproduces the original data.

The CART synthesizer has appealing features: it handles continuous, categorical and mixed data; captures nonlinear relationships and complex interactions automatically; and runs quickly on large data sets. However, CART synthesizers can run into computational difficulties when categorical variables have many (e.g., 20) levels. Additionally, when some levels have low incidence rates in the data, the CART synthesizer can have difficulty preserving relationships involving those levels [Reiter (2005d)].

For simulation purposes, we illustrated the CART synthesizer using only records. This facilitated estimation of the spatial regressions with each of the resulting synthetic data sets. In extended investigations, we found that the CART synthesis process readily scaled for tens of thousands of mortality records. Other applications using CART synthesizers for nongeographic attributes [Reiter (2009), Drechsler (2011)] indicate that it can be applied in surveys of dimensions typical of many government surveys. When data stewards need to synthesize locations for a very large number, for example, millions, of records, a computationally convenient strategy is to partition the data into geographical strata of manageable size (tens of thousands of records), and simulate latitudes and longitudes (and attributes) by running the synthesizer independently within each stratum.

Computational details and further results

\slink[doi,text=10.1214/11-AOAS506SUPP]10.1214/11-AOAS506SUPP \slink[url]http://lib.stat.cmu.edu/aoas/506/supplement.pdf
\sdatatype.pdf
\sdescriptionComputational details for geography
disclosure and identification risks in Sections
3.2.1 and 3.2.2; further
analytical validity results; and results based on genuine cause of
death.

### References

- {barticle}[author] \bauthor\bsnmArmstrong, \bfnmM. P.\binitsM. P., \bauthor\bsnmRushton, \bfnmG.\binitsG. \AND\bauthor\bsnmZimmerman, \bfnmD. L.\binitsD. L. (\byear1999). \btitleGeographically masking health data to preserve confidentiality. \bjournalStat. Med. \bvolume18 \bpages495–525. \bptokimsref \endbibitem
- {bbook}[author] \bauthor\bsnmBanerjee, \bfnmSudipto\binitsS., \bauthor\bsnmGelfand, \bfnmAlan E.\binitsA. E. \AND\bauthor\bsnmCarlin, \bfnmBradley P.\binitsB. P. (\byear2004). \btitleHierarchical Modeling and Analysis for Spatial Data. \bpublisherChapman and Hall/CRC, \baddressBoca Raton, FL. \bptokimsref \endbibitem
- {barticle}[mr] \bauthor\bsnmBanerjee, \bfnmSudipto\binitsS., \bauthor\bsnmGelfand, \bfnmAlan E.\binitsA. E., \bauthor\bsnmFinley, \bfnmAndrew O.\binitsA. O. \AND\bauthor\bsnmSang, \bfnmHuiyan\binitsH. (\byear2008). \btitleGaussian predictive process models for large spatial data sets. \bjournalJ. R. Stat. Soc. Ser. B Stat. Methodol. \bvolume70 \bpages825–848. \biddoi=10.1111/j.1467-9868.2008.00663.x, issn=1369-7412, mr=2523906 \bptokimsref \endbibitem
- {bbook}[author] \bauthor\bsnmBreiman, \bfnmL.\binitsL., \bauthor\bsnmFriedman, \bfnmJ. H.\binitsJ. H., \bauthor\bsnmOlshen, \bfnmR. A.\binitsR. A. \AND\bauthor\bsnmStone, \bfnmC. J.\binitsC. J. (\byear1984). \btitleClassification and Regression Trees. \bpublisherWadsworth, \baddressBelmont. \bptokimsref \endbibitem
- {barticle}[mr] \bauthor\bsnmCaiola, \bfnmGregory\binitsG. \AND\bauthor\bsnmReiter, \bfnmJerome P.\binitsJ. P. (\byear2010). \btitleRandom forests for generating partially synthetic, categorical data. \bjournalTrans. Data Priv. \bvolume3 \bpages27–42. \bidissn=1888-5063, mr=2725418 \bptokimsref \endbibitem
- {barticle}[mr] \bauthor\bsnmChipman, \bfnmHugh A.\binitsH. A., \bauthor\bsnmGeorge, \bfnmEdward I.\binitsE. I. \AND\bauthor\bsnmMcCulloch, \bfnmRobert E.\binitsR. E. (\byear2010). \btitleBART: Bayesian additive regression trees. \bjournalAnn. Appl. Stat. \bvolume4 \bpages266–298. \biddoi=10.1214/09-AOAS285, issn=1932-6157, mr=2758172 \bptokimsref \endbibitem
- {barticle}[mr] \bauthor\bsnmDalenius, \bfnmTore\binitsT. \AND\bauthor\bsnmReiss, \bfnmSteven P.\binitsS. P. (\byear1982). \btitleData-swapping: A technique for disclosure control. \bjournalJ. Statist. Plann. Inference \bvolume6 \bpages73–85. \biddoi=10.1016/0378-3758(82)90058-1, issn=0378-3758, mr=0653248 \bptokimsref \endbibitem
- {barticle}[author] \bauthor\bsnmDe’ath, \bfnmGlenn\binitsG. (\byear2002). \btitleMultivariate regression trees: A new technique for modeling species environment relationships. \bjournalEcology \bvolume83 \bpages1105–1117. \bptokimsref \endbibitem
- {barticle}[author] \bauthor\bsnmDrechsler, \bfnmJ.\binitsJ. (\byear2011). \btitleNew data dissemination approaches in old Europe—Synthetic datasets for a German establishment survey. \bjournalJ. Appl. Stat. \bnoteTo appear. \bptokimsref \endbibitem
- {bincollection}[author] \bauthor\bsnmDrechsler, \bfnmJ.\binitsJ. \AND\bauthor\bsnmReiter, \bfnmJ. P.\binitsJ. P. (\byear2008). \btitleAccounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data. In \bbooktitlePrivacy in Statistical Databases (LNCS 5262) (\beditor\bfnmJ.\binitsJ. \bsnmDomingo-Ferrer \AND\beditor\bfnmY.\binitsY. \bsnmSaygin, eds.) \bpages227–238. \bpublisherSpringer, \baddressNew York. \bptokimsref \endbibitem
- {barticle}[mr] \bauthor\bsnmDrechsler, \bfnmJörg\binitsJ. \AND\bauthor\bsnmReiter, \bfnmJerome P.\binitsJ. P. (\byear2010). \btitleSampling with synthesis: A new approach for releasing public use census microdata. \bjournalJ. Amer. Statist. Assoc. \bvolume105 \bpages1347–1357. \biddoi=10.1198/jasa.2010.ap09480, issn=0162-1459, mr=2796555 \bptokimsref \endbibitem
- {barticle}[author] \bauthor\bsnmDuncan, \bfnmG. T.\binitsG. T. \AND\bauthor\bsnmLambert, \bfnmD.\binitsD. (\byear1989). \btitleThe risk of disclosure for microdata. \bjournalJournal of Business and Economic Statistics \bvolume7 \bpages207–217. \bptokimsref \endbibitem
- {bmisc}[author] \borganizationFederal Register (\byear2000). \bhowpublishedStandards for privacy of individually identifiable health information—Final privacy rule. 45 C. F. R. Parts 160 and 164, Dept. Health and Human Services, Office of the Secretary, Washington, DC. \bptokimsref \endbibitem
- {barticle}[author] \bauthor\bsnmFienberg, \bfnmS. E.\binitsS. E., \bauthor\bsnmMakov, \bfnmU. E.\binitsU. E. \AND\bauthor\bsnmSanil, \bfnmA. P.\binitsA. P. (\byear1997). \btitleA Bayesian approach to data disclosure: Optimal intruder behavior for continuous data. \bjournalJournal of Official Statistics \bvolume13 \bpages75–89. \bptokimsref \endbibitem
- {bincollection}[author] \bauthor\bsnmFienberg, \bfnmS. E.\binitsS. E. \AND\bauthor\bsnmMcIntyre, \bfnmS. E.\binitsS. E. (\byear2004). \btitleData swapping: Variations on a theme by Dalenius and Reese. In \bbooktitlePrivacy in Statistical Databases (\beditor\bfnmJ.\binitsJ. \bsnmDomingo-Ferrer \AND\beditor\bfnmV.\binitsV. \bsnmTorra, eds.) \bpages14–29. \bpublisherSpringer, \baddressNew York. \bptokimsref \endbibitem
- {bincollection}[author] \bauthor\bsnmFreedman, \bfnmD. A.\binitsD. A. (\byear2004). \btitleThe ecological fallacy. In \bbooktitleEncyclopedia of Social Science Research Methods (\beditor\bfnmM.\binitsM. \bsnmLewis-Beck, \beditor\bfnmA.\binitsA. \bsnmBryman \AND\beditor\bfnmT. F.\binitsT. F. \bsnmLiao, eds.) \bvolume1 \bpages293. \bpublisherSage, \baddressThousand Oaks, CA. \bptokimsref \endbibitem
- {barticle}[author] \bauthor\bsnmFuller, \bfnmW. A.\binitsW. A. (\byear1993). \btitleMasking procedures for microdata disclosure limitation. \bjournalJournal of Official Statistics \bvolume9 \bpages383–406. \bptokimsref \endbibitem
- {barticle}[mr] \bauthor\bsnmGomatam, \bfnmS.\binitsS., \bauthor\bsnmKarr, \bfnmA. F.\binitsA. F., \bauthor\bsnmReiter, \bfnmJ. P.\binitsJ. P. \AND\bauthor\bsnmSanil, \bfnmA. P.\binitsA. P. (\byear2005). \btitleData dissemination and disclosure limitation in a world without microdata: A risk-utility framework for remote access analysis servers. \bjournalStatist. Sci. \bvolume20 \bpages163–177. \biddoi=10.1214/088342305000000043, issn=0883-4237, mr=2183447 \bptokimsref \endbibitem
- {bmisc}[author] \borganizationHealth and Retirement Study (\byear2007). \bhowpublishedData Description and Usage (2006 Core, Early, Version 2.0). Available at http://hrsonline.isr.umich.edu/meta/2006/core/desc/ h06dd.pdf. \bptokimsref \endbibitem
- {bmisc}[author] \bauthor\bsnmKinney, \bfnmS. K.\binitsS. K., \bauthor\bsnmReiter, \bfnmJ. P.\binitsJ. P., \bauthor\bsnmReznek, \bfnmA. P.\binitsA. P., \bauthor\bsnmMiranda, \bfnmJ.\binitsJ., \bauthor\bsnmJarmin, \bfnmR. S.\binitsR. S. \AND\bauthor\bsnmAbowd, \bfnmJ. M.\binitsJ. M. (\byear2011). \bhowpublishedTowards unrestricted public use business microdata: The synthetic Longitudinal Business Database. Technical report, Center for Economic Studies Working Paper CES-WP-11-04, Census Bureau, Washington, DC. \bptokimsref \endbibitem
- {barticle}[author] \bauthor\bsnmLittle, \bfnmR. J. A.\binitsR. J. A. (\byear1993). \btitleStatistical analysis of masked data. \bjournalJournal of Official Statistics \bvolume9 \bpages407–426. \bptokimsref \endbibitem
- {bincollection}[author] \bauthor\bsnmLittle, \bfnmR. J. A.\binitsR. J. A., \bauthor\bsnmLiu, \bfnmF.\binitsF. \AND\bauthor\bsnmRaghunathan, \bfnmT. E.\binitsT. E. (\byear2004). \btitleStatistical disclosure techniques based on multiple imputation. In \bbooktitleApplied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives (\beditor\bfnmA.\binitsA. \bsnmGelman \AND\beditor\bfnmX. L.\binitsX. L. \bsnmMeng, eds.) \bpages141–152. \bpublisherWiley, \baddressNew York. \bptokimsref \endbibitem
- {binproceedings}[author] \bauthor\bsnmMachanavajjhala, \bfnmA.\binitsA., \bauthor\bsnmKifer, \bfnmD.\binitsD., \bauthor\bsnmAbowd, \bfnmJ.\binitsJ., \bauthor\bsnmGehrke, \bfnmJ.\binitsJ. \AND\bauthor\bsnmVilhuber, \bfnmL.\binitsL. (\byear2008). \btitlePrivacy: Theory meets practice on the map. In \bbooktitleIEEE 24th International Conference on Data Engineering \bpages277–286. \bptokimsref \endbibitem
- {bmisc}[author] \borganizationNational Research Council (\byear2005). \bhowpublishedExpanding access to research data: Reconciling risks and opportunities. Panel on Data Access for Research Purposes, Committee on National Statistics, Division of Behavioral and Social Sciences and Education. The National Academies Press, Washington, DC. \bptokimsref \endbibitem
- {bmisc}[author] \borganizationNational Research Council (\byear2007). \bhowpublishedPutting people on the map: Protecting confidentiality with linked social-spatial data. Panel on Confidentiality Issues Arising from the Integration of Remotely Sensed and Self-Identifying Data, Committee on the Human Dimensions of Global Change, Division of Behavioral and Social Sciences and Education. The National Academies Press, Washington, DC. \bptokimsref \endbibitem
- {barticle}[author] \bauthor\bsnmReiter, \bfnmJ. P.\binitsJ. P. (\byear2003). \btitleInference for partially synthetic, public use microdata sets. \bjournalSurvey Methodology \bvolume29 \bpages181–189. \bptokimsref \endbibitem
- {barticle}[mr] \bauthor\bsnmReiter, \bfnmJerome P.\binitsJ. P. (\byear2004a). \btitleNew approaches to data dissemination: A glimpse into the future (?). \bjournalChance \bvolume17 \bpages11–15. \bidissn=0933-2480, mr=2061931 \bptokimsref \endbibitem
- {barticle}[author] \bauthor\bsnmReiter, \bfnmJ. P.\binitsJ. P. (\byear2004b). \btitleSimultaneous use of multiple imputation for missing data and disclosure limitation. \bjournalSurvey Methodology \bvolume30 \bpages235–242. \bptokimsref \endbibitem
- {barticle}[author] \bauthor\bsnmReiter, \bfnmJ. P.\binitsJ. P. (\byear2005a). \btitleEstimating identification risks in microdata. \bjournalJ. Amer. Statist. Assoc. \bvolume100 \bpages1103–1113. \bptokimsref \endbibitem
- {barticle}[mr] \bauthor\bsnmReiter, \bfnmJerome P.\binitsJ. P. (\byear2005b). \btitleReleasing multiply imputed, synthetic public use microdata: An illustration and empirical study. \bjournalJ. Roy. Statist. Soc. Ser. A \bvolume168 \bpages185–205. \biddoi=10.1111/j.1467-985X.2004.00343.x, issn=0964-1998, mr=2113234 \bptokimsref \endbibitem
- {barticle}[mr] \bauthor\bsnmReiter, \bfnmJ. P.\binitsJ. P. (\byear2005c). \btitleSignificance tests for multi-component estimands from multiply imputed, synthetic microdata. \bjournalJ. Statist. Plann. Inference \bvolume131 \bpages365–377. \biddoi=10.1016/j.jspi.2004.02.003, issn=0378-3758, mr=2139378 \bptokimsref \endbibitem
- {barticle}[author] \bauthor\bsnmReiter, \bfnmJ. P.\binitsJ. P. (\byear2005d). \btitleUsing CART to generate partially synthetic, public use microdata. \bjournalJournal of Official Statistics \bvolume21 \bpages441–462. \bptokimsref \endbibitem
- {barticle}[author] \bauthor\bsnmReiter, \bfnmJ. P.\binitsJ. P. (\byear2009). \btitleUsing multiple imputation to integrate and disseminate confidential microdata. \bjournalInternational Statistical Review \bvolume77 \bpages179–195. \bptokimsref \endbibitem
- {barticle}[author] \bauthor\bsnmReiter, \bfnmJ. P.\binitsJ. P. \AND\bauthor\bsnmMitra, \bfnmR.\binitsR. (\byear2009). \btitleEstimating risks of identification disclosure in partially synthetic data. \bjournalJournal of Privacy and Confidentiality \bvolume1 \bpages99–110. \bptokimsref \endbibitem
- {barticle}[author] \bauthor\bsnmRobinson, \bfnmW. S.\binitsW. S. (\byear1950). \btitleEcological correlations and the behavior of individuals. \bjournalAmerican Sociological Review \bvolume15 \bpages351–357. \bptokimsref \endbibitem
- {barticle}[mr] \bauthor\bsnmRubin, \bfnmDonald B.\binitsD. B. (\byear1981). \btitleThe Bayesian bootstrap. \bjournalAnn. Statist. \bvolume9 \bpages130–134. \bidissn=0090-5364, mr=0600538 \bptokimsref \endbibitem
- {bmisc}[author] \bauthor\bsnmSweeney, \bfnmL. A.\binitsL. A. (\byear2001). \bhowpublishedComputational disclosure control: A primer on data privacy protection. Ph.D. thesis, MIT, Cambridge, MA. \bptokimsref \endbibitem
- {barticle}[author] \bauthor\bsnmVanWey, \bfnmL. K.\binitsL. K., \bauthor\bsnmRindfuss, \bfnmR. R.\binitsR. R., \bauthor\bsnmGuttman, \bfnmM. P.\binitsM. P., \bauthor\bsnmEntwisle, \bfnmB.\binitsB. \AND\bauthor\bsnmBalk, \bfnmD. L.\binitsD. L. (\byear2005). \btitleConfidentiality and spatially explicit data: Concerns and challenges. \bjournalProc. Natl. Acad. Sci. USA \bvolume102 \bpages15337–15342. \bptokimsref \endbibitem
- {bmisc}[author] \bauthor\bsnmWang, \bfnmHao\binitsH. \AND\bauthor\bsnmReiter, \bfnmJerome\binitsJ. (\byear2011). \bhowpublishedSupplement to “Multiple imputation for sharing precise geographies in public use data.” DOI:10.1214/11-AOAS506SUPP. \bptokimsref \endbibitem
- {barticle}[mr] \bauthor\bsnmZhou, \bfnmYijie\binitsY., \bauthor\bsnmDominici, \bfnmFrancesca\binitsF. \AND\bauthor\bsnmLouis, \bfnmThomas A.\binitsT. A. (\byear2010). \btitleA smoothing approach for masking spatial data. \bjournalAnn. Appl. Stat. \bvolume4 \bpages1451–1475. \biddoi=10.1214/09-AOAS325, issn=1932-6157, mr=2758336 \bptokimsref \endbibitem