Optimal Data Acquisition for Statistical Estimation

Optimal Data Acquisition for Statistical Estimation

Yiling Chen Harvard University, Cambridge, MA. Email: yiling@seas.harvard.edu    Nicole Immorlica Microsoft Research New England, Cambridge, MA. Email: nicimm@microsoft.com    Brendan Lucier Microsoft Research New England, Cambridge, MA. Email: brlucier@microsoft.com    Vasilis Syrgkanis Microsoft Research New England, Cambridge, MA. Email: vasy@microsoft.com    Juba Ziani California Institute of Technology, Pasadena, CA. Email: jziani@caltech.edu. Much of this research was done while J. Ziani was at Microsoft Research.
Abstract

We consider a data analyst’s problem of purchasing data from strategic agents to compute an unbiased estimate of a statistic of interest. Agents incur private costs to reveal their data and the costs can be arbitrarily correlated with their data. Once revealed, data are verifiable. This paper focuses on linear unbiased estimators. We design an individually rational and incentive compatible mechanism that optimizes the worst-case mean-squared error of the estimation, where the worst-case is over the unknown correlation between costs and data, subject to a budget constraint in expectation. We characterize the form of the optimal mechanism in closed-form. We further extend our results to acquiring data for estimating a parameter in regression analysis, where private costs can correlate with the values of the dependent variable but not with the values of the independent variables.

\setdescription

itemsep=5pt,parsep=0pt,leftmargin=*

1 Introduction

In the age of automation, data is king. The statistics and machine learning algorithms that help curate our online content, diagnose our diseases, and drive our cars, among other things, are all fueled by data. Typically, this data is mined by happenstance: as we click around on the internet, seek medical treatment, or drive “smart” vehicles, we leave a trail of data. This data is recorded and used to make estimates and train machine learning algorithms. So long as representative data is readily abundant, this approach may be sufficient. But some data is sensitive and therefore inaccurate, rare, or lacking detail in observable data traces. In such cases, it is more expedient to buy the necessary data directly from the population.

Consider, for example, the problem a public health administration faces in trying to learn the average weight of a population, perhaps as an input to estimating the risk of heart disease. Weight is a sensitive personal characteristic, and people may be loath to disclose it. It is also variable over time, and so must be collected close to the time of the average weight estimate in order to be accurate. Thus, while other characteristics, like height, age, and gender, are fairly accurately recorded in, for example, driver’s license databases, weight is not. The public health administration may try surveying the public to get estimates of the average weight, but these surveys are likely to have low response rates and be biased towards healthier low-weight samples.

In this paper, we propose a mechanism for buying verifiable data from a population in order to estimate a statistic of interest, such as the expected value of some function of the underlying data. We assume each individual has a private cost, or disutility, for revealing his or her sensitive data to the analyst. Importantly, this cost may be correlated with the private data. For example, overweight or underweight individuals to have a higher cost of revealing their data than people of a healthy weight. Individuals wish to maximize their expected utility, which is the expected payment they receive for their data minus their expected cost. The analyst has a fixed budget for buying data. The analyst does not know the distribution of the data: properties of the distribution is what she is trying to learn from the data samples, therefore it is important that she uses the data she collects to learn it rather than using an inaccurate prior distribution (for example, the analyst may have a prior on weight distribution within a population from DMV records or previous surveys, but such a prior may be erroneous if people do not accurately report their weights). However, we do assume the analyst has a prior for the marginal distribution of costs, and that she estimates how much a survey may cost her as a function of said prior.111This prior could come from similar past exercises.

The analyst would like to buy data subject to her budget, then use that data to obtain an unbiased estimator for the statistic of interest. To this end, the analyst posts a menu of probability-price pairs. Each individual with cost selects a pair from the menu, at which point the analyst buys the data with probability at price . The expected utility of the individual is thus .222As we show, this menu-based formulation is fully general and captures arbitrary data-collection mechanisms. To form an estimate based on this collected data, we assume the analyst uses inverse propensity scoring, pioneered by Horvitz and Thompson [15]. This is the unique unbiased linear estimator; it works by upweighting the data from individual by the inverse of his/her selection probability, .

The Horvitz-Thompson estimator always generates an unbiased estimate of the statistic being measured, regardless of the price menu. However, the precision of the estimator, as measured by the variance or mean-squared error of the estimate, depends on the menu of probability-price pairs offered to each individual. For example, offering a high price would generate data samples with low bias (since many individuals would accept such an offer), but the budget would limit the number of samples. Offering low prices allows the mechanism to collect more samples, but these would be more heavily biased, requiring more aggressive correction which introduces additional noise. The goal of the analyst is to strike a balance between these forces and post a menu that minimizes the variance of her estimate in the worst-case over all possible joint distributions of the data and cost consistent with the cost prior. We note that this problem setting was first studied by [23], who characterized an approximately optimal mechanism for moment estimation.

1.1 Summary of results and techniques

Revisiting the example of estimating the weight of a population, our scheme suggests the following solution. Imagine the costs are with probability , and the per-agent budget of the analyst is . The analyst brings a scale to a public location and posts the following menu of pairs of allocation probability and price: . A simple calculation shows that individuals with cost or will pick the first menu option: stepping on the scale and having their weight recorded with probability , and receiving a payment of dollars. Individuals with cost will pick the second menu option; if they are selected to step on the scale, which happens with probability , the analyst records their weight scaled by . The estimate is the average of the scaled weights.

We show how to extend our approach in multiple directions. First, our characterization of the optimal mechanism holds even when the quantity to be estimated is the expected value of a -dimensional moment function of the data. Second, we extend our techniques beyond moment estimation to the common task of multi-dimensional linear regression. In this regression problem, an individual’s data includes both features (which are assumed to be insensitive or publicly available) and outcomes (which may be sensitive). The analyst’s goal is to estimate the linear regression coefficients that relate the outcomes to the features. We make the assumption that an individual’s cost is independent of her features, but may be arbitrarily correlated with her outcome. For example, the goal might be to regress a health outcome (such as severity of a disease) on demographic information. In this case, we might imagine that an agent incurs no cost for reporting his age, height or gender, but his cost might be highly correlated with his realized health outcome. In such a setting, we show that the asymptotically optimal allocation rule, given a fixed budget per agent as the number of agent grows large, can be calculated efficiently and exhibits a pooling region as before. However, unlike for moment estimation, agents with intermediate costs can also be pooled together. We further show that our results extend to non-linear regression in Section D, under mild additional conditions on the regression function.

Our techniques rely on i) reducing the mechanism design problem to an optimization problem through the classical notion of virtual costs, then ii) reducing the problem of optimizing the worst-case variance to that of finding an equilibrium of a zero-sum game between the analyst and an adversary. The adversary’s goal is to pick a distribution of data, conditional on agents’ costs, that maximizes the variance of the analyst’s estimator. We then characterize such an equilibrium through the optimality conditions for convex optimization described in [2].

1.2 Related work

A growing amount of attention has been placed on understanding interactions between the strategic nature of data holders and the statistical inference and learning tasks that use data collected from these holders. The work on this topic can be roughly divided into two categories according to whether money is used for incentive alignment.

In the first category, individuals as data holders do not directly derive utility from the accuracy of the inference or learning outcome, but in some cases may incur a privacy cost if the outcome leaks their private information. The analyst uses monetary payments to incentivize agents to reveal their data. Our work falls into this category. Prior papers by Roth and Schoenebeck [23] and Abernethy et al. [1] are closest to our setting. Similarly to our work, both Roth and Schoenebeck [23] and Abernethy et al. [1] consider an analyst’s problem of purchasing data from individuals with private costs subject to a budget constraint, allow the cost to be correlated with the value of data, and assume that individuals cannot fabricate their data. Roth and Schoenebeck [23] aim at obtaining an optimal unbiased estimator with minimum worst-case variance for population mean, while their mechanism achieves optimality only approximately: instead of the actual worst-case variance, a bound on the worst-case variance is minimized. Our work achieves optimality exactly (minimizing worst-case variance) and our results are extended to broader classes of statistical inference, moment estimation and linear regression. Abernethy et al. [1] consider general supervised learning. They do not seek to achieve a notion of optimality; instead, they take a learning-theoretic approach and design mechanisms to obtain learning guarantees (risk bounds).

Several papers consider data acquisition models with different objectives under the assumptions that (a) individuals do not fabricate their data, and (b) private costs and value of data are uncorrelated. For example, in Cummings at al. [6], the analyst can decide the level of accuracy for data purchased from each individual, and wishes to guarantee a certain desired level of accuracy of the aggregated information while minimizing the total privacy cost incurred by the agents. Cai, Daskalakis, and Papadimitriou [3] focus on incentivizing individuals to exert effort to obtain high-quality data for the purpose of linear regression. Another line of research in the first category examines the data acquisition problem under the lens of differential privacy [13, 10, 12, 20, 5]. The mechanism designer then uses payments to balance the trade-off between privacy and accuracy.

In the second category, individuals’ utilities directly depend on the inference or learning outcome (e.g. they want a regression line to be as close to their own data point as possible) and hence they have incentives to manipulate their reported data to influence the outcome. There often is no cost for reporting one’s data. The data analyst, without using monetary payments, attempts to design or identify inference or learning processes so that they are robust to potential data manipulations. Most papers in this category assume that independent variables (feature vectors) are unmanipulable public information and dependent variables are manipulable private information [7, 16, 17, 21], though some papers consider strategic manipulation of feature vectors [14, 8]. Such strategic data manipulations have been studied for estimation [4], classification [16, 17, 14], online classification [8], regression [22, 7], and clustering [21]. Work in this category is closer to mechanism design without money in the sense that they focus on incentive alignment in acquiring data (e.g., strategy-proof algorithms) but often do not evaluate the performance of the inference or learning, with a few notable exceptions [14, 8].

2 Model and Preliminaries

Survey Mechanisms

There is a population of agents. Each agent has a private pair , where is a data point and is a cost. We think of as the disutility agent incurs by releasing her data . The pair is drawn from a distribution , unknown to the mechanism designer. We denote with the CDF of the marginal distribution of costs,444Throughout the text we will use the CDF to refer to the distribution itself. supported on a set . We assume that and the support of the data points, , are known. However, the joint distribution of data and costs is unknown.

A survey mechanism is defined by an allocation rule and a payment rule , and works as follows. Each agent arrives at the mechanism in sequence and reports a cost . The mechanism chooses to buy the agent’s data with probability . If the mechanism buys the data, then it learns the value of (i.e., agents cannot misreport their data) and pays the agent . Otherwise the data point is not learned and no payment is made.

We assume agents have quasi-linear utilities, so that the utility enjoyed by agent when reporting is

 u(^ci;ci)=(P(^ci)−ci)⋅A(^ci) (1)

We will restrict attention to survey mechanisms that are truthful and individually rational.

Definition 1 (Truthful and Individually Rational - TIR).

A survey mechanism is truthful if for any cost it is in the agent’s best interest to report their true cost, i.e. for any report :

 u(c;c)≥u(^c;c) (2)

It is individually rational if, e. for any cost , .

We assume that the mechanism is constrained in the amount of payment it can make to the agents. We will formally define this as an expected budget constraint for the survey mechanism.

Definition 2 (Expected Budget Constraint).

A mechanism respects a budget constraint if:

 n⋅Ec∼F[P(c)⋅A(c)]≤B (3)
Estimators

The designer (or data analyst) wishes to use the survey mechanism to estimate some parameter of the marginal distribution of data points.555We also extend our results to multi-dimensional parameters; see Section 4. For example, it might be that and is the mean of the distribution over data points in the population. To this end, the designer will apply an estimator to the collection of data points elicited by the survey mechanism. We will write for the estimator used. Note that the value of the estimator depends on the sample , but might also depend on the distribution of costs and the survey mechanism. Due to the randomness inherent in the survey mechanism (both in the choice of data points sampled and the values of those samples), we think of as a random variable, drawn from a distribution . We will focus exclusively on unbiased estimators.

Definition 3 (Unbiased Estimator).

Given an allocation function , an estimator for is unbiased if for any instantiation of the true distribution its expected value is equal to :

 E^θS∼T(D,A)[^θS]=θ. (4)

Given a fixed choice of estimator, the mechanism designer wants to construct the survey mechanism to minimize the variance (finite sample or asymptotic as the population grows) of that estimator. Since the designer does not know the distribution , we will work with the worst-case variance over all instantiations of that are consistent with the cost marginal .

Definition 4 (Worst-Case Variance).

Given an allocation function and an instance of the true distribution , the variance of an estimator is defined as:

 V(^θS;D,A)=E^θS∼T(D,A)[(^θS−E[^θS])2] (5)

The worst-case variance of is

 V∗(^θS;F,A)=supD consistent with FV(^θS;D,A). (6)

We are now ready to formally define the mechanism design problem faced by the data analyst.

Definition 5 (Analyst’s Mechanism Design Problem).

Given an estimator and cost distribution , the goal of the designer is to design an allocation rule and payment rule so as to minimize worst-case variance subject to the truthfulness, individual rationality and budget constraints:

 infA,P V∗(^θS;F,A) (7) s.t. n⋅Ec∼F[P(c)⋅A(c)]≤B A,P define a TIR mechanism

The formulation above describes surveys as direct-revelation mechanisms, where agents report costs. We note that an equivalent indirect implementation might be more natural: a posted menu survey offers each agent a menu of (price, probability) pairs . If the agent chooses then their data is elicited with probability , in which case they are paid . Each agent can choose the item that maximizes their expected utility, i.e., . By the well-known taxation principle, any survey mechanism can be implemented as a posted menu survey, and the number of menu items required is at most the size of the support of the cost distribution.

2.1 Reducing Mechanism Design to Optimization

We begin by reducing the mechanism design problem to a simpler full-information optimization problem where the designer knows the private cost of each player and can acquire their data by paying them exactly that cost. However, the designer is constrained to using monotone allocation rules, in which players with higher costs have weakly lower probability of being chosen.

Definition 6 (Analyst’s Optimization Problem).

Given an estimator and cost distribution , the optimization version of the designer’s problem is to find a non-increasing allocation rule that minimizes worst-case variance subject to the budget constraint, assuming agents are paid their cost:

 infA V∗(^θS;F,A) (8) s.t. n⋅Ec∼F[c⋅A(c)]≤B A is monotone non-increasing

The mechanism design problem in Definition 5 reduces to the optimization problem given by Definition 6, albeit with a transformation of costs to virtual cost.

Definition 7 (Virtual Costs and Regular Distributions).

If is continuous and admits a density then define the virtual cost function as . If is discrete with support and PDF , then define the virtual cost function as: , with . We also denote with the distribution of virtual costs; i.e., the distribution created by first drawing from and then mapping it to . A distribution is regular if the virtual cost function is increasing.

The following is an analogue of Myerson’s [18] reduction of mechanism design to virtual welfare maximization, adapted to the survey design setting.

Lemma 1.

If the distribution of costs is regular, then solving the Analyst’s Mechanism Design Problem reduces to solving the Analyst’s Optimization Problem for distribution of costs .

Proof.

The proof is given in Appendix E.1. ∎

2.2 Unbiased Estimation and Inverse Propensity Scoring

We now describe a class of estimators that we will focus on for the remainder of the paper. Note that simply calculating the quantity of interest, , on the sampled data points can lead to bias, due to the potential correlation between costs and data. For instance, suppose that and the goal is to estimate the mean of the distribution of . A natural estimator is the average of the collected data: . However, if players with lower tend to have lower cost, and are therefore selected with higher probability by the analyst, then this estimator will consistently underestimate the true mean.

This problem can be addressed using inverse propensity scoring (IPS), pioneered by Horvitz and Thompson [15]. The idea is to recover unbiasedness by weighting each data point by the inverse of the probability of observing it. This IPS approach can be applied to any parameter estimation problem where the parameter of interest is the expected value of an arbitrary moment function .

Definition 8 (Horvitz-Thompson Estimator).

The Horvitz-Thompson estimator for the case when the parameter of interest is the expected value of a (moment) function is defined as:

 ^θS=1n∑i∈[n]m(zi)⋅1{i∈S}A(ci) (9)

The Horvitz-Thompson estimator is the unique unbiased estimator that is a linear function of the observations [23]. It is therefore without loss of generality to focus on this estimator if one restricts to unbiased linear estimators.

IPS beyond moment estimation.

We defined the Horvitz-Thompson estimator with respect to moment estimation problems, . As it turns out, this approach to unbiased estimation extends even beyond the moment estimation problem to parameter estimation problems defined as the solution to a system of moment equations or parameters defined as the minima of a moment function . We defer this discussion to Section 4.

3 Estimating Moments of the Data Distribution

In this section we consider the case where the analyst’s goal is to estimate the mean of a given moment function of the distribution. That is, there is some function such that both and are in the support of random variable , and the goal of the analyst is to estimate .666Observe that it is easy to deal with the more general case of by a simple linear translation, i.e., estimate instead, which is in and then translate the estimator back to recover . We assume that , the estimator being applied, is the Horvitz-Thompson estimator given in Definition 8.

For convenience we will assume that the cost distribution has finite support, say with . (We relax the finite support assumption in Appendix A.) Write for the probability of cost in . Also, for a given allocation rule , we will write for convenience. That is, we can interpret an allocation rule as a vector of values . Finally, we will assume that the distribution of costs is regular.

Our goal is to address the analyst’s mechanism design problem for this restricted setting. By Lemma 1 it suffices to solve the analyst’s optimization problem. We start by characterizing the worst-case variance for this setting.

Lemma 2.

The worst-case variance of the Horvitz-Thompson estimator of a moment , given cost distribution and allocation rule , is:

 n⋅V∗(^θS;F,A)=supq∈[0,1]|C||C|∑t=1πt⋅qtAt−⎛⎝|C|∑t=1πt⋅qt⎞⎠2 (10)
Proof.

For any distribution , observe that the Horvitz-Thompson estimator can be written as the sum of i.i.d. random variables each with a variance:

 σ2=E⎡⎣(m(zi)⋅1{i∈S}A(ci))2⎤⎦−E[m(zi)⋅1{i∈S}A(ci)]2=|C|∑t=1πt⋅E[m(z)2|ct]At−E[m(z)]2

Hence, the variance of the estimator is . Observe that conditional on any value , the worst-case distribution , will assign positive mass only to values such that . This is because any other conditional distribution can be altered by a mean-preserving spread, pushing all the mass on these values, while preserving the conditional mean . This would strictly increase the latter variance. Thus we can assume without loss of generality that , in which case and . Let . Then we can simplify the variance as:

 n⋅V(^θS;D,A)= |C|∑t=1πt⋅E[m(z)|ct]At−E[m(z)]2= |C|∑t=1πt⋅qtAt−⎛⎝|C|∑t=1πt⋅qt⎞⎠2

The theorem follows since the worst-case variance is a supremum over all possible consistent distributions, hence equivalently a supremum over conditional probabilities . ∎

Given the above characterization of the variance of the estimator, we can greatly simplify the analyst’s optimization problem for this setting. Indeed, it suffices to find the allocation rule that minimizes (10), subject to being monotone non-decreasing and satisfying the expected budget constraint.

3.1 Characterization of the Optimal Allocation Rule

We are now ready to solve the analyst’s optimization problem for moment estimation. We remark that if the budget per agent is larger than the expected cost of an agent, then it is feasible (and hence optimal) for the analyst can set the allocation rule to pick any type with probability . We therefore assume without loss of generality that .

Our analysis is based on an equilibrium characterization, where we view the analyst choosing and the adversary choosing as playing a zero-sum game and solve for its equilibria. We first present the characterization and some qualitative implications and then present an outline of our proof. We defer the full details of the proof to Appendix E.2.

Theorem 3 (Optimal Allocation for Moment Estimation).

The optimal allocation rule is determined by two constants and such that:

 At={¯Aif t≤t∗α√cto.w. (11)

with uniquely determined such that the budget constraint is binding.777The explicit form of this is . Moreover, the parameters and can be computed in time .

The parameters and in Theorem 3 are explicitly derived in closed form in Appendix E.2. For instance, when , then and for all . When then and . More generally, the computational part of Theorem 3 follows by performing binary search over the support of , which can be done in time.

We note that the optimal rule essentially allocates to each agent inversely proportionally to the square root of their cost, but may also “pool” the allocation probability for agents at the lower end of the cost distribution. See Figure 1 for examples of optimal solutions. In comparison, the approximately optimal rule presented in [23] omits the pooling region.

The proof of Theorem 3 appears in Appendix E.2. The main idea is to view the optimization problem as a zero-sum game between the analyst who designs the allocation rule , and an adversary who designs so as to maximize the variance of the estimate. We show how to compute an equilibrium of this zero-sum game via Lagrangian and KKT conditions, and then note that the obtained must in fact be an optimal allocation rule for worst-case variance.

The analysis above applied to a discrete cost distribution over a finite support of possible costs. We show how to extend this analysis to a continuous distribution over costs in Appendix A, noting that the continuous variant of the Optimization Problem for Moment Estimation can be derived by taking the limit over finer and finer discrete approximations of the cost distribution.

4 Further results

Multi-dimensional moment estimation:

We show in Appendix B that we can in fact extend our analysis to the case where is a -dimensional vector with support and we are trying to estimate each coordinate of so as to minimize the mean-squared error of the estimator. More precisely, we prove—under the condition that all of the corners of the hypercube are contained in the support of —that the -dimensional problem reduces exactly to solving the -dimensional problem discussed in Section 3. One can therefore use the results of Section 3 to design an optimal allocation rule under the budget constraint in the multi-dimensional setting.

Linear regression:

We extend our results beyond moment estimation, to a multi-dimensional linear regression task. In this setting, an agent’s information consists of a feature vector , an outcome value , and a residual value , drawn in the following manner: first, is drawn from an unknown distribution . Then, independently from , the pair is drawn from a joint distribution over . The marginal distribution over costs, , is known to the designer, but not the full joint distribution . Then is defined to be

 yi=x⊤iθ∗+ϵi,

where with a compact subset of . We assume the marginal distribution over is supported on some bounded range and has expected value . (So, in particular, .)

When a survey mechanism buys data from agent , the pair is revealed. However, the value of is not revealed to the survey mechanism. The goal of the designer is to estimate the parameter vector . The analyst wants to design a survey mechanism to buy from the agents, then compute an estimate of , while not paying each agent more than the total budget per agent in expectation.

One can interpret as a vector of publicly-verifiable information about agent , which might influence a (possibly sensitive) outcome . For example, might consist of demographic information, and might indicate the severity of a medical condition. The coefficient vector describes the average effect of each feature on the outcome, over the entire population. Under this interpretation, is the residual agent-specific component of the outcome, beyond what can be accounted for by the agent’s features. We can interpret the independence of from as meaning that each agent’s cost to reveal information is potentially correlated with their (private) residual data, but is independent of the agent’s features.

As in Section 3, the analyst wants to design a survey mechanism to buy from the agents, obtain data from the set of elicited agents, then compute an estimate of , while not paying each agent more than the total budget per agent in expectation. To this end, the analyst designs an allocation rule and a pricing rule so as to minimize the -normalized worst-case asymptotic mean-squared error of as the population size goes to infinity. Our mechanism will essentially be optimizing the coefficient in front of the leading term in the mean squared error, ignoring potential finite sample deviations that decay at a faster rate than . Note that we will design allocation and pricing rules to be independent of the population size ; hence, the analyst can use the designed mechanism even if the exact population size in unknown.

The analyst’s estimate is given by the value that minimizes the Horvitz-Thompson mean-squared error , i.e.,

 ^θS=argminθ∈Θ∑i1{i∈S}A(ci)(yi−x⊤iθ)2. (12)

Further, we make the following assumption on the distribution of data points:

Assumption 1 (Assumption on the distribution of features).

is finite and positive-definite, and hence invertible.

Finite expectation is a property one may expect real data such has age, height, weight, etc. to exhibit. The second part of the assumption is satisfied by common classes of distributions, such as multivariate normals.

We now state our main results, and defer proofs and technical details to Appendix C. We first show that our estimator is consistent, i.e. it converges in probability towards the true parameter as the population size grows to infinity:

Lemma 4.

Under Assumption 3, for any allocation rule that does not depend on , is a consistent estimator of .

As in Section 3, we now assume costs are drawn from a discrete set, say . We will then write for an allocation rule conditional on the cost being , and the probability of the cost of an agent being . We will further assume that , meaning that it is not feasible to accept all data points, since otherwise it is trivially optimal to set for all . Our main result, that characterizes the form of the optimal allocation rule, is then given by:

Theorem 5.

Under the assumptions described above, if , an optimal allocation rule has the form

1. for

2. for all

3. for

for and positive constants that do not depend on , and and integers with .

When , the form of the allocation rule can be obtained by reversing the roles of and (for more details, see Appendix C). We remark that the solution for the linear regression case exhibits a structure that is similar to the structure of the optimal allocation rule for moment estimation (see Theorem 3): it exhibits a pooling region in which all cost types are treated the same way, and changes in the inverse of the square root of the cost outside said pooling region. However, we note that we may now choose to pool agents together in an intermediate range of costs, instead of pooling together agents whose costs are below a given threshold.

Non-linear regression:

We further show that our results extend to non-linear regression, i.e. when is generated by a process of the more general form

 yi=f(θ∗,xi)+ϵi,

under a few additional assumptions on the distribution of and on the regression function . Said assumptions are discussed in more detail in Appendix D.

References

• [1] Jacob D. Abernethy, Yiling Chen, Chien-Ju Ho, and Bo Waggoner. Actively purchasing data for learning. CoRR, abs/1502.05774, 2015.
• [2] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.
• [3] Y. Cai, C. Daskalakis, and C. H. Papadimitriou. Optimum statistical estimation with strategic data sources. In 28th, pages 280–296, 2015.
• [4] I. Caragiannis, A. D. Procaccia, and N. Shah. Truthful univariate estimators. In 33rd, pages 127–135, 2016.
• [5] R. Cummings, S. Ioannidis, and K. Ligett. Truthful linear regression. In 28th, pages 448â–483, 2015.
• [6] Rachel Cummings, Katrina Ligett, Aaron Roth, Zhiwei Steven Wu, and Juba Ziani. Accuracy for sale: Aggregating data with a variance constraint. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, ITCS ’15, pages 317–324, New York, NY, USA, 2015. ACM.
• [7] O. Dekel, F. Fischer, and A. D. Procaccia. Incentive compatible regression learning. Journal of Computer and System Sciences, 76(8):759–777, 2010.
• [8] Jinshuo Dong, Aaron Roth, Zachary Schutzman, Bo Waggoner, and Zhiwei Steven Wu. Strategic classification from revealed preferences. 2017.
• [9] E. A. Feinberg, P. O. Kasyanov, and M. Z. Zgurovsky. Continuity of Equilibria for Two-Person Zero-Sum Games with Noncompact Action Sets and Unbounded Payoffs. ArXiv e-prints, September 2016.
• [10] Lisa Fleischer and Yu-Han Lyu. Approximately optimal auctions for selling privacy when costs are correlated with data. CoRR, abs/1204.4031, 2012.
• [11] Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1):79 – 103, 1999.
• [12] Arpita Ghosh, Katrina Ligett, Aaron Roth, and Grant Schoenebeck. Buying private data without verification. CoRR, abs/1404.6003, 2014.
• [13] Arpita Ghosh and Aaron Roth. Selling privacy at auction. In Proceedings of the 12th ACM Conference on Electronic Commerce, EC ’11, pages 199–208, New York, NY, USA, 2011. ACM.
• [14] M. Hardt, N. Megiddo, C. H. Papadimitriou, and M. Wootters. Strategic classification. In 7th, pages 111–122, 2016.
• [15] D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663–685, 1952.
• [16] R. Meir, S. Almagor, A. Michaely, and J. S. Rosenschein. Tight bounds for strategyproof classification. In 10th, pages 319–326, 2011.
• [17] R. Meir, A. D. Procaccia, and J. S. Rosenschein. Algorithms for strategyproof classification. Artificial Intelligence, 186:123–156, 2012.
• [18] Roger B. Myerson. Optimal auction design. Math. Oper. Res., 6(1):58–73, February 1981.
• [19] Whitney K. Newey and Daniel McFadden. Large sample estimation and hypothesis testing. In R. F. Engle and D. McFadden, editors, Handbook of Econometrics, volume 4 of Handbook of Econometrics, chapter 36, pages 2111–2245. Elsevier, January 1986.
• [20] Kobbi Nissim, Salil Vadhan, and David Xiao. Redrawing the boundaries on purchasing data from privacy-sensitive individuals. In Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, ITCS ’14, pages 411–422, New York, NY, USA, 2014. ACM.
• [21] J. Perote and J. Perote-Peña. The impossibility of strategy-proof clustering. Economics Bulletin, 4(23):1–9, 2003.
• [22] J. Perote and J. Perote-Peña. Strategy-proof estimators for simple regression. Mathematical Social Sciences, 47:153–176, 2004.
• [23] Aaron Roth and Grant Schoenebeck. Conducting truthful surveys, cheaply. CoRR, abs/1203.0353, 2012.

Appendix A Extension: Continuous Costs for Moment Estimation

The analysis above applied to a discrete cost distribution over a finite support of possible costs. We now show how to extend this analysis to a continuous distribution over costs. We first note that by taking the limit over finer and finer discrete approximations of the cost distribution, one can derive the following continuous variant of the Optimization Problem for Moment Estimation.

Definition 9 (Continuous Optimization Problem for Moment Estimation).

When costs are supported on , the analyst’s optimization problem for the moment estimation problem based on the Horvitz-Thompson estimator can be written as:

 infA:[0,1]→[0,1]supx:[0,1]→[0,1] ∫10x(c)A(c) dF(c)−(∫10x(c) dF(c))2 (13) s.t. ∫10c⋅A(c) dF(c)≤¯B A is monotone non-increasing

We can now establish the following continuous variant of Theorem 3, which describes the optimal survey mechanism for continuous cost distributions.

Theorem 6 (Continuous Limit of Optimal Allocation).

If the distribution of costs is atomless and supported in , then the optimal allocation rule is determined by two constants and such that:

 A(c)={¯Aif c≤x∗α√co.w. (14)

with uniquely determined such that the budget constraint is binding.888The explicit form of this is .. The quantities and are defined as follows: for any let

 Q∞(x)= Ec∼F[min{c,√cx}] R∞(x)= 2Ec∼F[min{cx,1}] G(x)= Q∞(x)max(1,R∞(x))

Then and (see Figure 2).999We take the convention that if lies above the range of , then .

Let us give some intuition behind the form of the allocation rule described in Theorem 6. As in Theorem 3, the allocation rule will pool agents with low costs (i.e., less than some threshold ), then allocate to higher-cost agents inversely proportional to the square root of their costs. In the definition of and , note that is non-decreasing and is non-increasing, so is non-decreasing. We therefore have that , the boundary of the pooling region, increases with up to a maximum value of (at which point all agents are pooled).

Let’s restrict attention to the case where the mean of the distribution is at least as large as half of the maximum value of the support, i.e. . In this setting, we see that for all , so

 G(x)=Q∞(x)R∞(x)=x2⋅Ec∼F[min{c,√cx}]Ec∼F[min{c,x}]≈x2 (see Figure 2)

So the optimal allocation sets . Moreover, the allocation for the pooling region is . So the optimal mechanism takes the following intuitive form: first, assign each agent an allocation probability that would, in an alternate world where costs are capped at , precisely exhaust the budget. Since costs can actually be greater than , this flat allocation goes over-budget. So, for agents whose costs are greater than , we remove allocation probability so that (a) the budget becomes balanced, and (b) the remaining probability of allocation is inversely proportional to the square root of the costs.

a.1 Proof of Theorem 6

Consider any continuous atomless distribution supported in . Then we can approximate the density of any such distribution by considering a discretized -grid of the interval, i.e. and the discrete support distribution defined by pdf for . Since, the loss of the zero sum game for moment estimation is continuous in the CDF of the cost distribution, we have that the minimax value of the game is continuous in the CDF of the cost distribution (see e.g. [9] on continuity of minimax with respect to parameters of the game). Hence, the limit of the optimal discretized solutions will be the optimal solution to the discrete problem.

We now consider the limit structure of the optimal solutions to the discretized problems. We will use the more structural characterization of our main Theorem 3, presented as Theorem 18 in Appendix E.2. In particular, the optimal solution will look as follows, taking the limit of the form in Theorem 18: for

 A(x)=⎧⎨⎩¯Aif x≤x∗1√x⋅¯B−¯AE[c⋅1{c≤x∗}]E[√c⋅1{c>x∗}% o.w. (15)

Now let us examine how the point is defined in the limit. Consider the functions , and defined in Theorem 18. In the limit as , observe that for every . Therefore, it is easy to see that and . Hence, also . Hence, we will have that defined in 18 will satisfy and in the limit for some . Hence, we only need to consider these functions at and take their limit as . In this limit we observe that these functions take the simpler forms (since summations will converge to integrals) for

 Q(x/ϵ,1)→ ∫x0cf(c)dc+∫1x√c⋅xf(c)dc=Ec∼F[min{c,√cx}]≜Q∞(x) (16) R(x/ϵ,1)→ 2(∫x0f(c)cxdc+∫1xf(c)dc)=2Ec∼F[min{cx,1}]≜R∞(x) (17)

Hence, adapting the discrete characterization of and to these limits we have: the parameter is defined as the solution to the following process: let be the solution to the equation . If , then and , otherwise and .

Now we observe that since is atomless and has support , is a continuous increasing function of , with range . Hence, if (which we assumed holds as otherwise the problem is trivial), then , or equivalently if then it must be that is the unique solution to the equation .

Moreover, observe that is also a decreasing function of ranging in as varies from to . If , then for all and the second case of the characterization of never holds and we have that is the solution to the equation , or if is above . Moreover, , in both cases. Equivalently, . Hence, in this case the Theorem holds.

Otherwise, let be the solution to the equation . Thus above and below . Now, consider the function . This function is continuous increasing and is equal to for and is equal to for .

If it happened that the solution of the equation happens at , then we have that and . Otherwise, if the solution to that equation is above , then (the latter always has a solution when ) and . Thus in this case we have that and , which concludes the proof.

Appendix B Extension: Multi-dimensional Parameters for moment estimation

Section 3 focused on the case of estimating a single-dimensional parameter of the data distribution. In this section we note that our characterization of the optimal mechanism extends to multi-dimensional moment estimation as well. In multi-dimensional moment estimation, there is a function , and our goal is to estimate . Here is the dimension of the estimation problem, which we assume to be a fixed constant.

As before, we will estimate by applying an estimator to the data collected from a survey mechanism. To evaluate an estimator, we must extend our definition of variance to the -dimensional setting, as follows.

Definition 10 (Worst-Case Mean Squared Error - Risk).

Given allocation function and distribution , the expected mean squared error (or risk) of an estimator is

 R(^θS;D,A)=E^θS∼T(D,A)[∥∥^θS−θ0∥∥22] (18)

and the worst-case variance of is

 R∗(^θS;F,A)=supD consistent with FR(^θS;D,A). (19)

When is unbiased, the risk has a natural interpretation: it is simply the sum of variances of each coordinate of , considered separately.

Claim 7 (Risk of Unbiased Estimators).

The risk of any unbiased estimator is equal to the sum of variances of every coordinate:

 R(^θS;D,A)=E^θS∼T(D,A)[d∑r=1(^θS,r−E[^θS,r])2]=d∑r=1V(^θS,r). (20)

As in the single-dimensional case, the analyst obtains an estimate through the Horvitz-Thompson estimator, which is defined as follows for parameters in . Also as in the single-dimensional case, The Horvitz-Thompson estimator is an unbiased estimator of .

Definition 11 (Horvitz-Thompson Estimator).

The Horvitz-Thompson estimator for the case when the parameter of interest is the expected value of a vector of moments is defined as:

 ^θS=1n∑i∈[n]1{i∈S}A(ci)⋅m(zi) (21)

For our characterization of worst-case risk, we will assume that the moment function can take on the extreme points of the hypercube .

Assumption 2.

is such that the induced distribution of is supported on every extreme point of the hypercube.

Lemma 8.

Under Assumption 2, the worst-case risk of the Horvitz-Thompson estimator of moment is

 (22)
Proof.

See Appendix E.3. ∎

Lemma 8 implies that the optimal survey design problem in the -dimensional case is, in fact, identical to the problem considered in the single-dimensional case. We can conclude that Theorems 3 and 6, which characterized the optimal survey mechanisms for discrete and continuous single-parameter settings, respectively, also apply to the multi-dimensional setting without change.

Appendix C Extension: Multi-dimensional Parameter Estimation via Linear Regression

In this section, we extend beyond moment estimation to a multi-dimensional linear regression task. For this setting we will impose additional structure on the data held by each agent. Each agent’s private information consists of a feature vector , an outcome value , and a residual value , that are i.i.d among agents. Each agent also has a cost . The data is generated in the following way: first, is drawn from an unknown distribution . Then, independently from , the pair is drawn from a joint distribution over . The marginal distribution over costs, , is known to the designer, but not the full joint distribution . Then is defined to be

 yi=x⊤iθ∗+ϵi (23)

where with a compact subset of . Without loss of generality, we pick large enough so that is in the interior of . We write for the marginal distribution over , which is supported on some bounded range , and has expected value . (So, in particular, .)

When a survey mechanism buys data from agent , the pair is revealed. Crucially, the value of is not revealed to the survey mechanism. The goal of the designer is to estimate the parameter vector .

Note that the single-dimensional moment estimation problem from Section 3 is a special case of linear regression. Indeed, consider setting , for each , , and to be the constant . Then, when the survey mechanism purchases data from agent , it learns , and estimating is equivalent to estimating the expected value of .

More generally, one can interpret as a vector of publicly-verifiable information about agent , which might influence a (possibly sensitive) outcome . For example, might consist of demographic information, and might indicate the severity of a medical condition. The coefficient vector describes the average effect of each feature on the outcome, over the entire population. Under this interpretation, is the residual agent-specific component of the outcome, beyond what can be accounted for by the agent’s features. We can interpret the independence of from as meaning that each agent’s cost to reveal information is potentially correlated with their (private) residual data, but is independent of the agent’s features.

As in Section 3, the analyst wants to design a survey mechanism to buy from the agents, obtain data from the set of elicited agents, then compute an estimate of , while not paying each agent more than the total budget per agent in expectation. As in Section 2.1, we note that the problem of designing a survey mechanism in fact reduces to that of designing an allocation rule that minimizes said variance and satisfies a budget constraint in which the prices are replaced by known virtual costs. To this end, the analyst designs an allocation rule and a pricing rule so as to minimize the -normalized worst-case asymptotic mean-squared error of as the population size goes to infinity. Our mechanism will essentially be optimizing the coefficient in front of the leading term in the mean squared error, ignoring potential finite sample deviations that decay at a faster rate than . Note that we will design allocation and pricing rules to be independent of the population size ; hence, the analyst can use the designed mechanism even if the exact population size in unknown.

c.1 Estimators for Regression

Let be the set of data points elicited by a survey mechanism. The analyst’s estimate will then be the value that minimizes the Horvitz-Thompson mean-squared error , i.e.,

 ^θS=argminθ∈Θ∑i1{i∈S}A(ci)(yi−x⊤iθ)2. (24)

Further, we make the following assumptions on the distribution of data points:

Assumption 3 (Assumption on the distribution of features).

is finite and positive-definite, and hence invertible.

Finite expectation is a property one may expect real data such has age, height, weight, etc. to exhibit. The second part of the assumption is satisfied by common classes of distributions, such as multivariate normals. We first show that is a consistent estimator of .

Lemma 9.

Under Assumption 3, for any allocation rule that does not depend on , is a consistent estimator of .

Proof of Lemma 9.

Let , and let for simplicity. The following holds:

1. First, we note that is the unique parameter that minimizes ; indeed, take any , we have that

 E[(yi−θ⊤xi)2] =E[(yi−x⊤iθ∗+x⊤i(θ∗−θ))2)] =E[(yi−x⊤iθ∗)2]+E[(x⊤i(θ∗−θ))2]+2E[ϵ(θ∗−θ)⊤xi]

As and are independent, has mean , this simplifies to

 E[(yi−θ⊤xi)2] =E[(yi−θ∗⊤xi)2]+(θ∗−θ)⊤E[xix⊤i](θ∗−θ)+2(θ∗−θ)⊤E[ϵixi] =E[(yi−θ∗⊤xi)2]+(θ∗−θ)⊤E[xix⊤i](θ∗−θ) >E[(yi−θ∗⊤xi)2]

where the last step follows from being positive-definite by Assumption 3.

2. By definition, is compact.

3. is continuous in , and so is its expectation.

4. is also bounded (lower-bounded by , and upper-bounded by either or ), implying that is continuous and bounded. Hence, by the uniform law of large number, remembering that are i.i.d,

 supθ∈Θ∣∣ ∣∣1nn∑i=1wiA(ci)m(θ;xi,yi)−E[wiA(ci)m(θ;xi,yi)]∣∣ ∣∣→0.

Finally, noting that conditional on , and are independent, we have:

 E[wiA(ci)m(θ;xi,yi)]=E[E[wiA(ci)∣∣∣ci]E[m(θ;xi,yi)|ci]]=E[m(θ;xi,yi)]

using