Competitive Statistical Estimation with Strategic Data Sources
Abstract
In recent years, data has played an increasingly important role in the economy as a good in its own right. In many settings, data aggregators cannot directly verify the quality of the data they purchase, nor the effort exerted by data sources when creating the data. Recent work has explored mechanisms to ensure that the data sources share high quality data with a single data aggregator, addressing the issue of moral hazard. Oftentimes, there is a unique, socially efficient solution.
In this paper, we consider data markets where there is more than one data aggregator. Since data can be cheaply reproduced and transmitted once created, data sources may share the same data with more than one aggregator, leading to freeriding between data aggregators. This coupling can lead to nonuniqueness of equilibria and social inefficiency. We examine a particular class of mechanisms that have received study recently in the literature, and we characterize all the generalized Nash equilibria of the resulting data market. We show that, in contrast to the singleaggregator case, there is either infinitely many generalized Nash equilibria or none. We also provide necessary and sufficient conditions for all equilibria to be socially inefficient. In our analysis, we identify the components of these mechanisms which give rise to these undesirable outcomes, showing the need for research into mechanisms for competitive settings with multiple data purchasers and sellers.
ptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptptpt
Competitive Statistical Estimation with Strategic Data Sources
Tyler Westenbroek, Student Member, IEEE, Roy Dong,
Lillian J.
Ratliff, Member, IEEE, and S. Shankar
Sastry, Fellow, IEEE
^{0}^{0}footnotetext: T. Westenbroek and S. S. Sastry are with the Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, 94707, USA, email: westenbroekt,sastry@eecs.berkeley.edu.^{0}^{0}footnotetext: R. Dong is with the Department of Electrical and Computer Engineering, University of Illinois at UrbanaChampaign, Champaign, IL, 61820, USA, email: roydong@illinois.edu.^{0}^{0}footnotetext: L. J. Ratliff is with the Department of Electrical Engineering, University of Washington, Seattle, WA, 98195, USA, email: ratliffl@uw.edu.^{0}^{0}footnotetext: This work is partially funded by NSF CNS:1656873.
I Introduction
D ATA has increasingly seen a role in the economy as an important good. As an input to machine learning algorithms, data can not only create new products and innovations, but also be used to redesign business strategies and processes. As the demand for data increases, we have seen the formation of data aggregators, who collate data for either use or resale. A fundamental information asymmetry arises between data aggregators and data sources: how can aggregators verify the quality of the data they purchase from data sources?
In particular, data sources often incur an effort cost to obtain high quality data. For example, devices require maintenance and upkeep to ensure accurate measurements, portable sensors need to use their limited energy resources to collect and transmit data, and human agents may need to be compensated to properly perform a desired task. As such, if a data aggregator wants a high quality data point, they must appropriately compensate the data source. Furthermore, this problem is complicated by the fact that the data aggregators cannot observe the effort exerted, and only the data received. As such, the payments must be calculated from the data sets alone, with no knowledge of the effort exerted or noise levels of data points. This problem has led to the design of a variety of mechanisms to ensure data sources provide quality data, which we will outline in more detail in Section II.
The contribution of this paper is the study of the data market that forms when multiple data aggregators share the same pool of data sources. In particular, we note that data is nonrivalrous, in the sense that it can be cheaply copied and shared with multiple data aggregators. Since a data aggregator does not ‘consume’ the good after purchasing it, data sources will have an incentive to share the same data with as many aggregators as are willing to pay. We show that the nonrivalrous nature of data introduces a coupling between data buyers: when a data aggregator incentivizes a data source to produce high quality data, other data aggregators benefit. In particular, this coupling leads to undesirable properties of the equilibrium. In many singleaggregator formulations, equilibria are unique and there is no social inefficiency. In contrast, the multipleaggregator case leads to a multiplicity of equilibria, and social inefficiencies across all equilibria.
The rest of this paper is organized as follows. In Section II, we discuss the related literature and contextualize our contributions. In Section III, we introduce our model for data sources, data aggregators, and their interactions in the data market. In Section IV, we characterize the generalized Nash equilibria in the data market, and identify necessary and sufficient conditions for social ineffiency. In Section V, we extend the results to cases where data sources do not share their data with all data aggregators. Finally, we close with final remarks in Section VI.
Ii Related Literature
In recent years, there has been a quickly growing body of literature on models for data exchange and data markets. Broadly speaking, the existing literature can be broken down by two categories: models with a single data purchaser and single data source, and models with a single data purchaser and multiple data sources.
In the first category, we find a class of models which study a single data purchaser and a single data source. These works focus on the game theoretic interactions and information states between the two agents. In particular, these works consider the strategies arising from direct signals, actions, and payments, rather than indirect coupling that can arise from multiple sources or purchasers. Some of these papers feature multiple data sources, but these are ultimately separable into a collection of singlesource models, and, at their core, focus on the direct interactions between buyers and sellers of data. In [1], optimal mechanisms for a single data source to sell to a single buyer are developed using a signaling framework. The authors of [2] design a menu of prices for different data qualities, employing a screening framework. In [3], the authors consider a single aggregator and single source, and show how repeated interactions with noisy verification allow for mechanisms which elicit costly effort from a data source. A single data source charging data purchasers for queries about customer preferences is studied in [4].
In the second category, there are a class of models which study a single data purchaser with multiple data sources. These works focus on capturing how the data supplied by one data source affects another. In [5], the authors consider a single data aggregator and multiple data sources, and show how robustness of the sample median provides protection against strategic data sources. In [6], the authors consider a single data aggregator and multiple data sources in a setting with verifiable data, and allow the data and the cost of revealing data to be arbitrarily correlated.
There is also a new body of work in the singleaggregator, multiplesource case, using peer prediction mechanisms, first introduced in [7]. These techniques often use scoring techniques to evaluate the ‘goodness’ of received data, and often examine classification tasks. In [8, 9], the authors develop mechanisms for eliciting the truth in crowdsourcing applications, while [10, 11, 12] consider theoretical extensions to strengthen the original results of [7], all in the context of a single aggregator. In [13], the authors consider a classification problem with a single aggregator and multiple data sources, which extends the classic peer prediction results by exploiting correlations between the queries and query responses.
A parallel literature considers similar ideas in the regression domain. These works design general payment mechanisms, by which a central data aggregator may incentivize data sources to exert the effort necessary to produce and report readings which are deemed to be of high quality, with respect to the estimation task the aggregator is performing. The roots of these approaches can be traced least as far back as VCG mechanisms, a set of seminal results in mechanism design [14]. Indeed, numerous approaches for deciding payments based on the actions of other agents have been proposed [15]. Here, we again see attention given to crowdsourcing [16].
Several recent papers [17, 18, 19, 20, 21, 3] investigate new directions in this domain. In cases where, without the ability to directly determine the effort exerted by data sources, data buyers must design incentive mechanisms based solely on the data available to them. In [17], whose approach we extend here, the authors develop a mechanism which a data aggregator can use to precisely set the level of effort a collection of data sources exert when producing data. A similar mechanism is explored in [18]. Extensions are considered wherein data sources form coalitions [19], or where aggregators assess the quality of readings using a trusted data source [20]. Meanwhile, [21] and [3] investigate dynamic settings where data sources are repeatedly queried.
Our work is closest in spirit to the literature studying regression problems with multiple data sources, with our key contribution being the presence of multiple data aggregators that are coupled in their costs and actions. To our knowledge, this is one of the first papers which considers multiple data aggregators and multiple data sources simultaneously. In particular, we simultaneously model coupling between data aggregators in their cost functions, coupling in the payments to the same pool of data sources, and coupling between data sources due to payments that depend on their peers’ data.
We suppose all data aggregators are trying to estimate the same function and share the same pool of data sources. Additionally, we assume each data aggregator has already chosen an estimator, and now must determine how to issue payments to have low estimation error with their exogenously fixed estimator. Our model builds heavily on the model introduced in [17], which featured a single data aggregator. Our contribution is an extension that models cases with multiple data aggregators. For consistency, we will refer to data purchasers as data aggregators, and data sellers as data sources.
Furthermore, the work in the paper is a significant extension of our prior work [22] where we considered strategic data sources with a specific exponential function mapping effort to query response quality. In the present work, we characterize equilibria and the price of anarchy for a much broader class of games between data buyers where the data sources’ effort functions can be any nonnegative, strictly decreasing, convex, and twice continuously differentiable function. The characterization we provide considers both bounded and unbounded feasible effort sets for the data sources.
Iii Data Market Preliminaries
In this section, we outline the models for data sources, data aggregators, and the strategic interactions between them.
At a high level, each data aggregator collects data from data sources to construct an estimate of a given function. In exchange for this data, the data aggregator issues incentives to the data sources. The data aggregators have three terms in their cost function: 1) an estimation error term, which rewards the data aggregator for constructing a better estimate; 2) a competition term, which penalizes when other data aggregators have higher quality estimates; 3) a payment term, which is the cost incurred issuing incentives.
Each data source is able to produce a noisy sample of the desired function. The data sources can exert effort to reduce the variance of the data sample, and we assume the data sources are effortaverse, i.e. data sources will prefer to exert less effort, unless they are provided incentive by the aggregators. As such, the data sources have two terms in their utility function: 1) an incentive term, which rewards payments received; 2) an effort term, which penalizes effort exerted.
The level of effort exerted and the variance of the data are not known by the data aggregator; this private information gives rise to moral hazard. One of the problems for the aggregator is the task of designing incentives which depend only on the information available to them. Another important nuance is that data is nonrivalrous; thus, when a data source produces a higherquality data sample, all the aggregators which receive this data benefit.
In order to simplify the initial introduction of our model, we will first assume that each data source provides data to all the aggregators in the data market, and receives payment from all aggregators as well. In Section V, we will outline how our results change when this assumption is removed.
Iiia Overview
More formally, let be the index set of strategic data sources, and let be the index set of strategic data aggregators. Each data aggregator desires to construct an estimate for a given function , where is a feature space. Practically, one may think of as a set of features the data aggregators are capable of observing, while the mapping encapsulates the relationship between the observable features and the outcome of interest.
Each data source is able to produce a noisy sample of at the fixed point . The point is common knowledge among all data sources and aggregators. The variance of is proportional to the effort exerted by data source to produce the reading. Each data source is characterized by an efforttovariance function , where represents the set of feasible efforts that data source can exert. When data source exerts effort , they produce the data point:
(1) 
Here, is a random variable with mean and variance . The function is common knowledge among all data sources and aggregators. However, while the function is known, the effort exerted is private. This means that the actual variance of , namely , is also private information of . We will delve into assumptions in the data source model in greater detail in Section IIIB.
Now, suppose a data aggregator is granted access to a data set . At this point, the data aggregator processes this data to construct an estimate for . In exchange for this data set, the data aggregator issues payment to data source for each . Here, denotes the data given to each member of . Note that the payment to from depends not only on the data supplied by , but rather depends on all data available to .
The data aggregator then incurs loss , which will depend on , the payments issued, as well as , the effort exerted by the data sources. We will formalize the data aggregator in greater detail in Section IIIC.
The interaction of the data market proceeds in three stages.

Aggregators declare incentives: Each data aggregator commits to a payment contract . The payments will depend on the data shared with , as well as the common knowledge information and functions .

Sources exert effort, realize and share data: In response to , each data source chooses an effort . Then, the random variable is realized according to (IIIA). The data is shared with each data aggregator. Note that has control over only through . In other words, the data source chooses the quality of data they generate, but cannot arbitrarily manipulate the reported value of .

Aggregators construct estimates, issue payments: Each data buyer constructs their estimate , issues payments to the data sources, and incurs loss .
For convenience, we include a table summarizing the notation throughout this paper in Table I.
IiiB Strategic Data Sources
As mentioned previously, each data source has their own feature vector , and samples the function at this point. We may also refer to as a query throughout the text, and as the query response for data source . The data source is characterized by the efforttovariance function . We assume so that each data source may exert no effort in producing her reading if she desires.
Assumption 1.
For each , the set is a closed, connected set and contains .
Assumption 1 means that we consider two cases:

, i.e. the data sources maximum allowed effort is unbounded.

for some , i.e. the data sources maximum allowed effort is bounded.
Imposing an upperbound on the amount of a effort a data source can exert can be used to model constraints such as hardware limitations. As we shall see in Section IV, the imposition of such constraints can drastically affect equilibrium behavior in the data market.
Once the data source exerts effort , they produce the data point according to (IIIA). Again, we note that the data source only controls the effort level . They can only indirectly control through , and cannot report arbitrary values as their data. We also impose the assumption that the noise in the data is independent across data sources.
Assumption 2.
For each , is a random variable with mean and variance . Furthermore, the random variables are independent.
Both and the function function are common knowledge, but the effort and , the actual variance of , are private.
For convenience, we let be the joint effort set and let be the tuple of efforttovariance functions. We make the following assumptions on the efforttovariance mappings .
Assumption 3.
For each data source , the mapping , which is the square root of , is (i) strictly decreasing, (ii) convex, and (iii) twice continuously differentiable.
The assumptions correspond to the variance of the estimate generated by data source decreasing in the effort exerted, with decreasing marginal returns.
Using the notation , we model each data source with the following utility function:
(2) 
where the expectation is with respect to the randomness in , the data generated by the data sources upon exerting effort .^{1}^{1}1For simplicity and as a firststep analysis, we assume that the data sources only care about the payments received from the aggregator, and are indifferent to which aggregators they share their data with. An interesting and practical extension would be to consider the case where the data sources’ utility functions are aggregatordependent. This could arise when data sources trust different aggregators differently, or over privacy concerns. Note the form of (IIIB) implies that the data sources are riskneutral and effortadverse. Additionally, the form of (IIIB) also implies the effort can be normalized to be comparable to the payments. We note that the timing of the game implies that data sources must commit to an effort level exante.
Thus, in the second stage of the game, data source has knowledge of the payment contracts , and chooses to maximize their , defined by (IIIB). However, since the utility of each data source depends on the effort exerted by the other data sources, the payments induce a game between the data sources. In Section IIIF we will fully characterize this game for the particular class of incentives we introduce in Section IIID.
IiiC Strategic Data Aggregators
The primary objective of each aggregator is to construct a lowvariance estimate for the function . We adopt the following formal definition for an estimator.
Definition 1 (Estimator [17]).
Let be a family of functions . An estimator for takes as input a collection of examples and produces an estimated function .
As an example, may be the class of linear functions , in which case one may produce an estimated function of via linear regression.
Each data aggregator constructs his estimate for from the class of functions , using the readings . We let denote the estimate that aggregator constructs based on the readings they receive.^{2}^{2}2In general, aggregators need not fit models of the same type—e.g., one data aggregator may choose to generate their estimate via linear regression, while another fits a polynomial of higher degree. Different estimator types across data aggregators may be used to encapsulate competitive advantages one has over another.
Each data aggregator’s estimator is given, fixed, and common knowledge among all agents. In other words, this means that, for each data aggregator, the process by which a data set is turned into an estimate is exogenous. We focus on the design of incentives once each buyer has chosen an estimator.
First, we introduce some restrictions on the class of estimators allowed. The following assumption is required for us to be able to consider the contribution of data source to reducing aggregator ’s estimation cost. Also, note that the functions will be nonnegative by construction.
Assumption 4.
We assume the estimator for each is separable, in the following sense [17]. There exists a function such that for all queries , distributions over , and variances of the reported estimates at queries in the dataset :
(3) 
Here, the expectation is taken across the randomness in , as well as across .
For brevity, we will also define the function as follows:
(4) 
Let denote the index set of aggregators excluding and let be the payments of all aggregators excluding . Aggregator constructs payments so as to minimize:
(5)  
As in (4), the expectation in (IIIC) is taken with respect to and the randomness in the query responses . The distribution weighs the importance data aggregator places on accurately estimating for different query points .
The scalars parameterize the level of competition between aggregators and . When , aggregator is indifferent to the success of ’s estimation; interacts with entirely through the incentives issued to the data sources. We note that, even when for all and , we can still see degeneracies and social inefficiency arise, since data aggregators will still be coupled through the data sources.^{3}^{3}3This is a stylized formulation of how competition can affect different data aggregators, but we see interesting results arise even in this simple model. In the future, we hope to consider more extensive models of competition for data aggregators. The parameter denotes a conversion between dollar amounts allocated by the payment functions and the utility generated by the quality of the various estimates that are constructed. We make the assumption that aggregator has knowledge of what estimator every other data aggregator plans to use, as well as the weighting distributions.^{4}^{4}4This is a fairly strong assumption given that competing data aggregators are unlikely to inform their competitors how they intend to process the data supplied by the sources. Our work isolates how coupling between aggregators through data sources affect the data market; an interesting avenue for future work is to consider extensions with different information sets, and characterize the existence and severity of market inefficiencies in these various situations.
IiiD Structure of Payment Contracts
Throughout this paper, we will assume a particular form for the payment contracts the aggregators offer to the data sources. Similar to previous notation, we let . For a given and we assume that is of the form:
(6) 
Here, and are nonnegative scalars. Also, denotes ’s data set excluding . Namely is the data features for all sources excluding and is the query responses to aggregator , excluding .
Note that these payments do not directly depend on the level of effort that any of the data sources exert, since the data aggregators do not have a means to directly observe these values. Rather, the payment to source from aggregator depends on the ’s best estimate for excluding ’s data, namely, . The payments only depend on the data reported to them, and can be calculated by the aggregator.
Similar payment contracts are common in the literature [17, 18, 20], in part because of their intuitive structure. The aggregator constructs an unbiased estimate of what data source should report, and this estimate is not influenced by the data of . This estimate is used to overcome the problem of moral hazard: all data sources are appropriately incentivized to reduce the variance of their reported data accordingly.
Given this payment structure, each data aggregator’s choice of payment contracts reduces to choosing parameters where and .
In the single aggregator case (when ), it was shown in [17] that payments of the form in (6) induce a game between the data sources for which there is a unique dominant strategy equilibrium. That is, for each collection of parameters and , the data sources each exert a unique level of effort. The authors develop and algorithm by which the single aggregator may select these parameters such that (i) data sources are incentivized to exert any level of effort that the aggregator desires, and (ii) data sources are compensated at exactly the value of their effort, i.e. .
This paper’s contribution is the study of how pricing schemes of this form perform in the more general case where there is more than one data aggregator (when ), and data aggregators may compete with each other. The goal is to model multiple aggregators as strategic decisionmakers in competition, and understand the data market where these agents interact. Thus, while prior work captured moral hazard, we extend this model to capture competition and the nonrivalrous nature of data.
IiiE Formulation of Aggregator Optimization Problem
As mentioned previously, the aggregators hope to minimize their costs, as given in (IIIC). They do so by choosing the parameters . In this section, we will describe the aggregator’s optimization problem in more detail, and specify constraints that the parameter choice must satisfy.
The first constraint is individual rationality (IR). Individual rationality requires that each data source’s utility is nonnegative exante [23].^{5}^{5}5Alternatively, a data source’s utility may be compared to an outside option; for simplicity, we model the outside option as having zero utility. This ensures that rational data sources are willing to exert effort to produce the data. The second constraint is nonnegative payments from each data aggregator. Given that there are multiple aggregators, we introduce a constraint that the payment each aggregator offers to each is nonnegative exante.^{6}^{6}6Negative payments could be handled via exchangeable utilities among the data aggregators or via a trusted third–party to manage the allocations; however, in an effort to ensure clarity, we leave these scenarios aside.
We’ll introduce some notation for brevity here; we let denote the expected value of the payment :
(7) 
where denotes the probability measure with mass one at and . Similar to previous conventions, we define:
Thus, the IR constraint for each data source is formalized:
(8) 
Similarly, the nonnegativity constraint for each data source and data aggregator is given by:
(9) 
The third constraint is incentive compatibility (IC). Intuitively, IC states that when a data source is acting rationally and choosing actions to maximize their utility, they behave as the data aggregators intended. When there is a single aggregator, IC is typically enforced by the aggregator finding the effort that minimizes their cost, , and then designing such that .^{7}^{7}7For notational brevity, we will use as a function rather than a setvalued function throughout this paper; this is welldefined by Assumption 3.
In the competitive setting, IC for one aggregator is defined holding all other aggregators payments fixed. Each of the data aggregators make their choice of payment subject to the fact that data source selects effort according to
(10) 
Note that the payment each source receives depends on the efforts exerted by the other data sources. Thus, for each set of contracts offered by the aggregators, a game is induced between the data sources to determine how much effort they will exert. The aggregators compete by issuing incentives, which influences the equilibrium behavior of this game.
From the perspective of the data aggregators, the IC constraint states the desired effort level must be a dominant strategy for data source ; that is, is the utilitymaximizing action for regardless of the actions taken by other sources . Formally, the following must hold for all :
With these constraints, we formulate a bilevel optimization problem for each aggregator. Consider a fixed aggregator . Given a fixed action profile for all other buyers , i.e. given , aggregator aims to solve:
s.t.  
where is defined in (IIIC).
Note that this problem actually has optimization problems as constraints, making is a difficult bilevel program. However, we will reformulate the aggregator’s problem to a more manageable nonlinear program in the sequel. This is possible, in part, due to the nice properties of the payment contract structure introduced in Section IIID; this tractability motivates the use of payment contracts of that particular form. Next, we analyze the induced game between the data sources and simplify the aggregator’s optimization problem.
IiiF Induced Equilibrium Between Data Sources
To ensure a notion of incentive compatibility in equilibrium, we show there is a welldefined mapping from the parameters chosen by the aggregators to the equilibrium .
Definition 2.
For fixed payments , we say is an induced Nash equilibrium if for each data source :
(11) 
If (11) holds for all rather than just at , then we say that is an induced dominant strategy equilibrium.
Suppose now that we have a set of payments of the form discussed in Section IIID, characterized by parameters . Data source chooses effort according to:
(12) 
for each choice of made by the other data sources. It is straight forward to verify that (12) is a concave maximization problem which admits a unique globally optimal solution. This follows from our assumption that is convex and decreasing, recalling that for each and observing that is a convex set. Moreover, note that the choice of this optimal effort is not affected by the choice of , since each of the terms enters (12) as a constant from the perspective of . Thus, each choice of contract parameters selected by the aggregators leads to an induced dominant strategy equilibrium for the data sources. In particular, note that the choice of
(13) 
fully characterizes the level of effort that data source exerts in equilibrium. We reiterate that the constraints on the aggregator’s optimization problems will ensure the chosen contract parameters respect the IR and nonnegativity constraints.
Next, we define to be the implicitlydefined map such that returns the solution to (12) for a given choice of . In the following section, we will use this mapping to simplify the optimization problem facing each of the aggregators.
Definition 3.
For a given data source , let:
(14) 
When with , define where
(15) 
On the other hand, when , define .
The above definition implies is the minimum value of that the aggregators must offer data source to ensure they do not have incentive to exert negative effort.^{8}^{8}8This situation could correspond to source obfuscating their data, for example. We have restricted to the nonnegative orthant, so we will add constraints to ensure we are operating within the domain of our model. Similarly, if the aggregators increase past , source cannot further increase the level of effort they exert, and the mapping ceases to be meaningful. Thus, when reformulating each buyers optimization in the following section we will additionally constrain for each .
The following lemma provides properties on the mapping which are needed to prove existence of equilibria for the game between aggregators in the first stage.
Lemma 1.
Fix a data source . Then the mapping is continuous and strictly increasing in for all values of .
Proof.
The firstorder optimality condition for the data source is given by:
(16) 
By assumption is strictly decreasing and convex so that (16) has a unique solution for all . By definition, this solution is . Implicit differentiation of (16) then yields:
where we suppress the dependence of on . The righthand side of the above equation is strictly positive by Assumption 3. Continuity follows directly by Assumption 3. ∎
IiiG Reformulation of Buyers Optimization Problem
Finally, using our previous analysis and assumptions, we reformulate the optimization problem faced by each aggregator. This reformulation will simplify our analysis of equilibrium behavior in the data market, and lend economic interpretability to the results presented in Section IV.
Previously, we assumed that aggregator ’s estimator is separable in Assumption 4. This allows us to write the loss function of as:
Recall that is fixed and common knowledge. Thus, we can replace each of the evaluations of the ’s with constants. Towards this end, for each and , we define:
(17) 
(18) 
Note that each , by definition of the . In addition, for each and , define:
(19) 
Since we defined such that , we can write:
Similarly, the expected payment for any data source and data aggregator is given by:
Before proceeding, we provide an interpretation of the constants introduced above. The constant denotes the relevance of data sampled from the point when constructing aggregator ’s estimate, given the distribution of all of the data sources. The parameter corresponds to the level of demand that aggregator has for highquality data from source , factoring in the benefit this data supplies to the competitors of . In other words, parameters capture the effects of the nonrivalrous nature of data. The parameter denotes a measure of coupling that exists between the payment contracts and . In the case of a single aggregator (i.e. [17]), this coupling did not prove problematic. In contrast, when there are multiple aggregators, each aggregator has an incentive to try and exploit this coupling, as shall become clear in our ensuing analysis. This coupling will play a central role in determining the existence and efficiency of equilibrium behavior in the data market.
Collecting the various expressions we have introduced, aggregator ’s optimization problem can be rewritten as:
(20)  
Without loss of generality, we let , by normalizing the accordingly. Note that the constraint can be omitted, in light of the constraint , since each and .
Notation  Meaning  Defined or First Used in Equation 

index of data source  –  
index set of data sources  –  
index of aggregator  –  
index set of aggregators  –  
expected payment from aggregator to source  (7)  
linear term in ; used to adjust level of effort in equilibrium  (6)  
vector containing the parameters offered to source by the members of  –  
constant term in ; used to ensure incentive compatibility in equilibrium  (6)  
vector containing the parameters offered to sources by the members of  –  
sum of parameters offered to source across all members of  (13)  
mimimum value of required to ensure source does not exert negative effort  (15)  
minimum value of at which data source exerts her maximum effort  (14)  
, the allowable range of  –  
implicit map which returns the equilibrium value of as a function of  –  
level of competition between  (IIIC)  
relevance of data from in constructing aggregator ’s estimator  (17)  
aggregate demand for from  (19)  
sum of demand for data source across all members of  (23)  
coupling between and  (36) 
Iv Generalized Nash Equilibria in the Data Market
It is important to note that the constraints each aggregator faces in her optimization problem (20) depend on the actions taken by the rest of the aggregators in the data market. In particular, in order to ensure that the IR and IC constraints are maintained in equilibrium, we require an equilibrium concept which allows each aggregator’s admissible action space to depend on the choice of contract parameters selected by the other aggregators in the data market. Thus, we will employ the notion of a generalized Nash equilibrium [24] to study competitive outcomes in the data market, which is a natural extension of the typical notion of Nash equilibrium to this setting.
Let be aggregator ’s actions space; that is, where and . Each aggregator solves a parametric nonlinear programming problem given by
(21) 
where with a finite set indexing the constraint functions of aggregator . Note that, unlike in the classic definition of a Nash equilibrium, the admissible action space of aggregator depends on , the actions of .
We say is a generalized Nash (GN) equilibrium problem. A GN equilibrium is defined as follows.
Definition 4.
A point is said to be a GN equilibrium for if for all , solves .
We now analyze the game between the aggregators utilizing the notion of a GN problem and GN equilibrium. We will characterize the existence and uniqueness of GN equilibria in two scenarios. In Section IVA, we will consider the case where the effort spaces of data sources are unbounded, i.e. . In Section IVB, we will characterize the case where each data source has an upper bound on the level of effort they can exert, i.e. . In Section IVC, we will then address the social efficiency of the equilibria identified in Section IVA. A similar analysis of the equilibria identified in Section IVB can be found in Appendix B.
Before preceding to out main results, we provide a technical lemma that will have a central role in our ensuing analysis and introduce some notation which will simplify the statement of our results. For compactness, for a given set of parameters we define . (Recall that is the sum of parameters, as defined in Equation (13).)
Lemma 2.
Suppose , where , is a GN equilibrium for the game defined by (20). Then for each :
(22) 
In other words, the IR constraint is always binding in equilibrium, and the expected payment to data source is equal to the effort exerted in equilibrium:
Proof.
Suppose that there is an equilibrium in which the IR constraint is not binding for some data source . Then, there must a exists an aggregator whose nonnegativity constriant corresponding to source is also not binding. Thus, this cannot be an equilibrium as aggregator can unilaterally improve their payoff by decreasing without causing any of the constraints to be violated, contradicting the assertion that the given selection of parameters is an equilibrium. ∎
The result of Lemma 2 is a wellknown result in contract design—that is, the individual rationality constraint always binds for the optimal contract [23]. As shall become clear in our analysis in the following sections, the equality (22) forms an implicit constraint that appears in each of the aggregators’ optimizations, which will be directly responsible for the degeneracy observed in the data market. Roughly speaking, while the parameters selected by the aggregators determine the level of effort that the data sources will exert, the parameters determine what portion of this effort each aggregator is expected to compensate.
For each , define:
(23) 
which can be interpreted to be the total demand for high quality data from data source . Next, we define:
Note that Lemma 2 implies that if is an GN equilibrium in the game between the aggregators then will hold for each . Moreover, the nonnegativity constraints in the game between the buyers will hold only if for each and .
Iva Unbounded Effort Spaces
Let us first consider the case where there is no upper bound on the effort the data sources may exert, i.e. .
Theorem 1.
Consider the game , where each aggregator’s objective is to solve the optimization in (20). Suppose that for each , and . Further, suppose that , . Then, there is either no GN equilibrium or an infinite number of GN equilibria. Moreover, if is a GN equilibrium, then the following conditions hold:

The set of infinite GN equilibria is given by:
That is, the parameters selected by the aggregators are the same across each GN equilibrium, and all degeneracy lies in the equilibrium parameters which lie in the dimensional convex polytope defined above.

The effort exerted by each data source is the same in each GN equilibrium and the efforts constitute a unique induced dominated strategy equilibrium between the data sources. More precisely, each data source exerts effort in all GN equilibria.
Before going ahead with the proof of the theorem, we discuss its hypotheses and implications. The hypothesis that implies that there is enough demand for the data from source such that she does not have incentive to exert negative effort in equilibrium. Together, the aggregators will provide sufficient incentive to so that accepts each of the contracts offered to her, and truthfully report her queryresponse. When we investigate the case where only provides readings to a subset of the aggregators in Section V, only the relevant subset of aggregators must maintain this constraint. This condition places a restriction on what subsets of incentives from the aggregators each data source is willing to accept.
As we discovered in Section IIIF, the parameters selected by the aggregators uniquely determine how much effort the data sources exert in equilibrium. Intuitively, the fact that the parameters are constant across all GN equilibria means that, when GN equilibria do exist in the game between the aggregators, the aggregators have agreed to incentivize the data sources to each exert a particular level of effort. The proof of the theorem will shed some light on how this unique choice of parameters is selected when GN equilibria exist, and also demonstrate what ‘goes wrong’ in cases where the aggregators cannot agree on how much effort to incentivize the sources to exert. In the latter case, no GN equilibrium solution exists in the game between the aggregators. Further commentary on this point is provided after the proof of Theorem 2.
Meanwhile, for a fixed profile of parameters, the parameters determine how much of this effort each aggregator is responsible for compensating in expectation. Even when aggregators are able to agree on how much effort to incentivize from the data sources and select the unique GN equilibria choice for , there is a nonuniqueness in the parameters in equilibrium. This implies that there is a fundamental ambiguity in who will fund the exertion of the data sources. In the extreme case, it is possible for one aggregator to pay for the entirety of the expected compensation offered to the data sources, while the other aggregators pay nothing in expectation.
Proof of Theorem 1.
By Lemma 2, we have that:
(24) 
Plugging in this constraint, the cost function for aggregator can be expressed as:
By swapping the roles of and in the middle term above, aggregator ’s cost can be decomposed into the sum of costs for each data sources. We define:
Then aggregator ’s optimization problem reduces to:
s.t.  
Note that the cost does not depend on , for any . We complete the argument by ignoring the constraints and showing that the constraints are satisfied for the set of equilibria we characterize.
Differentiating the cost with respect to and applying (16) and for all , we have that:
where . Applying Lemma 1, which states , we get the following conditions: