Credible Sample Elicitation by Deep Learning,for Deep Learning

# Credible Sample Elicitation by Deep Learning, for Deep Learning

Yang Liu    Zuyue Fu1    Zhuoran Yang   Zhaoran Wang equal contributionUniversity of California, Santa Cruz; yangliu@ucsc.eduNorthwestern University; zuyue.fu@u.northwestern.eduPrinceton University; zy6@princeton.eduNorthwestern University; zhaoranwang@gmail.com
1footnotemark: 1
October 9, 2019
October 9, 2019
###### Abstract

It is important to collect credible training samples for building data-intensive learning systems (e.g., a deep learning system). In the literature, there is a line of studies on eliciting distributional information from self-interested agents who hold a relevant information. Asking people to report complex distribution , though theoretically viable, is challenging in practice. This is primarily due to the heavy cognitive loads required for human agents to reason and report this high dimensional information. Consider the example where we are interested in building an image classifier via first collecting a certain category of high-dimensional image data. While classical elicitation results apply to eliciting a complex and generative (and continuous) distribution for this image data, we are interested in eliciting samples from agents. This paper introduces a deep learning aided method to incentivize credible sample contributions from selfish and rational agents. The challenge to do so is to design an incentive-compatible score function to score each reported sample to induce truthful reports, instead of an arbitrary or even adversarial one. We show that with accurate estimation of a certain -divergence function we are able to achieve approximate incentive compatibility in eliciting truthful samples. We then present an efficient estimator with theoretical guarantee via studying the variational forms of -divergence function. Our work complements the literature of information elicitation via introducing the problem of sample elicitation. We also show a connection between this sample elicitation problem and -GAN, and how this connection can help reconstruct an estimator of the distribution based on collected samples.

## 1 Introduction

The availability of a large quantity of credible samples is crucial for building high-fidelity machine learning models. This is particularly true for deep learning systems that are data-hungry. Arguably, the most scalable way for collecting a large amount of training samples is crowdsourcing from a decentralized population of agents who hold relevant sample information. The most popular example is the build of ImageNet (Deng et al., 2009).

The main challenge in eliciting private information is to properly score reported information such that the self-interested agent who holds a private information will be incentivized to report truthfully. At a first look, this problem of eliciting quality data is readily solvable with the seminal solution for eliciting distributional information, called the strictly proper scoring rule (Brier, 1950; Winkler, 1969; Savage, 1971; Matheson and Winkler, 1976; Jose et al., 2006; Gneiting and Raftery, 2007): suppose we are interested in eliciting information about a random vector . Denote its probability density function as with distribution . As the mechanism designer, if we have a sample drawn from the true distribution, we can apply strictly proper scoring rules (SPSR) to elicit : the agent who holds will be scored using . is called strictly proper if the following condition holds: The above elicitation approach has two main caveats that limited its application:

• When the outcome space is large and is even possibly infinite, it is practically impossible for any human agents to report such a distribution with reasonable efforts. This partially inspired a line of follow-up works on eliciting property of the distributions, which we will discuss later.

• The mechanism designer may not possess any ground truth samples.

In this work we aim to collect credible samples from self-interested agents via studying the question of sample elicitation. Instead of asking each agent to report the complete distribution , we hope to elicit samples drawn from the distribution truthfully. Denote samples . In analogy to strictly proper scoring rules111Our specific formulation and goal will be different in details., we aim to design a score function s.t. where is a reference answer that can be defined using elicited reports. This setting will relax the requirements of high reporting complexity, and has wide applications in collecting training samples for machine learning tasks. Indeed our goal resembles similarity to property elicitation (Lambert et al., 2008; Steinwart et al., 2014; Frongillo and Kash, 2015a), but we emphasize that our aims are different - property elicitation aims to elicit statistical properties of a distribution, while ours focus on eliciting samples drawn from the distributions. In certain scenarios, when agents do not have the complete knowledge or power to compute these properties, our setting enables elicitation of individual sample points.

Our challenge lies in accurately evaluating reported samples. We first observe that the -divergence functions between two properly defined distributions of the samples can serve the purpose of incentivizing truthful report of samples. We then propose a variational approach that enables us to estimate the divergence functions efficiently using reported samples, via estimating the variational form of the -divergence functions, through a deep neutral network. These estimation results help us establish an approximate incentive compatibility in eliciting truthful samples. It is worth to note our framework also generalizes to the setting where there is no access to ground truth sample, where we can only rely on reported samples. There we show our estimation results admit an approximate Bayesian Nash Equilibrium for agents to report truthfully. Furthermore, in our estimation framework, we use a generative adversarial approach to reconstruct the distribution from the elicited samples.

Our contributions are three-folds. (1) We tackle the problem of eliciting complex distribution via proposing a sample elicitation framework. Our deep learning aided solution concept makes it practical to solicit complex sample information from human agents. (2) Our framework covers the case when the mechanism designer has no access to ground truth information, which adds contribution to the peer prediction literature. (3) On the technical side, we develop estimators via deep learning techniques with strong theoretical guarantees. This not only helps us establish approximate incentive-compatibility, but also enables the designer to recover the targeted distribution from elicited samples. Our contribution can therefore be summarized as

“eliciting credible training samples by deep learning, for deep learning”.

### 1.1 Related works

The most relevant literature to our paper is strictly proper scoring rules and property elicitation. Scoring rules were developed for eliciting truthful prediction (probability) (Brier, 1950; Winkler, 1969; Savage, 1971; Matheson and Winkler, 1976; Jose et al., 2006; Gneiting and Raftery, 2007). Characterization results for strictly proper scoring rules are given in (McCarthy, 1956; Savage, 1971; Gneiting and Raftery, 2007). Property elicitation noticed the challenge of eliciting complex distributions (Lambert et al., 2008; Steinwart et al., 2014; Frongillo and Kash, 2015a). For instance, (Abernethy and Frongillo, 2012) characterizes the scoring functions for eliciting linear properties. (Frongillo and Kash, 2015b) studies the complexity of eliciting properties. Another line of relevant research is peer prediction, which solutions can help elicit private information when the ground truth verification might be missing (De Alfaro et al., 2016; Gao et al., 2016; Kong et al., 2016; Kong and Schoenebeck, 2018, 2019). Our work complements the information elicitation literature via proposing and studying the question of sample elicitation via a variational approach to estimate -divergence functions.

As mentioned, our work borrows ideas from the statistical learning literature on divergence estimation. The simplest way to estimate divergence starts with the estimation of density functions, see (Wang et al., 2009; Lee and Park, 2006; Wang et al., 2005) and the references therein. In recent years, another method based on the Donsker-Varadhan representation (Donsker and Varadhan, 1975) of the divergence function comes into play. Related works include (Ruderman et al., 2012; Nguyen et al., 2010; Kanamori et al., 2011; Sugiyama et al., 2012; Broniatowski and Keziou, 2004, 2009), where the estimation of divergence is modeled as the estimation of density ratio between two distributions. The Donsker-Varadhan representation of the divergence function also motivates the well-know Generative Adversarial Network (GAN) (Goodfellow et al., 2014), which learns the distribution by minimizing the Kullback-Leibler divergence (Kullback and Leibler, 1951). Follow-up works involve -GAN (Nowozin et al., 2016), Wasserstein-GAN (Arjovsky et al., 2017; Gulrajani et al., 2017) and Cramér-GAN (Bellemare et al., 2017), where different divergence functions are used to learn the distribution. Theoretical analysis of GAN are given in (Liang, 2018; Liu et al., 2017; Arora et al., 2017).

### 1.2 Notations

For the distribution , we denote by the empirical distribution given a set of samples following , i.e., , where is the Dirac measure at . We denote by the norm of the vector where , and , where is the -th entry of . For any real-valued continuous function , we denote by the norm of and the norm of , where is the Lebesgue measure. Also, we denote by . For any real-valued functions and defined on some unbounded subset of the real positive numbers, such that is strictly positive for all large enough values of , we write and if for some positive absolute constant and any , where is a real number. We denote by the set .

## 2 Preliminary

We formulate the question of sample elicitation.

### 2.1 Sample elicitation

We consider two scenarios. We start with an easier case where we, as the mechanism designer, have access to a certain number of group truth samples. This is a setting that resembles similarity to the proper scoring rule setting. Then we move to the harder case where the inputs to our mechanism can only be elicited samples from agents.

#### Multi-sample elicitation with ground truth samples.

Suppose the agent holds samples, with each of them drawn from , i.e., 222Though we use to denote the samples we are interested in, potentially includes both the feature and labels as in the context of supervised learning.. The agent can report each sample arbitrarily . There are data drawn from the ground truth distribution 333The number of ground truth samples can be different from but we keep them the same for simplicity of presentation. It will mainly affect the terms resulting from our estimations.. We are interested in designing score functions that takes inputs of each and : such that if the agent believes that is drawn from the same distribution , with probability at least ,

We name above as -properness (per sample) for sample elicitation. When , it is reduced to a one that is similar to the properness definition in scoring rule literature. When there is no confusion we will also shorthand . Agent believes that his samples are generated from the same distribution as of the ground truth samples ().

#### Sample elicitation with peer samples.

Suppose there are agents each holding a sample . s are not necessarily the same - this models the fact that agents can have subjective biases or local observation biases. This is in a more standard peer prediction setting. Denote their joint distribution as .

Again each agent can report arbitrarily . We are interested in designing and characterizing score function that takes inputs of each and : such that , with probability at least ,

 Ex∼\PP[S(xi,{rj(xj)=xj}j≠i)]≥Ex∼\PP[S(r(xi),{rj(xj)=xj}j≠i)]−ϵ.

We name above an -Bayesian Nash Equilibrium (BNE) in truthful elicitation. We only require that agents are all aware of above information structure as common knowledge, but they do not need to form beliefs about details of other agents’ sample distributions. Each agent’s sample is private to themselves.

### 2.2 f-divergence

It is well known that maximizing the expected proper scores equals to minimizing a corresponding Bregman divergence (Gneiting and Raftery, 2007). More generically, we take the perspective that divergence functions have great potentials to serve as scoring functions for eliciting samples. Denote the -divergence between two distributions and as :

 Df(q∥p)=∫p(x)f(q(x)p(x))\udμ.

Here is a function satisfying certain regularity conditions, which will be specified in the following section. Solving our elicitation problem involves evaluating the value of successively based on the distributions and , without knowing the probability density functions and . Therefore, we have to resolve to a form of which does not involve the exact forms of and , but instead sample forms. Following from Fenchel’s convex duality, it holds that

 Df(q∥p)=maxt\EEx∼\QQ[t(x)]−\EEx∼\PP[f†(t(x))], (1)

where and correspond to the distributions with probability density functions and , and is the Fenchel duality of , which is defined as , and the max is taken over all functions .

## 3 Sample Elicitation: A Generative Adversarial Approach

Recall from (1) that admits the following variational form:

 Df(q∥p)=maxtEx∼\QQ[t(x)]−Ex∼\PP[f†(t(x))]. (2)

We highlight that via functional derivative, the above variational form is solved by , where is the density ratio between and . Our elicitation builds upon such a variational form (2) and the following estimators:

 ^t(x;p,q) =\argmintEx∼\PPn[f†(t(x))]−Ex∼\QQn[t(x)], ^Df(q∥p) =Ex∼\QQn[^t(x)]−Ex∼\PPn[f†(^t(x))].

### 3.1 Concentration and assumptions

Suppose we have the following concentration bound for estimating : for any probability density functions and , it holds with probability at least that

 |^Df(q∥p)−Df(q∥p)| ≤ϵ(n). (3)

This concentration bound will be established based on the following assumptions. {assumption}[Bounded Density Ratio] We assume that the density ratio is bounded from above and below such that holds for some absolute constants and .

The above assumption is rather standard in related literature (Nguyen et al., 2010; Suzuki et al., 2008), which requires that the probability density functions and lie on a same support. For simplicity, we assume this support is . We define the -Hölder function class on as follows. {definition}[-Hölder Function Class] The -Hölder function class with radius is defined as

 \cCβd(Ω,M)={t:Ω⊂\RRd→\RR:∑∥α∥1<β∥∂αt∥∞+∑∥α∥1=⌊β⌋supx,y∈Ω,x≠y|∂αt(x)−∂αt(y)|∥x−y∥β−⌊β⌋∞≤M},

where with .

We assume that the function is -Hölder, which characterizes the smoothness of . {assumption}[-Hölder Condition] We assume that for some positive absolute constant , where is the -Hölder function class in Definition 3.1.

In addition, we assume that the following regularity conditions hold for the function .

{assumption}

[Regularity of Divergence Function] The function is smooth on and . Also,

• is -strongly convex on , where is a constant;

• has -Lipschitz continuous gradient on , where is a constant.

We highlight that we only require that the conditions hold on the interval in Assumption 3.1, where the constants and are specified in Assumption 3.1. Thus, the above assumptions are mild and they hold for many commonly used functions in the definition of divergence. For example, in Kullback-Leibler (KL) divergence, we take , which satisfies Assumption 3.1; in Jenson-Shannon divergence, we take , which also satisfies Assumption 3.1.

We will show that under Assumptions 3.1, 3.1, and 3.1, the bound (3) holds. See Theorem 4 in Section 4 for details.

### 3.2 Multi-sample elicitation with ground truth samples

In this setting, as a reminder, the agent will report multiple samples. After the agent reported his samples, the mechanism designer has a set of ground truth samples to serve the purpose of evaluation. This falls into the standard strictly proper scoring rule setting.

Our mechanism is presented in Algorithm 1. It consists two steps: one is to compute , which will enable us, in step 2, to pay agent using a linear-transformed estimated divergence between the reported samples and the true samples.

And we have the following results.

{theorem}

The -scoring mechanism in Algorithm 1 achieves -properness. The proof is mainly based on the concentration of -divergence function and its non-negativity. Not surprisingly, if the agent believes his samples are generated from the same distribution as the ground truth sample, and that our estimator can well characterize the difference between the two set of samples, he will be incentivized to report truthfully to minimize the difference. We defer the proof to Section B.1.

### 3.3 Single-task elicitation without ground truth samples

The above mechanism, while intuitive, has two caveats:

• The agent needs to report multiple samples (multi-task/sample elicitation);

• Multiple samples from the ground truth distribution are needed.

Now consider the single point elicitation in an elicitation without verification setting. Suppose there are agents each holding a sample 444This choice of is simply for exposition.. We randomly partition the agents into two groups, and denote the joint distributions for each group’s samples as and with distributions and for each of the two groups. Correspondingly, there are a set of agents for each group respectively, who are required to report their single data point according to two distributions and , i.e., each of them is holding and . As an interesting note, this is also similar to the setup of a Generative Adversarial Network (GAN) - one distribution corresponds to a generative distribution , and another . This is a connection that we will further explore in Section 5 to recover distributions from elicited samples.

Denote the joint distribution of and as (distribution as ), and the product of the marginal distribution as (distribution as ). Consider the divergence between the two distributions:

 Df(p⊕q∥p×q)=maxtE\xb∼\PP⊕\QQ[t(\xb)]−E\xb∼\PP×\QQ[f†(t(\xb))]

The results below connect mutual information with divergence functions. The most famous one is the relationship between KL divergence and mutual information, but the generic connection between a generalized -mutual information definition and -divergence function holds too. {definition}[Kong and Schoenebeck (2019)] A generalized -mutual information between and is defined as the -divergence between the joint distribution of and the product of marginal distribution :

 If(p;q)=Df(p⊕q∥p×q)

Further it is shown in Kong and Schoenebeck (2018, 2019) that the data processing inequality for mutual information holds for when is strictly convex. Again define the following estimators: (we use to denote a sample drawn from the joint distribution)

 ^t(\xb;p⊕q,p×q) =\argmintE\xb∼\PPn×\QQn[f†(t(x))]−E\xb∼\PPn⊕\QQn[t(\xb)] (4) ^Df(p⊕q∥p×q) =E\xb∼\PPn⊕\QQn[^t(\xb)]−E\xb∼\PPn×\QQn[f†(^t(\xb))]

Recall that and are the empirical distributions of reported samples. We denote as the conditional distribution when the first variable is fixed with realization . Our mechanism is presented in Algorithm 2. Similar to Algorithm 1, the main step is to estimate divergence between and using the reported samples. Then we pay agents using a linear-transformed form of it.

And we have the following results. {theorem} The -scoring mechanism in Algorithm 2 achieves -BNE. The theorem is proved by our concentration results in estimating -divergence, a max argument, and the data processing inequality for -mutual information. We defer the proof in Section B.2.

The job left for us is to estimate the divergence functions as accurate as possible to reduce and . Roughly speaking, if we solve the optimization problem (4) using deep neural networks with proper structure, it holds that and , where is a positive absolute constant. We state and prove this result formally in Section 4.

Several remarks follow: {remark} (1) When the number of samples grows, at least polynomially fast, and our guaranteed approximate incentive-compatibility approaches a strict one. (2) Our method or framework handles arbitrary complex information - can be sampled from high dimensional continuous space. (3) The score function requires no prior knowledge - we design estimation methods purely based on reported sample data. (4) Our framework also covers the case when the mechanism designer has no access to ground truth, which adds contribution to the peer prediction literature. So far peer prediction results focused on eliciting simple categorical information. Besides handling complex information structure, our approach can also be viewed as a data-driven mechanism for peer prediction problems too.

## 4 Estimation of f-divergence

In this section, we introduce an estimator of -divergence and establish the statistical rate of convergence, which characterizes and . For the ease of exposition, in the sequel, we estimate the -divergence between distributions and with probability density functions and , respectively. The rate of convergence of estimating -divergence can be easily extended to that of mutual information.

Following from Section 2.2, estimating -divergence between and is equivalent to solving the following optimization problem:

 t∗(x;p,q)=\argmint\EEx∼\PP[f†(t(x))]−\EEx∼\QQ[t(x)], Df(q∥p)=\EEx∼\QQ[t∗(x;p,q)]−\EEx∼\PP[f†(t∗(x;p,q))]. (5)

In what follows, we propose the estimator of . By Assumption 3.1, it suffices to solve (4) on the function class . To this end, we approximate solution to (4) by the family deep neural networks.

We now define the family of deep neural networks.

{definition}

Given a vector , where and , the family of deep neural networks is defined as

 Φ(L,k)={φ(x;W,v)=WL+1σvL⋯W2σv1W1x:Wj∈\RRkj×kj−1,vj∈\RRkj}.

Here we write as for notational convenience, where is the ReLU activation function.

To avoid overfitting, the sparsity of the deep neural networks is typically assumed in deep learning literature. In practice, such a sparsity property is achieved through certain techniques, e.g., dropout (Srivastava et al., 2014), or certain network architecture, e.g., convolutional neural network (Krizhevsky et al., 2012). We now define the family of sparse networks as follows,

 ΦM(L,k,s)= {φ(x;W,v)∈Φ(L,d):∥φ∥∞≤M, ∥Wj∥∞≤1 for~{}j∈[L+1], ∥vj∥∞≤1 for~{}j∈[L], L+1∑j=1∥Wj∥0+L∑j=1∥vj∥0≤s}, (6)

where is the sparsity. In contrast, another approach to avoid overfitting is to control the norm of parameters. See Section A.2 for details.

We now propose the following estimators

 ^t(x;p,q)=\argmint∈ΦM(L,k,s)\EEx∼\PPn[f†(t(x))]−\EEx∼\QQn[t(x)], ^Df(q∥p)=\EEx∼\QQn[^t(x;p,q)]−\EEx∼\PPn[f†(^t(x;p,q))]. (7)

The following theorem characterizes the statistical rate of convergence of the estimators defined in (4).

{theorem}

Let , , and in (4), where . Under Assumptions 3.1, 3.1, and 3.1, it holds with probability at least that

 |Df(q∥p)−^Df(q∥p)|≲n−2β2β+dlog7n.

We defer the proof of the theorem in Section B.3. By Theorem 4, the estimators in (4) achieve the optimal nonparametric rate of convergence (Stone, 1982) up to a logarithmic term. By (3) and Theorem 4, we have

 δ(n)=exp{−n(d−2β)/(2β+d)⋅log14n},ϵ(n)=c⋅n−2β/(2β+d)⋅log7n,

where is a positive absolute constant.

## 5 Connection to GAN and Reconstruction of Distribution

After sample elicitation, a natural question to ask is how to learn a representative probability density function from the samples. Denote the probability density function from elicited samples as . Then, learning the probability density function is to solve for

 q∗=\argminq∈\cQDf(q∥p), (8)

where is the probability density function space.

To see the connection between (8) and the formulation of -GAN (Nowozin et al., 2016), by combining (1) and (8), we have

 q∗=\argminq∈\cQmaxt\EEx∼\QQ[t(x)]−\EEx∼\PP[f†(t(x))],

which is the formulation of -GAN. Here the probability density function is the generator, while the function is the discriminator.

By the non-negativity of -divergence, solves (8). We now propose the following estimator

 ^q=\argminq∈\cQ^Df(q∥p), (9)

where is given in (4).

We define covering number as follows. {definition}[Covering Number] Let be a normed space, and . We say that is a -covering over of size if , where is the -ball centered at . The covering number is defined as .

We impose the following assumption on the covering number of the probability density function space .

{assumption}

It holds that .

Recall that is the unique minimizer of the problem (8). Therefore, the -divergence characterizes the deviation of from . The following theorem characterizes the error bound of estimating by . {theorem} Under the same assumptions in Theorem 4 and Assumption 5, for sufficiently large sample size , it holds with probability at least that

 Df(^q∥p) ≲n−2β2β+d⋅log7n+min~q∈\cQDf(~q∥p). (10)

We defer the proof of the theorem in Section B.4.

In Theorem 5, the first term on the RHS of (10) characterizes the generalization error of the estimator in (9), while the second term characterizes the approximation error. If the approximation error in (10) vanishes, then the estimator converges to the true density function at the optimal nonparametric rate of convergence (Stone, 1982) up to a logarithmic term.

## 6 Concluding Remarks

In this work, we introduce the problem of sample elicitation as an alternative to eliciting complicated distribution. Our elicitation mechanism leverages the variational form of -divergence functions to achieve accurate estimation of the divergences using samples. We provide theoretical guarantee for both our estimators and the achieved incentive compatibility.

It reminds an interesting problem to find out more “organic” mechanisms for sample elicitation that requires (i) less elicited samples; and (ii) induced strict truthfulness instead of approximated ones.

## References

• J. D. Abernethy and R. M. Frongillo (2012) A characterization of scoring rules for linear properties. In Conference on Learning Theory, pp. 27–1. Cited by: §1.1.
• M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §1.1.
• S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang (2017) Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 224–232. Cited by: §1.1.
• M. G. Bellemare, I. Danihelka, W. Dabney, S. Mohamed, B. Lakshminarayanan, S. Hoyer, and R. Munos (2017) The cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743. Cited by: §1.1.
• G. W. Brier (1950) Verification of forecasts expressed in terms of probability. Monthly Weather Review 78 (1), pp. 1–3. Cited by: §1.1, §1.
• M. Broniatowski and A. Keziou (2004) Parametric estimation and tests through divergences. Technical report Citeseer. Cited by: §1.1.
• M. Broniatowski and A. Keziou (2009) Parametric estimation and tests through divergences and the duality technique. Journal of Multivariate Analysis 100 (1), pp. 16–36. Cited by: §1.1.
• L. De Alfaro, M. Shavlovsky, and V. Polychronopoulos (2016) Incentives for truthful peer grading. arXiv preprint arXiv:1604.03178. Cited by: §1.1.
• J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
• M. D. Donsker and S. S. Varadhan (1975) Asymptotic evaluation of certain markov process expectations for large time, i. Communications on Pure and Applied Mathematics 28 (1), pp. 1–47. Cited by: §1.1.
• R. Frongillo and I. A. Kash (2015a) Vector-valued property elicitation. In Conference on Learning Theory, pp. 710–727. Cited by: §1.1, §1.
• R. Frongillo and I. Kash (2015b) On elicitation complexity. In Advances in Neural Information Processing Systems, pp. 3258–3266. Cited by: §1.1.
• A. Gao, J. R. Wright, and K. Leyton-Brown (2016) Incentivizing evaluation via limited access to ground truth: peer-prediction makes things worse. arXiv preprint arXiv:1606.07042. Cited by: §1.1.
• T. Gneiting and A. E. Raftery (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102 (477), pp. 359–378. Cited by: §1.1, §1, §2.2.
• I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.1.
• I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777. Cited by: §1.1.
• V. R. Jose, R. F. Nau, and R. L. Winkler (2006) Scoring rules, generalized entropy and utility maximization. Note: Working Paper, Fuqua School of Business, Duke University Cited by: §1.1, §1.
• T. Kanamori, T. Suzuki, and M. Sugiyama (2011) -Divergence estimation and two-sample homogeneity test under semiparametric density-ratio models. IEEE Transactions on Information Theory 58 (2), pp. 708–720. Cited by: §1.1.
• Y. Kong, K. Ligett, and G. Schoenebeck (2016) Putting peer prediction under the micro (economic) scope and making truth-telling focal. In International Conference on Web and Internet Economics, pp. 251–264. Cited by: §1.1.
• Y. Kong and G. Schoenebeck (2018) Water from two rocks: maximizing the mutual information. In Proceedings of the 2018 ACM Conference on Economics and Computation, pp. 177–194. Cited by: §1.1, §3.3.
• Y. Kong and G. Schoenebeck (2019) An information theoretic framework for designing information elicitation mechanisms that reward truth-telling. ACM Transactions on Economics and Computation (TEAC) 7 (1), pp. 2. Cited by: §B.2, §B.2, §1.1, §3.3.
• A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.
• S. Kullback and R. A. Leibler (1951) On information and sufficiency. The annals of mathematical statistics 22 (1), pp. 79–86. Cited by: §1.1.
• N.S. Lambert, D.M. Pennock, and Y. Shoham (2008) Eliciting properties of probability distributions. In Proceedings of the 9th ACM Conference on Electronic Commerce, EC ’08, pp. 129–138. Cited by: §1.1, §1.
• Y. K. Lee and B. U. Park (2006) Estimation of kullback–leibler divergence by local likelihood. Annals of the Institute of Statistical Mathematics 58 (2), pp. 327–340. Cited by: §1.1.
• X. Li, J. Lu, Z. Wang, J. Haupt, and T. Zhao (2018) On tighter generalization bound for deep neural networks: cnns, resnets, and beyond. arXiv preprint arXiv:1806.05159. Cited by: §B.6.
• T. Liang (2018) On how well generative adversarial networks learn densities: nonparametric and parametric results. arXiv preprint arXiv:1811.03179. Cited by: §1.1.
• S. Liu, O. Bousquet, and K. Chaudhuri (2017) Approximation and convergence properties of generative adversarial learning. In Advances in Neural Information Processing Systems, pp. 5545–5553. Cited by: §1.1.
• J. E. Matheson and R. L. Winkler (1976) Scoring rules for continuous probability distributions. Management Science 22 (10), pp. 1087–1096. Cited by: §1.1, §1.
• J. McCarthy (1956) Measures of the value of information. PNAS: Proceedings of the National Academy of Sciences of the United States of America 42 (9), pp. 654–655. Cited by: §1.1.
• M. Mohri, A. Rostamizadeh, and A. Talwalkar (2018) Foundations of machine learning. MIT press. Cited by: §B.6.
• X. Nguyen, M. J. Wainwright, and M. I. Jordan (2010) Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory 56 (11), pp. 5847–5861. Cited by: §1.1, §3.1.
• S. Nowozin, B. Cseke, and R. Tomioka (2016) F-gan: training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pp. 271–279. Cited by: §1.1, §5.
• A. Ruderman, M. Reid, D. García-García, and J. Petterson (2012) Tighter variational representations of f-divergences via restriction to probability measures. arXiv preprint arXiv:1206.4664. Cited by: §1.1.
• L. J. Savage (1971) Elicitation of personal probabilities and expectations. Journal of the American Statistical Association 66 (336), pp. 783–801. Cited by: §1.1, §1.
• J. Schmidt-Hieber (2017) Nonparametric regression using deep neural networks with relu activation function. arXiv preprint arXiv:1708.06633. Cited by: Appendix D, Appendix D.
• N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §4.
• I. Steinwart, C. Pasin, R. Williamson, and S. Zhang (2014) Elicitation and identification of properties. In Conference on Learning Theory, pp. 482–526. Cited by: §1.1, §1.
• C. J. Stone (1982) Optimal global rates of convergence for nonparametric regression. The annals of statistics, pp. 1040–1053. Cited by: §4, §5.
• M. Sugiyama, T. Suzuki, and T. Kanamori (2012) Density ratio estimation in machine learning. Cambridge University Press. Cited by: §1.1.
• T. Suzuki, M. Sugiyama, J. Sese, and T. Kanamori (2008) Approximating mutual information by maximum likelihood density ratio estimation. In New challenges for feature selection in data mining and knowledge discovery, pp. 5–20. Cited by: §3.1.
• S. A. van de Geer and S. van de Geer (2000) Empirical processes in m-estimation. Vol. 6, Cambridge university press. Cited by: Appendix D, Appendix D, Appendix D.
• Q. Wang, S. R. Kulkarni, and S. Verdú (2005) Divergence estimation of continuous distributions based on data-dependent partitions. IEEE Transactions on Information Theory 51 (9), pp. 3064–3074. Cited by: §1.1.
• Q. Wang, S. R. Kulkarni, and S. Verdú (2009) Divergence estimation for multidimensional densities via -nearest-neighbor distances. IEEE Transactions on Information Theory 55 (5), pp. 2392–2405. Cited by: §1.1.
• R. L. Winkler (1969) Scoring rules and the evaluation of probability assessors. Journal of the American Statistical Association 64 (327), pp. 1073–1078. Cited by: §1.1, §1.
• X. Zhou (2018) On the fenchel duality between strong convexity and lipschitz continuous gradient. arXiv preprint arXiv:1803.06573. Cited by: Appendix D.

## Appendix A Auxiliary Analysis

### a.1 Auxiliary Results on Sparsity Control

In this section, we provide some auxiliary results on (4). We first state an oracle inequality showing the rate of convergence of .

{theorem}

Given , for any sample size satisfies that , under Assumptions 3.1, 3.1, and 3.1, it holds that

 ∥^t−t∗∥L2(\PP)≲min~t∈ΦM(L,k,s)∥~t−t∗∥L2(\PP)+γn−1/2logn+n−1/2[√log(1/ε)+γ−1log(1/ε)]

with probability at least . Here and .

We defer the proof of to Section B.5.

As a by-product, note that , based on the error bound established in Theorem A.1, we obtain the following result.

{corollary}

Given , for the sample size , under Assumptions 3.1, 3.1, and 3.1, it holds with probability at least that

 ∥^θ−θ∗∥L2(\PP)≲min~t∈ΦM(L,k,s)∥~t−t∗∥L2(\PP)+γn−1/2logn+n−1/2[√log(1/ε)+γ−1log(1/ε)].

Here and .

###### Proof.

Note that and has Lipschitz continuous gradient with parameter from Assumption 3.1 and Lemma D, we obtain the result from Theorem A.1. ∎

### a.2 Error Bound using Norm Control

In this section, we consider using norm of the parameters (specifically speaking, the norm of and in (4)) to control the error bound, which is an alternative of the network model shown in (4). We consider the family of -layer neural networks with bounded spectral norm for weight matrices , where and , and vector , which is denoted as

 Φnorm=Φnorm(L,k,A,B)={φ(x;W,v)∈Φ(L,k):∥vj∥2≤Aj for~{}all~{}j∈[L], ∥Wj∥2≤Bj for~{}all% ~{}j∈[L+1]}, (11)

where is short for for any . We write the following program

 ^t(x;p,q)=\argmint∈Φnorm\EEx∼\PPn[f†(t(x))]−\EEx∼\QQn[t(x)], ^Df(q∥p)=\EEx∼\QQn[^t(x;p,q)]−\EEx∼\PPn[f†(^t(x;p,q))]. (12)

Based on this formulation, we derive the error bound on the estimated -divergence in the following theorem. We only consider the generalization error bound in this setting; therefore, we assume that the ground truth locates within . Before we state the theorem, we first define two parameters for the family of neural networks as follows

 γ1=BL+1∏j=1Bj⋅ ⎷L+1∑j=0k2j,γ2=L⋅(√∑L+1j=1k2jB2j+∑Lj=1Aj)∑L+1j=0k2j⋅minjB2j. (13)

We proceed to state the theorem.

{theorem}

We assume that . Then for any , with probability at least , it holds that

 |^Df(q∥p)−Df(q∥p)|≲γ1⋅n−1/2log(γ2n)+L+1∏j=1Bj⋅n−1/2√log(1/ε).

Here and are defined in (13).

We defer the proof to Section B.6.

The next theorem uses the results in Theorem A.2. Recall that in Section §A.2, we assume that the minimizer to the population version problem (4) lies within the norm-controlled family of neural networks .

{theorem}

Recall that we defined the parameter and of the family of neural networks in (13), the estimated distribution in (9), and the ground truth . We denote the the covering number of the probability distribution function class as , then for any , with probability at least , we have

 Df(^q∥p)≲b2(n,γ1,γ2)+L+1∏j=1Bj⋅n−1/2⋅√log(N2[b2(n,γ1,γ2),\cQ]/ε)+min~q∈\cQDf(~q∥p),

where .

We defer the proof to Section B.7.

## Appendix B Proofs of Theorems

### b.1 Proof of Theorem 3.2

If the player truthfully reports, he will receive the following expected payment per sample : with probability at least ,

 E[S(ri,⋅)] :=a−b(Ex∼\QQn[^t(x)]−Exi∼\PPn[f†(^t(xi))]) =a−b⋅^Df(q∥p) ≥a−b⋅(Df(p∥p)+ϵ(n))  (agent believes% p=q) =a−bϵ(n)

Similarly, any misreporting according to a distribution with distribution will lead to the following derivation with probability at least

 E[S(ri,⋅)] :=a−b(Ex∼\QQn[^t(x)]−Exi∼~\PPn[f†(^t(xi))]) =a−b⋅^Df(q∥~p) ≤a−b⋅(Df(p∥~p)−ϵ(n))+δ(n)⋅¯S ≤a+bϵ(n)  (non-negativity of Df)

Combining above, and using union bound, leads to -properness.

### b.2 Proof of Theorem 3.3

Consider an arbitrary agent . Suppose every other agent truthfully reports.

 E[S(ri,{rj}j≠i)] =a+b(E\xb∼\PPn⊕\QQn|ri[^t(\xb)]−E\xb∼\PPn×\QQn|ri{f†(^t(\xb))}) =a+bE[E\xb∼\PPn⊕\QQn|ri[^t(x)]−E\xb∼\PPn×\QQn|ri{f†(^t(\xb))}]

Consider the divergence term . Reporting a (denoting its distribution as ) leads to the following score

 Eri∼~\PPn[E\xb∼~\PPn⊕\QQn|ri[^t(\xb)]−E\xb∼~\PPn×\QQn|ri{f†(^t(\xb))}] =E\xb