Incentive Compatible Active Learning

Incentive Compatible Active Learning

Federico Echenique California Institute of Technology, fede@hss.caltech.edu    Siddharth Prasad Carnegie Mellon University, sprasad2@cs.cmu.edu
Abstract

We consider active learning under incentive compatibility constraints. The main application of our results is to economic experiments, in which a learner seeks to infer the parameters of a subject’s preferences: for example their attitudes towards risk, or their beliefs over uncertain events. By cleverly adapting the experimental design, one can save on the time spent by subjects in the laboratory, or maximize the information obtained from each subject in a given laboratory session; but the resulting adaptive design raises complications due to incentive compatibility. A subject in the lab may answer questions strategically, and not truthfully, so as to steer subsequent questions in a profitable direction.

We analyze two standard economic problems: inference of preferences over risk from multiple price lists, and belief elicitation in experiments on choice over uncertainty. In the first setting, we tune a simple and fast learning algorithm to retain certain incentive compatibility properties. In the second setting, we provide an incentive compatible learning algorithm based on scoring rules with query complexity that differs from obvious methods of achieving fast learning rates only by subpolynomial factors. Thus, for these areas of application, incentive compatibility may be achieved without paying a large sample complexity price.

1 Introduction

We study active learning under incentive compatibility constraints. Consider a learner: Alice, who seeks to elicit the parameters governing the behavior of a human subject: Bob. The chief application of our paper is to the design of laboratory experiments in economics. In such applications, Alice is an experimenter observing choices made by Bob in her laboratory. The active learning paradigm seeks to save on the number of questions posed by Alice by making the formulation of each question dependent on Bob’s answers to previous questions [BBL09, DAS11]. Now, Bob may misrepresent his answers to some of Alice’s questions so as to guide Alice’s line of questioning in a direction that he can benefit from.

Our setting differs from standard applications of active learning in computer science, in that data are labeled by a self-interested human agent (in our story, Bob). Computer scientists have thought of active learning as applied to, for example, combinatorial chemistry, or image detection. A learner then makes queries that are always truthfully answered. In economic settings, in contrast, one must recognize the role of incentives.

The existing literature on applications of passive learning to preference elicitation (see for example [BV06, KAL03, BE18, CP18]) does not have to deal with agents’ incentives to manipulate the learning mechanism, but active learning does, because an agent who understands the learner’s algorithm may answer strategically early on in the experiment so as to influence the questions he faces later in the experiment.

We should emphasize that experimental orthodoxy in economics requires that subjects (such as Bob) know as much as possible about the experimental design. No deception is allowed in economic experiments. In addition, subjects’ participation is almost universally incentivized: Bob gets a payoff that depends on his answers to Alice’s questions. Our model relates to a long-standing interest among economists for adaptive experimental design, see [EMP93, RGK+12, CSW+18, IC16].

Consider a concrete example. Bob has a utility function over money, so that if he faces a random amount of money , his expected utility is . In other words, Bob has a utility of the “constant relative risk aversion” (CRRA) form, and Alice wants to learn the value of the parameter –Bob’s relative risk aversion coefficient.111The coefficient captures Bob’s willingness to assume risk. It is a parameter that economic experiments very often seek to measure, even when the experiment is ostensibly about a totally different question. Economic experimentalists want to understand the relation between risk and their general experimental findings, so they include risk elicitation as part of the design. A standard procedure for estimating is a multiple price list.222Multiple price lists are a very common experimental design, first used by [BIN81], and popularized by [HL02] as a method to estimate , as described here.

In a multiple-price list (MPL), Alice successively asks Bob to choose between a sure payoff of dollars and a fixed lottery , for example a lottery that flips a fair coin and pays dollars if the coin turns up Heads, and dollar if it turns up Tails. Alice would first ask Bob to choose between a very small amount (almost zero) and . Then Alice would raise a little and ask Bob to choose again. The procedure is repeated, each time increasing the amount , until reaching a number equal to, or close to 1. At some value , Bob would switch from preferring the lottery to preferring the fixed amount of money. Then Alice would solve the equation

 xσ=(1/2)0σ+(1/2)1σ=(1/2) (1)

to find the value of . Now, it is important to explain how the experiment is incentivized: When the experiment is over, Alice will actually implement one of the choices made by Bob. Conventional experimental methodology dictates [ACH18] that she chooses one of the questions at random and implements it.

A proponent of active learning will immediately remark that the MPL design asks too many questions. Alice only needs to know the value of at which Bob is indifferent between and the lottery . We can thus imagine an adaptive design, where Alice raises until Bob switches from to , and stops the experiment when that happens. This design will result in strictly fewer questions than the passive (supervised) learning design.

Bob, however, understands that Alice stops raising when he declares indifference to . So he will manipulate Alice into offering him values of beyond what he truly views as indifferent to . Specifically, suppose that Alice raises continuously (this is a simplifying assumption; see Section 3 for a realistic version of this design), and that if Bob declares indifference at then the last question is implemented with probability . The function is strictly decreasing since reporting a larger value of increases the probability that a question for which Bob preferred will be implemented.

Bob’s payoff from stopping at is because with probability the last question gets implemented, so he gets the sure amount , and with the complementary probability one of the other questions is implemented and Bob gets his preference for those questions, namely the lottery . The expected utility of is .

Then it is clear that Bob would like to stop at an that is strictly greater than the value at which he would truly be indifferent to , the value that solves (1). If he stops at the true value of , he gets for sure something that he values as much as (either or the amount that he values exactly as ). By stopping at a strictly greater , he has a shot at getting a value of that he prefers over .

The situation is, however, far from hopeless. Bob’s optimal value of is strictly increasing in .333If and we assume that is smooth, then as and is decreasing. So is strictly supermodular. Hence the optimal is increasing in . Alice can then undo Bob’s strategic choice of and back out the true value of . (Alice’s approach is common in applied econometrics, often called the “structural” method.)

In this paper, we prove general possibility results, to illustrate that there are many situations where active learning is consistent with incentive compatibility. In Section 3 we shall present a formal model of multiple price lists, and show that it is possible to learn while satisfying incentive compatibility. In Section 4 we discuss incentive issues in active learning in a more general sense. We present a formal notion of incentive compatible active learning in a general preference elicitation environment, and provide characterizations of the complexity of incentive compatible learning in certain “nice” environments.

A recent and growing body of work studies the problem of inferring models of economic choice from a learning theoretic perspective [KAL03, BV06, ZR12, BDM+14, BE18, CP18, BAS19]. The learnability of preference relations has also received very recent attention [BE18, CP18]. Our investigation takes a different angle: in attempting to model an experimental situation where subjects are asked to make choices in an interactive manner, via, e.g., a computer program, or in person, we allow the analyst complete control over the learning data. In the active learning literature, this framework is known as the membership queries model. There is also an ongoing line of work that considers learning problems when the data provider is strategic [DFP10, ACH+15, LC17, CPP+18]. Finally, the recent work of [HMP+16] studies a model where an agent may (at a cost) manipulate the input to a classification algorithm.

The membership query model closely captures an adaptive economic experiment, while in this context the more traditional learning/active learning models (e.g. PAC learning, stream-based active learning) seem to place unnecessary restrictions on how the analyst learns. This notion is briefly discussed in [CP18], where classical learning theoretic approaches appear to give much weaker complexity guarantees than the membership queries model in learning time-dependent discounted utility preferences. For the remainder of this paper, whenever we use the phrase “active learning”, we refer to the membership query setting – all other forms of learning can be viewed as a special case of membership queries.

The other important component in modelling an economic experiment is a payment to the agent after the experiment has concluded. Experiments in economics are always incentivized, meaning that there are actual material consequences to subjects’ decisions in the lab. Subjects are paid for their decisions in the experiment. We incorporate this incentive payment into the execution of the algorithm by which the analyst chooses questions – the analyst implements the outcome chosen by the agent in the final round of the interaction. Thus, rather than treating the payment scheme as a separate problem, we use it to demand a certain level of robustness from our learning algorithms. As we demonstrate, this precludes the analyst from running naive learning algorithms that, despite achieving good query complexities, allow the agent to strategically and dishonestly answer questions to get offered higher payoff outcomes.

Finally, the framework we introduce engenders the following natural question: is there a combinatorial measure of complexity, akin to VC dimension for PAC learning concept classes, that precisely captures the complexity of incentive compatible learning in preference environments? Our results examine certain sufficient conditions for incentive compatible learning, a potential first step towards better understanding this new and interesting learning model.

Summary of results

We begin by discussing incentive issues in a very common experimental paradigm, that of convex budgets. We present an example to the effect that incentive problem are present and can be critical. Then we turn to the Multiple Price Lists (MPL), another very common experimental design used to infer agents’ attitudes towards risk. In MPL experiments, an agent is asked to choose between receiving various deterministic monetary amounts and participating in a lottery. The goal of the analyst is to elicit the agent’s certainty equivalent, i.e. the deterministic quantity at which the agent values the lottery (in our previous discussion, the certainty equivalent is the quantity that solves Equation (1)). We analyze a simple sequential search mechanism that is used in practice – start from the lowest possible deterministic amount and keep increasing the offer until the agent prefers it to the lottery. The analyst pays the agent by implementing the agent’s decision on a randomly selected question that was asked. We show that while this mechanism is not incentive compatible, under relatively benign assumptions it satisfies a one-to-one condition where the analyst can accurately infer the agent’s true certainty equivalent after learning the agent’s reported certainty equivalent. We then show how a modified payment scheme that only depends on the final decision of the agent allows the analyst to do a binary search and retain incentive compatibility, giving a mechanism for learning the certainty equivalent of a strategic agent to within an error of using questions.

We then turn to an abstract model of learning preference parameters/types. The idea is, as in the MPL, to induce incentive compatibility by incentivizing the payment from the last question asked of the agent. To this end, we coin a formal notion of incentive compatible (IC) learnability. A learning algorithm is simply an adaptive procedure that at each step asks the agent to choose between two outcomes. Informally, the IC learning complexity of an algorithm is the number of rounds required to both

1. Accurately learn (with high probability) the agent’s type with respect to some specified metric on the type space.

2. Ensure that (with high probability) the payment mechanism of implementing the agent’s choice on the final question cannot be strategically manipulated to yield a significant payoff gain over answering questions truthfully.

A simple structural condition allows a strong notion of incentive compatibility to be achieved via a deterministic exhaustive search (truthful reporting is the agent’s unique best response), and we give examples of commonly studied economic preference models that fit our condition. We demonstrate that a large class of preference relations over Euclidean space – those exhibiting strict convexity under a condition which we call hyperplane uniqueness (detailed in Section 4) – can be learned in an incentive compatible manner.

Theorem 1.1 (informal).

Let be a type space such that the preferences induced by each are continuous, strictly convex, and satisfy hyperplane uniqueness. Then, is IC learnable, under a suitably chosen metric.

However, this strong notion of incentive compatibility comes at a cost – the associated IC learning complexity can be massive (exponential in the preference parameters). In the abstract setting of preferences over outcomes, it is unclear how to obtain a tangible improvement in this complexity (even with randomization), and specifically it would appear that the problem parameters (e.g. the outcome space, the set of possible agent types) require much more structure for any sort of improvement.

We then analyze the specific setting of learning the beliefs of an expected utility agent, where we have the required structure. Here, an agent holds a belief represented by a distribution (there are uncertain states of the world, and is the probability with which the agent believes state will occur), and is asked to make choices between vectors of rewards , where the utility an agent of type enjoys from is simply . We first observe that naive learning algorithms can vastly beat the learning complexity of the general preference framework, but fail to be incentive compatible. Our main result is an incentive compatible learning algorithm for eliciting an agent’s beliefs that significantly improves upon the complexity in the general framework, and only differs from the fast naive learning algorithms by subpolynomial factors.

Theorem 1.2.

There is an algorithm for learning the belief of an expected utility agent that when run for

 O(n3/2lognmax(lognε,log1τ))

rounds (with high probability) cannot be manipulated to yield more than a increase in payoff, and learns a truthful agent’s belief to within total variation distance of . 444Typical supervised learning bounds have a logarithmic dependence on the confidence parameter , and so for the sake of brevity we omit terms depending on in our complexity bounds.

Our algorithm is built upon disagreement based active learning methods that provide learning guarantees, and employs the spherical scoring rule to ensure incentive compatibility properties.

2 Example: Convex budgets

We present a simple example to illustrate how incentive issues can prevent a very popular experimental design from being implementable in an active learning setting.

Consider an experiment on choice under uncertainty, with an adaptive “convex budgets” design. Such designs are ubiquitous in experimental economics: see [AM02, CFG+07, ACG+14, FHJ+18, ACP03, ANS15] among (many) others. Convex budgets is very popular as a design because it parallels the most basic model in economic theory, the model of consumer choice.555Consumer choice is probably the first model a student of economics is ever exposed to. It captures optimal choice from an economic budget sets, defined from linear prices and a maximum expenditure level.

Bob, a subject in the lab, has expected utility preferences. Specifically, suppose that the experiment involves two possible states of the world, and that Bob chooses among vectors . If Bob chooses the vector and the state of the world turns out to be , then he is paid . Bob believes that the state of the world occurs with probability , so his expected utility from choosing is (we assume for simplicity that Bob is risk-neutral).

The experiment seeks to learn the subjects’ beliefs with a design that has Bob choosing

 x∈B(p,I)={y∈R2+:p⋅y≤I},

at prices and income . The problem is equivalent to learning the ratio . It is obviously optimal for Bob to choose if and if .

The experimental design presents the subject with a sequence of prices and incomes , and asks him to choose from . Usually only one of the choice problems in the sequence will actually be paid off. It is standard practice in experimental economics to pay out only one of the questions posed to a subject. For the purpose of this example, imagine that the sequence has a length of 2: and . Moreover, suppose (again for simplicity) that incomes and prices are such that , for .

Fix the first price at . If Alice, the experimenter, observes a choice of she should conclude that . And given such an inference, it would not make sense to set the second set of prices so that . Alice, following an active learning paradigm of adaptive experimental design, should adjust upwards. So let us assume that she decides to adjust the ratio by a factor of in the direction in which there is something to learn: If the choice from is , Alice will set . If the choice is , she will set .

Now consider the problem facing our subject, Bob. Suppose that Bob’s beliefs are such that , and, to make our calculations simpler, that . If he chooses “truthfully” according to his beliefs, he would choose from and thus face prices . This means that the relative price of payoffs in state increase, the state that Bob values the most because he thinks it is the most likely to occur. If instead Bob “manipulates” the experiment by choosing , he will face prices . It is obvious that Bob is better off in the second choice problem from facing the second budget because he will be able to afford a much larger payoff in state . If Alice only incentivizes (pays out) the choice from , then Bob is always better off by misrepresenting his choice from the first budget.

If, instead, Alice incentivizes the experiment by implementing one of the choices made by Bob at random (a common practice in economic experiments, see [ACH18] for a formal justification), then the utility from truthtelling is . The utility from manipulation is . As long as , the manipulation yields a higher utility than truth telling.

The convex budgets example illustrates the perils of active learning as a guide to adaptive experimental design, when human subjects understand how the experiment unfolds conditional on how they make choices. The main result of our paper (see Section 4.2) considers belief elicitation, but proposes an active learning algorithm that is based on pairwise comparisons, not choices from convex budgets.

3 Multiple Price Lists

We begin by formally considering the application in the introduction: the use of Multiple Price Lists (MPL) to elicit an agent’s preferences over risk. MPL was first proposed by [BIN81], and popularized by [HL02], who used it to estimate risk attitudes along the lines of the discussion in the sequel.

We shall consider a version of MPL where a lottery with monetary outcomes is fixed, and an agent chooses between a sure (deterministic) monetary payment or the lottery. More specifically, consider a lottery where a coin is flipped and if the outcome is heads, the payoff is dollars, while if the outcome is tails, the payoff is dollars. An analyst wants to assess an agent’s willingness to participate in the lottery when presented with various deterministic alternatives. Denote this lottery by .

At every round of the experiment, the analyst asks the agent to choose between a deterministic payoff of or participation in the lottery, and aims to learn the agent’s certainty equivalent: the deterministic amount that yields indifference. Conventionally (for example, see [HL02, AHL+06]), the experiment is run by presenting the agent with a list of pairs . The agent makes a choice from each pair, either the sure amount or the lottery . Then the experimenter draws one of the questions at random an pays the agent according to the decision he made for that question. (i.e. if he preferred the deterministic amount , he is paid , and otherwise gets to participate in the lottery). We now present a formal model of the MPL experimental design and analyze issues of incentive compatibility.

3.1 The model

We consider a lottery with a low outcome and a high outcome , . The lottery can operate in any number of ways, for example, by a coin flip. The analyst chooses a discretization of the interval such that the intervals all have equal length . This discretization of represents the deterministic amounts that the analyst will offer to the agent.

An agent’s certainty equivalent is the point such that he is indifferent between receiving versus participating in the lottery. Certainty equivalents will be uniquely determined by an agent’s utility over money, as long as his utility function is strictly increasing.

For example, if an agent values money according to , his certainty equivalent (assuming that is a coin-flip) would be the point such that . In our model, we consider agents whose utility functions belong to a given family of functions such that a given certainty equivalent uniquely determines the utility function of the agent, and vice-versa. For example, if , so utilities take the CRRA form we discussed in the introduction, then uniquely determines the point such that .

3.2 Sequential Search

We first consider a simple mechanism that aims to find the agent’s certainty equivalent by performing a sequential search on . On round of the experiment, the agent chooses between the lottery and a deterministic payoff of . If he chooses the lottery, the experiment continues, and if he chooses or claims indifference, the experiment terminates. If the experiment terminates at round , the analyst can conclude that the agent reported a certainty equivalent lying in the interval .

The goal of the analyst is to make a payment to the agent at the end of the experiment such that the agent is incentivized to answer questions according to his true certainty equivalent. We analyze a common scheme used in experiments: if the experiment terminates after rounds, choose randomly from , and pay the agent based on his preference on the th question: so if , the agent receives , otherwise the agent receives a payment that is the outcome of the lottery. However, as discussed in the introduction, this scheme is not incentive compatible. Indeed, if is the agent’s true certainty equivalent, he has a profitable deviation to push the experiment to terminate at . The agent is indifferent between receiving and participating in the lottery, so by declaring a certainty equivalent that is higher than he may possibly win an amount larger than , and which he values strictly more than the lottery.

We now show that under some simplifying assumptions, this kind of payment scheme can at least be implemented in a manner such that the agent’s true certainty equivalent can be accurately inferred based on his report. Let be a family of utility functions such that each satisfies an inverse Lipschitz condition with constant : for all , . Let . Finally, let .

For let denote the probability that the agent is paid the deterministic amount if the experiment stops on round (so the agent participates in the lottery with probability ). An agent with true certainty equivalent and corresponding utility function has an expected payoff of

 Payoff(x,xt)=ptu(xt)+(1−pt)u(x)

for reporting a certainty equivalent in .

Let be the best response of an agent with certainty equivalent :

 r(x)=argmaxxtPayoff(x,xt).

We refer to as the report function.

We now show that when satisfy 666This is true for the standard uniform randomization scheme, as . More generally, without this condition the agent will have incentives to report high certainty equivalents as this is not penalized by lower probabilities of winning the certain amount. and , the analyst can recover the agent’s true certainty equivalent up to some low error via the sequential search mechanism.

We should emphasize that the agent will not be truthful, in the sense of reporting their true certainty equivalent. However, we are still able to back out the true certainty equivalent from understanding the agents strategic incentives.

We proceed in steps. First, we characterize the best responses for agents with certainty equivalents belonging to . We show that the report function is one to one. This implies that , since , for each .

For , .

Proof.

Note that since , it suffices to show that is injective on .

Suppose , and without loss of generality let . Let . Since any agent is incentivized to report higher than their true certainty equivalent, , so in particular . Let denote the utility function of the agent with true type , that of the agent with true type .

For any we have (since gives the best response):

 ptu1(xt)−ptu1(xt1) >psu1(xs)−psu1(xt1), ptu2(xt)−ptu2(xt2) >psu2(xs)−psu2(xt2).

Adding the two inequalities and rearranging gives

 ptps>u1(xs)−u1(xt1)+u2(xs)−u2(xt2)u1(xt)−u1(xt1)+u2(xt)−u2(xt2). (2)

At (we know , since ), Equation 2 simplifies to

 ptpt1>u2(xt1)−u2(xt2)u1(xt)−u1(xt1)+u2(xt)−u2(xt2)>u2(xt1)−u2(xt2)2M,

so

 u2(xt1)−u2(xt2)<2Mptpt1≤2Mpt1+1pt1

The inverse Lipschitz condition on then implies that , which cannot happen unless . ∎

Thus, if an agent’s true certainty equivalent happens to coincide with one of the points of the discretization, the agent will answer questions as if his certainty equivalent is the next point in the discretization.

For the next step, we need an additional Lipschitz type condition on utility functions. Suppose there are constants and such that for any , with the corresponding utility functions, and for any ,

 |u(x′)−u∗(x′′)|≤C1|x−x∗|+C2|x′−x′′|.

Moreover, let

 λ=infx∈(x––,¯¯¯x)mins,t|Payoff(x,xs)−Payoff(x,xt)|,

be the smallest possible deviation in payoff obtained by changing one’s report.

We also require the assumption that if , then for any , where and are the utility functions corresponding to certainty equivalents and , respectively. This is an intuitive condition stating that agents with a higher certainty equivalent value money more than agents with a lower certainty equivalent (note that the CRRA utilities discussed previously satisfy this property). This in particular implies that , so for any in the discretization. We will use this in the proof of the following proposition, which establishes that satisfies a certain weak monotonicity property.

Proposition 3.2.

Let with , and suppose . Then, .

Proof.

Let be the utility functions corresponding to certainty equivalents and , respectively. We first bound the increase in payoff an agent of type experiences over an agent of type for making the same report. For any , we have

 Payoff(x∗,xk)−Payoff(x,xk) =pk(u∗(xk)−u(xk))−pk(u∗(x∗)−u(x))+(u∗(x∗)−u(x))

As , either or . Suppose . We show that an agent of type cannot increase his payoff by reporting above . Let .

Plugging into the above bound gives

 Payoff(x∗,xt+1)≤λ2+Payoff(x,xt+1),

and the definition of gives that

 Payoff(x∗,xt+1)≥Payoff(x∗,xs)+λ.

Combining the two inequalities yields

 Payoff(x,xt+1)≥Payoff(x∗,xt+1)−λ2 >Payoff(x∗,xs)+λ2 >Payoff(x∗,xs) >Payoff(x,xs),

so .

In the case that , we similarly get . ∎

We can then repeatedly apply this proposition starting with to conclude that for any , we have .

Putting things together, we get:

Theorem 3.3.

If , then .

Thus, to learn the agent’s true certainty equivalent to within -error, the analyst chooses a discretization with , and runs a sequential search over the discretization. The number of questions the analyst asks is .

Of course, to lower the number of questions asked, the analyst could instead perform a binary search. It is easy to see that, like in the sequential search mechanism, simply implementing a uniformly random question is not incentive compatible. For example, consider a discretization with deterministic amounts , and consider an agent with true certainty equivalent at . For simplicity, we assume that if when presented with the agent is indifferent between and , he chooses . If the agent answers truthfully, the pairs offered by a binary search would be , , and , and his choices would have been , , and , respectively. The agent’s expected payoff is . Suppose instead the agent answers as if his true certainty equivalent is . Then, the pairs he gets offered would be , , and , and his choices would have been , , and , respectively. His expected payoff is then , which is clearly a profitable deviation.

It is unclear if this scheme can be directly modified to satisfy incentive compatibility properties, but since the payments in the sequential search mechanism only depended on the last question asked, we can use the same payment scheme here so that Theorem 3.3 holds. So now the analyst can learn the agent’s certainty equivalent to within an error of with questions.

4 General Preference Elicitation

Our discussion so far has focused on a specific, albeit ubiquitous, preference elicitation environment. In the rest of the paper we introduce a general model of incentive compatible active learning. We introduce the idea of incentive compatible query complexity: the sample size that guarantees some learning objective while maintaining incentive compatibility.

The main application of our tools will be to expected utility theory. We shall introduce a learning algorithm that is incentive compatible for learning the beliefs of an agent that has expected utility preferences.

We focus on learning an agent’s preferences. The agent will be modeled as having a utility function parameterized by some type, which generates the agent’s choices, that the learner wishes to infer. To this end, is a type space equipped with a metric that is bounded with respect to . is the space of possible outcomes. An agent of type has utility if the outcome is . induces a preference relation over defined by .

An analyst aims to learn the agent’s type by asking him to make a sequence of choices between pairs of outcomes.777One can imagine many other protocols for learning. We constrain ourselves to protocols that are based on a sequence of pairwise comparisons. Such protocols are common in practice, and are the obvious empirical counterpart to the decision theory literature in economics and statistics. This stands in contrast with the literature on scoring rules, which allows for richer message spaces. The agent makes choices among the pairs presented to him.

The agent’s choices can be thought of as the result of a strategy. Formally, a strategy is a mapping

 σ:⋃t{((o1,o′1),1o1≿o′1),…,((ot,o′t),1ot≿o′t),(ot+1,ot+1′)}→Δ{0,1}

that dictates a (potentially randomized) response for every possible history of the interaction up to any given time. Let denote the collection of all possible consistent strategies (a strategy is consistent if its outputs up to any given time are consistent with some preference relation in the type space).

For any strategy , let denote an oracle with memory that responds to queries of the form “is preferred to ?” according to given the history of previous queries made so far. Let denote the collection of oracles corresponding to all possible strategies. For a type , let denote the oracle that responds truthfully according to (i.e. on query it returns ).

We imagine the oracle playing the role of the agent: in an interaction with the analyst, an agent of true type chooses to act as an oracle for some strategy (departing from standard terminology, we allow the oracle to have randomized responses).

The analyst implements a learning mechanism, which consists of the following steps:

1. Run a (potentially randomized) learning algorithm that has access to oracle and can make queries to of the form for .

2. Arrive at a hypothesis for the agent’s type.

3. Implement the agent’s response on the last query.888Within adaptive experimental design, the idea of making a last choice on behalf of the agent is due to Ian Krajbich.

We now establish the notion of learnability that we work with. This definition is not concerned with issues of incentive compatibility: it is simply a refinement of the standard notion of a learning algorithm that stipulates that we learn a truthfully reported hypothesis accurately. Since in our setting the analyst has full control over the data he learns from, our requirements on the error of the algorithm are with respect to the metric on the space of types .

Definition 4.1.

is an -learning algorithm if for all ,

 Prθh∼A(^θ)[d(θ,θh)≤ε]≥1−δ.

The number of queries made by to the oracle is the query complexity of , denoted by .

Next, we define what it means for a learning algorithm to be incentive compatible. Intuitively, we require that if the learning algorithm is terminated on round , and the analyst implements the agent’s preferred outcome on the th query , then the agent (with high probability) cannot gain a non-negligible advantage over truthfully reporting by attempting to answer questions strategically. Let denote the th query to made by an execution of .

Definition 4.2.

is -incentive compatible if there exists a such that for all , the following holds for any type and strategy :

 Pr(oT,pT)∼AT(^θ)(o′T,p′T)∼AT(^σ)[u(θ,qT)≥u(θ,q′T)−τ]≥1−ν,

where () is the preferred outcome between and ( and ) according to oracle (). The quantity is the IC complexity of .999In this definition, and are drawn from independent executions of .

Our goal is to design mechanisms that learn the agent’s true type in an incentive compatible manner.

Definition 4.3.

is an -IC learning algorithm if it is an -learning algorithm that is -IC. We refer to the quantity as the IC learning complexity of .

4.1 An incentive compatible exhaustive search

We first give a very simple method of achieving incentive compatible learning in the general framework introduced in Section 4. The method proceeds by exhaustively searching over the type space, and requires a simple structural assumption. The assumption connects agents’ payoffs to the distance metric used by the learner to assess learning accuracy. In a sense, this lines up the agent’s incentives with the learner’s objective, and makes it easy to obtain a satisfactory algorithm.

Suppose there exists a one-to-one assignment of outcomes to types such that

 u(θ,s(θ′))>u(θ,s(θ′′))⟺d(θ,θ′)

so in particular . In the literature on scoring rules, is called effective with respect to [FRI83].

The following is an incentive compatible learning algorithm. Recall that an -cover of a subset of a metric space is a set of points such that for every , there is an such that .

1. Initialize an -cover of with respect to .

2. Initialize .

3. For to :

1. Query .

2. If is preferred, .

4. Output .

5. Pay the agent .

By definition of the function , allowing the algorithm to exhaustively search over all points of the cover will yield a that is the most preferred point in the cover and is also the closest point in the cover to the report. So reporting yields . Moreover, this (deterministic) algorithm satisfies -incentive compatibility for any runtime , since lying at any round would simply reduce the payoff of stopping at any round. The learning complexity is the covering number of the type space (which is finite as is bounded).

We now present some natural preference environments in which such an assignment function can be constructed. In the following discussion, the outcome space is , and is assumed to be bounded so that the search above terminates.

• Euclidean preferences. Each agent has an “ideal point” , and iff Let be the identity.

• Linear preferences. The type of an agent is a vector and iff (in order for preferences to be distinguishable we assume that no two are scalar multiples of one another, and so for simplicity we normalize so that all types have the same length). The indifference sets of an agent of type are the hyperplanes , for . For each , there is a unique indifference set that is tangent to the unit -sphere . Let be that tangent point.

Euclidean and linear preferences are characterized by natural axioms for preference relations [CE19].

More generally, suppose the preferences of each agent are continuous and strictly convex, which we define as the upper contour sets being closed, convex, and any supporting hyperplane of being unique, for all . For a type , real number , and outcome such that , let denote the supporting hyperplane of the upper contour set at .

Suppose that the following uniqueness requirement holds: for every pair of types , real number , and outcome such that , if is such that , it holds that . We call this property hyperplane uniqueness.101010Hyperplane uniqueness is reminiscent of the single-crossing property in mechanism design. Then, the argument for IC learnability in the case of linear preferences can be adapted to this setting as well. Though the assignment function we construct may not necessarily be effective, we show that exhaustively searching over a sufficiently fine cover is nevertheless incentive compatible.

Theorem 1.1.

Let be a type space such that the preferences induced by each are continuous, strictly convex, and satisfy hyperplane uniqueness. Then, there exists a metric on with respect to which is -IC learnable.

Proof.

For each , let be the unique maximizer of over the unit -sphere ; uniqueness follows from the strict convexity of preferences. Note that and , with , are on differing sides of the supporting hyperplane of at . Hyperplane uniqueness ensures that is one-to-one. Let be the metric where is the Euclidean distance between and .

Let with and let . Clearly the diameter of converges to as . Therefore, for each and , there is an open neighborhood of such that for all and all (where is of the form , for sufficiently close to ). 111111The neighborhoods and balls are with respect to the subspace topology on .

Now, fix the learning parameter , and let be sufficiently small such that if is an -cover of , for all . Then, any satisfies , for all . Thus, the most preferred point of an agent of type is contained in , and so exhaustively searching over this -cover is an -learning algorithm with respect to that is incentive compatible. ∎

It is an interesting question to see what structural conditions one can impose on the type space, the outcome space, etc. to write down better learning mechanisms. For example, one might hope to achieve a learning complexity that is logarithmic in the size of the cover . As we will see in the case of expected utility, naive learning algorithms achieve this sample complexity, but fail to be incentive compatible. More generally one can ask if there is a combinatorial complexity measure (such as VC dimension in the case of PAC learning) that characterizes the complexity of incentive compatible learning.

4.2 The expected utility model of choice under uncertainty

We now turn to the case of belief elicitation for an expected utility agent. Belief elicitation has a long history in experimental economics, and in the theoretical literature on scoring rules (e.g. [CL17]; see [CON09] for a survey). A major difference with the theory of scoring rules is that we shall take as given a protocol that is based on pairwise comparisons among uncertain prospects.121212This follows experimental practice, as well as the standard model of choice under uncertainty; starting from von-Neumann and Morgenstern [VM53] and Savage [SAV72]. In the scoring rule model, subjects are asked to report beliefs rather than carrying out a sequence of binary choices. In any case we shall use scoring rules in our solution, just not by asking subjects to report their beliefs. The case of passive learning was studied in [BE18].

There are states of the world, indexed by . An agent has a subjective belief , where is the probability the agent assigns to state occurring. The agent evaluates the payoff of a vector of rewards by computing expectation according to . An agent’s belief defines a preference relation , where

 x≿y⟺α.x≥α.y.

An analyst would like to learn by asking the agent to make several choices between vectors of rewards. The analyst presents the agent with a sequence of pairs and if the agent chooses she infers that . So the problem is related to that of learning half spaces, but with the added complication of having to respect incentive compatibility. An important assumption is that the analyst is able to simulate the states of the world and observe a state according to the “ground truth” process governing the states (so for example if the states were “rain”, “snow”, and “shine”, the analyst could simply observe the weather on the given day).

Using the notation of the previous section, , , and .

In the context of learning the agent’s true belief, the analyst uses total variation distance to measure accuracy/error.

4.2.1 Naive algorithms are not incentive compatible

First, to illustrate the restrictions of our definitions, we write down a naive algorithm for eliciting that achieves a good query complexity, but is not incentive compatible.

Consider a mechanism that tries to elicit each by performing a search (sequential or binary) on each state. That is, for each state , the algorithm makes queries , varying over a -cover of to find the indifference points, which reveal to within an error of . So, for example, a binary search uses questions to arrive at a hypothesis within total variation distance from . Note that a -cover of the simplex with respect to total variation distance contains elements, so performing a state-wise binary search exponentially improves upon a search over the entire cover.

However, incentive compatibility is broken rather easily, since the agent has a great deal of control over what questions the agent asks (in a similar manner to the situation in MPL). Consider the following simple example: suppose the analyst fixes a discretization of with sure amounts , as in the binary search MPL example from Section 3, and suppose an agent has a true belief , with . If, instead of , the agent reports an with , the final question he would get asked would be . The agent would prefer , and thus would get paid off a sure amount of . It is clear that truthfully reporting yields a strictly lower payoff than the misrepresentation. Notice that this situation is even worse than that of MPL, since if the binary search ends on state , then regardless of the probabilities an agent assigns to states , he will want to answer questions as if he assigns most weight to state – so there is no hope of backing-out an agent’s true belief using this kind of scheme.

A strategic agent can easily outwit minor modifications to this scheme: for example if the analyst does the binary searches in a random order over the states, the agent can adaptively report a belief that assigns most weight to the last state over which the analyst performs a binary search.

4.2.2 A mechanism based on scoring rules.

In this section we present an IC learning algorithm with IC learning complexity

 O(n3/2lognmax(lognε,log1τ)).

The algorithm is based on ideas from active learning, and specifically leverages convergence bounds on so-called disagreement based methods. Let denote the norm, let denote the unit -sphere, and let denote the projection map onto the unit sphere defined by .

We now present an incentive compatible learning algorithm that we henceforth refer to as .

1. Initialize .

2. For to :

1. Choose uniformly at random from . If the hyperplane does not intersect , resample.

2. Let be any elements of such that is a scalar multiple of .

3. Query oracle on pair .

3. Output any .

4. Pay the agent off based on preference from . If is the preferred vector, simulate states of the world, and pay if state occurs.

Before analyzing the algorithm, let us briefly remark that the analyst can always find , satisfying the required conditions to query the agent. Let normal vector define a hyperplane that cuts through the projection of the current hypothesis set onto the unit sphere. Let be a point in the interior of such that .131313The interior of can be written as for some , which is a non-empty intersection of open half-spaces as the agent’s responses are required to be consistent. We can find an open ball (with respect to the subspace topology on induced by ) of radius centered at such that . Then, take a point in the positive direction from and in the negative direction from such that . Then, .

Choosing and in this manner has no effect on the analysis of the learning rate, but is the main ingredient in achieving incentive compatibility. The learning guarantees we obtain are due to standard bounds on the label complexity of disagreement based active learning.

Theorem 4.1.

is a learning algorithm of query complexity with respect to total variation distance.

Proof.

Suppose receives as input an oracle . If on a given round we sample a normal vector and correspondingly query points , the truthful agent’s/oracle’s preference from precisely reveals – this is simply because and determine the same hyperplane.

The VC dimension of the expected utility model is linear (Theorem 2 of [BE18]), and the disagreement coefficient of the class of homogeneous linear separators with respect to the uniform distribution over normal vectors is bounded above by (Theorem 1 of [HAN07]). Standard convergence results in active learning (see, e.g., [DAS11]) then imply that with queries, it holds with high probability that for all in the final hypothesis set, where

 errα(β)=Prv∼Sn−1[sgn(v.α)≠sgn(v.β)]=arccos(ρ(α).ρ(β))π.

For , let , so .

Running for rounds yields that for any hypothesis ,

 ||ρ(α)−ρ(αh)|| =√(ρ(α)−ρ(αh)).(ρ(α)−ρ(αh)) =√2−2ρ(α).ρ(αh) </