Informational Substitutes

# Informational Substitutes

Yiling Chen and Bo Waggoner
Harvard, University of Pennsylvania
yiling@seas.harvard.edu, bwag@seas.upenn.edu
Updated: March 2017
###### Abstract

We propose definitions of substitutes and complements for pieces of information (“signals”) in the context of a decision or optimization problem, with game-theoretic and algorithmic applications. In a game-theoretic context, substitutes capture diminishing marginal value of information to a rational decision maker. We use the definitions to address the question of how and when information is aggregated in prediction markets. Substitutes characterize “best-possible” equilibria with immediate information aggregation, while complements characterize “worst-possible”, delayed aggregation. Game-theoretic applications also include settings such as crowdsourcing contests and Q&A forums. In an algorithmic context, where substitutes capture diminishing marginal improvement of information to an optimization problem, substitutes imply efficient approximation algorithms for a very general class of (adaptive) information acquisition problems.

In tandem with these broad applications, we examine the structure and design of informational substitutes and complements. They have equivalent, intuitive definitions from disparate perspectives: submodularity, geometry, and information theory. We also consider the design of scoring rules or optimization problems so as to encourage substitutability or complementarity, with positive and negative results. Taken as a whole, the results give some evidence that, in parallel with substitutable items, informational substitutes play a natural conceptual and formal role in game theory and algorithms.

## 1 Introduction and Overview

### 1.1 Motivation and challenge

An agent living in an uncertain world wishes to make some decision (whether to bring an umbrella on her commute, how to design her company’s website, …). She can improve her expected utility by obtaining pieces of information about the world prior to acting (a weatherman’s forecast or a barometer reading, market research or automated A/B testing, …). This naturally leads her to assign value to different pieces of information and combinations thereof. The value of information arises as the expected improvement it imparts to her optimization problem.

We would like to generally understand, predict, or design algorithms to guide such agents in acquiring and using information. Consider the analogous case where the agent has value for items or goods, represented by a valuation function over subsets of items. A set of items are substitutes if, intuitively, each’s value decreases given more of the others; they are complements if it increases. Here, we have rich game-theoretic and algorithmic theories leveraging the structure of substitutes and complements (S&C). For instance, in many settings, foundational work shows that substitutability captures positive results for existence of market equilibria, while complements capture negative results (Kelso Jr and Crawford, 1982; Roth, 1984; Gul and Stacchetti, 1999; Hatfield and Milgrom, 2005; Ostrovsky, 2008). When substitutes are captured by submodular valuation functions (Lehmann et al., 2001), algorithmic results show how to efficiently optimize (or approximately optimize) subject to constraints imposed by the environment (e.g. Calinescu et al. (2011)). For example, an agent wishing to select from a set of costly items with a budget constraint has a -approximation algorithm if her valuation function is submodular (Sviridenko, 2004).

Can we obtain similar structural and algorithmic results for information? Here, a piece of information is modeled as a signal or random variable that is correlated in some way with the state of the world that the agent cares about (whether it will rain, how profitable are different website designs, …). Intuitively, one might often expect information to satisfy substitutable or complementary structure. For instance, a barometer reading and an observation of whether the sky is cloudy both yield valuable information about whether it will rain to an umbrella-toting commuter; but these are substitutable observations for our commuter in that each is probably worth less once one has observed the other. On the other hand, the dew point and the temperature tend to be complementary observations for our commuter: Rain may be only somewhat correlated with dew point and only somewhat correlated with temperature, but is highly correlated with cases where temperature and dew point are close (i.e. the relative humidity is high).

Despite this appealing intuition, there are significant challenges to overcome in defining informational S&C. Pieces of information, unlike items, may have complex probabilistic structure and relationships. But on the other hand, this structure alone cannot capture the value of that information, which (again unlike items) seemingly must arise from the context in which it is used. Next, even given a measure of value, it is unclear how to formalize an intuition such as “diminishing marginal value”. Finally, it remains to demonstrate that the definitions are tractable and have game-theoretic and/or algorithmic applications. These challenges seem to have prevented a successful theory of informational S&C thus far.

### 1.2 This paper: summary and contributions

This paper has four components.

.   We propose a definition of informational substitutes and complements (S&C). Beginning from the very general notion of value of information in the context of any specific decision or optimization problem, we define S&C in terms of diminishing (increasing) marginal value for that problem. This requires a definition of “marginal unit” of information. We consider a hierarchy of three kinds of marginal information: learning another signal, learning some deterministic function of another signal, and learning some randomized function (“garbling”) of another signal. These correspond respectively to weak, moderate, and strong versions of the definitions, formalized by strengthenings of submodularity and supermoduarity.

We also investigate some useful tools and equivalent definitons. From an information-theoretic perspective, substitutes can be defined as signals that reveal a diminishing amount of information about a particular event, where the measure of information is some generalized entropy function. From a geometric perspective, substitutes can be defined using a measure of distance, namely some Bregman divergence, on the space of beliefs about the event; signals are substitutes if the average change in one’s belief about the event is dimininishing in the amount of information already known.

.   We give game-theoretic applications of these definitions, primarily on information aggregation in prediction markets. When strategic agents have heterogeneous, valuable information, we would like to understand when and how their information is revealed and aggregated in an equilibrium of strategic play. Prediction markets, which are toy models of financial markets, are possibly the simplest setting capturing the essence of this question. However, although the efficient market hypothesis states that information is quickly aggregated in financial markets (Fama, 1970), despite much research on this question in economics (e.g. Kyle (1985); Ostrovsky (2012)) and computer science (e.g. Chen et al. (2007); Dimitrov and Sami (2008); Gao et al. (2013)), very little was previously known about how quickly information is aggregated in markets except in very special cases.

We address the main open question regarding strategic play in prediction markets: When and how is information aggregated? We show that informational substitutes imply that all equilibria are of the “best possible” form where information is aggregated immediately, while complements imply “worst possible” equilibria where aggregation is delayed as long as possible. Furthermore, the respective converses hold as well; e.g., if an information structure guarantees the “best possible” equilibria, then it must satisfy substitutes.

###### Theorem (Informal).

In a prediction where trader’s signals are strict, strong substitutes with respect to the market scoring rule, in all equilibria, traders rush to reveal and aggregate information immediately. Conversely, if signals are not strong substitutes, then there are arrival orders for traders where no equilibrium has immediate aggregation.

###### Theorem (Informal).

In a prediction market where trader’s signals are strict, strong complements with respect to the market scoring rule, in all equilibria, traders delay revealing and aggregating information as long as possible. Conversely, if signals are not strong complements, then there are arrival orders where no equilibrium exhibits full delaying.

Informational S&C thus seem as fundamental to equilibria of (informational) markets as substitutable items are in markets for goods.

We believe that informational S&C have the potential for broad applicability in other game-theoretic settings involving strategic information revelation, and toward this end, give some additional example applications. We show that S&C characterize analogous “rush/delay” equilibria in some models of machine-learning or crowdsourcing contests (Abernethy and Frongillo, 2011; Waggoner et al., 2015) and question-and-answer forums (Jain et al., 2009).

.   We give algorithmic applications, focusing on the complexity of approximately-optimal information acquisition. Namely, we define a very broad class of problems, termed SignalSelection, in which a decision maker wishes to acquire information prior to making a decision, but has constraints on the acquisition process. For instance, a company wishes to purchase heterogeneous, pricey data sets subject to a budget constraint, or to place up to sensors in an environment. We show that substitutes imply efficient approximation algorithms in many cases such as a budget constraint; this extends to an adaptive version of the problem as well. We also show that the problem is hard in general and in the complements case, even when signals are independent uniform bits.

###### Theorem (Informal).

For the SignalSelection problem with e.g. cardinality or budget (“knapsack”) constraints, in the oracle model of input: If signals are weak substitutes, then polynomial-time approximation algorithms exist; but in general, or even if signals are assumed to be complements, no algorithm can achieve nonzero approximation with subexponentially many oracle queries.

These results offer a unifying perspective on a variety of similar “submodularity-based” solutions in the literature (Krause and Guestrin, 2005b; Guestrin et al., 2005; Krause and Guestrin, 2009; Golovin and Krause, 2011).

.   We investigate the structure of informational S&C. We give a variety of tools and insights for both identifying substitutable structure and designing for it. For instance, we provide natural geometric and information-theoretic definitions of S&C and show they are equivalent to the submodularity-based definitions.

We address two fundamental questions: Are there (nontrivial) signals that are substitutes for every decision problem? Second, given a set of signals, can we always design a decision problem for which they are substitutes? In the game-theoretic settings above, this corresponds to design of mechanisms for immediate aggregation, somewhat of a holy grail for prediction markets. In algorithmic settings, it has relevance for the design of submodular surrogates (Chen et al., 2015). Unfortunately, we give quite general negative answers to both questions. Surprisingly, more positive results arise for complements. We give the geometric intuition behind these results and point toward heuristics for substitutable design in practice.

In summary, the contributions of this paper are twofold: (a) in the definitions of informational S&C, along with a body of evidence that they are natural, tractable, and useful; and (b) in the applications, in which we resolve a major open problem on strategic information revelation as well as give a unifying and general framework for a broad algorithmic problem. Our results on structure and design of informational S&C points to potential for these very general definitions and results to have concrete applications.

Taken all together, we believe these results give evidence that informational S&C, in analogy with the successful theories of substitutable goods, have a natural and useful role to play in game theory, algorithms, and in connecting the two.

### 1.3 Outline

Two sections have been placed in the appendix for convenience. Appendix A gives a detailed survey of related work in a variety of areas. In Appendix B, we overview and justify our general approach to defining informational S&C, including historical context, tradeoffs, and intuition. We particularly focus on a comparison to the proposed definitions of Börgers et al. (2013), which we build on in this paper.

In Section 2, we concisely and formally define informational S&C. We show that prediction problems, and the modern convex analysis understanding of them, can be used to analyze general decision problems. Leveraging these tools, we give three equivalent definitions from seemingly-disparate perspectives.

In Section 3, we present game-theoretic applications. Primarily, we show that informational substitutes (complements) characterize best-case (worst-case) information aggregation in prediction markets.

In Section 4, we present algorithmic applications. We define SignalSelection, a class of information acquisition problems, and show that substitutes correspond to efficient approximation algorithms while there are strong hardness results in general.

In Section 5, we investigate informational S&C themselves with an eye toward the previous applications. We give some results on general classes of S&C and on the design of prediction or optimization problems for which a given information structure is substitutable, with both game-theoretic and algorithmic implications.

Section 6 summarizes and discusses future work.

## 2 Definitions and Foundations

### 2.1 Setting: information structure and decision problems

We now formally present the setting and definitions. Motivation for the choices made and relation to prior work, particularly Börgers et al. (2013), are described in depth in Section B.

#### 2.1.1 Model of information and decision problems

##### Information structure.

We take a standard Bayesian model of probabilistic information. There is a random event of interest to the decisionmaker, e.g. . There are also “base signals” , modeled as random events. These represent potential information obtained by a decision-maker, e.g. . An information structure is given by , , and a prior distribution on outcomes . For simplicity, we assume that all and have a finite set of possible outcomes.

In addition to the base signals, there will be other signals that intuitively represent combinations of base signals. Formally, there is a set of signals, with a generic signal usually denoted or . Any subscripted always refers to a base signal, while may in general be any signal in . We will describe how is generated from momentarily, in Section 2.1.2.

We will use lower-case to refer to probability distributions on , the event of interest. The notation refers to the probability that , while , and so on. The notation refers to the probability that conditioned on , obtained from the prior via a Bayesian update: . We will sometimes use the shorthand notation to refer to the posterior distribution on conditioned on , similarly for when and , and so on. We will abuse notation and write to represent a set of outcomes, so for instance we may write ; similarly for signals. We also sometimes write for the expectation over outcomes of .

##### Decision problems and value function.

A single-agent decision problem consists of a set of event outcomes , a decision space , and a utility function , where is the utility for taking action when the event’s outcome is . This decision problem, in the context of an information structure, will be how signals derive their value.

Specifically, given the prior , the decision that maximizes expected utility is . But now suppose a Bayesian, rational agent knows and will first observe the signal , then update to the posterior on , and then choose a decision maximizing expected utility for this posterior belief. In this case, her utility will be given by the following “value” function:

 Vu,P(A):=Ea[ maxd∈D Ee[u(d,e)∣A=a] ]. (1)

We will use to denote a null signal, so that is the expected utility for deciding based only on the prior distribution. Where the decision problem and information structure are evident from context, we will omit the superscripts .

Intuitively, is analogous to a valuation function over subsets of items. However, inputs to may not only represent subsets of , but also signals that give partial information about them.

#### 2.1.2 Signal lattices

We will consider three kinds of signal sets , leading to “weak”, “moderate”, and “strong” substitutes and complements. In each case, the set of signals will form a lattice.

###### Definition 2.1.1.

A lattice is a set together with a partial order on it such that for all , there are a meet and join in satisfying:

1. and ; and

2. the meet and join are the “highest” and “lowest” (respectively) elements in the order satisfying these inequalities.

In a lattice, denotes the “bottom” element and the “top” element, i.e. for all , if they exist.

The following definition illustrates one very common lattice, that of subsets of a ground set.

###### Definition 2.1.2.

The subsets signal lattice generated by consists of an element for each subset of , where is the signal conveying all realizations . Its partial order is if and only if . Hence, its meet operation is given by set intersection and join by set union.

The bottom element of the subsets lattice exists and is a null signal corresponding to the empty set (we will use this notation somewhat often), while the top element also exists and corresponds to observing all signals. Also, the partial ordering denotes less informative. These facts will continue to hold for the other two signal lattices we define.

For the other two lattices, we utilize the main idea from the classic model of information due to Aumann (1976). Let the set consist of all signal realizations in the support of the prior distribution. Now, a partition is a collection of subsets of such that each is in exactly one subset. Each signal corresponds to a partition of with one subset for each outcome , namely, the set of realizations . Example 2.1.1, given after the definition of discrete signal lattice, illustrates the partition model.

As in Aumann’s model, the partitions of form a lattice, each partition corresponding to a possible signal. The partial ordering is that if the partition of is “coarser” than that of . One partition is coarser than another (which is finer) if each element of the former is partitioned by elements of the latter. The join of two partitions is the coarsest common refinement (the coarsest partition that is finer than each of the two), while the meet is the finest common coarsening. Example 2.1.2, given after the definition, illustrates coarsenings and refinements.

###### Definition 2.1.3.

The discrete signal lattice generated by consists of all signals corresponding to partitions of , where is the subset of with positive probability. Its partial order has if the partition associated to is coarser than that of .

###### Example 2.1.1 (Signals modeled as partitions).

(a) We have two independent uniform bits and . In this case . Here is modeled as the partition consisting of two elements: and . The first element of the partition is the set of realizations where , while the second is the set of realizations where .

(b) Now modify the example so that and are perfectly correlated: with probability , , and with probability , . Here, and corresponds to the partition consisting of and .

(c) Now revisit the first case where and are independent. Imagine an agent who observes both base signals and wishes to reveal only the XOR of the bits. This new signal released by the agent, call it , may also be modeled as a partition of , where the elements of the partition are and .

###### Example 2.1.2 (The order given by coarsenings).

(a) We have a single signal which is distributed uniformly on . Then and ’s partition contains these six subsets: .

(b) Given the above information structure, suppose that an agent holding will commit to releasing some deterministic function of . In terms of information revealed, the agent may map each realization to a different report – this is the same as just revealing – or she may map some realizations to the same report. Suppose that she reports “small” whenever and reports “large” whenever . The information revealed by this report is captured by a binary signal corresponding to the partition with two elements: and . The partition of is coarser than that of , so on the discrete lattice.

(c) Given the same information structure, imagine that the agent will commit to releasing “even” whenever and “odd” whenever . This corresponds to a signal whose partition has these two elements and is again coarser than that of .

(d) Consider the above two signals and . They are incomparable: Neither is coarser nor finer than the other. The meet will be the null signal111Formally, this is the signal whose partition contains a single element: all of . , intuitively because given , one cannot guarantee anything about the outcome of . The join intuitively corresponds to observing both signals. Let . The partition corresponding to has the following four elements: , , , . These each correspond to a realization of the signal ; call the realizations respectively . Here, when for example “small” and “even”, then and an observer of would know that . When “large” and “even”, then and an observer of would know that , updating to a posterior on these possibilities.

For the third and strongest notion, we extend the model by, intuitively, appending randomness to the signals on the discrete lattice. Given any signal on the discrete lattice, a “garbling” of can be captured by a randomized function of ; but this may be modeled as a deterministic function where is a uniform random variable222In some applications, it may be more desirable to use an infinite string of independent uniform bits.. This observation allows us to “reduce” to the deterministic case, but where each possible signal carries extra information in the form of some independent randomness.

Specifically, let be defined as above (the subset of with positive probability) and, for each partition of , let drawn independently from the uniform distribution. Let where . Now, we proceed as before, but using .333To be precise, we should restrict to measurable subsets using the Lebesgue measure on . We will omit this discussion for simplicity; if concerned, the reader may alternatively assume that each is drawn uniformly from a massive but finite set, with some tiny approximation carried through our results.

###### Definition 2.1.4.

The continuous signal lattice consists of a signal corresponding to each partition of . Its partial order has if the partition associated to is coarser than that of .

###### Example 2.1.3 (Modeling garblings via the continuous lattice).

Consider a uniformly random bit as the only base signal; the resulting is . Now consider the garbling where, if , then output “happy” with probability and “sad” otherwise; if it equals , then output “happy” with probability and “sad” otherwise. Call the output of the garbling . Then can be modeled as a partition of with the following two subsets: , and . Here the first realization of corresponds to the output “happy”, while the second corresponds to output “sad”. To see this, note for instance that the first realization contains all the elements of where and the randomness variable . So when , assuming is drawn uniformly and independently from , then the outcome of is “happy” with probability . On the continuous lattice, corresponds to the partition of singletons such as , , and so on. That is, it corresponds to observing both the original binary bit as well as the random real number . Because this partition is finer than that corresponding to , we have on the continuous lattice.

The use of “happy” and “sad” for the outputs in the above example illustrates that it is not important, when considering the information conveyed by signal , to consider what its realizations were named. All that matters is their distributions, e.g. the partitions they represent.

###### Example 2.1.4 (Modeling garblings, continued).

Again let be a uniformly random bit, and now suppose is obtained by adding to independent Gaussian noise with mean and variance . In this case, intuitively, each outcome of (say ) represents two possibilities (such as and the Gaussian is , or and the Gaussian is ). can be modeled as a partition of where is interpreted as the quantile of the outcome of the Gaussian. Each member of the partition has two elements. These can be written and with and , using the standard normal CDF . Given the realization , the posterior distribution on is given by a Bayesian update depending on the probability density of the Gaussian at and at .

### 2.2 The definitions of substitutes and complements

We utilize a common notion of diminishing and increasing marginal value. For example, the idea of submodularity is that a lattice element ’s marginal contribution to should be smaller that to some .

###### Definition 2.2.1.

A function from a lattice to the reals is submodular if it exhibits diminishing marginal value: For all on the lattice with ,

 f(B\raisebox{1.075pt}{\scalebox{0.8}{∨}}A′)−f(A′)≥f(B\raisebox{1.075pt}{\scalebox{0.8}{∨}}A)−f(A).

It is supermodular if it exhibits increasing marginal value: For all on the lattice with , the above inequality is reversed. The sub- or super-modularity is strict if, whenever and are incomparable on the lattice’s ordering and , the inequality is strict.

###### Definition 2.2.2 (Weak Informational S&C).

In the context of a decision problem and prior , the signals are weak substitutes if is submodular on the subsets signal lattice generated by . They are weak complements if is supermodular on their subsets signal lattice. The signals are strict substitutes (complements) if the is strictly submodular (supermodular).

For moderate and strong substitutes, we will use a strengthening of submodularity by requiring diminishing marginal value with respect to, respectively, all deterministic and randomized functions of a signal.444Previous versions of this paper used submodularity on the discrete and continuous signal lattices, which is a more restrictive definition. We only need this weaker definition for results in this paper, but in general both are interesting and the right choice may be context-dependent; or there could be other interesting variations.

###### Definition 2.2.3 (Moderate and Strong Informational S&C).

In the context of a decision problem and prior , the signals are moderate (respectively, strong) substitutes if, for all on the subsets signal lattice, and for all on the discrete (respectively, continuous) signal lattice with ,

 Vu,P(B\raisebox{1.075pt}{\scalebox{0.8}{∨}}A′)−Vu,P(A′)≥Vu,P(B%\raisebox1.075pt\scalebox0.8$∨$A)−Vu,P(A).

They are moderate (strong) complements if, for all on the subsets signal lattice and all on their discrete (continuous) signal lattice with , the above inequality is reversed.

The signals are strict substitutes (complements) if whenever and are incomparable and , the respective inequality is strict.

Verbally, S&C capture that the more pieces of information one has, the less valuable (respectively, more valuable) becomes. The levels of weak, moderate, and strong capture the senses in which “pieces of information” is interpreted. Weak substitutes satisfy diminishing marginal value whenever a whole signal is added to a subset of signals. However, they do not give guarantees about marginal value with respect to partial information about signals. Moderate and strong substitutes respectively satisfy diminishing marginal value when deterministic (randomized) partial information about a signal is revealed.

###### Observation 2.2.1.

Strong substitutes imply moderate substitutes, which imply weak substitutes. The same holds for complements.

###### Proof.

The respective lattices are supersets, i.e. the continuous signal lattice is a superset of the discrete lattice which is a superset of the subsets lattice, and the partial orderings agree. So each substitutes definition requires to satisfy a set of inequalities at various points, and in going from weak to moderate to strong substitutes, we simply increase the set of required inequalities that signals must satisfy. ∎

###### Example 2.2.1 (Substitutes).

The event is a uniformly random bit and the two signals and . That is, both signals are always equal to . The decision problem is to predict the outcome of by deciding either or , with a payoff of for correctness and otherwise. In this case, one can immediately see that and are e.g. weak substitutes, as a second signal never gives marginal benefit over the first.

###### Example 2.2.2 (Complements).

The event and decision problem are the same as in Example 2.2.1, but this time and are uniformly random bits with , the XOR of and . In this case, and are immediately seen to be e.g. weak complements, as a first signal never gives marginal benefit over the prior.

###### Example 2.2.3 (Weak vs moderate).

Here is an example of weak substitutes that are not moderate substitutes. Intuitively, we will pair the previous two examples. The event consists of a pair of independent uniformly random bits. The decision problem is to predict both components of , getting one point for each correct answer. Let the random variable and . Let the random variables and be uniformly random bits such that .

Now, consider the signals and . Intuitively, the first component of each signal completely determines , while the second component gives no information about until combined with the other signal. Hence these signals intuitively have both substitutable and complementary internal structure. Consider the subsets lattice . If we modify the decision problem such that predicting the first component of is worth points, then these signals are weak substitutes: Each alone is worth points, while together they are worth points. On the other hand, if we modify the decision problem such that the second component of is worth points, then these signals become weak complements for analogous reasons.

On the other hand, these signals are neither moderate substitutes nor moderate complements. One way to see this is to consider “coarsening” into the signal ; this has diminishing marginal value when added to . However, we could also coarsen into the signal , which has increasing marginal value when added to .

### 2.3 Scoring rules and a revelation principle

We now introduce proper scoring rules and prove a useful “revelation principle”.

A scoring rule for an event is a function , so that is the score assigned to a prediction (probability distribution) when the true outcome realized is . Define the useful notation for the expected score under true belief for reporting to the scoring rule.

The scoring rule is (strictly) proper if for all , setting (uniquely) maximizes the expected score . In other words, if is distributed according to , then truthfully reporting to the scoring rule (uniquely) maximizes expected score.

A fundamental characterization of scoring rules is as follows:

###### Fact 2.3.1 (McCarthy (1956); Savage (1971); Gneiting and Raftery (2007)).

For every (strictly) proper scoring rule , there exists a (strictly) convex function with (1) and (2)

 S(^q,e)=G(^q)+⟨G′(^q),δe−^q⟩

where is a subgradient of at and is the probability distribution on putting probability on and elsewhere.

Furthermore, for every (strictly) convex function , there exists a (strictly) proper scoring rule such that (1) and (2) hold.

###### Proof.

Given any (strictly) convex , we first check that the induced is (strictly) proper. Select a subgradient at each point . The expected score for reporting when is distributed according to is

 S(^q;q) =Ee∼qS(^q,e) =G(^q)+⟨G′(^q),q−^q⟩ ≤G(q) by convexity of G =S(q;q).

Note that the inequality follows simply because, for any convex , if we take the linear approximation at some point and evaluate it at a different point , this lies below . Furthermore, if is strictly convex, then this inequality is strict, implying strict properness.

Now, given a (strictly) proper , we show that it has the stated form. Define . Note that is a linear function of . By properness, each . Since is a pointwise maximum over a set of linear functions of , is convex. If was strictly proper, then was the unique maximum at every point, implying that is strictly convex.

Now we claim that is a subtangent of at : it is linear, equal to at , and everywhere below by definition of . So in particular , as promised. ∎

###### Example 2.3.1.

The scoring rule is , i.e. the logarithm (usually base ) of the probability assigned to the realized event. The expected score function is , where is the Shannon entropy function.

Notice that a scoring rule is a special case of a decision problem: The utility function is the scoring rule , is the event picked by nature, and the decision space . We now show that in a sense, scoring rules capture all decision problems. This is not surprising or difficult, and may have been observed prior to this work; but we formalize it because it captures a very nice and useful intuition.

###### Theorem 2.3.1 (Revelation principle).

For any decision problem , there exists a proper scoring rule that is equivalent to the original decision problem in that for all information structures and signals , .

###### Proof.

The idea of the proof, as suggested by the name, is simply for the agent to report her belief about to the scoring rule and for the scoring rule to simulate the optimal decision for this belief, paying the agent according to the utility derived from that decision. For a given distribution (“belief”) on , let be the optimal decision, i.e. . Now, given , let

 S(^q,e)=u(d∗^q,e).

First let us show properness, i.e. . We have

 S(^q;q) =Ee∼qS(^q,e) =Ee∼qu(d∗^q,e) ≤Ee∼qu(d∗q,e) =Ee∼qS(q,e) =S(q;q)

using the definition of .

Now let us check equivalence to the original problem. Let be the distribution on conditioned on . We have

 Vu,P(A) =Eamaxd∈DEe∼qau(d,e) =EaEe∼qau(d∗qa,e) =EaEe∼qaS(qa,e) =Eamax^qEe∼qaS(^q,e) by properness =VS,P(A).

This reduction is not necessarily computationally efficient, because the input to the scoring rule is a probability distribution over which may have a large number of outcomes. We note two positives, however. First, the reduction does not necessarily need to be computationally efficient to be useful for proofs and analysis. Second, in any case where it seems reasonable to assume that the agent can solve her decision problem, which involves an expectation over possible outcomes of , it seems reasonable to suppose that she can efficiently represent or query her beliefs. In this case we may often expect a computationally efficient reduction and construction of . This is a direction for future work.

The revelation principle (Theorem 2.3.1) and scoring rule characterization (Fact 2.3.1) together imply the following extremely useful fact about general decision problems. We do not claim originality for it; the idea can be found in Savage (1971) and similar ideas or statements are present in e.g. Frongillo and Kash (2014) and Babaioff et al. (2012). But it is worth emphasizing because we will put it to extensive use in this paper.

###### Corollary 2.3.1.

For any decision problem there exists a corresponding convex function , and for every such there exists a decision problem , such that is the expected utility for acting optimally when the agent’s posterior belief on is .

Hence for instance , where is the posterior on given .

As an example of usefulness, we provide a concise proof of the following classic theorem.

In any decision problem, for any signals , . In other words, more information always improves the expected utility of a decision problem. In other words, is a monotone increasing function on the signal lattices.

###### Proof.

Recall that we are using the notation for the distribution on conditioned on , and so on. In particular, is a vector, i.e. . By the revelation principle, for some convex we have , and

 V(A1\raisebox{1.075pt}{\scalebox{0.8}{∨}}% A2) =Ea1[∑a2p(a2|a1)G(pa1a2)] ≥Ea1G(∑a2p(a2|a1)pa1a2) by Jensen’s inequality =Ea1G(pa1) =V(A1).

To obtain the last equality: Each term in the sum consists of the scalar multiplied by the vector , and for each coordinate of the vector, we have . Then . ∎

### 2.4 Characterizations

In this section, we show how the substitutes and complements conditions can be phrased using the convexity connection just derived. We will leverage this structure to identify characterizations or alternative definitions of substitutes and complements. For brevity, we will focus on substitutes, but in all cases the extension to complements is immediate.

From Corollary 2.3.1, we also get immediately the following characterization:

###### Definition 2.4.1 (Substitutes via convex functions).

For any decision problem, letting be the associated expected score function, signals are respectively (weak, moderate, strong) substitutes for that decision problem if and only if, for all on the subsets lattice and all on the respectively (subsets, discrete, continuous) lattice with ,

 Ea′,bG(pa′b)−Ea′G(pa′)≥Ea,bG(pab)−EaG(pa).

To be clear, our use of parentheses here means that we are giving three definitions: one for weak substitutes, where the condition must hold for all on the subsets lattice; and analogously for moderate substitutes with the discrete lattice and strong substitutes with the continuous lattice.

We view Definition 2.4.1 mostly as a tool, although it may convey some intuition on its own as well. Definition 2.4.1 will be pictured in Figure 3 along with the final characterization.

#### 2.4.1 Generalized entropies

Here, we seek an alternative interpretation of the definitions of S&C in terms of information and uncertainty. To this end, for any decision problem, consider the convex expected score function and define . Then is concave, and we interpret as a generalized entropy or measure of information. The justification for this is as follows: Define the notation , where is the distribution on conditioned on . Then concavity of implies via Jensen’s inequality that for all , we have . In other words, more information always decreases uncertainty/entropy.

We propose that this is the critical axiom a generalized entropy must satisfy: If more information always decreases , then in a sense it measures uncertainty, while if more information sometimes increases , then it should not be considered a measure of uncertainty. However, admittedly, the appeal of this definition may increase by adding additional axioms as are common in the literature, such as maximization at the uniform distribution and value zero at degenerate distributions. Another very intriguing axiom would be a relaxation of the “chain rule” in either direction: is restricted to be either greater than or less than . (Note that the chain rule itself, , along with concavity, uniquely characterizes Shannon entropy.) Such axioms may have interesting consequences for informational S&C. Examining the structure of S&C under such axioms represents an intriguing direction for future work.

Under this interpretation, Definition 2.4.1 can be restated:

###### Definition 2.4.2 (Substitutes via generalized entropies).

For any decision problem , let the generalized entropy function where is the corresponding expected score function. Signals are (weak, moderate, strong) substitutes for if and only if, for all on the subsets lattice and all on the (subsets, discrete, continuous) lattice with ,

 h(E|A′)−h(E|A′\raisebox{1.075pt}{\scalebox{0.8}% {∨}}B)≥h(E|A)−h(E|A\raisebox{1.075pt}{\scalebox{0.8}% {∨}}B).

Intuitively, Definition 2.4.2 says this: Consider the expected amount of information about that is revealed upon learning , given that some information will already be known. Use the generalized entropy to measure this information gain. Then substitutes imply that, the more information one has, the less information reveals. On the other hand, complements imply that, the more information one has, the more information reveals.

###### Example 2.4.1.

Revisiting Example 2.2.1, where was a uniform bit and , imagine predicting against the scoring rule. Our previous observations imply that here the generalized entropy function is Shannon entropy . We have and , which already shows that and are weak substitutes.

If we instead revisit Example 2.2.2, where with uniformly random bits, and again consider predicting according to the scoring rule, then we see that , while , already proving that and are weak complements.

#### 2.4.2 Bregman divergences

Given a convex function , the Bregman divergence of is defined as

 DG(p,q)=G(p)−(G(q)+⟨G′(q),p−q⟩).

In other words, it is the difference between and the linear approximation of at , evaluated at . (See Figure 3.) Another interpretation is to consider the proper scoring rule associated with , by Fact 2.3.1, and note that , the difference in expected score when reporting one’s true belief versus lying and reporting . The defining property of a convex function is that this quantity is always nonnegative. This can be observed geometrically in Figure 1 as well; there .

This notion is useful to us because, it turns out, all marginal values of information can be exactly characterized as Bregman divergences between beliefs.

###### Lemma 2.4.1.

In a decision problem, for any signals , the marginal value of given is

 V(A\raisebox{1.075pt}{\scalebox{0.8}{∨}}B)−V(A)=Ea,bDG(pab,pa)

where is the associated expected score function and is the distribution of given .

###### Proof.
 V(A\raisebox{1.075pt}{\scalebox{0.8}{∨}}B)−V(A) =Ea,bG(pab) − EaG(pa) =Ea,b(G(pab)−G(pa)) =Ea,b(DG(pab,pa)−⟨G′(pa),pab−pa⟩) =Ea,bDG(pab,pa) + Ea⟨G′(pa),∑bp(b|a)(pab−pa)⟩ =Ea,bDG(pab,pa)

because , so . ∎

###### Definition 2.4.3 (Substitutes via divergences).

For any decision problem , let be the Bregman divergence of the corresponding expected score function . Signals are (weak, moderate, strong) substitutes if and only if, for all on the subsets lattice and all on the (subsets, discrete, continuous) lattice with ,

 Ea′,bDG(pa′b,pa′)≥Ea,bDG(pab,pa).

This can be interpreted as a characterization of S&C where serves as a distance measure of sorts (although it is not in general a distance metric). The characterization says that, if we look at how “far” the agent’s beliefs move upon learning , on average, then for substitutes this distance is decreasing in how much other information is available to the agent. But for complements, the more information the agent already has, the farther she expects her beliefs to move on average upon learning .

###### Example 2.4.2.

For the scoring rule, is exactly the KL-divergence or relative entropy between distributions on and . If we recall Example 2.2.1, in which was a random bit and , we can consider the decision problem of prediction against the scoring rule. In this case, the prior , while the posteriors , , and the same for . Hence . But the posteriors conditioned on both signals are the same, e.g. . Hence .

This already shows that and are weak substitutes. And in fact, if , then this argument extends to show that and are substitutes in any decision problem (as they should be), because given , an update on moves the posterior belief a distance .

## 3 Game-Theoretic Applications

### 3.1 Prediction markets

A prediction market is modeled as a Bayesian extensive-form game. The market’s setting is specified by a strictly proper scoring rule and an information structure with prior and event , and set of signals . We assume these signals are “nontrivial” in that, given all signals but , the distribution of changes conditioned on .

An instantiation of the market is specified by a set of traders, each trader observing some subset of the signals (call the resulting signal ), and an order of trading , where at each time step , it is the turn of agent to trade. We assume that no trader participates twice in a row (if they do, it is without loss to delete one of these trading opportunities).

The market proceeds as follows. First, each trader simultaneously and privately observes , updating to a posterior belief. Then the market sets the initial prediction , which we assume to be the prior distribution on . We will also refer to a market prediction as the market prices. Then, for each , trader arrives, observes the current market prediction , and may update it to (“report”) any .

After the last trade step , the true outcome of is observed and each trader receives payoff . Thus, at each time , trader is paid according to the scoring rule applied to , but must pay the previous trader according to the scoring rule applied to . The total payment made by market “telescopes” into .

At any given time step , trader is said to be reporting truthfully if she moves the market prediction to her current posterior belief on . In other words, she makes the myopically optimal trade.

The natural solution concept for Bayesian games is that they be in Bayes-Nash equilibrium, where for every player, her (randomized) strategy — specifying how to trade at each time step as a function of her signal and all past history of play — maximizes expected utility given the prior and others’ strategies.

Because this is a broad class of equilibria and can in general include undesirable equilibria involving “non-credible threats”, it is often of interest in extensive-form games to consider the refinement of perfect Bayesian equilibrium. Here, at each time step and for each past history, a player’s strategy is required to maximize expected utility given her beliefs at that time and the strategies of the other players. (Note the difference to Bayes-Nash equilibrium in which this optimality is only required a priori rather than for every time step.) Here, at any time step and history of play, players’ beliefs are required to be consistent with Bayesian updating wherever possible. (It may be that one player deviates to an action not in the support of her strategy; in this case other players may have arbitrary beliefs about the deviator’s signal.)

To be clear, every perfect Bayesian equilibrium is also a Bayes-Nash equilibrium. Hence, we note that an existence result is strongest if it guarantees existence of perfect Bayesian equilibrium. Meanwhile, a uniqueness or nonexistence result is strongest if it refers to Bayes-Nash equilibrium.

##### Distinguishability criterion.

For most of our results, we will need a condition on signals equivalent or similar to those used in prior works (Chen et al., 2010; Ostrovsky, 2012; Gao et al., 2013) in order to ensure that traders can correctly interpret others’ reports. Formally, we say that signals are distinguishable if for all subsets and realizations , of the signals such that, for some , ,

 Pr[e∣ai:i∈S]≠Pr[e∣a′i:i∈S].

We believe it may be possible to relax this criterion and/or interpret such criteria within the S&C framework, and this is a direction for future work.

##### Our notation in prediction markets.

Note that, for the proper scoring rule with associated convex , along with the prior , we have the associated “signal value” function :

 V(A)=Ea,eS(pa,e)=EaG(pa).

In other words, is the expected score for reporting the posterior distribution conditioned on the realization of .

A second key point is that, to a trader whose current information is captured by a signal , the set of strategies available to that trader can be captured by the space of signals on the continuous lattice. This follows because any strategy is a randomized function of her information, so the outcomes of the strategy can be labeled as outcomes of a signal .

#### 3.1.1 Substitutes and “all-rush”

We now formally define an “all-rush” equilibrium and show that it corresponds to informational substitutes. The naive definition would be that each trader reports truthfully at their first opportunity, or (hence) at every opportunity. This turns out to be correct except for one subtlety. Consider, for example, the final trader to enter the market. Because all others have already revealed all information, this last trader will be indifferent between revealing immediately or delaying. Similarly, consider three traders and the order of trading . If trader truthfully reports at time , then trader is not strictly incentivized to report truthfully at time . She could also delay information revelation until time .

###### Definition 3.1.1.

An all-rush strategy profile in a prediction market is one where, if the traders are numbered in order of the first trading opportunity, then each trader reports truthfully at some time prior to ’s first trading opportunity (with the final trader reporting truthfully prior to the close of the market).

Before presenting the main theorem of this section, we give the following useful lemma (which is quite well known):

###### Lemma 3.1.1.

In every Bayes-Nash equilibrium, every trader reports truthfully at her final trading opportunity.

###### Proof.

Consider a time at which trader makes her final trade. Fix all strategies and any history of trades until time ; then ’s total expected payoff from all previous time steps is fixed as well and cannot be changed by any subsequent activity. Meanwhile, ’s unique utility-maximizing action at time is to report truthfully, by the strict properness of the scoring rule. If does not take this action, then her entire strategy is not a best response: She could take the same strategy until time and modify this last report to obtain higher expected utility. Therefore, in Bayes-Nash equilibrium, reports her posterior on her final trading opportunity. ∎

###### Theorem 3.1.1.

If signals are distinguishable and are strict, strong substitutes, then for every set of traders and trading orders, every Bayes-Nash equilibrium is all-rush.

###### Proof.

Let the traders be numbered in order of their first trading opportunity and let be the signal of trader . Before diving in, we develop a key idea. In equilibrium, we can view the market prediction at time as a random variable. Then, construct a “signal” capturing the information contained in . This can be pictured as the information conveyed by to an “outside observer” who knows the prior distribution and the strategy profile, but does not have any private information. Furthermore, if traders have participated thus far, then is an element of the continuous signal lattice with , because is a well-defined, possibly-randomized function of . Finally, if all participating traders have been truthful, then is a member of the subsets signal lattice , as it exactly reveals the subset of the signals held by those traders.

Now, let be ’s final trading opportunity prior to ’s first trading opportunity. We prove by backward induction on that, in BNE and for any participant participating at some time , the following holds: if is an element of the subsets lattice , then reports truthfully at . Now, suppose we have successfully proven this claim by backward induction; let us finish the proof. The claim implies that really is in and really is truthful at all such time steps for the following reason: is the null signal and is an element of , so trader participates truthfully, which implies that , which implies that participates truthfully, and so on. So in any BNE, all participants play all-rush strategies.

Now let us prove the statement. For the base case , the trader participating at the final time step is truthful by Lemma 3.1.1.

Now for the inductive step, consider any for some . If is ’s final trading opportunity, then by Lemma 3.1.1, in BNE reports truthfully at . If , then there is nothing to prove.

Otherwise, let be ’s next trading opportunity after . By inductive hypothesis, is truthful at time and thereafter. We compute ’s expected utility for any strategy, and show that if is not truthful at , she can improve by deviating to the following strategy: Copy the previous strategy up until , report truthfully at , and make no subsequent updates.

At , ’s strategy can be described as reporting truthfully according to . For this trade, obtains expected profit , and obtains no subsequent profit once her information is revealed. Meanwhile, consider ’s strategy at time , which induces some signal . For this trade, obtains expected profit at most . This follows because a trade conveying signal obtains at most .555Although we don’t explicitly use it here, this implies that in equilibrium, every , that is, the price at time equals the posterior distribution on conditioned on all information that has been revealed so far, including at time .

Let be ’s total expected utility at time and greater. Once reports truthfully at , she expects to make no further profit in equilibrium. So

 U≤V(C(t))−EG(p(t−1))+V(C(t′−1)\raisebox{1.075pt}{% \scalebox{0.8}{∨}}B