Definition 3.1

Bayesian Incentive-Compatible Bandit Exploration

Yishay Mansourthanks: Most of this research has been done while Y. Mansour was a researcher at Microsoft Research, Herzelia, Israel.

Tel Aviv University, Tel Aviv, Israel,

Aleksandrs Slivkins

Microsoft Research, New York, NY 10011, USA,

Vasilis Syrgkanis

Microsoft Research, Cambridge, MA 02142, USA,

As self-interested individuals (”agents”) make decisions over time, they utilize information revealed by other agents in the past, and produce information that may help agents in the future. This phenomenon is common in a wide range of scenarios in the Internet economy, as well as in medical decisions. Each agent would like to exploit: select the best action given the current information, but would prefer the previous agents to explore: try out various alternatives to collect information. A social planner, by means of a carefully designed recommendation policy, can incentivize the agents to balance the exploration and exploitation so as to maximize social welfare.

We model the planner’s recommendation policy as a multi-arm bandit algorithm under incentive-compatibility constraints induced by agents’ Bayesian priors. We design a bandit algorithm which is incentive-compatible and has asymptotically optimal performance, as expressed by regret. Further, we provide a black-box reduction from an arbitrary multi-arm bandit algorithm to an incentive-compatible one, with only a constant multiplicative increase in regret. This reduction works for very general bandit setting that incorporate contexts and arbitrary partial feedback.

Key words: mechanism design, multi-armed bandits, regret, Bayesian incentive-compatibility

History: First version: February 2015. This version: March 2017.thanks: An extended abstract of this paper has been published in ACM EC 2015 (16th ACM Conf. on Economics and Computation). Compared to the version in conference proceedings, this version contains complete proofs, revamped introductory sections, and thoroughly revised presentation of the technical material. Further, two major extensions are fleshed out, resp. to more than two actions and to more general machine learning settings, whereas they were only informally described in the conference version. The main results are unchanged, but their formulation and presentation is streamlined, particularly regarding assumptions on the common prior. This version also contains a discussion of potential applications to medical trials.


Decisions made by an individual often reveal information about the world that can be useful to others. For example, the decision to dine in a particular restaurant may reveal some observations about this restaurant. This revelation could be achieved, for example, by posting a photo, tweeting, or writing a review. Others can consume this information either directly (via photo, review, tweet, etc.) or indirectly through aggregations, summarizations or recommendations. Thus, individuals have a dual role: they both consume information from previous individuals and produce information for future consumption. This phenomenon applies very broadly: the choice of a product or experience, be it a movie, hotel, book, home appliance, or virtually any other consumer’s choice, leads to an individual’s subjective observations pertaining to this choice. These subjective observations can be recorded and collected, e.g., when the individual ranks a product or leaves a review, and can help others make similar choices in similar circumstances in a more informed way. Collecting, aggregating and presenting such observations is a crucial value proposition of numerous businesses in the modern Internet economy, such as TripAdvisor, Yelp, Netflix, Amazon, Waze and many others (see Table 1). Similar issues, albeit possibly with much higher stakes, arise in medical decisions: selecting a doctor or a hospital, choosing a drug or a treatment, or deciding whether to participate in a medical trial. First the individual can consult information from similar individuals in the past, to the extent that such information is available, and later he can contribute her experience as a review or as an outcome in a medical trial.

Watch this movie Netflix
Dine in this restaurant Yelp
Vacation in this resort TripAdvisor
Buy this product Amazon
Drive this route Waze
Do this exercise FitBit
See this doctor SuggestADoctor
Take this medicine medical trial
Table 1: Systems for recommendations and collecting feedback.

If a social planner were to direct the individuals in the information-revealing decisions discussed above, she would have two conflicting goals: exploitation, choose the best alternative given the information available so far, and exploration, trying out less known alternatives for the sake of gathering more information, at the risk of worsening the individual experience. A social planner would like to combine exploration and exploitation so as to maximize the social welfare, which results in the exploration-exploitation tradeoff, a well-known subject in Machine Learning, Operation Research and Economics.

However, when the decisions are made by individuals rather than enforced by the planner, we have another problem dimension based on the individuals’ incentives. While the social planner benefits from both exploration and exploitation, each individuals’ incentives are typically skewed in favor of the latter. (In particular, many people prefer to benefit from exploration done by others.) Therefore, the society as a whole may suffer from insufficient amount of exploration. In particular, if a given alternative appears suboptimal given the information available so far, however sparse and incomplete, then this alternative may remain unexplored – even though in reality it may be the best.

The focus of this work is how to incentivize self-interested decision-makers to explore. We consider a social planner who cannot control the decision-makers, but can communicate with them, e.g., recommend an action and observe the outcome later on. Such a planner would typically be implemented via a website, either one dedicated to recommendations and feedback collection (such as Yelp or Waze), or one that actually provides the product or experience being recommended (such as Netflix or Amazon). In medical applications, the planner would be either a website that rates/recommends doctors and collects reviews on their services (such as or, or an organization conducting a medical trial. We are primarily interested in exploration that is efficient from the social planner’s perspective, i.e., exploration that optimizes the social welfare.111In the context of Internet economy, the “planner” would be a for-profit company. Yet, the planner’s goal, for the purposes of incentivinzing exploration, would typically be closely aligned with the social welfare.

Following Kremer et al. [2014], we consider a basic scenario when the only incentive offered by the social planner is the recommended experience itself (or rather, the individual’s belief about the expected utility of this experience). In particular, the planner does not offer payments for following the recommendation. On a technical level, we study a mechanism design problem with an explore-exploit tradeoff and auxiliary incentive-compatibility constraints. Absent these constraints, our problem reduces to multi-armed bandits (MAB) with stochastic rewards, the paradigmatic and well-studied setting for exploration-exploitation tradeoffs, and various generalizations thereof. The interaction between the planner and a single agent can be viewed as a version of the Bayesian Persuasion game [Kamenica and Gentzkow 2011] in which the planner has more information due to the feedback from the previous agents; in fact, this information asymmetry is crucial for ensuring the desired incentives.

We consider the following abstract framework, called incentive-compatible exploration. The social planner is an algorithm that interacts with the self-interested decision-makers (henceforth, agents) over time. In each round, an agent arrives, chooses one action among several alternatives, receives a reward for the chosen action, and leaves forever. Before an agent makes her choice, the planner sends a message to the agent which includes a recommended action. Everything that happens in a given round is observed by the planner, but not by other agents. The agent has a Bayesian prior on the reward distribution, and chooses an action that maximizes its Bayesian expected reward given the algorithm’s message (breaking ties in favor of the recommended action). The agent’s prior is known to the planner, either fully or partially. We require the planner’s algorithm to be Bayesian incentive-compatible (henceforth, BIC), in the sense that each agent’s Bayesian expected reward is maximized by the recommended action. The basic goal is to design a BIC algorithm so as to maximize social welfare, i.e., the cumulative reward of all agents.

The algorithm’s message to each agent is restricted to the recommended action (call such algorithms message-restricted). Any BIC algorithm can be turned into a message-restricted BIC algorithm which chooses the same actions, as long as the agents’ priors are exactly known to the planner.222This is due to a suitable version of the “revelation principle”, as observed in Kremer et al. [2014]. Note that a message-restricted algorithm (BIC or not) is simply an MAB-like learning algorithm for the same setting.

A paradigmatic example is the setting where the reward is an independent draw from a distribution determined only by the chosen action. All agents share a common Bayesian prior on the reward distribution; the prior is also known to the planner. No other information is received by the algorithm or an agent (apart from the prior, the recommended action, and the reward for the chosen action). We call this setting BIC bandit exploration. Absent the BIC constraint, it reduces to the MAB problem with IID rewards and Bayesian priors. We also generalize this setting in several directions, both in terms of the machine learning problem being solved by the planner’s algorithm, and in terms of the mechanism design assumptions on the information structure.

We assume that each agent knows which round he is arriving in. A BIC algorithm for this version is also BIC for a more general version in which each agent has a Bayesian prior on his round.

Discussion. BIC exploration does not rely on “external” incentives such as monetary payments or discounts, social status distinctions (e.g., leaderboards or merit badges for prolific feedback contributors), or people’s affinity towards experimentation. This mitigates the potential for selection bias, when the population that participates in the experiment differs from the target population. Indeed, paying patients for participation in a medical trial may be more appealing to poorer patients; offering discounts for new services may attract customers who are more sensitive to such discounts; and relying on people who like to explore for themselves would lead to a dataset that represents this category of people rather than the general population. While all these approaches are reasonable and in fact widely used (with well-developed statistical tools to mitigate the selection bias), an alternative intrinsically less prone to selection bias is, in our opinion, worth investigating.

The “intrinsic” incentives offered by BIC exploration can be viewed as a guarantee of fairness for the agents: indeed, even though the planner imposes experimentation on the agents, the said experimentation does not degrade expected utility of any one agent. (This is because an agent can always choose to ignore the experimentation and select an action with the highest prior mean reward.) This is particularly important for settings in which “external” incentives described above do not fully substitute for the intrinsic utility of the chosen actions. For example, a monetary payment does not fully substitute for an adverse outcome of a medical trial, and a discounted meal at a restaurant does not fully substitute for a bad experience.

We focus on message-restricted algorithms, and rely on the BIC property to convince agents to follow our recommendations. We do not attempt to make our recommendations more convincing by revealing additional information, because doing so does not help in our model, and because the desirable kinds of additional information to be revealed are likely to be application-specific (whereas with message-restricted algorithms we capture many potential applications at once). Further, message-restricted algorithms are allowed, and even recommended, in the domain of medical trials (see Section id1 for discussion).

Objectives. We seek BIC algorithms whose performance is near-optimal for the corresponding setting without the BIC constraint. This is a common viewpoint for welfare-optimizing mechanism design problems, which often leads to strong positive results, both prior-independent and Bayesian, even if Bayesian-optimal BIC algorithms are beyond one’s reach. Prior-independent guarantees are particularly desirable because priors are almost never completely correct in practice.

We express prior-independent performance of an algorithm via a standard notion of regret: the difference, in terms of the cumulative expected reward, between the algorithm and the the best fixed action. Intuitively, it is the extent to which the algorithm “regrets” not knowing the best action in advance. For Bayesian performance, we consider Bayesian regret: ex-post regret in expectation over the prior, and also the average Bayesian-expected reward per agent. (For clarity, we will refer to the prior-independent version as ex-post regret.) Moreover, we consider a version in which the algorithm outputs a prediction after each round (visible only to the planner), e.g., the predicted best action; then we are interested in the rate at which this prediction improves over time.

On a high level, we make two contributions:

  Regret minimization

We provide an algorithm for BIC bandit exploration whose ex-post regret is asymptotically optimal among all MAB algorithms (assuming a constant number of actions). Our algorithm is detail-free, in that it requires very limited knowledge of the prior.

  Black-box reduction

We provide a reduction from an arbitrary learning algorithm to a BIC one, with only a minor loss in performance; this reduction “works” for a very general setting.

In what follows we discuss our results in more detail.

Regret minimization. Following the literature on regret minimization, we focus on the asymptotic ex-post regret rate as a function of the time horizon (which in our setting corresponds to the number of agents).

We establish that the BIC restriction does not affect the asymptotically optimal ex-post regret rate for a constant number of actions. The optimality is two-fold: in the worst case over all realizations of the common prior (i.e., for every possible vector of expected rewards), and for every particular realization (which may allow much smaller ex-post regret than in the worst-case).

More formally, if is the time horizon and is the “gap” in expected reward between the best action and the second-best action, then we achieve ex-post regret

where is an absolute constant, and is a constant that depends only on the common prior . A well-known lower bound states that one cannot achieve ex-post regret better than .333More precisely: any MAB algorithm has ex-post regret at least in the worst case over all MAB instances with two actions, time horizon and gap [Lai and Robbins 1985, Auer et al. 2002b].

Conceptually, our algorithm implements adaptive exploration: the exploration schedule is adapted to the previous observations, so that the exploration of low-performing actions is phased out early. This is known to be a vastly superior approach compared to exploration schedules that are fixed in advance; in particular, the latter approach yields a much higher ex-post regret, both per-realization and in the worst case.

Further, our algorithm is detail-free, requiring very little knowledge of the common prior. This is desirable because in practice it may be complicated or impossible to elicit the prior exactly. Moreover, this feature allows the agents to have different priors (as long as they are “compatible” with the planner’s prior, in a precise sense specified later). In fact, an agent does not even need to know her prior exactly: instead, she would trust the planner’s recommendation as long as she believes that their priors are compatible.

Black-box reduction. Given an arbitrary MAB algorithm , we provide a BIC algorithm which internally uses as a “black-box”. That is, simulates a run of , providing inputs and recording the respective outputs, but does not depend on the internal workings of . In addition to recommending an action, the original algorithm can also output a prediction after each round (visible only to the planner), e.g. the predicted best action; then outputs a prediction, too. A reduction such as ours allows a modular design: one can design a non-BIC algorithm (or take an existing one), and then use the reduction to inject incentive-compatibility. Modular designs are very desirable in complex economic systems, especially for settings such as MAB with a rich body of existing work.

Our reduction incurs only a small loss in performance, which can be quantified in several ways. In terms of Bayesian regret, the performance of worsens by at most a constant multiplicative factor that only depends on the prior. In terms of the average rewards, we guarantee the following: for any duration , the average Bayesian-expected reward of between rounds and is at least that of the first rounds in the original algorithm ; here and are prior-dependent constants. Finally, if outputs a prediction after each round , then learns as fast as , up to a prior-dependent constant factor : for every realization of the prior, its prediction in round has the same distribution as .444So if the original algorithm gives an asymptotically optimal error rate as a function of , compared to the “correct” prediction, then so does the transformed algorithm , up to a prior-dependent multiplicative factor.

The black-box reduction has several benefits other than “modular design”. Most immediately, one can plug in an MAB algorithm that takes into account the Bayesian prior or any other auxiliary information that a planner might have. Moreover, one may wish to implement a particular approach to exploration, e.g., incorporate some constraints on the losses, or preferences about which arms to favor or to throttle. Further, the planner may wish to predict things other than the best action. To take a very stark example, the planner may wish to learn what are the worst actions (in order to eliminate these actions later by other means such as legislation). While the agents would not normally dwell on low-performing actions, our reduction would then incentivize them to explore these actions in detail.

Beyond BIC bandit exploration. Our black-box reduction supports much richer scenarios than BIC bandit exploration. Most importantly, it allows for agent heterogeneity, as expressed by observable signals. We adopt the framework of contextual bandits, well-established in the Machine Learning literature (see Section id1 for citations). In particular, each agent is characterized by a signal, called context, observable by both the agent and the planner before the planner issues the recommendation. The context can include demographics, tastes, preferences and other agent-specific information. It impacts the expected rewards received by this agent, as well as the agent’s beliefs about these rewards. Rather than choose the best action, the planner now wishes to optimize a policy that maps contexts to actions. This type of agent heterogeneity is practically important: for example, websites that issue recommendations may possess a huge amount of information about their customers, and routinely use this “context” to adjust their recommendations (e.g., Amazon and Netflix). Our reduction turns an arbitrary contextual bandit algorithm into a BIC one, with performance guarantees similar to those for the non-contextual version.

Moreover, the reduction allows learning algorithms to incorporate arbitrary auxiliary feedback that agents’ actions may reveal. For example, a restaurant review may contain not only the overall evaluation of the agent’s experience (i.e., her reward), but also reveal her culinary preferences, which in turn may shed light on the popularity of other restaurants (i.e., on the expected rewards of other actions). Further, an action can consist of multiple “sub-actions”, perhaps under common constraints, and the auxiliary feedback may reveal the reward for each sub-action. For instance, a detailed restaurant recommendation may include suggestions for each course, and a review may contain evaluations thereof. Such problems (without inventive constraints) have been actively studied in machine learning, under the names “MAB with partial monitoring” and “combinatorial semi-bandits”; see Section id1 for relevant citations.

In particular, we allow for scenarios when the planner wishes to optimize his own utility which is misaligned with the agents’. Then rewards in our model still correspond to the agents’ utilities, and principal’s utility is observed by the algorithm as auxiliary feedback. For example, a vendor who recommends products to customers may favor more expensive products or products that are tied in with his other offerings. In a different setting, a planner may prefer less expensive options: in a medical trial with substantial treatment costs, patients (who are getting these treatments for free) are only interested in their respective health outcomes, whereas a socially responsible planner may also factor in the treatment costs. Another example is a medical trial of several available immunizations for the same contagious disease, potentially offering different tradeoffs between the strength and duration of immunity and the severity of side effects. Hoping to free-ride on the immunity of others, a patient may assign a lower utility to a successful outcome than the government, and therefore prefer safer but less efficient options.

A black-box reduction such as ours is particularly desirable for the extended setting described above, essentially because it is not tied up to a particular variant of the problem. Indeed, contextual bandit algorithms in the literature heavily depend on the class of policies to optimize over, whereas our reduction does not. Likewise, algorithms for bandits with auxiliary feedback heavily depend on the particular kind of feedback.

Our techniques. An essential challenge in BIC exploration is to incentivize agents to explore actions that appear suboptimal according to the agent’s prior and/or the information currently available to the planner. The desirable incentives are created due to information asymmetry: the planner knows more than the agents do, and the recommendation reveals a carefully calibrated amount of additional information. The agent’s beliefs are then updated so that the recommended action now seems preferable to others, even though the algorithm may in fact be exploring in this round, and/or the prior mean reward of this action may be small.

Our algorithms are based on (versions of) a common building block: an algorithm that incentivizes agents to explore at least once during a relatively short time interval (a “phase”). The idea is to hide one round of exploration among many rounds of exploitation. An agent receiving a recommendation does not know whether this recommendation corresponds to exploration or to exploitation. However, the agents’ Bayesian posterior favors the recommended action because the exploitation is much more likely. Information asymmetry arises because the agent cannot observe the previous rounds and the algorithm’s randomness.

To obtain BIC algorithms with good performance, we overcome a number technical challenges, some of which are listed below. First, an algorithm needs to convince an agent not to switch to several other actions: essentially, all actions with larger prior mean reward than the recommended action. In particular, the algorithm should accumulate sufficiently many samples of these actions beforehand. Second, we ensure that phase length — i.e., the sufficient size of exploitation pool — does not need to grow over time. In particular, it helps not to reveal any information to future agents (e.g., after each phase or at other “checkpoints” throughout the algorithm). Third, for the black-box reduction we ensure that the choice of the bandit algorithm to reduce from does not reveal any information about rewards. In particular, this consideration was essential in formulating the main assumption in the analysis (Property (Pid1) on page id1). Fourth, the detail-free algorithm cannot use Bayesian inference, and relies on sample average rewards to make conclusions about Bayesian posterior rewards, even though the latter is only an approximation for the former.

The common prior. Our problem is hopeless for some priors. For a simple example, consider a prior on two actions (whose expected rewards are denoted and ) such that and is statistically independent from . Then, since no amount of samples from action has any bearing on , a BIC algorithm cannot possibly incentivize agents to try action . To rule out such pathological examples, we make some assumptions. Our detail-free result assumes that the prior is independent across actions, and additionally posits minor restrictions in terms of bounded rewards and full support. The black-box reduction posits an abstract condition which allows for correlated priors, and includes independent priors as a special case (with similar minor restrictions).

Map of the technical content. We discuss technical preliminaries in Sections id1. The first technical result in the paper is a BIC algorithm for initial exploration in the special case of two arms (Section id1), the most lucid incarnation of the “common building block” discussed above. Then we present the main results for BIC bandit exploration: the black-box reduction (Section id1) and the detail-free algorithm with optimal ex-post regret (Section id1). Then we proceed with a major extensions to contexts and auxiliary feedback (Section id1). The proofs pertaining to the properties of the common prior are deferred to Section id1. Conclusions and open questions are in Section id1. The detail-free algorithm becomes substantially simpler for the special case of two actions. For better intuition, we provide a standalone exposition of this special case in Appendix id1.

We view patients’ incentives as one of the major obstacles that inhibit medical trials in practice, or prevent some of them from happening altogether. This obstacle may be particularly damaging for large-scale trials that concern wide-spread medical conditions with relatively inexpensive treatments. Then finding suitable patients and providing them with appropriate treatments would be fairly realistic, but incentivizing patients to participate in sufficient numbers may be challenging. BIC exploration is thus a theoretical (and so far, highly idealized) attempt to mitigate this obstacle.

Medical trials has been one of the original motivations for studying MAB and exploration-exploitation tradeoff [Thompson 1933, Gittins 1979]. Bandit-like designs for medical trials belong to the realm of adaptive medical trials [see Chow and Chang 2008, for background], which also include other “adaptive” features such as early stopping, sample size re-estimation, and changing the dosage.

“Message-restricted” algorithms (which recommend particular treatments to patients and reveal no other information) are appropriate for this domain. Revealing some (but not all) information about the medical trial is required to meet the standards of “informed consent”, as prescribed by various guidelines and regulations [see Arango et al. 2012, for background]. However, revealing information about clinical outcomes in an ongoing trial is currently not required, to the best of our understanding. In fact, revealing such information is seen as a significant threat to the statistical validity of the trial (because both patients and doctors may become biased in favor of better-performing treatments), and care is advised to prevent information leaks as much as possible [see Detry et al. 2012, for background and discussion, particularly pp. 26-30].

Medical trials provide additional motivation for BIC bandit exploration with multiple actions. While traditional medical trials compare a new treatment against the placebo or a standard treatment, designs of medical trials with multiple treatments have been studied in the biostatistics literature [e.g., see Hellmich 2001, Freidlin et al. 2008], and are becoming increasingly important in practice [Parmar et al. 2014, Redig and Jänne 2015]. Note that even for the placebo or the standard treatment the expected reward is often not known in advance, as it may depend on the particular patient population.

BIC contextual bandit exploration is particularly relevant to medical trials, as patients come with a lot of “context” which can be used to adjust and personalize the recommended treatment. The context can include age, fitness levels, race or ethnicity, various aspects of the patient’s medical history, as well as genetic markers (increasingly so as genetic sequencing is becoming more available). Context-dependent treatments (known as personalized medicine) has been an important trend in the pharmaceutical industry in the past decade, especially genetic-marker-dependent treatments in oncology [e.g., see Maitland and Schilsky 2011, Garimella 2015, Vaidyanathan 2012]. Medical trials for context-dependent treatments are more complex, as they must take the context into account. To reduce costs and address patient scarcity, a number of novel designs for such trials have been deployed [e.g., see Maitland and Schilsky 2011, Garimella 2015]. Some of the deployed designs are explicitly “contextual”, in that they seek the best policy — mapping from patient’s context to treatment [Redig and Jänne 2015]. More advanced “contextual” designs have been studied in biostatistics [e.g., see Freidlin and Simon 2005, Freidlin et al. 2007, 2010].

Exploration, exploitation, and incentives. There is a growing literature about a three-way interplay of exploration, exploitation, and incentives, comprising a variety of scenarios.

The study of mechanisms to incentivize exploration has been initiated by Kremer et al. [2014]. They mainly focus on deriving the Bayesian-optimal policy for the case of only two actions and deterministic rewards, and only obtain a preliminary result for stochastic rewards; a detailed comparison is provided below. Bimpikis et al. [2017]555Bimpikis et al. [2017] is concurrent and independent work with respect to the conference publication of this paper. consider a similar model with time-discounted rewards, focusing on the case of two arms. If expected rewards are known for one arm, they provide a BIC algorithm that achieves the “first-best” utility. For the general case, they design an optimal BIC algorithm that is computationally inefficient, and propose a tractable heuristic based on the same techniques. Motivated by similar applications in the Internet economy, Che and Hörner [2015] propose a model with a continuous information flow and a continuum of consumers arriving to a recommendation system and derive a Bayesian-optimal incentive-compatible policy. Their model is technically different from ours, and is restricted to two arms and binary rewards. Frazier et al. [2014] consider a similar setting with monetary transfers, where the planner not only recommends an action to each agent, but also offers a payment for taking this action. In their setting, incentives are created via the offered payments rather than via information asymmetry.

Rayo and Segal [2010], Kamenica and Gentzkow [2011] and Manso [2011] examine related settings in which a planner incentivizes agents to make better choices, either via information disclosure or via a contract. All three papers focus on settings with only two rounds. Ely et al. [2015] and Hörner and Skrzypacz [2015] consider information disclosure over time, in very different models: resp., releasing news over time to optimize suspense and surprise, and selling information over time.

Exploration-exploitation problems with self-interested agents have been studied in several other scenarios: multiple agents engaging in exploration and benefiting from exploration performed by others, without a planner to coordinate them [e.g.,  Bolton and Harris 1999, Keller et al. 2005] dynamic pricing with model uncertainty [e.g.,  Kleinberg and Leighton 2003, Besbes and Zeevi 2009, Badanidiyuru et al. 2013], dynamic auctions [e.g.,  Athey and Segal 2013, Bergemann and Välimäki 2010, Kakade et al. 2013], pay-per-click ad auctions with unknown click probabilities [e.g.,  Babaioff et al. 2014, Devanur and Kakade 2009, Babaioff et al. 2015b], as well as human computation [e.g.,  Ho et al. 2014, Ghosh and Hummel 2013, Singla and Krause 2013]. In particular, a black-box reduction from an arbitrary MAB algorithm to an incentive-compatible algorithm is the main result in Babaioff et al. [2015b], in the setting of pay-per-click ad auctions with unknown click probabilities. The technical details in all this work (who are the agents, what are the actions, etc.) are very different from ours, so a direct comparison of results is uninformative.

Subsequent work. Several related papers appeared subsequent to the conference publication of this paper. Kleinberg et al. [2016] consider costly information acquisition by self-interested agents (e.g., real estate buyers or venture capitalists investigating a potential purchase/invenstment), and design a BIC mechanism to coordinate this process so as to improve social welfare. Bahar et al. [2016] extend the setting in Kremer et al. [2014] — BIC exploration with two actions and deterministic rewards — to a known social network on the agents, where each agent can observe friend’s recommendations in the previous rounds (but not their rewards). Finally, Mansour et al. [2016] extend BIC exploration to scenarios when multiple agents arrive in each round and their chosen actions may affect others’ rewards. The paradigmatic motivating example is driving directions on Waze, where drivers choice of routes can create congestion for other drivers.

Detailed comparison to Kremer, Mansour, and Perry [2014]. While the expected reward is determined by the chosen action, we allow the realized reward to be stochastic. Kremer et al. [2014] mainly focus on deterministic rewards, and only obtain a preliminary result for stochastic rewards. We improve over the latter result in several ways.

Kremer et al. [2014] only consider the case of two actions, whereas we handle an arbitrary constant number of actions. Handling more than two actions is important in practice, because recommendation systems for products or experiences are rarely faced with only binary decisions. Further, medical trials with multiple treatments are important, too [Hellmich 2001, Freidlin et al. 2008, Parmar et al. 2014]. For multiple actions, convergence to the best action is a new and interesting result on its own, even regardless of the rate of convergence. Our extension to more than two actions is technically challenging, requiring several new ideas compared to the two-action case; especially so for the detail-free version.

We implement adaptive exploration, when the exploration schedule is adapted to the previous observations, whereas the algorithm in Kremer et al. [2014] is a BIC implementation of fixed exploration — a “naive” MAB algorithm in which the exploration schedule is fixed in advance. This leads to a stark difference in ex-post regret. To describe these improvements, let us define the MAB instance as a mapping from actions to their expected rewards, and let us be more explicit about the asymptotic ex-post regret rates as a function of the time horizon (i.e., the number of agents). Kremer et al. [2014] only provides regret bound of for all MAB instances,666Here and elsewhere, the notation hides factors. whereas our algorithm achieves ex-post regret for all MAB instances, and for MAB instances with constant “gap” in the expected reward between the best and the second-best action. The literature on MAB considers this a significant improvement (more on this below). In particular, the result is important in two ways: it quantifies the advantage of “nice” MAB instances over the worst case, and of IID rewards over adversarially chosen rewards.777The MAB problem with adversarially chosen rewards only admits ex-post regret in the worst case. The sub-par regret rate in Kremer et al. [2014] is indeed a consequence of fixed exploration: such algorithms cannot have worst-case regret better than , and cannot achieve regret for MAB instances with constant gap [Babaioff et al. 2014].

In terms of information structure, the algorithm in Kremer et al. [2014] requires all agents to have the same prior, and requires a very detailed knowledge of that prior; both are significant impediments in practice. Whereas our detail-free result allows the planner to have only a very limited knowledge of the prior, and allows the agents to have different priors.

Finally, Kremer et al. [2014] does not provide an analog of our black-box reduction, and does not handle the various generalizations of BIC bandit exploration that the reduction supports.

Multi-armed bandits. Multi-armed bandits (MAB) have been studied extensively in Economics, Operations Research and Computer Science since [Thompson 1933]. Motivations and applications range from medical trials to pricing and inventory optimization to driving directions to online advertising to human computation. A reader may refer to [Bubeck and Cesa-Bianchi 2012] and [Gittins et al. 2011] for background on regret-minimizing and Bayesian formulations, respectively. Further background on related machine learning problems can be found in Cesa-Bianchi and Lugosi [2006]. Our results are primarily related to regret-minimizing MAB formulations with IID rewards [Lai and Robbins 1985, Auer et al. 2002a].

Our detail-free algorithm builds on Hoeffding races [Maron and Moore 1993, 1997], a well-known technique in reinforcement learning. Its incarnation in the context of MAB is also known as active arms elimination [Even-Dar et al. 2006].

The general setting for the black-box reduction is closely related to three prominent directions in the work on MAB: contextual bandits, MAB with budgeted exploration, and MAB with partial monitoring. Contextual bandits have been introduced, under various names and models, in [Woodroofe 1979, Auer et al. 2002b, Wang et al. 2005, Langford and Zhang 2007], and actively studied since then. We follow the formulation from [Langford and Zhang 2007] and a long line of subsequent work [e.g.,  Dudíik et al. 2011, Agarwal et al. 2014]. In MAB with budgeted exploration, algorithm’s goal is to predict the best arm in a given number of rounds, and performance is measured via prediction quality rather than cumulative reward [Even-Dar et al. 2002, Mannor and Tsitsiklis 2004, Guha and Munagala 2007, Goel et al. 2009, Bubeck et al. 2011]. In MAB with partial monitoring, auxiliary feedback is revealed in each round along with the reward. A historical account for this direction can be found in Cesa-Bianchi and Lugosi [2006]; see Audibert and Bubeck [2010], Bartók et al. [2014] for examples of more recent progress. Important special cases are MAB with graph-structured feedback [Alon et al. 2013, 2015], where choosing an arm would also bring feedback for adjacent arms, and “combinatorial semi-bandits” [György et al. 2007, Kale et al. 2010, Audibert et al. 2011, Wen et al. 2015], where in each round the algorithm chooses a subset from some fixed ground set, and the reward for each chosen element of this set is revealed.

Improving the asymptotic regret rate from , to and from to has been a dominant theme in the literature on regret-minimizing MAB (a through survey is beyond our scope, see Bubeck and Cesa-Bianchi [2012] for background and citations). In particular, the improvement from to regret due to the distinction between fixed and adaptive exploration has been a major theme in [Babaioff et al. 2014, Devanur and Kakade 2009, Babaioff et al. 2015a].

As much of our motivation comes from human computation, we note that MAB-like problems have been studied in several other setting motivated by human computation. Most of this work has focused on crowdsourcing markets, see Slivkins and Vaughan [2013] for a discussion; specific topics include matching of tasks and workers [e.g.,  Ho et al. 2013, Abraham et al. 2013] and pricing decisions [Ho et al. 2014, Singla and Krause 2013, Badanidiyuru et al. 2013]. Also, Ghosh and Hummel [2013] considered incentivizing high-quality user-generated content.

We define the basic model, called BIC bandit exploration; the generalizations are discussed in Section id1. A sequence of agents arrive sequentially to the planner. In each round , the interaction protocol is as follows: a new agent arrives, the planner sends this agent a signal , the agent chooses an action in a set of actions, receives a reward for this action, and leaves forever. The signal includes a recommended action . This entire interaction is not observed by other agents. The planner knows the value of . However, a coarse upper bound would suffice in most cases, with only a constant degradation of our results.

The planner chooses signals using an algorithm, called recommendation algorithm. If (i.e., the signal is restricted to the recommended action, which is followed by the corresponding agent) then the setting reduces to multi-armed bandits (MAB), and the recommendation algorithm is a bandit algorithm. To follow the MAB terminology, we will use arms synonymously with actions; we sometimes write “play/pull an arm” rather than “choose an action”.

Rewards. For each arm there is a parametric family of reward distributions, parameterized by the expected reward . The reward vector is drawn from some prior distribution . Conditional on the mean , the realized reward when a given agent chooses action is drawn independently from distribution . In this paper we restrict attention to the case of single parameter families of distributions, however we do not believe this is a real restriction for our results to hold.

The prior and the tuple constitute the (full) Bayesian prior on rewards, denoted . It is known to all agents and to the planner.888As mentioned in Introduction, we also consider an extension to a partially known prior. The expected rewards are not known to either.

For each arm , let be the marginal of on this arm, and let be the prior mean reward. W.l.o.g., re-order the arms so that . The prior is independent across arms if the distribution is a product distribution, i.e., .

Incentive-compatibility. Each agent maximizes her own Bayesian expected reward, conditional on any information that is available to him. Recall that the agent observes the planner’s message , and does not observe anything about the previous rounds. Therefore, the agent simply chooses an action that maximizes the posterior mean reward . In particular, if the signal does not contain any new information about the reward vector , then the agent simply chooses an action that maximizes the prior mean reward .

We are interested in algorithms that respect incentives, in the sense that the recommended action maximizes the posterior mean reward. To state this property formally, let be the event that the agents have followed the algorithm’s recommendations up to (and not including) round .

Definition 3.1

A recommendation algorithm is Bayesian incentive-compatible (BIC) if


The algorithm is strongly BIC if Inequality (3.1) is always strict.

Throughout this paper, we focus on BIC recommendation algorithms with . As observed in Kremer et al. [2014], this restriction is w.l.o.g. in the following sense. First, any recommendation algorithm can be made BIC by re-defining to lie in . Second, any BIC recommendation algorithm can be restricted to , preserving the BIC property. Note that the first step may require full knowledge of the prior and may be computationally expensive.

Thus, for each agent the recommended action is at least as good as any other action. For simplicity, we assume the agents break ties in favor of the recommended action. Then the agents always follow the recommended action.

Regret. The goal of the recommendation algorithm is to maximize the expected social welfare, i.e., the expected total reward of all agents. For BIC algorithms, this is just the total expected reward of the algorithm in the corresponding instance of MAB.

We measure algorithm’s performance via the standard definitions of regret.

Definition 3.2 (Ex-post Regret)

The ex-post regret of the algorithm is:


The Bayesian regret of the algorithm is:


The ex-post regret is specific to a particular reward vector ; in the MAB terminology, it is sometimes called MAB instance. In (3.2) the expectation is taken over the randomness in the realized rewards and the algorithm, whereas in (3.3) it is also taken over the prior. The last summand in (3.2) is the expected reward of the algorithm.

Ex-post regret allows to capture “nice” MAB instances. Define the gap of a problem instance as , where . In words: the difference between the largest and the second largest expected reward. One way to characterize “nice” MAB instances is via large gap. There are MAB algorithms with regret for a constant number of arms Auer et al. [2002a].

The basic performance guarantee is expressed via Bayesian regret. Bounding the ex-post regret is particularly useful if the prior is not exactly known to the planner, or if the prior that everyone believes in may not quite the same as the true prior. In general, a ex-post regret also guards against “unlikely realizations”. Besides, a bound on the ex-post regret is valid for every realization of the prior, which is reassuring for the social planner and also could take advantage of “nice” realizations such as the ones with large “gap”.

Conditional expectations. Throughout, we often use conditional expectations of the form and where and are random variables. To avoid confusion, let us emphasize that evaluates to a scalar, whereas is a random variable that maps values of to the corresponding conditional expectations of . At a high level, we typically use in the algorithms’ specifications, and we often consider in the analysis.

Concentration inequalities. For our detail-free algorithm, we will use a well-known concentration inequality known as Chernoff-Hoeffding Bound. This concentration inequality, in slightly different formulations, can be found in many textbooks and surveys [e.g., Mitzenmacher and Upfal 2005]. We use a formulation from the original paper [Hoeffding 1963].

Theorem 3.1 (Chernoff-Hoeffding Bound)

Consider I.I.D. random variables with values in . Let be their average, and let . Then:


In a typical usage, we consider the high-probability event in Eq. 3.4 for a suitably chosen random variables (which we call the Chernoff event), use the above theorem to argue that the failure probability is negligible, and proceed with the analysis conditional on the Chernoff event.

A fundamental sub-problem in BIC bandit exploration is to incentivize agents to choose any arm even once, as initially they would only choose arm . (Recall that arms are ordered according to their prior mean rewards.) We provide a simple stand-alone BIC algorithm that samples each arm at least times, for a given , and completes in time times a prior-dependent constant. This algorithm is the initial stage of the black-box reduction, and (in a detail-free extension) of the detail-free algorithm.

In this section we focus on the special case of two arms, so as to provide a lucid introduction to the techniques and approaches in this paper. (Extension to many arms, which requires some additional ideas, is postponed to Section id1). We allow the common prior to be correlated across arms, under a mild assumption which we prove is necessary. For intuition, we explicitly work out the special case of Gaussian priors.

Restricting the prior. While consider the general case of correlated per-action priors, we need to restrict the common prior so as to give our algorithms a fighting chance, because our problem is hopeless for some priors. As discussed in the Introduction, an easy example is a prior such that and are independent. Then samples from arm have no bearing on the conditional expectation of , and therefore cannot possibly incentivize an agent to try arm .

For a “fighting chance”, we assume that after seeing sufficiently many samples of arm there is a positive probability that arm is better. To state this property formally, we denote with the random variable that captures the first outcomes of arm , and we let be the conditional expectation of as a function . We make the following assumption:

  1. for some prior-dependent constant .

In fact, Property (Pid1) is “almost necessary”: it is necessary for a strongly BIC algorithm.

Lemma 4.1

Consider an instance of BIC bandit exploration with two actions such that Property (Pid1) does not hold. Then a strongly BIC algorithm never plays arm .


Since Property (Pid1) does not hold, by Borel-Cantelli Lemma we have


We prove by induction on that the -th agent cannot be recommended arm . This is trivially true for . Suppose the induction hypothesis is true for some . Consider an execution of a strongly BIC algorithm. Such algorithm cannot recommend arm before round . Then the decision whether to recommend arm in round is determined by the outcomes of action . In other words, the event belongs to , the sigma-algebra generated by . Therefore,

The last inequality holds by (4.1). So a strongly BIC algorithm cannot recommend arm in round . This completes the induction proof.  

Algorithm and analysis. We provide a simple BIC algorithm that samples both arms at least times. The time is divided into phases of rounds each, except the first one, where is a parameter that depends on the prior. In the first phase the algorithm recommends arm 1 to agents, and picks the “exploit arm” as the arm with a highest posterior mean conditional on these samples. In each subsequent phase, it picks one agent independently and uniformly at random to explore arm . All other agents exploit using arm . A more formal description is given in Algorithm 1.

1 In the first rounds recommend arm ;
2 Let be the corresp. tuple of rewards;
3 Let , breaking ties in favor of arm ;
foreach phase do
      4 From the set of the next agents, pick one agent uniformly at random;
      5 Every agent is recommended arm ;
      6 Player is recommended arm
ALGORITHM 1 A BIC algorithm to collect samples of both arms.

We prove that the algorithm is BIC as long as is larger than some prior-dependent constant. The idea is that, due to information asymmetry, an agent recommended arm does not know whether this is because of exploration or exploitation, but knows that the latter is much more likely. For large enough , the expected gain from exploitation exceeds the expected loss from exploration, making the recommendation BIC.

Lemma 4.2

Consider BIC bandit exploration with two arms. Assume Property (Pid1) holds for some , and let . Algorithm 1 is BIC as long as


The algorithm collects at least samples from each arm, and completes in rounds.

Note that the denominator in (4.2) is strictly positive by Property (Pid1). Indeed, the property implies that for some , so .


The BIC constraint is trivially satisfied in the initial phase. By a simple observation from Kremer et al. [2014], which also applies to our setting, it suffices to show that an agent is incentivized to follow a recommendation for arm . Focus on one phase of the algorithm, and fix some agent in this phase. Let . It suffices to prove that , i.e., that an agent recommended arm is not incentivized to switch to arm .

Let be the random variable which represents the tuple of the initial samples . Observe that the event is uniquely determined by the initial samples and by the random bits of the algorithm, which are independent of . Thus by the law of iterated expectations:

where . There are two possible disjoint events under which agent is recommended arm : either or . Thus,

Observe that:

Moreover, is independent of the event . Therefore, we get:

Thus, for it suffices to pick such that:

The Lemma follows because and

The latter inequality follows from Lemma 4.3, as stated and proved below. Essentially, it holds because conditions on more information than : respectively, the first samples and the first samples of arm , where .  

Lemma 4.3

Let and be two random variables such that contains more information than , i.e. if and are the sigma algebras generated by these random variables then . Let and for some function on the vector of means . Then:


Since the event is measurable in we have:

Since :

Thus we can conclude:

Gaussian priors. To give more intuition on constants and random variable from Lemma 4.2, we present a concrete example where the common prior is independent across arms, the prior on the expected reward of each arm is a Normal distribution, and the rewards are normally distributed conditional on their mean. We will use to denote a normal distribution with mean and variance .

Example 4.1

Assume the prior is independent across arms, for each arm , and the respective reward is conditionally distributed as . Then for each


It follows that Property (Pid1) is satisfied for any .


Observe that , where and , with . Therefore,

Since , it follows that is a linear transformation of a Normal distribution and therefore

Eq. 4.4 follows because .  

Observe that as , the variance of increases and converges to the variance of . Intuitively, more samples can move the posterior mean further away from the prior mean ; in the limit of many samples will be distributed exactly the same as .

While any value of the parameter is sufficient for BIC, one could choose to minimize the phase length . Recall from Lemma 4.2 that the (smallest possible) value of is inversely proportional to the quantity . This is simply an integral over the positive region of a normally distributed random variable with negative mean, as given in Eq. 4.4. Thus, increasing increases the variance of this random variable, and therefore increases the integral. For example, is a reasonable choice if .

In this section we present a “black-box reduction” that uses a given bandit algorithm as a black box to create a BIC algorithm for BIC bandit exploration, with only a minor increase in performance.

The reduction works as follows. There are two parameters: . We start with a “sampling stage” which collects samples of each arm in only a constant number of rounds. (The sampling stage for two arms is implemented in Section id1.) Then we proceed with a “simulation stage”, divided in phases of rounds each. In each phase, we pick one round uniformly at random and dedicate it to . We run in the “dedicated” rounds only, ignoring the feedback from all other rounds. In the non-dedicated rounds, we recommend an “exploit arm”: an arm with the largest posterior mean reward conditional on the samples collected in the previous phases. A formal description is given in Algorithm 2.

Input : A bandit algorithm ; parameters .
Input : A dataset which contains samples from each arm.
1 Split all rounds into consecutive phases of rounds each;
foreach phase do
      2 Let be the ‘‘exploit arm”;
      3 Query algorithm for its next arm selection ;
      4 Pick one agent from the agents in the phase uniformly at random;
      5 Every agent in the phase is recommended , except agent who is recommended ;
      6 Return to algorithm the reward of agent ;
      7 Set
ALGORITHM 2 Black-box reduction: the simulation stage

The sampling stage for arms is a non-trivial extension of the two-arms case. In order to convince an agent to pull an arm , we need to convince him not to switch to any other arm . We implement a “sequential contest” among the arms. We start with collecting samples of arm . The rest of the process is divided into phases , where the goal of each phase is to collect the samples of arm . We maintain the “exploit arm” , the arm with the best posterior mean given the previously collected samples of arms . The -th phase collects the samples of arm using a procedure similar to that in Section id1: agents chosen u.a.r. are recommended arm , and the rest are recommended the exploit arm. Then the current exploit arm is compared against arm , and the “winner” of this contest becomes the new exploit arm. The pseudocode is in Algorithm 3, following the description above.

Parameters; the number of arms .
1 For the first agents recommend arm , let be the tuple of rewards;
foreach arm in increasing lexicographic order do
      2 Let , breaking ties lexicographically;
      3 From the set of the next agents, pick a set of agents uniformly at random;
      4 Every agent is recommended arm ;
      5 Every agent is recommended arm ;
      6 Let be the tuple of rewards of arm returned by agents in
ALGORITHM 3 Black-box reduction: the sampling stage.

The rest of this section is organized as follows. First we state the incentive-compatibility guarantees (Section id1), then we prove them for the simulation stage (Section id1) and the sampling stage (Section id1). Then we characterize the performance of the black-box reduction, in terms of average rewards, total rewards, and Bayesian regret (Section id1). We also consider a version in which outputs a prediction in each round (Section id1). The technically difficult part is to prove BIC, and trace out assumptions on the prior which make it possible.

As in the previous section, we need an assumption on the prior to guarantee incentive-compatibility.

  1. Let be the the random variable representing rewards of action , and let


    There exist prior-dependent constants such that

Informally: any given arm can “a posteriori” be the best arm by margin with probability at least after seeing sufficiently many samples of each arm . We deliberately state this property in such an abstract manner in order to make our BIC guarantees as inclusive as possible, e.g., allow for correlated priors, and avoid diluting our main message by technicalities related to convergence properties of Bayesian estimators.

For the special case of two arms, Property (Pid1) follows from Property (Pid1), see Section id1. Recall that the latter property is necessary for any strongly BIC algorithm.

Theorem 5.1 (Bic)

Assume Property (Pid1) holds with constants . Then the black-box reduction is BIC, for any bandit algorithm , if the parameters are larger than some prior-dependent constant. More precisely, it suffices to take and , where .

If the common prior puts positive probability on the event that , then Property (Pid1) is plausible because the conditional expectations tend to approach the true means as the number of samples grows. We elucidate this intuition by giving a concrete, and yet fairly general, example, where we focus on priors that are independent across arms. (The proof can be found in Section id1.)

Lemma 5.1

Property (Pid1) holds for a constant number of arms under the following assumptions:

  • The prior is independent across arms.

  • The prior on each has full support on some bounded interval 999That is, the prior on assigns a strictly positive probability to every open subinterval of , for each arm .

  • The prior mean rewards are all distinct: .

  • The realized rewards are bounded (perhaps on a larger interval).

Instead using Property (Pid1) directly, we will use a somewhat weaker corollary thereof (derived Section id1):

  1. Let be a random variable representing samples from each action . There exist prior-dependent constants and such that

Let us consider phase of the algorithm. We will argue that any agent who is recommended some arm does not want to switch to some other arm . More formally, we assert that

Let be the random variable which represents the dataset collected by the algorithm by the beginning of phase . Let and . It suffices to show:

There are two possible ways that agent is recommended arm : either arm is the best posterior arm, i.e., , and agent is not the unlucky agent or , in which case arm cannot possibly be the exploit arm and therefore, happens only if agent was the unlucky one agent among the agents of the phase to be recommended the arm of algorithm and algorithm recommended arm . There are also the followings two cases: (i) or (ii) , agent is the unlucky agent and algorithm recommends arm . However, since we only want to lower bound the expectation and since under both these events , we will disregard these events, which can only contribute positively.

Denote with the index of the “unlucky” agent selected in phase to be given recommendation of the original algorithm . Denote with the event that algorithm pulls arm at iteration . Thus by our reasoning in the previous paragraph:


Since is independent of , we can drop conditioning on the event from the conditional expectations in Eq. 5.2. Further, observe that:

Plugging this into Eq. 5.2, we can re-write it as follows:


Observe that conditional on , the integrand of the second part in the latter summation is always non-positive. Thus integrating over a bigger set of events can only decrease it, i.e.:

Plugging this into Eq. 5.3, we can lower bound the expected gain as:

Now observe that:

By property (Pid1), we have that . Therefore:

Thus the expected gain is lower bounded by:

The latter is non-negative if:

Since this must hold for every any , we get that for incentive compatibility it suffices to pick:

Compared to Theorem 5.1, a somewhat weaker bound on parameter suffices: .

Denote and .

The algorithm can be split into phases, where each phase except the first one lasts rounds, and agents are aware which phase they belong to. We will argue that each agent in each phase has no incentive not to follow the recommended arm. If an agent in phase is recommended any arm , then she knows that this is because this is the exploit arm ; so by definition of the exploit arm it is incentive-compatible for the agent to follow it. Thus it remains to argue that an agent in phase who is recommended arm does not want to deviate to some other arm , i.e., that .

By the law of iterated expectations: