Incentivized Exploration for Multi-Armed Bandits under Reward Drift

# Incentivized Exploration for Multi-Armed Bandits under Reward Drift

Zhiyuan Liu
Department of Computer Science
Department of Computer Science
University of Virginia
hw7ww@virginia.edu &Fan Shen
Technology, Cybersecurity and Policy
Computer Science Division
Clemson University
kail@clemson.edu &Lijun Chen
Department of Computer Science
###### Abstract

We study incentivized exploration for the multi-armed bandit (MAB) problem where the players receive compensation for exploring arms other than the greedy choice and may provide biased feedback on reward. We seek to understand the impact of this drifted reward feedback by analyzing the performance of three instantiations of the incentivized MAB algorithm: UCB, -Greedy, and Thompson Sampling. Our results show that they all achieve regret and compensation under the drifted reward, and are therefore effective in incentivizing exploration. Numerical examples are provided to complement the theoretical analysis.

## Introduction

Multi-armed bandit (MAB) problem is a classical model for sequential decision making under uncertainty, and finds applications in many real world systems such as recommender systems [Li et al.2010, Bouneffouf, Bouzeghoub, and Gançarski2012], search engine systems [Radlinski, Kleinberg, and Joachims2008, Yue and Joachims2009] and cognitive radio networks [Gai, Krishnamachari, and Jain2010], to just name a few. In the traditional MAB model, the decision maker (the principal) who selects the arm to pull and the action performers (the players) who actually pull the arm are assumed to be the same entity. This is, however, not true for several important real world applications where the principal and players are different entities with different interests. Take the Amazon product rating as an example: Amazon (the principal) would like the customers (the players) to buy and try different products (arms) of certain type in order to identify the best product (i.e., exploration), while the customers are heavily influenced by the current ratings on the products and behave myopically, i.e., select the product that currently has the highest rating (i.e., exploitation). It is well known that such exploitation-only behavior can be far from the optimal [Bubeck and Cesa-Bianchi2012, Sutton and Barto2018]. In the traditional MAB setting, the principal, who is also the player, strives to find the optimal tradeoff between exploration and exploitation and execute it accordingly. When the principal and players are different entities, misaligned interests between them need to be reconciled in order to balance exploration and exploitation in an optimal manner.

Incentivized learning has been proposed for the MAB problem to reconcile different interests between the principal and the players [Frazier et al.2014, Mansour, Slivkins, and Syrgkanis2015, Wang and Huang2018, Immorlica et al.2019]. In order to incentivize exploration, the principal provides certain compensation to the player so that s/he will pull the arm other than the greedy choice that currently has the best empirical reward. The goal of the principal is to maximize the cumulative reward while minimizing the total compensation to the players.

However, existing incentivized MAB models [Han, Kempe, and Qiang2015, Wang and Huang2018, Immorlica et al.2018, Liu and Ho2018, Hirnschall et al.2018] assume that the players provide unbiased stochastic feedback on reward111We will use reward and feedback interchangeably in this paper, whichever is convenient for exposition. even after they receive certain incentive from the principal. This assumption does not always hold in the real world scenarios: work based on industrial level experiments in [Martensen, Gronholdt, and Kristensen2000, Razak, Nirwanto, and Triatmanto2016, Ehsani and Ehsani2015] shows that the customers are inclined to give higher evaluation (i.e., increased reward) with incentive such as discount and coupon. The compensation could even be the primary driver for customer satisfaction [Martensen, Gronholdt, and Kristensen2000, Lee and Lin2005]. This drift in reward feedback may have negative impact on the exploration and exploitation tradeoff, e.g., a suboptimal arm is mistaken as the optimal one because of the incentives and the players will keep pulling it even after the compensation is removed, and has been ignored in previous research.

In this paper, we aim to investigate the impact of drifted reward feedback in the incentivized MAB problem. Specifically, we consider a general incentivized exploration algorithm where the player receives a compensation that is the difference in reward between the principal’s choice and the greedy choice, and provides biased feedback that is the sum of the true reward of an arm and a drift term that is a non-decreasing function of the compensation received for pulling this arm. We seek to answer the important question if the compensation scheme is effective in incentivizing exploration under reward drift from two intertwining aspects: (1) if the algorithm is robust to drifted reward so that the sequential decisions based on biased feedback still enjoy a small regret, and (2) if the proposed incentive mechanism is cost-efficient to the principal. We analyze the regret and compensation for three instantiations of the algorithm where the principal employs UCB, -Greedy, and Thompson Sampling, respectively. Our analytical results, complemented by numerical experiments, show that the proposed compensation scheme achieves regret and compensation and is thus effective in incentivizing exploration. Experimental results validate the performance of proposed algorithms and also coincide with the theoretical analysis.

### Related Work

Incentivized learning has attracted a lot of attention since the work [Frazier et al.2014]. In [Frazier et al.2014], the authors propose a Bayesian incentivized model with discounted regret and compensation, and characterize the relationship between the reward, compensation, and discount factor. In [Mansour, Slivkins, and Syrgkanis2015], the authors study the non-discount case and propose an algorithm that has a regret upper bound . In [Wang and Huang2018], the authors analyze the non-Bayesian and non-discount reward case and show regret and compensation for incentivized exploration based on simplified MAB algorithms. But all the models and analysis are under the assumption that the players’ feedbacks are unbiased under compensation. In contrast, we consider biased feedback under compensation, and show that the incentivized exploration with reward drift can still achieve regret and compensation.

Related work also includes those in robustness of MAB under adversarial attack. In [Lykouris, Mirrokni, and Paes Leme2018], the authors propose a multi-layer active arm elimination race algorithm for stochastic bandits with adversarial corruptions whose performance degrades linearly to the amount of corruptions, and show that this linear degradation is necessary. In [Feng, Parkes, and Xu2019], the authors study strategic behavior of rational arms and show that UCB, -Greedy, and Thompson sampling achieve a regret upper bound under any strategy of the strategic arms, where is the total budget across arms. On the other hand, in [Jun et al.2018], the authors construct attacks by decreasing the reward of non-target arms, and show that their algorithm can trick UCB and -Greedy to pull the optimal arm only times under an attack budget. All the modeled attacks are from exogenous sources, e.g., malicious users, while in our paper, the reward drift can have an interpretation as arising from attacks but generated endogenously by the incentivized exploration algorithm itself.

## Model, Notation, and Algorithm

Consider a variant of the multi-armed bandit problem where a principal has arms, denoted by the set . The reward of each arm follows a distribution with support and mean that is unknown. Without loss of generality, we assume that arm is the unique optimum with the maximum mean. Denote by the reward gap between arm and arm , and let . At each time , a new player will pull one arm and receive a reward that will fed back to the principal and other players. Let denote the number of times that arm has been pulled up to time and the corresponding empirical average reward, where the index function if is true and otherwise.

In real world applications, the principal and players may exhibit different behaviors. The principal would like to see the players select the best arm and maximize the cumulative reward. On the other hand, the players may be heavily influenced by other players’ feedback, e.g., the reward history of the arms, and behave myopically, i.e., pull the arm that currently achieves the highest empirical reward (exploitation). It is well known that such a myopic exploitation-only behavior can be far from the optimum due to the lack of exploration [Sutton and Barto2018]. The principal cannot pull the arm directly, but can provide certain compensation to incentivize the players to pull arms with suboptimal empirical reward (exploration). However, this compensation may affect the players’ feedback [Martensen, Gronholdt, and Kristensen2000], which results in a biased reward history and disturbs both the principal and players’ decisions. Specifically, we assume that at time there is a drift in feedback that is caused by compensation , captured by an unknown function with the following properties.

###### Assumption 1.

The reward drift function is non-decreasing with , and is Lipschitz continuous, i.e., for any and , there exists a constant such that

 |ft(x)−ft(y)|≤lt|x−y|. (1)

The biased feedback is then collected, and the principal and players know only the sum and cannot distinguish each part.

Let for later use. Denote by the event that arm is pulled with compensation at time and otherwise. Denote be the cumulative drift of arm up to time and be the corresponding average drifted reward. The general incentive mechanism and algorithm are described in Algorithm 1.

We characterize the performance of the incentivized exploration algorithm in terms of two metrics – the expected cumulative regret that quantifies the total loss because of not pulling the best arm, and the cumulative compensation that the principal pays for incentivizing exploration:

 E(R(T)) =E(T∑t=1(μ1−μIt))=N∑i=2ΔiE(ni(T+1)), E(C(T)) =E(T∑t=1(¯μIt−¯μGt)).

Notice that in Algorithm 1 the compensation and decision are made based on biased feedback which may not be an accurate reflection of an arm’s reward, while the regret is in terms of the “true” reward that is unknown. We seek to answer the important question if the proposed compensation scheme is effective in incentivizing exploration from two intertwining aspects: (1) if the algorithm is robust to drifted reward so that the sequential decisions based on biased feedback still enjoy a small regret, and (2) if the proposed incentive mechanism is cost-efficient to the principal. There are different arm selection strategies that the principal can employ, i.e., Step 2 of Algorithm 1. In the next section, in order to answer the above question, we analyze the cumulative regret and compensation under several typical multi-armed bandit algorithms such as UCB, -Greedy, and Thompson Sampling.

## Regret and Compensation Analysis

In this section, we consider three instantiations of Algorithm 1 when the principal employs UCB, -Greedy, and Thompson Sampling at Step 2, respectively. As will be seen later, our analysis shows that the proposed compensation scheme is effective in incentivizing exploration under reward drift.

### UCB policy

Consider first the case where the principal applies the UCB policy, i.e., uses the sum of average biased reward and upper confidence bound as the criterion to choose the arm to explore, as shown in Algorithm 2. The main result is summarized in Theorem 1.

###### Theorem 1.

For the incentivized UCB algorithm, the expected regret and compensation are bounded as follows:

 E(R(T)) ≤N∑i=28(l+1)2logTΔi+Δi(K−1)π23, (2) E(C(T)) ≤N∑i=216(l+1)logTΔi+16(l+1)logTΔ +2πK√2logT3. (3)
###### Proof.

Notice that compensation is incurred under the conditions:

 ¯μIt(t) ≤¯μGt(t), ¯μIt(t)+√2logtnIt(t) ≥¯μGt(t)+√2logtnGt(t).

By the second condition, the compensation

 ¯μGt(t)−¯μIt(t)≤√2logtnIt(t), (4)

and further by Assumption 1, the drift The total drift of arm can be bounded as follows (due to space limit, the details of inequality (5) are provided in supplementary material):

 Bi(t) =t∑τ=1bτI(Eiτ=1)≤2l√2ni(t)logt. (5)

For each sub-optimal arm , if this arm is pulled by the player at (with or without compensation), it must hold that

 ^μi(t)+Bi(t)ni(t)+√2logtni(t)≥^μ1(t)+B1(t)n1(t)+√2logtn1(t).

So, the probability that arm is pulled by the player at time can be bounded by the following:

 Pr(It=i) ≤ Pr(^μi(t)+Bi(t)ni(t)+√2logtni(t)≥^μ1(t)+B1(t)n1(t)+√2logtn1(t)) ≤ Pr(^μi(t)+(2l+1)√2logtni(t)≥^μ1(t)+B1(t)n1(t)+√2logtn1(t)) ≤ Pr(^μi(t)+(2l+1)√2logtni(t)≥^μ1(t)+√2logtn1(t)),

where the second inequality is due to the bound (5) on cumulative drift. Similar to the analysis in [Auer, Cesa-Bianchi, and Fischer2002], notice that if the event happens, one of the following three events must happen:

 Xi(t) :^μi(t)≥μi+√2logtni(t), Y1(t) :^μ1(t)≤μ1−√2logtn1(t), Zi(t) :2(l+1)√2logtni(t)≥Δi.

Therefore, . By the Chernoff-Hoeffding’s inequality,

 Pr(Xi(t))≤1t2,      Pr(Y1(t))≤1t2,

and their sum from to is bounded by . If , the event will not happen, and thus . We can bound as follows:

 E(ni(T)) =T∑t=1Pr(It=i) ≤T∑t=1(Pr(Xi(t))+Pr(Y1(t))+Pr(Zi(t))) ≤8(l+1)2logTΔ2i+π23.

So, the expected regret

 E(R(T))≤N∑i=28(l+1)2logTΔi+Δi(K−1)π23.

The calculation of compensation is a bit different from that of regret since compensation can be incurred even if the best arm is pulled. The player will be compensated to pull arm 1 only when

 ¯μ1(t)≤¯μi(t),
 ¯μ1(t)+√2logtn1(t)≥¯μi(t)+√2logtni(t),

which requires . So, the average number of times when the players are compensated to pull arm is smaller than . Denote by the total compensation the players have received to pull arm up to time . Recall the bound (4), we can bound the total compensation as follows:

 E(C(T)) =E(C1(T)+K∑i=2Ci(T)) ≤maxi≠1E(ni(T))∑m=1√2logTm+K∑i=2E(ni(T))∑m=1√2logTm ≤16(l+1)logTΔ+2Kπ√2logT3 +K∑i=216(l+1)logTΔi.

### ε-Greedy policy

We now consider the case where the principal uses the -Greedy policy as shown in Algorithm 3, with the choice of exploration probability from that shows diminishing achieves better performance. Algorithm 3 involves a random exploration phase (Step 3), and its analysis is more involved. Recall that the “true” reward has a normalized support of , we therefore assume that the drifted reward is projected onto . This assumption is also consistent with real world applications such as Amazon and Yelp as their rating systems usually have lower and upper bounds.

###### Theorem 2.

For the incentivized -Greedy algorithm with and , with a high probability the expected regret and compensation are bounded as follows:

 E(R(T)) ≤K∑i=2cSi(l)(logT+1)+c(K−1)(K+π26), (6) E(C(T)) ≤max(l,1)(c+√3c)KlogT, (7)

where .

###### Proof.

Since the biased feedback lies in the interval , the drift A compensation for pulling arm will be incurred only when the arm is chosen by the principal to explore. By Lemma 2 in supplementary material, the number of explorations that arm can receive up to time is bounded by

 mi(t)≤c(logt+1)+√3clogKδ(logt+1) (8)

with a probability of at least . When is large enough such that , the right hand side of (8) is upper bounded by

 ¯¯¯¯¯mi(t)=(c+√3c)(logt+1),

and the total drift on arm up to time is upper bounded by with a probability of at least .

Let that is chosen to facilitate the analysis. We can bound as follows:

 E(ni(T)) ≤ T∑t=1εtK+E(T∑t=1(1−εt)I(It=i,ni(t)≤L)) +E(T∑t=1(1−εt)I(It=i,ni(t)≥L)) ≤ T∑t=1εtK+LA+E(T∑t=1I(It=i,ni(t)≥L)) ≤ A+T∑t=1Pr(^μi(t)+Bi(t)ni(t)≥^μ1(t)+B1(t)n1(t),ni(t)≥L) ≤ A+T∑t=1Pr(^μi(t)+Δi3≥^μ1(t)) ≤ A+T∑t=1Pr(^μi(t)≥μi+Δi3)+T∑t=1Pr(^μ1(t)≤μ1−Δi3),

where the second last inequality is due to

 Bi(t)ni(t)≤g¯¯¯¯¯mi(t)3g¯¯¯¯¯mi(T)/Δi≤Δi3,

and the last inequality uses the fact that . By Lemma 3 in supplementary material, when , we have

 ≤(c2+18Δ2i)logT+c(K+π2Δ2i)+18Δ2i.

We can also show that , and further obtain the bound (6) on expected regret after some straightforward mathematical manipulations.

For the compensation analysis, notice again that the drifted reward is in , so the compensation at each time is less than and the total compensation the players receive to pull arm is bounded by the bound on the number of explorations it receives. To be consistent with the case with no drift, we write the bound on expected compensation as

 E(C(T)) ≤max(l,1)(c+√3c)K(logT+1).

### Thompson Sampling

Consider now the case where the principal uses Thompson Sampling as shown in Algorithm 4. Thompson Sampling starts with a (prior) distribution on each arm’s reward, and updates the distribution after the arm being pulled. At each time, the principal samples the reward of each arm according to its posterior distribution, and then selects the arm with the highest sample reward. In this paper, we consider Gaussian prior adopted from [Agrawal and Goyal2013] since the often used Beta priors are usually for binary reward feeback.

Before we analyze the performance of Algorithm 4, we first introduce some definitions and notations that are adopted from [Agrawal and Goyal2017, Agrawal and Goyal2013].

###### Definition 1.

For each arm , we denote two thresholds and such that . denotes the event and the event . Also, let where is the history of plays until time .

###### Definition 2.

For two arms and , if , there exists a constant such that . Let .

We have the following result on the frequency of compensation the players receive for pulling each arm .

###### Lemma 1.

The expected frequency of compensation for pulling arm is bounded by .

###### Proof.

The proof is provided in the supplemental material. ∎

Our analysis of regret generalizes that in [Agrawal and Goyal2017, Feng, Parkes, and Xu2019] to incorporate the effect of drift caused by compensation.

###### Theorem 3.

For the incentivized Thompson Sampling algorithm, the expected regret and compensation can be bounded as follows:

 E(R(T)) ≤K∑i=2((4e11+21)Pi(T)+5Δ2i+Qi(T)+π26), (9) E(C(T)) ≤2max(l,1)KlogTΔ––2, (10)

where and .

###### Proof.

The analysis of compensation is straightforward, similar to that for the incentivized -Greedy algorithm. By Lemma 1, the expected compensation .

Consider now the regret for choosing suboptimal arm . We can bound as follows:

 E(ni(T)) ≤T∑t=1Pr(It=i,Eμi(t),Eθi(t)) +T∑t=1Pr(It=i,Eμi(t),¯¯¯¯¯¯¯¯¯¯¯¯¯Eθi(t))+T∑t=1Pr(It=i,¯¯¯¯¯¯¯¯¯¯¯¯¯Eμi(t))

The first two terms can be bounded by the results of [Agrawal and Goyal2017], see the detail in supplemental material, since their analysis will not be affected by the reward drift. Specifically, by Lemma 4, the sum of first two terms is upper bounded by , where is certain constant and . As for the third term, the analysis is similar to that of UCB and -Greedy algorithm where the drift is bounded by :

 T∑t=1Pr(It=i,¯¯¯¯¯¯¯¯¯¯¯¯¯Eμi(t)) ≤T∑t=1Pr(¯¯¯¯¯¯¯¯¯¯¯¯¯Eμi(t))=T∑t=1Pr(¯μi(t)≥xi) =T∑t=1Pr(^μi(t)+Bi(t)ni(t)≥xi) =T∑t=1Pr(^μi(t)−μi≥Δi3−Bi(t)ni(t)Yi(t)) ≤T∑t=1Pr(^μi(t)−μi≥Yi(t),ni(t)≤Qi) +T∑t=1Pr(^μi(t)−μi≥Yi(t),ni(t)≥Qi) ≤Qi+T∑t=1e−2ni(t)Yi(t)2≤Qi+π26.

where the second last inequality is due to Hoeffding’s inequality. We then choose such that, when ,

 Δi3−Bi(t)ni(t) ≥Δi3−2llogTΔ2ni(t)≥0, (11) ni(t)Yi(t)2 ≥logT. (12)

By Since is non-increasing in , equation (12) requires

 Δ2i9ni+4l2log2TΔ––41ni≥(1+4Δil3Δ––2)logT.

The above two equations lead to

 Qi≥⌈92Δ2i⎛⎝(1+4Δil3Δ––2)logT+√1+8ΔillogT3Δ––2⎞⎠⌉.

### Discussion of Results

As can be seen from the above analysis, all three instantiations of the incentivized exploration algorithm attain regret and compensation upper bound under drifted reward. Our results match both the theoretical lower bound for regret in [Lai and Robbins1985] and lower bound for compensation in [Wang and Huang2018] without reward drift. Although explicit lower bounds of the regret and compensation with drifted feedback in our setting remain unknown, we argue that these lower bounds should be larger or equal to the lower bound without reward drift since non-drifting environment is a special case of the drifted reward feedback with drift function . On the other hand, the proposed incentive mechanism is still cost-efficient even the payment will lead to biased feedback, as the principal can reduce the regret from for the players’ myopic choices to by paying merely in incentive.

In terms of sensitivity to unknown drift functions , both incentivized -Greedy and Thompson Sampling attain regret and compensation, while the incentivized UCB attains regret and compensation. This difference comes from two aspects: 1) UCB is deterministic given the history while -Greedy and Thompson Sampling have a randomized exploration phase which makes them less sensitive to the drift. 2) For UCB, the drift effect is bounded by the amount of compensation which affects the frequency of compensation and in turn shapes the amount of compensation, while for -Greedy and Thompson Sampling, the cumulative drift can be directly bounded by the frequency of compensation. Numerical experiments reported in the next section are consistent with these analytical results.

## Numerical Examples

In this section, we carry out numerical experiments using synthetic data to complement the previous analysis of the incentivized MAB algorithms under reward drift, including UCB, -Greedy and Thompson Sampling.

We generate a pool of arms with mean reward vector . In each iteration, after the player pulls an arm , reward is set to the arm’s mean reward plus a random term drawn from , i.e. . Because of the randomness in sample rewards, the greedy algorithm without exploration suffers a linear regret, e.g., we observe nearly 6000 regret for 20000 trials. For the reward drift under compensation, we consider a linear drifting function where is the compensation offered by the principle and coefficient . The player reveals drifted reward feedback .

For the incentivized exploration, we first compare regret and compensation in a non-drifting environment () and a drifted reward environment (). In a non-drifting reward environment the player always gives unbiased feedback even offered with compensation. The result is shown in Fig. 1. As expected, all three instantiations of the incentivized MAB algorithms have a sub-linear regret and compensation. Thompson Sampling outperforms the other two both in the regret (which is consistent with observation from previous work [Vermorel and Mohri2005, Chapelle and Li2011]) and compensation.

In Fig. 2 we show the performance of the incentivzed MAB algorithms under drifted reward with drift coefficient . We first observe that over the three algorithms Thompson Sampling still performs the best. While their relative performance are in same order as Fig. 1, the regret and compensation are worse than non-drifting setting, e.g., regret of UCB increases from 350 to 800 because of the biased feedback.

To better understand the effect of drifted reward, we vary the coefficient from to and present the results in Table 1. We notice that the incentivized UCB incurs largest regret and compensation. This is due to the fact that, as the time goes, a larger UCB and uncertainty are assigned to those arms that are less explored but may in fact have small mean rewards, and the resulting higher chance of those suboptimal arms being selected leads to larger regret and compensation. We also notice that the gap between regret and compensation of UCB increase faster compared to the other two. This is consistent with out theoretical analysis that the regret of UCB is in the order of and compensation is in the order .

We then exam the frequency of compensation, as well as the estimation error for arm 1 in terms of the relative error of the average drifted reward compared to the mean reward, and present the result in Table 2. We see that all three incentivized exploration algorithms achieve small estimation errors that are not sensitive to the drift coefficient . This is expected, as the the expected compensation and thus the drift per time approaches 0 as T increases. However, while the incentivized -Greedy and Thompson Sampling have roughly a constant frequency of compensation across different values, the incentivized UCB is more sensitive to the coefficient in the frequency of compensation. The constant frequency of compensation for -Greedy and Thompson Sampling can be seen from the proof of Theorem 2 and Lemma 1 that show the frequency does not depend on the drift. In contrast, seen from the proof of Theorem 1, the frequency of compensation for UCB depends on the drift through equation (6).

## Conclusion

We propose and study multi-armed bandit algorithm with incentivized exploration under reward drift, where the player provides a biased reward feedback that is the sum of the true reward and a drift term that is non-decreasing in compensation. We analyze the regret and compensation for three instantiations of the incentivized MAB algorithm where the principal employs UCB, -Greedy and Thompson Sampling, respectively. Our results show that the algorithms achieve regret and compensation, and are therefor effective in incentivizing exploration. Our current analysis is based on the assumption that the reward drift is non-decreasing over the compensation. In the future work, we would like to study other assumptions about drift function and their corresponding impact on regret and compensation. It is also important to explore if an algorithm can leverage the drifted reward to reduce the compensation.

## References

• [Abramowitz1965] Abramowitz, M. 1965. Handbook of mathematical functions with formulas. Graphs, and Mathematical Tables.
• [Agarwal et al.2014] Agarwal, A.; Hsu, D.; Kale, S.; Langford, J.; Li, L.; and Schapire, R. 2014. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, 1638–1646.
• [Agrawal and Goyal2013] Agrawal, S., and Goyal, N. 2013. Further optimal regret bounds for thompson sampling. In Artificial intelligence and statistics, 99–107.
• [Agrawal and Goyal2017] Agrawal, S., and Goyal, N. 2017. Near-optimal regret bounds for thompson sampling. Journal of the ACM (JACM) 64(5):30.
• [Auer, Cesa-Bianchi, and Fischer2002] Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47(2-3):235–256.
• [Bouneffouf, Bouzeghoub, and Gançarski2012] Bouneffouf, D.; Bouzeghoub, A.; and Gançarski, A. L. 2012. A contextual-bandit algorithm for mobile context-aware recommender system. In International Conference on Neural Information Processing, 324–331. Springer.
• [Bubeck and Cesa-Bianchi2012] Bubeck, S., and Cesa-Bianchi, N. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning 5(1):1–122.
• [Chapelle and Li2011] Chapelle, O., and Li, L. 2011. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, 2249–2257.
• [Ehsani and Ehsani2015] Ehsani, Z., and Ehsani, M. H. 2015. Effect of quality and price on customer satisfaction and commitment in iran auto industry. International Journal of Service Science, Management and Engineering 1(5):52.
• [Feng, Parkes, and Xu2019] Feng, Z.; Parkes, D. C.; and Xu, H. 2019. The intrinsic robustness of stochastic bandits to strategic manipulation. arXiv preprint arXiv:1906.01528.
• [Frazier et al.2014] Frazier, P.; Kempe, D.; Kleinberg, J.; and Kleinberg, R. 2014. Incentivizing exploration. In Proceedings of the fifteenth ACM conference on Economics and computation, 5–22. ACM.
• [Gai, Krishnamachari, and Jain2010] Gai, Y.; Krishnamachari, B.; and Jain, R. 2010. Learning multiuser channel allocations in cognitive radio networks: A combinatorial multi-armed bandit formulation. In 2010 IEEE Symposium on New Frontiers in Dynamic Spectrum (DySPAN), 1–9. IEEE.
• [Han, Kempe, and Qiang2015] Han, L.; Kempe, D.; and Qiang, R. 2015. Incentivizing exploration with heterogeneous value of money. In International Conference on Web and Internet Economics, 370–383. Springer.
• [Hirnschall et al.2018] Hirnschall, C.; Singla, A.; Tschiatschek, S.; and Krause, A. 2018. Learning user preferences to incentivize exploration in the sharing economy. In Thirty-Second AAAI Conference on Artificial Intelligence.
• [Hoeffding1994] Hoeffding, W. 1994. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding. Springer. 409–426.
• [Immorlica et al.2018] Immorlica, N.; Mao, J.; Slivkins, A.; and Wu, Z. S. 2018. Incentivizing exploration with unbiased histories. arXiv preprint arXiv:1811.06026.
• [Immorlica et al.2019] Immorlica, N.; Mao, J.; Slivkins, A.; and Wu, Z. S. 2019. Bayesian exploration with heterogeneous agents. In The World Wide Web Conference, 751–761. ACM.
• [Jun et al.2018] Jun, K.-S.; Li, L.; Ma, Y.; and Zhu, J. 2018. Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems, 3640–3649.
• [Lai and Robbins1985] Lai, T. L., and Robbins, H. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1):4–22.
• [Lee and Lin2005] Lee, G.-G., and Lin, H.-F. 2005. Customer perceptions of e-service quality in online shopping. International Journal of Retail & Distribution Management 33(2):161–176.
• [Li et al.2010] Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, 661–670. ACM.
• [Liu and Ho2018] Liu, Y., and Ho, C.-J. 2018. Incentivizing high quality user contributions: new arm generation in bandit learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
• [Lykouris, Mirrokni, and Paes Leme2018] Lykouris, T.; Mirrokni, V.; and Paes Leme, R. 2018. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 114–122. ACM.
• [Mansour, Slivkins, and Syrgkanis2015] Mansour, Y.; Slivkins, A.; and Syrgkanis, V. 2015. Bayesian incentive-compatible bandit exploration. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, 565–582. ACM.
• [Martensen, Gronholdt, and Kristensen2000] Martensen, A.; Gronholdt, L.; and Kristensen, K. 2000. The drivers of customer satisfaction and loyalty: cross-industry findings from denmark. Total Quality Management 11(4-6):544–553.
• [Radlinski, Kleinberg, and Joachims2008] Radlinski, F.; Kleinberg, R.; and Joachims, T. 2008. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th international conference on Machine learning, 784–791. ACM.
• [Razak, Nirwanto, and Triatmanto2016] Razak, I.; Nirwanto, N.; and Triatmanto, B. 2016. The impact of product quality and price on customer satisfaction with the mediator of customer value. Journal of Marketing and Consumer Research 30(1):59–68.
• [Sutton and Barto2018] Sutton, R. S., and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press.
• [Vermorel and Mohri2005] Vermorel, J., and Mohri, M. 2005. Multi-armed bandit algorithms and empirical evaluation. In European conference on machine learning, 437–448. Springer.
• [Wang and Huang2018] Wang, S., and Huang, L. 2018. Multi-armed bandits with compensation. In NeurIPS.
• [Yue and Joachims2009] Yue, Y., and Joachims, T. 2009. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, 1201–1208. ACM.

## Supplementary Material

###### Fact 1.

(Hoeffding’s inequality [Hoeffding1994]) Assume that are i.i.d drawn from any distribution with mean and support . Let , then for any ,

 Pr(¯X−μ≥δ)≤e−2nδ2.
###### Fact 2.

[Abramowitz1965] For a Gaussian distributed random variable with mean and variance , for any x,

 Pr(|X−μ|≥xσ)≤12e−x2/2.
###### Lemma 2.

(Lemma 4 in [Jun et al.2018] and Lemma 9 in [Agarwal et al.2014]) Let and suppose satisfy With probability at least , the number of exploration of arm up to time is bounded as follows:

 mi(t)≤t∑τ=1ετK+ ⎷3t∑τ=1ετKlogKδ. (13)
###### Lemma 3.

(Theorem 3 in [Auer, Cesa-Bianchi, and Fischer2002] and Theorem 3.3 in [Feng, Parkes, and Xu2019]) For the -Greedy algorithm with , for any arm , we have

 Pr(|^μi(t)−μi|≥Δi3) ≤xte−xt/5+18Δ2ie−Δ2i⌊xt⌋18, Pr(|^μ1(t)−μ1|≥Δi3) ≤xte−xt/5+18Δ2ie−Δ2i⌊xt⌋18,

where . If , the sum of this probability up to has the following upper bound:

 (c2+18Δ2i)logT+c(K+π2Δ2i)+18Δ2i.
###### Lemma 4.

(lemma 2.14,2.16 in [Agrawal and Goyal2017]) For the Thompson Sampling algorithm, if we choose , then

 T∑t=1Pr(It=i,Eμi(t),Eθi(t)) ≤(4e11+20)Pi(T)+4Δ2i, T∑t=1Pr(It=i,Eμi(t),¯¯¯¯¯¯¯¯¯¯¯¯¯Eθi(t)) ≤Pi(T)+1Δ2i,

where .

Details of inequality (5)

 Bi(t) =t∑τ=1bτI(Eiτ=1) ≤t∑τ=1lτ√2logτni(τ)I(Eiτ=1) ≤t∑τ=1l√2logtni(τ)I(Eiτ=1) =ni(t)∑m=1l√2logtm≤2l√2ni(t)logt,

where the last inequality is due to .

###### Proof of Lemma 1.

A compensation will be incurred for choosing arm at time only when

 i≠argmaxj¯μ