Online Second Price Auction with Semi-bandit Feedback Under the Non-Stationary Setting

Online Second Price Auction with Semi-bandit Feedback
Under the Non-Stationary Setting

Haoyu Zhao
IIIS, Tsinghua University
zhaohy16@mails.tsinghua.edu.cn
   Wei Chen
Microsoft Research
weic@microsoft.com
Abstract

In this paper, we study the non-stationary online second price auction problem. We assume that the seller is selling the same type of items in rounds by the second price auction, and she can set the reserve price in each round. In each round, the bidders draw their private values from a joint distribution unknown to the seller. Then, the seller announced the reserve price in this round. Next, bidders with private values higher than the announced reserve price in that round will report their values to the seller as their bids. The bidder with the highest bid larger than the reserved price would win the item and she will pay to the seller the price equal to the second-highest bid or the reserve price, whichever is larger. The seller wants to maximize her total revenue during the time horizon while learning the distribution of private values over time. The problem is more challenging than the standard online learning scenario since the private value distribution is non-stationary, meaning that the distribution of bidders’ private values may change over time, and we need to use the non-stationary regret to measure the performance of our algorithm. To our knowledge, this paper is the first to study the repeated auction in the non-stationary setting theoretically. Our algorithm achieves the non-stationary regret upper bound , where is the number of switches in the distribution, and is the sum of total variation, and and are not needed to be known by the algorithm. We also prove regret lower bounds in the switching case and in the dynamic case, showing that our algorithm has nearly optimal non-stationary regret.

1 Introduction

As the Internet is rapidly developing, there are more and more online repeated auctions in our daily life, such as the auctions on the e-Bay website and the online advertisement auctions on Google and Facebook. Perhaps the most studied and applied auction mechanism is the online repeated second price auctions with a reserve price. In this auction format, a seller repeatedly sells the same type of items to a group of bidders. In each round , the seller selects and announces a reserve price while the bidders draw their private values on the item from a joint value distribution, which is unknown to the seller. For each bidder , if its private value is at least the reserve price , she will submit her bid to the seller; otherwise she will not submit her bid since she would not win if her value is less than the announced reserve price. After the seller collects the bids in this round (if any), she will give the item to the highest bidder, and collect from this winner the payment equal to the value of the second-highest bid or the reserve price, whichever is higher. If no bidder submits bids in this round, that means the reserve price the seller announced is too high, and the seller receives no payment. Such repeated auctions are common in online advertising applications on search engine or social network platforms. The seller’s objective is to maximize her cumulative revenue, which is the total payment she collects from the bidders over rounds. Since the seller does not know the private value distribution of the bidders, the seller has to adjust the reserve price over time, hoping to learn the optimal reserve price.

The above setting falls under the multi-armed bandit framework, where reserve prices can be treated as arms and payments as rewards. As in the multi-armed bandit framework, the performance of an online auction algorithm is measured by its regret, which is the difference between the optimal reward that always chooses the best reserve price and the expected cumulative reward of the algorithm. When the distribution of private values does not change over time, results from [6, 24] can be applied to solve the above problem, whereas the work in [7] considers a somewhat different setting where the seller only gets the reward as the feedback but does not see the bids (full-bandit feedback) and the private value distribution of each bidder is i.i.d.

In real-world applications, however, the private value distribution of the bidders may likely change over time, e.g., some important events happen, which greatly influence the market perception. When the private value distribution changes over time, the optimal reserve price will also change and there is no single optimal reserve value. None of the above studies would work under this realistic setting, except resetting the algorithms by human intervention. Since it is difficult to predict distribution changes, we prefer to have algorithms that could automatically detect distribution changes and adjust their actions accordingly, and still provide nearly optimal performance over the long run.

In this paper, we design the first online learning algorithm for online second price auction with non-stationary distributions of private values. We assume that the private values of the bidders at time follow the joint distribution , and we assume that is the best reserve price at time . We use non-stationary regret to measure the performance of the algorithm, which is the difference between the expected cumulative reward of the best reserve prices at each round and the expected cumulative reward of the algorithm. We use two quantities to measure the changing of the distributions : switchings and total variation. The number of switchings is defined as , and the total variation is given as , where denotes the total variation of the distribution and is the total time horizon (Section 2).

In this paper, we provide an elimination-based algorithm that can achieve the non-stationary regret of (Section 3). This regret bound shows that if the switchings or the total variations are not large (sublinear to in particular), our algorithm can still achieve sublinear non-stationary regret. We give a proof sketch in Section 4 to show the main technical ideas of the regret analysis. We further show the non-stationary regret is lower bounded by in the switching case, and lower bounded by in the dynamic case (Section 5), which means that our algorithm achieves nearly optimal regret in the non-stationary environment. Moreover, our algorithm is parameter-free, which means that we do not need to know the parameters and in advance and the algorithm is self-adaptive. Our main method is to reduce the non-stationary online auction problem into a variant of the non-stationary multi-armed bandit problem called non-stationary one-sided full information bandit, and solve this problem with some novel techniques.

The proof sketch covering all essential ideas are included in the main text, and the detailed technical proofs are included in the appendix.

1.1 Related Work

Multi-armed bandit: Multi-armed bandit (MAB) problem is first introduced in [19]. MAB problems can be classified into stochastic bandits and the adversarial bandits. In the stochastic case, the reward is drawn from an unknown distribution, and in the adversarial case, the reward is determined by an adversary. Our model is a generalization of the stochastic case, as discussed below. The classical MAB algorithms include UCB [1] and Thompson sampling [21] for the stochastic case and EXP3 [2] for the adversarial case. We refer to [5] for comprehensive coverage on the MAB problems.

Non-stationary MAB: Non-stationary MAB can be view as a generalization of the stochastic MAB, where the reward distributions are changing over time. The non-stationary MAB problems are analyzed mainly under two settings: The first considers the switching case, where there are number of switchings in the distribution, and derives switching regret in terms of and [10, 22, 14]; The second considers the dynamic case, where the distribution is changing continuously but the variation is bounded, and present dynamic regret in terms of and [11, 4]. However, most of the studies need to use or as algorithm parameters, which may not be easy to obtain in practice. Designing parameter-free algorithms has been studied in the full-information case [15, 12, 23]. There are also several attempts to design parameter-free algorithms in the bandit case [13, 16, 9], but the regret bound is not optimal. A recent and innovative study [3] solves the problem in the bandit case and achieves optimal regret. Then, [8] significantly generalizes the previous work by extending it into the non-stationary contextual bandit and also achieves optimal regret. Our study is the first one on the non-stationary one-sided full information bandit and its application to the online auction setting.

Online auction: For the online case where the private value distribution is unknown, [7, 6, 24] consider different forms of the online second price auction. These studies assume that bidders truthfully follow their private value distributions, the same as we assume in this work. [17] further considers the online second price auction with strategic bidders, which means that their bidding may not be truthful. [20] studies the online second price auction with bidder specific reserve price. However, they need to use all the bidding information, and they also assume that the bidders are truthful. For the offline case where the private value distribution is known, the classical work by Myerson [18] provides an optimal auction algorithm when the private value distributions of all bidders are independent and known, and the seller could set different reserve prices for different bidders.

2 Preliminary and Model

In this section, we introduce the non-stationary online second price auction with semi-bandit feedback. We will also introduce the non-stationary regret to measure the performance of the algorithm. As mentioned before, we reduce the non-stationary online second price auction problem to a non-stationary bandit problem, which we called non-stationary one-sided full information bandit. We will also give the formal definition of the bandit problem and show the performance measurement for the corresponding bandit problem.

Definition 1 (Non-stationary Online Second Price Auction).

There are a fixed number of bidders and a seller, and the seller sells the same item in each round . In each round , the seller sells the item through second price auction with reserve price , where is chosen by the seller at the beginning of each round and is announced to the bidders before the bidders give their private values. The values of the bidders follow a distribution with support in round , and the environment draws a vector of realized values for the bidders . For each bidder , if her value , she will report her value to the seller, otherwise she will not report her value and not attend the auction in this round.111We fully understand that in the repeated online second price auction, the bidder may not be truthful since she may participate in the auction in several rounds. However, this is out of the scope of the current paper. We will assume that the bidders are truthful in each round, and it’s a good approximation in some cases. The seller then dispatches the item using the second price auction with reserve price . We assume that the distributions are generated obliviously, i.e. are generated before our algorithm starts, or equivalently, are generated independently to the randomness of for all and the randomness of the algorithm.

The performance of the reserve price in auction is always measured by the revenue: , where denote the money bidder needs to pay when the reserve price is and is the private value vector of the bidders is . In particular, if bidder has the highest bid among all bidders and its bid is also larger than the reserve price , then pays the maximum value among all other bids and the reserve price and gets the auction item; otherwise the bidder pays nothing and does not get the item. Note that if we fix a reserve price , whether bidders with values less than report their values or not does not affect the revenue. Given the revenue of a reserve price, we have the following definition for the non-stationary regret in the online second price auction.

Definition 2 (Non-stationary Regret for Online Second Price Auction).

The non-stationary regret of algorithm for the online second price auction is defined as follow,

where and is the reserve price algorithm chooses in round , and the expectation is taken over all the randomness, including the randomness of the algorithm itself and the randomness of leading to the randomness in the selection of .

We now introduce the measurement of the non-stationarity. In general, there are two measurements of the change of the environment: the first is the number of the swichings , and the second is the total variation . For any interval , we define the number of switchings on to be . As for the total variation, the formal defintion is given as , where denotes the total variation of the distribution. For convenience, we use and to denote and .

Next, we briefly discuss how to reduce the online second price auction to the one-sided full-information bandit: 1) We can discretize the reserve price into . Because the revenue of the second price auction is one-sided Lipschitz, when is large enough, the revenue of the best discretized reserve price should not make so much difference to that of the best reserve price on the whole domain. 2) The distribution of the value will induce a distribution of reward on . More specifically, any private value vector will induce a reward vector for the discretized reserve price , and the reward vector follows a distribution . 3) At time , because all bidders with values at least will report their values, we can compute the rewards for all given the specific private values larger than or equal to . This gives us the following definition of the non-stationary one-sided full-information bandit. The formal reduction from the online auction to the bandit problem will be given in the proof of the Theorem 3.

Definition 3 (Non-stationary One-sided Full Information Bandit).

There is a set of arms , and for each arm at time , it corresponds to an unknown distribution with support , where is the marginal distribution of with support . In each round , the environment draws a reward vector , where is drawn from distribution . The player then chooses an arm to play, gains the reward and observes the reward of arms , i.e. observes . We assume that the distribution at each round is generated obliviously, i.e. are generated before the algorithm starts.

We use to denote the mean of , i.e. . We also use to denote the mean of the best arm at time . Then we have the following definition of the non-stationary regret.

Definition 4 (Non-stationary Regret).

We use the following to denote the non-stationary regret of algorithm .

For convenience, we will simply use regret to denote the non-stationary regret. We now introduce the measurements for the non-stationarity for the one-sided bandit case. Similar to the auction case, we have switchings and variation . For any interval , we define the number of switchings on to be . As for the sum of variation, the formal definition is given as , which sums up the max difference of mean in each round. For convenience, we use and to denote and . Note that the number of switchings in the bandit case is the same as that of the auction case, so we reuse the notations, and the variation definition in the bandit case uses the sum of the maximal differences in the consecutive mean vectors instead of the sum of total variations in the auction case, so we use notation instead of for differentiation. The variation defined for the bandit case is consistent with the variation defined in other non-stationary bandit papers.

We will use Switching Regret to denote the non-stationary regret in the switching case, and dynamic regret to denote the non-stationary regret in the dynamic case.

3 Algorithm

In this section, we present our algorithm for the non-stationary one-sided full-information bandit problem and its regret bounds. The algorithm for the online auction problem can be easily derived from , as outlined in Section 2, and we present its regret bound in Theorem 3.

1:Total time horizon , total number of arms . Parameters .
2:. is the starting time of epoch .
3:.
4:Let denote the empirical mean of arm in the time interval .
5:while  do
6:      Step 1. Randomly select the exploration phases
7:     if  then
8:         .
9:     end if
10:     Let for every , and . We define the notation for convenience.
11:     For every , independently add pair into with probability .
12:     (Let and be the values of and respectively at this point, to be used in the proof)
13:      Step 2. Choose an action to play
14:     if  such that  then Choosing the arm based on if is in an exploration phase.
15:         .
16:         Play arm and observe the reward for all .
17:     else
18:         Play arm and observe the reward for all .
19:     end if
20:      Step 3. Perform the elimination process
21:     while  such that  do
22:         Let be a vector with length .
23:         Let be the arm such that is maximized.
24:         , , and for all .
25:         .
26:     end while
27:      Step 4. Perform the non-stationarity check
28:     if  such that and  then
29:         .
30:     end if
31:     .
32:end while
Algorithm 1

Our algorithm borrows ideas from [24] and [3]. [24] introduce an elimination-based algorithm for the one-sided full-information bandit, and [3] present an elimination-based algorithm to adaptively follow the best arm in the switching case without knowing the number of switches . Our algorithm is a non-trivial combination of these ideas, and our innovation highly depends on the feedback structure of the one-sided bandit problem. The algorithm is given in Algorithm 1.

Generally speaking, our algorithm maintains a set to record the exploration phases for the adaptive detection of the dynamic changes in the distribution, and a set to record the information when an arm is eliminated. If we were dealing with the stationary case where the distribution of arms does not change, after observing arms for enough times, we can eliminate an empirically sub-optimal arm, and with high probability, the eliminated arm is indeed sub-optimal. However, in the non-stationary case, the optimal arm is changing, and thus we need to properly add exploration phases to observe the eliminated arms with some probability. When we detect that the distribution indeed has changed from these exploration phases, the algorithm starts a new epoch and resets and to empty sets. 222We mark the actual values of and in each round as and in the algorithm, to be used in our analysis.

Set records the information at the time when an arm is eliminated. Each element is a tuple, where records the empirical gap, which is the difference of the empirical means of the empirically optimal arm and that of the eliminated arm ; records the index of the eliminated arm; and for records the empirical mean of arm when the arm is eliminated (). An exploration phase is a pair where and interval . Each such phase is stored independently into with a probability (in line 11 of Step 1). The purpose of these exploration phases is to re-examine arms that have been eliminated to detect possible changes in the distribution, with indicating the range of rounds for an exploration. Intuitively, if there is no change in the distribution, such an exploration would pay an extra regret. To control this extra regret, we use to indicate the per-round regret that such an exploration could tolerate, and the length of is controlled to be to bound the total regret.

At each round, Our algorithm has the following four steps. In Step 1, we randomly add exploration phases into the set . We set to be the probability to add an exploration phase into in epoch at time . This probability is chosen carefully, not too small to omit the non-stationarity, and not too big to induce large regret.

In Step 2, we choose the action to play. If the current round is not in any exploration phase, then we will play the arm that is not eliminated and has the smallest index. If is in an exploration phase , we will find the maximum value . We will play arm and observe the reward for all . This arm selection in the exploration phase guarantees that the arm we play would induce the regret of at most per round if the distribution has not changed.

In Step 3, we perform arm elimination when the proper condition holds. In particular, when we find an arm is empirically sub-optimal among the remaining arms, we eliminate this arm in this epoch. When an arm is eliminated, the algorithm will add an tuple into the set to store the information at this point, where stores the empirical gap with the best arm, stores the index of the eliminated arm, and for stores the empirical mean of arm .

In Step 4, we apply the non-stationarity check. At the end of an exploration phase, we check that if there is a tuple and an arm , such that the gap between the current empirical mean of arm during the exploration phase and the stored empirical mean is . If so, it means that the empirical mean has a significant deviation indicating a change in distribution, and thus we will start a new epoch to redo the entire process from scratch.

The algorithm incorporates ideas from [3, 24], and its main novelty is related to the maintenance and use of set in arm selection (Step 2), arm elimination (Step 3) and stationarity check (Step 4), which make use of the feedback observation to balance the exploration and exploitation.

Now, we use a simple example to illustrate how the algorithm detects the distribution changes in the switching case. Suppose that we have three arms. At first, arm 1 always outputs , arm 2 always outputs , and arm 3 always outputs . Then arm 1 will be eliminated first, and the tuple will be stored in , where is the empirical gap between the means of arm 1 and the empirically best arm 3. Next arm 2 will be eliminated, and the algorithm will store in , where means that the value at that position has no meaning. At this point, the algorithm may have randomly selected many exploration phases, but they all fail to start a new epoch since the distribution does not change and non-stationarity would not be detected. Then suppose that at round , the distribution changes, and arm 1 will output from now on and thus becomes the best arm. Suppose that after round , we randomly select an exploration phase with , and in this exploration phase, we will play arm 2 but not arm 1 (since ), and thus we will still not detect the non-stationarity of arm 1. However, when we randomly select an exploration phase with in step 1 (perhaps in a later round), we will play arm 1 according to the key selection criteria for arm exploration in line 16 of step 2. This would allow us to observe the distribution change on arm 1 in the exploration phase and then start a new epoch, which will restart the algorithm from scratch by playing arm 1 again.

The following two theorems summarize the regret bounds of algorithm in the switching case and the dynamic case for the one-sided full-information bandit.

Theorem 1 (Switching Regret).

Suppose that we choose parameters , then the algorithm has regret in the switching case bounded by , where hides the polynomial factor of and .

Theorem 2 (Dynamic Regret).

Suppose that we , and suppose that the variation is not too small (). Then the algorithm has regret in the dyanmic case bounded by , where hide the polynomial factor of and .

As outlined in Section 2, can be easily adapted to solve the online second price auction problem by discretizing the reserve price. The following theorem provides the regret bound of on solving the online second price auction problem.

Theorem 3 (Regret for Online Second Price Auction).

For every , let , and we only set reserve price . Each time we set reserve price and get all the private value , we compute the reward for all and receive the reward . Then we apply our algorithm and set appropriately, and the regret is bounded by

where we assume that is not too small.

4 Proof Sketch for the Regret Analysis

In this section, we will give a proof sketch of the regret analysis in the switching case (Theorem 1) and the dynamic case (Theorem 2). In general, we first give a proof in the switching case, and then we reduce the dynamic case into the switching case. The proof strategy in the dynamic case is nearly the same as that in the switching case, and we will briefly discuss how to do the reduction.

4.1 Proof Sketch of Theorem 1

Generally speaking, our proof strategy for Theorem 1 is to define several events (Definitions 5,6,7,8), and decompose the regret by these events. We show that each term in the decomposition is bounded by .

Definition 5 (Sampling is nice).

We say that the sampling is nice if for every interval and every arm , we have

where is the length of interval . We use to denote this event. We use to denote the event when the above inequality holds for all .

Definition 6.

We use to denote the event such that is in an exploration phase, i.e. .

Definition 7 (Records are consistent).

We say that the records are consistent at time if for every , for every arm , we have . We use to denote this event.

We have the following definition when doesn’t happen.

Definition 8 (Playing bad arm).

Let denote the smallest index of an arm such that , and there exists , i.e.

We use to denote the event .

Generally speaking, is the smallest index of an eliminated arm such that the recorded mean when is eliminated induces the event .

Based on the above definitions, we decompose the regret into four mutually exclusive events and bound the regret for each event in the order of . These four event cases are listed below, where the first three are when the sampling is nice, and the last case is when sampling is not nice.

Case 1: . This means that the sampling is nice, the records are consistent at time , and round is not in an exploration phase. The regret should be bounded in this case, since when happens, the distribution does not change much and it is also not in an exploration phase (Lemma 1).

Case 2: or . The sampling is still nice. When is true, round is in an exploration phase and the records are consistent, meaning that the current arm means have not deviated much from the records. In this case, similar as discussed before, the definition of the exploration phase and the setting in line 16 guarantee that the arm explored would not have a large regret. When is true, we first claim that implies . This is because if the records are not consistent (i.e. ) but (i.e. ), it means played in round has smaller index than , but is an eliminated arm according to Definition 8, and thus arm must be played due to exploration. Next, since , the arm played is not a bad arm with a large gap, so its regret is still bounded (Lemma 2).

Case 3: . The sampling is nice, the records are not consistent, and in round we play a bad arm with a large gap between the current mean and the recorded mean. Although the regret in this case cannot be bounded by where , the key observation is that, due to the random selection of the exploration phase, we will observe the non-stationarity (since does not happen and happens) with some probability, and the expected regret can be bounded (Lemma 3).

Case 4: . The sampling is not nice, which is a low probability event, and its regret can be easily bounded by a constant (Lemma 4).

Lemma 1.

The proof of Lemma 1 is similar to the analysis in [24] and can be viewed as a generalization of the original proof. The key difference is that in the proof of Lemma 1, we divide the interval into

and we sum the regret in each interval first, and get the regret in each interval to be . Then we sum them up and show that the regret is in the order of .

Lemma 2.

This lemma bounds the regret when does not happen and is in an exploration phase. In this case, we show that the number of different lengths of exploration phases can be bounded by . Then, we show that the regret induced by the specific length exploration phase is bounded by . Finally, we combine the previous argument and apply the union bound to show that the total regret considered is bounded by .

Lemma 3.

This lemma bound the regret when happens, and this lemma is the most technical one. The proof strategy is similar to [3], which partitions the total time horizon into several intervals with identical distribution, and applies a two-dimensional induction from back to front. As discussed before, the regret in this case in each round cannot be bounded by where . However due to the random selection of the exploration phases, with some probability, we will observe the non-stationarity (since does not happen and happens), and the expected regret can be bounded.

Finally, by a simple application of the high probability result on , we can get the following lemma.

Lemma 4.

.

Combining these lemmas together, we complete the proof of Theorem 1.

4.2 Proof Sketch of Theorem 2

In this part, we briefly introduce how to reduce the dynamic case to the switching case. The proof is an imitation of the proof strategy of Theorem 1. Although the means can be changing at every time , we can approximately divide them into several sub-intervals such that in each interval, the change of means is not large. Recall that for interval , and we use . We have the following lemma,

Lemma 5 (Interval Partition [8]).

There is a way to partition the interval into such that , and for any , and .

Suppose that we have a partition shown in the above lemma. We construct a new instance such that for all and all , i.e. we take the average mean of each interval and make them all the same.

Generally speaking, the dynamic regret can be bounded by the sum of 2 parts: the switching regret of the new instance and the difference between the switching regret of the new instance and the dynamic regret. As for the first part, since , we know that the switching regret can be bounded by . As for the difference between the 2 regret, since for , we sum up all , we know that the difference is bounded by . Combine them together we complete the proof.

4.3 Proof Sketch of Theorem 3

In the proof of Theorem 3, we first show that the online second price auction has one-sided Lipschitz property, and thus discretizing the reserve price will not lead to a large regret. Next, we briefly discuss why discretizing the reserve price can lead to a one-sided full information bandit instance, and then it is easy to show that the regret can be bounded by in the switching case. To bound the regret in the dynamic case, we only have to set up the connection between the total variation in the online auction and the variation in the bandit problem. The bridge between these two variables can be set up easily by the definition and property of total variation .

5 Lower Bounds for Online Second Price Auction in Non-stationary Environment

In this section, we show that for the online second price auction problem, the regret upper bounds achieved by is almost tight, by giving a regret lower bound of for the switching case, and a lower bound of for the dynamic case.

Theorem 4.

For any algorithm, and any , there exists a set distributions of bids where is the number of switchings of the distribution and the non-stationary regret is at least . Moreover for any algorithm and any , there exists where , such that the regret is at least .

Our theorem is based on the following result in [7].

Proposition 1 (Theorem 2 of [7]).

For any deterministic algorithm, there exists a distribution of bids operating with two bidders and the stationary regret is at least .

The above proposition shows that in the full-information case, any deterministic algorithm will have stationary regret lower bounded by for the online second price auction problem. Generally speaking, we divide the time interval into segments, each with length . We construct an instance such that the regret in each segment is , and the total non-stationary regret sums up to be .

As for the regret in the dynamic case, the proof is very similar. We also divide the time horizon into segments, and the total variation between the distribution of adjacent segments is bounded by .

6 Conclusion and Further Work

We study the non-stationary online second price auction with the “semi-bandit” feedback structure in this paper. We reduce it into the non-stationary one-sided full-information bandit and show an algorithm that solves the problem. Our algorithm is parameter-free, which means that we do not have to know the switchings and the variation in advance. Our algorithm is also nearly optimal in both cases. There are also some future directions to explore:

First, in this work, we consider the online auction with “semi-bandit” feedback, where all the bidders with private values exceeding or equaling the reserve price will report their private values. We can also consider the “full-bandit” feedback where the seller only gets the reward in each round but does not observe the private values and design parameter-free algorithms to solve it in the non-stationary case. Second, in this work we use the second price auction and assume that the bidders are truthful. We can also study how to generalize this non-stationary result into the strategic bidders’ case or the other auction formats such as the generalized second price auction.

References

  • [1] P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 (2-3), pp. 235–256. Cited by: §1.1.
  • [2] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2002) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32 (1), pp. 48–77. Cited by: §1.1.
  • [3] P. Auer, P. Gajane, and R. Ortner (2019) Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, pp. 138–158. External Links: Link Cited by: §1.1, §3, §3, §4.1.
  • [4] O. Besbes, Y. Gur, and A. J. Zeevi (2015) Non-stationary stochastic optimization. Operations Research 63 (5), pp. 1227–1244. External Links: Link, Document Cited by: §1.1.
  • [5] S. Bubeck and N. Cesa-Bianchi (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning 5 (1), pp. 1–122. External Links: Link, Document Cited by: §1.1.
  • [6] N. Cesa-Bianchi, P. Gaillard, C. Gentile, and S. Gerchinovitz (2017) Algorithmic chaining and the role of partial feedback in online nonparametric learning. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, pp. 465–481. External Links: Link Cited by: §1.1, §1.
  • [7] N. Cesa-Bianchi, C. Gentile, and Y. Mansour (2015) Regret minimization for reserve prices in second-price auctions. IEEE Transactions on Information Theory 61 (1), pp. 549–564. Cited by: Appendix C, Appendix C, Appendix C, §1.1, §1, §5, Proposition 1.
  • [8] Y. Chen, C. Lee, H. Luo, and C. Wei (2019) A new algorithm for non-stationary contextual bandits: efficient, optimal and parameter-free. In Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, pp. 696–726. External Links: Link Cited by: §A.2, §1.1, Lemma 5.
  • [9] W. C. Cheung, D. Simchi-Levi, and R. Zhu (2019-16–18 Apr) Learning to optimize under non-stationarity. In Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, , pp. 1079–1087. External Links: Link Cited by: §1.1.
  • [10] A. Garivier and E. Moulines (2011) On upper-confidence bound policies for switching bandit problems. In Algorithmic Learning Theory - 22nd International Conference, ALT 2011, Espoo, Finland, October 5-7, 2011. Proceedings, pp. 174–188. External Links: Link, Document Cited by: §1.1.
  • [11] Y. Gur, A. J. Zeevi, and O. Besbes (2014) Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 199–207. External Links: Link Cited by: §1.1.
  • [12] K. Jun, F. Orabona, S. Wright, and R. Willett (2017) Online learning for changing environments using coin betting. CoRR abs/1711.02545. External Links: Link, 1711.02545 Cited by: §1.1.
  • [13] Z. S. Karnin and O. Anava (2016) Multi-armed bandits: competing with optimal sequences. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 199–207. External Links: Link Cited by: §1.1.
  • [14] F. Liu, J. Lee, and N. B. Shroff (2018) A change-detection based framework for piecewise-stationary multi-armed bandit problem. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 3651–3658. External Links: Link Cited by: §1.1.
  • [15] H. Luo and R. E. Schapire (2015) Achieving all with no parameters: adanormalhedge. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, pp. 1286–1304. External Links: Link Cited by: §1.1.
  • [16] H. Luo, C. Wei, A. Agarwal, and J. Langford (2018) Efficient contextual bandits in non-stationary worlds. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018., pp. 1739–1776. External Links: Link Cited by: §1.1.
  • [17] M. Mohri and A. M. Medina (2015) Revenue optimization against strategic buyers. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 2530–2538. External Links: Link Cited by: §1.1.
  • [18] R. B. Myerson (1981-02) Optimal auction design. Mathematics of Operations Research 6 (1). Cited by: §1.1.
  • [19] H. Robbins (1952-09) Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc. 58 (5), pp. 527–535. External Links: Link Cited by: §1.1.
  • [20] T. Roughgarden and J. R. Wang (2016) Minimizing regret with multiple reserves. In Proceedings of the 2016 ACM Conference on Economics and Computation, EC ’16, Maastricht, The Netherlands, July 24-28, 2016, pp. 601–616. External Links: Link, Document Cited by: §1.1.
  • [21] W. R. Thompson (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §1.1.
  • [22] C. Wei, Y. Hong, and C. Lu (2016) Tracking the best expert in non-stationary stochastic environments. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 3972–3980. External Links: Link Cited by: §1.1.
  • [23] L. Zhang, T. Yang, R. Jin, and Z. Zhou (2018) Dynamic regret of strongly adaptive methods. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pp. 5877–5886. External Links: Link Cited by: §1.1.
  • [24] H. Zhao and W. Chen (2019) Stochastic one-sided full-information bandit. arXiv preprint arXiv:1906.08656. Cited by: §1.1, §1, §3, §3, §4.1.

Appendix A Proof of Theorem 1 and 2

In this section, we give the detailed proof of Theorem 1 and 2. We first present some key observations and lemmas, which are helpful in both the switching and dynamic case. Then we give the detailed proof for the switching case(Theorem 1). Next, we give the proof for the dynamic case(Theorem 2). However, we will not give a detailed proof, since the proof is very similar to that of Theorem 1, and we will point out the difference.

First, we need the following definition and probability bound. The definition(Definition 5) is straight-forward, and the probability bound follows directly from the Hoeffding’s Inequality and union bound.

See 5

Lemma 6.

We have the following probability bound,

Proof.

From the Hoeffding’s inequality, we have for any interval and any arm ,

Then from the union bound, there are at most possible intervals and we have

Then

follows directly since . ∎

The next observation is not hard to prove, but it is one of the key observations of the proof. The observation highly depends on the feedback structure of the one-sided full-information bandit problem.

Lemma 7.

Suppose that an exploration phase ends at time where . For any arm such that , it is observed for times during the exploration phase .

Proof.

Suppose that time is in epoch , then we know that , since then we only add into the exploration set for .

Because the algorithm add into (the original set is and the new set is ), we know that there exists such that . This is due to the fact that from the definition of our algorithm, we have and .

Next, we can observe that for any , the number remains the same, because we eliminate arms from small index to large index, and the arms eliminated in the interval must have index larger than .

Let , and we only have to show that in every time , we play the arm , which is also true since from the definition of the algorithm, we have and then

We also have the following definition to describe a time that is in an exploration phase.

See 6

a.1 Switching Regret

As for the parameters in the algorithm, we choose in the switching regret analysis.

The next lemma shows that, with high probability, the number of epochs in our algorithm is at most (the number of switchings).

Lemma 8.

When happens, we have at time , , i.e., the number of epochs will not exceed the number of switchings.

Proof.

We partition the time interval into intervals with the same distribution. We set

where , for all , and for all in the same interval. We only have to show that, if happens and epoch starts at time in interval , epoch will not start in the interval .

We prove by contradiction, suppose that epoch . Since epoch ends in time interval , we know that from the definition of algorithm(Step 4), such that and , where denote the set in time just before Step 4. Moreover, . From now on, we will fix the variables . However we will show that when happens, .

First we will show that . Because , we know that , and when is added into , there exists such that . Then same as the argument in Lemma 7, since is contained in epoch , we only add elements into the set , and we have and the added elements do not affect the minimum .

Then from Lemma 7, we know that arm has been observed for times in the interval , because and from the definition of (Definition 5), we have