# Selling to a No-Regret Buyer

## Abstract

We consider the problem of a single seller repeatedly selling a single item to a single buyer (specifically, the buyer has a value drawn fresh from known distribution in every round). Prior work assumes that the buyer is fully rational and will perfectly reason about how their bids today affect the seller’s decisions tomorrow. In this work we initiate a different direction: the buyer simply runs a no-regret learning algorithm over possible bids. We provide a fairly complete characterization of optimal auctions for the seller in this domain. Specifically:

• If the buyer bids according to EXP3 (or any “mean-based” learning algorithm), then the seller can extract expected revenue arbitrarily close to the expected welfare. This auction is independent of the buyer’s valuation , but somewhat unnatural as it is sometimes in the buyer’s interest to overbid.

• There exists a learning algorithm such that if the buyer bids according to then the optimal strategy for the seller is simply to post the Myerson reserve for every round.

• If the buyer bids according to EXP3 (or any “mean-based” learning algorithm), but the seller is restricted to “natural” auction formats where overbidding is dominated (e.g. Generalized First-Price or Generalized Second-Price), then the optimal strategy for the seller is a pay-your-bid format with decreasing reserves over time. Moreover, the seller’s optimal achievable revenue is characterized by a linear program, and can be unboundedly better than the best truthful auction yet simultaneously unboundedly worse than the expected welfare.

## 1Introduction

Consider a bidder trying to decide how much to bid in an auction (for example, a sponsored search auction). If the auction happens to be the truthful Vickrey-Clarke-Groves auction [31], then the bidder’s decision is easy: simply bid your value. If instead, the bidder is participating in a Generalized First-Price (GFP) or Generalized Second-Price (GSP) auction, the optimal strategy is less clear. Bidders can certainly attempt to compute a Bayes-Nash equilibrium of the associated game and play accordingly, but this is unrealistic due to the need for accurate priors and extensive computation.

Alternatively, the bidders may try to learn a best-response over time (possibly offloading the learning to commercial bid optimizers). We specifically consider bidders who no-regret learn, as empirical work of Nekipelov et al. [27] shows that bidder behavior on Bing is largely consistent with no-regret learning (i.e. for most bidders, there exists a per-click value such that their behavior guarantees no-regret for this value). From the perspective of a revenue-maximizing auction designer, this motivates the following question: If a seller knows that buyers are no-regret learning over time, how should they maximize revenue?

This question is already quite interesting even when there is just a single item for sale to a single buyer.1 We consider a model where in every round , the seller solicits a bid from the buyer, then allocates the item according to some allocation rule and charges the bidder according to some pricing rule (satisfying for all ). Note that the allocation and pricing rules (henceforth, auction) can differ from round to round, and that the auction need not be truthful. Each round, the bidder has a value drawn independently from , and uses some no-regret learning algorithm to decide which bid to place in round , based on the outcomes in rounds (we will make clear exactly what it means for a buyer with changing valuation to play no-regret in Section 2, but one can think of as providing a “context” for the bidder during round ).

One default strategy for the seller is to simply to set Myerson’s revenue-optimal reserve price for , , in every round (that is, , for all , where is the indicator function). It’s not hard to see that any no-regret learning algorithm will eventually learn to submit a winning bid during all rounds where , and a losing bid whenever . So if denotes the expected revenue of the optimal reserve price when a single buyer is drawn from , the default strategy guarantees the seller revenue over rounds. The question then becomes whether or not the seller can beat this benchmark, and if so by how much.

The answer to this question isn’t a clear-cut yes or no, so let’s start with the following instantiation: how much revenue can the seller extract if the buyer runs EXP3 [1]? In Theorem ?, we show that the seller can actually do much better than the default strategy: it’s possible to extract revenue per round equal to (almost) the full expected welfare! That is, if , there exists an auction that extracts revenue for all .2 It turns out this result holds not only for EXP3, but for any learning algorithm with the following (roughly stated) property: if at time , the mean reward of action is significantly larger than the mean reward of action , the learning algorithm will choose action with negligible probability. We call a learning algorithm with this property a “mean-based” learning algorithm and note that many commonly used learning algorithms - EXP3, Multiplicative Weights Update [3], and Follow-the-Perturbed-Leader [16] - are ‘mean-based’ (see Section 2 for a formal definition).

We postpone all intuition until Section 3.1 with a worked-through example, but just note here that the auction format is quite unnatural: it “lures” the bidder into submitting high bids early on by giving away the item for free, and then charging very high prices (but still bounded in ) near the end. The transition from “free” to “high-price” is carefully coordinated across different bids to achieve the revenue guarantee.

This result motivates two further directions. First, do there exist other no-regret algorithms for which full surplus extraction is impossible for the seller? In Theorem ?, we show that the answer is yes. In fact, there is a simple no-regret algorithm , such that when the bidder uses algorithm to bid, the default strategy (set the Myerson reserve every round) is optimal for the seller. We again postpone a formal statement and intuition to Section 3.2, but just note here that the algorithm is a natural adaptation of EXP3 (or in fact, any existing no-regret algorithm) to our setting.

Finally, it is reasonable to expect that bidders might use off-the-shelf no-regret learning algorithms like EXP3, so it is still important to understand what the seller can hope to achieve if the buyer is specifically using such a “mean-based” algorithm (formal definition in Section 2). Theorem ? is perhaps unsatisfying in this regard because the proposed auction is so unnatural, and looks nothing like the GSP or GFP auctions that initially motivated this study. It turns out that the key property separating GFP/GSP from the unnatural auction above is whether overbidding is a dominated strategy. That is, in our unnatural auction, if the bidder truly hopes to guarantee low regret they must seriously consider overbidding (and this is how the auction lures them into bidding way above their value). In both GSP and GFP, overbidding is dominated, so the bidder can guarantee no regret while overbidding with probability in every round.

The final question we ask is the following: if the buyer is using EXP3 (or any “mean-based” algorithm), but only considering undominated strategies, how much revenue can the seller extract using an auction where overbidding is dominated in every round? It turns out that the auctioneer can still outperform the default strategy, but not extract full welfare. Instead, we identify a linear program (as a function of ) that tightly characterizes the optimal revenue the seller can achieve in this setting when the buyer’s values are drawn from . Moreover, we show that the auction that achieves this guarantee is natural, and can be thought of as a first-price auction with decreasing reserves over time. Finally, we show that this “mean-based revenue” benchmark, lies truly in between the Myerson revenue and the expected welfare: for all , there exists a distribution over values such that . In other words, the seller’s mean-based revenue may be unboundedly better than the default strategy, yet simultaneously unboundedly far from the expected welfare. We provide formal statements and a detailed proof overview of these results in Section 3.3. To briefly recap, our main results are the following:

1. If the buyer uses a “mean-based” learning algorithm like EXP3, the seller can extract revenue for any constant (Theorem ?).

2. There exists a natural no-regret algorithm such that when the buyer bids according to , the seller’s default strategy (charging the Myerson reserve every round) is optimal (Theorem ?).

3. If the buyer uses a “mean-based” algorithm only over undominated strategies, the seller can extract revenue using an auction where overbidding is dominated in every round. Moreover, we characterize as the value of a linear program, and show it can be simultaneously unboundedly better than and unboundedly worse than (Theorems ?, ? and ?).

Our plan for the remaining sections is as follows. Below, we overview our connection to related work. Section 2 formally defines our model. Section 3 works through a concrete example, providing intuition for all three results. Section 4 discusses conclusions and open problems.

### 1.1Related Work

There are two lines of work that are most related to ours. The first is that of dynamic auctions, such as [28]. Like our model, there are rounds where the seller has a single item for sale to a single buyer, whose value is drawn from some distribution every round. However, the buyer is fully strategic and processes fully how their choices today affect the seller’s decisions tomorrow (e.g. they engage with deals of the form “pay today to get the item tomorrow”). Additional closely related work is that of Devanur et al. studying the Fishmonger problem [11]. Here, there is again a single buyer and seller, and rounds of sale. Unlike our model, the buyer draws a value from once during round and that value is fixed through all rounds (so the seller could try to learn the buyer’s value over time). Also unlike our model, they study perfect Bayesian equilibria (where again the buyer is fully strategic, and reasons about how their actions today affect the seller’s behavior tomorrow).

In contrast to these works, while buyers in our model do care about the future (e.g. they value learning), they don’t reason about how their actions today might affect the seller’s decisions tomorrow. Our model is more realistic for sponsored search auctions, where search engines rarely release proprietary algorithms for setting reserves based on past data (and fully strategic reasoning is simply impossible without the necessary information).

Other related work considers the Price of Anarchy of simple combinatorial auctions when bidders no-regret learn [29]. One key difference between this line of work and ours is that these all study welfare maximization for combinatorial auctions with rich valuation functions. In contrast, our work studies revenue maximization while selling a single item. Additionally, in these works the seller commits to a publicly known auction format, and the only reason for learning is due to the strategic behavior of other buyers. In contrast, buyers in our model have to learn even when they are the only buyer, due to the strategic nature of the seller.

Recent work has also considered learning from the perspective of the seller. In these works, the buyer’s (or buyers’) valuations are drawn from an unknown distribution, and the seller’s goal is to learn an approximately optimal auction with as few samples as possible [8]. These works consider numerous different models and achieve a wide range of guarantees, but all study the learning problem from the perspective of the seller, whereas the buyer is simply myopic and participates in only one round. In contrast, it is the buyer in our model who does the learning (and there is no information for the seller to learn: the buyer’s values are drawn fresh in every round).

Finally, no-regret learning in online decision problems is an extremely well-studied problem. When feedback is revealed for every possible action, one well-known solution is the multiplicative weight update rule which has been rediscovered and applied in many fields (see survey [3] for more details). Another algorithmic scheme for the online decision problem is known as Follow the Perturbed Leader [16]. When only feedback for the selected action is revealed, the problem is referred to as the multi-armed bandit problem. Here, similar ideas to the MWU rule are used in developing the EXP3 algorithm [1] for adversarial bandit model, and also for the contextual bandit problem [21]. Our algorithm in Theorem ? bears some similarities to the low swap regret algorithm introduced in [5]. See the survey [4] for more details about the multi-armed bandit problem. Our results hold in both models (i.e. whether the buyer receives feedback for every bid they could have made, or only the bid they actually make), so we will make use of both classes of algorithms.

In summary, while there is already extensive work related to repeated sales in auctions, and even no-regret learning with respect to auctions (from both the buyer and seller perspective), our work is the first to address how a seller might adapt their selling strategy when faced with a no-regret buyer.

## 2Model and Preliminaries

We consider a setting with 1 buyer and 1 seller. There are rounds, and in each round the seller has one item for sale. At the start of each round , the buyer’s value (known only to the buyer) for the item is drawn independently from some distribution (known to both the seller and the buyer). For simplicity, we assume has a finite support3 of size , supported on values . For each , has probability of being drawn under .

The seller then presents options for the buyer, which can be thought of as “possible bids” (we will interchangeably refer to these as options, bids, or arms throughout the paper, depending on context). Each arm is labelled with a bid value , with . Upon pulling this arm at round , the buyer receives the item with some allocation probability , and must pay a price . These values and are chosen by the seller during time , but remain unknown to the buyer until he plays an arm, upon which he learns the values for that arm. All of our positive results (i.e. strategies for the seller) are non-adaptive (in some places called oblivious), in the sense that that are set before the first round starts. All of our negative results (i.e. upper bounds on how much a seller can possibly attain) hold even against fully adaptive sellers, where and can be set even after learning the distribution of arms the buyer intends to pull in round .

In order for the selling strategies to possibly represent sponsored search auctions, we require the allocation/price rules to be monotone. That is, if , then for all , and . In other words, bidding higher should result in a (weakly) higher probability of receiving the item and (weakly) higher expected payment. We’ll also insist on the existence of an arm with bid and for all ; i.e., an arm which charges nothing but does not give the item. Playing this arm can be thought of as not participating in the auction.

We’ll be interested in one final property of allocation/price rules that we call critical, and buyer behavior that we call clever. We won’t require that all auctions considered be critical, but this is an important property that greatly affects the optimal revenue that a seller can extract (see Theorems ? and ?).

The above definition captures the property that in many auctions like GFP and GSP (both of which are critical), it makes no sense for a buyer to ever play dominated strategies - they need only learn over the undominated strategies. Note that if overbidding is strictly dominated, any low-regret or mean-based learning algorithm will quickly learn not to overbid, and therefore play similarly to clever bidders in critical auctions.

### 2.1Bandits and experts

Our goal is to understand the behavior of such mechanisms when the buyer plays according to some no-regret strategy for the multi-armed bandit problem. In the classic multi-armed bandit problem a learner (in our case, the buyer) chooses one of arms per round, over rounds. On round , the learner receives a reward for pulling arm (where the values are possibly chosen adversarially). The learner’s goal is to maximize his total reward.

Let denote the arm pulled by the principal at round . The regret of an algorithm for the learner is the random variable . We say an algorithm for the multi-armed bandit problem is -no-regret if (where the expectation is taken over the randomness of ). We say an algorithm is no-regret if it is -no-regret for some .

In the multi-armed bandits setting, the learner only learns the value for the arm which he pulls on round . In our setting, the learner will learn and explicitly (from which they can compute ). Our results (both positive and negative) also hold when the learner learns the value for all arms (we refer this full-information setting as the experts setting, in contrast to the partial-information bandits setting). Simple no-regret algorithms exist in both the experts setting and the bandits setting. Of special interest in this paper will be a class of learning algorithms for the bandits problem and experts problem which we term ‘mean-based’.

Intuitively, ‘mean-based’ algorithms will rarely pick an arm whose current mean is significantly worse than the current best mean. Many no-regret algorithms, including commonly used variants of EXP3 (for the bandits setting), the Multiplicative Weights algorithm (for the experts setting) and the Follow-the-Perturbed-Leader algorithm (experts setting), are mean-based (Appendix D).

#### Contextual bandits

In our setting, the buyer has the additional information of their current value for the item, and hence is actually facing a contextual bandits problem. In (our variant of) the contextual bandits problem, each round the learner is additionally provided with a context drawn from some distribution supported on a finite set (in our setting, , the buyer’s valuation for the item at time ). The adversary now specifies rewards , the reward the learner receives if he pulls arm on round while having context . If we are in the full-information (experts) setting, the learner learns the values of for all arms after round , where as if we are in the partial-information (bandits) setting, the learner only learns the value of for the arm that he pulled.

In the contextual bandits setting, we now define the regret of an algorithm in terms of regret against the best “context-specific” policy ; that is, , where again is the arm pulled by on round . As before, we say an algorithm is -low regret if , and say an algorithm is no-regret if it is -no-regret for some .

If the size of the context set is constant with respect to , then there is a simple way to construct a no-regret algorithm for the contextual bandits problem from a no-regret algorithm for the classic bandits problem: simply maintain a separate instance of for every different context (in the contextual bandits literature, this is sometimes referred to as the -EXP3 algorithm [4]). We call the algorithm we obtain this way its contextualization, and denote it as .

If we start with a mean-based learning algorithm, then we can show that its contextualization satisfies an analogue of the mean-based property for the contextual-bandits problem (proof in Appendix D).

### 2.2Welfare and monopoly revenue

In order to evaluate the performance of our mechanisms for the seller, we will compare the revenue the seller obtains to two benchmarks from the single-round setting of a seller selling a single item to a buyer with value drawn from distribution .

The first benchmark we consider is the welfare of the buyer, the expected value the buyer assigns to the item. This quantity clearly upper bounds the expected revenue that the seller can hope to extract per round.

The second benchmark we consider is the monopoly revenue, the maximum possible revenue attainable by the seller in one round against a rational buyer. Seminal work of Myerson [26] shows that this revenue is attainable by setting a fixed price (“monopoly/Myerson reserve”) for the item, and hence can be characterized as follows.

### 2.3A final note on the model

For concreteness, we chose to phrase our problem as one where a single bidder whose value is repeatedly drawn independently from each round engages in no-regret learning with their value as context. Alternatively, we could imagine a population of different buyers, each with a fixed value . Each round, exactly one buyer arrives at the auction, and it is buyer with probability . The buyers are indistinguishable to the seller, and each buyer no-regret learns (without context, because their value is always . This model is mathematically equivalent to ours, so all of our results hold in this model as well if the reader prefers this interpretation instead.

## 3An Illustrative Example

In this section, we overview an illustrative example to show the difference between mean-based and non-mean-based learning algorithms, and between critical and arbitrary auctions. We will not prove all claims in this section (nor carry out all calculations) as it is only meant to illustrate and provide intuition. Throughout this section, the running example will be when samples with probability , with probability , and with probability . Note that and .

### 3.1Mean-Based Learning and Arbitrary Auctions

Let’s first consider what the seller can do with an arbitrary (not critical) auction when the buyer is running a mean-based learning algorithm like EXP3. The seller will let the buyer bid or . If the buyer bids , they pay nothing but do not receive the item (recall that an arm of this form is required). If the buyer bids in round , they receive the item and pay some price as follows: for the first half of the game (), the seller sets . For the second half of the game (), the seller sets .

Let’s examine the behaviour of the buyer, recalling that they run a mean-based learning algorithm, and therefore (almost) always pull the arm with highest cumulative utility. The buyer with value will happily bid all the way through, since he is always offered the item for less than or equal to his value for the item. The buyer with value will bid for the first rounds, accumulating a surplus (i.e., negative regret) of per round. For the next rounds, this surplus slowly disappears at the rate of per round until it disappears at time , so the bidder with value will bid all the way through. Finally, the bidder with value will bid for the first rounds, accumulating surplus at a rate of per round. After round , this surplus decreases at a rate of per round, until at round his cumulative utility from bidding reaches and he switches to bidding .

Now let’s compute the revenue. From round through , the buyer always buys the item at a price of , so the seller obtains revenue. Finally, from round through , the buyer purchases the item with probability and pays . The total revenue is . Note that if the seller used the default strategy, they would extract revenue only .

Where did our extra revenue come from? First, note that the welfare of the buyer in this example is quite high: the bidder gets the item the whole way through when , and two-thirds of the way through when . One reason why the welfare is so high is because we give the item away for free in the early rounds. But notice also that the utility of the buyer is quite low: the buyer actually has zero utility when , and utility when . The reason we’re able to keep the utility low, despite giving the item away for free in the early rounds is because we overcharge the bidders in later rounds (and they choose to overpay, exactly because their learning is mean-based).

Getting the intervals to line up properly so that any mean-based learner will pick the desired arms still requires some work. But interestingly, our constructed mechanism is non-adaptive and prior-independent (i.e. the same mechanism extracts full welfare for all ). Theorem ? below formally states the guarantees. The construction itself and the proof appear in Appendix B.

Two properties should jump out as key in enabling the result above. The first is that the buyer only has no regret towards fixed arms and not towards the policy they would have used with a lower value (this is what leads the buyer to continue bidding with value even though they have already learned to bid with value ). This suggests an avenue towards an improved learning algorithm: have the bidder attempt to have no regret not only towards each fixed arm, but also towards the policy of play produced when having different values. This turns out to be exactly the right idea, and is discussed in the following subsection below.

The second key property is that we were able to “lure” the bidders into playing an arm with a free item, then overcharge them later to make up for lost revenue. This requires that the bidder consider pulling an arm with maximum bid exceeding their value, which will never happen in a critical auction with clever bidders. It turns out it is still possible to do better than the default strategy with a critical auction against clever bidders, but not as well as with an arbitrary auction. Section 3.3 explores critical auctions for this example.

### 3.2Better Learning and Arbitrary Auctions

In our bad example above, the buyer with value for the item slowly spends the second half of the game losing utility. While his behaviour is still no-regret (he ends up with zero net utility, which indeed is at least as good as only bidding ), he would have been much happier to follow the actions of the buyer with value , who started bidding at .

Using this idea, we show how to construct a no-regret algorithm for the buyer such that the seller receives at most the Myerson revenue every round. We accomplish this by extending an arbitrary no-regret algorithm (e.g. EXP3) by introducing “virtual arms” for each value, so that each buyer with value has low regret not just with respect to every fixed bid, but also no-regret with respect to the policy of play as if they had a different value for the item (for all ). In some ways, our construction is very similar to the construction of low internal-regret (or swap-regret) algorithms from low external-regret algorithms. The main difference is that instead of having low regret with respect to swapping actions, we have low regret with respect to swapping contexts (i.e. values). Theorem ? below states that the seller cannot outperform the default strategy against buyers who use such algorithms to learn.

The algorithm’s description and proof appear in Appendix A. The key observation in the proof is that “not regretting playing as if my value were ” sounds a lot like “not preferring to report value instead of .” This suggests that the aggregate allocation probabilities and prices paid by any buyer using our algorithm should satisfy the same constraints as a truthful auction, proving that the resulting revenue cannot exceed the default strategy (and indeed the proof follows this approach).

### 3.3Mean-Based Learning and Critical Auctions

Recall in our example that to extract revenue , bidders with values and had to consider bidding . If the seller is using a critical auction, overbidding is dominated, so there is no reason for bidders to do this. In fact, the analysis and results of this section hold as long as the bidders never consider overbidding (even if the auction isn’t critical).

Although the auction in Section 3.1 is no longer viable, consider the following auction instead: in addition to the zero arm, the bidder can bid or . If they bid in any round, they will get the item with probability and pay . If they bid in round , they get nothing. If they bid in round , they get the item and pay . Let’s again see what the bidder will choose to do, remembering that they will always pull the arm that has provided highest cumulative utility (due to being mean-based).

Clearly, the bidder with value will bid every round (since they are clever, they won’t even consider bidding ), making a total payment of . The bidder with value will bid for the first rounds, and then immediately switch to bidding , making a total payment of .

The bidder with value will actually bid for the entire rounds. To see this, observe that their cumulative surplus through round from bidding is ( rounds by utility per round by probability of having value ). Their cumulative surplus through round from bidding is instead (for ). Because they are mean-based, they will indeed bid for the entire duration due to its strictly higher utility. So their total payment will be . The total revenue is then , again surpassing the default strategy (but not reaching the achieved by our non-critical auction).

Let’s again see where our extra revenue comes from in comparison to a truthful auction. Notice that the bidder receives the item with probability conditioned on having value , and also conditioned on having value . Yet somehow the bidder pays an average of conditioned on having value , but an average of conditioned on having value . This could never happen in a truthful auction, as the bidder would strictly prefer to pretend their value was rather than . But it is entirely possible when the buyer does mean-based learning, as evidenced by this example.

In Appendix C, we define as the value of the LP in Figure ?. In Theorems ? and ?, we show that tightly characterizes (up to ) the optimal revenue a seller can extract with a critical auction against a clever buyer. We state the theorem statements more generally to remind the reader that they hold as long as the buyer never overbids (even if the auction is arbitrary). The proofs can be found in Appendix C.1.

Before stating our theorems, let’s parse this LP. is a constant representing the probability that the buyer has value (also a constant). is a variable representing the average probability that the bidder gets the item with value , and is a variable representing the average utility of the bidder when having value . Therefore, this bidder’s average value is , the average price they pay is , and the objective function is simply the average revenue. The second constraints are just normalization, ensuring that everything lies in . The first line of constraints are the interesting ones. These look a lot like IC constraints that a truthful auction must satisfy, but something’s missing: the LHS is clearly the utility of the buyer with value for “telling the truth,” but the utility of the buyer for “reporting instead” is . So the term is missing on the RHS.

Let’s also see a very brief proof outline for why no seller can extract more revenue than :

1. Because the buyer has no regret conditioned on having value , their utility is at least as high as playing arm every round.

2. Because the auction never charges arm more than (conditioned on awarding the item), the buyer’s utility for playing arm every round is at least , where is the average probability that arm awards the item.

3. Because the auction is monotone, and the buyer never considers overbidding, if the buyer gets the item with probability conditioned on having value , we must have .

These three facts together show that no seller can extract more than against a no-regret buyer who doesn’t overbid. Observe also that step 3 is exactly the step that doesn’t hold for buyers who consider overbidding (and is exactly what’s violated in our example in Section 3.1): if the buyer ever overbids, then they might receive the item with higher probability than had they just played their own arm every round.

### 3.4A Final Note on the Example

While reading through our examples, the reader may think that the mean-based learner’s behavior is clearly irrational: why would you continue paying above your value? Why would you continue paying more than necessary, when you can safely get the item for less?

But this is exactly the point: a more thoughtful learner can indeed do better (for instance, by using the algorithm of Section 3.2). It is also perhaps misleading to believe that the bidder should “obviously” stop overpaying: we only know this because we know the structure of the example. But in principle, how is the bidder supposed to know that the overcharged rounds are the new norm and not an anomaly? Given that most standard no-regret algorithms are mean-based, it’s important to nail down the seller’s options for exploiting this behavior.

## 4Conclusion and Future Directions

Motivated by the prevalence of bidders no-regret learning to play non-truthful auctions in practice [27], we consider a revenue-maximizing seller with a single item (each round) to sell to a single buyer. We show that when the buyer uses mean-based algorithms like EXP3, the seller can extract revenue equal to the expected welfare with an unnatural auction. We then provide a modified no-regret algorithm such that the seller cannot extract revenue exceeding the monopoly revenue when the buyer bids according to . Finally, we consider a mean-based buyer who never overbids. We tightly characterize the seller’s optimal revenue with a linear program, and show that a pay-your-bid auction with decreasing reserves over time achieves this guarantee. Moreover, we show that the mean-based revenue can be unboundedly better than the monopoly revenue while simultaneously worse than the expected welfare. In particular, for the equal revenue curve truncated at , the monopoly revenue is , the expected welfare is , and the mean-based revenue is .

While our work has already shown the single-buyer problem is quite interesting, the most natural direction for future work is understanding revenue maximization with multiple learning buyers. Of our three main results, only Theorem ? extends easily (that if every buyer uses our modified learning, the default strategy, which now runs Myerson’s optimal auction every round, is optimal; see Theorem ? for details). Our work certainly provides good insight into the multi-bidder problem, but there are still clear barriers. For example, in order to obtain revenue equal to the expected welfare, the auction must necessarily also maximize welfare. In our single-bidder model, this means that we can give away the item for free for rounds, but with multiple bidders, such careless behaviour would immediately make it impossible to achieve the optimal welfare. Regarding the mean-based revenue, while there is a natural generalization of our LP to multiple bidders, it’s no longer clear how to achieve this revenue with a critical auction, as all the relevant variables now implicitly depend on the actions of the other bidders. These are just examples of concrete barriers, and there are likely interesting conceptual barriers for this extension as well.

Another interesting direction is understanding the consequences of our work from the perspective of the buyer. Aside from certain corner configurations (e.g. the seller extracting the buyer’s full welfare), it’s not obvious how the buyer’s utility changes. For instance, is it possible that the buyer’s utility actually increases as the seller switches from the default strategy to the optimal mean-based revenue? Does the buyer ever benefit from using an “exploitable” learning strategy, so that the seller can exploit it and make them both happier?

## AGood no-regret algorithms for the buyer

In this section we show that there exists a (contextual) no-regret algorithm for the buyer which guarantees that the seller receives at most the Myerson revenue per round (i.e., in total). As mentioned earlier, it does not suffice for the buyer to simply run the contextualization for some no-regret learning algorithm (and in fact, if is mean-based, the seller can extract strictly more than , as we will see later). However, by modifying so that it has not just no-regret with respect to the best stationary policy, but so that it additionally does not regret playing as if it had some other context, we obtain a no-regret algorithm for the buyer which guarantees the seller receives no more than per round.

The details of the algorithm are presented in Algorithm ?. Recall that the distribution is supported over values , where for each , has probability under . The algorithm takes a no-regret algorithm for the classic multi-armed bandit problem, and runs instances of it, one per possible value . Each instance of learns not only over the possible actions, but also over virtual actions corresponding to values through . Picking the virtual action associated with corresponds to the buyer pretending they have value , and playing accordingly (i.e., querying ).

This algorithm is very similar in structure to the construction of a low swap-regret bandits algorithm from a generic no-regret bandits algorithm (see [5]). The main difference is that whereas swap regret guarantees no-regret with respect to swapping actions (i.e. always playing action instead of action ), this algorithm guarantees no-regret with respect to swapping contexts (i.e., always pretending you have context when you actually have context ). In addition, the auction structure of our problem allows us to only consider contexts with valuations smaller than our current valuation ; this puts a limit of on the number of recursive calls per round, as opposed to the low swap regret algorithm where one must solve for the stationary distribution of a Markov chain over states each round.

We now proceed to show that Algorithm ? has our desired guarantees.

For each , define to be the expected number of rounds the buyer receives the item when they have value . For each define to be the expected total payment from the buyer to the seller when the buyer has value . Our goal is to upper bound , the total revenue the seller receives.

Recall that every strategy must contain a zero option in its menu, where the buyer pays nothing and doesn’t receive the item (and hence receives zero utility). Since each is a -no-regret algorithm, we know that the buyer does not regret always choosing the zero option when they have value . It follows that, for all , we have that

The following lemma shows that when , the buyer does not regret pretending to have value when they have value .

From the algorithm, we know that does not regret always playing the value arm corresponding to . We define the following notation. For all and any history of rounds (including for each round which option is chosen and the utility of that round), define to be the probability of getting item in round given history when buyer has value and define to be the expected price paid in round when the buyer has value given history .

Let be the distribution of histories at round , for . The no-regret guarantee tells us that

Note that

Dividing (Equation 2) through by and substituting in these relations, we arrive at the statement of the lemma.

Now define , and define

It follows from Lemma ? that for all ,

From (Equation 1), we also have for all ,

We will argue from these constraints that . To do this, we will construct a single-round mechanism for selling an item to a buyer with value distribution such that this mechanism has expected revenue ; the result then follows from the optimality of the Myerson mechanism ([26]).

To construct this mechanism, first find a sequence of indices via the following algorithm.

It is easy to verify that following this algorithm results in . For any (assuming ), .

Consider some value in . We will show that the buyer with value will pay at least , thus proving the lemma. Assume .

We have (from (Equation 5) and the monotonicity of ) that

This means the buyer with value receives non-negative utility by choosing option . For any , we have (from (Equation 4)) that

Since , the above inequality implies that

It follows that

This means the buyer with value prefers option to all options . Therefore this buyer will choose an option from . Since , we know that this buyer will pay at least , as desired.

It follows from the optimality of the Myerson auction that , and therefore that . Expanding out via (Equation 3), we have that

from which the theorem follows.

We can remove the explicit dependence on by filtering out all values which occur with small enough probability.

Ignore all values with (whenever a round with this value arises, choose an arbitrary action for this round). There are total values, so this happens with at most probability , and therefore modifies the regret and revenue in expectation by at most .

The regret bound from Theorem ? then holds with , from which the result follows.

### a.1Multiple bidders

Interestingly, we show that by slightly modifying Algorithm ?, we obtain an algorithm (Algorithm ?) that works for the case where there are multiple bidders. In the multiple bidder setting, there are bidders with independent valuations for the item. Each round , bidder receives a value for the item drawn from a distribution (independently of all other values). Each distribution is supported over values, , where occurs under with probability . Every round each bidder submits a bid , and the auctioneer decides on an allocation rule , which maps -tuples of bids to -tuples of probabilities and a pricing rule , which maps -tuples of bids to -tuples of prices . The allocation rule must additionally obey the supply constraint that . Bidder wins the item with probability and pays .

We show that if every bidder plays the no-regret algorithm Algorithm ?, then the auctioneer (even if playing adaptively) is guaranteed to receive no more than revenue, where is the optimal revenue obtainable by an auctioneer selling a single item to bidders with valuations drawn independently from distributions . In other words, if every bidder plays according to Algorithm ?, the seller can do nothing better than running the single-round optimal Myerson auction every round.

The only difference between Algorithm ? and Algorithm ? is that instance in Algorithm ? has a value arm for every possible value, not only the values less than . This means that the recursion depth of this algorithm is potentially unlimited, however it will still terminate in finite expected time since we insist that has a positive probability of picking any arm (in particular, it will eventually pick a bid arm). We can optimize the runtime of step 11 of Algorithm ? by eliciting a probability distribution over arms from each instance , constructing a Markov chain, and solving for the stationary distribution. This takes time per step of this algorithm.

Similarly as before, let equal the expected number of rounds bidder receives the item while having value , and let equal the expected total amount bidder pays to the auctioneer while having value . Again, our goal is to upper bound , the total expected revenue the seller receives.

Note that, as before, since every strategy contains a zero option in its menu, we have that (for all and )

Repeating the argument of Lemma ? (which still holds in the multiple bidder setting), we additionally have that (for all and ),

We will now (as in the proof of Theorem ?) construct a mechanism for the single-round instance of the problem of an auctioneer selling a single item to bidders with valuations independently drawn from . Our mechanism will work as follows:

1. The auctioneer will begin by asking each of the bidders for their valuations. Assume that bidder reports valuation (we will insist that belongs to the support of ).

2. The auctioneer will then sample a uniformly at random.

3. For each bidder , the auctioneer will calculate and , the expected allocation probability and price bidder has to pay in round of the dynamic -round mechanism, conditioned on for all .

4. The auctioneer will then give the item to bidder with probability , and charge bidder a price .

Note that since the allocation rules must always satisfy the supply constraint, the probabilities we sample also obey this supply constraint, and therefore this is a valid mechanism for the single-round problem. We will now show it is approximately incentive compatible.

To begin, we claim that in expectation, if bidder reports valuation (and everyone else reports truthfully), then the expected probability bidder receives the item (under this single-round mechanism) is equal to . Likewise, we claim that, if bidder reports valuation (and everyone else reports truthfully), the expected payment bidder they pay is equal to .

To see why this is true, let equal the probability bidder gets the item (in the multi-round mechanism) at time conditioned on for all . By construction, the probability bidder receives the item (in mechanism ) after reporting valuation is equal to

On the other hand, we can write in terms of our function a

It follows that . A similar calculation shows that if is the expected payment of bidder (if they report valuation and everyone else reports truthfully), then .

Now, recall that a mechanism is -BIC if misreporting your value increases your expected utility by at most (assuming everyone else reports truthfully). To show that mechanism is -BIC, it therefore suffices to show that for all , that

But for , this follows from equation (Equation 7). Similarly, is -ex-interim IR if for all ,

Again, this follows from equation (Equation 6), and the result therefore follows.

We now apply the following lemma from [13], which lets us transform an -BIC mechanism into a BIC mechanism at the cost of revenue.

See Theorem 3.3 in [13].

Applying Lemma ? to our mechanism, we obtain a mechanism that satisfies . Finally, note that since the Myerson auction is the optimal Bayesian-incentive compatible mechanism for this problem, . On the other hand, since (from the proof of Lemma ?) the expected payment bidder pays under mechanism when being truthful is equal to:

It follows that

and thus that

## BAchieving full welfare against non-conservative buyers

In this section, we will show that if the buyer uses a mean-based algorithm instead of Algorithm ?, the seller has a strategy which extracts the entire welfare from the buyer (hence leaving the buyer with zero utility).

If every element in the support of is at least , then the seller can simply always sell the item at price (since is supported on , this ensures a approximation to the buyer’s welfare). From now on, we will assume that is not entirely supported on .

Recall that is supported on values , where is chosen with probability . Define , and define . Since and , we know that and therefore . Notice that here we can make the strategy independent of if we just pick and (but setting and according to information about can reduce the number of arms).

Consider the following strategy for the seller. In addition to the zero arm, the seller will offer possible options, each with maximum bid value . We divide the timeline of each arm into three “sessions” in the following way:

1. session:

For the first rounds, the seller charges 0 and does not give the item to the buyer (i.e. ).

2. 0 session:

For the next rounds, the seller charges 0 and gives the item to the buyer (i.e. ).

3. 1 session:

For the final rounds, the seller charges 1 and gives the item to the buyer (i.e. ).

Note that this strategy is monotone; if , then and .

Assume that the buyer is running a -mean-based algorithm, for some . Define and . Note that is the round where arm starts its session; we show in the following Lemma that (by the mean-based property), the buyer with value will prefer arm over any arm over all rounds in the interval .

Note that arm starts its 1 session at round . It follows that

Now consider the cumulative utility of playing some arm . It is easy to verify that , and therefore arm is still either in its session or its session. Since arm starts its session the earliest, it follows that , so from now on, assume without loss of generality that . There are two cases:

1. If , the utility is 0.

2. If , the utility is .

It suffices to show that

We have that

Similarly

It follows from the mean-based condition (Definition ?) that in the interval the buyer with value will, with probability at least , choose an arm currently in its 1-session (i.e. an arm with label at most ) and hence pay each round. Since the buyer has value for the item with probability , the total contribution of the buyer with value to the expected revenue of the seller is given by

Here we have used the fact that