# Learning to Bid Without Knowing your Value

###### Abstract

We address online learning in complex auction settings, such as sponsored search auctions, where the value of the bidder is unknown to her, evolving in an arbitrary manner and observed only if the bidder wins an allocation. We leverage the structure of the utility of the bidder to provide algorithms with regret rates against the best fixed bid in hindsight, that are exponentially faster in convergence in terms of dependence on the action space, than what would have been derived by applying a generic bandit algorithm. Our results are enabled by analyzing a new online learning setting with outcome-based feedback, which generalizes learning with feedback graphs. We provide an online learning algorithm for this setting, of independent interest, with regret that grows only logarithmically with the number of actions and linearly only in the number of potential outcomes (the latter being very small in most auction settings).

## 1 Introduction

A standard assumption in the majority of the literature on auction theory and mechanism design is that participants that arrive in the market have a clear assessment of their valuation for the goods at sale. This assumption might seem acceptable in small markets with infrequent auction occurrences and amplitude of time for participants to do market research on the goods. However, it is an assumption that is severely violated in the context of the digital economy.

In settings like online advertisement auctions or eBay auctions, bidders participate very frequently in auctions that they have very little knowledge about the good at sale, e.g. the value produced by a user clicking on an ad. It is unreasonable, therefore, to believe that the participant has a clear picture of this value. However, the inability to pre-assess the value of the good before arriving to the market is alleviated by the fact that due to the large volume of auctions in the digital economy, participants can employ learning-by-doing approaches.

In this paper we address exactly the question of how would you learn to bid approximately optimally in a repeated auction setting where you do not know your value for the good at sale and where that value could potentially be changing over time. The setting of learning in auctions with an unknown value poses an interesting interplay between exploration and exploitation that is not standard in the online learning literature: in order for the bidder to get feedback on her value she has to bid high enough to win the good with higher probability and hence, receive some information about that underlying value. However, the latter requires paying a higher price. Thus, there is an inherent trade-off between value-learning and cost. The main point of this paper is to address the problem of learning how to bid in such unknown valuation settings with partial win-only feedback, so as to minimize regret with respect to the best fixed bid in hindsight.

On one extreme, one can treat the problem as a Multi-Armed Bandit (MAB) problem, where each possible bid that the bidder could submit (e.g. any multiple of a cent between and some upper bound on her value) is treated as an arm. Then, standard MAB algorithms (see e.g. [13]) can achieve regret rates that scale linearly with the number of such discrete bids. The latter can be very slow and does not leverage the structure of utilities and the form of partial feedback that arises in online auction markets. Recently, the authors in [34] addressed learning with such type of partial feedback in the context of repeated single-item second-price auctions. However, their approach does not address more complex auctions and is tailored to the second-price auction.

#### Contributions.

Our work addresses learning with partial feedback in general mechanism design environments. Importantly, we allow for randomized auctions with probabilistic outcomes, encompassing the case of sponsored search auctions, where the outcome of the mechanism (getting a click) is inherently randomized.

Our results are enabled by analyzing a novel setting in online learning of independent interest, which we denote learning with outcome-based feedback. For instance, our setting generalizes the setting of learning with feedback graphs [4], in a way that is crucial for applying it to the auction settings of interest. Roughly the setting is defined as follows: The learner chooses an action (e.g. a bid in an auction). The adversary chooses an allocation function , that maps an action to a distribution over a set of potential outcomes (e.g. the probability of getting a click) and a reward function that maps an action-outcome pair to a reward (utility conditional on getting a click with a bid of ). Then, an outcome is chosen based on distribution and a reward is incurred. The learner gets to observe the function and the reward function for the realized outcome (i.e. she learns the probability of click and the expected payment as a function of her bid and, if she gets clicks, also learns her value).

Our main contribution is an algorithm which we call WIN-EXP, which achieves regret . The latter is inherently better than the generic multi-armed bandit regret of and takes advantage of the particular feedback structure. Our algorithm is a variant of the EXP3 algorithm [7], with a carefully crafted unbiased estimate of the utility of each action, which has lower variance than the unbiased estimate used in the standard EXP3 algorithm. This result could be of independent interest and applicable beyond learning in auction settings.

This setting engulfs learning in many auctions of interest where bidders learn their value for a good only when they win the good and where the good which is allocated to the bidder is determined by some randomized allocation function. In particular, we give a detailed application to the context of sponsored search, where our feedback assumptions match the type of feedback that advertisers receive from the system.

We also extend our results to cases where the space of actions is a continuum (e.g. all bids in an interval ). We show that in many auction settings, under appropriate assumptions on the utility functions, a regret of can be achieved by simply discretizing the action space to a sufficiently small uniform grid and running our WIN-EXP algorithm. This result encompasses the results of [34] for second price auctions, learning in first-price and all-pay auctions, as well as learning in sponsored search with smoothness assumptions on the utility function. We also show how smoothness of the utility can easily arise due to the inherent randomness that exists in the mechanism run in sponsored search.

Finally, we provide two further extensions: switching regret and feedback-graphs over outcomes. The former adapts our algorithm to achieve good regret against a sequence of bids rather than a fixed bid. The latter has implications on faster convergence to approximate efficiency of the outcome (price of anarchy). Feedback graphs address the idea that in many cases the learner could be receiving information about other items other than the item he won (through correlations in the values for these items). The latter essentially corresponds to adding a feedback graph over outcomes and when outcome is chosen then the learner learns the reward function for all neighboring outcomes in the feedback graph. We provide improved results that mainly depend on the dependence number of the graph rather than the number of possible outcomes.

#### Related Work.

Our work lies on the intersection of two main areas: No regret learning in Game Theory and Mechanism Design and Contextual Bandits.

No regret learning in Game Theory and Mechanism Design. No regret learning has received a lot of attention in the Game Theory and Mechanism Design literature [18]. Most of the existing literature, however, focuses on the problem from the side of the auctioneer, who tries to maximize revenue through repeated rounds without knowing the priori the valuations of the bidders [4, 5, 11, 12, 16, 20, 21, 25, 29, 31, 30].
These works are centered around different auction formats like the sponsored search ad auctions, the pricing of inventory and the single-item auctions. Closely related to our work are the works of Dikkala and Tardos [22] ^{1}^{1}1In their work, they show in a setting where bidders have to experiment in order to learn their valuations, that the seller can increase revenue by offering an initial credit to them, in order to give them incentives to experiment. and Balseiro and Gur [8]^{2}^{2}2Balseiro and Gur [8] introduce a family of dynamic bidding strategies in repeated second-price auctions, where advertisers adjust their bids throughout the campaign. They analyze both regret minimization and market stability. There are two key differences from our setting; first, Balseiro and Gur consider the case where the goal of the bidders is the expediture rate in a way that guarantees that the available campaign budget will be spent in an optimal pacing way and second, because of their target being the expenditure rate at every timestep , they assume that the bidders get information about the value of the slot being auctioned and based on this information they decide how to adjust their bid.. Moreover, several works analyze the properties of auctions when bidders adopt a no-regret learning strategy [10, 14, 32]. None of these works, however, addresses the question of learning more efficiently in the unknown valuation model and either invokes generic MAB algorithms or develops tailored full information algorithms when the bidder knows his value. Another line of research takes a Bayesian approach to learn in repeated auctions and makes large market assumptions, analyzing learning to bid with an unknown value under a Mean Field Equilibrium condition [1, 24, 9]^{3}^{3}3No-regret learning is complementary and orthogonal to the mean field approach, as it does not impose any stationarity assumption on the evolution of valuations of the bidder or the behavior of his opponents..

Contextual Bandits. Our work is also related to the literature in contextual bandits [13]. To establish this connection we observe that the policies and the actions in contextual bandit terminology translate into discrete bids and groups of bids for which we learn the rewards in our work. The difference between these two is the fact that for each action in contextual bandits we get a single reward, whereas for our setting we observe a group of rewards; one for each action in the group. Moreover, the fact that we allow for randomized outcomes adds extra complication, non existent in contextual bandits. In addition, our work is closely related to the literature in online learning with feedback graphs [2, 3, 19, 28]. In fact, we propose a new setting in online learning, namely, learning with outcome-based feedback, which is a generalization of learning with feedback graphs and is essential when applied to a variety of auctions which include sponsored search, single-item second-price, single-item first-price and single-item all-pay auctions.

Finally, our work is mostly related to Weed et al. [34], who adopt the point of view of the bidders for the sequential Vickrey auction.^{4}^{4}4In their work, the true valuation of the item is revealed to the bidders only when they win the item. The authors identify and analyze the bidding strategies for the bidders in order to mitigate both overbidding (potential losses) and underbidding (opportunity cost) for the bidder in two scenarios of sequential valuations, the stochastic and the adversarial one. However their setting falls into the family of settings for which our novel and generic WIN-EXP algorithm produces good regret bounds and as a result, we are able to fully retrieve the regret that their algorithms yield, up to a tiny increase in the constants.

## 2 Learning in Auctions without Knowing your Value

For simplicity of exposition, we start with a simple single-dimensional mechanism design setting, but our results extend to multi-dimensional (multi-item) mechanisms, as we will see in Section 4. Let be the number of bidders. Each bidder has a value per-unit of a good and submits a bid , where is a discrete set of bids (e.g. a uniform -grid of ). Given the bid profile of all bidders, the auction allocates a unit of the good to the bidders. The allocation rule for bidder is given by: , which we assume to be an increasing function of the bid of the bidder. Moreover, the mechanism defines a per-unit payment function which is also increasing in the bid of the bidder. The overall utility of the bidder is quasi-linear, i.e. .

#### Online Learning with Partial Feedback.

The bidders participate in this mechanism repeatedly. At each iteration, each bidder has some value and submits a bid . The mechanism has some time-varying allocation function and payment function . We assume that the bidder does not know his value , nor the bids of his opponents , nor the allocation and payment functions, before submitting a bid.

At the end of each iteration, he gets an item with probability and observes his value for the item only when he gets one (e.g. in sponsored search, the good allocated is the probability of getting clicked, and you only observe your value if you get clicked). Moreover, we assume that he gets to observe his allocation and payment functions for that iteration, i.e. he gets to observe two functions and . Finally, he receives utility or in other words expected utility . Given that we focus on learning from the perspective of a single bidder we will drop the index from all notation and instead write, , , , , etc. The goal of the bidder is to achieve small expected regret with respect to any fixed bid in hindsight: .

## 3 Abstraction: Learning with Win-Only Feedback

We abstract a bit more the learner’s problem, to a setting that could be of interest beyond auction settings.

#### Learning with Win-Only Feedback.

Every day a learner picks an action from a finite set . The adversary chooses a reward function and an allocation function . The learner wins a reward with probability . Let be the learner’s expected utility from action . After each iteration, if he won the reward then he learns the whole reward function , while he always learns the allocation function .

Can the learner achieve regret rather than bandit-feedback regret ?

In the case of the auction learning problem, the reward function takes the parametric form and the learner needs to learn and at the end of each iteration, when he wins the item. This is inline with the feedback structure we described in the previous section.

We consider the following adaptation of the EXP3 algorithm with unbiased estimates based on the information received. It is also notationally useful throughout the section to denote with the event of winning a reward at time . Then, we can write: and , where with we denote the multinomial distribution from which bid is drawn. With this notation we define our WIN-EXP algorithm in Algorithm 1. We note here that our generic family of the WIN-EXP algorithms can be parametrized by the step-size , the estimate of the utility that the learner gets at each round and the feedback structure that he receives.

#### Bounding the Regret.

We first bound the first and second moment of the unbiased estimates built at each iteration in the WIN-EXP algorithm.

###### Lemma 1.

At each iteration , for any action , the random variable is an unbiased estimate of the true expected utility , i.e.: and has expected second moment bounded by: .

###### Proof.

Let denote the event that the reward was won. We have:

Similarly for the second moment:

where the last inequality holds since and . ∎

We are now ready to prove our main theorem:

###### Theorem 2 (Regret of Win-Exp).

The regret of the WIN-EXP algorithm with the aforementioned unbiased estimates and step size is: .

###### Proof.

Observe that regret with respect to utilities is equal to regret with respect to the translated utilities . We use the fact that the exponential weights update with an unbiased estimate of the true utilities, achieves expected regret of the form^{5}^{5}5A detailed proof of this claim can be found in Appendix G.:

Invoking the bound on the second moment by Lemma 1, we get:

Picking , we get the theorem. ∎

## 4 Beyond Binary Outcomes: Outcome-Based Feedback

In the win-only feedback framework there were two possible outcomes that could happen: either you win the reward or not. We now consider a more general problem, where there are more than two outcomes and you learn your reward function for the outcome you won. Moreover, the outcome that you won is also a probabilistic function of your action. The proofs for the results presented in this section can be found in Appendix B.

#### Learning with Outcome-Based Feedback.

Every day a learner picks an action from a finite set . There is a set of payoff-relevant outcomes . The adversary chooses a reward function , which maps an action and an outcome to a reward and he also chooses an allocation function , which maps an action to a distribution over the outcomes. Let be the probability of outcome under action . An outcome is chosen based on distribution . The learner wins reward and observes the whole outcome-specific reward function . He always learns the allocation function after the iteration. Let be the expected utility from action .

We consider the following adaptation of the EXP3 algorithm with unbiased estimates based on the information received. It is also notationally useful throughout the section to consider as the random variable of the outcome chosen at time . Then, we can write: and . With this notation and based on the feedback structure, we define our WIN-EXP algorithm with parameters and . The full Algorithm 2 can be found in Appendix A.

###### Theorem 3 (Regret of Win-Exp with outcome-based feedback).

The regret of Algorithm 2 with and step size is: .

#### Applications to Learning in Auctions.

We now present a series of applications of the main result of this section to several learning in auction settings, even beyond single-item or single-dimensional ones.

###### Example 1 (Second-price auction).

Suppose that the mechanism run at each iteration is just the second price auction. Then, we know that the allocation function is simply of the form: and the payment function is simply the second highest bid. In this case, observing the allocation and payment functions at the end of the auction simply boils down to observing the highest other bid. In fact, in this case we have a trivial setting where the bidder gets an allocation of either or and if we let , then the unbiased estimate of the utility takes the simpler form (assuming the bidder always loses in case of ties) of: if and in any other case. Our main theorem gives regret . We note that this theorem recovers exactly the results of Weed et al. [34], by simply using as a uniform discretization of the bidding space, for an appropriately defined constant (see Appendix B.1 for an exact comparison of the results).

###### Example 2 (Value-per-click auctions).

This is a variant of the binary outcome case analyzed in Section 3, where , i.e. get clicked or not. Hence, , and , while . Our main theorem gives regret .

###### Example 3 (Multi-slot auctions).

Consider the case of multiple slot auctions where the bidder has value per impression for appearing in slot . Given a bid , the mechanism defines a probability distribution over the slots that the bidder will be allocated and also defines a payment function, which depends on the bid of the bidder and the slot acquired. When a bidder gets allocated a slot, he gets to observe his value for that slot. Thus, the set of outcomes is equal to , with slot associated with not getting any slot. The rewards are also of the form: for some payment function dependent on the auction format. Our main theorem then gives regret .

###### Example 4 (Unit-demand multi-item auctions).

Consider the case of items at an auction where the bidder has value for only one item . Given a bid , the mechanism defines a probability distribution over the items that the bidder will be allocated and also defines a payment function, which depends on the bid of the bidder and the item allocated. When a bidder gets allocated an item he gets to observe his value for that item. Thus, the set of outcomes is equal to , with outcome associated with not getting any item. The rewards are also of the form: for some payment function dependent on the auction format. Our main theorem then gives regret .

### 4.1 Batch Rewards Per-Iteration and Sponsored Search Auctions

We now consider the case of sponsored search auctions, where the learner participates in multiple auctions per-iteration. At each of these auctions he has a chance to win and get feedback on his value. To this end, we abstract the learning with win-only feedback setting to a setting where multiple rewards are awarded per-iteration. The allocation function remains the same throughout the iteration but the reward functions can change.

#### Outcome-Based Feedback with Batch Rewards.

Every iteration is associated with a set of reward contests . The learner picks an action , which is used at all reward contests. For each the adversary picks an outcome specific reward function . Moreover, the adversary chooses an allocation function , which is not -dependent. At each , an outcome is chosen based on distribution and independently. The learner receives reward from that contest. The overall realized utility from that iteration is the average reward: , while the expected utility from any bid is: . We assume that at the end of each iteration the learner receives as feedback the average reward function conditional on each realized outcome, i.e. if we let , then the learner learns the function: (with the convention that if ) as well as the realized frequencies for all outcomes .

With this at hand we can define the batch-analogue of our unbiased estimates of the previous section. To avoid any confusion we define: and , to denote that these probabilities only depend on and not on . The estimate of the utility will be:

(1) |

###### Corollary 4.

The WIN-EXP algorithm with the latter unbiased utility estimates and step size , achieves regret in the outcome-based feedback with batch rewards setting at most: .

The proof can be found in Appendix D. It is also interesting to note that the same result holds if instead of using in the expected utility (Equation (10)), we used its mean value, which is . This would not change any of the derivations above. The nice property of this alternative is that the learner does not need to learn the realized fraction of each outcome, but only the expected fraction of each outcome. This is already contained in the function , that we already assumed was given to the learner at the end of each iteration. Thus, with these new estimates, the learner does not need to observe . In Appendix C we also discuss the case where different periods can have different number of rewards and how to extend our estimate to that case. The batch rewards setting finds an interesting application in the case of learning in sponsored search, as we describe below.

###### Example 5 (Sponsored Search).

In the case of sponsored search auctions, the latter boils down to learning the average value for the clicks that were generated, as well as the cost-per-click function , which is assumed to be constant throughout the period . Given these quantities, the learner can compute: and . An advertiser can keep track of the traffic generated by a search engine ad and hence, can keep track of the number of clicks from the search engine and the value generated by each of these clicks (conversion). Thus, he can estimate . Moreover, he can elicit the probability of click (aka click-through-rate or CTR) curves and the cost-per-click (CPC) curves over reasonably small periods of time of about a few hours. Thus, with these at hand we can apply our batch reward outcome based feedback algorithm and get regret that does not grow linearly with , but only as . Our main assumption is that the expected CTR and CPC curves during this small period of a few hours remains constant. This is a reasonable assumption when feedback can be elicited frequently, which is the case in practice.

## 5 Continuous Actions with Piecewise-Lipschitz Rewards

In this section, we extend our discussions to continuous action spaces; that is, we allow the action of each bidder to lie in a continuous action space (e.g. a uniform interval in ). To assist us in our analysis, we are going to use the following discretization result by Kleinberg [27] ^{6}^{6}6In [27] Kleinberg discusses the uniform discretization of continuum-armed bandits and in [26] the authors extend the results for the case of Lipschitz-armed bandits.. For what follows in this section, let
be the regret of the bidder, after rounds with respect to an action space . Moreover, for any pairs of action spaces and we let: ,
denote the discretization error incurred by optimizing over instead of . The proofs of this section can be found in Appendix E.

Observe now that in the setting of Weed et al [34] the discretization error was: if , as we discussed in Section 4 and that was the key that allowed us to recover this result, without adding an extra in the regret of the bidder. To achieve that, we construct the following general class of utility functions:

###### Definition 6 (-Piecewise Lipschitz Average Utilities).

A learning setting with action space , is said to have -Piecewise Lipschitz Cumulative Utilities if the average utility function satisfies the following conditions: the bidding space is divided into -dimensional cubes with edge length at least and within each cube the utility is -Lipschitz with respect to the norm. Moreover, for any boundary point there exists a sequence of non-boundary points whose limit cumulative utility is at least as large as the cumulative utility of the boundary point.

###### Lemma 7 (Discretization Error for Piecewise Lipschitz).

Let be a continuous action space and a uniform -grid of , such that (i.e. consists of all the points whose coordinates are multiples of a given ). Assume that the average utility function is -Piecewise -Lipschitz. Then, the discretization error of is bounded as: .

If we know the Lipschitzness constant mentioned above, the time horizon and , then our WIN-EXP algorithm for Outcome-Based Feedback with Batch Rewards yields regret as defined by the following theorem. In Appendix E, we also show how to deal with unknown parameters , and by applying a standard doubling trick.

###### Theorem 8.

Let be the action space as defined in our model and let be a uniform -grid of . The WIN-EXP algorithm with unbiased estimates given by Equation 10 on space with step size and
achieves expected regret at most in the outcome-based feedback with batch rewards and -Piecewise Lipschitz average utilities ^{7}^{7}7Interestingly, the above regret bound can help to retrieve two familiar expressions for the regret. First, when (i.e. when the function is constant within each cube), which is the case for the second price auction analyzed by [34], . Hence, we recover the bounds from the prior sections up to a tiny increase. Second, when , then we have functions that are -Lipschitz in the whole space and the regret bound that we retrieve is: , which is of the type achieved in continuous lipschitz bandit settings..

###### Example 6 (First Price and All-Pay Auctions).

Consider the case of learning in first price or all-pay auctions. In the former, the highest bidder wins and pays his bid, while in the latter the highest bidder wins and every player pays his bid whether he wins or loses. Let be the highest other bid at time . Then the average hindsight utility of the player in each auction is ^{8}^{8}8For simplicity assume the player loses in case of ties, though we can handle arbitrary random tie-breaking rules.:

(first price) | ||||

(all-pay) |

Let be the smallest difference between the highest other bid at any two iterations and ^{9}^{9}9This is an analogue of the used by [34] in second price auctions.. Then observe that the average utilities in this setting are -Piecewise -Lipschitz: Between any two highest other bids, the average allocation, , of the player remains constant and the only thing that changes is his payment which grows linearly. Hence, the derivative at any bid between any two such highest other bids is upper bounded by .
Hence, by applying Theorem 8, our WIN-EXP algorithm with a uniform discretization on a -grid, for , achieves regret , where we used that and for any of these auctions.

### 5.1 Sponsored Search with Lipschitz Utilities

In this subsection, we extend our analysis of learning in the sponsored search auction model (Example 5) to the continuous bid space case, i.e., each bidder can submit a bid . As a reminder, the utility function is: , where , is the average value for the clicks at iteration , is the CTR curve and is the CPC curve. These curves are implicitly formed by running some form of a Generalized Second Price auction (GSP) at each iteration to determine the allocation and payment rules.

We show in this section that the form of GSP run in reality gives rise to Lipschitz utilities, under some minimal assumptions. Therefore, we can apply the results in Section 5 to get regret bounds even with respect to the continuous bid space ^{10}^{10}10The aforementioned Lipschitzness is also reinforced by real world data sets from Microsoft’s sponsored search auction system.. We begin by providing a brief description of the type of Generalized Second Price auction run in practice.

###### Definition 9 (Weighted-GSP).

Each bidder is assigned a quality score . Bidders are ranked according to their score-weighted bid , typically called the rank-score. Every bidder whose rank-score does not pass a reserve is discarded. Bidders are allocated slots in decreasing order of rank-score. Each bidder is charged per-click the lowest bid he could have submitted and maintained the same slot. Hence, if a bidder is allocated a slot and is the rank-score of the bidder in slot , then he is charged per-click. We denote with , the utility of bidder under a bid profile and score profile .

The quality scores are typically highly random and dependent on the features of the advertisement and the user that is currently viewing the page. Hence, a reasonable modeling assumption is that the scores at each auction are drawn i.i.d. from some distribution with CDF . We now show that if the CDF is Lipschitz (i.e. admits a bounded density), then the utilities of the bidders are also Lipschitz.

###### Theorem 10 (Lipschitzness of the utility of Weighted GSP).

Suppose that the score of each bidder in a weighted GSP is drawn independently from a distribution with an Lipschitz CDF . Then, the expected utility is Lipschitz wrt .

Thus, we see that when the quality scores in sponsored search are drawn from -Lipschitz CDFs and the reserve is lower bounded by , then the utilities are -Lipschitz and we can achieve good regret bounds by using the WIN-EXP algorithm with batch rewards, with action space being a uniform -grid, and unbiased estimates given by Equation (10) or Equation (1). In the case of sponsored search the second unbiased estimate takes the following simple form:

(2) |

where is the average value from the clicks that happened during iteration , is the CTR curve, is the realized bid that the bidder submitted and is the distribution over discretized bids of the algorithm at that iteration. We can then apply Theorem 8 to get the following guarantee:

## 6 Further Extensions

#### Switching Regret and Implications for Price of Anarchy

We show below that actually our results can be extended to capture the case where, instead of having just one optimal bid , there is a sequence of switches in the optimal bids. Using the results presented in [23] and adapting them for our setting we get the following corollary (with proof in Appendix F).

###### Corollary 12.

Let be the number of times that the optimal bid switches in a horizon of rounds. Then, using Algorithm in [23] with and any we can achieve expected switching regret at most:

This result has implications on the price of anarchy (PoA) of auctions. In the case of sponsored search where bidders’ valuations are changing over time adversarially but non-adaptively, our result shows that if the valuation does not change more than times, we can compete with any bid that is a function of the value of the bidder at each iteration, with regret rate given by the latter theorem. Therefore, by standard PoA arguments [15, 33], this would imply convergence to an approximately efficient outcome at a faster rate than bandit regret rates.

#### Feedback Graphs over Outcomes

We now extend Section 5, by assuming that there is a directed feedback graph over the outcomes. When outcome is chosen, the player observes not only the outcome specific reward function , for that outcome, but also for any outcome in the out-neighborhood of in the feedback graph, which we denote with . Correspondingly, we denote with the incoming neighborhood of in . Both neighborhoods include self-loops. Let be the sub-graph of that contains only outcomes for which and subsequently, let and be the in and out neighborhoods of this sub-graph.

Based on this feedback graph we construct a WIN-EXP algorithm with step-size , utility estimate and feedback structure as described in the previous paragraph. With these changes we can show that the regret grows as a function of the independence number of the feedback graph, denoted with , rather than the number of outcomes. The full Algorithm 4 can be found in Appendix A.

###### Theorem 13 (Regret of Win-Exp-G).

The regret of the WIN-EXP-G algorithm with step size is bounded by: .

In the case of learning in auctions, the feedback graph over outcomes can encode the possibility that winning an item can help you uncover your value for other items. For instance, in a combinatorial auction for items, the reader should think of each node in the feedback graph as a bundle of items. Then the graph encodes the fact that winning bundle can teach you the value for all bundles . If the feedback graph has small dependence number then a much better regret is achieved than the dependence on , that would have been derived by our outcome-based feedback results of prior sections, if we treated each bundle of items separately as an outcome.

## References

- [1] Sachin Adlakha and Ramesh Johari. Mean field equilibrium in dynamic games with strategic complementarities. Operations Research, 61(4):971–989, 2013.
- [2] Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In Conference on Learning Theory, pages 23–35, 2015.
- [3] Noga Alon, Nicolo Cesa-bianchi, Claudio Gentile, and Yishay Mansour. From bandits to experts: A tale of domination and independence. In C.j.c. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1610–1618. 2013.
- [4] Kareem Amin, Rachel Cummings, Lili Dworkin, Michael Kearns, and Aaron Roth. Online learning and profit maximization from revealed preferences. In AAAI, pages 770–776, 2015.
- [5] Kareem Amin, Afshin Rostamizadeh, and Umar Syed. Repeated contextual auctions with strategic buyers. In Advances in Neural Information Processing Systems, pages 622–630, 2014.
- [6] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121–164, 2012.
- [7] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
- [8] Santiago Balseiro and Yonatan Gur. Learning in repeated auctions with budgets: Regret minimization and equilibrium. 2017.
- [9] Santiago R Balseiro, Omar Besbes, and Gabriel Y Weintraub. Repeated auctions with budgets in ad exchanges: Approximations and design. Management Science, 61(4):864–884, 2015.
- [10] Avrim Blum, MohammadTaghi Hajiaghayi, Katrina Ligett, and Aaron Roth. Regret minimization and the price of total anarchy. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 373–382. ACM, 2008.
- [11] Avrim Blum, Vijay Kumar, Atri Rudra, and Felix Wu. Online learning in online auctions. Theoretical Computer Science, 324(2-3):137–146, 2004.
- [12] Avrim Blum, Yishay Mansour, and Jamie Morgenstern. Learning valuation distributions from partial observation. In AAAI, pages 798–804, 2015.
- [13] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- [14] Ioannis Caragiannis, Christos Kaklamanis, Panagiotis Kanellopoulos, Maria Kyropoulou, Brendan Lucier, Renato Paes Leme, and Éva Tardos. Bounding the inefficiency of outcomes in generalized second price auctions. Journal of Economic Theory, 156:343–388, 2015.
- [15] Ioannis Caragiannis, Christos Kaklamanis, Panagiotis Kanellopoulos, Maria Kyropoulou, Brendan Lucier, Renato Paes Leme, and Ãva Tardos. Bounding the inefficiency of outcomes in generalized second price auctions. Journal of Economic Theory, 156(Supplement C):343 – 388, 2015. Computer Science and Economic Theory.
- [16] Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Regret minimization for reserve prices in second-price auctions. IEEE Transactions on Information Theory, 61(1):549–564, 2015.
- [17] Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.
- [18] Shuchi Chawla, Jason D. Hartline, and Denis Nekipelov. Mechanism design for data science. In ACM Conference on Economics and Computation, EC ’14, Stanford , CA, USA, June 8-12, 2014, pages 711–712, 2014.
- [19] Alon Cohen, Tamir Hazan, and Tomer Koren. Online learning with feedback graphs without the graphs. In International Conference on Machine Learning, pages 811–819, 2016.
- [20] Richard Cole and Tim Roughgarden. The sample complexity of revenue maximization. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 243–252. ACM, 2014.
- [21] Peerapong Dhangwatnotai, Tim Roughgarden, and Qiqi Yan. Revenue maximization with a single sample. Games and Economic Behavior, 91:318–333, 2015.
- [22] Nishanth Dikkala and Éva Tardos. Can credit increase revenue? In Web and Internet Economics - 9th International Conference, WINE 2013, Cambridge, MA, USA, December 11-14, 2013, Proceedings, pages 121–133, 2013.
- [23] András Gyorgy, Tamás Linder, and Gábor Lugosi. Efficient tracking of large classes of experts. IEEE Transactions on Information Theory, 58(11):6709–6725, 2012.
- [24] Krishnamurthy Iyer, Ramesh Johari, and Mukund Sundararajan. Mean field equilibria of dynamic auctions with learning. ACM SIGecom Exchanges, 10(3):10–14, 2011.
- [25] Yash Kanoria and Hamid Nazerzadeh. Dynamic reserve prices for repeated auctions: Learning from bids - working paper. In Web and Internet Economics - 10th International Conference, WINE 2014, Beijing, China, December 14-17, 2014. Proceedings, page 232, 2014.
- [26] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 681–690. ACM, 2008.
- [27] Robert D Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, pages 697–704, 2005.
- [28] Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. In John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger, editors, NIPS, pages 684–692, 2011.
- [29] Andres M Medina and Mehryar Mohri. Learning theory and algorithms for revenue optimization in second price auctions with reserve. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 262–270, 2014.
- [30] Andrés Muñoz Medina and Sergei Vassilvitskii. Revenue optimization with approximate bid predictions. CoRR, abs/1706.04732, 2017.
- [31] Michael Ostrovsky and Michael Schwarz. Reserve prices in internet advertising auctions: A field experiment. In Proceedings of the 12th ACM conference on Electronic commerce, pages 59–60. ACM, 2011.
- [32] Tim Roughgarden. Intrinsic robustness of the price of anarchy. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 513–522. ACM, 2009.
- [33] Vasilis Syrgkanis and Eva Tardos. Composable and efficient mechanisms. In Proceedings of the Forty-fifth Annual ACM Symposium on Theory of Computing, STOC ’13, pages 211–220, New York, NY, USA, 2013. ACM.
- [34] Jonathan Weed, Vianney Perchet, and Philippe Rigollet. Online learning in repeated auctions. In Conference on Learning Theory, pages 1562–1583, 2016.

## Appendix A Omitted Algorithms

Essentially, the family of our WIN-EXP algorithms is parametrized by the step-size -parameter, the estimate of the utility that the learner gets at every timestep and finally, the type of feedback that he receives after every timestep . Clearly, both and the estimate of the utility depend crucially on the particular type of feedback.

In this section, we present the specifics of the algorithms that we omitted from the main body of the text, due to lack of space.

### a.1 Outcome based feedback

(3) |

(4) |

### a.2 Outcome based batch-reward feedback

(5) |

(6) |

### a.3 Outcome-based feedback graph over outcomes

(7) |

(8) |

## Appendix B Omitted proofs from Section 4

We first give a lemma that bounds the moments of our utility estimate.

###### Lemma 14.

At each iteration , for any action , the random variable is an unbiased estimate of the true expected utility , i.e.: and has expected second moment bounded by: .

###### Proof of Lemma 14.

According to the notation we introduced before we have:

Similarly for the second moment:

where the last inequality holds since . ∎

###### Proof of Theorem 3.

Observe that regret with respect to utilities is equal to regret with respect to the translated utilities . We use the fact that the exponential weight updates with an unbiased estimate of the true utilities, achieves expected regret of the form:

For a detailed proof of the above, we refer the reader to Appendix G. Invoking the bound on the second moment by Lemma 14, we get:

Picking