A Note on Tight Lower Bound for MNL-Bandit Assortment Selection Models

A Note on Tight Lower Bound for MNL-Bandit Assortment Selection Models

Xi Chen Stern School of Business, New York University Yining Wang Machine Learning Department, Carnegie Mellon University

In this note we prove a tight lower bound for the MNL-bandit assortment selection model that matches the upper bound given in (Agrawal et al., 2016a, b) for all parameters, up to logarithmic factors.

1 Introduction

We consider the dynamic MNL-bandit model for assortment selection (Agrawal et al., 2016a), where items are present, each associated with a known revenue parameter and an unknown preference parameter . For a total of epochs, at each epoch a retailer, based on previous purchasing experiences, selects an assortment of size at most (i.e., ); the retailer then observes a purchasing outcome sampled from the following discrete distribution:

and collects the corresponding revenue (if then no revenue is collected). The objective is to find a policy that minimizes the worst-case expected regret

where is the expected revenue collected on assortment and is the optimal assortment of size at most in hindsight. It was shown in (Agrawal et al., 2016b) that a Thompson sampling based policy achieves a regret at most , and furthermore Agrawal et al. (2016a) shows that no policy can achieve a regret smaller than . There is an apparent gap between the upper and lower bounds when is large.

In this note we close this gap by proving the following result:

Theorem 1.

Suppose . There exists an absolute constant independent of , and such that

Theorem 1 matches the upper bound for all three parameters and , except for a logarithmic factor of . The proof technique is similar to the proof of (Bubeck et al., 2012, Theorem 3.5). The major difference is that for the MNL-bandit model with assortment size , a “neighboring” subset of size rather than the empty set is considered in the calculation of KL-divergence. This approach reduces an factor in the resulting lower bound, which matches the existing upper bound in (Agrawal et al., 2016a, b) up to poly-logarithmic factors.

2 Proof of Theorem 1

Throughout the proof we set and for some parameter to be specified later. For any subset , we use to indicate the parameterization where if and if . We use to denote all subsets of of size . Clearly, . We use and to denote the law and expectation under the parameterization .

2.1 The counting argument

We first prove the following lemma that bounds the regret of any assortment selection:

Lemma 1.

Fix arbitrary and let be the parameter associated with ; that is, for and for . For any with , it holds that

where .


By construction of , it is clear that . On the other hand, . Subsequently,

where the last inequality holds because . ∎

For each assortment selection , let be an arbitrary subset of size that contains . 111If then . Define . Using Lemma 1 and the fact that suffers less regret than , we have


Here Eq. (1) holds because the maximum regret is always lower bounded by the average regret (averaging over all parameterization for ), Eq. (2) follows from Lemma 1, and Eq. (3) holds because for any . The lower bound proof is then reduced to finding the largest such that the summation term in Eq. (3) is upper bounded by, say, for some constant .

2.2 Pinsker’s inequality

The major challenge of bounding the summation term on the right-hand side of Eq. (3) is the term. Ideally, we expect this term to be small (e.g., around fraction of ) because is of size . However, a bandit assortment selection algorithm, with knowledge of , could potentially allocate its assortment selections so that becomes significantly larger for than . To overcome such difficulties, we use an analysis similar to the proof of Theorem 3.5 in (Bubeck et al., 2012) to exploit the property and Pinsker’s inequality (Tsybakov, 2009) to bound the discrepancy in expectations under different parameterization.

Let be all subsets of size that do not include . Re-arranging summation order we have

Denote and . Also note that almost surely under both and . Using Pinsker’s inequality we have that

Here and are the total variation and the Kullback-Leibler (KL) divergence between and , respectively. Subsequently,


The first term on the right-hand side of Eq. (4) is easily bounded:

Here the last inequality holds because . Combining all inequalities we have that


It remains to bound the KL divergence between two “neighboring” parameterization and for all and , which we elaborate in the next section.

2.3 KL-divergence between assortment selections

Define . Note that because , we have almost surely and hence for all .

Lemma 2.

For any and ,

Before proving Lemma 2 we first prove an upper bound on KL-divergence between categorical distributions.

Lemma 3.

Suppose is a categorical distribution with parameters 222Meaning that for . and is a categorical distribution with parameters . Suppose for all . Then


We have that

Here (a) holds because for all and (b) holds because . ∎

We are now ready to prove Lemma 2.


It is clear that for any such that , . Therefore, we shall focus only on those with , which happens for epochs in expectation. Define and . Re-write the probability of as and under and , respectively, where . We then have that

Note that and for . Invoking Lemma 3 we have that

2.4 Putting everything together

Using Hölder’s inequality, we have that

By Jensen’s inequality and the convexity of the square root, we have

Invoking Lemma 2, we obtain

Subsequently, setting the term inside the bracket on the right-hand side of Eq. (5) can be lower bounded by . The overall regret is thus lower bounded by . Theorem 1 is thus proved.


  • Agrawal et al. (2016a) Agrawal, S., Avadhanula, V., Goyal, V., & Zeevi, A. (2016a). A near-optimal exploration-exploitation approach for assortment selection. In Proceedings of the 2016 ACM Conference on Economics and Computation (EC).
  • Agrawal et al. (2016b) Agrawal, S., Avadhanula, V., Goyal, V., & Zeevi, A. (2016b). Thompson sampling for the mnl-bandit. In Proceedings of the 2017 Conference on Learning Theory (COLT).
  • Bubeck et al. (2012) Bubeck, S., Cesa-Bianchi, N., et al. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1), 1–122.
  • Tsybakov (2009) Tsybakov, A. B. (2009). Introduction to nonparametric estimation.. Springer Series in Statistics. Springer, New York.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description