A Note on Tight Lower Bound for MNL-Bandit Assortment Selection Models
We consider the dynamic MNL-bandit model for assortment selection (Agrawal et al., 2016a), where items are present, each associated with a known revenue parameter and an unknown preference parameter . For a total of epochs, at each epoch a retailer, based on previous purchasing experiences, selects an assortment of size at most (i.e., ); the retailer then observes a purchasing outcome sampled from the following discrete distribution:
and collects the corresponding revenue (if then no revenue is collected). The objective is to find a policy that minimizes the worst-case expected regret
where is the expected revenue collected on assortment and is the optimal assortment of size at most in hindsight. It was shown in (Agrawal et al., 2016b) that a Thompson sampling based policy achieves a regret at most , and furthermore Agrawal et al. (2016a) shows that no policy can achieve a regret smaller than . There is an apparent gap between the upper and lower bounds when is large.
In this note we close this gap by proving the following result:
Suppose . There exists an absolute constant independent of , and such that
Theorem 1 matches the upper bound for all three parameters and , except for a logarithmic factor of . The proof technique is similar to the proof of (Bubeck et al., 2012, Theorem 3.5). The major difference is that for the MNL-bandit model with assortment size , a “neighboring” subset of size rather than the empty set is considered in the calculation of KL-divergence. This approach reduces an factor in the resulting lower bound, which matches the existing upper bound in (Agrawal et al., 2016a, b) up to poly-logarithmic factors.
2 Proof of Theorem 1
Throughout the proof we set and for some parameter to be specified later. For any subset , we use to indicate the parameterization where if and if . We use to denote all subsets of of size . Clearly, . We use and to denote the law and expectation under the parameterization .
2.1 The counting argument
We first prove the following lemma that bounds the regret of any assortment selection:
Fix arbitrary and let be the parameter associated with ; that is, for and for . For any with , it holds that
By construction of , it is clear that . On the other hand, . Subsequently,
where the last inequality holds because . ∎
For each assortment selection , let be an arbitrary subset of size that contains . 111If then . Define . Using Lemma 1 and the fact that suffers less regret than , we have
Here Eq. (1) holds because the maximum regret is always lower bounded by the average regret (averaging over all parameterization for ), Eq. (2) follows from Lemma 1, and Eq. (3) holds because for any . The lower bound proof is then reduced to finding the largest such that the summation term in Eq. (3) is upper bounded by, say, for some constant .
2.2 Pinsker’s inequality
The major challenge of bounding the summation term on the right-hand side of Eq. (3) is the term. Ideally, we expect this term to be small (e.g., around fraction of ) because is of size . However, a bandit assortment selection algorithm, with knowledge of , could potentially allocate its assortment selections so that becomes significantly larger for than . To overcome such difficulties, we use an analysis similar to the proof of Theorem 3.5 in (Bubeck et al., 2012) to exploit the property and Pinsker’s inequality (Tsybakov, 2009) to bound the discrepancy in expectations under different parameterization.
Let be all subsets of size that do not include . Re-arranging summation order we have
Denote and . Also note that almost surely under both and . Using Pinsker’s inequality we have that
Here and are the total variation and the Kullback-Leibler (KL) divergence between and , respectively. Subsequently,
The first term on the right-hand side of Eq. (4) is easily bounded:
Here the last inequality holds because . Combining all inequalities we have that
It remains to bound the KL divergence between two “neighboring” parameterization and for all and , which we elaborate in the next section.
2.3 KL-divergence between assortment selections
Define . Note that because , we have almost surely and hence for all .
For any and ,
Before proving Lemma 2 we first prove an upper bound on KL-divergence between categorical distributions.
Suppose is a categorical distribution with parameters 222Meaning that for . and is a categorical distribution with parameters . Suppose for all . Then
We have that
Here (a) holds because for all and (b) holds because . ∎
We are now ready to prove Lemma 2.
It is clear that for any such that , . Therefore, we shall focus only on those with , which happens for epochs in expectation. Define and . Re-write the probability of as and under and , respectively, where . We then have that
Note that and for . Invoking Lemma 3 we have that
2.4 Putting everything together
Using Hölder’s inequality, we have that
By Jensen’s inequality and the convexity of the square root, we have
Invoking Lemma 2, we obtain
- Agrawal et al. (2016a) Agrawal, S., Avadhanula, V., Goyal, V., & Zeevi, A. (2016a). A near-optimal exploration-exploitation approach for assortment selection. In Proceedings of the 2016 ACM Conference on Economics and Computation (EC).
- Agrawal et al. (2016b) Agrawal, S., Avadhanula, V., Goyal, V., & Zeevi, A. (2016b). Thompson sampling for the mnl-bandit. In Proceedings of the 2017 Conference on Learning Theory (COLT).
- Bubeck et al. (2012) Bubeck, S., Cesa-Bianchi, N., et al. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1), 1–122.
- Tsybakov (2009) Tsybakov, A. B. (2009). Introduction to nonparametric estimation.. Springer Series in Statistics. Springer, New York.