Shipra Agrawal sa3305@columbia.edu
Industrial Engineering and Operations Research, Columbia University. \ANDVashist Avadhanula vavadhanula18@gsb.columbia.edu
Decision, Risk and Operations, Columbia Business School. \ANDVineet Goyal vgoyal@ieor.columbia.edu
Industrial Engineering and Operations Research, Columbia University. \ANDAssaf Zeevi assaf@gsb.columbia.edu
Decision, Risk and Operations, Columbia Business School.
Accepted for presentation at Conference on Learning Theory (COLT) 2017

We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality from possible items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multinomial Logit (MNL) choice model whose parameters are a priori unknown. The objective of the decision maker is to maximize the expected cumulative rewards over a finite horizon , or alternatively, minimize the regret relative to an oracle that knows the MNL parameters. We refer to this as the MNL-Bandit problem. This problem is representative of a larger family of exploration-exploitation problems that involve a combinatorial objective, and arise in several important application domains. We present an approach to adapt Thompson Sampling to this problem and show that it achieves near-optimal regret as well as attractive numerical performance.

Thompson Sampling for the MNL-Bandit]Thompson Sampling for the MNL-Bandit

[ Shipra Agrawal sa3305@columbia.edu
Industrial Engineering and Operations Research, Columbia University.
Vashist Avadhanula vavadhanula18@gsb.columbia.edu
Decision, Risk and Operations, Columbia Business School.
Vineet Goyal vgoyal@ieor.columbia.edu
Industrial Engineering and Operations Research, Columbia University.
Assaf Zeevi assaf@gsb.columbia.edu
Decision, Risk and Operations, Columbia Business School.

Keywords: Thompson Sampling, Exploration-Exploitation, Multinomial Logit Choice Model

1 Introduction and Problem Formulation

Background. In the traditional stochastic multi-armed Bandit (MAB) problem, the decision maker selects one of, say, arms in each round and receives feedback in the form of a noisy reward characteristic of that arm. Regret minimizing strategies are typically based on the principle of optimism in the face of uncertainty, a prime example of which are the family of upper confidence bound policies (UCB), which allow the player to learn the identity of the best arm through sequential experimentation, while concurrently not spending “too much” of the sampling efforts on the sub-optimal arms. In this paper we consider a combinatorial variant of this problem where in each time step the player selects a bundle of arms, after which s/he gets to see the reward associated with one of the arms in that bundle, or observing no reward at all. One can think of the “no reward” as the result of augmenting each bundle with a further index that belongs to a “null arm” that cannot be directly chosen but can be manifest as a feedback; this structure will be further motivated shortly. The identity of the arm within the bundle that yields the reward observation (or the “null” arm that yields no observation) is determined by means of a probability distribution on the index set of cardinality (the arms plus the “null” arm). In this paper the distribution is specified by means of a multinomial logit model (MNL); hence the name MNL-Bandit.

A possible interpretation of this MNL-Bandit problem is as follows. A decision maker is faced with the problem of determining which subset (of at most cardinality ) of items to present to users that arrive sequentially, where user preferences for said items are unknown. Each user either selects one of the items s/he is offered or selects none (the “null arm” option described above). Every item presents some reward which is item-specific. Based on the observations of items users have selected, the decision maker needs to ascertain the composition of the “best bundle,” which involves balancing an exploration over bundles to learn the users’ preferences, while simultaneously exploiting the bundles that exhibit good reward. (The exact mathematical formulation is given below.) A significant challenge here is the combinatorial nature of the problem just described, as the space of possible subsets of cardinality is exponentially large, and for reasonable sized time horizons cannot be efficiently explored.

The problem as stated above is not new, but there is surprisingly little antecedent literature on it; the review below will expound on its history and related strands of work. It arises in many real-world instances, perhaps most notably in display-based online advertising. Here the publisher has to select a set of advertisements to display to users. Due to competing ads, the click rates for an individual ad depends on the overall subset of ads to be displayed; this is referred to as a substitution effect. For example, consider a user presented with two similar vacation packages from two different sources. The user’s likelihood of clicking on one of the ads in this scenario, would most likely differ from the situation where one of the ads is presented as a standalone. Because every advertisement is valued differently from the publisher’s perspective, the set of ads selected for display has a significant impact on revenues. A similar problem arises in online retail settings, where the retailer need to select a subset (assortment) of products to offer. Here demand for a specific product is influenced by the assortment of products offered. To capture these substitution effects, choice models are often used to specify user preferences in the form of a probability distribution over items in a subset.

The MNL-Bandit is a natural way to cast the exploration-exploitation problem discussed above into a well studied machine learning paradigm, and allows to more easily adapt algorithmic ideas developed in that setting. In particular, this paper focuses on a Thompson Sampling (TS) approach to the MNL-Bandit problem. This is primarily motivated by the attractive empirical properties that have been observed over a stream of recent papers in the context of TS versus more traditional approaches such as upper confidence bound policies (UCB). For the MNL-Bandit this has further importance given the combinatorial nature of the dynamic optimization problem one is attempting to solve. One of the main contributions of the present paper is in highlighting the salient features of TS that need to be adapted or customized to facilitate the design of an algorithm in the MNL-Bandit, and to elucidate their role in proving regret-optimality for this variant of TS. To the best of our knowledge some of these ideas are new in the TS-context, and can hopefully extend its scope to combinatorial-type problems that will go beyond the MNL-Bandit.

Problem Formulation. To formally state our problem, consider an option space containing distinct elements, indexed by and their values denoted by , with mnemonic for reward, though we will also use the term revenue in this context. Since the user need not necessarily choose any of the options presented, we model this “outside alternative” as an additional item denoted with an index of “0” which augments the index set. We assume that for any offer set, , the user will be selecting only one of the offered alternatives or item , and this selection is given by a Multinomial Logit (MNL) choice model. Under this model, the probability that a user chooses item is given by,


where is a parameter of the MNL model corresponding to item . Without loss of generality, we can assume that . (The focus on MNL is due to its prevalent use in the context of modeling substitution effects, and its tractability; see further discussion in related work.)

Given the above, the expected revenue corresponding to the offer set , is given by


and the corresponding static optimization problem, i.e., when the parameter vector and henceforth, is known a priori, is given by,


The cardinality constraints specified above, arise naturally in many applications. Specifically, a publisher/retailer is constrained by the space for advertisements/products and has to limit the number of ads/products that can be displayed.

Consider a time horizon , where a subset of items can be offered at time periods . Let be the offline optimal offer set for (3) under full information, namely, when the values of , as given by (1), are known a priori. In the MNL-Bandit, the decision maker does not know the values of and can only make sequential offer set decisions, , at times , respectively. The objective is to design an algorithm that selects a (non-anticipating) sequence of offer sets in a path-dependent manner (namely, based on past choices and observed responses) to maximize cumulative expected revenues over the said horizon, or alternatively, minimize the regret defined as


where is the expected revenue when the offer set is , and is as defined in (2). Here we make explicit the dependence of regret on the time horizon and the parameter vector of the MNL model that determines the user preferences and choices.

Outline. We review related literature and describe our contributions in Section 2. In Section 3, we present our adaptations of the Thompson Sampling algorithm for the MNL-bandit, and in Section 4, we prove our main result that our algorithm achieves an regret upper bound. Section 5 demonstrates the empirical efficiency of our algorithm design.

2 Related Work and Overview of Contribution

A basic pillar in the MNL-Bandit problem is the MNL choice model, originally introduced (independently) by Luce (1959) and Plackett (1975); see also Train (2003); McFadden (1978); Ben-Akiva and Lerman (1985) for further discussion and survey of other commonly used choice models. This model is by far the most widely used choice model insofar as capturing substitution effects that are a significant element in our problem. Initial motivation for this traces to online retail, where a retailer has to decide on a subset of items to offer from a universe of substitutable products for display. In this context, Rusmevichientong et al. (2010) and Sauré and Zeevi (2013) were the first two papers we are aware of, to consider a dynamic learning problem, in particular, focusing on minimizing regret under the MNL choice model. Both papers develop an “explore first and exploit later” approach. Assuming knowledge of the “gap” between the optimal and the next-best assortment, they show an asymptotic regret bound. (This assumption is akin to the “separated arm” case in the MAB setting.) It is worth noting that the algorithms developed in those papers require a priori knowledge of this gap as a tuning input, which makes the algorithms parameter dependent. In a more recent paper, Agrawal et al. (2016) show how to exploit specific characteristics of the MNL model to develop a policy based on the principle of “optimism under uncertainty” (UCB-like algorithm, see Auer et al. (2002)) which does not rely on the a priori knowledge of this gap or separation information and achieves a worst-case regret bound of . A regret lower bound of for this problem is also presented in this work.

It is widely recognized that UCB-type algorithms that optimize the worst case regret typically tend to spend “too much time” in the exploration phase, resulting in poor performance in practice (regret-optimality bounds notwithstanding). To that end, several studies (Oliver and Li (2011), Graepel et al. (2010), May et al. (2012)) have demonstrated that TS significantly outperforms the state of the art methods in practice. Despite being easy to implement and often empirically superior, TS based algorithms are hard to analyze and theoretical work on TS is limited. To the best of our knowledge, Agrawal and Goyal (2013a) is the first work to provide a finite time worst-case regret bounds for the MAB problem that are independent of problem parameters.

A naive translation of the MNL-bandit problem to an MAB-type setting would create “arms” (one for each offer set of size ). For an “arm” corresponding to subset , the reward is give by (3). Managing this exponentially large arm space is prohibitive for obvious reasons. Popular extensions of MAB for “large scale” problems include the linear bandit (e.g., Auer (2003), Rusmevichientong and Tsitsiklis (2010)) for which Agrawal and Goyal (2013b) present a TS-based algorithm and provide finite time regret bounds. However, these approaches do not apply directly to our problem, since the revenue corresponding to each offered set is not linear in problem parameters. Moreover, for the regret bounds in those settings to be attractive, the dimension of parameters should be small, this dimension would be here. Gopalan et al. (2014) consider a variant of MAB where one can play a subset of arms in each round and the expected reward is a function of rewards of the arms played. This setting is similar to the MNL-bandit, though the regret bounds they develop are dependent on the instance parameters as well as the number of possible actions which can be large in our combinatorial problem setting. Moreover, the computational tractability of updating the posterior and computing the optimal action set is not immediately clear.

Our Contributions.

In this work, relying on structural properties of the MNL model, we develop a TS approach that is computationally efficient and yet achieves parameter independent (optimal in order) regret bounds. Specifically, we present a computationally efficient TS algorithm for the MNL-bandit which uses a prior distribution on the parameters of the MNL model such that the posterior update under the MNL-bandit feedback is tractable. A key ingredient in our approach is a two moment approximation of the posterior and the ability to judicially correlate samples, which is done by embedding the two-moment approximation in a normal family. It is shown that our algorithm achieves a worst-case (prior-free) regret bound of under a mild assumption that for all (more on the practicality of this assumption later in the text); the bound is non-asymptotic, the “big oh” notation is used for brevity. This regret bound is independent of the parameters of the MNL choice model and hence holds uniformly over all problem instances. The regret is comparable to the existing upper bound of and the lower bound of provided by Agrawal et al. (2016) under the same assumption, yet the numerical results demonstrate that our Thompson Sampling based approach significantly outperforms the UCB-based approach of Agrawal et al. (2016). The methods developed in this paper highlight some of the key challenges involved in adapting the TS approach to the MNL-bandit, and present a blueprint to address these issues that we hope will be more broadly applicable, and form the basis for further work in the intersection of combinatorial optimization and machine learning.

3 Algorithm

In this section, we describe our posterior sampling (aka Thompson Sampling) based algorithm for the MNL-bandit problem. The basic structure of Thompson Sampling involves maintaining a posterior on the unknown problem parameters, which is updated every time new feedback is obtained. In the beginning of every round, a sample set of parameters is generated from the current posterior distribution, and the algorithm chooses the best option according to these sample parameters. Due to its combinatorial nature, designing an algorithm in this framework for the MNL-bandit problem involves several new challenges as we describe below, along with our algorithm design choices to address them.

3.1 Challenges and key ideas

Conjugate priors for the MNL parameters.

In the MNL-bandit problem, there is one unknown parameter associated with each item. To adapt the TS algorithm for the classical MAB problem, here we would need to maintain a joint posterior for . However, updating such a joint posterior is non-trivial since the feedback observed in every round is the choice made by the user among the offered set of items , and the observed choice provides a sample from multinomial choice probability , which clearly depends on the subset offered in that round. In particular, even if we initialize with an independent prior from a nice analytical family such as multivariate Gaussian, the posterior distribution after observing the MNL choice feedback can have a complex description.

Another possibility is to maintain a posterior each for revenue function of possible assortments, where the posterior for the set is updated only when that set is offered. However, due to the exponential number of possible offered sets, such an approach would learn very slowly and result in regret exponential in , in addition to being computationally inefficient.

One of key ideas utilized in our algorithm design is that of repeated offering of assortments in a way that allows us to efficiently maintain independent conjugate (Beta) priors for parameters for each . Details of the resulting TS algorithm are presented in Algorithm 1 in Section 3.2.

Posterior approximation and Correlated sampling.

Algorithm 1 samples the posterior distribution for each parameter independently in each round. However, this algorithm presents unique challenges in theoretical analysis. A worst case regret analysis of Thompson Sampling based algorithms for MAB typically proceeds by showing that the best arm is optimistic at least once every few steps, in the sense that its sampled parameter is better than its true parameter. Such a proof approach for our combinatorial problem requires that every few steps, all the items in the optimal offer set have sampled parameters that are better than their true counterparts. This makes the probability of being optimistic exponentially small in .

We address this challenge by employing correlated sampling across items. To implement correlated sampling, we find it useful to approximate the Beta posterior by a Gaussian distribution with approximately the same mean and variance as the Beta distribution; what was referred to in the introduction as a two-moment approximation. This allows us to generate correlated samples from the Gaussian distributions as linear transforms of a single standard Gaussian. Under such correlated sampling, the probability of all optimal items to be simultaneously optimistic is a constant, as opposed to being exponentially small (in ) in the case of independent samples. However, such correlated sampling reduces the overall variance of the maximum of samples severely, thus reducing exploration. We boost the variance by taking samples instead of a single sample of the standard Gaussian. The resulting variant of Thompson Sampling algorithm is presented in Algorithm 2 in Section 3.3. We prove near-optimal regret bound for this algorithm in Section 4.

To the best of our knowledge the idea of correlated sampling for combinatorial arms is novel, and potentially useful for further extensions to other combinatorial problems. In fact, by reducing sampling variance, correlated sampling may lead to better empirical performance, and may even compensate for the boosting due to multiple samples. In Section 5, we present some preliminary numerical simulation results, which illustrate this intuition.

3.2 Warmup: A TS algorithm with independent conjugate Beta priors

In this first version of our Thompson sampling algorithm, we maintain a Beta posterior distribution for each item , which is updated as we observe users’ choice of items from the offered subsets. A key challenge here is to design priors that can be efficiently updated on observing user choice feedback, in order to obtain increasingly accurate estimates of parameters . To address this, we use a technique introduced in Agrawal et al. (2016). The idea is to offer a set multiple times; in particular, a chosen is offered repeatedly until an “outside option” is picked (in the motivating application discussed earlier, this corresponds displaying the same subset of ads until we observe a user who does not click on any of the displayed ads). Proceeding in this manner, the average number of times an item is selected provides an unbiased estimate of parameter . Moreover, the number of times an item is selected is also independent of the displayed set and is a geometric distribution with success probability and mean . Precise statements are provided in Lemma 3.2. This observation is used as the basis for our epoch based algorithmic structure and our choice of prior/posterior, as a conjugate to this geometric distribution.

Epoch based offerings:

Our algorithm proceeds in epochs . An epoch is a group of consecutive time steps, where a set is offered repeatedly until the outside option is picked in response to offering . The set to be offered in an epoch is picked in the beginning of the epoch based on the sampled parameters from the current posterior distribution; the construction of these posteriors and choice of is described in the next paragraph. We denote the group of time steps in an epoch as , which includes the time step at which an outside option was preferred.

Construction of conjugate prior/posterior:

Suppose that the prior distribution for parameter for in the beginning of an epoch is same as that of In Lemma 3.2, we show that after observing the geometric variable : the number of picks of item in epoch , the posterior distribution of is same as that of, . Therefore, we use the distribution of as the starting prior for , and then, in the beginning of epoch , the posterior is distributed as , with being the number of epochs the item has been offered before epoch (as part of an assortment), and being the number of times it was picked by the user. This posterior distribution has expected value and variance close to (refer to Lemma 3.2) so that the samples for parameter from this distribution will approach , as the number of epochs where item is offered becomes large.

Selection of subset to be offered:

To choose the subset to be offered in epoch , the algorithm samples a set of parameters independently from the current posteriors and finds the set that maximizes the expected revenue as per the sampled parameters. In particular, the set to be offered in epoch is chosen as:


There are efficient polynomial time algorithms available to solve this optimization problem (e.g., refer to Davis et al. (2013), Avadhanula et al. (2016), Rusmevichientong et al. (2010)).

The details of our procedure are provided in Algorithm 1.

Initialization: For each item , , .
, keeps track of the time steps
, keeps count of total number of epochs
while  do

  1. (Posterior Sampling) For each item , sample from the and compute

  2. (Subset Selection) Compute

  3. (Epoch-based offering)

  • Offer the set , and observe the user choice ;

  • Update , time indices corresponding to epoch ;

                 until ;
  • (Posterior update)

    • For each item , compute , no. of picks of item in epoch .

    • Update = , ,

  •          end while
    Algorithm 1 A TS algorithm for MNL-bandit with Independent Beta priors

    The following lemmas provide important building blocks for our construction. Their proofs have been deferred to the appendix. \thmt@toks\thmt@toks Let be the number of times an item is picked when the set is offered repeatedly until no-click (outside option is picked). Then, are i.i.d geometrical random variables with success probability , and expected value .

    Lemma 1 (Agrawal et al. (2016))

    For any , let and be a probability distribution of the random variable . If is distributed as and is a geometric random variable with success probability , then we have,

    Lemma 2 (Conjugate Priors)

    If is a random variable distributed as , then

    3.3 A TS algorithm with posterior approximation and correlated sampling

    Motivated by the challenges in theoretical analysis of Algorithm 1 described earlier, in this section we design a variant, Algorithm 2, with the main changes being the introduction of a posterior approximation by means of a Gaussian distribution, correlated sampling, and taking multiple samples (“variance boosting”). We describe each of these below.

    • Posterior approximation: We approximate the posterior distributions used in Algorithm 1 for the MNL parameters , by Gaussian distributions with approximately the same mean and variance (refer to Lemma 3.2). In particular, let


      then the posterior distribution used for item in the beginning of epoch is

    • Correlated sampling: Given the posterior approximation by Gaussian distributions, we correlate the samples by using a common standard Gaussian sample and constructing our posterior samples as an appropriate transform of this common standard Gaussian sample. This allows us to generate sample parameters for that are either simultaneously high or simultaneously low, thereby, boosting the probability that the sample parameters for all the items in the best assortment are optimistic.

    • Multiple () samples: The correlated sampling decreases the joint variance of the sample set. In order to boost this joint variance and ensure sufficient exploration, we generate multiple sets of samples. In particular, in the beginning of an epoch , we generate independent samples from the standard normal distribution, . And then, the sample set is generated as:

      and we use the highest valued samples

      to decide the assortment to offer in epoch ,

    We summarize the steps in Algorithm 2. Here, we also have an “initial exploration period,” where for every item , we offer a set containing only until the user selects the outside option.

    Initialization: , , for all .
    for  each item,  do

             Display item to users until the user selects the “outside option”. Let be the number of times item was offered.   Update: , , and .
             end for
            while  do
    1. (Correlated sampling) for   do

    • Sample from the distribution ;   update .

    • For each item , compute .

                             end for
    For each item , compute
  • (Subset selection) Same as step (b) of Algorithm 1.

  • (Epoch-based offering) Same as step (c) of Algorithm 1.

  • (Posterior update) Same as step (d) of Algorithm 1.

  •                  end while
    Algorithm 2 A TS algorithm for MNL-bandit with Gaussian approximation and correlated sampling

    Intuitively, while the second moment approximation by Gaussian distribution and multiple samples in Algorithm 2 may make posterior converge slower and increase exploration, the correlated sampling may compensate for these effects by reducing the variance of the maximum of samples, and therefore reducing the overall exploration effort. In Section 5, we illustrate some of these insights through some preliminary numerical simulations, where correlated sampling performs significantly better compared to independent sampling, and posterior approximation by Gaussian distribution has little effect.

    4 Regret Analysis

    We prove an upper bound on the regret of Algorithm 2 for the MNL-bandit problem, under the following assumption.

    Assumption 1

    For every item , the MNL parameter satisfies .

    This assumption is equivalent to the outside option being more preferable to any other item. This assumption holds for many applications like display advertising, where users do not click on any of the displayed ads more often than not. Our main theoretical result is the following upper bound on the regret of Algorithm 2.

    Theorem 1

    For any instance of the MNL-bandit problem with products, , and satisfying Assumption 1, the regret of Algorithm 2 in time is bounded as,

    where and are absolute constants (independent of problem parameters).

    4.1 Proof Sketch

    We break down the expression for total regret

    into regret per epoch, and rewrite it as follows:

    where is the number of time steps in epoch , and is the set repeatedly offered by our algorithm in epoch . Then, we bound the two terms: and separately.

    The first term is essentially the difference between the optimal revenue of the true instance and the optimal revenue of the sampled instance (since was chosen by the algorithm to be an optimal -sized subset for the sampled instance). Therefore, this term would contribute no regret if the sampled instances were always optimistic. Unlike optimism under uncertainty approaches like UCB, this property is not ensured by our Thompson Sampling based algorithm. To bound this term, we utilize anti-concentration properties of the posterior, as well as the dependence between samples for different items, in order to prove that at least one of our sampled instances is optimistic often enough.

    The second term captures the difference in the revenue of the offered set when evaluated on sampled parameters vs. true parameters. This is bounded by utilizing the concentration properties of our posterior distributions. It involves showing that for the sets that are played often, the posterior will converge quickly, so that revenue on the sampled parameters will be close to that on the true parameters.

    Some further details are provided below. A complete proof is provided in Appendix C.

    Bounding the first term . Firstly, by our assumption , the outside option is picked at least as often as any particular item , and therefore, it is not difficult to see that the expected value of epoch length is bounded by , so that is bounded as

    Let us call an epoch optimistic if the sampled parameters are such that the optimal set has at least as much revenue on the sampled parameters as on the true parameters, i.e., if . Then, clearly, such epochs don’t contribute to this term:

    (by optimality of )

    To bound the contribution of the remaining epochs, we first bound the number of consecutive epochs between two optimistic epochs, by analyzing the probability of being optimistic. Intuitively, using anti-concentration properties of Gaussian distribution, we have that, with a constant probability, a sampled parameter for any item will exceed the posterior mean by a few standard deviations. Now, since our Gaussian posterior’s mean is equal to the unbiased estimate , and its standard deviation is close to the expected deviation of estimate from the true parameter , we can conclude that any sampled parameter will be optimistic with a constant probability, i.e., . However, for an epoch to be optimistic, sampled parameters for all the items in may need to be optimistic. This is where the correlated sampling feature of our algorithm is crucially utilized. Using the dependence structure between samples for different items in , and variance boosting provided by the sampling of independent sample sets, we prove an upper bound of roughly on the number of consecutive epochs between two optimistic epochs. The precise lemma is as follows, which forms one of the primary technical components of our proof: \thmt@toks\thmt@toks Let be the group of consecutive epochs between an optimistic epoch and the next optimistic epoch . Then, for any , we have,

    Lemma 1 (Spacing of optimistic epochs)

    Proof for the above lemma is provided in Appendix C.1. Next, we bound the individual contribution of any “non-optimistic” epoch (i.e., any epoch in ) by relating it to the closest optimistic epoch before it. By definition of an optimistic epoch,

    and by the choice of as the revenue maximizing set for the sampled parameters :

    What remains to bound is the difference in the revenue of the set for two different sample parameters: and . Over time, as the posterior distributions concentrate around their means, which in turn concentrate around the true parameters, this difference becomes smaller. In fact, using Lipschitz property of , can be bounded by (refer to Lemma 11 in the appendix), where was defined as the standard deviation of the posterior distribution in the beginning of epoch , which is larger than , and roughly equal to the deviation of posterior mean from the true parameter .

    To summarize, since between two optimistic epochs and , there are non-optimistic epochs, and each of their contribution to is bounded by some multiples of , this term can be bounded roughly as:


    A bound of on the sum of these deviations can be derived, which will also be useful for bounding the second term, as discussed next.

    Bounding the second term .

    Noting that the expected epoch length when set is offered is , where , can be reformulated as

    Again, as discussed above, using Lipschitz property of revenue function, this can be bounded in terms of posterior standard deviation (refer to Lemma 11)

    Overall, the above analysis on and implies roughly the following bound on regret

    where is total number of times was offered in time . Then, utilizing the bound of on the expected number of total picks, i.e., , and doing a worst scenario analysis, we obtain a bound of on .

    5 Empirical study

    In this section, we analyze the various design components of our Thompson Sampling approach through numerical simulations. The aim is to isolate and understand the effect of individual features of our algorithm design like Beta posteriors vs. Gaussian approximation, independent sampling vs. correlated sampling, and single sample vs. multiple samples, on the practical performance.

    Figure 1: Regret growth with for various heuristics on a randomly generated MNL-bandit instance with .

    We simulate an instance of MNL-bandit problem with , and , and the MNL parameters generated randomly from . And, we compute the average regret based on independent simulations over the randomly generated instance. In Figure 1, we report performance of successive variants of TS: the basic version of TS with independent Beta priors, as described in Algorithm 1, referred to as , Gaussian posterior approximation with independent sampling, referred to as , Gaussian posterior approximation with correlated sampling, referred to as , and finally, Gaussian posterior approximation with correlated sampling and boosting by using multiple () samples, referred to as , which is essentially the version with all the features of Algorithm 2. For comparison, we also present the performance of approach in Agrawal et al. (2016). We repeated this experiment on several randomly generated instances and a similar performance was observed.

    The performance of all the variants of TS is observed to be better than the UCB approach in our experiments, which is consistent with the other empirical evidence in the literature.

    Among the TS variants, the performance of , i.e., the basic version with independent beta priors (essentially Algorithm 1) is quite similar to , the version with independent Gaussian (approximate) posteriors; indicating that the effect of posterior approximation is minor. The performance of , where we generated correlated samples from the Gaussian distributions, is significantly better than all the other variants of the algorithm. This is consistent with our remark earlier that to adapt the Thompson sampling approach of the classical MAB problem to our setting, ideally we would like to maintain a joint prior over the parameters and update it to a joint posterior on observing the bandit feedback. However, since this can be quite challenging and intractable, we used independent priors over the parameters. The superior performance of demonstrates the potential benefits of considering a joint (correlated) prior/posterior in such settings with combinatorial arms. Finally, we observe that the performance of , where an additional “variance boosting” is provided through independent samples, is worse than as expected, but still significantly better than the independent Beta posterior version . Therefore, significant improvements in performance due to correlated sampling feature of Algorithm 2 compensate for the slight deterioration caused by boosting.


    V. Goyal is supported in part by NSF Grants CMMI-1351838 (CAREER) and CMMI-1636046. A. Zeevi is supported in part by NSF Grants NetSE-0964170 and BSF-2010466.


    • (1)
    • Abramowitz and Stegun (1964) M. Abramowitz and I. A. Stegun. 1964. Handbook of mathematical functions: with formulas, graphs, and mathematical tables.
    • Agrawal et al. (2016) S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. 2016. A Near-Optimal Exploration-Exploitation Approach for Assortment Selection. Proceedings of the 2016 ACM Conference on Economics and Computation (EC) , 599–600.
    • Agrawal and Goyal (2013a) S. Agrawal and N. Goyal. 2013a. Further Optimal Regret Bounds for Thompson Sampling. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics (AISTATS), Vol. (31). 99–107.
    • Agrawal and Goyal (2013b) S. Agrawal and N. Goyal. 2013b. Thompson Sampling for Contextual Bandits with Linear Payoffs.. In Proceedings of the 30th International Conference on International Conference on Machine Learning (ICML), Vol. (28). 127–135.
    • Auer (2003) P. Auer. 2003. Using Confidence Bounds for Exploitation-exploration Trade-offs. Journal of Machine Learning Research (3) , 397–422.
    • Auer et al. (2002) P. Auer, N. Cesa-Bianchi, and P. Fischer. 2002. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning 47 , 235–256.
    • Avadhanula et al. (2016) V. Avadhanula, J. Bhandari, V. Goyal, and A. Zeevi. 2016. On the tightness of an LP relaxation for rational optimization and its applications. Operations Research Letters 44, (5), 612–617.
    • Ben-Akiva and Lerman (1985) M. Ben-Akiva and S. Lerman. 1985. Discrete choice analysis: theory and application to travel demand. Vol. 9. MIT press.
    • Davis et al. (2013) J. Davis, G. Gallego, and H. Topaloglu. 2013. Assortment planning under the multinomial logit model with totally unimodular constraint structures. Technical Report .
    • Gopalan et al. (2014) A. Gopalan, S. Mannor, and Y. Mansour. 2014. Thompson Sampling for Complex Online Problems.. In Proceedings of the 31st International Conference on International Conference on Machine Learning (ICML), Vol. (32). 100–108.
    • Graepel et al. (2010) T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich. 2010. Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. In Proceedings of the 27th international conference on machine learning (ICML). 13–20.
    • Luce (1959) R.D. Luce. 1959. Individual choice behavior: A theoretical analysis. Wiley.
    • May et al. (2012) B. C. May, N. Korda, A. Lee, and D. S. Leslie. 2012. Optimistic Bayesian sampling in contextual-bandit problems. Journal of Machine Learning Research (13) , 2069–2106.
    • McFadden (1978) D. McFadden. 1978. Modelling the choice of residential location. Institute of Transportation Studies, University of California.
    • Oliver and Li (2011) C. Oliver and L. Li. 2011. An Empirical Evaluation of Thompson Sampling. In Advances in Neural Information Processing Systems (NIPS) 24 (2011), 2249?2257.
    • Plackett (1975) R. L. Plackett. 1975. The Analysis of Permutations. Journal of the Royal Statistical Society. Series C (Applied Statistics)
    • Rusmevichientong and Tsitsiklis (2010) P. Rusmevichientong and J.N. Tsitsiklis. 2010. Linearly Parameterized Bandits. Mathematics of Operations of Research 35(2), 395–411.
    • Rusmevichientong et al. (2010) P. Rusmevichientong, Z. M. Shen, and D.B. Shmoys. 2010. Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations Research 58, (6) , 1666–1680.
    • Sauré and Zeevi (2013) D. Sauré and A. Zeevi. 2013. Optimal Dynamic Assortment Planning with Demand Learning. Manufacturing & Service Operations Management 15, (3) , 387–404.
    • Train (2003) K. Train. 2003. Discrete Choice Methods with Simulation. Cambridge University Press.

    Appendix A Unbiased Estimate and Conjugate priors

    Some of the results in this section are adapted from Agrawal et al. (2016), but we provide the proofs again for the sake of completeness.

    We first prove that the estimate obtained from epoch based offerings, in Algorithm 1 is unbiased estimate and is distributed geometrically with probability of success Specifically, we have the following result.

    Lemma ?? (Agrawal et al. (2016))

    Proof . We prove the result by computing the moment generating function, from which we can establish that is a geometric random variable with parameter . Thereby also establishing that are unbiased estimators of . Specifically, we show the following result.

    The moment generating function of estimate conditioned on , , is given by,

    We focus on proving the above result. From (1), we have that probability of no purchase event when assortment is offered is given by

    Let be the total number of offerings in epoch before a no purchased occurred, i.e., . Therefore, is a geometric random variable with probability of success . And, given any fixed value of , is a binomial random variable with trials and probability of success given by

    In the calculations below, for brevity we use and respectively to denote and . Hence, we have

    Since the moment generating function for a binomial random variable with parameters is , we have

    For any , such that, is a geometric random variable with parameter , we have

    Note that for all , we have Therefore, we have

    Building on this result. We will prove Lemma 3.2 that helped construct Algorithm 1. Recall Lemma 3.2

    Lemma ?? (Conjugate Priors)

    Proof . The proof of the lemma follows from the following result on the probability density function of the random variable . Specifically, we have for any


    where and is the gamma function. Since we assume that the parameter ’s prior distribution is same as that of , we have from (6) and Lemma 3.2,

    Given the pdf of the posterior in (6), it is possible to compute the mean and variance of the posterior distribution. We show that they have simple closed form expressions. Recall Lemma 3.2.

    Lemma ?? (Moments of the Posterior Distribution)

    Proof . We prove the result by relating the mean of the posterior to the mean of the Beta distribution. Let From (6), we have

    Substituting , we have

    Similarly, we can derive the expression for the .

    Appendix B Bounds on the deviation of MNL Expected Revenue

    Here, we prove a Lipschitz kind of bound on the deviation of function with change in the parameter .

    Lemma 7

    For any and such that we have,

    Proof . Define sets and as

    and vector as,

    By construction of , we have and for all . Therefore from lemma1 8, we have

    The result follows from the fact that and for all

    Following a similar proof, we can also establish the following result.

    Lemma 8

    Assume for all . Suppose is an optimal assortment when the MNL are parameters are given by , i.e.


    Lemma 9

    For any and , we have for any ,

    where .

    Proof Note that we have Therefore, from union bound, we have,

    The result follows from the above inequality and the following anti-concentration bound for the normal random variable (see formula 7.1.13 in Abramowitz and Stegun (1964)).

    We need the following result to prove Theorem-1, which specified the convergence rate of the estimate to the true value . For the sake of presentation and continuity, proof is deferred to next section.

    Lemma 10

    If for all , then for all , and any , we have,

    1. .

    From Lemma 7, Lemma 9 and Lemma 10, we have the following result.

    Lemma 11

    For any epoch , if

    where and are absolute constants (independent of problem parameters).

    The proof of Lemma 9 in the next section will specify the exact values for and .

    Appendix C Proof of Theorem 1