NearOptimal Policies for Dynamic Multinomial Logit Assortment Selection Models
Abstract
In this paper we consider the dynamic assortment selection problem under an uncapacitated multinomiallogit (MNL) model. By carefully analyzing a revenue potential function, we show that a trisection based algorithm achieves an itemindependent regret bound of , which matches information theoretical lower bounds up to iterated logarithmic terms. Our proof technique draws tools from the unimodal/convex bandit literature as well as adaptive confidence parameters in minimax multiarmed bandit problems.
Keywords: dynamic assortment planning, multinomial logit choice model, trisection algorithm, regret analysis.
1 Introduction
Assortment planning has a wide range of applications in ecommerce and online advertising. Given a large number of substitutable products, the assortment planning problem refers to the selection of a subset of products (a.k.a., an assortment) offering to a customer such that the expected revenue is maximized [2, 3, 16, 19, 13]. Given items, each associated with a revenue parameter ^{2}^{2}2The constraint is without loss of generality, because it is only a normalization of revenues. representing the revenue a retailer collects once a customer purchases the th item. The revenue parameters are typically known to the retailer, who has full knowledge of each item’s prices/costs. In a dynamic assortment planning problem, assuming that there are a total of time epochs, the retailer presents an assortment to an incoming customer, and observes his/her purchasing action . (If then the customer makes no purchases at time .) If a purchasing action is made (i.e., ), the corresponding revenue is collected. It is worthy noting that since items are substitutable (e.g., different models of cell phones), a typical setting of assortment planning usually restricts a purchase to be a single item.
The retailer’s objective is to maximize the expected revenue over the time periods. Such objectives can be best measured and evaluated under a “regret minimization” framework, in which the retailer’s assortment sequence is compared against the optimal assortment. More specifically, consider
(1) 
as the regret measure of an assortment sequence , where is the expected revenue the retailer collects on assortment (for notational convenience we define corresponding to the “nopurchase” action).
For the regret measure Eq. (1) to be welldefined, it is conventional to specify a probabilistic model (known as “choice model”) that governs a customer’s purchasing choice on a provided assortment . Perhaps the most popular choice model is the multinomiallogit (MNL) choice model [21, 17, 5], which assigns each item a “preference parameter” and the purchasing choice is modeled by
(2) 
Subsequently, the expected revenue can be expressed as
(3) 
For normalization purposes the preference parameter for the “nopurchase” action is assumed to be . Apart from that, the rest of the preference parameters are unknown to the retailer and have to be either explicitly or implicitly learnt from customers’ purchasing actions .
1.1 Our results and techniques
The main contribution of this paper is an optimal characterization of the worstcase regret under the MNL assortment selection model specified in Eqs. (1) and (2). More specifically, we have the following informal statement of the main results in this paper.
Theorem 1 (informal).
There exists a policy whose worstcase regret over time periods is upper bounded by for some universal constant ; furthermore, there exists another universal constant such that no policy can achieve worstcase regret smaller than .
An important aspect of Theorem 1 is that our regret bound is completely independent of the number of items , which improves the existing dynamic regret minimization results on the MNL assortment selection problem [2, 3, 19]. This property makes our result more favorable for scenarios when a large number of potential items are available, e.g., online sales or online advertisement.
To enable such an independent regret, we provide a refined analysis of a certain unimodal revenue potential function first studied in [19] and consider a trisection algorithm on revenue levels, borrowing ideas from literature on unimodal bandits on either discrete or continuous arm domains [22, 10, 1]. An important challenge is that the revenue potential function (defined in Eq. (4)) does not satisfy convexity or local Lipschitz growth, ^{3}^{3}3See the related work section 1.2 for details. and therefore previous results on unimodal bandits cannot be directly applied. On the other hand, it is a simple exercise that mere unimodality in multiarmed bandits cannot lead to regret smaller than , because the worstcase constructions in the classical lower bound or multiarmed bandits have unimodal arms [6, 7]. To overcome such difficulties, we establish additional properties of the potential function in Eq. (4) which are different from classical convexity or Lipschitz growth properties. In particular, we prove connections between the potential function and the straight line , which is then used as guidelines in our update rules of trisection. Also, because the potential function behaves differently on and , our trisection algorithm is asymmetric in the treatments of the two trisection midpoints, which is in contrast to previous trisection based methods for unimodal bandits [22, 10] that treat both trisection midpoints symmetrically.
We also remark that the upper and lower bounds in Theorem 1 match except for an term. Under the “gapfree” setting where regret is to be expected, the removal of additional terms in dynamic assortment selection and unimodal bandit problems is highly nontrivial. Most previous results on dynamic assortment selection [19, 2, 3] and unimodal/convex bandit [22, 10, 1] have additional terms in regret upper bounds. (The work of [10] also derived gapdependent regret bounds for unimodal bandit, which is not easily comparable to our bounds.) The improvement from to achieved in this paper is done by using a sharper lawoftheiteratedlogarithm (LIL) type concentration inequalities [15] and an adaptive confidence strategy similar to the MOSS algorithm for multiarmed bandits [4]. Its analysis, however, is quite different from the analysis of the MOSS algorithm in [4] and also yields an additional factor. We conjecture that the additional factor can also be removed by resorting to much more complicated procedures, as we discuss in Sec. 6.
1.2 Related work
The question of dynamic optimization of commodity assortments has received increasing attention in both the machine learning and operations management society [8, 18, 20, 2, 3], as the mean utilities of customers (corresponding to the preference parameters in our model) are typically unknown and have to be learnt on the fly.
The work of [18] is perhaps the closest to our paper, which analyzed the same revenue potential function and designed a goldenratio search algorithm whose regret only depends logarithmically on the number of items. The analysis of [18] assumes a constant gap between any two assortment level sets, which might fail to hold when the number of items is large. In this work we relax the gap assumption and also remove the additional dependency by a more refined analysis of properties of the revenue potential function and borrowing “trisection” ideas from the unimodal bandit literature [22, 10, 1].
The works of [2, 3] considered variants of UCB/Thompson sampling type methods and focused primarily on the capacitated MNL assortment model, in which the size of each assortment is not allowed to exceed a prespecified parameter . It is known that the regret behavior in capacitated and uncapacitated models can be vastly different: in the capacitated case a regret lower bound exists provided that , while for the uncapacitated model it is possible to achieve or even independent regret.
Another relevant line of research is unimodal bandit [22, 10, 1, 11], in which discrete or continuous multiarmed bandit problems are considered with additional unimodality constraints on the means of the arms. Apart from unimodality, additional structures such as “inverse Lipschitz continuity” (e.g., ) or convexity are imposed to ensure improvement of regret, both of which fail to hold for the potential function arising from uncapacitated MNL assortment choice problems. In addition, under the “gapfree” setting where an regret is to be expected, most previous works have additional terms in their regret upper bounds, except for the work of [11] which introduces additional strong regularity conditions on the underlying functions.
2 The revenue potential function and its properties
For the MNL assortment selection model without capacity constraints, it is a classical result that the optimal assortment must consist of items with the largest revenue parameters (see, e.g., [16]):
Proposition 1.
There exists such that satisfies .
Proposition 1 suggests that it suffices to consider “levelset” type assortments and finds that gives rises to the largest . This motivates the following “potential” function, which takes a revenue threshold as input and outputs the expected revenue of its corresponding level set assortments:
(4) 
The potential was first introduced and considered in [16], in which it was proved that is leftcontinuous, piecewiseconstant and unimodal in its input revenue . Using such unimodality, a goldenratio search based policy was designed that achieves regret under additional consecutive gap assumptions of the level set assortments . To derive gapindependent results and to get rid of the additional dependency, we provide a more refined analysis of properties of the potential function in this paper, summarized in the following three lemmas:
Lemma 1.
There exists such that .
Lemma 2.
For any , and , where .
Lemma 3.
For any , and .
The proofs of the above lemmas are given in Sec. 7. The give a rather complete picture of the behavior of the potential function , and most importantly the relationship between and the central straight line , as depicted in Figure 1. More precisely, The mode of occurs at its intersection with and monotonically decreases moving away from in both directions. This helps us gauge the positioning of a particular revenue level by simply comparing the exepcted revenue of with itself, motivating an asymmetric trisection algorithm which we describe in the next section.
3 Trisection and regret analysis
We propose an algorithm based on trisections of the potential function in order to locate level at which the maximum expected revenue is attained. Our algorithm avoids explicitly estimating individual items’ mean utilities , and subsequently yields a regret independent of the number of items . We first give a simplified algorithm (pseudocode description in Algorithm 1) with an additional term in the regret upper bound and outline its proofs. We further show how the additional dependency on can be improved to and eventually fully removed by using more advanced techniques. Due to space constraints, complete proofs of all results are deferred to Sec. 7.
To assist with readability, below we list notations used in the algorithm description together with their meanings:

and : left and right boundaries that contain ; it is guaranteed that with high probability, and the regret incurred on failure events is strictly controlled;

and : trisection points; is closer to and is closer to ;

and : lower and upper confidence bands for established at iteration ; it is guaranteed that with high probability, and the regret incurred on failure events is strictly controlled;

: accumulated reward by exploring level set up to iteration .
With these notations in place, we provide a detailed description of Algorithm 1 to facilitate the understanding. The algorithm operates in epochs (outer iterations) until a total of assortment selections are made. The objective of each outer iteration is to find the relative position between trisection points () and the “reference” location , after which the algorithm either moves to or to , effectively shrinking the length of the interval that contains to its two thirds. Furthermore, to avoid a large cumulative regret, level set corresponding to the left endpoint is exploited in each time period within the epoch to offset potentially large regret incurred by exploring .
In Steps 1 and 1 of Algorithm 1, lower and upper confidence bands for are constructed using concentration inequalities (e.g. Hoeffding’s inequality [14]). These confidence bands are updated until the relationship between and is clear, or a prespecified number of inner iterations for outer iteration has been reached (set to in Step 1). Algorithm 2 gives detailed descriptions on how such confidence intervals are built, based on repeated exploration of level set .
After sufficiently many explorations of , a decision is made on whether to advance the left bounary (i.e., ) or the right boundary (i.e., ). Below we give highlevel intuitions on how such decisions are made, with rigorous justifications presented later as part of the proof of the main regret theorem for Algorithm 1.

If there is sufficient evidence that (e.g., ), then must be to the right of (i.e., ) due to Lemma 2. Therefore, we will shrink the value of right boundary by setting .

On the other hand, when , we can conclude that must be to the left of (i.e., ). We show this by contradiction. Assuming that , since is always greater than (and thus ) and the gap between and is at least ^{4}^{4}4By Lemma 2, we have , the gap will be detected by the confidence bands and thus we will have with high probability. This leads to a contradiction.
Therefore, since is to the left of , we should increase the value of the left boundary by setting .
The following theorem is our main upper bound result for the (worstcase) regret incurred by Algorithm 1.
Theorem 2.
There exists a universal constant such that for all parameters and satisfying , the regret incurred by Algorithm 1 satisfies
(5) 
3.1 Improved regret with LIL confidence intervals
In this section we consider a variant of Algorithm 1 that achieves an improved regret of . The key idea is to use the finitesample lawofiteratedlogarithm (LIL, [12]) confidence intervals [15] together with an adaptive choice of confidence parameters similar to the MOSS strategy [4] in order to carefully upper bounding regret induced by failure probabilities.
More specifically, most steps in Algorithms 1 and 2 remain unchanged, and the changes we make are summarized below:
The first change we make to achieve improved regret is the way how confidence intervals of is constructed. Comparing the new confidence interval in Eq. (6) with the original one in Algorithm 2, the important difference is the term arising from the law of the iterated logarithm, which makes the confidence intervals hold uniformly for all . This also leads to a different choice of confidence parameter in constructing confidence intervals, which is the second important change we make. In particular, instead of using a universal confidence level ^{6}^{6}6 rather than is used because an additional union bound is required for all inner iterations in each outer iteration for confidence intervals constructed via the Hoeffding’s inequality. throughout the entire procedure, “adaptive” confidence levels are used, which increases as the algorithm moves onto later iterations. Such choice of confidence parameters is motivated by the fact that the accumulated regret suffers less from a confidence interval failure at later iterations. Indeed, since we are relatively closer to the optimal assortment, the “excess regret” suffered when the confidence interval fails to cover the true potential function value is smaller. We also remark that similar confidence parameter choices were also adopted in [4] to remove additional factors in multiarmed bandit problems.
The following theorem shows that the algorithm variant presented above achieves an asymptotic regret of , considerably improving Theorem 2 establishing an regret bound. Its proof is rather technical and involves careful analysis of failure events at each outer iteration of the trisection algorithm. Due to space constraints, we defer the entire proof of Theorem 3 to Sec. 7.
Theorem 3.
There exists a universal constant such that for all parameters and satisfying , the regret incurred by the variant of Algorithm 1 satisfies
(7) 
4 Lower bound
We prove the following theorem showing that no policy can achieve an accumulated regret smaller than in the worst case.
Theorem 4.
Let and be the number of items and the time horizon that can be arbitrary. There exists revenue parameters such that for any policy ,
(8) 
Theorem 4 shows that our regret upper bounds in Theorems 2 and 3 are tight up to or factors and numerical constants. We conjecture (in Sec. 6) that the additional term can also be removed, leading to upper and lower bounds that match up to universal constants.
We next give a sketch of the proof of Theorem 4. Due to space constraints, we only present an outline of the proof and defer proofs of all technical lemmas to Sec. 7.
We first describe the underlying parameter values on which our lower bound proof is built. Fix revenue parameters as , and , which are known a priori. We then consider two constructions of the unknown mean utility parameters :
We note that and also give the probability distributions that characterize the customer random purchasing actions; and thus we will use to denote the probability of event under the utility parameters specified by for .
The first lemma shows that there does not exist estimators that can identify from with high probability with only observations of random purchasing actions. Its proof involves careful calculation of the KullbackLeibler (KL) divergence between the two hypothesized distributions and subsequent application of Le Cam’s lemma to the testing question between and .
Lemma 4.
For any estimator whose inputs are random purchasing actions , it holds that .
On the other hand, the following lemma shows that, if the policy can achieve a small regret under both and , then one can construct an estimator based on such that with large probability the estimator can distinguish between and from observed customers’ purchasing actions.
Lemma 5.
Suppose a policy satisfies for both and . Then there exists an estimator such that for both and .
Lemma 5 is proved by explicitly constructing a classifier (tester) from any sequence of low regret. In particular, for any assortment sequence , we construct as if and otherwise. Using Markov’s inequality and the construction of , it can be shown that if then is a good tester with small testing error. Detailed calculations and the complete proof is deferred to Sec. 7.
5 Numerical results
We present simple numerical results of our proposed trisection (and its LILimproved variant) algorithm and compare their performance with several competitors on synthetic data.
Experimental setup.
We generate each of the revenue parameters independently and identically from the uniform distribution on . For the preference parameters , they are generated independently and identically from the uniform distribution on , where is the total number of items available.
To motivate our parameter setting, consider the following three types of assortments: the “single assortment” for some , the “full assortment” , and the “appropriate” assortment . For the single assortment , because the preference parameter for each item is rather small (), no single assortment can produce an expected revenue exceeding . For the full assortment , because and by the law of large numbers, the expected revenue of is around . Finally, for the “appropriate” assortment , we have and . Therefore, the expected revenue of is around . The above discussion shows that a revenue threshold is mandatory to extract a portion of the items that attain the optimal expected revenue, which is highly nontrivial for a dynamic assortment selection algorithm to identify.
Ucb  Thompson  Grs  Trisec.  LILTrisec.  

mean  max  mean  max  mean  max  mean  max  mean  max  
(100,500)  34.9  38.1  1.28  2.97  10.9  22.4  7.68  7.68  5.17  5.17 
(250,500)  54.3  56.2  2.81  4.95  7.93  34.2  7.57  7.57  5.02  5.02 
(500,500)  73.4  75.5  4.90  4.95  7.02  43.4  7.43  7.43  4.91  4.91 
(1000,500)  90.3  93.5  8.17  10.7  5.34  45.1  7.44  7.44  4.74  4.74 
(100,1000)  73.1  78.2  1.36  2.79  139.9  175.0  8.69  8.69  5.36  5.36 
(250,1000)  113.7  119.3  3.36  5.17  90.1  110.1  8.69  8.69  5.31  5.31 
(500,1000)  136.8  140.3  5.65  7.64  65.7  113.9  9.38  9.38  6.01  6.01 
(1000, 1000)  160.8  165.4  9.31  12.4  8.43  22.8  9.77  9.77  6.39  6.39 
Comparative methods.
Our trisection algorithm with regret is denoted as Trisec, and its LILvariant (with regret ) is denoted as LILTrisec. The other methods we compare against include the Upper Confidence Bound algorithm of [2] (denoted as Ucb), the Thompson sampling algorithm of [3] (denoted as Thompson), and the Golden Ratio Search algorithm of [18] (denoted as Grs). Note that both Ucb and Thompson proposed in [2, 3] were initially designed for the capacitated MNL model, in which the number of items each assortment contains is restricted to be at most . In our experiments, we operate both the Ucb and Thompson algorithms under the uncapacitated setting, simply by removing the constraint set when performing each assortment optimization.
Most hyperparameters (such as constants in confidence bands) are set directly using the theoretical values. One exception is our LILTrisect algorithm, in which we remove the coefficient of 4 in front of the square root term in the confidence bands in Eq. (6), which can be thought of as taking in the finitesample LIL inequality (see Lemma 14) and was also adopted in [15]. Another exception is the Grs algorithm: in [18] the number of exploration iterations is set to where , which is inappropriate for our “gapfree” synthetical seeting in which . Instead, we use the common choice of exploration iterations in typical gapindependent bandit problems for Grs.
Results.
In Table 1 we report the mean and maximum regret from 20 independent runs of each algorithm on our synthetic data, with different settings of (number of items) and (time horizon). We observe that as the number of items () becomes large, our algorithms (Trisec and LILTrisec) achieve smaller mean and maximum regret compared to their competitors, and LILTrisec consistently outperforms Trisec in all settings. Unlike Ucb and Thompson whose regret depend polynomial on , our Trisec and LILTrisec algorithms have no dependency on and hence their regret does not increase significantly with . While Grs also has weak (logarithmic) dependency on , its pure exploration plus pure exploitation structure makes its performance rather unstable, which is evident from the large gaps between mean and maximum regret of Grs.
6 Discussion and conclusion
In this paper we consider the dynamic assortment allocation problem under uncapacitated MNL models and derive nearoptimal regret bounds. One important open question is to further remove the term in the upper bound in Theorem 2 and eventually achieve upper and lower regret bounds that match each other up to universal numerical constants. We conjecture that such improvement is possible by considering a sharper LIL concentration inequality which, instead of holding uniformly for all , holds only at “doubling checking” points .
Other questions worth investigating is to design “horizonfree” algorithms which automatically adapts to the time horizon that is not known a priori, and “instanceoptimal” regret bounds whose regret depends explicitly on the problem parameters and matching corresponding (instancedependent) minimax lower bounds in which are known up to permutations. Such instanceoptimal regret might potentially depend on “revenue gaps” , where is the optimal assortment and is the revenue parameter of the item with the th largest revenue.
7 Proofs
7.1 Proof of technical lemmas in Sec. 2
We first state a simple proposition that outlines the basic properties of the potential function . Its verification is easy from the definition and the discretized nature of .
Proposition 2.
There exists satisfying for all , and , such that
(9) 
where .
7.1.1 Proof of Lemma 1
Let be the two endpoints such that (if there are multiple such pairs, pick any one of them). We will prove that , which then implies Lemma 1.
We first prove . Assume by contradiction that . Clearly because . By definition of and , we have
(10) 
Because , adding we have that
(11) 
This contradicts with the fact that and that is the maximum value of .
We next prove . Assume by contradiction that . Removing all items corresponding to in Eq. (10), we have
(12) 
This contradicts with the fact that and that is the maximum value of .
7.1.2 Proof of Lemma 2
Because and is the maximum value of , we have for all . In addition, for any , by definition of we have
(13)  
(14)  
(15)  
(16)  
(17) 
Because holds for all , we conclude that also holds for all . Subsequently, the righthand side of Eq. (17) is nonnegative and therefore .
7.1.3 Proof of Lemma 3
If for all then the lemma clearly holds. In the rest of the proof we shall assume that there is at least one jumping point strictly smaller than . Formally, we let be all jumping points that are strictly smaller than . To prove Lemma 3, it suffices to show that and for all .
We use induction to establish the above claims. The base case is . Because is the maximum value of , we conclude that . In addition, because , invoking Eq. (17) we have that . The base case is then proved.
We next prove the claim for , assuming it holds for by induction. By inductive hypothesis, . Also, because there is no jump points between and , and subsequently . Invoking Eq. (17) we proved .
To prove , define . It is clear that . By Eq. (17), we have
(18)  
(19)  
(20) 
As we have already proved , the righthand side of the above inequality is nonnegative and therefore .
7.2 Proof of Theorem 2
We first prove two technical lemmas showing that with high probability, the confidence intervals constructed in Algorithm 2 contains the true parameter , and the optimal revenue level is contained in for all .
Lemma 6.
With probability , for all .
Proof.
Lemma 7.
With probability , for all , where is the last outer iteration of Algorithm 1.
Proof.
We use induction to prove this lemma. We also conditioned on the fact that and for all and , which happens with probability at least by Lemma 6.
We first prove the lemma for the base case of . According to the initialization step in Algorithm 1, we have and . On the other hand, for any it holds that . Therefore, and hence for .
We next prove the lemma for outer iteration , assuming the lemma holds for outer iteration (i.e., ). According to the trisection parameter update step in Algorithm 1, the proof can be divided into two cases:
Case 1: . Because always holds, we conclude in this case that . Invoking Lemma 3 we conclude that . On the other hand, by inductive hypothesis . Therefore, .
Case 2: . In this case, the revenue level must be explored at every inner iteration in Algorithm 1 at outer iteration , because is a nonincreasing function of . Denote and as the number of inner iterations in outer iteration . Subsequently, the length of the confidence intervals on at the end of all inner iterations can be upper bounded by
(23) 
Invoking Lemma 6 we then have
(24) 
The next lemma upper bounds the expected regret incurred at each outer iteration , conditioned on the success events in Lemmas 6 and 7.
Lemma 8.
Proof.
We analyze the regret incurred at outer iteration from exploration of and exploitation of separately.

Regret from exploring : suppose the level set is explored for times at outer iteration . Then we have . In addition, by Lemma 6 and widths in the constructed confidence bands and , we have with probability that