Conservative Exploration for SemiBandits with Linear Generalization: A Product Selection Problem for Urban Warehouses
Rong Jin
Alibaba Group, San Mateo, CA 94402, jinrong.jr@alibabainc.com
David SimchiLevi
Institute for Data, Systems, and Society, Department of Civil and Environmental Engineering, and Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139, dslevi@mit.edu
Li Wang
Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139, li_w@mit.edu
Xinshang Wang
Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139, xinshang@mit.edu
Sen Yang
Alibaba Group, San Mateo, CA 94402, senyang.sy@alibabainc.com
The recent rising popularity of ultrafast delivery services on retail platforms fuels the increasing use of urban warehouses, whose proximity to customers makes fast deliveries viable. The space limit in urban warehouses poses a problem for the online retailers: the number of products (SKUs) they carry is no longer “the more, the better”, yet it can still be significantly large, reaching hundreds or thousands in a product category. In this paper, we study algorithms for dynamically identifying a large number of products (i.e., SKUs) with top customer purchase probabilities on the fly, from an ocean of potential products to offer on retailers’ ultrafast delivery platforms.
We distill the product selection problem into a semibandit model with linear generalization. There are in total different arms, each with a feature vector of dimension . The player pulls arms in each period and observes the bandit feedback from each of the pulled arms. We focus on the setting where is much greater than the number of total time periods or the dimension of product features . We first analyze a standard UCB algorithm and show its regret bound can be expressed as the sum of a independent part and a dependent part , which we refer to as “fixed cost” and “variable cost” respectively. To reduce the fixed cost for large values, we propose a novel online learning algorithm, with more conservative exploration steps, and show its fixed cost is reduced by a factor of to . Moreover, we test the algorithms on an industrial dataset from Alibaba Group. Experimental results show that our new algorithm reduces the total regret of the standard UCB algorithm by at least 10%.
Key words: sequential decision making, product selection, online learning, online retailing, stochastic optimization, regret analysis
In this paper, we study a largescale product selection problem, motivated by the rising popularity of ultrafast delivery on retail platforms. With ultrafast delivery services such as sameday or instant delivery, customers receive parcels within hours after placing online orders, whereas traditional deferred delivery usually takes one to five days. The demand for ultrafast delivery is strong: a survey conducted in 2016 shows that 25% of customers are willing to pay significant premiums for the service (Joerss et al. 2016). On the supply side, many online retailers offering ultrafast delivery have emerged in the market, for example, Amazon Prime Now, Jet.com, Ocado, and Alibaba Hema. As a result, the retail market for ultrafast delivery has enjoyed exponential growth over the past few years. In the United States, the total order value of sameday delivery merchandise reached 4.03 billion dollars in 2018, up from 0.1 billion dollars in 2014 (eMarketer 2018).
The rapid growth in the ultrafast delivery retail market is accompanied by the increasing use of urban warehouses (JLL 2018). Their proximity to customers is the key in making ultrafast delivery viable for retailing models, especially in hightraffic cities like New York City and San Francisco. Due to the limited space in urban warehouses, retail platforms offering ultrafast delivery carry fewer products than traditional online retailers fulfilled by large suburban distribution centers. Hence, retailers often face the challenge of selecting a good subset of different products (i.e., SKUs) from an ocean of potential products to offer on the ultrafast delivery platforms.
Thanks to modern inventorymanagement technologies, such as Kiva robots, the cost of maintaining a diversified inventory is no more than that of managing an inventory with only a few types of products (Bhattacharya 2016). Thus, online retailers offering ultrafast delivery are disposed to carry more SKUs than the set of most popular items such as the marketleader or mainstream products with major market shares. By increasing product breadth, they gain an edge over local grocery stores or supermarkets, in terms of satisfying customers’ sporadic demands for products with mediumtolow popularities.
For example, as of October 2018, Prime Now, Amazon’s twohour delivery retail platform, offers around 1,500 different products in the toys category at zip code 02141 (Cambridge, MA). Meanwhile, a local CVS store at the same zip code only carries around 100 popular toys, whereas Amazon.com includes more than 90,000 SKUs related to toys. Table 1 summarizes product coverages of different retailers, where products are partitioned into two types: 1) marketleader, and 2) medium/lowdemand. The focus in this paper is on selecting an optimal set of products with mediumtolow demands for online retailers with urban warehouses.
Brick and mortar  Online retailer  

Urban warehouse  Distribution center  
Marketleader products  Most/All  All  All 
Medium/lowdemand products  None/Few  Some (paper focus)  Most 
For urban warehouse retailers, the most popular items can be readily identified from their sales volumes. However, with their total number of SKUs in each category reaching hundreds, if not thousands, it is a more difficult task to accurately estimate the popularities of a large number of relatively lowdemand products. Fortunately, online retailers are typically capable of dynamically adjusting the offered product set according to sales and market feedback. This enables them to learn each product’s probability of being sold through dynamic product offering, which in turn makes the problem a more complicated task of sequential decisionmaking than a pure demand estimation problem.
In this paper, we study different algorithms for dynamically selecting an optimal set of products for online retailers that use urban warehouses to support their ultrafast delivery services. The algorithms learn the demand levels of all products on the fly, and give retailers the ability to have their product sets tailored to every city, or even every zip code, better catering to customers’ geodemographic preferences. We distill the product selection problem into a semibandit model with linear generalization. For this model, we first provide an alternative analysis of a popular existing algorithm, which we call , that is adapted from the algorithm proposed by AbbasiYadkori et al. (2011). Then, we propose a novel algorithm called , and show it is superior in both theoretical and numerical performances, especially when the retailer selects products and observes their sales in large batches.
We consider an online retailer who wants to find an optimal set of products (SKUs) in each category to be offered on her ultrafast delivery retail platform in a certain city area. In the rest of the paper, we use “product” and “SKU” interchangeably.
Due to the use of urban warehouses, the storage space, after including the marketleader products identified from past sales data, is limited. In each product category, assuming products take similar storage space, the retailer would like to select additional SKUs from a catalog of products that have mediumtolow customer demands. Because of their relatively low popularities, there is a very limited amount of available sales data related to the candidate products in that city. Therefore, the retailer is interested in learning the optimal set of products through experimentation. Specifically, in each period over a period sales time horizon, she selects products to offer and observes whether customers purchase those products, which helps her learn the product demands more accurately and then adjust the product set accordingly in the next periods.
By only considering products with relatively low demands in this problem (see the discussion of product coverage summarized in Table 1), we assume their sales are independent. The reason is that the demands for the candidate products are typically direct demands, in the sense that customers specifically search for these products for some particular reasons. In the case where some products are not included in the offered product set, their demands, if not lost, are often captured by the marketleader products. Therefore, the probability of shifting from one lowpopularity product to another is very low.
We assume the candidate products’ probabilities of realizing positive sales in each period are an unknown linear function of their dimensional feature vectors that are known a priori and fixed over time. Specifically, for any product with feature vector , its positivesales probability in any period is , where is unknown.
We examine the linear model assumption between products’ features and their probabilities of positive sales on industrial data from Alibaba Tmall, the largest businesstoconsumer online retail platform in China. We select around products in the toys category with mediumtolow popularities in a city region. Each product is associated with a dimensional feature vector, which is derived from the product’s intrinsic features as well as productrelated user activities, like clicks, additions to cart, and purchases, by Alibaba’s deep learning team using representation learning (Bengio et al. 2013). Two consecutive weeks of the region’s unit sales data are translated into binary sales data indicating whether a product had positive sales in each week.
To test the validity of the linear assumption, we first fit a linear model for the firstweek binary sales with the product features as covariates. Then, for all products, we compute their scores based on the feature vectors using the fitted model, and divide the products into groups of size around 1,000 according to their scores in ascending order. The positivesales probabilities are estimated by the arithmetic averages of the products’ secondweek binary sales for all groups. In Figure 1, the estimated positivesales probabilities are plotted against the average scores for all 20 groups, and a nearlinear relationship is clearly identified. Since the scores are linear in the features, this suggests that a linear relationship between the products’ features and probabilities of positive sales can be reasonably assumed.
The retailer’s goal is to maximize the number of products with positive sales, which represent successes in matching customers’ demands. Then, the expected reward in each period is the sum of positivesales probabilities of the offered products. We measure an algorithm’s performance by its regret, the loss in total expected reward compared to the optimal expected reward.
Such a model is usually referred to as semibandits with linear generalization. In the literature, more general models like contextual combinatorial semibandits have been studied mainly for different applications such as personalized movie recommendations or online advertising (Yue and Guestrin 2011, Qin et al. 2014, Wen et al. 2015). In those applications, the number of recommendations is typically very small compared to the number of periods . Indeed, Yue and Guestrin (2011) explicitly make the assumption that (recall that is the dimension of the feature vector space).
In the task of choosing optimal product sets for urbanwarehouse retailers, however, the number of products selected in every period is usually very large ( products), whereas the learning process typically needs to be within a quarter (, for semiweekly updates), to reduce operations cost and potential longterm consumer confusion. Motivated by this problem setting, we focus on a different regime of problem parameters in the semibandit model with linear generalization, where the number of selections is large but the number of periods is small.
Directly applying existing algorithms and analyses in this setting results in regret bounds that grow rapidly in (see the discussion of regret bounds of different algorithms summarized in Table 2). Therefore, this raises the following research question that we aim to address in this work:
How can a retailer optimize the explorationexploitation tradeoff in selecting optimal product sets from a pool of products, when there are only a few time periods () available for experimentation, but she is able to observe sales information for a large number of products () in each period?
Regarding whether this task is achievable, we point out that, even though the number of periods is small, a good algorithm can still effectively learn the true model parameter on the fly, because there are a significant number of sales observations after all.
Since the time horizon we consider is no more than three months, we assume the set of candidate products remain unchanged. Moreover, products’ intrinsic characteristics and their related user behaviors are in general unlikely to have major changes during the relatively short time horizon. Hence, we assume that the product features are fixed over time in our problem setting. It is worth noting that, such assumptions can be removed once the time horizon is over, given is accurately estimated in the end.
For the purpose of presenting a clean model, we defer the discussion on extensions and variants that address some of the model’s potential limitations to Section id1. In addition to numerically testing algorithm performances on an online retailer’s data in Section id1, we make two main technical contributions in this paper:

We provide an alternative analysis of a common Upper Confidence Bound algorithm, which we call , in our model setting (Section id1). The main idea behind is to use an ellipsoid to construct the confidence region for . This technique has been studied in various semibandit models (c.f. Yue and Guestrin (2011), Qin et al. (2014)). We contribute to the literature by proving an alternative regret bound for this algorithm when is large.
Our new regret bound of can be expressed as a combination of two parts: (i) The first part is , which is sublinear in both and ( hides logarithmic factors). We call this part the “variable cost”, as it increases with the length of time horizon . The variable cost is a standard regret term in linear contextual bandit problems after playing arms (selecting products). (ii) The second part is , which is linear in and independent of . We call it the “fixed cost”, since it is independent of the length of the time horizon. The fixed cost is due to the unobservable feedback within selecting products in each period.
We also show that the fixed cost of is at least . From the business point of view, this lower bound is discouraging, because over the periods, the total regret of the algorithm could be of the same order as the total reward . In other words, the standard UCB technique and its immediate extensions could lead to arbitrarily bad performance over the entire season. This motivates us to devise new explorationexploitation techniques to beat this lower bound on the fixed cost.

In order to improve on the fixed cost of , we propose a novel “conservative” confidence bound and propose a new algorithm (Section id1). Using the conservative confidence bound and selecting products sequentially within each period, the new algorithm intelligently takes advantage of the abundance of sales observations in each period and enjoys an improved fixed cost.
The regret bounds proved in this paper are summarized and compared with existing results in Table 2. Although the models in Yue and Guestrin (2011), Wen et al. (2015), Qin et al. (2014) are more general than ours, their regret bounds often involve terms such as and are thus not suitable when is large. By contrast, in the regret bound of , the fixed cost that is linear in is only .
It is also worth noting that the factor in the fixed cost of is due to linear generalization. In practice, this factor and other logarithmic factors are replaced by a parameter that is tuned to achieve good empirical performance (see Section id1). Despite these factors related to linear generalization, the fixed cost of is only . For online retailers, it implies that the fixed cost of is no more than the optimal reward over only a few periods. On the other hand, there is a simple lower bound on the fixed cost of any algorithm, since an regret in the first time period is unavoidable. Therefore, the fixed cost of is optimal except for a factor caused by linear generalization.
Paper Algorithm technique “Fixed cost” regret “Variable cost” regret Yue and Guestrin (2011) UCB  Wen et al. (2015) Thompson sampling  Wen et al. (2015) UCB  Qin et al. (2014) UCB  this paper (Section id1) UCB (standard algorithm ) this paper (Section id1) UCB with conservative exploration (new algorithm ) Table 2: Regret bounds of different algorithms when adapted to our model setting where is much larger than or .
In the final part of the paper, we use Alibaba’s data to test and compare the performances of and . Numerically, we show that reduces about of the regret of in the first ten periods.
Our model is a type of multiarmed bandit problem, if we view the products as distinct arms. In classic problems of multiarmed bandit, a player sequentially pulls arms without initially knowing which arm returns the highest reward in expectation. The player needs to update the strategy of pulling arms based on the bandit feedbak of the pulled arm in each round, in order to minimize the total regret over time. For a broader review on multiarmed bandit problems, we refer the reader to Bubeck and CesaBianchi (2012), Slivkins (2017).
There are two special characteristics of our model. First, we assume the reward of pulling an arm is a linear function of an embedding feature vector of the arm. Second, in each step the player is able to pull a very large number of different arms and then observe the bandit feedback for each of the pulled arms.
In the literature, multiarmed bandit models assuming linear reward/payoff functions are often referred to as linear contextual bandits. Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Chu et al. (2011), AbbasiYadkori et al. (2011), Bubeck et al. (2012), Agrawal and Goyal (2012) propose and analyze different algorithms for linear models in which the player is able to pull only one arm in each period. The sampling algorithm in Russo and Van Roy (2014) also applies to this linear setting. Notably, the algorithm that we analyze in Section id1 is a direct extension of the OFUL algorithm in AbbasiYadkori et al. (2011). We are able to provide a new type of analysis of for our special model (see Section id1), in which the player can pull a large number of different arms (each arm is associated with a feature vector) in each period.
When the reward of each arm is a linear function of covariates that are drawn i.i.d. from a diverse distribution, Goldenshluger and Zeevi (2013), Bastani and Bayati (2015), Wang et al. (2018) propose online learning algorithms whose regret bounds are proved to be polylog in . However, their regret bounds scale at least linearly with the total number of arms . By contrast, the regret bounds of our algorithms are independent of .
When the player can pull multiple arms in each period and observe the bandit feedback from each of the pulled arms, the model is often referred to as semibandits (Audibert et al. 2014). In the literature, most semibandit models view the player’s action as a vector in (c.f. CesaBianchi and Lugosi (2012), Gai et al. (2012), Chen et al. (2013)); they do not further assume the reward of each arm to be a linear function.
To our knowledge, only Yue and Guestrin (2011), Gabillon and Eriksson (2014), Qin et al. (2014), Wen et al. (2015) study semibandit models in which the reward of each arm is generalized to a linear function. The models in these papers are more general than ours, as they allow for combinatorial constraints or submodular reward functions. However, their research focuses on applications in which is much smaller than . Thus, their regret bounds grow rapidly in . The regret bounds in Yue and Guestrin (2011), Qin et al. (2014), Wen et al. (2015) contain . The regret bound in Gabillon and Eriksson (2014) contains , where are gaps between suboptimal and optimal arms. By contrast, the regret bound of our algorithm is , in which the term that is linear in is only .
Throughout the paper, we use to denote the set for any positive integer .
There are distinct products with lowtomedium customer demands. There are periods, and in each period , the retailer selects a set of products, denoted as , from to offer on her retail platform.
Each product has a feature vector that is known to the retailer in advance and stays fixed over time. The probability of positive sales of product in any period is denoted as , which is linear in its feature vector , i.e., for some unknown vector . We assume and , for each product .
In each period , the binary random variable denotes whether the realized sales of product in period are positive, for each product in the selected product set . We assume is independent across products and across time periods, and its expected value .
The expected reward in each period is the sum of positive sales probabilities of the offered products, and the period total expected reward is . The optimal product set is a set of products with the highest probabilities of positive sales. The performance of any online algorithm is measured by its regret, which is defined as
In each period , the retailer makes the decision based on all the past information including . The goal is to minimize the total regret over the time periods.
In this section, we focus on a popular existing UCB algorithm, which we call , that has been widely used in practice and analyzed in theory.
We provide an alternative analysis of and prove a new regret bound, which consists of a independent “fixed cost” and a dependent “variable cost” . This alternative analysis allows us to more accurately evaluate the regret terms for , especially in our model where is significantly larger than .
We also give an example to show an lower bound on the regret of , which illustrates the algorithm’s potential weakness in some cases. This lower bound implies that, in the regret bound of , the fixed cost is at least , which is much more significant than the variable cost in our problem setting. Hence, the idea of modifying to reduce the fixed cost leads to the development of the new algorithm in Section id1.
is adapted from the standard UCB algorithm designed for linear contextual bandits (AbbasiYadkori et al. 2011). The difference is that, for , the model is updated once products (arms) are offered (pulled) in each period, whereas, for , the model is updated every time one product (arm) is offered (pulled) in each period. If is set to , the two algorithms become the same. is presented step by step in the following part.
algorithm for semibandits with linear generalization (with input parameters ):

Initialize and .

Repeat for

(a) Set .

(b) Calculate , for all .

(c) Offer product set . Observe outcomes for .

(d) Update and .

The algorithm we present above is an extension of (AbbasiYadkori et al. 2011), and it can also be considered as a special case of its combinatorial version (Qin et al. 2014). From Qin et al. (2014), we know that if is run with and , then the regret of the algorithm is with probability at least .
In this section, we provide an alternative analysis of the regret of , when applied to semibandit models with linear generalization.
We prove the regret of is , which is a sum of two parts. The first part is largely due to the model’s inability to observe product sales feedback within selecting products in each period, and is shown to be independent of the number of time periods, . The second part is a common regret term for UCBtype algorithms in linear contextual bandit problems.
If we consider the regret terms as “costs” that an algorithm has to pay in the learning process, the first part of the regret resembles an “fixed cost”, as it does not increase with , while the second part is similar to a “variable cost”, as it increases with .
We first introduce some existing results in the literature that we will use later on. For any positive definite matrix , we define the weighted norm of any vector as
Recall that, for each product , is and is defined as for each period in .
Lemma 1 (Qin et al. (2014), Lemma 4.1)
If we run with and , then we have, with probability at least , for all periods and all products ,
For each period , let denote an arbitrary permutation of the feature vectors .
Define
for each period and each . The following lemma is a direct result of Lemma 3 in Chu et al. (2011), by assuming the product selections are made in a model where only one product needs to be selected in each period for a total of periods.
Lemma 2 (Chu et al. (2011), Lemma 3)
If we run with , then
The next lemma is the key lemma of our analysis for . It upperbounds the difference between two norms of the same vector weighted by two matrices, and . The additional sum of outer products corresponds to updating matrix using feature vectors selected by . We defer its proof to the appendix.
Lemma 3
Let be any symmetric positive definite matrix, and be any vectors. Let be the eigenvalues of , and be the eigenvalues of . We have, for any such that ,
Our new regret bound for is presented in the following theorem.
Theorem 1
If is run with
then with probability at least , the regret of the algorithm is
By Lemma 1 and the property that picks products with the largest UCB values , we have, with probability at least ,
Recall that is a sequence of feature vectors of products in . Moreover, , for each period and each . Then, we can continue to obtain
(1) 
Let be the eigenvalues of for all . For the first term in (1), by Lemma 3 and the fact that for all and , we obtain
Since , we have for all . Hence, we have
(3) 
As shown by AbbasiYadkori et al. (2011), has a regret of in linear contextual bandit models, in which there is only one bandit observation in each period. Hence, the variable cost of , , matches the same regret, since there are in total bandit observations.
In this section, we show a lower bound on the fixed cost of by analyzing its regret in a simple example.
Theorem 2
The regret of is .
Proof.
Proof. Suppose there are products and they are split into groups, . For each , has products with the same feature vector , where denotes the unit vector with the th element being one.
Suppose . Then the optimal solution is to offer in all periods, and the expected reward in every period is .
Initially, for each , the UCB value used by for products in group is , which is increasing in .
Since the feature vectors of products in different groups are mutually perpendicular, the UCB value of a product will not be affected by products selected from other groups. Therefore, the initial UCB value for group will not change until one of the products in is picked by .
Given that always picks products with the highest UCB values, and the initial UCB values increase in the group index , will not select any product in in the first periods.
Therefore, the total reward of in the first periods must be zero. It follows that the regret of is at least
Q.E.D.
We make two remarks regarding this lower bound result:

When the retailer selects products in the first period, she has no prior information for the estimation of . Thus, regardless of the value of , the regret of any nonanticipating algorithm is at least . Compared to this, the lower bound for has an extra factor of . Therefore, in order to achieve a fixed cost close to , we have to design a new algorithm, which is our goal in the next section.

When , the lower bound on the regret for is , and there is only a gap compared to the fixed cost that we have proved in Theorem 1. This factor originates from the parameter that scales the lengths of confidence intervals.
In this section, we present a new algorithm for the semibandit model with linear generalization. The new algorithm is based on a novel construction of exploration steps that are more conservative than standard UCB procedures. Our analysis of shows its regret bound is , which improves on ’s fixed cost by a factor of .
As illustrated in the lower bound proof in Section id1, ’s potential weakness lies in its tendency to select products with feature vectors that have the same or similar directions. In other words, the set of selected products in each period is sometimes not diversified enough. This is because, in any period , the products in are selected independently based on the UCB values , which are calculated at the start of the period and sometimes form clusters for product groups with similar feature vector directions.
The new algorithm solves ’s potential problem by offering more diversified product sets. This is achieved through a sequential selection mechanism in each period, based on an adaptive product score , which is updated per selection to make dissimilar products more likely to be chosen in subsequent selections.
More precisely, in , the products in are selected sequentially, in a fashion that the th selection (for all ) is based on scores that are updated using the feature vectors of the previously selected products in the period. If product is the th selection in period , then the scores for all products that share similar feature vectors with product are decreased. This encourages more diversified product selections in each period. Because of the way is defined in , it is always less than or equal to the UCB value . This makes the new algorithm a variant of with more conservative exploration steps – hence the name .
algorithm for semibandits with linear generalization (with input parameter ):

Initialize and .

Repeat for

(a) Set and . Initialize .

(b) Repeat for

i. Calculate , for all .

ii. Add a product to . Update .


(c) Offer product set . Observe outcomes for .

(d) Update and .

shares a similar framework with , but it differs from in Step 2(b), in which a different product score, , is maintained and the products in are chosen in a sequential manner. We stress that the construction of product scores , i.e., Step 2(b)i., is novel.
Figure 2 further illustrates the conservative exploration technique. In the figure, the sets of products selected by and in a given time period are compared. In this example, we have , , and products are clustered into two groups along the feature dimensions. At the beginning of the period, suppose and in both algorithms. This setting slightly favors exploration in feature dimension 2. The upper graphs illustrate that selects products all in one group, reducing the uncertainty related to almost only in one dimension (shown by the shrinkages from the larger ellipses to the smaller ones), whereas selects a more diversified product set, significantly reducing uncertainty in both dimensions. The lower two graphs demonstrate that calculates the product scores only once at the beginning and chooses the top products, while the scores in are updated every time a product is selected.
In this section, we prove that the regret of is .
Proposition 1 demonstrates the key benefit of using conservative exploration. It shows that, under , the regret per product selection can be upperbounded by the sum of a standard confidence interval term and the increase in the lower confidence bound of a product in the optimal product set. The rest of the regret analysis follows from Proposition 1 and is completed in Theorem 3.
For convenience, let denote , for all , and . For each product and each period , define and as the lower and upper confidence bounds for , respectively. Let denote the th product selected in period by .
Proposition 1
If is run with , then with probability at least , the following conditions hold for all and :

If , then for all ,

If , then
Consider any product , which is selected in the th step in period by . Conditioned on (4), we want to show, for all ,

If , then for all ,
(5) 
If , then
(6)
Consider the first case where . We have
Above, inequality ① follows from condition (4); equalities ② and ④ are by definition of , and ; inequality ③ is because product is selected, instead of , by in the th step in period ; inequality ⑤ is because
Now consider the second case where .
Given condition (4) and , we have