Inventory Balancing with Online Learning
Wang Chi Cheung
National University of Singapore, NUS Engineering, Department of Industrial Systems Engineering and Management, Singapore, SG 117576, wangchimit@gmail.com
Will Ma
Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139, willma@mit.edu
David SimchiLevi
Institute for Data, Systems, and Society, Department of Civil and Environmental Engineering, and Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139, dslevi@mit.edu
Xinshang Wang
Institute for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, MA 02139, xinshang@mit.edu
We study a general problem of allocating limited resources to heterogeneous customers over time under model uncertainty. Each type of customer can be serviced using different actions, each of which stochastically consumes some combination of resources, and returns different rewards for the resources consumed. We consider a general model where the resource consumption distribution associated with each (customer type, action)combination is not known, but is consistent and can be learned over time. In addition, the sequence of customer types to arrive over time is arbitrary and completely unknown.
We overcome both the challenges of model uncertainty and customer heterogeneity by judiciously synthesizing two algorithmic frameworks from the literature: inventory balancing, which “reserves” a portion of each resource for highreward customer types which could later arrive; and online learning, which shows how to “explore” the resource consumption distributions of each customer type under different actions. We define an auxiliary problem, which allows for existing competitive ratio and regret bounds to be seamlessly integrated. Furthermore, we show that the performance guarantee generated by our framework is tight; that is, we provide an informationtheoretic lower bound which shows that both the loss from competitive ratio and the loss for regret are relevant in the combined problem.
Finally, we demonstrate the efficacy of our algorithms on a publicly available hotel data set. Our framework is highly practical in that it requires no historical data (no fitted customer choice models, nor forecasting of customer arrival patterns) and can be used to initialize allocation strategies in fastchanging environments.
Online resource allocation is a fundamental topic in many applications of operations research, such as revenue management, display advertisement allocation and appointment scheduling. In each of these settings, a manager needs to allocate limited resources to a heterogeneous pool of customers arriving in real time, while maximizing a certain notion of cumulative reward. The starting amount of each resource is exogenous, and these resources cannot be replenished during the planning horizon.
In many applications, the manager can observe a list of feature values of each arriving customer, which allows the manager to customize allocation decisions in real time. For example, a display advertising platform operator is usually provided with the internet cookie from a website visitor, upon the latter’s arrival. Consequently, the operator is able to display relevant advertisements to each website visitor, in a bid to maximize the total revenue earned from clicks on these advertisements.
To achieve an optimal allocation in the presence of resource constraints, the manager’s allocation decisions at any moment has to take into account the features of both the arriving customer as well as the customers who will arrive in the future. For example, in selling airline tickets, it is profitable to judiciously reserve a number of seats for business class customers, who often purchase tickets close to departure time (Talluri and van Ryzin 1998, Ball and Queyranne 2009). In healthcare applications, when making advance appointments for outpatients, it is critical to reserve certain physicians’ hours for urgent patients (Feldman et al. 2014, Truong 2015). In these examples, the manager’s central task is to reserve the right amount of each resource for future customers so as to maximize the expected reward.
While resource reservation is vital for optimizing online resource allocations, the implementation of resource reservation is often hindered by the following two challenges. First, the manager often lacks an accurate forecast model about the arrival patterns of future demand. For example, it may be difficult to model the demand spike on Black Friday as a stochastic process.
Second, the manager is often uncertain about the relationship between an arriving customer’s behavior (e.g., clickthrough rate) and his/her observed features. For example, when selling a new product at an online platform, the manager initially has very little information about the relationship between a customer’s observed feature values and his/her willingness to pay for a product.
These challenges in implementing resource reservation raise the following research question: Can the manager perform resource reservation effectively, in the absence of any demand forecast model and under uncertain customer behavior?
We describe an online resource allocation problem using general, neutral terminology. A central platform starts with finite and discrete amounts of inventory for multiple resources. Each unit of a resource yields a reward when consumed by a customer. Customers arrive sequentially, each of which is characterized by a context vector that describes the customer’s features. Upon the arrival of each customer, the platform selects an action, which corresponds to offering a subset of resources to the customer. Then the platform accumulates the reward value for each unit of resource consumed by the customer. The objective of the central platform is to maximize the total reward collected from all the resources.
We make the following two important assumptions in our model:

The number of future customer arrivals and the context vectors of each one of them are unknown and picked by an adversary. As a result, the historical observation at any time step does not provide any information about future customers.

For each potential combination of context vector and action, there is a fixed unknown distribution over the consumption outcome. That is, two customers arriving at different time periods with identical context vectors will have the same consumption distribution. As a concrete example, in ecommerce, the context vector represents the characteristics (e.g., age, location) of an online shopper. We are assuming that the conversion rate only depends on the characteristics of the shopper and the product offered. The platform needs to learn these conversion rates in an online fashion.
Each of these two assumptions has been studied extensively, but only separately, in the literature. In models with the first assumption alone, customer behavior such as purchase probabilities is known, and the difficulty is in conducting resource reservation without any demand forecast. The conventional approach is to balance the cost of reserving each unit of a resource with the opportunity cost of allocating it. We call such techniques inventory balancing. Meanwhile, in models with the second assumption alone, the tradeoff is between “exploring” the outcomes from playing different actions on different customers, and “exploiting” actions which are known to yield desirable outcomes. Online learning techniques are designed for managing this tradeoff. However, in the presence of resource constraints, online learning techniques assume that the context vectors are drawn IID from a known distribution, and there is no element of “hedging” against an adversarial input sequence.
In this research, we present a unified analysis of resourceconstrained online allocation in the presence of both of these two assumptions. We make the following contributions:

We propose a framework that integrates the inventory balancing technique with a broad class of online learning algorithms (Section id1). The framework produces online allocation algorithms with provable performance guarantees (Section id1), which can be informally expressed as
(1) where is the performance of the algorithm produced by our framework; is an upper bound on the expected revenue of an optimal algorithm which knows both the arrival sequence and the click probabilities in advance; represents the regret, i.e., the loss from exploring customer behavior; and can be viewed as the competitive ratio when customer behavior is known, i.e., when .

As an application of the framework, we analyze an online bipartite matching problem where edges, once assigned, are only matched with an unknown probability (Section id1). We use the framework to generate an online matching algorithm based on the Upper Confidence Bound (UCB) technique. We prove that the algorithm has performance guarantee
(2) As a result, is bounded from below by , which approaches the bestpossible competitive ratio of as becomes large (i.e. the regret from learning the matching probabilities becomes negligible). We also show that this is tight: we construct a setting where in (\the@equationgroup@ID), the loss of is unavoidable due to not knowing the arrival sequence in advance, and the loss of is unavoidable due to not knowing the matching probabilities in advance.

We study a dynamic assortment planning problem in which each resource can be sold at different reward rates (Section id1). We propose an online algorithm based on Thompson sampling, and test it on the hotel dataset of Bodea et al. (2009) (Section id1).
We summarize the positioning of our paper in Table 1. Our analysis incorporates the loss from two unknown aspects: the adversarial sequence of customer contexts, and the probabilistic decision for a given customer context. When one or both of these aspects are known, many papers have analyzed the corresponding metrics of interest (competitive ratio, regret, approximation ratio), which we review in Section 1.
Sequence of customer contexts  
(Distributionally)  Unknown Adversarial  
Known  (must hedge)  
(Distributionally)  Approximation Algorithms  Competitive Analysis  
Decisions of customer  Known  
with context  Unknown IID  Online Learning  [this paper] 
(can learn) 
To our understanding, we are the first to give a unified analysis for online algorithms involving (i) resource constraints, (ii) learning customer behavior, and (iii) adversarial customer arrivals. We review past work which has considered some subset of these aspects, as outlined in Table 1.
When both the arrival sequence and customer decisions are distributionally known, many algorithms have been proposed for overcoming the “curse of dimensionality” in solving the corresponding dynamic programming problem. Performance guarantees of bidpricing algorithms were initially analyzed in Talluri and van Ryzin (1998). Later, Alaei et al. (2012) and Wang et al. (2015) proposed new algorithms with improved bounds, for models with timevarying customer arrival probabilities. These performance guarantees are relative to a deterministic LP relaxation (see Section id1) instead of the optimal dynamic programming solution, and hence still represent a form of “competitive ratio” relative to a clairvoyant which knows the arrival sequence in advance (see Wang et al. (2015)).
In addition, the special case in which customer arrival probabilities are timeinvariant has been studied in Feldman et al. (2009) and its subsequent research. We refer to Brubach et al. (2016) for discussions of recent research in this direction.
We briefly review the literature analyzing the competitive ratio for resource allocation problems under adversarial arrivals. This technique is often called competitive analysis, and for a more extensive background, we refer the reader to Borodin and ElYaniv (2005). For more on the application of competitive analysis in online matching and allocation problems, we refer to Mehta (2013). For more on the application of competitive analysis in airline revenue management problems, we refer to the discussions in Ball and Queyranne (2009).
Our work is focused on the case where competitive analysis is used to manage the consumption of resources. The prototypical problem in this domain is the Adwords problem (Mehta et al. 2007). Often, the resources are considered to have large starting capacities—this assumption is equivalently called the “small bids assumption” (Mehta et al. 2007), “large inventory assumption” (Golrezaei et al. 2014), or “fractional matching assumption” (Kalyanasundaram and Pruhs 2000). In our work, we use the bestknown bound that is parametrized by the starting inventory amounts (Ma and SimchiLevi 2017). The Adwords problem originated from the classical online matching problem (Karp et al. 1990)—see Devanur et al. (2013) for a recent unified analysis. The competitive ratio aspect of our analysis uses ideas from this analysis as well as the primaldual analysis of Adwords (Buchbinder et al. 2007). We also refer to Devanur and Jain (2012), Kell and Panigrahi (2016), Ma and SimchiLevi (2017) for recent generalizations of the Adwords problem.
Our model also allows for probabilistic resource consumption, resembling many recent papers in the area starting with Mehta and Panigrahi (2012). We incorporate the assortment framework of Golrezaei et al. (2014), where the probabilistic consumption comes in the form of a random customer choice—see also Chen et al. (2016), Ma and SimchiLevi (2017). However, unlike those three papers on assortment planning, which assume some substitutability assumption in the customer choice model, we instead allow for resources which have ran out to still be consumed, but at zero reward.
The problem of learning customer behavior is conventionally studied in the field of online learning. For a comprehensive review on recent advances in online learning, we refer the reader to Bubeck and CesaBianchi (2012), Slivkins (2017).
Our research focuses on online learning problems with resources constraints. Badanidiyuru et al. (2014), Agrawal and Devanur (2014) incorporate resource constraints into the standard multiarmed bandit problem, and propose allocation algorithms with provable upper bounds on the regret. Badanidiyuru et al. (2013), Agrawal and Devanur (2016), Agrawal et al. (2016) study extensions in which customers are associated with independently and identically distributed context vectors; the values of reward and resource consumption are determined by the customer context. Besbes and Zeevi (2009, 2012), Babaioff et al. (2015), Wang et al. (2014), Ferreira et al. (2016) study pricing strategies for revenue management problems, where a resourceconstrained seller offers a price from a potential infinite price set to each arriving customer. Customers are homogeneous, in the sense that each customer has the same purchase probability under the same offered price.
Those models with resource constraints in the current literature assume that the type (if there is any) of each customer is drawn from a fixed distribution that does not change over time. As a result, there exists an underlying fixed randomized allocation strategy (typically based on an optimal linear programming solution) that converges to optimality as the number of customers becomes large. The idea of the online learning techniques involved in the abovementioned research works is to try to converge to that fixed allocation strategy. In our model, however, there is no such fixed allocation strategy that we can discover over time. For instance, the optimal algorithm in our model may reject all the lowfare customers who arrive first and reserve all the resources for highfare customers who arrive at the end. As a result, the optimal algorithm does not earn any reward at first, and thus cannot be identified as the best strategy by any learning technique. Our analysis is innovative as we construct learning algorithms with strong performance guarantees without trying to converge to any benchmark allocation strategy.
Throughout this paper, we let denote the set of positive integers. For any , let denote the set .
Consider a class of online resource allocation problems, generically modeled as follows. A central platform starts with resources. Each resource has a reward associated with it (Later in Section id1, we will allow each resource to have multiple reward values). Each resource also has an unreplenishable discrete starting inventory . We denote , and .
There is a latent sequence of customers who will arrive sequentially. Each customer is associated with a context vector , where is a known context set. The sequence of context vectors is revealed sequentially. That is, for each , the platform must make the decision for customer , without knowing the values of nor the value of . We will use the phrases “customer ” and “time period ” interchangeably.
When a customer arrives at the platform, the platform observes the customer context and takes an action . Under context and action , the customer’s behavior is governed by a distribution over outcomes. For each , the distribution is not known to the platform. An outcome is defined by the subset of resources consumed, given by a vector in . If and resource is not yet depleted, then one unit of the inventory of resource is consumed, and a reward of is earned. If but resource is depleted, or if , then no resource is consumed and no reward is earned. For all , , and , let be the probability of outcome when action is played on context .
In each period , events occur in the following sequence. First, the customer context is revealed. Second, the platform plays an action on . The action is determined by an online algorithm that sees and all the information prior to period . Third, the platform observes the outcome in period , drawn according to , and collects rewards.
The sequence of context vectors and the mapping can be interpreted as being chosen by an oblivious adversary, who cannot see any information related to . As a result, we will treat all of the adversariallychosen parameters , , and as being deterministic and chosen aheadoftime. An algorithm is evaluated on the expected reward it earns, for any fixed set of . In Section id1, we will bound the expected reward earned by our algorithms from Section id1, in comparison to that earned by an optimal algorithm which knows all of in advance, with the bound holding for any values of .
In this section, we present a framework which generates online allocation algorithms by integrating a broad class of online learning techniques, such as Upper Confidence Bounds (UCBs) and Thompson Sampling, with the inventorybalancing technique that hedges against an unknown sequence of customer contexts.
The framework first creates an auxiliary problem, which exclusively focuses on the explorationexploitation tradeoff, by removing all the inventory constraints from the original problem. In other words, there is no need to conduct resource reservation in the auxiliary problem. As a result, we can apply existing online learning techniques on the auxiliary problem, and achieve regrets sublinear in . Next, given any online learning algorithm for the auxiliary problem, the framework converts it into another algorithm that performs both learning and resource reservation for the original model.
The auxiliary problem is a contextual stochastic bandit problem, in which we define the context set , action set , and distributions in the same way as in the original problem. The distributions are still unknown from the beginning and needs to be learned.
The auxiliary problem differs from the original problem in two ways.
First, we define all of the resources to have unlimited inventory in the auxiliary problem. As a result, algorithms for the auxiliary problem are not concerned with any global inventory constraints, i.e. if the distributions were known, then the optimal algorithm would simply maximize the immediate reward for each period.
Second, the reward of resource in period is now defined as , which depends on . In each period , the online algorithm is given before having to make decision ; however it does not know the reward values for future periods. Thus, we can view as additional contextual information that is observed by online algorithms in the beginning of period . We assume that is chosen by an adaptive adversary, so that may depend on the actions played and outcomes realized in periods (whereas the sequence of context vectors are still fixed a priori in both the original problem and the auxiliary problem). We restrict the adversary so that all of the chosen rewards are bounded from above by .
Let denote a random variable that encapsulates all the external information used by an online algorithm. For example, can represent the random seed used by a sampling algorithm. The adversary cannot see the realization of . Without loss of generality, let
denote all the information that an online algorithm uses to make decision . If is a constant, the algorithm is a deterministic algorithm; otherwise, the algorithm is randomly drawn from a family of deterministic algorithms, according to the distribution of . Then, an online algorithm for the auxiliary problem can be represented by a list of oracles such that the algorithm makes decision in period .
Given an online algorithm , we define its regret for any realized sample path as
(3) 
where
is the expected reward of taking action in period in the auxiliary problem and
is the optimal action in period .
The goal in the auxiliary problem is to minimize the expected value of the regret (3). Typically, we expect , where the notation omits logarithmic factors. Depending on the specific problem setting, the constants hidden in may depend on parameters that are specific to the structure of . The value of will be used in our performance guarantee in Section id1.
Suppose we are given an online learning algorithm for the auxiliary problem. Our integrated algorithm defines the timedependent rewards for the auxiliary problem based on the resources that have been consumed by that time period.
For each time and resource , let denote the number of units of resource that have been consumed by the end of time . is understood to equal for all .
Then, we define the function , which is commonly used in inventoryconstrained online problems to hedge against adversarial arrivals (see Buchbinder et al. (2007)). This is a convex function which increases from 0 to 1 over .
We are now ready to define our integrated algorithm. For each time period :

For each resource , define its discounted reward for time to be
(4) where is the amount of resource that has been consumed at the start of time ;

Play action , where the input
for the oracle is constructed based on the discounted rewards generated in the previous step.
Our integrated algorithm has a very specific rule (4) of choosing the value of , which depends on the random outcomes in previous periods. In the auxiliary problem, however, we more generally allow to take any value generated in an adaptive way. Such a relaxation does not restrict the scope of online learning techniques that we can apply. This is because for models with adaptive contextual information (recall that we view as part of the contextual information), most online learning algorithms achieve near optimality under context vectors generated by an adaptive adversary.
In this section, we prove a performance guarantee for algorithms generated by our framework. Later in Section id1, we will prove that this performance guarantee is tight for a special case of our model.
The expected reward of an algorithm which knows in advance both the distributions and the arrival sequence can be upperbounded by the following LP, which is a standard result in the revenue management literature.
Primal:
(5)  
In the LP, the variable encapsulates the unconditional probability of an algorithm taking action in period . Given a fixed underlying problem instance, we set to be the optimal objective of the LP. Its dual can be written as follows.
Dual:
(6)  
(7)  
We prove the performance guarantee using a primaldual approach. More precisely, we construct a primal solution to the LP (5), whose objective value equals the total reward of our algorithm plus the cumulative regret due to not knowing in advance. We also construct a dual solution to (6), which bounds from above. The performance guarantee is obtained by finding a relationship in the form of (section1.I2.i1) between the objective values of the constructed primal and dual solutions.
We will let be the random variable representing the reward earned by the algorithm under consideration. Recall that denotes the optimal action during period in the auxiliary problem, while denotes number of units of resource consumed by the end of time ; these values will also be treated as random in the analysis.
Definition 1 (Random Dual Variables)
Define the following dual variables for all , which are random variables:
Lemma 1 (Feasibility)
The dual variables defined in Definition 1 are feasible on every sample path. Therefore, if we set for all and for all , then this provides a feasible solution to the Dual LP.
The following theorem gives our main result, which states that the optimality gap of our online allocation algorithm is at most a constant fraction of plus the regret in the auxiliary problem.
Theorem 1
The total reward of the algorithm generated by our framework satisfies
(8) 
When , the above expression can be written as
(9) 
Recall that denotes the smallest starting inventory among the resources. The expression in bound (8) represents the bestknown dependence on in the competitive ratio (Ma and SimchiLevi 2017). The expression decreases to 1 as .
Proof.
Proof. , by Lemma 1 and weak duality. Using the definitions of , , and , we obtain
Recall that is the input to the auxiliary problem at the start of time , which determines the values of . Conditioned on , the algorithm’s action is determined. Thus, for any resource ,
(10) 
We explain equation (10). Conditional on any , the vector of outcomes is distributed according to . If and resource is not yet depleted, i.e. , then a unit of resource is consumed, leading to . If but resource is depleted, i.e. , then resource cannot be consumed further, leading to .
Using the tower property of conditional expectation over the randomness in , and substituting in equation (10), we obtain
(11) 
In this section, we present a specific application of the framework on the online matching problem. In this problem, each resource corresponds to an advertiser who is willing to spend at most dollars for receiving clicks on the advertiser’s advertisements. The context set is , where we recall that is the number of resources/advertisers. For each , indicates whether a customer with context can be matched to advertiser .
Each advertiser has different advertisements, e.g., videos/banners. Upon the arrival of a customer , the platform needs to pick an advertiser such that , and display an advertisement that belongs to advertiser .
The action set can be written as . When action is played on customer context , the customer will click on the displayed advertisement with probability . The platform earns reward from each click on any advertisement belonging to advertiser .
The values of are unknown from the beginning. For each and , the distribution can be written as
The auxiliary problem is a variant of the classic stochastic multiarmed bandit problem, in which the expected reward of each arm in each period is scaled by an observable factor . The following algorithm for the auxiliary problem is based on the UCB technique.
Let denote the set of periods such that action is taken in period and . Define function . Recall that is the random outcome in period . Let
be an estimate for , and
be the size of its confidence interval.
Upon the arrival of customer , the algorithm takes action that maximizes the upper confidence bound .
Lemma 2
(Kleinberg et al. 2008) Consider independently and identically distributed random variables in . Let , and . Then, for any , we have with probability at least ,
Proposition 1
In any fixed period , with probability at least we have
for all .
For any fixed , applying Lemma 2 to the sequence of random variables , we can obtain with probability at least ,
Since
we have with probability at least ,
The proposition is proved by choosing . \@endparenv
The following proposition gives a regret upper bound on the UCB algorithm on the auxiliary problem. The bound is sublinear in the expected optimal value of the Primal linear program, and the bound is in particular sublinear in .
Proposition 2
In the auxiliary problem, the total regret of the UCB algorithm is
Conditioned on the high probability event that for all , , the total regret can be bounded from above as follows:
Recall that , so we must have . Let denote the number of periods in which action is taken and resource still has positive remaining inventory. For any fixed , we have