Meta Dynamic Pricing: Learning Across Experiments
Hamsa Bastani
Operations, Information and Decisions, Wharton School, hamsab@wharton.upenn.edu
David SimchiLevi
Institute for Data, Systems, and Society, Massachusetts Institute of Technology, dslevi@mit.edu
Ruihao Zhu
Statistics and Data Science Center, Massachusetts Institute of Technology, rzhu@mit.edu
We study the problem of learning across a sequence of price experiments for related products, focusing on implementing the Thompson sampling algorithm for dynamic pricing. We consider a practical formulation of this problem where the unknown parameters of the demand function for each product come from a prior that is shared across products, but is unknown a priori. Our main contribution is a meta dynamic pricing algorithm that learns this prior online while solving a sequence of nonoverlapping pricing experiments (each with horizon ) for different products. Our algorithm addresses two challenges: (i) balancing the need to learn the prior (metaexploration) with the need to leverage the current estimate of the prior to achieve good performance (metaexploitation), and (ii) accounting for uncertainty in the estimated prior by appropriately “widening” the prior as a function of its estimation error, thereby ensuring convergence of each price experiment. We prove that the price of an unknown prior for Thompson sampling is negligible in experimentrich environments (large ). In particular, our algorithm’s meta regret can be upper bounded by when the covariance of the prior is known, and otherwise. Numerical experiments on synthetic and real auto loan data demonstrate that our algorithm significantly speeds up learning compared to priorindependent algorithms or a naive approach of greedily using the updated prior across products.
Key words: Thompson sampling, transfer learning, dynamic pricing, meta learning
Experimentation is popular on online platforms to optimize a wide variety of elements such as search engine design, homepage promotions, and product pricing. This has led firms to perform an increasing number of experiments, and several platforms have emerged to provide the infrastructure for these firms to perform experiments at scale (see, e.g., Optimizely 2019). Stateoftheart techniques in these settings employ bandit algorithms (e.g., Thompson sampling), which seek to adaptively learn treatment effects while optimizing performance within each experiment (Thompson 1933, Scott 2015). However, the large number of related experiments begs the question: can we transfer knowledge across experiments?
We study this question for Thompson sampling algorithms in dynamic pricing applications that involve a large number of related products. Dynamic pricing algorithms enable retailers to optimize profits by sequentially experimenting with product prices, and learning the resulting customer demand (Kleinberg and Leighton 2003, Besbes and Zeevi 2009). Such algorithms have been shown to be especially useful for products that exhibit relatively short life cycles (Ferreira et al. 2015), stringent inventory constraints (Xu et al. 2019), strong competitive effects (Fisher et al. 2017), or the ability to offer personalized coupons/pricing (Zhang et al. 2017, Ban and Keskin 2017). In all these cases, the demand of a product is estimated as a function of the product’s price (chosen by the decisionmaker) and a combination of exogenous features as well as productspecific and customerspecific features. Through carefully chosen price experimentation, the decisionmaker can learn the pricedependent demand function for a given product, and choose an optimal price to maximize profits (Qiang and Bayati 2016, Cohen et al. 2016, Javanmard and Nazerzadeh 2016). Dynamic pricing algorithms based on Thompson sampling have been shown to be particularly successful in striking the right balance between exploring (learning the demand) and exploiting (offering the estimated optimal price), and are widely considered to be stateoftheart (Thompson 1933, Agrawal and Goyal 2013, Russo and Van Roy 2014, Ferreira et al. 2018).
The decisionmaker typically runs a separate pricing experiment (i.e., dynamic pricing algorithm) for each product. However, this approach can waste valuable samples rediscovering information shared across different products. For example, students may be more pricesensitive than general customers; as a result, many firms such as restaurants, retailers and movie theaters offer student discounts. This implies that the coefficient of studentspecific price elasticity in the demand function is positive for many products (although the specific value of the coefficient likely varies across products). Similarly, winter clothing may have higher demand in the fall and lower demand at the end of winter. This implies that the demand functions of winter clothing may have similar coefficients for the features indicating time of year. In general, there may even be complex correlations between coefficients of the demand functions of products that are shared. For example, the priceelasticities of products are often negatively correlated with their demands, i.e., customers are willing to pay higher prices when the demand for a product is high. Thus, one may expect that the demand functions for related products may share some (a priori unknown) common structure, which can be learned across products. Note that the demand functions are unlikely to be exactly the same, so a decisionmaker would still need to conduct separate pricing experiments for each product. However, accounting for shared structure during these experiments may significantly speed up learning per product, thereby improving profits.
In this paper, we propose an approach to learning shared structure across pricing experiments. We begin by noting that the key (and only) design decision in Thompson sampling methods is the Bayesian prior over the unknown parameters. This prior captures shared structure of the kind we described above — e.g., the mean of the prior on the studentspecific priceelasticity coefficient may be positive with a small standard deviation. It is well known that choosing a good (bad) prior significantly improves (hurts) the empirical performance of the algorithm (Chapelle and Li 2011, Honda and Takemura 2014, Liu and Li 2015, Russo et al. 2018). However, the prior is typically unknown in practice, particularly when the decisionmaker faces a cold start. While the decisionmaker can use a priorindependent algorithm (Agrawal and Goyal 2013), such an approach achieves poor empirical performance due to overexploration; in contrast, knowledge of the correct prior enables Thompson sampling to appropriately balance exploration and exploitation (Russo and Van Roy 2014). Thus, the decisionmaker needs to learn the true prior (i.e., shared structure) across products to achieve good performance. We propose a meta dynamic pricing algorithm that efficiently achieves this goal.
We first formulate the problem of learning the true prior online while solving a sequence of pricing experiments for different products. Our meta dynamic pricing algorithm requires two key ingredients. First, for each product, we must balance the need to learn about the prior (“metaexploration”) with the need to leverage the prior to achieve strong performance for the current product (“metaexploitation”). In other words, our algorithm balances an additional explorationexploitation tradeoff across price experiments. Second, a key technical challenge is that finitesample estimation errors of the prior may significantly impact the performance of Thompson sampling for any given product. In particular, vanilla Thompson sampling may fail to converge with an incorrect prior; as a result, directly using the estimated prior across products can result in poor performance. In order to maintain strong performance guarantees for every product, we increase the variance of the estimated prior by a term that is a function of the prior’s estimated finitesample error. Thus, we use a more conservative approach (a wide prior) for earlier products when the prior is uncertain; over time, we gain a better estimate of the prior, and can leverage this knowledge for better empirical performance. Our algorithm provides an exact prior correction path over time to achieve strong performance guarantees across all pricing problems. We prove that, when using our algorithm, the price of an unknown prior for Thompson sampling is negligible in experimentrich environments (i.e., as the number of products grows large).
Experimentation is widely used to optimize decisions in a datadriven manner. This has led to a rich literature on bandits and A/B testing (Lai and Robbins 1985, Auer 2002, Dani et al. 2008, Rusmevichientong and Tsitsiklis 2010, Besbes et al. 2014, Johari et al. 2015, Bhat et al. 2019). This literature primarily proposes learning algorithms for a single experiment, while our focus is on metalearning across experiments. There has been some work on metalearning algorithms in the bandit setting (Hartland et al. 2006, Maes et al. 2012, Wang et al. 2018, Sharaf and Daumé III 2019) as well as the more general reinforcement learning setting (Finn et al. 2017, 2018, Yoon et al. 2018). Relatedly, Raina et al. (2006) propose constructing an informative prior based on data from similar learning problems. These papers provide heuristics for learning exploration strategies given a fixed set of past problem instances. However, they do not prove any theoretical guarantees on the performance or regret of the metalearning algorithm. To the best of our knowledge, our paper is the first to propose a metalearning algorithm in a bandit setting with provable regret guarantees.
We study the specific case of dynamic pricing, which aims to learn an unknown demand curve in order to optimize profits. We focus on dynamic pricing because metalearning is particularly important in this application, e.g., online retailers such as Rue La La may run numerous pricing experiments for related fashion products. We believe that a similar approach could be applied to multiarmed or contextual bandit problems, in order to inform the prior for Thompson sampling across a sequence of related bandit problems.
Dynamic pricing has been found to be especially useful in settings with short life cycles or limited inventory (e.g., fast fashion or concert tickets, see Ferreira et al. 2015, Xu et al. 2019), among online retailers that constantly monitor competitor prices and adjust their own prices in response (Fisher et al. 2017), or when prices can be personalized based on customerspecific price elasticities (e.g., through personalized coupons, see Zhang et al. 2017). Several papers have designed nearoptimal dynamic pricing algorithms for pricing a product by balancing the resulting explorationexploitation tradeoff (Kleinberg and Leighton 2003, Besbes and Zeevi 2009, Araman and Caldentey 2009, Farias and Van Roy 2010, Harrison et al. 2012, Broder and Rusmevichientong 2012, den Boer and Zwart 2013, Keskin and Zeevi 2014). Recently, this literature has shifted focus to pricing policies that dynamically optimize the offered price with respect to exogenous features (Qiang and Bayati 2016, Cohen et al. 2016, Javanmard and Nazerzadeh 2016) as well as customerspecific features (Ban and Keskin 2017). We adopt the linear demand model proposed by Ban and Keskin (2017), which allows for featuredependent heterogeneous price elasticities.
We note that the existing dynamic pricing literature largely focuses on the singleproduct setting. A few papers consider performing price experiments jointly on a set of products with overlapping inventory constraints, or with substitutable demand (Keskin and Zeevi 2014, Agrawal and Devanur 2014, Ferreira et al. 2018). However, in these papers, price experimentation is still performed independently per product, and any learned parameter knowledge is not shared across products to inform future learning. In contrast, we propose a meta dynamic pricing algorithm that learns the distribution of unknown parameters of the demand function across products.
Our learning strategy is based on Thompson sampling, which is widely considered to be stateoftheart for balancing the explorationexploitation tradeoff (Thompson 1933). Several papers have studied the sensitivity of Thompson sampling to prior misspecification. For example, Honda and Takemura (2014) show that Thompson sampling still achieves the optimal theoretical guarantee with an incorrect but uninformative prior, but can fail to converge if the prior is not sufficiently conservative. Liu and Li (2015) provide further support for this finding by showing that the performance of Thompson sampling for any given problem instance depends on the probability mass (under the provided prior) placed on the underlying parameter; thus, one may expect that Thompson sampling with a more conservative prior (i.e., one that places nontrivial probability mass on a wider range of parameters) is more likely to converge when the true prior is unknown. It is worth noting that Agrawal and Goyal (2013) and Bubeck and Liu (2013) propose a priorindependent form of Thompson sampling, which is guaranteed to converge to the optimal policy even when the prior is unknown by conservatively increasing the variance of the posterior over time. However, the use of a more conservative prior creates a significant cost in empirical performance (Chapelle and Li 2011). For instance, Bastani et al. (2017) empirically find through simulations that the conservative priorindependent Thompson sampling is significantly outperformed by vanilla Thompson sampling even when the prior is misspecified. We empirically find, through experiments on synthetic and real datasets, that learning and leveraging the prior can yield much better performance compared to a priorindependent approach. As such, the choice of prior remains an important design choice in the implementation of Thompson sampling (Russo et al. 2018). We propose a metalearning algorithm that learns the prior across pricing experiments on related products to attain better performance. We also empirically demonstrate that a naive approach of greedily using the updated prior performs poorly, since it may cause Thompson sampling to fail to converge to the optimal policy for some products. Instead, our algorithm gracefully tunes the width of the estimated prior as a function of the uncertainty in the estimate over time.
We highlight our main contributions below:

Model: We formulate our problem as a sequence of different dynamic pricing problems, each with horizon . Importantly, the unknown parameters of the demand function for each product are drawn i.i.d. from a shared (unknown) multivariate gaussian prior.

Algorithm: We propose two metalearning pricing policies, MetaDP and MetaDP++. The former learns only the mean of the prior, while the latter learns both the mean and the covariance of the prior across products. Both algorithms address two challenges: (i) balancing the need to learn the prior (metaexploration) with the need to leverage the current estimate of the prior to achieve good performance (metaexploitation), and (ii) accounting for uncertainty in the estimated prior by conservatively widening the prior as a function of its estimation error (as opposed to directly using the estimated prior, which may cause Thompson sampling to fail on some products).

Theory: Unlike standard approaches, our algorithm can leverage shared structure across products to achieve regret that scales sublinearly in the number of products . In particular, we prove upper bounds and on the meta regret of MetaDP and MetaDP++ respectively.

Numerical Experiments: We demonstrate on both synthetic and real auto loan data that our approach significantly speeds up learning compared to ignoring shared structure (i.e., using priorindependent Thompson sampling) or greedily using the updated prior across products.
Throughout the paper, all vectors are column vectors by default. We define to be the set for any positive integer We use and to denote the and norm of a vector respectively. For a positive definite matrix and a vector , let denote the matrix norm . We also denote and as the maximum and minimum between respectively. When logarithmic factors are omitted, we use and to denote function growth.
We first describe the classical dynamic pricing formulation for a single product; we then formalize our metalearning formulation over a sequence of products.
Consider a seller who offers a single product over a selling horizon of periods. The seller can dynamically adjust the offered price in each period. At the beginning of each period , the seller observes a random feature vector (capturing exogenous and/or customerspecific features) that is independently and identically distributed from an unknown distribution. Upon observing the feature vector, the seller chooses a price for that period. The seller then observes the resulting demand, which is a noisy function of both the observed feature vector and the chosen price. The seller’s revenue in each period is given by the chosen price multiplied by the corresponding realized demand. The goal in this setting is to develop a policy that maximizes the seller’s cumulative revenue by balancing exploration (learning the demand function) with exploitation (offering the estimated revenuemaximizing price).
We consider a seller who sequentially offers related products, each with a selling horizon of periods. For simplicity, a new product is not introduced until the life cycle of the previous product ends. We call each product’s life cycle an epoch, i.e., there are epochs that last periods each. Each product (and corresponding epoch) is associated with a different (unknown) demand function, and constitutes a different instance of the classical dynamic pricing problem described above. We now formalize the problem.
In epoch at time , the seller observes a random feature vector , which is independently and identically distributed from an unknown distribution (note that the distribution may vary across products/epochs). She then chooses a price for that period. Based on practical constraints, we will assume that the allowable price range is bounded across periods and products, i.e., and . The seller then observes the resulting induced demand
where and are unknown fixed constants throughout epoch , and is zeromean subgaussian noise (see Definition 1 below). This demand model was recently proposed by Ban and Keskin (2017), and captures several salient aspects. In particular, the observed feature vector in period determines both the baseline demand (through the parameter ) and the priceelasticity of the demand (through the parameter ) of product .
Definition 1
A random variable is subgaussian if for every .
This definition implies . Many classical distributions are subgaussian; typical examples include any bounded, centered distribution, or the normal distribution. Note that the errors need not be identically distributed.
For ease of notation, we define ; following the classical formulation of dynamic pricing, is the unknown parameter vector that must be learned within a given epoch in order for the seller to maximize her revenues over periods. When there is no shared structure between the , our problem reduces to independent dynamic pricing problems.
However, we may expect that related products share a similar potential market, and thus may have some shared structure that can be learned across products. We model this relationship by positing that the product demand parameter vectors are independent and identically distributed draws from a common unknown distribution, i.e., for each . As discussed earlier, knowledge of the distribution over the unknown demand parameters can inform the prior for Thompson sampling, thereby avoiding the need to use a conservative prior that can result in poor empirical performance (Honda and Takemura 2014, Liu and Li 2015). The mean of the shared distribution is unknown; we will consider settings where the covariance of this distribution is known and unknown. We propose using metalearning to learn this distribution from past epochs to inform and improve the current product’s pricing strategy.
Remark 1
Following the literature on Thompson sampling, we consider a multivariate gaussian distribution since the posterior has a simple closed form, thereby admitting a tractable theoretical analysis. When implementing such an algorithm in practice, more complex distributions can be considered (e.g., see discussion in Russo et al. 2018).
In this subsection, we consider the setting where the true prior over the unknown product demand parameters is known. This setting will inform our definition of the meta oracle and meta regret in the next subsection. When the prior is known, a natural candidate policy for minimizing Bayes regret is the Thompson sampling algorithm (Thompson 1933). The Thompson sampling algorithm adapted to our dynamic pricing setting for a single epoch is formally given in Algorithm 1 below. Since the prior is known, there is no additional shared structure to exploit across products, so we can treat each epoch independently.
The algorithm begins with the true prior, and performs a single initialization period (). For each time , the Thompson sampling algorithm (1) samples the unknown product demand parameters from the posterior , and (2) solves and offers the resulting optimal price based on the demand function given by the sampled parameters
(1) 
Upon observing the actual realized demand , the algorithm computes the posterior for round . The same algorithm is applied independently to each epoch .
As evidenced by the large literature on the practical success of Thompson sampling (Chapelle and Li 2011, Russo and Van Roy 2014, Ferreira et al. 2018), Algorithm 1 is a very attractive choice for implementation in practice.
It is worth noting that Algorithm 1 attains a strong performance guarantee under the classical formulation compared to a the classical oracle that knows all product demand parameters in advance. In particular, this oracle would offer the expected optimal price in each period in epoch , i.e.,
The resulting Bayes regret (Russo and Van Roy 2014) of a given policy relative to the oracle is defined as:
(2) 
where the expectation is taken with respect to the unknown product demand parameters, the observed random feature vectors, and the noise in the realized demand. The following theorem bounds the Bayes regret of the Thompson sampling dynamic pricing algorithm:
Theorem 1
The Bayes regret of Algorithm 1 satisfies
when the prior over the product demand parameters is known.
The proof of Theorem 1 follows from a similar argument used for the linear bandit setting presented in Russo and Van Roy (2014), coupled with standard concentration bounds for multivariate normal distributions. The proof is given in Appendix id1 for completeness. Note that the regret scales linearly in , since each epoch is an independent learning problem.
However, we cannot directly implement Algorithm 1 in our setting, since the prior over the product demand parameters is unknown. In this paper, we seek to learn the prior (shared structure) across products in order to leverage the superior performance of Thompson sampling with a known prior. Thus, a natural question to ask is:
What is the price of not knowing the prior in advance?
To answer this question, we first define our performance metric. Since our goal is to converge to the policy given in Algorithm 1 (which knows the true prior), we define this policy as our meta oracle^{1}^{1}1We use the term meta oracle to distinguish from the oracle in the classical formulation.. Comparing the revenue of our policy relative to the meta oracle leads naturally to the definition of meta regret for a policy , i.e.,
where the expectation is taken with respect to the unknown product demand parameters, the observed random feature vectors, and the noise in the realized demand.
Our goal is to design a policy with meta regret that grows sublinearly in and at most linearly in . In particular, recall that Theorem 1 bounds the Bayes regret of Thompson sampling with a known prior as . Thus, if our meta regret (i.e., the performance of our metalearning policy relative to Algorithm 1) grows sublinearly in (and no faster than ), it would imply that the price of not knowing the prior in advance is negligible in experimentrich environments (i.e., as the number of products grows large) compared to the cost of learning the actual demand parameters for each product (i.e., the Bayes regret of Algorithm 1).
We restrict ourselves to the family of nonanticipating policies = that form a sequence of random functions that depend only on price and demand observations collected until time in epoch (including all times from prior epochs), and feature vector observations up to time in epoch . In particular, let , and denote the history of prices and corresponding demand realizations from prior epochs and time periods, as well as the observed feature vectors up to the next time period; let denote the field generated by . Then, we impose that is measurable.
The values of the prior mean as well as the actual product demand parameter vectors are unknown; we consider two settings — known and unknown (covariance of the prior).
We now describe some mild assumptions on the parameters of the problem for our regret analysis.
Assumption 1 (Boundedness)
The support of the features are bounded, i.e.,
Furthermore, there exists a positive constant such that
Our first assumption is that the observed feature vectors as well as the mean of the product demand parameters are bounded. This is a standard assumption made in the bandit and dynamic pricing literature, ensuring that the average regret at any time step is bounded. This is likely satisfied since features and outcomes are typically bounded in practice.
Assumption 2 (PositiveDefinite Feature Covariance)
The minimum eigenvalue of the feature covariance matrix in every epoch is lower bounded by some positive constant , i.e.,
Our second assumption imposes that the covariance matrix of the observed feature vectors in every epoch is positivedefinite. This is a standard assumption for the convergence of OLS estimators; in particular, our demand model is linear, and therefore requires that no features are perfectly collinear in order to identify each product’s true demand parameters.
Assumption 3 (PositiveDefinite Prior Covariance)
The maximum and minimum eigenvalues of are upper and lower bounded by positive constants and respectively i.e.,
We further assume that the trace of is upper bounded by i.e., .
Our final assumption imposes that the covariance matrix of the random product demand parameter is also positivedefinite. Again, this assumption ensures that each product’s true demand parameter is identifiable using standard OLS estimators.
We begin with the case where the prior’s covariance matrix is known, and describe the Meta Dynamic Pricing (MetaDP) algorithm for this setting. We will consider the case of unknown in the next section.
Throughout the rest of the paper, we use to denote the price and feature information of round in epoch for all and We also define the following quantities for each epoch :
(3) 
is the price and feature design matrix, and is the corresponding vector of realized demands from all initialization steps () in epochs .
The MetaDP algorithm begins by using initial product epochs as an exploration phase to initialize our estimate of the prior mean . These exploration epochs use the priorindependent UCB algorithm to ensure no more than meta regret for each epoch. After this initial exploration period, our algorithm leverages the estimated prior within each subsequent epoch, and continues to sequentially update the estimated prior after each epoch. The key challenge is that the estimated prior has finitesample estimation error, and can thus result in poor performance within a given epoch. At the same time, we can no longer employ a priorindependent approach, since this will cause our meta regret to grow linearly in . Our algorithm addresses this challenge by carefully widening the covariance of the prior (beyond the known covariance ) within each epoch by a term that scales as the expected error of the estimated . This correction approaches zero as grows large, ensuring that our meta regret grows sublinearly in .
The MetaDP algorithm is presented in Algorithm 2. We now describe the algorithm in detail.
The first epochs are treated as exploration epochs, where we define
(4) 
and the constants are given by
As described in the overview, the MetaDP algorithm proceeds in two phases. In particular, we distinguish the following two cases for all (similar to Algorithm 1, the first period of each epoch is reserved for initialization):

Epoch the MetaDP algorithm runs the priorindependent UCB algorithm proposed by AbbasiYadkori et al. (2011) for the rest of the epoch. In particular, for each we construct the UCB estimate using the regularized least square estimator on the price and feature data, and the corresponding demands observed so far, i.e.,
(5) The MetaDP algorithm then offers the price with the largest upper confidence bound, i.e.,
(6) and observes the realized demand

Epoch the MetaDP algorithm utilizes the data collected from the initialization step of all past epochs and the current epoch to compute the estimated mean of the prior . We use the ordinary least square estimator, i.e.,
(7) However, as noted earlier, using the estimated prior directly can cause Thompson sampling to fail due to finitesample estimation error. Thus, we widen the prior by increasing the covariance beyond . In particular, we set the prior as follows:
(8) (9) Note that the extent of prior widening approaches zero for later epochs (i.e., large), when we expect the estimation error of the prior mean to be small.
Next, the MetaDP algorithm follows the TS algorithm armed with the widened prior . In particular, for each the algorithm (1) samples the unknown product demand parameters from the posterior and (2) solves and offers the resulting optimal price based on the demand function given by the sampled parameters
(10) Upon observing the actual realized demand it computes the posterior for round .
We now prove an upper bound on the meta regret of the MetaDP algorithm.
We begin by noting that the priorindependent UCB algorithm employed in the exploration epochs satisfies a meta regret guarantee:
Lemma 1
The meta regret of the UCB algorithm in a single epoch is
The proof of this result is essentially the same as that of Theorem 1, and is thus omitted. Lemma 1 ensures that we do not accrue much regret in the exploration epochs as long as is small. From Eq. (4), we know that grows merely polylogarithmically in and .
Next, after the exploration epochs conclude, we begin using the estimated prior mean. The following theorem bounds the error of this estimate with high probability:
Theorem 2
For any fixed with probability at least the distance between and is upper bounded as
where and are constants that depends only on and
Proof.
Proof Sketch. The complete proof is provided in Appendix id1. Let be the initial feature and price vector of the first round of each epoch . Then, for an epoch the initial demand realization satisfies
where Note that is an independent random variable across different epochs, since the feature vectors are drawn i.i.d. from , and the prices alternate between and by construction. Thus, we can equivalently view the demand realization as the mean demand corrupted by the price dependent (or heteroscedastic) noise It can be verified that is subgaussian with
Next, applying Lemma A.1 of Zhu and Modiano (2018), we can bound the difference between and under the matrix norm with high probability, i.e.,
Thus, it suffices to lower bound the smallest eigenvalue of To this end, we employ matrix Chernoff bounds by Tropp (2011). First, we show that there exists a positive constant (that depends only on and ) such that the minimum eigenvalue of the expectation is lower bounded by , i.e.,
We apply the matrix Chernoff inequality (Tropp 2011) to provide a high probability lower bound on the minimum eigenvalue of the random matrix , i.e.,
Finally, by a simple union bound, we conclude the proof. Q.E.D.
We now state our main result upper bounding the meta regret of the MetaDP algorithm.
Theorem 3
If the number of products is at least then the meta regret of the proposed MetaDP algorithm is upper bounded as
Proof.
Proof Sketch. The complete proof is provided in Appendix id1.
We begin by defining some helpful notation. First, let be the expected revenue obtained by running the Thompson sampling algorithm in Algorithm 1 with the (possibly incorrect) prior after initialization in an epoch whose true parameter is Second, let be the maximum expected revenue that can be obtained from an epoch parametrized by after initialization. We also define the clean event over all nonexploration epochs:
When holds, our estimate of the prior mean has bounded error from the true prior mean in all nonexploration epochs. Theorem 2 implies that holds with probability at least Note that the meta regret over nonexploration epochs is trivially bounded by Then, the cumulative contribution to the expected meta regret when the clean event is violated is We then proceed to analyze the regret of each epoch conditioned on the clean event . For an epoch the expected meta regret of this epoch can be written as
Now, from Section 3 of Russo and Van Roy (2014), we upper bound the first term as
where is the RadonNikodym derivative of with respect to and is the essential supremum magnitude with respect to Therefore,
and, by applying Theorem 1, the total meta regret can be upper bounded as
(11) 
The first term in (11) is simply the regret accrued by UCB in the first exploration epochs. Applying Lemma 1 and the definition of from Eq. (4), we can bound this term as
For the second term in (11), we use the definition of the multivariate normal to compute
(12) 
Since we have assumed that is positive definite, it follows that is positive definite as well. Recalling that note that
Furthermore, since we have conditioned on the clean event , Eq. (12) does not exceed
Finally, using the identity that for all we can simplify
Therefore, the second term in (11) can be bounded as
By definition of in Eq. (4), we can write Using the identity that for any it follows that
Combining the above results yields the result. Q.E.D.
Remark 2
Note that if we are in the regime where prescribed by Eq. (4), then the decisionmaker can choose to instead be
and set for any choice of , without affecting the theoretical guarantee stated in Theorem 3. In other words, we can trade off the number of exploration epochs () with the extent of prior widening () in nonexploration epochs.
We now pause to comment on the necessity of our prior widening technique. An immediate and tempting alternative to the MetaDP algorithm is the the following “greedy” algorithm: it is identical to the MetaDP algorithm, but in each nonexploration epoch (), the greedy approach uses the updated prior directly without any prior widening, i.e., setting for all in Algorithm 2. In other words, after the initial exploration epochs, the algorithm greedily applies Thompson sampling with the current estimated prior (which is updated at the end of every epoch) in each subsequent epoch.
However, the estimated prior naturally has finitesample estimation error. Empirical evidence from Lattimore and Szepesvári (2018) shows that even a small misspecification in the prior can lead to significant performance degradation of the Thompson Sampling algorithm. This raises the concern that the simple greedy approach may fail to perform well in some epochs due to estimation error. In Section 3, we compare the performance of the greedy approach described above to our proposed approach on a range of numerical experiments on both synthetic and real auto loan data. We consistently find that our proposed approach performs better, suggesting that prior widening is in fact necessary. In what follows, we provide intuition from our theoretical analysis on why the greedy approach may fail, and explain how prior widening helps overcome this challenge.
Consider inequality (11) in the proof sketch of Theorem 3. When applied to the greedy approach, the upper bound for the meta regret becomes
(13) 
Following the same steps as in Eq. (12), we can write
(14) 
Suppose we take to be the form for some then Eq. (14) becomes
Note that is positive definite, so the quadratic form is positive as long as i.e., there exists any estimation error in . It is thus easy to verify that as as well. This suggests that for some realizations of , the Thompson algorithm with the greedy prior estimate can fail to converge and achieve worstcase performance. In contrast, by widening the prior, we ensure that this term is always bounded above (see Eq. (12)), thereby ensuring convergence within every epoch. The MetaDP algorithm provides an exact prior correction path over time to ensure low meta regret in every nonexploration epoch.
We note that the above argument simply indicates that the same analysis of the MetaDP algorithm cannot be applied to the greedy approach; we do not prove that the greedy approach fails. However, the empirical evidence from Lattimore and Szepesvári (2018) and our numerical experiments in Section 3 together suggest that the greedy approach in fact performs poorly.
In this section, we consider the setting where the prior covariance matrix is also unknown. We propose the MetaDP++ algorithm, which builds on top of the MetaDP algorithm and additionally estimates the unknown prior covariance
A key difference is that the MetaDP algorithm estimates the unknown prior mean using only the initial samples from each epoch; notably, the algorithm did not need to recover the unknown product demand parameters across epochs. However, in order to estimate the prior covariance matrix , we will need to estimate the unknown product parameters for at least some epochs. Therefore, the MetaDP++ algorithm performs uniform exploration steps in the first several epochs to collect enough data to reconstruct the prior covariance matrix
We start by defining the following quantities
(15) 
For each of the first
(16) 
epochs, the MetaDP++ algorithm uses its first
(17) 
rounds to offer prices and for times each, and computes the estimate for via the OLS estimator, i.e.,