SemiParametric Dynamic Contextual Pricing
Abstract
Motivated by the application of realtime pricing in ecommerce platforms, we consider the problem of revenuemaximization in a setting where the seller can leverage contextual information describing the customer’s history and the product’s type to predict her valuation of the product. However, her true valuation is unobservable to the seller, only binary outcome in the form of successfailure of a transaction is observed. Unlike in usual contextual bandit settings, the optimal price/arm given a covariate in our setting is sensitive to the detailed characteristics of the residual uncertainty distribution. We develop a semiparametric model in which the residual distribution is nonparametric and provide the first algorithm which learns both regression parameters and residual distribution with regret. We empirically test a scalable implementation of our algorithm and observe good performance.
1 Introduction
Many ecommerce platforms are experimenting with approaches to personalized dynamic pricing based on the customer’s context (i.e. customer’s prior search/purchase history and the product’s type). However, the mapping from context to optimal price needs to be learned. Our paper develops a bandit learning approach towards solving this problem motivated by practical considerations faced by online platforms. In our model, customers arrive sequentially, and each customer is interested in buying one product. The customer purchases the product if her valuation (unobserved by the platform) for the product exceeds the price set by the seller. The platform observes the covariate vector corresponding to the context, and chooses a price. The customer buys the item if and only if the price is lower than her valuation.
We emphasize three salient features of this model; taken together, these are the features that distinguish our work. First, feedback is only binary: either the customer buys the item, or she does not. In other words, the platform must learn from censored feedback. This type of binary feedback is a common feature of practical demand estimation problems, since typically exact observation of the valuation of a customer is not possible.
Second, the platform must learn the functional form of the relationship between the covariates and the expected valuation. In our work, we assume a parametric model for this relationship. In particular, we presume that the expected value of the logarithm of the valuation is linear in the covariates. Among other things, this formulation has the benefit that it ensures valuations are always nonnegative. Further, from a technical standpoint, we demonstrate that this formulation also admits efficient estimation of the parametric model.
Third, the platform must also learn the distribution of residual uncertainty that determines the actual valuation given the covariates; in other words, the distribution of the error between the expected logarithm of the valuation, and the actual logarithm of the valuation, given covariates. In our work we make minimal assumptions about the distribution of this residual uncertainty. Thus while the functional relationship between covariates and the expected logarithm of the valuation is parametric (i.e., linear), the distribution of the error is nonparametric; for this reason, we refer to our model as a semiparametric dynamic pricing model.
The challenge is to ensure that we can efficiently learn both the coefficients in the parametric model, as well as the distribution of the error. A key observation we leverage is that our model exhibits free exploration: testing a single covariatevectortoprice mapping at a given time can simultaneously provide information about several such mappings. We develop an arm elimination approach which maintains a set of active prices at each time, where the set depends on the covariate vector of the current customer. The set is reduced over time by eliminating empirically suboptimal choices.
We analyze our approach both theoretically and empirically. We analyze regret against the following standard oracle: the policy that optimally chooses prices given the true coefficients in the parametric linear model, as well as the distribution of the error, but without knowledge of the exact valuation of each arriving customer. Regret of our policy scales as with respect to time horizon , which is optimal. Further, it scales polynomially in covariate dimension , as well as in two smoothness parameters and defined as part of our model. In addition, we develop a scalable implementation of our approach which leverages a semiparametric regression technique based on convex optimization. Our simulations show that this scalable policy performs well.
1.1 Related work
Noncontextual dynamic pricing. There is a significant literature on regret analysis of the dynamic pricing problem without covariates; see [den Boer, 2015] for a detailed survey. For example, the works [Le Guen, 2008, Broder and Rusmevichientong, 2012, den Boer and Zwart, 2013, den Boer, 2014, Keskin and Zeevi, 2014] consider a parametric model whereas [Kleinberg and Leighton, 2003] consider a nonparametric model for the unknown demand function. Our methodology is most aligned to that of [Kleinberg and Leighton, 2003], in that we extend their techniques to incorporate sideinformation from the covariates.
Contextual dynamic pricing. Recently, the problem of dynamic pricing with highdimensional covariates has garnered significant interest among researchers; see, e.g., [Javanmard and Nazerzadeh, 2019, Ban and Keskin, 2019, Cohen et al., 2016, Mao et al., 2018, Qiang and Bayati, 2019, Nambiar et al., 2019]. In summary, in contrast to the prior works in dynamic pricing with covariates, ours is the first work to address a setting where the only feedback from each transaction is binary and the residual uncertainty given covariates is nonparametric, see Table 1. We believe that these features are relevant to several online platforms implementing dynamic pricing with highdimensional covariates, and thus our work bridges a gap between the stateoftheart in the academic literature and practical considerations. Below, we describe some of these prior works below.

[Javanmard and Nazerzadeh, 2019] consider a model where the expected valuation given covariates is a linear function of the covariates, and where the noise distribution is known. In other words, their model is fully parametric. Under certain conditions, they show that the expected regret is logarithmic in time horizon . They also briefly consider a scenario where the noise distribution is unknown, but the expected regret they obtain there is linear in .

[Ban and Keskin, 2019] consider a semiparametric setting where the relationship between the expected demand, the covariates, and prices is parametric (in particular, generalized linear), and the residual noise is nonparametric; however, in their setting the true demand (analogous to the valuation in our model) is observed by the platform. Their model, as a special case, allows for binary feedback as well; however, in this special case the model is fully parametric. Under a sparsity assumption where only out of covariates impact the demand, they show that the optimal regret is .

[Qiang and Bayati, 2019] considers a model where the expected demand is a linear function of covariates and prices, and where the true demand is observed by the platform. Under certain conditions they show that a greedy iterative least squares policy is optimal and achieves regret.

[Nambiar et al., 2019] considers a setup where the model is misspecified; in particular, the expected demand is assumed to be a linear function of covariates and prices, but in reality the relationship of demand to covariates is nonlinear. Here again, the true demand at each time is observed by the platform. Due to misspecification, the noise term in the assumed model is correlated with the price. They develop an optimal policy where a random perturbation is added to a greedy choice of price, and use the perturbation as an instrument to obtain unbiased estimates.

[Cohen et al., 2016] consider a model similar to ours but with known noise distribution, and with the covariates chosen adversarially. [Cohen et al., 2016] develop an algorithm based on an ellipsoid method for solving a system of linear equations which has regret. [Mao et al., 2018] consider a variant which generalizes linear model to Lipschitz function but with no noise.
Learning techniques: There is extensive prior work on highdimensional contextual bandits, e.g., [Langford and Zhang, 2008, Slivkins, 2011, Perchet and Rigollet, 2013]; however, their techniques do not directly apply to our setup (in part due to the censored nature of feedback). Our work is also loosely related to the works on learning and auctions, e.g. [Amin et al., 2014, Morgenstern and Roughgarden, 2016]. We leverage semiparametric regression technique with binary feedback from [Plan and Vershynin, 2013] to reduce computational complexity of our algorithm.
Contextual  Nonparametric residuals  Binary feedback  
[Kleinberg and Leighton, 2003]  ✓  ✓  
[Javanmard and Nazerzadeh, 2019]  ✓  ✓  
[Qiang and Bayati, 2019]  ✓  ✓  
[Cohen et al., 2016, Mao et al., 2018]  ✓  ✓  
[Ban and Keskin, 2019]  ✓  ✓  
✓  ✓  
[Nambiar et al., 2019]  ✓  ✓  
Our work  ✓  ✓  ✓ 
2 Preliminaries
In this section we first describe our model and then our objective, which is to minimize regret relative to a natural oracle policy.
2.1 Model
At each time , we have a new user arrival with covariate vector taking values in for . Throughout the paper all vectors are encoded as column vectors. The platform observes upon the arrival of the user. The user’s reservation value is modeled as
(1) 
where is a fixed unknown parameter vector, and for captures the residual uncertainty in demand given covariates. Similar to the linear model , this model is quite flexible in that linearity is a restriction only on the parameters while the predictor variables themselves can be arbitrarily transformed. However, our formulation additionally has the feature that it ensures that for each , a key practical consideration.
We conjecture that unlike our model, the linear model does not admit a learning algorithm with regret. This is due to censored nature of feedback, the structure of revenue as a function of price, and our nonparametric assumption on the distribution of as described below.
The platform sets price , upon which the user buys the product if . Without loss of generality, we will assume the setting where users buy the product; one can equivalently derive exactly the same results in a setting where users are sellers, and sell the product if . The revenue/reward at time is where . We assume that is measurable, where for each is an auxiliary random variable independent of the sources of randomness in the past. In other words, platform does not know the future but it can use randomized algorithms which may leverage past covariates, current covariate, and binary feedback from the past.
The goal of the platform is to design a pricing policy to maximize the total reward
In this paper we are interested in the performance characterization of optimal pricing policies as the time horizon grows large.
We make the following assumption on statistics of and .
A 1
We assume that and are i.i.d. and mutually independent. Their distributions are unknown to the platform. Their supports and are compact and known. In particular, we assume that and is an interval in .
A1 can be significantly relaxed, as we discuss in Appendix E (both in terms of the i.i.d. distribution of random variables, and the compactness of their supports).
A 2
The unknown parameter vector lies within a known, connected, compact set . In particular, .
2.2 The oracle and regret
It is common in multiarmed bandit problems to measure the performance of an algorithm against a benchmark, or , which may have more information than the platform, and for which the optimal policy is easier to characterize. Likewise, we measure the performance of our algorithm against the following .
Definition 1
The knows the true value of and the distribution of .
Now, let
The following proposition is easy to show, so the proof is omitted.
Proposition 1
The following pricing policy is optimal for the : At each time set price where .
Clearly, the total reward obtained by the Oracle with this policy, denoted as , satisfies .
Our goal: Regret minimization. Given a feasible policy, define the regret against the Oracle as :
Our goal in this paper is to design a pricing policy which minimizes asymptotically to leading order in .
2.3 Smoothness Assumption
Let
which can be thought of as the expected revenue of a single transaction when the platform sets price after observing a covariate . We impose the following assumption on .
A 3
Let be the component of , i.e., . We assume that there exist such that for each and we have
and for each ,
Recall that . It follows from A1 and conditioning on that
We will use this representation throughout our development.
Note that A3 subsumes that is the unique optimizer of . This is true if is the unique maximizer of and that is identifiable in the parameter space .
Below we will also provide sufficient conditions for A3 to hold. In particular, we develop sufficient conditions which are a natural analog of the assumptions made in [Kleinberg and Leighton, 2003].
2.4 Connection to assumptions in [Kleinberg and Leighton, 2003]
The ‘stochastic valuations’ model considered in [Kleinberg and Leighton, 2003] is equivalent to our model with no covariates, i.e., with . When the revenue function is equal to . In [Kleinberg and Leighton, 2003] it is assumed that are i.i.d., and that has bounded support. Clearly A1 and A2 are a natural analog to these assumptions. They also assume that has unique optimizer, and is locally concave at the optimal value, i.e., . We show below that a natural analog of these conditions are sufficient for A3 to hold.
Suppose that is the unique optimizer of . Also suppose that A1 and A2 hold. Then A3 holds if is strictly locally concave at , i.e., if the Hessian of at exists and is negative definite. To see why this is the case, note that strict local concavity at implies that there exists an such that the assumption holds for each where is the dimensional ball with center and radius . This, together with compactness of and , implies A3.
It is somewhat surprising that to incorporate covariates in a setting where is nonparametric, only minor modifications are needed relative to the assumptions in [Kleinberg and Leighton, 2003]. For completeness, in the Appendix we provide a class of examples for which it is easy to check that the Hessian is indeed negative definite and that all our assumptions are satisfied.
3 Pricing policies
Any successful algorithm must set prices to balance price exploration to learn with exploitation to maximize revenue. Because prices are adaptively controlled, the outputs will not be conditionally independent given the covariates , as is typically assumed in semiparametric regression with binary outputs (e.g., see [Plan and Vershynin, 2013]). This issue is referred to as price endogeneity in the pricing literature.
We address this problem by first designing our own banditlearning policy, Dynamic Experimentation and Elimination of Prices with Covariates (DEEPC), which uses only a basic statistical learning technique which dynamically eliminates suboptimal values of by employing confidence intervals. At first glance, such a learning approach seems to suffer from the curse of dimensionality, in terms of both sample complexity and computational complexity. As we will see, our DEEPC algorithm yields low sample complexity by cleverly exploiting the structure of our semiparameteric model. We then address computational complexity by presenting a variant of our policy which incorporates sparse semiparametric regression techniques.
The rest of the section is organized as follows. We first present the DEEPC policy. We then discuss three variants: (a) DEEPC with Rounds, a slight variant of DEEPC which is a bit more complex to implement but simpler to analyze theoretically, and thus enables us to obtain regret bounds; (b) Decoupled DEEPC, which decouples the estimation of and and thus allows us to leverage lowcomplexity sparse semiparametric regression to estimate but with the cost of regret; and (c) Sparse DEEPC, which combines DEEPC and sparse semiparametric regression to achieve low complexity without decoupling to achieve the best of both worlds. We provide a theoretical analysis of the first variant, and use simulation to study the others.
While we discuss below the key ideas behind these three variants, their formal definitions are provided in Appendix B.
3.1 DEEPC policy
We now describe DEEPC. As noted in Proposition 1, the achieves optimal performance by choosing at each time a price , where is the maximizer of . We view the problem as a multiarmed bandit in the space . Viewed this way, before the context at time arrives, the decision maker must choose a value and a . Once arrives, the price is set, and revenue is realized. Through this lens, we can see that the is equivalent to pulling the arm at every in the new multiarmed bandit we have defined. DEEPC is an armelimination algorithm for this multiarmed bandit.
From a learning standpoint, the goal is to learn the optimal , which at the first sight seems to suffer from the curse of dimensionality. However, we observe that in fact, our problem allows for “free exploration” that lets us to learn efficiently in this setting; in particular, given , for each choice of price we simultaneously obtain information about the expected revenue for a range of pairs . This is specifically because we observe the context , and because of the particular structure of demand that we consider. However, to ensure that each candidate arm has sufficiently high probability of being pulled at any time step, DEEPC selects prices at random from a set of active prices, and ensures that this set is kept small via armelimination. The speedup in learning thus afforded enables us to obtain low regret.
Formally, our procedure is defined as follows. We partition the support of into intervals of length . If the boundary sets are smaller, we enlarge the support slightly (by an amount less than ) so that each interval is of equal length, and equal to . Let the corresponding intervals be , and their centroids be where is less than or equal to . Similarly, for , we partition the projection of the support of into the dimension into intervals of equal length , with sets and centroids . Again, if the boundary sets are smaller, we enlarge the support so that each interval is of equal length .
Our algorithm keeps a set of active and eliminates those for which we have sufficient evidence for being far from . We let represent a set of active cells, where a cell represents a tuple . Then, represents the set of active pairs. Here, contains all cells.
At each time we have a set of active prices, which depends on and , i.e.,
At time we pick a price from uniformly at random. We say that cell is checked if where
Each price selection checks one or more cells .
Recall that the reward generated at time is . Let be the number of times cell is checked until time , and let be the total reward obtained at these times. Let
We also compute confidence bounds for , as follows. Fix . For each active , let
and
These represent the upper and lower confidence bounds, respectively.
We eliminate from if there exists such that
3.2 Variants of DEEPC
DEEPC with Rounds: Theoretical analysis of regret for arm elimination algorithms typically involves tracking the number of times each suboptimal arm is pulled before being eliminated. However, this is challenging in our setting, since the set of arms which get “pulled” at an offered price depends on the covariate vector at that time. To resolve this challenge, we consider a variant where the algorithm operates in rounds, as follows.
Within a round the set of active sells remains unchanged. Further, we ensure that within each round each arm in the active set is pulled at least once. For our analysis, we keep track of only the first time an arm is pulled in each round, and ignore the rest. While this may seem wasteful, a surprising aspect of our analysis is that the regret cost incurred by this form of exploration is only polylogarithmic in . Further, since the number of times each arm is “explored” in each round is exactly one, theoretical analysis now becomes tractable. For formal definitions of this policy and also of the policies below, we refer the reader to Appendix B.
Decoupled DEEPC: We now present a policy which has low computational complexity under sparsity and which does not suffer from price endogeneity, but may incur higher regret. At times , the price is set independently and uniformly at random from a compact set. This ensures that outputs are conditionally independent given covariates , i.e., there is no price endogeneity. We then use a lowcomplexity semiparametric regression technique from [Plan and Vershynin, 2013] to estimate under a sparsity assumption. With estimation of in place, at times , we use a onedimensional version of DEEPC to simultaneously estimate and maximize revenue. The best possible regret achievable with this policy is , achieved when is [Plan and Vershynin, 2013].
Sparse DEEPC: This policy also leverages sparsity, but without decoupling estimation of from estimation of and revenue maximization. At each time , using the data collected in past we estimate via semiperametric regression technique from [Plan and Vershynin, 2013]. Using this estimate of , the estimate of rewards for different values of from samples collected in past, and the corresponding confidence bounds, we obtain a set of active prices at each time, similar to that of DEEPC, from which the price is picked at random.
While Sparse DEEPC suffers from price endogeneity, with an appropriate choice of we conjecture that its cost in terms of expected regret can be made polylogarithmic in ; proving this result remains an important open direction. The intuition for this comes from our theoretical analysis of DEEPC with Rounds and the following observation: even though the set of active prices may be different at different times, we still choose prices at random, and prices are eliminated only upon reception of sufficient evidence of suboptimality. We conjecture that these features are sufficient to ensure that the error in the estimate of is kept small with high probability. Our simulation results indeed show that this algorithm performs relatively well.
4 Regret analysis
The main theoretical result of this paper is the following. The regret bound below is achieved by DEEPC with Rounds (as defined in Section 3.2). For its proof see Appendix C.
Theorem 1
First, note that the above scaling is optimal w.r.t. (up to polylogarithmic factors), as even for the case where w.p.1. it is known that achieving expected regret is not possible (see [Kleinberg and Leighton, 2003]).
Second, we state our results with explicit dependence on various parameters discussed in our assumptions in order for the reader to track the ultimate dependence on the dimension . Note that, as scales, the supports and , and the distribution of may change. In turn, the parameters , , and which are constants for a given , may scale as scales. These scalings need to be computed case by case as it depends on how one models the changes in and . Below we discuss briefly how these may scale in practice.
Recall that and are bounds on , namely, the user valuations. Thus, it is meaningful to postulate that and do not scale with covariate dimension, as the role of covariates is to aid prediction of user valuations and not to change them. For example, one may postulate that is “sparse”, i.e., the number of nonzero coordinates of is bounded from above by a known constant, in which case and do not scale with . Dependence of and on is more subtle as they may depend on the details of the modeling assumptions. For example, their scaling may depend on scaling of the difference between the largest and second largest values of . One of the virtues of Theorem 1 is that it succinctly characterizes the scaling of regret via a small set of parameters.
Finally, the above result can be viewed through the lens of sample complexity. The arguments used in Lemma 1 and in the derivation of equation (4) imply that the sample complexity is “roughly” . More precisely, suppose that at a covariate vector , we set the price . We say the mapping is probably approximately revenue optimal if for any the difference between the achieved revenue and the optimal revenue is at most with probability at least . The number of samples required to learn such a policy satisfies where is polynomial function.
5 Simulation Results
Below we summarize our simulation setting and then briefly describe our findings.
Simulation setup: First, we simulate our model with covariate dimension , where covariate vectors are i.i.d. dimensional standard normal random vectors, the parameter space is , the parameter vector is , the noise support is , and the noise distribution is . Note that even though we assumed that the covariate distribution has bounded support for ease of analysis, our policies do not assume that. Hence, we are able to use a covariate distribution with unbounded support in our simulations. In this setting, we simulate policies DEEPC, Decoupled DEEPC, and Sparse DEEPC for time horizon and for different values of parameter . Each policy is simulated 5,000 times for each set of parameters.
Next, we also simulate our model for with nonzero entries in , with each nonzero entry equal to , each policy is simulated 1,500 times for each set of parameters, with the rest of the setup being the same as earlier. For this setup, we only simulate Decoupled DEEPC and Sparse DEEPC, as the computational complexity of DEEPC does not scale well with .
Main findings: First, we find that the performance of each policy is sensitive to the choice of , and that the range of where expected regret is low may be different for different policies. The expected regret typically increases with increase in , however its variability typically reduces with . This is similar to the usual biasvariance tradeoff in learning problems. For our setup with , the reward of concentrates at around 4,150. As Figure 1 shows, each policy performs well in the plotted range of .
We find that the main metric where the performance of the policies is differentiated is in fact high quantiles of the regret distribution. For example, while the expected regret of DEEPC at and that of Decoupled DEEPC and Sparse DEEPC at each are all roughly the same, the thpercentile of regret distribution under DEEPC and Sparse DEEPC is and lower than that under Decoupled DEEPC, respectively.
For our setup with , while both Decoupled DEEPC and Sparse DEEPC perform similar in average regret, we find that Sparse DEEPC significantly outperforms Decoupled DEEPC in standard deviation and in thpercentile. In particular, thpercentile of Sparse DEEPC is lower than that under Decoupled DEEPC.
6 Acknowledgments
This work was supported in part by National Science Foundation Grants DMS1820942, DMS1838576, CNS1544548, and CNS 1343253. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
References
 [Amin et al., 2014] Amin, K., Rostamizadeh, A., and Syed, U. (2014). Repeated contextual auctions with strategic buyers. In Advances in Neural Information Processing Systems, pages 622–630.
 [Ban and Keskin, 2019] Ban, G.Y. and Keskin, N. B. (2019). Personalized dynamic pricing with machine learning.
 [Broder and Rusmevichientong, 2012] Broder, J. and Rusmevichientong, P. (2012). Dynamic pricing under a general parametric choice model. Operations Research, 60(4):965–980.
 [Cohen et al., 2016] Cohen, M. C., Lobel, I., and Paes Leme, R. (2016). Featurebased dynamic pricing. In Proceedings of the 2016 ACM Conference on Economics and Computation, EC ’16.
 [den Boer, 2014] den Boer, A. V. (2014). Dynamic pricing with multiple products and partially specified demand distribution. Mathematics of operations research, 39(3):863–888.
 [den Boer, 2015] den Boer, A. V. (2015). Dynamic pricing and learning: Historical origins, current research, and new directions.
 [den Boer and Zwart, 2013] den Boer, A. V. and Zwart, B. (2013). Simultaneously learning and optimizing using controlled variance pricing. Management science, 60(3):770–783.
 [Frahm, 2004] Frahm, G. (2004). Generalized elliptical distributions: theory and applications. PhD thesis, Universität zu Köln.
 [Javanmard and Nazerzadeh, 2019] Javanmard, A. and Nazerzadeh, H. (2019). Dynamic pricing in highdimensions. Journal of Machine Learning Research.
 [Keskin and Zeevi, 2014] Keskin, N. B. and Zeevi, A. (2014). Dynamic pricing with an unknown demand model: Asymptotically optimal semimyopic policies. Operations Research, 62(5):1142–1167.
 [Kleinberg and Leighton, 2003] Kleinberg, R. and Leighton, T. (2003). The value of knowing a demand curve: Bounds on regret for online postedprice auctions. In IEEE Symposium on Foundations of Computer Science.
 [Langford and Zhang, 2008] Langford, J. and Zhang, T. (2008). The epochgreedy algorithm for multiarmed bandits with side information. In Advances in Neural Information Processing Systems.
 [Le Guen, 2008] Le Guen, T. (2008). Datadriven pricing. Master’s thesis, Massachusetts Institute of Technology.
 [Mao et al., 2018] Mao, J., Leme, R., and Schneider, J. (2018). Contextual pricing for lipschitz buyers. In Advances in Neural Information Processing Systems, pages 5643–5651.
 [Morgenstern and Roughgarden, 2016] Morgenstern, J. and Roughgarden, T. (2016). Learning simple auctions. In Annual Conference on Learning Theory, pages 1298–1318.
 [Nambiar et al., 2019] Nambiar, M., SimchiLevi, D., and Wang, H. (2019). Dynamic learning and pricing with model misspecification. Management Science.
 [Perchet and Rigollet, 2013] Perchet, V. and Rigollet, P. (2013). The multiarmed bandit problem with covariates. The Annals of Statistics, pages 693–721.
 [Plan and Vershynin, 2013] Plan, Y. and Vershynin, R. (2013). Robust 1bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482–494.
 [Qiang and Bayati, 2019] Qiang, S. and Bayati, M. (2019). Dynamic pricing with demand covariates.
 [Slivkins, 2011] Slivkins, A. (2011). Contextual bandits with similarity information. In Annual Conference On Learning Theory.
Appendix A A class of examples where assumptions A1, A2, and A3 are satisfied
First consider a spherically distributed dimensional random vector , i.e., for each dimensional orthonormal matrix the distributions of and are identical. It is known that a dimensional random vector is spherically distributed iff there exists a positive (one dimensional) random variable , called generating random variable, such that where is uniformly distributed on the dimensional unit hypersphere [Frahm, 2004]. For example, if is a standard normal random vector than is a chisquared distributed random variable. Further, it is also known that for each spherically distributed there exists a function such that the MGF of , namely , is equal to , where represents 2norm [Frahm, 2004].
Now, suppose that are i.i.d. with a spherical distribution such that the generating random variable has density with support in . Further suppose that are i.i.d. Uniform, and that . Thus A1 and A2 readily hold.
The following facts are easy to show: (i) (ii) , (iii) , and (iv) is the unique optimizer of . Further, is a linear combination of MGFs [Frahm, 2004] which are convex, and is thus convex itself. Now, let be the Hessian of at . With some calculations one can show that for any nonzero , we have that
Appendix B Variants of DEEPC: Formal Definitions
b.1 DEEPC with Rounds
We partition the support of into intervals of length . If the boundary sets are smaller, we enlarge the support slightly (by an amount less than ) so that each interval is of equal length, and equal to . Let the corresponding intervals be , and their centroids be where is less than or equal to . Similarly, for , we partition the projection of the support of the into the dimension into intervals of equal length, with sets and centroids . Again, if the boundary sets are smaller, we enlarge the support so that each interval is of equal length, and equal to .
Our algorithm keeps a set of active and eliminates those for which we have sufficient evidence for being far from .
Our algorithm operates in rounds. We use to index the round. Each round lasts for one or more time steps. Let where represents the set of active ’s. For each let where represents the set of active ’s in round . Then, represents the set of active ’s.
During each time in round we have a set of active prices, which depends on and . Let
During round , at each time we pick a price from uniformly at random. At time , we say that cell , i.e. set , is ‘checked’ if where
Each price selection checks one or more cells . The round lasts until all active cells are checked.
Let be the first time in round when the cell is checked. Recall that the reward generated ay time is . At the end of each round , for each active cell we compute the empirical average of the rewards generated at the times for , i.e., we compute
Note that for each cell, in each round we only record reward at the first time the cell is checked and ignore rewards at the rest of the times in that round. We also compute confidence bounds for , as follows. Let . For each active , let
and
These represent the upper and lower confidence bounds, respectively.
We eliminate from if there exists such that
Similarly, we eliminate from if there exists such that
The timecomplexity of this policy is driven by the number of cells, which increases as , and thus scales poorly with .
b.2 Decoupled DEEPC
We assume that there exists an such that at most entries in are nonzero. The value of is known to the platform. Here, represents sparsity and could be significantly smaller than . We also assume that .
At times , select price uniformly at random from . Then, we estimate by solving the following convexoptimization problem:
(2)  
subject to 
We denote the estimate at .
We partition the support of into intervals of length as above, and let the corresponding intervals be with centroids .
Fix . For we do the following.
We let represent the set of active cells. Then, represents the set of active ’s. Here, .
We let
At time time we pick a price from uniformly at random. We say that cell , i.e. set , is ‘checked’ if where
Each price selection checks one or more cells . Let be the number of times cell is checked till time and be the total reward obtained at such times. Let
We also compute confidence bounds for , as follows. For each active , let
and
These represent the upper and lower confidence bounds, respectively.
We eliminate from if there exists such that
The timecomplexity of this policy is driven by that of the convexoptimization problem (2), size of which scales as . Note also that the total number of cells in this policy is .
b.3 Sparse DEEPC
Again, we assume that there exists an such that at most entries in are nonzero, and that the value of is known to the platform. We also assume that .
We partition the support of into intervals of length as above, and let the corresponding intervals be with centroids . We let represent a set of active cells at time . Here, . Fix .
At each time , estimate by solving the following convexoptimization problem:
(3)  
subject to 
We denote the estimate as .
We let
At time time we pick a price from uniformly at random. We say that cell , i.e. set , is ‘checked’ if where
Each price selection checks one or more cells . Let be the number of times cell is checked till time and be the total reward obtained at such times. Let
We also compute confidence bounds for , as follows. For each active , let
and
These represent the upper and lower confidence bounds, respectively.
We eliminate from if there exists such that
The timecomplexity of this policy is driven by having to solve the convexoptimization problem (3) at each time , size of which scales as . Its implementation at time can be sped up by using solution from time for initialization. Note also that the total number of cells in this policy is .
Appendix C Proof of Theorem 1
Consider policy DEEPC with Rounds as defined in Appendix B. The proof follows from a few technical results that we state now. We provide the statements of these results and delegate their proofs to Appendix D to not interrupt the logical flow of the proof of the theorem.
First, at the end of round , with high probability, the set of active arms corresponds to cells with guaranteed expected regret. More precisely, recall the definitions of , , and . Let
We have the following result.
Lemma 1
For each round , let be the event that the following holds:
and for each
Then,
Second, not only are the corresponding active cells guaranteed to have small expected regret with high probability, but the size (Lebesgue measure) of the set of active prices is guaranteed to be small with high probability. The next result provides explicit bound on such size.
Lemma 2
For each , the event