Dynamic Pricing in Highdimensions
Abstract
We study the pricing problem faced by a firm that sells a large number of products, described via a wide range of features, to customers that arrive over time. Customers independently make purchasing decisions according to a general choice model that includes products features and customers’ characteristics, encoded as dimensional numerical vectors, as well as the price offered. The parameters of the choice model are a priori unknown to the firm, but can be learned as the (binaryvalued) sales data accrues over time. The firm’s objective is to minimize the regret, i.e., the expected revenue loss against a clairvoyant policy that knows the parameters of the choice model in advance, and always offers the revenuemaximizing price. This setting is motivated in part by the prevalence of online marketplaces that allow for realtime pricing.
We assume a structured choice model, parameters of which depend on out of the product features. We propose a dynamic policy, called Regularized Maximum Likelihood Pricing (RMLP) that leverages the (sparsity) structure of the highdimensional model and obtains a logarithmic regret in . More specifically, the regret of our algorithm is of . Furthermore, we show that no policy can obtain regret better than .
1 Introduction
A central challenge in revenue management is determining the optimal pricing policy when there is uncertainty about customers’ willingness to pay. Due to its importance, this problem has been studied extensively [KL03, BZ09, BKS13, WDY14, BR12, KZ14, dBZ14, CLPL16]. Most of these models are built around the following classic setting: customers arrive over time; the seller posts a price for each customer; if the customer’s valuation is above the posted price, a sale occurs and the seller collects a revenue in the amount of the posted price; otherwise, no sale occurs and no revenue is generated. Based on this and the previous feedbacks, the seller updates the posted price. Therefore, the seller is involved in the realm of explorationexploitation as he needs to choose between learning about the valuations and exploiting what has been learned so far to collect revenue.
In this work, we consider a setting with a large number of products which are defined via a wide range of features. The valuations are given by with being the (observable) feature vectors of products and representing the customer’s characteristics (true parameters of the choice model, which is initially unknown to the seller, cf. [ARS14, CLPL16].) . An important special case of this setting is the linear model in which
where captures the idiosyncratic noise in valuations and is an unknown intercept.
Our setting is motivated in part by applications in online marketplaces. For instance, a company such as Airbnb recommends prices to hosts based on many features including the space (number of rooms, beds, bathrooms, etc.), amenities (AC, WiFi, washer, parking, etc.), the location (accessibility to public transportation, walk score of the neighborhood, etc.), house rules (petfriendly, nonsmoking, etc.), as well as the prediction of the demand which itself depends on many factors including the date, events in the area, availability and prices of nearby hotels, etc. [Air15]. Therefore, the vector describing each property can have hundreds of features. Another important application comes from online advertising. Online publishers set the (reserve) price of ads based on many features including user’s demographic, browsing history, the context of the webpage, the size and location of the ad on the page, etc.
In this work, we propose Regularized Maximum Likelihood Pricing (RMLP) policy for dynamic pricing in highdimensional environments. As suggested by its name, the policy uses maximum likelihood method to estimate the true parameters of the choice model. In addition, using an (norm) regularizer, our policy exploits the structure of the optimal solution; namely, the performance of the RMLP policy significantly improves if the valuations are essentially determined by a small subset of features. More formally, the difference between the revenue obtained by our policy and the benchmark policy that knows in advance the true parameters of the choice model, , is bounded by , where , , and respectively denote the length of the horizon, number of the features, and sparsity (i.e., number of nonzero elements of ). We show that our results are tight up to a logarithmic factor. Namely, no policy can obtain regret better than .
We point out that our results can be applied to applications where the features’ dimensions are larger than the time horizon of interest. A powerful pricing policy for these applications should obtain regret that scales gracefully with the dimension. Note that in general, little can be learned about the model parameters if , because the number of degrees of freedom exceeds the number of observations , and therefore, any estimator can be arbitrary erroneous. However, when there is prior knowledge about the structure of unknown parameter , (e.g., sparsity), then accurate estimations are attainable even when .
1.1 Organization
The rest of the paper is organized as follows: In the remaining part of the introduction, we discuss how our work is positioned with respect to the literature and highlight our contributions. In Section 2, we formally present our model and discuss the technical assumptions and the benchmark policy. The RMLP policy is presented in Section 3, followed by its analysis in Section 4. We provide in Section 5, a bound on the performance of any dynamic pricing policy that does not know the choice model in advance. In Section 6, we generalize the RMLP policy to nonlinear valuations functions. The proofs are relegated to the appendix.
1.2 Related Work
Our work contributes to literature on dynamic pricing as well as high dimensional statistics. In the following, we briefly overview the work closest to ours in these contexts.

Dynamic Pricing and Learning. The literature on dynamic pricing and learning has been growing over the past few years, motivated in part by the advances in big data technology that allow firms to easily collect and utilize information. We briefly discuss some of the recent lines of research in this literature. We refer to [dB15] for an excellent survey on this topic.

Parametric Approach. A natural approach to capture uncertainty about the customers’ valuations is to model the uncertainty using a small number of parameters, and then estimate those parameters using classical statistical methods such as maximum likelihood [BR12, dBZ13, dBZ14] or least square estimation [GZ13, Kes14, BB16]. Our work is similar to this line of work, in that we assume a parametric model for customer’s valuations and apply the maximum likelihood method using the randomness of the idiosyncratic noise in valuations. However, the parameter vector is highdimensional, whose dimension (that can even exceed the time horizon of interest ). We use regularized maximumlikelihood in order to promote sparsity structure in the estimated parameter. Further, our pricing policy has an episodic theme which makes the posted prices in each episode independent of the idiosyncratic noise in valuations, , in that episode. This is in contrast to other policies based on maximumlikelihood, such as MLEGREEDY [BR12], or greedy iterative least square (GILS) [Kes14, dBZ14, QB16] that use the entire history of observations to update the estimate for the model parameters at each step.

Bayesian Approach. One of the earliest work on Bayesian parametric approach in this context is by [Rot74] who consider a Bayesian framework where the firm can choose from two prices with unknown demand and show that (myopic) Bayesian policies may lead to “incomplete learning.” However, carefully designed variations of the myopic policies can (optimally) learn the optimal price [HKZ12]; see also [KR99, AC09, FVR10, KZ14].

NonParametric models. An early work in nonparametric setting is by [KL03]. They model the dynamic pricing problem as a multiarmed bandit (MAB) where each arm corresponds to a (discretized) posted price. They propose an algorithm where is the length of the horizon. Similar results have been obtained in more general settings [BKS13, AD14] including setting with inventory constraints [BZ09, BDKS12, WDY14].

Featurebased Models. Recent papers on dynamic pricing consider models with features/covariates. [ARS14], in a model similar to ours, present an algorithm that obtains regret ; they also study dynamic incentive compatibility in repeated auctions. Another closely related work to ours is by [CLPL16]. Their model differs from ours in two main aspects: their model is deterministic (no idiosyncratic noise) the arrivals (of features vectors) is modeled as adversarial. They propose a clever binarysearch approach using the Ellipsoid method which obtains regret of . [QB16] study a model where the seller can observe the demand itself, not a binary signal as in our setting. They show that a myopic policy based on leastsquare estimations can obtain a logarithmic regret. To the extent of our knowledge, ours is the first work that highlights the role of structure/sparsity in dynamic pricing.
[BB16] study a multiarmed bandit setting, with discrete arms, and highdimensional covariates, generalizing results of [GZ13]. [BB16] present an algorithm, using a LASSO estimator, that obtains regret where denotes the number of arms. In contrast, our setting can be interpreted as a multiarmed bandit with continuous arms in a high dimensional space.


High Dimensional Statistics. There has been a great deal of work on regularized estimator under the highdimensional scaling; see e.g. [VdG08]. Closer to the spirit of our work is the problem of 1bit compressed sensing [PV13, BJ15]. In this problem, linear measurements are observed for an unknown parameter of interest but only the sign of these measurements are observed. Note that in our problem, seller is involved in both the learning task and also the policy design. Specifically, he should decide on the prices, which directly affect collected revenue and also indirectly influence the difficulty of the learning task. The market values are then compared with the posted prices, in contrast to 1bit compressed sensing where the measurements are compared with zero (sign information). In addition, the pricing problem has an online nature while the 1bit compressed sensing is mostly studied for offline setting. Finally, note that prices are set based on customer’s purchase behavior, and hence introduce dependency among the collected information about the model parameters.
1.3 Notations
For a vector , represents the positions of nonzero entries of . Further, for a vector and a subset , is the restriction of to indices in . We write for the standard norm of a vector , i.e., and for the umber of nonzero entries of . If the subscript is omitted, it should be deemed as norm. For two vectors , the notation represents the standard inner product. For two functions and , the notation means that is bounded above by asymptotically, namely, for some fixed positive constant . Throughout, is the Gaussian density and is the Gaussian distribution.
2 Choice model
We consider a seller, who has a product for sale in each period , where denotes the length of the horizon and may be unknown the to the seller. Each product is represented by an observable vector of features (covariates) . Products may vary across periods and we assume that feature vectors are sampled independently from a fixed, but a priori unknown, distribution , supported on a bounded set .
The product at time has a market value , which is not observed by the seller and function is (a priori) unknown. At each period , the seller posts a price . If , a sale occurs, and the seller collects revenue . If the price is set higher than the market value, , no sale occurs and no revenue is obtained. The goal of the seller is to design a pricing policy that maximizes the collected revenue.
We first assume that the market value of a product is a linear function of its covariates, namely
(1) 
where denotes the inner product of vectors and . Here, are idiosyncratic shocks, referred to as noise, which are drawn independently and identically from a distribution with mean zero and cumulative function , with density , cf. [KZ14].The noise can account for the features that are not measured. We generalize our model to nonlinear valuation functions in Section 6.
Parameter is a prior unknown to seller. Therefore, the seller is involved in the realm of explorationexploitation as he needs to choose between learning and exploiting what has been learned so far to collect revenue.
Henceforth, we let denote the true model parameters and also define the augmented feature vectors .
Let be the response variable that indicates whether a sale has occurred at period :
(2) 
Note that the above model can be represented as the following probabilistic model:
(3) 
Our proposed algorithm exploits the structure (sparsity) of the feature space to improve its performance. To this aim, let denote the number of nonzero coordinates of , i.e., . We remark that is a priori unknown to the seller.
2.1 Technical assumptions
To simplify the presentation, we assume that , for all , and for a known constant , where for a vector , denotes the maximum absolute value of its entries and . We denote by the set of feasible parameters, i.e.,
We also make the following assumption on the distribution of noise .
Assumption 2.1.
The function is strictly increasing. Further, and are logconcave in .
Logconcavity is a widelyused assumption in the economics literature [BB05]. Note that if the density is symmetric and the distribution is logconcave, then is also logconcave. Assumption 2.1 is satisfied by several common probability distributions including normal, uniform, Laplace, exponential, and logistic. Note that the cumulative distribution function of all logconcave densities is also logconcave [BV04].
Our second assumption is on the product feature vectors.
Assumption 2.2.
Product feature vectors are generated independently from a probability distribution with a bounded support . We further assume that is normalized to zero^{1}^{1}1This normalization does not imply any restriction because if , then it can be absorbed in the intercept term . More precisely, we consider model with intercept parameter . and denoting by the covariance matrix of , we assume that is a positive definite matrix. Namely, all of its singular values are bounded from below by a constant . We also denote the maximum eigenvalue of by .
The above assumption holds for many common probability distributions, such as uniform, truncated normal, and in general truncated version of many more distributions. Generally, if is bounded below from zero on an open set around the origin, then it has a positive definite covariance matrix. Let us stress that we know neither the distribution , nor its covariance .
2.2 Clairvoyant policy and performance metric
We evaluate the performance of our algorithm using the common notion of regret: the expected revenue loss compared with the optimal pricing policy that knows in advance (but not the realizations of ). Let us first characterize this benchmark policy.
Using Eq. (1), the expected revenue from a posted price is equal to
Therefore, using first order conditions, for the optimal posted price, denoted by , we have
(4) 
To simplify the presentation, let denote the optimal price at time .
We now define corresponding to the virtual valuation function commonly used in mechanism design [Mye81]. By Assumption 2.1, is injective and hence we can define function as follows
(5) 
It is easy to verify that is nonnegative. Note that by Eq. (4), for the optimal price we have
Therefore, by rearranging the terms for the optimal price at time we have
(6) 
We can now formally define the regret of a policy. Let be the seller’s policy that sets price at period , and can depend on the history of events up to time . The worstcase regret is defined as:
(7) 
where the expectation is with respect to the distributions of idiosyncratic noise, , and , the distribution of feature vectors. Moreover, represents the set of probability distributions supported on a bounded set .
Our algorithm uses the sparsity structure of and learns the model with order of magnitude less data compared to a structureignorant algorithm. In Section 4, we show that our pricing scheme achieves a regret bound of .
3 A Regularized Maximum Likelihood Pricing (RMLP) Policy
(8) 
(9) 
(10) 
In this section, we present our dynamic pricing policy. Our policy runs in an episodic fashion. Episodes are indexed by and time periods are indexed by . The length of episode is denoted by . Throughout episode , we set the prices equal to where denotes the estimate of which is obtained from the observations in the previous episode. Note that by Eq. (5), is the optimal posted price if was the true underlying parameter of the model.
We estimate using a regularized maximumlikelihood estimator; see Eq. (25) where the (normalized) negative loglikelihood function for is given by Eq. (26). We note that as a consequence of the log concavity assumption on and , the optimization problem (25) is a convex problem. There is a large toolkit of various optimization methods (e.g., alternating direction method of multipliers (ADMM), fast iterative shrinkagethresholding algorithm (FISTA), accelerated projected gradient descent, among many others) that can be used to solve this optimization problem. There are also recent developments on distributed solvers for regularized cost function [BPC11].
Observe that by design, prices posted in the th episode are independent from the market value noises in this period, i.e., . This allows us to estimate for each episode separately; see Proposition 8.1 in Section 8.1. Comparing to policies that use the entire data sale history in making decisions, some remarks are in order:

Perishability of data: In practical applications, the unknown demand parameters will change over time, raising the concern of perishability of data. Namely, collected data becomes obsolete after a while and cannot be relied on for estimating the model parameters [KZ16, Jav17]. Common practical policies to mitigate this problem (discussed in [KZ16]) include moving windows and decaying weights which use only recent data to learn the model parameters. In contrast, methods that use the entire historical data suffers from this problem.

Simplicity and efficiency: In RMLP policy, estimates of the model parameters are updated only at the first period of each episode ( updates). Further, at each update, the policy uses only the historical data from the previous episode. These two ideas together, not only allow for a neat analysis of the statistical dependency among samples but also decrease the computational cost. Scalability of the pricing policy is indispensable in practical applications as the sales data is collected at an unprecedented rate.

Effect on regret: By using half of the historical data at each update, our policy loses at most a factor in the total regret. (This becomes clear shortly when we discuss the estimation error rate in terms of number of samples.)
The lengths of episodes in our algorithm increase geometrically (), allowing for more accurate estimate of as the episode index grows. The algorithm terminates at the end of the horizon (period ), but note that it does not need to know the length of the horizon in advance.
Regularization parameter constrains the norm of the estimator . Selecting the value of is of crucial importance as it effects the estimator error. We set it as . More precisely, define
where the derivatives are w.r.t. . By the logconcavity property of and , we have
Hence, captures the steepness of .
In order to minimize the regret, we run the RMLP policy with
(11) 
Note that exploration and exploitation tasks are mixed in our algorithm. In the beginning of each episode, we use what is learned from previous episode to improve the estimation of and then we exploit this estimate throughout the current episode to incur little regret. Meanwhile, the observations gathered in the current episode are used to update our estimate of for the next episode. We analyze the performance of RMLP in the next section.
4 Regret analysis
Although the description of RMLP is oblivious to sparsity , its performance depends on the structure of the optimal solution. The following theorem bounds the regret of our dynamics pricing policy.
Theorem 4.1 (Regret Upper Bound).
Below we provide an outline for the proof of Theorem 4.1 and defer its complete proof to Section 8.1.

In RMLP, the updates in the model parameter estimation only occurs at the beginning of each episode, with using only the samples collected in the previous episode. Therefore, the prices posted in each episode are independent from the market value noises in that episode. This observation also verifies that given by (26), is indeed the negative loglikelihood of the samples collected in th episode. Note that this independence is not a mere serendipity, rather it holds because of the specific design of RMLP policy. Using this property, we use tools from highdimensional statistics to bound the estimation error. To bound the error term , we compare the function values and . The main challenge here is that is not strictly convex in .^{2}^{2}2Note that , where . Therefore, is a matrix of rank at most . Hence, is strictly convex in only if . However, since we are not updating our estimates in the middle of an episode, episodes of length yield the regret to scale linearly in , which is not desired. Hence, there can be, in principle, parameter vectors and that are close to each other and nevertheless the values of function at these points are far from each other.
To cope with this challenge, we show that a socalled restricted eigenvalue condition holds for the feature products. This notion implies that is strictly convex on the set of sparse vectors.^{3}^{3}3It is strictly convex over the set of sparse vectors in dimension if the number of samples is above for a suitable constant . Using the restricted eigenvalue condition, we show the following error for the regularized loglikelihood estimate in the th episode, , holds true
As expected, the estimate gets more accurate as the episode’s length increases; see Section 8.1 for more details.

For any , denote by , the expected revenue under price . We bound in terms of . Since , we have , and by Taylor expansion of around , we obtain .

For in the th episode, namely , we have
which follows by showing that is Lipschitz. Further, by Assumption 2.2 (without loss of generality assume ), we have
where the equality holds because is independent of . The inequality holds because and therefore
(12) from which we obtain that the maximum eigenvalue of is at most .
Let be the regret occurred at step . Combining the above bounds (step 2 and 3), we arrive at . Therefore, the cumulative expected regret in episode works out at . Since the length of episodes increase geometrically, there are episodes by time . This implies that the total expected regret by time is .
4.1 Comparison with the “common” regret of bound
There is an oftenseen regret bound in the literature of online decision making, which can be improved to a logarithmic regret bound if some type of “separability assumption” holds true [DHK08, AYPS12]. Separability assumption posits that there is a positive constant gap between the rewards of the best and the second best actions. In our framework, the parameter belongs to a continuous set in and therefore the separability assumption cannot be enforced as by choosing arbitrary close to , one can obtain suboptimal (but arbitrary close to optimal) reward. However, our policy achieves regret. Here, we contrast our logarithmic lower bound with the folklore bound to build further insight on our results.
Uninformative prices and lowerbound.
We focus on [BR12] which has a close framework to ours in that it considers a dynamic pricing policy from purchasing decisions and presents a pricing policy based on maximum likelihood estimation with regret . Adopting their notation, it is assumed that market values are independent and identically distributed random variables coming from a distribution function that belongs to some family parametrized by . Denote by the demand curve. This curve determines the probability of a purchase at a given price, i.e., . [BR12] show that the worstcase regret of any pricing policy must be at least (see Theorem 3.1 therein). The bound is proved by considering a specific family of demand curves , such that all demand curves in this family intersect at a common price. Further, the common price is the optimal price for a specific choice of parameter , i.e, .^{4}^{4}4Specifically, they consider . Hence , for all and it is shown that for . Therefore, the price is “uninformative” since no policy can gain information about the demand parameter , while pricing . The idea behind the derived lower bound for the worsercase regret is that for a policy to learn the underlying demand curve fast enough, it must necessarily choose prices that are away from (the uninformative) price and this leads to a large regret when the true demand curve is indeed .
Intuition behind our results.
In contrast to the previous case, for our framework there is no such uninformative price. First, note that the for a choice model with parameters , the demand curve at time is given by
For , we define the aggregate demand function up to time as . In the following, we argue that under our setting, there is no uninformative price. For any price and any , , we have
where is the matrix with rows , for . We also used the fact that for some constant because is strictly increasing by Assumption 2.1. As we show in Appendix A, for (with a proper constant), satisfy a socalled “restricted eigenvalue”, by which we have
(13) 
Therefore, for any fixed price , if we vary the demand parameters to some other value , then the aggregate demand at price also changes by an amount proportional to . Hence, any price in this setting is informative about the model parameters.
To build further insight, let us consider a more general choice model, where the utility of the customer from buying a product with feature vectors at price is given by
(14) 
where are unknown model parameters and is the noise term. The customer buys the product iff . Note that the model we studied in this paper (see Equation (2)) is special case when the price sensitivity is known and hence can be normalized to . We next argue that in case of unknown , the uninformative prices do exist and hence the is still in place.
To see this, fix arbitrary , and let and . Then, the demand curves will be unaltered over time and are given by
It is easy to verify that is the optimal price for the specific choice of . Further, all the demand curves intersect at (they all have the value at this price). Therefore, is an uninformative price and no policy can gain information about by pricing at . However, when , choosing prices that are away from this informative price leads to a large regret. Prices that are close to does not have any information gain, and contrasting these two points, it can be shown that the worst case regret id of order . A formal proof follows the same lines ad the proof of [BR12, Theorem 3.1] and is omitted.
Finally, it is worth noting that the rate of learning demand parameter is chiefly derived by three factors:

Nonsmoothness of distribution function , as it controls the amount of information obtained about at each . This is captured by quantity defined by (34).

The rate by which the feature vectors span the parameter space. This is controlled through the minimum eigenvalue of , i.e., . If is small, the randomly generated features are relatively aligned and one requires larger sample size to estimate within specified accuracy.

Complexity of . This is captured through the sparsity measure .
Contribution of these factors to the learning rate can be clearly seen in our derived learning bound (105).
4.2 Role of
In establishing our results, we relied on Assumption 2.2 which requires the population covariance of features to be positive definite. The lower bound on its eigenvalues, denoted by , appears in our regret bound as a factor .
As evident from the proof of Proposition 8.1, Assumption 2.1 can be replaced by the weaker restricted eigenvalue condition [BvdG11, CT07], which is a common assumption in highdimensional statistical learning. While assumption allows for a fast learning rate of model parameters and a regret bound , RMLP policy can still provably achieve regret , even when .
Theorem 4.2.
Suppose that product feature vectors are generated independently from a probability distribution with a bounded support . Under Assumption 2.1, the regret of RMLP policy is of .
5 Lower bound on regret
As discussed in Section 2.2, if the true parameter is known, the optimal policy (in terms of expected revenue) is the one that sets prices as . Let denote the history set up to time , and recall that denotes the set of feasible parameters, i.e., . We consider the following set of policies, :
(15) 
Here denotes the price posted by policy at time .
We provide a lower bound on the achievable regret by any policy in set . Indeed this lower bound applies to an oracle who fully observes the market values after the price is either accepted or rejected. Compared to our setting, where the seller observes only the binary feedbacks (purchase/no purchase), this oracle appears exceedingly powerful at first sight but surprisingly, the derived lower bound matches the regret of our dynamic policy, up to a logarithmic factor.
Theorem 5.1.
In the following we give an outline for the proof of Theorem 5.1, summarizing its main steps and defer the complete proof to Section 8.3.

We derive a lower bound for regret in terms of the minimax estimation error. Specifically, for , let
(17) be the regret at period . Define . We show that
(18) for some constants .

Let and define . We use a standard argument (Le Cam’s method) that relates the minimax risk, , in terms of the error in multiway hypothesis problem [Tsy08]. We first construct a maximal set of points in , such that minimum pairwise distances among them is at least . (Such set is usually referred to as a packing in the literature). Here is a free parameter to be determined in the proof. We then use a standard reduction to show that any estimator with small minimax risk should necessarily solve a hypothesis testing problem over the packing set, with small error probability. More specifically, suppose that nature chooses one point from the packing set uniformly at random and conditional on nature’s choice of the parameter vector, say , the market value are generated according to with . The problem is reduced to lower bounding the error probability in distinguishing among the candidates in the packing set using the observed market values.

We apply Fano’s inequality from information theory to lower bound the probability of error [Tsy08]. The Fano bound involves the logarithm of the cardinality of the packing set as well as the mutual information between the observations (market values) and the random parameter vector chosen uniformly at random from the packing set. Le Cam’s method is used to derive minimal risk lower bound for an estimator , while here we have a sequence of estimators and need to adjust the Le Cam’s method to get the lower bound for .
6 Nonlinear valuation function
In previous sections, we focused exclusively on linear valuation function given by Eq (1). Here, we extend our results and assume that the market valuations are modeled by a nonlinear function that depends on products’ features and an independent noise term. Specifically, the market value of a product with feature vector is given by
(19) 
where the original features are transformed by a feature mapping , and function is a general function that is logconcave and strictly increasing. Important examples of this model include loglog model (, ), semilog model (, ), and logistic model (, ).
Model (19) allows us to capture correlations and nonlinear dependencies on the features. We next state our assumption on the feature mapping and then discuss our dynamic pricing policy and its regret bound for the general setting (19).
Assumption 6.1.
Let be an (unknown) distribution from which the original features are sampled independently. Suppose that the feature mapping has continuous derivative and denote by , the covariance of feature vector under . We assume that there exist constants and such that for every eigenvalue of , we have .
Invoking Assumption 2.1, has a bounded support and since has continuous derivative, it is Lipschitz on and hence the image of under remains bounded. Therefore, the new features are also sampled independently from a bounded set. The condition on is analogous to that on , as required by Assumption 2.2 for the linear setting.
Based on feature mapping , validity of Assumption 6.1 may depend on all moments of distribution . We provide an alternative to this assumption, which only depends on feature mapping and the second moment of . In stating the assumption, we use the notation to denote the derivative matrix of a feature mapping . Precisely, for , with realvalued function defined on , we write .
Assumption 6.2.
Suppose that feature mapping has continuous derivative and its derivative is fullrank for almost all . In addition, there exist constants and such that for every eigenvalue of covariance , we have .
Recall that the noise terms are drawn independently and identically from a distribution with cumulative function and density . Let be the hazard rate function for distribution . For a logconcave function , we define
(20) 
Note that and since is logconcave, this term is decreasing. Further, since is logconcave then its hazard rate is increasing (See proof of Lemma C.1.) Combining these observations, we have that is increasing. Consequently,

Righthand side of (20) is strictly increasing and hence, is welldefined.

We have , for all . This implies that , for all .
It is worth noting that for (linear model), we have , where is defined by (5). Our pricing policy for the nonlinear model is conceptually similar to the linear setting: The policy runs in an episodic manner. During episode , the prices are set as , where denotes the estimate of the true parameters using a regularized maximumlikelihood estimator applied to observations in the previous episode, and .
We describe our (modified) RMLP policy in Algorithm 2. There a few differences between Algorithm 2 and Algorithm 1: Firstly, the features are replaced by . Secondly, in the regularized estimator, prices are replaced by . Thirdly, in the last step of algorithm prices are set as , with defined by Equation (20).
(21) 
(23) 
Our next theorem bounds the regret of our pricing policy (Algorithm 2).
Theorem 6.3.
Proof of Theorem 6.3 is given in Appendix 8.4. Here, we summarize its key ingredients.

By increasing property of , a sale occurs at period when . Hence, the loglikelihood estimator for this setting reads as (LABEL:eq:log_likelihood2). By virtue of Assumption 6.1 (or its alternative, Assumption 6.2) we get a similar estimation error for the regularized estimator to the one in Proposition 8.1.

Similar to our derivation for linear setting, we show that the optimal pricing policy that knows in advance is given by , where is defined based on Equation (20).

The difference between the posted price and the optimal price can be bounded as , for a constant . This bound is similar to the corresponding bound for the linear setting, and following the same lines of our regret analysis for that case, we get .
7 Knowledge of market noise distribution
The proposed RMLP policy has assumed that the market noise distribution is known to the seller. Knowledge of has been used both in estimating the model parameters and in setting the prices . On the other hand, the benchmark policy is also assumed to have access to model parameters and the distribution . Therefore, the regret bound established in Theorem 4.1 essentially measures how much the seller loses in revenue due to lack of knowledge of the underlying model parameters. In practice, however, the underlying distribution of valuations is not given and this rises the question of distributionindependent pricing policy.
It is worth mentioning that in some applications, although the underlying distribution of valuations is unknown, it belongs to a known class of distributions. For example, lognormal distributions have proved to be a good fit for the distribution of valuations of advertisers in online advertising markets [EOS07, LP07, XYL09, BFMM14]. In Section 7.1, we consider a model where the underlying distribution belongs to a known class of logconcave distributions and propose a policy whose regret is . We also argue that no policy can get a better regret bound.
Next, we pursue pricing policies under completely unknown distribution. Here, the regret is measured against an optimal clairvoyant policy that has full knowledge of the model parameters and market noise realizations, , and thus extracts the customers’ valuation at each step. Note that such a clairvoyant policy is much more powerful than the one considered in previous sections, as now it has access to noise realizations while before it only had knowledge of the noise distribution .
7.1 Unknown distribution from a known class
Suppose that the maket noises are generated from a logconcave distribution (e.g., Lognormal), with unknown mean and unknown variance . Without loss of generality, we can assume that ; otherwise, in the valuation model (1), can be absorbed in the intercept term . We next explain how the RMLP policy can be adapted to this case.
Define and consider the transformation , , , . Then, the valuation model (1) can be written as
(24) 
where are drawn from . To lighten the notation, we use the shorthand . We also let . The response variables are then given by .
(25) 
(26) 
(27) 
We propose a variant of RMLP policy, called RMLP2 for this case. Similar to RMLP, it runs in an episodic manner but the length of episodes grows linearly. (Episode is of length periods.) At the first period of each episode, the price is chosen randomly and independently from the feature vectors. To be concrete, we set the price uniformly at random from . At the other periods of the episode, the price is set optimally based on the current estimate of the model parameters. Specifically, for episode , we set , where the pricing function is defined based on distribution , given by (5), and the estimates are obtained via regularized loglikelihood. In forming the loglikelihood loss, we only consider the first period of each episode, where the prices are set randomly; for , we denote by the set of first periods in episodes , and write the loglikelihood based on the samples in :