Revenue Optimization with Approximate Bid Predictions
Abstract
In the context of advertising auctions, finding good reserve prices is a notoriously challenging learning problem. This is due to the heterogeneity of ad opportunity types, and the nonconvexity of the objective function. In this work, we show how to reduce reserve price optimization to the standard setting of prediction under squared loss, a well understood problem in the learning community. We further bound the gap between the expected bid and revenue in terms of the average loss of the predictor. This is the first result that formally relates the revenue gained to the quality of a standard machine learned model.
Revenue Optimization with Approximate Bid Predictions
Andrés Muñoz Medina Google Research 76 9th Ave New York, NY 10011 Sergei Vassilvitskii Google Research 76 9th Ave New York, NY 10011
noticebox[b]31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float
1 Introduction
A crucial task for revenue optimization in auctions is setting a good reserve (or minimum) price. Set it too low, and the sale may yield little revenue, set it too high and there may not be anyone willing to buy the item. The celebrated work by Myerson (1981) shows how to optimally set reserves in second price auctions, provided the value distribution of each bidder is known.
In practice there are two challenges that make this problem significantly more complicated. First, the value distribution is never known directly; rather, the auctioneer can only observe samples drawn from it. Second, in the context of ad auctions, the items for sale (impressions) are heterogeneous, and there are literally trillions of different types of items being sold. It is therefore likely that a specific type of item has never been observed previously, and no information about its value is known.
A standard machine learning approach addressing the heterogeneity problem is to parametrize each impression by a feature vector, with the underlying assumption that bids observed from auctions with similar features will be similar. In online advertising. these features encode, for instance, the ad size, whether it’s mobile or desktop, etc.
The question is, then, how to use the features to set a good reserve price for a particular ad opportunity. On the face of it, this sounds like a standard machine learning question—given a set of features, predict the value of the maximum bid. The difficulty comes from the shape of the loss function. Much of the machine learning literature is concerned with optimizing well behaved loss functions, such as squared loss, or hinge loss. The revenue function, on the other hand is noncontinuous and strongly nonconcave, making a direct attack a challenging proposition.
In this work we take a different approach and reduce the problem of finding good reserve prices to a prediction problem under the squared loss. In this way we can rely upon many widely available and scalable algorithms developed to minimize this objective. We proceed by using the predictor to define a judicious clustering of the data, and then compute the empirically maximizing reserve price for each group. Our reduction is simple and practical, and directly ties the revenue gained by the algorithm to the prediction error.
1.1 Related Work
Optimizing revenue in auctions has been a rich area of study, beginning with the seminal work of Myerson (1981) who introduced optimal auction design. Follow up work by Chawla et al. (2007) and Hartline and Roughgarden (2009), among others, refined his results to increasingly more complex settings, taking into account multiple items, diverse demand functions, and weaker assumptions on the shape of the value distributions.
Most of the classical literature on revenue optimization focuses on the design of optimal auctions when the bidding distribution of buyers is known. More recent work has considered the computational and information theoretic challenges in learning optimal auctions from data. A long line of work (Cole and Roughgarden, 2015; Devanur et al., 2016; Dhangwatnotai et al., 2015; Morgenstern and Roughgarden, 2015, 2016) analyzes the sample complexity of designing optimal auctions. The main contribution of this direction is to show that under fairly general bidding scenarios, a nearoptimal auction can be designed knowing only a polynomial number of samples from bidders’ valuations. Other authors, (Leme et al., 2016; Roughgarden and Wang, 2016) have focused on the computational complexity of finding optimal reserve prices from samples, showing that even for simple mechanisms the problem is often NPhard to solve directly.
Another well studied approach to datadriven revenue optimization is that of online learning. Here, auctions occur one at a time, and the learning algorithm must compute prices as a function of the history of the algorithm. These algorithms generally make no distributional assumptions and measure their performance in terms of regret: the difference between the algorithm’s performance and the performance of the best fixed reserve price in hindsight. Kleinberg and Leighton (2003) developed an online revenue optimization algorithm for postedprice auctions that achieves low regret. Their work was later extended to secondprice auctions by CesaBianchi et al. (2015).
A natural approach in both of these settings is to attempt to predict an optimal reserve price, equivalently the highest bid submitted by any of the buyers. While the problem of learning this reserve price is well understood for the simplistic model of buyers with i.i.d. valuations (CesaBianchi et al., 2015; Devanur et al., 2016; Kleinberg and Leighton, 2003), the problem becomes much more challenging in practice, when the valuations of a buyer also depend on features associated with the ad opportunity (for instance user demographics, and publisher information).
This problem is not nearly as well understood as its i.i.d. counterpart. Mohri and Medina (2014) provide learning guarantees and an algorithm based on DC programming to optimize revenue in secondprice auctions with reserve. The proposed algorithm, however, does not easily scale to large auction data sets as each iteration involves solving a convex optimization problem. A smoother version of this algorithm is given by (Rudolph et al., 2016). However, being a highly nonconvex problem, neither algorithm provides a guarantee on the revenue attainable by the algorithm’s output. Devanur et al. (2016) give sample complexity bounds on the design of optimal auctions with side information. However, the authors consider only cases where this side information is given by . More importantly, their proposed algorithm only works under the unverifiable assumption that the conditional distributions of bids given satisfy stochastic dominance.
Our results. We show that given a predictor of the bid with squared loss of , we can construct a reserve function that extracts all but revenue, for a simple increasing function . (See Theorem 2 for the exact statement.) To the best of our knowledge, this is the first result that ties the revenue one can achieve directly to the quality of a standard prediction task. Our algorithm for computing is scalable, practical, and efficient.
Along the way we show what kinds of distributions are amenable to revenue optimization via reserve prices. We prove that when bids are drawn i.i.d. from a distribution , the ratio between the mean bid and the revenue extracted with the optimum monopoly reserve scales as – Theorem 5. This result refines the bound derived by Goldberg et al. (2001), and formalizes the intuition that reserve prices are more successful for low variance distributions.
2 Setup
We consider a repeated posted price auction setup where every auction is parametrized by a feature vector and a bid . Let be a distribution over . Let , be a bid prediction function and denote by the squared loss incurred by :
We assume is given, and make no assumption on the structure of or how it is obtained. Notice that while the existence of such is not guaranteed for all values of , using historical data one could use one of multiple readily available regression algorithms to find the best hypothesis .
Let be a set of i.i.d. samples drawn from and denote by its projection on . Given a price let denote the revenue obtained when the bidder bids . For a reserve price function we let:
denote the expected and empirical revenue of reserve price function .
We also let , denote the population and empirical mean bid, and , denote the expected and empirical separation between bid values and the revenue. Notice that for a given reserve price function , corresponds to revenue left on the table. Our goal is, given and , to find a function that maximizes or equivalently minimizes .
2.1 Generalization Error
Note that in our set up we are only given samples from the distribution, , but aim to maximize the expected revenue. Understanding the difference between the empirical performance of an algorithm and its expected performance, also known as the generalization error, is a key tenet of learning theory.
At a high level, the generalization error is a function of the training set size: larger training sets lead to smaller generalization error; and the inherent complexity of the learning algorithm: simple rules such as linear classifiers generalize better than more complex ones.
In this paper we characterize the complexity of a class of functions by its growth function . The growth function corresponds to the maximum number of binary labelings that can be obtained by over all possible samples . It is closely related to the VCdimension when takes values in and to the pseudodimension (Morgenstern and Roughgarden, 2015; Mohri et al., 2012) when takes values in .
We can give a bound on the generalization error associated with minimizing the empirical separation over a class of functions . The following theorem is an adaptation of Theorem 1 of (Mohri and Medina, 2014) to our particular setup.
Theorem 1.
Let , with probability at least over the choice of the sample the following bound holds uniformly for
(1) 
Therefore, in order to minimize the expected separation it suffices to minimize the empirical separation over a class of functions whose growth function scales polynomially in .
3 Warmup
In order to better understand the problem at hand, we begin by introducing a straightforward mechanism for transforming the hypothesis function to a reserve price function with guarantees on its achievable revenue.
Lemma 1.
Let be defined by . The function then satisfies
The proof is a simple application of Jensen’s and Markov’s inequalities and it is deferred to Appendix B.
This surprisingly simple algorithm shows there are ways to obtain revenue guarantees from a simple regressor. To the best of our knowledge these is the first guarantee of its kind. The reader may be curious about the choice of as the offset in our reserve price function. We will show that the dependence on is not a simple artifact of our analysis, but a cost inherent to the problem of revenue optimization.
Moreover, observe that this simple algorithm fixes a static offset, and does not make a distinction between those parts of the feature space, where the algorithm makes a low error, and those where the error is relatively high. By contrast our proposed algorithm partitions the space appropriately and calculates a different reserve for each partition. More importantly we will provide a data dependent bound on the performance of our algorithm that only in the worst case scenario behaves like .
4 Results Overview
In principle to maximize revenue we need to find a class of functions with small complexity, but that contains a function which approximately minimizes the empirical separation. The challenge comes from the fact that the revenue function, Rev, is not continuous and highly nonconcave—a small change in the price, , may lead to very large changes in revenue. This is the main reason why simply using the predictor as a proxy for a reserve function is a poor choice, even if its average error, is small. For example a function , that is just as likely to overpredict by as to under predict by will have very small error, but lead to revenue in half the cases.
A solution on the other end of the spectrum would simply memorize the optimum prices from the sample , setting . While this leads to optimal empirical revenue, a function class containing r would satisfy , making the bound of Theorem 1 vacuous.
In this work we introduce a family of classes parameterized by . This family admits an approximate minimizer that can be computed in polynomial time, has low generalization error, and achieves provable guarantees to the overall revenue.
More precisely, we show that given , and a hypothesis with expected squared loss of :

For every there exists a set of functions such that .

For every , there is a polynomial time algorithm that outputs such that in the worst case scenario is bounded by .
Effectively, we show how to transform any classifier with low squared loss, , to a reserve price predictor that recovers all but revenue in expectation.
4.1 Algorithm Description
In this section we give an overview of the algorithm that uses both the predictor and the set of samples in to develop a pricing function . Our approach has two steps. First we partition the set of feasible prices, into partitions, . The exact boundaries between partitions depend on the samples and their predicted values, as given by . For each partition we find the price that maximizes the empirical revenue in the partition. We let return the empirically optimum price in the partition that contains .
For a more formal description, let be the set of partitions of the interval that is:
We define . A function in chooses level sets of and reserve prices. Given , price is chosen if falls on the th level set.
It remains to define the function . Given a partition vector , let the partition of be given by . Let be the number of elements that fall into the th partition.
We define the predicted mean and variance of each group as
We are now ready to present algorithm RIC for computing .
Our main theorem states that the separation of is bounded by the cluster variance of . For a partition of let denote the empirical variance of bids for auctions in . We define the weighted empirical variance by:
(2) 
Theorem 2.
Let and let denote the output of Algorithm 4.1 then and with probability at least over the samples :
Notice that our bound is data dependent and only in he worst case scenario it behaves like . In general it could be much smaller.
We also show that the complexity of admits a favorable bound. The proof is similar to that in (Morgenstern and Roughgarden, 2015); we include it in Appendix E for completness.
Theorem 3.
The growth function of the class can be bounded as:
We can combine these results with Equation 1 and an easy bound on in terms of to conclude:
Corollary 1.
Let and let denote the output of Algorithm 4.1 then and with probability at least over the samples :
Since , this implies that when , the separation is bounded by plus additional error factors that go to 0 with the number of samples, , as
5 Bounding Separation
In this section we prove the main bound motivating our algorithm. This bound relates the variance of the bid distribution and the maximum revenue that can be extracted when a buyer’s bids follow such distribution. It formally shows what makes a distribution amenable to revenue optimization.
To gain intuition for the kind of bound we are striving for, consider a bid distribution . If the variance of is , that is is a point mass at some value , then setting a reserve price to leads to no separation. On the other hand, consider the equal revenue distribution, with . Here any reserve price leads to revenue of . However, the distribution has unbounded expected bid and variance, so it is not too surprising that more revenue cannot be extracted. We make this connection precise, showing that after setting the optimal reserve price, the separation can be bounded by a function of the variance of the distribution.
Given any bid distribution over we denote by the probability that a bid is greater than or equal to . Finally, we will let denote the maximum revenue achievable when facing a bidder whose bids are drawn from distribution . As before we denote by the mean bid and by the expected separation of distribution .
Theorem 4.
Let denote the variance of . Then
The proof of this theorem is highly technical and we present it in Appendix A.
Corollary 2.
The following bound holds for any distribution F:
The proof of this corollary follows immediately by an application of Taylor’s theorem to the bound of Theorem 4. It is also easy to show that this bound is tight (see Appendix D).
5.1 Approximating Maximum Revenue
In their seminal work Goldberg et al. (2001) showed that when faced with a bidder drawing values distribution on with mean , an auctioneer setting the optimum monopoly reserve would recover at least revenue. We show how to adapt the result of Theorem 4 to refine this approximation ratio as a function of the variance of . We defer the proof to Appendix B.
Theorem 5.
For any distribution with mean and variance , the maximum revenue with monopoly reserves, , satisfies:
Note that since this always leads to a tighter bound on the revenue.
5.2 Partition of
Corollary 2 suggests clustering points in such a way that the variance of the bids in each cluster is minimized. Given a partition of we denote by , , . Let also and .
Lemma 2.
Let then
Proof.
Let , Corollary 2 applied to the empirical bid distribution in yields . Multiplying by , summing over all clusters and using Hölder’s inequality gives:
∎
6 Clustering Algorithm
In view of Lemma 2 and since the quantity is fixed, we can find a function minimizing the expected separation by finding a partition of that minimizes the weighted variance defined Section 4.1. From the definition of , this problem resembles a traditional means clustering problem with distance function . Thus, one could use one of several clustering algorithms to solve it. Nevertheless, in order to allocate a new point to a cluster, we would require access to the bid which at evaluation time is unknown. Instead, we show how to utilize the predictions of to define an almost optimal clustering of .
For any partition of define
Notice that is the function minimized by Algorithm 4.1. The following lemma, proved in Appendix B, bounds the cluster variance achieved by clustering bids according to their predictions.
Lemma 3.
Let be a function such that , and let denote the partition that minimizes . If minimizes then .
Corollary 3.
Let be the output of Algorithm 4.1. If then:
(3) 
Proof.
The proof of Theorem 2 is now straightforward. Define a partition by if . Since for we have
(4) 
Furthermore since , Hoeffding’s inequality implies that with probability :
(5) 
In view of inequalities (4) and (5) as well as Corollary 3 we have:
This completes the proof of the main result. To implement the algorithm, note that the problem of minimizing reduces to finding a partition such that the sum of the variances within the partitions is minimized. It is clear that it suffices to consider points in the set . With this observation, a simple dynamic program leads to a polynomial time algorithm with an running time (see Appendix C).
7 Experiments
We now compare the performance of our algorithm against the following baselines:
Synthetic data. We begin by running experiments on synthetic data to demonstrate the regimes where each algorithm excels. We generate feature vectors with coordinates sampled from a mixture of lognormal distributions with means , , variance and mixture parameter . Let denote the vector with entries set to . Bids are generated according to two different scenarios:

Bids generated according to where is a Gaussian random variable with mean , and standard deviation .

Bids generated according to the following rule: let if then otherwise . Here has the same distribution as .
The linear scenario demonstrates what happens when we have a good estimate of the bids. The bimodal scenario models a buyer, which for the most part will bid as a continuous function of features but that is interested in a particular set of objects (for instance retargeting buyers in online advertisement) for which she is willing to pay a much higher price.
For each experiment we generated a training dataset , a holdout set and a test set each with 16,000 examples. The function used by RIC and the offset algorithm is found by training a linear regressor over . For efficiency, we ran RIC algorithm on quantizations of predictions . Quantized predictions belong to one of 1000 buckets over the interval .
Finally, the choice of hyperparameters for the Lipchitz loss and for the clustering algorithm was done by selecting the best performing parameter over the holdout set. Following the suggestions in (Mohri and Medina, 2014) we chose and .
Figure 1(a),(b) shows the average revenue of the three approaches across 20 replicas of the experiment as a function of the log of . Revenue is normalized so that the DC algorithm revenue is when . The error bars at one standard deviation are indistinguishable in the plot. It is not surprising to see that in the linear scenario, the DC algorithm of (Mohri and Medina, 2014) and the offset algorithm outperform RIC under low noise conditions. Both algorithms will recover a solution close to the true weight vector . In this case the offset is minimal, thus recovering virtually all revenue. On the other hand, even if we set the optimal reserve price for every cluster, the inherent variance of each cluster makes us leave some revenue on the table. Nevertheless, notice that as the noise increases all three algorithms seem to achieve the same revenue. This is due to the fact that the variance in each cluster is comparable with the error in the prediction function .
The results are reversed for the bimodal scenario where RIC outperforms both algorithms under low noise. This is due to the fact that RIC recovers virtually all revenue obtained from high bids while the offset and DC algorithms must set conservative prices to avoid losing revenue from lower bids.
(a)  (b)  (c) 
Auction data. In practice, however, neither of the synthetic regimes is fully representative of the bidding patterns. In order to fully evaluate RIC, we collected auction bid data from AdExchange for 4 different publisheradvertiser pairs. For each pair we sampled 100,000 examples with a set of discrete and continuous features. The final feature vectors are in for depending on the publisherbuyer pair. For each experiment, we extract a random training sample of 20,0000 points as well as a holdout and test sample. We repeated this experiment 20 times and present the results on Figure 1 (c) where we have normalized the data so that the performance of the DC algorithm is always 1. The error bars represent one standard deviation from the mean revenue lift. Notice that our proposed algorithm achieves on average up to improvement over the DC algorithm. Moreover, the simple offset strategy never outperforms the clustering algorithm, and in some cases achieves significantly less revenue.
8 Conclusion
We provided a simple, scalable reduction of the problem of revenue optimization with side information to the well studied problem of minimizing the squared loss. Our reduction provides the first polynomial time algoritm with a quantifiable bound on the achieved revenue. In the analysis of our algorithm we also provided the first variance dependent lower bound on the revenue attained by setting optimal monopoly prices. Finally, we provided extensive empirical evidence of the advantages of RIC over the current state of theart.
References
 CesaBianchi et al. [2015] Nicolò CesaBianchi, Claudio Gentile, and Yishay Mansour. Regret minimization for reserve prices in secondprice auctions. IEEE Trans. Information Theory, 61(1):549–564, 2015.
 Chawla et al. [2007] Shuchi Chawla, Jason D. Hartline, and Robert D. Kleinberg. Algorithmic pricing via virtual valuations. In Proceedings 8th ACM Conference on Electronic Commerce (EC2007), San Diego, California, USA, June 1115, 2007, pages 243–251, 2007. doi: 10.1145/1250910.1250946.
 Cole and Roughgarden [2015] Richard Cole and Tim Roughgarden. The sample complexity of revenue maximization. CoRR, abs/1502.00963, 2015.
 Devanur et al. [2016] Nikhil R. Devanur, Zhiyi Huang, and ChristosAlexandros Psomas. The sample complexity of auctions with side information. In Proceedings of STOC, pages 426–439, 2016.
 Dhangwatnotai et al. [2015] Peerapong Dhangwatnotai, Tim Roughgarden, and Qiqi Yan. Revenue maximization with a single sample. Games and Economic Behavior, 91:318–333, 2015.
 Goldberg et al. [2001] Andrew V. Goldberg, Jason D. Hartline, and Andrew Wright. Competitive auctions and digital goods. In Proceedings of the Twelfth Annual Symposium on Discrete Algorithms, January 79, 2001, Washington, DC, USA., pages 735–744, 2001.
 Hartline and Roughgarden [2009] Jason D. Hartline and Tim Roughgarden. Simple versus optimal mechanisms. In Proceedings 10th ACM Conference on Electronic Commerce (EC2009), Stanford, California, USA, July 6–10, 2009, pages 225–234, 2009.
 Kleinberg and Leighton [2003] Robert D. Kleinberg and Frank Thomson Leighton. The value of knowing a demand curve: Bounds on regret for online postedprice auctions. In Proceedings of FOCS, pages 594–605, 2003.
 Leme et al. [2016] Renato Paes Leme, Martin Pál, and Sergei Vassilvitskii. A field guide to personalized reserve prices. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11  15, 2016, pages 1093–1102, 2016. doi: 10.1145/2872427.2883071.
 Mohri and Medina [2014] Mehryar Mohri and Andres Muñoz Medina. Learning theory and algorithms for revenue optimization in secondprice auctions with reserve. In Proceedings of ICML, pages 262–270, 2014.
 Mohri et al. [2012] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012. ISBN 026201825X, 9780262018258.
 Morgenstern and Roughgarden [2015] Jamie Morgenstern and Tim Roughgarden. On the pseudodimension of nearly optimal auctions. In Proceedings of NIPS, pages 136–144, 2015.
 Morgenstern and Roughgarden [2016] Jamie Morgenstern and Tim Roughgarden. Learning simple auctions. In Proceedings ofCOLT, pages 1298–1318, 2016.
 Myerson [1981] R. Myerson. Optimal auction design. Mathematics of Operations Research, 6(1):58–73, 1981.
 Roughgarden and Wang [2016] Tim Roughgarden and Joshua R. Wang. Minimizing regret with multiple reserves. In Proceedings of the 2016 ACM Conference on Economics and Computation, EC ’16, Maastricht, The Netherlands, July 2428, 2016, pages 601–616, 2016. doi: 10.1145/2940716.2940792.
 Rudolph et al. [2016] Maja R. Rudolph, Joseph G. Ellis, and David M. Blei. Objective variables for probabilistic revenue maximization in secondprice auctions with reserve. In Proceedings of WWW 2016, pages 1113–1122, 2016.
Appendix A Proof of Theorem 4
Recall that given any bid distribution over we denote by the probability that a bid is greater than or equal to . Let denote the pseudoinverse of . Notice in particular that when is strictly decreasing then . When it is clear from context we will refer to a distribution indistinctly by , or .
We will use the following expressions for the expected bid and second moment of a distribution.
Lemma 4.
The expected bid and second moments of any distribution are given respectively by:
Proof.
We show the result only for the mean as the proof for the second moment is similar. It is well known that for a positive random variable, the mean can be expressed as:
where . Let . It is immediate that as implies by definition that . We can thus decompose the above integral as:
The proof will be complete by showing that has Lebesgue measure . Indeed, in that case the above expression reduces to:
Let us then characterize points . Notice that if then but this again by definition implies . If is also in then we must have . From which we conclude that . Thus implies that is a discontinuity of . Finally, since is decreasing there can be at most a countable number of discontinuities and thus has measure 0. ∎
In order to show the bound of Theorem 4 holds, we consider the following optimization problem over the space of square integrable functions :
(6)  
s.t. 
We show that the value of this optimization problem is greater than . Since any distribution achieving revenue and separation is feasible for (6) it follows that it must satisfy .
Proposition 1.
The objective value of (6) is lower bounded by:
(7) 
Proof.
For any and define the Lagrangian
It is immediate to see that optimization problem (6) is equivalent to
By taking variational derivatives of the function with respect to we see that the minimizing solution satisfies:
Replacing this value in the function we see that problem (6) is lower bounded by:
We can solve for the unconstrained variable to obtain . Replacing this value in the above expression yields:
Expanding the quadratic term yields:
∎
To obtain a lower bound on (7) we simply need to evaluate the objective function at a feasible function . In particular we let
(8) 
with . Notice that is clearly in and . The choice of this function is highly motivated by the solution to the unconstrained version of problem (7).
Proposition 2.
The optimization problem
(9) 
is lower bounded by .
Proof.
Let be defined by (8). Using the fact that we have the following equalities:
(10)  
(11) 
In view of (10) and (11) we have that for all
Multiplying the above equality by and integrating we see the objective function of (9) evaluated at is given by
Replacing (10) and (11) on the expression above we obtain:
∎
Appendix B Additional proofs
Lemma 1.
Let be defined by . The function then satisfies
Proof.
Corollary 2.
The following bound holds for any distribution F:
Proof.
By Theorem 4 and using a third order Taylor expansion we have:
The proof follows by rearranging terms. ∎
Theorem 5.
For any distribution with mean and variance , the maximum revenue with monopoly reserves, , satisfies:
Proof.
Let . Note that . We begin by dividing both sides of the statement of 4 by :
Rearranging, we have:
(14) 
Since , . Therefore, if Equation 14 holds, then:
Suppose for some fixed . Note that the function is increasing in for . Moreover, at we have , since for . Therefore .
In our situation, we can then conclude that
∎
Lemma 3.
Let be a function such that , and let denote the partition that minimizes . If minimizes then
Proof.
From definition of and a straightforward application of the triangle inequality we have:
where we have used CauchySchwarz inequality for the last line. Using the property of we can further bound the above expression as