On Bayes risk lower bounds
This paper provides a general technique for lower bounding the Bayes risk of statistical estimation, applicable to arbitrary loss functions and arbitrary prior distributions. A lower bound on the Bayes risk not only serves as a lower bound on the minimax risk, but also characterizes the fundamental limit of any estimator given the prior knowledge. Our bounds are based on the notion of -informativity (Csiszár, 1972), which is a function of the underlying class of probability measures and the prior. Application of our bounds requires upper bounds on the -informativity, thus we derive new upper bounds on -informativity which often lead to tight Bayes risk lower bounds. Our technique leads to generalizations of a variety of classical minimax bounds (e.g., generalized Fano’s inequality). Our Bayes risk lower bounds can be directly applied to several concrete estimation problems, including Gaussian location models, generalized linear models, and principal component analysis for spiked covariance models. To further demonstrate the applications of our Bayes risk lower bounds to machine learning problems, we present two new theoretical results: (1) a precise characterization of the minimax risk of learning spherical Gaussian mixture models under the smoothed analysis framework, and (2) lower bounds for the Bayes risk under a natural prior for both the prediction and estimation errors for high-dimensional sparse linear regression under an improper learning setting.
Consider a standard setting where we observe data points taking values in a sample space . The distribution of depends on an unknown parameter and is denoted by . The goal is to compute an estimate of based on the observed samples. Formally, we denote the estimator by , where is a mapping from the sample space to the parameter space. The risk of the estimator is defined by where is a non-negative loss function. This framework applies to a broad scope of machine learning problems. Taking sparse linear regression as a concrete example, the data represents the design matrix and the response vector; the parameter space is the set of sparse vectors; the loss function can be chosen as a squared loss.
Given an estimation problem, we are interested in the lowest possible risk achievable by any estimator, which will be useful in justifying the potential of improving existing algorithms. The classical notion of optimality is formalized by the so-called minimax risk. More specifically, we assume that the statistician chooses an optimal estimator , then the adversary chooses the worst parameter by knowing the choice of . The minimax risk is defined as:
The minimax risk has been determined up to multiplicative constants for many important problems. Examples include sparse linear regression (Raskutti et al., 2011), classification (Yang, 1999), additive models over kernel classes (Raskutti et al., 2012), and crowdsourcing (Zhang et al., 2016).
The assumption that the adversary is capable of choosing a worst-case parameter is sometimes over-pessimistic. In practice, the parameter that incurs a worst-case risk may appear with very small probability. To capture the hardness of the problem with this prior knowledge, it is reasonable to assume that the true parameter is sampled from an underlying prior distribution . In this case, we are interested in the Bayes risk of the problem. That is, the lowest possible risk when the true parameter is sampled from the prior distribution:
If the prior distribution is known to the learner, then the Bayes estimator attains the Bayes risk (Berger, 2013). But in general, the Bayes estimator is computationally hard to evaluate, and the Bayes risk has no closed-form expression. It is thus unclear what is the fundamental limit of estimators when the prior knowledge is available.
In this paper, we present a technique for establishing lower bounds on the Bayes risk for a general prior distribution . When the lower bound matches the risk of any existing algorithm, it captures the convergence rate of the Bayes risk. The Bayes risk lower bounds are useful for three main reasons:
They provide an idea of the difficulty of the problem under a specific prior .
They automatically provide lower bounds for the minimax risk and, because the minimax regret is always larger than or equal to the minimax risk (see, for example, Rakhlin et al. (2013)), they also yield lower bounds for the minimax regret.
As we will show, they have an important application in establishing the minimax lower bound under the smoothed analysis framework.
Throughout this paper, when the loss function and the parameter space are clear from the context, we simply denote the Bayes risk by . When the prior is also clear, the notation is further simplified to .
1.1 Our Main Results
In order to give the reader a flavor of the kind of results proved in this paper, let us consider Fano’s classical inequality (Han and Verdú, 1994; Cover and Thomas, 2006; Yu, 1997) which is one of the most widely used Bayes risk lower bounds in statistics and information theory. The standard version of Fano’s inequality applies to the case when for some positive integer with the indicator loss ( stands for the zero-one valued indicator function) and the prior being the discrete uniform distribution on . In this setting, Fano’s inequality states that
where is the mutual information between the random variables and with (note that this mutual information only depends on and which is why we denote it by ). Fano’s inequality implies that when is large i.e., when the information that has about is large, then the risk of estimation is small.
A natural question regarding Fano’s inequality, which does not seem to have been asked until very recently, is the following: does there exist an analogue of (3) when is not necessarily the uniform prior and/or when and are arbitrary sets, and/or when the loss function is not necessarily ? An interesting result in this direction is the following inequality which has been recently proved by Duchi and Wainwright (2013) who termed it the continuum Fano inequality. This inequality applies to the case when is a subset of Euclidean space with finite strictly positive Lebesgue measure, for a fixed ( is the usual Euclidean metric) and the prior being the uniform probability measure (i.e., normalized Lebesgue measure) on . In this setting, Duchi and Wainwright (2013) proved that
Since both (3) and (4) are special instances of (5), one might reasonably conjecture that inequality (5) might hold more generally. In Section 3, we give an affirmative answer by proving that inequality (5) holds for any zero-one valued loss function and any prior . No assumptions on , and are needed. We refer to this result as generalized Fano’s inequality. Our proof of (5) is quite succinct and is based on the data processing inequality (Cover and Thomas, 2006; Liese, 2012) for Kullback-Leibler (KL) divergence. The use of the data processing inequality for proving Fano-type inequalities was introduced by Gushchin (2003).
The data processing inequality is not only available for the KL divergence. It can be generalized to any divergence belonging to a general family known as -divergences (Csiszár, 1963; Ali and Silvey, 1966). This family includes the KL divergence, chi-squared divergence, squared Hellinger distance, total variation distance and power divergences as special cases. The usefulness of -divergences in machine learning has been illustrated in Reid and Williamson (2011); Garcıa-Garcıa and Williamson (2012); Reid and Williamson (2009).
For every -divergence, one can define a quantity called -informativity (Csiszár, 1972) which plays the same role as the mutual information for KL divergence. The precise definitions of -divergences and -informativities are given in Section 2. Utilizing the data processing inequality for -divergence, we prove general Bayes risk lower bounds which hold for every zero-one valued loss and for arbitrary , and (Theorem 3.2). The generalized Fano’s inequality (5) is a special case by choosing the -divergence to be KL. The proposed Bayes risk lower bounds can also be specialized to other -divergences and have a variety of interesting connections to existing lower bounds in the literature such as Le Cam’s inequality, Assouad’s lemma (see Theorem 2.12 in Tsybakov (2010)), Birgé-Gushchin inequality (Gushchin, 2003; Birgé, 2005). These results are provided in Section 3.
In Section 4, we deal with nonnegative valued loss functions which are not necessarily zero-one valued. Basically, we use the standard method of lower bounding the general loss function by a zero-one valued function and then use our results from Section 3 for lower bounding the Bayes risk. This technique, in conjunction with the generalized Fano’s inequality, gives the following lower bound (proved in Corollary 4.4)
A special case of the above inequality has appeared previously in Zhang (2006, Theorem 6.1) (please refer to Remark 4.5 for a detailed explanation of the connection between inequality (6) and (Zhang, 2006, Theorem 6.1)).
We also prove analogues of the above inequality for different divergences. Specifically, using our -divergence inequalities from Section 3, we prove, in Theorem 4.1, the following inequality which holds for every divergence:
where represents the -informativity and is a non-decreasing -valued function that depends only on . This function (see its definition from (31)) can be explicitly computed for many -divergences of interest, which gives useful lower bounds in terms of -informativity. For example, for the case of KL divergence and chi-squared divergence, inequality (7) gives the lower bound in (6) and the following inequality respectively,
where is the chi-squared informativity.
Intuitively, inequality (7) shows that the Bayes risk is lower bounded by half of the largest possible such that the maximum prior mass of any -radius “ball” () is less than some function of -informativity. To apply (7), one needs to obtain upper bounds on the following two quantities:
The “small ball probability” , which does not depend of the family of probability measures .
The -informativity , which does not depend on the loss function .
We note that a nice feature of (7) is that and play separately roles. One may first obtain an upper bound for the -informativity , then choose so that the small ball probability can be bounded from above by . The Bayes risk will be bounded from below by . It is noteworthy that the terminology “small ball probability” was used by Xu and Raginsky (2014) (this paper proved information-theoretic lower bounds on the minimum time in a distributed function computation problem).
We do not have a general guideline for bounding the small ball probability. It needs to be dealt with case by case based on the prior and the loss function. But for upper bounding the -informativity, we offer a general recipe in Section 5 for a subclass of divergences of interest (power divergences for ), which covers the chi-squared divergence as one of the most important divergences in our applications. These bounds generalize results of Haussler and Opper (1997) and Yang and Barron (1999) for mutual information to -informativities involving power divergences. As an illustration of our techniques (inequality (7) combined with the -informativity upper bounds), we apply them to a concrete estimation problem in Section 5. We further apply our results to several popular machine learning and statistics problems (e.g., generalized linear model, spiked covariance model, and Gaussian model with general loss) in Appendix C.
In Section 6 and Section 7, we present non-trivial applications of our Bayes risk lower bounds to two learning problems: the first one is a unsupervised learning problem, while the second one is a supervised learning problem. Section 6 studies smoothed analysis for learning mixtures of spherical Gaussians with uniform weights. Although learning mixtures of Gaussians is a computationally hard problem, it has been shown recently by Hsu and Kakade (2013) that under the assumptions that the Gaussian means are linearly independent, it can be learnt in polynomial time by a spectral method. We perform a smoothed analysis on a variant of the algorithm (Hsu and Kakade, 2013), showing that the linear independence assumption can be replaced by perturbing the true parameters by a small random noise. The method described in Section 6 achieves a better convergence rate than the original algorithm of Hsu and Kakade (2013). Furthermore, we apply the Bayes risk lower bound techniques to show that the algorithm’s convergence rate is unimprovable, even under smoothed analysis (i.e. when the true parameters are randomly perturbed). Section 6 highlights the usefulness of our techniques in proving lower bounds for smoothed analysis, which appears to be challenging using traditional techniques of the minimax theory.
In Section 7, we consider the high-dimensional sparse linear regression problem and we provide Bayes risk lower bounds for both prediction error and estimation error under a natural prior on the regression parameter belonging to the set of -sparse vectors. Although lower bounds for sparse linear regression have been well-studied (see, e.g., Raskutti et al. (2011); Zhang et al. (2014) and references therein), these bounds only focus on the minimax or the worst-case scenario and thus are too pessimistic in practice. Indeed, the parameters that usually attain these minimax lower bounds have zero probability under any continuous prior, so that their average effects might be negligible. The fundamental limits of sparse linear regression under a realistic prior is, to the best of the our knowledge, unknown. The developed tool of lower bounding Bayes risks can be directly applied to characterize these limits. Moreover, our Bayes risk lower bound is flexible in the sense that by tuning the variance of the prior of non-zero elements of , it provides a wide spectrum of lower bounds. For one particular choice of the variance, our Bayes risk lower bounds match the minimax risk lower bounds. This gives a natural least favorable prior for sparse linear regression, while the known least favorable prior in Raskutti et al. (2011) is a non-constructive discrete prior over a packing set of the parameter space that cannot be sampled from. We also work under the improper learning setting where we allow non-sparse estimators for the true regression vector (even though the true regression vector is assumed to be sparse).
1.2 Related Works
Before finishing this introduction section, we briefly describe related work on Bayes risk lower bounds. There are a few results dealing with special cases of finite dimensional estimation problems under (weighted/truncated) quadratic losses. The first results of this kind were established by Van Trees (1968), and Borovkov and Sakhanienko (1980) with extensions by Brown and Gajek (1990); Brown (1993); Gill and Levit (1995); Sato and Akahira (1996); Takada (1999). A few additional papers dealt with even more specialized problems e.g., Gaussian white noise model (Brown and Liu, 1993), scale models (Gajek and Kaluszka, 1994) and estimating Gaussian variance (Vidakovi and DasGupta, 1995). Most of these results are based on the van Trees inequality (see Gill and Levit (1995) and Theorem 2.13 in Tsybakov (2010)). Although the van Trees inequality usually leads to sharp constant in the Bayes risk lower bounds, it only applies to weighted quadratic loss functions (as its proof relies on Cauchy-Schwarz inequality) and requires the underlying Fisher information to be easily computable, which limits its applicability. There is also a vast body of literature on minimax lower bounds (see, e.g., Tsybakov (2010)) which can be viewed as Bayes risk lower bounds for certain priors. These priors are usually discrete and specially constructed so that the lower bounds do not apply to more general (continuous) priors. Another related area of work involves finding lower bounds on posterior contraction rates (see, e.g., Castillo (2008)).
1.3 Outline of the Paper
The rest of the paper is organized in the following way. In Section 2, we describe notations and review preliminaries such as -divergences, -informativity, data processing inequality, etc. Section 3 deals with inequalities for zero-one valued loss functions. These inequalities have many connections to existing lower bound techniques. Section 4 deals with nonnegative loss functions and we provide inequality (7) and its special cases. Section 5 presents upper bounds on the -informativity for power divergences for . Some examples are also given in this section. Section 6 studies smoothed analysis for learning mixtures of spherical Gaussians with uniform weights using our technique. We conclude the paper in Section 1.3. Due to space constraints, we have relegated some proofs and additional examples and results to the appendix.
2 Preliminaries and Notations
We first review the notions of -divergence (Csiszár, 1963; Ali and Silvey, 1966) and -informativity (Csiszár, 1972). Let denote the class of all convex functions which satisfy . Because of convexity, the limits and exist (even though they may be ) for each . Each function defines a divergence between probability measures which is referred to as -divergence. For two probability measures and on a sample space having densities and with respect to a common measure , the -divergence between and is defined as follows:
We note that the convention is adopted here so that when and . Note that when and . Also note that implies that when .
Certain divergences are commonly used because they can be easily computed or bounded when and are product measures. These divergences are the power divergences corresponding to the functions defined by
Popular examples of power divergences include:
1) Kullback-Leibler (KL) divergence: , if is absolutely continuous with respect to (and it is infinite if is not absolutely continuous with respect to ). Following the conventional notation, we denote the KL divergence by (instead of ).
2) Chi-squared divergence: , if is absolutely continuous with respect to (and it is infinite if is not absolutely continuous with respect to ). We denote the chi-squared divergence by following the conventional notation.
3) When , one has which is a half of the squared Hellinger distance. That is, , where is the squared Hellinger distance between and .
The total variation distance is another -divergence (with ) but not a power divergence.
One of the most important properties of -divergences is the “data processing inequality” (Csiszár (1972) and Liese (2012, Theorem 3.1)) which states the following: let and be two measurable spaces and let be a measurable function. For every and every pair of probability measures and on , we have
where and denote the induced measures of on , i.e., for any measurable set on the space , , (see the definition of induced measure from Definition 2.2.1. in Athreya and Lahiri (2006)).
Next, we introduce the notion of -informativity (Csiszár, 1972). Let be a family of probability measures on a space and be a probability measure on . For each , the -informativity, , is defined as
where the infimum is taken over all possible probability measures on . When (so that the corresponding -divergence is the KL divergence), the -informativity is equal to the mutual information and is denoted by . We denote the informativity corresponding to the power divergence by . For the special case , we use the more suggestive notation . The informativity corresponding to the total variation distance will be denoted by .
Additional notations and definitions are described as follows. Recall the Bayes risk (2) and the minimax risk (1). When the loss function and parameter space are clear from the context, we drop the dependence on and . When the prior is also clear from the context, we denote the Bayes risk by and the minimax risk by . We need certain notation for covering numbers. For a given -divergence and a subset , let denote any upper bound on the smallest number for which there exist probability measures that form an -cover of under the -divergence i.e.,
We write the covering number as when and when . We write when for other . We note that is an upper bound on the metric entropy. The quantity can be infinite if is arbitrary. For a vector and a real number , denote by the -norm of . In particular, denotes the Euclidean norm of . denotes the indicator function which takes value 1 when is true and 0 otherwise. We use , , etc. to denote generic constants whose values might change from place to place.
3 Bayes Risk Lower Bounds for Zero-one Valued Loss Functions and Their Applications
In this section, we consider zero-one loss functions and present a principled approach to derive Bayes risk lower bounds involving -informativity for every . Our results hold for any given prior and zero-one loss . By specializing the -divergence to KL divergence, we obtain the generalized Fano’s inequality (5). When specializing to other -divergences, our bounds lead to some classical minimax bounds of Le Cam and Assouad (Assouad, 1983), more recent minimax results of Gushchin (2003); Birgé (2005) and also results in Tsybakov (2010, Chapter 2). Bayes risk lower bounds for general nonnegative loss functions will be presented in the next section.
We need additional notations to state the main results of this section. For each , let be the function defined in the following way: for , is the -divergence between the two probability measures and on given by and . By the definition (9), it is easy to see that has the following expression (recall that ):
The convexity of implies monotonicity and convexity properties of , which is stated in the following lemma.
For each , for every fixed , the map is non-increasing for and is convex and continuous in . Further, for every fixed , the map is non-decreasing for .
We also define the quantity
where the decision does not depend on data . Note that represents the Bayes risk with respect to in the “no data” problem i.e., when one only has information on , , and the prior but not the data . For simplicity, our notation for suppresses its dependence on . Because the loss function is zero-one valued so that , the quantity has the following alternative expression:
and is the prior mass of the “ball” . It will be important in the sequel to observe that the Bayes risk, is bounded from above by . This is obvious because the risk with some data cannot be greater than the risk in the no data problem (which can be viewed as an application of the data processing inequality). Formally, if is the class of the constant decision rules, then . Because , we have when . We shall therefore assume throughout this section that .
The main result of this section is presented next. It provides an implicit lower bound for the Bayes risk in terms of and the -informativity for every . The only assumption is that is zero-one valued and we do not assume the existence of the Bayes decision rule.
where is the generalized inverse function of the non-increasing . As an illustration, we plot for and the corresponding Bayes risk lower bound in Figure 1. The lower bound (18) can be immediately applied to obtain Bayes risk lower bounds when the -divergence in (17) is chi-squared divergence, total variation distance, or Hellinger distance (see Corollary 3.7). However, for the KL divergence, there is no simple form of . To obtain the corresponding Bayes risk lower bound, we can invert (17) by utilizing the convexity of , which will give a generalized Fano’s inequality (see Corollary 3.5). In particular, since is convex (see Lemma 3.1),
where denotes the left derivative of at . The monotonicity of in (Lemma 3.1) gives and we thus have,
Inequality (17) can now be used to deduce that (note that )
Theorem 3.2 is new, but its special case , and the uniform prior is known (see Gushchin (2003) and Guntuboyina (2011a)). In such a discrete setting, for any and thus . The proof of Theorem 3.2 heavily relies on the following lemma, which is a consequence of the data processing inequality for -divergences (see (10) in Section 2).
Suppose that the loss function is zero-one valued. For every , every probability measure on and every decision rule , we have
We note that Lemma 3.3 is of independent interest, which can be applied to establish minimax lower bound as shown in the following remark.
Proof of Lemma 3.3.
Let denote the joint distribution of and under the prior i.e., and . For any decision rule , in (21) can be written as . Let denote the joint distribution of and under which they are independently distributed according to and respectively. The quantity in (21) can then be written as .
Because the loss function is zero-one valued, the function maps into . Our strategy is to fix and apply the data processing inequality (10) to the probability measures and the mapping . This gives
where and are induced measures on the space of . In other words, since is zero-one valued, both and are two-point distributions on with
Proof of Theorem 3.2.
We write as a shorthand notation of . By the definition (11) of , it suffices to prove that
for every probability measure .
Notice that . If , then the right hand side of (17) is zero and hence the inequality immediately holds. Assume that . Let be small enough so that . Let denote any decision rule for which and note that such a rule exists since . It is easy to see that
We thus have . By Lemma 3.3, we have
Because is non-increasing on , we have
Because is non-decreasing on , we have
Combining the above three inequalities, we have
Lemma 3.3 can also be used to derive minimax lower bounds in a different way. For example, when the minimax decision rule exists (e.g., for finite space and (Ferguson, 1967)), we have . If the probability measure is chosen so that , then, by Lemma 3.1, the right hand side of (17) can be lower bounded by replacing with which yields
Similarly, this inequality can be converted to an explicit lower bound on minimax risk. We will show an application of this inequality in deriving Birgé-Gushchin inequality (Gushchin, 2003; Birgé, 2005) in Section 3.3.
3.1 Generalized Fano’s Inequality
Corollary 3.5 (Generalized Fano’s inequality).
For any given prior and zero-one loss , we have
where is defined in (16).
Proof of Corollary 3.5.
As mentioned in the introduction, the classical Fano inequality (3) and the recent continuum Fano inequality (4) are both special cases (restricted to uniform priors) of Corollary 3.5. The proof of (4) given in Duchi and Wainwright (2013) is rather complicated with a stronger assumption and a discretization-approximation argument. Our proof based on Theorem 3.2 is much simpler. Lemma 3.3 also has its independent interest. Using Lemma 3.3, we are able to recover another recently proposed variant of Fano’s inequality in Braun and Pokutta (2014, Proposition 2.2). Details of this argument are provided in Appendix A.2.
3.2 Specialization of Theorem 3.2 to Different -divergences and Their Applications
In addition to the generalized Fano’s inequality, Theorem 3.2 allows us to derive a class of lower bounds on Bayes risk for zero-one losses by plugging other -divergences. In the next corollary, we consider some widely used -divergences and provide the corresponding Bayes risk lower bounds by inverting (17) in Theorem 3.2.
Let be zero-one valued, be any prior on and . We then have the following inequalities
Total variation distance:
provided . Here
See Appendix A.3 for the proof of the corollary. The special case of Corollary 3.7 for , and being the uniform prior has been discovered previously in Guntuboyina (2011a). It is clear from Corollary 3.7 that the choice of -divergence will affect the tightness of the lower bound for . In Appendix A.5, we provide a qualitative comparison of the lower bounds (25), (26) and (28). In particular, we show that in the discrete setting with , the lower bounds induced by the KL divergence and the chi-squared divergence are much stronger than the bounds given by the Hellinger distance. Therefore, in most applications in this paper, we shall only use the bounds involving the KL divergence and the chi-squared divergence.
Corollary 3.7 can be used to recover classical inequalities of Le Cam (for two point hypotheses) and Assouad (Theorem 2.12 in Tsybakov (2010) with both total variation distance and Hellinger distance) and Theorem 2.15 in Tsybakov (2010) that involves fuzzy hypotheses. The details are presented in Appendix A.4.
3.3 Birgé-Gushchin’s Inequality
In this section, we expand (24) in Remark 3.4 to obtain a minimax risk lower bound due to Gushchin (2003) and Birgé (2005), which presents an improvement of the classical Fano’s inequality when specializing to KL divergence.
Proof of Proposition 3.8.
To prove Proposition 3.8, it is enough to prove that for every . Without loss of generality, we assume that . We apply (20) with the uniform distribution on as , and the minimax rule for the problem as . Because is the minimax rule, . Also
It is easy to verify that . We thus have . Because is minimax, and thus
On the other hand, we have . To see this, note that the minimax risk is upper bounded by the maximum risk of a random decision rule, which chooses among the hypotheses uniformly at random. For this random decision rule, its risk is no matter what the true hypothesis is. Thus, is an upper bound on the minimax risk. We thus have, from (30), that . We can thus apply (24) to obtain
which completes the proof Proposition 3.8. ∎
4 Bayes Risk Lower Bounds for Nonnegative Loss Functions
In the previous section, we discussed Bayes risk lower bounds for zero-one valued loss functions. We deal with general nonnegative loss functions in this section. The main result of this section, Theorem 4.1, provides lower bounds for for any given loss and prior . To state this result, we need the following notion. Fix and recall the definition of in (13). We define by
and if for every , then we take to be 1. By Lemma 3.1, it is easy to see that is a non-decreasing function of . For example, for KL-divergence with , we have and (see Figure 2). We are now ready to state the main theorem of this paper.
For every and , we have
Proof of Theorem 4.1.
Fix and . Let be a shorthand notation. Suppose is such that
We prove below that and this would complete the proof. Let denote the zero-one valued loss function . It is obvious that and hence the proof will be complete if we establish that . Let for a shorthand notation.
Because is a zero-one valued loss function, Theorem 3.2 gives
By (34), it then follows that . By definition of , it is clear that there exists such that (this in particular implies that ). Lemma 3.1 implies that is non-decreasing for , which yields . The above two inequalities imply . Combining this inequality with (35), we have
Lemma 3.1 shows that is non-increasing for . Thus, we have . ∎