Distribution-dependent concentration inequalities for tighter generalization bounds

Distribution-dependent concentration inequalities for tighter generalization bounds

Abstract

Concentration inequalities are indispensable tools for studying the generalization capacity of learning models. Hoeffding’s and McDiarmid’s inequalities are commonly used, giving bounds independent of the data distribution. Although this makes them widely applicable, a drawback is that the bounds can be too loose in some specific cases. Although efforts have been devoted to improving the bounds, we find that the bounds can be further tightened in some distribution-dependent scenarios and conditions for the inequalities can be relaxed. In particular, we propose four types of conditions for probabilistic boundedness and bounded differences, and derive several distribution-dependent extensions of Hoeffding’s and McDiarmid’s inequalities. These extensions provide bounds for functions not satisfying the conditions of the existing inequalities, and in some special cases, tighter bounds. Furthermore, we obtain generalization bounds for unbounded and hierarchy-bounded loss functions. Finally we discuss the potential applications of our extensions to learning theory.

1Introduction

Concentration inequalities play a crucial role in statistical learning theory because they are useful for deriving the generalization capacity of learning models. Generally, they can be used to estimate the deviations between empirical risk and expectation risk [1]. Some important learning theories such as Rademacher complexity have been developed by applying concentration inequalities to bound such deviations [2].

Two commonly used concentration inequalities in learning theory are Hoeffding’s [11] and McDiarmid’s inequalities [12]. Besides being used to analyze the algorithmic stability [5], Hoeffding’s and McDiarmid’s inequalities, both giving bounds independent of the distribution, are powerful tools for estimating VC dimension and Rademacher complexity [16].

These two inequalities however have two major limitations: 1) they cannot deal with unbounded functions; 2) their bounds are weak for functions with a larger constant on a small exceptional set. If we generalize these inequalities to any distribution, the estimation of the deviations is likely to be loose. In the latter case, the bounds given by them will become less tight because the large constant dominates this bound. To address these issues, [14] proved two extensions of McDiarmid’s inequality for strongly and weakly difference-bounded functions and used them to study the generalization capacity. [15] proved an extension of McDiarmid’s inequality with the subgaussian diameter. Recently, [22] proposed an extension of McDiarmid’s inequality for functions with bounded differences on a high probability set and no restriction outside this set. [23] extended McDiarmid’s inequality by relaxing the Lipschitz condition since the approach only needs Lipschitz-bounds for changing one variable.

However, both the strong and weak bounded difference conditions proposed by [14] have their shortcomings in practice. The approach proposed by [15] requires an extra metric on the sample space and a bounded subgaussian diameter. The weaker Lipschitz condition given by [23] is only useful for bounded functions. Meanwhile, the bound discussed by [22] can be further tightened, which will be detailed in Section 2. After exploring the assumptions of Hoeffding’s and McDiarmid’s inequalities, we propose some extensions to these two inequalities to treat the cases of probabilistic boundedness and bounded differences. Our results improve the bound in [22] and the bounds in the original inequalities, and can also handle unbounded functions without introducing extra metrics.

Main results. we prove some distribution dependent extensions of Hoeffding’s and McDiarmid’s inequalities and obtain tighter generalization bounds. In Theorem 1 and Corollary 1, we obtain some new inequalities for the cases of probabilistic boundedness and bounded differences. These extensions are distribution dependent, and consequently yield better estimation (For example, Theorem 2 and Corollary 2) for some examples in learning theory.

Motivation. The unbounded functions often occur in the analysis of regression and classification [24]. Since Hoeffding’s and McDiarmid’s inequalities provide bounds independent of the distribution, we expect that our proposed distribution-dependent bounds will be tighter for each specific case.

Outline. In Section 2, we analyze the disadvantages of Hoeffding’s and McDiarmid’s inequalities, and discuss related work. In Section 3, we introduce the basic notations and definitions. In Section 4, we show the limitations of the conditions in existing inequalities using two examples. Furthermore, we propose four assumptions about the probabilistic boundedness and bounded differences, and compare these differences with the previous bounded ones. In Section 5, we present the probabilistic extensions of Hoeffding’s and McDiarmid’s inequalities. In Section 6, we discuss the potential applications of our results in learning theory. We show how to use our results to analyze the generalization of learning models. We conclude in Section 7.

2Related Work

Although Hoeffding’s and McDiarmid’s inequalities have achieved great success in learning theory,[13] have noted their limitations in applications due to the fact that these inequalities are distribution independent and cannot provide generalization bounds for unbounded loss functions.

To address these issues, researchers have studied more general conditions under which concentration inequalities exist. Specifically, assuming that the function is bounded on one side, [28] gave an extension of Hoeffding’s inequality to unbounded random variables with bounded mathematical expectation. [14] proved two extensions of McDiarmid’s inequality to strongly and weakly difference-bounded functions (See Definitions 2 and 3 in Section 3) for the study of the generalization error.

[14] assumed that there exist some constant vectors, e.g., and , with for all , such that the function has bounded differences on a subset set of and bounded differences on the complement of the set . [15] noted that the strong and weak bounded difference conditions proposed by [14] have their limitations in practice and the bounds of the inequalities are uninformative if is infinite. In order to relax the difference-bounded conditions, [15] introduced the subgaussian diameter and proved an extension of McDiarmid’s inequality using the subgaussian diameter. Nevertheless, the approach [15] proposed requires an extra metric on the sample space and that the subgaussian diameter is bounded. Recently, [22] developed more general difference-bounded conditions: the function has bounded differences on a high probability set and is arbitrary outside of , the measure of which is controlled by a probability (This is similar to Assumption 3 in Section 4). Finally, [22] proposed an extension of McDiarmid’s inequality:

Here, , . It is worth pointing out that the above bound in the equation (Equation 1) discussed by [22] can be further tightened.

Different from the boundedness conditions previously discussed, our extensional conditions do not require 1) loss functions to be bounded as in [14], and 2) the extra metrics as in [15]. Roughly speaking, they can be classified into two cases:

Being similar to the boundedness conditions given by [22] (See Assumptions 1 and 3 in Section 4). In this case, we will obtain a refined bound.

Refining the boundedness and bounded differences (See Assumptions 2 and 4 in Section 4). In this case, we will obtain tighter bound.

3Notations and Definitions

In this section, we introduce notations and definitions.

Let denote the indicator function.

Let be the set of natural numbers, be the set of real numbers. Let .

Let be a probability space, that is, alone is called the sample space, is a -algebra on , and is a probability measure on . And has the structure , where and are the input space and output space respectively. The set denotes the empty set.

Let be the set of all measurable functions . Assume that is a subset of , i.e., , the set is called the hypothesis class.

Let be a finite set of labeled training samples, and assume that these samples are independent and identically distributed (i.i.d.) according to . Denote the bold letter as a vector, for example, the bold presents a vector .

Let be the loss function, , and the loss of on a sample point is defined by . We can see that the function is nonnegative, but not necessarily bounded. Three well-known examples for this function often used in machine learning domain are the absolute loss , squared loss and loss [15]. Here and is a statistical model of conditional densities for .

In learning theory, one of the goals is to find a function in hypothesis space that minimizes the following generalization error . Generally speaking, the distribution in the equation is unknown. Rather than minimizing , we usually minimize the following training error:

In this paper, we are interested in the uniform estimation of .

Definition 1 (Uniformly difference-bounded [14]) Let be a function. We say that is uniformly difference-bounded by , if the following holds:

For any , if differ only in the th coordinate, that is, there exists , s.t. and , then we have .

Definition 2 (Strongly difference-bounded [14]) Let be a function. We say that is strongly difference-bounded by , if the following holds:

There exists a badsubset , where . For any , if differ only in the th coordinate and , then ; if differ only in the th coordinate, then .

Definition 3 (Weakly difference-bounded [14]) Let be a function. We say that is weakly difference-bounded by , if the following holds:

For any , we have

where , and for .

For any and differing only in the th coordinate, moreover, .

Note 1 The equation (Equation 2) means that if we construct by replacing the th entry of with , then holds for all but a fraction of the choices.

For the discussion of later sections and to be self-contained, we provide the original forms of Hoeffding’s and McDiarmid’s inequalities as follows:

Hoeffding’s inequality [11] Let be independent random variables on a probability space , s.t. . Set . Then, for all we have

where .

McDiarmid’s inequality [12] Let be independent random variables on a probability space . Then, for all we have

where , the function is a real-valued function of the sequence , s.t. , whenever and differ only in the th coordinate, , the uniformly difference bounded function is uniformly difference-bounded.

4Addressing the Limitations of Previous Concentration Inequalities

In this section, we analyze the conditions of the inequalities in previous works, and show that they have limitations in some examples. At the end of this section, we discuss several general assumptions for the boundedness conditions of concentration inequalities.

4.1Limitations in two cases

Here, we analyze two examples to illustrate the limitations of previous concentration inequalities.

Example 1 Let , set . For all , if we set , then we have the identity . By the above definition of , it follows that and .

Note 2 In this example, the random variable is unbounded, and thus does not satisfy the condition of Hoeffding’s inequality. If the random variable is limited in a certain range (for example, ), the condition of Hoeffding’s inequality will be satisfied.

Example 2 Let , set . Here, we set , is a constant, .

For all , we assume that where presents the cardinality of . By the above definition of , it follows that and .

Note 3 In this example, we know that is not uniformly differences-bounded, failing to the condition of McDiarmid’s inequality. But if the sample space is limited in a certain range (for example, ), the condition of McDiarmid’s inequality is satisfied. In addition, we observe that is neither weakly differences-bounded nor strongly differences-bounded.

4.2Assumptions

The aforementioned examples 1 and 2 do not satisfy various existing definitions of the boundedness and bounded differences. A possible reason behind is that these existing definitions either are sort of restrictive or neglect the robustness of learning theory in some cases. Therefore, four assumptions below are proposed to alleviate these issues.

Assumption 1 ( bounded) Let be the independent random variable on a probability , s.t. . If this is true, then we say that is bounded by the pair .

Assumption 2 ( hierarchy-bounded) Let be the independent random variable on a probability , s.t. there exists an integer , we have and . If this is true, then is hierarchy-bounded by the pair .

Assumption 3 ( difference-bounded)1 Let be a function, s.t. for any , , for any and differ only in the th coordinate, we have and . If this is true, then is difference-bounded by .

Assumption 4 ( hierarchy-difference-bounded) Let be a function, s.t. for any , there exists an integer , , and , for any and differ only in the th coordinate, we have and . If this is true, then is hierarchy-difference-bounded by .

Under assumption 1 and 3, we study if Hoeffding’s and McDiarmid’s inequalities still hold. Under Assumptions 2 and 4, we investigate if the convergence bounds of Hoeffding’s and McDiarmid’s inequalities can be better.

Actually, it is not difficult to see from Example 2 that it does not always concentrate around its expectation. For instance, will be close to as tends to infinity. Therefore, Hoeffding’s and McDiarmid’s inequalities for such functions in Example 2 do not hold. Based on these assumptions, we will present several similar Hoeffding’s and McDiarmid’s inequalities (See Theorems 1, 2 and Corollaries 1, 2 in Section 5), which can deal with unbounded and hierarchy-bounded functions.

At the end of this section, we compare four proposed bounded conditions with the previous three definitions in Section 2. Since the sums of random variables can be regarded as a special case of a multivariate random function, we will only discuss the relationships between the difference-bounded, hierarchy-difference-bounded, uniformly differences-bounded, strongly differences-bounded and weakly differences-bounded conditions. Here we have the following pairwise comparisons . Here the symbol “” means that the item is strictly stronger on the left than on the right. Therefore, from the above formal relation, it tells that the difference-bounded condition is the weakest and the hierarchy-difference-bounded condition is the strongest.

5Extensions of Hoeffding’s and McDiarmid’s Inequalities

In this section, we will show several extensions to Hoeffding’s and McDiarmid’s inequalities.

Essentially, Hoeffding’s inequality () in Section 3 can be proved by combining the properties of convex functions, Taylor expansion, the monotonicity of probability measures, the exponential Markov inequality and the independence of random variables. Meanwhile, McDiarmid’s inequality () in Section 3 can be proved by constructing the martingale difference sequences in combination with a similar proof of Hoeffding’s inequality. To extend these two concentration inequalities, we prove Theorems 1 and 2 using conditional mathematical expectation. We assume that is independent random variable on a probability space , and give the following condition:

Condition 1 (Partition of product space) Let . Set

the set is called a partition of .

We first provide two lemmas which will be used later.

Lemma 1 Assume that is hierarchy-bounded by the pair , and Condition 1 holds. Let . Set and , , . Then, for any , we have

where, , is a constant vector, and we agreed that and .

Proof By the assumptions and definition of conditional mathematical expectation and the additivity and monotonicity of the probability measure, for any and , there exists a constant vector , we have

Here, . Take , the inequality (Equation 6) gets the minimum value . Finally, we have

The proof of Lemma 1 is completed.

Similarly, we have the following Lemma 2.

Lemma 2 Let the function be a map from to . Assume that is difference-bounded by . Set . Then, for any , we have

where .

Denote as the algebra , and .

Theorem 1 Under the assumptions of Lemma 1, for any , we have

where , and we agreed that and , .

Proof From the assumptions, we have

By the equation (Equation 10) and definition of conditional mathematical expectation and the additivity and monotonicity of the probability measure, for any , we have

where.

By Lemma 1, it is easy to show that Theorem 1 holds.

Theorem 2 Let the function be a map from to . Assume that is hierarchy-difference-bounded by , and Condition 1 holds. Let . Set , , . For any , we have

where , and we agreed that .

Proof By combining the Lemma 2 and the methods employed in Theorem 1, Theorem 2 can be easily proved.

From the above Theorems, we have the following corollaries:

Corollary 1 Assume that is bounded by the pair . Set and . Then, for any , we have

where .

Corollary 2 Under the assumptions of Lemma 2. Then, for any , we have

where .

Note 4 From the above theorems and corollaries, we can conclude that:

Under the four assumptions proposed by us, the random variable or multivariate random function does not concentrate around its mathematical expectation. Theorems 1 and 2 imply that the random variable or multivariate random function should concentrate around its conditional expectation. And to some extent, Corollaries 1 and 2 also imply such concentration in these two cases: 1) is close to as tends to , see Example 3, or 2) .

The original inequalities (Equation 3) and (Equation 4) can be viewed as special cases of the extensions (Equation 12) and (Equation 13): if increases up to , then the inequality (Equation 12) reduces to the inequality (Equation 3) in Section 3. If increases up to , then the inequality (Equation 13) reduces to the inequality (Equation 4) in Section 3. Here equals to , and equals to . In addition, the bounds of the extensions (the equalities (Equation 12) and (Equation 13)) improve the bounds of the equation (Equation 1) given by [22]: 1) the bound of the equation (Equation 1) is trivial if the item in the equation (Equation 1) is larger than , whereas in our extensions has no such limitations, 2) the factor of the item ’’ in the equation (Equation 1) is always , whereas our factor is a probability .

The bounds of the extensions (the equalities (Equation 9) and (Equation 11)) are tighter than the equalities (Equation 3) and (Equation 4): let and where . Then we have

where .

According to definition of conditional expectation, implies . By substituting the inequality (Equation 14) into the inequality (Equation 9), we thus get

The inequality (Equation 15) means that, if we take and , in Theorem 1, then the inequality (Equation 9) will reduce to the inequality (Equation 3) in Section 3. Thus, we conclude that if the bound of the random variables is refined on a sample space , then a tighter bound by Theorem 1 will be obtained.

Similarly, if we take , in Theorem 2, then the inequality (Equation 11) will reduce to the inequality (Equation 4) in Section 3, leading to the conclusion that if the bounded differences of the function is refined on a sample space , then a tighter bound by Theorem 2 will be reached.

Note 5 For unbounded random variables, there are also some Bernstein-like results [?]. These results all require that the moment (for example, variance) exists or is uniformly bounded, and this limits their extension to some applications, whereas our results have no restrictions on the moment based on Corollary 1 or 2.

6Applications in Statistical Learning Theory

In the previous section, we have proposed several extensions and compared them with existing bounds. Now we discuss these extensions to applications in learning theory through four examples. We show that our extensions are slightly faster than the existing results in some special cases and artificially bounding the unbounded loss function may not discover the overfit.

Example 3 [22] gave an example as follows:

Let , follows a Bernoulli distribution , , , there exists a constant . Set a piecewise function: if , ; if , ; otherwise, . Then, [22] obtained the generalization bound as follows

From Corollary 1, we have the following generalization bound

From Figure 1.(1) it can be seen that the error probability from Corollary 1 is slightly faster than the error probability from [22].

Example 4 We assume that , follows a multinomial distribution , . Let . By the assumptions, we have , and . Then, from the equation (), we have

where .

From Theorem 2 we have

where , and . The values of , and can be seen as the weights obtained according to the proportion of samples.

From Figure 1.(2) it can be seen that the error probability from Theorem 2 is firstly faster than that from McDiarmid’s inequality, and then the distinction of their error probabilities gradually becomes less visible.

Example 5 We assume that , follows a multinomial distribution , . Let . By the assumptions, we have . Then, from Corollary 2 we have

where .

From Figure 1.(3) it can be seen that the error probability decreases as the sample complexity increases when the sample complexity is smaller than . When the sample complexity is larger than , meanwhile, the error probability increases with the increase of the sample complexity. However, the result from McDiarmid’s inequality is trivial. Examples 1 and 2 in Section 4 can be analyzed similarly.

For simplicity, in the discussions of Examples 3-5 we do not involve loss functions. In the last example, we will introduce a loss function and show how to apply our extensions.

Example 6 We assume that the linear model for regression is , where is a standard Cauchy random variable with the density function . The loss function is defined by the absolute loss (Denoted by ).

It is obvious that the expected value of does not exist. Therefore, Hoeffding’s inequality and McDiarmid’s inequality do not hold because Hoeffding’s inequality and McDiarmid’s inequality are distribution independent. Here, our results will be valid. We can employ Corollary 2 to analyze its generalization bound.

Let the set in Corollary 2 be . Then, we have

where , .

From Figure 1.(4) it can be seen that the error probability is smaller than when the sample complexity is smaller than . When the sample complexity is smaller than , the error probability becomes larger than , and tends to with the increase of the sample complexity.

Note 6 Examples 3 and 4 indicate that our extensions are slightly faster than the existing results when the sample complexity is not so large. Examples 5 and 6 show that our extensions describe that the error probability evolves over the sample complexity while the existing results are failure or trivial. Furthermore, from Example 4, we find that the effect of weighting on the proportion of samples is effective when the sample complexity is small, and becomes negligible when the complexity of the sample increases. From examples 5 and 6, we can tell that the generalization analysis on the artificial bound of the unbounded loss function (or the loss function whose expectation does not exist) may not catch the overfit, and for the generalization analysis of these loss functions. However, the classical learning framework may not be enough for the generalization analysis of these loss function. Such a phenomenon has also been observed by [29].

7Conclusion

In this paper, we review the conditions and limitations of Hoeffding’s and McDiarmid’s inequalities. We propose four new conditions and compare them with the existing difference-bounded conditions. Based on our proposed conditions, we obtain several extensions of Hoeffding’s and McDiarmid’s inequalities. Through four examples, we also discuss the potential applications of our extensional results in learning theory. As future work, we will study how to design effective machine learning algorithms to analyze the problems in learning theory based on the proposed extensions.

Footnotes

  1. This assumption is similar to the assumption in [22].

References

  1. The Nature of Statistical Learning Theory.
    V. N. Vapnik. Springer-Verlag, New York, 1995.
  2. Rademacher penalties and structural risk minimization.
    V. Koltchinskii. IEEE Transactions on Information Theory, 47(5):1902–1914, 2001.
  3. Rademacher and Gaussian complexities: risk bounds and structural results.
    P. L. Bartlett and S. Mendelson. Journal of Machine Learning Research, 3(3):463–482, 2003.
  4. Local Rademacher complexities.
    P. L. Bartlett and S. Mendelson. Annals of Statistics, 33(4):1497–1537, 2005.
  5. Stability and generalization.
    O. Bousquet and A. Elisseeff. Journal of Machine Learning Research, 2(3):499–526, 2001.
  6. On the convergence rate of good-turing estimators.
    D. Mcallester and R. E. Schapire. COLT, 2000.
  7. Why averaging classifiers can protect against overfitting.
    Y. Freund, Y. Mansour, and R. E. Schapire. AISTATS, 2001.
  8. Theory of classification: a survey of some recent advances.
    S. Boucheron, O. Bousquet, and G. Lugosi. ESAIM: Probability & Statistics, 3(2005):323–375, 2005.
  9. Predictive PAC learning and process decompositions.
    C. R. Shalizi and A. Kontorovich. NIPS, 2013.
  10. Rademacher complexity bounds for Non-IID.
    M. Mohri and A. Rostamizadeh. NIPS, 2010.
  11. Probability inequalities for sums of bounded random variables.
    W. Hoeffding. Journal of the American Statistical Association, 58(301):13–30, 1963.
  12. On the method of bounded differences.
    C. McDiarmid. Surveys in Combinatorics, 141:148–188, 1989.
  13. The interaction of stability and weakness in Adaboost.
    S. Kutin and P. Niyogi. Technical Report TR-2001-30, Computer Science Department, University of Chicago, 2001.
  14. Extensions to McDiarmid’s inequality when differences are bounded with high probability.
    S. Kutin. Technical Report TR-2002-04, Department Computer Science, University of Chicago, 2002.
  15. Concentration in unbounded metric spaces and algorithmic stability.
    A. Kontorovich. ICML, 2014.
  16. Improved bounds on the sample complexity of learning.
    L. Yi and M. L. Philip. http://ssltest.cs.umd.edu/srin/PDF/learning-jou.pdf, 2000.
  17. Refined Rademacher chaos complexity bounds with applications to the multikernel learning problem.
    Y. Lei and L. Ding. Neural Computation, 26(4):739–760, 2014.
  18. Generalization bounds for metric and similarity learning.
    Q. Cao, Z. Guo, and Y. Ying. Machine Learning, 102(1):1–18, 2012.
  19. Rademacher chaos complexities for learning the kernel problem.
    Y. Ying and C. Campbell. Neural Computation, 22(11):2858–2886, 2010.
  20. Dropout Rademacher complexity of deep neural networks.
    W. Gao and Z. Zhou. http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/sciChina16dropout.pdf, 2014.
  21. Almost-everywhere algorithmic stability and generalization error.
    S. Kutin and P. Niyogi. http://arxiv.org/pdf/1301.0579v1.pdf, 2002.
  22. An extension of McDiarmid’s inequality.
    R. Combes. http://arxiv.org/pdf/1511.05240v1.pdf, 2015.
  23. On the method of typical bounded differences.
    L. Warnke. Combinatorics Probability & Computing, 25(2):269–299, 2016.
  24. Biological and Artificial Intelligence Environments, chapter Consistency of empirical risk minimization for unbounded loss functions, pages 261–270.
    M. Muselli and F. Ruffino. Springer Netherlands, 1 edition, 2005.
  25. Relative deviation learning bounds and generalization with unbounded loss functions.
    C. Cortes, S. Greenberg, and M. Mohri. Comunicazioni Sociali, 23:234–255, 2013.
  26. Correcting sample selection bias in maximum entropy density estimation.
    M. Dudík, R. Schapire, and S. Phillips. NIPS, 2005.
  27. Fast rates with unbounded losses.
    P. D. Grünwald and N. A. Mehta. http://arxiv.org/pdf/1605.00252v1.pdf, 2016.
  28. An extension of the Hoeffding inequality to unbounded random variables.
    V. Bentkus. Lithuanian Mathematical Journal, 48(48):137–157, 2008.
  29. Learning without concentration.
    S. Mendelson. Journal of ACM, 62(3):21:1–25, 2014.