A note on a confidence bound of Kuzborskij and Szepesvári

# A note on a confidence bound of Kuzborskij and Szepesvári

## Abstract

In an interesting recent work, Kuzborskij and Szepesvári derived a confidence bound for functions of independent random variables, which is based on an inequality that relates concentration to squared perturbations of the chosen function. Kuzborskij and Szepesvári also established the PAC-Bayes-ification of their confidence bound. Two important aspects of their work are that the random variables could be of unbounded range, and not necessarily of an identical distribution. The purpose of this note is to advertise/discuss these interesting results, with streamlined proofs. This expository note is written for persons who, metaphorically speaking, enjoy the ‘featured movie’ but prefer to skip the preview sequence.

## 1 Introduction

In an interesting recent work, Kuzborskij and Szepesvári derived a confidence bound for the random variable

 Δ=f(S)−E[f(S)]

where is a size- random sample composed of independent -valued random elements , and is a measurable function. Notice, however, that the components are not required to be identically distributed: each may be distributed according to a different1 . Accordingly, the distribution of the size- random sample is .

Their confidence bound is based on an estimator of the variance of . Recall that McDiarmid’s inequality, which is based on the bounded differences property, relates concentration of around zero (its mean) to the sensitivity of to coordinatewise perturbations (“first-order”). By contrast, the bound of Kuzborskij and Szepesvári relates concentration to squared perturbations (“second-order”), which leads to an inequality based on a variance estimator. The latter has a resemblance with a well-known estimator, recalled next.

### The variance estimator used in the Efron-Stein inequality.

This is defined as follows:

 V{es}=n∑k=1E[(f(S)−f(S(k)))2+∣∣S], (1)

where is the positive part, and the notation indicates that the th element of is replaced with , where is an independent copy of . Further details about this estimator, with context and references, can be found in Boucheron et al. [2013].

Problem: In order to prove a confidence bound for based on , one needs a priori assumptions on the moments of . To avoid this limitation, Kuzborskij and Szepesvári used a modified variance estimator.

### The variance estimator used in the Kuzborskij-Szepesvári inequality.

This is defined as follows:

 V{ks}=n∑k=1E[(f(S)−f(S(k)))2∣∣Z1,…,Zk]. (2)

Kuzborskij and Szepesvári called it a “semi-empirical” estimator, because of its dependence on both the sample and the distribution of the sample.

The main result of Kuzborskij and Szepesvári is the following high-confidence bound: For any and , with probability at least one has

 |Δ|≤√2(V{ks}+y)[1+12log(1+V{ks}y)]x. (3)

### Remark:

Inequality (3) does not require boundedness of random variables , nor of the function ; the only crucial assumption is independence of elements in the sample . Observe that inequality (3) basically depends on and a positive free parameter , which must be selected by the user. For instance, choosing gives: For any , with probability at least one has

 |Δ|≤√2(V{ks}+1/n2)[1+12log(1+n2V{ks})]x.

Paraphrasing Kuzborskij and Szepesvári: With this particular choice of , the resulting inequality shows a Bernstein-type behavior, in the sense that the upper-bound is dominated by the lower-order term whenever is small enough; and the price for such a simple choice of is in the logarithmic term.

### Remark:

In addition to inequality (3), Kuzborskij and Szepesvári showed a bound that does not involve and, in particular, is scale-free: For any , with probability at least one has .

The remaining of this note’s content is as follows. The confidence bound of Kuzborskij and Szepesvári is presented and proved in Section 2; and the ‘PAC-Bayes-ified’ version of this bound is presented and proved in Section 3.

## 2 The main result and its proof

###### Theorem 1.

Let be a measurable function, let be the random gap with randomly chosen from a distribution , and let be the variance estimator defined in Eq. 2.
(i) For any ,

 P(|Δ|>2√(V+E[V])x)≤√2e−x.

(ii) For any , and any ,

 P(|Δ|>√2(V+y)[1+12log(1+V/y)]x)≤e−x.

To discuss the proof of Theorem 1, the following definition will be convenient: A pair of random variables is called a canonical pair if and

 supλ∈RE[exp(λA−λ22B2)]≤1. (4)

See de la Peña et al. [2009, Section 10.2] for further discussion on this condition, and its connection with the so-called self-normalized processes.

A key step of the proof of Theorem 1 consists of establishing that is a canonical pair. We state this as a lemma for convenient reference:

###### Lemma 2.

is a canonical pair.

The rest of the proof of Theorem 1 relies on following technical result, which essentially gives subgaussian tail probabilities for some functions of a canonical pair (cf. de la Peña et al. [2009, Theorem 12.4 & Corollary 12.5]):

###### Lemma 3.

Suppose is a canonical pair. Then:
(i) For any ,

 P(|A|√B2+(E[B])2≥t)≤√2e−t24.

(ii) For any and ,

 P⎛⎜ ⎜ ⎜ ⎜⎝|A|√(B2+y)[1+12log(1+B2y)]≥t⎞⎟ ⎟ ⎟ ⎟⎠≤e−t22.

The proof of Theorem 1 is then merely by combining Lemma 2 and Lemma 3. Hence, it remains to prove Lemma 2. This uses the martingale method, which is at the core of the proofs of McDiarmid’s/Azuma-Hoeffding’s inequalities.

###### Proof of Lemma 2.

Let stand for . Using the martingale difference decomposition, the gap can be written as

 Δ=n∑k=1Dk

where . Notice that , which follows from the elementary identity .

The variance estimator (cf. Eq. 2) can be rewritten as

 V=n∑k=1Vk

where . This is just a convenient notation.

Assume for now that for every the following holds:

 Ek−1[exp(λDk−λ22Vk)]≤1. (5)

Then, using a recursive argument and Eq. 5, we get

 E[exp(λΔ−λ22V)]=E[n∏k=1exp(λDk−λ22Vk)] =E⎡⎢ ⎢ ⎢ ⎢⎣En−1[exp(λDn−λ22Vn)]≤1n−1∏k=1exp(λDk−λ22Vk)⎤⎥ ⎥ ⎥ ⎥⎦ ≤E⎡⎢ ⎢ ⎢ ⎢⎣En−2[exp(λDn−1−λ22Vn−1)]≤1n−2∏k=1exp(λDk−λ22Vk)⎤⎥ ⎥ ⎥ ⎥⎦ ≤⋯≤1.

Thus, it remains to prove Eq. 5. Fix and let be a random variable independent of such that . Let . Notice that , and by Jensen’s inequality

 exp(Ek[λΔk−λ22Δ2k])≤Ek[exp(λΔk−λ22Δ2k)].

Let denote conditioning on without . Then we have

 Ek−1[exp(λDk−λ22Vk)]≤Ek−1[exp(λΔk−λ22Δ2k)] =Ek−1[E−kE[exp(ελΔk−λ22(εΔk)2)∣∣S,S′]].

The last equality follows from the assumption on the distributions, that is, given , the random variables and are identically distributed, hence so are and . Since is subgaussian (for any ), the innermost expectation in the last display is upper-bounded by one. ∎

### Remark:

The proof makes it clear that this inequality holds in the slightly more general setting in which and has independent components, where each is a -valued random variable with distribution .

## 3 PAC-Bayes-ification

We adapt the notation for and to make explicit their dependence on , and see them as being defined over ’s from some function class :

 Δ(f) =f(S)−E[f(S)], (1’) V(f) =n∑k=1Ek[(f(S)−f(S(k)))2]. (2’)

It might be convenient to make explicit the dependence of and on the sample as well; to do so, we may write and . Recall that the distribution of the (size-) random sample is . Notice that for a fixed nonrandom , the gap is

 Δs(f)=f(s)−∫Znf(s′)Pn(ds′).

The expression for is longer to write, but easy to imagine. The point is that and are real-valued functions defined over .

Let be a parametric family of functions . For each , define and , the gap and the variance estimator for . Then is a canonical pair, for each , by Lemma 2.

Given a probability kernel from to and , we write expectations with respect to the distribution as , and similarly . If is the random sample, then expectations with respect to the random measure are conditional expectations:

 QS[ΔS]=E[ΔS(θ)|S],andQS[VS]=E[VS(θ)|S].

The joint distribution over defined by and the probability kernel , denoted , is so that choosing a random pair corresponds to choosing and then choosing . Accordingly, integrals under correspond to the ‘total expectation’ with respect to the random choice of and . For instance,

With a slight abuse of notation, we may write instead of .

The ‘PAC-Bayes-ification’ of Theorem 1 is as follows.

###### Theorem 4.

Fix an arbitrary ‘data-free’ probability distribution over , and an arbitrary probability kernel from to . Then
(i) For any , with probability at least we have

 |QS[ΔS]|≤√2(E[VS(θ)]+QS[VS])(KL(QS∥Q0)+2x). (6)

(ii) For all and , with probability at least we have

 |QS[ΔS]|≤ ⎷2(QS[VS]+y)[KL(QS∥Q0)+x+x2log(1+QS[VS]y)]. (7)

The statement of this theorem uses the language of probability kernels for representing data-dependent distributions (cf. Rivasplata et al. [2020]).

In the remaining of this note, we switch back to the usual notation in terms of conditional expectations: and . Also recall that is the total expectation.

The proof of Theorem 4 is based on the following lemma.

###### Lemma 5.

Under the same conditions as in Theorem 4.
(i) For all ,

 E⎡⎣exp⎧⎨⎩x ⎷(E[ΔS(θ)|S]2E[VS(θ)]+E[VS(θ)|S]−2KL(QS∥Q0))+⎫⎬⎭⎤⎦≤2ex2. (8)

(ii) For any , we have

 E[y√y2+E[VS(θ)|S]exp{E[ΔS(θ)|S]22(y2+E[VS(θ)|S])−KL(QS∥Q0)}]≤1. (9)
###### Proof of Lemma 5.

For convenience, we start with the proof of Eq. 9. Recall the following change of measure, which is the basis of the PAC-Bayesian analysis: Let and be probability measures on , and let the induced expectation operators be and , respectively. Let be a -valued random variable. Then, for any measurable function we have

 E[f(X)]≤KL(π∥π0)+logE0[ef(X)].

Below we use this with , , and .

Let and be the expectation with respect to and , respectively. Conditioning on the random sample we have:

 E[λΔS(θ)−λ22VS(θ)∣∣∣S] ≤KL(QS∥Q0)+logE0[eλΔS(θ)−λ22VS(θ)∣∣∣S].

Subtracting the KL term, and taking exponential on both sides gives

 eE[λΔS(θ)−λ22VS(θ)∣∣∣S]−KL(QS∥Q0) ≤E0[eλΔS(θ)−λ22VS(θ)∣∣∣S].

Then, taking expectation over the random sample on both sides, and keeping in mind that is a canonical pair for any fixed , we have

 E[eE[λΔS(θ)−λ22VS(θ)∣∣∣S]−KL(QS∥Q0)] ≤E0[eλΔS(θ)−λ22VS(θ)] =E0[E0[eλΔS(θ)−λ22VS(θ)∣∣∣θ]]≤1,

The equality is by swapping the order of expectation, which is possible since is a data-free distribution (cf. Rivasplata et al. [2020]). Next, multiplying both sides by for some fixed , integrating with respect to , and applying Fubini’s theorem, gives2

 E[e−KL(QS∥Q0)y√2π∫∞−∞eλE[ΔS(θ)|S]−λ22E[VS(θ)|S]−λ22y2dλ]≤1.

Carrying out the Gaussian integration we arrive at

 E[y√y2+E[VS(θ)|S]exp{E[ΔS(θ)|S]22(y2+E[VS(θ)|S])−KL(QS∥Q0)}]≤1,

which finishes the proof of Eq. 9.

For the other part of the lemma, we consider the following:

###### Claim 6.

Let be a non-negative random variable, and for define . Then, for any , .

The proof of this claim is as follows. Fix and . Using the inequality with and we have

 xU=x√2α√2αU≤x24α+αU2.

Then take exponential on both sides, and take expectations.

Next, we see the proof of part (i) of the lemma.

Consider the random variable

 U= ⎷(E[ΔS(θ)|S]2E[VS(θ)]+E[VS(θ)|S]−2KL(QS∥Q0))+

and notice that Eq. 8 follows from the claim with , provided that we show that . For this, consider an arbitrary , and consider the abbreviations

 A=y√y2+E[VS(θ)|S],B=E[ΔS(θ)|S]22(y2+E[VS(θ)|S])−KL(QS∥Q0).

We need to upper-bound . Keeping in mind that (in fact, ), by Cauchy-Schwarz,

 E[exp((B)+/2)] =E[exp((B)+/2)A1/2A−1/2]

Observe that by Eq. 9, and . Now, we have

 √E[Aexp((B)+)]E[1A] =√(E[A\mathds1{B≥0}exp(B)]+E[A\mathds1{B<0}])E[1A]≤√E[2A].

Finally, by subadditivity of the square root function and Jensen’s inequality,

 √E[2A]= ⎷2E[√y2+E[VS(θ)|S]y2]≤√2+2√E[VS(θ)]y≤2,

where the last inequality is by taking any . Thus, for the chosen . Applying creftype 6 with completes the proof. ∎

To complete the argument, the proof of Theorem 4 is given next.

###### Proof of Theorem 4.

Applying Chernoff’s bounding technique with Eq. 8 gives

 P⎛⎝ ⎷(E[ΔS(θ)|S]2E[VS(θ)]+E[VS(θ)|S]−2KL(QS∥Q0))+≥t⎞⎠ ≤2infx≥0ex2−tx.

The infimum is . Thus, with probability at least we have

 (E[ΔS(θ)|S]2E[VS(θ)]+E[VS(θ)|S]−2KL(QS∥Q0))+≤4x.

With some algebra, this event implies

 |E[ΔS(θ)|S]|≤√(E[VS(θ)]+E[VS(θ)|S])(2KL(QS∥Q0)+4x).

The last display is equivalent to Eq. 6. Hence Theorem 4(i) is proved.

Next, observe that for any and any ,

 P(E[ΔS(θ)|S]22(y2+E[VS(θ)|S])−KL(QS∥Q0)≥t22[1+12log(1+E[VS(θ)|S]y2)]) ≤P(E[ΔS(θ)|S]22(y2+E[VS(θ)|S])−KL(QS∥Q0)≥t22+12log(1+E[VS(θ)|S]y2)) =P(E[ΔS(θ)|S]22(y2+E[VS(θ)|S])−KL(QS∥Q0)−12log(1+E[VS(θ)|S]y2)≥t22) ≤e−t22,

where the last two inequalities follow from Markov’s inequality and Eq. 9. This implies that for all , with probability at least , one has

 E[ΔS(θ)|S]22(y2+E[VS(θ)|S])−KL(QS∥Q0)≤x(1+12log(1+E[VS(θ)|S]y2)).

Notice that may be replaced with , since is a free variable. Doing this replacement, and rearranging the terms, we get the equivalent of Eq. 7. Hence Theorem 4(ii) is proved. ∎

### Closing remarks.

Kuzborskij and Szepesvári deserve fair credit for showing that the pair meets de la Peña et al. [2009]’s ‘canonical condition’ (Lemma 2), which enabled powerful tools for bounding exponential moments. Of course, this was possible with their variance estimator . Apart from that, the main part of the work of Kuzborskij and Szepesvári is in the proofs of Lemma 5 and Theorem 4, which cleverly use the techniques of de la Peña et al.. In the next iteration of this note (provided that enough readers cared about it) I intend to add discussions about Theorem 1 & Theorem 4, and applications.

### Footnotes

1. denotes the family of probability measures defined on a measurable space . When is clear from the context, we write simply for simplicity.
2. This is inspired by the proof of [de la Peña et al., 2009, Theorem 12.4], which uses the method of mixtures with a Gaussian distribution.

### References

1. Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.
2. Victor H de la Peña, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes: Limit theory and Statistical Applications. Springer, 2009.
3. Ilja Kuzborskij and Csaba Szepesvári. Efron-Stein PAC-Bayesian Inequalities. arXiv:1909.01931, 2019.
4. Omar Rivasplata, Ilja Kuzborskij, Csaba Szepesvári, and John Shawe-Taylor. PAC-Bayes Analysis Beyond the Usual Bounds. In Advances in Neural Information Processing Systems, 2020.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters