Generalization and Robustness of Batched Weighted Average Algorithm with V-geometrically Ergodic Markov Data

Generalization and Robustness of Batched Weighted Average Algorithm with V-geometrically Ergodic Markov Data

Nguyen Viet Cuong Department of Computer Science, National University of Singapore, 117417, Singapore, nvcuong@comp.nus.edu.sg    Lam Si Tung Ho Department of Statistics, University of Wisconsin-Madison, WI 53706, USA, lamho@stat.wisc.edu    Vu Dinh Department of Mathematics, Purdue University, IN 47907, USA, vdinh@math.purdue.edu
Abstract

We analyze the generalization and robustness of the batched weighted average algorithm for V-geometrically ergodic Markov data. This algorithm is a good alternative to the empirical risk minimization algorithm when the latter suffers from overfitting or when optimizing the empirical risk is hard. For the generalization of the algorithm, we prove a PAC-style bound on the training sample size for the expected -loss to converge to the optimal loss when training data are V-geometrically ergodic Markov chains. For the robustness, we show that if the training target variable’s values contain bounded noise, then the generalization bound of the algorithm deviates at most by the range of the noise. Our results can be applied to the regression problem, the classification problem, and the case where there exists an unknown deterministic target hypothesis.

1 Introduction

The generalization ability of learning algorithms has been studied extensively in statistical learning theory [1]. One main assumption in traditional learning theory when studying this problem is that data, drawn from an unknown distribution, are independent and identically distributed (IID) [2]. Although this assumption is useful for proving theoretical results, it may not hold in applications such as speech recognition or market prediction where data are usually temporal in nature [3].

One attempt to relax this IID data assumption is to consider cases where training data form a Markov chain with certain mixing properties. A common algorithm that has been analyzed is the empirical risk minimization (ERM) algorithm, which tries to find the hypothesis minimizing the empirical loss on the training data. Generalization bounds of this well-known algorithm were proven for exponentially strongly mixing data [4], uniformly ergodic data [5], and V-geometrically ergodic data [6].

In this paper, we investigate another learning algorithm, the batched weighted average (BWA) algorithm, when training data form a V-geometrically ergodic Markov chain. This algorithm is a batch version of the online weighted average algorithm with -loss [7]. Given the training data and a set of real-valued hypotheses, the BWA algorithm learns the weight of each hypothesis based on its prediction on the training data. During testing, the algorithm makes prediction based on the weighted average prediction of all the hypotheses on the testing data.

An advantage of the BWA algorithm when compared to the ERM algorithm is that the former may be less suffered from overfitting when the hypothesis space is large or complex [8, 9]. The BWA algorithm is also a good alternative to the ERM algorithm in cases where optimizing the empirical risk is hard.

We prove the generalization of the BWA algorithm by providing a PAC-style bound on the training sample size for the expected -loss of the algorithm to converge to the optimal loss with high probability, assuming that training data are V-geometrically ergodic. The main idea of our proof is to bound the normalized weights of all the bad hypotheses whose expected loss is far from the optimal. This idea comes from the observation that when more training data are seen, the normalized weights of the bad hypotheses will eventually be dominated by those of the better hypotheses.

Using the same proof technique, we then prove the robustness of the BWA algorithm when training data form a V-geometrically ergodic Markov chain with noise. By robustness, we mean the ability of an algorithm to generalize when there is a small amount of noise in the training data. For the BWA algorithm, we show that if the training values of the target variable are allowed to contain bounded noise, then the generalization bound of the algorithm deviates at most by the range of the noise.

Our main results are proven mainly for the regression problem and the case where the pairs of observation and target variables’ values are V-geometrically ergodic. However, we also give two lemmas to show that the results can be easily applied to other common settings such as the classification problem and the case where there exists an unknown deterministic target hypothesis.

This paper chooses to analyze the BWA algorithm for data that are V-geometrically ergodic. Theoretically, V-geometrically ergodic Markov chains have many good properties that make them appealing for analyses. Firstly, they are “nice” general state space Markov chains as they mix geometrically fast [10]. Secondly, the fact that these chains can be defined on a general, possibly uncountable, state space makes their learning models more general than previous models which learn from finite or countable state space Markov chains [11]. Thirdly, the V-geometrically ergodic assumption is not too restrictive since it includes all uniformly ergodic chains as well as all ergodic chains on a finite state space [6, 12]. Nevertheless, we emphasize that our proof idea can be applied to other types of mixing Markov chains if we have the uniform convergence rate of the empirical loss for these chains.

2 Related Work

The BWA algorithm considered in this paper is a batch version of the online weighted average algorithm [7]. The main differences are that the BWA algorithm uses an infinite real-valued hypothesis space and is trained from batch data. The original weighted average algorithm is a generalization of the weighted majority algorithm [13]. Both algorithms were analyzed for the online setting [7, 13] and a variant of the weighted majority algorithm was analyzed for the classification problem with batched IID data [8]. However, to the best of our knowledge, there was no rigorous treatment for the generalization and robustness of the BWA algorithm for non-IID data.

The proofs in our paper use a previous result on the uniform convergence rate of the empirical loss for V-geometrically ergodic Markov chains [6]. Convergence of the empirical loss is a fundamental problem in statistics and statistical learning theory, and it has been studied for other types of Markov chains such as -mixing [4, 14, 15], -mixing [16, 17], -mixing [16], and uniformly ergodic [5] chains. These results can be used with our proof idea to prove generalization and robustness bounds of the BWA algorithm for those chains.

The robustness of learning algorithms in the presence of noise has been studied for Valiant’s PAC model with IID data [18, 19, 20, 21]. Recently, Xu et al. [12] analyzed the generalization of learning algorithms based on their algorithmic robustness, the ability of an algorithm to achieve similar performances on similar training and testing data. Their analyses hold for both IID and uniformly ergodic Markov data. Another related concept is stability, the ability of an algorithm to return similar hypotheses when small changes are made to the training data [22]. Stability-based generalization bounds of learning algorithms were proven by Mohri et al. for -mixing and -mixing data [22]. Our bounds, in contrast, are obtained without measuring the algorithmic robustness or stability of the BWA algorithm.

3 Preliminaries

We now introduce the V-geometrically ergodic Markov chains and the settings for our analyses. We will follow the definitions in [6]. We also review a result on the uniform convergence rate of the empirical loss for V-geometrically ergodic Markov data [6] which will be used in the subsequent sections.

3.1 V-geometrically Ergodic Markov Chain

Let be a measurable space, where is a compact subset of () and is a -algebra on . A Markov chain on is a sequence of random variables together with a set of transition probabilities , where denotes the probability that a chain starting from will be in after steps. By Markov property,

where is the probability of an event. For any two probability measures and on , we define their total variation distance as . A V-geometrically ergodic Markov chain can be defined as follows.

Definition 1.

A Markov chain is called V-geometrically ergodic with respect to a measurable function if there exist , , and such that for every and , we have

and

where is the stationary distribution of the Markov chain .

A special case of V-geometrically ergodic Markov chain is uniformly ergodic Markov chain, which has (the constant function ) [6, 10]. So, the results in this paper also hold for the uniformly ergodic Markov data. Throughout our paper, we mostly consider the first elements of a V-geometrically ergodic Markov chain . For convenience, we will also call a V-geometrically ergodic Markov chain. Whenever we consider , , and of , we actually refer to those of .

3.2 Settings

We assume that the training data form a V-geometrically ergodic Markov chain on a state space , where is a compact subset of () and is a compact subset of . The variables ’s are usually called the observation variables and ’s are usually called the target variables.

Let be the set of all hypotheses, where a hypothesis is a function from to . Throughout this paper, we make the following assumption: is contained in a ball of a Hölder space for some , which is similar to the assumption in [6]. The Hölder space is the space of all continuous functions on with the following norm [6, 23]:

where and is a metric defined on .

In this paper, we consider the -loss of a hypothesis on an example . Because of the boundedness of and , there exist and such that

and

For any data , we define the empirical loss of the hypothesis on as

and the expected loss of with respect to the stationary distribution of the Markov chain as

3.3 Uniform Convergence Rate of the Empirical Loss

We review a previous result [6] which gives a PAC-style bound on the training set size for the empirical loss to converge uniformly to the expected loss when training data are V-geometrically ergodic Markov chains. This result will be used to prove the generalization and robustness bounds for the BWA algorithm in subsequent sections. To state the result, we first need to define the covering number, the quantity for measuring the capacity of a hypothesis space.

Definition 2.

For every , the covering number of the hypothesis space is the smallest integer number such that can be covered by balls with radius .

Note that the covering number is defined with respect to the norm and thus is data independent. This is different from another type of covering number which is data dependent [24]. With the assumption that , there exists such that for every , we have (see [23]). Thus, the covering number is finite in our setting.

We also need a concept of effective sample size for a V-geometrically ergodic Markov chain. The effective sample size plays the same role in our analyses as the sample size in the IID case. This concept is usually used when the observations are not independent (e.g., hierarchical autocorrelated observations [25]).

Definition 3.

Let be a V-geometrically ergodic Markov chain with satisfying Definition 1. The effective sample size is

where () denote the floor (ceiling) of .

For a V-geometrically ergodic Markov chain, as . The uniform convergence rate for the empirical loss when training data are V-geometrically ergodic Markov chains is stated in Lemma 1 below. This lemma is a direct consequence of Theorem 2 in [6].

Lemma 1.

Let the data be a V-geometrically ergodic Markov chain with , and satisfying Definition 1. For all , , if the effective sample size satisfies

then

4 The Batched Weighted Average Algorithm

In this section, we introduce the BWA algorithm. In contrast to the ERM algorithm which makes prediction based on a single empirical loss minimizing hypothesis, the BWA algorithm makes prediction based on the weighted average predictions of all the hypotheses in the hypothesis space. The pseudo code for the BWA algorithm is given in Algorithm 1.

Inputs for the BWA algorithm are a parameter and a training data sequence , which is a V-geometrically ergodic Markov chain on the state space . The algorithm computes a weight for each hypothesis in the hypothesis space by:

Then, the weights of the hypotheses are normalized to obtain a probability density function with respect to the measure (probability mass function if is finite) over the hypothesis space:

We will call the normalized weight of . Given a new example , we use the normalized weights to compute the weighted average prediction of all the hypotheses on :

In the algorithm, we assume there exists a probability measure on such that . The measure plays a similar role to the prior distribution in Bayesian analysis [26]. It reflects our initial belief about the distribution of the hypotheses in . During the execution of the algorithm, we gradually update our belief, via the weights, based on the prediction of each hypothesis on the training data. The existence of such a measure was also assumed in [8] for averaged classifiers.

When is infinite, we usually cannot compute the value of exactly. In practice, we can apply the Markov Chain Monte Carlo method [27] to approximate . For instance, we can sample hypotheses from the unnormalized density distribution and approximate by .

and training data .
for all
for  do
     for  do
     end for
end for
for all
return
Algorithm 1 The Batched Weighted Average (BWA) Algorithm

5 Generalization Bound for BWA Algorithm

In this section, we prove the generalization bound for the BWA algorithm when training data are V-geometrically ergodic Markov chains. For the analyses to be valid, we assume the following sets are measurable with respect to :

Since Algorithm 1 does not assume the existence of a perfect hypothesis in , we need to define the optimal expected loss of . Let , the optimal expected loss of is defined as . Note that always exists since and . For all , let be the volume of all the hypotheses with expected loss at most . By definition of , for all , we always have .

The idea of using was proposed in [8] to analyze the generalization bounds of averaged classifiers in the IID case. The argument for considering is that when is uncountable, a comparison between the average hypothesis and a single best hypothesis is meaningless because a single hypothesis mostly has measure . Hence, we should compare to a set of good hypotheses that has positive measure, as suggested in [8].

To prove the generalization bound, we need Lemma 2 that bounds the normalized weights of all the bad hypotheses. Specifically, this lemma proves that if the effective sample size is large enough, the normalized weights of all the bad hypotheses are sufficiently small with high probability.

Lemma 2.

Let the data be a V-geometrically ergodic Markov chain with , and satisfying Definition 1. For all and , if the effective sample size satisfies

then

Proof.

Denote and . We can write: . If the effective sample size satisfies

then by Lemma 1, with probability at least , we both have:

For all and , we also have . Therefore, with probability at least , for all and ,

Since , we have . Hence, . Note that this inequality holds for all and . Therefore,

Let , we have

Therefore, . ∎

Using Lemma 2, we now prove the following generalization bound for the BWA algorithm.

Theorem 5.1.

Let the data be a V-geometrically ergodic Markov chain with , and satisfying Definition 1. For all and , if the effective sample size satisfies

then

Proof.

We have

Notice that for all , we have: . On the other hand, from Lemma 2, if the effective sample size satisfies

then with probability at least , we have: .

Thus,

Note that when , we have . From the definition of the effective sample size, in order to ensure the previous condition for the sample size , it is sufficient to let

Hence, for

we have . ∎

In Theorem 5.1, the convergence rate of the expected loss to the optimal loss depends not only on the covering number but also on . From the definition of , this value depends mostly on the distribution on . If gives higher probability to hypotheses with small expected loss, will be closer to and the convergence rate will be better. Thus, it is desirable for the BWA algorithm to choose a good distribution . This is analogous to the Bayesian setting where we also need to choose a good prior for the learning algorithm. When is finite, for sufficiently small . In this case, does not depend on , but only depends on .

The bound in Theorem 5.1 and all the subsequent bounds depend on the values of , and . For one V-geometrically ergodic Markov chain, there may be many values of (, , ) satisfying Definition 1. Thus, to obtain good bounds, we need to choose a value of (, , ) that makes the bounds as tight as possible. This corresponds to selecting small values for these parameters.

When comparing various V-geometrically ergodic Markov chains, Theorem 5.1 suggests that the convergence rate is better if , and are smaller. Small values of these parameters correspond to chains that converge quickly to the stationary distribution . This result is expected because the expected loss is defined with respect to a random example drawn from . In the limit when and , the chains become more IID-like and the effective sample size bound tends to .

From the discussion in Section 3.3, there exists such that for , we have . Therefore, we can deduce the following corollary of Theorem 5.1 in which the bound does not depend on the covering number.

Corollary 1.

Let the data be a V-geometrically ergodic Markov chain with , and satisfying Definition 1. For all and , if the effective sample size satisfies

then .

Since as , by the above corollary, we have for every . Hence, the BWA algorithm is consistent.

6 Robustness Bound for BWA Algorithm

In this section, we consider the robustness of the BWA algorithm when the target variable’s values in the training data contain a small amount of noise. In particular, instead of the settings in Section 3.2, we assume that the training data are now , where and form a V-geometrically ergodic Markov chain with stationary distribution . We further assume that the noise are bounded, i.e., for all . However, we will not make any assumption on the distribution of noise.

With this setting, the BWA algorithm that we consider is essentially the same as Algorithm 1, except that now the algorithm does not have access to the true target variables ’s. Instead, it uses the noisy target variables and updates the hypothesis weights according to the following formula:

Hence, , where is the (noisy) empirical loss of the hypothesis on the noisy dataset :

For any hypothesis , the expected loss is defined as in Section 3.2 with respect to the stationary distribution of the Markov chain . We also let , and be the parameters satisfying Definition 1 for the chain . The optimal expected loss is defined as in Section 5.

We now prove that with this setting, the generalization bound of the BWA algorithm deviates at most by . The steps for the proof are similar to those in Section 5. First, we prove the following uniform convergence bound for V-geometrically ergodic Markov chain with bounded noise.

Lemma 3.

Let the data be a V-geometrically ergodic Markov chain with bounded noise. For all , , if the effective sample size satisfies

then .

Proof.

Let and be defined as in Section 3.2. For all ,

By Lemma 1, if the effective sample size satisfies

then . In this case, . Hence, Lemma 3 holds. ∎

Using Lemma 3, we can prove the following lemma, which is an analogy of Lemma 2.

Lemma 4.

Let the data be a V-geometrically ergodic Markov chain with bounded noise. For all and , if the effective sample size satisfies

then .

Proof.

The proof for this lemma uses the same technique as that of Lemma 2, except that we define and replace Lemma 1 by Lemma 3 with all and . ∎

Using Lemma 4, we can prove the following robustness bound.

Theorem 6.1.

Let the data be a V-geometrically ergodic Markov chain with bounded noise. For all and , if the effective sample size satisfies

then .

Proof.

The proof for this theorem is essentially the same as that of Theorem 5.1, except that we partition into and after the first inequality and then apply Lemma 4 instead of Lemma 2. ∎

From Theorem 6.1, with high probability, the expected loss of is at most larger than the optimal loss when we allow noise with range in the training data. This shows that the BWA algorithm is robust in the sense that it does not perform too badly if the level of noise in the training data is small. In the noiseless case where , we can recover Theorem 5.1. Thus, Theorem 6.1 is a generalization of Theorem 5.1 to the bounded noise case.

7 Applications to other Settings

Our results in Section 5 and 6 are proven for the regression problem when the pairs of observation and target variables are V-geometrically ergodic. We now prove that our results can be easily applied to other common settings such as the classification problem and the case where there exists an unknown deterministic target hypothesis. The discussion in Section 7.1 is for the noiseless training data, while the discussion in Section 7.2 can be applied to both the noiseless and noisy cases. In this section, we let be the indicator function for the event .

7.1 The Classification Problem

For the classification problem, the training data satisfy for ; and during testing, we need to predict the label of a given data point . If the hypothesis space contains the hypotheses satisfying for all , we can apply Algorithm 1 to compute and use its value to construct the following random classifier:

Let be the expected error of . The following lemma shows that is equal to the expected loss of . Thus, we can bound the probability using this lemma and Theorem 5.1.

Lemma 5.

For all , we have .

Proof.

Note that . Thus,

7.2 When a Target Hypothesis Exists

When there exists an unknown deterministic target hypothesis such that for all and the observation variables form a V-geometrically ergodic Markov chain, the following lemma shows that the chain is V-geometrically ergodic. Thus, our previous results can still be applied in this situation. Note that in our lemma, may not be in .

Lemma 6.

Let be a measurable function and be a -geometrically ergodic Markov chain on . For any deterministic function , the chain is a V-geometrically ergodic Markov chain on with respect to some measurable function .

Proof.

Let be the one-step transition probability of . It is easy to see that is a Markov chain on with the following one-step transition probability :

Intuitively, after taking the first step (from onwards), the new Markov chain on will transit around the points in