Passive Learning with Target Risk

# Passive Learning with Target Risk

Mehrdad Mahdavi
Department of Computer Science
Michigan State University
mahdavim@cse.msu.edu
Rong Jin
Department of Computer Science
Michigan State University
rongjin@cse.msu.edu
###### Abstract

In this paper we consider learning in passive setting but with a slight modification. We assume that the target expected loss, also referred to as target risk, is provided in advance for learner as prior knowledge. Unlike most studies in the learning theory that only incorporate the prior knowledge into the generalization bounds, we are able to explicitly utilize the target risk in the learning process. Our analysis reveals a surprising result on the sample complexity of learning: by exploiting the target risk in the learning algorithm, we show that when the loss function is both strongly convex and smooth, the sample complexity reduces to , an exponential improvement compared to the sample complexity for learning with strongly convex loss functions. Furthermore, our proof is constructive and is based on a computationally efficient stochastic optimization algorithm for such settings which demonstrate that the proposed algorithm is practically useful.

## 1 Introduction

In the standard passive supervised learning setting, the learning algorithm is given a set of labeled examples drawn i.i.d. from a fixed but unknown distribution . The goal, with the help of labeled examples, is to output a classifier from a predefined hypothesis class that does well on unseen examples coming from the same distribution. The sample complexity of an algorithm is the number of examples which is sufficient to ensure that, with probability at least (w.r.t. the random choice of ), the algorithm picks a hypothesis with an error that is at most from the optimal one. Sample complexity of passive learning is well established and goes back to early works in the learning theory where the lower bounds and were obtained in classic PAC and general agnostic PAC settings, respectively  [9, 5, 1].

In light of no free lunch theorem, learning is impossible unless we make assumptions regarding the nature of the problem at hand. Therefore, when approaching a particular learning problem, it is desirable to take into account some prior knowledge we might have about our problem and use a specialized algorithm that exploits this knowledge into a learning process or theoretical analysis. A key issue in this regard is the formalization of prior knowledge. Such prior knowledge can be expressed by restricting our hypothesis class, making assumptions on the nature of unknown distribution or formalization of the data space, analytical properties of the loss function being used to evaluate the performance, sparsity, and margin– to name a few.

There has been an upsurge of interest over the last decade in finding tight upper bounds on the sample complexity by utilizing prior knowledge on the analytical properties of the loss function, that led to stronger generalization bounds in agnostic PAC setting. In [17] fast rates obtained for squared loss, exploiting the strong convexity of this loss function, which only holds under pseudo-dimensionality assumption. With the recent development in online strongly convex optimization [11], fast rates approaching for convex Lipschitz strongly convex loss functions has been obtained in [29, 15]. For smooth non-negative loss functions, [27] improved the sample complexity to optimistic rates

 O(1ϵ(ϵopt+ϵϵ)(log31ϵ+log1δ))

for non-parametric learning using the notion of local Rademacher complexity [3], where is the optimal risk.

In this work, we consider a slightly different setup for passive learning. We assume that before the start of the learning process, the learner has in mind a target expected loss, also referred to as target risk, denoted by 111We use instead of to emphasize the fact that this parameter is known to the learner in advance., and tries to learn a classifier with the expected risk of by labeling a small number of training examples. We further assume the target risk is feasible, i.e., . To address this problem, we develop an efficient algorithm, based on stochastic optimization, for passive learning with target risk. The most surprising property of the proposed algorithm is that when the loss function is both smooth and strongly convex, it only needs labeled examples to find a classifier with the expected risk of , where is the dimension of data. This is a significant improvement compared to the sample complexity for empirical risk minimization.

The key intuition behind our algorithm is that by knowing target risk as prior knowledge, the learner has better control over the variance in stochastic gradients, which contributes mostly to the slow convergence in stochastic optimization and consequentially large sample complexity in passive learning. The trick is to run the stochastic optimization in multistages with a fixed size and decrease the variance of stochastically perturbed gradients at each iteration by a properly designed mechanism. Another crucial feature of the proposed algorithm is to utilize the target risk to gradually refine the hypothesis space as the algorithm proceeds. Our algorithm differs significantly from standard stochastic optimization algorithms and is able to achieve a geometric convergence rate with the knowledge of target risk .

We note that our work does not contradict the lower bound in [27] because a feasible target risk is given in our learning setup and is fully exploited by the proposed algorithm. Knowing that the target risk is feasible makes it possible to improve the sample complexity from to . We also note that although the logarithmic sample complexity is known for active learning [10, 2], we are unaware of any existing passive learning algorithm that is able to achieve a logarithmic sample complexity by incorporating any kind of prior knowledge.

### 1.1 More Related Work

#### Stochastic Optimization and Learnability

Our work is related to the recent studies that examined the learnability from the viewpoint of stochastic convex optimization. In [28, 26], the authors presented learning problems that are learnable by stochastic convex optimization but not by empirical risk minimization (ERM). Our work follows this line of research. The proposed algorithm achieves the sample complexity of by explicitly incorporating the target expected risk into the stochastic convex optimization algorithm. It is however difficult to incorporate such knowledge into the framework of ERM. Furthermore, it is worth noting that in [23, 28, 22, 4], the authors explored the connection between online optimization and statistical learning in the opposite direction. This was done by exploring the complexity measures developed in statistical learning for the learnability of online learning.

#### Online and Stochastic Optimization

The proposed algorithm is closely related to the recent works that stated is the optimal convergence rate for stochastic optimization when the objective function is strongly convex [14, 12, 21]. In contrast, the proposed algorithm is able to achieve a geometric convergence rate for a target optimization error. Similar to the previous argument, our result does not contradict the lower bound given in [12] because of the knowledge of a feasible optimization error. Moreover, in contrast to the multistage algorithm in [12] where the size of stages increases exponentially, in our algorithm, the size of each stage is fixed to be a constant.

#### Outline

The remainder of the paper is organized as follows: In Section 2, we set up notation, describe the setting, and discuss the assumptions on which our algorithm relies. Section 3 motivates the problem and discusses the main intuition of our algorithm. The proposed algorithm and main result are discussed in Section 4. We prove the main result in Section 5. Section 6 concludes the paper and the appendix contains the omitted proofs.

## 2 Preliminaries

As usual in the framework of statistical learning theory, we consider a domain where is the space for instances and is the set of labels, and is a hypothesis class. We assume that the domain space is endowed with an unknown Borel probability measure . We measure the performance of a specific hypothesis by defining a nonnegative loss function . We denote the risk of a hypothesis by . Given a sample , the goal of a learning algorithm is to pick a hypothesis from in such a way that its risk is close to the minimum possible risk of a hypothesis in .

Throughout this paper we pursue stochastic optimization viewpoint for risk minimization as detailed in Section 3. Precisely, we focus on the convex learning problems for which we assume that the hypothesis class is a parametrized convex set and for all , the loss function is a non-negative convex function. Thus, in the remainder we simply use vector to represent , rather than working with hypothesis . We will assume throughout that is the unit ball so that . Finally, the conditions under which we can get the desired result on sample complexity depend on analytic properties of the loss function. In particular, we assume that the loss function is strongly convex and smooth [20].

###### Definition 1 (Strong convexity).

A loss function is said to be -strongly convex w.r.t a norm 222Throughout this paper, we only consider the -norm., if there exists a constant (often called the modulus of strong convexity) such that, for any and for all , it holds that

 ℓ(λw1+(1−λ)w2)≤αℓ(w1)+(1−λ)ℓ(w2)−12λ(1−λ)α∥w1−w2∥2.

When is differentiable, the strong convexity is equivalent to

We would like to emphasize that in our setting, we only need that the expected loss function be strongly convex, without having to assume strong convexity for individual loss functions.
Another property of loss function that underline our analysis is its smoothness. Smooth functions arise, for instance, in logistic and least-squares regression, and in general for learning linear predictors where the loss function has a Lipschitz-continuous gradient.

###### Definition 2 (Smoothness).

A differentiable loss function is said to be -smooth with respect to a norm , if it holds that

 ℓ(w1)≤ℓ(w2)+⟨∇ℓ(w2),w1−w2⟩+β2∥w1−w2∥2,∀w1,w2∈H. (1)

## 3 The Curse of Stochastic Oracle

We begin by discussing stochastic optimization for risk minimization, convex learnability, and then the main intuition that motivates this work.

Most existing learning algorithms follow the framework of empirical risk minimizer (ERM) or regularized ERM, which was developed to great extent by Vapnik and Chervonenkis [30]. Essentially, ERM methods use the empirical loss over , i.e., , as a criterion to pick a hypothesis. In regularized ERM methods, the learner picks a hypothesis that jointly minimizes and a regularization function over . We note that ERM resembles the widely used Sample Average Approximation (SAA) method in the optimization community when the hypothesis space and the loss function are convex. If uniform convergence holds, then the empirical risk minimizer is consistent, i.e., the population risk of the ERM converges to the optimal population risk, and the problem is learnable using ERM.

A rather different paradigm for risk minimization is stochastic optimization. Recall that the goal of learning is to approximately minimize the risk . However, since the distribution is unknown to the learner, we can not utilize standard gradient methods to minimize the expected loss. Stochastic optimization methods circumvent this problem by allowing the optimization method to take a step which is only in expectation along the negative of the gradient. To motivate stochastic optimization as an alternative to the ERM method, [25, 24] challenged the ERM method and showed that there is a real gap between learnability and uniform convergence by investigating non-trivial problems where no uniform convergence holds, but they are still learnable using Stochastic Gradient Descent (SGD) algorithm [18]. These results uncovered an important relationship between learnability and stability, and showed that stability together with approximate empirical risk minimization, assures learnability [26]. We note that Lipschitzness or smoothness of loss function is necessary for an algorithm to be stable, and boundedness and convexity alone are not sufficient for ensuring that the convex learning problem is learnable.

To directly solve , a typical stochastic optimization algorithm initially picks some point in the feasible set and iteratively updates these points based on first order perturbed gradient information about the function at those points. For instance, the widely used SGD algorithm starts with ; at each iteration , it queries the stochastic oracle () at to obtain a perturbed but unbiased gradient and updates the current solution by

 wt+1=ΠH(wt−ηt^gt),

where projects the solution into the domain . To capture the efficiency of optimization procedures in a general sense, one can use oracle complexity of the algorithm which, roughly speaking, is the minimum number of calls to any oracle needed by any method to achieve desired accuracy [20]. We note that the oracle complexity corresponds to the sample complexity of learning from the stochastic optimization viewpoint previously discussed. The following theorem states a lower bound on the sample complexity of stochastic optimization algorithms [19].

###### Theorem 3 (Lower Bound on Oracle Complexity).

Suppose is -strongly and -smooth convex function defined over convex domain . Let be a stochastic oracle that for any point returns an unbiased estimate , i.e., , such that holds. Then for any stochastic optimization algorithm to find a solution with accuracy respect to the optimal solution , i.e., , the number of calls to is lower bounded by

 O(1)(√βαlog(β∥w0−w∗∥2ϵ)+σ2αϵ). (2)

The first term in (2) comes from deterministic oracle complexity and the second term is due to noisy gradient information provided by . As indicated in (2), the slow convergence rate for stochastic optimization is due to the variance in stochastic gradients, leading to at least queries to be issued. We note that the idea of mini-batch [7, 8], although it reduces the variance in stochastic gradients, does not reduce the oracle complexity.

We close this section by informally presenting why logarithmic sample complexity is, in principle, possible, under the assumption that target risk is known to the learner . To this end, consider the setting of Theorem 3 and assume that the learner is given the prior accuracy and is asked to find an -accurate solution. If it happens that the variance of has the same magnitude as , i.e., , then from (2) it follows that the second term vanishes and the learner needs to issue only queries to find the solution. But, since there is no control on , except that the variance of stochastic gradients are bounded, needs a mechanism to manage the variance of perturbed gradients at each iteration in order to alleviate the influence of noisy gradients. One strategy is to replace the unbiased estimate of gradient with a biased one, which unfortunately may yield loose bounds. To overcome this problem, we introduce a strategy that shrinks the solution space with respect to the target risk to control the damage caused by biased estimates.

## 4 Algorithm and Main Result

In this section we proceed to describe the proposed algorithm and state the main result on its sample complexity.

### 4.1 Description of Algorithm

We now turn to describing our algorithm. Interestingly, our algorithm is quite dissimilar to the classic stochastic optimization methods. It proceeds by running the algorithm online on fixed chunks of examples, and using the intermediate hypotheses and target risk to gradually refine the hypothesis space. As mentioned above, we assume in our setting that the target expected risk is provided to the learner a priori. We further assume the target risk is feasible for the solution within the domain , i.e., . The proposed algorithm explicitly takes advantage of the knowledge of expected risk to attain an sample complexity.

Throughout we shall consider linear predictors of form and assume that the loss function of interest is -smooth. It is straightforward to see that is also -smooth. In addition to the smoothness of the loss function, we also assume that to be -strongly convex. We denote by the optimal solution that minimizes , i.e., , and denote its optimal value by .

Let be a sequence of i.i.d. training examples. The proposed algorithm divides the iterations into the stages, where each stage consists of training examples, i.e., . Let be the -th training example received at stage , and let be the step size used by all the stages. At the beginning of each stage , we initialize the solution by the average solution obtained from the last stage, i.e.,

 ˆwk=1T1T1∑t=1ˆwtk, (3)

where denotes the th solution at stage . Another feature of the proposed algorithm is a domain shrinking strategy that adjusts the domain as the algorithm proceeds using intermediate hypotheses and target risk. We define the domain used at stage as

 Hk={w∈H:∥w−ˆwk∥≤Δk}, (4)

where is the domain size, whose value will be discussed later. Similar to the SGD method, at each iteration of stage , we receive a training example , and compute the gradient . Instead of using the gradient directly, following [13], a clipped version of the gradient, denoted by , will be used for updating the solution. More specifically, the clipped vector is defined as

 [vtk]i=clip(γk,[^gtk]i)=sign([^gtk]i)min(γk,∣∣[^gtk]i∣∣),i=1,…,d (5)

where with . Given the clipped gradient , we follow the standard framework of stochastic gradient descent, and update the solution by

 wt+1k=ΠHk(wtk−ηvtk). (6)

The purpose of introducing the clipped version of the gradient is to effectively control the variance in stochastic gradients, an important step toward achieving the geometric convergence rate. At the end of each stage, we will update the domain size by explicitly exploiting the target expected risk as

 Δk+1=√εΔ2k+τϵ% prior, (7)

where and are two parameters, both of which will be discussed later.

Algorithm 1 gives the detailed steps for the proposed method. The three important aspects of Algorithm 1, all crucial to achieve a geometric convergence rate, are highlighted as follows:

• Each stage of the proposed algorithm is comprised of the same number of training examples. This is in contrast to the epoch gradient algorithm [12] which divides iterations into exponentially increasing epochs, and runs SGD with averaging on each epoch. Also, in our case the learning rate is fixed for all iterations.

• The proposed algorithm uses a clipped gradient for updating the solution in order to better control the variance in stochastic gradients; this stands in contrast to the SGD method, which uses original gradients to update the solution.

• The proposed algorithm takes into account the targeted expected risk and intermediate hypotheses when updating the domain size at each stage. The purpose of domain shrinking is to reduce the damage caused by biased gradients that resulted from clipping operation.

### 4.2 Main Result on Sample Complexity

The main theoretical result of Algorithm 1 is given in the following theorem.

###### Theorem 4 (Convergence Rate).

Assume that the hypothesis space is compact and the loss function is -strongly convex and -smooth. Let be the size of the sample and be the target expected loss given to the learner in advance such that holds. Given and , set , , and as

 ξ=4βατ,T1=4max{ξ3βd+2ξβ√dεαlnmsδ,16ξ2β2α2ε2},η=12ξβ√T1,

where

 s=⌈log2ξβR2ϵprior⌉. (8)

After running Algorithm 1 over stages, we have, with a probability ,

 L(ˆwm+1)≤βR22εm+(1+τ1−ε)ϵprior,

implying that only training examples are needed in order to achieve a risk of .

We note that comparing to the bound in Theorem 3, for Algorithm 1 the level of error to which the linear convergence holds is not determined by the noise level in stochastic gradients, but by the target risk. In other words, the algorithm is able to tolerate the noise by knowing the target risk as prior knowledge and achieves a linear convergence to the level of the target risk even when the variance of stochastic gradients is much larger than the target risk. In addition, although the result given in Theorem 4 assumes a bounded domain with , however, this assumption can be lifted by effectively exploring the strong convexity of the loss function and further assuming that the loss function is Lipschitz continuous with constant , i.e., . More specifically, the fact that the is -strongly convex with first order optimality condition, for the optimal solution , we have

 L(w)−L(w∗)≥α2∥w−w∗∥2,∀w∈H.

This inequality combined with Lipschitz continuous assumption implies that for any the inequality holds, and therefore we can simply set . We also note that this dependency can be resolved with a weaker assumption than Lipschitz continuity, which only depends on the gradient of loss function at origin. To this end, we define . Using the fact that is -strongly, it is easy to verify that , leading to and, therefore, we can simply set .

We now use our analysis of Algorithm 1 to obtain a sample complexity analysis for learning smooth strongly convex problems with a bounded hypothesis class. To make it easier to parse, we only keep the dependency on the main parameters , , , , and and hide the dependency on other constants in notation. Let denote the output of Algorithm 1. By setting and letting to be an arbitrary small number, Theorem 4 yields the following:

###### Corollary 5 (Sample Complexity).

Under the same conditions as Theorem 4, by running Algorithm 1 for minimizing with a number of iterations (i.e., number of training examples) , if it holds that,

 T≥O(dκ4(log1ϵpriorloglog1ϵprior+log1δ))

where denotes the condition number of the loss function and is the dimension of data, then with a probability , attains a risk of , i.e., .

As an example of a concrete problem that may be put into the setting of the present work is the regression problem with squared loss. It is easy to show that average square loss function is Lipschitz continuous with a Lipschitz constant which denotes the largest eigenvalue of matrix where is the data matrix. The strong convexity is guaranteed as long as the population data covariance matrix is not rank-deficient and its minimum eigenvalue is lower bounded by a constant . For this problem, the optimal minimax sample complexity is known to be , but as it implies from Corollary 5, by the knowledge of target risk , it is possible to reduce the sample complexity to .

###### Remark 6.

It is indeed remarkable that the sample complexity of Theorem 4 has dependency on the condition number of the loss function, which is worse than the dependency in the lower bound in (2). Also, the explicit dependency of sample complexity on dimension makes the proposed algorithm inappropriate for non-parametric settings.

## 5 Analysis

Now we turn to proving the main theorem. The proof will be given in a series of lemmas and theorems where the proof of few are given in the appendix. The proof makes use of the Bernstein inequality for martingales, idea of peeling process, self-bounding property of smooth loss functions, standard analysis of stochastic optimization, and novel ideas to derive the claimed sample complexity for the proposed algorithm.

The proof of Theorem 4 is by induction and we start with the key step given in the following theorem.

###### Theorem 7.

Assume . For a fixed stage , if , then, with a probability , we have

 ∥ˆwk+1−w∗∥2≤aΔ2k+bϵprior

where

 a=2αT1(2ξβ√T1+[ξ3βd+2ξβ√d]lnsδ),b=8αξ (9)

and is given in (8), provided that and hold.

Taking this statement as given for the moment, we proceed with the proof of Theorem 4, returning later to establish the claim stated in Theorem 7.

###### of Theorem 4.

By setting and in (9) in Theorem 7 as and , we have and

 T1≤2αε(2ξβ√T1+[ξ3βd+2ξβ√d]lnsδ)

implying that

 T1≥4max{ξ3βd+2ξβ√dεαlnsδ,16ξ2β2α2ε2}.

Thus, using Theorem 7 and the definition of and , we have, with a probability ,

 Δ2k+1≤εΔ2k+2τβϵprior.

After stages, with a probability , we have

 Δ2m+1≤εmΔ21+2τβϵpriorm−1∑i=0εi≤εmΔ21+2τβ(1−ε)ϵprior.

By the -smoothness of , it implies that

 L(ˆwm+1)−L(w∗)≤β2∥ˆwm+1−w∗∥2 ≤ β2εmΔ21+τ1−εϵprior, ≤ βR22εm+τ1−εϵprior,

where the last inequality follows from . The bound stated in the theorem follows the assumption that . ∎

### 5.1 Proof of Theorem 7

To bound in terms of , we start with the standard analysis of online learning. In particular, from the strong convexity assumption of and updating rule (6) we have,

 L(wtk)−L(w∗) ≤ ⟨∇L(wtk),wtk−w∗⟩−α2∥wtk−w∗∥2 (10) = ⟨vtk,wtk−w∗⟩+⟨∇L(wtk)−vtk,wtk−w∗⟩−α2∥wt−w∗∥2 ≤ ∥wt+1k−w∗∥2−∥wt+1k−w∗∥22η+ηd2γ2k

where the last step follows from . By adding all the inequalities of (10) at stage , we have

 T1∑t=1L(wtk)−L(w∗) ≤ ∥ˆwk−w∗∥22η+dη2γ2kT1+T1∑t=1vtk−α2T1∑t=1∥wt−w∗∥2 (11) ≤ Δ2k2η+dη2γ2kT1+Vk−α2Wk,

where and are defined as and , respectively. In order to bound , using the fact that , we rewrite as

 Vk = = Dk+Ek,

where and which represent the variance and bias of the clipped gradient , respectively. We now turn to separately upper bound each term.

The following lemma bounds the variance term using the Bernstein inequality for martingale. Its proof can be found in Appendix A.

###### Lemma 1.

For any and , we have

 Pr(Wk≤ϵpriorT12μβ)+Pr(Dk≤1LWk+(Lγ2kd+γkΔk√d)lnsδ)≥1−δ

where is given by

 s=⌈log28βμR2ϵprior⌉.

The following lemma bounds using the self-bounding property of smooth functions and the proof is deferred to Appendix B.

###### Lemma 2.
 Ek≤4T1ξϵopt+4βξWk≤4T1ξϵprior+4βξWk.

Note that without the knowledge of , we have to bound by , resulting in a very loose bound for the bias term . It is knowledge of the target expected risk that allows us to come up with a significantly more accurate bound for the bias term , which consequentially leads to a geometric convergence rate.

We now proceed to bound using the two bounds in Lemma 1 and 2. To this end, based on the result obtained in Lemma 1, we consider two scenarios. In the first scenario, we assume

 Wk≤ϵpriorT12μβ (12)

In this case, we have

 T1∑t=1L(wtk)−L(w∗)≤β2Wk≤ϵprior2μT1. (13)

In the second scenario, we assume

 Dk≤1LWT+(Lγ2kd+γkΔk√d)lnsδ. (14)

In this case, by combining the bounds for and and setting , we have

 Vk ≤ 8βξWk+(ξd4βγ2k+γkΔk√d)lnsδ+4T1ξϵprior = 8βξWk+(ξ3βd+2ξβ√d)Δ2klnsδ+4T1ξϵ% prior,

where the last equality follows from the fact . If we choose such that or holds, we get

 Vk≤α2Wk+(ξ3βd+2ξβ√d)Δ2klnsδ+4T1ξϵprior

Substituting the above bound for into the inequality of (11), we have

 T1∑t=1L(wtk)−L(w∗)≤Δ2k2η+η2γ2kT1+(ξ3βd+2ξβ√d)Δ2klnsδ+4T1ξϵprior

By choosing as , we have

 L(ˆwk+1)−L(w∗)≤1T1(2ξβ√T1+[ξ3βd+2ξβ√d]lnsδ)Δ2k+4ξϵprior. (15)

By combining the bounds in (13) and (15), under the assumption that at least one of the two conditions in (12) and (14) is true, by setting , we have

 L(ˆwk+1)−L(w∗)≤1T1(2ξβ√T1+[ξ3βd+2ξβ√d]lnsδ)Δ2k+4ξϵprior,

implying

 ∥ˆwk+1−w∗∥≤2αT1(2ξβ√T1+[ξ3βd+2ξβ√d]lnsδ)Δ2k+8αξϵ% prior.

We complete the proof by using Lemma 1, which states that the probability for either of the two conditions hold is no less than .

## 6 Conclusions

In this paper, we have studied the sample complexity of passive learning when the target expected risk is given to the learner as prior knowledge. The crucial fact about target risk assumption is that, it can be fully exploited by the learning algorithm and stands in contrast to most common types of prior knowledges that usually enter into the generalization bounds and are often perceived as a rather crude way to incorporate such assumptions. We showed that by explicitly employing the target risk in a properly designed stochastic optimization algorithm, it is possible to attain the given target risk with a logarithmic sample complexity , under the assumption that the loss function is both strongly convex and smooth.

There are various directions for future research. The current study is restricted to the parametric setting where the hypothesis space is of finite dimension. It would be interesting to see how to achieve a logarithmic sample complexity in a non-parametric setting where hypotheses lie in a functional space of infinite dimension. Evidently, it is impossible to extend the current algorithm for the non-parametric setting; therefore additional analysis tools are needed to address the challenge of infinite dimension arising from the non-parametric setting. It is also an interesting problem to relate target risk assumption we made here to the low noise margin condition which is often made in active learning for binary classification since both settings appear to share the same sample complexity. However it is currently unclear how to derive a connection between these two settings. We believe this issue is worthy of further exploration and leave it as an open problem.

## Appendix A Proof of Lemma 1

The proof is based on the Bernstein inequality for martingales (see, e.g., [6]).

###### Lemma 3.

(Bernstein inequality for martingales). Let be a bounded martingale difference sequence with respect to the filtration and with . Let be the associated martingale. Denote the sum of the conditional variances by

 Σ2n=n∑t=1E[X2t|Ft−1]

Then for all constants , ,

 Pr[maxi=1,…,nSi>ρ and Σ2n≤ν]≤exp(−ρ22(ν+Mρ/3))

and therefore,

 Pr[maxi=1,…,nSi>√2νρ+√23Mρ and Σ2n≤ν]≤e−ρ.
###### of Lemma 1.

Define martingale difference and martingale . Let denote the conditional variance as

 Σ2T=T1∑t=1Et[(dtk)2] ≤ T1∑t=1Et[∥∥Et[vtk]−vtk∥∥2]∥wtk−w∗∥2 ≤ T∑t=1dγ2k∥wtk−w∥2=dγ2kWk,

which follows from the Cauchy’s Inequality and the definition of clipping. Define . To prove the inequality in Lemma 1, we follow the idea of peeling process [16]. Since , we have

 spanspanPr(Dk≥2γk√Wkdρ+√2Mρ/3) = Pr(Dk≥2γk√Wkdρ+√2Mρ/3,Wk≤4R2T1) = Pr(Dk≥2γk√Wkdρ+√2Mρ/3,Σ2T≤γ2kdWk,Wk≤4R2T1) ≤ Pr(Dk≥2γk√Wkdρ+√2Mρ/3,Σ2T≤γ2kdWk,Wk≤ϵpriorT1/(2βμ)) +s∑i=1Pr(Dk≥2γk√Wkdρ+√2Mρ/3,Σ2T≤γ2kdWk,ϵ% prior2i−1T12βμ

where is given by

 s=⌈log28βμR2ϵprior⌉.

The last step follows the Bernstein inequality for martingales. We complete the proof by setting and using the fact that

 2γk√Wkρd≤1LWk+γ2kρdL.

## Appendix B Proof of Lemma 2

To bound , we need the following two lemmas. The first lemma bounds the deviation of the expected value of a clipped random variable from the original variable, in terms of its variance (Lemma A.2 from [13]).

###### Lemma 4.

Let be a random variable, let and assume that for some . Then

 |E[˜X]−E[X]|≤2C|Var[X]|

Another key observation used for bounding is the fact that for any non-negative -smooth convex function, we have the following self-bounding property. We note that this self-bounding property has been used in [27] to get better (optimistic) rates of convergence for non-negative smooth losses.

###### Lemma 5.

For any -smooth non-negative function , we have

As a simple proof, first from the smoothness assumption, by setting in (1) and rearranging the terms we obtain . On the other hand, from the convexity of loss function we have . Combining these inequalities and considering the fact that the function is non-negative gives the desired inequality.

###### of Lemma 2.

To apply the above lemmas, we write as

 etk = d∑i=1Et[ℓ′(⟨wtk,xtk⟩,yt)[xtk]i−clip(γk,ℓ′(⟨wtk,xtk⟩,yt)[xtk]i)][wtk−w∗]i

In order to apply Lemma 4, we check if the following condition holds

 γk≥2∣∣Et[ℓ′(⟨wtk,xtk⟩