Online Newton Step Algorithm with Estimated Gradient

# Online Newton Step Algorithm with Estimated Gradient

Binbin Liu College of Science, China University of Petroleum, China Jundong Li Computer Science and Engineering, Arizona State University, USA Yunquan Song Xijun Liang Ling Jian11footnotemark: 1 Huan Liu
###### Abstract

Online learning with limited information feedback (bandit) tries to solve the problem where an online learner receives partial feedback information from the environment in the course of learning. Under this setting, Flaxman extends Zinkevich’s classical Online Gradient Descent (OGD) algorithm Zinkevich [2003] by proposing the Online Gradient Descent with Expected Gradient (OGDEG) algorithm. Specifically, it uses a simple trick to approximate the gradient of the loss function by evaluating it at a single point and bounds the expected regret as Flaxman et al. [2005]. It has been shown that compared with the first-order algorithms, second-order online learning algorithms such as Online Newton Step (ONS) Hazan et al. [2007] can significantly accelerate the convergence rate in traditional online learning. Motivated by this, this paper aims to exploit second-order information to speed up the convergence of OGDEG. In particular, we extend the ONS algorithm with the trick of expected gradient and develop a novel second-order online learning algorithm, i.e., Online Newton Step with Expected Gradient (ONSEG). Theoretically, we show that the proposed ONSEG algorithm significantly reduces the expected regret of OGDEG from to in the bandit feedback scenario. Empirically, we demonstrate the advantages of the proposed algorithm on several real-world datasets.

###### keywords:
Expected Gradient, Online Learning, Bandit Feedback, Bandit Convex Optimization

## 1 Introduction

Online learning algorithms differ from conventional learning paradigms as they targeted at learning the model incrementally from data in a sequential manner given the knowledge (could be partial) of answers pertaining to previously made decisions. They have shown to be effective in handling large-scale and high-velocity streaming data and emerged to become popular in the big data era Hoi et al. [2018, 2014]. In recent years, a number of effective online learning algorithms have been investigated and applied in a variety of high impact domains, ranging from game theory, information theory to machine learning and data mining Ding et al. [2017], Shalev-Shwartz [2011], Wang et al. [2003]. Most previously proposed online learning algorithms fall into the well-established framework of online convex optimization Gordon [1999], Zinkevich [2003].

In terms of the optimization algorithms, online learning algorithms can be grouped into the following categories: (i) first-order algorithms which aim to optimize the objective function using the first-order (sub) gradient such as the well-known OGD algorithm Zinkevich [2003]; and (ii) second-order algorithms which aim to exploit second-order information to speed up the convergence of the optimization, such as the ONS algorithm Hazan et al. [2007]. In online convex optimization, previous approaches are mainly based on the first-order optimization, i.e., optimization using the first-order derivative of the cost function. The regret bound achieved by these algorithms is proportional to the polynomial of the number of rounds . For example, Zinkevich Zinkevich [2003] showed that with the simple OGD, we can achieve the regret bound of . Later on, Elad Hazan Hazan et al. [2007] introduced a new algorithm with ONS by exploiting the second-order derivative of the cost function, which can be viewed as an online analogy of the Newton-Raphson method Ypma and Tjalling [1995] in the offline learning. Although the time complexity of ONS is higher than that of OGD ( denotes the number of features) per iteration, it guarantees a logarithmic regret bound with a warm assumption of the cost function.

Additionally, according to the forms of feedback information of the prediction, existing online convex optimization algorithms can be broadly classified into two categories Abernethy et al. [2012]: (i) online learning with full information feedback; and (ii) online learning with limited feedback (or bandit feedback) Dani et al. [2007], Hazan and Li [2016], Mcmahan and Blum [2004], Neu and Bartók []. In the former scenario, full information feedback of the prediction is always revealed to the learner at the end of each round; while in the latter scenario, the learner only receives partial information feedback from the environment of the prediction. In many real-world applications, the full information feedback is often difficult to acquire while the cost of obtaining bandit feedback is often much lower. For example, in many e-commerce websites, users only provide positive feedback (e.g., clicks or purchasing behaviors) Suhara et al. [2013], Zoghi et al. [2017] but do not necessarily disclose the full information feedback (e.g., fine-grained preferences). In this regard, online convex optimization with bandit feedback has motivated a surge of research interests in recent years. For example, Flaxman et al. Flaxman et al. [2005] extended the OGD algorithm to the bandit setting, where the learner only knows the value of the cost function at the current prediction point while the cost function value at other points remains opaque. In particular, it uses a simple approximation of the gradient of the cost function at the current point , and bounds the expected regret (against an oblivious adversary) as .

As second-order algorithms such as ONS often lead to lower regret bound than the first-order methods when full information feedback is available, one natural question to ask is whether the success can be shifted to the bandit feedback scenario. To answer this question, in this paper, we make the initial investigation of the ONS algorithm when only partial feedback is available. Our main contribution is the development of a novel second-order online convex optimization algorithm which can reduce the regret bound from  Flaxman et al. [2005] to . Furthermore, if the cost function is -Lipschitz, we can further bound the regret as which is often desired in practical usage.

The remainder of this paper is organized as follows. In Section 2, we summarize some symbols used throughout this paper and present the bandit convex optimization problem. In Section 3, we introduce the proposed Online Newton Step algorithm with Estimated Gradient. In Section 4, empirical evaluations on benchmark datasets are given to show the superiority of the proposed algorithm. In Section 5, we briefly review related work on bandit convex learning. The conclusion is presented in Section 6.

## 2 Preliminaries

In this section, we first present the notations used throughout the paper and then formally define the problem of bandit convex optimization.

### 2.1 Notation

Table 1 lists the main symbols used in this paper. We use bold lowercase characters to denote vectors (e.g., a), bold upper case characters to denote matrices (e.g., A), and to denote the transpose of A. For a positive definite matrix A, we use to denote the Mahalanobis norm of vector x with respect to A.

denotes the gradient operator. denotes a random variable that is distributed uniformly over . Under the bandit feedback setting, we use to denote the conditional expectation given the observations up to time .

### 2.2 Bandit Convex Optimization

Bandit convex optimization (BCO) are often performed in a sequence of consecutive rounds. At each round , the online learner picks a data sample from a convex set . After the data sample is picked and used to make the prediction, a convex cost function is revealed, then the online learner suffers from an instantaneous loss . Under the online convex optimization framework, we assume that the sequence of loss functions are fixed in advance. The goal of the online learner is to choose a sequence of predictions such that the regret defined as is minimized, where denotes the cumulative error across all rounds while denotes the cumulative error resulted from the optimal decisions in hindsight. In the full information feedback setting, the learner has access to the gradient of the loss function at any point in the feasible set . Conversely, in the BCO setting the given feedback is , which is the value of the loss function at the point it chooses.

In this paper, we assume that origin is contained in the feasible set with as diameter, so , loss functions are bounded by (i.e. ) and -nice function (see Hazan et al. [2016a]).

## 3 The Proposed Online Newton Step Algorithm with Estimated Gradient

As mentioned previously, this paper focuses on the online learning problem with partial information feedback - the bandit setting. For example, in the multi-armed bandit problem (MAB) Salehi et al. [2017], Tekin and Liu [2010]: there are different arms, and on each round the learner chooses one of the arms which is denoted by a basic unit vector (indicate which arm is pulled). Then the learner receives a cost of choosing this arm, . The vector associates a cost for each arm, but the learner only has access to the cost of the arm she/he pulls.

In this setting, the functions change adversarially over time and we can only evaluate each function once and cannot access the gradient of directly for gradient descent. To tackle this issue, Flaxman et al. Flaxman et al. [2005] proposed to use a one-point estimate of the gradient. Specifically, for a uniformly random unit vector , we have . Theoretically, the vector is an estimate of the gradient with low bias, and thus it is an approximation of the gradient. In fact, is an unbiased estimator of the gradient of a smoothed version of , which can be mathematically formulated as As we will show below, the advantage of is that it is differentiable and we can estimate its gradient with a single call.

###### Lemma 1

Fix , for a vector selected uniformly at random from the unit sphere , we have:

 Ev∼US[dδf(x+δv)v]=∇^f(x). (1)

: By definition,

 ^f(x)=E[f(x+δu)]=∫δBf(x+u)duδ⋅vold(B), (2)

and similarly,

 E[f(x+δv)v]=∫δSf(x+v)⋅v∥v∥dvδ⋅vold−1(S). (3)

According to the Stoke’s theorem Flaxman et al. [2005], we have:

 ∇∫δBf(x+u)du=∫δSf(x+v)v∥v∥dv. (4)

The above theorem along with the fact that ratio of the volume to the surface of a -dimensional ball of radius is completes the proof.

Based on the merit of the above lemma, we extend the ONS to the setting of limited feedback and propose a novel second-order online learning method - Online Newton Step algorithm with Estimated Gradient algorithm (ONSEG). The proposed ONSEG algorithm is summarized as follows:

We begin with some key observations and give the regret analysis of the proposed Online Newton Step Algorithm with Estimated Gradient (ONSEG).

###### Observation 1

The optimum in is near the optimum in :

 minx∈(1−γ)PT∑t=1ft(x)≤2γFT+minx∈PT∑t=1ft(x).

: It is clear that:

 minx∈(1−γ)PT∑t=1ft(x)=minx∈PT∑t=1ft((1−γ)x).

Since and hold for all and , we have:

 minx∈PT∑t=1ft((1−γ)x) ≤ minx∈PT∑t=1γft(0)+(1−γ)ft(x) = minx∈PT∑t=1γ(ft(0)−ft(x))+ft(x) ≤ minx∈PT∑t=1γ2F+ft(x).
###### Observation 2

For any point , the ball of radius centered at is contained in .

Since and is convex, we have The next observation establishes a bound on the maximum value the function value can reach in , which is an implicit Lipschitz condition.

###### Observation 3

For any and any , we have:

 |ft(x)−ft(y)|≤2Fγr∥x−y∥.

: Let . If , the observation follows directly from . Otherwise, let , which is the point at distance from in the direction . By the previous observation, we know . Also, , so we have:

 ft(x) ≤ |Δ|γrft(z)+(1−|Δ|γr)ft(y) = ft(y)+ft(z)−ft(y)γr|Δ| ≤ ft(y)+2Fγr|Δ|.
###### Theorem 1

Assume that for all t, the smoothed version of loss function , i.e., , is -exp-concave. For all in , , with the notation of , we have For and , ONSEG gives the following guarantee on the expected regret bound:

 E[T∑t=1ft(xt)]−minx∈PT∑t=1ft(x)≤6FT233√15d2DlogTr+5dαlogT.

: To improve the readability, we list Lemma 2-4 after this proof. For any , Observation 2 shows that with a proper setting of the parameters , the picked points . Suppose we run the ONS algorithm on the functions (i.e., the smoothed function of ) on the feasible set . Let , then Lemma 1 shows that . For convenience, we denote as in the following context. We first show that the corresponding expected regret is upper bounded by:

 E[T∑t=1^ft(yt)]−miny∈(1−γ)PT∑t=1^ft(y)≤10d2FDlogTδ+5dαlogT.

Let be the best chosen with the benefit of hindsight. By Lemma 2, we have:

 ^ft(yt)−^ft(y⋆)≤Rt △= ∇Tt(yt−y⋆)−β2(y⋆−yt)T∇t∇Tt(y⋆−yt)

for For convenience, we can define according to the update rule of In this way, by the definition of , we have:

 zt+1−y⋆=yt−y⋆−1βA−1t∇t, (6)
 At(zt+1−y⋆)=At(yt−y⋆)−1β∇t. (7)

Multiplying the transpose of Eq.(6) by Eq.(7) we get:

 (zt+1−y⋆)TAt(zt+1−y⋆)=(yt−y⋆)TAt(yt−y⋆)−2β∇Tt(yt−y⋆)+1β2∇TtA−1t∇t.

Since is the projection of in the norm induced by , it is a well-known fact that (See Lemma 3):

 (zt+1−y⋆)TAt(zt+1−y⋆)≥(yt+1−y⋆)TAt(yt+1−y⋆).

This inequality justifies the use of generalized projections as opposed to the standard projections. This fact together with Eq.(3) gives:

 ∇Tt(yt−y⋆)≤β2(yt−y⋆)TAt(yt−y⋆)+12β∇TtA−1t∇t−β2(yt+1−y⋆)TAt(yt+1−y⋆).

Now, summing up over from to , we get that:

 T∑t=1∇Tt(yt−y⋆) ≤ 12βT∑t=1∇TtA−1t∇t+β2(y1−y⋆)TA1(y1−y⋆) +β2T∑t=2(yt−y⋆)T(At−At−1)(yt−y⋆) −β2(yT+1−y⋆)TAT(yT+1−y⋆) ≤ 12βT∑t=1∇TtA−1t∇t+β2T∑t=1(yt−y⋆)T∇t∇Tt(yt−y⋆) +β2(y1−y⋆)T(A1−∇1∇T1)(y1−y⋆).

In the last inequality we use the fact that By moving the term to the LHS, we get the expression for

Using the facts that and , and the choice of we get:

 T∑t=1(^ft(yt)−^ft(y⋆))≤T∑t=1Rt ≤ 12βT∑t=1∇TtA−1t∇t+ε2R2β ≤ 12βT∑t=1∥∇t∥2A−1t+12β.

Taking expectation, and using the fact that , we obtain:

 E[T∑t=1(^ft(yt)−^ft(y⋆))]≤12βT∑t=1∥E[gt]∥2A−1t+12β. (8)

According to Lemma 4, Eq.(8) can be bounded by , via the setting of , , and . Now since , and , we have and . Then we get:

 E[T∑t=1^ft(yt)]−miny∈(1−γ)PT∑t=1^ft(y) ≤ 12β[dlog(d2F2Tδ2ε+1)+1] ≤ 4(2dFDδ+1α)[dlogT+1] ≤ 10d2FDδlogT+5dαlogT.

Let be as given in Observation 3. For any , as is the average over inputs within of , Observation 3 shows that . According to Observation 1, we have . With these above observations, we can obtain the expectation regret upper bound of ONSEG as:

 E[T∑t=1ft(yt+δvt)]−miny∈PT∑t=1ft(y) (9) ≤ E[T∑t=1ft(yt+δvt)]−miny∈(1−γ)PT∑t=1ft(y)+2γFT ≤ E[T∑t=1^ft(yt)]−miny∈(1−γ)PT∑t=1^ft(y)+3LδT+2γFT ≤ 10d2FDδlogT+3LδT+2γFT+5dαlogT.

By plugging in , we get the expression of the form , where , and . Setting and gives a value of . This gives the stated expected regret bound.

###### Lemma 2

For a function , where has a diameter D, such that , and is concave, the following holds for :

Since is concave and , the function is also concave. Then by the concavity of , we have:

 h(x)≤h(y)+∇h(y)T(x−y).

Plugging in gives:

 ^f(x)≥^f(y)−12βlog[1−2β∇^f(y)T(x−y)].

Next, note that and that for , Applying the inequality for completes the proof of the lemma.

###### Lemma 3 (Folklore)

Let be a convex set, and be the generalized projection of onto according to the positive semidefinite matrix Then for any point it holds that Hazan et al. [2007]:

 (y−a)TA(y−a)≥(z−a)TA(z−a).
###### Lemma 4

Let , for , be a sequence of vectors such that for some , . Define . Then we have:

 T∑t=1uTtV−1tut≤dlog(r2Tε+1).

For real numbers , the inequality implies that (taking ). A similar fact holds for the positive definite matrices. Now, we define the inner product of two matrices as . Then, for matrices , we have , where is the determinant of . Using the above facts, we have (for convenience, let ):

 T∑t=1uTtV−1tut=T∑t=1V−1t∙utuTt = T∑t=1V−1t∙(Vt−Vt−1) ≤ T∑t=1log|Vt||Vt−1|=log|VT||V0|.

Since and , the largest eigenvalue of is at most . Hence, the determinant of can be bounded by

Under the Lipschitzness of , it is clearly that the difference of and its smoothed version is no more than . And we can further tighten the expected upper bound of the proposed ONSEG algorithm to .

###### Theorem 2

In the proposed Online Newton Step Algorithm with Estimated Gradient, if each function is -Lipschitz, then for and , we have:

 E[T∑t=1ft(xt)]−minx∈PT∑t=1ft(x)≤2dT12√30FD(Lr+F)logTr+5dαlogT.

The proof is similar to the proof of Theorem 1. We now have an explicit Lipschitz constant, thus we can use it directly in Eq.(9). Plugging the chosen values of and gives the theorem.

#### Reshaping

The above regret bounds in Theorem 1 and Theorem 2 depend on , which would be very large. To remove this dependence, we reshape the feasible set to make it more “round”. The set , with , where is a unit ball centered around the origin in dimensions, can be put in Milman and Pajor [1989] by applying an affine transformation . A body in isotropic position has several nice properties, including . So, in practice, it is preferred to apply the preprocessing step to find which puts in isotropic position. This gives us new parameters and . The following observation shows that we can construct a Lipschitz function with under the assumption of is -Lipschitz.

###### Observation 4

Let . Then is -Lipschitz.

Let and , . Observe that, To prove that satisfies -Lipschitz condition, it suffices to show that . Suppose the above assumption does not hold, i.e. . Suppose and . It is clear that , and since contains the ball of radius 1, . Thus, and are in . Then, since is affine,

 ∥y1−y2∥ = ∥T−1(u1−u1∥u1−u2∥)−T−1(u2−u1∥u1−u2∥) = 1∥u1−u2∥∥T−1(u1−u2)−T−1(u2−u1)∥ = 2∥u1−u2∥∥T−1(u1)−T−1(u2)∥ = 2∥u1−u2∥∥x1−x2∥>2D,

where the last line uses the assumption . The inequality contradicts the assumption that is contained in a sphere of radius .

There exist some other algorithms such as MCMC algorithm which attempts to put any set into isotropic position Kannan [1997], Lovasz and Vempala [2003]. Among them, the latest algorithm proposed by Lovasz and Vempala is an efficient one (running in time poly-) Lovasz and Vempala [2003]. This algorithm puts the feasible set into nearly isotropic position, which means that . This gives to the following corollary:

###### Corollary 1

For a set , after putting into near-isotropic position, the ONSEG algorithm has expected regret bound as

And under the assumption that is -Lipschitz, we further tighten the regret bound to be

: Plugging the parameters , and into Theorem 1 and Theorem 2 respectively gives the stated regret bound.

## 4 Empirical Evaluations

Empirical evaluations of the proposed method was performed on two regression datasets and two classification datasets, i.e., abalone, kin, ionosphere and cancer. All datasets are from the libSVM repository (see Table 1 for details of each dataset). We compare the proposed second-order bandit learning algorithm ONSEG with the first-order bandit learning algorithm OGDEG Flaxman et al. [2005]. In addition, OGD and ONS are baseline methods with the full information feedback. To accelerate the computational efficiency, ONSEG and ONS employ Sherman-Morrson-Woodbury formula Brookes [2011]

 (At