Weighted Last-Step Min-Max Algorithm with Improved Sub-Logarithmic Regret

Weighted Last-Step Min-Max Algorithm with Improved Sub-Logarithmic Regret

Edward Moroshko Koby Crammer Department of Electrical Engineering, Technion, Israel
Abstract

In online learning the performance of an algorithm is typically compared to the performance of a fixed function from some class, with a quantity called regret. Forster Forster () proposed a last-step min-max algorithm which was somewhat simpler than the algorithm of Vovk vovkAS (), yet with the same regret. In fact the algorithm he analyzed assumed that the choices of the adversary are bounded, yielding artificially only the two extreme cases. We fix this problem by weighing the examples in such a way that the min-max problem will be well defined, and provide analysis with logarithmic regret that may have better multiplicative factor than both bounds of Forster Forster () and Vovk vovkAS (). We also derive a new bound that may be sub-logarithmic, as a recent bound of Orabona et.al OrabonaCBG12 (), but may have better multiplicative factor. Finally, we analyze the algorithm in a weak-type of non-stationary setting, and show a bound that is sublinear if the non-stationarity is sub-linear as well.

keywords:
Online learning, Regression, Min-max learning
journal: Theoretical Computer Science\biboptions

comma,square,sort&compress

1 Introduction

We consider the online learning regression problem, in which a learning algorithm tries to predict real numbers in a sequence of rounds given some side-information or inputs . Real-world example applications for these algorithms are weather or stockmarket predictions. The goal of the algorithm is to have a small discrepancy between its predictions and the associated outcomes . This discrepancy is measured with a loss function, such as the square loss. It is common to evaluate algorithms by their regret, the difference between the cumulative loss of an algorithm with the cumulative loss of any function taken from some class.

Forster Forster () proposed a last-step min-max algorithm for online regression that makes a prediction assuming it is the last example to be observed, and the goal of the algorithm is indeed to minimize the regret with respect to linear functions. The resulting optimization problem he obtained was convex in both choice of the algorithm and the choice of the adversary, yielding an unbounded optimization problem. Forster circumvented this problem by assuming a bound over the choices of the adversary that should be known to the algorithm, yet his analysis is for the version with no bound.

We propose a modified last-step min-max algorithm with weights over examples, that are controlled in a way to obtain a problem that is concave over the choices of the adversary and convex over the choices of the algorithm. We analyze our algorithm and show a logarithmic-regret that may have a better multiplicative factor than the analysis of Forster. We derive additional analysis that is logarithmic in the loss of the reference function, rather than the number of rounds . This behaviour was recently given by Orabona et.al OrabonaCBG12 () for a certain online-gradient decent algorithm. Yet, their bound OrabonaCBG12 () has a similar multiplicative factor to that of Forster Forster (), while our bound has a potentially better multiplicative factor and it has the same dependency in the cumulative loss of the reference function as Orabona et.al OrabonaCBG12 (). Additionally, our algorithm and analysis are totally free of assuming the bound or knowing its value.

Competing with the best single function might not suffice for some problems. In many real-world applications, the true target function is not fixed, but may change from time to time. We bound the performance of our algorithm also in non-stationary environment, where we measure the complexity of the non-stationary environment by the total deviation of a collection of linear functions from some fixed reference point. We show that our algorithm maintains an average loss close to that of the best sequence of functions, as long as the total of this deviation is sublinear in the number of rounds .

A short version appeared in The 23rd International Conference on Algorithmic Learning Theory (ALT 2012). This journal version of the paper includes additionally: (1) Recursive form of the algorithm and comparison to other algorithms of the same form (Sec. 3.1). (2) Kernel version of the algorithm (Sec. 3.2). (3) MAP interpretation of the minimization problems (Remark 1 and Remark 2). (4) All proofs and extended related-work section.

2 Problem Setting

We work in the online setting for regression evaluated with the squared loss. Online algorithms work in rounds or iterations. On each iteration an online algorithm receives an instance and predicts a real value , it then receives a label , possibly chosen by an adversary, suffers loss , updates its prediction rule, and proceeds to the next round. The cumulative loss suffered by the algorithm over iterations is,

 LT(alg)=T∑t=1ℓt(alg) . (1)

The goal of the algorithm is to perform well compared to any predictor from some function class.

A common choice is to compare the performance of an algorithm with respect to a single function, or specifically a single linear function, , parameterized by a vector . Denote by the instantaneous loss of a vector , and by . The regret with respect to is defined to be,

 RT(u)=T∑t(yt−^yt)2−LT(u) .

A desired goal of the algorithm is to have , that is, the average loss suffered by the algorithm will converge to the average loss of the best linear function .

Below in Sec. 5 we will also consider an extension of this form of regret, and evaluate the performance of an algorithm against some -tuple of functions, ,

 RT(u1,…,uT)=T∑t(yt−^yt)2−LT(u1,…,uT) ,

where . Clearly, with no restriction of the -tuple, any algorithm may suffer a regret linear in , as one can set , and suffer zero quadratic loss in all rounds. Thus, we restrict below the possible choices of -tuple either explicitly, or implicitly via some penalty.

3 A Last Step Min-Max Algorithm

Our algorithm is derived based on a last-step min-max prediction, proposed by Forster Forster () and Takimoto and Warmuth TakimotoW00 (). See also the work of Azoury and Warmuth AzouryWa01 (). An algorithm following this approach outputs the min-max prediction assuming the current iteration is the last one. The algorithm we describe below is based on an extension of this notion. For this purpose we introduce a weighted cumulative loss using positive input-dependent weights ,

 LaT(u)=T∑t=1at(yt−u⊤xt)2,LaT(u1,…,uT)=T∑t=1at(yt−u⊤txt)2 .

The exact values of the weights will be defined below.

Our variant of the last step min-max algorithm predicts111 and serves both as quantifiers (over the and operators, respectively), and as the optimal values over this optimization problem.

 ^yT=argmin^yTmaxyT[T∑t=1(yt−^yt)2−infu(b∥u∥2+LaT(u))] , (2)

for some positive constant . We next compute the actual prediction based on the optimal last step min-max solution. We start with additional notation,

 At =bI+t∑s=1asxsx⊤s ∈Rd×d (3) bt =t∑s=1asysxs ∈Rd . (4)

The solution of the internal infimum over is summarized in the following lemma.

Lemma 1.

For all , the function is minimal at a unique point given by,

 ut=A−1tbt and f(ut)=t∑s=1asy2s−b⊤tA−1tbt . (5)
Proof.

From

 f(u) = b∥u∥2+t∑s=1as(ys−u⊤xs)2 = t∑s=1asy2s−2t∑s=1u⊤(asysxs)+u⊤(bI+t∑s=1asxsx⊤s)u t∑s=1asy2s−2u⊤bt+u⊤Atu

it follows that . Thus is convex and it is minimal if , i.e. for . This show that and we obtain

 f(ut)=f(A−1tbt)=t∑s=1asy2s−2b⊤tA−1tbt+b⊤tA−1tAtA−1tbt=t∑s=1asy2s−b⊤tA−1tbt .

Remark 1.

The minimization problem in Lemma 1 can be interpreted as MAP estimator of based on the sequence in the following generative model:

 u ∼ N(0,σ2bI) ys ∼ N(x⊤su,σ2s) ,

where and .

Under the model we calculate,

 uMAP = argmaxuP(u∣{xs},{ys}) (6) = argmaxu[P(u)t∏s=1P(ys∣u,xs)] = argminu[−logP(u)−t∑s=1logP(ys∣u,xs)] .

By our gaussian generative model,

 −logP(u) = log(2πσ2b)d/2+12σ2b∥u∥2 −logP(ys∣u,xs) = log(2πσ2s)1/2+12σ2s(ys−x⊤su)2 .

Substituting in (6) we get

 uMAP=argminu[12σ2b∥u∥2+t∑s=112σ2s(ys−x⊤su)2] ,

and by using , we get the minimization problem of Lemma 1.

Substituting (5) back in (2) we obtain the following form of the minmax problem,

 min^yTmaxyTG(yT,^yT) % for G(yT,^yT)=α(aT)y2T+2β(aT,^yT)yT+^y2T , (7)

for some functions and . Clearly, for this problem to be well defined the function should be convex in and concave in .

A previous choice, proposed by Forster Forster (), is to have uniform weights and set (for ), which for the particular function yields . Thus, is a convex function in , implying that the optimal value of is not bounded from above. Forster Forster () addressed this problem by restricting to belong to a predefined interval , known also to the learner. As a consequence, the adversary optimal prediction is in fact either or , which in turn yields an optimal predictor which is clipped at this bound, , where for we define if and , otherwise.

This phenomena is illustrated in the left panel of Fig. 1 (best viewed in color). For the minmax optimization function defined by Forster Forster (), fixing some value of , the function is convex in , and the adversary would achieve a maximal value at the boundary of the feasible values of interval. That is, either or , as indicated by the two magenta lines at . The optimal predictor is achieved somewhere along the lines or .

We propose an alternative approach to make the minmax optimal solution bounded by appropriately setting the weight such that is concave in for a constant . We explicitly consider two cases. First, set such that is strictly concave in , and thus attains a single maximum with no need to artificially restrict the value of . In this case our function is concave in in the first option and has a maximum point, which is the worst adversary. The optimal predictor is achieved in the unique saddle point, as illustrated in the center panel of Fig. 1. A second case is to set such that and the minmax function becomes linear in . Here, the optimal prediction is achieved by choosing such that which turns to be invariant to , as illustrated in the right panel of Fig. 1.

Equipped with Lemma 1 we develop the optimal solution of the min-max predictor, summarized in the following theorem.

Theorem 2.

Assume that . Then the optimal prediction for the last round is

 ^yT=b⊤T−1A−1T−1xT . (8)

The proof of the theorem makes use of the following technical lemma.

Lemma 3.

For all

 a2tx⊤tA−1txt+1−at=1+atx⊤tA−1t−1xt−at1+atx⊤tA−1t−1xt . (9)

The proof appears in A. We now prove Theorem 2.

Proof.

The adversary can choose any , thus the algorithm should predict such that the following quantity is minimal,

 maxyT(T∑t=1(yt−^yt)2−infu∈Rd(b∥u∥2+T∑t=1at(yt−u⊤xt)2))

That is, we need to solve the following minmax problem

 min^yTmaxyT(T∑t=1(yt−^yt)2−T∑t=1aty2t+b⊤TA−1TbT) .

We use the following relation to re-write the optimization problem,

 b⊤TA−1TbT = b⊤T−1A−1TbT−1+2aTyTb⊤T−1A−1TxT+a2Ty2Tx⊤TA−1TxT . (10)

Omitting all terms that are not depending on and ,

 min^yTmaxyT((yT−^yT)2−aTy2T+2aTyTb⊤T−1A−1TxT+a2Ty2Tx⊤TA−1TxT) .

We manipulate the last problem to be of form (7) using Lemma 3,

 min^yTmaxyT(1+aTx⊤TA−1T−1xT−aT1+aTx⊤TA−1T−1xTy2T+2yT(aTb⊤T−1A−1TxT−^yT)+^y2T), (11)

where

 α(aT)=1+aTx⊤TA−1T−1xT−aT1+aTx⊤TA−1T−1xT and β(aT,^yT)=aTb⊤T−1A−1TxT−^yT .

We consider two cases: (1) (corresponding to the middle panel of Fig. 1), and (2) (corresponding to the right panel of Fig. 1), starting with the first case,

 1+aTx⊤TA−1T−1xT−aT<0 . (12)

Denote the inner-maximization problem by,

 f(yT)=1+aTx⊤TA−1T−1xT−aT1+aTx⊤TA−1T−1xTy2T+2yT(aTb⊤T−1A−1TxT−^yT)+^y2T .

This function is strictly-concave with respect to because of (12). Thus, it has a unique maximal value given by,

 fmax(^yT) = −aT1+aTx⊤TA−1T−1xT−aT^y2T+2aTb⊤T−1A−1TxT(1+aTx⊤TA−1T−1xT)1+aTx⊤TA−1T−1xT−aT^yT −(aTb⊤T−1A−1TxT)2(1+aTx⊤TA−1T−1xT)1+aTx⊤TA−1T−1xT−aT .

Next, we solve , which is strictly-convex with respect to because of (12). Solving this problem we get the optimal last step minmax predictor,

 ^yT=b⊤T−1A−1TxT(1+aTx⊤TA−1T−1xT) . (13)

We further derive the last equation. From (3) we have,

 A−1TaTxTx⊤TA−1T−1=A−1T(AT−AT−1)A−1T−1=A−1T−1−A−1T . (14)

Substituting (14) in (13) we have the following equality as desired,

 ^yT (15)

We now move to the second case for which, which is written equivalently as,

 aT=11−x⊤TA−1T−1xT . (16)

Substituting (16) in (11) we get,

 min^yTmaxyT(2yT(aTb⊤T−1A−1TxT−^yT)+^y2T) .

For , the value of the optimization problem is not-bounded as the adversary may choose for . Thus, the optimal last step minmax prediction is to set . Substituting and following the derivation from (13) to (15) above, yields the desired identity.

We conclude by noting that although we did not restrict the form of the predictor , it turns out that it is a linear predictor defined by for . In other words, the functional form of the optimal predictor is the same as the form of the comparison function class - linear functions in our case. We call the algorithm (defined using (3), (4) and (8)) WEMM for weighted min-max prediction. We note that WEMM can also be seen as an incremental off-line algorithm AzouryWa01 () or follow-the-leader, on a weighted sequence. The prediction is with a model that is optimal over a prefix of length . The prediction of the optimal predictor defined in (5) is , where was defined in (8).

3.1 Recursive form

Although Theorem 2 is correct for , in the rest of the paper we will (almost always) assume an equality, that is

 at=11−x⊤tA−1t−1xt,t=1…T . (17)

For this case, WEMM algorithm can be expressed in a recursive form in terms of weight vector and a covariance-like matrix . We denote and , and develop recursive update rules for and :

 wt = A−1tbt (18) = (At−1+atxtx⊤t)−1(bt−1+atytxt) = (A−1t−1−A−1t−1xtx⊤tA−1t−1a−1t+x⊤tA−1t−1xt)(bt−1+atytxt) = = wt−1+ytA−1t−1xt−A−1t−1xtx⊤twt−1a−1t+x⊤tA−1t−1xt wt−1+(yt−x⊤twt−1)A−1t−1xt = wt−1+(yt−x⊤twt−1)Σt−1xt ,

and

 Σ−1t = At=At−1+atxtx⊤t At−1+xtx⊤t1−x⊤tA−1t−1xt = Σ−1t−1+xtx⊤t1−x⊤tΣt−1xt

or

 Σt = Σt−1−Σt−1xtx⊤tΣt−1 . (19)

A summary of the algorithm in a recursive form appears in the right column of Table 1.

It is instructive to compare similar second order online algorithms for regression. The ridge-regression Foster91 (), summarized in the third column of Table 1, uses the previous examples to generate a weight-vector, which is used to predict current example. On round it sets a weight-vector to be the solution of the following optimization problem,

 wt−1=argminw[t−1∑i=1(yi−x⊤iw)2+b∥w∥2] ,

and outputs a prediction . The recursive least squares (RLS) Hayes () is a similar algorithm, yet it uses a forgetting factor , and sets the weight-vector according to

 wt−1=argminw[t−1∑i=1rt−i−1(yi−x⊤iw)2] .

The Aggregating Algorithm for regression (AAR) Vovk01 (), summarized in the second column of Table 1, was introduced by Vovk and it is similar to ridge-regression, except it contains additional regularization, which eventually makes it shrink the predictions. It is an application of the Aggregating Algorithm vovkAS () (a general algorithm for merging prediction strategies) to the problem of linear regression with square loss. On round , the weight-vector is obtained according to

 wt=argminw[t−1∑i=1(yi−x⊤iw)2+(x⊤tw)2+b∥w∥2] ,

and the algorithm predicts . Compared to ridge-regression, the AAR algorithm uses an additional input pair . The AAR algorithm was shown to be last-step min-max optimal by Forster Forster (), that is the predictions can be obtained by solving (2) for .

The AROWR algorithm VaitsCr11 (); CrammerKuDr12 (), summarized in the left column of Table 1, is a modification of the AROW algorithm CrammerKuDr09 () for regression. It maintains a Gaussian distribution parameterized by a mean and a full covariance matrix . Intuitively, the mean represents a current linear function, while the covariance matrix captures the uncertainty in the linear function . Given a new example the algorithm uses its current mean to make a prediction . AROWR then sets the new distribution to be the solution of the following optimization problem,

 argminw,Σ[D% KL(N(w,Σ)∥N(wt−1,Σt−1))+12r(yt−w⊤xt)2+12r(x⊤tΣxt)] .

Crammer et.al. CrammerKuDr12 () derived regret bounds for this algorithm.

Comparing WEMM to other algorithms we note two differences. First, for the weight-vector update rule, we do not have the normalization term . Second, for the covariance matrix update rule, our algorithm gives non-constant scale to the increment by . This scale is small when the current instance lies along the directions spanned by previously observed inputs , and large when the current instance lies along previously unobserved directions.

3.2 Kernel version of the algorithm

In this section we show that the WEMM algorithm can be expressed in dual variables, which allows an efficient run of the algorithm in any reproducing kernel Hilbert space. We show by induction that the weight-vector and the covariance matrix computed by the WEMM algorithm in the right column of Table 1 can be written in the form

 wt = t∑i=1α(t)ixi Σt = t∑j=1t∑k=1β(t)j,kxjx⊤k+b−1I ,

where the coefficients and depend only on inner products of the input vectors.

For the initial step we have and which are trivially written in the desired form by setting and . We proceed to the induction step. From the weight-vector update rule (18) we get

 wt = wt−1+(yt−x⊤twt−1)Σt−1xt= = = = t−1∑i=1[α(t−1)i+(yt−t−1∑l=1α(t−1)l(x⊤txl))t−1∑k=1β(t−1)i,k(x⊤kxt)]xi+b−1