Weighted LastStep MinMax Algorithm with Improved SubLogarithmic Regret
Abstract
In online learning the performance of an algorithm is typically compared to the performance of a fixed function from some class, with a quantity called regret. Forster Forster () proposed a laststep minmax algorithm which was somewhat simpler than the algorithm of Vovk vovkAS (), yet with the same regret. In fact the algorithm he analyzed assumed that the choices of the adversary are bounded, yielding artificially only the two extreme cases. We fix this problem by weighing the examples in such a way that the minmax problem will be well defined, and provide analysis with logarithmic regret that may have better multiplicative factor than both bounds of Forster Forster () and Vovk vovkAS (). We also derive a new bound that may be sublogarithmic, as a recent bound of Orabona et.al OrabonaCBG12 (), but may have better multiplicative factor. Finally, we analyze the algorithm in a weaktype of nonstationary setting, and show a bound that is sublinear if the nonstationarity is sublinear as well.
keywords:
Online learning, Regression, Minmax learningcomma,square,sort&compress
1 Introduction
We consider the online learning regression problem, in which a learning algorithm tries to predict real numbers in a sequence of rounds given some sideinformation or inputs . Realworld example applications for these algorithms are weather or stockmarket predictions. The goal of the algorithm is to have a small discrepancy between its predictions and the associated outcomes . This discrepancy is measured with a loss function, such as the square loss. It is common to evaluate algorithms by their regret, the difference between the cumulative loss of an algorithm with the cumulative loss of any function taken from some class.
Forster Forster () proposed a laststep minmax algorithm for online regression that makes a prediction assuming it is the last example to be observed, and the goal of the algorithm is indeed to minimize the regret with respect to linear functions. The resulting optimization problem he obtained was convex in both choice of the algorithm and the choice of the adversary, yielding an unbounded optimization problem. Forster circumvented this problem by assuming a bound over the choices of the adversary that should be known to the algorithm, yet his analysis is for the version with no bound.
We propose a modified laststep minmax algorithm with weights over examples, that are controlled in a way to obtain a problem that is concave over the choices of the adversary and convex over the choices of the algorithm. We analyze our algorithm and show a logarithmicregret that may have a better multiplicative factor than the analysis of Forster. We derive additional analysis that is logarithmic in the loss of the reference function, rather than the number of rounds . This behaviour was recently given by Orabona et.al OrabonaCBG12 () for a certain onlinegradient decent algorithm. Yet, their bound OrabonaCBG12 () has a similar multiplicative factor to that of Forster Forster (), while our bound has a potentially better multiplicative factor and it has the same dependency in the cumulative loss of the reference function as Orabona et.al OrabonaCBG12 (). Additionally, our algorithm and analysis are totally free of assuming the bound or knowing its value.
Competing with the best single function might not suffice for some problems. In many realworld applications, the true target function is not fixed, but may change from time to time. We bound the performance of our algorithm also in nonstationary environment, where we measure the complexity of the nonstationary environment by the total deviation of a collection of linear functions from some fixed reference point. We show that our algorithm maintains an average loss close to that of the best sequence of functions, as long as the total of this deviation is sublinear in the number of rounds .
A short version appeared in The 23rd International Conference on Algorithmic Learning Theory (ALT 2012). This journal version of the paper includes additionally: (1) Recursive form of the algorithm and comparison to other algorithms of the same form (Sec. 3.1). (2) Kernel version of the algorithm (Sec. 3.2). (3) MAP interpretation of the minimization problems (Remark 1 and Remark 2). (4) All proofs and extended relatedwork section.
2 Problem Setting
We work in the online setting for regression evaluated with the squared loss. Online algorithms work in rounds or iterations. On each iteration an online algorithm receives an instance and predicts a real value , it then receives a label , possibly chosen by an adversary, suffers loss , updates its prediction rule, and proceeds to the next round. The cumulative loss suffered by the algorithm over iterations is,
(1) 
The goal of the algorithm is to perform well compared to any predictor from some function class.
A common choice is to compare the performance of an algorithm with respect to a single function, or specifically a single linear function, , parameterized by a vector . Denote by the instantaneous loss of a vector , and by . The regret with respect to is defined to be,
A desired goal of the algorithm is to have , that is, the average loss suffered by the algorithm will converge to the average loss of the best linear function .
Below in Sec. 5 we will also consider an extension of this form of regret, and evaluate the performance of an algorithm against some tuple of functions, ,
where . Clearly, with no restriction of the tuple, any algorithm may suffer a regret linear in , as one can set , and suffer zero quadratic loss in all rounds. Thus, we restrict below the possible choices of tuple either explicitly, or implicitly via some penalty.
3 A Last Step MinMax Algorithm
Our algorithm is derived based on a laststep minmax prediction, proposed by Forster Forster () and Takimoto and Warmuth TakimotoW00 (). See also the work of Azoury and Warmuth AzouryWa01 (). An algorithm following this approach outputs the minmax prediction assuming the current iteration is the last one. The algorithm we describe below is based on an extension of this notion. For this purpose we introduce a weighted cumulative loss using positive inputdependent weights ,
The exact values of the weights will be defined below.
Our variant of the last step minmax algorithm predicts^{1}^{1}1 and serves both as quantifiers (over the and operators, respectively), and as the optimal values over this optimization problem.
(2) 
for some positive constant . We next compute the actual prediction based on the optimal last step minmax solution. We start with additional notation,
(3)  
(4) 
The solution of the internal infimum over is summarized in the following lemma.
Lemma 1.
For all , the function is minimal at a unique point given by,
(5) 
Proof.
From
it follows that . Thus is convex and it is minimal if , i.e. for . This show that and we obtain
∎
Remark 1.
The minimization problem in Lemma 1 can be interpreted as MAP estimator of based on the sequence in the following generative model:
where and .
Substituting (5) back in (2) we obtain the following form of the minmax problem,
(7) 
for some functions and . Clearly, for this problem to be well defined the function should be convex in and concave in .
A previous choice, proposed by Forster Forster (), is to have uniform weights and set (for ), which for the particular function yields . Thus, is a convex function in , implying that the optimal value of is not bounded from above. Forster Forster () addressed this problem by restricting to belong to a predefined interval , known also to the learner. As a consequence, the adversary optimal prediction is in fact either or , which in turn yields an optimal predictor which is clipped at this bound, , where for we define if and , otherwise.
This phenomena is illustrated in the left panel of Fig. 1 (best viewed in color). For the minmax optimization function defined by Forster Forster (), fixing some value of , the function is convex in , and the adversary would achieve a maximal value at the boundary of the feasible values of interval. That is, either or , as indicated by the two magenta lines at . The optimal predictor is achieved somewhere along the lines or .
We propose an alternative approach to make the minmax optimal solution bounded by appropriately setting the weight such that is concave in for a constant . We explicitly consider two cases. First, set such that is strictly concave in , and thus attains a single maximum with no need to artificially restrict the value of . In this case our function is concave in in the first option and has a maximum point, which is the worst adversary. The optimal predictor is achieved in the unique saddle point, as illustrated in the center panel of Fig. 1. A second case is to set such that and the minmax function becomes linear in . Here, the optimal prediction is achieved by choosing such that which turns to be invariant to , as illustrated in the right panel of Fig. 1.
Equipped with Lemma 1 we develop the optimal solution of the minmax predictor, summarized in the following theorem.
Theorem 2.
Assume that . Then the optimal prediction for the last round is
(8) 
The proof of the theorem makes use of the following technical lemma.
Lemma 3.
For all
(9) 
Proof.
The adversary can choose any , thus the algorithm should predict such that the following quantity is minimal,
That is, we need to solve the following minmax problem
We use the following relation to rewrite the optimization problem,
(10) 
Omitting all terms that are not depending on and ,
We manipulate the last problem to be of form (7) using Lemma 3,
(11) 
where
We consider two cases: (1) (corresponding to the middle panel of Fig. 1), and (2) (corresponding to the right panel of Fig. 1), starting with the first case,
(12) 
Denote the innermaximization problem by,
This function is strictlyconcave with respect to because of (12). Thus, it has a unique maximal value given by,
Next, we solve , which is strictlyconvex with respect to because of (12). Solving this problem we get the optimal last step minmax predictor,
(13) 
We further derive the last equation. From (3) we have,
(14) 
Substituting (14) in (13) we have the following equality as desired,
(15) 
We now move to the second case for which, which is written equivalently as,
(16) 
Substituting (16) in (11) we get,
For , the
value of the optimization problem is notbounded as the adversary
may choose for . Thus, the optimal last step minmax prediction
is to set
.
Substituting and
following the derivation from (13) to (15) above, yields the
desired identity.
∎
We conclude by noting that although we did not restrict the form of the predictor , it turns out that it is a linear predictor defined by for . In other words, the functional form of the optimal predictor is the same as the form of the comparison function class  linear functions in our case. We call the algorithm (defined using (3), (4) and (8)) WEMM for weighted minmax prediction. We note that WEMM can also be seen as an incremental offline algorithm AzouryWa01 () or followtheleader, on a weighted sequence. The prediction is with a model that is optimal over a prefix of length . The prediction of the optimal predictor defined in (5) is , where was defined in (8).
3.1 Recursive form
Although Theorem 2 is correct for , in the rest of the paper we will (almost always) assume an equality, that is
(17) 
For this case, WEMM algorithm can be expressed in a recursive form in terms of weight vector and a covariancelike matrix . We denote and , and develop recursive update rules for and :
(18)  
and
or
(19) 
A summary of the algorithm in a recursive form appears in the right column of Table 1.
It is instructive to compare similar second order online algorithms for regression. The ridgeregression Foster91 (), summarized in the third column of Table 1, uses the previous examples to generate a weightvector, which is used to predict current example. On round it sets a weightvector to be the solution of the following optimization problem,
and outputs a prediction . The recursive least squares (RLS) Hayes () is a similar algorithm, yet it uses a forgetting factor , and sets the weightvector according to
The Aggregating Algorithm for regression (AAR) Vovk01 (), summarized in the second column of Table 1, was introduced by Vovk and it is similar to ridgeregression, except it contains additional regularization, which eventually makes it shrink the predictions. It is an application of the Aggregating Algorithm vovkAS () (a general algorithm for merging prediction strategies) to the problem of linear regression with square loss. On round , the weightvector is obtained according to
and the algorithm predicts . Compared to ridgeregression, the AAR algorithm uses an additional input pair . The AAR algorithm was shown to be laststep minmax optimal by Forster Forster (), that is the predictions can be obtained by solving (2) for .
The AROWR algorithm VaitsCr11 (); CrammerKuDr12 (), summarized in the left column of Table 1, is a modification of the AROW algorithm CrammerKuDr09 () for regression. It maintains a Gaussian distribution parameterized by a mean and a full covariance matrix . Intuitively, the mean represents a current linear function, while the covariance matrix captures the uncertainty in the linear function . Given a new example the algorithm uses its current mean to make a prediction . AROWR then sets the new distribution to be the solution of the following optimization problem,
Crammer et.al. CrammerKuDr12 () derived regret bounds for this algorithm.
Comparing WEMM to other algorithms we note two differences. First, for the weightvector update rule, we do not have the normalization term . Second, for the covariance matrix update rule, our algorithm gives nonconstant scale to the increment by . This scale is small when the current instance lies along the directions spanned by previously observed inputs , and large when the current instance lies along previously unobserved directions.
AROWR VaitsCr11 (); CrammerKuDr12 ()  AAR Vovk01 () / MinMax Forster ()  RidgeRegression Foster91 ()  WEMM this work  
Parameters  
Initialize  ,  
Receive an instance  
For  Output prediction 





Receive a correct label  
Update : 





Update : 





Output 
3.2 Kernel version of the algorithm
In this section we show that the WEMM algorithm can be expressed in dual variables, which allows an efficient run of the algorithm in any reproducing kernel Hilbert space. We show by induction that the weightvector and the covariance matrix computed by the WEMM algorithm in the right column of Table 1 can be written in the form
where the coefficients and depend only on inner products of the input vectors.
For the initial step we have and which are trivially written in the desired form by setting and . We proceed to the induction step. From the weightvector update rule (18) we get