Sparse regret minimization

# Gains and Losses are Fundamentally Different in Regret Minimization: The Sparse Case

Joon Kwon Institut de Mathématiques de Jussieu
Université Pierre-et-Marie-Curie
Paris, France.
and  Vianney Perchet INRIA & Laboratoire de Probabilités et Modèles Aléatoires, Université Paris-Diderot, Paris, France.
###### Abstract.

We demonstrate that, in the classical non-stochastic regret minimization problem with decisions, gains and losses to be respectively maximized or minimized are fundamentally different. Indeed, by considering the additional sparsity assumption (at each stage, at most decisions incur a nonzero outcome), we derive optimal regret bounds of different orders. Specifically, with gains, we obtain an optimal regret guarantee after stages of order , so the classical dependency in the dimension is replaced by the sparsity size. With losses, we provide matching upper and lower bounds of order , which is decreasing in . Eventually, we also study the bandit setting, and obtain an upper bound of order when outcomes are losses. This bound is proven to be optimal up to the logarithmic factor .

###### Key words and phrases:
online optimization; regret minimization; adversarial; sparse; bandit

## 1. Introduction

We consider the classical problem of regret minimization [15] that has been well developed during the last decade [10, 20, 5, 22, 16, 6]. We recall that in this sequential decision problem, a decision maker (or agent, player, algorithm, strategy, policy, depending on the context) chooses at each stage a decision in a finite set (that we write as ) and obtains as an outcome a real number in . We specifically chose the word outcome, as opposed to gain or loss, as our results show that there exists a fundamental discrepancy between these two concepts.

The criterion used to evaluate the policy of the decision maker is the regret, i.e., the difference between the cumulative performance of the best stationary policy (that always picks a given action ) and the cumulative performance of the policy of the decision maker.

We focus here on the non-stochastic framework, where no assumption (apart from boundedness) is made on the sequence of possible outcomes. In particular, they are not i.i.d. and we can even assume, as usual, that they depend on the past choices of the decision maker. This broad setup, sometimes referred to as individual sequences (since a policy must be good against any sequence of possible outcomes) incorporates prediction with expert advice [10], data with time-evolving laws, etc. Perhaps the most fundamental results in this setup are the upper bound of order achieved by the Exponential Weight Algorithm [19, 24, 8, 4] and the asymptotic lower bound of the same order [9]. This general bound is the same whether outcomes are gains in (in which case, the objective is to maximize the cumulative sum of gains) or losses in (where the decision maker aims at minimizing the cumulative sum). Indeed, a loss can easily be turned into gain by defining , the regret being invariant under this transformation.

This idea does not apply anymore with structural assumption. For instance, consider the framework where the outcomes are limited to -sparse vectors, i.e. vectors that have at most nonzero coordinates. The coordinates which are nonzero may change arbitrarily over time. In this framework, the aforementioned transformation does not preserve the sparsity assumption. Indeed, if is a -sparse loss vector, the corresponding gain vector may even have full support. Consequently, results for loss vectors do not apply directly to sparse gains, and vice versa. It turns out that both setups are fundamentally different.

The sparsity assumption is actually quite natural in learning and have also received some attention in online learning [14, 7, 1, 11]. In the case of gains, it reflects the fact that the problem has some hidden structure and that many options are irrelevant. For instance, in the canonical click-through-rate example, a website displays an ad and gets rewarded if the user clicks on it; we can safely assume that there are only a small number of ads on which a user would click.

The sparse scenario can also be seen through the scope of prediction with experts. Given a finite set of expert, we call the winner of a stage the expert with the highest revenue (or the smallest loss); ties are broken arbitrarily. And the objective would be to win as many stages as possible. The -sparse setting would represent the case where experts are designated as winners (or, non-loser) at each stage.

In the case of losses, the sparsity assumption is motivated by situations where rare failures might happen at each stage, and the decision maker wants to avoid them. For instance, in network routing problems, it could be assumes that only a small number of paths would lose packets as a result of a single, rare, server failure. Or a learner could have access to a finite number of classification algorithms that perform ideally most of the time; unfortunately, some of them makes mistakes on some examples and the learner would like to prevent that. The general setup is therefore a number of algorithms/experts/actions that mostly perform well (i.e., find the correct path, classify correctly, optimize correctly some target function, etc.); however, at each time instance, there are rare mistakes/accidents and the objective would be to find the action/algorithm that has the smallest number (or probability in the stochastic case) of failures.

### 1.1. Summary of Results

We investigate regret minimization scenarios both when outcomes are gains on the one hand, and losses on the other hand. We recall that our objectives are to prove that they are fundamentally different by exhibiting rates of convergence of different order.

When outcomes are gains, we construct an algorithm based on the Online Mirror Descent family [21, 22, 5]. By choosing a regularizer based on the norm, and then tuning the parameter as a function of , we get in Theorem 2.2 a regret bound of order , which has the interesting property of being independent of the number of decisions . This bound is trivially optimal, up to the constant.

If outcomes are losses instead of gains, although the previous analysis remains valid, a much better bound can be obtained. We build upon a regret bound for the Exponential Weight Algorithm [19, 12] and we manage to get in Theorem 3.1 a regret bound of order , which is decreasing in , for a given . A nontrivial matching lower bound is established in Theorem 3.3.

Both of these algorithms need to be tuned as a function of . In Theorem 4.1 and Theorem 4.2, we construct algorithms which essentially achieve the same regret bounds without prior knowledge of , by adapting over time to the sparsity level of past outcome vectors, using an adapted version of the doubling trick.

Finally, we investigate the bandit setting, where the only feedback available to the decision maker is the outcome of his decisions (and, not the outcome of all possible decisions). In the case of losses we obtain in Theorem 5.1 an upper bound of order , using the Greedy Online Mirror Descent family of algorithms [2, 3, 5]. This bound is proven to be optimal up to a logarithmic factor, as Theorem 5.3 establishes a lower bound of order .

The rates of convergence achieved by our algorithms are summarized in Figure 1.

### 1.2. General Model and Notation

We recall the classical non-stochastic regret minimization problem. At each time instance , the decision maker chooses a decision in the finite set , possibly at random, accordingly to , where

 Δd={x=(x(1),…,x(d))∈Rd+∣∣ ∣∣d∑i=1x(i)=1}

is the the set of probability distributions over . Nature then reveals an outcome vector and the decision maker receives . As outcomes are bounded, we can easily replace by its expectation that we denote by . Indeed, Hoeffding-Azuma concentration inequality will imply that all the results we will state in expectation hold with high probability.

Given a time horizon , the objective of the decision maker is to minimize his regret, whose definition depends on whether outcomes are gains or losses. In the case of gains (resp. losses), the notation is then changed to (resp. ) and the regret is:

 RT=maxi∈[d]T∑t=1g(i)t−T∑t=1⟨gt,xt⟩(resp. RT=T∑t=1⟨ℓt,xt⟩−mini∈[d]T∑t=1ℓ(i)t).

In both cases, the well-known Exponential Weight Algorithm guarantees a bound on the regret of order . Moreover, this bound cannot be improved in general as it matches a lower bound.

We shall consider an additional structural assumption on the outcomes, namely that is -sparse in the sense that , i.e., the number of nonzero components of is less than , where is a fixed known parameter. The set of components which are nonzero is not fixed nor known, and may change arbitrarily over time.

We aim at proving that it is then possible to drastically improve the previously mentioned guarantee of order and that losses and gains are two fundamentally different settings with minimax regrets of different orders.

## 2. When Outcomes are Gains to be Maximized

### 2.1. Online Mirror Descent Algorithms

We quickly present the general Online Mirror Descent algorithm [22, 5, 6, 18] and state the regret bound it incurs; it will be used as a key element in Theorem 2.2.

A convex function is called a regularizer on if is strictly convex and continuous on its domain , and outside . Denote and the Legendre-Fenchel transform of :

 h∗(y)=supx∈Rd{⟨y,x⟩−h(x)},y∈Rd,

which is differentiable since is strictly convex. For all , it holds that .

Let be a parameter to be tuned. The Online Mirror Descent Algorithm associated with regularizer and parameter is defined by:

 xt=∇h∗(ηt−1∑k=1ωk),t⩾1,

where denote the vector of outcomes and the probability distribution chosen at stage . The specific choice for (and otherwise) gives the celebrated Exponential Weight Algorithm, which can we written explicitly, component by component:

 x(i)t=exp(η∑t−1k=1ω(i)k)∑dj=1exp(η∑t−1k=1ω(j)k),t⩾1, i∈[d].

The following general regret guarantee for strongly convex regularizers is expressed in terms of the dual norm of .

###### Theorem 2.1 ([22] Th. 2.21; [6] Th. 5.6; [18] Th. 5.1).

Let and assume to be -strongly convex with respect to a norm . Then, for all sequence of outcome vectors in , the Online Mirror Descent strategy associated with and (with in cases of gains and in cases of losses) guarantees, for , the following regret bound:

 RT⩽δh|η|+|η|2KT∑t=1∥ωt∥2∗.

### 2.2. Upper Bound on the Regret

We first assume . Let and define the following regularizer:

 hp(x)={12∥x∥2pif x∈Δd+∞otherwise.

One can easily check that is indeed a regularizer on and that . Moreover, it is -strongly convex with respect to (see [5, Lemma 5.7] or [17, Lemma 9]).

We can now state our first result, the general upper bound on regret when outcomes are -sparse gains.

###### Theorem 2.2.

Let and . Against all sequence of -sparse gain vectors , i.e., and , the Online Mirror Descent algorithm associated with regularizer and parameter guarantees:

 RT⩽12η+ηTs2/q2(p−1),

where . In particular, the choices and give:

 RT⩽√2eTlogs.
###### Proof.

being -strongly convex with respect to , and being the dual norm of , Theorem 2.1 gives:

 RT⩽δhpη+η2(p−1)T∑t=1∥gt∥2q.

For each , the norm of can be bounded as follows:

 ∥gt∥2q=(d∑i=1∣∣g(i)t∣∣q)2/q⩽⎛⎝∑s terms∣∣g(i)t∣∣q⎞⎠2/q⩽s2/q,

which yields

 RT⩽12η+ηTs2/q2(p−1).

We can now balance both terms by choosing and get:

 RT⩽√Ts2/qp−1.

Finally, since , we have and we set , which gives:

 1q=1−1p=p−1p=(2logs−1)−11+(2logs−1)−1=12logs,

and thus:

 RT⩽√Ts2/qp−1=√2Tlogse2logs/q=√2eTlogs.

We emphasize the fact that we obtain, up to a multiplicative constant, the exact same rate as when the decision maker only has a set of decisions.

In the case , we can easily derive a bound of respectively and using the same regularizer with .

### 2.3. Matching Lower Bound

For and , we denote the minimax regret of the -stage decision problem with outcome vectors restricted to -sparse gains:

 vg,s,dT=minstrat.max(gt)tRT

where the minimum is taken over all possible policies of the decision maker, and the maximum over all sequences of -sparse gains vectors.

To establish a lower bound in the present setting, we can assume that only the first coordinates of might be positive (for all ) and even that the decision maker is aware of that. Therefore he has no interest in assigning positive probabilities to any decision but the first ones. That setup, which is simpler for the decision maker than the original one, is obviously equivalent to the basic regret minimization problem with only decisions. Therefore, the classical lower bound [9, Theorem 3.2.3] holds and we obtain the following.

###### Theorem 2.3.
 liminfs→+∞d⩾sliminfT→+∞vg,s,dT√Tlogs⩾√22.

The same lower bound, up to the multiplicative constant actually holds non asymptotically, see [10, Theorem 3.6].

An immediate consequence of Theorem 2.3 is that the regret bound derived in Theorem 2.2 is asymptotically minimax optimal, up to a multiplicative constant.

## 3. When Outcomes are Losses to be Minimized

### 3.1. Upper Bound on the Regret

We now consider the case of losses, and the regularizer shall no longer depend on (as with gains), as we will always use the Exponential Weight Algorithm. Instead, it is the parameter that will be tuned as a function of .

###### Theorem 3.1.

Let . For all sequence of -sparse loss vectors , i.e., and , the Exponential Weight Algorithm with parameter where guarantees, for :

 RT⩽√2sTlogdd+logd.

We build upon the following regret bound for losses which is written in terms of the performance of the best action.

###### Theorem 3.2 ([19]; [10] Th 2.4).

Let . For all sequence of loss vectors in , the Exponential Weight Algorithm with parameter guarantees, for all :

 RT⩽logd1−e−η+(η1−e−η−1)L∗T,

where is the loss of the best stationary decision.

###### Proof of Theorem 3.1.

Let and be the loss of the best stationary policy. First note that since the loss vectors are -sparse, we have . By summing over :

 sT⩾T∑t=1d∑i=1ℓ(i)t=d∑i=1(T∑t=1ℓ(i)t)⩾d(mini∈[d]T∑t=1ℓ(i)t)=dL∗T,

and therefore, we have .

Then, by using the inequality , the bound from Theorem 3.2 becomes:

 RT⩽logd1−e−η+(eη−e−η2(1−e−η)−1)L∗T.

The factor of in the second term can be transformed as follows:

 eη−e−η2(1−e−η)−1=(1+e−η)(eη−e−η)2(1−e−2η)−1=(1+e−η)eη2−1=eη−12,

and therefore the bound on the regret becomes:

 RT⩽logd1−e−η+eη−12L∗T⩽logd1−e−η+(eη−1)Ts2d,

where we have been able to use the upper-bound on since . Along with the choice and standard computations, this yields:

 RT⩽√2Tslogdd+logd.

Interestingly, the bound from Theorem 3.1 shows that , the dominating term of the regret bound, is decreasing when the number of decisions increases. This is due to the sparsity assumptions (as the regret increases with , the maximal number of decision with positive losses). Indeed, when is fixed and increases, more and more decisions are optimal at each stage, a proportion to be precise. As a consequence, it becomes easier to find an optimal decisions when increases. However, this intuition will turn out not to be valid in the bandit framework.

On the other hand, if the proportion of positive losses remains constant then the regret bound achieved is of the same order as in the usual case.

### 3.2. Matching Lower Bound

When outcomes are losses, the argument from Section 2.3 does not allow to derive a lower bound. Indeed, if we assume that only the first coordinates of the loss vectors can be positive, and that the decision maker knows it, then he just has to take at each stage the decision which incurs a loss of 0. As a consequence, he trivially has a regret . Choosing at random, but once and for all, a fixed subset of coordinates does not provide any interesting lower bound either. Instead, the key idea of the following result is to choose at random and at each stage the coordinates associated to positive losses. And we therefore use the following classical probabilistic argument. Assume that we have found a probability distribution on such that the expected regret can be bounded from below by a quantity which does not depend on the strategy of the decision maker. This would imply that for any algorithm, there exists a sequence of such that the regret is greater than the same quantity.

In the following statement, stands for the minimax regret in the case where outcomes are losses.

###### Theorem 3.3.

For all ,

 liminfd→+∞liminfT→+∞vℓ,s,dT√Tsdlogd⩾√22.

The main consequences of this theorem are that the algorithm described in Theorem 3.1 is asymptotically minimax optimal (up to a multiplicative constant) and that gains and losses are fundamentally different from the point of view of regret minimization.

###### Proof.

We first define the sequence of loss vectors () i.i.d. as follows. Firs, we draw a set of cardinal uniformly among the possibilities. Then, if set with probability and with probability , independently for each component. If , we set .

As a consequence, we always have that is -sparse. Moreover, for each and each coordinate , satisfies:

 P[ℓ(i)t=1]=s2dandP[ℓ(i)t=0]=1−s2d,

thus . Therefore we obtain that for any algorithm , . This yields that

 E[RT√T] =E[1√T(T∑t=1⟨ℓt,xt⟩−mini∈[d]T∑t=1ℓ(i)t)] =E[maxi∈[d]1√TT∑t=1(s2d−ℓ(i)t)] =E[maxi∈[d]1√TT∑t=1X(i)t],

where , we have defined the random vector by for all . For , the are i.i.d. zero-mean random vectors with values in . We can therefore apply the comparison Lemma 3.5 to get:

 liminfT→+∞E[RT√T]=liminfT→+∞E[maxi∈[d]1√TT∑t=1X(i)t]⩾E[maxi∈[d]Z(i)],

where with .

We now make appeal to Slepian’s lemma, recalled in Proposition 3.4 below. Therefore, we introduce the Gaussian vector where

 ~Σ=diag(VarX(1)1,…,VarX(1)1).

As a consequence, the first two hypotheses of Proposition 3.4 from the definitions of and . Let , then

 E[Z(i)Z(j)] =cov(Z(i),Z(j))=cov(ℓ(i)1,ℓ(j)1)=E[ℓ(i)1ℓ(j)1]−E[ℓ(i)1]E[ℓ(j)1].

By definition of , if and only if and otherwise. Therefore, using the random subset that appears in the definition of :

 E[Z(i)Z(j)] =P[ℓ(i)1=ℓ(j)1=1]−(s2d)2 =14⋅(d−2s−2)(ds)−(s2d)2

and since , the third hypothesis of Slepian’s lemma is also satisfied. It yields that, for all :

 P[maxi∈[d]Z(i)⩽θ] =P[Z(1)⩽θ,…,Z(d)⩽θ] ⩽P[W(1)⩽θ,…,W(d)⩽θ]=P[maxi∈[d]W(i)⩽θ].

This inequality between two cumulative distribution functions implies, the reverse inequality on expectations:

 E[maxi∈[d]Z(i)]⩾E[maxi∈[d]W(i)].

The components of the Gaussian vector being independent, and of variance , we have

 E[maxi∈[d]W(i)]=κd√Varℓ(1)1=κd√s2d(1−s2d)⩾κd√s4d,

where is the expectation of the maximum of Gaussian variables. Combining everything gives:

 liminfT→+∞vℓ,s,dT√T⩾liminfT→+∞E[RT√T]⩾E[maxi∈[d]Z(i)]⩾E[maxi∈[d]W(i)]⩾κd√s4d.

And for large , since is equivalent to , see e.g., [13]

 liminfd→+∞liminfT→+∞vℓ,s,dT√Tsdlogd⩾√22.

###### Proposition 3.4 (Slepian’s lemma [23]).

Let and be Gaussian random vectors in satisfying:

1. ;

2. for ;

3. for .

Then, for all real numbers , we have:

 P[Z(1)⩽θ1,…,Z(d)⩽θd]⩽P[W(1)⩽θ1,…,W(d)⩽θd].

The following lemma is an extension of e.g. [10, Lemma A.11] to random vectors with correlated components.

###### Lemma 3.5 (Comparison lemma).

For , let be i.i.d. zero-mean random vectors in , be the covariance matrix of and . Then,

 liminfT→+∞E[maxi∈[d]1√TT∑t=1X(i)t]⩾E[maxi∈[d]Z(i)].
###### Proof.

Denote

 YT=maxi∈[d]1√TT∑t=1X(i)t.

Let and consider the function defined by .

 E[YT] =E[YT⋅\mathbbm1{YT⩾A}]+E[YT⋅\mathbbm1{YT0}].

Let us estimate the second term. Denote . We clearly have, for all , . And being nonnegative, we can write:

 0 ⩽E[(A−YT)⋅\mathbbm1{A−YT}>0]=E[ZT] =∫+∞0P[ZT>u]du =∫+∞0P[A−YT>u]du =∫+∞−AP[YT<−u]du =∫+∞−AP[maxi∈[d]1√TT∑t=1X(i)t

For , using Hoeffding’s inequality together with the assumptions and , we can bound the last integrand:

 P[T∑t=1X(1)t

Which gives:

 0⩽E[(A−YT)⋅\mathbbm1{A−YT}>0]⩽∫+∞−Ae−u2/2du⩽e−A2/2−A.

Therefore:

 E[YT]⩾E[ϕA(YT)]+e−A2/2A.

We now take the liminf on both sides as . The left-hand side is the quantity that appears in the statement. We now focus on the second term of the right-hand side. The central limit theorem gives the following convergence in distribution:

 1√TT∑t=1XtL−−−−→T→+∞X.

The application being continuous, we can apply the continuous mapping theorem:

 YT=maxi∈[d]1√TT∑t=1X(i)tL−−−−→n→+∞maxi∈[d]X(i).

This convergence in distribution allows the use of the portmanteau lemma: being lower semi-continuous and bounded from below, we have:

 liminft→+∞E[ϕA(YT)]⩾E[ϕA(maxi∈[d]X(i))],

and thus:

 liminft→+∞E[YT]⩾E[ϕA(maxi∈[d]X(i))]+e−A2/2A.

We would now like to take the limit as . By definition of , for , we have the following domination:

 ∣∣∣ϕA(maxi∈[d]X(i))∣∣∣⩽∣∣∣maxi∈[d]X(i)∣∣∣⩽maxi∈[d]∣∣X(i)∣∣⩽d∑i=1∣∣X(i)∣∣,

where each is since it is a normal random variable. We can therefore apply the dominated convergence theorem as :

 E[ϕA(maxi∈[d]X(i))]−−−−→A→−∞E[maxi∈[d]X(i)],

and eventually, we get the stated result:

 liminft→+∞E[YT]⩾E[maxi∈[d]X(i)].

## 4. When the sparsity level s is unknown

We now longer assume in this section that the decision maker have the knowledge of the sparsity level . We modify our algorithms to be adaptive over the sparsity level of the observed gain/loss vectors, following the same ideas behind the classical doubling trick (yet it cannot be directly applied here). The algorithms are proved to essentially achieve the same regret bounds as in the case where is known.

Specifically, let be the number of rounds and the highest sparsity level of the gain/loss vectors chosen by Nature up to time . In the following, we construct algorithms which achieve regret bounds of order and for gains and losses respectively, without prior knowledge of .

### 4.1. For Losses

Let be the sequence of loss vectors in chosen by Nature, and the number of rounds. We denote the higher sparsity level of the loss vectors up to time . The goal is to construct an algorithm which achieves a regret bound of order without any prior knowledge about the sparsity level of the loss vectors.

The time instances will be divided into several time intervals. On each of those, the previous loss vectors will be left aside, and a new instance of the Exponential Weight Algorithm with a specific parameter will be run. Let and . Then, for we define

 τ(m)=min{1⩽t⩽T∣∣∥ℓt∥0>2m}andτ(M)=T.

In other words, is the first time instance at which the sparsity level of the loss vector execeeds . is thus a nondecreasing sequence. We can then define the time intervals as follows. For , let

 I(m)={{τ(m−1)+1,…,τ(m)}if τ(m−1)<τ(m)∅if τ(m−1)=τ(m)..

The sets clearly is a partition of (some of the intervals may be empty). For , we define which implies . In other words, is the index of the only interval belongs to.

Let be a constant to be chosen later and for , let

 η(m)=log(1+C√dlogd2mT)

be the parameter of the Exponential Weight Algorithm to be used on interval . In this section, will be entropic regularizer on the simplex , so that is the logit map used in the Exponential Weight Algorithm. We can then define the played actions to be:

 xt=∇h∗⎛⎜ ⎜⎝−η(mt)∑t′
###### Theorem 4.1.

The above algorithm with guarantees

 RT⩽4√Ts∗logdd+⌈logs∗⌉logd2+5s∗√logddT.
###### Proof.

Let . On time interval , the Exponential Weight Algorithm is run with parameter against loss vectors in . Therefore, the following regret bound derived in the proof of Theorem 3.1 applies:

 R(m):= ∑t∈I(m)⟨ℓt,xt⟩−mini∈[d]∑t∈I(m)ℓ(i)t ⩽logd1−e−η(m)