Logistic Regression: Tight Bounds for Stochastic and Online Optimization1footnote 11footnote 1The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n{}^{\circ} 336078 – ERC-SUBLRN.

# Logistic Regression: Tight Bounds for Stochastic and Online Optimization111The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n∘ 336078 – Erc-Sublrn.

Elad Hazan222Technion—Israel Institute of Technology, Haifa 32000, Israel. Emails: ehazan@ie.technion.ac.il, tomerk@technion.ac.il, kfiryl@tx.technion.ac.il.    Tomer Koren22footnotemark: 2    Kfir Y. Levy22footnotemark: 2
May 2014
###### Abstract

The logistic loss function is often advocated in machine learning and statistics as a smooth and strictly convex surrogate for the 0-1 loss. In this paper we investigate the question of whether these smoothness and convexity properties make the logistic loss preferable to other widely considered options such as the hinge loss. We show that in contrast to known asymptotic bounds, as long as the number of prediction/optimization iterations is sub exponential, the logistic loss provides no improvement over a generic non-smooth loss function such as the hinge loss. In particular we show that the convergence rate of stochastic logistic optimization is bounded from below by a polynomial in the diameter of the decision set and the number of prediction iterations, and provide a matching tight upper bound. This resolves the COLT open problem of McMahan and Streeter (2012).

## 1 Introduction

In many applications, such as estimation of click-through-rate in web advertising, and predicting whether a patient has a certain disease, the logistic loss is often the loss of choice. It appeals as a convex surrogate of the 0-1 loss, and as a tool that not only yields categorical prediction but also able to estimate the underlying probabilities of the categories. Moreover, Friedman et al. (2000) and Collins et al. (2002) have shown that logistic regression is strongly connected to boosting.

A long standing debate in the machine learning community has been the optimal choice of surrogate loss function for binary prediction problems (see Langford (2009), Bulatov (2007)). Amongst the arguments in support of the logistic loss are its smoothness and strict-convexity properties, which unlike other loss functions (such as the hinge loss), permit the use of more efficient optimization methods. In particular, the logistic loss is exp-concave, and thus second-order methods are applicable and give rise to theoretically superior convergence and/or regret bounds.

More technically, under standard assumptions on the training data, the logistic loss is 1-Lipschitz and -exp-concave over the set of linear -dimensional classifiers whose -norm is at most . Thus, the Online Newton Step algorithm (Hazan et al., 2007) can be applied to the logistic regression problem and gives a convergence rate of over iterations. On the other hand, first order methods can be used to attain a rate of , which is attainable in general for any Lipschitz convex loss function. The exponential dependence on  of the first bound suggests that second order methods might present poor performance in practical logistic regression problems, even when compared to the slow rate of first-order methods. The gap between the two rates raises the question: is a fast convergence rate of the form achievable for logistic regression?

This question has received much attention lately. Bach (2013), relying on a property called “generalized self-concordance”, gave an algorithm with convergence rate of , where is the smallest eigenvalue of the Hessian at the optimal point. This translates to a  rate whenever the expected loss function is “locally strongly convex” at the optimum. More recently, Bach and Moulines (2013) extended this result and presented an elegant algorithm that attains a rate of the form , without assuming strong convexity (neither global or local) — but rather depending on a certain data-dependent constant .

In this paper, we resolve the above question and give tight characterization of the achievable convergence rates for logistic regression. We show that as long as the target accuracy is not exponentially small in , a rate of the form is not attainable. Specifically, we prove a lower bound of on the convergence rate, that can also be achieved (up to a factor) by stochastic gradient descent algorithms. In particular, this shows that in the worst case, the magnitude of data-dependent parameters used in previous works are exponentially large in the diameter . The latter lower bound only applies for multi-dimensional regression (i.e., when ); surprisingly, in one-dimensional logistic regression we find a rate of to be tight. As far as we know, this is the first natural setting demonstrating such a phase transition in the optimal convergence rates, with respect to the dimensionality of the problem.

We also consider the closely-related online optimization setting, where on each round  an adversary chooses a certain logistic function and our goal is to minimize the -round regret, with respect to the best fixed decision chosen with the benefit of hindsight. In this setting, McMahan and Streeter (2012) investigated the one-dimensional case and showed that if the adversary is restricted to pick binary (i.e. ) labels, a simple follow-the-leader algorithm attains a regret bound of . This discovery led them to conjecture that bounds of the form should be achievable in the general multi-dimensional case with continuous labels set.

Our results extend to the online optimization setup and resolve the COLT 2012 open problem of McMahan and Streeter (2012) on the negative side. Namely, we show that as long as the number of rounds is not exponentially large in , an upper bound of cannot be attained in general. We obtain lower bounds on the regret of in the multi-dimensional case and  in the one-dimensional case, when allowing the adversary to use a continuous label set. We are not aware of any other natural problem that exhibits such a dichotomy between the minimax regret rates in the one-dimensional and multi-dimensional cases.

It is interesting to note that our bounds apply to a finite interval of time, namely when , which is arguably the regime of interest for reasonable values of . This is the reason our lower bounds do not contradict the logarithmic known regret bounds.

We prove the tightness of our one-dimensional lower bounds, in both the stochastic and online settings, by devising an online optimization algorithm specialized for one-dimensional online logistic regression that attains a regret of . This algorithm maintains approximations of the observed logistic loss functions, and use these approximate losses to form the next prediction by a follow-the-regularized-leader (FTRL) procedure. As opposed to previous works that utilize approximate losses based on local structure (Zinkevich, 2003; Hazan et al., 2007), we find it necessary to employ approximations that rely on the global structure of the logistic loss.

The rest of the paper is organized as follows. In Section 2 we describe the settings we consider and give the necessary background. We present our lowers bounds in Section 3, and in Section 4 we prove our upper bound for one dimensional logistic regression. In Section 5 we give complete proofs of our results. We conclude in Section 6.

## 2 Setting and Background

In this section we formalize the settings of stochastic logistic regression and online logistic regression and give the necessary background on both problems.

### 2.1 Stochastic Logistic Regression

In the problem of stochastic logistic regression, there is an unknown distribution over instances . For simplicity, we assume that . The goal of an optimization algorithm is to minimize the expected loss of a linear predictor ,

 L(w) = Ex∼D[ℓ(w,x)] , (1)

where is the logistic loss function333The logistic loss is commonly defined as for instances . For ease of notation and without loss of generality, we ignore the variable in the instance by absorbing it into . ,

 ℓ(w,x) = log(1+exp(x⋅w))

that expresses the negative log-likelihood of the instance under the logit model. While we may try to optimize over the entire Euclidean space, for generalization purposes we usually restrict the optimization domain to some bounded set. In this paper, we focus on optimizing the expected loss over the set , the Euclidean ball of radius . We define the excess loss of a linear predictor as the difference between the expected loss of and the expected loss of the best predictor in the class .

An algorithm for the stochastic optimization problem, given a sample budget as a parameter, may use a sample of instances sampled independently from the distribution , and produce an approximate solution . The rate of convergence of the algorithm is then defined as the expected excess loss of the predictor , given by

 E[L(¯¯¯¯wT)] − minw∗∈WL(w∗) ,

where the expectation is taken with respect to both the random choice of the training set and the internal randomization of the algorithm (which is allowed to be randomized).

### 2.2 Online Logistic Regression

Another optimization framework we consider is that of online logistic optimization, which we formalize as the following game between a player and an adversary. On each round of the game, the adversary first picks an instance , the player then chooses a linear predictor , observes and incurs loss

For simplicity we again assume that for all . The goal of the player is to minimize his regret with respect to a fixed prediction from the set , which is defined as

 RegretT = T∑t=1ℓ(wt,xt) − minw∗∈WT∑t=1ℓ(w∗,xt) .

### 2.3 Information-theoretic Tools

As a part of our lower bound proofs, we utilize two impossibility theorems that assert the minimal number of samples needed in order to distinguish between two distributions. We prove the following lower bound on the performance of any algorithm for this task.

###### Theorem 1.

Assume a coin with bias either or , where , is given. Any algorithm that correctly identifies the coin’s bias with probability at least , needs no less than tosses.

The theorem applies to both deterministic and randomized algorithms; in case of random algorithms the probability is with respect to both the underlying distribution of the samples, and the randomization of the algorithm. The proof of Theorem 1 is given, for completeness, in Appendix A.

## 3 Lower Bounds for Logistic Regression

In this section we derive lower bounds for the convergence rate of stochastic logistic regression. For clarity, we lower bound the number of observations required in order to attain excess loss of at most , which we directly translate to a bound for the convergence rate. The stochastic optimization lower bounds are then used to obtain corresponding bounds for the online setting.

In Section 3.1 we prove a lower bound for the one dimensional case, in Section 3.2 we prove another lower bound for the multidimensional case, and in Section 3.3 we present our lower bounds for the online setting.

### 3.1 One-dimensional Lower Bound for Stochastic Optimization

We now show that any algorithm for one-dimensional stochastic optimization with logistic loss, must observe at least instances before it provides an instance with expected excess loss. This directly translates to a convergence rate of . Formally, the main theorem of this section is the following.

###### Theorem 2.

Consider the one dimensional stochastic logistic regression setting with a fixed sample budget . For any algorithm there exists a distribution for which the expected excess loss of ’s output is at least .

The proof of Theorem 2 is given at the end of this section; here we give an informal proof sketch. Consider distributions over the two-element set . For and , the losses of these instances are approximately linear/quadratic with opposed slopes (see Fig. 1(a)). Consequently, we can build a distribution with an expected loss which is quadratic in ; upon perturbing the latter distribution by we get two distributions with expected losses that are approximately linear in with slopes (see Fig. 1(b)). An algorithm that attains a low expected excess loss on both these distributions can be used to distinguish between them, we then utilize an information theoretic impossibility theorem to bound the number of observations needed in order to distinguish between two distributions.

In Fig. 2 we present two distributions, which we denote by and . We denote by the expected logistic loss of a predictor with respect to , i.e.,

 Lχ(w) = EDχ[ℓ(w,x)] = (θ2+χϵD)ℓ(w,1−θ2)+(1−θ2−χϵD)ℓ(w,−θ) ,χ∈{−1,1} .

The following lemma states that it is impossible attain a low expected excess loss on both and simultaneously. Here we only give a sketch of the proof; the complete proof is deferred to Section 5.1.

###### Lemma 3.

Given and , consider the distributions defined in Fig. 2. Then the following holds:

 L+(w)−minw∗∈WL+(w∗) ≥ ϵ/20 ,∀ w∈[34D,D] , L−(w)−minw∗∈WL−(w∗) ≥ ϵ/20 ,∀ w∈[−D,34D] .
###### Proof (sketch).

First we show that for , the losses of the instances are approximately linear/quadratic, i.e.,

 ∣∣ℓ(w,1−θ2)−(1−θ2)w∣∣ ≤ ϵ40 ,∀ w∈[12D,D] , ∣∣ℓ(w,−θ)−(log2−θ2w+18(θw)2)∣∣ ≤ ϵ40 ,∀ w∈[12D,D] .

Using the above approximations and , we show that and for , where “” denotes equality up to an additive term of . Thus,

 L+(w)−minw∗∈WL+(w∗) ≥ L+(w)−L+(D/2) ≥ ϵ/20 , ∀ w∈[34D,D] , L−(w)−minw∗∈WL−(w∗) ≥ L−(w)−L−(D) ≥ ϵ/20 , ∀ w∈[12D,34D] .

Showing that is monotonically decreasing in , extends the latter inequality to . ∎

We are now ready to prove Theorem 2.

###### Proof of Theorem 2.

Consider some algorithm ; we will show that if observes samples from a distribution which is either or , then the expected excess loss that can guarantee is lower bounded by .

The excess loss is non negative; therefore, if guarantees an expected excess loss smaller than , then by Markov’s inequality it achieves an excess loss smaller than , w.p. . Denoting by the predictor that outputs after samples, then according to Lemma 3, attaining an excess loss smaller than on the distribution (respectively ) implies (respectively ).

Since achieves an excess loss smaller than w.p. for any distribution we can use its output to identify the right distribution w.p. . This can be done as follows:

 If ¯¯¯¯wT≤34D, Return: % D+" ; If ¯¯¯¯wT>34D, Return: D−" .

According to Theorem 1 distinguishing between these two distributions (“coins”) w.p. requires that the number of observations to be lower bounded as follows:

 T ≥ θ/2−ϵ/D16(2ϵ/D)2 ≥ 1256Dϵ1.5 ,

We used as a lower bound on the bias of ; since and it follows that . We also used as the bias between the “coins” , . Using the above inequality together with yields a lower bound of on the expected excess loss. ∎

### 3.2 Multidimensional Lower Bound for Stochastic Optimization

We now construct two distribution over instance vectors from the unit ball of , and prove that any algorithm that attains an expected excess loss at most on both distributions requires samples in the worst case. This directly translates to a convergence rate of . For dimensions, we can embed the same construction in the unit ball of , thus our bound holds in any dimension greater than one. The main theorem of this section is the following.

###### Theorem 4.

Consider the multidimensional stochastic logistic regression setting with and a fixed sample budget . For any algorithm there exists a distribution such that the expected excess loss of ’s output is at least .

Theorem 4 is proved at the end of this section. We bring here an informal description of the proof:

Consider distributions that choose instances among the set depicted in Fig. 3. The shaded areas in Fig. 3 depict regions in the domain where either or is approximately linear. The dark area represents the region in which both loss functions are approximately linear. By setting the probability of much larger than the others we can construct a distribution over the instances such that the minima of the induced expected loss function lies in the black area. Perturbing this distribution by over the odds of choosing we attain two distributions whose induced expected losses are almost linear over in the dark area, with opposed slopes. An algorithm that attains a low expected excess loss on both distributions can be used to distinguish between them. This allows us to use information theoretic arguments to lower bound the number of samples needed for the optimization algorithm.

In Fig. 4 we present the distributions . We denote by and the expected loss functions induced by and respectively, that are given by

 Lχ(w) = p⋅ℓ(w,x0)+1+χϵ2(1−p)⋅ℓ(w,xl)+1−χϵ2(1−p)⋅ℓ(w,xr),χ∈{−1,1}

In the following lemma we state that it is impossible attain a low expected excess loss on both and simultaneously.

###### Lemma 5.

Given and , consider as defined in Fig. 4. Then the following holds:

 L+(w)−minw∗∈WL+(w∗) ≥ ϵ/20 ,∀ w:w[1]≤0 ,and L−(w)−minw∗∈WL−(w∗) ≥ ϵ/20 ,∀ w:w[1]≥0 .

Here we only give a sketch of the proof; for the complete proof, refer to Section 5.2.

###### Proof (sketch).

Let be the unperturbed () version of , i.e.,

 L0(w) = pℓ(w,x0)+1−p2ℓ(w,xl)+1−p2ℓ(w,xr) .

Note that is constructed such that its minima is attained at , which belongs to the shaded area in Fig. 3. Thus, in the neighborhood of this minima both are approximately linear. Using linear approximations of around , we show that the value of at is smaller by than the minimal value of , hence

 minw∗∈WL+(w∗) ≤ L+(wa) ≤ L0(w0)−ϵ/20 . (2)

Moreover, is shown to be the sum of and a function which is positive whenever , thus

 L+(w) ≥ L0(w) ,∀ w : w[1]≤0 . (3)

Combining Eqs. 3 and 2 we get

 L+(w)−minw∗∈WL+(w∗) ≥ L0(w)−(L0(w0)−ϵ/20) ≥ ϵ/20 , ∀ w : w[1]≤0 ,

where the last inequality follows from being the minimizer of . A similar argument shows that for predictors such that , it holds that . ∎

For the proof of Theorem 4 we require a lemma that lower-bounds the minimal number of samples needed in order to distinguish between the distributions defined in Fig. 4. To this end, we use the following modified version of Theorem 1.

###### Lemma 6.

Let . Consider a distribution supported on three atoms with probabilities , with being either or . Any algorithm that identifies the distribution correctly with probability at least , needs no less than samples.

Lemma 6 can be proved similarly to Theorem 1 (see Appendix A). We are now ready to prove Theorem 4.

###### Proof of Theorem 4.

Consider some algorithm ; we will show that if observes samples from a distribution which is either or , then the expected excess loss that can guarantee is lower bounded by .

The excess loss is non negative; therefore if guarantees an expected excess loss smaller than , then by Markov’s inequality it achieves an excess loss smaller than , w.p. . Denoting by the predictor that outputs after samples, then according to Lemma 5, attaining an excess loss smaller than on distribution (respectively ) implies (respectively ).

Since achieves an excess loss smaller than w.p. for any among we can use its output to identify the right distribution w.p. . This can be done as follows:

 if ¯¯¯¯wT[1]≥0, return “D+”  ; if ¯¯¯¯wT[1]<0, return “D−”  .

According to Lemma 6, distinguishing between these two distributions w.p. requires that the number of observations to be upper bounded as follows:

 T ≥ 0.5(1−ϵ)16(1−p)(2ϵ)2 ≥ D2561ϵ2 ,

We used as a lower bound on the bias of distribution conditioned that the instance was not chosen; since , it follows that . We also used as the bias between the distributions and conditioned that the label was not chosen. Finally we used . The above inequality together with yields a lower bound of on the expected excess loss. ∎

### 3.3 Lower Bounds for Online Optimization

In Section 3 we proved two lower bounds for the convergence rate of stochastic logistic regression. Standard online-to-batch conversion (Cesa-Bianchi et al., 2004) shows that any online algorithm attaining a regret of can be used to attain a convergence rate of for stochastic optimization. Hence, the lower bounds stated in Theorems 4 and 2 imply the following:

###### Corollary 7.

Consider the one dimensional online logistic regression setting with . For any algorithm there exists a sequence of loss functions such that suffers a regret of at least .

###### Corollary 8.

Consider the multidimensional online logistic regression setting with , . For any algorithm there exists a sequence of loss functions such that suffers a regret of at least .

## 4 Upper Bound for One-dimensional Regression

In this section we consider online logistic regression in one dimension; here an adversary chooses instances , then a learner chooses predictors , and suffers a logistic loss . We provide an upper bound of for logistic online regression in one dimension, thus showing that the lower bound found in Theorem 2 is tight. Formally, we prove:

###### Theorem 9.

Consider the one dimensional online regression with logistic loss. Then a player that chooses predictors according to Algorithm 1 with and , achieves the following guarantee:

 RegretT = T∑t=1log(1+extwt)−minw∈WT∑t=1log(1+extw) = O(D3T1/3) .

Using standard online-to-batch conversion techniques Cesa-Bianchi et al. (2004), we can translate the upper bound given in the above lemma to an upper bound for stochastic optimization.

###### Corollary 10.

Consider the one dimensional stochastic logistic regression setting with and a budget of samples. Then for any distribution over instances, an algorithm that chooses predictors according to Algorithm 1 with and outputs , achieves the following guarantee:

 E[L(¯¯¯¯wT)] −minw∗∈[−D,D]L(w∗) = O(D3/T2/3) .

Following Zinkevich (2003) and Hazan et al. (2007), we approximate the losses received by the adversary, and use the approximate losses in a follow-the-regularized-leader (FTRL) procedure in order to choose the predictors.

First note the following lemma due to Zinkevich (2003) (proof is found in Hazan et al. (2007)):

###### Lemma 11.

Let be an arbitrary sequence of loss functions, and let . Let, be a sequence of loss function that satisfy , and for all . Then

 T∑t=1ℓt(wt)−minw∈KT∑t=1ℓt(w) ≤ T∑t=1~ℓt(wt)−minw∈KT∑t=1~ℓt(w) .

Thus, the regret on the original losses is bounded by the regret of the approximate losses. For the logistic losses, , we define approximate losses that satisfy the conditions of the last lemma. Depending on , we divide into 3 cases:

 ~ℓt(w)=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩a0+ytw+β2y2tw211w≤0if wt≥0 ~{}and~{} xt≥1D~{}% ;a0+ytw+β2y2tw211w≥0if wt≤0 ~{}and~{} xt≤−1D ;a0+ytw+β2y2t(w−wt)2if |xt|≤1D ~{}or~{} xtwt≤0 , (4)

where,

 yt=∂ℓ(w,xt)∂w∣∣∣wt=gtxt ,gt=extwt1+extwt ,β=1/8D ,a0=log(1+extwt)−gtxtwt .

Thus, if or , then we use a quadratic approximation, else we use a loss that changes from linear to quadratic on . Note that if the approximation loss is partially linear, then the magnitude of its slope is greater than .

The approximations are depicted in Fig. 5. In Fig. 5(a) the approximate loss changes from linear to quadratic in , where in Fig. 5(b) the approximate loss is quadratic everywhere. The following technical lemma states that the losses satisfy the conditions of Lemma 11.

###### Lemma 12.

Assume that . Let be a sequence of logistic loss functions and let . The approximate losses defined above satisfy and for all .

Lemma 12 is proved in Section 5.4. We are now ready to describe our algorithm that obtains a regret of for one-dimensional online regression, given in Algorithm 1.

We conclude with a proof sketch of Theorem 9; the complete proof is deferred to Section 5.3.

###### Proof of Theorem 9 (sketch).

First we show that the regret of Algorithm 1 is upper bounded by the sum of differences , and then divide the analysis into two cases. In the first case we show that the accumulated regret in rounds where is quadratic around is upper bounded by . The second case analyses rounds in which is linear around ; due to the regularization, in the first such rounds our regret is bounded by and if the number of such rounds is greater than we show that the quadratic part of the accumulated losses is large enough so the above sum of differences is smaller than . Since the approximations may change from linear to quadratic in , our analysis splits into two cases: the case where consecutive predictors have the same sign, and the case where they have opposite signs. ∎

## 5 Proofs

### 5.1 Proof of Lemma 3

###### Proof.

We assume that the following holds:

 Ω(e−D)=40e−0.45D≤ϵ≤125 .

In the proof we use the following:

 θD≤0.2;1−θ2≥0.9 ,

the first follows since: , combing the latter with we get . Next we prove the lemma in three steps:

##### Step 1: Linear/quadratic approximation in [D/2,D].

We show that for , the logistic losses of the instances are linear/quadratic, up to an additive term of :

 ∣∣∣ℓ(w,1−θ2)−(1−θ2)w∣∣∣=log(1+e−(1−θ2)w)≤e−(1−θ2)w≤e−0.45D≤Δ,∀w∈[D/2,D] (5) ∣∣ ∣∣ℓ(w,−θ)−(log2−θ2w+(θw)28)∣∣ ∣∣≤max¯w∈[−D,D](θ¯w)4192≤(θD)4192≤Δ,∀w∈[−D,D] (6)

recalling , in the first equality of Eq. 5 we used, , next we used , finally we used and . In Eq. 6 we used the second order taylor approximation of the loss around , and the RHS of the second inequality is an upper bound to the error of this approximation. We define ; using , and we can bound:

 Δ≤ϵ/40 .
##### Step 2: proving the lemma for w∈[D/2,D].

Recall the notation for the expected losses according to ; using Eqs. 6 and 5, we can write:

 L+(w) =(θ2+ϵD)(1−θ2)w+(1−θ2−ϵD)(log2−θ2w+(θw)28)±Δ =ϵDw+(1−θ2−ϵD)log2+(1−θ2−ϵD)(θw)28±Δ,∀w∈[D/2,D] .

Using the latter expression for we can bound the excess loss for as follows:

 L+(w)−minw∗∈WL+(w∗) ≥L+(w)−L+(D/2) ≥ϵD(w−D2)+θ28(1−θ2−ϵD)(w2−D24)−2Δ ≥ϵD(w−D2)+θ210(w2−D24)−2Δ ,

where in the last inequality we used and . Hence, for , we have

 L+(w)−minw∗L+(w∗) ≥ ϵ4+θ2105D216−2Δ