A Stochastic Line Search Method with Convergence Rate Analysis

# A Stochastic Line Search Method with Convergence Rate Analysis

Courtney Paquette Department of Industrial and Systems Engineering, Lehigh University, Harold S. Mohler Laboratory, 200 West Packer Avenue, Bethlehem, PA 18015-1582, USA. cop318@lehigh.edu. The work of this author was partially supported by NSF TRIPODS Grant 17-40796 and DMS 18-03289.    Katya Scheinberg Department of Industrial and Systems Engineering, Lehigh University, Harold S. Mohler Laboratory, 200 West Packer Avenue, Bethlehem, PA 18015-1582, USA. katyas@lehigh.edu. The work of this author was partially supported by NSF Grants CCF 16-18717 and TRIPODS 17-40796, and DARPA Lagrange award HR-001117S0039.
###### Abstract

For deterministic optimization, line-search methods augment algorithms by providing stability and improved efficiency. We adapt a classical backtracking Armijo line-search to the stochastic optimization setting. While traditional line-search relies on exact computations of the gradient and values of the objective function, our method assumes that these values are available up to some dynamically adjusted accuracy which holds with some sufficiently large, but fixed, probability. We show the expected number of iterations to reach a near stationary point matches the worst-case efficiency of typical first-order methods, while for convex and strongly convex objective, it achieves rates of deterministic gradient descent in function values.

## 1 Introduction

In this paper we consider the classical stochastic optimization problem

 minx∈Rn{f(x)=E[~f(x;ξ)]}, (1.1)

where is a random variable obeying some distribution. In the case of empirical risk minimization with a finite training set, is a random variable that is defined by a single random sample drawn uniformly from the training set. More generally may represents a sample or a set of samples drawn from the data distribution.

The most widely used method to solve (1.1) is the stochastic gradient descent (SGD) [16]. Due to its low iteration cost, SGD is often preferred to the standard gradient descent (GD) method for empirical risk minimization. Despite the prevalent use of SGD, it has known challenges and inefficiencies. First, the direction may not represent a descent direction, and second, the method is sensitive to the step-size (learning rate) which is often poorly overestimated. Various authors have attempted to address this last issue, see [8, 10, 12, 13]. Motivated by these facts, we turn to the deterministic optimization approach for adaptively selecting step sizes - GD with Armijo back-tracking line-search.

#### Related work.

In [4] and [9] a practical back-tracking line search is proposed, combined with the their sample size selection. In both cases the backtracking is based on Armijo line search condition applied to function estimates that are computed on the same batch as the gradient estimates and is essentially a heuristic. A very different type of line-search based on probabilistic Wolfe condition is proposed in [14], however, it aims at improving step size selection for SGD and has no theoretical guarantees.

#### Our contribution.

In this work we propose an adaptive backtracking line-search method, where the sample sizes for gradient and function estimates are chosen adaptively using knowable quantities along with the step-size. We show that this method converges to the optimal solution with probability one and derive strong convergence rates that match those of the deterministic gradient descent methods in the nonconvex , convex , and strongly convex cases. This paper offers the first stochastic line search method with convergence rates analysis, and is the first to provide convergence rates analysis for adaptive sample size selection based on knowable quantities.

#### Background.

There are many types of (deterministic) line-search methods, see [15, Chapter 3], but all share a common philosophy. First, at each iteration, the method computes a search direction by e.g. the gradient or (quasi) Newton directions. Next, they determine how far to move in the direction through the univariate function, , to find the stepsize . Typical line-searches try out a sequences of potential values for the stepsize, accepting once some verifiable criteria becomes satisfied. One popular line-search criteria specifies an acceptable step length should give sufficient decrease in the objective function :

 (Armijo condition \@@cite[cite]{[\@@bibref{}{armijo}{}{}]})f(xk+αdk)≤f(xk)−θα∥∇f(xk)∥2, (1.2)

where the constant is chosen by the user and . Larger step sizes imply larger gains towards optimality and lead to fewer overall iterations. When step sizes get too small or worse , no progress is made and the algorithm stagnates. A popular way to systematically search the domain of while simultaneously preventing small step sizes is backtracking. Backtracking starts with an overestimate of and decreases it until (1.2) becomes true. Our exposition is on a stochastic version of backtracking using the stochastic gradient estimate as a search direction and stochastic function estimates in (1.2). In the remainder of the paper, all random quantities will be denoted by capitalized letters and their respective realizations by corresponding lower case letters.

## 2 Stochastic back-tracking line search method

We present here our main algorithm for GD with back-tracking line search. We impose the standard assumption on the objective function.

###### Assumption 2.1.

We assume that all iterates of Algorithm 1 satisfy where is a set in . Moreover, the gradient of is -Lipschitz continuous for all and that

 fmin≤f(x),for all x∈Ω.

### 2.1 Outline of method

At each iteration, our scheme computes a random direction via e.g. a minibatch stochastic gradient estimate or sampling the function itself and using finite differences. Then, we compute stochastic function estimates at the current iterate and prospective new iterate, resp. and . We check the Armijo condition [1] using the stochastic estimates

 (2.1)

If (2.1) holds, the next iterate becomes and stepsize increases; otherwise and decreases, as is typical in (deterministic) back-tracking line searches.

Algorithm 1 describes our method.111We state the algorithm using the lower case notation to represent a realization of the algorithm Unlike classical back-tracking line search, there is an additional control, , which serves as a guess of the true function decrease and controls the accuracy of the function estimates. We discuss this further next.

#### Challenges with randomized line-search.

Due to the stochasticity of the gradient and/or function values, two major challenges result:

• a series of erroneous unsuccessful steps cause to become arbitrarily small;

• steps may falsely satisfy (2.1) leading to objective value at the next iteration arbitrarily larger than the current iterate.

Convergence proofs for deterministic line searches rely on the fact that neither of the above problems arise. Our approach controls the probability with which the random gradients and function values are representative of their true counterparts. When this probability is large enough, the method tends to make successful steps when is sufficiently small, hence behaves like a random walk with an upward drift thus staying away from .

Yet, even when the probability of good gradients/function estimates is near 1, it is not guaranteed that holds at each iteration due to the second issue - possible arbitrary increase of the objective. Since random gradient may not be representative of the true gradient the function estimate accuracy and thus the expected improvement needs to be controlled by a different quantity, . When the predicted decrease in the true function matches the expected function estimate accuracy (), we call the step reliable and increase the parameter for the next iteration; otherwise our prediction does not match the expectation and we decrease .

Moreover, unlike the typical stochastic convergence rate analysis, which bounds expected improvement in either or after a given number of iteration, our convergence rate analysis bounds the total expected number of steps that the algorithm takes before either or is reached. Our results rely on a stochastic process framework introduced and analyzed in [3] to provide convergence rates for stochastic trust region method.

### 2.2 Random gradient and function estimates

#### Overview.

At each iteration, we compute a stochastic gradient and stochastic function values. With probability , the random direction is close to the true gradient. We measure closeness or accuracy of the random direction using the current step length, which is a known quantity. This procedure naturally adapts the required accuracy as the algorithm progresses. As the steps get shorter (i.e. either the gradient gets smaller or the step-size parameter does), we require the accuracy to increase, but the probability of encountering a good gradient at any iteration is the same.

A similar procedure applies to function estimates, and . The accuracy of the function estimates to the true function values at the points and are tied to the size of the step, . At each iteration, there is a probability of obtaining good function estimates. By choosing the probabilities of good gradient and estimates, we show Algorithm 1 converges. To formalize this procedure, we introduce the following.

#### Notation and definitions.

Algorithm 1 generates a random process , in what follows we will denote all random quantities by capital letters and their realization by small letters. Hence random gradient estimate is denoted by and its realizations - by . Similarly, let the random quantities (iterates), (stepsize), control size , and (step) denote their respective realizations. Similarly, we let denote estimates of and , with their realizations denoted by and . Our goal is to show that under some assumptions on and the resulting stochastic process convergences with probability one and at an appropriate rate. In particular, we assume that the estimates and and are sufficiently accurate with sufficiently high probability, conditioned on the past.

To formalize the conditioning on the past, let denote the -algebra generated by the random variables and and let denote the -algebra generated by the random variables and . For completeness, we set . As a result, we have that for is a filtration. By construction of the random variables and in Algorithm 1, we see and for all .

We measure accuracy of the gradient estimates and function estimates and using the following definitions.

###### Definition 2.2.

We say that a sequence of random directions is -probabilistically -sufficiently accurate for Algorithm 1 for the corresponding sequence , if there exists a constant , such that the events

 Ik={∥Gk−∇f(Xk)∥≤κgAk∥Gk∥}

satisfy the conditions222Given a measurable set , we use as the indicator function for the set ; if and otherwise.

 Pr(Ik|FG⋅Fk−1)=E[1Ik|FG⋅Fk−1]≥pg

In addition to sufficiently accurate gradients, we require estimates on the function values and to also be sufficiently accurate.

###### Definition 2.3.

A sequence of random estimates is said to be -probabilistically -accurate with respect to the corresponding sequence if the events

 Jk={|F0k−f(xk)|≤εfA2k∥Gk∥2and|Fsk−f(xk+sk)|≤εfA2k∥Gk∥2}.

satisfy the condition

 Pr(Jk|FG⋅Fk−1/2)=E[1Jk|FG⋅Fk−1/2]≥pf.

We note here that the filtration includes and ; hence the accuracy of the estimates is measured with respect to fixed quantities. Next, we state the key assumption on the nature of the stochastic information in Algorithm 1.

###### Assumption 2.4.

The following hold for the quantities in the algorithm:

1. The sequence of random gradients generated by Algorithm 1 is -probabilistically -sufficiently accurate for some sufficiently large .

2. The sequence of estimates generated by Algorithm 1 is -probabilistically -accurate estimates for some and sufficiently large .

3. The sequence of estimates generated by Algorithm 1 satisfies a -variance condition for all 333We implicitly assume and are integrable for all ; thus it is straightforward to deduce and are integrable for all .,

 E[|Fsk−f(Xk+Sk)|2|FG⋅Fk−1/2]≤max{κ2fA2k∥∇f(Xk)∥4,θ2Δ4k} (2.3) and E[|F0k−f(Xk)|2|FG⋅Fk−1/2]≤max{κ2fA2k∥∇f(Xk)∥4,θ2Δ4k}.

A simple calculation shows that under Assumption 2.4 the following hold

 E[1Ik∩Jk|FG⋅Fk−1]≥pgpf,E[1Ick∩Jk|FG⋅Fk−1]≤1−pg,andE[1Jck|FG⋅Fk−1]≤1−pf.
###### Remark 1.

We are interested in deriving convergence results for the case when may be large. For the rest of the exposition, without loss of generality . It clear if happens to be smaller, somewhat better bounds that the ones we derive here will result since the gradients give tighter approximations of the true gradient. We are interested in deriving bound for the case when is large. Equation (2.3) includes the maximum of two terms - one of the terms is unknown. When one posesses external knowledge of , one could use this value. This is particularly useful when is big since it allows large variance in the function estimates, for example assumption that implies that this variance does not have to be driven to zero, before the algorithm reaches a desired accuracy. Yet, for convergence and since a useful lower bound on may be unknown, we include the parameter as a way to adaptively control the variance. As such should be small, in fact, can be set equal to . The analysis can be performed for any other values of the above constants - the choices here are for simplicity and convenience.

This assumption on the accuracy of the gradient and function estimates is key in our convergence rate analysis. We derive specific bounds on and under which these rates would hold. We note here that if then Assumption 2.4(iii) is not needed and condition is sufficient for the convergence results. This case can be considered as an extension of results in [6]. Before concluding this section, we state a result showing the relationship between the variance assumption on the function values and the probability of inaccurate estimates.

###### Lemma 2.5.

Let Assumption  2.4 hold. Suppose is a random process generated by Algorithm 1 and are -probabilistically accurate estimates. Then for every we have

 E[1Jck|Fsk−f(Xk+Sk)| |FG⋅Fk−1/2]≤(1−pf)1/2max{κfAk∥∇f(Xk)∥2,θΔ2k} and
###### Proof.

We show the result for , but the proof for is the same. Using Holder’s inequality for conditional expectations, we deduce

 E[1Jck|F0k−f(Xk)|max{κfAk∥∇f(Xk)∥2,θΔ2k}∣∣FG⋅Fk−1/2]≤(E[1Jck|FG⋅Fk−1/2])1/2(E[|F0k−f(Xk)|2max{κ2fA2k∥∇f(Xk)∥4,θ2Δ4k}∣∣FG⋅Fk−1/2])1/2.

The result follows after noting by (2.3)

 (E[|F0k−f(Xk)|2max{κ2fA2k∥∇f(Xk)∥4,θ2Δ4k}∣∣FG⋅Fk−1/2])1/2≤1.

### 2.3 Computing Gk, F0k, and Fsk to satisfy Assumption 2.4.

Assuming that the variance of random function and gradient realizations is bounded as

 E(∥∇~f(x,ξi)−∇f(x)∥2)≤Vg and% \ E(|~f(x,ξi)−f(x)|2)≤Vf,

Assumption 2.4 can be made to hold if , and are computed using a sufficient number of samples. In particular, let be a sample of realizations , and . By using results e.g. in [18, 19] we can show that if

 |Sk|≥~O(Vgκ2gA2k∥Gk∥2) (2.4)

(where hides the log factor of ), then Assumption 2.4(i) is satisfied. While is not known when is chosen, one can design a simple loop by guessing the value of and increasing the number of samples until (2.4) is satisfied, this procedure is discussed in [6]. Similarly to satisfy Assumption 2.4(ii), it is sufficient to compute with

 |S0k|≥~O(Vfκ2fA2k∥Gk∥4)

(where hides the log factor of ) and to obtain analogously. Finally, it is easy to see that Assumption 2.4(iii) is simply satisfied if by standard properties of variance.

We observe that:

• unlike [5, 9], the number of samples for gradient and function estimation does not increase at any pre-defined rate, but is closely related to the progress of the algorithm. In particular if and increase then the sample sets sizes can decrease.

• Also, unlike [18] where the number of samples is simply chosen large enough a priori for all so that the right hand side in Assumption 2.4(i) is bounded by a predefined accuracy , our algorithm can be applied without knowledge of .

• Finally, unlike [4] where theoretical results require that depends on , which is unknown, our bounds on the sample set sizes all use knowable quantities, such as bound on the variance and quantities computed by the algorithm.

We also point out can be arbitrarily big and depends only on the backtracking factor and is not close to ; hence the number of samples to satisfy Assumption 2.4(i) is moderate. On the other hand, will have to depend on ; hence a looser control of the gradient estimates results in tighter control, i.e. larger sample sets, for function estimates.

Our last comment is that does not have to be an unbiased estimate of and does not need to be computed via gradient samples. Instead it can be computed via stochastic finite differences, as is discussed for example in [7].

## 3 Renewal-Reward Process

In this section, we define a general random process introduced in [3] and its stopping time which serve as a general framework for analyzing behavior of stochastic trust region method in [3] and stochastic line search in this paper. We state the relevant definitions, assumptions, and theorems and refer the reader to the proofs in [3].

###### Definition 3.1.

Given a discrete time stochastic process , a random variable is a stopping time for if the event .

Let be a random process such that and for . Let us also define a biased random walk process, , defined on the same probability space as . We denote the -algebra generated by , where . In addition, obeys the following dynamics

 Pr(Wk+1=1|Fk)=pandPr(Wk+1=−1|Fk)=(1−p) (3.1)

We define to be a family of stopping times parameterized by . In [3] a bound on is derived under the following assumption on .

###### Assumption 3.2.

The following hold for the process .

1. is a constant. There exists a constant and (for some ) such that for all .

2. There exists a constant for some and , such that, the following holds for all ,

 1{Tϵ>k}Ak+1≥1{Tϵ>k}min{AkeλWk+1,¯A}

where satisfies (3.1) with .

3. There exists a nondecreasing function and a constant such that

 1{Tϵ>k}{E}[Φk+1|Fk]≤1{Tϵ>k}(Φk−Θh(Ak)).

Assumption 3.2 (iii) states that conditioned on the event and the past, the random variable decreases by at each iteration. Whereas Assumption 3.2 (ii) says that once falls below the fixed constant , the sequence has a tendency to increase. Assumptions 3.2 (i) and (ii) together also ensures that belongs to the sequence of values taken by the sequence . As we will see this is a simple technical assumption that can be satisfied w.l.o.g.

###### Remark 2.

Computational complexity (in deterministic methods) measures the number of iterations until an event such as is small or is small, or equivalently, the rate at which the gradient/function values decreases as a function of the iteration counter . For randomized or stochastic methods, previous works tended to focus on the second definition, i.e. showing the expected size of the gradient or function values decreases like . Instead, here we bound the expected number of iterations until the size of the gradient or function values are small, which is the same as bounding the stopping times and , for a fixed .

###### Remark 3.

In the context of deterministic line search, when the stepsize falls below the constant , where is the Lipschitz constant of , the iterate always satisfies the sufficient decrease condition, namely . Thus never falls much below . To match the dynamics behind deterministic line search, we expect with and the constant . However, in the stochastic setting there is a positive probability of being arbitrarily small. Theorem 3.3, below, is derived by observing that on average occurs frequently due to the upward drift in the random walk process. Consequently, can be bounded by a negative fixed value (dependent on ) frequently; thus we can derive a bound on .

The following theorem (Theorem 2.2 in [3]) bounds in terms of and .

###### Theorem 3.3.

Under Assumption 3.2,

 E[Tε]≤p2p−1⋅Φ0Θh(¯A)+1.

## 4 Convergence of Stochastic Line Search

Our primary goal is to prove convergence of Algorithm 1 by showing a lim-inf convergence result, a.s. We that typical convergence results for stochastic algorithms prove either high probability results or that the expected gradient at an averaged point converges. Our result is slightly stronger than these results since we show a subsequence of the converges a.s. With this convergence result, stopping times based on either and/or are finite almost surely. Our approach for the liminf proof is twofold: (1) construct a function () whose expected progress decreases proportionally to and (2) the of the step sizes is strictly larger than a.s.

### 4.1 Useful results

Before delving into the convergence statement and proof, we state some lemmas similar to those derived in [6, 2, 7].

###### Lemma 4.1 (Accurate gradients ⇒ lower bound on ∥gk∥).

Suppose is -sufficiently accurate. Then

 ∥∇f(xk)∥(κgαmax+1)≤∥gk∥.
###### Proof.

Because is -sufficiently accurate together with the triangle inequality implies

 ∥∇f(xk)∥≤(κgαk+1)∥gk∥≤(κgαmax+1)∥gk∥.

###### Lemma 4.2 (Accurate gradients and estimates ⇒ successful iteration).

Suppose is -sufficiently accurate and are -accurate estimates. If

 αk≤1−θκg+L2+2εf

then the trial step is successful. In particular, this means

###### Proof.

The -smoothness of and the -sufficiently accurate gradient immediately yield

 f(xk+sk) ≤f(xk)−αk(∇f(xk)−gk)Tgk−αk∥gk∥2+Lα2k2∥gk∥2 ≤f(xk)+κgα2k∥gk∥2−αk∥gk∥2+Lα2k2∥gk∥2.

Since the estimates are -accurate, we obtain

 fsk−εfα2k∥gk∥2 ≤f(xk+sk)−fsk+fsk ≤f(xk)−f0k+f0k+κgα2k∥gk∥2−αk∥gk∥2+Lα2k2∥gk∥2 ≤f0k+εfα2k∥gk∥2+κgα2k∥gk∥2−αk∥gk∥2+Lα2k2∥gk∥2.

The result follows by noting . ∎

###### Lemma 4.3 (Good estimates ⇒ decrease in function).

Suppose and are -accurate estimates. If the trial step is successful, then the improvement in function value is

 f(xk+1)≤f(xk)−θαk2∥gk∥2. (4.1)

If, in addition, the step is reliable, then the improvement in function value is

 f(xk+1)≤f(xk)−θαk4∥gk∥2−θ4δ2k. (4.2)
###### Proof.

The iterate is successful and the estimates are accurate so we conclude

 f(xk+sk) ≤f(xk+sk)−fsk+f0k−f(xk)+f(xk)−αkθ∥gk∥2 ≤f(xk)+2εfα2k∥gk∥2−αkθ∥gk∥2 ≤f(xk)−αk∥gk∥2(θ−2εfαmax),

where the last inequality follows because . The condition immediately implies (4.1). By noticing holds for reliable steps, we deduce (4.2). ∎

###### Lemma 4.4.

Suppose the iterate is successful. Then

 ∥∇f(xk+1)∥2≤2(L2α2k∥gk∥2+∥∇f(xk)∥2).

In particular, the inequality holds

###### Proof.

An immediate consequence of -smoothness of is . The result follows from squaring both sides and applying the bound, . To obtain the second inequality, we note that in the case is successful, . ∎

###### Lemma 4.5 (Accurate gradients and estimates ⇒ decrease in function).

Suppose is -sufficiently accurate and are -accurate estimates where . If the trial step is successful, then

 f(xk+1)−f(xk)≤−θαk4∥gk∥2−θαk4(κgαmax+1)2∥∇f(xk)∥2. (4.3)

In addition, if the trial step is reliable, then

 f(xk+1)−f(xk)≤−θαk8∥gk∥2−θ8δ2k−θαk4(κgαmax+1)2∥∇f(xk)∥2. (4.4)
###### Proof.

Lemma 4.1 implies

 −θ2αk∥gk∥2≤−θ4αk∥gk∥2−θ4(κgαmax+1)2αk∥∇f(xk)∥2. (4.5)

We combine this result with Lemma 4.3 to conclude the first result. For the second result, since the step is reliable, equation (4.5) improves to

 −θ2αk∥gk∥2≤−θ8αk∥gk∥2−θ8δ2k−θ4(κgαmax+1)2αk∥∇f(xk)∥2,

and again the result follows from Lemma 4.3. ∎

### 4.2 Definition and analysis of {Φk,Ak,Wk} process for Algorithm 1

We base our proof of convergence on properties of the random function

 Φk=ν(f(Xk)−fmin)+(1−ν)1L2Ak∥∇f(Xk)∥2+(1−ν)θΔ2k. (4.6)

for some (deterministic) and for all . The goal is to show that satisfies Assumption 3.2, in particular, that is expected to decrease on each iteration. Due to inaccuracy in function estimates and gradients, the algorithm may take a step that increases the objective and thus . We will show that such increase if bounded by a value proportional to . On the other hand, as we will show, on successful iteration with accurate function estimates, the objective decreases proportionally , while on unsuccessful steps, equation (4.6) is always negative because both and are decreased. The function is chosen to balance the potential increases and decreases in the objective with changes inflicted by unsuccessful steps.

###### Theorem 4.6.

Let Assumptions 2.1 and 2.4 hold. Suppose is the random process generated by Algorithm 1. Then there exist probabilities and a constant such that the expected decrease in is

 E[Φk+1−Φk|FG⋅Fk−1]≤−pgpf(1−ν)(1−γ−1)4(AkL2∥∇f(Xk)∥2+θΔ2k). (4.7)

In particular, the constant and probabilities satisfy

 ν1−ν≥max{32γα2maxθ,16(γ−1),16γ(κgαmax+1)2θ}, (4.8) pg≥2γ1/2(1−γ−1)+2γ (4.9) andpgpf√1−pf≥max{8L2νκf+16γ(1−ν)(1−ν)(1−γ−1),8ν(1−ν)(1−γ−1)}. (4.10)