A Stochastic Levenberg-Marquardt Method Using Random Models with Application to Data Assimilation

# A Stochastic Levenberg-Marquardt Method Using Random Models with Application to Data Assimilation

E. Bergou MaIAGE, INRA, Université Paris-Saclay, 78350 Jouy-en-Josas, France (elhoucine.bergou@inra.fr).    Y. Diouane ISAE-SUPAERO, Université de Toulouse, 31055 Toulouse Cedex 4, France (youssef.diouane@isae.fr).    V. Kungurtsev Department of Computer Science, Faculty of Electrical Engineering, Czech Technical University in Prague. Support for this author was provided by the Czech Science Foundation project 17-26999S (vyacheslav.kungurtsev@fel.cvut.cz).    C. W. Royer Wisconsin Institute for Discovery, University of Wisconsin-Madison, 330 N Orchard St, Madison, WI 53715, USA (croyer2@wisc.edu). Support for this author was provided by Subcontract 3F-30222 from Argonne National Laboratory.
###### Abstract

Globally convergent variants of the Gauss-Newton algorithm are often the preferred methods to tackle nonlinear least squares problems. Among such frameworks, the Levenberg-Marquardt and the trust-region methods are two well-established paradigms, and their similarities have often enabled to derive similar analyses of these schemes. Both algorithms have indeed been successfully studied when the Gauss-Newton model is replaced by a random model, only accurate with a given probability. Meanwhile, problems where even the objective value is subject to noise have gained interest, driven by the need for efficient methods in fields such as data assimilation.

In this paper, we describe a stochastic Levenberg-Marquardt algorithm that can handle noisy objective function values as well as random models, provided sufficient accuracy is achieved in probability. Our method relies on a specific scaling of the regularization parameter, which clarifies further the correspondences between the two classes of methods, and allows us to leverage existing theory for trust-region alorithms. Provided the probability of accurate function estimates and models is sufficiently large, we establish that the proposed algorithm converges globally to a first-order stationary point of the objective function with probability one. Furthermore, we derive a bound the expected number of iterations needed to reach an approximate stationary point. We finally describe an application of our method to variational data assimilation, where stochastic models are computed by the so-called ensemble methods.

Keywords: Levenberg-Marquardt method, nonlinear least squares, regularization, random models, noisy functions, data assimilation.

## 1 Introduction

Minimizing a nonlinear least-squares function is one of the most classical problems in numerical optimization, that arises in a variety of fields. In many applications, the objective function to be optimized can only be accessed through noisy estimates. Typical occurrences of such a formulation can be found when solving inverse problems [16, 27, 28] or while minimizing the error of a model in the context of machine learning [9]. In such cases, the presence of noise is often due to the estimation of the objective function via cheaper, less accurate calculations: this is for instance true when part of the data is left aside while computing this estimate. In fact, in data-fitting problems such as those coming from machine learning, a huge amount of data is available, and considering the entire data throughout the optimization process can be extremely costly. Moreover, the measurements can be redundant and possibly corrupted: in that context, a full evaluation of the function or the gradient may be unnecessary.

Such concerns have motivated the development of optimization frameworks that cope with inexactness in the objective function or its derivatives. In particular, the field of derivative-free optimization [15], where it is assumed that the derivatives exist but are unavailable for use in an algorithm, has expanded in the recent years with the introduction of random models. One seminal work in this respect is [1], where the authors applied arguments from compressed sensing to guarantee accuracy of quadratic models whenever the Hessian had a certain (unknown) sparsity pattern. Trust-region methods based on general probabilistic models were then proposed in [2], where convergence to first- and second-order stationary points was established under appropriate accuracy assumptions on the models. Global convergence rates were derived for this approach in [19], in expectation and with high probability. Of particular interest to us is the extension of trust-region methods with probabilistic models to the case of noisy function values [13]: the corresponding algorithm considers two sources of randomness, respectively arising from the noisy function estimates and the random construction of the models. A global convergence rate in expectation for this method was derived in [7], where it was established that the method needed iterations in expectation to drive the gradient norm below some threshold .

In the context of derivative-free least-squares problems where exact function values are available, various deterministic approaches based on globalization of the Gauss-Newton method have been studied. The algorithms developed in the derivative-free community are mostly of trust-region type, and rely on building models that satisfy the so-called fully linear property, which requires the introduction of a so-called criticality step to guarantee its satisfaction throughout the algorithmic process [12, 31, 32, 29]. The recent DFO-GN algorithm [12] was equipped with a complexity result, showing a bound of the same order than derivative-free trust-region methods for generic functions [18]. As for general problems, considering random models is a possible way of relaxing the need for accuracy at every iteration. A Levenberg-Marquardt algorithm based in this idea was proposed in [6], motivated by problems from data assimilation. The authors of [6] proposed an extension of the classical LM algorithm that replaces the gradient of the objective function by a noisy estimate, that is accurate only with a certain probability. Using similar arguments than for the trust-region case [2], almost-sure global convergence to a first-order stationary point was established.

The case of noisy least squares has also been examined. A very recent preprint [10] proposed a efficient approach for handling noisy values in practice, but did not provide theoretical guarantees. A Levenberg-Marquardt framework for noisy optimization without derivatives was proposed in [4], with similar goals as those aimed in this paper. The method proposed in [4] assumes that function values can be estimated to a prescribed accuracy level, and explicitly maintains a sequence of these accuracies throughout the algorithm. Although such an approach is relevant when any accuracy level can be used (for instance, all the data can be utilized to estimate the function), it does not allow for arbitrarily bad estimations on any iteration: moreover, the noise level must be small compared to the norm of the upcoming Levenberg-Marquardt step, a condition that may force to reduce this noise level, and resembles the criticality step of derivative-free model-based methods. By contrast, the use of random models and estimates with properties only guaranteed in probability allows for arbitrarily bad estimates, which seems more economical at the iteration level, and does not necessarily mean that a good step will not be computed in that case. Probabilistic properties thus emerges as an interesting alternative, particularly when it is expensive to compute accurate estimates, and one can then think of exploiting the connection between Levenberg-Marquardt and trust-region methods [23] to analyze the former in the case of noisy problems.

In this paper, we propose a stochastic framework that builds upon the approach developed in [6] to handle both random models and noise in the function evaluations. This new algorithm is also inspired by a recently proposed variant of the Levenberg-Marquardt framework [5], where a specific scaling of the regularization parameter enabled the derivation of worst-case complexity results. We adapt the analysis of the stochastic trust-region framework using random models proposed in [7, 13] to prove that our framework enjoys comparable convergence and complexity guarantees. Unlike [4] , our setup allows for arbitrarily inaccurate models or function estimates, as long as it happens with a small probability. Our method is particularly suited for applications in data assimilation, which we illustrate in the context of ensemble methods.

The remainder of the paper is organized as follows. Section 2 presents our Levenberg-Marquardt framework; Section 3 established the accuracy requirements we make on the function values and the models, as well as their probabilistic counterparts. Global convergence and worst-case complexity of the method are analyzed in Sections 4 and 5, respectively. Finally, Section 6 describes an application of our method in data assimilation.

## 2 A Levenberg-Marquardt algorithm based on estimated values

In this paper, we consider the following nonlinear least squares problem:

 minx∈Rnf(x)\lx@stackrelΔ=12∥r(x)∥2, (1)

where is the residual vector-valued function, assumed to be continuously differentiable, and most likely . During the minimization process, the optimizer can only have access to estimates of - referred as . This estimate is assumed to be noisy, i.e., one has for all , where the noise is a random variable.

This section recalls the main features of the Levenberg-Marquardt method, then describes our extension of this algorithm to handle noisy function values and gradients.

### 2.1 Deterministic Levenberg-Marquardt algorithm

Whenever the function and its Jacobian can be accessed, one possible approach for solving the problem (1) is based on the Gauss-Newton model. More precisely, at a given iterate , a step is computed as a solution of the linearized least squares subproblem

 mins∈Rn12∥rj+Jjs∥2,

where and denotes the Jacobian of at . The subproblem has a unique solution if has full column rank, and in that case the step is a descent direction for . When is not of full column rank, the introduction of a regularization parameter can lead to similar properties. This is the underlying idea behind the Levenberg-Marquardt [21, 22, 24] algorithm, a globally convergent method based upon the Gauss-Newton model. At each iteration, one considers a step of the form , corresponding to the unique solution of

 mins∈Rn12∥rj+Jjs∥2+12γj∥s∥2, (2)

where is an appropriately chosen regularization parameter, typically updated in the spirit of the classical trust-region radius update strategy at each iteration. Several strategies were then developed to update . Several approaches have considered scaling this parameter using the norm of the gradient of the Gauss-Newton model [5, 33]. A similar choice will be adopted in this paper.

### 2.2 Algorithmic framework based on estimates

In this work, we are interested in the case where and cannot be directly accessed, but noisy estimates are available. As a result, we will consider a variant of the Levenberg-Marquardt algorithm in which both the function and gradient values are approximated.

Algorithm 1 presents a description of our method. At every iteration, estimates of the values of and its derivative at the current iterate are obtained, and serve to define a regularized Gauss-Newton model (3), where the regularization parameter is defined using a specific scaling formula: where . The model is then approximately minimized, yielding a trial step . The resulting new point is accepted only if the ratio between estimated decrease ( is again estimated at the new trial point) and model decrease is sufficiently high.

The Levenberg-Marquardt parameter is updated depending on the value of , and also on a condition involving the model gradient. Such updates have been widely used in derivative-free model-based methods based on random estimates [2, 6, 13, 19].

## 3 Probabilistic properties for the Levenberg-Marquardt method

We are interested in the case where the objective function values, the gradient and the Jacobian are noisy, and we only have their approximations.

### 3.1 Gradient and function estimates

We begin by describing our accuracy requirements for the models computed based on sampled values, of the form given in (3). Following previous work on derivative-free Levenberg-Marquardt methods [6], we propose the following accuracy definition, and motivate further its use below.

###### Definition 3.1

Consider a realization of Algorithm 1, and the model of defined around the iterate of the form (3), and let . Then, the model is called -first-order accurate with respect to if

 ∥gmj−J⊤jrj∥≤κegμj (5)

and

 ∣∣f(xj)−mj(xj)∣∣≤κefμ2j. (6)
###### Remark 3.1

The accuracy requirement for the model gradient (5) is similar to the first-order accuracy property introduced by Bergou, Gratton and Vicente [6]. However, it is not exactly equivalent as we use instead of . The purpose of this new parametrization is twofold. First, it allows us to measure the accuracy in formulas (5) and (6) through a parameter that is updated in an explicit fashion throughout the algorithmic run: this is a key property for performing a probabilistic analysis of optimization methods. Secondly, we believe this choice to be a better reflection of the relationship between the Levenberg-Marquardt and the trust-region parameter. Indeed, for a realization of the method, the Levenberg-Marquardt direction minimizing is given by

 dj=−(J⊤jJj+γjI)−1J⊤jrj, (7)

which is also the solution of the trust-region subproblem

 {mind12∥rj+Jjd∥2s.t.∥d∥≤δj=∥dj∥. (8)

As a result, we see that for a large value of , one would have:

 δj=O(∥J⊤jrj∥γj), (9)

which suggests that is not exactly equivalent to the inverse of the trust-region radius, as suggested in [6], but that it rather is an equivalent to . Still, this relation implies that can be seen as an equivalent to : in that sense, (5) matches the gradient assumption for fully linear models [15].

Note that Definition 3.1 contains two requirements: in the absence of noise, (6) is trivially satisfied by setting . In this work, we consider that even function values cannot be accessed inexactly, thus (6) appears to be necessary.

In the case of noisy function values, we also expect the estimates computed by Algorithm 1 to be sufficiently accurate with a suitable probability. This is formalized in the following definitions.

###### Definition 3.2

Given , we say that two values and are -accurate estimates of and , respectively, for a given , if

 ∣∣f0j−f(xj)∣∣≤εfμ2jand∣∣f1j−f(xj+sj)∣∣≤εfμ2j. (10)

### 3.2 Probabilistic accuracy of model gradients and function estimates

We are further interested in the case where the models are built in some random fashion. We will thus consider random models of the form , and we use the notation for its realizations. Correspondingly, let random variables and denote the estimates of the gradient and the Jacobian , with their realizations denoted by , and .

Note that the randomness of the models implies the randomness of the iterate , the parameters , and the step , and so , , and will denote their respective realizations.

As described in the introduction, another source of randomness from our problem in that the objective function is accessed through a randomized estimator . For a given iteration index , we define and . The realizations of and (taken over the randomness of as well as that of the iterate ) will be denoted by and .

We can now provide probabilistic equivalents of Definitions 3.1 and 3.2.

###### Definition 3.3

Let , and . A sequence of random models is said to be -probabilistically -first-order accurate with respect to the sequence if the events

 Uj\lx@stackrelΔ={∥∥gMj−J(Xj)⊤r(Xj)∥∥≤κeguj & ∣∣f(Xj)−Mj(Xj)∣∣≤κefu2j}

satisfy the following condition

 p∗j\lx@stackrelΔ=P(Vj|FM⋅Fj−1)≥p, (11)

where is the -algebra generated by and .

###### Definition 3.4

Given constants , and , the sequences of random quantities and is called -probabilistically -accurate, for corresponding sequences , , if the events

 Vj\lx@stackrelΔ={∣∣F0j−f(Xj)∣∣≤εfu2jand∣∣F1j−f(Xj+Sj)∣∣≤εfu2j}

satisfy the following condition

 q∗j\lx@stackrelΔ=P(Vj|FM⋅Fj−1/2)≥q, (12)

where is the -algebra generated by .

Here again, we point out that the parameter plays the role of a reciprocal of the trust-region radius. In that sense, the previous definitions are consistent with the definitions of sufficient accuracy presented in the case of stochastic trust-region methods [13].

## 4 Global convergence to first-order critical points

In this section, we aim at establishing convergence of Algorithm 1 when the function estimates and the models satisfy the probabilistic properties described in Section 3. Our analysis bears strong similarities with that of the STORM algorithm [13], but possesses significant differences induced by the use of probabilistic gradients rather than probabilistic fully linear models.

### 4.1 Assumptions and deterministic results

We will analyze Algorithm 1 under the following assumptions.

###### Assumption 4.1

is continuously differentiable on an open set containing the level set
, with Lipschitz continuous gradient, of Lipschitz constant .

We also require that the Jacobian model is uniformly bounded. Note that the result is assumed to hold for every realization of the algorithm, therefore such an assumption will be valid in both a deterministic and random context.

###### Assumption 4.2

There exists such that for all and all realizations of the -th model Jacobian, one has:

 ∥Jmj∥≤κJm.

Additionally, we assume that the subproblem is approximately solved so that a fraction of a Cauchy decrease is satisfied for the model.

###### Assumption 4.3

There exists such that for every iteration of every realization of the algorithm,

 mj(xj)−mj(xj+sj)≥θfcd2∥gmj∥2∥Jmj∥2+γj. (13)

We will also assume the following bounds hold.

###### Assumption 4.4

At each iteration and for each realization of the algorithm, the step size satisfies

 ∥sj∥≤2∥gmj∥γj=2μj, (14)

and there exists such that

 |s⊤j(γjsj+gmj)|≤4∥Jmj∥2∥gmj∥2+2θin∥gmj∥2γ2j=4∥Jmj∥2+2θinμ2j. (15)

Several choices for the approximate minimization of that verify (13),(14) and (15) can be proposed; in particular, the result holds for steps computed via a truncated Conjugate Gradient algorithm (initialized with the null vector) applied to the quadratic [6, Lemma 5.1].

###### Lemma 4.1

Let Assumptions 4.1, 4.2, and 4.4 hold for a realization of Algorithm 1. Consider the -th iteration of that realization, and suppose that is -first-order accurate. Then,

 ∣∣f(xj+sj)−mj(xj+sj)∣∣≤κefsμ2j, (16)

where .

Proof. Using Assumptions 4.1, 4.2, and 4.4 within a Taylor expansion of the function around , we obtain:

 ∣∣f(xj+sj)−mj(xj+sj)∣∣ ≤ ∣∣∣f(xj)+∇f(xj)⊤sj+ν2∥sj∥2−mj(xj)−g⊤mjsj−12s⊤jJ⊤mjJmjsj∣∣∣ ≤ ∣∣f(xj)−mj(xj)∣∣+∣∣∣(∇f(xj)−gmj)⊤sj∣∣∣+ν2∥sj∥2+12∥J⊤mjJmj∥∥sj∥2 ≤ κefμ2j+κegμj∥sj∥+ν+κ2Jm2∥sj∥2 ≤ κef+2κeg+ν+4κ2Jm21μ2j,

hence the result.

Lemma 4.1 illustrates that our accuracy requirements are enough to guarantee accuracy of any computed step.

We now state various results holding for a realization of Algorithm 1 that do not make direct use of the probabilistic nature of the method. These will be instrumental in proving Theorem 4.1.

###### Lemma 4.2

Let Assumptions 4.1, 4.2, 4.3, and 4.4 hold for a realization of Algorithm 1, and consider its -th iteration. . If the model is -first-order accurate and

 μj≥max{κ2Jm,8(κef+κefs)η1θfcd}1∥gmj∥, (17)

then the trial step satisfies

 f(xj+sj)−f(xj)≤−η1θfcd8∥gmj∥μj. (18)

Proof. Since the model is -first-order accurate, we have:

 f(xj+sj)−f(xj) = f(xj+sj)−m(xj+sj)+m(xj+sj)−mj(xj)+mj(xj)−f(xj) ≤ κefsμ2j+m(xj+sj)−mj(xj)+κefμ2j ≤ κef+κefsμ2j−η1θfcd2∥gmj∥2κ2Jm+γj = κef+κefsμ2j−η1θfcd2∥gmj∥2κ2Jm+μj∥gmj∥,

where we used the result of Lemma 4.1 and Assumption 4.3. Using the first part of (17), we have and thus

 f(xj+1)−f(xj) ≤ κef+κefsμ2j−η1θfcd2∥gmj∥2κ2Jm+μj∥gmj∥ ≤ κef+κefsμ2j−η1θfcd2∥gmj∥22μj∥gmj∥ = 1μj[κef+κefsμj−η1θfcd4∥gmj∥] ≤ 1μj[−η1θfcd8∥gmj∥],

where the second part of the maximum in (17) was used in the last line, yielding the expected result.

The next result is a consequence of Lemma 4.2.

###### Lemma 4.3

Let the assumptions of Lemma 4.2 hold. If is -first-order accurate and

 μj≥(κeg+max{κ2Jm,8(κef+κefs)η1θfcd})1∥∇f(xj)∥, (19)

then the trial step satisfies

 f(xj+sj)−f(xj)≤−C1∥∇f(xj)∥μj, (20)

where .

Proof. Since the model is -first-order accurate, we have:

 ∥∇f(xj)∥≤∥∇f(xj)−gmj∥+∥gmj∥≤κegμj+∥gmj∥. (21)

Using (19) to bound the left-hand side, we obtain:

 κeg+max{κ2Jm,8(κef+κefs)η1θfcd}μj≤κegμj+∥gmj∥,

which gives . We are thus in the assumptions of Lemma 4.2, and (18) holds.

Using again the fact that the model is -first-order accurate together with (17) and (21), we have:

 ∥∇f(xj)∥≤κegμj+∥gmj∥≤κegκeg+max{κ2Jm,8(κef+κefs)η1θfcd}∥∇f(xj)∥+∥gmj∥,

 ∥gmj∥≥max{κ2Jm,8(κef+κefs)η1θfcd}κeg+max{κ2Jm,8(κef+κefs)η1θfcd}∥∇f(xj)∥.

Combining this relation with (18) finally gives (20).

###### Lemma 4.4

Let Assumptions 4.1, 4.2, 4.3 and 4.4 hold. Consider the -th iteration of a realization of Algorithm 1 such that is not a critical point of .
Suppose further that is -first-order accurate, is -accurate, and

 μj≥max⎧⎪ ⎪⎨⎪ ⎪⎩α+√α2+4ακ2Jm(1−η1)2(1−η1),η2⎫⎪ ⎪⎬⎪ ⎪⎭1∥gmj∥\lx@stackrelΔ=κμg∥gmj∥, (22)

holds, where . Then, the -th iteration is successful (i.e. and ).

Proof. To simplify the notations, we will omit the indices in the proof.

 ∣∣∣1−ρ2∣∣∣ = ∣∣∣1−12f0−f1m(x)−m(x+s)∣∣∣ = = ∣∣−g⊤ms−12s⊤(J⊤mJm+γI)s−12f0+12f1∣∣|m(x)−m(x+s)| = ∣∣12(f1−f0−g⊤ms−s⊤J⊤mJms)−12s⊤(gm+γs)∣∣|m(x)−m(x+s)| ≤ 12∣∣f1−f0−g⊤ms−s⊤J⊤mJms∣∣+12|s⊤(gm+γs)||m(x)−m(x+s)|.

We look in more detail at the first term arising in the numerator; we have:

 ∣∣f1−f0−g⊤ms−s⊤J⊤mJms∣∣ = ∣∣f1−f(x+s)+f(x+s)−f(x)+f(x)−f0−g⊤ms−12s⊤J⊤mJms∣∣ = ∣∣∣f1−f(x+s)+∇f(x)⊤s+∫10(∇f(x+ts)−∇f(x))⊤s +f(x)−f0−g⊤ms−12s⊤J⊤mJms∣∣ = ∣∣∣f1−f(x+s)+[∇f(x)−gm]⊤s+∫10(∇f(x+ts)−∇f(x))⊤s +f(x)−f0−12s⊤J⊤mJms∣∣∣ ≤ ∣∣f1−f(x+s)∣∣+∣∣[∇f(x)−gm]⊤s∣∣ +∫10∥∇f(x+ts)−∇f(x)∥∥s∥dt+∣∣f(x)−f0∣∣+12|s⊤J⊤mJms| ≤ εfμ2+κeg∥s∥μ+ν2∥s∥2+εfμ2+12∥Jm∥2∥s∥2 = 2εfμ2+κeg∥s∥μ+ν+κ2Jm2∥s∥2.

Thus, we obtain:

 ∣∣∣1−ρ2∣∣∣ ≤ εfμ2+κeg∥s∥2μ+ν+κ2Jm4∥s∥2+12s⊤(gm+γs)m(x)−m(x+s).

Using Assumption 4.4 on the numerator and Assumption 4.3 on the denominator, we arrive at

 ∣∣∣1−ρ2∣∣∣ ≤ εfμ2+κeg∥s∥2μ+ν+κ2Jm4∥s∥2+s⊤(gm+γs)m(x)−m(x+s) ≤ εfμ2+κegμ2+ν+κ2Jmμ2+4κ2Jm+2θinμ2m(x)−m(x+s) ≤ (εf+κeg+ν+5κ2Jm+2θin)1μ2θfcd2∥gm∥2∥Jm∥2+γ = (εf+κeg+ν+5κ2Jm+2θin)∥gm∥2γ2θfcd2∥gm∥2κ2Jm+γ = (εf+κeg+ν+5κ2Jm+2θin)κ2Jm+γγ2 = ακ2Jm+γγ2.

As a result, we have

 ∣∣∣1−ρ2∣∣∣≥1−η1 ⇒ ακ2Jm+γγ2≥1−η1 ⇔ 0≥(1−η1)γ2−αγ−ακ2Jm.

Since the right-hand side is a second-order polynomial in , this gives

 γ≤α+√α2+4ακ2Jm(1−η1)2(1−η1)⇔μ≤α+√α2+4ακ2Jm(1−η1)2(1−η1)1∥gm∥.

But this contradicts (22), from which we conclude that we necessarily have , and thus . Since as a direct consequence of (22), the iteration is a successful one, and the parameter is not increased.

We point out that Lemma 4.4 only involves the accuracy requirements on the model gradient, thanks to the accuracy of the function estimates.

###### Lemma 4.5

Let Assumptions 4.1, 4.2, 4.3, and 4.4 hold. Consider a successful iteration of index for a realization of Algorithm 1, such that is not a critical point of .
Suppose further that is -accurate, with

 η2≥max{κ2Jm,8εfη1θfcd} (23)

Then, one has:

 f(xj+sj)−f(xj)≤−C2μ2j. (24)

where .

Proof. By definition of a successful iteration and using the accuracy properties of the models and the estimates, we have

 f(