Consensus-Based Optimization on the Sphere I: Well-Posedness and Mean-Field Limit

# Consensus-Based Optimization on the Sphere I: Well-Posedness and Mean-Field Limit

## Abstract

We introduce a new stochastic Kuramoto-Vicsek-type model for global optimization of nonconvex functions on the sphere. This model belongs to the class of Consensus-Based Optimization methods. In fact, particles move on the sphere driven by a drift towards an instantaneous consensus point, computed as a convex combination of the particle locations weighted by the cost function according to Laplace’s principle. The consensus point represents an approximation to a global minimizer. The dynamics is further perturbed by a random vector field to favor exploration, whose variance is a function of the distance of the particles to the consensus point. In particular, as soon as the consensus is reached, then the stochastic component vanishes. In this paper, we study the well-posedness of the model and we derive rigorously its mean-field approximation for large particle limit.

Keywords: consensus-based optimization, stochastic Kuramoto-Vicsek model, well-posedness, mean-field limit

## 1 Introduction

Over the last decades, large systems of interacting particles (or agents) are widely used to investigate self-organization and collective behavior. They frequently appear in modeling phenomena such as biological swarms [15, 19], crowd dynamics [2, 8, 9], self-assembly of nanoparticles [38] and opinion formation [3, 45, 35]. Similar particle models are also used in metaheuristics [1, 7, 11, 31], which provide empirically robust solutions to tackle hard optimization problems with fast algorithms. Metaheuristics are methods that orchestrate an interaction between local improvement procedures and global/high level strategies, and combine random and deterministic decisions, to create a process capable of escaping from local optima and performing a robust search of a solution space. Starting with the groundbreaking work of Rastrigin on Random Search in 1963 [49], numerous mechanisms for multi-agent global optimization have been considered, among the most prominent instances we recall the Simplex Heuristics [46], Evolutionary Programming [27], the Metropolis-Hastings sampling algorithm [34], Genetic Algorithms [36], Particle Swarm Optimization (PSO) [40, 48], Ant Colony Optimization (ACO) [23], Simulated Annealing (SA), [37, 41]. Despite the tremendous empirical success of these techniques, most metaheuristical methods still lack proper mathematical proof of robust convergence to global minimizers. In fact, due to the random component of metaheuristics, their asymptotic analysis would require to discern the stochastic dependencies, which is a dounting task, especially for those methods that combine instantaneous decisions with memory mechanisms.

Recent work [47, 14] on Consensus-Based Optimization (CBO) focuses on instantaneous stochastic and deterministic decisions in order to establish a consensus among particles on the location of the global minimizers within a domain. In view of the instantaneous nature of the dynamics, the evolution of the system can be interpreted as a system of first order stochastic differential equations, whose large particle limit is approximated by a deterministic partial differential equation of mean-field type. The large time behavior of such a deterministic PDE can be analyzed by classical techniques of large deviation bounds and the global convergence of the mean-field model can be mathematically proven in a rigorous way for a large class of optimization problems. Certainly CBO is a significantly simpler mechanism with respect to more sophisticated metaheuristics, which can include different features including memory of past exploration. Nevertheless, it seems to be powerful and robust enough to tackle many interesting nonconvex optimizations [16, 28, 33], which would be harder to solve by gradient descent methods that have been dominating the field of optimization, especially in relevant applications such as machine learning. In fact in many of these problems the objective function is not differentiable and CBO do not use any gradient information for their exploration. Moreover, in certain models, such as training of deep neural networks, the gradient tends to explode or vanish [10]. Finally, although there exist situations where global optimization by gradient descent methods can be ensured under ad hoc conditions, see, e.g., [17, 43, 42], they do not offer in general guarantees of global convergence. Instead, CBO has potential of a rather general and rigorous global asymptotic/convergence analysis. Some theoretical gaps remain open in the analysis of CBO though, in particular the rigorous derivation of the mean-field limit, due to the difficulty in establishing bounds on the moments of the probability distribution of the particles [14].

Motivated by these theoretical gaps and several potential applications in machine learning, the main task of the present work is to develop a CBO approach to solve the following constrained optimization problem

 v∗∈argminv∈Sd−1E(v), (1.1)

where is a given continuous cost function, which we wish to minimize over the sphere. In particular, we consider a system of interacting particles satisfying the following stochastic Kuramoto-Vicsek-type dynamics expressed in Itô’s form

 dVit =λP(Vit)vα,E(ρNt)dt+σ|Vit−vα,E(ρNt)|P(Vit)dBit−σ22(Vit−vα,E(ρNt))2(d−1)Vit|Vit|2dt, (1.2)

where is a stuitable drift parameter, a diffusion parameter,

 ρNt=1NN∑i=1δVit, (1.3)

is the empirical measure of the particles ( is the Dirac measure at ), and

 vα,E(ρNt)=∑Nj=1Vjte−α~E(Vjt)∑Nj=1e−α~E(Vjt)=∫RdvωEα(v)dρNt∫RdωEα(v)dρNt with ωEα(v):=e−αE(v). (1.4)

For ease of notation, for any vector we may write to mean and is the Euclidean norm. This stochastic system is considered complemented with independent and identical distributed (i.i.d.) initial data with , and the common law is denoted by . The trajectories denote independent standard Brownian motions in . In (1.2) the projection operator is defined by

 P(v)=I−vvT|v|2. (1.5)

The choice of the weight function in (1.4) comes from the well-known Laplace principle [44, 22, 47], a classical asymptotic method for integrals, which states that for any probability measure , it holds

 limα→∞(−1αlog(∫Rde−αE(v)dρ(v)))=infv∈suppρE(v). (1.6)

Let us discuss the mechanism of the dynamics. The right-hand-side of the equation (1.2) is made of three terms. The first deterministic term imposes a drift to the dynamics towards , which is the current consensus point at time as approximation to the global minimizer. The second stochastic term introduces a random decision to favor exploration, whose variance is function of the distance of the particles to the current consensus points. In particular, as soon as consensus is reached, then the stochastic component vanishes. The last term combined with is needed to ensure that the dynamics stays on the sphere despite the Brownian motion component. In fact, we will initially analyze the well-posedness of the system (1.2) as defined in the whole space instead of being immediately considered constrained on the sphere , and we will check afterwards that, if the initial data is set on the sphere, then the particles remain there at all times. We further notice that the dynamics does not make use of any derivative of , but only of its pointwise evaluations, which appear integrated in (1.4). Hence, the equation can be in principle numerically implemented at discrete times also for cost functions which are just continuous and with no further smoothness. We require below more regularity to exclusively to ensure formal well-posedness of the evolution.

The main results of this paper are about the well-posedness of (1.2) and its rigorous mean-field limit - which is an open issue for unconstrained CBO [14] - to the following nonlocal, nonlinear Fokker-Planck equation

 ∂tρt=λ∇Sd−1⋅((⟨vα,E(ρt),v⟩v−vα,E(ρt))ρt)+σ22ΔSd−1(|v−vα,E(ρt)|2ρt),t>0,\leavevmode\nobreak v∈Sd−1, (1.7)

with the initial data . Here is a Borel probability measure on and

The operators and denote the divergence and Laplace-Beltrami operator on the sphere respectively. The mean-field limit will be achieved through the coupling method [50, 25, 39, 12] by introducing an auxiliary mono-particle process, satisfying the self-consistent nonlinear SDE

 d¯¯¯¯Vt=λP(¯¯¯¯Vt)vα,E(ρt)dt+σ|¯¯¯¯Vt−vα,E(ρt)|P(¯¯¯¯Vt)dBt−(d−1)σ22(¯¯¯¯Vt−vα,E(ρt))2¯¯¯¯Vt|¯¯¯¯Vt|2dt, (1.8)

with the initial data distributed according to . Here we require , and we will show that as a measure on solves the PDE (1.7). We call the SDE (1.8) the mean-field dynamic, and the PDE (1.7) is called the mean-field PDE. In our follow up paper [28], for more regular datum , we also prove existence and uniqueness of distributional solutions at any finite time . This model and the analysis we provide in this paper are very much inspired by the homogenous version of the kinetic Kolmogorov-Kuramoto-Vicsek model, which was formally derived by Degond and Motsch [21] as a mean-field limit of the discrete Vicsek model [4, 18, 51]. Recently, the stochastic Vicsek model has received extensive attention in the mathematical community to establish its mean-field limit, hydrodynamic limit, and phase transitions. Bolley, Cañizo and Carrillo [13] have rigorously justified the mean-field limit in case of smoothed force field, and Gamba and Kang [30] and Figalli, Kang, and Morales [26] extended the global existence and uniqueness of weak solutions of the kinetic Kolmogorov-Kuramoto-Vicsek model to the case of singular mean-field force field. Recently, Degond, Frouvelle and Liu [20] provided also a complete and rigorous description of phase transitions such as the number and nature of equilibria, stability, convergence rate, phase diagram and hysteresis. Our results build very much upon these developments.

Before starting our analysis of (1.2), (1.8), and (1.7), let us illustrate numerically the behavior of the dynamics in the case of the minimum solution of the Ackley function for constrained over the sphere

 E(v)=−Aexp⎛⎜⎝−a ⎷b2dd∑k=1(vk−v∗k)2⎞⎟⎠−exp(1dd∑k=1cos(2πb(vk−v∗k)))+e+B, (1.9)

with , , , and with .

The global minimum is attained at . In Figure 1 we report the standard Ackley function on the plane for and the case constrained over the half sphere .

In the reported simulations we initialize the particles with a uniform distribution over the half sphere characterized by and employ a simple Euler-Maruyama scheme with projection and time step . We report in Figure 2 the particle trajectories for in the case of , , and . On the left we consider the case with minimum at , on the right the case with minimum at . The time evolution of the particle distribution in the numerical mean field limit for is also reported in the upper part of the same figure.

As these figures show it, the dynamics consistently converges to the global minimizers totally regardless of the many local minimizers and the large particle system does converges numerically to its mean-field approximation.

Let us conclude this introduction by mentioning that the optimization on the sphere offers further numerous advantages, besides allowing a rigorous proof of mean-field limit. First of all a vast class of optimization problems can be reduced to constrained optimizations over the sphere: in the companion paper [28], where we focus on the numerical implementation of the method and its asymptotic/convergence analysis to global minimizers, we present also a few relevant and challenging applications in signal processing and machine learning. Due to compactness of the sphere, local smoothness and boundedness requirements on are necessarily a uniform and global property. However, against these properties that greatly simplify the analysis of the well-posedness of the system and its mean-field limit, the specific topology of the sphere makes it surprisingly harder to prove asymptotic convergence of the dynamics to global miminizers, requiring major technical variations with respect to the approach of unconstrained CBO [14]. We refer to the companion paper [28] for the details of the large time analysis of solutions of (1.7) and examples of applications in high-dimension.

The rest of the paper is organized as follows: in Section 2 we show that the equations (1.2), (1.7) and (1.8) are well-posed, that is, there is a unique solution for any initial distribution , which depends continuously on the initial datum, and in Section 3 we address the mean-field limit.

## 2 Well-posedness

This section focuses on proving the well-posedness for the particle system (1.2), the mean-field dynamic (1.8) and the mean-field PDE (1.7). To ensure this, we make the following smoothness assumption on the objective function .

###### Assumption 2.1.

The objective function is locally Lipschitz continuous.

### 2.1 Well-posedness for the interacting particle system (1.2)

The difficulty in showing first the well-posedness of (1.2) in the ambient space is that the projection is not defined for (singularity) and is unbounded for (blow-up) . In order to overcome this problem, we regularize the diffusion and drift coefficients, that is, we replace them with appropriate functions and respectively: let be a matrix valued map on with bounded derivatives of all orders such that for all , and be a valued map on , again with bounded derivatives of all orders, such that if . It is also useful to mention that

 P(v)v=0 (2.1)

and

 v⋅P(v)y=0 (2.2)

hold for any . For later use, let us denote . Additionally, we regularize the locally Lipschitz function outside : Let us introduce satisfying the following assumptions

###### Assumption 2.2.

The regularized extension function is globally Lipschitz continuous and satisfies the properties

• when , and when ;

• for all for a suitable global Lipschitz constant ;

•  .

Given such , and from Assumption 2.2, we introduce the following regularized particle system

 dVit =P1(Vit)vα,~E(ρNt)dt+σ|Vit−vα,~E(ρNt)|P1(Vit)dBit−(d−1)σ22(Vit−vα,~E(ρNt))2P2(Vit)dt, (2.3)

for , where

 vα,~E(ρNt)=∫Rdvω~Eα(v)dρNt∫Rdω~Eα(v)dρNt,ω~Eα(v)=e−α~E(v). (2.4)

Next we can deduce that the coefficients in (2.3) are locally Lipschitz continuous and have linear growth. More precisely, we have the following result.

###### Lemma 2.1.

Let , be arbitrary and satisfy Assumption 2.2. Then for any , and corresponding empirical measures and it holds

 |vα,~E(ρN)|≤1NCα,~E∥VN∥1 (2.5)

and

 |vα,~E(ˆρN)−vα,~E(ρN)|≤(Cα,~EN+2αLCα,~EN∥ˆVN∥∞)∥VN−ˆVN∥1, (2.6)

where . Here we used the notations for norms of vectors and .

###### Proof.

First, one notes that

 e−α¯¯¯~E≤ω~Eα(Vj)=e−α~E(Vj)≤e−α~E––. (2.7)

Then we have

 |vα,~E(ρN)|=∣∣ ∣∣∑Nj=1Vje−α~E(Vj)∑Nj=1e−α~E(Vj)∣∣ ∣∣≤1Neα(¯¯¯~E−~E––)N∑j=1|Vj|=1NCα,~E∥VN∥1，

where . Next we split the error

 vα,~E(ˆρN)−vα,~E(ρN) =∑Nj=1(Vj−ˆVj)ω~Eα(Vj)∑Nj=1ω~Eα(Vj)+∑Nj=1ˆVj(ω~Eα(Vj)−ω~Eα(ˆVj))∑Nj=1ω~Eα(Vj) +∑Nj=1ˆVjω~Eα(ˆVj)∑Nj=1(ω~Eα(ˆVj)−ω~Eα(Vj))(∑Nj=1ω~Eα(Vj))(∑Nj=1ω~Eα(ˆVj)) =:I1+I2+I3. (2.8)

For , the previous computation yields

 |I1|≤1NCα,~E∥VN−ˆVN∥1, (2.9)

Notice that it holds

 |ω~Eα(Vj)−ω~Eα(ˆVj)|=|e−α~E(Vj)−e−α~E(ˆVj)|≤αe−α~E––|~E(Vj)−~E(ˆVj)|≤αLe−α~E––|Vj−ˆVj|.

Thus we conclude that

 max{|I2|,|I3|}≤αLCα,~EN∥ˆVN∥∞∥VN−ˆVN∥1. (2.10)

Collecting estimates (2.9) and (2.10) completes the proof. ∎

###### Theorem 2.1.

Let be a probability measure on and, for every , be i.i.d. random variables with the common law . For every , there exists a pathwise unique strong solution to the particle system (1.2) with the initial data . Moreover it holds that for all and any .

###### Proof.

Given , and , the SDE (2.3) has locally Lipschitz coefficients, so it admits a pathwise unique local strong solution by standard SDE well-posedness result [24, Chap. 5, Theorem 3.1]. Moreover, it follows from Itô’s formula (5.1) (choosing ), as long as , that

 d|Vit|2 =2Vit⋅P(Vit)vα,~E(ρNt)dt+2σ|Vit−vα,~E(ρNt)|Vit⋅P(Vit)dBit −σ2(d−1)(Vit−vα,~E(ρNt))2dt+12d∑k=12σ2(Vit−vα,~E(ρNt))2(1−(Vi,kt)2|Vit|2)dt =−σ2(d−1)(Vit−vα,~E(ρNt))2dt+σ2(d−1)(Vit−vα,~E(ρNt))2dt=0, (2.11)

where is the -th component of , and we have used the property of as in (2.2). Hence for all , which ensures that the solution keeps bounded at finite time, hence we have a global solution. Since all have norm 1, the solution to the regularized system (2.3) is a solution to (1.2), which provides the global existence of solutions to (1.2).

To show pathwise uniqueness let us consider two solutions to (1.2) for the same initial distribution and Brownian motion. According to the above computation these two solutions stay on the sphere for any , hence they are solutions to the regularised system (2.3), whose solutions are pathwise unique due to the locally Lipschitz continuous coefficients. Hence we have uniqueness for solutions to (1.2). ∎

### 2.2 Well-posedness for the mean-field dynamic (1.8)

For readers’ convenience, we give a brief introduction of the Wasserstein metric in the following definition, we refer to [5] for more details.

###### Definition 2.1 (Wasserstein Metric).

Let and be the space of Borel probability measures on with finite second moment. We equip this space with the Wasserstein distance

 Wpp(μ,ν):=inf{∫Rd×Rd|z−^z|p dπ(μ,ν) ∣∣ π∈Π(μ,ν)} (2.12)

where denotes the collection of all Borel probability measures on with marginals and in the first and second component respectively. The Wasserstein distance can also be expressed as

 Wpp(μ,ν)=inf{E[|Z−¯¯¯¯Z|p]} (2.13)

where the infimum is taken over all joint distributions of the random variables , with marginals , respectively.

Notice now that, for any

 (2.14)

A direct application of above leads to

 vα,~E(ρ):=∫Rdvω~Eα(v)dρ∫Rdω~Eα(v)dρ=∫Rdve−α~E(v)dρ∫Rde−α~E(v)dρ≤Cα,~E∫Rd|v|dρ≤Cα,~E(1+m2)2 (2.15)

with .

###### Lemma 2.2.

Assume that (with compact support), then the following stability estimate holds

 |vα,~E(ρ)−vα,~E(ˆρ)|≤CWp(ρ,ˆρ), (2.16)

for any , where .

###### Proof.

Let us compute the difference

 vα,~E(ρ)−vα,~E(ˆρ) =∫Rdve−α~E(v)dρ(v)∥e−α~E∥L1(ρ)−∫Rdˆve−α~E(ˆv)dˆρ(ˆv)∥e−α~E∥L1(ˆρ) =∬Rd×Rdve−α~E(v)∥e−α~E∥L1(ρ)−ˆve−α~E(ˆv)∥e−α~E∥L1(ˆρ):=h(v)−h(ˆv)dπ(v,ˆv)

where is an arbitrary coupling of and . We can write the integrand as follows

 h(v)−h(ˆv) =(v−ˆv)e−α~E(ˆv)∥e−α~E∥L1(ρ)=:I1+ˆv(e−α~E(v)−e−α~E(ˆv))∥e−α~E∥L1(ρ)=:I2+ˆve−α~E(ˆv)∥e−α~E∥L1(ρ)−ˆve−α~E(ˆv)∥e−α~E∥L1(ˆρ):=I3

where the last two terms on the right hand side can be rewritten as

 ˆve−α~E(ˆv)∥e−α~E∥L1(ρ)−ˆve−α~E(ˆv)∥e−α~E∥L1(ˆρ) =ˆve−α~E(ˆv)∫e−α~E(ˆv) dˆρ(ˆv)∥e−α~E∥L1(ρ)∥e−α~E∥L1(ˆρ)−ˆve−α~E(ˆv)∫e−α~E(v) dρ(v)∥e−α~E∥L1(ˆρ)∥e−α~E∥L1(ρ) =ˆve−α~E(ˆv)∬Rd×Rd(e−α~E(ˆv)−e−α~E(v))dπ(v,ˆv)∥e−α~E∥L1(ρ)∥e−α~E∥L1(ˆρ)=:I3.

Under Assumption 2.2, one can obtain the bounds

 |I1|≤e−α~E––∥e−α~E∥L1(ρ)|v−ˆv|≤Cα,~E|v−ˆv|, (2.17)

and

 |I2|≤|ˆv|αe−α~E––C′|v−ˆv|∥e−α~E∥L1(ρ)≤αCα,~EL|ˆv||v−ˆv|. (2.18)

Similar to the estimate of , we estimate

 |I3|≤αC2α,~EL|ˆv|∬Rd×Rd|v−ˆv|dπ(v,ˆv). (2.19)

Notice that , so for any one has

 max{∫Rd|v|pdρ(v),∫Rd|ˆv|pdˆρ(ˆv)}<∞. (2.20)

Collecting the above estimates, we obtain

 |vα,~E(ρ)−vα,~E(ˆρ)| ≤Cα,~E∬Rd×Rd|v−ˆv|dπ(v,ˆv)+αCα,~EL∬Rd×Rd|ˆv||v−ˆv|dπ(v,ˆv) +αC2α,~EL∫Rd|ˆv|dˆρ(ˆv)∬Rd×Rd|v−ˆv|dπ(v,ˆv) ≤C(∬Rd×Rd|v−ˆv|pdπ(v,ˆv))1p, (2.21)

where depends only on and . Lastly, optimizing over all couplings yields (2.16). ∎

The following theorem states the well-posedness for the mean-field dynamic (1.8).

###### Theorem 2.2.

Let satisfy Assumption 2.1. Then there exists a unique process , , satisfying the nonlinear SDE (1.8)

 d¯¯¯¯Vt=λP(¯¯¯¯Vt)vα,E(ρt)dt+σ|¯¯¯¯Vt−vα,E(ρt)|P(¯¯¯¯Vt)dBt−(d−1)σ22(¯¯¯¯Vt−vα,E(ρt))2¯¯¯¯Vt|¯¯¯¯Vt|2dt,

in strong sense for any initial data distributed according to , where

 vα,E(ρt)=∫Rdve−αE(v)dρt∫Rde−αE(v)d