A Mean Field Approach for Optimization in Particles Systems and Applications

# A Mean Field Approach for Optimization in Particles Systems and Applications

Nicolas Gast    Bruno Gaujal

INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE

A Mean Field Approach for Optimization in Particles Systems and Applications [1cm] Nicolas Gast — Bruno GaujalN° 6877 — version 2 initial version March 2009 — revised version June 2009

A Mean Field Approach for Optimization in Particles Systems and Applications

Nicolas Gast , Bruno Gaujal

Thème NUM — Systèmes numériques

Équipe-Projet MESCAL

Rapport de recherche n° 6877 — version 2 — initial version March 2009 — revised version June 2009 — ?? pages

Abstract: This paper investigates the limit behavior of Markov decision processes (MDPs) made of independent particles evolving in a common environment, when the number of particles goes to infinity.

In the finite horizon case or with a discounted cost and an infinite horizon, we show that when the number of particles becomes large, the optimal cost of the system converges almost surely to the optimal cost of a deterministic system (the “optimal mean field”). Convergence also holds for optimal policies.

We further provide insights on the speed of convergence by proving several central limits theorems for the cost and the state of the Markov decision process with explicit formulas for the variance of the limit Gaussian laws.

Then, our framework is applied to a brokering problem in grid computing. The optimal policy for the limit deterministic system is computed explicitly. Several simulations with growing numbers of processors are reported. They compare the performance of the optimal policy of the limit system used in the finite case with classical policies (such as Join the Shortest Queue) by measuring its asymptotic gain.

Key-words: Markov Decision Processes, Mean Field, Optimization, Particles System, Grid Broker

Une approche champ moyen pour l’optimisation dans les systèmes de particules et ses applications

Résumé : Cet article examine le comportement limite de processus de décision Markovien constitués de particules indépendantes évoluant dans un environnement commun, lorsque le nombre de particules tend vers l’infini.

Dans le cas où on s’intéresse à un coût à horizon fini ou dans le cas d’un coût à horizon infini avec décote, nous montrons que lorsque le nombre de particules devient grand, le coût optimal du système converge presque sûrement vers le coût optimal du système déterministe. La convergence vaut également pour les politiques optimales.

De plus, nous donnons un aperçu de la vitesse de convergence en prouvant plusieurs théorèmes de la limite centrale pour le coût ainsi que l’état moyen du processus en donnant des formules explicites pour la variance des lois gaussiennes limites.

Enfin, ce modèle est appliqué à un problème de gestionnaire de ressources dans des grilles de calcul. Nous donnons un algorithme explicite pour calculer la politique optimale de la limite puis plusieurs simulations avec un nombre variable de processeurs sont étudiées. Nous comparons les performances de la politique optimale de la limite appliquée au système initiale avec plusieurs politiques classiques, (telles que joindre la file la plus courte). Nous mesurons le gain asymptotique, ainsi que le seuil à partir duquel elle surpasse les politiques classiques.

Mots-clés : Processus de décision Markovien, Champ moyen, Optimisation, Systèmes de particules, Gestionnaire de ressource

## 1 Introduction

The general context of this paper is the optimization of the behavior of controlled Markovian systems, namely Markov Decision Processes composed by a large number of particles evolving in a common environment.

Consider a discrete time system made of particles, being large, that evolve randomly and independently (according to a transition probability kernel ). At each step, the state of each particle changes according to a probability kernel, depending on the environment. The evolution of the environment only depends on the number of particles in each state. Furthermore, at each step, a central controller makes a decision that changes the transition probability kernel. The problem addressed in this paper is to study the limit behavior of such systems when becomes large and the speed of convergence to the limit.

Several papers (benaim:cmf [], bordenave2007psi []) study the limit behavior of Markovian systems in the case of vanishing intensity (the expected number of transitions per time slot is ). In these cases, the system converges to a differential system in continuous time. In the case considered here, time remains discrete at the limit. This requires a rather different approach to construct the limit.

In boudec2007gmf [], discrete time systems are considered and the authors show that under certain conditions, as grows large, a Markovian system made of particles converges to a deterministic system. Since a Markov decision process can be seen as a family of Markovian kernels, the class of systems studied in boudec2007gmf [] corresponds to the case where this family is reduced to a unique kernel and no decision can be made. Here, we show that under similar conditions as in boudec2007gmf [], a Markov decision process also converges to a deterministic one. More precisely, we show that the optimal costs (as well as the corresponding states) converge almost surely to the optimal costs (resp. the corresponding states) of a deterministic system (the “optimal mean field”).

On a practical point of view, this allows one to compute the optimal policy in a deterministic system which can often be done very efficiently, and then to use this policy in the original random system as a good approximation of the optimal policy, which cannot be computed efficiently because of the curse of dimensionality. This is illustrated by an application of our framework to optimal brokering in computational grids. We consider a set of multi-processor clusters (forming a computational grid, like EGEE EGEE []) and a set of users submitting tasks to be executed. A central broker assigns the tasks to the clusters (where tasks are buffered and served in a fifo order) and tries to minimize the average processing time of all tasks. Computing the optimal policy (solving the associated MDP) is known to be hard Tsitsiklis []. Numerical computations can only be carried up to a total of 10 processors and two users. However, our approach shows that when the number of processors per cluster and the number of users submitting tasks grow, the system converges to a mean field deterministic system. For this deterministic mean field system, the optimal brokering policy can be explicitly computed. Simulations reported in Section 4 show that, using this policy over a grid with a growing number of processors, makes performance converge to the optimal sojourn time in a deterministic system, as expected. Also, simulations show that this deterministic static policy outperforms classical dynamic policies such as Join the Shortest Queue, as soon as the total number of processors and users is over 50.

In general, how good the deterministic approximation is and how fast convergence takes place can also be estimated. For that, we provide bounds on the speed of convergence by proving of central limit theorem for the state of the system under the optimal policy as well as for the cost function.

## 2 Notations and definitions

The system is composed of particles. There are possible states for each particle, the state space is denoted by . The state of the th particle at time is denoted . We assume that the particles are distinguishable only through their state and that the dynamics of the system is homogeneous in . In other words, this means that the behavior of the system only depends on in the proportion of particles in every state . For all , is the proportion of particles in state and we denote by the vector . The set of possible values for is the set of probability measures on , such that for all , denoted by . For each , is a finite set. When goes to infinity, it converges to the set of probability measures on .

The system of particles evolves depending on their common environment. We call the context of the environment. Its evolution depends on the mean states of the particles , itself at the previous time slot and the action chosen by the controller (see below):

 CNt+1=g(CNt,MNt+1,at),

where is a continuous function.

### 2.1 Actions and policies

At each time , the system’s state is . The decision maker may choose an action from the set of possible actions . is assumed to be a compact set (finite or infinite). The action determines how the system will evolve. For an action and an environment , we have a transition probability kernel such that the probability that a particle goes from state to state the is :

 P(XNn(t+1)=j|XNn(t)=i,at=a,CNt=C)=Ki,j(a,C).

The evolutions of particles are supposed to be independent once is given. Moreover, we assume that is continuous in and . The assumption of independence of the users is a rather common assumption in mean field models boudec2007gmf []. However other papers benaim:cmf [], bordenave2007psi [] have shown that similar results can be obtained using asymptotic independence only (see Graham [] for results of this type).

Here, the focus is on Markov Decision Processes theory and on the computation of optimal policies. A policy specifies the decision rules to be used at each time slot. A decision rule is a procedure that provides an action at time . In general, is a random measurable function that depends on the events but it can be shown that when the state space is finite and the action space is compact, then deterministic Markovian policies (i.e. that only depends deterministically on the current state) are dominant, therefore we will only focus on them puterman1994mdp [].

### 2.2 Reward functions

To each possible state of the system at time , we associate a reward . The reward is assumed to be continuous in and . This function can be either seen as a reward – in that case the controller wants to maximize the reward –, or as a cost – in that case the goal of the controller is to minimize this cost. In this paper, we will focus on two problems: finite-horizon reward and discounted reward.

In the finite-horizon case, we want to maximize the sum of the rewards over all time plus a final reward that depends on the final state, . The expected reward of the policies is:

 VNΠ0…ΠT(MN0,CN0)\lx@stackreldef=E[T−1∑t=1rt(MNt,CNt)+rT(MNT,CNT)],

where the expectation is taken over all possible when the actions are , for all .

Let , the discounted reward associated to and the policy is the quantity:

 VN(δ),Π0…(MN0,CN0)\lx@stackreldef=E[∞∑t=1δtrt(MNt,CNt)].

Again, the expectation is taken over all possible when the actions at time is , for all .

In both cases, the goal of the controller is to find a policy that maximizes the expected reward:

 V∗N(MN0,CN0)\lx@stackreldef=supΠ1…ΠTVNΠ1…ΠT(MN0,CN0),
 V∗N(δ)(MN0,CN0)\lx@stackreldef=supΠ1…VN(δ),Π1…(MN0,CN0).

### 2.3 Summary of the assumptions

Here is the list of the assumptions under which all our results will hold, together with some comments on their tightness and their degree of generality and applicability.

• Independence of the users, Markov system – If at time if the environment is and the action is , then the behavior of each particle is independent of other particles and its evolution is Markovian with a kernel .

• Compact action set – The set of action is compact.

• Continuity of – the mappings , and are continuous deterministic functions, uniformly continuous in .

• Almost sure initial state – Almost surely, the initial measure converges to a deterministic value . Moreover, there exists such that almost surely where .

To simplify the notations, we choose the functions and not to depend on time. However as the proofs will be done for each time step, they also hold if the functions are time-dependent (in the finite horizon case).

Also, and do not to depend on , while this is the case in most practical cases. Adding a uniform continuity assumption on these functions for all will make all the proofs work the same.

Here are some comments on the uniform bound on the initial condition (A4). In fact, as converges almost surely, is almost surely bounded. Here we had a bound which is uniform on all events in order to be sure that the variable is dominated by an integrable function. As is continuous and the sets and are compact, this shows that for all , there exists such that

 ∥CNt∥∞≤Bt. (1)

Finally, in many cases the rewards also depend on the action. This is not the case here, at a small loss of generality.

## 3 Convergence results and optimal policy

In the case where there is no control, one can adapt the results proved in boudec2007gmf [] to show that when goes to infinity, the system converges almost surely to a deterministic one. In our case, this means that if the actions are fixed, the system converges.

For any fixed action and any value , we define the random variable that corresponds to the state of the system after one iteration started from . For , we define the (deterministic) value corresponding to one iteration of the mean field system: where

 mt+1 = mt.K(a,ct) ct+1 = g(mt+1,ct).

We call (resp. ) the compositions of (resp. of ).

In boudec2007gmf [], the system is homogeneous in time. However, the proofs are done for each step time and the results still hold without time homogeneity. With our notations, theorem 4.1 of boudec2007gmf [] says that if the actions are , and if the initial state converges almost surely, then the system of size converges almost surely.

###### Theorem 1 (Mean Field Limit, th. 4.1 of boudec2007gmf []).

Under assumptions (A1,A3,A4), if the controller takes the actions at time , then for any fixed :

 (MNt,CNt)a.s−→Φa0…aT−1(m0,c0).

In the following, we will first show that if we fix the actions, the total reward of the system converges when grows, then we will show that the optimal reward also converges.

### 3.1 Finite horizon model

In this section, the horizon is fixed, the infinite horizon case will be treated in Section 3.3. Using the same notation and hypothesis as in Theorem 1, we define the reward of the deterministic system starting at under the actions :

 va0…at−1(m0,c0)=T∑t=1rt(Φa0…at−1(m0,c0)).

For any , if the action taken at instant is fixed equal to , then converges almost surely to . Since the reward at time is continuous, this means that the finite-horizon expected reward converges as grows large:

###### Lemma 2 (Convergence of the reward).

Under assumptions (A1,A3,A4), if the controller takes actions , the finite-horizon expected reward of the stochastic system converges to the finite-horizon reward of the deterministic system:

 limN→∞VNa0…at−1(MN0,CN0)=va0,…,at−1(m0,c0)a.s.
###### Proof.

For all , converges almost surely to . Since the reward at time is continuous in , then . Moreover, as are bounded (see Equation (1)), the dominated convergence theorem shows that goes to which concludes the demonstration. ∎

Now, let us consider the problem of convergence of the reward under the optimal strategy of the controller. First, it should be clear that the optimal strategy exists for the limit system. Indeed, the limit system being deterministic, starting at state , one only needs to know the actions to take for all to compute the reward. The optimal policy is deterministic and . Since the action set is compact, this supremum is a maximum: there exist such that . In fact, in many cases there are more than one optimal action sequence. In the following, is one of them, and will be called the sequence of optimal limit actions.

###### Theorem 3 (Convergence of the optimal reward).

Under assumptions (A1,A2,A3,A4), as goes to infinity, the optimal reward of the stochastic system converges to the optimal reward of the deterministic limit system: almost surely,

 limN→∞V∗NT(MN0,CN0)\leavevmode\nobreak =limN→∞VNa∗0…a∗T−1(MN0,CN0)=v∗T(m0,c0)

In words, this theorem says that, at the limit, the reward of the optimal policy under full information is the same as the reward obtained when the optimal limit actions are used in the original system, both being equal to the optimal reward of the limit deterministic system, .

###### Proof.

For all and and , let us define by induction on the function :

 V∗NT…T(M,C)=rT(M,C)V∗Nt…T(M,C)=rt(M,C)+supa∈AEM,C[V∗Nt+1…T(ΦNa(M,C))]. (2)

where the expectation is taken over all possible values of given . Also notice that is the maximal expected reward between time and time starting in and therefore .

Let us also define for the limit system, similarly (by removing the expectation):

 v∗T…T(m,c)=rT(m,c)v∗t…T(m,c)=rt(m,c)+supa∈A[v∗t+1…T(Φa(m,c))], (3)

and let be an action that maximize the in the previous equation (it exists because of (A2): is compact).

We will show by induction on that is continuous (note that since is discrete the continuity in is trivial) and that we can define an optimal policy , such that:

 V∗Nt…T(M,C)=rt(M,C)+E[V∗Nt+1…T(ΦNΠ∗tN(M,C)(M,C))]. (4)

For , the assumption holds by the continuity of (A3).

Let us assume that it holds for . By assumption (A3), the mapping and the kernel are continuous in thus if is a sequence of action converging to , converges (in law) to . As is continuous, is continuous. Using this continuity and the compacity of , the optimal action exists. The functions , , are uniformly continuous in , therefore the convergence of the continuity of the function is uniform in . This shows that is continuous and the property for all is proved.

Let us now prove by induction on that for all sequences converging almost surely to , . This is clearly true for . Assume that it holds for some and let us call a sequence of optimal actions for the deterministic limit. Lemma 2 shows that . In particular, this shows the second inequality (which holds a.s.) of the following equation:

 liminfV∗Nt…T(MN,CN)≥liminfVNa∗t…a∗T−1(MN,CN)=v∗t…T(m,c). (5)

Let be a sequence of actions maximizing the expectation in (2). As is compact, there exists a subsequence converging to a value . Again by lemma 2, the of converges a.s. to . Using both inequalities, this shows that .

To conclude the proof, remark that since the limit system is deterministic and takes the values , fixing the policy at time to the action achieves the optimal reward. ∎

This result has several practical consequences. Recall that the limit actions is a sequence of optimal actions in the limit case, i.e. such that . This result proves that in the limit case, the optimal policy does not depend on the state of the system. This also shows that incomplete information policies are as good as complete information policies. However, the state is not deterministic and on one trajectory of the system, it could be quite far from its deterministic limit . In the proof of proposition 2, we also defined the policy which is optimal for the deterministic system starting at time in state . The least we can say is that this strategy is also asymptotically optimal, that is:

 limN→∞VNΠ∗0…Π∗T(M,C)=limN→∞VNa∗0…a∗T(M,C).

In practical situations, using this policy will decrease the risk of being far from the optimal state. On the other hand, using this policy has some drawbacks. The first one is that the complexity of computing the optimal policy for all states can be much larger than the complexity of computing . An other one is that the system becomes very sensitive to random perturbations: the policy is not necessarily continuous and may not have a limit. In Section 4, a comparison between the performances of and is provided over an example.

### 3.2 Central Limit Theorems

In this part we prove central limit theorems for interacting particles. This result provides estimates on the speed of convergence to the mean field limit. This section contains two main results:

The first one is that when the control action sequence is fixed, the gap to the mean field limit decreases as the inverse square root of the number of particles. The second result states that the gap between the optimal reward for the finite system and the optimal reward for the limit system also decreases as fast as . These properties are formalized in theorems 5 and 4 respectively.

To prove these results, we will need additional assumptions (A4-bis) and (A5) or (A5-bis).

• Initial Gaussian variable – There exists a Gaussian vector of mean with covariance such that the vector (with components) converges in law to . (This is denoted as ). This assumption also includes (A4), i.e. almost sure convergence of the initial state.

• Continuous differentiability – For all and all , all functions , and are continuously differentiable.

• Differentiability in – Let be the deterministic limit of the system if the controller takes the actions then for all , the functions , and are differentiable in the points .

These assumptions are slightly stronger than (A3) and (A4) but remain very natural. (A4-bis) is clearly necessary for Theorems 5 and 4 to hold. The differentiability condition implies that if the gap between and is of order , it remains of the same order at time . For Theorem 5, (A5-bis) is necessary but can be replaced by a Lipschitz continuity condition for Theorem 4. This will be further discussed in Section 4.2.

###### Theorem 4 (Central limit theorem for costs).

Under assumptions (A1,A2,A3,A4bis,A5),
(i)- there exists constants and such that for all :

 limsupN→∞P(√N∣∣V∗NT(MN0,CN0)−v∗T(m0,c0)∣∣≥x)≤P(β∥G0∥∞+γ≥x); (6)

(ii)- there exist constants such that for all :

 limsupN→∞P(√N∣∣V∗NT(MN0,CN0)−VNa∗0…a∗T−1(MN0,CN0)∣∣≥x)≤P(β′∥G0∥∞+γ′≥x); (7)

where .

This theorem is the main result of this section. The previous result (Theorem 3) says that . This new theorem says that both the gap between the cost under the optimal policy and of the cost when using the limit actions (i) or the gap between the latter cost and the optimal cost of the limit system (ii) are random variables that decrease to 0 with speed and have Gaussian laws. Actually, a stronger result (using almost sure convergence instead of convergence in law) will be shown in Corollary 8. A direct consequence of this result is that there exists a constant such that:

 E[√N|V∗NT(MN0,CN0)−v∗T(m0,c0)|]→γ′′ (8)

The rest of this section is devoted to the proof of this theorem. A first step in the proof of Theorem 4 is a central limit theorem for the states, which has an interest by its own.

###### Theorem 5 (Mean field central limit theorem).

Under assumption (A1,A2,A3,A4bis,A5-bis), if the actions taken by the controller are , there exist Gaussian vectors of mean , such that for every :

 √N((MN0,CN0)−(m0,c0),…,(MNt,CNt)−(mt,ct))L→G0,…,Gt. (9)

Moreover if is the covariance matrix of , then:

 Γt+1=[PtFtQtHt]trΓt[PtFtQtHt]+[Dt000] (10)

where for all and : , , , , and ().

###### Proof.

Let us assume that the Equation (9) holds for some .

As converges in law to , there exists another probability space and random variables and with the same distribution as and such that converges almost surely to durrett1991pta []. In the rest of the proof, by abuse of notation, we will write and instead of and and then we assume that .

being a Gaussian vector, there exists a vector of independent Gaussian variables and a matrix of size such that .

Let us call . According to lemma 6 there exists a Gaussian variable independent of and of covariance such that we can replace (without changing and ) by a random variables with the same laws such that:

 √N(˜MNt+1−MNtPNt)a.s−→Ht. (11)

In the following, by abuse of notation we write instead of . Therefore we have

 √N(MNt+1−mtPt)=√N(Mt+1−MNtPNt+mt(PNt−Pt)+(MNt−mt)Pt+(MNt−mt)(PNt−Pt))a.s−→Ht+mtlimN→∞√N(PNt−Pt)+limN→∞√N(MNt−mt)Pt.

By assumption, . Moreover, the first order Taylor expansion with respect to all component of gives a.s.

 limN→∞mt√N(PNt−Pt)j = S∑i=1mtid∑k=1∂Kij∂ctk(at,ct)(XU)S+k = d∑k=1Qkj(XU)S+k.

Thus, the th component of tends to

 Ht+d∑k=1Qkj(XU)S+k+S∑i=1(XU)iPij (12)

Using similar ideas, we can prove that converges almost surely to . Thus converges almost surely to a Gaussian vector.

Let us write the covariance matrix at time and time as two bloc matrices:

 Γt=[MOOTC]\leavevmode\nobreak and\leavevmode\nobreak Γt+1=[M′O′O′TC′].

For , is the expectation of (12) taken in times (12) taken in . Using the facts that , and , this leads to:

By similar computation, we can write similar equations for and that lead to Equation (10). ∎

###### Lemma 6.

Let be a sequence of random measure on and a sequence of random stochastic matrices on such that . Let be a collection of iid random variables following the uniform distribution on and independent of and and let us define : for all :

 YNj\lx@stackreldef=1NS∑i=1NMNi∑k=11∑l

then there exists a Gaussian vector independent of and and a random variable with the same law as such that

 √N(ZN−MNPN)a.s−→G.

Moreover the covariance of the vector is the matrix :

 {Djj=∑imipij(1−pij)Djk=−∑imipijpik(j≠k). (13)
###### Proof.

As and are independent, they can be viewed as functions on independent probability space and . For all , let .

By assumption, for almost all , converges to . A direct computation shows that, when grows, the characteristic function of converges to . Therefore for almost all , converges in law to , a Gaussian random variable on .

Therefore for almost all , there exists a random variable with the same law as that converges -almost surely to . Let . By construction of , for almost all , has the same distribution as and . Thus there exists a function that has the same distribution as for all and that converges -almost surely to . ∎

The first application of the mean field CLT is to show that it also works for the cost. Let us assume that the controller takes actions and let us introduce the definition of and