Q-Learning for Mean-Field Controls

# Q-Learning for Mean-Field Controls

## Abstract

Multi-agent reinforcement learning (MARL) has been applied to many challenging problems including two-team computer games, autonomous drivings, and real-time biddings. Despite the empirical success, there is a conspicuous absence of theoretical study of different MARL algorithms: this is mainly due to the curse of dimensionality caused by the exponential growth of the joint state-action space as the number of agents increases. Mean-field controls (MFC) with infinitely many agents and deterministic flows, meanwhile, provide good approximations to -agent collaborative games in terms of both game values and optimal strategies. In this paper, we study the collaborative MARL under an MFC approximation framework: we develop a model-free kernel-based Q-learning algorithm (CDD-Q) and show that its convergence rate and sample complexity are independent of the number of agents. Our empirical studies on MFC examples demonstrate strong performances of CDD-Q. Moreover, the CDD-Q algorithm can be applied to a general class of Markov decision problems (MDPs) with deterministic dynamics and continuous state-action space.

## 1 Introduction

Multi-agent reinforcement learning (MARL) has enjoyed substantial successes for studying the otherwise challenging collaborative games, including two-agent or two-team computer games [31, 34], self-driving [30], real-time bidding [12], ride-sharing [17], and traffic routing [8]. Despite the empirical success of MARL, there is a conspicuous absence of theoretical studies on different MARL algorithms: this is primarily due to the curse of dimensionality, i.e., the exponential growth of the state-action space as the number of agents increases.

One approach to address this scalability issues is to focus on local policies which leads to some significant dimension reduction: using value based algorithms, [16] develops a distributed Q-learning algorithm that converges for deterministic and finite Markov decision problems (MDPs); and recently [26, 27] assume special dependence structures among agents– again for the purpose of dimension reduction. (See also the review paper [37] and the references therein).

An alternative (and opposite) approach, largely unexplored, is to consider instead a large number of controlled collaborative agents and take advantage of the phenomenon of propagation of chaos documented in [13, 19, 32, 9]. Indeed, as the number of agents goes to infinity, the problem of a large number of controlled collaborative agents with mean-field interaction becomes a mean-field control (MFC) problem. This approach is plausible as the dimension of MFC is independent of the number of agents and the analysis of MFC has shown to provide a good approximation to the corresponding -agent collaborative game in terms of both game values and optimal strategies [15, 21].

However, unlike the standard MFC framework which assumes full information, learning MFC involves unknown transition dynamics of the system and the reward, and considers the problem of simultaneously controlling and learning the system.

Our work. In this paper, we study the MARL collaborative games with interchangeable agents under the MFC approximation framework. The key idea is to reformulate learning MFC problem as a general MDP, by “lifting” the original finite state-action space into an appropriate compact continuous state-action space. By this“lifting”, the dynamics become deterministic as a result of aggregation over the original finite state-action space. We then design a Q-learning based algorithm (CDD-Q) using kernel regression and approximate Bellman operator. The lifting approach facilitates the convergence study, and in particular significantly reduces the complexity for the sample complexity analysis for this model-free algorithm. (See Remark 3.1). We test this CDD-Q algorithm for network traffic congestion control problem in the framework of learning MFC.

Related theoretical works on deterministic dynamics. [35, 36] study the sample complexity and regret for online learning in episodic deterministic systems. [3, 38] consider the learning problem for the deterministic system under a non-episodic and PAC framework. However, the adaptive-resolution RL algorithm proposed in [3] requires certain degree of system knowledge for the implementation. Meanwhile, the MEC algorithm proposed in [38] uses a grid to partition the continuous state space and works with discrete action space.

Related works on kernel-based reinforcement learning. [24, 23] present a kernel-based reinforcement learning algorithm (KBRL) for both discounted-cost and average-cost problem with continuous state space. A kernel-based approximation is constructed for the conditional expectation that appears in the Bellman operator. A subsequent work demonstrates applicability of KBRL to large-scale problems [2]. However, there is neither convergence rate study nor finite-sample performance guarantee. The idea of non-parametric kernel regression has also been used in the context of discrete state-space problems [4]. The most relevant to our work is [29], which considers a combination of Q-learning with kernel-based nearest neighbor regression method to study continuous-state stochastic MDPs. However, all these works focus on a finite action space, and their techniques and approaches do not apply to our deterministic setting.

Indeed, our framework of learning MFC by MDP with deterministic dynamics involves a discontinuous transition kernel, which is a Dirac measure. Therefore, classical convergence studies for Q-learning algorithms, which in general assume regularity assumptions on transition kernels, are not directly applicable either.

Related works on learning MFC. [6] takes a different approach by adding common noises to the underlying dynamics. This relaxation enables application of the existing learning theory for stochastic dynamics. Additional works on learning MFC with stochastic dynamics include [5, 18], both of which focus on linear-quadratic mean-field control: the former exploring policy gradient method and the latter developing an actor-critic type algorithm.

From the algorithmic perspective, the singularity of the transition kernel in the deterministic system leads to fundamental difference between analyzing deterministic dynamics and stochastic dynamics, in terms of learning rate and ergodicity (see for instance [36]). From the application point of view, problems of controlling deterministic physical systems are fundamental: robotics [1, 25], self-driving cars [28], and computer games such as Atari [20, 14].

All these works are different from ours, which provides the convergence and complexity analysis for non-episodic kernel-based Q-learning in MDP with deterministic dynamics, and with a continuous state-action space. From the theoretical perspective, study of Q-learning MFC is virtually non-existent, with the exception of [10, 21], which establish the Dynamic Programming Principle for both the Q function and the value function for learning MFC. These works pave the way for developing various DPP-based algorithms for MFC including Q learning, policy gradient, and actor-critic algorithms. In particular, these works have inspired the lifting idea in this paper.

Organizations. The paper is organized as follows. Section 2 introduces the set-up of learning MFC in an MDP framework. Section 3.1 proposes a model-free kernel-based Q-learning algorithm (CDD-Q), with its convergence and sample complexity analysis in Section 3.2 and Section 3.3. Section 3.4 specializes in the MFC setting. Finally, Section 4 tests CDD-Q algorithm for network congestion control problems, with different parameters and kernels.

## 2 Problem set-up

### 2.1 Learning MFC

Recall that MFC (with a central controller) deals with collaborative games with infinitely many exchangeable agents of mean-field interaction. Mathematically it goes as follows. Let and be respectively the finite state-action space of each representative agent. Let (resp. ) be the space of all probability measures on (resp. ). At each non-negative integer time step , the state of each representative agent is . The central controller observes , the probability distribution of the state , with . Each agent takes an action according to the policy assigned by the central controller. Mathematically, this means the policy maps the current state and current state distribution to a distribution over the action space , and . We call the population action distribution. Each agent will then receive a reward and move to the next state according to a probability transition function . It is worth mentioning that in the context of learning MFC, and are possibly unknown.

The central controller aims to maximize over all admissible policies the accumulated reward, i.e., to find

 supπE[∞∑t=0γt~r(xt,μt,ut,νt)|x0∼μ], (2.1)

subject to

 xt+1∼P(xt,μt,ut,νt),ut∼πt(xt,μt), (2.2)

where is a discount factor.

### 2.2 MDP framework for learning MFC

According to [10], the above learning MFC problem can be reformulated in a general MDP framework with continuous state-action space and deterministic dynamics. The idea is to lift the finite state-action space and to a compact continuous state-action space and . Moreover, the dynamics will become deterministic as a result of aggregation over the state-action space.

Based on this idea, throughout the paper, we consider the following general MDP setup. Let (resp. ) be continuous state (resp. action) space which is a complete compact metric space with metric (resp. ). Let be a complete metric space with the metric given by

 dC((s,a),(s′,a′))=dS(s,s′)+dA(a,a′). (2.3)

At time , let be the state of the representative agent. Once the agent takes the action according to a policy , the agent moves to the next state according to the deterministic dynamics and receives an immediate reward . Here the policy is Markovian so that at each stage , maps the state to , a distribution over the action space. We say the agent follows the policy if .

The agent’s objective is to maximize the expected cumulative reward starting from an arbitrary state ,

 VC(s)=supπVπC(s):=supπE[∞∑t=0γtr(st,at)∣∣s0=s],

as well as to maximize the expected cumulative reward starting from arbitrary state-action pair

 QC(s,a)=supπE[∞∑t=0γtr(st,at)∣∣s0=s,a0=a].

We will call this problem MDP-CDD.

MFC as an MDP-CDD problem. Given the above general setting, we have

###### Theorem 2.1

The MFC problem (2.1)-(2.2) is equivalent to the following MDP-CDD problem,

 maximizeht∈(P(U))X∑∞t=0γtr(μt,ht),subject to μt+1=Φ(μt,ht),

where and are aggregated reward and dynamics such that

 r(μ,h) =∑x∈X∑u∈U~r(x,μ,u,ν(μ,h))μ(x)h(x,u), Φ(μ,h) =∑x∈X∑u∈UP(x,μ,u,ν(μ,h))μ(x)h(x,u),

with . Here and ) is a compact continuous state-action space embedded in a finite dimensional Euclidean space.

Note that the transition dynamics is now deterministic from the aggregation over the state-action space. Moreover, according to [10, 21], we have

###### Proposition 2.1

The Bellman equation of function for the above MDP-CDD problem is

 QC(μ,h)=r(μ,h)+γmaxh′∈(P(U))XQC(Φ(μ,h),h′). (2.4)

Moreover, is the minimum space under which the Bellman equation (2.4) holds.

One can find one specific MDP-CDD formulation of a network congestion control problem in Section 4.1.

## 3 Algorithm, convergence, and complexity

### 3.1 Algorithm via kernel method and approximate Bellman operator

We will design a kernel-based algorithm for the MDP-CDD in Section 2.2 with convergence and complexity analysis.

Like other Q-learning algorithm, we assume the access to an existing policy that is used to sample data points for learning. This policy serves the purpose of “exploration”. For example, we can choose an -greedy policy or the Boltzmann policy.

Our algorithm for MDP-CDD consists of two steps.

Kernel method. The first step is to develop a kernel-based nearest neighbor method to handle the curse of dimensionality of continuous state-action space for the MDP-CDD. Kernel nearest neighbor is a local averaging approach, useful for approximating unknown state-action pair by observed data.

We start by introducing the concept of -net and defining the state-action space discretization based on this -net. This -net will serve as the building block of kernel regression and the choice of will be reflected in the convergence and the sample complexity analysis.

is called an -net on consisting of if for all . Note that compactness of implies the existence of an -net. Denote as an -net on induced from , i.e., contains all the possible action choices in , whose size is denoted by .

Denote as the set of all bounded functions with supremum norm and the set of all bounded functions with supremum norm . Then define the so-called nearest neighbor (NN) operator such that

 ΓKf(c)=Nϵ∑i=1K(ci,c)f(ci), (3.5)

where is a weighted kernel function satisfying , for all . is assumed to satisfy

 K(x,y)=0 if dC(x,y)≥ϵ,for all x∈Cϵ,y∈C. (3.6)

This requirement of is to ensure that the kernel regression operator is computed locally in order to reduce the computational cost. In general, can be set as

 K(ci,c)=ϕ(ci,c)∑Nϵi=1ϕ(ci,c), (3.7)

with some function satisfying and when . (See Section 4 for some choices of ).

Approximate Bellman operator. The second step for the algorithm is to approximate the optimal Q function in (2.2), which satisfies the following Bellman optimality equation

 Q(s,a)=BQ(s,a),

where is the Bellman operator for this MDP-CDD

 (Bq)(s,a):=r(s,a)+γsup~a∈Aq(Φ(s,a),~a). (3.8)

To deal with the issue of continuous state-action space, we will introduce an approximate Bellman operator such that

 (BKq)(ci)=r(ci)+γmax~a∈AϵΓKq(Φ(ci),~a), (3.9)

which is an approximate Bellman operator acting on functions defined on the net .

Note from the above discussion, there are two layers of approximations. First, since may not be on the -net, we can only approximate the value at that point by the kernel regression: . Second, in order to avoid solving an optimization problem over a continuous action space , we take the maximum over the -net on the action space. Now assume we have the sampled data on the -net from the exploration policy , then we expect that under mild assumptions, applying the fixed point iteration on the approximate Bellman operator will lead to an accurate estimation of the true Q function for the MDP-CDD problem. This leads to the following algorithm (CDD-Q Algorithm).

###### Remark 3.1

We emphasize that unlike learning algorithm for the stochastic dynamics where the learning rate is chosen to guarantee the convergence of , our algorithm directly conducts the fixed point iteration for the approximate Bellman operator on the sampled data set, and sets the learning rate as , to take full advantage of the deterministic dynamics. Moreover, in analyzing the deterministic dynamics for MDP-CDD, the complexity analysis for the algorithm is reduced significantly as it suffices to visit each component in the -net once. This is in contrast with the traditional stochastic environment where each component in the -net has to be visited sufficiently many times for a good estimate in Q-learning.

This point will be elaborated further in the next section, on the convergence and complexity analysis for this algorithm.

### 3.2 Convergence of CDD-Q Algorithm

###### Assumption 3.1 (Continuity of Φ)

There exists , such that for all ,

###### Assumption 3.2 (Continuity and boundedness of r)

There exists , such that for all

###### Assumption 3.3 (Discounted factor γ)

Assumptions 3.1 and 3.2 are standard for deterministic dynamics (see [3] and [36]). The essence of these assumptions is to guarantee that or is Lipschitz continuous (shown in Theorem 3.2 in the Appendix), in order to establish the convergence and bounds on sample complexities of the algorithm.

Indeed, Lipschitz continuity assumptions ensure that state-action pairs that are close to each other will have close values, in order for an accurate estimate. In the case of stochastic dynamics, the Lipschitz continuity of or is either assumed directly  [36] or guaranteed with sufficient regularity of the the transition kernel [29].

###### Theorem 3.2

Given Assumptions 3.13.23.3, has a unique fixed point in , and has a unique fixed point in with . Moreover,

 ||ΓKQCϵ−QC||∞≤(1+γ)Lr(1−γLΦ)(1−γ)⋅ϵ. (3.10)

In particular, for a fixed , Algorithm 1 converges linearly to .

The claim regarding the convergence rate comes from the fact that the operator is a contraction according to Lemma 3.1.

This theorem suggests that as long as one has enough samples to form an -net, one can run the fixed point iteration defined by  (3.9) to get , which is shown to be an accurate estimation for the true .

In order to prove Theorem 3.2, one addition operator we need to introduce is , the Bellman operator for MDP with the approximate discreted action space

 BAϵq(s,a)=r(s,a)+γmax~a∈Aϵq(Φ(s,a),~a). (3.11)

builds a bridge between Bellman operator and approximate Bellman operator . Then Lemma 3.1 shows that under assumptions in Theorem 3.2, the operators , and all admit a unique fixed point. Proof of Lemma 3.1 can be found in Appendix B.1.

###### Lemma 3.1

Under Assumption 3.13.23.3, let . Then

• has a unique fixed point, , in .

• has a unique fixed point, , in .

• has a unique fixed point, , in .

In order to facilitate the convergence analysis, we expect both and to be Lipschitz continuous. Indeed, this can be shown under Assumptions 3.13.2, and 3.3. That is,

###### Proposition 3.2

Under Assumptions 3.13.23.3, and are Lipschitz continuous on .

Based on Lemma 3.1 and Proposition 3.2, we are ready to prove Theorem 3.2.

###### Proof of Theorem 3.2.

We aim to show and . To prove , we have

 ||ΓKQCϵ−~QC||∞ = ||ΓKBKQCϵ−~QC||∞=||ΓKBAϵΓKQCϵ−~QC||∞ ≤ ||ΓKBAϵΓKQCϵ−ΓKBAϵ~QC||∞+||ΓKBAϵ~QC−~QC||∞ = ||ΓKBAϵΓKQCϵ−ΓKBAϵ~QC||∞+||ΓK~QC−~QC||∞ ≤ γ||ΓKQCϵ−~QC||∞+||ΓK~QC−~QC||∞.

Here the first and the third equalities come from the fact that is the fixed point of and that is the fixed point of . The second inequality is by the fact that is a non-expansion mapping, i.e., , and that is a contraction with modulus with the supremum norm. Meanwhile, for any Lipschitz function with Lipschitz constant , we have ,

 |ΓKf(c)−f(c)|=Nϵ∑i=1K(c,ci)|f(ci)−f(c)|≤Nϵ∑i=1K(c,ci)ϵL=ϵL.

Note here the inequality follows from for all . Therefore,

 ||ΓKQCϵ−~QC||∞≤L~QC1−γϵ.

where is the Lipschitz constant for .

In order to prove the second part, first note that , where is the optimal value function of the MDP on and , and is the optimal value function of the MDP on and . Hence it suffices to prove that . We adopt the similar strategy as in the proof of Theorem 3.2.

Let be the optimal policy of the deterministic MDP on and , whose existence is shown in Appendix B.7. For any , let be the trajectory of the system under the optimal policy , starting from state . We have .

Now let be the nearest neighbor of in . . Consider the trajectory of the system starting from and then taking , denote the corresponding state by . We have , since is the optimal value function.

 dS(s′t,st)=dS(Φ(s′t−1,ait−1),Φ(st−1,at))≤LΦ⋅(dS(s′t−1,st−1)+ϵ)

By the iteration, we have .

which implies

 0≤ VC(s)−~VC(s)≤∞∑t=0γt(r(st,at)−r(s′t,a′t)) ≤ ∞∑t=0γt⋅Lr⋅Lt+1Φ−1LΦ−1⋅ϵ=Lr(1−γLΦ)(1−γ)⋅ϵ.

Here is by the optimality of . This completes the result. ∎

### 3.3 Complexity analysis for CDD-Q algorithm

Note in the classical Q-learning for stochastic environment, it is necessary that every component in the -net be visited sufficiently many times for a good estimation. The terminology covering time refers to the expected time step for a certain exploration policy to visit every component in the -net at least once. The complexity analysis would then focus on how many rounds of covering time is needed. In a deterministic dynamics, however, visiting each component in the -net once is sufficient, thus reducing the complexity analysis to designing an exploration scheme to guarantee the boundedness of the covering time with high probability. To this end, the following assumption on the dynamics is needed.

###### Assumption 3.4 (Controllability of the dynamics)

For all , there exists , such that for any -net on , for any , there always exists an action sequence , such that starting from s, taking will drive the state to an -neighborhood of .

Let us denote as the covering time of the -net under policy , such that

 TC,π:= sups∈Sinf{t>0:s0=s,∀ci∈Cϵ,∃ti≤t, (sti,ati) in the ϵ-neighborhood % of ci, under the policy π}.

Recall that an -greedy policy on is a policy which with probability at least will uniformly explore the actions on . Note that this type of policy always exists. Then we have the following sample complexity result.

###### Theorem 3.3

(Bound for ) Under Assumption 3.4, let be an -greedy policy on . Then

 E[TC,πϵ′]≤(Mϵ+1)⋅(Nϵ,A)Mϵ+1(ϵ′)Mϵ+1⋅log(Nϵ). (3.12)

Moreover, with probability , for any initial state , under the -greedy policy, the dynamics will visit each -neighborhood of elements in at least once, after

 (Mϵ+1)⋅(Nϵ,A)Mϵ+1(ϵ′)Mϵ+1⋅log(Nϵ)⋅e⋅log(1/δ). (3.13)

time steps.

Theorem 3.3 can be proved based on Lemma 3.2, whose proof can be found in the Appendix B.3.

###### Lemma 3.2

Assume for some policy , . Then with probability , for any initial state , under the policy , the dynamics will visit each -neighborhood of elements in at least once, after time steps, i.e. .

###### Proof of Theorem 3.3..

Recall there are different state-action pairs in the -net. Denote the -neighborhoods of those pairs by . Without loss of generality, we may assume that are disjoint, since the covering time will only become smaller if they overlap with each other. Let . is the time to visit a new neighborhood after neighborhoods are visited. By Assumption 3.4, for any with center , , there exists a sequence of actions in , whose length is at most , such that starting from and taking that sequence of actions will let the agent visit the -neighborhood of . Then, at that point, taking will let the agent visit . Hence , ,

 P(Bi is visited in Mϵ+1 steps|sTk−1=s)≥(ϵ′Nϵ,A)Mϵ+1. P(a new neighborhood is visited in Mϵ+1 steps|sTk−1=s,k−1neighborhoods are visited) ≥ (Nϵ−k+1)⋅(ϵ′Nϵ,A)Mϵ+1.

This implies . Summing from to yields the desired result. The second part follows directly from Lemma 3.2. ∎

Combining Theorem 3.2 and Theorem 3.3 yields the following convergence and sample complexity results for Algorithm 1.

###### Theorem 3.4

Assume Assumptions 3.13.23.33.4, then under the -greedy policy, with probability , for any initial state , after samples, (1) Algorithm 1 converges linearly to ; and (2) .

### 3.4 Sample complexity result for mean-field control

Now recall the MFC setting, the state space is the probability simplex in , and the action space is the product of copies of the probability simplex in . As they are embedded in finite-dimensional Euclidean spaces, there are many commonly used and topologically equivalent metrics for , and . Here, we use the total variation distance as the metrics on and , i.e., and define

 dA(h1,h2)=maxx∈XdTV(h1(x),h2(x)). (3.14)

Recall for any probability vectors in the probability simplex of . The metric on the joint state-action space is defined as .

###### Assumption 3.5 (Continuity and boundedness of ~r)

There exists , such that for all , ,

 |~r(x,μ1,u,ν1)|≤~R,|~r(x,μ1,u,ν1)−~r(x,μ2,u,ν2)|≤L~r⋅(dTV(μ1,μ2)+dTV(ν1,ν2)). (3.15)
###### Assumption 3.6 (Continuity of P)

There exists such that for all

 dTV(P(x,μ1,u,ν1),P(x,μ2,u,ν2))≤(dTV(μ1,μ2)+dTV(ν1,ν2))⋅LP. (3.16)
###### Theorem 3.5

Assume Assumptions  3.4,  3.53.6, and assume . Then under the -greedy policy, with probability , for any initial state distribution , after samples, (1) Algorithm 1 converges linearly to ; and (2)

 ||ΓKQCϵ−QC||∞≤4(~R+L~r)(1−γ⋅(2LP+1))(1−γ)⋅ϵ, (3.17)

where , and .

The proof of Theorem 3.5 relies on several lemmas. Please refer to Appendix C for their proofs.

###### Lemma 3.3 (Continuity of r)

Under Assumption 3.5,

 |r(μ,h)−r(μ′,h′)|≤2(~R+L~r)dC((μ,h),(μ′,h′)). (3.18)
###### Lemma 3.4 (Continuity of Φ)

Under Assumption 3.6,

 dTV(Φ(μ,h),Φ(μ′,h′))≤(2LP+1)dC((μ,h),(μ′,h′)). (3.19)
###### Proof of Theorem 3.5.

Now by Lemma 3.3 and Lemma 3.4, Assumption 3.13.2 and 3.3 hold with and . Meanwhile, , the size of the -net in is , because is a compact dimensional manifold. Similarly,