Finite-Sample Analyses for Fully Decentralized Multi-Agent Reinforcement Learning

# Finite-Sample Analyses for Fully Decentralized Multi-Agent Reinforcement Learning

Kaiqing Zhang   Zhuoran Yang   Han Liu    Tong Zhang   Tamer Başar{}^{\natural} Department of Electrical and Computer Engineering & Coordinated Science Laboratory, University of Illinois at Urbana-ChampaignDepartment of Operations Research and Financial Engineering, Princeton UniversityDepartment of Electrical Engineering and Computer Science and Statistics, Northwestern UniversityTencent AI Lab
###### Abstract

Despite the increasing interest in multi-agent reinforcement learning (MARL) in the community, understanding its theoretical foundation has long been recognized as a challenging problem. In this work, we make an attempt towards addressing this problem, by providing finite-sample analyses for fully decentralized MARL. Specifically, we consider two fully decentralized MARL settings, where teams of agents are connected by time-varying communication networks, and either collaborate or compete in a zero-sum game, without any central controller. These settings cover many conventional MARL settings in the literature. For both settings, we develop batch MARL algorithms that can be implemented in a fully decentralized fashion, and quantify the finite-sample errors of the estimated action-value functions. Our error analyses characterize how the function class, the number of samples within each iteration, and the number of iterations determine the statistical accuracy of the proposed algorithms. Our results, compared to the finite-sample bounds for single-agent RL, identify the involvement of additional error terms caused by decentralized computation, which is inherent in our decentralized MARL setting. To our knowledge, our work appears to be the first finite-sample analyses for MARL, which sheds light on understanding both the sample and computational efficiency of MARL algorithms.

## 1 Introduction

Multi-agent reinforcement learning (MARL) has received increasing attention in the reinforcement learning community, with recent advances in both empirical (Foerster et al., 2016; Lowe et al., 2017; Lanctot et al., 2017) and theoretical studies (Perolat et al., 2016; Zhang et al., 2018f, e). With various models of multi-agent systems, MARL has been applied to a wide range of domains, including distributed control, telecommunications, and economics. See Busoniu et al. (2008) for a comprehensive survey on MARL.

Various settings exist in the literature on multi-agent RL, which are mainly categorized into three types: the collaborative setting, the competitive setting, and a mix of the two. In particular, collaborative MARL is usually modeled as either a multi-agent Markov decision process (MDP) (Boutilier, 1996), or a team Markov game (Wang and Sandholm, 2003), where the agents are assumed to share a common reward function. A more general while challenging setting for collaborative MARL considers heterogeneous reward functions for different agents, while the collective goal is to maximize the average of the long-term return among all agents (Kar et al., 2013; Zhang et al., 2018f). This setting makes it nontrivial to design fully decentralized MARL algorithms, in which agents make globally optimal decisions using only local information, without the coordination of any central controller. The fully decentralized protocol is favored over a centralized one, due to its better scalability, privacy-preserving property, and computational efficiency (Kar et al., 2013; Zhang et al., 2018f, a). Such a protocol has been broadly advocated in practical multi-agent systems, including unmanned (aerial) vehicles (Alexander and Murray, 2004; Zhang et al., 2018b), smart power grid (Zhang et al., 2018c, d), and robotics (Corke et al., 2005). Several preliminary attempts have been made towards the development of MARL algorithms in this setting (Kar et al., 2013; Zhang et al., 2018f), with theoretical guarantees for convergence. However, these theoretical guarantees are essentially asymptotic results, i.e., the algorithms are shown to converge as the number of iterations increases to infinity. No analysis has been conducted to quantify the performance of these algorithms with a finite number of iterations/samples.

Competitive MARL, on the other hand, is usually investigated under the framework of Markov games, especially zero-sum Markov games. Most existing competitive MARL typically concerns two-player games (Littman, 1994; Bowling and Veloso, 2002; Perolat et al., 2015), which can be viewed as a generalization of the standard MDP-based RL model (Littman and Szepesvári, 1996). In fact, the fully decentralized protocol above can be incorporated to generalize such two-player competitive setting. Specifically, one may consider two teams of collaborative agents that form a zero-sum Markov game. Within each team, no central controller exists to coordinate the agents. We note that this generalized setting applies to many practical examples of multi-agent systems, including two-team-battle video games (Do Nascimento Silva and Chaimowicz, 2017), robot soccer games (Kitano et al., 1997), and security for cyber-physical systems (Cardenas et al., 2009). This setting can also be viewed as a special case of the mixed setting, which is usually modeled as a general-sum Markov game, and is nontrivial to solve using RL algorithms (Littman, 2001; Zinkevich et al., 2006). This specialization facilitates the non-asymptotic analysis on the MARL algorithms in the mixed setting.

In general, finite-sample analyses are relatively scarce even for single-agent RL, in contrast to its empirical studies and asymptotic convergence analyses. One line of work studies the probably approximately correct (PAC) learnability and sample complexity of RL algorithms (Kakade et al., 2003; Strehl et al., 2009; Jiang et al., 2016). Though built upon solid theoretical foundations, most algorithms are developed for the tabular-case RL only (Kakade et al., 2003; Strehl et al., 2009), and are computationally intractable for large-scale RL. Another significant line of work concentrates on the finite-sample analyses for batch RL algorithms (Munos and Szepesvári, 2008; Antos et al., 2008a, b; Yang et al., 2018, 2019), using the tools from statistical learning theory. Specifically, these results characterize the errors of output value functions after a finite number of the iterations using a finite number of samples. These algorithms are built upon value function approximation and use possibly single trajectory data, which can thus handle massive state-action spaces, and enjoy the advantages of off-policy exploration (Antos et al., 2008a, b).

Following the batch RL framework, we aim to establish finite sample analyses for both collaborative and competitive MARL problems. For both settings, we propose fully decentralized fitted Q-iteration algorithms with value function approximation and provide finite-sample analyses. Specifically, for the collaborative setting where a team of finite agents on a communication network aims to maximize the global average of the cumulative discounted reward obtained by all the agents, we establish the statistical error of the action-value function returned by a decentralized variation of the fitted Q-iteration algorithm, which is measured by the \ell_{2}-distance between the estimated action-value function and the optimal one. Interestingly, we show that the statistical error can be decomposed into a sum of three error terms that reflect the effects of the function class, number of samples in each fitted Q-iteration, and the number of iterations, respectively. Similarly, for the competitive MARL where the goal is to achieve the Nash equilibrium of a zero-sum game played by two teams of networked agents, we propose a fully decentralized algorithm that is shown to approximately achieve the desired Nash equilibrium. Moreover, in this case, we also establish the \ell_{2}-distance between the action-value function returned by the algorithm and the one at the Nash equilibrium.

Main Contribution. Our contribution is two-fold. First, we propose MARL algorithms for both the collaborative and competitive settings where the agents are only allowed to communicate over a network and the reward of each individual agent is observed only by itself. Here, our formulation of the two-team competitive MARL problem is novel. Our algorithms incorporate function approximation of the value functions and can be implemented in a fully decentralized fashion. Second, most importantly, we provide finite-sample analyses for these MARL algorithms, which characterizes the statistical errors of the action-value function returned by our algorithms. Our error bounds provides a clear characterization of how the function class, the number of samples, and the number of iterations affect the statistical accuracy of our algorithms. To the best of our knowledge, this work appears to be the first finite-sample analyses for MARL for either collaborative or competitive settings.

Related Work. The original formulation of MARL traces back to Littman (1994), based on the framework of Markov games. Since then, various settings of MARL have been advocated. For the collaborative setting, Boutilier (1996); Lauer and Riedmiller (2000) proposed a multi-agent MDP model, where all agents are assumed to share identical reward functions. The formulations in Wang and Sandholm (2003); Mathkar and Borkar (2017) are based on the setting of team Markov games, which also assume that the agents necessarily have a common reward function. More recently, Kar et al. (2013); Zhang et al. (2018f); Lee et al. (2018); Wai et al. (2018) advocated the fully decentralized MARL setting, which allows heterogeneous rewards of agents. For the competitive setting, the model of two-player zero-sum Markov games has been studied extensively in the literature (Littman, 1994; Bowling and Veloso, 2002; Perolat et al., 2015; Yang et al., 2019), which can be readily recovered from the two-team competitive MARL in the present work. In particular, the recent work Yang et al. (2019) that studies the finite-sample performance of minimax deep Q-learning for two-player zero-sum games can be viewed as a specialization of our two-team setting. We note that the most relevant early work that also considered such a two-team competitive MARL setting was Lagoudakis and Parr (2003), where the value function was assumed to have a factored structure among agents. As a result, a computationally efficient algorithm integrated with least-square policy iteration was proposed to learn the good strategies for a team of agents against the other. However, no theoretical, let alone finite-sample, analysis was established in the work. Several recent work also investigated the mixed setting with both collaborative and competitive agents (Foerster et al., 2016; Tampuu et al., 2017; Lowe et al., 2017), but with focus on empirical instead of theoretical studies. Besides, Chakraborty and Stone (2014) also considered multi-agent learning with sample complexity analysis under the framework of repeated matrix games.

To lay theoretical foundations for RL, an increasing attention has been paid to finite-sample, namely, non-asymptotic analyses, of the algorithms. One line of work studies the sample complexity of RL algorithms under the framework of PAC-MDP (Kakade et al., 2003; Strehl et al., 2009; Jiang et al., 2016), focusing on the efficient exploration of the algorithms. Another more relevant line of work investigated the finite-sample performance of batch RL algorithms, based on the tool of approximate dynamic programming for analysis. Munos and Szepesvári (2008) studied a finite-sample bounds for the fitted-value iteration algorithm, followed by Antos et al. (2008b, a) on fitted-policy iteration and continuous action fitted-Q iteration. Similar ideas and techniques were also explored in (Scherrer et al., 2015; Farahmand et al., 2016; Yang et al., 2019), for modified policy iteration, nonparametric function spaces, and deep neural networks, respectively. Besides, for online RL algorithms, Liu et al. (2015); Dalal et al. (2018) recently carried out the finite-sample analyses for temporal difference learning algorithms. None of these existing finite-sample analyses, however, concerns multi-agent RL. The most relevant analyses for batch MARL were provided by Perolat et al. (2015, 2016), focusing on the error propagation of the algorithm using also the tool of approximate dynamic programming. However, no statistical rate was provided in the error analysis, nor the computational complexity that solves the fitting problem at each iteration. We note that the latter becomes inevitable in our fully decentralized setting, since the computation procedure explicitly shows up in our design of fully decentralized MARL algorithms. This was not touched upon in the single-agent analyses (Antos et al., 2008b, a; Scherrer et al., 2015), since they assume the exact solution to the fitting problem at each iteration can be obtained.

In addition, in the regime of fully decentralized decision-making, decentralized partially-observable MDP (Dec-POMDP) (Oliehoek et al., 2016) is recognized as the most general and powerful model. Accordingly, some finite-sample analysis based on PAC analysis has been established in the literature (Amato and Zilberstein, 2009; Banerjee et al., 2012; Ceren et al., 2016). However, since Dec-POMDPs are known to be NEXP-complete and thus difficult to solve in general, these Dec-POMDP solvers are built upon some or all the following requirements: i) a centralized planning procedure to optimize the policies for all the agents (Amato and Zilberstein, 2009); ii) the availability of the model or the simulator for the sampling (Amato and Zilberstein, 2009; Banerjee et al., 2012), or not completely model-free (Ceren et al., 2016); iii) a special structure of the reward (Amato and Zilberstein, 2009), or the policy-learning process (Banerjee et al., 2012). Also, these PAC results only apply to the small-scale setting with tabular state-actions and mostly finite time horizons. In contrast, our analysis is amenable to the batch RL algorithms that utilize function approximation for the setting with large state-action spaces.

Notation. For a measurable space with domain {\mathcal{S}}, we denote the set of measurable functions on {\mathcal{S}} that are bounded by V in absolute value by \mathcal{F}({\mathcal{S}},V). Let \mathcal{P}({\mathcal{S}}) be the set of all probability measures over {\mathcal{S}}. For any \nu\in\mathcal{P}({\mathcal{S}}) and any measurable function f\colon{\mathcal{S}}\rightarrow\mathbb{R}, we denote by \|f\|_{\nu,p} the \ell_{p}-norm of f with respect to measure \nu for p\geq 1. For simplicity, we write \|f\|_{\nu} for \|f\|_{2,\nu}. In addition, we define \|f\|_{\infty}=\max_{s\in{\mathcal{S}}}|f(s)|. We use \bm{1} to denote the column vector of proper dimensions that has all elements equal to 1. We use a\vee b to denote \max\{a,b\} for any a,b\in\mathbb{R}, and define the set [K]=\{1,2,\cdots,K\}.

## 2 Problem Formulation

In this section, we introduce the formal formulation of our fully decentralized MARL problem. We first present the formulation of the fully decentralized MARL problem in a collaborative setting. Then, we lay out the competitive setting where two teams of collaborative agents form a zero-sum Markov game.

### 2.1 Fully Decentralized Collaborative MARL

Consider a team of N agents, denoted by \mathcal{N}=[N], that operate in a common environment in a collaborative fashion. In the fully decentralized setting, there exists no central controller that is able to either collect rewards or make the decisions for the agents. Alternatively, to foster the collaboration, agents are assumed to be able to exchange information via a possibly time-varying communication network. We denote the communication network by a graph G_{\tau}=(\mathcal{N},E_{\tau}), where the edge set E_{\tau} represents the set of communication links at time \tau\in\mathbb{N}. Formally, we define the following model of networked multi-agent MDP (M-MDP).

###### Definition 2.1 (Networked Multi-Agent MDP).

A networked multi-agent MDP is characterized by a tuple ({\mathcal{S}},\{\mathcal{A}^{i}\}_{i\in\mathcal{N}},P,\{R^{i}\}_{i\in\mathcal% {N}},\{G_{\tau}\}_{\tau\geq 0},\gamma) where {\mathcal{S}} is the global state space shared by all the agents in \mathcal{N}, and \mathcal{A}^{i} is the set of actions that agent i can choose from. Then, let \mathcal{A}=\prod_{i=1}^{N}\mathcal{A}^{i} denote the joint action space of all agents. Moreover, P:{\mathcal{S}}\times\mathcal{A}\to\mathcal{P}({\mathcal{S}}) is the probability distribution of the next state, R^{i}:{\mathcal{S}}\times\mathcal{A}\to\mathcal{P}(\mathbb{R}) is the distribution of local reward function of agent i, which both depend on the joint actions a and the global state s. \gamma\in(0,1) is the discount factor. {\mathcal{S}} is a compact subset of \mathbb{R}^{d} which can be infinite, \mathcal{A} has finite cardinality A=|\mathcal{A}|, and the rewards have absolute values uniformly bounded by R_{\max}. At time t, the agents are connected by the communication network G_{t}. The states and the joint actions are globally observable while the rewards are observed only locally.

By this definition, agents observe the global state s_{t} and perform joint actions a_{t}=(a_{t}^{1},\ldots,a_{t}^{N})\in\mathcal{A} at time t. In consequence, each agent i receives an instantaneous reward r_{t}^{i} that samples from the distribution R^{i}(\cdot{\,|\,}s_{t},a_{t}). Moreover, the environment evolves to a new state s_{t+1} according to the transition probability P(\cdot{\,|\,}s_{t},a_{t}). We refer to this model as a fully decentralized one because each agent makes individual decisions based on the local information acquired from the network. In particular, we assume that given the current state, each agent i chooses actions independently to each other, following its own policy \pi^{i}:{\mathcal{S}}\to\mathcal{P}(\mathcal{A}^{i}). Thus, the joint policy of all agents, denoted by \pi\colon{\mathcal{S}}\to\mathcal{P}(\mathcal{A}), satisfies \pi(a{\,|\,}s)=\prod_{i\in\mathcal{N}}\pi^{i}(a^{i}{\,|\,}s) for any s\in{\mathcal{S}} and a\in\mathcal{A}.

The collaborative goal of the agents is to maximize the global average of the cumulative discounted reward obtained by all agents over the network, which can be formally written as

Accordingly, under any joint policy \pi, the action-value function Q_{\pi}:{\mathcal{S}}\times\mathcal{A}\to{\mathbb{R}} can be defined as

 \displaystyle Q_{\pi}(s,a)=\frac{1}{N}\sum_{i\in\mathcal{N}}\mathbb{E}_{a_{t}% \sim\pi(\cdot{\,|\,}s_{t})}\bigg{[}\sum_{t=0}^{\infty}\gamma^{t}\cdot r^{i}_{t% }{\,\bigg{|}\,}s_{0}=s,a_{0}=a\bigg{]}.

Notice that since r^{i}_{t}\in[-R_{\max},R_{\max}] for any i\in\mathcal{N} and t\geq 0, Q_{\pi} are bounded by R_{\max}/(1-\gamma) in absolute value for any policy \pi. We let Q_{\max}=R_{\max}/(1-\gamma) for notational convenience. Thus we have Q_{\pi}\in\mathcal{F}({\mathcal{S}}\times\mathcal{A},Q_{\max}) for any \pi. We refer to Q_{\pi} as global Q-function hereafter. For notational convenience, under joint policy \pi, we define the operator P_{\pi}:\mathcal{F}({\mathcal{S}}\times\mathcal{A},Q_{\max})\to\mathcal{F}({% \mathcal{S}}\times\mathcal{A},Q_{\max}) and the Bellman operator {{\mathcal{T}}}_{\pi}:\mathcal{F}({\mathcal{S}}\times\mathcal{A},Q_{\max})\to% \mathcal{F}({\mathcal{S}}\times\mathcal{A},Q_{\max}) that correspond to the globally averaged reward as follows

 \displaystyle(P_{\pi}Q)(s,a)= \displaystyle~{}\mathbb{E}_{s^{\prime}\sim P(\cdot{\,|\,}s,a),a^{\prime}\sim% \pi(\cdot{\,|\,}s^{\prime})}\bigl{[}Q(s^{\prime},a^{\prime})\bigr{]}, (2.1) \displaystyle({{\mathcal{T}}}_{\pi}Q)(s,a)= \displaystyle~{}\overline{r}(s,a)+\gamma\cdot(P_{\pi}Q)(s,a),

where \overline{r}(s,a)=\sum_{i\in\mathcal{N}}r^{i}(s,a)\cdot N^{-1} denotes the globally averaged reward with r^{i}(s,a)=\int rR^{i}(dr{\,|\,}s,a). Note that the action-value function Q_{\pi} is the unique fixed point of {\mathcal{T}}_{\pi}. Similarly, we also define the optimal Bellman operator corresponding to the averaged reward \overline{r} as

 \displaystyle({{\mathcal{T}}}Q)(s,a)=\overline{r}(s,a)+\gamma\cdot\mathbb{E}_{% s^{\prime}\sim P(\cdot{\,|\,}s,a)}\bigl{[}\max_{a^{\prime}\in\mathcal{A}}Q(s^{% \prime},a^{\prime})\bigr{]}.

Given a vector of Q-functions \mathbf{Q}\in[\mathcal{F}({\mathcal{S}}\times\mathcal{A},Q_{\max})]^{N} with \mathbf{Q}=[Q^{i}]_{i\in\mathcal{N}}, we also define the average Bellman operator \widetilde{{\mathcal{T}}}:[\mathcal{F}({\mathcal{S}}\times\mathcal{A},Q_{\max}% )]^{N}\to\mathcal{F}({\mathcal{S}}\times\mathcal{A},Q_{\max}) as

 \displaystyle(\widetilde{{\mathcal{T}}}\mathbf{Q})(s,a)=\frac{1}{N}\sum_{i\in% \mathcal{N}}({\mathcal{T}}^{i}Q^{i})(s,a)=\overline{r}(s,a)+\gamma\cdot\mathbb% {E}_{s^{\prime}\sim P(\cdot{\,|\,}s,a)}\biggl{[}\frac{1}{N}\cdot\sum_{i\in% \mathcal{N}}\max_{a^{\prime}\in\mathcal{A}}Q^{i}(s^{\prime},a^{\prime})\biggr{% ]}. (2.2)

Note that \widetilde{{\mathcal{T}}}\mathbf{Q}={{\mathcal{T}}}Q if Q^{i}=Q for all i\in\mathcal{N}.

In addition, for any action-value function Q\colon{\mathcal{S}}\times\mathcal{A}\rightarrow\mathbb{R}, one can define the one-step greedy policy \pi_{Q} to be the deterministic policy that chooses the action with the largest Q-value, i.e., for any s\in{\mathcal{S}}, it holds that

 \displaystyle\pi_{Q}(a{\,|\,}s)=1~{}~{}\text{if }~{}~{}a=\mathop{\mathrm{% argmax}}_{a^{\prime}\in\mathcal{A}}~{}Q(s,a^{\prime}).

If there are more than one actions a^{\prime} that maximize the Q(s,a^{\prime}), we break the tie randomly. Furthermore, we can define an operator \mathcal{G}, which generates the average greedy policy of a vector of Q-function, i.e., \mathcal{G}(\mathbf{Q})=N^{-1}\sum_{i\in\mathcal{N}}\pi_{Q^{i}}, where \pi_{Q^{i}} denotes the greedy policy with respect to Q^{i}.

### 2.2 Fully Decentralized Two-Team Competitive MARL

Now, we extend the fully decentralized collaborative MARL model to a competitive setting. In particular, we consider two teams, referred to as Team 1 and Team 2 that operate in a common environment. Let \mathcal{N} and \mathcal{M} be the set of agents in Team 1 and Team 2, respectively, with |\mathcal{N}|=N and |\mathcal{M}|=M. We assume the two teams form a zero-sum Markov game, i.e., the instantaneous rewards of all agents sum up to zero. Moreover, within each team, the agents can exchange information via a communication network, and collaborate in a fully decentralized fashion as defined in §2.1. We give a formal definition for such a model of networked zero-sum Markov game as follows.

###### Definition 2.2 (Networked Zero-Sum Markov Game).

A networked zero-sum Markov game is characterized by a tuple

 \displaystyle\Bigl{(}{\mathcal{S}},\big{\{}\{\mathcal{A}^{i}\}_{i\in\mathcal{N% }},\{\mathcal{B}^{j}\}_{j\in\mathcal{M}}\big{\}},P,\big{\{}\{R^{1,i}\}_{i\in% \mathcal{N}},\{R^{2,j}\}_{j\in\mathcal{M}}\big{\}}\big{\{}\{G^{1}_{\tau}\}_{% \tau\geq 0},\{G^{2}_{\tau}\}_{\tau\geq 0}\big{\}},\gamma\Bigr{)},

where {\mathcal{S}} is the global state space shared by all the agents in \mathcal{N} and \mathcal{M}, \mathcal{A}^{i} (resp. \mathcal{B}^{i}) are the sets of actions for any agent i\in\mathcal{N} (resp. j\in\mathcal{M}). Then, let \mathcal{A}=\prod_{i\in\mathcal{N}}\mathcal{A}^{i} (resp. \mathcal{B}=\prod_{j\in\mathcal{M}}\mathcal{B}^{j}) denote the joint action space of all agents in Team 1 (resp. in Team 2). Moreover, P:{\mathcal{S}}\times\mathcal{A}\times\mathcal{B}\to\mathcal{P}({\mathcal{S}}) is the probability distribution of the next state, R^{1,i},R^{2,j}:{\mathcal{S}}\times\mathcal{A}\times\mathcal{B}\to\mathcal{P}(% \mathbb{R}) are the distribution of local reward function of agent i\in\mathcal{N} and j\in\mathcal{M}, respectively. \gamma\in(0,1) is the discount factor. Also, the two teams form a zero-sum Markov game, i.e., at time t, it follows that \sum_{i\in\mathcal{N}}r^{1,i}_{t}+\sum_{j\in\mathcal{M}}r^{2,j}_{t}=0, with r^{1,i}_{t}\sim R^{1,i}(s_{t},a_{t},b_{t}) and r^{2,j}_{t}\sim R^{2,j}(s_{t},a_{t},b_{t}) for all i\in\mathcal{N} and j\in\mathcal{M}. Moreover, {\mathcal{S}} is a compact subset of \mathbb{R}^{d}, both \mathcal{A} and \mathcal{B} have finite cardinality A=|\mathcal{A}| and B=|\mathcal{B}|, and the rewards have absolute values uniformly bounded by R_{\max}. All the agents in Team 1 (resp. in Team 2), are connected by the communication network G^{1}_{t} (resp. G^{2}_{t}) at time t. The states and the joint actions are globally observable while the rewards are observed only locally to each agent.

We note that this model generalizes the most common competitive MARL setting, which is usually modeled as a two-player zero-sum Markov game (Littman, 1994; Perolat et al., 2015). Additionally, we allow collaboration among agents within the same team, which establishes a mixed MARL setting with both competitive and collaborative agents. This mixed setting finds broad practical applications, including team-battle video games (Do Nascimento Silva and Chaimowicz, 2017), robot soccer games (Kitano et al., 1997), and security for cyber-physical systems (Cardenas et al., 2009).

For two-team competitive MARL, both teams now aim to find the minimax joint policy that maximize the average of the cumulative rewards over all agents in its team. With the zero-sum assumption, this goal can also be viewed as one team, e.g., Team 1, maximizes the globally averaged return of the agents it contains; while Team 2 minimizes that globally averaged return. Accordingly, we refer to Team 1 (resp. Team 2) as the maximizer (resp. minimizer) team, without loss of generality. Let \pi^{i}:{\mathcal{S}}\to\mathcal{P}(\mathcal{A}^{i}) and \sigma^{j}:{\mathcal{S}}\to\mathcal{P}(\mathcal{B}^{j}) be the local policy of agent i\in\mathcal{N} and j\in\mathcal{M}, respectively. Then, given any fixed joint policies \pi=\prod_{i\in\mathcal{N}}\pi^{i} of Team 1 and \sigma=\prod_{j\in\mathcal{M}}\pi^{j} of Team 2, one can similarly define the action-value function Q_{\pi,\sigma}:{\mathcal{S}}\times\mathcal{A}\times\mathcal{B}\to\mathbb{R} as

 \displaystyle Q_{\pi,\sigma}(s,a,b)=\frac{1}{N}\sum_{i\in\mathcal{N}}\mathbb{E% }_{a_{t}\sim\pi(\cdot{\,|\,}s_{t}),b_{t}\sim\sigma(\cdot{\,|\,}s_{t})}\bigg{[}% \sum_{t=0}^{\infty}\gamma^{t}\cdot r^{1,i}_{t}{\,\bigg{|}\,}s_{0}=s,a_{0}=a,b_% {0}=b\bigg{]},

for any (s,a,b)\in{\mathcal{S}}\times\mathcal{A}\times\mathcal{B}, and define the value function V_{\pi,\sigma}:{\mathcal{S}}\to\mathbb{R} by

 V_{\pi,\sigma}(s)=\mathbb{E}_{a\sim\pi(\cdot{\,|\,}s),b\sim\sigma(\cdot{\,|\,}% s)}[Q_{\pi,\sigma}(s,a,b)].

Note that Q_{\pi,\sigma} and V_{\pi,\sigma} are both bounded by Q_{\max}=R_{\max}/(1-\gamma) in absolute values, i.e., Q_{\pi,\sigma}\in\mathcal{F}({\mathcal{S}}\times\mathcal{A}\times\mathcal{B},Q% _{\max}) and V_{\pi,\sigma}\in\mathcal{F}({\mathcal{S}},Q_{\max}). Formally, the collective goal of Team 1 is to find the optimal \pi such that \max_{\pi}\min_{\sigma}~{}V_{\pi,\sigma}. The collective goal of Team 2 is thus to solve \min_{\sigma}\max_{\pi}V_{\pi,\sigma}. By Minimax theorem (Von Neumann and Morgenstern, 1947; Patek, 1997), there exists some minimax value V^{*}\in\mathcal{F}({\mathcal{S}},Q_{\max}) of the game such that

 \displaystyle V^{*}=\max_{\pi}\min_{\sigma}~{}V_{\pi,\sigma}=\min_{\sigma}\max% _{\pi}~{}V_{\pi,\sigma}.

Similarly, one can define the minimax Q-value of the game as Q^{*}=\max_{\pi}\min_{\sigma}Q_{\pi,\sigma}=\min_{\sigma}\max_{\pi}Q_{\pi,\sigma}. We also define the optimal Q-value of Team 1 under policy \pi as Q_{\pi}=\min_{\sigma}Q_{\pi,\sigma}, where the opponent Team 2 is assumed to performing the best response to \pi.

Moreover, under fixed joint policy (\pi,\sigma) of two teams, one can define the operators P_{\pi,\sigma},P^{*}:\mathcal{F}({\mathcal{S}}\times\mathcal{A}\times\mathcal{% B},Q_{\max})\to\mathcal{F}({\mathcal{S}}\times\mathcal{A}\times\mathcal{B},Q_{% \max}) and the Bellman operators {{\mathcal{T}}}_{\pi,\sigma},{\mathcal{T}}:\mathcal{F}({\mathcal{S}}\times% \mathcal{A}\times\mathcal{B},Q_{\max})\to\mathcal{F}({\mathcal{S}}\times% \mathcal{A}\times\mathcal{B},Q_{\max}) by

 \displaystyle(P_{\pi,\sigma}Q)(s,a,b)= \displaystyle~{}\mathbb{E}_{s^{\prime}\sim P(\cdot{\,|\,}s,a,b),a^{\prime}\sim% \pi(\cdot{\,|\,}s^{\prime}),b^{\prime}\sim\sigma(\cdot{\,|\,}s^{\prime})}\bigl% {[}Q(s^{\prime},a^{\prime},b^{\prime})\bigr{]}, \displaystyle(P^{*}Q)(s,a,b)= \displaystyle~{}\mathbb{E}_{s^{\prime}\sim P(\cdot{\,|\,}s,a,b)}\Bigl{\{}\max_% {\pi^{\prime}\in\mathcal{P}(\mathcal{A})}\min_{\sigma^{\prime}\in\mathcal{P}(% \mathcal{B})}\mathbb{E}_{a^{\prime}\sim\pi^{\prime},b^{\prime}\sim\sigma^{% \prime}}\Bigl{[}Q(s^{\prime},a^{\prime},b^{\prime})\bigr{]}\Bigl{\}}, (2.3) \displaystyle({{\mathcal{T}}}_{\pi,\sigma}Q)(s,a,b)= \displaystyle~{}\overline{r}^{1}(s,a,b)+\gamma\cdot(P_{\pi,\sigma}Q)(s,a,b), \displaystyle({{\mathcal{T}}}Q)(s,a,b)= \displaystyle~{}\overline{r}^{1}(s,a,b)+\gamma\cdot(P^{*}Q)(s,a,b),

where \overline{r}^{1}(s,a,b)=N^{-1}\cdot\sum_{i\in\mathcal{N}}r^{1,i}(s,a,b) denotes the globally averaged reward of Team 1, where we write r^{1,i}(s,a,b)=\int rR^{1,i}(dr{\,|\,}s,a,b). Note that the operators P_{\pi,\sigma}, P^{*}, {{\mathcal{T}}}_{\pi,\mu}, and {\mathcal{T}} are all defined corresponding to the globally averaged reward of Team 1, i.e., \overline{r}^{1}. One can also define all the quantities above based on that of Team 2, i.e., \overline{r}^{2}=\sum_{j\in\mathcal{M}}r^{2,j}(s,a)\cdot M^{-1}, with the \max and \min operators exchanged. Also note that the solution to the \max\min problem on the right-hand side of (2.3), which is essentially solving a matrix game given Q(s^{\prime},\cdot,\cdot), may only have mixed strategies. For notational brevity, we also define the following two operators {\mathcal{T}}_{\pi} as {\mathcal{T}}_{\pi}Q=\min_{\sigma}~{}{\mathcal{T}}_{\pi,\sigma}Q, for any Q\in\mathcal{F}({\mathcal{S}}\times\mathcal{A}\times\mathcal{B},Q_{\max}). Moreover, with a slight abuse of notation, we also define the average Bellman operator \widetilde{{\mathcal{T}}}:[\mathcal{F}({\mathcal{S}}\times\mathcal{A}\times% \mathcal{B},Q_{\max})]^{N}\to\mathcal{F}({\mathcal{S}}\times\mathcal{A}\times% \mathcal{B},Q_{\max}) for the maximizer team, i.e., Team 1, similar to the definition in (2.2) 111For convenience, we will write \mathbb{E}_{a\sim\pi^{\prime},b\sim\sigma^{\prime}}\big{[}Q(s,a,b)\big{]} as \mathbb{E}_{\pi^{\prime},\sigma^{\prime}}\big{[}Q(s,a,b)\big{]} hereafter.:

 \displaystyle(\widetilde{{\mathcal{T}}}\mathbf{Q})(s,a,b)= \displaystyle~{}\overline{r}^{1}(s,a,b)+\gamma\cdot\mathbb{E}\biggl{[}\frac{1}% {N}\sum_{i\in\mathcal{N}}\max_{\pi^{\prime}\in\mathcal{P}(\mathcal{A})}\min_{% \sigma^{\prime}\in\mathcal{P}(\mathcal{B})}\mathbb{E}_{\pi^{\prime},\sigma^{% \prime}}\bigl{[}Q^{i}(s^{\prime},a^{\prime},b^{\prime})\bigr{]}\biggl{]}. (2.4)

In addition, in zero-sum Markov game, one can also define the greedy policy or equilibrium policy of one team with respect to a value or action-value function, where the policy acts optimally based on the best response of the opponent team. Specifically, given any Q\in\mathcal{F}({\mathcal{S}}\times\mathcal{A}\times\mathcal{B},Q_{\max}), the equilibrium joint policy of Teams 1, denoted by \pi_{Q}, is defined as

 \displaystyle\pi_{Q}(\cdot{\,|\,}s)=\mathop{\mathrm{argmax}}_{\pi^{\prime}\in% \mathcal{P}(\mathcal{A})}\min_{\sigma^{\prime}\in\mathcal{P}(\mathcal{B})}% \mathbb{E}_{\pi^{\prime},\sigma^{\prime}}\big{[}Q(s,a,b)\big{]}, (2.5)

With this definition, we can define an operator \mathcal{E}^{1} that generates the average equilibrium policy with respect to a vector of Q-function for agents in Team 1. Specifically, with \mathbf{Q}=[Q^{i}]_{i\in\mathcal{N}}, we define \mathcal{E}^{1}(\mathbf{Q})=N^{-1}\cdot\sum_{i\in\mathcal{N}}\pi_{Q^{i}}, where \pi_{Q^{i}} is the equilibrium policy as defined in (2.5).

## 3 Algorithms

In this section, we introduce the fully decentralized MARL algorithms proposed for both the collaborative and the competitive settings.

### 3.1 Fully Decentralized Collaborative MARL

Our fully decentralized MARL algorithm is established upon the fitted-Q iteration algorithm for single-agent RL (Riedmiller, 2005). In particular, all agents in a team have access to a dataset \mathcal{D}=\{(s_{t},\{a^{i}_{t}\}_{i\in\mathcal{N}},s_{t+1})\}_{t=1,\cdots,T} that records the transition of the multi-agent system along the trajectory under a fixed joint behavior policy. The local reward function, however, is only available to each agent itself.

At iteration k, each agent i maintains an estimate of the globally averaged Q-function denoted by \widetilde{Q}^{i}_{k}. Then, agent i samples local reward \{r^{i}_{t}\}_{t=1,\cdots,T} along the trajectory \mathcal{D}, and calculates the local target data \{Y^{i}_{t}\}_{t=1,\cdots,T} following Y^{i}_{t}=r^{i}_{t}+\gamma\cdot\max_{a\in\mathcal{A}}\widetilde{Q}^{i}_{k}(s_{% t+1},a). With the local data available, all agents hope to collaboratively find a common estimate of the global Q-function, by solving the following least-squares fitting problem

where \mathcal{H}\subseteq\mathcal{F}({\mathcal{S}}\times\mathcal{A},Q_{\max}) denotes the function class used for Q-function approximation. The exact solution, denoted by \widetilde{Q}_{k+1}, to (3.1) can be viewed as an improved estimate of the global Q-function, which can be used to generate the targets for the next iteration k+1. However, in practice, since agents have to solve (3.1) in a distributed fashion, then with a finite number of iterations of any distributed optimization algorithms, the estimate at each agent may not reach exact consensual. Instead, each agent i may have an estimate \widetilde{Q}^{i}_{k+1} that is different from the exact solution \widetilde{Q}_{k+1}. This mismatch will then propagate to next iteration since agents can only use the local \widetilde{Q}^{i}_{k+1} to generate the target for iteration k+1. This is in fact one of the departures of our finite-sample analyses for MARL from the analyses for the single-agent setting (Munos and Szepesvári, 2008; Lazaric et al., 2010). After K iterations, each agent i finds the local greedy policy with respect to \widetilde{Q}_{K}^{i} and the local estimate of the global Q-function. To obtain a consistent joint greedy policy, all agents output the average of their local greedy policies, i.e., output \pi_{K}=\mathcal{G}(\widetilde{\mathbf{Q}}_{K}). The proposed fully decentralized algorithm for collaborative MARL is summarized in Algorithm 1.

When a parametric function class is considered, we denote \mathcal{H} by \mathcal{H}_{\Theta}, where \mathcal{H}_{\Theta}=\{f(\cdot,\cdot;\theta)\in\mathcal{F}({\mathcal{S}}\times% \mathcal{A},Q_{\max}):\theta\in\Theta\} and \Theta\subseteq\mathbb{R}^{d} is a compact set of the parameter \theta. In this case, (3.1) becomes a vector-valued optimization problem with a separable objective among agents. For notational convenience, we denote g^{i}(\theta)={T}^{-1}\cdot\sum_{t=1}^{T}\bigl{[}Y^{i}_{t}-f(s_{t},a_{t};% \theta)\bigr{]}^{2}, then (3.1) can be written as

Since target data are distributed, i.e., each agent i only has access to its own g^{i}(\theta), the agents need to exchange local information over the network G_{t} to solve (3.2), which admits a fully decentralized optimization algorithm. Note that problem (3.2) may be nonconvex with respect to \theta when \mathcal{H}_{\Theta} is a nonlinear function class, e.g., deep neural networks, which makes the exact minimum of (3.2) intractable. In addition, even if \mathcal{H}_{\Theta} is a linear function class, which turns (3.2) into a convex problem, with only a finite number of steps in practical implementation, decentralized optimization algorithms can at best converge to a neighborhood of the global minimizer. Thus, the mismatch between \widetilde{Q}^{i}_{k} and \widetilde{Q}_{k} mentioned above is inevitable for our finite iteration analyses.

There exists a rich family of decentralized or consensus optimization algorithms for networked agents that can solve the vector-valued optimization problem (3.2). Since we consider a more general setting with a time-varying communication network, several recent work (Zhu and Martínez, 2013; Nedic et al., 2017; Tatarenko and Touri, 2017; Hong and Chang, 2017) may apply. When the overall objective function is strongly-convex, Nedic et al. (2017) is the most advanced algorithm that is guaranteed to achieve geometric/linear convergence rate to the best of our knowledge. Thus, we use the DIGing algorithm proposed in Nedic et al. (2017) as an example to solve (3.2). In particular, each agent i maintains two vectors in DIGing, i.e., the solution estimate \theta^{i}_{l}\in\mathbb{R}^{d}, and the average gradient estimate \gamma^{i}_{l}\in\mathbb{R}^{d}, at iteration l. Each agent exchanges these two vectors to the neighbors over the time-varying network \{G_{l}\}_{l\geq 0}, weighted by some consensus matrix \mathbf{C}_{l}=[c_{l}(i,j)]_{N\times N} that respects the topology of the graph G_{l}222Note that here we allow the communication graph to be time-varying even within each iteration k of Algorithm 1. Thus, we use l as the time index used in the decentralized optimization algorithm instead of \tau the general time index in Definition 2.1.. Details on choosing the consensus matrix \mathbf{C}_{l} will be provided in §4. The updates of the DIGing algorithm are summarized in Algorithm 2. If \mathcal{H}_{\Theta} represents a linear function class, then (3.2) can be strongly-convex under mild conditions. In this case, one can quantitatively characterize the mismatch between the global minimizer of (3.2) and the output of Algorithm 2 after a finite number of iterations, thanks to the linear convergence rate of the algorithm.

For general nonlinear function class \mathcal{H}_{\Theta}, the algorithms for nonconvex decentralized optimization (Zhu and Martínez, 2013; Hong et al., 2016; Tatarenko and Touri, 2017) can be applied. Nonetheless, the mismatch between the algorithm output and the global minimizer is very difficult to quantify, which is a fundamental issue in general nonconvex optimization problems.

### 3.2 Fully Decentralized Two-team Competitive MARL

The proposed algorithm for two-team competitive MARL is also based on the fitted-Q iteration algorithm. Similarly, agents in both teams receive their rewards following the single trajectory data in \mathcal{D}=\{(s_{t},\{a^{i}_{t}\}_{i\in\mathcal{N}},\{b^{j}_{t}\}_{j\in% \mathcal{M}},s_{t+1})\}_{t=1,\cdots,T}. To avoid repetition, in the sequel, we focus on the update and analysis for agents in Team 1.

At iteration k, each agent i\in\mathcal{N} in Team 1 maintains an estimate \widetilde{Q}^{1,i}_{k} of the globally averaged Q-function of its team. With the local reward r^{1,i}_{t}\sim R^{1,i}(s_{t},a_{t},b_{t}) available, agent i computes the target Y^{i}_{t}=r^{1,i}_{t}+\gamma\cdot\max_{\pi^{\prime}\in\mathcal{P}(\mathcal{A})% }\min_{\sigma^{\prime}\in\mathcal{P}(\mathcal{B})}\mathbb{E}_{\pi^{\prime},% \sigma^{\prime}}\big{[}\widetilde{Q}^{1,i}_{k}(s_{t+1},a,b)\big{]}. Then, all agents in Team 1 aim to improve the estimate of the minimax global Q-function by collaboratively solving the following least-squares fitting problem

Here with a slight abuse of notation, we also use \mathcal{H}\subset\mathcal{F}({\mathcal{S}}\times\mathcal{A}\times\mathcal{B},% Q_{\max}) to denote the function class for Q-function approximation. Similar to the discussion in §3.1, with fully decentralized algorithms that solve (3.3), agents in Team 1 may not reach consensus on the estimate within a finite number of iterations. Thus, the output of the algorithm at iteration k is a vector of Q-functions, i.e., \mathbf{Q}^{1}_{k}=[Q^{1,i}_{k}]_{i\in\mathcal{N}}, which will be used to compute the target at the next iteration k+1. Thus, the final output of the algorithm after K iterations is the average greedy policy with respect to the vector \mathbf{Q}^{1}_{K}, i.e., \mathcal{E}^{1}(\mathbf{Q}^{1}_{K}), which can differ from the exact minimizer of (3.3) \widetilde{Q}^{1}_{k}. The fully decentralized algorithm for competitive two-team MARL is summarized in Algorithm 3. Moreover, if the function class \mathcal{H} is parameterized as \mathcal{H}_{\Theta}, especially as a linear function class, then the mismatch between \widetilde{Q}^{1,i}_{k} and \widetilde{Q}^{1}_{k} after a finite number of iterations of the distributed optimization algorithms, e.g., Algorithm 2, can be quantified. We provide a detailed discussion on this in §4.3.

## 4 Theoretical Results

In this section, we provide the main results on the sample complexity of the algorithms proposed in Section 3. We first introduce several common assumptions for both the collaborative and competitive settings.

The function class \mathcal{H} used for action-value function approximation greatly influence the performance of the algorithm. Here we use the concept of pseudo-dimension to capture the capacity of function classes, as in the following assumption.

###### Assumption 4.1 (Capacity of Function Classes).

Let V_{\mathcal{H}^{+}} denote the pseudo-dimension of a function class \mathcal{H}, i.e., the VC-dimension of the subgraphs of functions in \mathcal{H}. Then the function class \mathcal{H} used in both Algorithm 1 and Algorithm 3 has finite pseudo-dimension, i.e., V_{\mathcal{H}^{+}}<\infty.

In our fully decentralized setting, each agent may not have access to the simulators for the MDP model transition. Thus, the data \mathcal{D} have to be collected from an actual trajectory of the networked M-MDP (or the Markov game), under some joint behavior policy of all agents. Note that the behavior policy of other agents are not required to be known in order to generate such a sample path. Our assumption regarding the sample path is as follows.

###### Assumption 4.2 (Sample Path).

The sample path used in the collaborative (resp. the competitive team) setting, i.e., \mathcal{D}=\{(s_{t},\{a^{i}_{t}\}_{i\in\mathcal{N}},s_{t+1})\}_{t=1,\cdots,T} (resp. \mathcal{D}=\{(s_{t},\{a^{i}_{t}\}_{i\in\mathcal{N}},\{b^{j}_{t}\}_{j\in% \mathcal{M}},s_{t+1})\}_{t=1,\cdots,T}), are collected from a sample path of the networked M-MDP (resp. Markov game) under some stochastic behavior policy. Moreover, the process \{(s_{t},a_{t})\} (resp. \{(s_{t},a_{t},b_{t})\}) is stationary, i.e., (s_{t},a_{t})\sim\nu\in\mathcal{P}({\mathcal{S}}\times\mathcal{A})) (resp. (s_{t},a_{t},b_{t})\sim\nu\in\mathcal{P}({\mathcal{S}}\times\mathcal{A}\times% \mathcal{B})), and exponentially \beta-mixing333See the Definition A.1 in Appendix §A on \beta-mixing and exponentially \beta-mixing of a stochastic process. with a rate defined by (\overline{\beta},g,\zeta).

Here we assume a mixing property of the random process along the sample path. Informally, this means that the future of the process depends weakly on the past, which allows us to derive tail inequalities for certain empirical processes. Note that Assumption 4.2 is standard in the literature Antos et al. (2008b); Lazaric et al. (2010) for finite-sample analyses of batch RL using a single trajectory data. We also note that the mixing coefficients do not need to be known when implementing the proposed algorithms.

In addition, we also make the following standard assumption on the concentrability coefficient of the networked M-MDP and the networked zero-sum Markov game, as in Munos and Szepesvári (2008); Antos et al. (2008b). The definitions of concentrability coefficients follow from those in Munos and Szepesvári (2008); Perolat et al. (2015). For completeness, we provide the formal definitions in Appendix §A.

###### Assumption 4.3 (Concentrability Coefficient).

Let \nu be the stationary distribution of the samples \{(s_{t},a_{t})\} (resp. \{(s_{t},a_{t},b_{t})\}) in \mathcal{D} from the networked M-MDP (resp. Markov game) in Assumption 4.2. Let \mu be a fixed distribution on {\mathcal{S}}\times\mathcal{A} (resp. on {\mathcal{S}}\times\mathcal{A}\times\mathcal{B}). We assume that there exist constants \phi_{\mu,\nu}^{\text{MDP}},\phi_{\mu,\nu}^{\text{MG}}<\infty such that

 \displaystyle(1-\gamma)^{2}\cdot\sum_{m\geq 1}\gamma^{m-1}\cdot m\cdot\kappa^{% \text{MDP}}(m;\mu,\nu)\leq \displaystyle~{}\phi_{\mu,\nu}^{\text{MDP}}, (4.1) \displaystyle(1-\gamma)^{2}\cdot\sum_{m\geq 1}\gamma^{m-1}\cdot m\cdot\kappa^{% \text{MG}}(m;\mu,\nu)\leq \displaystyle~{}\phi_{\mu,\nu}^{\text{MG}}, (4.2)

where \kappa^{\text{MDP}} and \kappa^{\text{MG}} are concentrability coefficients for the networked M-MDP and zero-sum Markov game as defined in §A, respectively.

The concentrability coefficient measures the similarity between \nu_{2} and the distribution of the future states of the networked M-MDP (or zero-sum Markov game) when starting from \nu_{1}. The boundedness of the concentrability coefficient can be interpreted as the controllability of the underlying system, and holds in a great class of regular MDPs and Markov games. See more interpretations on concentrability coefficients in Munos and Szepesvári (2008); Perolat et al. (2015).

As mentioned in §3.1, in practice, at iteration k of Algorithm 1, with a finite number of iterations of the decentralized optimization algorithm, the output \widetilde{Q}^{i}_{k} is different from the exact minimizer of (3.1). Such mismatches between the output of the decentralized optimization algorithm and the exact solution to the fitting problem (3.3) also exist in Algorithm 3. Thus, we make the following assumption on this one-step computation error in both cases.

###### Assumption 4.4 (One-step Decentralized Computation Error).

At iteration k of Algorithm 1, the computation error from solving (3.1) is uniformly bounded, i.e., there exists certain \epsilon^{i}_{k}>0, such that for any (s,a)\in{\mathcal{S}}\times\mathcal{A}, it holds that |\widetilde{Q}^{i}_{k}(s,a)-\widetilde{Q}_{k}(s,a)|\leq\epsilon^{i}_{k}, where \widetilde{Q}_{k} is the exact minimizer of (3.1) and \widetilde{Q}^{i}_{k} is the output of the decentralized optimization algorithm at agent i\in\mathcal{N}. Similarly, at iteration k of Algorithm 3, there exist certain \epsilon^{1,i}_{k},\epsilon^{2,i}_{k}>0, such that for any (s,a,b)\in{\mathcal{S}}\times\mathcal{A}\times\mathcal{B}, it holds that |\widetilde{Q}^{1,i}_{k}(s,a,b)-\widetilde{Q}^{1}_{k}(s,a,b)|\leq\epsilon^{1,i% }_{k} and |\widetilde{Q}^{2,j}_{k}(s,a,b)-\widetilde{Q}^{2}_{k}(s,a,b)|\leq\epsilon^{2,j% }_{k}, where \widetilde{Q}^{1}_{k} and \widetilde{Q}^{2}_{k} are the exact minimizers of (3.3), \widetilde{Q}^{1,i}_{k} and \widetilde{Q}^{2,j}_{k} are the output of the decentralized optimization algorithm at agent i\in\mathcal{N} and j\in\mathcal{M}, respectively.

The computation error, which for example is |\widetilde{Q}^{i}_{k}(s,a)-\widetilde{Q}_{k}(s,a)| in the collaborative setting, is usually induced from two sources: 1) the error caused by a finite number of iterations of the decentralized optimization algorithm in practice; 2) the error caused by the nonconvexity of (3.2) with nonlinear parametric function class \mathcal{H}_{\Theta}. The error is always bounded for function class \mathcal{H}\subset\mathcal{F}({\mathcal{S}}\times\mathcal{A},Q_{\max}) with bounded absolute values. Moreover, the error can be further quantified when \mathcal{H}_{\Theta} is a linear function class, as to be detailed in §4.3.

### 4.1 Fully Decentralized Collaborative MARL

Now we are ready to lay out the main results on the finite-sample error bounds for fully decentralized collaborative MARL.

###### Theorem 4.5 (Finite-sample Bounds for Decentralized Collaborative MARL).

Recall that \{\widetilde{\mathbf{Q}}_{k}\}_{0\leq k\leq K} are the estimator vectors generated from Algorithm 1, and \pi_{K}=\mathcal{G}(\widetilde{\mathbf{Q}}_{K}) is the joint average greedy policy with respect to the estimate vector \widetilde{\mathbf{Q}}_{K}. Let Q_{\pi_{K}} be the Q-function corresponding to \pi_{K}, Q^{*} be the optimal Q-function, and \widetilde{R}_{\max}=(1+\gamma)Q_{\max}+R_{\max}. Also, recall that A=|\mathcal{A}|,N=|\mathcal{N}|, and T=|\mathcal{D}|. Then, under Assumptions 4.1-4.4, for any fixed distribution \mu\in\mathcal{P}({\mathcal{S}}\times\mathcal{A}) and \delta\in(0,1], there exist constants K_{1} and K_{2} with

 \displaystyle K_{1}=K_{1}\big{(}V_{\mathcal{H}^{+}}\log(T),\log(1/\delta),\log% (\widetilde{R}_{\max}),V_{\mathcal{H}^{+}}\log(\overline{\beta})\big{)}, \displaystyle K_{2}=K_{2}\big{(}V_{\mathcal{H}^{+}}\log(T),V_{\mathcal{H}^{+}}% \log(\overline{\beta}),V_{\mathcal{H}^{+}}\log[\widetilde{R}_{\max}(1+\gamma)]% ,V_{\mathcal{H}^{+}}\log(Q_{\max}),V_{\mathcal{H}^{+}}\log(A)\big{)},

and \Lambda_{T}(\delta)=K_{1}+K_{2}\cdot N, such that with probability at least 1-\delta

 \displaystyle\|Q^{*}-Q_{\pi_{K}}\|_{\mu}\leq \displaystyle\underbrace{C^{\text{MDP}}_{\mu,\nu}\cdot E(\mathcal{H})}_{\text{% Approximation error}}+\underbrace{C^{\text{MDP}}_{\mu,\nu}\cdot\bigg{\{}\frac{% \Lambda_{T}(\delta/K)[\Lambda_{T}(\delta/K)/b\vee 1]^{1/\zeta}}{T/(2048\cdot% \widetilde{R}^{4}_{\max})}\bigg{\}}^{1/4}}_{\text{Estimation error}} \displaystyle\quad+\underbrace{\sqrt{2}\gamma\cdot C^{\text{MDP}}_{\mu,\nu}% \cdot\overline{\epsilon}+\frac{2\sqrt{2}\gamma}{1-\gamma}\cdot\overline{% \epsilon}_{K}}_{\text{Decentralized computation error}}+{\frac{4\sqrt{2}\cdot Q% _{\max}}{(1-\gamma)^{2}}\cdot\gamma^{K/2}},

where \overline{\epsilon}_{K}=[N^{-1}\cdot\sum_{i\in\mathcal{N}}(\epsilon^{i}_{K})^{% 2}]^{1/2}, and

Moreover, \phi^{\text{MDP}}_{\mu,\nu}, given in (4.1), is a constant that only depends on the distributions \mu and \nu.

###### Proof.

The proof is mainly based on the following theorem that quantifies the propagation of one-step errors as Algorithm 1 proceeds.

###### Theorem 4.6 (Error Propagation for Decentralized Collaborative MARL).

Under Assumptions 4.3 and 4.4, for any fixed distribution \mu\in\mathcal{P}({\mathcal{S}}\times\mathcal{A}), we have

 \displaystyle\|Q^{*}-Q_{\pi_{K}}\|_{\mu}\leq\underbrace{C^{\text{MDP}}_{\mu,% \nu}\cdot\|\varrho\|_{\nu}}_{\text{Statistical error}}+\underbrace{\sqrt{2}% \gamma\cdot C^{\text{MDP}}_{\mu,\nu}\cdot\overline{\epsilon}+\frac{2\sqrt{2}% \gamma}{1-\gamma}\cdot\overline{\epsilon}_{K}}_{\text{Decentralized % computation error}}+{\frac{4\sqrt{2}\cdot Q_{\max}}{(1-\gamma)^{2}}\cdot\gamma% ^{K/2}},

where we define

 \displaystyle\|\varrho\|_{\nu}=\max_{k\in[K]}~{}\|\widetilde{{\mathcal{T}}}% \widetilde{\mathbf{Q}}_{k-1}-\widetilde{Q}_{k}\|_{\nu},

and \overline{\epsilon}_{K},C^{\text{MDP}}_{\mu,\nu}, and \overline{\epsilon} are as defined in Theorem 4.5.

Theorem 4.6 shows that both the one-step statistical error and the decentralized computation error will propagate, which constitute the fundamental error that will not vanish even when the iteration K\to\infty. See §5.1 for the proof of Theorem 4.6.

To obtain the main results in Theorem 4.5, now it suffices to characterize the one-step statistical error \|\varrho\|_{\nu}. The following theorem establishes a high probability bound for this statistical error.

###### Theorem 4.7 (One-step Statistical Error for Decentralized Collaborative MARL).

Let \mathbf{Q}=[Q^{i}]_{i\in\mathcal{N}}\in\mathcal{H}^{N} be a vector of real-valued random functions (may not be independent from the sample path), let (s_{t},\{a^{i}_{t}\}_{i\in\mathcal{N}},s_{t+1}) be the samples from the trajectory data \mathcal{D} and \{r^{i}_{t}\}_{i\in\mathcal{N}} be the rewards sampled from (s_{t},\{a^{i}_{t}\}_{i\in\mathcal{N}},s_{t+1}). We also define Y^{i}_{t}=r^{i}_{t}+\gamma\cdot\max_{a\in\mathcal{A}}Q^{i}(s_{t+1},a), and define f^{\prime} by

 \displaystyle f^{\prime}\in\mathop{\mathrm{argmin}}_{f\in\mathcal{H}}~{}\frac{% 1}{N}\sum_{i\in\mathcal{N}}\frac{1}{T}\sum_{t=1}^{T}\bigl{[}Y^{i}_{t}-f(s_{t},% a_{t})\bigr{]}^{2}. (4.3)

Then, under Assumptions 4.1 and 4.2, for \delta\in(0,1], T\geq 1, there exists some \Lambda_{T}(\delta) as defined in Theorem 4.5, such that with probability at least 1-\delta,

 \displaystyle\|f^{\prime}-\widetilde{{\mathcal{T}}}{\mathbf{Q}}\|^{2}_{\nu}% \leq\inf_{f\in\mathcal{H}}\|f-\widetilde{{\mathcal{T}}}{\mathbf{Q}}\|^{2}_{\nu% }+\sqrt{\frac{\Lambda_{T}(\delta)[\Lambda_{T}(\delta)/b\vee 1]^{1/\zeta}}{T/(2% 048\cdot\widetilde{R}^{4}_{\max})}}. (4.4)

The proof of Theorem 4.7 is provided in §5.2. Similar to the existing results in the single-agent setting (e.g., Lemma 10 in Antos et al. (2008b)), the one-step statistical error consists of two parts, the approximation error that depends on the richness of the function class \mathcal{H}, and the estimation error that vanishes with the number of samples T.

By replacing \mathbf{Q} by \widetilde{\mathbf{Q}}_{k-1} and f^{\prime} by \widetilde{Q}_{k}, the results in Theorem 4.7 can characterize the one-step statistical error \|\widetilde{{\mathcal{T}}}\widetilde{\mathbf{Q}}_{k-1}-\widetilde{Q}_{k}\|_{\nu}. Together with Theorem 4.6, we conclude the main results in Theorem 4.5. ∎

Theorem 4.5 establishes a high probability bound on the quality of the output policy \pi_{K} obtained from Algorithm 1 after K iterations. Here we use the \mu-weighted norm of the difference between Q^{*} and Q_{\pi_{K}} as the performance metric. Theorem 4.5 shows that the finite-sample error of decentralized MARL is precisely controlled by three fundamental terms: 1) the approximation error that depends on the richness of the function class \mathcal{H}, i.e., how well \mathcal{H} preserves the average Bellman operator \widetilde{{\mathcal{T}}}; 2) the estimation error incurred by the fitting step (1), which vanishes with increasing number of samples T; 3) the computation error in solving the least-squares problem (3.1) in a decentralized way with a finite number of updates. Note that the estimation error, after some simplifications and suppression of constant and logarithmic terms, has the form

 \displaystyle\bigg{\{}\frac{[V_{\mathcal{H}^{+}}(N+1)\log(T)+V_{\mathcal{H}^{+% }}N\log(A)+\log(K/\delta)]^{1+1/\zeta}}{T}\bigg{\}}^{1/4}. (4.5)

Compared with the existing results in the single-agent setting, e.g., (Antos et al., 2008b, Theorem 4), our results has an additional dependence on O(N\log(A)), where N=|\mathcal{N}| is the number of agents in the team and A=|\mathcal{A}| is cardinality of the joint action set. This dependence on N is due to the fact that the target data used in the fitting step are collections of local target data from N agents; while the dependence on \log(A) characterizes the difficulty of estimating Q-functions, each of which has A choices to find the maximum given any state s. Similar terms of order \log(A) also shows up in the single-agent setting (Antos et al., 2008a, b), which is induced by the capacity of the action space. In addition, a close examination of the proof shows that the effective dimension (Antos et al., 2008b) is (N+1)V_{\mathcal{H}^{+}}, which is because we allow N agents to have their own estimates of Q-functions, each of which lies in the function class \mathcal{H} with pseudo-dimension V_{\mathcal{H}^{+}}. We note that it is possible to sharpen the dependence of the rate and the effective dimension on N via different proof techniques from here, which is left as our future work.

### 4.2 Two-team Competitive MARL

In the sequel, we establish the finite-sample error bounds for the fully decentralized competitive MARL as follows.

###### Theorem 4.8 (Finite-sample Bounds for Decentralized Two-team Competitive MARL).

Recall that \{\widetilde{\mathbf{Q}}^{1}_{k}\}_{0\leq k\leq K} are the estimator vectors obtained by Team 1 via Algorithm 3, and \pi_{K}=\mathcal{E}^{1}(\widetilde{\mathbf{Q}}^{1}_{K}) is the joint average equilibrium policy with respect to the estimate vector \widetilde{\mathbf{Q}}^{1}_{K}. Let Q_{\pi_{K}} be the optimal Q-function corresponding to \pi_{K}, Q^{*} be the minimax Q-function of the game, and \widetilde{R}_{\max}=(1+\gamma)Q_{\max}+R_{\max}. Also, recall that A=|\mathcal{A}|,B=|\mathcal{B}|,N=|\mathcal{N}|, and T=|\mathcal{D}|. Then, under Assumptions 4.1-4.4, for any fixed distribution \mu\in\mathcal{P}({\mathcal{S}}\times\mathcal{A}\times\mathcal{B}) and \delta\in(0,1], there exist constants K_{1} and K_{2} with 444Note that the constants K_{1},K_{2} here may have different values from those in Theorem 4.5.

 \displaystyle K_{1}=K_{1}\big{(}V_{\mathcal{H}^{+}}\log(T),\log(1/\delta),\log% (\widetilde{R}_{\max}),V_{\mathcal{H}^{+}}\log(\overline{\beta})\big{)}, \displaystyle K_{2}=K_{2}\big{(}V_{\mathcal{H}^{+}}\log(T),V_{\mathcal{H}^{+}}% \log(\overline{\beta}),V_{\mathcal{H}^{+}}\log[\widetilde{R}_{\max}(1+\gamma)]% ,V_{\mathcal{H}^{+}}\log(Q_{\max}),V_{\mathcal{H}^{+}}\log(AB)\big{)},

and \Lambda_{T}(\delta)=K_{1}+K_{2}\cdot N, such that with probability at least 1-\delta

 \displaystyle\|Q^{*}-Q_{\pi_{K}}\|_{\mu}\leq \displaystyle\underbrace{C^{\text{MG}}_{\mu,\nu}\cdot E(\mathcal{H})}_{\text{% Approximation error}}+\underbrace{C^{\text{MG}}_{\mu,\nu}\cdot\bigg{\{}\frac{% \Lambda_{T}(\delta/K)[\Lambda_{T}(\delta/K)/b\vee 1]^{1/\zeta}}{T/(2048\cdot% \widetilde{R}^{4}_{\max})}\bigg{\}}^{1/4}}_{\text{Estimation error}} \displaystyle\quad+\underbrace{\sqrt{2}\gamma\cdot C^{\text{MG}}_{\mu,\nu}% \cdot\overline{\epsilon}^{1}+\frac{2\sqrt{2}\gamma}{1-\gamma}\cdot\overline{% \epsilon}^{1}_{K}}_{\text{Decentralized computation error}}+{\frac{4\sqrt{2}% \cdot Q_{\max}}{(1-\gamma)^{2}}\cdot\gamma^{K/2}},

where \overline{\epsilon}^{1}_{K}=[N^{-1}\cdot\sum_{i\in\mathcal{N}}(\epsilon^{1,i}_% {K})^{2}]^{1/2}, and

Moreover, \phi^{\text{MG}}_{\mu,\nu}, given in (4.2), is a constant that only depends on the distributions \mu and \nu.

###### Proof.

Note that by slightly abusing the notation, the operator \widetilde{{\mathcal{T}}} here follows the definition in (2.4). The proof is established mainly upon the following error propagation bound, whose proof is provided in §5.3.

###### Theorem 4.9 (Error Propagation for Decentralized Two-team Competitive MARL).

Under Assumptions 4.3 and 4.4, for any fixed distribution \mu\in\mathcal{P}({\mathcal{S}}\times\mathcal{A}\times\mathcal{B}), we have

 \displaystyle\|Q^{*}-Q_{\pi_{K}}\|_{\mu}\leq\underbrace{C^{\text{MG}}_{\mu,\nu% }\cdot\|\varrho^{1}\|_{\nu}}_{\text{Statistical error}}+\underbrace{\sqrt{2}% \gamma\cdot C^{\text{MG}}_{\mu,\nu}\cdot\overline{\epsilon}^{1}+\frac{2\sqrt{2% }\gamma}{1-\gamma}\cdot\overline{\epsilon}^{1}_{K}}_{\text{Decentralized % computation error}}+{\frac{4\sqrt{2}\cdot Q_{\max}}{(1-\gamma)^{2}}\cdot\gamma% ^{K/2}},

where we define

 \displaystyle\|\varrho^{1}\|_{\nu}=\max_{k\in[K]}~{}\|\widetilde{{\mathcal{T}}% }\widetilde{\mathbf{Q}}^{1}_{k-1}-\widetilde{Q}^{1}_{k}\|_{\nu},

and \overline{\epsilon}^{1}_{K},C^{\text{MG}}_{\mu,\nu}, and \overline{\epsilon}^{1} are as defined in Theorem 4.8.

Similar to the error propagation results in Theorem 4.6, as iteration K increases, the fundamental error of the Q-function under policy \pi_{K} is bounded by two terms, the statistical error and the decentralized computation error. The former is characterized by the following theorem.

###### Theorem 4.10 (One-step Statistical Error for Decentralized Two-team Competitive MARL).

Let \mathbf{Q}=[Q^{i}]_{i\in\mathcal{N}}\in\mathcal{H}^{N} be a vector of real-valued random functions (may not be independent from the sample path), let (s_{t},\{a^{i}_{t}\}_{i\in\mathcal{N}},\{b^{j}_{t}\}_{j\in\mathcal{M}},s_{t+1}) be the samples from the trajectory data \mathcal{D} and \{r^{1,i}_{t}\}_{i\in\mathcal{N}} be the rewards sampled by agents in Team 1 from (s_{t},\{a^{i}_{t}\}_{i\in\mathcal{N}},\{b^{j}_{t}\}_{j\in\mathcal{M}},s_{t+1}). We also define Y^{i}_{t}=r^{1,i}_{t}+\gamma\cdot\max_{\pi^{\prime}\in\mathcal{P}(\mathcal{A})% }\min_{\sigma^{\prime}\in\mathcal{P}(\mathcal{B})}\mathbb{E}_{\pi^{\prime},% \sigma^{\prime}}\big{[}\widetilde{Q}^{1,i}_{k}(s_{t+1},a,b)\big{]}, and define f^{\prime} as the solution to (4.3). Then, under Assumptions 4.1 and 4.2, for \delta\in(0,1], T\geq 1, there exists some \Lambda_{T}(\delta) as defined in Theorem 4.8, such that the bound that has the form of (4.4) holds.

By substituting the results of Theorem 4.10 into Theorem 4.9, we obtain the desired results and conclude the proof. ∎

Theorem 4.8 characterizes the quality of the output policy \pi_{K} for Team 1 obtained from Algorithm 3 in a high probability sense. We use the same performance metric, i.e., the weighted norm of the difference between Q_{\pi_{K}} and the minimax action-value Q^{*}, as in the literature Patek (1997); Perolat et al. (2015). For brevity, we only include the error bound for Team 1 as in Perolat et al. (2015), and note that the bound for Team 2 can be obtained immediately by changing the order of the \max and \min operators and some notations in the proof.

Similar to the results for the collaborative setting, the error bound is composed of three main terms, the inherent approximate error depending on the function class \mathcal{H}, the estimation error vanishing with the increasing number of samples, and the decentralized computation error. The simplified estimation error has a nearly identical form as in (4.5), except that the dependence on N\log(A) is replaced by N\log(AB). Moreover, the effective dimension remains (N+1)V_{\mathcal{H}^{+}} as in (4.5). These observations further substantiate the discussions right after Theorem 4.5, i.e., the dependence on N is due to the local target data distributed at N agents in the team, and the dependence on \log(AB) follows from the capacity of the action space. Also note that the number of agents M=|\mathcal{M}| in Team 2 does not show up in the bound, thanks to the zero-sum assumption on the rewards.

### 4.3 Using Linear Function Approximation

Now we provide more concrete finite-sample bounds for both settings above when a linear function class for Q-function approximation is used. In particular, we quantify the one-step computation error bound assumed in Assumption 4.4, after L iterations of the decentralized optimization algorithm that solves (3.1) or (3.3). We first make the following assumption on the features of the linear function class used in both settings.

###### Assumption 4.11.

For collaborative MARL, the function class \mathcal{H}_{\Theta} used in Algorithm 1 is a parametric linear function class, i.e., \mathcal{H}_{\Theta}\subset\mathcal{F}({\mathcal{S}}\times\mathcal{A},Q_{\max}) and \mathcal{H}_{\Theta}=\{f(s,a;\theta)=\theta^{\top}\varphi(s,a):\theta\in\Theta\}, where for any (s,a)\in{\mathcal{S}}\times\mathcal{A}, \varphi(s,a)\in\mathbb{R}^{d} is the feature vector. Moreover, let \mathbf{M}^{\text{MDP}}=T^{-1}\cdot\sum_{t=1}^{T}\varphi(s_{t},a_{t})\varphi^{% \top}(s_{t},a_{t}) with \{(s_{t},a_{t})\}_{t\in[T]} being samples from the data set \mathcal{D}, then the matrix \mathbf{M}^{\text{MDP}} is full rank. Similarly, for two-team competitive MARL, the function class \mathcal{H}_{\Theta}\subset\mathcal{F}({\mathcal{S}}\times\mathcal{A}\times% \mathcal{B},Q_{\max}) used in Algorithm 3 is a parametric linear function class, with \varphi(s,a,b)\in\mathbb{R}^{d} being the feature vector for any (s,a,b)\in{\mathcal{S}}\times\mathcal{A}\times\mathcal{B}. Moreover, let \mathbf{M}^{\text{MG}}=T^{-1}\cdot\sum_{t=1}^{T}\varphi(s_{t},a_{t},b_{t})% \varphi^{\top}(s_{t},a_{t},b_{t}) with \{(s_{t},a_{t},b_{t})\}_{t\in[T]} being samples from the data set \mathcal{D}, then the matrix \mathbf{M}^{\text{MG}} is full rank.

Since f\in\mathcal{H}_{\Theta} has bounded absolute value Q_{\max}, Assumption 4.11 implies that the norm of the features are uniformly bounded. The second assumption on the rank of the matrices \mathbf{M}^{\text{MDP}} and \mathbf{M}^{\text{MG}} ensures that the least-squares problems (3.1) and (3.3) are strongly-convex, which enables the DIGing algorithm to achieve the desirable geometric convergence rate over time-varying communication networks. We note that this assumption can be readily satisfied in practice. Let \varphi(s,a)=[\varphi_{1}(s,a),\cdots,\varphi_{d}(s,a)]^{\top}. Then, the functions \{\varphi_{1}(s,a),\cdots,\varphi_{d}(s,a)\} (or vectors of dimension |{\mathcal{S}}|\times|\mathcal{A}| if the state space {\mathcal{S}} is finite), are required to be linearly independent, in the conventional RL with linear function approximation (Tsitsiklis and Van Roy, 1997; Geramifard et al., 2013). Thus, with a rich enough data set \mathcal{D}, it is not difficult to find d\ll T samples from \mathcal{D}, such that the matrix [\varphi(s_{1},a_{1}),\cdots,\varphi(s_{d},a_{d})]^{\top} has full rank, i.e., rank d. In this case, with some algebra (see Lemma B.1 in Appendix §B), one can show that the matrix \mathbf{M}^{\text{MDP}} is also full-rank. The similar argument applies to the matrix \mathbf{M}^{\text{MG}}. These arguments justify the rationale behind Assumption 4.11.

Moreover, we make the following assumption on the time-varying consensus matrix \mathbf{C}_{l} used in the DIGing algorithm (see also Assumption 1 in Nedic et al. (2017)).

###### Assumption 4.12 (Consensus Matrix Sequence \{\mathbf{C}_{l}\}).

For any l=0,1,\cdots, the consensus matrix \mathbf{C}_{l}=[c_{l}(i,j)]_{N\times N} satisfies the following relations:

1) (Decentralized property) If i\neq j, and edge (j,i)\notin E_{l}, then c_{l}(i,j)=0;

2) (Double stochasticity) \mathbf{C}_{l}\bm{1}=\bm{1} and \bm{1}^{\top}\mathbf{C}_{l}=\bm{1}^{\top};

3) (Joint spectrum property) There exists a positive integer B such that

 \displaystyle\chi<1,~{}~{}\text{where}~{}~{}\chi=\sup_{l\geq B-1}\sigma_{\max}% \Big{\{}\mathbf{C}_{l}\mathbf{C}_{l-1}\cdots\mathbf{C}_{l-B+1}-\frac{1}{N}\bm{% 1}\bm{1}^{\top}\Big{\}}\text{~{}~{}for all~{}~{}}l=0,1,\cdots,

and \sigma_{\max}(\cdot) denotes the largest singular value of a matrix.

Assumption 4.12 is standard and can be satisfied by many matrix sequences used in decentralized optimization. In specific, condition 1) states the restriction on the physical connection of the network; 2) ensures the convergent vector is consensual for all agents; 3) the connectivity of the time-varying graph \{G_{l}\}_{l\geq 0}. See more discussions on this assumption in (Nedic et al., 2017, Section 3).

Now we are ready to present the following corollary on the sample and iteration complexity of Algorithms 1 and 3, when Algorithm 2 and linear function approximation is used.

###### Corollary 4.13 (Sample and Iteration Complexity with Linear Function Approximation).

Suppose Assumptions 4.1- 4.4, and 4.11-4.12 hold, and Algorithm 2 is used in the fitting steps (3.1) and (3.3) for decentralized optimization. Let \pi_{K} be the output policy of Algorithm 1, then for any \delta\in(0,1], \epsilon>0, and fixed distribution \mu\in\mathcal{P}({\mathcal{S}}\times\mathcal{A}), there exist integers K,T, and L, where K is linear in \log(1/\epsilon), \log[1/(1-\gamma)], and \log(Q_{\max}); T is polynomial in 1/\epsilon, \gamma/(1-\gamma), 1/{\widetilde{R}_{\max}}, \log(1/\delta), \log(\overline{\beta}), and N\log(A); and L is linear in \log(1/\epsilon), \log[\gamma/(1-\gamma)], such that

 \displaystyle\|Q^{*}-Q_{\pi_{K}}\|_{\mu}\leq C^{\text{MG}}_{\mu,\nu}\cdot E(% \mathcal{H})+\epsilon

holds with probability at least 1-\delta. If \pi_{K} is the output policy of Team 1 from Algorithm 3, the same arguments also hold for any fixed distribution \mu\in\mathcal{P}({\mathcal{S}}\times\mathcal{A}\times\mathcal{B}), but with T being polynomial in N\log(AB) instead of N\log(A).

Corollary 4.13, whose proof is deferred to §LABEL:proof:coro:LFA, is established upon Theorems 4.5 and 4.8. It shows that the proposed Algorithms 1 and 3 are efficient with the aid of Algorithm 2 under some mild assumptions, in the sense that finite number of samples and iterations, which scale at most polynomially with the parameters of the problem, are needed to achieve arbitrarily small Q-value errors, provided the inherent approximation error is small.

We note that if the full-rank condition in Assumption 4.11 does not hold, the fitting problems (3.1) and (3.3) are simply convex. Then, over time-varying communication network, it is also possible to establish convergence rate of O(1/l) using the proximal-gradient consensus algorithm (Hong and Chang, 2017). We will skip the detailed discussion on various decentralized optimization algorithms since it is beyond the scope of this paper.

## 5 Proofs of the Main Results

In this section, we provide proofs for the main results presented in §4.

### 5.1 Proof of Theorem 4.6

###### Proof.

We start our proof by introducing some notations. For any k\in\{1,\ldots,K\}, we define

where recall that \widetilde{Q}_{k} is the exact minimizer of the least-squares problem (3.1), and \widetilde{\mathbf{Q}}_{k}=[\widetilde{Q}^{i}_{k}]_{i\in\mathcal{N}} are the output of Q-function estimators at each agent from Algorithm 1, both at iteration k. Also by definition of \widetilde{{\mathcal{T}}} in (2.2), the expression \widetilde{{\mathcal{T}}}\widetilde{\mathbf{Q}}_{k} has the form

 \displaystyle(\widetilde{{\mathcal{T}}}\widetilde{\mathbf{Q}}_{k})(s,a)=% \overline{r}(s,a)+\gamma\cdot\mathbb{E}_{s^{\prime}\sim P(\cdot{\,|\,}s,a)}% \biggl{[}\frac{1}{N}\cdot\sum_{i=1}^{N}\max_{a^{\prime}\in\mathcal{A}}% \widetilde{Q}^{i}_{k}(s^{\prime},a^{\prime})\biggr{]}.

The term \rho_{k} captures the one-step approximation error of the fitting problem (3.1), which is caused by the finite number of samples used, and the capacity along with the expressive power of the function class \mathcal{H}. The term \eta_{k} captures the computational error of the decentralized optimization algorithm after a finite number of updates. Also, we denote by \pi_{k} the average greedy policy obtained from the estimator vector \widetilde{\mathbf{Q}}_{k}, i.e., \pi_{k}=\mathcal{G}(\widetilde{\mathbf{Q}}_{k}).

The proof mainly contains the following three steps.

Step (i): First, we establish a recursion between the errors of the exact minimizers of (3.1) at consecutive iterations with respect to the optimal Q-function, i.e., the recursion between Q^{*}-\widetilde{Q}_{k+1} and Q^{*}-\widetilde{Q}_{k}. To this end, we first split Q^{*}-\widetilde{Q}_{k+1} as follows by the definitions of \varrho_{k+1} and \eta_{k+1}

 \displaystyle Q^{*}-\widetilde{Q}_{k+1} \displaystyle=Q^{*}-\big{(}\widetilde{{\mathcal{T}}}\widetilde{\mathbf{Q}}_{k}% -\varrho_{k+1}\big{)}=\big{(}Q^{*}-{\mathcal{T}}_{\pi^{*}}\widetilde{Q}_{k}% \big{)}+\big{(}{\mathcal{T}}_{\pi^{*}}\widetilde{Q}_{k}-{\mathcal{T}}% \widetilde{Q}_{k}\big{)}+\eta_{k+1}+\varrho_{k+1}, (5.2)

where we denote by \pi^{*} the greedy policy with respect to Q^{*}.

First note that for any s^{\prime}\in{\mathcal{S}} and a^{\prime}\in\mathcal{A}, \max_{a^{\prime}}\widetilde{Q}_{k}(s^{\prime},a^{\prime})\geq\widetilde{Q}_{k}% (s^{\prime},a^{\prime}), which yields

 \displaystyle({\mathcal{T}}\widetilde{Q}_{k})(s,a) \displaystyle=\overline{r}(s,a)+\gamma\cdot\mathbb{E}_{s^{\prime}\sim P(\cdot{% \,|\,}s,a)}\bigl{[}\max_{a^{\prime}}\widetilde{Q}_{k}(s^{\prime},a^{\prime})% \bigr{]} \displaystyle\geq\overline{r}(s,a)+\gamma\cdot\mathbb{E}_{s^{\prime}\sim P(% \cdot{\,|\,}s,a),a^{\prime}\sim\pi^{*}(\cdot{\,|\,}S^{\prime})}\bigl{[}% \widetilde{Q}_{k}(s^{\prime},a^{\prime})\bigr{]}=({\mathcal{T}}_{\pi^{*}}% \widetilde{Q}_{k})(s,a).

Thus, it follows that {\mathcal{T}}_{\pi^{*}}\widetilde{Q}_{k}\leq{\mathcal{T}}\widetilde{Q}_{k}. Combined with (5.2), we further obtain

 \displaystyle Q^{*}-\widetilde{Q}_{k+1}\leq\big{(}Q^{*}-{\mathcal{T}}_{\pi^{*}% }\widetilde{Q}_{k}\big{)}+\eta_{k+1}+\varrho_{k+1}. (5.3)

Similarly, we can establish a lower bound for Q^{*}-\widetilde{Q}_{k+1} based on Q^{*}-\widetilde{Q}_{k}. Note that

 \displaystyle Q^{*}-\widetilde{Q}_{k+1}=\big{(}Q^{*}-{\mathcal{T}}_{\widetilde% {\pi}_{k}}Q^{*}\big{)}+\big{(}{\mathcal{T}}_{\widetilde{\pi}_{k}}Q^{*}-{% \mathcal{T}}\widetilde{Q}_{k}\big{)}+\eta_{k+1}+\varrho_{k+1},

where \widetilde{\pi}_{k} is the greedy policy with respect to \widetilde{Q}_{k}, i.e., {\mathcal{T}}\widetilde{Q}_{k}={\mathcal{T}}_{\widetilde{\pi}_{k}}\widetilde{Q% }_{k}. Since Q^{*}=TQ^{*}\geq{\mathcal{T}}_{\widetilde{\pi}_{k}}Q^{*}, it holds that

 \displaystyle Q^{*}-\widetilde{Q}_{k+1}\geq\big{(}{\mathcal{T}}_{\widetilde{% \pi}_{k}}Q^{*}-{\mathcal{T}}_{\widetilde{\pi}_{k}}\widetilde{Q}_{k}\big{)}+% \eta_{k+1}+\varrho_{k+1}. (5.4)

By combining (5.3) and (5.4), we obtain that for any k\in\{0,\ldots,K-1\},

 \displaystyle\big{(}{\mathcal{T}}_{\widetilde{\pi}_{k}}Q^{*}-{\mathcal{T}}_{% \widetilde{\pi}_{k}}\widetilde{Q}_{k}\big{)}+\eta_{k+1}+\varrho_{k+1}\leq Q^{*% }-\widetilde{Q}_{k+1}\leq\big{(}T_{\pi^{*}}Q^{*}-{\mathcal{T}}_{\pi^{*}}% \widetilde{Q}_{k}\big{)}+\eta_{k+1}+\varrho_{k+1}. (5.5)

(5.5) shows that one can both upper and lower bound the error Q^{*}-\widetilde{Q}_{k+1} using terms related to Q^{*}-\widetilde{Q}_{k}, plus two error terms \eta_{k+1} and \varrho_{k+1} as defined in (5.1). With the definition of P_{\pi} in (2.1), we can write (5.5) in a more compact form as

 \displaystyle\gamma\cdot P_{\widetilde{\pi}_{k}}(Q^{*}-\widetilde{Q}_{k})+\eta% _{k+1}+\varrho_{k+1}\leq Q^{*}-\widetilde{Q}_{k+1}\leq\gamma\cdot P_{\pi^{*}}(% Q^{*}-\widetilde{Q}_{k})+\eta_{k+1}+\varrho_{k+1}. (5.6)

Note that since P_{\pi} is a linear operator, we can derive the following bounds for multi-step error propagation.

###### Lemma 5.1 (Multi-step Error Propagation in Collaborative MARL).

For any k,\ell\in\{0,1,\ldots,K-1\} with k<\ell, we have

 \displaystyle Q^{*}-\widetilde{Q}_{\ell}\geq\sum_{j=k}^{\ell-1}\gamma^{\ell-1-% j}\cdot(P_{\widetilde{\pi}_{\ell-1}}\cdots,P_{\widetilde{\pi}_{j+1}})(\eta_{j+% 1}+\varrho_{j+1})+\gamma^{\ell-k}\cdot(P_{\widetilde{\pi}_{\ell-1}}\cdots,P_{% \widetilde{\pi}_{k}})(Q^{*}-\widetilde{Q}_{k}), \displaystyle Q^{*}-\widetilde{Q}_{\ell}\leq\sum_{j=k}^{\ell-1}\gamma^{\ell-1-% j}\cdot(P_{\pi^{*}})^{\ell-1-j}(\eta_{j+1}+\varrho_{j+1})+\gamma^{\ell-k}\cdot% (P_{\pi^{*}})^{\ell-k}(Q^{*}-\widetilde{Q}_{k}),

where \varrho_{j+1} and \eta_{j+1} are defined in (5.1), and we use P_{\pi}P_{\pi^{\prime}} and (P_{\pi})^{k} to denote the composition of operators.

###### Proof.

By the linearity of the operator P_{\pi}, we can obtain the desired results by applying the inequalities in (5.6) multiple times. ∎

The bounds for multi-step error propagation in Lemma 5.1 conclude the first step of our proof.

Step (ii): Step (i) only establishes the propagation of error Q^{*}-\widetilde{Q}_{k}. To evaluate the output of Algorithm 1, we need to further derive the propagation of error Q^{*}-Q_{\pi_{k}}, where Q_{\pi_{k}} is the Q-function corresponding to the output joint policy \pi_{k} from Algorithm 1. The error Q^{*}-Q_{\pi_{k}} quantifies the sub-optimality of the output policy \pi_{k} at iteration k.

By definition of Q^{*}, we have Q^{*}\geq Q_{\pi_{k}} and Q^{*}={\mathcal{T}}_{\pi^{*}}Q^{*}. Also note Q_{\pi_{k}}={\mathcal{T}}_{\pi_{k}}Q_{\pi_{k}} and {\mathcal{T}}\widetilde{Q}^{i}_{k}={\mathcal{T}}_{\widetilde{\pi}^{i}_{k}}% \widetilde{Q}^{i}_{k}, where we denote the greedy policy with respect to \widetilde{Q}^{i}_{k} by \widetilde{\pi}^{i}_{k}, i.e., \widetilde{\pi}^{i}_{k}=\pi_{\widetilde{Q}^{i}_{k}}, for notational convenience. Hence, it follows that

 \displaystyle Q^{*}-Q_{\pi_{k}}= \displaystyle~{}{\mathcal{T}}_{\pi^{*}}Q^{*}-{\mathcal{T}}_{\pi_{k}}Q_{\pi_{k}% }=\bigg{(}{\mathcal{T}}_{\pi^{*}}Q^{*}-\frac{1}{N}\sum_{i\in\mathcal{N}}{% \mathcal{T}}_{\pi^{*}}\widetilde{Q}^{i}_{k}\bigg{)}+\frac{1}{N}\sum_{i\in% \mathcal{N}}\bigg{(}{\mathcal{T}}_{\pi^{*}}-{\mathcal{T}}_{\widetilde{\pi}^{i}% _{k}}\bigg{)}\widetilde{Q}^{i}_{k} \displaystyle~{}\quad+\bigg{(}\frac{1}{N}\sum_{i\in\mathcal{N}}{\mathcal{T}}_{% \widetilde{\pi}^{i}_{k}}\widetilde{Q}^{i}_{k}-{\mathcal{T}}_{\pi_{k}}% \widetilde{Q}_{k}\bigg{)}+\big{(}{\mathcal{T}}_{\pi_{k}}\widetilde{Q}_{k}-{% \mathcal{T}}_{\pi_{k}}Q_{\pi_{k}}\big{)} (5.7)

Now we show that the four terms on the right-hand side of (5.1) can be bounded, respectively. First, by definition of \widetilde{\pi}^{i}_{k}, we have

 \displaystyle{\mathcal{T}}_{\pi^{*}}\widetilde{Q}^{i}_{k}-{\mathcal{T}}_{% \widetilde{\pi}^{i}_{k}}\widetilde{Q}^{i}_{k}={\mathcal{T}}_{\pi^{*}}% \widetilde{Q}^{i}_{k}-{\mathcal{T}}\widetilde{Q}^{i}_{k}\leq 0,\quad\text{for~% {}all~{}}i\in\mathcal{N}. (5.8)

Moreover, since \pi_{k}=\mathcal{G}(\mathbf{Q}_{k}), it holds that for any Q, {\mathcal{T}}_{\pi_{k}}Q={N}^{-1}\sum_{i\in\mathcal{N}}{\mathcal{T}}_{% \widetilde{\pi}^{i}_{k}}Q where \widetilde{\pi}^{i}_{k} is the greedy policy with respect to \widetilde{Q}^{i}_{k}. Then, by definition of the operator P_{\pi}, we have

 \displaystyle{\mathcal{T}}_{\pi^{*}}Q^{*}-{\mathcal{T}}_{\pi^{*}}\widetilde{Q}% ^{i}_{k}=\gamma\cdot P_{\pi^{*}}(Q^{*}-\widetilde{Q}^{i}_{k}),\quad{\mathcal{T% }}_{\widetilde{\pi}^{i}_{k}}\widetilde{Q}^{i}_{k}-{\mathcal{T}}_{\widetilde{% \pi}^{i}_{k}}\widetilde{Q}_{k}=\gamma\cdot P_{\widetilde{\pi}^{i}_{k}}(% \widetilde{Q}^{i}_{k}-\widetilde{Q}_{k}), (5.9)

By substituting (5.8) and (5.9) into (5.1), we obtain

 \displaystyle Q^{*}-Q_{\pi_{k}}\leq\gamma\cdot P_{\pi^{*}}\bigg{(}Q^{*}-\frac{% 1}{N}\sum_{i\in\mathcal{N}}\widetilde{Q}^{i}_{k}\bigg{)}+\gamma\cdot\frac{1}{N% }\sum_{i\in\mathcal{N}}P_{\widetilde{\pi}^{i}_{k}}(\widetilde{Q}^{i}_{k}-% \widetilde{Q}_{k})+\big{(}{\mathcal{T}}_{\pi_{k}}\widetilde{Q}_{k}-{\mathcal{T% }}_{\pi_{k}}Q_{\pi_{k}}\big{)} \displaystyle\quad=\gamma\cdot(P_{\pi^{*}}-P_{\pi_{k}})(Q^{*}-\widetilde{Q}_{k% })+\gamma\cdot\frac{1}{N}\sum_{i\in\mathcal{N}}\big{(}P_{\widetilde{\pi}^{i}_{% k}}-P_{\pi^{*}}\big{)}(\widetilde{Q}^{i}_{k}-\widetilde{Q}_{k})+\gamma\cdot P_% {\pi_{k}}(Q^{*}-Q_{\pi_{k}}).

This further implies

 \displaystyle(I-\gamma\cdot P_{\pi_{k}})(Q^{*}-Q_{\pi_{k}})\leq \displaystyle~{}\gamma\cdot(P_{\pi^{*}}-P_{\pi_{k}})(Q^{*}-\widetilde{Q}_{k})+% \gamma\cdot\frac{1}{N}\sum_{i\in\mathcal{N}}\big{(}P_{\widetilde{\pi}^{i}_{k}}% -P_{\pi^{*}}\big{)}(\widetilde{Q}^{i}_{k}-\widetilde{Q}_{k}),

where I is the identity operator. Note that for any policy \pi, the operator {\mathcal{T}}_{\pi} is \gamma-contractive. Thus the operator I-\gamma\cdot P_{\pi} is invertible and it follows that

 \displaystyle 0\leq Q^{*}-Q_{\pi_{k}}\leq \displaystyle~{}\gamma\cdot(I-\gamma\cdot P_{\pi_{k}})^{-1}\bigl{[}P_{\pi^{*}}% (Q^{*}-\widetilde{Q}_{k})-P_{\pi_{k}}(Q^{*}-\widetilde{Q}_{k})\bigr{]} \displaystyle~{}\quad+\gamma\cdot(I-\gamma\cdot P_{\pi_{k}})^{-1}\biggl{[}% \frac{1}{N}\sum_{i\in\mathcal{N}}\big{(}P_{\widetilde{\pi}^{i}_{k}}-P_{\pi^{*}% }\big{)}(\widetilde{Q}^{i}_{k}-\widetilde{Q}_{k})\biggr{]}. (5.10)

With the expression (Q^{*}-\widetilde{Q}_{k}) on the right-hand side, we can further bound (5.1) by applying Lemma 5.1. To this end, we first note that for any f_{1},f_{2}\in\mathcal{F}({\mathcal{S}}\times\mathcal{A},Q_{\max}) such that f_{1}\geq f_{2}, it holds that P_{\pi}f_{1}\geq P_{\pi}f_{2} by definition of P_{\pi}. Thus, for any k<\ell, we obtain the following upper and lower bounds from Lemma 5.1

Moreover, we denote the second term on the right-hand side of (5.1) by \xi_{k}, i.e.,

 \displaystyle\xi_{k}=\gamma\cdot(I-\gamma\cdot P_{\pi_{k}})^{-1}\biggl{[}\frac% {1}{N}\sum_{i\in\mathcal{N}}\big{(}P_{\widetilde{\pi}^{i}_{k}}-P_{\pi^{*}}\big% {)}(\widetilde{Q}^{i}_{k}-\widetilde{Q}_{k})\biggr{]}.

Note that \xi_{k} depends on the accuracy of the output of the decentralized optimization algorithm at iteration k, i.e., the error \widetilde{Q}^{i}_{k}-\widetilde{Q}_{k}, which vanishes with the number of updates of the decentralized optimization algorithm, when the least-squares problem (3.1) is convex. With this definition, together with (5.11) and (5.12), we obtain the bound for the error Q^{*}-Q_{\pi_{K}} at the final iteration K as

 \displaystyle Q^{*}-Q_{\pi_{K}}\leq \displaystyle~{}(I-\gamma\cdot P_{\pi_{K}})^{-1}\bigg{\{}\sum_{j=0}^{K-1}% \gamma^{K-j}\cdot\bigl{[}(P_{\pi^{*}})^{K-j}-(P_{\pi_{K}}P_{\widetilde{\pi}_{K% -1}}\cdots P_{\widetilde{\pi}_{j+1}})\bigr{]}(\eta_{j+1}+\varrho_{j+1}) \displaystyle~{}\quad+\gamma^{K+1}\cdot\bigl{[}(P_{\pi^{*}})^{K+1}-(P_{\pi_{K}% }P_{\widetilde{\pi}_{K-1}}\cdots P_{\widetilde{\pi}_{0}})\bigr{]}(Q^{*}-% \widetilde{Q}_{0})\bigg{\}}+\xi_{K}. (5.13)

To simplify the notation, we introduce the coefficients

 \displaystyle\alpha_{j} \displaystyle=\frac{(1-\gamma)\gamma^{K-j-1}}{1-\gamma^{K+1}},~{}~{}\text{for}% ~{}~{}0\leq j\leq K-1,~{}~{}\text{and}~{}~{}\alpha_{K}=\frac{(1-\gamma)\gamma^% {K}}{1-\gamma^{K+1}}. (5.14)

Also, we introduce K+1 linear operators \{\mathcal{L}_{k}\}_{k=0}^{K} that are defined as

 \displaystyle\mathcal{L}_{j} \displaystyle=\frac{(1-\gamma)}{2}\cdot(I-\gamma P_{\pi_{K}})^{-1}\bigl{[}(P_{% \pi^{*}})^{K-j}+(P_{\pi_{K}}P_{\widetilde{\pi}_{K-1}}\cdots P_{\widetilde{\pi}% _{j+1}})\bigr{]},~{}~{}\text{for}~{}~{}0\leq j\leq K-1, \displaystyle\mathcal{L}_{K} \displaystyle=\frac{(1-\gamma)}{2}\cdot(I-\gamma P_{\pi_{K}})^{-1}\bigl{[}(P_{% \pi^{*}})^{K+1}+(P_{\pi_{K}}P_{\widetilde{\pi}_{K-1}}\cdots P_{\widetilde{\pi}% _{0}})\bigr{]}.

Then, by taking absolute value on both sides of (5.1), we obtain that for any (s,a)\in{\mathcal{S}}\times\mathcal{A}

 \displaystyle\bigl{|}Q^{*}(s,a)-Q_{\pi_{K}}(s,a)\bigr{|}\leq \displaystyle~{}\frac{2\gamma(1-\gamma^{K+1})}{(1-\gamma)^{2}}\cdot\biggl{[}% \sum_{j=0}^{K-1}\alpha_{j}\cdot\bigl{(}\mathcal{L}_{j}|\eta_{j+1}+\varrho_{j+1% }|\bigr{)}(s,a) \displaystyle~{}\quad+\alpha_{K}\cdot\bigl{(}\mathcal{L}_{K}|Q^{*}-\widetilde{% Q}_{0}|\bigr{)}(s,a)\biggr{]}+|\xi_{K}(s,a)|, (5.15)

where functions \mathcal{L}_{j}|\eta_{j+1}+\varrho_{j+1}| and \mathcal{L}_{K}|Q^{*}-\widetilde{Q}_{0}| are both defined over {\mathcal{S}}\times\mathcal{A}. The upper bound in (5.1) concludes the second step of the proof.

Step (iii): Now we establish the final step to complete the proof. In particular, we upper bound the weighted norm \|Q^{*}-Q_{\pi_{K}}\|_{\mu} for some probability distribution \mu\in\mathcal{P}({\mathcal{S}}\times\mathcal{A}), based on the point-wise bound of |Q^{*}-Q_{\pi_{K}}| from (5.1). For notational simplicity, we define \mu(f) to be the expectation of f under \mu, that is, \mu(f)=\int_{{\mathcal{S}}\times\mathcal{A}}f(s,a){\mathrm{d}}\mu(s,a). By taking square on both sides of (5.1), we obtain

 \displaystyle\bigl{|}Q^{*}(s,a)-Q_{\pi_{K}}(s,a)\bigr{|}^{2}\leq \displaystyle~{}2\cdot\bigg{[}\frac{2\gamma(1-\gamma^{K+1})}{(1-\gamma)^{2}}% \bigg{]}^{2}\cdot\biggl{[}\sum_{j=0}^{K-1}\alpha_{j}\cdot\bigl{(}\mathcal{L}_{% j}|\eta_{j+1}+\varrho_{j+1}|\bigr{)}(s,a) \displaystyle~{}\quad+\alpha_{K}\cdot\bigl{(}\mathcal{L}_{K}|Q^{*}-\widetilde{% Q}_{0}|\bigr{)}(s,a)\biggr{]}^{2}+2\cdot|\xi_{K}(s,a)|^{2}.

Then, by applying Jensen’s inequality twice, we arrive at

 \displaystyle\|Q^{*}-Q_{\pi_{K}}\|_{\mu}^{2}=\mu\big{(}|Q^{*}-Q_{\pi_{K}}|^{2}% \big{)}\leq \displaystyle~{}2\cdot\bigg{[}\frac{2\gamma(1-\gamma^{K+1})}{(1-\gamma)^{2}}% \bigg{]}^{2}\cdot\mu\biggl{[}\sum_{j=0}^{K-1}\alpha_{j}\cdot\bigl{(}\mathcal{L% }_{j}|\eta_{j+1}+\varrho_{j+1}|^{2}\bigr{)} \displaystyle~{}\quad+\alpha_{K}\cdot\bigl{(}\mathcal{L}_{K}|Q^{*}-\widetilde{% Q}_{0}|^{2}\bigr{)}\biggr{]}+2\cdot\mu\big{(}|\xi_{K}|^{2}\big{)}, (5.16)

where we also use the fact that \sum_{j=0}^{K}\alpha_{j}=1 and for all j=0,\cdots,K, the linear operators \mathcal{L}_{j} are positive and satisfy \mathcal{L}_{j}\bm{1}=\bm{1}. Since both Q^{*} and \widetilde{Q}_{0} are bounded by Q_{\max}=R_{\max}/(1-\gamma) in absolute value, we have

 \displaystyle\mu\bigr{[}\bigl{(}\mathcal{L}_{K}|Q^{*}-\widetilde{Q}_{0}|\bigr{% )}^{2}\bigr{]}\leq(2Q_{\max})^{2}. (5.17)

Also, by definition of the concentrability coefficients \kappa^{\text{MDP}} in §A, we have

 \displaystyle\mu(\mathcal{L}_{j})\leq(1-\gamma)\sum_{m\geq 0}\gamma^{m}\kappa^% {\text{MDP}}(m+K-j)\nu, (5.18)

where recall that \nu is the distribution over {\mathcal{S}}\times\mathcal{A} from which the data \{(s_{t},a_{t})\}_{t=1,\cdots,T} in trajectory \mathcal{D} are sampled. Moreover, we can also bound \mu(|\xi_{K}|^{2}) by Jensen’s inequality as

 \displaystyle\mu\big{(}|\xi_{K}|^{2}\big{)} \displaystyle\leq\bigg{(}\frac{2\gamma}{1-\gamma}\bigg{)}^{2}\cdot\mu\biggl{[}% \frac{1-\gamma}{2N}\cdot(I-\gamma\cdot P_{\pi_{k}})^{-1}\sum_{i\in\mathcal{N}}% \big{(}P_{\widetilde{\pi}^{i}_{k}}+P_{\pi^{*}}\big{)}\big{|}\widetilde{Q}^{i}_% {k}-\widetilde{Q}_{k}\big{|}\biggr{]}^{2} \displaystyle\leq\bigg{(}\frac{2\gamma}{1-\gamma}\bigg{)}^{2}\cdot\mu\biggl{[}% \frac{1-\gamma}{2N}\cdot(I-\gamma\cdot P_{\pi_{k}})^{-1}\sum_{i\in\mathcal{N}}% \big{(}P_{\widetilde{\pi}^{i}_{k}}+P_{\pi^{*}}\big{)}\big{|}\widetilde{Q}^{i}_% {k}-\widetilde{Q}_{k}\big{|}^{2}\biggr{]}. (5.19)

By Assumption 4.4, we can further bound the right-hand side of (5.1) as

 \displaystyle\mu\big{(}|\xi_{K}|^{2}\big{)}\leq\bigg{(}\frac{2\gamma}{1-\gamma% }\bigg{)}^{2}\cdot\frac{1}{N}\sum_{i\in\mathcal{N}}\bigl{\|}\widetilde{Q}^{i}_% {k}-\widetilde{Q}_{k}\bigl{\|}^{2}_{\mu}\leq\bigg{(}\frac{2\gamma}{1-\gamma}% \bigg{)}^{2}\cdot\frac{1}{N}\sum_{i\in\mathcal{N}}(\epsilon^{i}_{K})^{2}. (5.20)

Therefore, by plugging (5.17), (5.18), and (5.20) into (5.1), we obtain

 \displaystyle\|Q^{*}-Q_{\pi_{K}}\|_{\mu}^{2}\leq \displaystyle~{}\bigg{[}\frac{4\gamma(1-\gamma^{K+1})}{\sqrt{2}(1-\gamma)^{2}}% \bigg{]}^{2}\cdot\biggl{[}\sum_{j=0}^{K-1}\frac{(1-\gamma)^{2}}{1-\gamma^{K+1}% }\cdot\sum_{m\geq 0}\gamma^{m+K-j-1}\kappa^{\text{MDP}}(m+K-j) \displaystyle\quad\cdot\big{\|}\eta_{j+1}+\varrho_{j+1}\big{\|}^{2}_{\nu}+% \frac{(1-\gamma)\gamma^{K}}{1-\gamma^{K+1}}\cdot(2Q_{\max})^{2}\biggr{]}+\bigg% {(}\frac{2\sqrt{2}\gamma}{1-\gamma}\bigg{)}^{2}\cdot\frac{1}{N}\sum_{i\in% \mathcal{N}}(\epsilon^{i}_{K})^{2}, (5.21)

Furthermore, from Assumption 4.3 and the definition of \phi_{\mu,\nu}^{\text{MDP}}, and letting \overline{\epsilon}_{K}=[N^{-1}\cdot\sum_{i\in\mathcal{N}}(\epsilon^{i}_{K})^{% 2}]^{1/2}, it follows from (5.1) that

This further yields that

 \displaystyle\|Q^{*}-Q_{\pi_{K}}\|_{\mu} \displaystyle\quad\leq\frac{4\gamma(1-\gamma^{K+1})}{\sqrt{2}(1-\gamma)^{2}}% \cdot\biggl{[}\frac{\big{(}\phi^{\text{MDP}}_{\mu,\nu}\big{)}^{1/2}}{1-\gamma^% {K+1}}\cdot(\|\eta\|_{\nu}+\|\varrho\|_{\nu})+\frac{\gamma^{K/2}}{1-\gamma^{K+% 1}}\cdot(2Q_{\max})\biggr{]}+\frac{2\sqrt{2}\gamma}{1-\gamma}\cdot\overline{% \epsilon}_{K}, \displaystyle\quad\leq\frac{4\gamma\cdot\big{(}\phi^{\text{MDP}}_{\mu,\nu}\big% {)}^{1/2}}{\sqrt{2}(1-\gamma)^{2}}\cdot(\|\eta\|_{\nu}+\|\varrho\|_{\nu})+% \frac{4\sqrt{2}\cdot Q_{\max}}{(1-\gamma)^{2}}\cdot\gamma^{K/2}+\frac{2\sqrt{2% }\gamma}{1-\gamma}\cdot\overline{\epsilon}_{K}, (5.22)

where we denote by \|\eta\|_{\nu}=\max_{j=0,\cdots,K-1}\|\eta_{j+1}\|_{\nu} and \|\varrho\|_{\nu}=\max_{j=0,\cdots,K-1}\|\varrho_{j+1}\|_{\nu}. Recall that \eta_{j+1} is defined as \eta_{j+1}={\mathcal{T}}\widetilde{Q}_{j}-\widetilde{{\mathcal{T}}}\widetilde{% \mathbf{Q}}_{j}, which can be further bounded by the one-step decentralized computation error from Assumption 4.4. Specifically, we have the following lemma regarding the difference between {\mathcal{T}}\widetilde{Q}_{j} and \widetilde{{\mathcal{T}}}\widetilde{\mathbf{Q}}_{j}.

###### Lemma 5.2.

Under Assumption 4.4, for any j=0,\cdots,K-1, it holds that \|\eta_{j+1}\|_{\nu}\leq\sqrt{2}\gamma\cdot\overline{\epsilon}_{j}, where \overline{\epsilon}_{j}=[N^{-1}\cdot\sum_{i\in\mathcal{N}}(\epsilon^{i}_{j})^{% 2}]^{1/2} and \epsilon^{i}_{j} is defined as in Assumption 4.4.

###### Proof.

By definition, we have

 \displaystyle|\eta_{j+1}(s,a)| \displaystyle=\Big{|}\big{(}{\mathcal{T}}\widetilde{Q}_{j}\big{)}(s,a)-\big{(}% \widetilde{{\mathcal{T}}}\widetilde{\mathbf{Q}}_{j}\big{)}(s,a)\Big{|} \displaystyle\leq\gamma\cdot\frac{1}{N}\sum_{i\in\mathcal{N}}\mathbb{E}\Bigl{[% }\big{|}\max_{a^{\prime}\in\mathcal{A}}\widetilde{Q}_{j}(s^{\prime},a^{\prime}% )-\max_{a^{\prime}\in\mathcal{A}}\widetilde{Q}^{i}_{j}(s^{\prime},a^{\prime})% \big{|}{\,\Big{|}\,}s^{\prime}\sim P(\cdot{\,|\,}s,a)\Bigr{]}.

Now we claim that for any s^{\prime}\in{\mathcal{S}}

 \displaystyle\big{|}\max_{a^{\prime}\in\mathcal{A}}\widetilde{Q}_{j}(s^{\prime% },a^{\prime})-\max_{a^{\prime}\in\mathcal{A}}\widetilde{Q}^{i}_{j}(s^{\prime},% a^{\prime})\big{|}\leq(1+C_{0})\cdot\epsilon^{i}_{j}, (5.23)

for any constant C_{0}>0. Suppose (5.23) does not hold, then either \max_{a^{\prime}\in\mathcal{A}}\widetilde{Q}^{i}_{j}(s^{\prime},a^{\prime})% \geq\max_{a^{\prime}\in\mathcal{A}}\widetilde{Q}_{j}(s^{\prime},a^{\prime})+(1% +C_{0})\cdot\epsilon^{i}_{j} or \max_{a^{\prime}\in\mathcal{A}}\widetilde{Q}^{i}_{j}(s^{\prime},a^{\prime})% \leq\max_{a^{\prime}\in\mathcal{A}}\widetilde{Q}_{j}(s^{\prime},a^{\prime})-(1% +C_{0})\cdot\epsilon^{i}_{j}. In the first case, let a^{\prime}_{*}\in\mathop{\mathrm{argmax}}_{a^{\prime}\in\mathcal{A}}\widetilde% {Q}^{i}_{j}(s^{\prime},a^{\prime}), then by Assumption 4.4, the values of \widetilde{Q}_{j}(s^{\prime},\cdot) and \widetilde{Q}^{i}_{j}(s^{\prime},\cdot) are close at a^{\prime}_{*} up to a small error \epsilon^{i}_{j}, i.e., \widetilde{Q}_{j}(s^{\prime},a^{\prime}_{*})\geq\widetilde{Q}^{i}_{j}(s^{% \prime},a^{\prime}_{*})-\epsilon^{i}_{j}. This implies that \widetilde{Q}_{j}(s^{\prime},a^{\prime}_{*})\geq\max_{a^{\prime}\in\mathcal{A}% }\widetilde{Q}_{j}(s^{\prime},a^{\prime})+C_{0}\cdot\epsilon^{i}_{j}, which cannot hold since \max_{a^{\prime}\in\mathcal{A}}\widetilde{Q}_{j}(s^{\prime},a^{\prime})\geq% \widetilde{Q}_{j}(s^{\prime},a^{\prime}) for any a^{\prime} including a^{\prime}_{*}. Similarly, one can show that the second case cannot occur. Thus, the claim (5.23) is proved. Letting C_{0}=\sqrt{2}-1, we obtain that

 \displaystyle|\eta_{j+1}(s,a)|^{2}\leq\gamma^{2}\bigg{(}\frac{1}{N}\sum_{i\in% \mathcal{N}}\sqrt{2}\epsilon^{i}_{j}\bigg{)}^{2}\leq\big{(}\sqrt{2}\gamma\big{% )}^{2}\frac{1}{N}\sum_{i\in\mathcal{N}}\big{(}\epsilon^{i}_{j}\big{)}^{2},

where the second inequality follows Jensen’s inequality. Taking expectation over \nu, we obtain the desired bound. ∎

From Lemma 5.2, we can further simplify (5.1) to obtain the desired bound in Theorem 4.6, which concludes the proof. ∎

### 5.2 Proof of Theorem 4.7

###### Proof.

The proof integrates the proof ideas in Munos and Szepesvári (2008) and Antos et al. (2008b). First, for any fixed \mathbf{Q}=[Q^{i}]_{i\in\mathcal{N}} and f, we define

 \displaystyle d(\mathbf{Q})(s,a,s^{\prime}) \displaystyle=\overline{r}(s,a)+\gamma\cdot\frac{1}{N}\sum_{i\in\mathcal{N}}% \max_{a^{\prime}\in\mathcal{A}}Q^{i}(s^{\prime},a^{\prime}), \displaystyle\ell_{f,\mathbf{Q}}(s,a,s^{\prime}) \displaystyle=[d(\mathbf{Q})(s,a,s^{\prime})-f(s,a)]^{2},

where \overline{r}(s,a)=N^{-1}\cdot\sum_{i\in\mathcal{N}}r^{i}(s,a) with r^{i}(s,a)\sim R^{i}(s,a). We also define \mathcal{L}_{\mathcal{H}}=\{\ell_{f,\mathbf{Q}}:f\in\mathcal{H},\mathbf{Q}\in% \mathcal{H}^{N}\}. For convenience, we denote d_{t}(\mathbf{Q})=d(\mathbf{Q})(s_{t},a_{t},s_{t+1}), for any data (s_{t},a_{t},s_{t+1}) drawn from \mathcal{D}. Then, we define \widehat{L}_{T}(f;\mathbf{Q}) and {L}(f;\mathbf{Q}) as

 \displaystyle\widehat{L}_{T}(f;\mathbf{Q}) \displaystyle=\frac{1}{T}\sum_{t=1}^{T}\bigl{[}d_{t}(\mathbf{Q})-f(s_{t},a_{t}% )\bigr{]}^{2}=\frac{1}{T}\sum_{t=1}^{T}\ell_{f,\mathbf{Q}}(s_{t},a_{t},s_{t+1}), \displaystyle{L}(f;\mathbf{Q}) \displaystyle=\|f-\widetilde{{\mathcal{T}}}\mathbf{Q}\|^{2}_{\nu}+\mathbb{E}_{% \nu}\{\operatorname{{\rm Var}}[d_{1}(\mathbf{Q}){\,|\,}s_{1},a_{1}]\}. (5.24)

Obviously {L}(f;\mathbf{Q})=\mathbb{E}[\widehat{L}_{T}(f;\mathbf{Q})], and \mathop{\mathrm{argmin}}_{f\in\mathcal{H}}\widehat{L}_{T}(f;\mathbf{Q}) is exactly the minimizer of the fitting objective defined in (4.3). Also, note that the second term on the right-hand side of (5.24) does not depend on f, thus \mathop{\mathrm{argmin}}_{f\in\mathcal{H}}{L}(f;\mathbf{Q})=\mathop{\mathrm{% argmin}}_{f\in\mathcal{H}}\|f-\widetilde{{\mathcal{T}}}\mathbf{Q}\|^{2}_{\nu}. Let f^{\prime}\in\mathop{\mathrm{argmin}}_{f\in\mathcal{H}}\widehat{L}_{T}(f;% \mathbf{Q}), we have

 \displaystyle\|f^{\prime}-\widetilde{{\mathcal{T}}}\mathbf{Q}\|^{2}_{\nu}-\inf% _{f\in\mathcal{H}}\|f-\widetilde{{\mathcal{T}}}\mathbf{Q}\|^{2}_{\nu}=L(f^{% \prime};\mathbf{Q})-\widehat{L}_{T}(f^{\prime};\mathbf{Q})+\widehat{L}_{T}(f^{% \prime};\mathbf{Q})-\inf_{f\in\mathcal{H}}L(f;\mathbf{Q}) \displaystyle\quad\leq|\widehat{L}_{T}(f^{\prime};\mathbf{Q})-L(f^{\prime};% \mathbf{Q})|+\inf_{f\in\mathcal{H}}\widehat{L}_{T}(f;\mathbf{Q})-\inf_{f\in% \mathcal{H}}L(f;\mathbf{Q})\leq 2\sup_{f\in\mathcal{H}}|\widehat{L}_{T}(f;% \mathbf{Q})-L(f;\mathbf{Q})|, \displaystyle\quad\leq 2\sup_{f\in\mathcal{H},\mathbf{Q}\in\mathcal{H}^{N}}|% \widehat{L}_{T}(f;\mathbf{Q})-L(f;\mathbf{Q})|=2\sup_{\ell_{f,\mathbf{Q}}\in% \mathcal{L}_{\mathcal{H}}}\bigg{|}\frac{1}{T}\sum_{t=1}^{T}\ell_{f,\mathbf{Q}}% (Z_{t})-\mathbb{E}[\ell_{f,\mathbf{Q}}(Z_{1})]\bigg{|}, (5.25)

where we use the definition of f^{\prime} and \ell_{f,\mathbf{Q}}, and let Z_{t}=(s_{t},\{a^{i}_{t}\}_{i\in\mathcal{N}},s_{t+1}) for notational convenience.

In addition, we define two constants C_{1} and C_{2} as

 \displaystyle C_{1}=16\cdot e^{N+1}(V_{\mathcal{H}^{+}}+1)^{N+1}A^{NV_{% \mathcal{H}^{+}}}Q_{\max}^{(N+1)V_{\mathcal{H}^{+}}}\big{[}{64\cdot e% \widetilde{R}_{\max}(1+\gamma)}\big{]}^{(N+1)V_{\mathcal{H}^{+}}}, (5.26)

and C_{2}=1/(2048\cdot\widetilde{R}^{4}_{\max}), and also define \Lambda_{T}(\delta) and \epsilon as

 \displaystyle\Lambda_{T}(\delta)=\frac{V}{2}\log(T)+\log\Big{(}\frac{e}{\delta% }\Big{)}+\log^{+}(C_{1}C_{2}^{V/2}\vee\overline{\beta}),\quad\epsilon \displaystyle=\sqrt{\frac{\Lambda_{T}(\delta)[\Lambda_{T}(\delta)/b\vee 1]^{1/% \zeta}}{C_{2}T}}, (5.27)

where V=(N+1)V_{\mathcal{H}^{+}} and \log^{+}(x)=\max\{\log(x),0\}. Let

 \displaystyle P_{0}=\mathbb{P}\bigg{(}\sup_{\ell_{f,\mathbf{Q}}\in\mathcal{L}_% {\mathcal{H}}}\bigg{|}\frac{1}{T}\sum_{t=1}^{T}\ell_{f,\mathbf{Q}}(Z_{t})-% \mathbb{E}[\ell_{f,\mathbf{Q}}(Z_{1})]\bigg{|}>\frac{\epsilon}{2}\bigg{)}. (5.28)

Then, from (5.52), it suffices to show P_{0}<\delta in order to conclude the proof. To this end, we use the same technique in Antos et al. (2008b) that splits the T samples in \mathcal{D} into 2m_{T} blocks that come in pairs, with each block having k_{T} samples, i.e., T=2m_{T}k_{T}. Then, we can introduce the “ghost” samples that have m_{T} blocks, H_{1},\cdots,H_{2},\cdots,H_{m_{T}}, where each block has the same marginal distribution as the every second blocks in \mathcal{D}, but these new m_{T} blocks are independent of one another. We let H=\bigcup_{i=1}^{m_{T}}H_{i}. Recall that \widetilde{R}_{\max}=(1+\gamma)Q_{\max}+R_{\max}, then for any f\in\mathcal{H} and \mathbf{Q}\in\mathcal{H}^{N}, \ell_{f,\mathbf{Q}} has absolute value bounded by \widetilde{R}_{\max}^{2}. Thus, we can apply an extended version of Pollard’s tail inequality to \beta-mixing sequences (Lemma 5 in Antos et al. (2008b)), to obtain that

 \displaystyle\mathbb{P}\bigg{(}\sup_{\ell_{f,\mathbf{Q}}\in\mathcal{L}_{% \mathcal{H}}}\bigg{|}\frac{1}{T}\sum_{t=1}^{T}\ell_{f,\mathbf{Q}}(Z_{t})-% \mathbb{E}[\ell_{f,\mathbf{Q}}(Z_{1})]\bigg{|}>\frac{\epsilon}{2}\bigg{)} \displaystyle\quad\leq 16\cdot\mathbb{E}\big{[}\mathcal{N}_{1}(\epsilon/16,% \mathcal{L}_{\mathcal{H}},(Z^{\prime}_{t};t\in H))\big{]}e^{-\frac{m_{T}}{2}% \cdot\frac{\epsilon^{2}}{(16\widetilde{R}_{\max}^{2})^{2}}}+2m_{T}\beta_{k_{T}}, (5.29)

where \{\beta_{m}\} denote the mixing coefficients of the sequence Z_{1},\cdots,Z_{T} in \mathcal{D}, and \mathcal{N}_{1}(\epsilon/16,\mathcal{L}_{\mathcal{H}},(Z^{\prime}_{t};t\in H)) is the empirical covering number (see the formal definition in §A) of the function class \mathcal{L}_{\mathcal{H}} evaluated on the ghost samples (Z^{\prime}_{t};t\in H).

To bound the empirical covering number \mathcal{N}_{1}(\epsilon/16,\mathcal{L}_{\mathcal{H}},(Z^{\prime}_{t};t\in H)), we establish the following technical lemma.

###### Lemma 5.3.

Let Z^{1:T}=(Z_{1},\cdots,Z_{T}), with Z_{t}=(s_{t},\{a^{i}_{t}\}_{i\in\mathcal{N}},s_{t+1}). Recall that \widetilde{R}_{\max}=(1+\gamma)Q_{\max}+R_{\max}, and A=|\mathcal{A}| is the cardinality of the joint action set. Then, under Assumption 4.1, it holds that

 \displaystyle\mathcal{N}_{1}(\epsilon,\mathcal{L}_{\mathcal{H}},Z^{1:T})\leq e% ^{N+1}(V_{\mathcal{H}^{+}}+1)^{N+1}A^{NV_{\mathcal{H}^{+}}}Q_{\max}^{(N+1)V_{% \mathcal{H}^{+}}}\bigg{(}\frac{4e\widetilde{R}_{\max}(1+\gamma)}{\epsilon}% \bigg{)}^{(N+1)V_{\mathcal{H}^{+}}}.
###### Proof.

For any l_{f,\mathbf{Q}} and l_{\widetilde{f},\widetilde{\mathbf{Q}}} in \mathcal{L}_{\mathcal{H}}, the empirical \ell^{1}-distance between them can be bounded as

Let \mathcal{D}_{Z}=[(s_{1},\{a^{i}_{1}\}_{i\in\mathcal{N}}),\cdots,(s_{T},\{a^{i}% _{T}\}_{i\in\mathcal{N}})] and y_{Z}=(s_{2},\cdots,s_{T+1}), then the first term in the bracket of (5.30) is the \mathcal{D}_{Z}-based \ell^{1}-distance of functions in \mathcal{H}, while the second term is the y_{Z}-based \ell^{1}-distance of functions in the set \mathcal{H}^{\vee}_{N}=\{V:V(\cdot)=N^{-1}\cdot\sum_{i\in\mathcal{N}}\max_{a^{% \prime}\in\mathcal{A}}Q^{i}(\cdot,a^{\prime})\text{~{}and~{}}\mathbf{Q}\in% \mathcal{H}^{N}\} (times \gamma). This implies that

 \displaystyle\mathcal{N}_{1}\big{(}2\widetilde{R}_{\max}(1+\gamma)\epsilon,% \mathcal{L}_{\mathcal{H}},Z^{1:T}\big{)}\leq\mathcal{N}_{1}(\epsilon,\mathcal{% H}^{\vee}_{N},y_{Z})\cdot\mathcal{N}_{1}(\epsilon,\mathcal{H},\mathcal{D}_{Z}). (5.31)

The first empirical covering number \mathcal{N}_{1}(\epsilon,\mathcal{H}^{\vee}_{N},y_{Z}) can be further bounded by the following lemma, whose proof is deferred to §B.

###### Lemma 5.4.

For any fixed y_{Z}=(y_{1},\cdots,y_{T}), let \mathcal{D}_{y}=\{(y_{t},a_{j})\}_{t\in[T],j\in[A]}, where recall that A=|\mathcal{A}| and \mathcal{A}=\{a_{1},\cdots,a_{A}\}. Then, under Assumption 4.1, it holds that

 \displaystyle\mathcal{N}_{1}(\epsilon,\mathcal{H}^{\vee}_{N},y_{Z})\leq\big{[}% \mathcal{N}_{1}({\epsilon}/{A},\mathcal{H},\mathcal{D}_{y})\big{]}^{N}\leq% \bigg{[}e(V_{\mathcal{H}^{+}}+1)\bigg{(}\frac{2eQ_{\max}A}{\epsilon}\bigg{)}^{% V_{\mathcal{H}^{+}}}\bigg{]}^{N}.

In addition, the second empirical covering number \mathcal{N}_{1}(\epsilon,\mathcal{H},\mathcal{D}_{Z}) in (5.31) can be bounded directly by Corollary 3 in Haussler (1995) (see also Proposition B.2 in §B). Combined with the bound from Lemma 5.4, we finally obtain

Replacing 2\widetilde{R}_{\max}(1+\gamma)\epsilon by \epsilon in (5.2), we arrive at the desired bound and complete the proof. ∎

### 5.3 Proof of Theorem 4.9

###### Proof.

The proof is similar to the proof of Theorem 4.6. For brevity, we will only emphasize the the difference between them. We first define two quantities \varrho^{1}_{k} and \eta^{1}_{k} as follows

where recall that at iteration k, \widetilde{Q}^{1}_{k} is the exact minimizer of the least-squares problem (3.3) among f^{1}\in\mathcal{F}^{1}, and \widetilde{\mathbf{Q}}^{1}_{k}=[\widetilde{Q}^{1,i}_{k}]_{i\in\mathcal{N}} are the output of Q-function estimators at all agents in Team 1 from Algorithm 3. Recall from (2.4) that \widetilde{{\mathcal{T}}}\widetilde{\mathbf{Q}}^{1}_{k} here has the form

 \displaystyle(\widetilde{{\mathcal{T}}}\widetilde{\mathbf{Q}}^{1}_{k})(s,a,b)=% \overline{r}(s,a,b)+\gamma\cdot\mathbb{E}_{s^{\prime}\sim P(\cdot{\,|\,}s,a,b)% }\biggl{[}\frac{1}{N}\sum_{i=1}^{N}\max_{\pi^{\prime}\in\mathcal{P}(\mathcal{A% })}\min_{\sigma^{\prime}\in\mathcal{P}(\mathcal{B})}\mathbb{E}_{\pi^{\prime},% \sigma^{\prime}}\big{[}\widetilde{Q}^{1,i}_{k}(s^{\prime},a^{\prime},b^{\prime% })\big{]}\biggr{]}.

The term \rho^{1}_{k} captures the approximation error of the first fitting problem in (3.3), which can be characterized using tools from nonparametric regression. The term \eta^{1}_{k}, on the other hand, captures the computational error of the decentralized optimization algorithm after a finite number of updates. Also, we denote by \pi_{k} the average equilibrium policy obtained from the estimator vector \widetilde{\mathbf{Q}}^{1}_{k}, i.e., \pi_{k}=\mathcal{E}^{1}(\widetilde{\mathbf{Q}}_{k}). For notational convenience, we also introduce the following notations for several different minimizer policies, \widetilde{\sigma}_{k}^{i}, \widetilde{\sigma}_{k}^{i,*},\widetilde{\sigma}_{k},\widetilde{\sigma}^{*}_{k}, \sigma^{*}_{k}, \widehat{\sigma}^{i}_{k}, and \overline{\sigma}_{k} , which satisfy

We also separate the proof into three main steps, similar to the procedure in the proof of Theorem 4.6.

Step (i): The first step is to establish a recursion between the errors of the exact minimizers of the least-squares problem (3.3) with respect to Q^{*}, the minimax Q-function of the game. The error Q^{*}-\widetilde{Q}^{1}_{k+1} can be written as

 \displaystyle Q^{*}-\widetilde{Q}^{1}_{k+1}=\big{(}Q^{*}-{\mathcal{T}}_{\pi^{*% }}\widetilde{Q}^{1}_{k}\big{)}+\big{(}{\mathcal{T}}_{\pi^{*}}\widetilde{Q}^{1}% _{k}-{\mathcal{T}}\widetilde{Q}^{1}_{k}\big{)}+\eta^{1}_{k+1}+\varrho^{1}_{k+1}, (5.35)

where we denote by \pi^{*} the equilibrium policy with respect to Q^{*}. Then, by definition of {\mathcal{T}}_{\pi^{*}} and {\mathcal{T}} in zero-sum Markov games, we have {\mathcal{T}}_{\pi^{*}}\widetilde{Q}^{1}_{k}\leq{\mathcal{T}}\widetilde{Q}^{1}% _{k}. Also, from (5.34) and by relation between {\mathcal{T}}_{\pi,\sigma} and P_{\pi,\sigma} for any (\pi,\sigma), it holds that

 \displaystyle Q^{*}-{\mathcal{T}}_{\pi^{*}}\widetilde{Q}^{1}_{k}={\mathcal{T}}% _{\pi^{*}}Q^{*}-{\mathcal{T}}_{\pi^{*}}\widetilde{Q}^{1}_{k}={\mathcal{T}}_{% \pi^{*},\sigma^{*}}Q^{*}-{\mathcal{T}}_{\pi^{*},\sigma^{*}_{k}}\widetilde{Q}^{% 1}_{k}\leq{\mathcal{T}}_{\pi^{*},\sigma^{*}_{k}}Q^{*}-{\mathcal{T}}_{\pi^{*},% \sigma^{*}_{k}}\widetilde{Q}^{1}_{k}.

Thus, (5.35) can be upper bounded by

 \displaystyle Q^{*}-\widetilde{Q}^{1}_{k+1}\leq{\mathcal{T}}_{\pi^{*},\sigma^{% *}_{k}}(Q^{*}-\widetilde{Q}^{1}_{k})+\eta^{1}_{k+1}+\varrho^{1}_{k+1}. (5.36)

Moreover, we can also establish a lower bound for Q^{*}-\widetilde{Q}^{1}_{k+1}. Note that

 \displaystyle Q^{*}-\widetilde{Q}^{1}_{k+1}=\big{(}Q^{*}-{\mathcal{T}}_{% \widetilde{\pi}_{k}}Q^{*}\big{)}+\big{(}{\mathcal{T}}_{\widetilde{\pi}_{k}}Q^{% *}-{\mathcal{T}}\widetilde{Q}^{1}_{k}\big{)}+\eta^{1}_{k+1}+\varrho^{1}_{k+1},

where \widetilde{\pi}_{k} is the \max\min policy with respect to \widetilde{Q}^{1}_{k}. Since Q^{*}=TQ^{*}\geq{\mathcal{T}}_{\widetilde{\pi}_{k}}Q^{*} and {\mathcal{T}}_{\widetilde{\pi}_{k}}Q^{*}-{\mathcal{T}}_{\widetilde{\pi}_{k}}% \widetilde{Q}^{1}_{k}={\mathcal{T}}_{\widetilde{\pi}_{k},\widetilde{\sigma}_{k% }^{*}}Q^{*}-{\mathcal{T}}_{\widetilde{\pi}_{k},\widetilde{\sigma}_{k}}% \widetilde{Q}^{1}_{k}\geq{\mathcal{T}}_{\widetilde{\pi}_{k},\widetilde{\sigma}% _{k}^{*}}Q^{*}-{\mathcal{T}}_{\widetilde{\pi}_{k},\widetilde{\sigma}_{k}^{*}}% \widetilde{Q}^{1}_{k}, it follows that

 \displaystyle Q^{*}-\widetilde{Q}^{1}_{k+1}\geq{\mathcal{T}}_{\widetilde{\pi}_% {k},\widetilde{\sigma}_{k}^{*}}(Q^{*}-\widetilde{Q}^{1}_{k})+\eta^{1}_{k+1}+% \varrho^{1}_{k+1}. (5.37)

Thus, from the notations in (5.34), combining (5.36) and (5.37) yields

 \displaystyle\gamma\cdot P_{\widetilde{\pi}_{k},\widetilde{\sigma}_{k}^{*}}(Q^% {*}-\widetilde{Q}^{1}_{k})+\eta^{1}_{k+1}+\varrho^{1}_{k+1}\leq Q^{*}-% \widetilde{Q}^{1}_{k+1}\leq\gamma\cdot P_{\pi^{*},\sigma^{*}_{k}}(Q^{*}-% \widetilde{Q}^{1}_{k})+\eta^{1}_{k+1}+\varrho^{1}_{k+1}, (5.38)

from which we obtain the following multi-step error propagation bounds and conclude the first step of our proof.

###### Lemma 5.5 (Multi-step Error Propagation in Competitive MARL).

For any k,\ell\in\{0,1,\ldots,K-1\} with k<\ell, we have

where \varrho^{1}_{j+1} and \eta^{1}_{j+1} are defined in (5.33), and we use P_{\pi,\sigma}P_{\pi^{\prime},\sigma^{\prime}} to denote the composition of operators.

###### Proof.

By the linearity of the operator P_{\pi,\sigma}, we can obtain the desired results by applying the inequalities in (5.38) multiple times. ∎

Step (ii): Now we quantify the sub-optimality of the output policy at iteration k of Algorithm 3 for Team 1, i.e., the error Q^{*}-Q_{\pi_{k}}. Note that Q_{\pi_{k}} here represents the action-value when the maximizer Team 1 plays \pi_{k} and the minimizer Team 2 plays the optimal counter-policy against \pi_{k}. As argued in Perolat et al. (2015), this is a natural measure of the quality of the policy \pi_{k}. The error Q^{*}-Q_{\pi_{k}} can be separated as

 \displaystyle Q^{*}-Q_{\pi_{k}}={\mathcal{T}}Q^{*}-{\mathcal{T}}_{\pi_{k}}Q_{% \pi_{k}}= \displaystyle~{}\bigg{(}{\mathcal{T}}_{\pi^{*}}Q^{*}-\frac{1}{N}\sum_{i\in% \mathcal{N}}{\mathcal{T}}_{\pi^{*}}\widetilde{Q}^{1,i}_{k}\bigg{)}+\frac{1}{N}% \sum_{i\in\mathcal{N}}\bigg{(}{\mathcal{T}}_{\pi^{*}}-{\mathcal{T}}_{% \widetilde{\pi}^{i}_{k}}\bigg{)}\widetilde{Q}^{1,i}_{k} \displaystyle~{}\quad+\bigg{(}\frac{1}{N}\sum_{i\in\mathcal{N}}{\mathcal{T}}_{% \widetilde{\pi}^{i}_{k}}\widetilde{Q}^{1,i}_{k}-{\mathcal{T}}_{\pi_{k}}% \widetilde{Q}^{1}_{k}\bigg{)}+\big{(}{\mathcal{T}}_{\pi_{k}}\widetilde{Q}^{1}_% {k}-{\mathcal{T}}_{\pi_{k}}Q_{\pi_{k}}\big{)}. (5.39)

Now we bound the four terms on the right-hand side of (5.3) as follows. First, by definition of \widetilde{\pi}^{i}_{k}, we have

 \displaystyle{\mathcal{T}}_{\pi^{*}}\widetilde{Q}^{1,i}_{k}-{\mathcal{T}}_{% \widetilde{\pi}^{i}_{k}}\widetilde{Q}^{1,i}_{k}={\mathcal{T}}_{\pi^{*}}% \widetilde{Q}^{1,i}_{k}-{\mathcal{T}}\widetilde{Q}^{1,i}_{k}\leq 0,\quad\text{% for~{}all~{}}i\in\mathcal{N}. (5.40)

Moreover, since \pi_{k}=\mathcal{E}^{1}(\mathbf{Q}_{k}), by definition, {\mathcal{T}}_{\pi_{k}}Q={N}^{-1}\sum_{i\in\mathcal{N}}{\mathcal{T}}_{% \widetilde{\pi}^{i}_{k}}Q for any Q, where \widetilde{\pi}^{i}_{k} is the equilibrium policy of Team 1 with respect to \widetilde{Q}^{i}_{k}. Thus, we have from (5.34) that

 \displaystyle{\mathcal{T}}_{\pi^{*}}Q^{*}-{\mathcal{T}}_{\pi^{*}}\widetilde{Q}% ^{1,i}_{k}\leq \displaystyle\gamma\cdot P_{\pi^{*},\sigma^{i,*}_{k}}(Q^{*}-\widetilde{Q}^{1,i% }_{k}),\quad{\mathcal{T}}_{\widetilde{\pi}^{i}_{k}}\widetilde{Q}^{1,i}_{k}-{% \mathcal{T}}_{\widetilde{\pi}^{i}_{k}}\widetilde{Q}^{1}_{k}\leq\gamma\cdot P_{% \widetilde{\pi}^{i}_{k},\widehat{\sigma}^{i}_{k}}(\widetilde{Q}^{1,i}_{k}-% \widetilde{Q}^{1}_{k}), (5.41) \displaystyle{\mathcal{T}}_{\pi_{k}}\widetilde{Q}^{1}_{k}-{\mathcal{T}}_{\pi_{% k}}Q_{\pi_{k}}\leq\gamma\cdot P_{\pi_{k},\overline{\sigma}_{k}}(\widetilde{Q}^% {1}_{k}-Q_{\pi_{k}}), (5.42)

where the inequalities follow from the fact that {\mathcal{T}}_{\pi^{*}}Q^{*}\leq{\mathcal{T}}_{\pi^{*},\sigma}Q^{*}, {\mathcal{T}}_{\widetilde{\pi}^{i}_{k}}\widetilde{Q}^{1,i}_{k}\leq{\mathcal{T}% }_{\widetilde{\pi}^{i}_{k},\sigma}\widetilde{Q}^{1,i}_{k}, and {\mathcal{T}}_{\pi_{k}}\widetilde{Q}^{1}_{k}\leq{\mathcal{T}}_{\pi_{k},\sigma}% \widetilde{Q}^{1}_{k} for any \sigma\in\mathcal{P}(\mathcal{B}).

By substituting (5.40), (5.41), and (5.42) into (5.3), we obtain

 \displaystyle Q^{*}-Q_{\pi_{k}}\leq \displaystyle~{}\frac{\gamma}{N}\sum_{i\in\mathcal{N}}P_{\pi^{*},\sigma^{i,*}_% {k}}(Q^{*}-\widetilde{Q}^{1,i}_{k})+\frac{\gamma}{N}\sum_{i\in\mathcal{N}}P_{% \widetilde{\pi}^{i}_{k},\widehat{\sigma}^{i}_{k}}(\widetilde{Q}^{1,i}_{k}-% \widetilde{Q}^{1}_{k})+\gamma\cdot P_{\pi_{k},\overline{\sigma}_{k}}(% \widetilde{Q}^{1}_{k}-Q_{\pi_{k}}) \displaystyle= \displaystyle~{}\frac{\gamma}{N}\sum_{i\in\mathcal{N}}(P_{\pi^{*},\sigma^{i,*}% _{k}}-P_{\pi_{k},\overline{\sigma}_{k}})(Q^{*}-\widetilde{Q}^{1}_{k})+\gamma% \cdot P_{\pi_{k},\overline{\sigma}_{k}}(Q^{*}-Q_{\pi_{k}}) \displaystyle~{}\quad+\frac{\gamma}{N}\sum_{i\in\mathcal{N}}\big{(}P_{% \widetilde{\pi}^{i}_{k},\widehat{\sigma}^{i}_{k}}-P_{\pi^{*},\sigma^{i,*}_{k}}% \big{)}(\widetilde{Q}^{1,i}_{k}-\widetilde{Q}^{1}_{k}).

Since I-\gamma\cdot P_{\pi_{k},\overline{\sigma}_{k}} is invertible, we further obtain

 \displaystyle 0\leq Q^{*}-Q_{\pi_{k}}\leq \displaystyle~{}\frac{\gamma}{N}\cdot(I-\gamma\cdot P_{\pi_{k},\overline{% \sigma}_{k}})^{-1}\cdot\sum_{i\in\mathcal{N}}(P_{\pi^{*},\sigma^{i,*}_{k}}-P_{% \pi_{k},\overline{\sigma}_{k}})(Q^{*}-\widetilde{Q}^{1}_{k}) \displaystyle~{}\quad+\frac{\gamma}{N}\cdot(I-\gamma\cdot P_{\pi_{k},\overline% {\sigma}_{k}})^{-1}\cdot\sum_{i\in\mathcal{N}}\big{(}P_{\widetilde{\pi}^{i}_{k% },\widehat{\sigma}^{i}_{k}}-P_{\pi^{*},\sigma^{i,*}_{k}}\big{)}(\widetilde{Q}^% {1,i}_{k}-\widetilde{Q}^{1}_{k}). (5.43)

Moreover, by setting \ell=K and k=0 in Lemma 5.5, we obtain that for any i\in\mathcal{N}

Moreover, we denote the second term on the right-hand side of (5.3) by \xi^{1}_{k}, i.e.,

 \displaystyle\xi^{1}_{k}=\frac{\gamma}{N}\cdot(I-\gamma\cdot P_{\pi_{k},% \overline{\sigma}_{k}})^{-1}\cdot\sum_{i\in\mathcal{N}}\big{(}P_{\widetilde{% \pi}^{i}_{k},\widehat{\sigma}^{i}_{k}}-P_{\pi^{*},\sigma^{i,*}_{k}}\big{)}(% \widetilde{Q}^{1,i}_{k}-\widetilde{Q}^{1}_{k}), (5.45)

which depends on the quality of the solution to (3.1) returned by the decentralized optimization algorithm. By combining (5.3) and (5.45), we obtain the bound of (5.3) at the final iteration K as

 \displaystyle Q^{*}-Q_{\pi_{K}}\leq \displaystyle~{}\frac{(I-\gamma\cdot P_{\pi_{K},\overline{\sigma}_{K}})^{-1}}{% N}\cdot\sum_{i\in\mathcal{N}}\bigg{\{}\sum_{j=0}^{K-1}\gamma^{K-j}\cdot\bigl{[% }(P_{\pi^{*},\sigma^{i,*}_{K}}P_{\pi^{*},\sigma^{*}_{K-1}}\cdots P_{\pi^{*},% \sigma^{*}_{j+1}}) \displaystyle~{}\quad-(P_{\pi_{K},\overline{\sigma}_{K}}P_{\widetilde{\pi}_{K-% 1},\widetilde{\sigma}^{*}_{K-1}}\cdots P_{\widetilde{\pi}_{j+1},\widetilde{% \sigma}^{*}_{j+1}})\bigr{]}(\eta^{1}_{j+1}+\varrho^{1}_{j+1}) \displaystyle~{}\quad+\gamma^{K+1}\cdot\bigl{[}(P_{\pi^{*},\sigma^{i,*}_{K}}P_% {\pi^{*},\sigma^{*}_{K-1}}\cdots P_{\pi^{*},\sigma^{*}_{0}})-(P_{\pi_{K},% \overline{\sigma}_{K}}P_{\widetilde{\pi}_{K-1},\widetilde{\sigma}^{*}_{K-1}}% \cdots P_{\widetilde{\pi}_{0},\widetilde{\sigma}^{*}_{0}})\bigr{]} \displaystyle~{}\quad\cdot(Q^{*}-\widetilde{Q}^{1}_{0})\bigg{\}}+\xi^{1}_{K}. (5.46)

For simplicity, we define the coefficients \{\alpha_{j}\}_{k=0}^{K} as in (5.14), and K+1 linear operators \{\mathcal{L}^{1}_{k}\}_{k=0}^{K} as

 \displaystyle\mathcal{L}^{1}_{j}= \displaystyle~{}\frac{(1-\gamma)}{2N}\cdot(I-\gamma\cdot P_{\pi_{k},\overline{% \sigma}_{k}})^{-1}\sum_{i\in\mathcal{N}}\bigl{[}(P_{\pi^{*},\sigma^{i,*}_{K}}P% _{\pi^{*},\sigma^{*}_{K-1}}\cdots P_{\pi^{*},\sigma^{*}_{j+1}}) \displaystyle~{}\quad+(P_{\pi_{K},\overline{\sigma}_{K}}P_{\widetilde{\pi}_{K-% 1},\widetilde{\sigma}^{*}_{K-1}}\cdots P_{\widetilde{\pi}_{j+1},\widetilde{% \sigma}^{*}_{j+1}})\bigr{]},~{}~{}\text{for}~{}~{}0\leq j\leq K-1, \displaystyle\mathcal{L}^{1}_{K}= \displaystyle~{}\frac{(1-\gamma)}{2N}\cdot(I-\gamma\cdot P_{\pi_{k},\overline{% \sigma}_{k}})^{-1}\sum_{i\in\mathcal{N}}\bigl{[}(P_{\pi^{*},\sigma^{i,*}_{K}}P% _{\pi^{*},\sigma^{*}_{K-1}}\cdots P_{\pi^{*},\sigma^{*}_{0}}) \displaystyle~{}\quad+(P_{\pi_{K},\overline{\sigma}_{K}}P_{\widetilde{\pi}_{K-% 1},\widetilde{\sigma}^{*}_{K-1}}\cdots P_{\widetilde{\pi}_{0},\widetilde{% \sigma}^{*}_{0}})\bigr{]}.

Then, we take absolute value on both sides of (5.3) to obtain

 \displaystyle\bigl{|}Q^{*}(s,a,b)-Q_{\pi_{K}}(s,a,b)\bigr{|}\leq \displaystyle~{}\frac{2\gamma(1-\gamma^{K+1})}{(1-\gamma)^{2}}\cdot\biggl{[}% \sum_{j=0}^{K-1}\alpha_{j}\cdot\bigl{(}\mathcal{L}^{1}_{j}|\eta_{j+1}+\varrho_% {j+1}|\bigr{)}(s,a,b) \displaystyle~{}\quad+\alpha_{K}\cdot\bigl{(}\mathcal{L}^{1}_{K}|Q^{*}-% \widetilde{Q}_{0}|\bigr{)}(s,a,b)\biggr{]}+|\xi^{1}_{K}(s,a,b)|, (5.47)

for any (s,a,b)\in{\mathcal{S}}\times\mathcal{A}\times\mathcal{B}. This completes the second step of the proof.

Step (iii): We note that (5.3) has almost the identical form as (5.1), with the fact that \sum_{j=0}^{K}\alpha_{j}=1 and for all j=0,\cdots,K, the linear operators \mathcal{L}^{1}_{j} are positive and satisfy \mathcal{L}^{1}_{j}\bm{1}=\bm{1}. Hence, the proof here follows directly from the Step (iii) in §5.1, from which we obtain that for any fixed \mu\in\mathcal{P}({\mathcal{S}}\times\mathcal{A}\times\mathcal{B})

 \displaystyle\|Q^{*}-Q_{\pi_{K}}\|_{\mu} \displaystyle\quad\leq\frac{4\gamma\cdot\big{(}\phi^{\text{MG}}_{\mu,\nu}\big{% )}^{1/2}}{\sqrt{2}(1-\gamma)^{2}}\cdot(\|\eta^{1}\|_{\nu}+\|\varrho^{1}\|_{\nu% })+\frac{4\sqrt{2}\cdot Q_{\max}}{(1-\gamma)^{2}}\cdot\gamma^{K/2}+\frac{2% \sqrt{2}\gamma}{1-\gamma}\cdot\overline{\epsilon}^{1}_{K}, (5.48)

where we denote by \|\eta^{1}\|_{\nu}=\max_{j=0,\cdots,K-1}\|\eta^{1}_{j+1}\|_{\nu}, \|\varrho^{1}\|_{\nu}=\max_{j=0,\cdots,K-1}\|\varrho^{1}_{j+1}\|_{\nu}, and \overline{\epsilon}^{1}_{K}=[N^{-1}\cdot\sum_{i\in\mathcal{N}}(\epsilon^{1,i}_% {K})^{2}]^{1/2}, with \epsilon^{1,i}_{K} being the one-step decentralized computation error from Assumption 4.3. To obtain (5.3), we use the definition of concentrability coefficients \kappa^{\text{MG}} and Assumption 4.3, similarly as the usage of the definition of \kappa^{\text{MDP}} and Assumption 4.3 in the Step (iii) in §5.1.

Recall that \eta^{1}_{j+1} is defined as \eta^{1}_{j+1}={\mathcal{T}}\widetilde{Q}^{1}_{j}-\widetilde{{\mathcal{T}}}% \widetilde{\mathbf{Q}}^{1}_{j}, which can also be further bounded by the one-step decentralized computation error from Assumption 4.4. The following lemma characterizes such a relationship, playing the similar role as Lemma 5.2.

###### Lemma 5.6.

Under Assumption 4.4, for any j=0,\cdots,K-1, it holds that \|\eta^{1}_{j+1}\|_{\nu}\leq\sqrt{2}\gamma\cdot\overline{\epsilon}^{1}_{j}, where \overline{\epsilon}^{1}_{j}=[N^{-1}\cdot\sum_{i\in\mathcal{N}}(\epsilon^{1,i}_% {j})^{2}]^{1/2} and \epsilon^{1,i}_{j} is defined as in Assumption 4.4.

###### Proof.

Note that

Now we claim that for any s^{\prime}\in{\mathcal{S}}

 \displaystyle\Big{|}\max_{\pi^{\prime}\in\mathcal{P}(\mathcal{A})}\min_{\sigma% ^{\prime}\in\mathcal{P}(\mathcal{B})}\mathbb{E}_{\pi^{\prime},\sigma^{\prime}}% \big{[}\widetilde{Q}^{1}_{j}(s^{\prime},a^{\prime},b^{\prime})\big{]}-\max_{% \pi^{\prime}\in\mathcal{P}(\mathcal{A})}\min_{\sigma^{\prime}\in\mathcal{P}(% \mathcal{B})}\mathbb{E}_{\pi^{\prime},\sigma^{\prime}}\big{[}\widetilde{Q}^{1,% i}_{j}(s^{\prime},a^{\prime},b^{\prime})\big{]}\Big{|}\leq C_{1}\cdot\epsilon^% {1,i}_{j}, (5.49)

for any constant C_{1}>1. For notational simplicity, let g_{j}(\pi^{\prime},\sigma^{\prime})=\mathbb{E}_{\pi^{\prime},\sigma^{\prime}}% \big{[}\widetilde{Q}^{1}_{j}(s^{\prime},a^{\prime},b^{\prime})\big{]} and g^{i}_{j}(\pi^{\prime},\sigma^{\prime})=\mathbb{E}_{\pi^{\prime},\sigma^{% \prime}}\big{[}\widetilde{Q}^{1,i}_{j}(s^{\prime},a^{\prime},b^{\prime})\big{]}. Suppose (5.49) does not hold, then either

 \displaystyle\max_{\pi^{\prime}\in\mathcal{P}(\mathcal{A})}\min_{\sigma^{% \prime}\in\mathcal{P}(\mathcal{B})}g_{j}(\pi^{\prime},\sigma^{\prime})\geq\max% _{\pi^{\prime}\in\mathcal{P}(\mathcal{A})}\min_{\sigma^{\prime}\in\mathcal{P}(% \mathcal{B})}g^{i}_{j}(\pi^{\prime},\sigma^{\prime})+C_{1}\cdot\epsilon^{i}_{j}, (5.50)

or

 \displaystyle\max_{\pi^{\prime}\in\mathcal{P}(\mathcal{A})}\min_{\sigma^{% \prime}\in\mathcal{P}(\mathcal{B})}g_{j}(\pi^{\prime},\sigma^{\prime})\leq\max% _{\pi^{\prime}\in\mathcal{P}(\mathcal{A})}\min_{\sigma^{\prime}\in\mathcal{P}(% \mathcal{B})}g^{i}_{j}(\pi^{\prime},\sigma^{\prime})-C_{1}\cdot\epsilon^{i}_{j}.

In the first case, let (\pi^{\prime}_{*},\sigma^{\prime}_{*}) be the minimax strategy pair for g^{i}_{j}(\pi^{\prime},\sigma^{\prime}), such that g^{i}_{j}(\pi^{\prime}_{*},\sigma^{\prime}_{*})=\max_{\pi^{\prime}\in\mathcal{% P}(\mathcal{A})}\min_{\sigma^{\prime}\in\mathcal{P}(\mathcal{B})}g_{j}(\pi^{% \prime},\sigma^{\prime}). Note that \sigma^{\prime}_{*}=\sigma^{\prime}_{*}(\pi^{\prime}_{*})\in\mathop{\mathrm{% argmin}}_{\sigma^{\prime}\in\mathcal{P}(\mathcal{B})}g_{j}(\pi^{\prime}_{*},% \sigma^{\prime}) is a function of \pi^{\prime}_{*}. By Assumption 4.4, \widetilde{Q}^{1}_{j}(s^{\prime},\cdot,\cdot) and \widetilde{Q}^{1,i}_{j}(s^{\prime},\cdot,\cdot) are close to each other uniformly over \mathcal{A}\times\mathcal{B}. Thus, by the linearity of g_{j} and g^{i}_{j}, we have g_{j}(\pi^{\prime}_{*},\sigma^{\prime}_{*})\geq g^{i}_{j}(\pi^{\prime}_{*},% \sigma^{\prime}_{*})-\epsilon^{1,i}_{j}. Together with (5.50), we obtain

 \displaystyle g_{j}(\pi^{\prime}_{*},\sigma^{\prime}_{*})\geq\max_{\pi^{\prime% }\in\mathcal{P}(\mathcal{A})}\min_{\sigma^{\prime}\in\mathcal{P}(\mathcal{B})}% g^{i}_{j}(\pi^{\prime},\sigma^{\prime})+(C_{1}-1)\cdot\epsilon^{1,i}_{j}. (5.51)

However, (5.51) cannot hold with any C_{1}>1 since \max_{\pi^{\prime}\in\mathcal{P}(\mathcal{A})}\min_{\sigma^{\prime}\in\mathcal% {P}(\mathcal{B})}g^{i}_{j}(\pi^{\prime},\sigma^{\prime}) should be no smaller than g_{j}(\pi,\sigma) with any (\pi,\sigma) pair that satisfies \sigma=\sigma(\pi)\in\mathop{\mathrm{argmin}}_{\sigma\in\mathcal{P}(\mathcal{B% })}g_{j}(\pi,\sigma), including g_{j}(\pi^{\prime}_{*},\sigma^{\prime}_{*}). Similarly, one can show that the second case cannot occur. Thus, the claim (5.49) is proved. Letting C_{1}=\sqrt{2}, we obtain that

 \displaystyle|\eta^{1}_{j+1}(s,a)|^{2}\leq\gamma^{2}\bigg{(}\frac{1}{N}\sum_{i% \in\mathcal{N}}\sqrt{2}\epsilon^{1,i}_{j}\bigg{)}^{2}\leq\big{(}\sqrt{2}\gamma% \big{)}^{2}\frac{1}{N}\sum_{i\in\mathcal{N}}\big{(}\epsilon^{1,i}_{j}\big{)}^{% 2},

where the second inequality follows Jensen’s inequality. Taking expectation over \nu, we obtain the desired bound. ∎

From Lemma 5.6, we can further simplify (5.3) to obtain the desired bound in Theorem 4.9, which concludes the proof. ∎

### 5.4 Proof of Theorem 4.10

###### Proof.

The proof is similar to the proof of Theorem 4.7 in §5.2. To avoid repetition, we will only emphasize the difference from there. First for any fixed \mathbf{Q}=[Q^{i}]_{i\in\mathcal{N}}\in\mathcal{H}^{N} and f\in\mathcal{H}, we define \ell_{f,\mathbf{Q}} as

 \displaystyle\ell_{f,\mathbf{Q}}(s,a,b,s^{\prime}) \displaystyle=\Big{\{}\overline{r}(s,a,b)+\gamma/N\cdot\sum_{i\in\mathcal{N}}% \max_{\pi^{\prime}\in\mathcal{P}(\mathcal{A})}\min_{\sigma^{\prime}\in\mathcal% {P}(\mathcal{B})}\mathbb{E}_{\pi^{\prime},\sigma^{\prime}}[Q^{i}(s^{\prime},a^% {\prime},b^{\prime})]-f(s,a,b)\Big{\}}^{2},

where \overline{r}(s,a,b)=N^{-1}\cdot\sum_{i\in\mathcal{N}}r^{i}(s,a,b) with r^{i}(s,a,b)\sim R^{i}(s,a,b). Thus, we can similarly define \widehat{L}_{T}(f;\mathbf{Q})=T^{-1}\cdot\sum_{t=1}^{T}\ell_{f,\mathbf{Q}}(s_{% t},a_{t},b_{t},s_{t+1}) and {L}(f;\mathbf{Q})=\mathbb{E}[\widehat{L}_{T}(f;\mathbf{Q})]. Let f^{\prime}\in\mathop{\mathrm{argmin}}_{f\in\mathcal{H}}\widehat{L}_{T}(f;% \mathbf{Q}) and Z_{t}=(s_{t},\{a^{i}_{t}\}_{i\in\mathcal{N}},\{b^{j}_{t}\}_{j\in\mathcal{M}},s% _{t+1}), we have

 \displaystyle\|f^{\prime}-\widetilde{{\mathcal{T}}}\mathbf{Q}\|^{2}_{\nu}-\inf% _{f\in\mathcal{H}}\|f-\widetilde{{\mathcal{T}}}\mathbf{Q}\|^{2}_{\nu}\leq 2% \sup_{\ell_{f,\mathbf{Q}}\in\mathcal{L}_{\mathcal{H}}}\bigg{|}\frac{1}{T}\sum_% {t=1}^{T}\ell_{f,\mathbf{Q}}(Z_{t})-\mathbb{E}[\ell_{f,\mathbf{Q}}(Z_{1})]% \bigg{|}. (5.52)

Now if suffices to show that the P_{0} as defined in (5.28) satisfies P_{0}<\delta, with \Lambda_{T}(\delta) and \epsilon having the same form as (5.27). We will identify the choice of C_{1} and C_{2} in \Lambda_{T}(\delta) shortly. To show P_{0}<\delta, from (5.2), we need to bound the empirical covering number \mathcal{N}_{1}(\epsilon/16,\mathcal{L}_{\mathcal{H}},(Z^{\prime}_{t};t\in H)), where H is the set of ghost samples. The bound is characterized by the following lemma.

###### Lemma 5.7.

Let Z^{1:T}=(Z_{1},\cdots,Z_{T}), with Z_{t}=(s_{t},\{a^{i}_{t}\}_{i\in\mathcal{N}},\{b^{j}_{t}\}_{j\in\mathcal{M}},s% _{t+1}). Recall that \widetilde{R}_{\max}=(1+\gamma)Q_{\max}+R_{\max}, and A=|\mathcal{A}|,B=|\mathcal{B}|. Then, under Assumption 4.1, it holds that

 \displaystyle\mathcal{N}_{1}(\epsilon,\mathcal{L}_{\mathcal{H}},Z^{1:T})\leq e% ^{N+1}(V_{\mathcal{H}^{+}}+1)^{N+1}(AB)^{NV_{\mathcal{H}^{+}}}Q_{\max}^{(N+1)V% _{\mathcal{H}^{+}}}\bigg{(}\frac{4e\widetilde{R}_{\max}(1+\gamma)}{\epsilon}% \bigg{)}^{(N+1)V_{\mathcal{H}^{+}}}.
###### Proof.

We first bound the empirical \ell^{1}-distance between any l_{f,\mathbf{Q}} and l_{\widetilde{f},\widetilde{\mathbf{Q}}} in \mathcal{L}_{\mathcal{H}} as

 \displaystyle\frac{1}{T}\sum_{t=1}^{T}\big{|}l_{f,\mathbf{Q}}(Z_{t})-l_{% \widetilde{f},\widetilde{\mathbf{Q}}}(Z_{t})\big{|}\leq 2\widetilde{R}_{\max}% \bigg{\{}\frac{1}{T}\sum_{t=1}^{T}\big{|}f(s_{t},a_{t},b_{t})-\widetilde{f}(s_% {t},a_{t},b_{t})\big{|}+\frac{\gamma}{T}\sum_{t=1}^{T}\Big{|}\frac{1}{N}\sum_{% i\in\mathcal{N}} \displaystyle\quad\max_{\pi^{\prime}\in\mathcal{P}(\mathcal{A})}\min_{\sigma^{% \prime}\in\mathcal{P}(\mathcal{B})}\mathbb{E}_{\pi^{\prime},\sigma^{\prime}}[Q% ^{i}(s_{t+1},a^{\prime},b^{\prime})]-\frac{1}{N}\sum_{i\in\mathcal{N}}\max_{% \pi^{\prime}\in\mathcal{P}(\mathcal{A})}\min_{\sigma^{\prime}\in\mathcal{P}(% \mathcal{B})}\mathbb{E}_{\pi^{\prime},\sigma^{\prime}}[\widetilde{Q}^{i}(s_{t+% 1},a^{\prime},b^{\prime})]\Big{|}\bigg{\}}. (5.53)

Let \mathcal{D}_{Z}=\{(s_{1},\{a^{i}_{1}\}_{i\in\mathcal{N}},\{b^{j}_{1}\}_{j\in% \mathcal{M}}),\cdots,(s_{T},\{a^{i}_{T}\}_{i\in\mathcal{N}},\{b^{j}_{T}\}_{j% \in\mathcal{M}})\} and y_{Z}=(s_{2},\cdots,s_{T+1}), then from (5.4), the empirical covering number \mathcal{N}_{1}\big{(}2\widetilde{R}_{\max}(1+\gamma)\epsilon,\mathcal{L}_{% \mathcal{H}},Z^{1:T}\big{)} can be bounded by

 \displaystyle\mathcal{N}_{1}\big{(}2\widetilde{R}_{\max}(1+\gamma)\epsilon,% \mathcal{L}_{\mathcal{H}},Z^{1:T}\big{)}\leq\mathcal{N}_{1}(\epsilon,\mathcal{% H}^{\vee}_{N},y_{Z})\cdot\mathcal{N}_{1}(\epsilon,\mathcal{H},\mathcal{D}_{Z}), (5.54)

where \mathcal{H}^{\vee}_{N} here is defined as

 \displaystyle\mathcal{H}^{\vee}_{N}=\bigg{\{}V:V(\cdot)=N^{-1}\cdot\sum_{i\in% \mathcal{N}}\max_{\pi^{\prime}\in\mathcal{P}(\mathcal{A})}\min_{\sigma^{\prime% }\in\mathcal{P}(\mathcal{B})}\mathbb{E}_{\pi^{\prime},\sigma^{\prime}}[Q^{i}(% \cdot,a^{\prime},b^{\prime})]\text{~{}and~{}}\mathbf{Q}\in\mathcal{H}^{N}\bigg% {\}}.

We further bound the first covering number \mathcal{N}_{1}(\epsilon,\mathcal{H}^{\vee}_{N},y_{Z}) by the following lemma, which is proved later in §B.

###### Lemma 5.8.

For any fixed y_{Z}=(y_{1},\cdots,y_{T}), let \mathcal{D}_{y}=\{(y_{t},a_{j},b_{k})\}_{t\in[T],j\in[A],k\in[B]}, where recall that A=|\mathcal{A}|,B=|\mathcal{B}|, and \mathcal{A}=\{a_{1},\cdots,a_{A}\},\mathcal{B}=\{b_{1},\cdots,b_{B}\}. Then, under Assumption 4.1, it holds that

 \displaystyle\mathcal{N}_{1}(\epsilon,\mathcal{H}^{\vee}_{N},y_{Z})\leq\big{[}% \mathcal{N}_{1}({\epsilon}/{(AB)},\mathcal{H},\mathcal{D}_{y})\big{]}^{N}\leq% \bigg{[}e(V_{\mathcal{H}^{+}}+1)\bigg{(}\frac{2eQ_{\max}AB}{\epsilon}\bigg{)}^% {V_{\mathcal{H}^{+}}}\bigg{]}^{N}.

Furthermore, we can bound \mathcal{N}_{1}(\epsilon,\mathcal{H},\mathcal{D}_{Z}) by Proposition B.2, which together with Lemma 5.8 yields a bound for (5.54)

 \displaystyle\mathcal{N}_{1}\big{(}2\widetilde{R}_{\max}(1+\gamma)\epsilon,% \mathcal{L}_{\mathcal{H}},Z^{1:T}\big{)} \displaystyle\quad\leq e^{N+1}(V_{\mathcal{H}^{+}}+1)^{N+1}(AB)^{NV_{\mathcal{% H}^{+}}}Q_{\max}^{(N+1)V_{\mathcal{H}^{+}}}\bigg{(}\frac{2e}{\epsilon}\bigg{)}% ^{(N+1)V_{\mathcal{H}^{+}}},

which completes the proof by replacing 2\widetilde{R}_{\max}(1+\gamma)\epsilon by \epsilon. ∎

## 6 Conclusions

In this paper, we provided finite-sample analyses for fully decentralized multi-agent reinforcement learning. Specifically, we considered both the collaborative and competitive MARL settings. In the collaborative setting, a team of heterogeneous agents connected by a time-varying communication network aim to maximize the globally averaged return of all agents, with no existence of any central controller. In the competitive setting, two teams of such fully decentralized RL agents form a zero-sum Markov game. Our settings cover several convetional MARL settings as special cases, e.g., the conventional collaborative MARL with a common reward function for all agents, and the competitive MARL that is usually modeled as a two-player zero-sum Markov game.

In both settings, we proposed fitted-Q iteration-based MARL algorithms, with the aid of decentralized optimization algorithms to solve the fitting problem at each iteration. We quantified the performance bound of the output action-value, using finite number of samples drawn from a single RL trajectory. In addition, with linear function approximation, we further derived the performance bound after finite number of iterations of decentralized computation. To our knowledge, these are the first finite-sample analyses for multi-agent RL, in either the collaborative or the competitive setting. We believe these theoretical results provide some insights into the fundamental performance of MARL algorithms implemented with finite samples and finite computation iterations in practice. One interesting future direction is to extend the finite-sample analyses to more general MARL settings, e.g., general-sum Markov games. It is also promising to sharpen the bounds we obtained, in order to better understand and improve both the sample and computation efficiency of MARL algorithms.

## References

• Alexander and Murray (2004) Alexander, F. J. and Murray, R. M. (2004). Information flow and cooperative control of vehicle formations. IEEE Transactions on Automatic Control, 49 1465–1476.
• Amato and Zilberstein (2009) Amato, C. and Zilberstein, S. (2009). Achieving goals in decentralized POMDPs. In International Conference on Autonomous Agents and Multiagent Systems.
• Antos et al. (2008a) Antos, A., Szepesvári, C. and Munos, R. (2008a). Fitted Q-iteration in continuous action-space MDPs. In Advances in Neural Information Processing Systems.
• Antos et al. (2008b) Antos, A., Szepesvári, C. and Munos, R. (2008b). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71 89–129.
• Banerjee et al. (2012) Banerjee, B., Lyle, J., Kraemer, L. and Yellamraju, R. (2012). Sample bounded distributed reinforcement learning for decentralized POMDPs. In AAAI Conference on Artificial Intelligence.
• Boutilier (1996) Boutilier, C. (1996). Planning, learning and coordination in multi-agent decision processes. In Conference on Theoretical Aspects of Rationality and Knowledge.
• Bowling and Veloso (2002) Bowling, M. and Veloso, M. (2002). Multiagent learning using a variable learning rate. Artificial Intelligence, 136 215–250.
• Busoniu et al. (2008) Busoniu, L., Babuska, R. and De Schutter, B. (2008). A comprehensive survey of multi-agent reinforcement learning. IEEE Transactions on Systems, Man, And Cybernetics-Part C: Applications and Reviews, 38 (2), 2008.
• Cardenas et al. (2009) Cardenas, A., Amin, S., Sinopoli, B., Giani, A., Perrig, A., Sastry, S. et al. (2009). Challenges for securing cyber physical systems. In Workshop on Future Directions in Cyber-Physical Systems Security, vol. 5.
• Ceren et al. (2016) Ceren, R., Doshi, P. and Banerjee, B. (2016). Reinforcement learning in partially observable multiagent settings: Monte Carlo exploring policies with PAC bounds. In International Conference on Autonomous Agents and Multiagent Systems.
• Chakraborty and Stone (2014) Chakraborty, D. and Stone, P. (2014). Multiagent learning in the presence of memory-bounded agents. Autonomous Agents and Multi-agent Systems, 28 182–213.
• Chen et al. (2018) Chen, T., Zhang, K., Giannakis, G. B. and Başar, T. (2018). Communication-efficient distributed reinforcement learning. arXiv preprint arXiv:1812.03239.
• Corke et al. (2005) Corke, P., Peterson, R. and Rus, D. (2005). Networked robots: Flying robot navigation using a sensor net. Robotics Research 234–243.
• Dalal et al. (2018) Dalal, G., Szorenyi, B., Thoppe, G. and Mannor, S. (2018). Finite sample analyses for TD(0) with function approximation. In AAAI Conference on Artificial Intelligence.
• Do Nascimento Silva and Chaimowicz (2017) Do Nascimento Silva, V. and Chaimowicz, L. (2017). MOBA: A new arena for game AI. arXiv preprint arXiv:1705.10443.
• Farahmand et al. (2016) Farahmand, A.-m., Ghavamzadeh, M., Szepesvári, C. and Mannor, S. (2016). Regularized policy iteration with nonparametric function spaces. The Journal of Machine Learning Research, 17 4809–4874.
• Foerster et al. (2016) Foerster, J., Assael, Y. M., de Freitas, N. and Whiteson, S. (2016). Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems.
• Geramifard et al. (2013) Geramifard, A., Walsh, T. J., Tellex, S., Chowdhary, G., Roy, N., How, J. P. et al. (2013). A tutorial on linear function approximators for dynamic programming and reinforcement learning. Foundations and Trends® in Machine Learning, 6 375–451.
• Haussler (1995) Haussler, D. (1995). Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69 217–232.
• Hong and Chang (2017) Hong, M. and Chang, T.-H. (2017). Stochastic proximal gradient consensus over random networks. IEEE Transactions on Signal Processing, 65 2933–2948.
• Hong et al. (2016) Hong, M., Luo, Z.-Q. and Razaviyayn, M. (2016). Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26 337–364.
• Jiang et al. (2016) Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J. and Schapire, R. E. (2016). Contextual decision processes with low Bellman rank are PAC-learnable. arXiv preprint arXiv:1610.09512.
• Kakade et al. (2003) Kakade, S. M. et al. (2003). On the sample complexity of reinforcement learning. Ph.D. thesis, University of London London, England.
• Kar et al. (2013) Kar, S., Moura, J. M. and Poor, H. V. (2013). QD-learning: A collaborative distributed strategy for multi-agent reinforcement learning through Consensus + Innovations. IEEE Transactions on Signal Processing, 61 1848–1862.
• Kitano et al. (1997) Kitano, H., Asada, M., Kuniyoshi, Y., Noda, I., Osawa, E. and Matsubara, H. (1997). RoboCup: A challenge problem for AI. AI Magazine, 18 73.
• Lagoudakis and Parr (2003) Lagoudakis, M. G. and Parr, R. (2003). Learning in zero-sum team Markov games using factored value functions. In Advances in Neural Information Processing Systems.
• Lanctot et al. (2017) Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Perolat, J., Silver, D., Graepel, T. et al. (2017). A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems.
• Lauer and Riedmiller (2000) Lauer, M. and Riedmiller, M. (2000). An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In International Conference on Machine Learning.
• Lazaric et al. (2010) Lazaric, A., Ghavamzadeh, M. and Munos, R. (2010). Finite-sample analysis of LSTD. In International Conference on Machine Learning.
• Lee et al. (2018) Lee, D., Yoon, H. and Hovakimyan, N. (2018). Primal-dual algorithm for distributed reinforcement learning: Distributed GTD2. arXiv preprint arXiv:1803.08031.
• Littman (1994) Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings. Elsevier.
• Littman (2001) Littman, M. L. (2001). Friend-or-Foe Q-learning in general-sum games. In International Conference on Machine Learning.
• Littman and Szepesvári (1996) Littman, M. L. and Szepesvári, C. (1996). A generalized reinforcement-learning model: Convergence and applications. In International Conference on Machine Learning.
• Liu et al. (2015) Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S. and Petrik, M. (2015). Finite-sample analysis of proximal gradient TD algorithms. In Conference on Uncertainty in Artificial Intelligence.
• Lowe et al. (2017) Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P. and Mordatch, I. (2017). Multi-agent actor-critic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275.
• Mathkar and Borkar (2017) Mathkar, A. and Borkar, V. S. (2017). Distributed reinforcement learning via gossip. IEEE Transactions on Automatic Control, 62 1465–1470.
• Munos and Szepesvári (2008) Munos, R. and Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9 815–857.
• Nedic et al. (2017) Nedic, A., Olshevsky, A. and Shi, W. (2017). Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM Journal on Optimization, 27 2597–2633.
• Oliehoek et al. (2016) Oliehoek, F. A., Amato, C. et al. (2016). A Concise Introduction to Decentralized POMDPs, vol. 1. Springer.
• Patek (1997) Patek, S. D. (1997). Stochastic and shortest path games: Theory and algorithms. Ph.D. thesis, Massachusetts Institute of Technology.
• Perolat et al. (2015) Perolat, J., Scherrer, B., Piot, B. and Pietquin, O. (2015). Approximate dynamic programming for two-player zero-sum Markov games. In International Conference on Machine Learning.
• Perolat et al. (2016) Perolat, J., Strub, F., Piot, B. and Pietquin, O. (2016). Learning Nash equilibrium for general-sum Markov games from batch data. In International Conference on Artificial Intelligence and Statistics.
• Riedmiller (2005) Riedmiller, M. (2005). Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning.
• Scherrer et al. (2015) Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B. and Geist, M. (2015). Approximate modified policy iteration and its application to the game of Tetris. Journal of Machine Learning Research, 16 1629–1676.
• Strehl et al. (2009) Strehl, A. L., Li, L. and Littman, M. L. (2009). Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research, 10 2413–2444.
• Suttle et al. (2019) Suttle, W., Yang, Z., Zhang, K., Wang, Z., Başar, T. and Liu, J. (2019). A multi-agent off-policy actor-critic algorithm for distributed reinforcement learning. arXiv preprint arXiv:1903.06372.
• Tampuu et al. (2017) Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., Aru, J. and Vicente, R. (2017). Multiagent cooperation and competition with deep reinforcement learning. PloS one, 12 e0172395.
• Tatarenko and Touri (2017) Tatarenko, T. and Touri, B. (2017). Non-convex distributed optimization. IEEE Transactions on Automatic Control, 62 3744–3757.
• Tsitsiklis and Van Roy (1997) Tsitsiklis, J. N. and Van Roy, B. (1997). Analysis of temporal-diffference learning with function approximation. In Advances in Neural Information Processing Systems.
• Von Neumann and Morgenstern (1947) Von Neumann, J. and Morgenstern, O. (1947). Theory of Games and Economic Behavior (commemorative edition). Princeton University Press.
• Wai et al. (2018) Wai, H.-T., Yang, Z., Wang, Z. and Hong, M. (2018). Multi-agent reinforcement learning via double averaging primal-dual optimization. In Advances in Neural Information Processing Systems.
• Wang and Sandholm (2003) Wang, X. and Sandholm, T. (2003). Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In Advances in Neural Information Processing Systems.
• Yang et al. (2019) Yang, Z., Yuchen, X. and Zhaoran, W. (2019). A theoretical analysis of deep Q-learning. arXiv preprint arXiv:1901.00137.
• Yang et al. (2018) Yang, Z., Zhang, K., Hong, M. and Başar, T. (2018). A finite sample analysis of the actor-critic algorithm. In Proceedings of IEEE Conference on Decision and Control.
• Zhang et al. (2018a) Zhang, K., Liu, Y., Liu, J., Liu, M. and Başar, T. (2018a). Distributed learning of average belief over networks using sequential observations. arXiv preprint arXiv:1811.07799.
• Zhang et al. (2018b) Zhang, K., Lu, L., Lei, C., Zhu, H. and Ouyang, Y. (2018b). Dynamic operations and pricing of electric unmanned aerial vehicle systems and power networks. Transportation Research Part C: Emerging Technologies, 92 472–485.
• Zhang et al. (2018c) Zhang, K., Shi, W., Zhu, H. and Başar, T. (2018c). Distributed equilibrium-learning for power network voltage control with a locally connected communication network. In IEEE Annual American Control Conference. IEEE.
• Zhang et al. (2018d) Zhang, K., Shi, W., Zhu, H., Dall’Anese, E. and Başar, T. (2018d). Dynamic power distribution system management with a locally connected communication network. IEEE Journal of Selected Topics in Signal Processing, 12 673–687.
• Zhang et al. (2018e) Zhang, K., Yang, Z. and Başar, T. (2018e). Networked multi-agent reinforcement learning in continuous spaces. In Proceedings of IEEE Conference on Decision and Control.
• Zhang et al. (2018f) Zhang, K., Yang, Z., Liu, H., Zhang, T. and Başar, T. (2018f). Fully decentralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning.
• Zhu and Martínez (2013) Zhu, M. and Martínez, S. (2013). An approximate dual subgradient algorithm for multi-agent non-convex optimization. IEEE Transactions on Automatic Control, 58 1534–1539.
• Zinkevich et al. (2006) Zinkevich, M., Greenwald, A. and Littman, M. L. (2006). Cyclic equilibria in Markov games. In Advances in Neural Information Processing Systems.

## Appendix A Definitions

In this section, we provide the detailed definitions of some terms used in the main text for the sake of completeness.

We first give the definition of \beta-mixing of a stochastic process mentioned in Assumption 4.2.

###### Definition A.1 (\beta-mixing).

Let \{Z_{t}\}_{t=1,2,\cdots} be a stochastic process, and denote the collection of (Z_{1},\cdots,Z_{t}) by Z^{1:t}. Let \sigma(Z^{i:j}) denote the \sigma-algebra generated by Z^{i:j}. The m-th \beta-mixing coefficient of \{Z_{t}\}, denoted by \beta_{m}, is defined as

 \displaystyle\beta_{m}=\sup_{t\geq 1}~{}\mathbb{E}\bigg{[}\sup_{B\in\sigma(Z^{% t+m:\infty})}\big{|}P(B{\,|\,}Z^{1:t})-P(B)\big{|}\bigg{]}.

Then, \{Z_{t}\} is said to be \beta-mixing if \beta_{m}\to 0 as m\to\infty. In particular, we say that a \beta-mixing process mixes at an exponential rate with parameters \overline{\beta},g,\zeta>0 if \beta_{m}\leq\overline{\beta}\cdot\exp(-gm^{\zeta}) holds for all m\geq 0.

We then provide formal definitions of concentrability coefficients as in Munos and Szepesvári (2008); Perolat et al. (2015), for networked multi-agent MDPs and team zero-sum Markov games, respectively, used in Assumption 4.3.

###### Definition A.2 (Concentrability Coefficient for Networked Multi-agent MDPs).

Let \nu_{1},\nu_{2}\in\mathcal{P}({\mathcal{S}}\times\mathcal{A}) be two probability measures that are absolutely continuous with respect to the Lebesgue measure on {\mathcal{S}}\times\mathcal{A}. Let \{\pi_{t}\} be a sequence of joint policies for all the agents in the networked multi-agent MDP, with \pi_{t}:{\mathcal{S}}\to\mathcal{P}(\mathcal{A}) for all t. Suppose the initial state-action pair (s_{0},a_{0}) has distribution \nu_{1}, and the action a_{t} is sampled from the joint policy \pi_{t}. For any integer m, we denote by P_{\pi_{m}}P_{\pi_{m-1}}\cdots P_{\pi_{1}}\nu_{1} the distribution of (s_{m},a_{m}) under the policy sequence \{\pi_{t}\}_{t=1,\cdots,m}. Then, the m-th concentration coefficient is defined as

 \displaystyle\kappa^{\text{MDP}}(m;\nu_{1},\nu_{2})=\sup_{\pi_{1},\ldots,\pi_{% m}}\biggl{[}\mathbb{E}_{\nu_{2}}\biggl{|}\frac{{\mathrm{d}}(P_{\pi_{m}}P_{\pi_% {m-1}}\cdots P_{\pi_{1}}\nu_{1})}{{\mathrm{d}}\nu_{2}}\biggr{|}^{2}\biggr{]}^{% 1/2},

where {{\mathrm{d}}(P_{\pi_{m}}P_{\pi_{m-1}}\cdots P_{\pi_{1}}\nu_{1})}/{{\mathrm{d}% }\nu_{2}} is the Radon-Nikodym derivation of P_{\pi_{m}}P_{\pi_{m-1}}\cdots P_{\pi_{1}}\nu_{1} with respect to \nu_{2}, and the supremum is taken over all possible joint policy sequences \{\pi_{t}\}_{t=1,\cdots,m}.

###### Definition A.3 (Concentrability Coefficient for Networked Zero-sum Markov Games).

Let \nu_{1},\nu_{2}\in\mathcal{P}({\mathcal{S}}\times\mathcal{A}\times\mathcal{B}) be two probability measures that are absolutely continuous with respect to the Lebesgue measure on {\mathcal{S}}\times\mathcal{A}\times\mathcal{B}. Let \{(\pi_{t},\sigma_{t})\} be a sequence of joint policies for all the agents in the networked zero-sum Markov game, with \pi_{t}:{\mathcal{S}}\to\mathcal{P}(\mathcal{A}) and \sigma_{t}:{\mathcal{S}}\to\mathcal{P}(\mathcal{B}) for all t. Suppose the initial state-action pair (s_{0},a_{0},b_{0}) has distribution \nu_{1}, and the action a_{t} and b_{t} are sampled from the joint policy \pi_{t} and \sigma_{t}, respectively. For any integer m, we denote by P_{\pi_{m},\sigma_{m}}P_{\pi_{m-1},\sigma_{m-1}}\cdots P_{\pi_{1},\sigma_{1}}% \nu_{1} the distribution of (s_{m},a_{m},b_{m}) under the policy sequence \{(\pi_{t},\sigma_{t})\}_{t=1,\cdots,m}. Then, the m-th concentration coefficient is defined as

 \displaystyle\kappa^{\text{MG}}(m;\nu_{1},\nu_{2})=\sup_{\pi_{1},\sigma_{1},% \ldots,\pi_{m},\sigma_{m}}\biggl{[}\mathbb{E}_{\nu_{2}}\biggl{|}\frac{{\mathrm% {d}}(P_{\pi_{m},\sigma_{m}}P_{\pi_{m-1},\sigma_{m-1}}\cdots P_{\pi_{1},\sigma_% {1}}\nu_{1})}{{\mathrm{d}}\nu_{2}}\biggr{|}^{2}\biggr{]}^{1/2},

where {{\mathrm{d}}(P_{\pi_{m},\sigma_{m}}P_{\pi_{m-1},\sigma_{m-1}}\cdots P_{\pi_{1% },\sigma_{1}}\nu_{1})}/{{\mathrm{d}}\nu_{2}} is the Radon-Nikodym derivation of P_{\pi_{m},\sigma_{m}}P_{\pi_{m-1},\sigma_{m-1}}\cdots P_{\pi_{1},\sigma_{1}}% \nu_{1} with respect to \nu_{2}, and the supremum is taken over all possible joint policy sequences \{(\pi_{t},\sigma_{t})\}_{t=1,\cdots,m}.

To characterize the capacity of function classes, we define the (empirical) covering numbers of a function class, following the definition in Antos et al. (2008b).

###### Definition A.4 (Definition 3 in Antos et al. (2008b)).

For any fixed \epsilon>0, and a pseudo metric space \mathcal{M}=(\mathcal{M},d).555A pseudo-metric satisfies all the properties of a metric except that the requirement of distinguishability is removed. Then, we say that \mathcal{M} is covered by m discs D_{1},\cdots,D_{m} is \mathcal{M}\subset\bigcup_{j}D_{j}. We define the covering number \mathcal{N}_{c}(\epsilon,\mathcal{M},d) of \mathcal{M} as the smallest integer m such that \mathcal{M} can be covered by m discs each of which having a radius less than \epsilon. In particular, for a class \mathcal{H} of real-valued functions with domain \mathcal{X} and points x^{1:T}=(x_{1},\cdots,x_{T}) in \mathcal{X}, the empirical covering number is defined as the covering number of \mathcal{H} equipped with the empirical \ell^{1} pseudo metric,

 \displaystyle l_{x^{1:T}}(f,g)=\frac{1}{T}\sum_{t=1}^{T}d\big{(}f(x_{t}),g(x_{% t})\big{)},\text{~{}for any~{}}f,g\in\mathcal{H}

where d is a distance function on the range of functions in \mathcal{H}. When this range is the reals, we use d(a,b)=|a-b|. For brevity, we denote \mathcal{N}_{c}(\epsilon,\mathcal{H},l_{x^{1:T}}) by \mathcal{N}_{1}(\epsilon,\mathcal{H},x^{1:T}).

## Appendix B Auxiliary Results

In this section, we present several auxiliary results used previously and some of their proofs.

###### Lemma B.1 (Sum of Rank-1 Matrices).

Suppose there are T vectors \{\varphi_{t}\}_{t=1,\cdots,T} with \varphi_{t}\in\mathbb{R}^{d} and d\leq T, and the matrix [\varphi_{1},\cdots,\varphi_{T}]^{\top} has full column-rank, i.e., rank d. Let \mathbf{M}=T^{-1}\cdot\sum_{t=1}^{T}\varphi_{t}\varphi_{t}^{\top}. Then, the matrix \mathbf{M} is full-rank, i.e., has rank-d.

###### Proof.

Since the matrix [\varphi_{1},\cdots,\varphi_{T}]^{\top} has full column-rank, there exists at least one subset of d vectors \{\varphi_{t_{1}},\cdots,\varphi_{t_{d}}\}, such that these d vectors are linearly independent. Let \mathcal{I}_{d}=\{t_{1},\cdots,t_{d}\}. Consider a matrix \widetilde{\mathbf{M}}=T^{-1}\cdot\sum_{i=1}^{d}\varphi_{t_{i}}\varphi_{t_{i}}% ^{\top}. Now we show \widetilde{\mathbf{M}} is full rank. Consider a nonzero v_{1}\in\mathbb{R}^{d} such that v_{1}\in\text{span}\{\varphi_{t_{2}},\cdots,\varphi_{t_{d}}\}, then \widetilde{\mathbf{M}}v_{1}=T^{-1}\cdot\varphi_{t_{1}}(\varphi_{t_{1}}^{\top}v% _{1}) is a nonzero scalar multiple of \varphi_{t_{1}}. Otherwise, v_{1} is also orthogonal to \varphi_{t_{1}}. Then, all \varphi_{t_{i}} for i=1,\cdots,d are in the orthogonal space of v_{1}, which is of dimension d-1. This contradicts the fact that the d vectors \{\varphi_{t_{1}},\cdots,\varphi_{t_{d}}\} are linearly independent. This way, we can construct nonzero scalar multiples of all \varphi_{t_{i}} for i=1,\cdots,d, say \{v_{1},\cdots,v_{d}\}, which are linearly independent and all lie in the range space of \widetilde{\mathbf{M}}. Hence \widetilde{\mathbf{M}} is full-rank, with the smallest eigen-value \lambda_{\min}(\widetilde{\mathbf{M}})>0.

In addition, for any t_{j}\in[T] but t_{j}\notin\mathcal{I}_{d}, it holds that \lambda_{\min}(\widetilde{\mathbf{M}}+T^{-1}\cdot\varphi_{t_{j}}\varphi_{t_{j}% }^{\top})\geq\lambda_{\min}(\widetilde{\mathbf{M}}), since for any x\in\mathbb{R}^{d}, x^{\top}\widetilde{\mathbf{M}}x+T^{-1}\cdot(x^{\top}\varphi_{t_{j}})^{2}\geq x% ^{\top}\widetilde{\mathbf{M}}x. Therefore, \widetilde{\mathbf{M}}+T^{-1}\cdot\varphi_{t_{j}}\varphi_{t_{j}}^{\top} is also full-rank, so is the matrix \mathbf{M}, which completes the proof. ∎

Lemma B.1 can be used directly to show that with a rich enough data set \mathcal{D}=\{(s_{t},a_{t})\}_{t=1,\cdots,T}, the matrix \mathbf{M}^{\text{MDP}} defined in Assumption 4.11 is full rank. This argument also applies to the matrix \mathbf{M}^{\text{MG}}, and can be used to justify the rationale behind Assumption 4.11.

We have the following proposition to bound the empirical covering number of a function class using its pseudo-dimension. The proof of the proposition can be found in the proof of (Haussler, 1995, Corollary 3).

###### Proposition B.2 (Haussler (1995), Corollary 3).

For any set \mathcal{X}, any points x^{1:T}\in\mathcal{X}^{T}, any class \mathcal{H} of functions on \mathcal{X} taking values in [0,K] with pseudo-dimention V_{\mathcal{H}^{+}}<\infty, and any \epsilon>0, then

 \displaystyle\mathcal{N}_{1}(\epsilon,\mathcal{H},x^{1:T})\leq e(V_{\mathcal{H% }^{+}}+1)\bigg{(}\frac{2eK}{\epsilon}\bigg{)}^{V_{\mathcal{H}^{+}}}.

For distributed optimization with strongly convex objectives, the following lemma from Nedic et al. (2017) characterizes the geometric convergence rate of the DIGing algorithm (see Algorithm 2) under time-varying communication networks. Lemma B.3 is used in the proof of Corollaries 4.13.

###### Lemma B.3 (Nedic et al. (2017), Theorem 10).

Consider the decentralized optimization problem

where for each i\in[N], f_{i}:\mathbb{R}^{p}\to\mathbb{R} is differentiable, \mu_{i}-strongly convex, and has L_{i}-Lipschitz continuous gradients. Suppose Assumption 4.12 on the consensus matrix \mathbf{C} holds, recall that \delta,B are the parameters in the assumption, then there exists a constant \lambda=\lambda(\chi,B,N,\overline{L},\overline{\mu})\in[0,1), where \overline{\mu}=N^{-1}\cdot\sum_{i=1}^{N}\mu_{i} and \overline{L}=\max_{i\in[N]}L_{i}, such that the sequence of matrices \{[x^{1}_{l},\cdots,x^{N}_{l}]^{\top}\} generated by DIGing algorithm along iterations l=0,1,\cdots, converges to the matrix \bm{x}^{*}=\bm{1}(x^{*})^{\top}, where x^{*} is the unique solution to (B.1), at a geometric rate O(\lambda^{l}).

### B.1 Proof of Lemma 5.4

###### Proof.

First consider N=1, i.e., \mathbf{Q}=Q^{1}\in\mathcal{H}. Let \{Q_{j}\}_{l\in[\mathcal{N}_{1}(\epsilon^{\prime},\mathcal{H},\mathcal{D}_{y})]} be the \epsilon^{\prime}-covering of \mathcal{H}(\mathcal{D}_{y}) for some \epsilon^{\prime}>0. Recall that \mathcal{A}=\{a_{1},\cdots,a_{A}\}. Now we show that \{\max_{a\in\mathcal{A}}Q_{l}(\cdot,a)\}_{l\in[\mathcal{N}_{1}(\epsilon^{% \prime},\mathcal{H},\mathcal{D}_{y})]} is a covering of \mathcal{H}^{\vee}_{1}.

Since for any Q^{1}, there exists an l^{\prime}\in[\mathcal{N}_{1}(\epsilon^{\prime},\mathcal{H},\mathcal{D}_{y})] such that

 \displaystyle\frac{1}{AT}\sum_{t=1}^{T}\sum_{j=1}^{A}\Big{|}Q^{1}(y_{t},a_{j})% -Q_{l^{\prime}}(y_{t},a_{j})\Big{|}\leq\epsilon^{\prime}.

Let \epsilon^{\prime}=\epsilon/A and a^{Q}_{t}\in\mathop{\mathrm{argmax}}_{a\in\mathcal{A}}|Q^{1}(y_{t},a)-Q_{l^{% \prime}}(y_{t},a)|, we have

 \displaystyle\frac{1}{T}\sum_{t=1}^{T}\Big{|}\max_{a\in\mathcal{A}}Q^{1}(y_{t}% ,a)-\max_{a\in\mathcal{A}}Q_{l^{\prime}}(y_{t},a)\Big{|}\leq\frac{1}{T}\sum_{t% =1}^{T}\max_{a\in\mathcal{A}}\Big{|}Q^{1}(y_{t},a)-Q_{l^{\prime}}(y_{t},a)\Big% {|} \displaystyle\quad=\frac{1}{T}\sum_{t=1}^{T}\Big{|}Q^{1}(y_{t},a^{Q}_{t})-Q_{l% ^{\prime}}(y_{t},a^{Q}_{t})\Big{|}\leq\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{A}% \Big{|}Q^{1}(y_{t},a_{j})-Q_{l^{\prime}}(y_{t},a_{j})\Big{|}\leq\frac{\epsilon% }{A}\cdot A={\epsilon},

where the second inequality follows from the fact that

 \displaystyle|Q^{1}(y_{t},a^{Q}_{t})-Q_{l^{\prime}}(y_{t},a^{Q}_{t})|\leq\sum_% {j=1}^{A}|Q^{1}(y_{t},a_{j})-Q_{l^{\prime}}(y_{t},a_{j})|.

This shows that \mathcal{N}_{1}({\epsilon},\mathcal{H}^{\vee}_{1},y_{Z})\leq\mathcal{N}_{1}({% \epsilon}/{A},\mathcal{H},\mathcal{D}_{y}). Moreover, since functions in \mathcal{H}^{\vee}_{N} are averages of functions from N \mathcal{H}^{\vee}_{1}, we have

 \displaystyle\mathcal{N}_{1}({\epsilon},\mathcal{H}^{\vee}_{N},y_{Z})\leq\big{% [}\mathcal{N}_{1}({\epsilon},\mathcal{H}^{\vee}_{1},y_{Z})\big{]}^{N}\leq\big{% [}\mathcal{N}_{1}({\epsilon}/{A},\mathcal{H},\mathcal{D}_{y})\big{]}^{N}.

On the other hand, by Corollary 3 in Haussler (1995) (see also Proposition B.2 in §B), we can bound \mathcal{N}_{1}({\epsilon}/{A},\mathcal{H},\mathcal{D}_{y}) by the pseudo-dimension of \mathcal{H}, i.e.,

 \displaystyle\mathcal{N}_{1}({\epsilon},\mathcal{H}^{\vee}_{N},y_{Z})\leq\bigg% {[}e(V_{\mathcal{H}^{+}}+1)\bigg{(}\frac{2eQ_{\max}A}{\epsilon}\bigg{)}^{V_{% \mathcal{H}^{+}}}\bigg{]}^{N},

which concludes the proof. ∎

### B.2 Proof of Lemma 5.8

###### Proof.

The proof is similar to that in B.2 for Lemma 5.4. Let \{Q_{l}\}_{l\in[\mathcal{N}_{1}(\epsilon^{\prime},\mathcal{H},\mathcal{D}_{y})]} be the \epsilon^{\prime}-covering of \mathcal{H}(\mathcal{D}_{y}) for some \epsilon^{\prime}>0. Then we claim that

 \displaystyle\Big{\{}\max_{\pi^{\prime}\in\mathcal{P}(\mathcal{A})}\min_{% \sigma^{\prime}\in\mathcal{P}(\mathcal{B})}\mathbb{E}_{\pi^{\prime},\sigma^{% \prime}}[Q_{l}(\cdot,a^{\prime},b^{\prime})]\Big{\}}_{l\in[\mathcal{N}_{1}(% \epsilon^{\prime},\mathcal{H},\mathcal{D}_{y})]}

covers \mathcal{H}^{\vee}_{1}. By definition, for any Q^{1,1}\in\mathcal{H}, there exists l^{\prime}\in[\mathcal{N}_{1}(\epsilon^{\prime},\mathcal{H},\mathcal{D}_{y})] such that

 \displaystyle\frac{1}{ABT}\sum_{t=1}^{T}\sum_{j=1}^{A}\sum_{k=1}^{B}\Big{|}Q^{% 1,1}(y_{t},a_{j},b_{k})-Q_{l^{\prime}}(y_{t},a_{j},b_{k})\Big{|}\leq\epsilon^{% \prime}.

Let (\pi^{Q}_{t},\sigma^{Q}_{t}) be the strategy pair given y_{t} such that

 \displaystyle\max_{\pi^{\prime}\in\mathcal{P}(\mathcal{A}),\sigma^{\prime}\in% \mathcal{P}(\mathcal{B})}\Big{|}\mathbb{E}_{\pi^{\prime},\sigma^{\prime}}[Q^{1% ,1}(y_{t},a^{\prime},b^{\prime})]-\mathbb{E}_{\pi^{\prime},\sigma^{\prime}}[Q_% {l^{\prime}}(y_{t},a^{\prime},b^{\prime})]\Big{|} \displaystyle\quad=\Big{|}\mathbb{E}_{\pi^{Q}_{t},\sigma^{Q}_{t}}[Q^{1,1}(y_{t% },a^{\prime},b^{\prime})]-\mathbb{E}_{\pi^{Q}_{t},\sigma^{Q}_{t}}[Q_{l^{\prime% }}(y_{t},a^{\prime},b^{\prime})]\Big{|}.

Then, let \epsilon^{\prime}=\epsilon/(AB), we have

 \displaystyle\frac{1}{T}\sum_{t=1}^{T}\Big{|}\max_{\pi^{\prime}\in\mathcal{P}(% \mathcal{A})}\min_{\sigma^{\prime}\in\mathcal{P}(\mathcal{B})}\mathbb{E}_{\pi^% {\prime},\sigma^{\prime}}[Q^{1,1}(y_{t},a^{\prime},b^{\prime})]-\max_{\pi^{% \prime}\in\mathcal{P}(\mathcal{A})}\min_{\sigma^{\prime}\in\mathcal{P}(% \mathcal{B})}\mathbb{E}_{\pi^{\prime},\sigma^{\prime}}[Q_{l^{\prime}}(y_{t},a^% {\prime},b^{\prime})]\Big{|} \displaystyle\quad\leq\frac{1}{T}\sum_{t=1}^{T}\max_{\pi^{\prime}\in\mathcal{P% }(\mathcal{A}),\sigma^{\prime}\in\mathcal{P}(\mathcal{B})}\Big{|}\mathbb{E}_{% \pi^{\prime},\sigma^{\prime}}[Q^{1,1}(y_{t},a^{\prime},b^{\prime})]-\mathbb{E}% _{\pi^{\prime},\sigma^{\prime}}[Q_{l^{\prime}}(y_{t},a^{\prime},b^{\prime})]% \Big{|} \displaystyle\quad=\frac{1}{T}\sum_{t=1}^{T}\Big{|}\mathbb{E}_{\pi^{Q}_{t},% \sigma^{Q}_{t}}[Q^{1,1}(y_{t},a^{\prime},b^{\prime})]-\mathbb{E}_{\pi^{Q}_{t},% \sigma^{Q}_{t}}[Q_{l^{\prime}}(y_{t},a^{\prime},b^{\prime})]\Big{|} \displaystyle\quad\leq\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}_{\pi^{Q}_{t},\sigma^% {Q}_{t}}\Big{|}Q^{1,1}(y_{t},a^{\prime},b^{\prime})-Q_{l^{\prime}}(y_{t},a^{% \prime},b^{\prime})\Big{|}\leq\frac{\epsilon}{AB}\cdot(AB)={\epsilon}.

Thus, by the relation between \mathcal{H}^{\vee}_{N} and \mathcal{H}^{\vee}_{1}, and Proposition B.2, we further obtain

 \displaystyle\mathcal{N}_{1}({\epsilon},\mathcal{H}^{\vee}_{N},y_{Z})\leq\big{% [}\mathcal{N}_{1}({\epsilon}/{(AB)},\mathcal{H},\mathcal{D}_{y})\big{]}^{N}% \leq\bigg{[}e(V_{\mathcal{H}^{+}}+1)\bigg{(}\frac{2eQ_{\max}AB}{\epsilon}\bigg% {)}^{V_{\mathcal{H}^{+}}}\bigg{]}^{N},

which concludes the proof. ∎

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters