THE STOCHASTIC ASYMPTOTIC STABILITY OF NASH EQUILIBRIA IN FTRL DYNAMICS

# FROM LEARNING WITH PARTIAL INFORMATION TO BANDITS: ONLY STRICT NASH EQUILIBRIA ARE STABLE.

Angeliki Giannou{}^{\ast,c} {}^{c} Corresponding author. {}^{\ast} School of Electrical & Computer Engineering, National Technical University of Athens, Athens, Greece. Emmanouil V. Vlatakis–Gkaragkounis{}^{{\ddagger}} {}^{{\ddagger}} Department of Computer Science, Columbia University, New York, NY 10025, USA.  and  Panayotis Mertikopoulos{}^{\sharp,\diamond} {}^{\sharp} Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France. {}^{\diamond} Criteo AI Lab.
###### Abstract.

In this paper, we examine the Nash equilibrium convergence properties of no-regret learning in general N-player games. Despite the importance and widespread applications of no-regret algorithms, their long-run behavior in multi-agent environments is still far from understood, and most of the literature has focused by necessity on certain, specific classes of games (typically zero-sum or congestion games). Instead of focusing on a fixed class of games, we instead take a structural approach and examine different classes of equilibria in generic games. For concreteness, we focus on the archetypal "follow the regularized leader" (FTRL) class of algorithms, and we consider the full spectrum of information uncertainty that the players may encounter – from noisy, oracle-based feedback, to bandit, payoff-based information. In this general context, we establish a comprehensive equivalence between the stability of a Nash equilibrium and its support: a Nash equilibrium is stable and attracting with arbitrarily high probability if and only if it is strict (i.e., each equilibrium strategy has a unique best response). This result extends existing continuous-time versions of the "folk theorem" of evolutionary game theory to a bona fide discrete-time learning setting, and provides an important link between the literature on multi-armed bandits and the equilibrium refinement literature.

\setenumerate

## 1. Introduction

The prototypical framework for online learning in games can be summarized as follows:

1. At each stage of the process, every participating agent chooses an action from some finite set.

2. Each agent receives a reward based on an (a priori unknown) payoff function and the actions of all participating players.

3. The players record their rewards and any other feedback generated during the payoff phase, and the process repeats.

This multi-agent framework has both important similarities and major differences with the setting of single-agent online learning. Indeed, if we isolate a single player and abstract away all others as a generic “environmental influence”, we essentially recover a version of the multi-armed bandit (MAB) problem – stochastic or adversarial, depending on the assumptions governing the actions of the non-focal players [CBL06, BCB12]. In this case, the most widely used figure of merit is the agent’s regret, i.e., the difference between the agent’s cumulative payoff and that of the best fixed action in hindsight. Accordingly, much of the literature on online learning has focused on deriving regret bounds that are min-max optimal, both in terms of the horizon T of the process, as well as the number of actions A available to the focal player.

On the other hand, from a game-theoretic standpoint, the main question that arises is whether players eventually settle on an equilibrium profile from which no player has an incentive to deviate. In this regard, a well-known blanket prediction states that, if all players follow a no-regret policy, the empirical frequency of play converges to the game’s Hannan set [Han57] – otherwise referred to as the set of coarse correlated equilibria [CBL06, NRTV07]. However, this notion is considerably weaker than that of a Nash equilibrium, which embodies the individual rationality principle stated above (resilience to unilateral deviations). In particular, even in 2-player games, coarse correlated equilibria may be supported exclusively on strictly dominated strategies [VZ13], in which case the “Hannan consistency” prediction fails even the most basic axioms of rationalizability [FT91, DF90].

In this paper, we focus on the interface of these considerations, namely the Nash equilibrium convergence properties of no-regret learning in games. In more detail, we examine the widely studied class of online learning algorithms known as follow the regularized leader (FTRL) and we ask the following questions: If all players of a finite game employ an follow the regularized leader (FTRL) policy, does the induced sequence of play converge to a Nash equilibrium? And, if so, are all Nash equilibria equally likely to emerge as end states of the players’ learning process?

##### Related work.

Existing works in the literature provide a seemingly contradictory account. To begin with, in a recent series of papers, it was shown that the well-known exponential weights (EW) algorithm exhibits chaotic behavior, even in 2\times 2 (symmetric, non-splittable) congestion games [PPP17, CP19]. Such games have a relatively simple equilibrium structure: they are potential games which, generically, admit strict Nash equilibria (i.e., pure-strategy equilibria in which every player has a unique best response). In view of this, the appearance of chaos – due itself to an intrinsic link between the EW algorithm and the replicator dynamics of evolutionary biology [HS98, San10] – provides a resoundingly negative answer to the above questions.

On the flip side of this coin, a related strand of literature has shown that strict Nash equilibria are stable and attracting under all FTRL dynamics [CGM15, MS16]; moreover, these are the only equilibria enjoying this property – known in the theory of dynamical systems as asymptotic stability [CGM15, FVGL+20]. This equivalence mirrors the so-called “folk theorem” of evolutionary game theory, which states that an action profile is asymptotically stable under the replicator dynamics if and only if it is a strict Nash equilibrium [Wei95, HS03]. The common ground of these results is the case of the exponential weights algorithm, whose underlying mean-field dynamics is precisely the replicator equation [Rus99, HSV09]. In this case, the results mentioned above show that the EW dynamics converge to strict Nash equilibria (the only stable attractors of the dynamics).

This apparent paradox – chaos versus the asymptotic stability of strict Nash equilibria under FTRL – is due to a fundamental difference in the modeling assumptions of the two series of results. Chaotic behavior arises when an FTRL algorithm is executed in discrete time, as in the presentation of the game-theoretic learning framework before. By contrast, the results establishing the convergence of FTRL to strict Nash equilibria concern the continuous-time counterpart of the dynamics. This gap between discrete- and continuous-time models is not new in the online learning literature [Sor09], but it has a game-changing impact on the range of questions outlined above: in particular, it shows that the behavior of no-regret learning in games can be qualitatively different in these two regimes.

A further consideration which is of crucial importance in discrete-time models of learning is the information available to the players at each stage of the process. In terms of regret, an agent running the EW algorithm with full knowledge of their payoff vector – or even a noisy estimate thereof – can achieve \operatorname{\mathcal{O}}(\sqrt{\log A\cdot T}) regret against an informed adversary [ACBFS95, FS99, ACBFS02]. However, if the player only observes their realized, in-game payoffs, the corresponding guarantee becomes \operatorname{\mathcal{O}}(\sqrt{A\log A\cdot T}), and this bound can only be improved up to logarithmic factors in A (the number of different strategies) [ACBFS02, AB10, BCB12]. It is therefore plausible to expect that the type of feedback received by the players plays a key role in the equilibrium convergence properties of FTRL in discrete time.

##### Our contributions.

Our aim in this paper is to provide answers to the following questions:

1. Is there a link between strict Nash equilibria and discrete-time FTRL algorithms?

2. Is this link affected by the type of feedback available to the players?

To study the role of the players’ feedback model, we examine in detail two contrasting settings: {enumerate*}

the case where players have access to a payoff oracle that provides an estimate of their payoff vectors; and

the bandit case, where players only observe their in-game payoffs and they use an inverse propensity scoring scheme to estimate the payoff of their other actions. As discussed above, these regimes lead to starkly different guarantees from a regret minimization perspective. However, as far as the algorithms’ equilibrium convergence properties are concerned (and quite surprisingly at that), it turns out that they are equivalent.

In more detail, we show that under the standard hyperparameter choices used to achieve no-regret (step-size, explicit exploration weights, etc.), all FTRL algorithms enjoy the following properties:

1. Strict Nash equilibria are stochastically asymptotically stable – i.e., they are stable and attracting with arbitrarily high probability.

2. Only strict Nash equilibria have this property: mixed Nash equilibria cannot attract an FTRL algorithm with positive probability.

Importantly, this two-way equivalence between strict equilibria and stochastic asymptotic stability is independent of the type of information available to the players (payoff-based, oracle-based, or otherwise). To the best of our knowledge, this is the first result of its kind for realistic models of learning in games, and it represents a clear refinement criterion for the prediction of the day-to-day behavior of no-regret learners.

Because learning in discrete time is an inherently stochastic process, our results are also stochastic in nature – hence the requirement for asymptotic stability with arbitrarily high probability. This constitutes a major point of departure from continuous-time models of learning, such as the strict equilibrium stability results of [MS16, FVGL+20]. As a result, our proof techniques are also radically different: Instead of relying on volume-preservation arguments, the crucial argument in the proof of the instability of mixed equilibria is provided by a direct probabilistic estimate which shows that a certain class of stochastic processes avoids zero with arbitrarily high probability. In the converse direction, the principal challenge in our proof of the stability of strit Nash equilibria is to control the aggregation of error terms with unbounded variance. Because of this, stochastic approximation techniques that have been used in the literature to show convergence in potential games [CHM17-NIPS] or strict equilibria with L^{2}-bounded feedback [MZ19] cannot be applied in our setting. Instead, our approach relies on a combination of subsequence extraction arguments and (sub)martingale limit and convergence theory.

## 2. Problem setup and preliminaries

### 2.1. Notation

If \mathcal{V} is a finite-dimensional space, then \mathcal{V}^{*} is its dual space and \langle x,y\rangle will denote the ordinary pairing between x\in\mathcal{V} and y\in\mathcal{V}^{*}. Furthermore, if \mathcal{S} is a finite set then the vector space spanned by \mathcal{S} will be denoted by \mathbb{R}^{\mathcal{S}} and its canonical basis by \{e_{s}\}_{s\in\mathcal{S}}. The simplex of \mathbb{R}^{\mathcal{S}} will be denoted by \Delta(\mathcal{S})=\{x\in\mathbb{R}^{\mathcal{S}}:\sum_{\alpha}x_{\alpha}=1% \text{ and }x_{\alpha}\geq 0\}.

### 2.2. The game

Throughout this work we consider finite games in normal form. Formally, a game in normal form is defined by a tuple \Gamma=\Gamma(\mathcal{N},\mathcal{A},u), the set of players \mathcal{N}=\{1,2,...,N\}, the set of pure strategies \mathcal{A}_{i} of each player i\in\mathcal{N}, which is finite; and the players’ payoff functions u_{i}:\mathcal{A}\to\mathbb{R}, where \mathcal{A}\equiv\prod_{i}\mathcal{A}_{i} denotes the set of all pure strategy profiles \alpha=(\alpha_{1},...,\alpha_{N}), \alpha_{i}\in\mathcal{A}_{i}. Players can either play a pure strategy or a probability distribution x_{i}\in\Delta(\mathcal{A}_{i}) over their pure strategies, called a mixed strategy. The mixed strategy space of a player i will be denoted as \mathcal{X}_{i}\equiv\Delta(\mathcal{A}_{i}), while \mathcal{X}=\prod_{i}\mathcal{X}_{i} denotes the set of all mixed strategy profiles x=(x_{1},...,x_{N}). For practical reasons, we choose to adopt the shorthand x=(x_{1},\ldots,x_{N})\equiv(x_{i};x_{-i}).

Given a mixed strategy profile x\in\mathcal{X}, the expected payoff of player i will be

 u_{i}(x)\equiv u_{i}(x_{i};x_{-i})=\displaystyle\sum_{\alpha_{1}\in\mathcal{A}% _{1}}\cdots\sum_{\alpha_{N}\in\mathcal{A}_{N}}u_{i}(\alpha_{1},...,\alpha_{N})% x_{1,\alpha_{1}}\cdots x_{N,\alpha_{N}} (1)

where u_{i}(\alpha_{1},...,\alpha_{N}) is the corresponding payoff of player i to the pure strategy profile \alpha=(\alpha_{1},...,\alpha_{N})\in\mathcal{A}. The payoff corresponding to an individual action of player i, \alpha_{i}\in\mathcal{A}_{i}, is

 v_{i\alpha_{i}}(x)\equiv u_{i}(\alpha_{i};x_{-i}) (2)

Thus, the vector v_{i}(x)\equiv(v_{i\alpha_{i}}(x))_{\alpha_{i}\in\mathcal{A}_{i}} constitutes the mixed payoff vector of player i, satisfying

 u_{i}(x)=\langle v_{i}(x),x_{i}\rangle=\displaystyle\sum_{\alpha_{i}\in% \mathcal{A}_{i}}x_{i\alpha_{i}}v_{i\alpha_{i}}(x) (3)

Only for practical reasons when all players employ pure strategies, the pure payoff vector of player i corresponding to a pure strategy profile \alpha_{-i}\in\mathcal{A}_{-i} will be denoted as

 v_{i}(\alpha)\equiv(u_{i}(\alpha_{i};\alpha_{-i}))_{\alpha_{i}\in\mathcal{A}_{% i}} (4)

### 2.3. Nash equilibrium

The most widely used solution concept is that of Nash equilibrium (\operatorname{NE}), i.e., a state x^{\ast}\in\mathcal{X} such that each player has no incentive to unilaterally deviate. Formally,

 u_{i}(x^{\ast})\geq u_{i}(x_{i};x^{\ast}_{-i})\text{ for all }x_{i}\in\mathcal% {X}_{i}\text{ and all }i\in\mathcal{N} (NE)

We will write \operatorname{supp}(x^{\ast}_{i})=\{\alpha_{i}\in\mathcal{A}_{i}:x^{\ast}_{i% \alpha_{i}}>0\} for the support of x^{\ast}_{i}, so Nash equilibria can be equivalently characterized via the inequality

 v_{i\alpha^{*}_{i}}(x^{\ast})\geq v_{i\alpha_{i}}(x^{\ast})\text{ for all }% \alpha^{*}_{i}\in\operatorname{supp}(x^{\ast}_{i})\text{ and all }\alpha_{i}% \in\mathcal{A}_{i},i\in\mathcal{N} (5)

The above characterization gives rise to the following classification of Nash equilibria:

1. x^{\ast} is a pure Nash equilibrium if \operatorname{supp}(x^{\ast}_{i}) attributes positive probability only to one pure strategy for all i\in\mathcal{N}.

2. x^{\ast} is a mixed Nash equilibrium in any other case. If \operatorname{supp}(x^{\ast}_{i})=\mathcal{A}_{i} for all i\in\mathcal{N} then x^{\ast} is a fully mixed Nash equilibrium.

It is of interest to notice that by definition, pure Nash equilibria correspond to vertices of \mathcal{X}, mixed Nash equilibria lie in the relative interior of the face of the simplex spanned by the support of each equilibrium and fully mixed Nash equilibria lie in the relative interior of \mathcal{X}, \operatorname{ri}(\mathcal{X}).

Additionally, there is a further distinction of Nash equilibria, concerning the inequality (5), if this inequality is strict for all \alpha_{i}\in\mathcal{A}_{i}\setminus\operatorname{supp}(x^{\ast}_{i}), i\in\mathcal{N}, then the equilibrium is characterized as quasi-strict. A result from [Rit94] states that all Nash equilibria in all but a measure-zero set of games are quasi-strict. A game that satisfies this assumption is referred as generic game. A quasi-strict equilibrium can either be mixed or pure. When a pure Nash equilibrium is quasi-strict is known as strict Nash equilibrium for which any deviation from its pure strategies results in a strictly worst payoff.

### 2.4. No regret learning

In the context of online optimization, a key requirement is the minimization of players’ regret. Each player seeks to minimize the cumulative payoff difference between her choice of mixed strategies up to a time T and her best possible action in hindsight. Thus the (external) regret of player i, against a sequence of play x_{k} is defined as:

 \operatorname{Reg}_{i}(T)=\max_{x_{i}\in\mathcal{X}_{i}}\sum_{k=0}^{T}[u_{i}(% \alpha_{i};x_{-i,k})-u_{i}(x_{i,k};x_{-i,k})] (6)

and we will say that player i has no regret if \operatorname{Reg}_{i}(T)=o(T).

One of the most widely used no-regret schemes is the Follow the Regularized Leader (FTRL). Roughly speaking, at each round FTRL prescribes the strategy that maximizes the cumulative payoff up this round minus a strongly convex function, a regularizer which enhances the smoothness of the transition between the different strategies during the game play. A standard representation of FTRL is

 \displaystyle Y_{i,n+1}=Y_{i,0}+\sum_{k=1}^{n}\gamma_{k}\hat{v}_{k} {aggregate payoffs} \displaystyle X_{i,n+1}=Q_{i}(Y_{i,n+1}) {choice of strategy}

or recursively

 \displaystyle Y_{i,n+1}=Y_{i,n}+\gamma_{n}\hat{v}_{n} (FTRL) \displaystyle X_{i,n+1}=Q_{i}(Y_{i,n+1})

where \hat{v}_{n} is an imperfect "model" of v(X_{n}) and Q_{i}(Y)=\operatorname*{arg\,max}_{x\in\mathcal{X}_{i}}\{\langle Y,x\rangle-h_% {i}(x)\} for all Y\in\mathcal{Y}_{i} and h_{i} is the individual regularizer that each player employs. Notice that the regularizer ensures a more smooth version of \operatorname*{arg\,max}_{x\in\mathcal{X}_{i}}\{\langle Y_{i,n},x\rangle\}, which does not permit players to abruptly change their decisions. We assume that the regularizers satisfy the following properties:

1. h_{i} is continuous over the simplex \Delta(\mathcal{A}_{i}) for all payers i\in\mathcal{N}.

2. h_{i} has locally Lipschitz continuous gradient for all players i\in\mathcal{N} on the relative interior of every face of \Delta(\mathcal{A}_{i}).

3. h_{i} is decomposable for all i\in\mathcal{N} i.e.,, where \theta_{i} is a continuous, strong convex on [0,1] and differentiable on (0,1] function.

4. h_{i} is strongly convex on \Delta(\mathcal{A}_{i}) i.e., there exists K_{i}>0 such that

 h_{i}(tx+(1-t)y)\leq th_{i}(x)+(1-t)h_{i}(y)+\frac{K_{i}}{2}t(1-t)\lVert y-x% \rVert^{2} (7)

for all x,y\in\Delta(\mathcal{A}_{i}), i\in\mathcal{N} and for all t\in[0,1].

Furthermore, h will symbolize the aggregate function of all the regularizers i.e., h(x)=\sum_{i}h_{i}(x_{i}), with strong convexity parameter K\equiv\min_{i}K_{i}. Different regularizers give rise to algorithms of different underlying dynamics; below we present two characteristic examples of the algorithms generated by the choice of different regularizers.

###### Example 2.1 (Exponential Weights/ Multiplicative Weights Update).

A classic choice of regularizer is the (negative) Gibbs-Shannon entropy h_{i}(x)=\sum_{i}x_{i}\log x_{i}, which after some standard calculations leads to the choice map \Lambda_{i}(y)=\exp(y_{i})/\sum_{j}\exp(y_{j}) and the algorithm known as multiplicative weights update (MWU).

###### Example 2.2 (Online Gradient Descent).

Another popular choice of regularizer is the quadratic h_{i}(x)=(1/2)\sum_{i}{x_{i}}^{2}. In this case the induced choice map is the Euclidean projection map \operatorname{\Pi}_{i}(y)=\operatorname*{arg\,min}_{x\in\Delta}\lVert y-x% \rVert^{2}.

### 2.5. Steep vs non-steep

As we mentioned players may have different regularizers h_{i} employed in their choice maps Q_{i}(y)=\operatorname*{arg\,max}_{x\in\mathcal{X}_{i}}\left\{\langle x,y% \rangle-h_{i}(x)\right\}. Depending on the regularizer chosen, FTRL dynamics may differ significantly. To formally express this difference, it is convenient to consider that h is an extended-real valued function h:\mathcal{V}\to\mathbb{R}\cup\{\infty\} with value \infty outside of the simplex \mathcal{X}. Then the subdifferential of h at x\in\mathcal{V} is defined as:

 \partial h(x)=\{y\in{\mathcal{V}}^{*}:h({x}^{\prime})\geq h(x)+\langle y,{x}^{% \prime}-x\rangle\;\forall{x}^{\prime}\in\mathcal{V}\} (8)

If \partial h(x) is nonempty, then h is called subdifferentiable at x\in\mathcal{X}. When x\in\operatorname{ri}(\mathcal{X}) then \partial h(x) is always non-empty or \operatorname{ri}(\mathcal{X})\subseteq\operatorname{dom}\partial h\equiv\{x% \in\mathcal{X}:\partial h(x)\neq\emptyset\}. Notice that when the gradient of h exists, then its subgradient always contains it. With these in mind, we present a typical separation between the different regularizers. On the one hand, steep regularizers like the negative Shannon-entropy become infinitely steep as x approaches the boundary or \lVert\nabla h(x)\rVert\to\infty. On the other hand, non-steep are everywhere differentiable, like the Euclidean, allowing the sequence of play to transfer between the different faces of the simplex. In the dual space of payoffs, steepness implies that the choice map is not surjective (since it cannot map payoffs to points of the boundary), it is however injective (it maps a payoff vector plus a multiple of (1,1,\ldots,1) to the same strategy). Non-steep regularizers give rise to surjective maps, which are not injective, not even under a multiple of (1,1,\ldots,1), to the boundary.

### 2.6. Polar cone

The notion of the polar cone is tightly connected with the notion of duality. Given a finite dimensional vector space \mathcal{V}, a convex set \mathcal{C}\subseteq\mathcal{V} and a point x\in\mathcal{C} the tangent cone \operatorname{TC}_{\mathcal{C}}(x) is the closure of the set of all rays emanating from x and intersecting \mathcal{C} in at least one other point. The dual of the tangent cone is the polar cone \operatorname{PC}_{\mathcal{C}}(x)=\{y\in\mathcal{V}^{*}:\langle y,z\rangle% \leq 0\text{ for all }z\in\operatorname{TC}_{\mathcal{C}}(x)\}.
When the under consideration convex set is the simplex of the players’ strategies, the polar cone corresponding to the boundary differs significantly from the one corresponding to the interior. Formally, the polar cone at a point x of the simplex is

 \operatorname{PC}(x)=\{y\in\mathcal{Y}:y_{a}\geq y_{b}\text{ for all }a,b\in% \mathcal{A}\}\lx@note{footnote}{It is always $y_{a}=y_{b}$ whenever $a,b\in% \operatorname{supp}(x)$.} (9)

An illustration of this is depicted in Fig. 1. When (FTRL) is run, the notion of the polar cone emerges from the choice map Q:\mathcal{Y}\to\mathcal{X}, connecting the primal space of the strategies with the dual space of the payoffs. The proposition below presents this exact connection.

###### Proposition 1.

Let h be a strong convex regularizer that satisfies the properties described in Section 2.4 and let Q:\mathcal{Y}\to\mathcal{X} be the induced choice map then

1. x=Q(y)\Leftrightarrow y\in\partial h(x)

2. \partial h(x)=\nabla h(x)+\operatorname{PC}(x) for all x\in\mathcal{X}.

### 2.7. Feedback separation

Depending on the setting, the feedback which players receive may differ significantly. Below we examine three different cases of payoff feedback \hat{v}_{n} in (FTRL).

#### 2.7.1. Noisy first-order feedback

It is rather common to assume that players have access to a payoff oracle, perfect or imperfect. More precisely, at each round n, after each player i has played a mixed strategy X_{i,n}\in\mathcal{X}_{i}, all players are able to observe their whole payoff vector intact or a noisy version of it i.e., \hat{v}_{i,n}=v_{i}(X_{n})+\xi_{i,n}, where \xi_{i,n} denotes the oracle’s error. If the oracle has no systematical error then the players are able to fully observe their payoffs corresponding to any X\in\mathcal{X}, without any uncertainty involved.

However, a perfect payoff oracle generally constitutes a high demand. As so, we will focus on the case of an imperfect payoff oracle or a stochastic oracle. For the noise introduced by the oracle, \xi_{n} we will have the following mild assumptions:

1. Zero mean

 \operatorname{\mathbb{E}}[\xi_{i,n}\nonscript\,|\nonscript\,\mathopen{}% \mathcal{F}_{n}]=0\text{ for all }i\in\mathcal{N},\;n=1,2,\ldots (A1)
2. Bounded variance

 \operatorname{\mathbb{E}}[\lVert\xi_{i,n}\rVert^{2}_{*}\nonscript\,|\nonscript% \,\mathopen{}\mathcal{F}_{n}]\leq\sigma_{\small\textrm{noise}}^{2}\text{ for % all }i\in\mathcal{N},\;n=1,2,\ldots (A2)
3. Non-strict coordinate-correlation: For each equilibrium x^{\ast} of the game, there exists player i, strategies a,b\in\operatorname{supp}(x^{\ast}_{i}), \pi>0 and \beta>0 such that

 \operatorname{\mathbb{P}}(\lvert\xi_{ia,n}-\xi_{ib,n}\rvert\geq\beta)\geq\pi% \text{ for all }n=1,2,\ldots (A3)

It is worth mentioning that our noise model covers not only the widely-used family of isotropic noise like Gaussian but also allows for a broad range of error processes, including all compactly supported, (sub-)Gaussian, (sub-)exponential and log-normal distributions. In fact, we will not be assuming i.i.d. errors; Intuitively, the last assumption only demands that with positive probability the noise will not attribute the same value in two different strategies, which belong to the support of the equilibrium, of the player. This point is crucial for applications to distributed control where measurements are typically correlated with the state of the system.
Equivalently, in terms of payoffs, the first two assumptions of the noise are:

1. Unbiasedness: \operatorname{\mathbb{E}}[\hat{u}_{i,n}\nonscript\,|\nonscript\,\mathopen{}% \mathcal{F}_{n}]=v_{i}(X_{n}).

2. Bounded variance: \operatorname{\mathbb{E}}[\lVert\hat{v}_{i,n}\rVert^{2}_{*}\nonscript\,|% \nonscript\,\mathopen{}\mathcal{F}_{n}]\leq\sigma^{2}.

#### 2.7.2. Semi-bandit feedback

In this case at each round players choose a strategy to submit based on some probability distribution. After this procedure has been completed, the players receive the payoffs corresponding to each of their pure strategies and the pure strategies chosen by all others. Formally, at each round n, each player i\in\mathcal{N} employs a pure strategy \alpha_{i,n}\in\mathcal{A}_{i} based on X_{i,n}\in\mathcal{X}_{i} and then receives the payoff vector v_{i}(\alpha_{n})\equiv(u_{i}(a;\alpha_{-i,n}))_{a\in\mathcal{A}_{i}}. So, in this case \hat{v}_{i,n}=v_{i}(\alpha_{n}). This type of payoff feedback, can be viewed as a special case of the one described above, in the sense that:

1. Unbiasedness: \operatorname{\mathbb{E}}[\hat{v}_{i,n}\nonscript\,|\nonscript\,\mathopen{}% \mathcal{F}_{n}]=v_{i}(X_{n}).

2. Bounded variance: \operatorname{\mathbb{E}}[\lVert\hat{v}_{i,n}\rVert^{2}_{*}\nonscript\,|% \nonscript\,\mathopen{}\mathcal{F}_{n}]\leq\sigma^{2}.

where \mathcal{F}_{n} is the history of X_{n}. \hat{v}_{n-1} is adapted to \mathcal{F}_{n}, in the sense that the value of \hat{v}_{n-1} is known in the n^{th} round but not during the (n-1)^{th} round. Thus, the payoff feedback in this case is \hat{v}_{i,n}=v_{i}(X_{n})+\xi_{i,n}, where \xi_{i,n}=v_{i}(\alpha_{n})-v_{i}(X_{n}). Notice that in this case the noise introduced depends on the structure of the game, but also in the current sequence of play X_{n}.

#### 2.7.3. Payoff-based/Bandit feedback

Sometimes even the access to a stochastic first-order oracle is not feasible. Such cases are known as bandit and players need to somehow estimate their payoff vectors. More specifically, at each round the players choose a pure action to submit based on some probability distribution; after all players have submitted their strategies, each one of them receives only the payoff induced by the specific actions. Thus, at each round n, each player i employs a pure action \alpha_{i,n}\in\mathcal{A}_{i} based on X_{i,n}\in\mathcal{X}_{i} and then receives u_{i}(\alpha_{i,n};\alpha_{-i,n}). In order to estimate the whole payoff vector of each player, we utilize the Importance Weighted Estimator (IWE):

 \hat{v}_{i\alpha_{i},n}=\left\{\begin{matrix}u_{i}(\alpha_{i,n};\alpha_{-i,n})% /\hat{X}_{i\alpha_{i,n}},&\alpha_{i}=\alpha_{i,n}\\ 0,&\alpha_{i}\neq\alpha_{i,n}\end{matrix}\right. (IWE)

where \hat{X}_{i,n}=(1-\epsilon_{n})X_{i,n}+\epsilon_{n}/\lvert\mathcal{A}_{i}\rvert is a convex combination of the sequence of play X_{i,n}, produced by (FTRL) and the uniform distribution over the action set of each player \mathcal{A}_{i}. Thus, at each round n players perform the following steps: they calculate their aggregated payoffs, the induced sequence of play X_{i,n} (as described in (FTRL)), then they recalibrate these distributions to \hat{X}_{i,n} based on which they sample the strategy to be submitted and use the Importance Weighted Estimator. Schematically,

This estimator satisfies the following properties:

1. Unbiasedness: \operatorname{\mathbb{E}}[\hat{v}_{i,n}\nonscript\,|\nonscript\,\mathopen{}% \mathcal{F}_{n}]=v_{i}(\hat{X}_{n}).

2. Bounded variance: \operatorname{\mathbb{E}}[\lVert\hat{v}_{i,n}\rVert^{2}_{*}\nonscript\,|% \nonscript\,\mathopen{}\mathcal{F}_{n}]\sim\dfrac{1}{\displaystyle\min_{\alpha% _{i}\in\mathcal{A}_{i}}\hat{X}_{i\alpha_{i}}}\leq\dfrac{\lvert\mathcal{A}_{i}% \rvert}{\epsilon_{n}}.

It is easy to prove that the above variance is indeed proportional to 1/\min_{\alpha_{i}}\hat{X}_{i\alpha_{i}}. For the rest of this work we will write:

 \operatorname{\mathbb{E}}[\lVert\hat{v}_{i,n}\rVert_{*}^{2}\nonscript\,|% \nonscript\,\mathopen{}\mathcal{F}_{n}]\leq\dfrac{\sigma^{2}}{\epsilon_{n}} (10)

It has not to escape our notice that running (FTRL) with (IWE) estimator directly with X_{i,n} would have some undesirable properties, since the noise can be uncontrollably large, due to potentially arbitrary small probability attributed to one or more strategies, leading to instability of the model. Inspired by the exploration-exploitation dilemma, players employ an exploration parameter, which is decreasing through time i.e., \epsilon_{n}\to 0 and choose their strategies based on a recalibrated distribution \hat{X}_{i,n}. The idea is that even if a strategy has zero probability to be chosen the players will explore their other options by attributing positive probability to all of the strategies.

## 3. Results

Turning now our attention to the equilibrium convergence properties of FTRL with imperfect payoff feedback, the core of our pursuit can be summarized in the following question: Which Nash equilibria can be Lyapunov stable and attracting under (FTRL) with arbitrarily high probability?

### 3.1. Asymptotic Stability

In a game may exist more than one Nash equilibrium. Since FTRL algorithms depend on their initialization, global convergence results are not feasible in such a game.However, local convergence results are; and the standard criterion to characterise them is asymptotic stability. Roughly speaking, a point is asymptotically stable if, whenever the sequence of play starts "close enough" to it remains "close enough" and converges to it. In our stochastic algorithms convergence guarantees become more intricate. For example, in the case of Payoff-based/Bandit feedback a small number of unfortunate choices of strategies could lead to the sequence’s of play deviation from the equilibrium. As a result,

These allow us to conclude our analysis with the proof of our first theorem.

###### Proof of LABEL:theorem_1.

We start by determining all the parameters of the algorithm (FTRL) and we assume ad absurdum that x^{\ast} is a mixed Nash equilibrium, which is stochastically asymptotically stable. Then for all \varepsilon,\varrho>0, there exists neighborhood U\equiv U(\varepsilon,\varrho) such that \lVert X_{n}-x^{\ast}\rVert<\varepsilon for all n\geq 0, whenever X_{0}\in U , with probability at least 1-\varrho. We leave \varepsilon to be chosen at the end of our analysis, but we will consider it to be fixed.
We focus on player i who has the property described in (A3) for the case of noisy first-order feedback and the one described in LABEL:not_equal_payoffs for the cases of Semi-bandit feedback and Payoff-based/Bandit feedback; instantly we have \lVert X_{i,n}-x^{\ast}_{i}\rVert<\varepsilon for all n\geq 0 with probability at least 1-\varrho. Assume that X_{i,n},X_{i,n+1} are two consecutive instances of the sequence of play; then \lVert X_{i,n}-x^{\ast}_{i}\rVert<\varepsilon, \lVert X_{i,n+1}-x^{\ast}_{i}\rVert<\varepsilon and by Cauchy-Schwarz inequality

 \lVert X_{i,n+1}-X_{i,n}\rVert<2\varepsilon (11)

Depending on the feedback received, \pi will denote the probability of the event:

• Noisy first-order feedback: described in (A3).

• Semi-bandit feedback: the specific pure profile \alpha_{-i}\in\operatorname{supp}(x^{\ast}_{-i}) i.e., \alpha_{-i,n}=\alpha_{-i} described in LABEL:not_equal_payoffs to be chosen at round n.

• Payoff-based/Bandit feedback: the pure strategy profile (\alpha_{i};\alpha_{-i})\in\operatorname{supp}(x^{\ast}) described in LABEL:non_zero_payoff to be chosen at round n.

In all three cases \pi is strictly positive222In the last two cases, \pi>0 since the strategies belonging to the support of the equilibrium have strictly positive probabilities, while in the first case it holds by assumption.. For each case separately \varrho is fixed and satisfies:

 1-\varrho>1-\pi

This is possible, since \pi is strictly positive and \varrho is arbitrarily small.
Consider now the projection of the aggregate payoffs Y_{i,n},Y_{i,n+1} in the difference of the directions of these two strategies. From LABEL:Useful_Expression we have

 \langle Y_{i,n+1}-Y_{i,n},e_{a}-e_{b}\rangle=\langle\nabla h_{i}(X_{i,n+1})-% \nabla h_{i}(X_{i,n}),e_{a}-e_{b}\rangle (12)

However, by definition of (FTRL) Y_{i,n+1}-Y_{i,n}=\gamma_{n}\hat{v}_{i,n} and thus

 ({\theta}^{\prime}_{i}(X_{ia,n+1})-{\theta}^{\prime}_{i}(X_{ib,n+1})-({\theta}% ^{\prime}_{i}(X_{ia,n})-{\theta}^{\prime}_{i}(X_{ib,n})))=\gamma_{n}\langle% \hat{v}_{i,n},e_{a}-e_{b}\rangle (13)

By rearranging and taking into consideration that the regularizers used are decomposable we have

 \displaystyle\left({\theta}^{\prime}_{i}(X_{ia,n+1})-{\theta}^{\prime}_{i}(X_{% ia,n})-\left({\theta}^{\prime}_{i}(X_{ib,n+1})-{\theta}^{\prime}_{i}(X_{ib,n})% \right)\right) \displaystyle=\gamma_{n}\langle\hat{v}_{i,n},e_{a}-e_{b}\rangle (14) \displaystyle\left({\theta}^{\prime}_{i}(X_{ia,n+1})-{\theta}^{\prime}_{i}(X_{% ia,n})\right)-\left({\theta}^{\prime}_{i}(X_{ib,n+1})-{\theta}^{\prime}_{i}(X_% {ib,n})\right) \displaystyle=\gamma_{n}(\hat{v}_{ia,n}-\hat{v}_{ib,n}) (15)

As a consequence of \theta being continuously differentiable, there exist finite C_{a},C_{b} corresponding to a,b equivalently, such that

 \displaystyle\left|{\theta}^{\prime}_{i}(X_{ia,n+1})-{\theta}^{\prime}_{i}(X_{% ia,n})\right| \displaystyle\leq C_{a}\lvert X_{ia,n+1}-X_{ia,n}\rvert<2\cdot C_{a}\cdot\varepsilon (16) \displaystyle\left|{\theta}^{\prime}_{i}(X_{ib,n+1})-{\theta}^{\prime}_{i}(X_{% ib,n})\right| \displaystyle\leq C_{b}\lvert X_{ib,n+1}-X_{ib,n}\rvert<2\cdot C_{b}\cdot\varepsilon (17)

By applying the Cauchy-Schwarz inequality in (14) and using (16) we get:

 \gamma_{n}\lvert\hat{v}_{ia,n}-\hat{v}_{ib,n}\rvert<(2\cdot C_{a}+2\cdot C_{b}% )\cdot\varepsilon (18)

Equivalently,

 \lvert\hat{v}_{ia,n}-\hat{v}_{ib,n}\rvert<\dfrac{2\cdot C_{a}+2\cdot C_{b}}{% \gamma_{n}}\cdot\varepsilon (19)

The above inequality holds with probability 1-\varrho.

We now fix \varepsilon to be

 \varepsilon<\min\left\{\dfrac{\gamma_{n}}{2\cdot C_{a}+2\cdot C_{b}}{\beta}^{% \prime},\dfrac{\gamma_{n}}{2\cdot C_{a}+2\cdot C_{b}}{\beta}^{\prime\prime},% \dfrac{\gamma_{n}}{2C_{a}+2C_{b}+2L\gamma_{n}}\beta,\dfrac{\beta}{2L}\right\} (20)

Focusing on the noisy case we have:

 \displaystyle\lvert\hat{v}_{ia,n}-\hat{v}_{ib,n}\rvert \displaystyle=\lvert\xi_{ia,n}-\xi_{ib,n}+u_{i}(a;X_{i,n})-u_{i}(b;X_{i,n})\rvert (21) \displaystyle\geq\lvert\lvert\xi_{ia,n}-\xi_{ib,n}\rvert-\lvert u_{i}(a;X_{i,n% })-u_{i}(b;X_{i,n})\rvert\rvert (22)

From LABEL:lemma_for_noise it is \lvert\xi_{ia,n}-\xi_{ib,n}\rvert-\lvert u_{i}(a;X_{i,n})-u_{i}(b;X_{i,n})% \rvert\geq\beta-2L\varepsilon and from (20), \varepsilon<\beta/2L, thus

 \displaystyle\lvert\hat{v}_{ia,n}-\hat{v}_{ib,n}\rvert \displaystyle\geq\lvert\lvert\xi_{ia,n}-\xi_{ib,n}\rvert-\lvert u_{i}(a;X_{i,n% })-u_{i}(b;X_{i,n})\rvert\rvert (23) \displaystyle=\lvert\xi_{ia,n}-\xi_{ib,n}\rvert-\lvert u_{i}(a;X_{i,n})-u_{i}(% b;X_{i,n})\rvert (24) \displaystyle\geq\beta-2L\varepsilon (25)

Since from (20), \varepsilon<\dfrac{\gamma_{n}}{2C_{a}+2C_{b}+2L\gamma_{n}}\beta we have

 \lvert\hat{v}_{ia,n}-\hat{v}_{ib,n}\rvert\geq\dfrac{2C_{a}+2C_{b}}{2C_{a}+2C_{% b}+2L\gamma_{n}}\beta\equiv\mu (26)

In the bandit case, since a is the strategy employed by player i in round n, by construction of the importance weight estimator it is \hat{v}_{ia,n}=u_{i}(a;\alpha_{-i,n})/\hat{X}_{ia,n} and \hat{v}_{ib,n}=0. From LABEL:payoffs_bounded there exists {\beta}^{\prime\prime}\equiv\mu>0 such that \lvert\hat{v}_{ia,n}-\hat{v}_{ib,n}\rvert\geq\mu. Finally, in the Limited first-order feedbackcase \hat{v}_{ia,n}-\hat{v}_{ib,n}=u_{i}(a;\alpha_{-i,n})-u_{i}(b;\alpha_{-i,n}). From LABEL:payoffs_bounded there exists {\beta}^{\prime}\equiv\mu>0, such that \lvert\hat{v}_{ia,n}-\hat{v}_{ib,n}\rvert\geq\mu. Thus, in all three cases it holds that

 \lvert\hat{v}_{ia,n}-\hat{v}_{ib,n}\rvert\geq\mu (27)

with probability \pi>0, which has already been specified.
Combining (19),(20) we have that with probability at least 1-\varrho, it holds

 \lvert\hat{v}_{ia,n}-\hat{v}_{ib,n}\rvert<\mu. (28)

The above analysis implies that there exists a neighborhood of x^{\ast}, defined by \varepsilon in which it holds with probability at least 1-\varrho that \lvert\hat{v}_{ia,n}-\hat{v}_{ib,n}\rvert<\mu, while with probability \pi it holds that \lvert\hat{v}_{ia,n}-\hat{v}_{ib,n}\rvert\geq\mu. Formally,

 \displaystyle 1 \displaystyle=\operatorname{\mathbb{P}}\left[\left\{\lvert\hat{v}_{ia,n}-\hat{% v}_{ib,n}\rvert\geq\mu\right\}\operatorname*{\bigcup}\left\{\lvert\hat{v}_{ia,% n}-\hat{v}_{ib,n}\rvert<\mu\right\}\right] (29) \displaystyle=\operatorname{\mathbb{P}}\left[\lvert\hat{v}_{ia,n}-\hat{v}_{ib,% n}\rvert\geq\mu\right]+\operatorname{\mathbb{P}}\left[\lvert\hat{v}_{ia,n}-% \hat{v}_{ib,n}\rvert<\mu\right] (30) \displaystyle\geq\pi+1-\varrho (31) \displaystyle>1 (32)

Thus, a mixed Nash equilibrium cannot be stochastically asymptotically stable, under (FTRL) in all three types of payoff feedback. Notice that this analysis holds even for the first round. Once the parameters of the algorithm have been determined, asymptotic instability can be derived in whichever finite round. ∎

### 3.2. Stability of strict Nash equilibria

Below we prove that indeed strict Nash equilibria are stochastically asymptotically stable. Since they are variational stable (See Eq. VS, there exists a neighborhood in which they strictly dominate. By controlling properly the parameters of (FTRL) asymptotic stability can be guaranteed with arbitrarily high probability. Notice that the exploration parameter \epsilon, described in Section 2.7.3 is integrated to the algorithm for this type of feedback, since the actual sequence of play, in which the choice of strategies is based on, for each player i is \hat{X}_{i,n}=(1-\epsilon_{n})X_{i,n}+\epsilon_{n}/\lvert\mathcal{A}_{i}\rvert.

### 3.3. Bregman divergence - Fenchel coupling

Bregman divergence provides a way to measure the distance of two points that belong to the simplex. Its properties render it a useful tool to prove convergence results. Below we state its definition and prove these properties that would be crucial in the establishment of our proof. Given a fixed point p\in\mathcal{X} then the Bregman divergence of a function h is defined for all points x\in\mathcal{X} as

 D_{h}(p,x)=h(p)-h(x)-h^{\prime}(x;p-x)\text{ for all }p,x\in\mathcal{X} (33)

where h^{\prime}(x;p-x) is the one-sided derivative

 h^{\prime}(x;p-x)\equiv\lim_{t\to 0^{+}}t^{-1}[h(x+t(p-x))-h(x)] (34)

Notice that this definition of the Bregman divergence permits to work also with points on the boundary. D_{h} may attain the value of +\infty if h^{\prime}(x;p-x)=-\infty, while also as a sequence x_{j}\to p, with p a point of the boundary, it is possible that D_{h}(p,x_{j})\to+\infty. However, the condition below ensures that this is not the case.

 D_{h}(p;x_{j})\to 0\text{ whenever }x_{j}\to p (reciprocity)

This is known as the reciprocity condition. What this property actually means is that the sublevel sets of D(p,\cdot) are neighborhoods of p. This is illustrated in Fig. 2, when the function employed is the negative Shannon-entropy and the induced Bregman divergnce the Kullback–Leibler divergence. For our purposes, since h_{i} is a decomposable function, this assumption is satisfied. Additionally, Bregman divergence satisfies the properties described below.

###### Proposition 2.

Let h be a K-strongly convex function defined on the simplex \mathcal{X}=\Delta(\mathcal{A}), that has the properties described in Section 2.4 and let \Delta_{p} be the union of the relative interiors of the faces of \mathcal{X} that contain p i.e.,

 \Delta_{p}=\{x\in\mathcal{X}:\operatorname{supp}(p)\subseteq\operatorname{supp% }(x)\}=\{x\in\mathcal{X}:x_{a}>0\text{ whenever }p_{a}>0\} (35)

Then

1. D_{h}(p,x)<\infty whenever x\in\Delta_{p}.

2. D_{h}(p,x)\geq 0 for all x\in\mathcal{X}, with equality if and only if p=x, more particularly

 D_{h}(p,x)\geq\dfrac{1}{2}K\lVert x-p\rVert^{2}\text{ for all }x\in\mathcal{X} (36)
###### Proof.

For the first part, if x\in\Delta_{p} then h(x+t(x-p)) is finite and smooth in a neighborhood of 0 and thus D(p,x) is also finite.
The second part of the proposition, let z=x-p then strong convexity yields

 \displaystyle h(x+tz)\leq th(p)+(1-t)h(x)-\dfrac{1}{2}Kt(1-t)\lVert x-p\rVert^% {2} \displaystyle t^{-1}(h(x+tz)-h(x))\leq h(p)-h(x)-\dfrac{1}{2}(1-t)K\lVert x-p% \rVert^{2} \displaystyle h(p)-h(x)-t^{-1}(h(x+tz)-h(x))\geq\dfrac{1}{2}(1-t)K\lVert x-p% \rVert^{2}

And by taking t\to 0, we obtain the result. ∎

Even though Bregman divergence is a useful tool, (FTRL) evolves in the dual space of payoffs. Thus dually to the above the Fenchel coupling333The term is due to [MS16]. is defined, F_{h}:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}

 F_{h}(p,y)=h(p)+h^{*}(y)-\langle y,p\rangle\text{ for all }p\in\mathcal{X},y% \in\mathcal{Y} (37)

where h^{*}:\mathcal{Y}\to\mathbb{R} is the convex conjugate of h: h^{*}(y)=\sup_{x\in\mathcal{X}}\{\langle y,x\rangle-h(x)\}. The fenchel conjugate is differentiable on \mathcal{Y} and it holds that

 \nabla h^{*}(y)=Q(y)\text{ for all }y\in\mathcal{Y} (38)

Fenchel coupling is also a measure that connects the primal with the dual space. As we mentioned above, (FTRL) evolves in the dual space and thus we use Fenchel coupling to trace its convergence properties. As the next proposition states, whenever Fenchel coupling F(p,y) is bounded from above so does \lVert Q(y)-p\rVert. This proposition in its entity, is critical for our proof, since we first need to find a neighborhood U of attractness (See LABEL:stability). For this step, Bregman divergence is necessary in order to define the aforementioned neighborhood since \lVert Q(y)-p\rVert<c for some constant c is not necessarily a neighborhood of p (See Section 2.5).

###### Proposition 3.

Let h be a K-strongly convex function on \mathcal{X} and has the propertied described in Section 2.4. Let p\in\mathcal{X}, then

1. F_{h}(p,y)\geq\dfrac{1}{2}K\lVert Q(y)-p\rVert^{2} for all y\in\mathcal{Y} and whenever F_{h}(p,y)\to 0, Q(y)\to p.

2. F_{h}(p,y)=D_{h}(p,x) whenever Q(y)=x and x\in\Delta_{p}.

3. F_{h}(p,{y}^{\prime})\leq F_{h}(p,y)+\langle{y}^{\prime}-y,Q(y)-p\rangle+% \dfrac{1}{2K}\lVert{y}^{\prime}-y\rVert_{*}^{2}.

###### Remark 3.1.

Notice that the first part of the proposition is not implied by the second one, since it is possible that \operatorname{im}Q=\operatorname{dom}\partial h is not always contained in \Delta_{p} (see Section 2.5).

###### Proof.

For the first part, let x=Q(y) then h^{*}(y)=\langle y,x\rangle-h(x)

 F_{h}(p,y)=h(p)-h(x)-\langle y,p-x\rangle (39)

Since y\in\partial h(x) (Proposition 1), it is

 h(x+t(p-x))\geq h(x)+t\langle y,p-x\rangle (40)

and by strong convexity of h, we have

 h(x+t(p-x))\leq th(p)+(1-t)h(x)-\dfrac{1}{2}Kt(1-t)\lVert p-x\rVert^{2} (41)

Thus by combining (40),(41) and taking t\to 0 we get

 F_{h}(p,y)\geq h(p)-h(x)-h(p)+h(x)+\dfrac{K}{2}\lVert p-x\rVert^{2}\geq\dfrac{% K}{2}\lVert p-x\rVert^{2} (42)

For the second part of the proposition, notice that x+t(p-x) lies in the relative interior of some face of \mathcal{X} for t in a neighborhood of 0 and thus h(x+t(p-x)) is smooth and finite. So, h admits a two-sided derivative along x-p and since y\in\partial h(x), \langle y,p-x\rangle=h^{\prime}(x;p-x) and our claim naturally follows.
Finally for the last part of the proposition, we have

 \displaystyle F_{h}(p,y^{\prime}) \displaystyle=h(p)+h^{*}(y^{\prime})-\langle y^{\prime},p\rangle \displaystyle\leq h(p)+h^{*}(y)+\langle y^{\prime}-y,\nabla h^{*}(y)\rangle+% \dfrac{1}{2K}\lVert y^{\prime}-y\rVert^{2}_{*}-\langle y^{\prime},p\rangle \displaystyle=F_{h}(p,y)+\langle y^{\prime}-y,Q(y)-p\rangle+\dfrac{1}{2K}% \lVert y^{\prime}-y\rVert^{2}_{*}

where the second inequality follows from the fact that h^{*} is 1/K strongly smooth [RW98]. ∎

In terms of Fenchel coupling our reciprocity assumption can be written as

 F_{h}(p,y)\to 0\text{ whenever }Q(y)\to p (reciprocity)

Again for h decomposable, the assumption is turned into a property.

### 3.4. Variational stability

###### Definition 1 (Variational stability).

A point x^{\ast}\in\mathcal{X} is said to be variationally stable if there exists neighborhood U of x^{\ast} such that

 \langle v(x),x-x^{\ast}\rangle\leq 0\text{ for all }x\in U (VS)

the equation holds if and only if x=x^{\ast}.

This property states that in a neighborhood of x^{\ast}, it strictly dominates over all other strategies. Interestingly, strict Nash equilibria hold this property:

###### Proposition 4.

For finite games in normal form, the following are equivalent:

1. x^{\ast} is a strict Nash equilibrium.

2. \langle v(x^{\ast}),z\rangle\leq 0 for all z\in\operatorname{TC}(x^{\ast}) with equality if and only if z=0.

3. x^{\ast} is variationally stable.

###### Proof.

We will first prove that i)\Rightarrow ii)\Rightarrow iii)\Rightarrow i).
i)\Rightarrow ii) Since x^{\ast} is a Nash equilibrium by definition it holds for each player i that

 \langle v(x^{\ast}),x-x^{\ast}\rangle\leq 0\text{ for all }x\in\mathcal{X} (43)

For the strict part of the inequality, by definition of strict Nash equilibria it holds that \langle v_{i}(x^{\ast}),x_{i}-x^{\ast}_{i}\rangle<0 whenever x_{i}\neq x^{\ast}_{i} and thus

 \langle v(x^{\ast}),z\rangle=\sum_{i=1}^{N}\langle v_{i}(x^{\ast}),x_{i}-x^{% \ast}_{i}\rangle<0\text{ if }x_{i}\neq x^{\ast}_{i}\text{ for some }i\text{ or% }z\neq 0 (44)

ii)\Rightarrow iii) By definition of the polar cone, we have that v(x^{\ast}) belongs to the interior of \operatorname{PC}(x^{\ast})444Indeed if it belonged to the boundary then the equality in ii) would not hold only for z=0.. Thus by continuity there exists some neighborhood of x^{\ast} such that v(x) also belongs to the polar cone of \operatorname{PC}(x^{\ast}) or x^{\ast} is variationally stable.
iii)\Rightarrow i) Assume now that x^{\ast} is variationally stable but not strict, then there exist for some player i a,b\in\mathcal{A}_{i} such that u_{i}(a;x^{\ast}_{-i})=u_{i}(b;x^{\ast}_{-i}). Then for x_{i}=x^{\ast}_{i}+\lambda(e_{a}-e_{b}) and x_{-i}=x^{\ast}_{-i} we have

 \langle v(x^{\ast}),x-x^{\ast}\rangle=\langle v_{i}(x^{\ast}),\lambda(e_{a}-e_% {b})\rangle=0 (45)

In the following preliminary result, we we focus on the case of Payoff-based/Bandit feedback and we show that if x^{\ast} is a strict Nash equilibrium, there exists a subsequence of (X_{n})_{n=0}^{\infty} that converges to it. In order to achieve this convergence result, it is necessary to assume that the sequence (\hat{X}_{n})_{n=0}^{\infty} is contained in a neighborhood of x^{\ast}, in which (VS) holds. Our argumentation leverages the power of proof by contradiction and combines the previous section’s stability results. Here, we will firstly outline the basic steps below:
0. Assume that there exists a neighborhood, in which X_{n} is not contained for all sufficiently large n\geq 0.
1. We start by showing that the terms of the RHS of the third property described in Proposition 3 are converging almost surely to finite values, except for one. This term, which is a consequence of x^{\ast} being variational stable, goes to -\infty as n\to\infty .
2. The next crucial observation is that thanks to the first property in Proposition 3, the Fenchel coupling is bounded from below by 0, which gives us the contradiciton.

###### Lemma 1.

Let x^{\ast}\in\mathcal{A} be a strict Nash equilibrium. If (FTRL) is run with zero-order feedback and the sequence of play (\hat{X}_{n})_{n=1}^{\infty} does not exit a neighborhood R of x^{\ast}, in which variational stability holds, then there exists a subsequence \hat{X}_{n_{k}} of \hat{X}_{n} that converges to x^{\ast} almost surely, under the hypotheses: \sum_{k=1}^{\infty}\gamma_{k}\to\infty, \sum_{k=1}^{\infty}\gamma_{k}\epsilon_{k}<\infty, \sum_{k=1}^{\infty}\dfrac{\gamma_{k}^{2}}{\epsilon_{k}}<\infty.

###### Proof.

Suppose that there exists a neighborhood U\subseteq R of x^{\ast} , such that \hat{X}_{n}\notin U for all large enough n. Assume without loss of generality that this is true for all n\geq 0. Since variational stability holds in R, we have

 \langle v(x),x-x^{\ast}\rangle<0\text{ for all }x\in R,\;x\neq x^{\ast} (46)

Furthermore, from Proposition 3 we have that for each round n:

 F_{h}(Y_{n+1},x^{\ast})\leq F_{h}(Y_{n},x^{\ast})+\gamma_{n}\langle\hat{v}_{n}% ,X_{n}-x^{\ast}\rangle+\dfrac{1}{2K}\gamma_{n}^{2}\lVert\hat{v}_{n}\rVert^{2}_% {*} (47)

By applying the above inequality for all rounds from 0,...,n and creating the telescopic sum we get

 F_{h}(Y_{n+1},x^{\ast})\leq F_{h}(Y_{0},x^{\ast})+\sum_{k=1}^{n}\gamma_{k}% \langle\hat{v}_{k},X_{k}-x^{\ast}\rangle+\dfrac{1}{2K}\sum_{k=1}^{n}\gamma_{k}% ^{2}\lVert\hat{v}_{k}\rVert_{*}^{2} (48)

Remember that our estimator is unbiased and thus

 \operatorname{\mathbb{E}}[\hat{v}_{n}\nonscript\,|\nonscript\,\mathopen{}% \mathcal{F}_{n}]=v(\hat{X}_{n}) (49)

So we can write \hat{v}_{n} as

 \hat{v}_{n}=v(\hat{X}_{n})+\xi_{n} (50)

where \xi_{n}=\hat{v}_{n}-\operatorname{\mathbb{E}}[\hat{v}_{n}\nonscript\,|% \nonscript\,\mathopen{}\mathcal{F}_{n}]. We now rewrite (48)

 F_{h}(Y_{n+1},x^{\ast})\leq F_{h}(Y_{0},x^{\ast})+\sum_{k=1}^{n}\gamma_{k}% \langle v(\hat{X}_{k}),X_{k}-x^{\ast}\rangle+\sum_{k=1}^{n}\gamma_{k}\langle% \xi_{k},X_{k}-x^{\ast}\rangle+\dfrac{1}{2K}\sum_{k=1}^{n}\gamma_{k}^{2}\lVert% \hat{v}_{k}\rVert_{*}^{2} (51)

Let \tau_{n}=\sum_{k=1}^{n}\gamma_{k} then

 F_{h}(Y_{n+1},x^{\ast})\leq F_{h}(Y_{0},x^{\ast})+\sum_{k=1}^{n}\gamma_{k}% \langle v(\hat{X}_{k}),X_{k}-x^{\ast}\rangle+\tau_{n}(\dfrac{\sum_{k=1}^{n}% \gamma_{k}\langle\xi_{k},X_{k}-x^{\ast}\rangle}{\tau_{n}}+\dfrac{1}{2K}\dfrac{% \sum_{k=1}^{n}\gamma_{k}^{2}\lVert\hat{v}_{k}\rVert_{*}^{2}}{\tau_{n}}) (52)

We focus on the asymptotic behavior of each particular term of the previous inequality.
Let R_{n}=\sum_{k=1}^{n}\gamma_{k}^{2}\lVert\hat{v}_{k}\rVert^{2}_{*}. Then

 \operatorname{\mathbb{E}}[R_{n}]\leq\sum_{k=1}^{n}\gamma_{k}^{2}\operatorname{% \mathbb{E}}[\lVert\hat{v}_{k}\rVert^{2}_{*}]\leq\sum_{k=1}^{n}\gamma_{k}^{2}% \sum_{i=1}^{\lvert\mathcal{N}\rvert}\operatorname{\mathbb{E}}[\lVert\hat{v}_{k% ,i}\rVert^{2}_{*}]\leq\lvert\mathcal{N}\rvert\sigma^{2}\sum_{k=1}^{n}\dfrac{% \gamma_{k}^{2}}{\epsilon_{k}}<\infty (53)

Since R_{n} is a L_{1} bounded submartingale, by Doob’s convergence theorem for submartingales ([HH80], Theorem 2.5) R_{n} will converge almost surely to some random finite value

 \lim_{n\to\infty}\tau_{n}^{-1}R_{n}\to 0 (54)

Let S_{n}=\sum_{k=1}^{n}\gamma_{k}\langle\xi_{k},X_{k}-x^{\ast}\rangle and \psi_{k}=\langle\xi_{k},X_{k}-x^{\ast}\rangle. For the expected value of \psi_{n} we have

 \operatorname{\mathbb{E}}[\psi_{n}\nonscript\,|\nonscript\,\mathopen{}\mathcal% {F}_{n}]=\langle\operatorname{\mathbb{E}}[\xi_{n}\nonscript\,|\nonscript\,% \mathopen{}\mathcal{F}_{n}],X_{n}-x^{\ast}\rangle=0 (55)

while for the expectation of the absolute value of \psi_{n}

 \displaystyle\operatorname{\mathbb{E}}[\lvert\psi_{n}\rvert\nonscript\,|% \nonscript\,\mathopen{}\mathcal{F}_{n}] \displaystyle\leq\operatorname{\mathbb{E}}[\lVert\xi_{n}\rVert_{*}\lVert X_{n}% -x^{\ast}\rVert\nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{n}] (56) \displaystyle\leq(\operatorname{\mathbb{E}}[\lVert\hat{v}_{n}\rVert_{*}% \nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{n}]+\lvert\operatorname{% \mathbb{E}}[\hat{v}_{n}\nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{n}]% \rvert)\lVert\mathcal{X}\rVert<\infty (57)

Obviously, \sum_{n=1}^{\infty}\tau_{n}^{-1}\operatorname{\mathbb{E}}[\lvert\psi_{n}\rvert% \nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{n}]<\infty and so the law of large numbers for martingales ([HH80], Theorem 2.18) yields that almost surely

 \lim_{n\to\infty}\tau_{n}^{-1}S_{n}\to 0 (58)

Finally, we will examine the term \sum_{k=1}^{n}\gamma_{k}\langle v(\hat{X}_{k}),X_{k}-x^{\ast}\rangle. Since \hat{X}_{n}\in R\setminus U and variational stability holds in R, by continuity there exists c>0, such that

 \langle v(\hat{X}_{n}),\hat{X}_{n}-x^{\ast}\rangle\leq-c (59)

Considering the definition of \hat{X}_{i,n} we can write

 \sum_{i=1}^{\lvert\mathcal{N}\rvert}\langle v_{i}(\hat{X}_{n}),(1-\epsilon_{n}% )X_{i,n}+\dfrac{\epsilon_{n}}{\lvert\mathcal{A}_{i}\rvert}-x^{\ast}_{i}\rangle% \leq-c (60)

Equivalently,

 \langle v(\hat{X}_{n}),X_{n}-x^{\ast}\rangle\leq-c+\epsilon_{n}\sum_{i=1}^{% \lvert\mathcal{N}\rvert}\langle v_{i}(\hat{X}_{n}),X_{i,n}-\dfrac{1}{\lvert% \mathcal{A}_{i}\rvert}\rangle (61)

Hence,

 \sum_{k=1}^{n}\gamma_{k}\langle v(\hat{X}_{k}),X_{n}-x^{\ast}\rangle\leq-c\sum% _{k=1}^{n}\gamma_{k}+\sum_{k=1}^{n}\gamma_{k}\epsilon_{k}\sum_{i=1}^{\lvert% \mathcal{N}\rvert}\langle v_{i}(\hat{X}_{k}),X_{i,k}-\dfrac{1}{\lvert\mathcal{% A}_{i}\rvert}\rangle (62)

We will prove that the last term in RHS is finite when \sum_{k=1}^{\infty}\gamma_{k}\epsilon_{k}<\infty. Indeed,

 \displaystyle\sum_{k=1}^{\infty}\lvert\gamma_{k}\epsilon_{k}\sum_{i=1}^{\lvert% \mathcal{N}\rvert}\langle v_{i}(\hat{X}_{k}),X_{i,k}-\dfrac{1}{\lvert\mathcal{% A}_{i}\rvert}\rangle\rvert \displaystyle=\sum_{k=1}^{\infty}\gamma_{k}\epsilon_{k}\lvert\sum_{i=1}^{% \lvert\mathcal{N}\rvert}\langle v_{i}(\hat{X}_{k}),X_{i,k}-\dfrac{1}{\lvert% \mathcal{A}_{i}\rvert}\rangle\rvert (63) \displaystyle\leq\sum_{k=1}^{\infty}\gamma_{k}\epsilon_{k}\sum_{i=1}^{\lvert% \mathcal{N}\rvert}\lVert v_{i}(\hat{X}_{k})\rVert\lVert X_{i,k}-1/{\lvert% \mathcal{A}_{i}\rvert}\rVert<\infty (64)

By Proposition 3 we conclude to a contradiction. This implies that some instance of the sequence of play is included to every neighborhood U of x^{\ast} and thus there exists subsequence \hat{X}_{n_{k}} of \hat{X}_{n} that almost surely converges to x^{\ast}. ∎

One of the cases presented in Theorem 1 has already been treated in [MZ19]. Below we prove that strict Nash equilibria are stochastically asymptotically stable for the case of Payoff-based/Bandit feedback.

###### Theorem 1.

Let x^{\ast} be a strict Nash equilibrium; if (FTRL) is run with Payoff-based/Bandit feedback then x^{\ast} is stochastically asymptotically stable, if we appropriately choose parameters \epsilon_{n},\gamma_{n} satisfying \sum_{k=1}^{\infty}\dfrac{\gamma_{k}^{2}}{\epsilon_{k}}<\infty and \sum_{k=1}^{\infty}\gamma_{k}\epsilon_{k}<\infty.

###### Proof.

Let U_{\varepsilon}^{*}=\left\{y\in\mathcal{Y}:F_{h}(y,x^{\ast})<\varepsilon\right\}, then by Proposition 3 for all x=Q(y), y\in U_{\varepsilon}^{*} it holds that \lVert x-x^{\ast}\rVert^{2}<2\varepsilon/K. Let now U_{\varepsilon}=\{x:D_{h}(x^{\ast},x)<\varepsilon\}, by Proposition 2 it is \lVert x-x^{\ast}\rVert<2\varepsilon/K anad by Proposition 3 Q(U_{\varepsilon}^{*})\subseteq U_{\varepsilon}, Q^{-1}(U_{\varepsilon}^{*})=U_{\varepsilon}. Thus U_{\varepsilon} is a neighborhood of x^{\ast} and whenever y\in U_{\varepsilon}^{*}, x=Q(y)\in U_{\varepsilon}. Furthermore, if x\in U_{\varepsilon} for each individual player we have

 \lVert(1-\epsilon_{n})x_{i}+\epsilon_{n}/\lvert\mathcal{A}_{i}\rvert-x^{\ast}_% {i}\rVert<\lVert x_{i}-x^{\ast}_{i}\rVert+\epsilon_{n}\lVert x_{i}-1/\mathcal{% A}_{i}\rVert<\sqrt{2\varepsilon/K}+\epsilon_{n}\lVert\mathcal{X}\rVert (65)

Let U_{\varepsilon,\varepsilon_{1}}=\left\{x:\forall i\in\mathcal{N}\;\lVert x_{i}% -x^{\ast}_{i}\rVert<\sqrt{2\varepsilon/K}+\epsilon_{1}\lVert\mathcal{X}\rVert\right\} and pick \varepsilon,\epsilon_{1} such that (VS) holds for all x\in U_{4\varepsilon,\epsilon_{1}}.

From Proposition 3 we have

 F_{h}(Y_{n+1},x^{\ast})\leq F_{h}(Y_{n},x^{\ast})+\gamma_{n}\langle\hat{v}_{n}% ,X_{n}-x^{\ast}\rangle+\dfrac{1}{2C}\gamma_{n}^{2}\lVert\hat{v}_{n}\rVert^{2}_% {*} (66)

Since our estimator is unbiased we rewrite \hat{v}_{n}=v(\hat{X}_{n})+\xi_{n}, where \xi_{n}=\hat{v}_{n}-\operatorname{\mathbb{E}}[\hat{v}_{n}\nonscript\,|% \nonscript\,\mathopen{}\mathcal{F}_{n}]. Then telescoping the above inequality yields

 F_{h}(Y_{n+1},x^{\ast})\leq F_{h}(Y_{0},x^{\ast})+\sum_{k=1}^{n}\gamma_{k}% \langle v(\hat{X}_{k}),X_{k}-x^{\ast}\rangle+\sum_{k=1}^{n}\gamma_{k}\langle% \xi_{k},X_{k}-x^{\ast}\rangle+\dfrac{1}{2C}\sum_{k=1}^{n}\gamma_{k}^{2}\lVert% \hat{v}_{k}\rVert^{2}_{*} (67)

We will study each term of the inequality separately. Let R_{n}=\dfrac{1}{2C}\sum_{k=1}^{n}\gamma_{k}^{2}\lVert\hat{v}_{k}\rVert^{2}_{*} and F_{n,\varepsilon}=\left\{\sup_{1\leq k\leq n}R_{k}\geq\varepsilon\right\}. Since R_{n} is a submartingale, Doob’s maximal inequality ([HH80], Theorem 2.1) yields

 \operatorname{\mathbb{P}}(F_{n,\varepsilon})\leq\dfrac{\operatorname{\mathbb{E% }}[R_{n}]}{\varepsilon}\leq\dfrac{\sigma^{2}\lvert\mathcal{N}\rvert\sum_{k=1}^% {n}\frac{\gamma_{k}^{2}}{\epsilon_{k}}}{2C\varepsilon} (68)

By demanding \sum_{k=1}^{n}\dfrac{\gamma_{k}^{2}}{\epsilon_{k}}\leq\dfrac{C\varepsilon% \delta}{\lvert\mathcal{N}\rvert\sigma^{2}} the event F\varepsilon=\operatorname*{\bigcup}_{n=1}^{\infty}F_{\varepsilon,n} will occur with probability at most \delta/2.

Now let S_{n}=\sum_{k=1}^{n}\gamma_{k}\langle\xi_{k},X_{k}-x^{\ast}\rangle and E_{n,\varepsilon}=\left\{\sup_{1\leq k\leq n}S_{k}\geq\varepsilon\right\}. Since S_{n} is a martingale, Doob’s maximal inequality yields

 \operatorname{\mathbb{P}}(E_{n,\varepsilon})\leq\dfrac{\operatorname{\mathbb{E% }}[\lvert S_{n}\rvert]^{2}}{\varepsilon^{2}}\leq\dfrac{\sigma^{2}\lVert% \mathcal{X}\rVert^{2}\sum_{k=1}^{n}\frac{\gamma_{k}^{2}}{\epsilon_{k}}}{% \varepsilon^{2}} (69)

In order to calculate the above upper bound, we define \psi_{n}=\langle\xi_{n},X_{n}-x^{\ast}\rangle and then

 \displaystyle\operatorname{\mathbb{E}}[\lvert\psi_{k}\rvert^{2}] \displaystyle\leq\operatorname{\mathbb{E}}[\lVert\xi_{k}\rVert_{*}^{2}\lVert X% _{k}-x^{\ast}\rVert^{2}\nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{k}] \displaystyle\leq\operatorname{\mathbb{E}}[\lVert\xi_{k}\rVert^{2}_{*}% \nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{k}]\lVert\mathcal{X}\rVert^{2}

where,

 \displaystyle\operatorname{\mathbb{E}}[\lVert\xi_{k}\rVert_{*}^{2}\nonscript\,% |\nonscript\,\mathopen{}\mathcal{F}_{k}] \displaystyle=\operatorname{\mathbb{E}}[\lVert\hat{v}_{k}-\operatorname{% \mathbb{E}}[\hat{v}_{k}\nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{k}]% \rVert^{2}_{*}\nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{k}] \displaystyle=\operatorname{\mathbb{E}}[\lVert\hat{v}_{k}\rVert^{2}_{*}-2% \langle\hat{v}_{k},\operatorname{\mathbb{E}}[\hat{v}_{k}\nonscript\,|% \nonscript\,\mathopen{}\mathcal{F}_{k}]\rangle+\lVert\operatorname{\mathbb{E}}% [\hat{v}_{k}\nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{k}]\rVert_{*}^{2}% \nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{k}] \displaystyle=\operatorname{\mathbb{E}}[\lVert\hat{v}_{k}\rVert^{2}_{*}% \nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{k}]-\lVert\operatorname{% \mathbb{E}}[\hat{v}_{k}\nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{k}]% \rVert^{2}_{*} \displaystyle\leq\operatorname{\mathbb{E}}[\lVert\hat{v}_{k}\rVert^{2}_{*}% \nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{k}]\leq\lvert\mathcal{N}% \rvert\sigma^{2}/\epsilon_{k}

Furthermore, it holds that \operatorname{\mathbb{E}}[\psi_{k}\psi_{l}]=\operatorname{\mathbb{E}}[% \operatorname{\mathbb{E}}[\psi_{k}\psi_{l}]\nonscript\,|\nonscript\,\mathopen{% }\mathcal{F}_{k\vee l}]=0.
Thus if we choose the parameters as though \sum_{k=1}^{\infty}\dfrac{\gamma_{k}^{2}}{\epsilon_{k}}\leq\dfrac{\varepsilon^% {2}\delta}{2\sigma^{2}\lvert\mathcal{N}\rvert\lVert\mathcal{X}\rVert^{2}} we ensure that the event E_{\varepsilon}=\operatorname*{\bigcup}_{n=1}^{\infty}E_{\varepsilon,n} will occur with probability at most \delta/2.

Furthermore, if \hat{X}_{k} belongs to a neighborhood in which (VS) holds for all 1\leq k\leq n, we have

 \displaystyle\langle v(\hat{X}_{k}),\hat{X}_{k}-x^{\ast}\rangle \displaystyle\leq 0 \displaystyle\sum_{i=1}^{\lvert\mathcal{N}\rvert}\langle v_{i}(\hat{X}_{k}),% \hat{X}_{i,k}-x^{\ast}_{i}\rangle \displaystyle\leq 0

Equivalently,

 \displaystyle\sum_{i=1}^{\lvert\mathcal{N}\rvert}\langle v(\hat{X}_{k}),X_{i,k% }-x^{\ast}_{i}\rangle+\epsilon_{k}\langle v_{i}(\hat{X}_{k}),1/\lvert\mathcal{% A}_{i}\rvert-X_{i,k}\rangle\leq 0 \displaystyle\sum_{i=1}^{\lvert\mathcal{N}\rvert}\langle v(\hat{X}_{k}),X_{i,k% }-x^{\ast}_{i}\rangle\leq\epsilon_{k}\sum_{i=1}^{\lvert\mathcal{N}\rvert}% \langle v_{i}(\hat{X}_{k}),X_{i,k}-1.\lvert\mathcal{A}_{i}\rvert\rangle

As a result,

 \sum_{k=1}^{n}\gamma_{k}\langle v_{i}(\hat{X}_{k}),X_{k}-x^{\ast}\rangle\leq% \sum_{k=1}^{n}\gamma_{k}\epsilon_{k}\sum_{i=1}^{\lvert\mathcal{N}\rvert}% \langle v_{i}(\hat{X}_{k}),X_{i,k}-1/\lvert\mathcal{A}_{i}\rvert\rangle (70)

If we control \sum_{k=1}^{n}\gamma_{k}\epsilon_{k}, we can ensure that \sum_{k=1}^{n}\gamma_{k}\langle v_{i}(\hat{X}_{k}),X_{k}-x^{\ast}\rangle is always bounded from above by \varepsilon. More specifically,

 \displaystyle\lvert\sum_{k=1}^{n}\gamma_{k}\epsilon_{k}\sum_{i=1}^{\lvert% \mathcal{N}\rvert}\langle v_{i}(\hat{X}_{k}),X_{i,k}-1/\lvert\mathcal{A}_{i}% \rvert\rangle\rvert \displaystyle\leq\sum_{k=1}^{n}\gamma_{k}\epsilon_{k}\lvert\sum_{i=1}^{\lvert% \mathcal{N}\rvert}\langle v_{i}(\hat{X}_{k}),X_{i,k}-1/\lvert\mathcal{A}_{i}% \rvert\rangle\rvert \displaystyle\leq\sum_{k=1}^{n}\gamma_{k}\epsilon_{k}\sum_{i=1}^{\lvert% \mathcal{N}\rvert}\lVert v_{i}(\hat{X}_{k})\rVert_{*}\lVert X_{i,k}-1/\lvert% \mathcal{A}_{i}\rvert\rVert \displaystyle\leq\sum_{k=1}^{n}\gamma_{k}\epsilon_{k}\lvert\mathcal{N}\rvert% \lVert\mathcal{Y}\rVert_{*}\lVert\mathcal{X}\rVert\leq\varepsilon

If \sum_{k=1}^{n}\gamma_{k}\epsilon_{k}\leq\dfrac{\varepsilon}{\lvert\mathcal{N}% \rvert\lVert\mathcal{X}\rVert\lVert\mathcal{Y}\rVert_{*}}.

Assume now that Y_{0}\in U_{\varepsilon}^{*} and thus F_{h}(Y_{0},x^{\ast})<\varepsilon\leq 4\varepsilon. We will prove by induction that Y_{n}\in U_{4\varepsilon}^{*} for all n\geq 1. Suppose that F_{h}(Y_{k},x^{\ast})<4\varepsilon for all 1\leq k\leq n and we will prove that Y_{n+1}\in U_{4\varepsilon}^{*}.
Choose \epsilon_{n},\gamma_{n} such that \sum_{k=1}^{\infty}\dfrac{\gamma_{k}^{2}}{\epsilon_{k}}\leq\min\left\{\dfrac{% \varepsilon^{2}\delta}{V_{*}\lVert\mathcal{X}\rVert^{2}},\dfrac{C\varepsilon% \delta}{\lvert\mathcal{N}\rvert V_{*}}\right\} and \sum_{k=1}^{\infty}\gamma_{k}\epsilon_{k}\leq\dfrac{\varepsilon}{\lvert% \mathcal{N}\rvert\lVert\mathcal{X}\rVert\lVert\mathcal{Y}\rVert_{*}}555Since \gamma_{n},\epsilon_{n}>0 for all n>0 the sum is bounded from above by \varepsilon/\lvert\mathcal{N}\rvert\lVert\mathcal{X}\rVert\lVert\mathcal{Y}% \rVert_{*} for all n>0. If both \bar{E}_{\varepsilon},\bar{F}_{\varepsilon} hold, which happens with probability \operatorname{\mathbb{P}}(\bar{E}_{\varepsilon}\operatorname*{\bigcap}\bar{F}_% {\varepsilon})\geq 1-\delta, from (67) we have F_{h}(Y_{n+1},x^{\ast})<4\varepsilon, . This immediately yields that Y_{n+1}\in U_{4\varepsilon}^{*}, X_{n+1}\in U_{4\varepsilon} and \hat{X}_{n+1}\in U_{4\varepsilon,\epsilon_{1}}. Thus \hat{X}_{n+1}\in U_{4\varepsilon,\epsilon_{1}}, in which variational stability holds, with probability at least 1-\delta.
By Lemma 1 there exists a subsequence \hat{X}_{n_{k}} that converges to x^{\ast}. Hence, the equivalent subsequence of X_{n}, X_{n_{k}}, converges to x^{\ast}. By Proposition 3 we have that \liminf_{n\to\infty}{F_{h}(Y_{n},x^{\ast})}=0. In order to complete the proof, it is sufficient to prove that the limit of F_{h}(Y_{n},x^{\ast}) exists. Variational stability holds, \langle v(\hat{X}_{n}),\hat{X}_{n}-x^{\ast}\rangle\leq 0.Again using Proposition 3

 \displaystyle F_{h}(Y_{n+1},x^{\ast}) \displaystyle\leq F_{h}(Y_{n},x^{\ast})+\gamma_{n}\langle v(\hat{X}_{n}),X_{n}% -x^{\ast}\rangle+\gamma_{n}\langle\xi_{n},X_{n}-x^{\ast}\rangle+\dfrac{1}{2K}% \gamma_{n}^{2}\lVert\hat{v}_{n}\rVert_{*}^{2} (71) \displaystyle=F_{h}(Y_{n},x^{\ast})+\gamma_{n}\sum_{i=1}^{\lvert\mathcal{N}% \rvert}\langle v_{i}(\hat{X}_{n}),\hat{X}_{i,n}-x^{\ast}+\epsilon_{n}(X_{i,n}-% 1/\lvert\mathcal{A}_{i}\rvert)\rangle+\dfrac{1}{2K}\gamma_{n}^{2}\lVert\hat{v}% _{n}\rVert_{*}^{2} (72) \displaystyle\leq F_{h}(Y_{n},x^{\ast})+\gamma_{n}\epsilon_{n}\sum_{i=1}^{% \lvert\mathcal{N}\rvert}\langle v_{i}(\hat{X}_{n}),X_{i,n}-1/\lvert\mathcal{A}% _{i}\rvert\rangle+\dfrac{1}{2K}\gamma_{n}^{2}\lVert\hat{v}_{n}\rVert_{*}^{2} (73)
 \displaystyle\operatorname{\mathbb{E}}[F_{h}(Y_{n+1},x^{\ast})\nonscript\,|% \nonscript\,\mathopen{}\mathcal{F}_{n}] \displaystyle\leq F_{h}(Y_{n},x^{\ast})+\gamma_{n}\epsilon_{n}\sum_{i=1}^{% \lvert\mathcal{N}\rvert}\langle v_{i}(\hat{X}_{n}),X_{i,n}-x^{\ast}_{i}\rangle% +\dfrac{1}{2K}\gamma_{n}^{2}\operatorname{\mathbb{E}}[\lVert\hat{v}_{n}\rVert_% {*}^{2}\nonscript\,|\nonscript\,\mathopen{}\mathcal{F}_{n}] \displaystyle\leq F_{h}(Y_{n},x^{\ast})+\gamma_{n}\epsilon_{n}\sum_{i=1}^{% \lvert\mathcal{N}\rvert}\langle v_{i}(\hat{X}_{n}),X_{i,n}-x^{\ast}_{i}\rangle% +\dfrac{\lvert\mathcal{N}\rvert V_{*}\gamma_{n}^{2}}{2K\epsilon_{n}}

Let \displaystyle R_{n}=F_{h}(Y_{n},x^{\ast})+\sum_{k=n}^{\infty}\gamma_{k}% \epsilon_{k}\sum_{i=1}^{\lvert\mathcal{N}\rvert}\langle v_{i}(\hat{X}_{k}),X_{% i,k}-1/\lvert\mathcal{A}_{i}\rvert\rangle+\dfrac{\lvert\mathcal{N}\rvert V_{*}% }{2K}\sum_{k=n}^{\infty}\dfrac{\gamma_{k}^{2}}{\epsilon_{k}} Then

 \displaystyle\operatorname{\mathbb{E}}[R_{n+1}\nonscript\,|\nonscript\,% \mathopen{}\mathcal{F}_{n}] \displaystyle\leq F_{h}(Y_{n},x^{\ast})+\gamma_{n}\epsilon_{n}\sum_{i=1}^{% \lvert\mathcal{N}\rvert}\langle v_{i}(\hat{X}_{n}),X_{i,n}-x^{\ast}_{i}\rangle% +\dfrac{\lvert\mathcal{N}\rvert V_{*}\gamma_{n}^{2}}{2K\epsilon_{n}} (74) \displaystyle+\sum_{k=n+1}^{\infty}\gamma_{k}\epsilon_{k}\sum_{i=1}^{\lvert% \mathcal{N}\rvert}\langle v_{i}(\hat{X}_{k}),X_{i,k}-1/\lvert\mathcal{A}_{i}% \rvert\rangle+\dfrac{\lvert\mathcal{N}\rvert V_{*}}{2K}\sum_{k=n+1}^{\infty}% \dfrac{\gamma_{k}^{2}}{\epsilon_{k}} (75) \displaystyle=R_{n} (76)

Thus R_{n} is a L_{1} bounded supermartingale (each of the series is bounded) and from ([HH80], Theorem 2.5) R_{n} converges to a finite random variable and so does F_{h}(Y_{n},x^{\ast}). Inevitably, \liminf_{n\to\infty}F_{h}(Y_{n},x^{\ast})=\lim_{n\to\infty}F_{h}(Y_{n},x^{\ast% })=0 and by (reciprocity), Q(Y_{n})=X_{n}\to x^{\ast}.

The above analysis shows that whenever Y_{0}\in U_{\varepsilon}^{*} and thus X_{0}\in U_{\varepsilon}\cap\operatorname{im}Q, X_{n}\in U_{\varepsilon}\cap\operatorname{im}Q and converges to x^{\ast} with arbitrary high probability. Hence, x^{\ast} is stochastically asymptotically stable.

## 4. Conclusion

In this work, we examine the stochastic asymptotic stability of the broad class of “Follow The Regularized Leader” (FTRL) algorithms in different feedback information types. From the noisy first-order setting where each player observes a slightly perturbed version of the whole payoff vector to the bandit case where the agents may not even know they are playing a game, we showed that only the strict Nash equilibria are stochastically asymptotically stable with arbitratily high probability. In particular, we first establish a strong impossibility convergence result for all mixed Nash equilibria under any discrete FTRL implementation for all the examined types of payoff feedback. Although equilibriation to mixed Nash equilibria is prohibited, we prove that this is not the case for strict Nash equilibria; we show that under standard hyperparameter choices they are stable and attracting with arbitrarily high probability.

As a consequence of this work numerous open problems emerge; Firstly, extending such asymptotic results to more general families of games, i.e with continuous action sets, as well as examining possible generalizations to the optimistic variations of FTRL are fascinating questions. Recently, there has been proven similar dichotomies for the case of continuous deterministic FTRL dynamics [FVGL+20]. It would be interesting to examine if these dynamics algorithms could enhance mixed Nash equilibration in stochastic or Brownian setting as well. Additionally, since our result reinforces the belief that only strict (and hence,pure) Nash equilibria can emerge as stable limit single points in this kind of no-regret process, it would be interesting to explore the topology of more complex ontologies like the limit cycles or even more complicated asymptotic stable sets.

## Acknowledgments

This research was partially supported by the COST Action CA16228 “European Network for Game Theory” (GAMENET), the French National Research Agency (ANR) under grant ALIAS, and the Onassis Foundation undr Scholarship ID: F ZN 010-1/2017-2018.

E.V. Vlatakis-Gkaragkounis is grateful to be supported by NSF grants CCF-1703925, CCF-1763970, CCF-1814873, CCF-1563155, and by the Simons Collaboration on Algorithms and Geometry.

P. Mertikopoulos is grateful for financial support by the French National Research Agency (ANR) in the framework of the “Investissements d’avenir” program (ANR-15-IDEX-02), the LabEx PERSYVAL (ANR-11-LABX-0025-01), and MIAI@Grenoble Alpes (ANR-19-P3IA-0003).

## References

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters