Response-Based Approachability and its Application to Generalized No-Regret Algorithms
Approachability theory, introduced by Blackwell (1956), provides fundamental results on repeated games with vector-valued payoffs, and has been usefully applied since in the theory of learning in games and to learning algorithms in the online adversarial setup. Given a repeated game with vector payoffs, a target set is approachable by a certain player (the agent) if he can ensure that the average payoff vector converges to that set no matter what his adversary opponent does. Blackwell provided two equivalent sets of conditions for a convex set to be approachable. The first (primary) condition is a geometric separation condition, while the second (dual) condition requires that the set be non-excludable, namely that for every mixed action of the opponent there exists a mixed action of the agent (a response) such that the resulting payoff vector belongs to . Existing approachability algorithms rely on the primal condition and essentially require to compute at each stage a projection direction from a given point to . In this paper, we introduce an approachability algorithm that relies on Blackwell’s dual condition. Thus, rather than projection, the algorithm relies on computation of the response to a certain action of the opponent at each stage. The utility of the proposed algorithm is demonstrated by applying it to certain generalizations of the classical regret minimization problem, which include regret minimization with side constraints and regret minimization for global cost functions. In these problems, computation of the required projections is generally complex but a response is readily obtainable.
Consider a repeated matrix game with vector-valued rewards that is played by two players, the agent and the adversary or opponent. In a learning context the agent may represent the learning algorithm, while the adversary stands for an arbitrary or unpredictable learning environment. For each pair of simultaneous actions and (of the agent and the opponent, respectively) in the one-stage game, a reward vector , , is obtained. In Blackwell’s approachability problem (Blackwell, 1956), the agent’s goal is to ensure that the long-term average reward vector approaches a given target set , namely converges to almost surely in the point-to-set distance. If that convergence can be ensured irrespectively of the opponent’s strategy, the set is said to be approachable, and a strategy of the agent that satisfies this property is an approaching strategy (or algorithm) for .
Blackwell’s approachability results have been broadly used in theoretical work on learning in games, including equilibrium analysis in repeated games with incomplete information (Aumann and Maschler, 1995), calibrated forecasting (Foster, 1999), and convergence to correlated equilibria (Hart and Mas-Colell, 2000). The earliest application, however, concerned the notion of regret minimization, or no-regret strategies, that was introduced in Hannan (1957). Even before Hannan’s paper was published, it was shown in Blackwell (1954) that regret minimization can be formulated as a particular approachability problem, which led to a distinct class of no-regret strategies. More recently, approachability was used in Rustichini (1999) to prove an extended no-regret result for games with imperfect monitoring, while Hart and Mas-Colell (2001) proposed an alternative formulation of no-regret as an approachability problem (see Section 2). An extensive overview of approachability and no-regret in the context of learning is games can be found in Fudenberg and Levine (1998), Young (2004), and Cesa-Bianchi and Lugosi (2006). The latter monograph also makes the connection to the modern theory of on-line learning and prediction algorithms. In a somewhat different direction, approachability theory was applied in Mannor and Shimkin (2004) to a problem of multi-criterion reinforcement learning in an arbitrarily-varying environment.
Standard approachability algorithms require, at each stage of the game, the computation the direction from the current average reward vector to a closest point in the target set . This is implied by Blackwell’s primal geometric separation condition, which is a sufficient condition for approachability of a target set. For convex sets, this step is equivalent to computing the projection direction of the average reward onto . In this paper, we introduce an approachability algorithm that avoids this projection computation step. Instead, the algorithm relies on availability of a response map, that assigns to each mixed action of the opponent a mixed action of the agent so that , the expected reward vector under these two mixed actions, is in . Existence of such a map is based on the Blackwell’s dual condition, which is also a necessary and sufficient condition for approachability of a convex target set.
The idea of constructing an approachable set in terms of a general response map was employed in Lehrer and Solan (2007) (updated in Lehrer and Solan (2013)), in the context of internal no-regret strategies. An explicit approachability algorithm which is based on computing the response to calibrated forecasts of the opponent’s actions has been proposed in Perchet (2009), and further analyzed in Bernstein et al. (2013). However, the algorithms in these papers are essentially based on the computation of calibrated forecasts of the opponent’s actions, a task which is known to be computationally hard (Hazan and Kakade, 2012). In contrast, the algorithm proposed in the present paper operates strictly in the payoff space, similarly to Blackwell’s approachability algorithm.
The main motivation for the proposed algorithm comes from certain generalizations of the basic no-regret problem, where the set to be approached is complex so that computing the projection direction may be hard, while the response map is explicit by construction. These generalizations include the constrained regret minimization problem (Mannor et al., 2009), regret minimization with global cost functions (Even-Dar et al., 2009), regret minimization in variable duration repeated games (Mannor and Shimkin, 2008), and regret minimization in stochastic game models (Mannor and Shimkin, 2003). In these cases, the computation of a response reduces to computing a best-response in the underlying regret minimization problem, and hence can be carried out efficiently. The application of our algorithm to some of these problems is discussed in Section 5 of this paper.
The paper proceeds as follows. In Section 2 we review the approachability problem and existing approachability algorithms, and illustrate the formulation of standard no-regret problems as approachability problems. Section 3 presents our basic algorithm and establishes its approachability properties. In Section 4, we provide an interpretation of certain aspects of the proposed algorithm, and propose some variants and extensions to the basic algorithm. Section 5 applies the proposed algorithms to generalized no-regret problems, including constrained regret minimization and online learning with global cost functions. We conclude the paper in Section 6.
2 Review of Approachability and Related No-Regret Algorithms
In this Section, we present the approachability problem and review Blackwell’s approachability conditions. We further discuss existing approachability algorithms, and illustrate the application of the approachability framework to classical regret minimization problems.
2.1 Approachability Theory
Consider a repeated two-person matrix game, played between an agent and an arbitrary opponent. The agent chooses its actions from a finite set , while the opponent chooses its actions from a finite set . At each time instance , the agent selects its action , observes the action chosen by the opponent, and obtains a vector-valued reward , where , and is a given reward function. The average reward vector obtained by the agent up to time is then A mixed action of the agent is a probability vector , where specifies the probability of choosing action , and denotes the set of probability vectors over . Similarly, denotes a mixed action of the opponent. Let denote the empirical distribution of the opponent’s actions at time , namely
where is indicator function. Further define the Euclidean span of the reward vector as
where is the Euclidean norm. The inner product between two vectors and is denoted by .
In what follows, we find it convenient to use the notation
for the expected reward under mixed actions and ; the distinction between and should be clear from their arguments. Occasionally, we will use for the expected reward under mixed action and pure action . The notation is to be interpreted similarly.
denote the history of the game up to (and including) time . A strategy of the agent is a collection of decision rules , , where each mapping specifies a mixed action for the agent at time . The agent’s pure action is sampled from . Similarly, the opponent’s strategy is denoted by , with . Let denote the probability measure on induced by the strategy pair .
Let be a given target set. Below is the classical definition of an approachable set from Blackwell (1956).
Definition 1 (Approachable Set)
A closed set is approachable by the agent’s strategy if the average reward converges to in the Euclidian point-to-set distance , almost surely for every strategy of the opponent, at a uniform rate over all strategies of the opponent. That is, for every there is an integer such that, for every strategy of the opponent,
The set is approachable if there exists such a strategy for the agent.
In the sequel, we find it convenient to state most of our results in terms of the expected average reward, where expectation is applied only to the agent’s mixed actions:
With this modified reward, the stated convergence results will be shown to hold pathwise, for any possible sequence of the opponent’s actions. See, e.g., Theorem 6, where we show that for all . The corresponding almost sure convergence for the actual average reward can be easily deduced using martingale convergence theory. Indeed, note that
But the first term is the norm of the mean of the vector martingale difference sequence . This can be easily shown to converge to zero at a uniform rate of , using standard results (e.g., from Shiryaev (1995)); see for instance Shimkin and Shwartz (1993), Proposition 4.1. In particular, it can be shown that there exists a finite constant so that for each
with probability at least .
Next, we present a formulation of Blackwell’s results (Blackwell, 1956) which provides us with conditions for approachability of general and convex sets. To this end, for any , let denote a closest point in to . Also, for any let , which coincides with the convex hull of the vectors .
B-sets: A closed set will be called a B-set (where B stands for Blackwell) if for every there exists a mixed action such that the hyperplane through perpendicular to the line segment , separates from .
D-sets: A closed set will be called a D-set (where D stands for Dual) if for every there exists a mixed action so that . We shall refer to such as an -response (or just response) of the agent to .
Primal Condition and Algorithm. A B-set is approachable, by using at time the mixed action whenever . If , an arbitrary action can be used.
Dual Condition. A closed set is approachable only if it is a D-set.
Convex Sets. Let be a closed convex set. Then, the following statements are equivalent: (a) is approachable, (b) is a B-set, (c) is a D-set.
The convex hull of a D-set is approachable (and is also a B-set).
Proof The convex hull of a D-set is a convex D-set. The claim then follows by Theorem 3.
Since Blackwell’s original construction, some other approachability algorithms that are based on similar geometric ideas have been proposed in the literature. Hart and Mas-Colell (2001) proposed a class of approachability algorithms that use a general steering direction with separation properties. As shown there, this is essentially equivalent to the computation of the projection to the target set in some norm. When Euclidean norm is used, the resulting algorithm is equivalent to Blackwell’s original scheme. Recently, Abernethy et al. (2012) proposed an elegant scheme which generates the required steering directions through a no-regret algorithm (in the online convex programming framework). We provide in the Appendix a somewhat simplified version of that algorithm which is meant to clarify the geometric basis of the algorithm, which involves the support function of the target set.
We mention in passing some additional theoretical results and extensions. Vieille (1992) studied the weaker notions of weak approachability and excludability, and showed that these notions are complimentary even for non-convex sets. Spinat (2002) formulated a necessary and sufficient condition for approachability of general (not necessarily convex) sets. In Shimkin and Shwartz (1993) and Milman (2006), approachability was extended to stochastic (Markov) game models. An extension of approachability theory to infinite dimensional reward spaces was carried out in Lehrer (2002), while Lehrer and Solan (2009) considered approachability strategies with bounded memory.
Recently, Mannor et al. (2011) proposed a robust approachability algorithm for repeated games with partial monitoring and applied it to the corresponding regret minimization problem.
In all these papers, at each time step, either the computation of the projection to the target set, or that of a steering direction with separation properties is required.
2.2 Approachability and No-Regret Algorithms
We next present the problem of regret minimization in repeated matrix games, and show how these problems can be formulated in terms of approachability with an appropriately defined reward vector and target set. We start with Blackwell’s original formulation, and proceed to the alternative one by Hart and Mas-Colell (2001). In the final subsection, we consider briefly the more elaborate problem of internal regret minimization. We will mainly emphasize the role of the dual condition and the simple computation of the response for these problems, and refer to the respective references for details of the (primal) resulting algorithms.
Consider, as before, the agent that faces an arbitrarily varying environment (the opponent). The repeated game model is the same as above, except that the vector reward function is replaced by a scalar reward (or utility) function . Let denote the average reward by time , and let
denote the best reward-in-hindsight of the agent after observing . That is, is the maximal average reward the agent could obtain at time if he knew the opponent’s actions beforehand and used a single fixed action. It is not hard to see that the best reward-in-hindsight can be defined as a convex function of the empirical distribution of the opponent’s actions:
This motivates the definition of the average regret as , and the following definition of a no-regret algorithm:
Definition 5 (No-Regret Algorithm)
We say that a strategy of the agent is a no-regret algorithm (also termed a Hannan Consistent strategy) if
almost surely, for any strategy of the opponent.
2.2.1 Blackwell’s Formulation
Following Hannan’s seminal paper, Blackwell (1954) used approachability theory in order to elegantly show the existence of regret minimizing algorithms. Define the vector-valued rewards , where is the probability vector in supported on . The corresponding average reward is then . Finally, define the target set
It is easily verified that this set is a D-set: by construction, for each there exists an -response so that , namely . Also, is a convex set by the convexity of in . Hence, by Theorem 3, is approachable, and by the continuity of , an algorithm that approaches also minimizes the regret in the sense of Definition 5. Application of Blackwell’s approachability strategy to the set therefore results in a no-regret algorithm. We note that the required projection of the average reward vector onto is somewhat implicit in this formulation.
2.2.2 Regret Matching
An alternative formulation, proposed in Hart and Mas-Colell (2001), leads to a a simple and explicit no-regret algorithm for this problem. Let
denote the regret accrued due to not using action constantly up to time . The no-regret requirement in Definition 5 is then equivalent to
almost surely, for any strategy of the opponent. In turn, this goal is equivalent to the approachability of the the non-positive orthant in the game with vector payoff , defined as .
To verify the dual condition, observe that . Choosing clearly ensures , hence is an -response to (in the sense of Definition 2(ii)), and is a D-set. Note that the response here can always be taken as a pure action.
It was shown in Hart and Mas-Colell (2001) that application of Blackwell’s approachability strategy in this formulation leads to the so-called regret matching algorithm, where the probability of action at time step is given by:
Here, . In fact, using their generalization of Blackwell’s approachability strategies, the authors of that paper obtained a whole class of no-regret algorithms with different weighting of the components of .
2.2.3 Internal Regret
We close this section with another application of approachability to the stronger notion of internal regret. Given a pair of different actions , suppose the agent were to replace action with every time was played in the past. His reward at time would become:
The internal average regret of the agent for not playing instead of is then given by
A no-internal-regret strategy must ensure that
To show existence of such strategies, define the vector-valued reward function by setting its coordinate to
Internal no-regret is then equivalent to approachability of the negative quadrant . It is easy to verify that is a D-set, by pointing out the response map: Given a mixed action of the opponent, choosing clearly results in . Therefore, By Theorem 3, the set is approachable.
The formulation of internal-no-regret as the approachability problem above, along with explicit approaching strategies, in due to Hart and Mas-Colell (2000). The importance of internal regret in game theory stems from the fact that if each player in a repeated -player game uses such a no-internal regret strategy, then the empirical distribution of the players’ actions convergence to the set of correlated equilibria. Some interesting relations between internal and external (Hannan’s) regret are discussed in Blum and Mansour (2007).
3 Response-Based Approachability
In this section, we present our basic algorithm and establish its approachability properties.
Throughout the paper, we consider a target set that satisfies the following assumption.
The set is a convex and approachable set. Hence, by Theorem 3, is a D-set: For all there exists an -response such that .
Under this assumption, we may define a response map that assigns to each mixed action a response so that .
We note that in some cases of interest, including those discussed in Section 5, the target may itself be defined through an appropriate response map. Suppose that for each , we are given a response , devised so that satisfies some desired properties. Then the set is, by construction, a convex D-set, hence approachable.
We next present our main results and the basic form of the related approachability algorithm. The general idea is the following. By resorting to the response map, we create a specific sequence of target points with . Letting
denote the -step average target point, it follows that by convexity of . Finally, the agents actions are chosen so that the difference converges to zero, implying that converges to .
denote the difference between the average target vector and the average reward vector.
Let . Suppose that at each time step , the agent chooses its mixed action (from which is sampled) and two additional mixed actions and as follows:
and are equilibrium strategies in the zero-sum game with payoff matrix defined by projected in the direction , namely,
is chosen as an -response to , so that ; set .
for any strategy of the opponent.
Observe that the required choice of as an -response to is possible due to our standing Assumption 1. The conclusion of this theorem clearly implies that the set is approached by the specified strategy, and provides an explicit bound on the rate of convergence. The approachability algorithm implied by Theorem 6 is summarized in Algorithm 1.
The computational requirements Algorithm 1 are as follows. The algorithm has two major computations at each time step :
The computation of the – the equilibrium strategies in the zero-sum matrix game with the reward function . This boils down to the solution of the related primal and dual linear programs, and hence can be done efficiently. Note that, given the vector , this computation does not involve the target set .
The computation of the target point , which is problem dependent. For example, in the constrained regret minimization problem this reduces to the computation of a best-response action to . This problem is further discussed in Section 5.
The proof of the last Theorem follows from the next result, which also provides less specific conditions on the required choice of .
Suppose that at each time step , the agent chooses the triple so that
and sets . Then it holds that
If, in addition, is chosen as an -response to , so that , then
The specific choice of in equations (8)-(9) satisfies the requirement in (11), as argued below. Indeed, the latter requirement is less restrictive, and can replace (8)-(9) in the definition of the basic algorithm. However, the former choice is convenient as it ensures that (11) holds for any choice of .
Initialization: At time step , use arbitrary mixed action and set an arbitrary target point .
At time step :
Set an approachability direction
are, respectively, the average (smoothed) reward vector and the average target point.
Solve a zero-sum matrix game with the scalar reward function . In particular, find the optimal mixed action and that satisfy
Choose action according to .
Pick so that , and set the target point
For any , we have that
where is the span of the reward function (1).
Proof We have that
where is the reward bound defined in (1). The proof is concluded by multiplying both sides of the inequality by .
Hence, by Lemma 8,
Applying this inequality recursively, we obtain that
as claimed in part (i). Part (ii) now follows since (for all ) implies that (recall that is a convex set), hence
where the equality follows by the minimax theorem for matrix games.
Therefore, condition (11) is satisfied for any , and in particular for the one satisfying . This concludes the proof of the Theorem.
4 Interpretation and Extensions
We open this section with an illuminating interpretation of the proposed algorithm in terms of a certain approachability problem in an auxiliary game, and proceed to present several variants and extensions to the basic algorithm. While each of these variants is presented separately, they may also be combined when appropriate.
4.1 An Auxiliary Game Interpretation
A central part of Algorithm 1 is the choice of the pair so that tracks , namely (see Equations (8)-(9) and Proposition 7). If fact, the choice of in (8)-(9) can be interpreted as Blackwell’s strategy for a specific approachability problem in an auxiliary game, which we define next.
Suppose that at time , the agent chooses a pair of actions and the opponent chooses a pair of actions . The vector payoff function, now denoted by , is given by
Consider the single-point target set . This set is clearly convex, and we next show that it is a D-set in the auxiliary game. We need to show that for any there exists so that , namely . That that end, observe that
where and are the marginal distributions of on and , respectively, while and are the respective marginal distributions of . Therefore we obtain by choosing with the same marginals as , for example with and . Thus, by Theorem 3, is approachable.
We may now apply Blackwell’s approachability strategy to this auxiliary game. Since is the origin, the direction from to the average reward is just the average reward vector itself. Therefore, the primal (geometric separation) condition here is equivalent to
Now, a pair that satisfies this inequality is any pair of equilibrium strategies in the zero-sum game with reward projected in the direction of . That is, for
it is easily verified that
The choice of in Equations (8)-(9) follows (13)-(14), with replacing . We note that the two are not identical, as is the temporal average of while is the average the smoothed difference ; however this does not change the approachability result above, and in fact either can be used. More generally, any approachability algorithm in the auxiliary game can be used to choose the pair in Algorithm 1.
We note that in our original problem, the mixed action is not chosen by an “opponent” but rather specified as part of Algorithm 1. But since the approachability result above holds for an arbitrary choice of , it also holds for this particular one.
We proceed to present some additional variants of our algorithm.
4.2 Idling when Inside
Recall that in the original approachability algorithm of Blackwell, an arbitrary action can be chosen by the agent whenever . This may reduce the computational burden of the algorithm, and adds another degree of freedom that may be used to optimize other criteria.
Such arbitrary choice of (or ) when the average reward is in is also possible in our algorithm. However, some care is required in the setting of the average target point over these time instances, as otherwise the two terms of the difference may drift apart. As it turns out, what is required is simply to shift the average target point to at these time instances, and use the modified point in the computation of the steering direction . In recursive form, we obtain the following modified recursion:
It may be seen that the steering direction is reset to whenever the average reward is in . With this modified definition, we are able to maintain the same convergence properties of the algorithm.
Proof We establish the claim in two steps. We first show that bounds the Euclidean distance of from . We then show that satisfies an analogue of Lemma 8, and therefore the analysis of the previous section holds.
To see that for all , observe that if , then trivially . Assume next that . Let be the last instant such that . Using the abbreviate notation
and similarly for , we obtain
On the other hand,
where the first inequality follows by the convexity of the point-to-set Euclidean distance to a convex set, and the second inequality holds since
Hence, when , we have that , and arbitrary and can be chosen. Also, similarly to the analysis in Section 3, whenever , the solution of the zero-sum game in the direction ensures that
and thus the convergence of to zero is implied.
4.3 Directionally Unbounded Target Sets
In some applications of interest, the target set may be unbounded in certain directions. Indeed, this is the case in the approachability formulation of the no-regret problem, where the goal is essentially to make the average reward as large as possible. In particular, in Blackwell’s formulation (Section 2.2.1), the set is unbounded in the direction of the first coordinate . In Hart and Mas-Collel’s formulation (Section 2.2.2), the set is unbounded in the negative direction of all the coordinates of .
In such cases, the requirement that , which is a property of our basic algorithm, may be too strong, and may even be counter-productive. For example, in Blackwell’s no-regret formulation mentioned above, we would like to increase the first coordinate of as much as possible, hence allowing negative values of makes sense (rather than steering that coordinate to by reducing ). We propose next a modification of our algorithm that addresses this issue.
Given the (closed and convex) target set , let be the set of vectors such that . It may be seen that is a closed and convex cone, which trivially equals if (and only if) is bounded. We refer to the unit vectors in as directions in which is unbounded.
Referring to the auxiliary game interpretation of our algorithm in Section 4.1, we may now relax the requirement that approaches to the requirement that approaches . Indeed, if we maintain as before, then suffices to verify that .
We may now apply Blackwell’s approachability strategy to the cone in place of the origin. The required modification to the algorithm is simple: replace the steering direction in (8)-(9) or (11) with the direction from the closest point in to :
That projection is particularly simple in case is unbounded along primary coordinates, so that the cone is a quadrant, generated by a collection of orthogonal unit vectors. In that case, clearly,
Thus, the negative components of in directions are nullified.
4.4 Using the Non-smoothed Rewards
In the basic algorithm of Section 3, the definition of the steering direction employs the smoothed rewards rather than the actual ones, namely . We consider here the case where the latter are used. This is essential in case that the opponent’s action is not observed, so that cannot be computed, but rather the reward vector is observed directly. It also makes sense since the quantity we are actually interested in is the average reward , and not its smoothed version .
Thus, we replaced with
The rest of the algorithm is the same as Algorithm 1. We have the following result for this variant.