An axiomatic basis for Blackwell optimality

# An axiomatic basis for Blackwell optimality

Adam Jonsson Department of Engineering Sciences and Mathematics
Luleå University of Technology, 97187 Luleå, Sweden
###### Abstract.

In the theory of Markov decision processes (MDPs), a Blackwell optimal policy is a policy that is optimal for every discount factor sufficiently close to one. This paper provides an axiomatic basis for Blackwell optimality in discrete-time MDPs with finitely many states and finitely many actions.

###### Key words and phrases:
Markov decision processes; Blackwell optimality
###### 2010 Mathematics Subject Classification:
Primary 90C40, Secondary 91B06

## 1. Introduction

In his foundational paper, Blackwell [4] showed that for any discrete-time Markov decision process (MDP) with finitely many states and finitely many actions, there exists a stationary policy that is optimal for every discount factor sufficiently close to one. Following Veinott [15], policies that possess this property are now referred to as Blackwell optimal. Blackwell optimality and the related concept of 1-optimality (also known as near optimality, 0-discount optimality, and bias optimality) have come to provide two of the most well studied optimality criteria for undiscounted MPDs (see, e.g., [10, 9, 13, 12, 14, 6, 7]). However, the question of which assumptions on a decision maker’s preferences lead to these criteria has not been answered in the literature.

To address this question, we consider a decision maker with preferences over . The preference relation is postulated to be reflexive and transitive, where means that is at least as good as , means that is better than ( but not ), and means that and are equally good ( and ). A policy generates a stream of expected rewards (see Eq. (3) below), where is the expected reward at time . Let denote the set of streams generated by stationary policies, that is, policies for which the action chosen at time depends only on the state at time . The principal result of this paper (Theorem 1) provides conditions on that ensure that and coincide on , where

 u≿\textscBv⟺liminfβ→1−∞∑t=1βt(ut−vt)≥0 (1)

is the preference relation induced by the 1-optimality criterion. To state this result, we use the following notation: For and , we let denote . If for all and for some , we write . The long-run average

 limn→∞1nn∑t=1ut (2)

of is denoted by if the limit (2) exists.

###### Theorem 1.

Let be a preference relation on with the following three properties.

A1. For all , if , then .
A2. For all , if , then .
A3. For all , if is well defined, then .

Then and coincide on .

This result is proved in [8] on a different domain (the set of streams that are either summable or eventually periodic). To prove Theorem 1, we extend the result from [8] to a larger domain (Lemma 2) and show that this domain contains (Lemma 3).

The first two assumptions in Theorem 1, A1 and A2, are standard (cf. [2, 3]). To interpret A3, which is the Compensation Principle from [8], imagine that the decision maker is faced with two different scenarios: In the first scenario, a stream of rewards is received. In the second scenario, there is a one-period postponement of , for which a compensation of is received in the first period. According to A3, the decision maker is indifferent between and if . For an axiomatic defence of this assertion, see [8, Prop. 1].

Theorem 1 tells us that if a decision maker restricts attention to stationary policies and respects A1, A2, and A3, then any stationary 1-optimal policy is (weakly) best possible with respect to his or her preferences. (The same conclusion hold for Blackwell optimal policies since such policies are 1-optimal by definition.) While restricting attention to stationary policies is often natural, it is well known that not all optimality criteria admit stationary optimal policies [5, 13, 11]. The arguments used in the proof of Theorem 1 apply to sequences that are asymptotically periodic (see Eq. (8) below). We mention without proof that as a consequence, the conclusion in Theorem 1 holds also on the set of streams generated by eventually periodic policies.

## 2. Definitions

We use Blackwell’s [4] formulation of a discrete-time MDP, with a finite set of states, a finite set of actions, and the set of all functions . Thus at each time , a system is observed to be in one of states, an action is chosen from , and a reward is received. The reward is assumed to be a function from to . The transition probability matrix and reward (column) vector that correspond to are denoted by and , respectively. So, if the system is observed to be in state and action is chosen, then a reward of is received and the system moves to with probability .

A policy is a sequence , each . The set of all policies is denoted by . A policy is stationary if for all , and eventually periodic if there exist such that for all .

The stream of expected rewards that generates, given an initial state , is the sequence defined (see [4, p. 719])

 u1 =[R(f1)]s, ut =[Q(f1)…Q(ft−1)⋅R(ft)]s,t≥2. (3)

We define as the set of all that can be written (3) for some stationary and some , where is a MDP with finitely many states and finitely many actions.

## 3. Proof of Theorem 1

The proof of Theorem 1 will be completed through three lemmas. The first lemma shows that if satisfies A1A3, then and coincide on the set of pairs for which the series is Cesàro-summable and has bounded partial sums, where

 u≿\textscVv⟺liminfn→∞1nn∑T=1T∑t=1(ut−vt)≥0 (4)

is the preference relation induced by Veinott’s [15] average overtaking criterion. All results presented in this paper hold with in the role of .

###### Lemma 1.

(a) The preference relation satisfies A1A3.

(b) Let be a preference relation that satisfies A1A3. For all , if the series is Cesàro-summable and has bounded partial sums, then if and only if .

###### Proof.

(a) See [8, Theorem 1]. (b) A consequence of (a) and Lemma 2 in [8]. ∎

That the conclusion in Lemma 1(b) holds with in the role of follows from that satisfies A1A3. The rest of the proof consists of identifying a superset of to which the conclusion in Lemma 1(b) extends. Lemma 2 shows that this conclusion holds on the set of that can be written

 u=w+△, (5)

where is eventually periodic and the series is Cesàro-summable (the limit exists and is finite) and has bounded partial sums. Let denote the set of streams that can be written in this way.

###### Lemma 2.

A preference relation on that satisfies A1A3 is complete on and coincides with on this domain.

That is complete on means that for all , if does not hold, then .

###### Proof.

Let be a preference relation that satisfies A1A3, and let . Then . We show that if and only if . Take and such that , where for all and where is Cesàro-summable with bounded partial sums. Without loss of generality, we may assume that .

Case 1: . Then is Cesàro-summable and has bounded partial sums. This means that is Cesàro-summable and has bounded partial sums. By Lemma 1, .

Case 2: . (A similar argument applies when .) Then as . Since has bounded partial sums, . We show that . Choose and with the following properties.

(i) is eventually periodic with period .

(ii) for all and for all .

(iii) for all .

(iv) for all .

Since , (iv) follows from (i)–(ii) by taking sufficiently large. Let . By (iii), for all . This means that , is eventually periodic. Thus and hence is Cesàro-summable with bounded partial sums. Since , the Cesàro sum of is nonnegative by (iv). This means that , so by Lemma 1. Here , so by A1 and transitivity. By A2, . Since also satisfies A1A3, the same argument shows that . ∎

It remains to verify that contains . For this it is sufficient to show that every can be written

 u=w+△, (6)

where is eventually periodic and goes to zero at exponential rate as . We say that is asymptotically periodic if can be written in this way.

###### Lemma 3.

If is generated by a stationary policy, then is asymptotically periodic.

###### Proof.

Let be generated by applying given an initial state , so that is the :th component of (here is the identity matrix)

 Q(f)t−1⋅R(f),t≥1. (7)

We need to show that there exist and with

 u=w+△, (8)

where is eventually periodic and . A well known corollary of the Perron-Frobenius theorem for nonnegative matrices says that for any stochastic matrix and , the sequence converges exponentially to a periodic orbit (see, e.g., [1].) That is, there exist , and such that for all and where

 limt→∞|(Pt⋅x−y(t))s|eρt=0

for every . Thus we can take , and such that

 (Q(f))t−1⋅R(f)=w(t)+e(t) (9)

for every , where for all and where each component of goes to zero faster than . If we now set , then , where is eventually periodic and . ∎

## References

• [1] Mustafa A. Akcoglu and Ulrich Krengel. Nonlinear models of diffusion on a finite space. Probability Theory and Related Fields, 76(4):441–420, 1987.
• [2] Geir B. Asheim, Claude d’Aspremont, and Kuntal Banerjee. Generalized time-invariant overtaking. Journal of Mathematical Economics, 46(4):519–533, 2010.
• [3] Kaushik Basu and Tapan Mitra. Utilitarianism for infinite utility streams: A new welfare criterion and its axiomatic characterization. Journal of Economic Theory, 133(1):350–373, 2007.
• [4] David Blackwell. Discrete dynamic programming. Annals of Mathematical Statistics, 33(2):719–726, 1962.
• [5] Barry W. Brown. On the iterative method of dynamic programming on a finite space discrete time Markov process. Ann. Math. Statist., 36(4):1279–1285, 1965.
• [6] Arie Hordijk and Alexander A. Yushkevich. Blackwell optimality. In E.A. Feinberg and A Shwartz, editors, Handbook of Markov Decision Processes, Imperial College Press Optimization Series. Springer, Boston, MA, 2002.
• [7] Hèctor Jasso-Fuentes and Onèsimo Hernàndez-Lerma. Blackwell optimality for controlled diffusion processes. Journal of Applied Probability, 46(2):372–391, 2009.
• [8] Adam Jonsson and Mark Voorneveld. The limit of discounted utilitarianism. Theoretical Economics, 2017. To appear, available at: https://econtheory.org.
• [9] J.B Lasserre. Conditions for existence of average and blackwell optimal stationary policies in denumerable markov decision processes. Journal of Mathematical Analysis and Applications, 136(2):479–489, 1988.
• [10] Steven A. Lippman. Letter to the Editor — Criterion equivalence in discrete dynamic programming. Operations Research, 17(5):920–923, 1969.
• [11] Andrzej S. Nowak and Oscar Vega-Amaya. A counterexample on overtaking optimality. Math. Methods Oper. Res., 49(3):435–439, 1999.
• [12] Alexey B. Piunovskiy. Examples in Markov decision processes, volume 2 of Imperial College Press Optimization Series. Imperial College Press, London, 2013.
• [13] Martin L. Puterman. Markov decision processes: discrete stochastic dynamic programming. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, Inc., New York, 1994.
• [14] Dinah Rosenberg, Eilon Solan, and Nicolas Vieille. Blackwell optimality in Markov decision processes with partial observation. The Annals of Statistics, 30(4):1178–1193, 2002.
• [15] Arthur. F Veinott. On finding optimal policies in discrete dynamic programming with no discounting. Annals of Mathematical Statistics, 37(5):1284–1294, 1966.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters