# An axiomatic basis for Blackwell optimality

###### Abstract.

In the theory of Markov decision processes (MDPs), a Blackwell optimal policy is a policy that is optimal for every discount factor sufficiently close to one. This paper provides an axiomatic basis for Blackwell optimality in discrete-time MDPs with finitely many states and finitely many actions.

###### Key words and phrases:

Markov decision processes; Blackwell optimality###### 2010 Mathematics Subject Classification:

Primary 90C40, Secondary 91B06## 1. Introduction

In his foundational paper, Blackwell [4] showed that for any discrete-time Markov decision process (MDP) with finitely many states and finitely many actions, there exists a stationary policy that is optimal for every discount factor sufficiently close to one. Following Veinott [15], policies that possess this property are now referred to as Blackwell optimal. Blackwell optimality and the related concept of 1-optimality (also known as near optimality, 0-discount optimality, and bias optimality) have come to provide two of the most well studied optimality criteria for undiscounted MPDs (see, e.g., [10, 9, 13, 12, 14, 6, 7]). However, the question of which assumptions on a decision maker’s preferences lead to these criteria has not been answered in the literature.

To address this question, we consider a decision maker with preferences over . The preference relation is postulated to be reflexive and transitive, where means that is at least as good as , means that is better than ( but not ), and means that and are equally good ( and ). A policy generates a stream of expected rewards (see Eq. (3) below), where is the expected reward at time . Let denote the set of streams generated by stationary policies, that is, policies for which the action chosen at time depends only on the state at time . The principal result of this paper (Theorem 1) provides conditions on that ensure that and coincide on , where

(1) |

is the preference relation induced by the 1-optimality criterion. To state this result, we use the following notation: For and , we let denote . If for all and for some , we write . The long-run average

(2) |

of is denoted by if the limit (2) exists.

###### Theorem 1.

Let be a preference relation on with the following three properties.

A1. For all , if , then .

A2. For all , if , then .

A3. For all , if is well defined, then .

Then and coincide on .

This result is proved in [8] on a different domain (the set of streams that are either summable or eventually periodic). To prove Theorem 1, we extend the result from [8] to a larger domain (Lemma 2) and show that this domain contains (Lemma 3).

The first two assumptions in Theorem 1, A1 and A2, are standard (cf. [2, 3]). To interpret A3, which is the Compensation Principle from [8], imagine that the decision maker is faced with two different scenarios: In the first scenario, a stream of rewards is received. In the second scenario, there is a one-period postponement of , for which a compensation of is received in the first period. According to A3, the decision maker is indifferent between and if . For an axiomatic defence of this assertion, see [8, Prop. 1].

Theorem 1 tells us that if a decision maker restricts attention to stationary policies and respects A1, A2, and A3, then any stationary 1-optimal policy is (weakly) best possible with respect to his or her preferences. (The same conclusion hold for Blackwell optimal policies since such policies are 1-optimal by definition.) While restricting attention to stationary policies is often natural, it is well known that not all optimality criteria admit stationary optimal policies [5, 13, 11]. The arguments used in the proof of Theorem 1 apply to sequences that are asymptotically periodic (see Eq. (8) below). We mention without proof that as a consequence, the conclusion in Theorem 1 holds also on the set of streams generated by eventually periodic policies.

## 2. Definitions

We use Blackwell’s [4] formulation of a discrete-time MDP, with a finite set of states, a finite set of actions, and the set of all functions . Thus at each time , a system is observed to be in one of states, an action is chosen from , and a reward is received. The reward is assumed to be a function from to . The transition probability matrix and reward (column) vector that correspond to are denoted by and , respectively. So, if the system is observed to be in state and action is chosen, then a reward of is received and the system moves to with probability .

A policy is a sequence , each . The set of all policies is denoted by . A policy is stationary if for all , and eventually periodic if there exist such that for all .

## 3. Proof of Theorem 1

The proof of Theorem 1 will be completed through three lemmas. The first lemma shows that if satisfies A1–A3, then and coincide on the set of pairs for which the series is Cesàro-summable and has bounded partial sums, where

(4) |

is the preference relation induced by Veinott’s [15] average overtaking criterion. All results presented in this paper hold with in the role of .

###### Lemma 1.

(a) The preference relation satisfies A1–A3.

(b) Let be a preference relation that satisfies A1–A3. For all , if the series is Cesàro-summable and has bounded partial sums, then if and only if .

That the conclusion in Lemma 1(b) holds with in the role of follows from that satisfies A1–A3. The rest of the proof consists of identifying a superset of to which the conclusion in Lemma 1(b) extends. Lemma 2 shows that this conclusion holds on the set of that can be written

(5) |

where is eventually periodic and the series is Cesàro-summable (the limit exists and is finite) and has bounded partial sums. Let denote the set of streams that can be written in this way.

###### Lemma 2.

A preference relation on that satisfies A1–A3 is complete on and coincides with on this domain.

That is complete on means that for all , if does not hold, then .

###### Proof.

Let be a preference relation that satisfies A1–A3, and let . Then . We show that if and only if . Take and such that , where for all and where is Cesàro-summable with bounded partial sums. Without loss of generality, we may assume that .

Case 1: . Then is Cesàro-summable and has bounded partial sums. This means that is Cesàro-summable and has bounded partial sums. By Lemma 1, .

Case 2: . (A similar argument applies when .) Then as . Since has bounded partial sums, . We show that . Choose and with the following properties.

(i) is eventually periodic with period .

(ii) for all and for all .

(iii) for all .

(iv) for all .

Since , (iv) follows from (i)–(ii) by taking sufficiently large. Let . By (iii), for all . This means that , is eventually periodic. Thus and hence is Cesàro-summable with bounded partial sums. Since , the Cesàro sum of is nonnegative by (iv). This means that , so by Lemma 1. Here , so by A1 and transitivity. By A2, . Since also satisfies A1–A3, the same argument shows that . ∎

It remains to verify that contains . For this it is sufficient to show that every can be written

(6) |

where is eventually periodic and goes to zero at exponential rate as . We say that is asymptotically periodic if can be written in this way.

###### Lemma 3.

If is generated by a stationary policy, then is asymptotically periodic.

###### Proof.

Let be generated by applying given an initial state , so that is the :th component of (here is the identity matrix)

(7) |

We need to show that there exist and with

(8) |

where is eventually periodic and . A well known corollary of the Perron-Frobenius theorem for nonnegative matrices says that for any stochastic matrix and , the sequence converges exponentially to a periodic orbit (see, e.g., [1].) That is, there exist , and such that for all and where

for every . Thus we can take , and such that

(9) |

for every , where for all and where each component of goes to zero faster than . If we now set , then , where is eventually periodic and . ∎

## References

- [1] Mustafa A. Akcoglu and Ulrich Krengel. Nonlinear models of diffusion on a finite space. Probability Theory and Related Fields, 76(4):441–420, 1987.
- [2] Geir B. Asheim, Claude d’Aspremont, and Kuntal Banerjee. Generalized time-invariant overtaking. Journal of Mathematical Economics, 46(4):519–533, 2010.
- [3] Kaushik Basu and Tapan Mitra. Utilitarianism for infinite utility streams: A new welfare criterion and its axiomatic characterization. Journal of Economic Theory, 133(1):350–373, 2007.
- [4] David Blackwell. Discrete dynamic programming. Annals of Mathematical Statistics, 33(2):719–726, 1962.
- [5] Barry W. Brown. On the iterative method of dynamic programming on a finite space discrete time Markov process. Ann. Math. Statist., 36(4):1279–1285, 1965.
- [6] Arie Hordijk and Alexander A. Yushkevich. Blackwell optimality. In E.A. Feinberg and A Shwartz, editors, Handbook of Markov Decision Processes, Imperial College Press Optimization Series. Springer, Boston, MA, 2002.
- [7] Hèctor Jasso-Fuentes and Onèsimo Hernàndez-Lerma. Blackwell optimality for controlled diffusion processes. Journal of Applied Probability, 46(2):372–391, 2009.
- [8] Adam Jonsson and Mark Voorneveld. The limit of discounted utilitarianism. Theoretical Economics, 2017. To appear, available at: https://econtheory.org.
- [9] J.B Lasserre. Conditions for existence of average and blackwell optimal stationary policies in denumerable markov decision processes. Journal of Mathematical Analysis and Applications, 136(2):479–489, 1988.
- [10] Steven A. Lippman. Letter to the Editor — Criterion equivalence in discrete dynamic programming. Operations Research, 17(5):920–923, 1969.
- [11] Andrzej S. Nowak and Oscar Vega-Amaya. A counterexample on overtaking optimality. Math. Methods Oper. Res., 49(3):435–439, 1999.
- [12] Alexey B. Piunovskiy. Examples in Markov decision processes, volume 2 of Imperial College Press Optimization Series. Imperial College Press, London, 2013.
- [13] Martin L. Puterman. Markov decision processes: discrete stochastic dynamic programming. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, Inc., New York, 1994.
- [14] Dinah Rosenberg, Eilon Solan, and Nicolas Vieille. Blackwell optimality in Markov decision processes with partial observation. The Annals of Statistics, 30(4):1178–1193, 2002.
- [15] Arthur. F Veinott. On finding optimal policies in discrete dynamic programming with no discounting. Annals of Mathematical Statistics, 37(5):1284–1294, 1966.