An axiomatic basis for Blackwell optimality
In the theory of Markov decision processes (MDPs), a Blackwell optimal policy is a policy that is optimal for every discount factor sufficiently close to one. This paper provides an axiomatic basis for Blackwell optimality in discrete-time MDPs with finitely many states and finitely many actions.
Key words and phrases:Markov decision processes; Blackwell optimality
2010 Mathematics Subject Classification:Primary 90C40, Secondary 91B06
In his foundational paper, Blackwell  showed that for any discrete-time Markov decision process (MDP) with finitely many states and finitely many actions, there exists a stationary policy that is optimal for every discount factor sufficiently close to one. Following Veinott , policies that possess this property are now referred to as Blackwell optimal. Blackwell optimality and the related concept of 1-optimality (also known as near optimality, 0-discount optimality, and bias optimality) have come to provide two of the most well studied optimality criteria for undiscounted MPDs (see, e.g., [10, 9, 13, 12, 14, 6, 7]). However, the question of which assumptions on a decision maker’s preferences lead to these criteria has not been answered in the literature.
To address this question, we consider a decision maker with preferences over . The preference relation is postulated to be reflexive and transitive, where means that is at least as good as , means that is better than ( but not ), and means that and are equally good ( and ). A policy generates a stream of expected rewards (see Eq. (3) below), where is the expected reward at time . Let denote the set of streams generated by stationary policies, that is, policies for which the action chosen at time depends only on the state at time . The principal result of this paper (Theorem 1) provides conditions on that ensure that and coincide on , where
is the preference relation induced by the 1-optimality criterion. To state this result, we use the following notation: For and , we let denote . If for all and for some , we write . The long-run average
of is denoted by if the limit (2) exists.
Let be a preference relation on with the following three properties.
A1. For all , if , then .
A2. For all , if , then .
A3. For all , if is well defined, then .
Then and coincide on .
This result is proved in  on a different domain (the set of streams that are either summable or eventually periodic). To prove Theorem 1, we extend the result from  to a larger domain (Lemma 2) and show that this domain contains (Lemma 3).
The first two assumptions in Theorem 1, A1 and A2, are standard (cf. [2, 3]). To interpret A3, which is the Compensation Principle from , imagine that the decision maker is faced with two different scenarios: In the first scenario, a stream of rewards is received. In the second scenario, there is a one-period postponement of , for which a compensation of is received in the first period. According to A3, the decision maker is indifferent between and if . For an axiomatic defence of this assertion, see [8, Prop. 1].
Theorem 1 tells us that if a decision maker restricts attention to stationary policies and respects A1, A2, and A3, then any stationary 1-optimal policy is (weakly) best possible with respect to his or her preferences. (The same conclusion hold for Blackwell optimal policies since such policies are 1-optimal by definition.) While restricting attention to stationary policies is often natural, it is well known that not all optimality criteria admit stationary optimal policies [5, 13, 11]. The arguments used in the proof of Theorem 1 apply to sequences that are asymptotically periodic (see Eq. (8) below). We mention without proof that as a consequence, the conclusion in Theorem 1 holds also on the set of streams generated by eventually periodic policies.
We use Blackwell’s  formulation of a discrete-time MDP, with a finite set of states, a finite set of actions, and the set of all functions . Thus at each time , a system is observed to be in one of states, an action is chosen from , and a reward is received. The reward is assumed to be a function from to . The transition probability matrix and reward (column) vector that correspond to are denoted by and , respectively. So, if the system is observed to be in state and action is chosen, then a reward of is received and the system moves to with probability .
A policy is a sequence , each . The set of all policies is denoted by . A policy is stationary if for all , and eventually periodic if there exist such that for all .
3. Proof of Theorem 1
The proof of Theorem 1 will be completed through three lemmas. The first lemma shows that if satisfies A1–A3, then and coincide on the set of pairs for which the series is Cesàro-summable and has bounded partial sums, where
is the preference relation induced by Veinott’s  average overtaking criterion. All results presented in this paper hold with in the role of .
(a) The preference relation satisfies A1–A3.
(b) Let be a preference relation that satisfies A1–A3. For all , if the series is Cesàro-summable and has bounded partial sums, then if and only if .
That the conclusion in Lemma 1(b) holds with in the role of follows from that satisfies A1–A3. The rest of the proof consists of identifying a superset of to which the conclusion in Lemma 1(b) extends. Lemma 2 shows that this conclusion holds on the set of that can be written
where is eventually periodic and the series is Cesàro-summable (the limit exists and is finite) and has bounded partial sums. Let denote the set of streams that can be written in this way.
A preference relation on that satisfies A1–A3 is complete on and coincides with on this domain.
That is complete on means that for all , if does not hold, then .
Let be a preference relation that satisfies A1–A3, and let . Then . We show that if and only if . Take and such that , where for all and where is Cesàro-summable with bounded partial sums. Without loss of generality, we may assume that .
Case 1: . Then is Cesàro-summable and has bounded partial sums. This means that is Cesàro-summable and has bounded partial sums. By Lemma 1, .
Case 2: . (A similar argument applies when .) Then as . Since has bounded partial sums, . We show that . Choose and with the following properties.
(i) is eventually periodic with period .
(ii) for all and for all .
(iii) for all .
(iv) for all .
Since , (iv) follows from (i)–(ii) by taking sufficiently large. Let . By (iii), for all . This means that , is eventually periodic. Thus and hence is Cesàro-summable with bounded partial sums. Since , the Cesàro sum of is nonnegative by (iv). This means that , so by Lemma 1. Here , so by A1 and transitivity. By A2, . Since also satisfies A1–A3, the same argument shows that . ∎
It remains to verify that contains . For this it is sufficient to show that every can be written
where is eventually periodic and goes to zero at exponential rate as . We say that is asymptotically periodic if can be written in this way.
If is generated by a stationary policy, then is asymptotically periodic.
Let be generated by applying given an initial state , so that is the :th component of (here is the identity matrix)
We need to show that there exist and with
where is eventually periodic and . A well known corollary of the Perron-Frobenius theorem for nonnegative matrices says that for any stochastic matrix and , the sequence converges exponentially to a periodic orbit (see, e.g., .) That is, there exist , and such that for all and where
for every . Thus we can take , and such that
for every , where for all and where each component of goes to zero faster than . If we now set , then , where is eventually periodic and . ∎
-  Mustafa A. Akcoglu and Ulrich Krengel. Nonlinear models of diffusion on a finite space. Probability Theory and Related Fields, 76(4):441–420, 1987.
-  Geir B. Asheim, Claude d’Aspremont, and Kuntal Banerjee. Generalized time-invariant overtaking. Journal of Mathematical Economics, 46(4):519–533, 2010.
-  Kaushik Basu and Tapan Mitra. Utilitarianism for infinite utility streams: A new welfare criterion and its axiomatic characterization. Journal of Economic Theory, 133(1):350–373, 2007.
-  David Blackwell. Discrete dynamic programming. Annals of Mathematical Statistics, 33(2):719–726, 1962.
-  Barry W. Brown. On the iterative method of dynamic programming on a finite space discrete time Markov process. Ann. Math. Statist., 36(4):1279–1285, 1965.
-  Arie Hordijk and Alexander A. Yushkevich. Blackwell optimality. In E.A. Feinberg and A Shwartz, editors, Handbook of Markov Decision Processes, Imperial College Press Optimization Series. Springer, Boston, MA, 2002.
-  Hèctor Jasso-Fuentes and Onèsimo Hernàndez-Lerma. Blackwell optimality for controlled diffusion processes. Journal of Applied Probability, 46(2):372–391, 2009.
-  Adam Jonsson and Mark Voorneveld. The limit of discounted utilitarianism. Theoretical Economics, 2017. To appear, available at: https://econtheory.org.
-  J.B Lasserre. Conditions for existence of average and blackwell optimal stationary policies in denumerable markov decision processes. Journal of Mathematical Analysis and Applications, 136(2):479–489, 1988.
-  Steven A. Lippman. Letter to the Editor — Criterion equivalence in discrete dynamic programming. Operations Research, 17(5):920–923, 1969.
-  Andrzej S. Nowak and Oscar Vega-Amaya. A counterexample on overtaking optimality. Math. Methods Oper. Res., 49(3):435–439, 1999.
-  Alexey B. Piunovskiy. Examples in Markov decision processes, volume 2 of Imperial College Press Optimization Series. Imperial College Press, London, 2013.
-  Martin L. Puterman. Markov decision processes: discrete stochastic dynamic programming. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, Inc., New York, 1994.
-  Dinah Rosenberg, Eilon Solan, and Nicolas Vieille. Blackwell optimality in Markov decision processes with partial observation. The Annals of Statistics, 30(4):1178–1193, 2002.
-  Arthur. F Veinott. On finding optimal policies in discrete dynamic programming with no discounting. Annals of Mathematical Statistics, 37(5):1284–1294, 1966.