Servant of Many Masters:
Shifting priorities in Paretooptimal sequential decisionmaking
Abstract
It is often argued that an agent making decisions on behalf of two or more principals who have different utility functions should adopt a Paretooptimal policy, i.e., a policy that cannot be improved upon for one agent without making sacrifices for another. A famous theorem of Harsanyi shows that, when the principals have a common prior on the outcome distributions of all policies, a Paretooptimal policy for the agent is one that maximizes a fixed, weighted linear combination of the principals’ utilities.
In this paper, we show that Harsanyi’s theorem does not hold for principals with different priors, and derive a more precise generalization which does hold, which constitutes our main result. In this more general case, the relative weight given to each principal’s utility should evolve over time according to how well the agent’s observations conform with that principal’s prior. The result has implications for the design of contracts, treaties, joint ventures, and robots.
1 Introduction
As AI systems take on an increasingly pivotal decisionmaking role in human society, an important question arises: Whose values should a powerful decisionmaking machine be built to serve? (Bostrom, 2014)
Consider, informally, a scenario wherein two or more principals—perhaps individuals, companies, or states—are considering cooperating to build or otherwise obtain an “agent” that will then interact with an environment on their behalf. The “agent” here could be anything that follows a policy, such as a robot, a corporation, or a webbased AI system. In such a scenario, the principals will be concerned with the question of “how much” the agent will prioritize each principal’s interests, a question which this paper addresses quantitatively.
One might be tempted to model the agent as maximizing the expected value, given its observations, of some utility function of the environment that equals a weighted sum
(1) 
of the principals’ individual utility functions and , as Harsanyi’s social aggregation theorem (Harsanyi, 1980) recommends. Then the question of prioritization could be reduced to that of choosing values for the weights .
However, this turns out to be a suboptimal approach, from the perspective of the principals. As we shall see in Proposition 3, this solution form is not generally compatible with Paretooptimality when agents have different beliefs. Harsanyi’s setting does not account for agents having different priors, nor for decisions being made sequentially, after future observations.
In such a setting, we need a new form of solution, exhibited in this paper. The solution is presented along with a recursion (Theorem 3) that characterizes solutions by a process algebraically similar to, but meaningfully different from, Bayesian updating. The updating process resembles a kind of betsettling between the principals, which allows them each to expect to benefit from the veracity of their own beliefs.
Qualitatively, this phenomenon can be seen in isolation whenever two people make a bet on a piece of decisionirrelevant trivia. If neither Alice nor Bob would base any important decision on whether Michael Jackson was born in 1958 or 1959, they might still make a bet for $100 on the answer. For a person chosen to arbitrate the bet (their “agent”), Michael Jackson’s birth year now becomes a decisionrelevant observation: it determines which of Alice and Bob gets the money!
Even in scenarios where differences in belief are not decisionirrelevant, once might expect some “degree” of betsettling to arise from the disagreement. The main result of this paper (Theorem 3) is a precise formulation of exactly how and how much a Paretooptimal agent will tend to prioritize each of its principals over time, as a result of differences in their implicit predictions about the agent’s observations.
Related work
This paper may be viewed as extending or complimenting results in several areas:
Value alignment theory.
The “single principal” value alignment problem—that of aligning the value function of an agent with the values of single human, or a team of humans in close agreement with one another—is already a very difficult one and should not be swept under the rug; approaches like inverse reinforcement learning (IRL) (Russell, 1998) (Ng et al., 2000) (Abbeel and Ng, 2004) and cooperative inverse reinforcement learning (CIRL) (HadfieldMenell et al., 2016) have only begun to address it.
Social choice theory.
The whole of social choice theory and voting theory may be viewed as an attempt to specify an agreeable formal policy to enact on behalf of a group. Harsanyi’s utility aggregation theorem (Harsanyi, 1980) suggests one form of solution: maximizing a linear combination of group members’ utility functions. The present work shows that this solution is inappropriate when principals have different beliefs, and Theorem 3 may be viewed as an extension of Harsanyi’s form that accounts simultaneously for differing priors and the prospect of future observations. Indeed, Harsanyi’s form follows as a direct corollary of Theorem 3 when principals do share the same beliefs (Corollary 3).
Bargaining theory.
The formal theory of bargaining, as pioneered by (Nash, 1950) and carried on by (Myerson, 1979), (Myerson, 2013), and (Myerson and Satterthwaite, 1983), is also topical. Future investigation in this area might be aimed at generalizing their work to sequential decisionmaking settings, and this author recommends a focus on research specifically targeted at resolving conflicts.
Multiagent systems.
There is ample literature examining multiagent systems using sequential decisionmaking models. Shoham and LeytonBrown (2008) survey various models of multiplayer games using an MDP to model each agent’s objectives. Chapter 9 of the same text surveys social choice theory, but does not account for sequential decisionmaking.
Zhang and Shah (2014) may be considered a sequential decisionmaking approach to social choice: they use MDPs to represent the decisions of players in a competitive game, and exhibit an algorithm for the players that, if followed, arrives at a Paretooptimal Nash equilibrium satisfying a certain fairness criterion. Among the literature surveyed here, that paper is the closest to the present work in terms of its intended application: roughly speaking, achieving mutually desirable outcomes via sequential decisionmaking. However, that work is concerned with an ongoing interaction between the players, rather than selecting a policy for a single agent to follow as in this paper.
Multiobjective sequential decisionmaking.
There is also a good deal of work on MultiObjective Optimization (MOO) (Tzeng and Huang, 2011), including for sequential decisionmaking, where solution methods have been called MultiObjective Reinforcement Learning (MORL). For instance, Gábor et al. (1998) introduce a MORL method called Pareto Qlearning for learning a set of a Paretooptimal polices for a MultiObjective MDP (MOMDP). Soh and Demiris (2011) define MultiReward Partially Observable Markov Decision Processes (MRPOMDPs), and use use genetic algorithms to produce nondominated sets of policies for them. Roijers et al. (2015) refer to the same problems as Multiobjective POMDPS (MOPOMDPs), and provide a bounded approximation method for the optimal solution set for all possible weightings of the objectives. Wang (2014) surveys MORL methods, and contributes MultiObjective MonteCarlo Tree Search (MOMCTS) for discovering multiple Paretooptimal solutions to a multiobjective optimization problem. Wray and Zilberstein (2015) introduce Lexicographic Partially Observable Markov Decision Process (LPOMDPs), along with two accompanying solution methods.
However, none of these or related works addresses scenarios where the objectives are derived from principals with differing beliefs, from which the priorityshifting phenomenon of Theorem 3 arises. Differing beliefs are likely to play a key role in negotiations, so for that purpose, the formulation of multiobjective decisionmaking adopted here is preferable.
2 Notation
Random variables are denoted by uppercase letters, e.g., , and lowercase letters, e.g., , are used as indices ranging over the values of a variable, as in the equation
Given a set , the set of probability distributions on is denoted .
Sequences are denoted by overbars, e.g., given a sequence , stands for the whole sequence. Subsequences are denoted by subscripted inequalities, so e.g., stands for , and stands for .
3 Formalism
N.B.: All results in this paper generalize directly from agents with two principals to agents with several, but for clarity of exposition, the case of two principals will be prioritized.
Consider a scenario wherein Alice and Bob will share some cake, and have different predictions of the cake’s color. Even if the color would be decisionirrelevant for either Alice or Bob on their own (they don’t care what color the cake is), we will show that the difference between their predictions will tend to make the cake color a decisionrelevant observation for a Paretooptimal cakesplitting policy that is adopted before they see the cake. Specifically, we will show that Paretooptimal policies tend to incorporate some degree of betsettling between Alice and Bob, where the person who was more right about the color of the cake will end up getting more of it.
Serving multiple principals as a single POMDP
To formalize such scenarios, where a single agent acts on behalf of multiple principals, we need some definitions.
We encode each principal ’s view of the agent’s decision problem as a finite horizon POMDP, , which simultaneously represents that principal’s beliefs about the environment, and the principal’s utility function (see Russell et al. (2003) for an introduction to POMDPs). These symbols take on their usual meaning:

represents a set of possible states of the environment,

represents the set of possible actions available to the agent,

represents the conditional probabilities principal believes will govern the environment state transitions, i.e., ,

represents principal ’s utility function from sequences of environmental states to ; for the sake of generality, is not assumed to be additive over time, as reward functions often are,

represents the set of possible observations of the agent,

represents the conditional probabilities principal believes will govern the agent’s observations, i.e., , and

is the horizon (number of time steps)
This POMDP structure is depicted by the Bayesian network in Figure 1. (See Darwiche (2009) for an intro to Bayesian networks.) At each point in time , the agent has a timespecific policy , which receives the agent’s history,
and returns a distribution on actions , which will then be used to generate an action with probability . Thus, principal ’s subjective probability of an outcome is given by a probability distribution that takes as a parameter:
(2) 
Fullmemory assumption. Every policy in this paper will be assumed to employ a “full memory”, so it decomposes into a sequence of policies for each time step. In Figure 1, the part of the Bayes net governed by the fullmemory policy is highlighted in green.
Common knowledge assumptions.
It is assumed that the principals will have common knowledge of the (fullmemory) policy they select for the agent to implement, but that the principals may have different beliefs about how the environment works, and of course different utility functions. It is also assumed that the principals have common knowledge of one another’s current beliefs at the time of the agent’s creation, which we refer to as their their priors.
This last assumption is critical. During the agent’s creation, one should expect each principal’s beliefs to have updated somewhat in response to disagreements from the other. Assuming common knowledge of their priors means assuming the principals to have reached an equilibrium where, each knowing what the other believes, they do not wish to further update their own beliefs.^{1}^{1}1It is enough to assume the principals have reached a “persistent disagreement” that cannot be mediated by the agent in some way. Future work should design solutions for facilitating the process of attaining common knowledge, or to obviate the need to assume it.
Paretooptimal policies
A policy will be considered Paretooptimal relative to a set of POMDPs it could be deployed to solve.
[Compatible POMDPs] We say that two POMDPs, and , are compatible if any policy for one may be viewed as a policy for the other, i.e., they have the same set of actions and observations , and the same number of time steps .
In this context, where a single policy may be evaluated relative to more than one POMDP, we use superscripts to represent which POMDP is governing the probabilities and expectations, e.g.,
represents the expectation in of the utility function , assuming policy is followed. {definition}[Paretooptimal policies] A policy is Paretooptimal for a set of compatible POMDPs if for any other policy and any
It is assumed that, before the agent’s creation, the principals will be seeking a Paretooptimal (fullmemory) policy for the agent to follow, relative to the POMDPs describing each principal’s view of the agent’s task.
Example: cake betting
A quantitative model of a cake betting scenario is laid out in Table 1, and described as follows.
Alice (Principal 1) and Bob (Principal 2) are about to be presented with a cake which they can choose to split in half to share, or give entirely to one of them. They have (built or purchased) a robot that will make the cakesplitting decision on their behalf. Alice’s utility function returns if she gets no cake, if she gets half a cake, or if she gets a whole cake. Bob’s utility function values Bob getting cake in the same way.
red cake  (all, none)  30  0  
(half, half)  20  20  
(none, all)  0  30  
green cake  (all, none)  30  0  
(half, half)  20  20  
(none, all)  0  30 
However, Alice and Bob have different beliefs about the color of the cake. Alice is sure that the cake is red (), versus sure it will be green (), whereas Bob’s probabilities are reversed.
Upon seeing the cake, the robot must decide to either give Alice the entire cake (), split the cake halfandhalf (), or give Bob the entire cake (). Moreover, Alice and Bob have common knowledge of all these facts.
Now, consider the following Paretooptimal fullmemory policy that favors Alice (Principal 1) when is red, and Bob (Principal 2) when is green:
This policy can be viewed intuitively as a bet between Alice and Bob about the value of , and is highly appealing to both principals:
In particular, is more appealing to both Alice and Bob than an agreement to deterministically split the cake (half, half), which would yield them each an expected utility of . However,
Proposition \thetheorem.
The Paretooptimal strategy above cannot be implemented by any agent that naïvely maximizes a fixedovertime linear combination of the conditionally expected utilities of the two principals. That is, it cannot be implemented by any policy satisfying
(3) 
for some fixed . Moreover, every such policy is strictly worse than in expectation to one of the principals.
Proof.
See appendix. ∎
This proposition is relatively unsurprising when one considers the fullmemory policy intuitively as a betsettling mechanism, because the nature of betting is to favor different preferences based on future observations. However, to be sure of this impossibility claim, one must rule out the possibility that the could be implemented by having the agent choose which element of the in Equation 3 to use based on whether the cake appears red or green. (See appendix.)
Characterizing Paretooptimality geometrically
With the definitions above, we can characterize a Paretooptimality as a geometric condition.
Policy mixing assumption.
Given policies and a distribution , we assume that the agent may construct a new policy by choosing at time 0 between the with probability , and then executing the chosen policy for the rest of time. We write this policy as whence we derive:
(4) 
[Polytope Lemma] A fullmemory policy is Paretooptimal to principals and if and only if there exist weights with such that
(5) 
Proof.
The mixing assumption gives the set of policies the structure of a convex space that the maps respect by Equation 4. This ensures that the image of the map given by
is a closed, convex polytope. As such, a point lies on the Pareto boundary of if and only if there exist nonnegative weights , not both zero, such that
After normalizing to equal , this implies the result. ∎
Characterizing Paretooptimality probabilistically
To help us apply the Polytope Lemma, we will adopt an interpretation wherein the weights are subjective probabilities for the agent, as follows.
For any , we define a new POMDP, , that works by flipping a weighted coin, and then running or thereafter, according to the coin flip. We denote this by
and call a POMDP mixture. A formal definition of is given in the appendix. It can be depicted by a Bayes net by adding an additional environmental node for in the diagram of and (see Figure 2).
Given any fullmemory policy , the expected payoff of in is exactly
Therefore, using the above definitions, Lemma 3 may be restated in the following equivalent form:
[Mixture Lemma] Given a pair of compatible POMDPs, a fullmemory policy is Paretooptimal for that pair if and only if there exists such that is an optimal fullmemory policy for the single POMDP given by .
Expressed in the form of Equation 5, it might not be clear how a Paretooptimal fullmemory policy makes use of its observations over time, aside from storing them in memory. For example, is there any sense in which the agent carries “beliefs” about the environment that it “updates” at each time step? Lemma 3 allows us to reduce some such questions about Paretooptimal policies to questions about single POMDPs.
If is an optimal fullmemory policy for a single POMDP, the optimality of each action distribution can be characterized without reference to the previous policy components , nor to for any alternate history . This can be expressed using Pearl’s “” notation (Pearl, 2009): {definition}[“do” notation] The probability of causally conditioned on is defined as
[Expected utility abbreviation] For brevity, given any POMDP and policy , we write
i.e., the total expected utility in that would result from replacing by . This quantity does not depend on .
Proposition \thetheorem (Classical separability).
If is a POMDP described by conditional probabilities and utility function (as in Equation 2), then a fullmemory policy is optimal for if and only if for each time step and each observation/action history , the action distribution satisfies the following backward recursion:
This characterization of does not refer to , nor to for any alternate history .
Proof.
This is just Bellman’s Principle of Optimality. See (Bellman, 1957), Chap. III. 3. ∎
N.B.: Unlike Bellman’s “backup” equation, the above proposition requires no assumption whatsoever on the form of the utility function. Note also that when the probability term is nonzero, it may be removed from the without changing the theorem statement. But when the term is zero, its presence is essential, and implies that can be anything.
It turns out that Paretooptimality can be characterized in a similar way by backward recursion from the final time step. The resulting recursion reveals a pattern in how the weights on the principals’ conditionally expected utilities must change over time, which is the main result of this paper:
[Paretooptimal control theorem] Given a pair of compatible POMDPs with horizon , a fullmemory policy is Paretooptimal if and only if its components for satisfy the following backward recursion for some weights :
In words, to achieve Paretooptimality, the agent must

use each principal’s own worldmodel when estimating the degree to which a decision favors that principal’s utility function, and

shift the relative priority of each principal’s expected utility in the agent’s maximizationtarget over time, by a factor proportional to how well that principal’s prior predicts the agent’s observations, .
N.B.: The analogous result for more than two POMDPs holds as well, with essentially the same proof.
Proof of Theorem 3.
By Lemma 3, the Paretooptimality of for is equivalent to its classical optimality for for some . Writing for probabilities in D, Proposition 3 says this is equivalent to maximizing the following expression for each :
(6) 
The expectation factor on the right equals
Multiplying by
and applying Bayes’ rule yields that
hence the result. ∎
To see the necessity of the terms that shift the expectation weights in Theorem 3 over time, recall from Proposition 3 that, without these, some Paretooptimal policies cannot be implemented. These terms are responsible for the “betsettling” phenomena discussed in the introduction.
However, when the principals have the same beliefs, they aways assign the same probability to the agent’s observations, so the weights on their respective valuations do not change over time. Hence, as a special instance, we derive:
Corollary \thetheorem (Harsanyi’s utility aggregation formula).
Suppose that principals 1 and 2 share the same beliefs about the environment, i.e., the pair of compatible POMDPs agree on all parameters except the principals’ utility functions . Then a fullmemory policy is Paretooptimal if and only if there exists such that for , satisfies
where denotes the shared expectations of both principals.
Proof.
Setting in Theorem 3, factoring out the common coefficient , and applying linearity of expectation yields the result. ∎
4 Conclusion
Theorem 3 exhibits a novel form for the objective of a sequential decisionmaking policy that is Paretooptimal according to principals with differing beliefs.
This form represents two departures from naïve utility aggregation: to achieve Paretooptimality for principals with differing beliefs, an agent must (1) use each principal’s own beliefs (updated on the agent’s observations) when evaluating how well an action will serve that principal’s utility function, and (2) shift the relative priority it assigns to each principal’s expected utilities over time, by a factor proportional to how well that principal’s prior predicts the agent’s observations.
Implications for contract design
Theorem 3 has implications for modeling and structuring the process of contract design. If a contract is being created between principals with different beliefs, then to the extent that the principals will target Paretooptimality among them as an objective, there will be a tendency for the contract to end up implicitly settling bets between the principals. Perhaps making the betsettling nature of Paretooptimal contract design more explicit might help to design contracts that are more attractive to both principals, along the lines illustrated by Proposition 3. This could potentially lead to more successful negotiations, provided the principals remained willing to uphold the contract after its implicit bets have been settled.
Implications for shareable AI systems
Proposition 3 shows how the Paretooptimal form of Theorem 3 is more attractive—from the perspective of the principals—than policies that do not account for differences in their beliefs. The relative attractiveness of shared ownership versus individual ownership of AI systems may be essential to the technological adoption of shared systems. Consider the following product substitutions that might be enabled by the development of shareable machine learning systems:

Office assistant software jointly controlled by a team, as an improvement over personal assistant software for each member of the team.

A team of domestic robots controlled by a family, as an improvement over individual robots each controlled by a separate family member.

A webbased security system shared by several interested companies or nations, as an improvement over individual security systems deployed by each group.
It may represent a significant technical challenge for any of these substitutions to become viable. However, machine learning systems that are able to approximate Paretooptimality as an objective are more likely to be sufficiently appealing to motivate the switch from individual control to sharing.
Implications for bargaining versus racing
Consider two nations—allies or adversaries—who must decide whether to cooperate in the deployment of a very powerful and autonomous AI system.
If the nations cannot reach agreement as to what policy a jointly owned AI system should follow, joint ownership may be less attractive than building separate AI systems, one for each party. This could lead to an arms race between nations competing under time pressure to develop ever more powerful militarized AI systems. Under such race conditions, everyone loses, as each nation is afforded less time to ensure the safety and value alignment of its own system.
The first author’s primary motivation for this paper is to initiate a research program with the mission of averting such scenarios. Beginning work today on AI architectures that are more amenable to joint ownership could help lead to futures wherein powerful entities are more likely to share and less likely to compete for the ownership of such systems.
Future work
Insofar as Theorem 3 is not particularly mathematically sophisticated—it employs only basic facts about convexity and linear algebra—this suggests there may be more lowhanging fruit to be found in the domain of “machine implementable social choice theory”. Future work should address methods for helping the principals to share information—perhaps in exchange for adjustments to the weights in Theorem 3—to reach either a state of agreement or a persistent disagreement that allows the theorem to be applied. More ambitiously, bargaining models that account for a degree of transparency between the principals should be employed, as individual humans and institutions have some capacity for detecting one another’s intentions.
As well, scenarios where the principals continue to exhibit some active control over the system after its creation should be modeled in detail. In real life, principals usually continue to exist in their agents’ environments, and accounting for this will be a separate technical challenge.
As a final motivating remark, consider that social choice theory and bargaining theory were both pioneered during the Cold War, when it was particularly compelling to understand the potential for cooperation between human institutions that might behave competitively. In the coming decades, machine intelligence will likely bring many new challenges for cooperation, as well as new means to cooperate, and new reasons to do so. As such, new technical aspects of social choice and bargaining will likely continue to emerge.
5 Appendix
Here we make available the technical details for defining POMDP mixtures, and proving that certain Paretooptimal expectations cannot be obtained without priorityshifting.
[POMDP mixtures] Suppose that and are compatible POMDPs, with parameters . Define a new POMDP compatible with both, denoted , with parameters , as follows:

,

Environmental transition probabilities given by
for any initial state , and thereafter, Hence, the value of will be constant over time, so a full history for the environment may be represented by a pair
Let denote the boolean random variable that equals whichever constant value of obtains, so then

The utility function is given by

The observation probabilities are given by
In particular, the agent does not observe directly whether or .
Proof of Proposition 3.
Suppose is any policy satisfying Equation 3 for some fixed , and consider the following cases for :

If , then must satisfy
Here, , so is strictly worse than in expectation to Alice.

If , then must satisfy
for some depending on . Here, (with equality when ), so is strictly worse than in expectation to Alice.

If , then must satisfy
Here, , so is strictly worse than in expectation to both Alice and Bob.
The remaining cases, and , are symmetric to the first two, with Bob in place of Alice and (none, all) in place of (all, none).
Hence, no fixed linear combination of the principals’ utility functions can be maximized to simultaneously achieve an expected utility of 27 for both players. ∎
References
 Abbeel and Ng [2004] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, page 1. ACM, 2004.
 Bellman [1957] Richard Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ., 1957.
 Bostrom [2014] Nick Bostrom. Superintelligence: Paths, dangers, strategies. OUP Oxford, 2014.
 Darwiche [2009] Adnan Darwiche. Modeling and reasoning with Bayesian networks (Chapter 4). Cambridge University Press, 2009.
 Gábor et al. [1998] Zoltán Gábor, Zsolt Kalmár, and Csaba Szepesvári. Multicriteria reinforcement learning. In ICML, volume 98, pages 197–205, 1998.
 HadfieldMenell et al. [2016] Dylan HadfieldMenell, Anca Dragan, Pieter Abbeel, and Stuart Russell. Cooperative inverse reinforcement learning, 2016.
 Harsanyi [1980] John C Harsanyi. Cardinal welfare, individualistic ethics, and interpersonal comparisons of utility. In Essays on Ethics, Social Behavior, and Scientific Explanation, pages 6–23. Springer, 1980.
 Myerson and Satterthwaite [1983] Roger B Myerson and Mark A Satterthwaite. Efficient mechanisms for bilateral trading. Journal of economic theory, 29(2):265–281, 1983.
 Myerson [1979] Roger B Myerson. Incentive compatibility and the bargaining problem. Econometrica: journal of the Econometric Society, pages 61–73, 1979.
 Myerson [2013] Roger B Myerson. Game theory. Harvard university press, 2013.
 Nash [1950] John F Nash. The bargaining problem. Econometrica: Journal of the Econometric Society, pages 155–162, 1950.
 Ng et al. [2000] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, pages 663–670, 2000.
 Pearl [2009] Judea Pearl. Causality. Cambridge university press, 2009.
 Roijers et al. [2015] Diederik M Roijers, Shimon Whiteson, and Frans A Oliehoek. Pointbased planning for multiobjective pomdps. In IJCAI 2015: Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence, pages 1666–1672, 2015.
 Russell et al. [2003] Stuart Russell, Peter Norvig, John F Canny, Jitendra M Malik, and Douglas D Edwards. Artificial intelligence: a modern approach (Chapter 17.1), volume 2. Prentice hall Upper Saddle River, 2003.
 Russell [1998] Stuart Russell. Learning agents for uncertain environments. In Proceedings of the eleventh annual conference on Computational learning theory, pages 101–103. ACM, 1998.
 Shoham and LeytonBrown [2008] Yoav Shoham and Kevin LeytonBrown. Multiagent systems: Algorithmic, gametheoretic, and logical foundations. Cambridge University Press, 2008.
 Soh and Demiris [2011] Harold Soh and Yiannis Demiris. Evolving policies for multireward partially observable markov decision processes (mrpomdps). In Proceedings of the 13th annual conference on Genetic and evolutionary computation, pages 713–720. ACM, 2011.
 Tzeng and Huang [2011] GwoHshiung Tzeng and JihJeng Huang. Multiple attribute decision making: methods and applications. CRC press, 2011.
 Wang [2014] Weijia Wang. Multiobjective sequential decision making. PhD thesis, Université Paris SudParis XI, 2014.
 Wray and Zilberstein [2015] Kyle Hollins Wray and Shlomo Zilberstein. Multiobjective pomdps with lexicographic reward preferences. In Proceedings of the 24th International Joint Conference of Artificial Intelligence (IJCAI), pages 1719–1725, 2015.
 Zhang and Shah [2014] Chongjie Zhang and Julie A Shah. Fairness in multiagent sequential decisionmaking. In Advances in Neural Information Processing Systems, pages 2636–2644, 2014.