RiskSensitive Cooperative Games for HumanMachine Systems
Abstract
Autonomous systems can substantially enhance a human’s efficiency and effectiveness in complex environments. Machines, however, are often unable to observe the preferences of the humans that they serve. Despite the fact that the human’s and machine’s objectives are aligned, asymmetric information, along with heterogeneous sensitivities to risk by the human and machine, make their joint optimization process a game with strategic interactions. We propose a framework based on risksensitive dynamic games; the human seeks to optimize her risksensitive criterion according to her true preferences, while the machine seeks to adaptively learn the human’s preferences and at the same time provide a good service to the human. We develop a class of performance measures for the proposed framework based on the concept of regret. We then evaluate their dependence on the risksensitivity and the degree of uncertainty. We present applications of our framework to selfdriving taxis, and robofinancial advising.
1 Introduction
Autonomous systems can substantially enhance human effectiveness in complex environments by handling routine or cognitively challenging operations. It is crucial, however, that both the human and the machine can communicate their preferences to execute their objectives accordingly. A machine can only provide a useful or reliable service if its valuation of the costs and risks associated with each action are aligned with the human that it serves. The resulting value alignment problem is critical to the success or failure of an operation.
In this paper, we propose a framework for the analysis of humanmachine interactions. Despite the fact that the objectives of the human and the machine are aligned, there are informational asymmetries. The machine is unable to observe the human’s preferences, and must infer them via a dynamic learning process by observing the effect of joint humanmachine actions on the system’s state. Our model is designed to capture a wide variety of situations in which a human wants to delegate a task to a machine with the objective of enhancing the efficiency and effectiveness of the task’s execution. The machine is designed to serve a broad audience of humans, rather than tailored to a specific category of humans. It is thus important for the machine to personalize itself to the human and selfcalibrate as the human reveals information regarding her risk preferences and objectives.
Our framework is designed to capture decision making processes arising in a large class of autonomous systems, including those for logistic operations, digital assistance, financial advising, defense, robotics, and selfdriving taxi systems. As illustrative examples, consider the following two practical applications of our framework. Assume that a network of selfdriving taxis tracks each time a user hails a ride. As a part of the service, the human is able to select one of the several routes or destinations that match a search query; for example, to local restaurants or retail stores. The network maintaining these cars would then keep track of each selection, marking when the human chose longer or shorter routes to higher or lower rated destinations. Our framework can generate a dynamic assessment of each user’s preferences towards various destinations and on the user’s sensitivity towards the risks involved in travel, such as the uncertainty of the arrival time to each destination. This assessment enables the taxicab to provide better service to the human, by presenting options that are customized for the user on the next ride.
Another example applies to the growing industry of robofinance. Consider an investment firm that develops a portfoliobot to manage a client’s investments autonomously. In each period, the bot can reallocate the clients’ investments into various assets, based on information gathered by the firm regarding the expected return and risk profile of each asset. Any allocation decision can be overridden by the client herself, through manual purchases and sales. In order to provide good service, the bot needs to understand the client’s preferences for risk and return. Through our framework, the bot can estimate the preferences of the client by observing her manual investment decisions. Additionally, the firm faces risk as aggressive allocations by the bot may reflect badly on the firm if the estimates of the client’s preferences are incorrect. The tolerance that the firm has towards the uncertainty over the human’s preferences presents a separate form of risksensitivity that we define explicitly in our framework.
Contextdriven and Humandriven risk
The distinguishing feature of our framework is the simultaneous handling of humandriven and contextdriven risks. The uncertainty over the human’s characteristics, such as her risk preferences, goals, and objectives, presents a humandriven risk to the machine. Depending on the machine’s attitude toward risk, it could, for example, operate to provide a good service to the typical human. Alternatively, it could target humans whose characteristics belong to a specific quartile. We refer to the set of characteristics which uniquely identify the human’s behavior as her type. On the other hand, the unpredictable conditions or hazardous environments in which the task needs to be executed present contextdriven risks to the human. The human executes actions on the basis of her risk preferences, and in doing so it reveals information about her type to the machine. Both human and machine share the cooperative goal of minimizing the human’s costs. However, informational asymmetries and heterogeneous sensitivities to risk lead to strategic behavior of the agents, and make the joint minimization process of human’s costs a strategic game. In the absence of informational asymmetries, the objectives of human and machine are perfectly aligned, so that the game becomes cooperative.
Relation to cooperative inverse reinforcement learning
Our framework recovers the cooperative inverse reinforcement learning (CIRL) setup recently explored by HadfieldMenell et al. (2017), in the special case that there is indifference with respect to both humandriven and contextdriven risk. Under these circumstances, the machine aims at providing a good service to the average human’s type, and the human is only concerned about minimizing her expected costs, being neutral with respect to the risk present from the context in which she operates. The CIRL setup, however, is no longer applicable in the cases where the human is sensitive to contextdriven risk or the machine is sensitive to humandriven risk.
Risksensitive equilibrium and performance measures
For models featuring both humandriven and contextdriven risk, we introduce a new equilibrium concept, risksensitive Bayesian equilibria, a departure from the classical Nash equilibrium concept. Furthermore, we introduce decentralized optimization techniques to reduce the problem to a related singleagent, risksensitive, optimization problem. We remark that risksensitive optimization in the context of Markov decision processes has been subject of considerable investigation recently (see Section 2 for additional details). In addition, our study paves the way for a systematic study of humandriven risk and its implications.
We develop a numerical study to demonstrate the power of our framework. We consider the stochastic shortest path problem with contextdriven risk. This formulation captures a wide range of scenarios in which the objective of the human is to reach a goal state in the least costly way, as measured by a contextrisk criterion, using actions with probabilistic outcomes. We analyze measures of performance, including the regret of our solution concept against a completeinformation benchmark, and trace the impact of varying degrees of uncertainty over the human’s type.
The paper proceeds as follows. Section 2 puts our paper in perspective with existing literature. Section 3 provides a brief review of the theory of risk measures. Section 4 develops the framework and presents solution concepts. Section 5 provides a numerical study for the selfdriving taxicab application using an abstraction based on the stochastic shortest path problem. Section 6 concludes the paper and discusses how the proposed framework opens the door to a new class of multiagent decision making problems.
2 Contributions and Related Work
Our dynamic learning framework draws upon tools from machine learning, economics and finance to model humanmachine interactions. The proposed framework describes the cooperative decision making problem of a human and a robot, who are both sensitive towards risk. In this section, we review related work, and put it in perspective with our contributions to the development of the framework and solution concepts
In a recent work, HadfieldMenell et al. (2017) defines a framework for humanmachine interactions, based on the theory of inverse reinforcement learning (IRL). Both the machine and the human are riskneutral agents and, as such, their framework does not capture humandriven or contextdriven risk. They reduce the two agentmodel to a joint optimization problem, and compare their solution concepts to existing IRL methods. In our study, we introduce a notion of risksensitive equilibrium to deal with the aversion to risk of both agents in the model.
One of the defining features of our framework is that both agents share the common goal of optimizing the objective of the human. Nayyar et al. (2013) introduce a model of decentralized stochastic control, where a team of agents work together to minimize a common objective. They show that this problem can be reduced to a POMDP by constructing a coordinator that determines strategies for the agents, based on the common information available in each period. Similar approaches have been recently employed by Vasal and Anastasopoulos (2016a) Vasal and Anastasopoulos (2016b), and Sinha and Anastasopoulos (2016) to solve incomplete information games between agents with conflicting objectives. The coordinator technique is appealing because it reduces a game of multiple agents to a singleagent optimization problem. In our framework, we show that the solution of this singleagent problem corresponds to an equilibrium between the human and machine.
Our paper is related to existing literature on risksensitive Markov decision processes (MDP). Recent contributions by Bäuerle and Rieder (2014) and Bäuerle and Ott (2011) solve the utility maximization process and the conditional value at risk criterion for a MDP. Haskell and Jain (2015) generalizes these studies to a wider class of risk measures using a convex analytic approach. All these studies deal with the optimization of a single agent. In contrast, our framework features strategic interactions between agents, and employs risksensitive optimization to solve for a new class of equilibria corresponding to the optimal pair of humanmachine actions.
3 Background on Risk Aversion
In the proposed framework, humans and machines are modeled as riskaverse agents. This allows us to simultaneously capture contextdriven and humandriven risk. We next provide a discussion of risk aversion, and its connection to utility functions, risk measures, and uncertainty of the outcome. In particular, utility functions are contained in the class of risk measures, and therefore we will consider risk measures in the following analysis. Informally speaking, a riskaverse agent is an agent who assigns higher weight to bad states of the world and lower weight to good states of the world, relative to a riskneutral agent.
Consider a probability space , and let the space be the space of essentially bounded random variables.^{1}^{1}1A random variable is essentially bounded if there exists such that . A risk measure is a mapping : from an uncertain outcome onto the set of real numbers. Risk measures can thus account for the entire probability distribution of an uncertain outcome, whereas expected utility functions can only depend on the realization of that outcome.
A widely used class of risk measures is the class of convex risk measures, which satisfy

Monotonicity: if almost surely.

Translation invariance: , if .

Convexity: for all .
The monotonicity axiom states that higher risk is associated with higher loss. The translation invariance axiom states that a sure loss of simply increases the risk by . The convexity axiom states that merging positions does not create additional risk. This property captures the benefits of diversification: a joint position containing the individual positions and results in a lower risk overall than the sum of the risks in the position plus the risk in the position separately.
The theory of convex risk measures can be related to utilities. Concretely, let be a convex nondecreasing continuous disutility function, i.e. the minimum represents the point of lowest disutility. If is well defined for all , then the risk measure is a convex risk measure (see also Shapiro et al. (2009), chapter 6.3 therein). An important class of risk measures considered in this paper are those of the form
where is a strictly decreasing and convex utility function, and is a coefficient quantifying riskaversion. The function measures the deviation of the losses from the expectation. If we choose and , we recover the classical meanvariance problem. If we set to be a convex function with for , then the second component of captures downside risk of losses.
4 The Framework
Definition 1.
A humanmachine interaction game is a period, dynamic game with asymmetric information played between two risk sensitive agents: a human, H, and a machine, M. The game is described by a tuple , whose elements are defined as:

a set of system states: ;

a set of actions for : ;

a set of actions for : ;

a set of possible risk parameters, only observed by : ;

’s convex risk measure, parameterized by ;

’s convex risk measure over a probability distribution on ;

the probability transition function on the future state, given the current state and joint action: ;

an instantaneous cost function that maps the system state and joint actions to a vector of real numbers: ;

a common prior distribution over the risk parameters: .
We allow for different cost drivers, such as time, labor, or consumed materials depending on the application considered. In a selfdriving taxis application, the system costs may include the time required to drive the passenger to her destination, and the toll amount paid by the passenger to the taxi driver. For example, a high toll bridge would allow the passenger to arrive faster to her destination, but it will require him to pay additional fees. Above, we have used to denote the set of probability distributions on . After each period , the human and the machine incur common costs, , depending on the current state of the system, and their joint action. Their incentives are partially aligned as both the human and the machine prefer to keep the total system costs low over the period horizon. The human’s objective is to minimize the costs using her risk measure as the optimization criterion , where is the true type of the human. For example, the meanvariance risk measure maps an dimensional random outcome for the total costs to a scalar quantity through the parameter vector . In this case, the vector not only describes the human’s risk sensitivity towards each of the costs, but it also describes the relative weight that the human assigns to each cost. The machine does not know the value of at the initial stage of the game, but begins with a prior distribution . The machine’s objective is to minimize the risk measure criterion .
Denote the set of public histories as
where for and . A public history contains information that is observed by both the human and the machine, which includes the realization of the system’s states and the actions executed by both agents. The machine maintains the posterior distribution over the human’s type, , which we refer to as the machine’s belief in period .
A Markov strategy for the human is a sequence of measurable maps so that
A Markov strategy for the machine is a sequence of measurable maps so that
Notice that the human’s Markov strategy depends on the machine’s current beliefs because the action of the human is influenced by the action of the machine, which in turn depends on its belief over the human’s type.
The total (cumulative) cost is given by the random variable
Given the conflicting objectives of both agents, the framework as presently defined is a twoplayer strategic game. As such, we define the corresponding risksensitive Bayesian equilibrium as a pair of strategies and a belief profile such that
(1) 
for all strategies , . Furthermore, the machine’s belief profile must be consistent with the strategies in that Bayes’ rule is used to update the beliefs whenever possible. Specifically, the machine’s belief on the true value of the human’s risk parameter satisfy the standard nonlinear filter equation (Fudenberg and Tirole (1991)),
(2) 
provided there exists a value of such that and . In period 1, the belief profile is equal to the prior .
The first of the two inequalities in equation (1) indicates that the human has no incentive to unilaterally deviate from her action to any other action because her riskadjusted total cost would increase. Similarly, the second inequality stipulates that the machine’s action yields the smallest riskadjusted total cost, according to both the human’s type and the machine’s beliefs over the human’s type.
The canonical solution concept for dynamic games of incomplete information is the Bayesian equilibrium (BE). However, standard equilibrium concepts rely on maximizing the expectation of utility functions assigned to each player. A Bayesian equilibrium in our setup would require that both agents minimize the expected disutility of total system costs, rather than the general risk measures we present.
Contextdriven risk in our model is captured by applying the risk measure to the total system cost. This allows us to capture a wide variety of cost criteria that depend on the statistical properties of the cumulative costs, including value at risk, conditional value at risk, and worst case measures. A special case of contextdriven risk is when the human minimizes the expected disutility from costs, where disutility is quantified by a convex utility function. Humandriven risk is quantified by the risk measure over the distribution of human’s type. For example, if is the expectation operator, then the machine aims for the best service to the average human type. On the other hand, if represents the value at risk for some level of service , then the machine aims to provide good service for percentage of the human types. Lastly, if the human’s type is revealed to the machine before , then there is no human’s driven risk. In this case, , so that the two inequalities in Eq. 1 coincide, and the game becomes cooperative.
The solution methodology that we propose to address the humanmachine framework is to transform the strategic game to a singleagent problem by introducing a coordinator agent . The coordinator assigns a policy such that is a strategy for the machine and is a decision function, which prescribes the human’s strategy for each possible realization of . Hence, the coordinator is unaware of the human’s risk parameter, but instead chooses a strategy for every possible type of human. The coordinator’s objective is to minimize the machine’s risk measure using these controls,
The resulting problem is a partiallyobservable, risksensitive, Markov decision process (riskPOMDP). The following theorem connects the solution to the coordinator problem with the equilibrium concept for the humanmachine interaction game.
Theorem 1.
A solution to the coordinator problem is a risksensitive Bayesian equilibrium to the twoagent humanmachine interaction game.
Remark 1.
The riskPOMDP formulation can be reduced to a fully observable, risksensitive MDP, where the state space is enlarged to include the belief profile. The resulting singleagent problem can then be solved using existing risk optimization techniques described in Section 2.
5 Applications and Numerical Study
Consider the following risksensitive version of the stochastic path finding problem. A human hires a selfdriving taxicab to travel to one of several possible locations, but each path assumes an uncertain cost for the human. This cost may take the form of a fuel consumption cost required to travel, tolls or fees required for access to certain paths, or a cost associated with the time the car takes to reach one of the destinations. In each period, the car chooses a direction to move along a grid towards a destination. Making this determination requires the car to estimate the human’s risk preferences associated with the cost of travel. For instance, if the human’s tolerance for risk was known to be high, then the car would choose a path with a lower expected cost, irrespective of its risk level. Conversely, if the robot knew that the human was sensitive to risk, it would avoid paths with widely varying costs, even if those paths had lower average costs.
As an illustrative example, assume that the human’s risk aversion coefficient is and that the optimal path minimizes the weighted sum of its expected cost and variance. Note that, for this example, the riskaversion coefficient is the human’s type.
(3) 
where each represents the random cost of traveling from gridpoint to the gridpoint determined by the joint actions and . Once the car stops moving, the human receives a reward (or negative cost) associated with its final position . Assuming that is known, the optimal solution solves for the stochastic shortest path adjusted for risk aversion. The figure below depicts optimal paths for two values of the risk aversion parameter, with representing a stronger aversion to the variance of total travel costs. Each edge is labeled with the mean and variance of its cost .
Left: the shortest path if the human’s risk aversion coefficient is (red arrows denote the shortest path). The risk averse criterion equals . Right: the shortest path if the human’s risk aversion coefficient is (yellow arrows denote the shortest path). The risk averse criterion equals .
Each of these paths would be followed by the car, provided it knew the human’s risk aversion before proceeding along a chosen direction. Instead the machine is calibrated, perhaps by its manufacturer, with a prior distribution over the two possible risk coefficients, , and . Left unattended by the human, the car would proceed along the path minimizing the riskadjusted cost corresponding to the expected riskaversion coefficient, . Accordingly, the car chooses an action in each period. The human observes the car as it moves and decides in each period whether or not to send a signal to override the car’s action and replace it with one that better reflects the human’s true risk aversion parameter. The human’s action in period , incurs a fixed transmission cost each time a signal is sent to the car.
Whenever the human overrides the car’s action, the car receives more information about the human’s underlying risk parameter. Consider once again, the example illustrated in the figure. Before period , the riskadjusted shortest path is the same for both and . The human would therefore not send a signal to the car in these periods, since the car would execute the optimal path without additional and costly guidance. In period , however, the human would direct the car towards the optimal path with if and if . After the human sends this action to the car, the car would learn the human’s true risk parameter as this action is fully revealing. The human would then refrain from sending any more transmissions, as the car would act optimally according to the human’s revealed preferences until reaching the destination.
Left: The actual human’s risk aversion coefficient is . Right: The actual human’s risk aversion coefficient is .
The human’s action is fully revealing and allows the machine to perfectly learn the human’s risk aversion coefficient.
5.1 Measures of Performance
To measure the effectiveness of the coordinator riskPOMDP, we assess the following measures of performance. Using the selfdriving taxicab framework, consider the case in which the car is undecided between three possible values of human’s risk aversion: , and . The mean and variance of the edges can be found in appendix B. We calculate the regret of various strategies, including the riskPOMDP solution, against the best possible outcome. This benchmark is defined to be the solution in a setup where the human is allowed to directly communicate her risk parameter to the machine in the first period. The “MachineNeutral” strategy represents the policy of a car, which ignores the risk of each path; that is, the car assumes incorrectly that . The “MachineAverage” strategy uses the correct prior over the human’s risk parameter, but does not interact with the human while it drives. As such, the human never gets the opportunity to reveal more information about her risk parameter, and the car provides the best service to the average human. This policy is similar to the strategies considered in HadfieldMenell et al. (2017) for apprenticeship learning models. In these models, the machine observes the human for a fixed number of periods and generates a probability distribution over the human’s type. The machine then takes actions to optimize the objective of the average human according to the prior distribution on the human’s type. The machineaverage strategy mirrors this learning model in the sense that the machine’s prior could be generated from earlier interactions with the human. Lastly, the “HumanMachine” strategy solves the riskPOMDP in a setup where the human is allowed to override the machine’s actions and teach the car her risk preferences. Mathematical formulations for each policy are available in appendix B.
5.2 Numerical Results
The graphs in Figure 3 depict the regret from the different strategies. The optimal joint humanmachine policy always achieves the smallest regret. Moreover, the regret from the strategy in which the machine ignores the contextdriven risk is the highest. Consistent with the intuition, the left graph indicates that if the prior probability on the lowest value of human’s risk aversion increases, the regret from not accounting for the risk parameter in the machine’s strategy gets lower. Consequently, the machineneutral strategy has a small regret because the true sensitivity to contextdriven risk is very low (). In contrast, if the prior probability on the highest value of human’s risk aversion increases (right graph), the regret of the strategy in which the machine does not account for risk becomes higher. In this case, ignoring risk is costly because the human is very sensitive to contextdriven risk ().
The machineaverage strategy outperforms the machineneutral strategy in each instance since it incorporates the typical (average) human’s preference towards contextdriven risk. As the prior probability on the human’s risk aversion parameter approaches a point mass distribution, i.e. increases toward one for a fixed , under the machineaverage strategy, the machine provides the same service as in the best case policy, in which it knows the human’s risk aversion. When the prior probability for one of the parameters is low, the machineaverage strategy takes the average of the remaining two parameter values to determine its policy. As a result, the regret when shown in the left graph is lower than the regret for intermediate values of the prior. Low values of imply that the true parameter is either or and so the optimal policy corresponding to performs closer to the best case policy than, for example, the optimal policy when , which corresponds to .
6 Conclusion and Future Work
In this paper, we presented a framework for humanmachine decision making, accounting for both humandriven and contextdriven risk. Due to the different risk sensitivities of the human and the machine, respectively, to the context in which the task is being executed and to the category of humans served, the optimal decision making problem may be formulated as a game with strategic interactions. We have introduced the concept of risksensitive equilibria to deal with the corresponding game, and introduced risksensitive optimization techniques to solve a related coordinator problem. The optimal solution to the riskPOMDP is then a risksensitive Bayesian equilibrium for the humanmachine framework. Using our framework, we have developed measures of performances quantifying the regret of various humanmachine cooperative strategies against the idealized situation in which both human and machine know human’s characteristics. Our numerical analysis indicates that humanmachine strategies may exhibit poor performance if they ignore contextdriven risk when this is actually present, or if they target the profile of human having the typical (average) characteristics.
Future directions for this research include the development of new solution methods to integrate risk optimization techniques with concepts from game theory. A key refinement to equilibrium in dynamic games is the notion of subgame perfection. This enforces incentive compatibility for both agents in each subgame initiated at the start of each period. However, many commonly used risk measures are not timeseparable, i.e. the risk over the entire horizon cannot be decomposed into a set of risks, each allocated to a different time period. Consider, for example, a selfdriving taxi application, in which heavy traffic conditions at a specific hour of the day are likely to be followed by severe traffic conditions at nearby time periods. In this situation, risks are correlated over time, and thus an error would be introduced if they were to be treated as time separable. Without time separability, the riskPOMDP no longer satisfies the Markov property.
Appendix
Appendix A Proof of Theorem 1
Given a solution to the coordinator problem, , assume that the human’s true type is an arbitrary value . Then it is sufficient to confirm that the strategies and satisfy the three properties for the risksensitive Bayesian equilibrium:

Machine’s incentive compatibility

Human’s incentive compatibility

The consistent belief profile
(4)
The machine’s incentive compatibility (I) is satisfied since the objective function of the coordinator is equal to the objective function of the machine. The consistent belief profile (III) follows directly from the formulation of the coordinator problem as a POMDP. The human’s incentive compatibility condition (II) follows from the monotonicity property of the convex risk measures and by the following logic. Assume that there exists a strategy such that
Then the monotonicity property implies that the coordinator’s objective function can be reduced using the strategy . Thus we arrive at the desired contradiction that the solution to the coordinator problem is optimal.
Appendix B Numerical Study
The figure below depicts optimal paths for three values of the risk aversion parameters: from left to right, , and . Each edge is labeled with the mean and variance of its cost and the terminal nodes are labeled with reward distributions . The three terminal nodes labeled with reward distributions are the only nodes where the humanmachine tandem can choose the “STOP” action.
We mathematically define the regret for each of the strategies discussed in Section 5.1. First, we define the bestcase policy as the one in which the machine maximizes the correct human’s criterion, that is:
where is the number of distinct admissible values of the parameter . The regret of the “MachineNeutral” strategy is defined by
where the first expectation above is computed under the policy in which the machine incorrectly maximizes the expected cumulative cost, i.e. it assumes . We denote by the corresponding optimal action chosen by the machine. The regret of the “MachineAverage” strategy is defined by
where the first expectation above is computed under the policy in which only the machine takes actions and the human never intervenes. We denote by the corresponding optimal action by the machine. The regret of the “HumanMachine” strategy evaluates the performance of our model against the bestcase policy. The regret is defined as
where the first expectation above is computed under the joint optimal humanmachine policy, and denote the corresponding optimal actions to the riskPOMDP.
For each policy, the optimal value of the coordinator riskPOMDP is solved by expressing the belief profile of the machine as a beliefstate and solving for the optimal value function via backwards induction.
Acknowledgments
This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA).
References
 Bäuerle and Ott (2011) Bäuerle, N. and Ott, J. (2011). Markov decision processes with averagevalueatrisk criteria. Mathematical Methods of Operations Research, 74(3):361–379.
 Bäuerle and Rieder (2014) Bäuerle, N. and Rieder, U. (2014). More risksensitive markov decision processes. Mathematics of Operations Research, 39(1):105–120.
 Fudenberg and Tirole (1991) Fudenberg, D. and Tirole, J. (1991). Game theory, 1991. Cambridge, Massachusetts.
 HadfieldMenell et al. (2017) HadfieldMenell, D., Dragan, A., Abbeel, P., and Russell, S. (2017). Cooperative inverse reinforcement learning. Advances in Neural Information Processing Systems, 25.
 Haskell and Jain (2015) Haskell, W. B. and Jain, R. (2015). A convex analytic approach to riskaware markov decision processes. SIAM Journal on Control and Optimization, 53(3):1569–1598.
 Nayyar et al. (2013) Nayyar, A., Mahajan, A., and Teneketzis, D. (2013). Decentralized stochastic control with partial history sharing: A common information approach. IEEE Transactions on Automatic Control, 58(7):1644–1658.
 Shapiro et al. (2009) Shapiro, A., Dentcheva, D., and Ruszczynski, A. (2009). Lectures on Stochastic Programming. SIAM, Philadelphia.
 Sinha and Anastasopoulos (2016) Sinha, A. and Anastasopoulos, A. (2016). Structured perfect bayesian equilibrium in infinite horizon dynamic games with asymmetric information. arXiv preprint arXiv:1609.04221.
 Vasal and Anastasopoulos (2016a) Vasal, D. and Anastasopoulos, A. (2016a). Signaling equilibria for dynamic lqg games with asymmetric information. In Decision and Control (CDC), 2016 IEEE 55th Conference on, pages 6901–6908. IEEE.
 Vasal and Anastasopoulos (2016b) Vasal, D. and Anastasopoulos, A. (2016b). A systematic process for evaluating structured perfect bayesian equilibria in dynamic games with asymmetric information. In American Control Conference (ACC), 2016, pages 3378–3385. IEEE.