Learning Compact Models for Planning
with Exogenous Processes
Abstract
We address the problem of approximate model minimization for mdps in which the state is partitioned into endogenous and (much larger) exogenous components. An exogenous state variable is one whose dynamics are independent of the agent’s actions. We formalize the masklearning problem, in which the agent must choose a subset of exogenous state variables to reason about when planning; doing planning in such a reduced state space can often be significantly more efficient than planning in the full model. We then explore the various value functions at play within this setting, and describe conditions under which a policy for a reduced model will be optimal for the full mdp. The analysis leads us to a tractable approximate algorithm that draws upon the notion of mutual information among exogenous state variables. We validate our approach in simulated robotic manipulation domains where a robot is placed in a busy environment, in which there are many other agents also interacting with the objects. Visit http://tinyurl.com/chitnisexogenous for a supplementary video.
capbtabboxtable[][\FBwidth]
learning to plan efficiently, exogenous events, model minimization
1 Introduction
Most aspects of the world are exogenous to us: they are not affected by our actions. However, though they are exogenous, these processes (e.g., weather and traffic) will often play a major role in the way we should choose to perform a task at hand. Despite being faced with such a daunting space of processes that are out of our control, humans are extremely adept at quickly reasoning about which aspects of this space they need to concern themselves with, for the particular task at hand.
Consider the setting of a household robot tasked with doing laundry inside a home. It should not get bogged down by reasoning about the current weather or traffic situation, because these factors are irrelevant to its task. If, instead, it were tasked with mowing the lawn, then good decisionmaking would require it to reason about the time of day and weather (so it can finish the task by sunset, say).
In this work, we address the problem of approximately solving Markov decision processes (mdps), without too much loss in solution quality, by leveraging the structure of their exogenous processes. We begin by formalizing the masklearning problem. An autonomous agent is given a generative model of an mdp that is partitioned into an endogenous state (i.e., that which can be affected by the agent’s actions) and a much higherdimensional exogenous state. The agent must choose a mask, a subset of the exogenous state variables, that yields a policy not too much worse than a policy that would be obtained by reasoning about the entire exogenous state.
After formalizing the masklearning problem, we discuss how we can leverage exogeneity to quickly learn transition models for only the relevant variables from data. Then, we explore the various value functions of interest within the problem, and discuss the conditions under which a policy for a particular mask will be optimal for the full mdp. Our analysis lends theoretical credence to the idea that a good mask should contain not only the exogenous state variables that directly influence the agent’s reward function, but also ones whose dynamics are correlated with theirs. This idea leads to a tractable approximate algorithm for the masklearning problem that leverages the structure of the mdp, drawing upon the notion of mutual information among exogenous state variables.
We experiment in simulated robotic manipulation domains where a robot is put in a busy environment, among many other agents that also interact with the objects. We show that 1) in small domains where we can plan directly in the full mdp, the masks learned by our approximate algorithm yield competitive returns; 2) our approach outperforms strategies that do not leverage the structure of the mdp; and 3) our algorithm can scale up to planning problems with large exogenous state spaces.
2 Related Work
The notion of an exogenous event, one that is outside the agent’s control, was first introduced in the context of planning problems by Boutilier et al. [1]. The work most similar to ours is that by Dietterich et al. [2], who also consider the problem of model minimization in mdps by removing exogenous state variables. Their formulation of an mdp with exogenous state variables is similar to ours, but the central assumption they make is that the reward decomposes additively into an endogenous state component and an exogenous state component. Under this assumption, the value function of the full mdp decomposes into two parts, and any policy that is optimal for the endogenous mdp is also optimal for the full mdp. On the other hand, we do not make this reward decomposition assumption, and so our value function does not decompose; instead, our work focuses on a different set of questions: 1) what are the conditions under which an optimal policy in a given reduced model is optimal in the full mdp? 2) can we build an algorithm that leverages exogeneity to efficiently (approximately) discover such a reduced model?
Model minimization of factored Markov decision processes is often defined using the notion of stochastic bisimulation [3, 4], which describes an equivalence relation among states based upon their transition dynamics. Other prior work in state abstraction tries to remove irrelevant state variables in order to form reduced models [5, 6]. Our approach differs from these techniques in two major ways: 1) we consider only reducing the exogenous portion of the state, allowing us to develop algorithms which leverage the computational benefits enjoyed by the exogeneity assumption; 2) rather than trying to build a reduced model that is faithful to the full mdp, we explicitly optimize a different objective (Equation 1), which tries to find a reduced model yielding high rewards in the full mdp. Recent work in modelfree reinforcement learning has considered how to exploit exogenous events for better sample complexity [7, 8], whereas we tackle the problem from a modelbased perspective.
If we view the state as a vector of features, then another perspective on our approach is that it is a technique for feature selection [9] applied to mdps with exogenous state variables. This is closely related to the typical subset selection problem in supervised learning [10, 11, 12, 13], in which a learner must determine a subset of features that suffices for making predictions with little loss.
3 Problem Formulation
In this section, we introduce the notion of exogeneity in the context of a Markov decision process, and use this idea to formalize the masklearning problem.
3.1 Markov Decision Processes with Exogenous Variables
We will formalize our setting as a discrete Markov decision process (mdp) with special structure. An mdp is a tuple where: is the state space; is the action space; is the state transition distribution with and ; is the reward function; and is the discount factor in . On each timestep, the agent chooses an action, transitions to a new state sampled from , and receives a reward as specified by . The solution to an mdp is a policy , a mapping from states in to actions in , such that the expected discounted sum of rewards over trajectories resulting from following , which is , is maximized. Here, the expectation is with respect to the stochasticity in the initial state and state transitions. The value of a state under policy is defined as the expected discounted sum of rewards from following , starting at state : . The value function for is the mapping from to defined by .
In this work, we assume that the agent is given a generative model of an mdp, in the sense of Kearns et al. [14]. Concretely, the agent is given the following:

Knowledge of , , and .

A blackbox sampler of the transition model, which takes as input a state and action , and returns a next state .

A blackbox reward function, which takes as input a state and action , and returns the reward for that stateaction pair.

A blackbox sampler of an initial state from some distribution .
We note that this assumption of having a generative model of an mdp lies somewhere in between that of only receiving execution traces (as in the typical reinforcement learning setting, assuming no ability to reset the environment) and that of having knowledge of the full analytical model. One can also view this assumption as saying that a simulator is available. Generative models are a natural way to specify a large mdp, as it is typically easier to produce samples from a transition model than to write out the complete nextstate distributions.
We assume that the state of our mdp decomposes into an endogenous component and a (much larger) exogenous component, and that the agent knows this partition. Thus, we write , where is the endogenous component of the state, whose transitions the agent can affect through its actions, and is the exogenous component of the state. This assumption means that , and therefore .
Though unaffected by the agent’s actions, the exogenous state variables influence the agent through the rewards and endogenous state transitions. Therefore, to plan, the agent will need to reason about future values of the .
We will focus on the setting where the exogenous state is made up of (not necessarily independent) state variables , with large. The fact that is large precludes reasoning about the entire exogenous state. We will define the masklearning problem (Section 3.2) an optimization problem for deciding which variables the agent should reason about.
To be able to unlink the effects of each exogenous state variable on the agent’s reward, we will need to make an assumption on the form of the reward function . This assumption is necessary for the agent to be able to reason about the effect of dropping a particular exogenous state variable from consideration. Specifically, we assume that . In words, this says that the reward function decomposes into a sum over the individual effects of each exogenous state variable . Although this means that the computation of the agent’s reward cannot include logic based on any combinations of , this assumption is not too restrictive: one could always construct “supervariables” encompassing state variables coupled in the reward. Note that despite this assumption, the value function might depend nonadditively on the exogenous variables.
3.2 The MaskLearning Problem
Our central problem of focus is that of finding a mask, a subset of the exogenous state variables, that is “good enough” for planning, in the sense that we do not lose too much by ignoring the others. Before being able to formalize the problem, we must define precisely what it means to plan with only a subset of the exogenous variables. Suppose we have an mdp as described in Section 3.1. Let , and be an arbitrary subset (a mask).
We define the reduced model corresponding to mask and mdp as another mdp:

and are the same as in .

is reduced by removing the dimensions corresponding to any .

. Here, .^{1}^{1}1The might, in actuality, not be Markov, since we are ignoring the variables . Nevertheless, this expression is wellformed, and estimating it from data marginalizes out these ignored variables. If they were not exogenous, such estimates would depend on the datagathering policy (and thus could be very errorprone).

. Here, we are leveraging the assumption that the reward function decomposes as discussed at the end of Section 3.1.
Since the agent only has access to a generative model of the mdp , planning in will require estimating the reduced dynamics and reward models, and .
Formally, the masklearning problem is to determine:
(1) 
where is the policy (mapping reduced states to actions) that is obtained by planning in . In words, we seek the mask that yields a policy maximizing the expected total reward accrued in the actual environment (the complete mdp ), minus a cost on . Note that if , then the choice is always optimal, and so the serves as a regularization hyperparameter that balances the expected reward with the complexity of the considered mask. Reasonable choices of include or the amount of time needed to produce the policy corresponding to .
4 Leveraging Exogeneity
The agent has only a generative model of the mdp (Section 3.1). In order to build a reduced mdp , it must estimate and .
Recall that we are considering the setting where the space of exogenous variables is much larger than the space of endogenous variables. At first glance, then, it seems challenging to estimate using only the generative model for . However, this estimation problem is in fact greatly simplified due to the exogeneity of the . To see why, consider the typical strategy for estimating a transition model from data: generate trajectories starting from some initial state , then fit a onestep prediction model to this data. Now, if the were endogenous, then we would need to commit to some policy in order to generate these trajectories, and the reduced transition model we learn would depend heavily on this . However, since the are exogenous, we can roll out simulations conditioned only on the initial state , not on a policy, and use this data to efficiently estimate the transition model of the reduced mdp .^{2}^{2}2In high dimensions, we may still need a lot of data, unless the state factors nicely.
Because we can efficiently estimate the reduced model dynamics of exogenous state variables, we can not only tractably construct the reduced mdp, but also allow ourselves to explore algorithms that depend heavily on estimating these variables’ dynamics, as we will do in Section 5.3.
5 Algorithms for MaskLearning
In this section, we explore some value functions induced by the masklearning problem, and use this analysis to develop a tractable but effective approximate algorithm for finding good masks.
5.1 Preliminaries
Observe that the expectation in the objective (Equation 1) is the value of an initial state under the policy , corresponding to mask . Because the full mdp is very large, computing this value (the expected discounted sum of rewards) exactly will not be possible. Instead, we can use rollouts of to produce an estimate , which in turns yields an estimate of the objective, :
As discussed in Section 4, Line 1 is tractable due to the exogeneity of the variables, which make up most of the dimensionality of the state. With EstimateObjective in hand, we can write down some very simple strategies for solving the masklearning problem:

MaskLearningBruteForce: Evaluate for every possible mask . Return the highestscoring mask.

MaskLearningGreedy: Start with an empty mask . While increases, pick a variable at random, add it into , and reevaluate .
While optimal, MaskLearningBruteForce is of course intractable for even mediumsized values of , as it will not be feasible to evaluate all possible subsets of . Unfortunately, even MaskLearningGreedy will likely be ineffective for medium and large mdps, as it does not leverage the structure of the mdp whatsoever. To develop a better algorithm, we will start by exploring the connection between the value functions of the reduced mdp and the full mdp .
5.2 Analyzing the Value Functions of Interest
It is illuminating to outline the various value functions at play within our problem. We have:

: the value function under an optimal policy; unknown and difficult to compute exactly.

: given a mask , the value function under the policy ; unknown and difficult to compute exactly. Note that , by the definition of an optimal value function.

: given a mask , the empirical estimate of , which the agent can obtain by rolling out in the environment many times, as was done in the EstimateObjective procedure.

: given a mask , the value function of policy within the reduced mdp . Here, is the reduced form of state (i.e., the endogenous state concatenated with ).
Intuitively, corresponds to the expected reward that the agent believes it will receive by following , which typically will not match , the actual expected reward. In general, we cannot say anything about the ordering between and . For instance, if the mask ignores some negative effect in the environment, then the agent will expect to receive higher reward than it actually receives during its rollouts. On the other hand, if the mask ignores some positive effect in the world, then the agent will expect to receive lower reward than it actually receives.
It is now natural to ask: under what conditions would an optimal policy for the reduced mdp also be optimal for the full mdp ? The following theorem describes sufficient conditions:
Theorem 1.
Consider an mdp as defined in Section 3.1, with exogenous state variables , and a mask . Let be the variables not included in the mask. If the following conditions hold: (1) , (2) , (3) ; then . If is optimal for the reduced mdp , then it must also be true that .
Proof: See Appendix A. ∎
The conditions in Theorem 1 are very intuitive: (1) all exogenous variables not in the mask do not influence the agent’s reward, (2) the endogenous state transitions do not depend on variables not in the mask, and (3) the variables in the mask transition independently of the ones not in the mask. If these conditions hold, and we use an optimal planner for the reduced model, then we will obtain a policy that is optimal for the full mdp, not just the reduced one. Based on these conditions, it is clear that our masklearning algorithm should be informed by two things: 1) the agent’s reward function and 2) the degree of correlation among the dynamics of the various state variables.
Hoeffding’s inequality [15] allows us to bound the difference between and the empirical estimate . For rewards in the range , discount factor , number of rollouts , and policy , we have that for any state , with probability at least . This justifies the use of in place of , as was done in the EstimateObjective procedure. Next, we describe a tractable algorithm for masklearning that draws on Theorem 1.
5.3 A Correlational Algorithm for MaskLearning
Of course, it is very challenging to directly search for a lowcost mask that meets all the conditions of Theorem 1, if one even exists. However, we can use that intuition to develop an approximate algorithm based upon greedy forward selection techniques for feature selection [9], which at each iteration add a single variable that most improves some performance metric.
Algorithm description. In line with Condition (1) of Theorem 1, we begin by estimating the set of exogenous state variables relevant to the agent’s reward function, which we can do by leveraging the generative model of the mdp. As shown by the EstimateRewardVariables subroutine, the idea is to compute, for each variable , the average amount of variance in the reward across different values of when the remainder of the state is held fixed, and threshold this at . We define our initial mask to be this set. The efficiencyaccuracy tradeoff of this subroutine can be controlled by tuning the number of samples, and . In practice, we can make a further improvement by heuristically biasing the sampling to favor states that are more likely to be encountered by the agent (such heuristics can be computed, for instance, by planning in a relaxed version of the problem).
Then, we employ an iterative procedure to approximately build toward Conditions (2) and (3): that the variables not in the mask transition independently of those in the mask and the endogenous state. To do so, we greedily add variables into the mask based upon the mutual information between the empirical transition model of the reduced state and that of each remaining variable. Intuitively, this quantitatively measures: “How much better would we be able to predict the dynamics of if included variable , versus if it didn’t?” This mutual information will be 0 if transitions independently of all variables in . To calculate the mutual information between and variable , we must first learn the empirical transition models , , and from data. The exogeneity of most of the state is critical here: not only does it make learning these models much more efficient, but also, without exogeneity, we could not be sure whether two variables actually transition independently of each other or we just happened to follow a datagathering policy that led us to believe so.
6 Experiments
Our experiments are designed to answer the following questions: (1) In small domains, are the masks learned by our algorithm competitive with the optimal masks? (2) Quantitatively, how well do the learned masks perform in large, complicated domains? (3) Qualitatively, do the learned masks properly reflect different goals given to the robot? (4) What are the limitations of our approach?
We experiment in simulated robotic manipulation domains in which a robot is placed in a busy environment with objects on tables, among many other agents that are also interacting with objects. The robot is rewarded for navigating to a given goal object (which changes on each episode) and penalized for crashing into other agents. The exogenous variables are the states of the other agents^{3}^{3}3It is true that typically, the other agents will react to the robot’s actions, but it is well worth making the exogeneity assumption in these complicated domains we are considering. This assumption is akin to how we treat traffic patterns as exogenous to ourselves, even though technically we can slightly affect them., a binaryvalued occupancy grid discretizing the environment, the object placements on tables, and information specifying the goal. We plan using value iteration with a timeout of 60 seconds. Empirical value estimates are computed as averages across 500 independent rollouts. Each result reports an average across 50 independent trials. We regularize masks by setting ; our initial experimentation suggested that other mask choices, such as planning time, perform similarly.
6.1 In small domains, are our learned masks competitive with the optimal masks?
Finding the optimal mask is only possible in small models. Therefore, to answer this question we constructed a small gridworld representation of our experimental domain, with only 600 states and 5 exogenous variables, in which we can plan exactly. We find using the MaskLearningBruteForce strategy given in Section 5.1. We compare the optimal mask with both 1) Ours, the mask returned by our algorithm in Section 5.3; and 2) Greedy, the mode mask chosen across 10 independent trials of the MaskLearningGreedy strategy given in Section 5.1.
Discussion. For higher values of especially, our algorithm performs on par with the optimal bruteforce strategy, which is only viable in small domains. The gap widens as decreases because as the optimal mask gets larger, it becomes harder to find using a forward selection strategy such as ours. In practical settings, one should typically set quite high so that smaller masks are preferred, as these will yield the most compact reduced models. Meanwhile, the greedy strategy does not perform well because it disregards the structure of the mdp and the conditions of Theorem 1. We also observe that our algorithm takes significantly less time than the optimal bruteforce strategy; we should expect this gap to widen further in larger domains. Our algorithm’s stopping condition is that the score function begins to decrease, and so it tends to terminate more quickly for higher values of , as smaller masks become increasingly preferred.
6.2 Quantitatively, how well do the learned masks perform in large, complicated domains?
To answer this question, we consider two large domains: Factory, in which a robot must fulfill manipulation tasks being issued in a stream; and Crowd, in which a robot must navigate to target objects among a crowd of agents executing their own fixed stochastic policies. Even after discretization, these domains have and states respectively; Factory has 22 exogenous variables and Crowd has 124. In either domain, both planning exactly in the full mdp and searching for by brute force are prohibitively expensive. All results use , , , , and . We compare our algorithm with the Greedy one described earlier, which disregards the structure of the mdp. We also compare to only running our algorithm’s first phase, which chooses the mask to be the estimated set of variables directly influencing the reward.
Algorithm  Domain  Average True Returns  Time / Run (sec) 

Greedy  Factory  0  13.4 
Ours (first phase)  Factory  186  — 
Ours (full)  Factory  226  53.8 
Greedy + heuristics  Crowd  188  43.7 
Ours (first phase)  Crowd  260  — 
Ours (full)  Crowd  545  123.5 
Discussion. In the Factory domain, the baseline greedy algorithm did not succeed even once at the task. The reason is that in this domain, several exogenous variables directly influence the reward function, but the greedy algorithm starts with an empty mask and only adds one variable at a time, and so cannot detect the improvement arising from adding in several variables at once. To give the baseline a fair chance, we initialized it by hand to a better mask for the Crowd domain, which is why it sees success there (Greedy + heuristics in the table). The results suggest that our algorithm, explicitly framed around all conditions of Theorem 1, performs well. The graph of the example execution in the Crowd domain shows that adding a 55th variable to the mask yields a decrease in the estimated objective even though the average returns slightly increase, due to the regularizer .
6.3 Qualitatively, do the learned masks properly reflect different goals given to the robot?
Please see the supplementary video for more qualitative results, examples, and explanations.
An important characteristic of our algorithm is its ability to learn different masks based on what the goal is; we illustrate this concept with an example. Let us explore the masks resulting from two different goals in the Crowd domain: Goal (1) is for the robot to navigate to an object that cannot be moved by any of the other agents, and Goal (2) is for the robot to navigate to an object that is manipulable by the other agents. In either case, the variables that directly affect the reward function are the object placements on the tables (which tell the robot where it must navigate to) and the occupancy grid (which helps the robot avoid crashing). However, for Goal (2), there is another variable that is important to consider: the states of any other agents that can manipulate the objects. This desideratum gets captured by the second phase of our algorithm: reasoning about the states of the other agents will allow the robot to better predict the dynamics of the object placements, enabling it to succeed at its task more efficiently and earn higher rewards. See Figure 2 for a visualization of this concept in our experimental domain, simulated using pybullet [16].
Discussion. In a realworld setting, all of the exogenous variables could potentially be relevant to solving some problem, but typically only a small subset will be relevant to a particular problem. Under this lens, our method gives a way of deriving option policies in lowerdimensional subspaces.
6.4 What are the limitations of our approach?
Our experimentation revealed some limitations of our approach that are valuable to discuss. If the domains of the exogenous variables are large (or continuous), then it is expensive to compute the necessary mutual information quantities. To remedy this issue, one could turn to techniques for estimating mutual information, such as MINE [17]. Another limitation is that the algorithm as presented greedily adds one variable to the mask at a time, after the initial mask is built. In some settings, it can be useful to instead search over groups of variables to add in all at once, since these may contain information for better predicting dynamics that is not present in any single variable.
7 Conclusion and Future Work
We have formalized and given a tractable algorithm for the masklearning problem, in which an agent must choose a subset of the exogenous state variables of an mdp to reason about when planning. An important avenue for future work is to remove the assumption that the agent knows the partition of endogenous versus exogenous aspects of the state. An interesting fact to ponder is that the agent can actually control this partition by choosing its actions appropriately. Thus, the agent can commit to a particular choice of exogenous variables in the world, and plan under the constraint of never influencing these variables. Another avenue for future work is to develop an incremental, realtime version of the algorithm, necessary in settings where the agent’s task constantly changes.
We gratefully acknowledge support from NSF grants 1523767 and 1723381; from AFOSR grant FA95501710165; from ONR grant N000141812847; from Honda Research; and from the MITSensetime Alliance on AI. Rohan is supported by an NSF Graduate Research Fellowship. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.
References
 Boutilier et al. [1999] C. Boutilier, T. Dean, and S. Hanks. Decisiontheoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11:1–94, 1999.
 Dietterich et al. [2018] T. Dietterich, G. Trimponias, and Z. Chen. Discovering and removing exogenous state variables and rewards for reinforcement learning. In International Conference on Machine Learning, pages 1261–1269, 2018.
 Dean and Givan [1997] T. Dean and R. Givan. Model minimization in Markov decision processes. In AAAI/IAAI, pages 106–111, 1997.
 Givan et al. [2003] R. Givan, T. Dean, and M. Greig. Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence, 147(12):163–223, 2003.
 Jong and Stone [2005] N. K. Jong and P. Stone. State abstraction discovery from irrelevant state variables. In IJCAI, volume 8, pages 752–757. Citeseer, 2005.
 Mehta et al. [2008] N. Mehta, S. Ray, P. Tadepalli, and T. Dietterich. Automatic discovery and transfer of MAXQ hierarchies. In Proceedings of the 25th international conference on Machine learning, pages 648–655. ACM, 2008.
 Mao et al. [2018] H. Mao, S. B. Venkatakrishnan, M. Schwarzkopf, and M. Alizadeh. Variance reduction for reinforcement learning in inputdriven environments. arXiv preprint arXiv:1807.02264, 2018.
 Choi et al. [2019] J. Choi, Y. Guo, M. L. Moczulski, J. Oh, N. Wu, M. Norouzi, and H. Lee. Contingencyaware exploration in reinforcement learning. 2019.
 Guyon and Elisseeff [2003] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182, 2003.
 Kohavi and John [1997] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial intelligence, 97(12):273–324, 1997.
 Miller [2002] A. Miller. Subset selection in regression. Chapman and Hall/CRC, 2002.
 John et al. [1994] G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In Machine Learning Proceedings 1994, pages 121–129. Elsevier, 1994.
 Yang and Honavar [1998] J. Yang and V. Honavar. Feature subset selection using a genetic algorithm. In Feature extraction, construction and selection, pages 117–136. Springer, 1998.
 Kearns et al. [2002] M. Kearns, Y. Mansour, and A. Y. Ng. A sparse sampling algorithm for nearoptimal planning in large Markov decision processes. Machine learning, 49(23):193–208, 2002.
 Hoeffding [1994] W. Hoeffding. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pages 409–426. Springer, 1994.
 Coumans et al. [2018] E. Coumans, Y. Bai, and J. Hsu. Pybullet physics engine. 2018. URL http://pybullet.org/.
 Belghazi et al. [2018] M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm. MINE: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.
 Puterman [2014] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
Appendix A: Proof of Theorem 1
Theorem 1.
Consider an mdp as defined in Section 3.1, with exogenous state variables , and a mask . Let be the variables not included in the mask. If the following conditions hold: (1) , (2) , (3) ; then . If is optimal for the reduced mdp , then it must also be true that .
Proof: Consider an arbitrary state , with corresponding reduced state . We begin by showing that under the stated conditions, . The recursive form of these value functions is:
Now, consider an iterative procedure for obtaining these value functions, which repeatedly applies the above equations starting from . Let the value functions at iteration be denoted as and . We will show by induction on that .
The base case is immediate. Suppose , for some value of . We compute:
Model assumptions.  
Condition (1).  
Defn. of reduced .  
Condition (2).  
Split up sum.  
Condition (3).  
Inductive assumption.  
Rearrange.  
Defn. of reduced .  
We have shown that . By standard arguments (e.g., in [18]), this iterative procedure converges to the true and respectively. Therefore, we have that .
Now, if is optimal for , then it is optimal for the full mdp as well. This is because Condition (1) assures us that the variables not considered in the mask do not affect the reward, implying that optimizes the expected reward in not just , but also . Therefore, under this assumption, we have that . ∎