A method for the online construction of the set of states of a Markov Decision Process using Answer Set Programming
Abstract
Nonstationary domains, that change in unpredicted ways, are a challenge for agents searching for optimal policies in sequential decisionmaking problems. This paper presents a combination of Markov Decision Processes (MDP) with Answer Set Programming (ASP), named Online ASP for MDP (oASP(MDP)), which is a method capable of constructing the set of domain states while the agent interacts with a changing environment. oASP(MDP) updates previously obtained policies, learnt by means of Reinforcement Learning (RL), using rules that represent the domain changes observed by the agent. These rules represent a set of domain constraints that are processed as ASP programs reducing the search space. Results show that oASP(MDP) is capable of finding solutions for problems in nonstationary domains without interfering with the actionvalue function approximation process.
A method for the online construction of the set of states of a Markov Decision Process using Answer Set Programming
Leonardo A. Ferreira, Reinaldo A. C. Bianchi, Paulo E. Santos, Ramon Lopez de Mantaras Universidade Metodista de São Paulo, São Bernardo do Campo, Brazil. Centro Universitário FEI, São Bernardo do Campo, Brazil. IIIACSIC, Bellaterra, España. leonardo.ferreira@metodista.br, {rbianchi,psantos}@fei.edu.br, mantaras@iiia.csic.es
1 Introduction
A key issue in Artificial Intelligence (AI) is to equip autonomous agents with the ability to operate in changing domains by adapting the agents’ processes at a cost that is equivalent to the complexity of the domain changes. This ability is called elaboration tolerance [?; ?]. Consider, for instance, an autonomous robot learning to navigate in an unknown environment. Unforeseen events may happen that could block passages (or open previously unavailable ones). The autonomous agent should be able to find new solutions in this changed domain using the knowledge previously acquired plus the knowledge acquired from the observed changes in the environment, without having to operate a complete coderewriting, or start a new cycle of domainexploration from scratch.
Reinforcement Learning (RL) is an AI framework in which an agent interacts with its environment in order to find a sequence of actions (a policy) to perform a given task [?]. RL is capable of finding optimal solutions to Markov Decision Processes (MDP) without assuming total information about the problem’s domain. However, in spite of having the optimal solution to a particular task, a RL agent may still perform poorly on a new task, even if the latter is similar to the former [?]. Therefore, Reinforcement Learning alone does not provide elaborationtolerant solutions. Nonmonotonic reasoning can be used as a tool to increase the generality of domain representations [?] and may provide the appropriate element to build agents more adaptable to changing situations. In this work we consider Answer Set Programming (ASP) [?; ?], which is a declarative nonmonotonic logic programming language, to bridge the gap between RL and elaboration tolerant solutions. The present paper tackles this problem by introducing a novel algorithm: Online ASP for MDP (oASP(MDP)), that updates previously obtained policies, learned by means of Reinforcement Learning (RL), using rules that represent the domain changes as observed by the agent. These rules are constructed by the agent in an online fashion (i.e., as the agent perceives the changes) and they impose constraints on the domain states that are further processed by an ASP engine, reducing the search space. Tests performed in nonstationary nondeterministic grid worlds show that, not only oASP(MDP) is capable of finding the actionvalue function for an RL agent and, consequently, the optimal solution, but also that using ASP does not hinder the performance of a learning agent and can improve the overall agent’s performance.
To model an oASP(MDP) learning agent (Section 3), we propose the combination of Markov Decision Processes and Reinforcement Learning (Section 2.1) with ASP (Section 2.2). Tests were performed in two different nonstationary nondeterministic grid worlds (Section 4), whose results show a considerable increase in the agent’s performances when compared with a RL base algorithm, as presented in Sections 4.1 and 4.2.
2 Background
This section introduces Markov Decision Processes (MDP), Reinforcement Learning (RL) and Answer Set Programming (ASP) that are the foundations of the work reported in this paper.
2.1 MDP and Reinforcement Learning
In a sequential decision making problem, an agent is required to execute a series of actions in an environment in order to find the solution of a given problem. Such sequence of actions, that forms a feasible solution, is known as a policy () which leads the agent from an initial state to a goal state [?; ?]. Given a set of feasible solutions, an optimal policy can be found by using Bellman’s Principle of Optimality [?], which states that “an optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision”; can be defined as the policy that maximises/minimises a desired reward/cost function.
A formalisation that can be used to describe sequential decision making problems is a Markov Decision Process (MDP) that is defined as a tuple , where:

is the set of states that can be observed in the domain;

is the set of actions that the agent can execute;

is the transition function that provides the probability of, being in and executing , reaching the future state ;

is the reward function that provides a real number when executing in the state and observing as the future state.
One method that can be used to find an optimal policy for MDPs, which does not need a priori knowledge of the transition and reward functions, is the reinforcement learning modelfree offpolicy method known as QLearning [?; ?].
Given an MDP , QLearning learns while an agent interacts with its environment by executing an action in the current state and observing both the future state and the reward . With these observations, QLearning updates an actionvalue function using
where is the learning rate and is a discount factor. By using these reward values to approximate a function that maps a real value to pairs of states and actions, QLearning is capable of finding which maximises the reward function. Since QLearning is a wellknown and largely used RL method, we omit its detailed description here, which can be found in [?; ?].
Although QLearning does not need information about and , it still needs to know the set of states before starting the interaction with the environment. For finding this set, this work uses Answer Set Programming.
2.2 Answer Set Programming
Answer Set Programming (ASP) is a declarative nonmonotonic logic programming language that has been successfully used for NPcomplete problems such as planning [?; ?; ?].
An ASP rule is represented as
(1) 
where is an atom (the head of the rule) and the conjunction of literals is the rule’s body.
An ASP program is a set of rules in the form of Formula 1. ASP is based on the stable model semantics of logic programs [?]. A stable model of is an interpretation that makes every rule in true, and is a minimal model of . ASP programs are executed by computing stable models, which is usually accomplished by inference engines called answer set solvers [?].
Two important aspects of ASP are its third truth value for unkown, along with true and false, and its two types of negation: strong (or classical) negation and weak negation, representing negation as failure. As it is defined over stable models semantics, ASP respects the rationality that one shall not believe anything one is not forced to believe [?].
Although ASP does not allow explicit reasoning with or about probabilities, ASP’s choice rules are capable of generating distinct outcomes for the same input. I.e., given a current state and an action , it is possible to describe in an ASP logic program states , and as possible outcomes of executing in as “1{ s1, s2, s3 }1 : a, s.”. Such choice rules can be read as “given that and are true, choose at least one and at maximum of one state from , and ”. Thus, the answer sets [s, a, s1], [s, a, s2] and [s, a, s3] represent the possible transitions that are the effects of executing action on state .
This work assumes that for each state there is an ASP logic program with choice rules describing the consequences of each action (where is the set of actions for the state ). ASP programs can also be used to represent domain constraints: the allowed or forbidden states or actions. In this context, to find a set of an MDP and its function is to find every answer set for every state that the agent is allowed to visit, i.e. every allowed transition for each stateaction pair. In this paper ASP is used to find the set of states of an MDP and QLearning is used to approximate without assuming prior knowledge of and . The next section describes this idea in more details.
3 Online ASP for MDP: oASP(MDP)
Given sets and of an MDP, a RL method can approximate an actionvalue function . If is constructed state by state while the agent is interacting with the world, is still able to approximate , as it only uses the current and past states for that. By using choice rules in ASP, it is possible to describe a transition in the form 1{s’}1 : a for each action and each state . By describing possible transitions for each action in each state as a logic program, an ASP engine can be used to provide a set of observed states , a set of actions for each state and, finally, an actionvalue function defined from the interaction with the environment, that can be used to further operate in this environment. This is the essence of the oASP(MDP) method, represented in Algorithm 1.
In order to illustrate oASP(MDP) (Algorithm 1), let’s consider the grid world in Figure 1, and an oASP(MDP) agent, initially located at the state “S” (blue cell in the grid), that is capable of executing any action in the following set: { go up, go down, go left, go right}. This grid world has walls (represented by the letter “W”), that are cells where the agent cannot occupy and through which it is unable to pass. If an agent moves toward a wall (or toward an external border of the grid) it stays at its original location. When the interaction with the environment starts, the agent has only information about the set of actions . The set of observed states is initially empty.
At the beginning of the agent’s interactions with the environment, the agent observes the initial state and verifies if it is in . Since , the agent adds to (line 1 of Algorithm 1) and executes a random action, let this action be go up. As a consequence of this choice, the agent moves to a new state (the cell above S) and receives a reward . At this moment, the agent has information about the previous state, allowing it to write the choice rule “” as an ASP logic program. In this first interaction, the only answer set that can be found for this choice rule is “”. With this information the agent can initialize a and update this value using the reward (line 1).
After this first interaction, the agent is in the state (the cell above S). Again, this is an unknown state (), thus, as with the previous state, the agent adds to , chooses a random action, let it be go up again, and executes this action in the environment. By performing go up in this state, the agent hits a wall and stays in the same state. With this observation, the agent writes the choice rule “” and updates the value of using the received reward .
Since the agent is in the same state as in the previous interaction, it knows the consequence of the action go up in this state, but has no information about any other actions for this state. At this moment, the agent selects an action using the actionselection function defined by the learning method and executes it in the environment. For example, let it choose go down, returning to the blue cell (S). The state has now two choice rules: “” and “” which lead to the answer sets “” and “” respectively. Once again, the agent updates the function using the method described in with the reward received. After this transition, the agent finds itself once again in the initial state and continues the domain exploration just described. If, for example, the agent chooses to execute the action go up again, but due to the nondeterministic nature of the environment, the agent goes to the state on the right of the blue square, then a new state is observed and the choice rule for the previous state is updated to “”. The answer sets that can be found considering this choice rule are “” and “”. With the reward r3 received, the agent updates the value of .
The learning process of oASP(MDP) continues according to the chosen actionvalue function approximation method (from line 1 onwards). After a number of interactions with the environment, the oASP(MDP) agent has executed every possible action in every state that is possible to be visited and has the complete environment description. Note that this method excludes states of the MDP that are unreachable by the agent, which improves the efficiency of a RL agent in cases that the environment imposes state constrains (as we shall see in the next section).
The next section presents the tests applied to evaluate oASP(MDP) implemented with QLearning as the actionvalue function approximation method .
4 Tests and Results
The oASP(MDP) algorithm was evaluated with tests performed in nondeterministic, nonstationary, gridworld domains. Two test sets were considered where, in each set, one of the following domain variables was randomly changed: the number and location of walls in the grid (first test, Section 4.1), and the transition probabilities (second test, Section 4.2).
Four actions were allowed in the test domains considered in this work: go up, go down, go left and go right. Each action has a predefined probability of conducting the agent in the desired direction and also for moving the agent to an orthogonal (undesired) location. The transition probability for each action depends on the grid world and will be defined for each test, as described below. In all tests, the initial state was fixed at the lowerleftmost square (e.g., cell ‘S’ in Fig. 1) and the goal state fixed in the upperrightmost square (e.g., cell ‘G’ in Fig. 1).
In the test domains, walls were distributed randomly in the grid as obstacles. For each grid, the ratio of walls per grid size is defined. The initial and goal states are the only cells that do not accept obstacles. Wall’s placement in the grid changed at the 1000 and 2000 episodes during each test trial. An example of a grid used in this work is shown in Figure 1.
Results show the data obtained from executing QLearning and oASP(MDP) (with QLearning as the actionvalue function approximation method) in the same environment configuration. The values used for the learning variables were: learning rate , discount factor , exploration/exploitation rate for the greedy action selection method: and the maximum number of steps before an episode is finished was 1000.
In each test, three variables were used to compare QLearning and oASP(MDP). First, the rootmeansquare deviation (RMSD), that provides information related to the convergence of the methods by comparing values of the function in the current episode with respect to that obtained in the previous episode. Second, we considered the return (sum of the rewards) received in an episode. Third, the number of steps needed to go from the initial state to the goal state was evaluated. The results obtained were also compared with that of an agent using the optimal policy in a deterministic grid world (the best performance possible, shown as a reddashed line in the results below).
For oASP(MDP), the number of stateaction pairs known by the agent was also measured and compared with the size of QLearing’s fixed tabular implementation. This variable provides information of how far an oASP(MDP) agent is from knowing the complete environment along with how much the function could be reduced.
The test domains and related results are described in details in the next sections.
4.1 First test: changes in the wall–freespace ratio
In the first test, the size of the grid was fixed to 1010 and the transition probabilities were assigned at 90% for moving on the desired direction and 5% for moving in each of the two directions that are orthogonal to the desired. In this test, changes in the environment occurred in the number and location of walls in the grid. Initially the domain starts with no walls (0%), then it changes to a world where 10% of the grid is occupied by walls placed at random locations and, finally, the grid world changes to a situation where 25% of the grid is occupied by walls. Each change occurs after 1000 episodes.
The results obtained in the first test are represented in Figure 2. Figure 1(a) shows that the RMSD values of oASP(MDP) decrease faster than those of QLearning, thus converging to the optimal policy ahead of QLearning. It is worth observing that when a change occurs in the environment (at episodes 1000 and at 2000) there is no increase in oASP(MDP) RMSD values, contrasting with the significant increase in QLearning’s values. A similar behaviour is shown in Figure 1(c), where there is no change in the number of steps of oASP(MDP) after a change occurs, at the same time that Qlearning number of steps increase considerably at that point.
The return values obtained in this test are shown in Figure 1(b), where it can be observed that both oASP(MDP) and Qlearning reach the maximum value together during the initial episodes, but there is no reduction in the return values of oASP(MDP) when the environment changes, whereas Qlearning returns drop to the initial figures.
Figure 1(d) shows the number of stateaction pairs that oASP(MDP) has found for the grid world. Values obtained after the 15 episode were omitted since they presented no variation. This figure shows that oASP(MDP) has explored every state of the grid world and performed every action allowed in each state, resulting in a complete description of the environment. Since oASP(MDP) has provided the complete description of the environment, the agent that uses oASP(MDP) optimizes the same actionvalue function as the agent that uses QLearning, thus the optimal policy found by both agents is the same. Due to the exploration of the environment performed in the beginning of the interaction, before the 10 episode the agent has executed every action in every possible state at least once and, as can be seen in line 1 of Algorithm 1, the agent then uses the underlying RL procedure to find the actionvalue function.
4.2 Second test: changes in the transition probabilities
In this test, the grid was fixed at a 1010 size, with wall–freespace ratio fixed at 25%. Changes in the environment occurred with respect to the transition probabilities. Initially, the agent’s actions had 50% of probability for moving the agent in the desired direction and 25% for moving it in each of the two orthogonal directions. The first change set the probabilities at 75% (assigned to the desired action effect) and 12.5% (for the directions orthogonal to the desired). The final change assigned 90% for moving in the desired direction and 5% for moving in each of the orthogonal directions.
The RMSD values for oASP(MDP), in this case, decreased faster than those of QLearning, reaching zero before the first change occurred, while QLearning at that point had not yet converged, as shown in Figure 2(a). Analogously to the first test, there is no change in RMSD values of oASP(MDP) when the environment changes, whereas Qlearning presents reinitializations. In the results on return and the number of steps, shown in Figures 2(b) and 2(c) respectively, the performance of oASP(MDP) improves faster than the QLearning performance when there is a change in the environment. This is explained by the fact that, after oASP(MDP) approximates the actionvalue function (in the periods between the changes), when a change occurs, the information about it, acquired by the agent, is used to find solutions in the new world situation. In this case, the current actionvalue function is simply updated. QLearning, on the other hand, is restarted at each time a change occurs, resulting in the application of an inefficient policy in the new environment.
The number of stateaction pairs that oASP(MDP) was able to describe is shown in Figure 2(d). Once more, values obtained after the 15 episode were omitted, as they present no variation after this point. Analogous to the results obtained in the first experiment, oASP(MDP) was capable of executing at least once every allowed action in every state possible to be visited. As before, by exploring the environment oASP(MDP) could efficiently find the set of allowed states, defining the complete .
In summary, the tests performed in the domains considered show that the information previously obtained is beneficial to an agent that learns by interacting with a changing environment. The actionvalue function obtained by oASP(MDP) before a change occurs accelerates the approximation of this function in a new version of the environment, avoiding the various reinitializations observed in Qlearning alone (as shown in Figures 2 and 3). However, as the actionvalue function approximation method used in oASP(MDP) (in this work) was QLearning, the policies learnt by oASP(MDP) and QLearning alone were analogous. This can be observed when comparing the curves for oASP(MDP) and QLearning in Figures 2 and 3 after convergence, noticing also that they keep the same distance with respect to the best performance possible (reddashed lines in the graphs).
Tests were performed in virtual machines in AWS EC2 with t2.micro configuration, which provides one virtual core of an Intel Xeon at 2.4GHz, 1GB of RAM and 8GB of SSD with standard Debian 8 (Jessie). oASP(MDP) was implemented in Python 3.4 using ZeroMQ for providing messages exchanges between agent and environment and Clingo [?] was used as the ASP Engine. The source code for the tests can be found in the following (anonymous) URL: http://bit.ly/2k03lkl.
5 Related Work
Previous attempts at combining RL with ASP include [?], which proposes the use of ASP to find a predefined plan for a RL agent. This plan is described as a hierarchical MDP and RL is used to find the optimal policy for this MDP. However, changes in the environment, as used in the present work, were not considered in [?].
Analogous methods were proposed by [?; ?], in which an agent interacts with an environment and updates an action’s cost function. While [?] uses the action language , [?] uses ASP to find a description of the environment. Although both methods consider action costs, none of them uses Reinforcement Learning and they do not deal with changes in the actionvalue function description during the agent’s interaction with the environment.
An approach to nondeterministic answer set programs is PLog [?; ?]. While PLog is capable of calculating transition probabilities from sampling, it is not capable of using this information to generate policies. Also PLog does not consider action costs. Thus, although PLog can be used to find the transition function, it cannot find the optimal solution, as proposed here.
Works related to nonstationary MDPs such as [?; ?], which deal only with changes in reward function, are more associated with RL alone than with a hybrid method such as oASP(MDP), since RL methods are already capable of handling changes in the reward and transition functions. The advantage of ASP is to find the set of states so that it is possible to search for an optimal solution regardless of the agent’s transition and reward functions.
A proposal that closely resembles oASP(MDP) is [?]. This method proposes the combination of deep learning to find a description to a set of states, which are then described as rules to a probabilistic logic program and, finally, a RL agent interacts with the environment using the results and learns the optimal policy.
6 Conclusion
This paper presented the method oASP(MDP) for approximating actionvalue functions of Markov Decision Processes, in nonstationary domains, with unknown set of states and unknown transition and reward functions. This method is defined on a combination of Reinforcement Learning (RL) and Answer Set Programming (ASP). The main advantage of RL is that it does not need a priori knowledge of transition and reward functions, but it relies on having a complete knowledge to the set of domain states. In oASP(MDP), ASP is used to construct the set of states of an MDP to be used by a RL algorithm. ASP programs representing domain states and transitions are obtained as the agent interacts with the environment. This provides an efficient solution to finding optimal policies in changing environments.
Tests were performed in two nonstationary nondeterministic gridworld domains, where each domain had one property of the grid world changed over time. In the first domain, the ratio of obstacles and free space in the grid was changed, whereas in the second domain changes occurred in the transition probabilities. The changes happened in intervals of 1000 episodes in both domains. Results show that, when a change occurs, oASP(MDP) (with Qlearning as the actionvalue function) is capable of approximating the function faster than Qlearning alone. Therefore, the combination of ASP with RL was effective in the definition of a method that provides more general (or more elaboration tolerant) solutions to changing domains than RL methods alone.
Future work will be directed toward the development of an interface to facilitate the use of oASP(MDP) with distinct domains, such as those provided by the DeepMind Lab [?]. Also, a comparison of oASP(MDP) with the framework proposed in [?] is an interesting subject for future research.
References
 [Baral et al., 2009] Chitta Baral, Michael Gelfond, and Nelson Rushton. Probabilistic reasoning with answer sets. Theory and Practice of Logic Programming, 9(1):57, 2009.
 [Beattie et al., 2016] Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis, and Stig Petersen Shane Legg. Deepmind lab. arXiv preprint arXiv:1612.03801v2 [cs.AI], December 2016.
 [Bellman and Dreyfus, 1971] Richard Ernest Bellman and Stuart E. Dreyfus. Applied dynamic programming. Princeton Univ. Press, 4 edition, 1971.
 [Bellman, 1957] Richard Bellman. A Markovian decision process. Indiana University Mathematics Journal, 6(4):679–684, 1957.
 [EvenDar et al., 2009] Eyal EvenDar, Sham. M. Kakade, and Yishay Mansour. Online markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
 [Garnelo et al., 2016] Marta Garnelo, Kai Arulkumaran, and Murray Shanahan. Towards deep symbolic reinforcement learning. arXiv preprint arXiv:1609.05518 [cs], September 2016.
 [Gebser et al., 2013] Martin Gebser, Roland Kaminski, and Benjamin Kaufmann. Answer set solving in practice. Morgan & Claypool Publishers, 2013.
 [Gelfond and Lifschitz, 1988] Michael Gelfond and Vladimir Lifschitz. The stable model semantics for logic programming. In Robert Kowalski, Bowen, and Kenneth, editors, Proceedings of International Logic Programming Conference and Symposium, pages 1070–1080. MIT Press, 1988.
 [Gelfond and Rushton, 2010] Michael Gelfond and Nelson Rushton. Causal and probabilistic reasoning in Plog. Heuristics, Probabilities and Causality. A tribute to Judea Pearl, pages 337–359, 2010.
 [Gelfond, 2008] Michael Gelfond. van Harmelen, Frank; Lifschitz, Vladimir; Porter, Bruce. Handbook of Knowledge Representation, chapter Answer sets, page 285–316. Elsevier, 2008.
 [Khandelwal et al., 2014] Piyush Khandelwal, Fangkai Yang, Matteo Leonetti, Vladimir Lifschitz, and Peter Stone. Planning in action language BC while learning action costs for mobile robots. In Proceedings of the TwentyFourth International Conference on Automated Planning and Scheduling, ICAPS 2014, Portsmouth, New Hampshire, USA, June 2126, 2014, 2014.
 [Lifschitz, 2002] Vladimir Lifschitz. Answer set programming and plan generation. Artificial Intelligence, 138(1):39–54, 2002.
 [McCarthy, 1987] John McCarthy. Generality in artificial intelligence. Communications of the ACM, 30(12):1030–1035, 1987.
 [McCarthy, 1998] John McCarthy. Elaboration tolerance. In Proc. of the Fourth Symposium on Logical Formalizations of Commonsense Reasoning (Common Sense 98), volume 98, London, UK, 1998.
 [Sutton and Barto, 2015] Richard S Sutton and Andrew G Barto. Reinforcement learning an introduction – Second edition, in progress (Draft). MIT Press, 2015.
 [Watkins, 1989] Christopher J. C. H. Watkins. Learning from deSuttonlayed rewards. PhD thesis, University of Cambridge England, 1989.
 [Yang et al., 2014] Fangkai Yang, Piyush Khandelwal, Matteo Leonetti, and Peter Stone. Planning in answer set programming while learning action costs for mobile robots. In AAAI Spring 2014 Symposium on Knowledge Representation and Reasoning in Robotics (AAAISSS), 2014.
 [Yu et al., 2009] Jia Yuan Yu, Shie Mannor, and Nahum Shimkin. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737–757, 2009.
 [Zhang et al., 2015] Shiqi Zhang, Mohan Sridharan, and Jeremy L. Wyatt. Mixed logical inference and probabilistic planning for robots in unreliable worlds. IEEE Transactions on Robotics, 31(3):699–713, 2015.