Planning with Learned Binarized Neural Network Transition Models in Factored State and Action Spaces^{1}^{1}1Parts of this work appeared in preliminary form in Say and Sanner, 2018 Say2018 ().
Abstract
In this paper, we leverage the efficiency of Binarized Neural Networks (BNNs) to learn complex state transition models of planning domains with discretized factored state and action spaces. In order to directly exploit this transition structure for planning, we present two novel compilations of the learned factored planning problem with BNNs based on reductions to Weighted Partial Maximum Boolean Satisfiability (FDSATPlan+) as well as Binary Linear Programming (FDBLPPlan+). Theoretically, we show that our SATbased BiDirectional Neuron Activation Encoding maintains the generalized arcconsistency property through unit propagation, which is one of the most important properties that certificate the efficiency of a SATbased encoding. Experimentally, we validate the computational efficiency of our BiDirectional Neuron Activation Encoding in comparison to the existing neuron activation encoding, demonstrate the effectiveness of learning complex transition models with BNNs, and test the runtime efficiency of both FDSATPlan+ and FDBLPPlan+ on the learned factored planning problem. Finally, we present a finitetime incremental constraint generation algorithm based on generalized landmark constraints to improve the planning accuracy of our encodings.
keywords:
datadriven planning, binarized neural networks, Weighted Partial Maximum Boolean Satisfiability, Binary Linear Programming1 Introduction
Deep neural networks have significantly improved the ability of autonomous systems to perform complex tasks, such as image recognition Krizhevsky2012 (), speech recognition Deng2013 () and natural language processing Collobert2011 (), and can outperform humans and humandesigned superhuman systems in complex planning tasks such as Go Alphago2016 () and Chess Alphazero2017 ().
In the area of learning and online planning, recent work on HDMILPPlan Say2017 () has explored a twostage framework that (i) learns transitions models from data with ReLUbased deep networks and (ii) plans optimally with respect to the learned transition models using mixedinteger linear programming, but did not provide encodings that are able to learn and plan with discrete state variables. As an alternative to ReLUbased deep networks, Binarized Neural Networks (BNNs) Hubara2016 () have been introduced with the specific ability to learn compact models over discrete variables, providing a new formalism for transition learning and planning in factored Boutilier1999 () discretized state and action spaces that we explore in this paper. However planning with these BNN transition models poses two nontrivial questions: (i) What is the most efficient compilation of BNNs for planning in domains with factored state and (concurrent) action spaces? (ii) Given that BNNs may learn incorrect domain models, how can a planner repair BNN compilations to improve their planning accuracy (or prove the retraining of BNN is necessary)?
To answer question (i), we present two novel compilations of the learned factored planning problem with BNNs based on reductions to Weighted Partial Maximum Boolean Satisfiability (FDSATPlan+) and Binary Linear Programming (FDBLPPlan+). Theoretically, we show that the SATbased BiDirectional Neuron Activation Encoding has the generalized arcconsistency property through unit propagation. Experimentally, we demonstrate the computational efficiency of our BiDirectional Neuron Activation Encoding compared to the existing neuron activation encoding. Then, we test the effectiveness of learning complex state transition models with BNNs, and test the runtime efficiency of both FDSATPlan+ and FDBLPPlan+ on the learned factored planning problems over four factored planning domains with multiple size and horizon settings. While there are methods for learning PDDL models from data Yang2007 (); Amir2008 () and excellent PDDL planners Helmert2006 (); Richter2010 (), we remark that BNNs are strictly more expressive than PDDLbased learning paradigms for learning concurrent effects in factored action spaces that may depend on the joint execution of one or more actions.
Furthermore, while Monte Carlo Tree Search (MCTS) methods Kocsis2006 (); Keller2013 () including AlphaGo Alphago2016 () and AlphaGoZero Alphago2016 () could technically plan with a BNNlearned black box model of transition dynamics, unlike this work, they would not be able to exploit the BNN transition structure and they would not be able to provide optimality guarantees with respect to the learned model.
To answer question (ii), we introduce a finitetime incremental algorithm based on generalized landmark constraints from the decompositionbased costoptimal classical planner Davies2015 (), where we detect and constrain invalid sets of action selections from the decision space of the planners and efficiently improve their planning accuracy.
In summary, this work provides the first two planners capable of learning complex transition models in domains with mixed (continuous and discrete) factored state and action spaces as BNNs and capable of exploiting their structure in weighted partial maximum satisfiability (or binary linear optimization) encodings for planning purposes. Theoretically we show the efficiency of our SATbased encoding and the incremental algorithm. Empirical results show the computational efficiency of our new BiDirectional Neuron Activation Encoding, demonstrate strong performance for FDSATPlan+ and FDBLPPlan+ in both the learned and original domains, and provide a new transition learning and planning formalism to the datadriven modelbased planning community.
2 Preliminaries
Before we present the Weighted Partial Maximum Boolean Satifiability (WPMaxSAT) and Binary Linear Programming (BLP) compilations of the learned planning problem, we review the preliminaries motivating this work. We begin this section by describing the formal notation and the problem definition that is used in this work.
2.1 Problem Definition
A deterministic factored planning problem is a tuple where is a mixed set of state variables with discrete and continuous domains, is a mixed set of action variables with discrete and continuous domains, is a function that returns true if action and state variables satisfy constraints that represent global constraints, denotes the stationary transition function, and is the reward function. Finally, is the initial state constraints that assign values to all state variables , and is the goal state constraints over the subset of state variables .
Given a planning horizon , a solution (i.e. plan) to is a value assignment to action and state variables such that , over global constraints and time steps and initial and goal state constraints are satisfied such that and , respectively. Similarly, given a planning horizon , an optimal solution to is a plan that maximizes the total reward function .
Next, we introduce an example domain for motivating this work.
2.2 Example Domain: Cellda
Influenced by the famous video game the Legend of Zelda Nintendo1986 (), Cellda domain models an agent in a two dimensional (4by4) dungeon cell. As visualized by Figure 1, the agent Cellda (C) must escape a dungeon through an initially locked door (D) by obtaining its key (K) without getting hit by her enemy (E). The gridworldlike dungeon is made up of two types of cells: i) regular cells (blank) on which Cellda and her enemy can move from/to deterministically up, down, right or left, and ii) blocks (B) that neither Cellda nor her enemy can walkthrough. The state variables of this domain include two integer variables for describing the location of Cellda, two integer variables for describing the location of the enemy, one boolean variable for describing the whether the key is obtained or not, and one boolean variable for describing the whether Cellda is alive or not. The action variables of this domain include four mutually exclusive boolean variables for describing the movement of Cellda (i.e., up, down, right or left). The enemy has a deterministic policy that is unknown to Cellda that will try to minimize the total Manhattan distance between itself and Cellda by breaking the symmetry first in vertical axis. The goal of this domain is to learn the unknown policy of the enemy from previous plays (i.e., data) and escape the dungeon without getting hit. The complete description of this domain can be found in C.
Given the state transition function that describe the location of the enemy is unknown, a planner that ignores the adversarial policy of the enemy E (e.g., as visualized in Figure 1(0(a)0(c))) will get hit by the enemy, as opposed to a planner that learns the adversarial policy of the enemy E (e.g., as visualized in Figure 1(0(d)0(f))) which avoids getting hit by waiting for two time steps to trap her enemy who will try to move up for the remaining of time steps.
To remedy this problem, next we describe a learning and planning framework that i) learns an unknown transition function from data, and ii) plans optimally with respect to the learned deterministic factored planning problem.
2.3 Factored Planning with DeepNet Learned Transition Models
Factored planning with deepnet learned transition models is a twostage framework for learning and solving nonlinear factored planning problems as first introduced in HDMILPPlan Say2017 (). Given samples of state transition data, the first stage of the HDMILPPlan framework learns the transition function using a deepneural network with Rectified Linear Units (ReLUs) Nair2010 () and linear activation units. In the second stage, the learned transition function is used to construct the learned factored planning problem . That is, the trained deepneural network with fixed weights is used to predict the state at time for free state and action variables at time such that . As visualized in Figure 2, the learned transition function is sequentially chained over horizon , and compiled into a MixedInteger Linear Program yielding the planner HDMILPPlan Say2017 (). Since HDMILPPlan utilizes only ReLUs and linear activation units in its learned transition models, the state variables are restricted to have only continuous domains .
Next, we describe an efficient deepneural network structure for learning discrete models, namely Binarized Neural Networks.
2.4 Binarized Neural Networks
Binarized Neural Networks (BNNs) are neural networks with binary weights and activation functions Hubara2016 (). As a result, BNNs naturally learn discrete models by replacing most arithmetic operations with bitwise operations. BNN layers are stacked in the following order:
Real or Binary Input Layer: Binary units in all layers, with the exception of the first layer, receive binary input. When the input of the first layer has realvalued domains , bits of precision can be used for a practical representation such that Hubara2016 ().
Binarization Layer: Given input of binary unit at layer the deterministic activation function used to compute output is: if , otherwise, where denotes the number of layers and denotes the set of binary units in layer .
Batch Normalization Layer: For all layers , Batch Normalization Ioffe2015 () is a method for transforming the weighted sum of outputs at layer in to input of binary unit at layer such that: where parameters , , , , and denote the weight, input mean, input variance, numerical stability constant (i.e., epsilon), input scaling and input bias respectively, where all parameters are computed at training time.
2.5 Weighted Partial Maximum Boolean Satisfiability Problem
The Weighted Partial Maximum Boolean Satisfiability Problem
(WPMaxSAT) is the
problem of finding a value assignment to the variables of a Boolean formula that
consists of hard and weighted soft clauses such that i) all hard clauses evaluate
to true (i.e., SAT) Davis1960 (), and ii) the total weight of the unsatisfied
soft clauses is minimized.
While the theoretical worstcase complexity of WPMaxSAT is NPComplete,
stateoftheart WPMaxSAT solvers are experimentally shown to scale well for large
instances Davies2013 ().
2.6 Boolean Cardinality Constraints
Boolean cardinality constraints describe bounds on the number of Boolean variables that are allowed to be true, and are in the form of . Cardinality Networks provide an efficient encoding in conjunctive normal form (CNF) for counting an upper bound on the number of true assignments to Boolean variables using auxillary Boolean counting variables such that holds for all where Asin2009 (). The detailed CNF encoding of is outlined in A. Given , Boolean cardinality constraint is defined as
(1) 
where is the size of additional input variables.
Similarly, Boolean cardinality constraints of the form are encoded given the CNF encoding of the Cardinality Networks that count a lower bound on the number of true assignments to Boolean variables such that holds for all . The detailed CNF encoding of is also outlined in A. Given , Boolean cardinality constraint is defined as follows.
(2) 
Note that the cardinality constraint is equivalent to . Since Cardinality Networks require the value of to be less than or equal to , Boolean cardinality constraints of the form with must be converted into .
Finally, a Boolean cardinality constraint is generalized arc consistent if and only if for every value assignment to every Boolean variable in the set , there exists feasible a value assignment to all the remaining Boolean variables . In practice, the ability to maintain generalized arc consistency through efficient algorithms such as unit propagation (as opposed to search) is one of the most important properties for the efficiency of a Boolean cardinality constraint encoded in CNF Sinz2005 (); Bailleux2006 (); Asin2009 (); Jabbour2014 (), and encoding maintains generalized arc consistency through unit propagation Asin2009 ().
2.7 Binary Linear Programming Problem
The Binary Linear Programming (BLP) problem requires finding the optimal value assignment to the variables of a mathematical model with linear constraints, linear objective function, and binary decision variables. Similar to WPMaxSAT, the theoretical worstcase complexity of BLP is NPComplete. The stateoftheart BLP solvers IBM2017 () utilize branchandbound algorithms and can handle cardinality constraints efficiently in the size of its encoding.
2.8 Generalized Landmark Constraints
A generalized landmark constraint is a linear inequality in the form of where denotes the set of action landmarks and denotes counts on actions , that is, the minimum number of times an action must occur in a plan Davies2015 (). Davies et al. introduced a decompositionbased planner, OpSeq, that incrementally updates generalized landmark constraints to find costoptimal plans to classical planning problems.
3 Weighted Partial Maximum Boolean Satisfiability Compilation of the Learned Factored Planning Problem
In this section, we show how to reduce the learned factored planning problem with BNNs into WPMaxSAT which we denote as Factored Deep SAT Planner (FDSATPlan+).
3.1 Propositional Variables
First, we describe the set of propositional variables used in FDSATPlan+. We use three sets of propositional variables: action variables, state variables and BNN binary units, where variables use a bitwise encoding.

denotes if th bit of action is executed at time .

denotes if th bit of state is true at time .

denotes if BNN binary unit at layer is activated at time .
3.2 Parameters
Next we define the additional parameters used in FDSATPlan+.

is the initial (i.e., at ) value of the th bit of state variable .

is the function that maps the th bit of a state or an action variable to the corresponding binary unit in the input layer of the BNN such that where .

is the function that maps the th bit of a state variable to the corresponding binary unit in the output layer of the BNN such that where .
The global constraints and goal state constraints are in the form of , and the reward function is in the form of for state and action variables where and .
3.3 The WPMaxSAT Compilation
Below, we define the WPMaxSAT encoding of the learned factored planning problem with BNNs. First, we present the set of hard clauses used in FDSATPlan+.
3.3.1 Initial State Clauses
The following conjunction of hard clauses encode the initial state constraints .
(3) 
where hard clause (3) set the initial values of the state variables at time .
Next, we describe an efficient CNF encoding to model the activation behaviour of BNN binary unit .
3.3.2 BiDirectional Neuron Activation Encoding
Given input , activation threshold and binary activation function if , else , the output of a binary neuron can be efficiently encoded in CNF by combining the base hard clauses (i.e., the conjunction of hard clauses (24) with (44), and (34) with (44) from A) and the recursive hard clauses (i.e., the conjunction of hard clauses (38) with (45) and (38) with (45) from A) of and , adding the auxillary input variables and respective unit hard clauses , and adding the following bidirectional activation hard clause:
(4) 
where the Boolean variable represents the activation of the binary neuron such that if and only if , and hard clause (4) is a biconditional logical connective between the output of the neuron and its activation function. Intuitively, the conjunction of hard clauses in and together count combining the respective bounds and for all .
Instead of the UniDirectional encoding Say2018 () that utilize two separate sets of auxillary Boolean counting variables (i.e., and are encoded with two different sets of auxillary Boolean counting variables ), BiDirectional encoding shares the same set of decision variables. Further, the previous work Say2018 () uses the Sequential Counters Sinz2005 () for encoding the cardinality constraints using number of variables and hard clauses whereas the BiDirectional encoding uses only number of variables and hard clauses. For notational clarity, we will refer to the conjunction of hard clauses in the BiDirectional Neuron Activation Encoding as .
Next, we will prove that the BiDirectional Neuron Activation Encoding has the generalized arcconsistency property through unit propagation, which is considered to be one of the most important theoretical properties that certify the efficiency of a Boolean cardinality constraint encoded in CNF Sinz2005 (); Bailleux2006 (); Asin2009 (); Jabbour2014 ().
Definition 0.1 (Generalized ArcConsistency of Neuron Activation Encoding).
A neuron activation encoding has the generalized arcconsistency property through unit propagation if and only if unit propagation is sufficient to deduce the following:

For any set with size , value assignment to variables and for all assigns the remaining variables from the set to true,

For any set with size value assignment to variables and for all assigns the remaining variables from the set to false,

Partial value assignment of variables from to true assigns variable , and

Partial value assignment of variables from to false assigns variable .
Theorem 1 (Generalized ArcConsistency of ).
The BiDirectional Neuron Activation Encoding has the generalized arcconsistency property through unit propagation.
Proof.
To show maintains generalized arc consistency property through unit propagation, we need to show exhaustively for all four cases of Definition 0.1 that unit propagation is sufficient to maintain the generalized arcconsistency.
Case 1 ( where , and by unit propagation): When , unit propagation assigns using the hard clause . Given uses the same set of variables as (excluding variable which is assigned to true) and value assignment to variables for any set with size , unit propagation will set the remaining variables from the set to true using the conjunction of hard clauses that encode excluding the unit clause () Asin2009 ().
Case 2 ( where , and by unit propagation): When , unit propagation assigns using the hard clause . Given uses the same set of variables as (excluding variable which is assigned to false) and value assignment to variables for any set with size , unit propagation will set the remaining variables from the set to false using the conjunction of hard clauses that encode excluding the unit clause () Asin2009 ().
Cases 3 ( where , by unit propagation) When variables from the set are set to true, unit propagation assigns the counting variable using the conjunction of hard clauses that encode excluding the unit clause () Asin2009 (). Given the assignment , unit propagation assigns using the hard clause .
Cases 4 ( where , by unit propagation) When variables from the set are set to false, unit propagation assigns the counting variable using the conjunction of hard clauses that encode excluding the unit clause () Asin2009 (). Given the assignment , unit propagation assigns using the hard clause . ∎
3.3.3 BNN Clauses
Given the efficient CNF encoding , we present the conjunction of hard clauses to model the complete BNN model.
(5)  
(6)  
(7)  
(8)  
(9) 
where activation constant in hard clauses (89) are computed using the batch normalization parameters for binary unit in layer at training time such that:
where denotes the size of set . The computation of the activation constant ensures that is less than or equal to the half size of the previous layer , as the BiDirectional Neuron Activation Encoding only counts upto .
Hard clauses (56) map the binary units at the input layer of the BNN (i.e., ) to a unique state or action variable, respectively. Similarly, hard clause (7) maps the binary units at the output layer of the BNN (i.e., ) to a unique state variable. Hard clauses (89) encode the binary activation of every unit in the BNN.
3.3.4 Global Constraint Clauses
The following conjunction of hard clauses encode the global constraints .
(10) 
where hard clause (10) represents domaindependent global constraints on state and action variables. Some common examples of global constraints such as mutual exclusion on Boolean action variables and onehot encodings for the output of the BNN (i.e., exactly one Boolean state variable must be true) are respectively encoded by hard clauses (1112) as follows.
(11)  
(12) 
In general, linear global constraints in the form of , such as bounds on state and action variables, can be encoded in CNF where are positive integer coefficients and are decision variables with nonnegative integer domains Abio2014 ().
3.3.5 Goal State Clauses
The following conjunction of hard clauses encode the goal state constraints .
(13) 
where hard clause (13) set the goal constraints on the state variables at time .
3.3.6 Reward Clauses
Given the reward function for each time step is in the form of
the following weighted soft clauses:
(14) 
can be written to represent where are the weights of the soft clauses for each bit of action and state variables, respectively.
4 BLP Compilation of the Learned Factored Planning Problem
Given FDSATPlan+, we present the Binary Linear Programming (BLP) compilation of the learned factored planning problem with BNNs, which we denote as Factored Deep BLP Planner (FDBLPPlan+).
4.1 Binary Variables and Parameters
FDBLPPlan+ uses the same set of decision variables and parameters as FDSATPlan+.
4.2 The BLP Compilation
FDBLPPlan+ replaces hard clauses (3) and (57) with equivalent linear constraints as follows.
(15)  
(16)  
(17)  
(18) 
Given the activation constant of binary unit in layer , FDBLPPlan+ replaces hard clauses (89) representing the activation of binary unit with the following linear constraints:
(19)  
(20) 
where .
5 Incremental Factored Planning Algorithm for FDSATPlan+ and FDBLPPlan+
Given that the plans found for the learned factored planning problem by FDSATPlan+ and FDBLPPlan+ can be infeasible to the factored planning problem , we introduce an incremental algorithm for finding plans for by iteratively excluding invalid plans from the search space of FDSATPlan+ and FDBLPPlan+. Similar to OpSeq Davies2015 (), FDSATPlan+ and FDBLPPlan+ are updated with the following generalized landmark hard clauses or constraints
(22)  
(23) 
respectively, where is the set of bits of actions executed at time at the th iteration of the algorithm outlined below.
For a given horizon , Algorithm 1 iteratively computes a set of actions , or returns infeasibility for the learned factored planning problem . If the set of actions is nonempty, we evaluate whether is a valid plan for the original factored planning problem (i.e., line 3) either in the actual domain or using a high fidelity domain simulator – in our case RDDLsim Sanner2010 (). If the set of actions constitutes a plan for , Algorithm 1 returns as a plan. Otherwise, the planner is updated with the new set of generalized landmarks to exclude and the loop repeats. Since the original action space is discretized and represented upto bits of precision, Algorithm 1 can be shown to terminate in no more than iterations by constructing an inductive proof similar to the termination criteria of OpSeq where either a feasible plan for is returned or there does not exist a plan to both and for the given horizon . The outline of the proof can be found in B.
6 Experimental Results
In this section, we evaluate the effectiveness of factored planning with BNNs. First, we present the benchmark domains used to test the efficiency of our learning and factored planning framework with BNNs. Second, we present the accuracy of BNNs to learn complex state transition models for factored planning problems. Third, we test the efficiency and scalability of planning with FDSATPlan+ and FDBLPPlan+ on the learned factored planning problems across multiple problem sizes and horizon settings. Finally, we demonstrate the effectiveness of Algorithm 1 to find a plan for the factored planning problem .
6.1 Domain Descriptions
The RDDL Sanner2010 () formalism is extended to handle goalspecifications and used to describe the problem . Below, we summarize the extended deterministic RDDL domains used in the experiments, namely Navigation Sanner2011 (), Inventory Control (Inventory) Mann2014 (), System Administrator (SysAdmin) Guestrin2001 (); Sanner2011 (), and Cellda Nintendo1986 (). Detailed presentation of the RDDL domains and instances are provided in C.
Navigation
models an agent in a twodimensional (by) maze with obstacles where the goal of the agent is to move from the initial location to the goal location at the end of horizon . The transition function describes the movement of the agent as a function of the topological relation of its current location to the maze, the moving direction and whether the location the agent tries to move to is an obstacle or not. This domain is a deterministic version of its original from IPPC2011 Sanner2011 (). Both the action and the state space is Boolean. We report the results on instances with three maze sizes by and three horizon settings per maze size where , .
Inventory
describes the inventory management control problem with alternating demands for a product over time where the management can order a fixed amount of units to increase the number of units in stock at any given time. The transition function updates the state based on the change in stock as a function of demand, the time, the current order quantity, and whether an order has been made or not. The action space is Boolean (either order a fixed positive integer amount, or do not order) and the state space is nonnegative integer. We report the results on instances with two demand cycle lengths and three horizon settings per demand cycle length where and .
SysAdmin
models the behavior of a computer network of size where the administrator can reboot a limited number of computers to keep the number of computers running above a specified safety threshold over time. The transition function describes the status of a computer which depends on its topological relation to other computers, its age and whether it has been rebooted or not, and the age of the computer which depends on its current age and whether it has been rebooted or not. This domain is a deterministic modified version of its original from IPPC2011 Sanner2011 (). The action space is Boolean and the state space is a nonnegative integer where concurrency between actions are allowed. We report the results on instances with two network sizes and three horizon settings where and .
Cellda
models an agent in a two dimensional (4by4) dungeon cell. The agent Cellda must escape her cell through an initially locked door by obtaining the key without getting hit by her enemy. Each grid of the cell is made up of a grid type: i) regular grids which Cellda and her enemy can move from (or to) deterministically up, down, right or left, and ii) blocks that neither Cellda nor her enemies can stand on. The enemy has a deterministic policy that is unknown to Cellda that will try to minimize the total Manhattan distance between itself and Cellda. Given the location of Cellda and the enemy, the adversarial deterministic policy will always try to minimize the distance between the two by trying to move the enemy on axis . The state space is mixed; integer to describe the locations of Cellda and the enemy, and Boolean to describe whether the key is obtained or not and whether Cellda is alive or not. The action space is Boolean for moving up, down, right or left. The transition function updates states as a function of the previous the locations of Cellda and the enemy, the moving direction of Cellda, and whether the key was obtained or not and whether Cellda was alive or not. We report results on instances with two adversarial deterministic policies and three horizon settings per policy where and .
6.2 Transition Learning Performance
In Table 1, we present test errors for different configurations of the BNNs on each domain instance where the sample data was generated from the RDDLbased domain simulator RDDLsim Sanner2010 () using a simple stochastic exploration policy. For each instance of a domain, state transition pairs were collected and the data was treated as independent and identically distributed. After random permutation, the data was split into training and test sets with 9:1 ratio. The BNNs were trained on MacBookPro with 2.8 GHz Intel Core i7 16 GB memory using the code available Hubara2016 (). Overall, Navigation instances required the smallest BNN structures for learning due to their purely Boolean state and action spaces, while both Inventory, SysAdmin and Cellda instances required larger BNN structures for accurate learning, owing to their nonBoolean state and action spaces.
Domain  Network Structure  Test Error (%) 

Navigation(3)  13:36:36:9  0.0 
Navigation(4)  20:96:96:16  0.0 
Navigation(5)  29:128:128:25  0.0 
Inventory(2)  7:96:96:5  0.018 
Inventory(4)  8:128:128:5  0.34 
SysAdmin(4)  16:128:128:12  2.965 
SysAdmin(5)  20:128:128:128:15  0.984 
Cellda(x)  12:128:128:4  0.645 
Cellda(y)  12:128:128:4  0.65 
6.3 Planning Performance on the Learned Factored Planning Problems
In this section, we present the results of two computational comparisons. First, we test the efficiency of the BiDirectional Neuron Activation Encoding to the existing neuron activation encoding to elect the best WPMaxSATbased encoding for FDSATPlan+. Second, we compare the effectiveness of using the elected WPMaxSATbased encoding against a BLPbased encoding, namely FDSATPlan+ and FDBLPPlan+ to find plans for the learned factored planning problem . We ran the experiments on a MacBookPro with 2.8 GHz Intel Core i7 16GB memory. For FDSATPlan+ and FDBLPPlan+, we used MaxHS Davies2013 () with underlying LPsolver CPLEX 12.7.1 IBM2017 (), and CPLEX 12.7.1 IBM2017 () solvers respectively, with 1 hour total time limit per domain instance.
6.3.1 Comparison of neuron activation encodings
The runtime efficiency of both neuron activation encodings are tested for the learned factored planning problems over 27 problem instances where we test our BiDirectional encoding that utilize Cardinality Networks Asin2009 () against the previous UniDirectional encoding Say2018 () that use Sequential Counters Sinz2005 ().
Figure 3 visualizes the runtime comparison of both neuron activation encodings. The inspection of Figure 3 clearly demonstrate that FDSATPlan+ with Cardinality Networks and BiDirectional Encoding signigicantly outperforms FDSATPlan+ with Sequential Counters and UniDirectional Encoding in all problem instances due to its i) smaller encoding size (i.e., versus ) with respect to both the number of variables and the number of hard clauses used, and ii) generalized arc consistency property. Therefore, we use FDSATPlan+ with Cardinality Networks and BiDirectional Encoding in the remaining experiments and omit the results for FDSATPlan+ with the UniDirectional encoding and Sequential Counters.
6.3.2 Comparison of FDSATPlan+ and FDBLPPlan+
Next, we test the runtime efficiency of FDSATPlan+ and FDBLPPlan+ for solving the learned factored planning problem.
Domains  FDSATPlan+  FDBLPPlan+ 

Navigation  529.11  1282.82 
Inventory  54.88  0.54 
SysAdmin  1627.35  3006.27 
Cellda  344.03  285.45 
Coverage  27/27  20/27 
Optimality Proved  25/27  19/27 
In Table 2, we present the summary of the computational results including the average runtimes in seconds, the total number of instances for which a feasible solution is returned (i.e., coverage), and the total number of instances for which an optimal solution is returned (i.e., optimality proved), for both FDSATPlan+ and FDBLPPlan+ over all four domains for the learned factored planning problem within 1 hour time limit. The analysis of Table 2 show that FDSATPlan+ covers all problem instances by returning an incumbent solution to the learned factored planning problem compared to FDBLPPlan+ which runs out of 1 hour time limit in 7 out of 27 instances before finding an incumbent solution. Similarly, FDSATPlan+ proves the optimality of the solutions found in 25 out of 27 problem instances compared to FDBLPPlan+ which only proves the optimality of 19 out of 27 solutions within 1 hour time limit.
In Figure 4, we compare the runtime performances of FDSATPlan+ (xaxis) and FDBLPPlan+ (yaxis) per instance labeled by their domain. The analysis of Figure 4 across all 27 intances show that FDBLPPlan+ proved the optimality of problem instances from domains which require less computational demand (e.g., Inventory) more efficiently compared to FDSATPlan+. In contrast, FDSATPlan+ proved the optimality of problem instances from domains which require more computational demand (e.g., SysAdmin) more efficiently compared to FDBLPPlan+. As the instances got harder to solve, FDBLPPlan+ timedout more compared to FDSATPlan+, mainly due to its inability to find incumbent solutions as evident from Table 2.
The detailed inspection of Figure 4, Table 2 together with Table 1 shows that the computational efforts required to solve the benchmark instances increase signigicantly more for FDBLPPlan+ compared to FDSATPlan+ as the learned BNN structure gets more complex (i.e., from smallest BNN structure of Inventory, to moderate size BNN structures of Navigation and Cellda, to the largest BNN structure of SysAdmin). Detailed presentation of the run time results for each instance are provided in D.
6.4 Planning Performance on the Factored Planning Problems
Finally, we test the planning efficiency of the incremental factored planning algorithm for solving the factored planning problem .
Domains  FDSATPlan+  FDBLPPlan+ 

Navigation  529.11  1282.82 
Inventory  68.88  0.66 
SysAdmin  2463.79  3006.27 
Cellda  512.51  524.53 
Coverage  23/27  19/27 
Optimality Proved  23/27  19/27 
In Table 3, we present the summary of the computational results including the average runtimes in seconds, the total number of instances for which a feasible solution is returned (i.e., coverage), and the total number of instances for which an optimal solution is returned (i.e., optimality proved), for both FDSATPlan+ and FDBLPPlan+ using Algorithm 1 over all four domains for the factored planning problem within 1 hour time limit. The analysis of Table 3 show that FDSATPlan+ with Algorithm 1 covers 23 out of 27 problem instances by returning an incumbent solution to the factored planning problem compared to FDBLPPlan+ Algorithm 1 which runs out of 1 hour time limit in 8 out of 27 instances before finding an incumbent solution. Similarly, FDSATPlan+ Algorithm 1 proves the optimality of the solutions found in 23 out of 27 problem instances compared to FDBLPPlan+ Algorithm 1 which only proves the optimality of 19 out of 27 solutions within 1 hour time limit.
Figures 4(a), 4(b) visualize the comparative runtime performace of Algorithm 1 with i) FDSATPlan+ (orange/red) and ii) FDBLPPlan+ (green/blue) per domain where the additional computational requirement of solving the factored planning problem is stacked on top of the computational requirement of solving the learned factored planning problem. The detailed inspection of Figures 4(a), 4(b) and 4(d) demonstrate that the constraint generation algorithm successfully verified the plans found for the factored planning problem in three out of four domains with low computational cost. In the contrast, the incremental factored planning algorithm spent significantly more time in SysAdmin domain as evident in Figure 4(c). Over all instances, we observed that at most 5 instances required constraint generation to find a plan where the maximum number of constraints required was at least 6; namely for (Sys,4,3) instance. Detailed presentation of the run time results and the number of generalized landmark constraints generated for each instance are provided in D.
7 Conclusion
In this work, we utilized the efficiency and ability of BNNs to learn complex state transition models of factored planning domains with discretized state and action spaces. We introduced two novel compilations, a WPMaxSAT (FDSATPlan+) and a BLP (FDBLPPlan+) encodings, that directly exploit the structure of BNNs to plan for the learned factored planning problem, which provide optimality guarantees with respect to the learned model if they successfully terminate. Theoretically have shown that our SATbased BiDirectional Neuron Activation Encoding has the generalized arcconsistency property, which is one of the most important efficiency certificates of a SATbased encoding.
We further introduced a finitetime incremental factored planning algorithm based on generalized landmark constraints that improve planning accuracy of both FDSATPlan+ and FDBLPPlan+. Experimentally, we demonstrate the computational efficiency of our BiDirectional Neuron Activation Encoding in comparison to the existing neuron activation encoding. Overall, our empirical results showed we can accurately learn complex state transition models using BNNs and demonstratedstrong performance in both the learned and original domains. In sum, this work provides a novel and effective factored state and action transition learning and planning formalism to the datadriven modelbased planning community.
Appendices
Appendix A CNF Encoding of the Cardinality Networks
The CNF encoding of Cardinality Networks () is as follows Asin2009 ().
Half Merging Networks
Given two sequences of Boolean variables and , Half Merging (HM) Networks merge inputs into a single sequence of size using the CNF encoding as follows.
For input size :
(24) 
For input size :
(25)  
(26)  
(27)  
(28) 
Half Sorting Networks
Given a sequence of Boolean variables , Half Sorting (HS) Networks sort the variables with respect to their value assignment as follows.
For input size :