Planning with Learned Binarized Neural Network Transition Models in Factored State and Action Spaces1footnote 11footnote 1Parts of this work appeared in preliminary form in Say and Sanner, 2018 Say2018 ().

Planning with Learned Binarized Neural Network Transition Models in Factored State and Action Spaces111Parts of this work appeared in preliminary form in Say and Sanner, 2018 Say2018 ().

Buser Say, Scott Sanner
{bsay,ssanner}@mie.utoronto.ca
Department of Mechanical & Industrial Engineering, University of Toronto, Canada
Abstract

In this paper, we leverage the efficiency of Binarized Neural Networks (BNNs) to learn complex state transition models of planning domains with discretized factored state and action spaces. In order to directly exploit this transition structure for planning, we present two novel compilations of the learned factored planning problem with BNNs based on reductions to Weighted Partial Maximum Boolean Satisfiability (FD-SAT-Plan+) as well as Binary Linear Programming (FD-BLP-Plan+). Theoretically, we show that our SAT-based Bi-Directional Neuron Activation Encoding maintains the generalized arc-consistency property through unit propagation, which is one of the most important properties that certificate the efficiency of a SAT-based encoding. Experimentally, we validate the computational efficiency of our Bi-Directional Neuron Activation Encoding in comparison to the existing neuron activation encoding, demonstrate the effectiveness of learning complex transition models with BNNs, and test the runtime efficiency of both FD-SAT-Plan+ and FD-BLP-Plan+ on the learned factored planning problem. Finally, we present a finite-time incremental constraint generation algorithm based on generalized landmark constraints to improve the planning accuracy of our encodings.

keywords:
data-driven planning, binarized neural networks, Weighted Partial Maximum Boolean Satisfiability, Binary Linear Programming
journal: AIJ

1 Introduction

Deep neural networks have significantly improved the ability of autonomous systems to perform complex tasks, such as image recognition Krizhevsky2012 (), speech recognition Deng2013 () and natural language processing Collobert2011 (), and can outperform humans and human-designed super-human systems in complex planning tasks such as Go Alphago2016 () and Chess Alphazero2017 ().

In the area of learning and online planning, recent work on HD-MILP-Plan Say2017 () has explored a two-stage framework that (i) learns transitions models from data with ReLU-based deep networks and (ii) plans optimally with respect to the learned transition models using mixed-integer linear programming, but did not provide encodings that are able to learn and plan with discrete state variables. As an alternative to ReLU-based deep networks, Binarized Neural Networks (BNNs) Hubara2016 () have been introduced with the specific ability to learn compact models over discrete variables, providing a new formalism for transition learning and planning in factored Boutilier1999 () discretized state and action spaces that we explore in this paper. However planning with these BNN transition models poses two non-trivial questions: (i) What is the most efficient compilation of BNNs for planning in domains with factored state and (concurrent) action spaces? (ii) Given that BNNs may learn incorrect domain models, how can a planner repair BNN compilations to improve their planning accuracy (or prove the re-training of BNN is necessary)?

To answer question (i), we present two novel compilations of the learned factored planning problem with BNNs based on reductions to Weighted Partial Maximum Boolean Satisfiability (FD-SAT-Plan+) and Binary Linear Programming (FD-BLP-Plan+). Theoretically, we show that the SAT-based Bi-Directional Neuron Activation Encoding has the generalized arc-consistency property through unit propagation. Experimentally, we demonstrate the computational efficiency of our Bi-Directional Neuron Activation Encoding compared to the existing neuron activation encoding. Then, we test the effectiveness of learning complex state transition models with BNNs, and test the runtime efficiency of both FD-SAT-Plan+ and FD-BLP-Plan+ on the learned factored planning problems over four factored planning domains with multiple size and horizon settings. While there are methods for learning PDDL models from data Yang2007 (); Amir2008 () and excellent PDDL planners Helmert2006 (); Richter2010 (), we remark that BNNs are strictly more expressive than PDDL-based learning paradigms for learning concurrent effects in factored action spaces that may depend on the joint execution of one or more actions.

Furthermore, while Monte Carlo Tree Search (MCTS) methods Kocsis2006 (); Keller2013 () including AlphaGo Alphago2016 () and AlphaGoZero Alphago2016 () could technically plan with a BNN-learned black box model of transition dynamics, unlike this work, they would not be able to exploit the BNN transition structure and they would not be able to provide optimality guarantees with respect to the learned model.

To answer question (ii), we introduce a finite-time incremental algorithm based on generalized landmark constraints from the decomposition-based cost-optimal classical planner Davies2015 (), where we detect and constrain invalid sets of action selections from the decision space of the planners and efficiently improve their planning accuracy.

In summary, this work provides the first two planners capable of learning complex transition models in domains with mixed (continuous and discrete) factored state and action spaces as BNNs and capable of exploiting their structure in weighted partial maximum satisfiability (or binary linear optimization) encodings for planning purposes. Theoretically we show the efficiency of our SAT-based encoding and the incremental algorithm. Empirical results show the computational efficiency of our new Bi-Directional Neuron Activation Encoding, demonstrate strong performance for FD-SAT-Plan+ and FD-BLP-Plan+ in both the learned and original domains, and provide a new transition learning and planning formalism to the data-driven model-based planning community.

2 Preliminaries

Before we present the Weighted Partial Maximum Boolean Satifiability (WP-MaxSAT) and Binary Linear Programming (BLP) compilations of the learned planning problem, we review the preliminaries motivating this work. We begin this section by describing the formal notation and the problem definition that is used in this work.

2.1 Problem Definition

A deterministic factored planning problem is a tuple where is a mixed set of state variables with discrete and continuous domains, is a mixed set of action variables with discrete and continuous domains, is a function that returns true if action and state variables satisfy constraints that represent global constraints, denotes the stationary transition function, and is the reward function. Finally, is the initial state constraints that assign values to all state variables , and is the goal state constraints over the subset of state variables .

Given a planning horizon , a solution (i.e. plan) to is a value assignment to action and state variables such that , over global constraints and time steps and initial and goal state constraints are satisfied such that and , respectively. Similarly, given a planning horizon , an optimal solution to is a plan that maximizes the total reward function .

Next, we introduce an example domain for motivating this work.

2.2 Example Domain: Cellda

C

K

D

B

B

E

(a) Time t=0

C

K

D,E

B

B

(b) Time t=2

C,E

D

B

B

(c) Time t=8

C

K

D

B

B

E

(d) Time t=0

C

K

D

B

B

E

(e) Time t=2

C,D

B

B

E

(f) Time t=8
Figure 1: Vizualization of the Cellda domain in a 4-by-4 maze for two plans (0(a)-0(c)) and (0(d)-0(f)). A plan that ignores the adversarial policy of the enemy E (e.g., ) will get hit by the enemy, as opposed to a plan that takes into account the adversarial policy of the enemy E (e.g., ). With plan , Cellda avoids getting hit by waiting for two time steps to trap her enemy who will try to move up for the remaining of time steps.

Influenced by the famous video game the Legend of Zelda Nintendo1986 (), Cellda domain models an agent in a two dimensional (4-by-4) dungeon cell. As visualized by Figure 1, the agent Cellda (C) must escape a dungeon through an initially locked door (D) by obtaining its key (K) without getting hit by her enemy (E). The gridworld-like dungeon is made up of two types of cells: i) regular cells (blank) on which Cellda and her enemy can move from/to deterministically up, down, right or left, and ii) blocks (B) that neither Cellda nor her enemy can walk-through. The state variables of this domain include two integer variables for describing the location of Cellda, two integer variables for describing the location of the enemy, one boolean variable for describing the whether the key is obtained or not, and one boolean variable for describing the whether Cellda is alive or not. The action variables of this domain include four mutually exclusive boolean variables for describing the movement of Cellda (i.e., up, down, right or left). The enemy has a deterministic policy that is unknown to Cellda that will try to minimize the total Manhattan distance between itself and Cellda by breaking the symmetry first in vertical axis. The goal of this domain is to learn the unknown policy of the enemy from previous plays (i.e., data) and escape the dungeon without getting hit. The complete description of this domain can be found in C.

Given the state transition function that describe the location of the enemy is unknown, a planner that ignores the adversarial policy of the enemy E (e.g., as visualized in Figure 1(0(a)-0(c))) will get hit by the enemy, as opposed to a planner that learns the adversarial policy of the enemy E (e.g., as visualized in Figure 1(0(d)-0(f))) which avoids getting hit by waiting for two time steps to trap her enemy who will try to move up for the remaining of time steps.

To remedy this problem, next we describe a learning and planning framework that i) learns an unknown transition function from data, and ii) plans optimally with respect to the learned deterministic factored planning problem.

2.3 Factored Planning with Deep-Net Learned Transition Models

Factored planning with deep-net learned transition models is a two-stage framework for learning and solving nonlinear factored planning problems as first introduced in HD-MILP-Plan Say2017 (). Given samples of state transition data, the first stage of the HD-MILP-Plan framework learns the transition function using a deep-neural network with Rectified Linear Units (ReLUs) Nair2010 () and linear activation units. In the second stage, the learned transition function is used to construct the learned factored planning problem . That is, the trained deep-neural network with fixed weights is used to predict the state at time for free state and action variables at time such that . As visualized in Figure 2, the learned transition function is sequentially chained over horizon , and compiled into a Mixed-Integer Linear Program yielding the planner HD-MILP-Plan Say2017 (). Since HD-MILP-Plan utilizes only ReLUs and linear activation units in its learned transition models, the state variables are restricted to have only continuous domains .

Figure 2: Visualization of HD-MILP-Plan Say2017 (), where blue circles represent state variables , red circles represent action variables , gray circles represent hidden units (i.e., ReLUs and linear activation units) and w represent the weights of a deep-neural network. During the learning stage, the weights w are learned from data. In the planning stage, the weights are fixed and HD-MILP-Plan optimizes a given reward function with respect to the free action and state variables .

Next, we describe an efficient deep-neural network structure for learning discrete models, namely Binarized Neural Networks.

2.4 Binarized Neural Networks

Binarized Neural Networks (BNNs) are neural networks with binary weights and activation functions Hubara2016 (). As a result, BNNs naturally learn discrete models by replacing most arithmetic operations with bit-wise operations. BNN layers are stacked in the following order:

Real or Binary Input Layer: Binary units in all layers, with the exception of the first layer, receive binary input. When the input of the first layer has real-valued domains , bits of precision can be used for a practical representation such that  Hubara2016 ().

Binarization Layer: Given input of binary unit at layer the deterministic activation function used to compute output is: if , otherwise, where denotes the number of layers and denotes the set of binary units in layer .

Batch Normalization Layer: For all layers , Batch Normalization Ioffe2015 () is a method for transforming the weighted sum of outputs at layer in to input of binary unit at layer such that: where parameters , , , , and denote the weight, input mean, input variance, numerical stability constant (i.e., epsilon), input scaling and input bias respectively, where all parameters are computed at training time.

2.5 Weighted Partial Maximum Boolean Satisfiability Problem

The Weighted Partial Maximum Boolean Satisfiability Problem
(WP-MaxSAT) is the problem of finding a value assignment to the variables of a Boolean formula that consists of hard and weighted soft clauses such that i) all hard clauses evaluate to true (i.e., SAT) Davis1960 (), and ii) the total weight of the unsatisfied soft clauses is minimized. While the theoretical worst-case complexity of WP-MaxSAT is NP-Complete, state-of-the-art WP-MaxSAT solvers are experimentally shown to scale well for large instances Davies2013 ().

2.6 Boolean Cardinality Constraints

Boolean cardinality constraints describe bounds on the number of Boolean variables that are allowed to be true, and are in the form of . Cardinality Networks provide an efficient encoding in conjunctive normal form (CNF) for counting an upper bound on the number of true assignments to Boolean variables using auxillary Boolean counting variables such that holds for all where  Asin2009 (). The detailed CNF encoding of is outlined in A. Given , Boolean cardinality constraint is defined as

(1)

where is the size of additional input variables.

Similarly, Boolean cardinality constraints of the form are encoded given the CNF encoding of the Cardinality Networks that count a lower bound on the number of true assignments to Boolean variables such that holds for all . The detailed CNF encoding of is also outlined in A. Given , Boolean cardinality constraint is defined as follows.

(2)

Note that the cardinality constraint is equivalent to . Since Cardinality Networks require the value of to be less than or equal to , Boolean cardinality constraints of the form with must be converted into .

Finally, a Boolean cardinality constraint is generalized arc consistent if and only if for every value assignment to every Boolean variable in the set , there exists feasible a value assignment to all the remaining Boolean variables . In practice, the ability to maintain generalized arc consistency through efficient algorithms such as unit propagation (as opposed to search) is one of the most important properties for the efficiency of a Boolean cardinality constraint encoded in CNF Sinz2005 (); Bailleux2006 (); Asin2009 (); Jabbour2014 (), and encoding maintains generalized arc consistency through unit propagation Asin2009 ().

2.7 Binary Linear Programming Problem

The Binary Linear Programming (BLP) problem requires finding the optimal value assignment to the variables of a mathematical model with linear constraints, linear objective function, and binary decision variables. Similar to WP-MaxSAT, the theoretical worst-case complexity of BLP is NP-Complete. The state-of-the-art BLP solvers IBM2017 () utilize branch-and-bound algorithms and can handle cardinality constraints efficiently in the size of its encoding.

2.8 Generalized Landmark Constraints

A generalized landmark constraint is a linear inequality in the form of where denotes the set of action landmarks and denotes counts on actions , that is, the minimum number of times an action must occur in a plan Davies2015 (). Davies et al. introduced a decomposition-based planner, OpSeq, that incrementally updates generalized landmark constraints to find cost-optimal plans to classical planning problems.

3 Weighted Partial Maximum Boolean Satisfiability Compilation of the Learned Factored Planning Problem

In this section, we show how to reduce the learned factored planning problem with BNNs into WP-MaxSAT which we denote as Factored Deep SAT Planner (FD-SAT-Plan+).

3.1 Propositional Variables

First, we describe the set of propositional variables used in FD-SAT-Plan+. We use three sets of propositional variables: action variables, state variables and BNN binary units, where variables use a bitwise encoding.

  • denotes if -th bit of action is executed at time .

  • denotes if -th bit of state is true at time .

  • denotes if BNN binary unit at layer is activated at time .

3.2 Parameters

Next we define the additional parameters used in FD-SAT-Plan+.

  • is the initial (i.e., at ) value of the -th bit of state variable .

  • is the function that maps the -th bit of a state or an action variable to the corresponding binary unit in the input layer of the BNN such that where .

  • is the function that maps the -th bit of a state variable to the corresponding binary unit in the output layer of the BNN such that where .

The global constraints and goal state constraints are in the form of , and the reward function is in the form of for state and action variables where and .

3.3 The WP-MaxSAT Compilation

Below, we define the WP-MaxSAT encoding of the learned factored planning problem with BNNs. First, we present the set of hard clauses used in FD-SAT-Plan+.

3.3.1 Initial State Clauses

The following conjunction of hard clauses encode the initial state constraints .

(3)

where hard clause (3) set the initial values of the state variables at time .

Next, we describe an efficient CNF encoding to model the activation behaviour of BNN binary unit .

3.3.2 Bi-Directional Neuron Activation Encoding

Given input , activation threshold and binary activation function if , else , the output of a binary neuron can be efficiently encoded in CNF by combining the base hard clauses (i.e., the conjunction of hard clauses (24) with (44), and (34) with (44) from A) and the recursive hard clauses (i.e., the conjunction of hard clauses (38) with (45) and (38) with (45) from A) of and , adding the auxillary input variables and respective unit hard clauses , and adding the following bi-directional activation hard clause:

(4)

where the Boolean variable represents the activation of the binary neuron such that if and only if , and hard clause (4) is a biconditional logical connective between the output of the neuron and its activation function. Intuitively, the conjunction of hard clauses in and together count combining the respective bounds and for all .

Instead of the Uni-Directional encoding Say2018 () that utilize two separate sets of auxillary Boolean counting variables (i.e., and are encoded with two different sets of auxillary Boolean counting variables ), Bi-Directional encoding shares the same set of decision variables. Further, the previous work Say2018 () uses the Sequential Counters Sinz2005 () for encoding the cardinality constraints using number of variables and hard clauses whereas the Bi-Directional encoding uses only number of variables and hard clauses. For notational clarity, we will refer to the conjunction of hard clauses in the Bi-Directional Neuron Activation Encoding as .

Next, we will prove that the Bi-Directional Neuron Activation Encoding has the generalized arc-consistency property through unit propagation, which is considered to be one of the most important theoretical properties that certify the efficiency of a Boolean cardinality constraint encoded in CNF Sinz2005 (); Bailleux2006 (); Asin2009 (); Jabbour2014 ().

Definition 0.1 (Generalized Arc-Consistency of Neuron Activation Encoding).

A neuron activation encoding has the generalized arc-consistency property through unit propagation if and only if unit propagation is sufficient to deduce the following:

  1. For any set with size , value assignment to variables and for all assigns the remaining variables from the set to true,

  2. For any set with size value assignment to variables and for all assigns the remaining variables from the set to false,

  3. Partial value assignment of variables from to true assigns variable , and

  4. Partial value assignment of variables from to false assigns variable .

Theorem 1 (Generalized Arc-Consistency of ).

The Bi-Directional Neuron Activation Encoding has the generalized arc-consistency property through unit propagation.

Proof.

To show maintains generalized arc consistency property through unit propagation, we need to show exhaustively for all four cases of Definition 0.1 that unit propagation is sufficient to maintain the generalized arc-consistency.

Case 1 ( where , and by unit propagation): When , unit propagation assigns using the hard clause . Given uses the same set of variables as (excluding variable which is assigned to true) and value assignment to variables for any set with size , unit propagation will set the remaining variables from the set to true using the conjunction of hard clauses that encode excluding the unit clause (Asin2009 ().

Case 2 ( where , and by unit propagation): When , unit propagation assigns using the hard clause . Given uses the same set of variables as (excluding variable which is assigned to false) and value assignment to variables for any set with size , unit propagation will set the remaining variables from the set to false using the conjunction of hard clauses that encode excluding the unit clause (Asin2009 ().

Cases 3 ( where , by unit propagation) When variables from the set are set to true, unit propagation assigns the counting variable using the conjunction of hard clauses that encode excluding the unit clause (Asin2009 (). Given the assignment , unit propagation assigns using the hard clause .

Cases 4 ( where , by unit propagation) When variables from the set are set to false, unit propagation assigns the counting variable using the conjunction of hard clauses that encode excluding the unit clause (Asin2009 (). Given the assignment , unit propagation assigns using the hard clause . ∎

3.3.3 BNN Clauses

Given the efficient CNF encoding , we present the conjunction of hard clauses to model the complete BNN model.

(5)
(6)
(7)
(8)
(9)

where activation constant in hard clauses (8-9) are computed using the batch normalization parameters for binary unit in layer at training time such that:

where denotes the size of set . The computation of the activation constant ensures that is less than or equal to the half size of the previous layer , as the Bi-Directional Neuron Activation Encoding only counts upto .

Hard clauses (5-6) map the binary units at the input layer of the BNN (i.e., ) to a unique state or action variable, respectively. Similarly, hard clause (7) maps the binary units at the output layer of the BNN (i.e., ) to a unique state variable. Hard clauses (8-9) encode the binary activation of every unit in the BNN.

3.3.4 Global Constraint Clauses

The following conjunction of hard clauses encode the global constraints .

(10)

where hard clause (10) represents domain-dependent global constraints on state and action variables. Some common examples of global constraints such as mutual exclusion on Boolean action variables and one-hot encodings for the output of the BNN (i.e., exactly one Boolean state variable must be true) are respectively encoded by hard clauses (11-12) as follows.

(11)
(12)

In general, linear global constraints in the form of , such as bounds on state and action variables, can be encoded in CNF where are positive integer coefficients and are decision variables with non-negative integer domains Abio2014 ().

3.3.5 Goal State Clauses

The following conjunction of hard clauses encode the goal state constraints .

(13)

where hard clause (13) set the goal constraints on the state variables at time .

3.3.6 Reward Clauses

Given the reward function for each time step is in the form of
the following weighted soft clauses:

(14)

can be written to represent where are the weights of the soft clauses for each bit of action and state variables, respectively.

4 BLP Compilation of the Learned Factored Planning Problem

Given FD-SAT-Plan+, we present the Binary Linear Programming (BLP) compilation of the learned factored planning problem with BNNs, which we denote as Factored Deep BLP Planner (FD-BLP-Plan+).

4.1 Binary Variables and Parameters

FD-BLP-Plan+ uses the same set of decision variables and parameters as FD-SAT-Plan+.

4.2 The BLP Compilation

FD-BLP-Plan+ replaces hard clauses (3) and (5-7) with equivalent linear constraints as follows.

(15)
(16)
(17)
(18)

Given the activation constant of binary unit in layer , FD-BLP-Plan+ replaces hard clauses (8-9) representing the activation of binary unit with the following linear constraints:

(19)
(20)

where .

Global constraint hard clauses (10) and goal state hard clauses (13) are compiled into linear constraints given they are in the form of . Finally, the reward function with linear expressions is maximized over time such that:

(21)

5 Incremental Factored Planning Algorithm for FD-SAT-Plan+ and FD-BLP-Plan+

Given that the plans found for the learned factored planning problem by FD-SAT-Plan+ and FD-BLP-Plan+ can be infeasible to the factored planning problem , we introduce an incremental algorithm for finding plans for by iteratively excluding invalid plans from the search space of FD-SAT-Plan+ and FD-BLP-Plan+. Similar to OpSeq Davies2015 (), FD-SAT-Plan+ and FD-BLP-Plan+ are updated with the following generalized landmark hard clauses or constraints

(22)
(23)

respectively, where is the set of bits of actions executed at time at the -th iteration of the algorithm outlined below.

1:, planner FD-SAT-Plan+ or FD-BLP-Plan+
2:
3:if  is infeasible or is a plan for  then
4:      return
5:else
6:      if  planner = FD-SAT-Plan+  then
7:              hard clause (22)
8:      else
9:              constraint (23)       
10:, go to line 2.
Algorithm 1 Incremental Factored Planning Algorithm

For a given horizon , Algorithm 1 iteratively computes a set of actions , or returns infeasibility for the learned factored planning problem . If the set of actions is non-empty, we evaluate whether is a valid plan for the original factored planning problem (i.e., line 3) either in the actual domain or using a high fidelity domain simulator – in our case RDDLsim Sanner2010 (). If the set of actions constitutes a plan for , Algorithm 1 returns as a plan. Otherwise, the planner is updated with the new set of generalized landmarks to exclude and the loop repeats. Since the original action space is discretized and represented upto bits of precision, Algorithm 1 can be shown to terminate in no more than iterations by constructing an inductive proof similar to the termination criteria of OpSeq where either a feasible plan for is returned or there does not exist a plan to both and for the given horizon . The outline of the proof can be found in B.

6 Experimental Results

In this section, we evaluate the effectiveness of factored planning with BNNs. First, we present the benchmark domains used to test the efficiency of our learning and factored planning framework with BNNs. Second, we present the accuracy of BNNs to learn complex state transition models for factored planning problems. Third, we test the efficiency and scalability of planning with FD-SAT-Plan+ and FD-BLP-Plan+ on the learned factored planning problems across multiple problem sizes and horizon settings. Finally, we demonstrate the effectiveness of Algorithm 1 to find a plan for the factored planning problem .

6.1 Domain Descriptions

The RDDL Sanner2010 () formalism is extended to handle goal-specifications and used to describe the problem . Below, we summarize the extended deterministic RDDL domains used in the experiments, namely Navigation Sanner2011 (), Inventory Control (Inventory) Mann2014 (), System Administrator (SysAdmin) Guestrin2001 (); Sanner2011 (), and Cellda Nintendo1986 (). Detailed presentation of the RDDL domains and instances are provided in C.

Navigation

models an agent in a two-dimensional (-by-) maze with obstacles where the goal of the agent is to move from the initial location to the goal location at the end of horizon . The transition function describes the movement of the agent as a function of the topological relation of its current location to the maze, the moving direction and whether the location the agent tries to move to is an obstacle or not. This domain is a deterministic version of its original from IPPC2011 Sanner2011 (). Both the action and the state space is Boolean. We report the results on instances with three maze sizes -by- and three horizon settings per maze size where , .

Inventory

describes the inventory management control problem with alternating demands for a product over time where the management can order a fixed amount of units to increase the number of units in stock at any given time. The transition function updates the state based on the change in stock as a function of demand, the time, the current order quantity, and whether an order has been made or not. The action space is Boolean (either order a fixed positive integer amount, or do not order) and the state space is non-negative integer. We report the results on instances with two demand cycle lengths and three horizon settings per demand cycle length where and .

SysAdmin

models the behavior of a computer network of size where the administrator can reboot a limited number of computers to keep the number of computers running above a specified safety threshold over time. The transition function describes the status of a computer which depends on its topological relation to other computers, its age and whether it has been rebooted or not, and the age of the computer which depends on its current age and whether it has been rebooted or not. This domain is a deterministic modified version of its original from IPPC2011 Sanner2011 (). The action space is Boolean and the state space is a non-negative integer where concurrency between actions are allowed. We report the results on instances with two network sizes and three horizon settings where and .

Cellda

models an agent in a two dimensional (4-by-4) dungeon cell. The agent Cellda must escape her cell through an initially locked door by obtaining the key without getting hit by her enemy. Each grid of the cell is made up of a grid type: i) regular grids which Cellda and her enemy can move from (or to) deterministically up, down, right or left, and ii) blocks that neither Cellda nor her enemies can stand on. The enemy has a deterministic policy that is unknown to Cellda that will try to minimize the total Manhattan distance between itself and Cellda. Given the location of Cellda and the enemy, the adversarial deterministic policy will always try to minimize the distance between the two by trying to move the enemy on axis . The state space is mixed; integer to describe the locations of Cellda and the enemy, and Boolean to describe whether the key is obtained or not and whether Cellda is alive or not. The action space is Boolean for moving up, down, right or left. The transition function updates states as a function of the previous the locations of Cellda and the enemy, the moving direction of Cellda, and whether the key was obtained or not and whether Cellda was alive or not. We report results on instances with two adversarial deterministic policies and three horizon settings per policy where and .

6.2 Transition Learning Performance

In Table 1, we present test errors for different configurations of the BNNs on each domain instance where the sample data was generated from the RDDL-based domain simulator RDDLsim Sanner2010 () using a simple stochastic exploration policy. For each instance of a domain, state transition pairs were collected and the data was treated as independent and identically distributed. After random permutation, the data was split into training and test sets with 9:1 ratio. The BNNs were trained on MacBookPro with 2.8 GHz Intel Core i7 16 GB memory using the code available Hubara2016 (). Overall, Navigation instances required the smallest BNN structures for learning due to their purely Boolean state and action spaces, while both Inventory, SysAdmin and Cellda instances required larger BNN structures for accurate learning, owing to their non-Boolean state and action spaces.

Domain Network Structure Test Error (%)
Navigation(3) 13:36:36:9 0.0
Navigation(4) 20:96:96:16 0.0
Navigation(5) 29:128:128:25 0.0
Inventory(2) 7:96:96:5 0.018
Inventory(4) 8:128:128:5 0.34
SysAdmin(4) 16:128:128:12 2.965
SysAdmin(5) 20:128:128:128:15 0.984
Cellda(x) 12:128:128:4 0.645
Cellda(y) 12:128:128:4 0.65
Table 1: Transition Learning Performance Table measured by error on test data (in %) for all domains and instances.

6.3 Planning Performance on the Learned Factored Planning Problems

In this section, we present the results of two computational comparisons. First, we test the efficiency of the Bi-Directional Neuron Activation Encoding to the existing neuron activation encoding to elect the best WP-MaxSAT-based encoding for FD-SAT-Plan+. Second, we compare the effectiveness of using the elected WP-MaxSAT-based encoding against a BLP-based encoding, namely FD-SAT-Plan+ and FD-BLP-Plan+ to find plans for the learned factored planning problem . We ran the experiments on a MacBookPro with 2.8 GHz Intel Core i7 16GB memory. For FD-SAT-Plan+ and FD-BLP-Plan+, we used MaxHS Davies2013 () with underlying LP-solver CPLEX 12.7.1 IBM2017 (), and CPLEX 12.7.1 IBM2017 () solvers respectively, with 1 hour total time limit per domain instance.

6.3.1 Comparison of neuron activation encodings

Figure 3: Timing comparison between for FD-SAT-Plan+ with Sequential Counters Sinz2005 () and Uni-Directional Encoding Say2018 () (x-axis) and Cardinality Networks Asin2009 () and Bi-Directional Encoding (y-axis). Over all problem settings, FD-SAT-Plan+ with Cardinality Networks and Bi-Directional Encoding signigicantly outperformed FD-SAT-Plan+ with Sequential Counters and Uni-Directional Encoding on all problem instances due to its i) smaller encoding size, and ii) generalized arc consistency property.

The runtime efficiency of both neuron activation encodings are tested for the learned factored planning problems over 27 problem instances where we test our Bi-Directional encoding that utilize Cardinality Networks Asin2009 () against the previous Uni-Directional encoding Say2018 () that use Sequential Counters Sinz2005 ().

Figure 3 visualizes the runtime comparison of both neuron activation encodings. The inspection of Figure 3 clearly demonstrate that FD-SAT-Plan+ with Cardinality Networks and Bi-Directional Encoding signigicantly outperforms FD-SAT-Plan+ with Sequential Counters and Uni-Directional Encoding in all problem instances due to its i) smaller encoding size (i.e., versus ) with respect to both the number of variables and the number of hard clauses used, and ii) generalized arc consistency property. Therefore, we use FD-SAT-Plan+ with Cardinality Networks and Bi-Directional Encoding in the remaining experiments and omit the results for FD-SAT-Plan+ with the Uni-Directional encoding and Sequential Counters.

6.3.2 Comparison of FD-SAT-Plan+ and FD-BLP-Plan+

Next, we test the runtime efficiency of FD-SAT-Plan+ and FD-BLP-Plan+ for solving the learned factored planning problem.

Domains FD-SAT-Plan+ FD-BLP-Plan+
Navigation 529.11 1282.82
Inventory 54.88 0.54
SysAdmin 1627.35 3006.27
Cellda 344.03 285.45
Coverage 27/27 20/27
Optimality Proved 25/27 19/27
Table 2: Summary of the computational results presented in D including the average runtimes in seconds for both FD-SAT-Plan+ and FD-BLP-Plan+ over all four domains for the learned factored planning problem within 1 hour time limit.

In Table 2, we present the summary of the computational results including the average runtimes in seconds, the total number of instances for which a feasible solution is returned (i.e., coverage), and the total number of instances for which an optimal solution is returned (i.e., optimality proved), for both FD-SAT-Plan+ and FD-BLP-Plan+ over all four domains for the learned factored planning problem within 1 hour time limit. The analysis of Table 2 show that FD-SAT-Plan+ covers all problem instances by returning an incumbent solution to the learned factored planning problem compared to FD-BLP-Plan+ which runs out of 1 hour time limit in 7 out of 27 instances before finding an incumbent solution. Similarly, FD-SAT-Plan+ proves the optimality of the solutions found in 25 out of 27 problem instances compared to FD-BLP-Plan+ which only proves the optimality of 19 out of 27 solutions within 1 hour time limit.

Figure 4: Timing comparison between FD-SAT-Plan+ (x-axis) and FD-BLP-Plan+ (y-axis). Over all problem settings, FD-BLP-Plan+ performed better on instances that require less than approximately 100 seconds to solve (i.e., computationally easy instances) whereas FD-SAT-Plan+ outperformed FD-BLP-Plan+ on instsances that require more than approximately 100 seconds to solve (i.e., computationally hard instances).

In Figure 4, we compare the runtime performances of FD-SAT-Plan+ (x-axis) and FD-BLP-Plan+ (y-axis) per instance labeled by their domain. The analysis of Figure 4 across all 27 intances show that FD-BLP-Plan+ proved the optimality of problem instances from domains which require less computational demand (e.g., Inventory) more efficiently compared to FD-SAT-Plan+. In contrast, FD-SAT-Plan+ proved the optimality of problem instances from domains which require more computational demand (e.g., SysAdmin) more efficiently compared to FD-BLP-Plan+. As the instances got harder to solve, FD-BLP-Plan+ timed-out more compared to FD-SAT-Plan+, mainly due to its inability to find incumbent solutions as evident from Table 2.

The detailed inspection of Figure 4, Table 2 together with Table 1 shows that the computational efforts required to solve the benchmark instances increase signigicantly more for FD-BLP-Plan+ compared to FD-SAT-Plan+ as the learned BNN structure gets more complex (i.e., from smallest BNN structure of Inventory, to moderate size BNN structures of Navigation and Cellda, to the largest BNN structure of SysAdmin). Detailed presentation of the run time results for each instance are provided in D.

6.4 Planning Performance on the Factored Planning Problems

Finally, we test the planning efficiency of the incremental factored planning algorithm for solving the factored planning problem .

Domains FD-SAT-Plan+ FD-BLP-Plan+
Navigation 529.11 1282.82
Inventory 68.88 0.66
SysAdmin 2463.79 3006.27
Cellda 512.51 524.53
Coverage 23/27 19/27
Optimality Proved 23/27 19/27
Table 3: Summary of the computational results presented in D including the average runtimes in seconds for both FD-SAT-Plan+ and FD-BLP-Plan+ over all four domains for the factored planning problem within 1 hour time limit.

In Table 3, we present the summary of the computational results including the average runtimes in seconds, the total number of instances for which a feasible solution is returned (i.e., coverage), and the total number of instances for which an optimal solution is returned (i.e., optimality proved), for both FD-SAT-Plan+ and FD-BLP-Plan+ using Algorithm 1 over all four domains for the factored planning problem within 1 hour time limit. The analysis of Table 3 show that FD-SAT-Plan+ with Algorithm 1 covers 23 out of 27 problem instances by returning an incumbent solution to the factored planning problem compared to FD-BLP-Plan+ Algorithm 1 which runs out of 1 hour time limit in 8 out of 27 instances before finding an incumbent solution. Similarly, FD-SAT-Plan+ Algorithm 1 proves the optimality of the solutions found in 23 out of 27 problem instances compared to FD-BLP-Plan+ Algorithm 1 which only proves the optimality of 19 out of 27 solutions within 1 hour time limit.

(a) Navigation
(b) Inventory
(c) SysAdmin
(d) Cellda
Figure 5: Timing comparison between FD-SAT-Plan+ (orange/red) and FD-BLP-Plan+ (green/blue) for solving the learned factored planning problems (orange/green) and the factored planning problems (red/blue) per domain.

Figures 4(a),  4(b) visualize the comparative runtime performace of Algorithm 1 with i) FD-SAT-Plan+ (orange/red) and ii) FD-BLP-Plan+ (green/blue) per domain where the additional computational requirement of solving the factored planning problem is stacked on top of the computational requirement of solving the learned factored planning problem. The detailed inspection of Figures 4(a),  4(b) and  4(d) demonstrate that the constraint generation algorithm successfully verified the plans found for the factored planning problem in three out of four domains with low computational cost. In the contrast, the incremental factored planning algorithm spent significantly more time in SysAdmin domain as evident in Figure 4(c). Over all instances, we observed that at most 5 instances required constraint generation to find a plan where the maximum number of constraints required was at least 6; namely for (Sys,4,3) instance. Detailed presentation of the run time results and the number of generalized landmark constraints generated for each instance are provided in D.

7 Conclusion

In this work, we utilized the efficiency and ability of BNNs to learn complex state transition models of factored planning domains with discretized state and action spaces. We introduced two novel compilations, a WP-MaxSAT (FD-SAT-Plan+) and a BLP (FD-BLP-Plan+) encodings, that directly exploit the structure of BNNs to plan for the learned factored planning problem, which provide optimality guarantees with respect to the learned model if they successfully terminate. Theoretically have shown that our SAT-based Bi-Directional Neuron Activation Encoding has the generalized arc-consistency property, which is one of the most important efficiency certificates of a SAT-based encoding.

We further introduced a finite-time incremental factored planning algorithm based on generalized landmark constraints that improve planning accuracy of both FD-SAT-Plan+ and FD-BLP-Plan+. Experimentally, we demonstrate the computational efficiency of our Bi-Directional Neuron Activation Encoding in comparison to the existing neuron activation encoding. Overall, our empirical results showed we can accurately learn complex state transition models using BNNs and demonstratedstrong performance in both the learned and original domains. In sum, this work provides a novel and effective factored state and action transition learning and planning formalism to the data-driven model-based planning community.

Appendices

Appendix A CNF Encoding of the Cardinality Networks

The CNF encoding of -Cardinality Networks () is as follows Asin2009 ().

Half Merging Networks

Given two sequences of Boolean variables and , Half Merging (HM) Networks merge inputs into a single sequence of size using the CNF encoding as follows.

For input size :

(24)

For input size :

(25)
(26)
(27)
(28)
Half Sorting Networks

Given a sequence of Boolean variables , Half Sorting (HS) Networks sort the variables with respect to their value assignment as follows.

For input size :