# Neural-Symbolic Descriptive Action Model from Images: The Search for STRIPS

## Abstract

Recent work on Neural-Symbolic systems that learn the discrete planning model from images has opened a promising direction for expanding the scope of Automated Planning and Scheduling to the raw, noisy data. However, previous work only partially addressed this problem, utilizing the black-box neural model as the successor generator. In this work, we propose Double-Stage Action Model Acquisition (DSAMA), a system that obtains a descriptive PDDL action model with explicit preconditions and effects over the propositional variables unsupervised-learned from images. DSAMA trains a set of Random Forest rule-based classifiers and compiles them into logical formulae in PDDL. While we obtained a competitively accurate PDDL model compared to a black-box model, we observed that the resulting PDDL is too large and complex for the state-of-the-art standard planners such as Fast Downward primarily due to the PDDL-SAS+ translator bottleneck. From this negative result, we show that this translator bottleneck cannot be addressed just by using a different, existing rule-based learning method, and we point to the potential future directions.

supplemental

## 1 Introduction

Recently, Latplan system [\citeauthoryearAsai and Fukunaga2018] successfully connected a subsymbolic neural network (NN) system and a symbolic Classical Planning system to solve various visually presented puzzle domains. The system consists of four parts: 1) The State AutoEncoder (SAE) neural network learns a bidirectional mapping between images and propositional states with unsupervised training. 2) Action Model Acquisition module generates an action model from the propositional state transitions encoded from the images. 3) Classical Planning module solves the problem in the propositional state space with the learned action model. 4) The decoding module maps the propositional plan back to an image sequence. The proposed framework opened a promising direction for applying a variety of symbolic methods to the real world — For example, the search space generated by Latplan was shown to be compatible with a symbolic Goal Recognition system [\citeauthoryearAmado et al.2018a, \citeauthoryearAmado et al.2018b]. Several variations replacing the state encoding modules have also been proposed: Causal InfoGAN [\citeauthoryearKurutach et al.2018] uses a GAN-based framework, First-Order SAE [\citeauthoryearAsai2019] learns the First Order Logic symbols (instead of the propositional ones), and Zero-Suppressed SAE [\citeauthoryearAsai and Kajino2019, ZSAE] addresses the Symbol Stability issue of the regular SAE with regularization.

Despite these efforts, Latplan is missing a critical feature of the traditional Classical Planning systems: The use of State-of-the-Art heuristic functions. The main reason behind this limitation is the lack of descriptive action model consisting of logical formula for the preconditions and the effects, which allows the heuristics to exploit its causal structures. Obtaining the descriptive action models from the raw observations with minimal human interference is the next key milestone for expanding the Automated Planning applications to the raw unstructured inputs, as it fully unleashes the pruning power of state-of-the-art Classical Planning heuristic functions which allow the planner to scale up to much larger problems.

In this paper, we propose an approach called Dual-Stage Action Model Acquisition (DSAMA), a dual-stage process that first learns the set of action symbols and action effects via Action AutoEncoder neural network module in Latplan AMA [\citeauthoryearAsai and Fukunaga2018] and then trains a rule-based machine learning system which are then converted into propositional formula in a PDDL format. We tested DSAMA with Random Forest (RF) framework [\citeauthoryearHo1998] as the machine learning module due to its maturity and performance. As a result, we successfully generated a descriptive action model, as depicted in Fig. 1 for example, which is as accurate as the black-box neural counterpart.

Despite the success in terms of the model accuracy, the proposed approach turned out to be an impractical solution for descriptive action model acquisition and gave us an insight into the core problem of this approach. The generated logical formula and the resulting PDDL was too large and complex for the recipient classical planning system (Fast Downward) to solve the given instance in a reasonable runtime and memory, and if we trade the accuracy with the model simplicity, the goal becomes unreachable. We provide an analysis on the reason and discuss possible future directions. The code reproducing the experiments will be published at github.com/guicho271828/latplan/.

## 2 Preliminaries

We denote a tensor (multi-dimensional array) in bold and denote its elements with a subscript, e.g. when , the second row is . We use dotted subscripts to denote a subarray, e.g. . For a vector or a set , denotes the number of elements. and denote the constant matrix of shape with all elements being 1/0, respectively. denotes a concatenation of tensors and in the first axis where the rest of the dimensions are same between and . For a dataset, we generally denote its -th data point with a superscript which we may sometimes omit for clarity.

Let be a propositional formula consisting of logical operations , constants , and a set of propositional variables . We define a grounded (propositional) Classical Planning problem as a 4-tuple where is a set of propositions, is a set of actions, is the initial state, and is a goal condition. Each action is a 3-tuple where is a precondition and , are the sets of effects called add-effects and delete-effects, respectively. Each effect is denoted as where is an effect condition and . A state is a set of true propositions, an action is applicable when ( satisfies ), and applying an action to yields a new successor state which is .

Modern classical planners such as Fast Downward [\citeauthoryearHelmert2004] takes the PDDL [\citeauthoryearMcDermott2000] input which specifies the above planning problem, and returns an action sequence that reaches the goal state from the initial state. Recent planners typically convert a propositional planning model into SAS+ [\citeauthoryearBäckström and Nebel1995] format, upon which the disjunctions in the action model must be eliminated by moving the disjunctions to the root of the formula and splitting the actions [\citeauthoryearHelmert2009].

#### Latplan

Latplan [\citeauthoryearAsai and Fukunaga2018] is a framework for domain-independent image-based classical planning. It learns the state representation as well as the transition rules entirely from the image-based observation of the environment with deep neural networks and solves the problem using a classical planner.

Latplan takes two inputs. The first input is the transition input Tr, a set of pairs of raw data randomly sampled from the environment. An -th data pair in the dataset represents a randomly sampled transition from an environment observation to another observation where some unknown action took place. The second input is the planning input , a pair of raw data, which corresponds to the initial and the goal state of the environment. The output of Latplan is a data sequence representing the plan execution that reaches from . While the original paper used an image-based implementation (“data” = raw images), the type of data is arbitrary as long as it is compatible with neural networks.

Latplan works in 3 steps. In Step 1, a State AutoEncoder (SAE) (Fig. 2, left) neural network learns a bidirectional mapping between raw data (e.g., images) and propositional states , where the propositional states are represented by -dimensional bit vector. The network consists of two functions Encode and Decode, where Encode maps an image to , and Decode function maps back to an image . The training is performed by minimizing the reconstruction loss under some norm (e.g., Mean Square Error for images). In order to guarantee that is a binary vector, the network must use a discrete latent representation learning method such as Gumbel Softmax [\citeauthoryearJang, Gu, and Poole2017, \citeauthoryearMaddison, Mnih, and Teh2017] or Step Function with straight-through estimator [\citeauthoryearKoul, Fern, and Greydanus2019, \citeauthoryearBengio, Léonard, and Courville2013] — We used Gumbel Softmax annealing-based continuous relaxation , in this paper. After learning the mapping from , SAE obtains the propositional transitions . In Step 2, an Action Model Acquisition (AMA) method learns an action model from . In Step 3, a planning problem instance is generated from the planning input . These are converted to the discrete states and the classical planner finds the path connecting them. For example, an 8-puzzle problem instance consists of an image of the start (scrambled) configuration of the puzzle and an image of the solved state. In the final step, Latplan obtains a step-by-step, human-comprehensive visualization of the plan execution by Decode’ing the latent bit vectors for each intermediate state and validates the visualized result using a custom domain-specific validator, for the evaluation purpose. This is because the SAE-generated latent bit vectors are learned unsupervised and not directly verifiable through human knowledge.

#### Action Model Acquisition (AMA)

The original Latplan paper proposed two approaches for AMA. AMA is an oracular model that directly generates a PDDL without learning, and AMA is a neural model that approximates AMA by learning from examples.

AMA is an oracular, idealistic AMA that does not incorporate machine learning, and instead generates the entire propositional state transitions from the entire image transitions in the search space. Each propositional transition is turned into a single, grounded action schema. For example, in a state space represented by 2 latent space propositions , a transition from to is translated into an action with . It is impractical because it requires the entire image transitions, but also because the size of the PDDL is proportional to the number of transitions in the state space, slowing down the PDDL-SAS+ translation, preprocessing, and heuristic calculation at each search node.

AMA consists of two neural networks: Action AutoEncoder (AAE) and Action Discriminator (AD). AAE is an autoencoder that learns to cluster the state transitions into a (preset) finite number of action labels. See Fig. 2 (middle) for the illustration.

AAE’s encoder takes a propositional state pair as the input. The last layer of the encoder is activated by a discrete activation function (such as Gumbel Softmax) to become a one-hot vector of categories, (), where is a hyperparameter for the maximum number of action labels and represents an action label. For clarity, we use the one-hot vector and the index interchangeably. AAE’s decoder takes the current state and as the input and output , which is a reconstruction of . The encoder acts as a function that tells “what action has happened” and the decoder can be seen as a progression function (Fig. 2, middle).

AD is a binary classifier that models the preconditions of the actions. AD learns the condition from the observed propositional state transitions and a “fake” state transitions. Let and be the fake transitions. could be generated by applying a random action to the states in .

This learning task is a Positive-Unlabeled learning task [\citeauthoryearElkan and Noto2008, PU-learning]. While all examples in are guaranteed to be the positive (valid) examples obtained from the observations, the examples in are unlabeled, i.e., we cannot guarantee that the examples in are always negative (invalid). Unlike the standard binary classification task, which takes the purely positive and the purely negative dataset, PU-learning takes such a positive and an unlabeled dataset and returns a positive-negative classifier. Under the assumption that the positive examples are i.i.d.-sampled from the entire distribution of positive examples, one can obtain a positive-negative classifier for the input by correcting the confidence value of a labeled-unlabeled classifier by an equation , where is a positive validation set and is a constant computed after the training of [\citeauthoryearElkan and Noto2008]. In AD, is , i.e., the concatenation of the propositional current state and the successor state, unlike the standard STRIPS setting where the precondition only sees the current state.

Combining AAE and AD yields a successor function that can be used for graph search algorithms: It first enumerates the potential successor states from the current state by iterating over , then prunes the generated successor states using AD, i.e., whether . The major drawback of this approach is that both AAE and AD are black-box neural networks, and thus are incompatible with the standard PDDL-based planners and heuristics, and requires a custom heuristic graph search solver.

## 3 Double-Stage Learning

To overcome the lack of PDDL compatibility of the black-box NNs in AMA, we propose Double-Stage Action Model Acquisition method (DSAMA) which consists of 3 steps: (1) It trains the same AAE networks to identify actions and perform the clustering, (2) transfers the knowledge to a set of Random Forest binary classifiers (Fig. 2, right), then finally (3) converts the classifiers into logical preconditions / effects in PDDL. Let be a process that returns a Random Forest binary classifier and let be a function that converts a classifier into a logical formula . The overall DSAMA process is shown in Algorithm 1.

In order to learn the action preconditions, DSAMA performs a PU-learning following Action Discriminator (Sec. 2). Similar to AD, it takes both the current and successor states as the input – in the later experiments, we show that the accuracy drops when we limit the input to the current state. Unlike AD in AMA, DSAMA trains a specific classifier for each action. For the action effects, DSAMA learns the effect condition of the conditional effect in PDDL (Sec. 2). DSAMA iterates over every action and every bit and trains a binary classifier that translates to .

#### Random Forest (RF)

While the binary classifier in DSAMA could be any binary classifier that could be converted to a logical formula,
we chose Random Forest [\citeauthoryearHo1998], a machine learning method based on decision trees.
It constructs an ensemble of decision trees
and averages the predictions returned by each tree .
We do not describe the details of its training, which is beyond the scope of this paper.
It is one of the most widely used rule-based learning algorithms whose
implementations are available in various machine learning packages ^{1}

A decision tree for classification consists of decision nodes and leaf nodes. A decision node is a 4-tuple , where each element is the feature index, a threshold, and the left / right child nodes. A leaf node contains a class probability vector , where is a number of classes to be classified, which is 2 in our binary case. To classify a feature vector , where is the number of features, it tests at each decision node and recurses into the left/right children depending on success. When a single tree is used for classification, it takes the over the probability vector at the leaf node and returns the result as the predicted classification. For an ensemble of decision trees, classification is performed either by taking the average of returned by the trees and then taking an over the classes, or by taking the at each leaf node and returning the majority class voted by each tree.

Since STRIPS/PDDL cannot directly represent the numeric averaging operation, we simulate the voting-based inference of Random Forest in the PDDL framework by compiling the RF into a majority gate boolean circuit.

First, converting a decision tree into a boolean logic formula is straightforward. Since we assume a binary input dataset, the decision nodes and the leaf nodes can be recursively converted into a Negation Normal Form as in the toPDDL(tree ) method (Algorithm 2).

Next, recall that we now take the votes from all trees and choose the class with the largest votes as the final classification prediction. Since we are handling the binary classification, finding the most voted class is equivalent to computing a majority function [\citeauthoryearLee and Jen1992], a boolean function of fan-ins and a single fan-out which returns 1 when more than inputs are 1. One practical approach for implementing such a function is based on bitonic sorting network [\citeauthoryearKnuth1997, \citeauthoryearBatcher1968]. First, we apply the bitonic sorting algorithm proposed by \citeauthorbatcher1968sorting (Algorithm 2) to , except that instead of taking the and in the compareAndSwap function, we take and of the elements being swapped because and for 0/1 values seen as boolean values. We then use the -th element stored in the result as the output. See examples in Fig. 3, Fig. 4, Fig. 1.

Finally, since our preconditions takes the current and the successor states as the input, we need to take care of the decision nodes that points to the successor state. When the binary latent space has dimensions, the input vector to the random forest has dimensions where the first and the second half of dimensions is for the current and the successor state. In toPDDL, when the index of the decision node satisfies , we insert instead of because our DSAMA formulation guarantees that is true in the successor bit when the effect condition is satisfied. This can be seen as a trick to implement a one-step lookahead in the action model.

## 4 Evaluation

We evaluated our approach in the dataset used by \citeauthorAsai2018, which consists of 5 image-based domains. MNIST 8-puzzle is an image-based version of the 8-puzzle, where tiles contain hand-written digits (0-9) from the MNIST database [\citeauthoryearLeCun et al.1998]. Valid moves in this domain swap the “0” tile with a neighboring tile, i.e., the “0” serves as the “blank” tile in the classic 8-puzzle. The Scrambled Photograph 8-puzzle (Mandrill, Spider) cuts and scrambles real photographs, similar to the puzzles sold in stores). These differ from the MNIST 8-puzzle in that “tiles” are not cleanly separated by black regions (we re-emphasize that Latplan has no built-in notion of square or movable region). LightsOut is a video game where a 4x4 grid of lights is in some on/off configuration, and pressing a light toggles its state as well as the states of its neighbors. The goal is all lights Off. Twisted LightsOut distorts the original LightsOut game image by a swirl effect, showing that Latplan is not limited to handling rectangular “objects”/regions. In all domains, we used 9000 transitions for training and 1000 transitions for testing. Note that 8-puzzle contains 362880 states and 967680 transitions, and LightsOut contains 65536 states and 1048576 transitions.

We used the SAE with , i.e., it produces 100 latent propositions. Following the work of [\citeauthoryearAsai and Kajino2019], we used the standard version of SAE and a regularized version of SAE (ZSAE) with a regularization constant . For the AAE, we tuned the upper-bound of the number of actions in AAE by iteratively increasing from to by until the mean absolute error of AAE () goes below 0.01, i.e., below 1 bit on average. This is because a large reduces the number of transitions that fall into a single action label and makes the random forest training harder due to the lack of examples.

### 4.1 Accuracy

We compared the accuracy of DSAMA and AMA. DSAMA has two primary controlling hyperparameters for Random Forest — the maximum depth of the tree and the number of trees. Other hyperparameters of Random Forest follows the standard parameters for classification tasks: Entropy-based variable selection, out-of-bag ratio 0.33 (each decision tree is trained on the random subset of the entire dataset), and the number of variables considered by each tree as for the -dimensional dataset. (note: this is not equivalent to the tree depth because the same variable may be selected several times.)

We first compared the successor generation accuracy between AAE (black-box NN model) and DSAMA. The dataset is divided into the training set and the test set by 9:1. AAE uses the same hyperparameters used in Latplan [\citeauthoryearAsai and Fukunaga2018]. DSAMA uses a Random Forest with the number of trees and the maximum depth of the tree , the largest number we used in this paper. Note that, in general, Random Forest is claimed to achieve the monotonically higher accuracy as it has more ensembles and depth. Table 1 shows the average reconstruction accuracy for the successor states over bits, over all transitions in the test dataset. The results indicate that DSAMA based on Random Forest is competitive against the black box neural model.

Next, we compared the F-measure based on the true positive rate (=recall) and the true negative rate (=specificity) of the black-box precondition model (Action Discriminator) and the DSAMA precondition model using Random Forest. Note that this task is not only a PU-learning task, but also a classification task on a potentially highly imbalanced dataset and therefore we cannot use the accuracy as the evaluation metric as it could be skewed toward the majority dataset [\citeauthoryearWallace et al.2011].

Similar results are obtained in Table 2: Rule-based method (Random-Forest) is competitive against the black box method when a sufficiently large capacity is provided to the model (). To address the concern about using the successor states as part of the precondition, we also tested the variants which learns only from the current state. We observed a significant drop in the accuracy both in the black-box NN (AD) and the DSAMA .

Next, in order to see the effect of the random forest hyperparameters on the learned results, we performed an exhaustive experiment on and compared the precondition accuracy, the effect accuracy and the size of the PDDL files. Note that is a degenerative case for a single decision tree without ensembles. For the space constraint, we only show the results for Mandrill 8-Puzzle with ZSAE (), but the overall characteristics were the same across domains and the choice of SAE / ZSAE.

We observed that the effect of larger and saturates quickly, while small numbers negatively affect the performance. The action applicability prediction (i.e., the precondition accuracy, Table 3, left) tends to be more affected by the depth while the successor state reconstruction accuracy (i.e., the effect accuracy, Table 3, middle) tends to be more affected by the number of trees . Larger and also implies larger file sizes. (Note: When generating the PDDL file, we apply De-Morgan’s law to simplify obvious invariants when one is encountered, e.g., , , and .)

### 4.2 Evaluation in the Latent Space

To measure the effectiveness of our approach for planning, we ran fast downward on the domain PDDL files generated by DSAMA system.

In each of the 5 domains, we generated 20 problem instances by generating the initial state with a random walk from the goal state using a problem-specific simulator. 10 instances are generated with 7 steps away from the goal state while the others are generated with 14 steps.

We tested three scenarios: Blind search with , FF heuristics [\citeauthoryearHoffmann and Nebel2001] with Greedy Best First Search, and max-heuristics [\citeauthoryearHaslum and Geffner2000] with . We gave 1 hour time limit and a maximum of 256 GB memory to the planner.

We tested these configurations on a variety of PDDL domain files generated by different and . As stated in the introduction, despite our RF models achieving high accuracy in terms of prediction, we did not manage to find a plan using Fast Downward. The failure modes are threefold: The planner failed to find the goal after exhaustively searching the state space, the initial heuristic value being infinity in the reachability analysis (in and ), or the problem transformation to SAS+ does not finish within the resource limit.

From the results in the previous tables, the reason of the failure is obvious: There is a trade off between the accuracy and the PDDL file size. When the PDDL model is inaccurate, the search graph becomes disconnected and the search fails. If we increase the accuracy of the PDDL model, the file size increases and Fast Downward fails even to start the search. Moreover, we observed the translation fails even with a PDDL domain file with the moderate file size (e.g. , 11MB).

In order to narrow down the reason for failure, we tested the domain files whose preconditions are removed, i.e., replaced with (and) and made always applicable. We expected the planner to find any sequence of actions which may not be a valid solution. The results were the same: The goal state is unreachable for the small PDDL files due to the lack of accuracy and the translation does not finish for the large PDDL files. Considering the fact that the effect of an action is modeled by random forests while the precondition is modeled by a single random forest, we conclude that the effect modeling is the main bottleneck of the translator failure. Note that, however, the maximum accuracy of the effect modeling with DSAMA is comparable to the neural model and quite high (typically ). We analyze this phenomenon in the next section.

## 5 Discussion

Our experiments showed that Random-Forest based DSAMA approach does not work even if it achieves the same or superior accuracy in the best hyperparameter. The main bottleneck turned out to be the effect modeling, which is accurate but is too complex for Fast Downward to handle. Based on this observation, one question arises: Can the translator bottleneck be addressed just by using a different rule-based learning method, such as MAX-SAT based approaches [\citeauthoryearYang, Wu, and Jiang2007] or planning based approaches [\citeauthoryearAineto, Jiménez, and Onaindia2018]? We argue that this is not the case because (1) our Random Forest based DSAMA approach can be considered as the upper bound of existing Action Model Acquisition method in terms of accuracy and (2) should the same accuracy be achieved by other approaches, the resulting PDDL must have the same complexity. We explain the reasoning below.

First, we note that the translation failure is due to the heavy use of disjunctions in the PDDL file for the compilation of Random Forest because, in Fast Downward, disjunctions are “flattened” [\citeauthoryearHelmert2009], i.e., compiled away by making the separate actions for each branch of the disjunction. This causes an exponential blowup when a huge number of disjunctions are presented to the translator, which is exactly the case for our scenario. The use of effect conditions are not an issue because Fast Downward uses them directly.

Next, in order to avoid this exponential blowup, the resulting rules learned by the binary classifier must be disjunction-free. In fact, existing approaches [\citeauthoryearYang, Wu, and Jiang2007, \citeauthoryearAineto, Jiménez, and Onaindia2018] learn the disjunction-free action models. One trivial approach to achieve this in DSAMA is to compile a decision tree into Decision Lists [\citeauthoryearCohen1995], the degenerate case of decision trees where the children (left, right) of every decision node can contain at most one decision node. However this is trivially ineffective because compiling a decision tree into a decision list is equivalent to how Fast Downward makes the actions disjunction-free by splitting them. Both cases end up in an exponentially large list of disjunction-free actions.

Finally, given that our Random Forest based DSAMA achieved almost-perfect accuracy in the successor generation task (effect condition), we could argue that the rules generated by our approach are quite close to the ground truth rules and, therefore, the ground truth rules are at least as complex as the rules found by DSAMA. Therefore, if the existing approaches achieved the same accuracy on the same task, their resulting disjunction-free set of conditions would be as large and complex as the exponentially large “flattened” form of our rules. This argument also applies to the variants of DSAMA using Decision List based classifiers (e.g., [\citeauthoryearHolte1993, OneR],[\citeauthoryearCohen1995, RIPPER],[\citeauthoryearMaliotov and Meel2018, MLIC]).

## 6 Related Work

Traditionally, symbolic action learners tend to require a certain type of human domain knowledge and have been situating itself merely as an additional assistance tool for humans, rather than a system that builds knowledge from the scratch, e.g., from unstructured images. Many systems require a structured input representation (i.e., First Order Logic) that are partially hand-crafted and exploits the symmetry and the structures provided by the structured representation, although the requirements of the systems may vary. For example, some systems require state sequences [\citeauthoryearYang, Wu, and Jiang2007], while others require action sequences [\citeauthoryearCresswell and Gregory2011, \citeauthoryearCresswell, McCluskey, and West2013]. Some supports the noisy input [\citeauthoryearMourão et al.2012, \citeauthoryearZhuo and Kambhampati2013], partial observations in a state and missing state/actions in a plan trace [\citeauthoryearAineto, Jiménez, and Onaindia2018], or a disordered plan trace [\citeauthoryearZhuo, Peng, and Kambhampati2019]. Approach-wise, they can be grouped into 3 categories: MAX-SAT based approaches [\citeauthoryearYang, Wu, and Jiang2007, \citeauthoryearZhuo and Kambhampati2013, \citeauthoryearZhuo, Peng, and Kambhampati2019], Object-centric approaches [\citeauthoryearCresswell and Gregory2011, \citeauthoryearCresswell, McCluskey, and West2013] and learning-as-planning approaches [\citeauthoryearAineto, Jiménez, and Onaindia2018]. AMA and DSAMA works on a factored but non-structured propositional representation. While we do not address the problem of lifting the action description, combining these approaches with the FOL symbols (relations/predicates) found by NN [\citeauthoryearAsai2019] is an interesting avenue for future work.

There are several lines of work that extracts a PDDL action model from a natural language corpus. Framer [\citeauthoryearLindsay et al.2017] uses a CoreNLP language model while EASDRL [\citeauthoryearFeng, Zhuo, and Kambhampati2018] uses Deep Reinforcement Learning [\citeauthoryearMnih et al.2015]. The difference from our approach is that they are reusing the symbols found in the corpus while we generate the discrete propositional symbols from the visual perception which completely lacks such a predefined set of discrete symbols.

While there are recent efforts in handling the complex state space without having the action description [\citeauthoryearFrances et al.2017], action models could be used for other purposes, including Goal Recognition [\citeauthoryearRamírez and Geffner2009], macro-action generation [\citeauthoryearBotea and Braghin2015, \citeauthoryearChrpa, Vallati, and McCluskey2015], or plan optimization [\citeauthoryearChrpa and Siddiqui2015].

There are three lines of work that learn the binary representation of the raw environment. Latplan SAE [\citeauthoryearAsai and Fukunaga2018] uses the Gumbel-Softmax VAE [\citeauthoryearMaddison, Mnih, and Teh2017, \citeauthoryearJang, Gu, and Poole2017] which was modified from the original to maximize the KL divergence term for the Bernoulli distribution [\citeauthoryearAsai and Kajino2019]. Causal InfoGAN [\citeauthoryearKurutach et al.2018] uses GAN[\citeauthoryearGoodfellow et al.2014]-based approach combined with Gumbel Softmax prior and Mutual Information prior. Mutual Information and the negated KL term are both the same entropy term , i.e., the randomness of the latent vector given a particular input image . Latplan ZSAE [\citeauthoryearAsai and Kajino2019] additionally penalizes the “true” category in the binary categorical distribution to suppress the chance of random flips in the latent vector caused by the input noise. It was shown that these random flips negatively affect the performance of the recipient symbolic systems by violating the uniqueness assumption of the representation, dubbed as “symbol stability problem”. Quantized Bottleneck Network [\citeauthoryearKoul, Fern, and Greydanus2019] uses quantized activations (i.e., step functions) in the latent space to obtain the discrete representation. It trains the network with Straight-Through gradient estimator [\citeauthoryearBengio, Léonard, and Courville2013], which enables the backpropagation through the step function. There are more complex variations such as VQVAE [\citeauthoryearvan den Oord, Vinyals, and others2017], DVAE++[\citeauthoryearVahdat et al.2018], DVAE# [\citeauthoryearVahdat, Andriyash, and Macready2018].

In the context of modern machine learning, Deep Reinforcement Learning (DRL) has solved complex problems, including Atari video games [\citeauthoryearMnih et al.2015, DQN] or Game of Go [\citeauthoryearSilver and others2016, AlphaGo]. However, they both have a hard-coded list of action symbols (e.g., levers, Fire button, grids to put stones) and relies on the hard-coded simulator for both learning and the correct execution.

In another line of work, Neural Networks model the external environment captured by video cameras by explicitly taking the temporal dependency into account [\citeauthoryearLotter, Kreiman, and Cox2017], unlike Latplan SAE, which processes each image frame one by one.

## 7 Conclusion

In this paper, we negatively answered a question of whether simply replacing a neural, black-box Action Model Acquisition model with a rule-based machine learning model would generate a useful descriptive action model from the raw, unstructured input. Our approach hybrids a neural unsupervised learning approach to the action label generation and the precondition/effect-condition learning using State-of-the-Art rule-based machine learning. While the proposed method was able to generate accurate PDDL models, the models are too complex for the standard planner to preprocess in a reasonable runtime and memory.

The fact that the rather straightforward modeling of effects in DSAMA is causing such a huge problem is worth noting. The planning domains written by humans, in general, tend to have a specific human-originated property that causes this type of phenomenon to happen less often, and this might be reflected by the fact that STRIPS (without disjunctions) was the first common language adapted by the community. Unlike the planning models written by the human, we found that the set of propositions generated by the State AutoEncoder network, as well as the set of action labels generated by the Action AutoEncoder network, do not have such a property. The state space and the clustering of transitions are “less organized” compared to the typical human models, and the lack of human-like regularization behavior makes an otherwise trivial task of PDDL-SAS+ translation intractable in a modern planner.

The future directions are twofold. The first one is to find the right regularization or the right architecture for the neural networks in order to further constrain the space of the ground-truth transition model in the latent space. This is similar to the approach pursued by [\citeauthoryearAsai and Kajino2019] which tries to suppress the instability of the propositional values in the latent space. Machine Learning community is increasingly focusing on the disentangled representation learning [\citeauthoryearHiggins et al.2017] that tries to separate the meaning of the feature values in the latent space. Finding the right structural bias for neural networks has a long history, notably the convolutional neural networks [\citeauthoryearFukushima1980, \citeauthoryearLeCun et al.1989, \citeauthoryearKrizhevsky, Sutskever, and Hinton2012] for images, or LSTMs [\citeauthoryearHochreiter and Schmidhuber1997] and transformers [\citeauthoryearVaswani et al.2017] for sequence modeling.

The second approach is to develop a planner that can directly handle the complex logical conditions in an efficient manner. Fast downward requires converting the input PDDL into SAS+ with a rather slow translator, assuming that such a task tends to be easy and tractable. While this may hold for most hand-crafted domains (such as IPC domains), it may not be a viable approach when the symbolic input is generated by neural networks.

### Footnotes

- We used cl-random-forest [\citeauthoryearImai2017]

### References

- Aineto, D.; Jiménez, S.; and Onaindia, E. 2018. Learning STRIPS Action Models with Classical Planning. In Twenty-Eighth International Conference on Automated Planning and Scheduling.
- Amado, L.; Pereira, R. F.; Aires, J.; Magnaguagno, M.; Granada, R.; and Meneguzzi, F. 2018a. Goal Recognition in Latent Space. In Proc. of International Joint Conference on Neural Networks (IJCNN).
- Amado, L.; Pereira, R. F.; Aires, J.; Magnaguagno, M.; Granada, R.; and Meneguzzi, F. 2018b. LSTM-based Goal Recognition in Latent Space. arXiv preprint arXiv:1808.05249.
- Asai, M., and Fukunaga, A. 2018. Classical Planning in Deep Latent Space: Bridging the Subsymbolic-Symbolic Boundary. In Proc. of AAAI Conference on Artificial Intelligence.
- Asai, M., and Kajino, H. 2019. Towards Stable Symbol Grounding with Zero-Suppressed State AutoEncoder. In Proc. of the International Conference on Automated Planning and Scheduling(ICAPS).
- Asai, M. 2019. Unsupervised Grounding of Plannable First-Order Logic Representation from Images. In Proc. of the International Conference on Automated Planning and Scheduling(ICAPS).
- Bäckström, C., and Nebel, B. 1995. Complexity Results for SAS+ Planning. Computational Intelligence 11(4):625–655.
- Batcher, K. E. 1968. Sorting Networks and Their Applications. In Proceedings of the April 30–May 2, 1968, spring joint computer conference, 307–314. ACM.
- Bengio, Y.; Léonard, N.; and Courville, A. 2013. Estimating or Propagating Gradients through Stochastic Neurons for Conditional Computation. arXiv preprint arXiv:1308.3432.
- Botea, A., and Braghin, S. 2015. Contingent versus Deterministic Plans in Multi-Modal Journey Planning. In Proc. of the International Conference on Automated Planning and Scheduling(ICAPS), 268–272.
- Chen, C.; Liaw, A.; and Breiman, L. 2004. Using Random Forest to Learn Imbalanced Data. Technical Report Technical Report 666, Department of Statistics, UC Berkeley.
- Chrpa, L., and Siddiqui, F. H. 2015. Exploiting Block Deordering for Improving Planners Efficiency. In Proc. of International Joint Conference on Artificial Intelligence (IJCAI).
- Chrpa, L.; Vallati, M.; and McCluskey, T. L. 2015. On the Online Generation of Effective Macro-Operators. In Proc. of International Joint Conference on Artificial Intelligence (IJCAI).
- Cohen, W. W. 1995. Fast Effective Rule Induction. In Proc. of the International Conference on Machine Learning, 115–123.
- Cresswell, S., and Gregory, P. 2011. Generalised Domain Model Acquisition from Action Traces. In Proc. of the International Conference on Automated Planning and Scheduling(ICAPS).
- Cresswell, S.; McCluskey, T. L.; and West, M. M. 2013. Acquiring planning domain models using LOCM. Knowledge Eng. Review 28(2):195–213.
- Elkan, C., and Noto, K. 2008. Learning Classifiers from Only Positive and Unlabeled Data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 213–220. ACM.
- Feng, W.; Zhuo, H. H.; and Kambhampati, S. 2018. Extracting Action Sequences from Texts Based on Deep Reinforcement Learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, 4064–4070. Proc. of International Joint Conference on Artificial Intelligence (IJCAI).
- Frances, G.; Ramırez, M.; Lipovetzky, N.; and Geffner, H. 2017. Purely Declarative Action Representations are Overrated: Classical Planning with Simulators. In Proc. of International Joint Conference on Artificial Intelligence (IJCAI), 4294–4301.
- Fukushima, K. 1980. Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position. Biological cybernetics 36(4):193–202.
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, 2672–2680.
- Haslum, P., and Geffner, H. 2000. Admissible Heuristics for Optimal Planning. In Proc. of the International Conference on Artificial Intelligence Planning and Scheduling.
- Helmert, M. 2004. A Planning Heuristic Based on Causal Graph Analysis. In Proc. of the International Conference on Automated Planning and Scheduling(ICAPS), 161–170.
- Helmert, M. 2009. Concise Finite-Domain Representations for PDDL Planning Tasks. Artificial Intelligence 173(5-6):503–535.
- Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; and Lerchner, A. 2017. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. volume 2, 6.
- Ho, T. K. 1998. The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8):832–844.
- Hochreiter, S., and Schmidhuber, J. 1997. Long Short-Term Memory. Neural Computation 9(8):1735–1780.
- Hoffmann, J., and Nebel, B. 2001. The FF Planning System: Fast Plan Generation through Heuristic Search. J. Artif. Intell. Res.(JAIR) 14:253–302.
- Holte, R. C. 1993. Very Simple Classification Rules Perform Well on Most Commonly Used Datasets. Machine learning 11(1):63–90.
- Imai, S. 2017. cl-random-forest. https://github.com/masatoi/cl-random-forest.
- Jang, E.; Gu, S.; and Poole, B. 2017. Categorical Reparameterization with Gumbel-Softmax. In Proc. of the International Conference on Learning Representations.
- Knuth, D. E. 1997. The Art of Computer Programming, volume 3. Pearson Education.
- Koul, A.; Fern, A.; and Greydanus, S. 2019. Learning Finite State Representations of Recurrent Policy Networks. In Proc. of the International Conference on Learning Representations.
- Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, 1097–1105.
- Kurutach, T.; Tamar, A.; Yang, G.; Russell, S.; and Abbeel, P. 2018. Learning Plannable Representations with Causal InfoGAN. In In Proceedings of ICML / IJCAI / AAMAS 2018 Workshop on Planning and Learning (PAL-18).
- LeCun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W.; and Jackel, L. D. 1989. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation 1(4):541–551.
- LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-Based Learning Applied to Document Recognition. Proc. of the IEEE 86(11):2278–2324.
- Lee, C. L., and Jen, C.-W. 1992. Bit-Sliced Median Filter Design based on Majority Gate. IEE Proceedings G (Circuits, Devices and Systems) 139(1):63–71.
- Lindsay, A.; Read, J.; Ferreira, J. F.; Hayton, T.; Porteous, J.; and Gregory, P. J. 2017. Framer: Planning Models from Natural Language Action Descriptions. In Proc. of the International Conference on Automated Planning and Scheduling(ICAPS).
- Lotter, W.; Kreiman, G.; and Cox, D. 2017. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. In Proc. of the International Conference on Learning Representations.
- Maddison, C. J.; Mnih, A.; and Teh, Y. W. 2017. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. In Proc. of the International Conference on Learning Representations.
- Maliotov, D., and Meel, K. S. 2018. Mlic: A maxsat-based framework for learning interpretable classification rules. In Proc. of the International Conference on Principles and Practice of Constraint Programming (CP), 312–327. Springer.
- McDermott, D. V. 2000. The 1998 AI Planning Systems Competition. AI Magazine 21(2):35–55.
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-Level Control through Deep Reinforcement Learning. Nature 518(7540):529–533.
- Mourão, K.; Zettlemoyer, L. S.; Petrick, R. P. A.; and Steedman, M. 2012. Learning STRIPS Operators from Noisy and Incomplete Observations. In Proc. of the International Conference on Uncertainty in Artificial Intelligence, 614–623.
- Ramírez, M., and Geffner, H. 2009. Plan Recognition as Planning. In Proc. of AAAI Conference on Artificial Intelligence.
- Silver, D., et al. 2016. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature 529(7587):484–489.
- Vahdat, A.; Andriyash, E.; and Macready, W. 2018. DVAE#: Discrete variational autoencoders with relaxed Boltzmann priors. In Advances in Neural Information Processing Systems, 1864–1874.
- Vahdat, A.; Macready, W. G.; Bian, Z.; Khoshaman, A.; and Andriyash, E. 2018. DVAE++: Discrete variational autoencoders with overlapping transformations. arXiv preprint arXiv:1802.04920.
- van den Oord, A.; Vinyals, O.; et al. 2017. Neural Discrete Representation Learning. In Advances in Neural Information Processing Systems, 6306–6315.
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998–6008.
- Wallace, B. C.; Small, K.; Brodley, C. E.; and Trikalinos, T. A. 2011. Class Imbalance, Redux. In Proc. of IEEE International Conference on Data Mining, 754–763. IEEE.
- Yang, Q.; Wu, K.; and Jiang, Y. 2007. Learning Action Models from Plan Examples using Weighted MAX-SAT. Artificial Intelligence 171(2-3):107–143.
- Zhuo, H. H., and Kambhampati, S. 2013. Action-Model Acquisition from Noisy Plan Traces. In Twenty-Third International Joint Conference on Artificial Intelligence.
- Zhuo, H. H.; Peng, J.; and Kambhampati, S. 2019. Learning Action Models from Disordered and Noisy Plan Traces. arXiv preprint arXiv:1908.09800.