Extracting Action Sequences from Texts Based on Deep Reinforcement Learning

Extracting Action Sequences from Texts Based on Deep Reinforcement Learning

Wenfeng Feng, Hankz Hankui Zhuo, Subbarao Kambhampati,
School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China
Department of Computer Science and Engineering, Arizona State University, Tempe, Arizona, US
fengwf@mail2.sysu.edu.cn, zhuohank@mail.sysu.edu.cn, rao@asu.edu

Extracting action sequences from texts in natural language is challenging, which requires commonsense inferences based on world knowledge. Although there has been work on extracting action scripts, instructions, navigation actions, etc., they require either the set of candidate actions is provided in advance, or action descriptions are restricted in a specific form, e.g., description templates. In this paper, we aim to extract action sequences from texts in free natural language, i.e., without any restricted templates, provided the candidate set of actions is unknown. We propose to extract action sequences from texts based on the deep reinforcement learning framework. Specifically, we view “selecting” or “eliminating” words from texts as “actions”, and texts associated with actions as “states”. We then build Q-networks to learn the policy of extracting actions and extract plans from the labelled texts. We exhibit the effectiveness of our approach in several datasets with comparison to state-of-the-art approaches, including online experiments interacting with humans.

Extracting Action Sequences from Texts Based on Deep Reinforcement Learning

Wenfeng Feng, Hankz Hankui Zhuo, Subbarao Kambhampati, School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China Department of Computer Science and Engineering, Arizona State University, Tempe, Arizona, US fengwf@mail2.sysu.edu.cn, zhuohank@mail.sysu.edu.cn, rao@asu.edu

1 Introduction

Artificial intelligent agents, such as robots and unmanned aerial vehicles, are more and more common in real-world applications. They serve as assistants in families, labs or public places, transporting cargo in warehouse, delivering goods, and so forth. Instruction texts, related with actions, are important media for robots to communicate with humans. Extracting action sequences from action descriptions in natural language is challenging, which requires agents understanding complex contexts of actions regarding complicated syntax and semantics in action descriptions.

For example, in Figure 1, given a document of action descriptions (the left part of Figure 1) such as “Cook the rice the day before, or use leftover rice in the refrigerator. The important thing to remember is not to heat up the rice, but keep it cold.”, which addresses the procedure of making egg fired rice, an action sequence of “cook(rice), keep(rice, cold)” or “use(leftover rice), keep(rice, cold)” is expected to be extracted from the action descriptions. This task is challenging. For example, for the first sentence, we need to learn to figure out that “cook” and “use” are exclusive (denoted by “EX” in the middle of Figure 1), meaning that we could extract only one of them; for the second sentence, we need to learn to understand that among the three verbs “remember”, “heat” and “keep”, the last one is the best because the goal of this step is to “keep the rice cold” (denoted by “ES” indicating this action is essential). There is also another action “Recycle” denoted by “OP” indicating this action can be extracted optionally. We also need to consider action arguments which can be either “EX” or “ES” as well (as shown in the middle of Figure 1). The possible action sequences extracted are shown in the right part of Figure 1. This action sequence extraction problem is different from sequence labelling which has been successfully dealt with by many approaches such as BiLSTM-CNNs-CRF [?], since we aim to extract “meaningful” or “correct” action sequences (which suggest some actions should be ignored because they are exclusive), such as “cook(rice), keep(rice, cold)”, instead of “cook(rice),use(leftover rice), remember(thing), heat(rice), keep(rice, cold)” as extracted by [?] (adapted to allow “labelling” both actions and parameters).

Figure 1: Illustration of our action sequence extraction problem

There has been work on extracting action sequences from action descriptions. For example, [?] propose to map instructions to sequences of executable actions using reinforcement learning. [??] interpret natural instructions as action sequences or generate navigational action description using an encoder-aligner-decoder structure. Despite the success of those approaches, they all require a limited set of action names given as input, which are mapped to by action descriptions. Another approach, proposed by [?], builds action sequences from texts based on dependency parsers and then builds planning models, assuming texts are in restricted templates when describing actions. Different from previous approaches, we do not require action names provided as input, or texts restricted in specific templates. In addition, we would like to consider complicated relations among actions including “exclusive” and “optional” relations for extracting meaningful action sequences.

In this paper, we aim to extract meaningful action sequences from texts in free natural language, i.e., without any restricted templates, provided the candidate set of actions is unknown. We propose an approach called EASDRL, which stands for Extracting Action Sequences from texts based on Deep Reinforcement Learning. In our EASDRL approach, we view texts associated with extracted actions as “states”, and the operation of labelling words from texts as “actions”, and then build deep Q-networks to extract action sequences from texts. We capture the complicated relations among actions by considering previously extracted actions as parts of states for deciding the choice of next operations. In other words, once we know action “cook(rice)” has been extracted and included as parts of states, we will choose to extract next action “keep(rice, cold)” instead of “use(leftover rice)” in the above-mentioned example.

In the remainder of paper, we first review previous work related to our approach. After that we give a formal definition of our plan extraction problem and present our EASDRL approach in detail. We then evaluate our EASDRL approach with comparison to state-of-the-art approaches and conclude the paper with future work.

2 Related Work

There have been approaches related to our work besides the ones we mentioned in the introduction section. Mapping SAIL route instructions [?] to action sequences has aroused great interest of in natural language processing community. Early approaches, like [????], largely depend on specialized resources, i.e. semantic parsers, learned lexicons and re-rankers. Recently, LSTM encoder-decoder structure [?] has been applied to this field and gets decent performance in processing single-sentence instructions, however, it could not handle multi-sentence texts well.

There is also a lot of work on learning STRIPS representation actions [??] from texts. [??] learn sentence patterns and lexicons or use off-the-shelf toolkits, i.e., OpenNLP111https://opennlp.apache.org/ and Stanford CoreNLP222http://stanfordnlp.github.io/CoreNLP/. [?] also build action models with the help of LOCM [?] after extracting action sequences by using NLP tools. These tools are trained for universal natural language processing tasks, they cannot solve the complicated action sequence extraction problem well, and their performance will be greatly affected by POS-tagging and dependency parsing results. We thus want to build a model that learns to directly extract action sequences without external tools.

Recently [?] apply deep neural network (DQN) to solve reinforcement learning problem and obtain state-of-the-art results. [??] propose DQN models to play more challenging games with enormous search space or sparse feedback. [??] develop DQN structure to continuous action space and improve the experience replay trick. Unlike those works which mainly focus on image games, [??] take as input texts descriptions of games, which also shed a light on our task.

3 Problem Definition

Our training data can be defined by , where is a sequence of words and is a sequence of annotations. If is not an action name, is . Otherwise, is a tuple to describe type of the action name and its corresponding arguments. indicates the type of action corresponding to , which can be one of essential, optional and exclusive. The type suggests the corresponding action should be extracted, suggests can be “optionally” extracted, suggests is “exclusive” with other actions indicated by the set (in other words, either or exactly one action in can be extracted). is the index of the action exclusive with . We denote the size of by , i.e., . Note that “” indicates the type of action is either essential or optional, and “” indicates is exclusive. is the index of the word composing arguments of , and is the index of words exclusive with .

For example, as shown in Figure 2, given a text denoted by , its corresponding annotation is shown in the figure denoted by . In , “{11}” indicates the action exclusive with (i.e., “Hang”) is “opt” with index 11. “” indicates the corresponding arguments “engraving” and “lithograph” are exclusive, and the other argument “frame” with index 9 is essential since it is exclusive with an empty index, likewise for . For and , they are empty since their corresponding words are not action names. From , we can generate three possible actions as shown at the bottom of Figure 2.

Figure 2: Illustration of text X and its corresponding annotation Y

As we can see from the training data, it is uneasy to build a supervised learning model to directly predict annotations for new texts , since annotations is complicated and the size varies with respect to different (different action names have different arguments with different lengths). We seek to build a unified framework to predict simple “labels” (corresponding to “actions” in reinforcement learning) for extracting action names and their arguments. We exploit the framework to learn two models to predict action names and arguments, respectively. Specifically, given a new text , we would like to predict a sequence of operations (instead of annotations in ) on , where is an that or word in . In other words, when predicting action names (or arguments), indicates is extracted as an action name (or argument), while indicates is not extracted as an action name (or argument).

In summary, our action sequence extraction problem can be defined by: given a set of training data , we aim to learn two models (with the same framework) to predict action names and arguments for new texts , respectively. The two models are




where and are parameters to be learnt for predicting action names and arguments, respectively. is an action name extracted based on . We will present how to build these two models in the following section.

4 Our Easdrl Approach

In this section we present the details of our EASDRL approach. As mentioned in the introduction section, our action sequence extraction problem can be viewed as a reinforcement learning problem. We thus first describe how to build states and operations given text , and then present deep Q-networks to build the Q-functions. Finally we present the training procedure and give an overview of our EASDRL approach. Note that we will use the term operation to represent the meaning of “action” in reinforcement learning since the term “action” has been used to represent an action name with arguments in this work.

4.1 Generating State Representations

In this subsection we address how to generate state representations from texts. As defined in the problem definition section, the space of operations is . We view texts associated with operations as “states”. Specifically, we represent a text by a sequence of vectors , where is a -dimension real-value vector [?], representing the th word in . Words of texts stay the same when we perform operations, so we embed operations in state representations to generate state transitions. We extend the set of operations to where “NULL” indicates a word has not been processed. We represent the operation sequence corresponding to by a sequence of vectors , where is a -dimension real-value vector. In order to balance the dimension of and , we generate each by a repeat-representation , i.e., if , , and if , , where corresponds to , respectively. We define a state as a tuple , where is a matrix in , is a matrix in . The th row of is denoted by . The space of states is denoted by . A state is changed into a new state after performing an operation on , such that , where . For example, consider a text “Cook the rice the day before…” and a state corresponding to it is shown in the left part of Figure 3. After performing an operation on , a new state (the right part) will be generated. In this way, we can learn in (Equation (1)) based on with deep Q-networks as introduced in the next subsection.

Figure 3: Illustration of states and operations

After is learnt, we can use it to predict action names, and then exploit the predicted action names to extract action arguments by training (Equation (2)). To do this, we would like to encode the predicted action names in states to generate a new state representation for learning in . We denote by the word corresponding to the action name. We build by appending the distance between and based on their indices, such that , where , where and . Note that is a -dimension real-value vector using repeat-representation . In this way we can learn based on with the same deep Q-networks. Note that in our experiments, we found that the results were the best when we set , suggesting the impact of word vectors, distance vectors and operation vectors was generally identical.

4.2 Deep Q-networks for Operation Execution

Given the formulation of states and operations, we aim to extract a sequence of actions from texts. We construct sequences by repeatedly choosing operations given current states, and applying operations on current states to achieve new states.

In Q-Learning, this process can be described by a Q-function and updating the Q-function iteratively according to Bellman equation. In our action sequence extraction problem, actions are composed of action names and action arguments. We need to first extract action names from texts and use the extracted action names to further extract action arguments. Specifically, We define two Q-functions and , where contains the information of extracted action names, as defined in the last subsection. The update procedure based on Bellman equation and deep Q-networks can be defined by:


where and corresponds to the deep Q-networks [?] for extracting action names and arguments, respectively. As , . In this way, we can define and in Equations (1) and (2), and then use and to extract action names and arguments, respectively.

Since Convolutional Neural Networks (CNNs) are widely applied in natural language processing [???], we build CNN models to learn Q-functions and . We adopt the CNN Architecture of [?]. To build the kernels of our CNN models, we test from uni-gram context to ten-gram context and observe that five-word context is generally abundant for our task. We thus design four types of kernels, which correspond to bigram, trigram, four-gram and five-gram, respectively.

4.3 Computing Rewards

In this subsection we compute a reward based on a state and an operation . Specifically, is composed of two parts, i.e., basic reward and additional reward. For a basic reward at time step , denoted by , if a word is not an item (we use item to represents action name or action argument in this paper when it is not confused), when the operation is correct and when it is incorrect. If a word is an essential item, when the operation is correct and when it is incorrect. If the word is an optional item, when the operation is correct and when it is incorrect. If a word is an exclusive item, when the operation is correct and when it is incorrect. We denote that an operation is correct when it selects essential items, selects optional items, selects only one item of exclusive items or eliminates words that are not items.

Note that action names are key verbs of a text and action arguments are some nominal words, so the percentage of these words in a text is closely related to action sequence extracting process. We thus calculate the percentage, namely an item rate, denoted by , where indicates the amount of action names or action arguments in all the annotated texts and indicates the total words of these texts. We define a real-time item rate as to denote the percentage of words that have been selected as action names or action arguments in a text after training steps, and . On one hand, when , a positive additional reward is added to if (i.e., the operation is correct), otherwise a negative additional reward is added to . On the other hand, when , which means that words selected as action names or action arguments are out of the expected number and it is more likely to be incorrect if subsequent words are selected, then a negative additional reward should be added to the basic reward. In this way, the reward at time step can be obtained by Equation (5),


where is a positive constant and .

4.4 Training Our Model

To learn the parameters and of our two DQNs, we store transitions and in replay memories and , respectively, and exploit a mini-batch sampling strategy. As indicated in [?], transitions that provide positive rewards can be used more often to learn optimal Q-values faster. We thus develop a positive-rate based experience replay instead of randomly sampling transitions from (or ), where positive-rate indicates the percentage of transitions with positive rewards. To do this, we set a positive rate and require the proportion of positive samples in each mini-batch be and the proportion of negative samples be , where . We also use the -greedy policy for exploration and exploitation.

We present the learning procedure of our EASDRL approach in Algorithm 1, for building . We can simply replace , and with , and for building . In Step 4 of Algorithm 1, we generate the initial state ( for learning ) for each training data by setting all operations in to be . We perform steps to execute one of the operations in Steps 6, 7 and 8. From Steps 10 and 11, we do a positive-rate based experience replay according to positive rate . From Steps 12 and 13, we update parameters using gradient descent on the loss function as shown in Step 13.

With Algorithm 1, we are able to build the Q-function and execute operations to a new text by iteratively maximizing the Q-function. Once we obtain operation sequences, we can generate action names and use the action names to build with and the same framework of Algorithm 1. We then exploit the built to extract action arguments. As a result, we can extract action sequences from texts using both of the built and .

Input: a training set , positive rate , item rate
Output: the parameters

1:  Initialize , CNN with random values for
2:  for epoch = 1:  do
3:     for each training data  do
4:        Generate the initial state based on
5:        for  = 1:  do
6:           Perform an operation with probability
7:           Otherwise select
8:           Perform on to generate
9:           Calculate based on , , and
10:           Store transition in
11:           Sample transitions from according to positive rate
12:           Set
13:           Perform a gradient descent step on the loss
14:        end for
15:     end for
16:  end for
17:  return  The parameters
Algorithm 1 Our EASDRL algorithm

5 Experiments

5.1 Datasets and Evaluation Metric

We conducted experiments on three datasets, i.e., “Microsoft Windows Help and Support” (WHS) documents [?], and two datasets collected from “WikiHow Home and Garden”333https://www.wikihow.com/Category:Home-and-Garden (WHG) and “CookingTutorial”444http://cookingtutorials.com/ (CT). Details are presented in Table 1. Supervised learning models require that training data are one-to-one pairs (i.e. each word has a unique label), so we generate input-texts-to-output-labels based on annotation (as defined in Section 3). In our task, a single text with optional items or exclusive pairs will generate more than label sequences (i.e. each item of them can be extracted or not be extracted). Especially, we observe that is larger than 30 in some texts of our datasets, which means more than 1 billion sequences will be generated. We thus restrict (no more than label sequences) to generate reasonable number of sequences.

Labelled texts 154 116 150
Total words 6927 24284 62214
Action name rate (%) 19.47 10.37 7.61
Action argument rate (%) 15.45 7.44 6.30
Unlabelled texts 0 0 80
Table 1: Datasets used in our experiments

For evaluation, we first feed test texts to each model to output sequences of labels or operations. We then extract action sequences based on these labels or operations. After that, we compare these action sequences to their corresponding annotations and calculate (total ground truth items), (total extracted items), (total correctly extracted items). Finally we compute metrics: , , and . We will use the F1 metric in our experiments.

5.2 Experimental Results

We compare EASDRL to four baselines, as shown below:

  • STFC: Stanford CoreNLP, an off-the-shelf tool, denoted by STFC, extracts action sequences by viewing root verbs as action names and objects as action arguments [?].

  • BLCC: We adapt a state-of-the-art sequence labelling method (denoted by BLCC), Bi-directional LSTM-CNNs-CRF model [??], to extract action sequences.

  • EAD: The Encoder-Aligner-Decoder approach maps instructions to action sequences proposed by [?], denoted by EAD.

  • CMLP: We consider a Combined Multi-layer Perception (CMLP), which consists of MLP classifiers. for action names extraction and for action arguments extraction. Each MLP classifier focuses on not only a single word but also the k-gram context.

When comparing with baselines, we set the input dimension of our CNN model to be for action names and for action arguments, the size of four kernels to be for action names and for action arguments, the number of feature-maps of convolutional layers to be . We exploit a max pooling after the convolution layer. We set the replay memory , positive rate , discount factor . We set for action names rate, for action arguments rate, constant . The learning rate of adam is 0.001 and the probability for -greedy decreases from 1 to 0.1 over 1000 training steps.

Comparison with Baselines

Action Names Action Arguments
EAD-2 86.25 64.74 53.49 57.71 51.77 37.70
EAD-8 85.32 61.66 48.67 57.71 51.77 37.70
CMLP-2 83.15 83.00 67.36 47.29 34.14 32.54
CMLP-8 80.14 73.10 53.50 47.29 34.14 32.54
BLCC-2 90.16 80.50 69.46 93.30 76.33 70.32
BLCC-8 89.95 72.87 59.63 93.30 76.33 70.32
STFC 62.66 67.39 62.75 38.79 43.31 42.75
EASDRL 93.46 84.18 75.40 95.07 74.80 75.02
Table 2: F1 scores of different methods in extracting all types of action names and all types of action arguments

We set the restriction and for EAD, CMLP and BLCC which need one-to-one sequence pairs, and not restriction for STFC and EASDRL. In all of our datasets, the arguments of an action are either all essential arguments or one exclusive argument pair together with all other essential arguments, which means at most sequences can be generated. Therefore, the results of action arguments extraction are identical when and . The experimental results are shown in Table 2. From Table 2, we can see that our EASDRL approach performs the best on extracting both action names and action arguments in most datasets, except for CT dataset. We observe that the number of arguments in most texts of the CT dataset is very small, such that BLCC performs well on extracting arguments in the CT dataset. On the other hand, we can also observe that BLCC, EAD and CMLP get worse performance when relaxing the restriction ( and ). We can also see that both sequence labelling method and encoder-decoder structure do not work well, which exhibits that, in this task, our reinforcement learning framework can indeed perform better than traditional methods.

Action Names Action Arguments
EAD-2 26.60 21.76 22.75 40.78 47.91 39.81
EAD-8 22.12 17.01 23.12 40.78 47.91 39.81
CMLP-2 31.54 54.75 51.29 35.52 25.07 29.78
CMLP-8 26.90 51.80 41.03 35.52 25.07 29.78
BLCC-2 16.35 38.27 54.34 12.50 13.45 18.57
BLCC-8 19.55 35.01 41.27 12.50 13.45 18.57
STFC 46.40 50.28 44.32 50.00 46.40 50.32
EASDRL 56.19 66.37 68.29 66.67 54.24 55.67
Table 3: F1 scores of different methods in extracting exclusive action names and exclusive action arguments
Figure 4: Results of EASDRL after removing some components

In order to test and verify whether or not our EASDRL method can deal with complex action types well, we compare with baselines in extracting exclusive action names and exclusive action arguments. Results are shown in Table 3. In this part, our EASDRL model outperforms all baselines and leads more than absolutely, which demonstrates the effectiveness of our EASDRL model in this task.

We would like to evaluate the impact of additional reward and positive-rate based experience replay. We test our EASDRL model by removing positive-rate based experience replay (denoted by “-PR”) or additional reward (denoted by “-AR”). Results are shown in Figure 4. We observe that removing either positive-rate based experience replay or additional reward will degrade the performance of our model.

Online Training Results

To further test the robustness and self-learning ability of our approach, we design a human-agent interacting environment to collect the feedback from humans. The environment takes a text as input (as shown in the upper left part of Figure 5) and present the results of our EASDRL approach in the upper right part of Figure 5. Humans adjust the output results by inputting values in the “function panel” (as shown in the middle row) and pressing the buttons (in the bottom). After that, the environment updates the deep Q-networks of our EASDRL approach based on humans’ adjustment (or feedback) and output new results in the upper right part. Note that the parts indicated by in the upper right part are the extracted action sequence. For example, the action “Remove(tape)”, which is indicated in the upper right part with orange color, should be “Remove(tape, deck)”. The user can delete, revise or insert words (corresponding to the buttons with labels “Delete”, “Revise” and “Insert”, respectively) by input “values” in the middle row, where “Act/Arg” is used to decide whether the inputed words belong to action names or action arguments, “ActType/ArgType” is used to decide whether the inputed words are essential, optional or exclusive, “SentId” and “ActId/ArgId” are used to input the sentence indices and word indices of inputed words, “ExSentId” and “ExActId/ExArgId” are used to input the indices of exclusive action names or arguments. After that, the modified text with its annotations will be used to update our model.

Before online training, we pre-train an initial model of EASDRL by combining all lablled texts of WHS, CT and WHG, with labelled texts of WHG for testing. The accuracy of this initial model is low since it is domain-independent. We then use the unlabelled texts in WHG (i.e., 80 texts as indicated in the last row in Table 1) for online training. We “invited” humans to provide feedbacks for these 80 texts (with an average of 5 texts for each human). When a human finish the job assigned to him, we update our model (as well as the baseline model). We compare EASDRL to the best offline-trained baseline BLCC-2. Figure 6 shows the results of online training, where “online collected texts” indicates the number of texts on which humans provide feedbacks. We can see that EASDRL outperforms BLCC-2 significantly, which demonstrates the effectiveness of our reinforcement learning framework.

Figure 5: A snapshot of our human-agent interacting environment
Figure 6: Online test results of WHG dataset

6 Conclusion

In this paper, we propose a novel approach EASDRL to automatically extract action sequences from texts based on deep reinforcement learning. To the best of our knowledge, our EASDRL approach is the first approach that explores deep reinforcement learning to extract action sequences from texts. In the experiment, we demonstrated that our EASDRL model outperforms state-of-the-art baselines on three datasets. We showed that our EASDRL approach could better handle complex action types and arguments. We also exhibited the effectiveness of our EASDRL approach on online learning environment. In the future, it would be interesting to explore the feasibility of learning more structure knowledge from texts such as state sequences or action models for supporting planning.


  • [Branavan et al., 2009] S. R. K. Branavan, Harr Chen, Luke S. Zettlemoyer, and Regina Barzilay. Reinforcement learning for mapping instructions to actions. In ACL, pages 82–90, 2009.
  • [Chen and Mooney, 2011] David L. Chen and Raymond J. Mooney. Learning to interpret natural language navigation instructions from observations. In AAAI, 2011.
  • [Chen, 2012] David L Chen. Fast online lexicon learning for grounded language acquisition. ACL, pages 430–439, 2012.
  • [Cresswell et al., 2009] Stephen Cresswell, Thomas Leo McCluskey, and Margaret Mary West. Acquisition of object-centred domain models from planning examples. In ICAPS, 2009.
  • [Daniele et al., 2017] Andrea F Daniele, Mohit Bansal, and Matthew R Walter. Navigational instruction generation as inverse reinforcement learning with neural machine translation. In ACM/IEEE, pages 109–118, 2017.
  • [Fikes and Nilsson, 1971] Richard E Fikes and Nils J Nilsson. Strips: A new approach to the application of theorem proving to problem solving. Artificial intelligence, 2(3-4):189–208, 1971.
  • [He et al., 2016] Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari Ostendorf. Deep reinforcement learning with a natural language action space. In ACL, 2016.
  • [Kim and Mooney, 2013a] Joohyun Kim and Raymond Mooney. Adapting discriminative reranking to grounded language learning. In ACL, pages 218–227, 2013.
  • [Kim and Mooney, 2013b] Joohyun Kim and Raymond J. Mooney. Unsupervised pcfg induction for grounded language learning with highly ambiguous supervision. In EMNLP, pages 433–444, 2013.
  • [Kim, 2014] Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of EMNLP, pages 1746–1751, 2014.
  • [Kulkarni et al., 2016] Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In NIPS, pages 3675–3683, 2016.
  • [Lillicrap et al., 2015] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. Computer Science, 8(6):A187, 2015.
  • [Lindsay et al., 2017] Alan Lindsay, Jonathon Read, Joao F Ferreira, Thomas Hayton, Julie Porteous, and PJ Gregory. Framer: Planning models from natural language action descriptions. 2017.
  • [Ma and Hovy, 2016] Xuezhe Ma and Eduard H. Hovy. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of ACL, 2016.
  • [Macmahon et al., 2006] Matt Macmahon, Brian Stankiewicz, and Benjamin Kuipers. Walk the talk: Connecting language, knowledge, and action in route instructions. In AAAI, pages 1475–1482, 2006.
  • [Mei et al., 2016] Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. Listen, attend, and walk: neural mapping of navigational instructions to action sequences. In AAAI, pages 2772–2778, 2016.
  • [Mikolov et al., 2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013.
  • [Mnih et al., 2015] V Mnih, K Kavukcuoglu, D Silver, A. A. Rusu, J Veness, M. G. Bellemare, A Graves, M Riedmiller, A. K. Fidjeland, and G Ostrovski. Human-level control through deep reinforcement learning. Nature, 518(7540):529–33, 2015.
  • [Narasimhan et al., 2015] Karthik Narasimhan, Tejas D. Kulkarni, and Regina Barzilay. Language understanding for text-based games using deep reinforcement learning. In EMNLP, pages 1–11, 2015.
  • [Pomarlan et al., 2017] Mihai Pomarlan, Sebastian Koralewski, and Michael Beetz. From natural language instructions to structured robot plans. In Gabriele Kern-Isberner, Johannes Fürnkranz, and Matthias Thimm, editors, KI 2017: Advances in Artificial Intelligence, pages 344–351, Cham, 2017. Springer International Publishing.
  • [Reimers and Gurevych, 2017] Nils Reimers and Iryna Gurevych. Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging. In Proceedings of (EMNLP), 2017.
  • [Schaul et al., 2015] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. Computer Science, 2015.
  • [Sil and Yates, 2011] Avirup Sil and Alexander Yates. Extracting STRIPS representations of actions and events. In RANLP, pages 1–8, 2011.
  • [Sil et al., 2010] Avirup Sil, Fei Huang, and Alexander Yates. Extracting action and event semantics from web text. In AAAI, 2010.
  • [Silver et al., 2016] D Silver, A. Huang, C. J. Maddison, A Guez, L Sifre, den Driessche G Van, J Schrittwieser, I Antonoglou, V Panneershelvam, and M Lanctot. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
  • [Wang et al., 2017] Jin Wang, Zhongyuan Wang, Dawei Zhang, and Jun Yan. Combining knowledge with deep convolutional neural networks for short text classification. In Proceedings of IJCAI, pages 2915–2921, 2017.
  • [Zhang and Wallace, 2015] Ye Zhang and Byron C. Wallace. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. CoRR, abs/1510.03820, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description