Meta-Learning with Dynamic-Memory-Based Prototypical Network for Few-Shot Event Detection

Meta-Learning with Dynamic-Memory-Based Prototypical Network for Few-Shot Event Detection

Shumin Deng Zhejiang UniversityChina 231sm@zju.edu.cn Ningyu Zhang Alibaba GroupChina ningyu.zny@alibaba-inc.com Jiaojian Kang Zhejiang UniversityChina kangjiaojian@zju.edu.cn Yichi Zhang Alibaba GroupChina yichi.zyc@alibaba-inc.com Wei Zhang Alibaba GroupChina lantu.zw@alibaba-inc.com  and  Huajun Chen Zhejiang UniversityChina huajunsir@zju.edu.cn
Abstract.

Event detection (ED), a sub-task of event extraction, involves identifying triggers and categorizing event mentions. Existing methods primarily rely upon supervised learning and require large-scale labeled event datasets which are unfortunately not readily available in many real-life applications. In this paper, we consider and reformulate the ED task with limited labeled data as a Few-Shot Learning problem. We propose a Dynamic-Memory-Based Prototypical Network (DMB-PN), which exploits Dynamic Memory Network (DMN) to not only learn better prototypes for event types, but also produce more robust sentence encodings for event mentions. Differing from vanilla prototypical networks simply computing event prototypes by averaging, which only consume event mentions once, our model is more robust and is capable of distilling contextual information from event mentions for multiple times due to the multi-hop mechanism of DMNs. The experiments show that DMB-PN not only deals with sample scarcity better than a series of baseline models but also performs more robustly when the variety of event types is relatively large and the instance quantity is extremely small.

event extraction, prototypical network, dynamic memory network
journalyear: 2020copyright: acmcopyrightconference: The Thirteenth ACM International Conference on Web Search and Data Mining; February 3–7, 2020; Houston, TX, USAbooktitle: The Thirteenth ACM International Conference on Web Search and Data Mining (WSDM ’20), February 3–7, 2020, Houston, TX, USAprice: 15.00doi: 10.1145/3336191.3371796isbn: 978-1-4503-6822-3/20/02ccs: Information systems Information extractionccs: Computing methodologies Information extraction\acmSubmissionID

wsdm305

1. Introduction

Event extraction (EE) is a task aimed at extracting structural event information from unstructured texts. An event is defined as a specific occurrence involving participants, described in an event mention (Chen et al., 2015). The main word or nugget (typically a verb or a noun) that most clearly expresses the occurrence of an event is called a trigger (Chen et al., 2015). In this paper, we focus on the event detection (ED) task, a subtask of EE, which aims to locate the triggers of specified event types in texts. For example, in the sentence “He is married to the Iraqi microbiologist known as Dr. Germ.”, the ED task should detect the word ‘married’ as a trigger for the event type ‘Marry’.

Typical approaches for ED follow a supervised learning paradigm, which typically relies upon large sets of labeled data and they are unfortunately not readily available in many real-life applications. Even for widely-used ACE-2005 corpus, about event types have less than intances (Nguyen et al., 2016). More importantly, new event types tend to emerge frequently in practice, whereas most traditional models are hardly able to classify new events correctly if only a small number of samples for these new event types are given.

Intuitively, people can promptly assimilate new knowledge and deduce new classes by learning from few instances, due to the human brain’s ability to synthesis, adapt and transfer knowledge from different learned classes, which is known as the ability of “learning to learn” or “meta-learning” (Finn et al., 2017; Santoro et al., 2016; Snell et al., 2017). The process of developing a classifier which must generalize to new classes from only a small number of samples at a rapid pace is also commonly referred as few-shot learning (FSL) (Snell et al., 2017).

Figure 1. A few-shot (3-shot) event detection example, in which italic words in the support and query set are triggers of events. ‘Label’ denotes the labeled type of event mentions, and ‘Pred’ denotes predicted types.

In this paper, we revisit the ED task with limited labeled data as an instantiation of FSL problems, i.e., Few-Shot Event Detection (FSED). Figure 1 illustrates a few-shot learning example for FSED tasks. Intuitively, the FSED model is analogous to an Event Type Learner, which attempts to learn event-type-specific meta knowledge from only few instances in the support set, and apply what it learns to predict the event type of instances in the query set. In a typical meta-learning setting, the Event Type Learner is firstly trained in a meta-training step to learn meta knowledge from event types such as Marry, afterwards the model is quickly adapted, with only three samples again, to predict the results for new event types such as Divorce, which is even not seen in training.

This paper proposes to tackle the problem of FSED in few-shot and meta learning settings. Non-parametric approaches such as siamese networks (Koch et al., 2015), matching networks (Vinyals et al., 2016), and prototypical networks (Snell et al., 2017) are among the most popular models for FSL tasks, due to the properties of simpleness, entirely feed-forward and easy to be optimized. Unlike typical deep learning architecture, non-parametric approaches do not classify instances directly, and instead learn to compare in a metric space. For example, prototypical network simply learns a distance function to compute a mixture of prototypes for classes. Afterward the encoder compares the new sample with prototypes, and classifies it to the class with the closest prototype (Snell et al., 2017).

Previous studies (Snell et al., 2017; Gao et al., 2019) demonstrate that selection of distance functions will significantly affect the capacity of prototypical networks, so that the model performance is vulnerable to instance representations. However, due to the paucity of instances in FSL, key information may be lost in noise brought by the diversity of event mentions. Moreover, it is deficient to learn robust contextual representation due to data shortage, particularly for tasks like ED in which learning context-aware embeddings for words and sentences are vital (Liu et al., 2017, 2018). As a result, in the case of FSED, there is a urgent demand for a more robust architecture that can learn contextual representations for event prototypes from limited instances.

In this work, we propose a Dynamic-Memory-Based Prototypical Network (DMB-PN), which exploits Dynamic Memory Network (DMN) (Xiong et al., 2016; Kumar et al., 2016) to learn better prototypes for event types. Differing from vanilla prototypical networks simply computing event prototypes by averaging, which only consume event mention encodings once, DMB-PN, equipped with a DMN, distills contextual information from event mentions for multiple times. Experiments demonstrate that DMB-PN not only deals with sample scarcity better than vanilla prototypical networks, but also performs more robustly when the shot number decreases, referring to the section on K-Shot Evaluations, and the way number of event types increases, referring to the section on N-Way Evaluations.

Additionally, Dynamic Memory Network is also used to learn event prototypes and sentence encodings in our model. Specifically, we propose to use trigger words as the questions in a typical DMN modules to produce the memory vectors, thereby producing sentence encoding more sensitive to trigger words. As DMN is more advantageous to fully exploit the event instances due to its multi-hop mechanism, DMN-based models are more robust in sentence encodings particularly in few-shot settings as supported by experimental results.

In summary, the main contributions of our work are as follows:

  • Firstly, we formally define and formulate the new problem of Few Shot Event Detection, and produce a new dataset called FewEvent tailored particularly for the problem.

  • We then propose a new framework called Dynamic-Memory-Based Prototypical Network, which exploits Dynamic Memory Network to not only learn better prototypes for event types, but also produce more robust sentence encodings for event mentions.

  • The experiments show that prototypical network integrated with memory mechanism outperforms a series of baseline models, particularly when the variety of event types is relatively large and the instance quantity is extremely small, owning to its capability of distilling contextual information from event instances for multiple times.

The next section review related work on label-data shortage in event detection and meta-learning in few-shot NLP tasks. Section 3 present the details of DMB-PN architecture. Section 4 introduce the experiments and evaluation results. Section 5 make a conclusion of the paper and discusses the future work.

2. Related Work

2.1. Sample Shortage Problems in ED Tasks.

Traditional approaches to the task of EE primarily rely on elaborately-designed features and complicated natural language processing (NLP) tools (McClosky et al., 2011; Li et al., 2013; Hong et al., 2011). Recently, neural-network-based models have shown good performance on EE tasks (Nguyen and Grishman, 2016; Hong et al., 2018; Chen, 2019; Nguyen and Grishman, 2018; Liu et al., 2019), since (Chen et al., 2015) proposed dynamic multi-pooling convolutional neural network (DMCNN) to automatically extract and reserve lexical-level and sentence-level features. However, these methods rely on large-scale labeled event datasets. Considering actual situations, there have been some researches focusing on the shortage of labeled data. (Nguyen et al., 2016) proposes a CNN-2-STAGE model which uses a two-stage training method to detect event types not seen during training, through effectively transfering knowledge from other event types to the target one. (Peng et al., 2016) develops an event detection and co-reference system with minimal supervision, in the form of a few event examples, by viewing ED tasks as semantic similarity problems among event mentions or among event mentions and ontologies of event types. (Huang et al., 2018) takes a fresh look at EE by mapping event instances to the corresponding event ontology which holds event structures for each event type. Besides, there are also some works address the problem of insufficient training data by importing external knowledge111Note that we do not consider data augmentation by import external pre-train knowledge in this paper and only focus on the few-shot models.. (Baldini Soares et al., 2019) describes a novel training setup called matching the blanks, and couple it with BERT (Devlin et al., 2019) to produce useful relation representations, particularly effective in low-resource regimes. (Yang et al., 2019) proposes a method to automatically generate labeled data by editing prototypes and screen out generated samples by ranking the quality.

2.2. Meta-Learning in Few-Shot NLP Tasks

Actually, researches about adopting FSL for NLP tasks are extremely limited, and mostly based on metric-based methods. (Gao et al., 2019) formalizes relation classification as a FSL problem, and propose hybrid attention-based prototypical networks for the task. (Yu et al., 2018) proposes an adaptive metric learning approach that automatically determines the best weighted combination from a set of metrics for few-shot text classification. In this paper, we also utilize a metric-based method, prototypical network, to tackle the few-shot event detection tasks. Besides, model-based methods are also designed for meta-learning to rapidly incorporate new information and remember them. Few-shot ED tasks with sparse labeled instances make it vital to make full use of available data, especially contextual information which has been shown effective on ED tasks (Liu et al., 2017; Nguyen and Grishman, 2018). However, existing methods which utilize context only process the context once. (Xiong et al., 2016; Kumar et al., 2016) introduce the dynamic memory network (DMN), exhibiting certain reasoning capabilities in NLP tasks, such as QA, with the multi-hop mechanism. Inspired by this, (Liu et al., 2018) proposes the trigger detection dynamic memory network (TD-DMN) to tackle the ED problem by fully exploiting the context in documents.

Figure 2. Overview of DMB-PN model, where TI and EC is an abbreviation of trigger identification and event classification respectively. The question in TI is implicitly viewed as “Whether the word is a trigger or not?”, and that in few-shot EC is implicitly viewed as “How does this event mention contribute to event prototype learning?”. Note that the primitive prototypical network directly converts the input module results to prototypical network for few-shot EC, while DMB-PN generates dynamic-memory-based support and query encodings first.

3. Method

This section introduces the general architecture and principal modules of the proposed model.

3.1. Problem Formulation

In this paper, the Few Shot Event Detection (FSED) problem is formulated with typical -way--shot descriptions. Specifically, our model is given a tiny labeled training set called support set , which has event types. Each event type has only labeled samples and is typically small, for example -shot or -shot. In addition to the support set, there is another set called query set , in which the samples are unlabeled and subject to prediction based only on the observation of few-shot samples in .

Formally, given an event type set , the support set , the query set and few shot task are defined as follows.

(1)

where denotes that an event mention instance in support set with trigger and event type . Analogously, denotes an event mention instance in query set, and is instance number of query set. Each instance is denoted as a word sequence , and is the maximum length of event mentions.

Thus, the goal of few-shot event detection is to gain the capability to predict the type of a new event in the query set with only observing a small number of events for each in the support set. Its training process is based on a set of tasks where each task corresponds to an individual few-shot event detection task with its own support and query set. Its testing process is conducted on a set of new tasks which is similar to , other than that should be about event types that have never seen in .

3.2. General Architecture

Generally,we divide few-shot event detection into two sub-tasks: trigger identification and few-shot event classification. The overview of our model DMB-PN is shown in Figure 2.

In trigger identification, a dynamic-memory-based sentence encoder is designed to learn event mention encodings and identify triggers. Given an event mention, each word in it is vectorized to dynamic-memory-based word embedding, and then is identified as a trigger or not based on DMN. Specifically, for event mention instance , each word in it is vectorized to . Then the trigger is identified and the sentence encoding is obtained via a dynamic-memory-based sentence encoder :

(2)

In few-shot event classification, a dynamic-memory-based prototypical network, denoted as M_Proto, is proposed to classify events through FSL. Differing from the primitive prototypical network, the dynamic-memory-based one generates encodings of support set and query set under the architecture of dynamic memory network. The prototypical network is applied to serve as the answer module of DMN, where the event type is predicted by comparison between query instance encoding and each event prototype , denoted by

(3)

where denotes the probability of the query instance belongs to the event type.

3.3. Trigger Identification

Input Module for TI. The input module of trigger identification contains two layers: the word encoder layer and the input fusion layer. The word encoder layer encodes each word into a vector independently, while the input fusion layer gives these encoded word vectors a chance to exchange information between each other.

Word encoder layer. For the word in the event mention , the encoding includes two components: (1) a real-valued embedding to express semantic and syntactic meanings of the word, which are pre-trained via GloVe (Pennington et al., 2014), and (2) position embeddings to embed its relative distances in the sentence, including distances from to the beginning and ending of the sentence, as well as the sentence length, with three -dimensional vectors, and then concatenate them as a unified position embedding .

We then achieve a final input embedding for each word by concatenating its word embedding and position embedding.

(4)

Input fusion layer. Given , we generate fact vectors with a Bi-GRU.

(5)

where is the maximum sentence length.

Question Module for TI. Analogously, the question module encodes the question into a distributed vector representation. In the task of trigger identification, each word in the input sentence could be deemed as the question. The question module of trigger identification treats each word in the event mention as implicitly asking a question “Whether the word is the trigger or not?”. The intuition here is to obtain a vector that represents the question word. Given encoding of the word in the sentence, , the question GRU generates hidden state by a Bi-GRU. The question vector for the sentence is a combination of all hidden states.

(6)

Answer Module for TI. The answer module predicts the trigger in an event mention from the final memory vector of the Memory Module for TI, which will be introduced in Memory Module for TI and Few-Shot EC of Section 3.4. We employ another GRU whose initial state is initialized to the last memory . At each timestep, it takes the question , last hidden state , as well as the previously predicted output as input.

(7)
(8)

The output of trigger identification are trained with cross-entropy error classification, and the loss function for trigger identification is denoted by

(9)

Sentence Reader Layer. The sentence reader layer is responsible for embedding the words into a sentence encoding, where the words are embedded through the memory module of trigger identification. We obtain scalar attention weight for each word in a sentence by feeding generated by the Memory Module for TI into a two-layer perceptron and going through a softmax.

(10)

Then we denote sentence representation by

(11)

3.4. Few-Shot Event Classification

Input Module for Few-Shot EC. The input module of few-shot event classification is after a sentence integration layer and contains an input fusion layer. The sentence integration layer integrates sentences into support set and query set respectively. The input fusion layer gives these sentence encodings in support set a chance to exchange information between each other.

Sentence integration layer. The support set and query set encoding are denoted by

(12)

Input fusion layer. The input fusion operation for sentences is similar to that for words, shown in Equation (5). We generate the fact vectors for sentences in support set by a Bi-GRU.

(13)

Question Module for Few-Shot EC. In the task of few-shot event classification, each event mention could be deemed as the question. The question module of few-shot event classification treats each event mention as implicitly asking a question “How does this event mention contribute to event prototype learning?

The question vector for event mention is obtained by feeding sentence encodings in support set to a Bi-GRU:

(14)

Memory Module for TI and Few-Shot EC. The memory module for trigger identification and few-shot event classification are almost the same, except the inputs are word encodings and event mention encodings respectively. Given a collection of inputs, the episodic memory module chooses which parts of inputs to focus on through attention mechanism. It then produces a new “memory” vector considering the question as well as the previous memory. At each iteration, the memory module is able to retrieve new information which were thought to be irrelevant in previous iterations.

Specifically, the memory module contains three components: the attention gate, the attentional GRU (Xiong et al., 2016), and the memory update gate. We present its structure in Figure 3.

Figure 3. Architecture of DMB-PN memory module. denote facts of input, denotes question vector, denotes candidate facts, and denotes memory.

Attention gate. The attention gate determines how much the memory module should pay attention to each fact, given the facts , the question , and the acquired knowledge stored in the memory vector from the previous step. The three inputs are transformed by:

(15)

where “”, , and are concatenation, element-wise product, subtraction and absolute value respectively. The first two terms measure similarity and difference between facts and the question, and the last two terms comparing facts with the last memory state.

Let of size denotes the generated attention vector. The element in is the attention weight for fact . is obtained by transforming using a two-layer perceptron:

(16)

where and are parameters of the perceptron.

Attentional GRU. The attentional GRU takes facts , fact attention as input and produces context vector .

(17)
(18)

Context vector is the final hidden state of attention based GRU:

(19)

Memory update gate. The episodic memory for passing times is computed by

(20)

and the new episode memory state is calculated by

(21)

where “” is concatenation operator, , and .

Memory-Based Prototypical Network for Few-Shot EC. The main idea of prototypical networks for few-shot event classification is to use a feature vector, also named a prototype, to represent each event type. The traditional approach to compute the prototype is to average all the instance embeddings in the support set to produce the event type. In this paper, we apply memory-based mechanism to produce event prototypes.

In practice, event mentions for an event type can be of great discrepancy, and the huge diversities among instances may result in inaccurate representation of events. In order to obtain more precise event prototype , we encode each event mention in support set by making interaction with other event mentions of the same event type, which is calculated by Equation (21).

We then compute probabilities of event types for the query instance (Equation (12)) as follows

(22)

where denotes Euclidean distance.

Loss function. We adopt the cross entropy function as the cost function for few-shot event classification, calculated by

(23)

The final loss function for few-shot event detection is a weighted sum of trigger identification loss and few-shot event classification loss, denoted by

(24)

where is a hyper-parameter, and we set it to in this paper.

4. Experiments

The experiments seek: (1) to compare the dynamic-memory-based prototypical network with a series of combinations of sentence encoding models and metric models; (2) to assess the effectiveness of memory-based models from the perspective of -shot evaluations and -way evaluations respectively in different -way--shot settings; (3) to provide the evidence for that dynamic-memory-based approaches are more feasible to learn contextual representations for both event prototypes and event mentions from limited instances.

Model Encoder Metric 5-Way-5-Shot 5-Way-10-Shot 5-Way-15-Shot 10-Way-5-Shot 10-Way-10-Shot 10-Way-15-Shot
BRN-MN Bi-LSTM Match
CNN-MN CNN Match
SAT-MN Self-Attn Match
DMN-MN DMN Match
BRN-PN Bi-LSTM Proto
CNN-PN CNN Proto
SAT-PN Self-Attn Proto
DMN-PN DMN Proto
BRN-MPN Bi-LSTM M-Proto
CNN-MPN CNN M-Proto
SAT-MPN Self-Attn M-Proto
DMB-PN DMN M-Proto
Table 1. Accuracy  () and F1 Score of few-shot event classification. “Encoder” and “Metric” denote the sentence encoder and the metric-based model respectively, so the final “Model” is a combination of them. “Match”, “Proto” and “M-Proto” are an abbreviation for matching network, prototypical network and memory-based prototypical network respectively. The value% in the brackets denotes the accuracy margin calculated by subtracting the accuracy of the worst baseline from that of the current model under inspection.

4.1. Datasets

FSED task should be trained and tested on few-shot event detection datasets as few-shot tasks in other research areas, while there are not existing FSED datasets. Thus we evaluate our models on a newly-generated dataset tailored particularly for few-shot event detection called FewEvent. In general, it contains instances for event types graded into event subtypes in total, in which each event type is annotated with about instances on average. FewEvent was built in two different methods:

The FewEvent dataset is now released and published at https://github.com/231sm/Low_Resource_KBP, including details of event types and their instance quantity.

In our experiment settings, event types are selected for training, for validation, and the rest event types for testing. Note that there are no overlapping types between training and testing sets, following the typical few-shot settings.

4.2. Baselines and Settings

Comparisons are performed against two types of baselines including sentence encoding models and metric learning models. For sentence encoder baselines, we consider four models including CNN (Kim, 2014; Zeng et al., 2014), Bi-LSTM (Huang et al., 2015), Self-Attention model (Vaswani et al., 2017), and DMN (Kumar et al., 2016; Xiong et al., 2016). For metric-learning baselines, we mainly consider two commonly-used metric models, i.e., Matching Networks (Vinyals et al., 2016) and Prototypical Networks (Snell et al., 2017), as well as our proposed Memory-based Prototypical Network. Combining these two sets of models in pairs, we obtain baselines. The combination of DMN and Memory-based Prototypical Network is our proposed model, denoted as DMB-PN.

With regard to settings of training process, stochastic gradient descent (SGD) (Ketkar, 2014) optimizer is used, with iterations of training and iterations of testing. The dimension of memory units, word embedding and position embedding are set to , and respectively. The number of memory module passing is . In DMB-PN, a dropout rate of is used to avoid over-fitting, and the learning rate is . We evaluate the performance of event detection with and .

4.3. General Comparisons

As shown in Table 1, we compare accuracies, F1 scores and accuracy margins among different combinations of sentence encoders and metric models.

A general inspection reveals that Prototypical Network (PN) outperforms Matching Network (MN) when sentence encoders are the same in almost all N-way-K-shot settings. A possible explanation for this might be that PN learns to compare between a query instance and an event prototype, i.e.,instance-to-prototype matching, whereas MN compares between instances in the support set and query set,i.e.,instance-to-instance matching. Instance-to-instance matching is more susceptible to noises in metric-computation than instance-to-prototype comparing. If there are many outlier instances in the support set, instance-to-instance matching will introduce more noises. This result confirms previous study that prototype learning reduces noises introduced by instance randomness (Snell et al., 2017).

Notably, the best result is achieved by DMB-PN, a prototypical network incorporated with a DMN. This result echos the statement that dynamic-memory-based prototypical network learns better prototypes than simply averaging over the instances of the support sets, owning to its capability of distilling contextual information from event instances for multiple times and in an incremental way.

4.4. K-Shot Evaluations

This section is primarily intended to assess the effectiveness of memory-based models from the perspective of K-shot in different -way--shot settings with the same-way-number settings, such as -way--shot, -way--shot and -way--shot. As shown in Table 1, the effect is reflected by the variance of accuracy margin, denoted as (+m) in the brackets and defined as the margin between the accuracy of the worst baseline model and that of the current model under inspection. We report the analysis for both metric models and sentence encoders respectively.

4.4.1. On Dynamic-Memory-Based Prototypical Networks

When the sentence encoders are the same, we could observe that DMB-PN achieves the best accuracy margin in both -way and -way settings. Further inspection reveals that the accuracy margin of DMB-PN increases as the shot number decreases, indicating that the model performs even better when the shot number is relatively small. In contrast, for other metric-based models such as prototypical network, the margin does not always increases steadily when shot number decreases, e.g., the margin for CNN-PN with a combination of CNN and Proto increase first then decrease. The possible reason for this is that memory mechanism in DMB-PN has less dependence on the quantity of instances. It is still capable of learning distinguishable event prototypes even the number of instances in the support set is very small. These results corroborate the statement that the prototypical network integrated with memory mechanism is more applicable to the few-shot classification tasks, particularly when the instance quantity is relatively small.

4.4.2. On Dynamic-Memory-Based Sentence Encoders

Given the same metric models such as M-Proto, the DMN-based encoders achieves the best margin, e.g., for DMB-PN. A further inspection reveals that, all models with DMN-based encoder such as DMN-PN or DMN-MN, the margin increases as shots number decreases, whereas models with other encoders are different. These results indicate that DMN-based models are more robust in learning sentence encodings, particularly when the shot number is relatively small. A possible explanation for this might be that the multi-hop mechanism of memory-based models is more advantageous to fully utilize the limited instances, while the other sentence encoders only consume the training samples for once.

Figure 4. N-way evaluations with fixed shot numbers. (a) and (b) illustrate the variation tendency of accuracy in the -shot setting for models with CNN encoders and DMN encoder respectively. (c) and (d) illustrate the results for -shot setting while the way number increases.

4.5. N-Way Evaluations

This section is mainly intended to assess the effectiveness of memory-based models from the perspective of N-way in -way--shot settings with the same shot numbers, as illustrated in Figure 4. Generally, the accuracy decreases as the way number increases when the shot number is fixed, which is in accordance with the expectation as larger number of ways results in wider variety of event types to be predicted, which increases the difficulty of correct classifications. We can further observe that memory-based models such as CNN-MPN performs better than vanilla prototypical networks which further overtakes matching networks, and the margins among them increases as way number increases. These results indicate that the memory-based prototypical network is more robust to the number of ways as multi-hop mechanism in memory networks contributes to the learning of more distinguishable event prototypes.

Figure 5. Visualization of event prototypes, support set, and query instance of DMB-PN, in -way--shot event detection tasks. Note that the five bigger dot with outlines denote event prototypes, and events of the same type are marked in the same color. The red dot represents a query instance.
Figure 6. Visualization of word attentions in event mention encodings, where instances of support and query set in training and testing are shown respectively. Note that the attention value becomes smaller as green becomes lighter, and triggers are marked in bold Italic. We only colorize words with top large attention values of event mention encodings for conciseness.

4.6. Case Study

This section reports case studies on the learning results of event prototypes and the effectiveness of the model to learn meta knowledge from sentences of event instances.

4.6.1. On Event Prototypes

Figure 5 visualizes several samples of event prototypes and event mention encodings generated by DMB-PN. One interesting finding is that it is obvious that the distance between the query instance and the prototypes are fairly distinguishable, whereas, it is hardly to distinguish the distances between the query instances Die, marked in red, and the surrounding instances of Execute, Injure and Die, marked in dark green, yellow and purple respectively, as the distance distributions are very similar. We can easily predict that it belongs to Die as it is closer to the event prototype of Die. This example supports the statement that DMB-PN has advantages in few-shot event detection tasks, especially with instances close in the vector space, through generating distinguishable event prototypes.

4.6.2. On Trigger Detection

To assess the effectiveness of learning and converting event-type-specific meta knowledge from sentence instances, we visualize word attentions obtained from event mention encoding via DMB-PN, as shown in Figure 6. Apparently, in the process of training, triggers in each event mention tends to achieve higher attention value than other words, and similar results are also obtained during testing, indicating that DMB-PN can effectively detect triggers in event mentions. Additionally, a further inspection into examples in training reveals that other high-lightened words are participants involved in an event, or provide important clues to ED task, which can be seen as arguments (Chen et al., 2015). In the process of testing, the arguments of each event mention also achieve higher attention. For example, in the event mention of “Nathan divorced wallpaper salesman Bruce Nathan in 1992.” whose trigger is “divorced”, DMB-PN considers “divorced”, “Nathan”, “Bruce”, “1992” and “salesman” as the top words to be valued, among which the latter four words all describe the event of Divorce. This observational study suggests that DMB-PN is capable of capturing both trigger and arguments information, thereby generating more accurate sentence encoding and capturing more valuable information from limited labeled training data. It can therefore be assumed that DMB-PN is able to assimilate more valuable meta knowledge from the few-shot samples, and transfer more event-type-specific knowledge for few-shot event classifications.

4.7. Parameter Analysis

In this section, we intend to study the effect of loss ratio, in Equation (24), on trigger identification.

Figure 7. The trigger identification accuracy of DMB-PN model in different few-shot tasks.

Seen in Figure 7, as increases, the performance of trigger identification increases first and then decreases. When reaches to , the best performance is achieved. This is also why we choose as the value of hyperparameter. Besides, in general, when is larger, the performance of trigger identification is better than when it is small. Intuitively, the bigger implies the model more likely to learn more precise trigger identification results, but not always. In DMB-PN, the training of trigger identification and few-shot event classification interact with each other, and the final results are actually combination of them. Therefore, we select median as the loss ratio of trigger indentification, and results in Figure 7 also demonstrate its effectiveness. Therefore, DMB-PN has its advantage naturally to integrate trigger identification and few-shot event classification, making them influence mutually while both achieve good performance.

5. Conclusion

In this paper, we propose a dynamic-memory-based prototypical network (DMB-PN) for few-shot event detection task in meta-learning setting. Our approach consists of two stages: trigger identification and few-shot event classification. In the first stage, we locate the trigger in each event mention, and obtain memory-augmented sentence encoding based on DMN. In the second stage, we utilize the dynamic-memory-based prototypical network to classify the event type of each query instance, where event mentions are encoded by utilizing the multi-hop mechanism of DMN to capture contextual information among event mention encodings. The experiment results demonstrate that the integration of prototypical network and dynamic-memory-based model excels at addressing the sample-shortage problem for few-shot event detection and dynamic-memory-based approaches are more feasible than other sentence encoding baselines in context of limited labeled sentence instances, especially when the variety of event types is large and the instance quantity is small.

In the future, we will apply DMB-PN to other few-shot tasks, such as few-show relation extraction, trying to exploit the contexts of texts and entities.

6. Acknowledgements

We want to express gratitude to the anonymous reviewers for their hard work and kind comments, which will further improve our work in the future. This work is funded by NSFC 91846204/61473260, national key research program YS2018YFB140004, and Alibaba CangJingGe(Knowledge Engine) Research Plan.


References

  • L. Baldini Soares, N. FitzGerald, J. Ling, and T. Kwiatkowski (2019) Matching the blanks: distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2895–2905. External Links: Link Cited by: §2.1.
  • K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008) Freebase:a collaboratively created graph database for structuring human knowledge. In SIGMOD Conference, pp. 1247–1250. Cited by: 2nd item.
  • Y. Chen, S. Liu, X. Zhang, K. Liu, and J. Zhao (2017) Automatically labeled data generation for large scale event extraction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 409–419. Cited by: 2nd item.
  • Y. Chen, L. Xu, K. Liu, D. Zeng, and J. Zhao (2015) Event extraction via dynamic multi-pooling convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Vol. 1, pp. 167–176. Cited by: §1, §2.1, §4.6.2.
  • Y. Chen (2019) Exploiting the ground-truth: an adversarial imitation based knowledge distillation approach for event detection. Cited by: §2.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §2.1.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §1.
  • T. Gao, X. Han, Z. Liu, and M. Sun (2019) Hybrid attention-based prototypical networks for noisy few-shot relation classification. Cited by: §1, §2.2.
  • Y. Hong, J. Zhang, B. Ma, J. Yao, G. Zhou, and Q. Zhu (2011) Using cross-entity inference to improve event extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 1127–1136. Cited by: §2.1.
  • Y. Hong, W. Zhou, J. Zhang, G. Zhou, and Q. Zhu (2018) Self-regulation: employing a generative adversarial network to improve event detection. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1, pp. 515–526. Cited by: §2.1.
  • L. Huang, H. Ji, K. Cho, I. Dagan, S. Riedel, and C. Voss (2018) Zero-shot transfer learning for event extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2160–2170. Cited by: §2.1.
  • Z. Huang, W. Xu, and K. Yu (2015) Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991. Cited by: §4.2.
  • N. Ketkar (2014) Stochastic gradient descent. Optimization. Cited by: §4.2.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §4.2.
  • G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: §1.
  • A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher (2016) Ask me anything: dynamic memory networks for natural language processing. In International Conference on Machine Learning, pp. 1378–1387. Cited by: §1, §2.2, §4.2.
  • Q. Li, H. Ji, and L. Huang (2013) Joint event extraction via structured prediction with global features. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 73–82. Cited by: §2.1.
  • J. Liu, Y. Chen, and K. Liu (2019) Exploiting the ground-truth: an adversarial imitation based knowledge distillation approach for event detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6754–6761. Cited by: §2.1.
  • S. Liu, R. Cheng, X. Yu, and X. Cheng (2018) Exploiting contextual information via dynamic memory network for event detection. Cited by: §1, §2.2.
  • S. Liu, Y. Chen, K. Liu, and J. Zhao (2017) Exploiting argument information to improve event detection via supervised attention mechanisms. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1789–1798. Cited by: §1, §2.2.
  • D. McClosky, M. Surdeanu, and C. D. Manning (2011) Event extraction as dependency parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 1626–1635. Cited by: §2.1.
  • T. H. Nguyen, L. Fu, K. Cho, and R. Grishman (2016) A two-stage approach for extending event detection to new types via neural networks. In Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 158–165. Cited by: §1, §2.1.
  • T. H. Nguyen and R. Grishman (2016) Modeling skip-grams for event detection with convolutional neural networks. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 886–891. Cited by: §2.1.
  • T. H. Nguyen and R. Grishman (2018) Graph convolutional networks with argument-aware pooling for event detection. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.1, §2.2.
  • H. Peng, Y. Song, and D. Roth (2016) Event detection and co-reference with minimal supervision. In Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 392–402. Cited by: §2.1.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In Empirical Methods in Natural Language Processing, pp. 1532–1543. External Links: Link Cited by: §3.3.
  • A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap (2016) Meta-learning with memory-augmented neural networks. In International conference on machine learning, pp. 1842–1850. Cited by: §1.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: §1, §1, §1, §4.2, §4.3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §4.2.
  • O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §1, §4.2.
  • I. H. Witten and I. H. Witten (2008) Learning to link with wikipedia. In ACM Conference on Information and Knowledge Management, pp. 509–518. Cited by: 2nd item.
  • C. Xiong, S. Merity, and R. Socher (2016) Dynamic memory networks for visual and textual question answering. In International conference on machine learning, pp. 2397–2406. Cited by: §1, §2.2, §3.4, §4.2.
  • S. Yang, D. Feng, L. Qiao, Z. Kan, and D. Li (2019) Exploring pre-trained language models for event extraction and generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5284–5294. External Links: Link Cited by: §2.1.
  • M. Yu, X. Guo, J. Yi, S. Chang, S. Potdar, Y. Cheng, G. Tesauro, H. Wang, and B. Zhou (2018) Diverse few-shot text classification with multiple metrics. arXiv preprint arXiv:1805.07513. Cited by: §2.2.
  • D. Zeng, K. Liu, S. Lai, G. Zhou, J. Zhao, et al. (2014) Relation classification via convolutional deep neural network. Cited by: §4.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398139
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description