Neural Simile Recognition with Cyclic Multitask Learning and Local Attention

Neural Simile Recognition with Cyclic Multitask Learning and Local Attention

Abstract

Simile recognition is to detect simile sentences and to extract simile components, i.e., tenors and vehicles. It involves two subtasks: simile sentence classification and simile component extraction. Recent work has shown that standard multitask learning is effective for Chinese simile recognition, but it is still uncertain whether the mutual effects between the subtasks have been well captured by simple parameter sharing. We propose a novel cyclic multitask learning framework for neural simile recognition, which stacks the subtasks and makes them into a loop by connecting the last to the first. It iteratively performs each subtask, taking the outputs of the previous subtask as additional inputs to the current one, so that the interdependence between the subtasks can be better explored. Extensive experiments show that our framework significantly outperforms the current state-of-the-art model and our carefully designed baselines, and the gains are still remarkable using BERT. Source Code of this paper are available on https://github.com/DeepLearnXMU/Cyclic.

Introduction

Simile is a special type of metaphor that compares two objects (called tenor and vehicle) of different categories using comparator words such as “like”, “as” or “than”. A Chinese simile sentence is shown in Figure 1, where the tenor “Magnolia flower” and the vehicle “perfume” are compared using comparator “like”. Typically, simile recognition involves two subtasks [11]: Simile Sentence Classification, which discriminates whether a sentence containing a comparator is a simile sentence, and Simile Component Extraction, which aims to extract the tenor and the vehicle in a simile sentence, respectively.

Figure 1: Attention weights generated by the simile sentence classifier of the conventional multitask learning [11], showing that the simile sentence classifier tends to focus more on simile components.

It is of great importance to study simile. Simile recognition is potentially beneficial for NLP applications, such as sentiment analysis (e.g. hate speech detection), dialogue understanding and question answering, because users sometimes use simile to express their emotions. In addition, simile recognition can help language learners to better understand the implicit meanings expressed by simile in books and novels by highlighting simile components. However, simile recognition is very challenging, with one reason being that simile sentences have very similar syntactic and semantic structures to the normal sentences, hindering the feasibility of standard NLU techniques, such as syntactic and semantic parsing. Even though comparator words can provide some hints, they are also frequently used in literal comparisons, which introduces great ambiguity to this task.

Previous approaches of simile recognition are primarily based on handcrafted linguistic features and syntactic patterns [8, 18, 19, 17], which are inefficient on new languages and domains due to the extra time for feature engineering. Inspired by the successful applications of neural multitask learning on many NLP tasks [12, 31, 13, 16], \citeauthorLiu:EMNLP2018 (2018) investigated a standard multitask learning framework on simile recognition, which significantly outperforms the existing methods. Specifically, they apply a bi-directional LSTM [4] to encode the representations of each input sentence, and then the encoding results are taken as features shared by an attention-based sentence classifier, a CRF-based component extractor [7] and a language model that serves as an auxiliary task for additional supervision signals.

Despite their success, the multitask learning framework of \citeauthorLiu:EMNLP2018 (2018) suffers from two major drawbacks. First, simple parameter sharing is unable to fully exploit the semantic interdependence between the two subtasks of simile recognition. It is intuitive that the results of these subtasks can be beneficial to each other. The potential simile components usually get higher attention weights during simile sentence classification. Taking Figure 1 as an example, the attention weights for tenor “Magnolia flower” and vehicle “perfume” are much higher than those of the other words. Therefore, simile component extractor can be more precise with the information about potential tenor and vehicle (attention distribution) from the simile sentence classifier. Moreover, it will be easier for simile sentence classification if we have identified the tenor and vehicle through component extraction, since they directly determine whether the sentence is a simile.

Second, both tenors and vehicles are usually close to comparators. In a standard simile benchmark [11], the average distances from tenors to comparators and from vehicles to comparators are 3.0 and 4.3, respectively, while the average sentence length is 29.5. As a result, the global attention mechanism used by \citeauthorLiu:EMNLP2018 (2018) can suffer from attention errors, since it considers all words in a sentence. Back to Figure 1, irrelevant words, such as the words after “,”, distract the global attention model and own attention weights significantly larger than zero.

To overcome the above drawbacks, we propose a novel cyclic multitask learning framework with local attention for neural simile recognition. Figure 2 shows our framework, which captures the correlations of its subtasks by feeding the output of each subtask into the next. It organizes the subtasks as a cycle that is executed for times, thus all subtasks further benefit from all others.

Taking as an example, first, a Bi-LSTM encodes the input sentence and produces a sequence of word representations. Then, a local attention model is applied to the local sequence of word representations around the comparator word, and the induced local context vector is fed into a simile sentence classifier. Next, we concatenate the attention weights generated by the local attention and word representations, before sending the results to a CRF layer to extract simile components via sequence labeling. Afterwards, the label distribution and the word representations are concatenated as the input of the sentence decoder to reconstruct the original sentence. Finally, the decoder states are summed with the word representations, and the results are the input for the following simile classification.

Overall, our contributions are three folds:

  • We propose a novel cyclic multitask learning framework for neural simile recognition. Comparing with standard multi-task learning, this framework better models the inter-correlation among its sub-tasks.

  • We introduce a local attention mechanism for simile sentence classification. To our knowledge, no previous work has explored local attention on this task.

  • Our framework shows superior performance over carefully designed baselines with or without pretrained BERT [1], introducing the new state-of-the-art performance in the literature.

Figure 2: The architecture of our framework. Following previous work [11], ➂ is taken as an auxiliary sub-task for additional supervision.

Our Framework

In this section, we give a detailed description of our proposed framework. As shown in Figure 2, our cyclic framework concatenates a local attention based simile sentence classifier (➀), a CRF based simile component extractor (➁), and a Bi-LSTM sentence decoder (➂) as a cycle, where the output of each module is fed as additional input to its successor module. In this way, the interdependence between different subtasks can be better exploited in our framework. In addition, it contains a Bi-LSTM sentence encoder that provides shared features to these modules (➀, ➁ and ➂). Note that our framework executes for times, and the execution path is ➀➀ when .

Bi-LSTM based Sentence Encoder

Given an input sentence , we follow Liu et al. \shortciteLiu:EMNLP2018 to first map its words to embeddings. Then, a Bi-LSTM is applied to produce a sequence of word representations that are shared by our subtasks. The forward LSTM reads the sentence from left to right to learn the representation of each word as . Similarly, the backward LSTM reversely scans the source sentence and learns each representation . Finally, for each word , the representations from two LSTMs are concatenated to form the word representation ,.

Local Attention based Simile Sentence Classifier

Figure 3: Global attention mechanism and our proposed local attention mechanism.

Simile sentence classification is a binary classification task determining whether a sentence contains any simile or not. As analyzed previously, the adopted global attention mechanism of \citeauthorLiu:EMNLP2018 (2018) considers all words. Hence, it is not suitable for the simile sentence classification, which mainly depends on the tenor and the vehicle around the comparator. To address this issue, we base our simile sentence classifier on a local attention mechanism. Compared with a global attention mechanism, our local attention mechanism only focuses on a dynamically choosed local context surrounding the comparator word.

We contrast our local attention mechanism with the global attention mechanism in Figure 3. For more details, since the comparator word, such as ”like”, ”than” or ”as”, is given in each sentence, we first choose the position of the comparator word as the central position , and then dynamically generate a context window size as follows:

(1)

where is the representation vector of the central word, , , and are learnable parameters. As the next step, we perform an attention operation on the local sequence of word representations {} to generate a local context vector via

(2)

where has been defined in Eq.(1) and is a model parameter. Next, we stack a feed-forward network on to induce high-level features. Finally, these features are used as inputs of a layer to conduct classification:

(3)

where and are model parameters1.

As shown in Figure 2, there are two cases for simile sentence classification within our cyclic framework: for the first case, only word representations are available for producing the local attention weights as additional input to simile component extractor, while for the other case, both and the states =() of the Bi-LSTM sentence decoder (➂ in Figure 2) are available. We directly take the sum of and as , i.e., , to conduct simile classification. To make a unified definition, the loss function for simile sentence classification is defined as:

(4)

Intuitively, contains useful information of simile component extractor and sentence decoder, which can be directly propagated to simile sentence classifier by incorporating into it.

CRF based Component Extractor

We follow Liu et al. \shortciteLiu:EMNLP2018 to implement simile component extraction as a sequence labeling task. As mentioned before, the results of simile sentence classification can be beneficial to simile component extraction. Thus, we augment the word representations with the attention weights generated from Equation 2:

(5)

and then we stack a CRF layer on . Formally, the score of a predicted sequence =(), where and , is defined as

(6)

where is a transition matrix updated during training, and records the transition score from label to ; similarly, =(,…,) is the emission matrix and indicates the score of assigning tag to . Specifically, is a -dimensional label distribution vector generated by feeding into a single-layer network with activation function and a layer. Finally, the probability of the label sequence given the sentence is

(7)

To train this extractor, we minimize the standard log-likelihood loss function:

(8)

Bi-LSTM based Sentence Decoder

Due to the small number of available training instances, we follow Liu et al. \shortciteLiu:EMNLP2018 to incorporate language modeling into our cyclic framework as an auxiliary task, which can help Bi-LSTM encoder better model the sentence information.

For more details, we concatenate each label distribution vector from component extractor with word representation to produce the initial state of the sentence decoder:

(9)

where is a learnable matrix. Next, the forward LSTM takes the previous hidden state and the previous word embedding as input to produce the hidden state at the -th timestep:

(10)

and then predict the current word in the following way:

(11)

where , [1,2,3] are trainable matrix parameters. Formally, the loss function for the forward sentence decoder is defined as

(12)

Similarly, the backward decoder is the same as the forward decoder, but with different model parameters. Equations are omitted for space limitation. The backward loss function is defined as

(13)

Finally, the loss function for the whole decoder is defined as the sum of those in two directions:

(14)

Overall Training Objective

The final training objective over an instance, which contains a sentence, a simile tag, and a sequence of component labels, becomes

(15)

where , (s.t. + 1) are non-negtive weights assigned beforehand to balance the importance among the three tasks.

Experiments

Settings

Data

We evaluate our model on a standard Chinese simile recognition benchmark [11], where each instance contains one or zero similes. Table 1 shows the basic statistics of this dataset. We follow \citeauthorLiu:EMNLP2018 (2018) to conduct 5-fold cross validation: the dataset is first equally divided into 5 folds. For each time, 4 folds are used as training and validation sets (80% for training, 20% for validation), and the remaining fold is used for testing. For simile extraction, the components (tenors and vehicles) are tagged with the IOBES scheme [22].

#Sentence 11,337
#Simile Sentence 5,088
#Literal Sentence 6,249
#Token 334K
#Tenor 5,183
#Vehicle 5,119
#Unique tenor concept 1,680
#Unique vehicle concept 1,972
#Tenor-vehicle pair 5,214
Table 1: Statistics of our simile dataset.

Hyper-parameters

For fair comparisons, we use the same hyper-parameters as [11]. In particular, we use their pretrained 50-dimensional Word2Vec [15] embeddings, which are updated during training. For efficient training, we only use the sentences with at most 120 words. The hidden sizes for Bi-LSTM encoder and decoder are 128. The parameters between sentence encoder and bi-directional sentence decoder are shared. Also, the word embeddings and the pre-softmax linear transformation in the sentence decoder are shared. The batch size is 80. The dropout rate is 0.5. We adopt Adadelta [28] as the optimizer with a learning rate of 1.0 and early stopping [21]. The optimal hyper-parameters =0.1, =0.8 are chosen using the validation set.

Contrast Models

We compare the following baselines and models to study the effectiveness of our cyclic MTL:

  • ME [8]. It is a maximum entropy model taking tokens, POS and dependency relation tags as features.

  • MTL [11]. A multitask learning framework, where the simile sentence classification, simile component extraction and sentence reconstruction are jointly modeled. It is the previous state-of-the-art system for simile sentence classification.

  • MTL-OP [11]. An “Optimized Pipeline” introduced by Liu et al. \shortciteLiu:EMNLP2018 for improving simile component extraction. It involves two steps: it first uses 1-best results produced by a model jointly training their simile sentence classifier and language model to filter simile sentences; then, another model jointly training simile component extractor and language model is used to extract simile components from these sentences. Note that this model is just a pipeline for decoding.

  • MTL. Our implemantation of MTL [11], where local attention is adopted instead of a global one.

  • MTL-Pip. It is a degraded variant of our framework without cyclic connection. Note that this is also novel as no previous work has investigated it on this task.

  • MTL-Cyc. Our cyclic multitask learning framework.

Effect of the executing number

K F1-score for ➀ F1-score for ➁
0 86.09 63.15
1 86.62 73.33
2 86.65 73.40
3 86.27 73.57
4 85.89 72.83
Table 2: Experimental results on the validation set with different execution number , where standard MTL results are shown when .

Table 2 shows the validation results of our MTL-Cyc framework regarding the executing time , where we show the results of MTL when . There are large improvements for both simile classification (task ➀) and simile component recognition (task ➁) when increasing from to , showing the effectiveness of stacking the subtasks into a loop. Further increasing from 1 to 2 only results in marginal improvements for both subtasks while introducing more running time, and their performances slightly go down when enlarging from 2 to 4. All the evidence above indicates that our cyclic framework converges quickly, making it more practically useful. Considering both efficiency and performance, we set for all experiments thereafter.

Task 1: Simile Sentence Classification

Model Precision Recall F1-score
ME [8] 76.61 78.32 77.45
MTL [11] 80.84 92.20 86.15
➀-Global [11] 77.51 88.95 82.84
79.76 88.25 83.79
MTL(➀+➁) 81.95 87.44 84.61
MTL(➀+➂) 81.45 88.96 85.04
MTL-Pip(➀➁) 81.72 89.67 85.51
MTL-Pip(➁➀) 81.50 89.26 85.20
MTL(➀+➁+➂) 81.60 92.10 86.53
MTL-Pip(➀➂) 81.39 93.01 86.81
MTL-Pip(➁➂) 81.80 91.99 86.59
MTL-Cyc 82.12 92.60 *87.04*
Table 3: Main results on simile sentence classification. * indicates significant at over MTL(➀+➁+➂) with 1000 bootstrap tests [2, 6]. For the remaining of this paper, we use the same measure for statistical significance.

Table 3 shows the experimental results on simile sentence classification. Overall, our MTL-Cyc exhibits the best performance, outperforming the previous state of the arts: ME [8] and MTL [11] and our baselines. In addition, we have the following interesting observations:

Effect of Local Attention

As shown in Table 3 (Line 4-5), when replacing the conventional global attention with our proposed local one, the performance of simile sentence classifier is improved by about 1 points. This confirms the hypothesis that focusing on the local context of comparator is more suitable for detecting simile sentences.

Effects of Simile Component Extraction and Sentence Reconstruction

Here, we incrementally add simile component extraction and sentence reconstruction to explore their contributions to simile sentence classification under different frameworks: MTL and MTL-Pip. From Table 3 (Line 6-13), we draw some conclusions: First, when jointly modeling two subtasks, MTL(➀+➁), MTL(➀+➂), MTL-Pip(➀➁) and MTL-Pip(➁➀) all significantly outperform the single task model ➀. This indicates that there exists intense interdependence between the subtasks of simile recognition. Second, both MTL-Pip(➀➁) and MTL-Pip(➁➀) show much better performance than MTL(➀+➁), demonstrating that our framework is able to better utilize the interdependence between subtasks than MTL. This is due to the utilization of bi-directional interaction between these two tasks: the previous task provides useful information to the subsequent task, meanwhile, the back propagation of the subsequent task can also positively affect its previous one. Furthermore, the performance of MTL-Pip(➀➁) is better than MTL-Pip(➁➀). Thus, we believe that the direction of jointly modeling subtasks has an important effect on our framework. Third, jointly modeling all three subtasks (the last group in Table 3) is better than modeling two tasks (the second last group in Table 3), no matter which framework (MTL-Pip or MTL) is used. This result suggests that any subtask can provide useful information to other subtasks. Finally, the better performance of MTL-Pip(➀➂) regarding MTL-Pip(➁➂) confirms that it is better to stack ➁ upon ➀ (➀➁) via pipeline. Probably, this is because simile sentence classification (➀, binary classification) is generally easier than simile component extraction (➁, sequence labeling), and thus it is more reasonable to finish the easy task before the difficult one.

Effect of Cyclic Multitask Learning Framework

As shown in Table 3 (Line 10-13), we can see that our MTL-Cyc outperforms other contrast systems, including MTL-Pip(➀➂). Note that MTL-Pip(➀➂) is a subset of our cyclic framework and has not been investigated before. MTL-Cyc is better than previous numbers and our strong baselines, demonstrating its effectiveness.

Task 2: Simile Component Extraction

Table 4 shows the comparison results on simile component extraction. Similar to the experimental results on Task 1, our MTL-Cyc and MTL-Pip still beats other models. Specially, our MTL-Pip(➀➂) exhibits better performance than MTL-OP [11], showing the effectiveness of information sharing between subtasks during training. Moreover, we discuss the results from the following aspects:

Model Precision Recall F1-score
MTL [11] 55.99 69.89 62.11
MTL-OP [11] 61.60 73.61 67.07
54.98 66.47 60.18
MTL(➀+➁) 55.46 65.09 59.89
MTL(➁+➂) 58.07 69.53 63.29
MTL-Pip(➀➁) 63.14 70.12 66.45
MTL-Pip(➁➀) 56.87 67.75 61.84
MTL(➀+➁+➂) 55.92 71.30 62.68
MTL-Pip(➀➂) 64.08 71.60 67.63
MTL-Pip(➁➂) 57.54 73.37 64.50
MTL-Cyc 63.16 73.78 *68.05*
Table 4: Main results on simile component extraction. * indicates significant at over MTL(➀+➁+➂).

Effects of Simile Sentence Classification and Sentence Reconstruction

We first investigate contributions of simile sentence classification and sentence reconstruction to simile component extraction via MTL and MTL-Pip frameworks. Here, we draw the following conclusions: First, as same as our results on Task 1, both MTL-Pip(➀➁) and MTL-Pip(➁➀) obtain better performance than MTL(➀+➁). It confirms the superiority of our framework, which enables the involving subtasks to better benefit from each other than the conventional multitask learning. Second, MTL(➀+➁) shows worse results than MTL(➁+➂). This may be owing to ➂ (sentence reconstruction) bringing more supervision signals over the encoder than ➀ (binary classification). Third, MTL-Pip(➀➁) significantly outperforms MTL-Pip(➁➀). This observation confirms that simile sentence classification is much easier than simile component extraction. Thus, the previous simile sentence classifier can provide useful information for the subsequent simile component extraction.

Effect of Cyclic Multitask Learning Framework

Our MTL-Cyc outperforms all MTL-Pip models, even with the same number of parameters. This result indicates that our cyclic setting mitigates the error propagation and truly benefits from leveraging the interdependence between subtasks.

Analysis

In order to better understand the individual effectiveness of our proposed local attention mechanism and cyclic multitask learning framework, we carry out additional analysis from the following aspects.

Distribution of Window Size

Figure 4 shows the distribution of the predicted window size (Eq. 1) for our local attention on testset. The window sizes predicted by MTL-Cyc tend to be much smaller than those predicted by MTL. The underlying reason is that our MTL-Cyc allows its local attention module to be enhanced and better adjusted by the outputs and loss signals of the other tasks.

Figure 4: Distribution of the predicted window size .

F1-scores over Distances between a Tenor and a Vehicle

We hypothesize that the difficulty of simile recognition increases as the distance between a tenor and a vehicle increases. To testify this, we display results on different groups of test examples regarding this distance in Figure 5. In both tasks, we find that the advantage of our framework becomes more obvious as the distance increases. Particularly, the performance gap between our MTL-Cyc and MTL becomes greater up to 15 points for difficult instances.

F1-scores over Sentence Lengths

Figure 5: F1-scores on different groups of test instances according to the distance between a tenor and a vehicle. Solid lines are results on simile sentence classification (➀), dashed lines are results on simile component extraction (➁).
Figure 6: F1-scores on different groups of test instances according to sentence lengths.

As shown in Figure 6, we compare our cyclic framework with standard MTL regarding different ranges of sentence lengths. Results show that our framework is consistently better than MTL in all groups, showing its robustness.

Model(+BERT) Precision Recall F1-score
Task 1: Simile Sentence Classification
MTL(➀+➁) 83.31 93.61 88.16
MTL-Pip(➀➁) 83.84 94.63 88.91
MTL-Cyc 85.81 94.43 89.92
Task 2: Simile Component Extraction
MTL(➀+➁) 71.95 74.65 73.28
MTL-Pip(➀➁) 72.17 77.81 74.88
MTL-Cyc 73.97 77.61 75.74
Table 5: Test results for simile recognition and extraction using pretrained BERT.

Effect using BERT [1]

Recently, BERT has achieved great success in many NLP tasks by leveraging the rich knowledge within large-scale raw text via pretraining. To further demonstrate the effectiveness of our framework, we replace the pretrained word embeddings with the outputs of a Chinese BERT model2, which is finetuned during simile training. Since the Bi-LSTM sentence decoder module (step ➂ in Figure 2) works similarly with BERT by providing additional language modeling loss, we remove this module and directly construct a cycle with the other two tasks by feeding the outputs of simile component extraction (➁ in Figure 2) as additional inputs to simile sentence classification (➀ in Figure 2).

As shown in Table 5, our MTL-Cyc is still significantly better than MTL and MTL-Pip in both two subtasks given a strong pretrained BERT. Especially, the gains over MTL are almost 2.0 and 2.5 absolute points for simile sentence classification and simile component extraction, respectively. Both results confirm the superiority of our MTL-Cyc framework over standard MTL and other alternatives.

Related Work

Simile Recognition and its Applications

Similes have been studied in linguistics and psycholinguistics to explore how humans process similes, comparisons, metaphors, and the interplay among different components of these linguistic forms. Recently, simile recognition has wide applications in many tasks. Veale and Hao \shortciteVeale-Hao:AMCSS2007 and Veale \shortciteVeale:MIUCC2012 showed that the category-specific knowledge acquired from explicit similes can help to better understand figurative languages, such as metaphor and irony. Qadir et al. \shortciteQadir-Riloff-Walker:EMNLP2015 studied simile on sentiment classification, because people sometimes use simile to express their feelings instead of sentiment words. Since simile is very beneficial to other applications, simile recognition has received increasing interests in industrial and academic research. Li et al. \shortciteLi:CIP2008 introduced a maximum entropy model as simile sentence classifier and a CRF as simile component extractor. Niculae and Yaneva \shortciteNiculae-Yaneva:ACL2013 and Niculae \shortciteNiculae:JSSP2013 recognized comparisons and similes through the use of syntactic patterns. Niculae and Danescu-Niculescu-Mizil \shortciteNiculae-Mizil:EMNLP2014 distinguished simile in product reviews using a series of linguistic cues as features. Overall, these approaches were primarily based on handcrafted linguistic features and syntactic patterns. Inspired by successful applications of multitask learning, Liu et al. \shortciteLiu:EMNLP2018 introduced a neural multitask learning framework. We are in line with Liu et al. \shortciteLiu:EMNLP2018 in multitask modeling, but are different in that we consider the intercorrelation between different subtasks of simile recognition.

Multitask Learning

Recently, joint modeling multiple closely related tasks with shared representations has achieved great success on many NLP tasks, such as parsing and named entity recognition (NER) [3], NER and linking [13], text classification [12], POS tagging and parsing [31], extraction of entities and relations [16], event detection and summarization [26] and simile recognition [11]. Unlike previous work, our framework further considers the interactions between subtasks by cyclic information propagation. On the simile recognition task, whose subtasks have strong intercorrelation between each other, our framework shows much stronger performance than conventional multitask learning framework.

Local Attention

In addition to neural machine translation [14, 27], local attention mechanism has been shown effective on other NLP tasks, such as natural language inference [23]. We are the first to investigate the local attention mechanism on simile recognition.

Simile Recognition vs. Aspect-level Sentiment Analysis

The task of simile recognition looks similar to aspect-level sentiment classification (ASC) [10, 20]. ASC is to determine the sentiment regarding a certain aspect, while simile recognition is to detect whether there is a simile regarding a comparator word in a sentence [11]. However, simile recognition also requires extracting the corresponding tenor and vehicle if there is a simile sentence, while ASC does not require extracting any supporting evidence from text. Hence, the existing work for ASC can not be simply applied for simile recognition with a naive adaptation. More importantly, to the best of our knowledge, our framework is also novel and has large potential in the field of ASC. Specifically, the current state-of-the-art ASC model [5] shows that a decoding pipeline, which explicitly explores the intercorrelation among subtasks, surprisingly outperforms standard MTL [9], even though decoding pipelines generally suffer from error propagation. This indicates that our cyclic-MTL may further improve ASC, as the joint training and cyclic flow of our framework can better model the intercorrelation among subtasks than a simple decoding pipeline. We leave studying our cyclic-MTL on ASC for future work.

Conclusion

We presented a novel cyclic multitask learning framework for simile recognition. Compared with conventional multitask learning, our framework can better model the dependencies among the subtasks. Extensive experiments and analysis strongly demonstrate the effectiveness of our framework.

In the future, we plan to investigate the generality of our framework on other multitask learning based NLP tasks. Besides, we will explore how to improve our framework by introducing variational networks, which have been widely used in many tasks [29, 30, 24, 25].

Acknowledgments

The authors were supported by Beijing Advanced Innovation Center for Language Resources, National Natural Science Foundation of China (No. 61672440), the Fundamental Research Funds for the Central Universities (Grant No. ZK1024), and Scientific Research Project of National Language Committee of China (Grant No. YB135-49).

Footnotes

  1. Right now we follow previous work to assume there being one or zero simile for each instance. For future multi-simile extension, we simply sum all s before Equation 3.
  2. https://github.com/ymcui/Chinese-BERT-wwm

References

  1. J. Devlin, M. Chang, K. Lee and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: 3rd item, Effect using BERT [1].
  2. B. Efron and R. J. Tibshirani (1994) An introduction to the bootstrap. CRC press. Cited by: Table 3.
  3. J. R. Finkel and C. D. Manning (2010) Hierarchical joint learning: improving joint parsing and named entity recognition with non-jointly labeled data. In ACL, Cited by: Multitask Learning.
  4. S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation. Cited by: Introduction.
  5. M. Hu, Y. Peng, Z. Huang, D. Li and Y. Lv (2019) Open-domain targeted sentiment analysis via span-based extraction and classification. In ACL, Cited by: Simile Recognition vs. Aspect-level Sentiment Analysis.
  6. P. Koehn (2004) Statistical significance tests for machine translation evaluation. In EMNLP, Cited by: Table 3.
  7. J. D. Lafferty, A. McCallum and F. C. N. Pereira (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In ICML, Cited by: Introduction.
  8. B. Li, L. Yu, M. Shi and W. Qu (2008) Computation of chinese simile with ”xiang”. Chinese Information Processing. Cited by: Introduction, 1st item, Task 1: Simile Sentence Classification, Table 3.
  9. X. Li, L. Bing, P. Li and W. Lam (2019) A unified model for opinion target extraction and target sentiment prediction. In AAAI, Cited by: Simile Recognition vs. Aspect-level Sentiment Analysis.
  10. B. Liu (2012) Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers. External Links: Document Cited by: Simile Recognition vs. Aspect-level Sentiment Analysis.
  11. L. Liu, X. Hu, W. Song, R. Fu, T. Liu and G. Hu (2018) Neural multitask learning for simile recognition. In EMNLP, Cited by: Figure 1, Figure 2, Introduction, Introduction, 2nd item, 3rd item, 4th item, Data, Hyper-parameters, Task 1: Simile Sentence Classification, Task 2: Simile Component Extraction, Table 3, Table 4, Multitask Learning, Simile Recognition vs. Aspect-level Sentiment Analysis.
  12. P. Liu, X. Qiu and X. Huang (2016) Recurrent neural network for text classification with multi-task learning. In IJCAI, Cited by: Introduction, Multitask Learning.
  13. G. Luo, X. Huang, C. Lin and Z. Nie (2015) Joint entity recognition and disambiguation. In EMNLP, Cited by: Introduction, Multitask Learning.
  14. T. Luong, H. Pham and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. In EMNLP, Cited by: Local Attention.
  15. T. Mikolov, K. Chen, G. Corrado and J. Dean (2013) Efficient estimation of word representations in vector space. In ICLR, Cited by: Hyper-parameters.
  16. M. Miwa and M. Bansal (2016) End-to-end relation extraction using lstms on sequences and tree structures. In ACL, Cited by: Introduction, Multitask Learning.
  17. V. Niculae and C. Danescu-Niculescu-Mizil (2014) Brighter than gold: figurative language in user generated comparisons. In EMNLP, Cited by: Introduction.
  18. V. Niculae and V. Yaneva (2013) Computational considerations of comparisons and similes. In ACL Student Research Workshop, Cited by: Introduction.
  19. V. Niculae (2013) Comparison pattern matching and creative simile recognition. In JSSP, Cited by: Introduction.
  20. M. Pontiki, D. Galanis, J. Pavlopoulos, H. Papageorgiou, I. Androutsopoulos and S. Manandhar (2014) SemEval-2014 task 4: aspect based sentiment analysis. In SemEval@COLING, Cited by: Simile Recognition vs. Aspect-level Sentiment Analysis.
  21. L. Prechelt (1998) Automatic early stopping using cross validation: quantifying the criteria. Neural Networks. Cited by: Hyper-parameters.
  22. L. Ratinov and D. Roth (2009) Design challenges and misconceptions in named entity recognition. In CoNLL, Cited by: Data.
  23. M. Sperber, J. Niehues, G. Neubig, S. Stüker and A. Waibel (2018) Self-attentional acoustic models. In Interspeech, Cited by: Local Attention.
  24. J. Su, S. Wu, D. Xiong, Y. Lu, X. Han and B. Zhang (2018) Variational recurrent neural machine translation. In AAAI, Cited by: Conclusion.
  25. J. Su, S. Wu, B. Zhang, C. Wu, Y. Qin and D. Xiong (2018) A neural generative autoencoder for bilingual word embeddings. Information Sciences. Cited by: Conclusion.
  26. Z. Wang and Y. Zhang (2017) A neural model for joint event detection and summarization. In IJCAI, Cited by: Multitask Learning.
  27. B. Yang, Z. Tu, D. F. Wong, F. Meng, L. S. Chao and T. Zhang (2018) Modeling localness for self-attention networks. In EMNLP, Cited by: Local Attention.
  28. M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701. Cited by: Hyper-parameters.
  29. B. Zhang, D. Xiong, J. Su, H. Duan and M. Zhang (2016) Variational neural machine translation. In EMNLP, Cited by: Conclusion.
  30. B. Zhang, D. Xiong, J. Su, Q. Liu, R. Ji, H. Duan and M. Zhang (2016) Variational neural discourse relation recognizer. In EMNLP, Cited by: Conclusion.
  31. Y. Zhang and D. Weiss (2016) Stack-propagation: improved representation learning for syntax. In ACL, Cited by: Introduction, Multitask Learning.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402636
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description