Multi-Task Learning for Sequence Tagging: An Empirical Study

Multi-Task Learning for Sequence Tagging: An Empirical Study

Soravit Changpinyo, Hexiang Hu,    Fei Sha
Department of Computer Science
University of Southern California
Los Angeles, CA 90089
schangpi,hexiangh,feisha@usc.edu
Abstract

We study three general multi-task learning (MTL) approaches on 11 sequence tagging tasks. Our extensive empirical results show that in about 50% of the cases, jointly learning all 11 tasks improves upon either independent or pairwise learning of the tasks. We also show that pairwise MTL can inform us what tasks can benefit others or what tasks can be benefited if they are learned jointly. In particular, we identify tasks that can always benefit others as well as tasks that can always be harmed by others. Interestingly, one of our MTL approaches yields embeddings of the tasks that reveal the natural clustering of semantic and syntactic tasks. Our inquiries have opened the doors to further utilization of MTL in NLP.

Multi-Task Learning for Sequence Tagging: An Empirical Study


Soravit Changpinyo, Hexiang Hu, and Fei Sha Department of Computer Science University of Southern California Los Angeles, CA 90089 schangpi,hexiangh,feisha@usc.edu

1 Introduction

\@footnotetext

This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/

Multi-task learning (MTL) has long been studied in the machine learning literature, cf. [Caruana, 1997]. The technique has also been popular in NLP, for example, in [Collobert and Weston, 2008, Collobert et al., 2011, Luong et al., 2016]. The main thesis underpinning MTL is that solving many tasks together provides a shared inductive bias that leads to more robust and generalizable systems. This is especially appealing for NLP as data for many tasks are scarce — shared learning thus reduces the amount of training data needed. MTL has been validated in recent work, mostly where auxiliary tasks are used to improve the performance on a target task, for example, in sequence tagging [Søgaard and Goldberg, 2016, Bjerva et al., 2016, Plank et al., 2016, Alonso and Plank, 2017, Bingel and Søgaard, 2017].

Despite those successful applications, several key issues about the effectiveness of MTL remain open. Firstly, with only a few exceptions, much existing work focuses on “pairwise” MTL where there is a target task and one or several (carefully) selected auxiliary tasks. However, can jointly learning many tasks benefit all of them together? A positive answer will significantly raise the utility of MTL. Secondly, how are tasks related such that one could benefit another? For instance, one plausible intuition is that syntactic and semantic tasks might benefit among their two separate groups though cross-group assistance is weak or unlikely. However, such notions have not been put to test thoroughly on a significant number of tasks.

In this paper, we address such questions. We investigate learning jointly multiple sequence tagging tasks. Besides using independent single-task learning as a baseline and a popular shared-encoder MTL framework for sequence tagging [Collobert et al., 2011], we propose two variants of MTL, where both the encoder and the decoder could be shared by all tasks.

We conduct extensive empirical studies on 11 sequence tagging tasks — we defer the discussion on why we select such tasks to a later section. We demonstrate that there is a benefit to moving beyond “pairwise” MTL. We also obtain interesting pairwise relationships that reveal which tasks are beneficial or harmful to others, and which tasks are likely to be benefited or harmed. We find such information correlated with the results of MTL using more than two tasks. We also study selecting only benefiting tasks for joint training, showing that such a “greedy” approach in general improves the MTL performance, highlighting the need of identifying with whom to jointly learn.

The rest of the paper is organized as follows. We describe different approaches for learning from multiple tasks in Sect. 2. We describe our experimental setup and results in Sect. 3 and Sect. 4, respectively. We discuss related work in Sect. 5. Finally, we conclude with discussion and future work in Sect. 6.

2 Multi-Task Learning for Sequence Tagging

In this section, we describe general approaches to multi-task learning (MTL) for sequence tagging. We select sequence tagging tasks for several reasons. Firstly, we want to concentrate on comparing the tasks themselves without being confounded by designing specialized MTL methods for solving complicated tasks. Sequence tagging tasks are done at the word level, allowing us to focus on simpler models while still enabling varying degrees of sharing among tasks. Secondly, those tasks are often the first steps in NLP pipelines that come with extremely diverse resources. Understanding the nature of the relationships between them is likely to have a broad impact on many downstream applications.

Let be the number of tasks and be training data of task . A dataset for each task consists of input-output pairs. In sequence tagging, each pair corresponds to a sequence of words and their corresponding ground-truth tags , where is the sequence length. We note that our definition of “task” is not the same as “domain” or “dataset.” In particular, we differentiate between tasks based on whether or not they share the label space of tags. For instance, part-of-speech tagging on weblog and that on email domains are considered the same task in this paper.

Given the training data , we describe how to learn one or more models to perform all the tasks. In general, our models follow the design of state-of-the-art sequence taggers [Reimers and Gurevych, 2017]. They have an encoder with parameters that encodes a sequence of word tokens into a sequence of vectors and a decoder with parameters that decodes the sequence of vectors into a sequence of predicted tags . That is, and . The model parameters are learned by minimizing some loss function over and . In what follows, we will use superscripts to differentiate instances from different tasks.

Single-task learning MTL (Multi-Dec) MTL (TEDec) MTL (TEEnc)
Figure 1: Different settings for learning from multiple tasks considered in our experiments

In single-task learning (STL), we learn models independently. For each task , we have an encoder and a decoder . Clearly, information is not shared between tasks in this case.

In multi-task learning (MTL), we consider two or more tasks and train an MTL model jointly over a combined loss . In this paper, we consider the following general frameworks that are different in the nature of how the parameters of those tasks are shared.

Multi-task learning with multiple decoders (Multi-Dec) We learn a shared encoder and decoders . This setting has been explored for sequence tagging in [Collobert and Weston, 2008, Collobert et al., 2011]. In the context of sequence-to-sequence learning [Sutskever et al., 2014], this is similar to the “one-to-many” MTL setting in [Luong et al., 2016].

Multi-task learning with task embeddings (TE) We learn a shared encoder for the input sentence as well as a shared decoder . To equip our model with the ability to perform one-to-many mapping (i.e., multiple tasks), we augment the model with “task embeddings.” Specifically, we additionally learn a function that maps a task ID to a vector. We explore two ways of injecting task embeddings into models. In both cases, our is simply an embedding layer that maps the task ID to a dense vector.

One approach, denoted by TEDec, is to incorporate task embeddings into the decoder. We concatenate the task embeddings with the encoder’s outputs and then feed the result to the decoder.

The other approach, denoted by TEEnc, is to combine the task embeddings with the input sequence of words at the encoder. We implement this by prepending the task token (<<upos>>, <<chunk>>, <<mwe>>, etc.) to the input sequence and treat the task token as a word token [Johnson et al., 2017].

While the encoder in TEDec must learn to encode a general-purpose representation of the input sentence, the encoder in TEEnc knows from the start which task it is going to perform.

Fig. 1 illustrates different settings described above. Clearly, the number of model parameters is reduced significantly when we move from STL to MTL. Which MTL model is more economical depends on several factors, including the number of tasks, the dimension of the encoder output, the general architecture of the decoder, the dimension of task embeddings, how to augment the system with task embeddings, and the degree of tagset overlap.

3 Experimental Setup

3.1 Datasets and Tasks

Dataset # sentences Token/type Task # labels Label entropy
Universal Dependencies v1.4 12543/16622 12.3/13.2 upos 17 2.5
xpos 50 3.1
CoNLL-2000 8936/10948 12.3/13.3 chunk 42 2.3
CoNLL-2003 14041/20744 9.7/11.2 ner 17 0.9
Streusle 4.0 2723/3812 8.6/9.3 mwe 3 0.5
supsense 212 2.2
SemCor 13851/20092 13.2/16.2 sem 75 2.2
semtr 11 1.3
Broadcast News 1 880/1370 5.2/6.1 com 2 0.6
FrameNet 1.5 3711/5711 8.6/9.1 frame 2 0.5
Hyper-Text Corpus 2000/3974 6.7/9.0 hyp 2 0.4
Table 1: Datasets used in our experiments, as well as their key characteristics and their corresponding tasks. / is used to separate statistics for training data only and those for all subsets of data.

Table 1 summarizes the datasets used in our experiments, along with their corresponding tasks and important statistics. Table 2 shows an example of each task’s input-output pairs. We describe details below. For all tasks, we use the standard splits unless specified otherwise.

We perform universal and English-specific POS tagging (upos and xpos) on sentences from the English Web Treebank [Bies et al., 2012], annotated by the Universal Dependencies project [Nivre et al., 2016]. We perform syntactic chunking (chunk) on sentences from the WSJ portion of the Penn Treebank [Marcus et al., 1993], annotated by the CoNLL-2000 shared task [Tjong Kim Sang and Buchholz, 2000]. We use sections 15-18 for training. The shared task uses section 20 for testing and does not designate the development set, so we use the first 1001 sentences for development and the rest 1011 for testing. We perform named entity recognition (ner) on sentences from the Reuters Corpus [Lewis et al., 2004], consisting of news stories between August 1996-97, annotated by the CoNLL-2003 shared task [Tjong Kim Sang and De Meulder, 2003]. For both chunk and ner, we use the IOBES tagging scheme.

We perform multi-word expression identification (mwe) and supersense tagging (supsense) on sentences from the reviews section of the English Web Treebank, annotated under the Streusle project [Schneider and Smith, 2015]111https://github.com/nert-gu/streusle. We perform supersense (sem) and semantic trait (semtr) tagging on SemCor’s sentences [Landes et al., 1998], taken from a subset of the Brown Corpus [Francis and Kučera, 1982], using the splits provided by [Alonso and Plank, 2017] for both tasks222https://github.com/bplank/multitasksemantics. For sem, they are annotated with supersense tags [Miller et al., 1993] by [Ciaramita and Altun, 2006]333We consider supsense and sem as different tasks as they use different sets of supersense tags.. For semtr, [Alonso and Plank, 2017] uses the EuroWordNet list of ontological types for senses [Vossen et al., 1998] to convert supersenses into coarser semantic traits.

For sentence compression (com), we identify which words to keep in a compressed version of sentences from the 1996 English Broadcast News Speech (HUB4) [Graff, 1997], created by [Clarke and Lapata, 2006]444http://jamesclarke.net/research/resources/. We use the labels from the first annotator. For frame target identification (frame), we detect words that evoke frames [Das et al., 2014] on sentences from the British National Corpus, annotated under the FrameNet project [Baker et al., 1998]. For both com and frame, we use the splits provided by [Bingel and Søgaard, 2017]. For hyper-link detection (hyp), we identify which words in the sequence are marked with hyperlinks on text from Daniel Pipes’ news-style blog collected by [Spitkovsky et al., 2010]555https://nlp.stanford.edu/valentin/pubs/markup-data.tar.bz2. We use the “select” subset that correspond to marked, complete sentences.

Task Input/Output
upos once again , thank you all for an outstanding accomplishment .
ADV ADV PUNCT VERB PRON DET ADP DET ADJ NOUN PUNCT
xpos once again , thank you all for an outstanding accomplishment .
RB RB , VBP PRP DT IN DT JJ NN .
chunk the carrier also seemed eager to place blame on its american counterparts .
B-NP E-NP S-ADVP S-VP S-ADJP B-VP E-VP S-NP S-PP B-NP I-NP E-NP O
ner 6. pier francesco chili ( italy ) ducati 17541
O B-PER I-PER E-PER O S-LOC O S-ORG O
mwe had to keep in mind that the a / c broke , i feel bad it was their opening !
B I B I I O O B I I O O O O O O O O O O
supsense this place may have been something sometime ; but it way past it " sell by date " .
O n.GROUP O O v.stative O O O O O O p.Time p.Gestalt O v.possession p.Time n.TIME O O
sem a hypothetical example will illustrate this point .
O adj.all noun.cognition O verb.communication O noun.communication O
semtr he wondered if the audience would let him finish .
O Mental O O Object O Agentive O BoundedEvent O
com he made the decisions in 1995 , in early 1996 , to spend at a very high rate .
KEEP KEEP DEL KEEP DEL DEL DEL DEL DEL DEL DEL KEEP KEEP KEEP KEEP DEL KEEP KEEP KEEP
frame please continue our important partnership .
O B-TARGET O B-TARGET O O
hyp will this incident lead to a further separation of civilizations ?
O O O O O O O B-HTML B-HTML B-HTML O
Table 2: Examples of input-output pairs of the tasks in consideration

3.2 Metrics and Score Comparison

We use the span-based micro-averaged F1 score (without the O tag) for all tasks. We run each configuration three times with different initializations and compute mean and standard deviation of the scores. To compare two scores, we use the following strategy. Let , and , be two sets of scores (mean and std, respectively). We say that is “higher” than if , where is a parameter that controls how strict we want the definition to be. “lower” is defined in the same manner with changed to and switched with . is set to 1.5 in all of our experiments.

3.3 Models

General architectures

We use bidirectional recurrent neural networks (biRNNs) as our encoders for both words and characters [Irsoy and Cardie, 2014, Huang et al., 2015, Lample et al., 2016, Ma and Hovy, 2016]. Our word/character sequence encoders and decoder classifiers are common in literature and most similar to [Lample et al., 2016], but we use two-layer biRNNs (instead of one) with Gated Recurrent Unit (GRU) [Cho et al., 2014] (instead of with LSTM [Hochreiter and Schmidhuber, 1997]).

Each word is represented by a 100-dimensional vector that is the concatenation of a 50-dimensional embedding vector and the 50-dimensional output of a character biRNN (whose hidden representation dimension is 25 in each direction). We feed a sequence of those 100-dimensional representations to a word biRNN, whose hidden representation dimension is 300 in each direction, resulting in a sequence of 600-dimensional vectors. In TEDec, the token encoder is also used to encode a task token (which is then concatenated to the encoder’s output), where each task is represented as a 25-dimensional vector. For decoder/classifiers, we predict a sequence of tags using a linear projection layer (to the tagset size) followed by a conditional random field (CRF) [Lafferty et al., 2001].

Implementation and training details

We implement our models in PyTorch [Paszke et al., 2017] on top of the AllenNLP library [Gardner et al., 2018]. Code is to be available at https://github.com/schangpi/.

Words are lower-cased, but characters are not. Word embeddings are initialized with GloVe [Pennington et al., 2014] trained on Wikipedia 2014 and Gigaword 5. We use strategies suggested by [Ma and Hovy, 2016] for initializing other parameters in our networks. Character embeddings are initialized uniformly in , where is the dimension of the embeddings. Weight matrices are initialized with Xavier Uniform [Glorot and Bengio, 2010], i.e., uniformly in , where and are the number of of rows and columns in the structure. Bias vectors are initialized with zeros.

We use Adam [Kingma and Ba, 2015] with default hyperparameters and a mini-batch size of 32. The dropout rate is 0.25 for the character encoder and 0.5 for the word encoder. We use gradient normalization [Pascanu et al., 2013] with a threshold of 5. We halve the learning rate if the validation performance does not improve for two epochs, and stop training if the validation performance does not improve for 10 epochs. We use L2 regularization with parameter 0.01 for the transition matrix of the CRF.

For the training of MTL models, we make sure that each mini-batch is balanced; the difference in numbers of examples from any pair of tasks is no more than 1. As a result, each epoch may not go through all examples of some tasks whose training set sizes are large. In a similar manner, during validation, the average F1 score is over all tasks rather than over all validation examples.

3.4 Various Settings for Learning from Multiple Tasks

We consider the following settings: (i) “STL” where we train each model on one task alone; (ii) “Pairwise MTL” where we train on two tasks jointly; (iii) “All MTL” where we train on all tasks jointly; (iv) “Oracle MTL” where we train on the Oracle set of the testing task jointly with the testing task; (v) “All-but-one MTL” setting where we train on all tasks jointly except for one (as part of Sect. 4.4.)

Constructing the Oracle Set of a Testing Task

The Oracle set of a task is constructed from the pairwise performances: let be the F1 score and the standard deviation of a model that is jointly trained on a set of tasks in the set and that is tested on task . Task is considered “beneficial” to another (testing) task if is “higher” than (cf. Sect. 3.2). Then, the “Oracle” set for a task is the set of its all beneficial (single) tasks. Throughout our experiments, we compute and by averaging over three rounds (cf. Sect. 3.2, standard deviations can be found in Appendix C.)

4 Results and Analysis

4.1 Main Results

Figure 2: Summary of our results for MTL methods Multi-Dec (left), TEDec (middle), and TEEnc (right) on various settings with different types of sharing. The vertical axis is the relative improvement over STL. See texts for details. Best viewed in color.

Fig. 2 summarizes our main findings. We compare relative improvement over single-task learning (STL) between various settings with different types of sharing in Sect. 3.4. Scores from the pairwise setting (“+One Task”) are represented as a vertical bar, delineating the maximum and minimum improvement over STL by jointly learning a task with one of the remaining 10 tasks. The “All” setting (red triangles) indicates the joint learning all 11 tasks. The “Oracle” setting (blue rectangles) indicates the joint learning using a subset of 11 tasks which are deemed beneficial, based on corresponding performances in pairwise MTL, as defined in Sect. 3.4.

We observe that (1) [STL vs. Pairwise/All] Neither pairwise MTL nor All always improves upon STL; (2) [STL vs. Oracle] Oracle in general outperforms or at least does not worsen STL; (3) [All/Oracle vs. Pairwise] All does better than Pairwise on about half of the cases, while Oracle almost always does better than Pairwise; (4) [All vs. Oracle] Consider when both All and Oracle improve upon STL. For Multi-Dec and TEEnc, Oracle generally dominates All, except on the task hyp. For TEDec, their magnitudes of improvement are mostly comparable, except on semtr (Oracle is better) and on hyp (All is better). In addition, All is better than Oracle on the task com, in which case Oracle is STL.

In Appendix A, we compare different MTL approaches: Multi-Dec, TEDec, and TEEnc. There is no significant difference among them.

4.2 Pairwise MTL results

Figure 3: Pairwise MTL relationships (benefit vs. harm) using Multi-Dec (left), TEDec (middle), and TEEnc (right). Solid green (red) directed edge from to denotes benefiting (harming) . Dashed Green (Red) edges between and denote they benefiting (harming) each other. Dotted edges denote asymmetric relationship: benefit in one direction but harm in the reverse direction. Absence of edges denotes neutral relationships. Best viewed in color and with a zoom-in.

Summary

The summary plot in Fig. 3 gives a bird’s-eye view of patterns in which a task might benefit or harm another one. For example, mwe is always benefited from jointly learning any of the 10 tasks as the incoming edges are green, so is semtr in most cases. On the other end, com seems to be harming any of the 10 as the outgoing edges are almost always red. For chunk and u/xpos, it generally benefits others (or at least does not do harm) as most of their outgoing edges are green.

In Table 3-5, we report F1 scores for Multi-Dec, TEDec, and TEEnc, respectively. In each table, rows denote settings in which we train our models, and columns correspond to tasks we test them on. We also include “Average” of all pairwise scores, as well as the number of positive () and negative () relationships in each row or each column.

Which tasks are benefited or harmed by others in pairwise MTL?

mwe, supsense, semtr, and hyp are generally benefited by other tasks. The improvement is more significant in mwe and hyp. upos, xpos, ner, com, and frame (Multi-Dec and TEDec) are often hurt by other tasks. Finally, the results are mixed for chunk and sem.

Which tasks are beneficial or harmful?

upos, xpos, and chunk are universal helpers, beneficial in 16, 17, and 14 cases, while harmful only in 1, 3, and 0 cases, respectively. Interestingly, chunk never hurts any task, while both upos and xpos can be harmful to ner. While chunk is considered more of a syntactic task, the fact that it informs other tasks about the boundaries of phrases may aid the learning of other semantic tasks (task embeddings in Sect. 4.4 seem to support this hypothesis).

On the other hand, com, frame, and hyp are generally harmful, all useful in 0 cases and causing the performance drop in 22, 10, 12 cases, respectively. One factor that may play a role is the training set sizes of these tasks. However, we note that both mwe and supsense (Streusle dataset) has smaller training set sizes than frame does, but those tasks can still benefit some tasks. (On the other hand, ner has the largest training set, but infrequently benefits other tasks, less frequently than supsense does.) Another potential cause is the fact that all those harmful tasks have the smallest label size of 2. This combined with small dataset sizes leads to a higher chance of overfitting. Finally, it may be possible that harmful tasks are simply unrelated; for example, the nature of com, frame, or hyp may be very different from other tasks — an entirely different kind of reasoning is required.

Finally, ner, mwe, sem, semtr, and supsense can be beneficial or harmful, depending on which other tasks they are trained with.

upos xpos chunk ner mwe sem semtr supsense com frame hyp # #
+upos 95.4 95.01 94.18 87.68 59.99 73.23 74.93 68.25 72.46 62.14 48.02 5 0
+xpos 95.38 95.04 93.97 87.61 58.87 73.34 74.91 67.78 72.83 60.77 48.81 6 1
+chunk 95.43 95.1 93.49 87.96 59.18 73.16 74.79 67.39 72.44 62.67 47.85 5 0
+ner 95.38 94.98 93.47 88.24 55.4 72.88 74.34 68.06 70.93 62.39 47.9 3 0
+mwe 95.15 94.7 93.54 88.15 53.07 72.75 74.51 66.88 71.31 61.75 47.32 1 2
+sem 95.23 94.77 93.63 87.35 60.16 72.77 74.73 68.29 72.72 61.74 48.15 5 2
+semtr 95.17 94.86 93.61 87.34 58.84 72.5 74.02 68.6 71.96 62.03 47.74 2 3
+supsense 95.08 94.75 93.2 87.9 58.81 72.81 74.61 66.81 72.24 61.94 49.23 3 1
+com 93.04 93.19 91.94 86.62 53.89 70.39 72.6 65.57 72.71 56.52 47.41 0 7
+frame 94.98 94.64 93.22 88.15 53.88 72.76 74.18 66.59 72.47 62.04 47.5 0 3
+hyp 94.84 94.46 92.96 87.98 53.08 72.47 74.23 66.47 71.82 61.02 46.73 0 4
# 0 0 3 0 7 3 7 6 0 0 4
# 5 6 3 4 0 3 0 1 0 1 0
Average 94.97 94.65 93.37 87.67 57.21 72.63 74.38 67.39 72.12 61.3 47.99
All 95.04 94.31 93.44 86.38 61.43 71.53 74.26 68.1 74.54 59.71 51.41 4 4
Oracle 95.4 95.04 94.01 88.24 62.76 73.32 75.23 68.53 72.71 62.04 50.0 6 0
Table 3: F1 scores for Multi-Dec. We compare STL setting (blue), with pairwise MTL (+task), All, and Oracle. We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red. The last two columns indicate how many tasks are helped or harmed by the task at that row.
upos xpos chunk ner mwe sem semtr supsense com frame hyp # #
+upos 95.4 94.99 94.02 87.99 60.28 73.17 74.87 67.8 72.86 61.54 49.36 6 0
+xpos 95.4 95.04 94.18 87.65 60.32 73.21 74.84 68.3 72.87 61.44 49.23 6 1
+chunk 95.57 95.21 93.49 88.11 57.61 73.02 74.73 67.29 73.3 61.39 48.43 6 0
+ner 95.32 95.09 93.64 88.24 55.17 72.77 74.01 67.25 71.08 59.25 48.24 2 2
+mwe 95.11 94.8 93.59 87.99 53.07 72.66 74.63 66.88 70.93 56.77 45.83 1 3
+sem 95.2 94.82 93.45 87.27 58.21 72.77 74.72 68.46 73.14 60.09 47.95 3 3
+semtr 95.21 94.8 93.47 87.75 58.55 72.5 74.02 68.18 71.74 59.77 46.96 2 3
+supsense 95.05 94.81 93.25 87.94 58.75 72.71 74.52 66.81 69.13 55.68 47.29 2 4
+com 94.03 93.94 92.29 86.59 51.72 70.37 71.76 64.98 72.71 55.25 45.24 0 8
+frame 94.79 94.66 93.23 88.02 53.05 72.26 74.21 66.2 72.89 62.04 46.0 0 5
+hyp 94.35 94.56 92.86 87.91 52.98 72.15 74.19 66.52 70.47 55.35 46.73 0 5
# 1 1 3 0 7 3 6 4 0 0 3
# 7 6 3 3 0 4 1 2 3 5 0
Average 95.0 94.77 93.4 87.72 56.67 72.48 74.25 67.19 71.84 58.65 47.45
All 94.95 94.42 93.64 86.8 61.97 71.72 74.36 67.98 74.61 58.14 51.31 5 5
Oracle 95.57 95.21 94.07 88.24 61.74 73.1 75.24 68.22 72.71 62.04 50.15 8 0
Table 4: F1 scores for TEDec. We compare STL setting (blue), with pairwise MTL (+task), All, and Oracle. We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red. The last two columns indicate how many tasks are helped or harmed by the task at that row.
upos xpos chunk ner mwe sem semtr supsense com frame hyp # #
+upos 95.4 94.94 94.0 87.43 57.61 73.11 74.85 67.76 72.09 62.27 48.27 5 1
+xpos 95.42 95.04 93.98 87.71 58.26 73.04 74.66 67.77 72.41 61.62 48.06 5 1
+chunk 95.4 95.1 93.49 88.07 58.06 73.13 74.77 67.36 72.88 62.98 47.13 3 0
+ner 95.29 95.05 93.54 88.24 53.4 72.91 74.04 67.57 70.78 63.02 48.64 1 1
+mwe 95.05 94.66 93.33 88.02 53.07 72.83 74.66 66.26 71.36 60.61 46.71 1 2
+sem 95.27 94.93 93.52 87.49 58.62 72.77 74.41 68.1 72.25 62.17 47.12 3 1
+semtr 95.23 94.97 93.45 87.29 58.31 72.17 74.02 67.64 72.15 62.79 46.1 1 2
+supsense 95.27 95.0 93.13 87.92 58.05 73.09 74.94 66.81 72.12 61.96 47.24 3 1
+com 93.6 93.12 91.86 86.75 51.71 70.18 71.35 65.55 72.71 57.65 47.81 0 7
+frame 95.0 94.55 93.29 87.99 53.3 72.49 74.63 66.75 72.1 62.04 46.66 1 2
+hyp 94.43 94.26 93.13 87.82 52.59 71.95 74.14 66.16 72.79 61.14 46.73 0 3
# 0 0 2 0 6 3 7 4 0 0 1
# 4 4 3 5 0 2 1 1 1 0 0
Average 95.0 94.66 93.32 87.65 55.99 72.49 74.24 67.09 72.09 61.62 47.37
All 94.94 94.3 93.7 86.01 59.57 71.58 74.35 68.02 74.61 61.83 49.5 5 4
Oracle 95.4 95.04 93.93 88.24 61.92 73.14 75.09 69.04 72.71 62.04 48.06 6 0
Table 5: F1 scores for TEEnc. We compare STL setting (blue), with pairwise MTL (+task), All, and Oracle. We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red. The last two columns indicate how many tasks are helped or harmed by the task at that row.

4.3 All MTL Results

In addition to pairwise MTL results, we report the performances in the All and Oracle MTL settings in the last two rows of Table 3-5. We find that their performances depend largely on the trend in their corresponding pairwise MTL. We provide examples and discussion of such observations below.

upos xpos chunk ner mwe sem semtr supsense com frame hyp # #
All 95.04 94.31 93.44 86.38 61.43 71.53 74.26 68.1 74.54 59.71 51.41
All - upos 94.03 93.59 86.03 61.28 70.87 73.54 68.27 74.42 58.47 51.13 0 0
All - xpos 94.57 93.57 86.04 61.91 71.12 74.03 67.99 74.36 60.16 51.65 0 1
All - chunk 94.84 94.46 86.05 61.01 71.07 73.97 68.26 74.2 60.01 50.27 0 1
All - ner 94.81 94.3 93.59 62.69 70.82 73.51 68.16 74.08 59.17 50.86 0 2
All - mwe 94.93 94.45 93.71 86.21 71.01 73.61 68.18 74.7 59.23 50.83 0 2
All - sem 94.82 94.34 93.63 85.81 61.17 71.97 67.36 74.31 58.73 50.93 0 1
All - semtr 94.83 94.35 93.58 86.11 63.04 69.72 68.17 74.2 59.49 51.27 0 1
All - supsense 94.97 94.54 93.67 86.43 60.51 71.22 73.86 74.24 59.23 50.86 0 1
All - com 95.19 94.69 93.67 86.6 61.95 72.38 74.75 68.67 62.37 50.28 5 0
All - frame 95.15 94.57 93.7 85.9 62.62 71.48 74.24 68.47 75.03 50.89 0 0
All - hyp 94.93 94.53 93.78 86.31 62.04 71.22 74.02 68.46 74.62 59.69 1 0
# 1 1 1 0 0 1 1 0 0 1 0
# 4 0 0 0 0 1 4 0 0 0 0
Table 6: F1 scores for Multi-Dec. We compare All with All-but-one settings (All - task). We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red.

How much is STL vs. Pairwise MTL predictive of STL vs. All MTL?

We find that the performance of pairwise MTL is predictive of the performance of All MTL to some degree. Below we discuss the results in more detail. Note that we would like to be predictive in both the performance direction and magnitude (whether and how much the scores will improve or degrade over the baseline).

When pairwise MTL improves upon STL even slightly, All improves upon STL in all cases (mwe, semtr, supsense, and hyp). This is despite the fact that jointly learning some pairs of tasks lead to performance degradation (com and frame in the case of supsense and com in the case of semtr). Furthermore, when pairwise MTL leads to improvement in all cases (all pairwise rows in mwe and hyp), All MTL will achieve even better performance, suggesting that tasks are beneficial in a complementary manner and there is an advantage of MTL beyond two tasks.

The opposite is almost true. When pairwise MTL does not improve upon STL, most of the time All MTL will not improve upon STL, either — with one exception: com. Specifically, the pairwise MTL performances of upos, xpos, ner and frame (TEDec) are mostly negative and so are their All MTL performances. Furthermore, tasks can also be harmful in a complementary manner. For instance, in the case of ner, All MTL achieves the lowest or the second lowest score when compared to any row of the pairwise MTL settings. In addition, sem’s pairwise MTL performances are mixed, making the average score about the same or slightly worse than STL. However, the performance of All MTL when tested on sem almost achieves the lowest. In other words, sem is harmed more than it is benefited but pairwise MTL performances cannot tell. This suggests that harmful tasks are complementary while beneficial tasks are not.

Our results when tested on com are the most surprising. While none of pairwise MTL settings help (with some hurting), the performance of All MTL goes in the opposite direction, outperforming that of STL. Further characterization of task interaction is needed to reveal why this happens. One hypothesis is that instances in com that are benefited by one task may be harmed by another. The joint training of all tasks thus works because tasks regularize each other.

We believe that our results open the doors to other interesting research questions. While the pairwise MTL performance is somewhat predictive of the performance direction of All MTL (except com), the magnitude of that direction is difficult to predict. It is clear that additional factors beyond pairwise performance contribute to the success or failure of the All MTL setting. It would be useful to automatically identify these factors or design a metric to capture that. There have been initial attempts along this research direction in [Alonso and Plank, 2017, Bingel and Søgaard, 2017, Bjerva, 2017], in which manually-defined task characteristics are found to be predictive of pairwise MTL’s failure or success.

Oracle MTL

Recall that a task has an “Oracle” set when the task is benefited from some other tasks according to its pairwise results (cf. Sect. 3.4). In general, our simple heuristic works well. Out of 20 cases where Oracle MTL performances exist, 16 are better than the performance of All MTL. In sem, upos and xpos (TEDec, Oracle MTL is able to reverse the negative results obtained by All MTL to the positive ones, leading to improved scores over STL in all cases. This suggests that pairwise MTL performances are valuable knowledge if we want to go beyond two tasks. But, as mentioned previously, pairwise performance information fails in the case of com; All MTL leads to improvement but we do not have an Oracle set in this case.

Out of 4 cases where Oracle MTL does not improve upon All MTL, 3 is when we test on hyp and one is when we test on mwe. These two tasks are not harmed by any tasks. This result seems to suggest that sometimes “neutral” tasks can help in MTL (but not always, for example, in Multi-Dec and TEEnc of mwe). This also raises the question of whether there is a more effective way to construct an oracle set.

4.4 Analysis

Task Contribution in All MTL

How much does one particular task contribute to the performance of All MTL? To investigate this, we remove one task at a time and train the rest jointly. Results are shown in Table 6 for the method Multi-Dec– results for other two methods are in Appendix B as they are similar to Multi-Dec qualitatively. We find that upos, sem and semtr are in general sensitive to a task being removed from All MTL. Moreover, at least one task significantly contributes to the success of All MTL at some point; if we remove it, the performance will drop. On the other hand, com generally negatively affects the performance of All MTL as removing it often leads to performance improvement.

Task Embeddings

Figure 4: t-SNE visualization of the embeddings of the 11 tasks that are learned from TEDec

Fig. 4 shows t-SNE visualization [Van der Maaten and Hinton, 2008] of task embeddings learned from TEDec 666We observed that task embeddings learned from TEEnc are not consistent across multiple runs. in the All MTL setting. The learned task embeddings reflect our knowledge about similarities between tasks, where there are clusters of syntactic and semantic tasks. We also learn that sentence compression (com) is more syntactic, whereas multi-word expression identification (mwe) and hyper-text detection (hyp) are more semantic. Interestingly, chunk seems to be in between, which may explain why it never harms any tasks in any settings (cf. Sect. 4.2).

In general, it is not obvious how to translate task similarities derived from task embeddings into something indicative of MTL performance. While our task embeddings could be considered as “task characteristics” vectors, they are not guaranteed to be interpretable. We thus leave a thorough investigation of information captured by task embeddings to future work.

Nevertheless, we observe that task embeddings disentangle “sentences/tags” and “actual task” to some degree. For instance, if we consider the locations of each pair of tasks that use the same set of sentences for training in Fig. 4, we see that sem and semtr (or mwe and supsense) are not neighbors, while xpos and upos are. On the other hand, mwe and ner are neighbors, even though their label set size and entropy are not the closest. These observations suggest that hand-designed task features used in [Bingel and Søgaard, 2017] may not be the most informative characterization for predicting MTL performance.

5 Related Work

For a comprehensive overview of MTL in NLP, see Chapter 20 of [Goldberg, 2017] and [Ruder, 2017]. Here we highlight those which are mostly relevant.

MTL for NLP has been popular since a unified architecture was proposed by [Collobert and Weston, 2008, Collobert et al., 2011]. As for sequence to sequence learning [Sutskever et al., 2014], general multi-task learning frameworks are explored by [Luong et al., 2016].

Our work is different from existing work in several aspects. First, the majority of the work focuses on two tasks, often with one being the main task and the other being the auxiliary one [Søgaard and Goldberg, 2016, Bjerva et al., 2016, Plank et al., 2016, Alonso and Plank, 2017, Bingel and Søgaard, 2017]. For example, pos is the auxiliary task in [Søgaard and Goldberg, 2016] while chunk, CCG supertagging (ccg) [Clark, 2002], ner, sem, or mwe+supsense is the main one. They find that pos benefits chunk and ccg. Another line of work considers language modeling as the auxiliary objective [Godwin et al., 2016, Rei, 2017, Liu et al., 2018]. Besides sequence tagging, some work considers two high-level tasks or one high-level task with another lower-level one. Examples are dependency parsing (dep) with pos [Zhang and Weiss, 2016], with mwe [Constant and Nivre, 2016], or with semantic role labeling (srl) [Shi et al., 2016]; machine translation (translate) with pos or dep [Niehues and Cho, 2017, Eriguchi et al., 2017]; sentence extraction and com [Martins and Smith, 2009, Berg-Kirkpatrick et al., 2011, Almeida and Martins, 2013].

Exceptions to this include the work of [Collobert et al., 2011], which considers four tasks: pos, chunk, ner, and srl; [Raganato et al., 2017], which considers three: word sense disambiguation with pos and coarse-grained semantic tagging based on WordNet lexicographer files; [Hashimoto et al., 2017], which considers five: pos, chunk, dep, semantic relatedness, and textual entailment; [Niehues and Cho, 2017, Kiperwasser and Ballesteros, 2018], which both consider three: translate with pos and ner, and translate with pos and dep, respectively. We consider as many as 11 tasks jointly.

Second, we choose to focus on model architectures that are generic enough to be shared by many tasks. Our structure is similar to [Collobert et al., 2011], but we also explore frameworks related to task embeddings and propose two variants. In contrast, recent work considers stacked architectures (mostly for sequence tagging) in which tasks can supervise at different layers of a network [Søgaard and Goldberg, 2016, Klerke et al., 2016, Plank et al., 2016, Alonso and Plank, 2017, Bingel and Søgaard, 2017, Hashimoto et al., 2017]. More complicated structures require more sophisticated MTL methods when the number of tasks grows and thus prevent us from concentrating on analyzing relationships among tasks. For this reason, we leave MTL for complicated models for future work.

The purpose of our study is relevant to but different from transfer learning, where the setting designates one or more target tasks and focuses on whether the target tasks can be learned more effectively from the source tasks; see e.g., [Mou et al., 2016, Yang et al., 2017].

6 Discussion and Future Work

We conduct an empirical study on MTL for sequence tagging, which so far has been mostly studied with two or a few tasks. We also propose two alternative frameworks that augment taggers with task embeddings. Our results provide insights regarding task relatedness and show benefits of the MTL approaches. Nevertheless, we believe that our work simply scratches the surface of MTL. The characterization of task relationships seems to go beyond the performances of pairwise MTL training or similarities of their task embeddings. We are also interested in exploring further other techniques to MTL, especially when tasks become more complicated. For example, it is not clear how to best represent task specification as well as how to incorporate them into NLP systems. Finally, the definition of tasks can be relaxed to include domains or languages. Combining all these will move us toward the goal of having a single robust, generalizable NLP agent that is equipped with a diverse set of skills.

Acknowledgments

This work is partially supported by USC Graduate Fellowship, NSF IIS-1065243, 1451412, 1513966/1632803/1833137, 1208500, CCF-1139148, a Google Research Award, an Alfred P. Sloan Research Fellowship, gifts from Facebook and Netflix, and ARO# W911NF-12-1-0241 and W911NF-15-1-0484.

References

  • [Almeida and Martins, 2013] Miguel B. Almeida and André F. T. Martins. 2013. Fast and robust compressive summarization with dual decomposition and multi-task learning. In ACL.
  • [Alonso and Plank, 2017] Héctor Martínez Alonso and Barbara Plank. 2017. Multitask learning for semantic sequence prediction under varying data conditions. In EACL.
  • [Baker et al., 1998] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet Project. In COLING-ACL.
  • [Berg-Kirkpatrick et al., 2011] Taylor Berg-Kirkpatrick, Daniel Gillick, and Dan Klein. 2011. Jointly learning to extract and compress. In ACL.
  • [Bies et al., 2012] Ann Bies, Justin Mott, Colin Warner, and Seth Kulick. 2012. English web treebank. Technical Report LDC2012T13, Linguistic Data Consortium, Philadelphia, PA.
  • [Bingel and Søgaard, 2017] Joachim Bingel and Anders Søgaard. 2017. Identifying beneficial task relations for multi-task learning in deep neural networks. In EACL.
  • [Bjerva et al., 2016] Johannes Bjerva, Barbara Plank, and Johan Bos. 2016. Semantic tagging with deep residual networks. In COLING.
  • [Bjerva, 2017] Johannes Bjerva. 2017. Will my auxiliary tagging task help? Estimating Auxiliary Tasks Effectivity in Multi-Task Learning. In NoDaLiDa.
  • [Caruana, 1997] Rich Caruana. 1997. Multitask learning. Machine Learning, 28:41–75.
  • [Cho et al., 2014] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP.
  • [Ciaramita and Altun, 2006] Massimiliano Ciaramita and Yasemin Altun. 2006. Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In EMNLP.
  • [Clark, 2002] Stephen Clark. 2002. Supertagging for combinatory categorial grammar. In Proceedings of the Sixth International Workshop on Tree Adjoining Grammar and Related Frameworks.
  • [Clarke and Lapata, 2006] James Clarke and Mirella Lapata. 2006. Constraint-based sentence compression: An integer programming approach. In ACL.
  • [Collobert and Weston, 2008] Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML.
  • [Collobert et al., 2011] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537.
  • [Constant and Nivre, 2016] Matthieu Constant and Joakim Nivre. 2016. A transition-based system for joint lexical and syntactic analysis. In ACL.
  • [Das et al., 2014] Dipanjan Das, Desai Chen, André F. T. Martins, Nathan Schneider, and Noah A. Smith. 2014. Frame-semantic parsing. Computational Linguistics, 40:9–56.
  • [Eriguchi et al., 2017] Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. 2017. Learning to parse and translate improves neural machine translation. In ACL.
  • [Francis and Kučera, 1982] Winthrop Nelson Francis and Henry Kučera. 1982. Frequency analysis of english usage: Lexicon and grammar. Journal of English Linguistics, 18(1):64–70.
  • [Gardner et al., 2018] Matt A. Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2018. AllenNLP: A deep semantic natural language processing platform. arXiv preprint arXiv:1803.07640.
  • [Glorot and Bengio, 2010] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In AISTATS.
  • [Godwin et al., 2016] Jonathan Godwin, Pontus Stenetorp, and Sebastian Riedel. 2016. Deep semi-supervised learning with linguistically motivated sequence labeling task hierarchies. arXiv preprint arXiv:1612.09113.
  • [Goldberg, 2017] Yoav Goldberg. 2017. Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1):1–309.
  • [Graff, 1997] David Graff. 1997. The 1996 broadcast news speech and language-model corpus. In Proceedings of the 1997 DARPA Speech Recognition Workshop.
  • [Hashimoto et al., 2017] Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2017. A Joint Many-Task Model: Growing a neural network for multiple NLP tasks. In EMNLP.
  • [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–80.
  • [Huang et al., 2015] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.
  • [Irsoy and Cardie, 2014] Ozan Irsoy and Claire Cardie. 2014. Opinion mining with deep recurrent neural networks. In EMNLP.
  • [Johnson et al., 2017] Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viégas, Martin Wattenberg, Gregory S. Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. TACL, 5:339–351.
  • [Kingma and Ba, 2015] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
  • [Kiperwasser and Ballesteros, 2018] Eliyahu Kiperwasser and Miguel Ballesteros. 2018. Scheduled multi-task learning: From syntax to translation. TACL, 6:225–240.
  • [Klerke et al., 2016] Sigrid Klerke, Yoav Goldberg, and Anders Søgaard. 2016. Improving sentence compression by learning to predict gaze. In HLT-NAACL.
  • [Lafferty et al., 2001] John D. Lafferty, Andrew D. McCallum, and Fernando Pereira. 2001. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In ICML.
  • [Lample et al., 2016] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In HLT-NAACL.
  • [Landes et al., 1998] Shari Landes, Claudia Leacock, and Randee I. Tengi. 1998. Building semantic concordances. WordNet: An electronic lexical database, 199(216):199–216.
  • [Lewis et al., 2004] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397.
  • [Liu et al., 2018] Liyuan Liu, Jingbo Shang, Frank F. Xu, Xiang Ren, Huan Gui, Jian Peng, and Jiawei Han. 2018. Empower sequence labeling with task-aware neural language model. In AAAI.
  • [Luong et al., 2016] Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multi-task sequence to sequence learning. In ICLR.
  • [Ma and Hovy, 2016] Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In ACL.
  • [Marcus et al., 1993] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330.
  • [Martins and Smith, 2009] André F. T. Martins and Noah A. Smith. 2009. Summarization with a joint model for sentence extraction and compression. In Proceedings of the NAACL-HLT Workshop on Integer Linear Programming for NLP.
  • [Miller et al., 1993] George A. Miller, Claudia Leacock, Randee Tengi, and Ross T. Bunker. 1993. A semantic concordance. In Proceedings of the workshop on Human Language Technology.
  • [Mou et al., 2016] Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. 2016. How transferable are neural networks in nlp applications? In EMNLP.
  • [Niehues and Cho, 2017] Jan Niehues and Eunah Cho. 2017. Exploiting linguistic resources for neural machine translation using multi-task learning. In WMT.
  • [Nivre et al., 2016] Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In LREC.
  • [Pascanu et al., 2013] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In ICML.
  • [Paszke et al., 2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In Proceedings of the NIPS Workshop on the future of gradient-based machine learning software and techniques.
  • [Pennington et al., 2014] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In EMNLP.
  • [Plank et al., 2016] Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In ACL.
  • [Raganato et al., 2017] Alessandro Raganato, Claudio Delli Bovi, and Roberto Navigli. 2017. Neural sequence learning models for word sense disambiguation. In EMNLP.
  • [Rei, 2017] Marek Rei. 2017. Semi-supervised multitask learning for sequence labeling. In ACL.
  • [Reimers and Gurevych, 2017] Nils Reimers and Iryna Gurevych. 2017. Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging. In EMNLP.
  • [Ruder, 2017] Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098.
  • [Schneider and Smith, 2015] Nathan Schneider and Noah A. Smith. 2015. A corpus and model integrating multiword expressions and supersenses. In HLT-NAACL.
  • [Shi et al., 2016] Peng Shi, Zhiyang Teng, and Yue Zhang. 2016. Exploiting mutual benefits between syntax and semantic roles using neural network. In EMNLP.
  • [Søgaard and Goldberg, 2016] Anders Søgaard and Yoav Goldberg. 2016. Deep multi-task learning with low level tasks supervised at lower layers. In ACL.
  • [Spitkovsky et al., 2010] Valentin I. Spitkovsky, Daniel Jurafsky, and Hiyan Alshawi. 2010. Profiting from mark-up: Hyper-text annotations for guided parsing. In ACL.
  • [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS.
  • [Tjong Kim Sang and Buchholz, 2000] Erik F. Tjong Kim Sang and Sabine Buchholz. 2000. Introduction to the CoNLL-2000 shared task: Chunking. In CoNLL.
  • [Tjong Kim Sang and De Meulder, 2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In CoNLL.
  • [Van der Maaten and Hinton, 2008] Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(2579-2605):85.
  • [Vossen et al., 1998] Piek Vossen, Laura Bloksma, Horacio Rodriguez, Salvador Climent, Nicoletta Calzolari, Adriana Roventini, Francesca Bertagna, Antonietta Alonge, and Wim Peters. 1998. The EuroWordNet Base Concepts and Top Ontology. Technical Report LE2-4003, University of Amsterdam, The Netherlands.
  • [Yang et al., 2017] Zhilin Yang, Ruslan Salakhutdinov, and William W. Cohen. 2017. Transfer learning for sequence tagging with hierarchical recurrent networks. In ICLR.
  • [Zhang and Weiss, 2016] Yuan Zhang and David Weiss. 2016. Stack-propagation: Improved representation learning for syntax. In ACL.

Appendix A Comparison between different MTL approaches

Settings Method upos xpos chunk ner mwe sem semtr supsense com frame hyp Average
STL 95.4 95.04 93.49 88.24 53.07 72.77 74.02 66.81 72.71 62.04 46.73 74.58
(Average) Multi-Dec 94.97 94.65 93.37 87.67 57.21 72.63 74.38 67.39 72.12 61.3 47.99 74.88
Pairwise TEDec 95.0 94.77 93.4 87.72 56.67 72.48 74.25 67.19 71.84 58.65 47.45 74.49
TEEnc 95.0 94.66 93.32 87.65 55.99 72.49 74.24 67.09 72.09 61.62 47.37 74.68
Multi-Dec 95.04 94.31 93.44 86.38 61.43 71.53 74.26 68.1 74.54 59.71 51.41 75.47
All TEDec 94.95 94.42 93.64 86.8 61.97 71.72 74.36 67.98 74.61 58.14 51.31 75.44
TEEnc 94.94 94.3 93.7 86.01 59.57 71.58 74.35 68.02 74.61 61.83 49.5 75.31
(Average) Multi-Dec 94.91 94.43 93.65 86.15 61.82 71.09 73.75 68.2 74.42 59.66 50.9 75.36
All-but-one TEDec 94.83 94.4 93.64 86.39 60.55 70.95 73.74 67.81 74.47 58.66 50.86 75.12
TEEnc 94.77 94.35 93.53 85.96 60.23 70.83 73.64 68.15 74.05 61.23 50.15 75.17
Table 7: Comparison between MTL approaches

In Table 7, we summarize the results of different MTL approaches. We observe no significant differences between those methods.

Appendix B Additional results on All-but-one settings

Table 8 and Table 9 compare All and All-but-one settings for TEDec and TEEnc, respectively. We show similar results for Multi-Dec in the main text.

upos xpos chunk ner mwe sem semtr supsense com frame hyp # #
All 94.95 94.42 93.64 86.8 61.97 71.72 74.36 67.98 74.61 58.14 51.31
All - upos 94.06 93.44 86.47 60.48 71.08 73.79 68.1 74.69 58.32 50.83 0 2
All - xpos 94.38 93.6 86.68 60.09 70.98 73.78 67.9 74.26 58.31 50.6 0 3
All - chunk 94.6 94.29 86.08 60.6 70.39 73.36 68.07 74.47 58.73 51.1 0 3
All - ner 94.69 94.31 93.69 60.48 70.64 73.59 67.51 74.49 58.19 50.44 0 4
All - mwe 94.93 94.46 93.72 86.21 71.11 74.04 67.38 74.49 57.6 50.5 0 2
All - sem 94.86 94.41 93.6 85.97 59.94 72.26 67.35 74.34 59.08 50.48 0 3
All - semtr 94.8 94.28 93.56 86.23 61.23 69.62 68.16 74.36 58.85 51.5 0 2
All - supsense 94.82 94.4 93.67 86.49 59.11 71.02 73.76 74.69 58.28 51.96 0 2
All - com 95.19 94.76 93.79 86.25 62.02 72.32 74.92 67.62 60.72 50.0 4 2
All - frame 95.03 94.6 93.64 86.68 60.52 71.11 73.9 67.69 74.49 51.23 0 2
All - hyp 94.94 94.45 93.69 86.86 61.07 71.22 74.04 68.32 74.4 58.55 0 1
# 1 1 0 0 0 0 1 0 0 1 0
# 3 1 0 4 3 8 6 0 0 0 1
Table 8: F1 scores for TEDec. We compare All with All-but-one settings (All - task). We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red.
upos xpos chunk ner mwe sem semtr supsense com frame hyp # #
All 94.94 94.3 93.7 86.01 59.57 71.58 74.35 68.02 74.61 61.83 49.5
All - upos 94.0 93.36 85.98 59.58 70.68 73.66 68.19 74.07 60.51 50.23 0 1
All - xpos 94.24 93.29 85.8 59.81 70.57 73.64 68.47 73.94 60.13 50.39 0 4
All - chunk 94.66 94.3 85.73 61.58 70.78 73.65 67.87 73.67 61.73 50.18 0 1
All - ner 94.71 94.25 93.5 59.05 70.58 73.4 67.95 74.16 59.96 49.95 0 2
All - mwe 94.94 94.5 93.63 86.1 71.12 73.75 69.0 74.28 61.51 49.81 0 0
All - sem 94.76 94.32 93.45 85.58 59.47 72.21 67.77 74.2 61.76 50.15 1 1
All - semtr 94.68 94.25 93.54 86.02 60.59 69.86 67.96 73.81 61.31 51.72 1 2
All - supsense 94.8 94.27 93.56 86.04 59.25 70.53 73.27 74.3 59.98 50.01 0 2
All - com 95.25 94.72 93.82 86.23 60.63 72.38 75.06 67.94 63.55 48.77 4 0
All - frame 94.84 94.39 93.51 85.99 61.21 70.78 73.69 68.13 74.3 50.35 0 1
All - hyp 94.86 94.45 93.59 86.1 61.09 71.03 74.09 68.17 73.78 61.91 0 2
# 1 1 0 0 0 1 1 0 0 0 2
# 1 0 3 0 0 5 4 0 3 0 0
Table 9: F1 scores for TEEnc. We compare All with All-but-one settings (All - task). We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red.

Appendix C Detailed results separated by the tasks being tested on

In Table 13-20, we provide F1 scores with standard deviations in all settings. Each table corresponds to a task we test our models on. Rows denote training settings and columns denote MTL approaches.

Trained with Tested on upos
Multi-Dec TEDec TEEnc
upos only 95.4 0.08

Pairwise

+xpos 95.38 0.03 95.4 0.04 95.42 0.07
+chunk 95.43 0.11 95.57 0.02 95.4 0.0
+ner 95.38 0.1 95.32 0.03 95.29 0.04
+mwe 95.15 0.05 95.11 0.07 95.05 0.05
+sem 95.23 0.14 95.2 0.05 95.27 0.08
+semtr 95.17 0.15 95.21 0.03 95.23 0.13
+supsense 95.08 0.08 95.05 0.04 95.27 0.08
+com 93.04 0.77 94.03 0.42 93.6 0.15
+frame 94.98 0.13 94.79 0.09 95.0 0.07
+hyp 94.84 0.07 94.35 0.21 94.43 0.15
Average 94.97 95.0 95.0

All-but-one

All - xpos 94.57 0.12 94.38 0.05 94.24 0.24
All - chunk 94.84 0.01 94.6 0.1 94.66 0.15
All - ner 94.81 0.07 94.69 0.05 94.71 0.07
All - mwe 94.93 0.01 94.93 0.08 94.94 0.04
All - sem 94.82 0.17 94.86 0.08 94.76 0.15
All - semtr 94.83 0.12 94.8 0.03 94.68 0.17
All - supsense 94.97 0.07 94.82 0.03 94.8 0.07
All - com 95.19 0.05 95.19 0.04 95.25 0.02
All - frame 95.15 0.07 95.03 0.17 94.84 0.1
All - hyp 94.93 0.18 94.94 0.11 94.86 0.04
All 95.04 0.03 94.95 0.08 94.94 0.1
Oracle 95.4 0.08 95.57 0.02 95.4 0.08
Table 10: F1 score tested on the task upos in different training scenarios
Trained with Tested on xpos
Multi-Dec TEDec TEEnc
xpos only 95.04 0.06

Pairwise

+upos 95.01 0.04 94.99 0.03 94.94 0.05
+chunk 95.1 0.02 95.21 0.02 95.1 0.04
+ner 94.98 0.12 95.09 0.07 95.05 0.13
+mwe 94.7 0.16 94.8 0.08 94.66 0.07
+sem 94.77 0.08 94.82 0.15 94.93 0.08
+semtr 94.86 0.02 94.8 0.09 94.97 0.09
+supsense 94.75 0.15 94.81 0.06 95.0 0.12
+com 93.19 0.75 93.94 0.21 93.12 0.44
+frame 94.64 0.06 94.66 0.05 94.55 0.06
+hyp 94.46 0.3 94.56 0.09 94.26 0.18
Average 94.65 94.77 94.66

All-but-one

All - upos 94.03 0.13 94.06 0.09 94.0 0.26
All - chunk 94.46 0.09 94.29 0.07 94.3 0.12
All - ner 94.3 0.03 94.31 0.02 94.25 0.07
All - mwe 94.45 0.05 94.46 0.12 94.5 0.09
All - sem 94.34 0.09 94.41 0.09 94.32 0.17
All - semtr 94.35 0.08 94.28 0.07 94.25 0.12
All - supsense 94.54 0.02 94.4 0.08 94.27 0.03
All - com 94.69 0.1 94.76 0.08 94.72 0.06
All - frame 94.57 0.12 94.6 0.19 94.39 0.08
All - hyp 94.53 0.07 94.45 0.1 94.45 0.07
All 94.31 0.15 94.42 0.07 94.3 0.2
Oracle 95.04 0.06 95.21 0.02 95.04 0.06
Table 11: F1 score tested on the task xpos in different training scenarios
Trained with Tested on chunk
Multi-Dec TEDec TEEnc
chunk only 93.49 0.01

Pairwise

+upos 94.18 0.02 94.02 0.08 94.0 0.15
+xpos 93.97 0.16 94.18 0.01 93.98 0.13
+ner 93.47 0.1 93.64 0.03 93.54 0.1
+mwe 93.54 0.13 93.59 0.2 93.33 0.2
+sem 93.63 0.02 93.45 0.07 93.52 0.13
+semtr 93.61 0.07 93.47 0.03 93.45 0.07
+supsense 93.2 0.21 93.25 0.15 93.13 0.13
+com 91.94 0.4 92.29 0.27 91.86 0.09
+frame 93.22 0.16 93.23 0.04 93.29 0.13
+hyp 92.96 0.08 92.86 0.08 93.13 0.04
Average 93.37 93.4 93.32

All-but-one

All - upos 93.59 0.13 93.44 0.17 93.36 0.17
All - xpos 93.57 0.19 93.6 0.05 93.29 0.21
All - ner 93.59 0.09 93.69 0.14 93.5 0.23
All - mwe 93.71 0.11 93.72 0.13 93.63 0.04
All - sem 93.63 0.08 93.6 0.11 93.45 0.13
All - semtr 93.58 0.08 93.56 0.14 93.54 0.06
All - supsense 93.67 0.08 93.67 0.12 93.56 0.12
All - com 93.67 0.12 93.79 0.14 93.82 0.05
All - frame 93.7 0.09 93.64 0.11 93.51 0.06
All - hyp 93.78 0.12 93.69 0.05 93.59 0.07
All 93.44 0.09 93.64 0.21 93.7 0.06
Oracle 94.01 0.13 94.07 0.25 93.93 0.16
Table 12: F1 score tested on the task chunk in different training scenarios
Trained with Tested on ner
Multi-Dec TEDec TEEnc
ner only 88.24 0.09

Pairwise

+upos 87.68 0.41 87.99 0.21 87.43 0.11
+xpos 87.61 0.27 87.65 0.14 87.71 0.08
+chunk 87.96 0.19 88.11 0.21 88.07 0.16
+mwe 88.15 0.23 87.99 0.15 88.02 0.36
+sem 87.35 0.16 87.27 0.36 87.49 0.25
+semtr 87.34 0.27 87.75 0.38 87.29 0.17
+supsense 87.9 0.24 87.94 0.33 87.92 0.16
+com 86.62 0.72 86.59 0.31 86.75 0.45
+frame 88.15 0.35 88.02 0.17 87.99 0.32
+hyp 87.98 0.21 87.91 0.4 87.82 0.31
Average 87.67 87.72 87.65

All-but-one

All - upos 86.03 0.53 86.47 0.14 85.98 0.29
All - xpos 86.04 0.15 86.68 0.27 85.8 0.27
All - chunk 86.05 0.1 86.08 0.49 85.73 0.2
All - mwe 86.21 0.27 86.21 0.19 86.1 0.37
All - sem 85.81 0.32 85.97 0.14 85.58 0.04
All - semtr 86.11 0.28 86.23 0.23 86.02 0.39
All - supsense 86.43 0.12 86.49 0.17 86.04 0.14
All - com 86.6 0.79 86.25 0.06 86.23 0.33
All - frame 85.9 0.29 86.68 0.15 85.99 0.3
All - hyp 86.31 0.18 86.86 0.25 86.1 0.56
All 86.38 0.12 86.8 0.08 86.01 0.4
Oracle 88.24 0.09 88.24 0.09 88.24 0.09
Table 13: F1 score tested on the task ner in different training scenarios
Trained with Tested on mwe
Multi-Dec TEDec TEEnc
mwe only 53.07 0.12

Pairwise

+upos 59.99 0.36 60.28 0.24 57.61 0.2
+xpos 58.87 0.78 60.32 0.3 58.26 0.25
+chunk 59.18 0.03 57.61 1.53 58.06 0.88
+ner 55.4 0.52 55.17 0.44 53.4 0.98
+sem 60.16 1.23 58.21 0.09 58.62 0.61
+semtr 58.84 1.45 58.55 0.28 58.31 2.24
+supsense 58.81 1.01 58.75 0.33 58.05 0.72
+com 53.89 1.41 51.72 1.01 51.71 1.05
+frame 53.88 0.76 53.05 1.32 53.3 1.15
+hyp 53.08 1.72 52.98 1.66 52.59 1.98
Average 57.21 56.67 55.99

All-but-one

All - upos 61.28 0.78 60.48 0.93 59.58 1.14
All - xpos 61.91 1.56 60.09 0.9 59.81 0.83
All - chunk 61.01 1.61 60.6 1.52 61.58 1.05
All - ner 62.69 0.26 60.48 0.15 59.05 0.4
All - sem 61.17 0.86 59.94 0.85 59.47 0.04
All - semtr 63.04 0.85 61.23 2.05 60.59 0.59
All - supsense 60.51 0.25 59.11 2.02 59.25 0.74
All - com 61.95 0.97 62.02 1.73 60.63 0.73
All - frame 62.62 0.85 60.52 0.47 61.21 0.99
All - hyp 62.04 0.6 61.07 0.51 61.09 1.06
All 61.43 1.94 61.97 0.5 59.57 0.64
Oracle 62.76 0.63 61.74 1.49 61.92 0.66
Table 14: F1 score tested on the task mwe in different training scenarios
Trained with Tested on sem
Multi-Dec TEDec TEEnc
sem only 72.77 0.04

Pairwise

+upos 73.23 0.06 73.17 0.08 73.11 0.01
+xpos 73.34 0.12 73.21 0.04 73.04 0.21
+chunk 73.16 0.05 73.02 0.05 73.13 0.07
+ner 72.88 0.08 72.77 0.19 72.91 0.08
+mwe 72.75 0.09 72.66 0.18 72.83 0.07
+semtr 72.5 0.07 72.5 0.05 72.17 0.06
+supsense 72.81 0.04 72.71 0.03 73.09 0.08
+com 70.39 0.46 70.37 0.28 70.18 0.54
+frame 72.76 0.16 72.26 0.21 72.49 0.23
+hyp 72.47 0.02 72.15 0.1 71.95 1.22
Average 72.63 72.48 72.49

All-but-one

All - upos 70.87 0.19 71.08 0.19 70.68 0.76
All - xpos 71.12 0.1 70.98 0.24 70.57 0.13
All - chunk 71.07 0.27 70.39 0.39 70.78 0.35
All - ner 70.82 0.41 70.64 0.15 70.58 0.03
All - mwe 71.01 0.14 71.11 0.17 71.12 0.29
All - semtr 69.72 0.27 69.62 0.37 69.86 0.36
All - supsense 71.22 0.29 71.02 0.16 70.53 0.19
All - com 72.38 0.08 72.32 0.23 72.38 0.17
All - frame 71.48 0.51 71.11 0.16 70.78 0.44
All - hyp 71.22 0.25 71.22 0.33 71.03 0.07
All 71.53 0.28 71.72 0.21 71.58 0.24
Oracle 73.32 0.04 73.1 0.03 73.14 0.06
Table 15: F1 score tested on the task sem in different training scenarios
Trained with Tested on semtr
Multi-Dec TEDec TEEnc
semtr only 74.02 0.04

Pairwise

+upos 74.93 0.09 74.87 0.1 74.85 0.05
+xpos 74.91 0.06 74.84 0.21 74.66 0.2
+chunk 74.79 0.13 74.73 0.12 74.77 0.13
+ner 74.34 0.08 74.01 0.05 74.04 0.07
+mwe 74.51 0.18 74.63 0.28 74.66 0.21
+sem 74.73 0.1 74.72 0.14 74.41 0.01
+supsense 74.61 0.24 74.52 0.05 74.94 0.22
+com 72.6 0.95 71.76 0.88 71.35 0.95
+frame 74.18 0.19 74.21 0.37 74.63 0.11
+hyp 74.23 0.27 74.19 0.45 74.14 0.23
Average 74.38 74.25 74.24

All-but-one

All - upos 73.54 0.54 73.79 0.46 73.66 0.97
All - xpos 74.03 0.11 73.78 0.28 73.64 0.07
All - chunk 73.97 0.22 73.36 0.05 73.65 0.39
All - ner 73.51 0.35 73.59 0.19 73.4 0.19
All - mwe 73.61 0.2 74.04 0.18 73.75 0.24
All - sem 71.97 0.3 72.26 0.28 72.21 0.48
All - supsense 73.86 0.09 73.76 0.19 73.27 0.2
All - com 74.75 0.22 74.92 0.1 75.06 0.12
All - frame 74.24 0.37 73.9 0.29 73.69 0.32
All - hyp 74.02 0.12 74.04 0.17 74.09 0.21
All 74.26 0.1 74.36 0.03 74.35 0.29
Oracle 75.23 0.06 75.24 0.13 75.09 0.02
Table 16: F1 score tested on the task semtr in different training scenarios
Trained with Tested on supsense
Multi-Dec TEDec TEEnc
supsense only 66.81 0.22

Pairwise

+upos 68.25 0.42 67.8 0.29 67.76 0.14
+xpos 67.78 0.4 68.3 0.71 67.77 0.15
+chunk 67.39 0.15 67.29 0.33 67.36 0.29
+ner 68.06 0.16 67.25 0.21 67.57 0.27
+mwe 66.88 0.14 66.88 0.24 66.26 0.9
+sem 68.29 0.21 68.46 0.38 68.1 0.59
+semtr 68.6 0.81 68.18 0.39 67.64 0.92
+com 65.57 0.17 64.98 0.34 65.55 0.18
+frame 66.59 0.07 66.2 0.16 66.75 0.22
+hyp 66.47 0.24 66.52 0.59 66.16 0.43
Average 67.39 67.19 67.09

All-but-one

All - upos 68.27 0.33 68.1 0.28 68.19 0.55
All - xpos 67.99 0.5 67.9 0.54 68.47 0.18
All - chunk 68.26 0.48 68.07 0.28 67.87 0.32
All - ner 68.16 0.26 67.51 0.4 67.95 0.24
All - mwe 68.18 0.62 67.38 0.22 69.0 0.45
All - sem 67.36 0.42 67.35 0.18 67.77 0.28
All - semtr 68.17 0.15 68.16 0.47 67.96 0.73
All - com 68.67 0.37 67.62 0.6 67.94 0.22
All - frame 68.47 0.72 67.69 0.95 68.13 0.39
All - hyp 68.46 0.37 68.32 0.18 68.17 0.36
All 68.1 0.54 67.98 0.29 68.02 0.21
Oracle 68.53 0.09 68.22 0.61 69.04 0.44
Table 17: F1 score tested on the task supsense in different training scenarios
Trained with Tested on com
Multi-Dec TEDec TEEnc
com only 72.71 0.75

Pairwise

+upos 72.46 0.34 72.86 0.12 72.09 0.36
+xpos 72.83 0.16 72.87 0.56 72.41 0.51
+chunk 72.44 0.11 73.3 0.15 72.88 0.26
+ner 70.93 0.73 71.08 0.31 70.78 0.27
+mwe 71.31 0.31 70.93 0.43 71.36 0.42
+sem 72.72 0.22 73.14 0.08 72.25 0.07
+semtr 71.96 0.16 71.74 0.46 72.15 0.5
+supsense 72.24 0.27 69.13 0.19 72.12 0.66
+frame 72.47 0.08 72.89 0.22 72.1 0.93
+hyp 71.82 0.97 70.47 0.81 72.79 0.97
Average 72.12 71.84 72.09

All-but-one

All - upos 74.42 0.24 74.69 0.26 74.07 0.19
All - xpos 74.36 0.14 74.26 0.64 73.94 0.3
All - chunk 74.2 0.13 74.47 0.26 73.67 0.23
All - ner 74.08 0.07 74.49 0.38 74.16 0.48
All - mwe 74.7 0.14 74.49 0.13 74.28 0.16
All - sem 74.31 0.1 74.34 0.42 74.2 0.28
All - semtr 74.2 0.24 74.36 0.36 73.81 0.16
All - supsense 74.24 0.44 74.69 0.52 74.3 0.13
All - frame 75.03 0.24 74.49 0.2 74.3 0.19
All - hyp 74.62 0.14 74.4 0.06 73.78 0.05
All 74.54 0.53 74.61 0.24 74.61 0.32
Oracle 72.71 0.75 72.71 0.75 72.71 0.75
Table 18: F1 score tested on the task com in different training scenarios
Trained with Tested on frame
Multi-Dec TEDec TEEnc
frame only 62.04 0.74

Pairwise

+upos 62.14 0.35 61.54 0.53 62.27 0.33
+xpos 60.77 0.39 61.44 0.06 61.62 1.01
+chunk 62.67 0.47 61.39 0.78 62.98 0.5
+ner 62.39 0.37 59.25 0.52 63.02 0.39
+mwe 61.75 0.21 56.77 2.79 60.61 0.91
+sem 61.74 0.27 60.09 0.48 62.17 0.36
+semtr 62.03 0.41 59.77 0.81 62.79 0.19
+supsense 61.94 0.43 55.68 0.61 61.96 0.18
+com 56.52 0.27 55.25 2.29 57.65 2.42
+hyp 61.02 0.62 55.35 0.5 61.14 1.77
Average 61.3 58.65 61.62

All-but-one

All - upos 58.47 1.0 58.32 0.35 60.51 0.1
All - xpos 60.16 0.42 58.31 0.8 60.13 1.38
All - chunk 60.01 0.65 58.73 0.68 61.73 0.48
All - ner 59.17 0.27 58.19 0.89 59.96 0.52
All - mwe 59.23 0.33 57.6 0.82 61.51 0.43
All - sem 58.73 0.67 59.08 0.84 61.76 0.52
All - semtr 59.49 0.79 58.85 0.51 61.31 1.16
All - supsense 59.23 0.64 58.28 0.19 59.98 1.23
All - com 62.37 0.37 60.72 0.73 63.55 0.31
All - hyp 59.69 0.41 58.55 0.29 61.91 0.59
All 59.71 0.85 58.14 0.23 61.83 0.98
Oracle 62.04 0.74 62.04 0.74 62.04 0.74
Table 19: F1 score tested on the task frame in different training scenarios
Trained with Tested on hyp
Multi-Dec TEDec TEEnc
hyp only 46.73 0.55

Pairwise

+upos 48.02 0.31 49.36 0.36 48.27 0.68
+xpos 48.81 0.36 49.23 0.55 48.06 0.02
+chunk 47.85 0.2 48.43 0.3 47.13 0.35
+ner 47.9 0.67 48.24 0.65 48.64 1.17
+mwe 47.32 0.29 45.83 0.46 46.71 0.64
+sem 48.15 0.21 47.95 0.75 47.12 0.43
+semtr 47.74 0.57 46.96 0.85 46.1 0.11
+supsense 49.23 0.13 47.29 0.41 47.24 0.43
+com 47.41 1.18 45.24 0.46 47.81 0.8
+frame 47.5 0.46 46.0 0.53 46.66 0.54
Average 47.99 47.45 47.37

All-but-one

All - upos 51.13 0.94 50.83 0.65 50.23 0.73
All - xpos 51.65 0.63 50.6 0.44 50.39 1.17
All - chunk 50.27 0.76 51.1 0.28 50.18 0.81
All - ner 50.86 0.87 50.44 0.39 49.95 0.38
All - mwe 50.83 0.61 50.5 0.9 49.81 0.44
All - sem 50.93 0.27 50.48 0.53 50.15 0.11
All - semtr 51.27 0.5 51.5 0.46 51.72 0.15
All - supsense 50.86 1.85 51.96 0.29 50.01 1.13
All - com 50.28 1.02 50.0 0.11 48.77 0.54
All - frame 50.89 0.64 51.23 1.01 50.35 0.68
All 51.41 0.25 51.31 0.55 49.5 0.05
Oracle 50.0 0.42 50.15 0.25 48.06 0.02
Table 20: F1 score tested on the task hyp in different training scenarios
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
254268
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description